Being one of biggest and oldest telecom providers in Netherlands with multiple acquisitions over last few decades left KPN with 1500+ data sources and more than 25 teams working with different tools like Teradata, Informatica, Oracle, OBIEE, Hadoop etc. This resulted into a lot of technical debt and also duplicated data on various systems with complex data relationship. Thus resulting into data quality issues and long processing times to get meaningful insights into business.
This created need for an unified way of Ingesting, Storing and Transforming data to be consumed and processed at multiple stages where KETL Framework was born.
Our journey so far:
Instead of developing mappings and workflows or handcrafted SQL Code, Business teams started writing metadata about their sources and dependency between them using macro based excel files or Django based YAML file generator. These files are used by the KETL framework to generate appropriate Hive/Spark/Informatica/Oracle/Teradata Code along with Airflow Scheduler DAG with schedule and dependency as Code.
Additionally all the environments, configuration and access rights are also managed via YAML files with Ansible thus enabling us to view each change as code. This made teams self sufficient where they can build their own Dev/Test environment to validate their metadata and target model structure before deploying to production.
- On-boarding new sources and integrating with existing data store takes less than a week
- Everything is maintained in Git, giving full visibility of changes along with their timelines
- Minimising technical depth and allowing business teams to focus on data instead of tooling
- Easier adoption of newer tools and path of least resistence for decommissioning of legacy stack
- KISS Architecture, easier to maintain and scale
- Reducing bureaucratic processes and design for transparency
What keeps us busy:
- Adding a Test Framework to enable users with BDD tests using the same metadata
- Adding functionality to generate complex code structures
- Using advanced CI/CD processes like Jenkins pipeline for faster deployments
- Integration with new tools and technologies both enterprise and opensource.
- Hortonworks HDP - HDFS, Yarn, Hive, Spark, LLAP, Tez
- User and Access Management - Ranger, Knox, Kerberos, LDAP, SSSD, Linux ACL's
- ETL & DWH Tools - Informatica, Informatica BDM, Teradata, Querygrid, Aster etc.
- Reporting - OBIEE, Tableau, Zeppelin Notebooks
- Monitoring - Grafana, Zabbix & ELK
- Scheduler - Airflow
- Orchestration - Ansible
- Code - Python, YAML, Jinja2
- CI/CD - Git, Artifactory, Jenkins