Tuesday, May 29, 2012

Building the Next Generation ETL data loading Framework

Do you wish for an ETL framework that is highly customizable, light-weight and suits perfectly with all of your data loading needs? We too! Let's build one together...

What is an ETL framework?

ETL Or "Extraction, Transformation and Loading" is the prevalent technological paradigm for data integration. While ETL in itself is not a tool, ETL processes can be implemented though varied tools and programming methods. This includes, but not limited to, tools like Informatica PowerCentre, DataStage, BusinessObjects Data Services (BODS), SQL Server Integration Services (SSIS), AbInitio etc. and programming methods like PL/SQL (Oracle), T-SQL (Microsoft), UNIX shell scripting etc. Most of these tools and programming methodologies use a generic setup that controls, monitors, executes and Logs the data flow through out the ETL process. This generic 'setup' is often referred as 'ETL framework'
As an example of an ETL framework, let's consider this. "Harry" needs to load 2 tables everyday from one source system to some other target system. For this purpose, Harry has created 2 SQL jobs, each of which reads data from source through some "SELECT" statements and write the data in the target database using some "INSERT" statements. But in order to run these jobs, Harry needs couple of more information - e.g.
  • when is a good time to execute these jobs? Can he schedule these jobs to run automatically everyday?
  • Where is the source system located? (Connection information)
  • What will happen if one of the jobs fail while loading the data? Will Harry get an alert message? Can he simply rerun the jobs after fixing the issue of the failure?
  • How will Harry know if at all any data is retrieved or loaded to the target?
Turns out that, Harry needs something more. He needs some kind of setup that will govern the job execution regularly. This includes - scheduling the jobs, executing the jobs, logging any failure/error information (and also alerting Harry about such failures), maintaining the connection information and even ensuring that Harry does not end up loading the duplicate data.
Such a setup is called "ETL Framework". And we are trying to build the perfect one here.

Critical Features of an ETL framework

In a very broad sense, here are a few of the features that we feel critical in any ETL framework
  • Support for Change Data Capture Or Delta Loading Or Incremental Loading
  • Metadata logging
  • Handling of multiple source formats
  • Restartability support
  • Notification support
  • Highly configurable / customizable

Good-to-have features of ETL Framework

These are some good-to-have features for the framework
  • Inbuilt data reconciliation
  • Customizable log format
  • Dry-load enabling
  • Multiple notification formats

Request for Proposal for the next-gen ETL framework

Based on the feature sets above, we are trying to build a generic framework that we would make available here for free for everyone's use.
However the list of features above are not complete. We are requesting our readership to send us RFP for the proposed ETL framework that would resolve the incapability / issues in their existing frameworks.

0 comments:

Post a Comment

newer post older post Home