Tuesday, December 28, 2010

Service Manager

0 comments
The Service Manager is a service that manages all domain operations. It runs within Informatica Services. It runs as a service on Windows and as a daemon on UNIX. When you start Informatica Services, you start the Service Manager. The Service Manager runs on each node. If the Service Manager is not running, the node is not available.

The Service Manager runs on all nodes in the domain to support the application services and the domain:

  Application service support. The Service Manager on each node starts application services configured to run on that node. It starts and stops services and service processes based on requests from clients. It also directs service requests to application services. The Service Manager uses TCP/IP to communicate with the application services.

  Domain support. The Service Manager performs functions on each node to support the domain. The functions that the Service Manager performs on a node depend on the type of node. For example, the Service Manager running on the master gateway node performs all domain functions on that node. The Service Manager running on any other node performs some domain functions on that node.

Table 1-1 describes the domain functions that the Service Manager performs:

Table 1-1. Domain Functions Performed by the Service Manager

newer post

Nodes

0 comments
When you install PowerCenter Services on a machine, you add the machine to the domain as a node. You can add multiple nodes to a domain. Each node in the domain runs a Service Manager that manages domain operations on that node. The operations that the Service Manager performs depend on the type of node. A node can be a gateway node or a worker node. You can subscribe to alerts to receive notification about node events such as node failure or a master gateway election.

Gateway Nodes
A gateway node is any node you configure to serve as a gateway for the domain. One node acts as the gateway at any given time. That node is called the master gateway. A gateway node can run application services, and it can serve as a master gateway node. The master gateway node is the entry point to the domain.

The Service Manager on the master gateway node performs all domain operations on the master gateway node. The Service Manager running on other gateway nodes performs limited domain operations on those nodes.

You can configure more than one node to serve as a gateway. If the master gateway node becomes unavailable, the Service Manager on other gateway nodes elect another master gateway node. If you configure one node to serve as the gateway and the node becomes unavailable, the domain cannot accept service requests.

Worker Nodes
A worker node is any node not configured to serve as a gateway. A worker node can run application services, but it cannot serve as a gateway. The Service Manager performs limited domain operations on a worker node.
newer post

Understanding Domains Overview

0 comments
PowerCenter has a service-oriented architecture that provides the ability to scale services and share resources across multiple machines. High availability functionality helps minimize service downtime due to unexpected failures or scheduled maintenance in the PowerCenter environment.

The PowerCenter domain is the fundamental administrative unit in PowerCenter. The domain supports the administration of the distributed services. A domain is a collection of nodes and services that you can group in folders based on administration ownership.

A node is the logical representation of a machine in a domain. One node in the domain acts as a gateway to receive service requests from clients and route them to the appropriate service and node. Services and processes run on nodes in a domain. The availability of a service or process on a node depends on how you configure the service and the node. For more information, see Nodes.

Services for the domain include the Service Manager and a set of application services:

  Service Manager. A service that manages all domain operations. It runs the application services and performs domain functions on each node in the domain. Some domain functions include authentication, authorization, and logging. For more information, see Service Manager.

  Application services. Services that represent PowerCenter server-based functionality, such as the Repository Service and the Integration Service. The application services that runs on a node depend on the way you configure the services. For more information, see Application Services.

The Service Manager and application services control PowerCenter security. The Service Manager manages users and groups that can log in to PowerCenter applications and authenticates the users who log in to PowerCenter applications. The Service Manager and application services authorize user requests from PowerCenter applications. For more information, see Security.

The PowerCenter Administration Console consolidates the administrative tasks for domain objects such as services, nodes, licenses, and grids and for users, groups, and roles. You manage the domain and the security of the domain through the Administration Console.

To use the SSL protocol to transfer data securely between the Administration Console and the Service Manager, configure HTTPS for all nodes on the domain. You can configure HTTPS when you install PowerCenter or using infasetup commands. The Administration Console uses the HTTPS port to communicate with the Service Manager. The gateway and worker node port numbers you configure for communication with the Service Manager remain the same. Application services and PowerCenter Client applications communicate with the Service Manager using the gateway or worker node port.

If you have the high availability option, you can scale services and eliminate single points of failure for services. Services can continue running despite temporary network or hardware failures.
newer post

Informatica Resources

0 comments
Informatica Customer Portal
As an Informatica customer, you can access the Informatica Customer Portal site at http://my.informatica.com. The site contains product information, user group information, newsletters, access to the Informatica customer support case management system (ATLAS), the Informatica Knowledge Base, Informatica Documentation Center, and access to the Informatica user community.

Informatica Documentation
The Informatica Documentation team takes every effort to create accurate, usable documentation. If you have questions, comments, or ideas about this documentation, contact the Informatica Documentation team through email at infa_documentation@informatica.com. We will use your feedback to improve our documentation. Let us know if we can contact you regarding your comments.

Informatica Web Site
You can access the Informatica corporate web site at http://www.informatica.com. The site contains information about Informatica, its background, upcoming events, and sales offices. You will also find product and partner information. The services area of the site includes important information about technical support, training and education, and implementation services.

Informatica Knowledge Base
As an Informatica customer, you can access the Informatica Knowledge Base at http://my.informatica.com. Use the Knowledge Base to search for documented solutions to known technical issues about Informatica products. You can also find answers to frequently asked questions, technical white papers, and technical tips.

Informatica Global Customer Support
There are many ways to access Informatica Global Customer Support. You can contact a Customer Support Center through telephone, email, or the WebSupport Service.

Use the following email addresses to contact Informatica Global Customer Support:

  support@informatica.com for technical inquiries

  support_admin@informatica.com for general customer service requests

WebSupport requires a user name and password. You can request a user name and password at http://my.informatica.com.

Use the following telephone numbers to contact Informatica Global Customer Support:
newer post

Sunday, November 28, 2010

Commercial ETL vendors

0 comments
newer post

DB Software Laboratory ETL software

0 comments
Oracle Warehouse Builder (OWB) is a tool that enables to design a custom Business Intelligence application. It provides dimensional ETL process design, extraction from heterogeneous source systems, and metadata reporting functions. Oracle Warehouse Builder allows creation of both dimensional and relational models, and also star schema data warehouse architectures.

Except of being an ETL (Extract, Transform, Load) tool, Oracle Warehouse Builder also enables users to design and build ETL processes, target data warehouses, intermediate data storages and users access layers. It allows metadata reading in a wizard-driven form from a data dictionary or Oracle Designer but also supports over 40 metadata files from other vendors.

In modeling of the extraction processes following source systems are supported: Oracle databases, flat files, SAP R/3,DB2, Sybase, Informix, SQL Server and others via Oracle Transparent Gateways, ODBC, Mainframe.

Oracle Warehouse Builder supports Oracle Database (releases 8i, 9i and newer) and flat files as target databases. Moreover, it allows integration with Oracle AS Discoverer, Oracle Workflow and Oracle Enterprise Manager.
ETL mappings and code design

Modeling of ETL processes takes place in a graphical environment. Oracle Warehouse Builder enables mapping of multiple sources into multiple targets as well as PL/SQL transformations of the data. It allows automated or custom mappings. It also allows operations on data and transforms them into PL/SQL code stored in a transformation library to be used later. More advanced operations are supported as well. A test run for selected data is also possible with error reporting. Additionally, Warehouse Builder provides data cleanse function upon loading including complex Name and Address solution and Match-Merge operator.

The dependencies between the mappings can be recorded graphically to an Oracle Workflow engine and process flow definitions can contain multiple activities including external processes.
Deployment

Deployment of the actual code is carried out by a Deployment Manager for all kinds of supported objects and the process includes validation of metadata. The specification to a file system created by a user can be saved and stored for further use.
Managing Metadata

Oracle Warehouse Builder provides a possibility to save an image of the design object at any stage. The snapshot is then stored for comparison and restoration. The metadata repository operates in various languages and allows adjustment of users privileges. Users can also extend their objects definitions to enter information not yet supported. Moreover, Oracle Warehouse Builder contains build-in reporting templates and allows updates in metadata.
Business Intelligence system Management

Oracle Warehouse Builder contains Oracle Enterprise Manager that allows scheduling of the mappings in an Oracle database environment. Additionally, the platform provides Runtime Audit Browser, which is a HTML-based tool detecting errors while loading data, and Changes Management adjusting the warehouse to changes introduced by a user in a logical model.

Oracle Warehouse Builder supports OMG CWM (Object Management Group, Common Warehouse Metamodel) metadata standard and allows to exchange metadata with other tools supporting this standard.
newer post

Adeptia ETL Suite

0 comments
Adeptia ETL Suite (AES) is a graphical, easy-to-use data mapping solution that is ideal for aggregating data from multiple sources to populate databases and data warehouses for Business Intelligence solutions. AES is a comprehensive solution that combines data transport with powerful metadata management and data transformation capability.
AES platform components

Adeptia ETL consists of three distinct components. It has a web-based Design Studio that provides wizard-driven, graphical ability to document data rules as they relate to validations, mapping and edits. This tool includes a library of functions which can be pre-created and reused again and again.

Data Mapper has a preview capability to see actual source and target data, while the rules are being specified, if the source data file is available.
The second component is the Service Repository where all the rules and mapping objects are saved.
The third component is the Run-time Execution Engine where the mapping rules and data flow transactions are executed on incoming data files and messages.

Adeptia ETL Suite key components The diagram represents the key components of Adeptia ETL suite in a greater detail.

Key Features of Adeptia ETL Suite

    * Data integration functionality which is perfectly suited for Extract, Transform, and Load (ETL) type of data warehousing solutions for business intelligence, reporting and dashboards
    * Process flow based and service-oriented approach for data transformation and integration
    * Easy to use, intuitive user-friendly interface to define data mapping and data transformation rules
    * Graphical design environment and fully featured run-time engine in one package
    * Object-based architecture that promotes reusability of components and functionality across disparate custom packaged and legacy applications
    * Web-based interface that enhances productivity with minimal training
    * Complete management, reporting and auditing capabilities to track and monitor transactions

What makes Adeptia ETL Suite unique?

    * Process-centric Integration in single package
          o Allows complete business process around the integration to be automated
          o Supports Services Orchestration
          o Allows Human Workflow to handle errors, exceptions and approvals
    * Enables SOA
          o Flows can be exposed as Services
          o Includes a services repository
          o Metadata Driven approach, Integration components are re-usable
    * Quick implementations in days and weeks
          o Allows quick projects so customers can see immediate value
          o Ease of Use: Graphical, No coding approach ideal for business analysts
          o Reduces reliance on expensive consultants
    * Flexible pricing reduces customer risk 



SOURCE:http://www.etltools.net/adeptia-etl-suite.html
newer post

Pervasive ETL Software

0 comments
Pervasive ETL Software include data integration tools minimizng costs connected with system changing and making integration more efficient. They are designed for small and large companies. Easy-to-use graphical mapping interfaces increases work speed and allows developers to increase productivity.
Pervasive Data Integrator

Pervasive Data Integrator is enviroment which develops functionality for data aggreagating and data movement automating. It could load high performance data warehouse as well as smaller data parts. Unified toolset lets manage integration of many applications and operational data stores. Pervasive Integration Engine supports designing multi-step processes, that can be execute automaticaly. If any errors will apear during the process there are logging or rollback operations avalible. Data integrator supports "upsert" operation, so high-speed operations can be more efficient. Multithreading is also supported.

There is possibility to define rules for complex data aggregation, so multiple elements can be changed by one-pass writing. Pervasive Data Integrator support connectivity to all major databases, common CRM, ERP and COBOL sources.
Data Integrator's capabilities, like real-time continuous integration, native connectivity or legacy conversion, make creating integration solutions simpler.

Pervasive application integration develops functionality to integrate multiple applications (inside or outside of organisation). Enterprise Service Bus, Service Oriented Architecture and point-to-point architectures are supported. Environment can be f.e. used for point-to-point exchange between applications or building message queue for ESB.
Pervarsive solutions are transport independent and support "listening" for data arrival(f.e. from FTP folders). Application integration solution offers capabilities like real time processing, support for "multi mode" and "mass insert" options, increasing database connectivity, application messages validation, ability to keep data objects in buffered memory (for increasing processing speed).
Pervasive Business Integrator

Pervasive Business Integrator supports creating processes taking care of translating documents and moving them among the system. It includes some handy functionalities like advanced messages processing, connectivity to databases or xml format, component checking if there is some data arriving, support for systems of trading partners.

Pervasive Business Integrator lets users create and manage message processing and moving. It provides possibility of connection for major ERP systems (like SAP). Business Integrator uses standard data exchange formats (f.e. XML), supports Java framework and every JMS-compliant queue. Continuous, real time processing is supported. With Business Integrator there is possibility of making scalable integration infrastructure using integration model. This structure can be used in SOA or ESB projects. Messages moving between applications are validating. Pervasive Integration Manager owns management console which helps using this tool.
Data Profiler

Data Profiler is tool that monitore data and identify quality risks. It is available as Pervasive ETL module, but can be used as stand alone application as well. Data problems, when found too late, fe. during project testing, may cause huge problems. Becouse of continuous data quality testing, Data Profiler can prevent them. It improves quality and speed of software production by eliminating expensive and less efficient manual data validating processes. Data Profiler support multiple data platforms (more than 150 adapters), reduses costly and time-consuming refactoring by isolating unsafe data, eliminates intermediate data storage by data formats audit (Effective engine allows large transitional datasets audit), improve organizations internal control, and increase value of final products.
newer post

SAS ETL

0 comments
The ETL tools might be developed by different companies. One of them – widely known and popular – is SAS Enterprise which offers an integrated ETL platform.

SAS is over twenty-five-year company from the USA offering business intelligence software. SAS – one of the market leaders – combines data warehousing and intelligence applications for traditional business.

SAS’ platform is advertised as the one to accomplish current requirements of punctuality, cost-effectiveness, credibility and compatibility with different data storing systems. Moreover, the efficiency of SAS ETL is enough to support the whole enterprise processes, managing to extract data from other platforms and sources.

The SAS Enterprise ETL Server offers also an interactive ETL environment providing some facilities such as multithreaded and multiprocessing data extraction and the engines – able to work in the same time – that minimizes the time needed for succeeding the operations and data transfers. Also the calendar and job and resource dependency managing tools are included. What’s worth pointing, SAS ETL environment might be controlled with SAS ETL Studio (also included) that enables an access to the whole database from one point.
The SAS platform is also able to reduce duplicate or inaccurate data – that option permits to reduce overgrowing costs. It was projected to simplify the ETL processes – large pressure was put on the financial aspects. Also the efficiency got improved – all the algorithms, transformations and codes are registered (creating a full documentation) to make them easy revocable – it saves a lot of time. The drag-and-drop interface also simplifies the use, not demanding complicated knowledge from programming or SQL (Structured Query Language – database language for managing data).

SAS ETL is only one platform but the market offers much wider choice. Although all the companies, more or less, aspire to the common goal which is improving databases’ managing, platforms offered by them differs significantly. It’s impossible to tell unambiguously which one is the best – it all depends on the needs and requirements.

More information on SAS ETL and Data Integration: SAS tutorial
newer post

Sybase ETL

0 comments
Sybase ETL application has been designed for loading the data from the source systems, transferring them to data sets, and in result, loading those transformed data to target data warehouse or database. ETL solution is characterized by easiness of operating and a high level of scalability, which has been obtained because of the unique GRID architecture which the platform is based on.

The new generation of Sybase ETL application – Transform on Demand, provides the automation of data integration, by means of the unique method (Draw, Simulate, and Execute). This application doesn’t require any programming. Furthermore, it enables design, monitoring and planning the data processing, depending on the user’s needs. The complex simulation environment allows tracking down errors, which have occurred within data flow, and make the necessary transformations to prevent the damage of target data.

Sybase ETL benefits

    * Minimizing the time, costs and human resources which are needed to operate application interfaces, data integration and extraction processes, transformation and data loading (ETL-extract, transformation, loading)
    * Repeatability of data transformation projects (there is no need to create the new projects from the beginning)
    * No professional training is needed, in scope of basic and not very complex data transformation processes; in case of advanced issues, only the basic training is required
    * Easiness of updating, by means of the initialization of Warehouse Workbench application
    * Lowering the costs, thanks to the automatic registration of the whole data transformation process
    * Upgrading the quality of data by using the unique simulation environment before the right process is triggered

Sybase ETL most important features

    * Using the unique method (Draw, Simulate, and Execute)
    * Using of the modern technologies as: Java, SOAP I XML
    * Great efficiency, due to Native Data access technology
    * Possibility of making any types of transformations
    * Operation of all commonly used types of data bases sources, target data bases, data sources WWW, XML language, homogenous files; compliance with ODBC
    * Efficient simulation environment allows detecting the damaged data and errors, preventing from their permanent damage.
    * Integrated control and align system of processes
    * Automatic registration (on paper and XLM form)
    * More than two hundred UTL/JavaScript commands, which have been created at an angle of data integration. Installation of application takes less than five minutes.

Uncertain future of Sybase ETL

According to a statement made in March 2009 by Sybase, the future of their ETL product is rather negative. It says that now Sybase ETL can be used only by Sybase IQ users who have license for that product. Because of that some users will be forced to obtain another ETL, or even more than one, for different targets than Sybase IQ instances. Although it is quite obvious that the Sybase ETL is exclusively used by Sybase users it will hinder their work and exporting the data to other parts of their information system will be impossible to carry through. It seams like every single target needs a different tool, which is similar to an old hand-coded system.
newer post

Microsoft Integration Services

0 comments
Microsoft SQL Server Integration Services (SSIS) is a part of the Microsoft's database product - Microsoft SQL Server. SQL Server Integration Services first appeared in this implementation of SQL in the 2005 and has been continued through to SQL Server 2008. The main purpose of adding this component to the product was to make data integration in the database easy, uncomplicated and fast.

The first question that should be asked is why do we need data integration in a SQL database. Data in regular databases is not very complicated, and accessing it is easy, because data accessing system is often a part of the database itself. The answer is that Microsoft SQL Server is a solution not only for transactional databases, but also for large data warehouses as a company's business intelligence solution. As we know, the data in a data warehouse has to be integrated in order to make it accessible data easy, and as fast as possible.
Like many others professional data warehousing solutions, Microsoft SQL Server has its own data integration tool – called SQL Server Integration Services.

Unique features and components of Microsoft SQL Server:

    * Connection monitoring. The application can manage many connections to data sources, so that the data is stored properly.
    * Tasks managing component, other part of the Integration Services, controls actions like copying and moving data, collecting the data from sources and others.
    * Precedence Constraints which job is to control tasks, monitor their status, check if the tasks are finished and start new tasks.
    * Event handlers are very important in some situations when some uexpected things happen to the data warehouse. This component allows the administrator to define what should be done in some abnormal situations. It's a great way to make our system dependable and safe.

Thanks to the components mentioned earlier, Integration Serveces provides many ways to manipulate data in the warehouse. There is a posibility to split the data into groups under some conditions, sort it, merge and many other operations making the Microsoft SQL Server very flexible solution for data warehousing.
There is also a possibility to create some scripts to manipulate with the data. Unfortunately, these scripts can only be created under the .NET enviroment using C# programming language (Windows environments only). There is no support for other programming languages, which is a big disadvantage, because administrator cannot decide how to program the server. From the other hand, there is a great number of online forums, tutorials and support materials for Microsoft programming products like C# or Visual C#.

In the SQL Server 2008 version, a development environment is available for creating data integration procedures. Thanks to the interface of this application, it is easy to use for inexperienced users, but as we all know, user friendly solutions are not always as effective as expert tools.

Integration Services is a great tool not only for companies keeping data warehouses, but also for administrators of small databases. List of its features is impressing, but there are some things that should be changed. Thanks to Integration Services Microsoft SQL Server is a very interesting solution for ETL Data Integration and data warehousing in business intelligence.
newer post

Oracle Warehouse Builder

0 comments
Oracle Warehouse Builder (OWB) is a tool that enables to design a custom Business Intelligence application. It provides dimensional ETL process design, extraction from heterogeneous source systems, and metadata reporting functions. Oracle Warehouse Builder allows creation of both dimensional and relational models, and also star schema data warehouse architectures.

Except of being an ETL (Extract, Transform, Load) tool, Oracle Warehouse Builder also enables users to design and build ETL processes, target data warehouses, intermediate data storages and users access layers. It allows metadata reading in a wizard-driven form from a data dictionary or Oracle Designer but also supports over 40 metadata files from other vendors.

In modeling of the extraction processes following source systems are supported: Oracle databases, flat files, SAP R/3,DB2, Sybase, Informix, SQL Server and others via Oracle Transparent Gateways, ODBC, Mainframe.

Oracle Warehouse Builder supports Oracle Database (releases 8i, 9i and newer) and flat files as target databases. Moreover, it allows integration with Oracle AS Discoverer, Oracle Workflow and Oracle Enterprise Manager.
ETL mappings and code design

Modeling of ETL processes takes place in a graphical environment. Oracle Warehouse Builder enables mapping of multiple sources into multiple targets as well as PL/SQL transformations of the data. It allows automated or custom mappings. It also allows operations on data and transforms them into PL/SQL code stored in a transformation library to be used later. More advanced operations are supported as well. A test run for selected data is also possible with error reporting. Additionally, Warehouse Builder provides data cleanse function upon loading including complex Name and Address solution and Match-Merge operator.

The dependencies between the mappings can be recorded graphically to an Oracle Workflow engine and process flow definitions can contain multiple activities including external processes.
Deployment

Deployment of the actual code is carried out by a Deployment Manager for all kinds of supported objects and the process includes validation of metadata. The specification to a file system created by a user can be saved and stored for further use.
Managing Metadata

Oracle Warehouse Builder provides a possibility to save an image of the design object at any stage. The snapshot is then stored for comparison and restoration. The metadata repository operates in various languages and allows adjustment of users privileges. Users can also extend their objects definitions to enter information not yet supported. Moreover, Oracle Warehouse Builder contains build-in reporting templates and allows updates in metadata.
Business Intelligence system Management

Oracle Warehouse Builder contains Oracle Enterprise Manager that allows scheduling of the mappings in an Oracle database environment. Additionally, the platform provides Runtime Audit Browser, which is a HTML-based tool detecting errors while loading data, and Changes Management adjusting the warehouse to changes introduced by a user in a logical model.

Oracle Warehouse Builder supports OMG CWM (Object Management Group, Common Warehouse Metamodel) metadata standard and allows to exchange metadata with other tools supporting this standard.
newer post

ORACLE DATA INTEGRATOR

0 comments
Oracle Data Integrator (ODI) is a software application based on ETL (extract-transform-load) structure used for data transformation and merging processes. In today's business the meaning of well organized data has increased dramatically and that is why having specialized software is so important.

Furthermore, these applications must be prepared to work in mixed systems and platforms environment and be sure that none of the quality will deteriorate. Oracle Data Integrator provides decalrative rules apart from implementation details and also its E-LT (extract-load, transform) structure eliminates need of having separate ETL server which results in optimized performance, efficiency and scalability.

ORACLE DATA INTEGRATOR architecutre

Main architecture component in Oracle Data Integrator is the repository that is accessed by client-server mode or thin client mode.

Repository is absolutely open, comprehensive, shared metadata database which is accessible for the applications responsible for transformation, project development, integration and information redistribution. Other architecture components are the Topology Manager, Designer, Security Manager, Operator, Agent, Scheduler and Metadata Navigator.
Graphical interface module that manages all the information about system's physical and logical structure is called the Topology Manager. Descriptions created with the Topology Manager of server data, machine and site will allow to run same interfaces on different systems.
A tool designed to develop transformations and to check data integrity is called Designer. Among many fuctions its most significant are visualization of data flow, development and maintenance data integration and transformation, reverse-engineering of applications or repository's data.
Oracle Data Integrator permissions and users management tool is simply called Security Manager. Operator is the module responsible for monitoring and production management. Agent executes application interfaces which Designer designed, next the Scheduler schedules them.
Metadata Navigator browses through contents of Oracle Data Integrator Repository in search of wanted content. It is a thin client-oriented for search and analysis but also to manage all production processes, it can also trace and cross-reference data lineage.
Apart from this standard modules comes a bunch of „out-of-the-box” modules which can be fully extensible, e.g. Log Miner, SQL Server Triggers, Oracle DBLink, Check MS Excel, Oracle Merge, Check Sybase – all as a part of Pluggable Knowledge Architecture are written for certain systems and environments and in fact are the very core of the Oracle Data Integrator, making the whole process accessible, modular and as simple as it can be.
In case where our target is an Oracle database, natural choice is to go with Oracle Warehouse Builder, but where target is not an Oracle based environment, like SQL Server or XML, Teradata we can use Oracle Data Integrator to support OWB. When the source is XML or Web Service, SOAP then ODI supplements as well. Clearly Oracle Data Integrator can be used in many different cases – e.g. when some dirty data has to be filtered out before loading, or data integration is very complicated or/and processed in real-time.
Oracle Data Integrator works as well in the staging and transforming area as the support for other Oracle software. It is „hot pluggable” which means connecting to any heterogeneous environment and still provide with number of required operations.

SOURCE:http://www.etltools.net/oracle-data-integrator.html
newer post

BusinessObjects Data Integrator

0 comments
The use of reliable and correct data is crucial for the business users and requires guarantee of data quality and both proper and constant controls and providing lineage of the data. Delivery of trusted information requires the process of data integration which enables data management to be smoothly performed thanks to e.g. finding sources of data, correcting errors, making changes. BusinessObjects Data Integrator (BODI) is a complex tool which provides a lot of options in data manipulation.

   1. BusinessObjects Data integration process is divided into the following operations: Data unification
   2. Data profiling
   3. Data auditing
   4. Data cleansing

Data unification

BusinessObjects Data Integration lets users operate and make changes in the data quickly and easily.

Thanks to the unique and helpful solutions it allows to integrate data from the source to the end-user reports and analysis and is crucial to establish the lineage of the data and changes to the data sources. The possibility of monitoring the whole process starting from the results to the data sources enables to find out directly which data origins from which source. It is useful not only to deliver trusted and reliable information but also to have an outlook on overall architecture.
BusinessObjects Data Integrator makes possible quick and trouble-free updating and creating the universes objects (using the BusinessObjects Data Integrator Designer tool). Those actions are facilitated thanks to specific information (column descriptions, data catalogue information) that is loaded into the universes and which help to build the universes easily. All the changes and the data lineage that are easy to be checked cause reduction of time consumption and labour-intensity of objects management, metadata sharing and updating. But it is not only useful for building or updating universes but the end user can see how their data was transformed.
Another Data Integrator tool provided by SAP BusinessObjects is Data Mart Accelerator for Crystal Reports. It allows to extract data from the Crystal Reports instance to aggregate-aware tables automatically. In addition, owing to creation of a time dimension it is possible to perform constantly trend-based analysis.
Data Profiling

BusinessObject Data Integrator has got a well developed instrument for operating in the design environment. The Designer program can be used for the majority of designing operations such as data mapping, transformations and ETL logics as well as for creating workflow and data flows. The developer who works with the Designer tool can create, preview and configure complex data profiles with minimum of efficiency and time. And what is unique, the developer can view the results before and after they are extracted or transformed without leaving the design environment.
Data validation

Data validation is one of the most essential elements of the trusted information delivery. To provide accurate information is to be sure that there is no erroneous or unreliable data. The BusinessObject Data Integration process helps to identify and then correct or reject incorrectness, NULL value data or out of range data. It is very useful when specific records or column needs to be validated by means of the changed rules.
Data Auditing

Process of data auditing consists in verifying data integrity, especially if the data is read, processed and loaded successfully. BusinessObject Data Integrator allows to build audit statistics in accordance with the specific categories (sum, average or row count) and then confront them with the established business rules.
Data cleansing

With Data Integrator it is possible to maintain data quality through data correcting, standardizing and consolidating. The process of cleansing data is particularly important because of common errors in consumers contact information. According to the data integration process it is possible to validate address information by parsing, matching and merging.
newer post

Ab Initio ETL

0 comments
The Ab Initio software is a Business Intelligence platform containing six data processing products: Co>Operating System, The Component Library, Graphical Development Environment, Enterprise Meta>Environment, Data Profiler and Conduct>It. It is a powerful graphical user interface-based parallel processing tool for ETL data management and analysis.
Co>Operating System

Ab Initio Co>Operating System is a foundation for all Ab Initio applications and provides a general engine for integration of all kinds of data processing and communication between all the tools within the platform. It runs on OS/390, zOS on Mainframe, Unix, Linux, and Windows. It enables distributed and parallel execution, platform-independent data transport, estabilishing checkpoints and monitoring of the process. It implements data excution paralellism by using data paralellism, component paralellism and piepline paralellism.


This tool also assures high data processing capability and provides speedups proportional to hardware resources available.
Component library

The Ab Initio Component Library is a reusable software module for sorting, data transformation, and high-speed database loading and unloading. This is a flexible and extensible tool which adapts at runtime to the formats of records entered and allows creation and incorporation of new components obtained from any program that permits integration and reuse of external legacy codes and storage engines.
Graphical Development Environment (GDE)

Graphical Development Environment provides an intuitive graphical interface for editing and executing applications. You can easily drag-and-drop components from the library onto a canvas, configure them and connect them into flowcharts. They are not only an abstract diagram but also actual architecture of various ETL functions. All the graphs contain a program function and channel data flow in one direction. This allows the graphs to run in a parallel processing environment. Then the programs can be directly executed with Co>Operating System. Ab Initio also allows monitoring of running applications, and makes it easy to quantify data volumes and execution times of the programs to investigate the opportunities for improved performance. You can also write Ab Initio programs using text editors instead of Graphical Development Environment and then execute them with Co>Operating System.
Enterprise Meta>Environment (EME)

Ab Initio Enterprise Meta>Environment is a data store with additional functions of tracking changes in developed graphs and metadata used in their development. It can also provide a feedback of how the data is used and preliminary classify data. It presents in a graphic way the process of data changes in graphs and its influence on another graphs, which is called data impact analysis. Additionally, Enterprise Meta>Environment manages the configuration and changes of the code to assure immutable functions of the graphs. It also offers tools such as dependence analysis, metadata management, statistical analysis, and version controlling.
Data Profiler

The Data Profiler is an analytical application that can specify data range, scope, distribution, variance, and quality. It runs in a graphic environment on top of the Co>Operating system.
Conduct>It

Ab Initio Conduct>It is a high-volume data processing systems developing tool. It enables combining graphs from Graphical Development Environment with custom scripts and programs from other vendors.

SOURCE:http://www.etltools.net/ab-initio.html
newer post

IBM Infosphere Information Server

0 comments
Information Server is a line of products from the IBM company responsible for data warehousing and data integration.
IBM introduced Infosphere Information Server as a complete set of business intelligence and data warehousing products which operate in four functional areas:

    * Source data profiling – consists of understanding and modeling the source data to detect inconsistencies, anomalies and problems with the data.

    * (Information Analyzer, Business Glossary)
    * Data quality assurance – data cleansing, standardizing, merging (Information Analyzer, QualityStage and AuditStage)
    * Data transformation – data is integrated, transformed and loaded into a data warehouse using the ETL processes
    * Data delivery to the end users

Infosphere applications

    * IBM InfoSphere Warehouse applications offer solutions for the deployment, integration and management of a data warehouse and the data in an organization, based on a DB2 Database technology.
    * Information Analyzer application interprets and analyzes data from the source systems or a data warehouse to make it easy to understand for both business and technical users.
      The tool connects to the source data and peforms data structure analyses, such as: column analysis, table analysis, primary key and foreign key analysis, relations identification, etc. It also monitors business rules and in effect produces pure text or formatted reports with the results.
    * Metadata workbench - a web-based and fully integrated and unified metadata repository management tool for Infosphere. The tool helps manage business, technical and operational metadata and provides such analysis modules as data lineage, impact analysis and diff.
    * Datastage and QualityStage- is a GUI application for integrating the data stored in a data warehouse and implementing the ETL and data quality assurance processes. Datastage can work with very large amounts of data and intergrate it to make it easily accessible for other applications and for the end-users. QualityStage and Datastage processes are developed in the same tool in IBM Information Server (Designer).
    * Business Glossary is a web-based tool that is responsible for common business terms vocabulary management (creation, editing, viewing and sharing) in an organization. It is fully integrated with the Information Server platform and can share business terms and categories with such products as Information Analyzer, Rational Data Architect or FastTrack. Business Glossary Anywhere is a desktop application which enables accessing the business glossary terms from any other Windows application (including non-IBM software).

Summary

Infosphere is a complete line of products to create and use a data warehouse for business inteligence. It has some advantages, but it also has some disadvantages. One of them is the price for this solution. All these components can work properly only together, so the company has to buy several products to create a good working data warehouse. It is also difficult to implement the brand new data warehouse in the company without stopping to gather the data. It would be a lot easier if the Infosphere components were compatible with data warehousing solutions from other companies and with other standards.

SOURCE:http://www.etltools.net/information-server.html
newer post

INFORMATICA

0 comments
Informatica PowerCenter is an enterprise data integration platform working as a unit. With its high availability as well as being fully scalable and high-performing, PowerCenter provides the foundation for all major data integration projects and initiatives throughout the enterprise.


    * These areas include: B2B exchange
    * data governance
    * data migration
    * data warehousing
    * data replication and synchronization
    * Integration Competency Centers (ICC)
    * Master Data Management (MDM)
    * Service-oriented architectures (SOA) and more.

PowerCenter provides reliable solutions to the IT management, global IT teams, developers and business analysts as it delivers not only data that can be trusted and guarantees to meet analytical and operational requirements of the business, but also offers support to various data integration projects and collaboration between the business and IT across the globe.

Informatica PowerCenter enables access to almost any data sources from one platform. It is possible thanks to the technologies of Informatica PowerExchange and PowerCenter Options.

PowerCenter is able to deliver data on demand of the business offering the choice of data access of real-time, batch or change data capture (CDC).

Informatica PowerCenter is capable of managing the broadest range of data integration initiatives as a single platform. This ETL tool makes it possible to simplify the development of data warehouses and data marts.

Supported by PowerCenter Options, Informatica PowerCenter software meets enterprise expectations and requirements for security, scalability and collaboration through such capabilities as:

    * dynamic partitioning
    * high availability/seamless recovery
    * metadata management
    * data masking
    * grid computing support and many more


In order to increase operational efficiency and to manage and execute various initiatives, the platform enhances successful collaboration between the business and IT.
Informatica ETL products

PowerCenter has an offer of a wide range of features designed for global IT teams and production administrators, as well as for individual developers and professionals:

    * - Metadata Manager (consolidates metadata into a unified integration catalog)
    * - development capabilities (team-based; accelerate development, simplify administration)
    * - a set of visual tools and productivity tools (manages administration and collaboration between different specialists)
    * - metadata-driven architecture (eliminates the recoding requirement).

    * The Informatica ETL (Informatica PowerCenter) product comprises three major applications: Informatica PowerCenter Client Tools. These tools have been designed to enable a developer to:
      - report metadata
      - manage repository
      - monitor sessions' execution
      - define mapping and run-time properties (sessions)
    * Informatica PowerCenter Repository - it is the centre of Informatica tools where all data (eg. related to mapping or sources/targets) is stored. Here all metadata for application is kept. All the client tools as well as Informatica Server use the Repository to obtain data. The Repository can be compared to a harddisk or memory in a PC - without it it is possible to process the data but there is virtually no data that can be processed.
    * Informatica PowerCenter Server - server is the place where all the actions are executed. It physically connects to sources and targets to fetch the data, apply all transformations and load the data into target systems.

Detailed business and technical information on Informatica PowerCenter ETL tool can be found at the Informatica tutorial

SOURCE:http://www.etltools.net/informatica.html
newer post

ETL Products

0 comments
Open source (freeware) data integration tools are the latest segment in the community-driven software. They are an alternative to commercial packaged solutions as well as an productive and efficient option to writing custom code.

The most advanced ETL tools packages on the market include enterprise-level offers from IBM, Informatica or Ab Initio. They are designed to handle high performance and scalability requirements. Minimum prices of them range from $45,000 to around $150,000 per CPU for an Enterprise package.

There are, however, less expensive options but they often are limited in support for heterogenous environments (Microsoft, Oracle) and sometimes charge extra for additional facilities and sevices (Metadata management, Data Quality, Data federation modules for instance or connectors for SAP or IFS).

It is a challenge for data integration architects to create an ETL tool that will be capable of integration of the data between a variety of data sources and targets and be reasonably priced. If there is a need for such a solution, an open source model should be considered.
Open Source ETL

Open source implementations play a significant role both in bringing community power into ETL and promotion of development of standards. A large number of testers is available which makes free ETL tools to be widely spoken of and evolving. But the most important feature of open source ETL products is that they are significantly less expensive than tools of commercial licence.

There are four basic constituencies that adopt free ETL tools are:
# independent software vendors (ISV) that are looking for embeddable data integration - costs are reduced and the savings are passed on customers, data integration, migration and transofrmation capabilities are incorporated as an embedded component, memory footprint of the end product is reduced in comparison to large commercial offers;
# system integrators that look for not expensive integration tooling - open source ETL software allows system integrators to deliver integration capabilities significantly faster and with higher quality level than it is done by custom-building of the capabilities;
# enterprise departamental developers that look for a local solution - using the free ETL tools technology by larger enterprises gives support to small initiatives;
# midmarket companies with smaller budgets and less complex requirements - small companies more likely support open source BI providers as they have less demanding needs for data integration software, hence no need for a costly BI provider.

A number of open source projects are capable of performing more than one ETL function. The technical features of these projects are less different than similar.
Free ETL tools
Pentaho Data Integration (PDI, Kettle)

According to Pentaho itself, it is a BI provider that offers ETL tools as a capability of data integration. These ETL capabilities are based on the Kettle project. Pentaho is known by selling subscriptions such as support services or management tools. Focusing primarily on connectivity and transformation, Pentaho's Kettle project is able to incorporate significant number of contributions from its community. Community-driven enhancements include: a Web services lookup, a SAP connector and the development of an Oracle bulk loader.
The SAP connector, although it is integrated with Kettle, is not a free product - it is a commercially offered plug-in, however it is around 10 times cheaper than an SAP connectivity for Infosphere Datastage.
Talend

It is a startup of French origin that has positioned itself as a pure play of open source data integration and now offers its product - Open Studio. For vendors wishing to embed Open Studio capabilities in their products, Talend has an OEM license agreement. That is what JasperSoft has done, thus creating an open source BI stack to compete with Pentaho's Kettle. Talend is a commercial open source vendor which generates profit from support, training and consulting services. What Open Studio offers is a user-friendly graphical modeling environment as it provides traditional approach for performance management as well as a pushdown optimization (architectural approach). The latter allows users to bypass the actual cost of dedicated hardware to support an ETL engine and enables users to leverage spare capacity of the server within both the source and target environments to power the transformations.
clover.ETL

This project is directed by OpenSys, a based in Czech Republic company. It is Java-based, dual-licensed open source that in its commercially licensed version offers warranty and support. In its offer there is a small footprint that makes it easy to embed by system integrators and ISVs. It aims at creating a basic library of functions, including mapping and transformations. Its enterprise server edition is a commercial offering.
KETL

This project is sponsored by Kinetic Networks - a professional services company. It started as a tool for customer engagements as commercial tools were too expensive. The Kinetic employees are currently developing the code but there are outside contributions that are expected in the future. Additional modules like data quality and profilifng component, were also developed by Kinetic and they are not placed under the licence for the open source. Initially KETL was designed as a utility to replace custom PLSQL code would move large data volumes. It is Java-based and XML-driven development environment which is of great use for skilled Java developers. KETL is currently limited ofr those users who do not have a visual development GUI.
Limitations of the license cost free ETL

When they are used within limits, today's free ETL tools are quite suitable and do their work.
In the future those limits are expected to be extended as now the limitations include:
# enterprise application connectivity
# non-RDBMS connectivity
# large data volumes and small batch windows
# multirole collaborations
# complex transformation requirements

Open source ETL does not provide management capabilities that could be considered as a cross-enterprise standard for data integration. They are missing advanced connectivity, techniques of real-time data integration, such as enterprise information integration (EII) or change data capture (CDC), collaboration of enterprise-level, integrated data quality management and profiling. Yet many enterprises are not looking for large and expensive data integration suite. If there is an efficient and reliable alternative available to custom code of data integration requirements, an option to use free ETL technologies should be taken into consideration.

The most popular open source vendors are still not truly community-driven projects. There is and increased investment expected from a wider community to build out and encourage development, especially for connectivity modules to the unimaginable number of evolving source systems.

SOURCE:http://www.etltools.net/free-etl-tools.html
newer post

ETL TOOLS COMPARISON

1 comments
ETL Tools - general information

ETL tools are designed to save time and money by eliminating the need of 'hand-coding' when a new data warehouse is developed. They are also used to facilitate the work of the database administrators who connect different branches of databases as well as integrate or change the existing databases.

    * The main purpose of the ETL tool is: extraction of the data from legacy sources (usually heterogenous)
    * data transformation (data optimized for transaction --> data optimized for analysis)
    * synchronization and cleansing of the data
    * loading the data into data warehouse.

There are several requirements that must be had by ETL tools in order to deliver an optimal value to users, supporting a full range of possible scenarios.

Those are:
- data delivery and transformation capabilities
- data and metadata modelling capabilities
- data source and target support
- data governance capability
- runtime platform capabilities
- operations and administration capabilities
- service-enablements capability.
ETL tools comparison criteria

The research presented in this article is based on Gartner’s data integration magic quadrant, forrester researches and our professional experience. The etltools.org portal is not affiliated with any of the companies listed below in the comparison.

The research inclusion and exclusion criteria are as follows:
- range and mode of connectivity/adapter support
- data transformation and delivery modes support
- metadata and data modelling support
- design, development and data governance support
- runtime platform support
- enablement of service and three additional requirements for vendors:
- $20 milion or more of software revenue from data integration tools every year or not less than 300 production customers
- support of customers in not less than two major geographic regions
- have customer implementations at crossdepartamental and multiproject level.
ETL Tools Comparison

The information provided below lists major strengths and weaknesses of the most popular ETL vendors.
IBM (Information Server Infosphere platform)

    * Advantages: strongest vision on the market, flexibility
    * progress towards common metadata platform
    * high level of satisfaction from clients and a variety of initiatives

    * Disadvantages: difficult learning curve
    * long implementation cycles
    * became very heavy (lots of GBs) with version 8.x and requires a lot of processing power

Informatica PowerCenter

    * Advantages: most substantial size and resources on the market of data integration tools vendors
    * consistent track record, solid technology, straightforward learning curve, ability to address real-time data integration schemes
    * Informatica is highly specialized in ETL and Data Integration and focuses on those topics, not on BI as a whole
    * focus on B2B data exchange

    * Disadvantages: several partnerships diminishing the value of technologies
    * limited experience in the field.

Microsoft (SQL Server Integration Services)

    * Advantages: broad documentation and support, best practices to data warehouses
    * ease and speed of implementation
    * standardized data integration
    * real-time, message-based capabilities
    * relatively low cost - excellent support and distribution model

    * Disadvantages: problems in non-Windows environments. Takes over all Microsoft Windows limitations.
    * unclear vision and strategy

Oracle (OWB and ODI)

    * Advantages: based on Oracle Warehouse Builder and Oracle Data Integrator – two very powerful tools;
    * tight connection to all Oracle datawarehousing applications;
    * tendency to integrate all tools into one application and one environment.

    * Disadvantages: focus on ETL solutions, rather than in an open context of data management;
    * tools are used mostly for batch-oriented work, transformation rather than real-time processes or federation data delivery;
    * long-awaited bond between OWB and ODI brought only promises - customers confused in the functionality area and the future is uncertain

SAP BusinessObjects (Data Integrator / Data Services)

    * Advantages: integration with SAP
    * SAP Business Objects created a firm company determined to stir the market;
    * Good data modeling and data-management support;
    * SAP Business Objects provides tools for data mining and quality; profiling due to many acquisitions of other companies.
    * Quick learning curve and ease of use

    * Disadvantages: SAP Business Objects is seen as two different companies
    * Uncertain future. Controversy over deciding which method of delivering data integration to use (SAP BW or BODI).
    * BusinessObjects Data Integrator (Data Services) may not be seen as a stand-alone capable application to some organizations.

SAS

    * Advantages: experienced company, great support and most of all very powerful data integration tool with lots of multi-management features
    * can work on many operating systems and gather data through number of sources – very flexible
    * great support for the business-class companies as well for those medium and minor ones

    * Disadvantages: misplaced sales force, company is not well recognized
    * SAS has to extend influences to reach non-BI community
    * Costly

Sun Microsystems

    * Advantages: Data integration tools are a part of huge Java Composite Application Platform Suite - very flexible with ongoing development of the products
    * 'Single-view' services draw together data from variety of sources; small set of vendors with a strong vision

    * Disadvantages: relative weakness in bulk data movement
    * limited mindshare in the market
    * support and services rated below adequate

Sybase

    * Advantages: assembled a range of capabilities to be able to address a mulitude of data delivery styles
    * size and global presence of Sybase create opportunities in the market
    * pragmatic near-term strategy - better of current market demand
    * broad partnerships with other data quality and data integration tools vendors

    * Disadvantages: falls behind market leaders and large vendors
    * gaps in many aspects of data management

Syncsort

    * Advantages: functionality; well-known brand on the market (40 years experience); loyal customer and experience base;
    * easy implementation, strong performance, targeted functionality and lower costs

    * Disadvantages: struggle with gaining mind share in the market
    * lack of support for other than ETL delivery styles
    * unsatisfactory with lack of capability of professional services

Tibco Software

    * Advantages: message-oriented application integration; capabilities based on common SOA structures;
    * support for federated views; easy implementation, support andperformance

    * Disadvantages: scarce references from customers; not widely enough recognised for data integration competencies
    * lacking in data quality capabilities.

ETI

    * Advantages: proven and mature code-generating architecture
    * one of the earliest vendors on the data integration market; support for SOA service-oriented deployments;
    * successfully deals with large data volumes and a high degree of complexity, extension of the range of data platforms and data sources;
    * customers' positive responses to ETI technology

    * Disadvantages: relatively slow growth of customer base
    * rather not attractive and inventive technology.

iWay Software

    * Advantages: offers physical data movement and delivery; support of wide range of adapters and access to numerous sources;
    * well integrated, standard tools;
    * reasonable ease of implementation effort

    * Disadvantages: gaps in specific capabilities
    * relatively costly - not competitive versus market leaders

Pervasive Software

    * Advantages: many customers, years of experience, solid applications and support;
    * good use of metadata
    * upgrade from older versions into newer is straightforward.

    * Disadvantages: inconsistency in defining the target for their applications;
    * no federation capability;
    * limitated presence due to poor marketing.

Open Text

    * Advantages Simplicity of use in less-structured sources
    * Easy licensing for business solutions
    * cooperates with a wide range of sources and targets
    * increasingly high functionality

    * Disadvantages: limited federation, replication and data quality support; rare upgrades due to its simplicity;
    * weak real-time support due to use third party solutions and other database utilities.

Pitney Bowes Software

    * Advantages: Data Flow concentrates on data integrity and quality;
    * supports mainly ETL patterns; can be used for other purposes too;
    * ease of use, fast implementation, specific ETL functionality.

    * Disadvantages: rare competition with other major companies, repeated rebranding trigger suspicions among customers.
    * narrow vision of possibilities even though Data Flow comes with variety of applications.
    * weak support, unexperienced service.

SOURCE:http://www.etltools.net/etl-tools-comparison.html
newer post

Different Types Of Data Warehouse ETL Tools

0 comments
ETL

Nowadays, most companies’ existence depends on data flow. When plenty of information is generally accessible and one can find almost everything he needs, managing became easier than ever before. The Internet simplifies cooperation – time needed to send and receive requested data gets shorter as more and more institutions computerize their resources. Also, the communication between separate corporation departments became easier – no one needs to send normal letters (or even the office boys) as the process is replaced by e-mails. Although the new ways of communication improved and facilitated managing, the ubiquitous computerization has its significant disadvantages.

The variety of data – as positive phenomenon as possible – got a little bit out of control. The unlimited growth of databases’ size caused mess that often slows down (or even disable) data finding process.

It’s all about effective information storing. Uncategorized data is assigned to different platforms and systems. As a consequence, finding wanted data brings a lot of troubles – user needs to know what data he administers, where it is located (and whether he has proper access), finally how to take them out.
Wrong was someone who thought that the hardest task was making decisions basing on data. No – finding data itself is often much more annoying. But users are not the only ones suffering for databases’ overgrowth. The IT departments – usually responsible for keeping the systems work – have to struggle with data in different formats and systems. ‘Keeping it alive’ is extremely time-consuming what delays the company’s work.
Slow (or sometimes omitted at all) transformation of data causes that it’s usually impossible to provide demanded information in demanded time. Formed divergence between data provided and data really existing in the moment of need harms the IT departments’ image.

To achieve better results, companies invest in external systems – computing power and resources. Not enough power causes lacks of synchronization of data. Transporting information between separate units lasts too long to work effectively. On the other side, computing power increasing – that might be an example solution – is expensive and lead to overgrowth of the operation costs.
Supposing that example company managed to prepare well-working database responsible for supporting designated operation. A lot of money and time got spent. Everything seems wonderful until it comes to another operation. Suddenly it appears that once created system doesn’t really fit the requirements of new operation and the best idea is to create a new system from the beginning. Yes, modifications might be made but there is no single developer common for all parts of the projects, so it demands cooperation of at least a few subjects – that hardly disables the idea.
ETL process

ETL process
The three-stage ETL process and the ETL tools implementing the concept might be a response for the needs described above.

The ‘ETL’ shortcut comes from 'Extract, transform, and load' – the words that describe the idea of the system. The ETL tools were created to improve and facilitate data warehousing.

   1. The Etl process consists of the following steps: Initiation
   2. Build reference data
   3. Extract from sources
   4. Validate
   5. Transform
   6. Load into stages tables
   7. Audit reports
   8. Publish
   9. Archive
  10. Clean up 



Sometimes those steps are supervised and performed indirectly but its very time-consuming and may be not so accurate.
The purpose of using ETL Tools is to save the time and make the whole process more reliable.
ETL Tools

The times of increasing data-dependence forced a lot of companies to invest in complicated data warehousing systems. Their differentiation and incompatibility led to an uncontrolled growth of costs and time needed to coordinate all the processes. The ETL (Extract, transform, load) tools were created to simplify the data management with simultaneous reduction of absorbed effort.

Depending on the needs of customers there are several types of tools.
One of them perform and supervise only selected stages of the ETL process like data migration tools(EtL Tools , “small t”tools) , data transformation tools(eTl Tools , “capital T”tools).Another are complete (ETL Tools ) and have many functions that are intended for processing large amounts of data or more complicated ETL projects.

Some of them like server engine tools execute many ETL steps at the same time from more than one developer , while other like client engine tools are simpler and execute ETL routines on the same machine as they are developed.
There are two more types. First called code base tools is a family of programing tools which allow you to work with many operating systems and programing languages.The second one called GUI base tools remove the coding layer and allow you to work without any knowledge (in theory) about coding languages.
How do the ETL tools work?

The first task is data extraction from internal or external sources. After sending queries to the source system data may go indirectly to the database. However usually there is a need to monitor or gather more information and then go to Staging Area . Some tools extract only new or changed information automatically so we dont have to update it by our own.
The second task is transformation which is a broad category:
-transforming data into a stucture wich is required to continue the operation (extracted data has usually a sructure typicall to the source)
-sorting data
-connecting or separating
-cleansing
-checking quality

The third task is loading into a data warehouse.

As you can see the ETL Tools have many other capabilities (next to the main three: extraction , transformation and loading) like for instance sorting , filtering , data profiling , quality control, cleansing , monitoring , synchronization and consolidation.
ETL Tools providers

Here is a list of the most popular comercial and freeware(open-sources) ETL Tools.

    * Comercial ETL Tools: IBM Infosphere DataStage
    * Informatica PowerCenter
    * Oracle Warehouse Builder (OWB)
    * Oracle Data Integrator (ODI)
    * SAS ETL Studio
    * Business Objects Data Integrator(BODI)
    * Microsoft SQL Server Integration Services(SSIS)
    * Ab Initio

    * Freeware, open source ETL tools: Pentaho Data Integration (Kettle)
    * Talend Integrator Suite
    * CloverETL
    * Jasper ETL

As you can see there are many types of ETL Tools and all you have to do right now is to choose appropriate one for you. Some of them are relatively quite expensive, some may be too complex, if you dont want to transform a lot of information or use many sources or use sophisticated features.
It is always necessary to start with defining the business requirements, then consider the technical aspects and then choose the right ETL tool.

SOURCE:http://www.etltools.net/
newer post

Data Warehouse ETL Tools Images

0 comments

newer post

The Generic Meta Data Repository

0 comments
The most important data model that the data warehouse developer needs to understand in the business intelligence environment is the meta data repository model. The focus of many projects is on operational source system or data warehouse models because that is where the actual raw data is stored. While this is unquestionably the primary purpose of the business intelligence environment, it is the meta data repository and its data model structure that allow the various function areas in the warehouse to communicate. The repository's role in furnishing context to the data content, processes and reports that are elements of the environment is often overlooked. The meta data repository model is the central hub of the environment. It is the one place where integration occurs between operational source systems; the data warehouse; the extraction, transformation and load (ETL) processes; the business views; reports; and operational statistics.

While many firms understand the significance of the meta data repository, few have the experience to construct one from a blank sheet of paper. This is where a generic meta data repository steps in to help get the project started by providing a template that can be tailored by the repository architect for specific business needs.

Unlike a generic data warehouse model that is useful only for a specific vertical industry or functional purpose, the generic meta data repository model can be utilized across various businesses. This reusability is possible because the repository's purpose is to capture and store meta data – data about data. This function is not influenced by the type of business being supported but by the data warehouse environment infrastructure needed. The generic repository model is not an all-encompassing "shrink wrap" solution for any company. Its purpose is to provide the repository architect with a point of departure to begin the revision process needed for the enterprise. The repository architect's modeling decisions are based primarily on what other type of warehouse components need to be supported.

Meta Data Components

There are typically seven standard components to a meta data repository data model as seen in Figure 1. These components can be viewed as the subject areas of the generic meta data repository model. Depending on the requirements of the business, the model may be revised by removing, editing or integrating a new component area. If a company is using a data cleansing tool to scrub dirty data from a source system, the ETL component area of the model may need to be revised to capture pertinent information. Alternatively, if capture of query statistics is deemed extraneous or is accommodated through some other means in the environment (e.g., OLAP tool, meta data layer, database), this component area may be removed from the model.


Figure 1: Meta Data Repository Components

The first component area of the generic meta data repository model contains information about the logical data warehouse model. The logical data warehouse model contains all the business entities for specific business subject areas, relationships between the entities and attributes of each entity comprising the target model. The level of detail found in the logical model will vary depending on your firm's data modeling methodology and practices. Individual modeling practices for the logical data model may also necessitate changes in the generic model to accommodate additional information.

The second component contains very detailed information about the physical data warehouse target model. The logical data warehouse model must go through a transformation process in order to become a physical data warehouse dimensional model. A fundamental set of dimensional modeling steps is applied to the logical model in order to produce the physical model (e.g., addition of time). Depending on the requirements of your dimensional data modeling and/or database administration departments, this repository component may be extended to include additional physical attributes such as indexes and fragmentation strategies.

The third component of the generic model contains physical information about the sources feeding data to the data warehouse. This operational information can originate from source databases, file extractions, spreadsheets, Internet and other formats. The information stored in this component can be used to alert the data warehouse development team of pending changes to a source that will potentially affect the data warehouse model, ETL and report processing. Additionally, the information in this component can be used by warehouse analysts, both business and technical, for reconciliation and auditing purposes.

The fourth component of the generic model is the source-to-target data warehouse mappings. This component stores the table/column cross-reference mapping and semantic resolution between the source operational systems and the target physical data warehouse model. This is one of the primary purposes of a meta data repository since other environment tools often do not perform this function or do so inadequately. Once again, the detail to which the semantic resolution is documented will vary based on your firm's desires and goals. You may decide to extend the model to take a more structured approach toward business rule identification and use.

The fifth component is the business subject areas. The logical grouping of physical data warehouse tables is stored here. This information provides business end users with a more intuitive navigation method or view of the information stored in the data warehouse. This information can be used by both ETL and front-end reporting tools. Your particular business needs may require a more hierarchical method to store and navigate subject area information, thus requiring revisions to the generic model.

The sixth component of the generic model is ETL statistics. Information on individual extract, transformation and load processes is captured within this component. This information can be used to determine process improvements, database enhancements, fault isolation and other optimization procedures for the warehouse. Use of this particular component will depend greatly on your specific ETL methodology and toolset capabilities. The capture of this type of information in the repository database will require some degree of additional time during ETL processing.

The seventh and last component of the generic meta data repository model is query statistics. Information about every query made against the data warehouse database is stored in this component. This information is used to determine a variety of optimizations that can be performed against the warehouse database. Usage statistics on tables and columns is analyzed to identify dormant data, aggregation and index candidates. Use of this component will be dependent on whether your company has purchased a data monitoring tool for the BI environment since these products typically come with their own database.
Logical View of the Generic Repository Model

The logical modeling view of the generic meta data repository is illustrated in Figure 2. It is important to remember that the generic repository model should only be used as a guide, not as an all-encompassing road map. Individual components of the model may be revised, supplemented and/or removed depending on the business and infrastructure environment needs of your specific enterprise. Methods of versioning and source control for the model components are not addressed in sufficient detail for most projects, but are left to the discretion and design of the repository architect.

Figure 2: Logical View of the Generic Repository Model

The four tables that store both logical and physical information about the actual data warehouse model are the Target Table, Target Column, Target Column Map and Target Domain.

The Target Table is able to store both logical and physical views of tables in the warehouse through use of the Target Table Type column, which distinguishes the two categories. An implied business rule for this table is that only physical table types are actually mapped to the source systems feeding information to the warehouse. The Target Table Name column contains either the logical or the actual physical database name of the warehouse table. The Table Column Map table simply provides a cross-reference of the logical and/or physical tables to their associated columns.

The Target Column table contains both technical and business information about the field in the table (logical or physical since the Target Table Type is inherited onto this table from its parent). The Target Column Name column contains either the logical or the actual physical database name of the warehouse field. The Target Column Business Rules field is used to denote any conventions or practices the column must follow. For example, an amount field must have a corresponding currency value.

The Target Domain table contains definitions of values for columns containing codes (lookup list) only. Each possible domain value or code for a column is stored in this table. For example, the column country code has domain values that include USA for United States and CAN for Canada, etc.

The three tables that store physical information about the operational systems that feed information to the data warehouse are the Source Table, Source Column and Source Domain.

The Source Table contains information about the source system database or extract file used to populate the target tables of the warehouse. The Source ID column is used to uniquely identify a particular system of record (e.g., ERP, Order Management 1, Trouble Ticket, Client Extract 5). The Source Format Type provides a means to identify the category of the source information such as a server/database, directory/file or spreadsheet file. In cases where operational information is being extracted directly from a source database table, the Source DBMS column contains the physical names of these items.

The remaining tables and columns in this meta data repository component, Source Column and Source Domain, follow the same definitions found in the Target component tables, Target Column and Target Domain by just substituting source in place of target.

The Source To Target Column and Domain Map tables provide the necessary design strategy to build extraction, transformation and load (ETL) processes by linking the operational sources to the warehouse tables they feed.

In addition to the actual mapping documented in these cross- reference tables, any additional instructions that need to be described are located in the Mapping Semantic Resolution columns on both tables.

Three tables – ETL Process, ETL Process Source Map and ETL Process Statistics – map warehouse tables and operational sources to the procedures that access them and capture statistics about load processing. Population of these tables will depend on the ETL methodology and products utilized by your organization.

The ETL Process table contains a Process ID column to uniquely identify the ETL procedure in the warehouse environment.

The ETL Process Statistics table contains both the mapping of the process to a warehouse table or tables plus the date/time statistics of the process for a specific batch cycle load. A single ETL process can load one or many target warehouse tables depending on the sources and/or business requirements involved.

The ETL Process Source Map tables provide a cross-reference of ETL processes to operational systems or extract files.

The Subject Area and Subject Area Table Map provide the end user with a logical grouping or business view of the data warehouse tables (e.g., sales, financial, human resources).

The Subject Area table contains a Subject Area ID column to uniquely identify the group of tables within the enterprise from a business perspective. The Subject Area Table Map simply groups the warehouse tables (logical and physical) to a subject area.

Two tables, Query Statistics and Query Table Column Hits, map warehouse tables and columns to the query requests that access them and capture statistics about processing. Population of these tables will depend on the DBMS, reporting tool and/or data-monitoring product being employed.

The Query Statistics table contains various information about query results made against the warehouse. These requests can be from your warehouse front-end reporting tool or from ad hoc requests made directly against the database depending on the informational needs of the DBAs, data acquisition developers, data access developers or the architect. The Query Table Column Hits table provides a cross-reference of database queries to warehouse tables and columns.

The generic repository model will need to be tailored by the repository architect. Revisions, additions or deletions to the generic repository model will depend on your warehouse development methodology and business intelligence products being used. Capture of statistical information, in some cases, may not be possible due use of proprietary standards or lack of integration between developmental components.

For those firms that decide to develop a meta data repository in house, a generic model can help to quickly advance these efforts by providing a starting template. This generic model can also assist those firms evaluating meta data repository products in the marketplace.

SOURCE:http://www.information-management.com/issues/20001001/2744-1.html
newer post

Phamaceutical Sales Data Warehouse Data Model

0 comments
newer post

DATA MODEL FOR AIRLINE RESERVATIONS DATA WAREHOUSE

1 comments
newer post

InsFocus BI - Data Model for Insurance

0 comments
The data model design follows industry-proven dimension-modeling standards, and stores data at the lowest possible level of granularity. Concise policy information is detailed down to the coverage and risk parameter level. Claim information includes individual payments, historical reserve estimates and full tracking of claim status changes. The model is optimized for fast and flexible querying via InsFocus BI’s Query Engine. The insurance data warehouse’s design in not limited as is the case with traditional OLAP data models, and provides unrivaled query and analysis flexibility.
Insurance business domains


InsFocus BI's insurance data model covers all insurance business domains, as is covered in the insurance content page. Those domains include:

    * Marketing
    * Underwriting
    * Claims
    * Reinsurance
    * Actuarial
    * Accounting

Policy / section / claim granularity

Freed from the limits of traditional OLAP solutions, InsFocus BI's insurance data warehouse model details data to the lowest insurance granular level, including:

    * Policy data
    * Policy transactions (changes)
    * Policy sections (coverages)
    * Risk codes (specific values per product)
    * Claims data
    * Claim payments
    * Claim outstanding estimate changes
    * Collection data
    * Client data

Optimized for insurance queries
InsFocus BI's data model was built for insurance from the ground up, and therefore is highly optimized for fast and flexible insurance-specific queries. A lot of work and effort has been put into each calculation method, index, fact table and other data structures, in order to speed up and facilitate typical insurance queries, such as:

    * Earned and unearned calculations
    * Issued, latest, exposed, in-force sums-insured
    * For renewal and lapses
    * Outstanding claim estimates
    * Reinsurance cessions and margins
    * Claim status filters (e.g. run-off)

Follows industry standards
InsFocus BI's data model is a star-schema dimensional model, closely following the Kimball approach for data warehousing. The data model has been developed for years following practices by Mark Kimball, TDWI, Microsoft and others, to achieve a highly efficient yet standards-compliant insurance data warehouse.

SOURCE:http://www.insfocus.com/product/insurance-data-model.html
newer post
newer post older post Home