ETL TOOLS

Tuesday, May 29, 2012

SAP HANA Architecture

In this article we will discuss about the architecture overview of the In-Memory Computing Engine of SAP HANA. The SAP HANA database is developed in C++ and runs on SUSE Linux Enterpise Server. SAP HANA database consists of multiple servers and the most important component is the Index Server. SAP HANA database consists of Index Server, Name Server, Statistics Server, Preprocessor Server and XS Engine.

Index Server contains the actual data and the engines for processing the data. It also coordinates and uses all the other servers.
Name Server holds information about the SAP HANA databse topology. This is used in a distributed system with instances of HANA database on different hosts. The name server knows where the components are running and which data is located on which server.
Statistics Server collects information about Status, Performance and Resource Consumption from all the other server components. From the SAP HANA Studio we can access the Statistics Server to get status of various alert monitors.
Preprocessor Server is used for Analysing Text Data and extracting the information on which the text search capabilities are based .
XS Engine is an optional component. Using XS Engine clients can connect to SAP HANA database to fetch data via HTTP.

Now let us check the architecture components of SAP HANA Index Server.

SAP HANA Index Server Architecture:

Connection and Session Management component is responsible for creating and managing sessions and connections for the database clients. Once a session is established, clients can communicate with the SAP HANA database using SQL statements. For each session a set of parameters are maintained like, auto-commit, current transaction isolation level etc. Users are Authenticated either by the SAP HANA database itself (login with user and password) or authentication can be delegated to an external authentication providers such as an LDAP directory.
The client requests are analyzed and executed by the set of components summarized as Request Processing And Execution Control. The Request Parser analyses the client request and dispatches it to the responsible component. The Execution Layer acts as the controller that invokes the different engines and routes intermediate results to the next execution step. For example, Transaction Control statements are forwarded to the Transaction Manager. Data Definition statements are dispatched to the Metadata Manager and Object invocations are forwarded to Object Store. Data Manipulation statements are forwarded to the Optimizer which creates an Optimized Execution Plan that is subsequently forwarded to the execution layer.
- The SQL Parser checks the syntax and semantics of the client SQL statements and generates the Logical Execution Plan. Standard SQL statements are processed directly by DB engine.
- The SAP HANA database has its own scripting language named SQLScript that is designed to enable optimizations and parallelization. SQLScript is a collection of extensions to SQL. SQLScript is based on side effect free functions that operate on tables using SQL queries for set processing. The motivation for SQLScript is to offload data-intensive application logic into the database.
- Multidimensional Expressions (MDX) is a language for querying and manipulating the multidimensional data stored in OLAP cubes.
- The SAP HANA database also contains a component called the Planning Engine that allows financial planning applications to execute basic planning operations in the database layer. One such basic operation is to create a new version of a dataset as a copy of an existing one while applying filters and transformations. For example: Planning data for a new year is created as a copy of the data from the previous year. This requires filtering by year and updating the time dimension. Another example for a planning operation is the disaggregation operation that distributes target values from higher to lower aggregation levels based on a distribution function.
- The SAP HANA database also has built-in support for domain-specific models (such as for financial planning) and it offers scripting capabilities that allow application-specific calculations to run inside the database.
The SAP HANA database features such as SQLScript and Planning operations are implemented using a common infrastructure called the Calc engine. The SQLScript, MDX, Planning Model and Domain-Specific models are converted into Calculation Models. The Calc Engine creates Logical Execution Plan for Calculation Models. The Calculation Engine will break up a model, for example some SQL Script, into operations that can be processed in parallel. The engine also executes the user defined functions.
In HANA database, each SQL statement is processed in the context of a transaction. New sessions are implicitly assigned to a new transaction. The Transaction Manager coordinates database transactions, controls transactional isolation and keeps track of running and closed transactions. When a transaction is committed or rolled back, the transaction manager informs the involved engines about this event so they can execute necessary actions. The transaction manager also cooperates with the persistence layer to achieve atomic and durable transactions.
Metadata can be accessed via the Metadata Manager. The SAP HANA database metadata comprises of a variety of objects, such as definitions of relational tables, columns, views, and indexes, definitions of SQLScript functions and object store metadata. Metadata of all these types is stored in one common catalog for all SAP HANA database stores (in-memory row store, in-memory column store, object store, disk-based). Metadata is stored in tables in row store. The SAP HANA database features such as transaction support, multi-version concurrency control, are also used for metadata management. In distributed database systems central metadata is shared across servers. How metadata is actually stored and shared is hidden from the components that use the metadata manager.
The Authorization Manager is invoked by other SAP HANA database components to check whether the user has the required privileges to execute the requested operations. SAP HANA allows granting of privileges to users or roles. A privilege grants the right to perform a specified operation (such as create, update, select, execute, and so on) on a specified object (for example a table, view, SQLScript function, and so on). The SAP HANA database supports Analytic Privileges that represent filters or hierarchy drilldown limitations for analytic queries. Analytic privileges grant access to values with a certain combination of dimension attributes. This is used to restrict access to a cube with some values of the dimensional attributes.
Database Optimizer gets the Logical Execution Plan from the SQL Parser or the Calc Engine as input and generates the optimised Physical Execution Plan based on the database Statistics. The database optimizer which will determine the best plan for accessing row or column stores.
Database Executor basically executes the Physical Execution Plan to access the row and column stores and also process all the intermediate results.
The Row Store is the SAP HANA database row-based in-memory relational data engine. Optimized for high performance of write operation, Interfaced from calculation / execution layer. Optimised Write and Read operation is possible due to Storage separation i.e. Transactional Version Memory & Persisted Segment.
- Transactional Version Memory contains temporary versions i.e. Recent versions of changed records. This is required for Multi-Version Concurrency Control (MVCC). Write Operations mainly go into Transactional Version Memory. INSERT statement also writes to the Persisted Segment.
- Persisted Segment contains data that may be seen by any ongoing active transactions. Data that has been committed before any active transaction was started.
- Version Memory Consoliation moves the recent version of changed records from Transaction Version Memory to Persisted Segment based on Commit ID. It also clears outdated record versions from Transactional Version Memory. It can be considered as garbage collector for MVCC.
- Segments contain the actual data (content of row-store tables) in pages. Row store tables are linked list of memory pages. Pages are grouped in segments. Typical Page size is 16 KB.
- Page Manager is responsible for Memory allocation. It also keeps track of free/used pages.
The Column Store is the SAP HANA database column-based in-memory relational data engine. Parts of it originate from TREX (Text Retrieval and Extraction) i.e SAP NetWeaver Search and Classification. For the SAP HANA database this proven technology was further developed into a full relational column-based data store. Efficient data compression and optimized for high performance of read operation, Interfaced from calculation / execution layer. Optimised Read and Write operation is possible due to Storage separation i.e. Main & Delta.
- Main Storage contains the compressed data in memory for fast read.
- Delta Storage is meant for fast write operation. The update is performed by inserting a new entry into the delta storage.
- Delta Merge is an asynchronous process to move changes in delta storage into the compressed and read optimized main storage. Even during the merge operation the columnar table will be still available for read and write operations. To fulfil this requirement, a second delta and main storage are used internally.
- During Read Operation data is always read from both main & delta storages and result set is merged. Engine uses multi version concurrency control (MVCC) to ensure consistent read operations.
- As row tables and columnar tables can be combined in one SQL statement, the corresponding engines must be able to consume intermediate results created by each other. A main difference between the two engines is the way they process data: Row store operators process data in a row-at-a-time fashion using iterators. Column store operations require that the entire column is available in contiguous memory locations. To exchange intermediate results, row store can provide results to column store materialized as complete rows in memory while column store can expose results using the iterator interface needed by row store.
The Persistence Layer is responsible for durability and atomicity of transactions. It ensures that the database is restored to the most recent committed state after a restart and that transactions are either completely executed or completely undone. To achieve this goal in an efficient way the per-sistence layer uses a combination of write-ahead logs, shadow paging and savepoints. The persistence layer offers interfaces for writing and reading data. It also contains SAP HANA 's logger that manages the transaction log. Log entries can be written implicitly by the persistence layer when data is written via the persistence interface or explicitly by using a log interface.

Distributed System and High Availability

The SAP HANA Appliance software supports High Availability. SAP HANA scales systems beyond one server and can remove the possibility of single point of failure. So a typical Distributed Scale out Cluster Landscape will have many server instances in a cluster. Therefore Large tables can also be distributed across multiple servers. Again Queries can also be executed across servers. SAP HANA Distributed System also ensures transaction safety. Features

N Active Servers or Worker hosts in the cluster.
M Standby Server(s) in the cluster.
Shared file system for all Servers. Serveral instances of SAP HANA share the same metadata.
Each Server hosts an Index Server & Name Server.
Only one Active Server hosts the Statistics Server.
During startup one server gets elected as Active Master.
The Active Master assigns a volume to each starting Index Server or no volume in case of cold Standby Servers.
Upto 3 Master Name Servers can be defined or configured.
Maximum of 16 nodes is supported in High Availability configurations.

Name Server Configured Role	Name Server Actual Role	Index Server Configured Role	Index Server Actual Role
Master 1	Master	Worker	Master
Master 2	Slave	Worker	Slave
Master 3	Slave	Worker	Slave
Slave	Slave	Standby	Standby

Failover

High Availability enables the failover of a node within one distributed SAP HANA appliance. Failover uses a cold Standby node and gets triggered automatically. So when a Active Server X fails, Standby Server N+1 reads indexes from the shared storage and connects to logical connection of failed server X.
If the SAP HANA system detects a failover situation, the work of the services on the failed server is reassigned to the services running on the standby host. The failed volume and all the included tables are reassigned and loaded into memory in accordance with the failover strategy defined for the system. This reassignment can be performed without moving any data, because all the persistency of the servers is stored on a shared disk. Data and logs are stored on shared storage, where every server has access to the same disks.
The Master Name Server detects an Index Server failure and executes the failover. During the failover the Master Name Server assigns the volume of the failed Index Server to the cold Standby Server. In case of a Master Name Server failure, another of the remaining Name Servers will become Active Master.
Before a failover is performed, the system waits for a few seconds to determine whether the service can be restarted. Standby node can take over the role of a failing master or failing slave node

SAP HANA - An Introduction for the beginners

Posted by DWH at 9:55 PM 2 comments

SAP HANA: High-Performance Analytic Appliance (HANA)is an In-Memory Database from SAP to store data and analyze large volumes of non aggregated transactional data in Real-time with unprecedented performance ideal for decision support & predictive analysis.

The In-Memory Computing Engine is a next generation innovation that uses cache-conscious data-structures and algorithms leveraging hardware innovation as well as SAP software technology innovations. It is ideal for Real-time OLTP and OLAP in one appliance i.e. E-2-E solution from Transactional to high performance Analytics. SAP HANA can also be used as a secondary database to accelerate analytics on existing applications.

Hardware Innovations - Leading to HANA

In real world we have so many variety of data sources, e.g. Unstructured Data, Operational Data Stores, Data Marts, Data Warehouses, Online Analytical Stores, etc. To do analytics or information mining from this Big Data at real time we come across the hurdles like Latency, High Cost and Complexity.
Disk I/O was the Performance bottleneck in the past, whereas in memory computing was always much faster than that. Earlier, however, the cost of in-memory computing was prohibitive for any large scale implementation. Now with Multi-Core CPU and high capacity of RAM, we can host the entire database in memory. So now CPU is waiting for data to be loaded from main memory into CPU cache - and that's what is the Performance bottleneck today.
This is a total paradigm shift; Tape is Dead, Disk is Tape, Main Memory is Disk & CPU Cache is Main Memory. HANA is optimized to exploit the parallel processing capabilities of modern multi-core/CPU architectures. With this architecture, SAP applications can benefit from current hardware technologies.

Memory Overview - Where we stand

Let us have a quick look on Multi-Core CPU Caches, Main Memory i.e. RAM & traditional Hard Disk with respect to response time.

L1 cache - Primary & within core. SRAM - Fastest. L1 cache | ~ 1ns | 64k
L2 cache – Intermediate & within core. DRAM - Slower. L2 cache | ~ 5ns | 256k
L3 Cache – Shared across all cores. DRAM - Slowest. L3 cache | ~ 20ns | 8M
Main Memory | ~ 100ns | TBs
Hard Disk | > 1.000.000ns | TBs

HANA Hardware Requirement

HANA can be installed on many certified SAP hardware partners: Hewlett Packard, IBM, Fujitsu Computers, CISCO systems, DELL.

Currently SUSE Linux Enterprise Server x86-64 (SLES) 11 SP1 is the Operating System supported by SAP HANA.

A typical example of CPU and RAM can be 4 Intel E7-4870 / 40 cores and 512 GB RAM. SAP recommends a dedicated server network communication of 10 GBit/s between the SAP HANA landscape and the source system for efficient data replication.

HANA Database Features

Important database features of HANA include OLTP & OLAP capabilities, Extreme Performance, In-Memory , Massively Parallel Processing, Hybrid Database, Column Store, Row Store, Complex Event Processing, Calculation Engine, Compression, Virtual Views, Partitioning and No aggregates. HANA In-Memory Architecture includes the In-Memory Computing Engine and In-Memory Computing Studio for modeling and administration. All the properties need a detailed explanation followed by the SAP HANA Architecture.

Basic Concepts behind HANA Database

Extreme Hardware Innovations:

Main memory is no-longer a limited resource, modern servers can have 2TB of system memory and this allows complete databases to be held in RAM. Currently processors have up to 64 cores, and 128 cores will soon be available. With the increasing number of cores, CPUs are able to process increased data per time interval. This shifts the performance bottleneck from disk I/O to the data transfer between main memory and CPU cache.

In-Memory Database:

HANA fully leverages the hardware innovations like Multi-Core CPU, High capacity RAM availability. The basic concept is to cache the entire database into fast accessible Main Memory close to CPU for faster execution and to avoid disk I/O. Disk storage is still required for permanent persistency since Main Memory is volatile. SAP HANA, holds the bulk of its data in memory for maximum performance, but still uses persistent storage to provide a fallback in case of failure. Data and log are automatically saved to disk at regular save points, the log is also saved to disk after each COMMIT of a database transaction. Disk write operations happens asynchronously and as a background task. Generally on system start-up HANA loads the tables into memory.

Massively Parallel Processing:

With availability of Multi-Core CPUs, higher CPU execution speeds can be achieved. Multiple CPUs call for new parallel algorithms to be used in databases in order to fully utilize the computing resources available. HANA Column-based storage makes it easy to execute operations in parallel using multiple processor cores. In a column store data is already vertically partitioned. This means that operations on different columns can easily be processed in parallel. If multiple columns need to be searched or aggregated, each of these operations can be assigned to a different processor core. In addition operations on one column can be parallelized by partitioning the column into multiple sections that can be processed by different processor cores. With the SAP HANA database, queries can be executed rapidly and in parallel.

Hybrid Data Store:

Common databases store tabular data row-wise, i.e. all data for a record are stored adjacent to each other in memory. Row store tables are linked list of memory pages. Conceptually, a database table is a two-dimensional data structure with cells organized in rows and columns. Computer memory however is organized as a linear structure. To store a table in linear memory, two options exist:

A row-oriented storage stores a table as a sequence of records, each of which contain the fields of one row.
A column-oriented storage stores all the values of a column in contiguous memory locations.

Use of column store will help to prevent table scan of unnecessary columns while performing searching and aggregation operations on single column values stored in contiguous memory locations. Such an oper-ation has high spatial locality and can efficiently be executed in the CPU cache. With row-oriented storage, the same operation would be much slower because data of the same column is distributed across memory and the CPU is slowed down by cache misses. Column store is optimized for high performance of read operation and efficient data compression. This combination of both classical and innovative technologies of data storage and access allows the developer to choose the best technology for their application and, where necessary, use both in parallel.

OLTP and OLAP Database:

HANA is a hybrid database, having both read optimised column store ideally suited for OLAP and write optimised row store best for OLTP systems relational engines. Both the stores are In-Memory. Using column stores in OLTP applications requires a balanced approach to insertion and indexing of column data to minimize cache misses. The SAP HANA database allows the developer to specify whether a table is to be stored column-wise or row-wise. It is also possible to alter an existing table from columnar to row-based and vice versa.

Higher Data Compression:

The goal of keeping all relevant data in main memory can be achieved with less cost if data compression is used. Columnar data storage allows highly efficient compression. If a column is sorted, there will normally be several contiguous values placed adjacent to each other in memory. In this case compression methods, such as run-length encoding, cluster coding or dictionary coding can be used. In column stores a compression factor of 10 can typically be achieved compared to traditional row-oriented storage systems.

Building the Next Generation ETL data loading Framework

Posted by DWH at 9:54 PM 0 comments

Do you wish for an ETL framework that is highly customizable, light-weight and suits perfectly with all of your data loading needs? We too! Let's build one together...

What is an ETL framework?

ETL Or "Extraction, Transformation and Loading" is the prevalent technological paradigm for data integration. While ETL in itself is not a tool, ETL processes can be implemented though varied tools and programming methods. This includes, but not limited to, tools like Informatica PowerCentre, DataStage, BusinessObjects Data Services (BODS), SQL Server Integration Services (SSIS), AbInitio etc. and programming methods like PL/SQL (Oracle), T-SQL (Microsoft), UNIX shell scripting etc. Most of these tools and programming methodologies use a generic setup that controls, monitors, executes and Logs the data flow through out the ETL process. This generic 'setup' is often referred as 'ETL framework'
As an example of an ETL framework, let's consider this. "Harry" needs to load 2 tables everyday from one source system to some other target system. For this purpose, Harry has created 2 SQL jobs, each of which reads data from source through some "SELECT" statements and write the data in the target database using some "INSERT" statements. But in order to run these jobs, Harry needs couple of more information - e.g.

when is a good time to execute these jobs? Can he schedule these jobs to run automatically everyday?
Where is the source system located? (Connection information)
What will happen if one of the jobs fail while loading the data? Will Harry get an alert message? Can he simply rerun the jobs after fixing the issue of the failure?
How will Harry know if at all any data is retrieved or loaded to the target?

Turns out that, Harry needs something more. He needs some kind of setup that will govern the job execution regularly. This includes - scheduling the jobs, executing the jobs, logging any failure/error information (and also alerting Harry about such failures), maintaining the connection information and even ensuring that Harry does not end up loading the duplicate data.
Such a setup is called "ETL Framework". And we are trying to build the perfect one here.

Critical Features of an ETL framework

In a very broad sense, here are a few of the features that we feel critical in any ETL framework

Support for Change Data Capture Or Delta Loading Or Incremental Loading
Metadata logging
Handling of multiple source formats
Restartability support
Notification support
Highly configurable / customizable

Good-to-have features of ETL Framework

These are some good-to-have features for the framework

Inbuilt data reconciliation
Customizable log format
Dry-load enabling
Multiple notification formats

Request for Proposal for the next-gen ETL framework

Based on the feature sets above, we are trying to build a generic framework that we would make available here for free for everyone's use.

However the list of features above are not complete. We are requesting our readership to send us RFP for the proposed ETL framework that would resolve the incapability / issues in their existing frameworks.

Incremental Loading for Dimension Table

Posted by DWH at 9:53 PM 0 comments

In our previous article we have discussed the concept of incremental loading in general. In this article we will see how to perform incremental loading for dimension tables.

Should we do incremental loading for dimensions?

In a dimensional model, we may perform incremental loading for dimension tables also. One may argue that this wont be necessary as data volume in dimension tables are not as high as the data volumes in the fact tables, hence we can simply do a full load every time.
I personally do not agree to this argument. This is because during the last few years I have seen tremendous growth in the data in dimension tables and things can get quite heavy especially if we are trying to load SCD type 2 dimensions. Anyway, without much ado, let's delve deep.

Standard Method of Loading

Like before, for our purpose we will assume we have the below customer table in our source system from where we need to perform the data loading

CustomerID  CustomerName  Type         LastUpdatedDate
1           John          Individual   22-Mar-2012
2           Ryan          Individual   22-Mar-2012
3           Bakers'       Corporate    23-Mar-2012

As discussed in the previous article, a typical SQL query to extract data incrementally from this source system will be like this:

SELECT t.* 
FROM Customer t
WHERE t.lastUpdatedDate > (select nvl(
                                 max(b.loaded_until), 
                                 to_date('01-01-1900', 'MM-DD-YYYY')
                                )
                      from batch b
                      where b.status = 'Success');

Here "batch" is a separate table which stores the date until which we have successfully extracted the data.

Batch_ID  Loaded_Until  Status
1         22-Mar-2012   Success
2         23-Mar-2012   Success

Which one to use: "Entry Date" / "Load Date" or "Last Update Date"?

In an incremental load methodology, we should extract the record when it is first created and after that whenever the record is updated. Therefore, we should always look for "last update date" column for extracting records. This is because, "entry date" or "load date" columns in the source systems are not enough to determine if the record is updated in the later point in time.

Often source systems maintain 2 different columns as load_date and last_update_date. When extracting data based on "last update date", ensure that source systems always populate "last updated date" field with "load date" when the record is first created.

What are the benefits of incremental loading of dimension tables?

Once we extract records incrementally based on their last update date, we can compare each record with the target based on their natural keys and determine if the record is a new record or updated record.
However, if we do not extract incrementally (and every time extract all the records from source), then the number of records to compare against target will be much higher resulting into performance degradation. If we are doing incremental loading, records that do not have any change will not come - only new or updatable records will come. But if we are doing full load, everything will come irrespective of any change.

Using Informatica Normalizer Transformation

Posted by DWH at 9:52 PM 0 comments

Normalizer transformation is a native transformation in Informatica that can ease many complex data transformation requirements. Learn how to effectively use normalizer in this tutorial.

What is a Noramalizer Transformation?

In a snapshot, here is what a Normalizer is or does:

Active Transformation
Can output multiple rows for each input row
Can transpose the data (transposing columns to rows

A Normalizer is an Active transformation that returns multiple rows from a source row, it returns duplicate data for single-occurring source columns. The Normalizer transformation parses multiple-occurring columns from COBOL sources, relational tables, or other sources. Normalizer can be used to transpose the data in columns to rows.

Normalizer effectively does the opposite of what Aggregator does!

Transposing data using Normalizer

Let's imagine we have a table like below that stores the sales figure for 4 quarters of a year in 4 different columns. As you can see each row represent one shop and the columns represent the corresponding sales. Next, imagine - our task is to generate a result-set where we will have separate rows for every quarter. We can configure a Normalizer transformation to return a separate row for each quarter like below..
The following source rows contain four quarters of sales by store:
Source Table

Store	Quarter1	Quarter2	Quarter3	Quarter4
Shop 1	100	300	500	700
Shop 2	250	450	650	850

The Normalizer returns a row for each shop and sales combination. It also returns an index - called GCID (we will know later in detail) - that identifies the quarter number:
Target Table

Shop	Sales	Quarter
Shop 1	100	1
Shop 1	300	2
Shop 1	500	3
Shop 1	700	4
Shop 2	250	1
Shop 2	450	2
Shop 2	650	3
Shop 2	850	4

How to use Normalizer transformation inside Informatica Mapping

Now that you know the concept of a normalizer, let's see how we can implement this concept using Normalizer transformation. We will take a different data set for our example this time. Suppose we have the following data in source:

Name	Month	Transportation	House Rent	Food
Sam	Jan	200	1500	500
John	Jan	300	1200	300
Tom	Jan	300	1350	350
Sam	Feb	300	1550	450
John	Feb	350	1200	290
Tom	Feb	350	1400	350

and we need to transform the source data and populate this as below in the target table:

Name	Month	Expense Type	Expense
Sam	Jan	Transport	200
Sam	Jan	House rent	1500
Sam	Jan	Food	500
John	Jan	Transport	300
John	Jan	House rent	1200
John	Jan	Food	300
Tom	Jan	Transport	300
Tom	Jan	House rent	1350
Tom	Jan	Food	350

Now below is the screen-shot of a complete mapping which shows how to achieve this result using Informatica PowerCenter Designer.

Please click on the above image to enlarge it. You can see after the Source Qualifier, we have placed the Normalizer transformation. In the next section, I will explain how to set up the properties of the normalizer.

Setting Up Normalizer Transformation Property

First we need to set the number of occurrences property of the Expense head as 3 in the Normalizer tab of the Normalizer transformation. This is because we have 3 different types of expenses in the given data - Food, Houserent and Transportation.
As soon as we set the occurrences to 3, Normalizer will in turn automatically create 3 corresponding input ports in the ports tab along with the other fields (e.g. "Individual" and "Month" fields). These 3 input ports, as you can see in the above image, are EXPENSEHEAD_in1, EXPENSEHEAD_in2, EXPENSEHEAD_in3. We have connected these input ports with food, house rent and transportation from the source qualifier. Below image shows the setting up of number of occurrences property.

Next, In the Ports tab of the Normalizer the ports will be created automatically as configured in the Normalizer tab.
But, Interestingly we will observe two new columns here. They are ,

GK_EXPENSEHEAD
GCID_EXPENSEHEAD

See these ports in the below screen shot. Again, if you need - please click on the image to enlarge it

GK field generates sequence number starting from the value as defined in Sequence field while GCID holds the value of the occurrence field i.e. the column no of the input Expense head.
In our case, 1 is for FOOD, 2 is for HOUSERENT and 3 is for TRANSPORTATION. Now the GCID will give which expense corresponds to which field while converting columns to rows.
Below is the screen-shot of the expression to handle this GCID efficiently:

As you can see above, the DECODE statement is used to assign proper level to the output expense head field

Incremental Loading for Fact Tables

Posted by DWH at 9:52 PM 0 comments

In the previous articles, we have discussed the general concepts of incremental data loading as well as how to perform incremental data loading for dimension tables. In this article we will discuss the methods and issues of loading data incrementally in Fact tables of a data warehouse.

METHOD OF LOADING

Generally speaking, incremental loading for Fact tables is relatively easier as, unlike dimension tables, here you do not need to perform any look-up on your target table to find out if the source record already exists in the target or not. All you need to do is to select incremental records from source (as shown below for the case of "sales" table) and load them as it is to target (you may need to perform lookup to dimensional tables to assign respective surrogate keys - but that's a different story).
Like before we will assume we have a "sales" table in the source
Sales Table

ID         CustomerID    ProductDescription   Qty   Revenue  Sales Date
1          1             White sheet (A4)     100    4.00    22-Mar-2012
2          1             James Clip (Box)     1      2.50    22-Mar-2012
3          2             Whiteboard Marker    1      2.00    22-Mar-2012
4          3             Letter Envelop       200   75.00    23-Mar-2012
5          1             Paper Clip           12     4.00    23-Mar-2012

Given this table, a typical extraction query will be like this:

SELECT t.* 
FROM Sales t
WHERE t.sales_date > (select nvl(
                                 max(b.loaded_until), 
                                 to_date('01-01-1900', 'MM-DD-YYYY')
                                )
                      from batch b
                      where b.status = 'Success');

where "batch" is a separate table maintained at target system having minimal structure and data like below

Batch_ID  Loaded_Until  Status
1         22-Mar-2012   Success
2         23-Mar-2012   Success

However, things may get pretty complicated if your fact is a special type of fact called "snapshot fact". Let's understand them below.

Loading Incident Fact

Incident fact is the normal fact that we encounter mostly (and that we have seen above in our sales table example). Records in these types of facts are only loaded if there are transactions coming from the source. For example, if at all there is one sale that happens in the source system, then only a new sales record will come. They are dependent on some real "incident" to happen in the source hence the name incident fact.

Loading Snapshot Fact

As opposed to incident fact, snapshot facts are loaded even if there is no real business incident in the source. Let me show you what I mean by using the above example of customer and sales tables in OLTP. Let's say I want to build a fact that would show me total revenue of sales from each customer for each day. In effect, I want to see the below data in my fact table.
Sales fact table (This is what I want to see in my target fact table)

Date          Customer    Revenue
22-Mar-2012   John        6.50
22-Mar-2012   Ryan        2.00
23-Mar-2012   John       10.50
23-Mar-2012   Ryan        2.00
23-Mar-2012   Bakers'    75.00

As you see, even if no sales was made to Ryan on 23-Mar, we still show him here with the old data. Similarly for John, even if goods totaling to $4.00 was sold to him on 23-Mar, his record shows the cumulative total of $10.50.
Now obviously the next logical question is how to load this fact using incremental loading? Because incremental loading only brings in incremental data - that is on 23rd March, we will only have Bakers' and John's records and that too with that day's sales figures. We won't have Ryan record in the incremental set.

Why not a full load

You can obviously opt-in for full load mechanism as that would solve this problem but that would take the toll on your loading performance.

Then what's the solution?

One way to resolve this issue is: creating 2 incremental channels of loading for the fact. 1 channel will bring in incremental data from source and the other channel will bring in incremental data from the target fact itself. Let's see how does it happen below. We will take the example for loading 23-Mar data.

Channel 1: Incremental data from source

Customer    Revenue 
John        4.00
Bakers'    75.00

Channel 2: Incremental data from target fact table (last day's record)

Customer    Revenue
John        6.50
Ryan        2.00

Next we can perform a FULL OUTER JOIN between the above two sets to come to below result

John       10.50
Ryan        2.00
Bakers'    75.00

Hope this tutorial was helpful. If you have any doubt or question on above, please Ask your question here. We will surely help you out!

If you want to learn more, visit our ETL Basic Concepts and Tutorial page.

Monday, May 28, 2012

SAP BODS Transforms

Posted by DWH at 10:27 PM 0 comments

This article deals with the various types of transformations available in SAP BODS. Transformations are in-built, optional objects used in dataflow to transform source data to desired output dataset objects available in Local Object Library under the Transforms tab. There are four categories of Transform available in SAP BODS namely Data Integrator, Data Quality, Platform and Text Data Processing.

List of available transforms

Data Integrator- Data_Transfer, Date_Generation, Effective_Date, Hierarchy_Flattening, History_Preserving, Key_Generation, Map_CDC_Operation, Pivot (Columns to Rows), Reverse Pivot (Rows to Columns), Table_Comparison, XML_Pipeline
Data Quality- Associate, Country ID, Data Cleanse, DSF2 Walk Sequencer, Geocoder, Global Address Cleanse, Global Suggestion Lists, Match, USA Regulatory Address Cleanse, User-Defined
Platform- Case, Map_Operation, Merge, Query, Row_Generation, SQL, Validation
Text Data Processing- Entity_Extraction

Our approach is to get a detailed knowledge on all the above transforms starting with the mostly commonly used ones. So we will start with the Query transform.

QUERY Transform

Query transform is used to retrieve a data set based on the input schema that satisfies conditions that we specify. A query transform is similar to a SQL SELECT statement. The Query transform is used to perform the following operations: -

Maps column from input Schema to output Schema.
Perform transformations and functions on the source data.
Assign Primary Keys to output Schema columns.
Add New Output columns, Nested Schemas, and Function Calls to the output Schema.
Perform Data Nesting and Unnesting with Sub Schemas of the Output Schema. Also assign Make Current Schema.
Generate Distinct result set output for the input Schema.
Join data from Multiple Input Source Schemas. Equi Join as well as Outer Join is supported.
Filter input Source Data.
Performs Aggregation based on input column groups.
Generate sorted dataset based on source input column order.
Also we can generate DTD, XML Schema or File Format based on the Input or Output Schema.

CASE Transform

Case transform is used to divide or route the input data set into multiple output data sets based on the defined logical expression. It is used to implement IF-THEN-ELSE logic at dataflow level. This transform accepts only one source input. We can define multiple labels and their corresponding CASE expression. For input rows that do not satisfy any of the CASE conditions, we may select to output those records using the DEFAULT case. For that we need to select the check box Produce default output when all expressions are false.
Two other featured properties of this transform are Row can be TRUE for one case only and Preserve expression order. If we select the option Row can be TRUE for one case only, then a row is passed to the first case whose expression returns TRUE. Otherwise, the row is passed to all the cases whose expression returns TRUE. Preserve expression order option is available only when the Row can be TRUE for one case only option is checked. We can select this option if expression order is important to us because there is no way to guarantee which expression will evaluate to TRUE first.

MERGE Transform

Merge transform is used to combine multiple input dataset with the same schemas into a single output dataset of the same schema. It is equivalent to SQL UNION ALL statement. In order to eliminate duplicate records from output dataset basically to attain UNION operation, add a Query transform with DISTINCT option enabled after the Merge transform.

VALIDATION Transform

Validation transform is used to filter or replace the source dataset based on criteria or validation rules to produce desired output dataset. It enables to create validation rules on the input dataset, and generate the output based on whether they have passed or failed the validation condition. This transform is typically used for NULL ckecking for mandatory fields, Pattern matching, existence of value in reference table, validate datatype, etc.
The Validation transform can generate three output dataset Pass, Fail, and RuleViolation. The Pass Output schema is identical with the Input schema. The Fail Output schema has two more columns, DI_ERRORACTION and DI_ERRORCOLUMNS. The RuleViolation has three columns DI_ROWID, DI_RULENAME and DI_COLUMNNAME.

MAP_OPERATION Transform

Map_Operation transform allows conversions between data manipulation operations like INSERT, UPDATE, DELETE & REJECT. It enables to change the operation codes of input data sets to produce the desired output row type. There are 4 operation codes for any input row type – Normal, Update, Insert and Delete. In addition, the DISCARD option can be assigned to the output row type. Discarded rows are not passed through to the output of the transform.
If the output record is flagged NORMAL or INSERT, then it inserts a new record in the target table. If it is marked as UPDATE it basically overwrites an existing row in the target table. If the input record is flagged as Delete it does not load the records in the target table. But if the output row type is set to DELETE then it deletes the corresponding records present in target. If the row is marked as DISCARD then no records are passed to the output of the transform.

ROW_GENERATION Transform

Row_Generation transform produces a dataset with a single column. The column values start with the number that we specify in the Row number starts at option. The value then increments by one to specified number of rows as set in the Row count option. This transform does not allow any input data set.

SQL Transform

SQL transform is used to submit or perform standard SQL operations on database server. The SQL transform supports a single SELECT statement only. This transform does not allow any input data set. Use this transform when other built-transforms cannot perform the required SQL operation. Try to use this transform as your last option as it not optimised for performance and also reduces readability.

KEY_GENERATION Transform

Key_Generation transform helps to generate artificial keys for new rows in a table. The transform looks up the maximum existing key value of the surrogate key column from the table; And uses it as the starting value to generate new keys for new rows in the input dataset. The transform expects a column with the same name as the Generated key column of the source table to be a part of the input schema.
The source table must be imported into the DS repository before defining the source table for this transform. The fully qualified Table name e.g. DATASTORE.OWNER.TABLE should be specified. Also we can set the Increment value i.e. the interval between the generated key values. By default it is 1. We can also use a variable placeholder for this option. We will be using this tranform frequently while populating surrogate key values of slowly changing dimension tables.

TABLE_COMPARISON Transform

Table_Comparison transform helps to compare two data sets and generates the difference between them as a resultant data set with rows flagged as INSERT, UPDATE, or DELETE. This transform can be used to ensure rows are not duplicated in a target table, or to compare the changed records of a data warehouse dimension table. It helps to detect and forward all changes or the latest ones that have occurred since the last time the comparison table was updated. We will be using this transform frequently while implementing slowing changing dimensions and while designing dataflows for recovery.
The source table must be already imported into the DS repository. The fully qualified Table name e.g. DATASTORE.OWNER.TABLE should be specified. Also set the input dataset columns that uniquely identify each row as Input primary key columns. These columns must be present in the comparison table with the same column names and datatypes. If the primary key value from the input data set does not match a value in the comparison table, DS generates an INSERT statement. Else it generates an UPDATE row with the values from the input dataset row after comparing all the columns in the input data set that are also present in the comparison table apart from the primary key columns. As per your requirement select only the required subset of non-key Compare columns which will give performance improvement.
If the Input primary key columns have duplicate keys, the transform arbitrarily chooses any of the rows to compare during dataflow processing i.e. order of the input rows are ignored. Selecting the Input contains duplicate keys check box provides a method of handling duplicate keys in the input data set.
If the comparison table contains rows with the same primary keys, the transform arbitrarily chooses any of the rows to compare. Specify the column of the comparison table with unique keys i.e. by design contains no duplicate keys as the Generated key column. A generated key column indicates which row of a set containing identical primary keys is to be used in the comparison. This provides a method of handling duplicate keys in the comparison table.
For an UPDATE, the output data set will contain the largest key value found for the given primary key. And for a DELETE, the output data set can include all duplicate key rows or just the row with the largest key value.
When we select the check box Detect deleted row(s) from comparison table the transform flags rows of the comparison table with the same key value as DELETE. When we select the options of the transforms - Generated key column, Detect deleted row(s) from comparison table and Row-by-row select or the Sorted input comparison method; Additional section appears to specify how to handle DELETE rows with duplicate keys. i.e. Detect all rows or Detect row with largest generated key value
Apart from all these properties there are three methods for accessing the comparison table namely Row-by-row select, Cached comparison table and Sorted input. Below is the brief on when to select which option.

Row-by-row select option is best if the target table is large compared to the number of rows the transform will receive as input. In this case for every input row the transform fires a SQL to lookup the target table.
Cached comparison table option is best when we are comparing the entire target table. DS uses pageable cache as the default. If the table fits in the available memory, we can change the Cache type property of the dataflow to In-Memory.
Sorted input option is best when the input data is pre sorted based on the primary key columns. DS reads the comparison table in the order of the primary key columns using sequential read only once. NOTE: The order of the input data set must exactly match the order of all primary key columns in the Table_Comparison transform.

NOTE:

The transform only considers rows flagged as NORMAL as Input dataset.
Cautious when using real datatype columns in this transform as comparison results are unpredictable for this datatype.

What is Active Lookup Transformation

Posted by DWH at 10:26 PM 0 comments

Informatica 9x allows us to configure Lookup transformation to return multiple rows. So now we can retrieve multiple rows from a lookup table thus making Lookup transformation an Active transformation type.

How to configure a Lookup as Active?

To use this option, while creating the transformation, we must configure the Lookup transformation property "Lookup Policy on Multiple Match" to Use All Values. Once created we cannot change the mode between passive and active. When ever the Lookup policy on multiple match attribute is set to Use All Values. The property becomes read-only.

Implementing a Lookup As Active

Scenario: Suppose we have customer order data in a relational table. Each customer has multiple orders in the table. We can configure the Lookup transformation to return all the orders placed by a customer.
Now check the below simple mapping where we want to return all employees in the departments.
Go to Transformation and click Create. Select Transformation Type as Lookup and enter a name for the transformation.

Next check the option Return All Values on Multiple Match.

Here our source is the DEPT table and the EMP table is used a lookup. The lookup condition is based on the department number.

Properties Tab of Active Lookup Transformation

Basically we try to achive the result as the below sql select:-

SELECT DEPT.DEPTNO, DEPT.DNAME, DEPT.LOC, EMP.ENAME, EMP.SAL 
FROM DEPT LEFT OUTER JOIN EMP 
ON DEPT.DEPTNO = EMP.DEPTNO

Active Lookup Transformation Restrictions:

We cannot return multiple rows from an unconnected Lookup transformation
We cannot enable dynamic cache for a Active Lookup transformation.
Active Lookup Transformation that returns multiple rows cannot share a cache with a similar Passive Lookup Transformation that returns one matching row for each input row.

CDC Implementation using Flatfile

Posted by DWH at 10:24 PM 0 comments

This article shows how to use a flatfile to implement Change data Capture. Suppose we want to maintain the last extraction date in a flatfile, based on that value we want to capture the changed data of our business table.

First we will discuss what we are going to do followed by how we are going to do.

Suppose we have an employee table in the source system. We want to load the Delta or Changed employee data to our target data warehouse.
We want to maintain the last extraction date in a flatfile instead of maintaining in a DB table.
So the Business ETL session will extract the changed data based on the last extraction date as defined in the flatfile.

How we are going to do

We will create three sessions to accomplish our task.
First we will create a mapping to read the flatfile maintaining the last extraction date and in turn generates a parameter file.
Next is the Business session mapping which will extract the delta data based on the $$start_date and $$end_date parameter values as defined in the parameter file generated by the previous session.
Last we will create a mapping to update the flatfile maintaing the last extraction date which runs only when the Business session completes successfully.
We keep all the three sessions under one workflow.
We have one workflow level global parameter file having the connection details.
We have another runtime parameter file for the business session containg the extraction parameter values only.

The flatfile which maintains the extraction date is a comma separated file having three fields: START_DATE, END_DATE, RUN_DATE.

Content of the CDC FlatFile:-
C:\Informatica\PowerCenter8.6.0\server\infa_shared\SrcFiles\FF_CDC_DT.txt

2010-10-10,2010-10-11,2010-10-11

NOTE: Date format is YYYY-MM-DD

Now we define a target definition for the Parameter file as below:

Now find the mapping which reads the extraction dates file and generates the runtime Parameter file.

In the expression transformation add the following ports:

V_START_DATE as Variable Port
V_END_DATE as Variable Port
V_RUN_DATE as Variable Port
ParamText as Output Port

Expression Transformation- Param generation

Now find the expressions for all the above ports of the expression transformation.

V_START_DATE:-
IIF(
TO_DATE(END_DATE,'YYYY-MM-DD')= TRUNC(SYSDATE),
TO_DATE(START_DATE,'YYYY-MM-DD'), 
TO_DATE(END_DATE,'YYYY-MM-DD')
)

V_END_DATE:-
TRUNC(SYSDATE)

V_RUN_DATE:-
TRUNC(SYSDATE)

ParamText:-
'[WorkFolder.WF:wf_runtime_param.ST:s_m_emp_cdc]' 
|| chr(10) ||
'$$start_date=' || TO_CHAR(V_START_DATE,'YYYY-MM-DD')
|| chr(10) ||
'$$end_date=' || TO_CHAR(V_END_DATE,'YYYY-MM-DD') 
|| chr(10)

NOTE:

Informatica Folder Name is WorkFolder.
Workflow Name is wf_runtime_param
The business session name which will use this parameter file is s_m_emp_cdc.
$$start_date and $$end_date are the Mapping Parameters that will be used by the Business session for CDC extraction and load purpose.
chr(10) ascii equivalent of NewLine.

Next link the Output port namely ParamText of the expression transformation to the port ParamText of the Target Instance.
Now find the output generated by the mapping i.e. the runtime Parameter file ff_param_runtime.txt

[WorkFolder.WF:wf_runtime_param.ST:s_m_emp_cdc]
$$start_date=2010-10-11
$$end_date=2011-10-12

Next let us see the business mapping which extracts the delta employee information based on the extraction Mapping Parameters $$start_date and $$end_date.

Mapping to extract and load source changed data

Define two Parameters namely:

$$start_date string(10)
$$end_date string(10)

Find the Source Qualifier SQL Query.

SELECT 
EMP_SRC.EMPNO, EMP_SRC.ENAME, 
EMP_SRC.JOB, EMP_SRC.MGR, 
EMP_SRC.HIREDATE, EMP_SRC.SAL, 
EMP_SRC.COMM, EMP_SRC.DEPTNO 
FROM
EMP_SRC 
WHERE EMP_SRC.HIREDATE ≥ TO_DATE('$$start_date','YYYY-MM-DD')
AND EMP_SRC.HIREDATE < TO_DATE('$$end_date','YYYY-MM-DD')

Next let us see the mapping which resets the the extraction dates file, which runs only after successful execution of the business session.

Mapping to update Extraction Date flatfile

In the expression transformation add the following ports:

V_START_DATE as Variable Port
V_END_DATE as Variable Port
V_RUN_DATE as Variable Port
OUT_START_DATE as Output Port
OUT_END_DATE as Output Port
OUT_RUN_DATE as Output Port

Now find the expressions for all the above ports of the expression transformation.

V_START_DATE:-
IIF(
TO_DATE(END_DATE,'YYYY-MM-DD') = TRUNC(SYSDATE),
TO_DATE(START_DATE,'YYYY-MM-DD'),
TO_DATE(END_DATE,'YYYY-MM-DD')
)

V_END_DATE:-
TRUNC(SYSDATE)

V_RUN_DATE:-
TRUNC(SYSDATE)

OUT_START_DATE:-
TO_CHAR(V_START_DATE,'YYYY-MM-DD')

OUT_END_DATE:-
TO_CHAR(V_END_DATE,'YYYY-MM-DD')

OUT_RUN_DATE:-
TO_CHAR(V_RUN_DATE,'YYYY-MM-DD')

Next link the Output ports of the expression transformation to the corresponding ports of the Target Instance.
Now let us look at the Workflow and sessions:

In the Workflow Properties tab set the Parameter file namely global_param.txt and in the Variables tab create a Workflow Variable namely $$var_param_file of datatype nstring.
Next for the business session s_m_emp_cdc set the Parameter Filename in the Properties tab to $$var_param_file.

Content of the Global Parameter FlatFile:-
C:\Informatica\PowerCenter8.6.0\server\infa_shared\BWParam\global_param.txt

[WorkFolder.WF:wf_runtime_param]
$DBConnection_SRC=http://js.dwbiconcepts.com/Info_Src_Conn
$DBConnection_TGT=Info_Tgt_Conn
$PMMergeSessParamFile=TRUE
$$var_param_file=C:\Informatica\PowerCenter8.6.0\server\infa_shared
\TgtFiles\ff_param_runtime.txt

[WorkFolder.WF:wf_runtime_param.ST:s_m_get_runtime_param]
$InputFile_CDC=$PMSourceFileDir\FF_CDC_DT.txt
$OutputFile_Param=$PMTargetFileDir\ff_param_runtime.txt


[WorkFolder.WF:wf_runtime_param.ST:s_m_set_runtime_param]
$InputFile_CDC=$PMSourceFileDir\FF_CDC_DT.txt
$OutputFile_CDC=$PMSourceFileDir\FF_CDC_DT1.txt

Now there are many important points to discuss regarding the Workflow level global parameter file.
NOTE:

$$var_param_file is set to the path of the runtime output parameter file generated by session s_m_get_runtime_param.
Now the Business session needs to read the Parameter file ff_param_runtime.txt
We know that Parameter file can be declared at Workflow level and also at Session level.
The Integration Service uses the workflow level parameter file, and ignores the session level parameter file.
Now we want the business session within the workflow wf_runtime_param having Workflow Parameter file global_param.txt to use the Session Parameter file ff_param_runtime.txt.
$PMMergeSessParamFile=TRUE property causes the Integration Service to read both the session level and workflow level parameter files. Hence our problem is solved.
Observe the parameters in the set session that updates the extraction date in the flat file for the next day load. Here we are reading from FF_CDC_DT.txt and loading to a different file name FF_CDC_DT1.txt. So we need to rename the file to FF_CDC_DT.txt and delete the file FF_CDC_DT1.txt at the post succession of the session run.

Find the Post-Session Success Commands:

Copy_file:
copy $OutputFile_CDC $InputFile_CDC;

Delete_file:
del $OutputFile_CDC;

In this way we can Capture Changed Data using a Flatfile.

What are the differences between Connected and Unconnected Lookup?

The differences are illustrated in the below table

Connected Lookup	Unconnected Lookup
Connected lookup participates in dataflow and receives input directly from the pipeline	Unconnected lookup receives input values from the result of a LKP: expression in another transformation
Connected lookup can use both dynamic and static cache	Unconnected Lookup cache can NOT be dynamic
Connected lookup can return more than one column value ( output port )	Unconnected Lookup can return only one column value i.e. output port
Connected lookup caches all lookup columns	Unconnected lookup caches only the lookup output ports in the lookup conditions and the return port
Supports user-defined default values (i.e. value to return when lookup conditions are not satisfied)	Does not support user defined default values

What is the difference between Router and Filter?

Following differences can be noted,

Router	Filter
Router transformation divides the incoming records into multiple groups based on some condition. Such groups can be mutually inclusive (Different groups may contain same record)	Filter transformation restricts or blocks the incoming record set based on one given condition.
Router transformation itself does not block any record. If a certain record does not match any of the routing conditions, the record is routed to default group	Filter transformation does not have a default group. If one record does not match filter condition, the record is blocked
Router acts like CASE.. WHEN statement in SQL (Or Switch().. Case statement in C)	Filter acts like WHERE condition is SQL.

What can we do to improve the performance of Informatica Aggregator Transformation?

Aggregator performance improves dramatically if records are sorted before passing to the aggregator and "sorted input" option under aggregator properties is checked. The record set should be sorted on those columns that are used in Group By operation.
It is often a good idea to sort the record set in database level (click here to see why?) e.g. inside a source qualifier transformation, unless there is a chance that already sorted records from source qualifier can again become unsorted before reaching aggregator

You may also read this article to know how to tune the performance of aggregator transformation

What are the different lookup cache(s)?

Informatica Lookups can be cached or un-cached (No cache). And Cached lookup can be either static or dynamic. A static cache is one which does not modify the cache once it is built and it remains same during the session run. On the other hand, A dynamic cache is refreshed during the session run by inserting or updating the records in cache based on the incoming source data. By default, Informatica cache is static cache.
A lookup cache can also be divided as persistent or non-persistent based on whether Informatica retains the cache even after the completion of session run or deletes it

How can we update a record in target table without using Update strategy?

A target table can be updated without using 'Update Strategy'. For this, we need to define the key in the target table in Informatica level and then we need to connect the key and the field we want to update in the mapping Target. In the session level, we should set the target property as "Update as Update" and check the "Update" check-box.
Let's assume we have a target table "Customer" with fields as "Customer ID", "Customer Name" and "Customer Address". Suppose we want to update "Customer Address" without an Update Strategy. Then we have to define "Customer ID" as primary key in Informatica level and we will have to connect Customer ID and Customer Address fields in the mapping. If the session properties are set correctly as described above, then the mapping will only update the customer address field for all matching customer IDs.

Under what condition selecting Sorted Input in aggregator may fail the session?

If the input data is not sorted correctly, the session will fail.
Also if the input data is properly sorted, the session may fail if the sort order by ports and the group by ports of the aggregator are not in the same order.

Why is Sorter an Active Transformation?

This is because we can select the "distinct" option in the sorter property.
When the Sorter transformation is configured to treat output rows as distinct, it assigns all ports as part of the sort key. The Integration Service discards duplicate rows compared during the sort operation. The number of Input Rows will vary as compared with the Output rows and hence it is an Active transformation.

Is lookup an active or passive transformation?

From Informatica 9x, Lookup transformation can be configured as as "Active" transformation. Find out How to configure lookup as active transformation
However, in the older versions of Informatica, lookup is a passive transformation

What is the difference between Static and Dynamic Lookup Cache?

We can configure a Lookup transformation to cache the underlying lookup table. In case of static or read-only lookup cache the Integration Service caches the lookup table at the beginning of the session and does not update the lookup cache while it processes the Lookup transformation.
In case of dynamic lookup cache the Integration Service dynamically inserts or updates data in the lookup cache and passes the data to the target. The dynamic cache is synchronized with the target.

What is the difference between STOP and ABORT options in Workflow Monitor?

When we issue the STOP command on the executing session task, the Integration Service stops reading data from source. It continues processing, writing and committing the data to targets. If the Integration Service cannot finish processing and committing data, we can issue the abort command.
In contrast ABORT command has a timeout period of 60 seconds. If the Integration Service cannot finish processing and committing data within the timeout period, it kills the DTM process and terminates the session.

How to Delete duplicate row using Informatica

Scenario 1: Duplicate rows are present in relational database

Suppose we have Duplicate records in Source System and we want to load only the unique records in the Target System eliminating the duplicate rows. What will be the approach?
Assuming that the source system is a Relational Database, to eliminate duplicate records, we can check the Distinct option of the Source Qualifier of the source table and load the target accordingly.

Source Qualifier Transformation DISTINCT clause

WELCOME TO MICRO INFORMATICA!

Posted by DWH at 10:21 PM 0 comments

Since our inception over 22 years ago, Micro Informatica has taken pride in delivering the highest quality and best value to our customers. This philosophy has helped us grow into an industry leader, distributing a wide range of computer hardware and peripherals throughout Latin American and Caribbean markets at heavily discounted prices.

I would like to take this opportunity to communicate directly with you and all of our customers as Micro Informatica takes this significant step forward. Following the traumatic events at the beginning of this century in New York, which deeply touched all of us, economic conditions deteriorated as energy, attention and resources were redirected toward improving security. In light of these conditions, I believe the steady recovery of the technology sector, especially here at Micro Informatica, has been remarkable.

With our typical enthusiasm, we at Micro Informatica intensified our efforts to get closer to our markets and meet the clear demands of our customers. We created Phitronics, a manufacturer of motherboards, located in the Free Zone of Manaus, at the heart of the Amazon Basin in Brazil. After twelve months of intense planning and management involving both public and private entities, we have finally crowned our efforts with success. Phitronics has been operative since December 2005. I am confident that this landmark venture will expand horizons, and allow you to meet new challenges in a most exciting and dynamic manner.

Our new partnership would not have been possible without the essential and enduring support of our valued suppliers, whose products illustrate this brochure. To both our clients and suppliers, I render my sincere thanks.

If you have any questions or would like to learn more about whatMicro Informatica has to offer, please feel free to call us at 305-418-3200 or visit our website at www.MicMiami.com.

Terra Informatica

Posted by DWH at 10:21 PM 0 comments

New features:

[CSS] background-attachment: scroll | local | fixed; is back.
[CSSS!] "string".compare("other string") – lexical string comparison.
[TIS] "string".lexicalCompare("other string") – lexical string comparison.
[CSS] image-transformation, additive cascade: combining lists from different rules. Allows to combine transformations from different rules. See: sdk/samples/image-transformations.css++/additive-transformations.htm
[CSS] behavior:marquee is back. See: sdk/samples/effects/marquee.htm
Sciter2 port of behavior_tabs.cpp from Sciter1 SDK.
behavior:file-icon is back, see: sdk/samples/goodies/file-icon.htm
@novalue support for behavior:edit, password, textarea.
[CSS] New ellipsis drawing mode: text-overflow:path-ellipsis - draws ellipsis in the middle of string.
Windows non-aero themes / "classic" theme support. Not all shapes are supported yet though.
TIScriptAPI API function is back.
Element.bindImage(url,img) – dynamic image generation for CSS. See: /sdk/samples/graphics/test-dynamic-background.htm and /sdk/samples/ideas/preview/thumbnail.htm
IAccessible support on Sciter window.
<select type="dropdown-select" multiple> – dropdown select with multiple selected options. See: /sdk/samples/forms/select-dropdown-variants.htm
Graphics.line/fillLinear/RadialGradient() methods support multiple color stops now. See: /sdk/samples/graphics/test-immediate-mode.htm

Fixes:

fix of dynamic updates of display:inline elements with parent flow:***
fix of dynamic updates of visibility:visible|collapse.
fix of <!DOCTYPE … preambula parsing.
masked-edit, various fixes.
fix background-image-transformation + animations.
fix of table on popup drawing problem, see: http://terrainformatica.com/forums/topic.php?id=2293
fix of Element.$replace() method.
html parser tweaks for erroneous table markup.
IME (East Asian languages support) works now.
Proper WM_SETFOCUS/WM_KILLFOCUS handling on Sciter window.
fix of <optgroup> handling.
fix of visibility:hidden of display:inline-block’s;
fix of SciterDataReadyAsync for CSS files.
fix of SciterSetElementText() handling by elements that have attached behaviors.
fix of <select editable> value handling.
fix of erroneous unescaped ‘&’ handling.

New samples:

sdk/samples/forms/edit-plus.htm – demonstrates extensions and use of methods like selectAll() on behavior:edit and behavior:textarea;
sdk/samples/ideas/lightbox-dialog – that famous lightbox with modal event processing.
sdk/samples/image-transformations.css++/additive-transformations.htm
sdk/samples/effects/marquee.htm
sdk/samples/replace-animator/test3.htm – one more demo of Metro style animations.
sdk/samples/goodies/file-icon.htm
/sdk/samples/graphics/test-dynamic-background.htm
/sdk/samples/ideas/preview/thumbnail.htm
/sdk/samples/forms/select-dropdown-variants.htm
/sdk/samples/graphics/test-immediate-mode.htm

BMJ Informatica

Posted by DWH at 10:18 PM 0 comments

Informatica Systems, part of the BMJ Group, delivers performance management systems and innovative software solutions to primary care such as FrontDesk, Contract+, Audit+ and Health Checks.
Since 1992, Informatica Systems has been providing decision support and benchmarking support tools for doctors and developing systems for primary care data extraction and analysis. The Welsh Assembly Government relies upon Informatica's technology for its Primary Care Data Quality Service.
One of the few companies certified by BSi for our ISO27001 Information Security Management System, Informatica Systems has also passed Connecting for Health’s Information Governance Statement of Compliance.
Informatica Systems is an official partner of EMIS, InPS and iSOFT and is a certified Microsoft partner.

FrontDesk

FrontDesk is the most popular third party appointments system for primary care in the UK and has been in use by a large number of practices since 1992. FrontDesk is running over 600 sites’ appointment books and well over 4 million patients are able to check themselves in on touchscreens using our Patient Check-In module. Our Appointments Online module allows patients to book or cancel appointments and order repeat prescriptions via the web.

Contract+

Contract+ has been built to help GPs and GP practices maximise their performance and earnings under the GMS contract and easily manage their workload. It allows you to easily identify your current QOF status and areas where more points can be gained. Contract+ is in use at well over 1,000 sites, providing GMS analyses on over 8 million patients overall.

Audit+

Audit+ is benchmarking and tracking software used by PCTs and Health Boards to track the performance and effectiveness of locally enhanced services. It can be used by GPs who want to monitor their performance against NICE guidelines or who want to run and record audits the demonstrate improvements for appraisal or revalidation purposes. Deployed nationally in Wales since 2007 to support the Data Quality Service for the Primary Care Informatics Programme, Audit+ has been used to provide daily monitoring of H1N1 outbreaks. Audit+ has been installed at nearly 500 Welsh sites, analysing details on 2.7 million patients.

Health Checks

Health Checks is a software solution to support delivering requirements set out under NHS Health Check. The software solution can be deployed at GP Practice level or centrally by a PCT, Health Board or Public Health organisation. Interacting with the patient record it helps the practice identify and contact the relevant cohort of patients, prompts the collection of the correct information as well as providing personalised advice based on data inputted. The software is intuitive and includes a patient visualisation screen. A version of this software has successfully been in use in the Ashton Leigh Wigan PCT for more than 2 years and used by a number of other PCTs.

Tuesday, May 29, 2012

SAP HANA Index Server Architecture:

Distributed System and High Availability

Failover

Hardware Innovations - Leading to HANA

Memory Overview - Where we stand

HANA Hardware Requirement

HANA Database Features

Basic Concepts behind HANA Database

Extreme Hardware Innovations:

In-Memory Database:

Massively Parallel Processing:

Hybrid Data Store:

OLTP and OLAP Database:

Higher Data Compression:

What is an ETL framework?

Critical Features of an ETL framework

Good-to-have features of ETL Framework

Request for Proposal for the next-gen ETL framework

Based on the feature sets above, we are trying to build a generic framework that we would make available here for free for everyone's use.

Should we do incremental loading for dimensions?

Standard Method of Loading

Which one to use: "Entry Date" / "Load Date" or "Last Update Date"?

What are the benefits of incremental loading of dimension tables?

What is a Noramalizer Transformation?

Transposing data using Normalizer

How to use Normalizer transformation inside Informatica Mapping

Setting Up Normalizer Transformation Property

METHOD OF LOADING

Loading Incident Fact

Loading Snapshot Fact

Channel 1: Incremental data from source

Channel 2: Incremental data from target fact table (last day's record)

Monday, May 28, 2012

List of available transforms

QUERY Transform

CASE Transform

MERGE Transform

VALIDATION Transform

MAP_OPERATION Transform

ROW_GENERATION Transform

SQL Transform

KEY_GENERATION Transform

TABLE_COMPARISON Transform

How to configure a Lookup as Active?

Implementing a Lookup As Active

Active Lookup Transformation Restrictions:

What are the differences between Connected and Unconnected Lookup?

What is the difference between Router and Filter?

What can we do to improve the performance of Informatica Aggregator Transformation?

What are the different lookup cache(s)?

How can we update a record in target table without using Update strategy?

Under what condition selecting Sorted Input in aggregator may fail the session?

Why is Sorter an Active Transformation?

Is lookup an active or passive transformation?

What is the difference between Static and Dynamic Lookup Cache?

What is the difference between STOP and ABORT options in Workflow Monitor?

How to Delete duplicate row using Informatica

Scenario 1: Duplicate rows are present in relational database

New features:

Fixes:

New samples:

FrontDesk

Contract+

Audit+

Health Checks

My Links

Blog Archive

Labels

Followers

Page Views

DWHETLTOOL

Popular Posts

About