Incremental Loading for Fact Tables

Tuesday, May 29, 2012

Incremental Loading for Fact Tables

Posted by DWH at 9:52 PM

In the previous articles, we have discussed the general concepts of incremental data loading as well as how to perform incremental data loading for dimension tables. In this article we will discuss the methods and issues of loading data incrementally in Fact tables of a data warehouse.

METHOD OF LOADING

Generally speaking, incremental loading for Fact tables is relatively easier as, unlike dimension tables, here you do not need to perform any look-up on your target table to find out if the source record already exists in the target or not. All you need to do is to select incremental records from source (as shown below for the case of "sales" table) and load them as it is to target (you may need to perform lookup to dimensional tables to assign respective surrogate keys - but that's a different story).
Like before we will assume we have a "sales" table in the source
Sales Table

ID         CustomerID    ProductDescription   Qty   Revenue  Sales Date
1          1             White sheet (A4)     100    4.00    22-Mar-2012
2          1             James Clip (Box)     1      2.50    22-Mar-2012
3          2             Whiteboard Marker    1      2.00    22-Mar-2012
4          3             Letter Envelop       200   75.00    23-Mar-2012
5          1             Paper Clip           12     4.00    23-Mar-2012

Given this table, a typical extraction query will be like this:

SELECT t.* 
FROM Sales t
WHERE t.sales_date > (select nvl(
                                 max(b.loaded_until), 
                                 to_date('01-01-1900', 'MM-DD-YYYY')
                                )
                      from batch b
                      where b.status = 'Success');

where "batch" is a separate table maintained at target system having minimal structure and data like below

Batch_ID  Loaded_Until  Status
1         22-Mar-2012   Success
2         23-Mar-2012   Success

However, things may get pretty complicated if your fact is a special type of fact called "snapshot fact". Let's understand them below.

Loading Incident Fact

Incident fact is the normal fact that we encounter mostly (and that we have seen above in our sales table example). Records in these types of facts are only loaded if there are transactions coming from the source. For example, if at all there is one sale that happens in the source system, then only a new sales record will come. They are dependent on some real "incident" to happen in the source hence the name incident fact.

Loading Snapshot Fact

As opposed to incident fact, snapshot facts are loaded even if there is no real business incident in the source. Let me show you what I mean by using the above example of customer and sales tables in OLTP. Let's say I want to build a fact that would show me total revenue of sales from each customer for each day. In effect, I want to see the below data in my fact table.
Sales fact table (This is what I want to see in my target fact table)

Date          Customer    Revenue
22-Mar-2012   John        6.50
22-Mar-2012   Ryan        2.00
23-Mar-2012   John       10.50
23-Mar-2012   Ryan        2.00
23-Mar-2012   Bakers'    75.00

As you see, even if no sales was made to Ryan on 23-Mar, we still show him here with the old data. Similarly for John, even if goods totaling to $4.00 was sold to him on 23-Mar, his record shows the cumulative total of $10.50.
Now obviously the next logical question is how to load this fact using incremental loading? Because incremental loading only brings in incremental data - that is on 23rd March, we will only have Bakers' and John's records and that too with that day's sales figures. We won't have Ryan record in the incremental set.

Why not a full load

You can obviously opt-in for full load mechanism as that would solve this problem but that would take the toll on your loading performance.

Then what's the solution?

One way to resolve this issue is: creating 2 incremental channels of loading for the fact. 1 channel will bring in incremental data from source and the other channel will bring in incremental data from the target fact itself. Let's see how does it happen below. We will take the example for loading 23-Mar data.

Channel 1: Incremental data from source

Customer    Revenue 
John        4.00
Bakers'    75.00

Channel 2: Incremental data from target fact table (last day's record)

Customer    Revenue
John        6.50
Ryan        2.00

Next we can perform a FULL OUTER JOIN between the above two sets to come to below result

John       10.50
Ryan        2.00
Bakers'    75.00

Hope this tutorial was helpful. If you have any doubt or question on above, please Ask your question here. We will surely help you out!

If you want to learn more, visit our ETL Basic Concepts and Tutorial page.