Monday, February 28, 2011

Slowly Changing Dimensions: A Data Warehouse Administrator’s Ongoing Challenge

1 comments
A key pillar of data mart building capabilities is a dimensional framework. Ralph Kimball, a leading data warehouse consultant, describes building data marts with shared or conformed dimensions as a very effective method for rolling out an enterprise solution. For the business user, dimensions such as "customer" or "location" can be described in terms of the business rules that an organization needs to define in order to make decisions about their business (for example, "customers by location" or "sales over time"). Building data marts with common definitions delivers a consistent view of the data. The industry’s most effective ETL solutions are able to define dimensions once and make them available for reuse in subsequent data marts. This best practice methodology quickly facilitates an enterprise roll out of subsequent data marts. A dimensional framework allows data mart designers to quickly prototype and deploy a data mart solution using its dimensional framework.
Tracking Changes Over Time

In the real world, though, many dimensions change over time. But what happens to a dimension when, for example, a customer or product dimension changes? Tracking changes in dimensions, referred to as slowly changing dimensions, can raise interesting questions. Data mart designers need to add new customers or change the addresses of existing customers, the company assigns new regional hierarchies, or product descriptions and packaging change. These sorts of changes need to be reflected in the data mart dimension tables. In several cases, not only does the change need to be reflected, but the history of the changes also needs to be tracked. By remembering history, businesses are able to look at historical data and compare it to their current situation.
Slowly Changing Dimensions

The issues of maintaining the history of this data has profound implications for the design of the data warehouse. Effective ETL solutions are able to simplify the support for Type 1 and Type 2 (as per Ralph Kimball terminology) slowly changing dimensions through automated dimensional management. Furthermore, these dimensions can contain a mixture of both Type 1 and Type 2 dimensions.


Data mart designers have two choices: they can either allow dimension records to be updated in place (no history preservation), or they can create a new dimension record when changes in designated dimension attributes occur (preserving history). The designer must choose based on the users’ needs.


Figure 1: Type 1 (Overwrite Dimension Records)

The first technique does not maintain any history, but is a simple and fast procedure. Overwriting dimensional records is used when the old values are incorrect or unimportant – when there is no value in keeping the old description. For example, when a sales representative changes her last name due to marriage, there is no business reason to keep the representative’s maiden name in history. Thus, her maiden name can be overwritten.


Figure 2: Type 2 (Add Dimension Record)

Type 2 slowly changing dimensions are used to create a new row of dimension data when there are changes in an attribute whose history needs preserving. For example, do the sales for Mary Smith get credited to the new office when she transfers, or do they stay with the old office? In this case it is necessary to keep the history so Mary’s sales (and her commission check) will be assigned to the correct branch.
Surrogates and Slowly Changing Dimensions

Surrogates refer to artificial keys used to connect a dimension table to a fact table and are always used with slowly changing dimensions. Leading data mart creation tools automatically manage the creation of new surrogates and synchronization of these values in fact and dimension tables. The process of surrogate and dimension history management are now reduced to a set of rules that are associated with the dimension table columns in the warehouse. Once these rules are defined, dimensionally friendly ETL solutions automatically maintain the surrogates and updates or adds dimension table records as necessary when changes are made in the corresponding reference sources in the operational system.

Data mart/warehouse administrators that want to rid themselves of the daily headache of slowly changing dimensions need to look to the automation of this important data warehouse function by simplifying the process of handling slowly changing dimensions. For the data mart designer this translates into the ability to respond to changes and provides an efficient method for tracking history.

Peter Griffiths is vice president of Cognos, Inc.

For more information on related topics, visit the following channels:

    * DW Administration, Mgmt., Performance
    * DW Design, Methodology
    * Databases

SOURCE:http://www.information-management.com/infodirect/20010126/2998-1.html
newer post

Slowly Changing Dimensions Are Not Always as Easy as 1, 2, 3

0 comments
To kick off our first column of the year, we're going to take on a challenging subject that all designers face: how to deal with changing dimensions. Unlike most OLTP systems, a major objective of a data warehouse is to track history. So, accounting for change is one of the analyst's most important responsibilities. A sales force region reassignment is a good example of a business change that may require you to alter the dimensional data warehouse. We'll discuss how to apply the right technique to account for the change historically. Hang on to your hats — this is not an easy topic.

warehouse design presumes that facts, such as customer orders or product shipments, will accumulate quickly. But the supporting dimension attributes of the facts, such as customer name or product size, are comparatively static. Still, most dimensions are subject to change, however slow. When dimensional modelers think about changing a dimension attribute, the three elementary approaches immediately come to mind: slowly changing dimension (SCD) types 1, 2 and 3.

These three fundamental techniques, described in Quick Study, are adequate for most situations. However, what happens when you need variations that build on these basics to serve more analytically mature data warehouse users? Business folks sometimes want to preserve the historically accurate dimension attribute associated with a fact (such as at the time of sale or claim), but maintain the option to roll up historical facts based on current dimension characteristics. That's when you need hybrid variations of the three main types. We'll lead you through the hybrids in this column.

The Mini Dimension with "Current" Overwrite

When you need historical tracking but are faced with semi-rapid changes in a large dimension, pure type 2 tracking is inappropriate. If you use a mini dimension, you can isolate volatile dimension attributes in a separate table rather than track changes in the primary dimension table directly. The mini-dimension grain is one row per "profile," or combination of attributes, while the grain of the primary dimension might be one row per customer. The number of rows in the primary dimension may be in the millions, but the number of mini-dimension rows should be a fraction of that. You capture the evolving relationship between the mini dimension and primary dimension in a fact table. When a business event (transaction or periodic snapshot) spawns a fact row, the row has one foreign key for the primary dimension and another for the mini-dimension profile in effect at the time of the event.

Profile changes sometimes occur outside of a business event, for example when a customer's geographic profile is updated without a sales transaction. If the business requires accurate point-in-time profiling, a supplemental factless fact table with effective and expiration dates can capture every relationship change between the primary and profile dimensions. One more embellishment with this technique is to add a "current profile" key to the primary dimension. This is a type 1 attribute, overwritten with every profile change, that's useful for analysts who want current profile counts regardless of fact table metrics or want to roll up historical facts based on the current profile. You'd logically represent the primary dimension and profile outrigger as a single table to the presentation layer if doing so doesn't harm performance. To minimize user confusion and error, the current attributes should have column names that distinguish them from the mini-dimension attributes. For example, the labels should indicate whether a customer's marketing segment designation is a current one or an obsolete one that was effective when the fact occurred — such as "historical marital status at time of event" in the profile mini dimension and "current marital status" in the primary customer dimension.


Business analysts need to track changes in dimension attributes. Reevaluation of a customer's marketing segment is an example of what might prompt a change. There are three fundamental techniques.

Type 1 is most appropriate when processing corrections; this technique won't preserve historically accurate associations. The changed attribute is simply updated (overwritten) to reflect the most current value.

    Discover how cloud platform services are raising the stakes when it comes to what you can expect from embracing the cloud as part of your architecture.
    Time To Revisit Your Cloud Strategy

With a type 2 change, a new row with a new surrogate primary key is inserted into the dimension table to capture changes. Both the prior and new rows contain as attributes the natural key (or durable identifier), the most-recent-row flag and the row effective and expiration dates.

With type 3, another attribute is added to the existing dimension row to support analysis based on either the new or prior attribute value. This is the least commonly needed technique.

SOURCE:http://www.informationweek.com/news/software/bi/showArticle.jhtml?articleID=59301280
newer post

How to handle Slowly Changing Dimensions (SCDs) in data model design?

0 comments


There are multiple methods to handle the slowly changing dimensions. Which technique to use depends on your business requirements. The choice among these three methods are not a technical design decision since their behaviors are different.

Type One: Overwite the old data with new data

Using this method, you do not store the histoy. For example, that say each customer can have one salesrep at any given point in time. When the salerep of ABC Inc., changes from Sandy to Laura, Sandy was a salerep of ABC will not be kept anywhere. Any report by salesrep will assume that Laura is the salereps of ABC Inc. forever and count all the sales done by Sandy as Lanura’s.

The above example may not sound making business sense. However, if you only report the sales of the current period, and salesrep does not change during the period, this method is ok to be used.

Mary OLTP tables does not need to track the history of changes and thus this method may be used by the source application. However, if you want to report the historical data, even your OLTP does not track history, the data warehouse can still use other methods to track the history.

Type Two: Add a new record at the timeof the change

Using this method, all priorhistory are saved. There are two alternative methods to model the key of this table.

Method A – No surrogate key – Use timestamp

When a change happens, a new record is added into the table. All the attributes are copied from the previous record except the changed values. The nature key is copied as well so the timestamps is used to differentiate the records.

When a fact table is joined with the dimension, if you are interested in the historical data, the timestamp will be used as part of the join condition. To ease the join, the record typically use two date columns – the effective start date and the effective end date.

Method B – No surrogate key – Use version number

Instead of using the date column, a version number is used to differentiate the different versions of the records.

This technique requires the fact table store both nature key and the version number to retrive a given version of the dimension date.

Method C – Use a surrogate key

When an attribue is change, a sequence generated key is used, the fact table will also use this key column as the foreign key.

Type Three: Track changes using a separate column

Using this method, you use a separate column of dimension table to store the values of previous years, in addition to the current year data.

This method does not track all the history, but just one prior version.

If the data is changed, the old value need to be moved from the current value column to the prior column and the new value overwrites the current column.

This method is used when the changes is not randon but a predefined interval such as annual.

This entry was posted on January 13, 2007 at 10:44 pm and is filed under BI, Data Warehouse, OBIEE, Oracle BI Suite EE, SCD, Siebel Analytics. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

SOURCE:http://dylanwan.wordpress.com/2007/01/13/how-to-handle-slowly-changing-dimensions-scds-in-data-model-design/
newer post

Data Warehousing - Slowly Changing DimensionsBy nature of the data warehouse, fact and dimension tables are likely to have very different transactional activity. Fact tables will be mostly inserted into as you load data. Very infrequently you might have to update the facts that were loaded incorrectly. It is even less likely that you'll ever delete rows from the fact table; the only time this might happen is if you wanted to archive the old data that is no longer relevant for the current portion of the data warehouse. Dimension tables, on the other hand are likely to see frequent updates. A classic example of this is a consumer dimension which tracks people that buy your products or services. People can change names due to marriage, divorce or another reason, they can change their title of courtesy from Ms to Mrs or to Dr. As people age they also fit a different age group and perhaps even a different education and income level. Therefore you are likely to change dimension table values often. Dimensions that are changing over time are referred to as slowly changing dimensions (SCD). There are several ways to handle SCD; most common approaches are referred to as type 1, type 2, type 3 and type 4. The easiest way to manage SCD is to simply override the existing value with the new value type 1. Indeed if a department changes the title from "Finance" to "Financial" overwriting such a value isn't likely to do much harm. However, you must weigh the consequences of overriding existing values carefully to ensure that your reports don't lose credibility. For example, suppose Mr. Smith who has near perfect credit marries Ms. Jones who has recently declared bankruptcy. Since Mr. Smith had near perfect credit before his marriage your bank has issued him a loan for $200,000. If you overwrite Mr. Smith's credit score with his new, considerably lower credit score you will make decisions from the not very distant past look very incorrect. If your data analysts happen to run reports immediately before and immediately after the credit rating was changed your data warehouse credibility will be compromised. Here is what type 1 SCD records would look like before and after change: Customer_key Customer_name Customer_city Customer_state 123 Ms. Brown Houston Texas 123 Mrs. Green Houston Texas As you can see type 1 SCD does not allow tracking of changes. Once you overwrite the value all historical reports will display Mrs. Green instead of Ms. Brown. The second and more common way of handling SCD is to create a new record when dimension attributes change - type 2. For example, when Ms. Brown becomes Mrs. Green you create a new record for that customer and mark the first record as obsolete. The second approach adds effective date and obsolete date columns to dimension tables. Doing so allows a data warehouse architect to maintain a history of SCD changes. After a change occurs the warehouse will have the following two records (one of them marked as obsolete): Customer_key Customer_name Customer_city Customer_state Effective_date Obsolete_date 123 Ms.Brown Houston Texas 1/1/2000 1/1/2005 234 Mrs.Green Houston Texas 1/1/2005 NULL The third approach is to add columns to the dimension tables in order to maintain the history of changes in a single row. In the case of consumer dimension you would create an old last name and current last name columns along with the date column that indicates when the change occurred. This type of auditing of changes is typically reserved for very few dimensions where change history tracking is critical. For example if your product prices tend to change over time and you must compare revenues based on old and new price then you might wish to use this method. Type 3 of maintaining SCD carries much overhead and is not used very frequently. After type 3 change the warehouse will still have a single record shown below: Customer_key Customer_name Customer_city Customer_state Change_date Old_name 123 Mrs.Green Houston Texas 1/1/2005 Ms.Brown Most environments that use type 3 slowly changing dimensions track 2 iterations of changes; that is you would record the current version, previous version and version before last for the changed column. Type 4 is a special way of handling type 2 changes. Type 2 usually tracks unlimited history by creating new records each time the change occurs. If you know how far back you wish to track history you can add logic to your code to delete any records that go too far back in history. For example, suppose we created an exam and offered it to our students. As time goes on we might decide to edit the exam. We could manage these edits using type 2 method, but if we accumulate 10 versions of the same exam the data might get confusing. So we might decide that we only store up-to 9 records per each exam; any time 10th version of an exam is created we'll delete the 1st version from the warehouse. SOURCE:http://sqlserverpedia.com/wiki/Data_Warehousing_-_Slowly_Changing_Dimensions

0 comments
By nature of the data warehouse, fact and dimension tables are likely to have very different transactional activity. Fact tables will be mostly inserted into as you load data. Very infrequently you might have to update the facts that were loaded incorrectly. It is even less likely that you'll ever delete rows from the fact table; the only time this might happen is if you wanted to archive the old data that is no longer relevant for the current portion of the data warehouse. Dimension tables, on the other hand are likely to see frequent updates. A classic example of this is a consumer dimension which tracks people that buy your products or services. People can change names due to marriage, divorce or another reason, they can change their title of courtesy from Ms to Mrs or to Dr. As people age they also fit a different age group and perhaps even a different education and income level. Therefore you are likely to change dimension table values often.



Dimensions that are changing over time are referred to as slowly changing dimensions (SCD). There are several ways to handle SCD; most common approaches are referred to as type 1, type 2, type 3 and type 4.



The easiest way to manage SCD is to simply override the existing value with the new value type 1. Indeed if a department changes the title from "Finance" to "Financial" overwriting such a value isn't likely to do much harm. However, you must weigh the consequences of overriding existing values carefully to ensure that your reports don't lose credibility. For example, suppose Mr. Smith who has near perfect credit marries Ms. Jones who has recently declared bankruptcy. Since Mr. Smith had near perfect credit before his marriage your bank has issued him a loan for $200,000. If you overwrite Mr. Smith's credit score with his new, considerably lower credit score you will make decisions from the not very distant past look very incorrect. If your data analysts happen to run reports immediately before and immediately after the credit rating was changed your data warehouse credibility will be compromised. Here is what type 1 SCD records would look like before and after change:

Customer_key     Customer_name     Customer_city     Customer_state
123     Ms. Brown     Houston     Texas
123     Mrs. Green     Houston     Texas


As you can see type 1 SCD does not allow tracking of changes. Once you overwrite the value all historical reports will display Mrs. Green instead of Ms. Brown.



The second and more common way of handling SCD is to create a new record when dimension attributes change - type 2. For example, when Ms. Brown becomes Mrs. Green you create a new record for that customer and mark the first record as obsolete. The second approach adds effective date and obsolete date columns to dimension tables. Doing so allows a data warehouse architect to maintain a history of SCD changes. After a change occurs the warehouse will have the following two records (one of them marked as obsolete):

Customer_key     Customer_name     Customer_city     Customer_state     Effective_date     Obsolete_date
123     Ms.Brown     Houston     Texas     1/1/2000     1/1/2005
234     Mrs.Green     Houston     Texas     1/1/2005     NULL


The third approach is to add columns to the dimension tables in order to maintain the history of changes in a single row. In the case of consumer dimension you would create an old last name and current last name columns along with the date column that indicates when the change occurred. This type of auditing of changes is typically reserved for very few dimensions where change history tracking is critical. For example if your product prices tend to change over time and you must compare revenues based on old and new price then you might wish to use this method. Type 3 of maintaining SCD carries much overhead and is not used very frequently. After type 3 change the warehouse will still have a single record shown below:

Customer_key     Customer_name     Customer_city     Customer_state     Change_date     Old_name
123     Mrs.Green     Houston     Texas     1/1/2005     Ms.Brown


Most environments that use type 3 slowly changing dimensions track 2 iterations of changes; that is you would record the current version, previous version and version before last for the changed column.



Type 4 is a special way of handling type 2 changes. Type 2 usually tracks unlimited history by creating new records each time the change occurs. If you know how far back you wish to track history you can add logic to your code to delete any records that go too far back in history. For example, suppose we created an exam and offered it to our students. As time goes on we might decide to edit the exam. We could manage these edits using type 2 method, but if we accumulate 10 versions of the same exam the data might get confusing. So we might decide that we only store up-to 9 records per each exam; any time 10th version of an exam is created we'll delete the 1st version from the warehouse.


SOURCE:http://sqlserverpedia.com/wiki/Data_Warehousing_-_Slowly_Changing_Dimensions
newer post

Slowly Changing Dimensions SCD in Dimensional Modeling

0 comments
Slowly Changing Dimensions

Entities change over time. Customer demographics, product characteristics, classification rules, status of customers etc. lead to changes in the attributes of dimensions. In a transaction system, many a times the change is overwritten and track of change is lost.

For example a source system may have only the latest customer PIN Code, as it is needed to send the marketing and billing statements. However, a data warehouse needs to maintain all the previous PIN Codes as well, because we need to track on how many customers move to new locations over what frequency.

A key benefit for Data Warehouse is to provide historical information, which is typically over-written (and thus lost)in the transaction systems. How to handle slowly changing dimensions in a Dimensional Model is a key determinant to that benefit.
There are three ways to handle the same:
Slowly Changing Dimension method 1 (In short SCD 1)

The way most of the source systems will handle it- Overwrite the attribute value. For example if a customer’s marital status has moved from 'Unmarried' to 'Married', we over-write 'unmarried' to 'Married'. Similarly, if an insurance policy status has moved from 'Lapsed' to 'Re-instated' the new status is over written on the old status. This is obviously done, when we are not analyzing the historical information.
Slowly Changing Dimension Method 2 (in short SCD 2)

This is the true-blue technique to deliver precise historical analysis. This is used, when there is more than one change in the attributes of an entity, and we need to track the date of change of the attribute.

In this method, a new record is added whereby the new record is given a separate identifier as the primary key. We cannot use the production key as the primary key here as it has not changed (Customer ID has remained the same, while the value of its attribute 'marital status' has changed). This new identifier is called the surrogate key.

Apart from adding a new record and providing a new primary (surrogate) key, the validity period for this new record is also added.

For example- You have a dimensional table with customer_ID '110002' with marital status as 'single'. Overtime, customer gets married and also moved to a new location. The customer dimension record will be:
Surrogate Key     Customer ID     Date Valid     Marital Status     Date of Birth     City
1100021     110002     Sept 23, 2004     Single     Jan8, 1982     Palo Alto
1100022     110002     Oct 25, 2005     Married     Jan8, 1982     Palo Alto
1100023     110002     Nov 23, 2005     Married     Jan8, 1982     San Francisco
Slowly changing dimension method 3 (SCD 3)

This is a mid-way between method 1 and method 2. Here we don’t add an additional record, but add a new field 'old attribute value'. However, this has limitations. This method has to know from the beginning on what attributes will change. This is because a new field/attribute has to be added in the design for every attribute, which can change. Secondly, attribute can change maximum once in the lifetime of the entity OR at least the lifetime of the data warehouse.
Surrogate Key     Customer ID     Marital Status     Date of Birth     City     Marital Status Old     City Old
1100021     110002     Married     Jan8, 1982     San Francisco     Single     Palo Alto

NOTE – The term of 'Slowly changing dimension' is used because of it being a universally acknowledged term. However, the same methods will apply to fast changing dimensions as well.
Surrogate Keys as Primary keys of dimension tables

There is a best practice in dimensional model design to not to use the production primary key as the primary key for the dimension table. This goes against conventional logic, but has a reason.

Data Warehouse has a core need for maintaining historical information and how an entity has moved and changed shape through the passage of time. Typically Source Systems need for this kind of information is quite less. In case of historical tracking in source systems, these systems can have the luxury of using multiple-field primary key (including the key identifier of the entity plus date stamp). For example if an insurance policy is lapsed and after two months it becomes reinstated, one can use the primary key as a combination of Policy number+ date/time +the status in the 'policy history table'. However, Data Warehouse doesn’t recommend the luxury of using multiple field primary key in dimension table.

Therefore, the concept 'surrogate Key' comes into play where the primary key is not the production key, but a key generated by the system. The production key is also used as an attribute within the same dimension table.
The situations/reasons on when a surrogate key is used:

    * 'Slowly changing dimensions'
    * When the primary key itself is repeated.
    * When there is a multiple field primary key. Dimension model typically does not use multiple field primary key to link to the fact table.

Therefore it is always recommended to use surrogate keys. it is difficult to find the organizations, which will not face the situations as highlighted above. If there are, they could as well manage their needs using excel and pivot tables.

SOURCE:http://www.executionmih.com/data-warehouse/slowly-changing-dimension-SCD.php
newer post

Slowly changing dimension

0 comments
Dimension is a term in data management and data warehousing that refers to logical groupings of data such as geographical location, customer information, or product information. Slowly Changing Dimensions (SCDs) are dimensions that have data that changes slowly, rather than changing on a time-based, regular schedule.[1]

For example, you may have a dimension in your database that tracks the sales records of your company's salespeople. Creating sales reports seems simple enough, until a salesperson is transferred from one regional office to another. How do you record such a change in your sales dimension?

You could sum or average the sales by salesperson, but if you use that to compare the performance of salesmen, that might give misleading information. If the salesperson that was transferred used to work in a hot market where sales were easy, and now works in a market where sales are infrequent, her totals will look much stronger than the other salespeople in her new region, even if they are just as good. Or you could create a second salesperson record and treat the transferred person as a new sales person, but that creates problems also.

Dealing with these issues involves SCD management methodologies referred to as Type 0 through 6. Type 6 SCDs are also sometimes called Hybrid SCDs.

Type 0

The Type 0 method is a passive approach to managing dimension value changes, in which no action is taken. Values remain as they were at the time the dimension record was first entered. In certain circumstances historical preservation with a Type 0 SCD may occur. But, higher order SCD types are often employed to guarantee history preservation, whereas Type 0 provides the least control or no control over managing a slowly changing dimension.

The most common slowly changing dimensions are Types 1, 2, and 3.
 Type 1

The Type 1 methodology overwrites old data with new data, and therefore does not track historical data at all. This is most appropriate when correcting certain types of data errors, such as the spelling of a name. (Assuming you won't ever need to know how it used to be misspelled in the past.)

Here is an example of a database table that keeps supplier information:
Supplier_Key     Supplier_Code     Supplier_Name     Supplier_State
123     ABC     Acme Supply Co     CA

In this example, Supplier_Code is the natural key and Supplier_Key is a surrogate key. Technically, the surrogate key is not necessary, since the table will be unique by the natural key (Supplier_Code). However, the joins will perform better on an integer than on a character string.

Now imagine that this supplier moves their headquarters to Illinois. The updated table would simply overwrite this record:
Supplier_Key     Supplier_Code     Supplier_Name     Supplier_State
123     ABC     Acme Supply Co     IL

The obvious disadvantage to this method of managing SCDs is that there is no historical record kept in the data warehouse. You can't tell if your suppliers are tending to move to the Midwest, for example. But an advantage to Type 1 SCDs is that they are very easy to maintain.

If you have calculated an aggregate table summarizing facts by state, it will need to be recalculated when the Supplier_State is changed.[1]
 Type 2

The Type 2 method tracks historical data by creating multiple records for a given natural key in the dimensional tables with separate surrogate keys and/or different version numbers. With Type 2, we have unlimited history preservation as a new record is inserted each time a change is made.

In the same example, if the supplier moves to Illinois, the table could look like this, with incremented version numbers to indicate the sequence of changes:
Supplier_Key     Supplier_Code     Supplier_Name     Supplier_State     Version
123     ABC     Acme Supply Co     CA     0
124     ABC     Acme Supply Co     IL     1

Another popular method for tuple versioning is to add effective date columns.
Supplier_Key     Supplier_Code     Supplier_Name     Supplier_State     Start_Date     End_Date
123     ABC     Acme Supply Co     CA     01-Jan-2000     21-Dec-2004
124     ABC     Acme Supply Co     IL     22-Dec-2004   

The null End_Date in row two indicates the current tuple version. In some cases, a standardized surrogate high date (e.g. 9999-12-31) may be used as an end date, so that the field can be included in an index, and so that null-value substitution is not required when querying.

Transactions that reference a particular surrogate key (Supplier_Key) are then permanently bound to the time slices defined by that row of the slowly changing dimension table. An aggregate table summarizing facts by state continues to reflect the historical state, i.e. the state the supplier was in at the time of the transaction; no update is needed.

If there are retrospective changes made to the contents of the dimension, or if new attributes are added to the dimension (for example a Sales_Rep column) which have different effective dates from those already defined, then this can result in the existing transactions needing to be updated to reflect the new situation. This can be an expensive database operation, so Type 2 SCDs are not a good choice if the dimensional model is subject to change.[1]
 Type 3

The Type 3 method tracks changes using separate columns. Whereas Type 2 had unlimited history preservation, Type 3 has limited history preservation, as it's limited to the number of columns we designate for storing historical data. Where the original table structure in Type 1 and Type 2 was very similar, Type 3 will add additional columns to the tables:
Supplier_Key     Supplier_Code     Supplier_Name     Original_Supplier_State     Effective_Date     Current_Supplier_State
123     ABC     Acme Supply Co     CA     22-Dec-2004     IL

Note that this record can not track all historical changes, such as when a supplier moves twice.

One version of this type is to create the field Previous_Supplier_State instead of Original_Supplier_State which will then track the most recent historical change.[1]
 Type 4

The Type 4 method is usually referred to as using "history tables", where one table keeps the current data, and an additional table is used to keep a record of some or all changes.

Following the example above, the original table might be called Supplier and the history table might be called Supplier_History.
Supplier Supplier_key     Supplier_Code     Supplier_Name     Supplier_State
123     ABC     Acme Supply Co     IL
Supplier_History Supplier_key     Supplier_Code     Supplier_Name     Supplier_State     Create_Date
123     ABC     Acme Supply Co     CA     22-Dec-2004

This method resembles how database audit tables and change data capture techniques function.
 Type 6 / Hybrid

The Type 6 method combines the approaches of types 1, 2 and 3 (1 + 2 + 3 = 6). One possible explanation of the origin of the term was that it was coined by Ralph Kimball during a conversation with Stephen Pace from Kalido. Ralph Kimball calls this method "Unpredictable Changes with Single-Version Overlay" in The Data Warehouse Toolkit[1].

The Supplier table starts out with one record for our example supplier:
Supplier_Key     Supplier_Code     Supplier_Name     Current_State     Historical_State     Start_Date     End_Date     Current_Flag
123     ABC     Acme Supply Co     CA     CA     01-Jan-2000     31-Dec-9999     Y

The Current_State and the Historical_State are the same. The Current_Flag attribute indicates that this is the current or most recent record for this supplier.

When Acme Supply Company moves to Illinois, we add a new record, as in Type 2 processing:
Supplier_Key     Supplier_Code     Supplier_Name     Current_State     Historical_State     Start_Date     End_Date     Current_Flag
123     ABC     Acme Supply Co     IL     CA     01-Jan-2000     21-Dec-2004     N
124     ABC     Acme Supply Co     IL     IL     22-Dec-2004     31-Dec-9999     Y

We overwrite the Current_State information in the first record (Supplier_Key = 123) with the new information, as in Type 1 processing. We create a new record to track the changes, as in Type 2 processing. And we store the history in a second State column (Historical_State), which incorporates Type 3 processing.

If our example supplier company were to relocate again, we would add another record to the Supplier dimension, and we would once again overwrite the contents of the Current_State column:
Supplier_Key     Supplier_Code     Supplier_Name     Current_State     Historical_State     Start_Date     End_Date     Current_Flag
123     ABC     Acme Supply Co     NY     CA     01-Jan-2000     21-Dec-2004     N
124     ABC     Acme Supply Co     NY     IL     22-Dec-2004     03-Feb-2008     N
125     ABC     Acme Supply Co     NY     NY     04-Feb-2008     31-Dec-9999     Y

Note that, for the current record (Current_Flag = 'Y'), the Current_State and the Historical_State are always the same.[1]
 Type 2 / Type 6 Fact Implementation
 Surrogate Key Alone

In many Type 2 and Type 6 SCD implementations, the surrogate key from the dimension is put into the fact table in place of the natural key when the fact data is loaded into the data repository.[1] The surrogate key is selected for a given fact record based on its effective date and the Start_Date and End_Date from the dimension table. This allows the fact data to be easily joined to the correct dimension data for the corresponding effective date.

Here is the Supplier table as we created it above using Type 6 methodology:
Supplier_Key     Supplier_Code     Supplier_Name     Current_State     Historical_State     Start_Date     End_Date     Current_Flag
123     ABC     Acme Supply Co     NY     CA     01-Jan-2000     21-Dec-2004     N
124     ABC     Acme Supply Co     NY     IL     22-Dec-2004     03-Feb-2008     N
125     ABC     Acme Supply Co     NY     NY     04-Feb-2008     31-Dec-9999     Y

The following SQL retrieves the correct Supplier Surrogate_Key for each Delivery fact record, based on the primary effective date, Delivery_Date:

SELECT
  supplier.supplier_key
FROM supplier
INNER JOIN delivery
  ON supplier.supplier_code = delivery.supplier_code
 AND delivery.delivery_date >= supplier.start_date
 AND delivery.delivery_date <= supplier.end_date

A fact record with an effective date of August 9, 2001 will be linked to Surrogate_Key 123, with a Historical_State of 'CA'. A fact record with an effective date of October 11, 2007 will be linked to Surrogate_Key 124, with a Historical_State of 'IL'.

Once the Delivery table contains the correct Supplier_Key, it can easily be joined to the Supplier table using that key. The following SQL retrieves, for each fact record, the correct supplier name and the state the supplier was located in at the time of the delivery:

SELECT
  delivery.delivery_cost,
  supplier.supplier_name,
  supplier.historical_state
FROM delivery
INNER JOIN supplier
  ON delivery.supplier_key = supplier.supplier_key

If you've utilized Type 6 processing for your dimension, then you can easily retrieve the state the company is currently located in, using the same Supplier_Key:

SELECT
  delivery.delivery_cost,
  supplier.supplier_name,
  supplier.current_state
FROM delivery
INNER JOIN supplier
  ON delivery.supplier_key = supplier.supplier_key

 Both Surrogate and Natural Key

An alternate implementation is to place both the surrogate key and the natural key into the fact table.[2] This allows the user to select the appropriate dimension records based on:

    * the primary effective date on the fact record (above),
    * the most recent or current information,
    * any other date associated with the fact record.

This method allows more flexible links to the dimension, even if you have used the Type 2 approach instead of Type 6.

Here is the Supplier table as we might have created it using Type 2 methodology:
Supplier_Key     Supplier_Code     Supplier_Name     Supplier_State     Start_Date     End_Date     Current_Flag
123     ABC     Acme Supply Co     CA     01-Jan-2000     21-Dec-2004     N
124     ABC     Acme Supply Co     IL     22-Dec-2004     03-Feb-2008     N
125     ABC     Acme Supply Co     NY     04-Feb-2008     31-Dec-9999     Y

The following SQL retrieves the most current Supplier_Name and Supplier_State for each fact record:

SELECT
  delivery.delivery_cost,
  supplier.supplier_name,
  supplier.supplier_state
FROM delivery
INNER JOIN supplier
  ON delivery.supplier_code = supplier.supplier_code
WHERE supplier.current_flag = 'Y'

If there are multiple dates on the fact record, the fact can be joined to the dimension using another date instead of the primary effective date. For instance, the Delivery table might have a primary effective date of Delivery_Date, but might also have an Order_Date associated with each record.

The following SQL retrieves the correct Supplier_Name and Supplier_State for each fact record based on the Order_Date:

SELECT
  delivery.delivery_cost,
  supplier.supplier_name,
  supplier.supplier_state
FROM delivery
INNER JOIN supplier
  ON delivery.supplier_code = supplier.supplier_code
 AND delivery.order_date >= supplier.start_date
 AND delivery.order_date <= supplier.end_date

Some cautions:

    * If the join query is not written correctly, it may return duplicate rows and/or give incorrect answers.

    * The date comparison might not perform well.

    * Some Business Intelligence tools do not handle generating complex joins well.

    * The ETL processes needed to create the dimension table needs to be carefully designed to ensure that there are no overlaps in the time periods for each distinct item of reference data.

newer post
newer post older post Home