Monday, January 3, 2011

Data Warehouse Design Strategies

0 comments
To build an effective data warehouse, it is important for you to understand data warehouse design principles. If your data warehouse is not built correctly, you can run into a number of different problems.

The proper methods for building a powerful data warehouse are based on information technology tactics. First off, it is important that you and your organization understand the importance of having a data warehouse. If workers feel that a data warehouse is unnecessary, they may not use it, and this could cause conflicts. Everyone in your organization should understand the importance of using the system.

After you have got your colleagues behind the concept of using a data warehouse, you will want to next focus on data integrity. You will want to avoid designing a data warehouse that will load data that is not consistent. It is also important to avoid creating a database that will replicate data. The goal of your organization should be to integrate data and create standards that will be used and followed. After data integrity, you will next want to look at implementation efficiency. This basically means that you will want to design at system that is simple to use. It doesn't matter how well designed your data warehouse is if your workers have a hard time using it.

If your workers have a hard time using the data warehouse, it will slow down the speed and productivity of your operation. When it comes to creating a data warehouse, you will want to make it as simple as possible. All of your workers should be able to use it without problems. Implementation efficiency is a principle that naturally leads to the next topic you will want to focus on, and this is user friendliness. This is a concept that is an important part of your business. The reason for this is because end users will not utilize a program that is too difficult to use. It is important for you to keep them in mind. Use a design which is friendly and easy to learn.

Once you have designed a data warehouse that is user friendly, you will next want to look at operational efficiency. Once the data warehouse has been created, it should be able to carry out operations quickly. In addition to this, it should not have errors or other technical problems. When errors or technical problems do occur, they should be simple to fix. Another thing you will want to look at is the cost involved with supporting the system. You will want to keep these costs low as much as possible.

The design principles that have been discussed in this article so far are more related to business than information technology. However, there are a number of IT design principles that you will want to follow. One of these is scalability. This is a problem that many data warehouse designers run into. The best way to deal with this issue is to create a data warehouse that is scalable from the beginning. Design it in a way which will allow it to support expansions or upgrades. You should be able to adapt it to a number of different business situations. The best data warehouses are those which are scalable.

The data warehouse that you design should fall under the guidelines of information technology standards. Every tool that you use to build your data warehouse should work well with IT standards. You will want to make sure it is designed in a way that makes it easier for your workers to use. While following the guidelines in this article won't allow you to always be successful, it will greatly tip the odds in your favor. You should be wary of companies that promise you perfect results if you use their design methods. No matter how well designed your data warehouse is, you will always run into problems. However, following the right principles will make the problems easier to recognize and solve.

When it comes to using a data warehouse, it is not a matter of "if" you will run into problems. It is matter of "how" and "when." When your data warehouse is well designed, you will be better equipped to solve any problems you encounter.

SOURCE:http://www.exforsys.com/tutorials/data-warehousing/data-warehouse-design-strategies.html

newer post

Data Warehousing Methods

0 comments
Most organizations agree that data warehouses are a useful tool. They benefit from the ability to store and analyze data, and this can allow them to make sound business decisions. It is also important for them to make sure the correct information is published, and it should be easy to access by the people who are responsible for making decisions.

There are two elements that make up the data warehouse environment, and these are presentation and staging. The staging could also be known as the acquisition area. It is composed of ETL operations, and once the data has been prepared, it will be sent to the presentation area.

When the data is placed within the presentation area, a number of programs will analyze and review it. While many organizations agree on the overall goal of data warehouses, the approaches to building them may differ. Attempting to use data marts alone is not a good approach, because they are geared towards departments. In addition to this, attempting to use data marts alone will be inefficient, and you will run into a number of long term problems. There are two techniques for building data warehouses that have become very popular. These are the Kimball Bus Architecture and the Corporate Information Factory.

With the Kimball technique, the rough data will be transformed and refined within the staging area. It is important to make sure the data is properly handled during this step. During the staging process, the rough data will be pulled from the source systems. While some of the staging processes may be centralized, others will be distributed. The presentation area will have a dimensional structure, and this model will hold the same information as a standard model. However, it will be easier to use, and it will display information that is summarized.

A dimensional model will be created by a business operation. Departments within the organization do not play a role in this. The data will be populated once it is placed within the dimensional warehouse, and is not dependent on the various departments that may compose an organization. When business processes have been developed within the warehouse, the system will become highly efficient. The next popular data warehouse approach that you will want to become familiar with is the Corporate Information Factory. Another name for this technique is the EDW approach. The data that is extracted from the source will be coordinated.

Within the CIF, a standard data warehouse is used to hold data repositories, and it may also have specific data warehouses which are designed for data mining. The data marts may be designed for specific departments, and they may have summary data which is in the form of a dimensional structure. The atomic data may be obtained from the standard data warehouse. While there are some similarities between these to techniques, there are some notable differences as well.

One of the primary differences between these two techniques is the normalized data foundation. With the Kimball approach, the data structures that must be obtained before the dimensional presentation will be dependent on the source data and transformation. In most cases, the duplicate storage of data is not required in both dimensional and normalized foundations. Many of the people who choose to use a normalized data structure believe that it is faster than the dimensional structure, but they often fail to take ETL into consideration.

Another thing that separates the two data warehouse approaches is the management of atomic data. With the CIF, atomic data will be stored within a normalized data warehouse. In contrast, the Kimball method states that the atomic data should be placed within a dimensional structure. When the data is placed within a dimensional structure, it can be summarized in a wide variety of different ways.

It is important to make sure the information you have is detailed so that users will be able to ask relevant questions. While most users will not place an emphasis on the details of one atomic transaction, they may want a summary of a large number of transactions. It is important for them to have the details so that they will be able to answer important questions. The approach that you choose should be the one which best serves the needs of your company.

SOURCE:http://www.exforsys.com/tutorials/data-warehousing/data-warehousing-methods.html
newer post

Data Warehouse Tools

0 comments
There are a number of important tools which are connected to data warehouses, and one of these is data aggregation. A data warehouse can be designed to store information based on a certain level of detail.

For example, you can store data based on each transaction, or you can store it based on a summary. These are examples of data aggregation. When data is summarized, the queries will move at a much faster rate. However, some of the information may be lost during a query, and this information may be important for solving a certain problem.

Before you decide which one you will use, it is important to weigh your options carefully. Once you have carried out an operation, you will need to rebuild the warehouse in order to undo it. The best way to handle this situation is to make sure the data warehouse is constructed with a large amount of detail. However, the cost for this can be huge depending on the storage options you choose. Once you have filled your data warehouse with important information, you will want to use this data to help you make smart investment decisions. The tools that can allow you to do this will fall under a topic that is called business intelligence.

Business intelligence is a field which is very diverse. It is comprised of things such as Executive Information Systems, Decision Support Systems, and On-Line Analytical Processing. Business intelligence can further be broken down into a field that is called multi-dimensional analysis tools. These are tools that will allow a user to view data from a wide variety of angles. A query tool will allow a user to send SQL queries within a warehouse to look for results. Data mining is also a field that falls under business intelligence, and will allow you to look for patterns and relationships within a data warehouse.

Another tool that is connected to data warehouses is data visualization. The tools that are used for data visualization will present visual models of data. This data could come in the form of intricate 3D images. The goal of data visualization is to allow the user to view trends in a method which is easier to understand than complicated models that are based off statistics. One tool that is allowing this field to advance is VRML, or Virtual Reality Modeling language. In order for data warehouses to function properly, it is also important to place an emphasis on metadata management. Meta data can be described as being "information about information."

Meta data must be managed when data is acquired or analyzed. Meta data will be held in a repository, and can give you important information about many of the data warehouse tools. The process of properly managing meta data has become a science within itself. If it is done properly, the company can greatly benefit. The reason why it is important is because it can allow organizations to analyze the changes that occur within database tables. This is a tool that plays an important part of the construction of a data warehouse.

Data warehousing is a field which is somewhat complicated. There are many vendors who are attempting to advertise the tools, but the cost and complexity involved with the products has not allowed them to be used by a large number of companies. Any company that is thinking of using data warehouses must make sure they have taken the time to review and understand the technology. It can only be useful if you know how to use it. Once you understand and acquire the technology, it is possible for you to gain a powerful advantage over your competitors. This has made data warehouses attractive to many companies.

One of the biggest advantages to data warehouses is that they allow you to store information that you can use to improve the marketing strategies of your company. Not only can you improve the marketing strategies, but you will also be able to make strategic decisions based on the information you have compiled and organized. With techniques such as data mining and data visualization, you will be able to discover important patterns that you didn't know existed. The patterns that you discover can allow your company to earn large profits.

SOURCE:http://www.exforsys.com/tutorials/data-warehousing/data-warehouse-tools.html
newer post

Sunday, January 2, 2011

For every rule there is an exception; for each exception there are more exceptions…

0 comments
To implement an ETL process there are many steps that are followed. One such step is creating a mapping document. This mapping document describes the data mapping between the source systems and the target and the rules of data transformation.

Ex. Table / column map between source and target, rules to identify unique rows, not null attributes, unique values, and range of a attributes, transformations rules, etc.


Without going into further details of the document, lets analyze the very next step. It seems obvious and natural to start development of the of the ETL process. The ETL developer is all fired up and comes up with a design document and starts developing, few days time the code is ready for data loading.
But unexpectedly (?) the code starts having issues every few days. Issues are found and fixed. And then it fails again. What’s happening? Analysis was done properly; rules were chalked out & implemented according to the mapping document. But why are issues popping up?  Was something missed?
Maybe not! Isn’t it, normal to have more issues in the initial lifetime of the processes?

Maybe Yes! You have surely missed ‘Source System Data Profiling’. The business analyst has told you rules as the how the data is structured in the source system and how it is supposed to behave; but he/she has not told you the ‘buts and ifs’ called as EXCEPTIONS for those rules.

To be realistic it is not possible for anyone to just read you all rules and exceptions like a parrot. You have to collaborate and dig the truth. The actual choice is yours, to do data profiling on the source system and try to break all the rules told by the analyst. Or you can choose to wait for the process to go live and then wakeup every night as the load fails.  If you are lucky enough you deal with an unhappy user every morning you go to the office.

Make the right choice; don’t miss ‘Source system data profiling’ before actually righting a single line of code. Question every rule. Try to find exception to the rules. There must be at least 20 tables. One table on an average will have 30 columns; each column will have on an average 100k values. If you make matrix of number of tables * columns * data values, it will give the number of reasons the why your assumptions may be wrong.   It’s like unit testing source data even without loading. There is a reason why machines alone cannot do your job; there is reason why IT jobs are more paying.

Remember, ‘for every rule there is an exception; for each exception there are more exceptions…’

newer post

Difference between Reference Data and Master Data

0 comments
It is not unusual for people to use ‘Reference Data’ and ‘Master Data’ interchangeably without understanding the differences.
Lets try to understand the differences with an example of sales transaction.

A sales transaction contains information like….
Store,
Products Sold,
Sales Person,
Store Name,
Sales Date,
Customer,
Price,
Quantity,
etc.

Attributes from the above example can be separated into two types: Factual (transactional) and Dimensional information
Price and Quantity are measurable attributes of a transaction.
Store, Products Sold, Sales Person, Store Name, Sales Date, and Customer are dimensional attributes of a transaction.

We can see that the dimensional data is already embedded in the transaction. And with dimensional attributes we can successfully complete the transaction.Dimensional data that directly participates in a transaction is master data.

But is the list of dimensional attributes in the transaction complete?

Asking few analytical questions can help us discover the answer.
     -What is the Male to Female ratio of customers doing purchase at the store?
     -What type of products are customers buying? Ex: Electronic, Computers, Toys
     -What type of Store is it?  Ex: Web store, Brick & Mortar, Telesales, Catalog Sales

The above questions cannot be answered by attributes in the transaction. These dimensional data is missing in the transactions.  This missing dimensional data that does not directly participate in transaction but are attributes of the dimension is reference data.

Why it is important for an ETL person to understand the differences? Well once the  ‘Reference Data Management’ (RDM) was popular then suddenly in last few years there is this new word ‘Master Data Management’ (MDM). These words mean different things and they have significant implication on how they are managed. But that will be a topic of discussion for some future post!  I hope this article will help clear atleast some confusion.
newer post

ETL delta logic & de-normalization of data model.

0 comments

It is a normal practice in data warehouse to de normalizes (Or once auto corrected as demoralize) as the data model for performance. I am not going to discuss the benefits vs. issues with de-normalization.  As by the time it comes to the ETL guy the fate of the model is already decided.

Let’s look at the model in the source side, which is perfectly normalized.

   





Now let’s look at the de normalized model on the target side.

     





Next lets think of delta logic for loading of the dim_employee table. Ideally you would only check changes in the employee table. Then if there is any changes after the last load date time ; then get those rows from ref_employee and do the lookup to get the department & the designation and load it into the target table.

The issue with this delta logic is that it has not considered the effect of de normalization of employee table on the target side. If you carefully look at the two de normalized attributes dept_name and emp_designation_desc, the ETL process will miss any changes in the parent tables, so only new employees or updated employee will get the new definition of department & designation. And any employee that has not been updated in the source side will still have the same dept_name & emp_designation_desc. This is wrong.

The reason it is wrong is the ETL delta logic only picked the row from the employee table when it changed and ignored the changes in the dept & designation tables. The truth of the matter is, ” For any de normalized target table data (affected rows) should be re-captured from the source, any time there is change in the driving/core table as well as when there is change in any parent tables to which the driving table refers to.” In this case, even if there is change in department or designation table, all the rows affected on the employee tables should be re-processed.

It might seem very simple, but ETL developers/designers/modelers always miss this point. Also once developed it is very difficult to catch.

The next question is how you would catch the affected rows. Well there are ways to write SQL that combine the three tables (in this case) and treat them as one single entity and the pull rows based on the any update_dttm greater than the last ETL run. Figure out the SQL…
newer post
newer post older post Home