Wednesday, February 16, 2011

Should You Use An ETL Tool?

0 comments
The Extract, Transformation, and Load (ETL) system is the most time-consuming and expensive part of building a data warehouse and delivering business intelligence to your user community. A decade ago the majority of ETL systems were hand crafted, but the market for ETL software has steadily grown and the majority of practitioners now use ETL tools in place of hand-coded systems.

Does it make sense to hand-code an ETL system in 2008, or is an ETL tool a better choice? Kimball Group now generally recommends using an ETL tool, but a custom-built approach can still makes sense. This article summarizes the advantages and disadvantages of ETL tools and offers advice on making the choice that's right for you.

ADVANTAGES OF ETL TOOLS

Visual flow and self-documentation.The single greatest advantage of an ETL tool is that it provides a visual flow of the system's logic. Each ETL tool presents these flows differently, but even the least-appealing of these user interfaces compare favorably to custom systems consisting of stored procedures, SQL and operating system scripts, and a handful of other technologies. Ironically, some ETL tools have no practical way to print the otherwise-attractive self documentation.

    Learn about a predictive analytics solution that provides forward-looking insights, and offers best practices to ensure you maximize the value of your company's customer relationships
    Best Practices for Collaboration in the Enterprise

Structured system design. ETL tools are designed for the specific problem of populating a data warehouse. Although they are only tools, they do provide a metadata-driven structure to the development team. This is particularly valuable for teams building their first ETL system.

Operational resilience. Many of the home-grown ETL systems I've evaluated are fragile: they have too many operational problems. ETL tools provide functionality and practices for operating and monitoring the ETL system in production. You can certainly design and build a well instrumented hand-coded ETL application, and ETL tool operational features have yet to mature. Nonetheless, it's easier for a data warehouse / business intelligence team to build on the management features of an ETL tool to build a resilient ETL system.

Data-lineage and data-dependency functionality. We would like to be able to right-click on a number in a report and see exactly how it was calculated, where the data was stored in the data warehouse, how it was transformed, when the data was most recently refreshed, and what source system or systems underlay the numbers. Dependency is the flip side of lineage: we'd like to look at a table or column in the source system and know which ETL modules, data warehouse tables, OLAP cubes, and user reports might be affected by a structural change. In the absence of ETL standards that hand-coded systems could conform to, we must rely on ETL tool vendors to supply this functionality — though, unfortunately, few have done so to date.

Advanced data cleansing functionality. Most ETL systems are structurally complex, with many sources and targets. At the same time, requirements for transformation are often fairly simple, consisting primarily of lookups and substitutions. If you have a complex transformation requirement, for example if you need to de-duplicate your customer list, you should use a specialized tool. Most ETL tools either offer advanced cleansing and de-duplication modules (usually for a substantial additional price) or they integrate smoothly with other specialized tools. At the very least, ETL tools provide a richer set of cleansing functions than are available in SQL.

Performance. You might be surprised that performance is listed last under the advantages of the ETL tools. It's possible to build a high-performance ETL system whether you use a tool or not. It's also possible to build an absolute dog of an ETL system whether you use a tool or not. I've never been able to test whether an excellent hand-coded ETL system outperforms an excellent tool-based ETL system; I believe the answer is that it's situational. But the structure imposed by an ETL tool makes it easier for an inexperienced ETL developer to build a quality system.

Software licensing cost.The greatest disadvantage of ETL tools in comparison to hand-crafted systems is the licensing cost for the ETL tool software. Costs vary widely in the ETL space, from several thousand dollars to hundreds of thousands of dollars.

Uncertainty. We've spoken with many ETL teams that are uncertain – and sometimes misinformed – about what an ETL tool will do for them. Some teams under-value ETL tools, believing they are simply a visual way to connect SQL scripts together. Other teams unrealistically over-value ETL tools, imagining that building the ETL system with such a tool will be more like installing and configuring software than developing an application.

Reduced flexibility. A tool-based approach limits you to the tool vendor's abilities and scripting languages. Build a Solid Foundation

    Learn about a predictive analytics solution that provides forward-looking insights, and offers best practices to ensure you maximize the value of your company's customer relationships
    Best Practices for Collaboration in the Enterprise

There are some over-arching themes in successful ETL system deployments regardless of which tools and technologies are used. Most important — and most frequently neglected — is the practice of designing the ETL system before development begins. Too often we see systems that just evolved without any initial planning. These systems are inefficient and slow, they break down all the time, and they're unmanageable. The data warehouse team has no idea how to pinpoint the bottlenecks and problem areas of the system. A solid system design should incorporate the concepts described in detail in Kimball University: The Subsystems of ETL Revisited, by Bob Becker.

Good ETL system architects will design standard solutions to common problems such as surrogate key assignment. Excellent ETL systems will implement these standard solutions most of the time but offer enough flexibility to deviate from those standards where necessary. There are usually half a dozen ways to solve any ETL problem, and each one may be the best solution in a specific set of circumstances. Depending on your personality and fondness for solving puzzles, this can be either a blessing or a curse.

One of the rules you should try to follow is to write data as seldom as possible during the ETL process. Writing data, especially to the relational database, is one of the most expensive tasks that the ETL system performs. ETL tools contain functionality to operate on data in memory and guide the developer along a path to minimize database writes until the data is clean and ready to go into the data warehouse table. However, the relational engine is excellent at some tasks, particularly joining related data. There are times when it is more efficient to write data to a table, even index it, and let the relational engine perform a join than it is to use the ETL tool's lookup or merge operators. We usually want to use those operators, but don't overlook the powerful relational database when trying to solve a thorny performance problem.

Whether your ETL system is hand-coded or tool-based, it's your job to design the system for manageability, auditability, and restartability. Your ETL system should tag all rows in the data warehouse with some kind of batch identifier or audit key that describes exactly which process loaded the data. Your ETL system should log information about its operations, so your team can always know exactly where the process is now and how long each step is expected to take. You should build and test procedures for backing out a load, and, ideally, the system should roll back transactions in the event of a midstream failure. The best systems monitor data health during extraction and transformation, and they either improve the data or issue alerts if data quality is substandard. ETL tools can help you with the implementation of these features, but the design is up to you and your team.

Should you use an ETL tool? Yes. Do you have to use an ETL tool? No. For teams building their first or second ETL system, the main advantage of visual tools are self-documentation and a structured development path. For neophytes, these advantages are worth the cost of the tool. If you're a seasoned expert — perhaps a consultant who has built dozens of ETL systems by hand — it's tempting to stick to what has worked well in the past. With this level of expertise, you can probably build a system that performs as well, operates as smoothly, and perhaps costs less to develop than a tool-based ETL system. But many seasoned experts are consultants, so you should think objectively about how maintainable and extensible a hand-crafted ETL system might be once the consultant has moved on.

Don't expect to reap a positive return on investment in an ETL tool during the development of your first system. The advantages will come as that first phase moves into operation, as it's modified over time, and as your data warehouse grows with the addition of new business process models and associated ETL systems.

SOURCE:http://www.informationweek.com/news/software/bi/showArticle.jhtml?articleID=207002081&pgno=3&queryText=&isPrev=
newer post

k-means Clustering

0 comments
k-means clustering is a data mining/machine learning algorithm used to cluster observations into groups of related observations without any prior knowledge of those relationships. The k-means algorithm is one of the simplest clustering techniques and it is commonly used in medical imaging, biometrics and related fields.
The k-means Algorithm
The k-means algorithm is an evolutionary algorithm that gains its name from its method of operation. The algorithm clusters observations into k groups, where k is provided as an input parameter. It then assigns each observation to clusters based upon the observation’s proximity to the mean of the cluster. The cluster’s mean is then recomputed and the process begins again. Here’s how the algorithm works:

   1. The algorithm arbitrarily selects k points as the initial cluster centers (“means”).
   2. Each point in the dataset is assigned to the closed cluster, based upon the Euclidean distance between each point and each cluster center.
   3. Each cluster center is recomputed as the average of the points in that cluster.
   4. Steps 2 and 3 repeat until the clusters converge. Convergence may be defined differently depending upon the implementation, but it normally means that either no observations change clusters when steps 2 and 3 are repeated or that the changes do not make a material difference in the definition of the clusters.

Choosing the Number of Clusters
One of the main disadvantages to k-means is the fact that you must specify the number of clusters as an input to the algorithm. As designed, the algorithm is not capable of determining the appropriate number of clusters and depends upon the user to identify this in advance. For example, if you had a group of people that were easily clustered based upon gender, calling the k-means algorithm with k=3 would force the people into three clusters, when k=2 would provide a more natural fit. Similarly, if a group of individuals were easily clustered based upon home state and you called the k-means algorithm with k=20, the results might be too generalized to be effective.

SOURCE:http://databases.about.com/od/datamining/a/kmeans.htm
newer post

Kimball University: Industry Standard Data Models Fall Short

0 comments
Industry-standard data models are an appealing concept at first blush, but they aren't the time savers they are cracked up to be. What's more, these prebuilt models may inhibit data warehouse project success.

Vendors and proponents argue that standard, prebuilt models allow for more rapid, less risky implementations by reducing the scope of the data model design effort.

Every manufacturer takes orders and ships products to fulfill the orders. Every insurance company sells policies and processes claims. Every transportation company moves cargo between an origin and a destination. The list goes on across other industries.

Why bother "recreating the wheel" by designing custom data models to support these common business processes when you can buy an industry-standard model instead?

    Discover how a high-performing, secure, and scalable commercial solution delivers virtualization functionality with the management tools organizations need -- at a fraction of the cost of competing virtualization solutions
    Accelerate The deployment of Virtualization in your Environment

Yes, most businesses in a given industry perform common functions. But if everyone's approach to these business functions was so similar, then why are there so many alternative organizations? Don't most organizations do things slightly differently than their industry peers? And if so, how do these "competitive advantages" get addressed by a pre-defined industry model?

True business intelligence requires the injection of an organization's own intellectual capital. Would you really want to use the identical industry solution as your peers?

In virtually every data warehouse design and delivery project, the vocabulary used by the operational source system's data model needs to be translated into business vernacular. Some might argue that the source system speaks "geek" rather than Greek. Embracing an industry-standard model introduces the need for yet another pocket dictionary.

First, the data in the source system's language needs to be translated and transformed into the industry model's generic language. This is no small feat; while some data will translate without too much compromise, other data will need to be wrestled and coerced into the pre-defined model and invariably some source data just won't fit.

Once the source has been manipulated into the model's supposedly universal language, the data then needs to go through a second translation so that the vocabulary used in the final presentation layer makes sense to the business users. The challenges surrounding these multiple transformations, and the opportunity to lose something in the translations between three languages -- source system, industry model, and business usage -- are extensive but often overlooked when considering a standardized model.

Of course, the transformation effort is less onerous if the source data capture system and industry model are supplied by the same vendor. But there are still some sharp corners lurking even in this scenario. First, you'll need to incorporate any custom extensions or flex field data elements from your implementation into the vendor's generic model. Secondly, you'll need to worry about the integration of any source data that's outside the vendor's domain.

Can you readily conform the industry model's dimensions with other internally available master data? If not, the industry model is destined to become another isolated stovepipe dataset. Clearly, this outcome is unappealing, but it may be less of an issue if all your operational systems are supported by the same ERP vendor or you’re a very small organization without an IT shop doing independent development.

What can you realistically expect to gain from an industry-standard model? A pre-built generic model can help identify core business processes and associated common dimensions for a given industry. That provides some comfort for data warehouse project teams feeling initially overwhelmed by design tasks.

However, is this knowledge worth six figures? Alternatively, you could likely gain this same insight by spending a few weeks with the business users; you'd not only improve your understanding of the business's needs, but you'd also begin "bonding" business users to the DW/BI initiative.

Ultimately, the business's ownership and stewardship of the effort are critical to its long term success. Even if you buy an industry-standard model, you're still going to need to spend time with the business to understand the final translations required to address their requirements. There's no dodging this necessary step. In Kimball Group's experience, after a few days or weeks studying the standard model, most teams typically gain enough confidence to want to customize the schema extensively for "their data."

It's also worth mentioning that just because you spend thousands of dollars on a standard model doesn't mean it exhibits generally-accepted dimensional modeling best practices. Unfortunately, some pre-built models embody common dimensional modeling design flaws; this isn't surprising if the model's designers have focused more on best practices for source system data capture rather than those required for business reporting and analytics.

SOURCE:http://www.informationweek.com/news/showArticle.jhtml?articleID=227400287&pgno=2&queryText=&isPrev=
newer post
newer post older post Home