ETL TOOLS

Tuesday, January 18, 2011

Simulation and performance analysis of distributed Internet systems using TCPNs.(timed coloured cetri net)(Report)

Posted by DWH at 5:21 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

This paper presents a Timed Coloured Petri Nets based programming tool that supports modeling and performance analysis of distributed World Wide Web environments. A distributed Internet system model, initially described in compliance with Queueing Theory (QT) rules, is mapped onto the Timed Coloured Petri Net (TCPN) structure by means of queueing system templates. Then, it is executed and analyzed. The proposed distributed Internet systems modeling and design methodology has been applied for evaluation of several system architectures under different external loads.

Keywords: distributed systems modeling, performance analysis, timed coloured Petri nets

Povzetek: Predstavljeno je orodje na osnovi Petri mrez za modeliranje spletnih okolij.

1 Introduction

One of modern Internet (or Web) systems development approaches assumes that the systems consist of a set of distributed nodes. Dedicated groups of nodes are organized in layers (clusters) conducting predefined services (e.g. WWW service or data base service) [2, 6, 8]. This approach makes it possible to easily scale the system. Additionally, a distributed structure of the system assures its higher dependability. Fig. 1 shows an example cluster-based Internet system structure. The Internet requests are generated by the clients. Then they are distributed by the load balancer among set of computers that constitute the front-end or WWW cluster. The front-end cluster offers a system interface and some procedures that optimize the load of the next system layer-the database servers cluster. In the standard scenario the client produces the request by e.g. filling out the formula on the web side. Then the request is converted into e.g. a SQL query and forwarded to the database layer. The result of the query is sent back to the front-end layer. Finally, the client receives the result of his request on the website.

Simultaneously, for a significant number of Internet applications some kind of soft real-time constraints are formulated. The applications should provide up-to-date data in set time frames [20]. Stock market or multimedia applications may be good examples of hardware/software systems that may have such timing requirements.

The appearing of new above mentioned development paradigms cause that searching for a new method of modeling and timing performance evaluation of distributed Internet systems seems to be an up-to-date research path.

One of intensively investigated branch of Internet systems software engineering is formal languages application for modeling and performance analysis. Amid suggested solutions there are: algebraic description [11], mapping through Queueing Nets (QNs) [8, 18], modeling using both Coloured Petri Nets (CPNs) [12] and Queueing Petri Nets (QPNs) [6, 7].

[FIGURE 1 OMITTED]

Our approach proposed in this paper (1) may be treated as extension of selected solutions summed up in [6, 7], where Queueing Petri Nets (QPNs) language [1] has been successively applied to the web-cluster modeling and performance evaluation. The final QPNs based model can be executed and used for modeled system performance prediction. In our solution we propose alternative Queueing Systems models defined as in [3] expressed into Timed Coloured Petri Nets (TCPNs) [4]. The models has been used as a background for developing a programming tool which is able to map timed behavior of queueing nets by means of simulation. Subsequently we developed our individual TCPNs-based method of modeling and analysis of distributed Internet systems. The well known software toolkits as Design/CPN or CPN Tools can be naturally used for our models simulation and performance analysis [10, 19]. The preliminary version of our software tool was announced in [13], the more mature its description can be found in [16].

The remaining work is organized as follows. In section 2 we informally introduce basic concepts of TCPNs and then in section 3 we provide rules of mapping queueing systems into TCPNs. In the next section, we present a method of applying the TCPNs based queueing systems models (TCPNs templates) to distributed Internet system modeling. Section 5 focuses on results of simulation some detailed Internet system models while section 6 sums up the paper and includes our future research plans.

2 Hierarchical Timed Coloured Petri Nets' Basic Concepts

As TCPNs is the main formal language exploited in the paper we decided to briefly introduce it. The introduction will have an informal form focusing only on the most important TCPNs features. The more thorough TCPNs informal introductions can be found in [9, 5]. The detailed TCPNs features and some applications are presented in [4].

Informally, a Timed Coloured Petri Net is a bipartite graph consisting of "place" nodes and "transition" nodes. The places, drawn as circles or ellipses, are used to represent conditions; the transitions, drawn as bars, are used to represent events.

The places can have some "tokens" associated with. Each token is equipped with an attached data value--the token "colour". The data value may be freely complex (e.g. integer number or record of data). For each place, a colour set is defined which characterizes acceptable colours of the token in the place. All declarations concerning the behavior of the net and colours of tokens are written in the CPN ML language. It is possible to define colours, variables, expressions and functions connected to the elements of the net. The distribution of tokens in the places is called marking. The initial marking determines the initial state of the net and is specified by "initialization expressions". The marking of each place is a multi-set over the colour set attached to the place (compare [4]).

Directed arcs (arrows) connect the places and transitions, with some arcs directed from the places to the transitions and other arcs directed from the transitions to …

Similarity measures for relational databases

Posted by DWH at 5:21 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

The relational algebra (4; 15), a relational data model with five basic operations on relations, i.e., Cartesian product x, projection [pi], selection [sigma], union [union], and set difference --, and several additional operations such as [theta]-join or intersection, has three main advantages over non-relational data models (13):

--From the point of view of usability, the model has a simple interpretation in terms of real-world concepts, i.e., the essential data structure of the model is a relation, which can be visualized in a tabular format.

--From the point of view of applicability, the model is flexible and general, and can be easily adapted to many applications.

--From the point of view of formalism, the model is elegant enough to support extensive research and analysis.

Hence, the relational data models have gained acceptance from a broad range of users, they have gained popularity and credibility in a variety of application areas, and they facilitate better theoretical research in many fundamental issues arising from database query languages and dependency theory.

However, there are several applications that have evolved beyond the capabilities of traditional relational data models, such as applications that require databases to cooperate with the user by suggesting answers which may be helpful but were not explicitly asked for. The cooperative-behaviour or cooperative-answering techniques (5) may be differentiated into the following categories:

i.) consideration of specific information about a user's state of mind,

ii.) evaluation of presuppositions in a query,

iii.) detection and correction of misconceptions in a query,

iv.) formulation of intensional answers,

v.) generalization of queries and of responses.

The cooperative behaviour plays an important part, for instance, in information-providing dialogue systems (7), where the most vital cooperative-answering technique leading to user satisfaction is generalization of queries and of responses as shown by Hajdinjak and Mihelic (8). Generalization of queries and of responses, the aim of which is to capture possibly relevant information, is often achieved by query relaxation (6).

Another kind of applications not suitable for the traditional relational data models are applications which require the database to be enhanced with a notion of similarity that allows one to perform approximate searches (9). The goal in these applications is often one of the following:

i.) Find objects whose feature values fall within a given range or where the distance from some query object falls into a certain range (range queries).

ii.) Find objects whose features have values similar to those of a given query object or set of query objects (nearest neighbour queries and approximate nearest neighbour queries).

iii.) Find pairs of objects from the same set or different sets which are sufficiently similar to each other (closest pairs queries).

Examples of such approximate-matching or similarity-search applications are databases storing images, fingerprints, audio clips or time sequences, text databases with typographical or spelling errors, text databases where we look for documents that are similar to a given document, and computational-biology applications where we want to find a DNA or a protein sequence in a database allowing some errors due to typical variations.

Persuaded that many applications will never reach the limitations of the widespread relational data model this article focuses on traditional relational algebra equipped with extra features that allow query relaxation and similarity searches. Although a large body of work has addressed how to extend the relational data model to incorporate cooperativity, neighbouring information, and/or orderings (2; 3; 10; 11; 13), neither of them have succeeded to fit into the representational and operational uniformity of traditional relational algebra or even to reach a certain degree of generality.

Therefore, we are going to talk about domains, similarity, approximate answers, and nearness of data in a highly systematic and comprehensive way, which will lead us towards an usable, applicable, and a formaly strong generalization of the relational data model.

2 Sets with similarity

Most applications and proposed solutions of non-exact matches and similarity search, which are not covered by traditional relational algebra, have some common characteristics--there is a universe of objects and a non-negative distance or distance-like function defined among them. The distance function measures how close are the non-exact matches to the exact specifications that were given by the user willing to accept approximate answers.

Instead of restricting only to distance metrics, we consider more general similarity measures that satisfy the only condition of being reflexive, i.e., every object is most similar to itself. Hence, rather than focusing on (ordinary sets) or metric spaces, we will consider more general sets with similarity, where a measure of similarity assigns to a pair of objects a similarity value, which tells us how similar they are. Note, we speak of similarity instead of distance--if a point x moves toward a point y, the distance between x and y gets smaller, but their similarity gets larger.

Applying SD-tree for object-oriented query processing.(signature declustering)(Report)

Posted by DWH at 5:20 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

The advent of internet has made the volume of data going high everyday in all computer-based applications. This has entrusted researchers to design more powerful techniques to generate and manipulate large amounts of data to derive useful information. Indexing plays a vital role in the fast recovery of required data from large databases.

Among the many indexing techniques reported the signature file approach is preferred for its efficient evaluation of set-oriented queries and easy handling of insert and update operations. Initially applied on text data [2, 3, 12, 17, 21] it has now been used in other applications like office filing [6], relational and Object-Oriented Databases [14,16,26] and hypertext [9].

Signatures are hash coded abstractions of the original data. It is a binary pattern of predefined length with fixed number of ls. The attributes' signatures are superimposed to form object's signature. To resolve a query, the query signature say Sq is generated using the same hash function and compared with signatures in the signature file for l s sequentially and many non-qualifying objects are immediately rejected.

If all the Is of Sq matches with that of the signature in the file it is called a drop. The signature file method guarantees that all qualifying objects will pass through the filtering mechanism; however some non-qualifying objects may also pass the signature test. The drop that actually matches the Sq is called an actual drop and drop that fails the test is called false drop. The next step in the query processing is the false drop resolution. To remove false drops each drop is accessed and checked individually. The number of false drops can be statistically controlled by careful design of the signature extraction method [7] and by using long signatures [3,6].

1.1 Related work

Different approaches have been discussed by researchers to represent Signature file in a way conducive for evaluating queries, such as Sequential Signature File [31], Bit-Slice Signature file [31], Multilevel Signature file [25], Compressed Multi Framed Signature file [23], Parallel Signature file [20], S-Tree and its variants [13,24], Signature Graph [28] and Signature tree [27,29,30].

1.2 Motivation for the current work

The signature tree developed by Chen [30] is having the following drawbacks:

* Signatures are inserted considering both 0s and 1s whereas actual weight age is for set bits only.

* Insertion path is dictated by the existing tree structure.

* To process a query, bits appearing in the tree from root node are compared with query signature pattern for 0s and I s and not by its set bits.

* For a 0-bit in the query both left and right sub tree is followed leading to multiple traversals.

These observations laid the foundation for the current work. We study a new indexing technique for OODBSs using the dynamic balancing of B+ tree called Signature Declustering (SD)-tree in which the positions of 1s in the signatures are distributed over a set of leaf nodes. Using this for a given query signature all the matching signatures can be retrieved cumulatively in a single node.

The rest of the paper is organized as follows. In section 2 we discuss briefly the different approaches used to represent the signature file. In section 3 the structure of the proposed SD-tree is shown. Section 4 is devoted to the algorithms for insert, search and delete operations. Section 5 proposes a sample data set and queries to validate the new structure. Section 6 reports the results of the experiments conducted with the analytical comparison of SD-tree with that of Signature tree [30]. Finally section 7 concludes the work with further outlook on the work.

2 A summary of signature file techniques

2.1 Signature files

A Signature is a bit string formed from a given value. Compared to other index structures, signature file is more efficient in handling new insertions and queries on parts of words. Other advantages include its simple implementation and the ability to support a growing file. But it introduces information loss which can be minimized by carefully selecting the signature extraction method.

Techniques for signature extraction such as Word Signature (WS) [2,3,4,6], Superimposed Coding (SC) [1,2,3,4,5,6,10], Multilevel Superimposed Coding [12], Run Length Encoding (RL) [3,4,6], Bit-block Compression (BC) [3,4], Variable Bit-block Compression (VBC) [4,6] have been reported. The encoding scheme sets a constant number say m, of 1s in the range [1..F], where F is the length of the signature.

The resulting binary pattern with m number of 1s and (Fm) number of 0s is called a word signature. The signature of a text block or object can be obtained by superimposing (logical OR operation) all its constituent signatures (i.e) word signatures for text block and attributes' signatures for object. The set of all signatures form a signature file. An example of Superimposed Coding and a sample query evaluation is given below.

Information   0010 0100
Retrieval 0100 0001
Block Signature   0110 0101

Sample queries

Matching query
Keyword = Information   0010 0100
Query descriptor0010 0100
Block signature matches   (Actual Drop)

False Match query
Keyword = Coding0010 0001
Query descriptor0010 0001
Block signature matches   (False Drop)
but keyword does not

Non-matching query
Keyword = Information   0010 0100
Keyword = Science   0000 0110
Query descriptor0010 0110
Block signature does not match

2.2 Applications of signatures

Multiple executions ETL process against same set of data.

Posted by DWH at 5:19 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

Every ETL designer, developer & tester should always ask this question…”What will happen, if I run the ETL process multiple times, against the same data set?”
Answer: 1. I get the same result set.
Answer: 2. I get multiple result set.
If you go back to the original article on What is ETL & What ETL is not! You will immediately come to the conclusion that Answer 2 is incorrect, as ETL is not allowed to create data.
Why will the process run more than once against the same set of data? Many reasons, example most common being operators mistake, accidental kickoff, old set of data file remaining in the directory, staging table loaded more than once, intentional rerun of ETL process after correction of some data in source data set, etc. Without going into further details, I would advise ETL folks to always include in your process ways to prevent it from happening by one or more combinations of following methods…
1. Identify the primary key (logical/physical) and put update else insert logic.
2. Deleting the target data set before processing again (based on logical/physical primary key)
3. Preventing occurrences of multiple runs by flagging processed dates
4. Marking processed records with processed flags after commit
5. Prevention of multiple loads in the staging area
6. identifying duplicate records in stage area before the data gets processed
7. more…
So do these experiments in the development or test environment run the ETL process more than once, check the result! If you get the result 2 (copies of rows, with no way to distinguish or retiring the old rows)
The designer or the developer is wrong & if the process as passed the QA or testing then the tester is wrong.

The Problem with Dimensional Modeling

Posted by DWH at 5:18 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

The problem with dimensional modeling is... well there really isn't a problem with dimensional modeling. In terms of capturing requirements for DSS analysis, dimensional modeling is undoubtedly the best technique for design that there is. Hundreds of organizations have used dimensional modeling successfully and are quite satisfied with the results. There are classes and conferences where dimensional modeling is the focus and the attendees walk away with a happy smile on their face.

So whatever could be the problem with dimensional modeling? In fact, is there really a problem here?

In order to understand what the issues are, consider the results of a dimensional modeling exercise. Typically the database design that results from a dimensional model is called a star join (or sometimes a snowflake structure). A star join is made up of one or more fact tables coupled with any number of related dimensions. A star join is built by gathering and assimilating user requirements. A star join represents the optimal denormalization of data that best suits the requirements of the users.

The starting point for the design of the star join and dimensional modeling is the process of requirements gathering. Users are interviewed and requirements are synthesized from the users' statements. Once the requirements are gathered, an assimilation of requirements from all users interviewed takes place, leading to the determination of the shape and content of the star join. The database designer knows how to build the star join because of the requirements analysis that precede the design activities.

The result of the star join when built properly is that the shape and content of the star join are optimal for the requirements of the users that have participated in the requirements-gathering process. Star joins are great for representing the views of people who are of a like mind, but different groups of people want their own star join.

While the star join fits the needs of the users who have been interviewed, not surprisingly the star join does not fit the needs of all users. For a variety of reasons, there will be users who do not have input into the dimensional modeling process. Indeed, the problem with a star join design is that it optimizes the access of data for one group of people at the expense of everyone else. A star join with one shape and set of contents is optimal for one group of people and another star join with another shape and contents is optimal for another group of people. Because the star join is shaped around user requirements and because user requirements vary from one type of user to the next, not surprisingly different star joins are optimal for different types of users.

The advocates of dimensional modeling would like to believe that a single star join can be created for the optimization of everyone's requirements. Such is not the case at all. The world has long ago discovered that the single database for all purposes was pie in the sky. The single database to serve all purposes and all requirements simply has never been a reality.

There are a host of very good reasons why different organizations need their own unique star joins and cannot share a star join with another organization. Some of these reasons are:

    Sequencing of data. Finance likes to see data sequenced one way; marketing likes to see data sequenced another way.
    Data definitions. Sales defines a sale as closed business; accounting defines a sale as booked business.
    Granularity. Finance looks at things in terms of monthly revenue; accounting looks at things in terms of quarterly revenue.
    Geography. Sales looks at things in terms of sales territories by ZIP code; marketing looks at things in terms of states.
    Products. Engineering looks at things in terms of existing products; sales looks at things in terms of future products.
    Data relationships. Finance looks at the relationship between a customer and a sale; accounting looks at things through the eyes of an account/sales relationship.
    Time. Accounting looks at the world through scheduled closing dates; finance looks at the world through calendar dates.
    Technologies. One star join (from finance) is in one technology and another star join (from marketing) is in another technology.
    Sources of data. One source system feeds one star join while another source system feeds another star join.
    Management. Management looks at things through the organization chart, the investors look at things through the marketplace.

The differences in the way that parts of the business operate are much more vast and much more profound than the obvious examples presented here.

The way that the different departments of the organization conduct their business is like the six blind men describing an elephant. Each blind man is touching a different part of the elephant and is describing it. To a casual observer, there appears to be no cohesive pattern in the very different descriptions given by the six blind men. Having different organizations care for different aspects of the business is simply the way business runs. There are good reasons for different departments looking at the data and the business in very different lights. The consequence of this need to look at the same business in different ways results in the need for very different star joins for different departments. The star join that is optimal for one department is hardly optimal (or even recognizable) for another department. The result is that because of the nature of the way business is run different departments require different star joins.

When each department has its own star join which has been designed specifically for it, the world is at peace. But herein lies a host of problems.

The problems that arise with star joins and dimensional modeling do not become apparent until multiple star joins appear on the horizon. (The problems to be discussed with multiple star joins cohabiting in the same environment is well documented and will be only lightly touched upon here in the interest of not repeating what has been well discussed in the industry.)

When there are multiple independent star join environments, the same detailed data appears in each star join. There is no reconcilability of data, and new star joins require the same amount of work for creation as older star joins. As a result:

    Each star joins contains much of the same detailed data as each other star join. Star joins grow unnecessarily large when each star join thinks that it has to go back and gather the same detailed data that all other star joins have already gathered.
    The results obtained from each star join are inconsistent with the results obtained from every other star join. Furthermore, the ability to reconcile the differences is almost nil.
    New star joins require as much work to create as the original star joins. There is no foundation to build on when each star join is built independently.
    The interface to the supporting applications that feed the star joins becomes unmanageable.

In short, simply doing dimensional modeling as a basis for data warehouse design leads down a dark path when multiple star joins are considered. It is never apparent that there is a problem with star joins when you are looking at just one star join. But when you look at multiple star joins, the limitations of dimensional modeling become apparent.

Does this mean that dimensional modeling is invalid as a database design technique for data warehousing? The answer is: not at all. Dimensional modeling and star joins fit the bill very nicely for data marts. In fact, if I had to design a data mart tomorrow, I would not consider using any other approach.

But when it comes to the foundation data, that's another story. Figure 1 shows that the foundation data the data warehouse requires an altogether different treatment than dimensional modeling. The data warehouse which is the proper foundation for all DSS activity, including star joins, requires very granular, very flexible data. The ideal structure for the data warehouse is normalized data. The normalized data can be bent and shaped any old way.

Different styles of data structures fit in different places.

Trying to superimpose a star join on the data warehouse implies that the data warehouse is going to be optimal for the support of one set of users at the expense of everyone else.

Not only does dimensional modeling not fit the data warehouse, dimensional modeling does not fit in many other places in the data warehouse/corporate information factory environment, such as:

    exploration warehouses,
    near-line storage,
    project warehouses and
    parts of the operational data store (ODS).

In short, dimensional modeling really only fits the data mart environment, where requirements for processing are known before the infrastructure is built.

The advocates of dimensional modeling are quite proud of their pioneering efforts and well they should be. They have rightfully advanced the industry in a proper direction. But in their zeal to convert people to their way of thinking, they seem to have forgotten that there are other valid approaches to solving the database design problem for the corporate information factory. The dimensional modeling approach fits in a few places in the DSS environment, but it does not fit everywhere. The dimensional modelers have crafted a fine hammer. Now everything looks like a nail to them.

Bill Inmon is universally recognized as the father of the data warehouse. He has more than 35 years of database technology management experience and data warehouse design expertise. His books have been translated into nine languages. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for many major computing associations. For more information, visit www.inmongif.com and www.inmoncif.com. Inmon may be reached at (303) 681-6772.

Data Mining: An Introduction

Posted by DWH at 5:17 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

By this point in time, you've probably heard a good deal about data mining -- the database industry's latest buzzword. What's this trend all about? To use a simple analogy, it's finding the proverbial needle in the haystack. In this case, the needle is that single piece of intelligence your business needs and the haystack is the large data warehouse you've built up over a long period of time.

Data Mining in Business

Through the use of automated statistical analysis (or "data mining") techniques, businesses are discovering new trends and patterns of behavior that previously went unnoticed. Once they've uncovered this vital intelligence, it can be used in a predictive manner for a variety of applications. Brian James, assistant coach of the Toronto Raptors, uses data mining techniques to rack and stack his team against the rest of the NBA. The Bank of Montreal's business intelligence and knowledge discovery program is used to gain insight into customer behavior.

Gathering Data

The first step toward building a productive data mining program is, of course, to gather data! Most businesses already perform these data gathering tasks to some extent -- the key here is to locate the data critical to your business, refine it and prepare it for the data mining process. If you're currently tracking customer data in a modern DBMS, chances are you're almost done. Take a look at the article Mining Customer Data from DB2 Magazine for a great feature on preparing your data for the mining process.

Selecting an Algorithm

At this point, take a moment to pat yourself on the back. You have a data warehouse! The next step is to choose one or more data mining algorithms to apply to your problem. If you're just starting out, it's probably a good idea to experiment with several techniques to give yourself a feel for how they work. Your choice of algorithm will depend upon the data you've gathered, the problem you're trying to solve and the computing tools you have available to you. Let's take a brief look at two of the more popular algorithms.

Regression

Regression is the oldest and most well-known statistical technique that the data mining community utilizes. Basically, regression takes a numerical dataset and develops a mathematical formula that fits the data. When you're ready to use the results to predict future behavior, you simply take your new data, plug it into the developed formula and you've got a prediction! The major limitation of this technique is that it only works well with continuous quantitative data (like weight, speed or age). If you're working with categorical data where order is not significant (like color, name or gender) you're better off choosing another technique.

Classification

Working with categorical data or a mixture of continuous numeric and categorical data? Classification analysis might suit your needs well. This technique is capable of processing a wider variety of data than regression and is growing in popularity. You'll also find output that is much easier to interpret. Instead of the complicated mathematical formula given by the regression technique you'll receive a decision tree that requires a series of binary decisions. One popular classification algorithm is the k-means clustering algorithm. Take a look at the Classification Trees chapter from the Electronic Statistics Textbook for in-depth coverage of this technique.

Other Techniques

Regression and classification are two of the more popular classification techniques, but they only form the tip of the iceberg. For a detailed look at other data mining algorithms, look at this feature on Data Mining Techniques or the SPSS Data Mining page.

Data Mining Products

Data mining products are taking the industry by storm. The major database vendors have already taken steps to ensure that their platforms incorporate data mining techniques. Oracle's Data Mining Suite (Darwin) implements classification and regression trees, neural networks, k-nearest neighbors, regression analysis and clustering algorithms. Microsoft's SQL Server also offers data mining functionality through the use of classification trees and clustering algorithms. If you're already working in a statistics environment, you're probably familiar with the data mining algorithm implementations offered by the advanced statistical packages SPSS, SAS, and S-Plus.

Moving On

Have we whetted your appetite for data mining knowledge? For a more detailed look, check out the excellent slide show presentations and other data mining resources on Megaputer.com. If you're ready to get started but can't find any sample data, take a look at the various repositories listed in Data Sources for Knowledge Discovery. Good luck with your data mining endeavors! Stop by our forum and let us know how things are going!

SOURCE:http://databases.about.com/od/datamining/a/datamining.htm

How important is Extract, Transform, Load (ETL) to data Warehousing?

Posted by DWH at 5:17 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

Politicians raising money can be used as an analogy to compare data cleansing to data warehousing. There is almost no likelihood of one existing without the other. Data cleansing is often the most time intensive, and contentious, process for data warehousing projects.

What is Data Cleansing?

The elevator pitch: "Data cleansing ensures that undecipherable data does not enter the data warehouse. Undecipherable data will affect reports generated from the data warehouse via OLAP, Data Mining and KPI's."

A very simple example of where data cleansing would be utilized is how dates are stored in separate applications. Example: 11th March 2007 can be stored as '03/11/07' or '11/03/07' among other formats. A data warehousing project would require the different date formats to be transformed to a uniform standard before being entered in the data warehouse.

Why Extract, Transform and Load (ETL)?
Extract, Transform and Load (ETL) refers to a category of tools that can assist in ensuring that data is cleansed, i.e. conforms to a standard, before being entered into the data warehouse. Vendor supplied ETL tools are considerably more easy to utilized for managing data cleansing on an ongoing basis. ETL sits in front of the data warehouse, listening for incoming data. If it comes across data that it has been programmed to transform, it will make the change before loading the data into the data warehouse.
ETL tools can also be utilized to extract data from remote databases either through automatically scheduled events or via manual intervention. There are alternatives to purchasing ETL tools and that will depend on the complexity and budget for your project. Database Administrators (DBAs) can write scripts to perform ETL functionality which can usually suffice for smaller projects. Microsoft's SQL Server comes with a free ETL tool called Data Transforming Service (DTS). DTS is pretty good for a free tool but it does has limitations especially in the ongoing administration of data cleansing.
Example of ETL vendors are Data Mirror, Oracle, IBM, Cognos and SAS. As with all product selections, list what you think you would require from an ETL tool before approaching a vendor. It may be worthwhile to obtain the services of consultants that can assist with the requirements analysis for product selection.

Figure 1. ETL sits in front of Data Warehouses

How important is Data Cleansing and ETL to the success of Data Warehousing Projects?
ETL is often out-of-sight and out-of-mind if the data warehouse is producing the results that match stakeholders expectations. As a results ETL has been dubbed the silent killer of data warehousing projects. Most data warehousing projects experience delays and budget overruns due to unforeseen circumstances relating to data cleansing.
How to Plan for Data Cleansing?
It is important is start mapping out the data that will be entered into the data warehouse as early as possible. This may change as the project matures but the documentation trail will come in extremely valuable as you will need to obtain commitments from data owners that they will not change data formats without prior notice.
Create a list of data that will require Extracting, Transforming and Loading. Create a separate list for data that has a higher likelihood of changing formats. Decide on whether you need to purchase ETL tools and set aside an overall budget. Obtain advice from experts in the field and evaluate if the product fits into the overall technical hierarchy of your organization.

SOURCE:http://www.dwreview.com/Articles/Data_Cleansing.html

Informatica Repository Manager

Posted by DWH at 5:12 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

Q. What type of repositories can be created using Informatica Repository Manager?

A. Informatica PowerCenter includeds following type of repositories :

    Standalone Repository : A repository that functions individually and this is unrelated to any other repositories.
    Global Repository : This is a centralized repository in a domain. This repository can contain shared objects across the repositories in a domain. The objects are shared through global shortcuts.
    Local Repository : Local repository is within a domain and it’s not a global repository. Local repository can connect to a global repository using global shortcuts and can use objects in it’s shared folders.
    Versioned Repository : This can either be local or global repository but it allows version control for the repository. A versioned repository can store multiple copies, or versions of an object. This features allows to efficiently develop, test and deploy metadata in the production environment.

Q. What is a code page?

A. A code page contains encoding to specify characters in a set of one or more languages. The code page is selected based on source of the data. For example if source contains Japanese text then the code page should be selected to support Japanese text.

When a code page is chosen, the program or application for which the code page is set, refers to a specific set of data that describes the characters the application recognizes. This influences the way that application stores, receives, and sends character data.

Q. Which all databases PowerCenter Server on Windows can connect to?

A. PowerCenter Server on Windows can connect to following databases:

    IBM DB2
    Informix
    Microsoft Access
    Microsoft Excel
    Microsoft SQL Server
    Oracle
    Sybase
    Teradata

Q. Which all databases PowerCenter Server on UNIX can connect to?

A. PowerCenter Server on UNIX can connect to following databases:

    IBM DB2
    Informix
    Oracle
    Sybase
    Teradata

Infomratica Mapping Designer

Q. How to execute PL/SQL script from Informatica mapping?

A. Stored Procedure (SP) transformation can be used to execute PL/SQL Scripts. In SP Transformation PL/SQL procedure name can be specified. Whenever the session is executed, the session will call the pl/sql procedure.

Q. How can you define a transformation? What are different types of transformations available in Informatica?

A. A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set of transformations that perform specific functions. For example, an Aggregator transformation performs calculations on groups of data. Below are the various transformations available in Informatica:

    Aggregator
    Application Source Qualifier
    Custom
    Expression
    External Procedure
    Filter
    Input
    Joiner
    Lookup
    Normalizer
    Output
    Rank
    Router
    Sequence Generator
    Sorter
    Source Qualifier
    Stored Procedure
    Transaction Control
    Union
    Update Strategy
    XML Generator
    XML Parser
    XML Source Qualifier

Q. What is a source qualifier? What is meant by Query Override?

A. Source Qualifier represents the rows that the PowerCenter Server reads from a relational or flat file source when it runs a session. When a relational or a flat file source definition is added to a mapping, it is connected to a Source Qualifier transformation.

PowerCenter Server generates a query for each Source Qualifier Transformation whenever it runs the session. The default query is SELET statement containing all the source columns. Source Qualifier has capability to override this default query by changing the default settings of the transformation properties. The list of selected ports or the order they appear in the default query should not be changed in overridden query.

Q. What is aggregator transformation?

A. The Aggregator transformation allows performing aggregate calculations, such as averages and sums. Unlike Expression Transformation, the Aggregator transformation can only be used to perform calculations on groups. The Expression transformation permits calculations on a row-by-row basis only.

Aggregator Transformation contains group by ports that indicate how to group the data. While grouping the data, the aggregator transformation outputs the last row of each group unless otherwise specified in the transformation properties.

Various group by functions available in Informatica are : AVG, COUNT, FIRST, LAST, MAX, MEDIAN, MIN, PERCENTILE, STDDEV, SUM, VARIANCE.

Q. What is Incremental Aggregation?

A. Whenever a session is created for a mapping Aggregate Transformation, the session option for Incremental Aggregation can be enabled. When PowerCenter performs incremental aggregation, it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally.

Q. How Union Transformation is used?

A. The union transformation is a multiple input group transformation that can be used to merge data from various sources (or pipelines). This transformation works just like UNION ALL statement in SQL, that is used to combine result set of two SELECT statements.

Q. Can two flat files be joined with Joiner Transformation?

A. Yes, joiner transformation can be used to join data from two flat file sources.

Q. What is a look up transformation?

A. This transformation is used to lookup data in a flat file or a relational table, view or synonym. It compares lookup transformation ports (input ports) to the source column values based on the lookup condition. Later returned values can be passed to other transformations.

Q. Can a lookup be done on Flat Files?

A. Yes.

Q. What is the difference between a connected look up and unconnected look up?

A. Connected lookup takes input values directly from other transformations in the pipleline.

Unconnected lookup doesn’t take inputs directly from any other transformation, but it can be used in any transformation (like expression) and can be invoked as a function using :LKP expression. So, an unconnected lookup can be called multiple times in a mapping.

Q. What is a mapplet?

A. A mapplet is a reusable object that is created using mapplet designer. The mapplet contains set of transformations and it allows us to reuse that transformation logic in multiple mappings.

Q. What does reusable transformation mean?

A. Reusable transformations can be used multiple times in a mapping. The reusable transformation is stored as a metadata separate from any other mapping that uses the transformation. Whenever any changes to a reusable transformation are made, all the mappings where the transformation is used will be invalidated.

Q. What is update strategy and what are the options for update strategy?

A. Informatica processes the source data row-by-row. By default every row is marked to be inserted in the target table. If the row has to be updated/inserted based on some logic Update Strategy transformation is used. The condition can be specified in Update Strategy to mark the processed row for update or insert.

Following options are available for update strategy :

    DD_INSERT : If this is used the Update Strategy flags the row for insertion. Equivalent numeric value of DD_INSERT is 0.
    DD_UPDATE : If this is used the Update Strategy flags the row for update. Equivalent numeric value of DD_UPDATE is 1.
    DD_DELETE : If this is used the Update Strategy flags the row for deletion. Equivalent numeric value of DD_DELETE is 2.
    DD_REJECT : If this is used the Update Strategy flags the row for rejection. Equivalent numeric value of DD_REJECT is 3.

What is lookup transformation in informatica?

Posted by DWH at 5:11 PM Labels: DOLAP, ETL, MOLAP, ODS, OLAP, OLTP, SOLAP, SQL, WOLAP 0 comments

Lookup is a transformation to look up the values from a relational table/view or a flat file. The developer defines the lookup match criteria. There are two types of Lookups in Powercenter-Designer, namely; 1) Connected Lookup 2) Unconnected Lookup . Different caches can also be used with lookup like static, dynamic, persistent, and shared(The dynamic cache cannot be used while creating an un-connected lookup). Each of these has its own identification. For more details, the book "Informatica Help" can be useful.

Hope you are aware with the basics of Informatica. Now proceeding through lookup transformation.

Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is used to look up data in a relational table, view, or synonym. Lookup definition can be imported either from source or from target tables.

For example, if we want to retrieve all the sales of a product with an ID 10 and assume that the sales data resides in another table called 'Sales'. Here instead of using the sales table as one more source, use Lookup transformation to lookup the data for the product, with ID 10 in sales table.

Difference between Connected and UnConnected Lookup Transformation:

1. Connected lookup receives input values directly from mapping pipeline whereas UnConnected lookup receives values from: LKP expression from another transformation.

2. Connected lookup returns multiple columns from the same row whereas UnConnected lookup has one return port and returns one column from each row.

3. Connected lookup supports user-defined default values whereas UnConnected lookup does not support user defined values.

Example

Select dname from dept,emp where: emp.deptno=dept.deptno

Connected LKPs

    Connected LKP trasformation is one which is connected to Pipe line.
    Connected LKP trasformation will process each and every Row.
    If you want to Use Dynamic LKP cache, use the connected LKP transformation.
    If the LKP condition is not matched the LKP transformation will return the Default Value.
    it cannot be called
    it returns multiple values.
    it can use static or dynamic cache

Unconnected LKPs

    Unconnected LKP trasformation is one which is not connected to the Pipe line.
    It should be called either from expression or Update Stragey.
    It will not process each and evry row. It will return the values based expression Condition.
    If no match found for the LKP condition, the LKP transformation will return Null Values.
    it is a reusable trnsformation. The same LKP trnans can be called multiple times in same mapping
    it will return only one value.
    it can use only static cache

Performance Considerations for Lookups

Below are a list of performance considerations for lookups in Informatica PowerCenter.

Performance for Lookups

Misconceptions about lookup SQL Indexes

I have seen people suggesting an index to improve the performance of any SQL. This suggestion is incorrect - many times. Specially when talking about indexing the condition port columns of Lookup SQL, it is far more "incorrect".

Before explaining why it is incorrect, I would try to detail the functionality of Lookup. To explain the stuff with an example, we take the usual HR schema EMP table. I have EMPNO, ENAME, SALARY as columns in EMP table.

Let us say, there is a lookup in ETL mapping that checks for a particular EMPNO and returns ENAME and SALARY from the Lookup. Now, the output ports for the Lookup are "ENAME" and "SALARY". The condition port is "EMPNO". Imagine that you are facing performance problems with this Lookup and one of the suggestion was to index the condition port.

As suggested (incorrectly) you create an index on EMPNO column in the underlying database table. Practically, the SQL the lookup executes is going to be this:

      select EMPNO,
             ENAME,
             SALARY
      from EMP
      ORDER BY EMPNO,
               ENAME,
               SALARY;

The data resulted from this query is stored in the Lookup cache and then, each record from the source is looked up against this cache. So, the checking against the condition port column is done in the Informatica Lookup cache and "not in the database". So any index created in the database has no effect for this.

You may be wondering if we can replicate the same indexing here in Lookup Cache. You don't have to worry about it. PowerCenter create "index" cache and "data" cache for the Lookup. In this case, condition port data - "EMPNO" is indexed and hashed in "index" cache and the rest along with EMPNO is found in "data" cache.

I hope now you understand why indexing condition port columns doesn't increase performance.

Having said that, I want to take you to a different kind of lookup, where you would've disabled the caching. In this kind of Lookup, there is no cache. Everytime a row is sent into lookup, the SQL is executed against database. In this scenario, the database index "may" work. But, if the performance of the lookup is a problem, then "cache-less" lookup itself may be a problem.

I would go for cache-less lookup if my source data records is less than the number of records in my lookup table. In this case ONLY, indexing the condition ports will work. Everywhere else, it is just a mere chanse of luck, that makes the database pick up index.

Dynamic Lookups

Dynamic Lookups are used for implementing Slowly Changing dimensions. The ability to provide dynamic caching gives Informatica a definetive edge over other vendor products. In a Dynamic Lookup, everytime a new record is found (based on the lookup condition) the Lookup Cache is appended with that record. It can also update existing records in the cache with the incoming values.

SOURCE:http://it.toolbox.com/wiki/index.php/What_is_lookup_transformation_in_informatica%3F

ETL TOOLS

Tuesday, January 18, 2011

Simulation and performance analysis of distributed Internet systems using TCPNs.(timed coloured cetri net)(Report)

Similarity measures for relational databases

Applying SD-tree for object-oriented query processing.(signature declustering)(Report)

Multiple executions ETL process against same set of data.

The Problem with Dimensional Modeling

Data Mining: An Introduction

How important is Extract, Transform, Load (ETL) to data Warehousing?

Informatica Repository Manager

What is lookup transformation in informatica?

My Links

Blog Archive

Labels

Followers

Page Views

DWHETLTOOL

Popular Posts

About