Monday, February 21, 2011

Developing a Data Warehouse Architecture

0 comments
“Every data warehouse has an architecture,” says Warren Thornthwaite, a partner with Menlo Park, CA-based InfoDynamics LLC. “It's either ad hoc or planned; implied or documented. Unfortunately, many warehouses are developed without an explicit architectural plan, which severely limits flexibility.” Without architecture, subject areas don't fit together, connections lead to nowhere, and the whole warehouse is difficult to manage and change. In addition, although it might not seem important, the architecture of a data warehouse becomes the framework for product selection.
Thornthwaite compares the development of a data warehouse to building a real house. “But how do you build a $3 million mansion, let alone a $100,000 house?” You do it with blueprints, he says—the drawings, specifications, and standards showing how the house will be constructed, at multiple levels of detail. Of course, there are different versions of the blueprint for various subsystems of the house, such as plumbing, electrical, HVAC, communications, and vacuum. There's also standards that all homes follow, including plugs, lights, plumbing fixtures, door sizes, etc.
For data warehousing, the architecture is a description of the elements and services of the warehouse, with details showing how the components will fit together and how the system will grow over time. Like the house analogy, the warehouse architecture is a set of documents, plans, models, drawings, and specifications, with separate sections for each key component area and enough detail to allow their implementation by skilled professionals.
“This is not a requirements document,” Thornthwaite points out. “The requirements document says what the architecture needs to do. The architecture also isn't a project plan or task list; it's the what, not the how or why.”
It's also not easy, he says, because we've only been developing data warehouse systems for 15 years, versus 5,000 years for building homes. Therefore we have fewer standards, the tools and techniques are rapidly evolving, there is little documentation of what systems we already have, and data warehouse terminology is extremely loose.
So while developing an architecture is difficult, it is possible—and it's critical. First and foremost, he says, the architecture has to be driven by the business. If your requirement is to have nightly updates, this has implications for the architecture, and you must understand the technical requirements to achieve what you want to have. Thornthwaite gives a few business requirement examples, and the general technical considerations for each:
  • Nightly updates - adequate staging horsepower.
  • Worldwide availability - parallel or distributed servers.
  • Customer-level analysis - [large] server size.
  • New data sources - flexible tools with support for meta data.
  • Reliability - job control features.
Key Component Areas
A complete data warehouse architecture includes data and technical elements. Thornthwaite breaks down the architecture into three broad areas. The first, data architecture, is centered on business processes. The next area, infrastructure, includes hardware, networking, operating systems, and desktop machines. Finally, the technical area encompasses the decision-making technologies that will be needed by the users, as well as their supporting structures. These areas are detailed in the sub-sections below.
Data Architecture
As stated above, the data architecture portion of the overall data warehouse architecture is driven by business processes. For example, in a manufacturing environment the data model might include orders, shipping, and billing. Each area draws on a different set of dimensions. But where dimensions intersect in the data model the definitions have to be the same—the same customer who buys is the same that builds. So data items should have a common structure and content, and involve a single process to create and maintain.
Thornthwaite says that organizations often ask how data should be represented in the warehouse—entity/relationship or dimensional? “If you have a star schema1 then use dimensional. Is your detail normalized2 or dimensional? Will users be querying detail? Then use dimensional.” He adds that most data warehousing experts are in substantial agreement; the [data] sources are typically entity/relationship models and the front end is a dimensional model. The only issue is where you draw the line between the warehouse itself and the data staging area.
As you work through the architecture and present data to your users, tool choices will be made, but many choices will disappear as the requirements are set. For example, he explains that product capabilities are beginning to merge, like MOLAP and ROLAP. “MOLAP is okay if you stay within the cube you've built. It's fast and allows for flexible querying—within the confines of the cube.” Its weaknesses are size (overall and within a dimension), design constraints (limited by the cube structure), and the need for a proprietary data base.
Infrastructure Architecture
With the required hardware platform and boxes, sometimes the data warehouse becomes its own IS shop. Indeed, there are lots of “boxes” in data warehousing, mostly used for data bases and application servers.
The issues with hardware and DBMS choices are size, scalability, and flexibility. In about 80 percent of data warehousing projects this isn't difficult; most businesses can get enough power to handle their needs.
In terms of the network, check the data sources, the warehouse staging area, and everything in between to ensure there's enough bandwidth to move data around. On the desktop, run the tools and actually get some data through them to determine if there's enough power for retrieval. Sometimes the problem is simply with the machine, and the desktops must be powerful enough to run current-generation access tools. Also, don't forget to implement a software distribution mechanism.
Technical Architecture
The technical architecture is driven by the meta data catalog. “Everything should be meta data-driven,” says Thornthwaite. “The services should draw the needed parameters from tables, rather than hard-coding them.” An important component of technical architecture is the data staging process, which covers five major areas:
  • Extract - data comes from multiple sources and is of multiple types. Data compression and encryption handling must be considered at this area, if it applies.
  • Transform - data transformation includes surrogate key management, integration, de-normalization, cleansing, conversion, aggregation, and auditing.
  • Load - loading is often done to multiple targets, with load optimization and support for the entire load cycle.
  • Security - administrator access and data encryption policies.
  • Job control - this includes job definition, job scheduling (time and event), monitoring, logging, exception handling, error handling, and notification.
The staging box needs to be able to extract data from multiple sources, like MVS, Oracle, VM, and others, so be specific when you choose your products. It must handle data compression and encryption, transformation, loading (possibly to multiple targets), and security (at the front end this is challenging, Thornthwaite says). In addition, the staging activities need to be automated. Many vendors' offerings do different things, so he advises that most organizations will need to use multiple products.
A system for monitoring data warehouse use is valuable for capturing queries and tracking usage, and performance tuning is also helpful. Performance optimization includes cost estimation through a “governor” tool, and should include ad hoc query scheduling. Middleware can provide query management services. Tools for all of these and other related tasks are available for the front end, for server-based query management, and for data from multiple sources. Tools are also available for reporting, connectivity, and infrastructure management. Finally, the data access piece should include reporting services (such as publish and subscribe), a report library, a scheduler, and a distribution manager.
A Word About Meta Data
The creation and management of data has the following “steps” in the data warehouse process:
  1. warehouse model
  2. source definitions
  3. table definitions
  4. source-to-target maps
  5. map and transformation information
  6. physical information (table spaces, etc.)
  7. extracted data
  8. transformed data
  9. load statistics
  10. business descriptions
  11. query requests
  12. the data itself
  13. query statistics
To show how important meta data is, of the steps listed above only three involve “real” data—7, 8, and 12. “Everything else is meta data,” says Thornthwaite, “and the whole data warehouse process relies on it.” The major technical elements of a meta data catalog include:
  • Business rules - includes definitions, derivations, related items, validation, and hierarchy information (versions, dates, etc.).
  • Movement/transformation information - source/destination information, as well as DDL (data types, names, etc.).
  • Operations information - data load job schedules, dependencies, notification, and reliability information (such as host redirects and load balancing).
  • Tool-specific information - graphic display information and special function support.
  • Security rules - authentication and authorization.
Developing an Architecture
When you develop the technical architecture model, draft the architecture requirements document first. Next to each business requirement write down its architecture implications. Group these implications according to architecture areas (remote access, staging, data access tools, etc.) Understand how it fits in with the other areas. Capture the definition of the area and its contents. Then refine and document the model.
Thornthwaite recognizes that developing a data warehouse architecture is difficult, and thus warns against using a “just do it” approach, which he also calls “architecture lite.” But the Zachman framework is more than what most organizations need for data warehousing, so he recommends a reasonable compromise consisting of a four-layer process: business requirements, technical architecture, standards, and products.
Business requirements essentially drive the architecture, so talk to business managers, analysts, and power users. From your interviews look for major business issues, as well as indicators of business strategy, direction, frustrations, business processes, timing, availability, and performance expectations. Document everything well.
From an IT perspective, talk to existing data warehouse/DSS support staff, OLTP application groups, and DBAs; as well as networking, OS, and desktop support staff. Also speak with architecture and planning professionals. Here you want to get their opinions on data warehousing considerations from the IT viewpoint. Learn if there are existing architecture documents, IT principles, standards statements, organizational power centers, etc.
Not many standards exist for data warehousing, but there are standards for a lot of the components. The following are some to keep in mind:
  • Middleware - ODBC, OLE, OLE DB, DCE, ORBs, and JDBC.
  • Data base connectivity - ODBC, JDBC, OLE DB, and others.
  • Data management - ANSI SQL and FTP.
  • Network access - DCE, DNS, and LDAP.
Regardless of what standards they support, major data warehousing tools are meta data-driven. However, they don't often share meta data with each other and vary in terms of openness. “So research and shop for tools carefully,” advises Thornthwaite. “The architecture is your guide. And use IT advisory firms, like GartnerGroup, META Group, Giga, etc.”
How detailed does a data warehouse architecture need to be? The question to ask is this: Is this enough information to allow a competent team to build a warehouse that meets the needs of the business? As for how long it will take, the architecture effort will grow exponentially as more people are added for its development (i.e., it becomes “techno-politically complex”), and more complex the resulting system needs to be (i.e., “functionally complex”).
Like almost everything in data warehousing, an iterative process is best. You can't do it all at once because it's too big—and the business won't wait. Also, Thornthwaite says, the data warehouse market isn't complete yet. So begin with high leverage, high-value parts of the process. Then, use your success to make a case for additional phases.
Conclusions
To sum up, the benefits of having a data warehouse architecture are as follows:
  • Provides an organizing framework - the architecture draws the lines on the map in terms of what the individual components are, how they fit together, who owns what parts, and priorities.
  • Improved flexibility and maintenance - allows you to quickly add new data sources, interface standards allow plug and play, and the model and meta data allow impact analysis and single-point changes.
  • Faster development and reuse - warehouse developers are better able to understand the data warehouse process, data base contents, and business rules more quickly.
  • Management and communications tool - define and communicate direction and scope to set expectations, identify roles and responsibilities, and communicate requirements to vendors.
  • Coordinate parallel efforts - multiple, relatively independent efforts have a chance to converge successfully. Also, data marts without architecture become the stovepipes of tomorrow.
Thornthwaite recommends that companies align with business requirements but to be practical. He also emphasizes the importance of keeping up with advances in the data warehouse industry. Finally, remember that there is always an architecture: implicit or explicit, “almost in time” or planned. Experience shows that the planned and explicit ones have a better chance of succeeding.
Source
“From Bauhaus to warehouse: Understanding data warehouse architecture requirements,” presented by Warren Thornthwaite at DCI's Data Warehouse Summit, held December 8-10, 1998 in Phoenix, AZ. 

SOURCE:http://frame.inetgiant.in/le/aHR0cDovL3d3dy51c2Vycy5xd2VzdC5uZXQvfmxhdXJhbWgvcmVzdW1lL3Rob3JuLmh0bQ%3d%3d
newer post

Data Warehouse Architect

0 comments

Employer Description

EWSolutions (www.EWSolutions.com) is a high-end, full service systems integrator that is completely focused on data warehousing, metadata management, enterprise architecture and data management. We pride ourselves on employing only the "best of the best" consultants at each level in our firm. Our consultants have published over 300 articles and have written multiple top selling technology books. EWSolutions is a fast growing and progressive company that has doubled in size for two consecutive years. We are committed to providing our high-achieving consultants with full life-cycle metadata repository, enterprise architecture and data warehousing projects employing leading edge technologies and software tools. In addition, our consultants have the opportunity to work with the best minds in the metadata, enterprise architecture and data warehousing arena on high visibility projects that are focused on attaining results for our clients. EWSolutions' clients have been selected for various IT industry awards and accolades (visit our website for additional details). Our "100% project success" banner is not a slogan but rather a statistic. We don't fail on our projects.

Responsibilities

The Data Warehouse Architect has the overall responsibility for assessing requirements and defining the strategy, technical architecture, implementation plan, and delivery of data warehouse projects for major organizations.
The Data Warehouse Architect must have prior experience leading successful data warehousing implementations as well as a broad background and experience with IT application development. This individual is responsible for establishing the long-term strategy and technical architecture as well as the short-term scope for a multi-phased data warehouse effort. This person must have strong professional consulting skills and the ability to communicate well at all levels of the organization.
  • Find/generate executive level support for the data warehouse initiative
  • Define key business drivers for the data warehouse initiative
  • Deliver a project scope that directly supports the key business drivers
  • Define the overall data warehouse architecture (e.g., ETL process, ODS, EDW, Data Marts)
  • Define technical requirements, technical and data architectures for the data warehouse
  • Recommend/select data warehouse technologies (e.g., ETL, DBMS, Data Quality, BI)
  • Direct the discovery process
  • Design and direct the ETL process, including data quality and testing
  • Design and direct the information access and delivery effort for the data warehouse
  • Define and direct the implementation of security requirements for the data warehouse
  • Define meta data standards for the data warehouse
  • Direct the data warehouse meta data capture and access effort
  • Define production release requirements and sustainment architecture

Qualifications

  • Bachelors degree from an accredited college or university required
  • BS/MS in Engineering, IT, or Computer Science preferred
  • Experience in leading successful enterprise data warehousing efforts
  • Strong technical skills with all data warehousing technologies (e.g., ETL, ODS, DBMS, BI)
  • Strong data modeling skills (normalized and multidimensional)
  • Strong business and communication skills
  • Broad based information technology experience (e.g., languages, methods, EAI, XML)
  • Strong professional consulting skills
  • Current knowledge of the data warehouse market; vendors, and standards bodies EWSolutions is still a relatively small firm with a family atmosphere without big firm politics. If this sounds like the type of company that you want to work for and you are prepared to take on the challenges and rewards of working with an exciting and high-growth entrepreneurial company, please send your resume in MS Word format to info@EWSolutions.com . Resumes in PDF format will not be accepted.
newer post

WHY DATA WAREHOUSE?

0 comments
WHY DATA WAREHOUSE?
You wish to achieve the goals of your company. You want to know your (potential) clients, their demands and needs. Furthermore you want to know your present and future competitors, and what they are doing to meet your clients needs.
Having established your goals, you want to monitor them, in order to determine to what extent they are being achieved.
You do suspect that the management information you need is available somewhere within your company, but you don’t know where. When you ask your analysts, they come up with different results.
In Decision Making, managers  have a big business need for timely management information. It is vital to have a proper structure in the whole compilation of data within your company. In many organizations however, the current structure blocks effective and efficient use of existing data. Most problems are caused by the fact that the required information is stored in different systems, using different definitions or different formats. This frustrates a consistent overall view on how your company is doing.
The creation of an integrated source of data may enhance the performance of your decision support system, being specifically designed to provide information in support of the decision making process, business wide analysis and performance monitoring. This is what we call a data warehouse.
  • Subject Oriented: data is organized around the subjects of interest to managers, such as clients, markets, products, suppliers, etc.
  • Integrated: data from different sources is harmonized and uniquely coded;
  • Time Dependent: all data is stored together with a date/time stamp;
  • Consistent: data is uniquely defined, its description (location, meaning, ownership) is stored in a metabase;
  • Nonvolatile and Historical Integrity: data is periodically refreshed, re-aggregated as required, but in principle not changed.
By means of a data warehouse, it helps making available the information you need.
  • Data derived from different sources can be retrieved in an integrated fashion;
  • Due to the subject oriented design, data can be approached from (a combination of) different angles or dimensions, enabling multidimensional analysis.
  • Reports are reliable and verifiable: you are able to search through the metabase to determine the source of the data and the rules which were used to produce it;
  • Analyses produced in different parts of the organization are consistent and are reproducible;
  • The number of interfaces to be maintained between transaction processing systems and decision support systems is minimal;
  • Due to the fact that historical data is maintained, it becomes possible to carry out analyses over periods of time and to do datamining;
  • Because transaction processing systems are separate from the data warehouse, it becomes possible to perform complex analyses without affecting the performance of these systems
top of page
DWH Architecture
Data Warehouse Architecture

SOURCE:http://www.breteler.com/DWH.htm
newer post

Overview of OLAP Capabilities

0 comments
Oracle OLAP provides the query performance and calculation capability previously found only in multidimensional databases to Oracle's relational platform. In addition, it provides a Java OLAP API that is appropriate for the development of internet-ready analytical applications. Unlike other combinations of OLAP and RDBMS technology, Oracle OLAP is not a multidimensional database using bridges to move data from the relational data store to a multidimensional data store. Instead, it is truly an OLAP-enabled relational database. As a result, Oracle provides the benefits of a multidimensional database along with the scalability, accessibility, security, manageability, and high availability of the Oracle database. The Java OLAP API, which is specifically designed for internet-based analytical applications, offers productive data access.

Benefits of OLAP and RDBMS Integration

Basing an OLAP system directly on the Oracle database server offers the following benefits:

Scalability

There is tremendous growth along three dimensions of analytic applications: number of users, size of data, and complexity of analyses. There are more users of analytical applications, and they need access to more data to perform more sophisticated analysis and target marketing. For example, a telephone company might want a customer dimension to include detail such as all telephone numbers as part of an application that is used to analyze customer turnover. This would require support for multi-million row dimension tables and very large volumes of fact data. Oracle can handle very large data sets using parallel execution and partitioning, as well as offering support for advanced hardware and clustering.

Availability

Partitioning allows management of precise subsets of tables and indexes, so that management operations affect only small pieces of these data structures. By partitioning tables and indexes, data management processing time is reduced, thus minimizing the time data is unavailable. Transportable tablespaces also support high availability. With transportable tablespaces, large data sets, including tables and indexes, can be added with almost no processing to other databases. This enables extremely rapid data loading and updates.

Manageability

Oracle lets you precisely control resource utilization. The Database Resource Manager, for example, provides a mechanism for allocating the resources of a data warehouse among different sets of end-users.
Another resource management facility is the progress monitor, which gives end users and administrators the status of long-running operations. Oracle maintains statistics describing the percent-complete of these operations. Oracle Enterprise Manager lets you view a bar-graph display of these operations showing what percent complete they are. Moreover, any other tool or any database administrator can also retrieve progress information directly from the Oracle data server using system views.

Backup and Recovery

Oracle provides a server-managed infrastructure for backup, restore, and recovery tasks that enables simpler, safer operations at terabyte scale. Some of the highlights are:
  • Details related to backup, restore, and recovery operations are maintained by the server in a recovery catalog and automatically used as part of these operations.
  • Backup and recovery operations are fully integrated with partitioning. Individual partitions, when placed in their own tablespaces, can be backed up and restored independently of the other partitions of a table.
  • Oracle includes support for incremental backup and recovery using Recovery Manager, enabling operations to be completed efficiently within times proportional to the amount of changes, rather than the overall size of the database.

    Security

    The security features in Oracle have reached the highest levels of U.S. government certification for database trustworthiness. Oracle's fine grained access control enables cell-level security for OLAP users. Fine grained access control works with minimal burden on query processing, and it enables efficient centralized security management.

    Overview of Data Mining

    Oracle Data Mining (ODM) embeds data mining within the Oracle Database. The data never leaves the database — the data, data preparation, model building, and model scoring results all remain in the database. This enables Oracle to provide an infrastructure for application developers to integrate data mining seamlessly with database applications. Some typical examples of the applications that data mining are used in are call centers, ATMs, ERM, and business planning applications.
    By eliminating the need for extracting data into specialized tools and then importing the results back into the database, you can save significant amounts of time. In addition, by having the data and the data model in the same location (an Oracle database), there is no need to export the model as code.
    Data mining functions such as model building, testing, and scoring are provided through a Java API.
    Oracle Data Mining supports the following algorithms:
  • For classification, Naive Bayes, Adaptive Bayes Networks, and Support Vector Machines (SVM)
  • For regression, Support Vector Machines
  • For clustering, k-means and O-Cluster
  • For feature extraction, Non-Negative Matrix Factorization (NMF)
  • For sequence matching and annotation, BLAST


    SOURCE:http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10743/bus_intl.htm
newer post

Overview of Analytic SQL

0 comments
Oracle has introduced many SQL operations for performing analytic operations in the database. These operations include ranking, moving averages, cumulative sums, ratio-to-reports, and period-over-period comparisons. Although some of these calculations were previously possible using SQL, the new syntax offers much better performance.
This section discusses:

SQL for Aggregation

Aggregation is a fundamental part of data warehousing. To improve aggregation performance in your warehouse, Oracle provides extensions to the GROUP BY clause to make querying and reporting easier and faster. Some of these extensions enable you to:
  • Aggregate at increasing levels of aggregation, from the most detailed up to a grand total
  • Calculate all possible combinations of aggregations with a single statement
  • Generate the information needed in cross-tabulation reports with a single query
These extension let you specify exactly the groupings of interest in the GROUP BY clause. This allows efficient analysis across multiple dimensions without performing a CUBE operation. Computing a full cube creates a heavy processing load, so replacing cubes with grouping sets can significantly increase performance. CUBE, ROLLUP, and grouping sets produce a single result set that is equivalent to a UNION ALL of differently grouped rows.
To enhance performance, these extensions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make aggregate calculations more efficient, thereby enhancing database performance, and scalability.
One of the key concepts in decision support systems is multidimensional analysis: examining the enterprise from all necessary combinations of dimensions. We use the term dimension to mean any category used in specifying questions. Among the most commonly specified dimensions are time, geography, product, department, and distribution channel, but the potential dimensions are as endless as the varieties of enterprise activity. The events or entities associated with a particular set of dimension values are usually referred to as facts. The facts might be sales in units or local currency, profits, customer counts, production volumes, or anything else worth tracking.
Here are some examples of multidimensional requests:
  • Show total sales across all products at increasing aggregation levels for a geography dimension, from state to country to region, for 1999 and 2000.
  • Create a cross-tabular analysis of our operations showing expenses by territory in South America for 1999 and 2000. Include all possible subtotals.
  • List the top 10 sales representatives in Asia according to 2000 sales revenue for automotive products, and rank their commissions.
All these requests involve multiple dimensions. Many multidimensional questions require aggregated data and comparisons of data sets, often across time, geography or budgets.



SQL for Analysis

Oracle has advanced SQL analytical processing capabilities using a family of analytic SQL functions. These analytic functions enable you to calculate:
  • Rankings and percentiles
  • Moving window calculations
  • Lag/lead analysis
  • First/last analysis
  • Linear regression statistics
Ranking functions include cumulative distributions, percent rank, and N-tiles. Moving window calculations allow you to find moving and cumulative aggregations, such as sums and averages. Lag/lead analysis enables direct inter-row references so you can calculate period-to-period changes. First/last analysis enables you to find the first or last value in an ordered group.
Other features include the CASE expression. CASE expressions provide if-then logic useful in many situations.
To enhance performance, analytic functions can be parallelized: multiple processes can simultaneously run all of these statements. These capabilities make calculations easier and more efficient, thereby enhancing database performance, scalability, and simplicity.

SQL for Modeling

Oracle's MODEL clause brings a new level of power and flexibility to SQL calculations. With the MODEL clause, you can create a multidimensional array from query results and then apply formulas to this array to calculate new values. The formulas can range from basic arithmetic to simultaneous equations using recursion. For some applications, the MODEL clause can replace PC-based spreadsheets. Models in SQL leverage Oracle's strengths in scalability, manageability, collaboration, and security. The core query engine can work with unlimited quantities of data. By defining and executing models within the database, users avoid transferring large datasets to and from separate modeling environments. Models can be shared easily across workgroups, ensuring that calculations are consistent for all applications. Just as models can be shared, access can also be controlled precisely with Oracle's security features. With its rich functionality, the MODEL clause can enhance all types of applications.

newer post

Introduction to Data Warehousing and Business Intelligence

0 comments
A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction, transportation, transformation, and loading (ETL) solution, an online analytical processing (OLAP) engine, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

Characteristics of Data Warehousing

A common way of introducing data warehousing is to refer to the characteristics of a data warehouse as set forth by William Inmon:

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more about your company's sales data, you can build a warehouse that concentrates on sales. Using this warehouse, you can answer questions like "Who was our best customer for this item last year?" This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.

Integrated

Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.

Nonvolatile

Nonvolatile means that, once entered into the warehouse, data should not change. This is logical because the purpose of a warehouse is to enable you to analyze what has occurred.

Time Variant

In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse's focus on change over time is what is meant by the term time variant.
Typically, data flows from one or more online transaction processing (OLTP) databases into a data warehouse on a monthly, weekly, or daily basis. The data is normally processed in astaging file before being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to a few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.

Differences Between Data Warehouse and OLTP Systems

Data warehouses and OLTP systems have very different requirements. Here are some examples of differences between typical data warehouses and OLTP systems:

Workload

Data warehouses are designed to accommodate ad hoc queries. You might not know the workload of your data warehouse in advance, so a data warehouse should be optimized to perform well for a wide variety of possible query operations.
OLTP systems support only predefined operations. Your applications might be specifically tuned or designed to support only these operations.

Data Modifications

A data warehouse is updated on a regular basis by the ETL process (run nightly or weekly) using bulk data modification techniques. The end users of a data warehouse do not directly update the data warehouse.
In OLTP systems, end users routinely issue individual data modification statements to the database. The OLTP database is always up to date, and reflects the current state of each business transaction.

Schema Design

Data warehouses often use denormalized or partially denormalized schemas (such as a star schema) to optimize query performance.
OLTP systems often use fully normalized schemas to optimize update/insert/delete performance, and to guarantee data consistency.

Typical Operations

A typical data warehouse query scans thousands or millions of rows.For example, "Find the total sales for all customers last month."
A typical OLTP operation accesses only a handful of records. For example, "Retrieve the current order for this customer."

Historical Data

Data warehouses usually store many months or years of data. This is to support historical analysis.
OLTP systems usually store data from only a few weeks or months. The OLTP system stores only historical data as needed to successfully meet the requirements of the current transaction.

Data Warehouse Architecture

Data warehouses and their architectures vary depending upon the specifics of an organization's situation. Three common architectures are:

Data Warehouse Architecture (Basic)

Figure 16-1 shows a simple architecture for a data warehouse. End users directly access data derived from several source systems through the data warehouse.
Figure 16-1 Architecture of a Data Warehouse
Description of cncpt170.gif follows
Description of the illustration cncpt170.gif

In Figure 16-1, the metadata and raw data of a traditional OLTP system is present, as is an additional type of data, summary data. Summaries are very valuable in data warehouses because they pre-compute long operations in advance. For example, a typical data warehouse query is to retrieve something like August sales.
Summaries in Oracle are called materialized views.

Data Warehouse Architecture (with a Staging Area)

Figure 16-1, you need to clean and process your operational data before putting it into the warehouse. You can do this programmatically, although most data warehouses use a staging area instead. A staging area simplifies building summaries and general warehouse management. Figure 16-2 illustrates this typical architecture.
Figure 16-2 Architecture of a Data Warehouse with a Staging Area
Description of cncpt171.gif follows
Description of the illustration cncpt171.gif

Data Warehouse Architecture (with a Staging Area and Data Marts)

Although the architecture in Figure 16-2 is quite common, you might want to customize your warehouse's architecture for different groups within your organization.
Do this by adding data marts, which are systems designed for a particular line of business. Figure 16-3 illustrates an example where purchasing, sales, and inventories are separated. In this example, a financial analyst might want to analyze historical data for purchases and sales.
Figure 16-3 Architecture of a Data Warehouse with a Staging Area and Data Marts
Description of cncpt172.gif follows
Description of the illustration cncpt172.gif

Overview of Extraction, Transformation, and Loading (ETL)

You need to load your data warehouse regularly so that it can serve its purpose of facilitating business analysis. To do this, data from one or more operational systems needs to be extracted and copied into the warehouse. The process of extracting data from source systems and bringing it into the data warehouse is commonly called ETL, which stands for extraction, transformation, and loading. The acronym ETL is perhaps too simplistic, because it omits the transportation phase and implies that each of the other phases of the process is distinct. We refer to the entire process, including data loading, as ETL. You should understand that ETL refers to a broad process, and not three well-defined steps.
The methodology and tasks of ETL have been well known for many years, and are not necessarily unique to data warehouse environments: a wide variety of proprietary applications and database systems are the IT backbone of any enterprise. Data has to be shared between applications or systems, trying to integrate them, giving at least two applications the same picture of the world. This data sharing was mostly addressed by mechanisms similar to what we now call ETL.
Data warehouse environments face the same challenge with the additional burden that they not only have to exchange but to integrate, rearrange and consolidate data over many systems, thereby providing a new unified information base for business intelligence. Additionally, the data volume in data warehouse environments tends to be very large.
What happens during the ETL process? During extraction, the desired data is identified and extracted from many different sources, including database systems and applications. Very often, it is not possible to identify the specific subset of interest, therefore more data than necessary has to be extracted, so the identification of the relevant data will be done at a later point in time. Depending on the source system's capabilities (for example, operating system resources), some transformations may take place during this extraction process. The size of the extracted data varies from hundreds of kilobytes up to gigabytes, depending on the source system and the business situation. The same is true for the time delta between two (logically) identical extractions: the time span may vary between days/hours and minutes to near real-time. Web server log files for example can easily become hundreds of megabytes in a very short period of time.
After extracting data, it has to be physically transported to the target system or an intermediate system for further processing. Depending on the chosen way of transportation, some transformations can be done during this process, too. For example, a SQL statement which directly accesses a remote target through a gateway can concatenate two columns as part of the SELECT statement.
If any errors occur during loading, an error is logged and the operation can continue.

Transportable Tablespaces

Transportable tablespaces are the fastest way for moving large volumes of data between two Oracle databases. You can transport tablespaces between different machine architectures and operating systems.
Previously, the most scalable data transportation mechanisms relied on moving flat files containing raw data. These mechanisms required that data be unloaded or exported into files from the source database. Then, after transportation, these files were loaded or imported into the target database. Transportable tablespaces entirely bypass the unload and reload steps.
Using transportable tablespaces, Oracle data files (containing table data, indexes, and almost every other Oracle database object) can be directly transported from one database to another. Furthermore, like import and export, transportable tablespaces provide a mechanism for transporting metadata in addition to transporting data.
The most common applications of transportable tablespaces in data warehouses are in moving data from a staging database to a data warehouse, or in moving data from a data warehouse to a data mart.

Table Functions

Table functions provide the support for pipelined and parallel execution of transformations implemented in PL/SQL, C, or Java. Scenarios as mentioned earlier can be done without requiring the use of intermediate staging tables, which interrupt the data flow through various transformations steps.
A table function is defined as a function that can produce a set of rows as output. Additionally, table functions can take a set of rows as input. Table functions extend database functionality by allowing:
  • Multiple rows to be returned from a function
  • Results of SQL subqueries (that select multiple rows) to be passed directly to functions
  • Functions take cursors as input
  • Functions can be parallelized
  • Returning result sets incrementally for further processing as soon as they are created. This is called incremental pipelining
Table functions can be defined in PL/SQL using a native PL/SQL interface, or in Java or C using the Oracle Data Cartridge Interface (ODCI).

External Tables

External tables let you use external data as a virtual table that can be queried and joined directly and in parallel without requiring the external data to be first loaded in the database. You can then use SQL, PL/SQL, and Java to access the external data.
External tables enable the pipelining of the loading phase with the transformation phase. The transformation process can be merged with the loading process without any interruption of the data streaming. It is no longer necessary to stage the data inside the database for further processing inside the database, such as comparison or transformation. For example, the conversion functionality of a conventional load can be used for a direct-path INSERT AS SELECTSELECT from an external table. Figure 16-4 illustrates a typical example of pipelining. statement in conjunction with the
Figure 16-4 Pipelined Data Transformation
Description of dwhsg007.gif follows
Description of the illustration dwhsg007.gif

The main difference between external tables and regular tables is that externally organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE) are possible and no indexes can be created on them.
External tables are a complement to SQL*Loader and are especially useful for environments where the complete external source has to be joined with existing database objects and transformed in a complex manner, or where the external data volume is large and used only once. SQL*Loader, on the other hand, might still be the better choice for loading of data where additional indexing of the staging table is necessary. This is true for operations where the data is used in independent complex transformations or the data is only partially used in further processing.

Table Compression

You can save disk space by compressing heap-organized tables. A typical type of heap-organized table you should consider for table compression is partitioned tables.
To reduce disk use and memory use (specifically, the buffer cache), you can store tables and partitioned tables in a compressed format inside the database. This often leads to a better scaleup for read-only operations. Table compression can also speed up query execution. There is, however, a slight cost in CPU overhead.
Table compression should be used with highly redundant data, such as tables with many foreign keys. You should avoid compressing tables with much update or other DML activity. Although compressed tables or partitions are updatable, there is some overhead in updating these tables, and high update activity may work against compression by causing some space to be wasted.
newer post

Data Warehouse – Data Model

1 comments
Dimensional Modeling Techniques
What is Dimensional Modeling?
  • Dimensional Modeling is a logical design technique that seeks to present the data in a standard framework that is intuitive and allows for high performance access.
Strengths of Dimensional Modeling
  • Predictable, standard framework   (facts, dimensions)
  • Gracefully extendable
  • Standard Approaches to Standard Problems
  • Easy management of aggregates
Most Important Terminology – Data Warehouse
STAR Schema

- A database design that stores a central fact table surrounded by multiple dimension tables
- Star schema represents a compromise between the fully normalized model and the denormalize
- OLAP Characteristics
- d model

Fact table
-   Contains Keys to dimensions, and measures
-   Measures are typically described as the performance measures of the business
-   Usually numerical, additive and represent counts, currency amounts, percentages or ratios
-    Examples of measures are transaction amounts, balance, count of approved and declined transactions

Dimension table
-    An entity by which the business views the measures (facts)
-   Dimensions are groupings of similar data into a larger category
-   Dimensions may be hierarchical in nature, like Time – days into months, months into quarters
What is OLAP?

  • OLAP is an On-line Analytical processing technology which creates new business information from existing data , through a rich set of business transformations and numerical calculations.

- Involves drilling down to lower levels
- Involves roll-ups to higher levels of summarization

ETL Tools

  • Informatica
  • Data Integrator
  • Data Stage
  • MS SQL-Server DTS
  • Ab Initio
  • Data Junction


TOOLS for Front End Analysis (BI tools)

  • SAP BusinessObjects
  • Cognos
  • Web Focus
  • Brio
  • Hyperion
  • Micro Strategy

What is Business Intelligence?
  • Business Intelligence is a broad category of application and technologies for:
–         gathering,
–         storing,
–         providing access and
–         analyzing data
to help enterprise users make better, faster  business decisions.
Why Business Intelligence?
–   By definition, the moment any given business is operating, it begins generating data. Some obvious examples are banking, sales, production data, etc.
–  In addition there also exists large volumes of data which are important to the business but not directly generated by business operations. Examples are market data, competitive data, tenders and proposal etc.
–   As such, none of the above described information can be used in its raw form by corporate management to make decisions although the information is critical in helping make those business decisions.
–   Therein lies the necessity for Business Intelligence.
–   BI technologies help bring decision-makers the Data in a form they can quickly digest and apply to their decision making.
–    BI turns Data into Information for people making decisions in a company.
Business Intelligence applications include the activities of:
–    Decision support systems,
–    Query and reporting,
–   Online analytical processing (OLAP),
–    Statistical analysis, Forecasting, and
–    Data Visualization

BI Benefits
–        Make better decisions by turning enterprise data into real information
–        Gain a competitive advantage by getting timely, flexible, sophisticated analysis of corporate data

DATA WAREHOUSING ARCHITECTURE

An analysis and reporting system that covers all the areas an organization requires to support its business decisions at an enterprise level.
Reporting Pressures
Relieve reporting pressure on transactional databases.
Restructure Data
Restructure data to speed up data analysis and reporting capabilities.
Reduce Complexity
Reduce the user complexity associated with generating new reports
Clean Data
Create a repository of “clean data” that does not require wholesale changes to the transactional systems or business processes.
Multiple Source Analysis

Allow easier reporting across multiple transactional systems and external data sources.
Historic Analysis
To provide a data source supporting a longer span of time than can be reasonable supported on the transactional systems.
newer post
newer post older post Home