Sunday, November 28, 2010

The Generic Meta Data Repository

The most important data model that the data warehouse developer needs to understand in the business intelligence environment is the meta data repository model. The focus of many projects is on operational source system or data warehouse models because that is where the actual raw data is stored. While this is unquestionably the primary purpose of the business intelligence environment, it is the meta data repository and its data model structure that allow the various function areas in the warehouse to communicate. The repository's role in furnishing context to the data content, processes and reports that are elements of the environment is often overlooked. The meta data repository model is the central hub of the environment. It is the one place where integration occurs between operational source systems; the data warehouse; the extraction, transformation and load (ETL) processes; the business views; reports; and operational statistics.

While many firms understand the significance of the meta data repository, few have the experience to construct one from a blank sheet of paper. This is where a generic meta data repository steps in to help get the project started by providing a template that can be tailored by the repository architect for specific business needs.

Unlike a generic data warehouse model that is useful only for a specific vertical industry or functional purpose, the generic meta data repository model can be utilized across various businesses. This reusability is possible because the repository's purpose is to capture and store meta data – data about data. This function is not influenced by the type of business being supported but by the data warehouse environment infrastructure needed. The generic repository model is not an all-encompassing "shrink wrap" solution for any company. Its purpose is to provide the repository architect with a point of departure to begin the revision process needed for the enterprise. The repository architect's modeling decisions are based primarily on what other type of warehouse components need to be supported.

Meta Data Components

There are typically seven standard components to a meta data repository data model as seen in Figure 1. These components can be viewed as the subject areas of the generic meta data repository model. Depending on the requirements of the business, the model may be revised by removing, editing or integrating a new component area. If a company is using a data cleansing tool to scrub dirty data from a source system, the ETL component area of the model may need to be revised to capture pertinent information. Alternatively, if capture of query statistics is deemed extraneous or is accommodated through some other means in the environment (e.g., OLAP tool, meta data layer, database), this component area may be removed from the model.


Figure 1: Meta Data Repository Components

The first component area of the generic meta data repository model contains information about the logical data warehouse model. The logical data warehouse model contains all the business entities for specific business subject areas, relationships between the entities and attributes of each entity comprising the target model. The level of detail found in the logical model will vary depending on your firm's data modeling methodology and practices. Individual modeling practices for the logical data model may also necessitate changes in the generic model to accommodate additional information.

The second component contains very detailed information about the physical data warehouse target model. The logical data warehouse model must go through a transformation process in order to become a physical data warehouse dimensional model. A fundamental set of dimensional modeling steps is applied to the logical model in order to produce the physical model (e.g., addition of time). Depending on the requirements of your dimensional data modeling and/or database administration departments, this repository component may be extended to include additional physical attributes such as indexes and fragmentation strategies.

The third component of the generic model contains physical information about the sources feeding data to the data warehouse. This operational information can originate from source databases, file extractions, spreadsheets, Internet and other formats. The information stored in this component can be used to alert the data warehouse development team of pending changes to a source that will potentially affect the data warehouse model, ETL and report processing. Additionally, the information in this component can be used by warehouse analysts, both business and technical, for reconciliation and auditing purposes.

The fourth component of the generic model is the source-to-target data warehouse mappings. This component stores the table/column cross-reference mapping and semantic resolution between the source operational systems and the target physical data warehouse model. This is one of the primary purposes of a meta data repository since other environment tools often do not perform this function or do so inadequately. Once again, the detail to which the semantic resolution is documented will vary based on your firm's desires and goals. You may decide to extend the model to take a more structured approach toward business rule identification and use.

The fifth component is the business subject areas. The logical grouping of physical data warehouse tables is stored here. This information provides business end users with a more intuitive navigation method or view of the information stored in the data warehouse. This information can be used by both ETL and front-end reporting tools. Your particular business needs may require a more hierarchical method to store and navigate subject area information, thus requiring revisions to the generic model.

The sixth component of the generic model is ETL statistics. Information on individual extract, transformation and load processes is captured within this component. This information can be used to determine process improvements, database enhancements, fault isolation and other optimization procedures for the warehouse. Use of this particular component will depend greatly on your specific ETL methodology and toolset capabilities. The capture of this type of information in the repository database will require some degree of additional time during ETL processing.

The seventh and last component of the generic meta data repository model is query statistics. Information about every query made against the data warehouse database is stored in this component. This information is used to determine a variety of optimizations that can be performed against the warehouse database. Usage statistics on tables and columns is analyzed to identify dormant data, aggregation and index candidates. Use of this component will be dependent on whether your company has purchased a data monitoring tool for the BI environment since these products typically come with their own database.
Logical View of the Generic Repository Model

The logical modeling view of the generic meta data repository is illustrated in Figure 2. It is important to remember that the generic repository model should only be used as a guide, not as an all-encompassing road map. Individual components of the model may be revised, supplemented and/or removed depending on the business and infrastructure environment needs of your specific enterprise. Methods of versioning and source control for the model components are not addressed in sufficient detail for most projects, but are left to the discretion and design of the repository architect.

Figure 2: Logical View of the Generic Repository Model

The four tables that store both logical and physical information about the actual data warehouse model are the Target Table, Target Column, Target Column Map and Target Domain.

The Target Table is able to store both logical and physical views of tables in the warehouse through use of the Target Table Type column, which distinguishes the two categories. An implied business rule for this table is that only physical table types are actually mapped to the source systems feeding information to the warehouse. The Target Table Name column contains either the logical or the actual physical database name of the warehouse table. The Table Column Map table simply provides a cross-reference of the logical and/or physical tables to their associated columns.

The Target Column table contains both technical and business information about the field in the table (logical or physical since the Target Table Type is inherited onto this table from its parent). The Target Column Name column contains either the logical or the actual physical database name of the warehouse field. The Target Column Business Rules field is used to denote any conventions or practices the column must follow. For example, an amount field must have a corresponding currency value.

The Target Domain table contains definitions of values for columns containing codes (lookup list) only. Each possible domain value or code for a column is stored in this table. For example, the column country code has domain values that include USA for United States and CAN for Canada, etc.

The three tables that store physical information about the operational systems that feed information to the data warehouse are the Source Table, Source Column and Source Domain.

The Source Table contains information about the source system database or extract file used to populate the target tables of the warehouse. The Source ID column is used to uniquely identify a particular system of record (e.g., ERP, Order Management 1, Trouble Ticket, Client Extract 5). The Source Format Type provides a means to identify the category of the source information such as a server/database, directory/file or spreadsheet file. In cases where operational information is being extracted directly from a source database table, the Source DBMS column contains the physical names of these items.

The remaining tables and columns in this meta data repository component, Source Column and Source Domain, follow the same definitions found in the Target component tables, Target Column and Target Domain by just substituting source in place of target.

The Source To Target Column and Domain Map tables provide the necessary design strategy to build extraction, transformation and load (ETL) processes by linking the operational sources to the warehouse tables they feed.

In addition to the actual mapping documented in these cross- reference tables, any additional instructions that need to be described are located in the Mapping Semantic Resolution columns on both tables.

Three tables – ETL Process, ETL Process Source Map and ETL Process Statistics – map warehouse tables and operational sources to the procedures that access them and capture statistics about load processing. Population of these tables will depend on the ETL methodology and products utilized by your organization.

The ETL Process table contains a Process ID column to uniquely identify the ETL procedure in the warehouse environment.

The ETL Process Statistics table contains both the mapping of the process to a warehouse table or tables plus the date/time statistics of the process for a specific batch cycle load. A single ETL process can load one or many target warehouse tables depending on the sources and/or business requirements involved.

The ETL Process Source Map tables provide a cross-reference of ETL processes to operational systems or extract files.

The Subject Area and Subject Area Table Map provide the end user with a logical grouping or business view of the data warehouse tables (e.g., sales, financial, human resources).

The Subject Area table contains a Subject Area ID column to uniquely identify the group of tables within the enterprise from a business perspective. The Subject Area Table Map simply groups the warehouse tables (logical and physical) to a subject area.

Two tables, Query Statistics and Query Table Column Hits, map warehouse tables and columns to the query requests that access them and capture statistics about processing. Population of these tables will depend on the DBMS, reporting tool and/or data-monitoring product being employed.

The Query Statistics table contains various information about query results made against the warehouse. These requests can be from your warehouse front-end reporting tool or from ad hoc requests made directly against the database depending on the informational needs of the DBAs, data acquisition developers, data access developers or the architect. The Query Table Column Hits table provides a cross-reference of database queries to warehouse tables and columns.

The generic repository model will need to be tailored by the repository architect. Revisions, additions or deletions to the generic repository model will depend on your warehouse development methodology and business intelligence products being used. Capture of statistical information, in some cases, may not be possible due use of proprietary standards or lack of integration between developmental components.

For those firms that decide to develop a meta data repository in house, a generic model can help to quickly advance these efforts by providing a starting template. This generic model can also assist those firms evaluating meta data repository products in the marketplace.

SOURCE:http://www.information-management.com/issues/20001001/2744-1.html

0 comments:

Post a Comment

newer post older post Home