Monday, February 27, 2012

Make room for data

0 comments
mobilkom austria uses multi-value compression to economically increase table size and boost performance.
Given enough time, most organizations will reach a point when they wonder how their Teradata systems can possibly take in any more data. Buying new hardware can solve this problem—but that may be unnecessary.
One company found a way to more readily accommodate its growing volumes of data. Teradata Magazine spoke with Dietmar Trummer, senior IT architect at mobilkom austria, about its use of Teradata multi-value compression. Trummer explained how he was able to free space, increase performance and save money, all without reducing service.

Q: What prompted you to investigate using Teradata’s multi-value compression?

A: Starting at the end of 2006, we anticipated performance issues in the near future due to shortage
of free disk space.


Our users had data analysis results and reports to deliver on time, so in my role as a Teradata system user and developer of smaller data marts, I tried to figure out how to cope with this disk space storage issue.

Q: What approaches did you consider?

A: A Management and administrators at mobilkom austria considered several options: buy new hardware, archive data to external storage or remove indexes. We limited that to these choices:

    Reduce redundant data by deleting specialized data marts. This required developing more complex queries to rebuild the logic of the data marts.
    Aggregate data and skip the details, or reduce history by removing old data. We would lose information and would have to reduce our internal service portfolio.
    Optimize the table definitions based on Teradata technology. This option offers the use of size-optimal data types, primary indexes with optimal table distribution and multi-value compression with optimal size impact.

Q: Why did you select Teradata’s multi-value compression approach?

A: First of all, we didn’t want to develop more complex queries. Second, we didn’t want to reduce
our service portfolio.

I was looking for a solution that had the least possible influence on the daily work of the users and developers. Optimizing table definitions appeared to be worth investigating.

To be honest, the multi-value compression approach was the most interesting. It promised a high potential and also delivered a technical and mathematical challenge. Besides, I had some prior experience implementing the compression approach.

Two years ago during the development of a simple data mart, I had to store a large amount of intermediate data in a table. The space in my staging database was insufficient, so I needed to find a way to reduce the table space.

However, with multi-value compression I could fit my table into the staging database in a short time, without the help of an administrator.

One year later, I adapted what I learned from this experience and developed a method to ease our new storage problems using multi-value compression.
Q: Is this code for using compression as simple as it looks?

 CREATE MULTISET TABLE dwh_ua.uaf_contract ( phone_id INTEGER NOT NULL, call_mode_code CHAR(1) COMPRESS (‘A’,’P’,’S’), source_table_id SMALLINT NOT NULL COMPRESS (1,4,10), charge_usage DECIMAL(18,4) COMPRESS (0.0000), ... ) PRIMARY INDEX (phone_id)

A: Yes, it is. We optimized our biggest table using multi-value compression: 1TB without compression; 480GB with manual (not optimized) compression; 370GB with optimized compression. The sample code is actually a fraction of the table definition.

That’s the simple part—the challenging part is how to get the values for optimized compression.
Q: What about the importance of data analysis and finding the break-even point of compression? How does multi-value compression work, and how do you find that break-even point?

A: Unlike other databases, the Teradata Database compresses specific user-defined values to zero space. That sounds like magic but, of course, it isn’t. The idea is to reduce the row size of many rows by a large amount, and to enlarge the row size of all rows by a small amount.
image

Click to enlarge

The code that follows and the corresponding row storage table [see table 1] illustrate how this works: We observed the compression of a single column. All rows with a call_mode_code value contained in the compress list (‘A,’ ‘P’ or ‘S’) skipped the storage for this column in the row data. The compressed values are stored as binary code, and the column was reduced to zero space. Consequently, the rows got smaller. In table 1, the binary code “00” indicates that the value is stored in the row data.

 … call_mode_code CHAR(1) COMPRESS (‘A’,’P’,’S’), … 2 bit: ‘A’ = 01, ‘P’ = 10, ‘S’ = 11

The binary code is also used in the presence bits to indicate if the values are not compressed. If the table contains no call_mode_code value of ‘A,’ ‘P’ or ‘S,’ small amounts of storage are added in the presence bits and the rows get slightly larger. This explains why it is important to know the values contained in your table.

In summary, if you use compression, all rows—regardless of whether the column value is compressed—have to add the presence bits to their row storage.

Then, the question arose: How can we use this information to find the optimal compression list for a column’s table?

The answer is detailed mathematically, but the principle is not very complex: The values that occur more frequently are more likely to be added to the compression list. Therefore, we had to fill the compression list with the most frequently occurring values. The break-even point is reached when the addition of a new value to the list will not result in a further decrease of the column’s total space consumption.

But be aware that this equation is based only on the static point of view. Data changes over time, so analysis about volatility of data is also necessary and has to be taken into consideration when finding the best compression list.
Q: What kinds of data gain the best compression by setting “obvious” compression values? Are there problems using this technique?

A: That’s difficult to generalize because we experienced different and unexpected kinds of data with great compression performance. Of course columns with large data types have a better compression ratio than those with small data types. Also, columns that contain a few very frequent values gain good compression. These values might be words in natural language, flags, categories, status and years.

But data columns that contain measures can also be a good source for compression—especially default values, zeros and values that are near the most frequent value of a Gaussian or Poisson distributed column.

The problem with what might be considered “obvious” compression values is that it is difficult to find the break-even point—i.e., the optimal compression list—without data analysis. We frequently experienced that manual compression with no data analysis often leads to compression lists that are too large. In these cases, too many values are used for compression, which can lead to “over-compression.” Less would have been better.
Q: You described actual results when using multi-value compression. How much table scan performance improvement did you see? Do you have examples to share?

A: We didn’t analyze the table scan performance in a way that would enable me to present a percentage of performance improvement. Our result is based on the theoretical fact that the table scan performance is determined by the table size—and by our users’ experiences.
image

Click to enlarge

Nevertheless, we conducted experiments to check a potential negative performance impact caused by “decoding” the compressed values during querying. The results showed that, on the one hand, we could not find a negative performance impact; on the other hand, the table scan performance improvement is directly proportional to the space reduction.
Q: How large of a saving in data size did you see in your results?

A: The bar chart demonstrates a graphical representation of different compression scenarios of a single column. [See figure.] The horizontal axis breaks down the number of presence bits that must be used to code the compression values, and the vertical axis shows the size of the column in megabytes.

The red portion of each bar displays how much space will remain after the compression, and the green portion indicates how much space would be freed. Combined, the size of this column without compression is about 760MB.
image

Click to enlarge

The bars, from left to right, indicate how much space would be freed when compressing:

    First column = 1 compression bit: the most frequent value
    Second column = 2 compression bits: the most and second-most frequent value
    Third column = 2 compression bits: the most, second-most and third-most frequent value
    Thirteenth column (the rightmost bar in the chart) = 4 compression bits: in this case all occurring values

Notice that the first compression scenario is optimal—the size of this column would be reduced to about 50MB.

Table 2 displays the data analysis results in textual form. The recommendation of the tool is to compress one value (= 1 compression bit), which then generates the corresponding compress clause.

This table is a real example from our biggest table. You can see in row 7 of the table that multi-value compression was used. In this case the developer manually set a large compress list, which resulted in a column size of about 290MB as shown in cell C7. This is one-third the size without compression, indicated in cell C6, but nearly six times the size of the optimally compressed column that appears in cell C8.
Q: Beyond the space savings, what other benefits and cost savings
did you experience?

A: The cost savings stemmed directly from the fact that we didn’t need to buy new hardware or perform other actions that would result in indirect costs, such as reducing our service portfolio.

An indirect benefit was that the overall performance of our system was stabilized because we could raise the level of free storage to the recommended percentage. Multi-value compression had its part in this. It also played a part in other actions, like archiving.
Q: Have there been any impacts on your end users or application developers since you implemented compression? Have they had to change anything?

A: The end users who get our reports and analysis results didn’t realize that anything had changed, except that we were able to consistently meet their service level agreements because of our stabilized system performance.

What is interesting is the impact this procedure has on our developers, like me. The method was used with the intention to free some space in a one-time action. The administrators would analyze our biggest tables and change their table definitions to optimize compression.

Today all application developers in our unit know how to use it. During development of large to medium-sized data marts, a procedure is used to optimize bigger tables; therefore, the tables go into production with optimized compression. What is important is that the developers know about the pitfalls, such as data volatility, which can make a perfect compression change over time to a bad compression.

Also, our administrators didn’t confine themselves to optimizing just the biggest tables. They optimized medium-sized and even small tables. This is why we have approximately 3,000 optimized tables!
newer post

Data Warehousing Gets the Data Out

0 comments
In the past year or two, a growing split has occurred in the database market. Similar to a giant iceberg, this split is about to carve off a huge new piece that will have its own identity and direction. We call this new half of the database market data warehousing. We call the old half OLTP.
Data warehousing is that part of relational technology that deals with getting the data out. Data warehousing is a direct descendant of what led the industry to relational technology in the first place. Back in the early 1980s, we didn't decide to turn all of our IS shops upside down just because we were in love with transaction processing. We wanted better information. Those of you who still own Chris Date's original book explaining relational databases, An Introduction to Database Systems (Addison-Wesley), ought to go back and look at it again. The entire book is about queries and getting the data out! Transaction processing, entity-relationship diagrams, CASE tools, and all of the other OLTP apparatus came much later.
Once we got hooked on relational databases, most of us realized that we had some serious work to do to make these databases production-ready. So, for the last dozen years we have had to put off getting the data out in order to concentrate on getting the data in. We needed systems that could absorb at least 1000 transactions per second to enable us to store our basic enterprise data. Grafting transaction processing onto relational databases nearly put the lights out. We became fixated on the new mechanisms we created, such as entity-relationship diagrams and distributed database technology, and we came perilously close to forgetting why we bought relational databases in the first place.
Fortunately, the chief executives in most companies have long memories. They remembered the promise that we would be able to "slice and dice" all of our data. These executives have noticed that we have almost succeeded in storing all the corporate data in relational databases. They also haven't forgotten that they have spent several billion dollars. From their point of view, it is now time to get all that data out.
Almost every IS shop experiences significant pressure to make the corporate data accessible. Two or three years ago, we saw a brief flurry of marketing material from RDBMS vendors trying to reduce this pressure by promoting the use of OLTP systems for querying. This approach didn't work. Anyone who tried to use a major corporate database for both OLTP and querying soon realized some basic truths about systems for getting the data in, versus systems for getting the data out.
OLTP systems and data warehouse systems have different end users, managers, administrators, communications interfaces, backup strategies, data structures, processing needs, and access rhythms. Fortunately, DBMS and hardware vendors have stopped talking about one system that does it all. They now understand the profound difference between OLTP systems and data warehouse systems. Well, maybe they understand, and maybe they don't. But they have certainly figured out that if they keep quiet, they will sell two DBMS licenses and two hardware systems.
The little crack in the database market iceberg is widening very rapidly. A large channel of blue water is now visible between OLTP systems and data warehouse systems. Both IS shops and vendors realize that you can benefit from having two systems that are specialized for each task. An OLTP system must specialize in managing rapidly changing data and keeping it safe. OLTP systems must track millions of separate transactions per day and must encourage a usage style that peppers the database with tiny atomic processing requests that are all very similar. An OLTP transaction rarely uses a join.
Conversely, a data warehouse system must specialize in managing huge static sets of data copied from production systems. The average data warehouse query ranges dynamically over wide extremes of processing, and assembles hundreds, thousands, or even millions of separate data elements into small answer sets to be delivered to users. A data warehouse query is very join- intensive. Also, the data warehouse performs only one transaction per day when it loads millions of records from the production system. (We are not outlawing a light sprinkle of transactions restricted to forecasting and plan tables within the data warehouse.)
We don't need systems that are only pretty good at transaction processing and pretty good at querying. To be blunt, today's systems are very good at transaction processing and pretty horrible at querying. A lot of shops are still using 20-year-old B-tree technology to index their databases, because B-tree indexes are a compromise between querying and updating. However, we don't need to compromise when we decide to break our database world into two pieces. By specializing in getting the data out, the data warehouse half of the iceberg will yield a 100-fold improvement in query performance over the next few years. (Remember that you read this bold prediction in DBMS.)
Some of our cherished notions, such as that of cost-based optimization, may mean that optimizer is necessary when our data schemas are too complex -- no one can figure them out. As you'll read in my next column, the data warehouse marketplace seems to be adopting very simple data schemas that are highly dimensional. These schemas -- called Star Join schemas -- have large central "fact" tables surrounded by a single layer of much smaller dimension tables. (See Figure 1.) One of the great charms of the Star Join schema is that you can use fixed, deterministic evaluation strategies. You don't need a cost-based optimizer. Horrors!
Over the next few years, we'll witness a development in data warehouse-oriented query systems. The wonderful vocabulary for OLTP developed by the hardware and software vendors who are members of the Transaction Processing Performance Council will be mirrored by a new vocabulary specific to data warehousing. Terms like "two-phase commit" and "row-level locking" will have data warehouse counterparts such as "factless fact tables" and "slowly changing dimensions." Hardware and software developers have just begun to turn their attention to data warehousing. Thus, the two icebergs will continue to drift further apart.

newer post

Getting the most out of data warehousing

0 comments
Information Technology is the technology to extract, generate and distribute information. And information is very clearly defined as meaningful data. Information is data that has been processed into a form that is meaningful to the recipient. It should be very evident that data management should be the core of any information technology thinking. Datawarehousing and data mining are avenues in data management to get better information.
Data warehouse stores information from various databases into a single location - cleaned and processed into the right formats for analysis. The process of transforming raw data into data warehouse involves steps such as extraction – getting data out of original database and transferring it to database infrastructure. Consolidation is process of combining data from several sources into one database. Cleansing is the process of correcting data.

Filtering is process of removing unnecessary information. Aggregation is process of combining data. Conversion is process of translating data into the model used by the warehouse.

A data mart is a specialised set of business information focusing on a particular aspect of the enterprise. Many companies choose to feed a data mart from a data warehouse because the information in the warehouse has already been consolidated and processed from the same raw data. An operational data store is a hybrid of an OLTP system and an analytical system. It contains information that's frequently updated on an ad hoc basis, often in response to changes in the OLTP system, as opposed to the scheduled updates of a data warehouse.

Warehousing brings together data. Mining sorts through the data collected and turns up interesting and useful connections. It all starts with a load of finely detailed historical data that needs to be sifted through for gems.

"Data warehousing is similar to any ordinary warehouse where one stores goods and takes out when required," says Dr Girish Kumar, DGM – IS, BPCL. Transactional data needs to be collated into a given area so that analysis can be conducted. BPCL, he says, has implemented SAP business warehouse by consolidating data at customer level. Analytical tools are used to generate management information system reports as per user requirements.

LIC plans to implement data warehousing and data mining, says T S Vijayan, chief – IT, BPR, LIC. Conceptually, in a technical sense, data warehousing comes first inasmuch as data needs to be collected at one location. After the data is collected, data mining extracts the required information. And from this MIS reports for management follow. However, in an organisation the management's need for information comes first and then the data mining is charted out and only after this data warehousing – collating data comes into the picture, says Vijayan. Datawarehousing has been implemented at LIC to get all historical transactions in one area. Another reason is to link up to other databases. Datamining is implemented to get more information on customer for up selling, cross selling and for other marketing purposes.

Data mining is mainly for analysis, says S B Patankar, director – information systems, Bombay Stock Exchange. This requires data to be in a particular format. Thus data warehousing is a prerequisite for data mining.
newer post

Preprocess External Tables

0 comments
Use inline preprocessing on external tables to simplify extract, transform, and load operations.

External tables, available since Oracle9I Database, enable you to view a text file as if it were data in a database table. Used in data warehouse environments, extract, transform, and load (ETL) processes can read an external table’s text file directly and subsequently load the data into summary tables. ETL eliminates the step of loading the text files into intermediate storage, saving significant space and time. But sometimes the text files being loaded into the database are compressed or not precisely in the format expected by the external tables. In such cases, the text files must be transformed into the appropriate format—for example, uncompressed—in a separate process before they can be read by external tables. Can this additional “out-of-line” task be eliminated—or at least be made inline with ETL processes—so that processing logic does not need to change?

In Oracle Database 11g Release 2, it is possible with inline preprocessing of input text files in external tables. This article shows you how to use this new feature to make your ETL processes handle any type of file without an additional out-of-line step.
External Tables

First let’s look at an external table setup. Consider a system that receives a file called indata1.txt to be loaded into the database by an ETL process. This file is stored in a directory (on a Windows-based system) named c:\etl_dir.

To create the external table, you must first create a directory object; in this case, the table is named ETL_DIR. Logging in as SYS, you issue the following command:

create directory etl_dir as
'C:\etl_dir';


The user who owns the ETL_DIR external table is ETLADMIN, and that user should have the privileges to read from and write on the directory. The following SQL grants those privileges to ETLADMIN:

grant read, write on directory etl_dir to etladmin;


Now you can create the external table on the indata1.txt text file:

create table indata1
(
    cust_id     number,
    cust_name    varchar2(20),
    credit_limit number(10)
)
organization external
(
    type oracle_loader
    default directory etl_dir
    access parameters
    (
       records delimited by newline
       fields terminated by ","
    )
location ('indata1.txt')
)
/


The indata1.txt text file contains the following rows of information:

1,John Smith,1000
2,Jack Smith,2000


With the external table created, you can now select from it as you would from a typical database table:

SQL> select * from indata1;

CUST_ID  CUST_NAME   CREDIT_LIMIT
———————  ——————————  ————————————
      1  John Smith          1000
      2  Jack Smith          2000


Although you can access the external table in many of the same ways you would a database table, you cannot update the external table.
Change in Process

Now suppose that with your external table and its text file in place, the input file for the external table is compressed to reduce the volume of data transmitted across the network. Although compression helps the network bandwidth utilization, it creates a challenge for the ETL process. The file must be uncompressed before its contents can be accessed by the external table.

Rather than uncompress the file in a separate process, you can use the preprocessor feature of Oracle Database 11g Release 2 with external tables to uncompress the file inline. And you will not need to change the ETL process.

To use the preprocessor feature, first you need to create a preprocessor program. The external table expects input in a text format but not necessarily in a file. The external table does not need to read a file; rather, it expects to get the file contents “fed” to it. So the preprocessor program must stream the input data directly to the external table—and not create another input file. The input to the preprocessor will be a compressed file, and the output will be the uncompressed contents.

The following is the code for your new preprocessor program, named preprocess.bat:

@echo off
C:\oracle\product\11.2.0\dbhome_1\
BIN\unzip.exe -qc %1


The first line, @echo off, suppresses the output of the command in a Windows environment. The remaining code calls the unzip.exe utility located in the Oracle home. The utility needs an input file, which is the first (and only) parameter passed to it, shown as %1. The options q and c tell the utility to uncompress quietly (q) without producing extraneous output such as “Inflating” or “%age inflated” and match filenames case-insensitively (c), respectively.

Next you need to create the directory object where this preprocessor program is located. Logging in as SYS, issue

create directory execdir as 'c:\tools';


And now grant EXECUTE permissions on the directory to the ETLADMIN user:

grant execute on directory execdir to etladmin;


Finally, create the new external table:

create table indata1
(
  cust_id        number,
  cust_name    varchar2(20),
  credit_limit number(10)
)
organization external
(
  type oracle_loader
  default directory etl_dir
  access parameters
  (
    records delimited by newline
    preprocessor execdir:'preprocess.bat'
    fields terminated by ","
  )
location ('indata1.txt')
)
/


Note that the only difference from the previous CREATE TABLE statement is the new line within the access parameters clause:

preprocessor execdir:'preprocess.bat'


It calls the preprocess.bat executable in the directory specified by EXECDIR before the external table accesses the indata1.txt file in the location specified by the ETL_DIR directory. Remember, indata1.txt is now a compressed file. So, in effect, the external table reads not the actual specified input file but rather the output of preprocess.bat, which is the uncompressed data from the indata1.txt file.

If you select from the external table now, the output will be similar to that of the earlier select * from indata1; query. The preprocessor passed the uncompressed contents of the indata1.txt (compressed) file on to the external table. There was no need to uncompress the file first—saving significant time and the intermediate space required and making it unnecessary to change the ETL process.

This inline preprocessing unzip example uses a script, but that is not always necessary. An executable can be used instead. For example, in Linux you can use /bin/gunzip. However, the utility can’t accept any parameters. So if you pass parameters (as in this article’s example), you must use a script.
Security Concerns

The EXECUTE privilege on a directory is a new feature introduced in Oracle Database 11g Release 2. It enables the DBA to grant EXECUTE permissions only for certain directories and only to certain users. Without WRITE privileges, users will not be able to update the executables inside a directory to insert malicious code, but users will be able to execute the “approved” code accessible in a single location. The DBA can put all the necessary preprocessor tools into a single directory and grant EXECUTE privileges there to the users who may need them. And, of course, the executables and data should be in different directories.

Preprocessing also requires some special precautions on the part of the DBA. Because the executables called by preprocessing programs will be executed under the privileges of the Oracle software owner and malicious executable code can cause a lot of damage, the DBA should be extremely careful in monitoring executables for potentially harmful code.

The directory containing the preprocessor executables needs to be accessible to the Oracle software owner for EXECUTE operations only, not for WRITE activity. Therefore, as an added precaution, the system administrator can remove WRITE access to that directory from all users, including the Oracle software owner. This significantly reduces the chance of damage by malicious code.
Other Uses

Compression is not the only use for inline preprocessing, although it certainly is the most widely used. You can, for example, use this preprocessing technique to show the output of a program as an external table. Consider, for instance, the dir command in Windows for listing the contents of a directory. How would you like to get the output as a table so that you can apply predicates?

Getting and using this output is quite simple with the preprocessor functionality. Remember, the preprocessor does not actually need a file but, rather, requires the output of the preprocessor program. You can write a preprocessor program to send the output of a dir command. The new pre-processor program, named preproc_dir.bat, has only the following two lines:

@echo off
dir

You will also need a file for the external table. The contents of the file are irrelevant, so you can use any file that the Oracle software owner can read in a directory to which that owner has read access. For this example, the file is dirfile.txt, and although the contents of the file are immaterial, the file must exist, because the external table will access it. Listing 1 shows how to create the table.

Code Listing 1: Creating an external table for showing a directory listing

create table dir_tab
(
       mod_dt       date,
       mod_time     char(10),
       file_type    char(10),
       file_size    char(10),
       file_name    char(40)
)
organization external
(
    type oracle_loader
    default directory etl_dir
    access parameters
    (
       records delimited by newline
       preprocessor execdir:'preproc_dir.bat'
       skip 5
       load when (mod_dt != blanks)
       fields
       (
             mod_dt       position (01:10) DATE mask "mm/dd/yyyy",
             mod_time     position (11:20),
             file_type    position (21:29),
             file_size    position (30:38),
             file_name    position (39:80)
       )
    )
location ('dirfile.txt')
)
reject limit unlimited
/

-- select from this table

SQL> select * from dir_tab;

MOD_DT      MOD_TIME    FILE_TYPE  FILE_SIZE  FILE_NAME
—————————    ————————   —————————  —————————  —————————
16-DEC-10   10:12 AM     <DIR>                .
16-DEC-10   10:12 AM     <DIR>                ..
22-MAY-10   09:57 PM     <DIR>                archive
22-MAY-10   10:27 PM                  2,048   hc_alap112.dat
05-DEC-10   07:07 PM                     36   indata1.txt
22-DEC-05   04:07 AM                 31,744   oradba.exe
16-DEC-10   09:58 AM                  1,123   oradim.log
28-SEP-10   12:41 PM                  1,536   PWDALAP112.ora
16-DEC-10   09:58 AM                  2,560   SPFILEALAP112.ORA

9 rows selected.

-- select a file not updated in last 1 year

SQL> select * from dir_tab where mod_dt < sysdate - 365;

MOD_DT      MOD_TIME   FILE_TYPE  FILE_SIZE  FILE_NAME
—————————   ————————   —————————  —————————  —————————
22-DEC-05   04:07 AM               31,744  oradba.exe
 
newer post

Data Warehouse Project Management

0 comments
By Hari Mailvaganam

A paramount determining factor in the success of data warehousing is the input of stakeholders. Data warehousing is very unique to an organization, its business processes, systems architecture and decision support needs.

Project management for data warehousing allows for large amounts of user input and at all phases of the project. There are commercial software products tailored for data warehouse project management. A good project plan lists the critical tasks that must be performed and when each task should be started and completed. It identifies who is to perform the tasks, describes deliverables to be created and identifies milestones for measuring progress.

There are a number of publications on data warehousing project management. The standard bearer publication is Ralph Kimball's Data Warehouse Lifecycle Toolkit.

Over the years I have worked with different project management methodologies and processes for data warehousing. In this article I have listed a summary of  a project management methodology that I have developed that also ties in with the Rational Unified Process (RUP). This is especially useful if the data warehousing project is part of a greater development effort which follows RUP.

The program management framework for data warehousing follows tested methodologies. The methodology described below follows the project management workflow of the Unified Process.

Figure 1. Rational Unified Process; Copyright IBM

The project plan should include presentations at regular intervals, say monthly, to management and stakeholders. The presentation will include:

    Discussion on any challenges and determine if project is on schedule;
    Review of activities and priorities to be achieved before next meeting;
    Contingency plans to make up time and address problems;
    An open question-and-answer period followed by a summary.

Phases of the Unified Process
The phases in project management following UP, Figure 1, are:

     Inception
    Elaboration
    Construction
    Transition

The project lifecycle is composed over time into four sequential phases, each concluded by a major milestone, Figure 2.

The RFP is produces at the end of the inception phase.

Figure 2. Phases in the Unified Process; © IBM

1. Inception Phase

    Project planning and evaluation.
    Requirements gathering.
    Define features the system must support.
    Identify the stakeholders who oversee the system use.
    Users and other applications (i.e. actors) that will interface with the system.
    Define framework to identify business rules.
    List the events that the system must be aware of and respond to.
    List the constraints and risks place on the project.
    Define product selection criteria.
    Define project management environment.
    Prepare and obtain approval of sponsors for project mandate and approach.
    Define data warehouse objectives.
    Define query library/location matrix.
    Define query libraries.
    Establish meta-data.
    Prepare migration specifications.
    Develop query usage monitoring.
    Technical infrastructure assessment.
    Current technology assessment.
    Release of RFP.

The project plan must also place importance to:

    Change control.
    Risk assessment.
    Training.
    Scope agreements.
    Resources.
    Communication.

2. Elaboration Phase

    Review of RFPs and selecting winning candidate.
    Determine size of the data warehouse.
    Determine complexity and cleanliness of the data.
    Identify number of source databases and their characteristics.
    Select DSS tools.
    Determine network requirements.
    Conduct training.
    Integration of data.
    Plan quality assurance testing procedures.

3. Construction Phase

    Implement components.
    Systems integration.
    Conduct testing.
    Plan deployment.
    Develop support material.

4. Transitions Phase

    Roll-out of modules.
    Managing change-control requests.
    Conduct beta testing to validate the new system against user expectations.
    Perform converting operational databases.
    Evaluate performance and benchmarking.
    Evaluate that deployment baselines are complete.
    Obtain stakeholder feedback on deployment baselines consistency with evaluation criteria.
    Fine-tune product based on feedback.
    Release product to users.
    Product release milestones.
newer post
newer post older post Home