Monday, February 27, 2012

Data Warehousing Gets the Data Out

In the past year or two, a growing split has occurred in the database market. Similar to a giant iceberg, this split is about to carve off a huge new piece that will have its own identity and direction. We call this new half of the database market data warehousing. We call the old half OLTP.
Data warehousing is that part of relational technology that deals with getting the data out. Data warehousing is a direct descendant of what led the industry to relational technology in the first place. Back in the early 1980s, we didn't decide to turn all of our IS shops upside down just because we were in love with transaction processing. We wanted better information. Those of you who still own Chris Date's original book explaining relational databases, An Introduction to Database Systems (Addison-Wesley), ought to go back and look at it again. The entire book is about queries and getting the data out! Transaction processing, entity-relationship diagrams, CASE tools, and all of the other OLTP apparatus came much later.
Once we got hooked on relational databases, most of us realized that we had some serious work to do to make these databases production-ready. So, for the last dozen years we have had to put off getting the data out in order to concentrate on getting the data in. We needed systems that could absorb at least 1000 transactions per second to enable us to store our basic enterprise data. Grafting transaction processing onto relational databases nearly put the lights out. We became fixated on the new mechanisms we created, such as entity-relationship diagrams and distributed database technology, and we came perilously close to forgetting why we bought relational databases in the first place.
Fortunately, the chief executives in most companies have long memories. They remembered the promise that we would be able to "slice and dice" all of our data. These executives have noticed that we have almost succeeded in storing all the corporate data in relational databases. They also haven't forgotten that they have spent several billion dollars. From their point of view, it is now time to get all that data out.
Almost every IS shop experiences significant pressure to make the corporate data accessible. Two or three years ago, we saw a brief flurry of marketing material from RDBMS vendors trying to reduce this pressure by promoting the use of OLTP systems for querying. This approach didn't work. Anyone who tried to use a major corporate database for both OLTP and querying soon realized some basic truths about systems for getting the data in, versus systems for getting the data out.
OLTP systems and data warehouse systems have different end users, managers, administrators, communications interfaces, backup strategies, data structures, processing needs, and access rhythms. Fortunately, DBMS and hardware vendors have stopped talking about one system that does it all. They now understand the profound difference between OLTP systems and data warehouse systems. Well, maybe they understand, and maybe they don't. But they have certainly figured out that if they keep quiet, they will sell two DBMS licenses and two hardware systems.
The little crack in the database market iceberg is widening very rapidly. A large channel of blue water is now visible between OLTP systems and data warehouse systems. Both IS shops and vendors realize that you can benefit from having two systems that are specialized for each task. An OLTP system must specialize in managing rapidly changing data and keeping it safe. OLTP systems must track millions of separate transactions per day and must encourage a usage style that peppers the database with tiny atomic processing requests that are all very similar. An OLTP transaction rarely uses a join.
Conversely, a data warehouse system must specialize in managing huge static sets of data copied from production systems. The average data warehouse query ranges dynamically over wide extremes of processing, and assembles hundreds, thousands, or even millions of separate data elements into small answer sets to be delivered to users. A data warehouse query is very join- intensive. Also, the data warehouse performs only one transaction per day when it loads millions of records from the production system. (We are not outlawing a light sprinkle of transactions restricted to forecasting and plan tables within the data warehouse.)
We don't need systems that are only pretty good at transaction processing and pretty good at querying. To be blunt, today's systems are very good at transaction processing and pretty horrible at querying. A lot of shops are still using 20-year-old B-tree technology to index their databases, because B-tree indexes are a compromise between querying and updating. However, we don't need to compromise when we decide to break our database world into two pieces. By specializing in getting the data out, the data warehouse half of the iceberg will yield a 100-fold improvement in query performance over the next few years. (Remember that you read this bold prediction in DBMS.)
Some of our cherished notions, such as that of cost-based optimization, may mean that optimizer is necessary when our data schemas are too complex -- no one can figure them out. As you'll read in my next column, the data warehouse marketplace seems to be adopting very simple data schemas that are highly dimensional. These schemas -- called Star Join schemas -- have large central "fact" tables surrounded by a single layer of much smaller dimension tables. (See Figure 1.) One of the great charms of the Star Join schema is that you can use fixed, deterministic evaluation strategies. You don't need a cost-based optimizer. Horrors!
Over the next few years, we'll witness a development in data warehouse-oriented query systems. The wonderful vocabulary for OLTP developed by the hardware and software vendors who are members of the Transaction Processing Performance Council will be mirrored by a new vocabulary specific to data warehousing. Terms like "two-phase commit" and "row-level locking" will have data warehouse counterparts such as "factless fact tables" and "slowly changing dimensions." Hardware and software developers have just begun to turn their attention to data warehousing. Thus, the two icebergs will continue to drift further apart.

0 comments:

Post a Comment

newer post older post Home