Sunday, October 7, 2012

Make room for data

Given enough time, most organizations will reach a point when they wonder how their Teradata systems can possibly take in any more data. Buying new hardware can solve this problem—but that may be unnecessary.
One company found a way to more readily accommodate its growing volumes of data. Teradata Magazine spoke with Dietmar Trummer, senior IT architect at mobilkom austria, about its use of Teradata multi-value compression. Trummer explained how he was able to free space, increase performance and save money, all without reducing service.

Q: What prompted you to investigate using Teradata’s multi-value compression?

A: Starting at the end of 2006, we anticipated performance issues in the near future due to shortage
of free disk space Our users had data analysis results and reports to deliver on time, so in my role as a Teradata system user and developer of smaller data marts, I tried to figure out how to cope with this disk space storage issue.

Q: What approaches did you consider?

A: A Management and administrators at mobilkom austria considered several options: buy new hardware, archive data to external storage or remove indexes. We limited that to these choices:
  • Reduce redundant data by deleting specialized data marts. This required developing more complex queries to rebuild the logic of the data marts.
  • Aggregate data and skip the details, or reduce history by removing old data. We would lose information and would have to reduce our internal service portfolio.
  • Optimize the table definitions based on Teradata technology. This option offers the use of size-optimal data types, primary indexes with optimal table distribution and multi-value compression with optimal size impact.

Q: Why did you select Teradata’s multi-value compression approach?

A: First of all, we didn’t want to develop more complex queries. Second, we didn’t want to reduce
our service portfolio.
I was looking for a solution that had the least possible influence on the daily work of the users and developers. Optimizing table definitions appeared to be worth investigating.
To be honest, the multi-value compression approach was the most interesting. It promised a high potential and also delivered a technical and mathematical challenge. Besides, I had some prior experience implementing the compression approach.
Two years ago during the development of a simple data mart, I had to store a large amount of intermediate data in a table. The space in my staging database was insufficient, so I needed to find a way to reduce the table space.
However, with multi-value compression I could fit my table into the staging database in a short time, without the help of an administrator.
One year later, I adapted what I learned from this experience and developed a method to ease our new storage problems using multi-value compression.

Q: Is this code for using compression as simple as it looks?

 CREATE MULTISET TABLE dwh_ua.uaf_contract ( phone_id INTEGER NOT NULL, call_mode_code CHAR(1) COMPRESS (‘A’,’P’,’S’), source_table_id SMALLINT NOT NULL COMPRESS (1,4,10), charge_usage DECIMAL(18,4) COMPRESS (0.0000), ... ) PRIMARY INDEX (phone_id) 
A: Yes, it is. We optimized our biggest table using multi-value compression: 1TB without compression; 480GB with manual (not optimized) compression; 370GB with optimized compression. The sample code is actually a fraction of the table definition.
That’s the simple part—the challenging part is how to get the values for optimized compression.

Q: What about the importance of data analysis and finding the break-even point of compression? How does multi-value compression work, and how do you find that break-even point?

A: Unlike other databases, the Teradata Database compresses specific user-defined values to zero space. That sounds like magic but, of course, it isn’t. The idea is to reduce the row size of many rows by a large amount, and to enlarge the row size of all rows by a small amount.
image
Click to enlarge
The code that follows and the corresponding row storage table [see table 1] illustrate how this works: We observed the compression of a single column. All rows with a call_mode_code value contained in the compress list (‘A,’ ‘P’ or ‘S’) skipped the storage for this column in the row data. The compressed values are stored as binary code, and the column was reduced to zero space. Consequently, the rows got smaller. In table 1, the binary code “00” indicates that the value is stored in the row data.
 … call_mode_code CHAR(1) COMPRESS (‘A’,’P’,’S’), … 2 bit: ‘A’ = 01, ‘P’ = 10, ‘S’ = 11 
The binary code is also used in the presence bits to indicate if the values are not compressed. If the table contains no call_mode_code value of ‘A,’ ‘P’ or ‘S,’ small amounts of storage are added in the presence bits and the rows get slightly larger. This explains why it is important to know the values contained in your table.
In summary, if you use compression, all rows—regardless of whether the column value is compressed—have to add the presence bits to their row storage.
Then, the question arose: How can we use this information to find the optimal compression list for a column’s table?
The answer is detailed mathematically, but the principle is not very complex: The values that occur more frequently are more likely to be added to the compression list. Therefore, we had to fill the compression list with the most frequently occurring values. The break-even point is reached when the addition of a new value to the list will not result in a further decrease of the column’s total space consumption.
But be aware that this equation is based only on the static point of view. Data changes over time, so analysis about volatility of data is also necessary and has to be taken into consideration when finding the best compression list.

Q: What kinds of data gain the best compression by setting “obvious” compression values? Are there problems using this technique?

A: That’s difficult to generalize because we experienced different and unexpected kinds of data with great compression performance. Of course columns with large data types have a better compression ratio than those with small data types. Also, columns that contain a few very frequent values gain good compression. These values might be words in natural language, flags, categories, status and years.
But data columns that contain measures can also be a good source for compression—especially default values, zeros and values that are near the most frequent value of a Gaussian or Poisson distributed column.
The problem with what might be considered “obvious” compression values is that it is difficult to find the break-even point—i.e., the optimal compression list—without data analysis. We frequently experienced that manual compression with no data analysis often leads to compression lists that are too large. In these cases, too many values are used for compression, which can lead to “over-compression.” Less would have been better.

Q: You described actual results when using multi-value compression. How much table scan performance improvement did you see? Do you have examples to share?

A: We didn’t analyze the table scan performance in a way that would enable me to present a percentage of performance improvement. Our result is based on the theoretical fact that the table scan performance is determined by the table size—and by our users’ experiences.
image
Click to enlarge
Nevertheless, we conducted experiments to check a potential negative performance impact caused by “decoding” the compressed values during querying. The results showed that, on the one hand, we could not find a negative performance impact; on the other hand, the table scan performance improvement is directly proportional to the space reduction.

Q: How large of a saving in data size did you see in your results?

A: The bar chart demonstrates a graphical representation of different compression scenarios of a single column. [See figure.] The horizontal axis breaks down the number of presence bits that must be used to code the compression values, and the vertical axis shows the size of the column in megabytes.
The red portion of each bar displays how much space will remain after the compression, and the green portion indicates how much space would be freed. Combined, the size of this column without compression is about 760MB.
image
Click to enlarge
The bars, from left to right, indicate how much space would be freed when compressing:
  • First column = 1 compression bit: the most frequent value
  • Second column = 2 compression bits: the most and second-most frequent value
  • Third column = 2 compression bits: the most, second-most and third-most frequent value
  • Thirteenth column (the rightmost bar in the chart) = 4 compression bits: in this case all occurring values
Notice that the first compression scenario is optimal—the size of this column would be reduced to about 50MB.
Table 2 displays the data analysis results in textual form. The recommendation of the tool is to compress one value (= 1 compression bit), which then generates the corresponding compress clause.
This table is a real example from our biggest table. You can see in row 7 of the table that multi-value compression was used. In this case the developer manually set a large compress list, which resulted in a column size of about 290MB as shown in cell C7. This is one-third the size without compression, indicated in cell C6, but nearly six times the size of the optimally compressed column that appears in cell C8.

Q: Beyond the space savings, what other benefits and cost savings
did you experience?

A: The cost savings stemmed directly from the fact that we didn’t need to buy new hardware or perform other actions that would result in indirect costs, such as reducing our service portfolio.
An indirect benefit was that the overall performance of our system was stabilized because we could raise the level of free storage to the recommended percentage. Multi-value compression had its part in this. It also played a part in other actions, like archiving.

Q: Have there been any impacts on your end users or application developers since you implemented compression? Have they had to change anything?

A: The end users who get our reports and analysis results didn’t realize that anything had changed, except that we were able to consistently meet their service level agreements because of our stabilized system performance.
What is interesting is the impact this procedure has on our developers, like me. The method was used with the intention to free some space in a one-time action. The administrators would analyze our biggest tables and change their table definitions to optimize compression.
Today all application developers in our unit know how to use it. During development of large to medium-sized data marts, a procedure is used to optimize bigger tables; therefore, the tables go into production with optimized compression. What is important is that the developers know about the pitfalls, such as data volatility, which can make a perfect compression change over time to a bad compression.
Also, our administrators didn’t confine themselves to optimizing just the biggest tables. They optimized medium-sized and even small tables. This is why we have approximately 3,000 optimized tables!

Q: Has compression added work to maintenance efforts?

A: Yes, it has. We have been using optimized compression since June 2007, and most tables have been optimized as of January 2008.
We know that we have to check and re-adjust our compression lists because of data volatility. Therefore, the developers, as well as the administrators, re-analyze the biggest tables in the form of control samples. But we have not gained enough experience to estimate how much effort we’ll have to invest in compression maintenance.
To reduce maintenance efforts, we try to avoid compression on highly volatile data columns, or we use more robust and less optimal compression lists for those columns.

Q: How does a compression assessment tool work, and is it available to others?

A: I developed an Excel macro that does data analysis on selected columns of a specified table. It visualizes its results in Excel tables and charts and produces recommendations for the compress clause of the analyzed columns. Tables 1 and 2 show some output of the tool.
I’ve received about 50 e-mails from Teradata users asking about the tool. So far, mobilkom has granted me permission to give the macro tool to them free for personal use. Any future requests would have to be negotiated. Anyone who is interested should e-mail TDCompress@mobilkom.at. Of course, since I am not a software producer, I cannot offer warranties or provide any service for this tool.
I want to mention that there are professional tools on the market that deal with Teradata multi-value compression, its optimization and similar issues. Atanasoft is one vendor of such a tool.



0 comments:

Post a Comment

newer post older post Home