Some Thoughts about Data Proximity for Big Data Calculations – Part 2

Wednesday, May 9, 2012

Some Thoughts about Data Proximity for Big Data Calculations – Part 2

Posted by DWH at 8:19 PM

Treating Big Data Performance Woes with the Data Replication Cure Blog Series – Part 2
In my last posting, I suggested that the primary bottleneck for performance computing of any type, including big data applications, is the latency associated with getting data from where it is to where it needs to be. If the presumptive big data analytics platform/programming model is Hadoop, (which is also often presumed to provide in-memory analytics), though, there are three key issues:
1)     Getting the massive amounts of data into the Hadoop file system and memory from where those data sets originate,
2)     Getting results out of the Hadoop system to where the results need to be, and
3)     Moving data around within the Hadoop application.
That third item could use a little further investigation. Hadoop is built as an open source implementation of Google’s Map Reduce, a model in which computation is allocated across multiple processing units in a two-phased manner. During the first phase, “Map,” each processing node is allocated a chunk of the data for analysis; the interim results are cached locally within each processing node. For example, if the task were to count the number of occurrences of company names in a collection of social network streams, then a bucket for each company would be created at each node to hold the count of occurrences accumulated from each stream subset.
During the second phase, “Reduce,” the interim results at each node are then combined across the network. If there were very few buckets altogether, this would not be a big deal. However, if there are many, many buckets (which we might presume due to the “bigness” of the data), the reduce phase might incur a significant amount of communication – yet another example of a potential bottleneck.
This theme is not limited to Hadoop applications. Even just looking at analytical appliances used for traditional business intelligence queries, there is a general thought out there that because data resides within the environment in a way that is supposed to meet the demands of mixed workload processing, that the operational data is generally going to be in the same locations where the analytical engine is. And if you consider those commonplace queries that are used for regularly-generated reports, this knowledge aforethought can be put to good use in a data distribution scheme.
However, not all queries are the same old canned queries over and over again, and in many more sophisticated cases, ad hoc queries with multiple join conditions are going to require that those attributes used for the join conditions be moved from their original allocation to be communicated to all the nodes computing the join! In essence, my opinion is that the idea that the data is where it needs to be is fundamentally flawed, since there is no way that the mixed workload can use the same data resources in their original places except in extremely controlled circumstances in which all the queries are known ahead of time.
So unless we have some additional set of strategies, we are going to still be at the mercy of the network. And as the volumes of data grow so will the bottlenecks… More next week. I will discuss this topic more on May 23 for the Information-Management.com EspressoShot webinar, Treating Big Data Performance Woes with the Data Replication Cure.