Treating Big Data Performance Woes with the Data Replication Cure Blog Series – Part 2
In my last posting,
I suggested that the primary bottleneck for performance computing of
any type, including big data applications, is the latency associated
with getting data from where it is to where it needs to be. If the
presumptive big data analytics platform/programming model is Hadoop,
(which is also often presumed to provide in-memory analytics), though,
there are three key issues:
1) Getting the massive amounts of data into the Hadoop file system and memory from where those data sets originate,
2) Getting results out of the Hadoop system to where the results need to be, and
3) Moving data around within the Hadoop application.
That
third item could use a little further investigation. Hadoop is built as
an open source implementation of Google’s Map Reduce, a model in which
computation is allocated across multiple processing units in a
two-phased manner. During the first phase, “Map,” each processing node
is allocated a chunk of the data for analysis; the interim results are
cached locally within each processing node. For example, if the task
were to count the number of occurrences of company names in a collection
of social network streams, then a bucket for each company would be
created at each node to hold the count of occurrences accumulated from
each stream subset.
During
the second phase, “Reduce,” the interim results at each node are then
combined across the network. If there were very few buckets altogether,
this would not be a big deal. However, if there are many, many buckets
(which we might presume due to the “bigness” of the data), the reduce
phase might incur a significant amount of communication – yet another
example of a potential bottleneck.
This
theme is not limited to Hadoop applications. Even just looking at
analytical appliances used for traditional business intelligence
queries, there is a general thought out there that because data resides
within the environment in a way that is supposed to meet the demands of
mixed workload processing, that the operational data is generally going
to be in the same locations where the analytical engine is. And if you
consider those commonplace queries that are used for regularly-generated
reports, this knowledge aforethought can be put to good use in a data
distribution scheme.
However,
not all queries are the same old canned queries over and over again,
and in many more sophisticated cases, ad hoc queries with multiple join
conditions are going to require that those attributes used for the join
conditions be moved from their original allocation to be communicated to
all the nodes computing the join! In essence, my opinion is
that the idea that the data is where it needs to be is fundamentally
flawed, since there is no way that the mixed workload can use the same
data resources in their original places except in extremely controlled
circumstances in which all the queries are known ahead of time.
So
unless we have some additional set of strategies, we are going to still
be at the mercy of the network. And as the volumes of data grow so will
the bottlenecks… More next week. I will discuss this topic more on May
23 for the Information-Management.com EspressoShot webinar, Treating Big Data Performance Woes with the Data Replication Cure.