Monday, December 26, 2011

MapReduce and Grid Computing


There is a lot of interest and discussion around Hadoop MapReduce' success stories these days, with the likes of Amazon, Yahoo, Facebook, Google etc.. advocating and adopting the framework implementation for their production systems. I got curious to understand the core concept behind the MapReduce framework and what makes it so unique for distributed processing of large data sets.

Reading some interesting articles online, I understand MapReduce  framework at its core is a combination of two functions map ( ) and reduce ( ).The map function understands exactly where it should go to process the data i.e. the computation happens on the distributed nodes  in a completely parallel manner. The reduce function on the other hand, operates on the sorted output of the mappers' intermediate results from each computing node and performs a function on the list. Both the input and the output of the map/reduce tasks are stored in a file-system, for example  proprietary Google File System(GFS), Hadoop Distributed File System (HDFS) or something else. Typically, the compute nodes (MapReduce framework) and the storage nodes (HDFS) are co-located and run on the same set of nodes or physical box based on the assumption that remote data can only be accessed with low bandwidth and high latency. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster. So, is this an extension to the architectural approach for storage grid computing?

The idea of Grid computing arose from the need to solve highly-parallel computational problems that were beyond the processing capability of any single computer. Oracle has been offering its version of grid technology since 2000. The database grid, representing the approach taken with Oracle Database, deploys code redundantly on multiple servers (or nodes), which break up the workload based on an optimized scheme and execute tasks in parallel against common data resources. If any node fails, its work is taken up by a surviving node to ensure high availability. Simply put, RAC Database grid architecture assigns computing tasks to computing resources, and it assigns data to storage resources in a way that enables such resources to be easily added or removed and provides the flexibility for tasks and data to be moved as needed.

My take-away points from a computing perspective: Both MapReduce and Oracle RAC computing environments harness the processing power of multiple interconnected computers and are promising technologies to invest in (depending on the business case) for solving data-intensive and resource-intensive computing problems. A key premise for MapReduce-style computing systems is that there is insufficient network bandwidth to move the data to the computation, and thus computation must move to the data instead. The key differentiator or limitation I observed (at the point of blogging) is the High Availability. Unlike a RAC database transaction processing system, MapReduce-HDFS-style computing systems does not provide high availability as its HDFS file-system instance' name node server is a single point of failure.

I am looking forward to more advancement in these technologies at an affordable cost for addressing the growing data-intensive computing requirements of today’ business economics.