Skip navigation
Toggle Sidebar

Distributed Map-Reduce

Project Description Map-Reduce is one of the more common patterns used in GigaSpaces systems. This library provides an implementation of the Distributed Map Reduce pattern that allows clients to simply write the classes for the map and reduce functionality, relying on GigaSpaces XAP for the infrastructure.
Current Project Version 1.00
Project Maturity Production
Project License Apache License 2.0
Compatible GigaSpaces XAP Version GigaSpacesXAP6.x
Project Captain Patrick May
Contributors Patrick May

Project Sitemap

Project Homepage

Features and Capabilities

This project demonstrates how to leverage the inherently distributed nature of a Space Based Architecture (SBA) to efficiently implement the Map-Reduce pattern. It includes a library that implements the pattern, using GigaSpaces XAP to meet usability, scalability, resiliency, and performance goals.

The Map-Reduce Pattern

The Map-Reduce pattern is enjoying a resurgence of popularity, due in no small part to its use by Google. The concepts underlying this pattern are very simple: work is distributed across multiple processes (mapped) and the intermediate results from each of those processes are consolidated into a single final answer (reduced).

This technique has been used for decades in functional languages. Common Lisp, for example, provides map and reduce as part of the language:

(sqrt (reduce #'+ (mapcar (lambda (x) (* x x)) '(3 4))))

This maps a function that squares its argument over the specified list, then reduces the results by summing them and computes the square root of the sum. The list can be arbitrarily large.

Combining the expressiveness of this functional technique with distributed computing mechanisms results in a pattern that realizes the potential parallelism of clustered hardware for many problem domains.

Space Based Architecture

Space Based Architecture (SBA) is conceptually almost as simple as the Map-Reduce pattern. SBA derives from the observation that traditional modularization techniques have significant benefits when used to logically decouple the components of a software architecture, but significant costs when used to physically decouple those components.

An easy way to understand SBA is to contrast it with a typical tier based architecture (TBA) consisting of Presentation, Messaging, Business Logic, Data Access, Data Caching, and Database layers that not only separate the technical concerns of each tier into logical modules but implement the tiers themselves as distinct modules.

The consequences of separating the tiers are significant communication latency, increased integration complexity, and the need to address non-functional requirements (NFRs) such as scalability, resiliency, and performance at each individual tier. These problems are inherent in TBA.

TBAs fail to address the fact that business value comes not from optimizing the capabilities of each individual tier, which are, after all, merely software artifacts, but from maximizing the scalability, resiliency, and performance of an end to end business transaction. SBA directly addresses this essential point.

SBA preserves the benefits of the logical modularization that allows separation of technical concerns while eliminating the high latency, complex tiers of TBA. The basic idea is that the end to end performance of a particular instance of a business use case can be optimized by colocating in a single process the data, business logic, and messaging required to execute that instance of the use case.

The unit of deployment in an SBA is the processing unit rather than the tier. A processing unit typically contains one node of a clustered space and the set of software services that comprise the business logic needed to execute the end to end business transaction. The space node is a service like any other. It provides in memory storage of the objects required by the business logic and is also used for messaging between other services in the processing unit. Nodes federate into clusters, forming what can be thought of as a distributed shared memory. Services and other clients of the space can access either single nodes or the cluster as a whole, with the same API.

In addition to significantly decreasing latency and improving throughput, this decoupling of executable instances results in near linear scalability across a broad range of hardware resources. If more throughput is required, it is a simple matter to deploy additional processing units. This can be done dynamically, under the control of programmatically enforced service level agreements (SLAs).

The same characteristics of SBA that result in massive scalability also provide high availability, even when the underlying hardware infrastructure is unreliable. In a typical deployment, each processing unit has an identical backup copy In the event of a failure, the backup takes over without losing any state, including that of any open transactions. The cluster monitor detects that a failover has occurred and that there is now a primary processing unit without a backup. In response, a new backup is created and synchronized with the primary. The processing unit concept is essential to this self healing capability.

The Distributed Map-Reduce API

The ability to distribute processing across all nodes in an SBA makes it a natural fit for the Map-Reduce pattern. The fit is so natural, in fact, that the pattern often emerges in various guises during the design and implementation of distributed applications on top of SBAs. The library in this project provides a simple interface for Map-Reduce and eliminate the need to reinvent the GigaSpaces specific code.

When using the Distributed Map-Reduce pattern to solve a problem, a developer defines four components:

  1. The behavior to be mapped over each node
  2. The result of mapping the behavior over a single node
  3. The behavior to reduce the mapping results
  4. The result of reducing the mapping results

The simplest, most usable client API will reflect these concepts as directly as possible. Accordingly, the client API exposed by the library consists of:

  • public interface MapTask, declaring only public MapResult map(GigaSpace space);
  • public interface MapResult
  • public interface ReduceTask, declaring only public ReduceResult reduce(Collection<MapResult mapResults);
  • public interface ReduceResult
  • public class MapReduceTask, with the constructor public MapReduceTask(MapTask mapTask,ReduceTask reduceTask);

To implement the Distributed Map-Reduce pattern using this API, a developer simply:

  1. Implements the two task and two result interfaces
  2. Creates an instance of MapReduceTask from the two task classes
  3. Writes the MapReduceTask to a clustered space
  4. Retrieves an instance of the ReduceResult implementation from the clustered space

Summary

A Space Based Architecture is an excellent platform for deploying the Distributed Map-Reduce pattern. GigaSpaces' SBA implementation adds support for scalability, resiliency, and performance to the capabilities of this pattern, without the need for any additional products or integration effort. The result is a robust platform that provides measurable business value in those environments where Map-Reduce is applicable.

Adaptavist Theme Builder Powered by Atlassian Confluence