MapReduce
MapReduce is the name of a programming framework used by Google for performing distributed computation. The MapReduce algorithms are described in the paper MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat.
In addition to the input data, the MapReduce framework requires the user to pass two functions to describe the computation task. One function generates intermediate data sets (named "map" in the paper) and a function for combining intermediate data sets into smaller data sets (named "reduce" in the paper). The MapReduce framework then distributes tasks as needed, taking care of the messy details of task scheduling.
Note: the map and reduce functions in the paper are different than the primitive functions of the same name in Cat and other functional languages. This is a source of considerable confusion for people.
The following pseduo-code example taken from the paper by Dean and Ghemawat demonstrates sample input functions to the MapReduce algorithm that count words from multple input documents:
``` map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");
reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); ```
These two functions when passed to the MapReduce framework, would potentially allow us to count all words in occurences.
MapReduce in Haskell
The paper Google's MapReduce Programming Model -- Revisited by Ralf Lammel provides a detailed explanation of the MapReduce algorithm using Haskell as an example language.
The MapReduce algorithm is shown written in Haskell in Lammel's paper as (the terms "map" and "reduce" have been replaced with "m" and "r" for clarity:
mapReduce m r
= reducePerKey r -- 3. Apply r to each group
. groupByKey -- 2. Group intermediates per key
. mapPerKey m -- 1. Apply m to each key/value pair
From Lammel's paper:
"We note that the basic skeleton of MapReduce computations is stunningly simple. We assume that mapPerKey, groupByKey and reducePerKey are components of the MapReduce library."
Notice that Lammel uses the function composition operator "." which makes the mapReduce algorithm look almost like a Cat program. He could have written:
mapReduce m r
= r . reducePerKey . groupByKey . m . mapPerKey
If we were to rewrite the above algorithm in Cat using named parameters we would arrive with:
define mapReduce(m r)
= r reducePerKey groupByKey m MapPerKey
MapReduce in Cat
The implementation of a naive non-distributed MapReduce algorithm can be expressed in a one-line Cat program as:
define map_reduce { [map flatten self_join] dip map }
In pseudo-code this could be translated to:
map_reduce(input, fmap, freduce) =
map(freduce, self_join(flatten(map(fmap, input))))
where the types of the arguments could be described as:
input = list(pair('input_value, 'input_key))
fmap = pair('input_value, 'input_key)) -> list(pair('output_value, 'output_key))
freduce = pair(list('output_value), 'output_key) -> pair('output_value, 'output_key)
Note that these types are not actually valid Cat types. This is a planned future extension for the type system.
Understanding the Cat algorithm requires familiarity with the self_join and flatten library functions.
Usage
The following demonstrates how the map_reduce algorithm in Cat could be used to count words from sentences:
``` define test_strings { ( (("The", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"), 1), (("I", "am", "very", "lazy"), 2), (("I", "hope", "this", "is", "over", "quick"), 3), (("I", "have", "high", "hopes", "for", "the", "lazy", "dog"), 4) ) }
define m { unpair pop [1 swap pair] map }
define r { unpair [sum] dip pair }
define test_map_reduce { test_strings [m] [r] map_reduce print_list } ```
The output of this example is:
( 1, The)
( 1, brown)
( 1, fox)
( 1, jumped)
( 1, am)
( 1, very)
( 1, hope)
( 1, this)
( 1, is)
( 2, over)
( 2, quick)
( 3, I)
( 1, have)
( 1, high)
( 1, hopes)
( 1, for)
( 2, the)
( 3, lazy)
( 2, dog)