Idioms

There are certain types of computation that show up again and again in parallel computation. This section shows how to perform them with jug.

Map/Reduce

Currently, the dominant paradigm for large scale distributed computing is the map/reduce paradigm. Originally made prominent by Google’s proprietary implementation, it is now available in many implementations, including open-source ones such as Hadoop.

Jug is not a direct competitor to Hadoop as it focuses on medium sized problems. Jug does not implement a distributed file system of any sort, but assumes that all compute nodes have access to a central repository of information (such as a shared filesystem or a redis server).

On the other hand, jug supports much more complex computations than map-reduce. If, however, your problem is naturally described as a map/reduce computation, then jug has some helper functions.

jug.mapreduce.mapreduce

The jug.mapreduce.mapreduce function implements mapreduce:

jug.mapreduce(reducer, mapper, inputs)

is roughly equivalent to:

Task{ reduce(reducer, map(mapper, inputs)) }

If the syntax of Python supported such a thing.

An issue that might come up is that your map function can be too fast. A good task should take at least a few seconds (otherwise, the overhead of scheduling and loading the data overwhelms the performance advantages of parallelism. Analogously for the reduce function.

Therefore, jug groups your inputs so that a mapping task actually consists of mapping and reducing more than one input. How many is controlled by the map_step parameter. By default, it is set to 4. Similarly, the reduce_step parameter controls how many reduction steps to perform in a single task (by default, 8; reflecting the fact that reduce operations tend to be lighter than map operations).

The compound task section has a worked out example of using map/reduce.

Parameter Sweep

This is a standard problem in many fields, for example, in machine learning. You have an algorithm that takes a few parameters (let’s call them p0 and p1) and a function which takes your input data (data) and the parameters and outputs the score of this parameter combination.

In pure Python, we’d write something like:

best = None
best_val = float("-Inf")
for p0 in range(100):
    for p1 in range(-20, 20):
        cur = score(data, p0, p1)
        if cur > best_val:
            best = p0,p1
            best_val = cur
print('Best parameter pair', best)

This is, obviously, an embarassingly parallel problem and we want jug to handle it.

First note: we can, of course, perform this with a map/reduce:

def mapper(data, p0, p1):
    return (p0, p1, score(data, p0, p1))

def reducer(a, b):
    _, _, av = a
    _, _, bv = b
    if av > bv: return a
    return b

best = jug.mapreduce.mapreduce(
                reducer,
                mapper,
                [(p0, p1)
                        for p0 in range(101)
                        for p1 in range(-20, 21)])

However, if you want to look at the whole parameter space instead of just the best score, this will not work. Instead, you can do:

from jug import TaskGenerator

score = TaskGenerator(score)
results = {}
for p0 in range(100):
    for p1 in range(-20, 20):
        result[p0,p1] = value(data, p0, p1)

Now, after you’ve run ``jug execute``, you can use jug shell and load the result dictionary to look at all the results.

result = value(result)
print(result[0, -2])
# Look for the maximum score
print(max(result.values()))
# Look at maximum score *and* the parameters that generated it:
print(max((v, k) for k, v in result.iteritems()))