Package datafu.hourglass.jobs

Incremental Hadoop jobs and some supporting classes.

See: Description

Package datafu.hourglass.jobs Description

Incremental Hadoop jobs and some supporting classes.

Jobs within this package form the core of the incremental framework implementation. There are two types of incremental jobs: partition-preserving and partition-collapsing.

A partition-preserving job consumes input data partitioned by day and produces output data partitioned by day. This is equivalent to running a MapReduce job for each individual day of input data, but much more efficient. It compares the input data against the existing output data and only processes input data with no corresponding output.

A partition-collapsing job consumes input data partitioned by day and produces a single output. What distinguishes this job from a standard MapReduce job is that it can reuse the previous output. This enables it to process data much more efficiently. Rather than consuming all input data on each run, it can consume only the new data since the previous run and merges it with the previous output.

Partition-preserving and partition-collapsing jobs can be created by extending AbstractPartitionPreservingIncrementalJob and AbstractPartitionCollapsingIncrementalJob, respectively, and implementing the necessary methods. Alternatively, there are concrete versions of these classes, PartitionPreservingIncrementalJob and PartitionCollapsingIncrementalJob, which can be used instead. With these classes, the implementations are provided through setters.

Incremental jobs use Avro for input, intermediate, and output data. To implement an incremental job, one must define their schemas. A key schema and intermediate value schema specify the output of the mapper and combiner, which output key-value pairs. The key schema and an output value schema specify the output of the reducer, which outputs a record having key and value fields.

An incremental job also requires that implementations of map and reduce be defined, and optionally combine. The map implementation must implement a Mapper interface, which is very similar to the standard map interface in Hadoop. The combine and reduce operations are implemented through an Accumulator interface. This is similar to the standard reduce in Hadoop, however values are provided one-at-a-time rather than by an enumerable list. Also an accumulator returns either one value or no value at all by returning null. That is, the accumulator may not return an arbitrary number of values for the output. This restriction precludes the implementation of certain operations, like flatten, which do not fit well within the incremental programming model.