datafu.hourglass.jobs (datafu-hourglass 1.4.0 API)

Interface Summary
Interface	Description
DateRangeConfigurable	An interface for an object with a configurable output date range.
Setup	Used as a callback by `PartitionCollapsingIncrementalJob` and `PartitionPreservingIncrementalJob` to provide configuration settings for the Hadoop job.

Class Summary
Class	Description
AbstractJob	Base class for Hadoop jobs.
AbstractNonIncrementalJob	Base class for Hadoop jobs that consume time-partitioned data in a non-incremental way.
AbstractNonIncrementalJob.BaseCombiner	Combiner base class for `AbstractNonIncrementalJob`.
AbstractNonIncrementalJob.BaseMapper	Mapper base class for `AbstractNonIncrementalJob`.
AbstractNonIncrementalJob.BaseReducer	Reducer base class for `AbstractNonIncrementalJob`.
AbstractNonIncrementalJob.Report	Reports files created and processed for an iteration of the job.
AbstractPartitionCollapsingIncrementalJob	An `IncrementalJob` that consumes partitioned input data and collapses the partitions to produce a single output.
AbstractPartitionCollapsingIncrementalJob.Report	Reports files created and processed for an iteration of the job.
AbstractPartitionPreservingIncrementalJob	An `IncrementalJob` that consumes partitioned input data and produces output data having the same partitions.
AbstractPartitionPreservingIncrementalJob.Report	Reports files created and processed for an iteration of the job.
DateRangePlanner	Determines the date range of inputs which should be processed.
ExecutionPlanner	Base class for execution planners.
FileCleaner	Used to remove files from the file system when they are no longer needed.
IncrementalJob	Base class for incremental jobs.
MaxInputDataExceededException
PartitionCollapsingExecutionPlanner	Execution planner used by `AbstractPartitionCollapsingIncrementalJob` and its derived classes.
PartitionCollapsingIncrementalJob	A concrete version of `AbstractPartitionCollapsingIncrementalJob`.
PartitionPreservingExecutionPlanner	Execution planner used by `AbstractPartitionPreservingIncrementalJob` and its derived classes.
PartitionPreservingIncrementalJob	A concrete version of `AbstractPartitionPreservingIncrementalJob`.
ReduceEstimator	Estimates the number of reducers needed based on input size.
StagedOutputJob	A derivation of `Job` that stages its output in another location and only moves it to the final destination if the job completes successfully.
TimeBasedJob	Base class for Hadoop jobs that consume time-partitioned data.
TimePartitioner	A partitioner used by `AbstractPartitionPreservingIncrementalJob` to limit the number of named outputs used by each reducer.

Package datafu.hourglass.jobs Description

Incremental Hadoop jobs and some supporting classes.

Jobs within this package form the core of the incremental framework implementation. There are two types of incremental jobs: partition-preserving and partition-collapsing.

A partition-preserving job consumes input data partitioned by day and produces output data partitioned by day. This is equivalent to running a MapReduce job for each individual day of input data, but much more efficient. It compares the input data against the existing output data and only processes input data with no corresponding output.

A partition-collapsing job consumes input data partitioned by day and produces a single output. What distinguishes this job from a standard MapReduce job is that it can reuse the previous output. This enables it to process data much more efficiently. Rather than consuming all input data on each run, it can consume only the new data since the previous run and merges it with the previous output.

Partition-preserving and partition-collapsing jobs can be created by extending AbstractPartitionPreservingIncrementalJob and AbstractPartitionCollapsingIncrementalJob, respectively, and implementing the necessary methods. Alternatively, there are concrete versions of these classes, PartitionPreservingIncrementalJob and PartitionCollapsingIncrementalJob, which can be used instead. With these classes, the implementations are provided through setters.

Incremental jobs use Avro for input, intermediate, and output data. To implement an incremental job, one must define their schemas. A key schema and intermediate value schema specify the output of the mapper and combiner, which output key-value pairs. The key schema and an output value schema specify the output of the reducer, which outputs a record having key and value fields.

An incremental job also requires that implementations of map and reduce be defined, and optionally combine. The map implementation must implement a Mapper interface, which is very similar to the standard map interface in Hadoop. The combine and reduce operations are implemented through an Accumulator interface. This is similar to the standard reduce in Hadoop, however values are provided one-at-a-time rather than by an enumerable list. Also an accumulator returns either one value or no value at all by returning null. That is, the accumulator may not return an arbitrary number of values for the output. This restriction precludes the implementation of certain operations, like flatten, which do not fit well within the incremental programming model.