Interface | Description |
---|---|
DateRangeConfigurable |
An interface for an object with a configurable output date range.
|
Setup |
Used as a callback by
PartitionCollapsingIncrementalJob and PartitionPreservingIncrementalJob
to provide configuration settings for the Hadoop job. |
Class | Description |
---|---|
AbstractJob |
Base class for Hadoop jobs.
|
AbstractNonIncrementalJob |
Base class for Hadoop jobs that consume time-partitioned data
in a non-incremental way.
|
AbstractNonIncrementalJob.BaseCombiner |
Combiner base class for
AbstractNonIncrementalJob . |
AbstractNonIncrementalJob.BaseMapper |
Mapper base class for
AbstractNonIncrementalJob . |
AbstractNonIncrementalJob.BaseReducer |
Reducer base class for
AbstractNonIncrementalJob . |
AbstractNonIncrementalJob.Report |
Reports files created and processed for an iteration of the job.
|
AbstractPartitionCollapsingIncrementalJob |
An
IncrementalJob that consumes partitioned input data and collapses the
partitions to produce a single output. |
AbstractPartitionCollapsingIncrementalJob.Report |
Reports files created and processed for an iteration of the job.
|
AbstractPartitionPreservingIncrementalJob |
An
IncrementalJob that consumes partitioned input data and produces
output data having the same partitions. |
AbstractPartitionPreservingIncrementalJob.Report |
Reports files created and processed for an iteration of the job.
|
DateRangePlanner |
Determines the date range of inputs which should be processed.
|
ExecutionPlanner |
Base class for execution planners.
|
FileCleaner |
Used to remove files from the file system when they are no longer needed.
|
IncrementalJob |
Base class for incremental jobs.
|
MaxInputDataExceededException | |
PartitionCollapsingExecutionPlanner |
Execution planner used by
AbstractPartitionCollapsingIncrementalJob and its derived classes. |
PartitionCollapsingIncrementalJob |
A concrete version of
AbstractPartitionCollapsingIncrementalJob . |
PartitionPreservingExecutionPlanner |
Execution planner used by
AbstractPartitionPreservingIncrementalJob and its derived classes. |
PartitionPreservingIncrementalJob |
A concrete version of
AbstractPartitionPreservingIncrementalJob . |
ReduceEstimator |
Estimates the number of reducers needed based on input size.
|
StagedOutputJob |
A derivation of
Job that stages its output in another location and only
moves it to the final destination if the job completes successfully. |
TimeBasedJob |
Base class for Hadoop jobs that consume time-partitioned data.
|
TimePartitioner |
A partitioner used by
AbstractPartitionPreservingIncrementalJob to limit the number of named outputs
used by each reducer. |
Jobs within this package form the core of the incremental framework implementation. There are two types of incremental jobs: partition-preserving and partition-collapsing.
A partition-preserving job consumes input data partitioned by day and produces output data partitioned by day. This is equivalent to running a MapReduce job for each individual day of input data, but much more efficient. It compares the input data against the existing output data and only processes input data with no corresponding output.
A partition-collapsing job consumes input data partitioned by day and produces a single output. What distinguishes this job from a standard MapReduce job is that it can reuse the previous output. This enables it to process data much more efficiently. Rather than consuming all input data on each run, it can consume only the new data since the previous run and merges it with the previous output.
Partition-preserving and partition-collapsing jobs can be created by extending AbstractPartitionPreservingIncrementalJob
and AbstractPartitionCollapsingIncrementalJob
, respectively, and implementing the necessary methods.
Alternatively, there are concrete versions of these classes, PartitionPreservingIncrementalJob
and
PartitionCollapsingIncrementalJob
, which can be used instead. With these classes, the implementations are provided
through setters.
Incremental jobs use Avro for input, intermediate, and output data. To implement an incremental job, one must define their schemas. A key schema and intermediate value schema specify the output of the mapper and combiner, which output key-value pairs. The key schema and an output value schema specify the output of the reducer, which outputs a record having key and value fields.
An incremental job also requires that implementations of map and reduce be defined, and optionally combine. The map implementation must
implement a Mapper
interface, which is very similar to the standard map interface in Hadoop.
The combine and reduce operations are implemented through an Accumulator
interface.
This is similar to the standard reduce in Hadoop, however values are provided one-at-a-time rather than by an enumerable list.
Also an accumulator returns either one value or no value at all by returning null. That is, the accumulator may not return an arbitrary number of values
for the output. This restriction precludes the implementation of certain operations, like flatten, which do not fit well within the
incremental programming model.