datafu.hourglass.jobs
Class IncrementalJob

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by datafu.hourglass.jobs.AbstractJob
          extended by datafu.hourglass.jobs.TimeBasedJob
              extended by datafu.hourglass.jobs.IncrementalJob
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable
Direct Known Subclasses:
AbstractPartitionCollapsingIncrementalJob, AbstractPartitionPreservingIncrementalJob

public abstract class IncrementalJob
extends TimeBasedJob

Base class for incremental jobs. Incremental jobs consume day-partitioned input data.

Implementations of this class must provide key, intermediate value, and output value schemas. The key and intermediate value schemas define the output for the mapper and combiner. The key and output value schemas define the output for the reducer.

This class has the same configuration and methods as TimeBasedJob. In addition it also recognizes the following properties:

Author:
"Matthew Hayes"

Constructor Summary
IncrementalJob()
          Initializes the job.
IncrementalJob(java.lang.String name, java.util.Properties props)
          Initializes the job with a job name and properties.
 
Method Summary
protected abstract  org.apache.avro.Schema getIntermediateValueSchema()
          Gets the Avro schema for the intermediate value.
protected abstract  org.apache.avro.Schema getKeySchema()
          Gets the Avro schema for the key.
 java.lang.Integer getMaxIterations()
          Gets the maximum number of iterations for the job.
 java.lang.Integer getMaxToProcess()
          Gets the maximum number of days of input data to process in a single run.
protected abstract  org.apache.avro.Schema getOutputValueSchema()
          Gets the Avro schema for the output data.
protected  TaskSchemas getSchemas()
          Gets the schemas.
protected  void initialize()
          Initialization required before running job.
 boolean isFailOnMissing()
          Gets whether the job should fail if input data within the desired range is missing.
 void setFailOnMissing(boolean failOnMissing)
          Sets whether the job should fail if input data within the desired range is missing.
 void setMaxIterations(java.lang.Integer maxIterations)
          Sets the maximum number of iterations for the job.
 void setMaxToProcess(java.lang.Integer maxToProcess)
          Sets the maximum number of days of input data to process in a single run.
 void setProperties(java.util.Properties props)
          Sets the configuration properties.
 
Methods inherited from class datafu.hourglass.jobs.TimeBasedJob
getDaysAgo, getEndDate, getNumDays, getStartDate, setDaysAgo, setEndDate, setNumDays, setStartDate, validate
 
Methods inherited from class datafu.hourglass.jobs.AbstractJob
config, createRandomTempPath, ensurePath, getCountersParentPath, getFileSystem, getInputPaths, getName, getNumReducers, getOutputPath, getProperties, getRetentionCount, getTempPath, isUseCombiner, randomTempPath, run, setCountersParentPath, setInputPaths, setName, setNumReducers, setOutputPath, setRetentionCount, setTempPath, setUseCombiner
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

IncrementalJob

public IncrementalJob()
Initializes the job.


IncrementalJob

public IncrementalJob(java.lang.String name,
                      java.util.Properties props)
Initializes the job with a job name and properties.

Parameters:
name - job name
props - configuration properties
Method Detail

setProperties

public void setProperties(java.util.Properties props)
Description copied from class: AbstractJob
Sets the configuration properties.

Overrides:
setProperties in class TimeBasedJob
Parameters:
props - Properties

initialize

protected void initialize()
Description copied from class: AbstractJob
Initialization required before running job.

Overrides:
initialize in class AbstractJob

getKeySchema

protected abstract org.apache.avro.Schema getKeySchema()
Gets the Avro schema for the key.

This is also used as the key for the map output.

Returns:
key schema.

getIntermediateValueSchema

protected abstract org.apache.avro.Schema getIntermediateValueSchema()
Gets the Avro schema for the intermediate value.

This is also used for the value for the map output.

Returns:
intermediate value schema

getOutputValueSchema

protected abstract org.apache.avro.Schema getOutputValueSchema()
Gets the Avro schema for the output data.

Returns:
output data schema

getSchemas

protected TaskSchemas getSchemas()
Gets the schemas.

Returns:
schemas

getMaxToProcess

public java.lang.Integer getMaxToProcess()
Gets the maximum number of days of input data to process in a single run.

Returns:
maximum number of days to process

setMaxToProcess

public void setMaxToProcess(java.lang.Integer maxToProcess)
Sets the maximum number of days of input data to process in a single run.

Parameters:
maxToProcess - maximum number of days to process

getMaxIterations

public java.lang.Integer getMaxIterations()
Gets the maximum number of iterations for the job. Multiple iterations will only occur when there is a maximum set for the number of days to process in a single run. An error should be thrown if this number will be exceeded.

Returns:
maximum number of iterations

setMaxIterations

public void setMaxIterations(java.lang.Integer maxIterations)
Sets the maximum number of iterations for the job. Multiple iterations will only occur when there is a maximum set for the number of days to process in a single run. An error should be thrown if this number will be exceeded.

Parameters:
maxIterations - maximum number of iterations

isFailOnMissing

public boolean isFailOnMissing()
Gets whether the job should fail if input data within the desired range is missing.

Returns:
true if the job should fail on missing data

setFailOnMissing

public void setFailOnMissing(boolean failOnMissing)
Sets whether the job should fail if input data within the desired range is missing.

Parameters:
failOnMissing - true if the job should fail on missing data


Matthew Hayes