datafu.hourglass.jobs
Class AbstractJob

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by datafu.hourglass.jobs.AbstractJob
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable
Direct Known Subclasses:
TimeBasedJob

public abstract class AbstractJob
extends org.apache.hadoop.conf.Configured

Base class for Hadoop jobs.

This class defines a set of common methods and configuration shared by Hadoop jobs. Jobs can be configured either by providing properties or by calling setters. Each property has a corresponding setter.

This class recognizes the following properties:

The input.path property may be a comma-separated list of paths. When there is more than one it implies a join is to be performed. Alternatively the paths may be listed separately. For example, input.path.first and input.path.second define two separate input paths.

The num.reducers fixes the number of reducers. When not set the number of reducers is computed based on the input size.

The temp.path property defines the parent directory for temporary paths, not the temporary path itself. Temporary paths are created under this directory with an hourglass- prefix followed by a GUID.

The input and output paths are the only required parameters. The rest are optional.

Hadoop configuration may be provided by setting a property with the prefix hadoop-conf.. For example, mapred.min.split.size can be configured by setting property hadoop-conf.mapred.min.split.size to the desired value.

Author:
"Matthew Hayes"

Constructor Summary
AbstractJob()
          Initializes the job.
AbstractJob(java.lang.String name, java.util.Properties props)
          Initializes the job with a job name and properties.
 
Method Summary
 void config(org.apache.hadoop.conf.Configuration conf)
          Overridden to provide custom configuration before the job starts.
protected  org.apache.hadoop.fs.Path createRandomTempPath()
          Creates a random temporary path within the file system.
protected  org.apache.hadoop.fs.Path ensurePath(org.apache.hadoop.fs.Path path)
          Creates a path, if it does not already exist.
 org.apache.hadoop.fs.Path getCountersParentPath()
          Gets the path where counters will be stored.
protected  org.apache.hadoop.fs.FileSystem getFileSystem()
          Gets the file system.
 java.util.List<org.apache.hadoop.fs.Path> getInputPaths()
          Gets the input paths.
 java.lang.String getName()
          Gets the job name
 java.lang.Integer getNumReducers()
          Gets the number of reducers to use.
 org.apache.hadoop.fs.Path getOutputPath()
          Gets the output path.
 java.util.Properties getProperties()
          Gets the configuration properties.
 java.lang.Integer getRetentionCount()
          Gets the number of days of data which will be retained in the output path.
 org.apache.hadoop.fs.Path getTempPath()
          Gets the temporary path under which intermediate files will be stored.
protected  void initialize()
          Initialization required before running job.
 boolean isUseCombiner()
          Gets whether the combiner should be used.
protected  org.apache.hadoop.fs.Path randomTempPath()
          Generates a random temporary path within the file system.
abstract  void run()
          Run the job.
 void setCountersParentPath(org.apache.hadoop.fs.Path countersParentPath)
          Sets the path where counters will be stored.
 void setInputPaths(java.util.List<org.apache.hadoop.fs.Path> inputPaths)
          Sets the input paths.
 void setName(java.lang.String name)
          Sets the job name
 void setNumReducers(java.lang.Integer numReducers)
          Sets the number of reducers to use.
 void setOutputPath(org.apache.hadoop.fs.Path outputPath)
          Sets the output path.
 void setProperties(java.util.Properties props)
          Sets the configuration properties.
 void setRetentionCount(java.lang.Integer retentionCount)
          Sets the number of days of data which will be retained in the output path.
 void setTempPath(org.apache.hadoop.fs.Path tempPath)
          Sets the temporary path where intermediate files will be stored.
 void setUseCombiner(boolean useCombiner)
          Sets whether the combiner should be used.
protected  void validate()
          Validation required before running job.
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AbstractJob

public AbstractJob()
Initializes the job.


AbstractJob

public AbstractJob(java.lang.String name,
                   java.util.Properties props)
Initializes the job with a job name and properties.

Parameters:
name - Job name
props - Configuration properties
Method Detail

getName

public java.lang.String getName()
Gets the job name

Returns:
Job name

setName

public void setName(java.lang.String name)
Sets the job name

Parameters:
name - Job name

getProperties

public java.util.Properties getProperties()
Gets the configuration properties.

Returns:
Configuration properties

setProperties

public void setProperties(java.util.Properties props)
Sets the configuration properties.

Parameters:
props - Properties

config

public void config(org.apache.hadoop.conf.Configuration conf)
Overridden to provide custom configuration before the job starts.

Parameters:
conf -

getNumReducers

public java.lang.Integer getNumReducers()
Gets the number of reducers to use.

Returns:
Number of reducers

setNumReducers

public void setNumReducers(java.lang.Integer numReducers)
Sets the number of reducers to use. Can also be set with num.reducers property.

Parameters:
numReducers - Number of reducers to use

isUseCombiner

public boolean isUseCombiner()
Gets whether the combiner should be used.

Returns:
True if combiner should be used, otherwise false.

setUseCombiner

public void setUseCombiner(boolean useCombiner)
Sets whether the combiner should be used. Can also be set with use.combiner.

Parameters:
useCombiner - True if a combiner should be used, otherwise false.

getCountersParentPath

public org.apache.hadoop.fs.Path getCountersParentPath()
Gets the path where counters will be stored.

Returns:
Counters path

setCountersParentPath

public void setCountersParentPath(org.apache.hadoop.fs.Path countersParentPath)
Sets the path where counters will be stored. Can also be set with counters.path.

Parameters:
countersParentPath - Counters path

getRetentionCount

public java.lang.Integer getRetentionCount()
Gets the number of days of data which will be retained in the output path. Only the latest will be kept. Older paths will be removed.

Returns:
retention count

setRetentionCount

public void setRetentionCount(java.lang.Integer retentionCount)
Sets the number of days of data which will be retained in the output path. Only the latest will be kept. Older paths will be removed. Can also be set with retention.count.

Parameters:
retentionCount -

getInputPaths

public java.util.List<org.apache.hadoop.fs.Path> getInputPaths()
Gets the input paths. Multiple input paths imply a join is to be performed.

Returns:
input paths

setInputPaths

public void setInputPaths(java.util.List<org.apache.hadoop.fs.Path> inputPaths)
Sets the input paths. Multiple input paths imply a join is to be performed. Can also be set with input.path or several properties starting with input.path..

Parameters:
inputPaths - input paths

getOutputPath

public org.apache.hadoop.fs.Path getOutputPath()
Gets the output path.

Returns:
output path

setOutputPath

public void setOutputPath(org.apache.hadoop.fs.Path outputPath)
Sets the output path. Can also be set with output.path.

Parameters:
outputPath - output path

getTempPath

public org.apache.hadoop.fs.Path getTempPath()
Gets the temporary path under which intermediate files will be stored. Defaults to /tmp.

Returns:
Temporary path

setTempPath

public void setTempPath(org.apache.hadoop.fs.Path tempPath)
Sets the temporary path where intermediate files will be stored. Defaults to /tmp.

Parameters:
tempPath - Temporary path

getFileSystem

protected org.apache.hadoop.fs.FileSystem getFileSystem()
Gets the file system.

Returns:
File system
Throws:
java.io.IOException

randomTempPath

protected org.apache.hadoop.fs.Path randomTempPath()
Generates a random temporary path within the file system. This does not create the path.

Returns:
Random temporary path

createRandomTempPath

protected org.apache.hadoop.fs.Path createRandomTempPath()
                                                  throws java.io.IOException
Creates a random temporary path within the file system.

Returns:
Random temporary path
Throws:
java.io.IOException

ensurePath

protected org.apache.hadoop.fs.Path ensurePath(org.apache.hadoop.fs.Path path)
                                        throws java.io.IOException
Creates a path, if it does not already exist.

Parameters:
path - Path to create
Returns:
The same path that was provided
Throws:
java.io.IOException

validate

protected void validate()
Validation required before running job.


initialize

protected void initialize()
Initialization required before running job.


run

public abstract void run()
                  throws java.io.IOException,
                         java.lang.InterruptedException,
                         java.lang.ClassNotFoundException
Run the job.

Throws:
java.io.IOException
java.lang.InterruptedException
java.lang.ClassNotFoundException


Matthew Hayes