datafu.hourglass.jobs
Class StagedOutputJob

java.lang.Object
  extended by org.apache.hadoop.mapreduce.JobContext
      extended by org.apache.hadoop.mapreduce.Job
          extended by datafu.hourglass.jobs.StagedOutputJob
All Implemented Interfaces:
java.util.concurrent.Callable<java.lang.Boolean>

public class StagedOutputJob
extends org.apache.hadoop.mapreduce.Job
implements java.util.concurrent.Callable<java.lang.Boolean>

A derivation of Job that stages its output in another location and only moves it to the final destination if the job completes successfully. It also outputs a counters file to the file system that contains counters fetched from Hadoop and other task statistics.


Nested Class Summary
 
Nested classes/interfaces inherited from class org.apache.hadoop.mapreduce.Job
org.apache.hadoop.mapreduce.Job.JobState
 
Field Summary
 
Fields inherited from class org.apache.hadoop.mapreduce.JobContext
CACHE_ARCHIVES_VISIBILITIES, CACHE_FILE_VISIBILITIES, COMBINE_CLASS_ATTR, conf, credentials, INPUT_FORMAT_CLASS_ATTR, JOB_ACL_MODIFY_JOB, JOB_ACL_VIEW_JOB, JOB_CANCEL_DELEGATION_TOKEN, JOB_NAMENODES, MAP_CLASS_ATTR, OUTPUT_FORMAT_CLASS_ATTR, PARTITIONER_CLASS_ATTR, REDUCE_CLASS_ATTR, ugi, USER_LOG_RETAIN_HOURS
 
Constructor Summary
StagedOutputJob(org.apache.hadoop.conf.Configuration conf, java.lang.String stagingPrefix, org.apache.log4j.Logger log)
          Initializes the job.
 
Method Summary
 java.lang.Boolean call()
          Run the job.
static StagedOutputJob createStagedJob(org.apache.hadoop.conf.Configuration conf, java.lang.String jobName, java.util.List<java.lang.String> inputPaths, java.lang.String stagingLocation, java.lang.String outputPath, org.apache.log4j.Logger log)
          Creates a job which using a temporary staging location for the output data.
 org.apache.hadoop.fs.Path getCountersParentPath()
          Gets path to store the counters.
 org.apache.hadoop.fs.Path getCountersPath()
          Path to written counters.
 boolean getWriteCounters()
          Get whether counters should be written.
 void setCountersParentPath(org.apache.hadoop.fs.Path path)
          Sets path to store the counters.
 void setWriteCounters(boolean writeCounters)
          Sets whether counters should be written.
 boolean waitForCompletion(boolean verbose)
          Run the job and wait for it to complete.
 
Methods inherited from class org.apache.hadoop.mapreduce.Job
failTask, getCounters, getJar, getTaskCompletionEvents, getTrackingURL, isComplete, isSuccessful, killJob, killTask, mapProgress, reduceProgress, setCancelDelegationTokenUponJobCompletion, setCombinerClass, setGroupingComparatorClass, setInputFormatClass, setJarByClass, setJobName, setMapOutputKeyClass, setMapOutputValueClass, setMapperClass, setMapSpeculativeExecution, setNumReduceTasks, setOutputFormatClass, setOutputKeyClass, setOutputValueClass, setPartitionerClass, setReducerClass, setReduceSpeculativeExecution, setSortComparatorClass, setSpeculativeExecution, setupProgress, setWorkingDirectory, submit
 
Methods inherited from class org.apache.hadoop.mapreduce.JobContext
getCombinerClass, getConfiguration, getCredentials, getGroupingComparator, getInputFormatClass, getJobID, getJobName, getMapOutputKeyClass, getMapOutputValueClass, getMapperClass, getNumReduceTasks, getOutputFormatClass, getOutputKeyClass, getOutputValueClass, getPartitionerClass, getReducerClass, getSortComparator, getWorkingDirectory
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

StagedOutputJob

public StagedOutputJob(org.apache.hadoop.conf.Configuration conf,
                       java.lang.String stagingPrefix,
                       org.apache.log4j.Logger log)
                throws java.io.IOException
Initializes the job.

Parameters:
conf - configuration
stagingPrefix - where to stage output temporarily
log - logger
Throws:
java.io.IOException
Method Detail

createStagedJob

public static StagedOutputJob createStagedJob(org.apache.hadoop.conf.Configuration conf,
                                              java.lang.String jobName,
                                              java.util.List<java.lang.String> inputPaths,
                                              java.lang.String stagingLocation,
                                              java.lang.String outputPath,
                                              org.apache.log4j.Logger log)
Creates a job which using a temporary staging location for the output data. The data is only copied to the final output directory on successful completion of the job. This prevents existing output data from being overwritten unless the job completes successfully.

Parameters:
conf - configuration
jobName - job name
inputPaths - input paths
stagingLocation - where to stage output temporarily
outputPath - output path
log - logger
Returns:
job

getCountersParentPath

public org.apache.hadoop.fs.Path getCountersParentPath()
Gets path to store the counters. If this is not set then by default the counters will be stored in the output directory.

Returns:
path parent path for counters

setCountersParentPath

public void setCountersParentPath(org.apache.hadoop.fs.Path path)
Sets path to store the counters. If this is not set then by default the counters will be stored in the output directory.

Parameters:
path - parent path for counters

getCountersPath

public org.apache.hadoop.fs.Path getCountersPath()
Path to written counters.

Returns:
counters path

getWriteCounters

public boolean getWriteCounters()
Get whether counters should be written.

Returns:
true if counters should be written

setWriteCounters

public void setWriteCounters(boolean writeCounters)
Sets whether counters should be written.

Parameters:
writeCounters - true if counters should be written

call

public java.lang.Boolean call()
                       throws java.lang.Exception
Run the job.

Specified by:
call in interface java.util.concurrent.Callable<java.lang.Boolean>
Throws:
java.lang.Exception

waitForCompletion

public boolean waitForCompletion(boolean verbose)
                          throws java.io.IOException,
                                 java.lang.InterruptedException,
                                 java.lang.ClassNotFoundException
Run the job and wait for it to complete. Output will be temporarily stored under the staging path. If the job is successful it will be moved to the final location.

Overrides:
waitForCompletion in class org.apache.hadoop.mapreduce.Job
Throws:
java.io.IOException
java.lang.InterruptedException
java.lang.ClassNotFoundException


Matthew Hayes