datafu.hourglass.jobs
Class TimePartitioner
java.lang.Object
org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
datafu.hourglass.jobs.TimePartitioner
- All Implemented Interfaces:
- org.apache.hadoop.conf.Configurable
public class TimePartitioner
- extends org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
- implements org.apache.hadoop.conf.Configurable
A partitioner used by AbstractPartitionPreservingIncrementalJob
to limit the number of named outputs
used by each reducer.
The purpose of this partitioner is to prevent a proliferation of small files created by AbstractPartitionPreservingIncrementalJob
.
This job writes multiple outputs. Each output corresponds to a day of input data. By default records will be distributed across all
the reducers. This means that if many input days are consumed, then each reducer will write many outputs. These outputs will typically
be small. The problem gets worse as more input data is consumed, as this will cause more reducers to be required.
This partitioner solves the problem by limiting how many days of input data will be mapped to each reducer. At the extreme each day of
input data could be mapped to only one reducer. This is controlled through the configuration setting incremental.reducers.per.input,
which should be set in the Hadoop configuration. Input days are assigned to reducers in a round-robin fashion.
- Author:
- "Matthew Hayes"
Method Summary |
org.apache.hadoop.conf.Configuration |
getConf()
|
int |
getPartition(org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord> key,
org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord> value,
int numReduceTasks)
|
void |
setConf(org.apache.hadoop.conf.Configuration conf)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
INPUT_TIMES
public static java.lang.String INPUT_TIMES
REDUCERS_PER_INPUT
public static java.lang.String REDUCERS_PER_INPUT
TimePartitioner
public TimePartitioner()
getConf
public org.apache.hadoop.conf.Configuration getConf()
- Specified by:
getConf
in interface org.apache.hadoop.conf.Configurable
setConf
public void setConf(org.apache.hadoop.conf.Configuration conf)
- Specified by:
setConf
in interface org.apache.hadoop.conf.Configurable
getPartition
public int getPartition(org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord> key,
org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord> value,
int numReduceTasks)
- Specified by:
getPartition
in class org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
Matthew Hayes