public class TimePartitioner
extends org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
implements org.apache.hadoop.conf.Configurable
AbstractPartitionPreservingIncrementalJob
to limit the number of named outputs
used by each reducer.
The purpose of this partitioner is to prevent a proliferation of small files created by AbstractPartitionPreservingIncrementalJob
.
This job writes multiple outputs. Each output corresponds to a day of input data. By default records will be distributed across all
the reducers. This means that if many input days are consumed, then each reducer will write many outputs. These outputs will typically
be small. The problem gets worse as more input data is consumed, as this will cause more reducers to be required.
This partitioner solves the problem by limiting how many days of input data will be mapped to each reducer. At the extreme each day of input data could be mapped to only one reducer. This is controlled through the configuration setting incremental.reducers.per.input, which should be set in the Hadoop configuration. Input days are assigned to reducers in a round-robin fashion.
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
INPUT_TIMES |
static java.lang.String |
REDUCERS_PER_INPUT |
Constructor and Description |
---|
TimePartitioner() |
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.conf.Configuration |
getConf() |
int |
getPartition(org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord> key,
org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord> value,
int numReduceTasks) |
void |
setConf(org.apache.hadoop.conf.Configuration conf) |
public static java.lang.String INPUT_TIMES
public static java.lang.String REDUCERS_PER_INPUT
public org.apache.hadoop.conf.Configuration getConf()
getConf
in interface org.apache.hadoop.conf.Configurable
public void setConf(org.apache.hadoop.conf.Configuration conf)
setConf
in interface org.apache.hadoop.conf.Configurable
public int getPartition(org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord> key, org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord> value, int numReduceTasks)
getPartition
in class org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>