public class Quantile extends SimpleEvalFunc<org.apache.pig.data.Tuple>
N.B., all the data is pushed to a single reducer per key, so make sure some partitioning is done (e.g., group by 'day') if the data is too large. That is, this isn't distributed quantiles.
Note that unlike datafu's StreamingQuantile algorithm, this implementation gives exact quantiles. But, it requires that the input bag to be sorted. Quantile must spill to disk when the input data is too large to fit in memory, which will contribute to longer runtimes. Because StreamingQuantile implements accumulate it can be much more efficient than Quantile for large input bags which do not fit well in memory.
The constructor takes a single integer argument that specifies the number of evenly-spaced quantiles to compute, e.g.,
Alternatively the constructor can take the explicit list of quantiles to compute, e.g.
The list of quantiles need not span the entire range from 0.0 to 1.0, nor do they need to be evenly spaced, e.g.
define Quantile datafu.pig.stats.Quantile('0.0','0.5','1.0');
-- input: 9,10,2,3,5,8,1,4,6,7
input = LOAD 'input' AS (val:int);
grouped = GROUP input ALL;
-- produces: (1,5.5,10)
quantiles = FOREACH grouped {
sorted = ORDER input BY val;
GENERATE Quantile(sorted);
}
Median
,
StreamingQuantile
Constructor and Description |
---|
Quantile(java.lang.String... k) |
Modifier and Type | Method and Description |
---|---|
org.apache.pig.data.Tuple |
call(org.apache.pig.data.DataBag bag) |
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Override outputSchema so we can verify the input schema at pig compile time, instead of runtime
|
exec, getReturnType
getContextProperties, getInstanceName, getInstanceProperties, onReady, setUDFContextSignature
public org.apache.pig.data.Tuple call(org.apache.pig.data.DataBag bag) throws java.io.IOException
java.io.IOException
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
SimpleEvalFunc
outputSchema
in class SimpleEvalFunc<org.apache.pig.data.Tuple>
input
- input schema