datafu.pig.stats
Class Quantile
java.lang.Object
org.apache.pig.EvalFunc<T>
datafu.pig.util.SimpleEvalFunc<org.apache.pig.data.Tuple>
datafu.pig.stats.Quantile
- Direct Known Subclasses:
- Median
public class Quantile
- extends SimpleEvalFunc<org.apache.pig.data.Tuple>
Computes quantiles
for a sorted input bag, using type R-2 estimation.
N.B., all the data is pushed to a single reducer per key, so make sure some partitioning is
done (e.g., group by 'day') if the data is too large. That is, this isn't distributed quantiles.
Note that unlike datafu's StreamingQuantile algorithm, this implementation gives
exact quantiles. But, it requires that the input bag to be sorted. Quantile must spill to
disk when the input data is too large to fit in memory, which will contribute to longer runtimes.
Because StreamingQuantile implements accumulate it can be much more efficient than Quantile for
large input bags which do not fit well in memory.
The constructor takes a single integer argument that specifies the number of evenly-spaced
quantiles to compute, e.g.,
- Quantile('3') yields the min, the median, and the max
- Quantile('5') yields the min, the 25th, 50th, 75th percentiles, and the max
- Quantile('101') yields the min, the max, and all 99 percentiles.
Alternatively the constructor can take the explicit list of quantiles to compute, e.g.
- Quantile('0.0','0.5','1.0') yields the min, the median, and the max
- Quantile('0.0','0.25','0.5','0.75','1.0') yields the min, the 25th, 50th, 75th percentiles, and the max
The list of quantiles need not span the entire range from 0.0 to 1.0, nor do they need to be evenly spaced, e.g.
- Quantile('0.5','0.90','0.95','0.99') yields the median, the 90th, 95th, and the 99th percentiles
- Quantile('0.0013','0.0228','0.1587','0.5','0.8413','0.9772','0.9987') yields the 0.13th, 2.28th, 15.87th, 50th, 84.13th, 97.72nd, and 99.87th percentiles
Example:
define Quantile datafu.pig.stats.Quantile('0.0','0.5','1.0');
-- input: 9,10,2,3,5,8,1,4,6,7
input = LOAD 'input' AS (val:int);
grouped = GROUP input ALL;
-- produces: (1,5.5,10)
quantiles = FOREACH grouped {
sorted = ORDER input BY val;
GENERATE Quantile(sorted);
}
- See Also:
Median
,
StreamingQuantile
Fields inherited from class org.apache.pig.EvalFunc |
log, pigLogger, reporter, returnType |
Constructor Summary |
Quantile(java.lang.String... k)
|
Method Summary |
org.apache.pig.data.Tuple |
call(org.apache.pig.data.DataBag bag)
|
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Override outputSchema so we can verify the input schema at pig compile time, instead of runtime |
Methods inherited from class org.apache.pig.EvalFunc |
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Quantile
public Quantile(java.lang.String... k)
call
public org.apache.pig.data.Tuple call(org.apache.pig.data.DataBag bag)
throws java.io.IOException
- Throws:
java.io.IOException
outputSchema
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
- Description copied from class:
SimpleEvalFunc
- Override outputSchema so we can verify the input schema at pig compile time, instead of runtime
- Overrides:
outputSchema
in class SimpleEvalFunc<org.apache.pig.data.Tuple>
- Parameters:
input
- input schema
- Returns:
- call to super.outputSchema in case schema was defined elsewhere
Matthew Hayes, Sam Shah