StreamingQuantile (datafu-pig 1.6.1 API)

java.lang.Object
- org.apache.pig.EvalFunc<T>
- - org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.Tuple>
  - - datafu.pig.stats.StreamingQuantile

All Implemented Interfaces:

org.apache.pig.Accumulator<org.apache.pig.data.Tuple>

Direct Known Subclasses:

StreamingMedian
```
public class StreamingQuantile
extends org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.Tuple>
```
Computes approximate quantiles for a (not necessarily sorted) input bag, using the Munro-Paterson algorithm.
The algorithm is described here: http://www.cs.ucsb.edu/~suri/cs290/MunroPat.pdf

The implementation is based on the one in Sawzall, available here: szlquantile.cc

N.B., all the data is pushed to a single reducer per key, so make sure some partitioning is done (e.g., group by 'day') if the data is too large. That is, this isn't distributed quantiles.

Note that unlike datafu's standard Quantile algorithm, the Munro-Paterson algorithm gives approximate quantiles and does not require the input bag to be sorted. Because it implements accumulate, StreamingQuantile can be much more efficient than Quantile for large amounts of data which do not fit in memory. Quantile must spill to disk when the input data is too large to fit in memory, which will contribute to longer runtimes.

The constructor takes a single integer argument that specifies the number of evenly-spaced quantiles to compute, e.g.,
- StreamingQuantile('3') yields the min, the median, and the max
- StreamingQuantile('5') yields the min, the 25th, 50th, 75th percentiles, and the max
- StreamingQuantile('101') yields the min, the max, and all 99 percentiles.
Alternatively the constructor can take the explicit list of quantiles to compute, e.g.
- StreamingQuantile('0.0','0.5','1.0') yields the min, the median, and the max
- StreamingQuantile('0.0','0.25','0.5','0.75','1.0') yields the min, the 25th, 50th, 75th percentiles, and the max
The list of quantiles need not span the entire range from 0.0 to 1.0, nor do they need to be evenly spaced, e.g.
- StreamingQuantile('0.5','0.90','0.95','0.99') yields the median, the 90th, 95th, and the 99th percentiles
- StreamingQuantile('0.0013','0.0228','0.1587','0.5','0.8413','0.9772','0.9987') yields the 0.13th, 2.28th, 15.87th, 50th, 84.13th, 97.72nd, and 99.87th percentiles
Be aware when specifying the list of quantiles in this way that more quantiles may be computed internally than are actually returned. The GCD of the quantiles is found and this determines the number of evenly spaced quantiles to compute. The requested quantiles are then returned from this set. For instance:
- If the quantiles 0.2 and 0.6 are requested then the quantiles 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0 are computed because 0.2 is the GCD of 0.2, 0.6, and 1.0.
- If 0.2 and 0.7 are requested then the quantiles 0.0, 0.1, 0.2, ... , 0.9, 1.0 are computed because 0.1 is the GCD of 0.2, 0.7, and 1.0.
- If 0.999 is requested the quantiles 0.0, 0.001, 0.002, ... , 0.998, 0.999, 1.0 are computed because 0.001 is the GCD of 0.999 and 1.0.
The error on the approximation goes down as the number of buckets computed goes up.
Example:
```
 

 define Quantile datafu.pig.stats.StreamingQuantile('5');

 -- input: 9,10,2,3,5,8,1,4,6,7
 input = LOAD 'input' AS (val:int);

 grouped = GROUP input ALL;

 -- produces: (1.0,3.0,5.0,8.0,10.0)
 quantiles = FOREACH grouped generate Quantile(input);
 
 
```
See Also:

StreamingMedian, Quantile

Nested Class Summary
- Nested classes/interfaces inherited from class org.apache.pig.EvalFunc
  org.apache.pig.EvalFunc.SchemaType

Field Summary
- Fields inherited from class org.apache.pig.EvalFunc
  log, pigLogger, reporter, returnType

Constructor Summary

Constructors
Constructor and Description

StreamingQuantile(java.lang.String... k)

Constructors
Constructor and Description
`StreamingQuantile(java.lang.String... k)`

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`void`	`accumulate(org.apache.pig.data.Tuple b)`
`void`	`cleanup()`
`org.apache.pig.data.Tuple`	`getValue()`
`org.apache.pig.impl.logicalLayer.schema.Schema`	`outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)`

Methods inherited from class org.apache.pig.AccumulatorEvalFunc
exec

Methods inherited from class org.apache.pig.EvalFunc
allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - StreamingQuantile
```
public StreamingQuantile(java.lang.String... k)
```
- Method Detail
  - accumulate
```
public void accumulate(org.apache.pig.data.Tuple b)
                throws java.io.IOException
```
    Specified by:
    
    accumulate in interface org.apache.pig.Accumulator<org.apache.pig.data.Tuple>
    
    Specified by:
    
    accumulate in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.Tuple>
    
    Throws:
    
    java.io.IOException
  - cleanup
```
public void cleanup()
```
    Specified by:
    
    cleanup in interface org.apache.pig.Accumulator<org.apache.pig.data.Tuple>
    
    Specified by:
    
    cleanup in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.Tuple>
  - getValue
```
public org.apache.pig.data.Tuple getValue()
```
    Specified by:
    
    getValue in interface org.apache.pig.Accumulator<org.apache.pig.data.Tuple>
    
    Specified by:
    
    getValue in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.Tuple>
  - outputSchema
```
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
```
    Overrides:
    
    outputSchema in class org.apache.pig.EvalFunc<org.apache.pig.data.Tuple>

Class StreamingQuantile

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.pig.EvalFunc

Field Summary

Fields inherited from class org.apache.pig.EvalFunc

Constructor Summary

Method Summary

Methods inherited from class org.apache.pig.AccumulatorEvalFunc

Methods inherited from class org.apache.pig.EvalFunc

Methods inherited from class java.lang.Object

Constructor Detail

StreamingQuantile

Method Detail

accumulate

cleanup

getValue

outputSchema