datafu.pig.stats
Class Quantile

java.lang.Object
  extended by org.apache.pig.EvalFunc<T>
      extended by datafu.pig.util.SimpleEvalFunc<org.apache.pig.data.Tuple>
          extended by datafu.pig.stats.Quantile
Direct Known Subclasses:
Median

public class Quantile
extends SimpleEvalFunc<org.apache.pig.data.Tuple>

Computes quantiles for a sorted input bag, using type R-2 estimation.

N.B., all the data is pushed to a single reducer per key, so make sure some partitioning is done (e.g., group by 'day') if the data is too large. That is, this isn't distributed quantiles.

Note that unlike datafu's StreamingQuantile algorithm, this implementation gives exact quantiles. But, it requires that the input bag to be sorted. Quantile must spill to disk when the input data is too large to fit in memory, which will contribute to longer runtimes. Because StreamingQuantile implements accumulate it can be much more efficient than Quantile for large input bags which do not fit well in memory.

The constructor takes a single integer argument that specifies the number of evenly-spaced quantiles to compute, e.g.,

Alternatively the constructor can take the explicit list of quantiles to compute, e.g.

The list of quantiles need not span the entire range from 0.0 to 1.0, nor do they need to be evenly spaced, e.g.

Example:

 define Quantile datafu.pig.stats.Quantile('0.0','0.5','1.0');

 -- input: 9,10,2,3,5,8,1,4,6,7
 input = LOAD 'input' AS (val:int);

 grouped = GROUP input ALL;

 -- produces: (1,5.5,10)
 quantiles = FOREACH grouped {
   sorted = ORDER input BY val;
   GENERATE Quantile(sorted);
 }
 

See Also:
Median, StreamingQuantile

Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
Quantile(java.lang.String... k)
           
 
Method Summary
 org.apache.pig.data.Tuple call(org.apache.pig.data.DataBag bag)
           
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
          Override outputSchema so we can verify the input schema at pig compile time, instead of runtime
 
Methods inherited from class datafu.pig.util.SimpleEvalFunc
exec, getReturnType
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

Quantile

public Quantile(java.lang.String... k)
Method Detail

call

public org.apache.pig.data.Tuple call(org.apache.pig.data.DataBag bag)
                               throws java.io.IOException
Throws:
java.io.IOException

outputSchema

public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Description copied from class: SimpleEvalFunc
Override outputSchema so we can verify the input schema at pig compile time, instead of runtime

Overrides:
outputSchema in class SimpleEvalFunc<org.apache.pig.data.Tuple>
Parameters:
input - input schema
Returns:
call to super.outputSchema in case schema was defined elsewhere


Matthew Hayes, Sam Shah