datafu.pig.sampling
Class SimpleRandomSample

java.lang.Object
  extended by org.apache.pig.EvalFunc<T>
      extended by org.apache.pig.AccumulatorEvalFunc<T>
          extended by org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
              extended by datafu.pig.sampling.SimpleRandomSample
All Implemented Interfaces:
org.apache.pig.Accumulator<org.apache.pig.data.DataBag>, org.apache.pig.Algebraic

public class SimpleRandomSample
extends org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>

Scalable simple random sampling.

This UDF implements a scalable simple random sampling algorithm described in

 X. Meng, Scalable Simple Random Sampling and Stratified Sampling, ICML 2013.
 
It takes a sampling probability p as input and outputs a simple random sample of size exactly ceil(p*n) with probability at least 99.99%, where $n$ is the size of the population. This UDF is very useful for stratified sampling. For example,
 DEFINE SRS datafu.pig.sampling.SimpleRandomSample('0.01');
 examples = LOAD ...
 grouped = GROUP examples BY label;
 sampled = FOREACH grouped GENERATE FLATTEN(SRS(examples));
 STORE sampled ...
 
We note that, in a Java Hadoop job, we can output pre-selected records directly using MultipleOutputs. However, this feature is not available in a Pig UDF. So we still let pre-selected records go through the sort phase. However, as long as the sample size is not huge, this should not be a big problem.

Author:
ximeng

Nested Class Summary
static class SimpleRandomSample.Final
           
static class SimpleRandomSample.Initial
           
static class SimpleRandomSample.Intermediate
           
 
Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
SimpleRandomSample()
           
SimpleRandomSample(java.lang.String samplingProbability)
           
 
Method Summary
 java.lang.String getFinal()
           
 java.lang.String getInitial()
           
 java.lang.String getIntermed()
           
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
           
 
Methods inherited from class org.apache.pig.AlgebraicEvalFunc
accumulate, cleanup, getValue
 
Methods inherited from class org.apache.pig.AccumulatorEvalFunc
exec
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SimpleRandomSample

public SimpleRandomSample()

SimpleRandomSample

public SimpleRandomSample(java.lang.String samplingProbability)
Method Detail

getInitial

public java.lang.String getInitial()
Specified by:
getInitial in interface org.apache.pig.Algebraic
Specified by:
getInitial in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>

getIntermed

public java.lang.String getIntermed()
Specified by:
getIntermed in interface org.apache.pig.Algebraic
Specified by:
getIntermed in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>

getFinal

public java.lang.String getFinal()
Specified by:
getFinal in interface org.apache.pig.Algebraic
Specified by:
getFinal in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>

outputSchema

public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Overrides:
outputSchema in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>


Matthew Hayes, Sam Shah