public class SimpleRandomSample
extends org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
This UDF implements a scalable simple random sampling algorithm described in
X. Meng, Scalable Simple Random Sampling and Stratified Sampling, ICML 2013.
It takes a bag of n items and a sampling probability p as the inputs, and outputs a simple random sample of size exactly ceil(p*n) in a bag, with probability at least 99.99%. For example, the following script generates a simple random sample with sampling probability 0.1:
DEFINE SRS datafu.pig.sampling.SimpleRandomSample(); item = LOAD 'input' AS (x:double); sampled = FOREACH (GROUP item ALL) GENERATE FLATTEN(SRS(item, 0.01));
Optionally, user can provide a good lower bound of n as the third argument to help reduce the size of intermediate data, for example:
DEFINE SRS datafu.pig.sampling.SimpleRandomSample(); item = LOAD 'input' AS (x:double); summary = FOREACH (GROUP item ALL) GENERATE COUNT(item) AS count; sampled = FOREACH (GROUP item ALL) GENERATE FLATTEN(SRS(item, 0.01, summary.count));
This UDF is very useful for stratified sampling. For example, the following script keeps all positive examples while downsampling negatives with probability 0.1:
DEFINE SRS datafu.pig.sampling.SimpleRandomSample(); item = LOAD 'input' AS (x:double, label:int); grouped = FOREACH (GROUP item BY label) GENERATE item, (group == 1 ? 1.0 : 0.1) AS p; sampled = FOREACH grouped GENERATE FLATTEN(SRS(item, p));
In a Java Hadoop MapReduce job, we can output selected items directly using MultipleOutputs. However, this feature is not available in a Pig UDF. So we still let selected items go through the sort phase. However, as long as the sample size is not huge, this should not be a big problem.
In the first version, the sampling probability is specified in the constructor. This method is deprecated now and will be removed in the next release.
Modifier and Type | Class and Description |
---|---|
static class |
SimpleRandomSample.Final |
static class |
SimpleRandomSample.Initial |
static class |
SimpleRandomSample.Intermediate |
Modifier and Type | Field and Description |
---|---|
static java.lang.String |
OUTPUT_BAG_NAME_PREFIX
Prefix for the output bag name.
|
Constructor and Description |
---|
SimpleRandomSample() |
SimpleRandomSample(java.lang.String samplingProbability)
Deprecated.
Should specify the sampling probability in the function call.
|
Modifier and Type | Method and Description |
---|---|
java.lang.String |
getFinal() |
java.lang.String |
getInitial() |
java.lang.String |
getIntermed() |
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input) |
allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
public static final java.lang.String OUTPUT_BAG_NAME_PREFIX
public SimpleRandomSample()
@Deprecated public SimpleRandomSample(java.lang.String samplingProbability)
samplingProbability
- sampling probabilitypublic java.lang.String getInitial()
getInitial
in interface org.apache.pig.Algebraic
getInitial
in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
public java.lang.String getIntermed()
getIntermed
in interface org.apache.pig.Algebraic
getIntermed
in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
public java.lang.String getFinal()
getFinal
in interface org.apache.pig.Algebraic
getFinal
in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
outputSchema
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>