public class SampleByKey
extends org.apache.pig.FilterFunc
The method of sampling is to convert the key to a hash, derive a double value from this, and then test this against a supplied probability. The double value derived from a key is uniformly distributed between 0 and 1.
The only required parameter is the sampling probability. This may be followed by an optional seed value to control the random number generation.
SampleByKey will work deterministically as long as the same seed is provided.
Example:
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.5');
-- input: (A,1), (A,2), (A,3), (B,1), (B,3)
data = LOAD 'input' AS (A_id:chararray, B_id:chararray, C:int);
output = FILTER data BY SampleByKey(A_id);
--output: (B,1), (B,3)
Constructor and Description |
---|
SampleByKey(java.lang.String probability) |
SampleByKey(java.lang.String probability,
java.lang.String salt) |
Modifier and Type | Method and Description |
---|---|
java.lang.Boolean |
exec(org.apache.pig.data.Tuple input) |
void |
setUDFContextSignature(java.lang.String signature) |
allowCompileTimeCalculation, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, outputSchema, progress, setInputSchema, setPigLogger, setReporter, warn
public SampleByKey(java.lang.String probability)
public SampleByKey(java.lang.String probability, java.lang.String salt)
public void setUDFContextSignature(java.lang.String signature)
setUDFContextSignature
in class org.apache.pig.EvalFunc<java.lang.Boolean>
public java.lang.Boolean exec(org.apache.pig.data.Tuple input) throws java.io.IOException
exec
in class org.apache.pig.EvalFunc<java.lang.Boolean>
java.io.IOException