Performs a weighted random sample using an in-memory reservoir to produce
a weighted random sample of a given size based on the A-Res algorithm described in
Species with larger weight have higher probability to be selected in the final sample set.
This UDF inherits from
ReservoirSample and it is guaranteed to produce
a sample of the given size. Similarly it comes at the cost of scalability.
since it uses internal storage with size equaling the desired sample to guarantee the exact sample size.
Its constructor takes 2 arguments:
- The 1st argument specifies the sample size which should be a string of positive integer.
- The 2nd argument specifies the index of the weight field in the input tuple,
which should be a string of non-negative integer that is no greater than the input tuple size.
define WeightedSample datafu.pig.sampling.WeightedReservoirSample('1','1');
input = LOAD 'input' AS (v1:chararray, v2:INT);
input_g = GROUP input ALL;
sampled = FOREACH input_g GENERATE WeightedSample(input);