public class CondEntropy
extends org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
Each tuple of the input bag has 2 fields, the 1st field is an object instance of variable X and the 2nd field is an object instance of variable Y. An exception will be thrown if the number of fields is not 2.
This UDF's constructor definition and parameters are the same as that of Entropy
How to use:
This UDF calculates conditional entropy given raw data tuples of X and Y without the need to pre-compute per tuple occurrence frequency.
It could be used in a nested FOREACH after a GROUP BY, in which we sort the inner bag and use the sorted bag as this UDF's input.
Example:
--define empirical conditional entropy with Euler's number as the logarithm base
define CondEntropy datafu.pig.stats.entropy.CondEntropy();
input = LOAD 'input' AS (grp: chararray, valX: double, valY: double);
-- calculate conditional entropy H(Y|X) in each group
input_group_g = GROUP input BY grp;
entropy_group = FOREACH input_group_g {
input_val = input.(valX, valY)
input_ordered = ORDER input_val BY $0, $1;
GENERATE FLATTEN(group) AS group, CondEntropy(input_ordered) AS cond_entropy;
}
Use case to calculate mutual information:
------------
-- calculate mutual information I(X, Y) using conditional entropy UDF and entropy UDF
-- I(X, Y) = H(Y) - H(Y|X)
------------
define CondEntropy datafu.pig.stats.entropy.CondEntropy();
define Entropy datafu.pig.stats.entropy.Entropy();
input = LOAD 'input' AS (grp: chararray, valX: double, valY: double);
-- calculate the I(X,Y) in each group
input_group_g = GROUP input BY grp;
mutual_information = FOREACH input_group_g {
input_val_x_y = input.(valX, valY);
input_val_x_y_ordered = ORDER input_val_x_y BY $0,$1;
input_val_y = input.valY;
input_val_y_ordered = ORDER input_val_y BY $0;
cond_h_x_y = CondEntropy(input_val_x_y_ordered);
h_y = Entropy(input_val_y_ordered);
GENERATE FLATTEN(group), h_y - cond_h_x_y;
}
Entropy
Constructor and Description |
---|
CondEntropy() |
CondEntropy(java.lang.String type) |
CondEntropy(java.lang.String type,
java.lang.String base) |
Modifier and Type | Method and Description |
---|---|
void |
accumulate(org.apache.pig.data.Tuple input) |
void |
cleanup() |
java.lang.Double |
getValue() |
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input) |
allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
public CondEntropy() throws org.apache.pig.backend.executionengine.ExecException
org.apache.pig.backend.executionengine.ExecException
public CondEntropy(java.lang.String type) throws org.apache.pig.backend.executionengine.ExecException
org.apache.pig.backend.executionengine.ExecException
public CondEntropy(java.lang.String type, java.lang.String base) throws org.apache.pig.backend.executionengine.ExecException
org.apache.pig.backend.executionengine.ExecException
public void accumulate(org.apache.pig.data.Tuple input) throws java.io.IOException
accumulate
in interface org.apache.pig.Accumulator<java.lang.Double>
accumulate
in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
java.io.IOException
public java.lang.Double getValue()
getValue
in interface org.apache.pig.Accumulator<java.lang.Double>
getValue
in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
public void cleanup()
cleanup
in interface org.apache.pig.Accumulator<java.lang.Double>
cleanup
in class org.apache.pig.AccumulatorEvalFunc<java.lang.Double>
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
outputSchema
in class org.apache.pig.EvalFunc<java.lang.Double>