public class TupleDiff
extends org.apache.pig.EvalFunc<java.lang.String>
Gets a variable number of arguments - the old tuple, the new tuple, and a list of ignored fields
Values are compared by position (zero-based). If a schema exists, the field names will be used for descriptive purposes only.
If there are different field names, both will be used, with the format old/new. If there is no schema numbers will be used.
The list of ignored fields may be by number or by name. It is only checked at the topmost level. If you want to ignore a field
whose name has changed you can use the format oldname/newname, and it will be ignored. To ignore a new field, for example
you would use null/newFieldName
The following chart shows some sample results. Assume the schema has field names like f0, f1 f2 ... for fields 0, 1, 2, etc.
Tuple Result Without Schema Result with Schema
------ ---------------------- ------------------
((0),) missing missing
(,(0)) added added
((0),(0))
((0),(1)) changed 0 changed f0
((0,1),(0,2)) changed 1 changed f1
((0,1),(0,2),1)
((0,1),(0)) changed 1 changed f1/null
((0),(0,1)) changed 1 changed null/f1
((0),(0,1),1)
((0,1),(2,3)) changed 0 1 changed f0 f1
The following macro may be useful in calling this UDF if you have only one join field and up to one ignored field:
DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, diff_macro_ignored_field) returns diff_macro_diffs
{
DEFINE TupleDiff datafu.pig.util.TupleDiff;
old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS original;
new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS original;
join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
join_data = FOREACH join_data GENERATE TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, new::original;
$diff_macro_diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
};