Apache DataFu Pig is a collection of user-defined functions for working with large scale data in Apache Pig. It has a number of useful functions available:
Compute quantiles, median, variance, wilson binary confidence, etc.
Perform set intersection, union, or difference of bags.
Convenient functions for working with bags such as enumerate items, append, prepend, concat, group, distinct, etc.
Sessionize events from a stream of data.
Streaming implementations that can estimate quantiles and median.
Simple random sampling with or without replacement, weighted sampling.
Run PageRank on a graph represented by a bag of nodes and edges.
Other useful methods like Assert and Coalesce.
If you'd like to read more details about these functions, check out the Guide. Otherwise if you are ready to get started using DataFu Pig, keep reading.
The rest of this page assumes you already have a built JAR available. If this is not the case, please see the Download page.
Let's use DataFu Pig to perform a very basic task: computing the median of some data.
Suppose we have a file input
in Hadoop with the following content:
1
2
3
2
2
2
3
2
2
1
We can clearly see that the median is 2 for this data set. First we'll start up Pig's grunt shell by running pig
and
then register the DataFu JAR:
register datafu-pig-1.6.1.jar
To compute the median we'll use DataFu's StreamingMedian
, which computes an estimate of the median but has the benefit
of not requiring the data to be sorted:
DEFINE Median datafu.pig.stats.StreamingMedian();
Next we can load the data and pass it into the function to compute the median:
data = LOAD 'input' using PigStorage() as (val:int);
data = FOREACH (GROUP data ALL) GENERATE Median(data);
DUMP data
This produces the expected output:
((2.0))
Check out the Guide for more information on what you can do with DataFu Pig.