Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs


Apache Software Foundation



We welcome contributions to the Apache DataFu. If you're interested, please read the following guide:


Working in the Code Base

Common tasks for working in the DataFu code can be found below. For information on how to contribute patches, please follow the wiki link above.

Get the Code

If you haven't done so already:

git clone https://git-wip-us.apache.org/repos/asf/datafu.git
cd datafu

Generate Eclipse Files

The following command generates the necessary files to load the project in Eclipse:

./gradlew eclipse

To clean up the eclipse files:

./gradlew cleanEclipse

Note that you may run out of heap when executing tests in Eclipse. To fix this adjust your heap settings for the TestNG plugin. Go to Eclipse->Preferences. Select TestNG->Run/Debug. Add "-Xmx1G" to the JVM args.


All the JARs for the project can be built with the following command:

./gradlew assemble

This builds SNAPSHOT versions of the JARs for DataFu Pig, Spark and Hourglass. The built JARs can be found under datafu-pig/build/libs, datafu-spark/build/libs and datafu-hourglass/build/libs, respectively.

A single project - for example, DataFu Pig - may be built by running the command below.

./gradlew :datafu-pig:assemble

Running Tests

Tests can be run with the following command:

./gradlew test

All the tests can also be run from within eclipse.

To run a single project's test - for example, for DataFu Pig only:

./gradlew :datafu-pig:test

To run a specific set of tests from the command line, you can define the test.single system property with a value matching the test class you want to run. For example, to run all tests defined in the QuantileTests test class for DataFu Pig:

./gradlew :datafu-pig:test --tests QuantileTests

You can similarly run a specific Hourglass test like so:

./gradlew :datafu-hourglass:test --tests PartitionCollapsingTests