Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs

Community

Apache Software Foundation

Getting Started

DataFu Spark

Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in Apache Spark.

Compatibility Matrix

This matrix represents versions of Spark that DataFu has been compiled and tested on. Some/many methods work on unsupported versions as well.

DataFu Spark
1.7.0 2.2.0 to 2.2.2, 2.3.0 to 2.3.2 and 2.4.0 to 2.4.3
1.8.0 2.2.3, 2.3.3, and 2.4.4 to 2.4.5
2.0.0 3.0.x - 3.1.x
2.1.0 (unreleased) 3.2.x and up


Examples

A list of some of the things you can do with DataFu Spark is given below:

If you'd like to read more details about these functions, check out the Guide. Otherwise if you are ready to get started using DataFu Spark, keep reading.

The rest of this page assumes you already have a built JAR available. If this is not the case, please see the Download page.

This jar should be loaded to the Spark class path. You can verify that you've done this correctly by trying to import one of our DataFu classes, for example, DataFrameOps.

Basic Example: Finding the most recent update of a given record

A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.


Raw customers’ data, with more than one row per customer

We can see that though most of the customers only appear once, julia and quentin have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? We can use DataFu's dedupWithOrder method.

import datafu.spark.DataFrameOps._

val customers = spark.read.format("csv").option("header", "true").load("customers.csv")

csv.dedupWithOrder($"id", $"date_updated".desc).show

Our result will be as expected — each customer only appears once, as you can see below:


“Deduplicated” data, with only the most recent record for each customer (though not in order)

There are two additional variants of dedupWithOrder in datafu-spark. The dedupWithCombiner method has similar functionality to dedupWithOrder, but uses a UDAF to utilize map side aggregation. dedupTopN allows retaining more than one record for each key.

Next Steps

Check out the Guide for more information on what you can do with DataFu Spark.