DataFu Spark

Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in Apache Spark.

Compatibility Matrix

This matrix represents versions of Spark that DataFu has been compiled and tested on. Some/many methods work on unsupported versions as well.

DataFu	Spark
1.7.0	2.2.0 to 2.2.2, 2.3.0 to 2.3.2 and 2.4.0 to 2.4.3
1.8.0	2.2.3, 2.3.3, and 2.4.4 to 2.4.5
2.0.0	3.0.x - 3.1.x
2.1.0	3.0.x - 3.4.2

Examples

A list of some of the things you can do with DataFu Spark is given below:

"Dedup" a table - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).
Join a table with a numeric field with a table with a range
Do a skewed join between tables (where the small table is still too big to fit in memory)
Count distinct up to - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table
Call Python code from Spark Scala, or Scala code from PySpark

If you'd like to read more details about these functions, check out the Guide. Otherwise if you are ready to get started using DataFu Spark, keep reading.

The rest of this page assumes you already have a built JAR available. If this is not the case, please see the Download page.

This jar should be loaded to the Spark class path. You can verify that you've done this correctly by trying to import one of our DataFu classes, for example, DataFrameOps.

Basic Example: Finding the most recent update of a given record

A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.

Raw customers’ data, with more than one row per customer

We can see that though most of the customers only appear once, julia and quentin have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? We can use DataFu's dedupWithOrder method.

import datafu.spark.DataFrameOps._

val customers = spark.read.format("csv").option("header", "true").load("customers.csv")

csv.dedupWithOrder($"id", $"date_updated".desc).show

Our result will be as expected — each customer only appears once, as you can see below:

“Deduplicated” data, with only the most recent record for each customer (though not in order)

There are two additional variants of dedupWithOrder in datafu-spark. The dedupWithCombiner method has similar functionality to dedupWithOrder, but uses a UDAF to utilize map side aggregation. dedupTopN allows retaining more than one record for each key.

Next Steps

Check out the Guide for more information on what you can do with DataFu Spark.

Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs