SparkDFUtils

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def broadcastJoinSkewed(notSkewed: DataFrame, skewed: DataFrame, joinCol: String, numRowsToBroadcast: Int, filterCnt: Option[Long] = None, joinType: String = "inner"): DataFrame

Suitable to perform a join in cases when one DF is skewed and the other is not skewed.
Suitable to perform a join in cases when one DF is skewed and the other is not skewed. splits both of the DFs to two parts according to the skewed keys. 1. Map-join: broadcasts the skewed-keys part of the not skewed DF to the skewed-keys part of the skewed DF 2. Regular join: between the remaining two parts.
notSkewed
not skewed DataFrame
skewed
skewed DataFrame
joinCol
join column
numRowsToBroadcast
num of rows to broadcast
filterCnt
filter out unskewed rows from the boardcast to ease limit calculation
returns
DataFrame representing the data after the operation
def changeSchema(df: DataFrame, newScheme: String*): DataFrame

Returns a DataFrame with the column names renamed to the column names in the new schema
Returns a DataFrame with the column names renamed to the column names in the new schema
df
DataFrame to operate on
newScheme
new column names
returns
DataFrame representing the data after the operation
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def dedupRandomN(df: DataFrame, groupCol: Column, maxSize: Int): DataFrame

Used get the random n records in each group.
Used get the random n records in each group. Uses an efficient implementation that doesn't order the data so it can handle large amounts of data.
df
DataFrame to operate on
groupCol
column to group by the records
maxSize
The maximal number of rows per group
returns
DataFrame representing the data after the operation
def dedupTopN(df: DataFrame, n: Int, groupCol: Column, orderCols: Column*): DataFrame

Used get the top N records (after ordering according to the provided order columns) in each group.
Used get the top N records (after ordering according to the provided order columns) in each group.
df
DataFrame to operate on
n
number of records to return from each group
groupCol
column to group by the records
orderCols
columns to order the records according to
returns
DataFrame representing the data after the operation
def dedupWithCombiner(df: DataFrame, groupCol: Seq[Column], orderByCol: Seq[Column], desc: Boolean = true, moreAggFunctions: Seq[Column] = Nil, columnsFilter: Seq[String] = Nil, columnsFilterKeep: Boolean = true): DataFrame

Used to get the 'latest' record (after ordering according to the provided order columns) in each group.
Used to get the 'latest' record (after ordering according to the provided order columns) in each group. the same functionality as #dedupWithOrder but implemented using UDAF to utilize map side aggregation. this function should be used in cases when you expect a large number of rows to get combined, as they share the same group column.
df
DataFrame to operate on
groupCol
column to group by the records
orderByCol
column to order the records according to
desc
have the order as desc
moreAggFunctions
more aggregate functions
columnsFilter
columns to filter
columnsFilterKeep
indicates whether we should filter the selected columns 'out' or alternatively have only those columns in the result
returns
DataFrame representing the data after the operation
def dedupWithOrder(df: DataFrame, groupCol: Column, orderCols: Column*): DataFrame

Used to get the 'latest' record (after ordering according to the provided order columns) in each group.
Used to get the 'latest' record (after ordering according to the provided order columns) in each group. Different from org.apache.spark.sql.Dataset#dropDuplicates because order matters.
df
DataFrame to operate on
groupCol
column to group by the records
orderCols
columns to order the records according to
returns
DataFrame representing the data after the operation
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def explodeArray(df: DataFrame, arrayCol: Column, alias: String): DataFrame

Given an array column that you need to explode into different columns, use this method.
Given an array column that you need to explode into different columns, use this method. This function counts the number of output columns by executing the Spark job internally on the input array column. Consider caching the input dataframe if this is an expensive operation.
returns
input +-----+----------------------------------------+ |label|sentence_arr | +-----+----------------------------------------+ |0.0 |[Hi, I heard, about, Spark] | |0.0 |[I wish, Java, could use, case classes] | |1.0 |[Logistic, regression, models, are neat]| +-----+----------------------------------------+ output +-----+----------------------------------------+--------+----------+---------+------------+ |label|sentence_arr |token0 |token1 |token2 |token3 | +-----+----------------------------------------+--------+----------+---------+------------+ |0.0 |[Hi, I heard, about, Spark] |Hi |I heard |about |Spark | |0.0 |[I wish, Java, could use, case classes] |I wish |Java |could use|case classes| |1.0 |[Logistic, regression, models, are neat]|Logistic|regression|models |are neat | +-----+----------------------------------------+--------+----------+---------+------------+
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def flatten(df: DataFrame, colName: String): DataFrame

Returns a DataFrame with the given column (should be a StructType) replaced by its inner fields.
Returns a DataFrame with the given column (should be a StructType) replaced by its inner fields. This method only flattens a single level of nesting.
+-------+----------+----------+----------+ |id |s.sub_col1|s.sub_col2|s.sub_col3| +-------+----------+----------+----------+ |123 |1 |2 |3 | +-------+----------+----------+----------+
+-------+----------+----------+----------+ |id |sub_col1 |sub_col2 |sub_col3 | +-------+----------+----------+----------+ |123 |1 |2 |3 | +-------+----------+----------+----------+
df
DataFrame to operate on
colName
column name for a column of type StructType
returns
DataFrame representing the data after the operation
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def joinSkewed(dfLeft: DataFrame, dfRight: DataFrame, joinExprs: Column, numShards: Int = 10, joinType: String = "inner"): DataFrame

Used to perform a join when the right df is relatively small but still too big to fit in memory to perform map side broadcast join.
Used to perform a join when the right df is relatively small but still too big to fit in memory to perform map side broadcast join. Use cases: a. excluding keys that might be skewed from a medium size list. b. join a big skewed table with a table that has small number of very large rows.
dfLeft
left DataFrame
dfRight
right DataFrame
joinExprs
join expression
numShards
number of shards - number of times to duplicate the right DataFrame
joinType
join type
returns
joined DataFrame
def joinWithRange(dfSingle: DataFrame, colSingle: String, dfRange: DataFrame, colRangeStart: String, colRangeEnd: String, DECREASE_FACTOR: Long): DataFrame

Helper function to join a table with point column to a table with range column.
Helper function to join a table with point column to a table with range column. For example, join a table that contains specific time in minutes with a table that contains time ranges. The main problem this function addresses is that doing naive explode on the ranges can result in a huge table. requires: 1. point table needs to be distinct on the point column. 2. the range and point columns need to be numeric.
TIMES: +-------+ |time | +-------+ |11:55 | +-------+
TIME RANGES: +----------+---------+----------+ |start_time|end_time |desc | +----------+---------+----------+ |10:00 |12:00 | meeting | +----------+---------+----------+ |11:50 |12:15 | lunch | +----------+---------+----------+
OUTPUT: +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |10:00 |12:00 | meeting | +-------+----------+---------+---------+ |11:55 |11:50 |12:15 | lunch | +-------+----------+---------+---------+
dfSingle
DataFrame that contains the point column
colSingle
the point column's name
dfRange
DataFrame that contains the range column
colRangeStart
the start range column's name
colRangeEnd
the end range column's name
DECREASE_FACTOR
resolution factor. instead of exploding the range column directly, we first decrease its resolution by this factor
def joinWithRangeAndDedup(dfSingle: DataFrame, colSingle: String, dfRange: DataFrame, colRangeStart: String, colRangeEnd: String, DECREASE_FACTOR: Long, dedupSmallRange: Boolean): DataFrame

Run joinWithRange and afterwards run dedup
Run joinWithRange and afterwards run dedup
dedupSmallRange
- by small/large range OUTPUT for dedupSmallRange = "true": +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |11:50 |12:15 | lunch | +-------+----------+---------+---------+ OUTPUT for dedupSmallRange = "false": +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |10:00 |12:00 | meeting | +-------+----------+---------+---------+
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package spark

object SparkDFUtils

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def broadcastJoinSkewed(notSkewed: DataFrame, skewed: DataFrame, joinCol: String, numRowsToBroadcast: Int, filterCnt: Option[Long] = None, joinType: String = "inner"): DataFrame

def changeSchema(df: DataFrame, newScheme: String*): DataFrame

def clone(): AnyRef

def dedupRandomN(df: DataFrame, groupCol: Column, maxSize: Int): DataFrame

def dedupTopN(df: DataFrame, n: Int, groupCol: Column, orderCols: Column*): DataFrame

def dedupWithCombiner(df: DataFrame, groupCol: Seq[Column], orderByCol: Seq[Column], desc: Boolean = true, moreAggFunctions: Seq[Column] = Nil, columnsFilter: Seq[String] = Nil, columnsFilterKeep: Boolean = true): DataFrame

def dedupWithOrder(df: DataFrame, groupCol: Column, orderCols: Column*): DataFrame

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def explodeArray(df: DataFrame, arrayCol: Column, alias: String): DataFrame

def finalize(): Unit

def flatten(df: DataFrame, colName: String): DataFrame

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

def joinSkewed(dfLeft: DataFrame, dfRight: DataFrame, joinExprs: Column, numShards: Int = 10, joinType: String = "inner"): DataFrame

def joinWithRange(dfSingle: DataFrame, colSingle: String, dfRange: DataFrame, colRangeStart: String, colRangeEnd: String, DECREASE_FACTOR: Long): DataFrame

def joinWithRangeAndDedup(dfSingle: DataFrame, colSingle: String, dfRange: DataFrame, colRangeStart: String, colRangeEnd: String, DECREASE_FACTOR: Long, dedupSmallRange: Boolean): DataFrame

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from AnyRef

Inherited from Any

Ungrouped