Object

datafu.spark

SparkDFUtils

Related Doc: package spark

Permalink

object SparkDFUtils

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. SparkDFUtils
  2. AnyRef
  3. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  5. def broadcastJoinSkewed(notSkewed: DataFrame, skewed: DataFrame, joinCol: String, numRowsToBroadcast: Int, filterCnt: Option[Long] = None, joinType: String = "inner"): DataFrame

    Permalink

    Suitable to perform a join in cases when one DF is skewed and the other is not skewed.

    Suitable to perform a join in cases when one DF is skewed and the other is not skewed. splits both of the DFs to two parts according to the skewed keys. 1. Map-join: broadcasts the skewed-keys part of the not skewed DF to the skewed-keys part of the skewed DF 2. Regular join: between the remaining two parts.

    notSkewed

    not skewed DataFrame

    skewed

    skewed DataFrame

    joinCol

    join column

    numRowsToBroadcast

    num of rows to broadcast

    filterCnt

    filter out unskewed rows from the boardcast to ease limit calculation

    returns

    DataFrame representing the data after the operation

  6. def changeSchema(df: DataFrame, newScheme: String*): DataFrame

    Permalink

    Returns a DataFrame with the column names renamed to the column names in the new schema

    Returns a DataFrame with the column names renamed to the column names in the new schema

    df

    DataFrame to operate on

    newScheme

    new column names

    returns

    DataFrame representing the data after the operation

  7. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. def dedupRandomN(df: DataFrame, groupCol: Column, maxSize: Int): DataFrame

    Permalink

    Used get the random n records in each group.

    Used get the random n records in each group. Uses an efficient implementation that doesn't order the data so it can handle large amounts of data.

    df

    DataFrame to operate on

    groupCol

    column to group by the records

    maxSize

    The maximal number of rows per group

    returns

    DataFrame representing the data after the operation

  9. def dedupTopN(df: DataFrame, n: Int, groupCol: Column, orderCols: Column*): DataFrame

    Permalink

    Used get the top N records (after ordering according to the provided order columns) in each group.

    Used get the top N records (after ordering according to the provided order columns) in each group.

    df

    DataFrame to operate on

    n

    number of records to return from each group

    groupCol

    column to group by the records

    orderCols

    columns to order the records according to

    returns

    DataFrame representing the data after the operation

  10. def dedupWithCombiner(df: DataFrame, groupCol: Seq[Column], orderByCol: Seq[Column], desc: Boolean = true, moreAggFunctions: Seq[Column] = Nil, columnsFilter: Seq[String] = Nil, columnsFilterKeep: Boolean = true): DataFrame

    Permalink

    Used to get the 'latest' record (after ordering according to the provided order columns) in each group.

    Used to get the 'latest' record (after ordering according to the provided order columns) in each group. the same functionality as #dedupWithOrder but implemented using UDAF to utilize map side aggregation. this function should be used in cases when you expect a large number of rows to get combined, as they share the same group column.

    df

    DataFrame to operate on

    groupCol

    column to group by the records

    orderByCol

    column to order the records according to

    desc

    have the order as desc

    moreAggFunctions

    more aggregate functions

    columnsFilter

    columns to filter

    columnsFilterKeep

    indicates whether we should filter the selected columns 'out' or alternatively have only those columns in the result

    returns

    DataFrame representing the data after the operation

  11. def dedupWithOrder(df: DataFrame, groupCol: Column, orderCols: Column*): DataFrame

    Permalink

    Used to get the 'latest' record (after ordering according to the provided order columns) in each group.

    Used to get the 'latest' record (after ordering according to the provided order columns) in each group. Different from org.apache.spark.sql.Dataset#dropDuplicates because order matters.

    df

    DataFrame to operate on

    groupCol

    column to group by the records

    orderCols

    columns to order the records according to

    returns

    DataFrame representing the data after the operation

  12. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  13. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  14. def explodeArray(df: DataFrame, arrayCol: Column, alias: String): DataFrame

    Permalink

    Given an array column that you need to explode into different columns, use this method.

    Given an array column that you need to explode into different columns, use this method. This function counts the number of output columns by executing the Spark job internally on the input array column. Consider caching the input dataframe if this is an expensive operation.

    returns

    input +-----+----------------------------------------+ |label|sentence_arr | +-----+----------------------------------------+ |0.0 |[Hi, I heard, about, Spark] | |0.0 |[I wish, Java, could use, case classes] | |1.0 |[Logistic, regression, models, are neat]| +-----+----------------------------------------+ output +-----+----------------------------------------+--------+----------+---------+------------+ |label|sentence_arr |token0 |token1 |token2 |token3 | +-----+----------------------------------------+--------+----------+---------+------------+ |0.0 |[Hi, I heard, about, Spark] |Hi |I heard |about |Spark | |0.0 |[I wish, Java, could use, case classes] |I wish |Java |could use|case classes| |1.0 |[Logistic, regression, models, are neat]|Logistic|regression|models |are neat | +-----+----------------------------------------+--------+----------+---------+------------+

  15. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  16. def flatten(df: DataFrame, colName: String): DataFrame

    Permalink

    Returns a DataFrame with the given column (should be a StructType) replaced by its inner fields.

    Returns a DataFrame with the given column (should be a StructType) replaced by its inner fields. This method only flattens a single level of nesting.

    +-------+----------+----------+----------+ |id |s.sub_col1|s.sub_col2|s.sub_col3| +-------+----------+----------+----------+ |123 |1 |2 |3 | +-------+----------+----------+----------+

    +-------+----------+----------+----------+ |id |sub_col1 |sub_col2 |sub_col3 | +-------+----------+----------+----------+ |123 |1 |2 |3 | +-------+----------+----------+----------+

    df

    DataFrame to operate on

    colName

    column name for a column of type StructType

    returns

    DataFrame representing the data after the operation

  17. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  18. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  19. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  20. def joinSkewed(dfLeft: DataFrame, dfRight: DataFrame, joinExprs: Column, numShards: Int = 10, joinType: String = "inner"): DataFrame

    Permalink

    Used to perform a join when the right df is relatively small but still too big to fit in memory to perform map side broadcast join.

    Used to perform a join when the right df is relatively small but still too big to fit in memory to perform map side broadcast join. Use cases: a. excluding keys that might be skewed from a medium size list. b. join a big skewed table with a table that has small number of very large rows.

    dfLeft

    left DataFrame

    dfRight

    right DataFrame

    joinExprs

    join expression

    numShards

    number of shards - number of times to duplicate the right DataFrame

    joinType

    join type

    returns

    joined DataFrame

  21. def joinWithRange(dfSingle: DataFrame, colSingle: String, dfRange: DataFrame, colRangeStart: String, colRangeEnd: String, DECREASE_FACTOR: Long): DataFrame

    Permalink

    Helper function to join a table with point column to a table with range column.

    Helper function to join a table with point column to a table with range column. For example, join a table that contains specific time in minutes with a table that contains time ranges. The main problem this function addresses is that doing naive explode on the ranges can result in a huge table. requires: 1. point table needs to be distinct on the point column. 2. the range and point columns need to be numeric.

    TIMES: +-------+ |time | +-------+ |11:55 | +-------+

    TIME RANGES: +----------+---------+----------+ |start_time|end_time |desc | +----------+---------+----------+ |10:00 |12:00 | meeting | +----------+---------+----------+ |11:50 |12:15 | lunch | +----------+---------+----------+

    OUTPUT: +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |10:00 |12:00 | meeting | +-------+----------+---------+---------+ |11:55 |11:50 |12:15 | lunch | +-------+----------+---------+---------+

    dfSingle

    DataFrame that contains the point column

    colSingle

    the point column's name

    dfRange

    DataFrame that contains the range column

    colRangeStart

    the start range column's name

    colRangeEnd

    the end range column's name

    DECREASE_FACTOR

    resolution factor. instead of exploding the range column directly, we first decrease its resolution by this factor

  22. def joinWithRangeAndDedup(dfSingle: DataFrame, colSingle: String, dfRange: DataFrame, colRangeStart: String, colRangeEnd: String, DECREASE_FACTOR: Long, dedupSmallRange: Boolean): DataFrame

    Permalink

    Run joinWithRange and afterwards run dedup

    Run joinWithRange and afterwards run dedup

    dedupSmallRange

    - by small/large range OUTPUT for dedupSmallRange = "true": +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |11:50 |12:15 | lunch | +-------+----------+---------+---------+ OUTPUT for dedupSmallRange = "false": +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |10:00 |12:00 | meeting | +-------+----------+---------+---------+

  23. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  24. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  25. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  26. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  27. def toString(): String

    Permalink
    Definition Classes
    AnyRef → Any
  28. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  29. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  30. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from AnyRef

Inherited from Any

Ungrouped