Suitable to perform a join in cases when one DF is skewed and the other is not skewed.
Suitable to perform a join in cases when one DF is skewed and the other is not skewed. splits both of the DFs to two parts according to the skewed keys. 1. Map-join: broadcasts the skewed-keys part of the not skewed DF to the skewed-keys part of the skewed DF 2. Regular join: between the remaining two parts.
not skewed DataFrame
skewed DataFrame
join column
num of rows to broadcast
DataFrame representing the data after the operation
Returns a DataFrame with the column names renamed to the column names in the new schema
Returns a DataFrame with the column names renamed to the column names in the new schema
DataFrame to operate on
new column names
DataFrame representing the data after the operation
Used get the top N records (after ordering according to the provided order columns) in each group.
Used get the top N records (after ordering according to the provided order columns) in each group.
DataFrame to operate on
number of records to return from each group
column to group by the records
columns to order the records according to
DataFrame representing the data after the operation
Used to get the 'latest' record (after ordering according to the provided order columns) in each group.
Used to get the 'latest' record (after ordering according to the provided order columns)
in each group.
the same functionality as #dedup
but implemented using UDAF to utilize
map side aggregation.
this function should be used in cases when you expect a large number of rows to get combined,
as they share the same group column.
DataFrame to operate on
column to group by the records
column to order the records according to
have the order as desc
more aggregate functions
columns to filter
indicates whether we should filter the selected columns 'out' or alternatively have only those columns in the result
DataFrame representing the data after the operation
Used to get the 'latest' record (after ordering according to the provided order columns) in each group.
Used to get the 'latest' record (after ordering according to the provided order columns)
in each group.
Different from org.apache.spark.sql.Dataset#dropDuplicates
because order matters.
DataFrame to operate on
column to group by the records
columns to order the records according to
DataFrame representing the data after the operation
Returns a DataFrame with the given column (should be a StructType) replaced by its inner fields.
Returns a DataFrame with the given column (should be a StructType) replaced by its inner fields. This method only flattens a single level of nesting.
+-------+----------+----------+----------+ |id |s.sub_col1|s.sub_col2|s.sub_col3| +-------+----------+----------+----------+ |123 |1 |2 |3 | +-------+----------+----------+----------+
+-------+----------+----------+----------+ |id |sub_col1 |sub_col2 |sub_col3 | +-------+----------+----------+----------+ |123 |1 |2 |3 | +-------+----------+----------+----------+
DataFrame to operate on
column name for a column of type StructType
DataFrame representing the data after the operation
Used to perform a join when the right df is relatively small but still too big to fit in memory to perform map side broadcast join.
Used to perform a join when the right df is relatively small but still too big to fit in memory to perform map side broadcast join. Use cases: a. excluding keys that might be skewed from a medium size list. b. join a big skewed table with a table that has small number of very large rows.
left DataFrame
right DataFrame
join expression
number of shards - number of times to duplicate the right DataFrame
join type
joined DataFrame
Helper function to join a table with point column to a table with range column.
Helper function to join a table with point column to a table with range column. For example, join a table that contains specific time in minutes with a table that contains time ranges. The main problem this function addresses is that doing naive explode on the ranges can result in a huge table. requires: 1. point table needs to be distinct on the point column. 2. the range and point columns need to be numeric.
TIMES: +-------+ |time | +-------+ |11:55 | +-------+
TIME RANGES: +----------+---------+----------+ |start_time|end_time |desc | +----------+---------+----------+ |10:00 |12:00 | meeting | +----------+---------+----------+ |11:50 |12:15 | lunch | +----------+---------+----------+
OUTPUT: +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |10:00 |12:00 | meeting | +-------+----------+---------+---------+ |11:55 |11:50 |12:15 | lunch | +-------+----------+---------+---------+
DataFrame that contains the point column
the point column's name
DataFrame that contains the range column
the start range column's name
the end range column's name
resolution factor. instead of exploding the range column directly, we first decrease its resolution by this factor
Run joinWithRange and afterwards run dedup
Run joinWithRange and afterwards run dedup
- by small/large range OUTPUT for dedupSmallRange = "true": +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |11:50 |12:15 | lunch | +-------+----------+---------+---------+ OUTPUT for dedupSmallRange = "false": +-------+----------+---------+---------+ |time |start_time|end_time |desc | +-------+----------+---------+---------+ |11:55 |10:00 |12:00 | meeting | +-------+----------+---------+---------+