p

datafu

spark

package spark

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class CoreBridgeDirectory extends PythonResource

    Contains all python files needed by the bridge itself

  2. abstract class PythonResource extends AnyRef

    Represents a resource that needs to be added to PYTHONPATH used by ScalaPythonBridge.

    Represents a resource that needs to be added to PYTHONPATH used by ScalaPythonBridge.

    To ensure your python resources (modules, files, etc.) are properly added to the bridge, do the following: 1) Put all the resource under some root directory with a unique name x, and make sure path/to/x is visible to the class loader (usually just use src/main/resources/x). 2) Extend this class like this: class MyResource extends PythonResource("x") This assumes x is under src/main/resources/x 3) (since we use ServiceLoader) Add a file to your jar/project: META-INF/services/spark.utils.PythonResource with a single line containing the full name (including package) of MyResource.

    This process involves scanning the entire jar and copying files from the jar to some temporary location, so if your jar is really big consider putting the resources in a smaller jar.

  3. case class ScalaPythonBridgeRunner(extraPath: String = "") extends Product with Serializable

    this class let's the user invoke PySpark code from scala example usage:

    this class let's the user invoke PySpark code from scala example usage:

    val runner = ScalaPythonBridgeRunner() runner.runPythonFile("my_package/my_pyspark_logic.py")

  4. class SparkDFUtilsBridge extends AnyRef

    class definition so we could expose this functionality in PySpark

Value Members

  1. object Aggregators

    This file contains UDAFs which extend the Aggregator class.

    This file contains UDAFs which extend the Aggregator class. They were migrated from previous implementations which used UserDefinedAggregateFunction

    The implementations below reuse the intermediate buffer in the merge function ( see https://stackoverflow.com/questions/77713959/can-you-reuse-one-of-the-buffers-in-the-merge-method-of-spark-aggregators )

  2. object DataFrameOps

    implicit class to enable easier usage e.g:

    implicit class to enable easier usage e.g:

    df.dedup(..)

    instead of:

    SparkDFUtils.dedup(...)

  3. object PythonPathsManager

    There are two phases of resolving python files path:

    There are two phases of resolving python files path:

    1) When launching spark: the files need to be added to spark.executorEnv.PYTHONPATH

    2) When executing python file via bridge: the files need to be added to the process PYTHONPATH. This is different than the previous phase because this python process is spawned by datafu-spark, not by spark, and always on the driver.

  4. object ResourceCloning

    Utility for extracting resource from a jar and copy it to a temporary location

  5. object ScalaPythonBridge

    Do not instantiate this class! Use the companion object instead.

    Do not instantiate this class! Use the companion object instead. This class should only be used by python

  6. object SparkDFUtils
  7. object SparkUDAFs

    UserDefineAggregateFunction is deprecated and will be removed in DataFu 2.1.0 in order to allow compilation with Spark 3.2 and up.

    UserDefineAggregateFunction is deprecated and will be removed in DataFu 2.1.0 in order to allow compilation with Spark 3.2 and up. Please use the methods in @Aggregators instead

    Annotations
    @Deprecated

Ungrouped