pyspark.sql.functions.arrow_udtf#
- pyspark.sql.functions.arrow_udtf(cls=None, *, returnType=None)[source]#
Creates a PyArrow-native user defined table function (UDTF).
This function provides a PyArrow-native interface for UDTFs, where the eval method receives PyArrow RecordBatches or Arrays and returns an Iterator of PyArrow Tables or RecordBatches. This enables true vectorized computation without row-by-row processing overhead.
New in version 4.1.0.
- Parameters
- clsclass, optional
the Python user-defined table function handler class.
- returnType
pyspark.sql.types.StructType
or str, optional the return type of the user-defined table function. The value can be either a
pyspark.sql.types.StructType
object or a DDL-formatted struct type string.
Notes
The eval method must accept PyArrow RecordBatches or Arrays as input
The eval method must yield PyArrow Tables or RecordBatches as output
Examples
UDTF with PyArrow RecordBatch input:
>>> import pyarrow as pa >>> from pyspark.sql.functions import arrow_udtf >>> @arrow_udtf(returnType="x int, y int") ... class MyUDTF: ... def eval(self, batch: pa.RecordBatch): ... # Process the entire batch vectorized ... x_array = batch.column('x') ... y_array = batch.column('y') ... result_table = pa.table({ ... 'x': x_array, ... 'y': y_array ... }) ... yield result_table ... >>> df = spark.range(10).selectExpr("id as x", "id as y") >>> MyUDTF(df.asTable()).show()
UDTF with PyArrow Array inputs:
>>> @arrow_udtf(returnType="x int, y int") ... class MyUDTF2: ... def eval(self, x: pa.Array, y: pa.Array): ... # Process arrays vectorized ... result_table = pa.table({ ... 'x': x, ... 'y': y ... }) ... yield result_table ... >>> MyUDTF2(lit(1), lit(2)).show()