pyspark.sql.functions.arrow_udtf#

pyspark.sql.functions.arrow_udtf(cls=None, *, returnType=None)[source]#

Creates a PyArrow-native user defined table function (UDTF).

This function provides a PyArrow-native interface for UDTFs, where the eval method receives PyArrow RecordBatches or Arrays and returns an Iterator of PyArrow Tables or RecordBatches. This enables true vectorized computation without row-by-row processing overhead.

New in version 4.1.0.

Parameters

clsclass, optional: the Python user-defined table function handler class.
returnTypepyspark.sql.types.StructType or str, optional: the return type of the user-defined table function. The value can be either a pyspark.sql.types.StructType object or a DDL-formatted struct type string.

Notes

The eval method must accept PyArrow RecordBatches or Arrays as input
The eval method must yield PyArrow Tables or RecordBatches as output

Examples

UDTF with PyArrow RecordBatch input:

>>> import pyarrow as pa
>>> from pyspark.sql.functions import arrow_udtf
>>> @arrow_udtf(returnType="x int, y int")
... class MyUDTF:
...     def eval(self, batch: pa.RecordBatch):
...         # Process the entire batch vectorized
...         x_array = batch.column('x')
...         y_array = batch.column('y')
...         result_table = pa.table({
...             'x': x_array,
...             'y': y_array
...         })
...         yield result_table
...
>>> df = spark.range(10).selectExpr("id as x", "id as y")
>>> MyUDTF(df.asTable()).show()  

UDTF with PyArrow Array inputs:

>>> @arrow_udtf(returnType="x int, y int")
... class MyUDTF2:
...     def eval(self, x: pa.Array, y: pa.Array):
...         # Process arrays vectorized
...         result_table = pa.table({
...             'x': x,
...             'y': y
...         })
...         yield result_table
...
>>> MyUDTF2(lit(1), lit(2)).show()