pyspark.sql.functions.arrow_udtf#

pyspark.sql.functions.arrow_udtf(cls=None, *, returnType=None)[source]#

Creates a PyArrow-native user defined table function (UDTF).

This function provides a PyArrow-native interface for UDTFs, where the eval method receives PyArrow RecordBatches or Arrays and returns an Iterator of PyArrow Tables or RecordBatches. This enables true vectorized computation without row-by-row processing overhead.

New in version 4.1.0.

Parameters
clsclass, optional

the Python user-defined table function handler class.

returnTypepyspark.sql.types.StructType or str, optional

the return type of the user-defined table function. The value can be either a pyspark.sql.types.StructType object or a DDL-formatted struct type string.

Notes

  • The eval method must accept PyArrow RecordBatches or Arrays as input

  • The eval method must yield PyArrow Tables or RecordBatches as output

Examples

UDTF with PyArrow RecordBatch input:

>>> import pyarrow as pa
>>> from pyspark.sql.functions import arrow_udtf
>>> @arrow_udtf(returnType="x int, y int")
... class MyUDTF:
...     def eval(self, batch: pa.RecordBatch):
...         # Process the entire batch vectorized
...         x_array = batch.column('x')
...         y_array = batch.column('y')
...         result_table = pa.table({
...             'x': x_array,
...             'y': y_array
...         })
...         yield result_table
...
>>> df = spark.range(10).selectExpr("id as x", "id as y")
>>> MyUDTF(df.asTable()).show()  

UDTF with PyArrow Array inputs:

>>> @arrow_udtf(returnType="x int, y int")
... class MyUDTF2:
...     def eval(self, x: pa.Array, y: pa.Array):
...         # Process arrays vectorized
...         result_table = pa.table({
...             'x': x,
...             'y': y
...         })
...         yield result_table
...
>>> MyUDTF2(lit(1), lit(2)).show()