ANSI Migration Guide - Pandas API on Spark#
ANSI mode is now on by default for Pandas API on Spark. This guide helps you understand the key behavior differences you’ll see. In short, with ANSI mode on, Pandas API on Spark behavior matches native pandas in cases where Pandas API on Spark with ANSI off did not.
Behavior Change#
String Number Comparison#
ANSI off: Spark implicitly casts numbers and strings, so 1
and '1'
are considered equal.
ANSI on: behaves like pandas, 1 == '1'
is False.
Examples are as shown below:
>>> pdf = pd.DataFrame({"int": [1, 2], "str": ["1", "2"]})
>>> psdf = ps.from_pandas(pdf)
# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> psdf["int"] == psdf["str"]
0 False
1 False
dtype: bool
# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> psdf["int"] == psdf["str"]
0 True
1 True
dtype: bool
# Pandas
>>> pdf["int"] == pdf["str"]
0 False
1 False
dtype: bool
Strict Casting#
ANSI off: invalid casts (e.g., 'a' → int
) quietly became NULL.
ANSI on: the same casts raise errors.
Examples are as shown below:
>>> pdf = pd.DataFrame({"str": ["a"]})
>>> psdf = ps.from_pandas(pdf)
# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> psdf["str"].astype(int)
Traceback (most recent call last):
...
pyspark.errors.exceptions.captured.NumberFormatException: [CAST_INVALID_INPUT] ...
# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> psdf["str"].astype(int)
0 NaN
Name: str, dtype: float64
# Pandas
>>> pdf["str"].astype(int)
Traceback (most recent call last):
...
ValueError: invalid literal for int() with base 10: 'a'
MultiIndex.to_series Return#
ANSI off: Each row is returned as an ArrayType
value, e.g. [1, red]
.
ANSI on: Each row is returned as a StructType
value, which appears as a tuple (e.g., (1, red)
) if the Runtime SQL Configuration spark.sql.execution.pandas.structHandlingMode
is set to 'row'
. Otherwise, the result may vary depending on whether Arrow is used. See more in the Spark Runtime SQL Configuration docs.
Examples are as shown below:
>>> arrays = [[1, 2], ["red", "blue"]]
>>> pidx = pd.MultiIndex.from_arrays(arrays, names=("number", "color"))
>>> psidx = ps.from_pandas(pidx)
# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> spark.conf.set("spark.sql.execution.pandas.structHandlingMode", "row")
>>> psidx.to_series()
number color
1 red (1, red)
2 blue (2, blue)
dtype: object
# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> psidx.to_series()
number color
1 red [1, red]
2 blue [2, blue]
dtype: object
# Pandas
>>> pidx.to_series()
number color
1 red (1, red)
2 blue (2, blue)
dtype: object
Invalid Mixed-Type Operations#
ANSI off: Spark implicitly coerces so these operations succeed.
ANSI on: Behaves like pandas, such operations are disallowed and raise errors.
Operation types that show behavior changes under ANSI mode:
Decimal–Float Arithmetic:
/
,//
,*
,%
Boolean vs. None:
|
,&
,^
Example: Decimal–Float Arithmetic
>>> import decimal
>>> pser = pd.Series([decimal.Decimal(1), decimal.Decimal(2)])
>>> psser = ps.from_pandas(pser)
# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> psser * 0.1
Traceback (most recent call last):
...
TypeError: Multiplication can not be applied to given types.
# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> psser * 0.1
0 0.1
1 0.2
dtype: float64
# Pandas
>>> pser * 0.1
...
TypeError: unsupported operand type(s) for *: 'decimal.Decimal' and 'float'
Example: Boolean vs. None
# ANSI on
>>> spark.conf.set("spark.sql.ansi.enabled", True)
>>> ps.Series([True, False]) | None
Traceback (most recent call last):
...
TypeError: OR can not be applied to given types.
# ANSI off
>>> spark.conf.set("spark.sql.ansi.enabled", False)
>>> ps.Series([True, False]) | None
0 False
1 False
dtype: bool
# Pandas
>>> pd.Series([True, False]) | None
...
TypeError: unsupported operand type(s) for |: 'bool' and 'NoneType'
Related Configurations#
spark.sql.ansi.enabled
(Spark config)#
Native Spark setting that controls ANSI mode.
The most powerful config to control both SQL and pandas API behavior.
If set to False, Spark reverts to the old behavior, and the other configs are not effective.
compute.ansi_mode_support
(Pandas API on Spark option)#
Indicates whether ANSI mode is fully supported.
Effective only when ANSI is enabled.
If set to False, pandas API on Spark may hit unexpected results or errors.
Default is True.
compute.fail_on_ansi_mode
(Pandas API on Spark option)#
Controls whether pandas API on Spark fails immediately when ANSI mode is enabled.
Effective only when ANSI is enabled and
compute.ansi_mode_support
is False.If set to False, forces pandas API on Spark to work with the old behavior even when ANSI is enabled.