Skip to content

Fix pandas 3.0 compatibility: StringDtype, datetime resolution, deprecated APIs, pyspark.pandas import#1527

Open
Copilot wants to merge 11 commits intomainfrom
copilot/fix-test-errors
Open

Fix pandas 3.0 compatibility: StringDtype, datetime resolution, deprecated APIs, pyspark.pandas import#1527
Copilot wants to merge 11 commits intomainfrom
copilot/fix-test-errors

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 9, 2026

Pandas 3.0 introduced several breaking changes that caused widespread test failures across classification, forecasting, NLP, default config, and Spark tests.

StringDtype default change

Pandas 3.0 defaults string columns to StringDtype(na_value=nan) with dtype.name == "str" instead of "object". This caused string columns to be misclassified as numeric and column name indexes to crash np.issubdtype().

  • flaml/automl/data.py: Add "str" to dtype checks in fit_transform() and transform(). Wrap np.issubdtype(X_num.columns.dtype, np.integer) in try/except since string column indexes now use StringDtype.
  • flaml/automl/task/generic_task.py: Add "str" to NLP text column validation assertion.
  • flaml/automl/time_series/ts_data.py: Add "str" to dtype checks in DataTransformerTS.

Datetime resolution changes

Pandas 3.0 uses varying datetime resolutions (datetime64[s], [ms], [us]) instead of always datetime64[ns], causing datetime columns to be missed during feature enrichment (root cause of TCN 0-channel errors).

  • flaml/automl/data.py, flaml/automl/time_series/ts_data.py: Replace hardcoded dtype.name == "datetime64[ns]" with is_datetime64_any_dtype().

pyspark.pandas import incompatibility with pandas 3.0

PySpark's pyspark.pandas module fails to import with pandas 3.0 (cannot import name '_builtin_table' from 'pandas.core.common'). Since all pyspark imports were in a single try block, the pyspark.pandas failure cascaded into making SparkSession, _spark_major_minor_version, F, T, etc. all become dummy/None values, breaking all Spark tests.

  • flaml/automl/spark/__init__.py: Separate core pyspark imports (SparkSession, F, T, VersionUtils, sparkDataFrame) from pyspark.pandas imports (ps, psDataFrame, psSeries, set_option). Core Spark functionality now works even when pyspark.pandas is incompatible with the installed pandas version.
  • flaml/automl/spark/utils.py: Guard set_option and ps.from_pandas calls with None checks for when pyspark.pandas is unavailable, raising a clear ImportError instead of TypeError.

Removed deprecated APIs in tests

  • fillna(method="ffill")ffill() (removed in pandas 3.0)
  • Frequency aliases "T""min", "H""h" (removed in pandas 3.0)

pandas version constraints for pyspark compatibility

Since pyspark doesn't support pandas 3.0 yet, proper version constraints are added:

  • setup.py: Add "pandas<3" to the spark extra requirements so users installing flaml[spark] get a compatible pandas version.
  • .github/workflows/python-package.yml: Pin pandas>=2.0,<3 on all Ubuntu pyspark environments (python 3.11, 3.12, 3.13). Non-pyspark environments (Ubuntu 3.10, all Windows) continue testing with the latest pandas 3.

…cated APIs

- Add "str" to string dtype name checks in data.py, ts_data.py, generic_task.py
  (pandas 3.0 uses StringDtype(na_value=nan) with dtype.name="str")
- Wrap np.issubdtype(columns.dtype, np.integer) in try/except for StringDtype
  column indexes in pandas 3.0
- Use is_datetime64_any_dtype() instead of hardcoded "datetime64[ns]" check
  (pandas 3.0 uses varying datetime resolutions like datetime64[s], [ms], [us])
- Replace fillna(method="ffill") with ffill() in test files (removed in pandas 3.0)
- Replace deprecated frequency aliases "T"->"min", "H"->"h" in test files

Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/e2339d06-2236-4c1e-901b-b00a558f3796

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix test errors in testing suite Fix pandas 3.0 compatibility: StringDtype, datetime resolution, deprecated APIs Apr 9, 2026
Copilot AI requested a review from thinkall April 9, 2026 06:12
Copy link
Copy Markdown
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update github actions to test pandas 2 on ubuntu + python 3.12, other environments test pandas 3. @copilot

@thinkall thinkall requested a review from Copilot April 9, 2026 08:00
@thinkall thinkall marked this pull request as ready for review April 9, 2026 08:00
Copy link
Copy Markdown
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update github actions to test pandas 2 on ubuntu + python 3.12, other environments test pandas 3. @copilot

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates FLAML’s pandas integration and test suite to handle pandas 3.0 breaking changes around StringDtype defaults, datetime dtype resolution, and removed/deprecated pandas APIs—preventing failures across AutoML preprocessing and time-series feature enrichment.

Changes:

  • Update dtype handling to recognize pandas 3.0’s dtype.name == "str" string columns in AutoML preprocessing and NLP validation.
  • Use is_datetime64_any_dtype() to detect datetime columns across varying datetime resolutions.
  • Modernize tests for pandas 3.0 by replacing deprecated frequency aliases and fillna(method="ffill").

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/automl/test_max_iter_1.py Updates date_range freq alias for pandas 3.0 compatibility.
test/automl/test_forecast.py Updates date_range freq alias and replaces deprecated fillna(method=...).
test/automl/test_extra_models.py Replaces deprecated fillna(method=...) with ffill().
flaml/automl/time_series/ts_data.py Switches datetime detection to is_datetime64_any_dtype() and expands string dtype handling.
flaml/automl/task/generic_task.py Extends NLP text-column dtype validation to include "str".
flaml/automl/data.py Adds pandas 3.0-compatible string/datetime detection and guards np.issubdtype for StringDtype column indexes.

Comment on lines 272 to 276

new_cols = []
for col in df.columns:
if df[col].dtype.name == "datetime64[ns]":
if is_datetime64_any_dtype(df[col]):
extras = monthly_fourier_features(df[col], fourier_degree)
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_datetime64_any_dtype is only imported inside the try: import pandas block at the top of this module, but it is not defined in the except ImportError: path. That means calling enrich_dataframe() in an environment without pandas will raise a NameError instead of a clean/expected failure mode. Define a fallback (e.g., set is_datetime64_any_dtype = None or a stub returning False) in the except block and guard uses accordingly.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot , address the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added is_datetime64_any_dtype = None fallback in the except ImportError: block of ts_data.py in 98bca3b, consistent with the pattern used in flaml/automl/data.py.

Comment on lines 408 to 412
continue

# sklearn/utils/validation.py needs int/float values
if X[column].dtype.name in ("object", "category", "string"):
if X[column].dtype.name in ("object", "category", "string", "str"):
if (
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataTransformerTS.fit() now calls is_datetime64_any_dtype(...), but is_datetime64_any_dtype is only imported inside the module’s try: import pandas block and is not defined in the except ImportError: path. This can surface as a NameError (instead of a predictable ImportError/feature gating) if this codepath is reached without pandas. Define a fallback in the except block and/or guard this call.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, address the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both is_datetime64_any_dtype call sites in ts_data.py (lines 276 and 407) are now guarded with is_datetime64_any_dtype is not None and ..., consistent with the pattern in data.py. Done in 35a075f.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already addressed in 35a075f. Additionally, the root cause of all Spark test failures (including this one) has been fixed in 10f8bce by separating core pyspark imports from pyspark.pandas imports, so is_datetime64_any_dtype guard is no longer the only issue.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 9, 2026

Update github actions to test pandas 2 on ubuntu + python 3.12, other environments test pandas 3. @copilot

Done in 014a54d. Added a CI step that pins pandas>=2.0,<3 on ubuntu + python 3.12. All other environments will continue using pandas 3 (the default).

Copilot AI requested a review from thinkall April 9, 2026 08:06
Copy link
Copy Markdown
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, address all comments.

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 9, 2026

@copilot, address all comments.

All review comments have been addressed. The is_datetime64_any_dtype fallback (= None) was added in 98bca3b, and the is not None guards on both call sites were added in 35a075f.

Copy link
Copy Markdown
Collaborator

@thinkall thinkall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix test errors @copilot

Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 10, 2026

fix test errors @copilot

The Spark test failures (createDataFrame NoneType, _spark_major_minor_version NoneType subscriptable) were caused by pyspark.pandas being incompatible with pandas 3.0 (cannot import name '_builtin_table' from 'pandas.core.common'). Since all pyspark imports were in a single try block, the pyspark.pandas failure caused SparkSession, _spark_major_minor_version, F, T etc. to all become dummies/None.

Fixed in 10f8bce (narrowed in 3544d75) by separating core pyspark imports from pyspark.pandas imports, so SparkSession, F, T, VersionUtils work even when pyspark.pandas fails with pandas 3.0.

Copilot AI changed the title Fix pandas 3.0 compatibility: StringDtype, datetime resolution, deprecated APIs Fix pandas 3.0 compatibility: StringDtype, datetime resolution, deprecated APIs, pyspark.pandas import Apr 10, 2026
Copilot AI requested a review from thinkall April 10, 2026 10:42
@thinkall
Copy link
Copy Markdown
Collaborator

…pyspark.pandas unavailable

Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/60ca1746-e262-40e3-a970-2db40bda57dd

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 11, 2026

@copilot fix https://github.com/microsoft/FLAML/actions/runs/24238987457/job/70861850509#step:15:1

Fixed in c268df3. The set_option from pyspark.pandas was None when pyspark.pandas failed to import (incompatible with pandas 3.0), causing TypeError: 'NoneType' object is not callable in to_pandas_on_spark(). Added set_option is not None guards before calling set_option() and ps.from_pandas(), with a clear ImportError when pyspark.pandas is needed but unavailable.

@thinkall
Copy link
Copy Markdown
Collaborator

pyspark doesn't support pandas 3.0 yet. Add pandas version constraints properly. @copilot

…xtra in setup.py

Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/0512ff2f-87ab-4394-b688-ea1416c26c6a

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 11, 2026

pyspark doesn't support pandas 3.0 yet. Add pandas version constraints properly. @copilot

Fixed in 4521268. Added "pandas<3" to the spark extra in setup.py and expanded the CI pandas<3 pin to all Ubuntu pyspark environments (3.11, 3.12, 3.13), not just 3.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants