Fix pandas 3.0 compatibility: StringDtype, datetime resolution, deprecated APIs, pyspark.pandas import#1527
Fix pandas 3.0 compatibility: StringDtype, datetime resolution, deprecated APIs, pyspark.pandas import#1527
Conversation
…cated APIs - Add "str" to string dtype name checks in data.py, ts_data.py, generic_task.py (pandas 3.0 uses StringDtype(na_value=nan) with dtype.name="str") - Wrap np.issubdtype(columns.dtype, np.integer) in try/except for StringDtype column indexes in pandas 3.0 - Use is_datetime64_any_dtype() instead of hardcoded "datetime64[ns]" check (pandas 3.0 uses varying datetime resolutions like datetime64[s], [ms], [us]) - Replace fillna(method="ffill") with ffill() in test files (removed in pandas 3.0) - Replace deprecated frequency aliases "T"->"min", "H"->"h" in test files Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/e2339d06-2236-4c1e-901b-b00a558f3796 Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates FLAML’s pandas integration and test suite to handle pandas 3.0 breaking changes around StringDtype defaults, datetime dtype resolution, and removed/deprecated pandas APIs—preventing failures across AutoML preprocessing and time-series feature enrichment.
Changes:
- Update dtype handling to recognize pandas 3.0’s
dtype.name == "str"string columns in AutoML preprocessing and NLP validation. - Use
is_datetime64_any_dtype()to detect datetime columns across varying datetime resolutions. - Modernize tests for pandas 3.0 by replacing deprecated frequency aliases and
fillna(method="ffill").
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
test/automl/test_max_iter_1.py |
Updates date_range freq alias for pandas 3.0 compatibility. |
test/automl/test_forecast.py |
Updates date_range freq alias and replaces deprecated fillna(method=...). |
test/automl/test_extra_models.py |
Replaces deprecated fillna(method=...) with ffill(). |
flaml/automl/time_series/ts_data.py |
Switches datetime detection to is_datetime64_any_dtype() and expands string dtype handling. |
flaml/automl/task/generic_task.py |
Extends NLP text-column dtype validation to include "str". |
flaml/automl/data.py |
Adds pandas 3.0-compatible string/datetime detection and guards np.issubdtype for StringDtype column indexes. |
|
|
||
| new_cols = [] | ||
| for col in df.columns: | ||
| if df[col].dtype.name == "datetime64[ns]": | ||
| if is_datetime64_any_dtype(df[col]): | ||
| extras = monthly_fourier_features(df[col], fourier_degree) |
There was a problem hiding this comment.
is_datetime64_any_dtype is only imported inside the try: import pandas block at the top of this module, but it is not defined in the except ImportError: path. That means calling enrich_dataframe() in an environment without pandas will raise a NameError instead of a clean/expected failure mode. Define a fallback (e.g., set is_datetime64_any_dtype = None or a stub returning False) in the except block and guard uses accordingly.
There was a problem hiding this comment.
Added is_datetime64_any_dtype = None fallback in the except ImportError: block of ts_data.py in 98bca3b, consistent with the pattern used in flaml/automl/data.py.
| continue | ||
|
|
||
| # sklearn/utils/validation.py needs int/float values | ||
| if X[column].dtype.name in ("object", "category", "string"): | ||
| if X[column].dtype.name in ("object", "category", "string", "str"): | ||
| if ( |
There was a problem hiding this comment.
DataTransformerTS.fit() now calls is_datetime64_any_dtype(...), but is_datetime64_any_dtype is only imported inside the module’s try: import pandas block and is not defined in the except ImportError: path. This can surface as a NameError (instead of a predictable ImportError/feature gating) if this codepath is reached without pandas. Define a fallback in the except block and/or guard this call.
There was a problem hiding this comment.
Both is_datetime64_any_dtype call sites in ts_data.py (lines 276 and 407) are now guarded with is_datetime64_any_dtype is not None and ..., consistent with the pattern in data.py. Done in 35a075f.
There was a problem hiding this comment.
Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/7c3ce4f8-8623-4d3e-8986-2fa83e1daa5a Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
… block Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/2b20aa86-81b1-4b3b-b6af-0cf7669c5575 Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/395c2eef-887f-4278-a4e0-4d4d276afeee Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
…andas Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/28e597b7-ae5e-41ba-957d-04aaa78834bd Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/28e597b7-ae5e-41ba-957d-04aaa78834bd Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
The Spark test failures ( Fixed in 10f8bce (narrowed in 3544d75) by separating core pyspark imports from |
…pyspark.pandas unavailable Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/60ca1746-e262-40e3-a970-2db40bda57dd Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Fixed in c268df3. The |
|
pyspark doesn't support pandas 3.0 yet. Add pandas version constraints properly. @copilot |
…xtra in setup.py Agent-Logs-Url: https://github.com/microsoft/FLAML/sessions/0512ff2f-87ab-4394-b688-ea1416c26c6a Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Pandas 3.0 introduced several breaking changes that caused widespread test failures across classification, forecasting, NLP, default config, and Spark tests.
StringDtype default change
Pandas 3.0 defaults string columns to
StringDtype(na_value=nan)withdtype.name == "str"instead of"object". This caused string columns to be misclassified as numeric and column name indexes to crashnp.issubdtype().flaml/automl/data.py: Add"str"to dtype checks infit_transform()andtransform(). Wrapnp.issubdtype(X_num.columns.dtype, np.integer)in try/except since string column indexes now useStringDtype.flaml/automl/task/generic_task.py: Add"str"to NLP text column validation assertion.flaml/automl/time_series/ts_data.py: Add"str"to dtype checks inDataTransformerTS.Datetime resolution changes
Pandas 3.0 uses varying datetime resolutions (
datetime64[s],[ms],[us]) instead of alwaysdatetime64[ns], causing datetime columns to be missed during feature enrichment (root cause of TCN 0-channel errors).flaml/automl/data.py,flaml/automl/time_series/ts_data.py: Replace hardcodeddtype.name == "datetime64[ns]"withis_datetime64_any_dtype().pyspark.pandas import incompatibility with pandas 3.0
PySpark's
pyspark.pandasmodule fails to import with pandas 3.0 (cannot import name '_builtin_table' from 'pandas.core.common'). Since all pyspark imports were in a singletryblock, thepyspark.pandasfailure cascaded into makingSparkSession,_spark_major_minor_version,F,T, etc. all become dummy/None values, breaking all Spark tests.flaml/automl/spark/__init__.py: Separate core pyspark imports (SparkSession,F,T,VersionUtils,sparkDataFrame) frompyspark.pandasimports (ps,psDataFrame,psSeries,set_option). Core Spark functionality now works even whenpyspark.pandasis incompatible with the installed pandas version.flaml/automl/spark/utils.py: Guardset_optionandps.from_pandascalls withNonechecks for whenpyspark.pandasis unavailable, raising a clearImportErrorinstead ofTypeError.Removed deprecated APIs in tests
fillna(method="ffill")→ffill()(removed in pandas 3.0)"T"→"min","H"→"h"(removed in pandas 3.0)pandas version constraints for pyspark compatibility
Since pyspark doesn't support pandas 3.0 yet, proper version constraints are added:
setup.py: Add"pandas<3"to thesparkextra requirements so users installingflaml[spark]get a compatible pandas version..github/workflows/python-package.yml: Pinpandas>=2.0,<3on all Ubuntu pyspark environments (python 3.11, 3.12, 3.13). Non-pyspark environments (Ubuntu 3.10, all Windows) continue testing with the latest pandas 3.