Skip to content

Commit

Permalink
FIX use np.nan instead of None for missing marker in fetch_openml (sc…
Browse files Browse the repository at this point in the history
  • Loading branch information
glemaitre committed Jun 14, 2023
1 parent 9c266cf commit f721c6d
Show file tree
Hide file tree
Showing 3 changed files with 9 additions and 4 deletions.
4 changes: 4 additions & 0 deletions doc/whats_new/v1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,10 @@ Changelog
the pandas parser. The parameter `read_csv_kwargs` allows to overwrite this behaviour.
:pr:`26551` by :user:`Guillaume Lemaitre <glemaitre>`.

- |Fix| :func:`dataasets.fetch_openml` will consistenly use `np.nan` as missing marker
with both parsers `"pandas"` and `"liac-arff"`.
:pr:`26579` by :user:`Guillaume Lemaitre <glemaitre>`.

- |Enhancement| Allows to overwrite the parameters used to open the ARFF file using
the parameter `read_csv_kwargs` in :func:`datasets.fetch_openml` when using the
pandas parser.
Expand Down
5 changes: 4 additions & 1 deletion sklearn/datasets/_arff_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,10 @@ def _io_to_generator(gzip_file):
if len(dfs) >= 2:
dfs[0] = dfs[0].astype(dfs[1].dtypes)

frame = pd.concat(dfs, ignore_index=True)
# liac-arff parser does not depend on NumPy and uses None to represent
# missing values. To be consistent with the pandas parser, we replace
# None with np.nan.
frame = pd.concat(dfs, ignore_index=True).fillna(value=np.nan)
del dfs, first_df

# cast the columns frame
Expand Down
4 changes: 1 addition & 3 deletions sklearn/datasets/tests/test_openml.py
Original file line number Diff line number Diff line change
Expand Up @@ -920,9 +920,7 @@ def datasets_missing_values():
(1119, "liac-arff", 9, 6, 0),
(1119, "pandas", 9, 0, 6),
# miceprotein
# 1 column has only missing values with object dtype
(40966, "liac-arff", 1, 76, 0),
# with casting it will be transformed to either float or Int64
(40966, "liac-arff", 1, 77, 0),
(40966, "pandas", 1, 77, 0),
# titanic
(40945, "liac-arff", 3, 6, 0),
Expand Down

0 comments on commit f721c6d

Please sign in to comment.