Skip to content

Commit

Permalink
FIX only consider "?" as missing marker as per ARFF specs (scikit-lea…
Browse files Browse the repository at this point in the history
  • Loading branch information
glemaitre committed Jun 9, 2023
1 parent 9eea5b7 commit b044ef8
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 0 deletions.
5 changes: 5 additions & 0 deletions doc/whats_new/v1.3.rst
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,11 @@ Changelog
- |Fix| :func:`datasets.fetch_openml` returns improved data types when
`as_frame=True` and `parser="liac-arff"`. :pr:`26386` by `Thomas Fan`_.

- |Fix| Following the ARFF specs, only the marker `"?"` is now considered as a missing
values when opening ARFF files fetched using :func:`datasets.fetch_openml` when using
the pandas parser. The parameter `read_csv_kwargs` allows to overwrite this behaviour.
:pr:`26551` by :user:`Guillaume Lemaitre <glemaitre>`.

- |Enhancement| Allows to overwrite the parameters used to open the ARFF file using
the parameter `read_csv_kwargs` in :func:`datasets.fetch_openml` when using the
pandas parser.
Expand Down
1 change: 1 addition & 0 deletions sklearn/datasets/_arff_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -387,6 +387,7 @@ def _pandas_arff_parser(
"header": None,
"index_col": False, # always force pandas to not use the first column as index
"na_values": ["?"], # missing values are represented by `?`
"keep_default_na": False, # only `?` is a missing value given the ARFF specs
"comment": "%", # skip line starting by `%` since they are comments
"quotechar": '"', # delimiter to use for quoted strings
"skipinitialspace": True, # skip spaces after delimiter to follow ARFF specs
Expand Down

0 comments on commit b044ef8

Please sign in to comment.