Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate CSV reader to pylibcudf #16011

Merged
merged 22 commits into from
Jul 18, 2024

Conversation

lithomas1
Copy link
Contributor

@lithomas1 lithomas1 commented Jun 12, 2024

Description

xref #15162

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@lithomas1 lithomas1 added feature request New feature or request non-breaking Non-breaking change labels Jun 12, 2024
@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue pylibcudf Issues specific to the pylibcudf package labels Jun 12, 2024
@lithomas1 lithomas1 marked this pull request as ready for review July 11, 2024 18:41
@lithomas1 lithomas1 requested a review from a team as a code owner July 11, 2024 18:41
@lithomas1 lithomas1 requested review from vyasr and isVoid July 11, 2024 18:41
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some small suggestions for improvement, but I don't need to review this again so feel free to go ahead with this PR once you feel you've addressed my comments sufficiently.

python/cudf/cudf/_lib/csv.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/csv.pyx Outdated Show resolved Hide resolved
Comment on lines +223 to +228
elif (
cudf.api.types.is_scalar(dtype) or
isinstance(dtype, (
np.dtype, pd.api.extensions.ExtensionDtype, type
))
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could handle the scalar case up front by wrapping it in a list to keep things simpler. Then you have new_dtypes as a list in the list branch and it's a dict in the mapping branch of this conditional.

python/cudf/cudf/_lib/csv.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved
Comment on lines 281 to 286
c_na_values.reserve(len(na_values))
for nv in na_values:
if not isinstance(nv, str):
raise TypeError("na_values must be a list of str!")
c_na_values.push_back(nv.encode())
options.set_na_values(c_na_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type of code is repeated a lot in this function's parsing of inputs. It feels like a helper function along the lines of

cdef vector[T] vec_from_iterable(vec, test, error_msg)

could help. OTOH maybe that's overengineering since you'll still have to write predicate functions and error messages each time. Maybe give it a shot once and see what it looks like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda did something like that.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small suggestions

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved
Comment on lines 246 to 250
for dtype in dtypes:
if not isinstance(dtype, DataType):
raise TypeError("If passing list to read_csv, "
"all elements must be of type `DataType`!")
c_dtypes_list.push_back((<DataType>dtype).c_obj)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be happy with a simpler structure of:

options.set_dtypes([(<DataType?>dtype).c_obj for dtype in dtypes])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work since you can't put C objects in a list comprehension.

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved
from cudf._lib.pylibcudf.types cimport DataType


cpdef TableWithMetadata read_csv(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: (best as an issue) in a followup, expose the csv_reader_options_builder object. So that we don't have to use this nightmare signature :)


cpdef TableWithMetadata read_csv(
SourceInfo source_info,
compression_type compression = compression_type.AUTO,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
compression_type compression = compression_type.AUTO,
*,
compression_type compression = compression_type.AUTO,

People should not be allowed to call this function with positional arguments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work with cpdef functions.

Lets punt on this for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be cpdef? I am willing to accept a slight calling cost overhead to avoid inevitable argument order issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, put it as def for now.

We should try to make this consistent in the future, though.
(Either put everything as def, or deprecate and remove a bunch of the parameters in read_csv and turn this back into cpdef)

@lithomas1 lithomas1 requested a review from wence- July 17, 2024 15:45
@lithomas1
Copy link
Contributor Author

Self-merging to keep progress moving forward.

As usual, happy to address further comments in a followup.

@lithomas1
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit faddc8c into rapidsai:branch-24.08 Jul 18, 2024
81 checks passed
@lithomas1 lithomas1 deleted the pylibcudf-io-csv branch July 18, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.
Projects
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants