New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate CSV reader to pylibcudf #16011

Merged

rapids-bot merged 22 commits into rapidsai:branch-24.08 from lithomas1:pylibcudf-io-csv

Jul 18, 2024

Contributor

lithomas1 commented Jun 12, 2024 •

edited

Loading

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.


Migrate CSV reader to pylibcudf

24a9d94

lithomas1 added feature request non-breaking labels

github-actions bot added Python CMake pylibcudf labels

lithomas1 added 3 commits

June 12, 2024 18:54


fix cudf_kafka

86ebb02


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

52a7185

…f-io-csv


stub out tests for csv

989a21e

lithomas1 force-pushed the pylibcudf-io-csv branch from 8093233 to 989a21e Compare

July 5, 2024 20:40

lithomas1 mentioned this pull request

[FEA] Implement all libcudf modules required by cuDF Python in pylibcudf #15162

Open

lithomas1 added 9 commits

July 9, 2024 00:03


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

bfa4095

…f-io-csv


more tests

112a099


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

51827bc

…f-io-csv


add docs

3b020b1


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

c3aa845

…f-io-csv


refactor data generation

b9af4ee


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

f6afa00

…f-io-csv


final tests update

f07c994


fix docs

929a39e

lithomas1 marked this pull request as ready for review

July 11, 2024 18:41

lithomas1 requested a review from a team as a code owner

July 11, 2024 18:41

lithomas1 requested review from vyasr and isVoid

July 11, 2024 18:41

lithomas1 and others added 4 commits

July 11, 2024 19:27


remove debug prints

e4877bd


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

d53e531

…f-io-csv


typo

3b9cec4


simplify more

dd09dc7

vyasr approved these changes

View reviewed changes

Contributor

vyasr left a comment

I have some small suggestions for improvement, but I don't need to review this again so feel free to go ahead with this PR once you feel you've addressed my comments sufficiently.

python/cudf/cudf/_lib/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/csv.pyx

Comment on lines +223 to +228

+ elif (
+ cudf.api.types.is_scalar(dtype) or
+ isinstance(dtype, (
+ np.dtype, pd.api.extensions.ExtensionDtype, type
+ ))
+ ):

Contributor

vyasr Jul 16, 2024

You could handle the scalar case up front by wrapping it in a list to keep things simpler. Then you have new_dtypes as a list in the list branch and it's a dict in the mapping branch of this conditional.

python/cudf/cudf/_lib/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated

Comment on lines 281 to 286

+ c_na_values.reserve(len(na_values))
+ for nv in na_values:
+ if not isinstance(nv, str):
+ raise TypeError("na_values must be a list of str!")
+ c_na_values.push_back(nv.encode())
+ options.set_na_values(c_na_values)

Contributor

vyasr Jul 16, 2024

This type of code is repeated a lot in this function's parsing of inputs. It feels like a helper function along the lines of

cdef vector[T] vec_from_iterable(vec, test, error_msg)

could help. OTOH maybe that's overengineering since you'll still have to write predicate functions and error messages each time. Maybe give it a shot once and see what it looks like.

Contributor Author

lithomas1 Jul 17, 2024

Kinda did something like that.

wence- reviewed

View reviewed changes

Contributor

wence- left a comment

Some small suggestions

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated

Comment on lines 246 to 250

+ for dtype in dtypes:
+ if not isinstance(dtype, DataType):
+ raise TypeError("If passing list to read_csv, "
+ "all elements must be of type `DataType`!")
+ c_dtypes_list.push_back((<DataType>dtype).c_obj)

Contributor

wence- Jul 16, 2024

I would be happy with a simpler structure of:

options.set_dtypes([(<DataType?>dtype).c_obj for dtype in dtypes])

Contributor Author

lithomas1 Jul 17, 2024

This doesn't work since you can't put C objects in a list comprehension.

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated Show resolved Hide resolved

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx Outdated

		from cudf._lib.pylibcudf.types cimport DataType


		cpdef TableWithMetadata read_csv(

Contributor

wence- Jul 16, 2024

TODO: (best as an issue) in a followup, expose the csv_reader_options_builder object. So that we don't have to use this nightmare signature :)

python/cudf/cudf/_lib/pylibcudf/io/csv.pyx

+cpdef TableWithMetadata read_csv(
+ SourceInfo source_info,
+ compression_type compression = compression_type.AUTO,

Contributor

wence- Jul 16, 2024

Suggested change

 compression_type compression = compression_type.AUTO,

 *,

 compression_type compression = compression_type.AUTO,

People should not be allowed to call this function with positional arguments.

Contributor Author

lithomas1 Jul 16, 2024

This doesn't work with cpdef functions.

Lets punt on this for now.

Contributor

wence- Jul 16, 2024

Does this need to be cpdef? I am willing to accept a slight calling cost overhead to avoid inevitable argument order issues.

Contributor Author

lithomas1 Jul 17, 2024

OK, put it as def for now.

We should try to make this consistent in the future, though.
(Either put everything as def, or deprecate and remove a bunch of the parameters in read_csv and turn this back into cpdef)

lithomas1 and others added 5 commits

July 16, 2024 16:48


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

6935b9c

…f-io-csv


simplify greatly

704d51a

Co-authored-by: Lawrence Mitchell <[email protected]>
Co-authored-by: Vyas Ramasubramani <[email protected]>


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

da114dd

…f-io-csv


Merge branch 'branch-24.08' of github.com:rapidsai/cudf into pylibcud…

04cb9fb

…f-io-csv


cleanup more

7e8ad5d

lithomas1 requested a review from wence-

July 17, 2024 15:45

Contributor Author

lithomas1 commented Jul 18, 2024

Self-merging to keep progress moving forward.

As usual, happy to address further comments in a followup.

Contributor Author

lithomas1 commented Jul 18, 2024

/merge

rapids-bot bot merged commit faddc8c into rapidsai:branch-24.08

81 checks passed

lithomas1 deleted the pylibcudf-io-csv branch

July 18, 2024 15:04

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment