Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-1642] index option in pandas.ParquetDataSet #352

Closed
juan-carlos-calvo opened this issue May 4, 2020 · 2 comments
Closed

[KED-1642] index option in pandas.ParquetDataSet #352

juan-carlos-calvo opened this issue May 4, 2020 · 2 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@juan-carlos-calvo
Copy link

juan-carlos-calvo commented May 4, 2020

Description

ParquetLocalDataSet accepts the index save option which specifies whether to save the index in parquet or not. pandas.ParquetDataSet doesn't have an equivalent option at the moment.

Context

As the datasets in kedro.io are going to be deprecated and moved to kedro.extras.datasets kedro should have backward compatibility with the load_args save_args options. Also, index is a useful option to have.

Possible Implementation

at the _save method of pandas.ParquetDataSet substitute:

table = pa.Table.from_pandas(data)

with

preserve_index = self._save_args.pop('index', False)
table = pa.Table.from_pandas(data, preserve_index=preserve_index)
@juan-carlos-calvo juan-carlos-calvo added the Issue: Feature Request New feature or improvement to existing feature label May 4, 2020
@921kiyo
Copy link
Contributor

921kiyo commented May 5, 2020

@juan-carlos-calvo Thank you for opening the issue! This makes sense to me. I've logged this in our backlog and will fix it :)

@921kiyo 921kiyo changed the title index option in pandas.ParquetDataSet [KED-1642] index option in pandas.ParquetDataSet May 5, 2020
@andrii-ivaniuk
Copy link
Contributor

andrii-ivaniuk commented May 27, 2020

Thanks @juan-carlos-calvo for reporting this.
It was fixed in 5330b45 commit.

The arguments for from_pandas() should be passed through a nested key: from_pandas. E.g.: save_args = {"from_pandas": {"preserve_index": False}}

pull bot pushed a commit to FoundryAI/kedro that referenced this issue Jul 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

3 participants