Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read from parquet does not work #587

Open
m-birke opened this issue Aug 9, 2023 · 7 comments
Open

Read from parquet does not work #587

m-birke opened this issue Aug 9, 2023 · 7 comments
Labels
good first issue Non-urgent, simple task with limited scope, suitable for getting started as a contributor.

Comments

@m-birke
Copy link
Collaborator

m-birke commented Aug 9, 2023

Parquet file is existing and can be read with https://parquet-viewer-online.com

Execution of DSL script results in printing a frame with all (10) values nan

// read_from_parquet.daph
group_data = readFrame("/path/to/data.parquet");

print(group_data);

OUTPUT:

Frame(5x2, [Id:double, Vds:double])
nan nan
nan nan
nan nan
nan nan
nan nan
// /path/to/data.parquet.meta
{
  "numRows": 5,
  "numCols": 2,
  "schema": [
    {
      "label": "Id",
      "valueType": "f64"
    },
    {
      "label": "Vds",
      "valueType": "f64"
    }
  ]
}

This is how I created the parquet file:

import pyarrow.parquet as pq
from pathlib import Path
import pyarrow.csv as csv

def main(path: str):

    p = Path(path)
    ro = csv.ReadOptions(column_names = ["Id", "Vds"])
    table = csv.read_csv(p, read_options=ro)    

    print("Arrow table from csv ----------------------------------------------------------------------------")
    print(f"Num cols in table: {table.num_columns}")
    print(f"Num rows in table: {table.num_rows}")
    print(table)
    print("Writing arrow table -----------------------------------------------------------------------------")
    destpth = p.with_suffix(".parquet")
    print(f"writing to {destpth.resolve()}")
    pq.write_table(table, destpth, use_dictionary=False)
    print("Reading back again from parquet file-------------------------------------------------------------")
    rtable=pq.read_table(p.with_suffix(".parquet"))
    print(rtable)
    print("pandas repr.:")
    print(rtable.to_pandas())

from a csv file which looks like this:

0.1184805,4.2727
0.026556,4.2356
-0.0653686,4.248
-0.0347271,4.248
-0.0040855,4.6257

Sample parquet files can be acquired here (data is much more complex here): https://github.com/kaysush/sample-parquet-files/tree/main

@corepointer corepointer added the bug A mistake in the code. label Aug 11, 2023
@corepointer
Copy link
Collaborator

Hi!
I just looked into this. Apparently we don't support reading frames from Parquet files yet (see src/runtime/local/kernels/Read.h) 🙈
When changing readFrame() to readMatrix() I noticed, since I used the suggested sample inputs, that we don't build Arrow with support for the Snappy codec.
Regards, Mark

@corepointer corepointer added good first issue Non-urgent, simple task with limited scope, suitable for getting started as a contributor. and removed bug A mistake in the code. labels Aug 11, 2023
@m-birke m-birke changed the title [bug] Read from parquet does not work [feature] readFrame from parquet does not work Aug 16, 2023
@m-birke m-birke changed the title [feature] readFrame from parquet does not work [feature] support for readFrame from parquet Aug 16, 2023
@m-birke
Copy link
Collaborator Author

m-birke commented Aug 16, 2023

I changed the Python script with additional compression algo specification like

compression = "GZIP" | "SNAPPY" | "BROTLI"
pq.write_table(table, destpth, use_dictionary=False, compression=compression)

with readMatrix and all 3 compression algos I get the same error:

[error]: Execution error: Could not read Parquet table

@m-birke m-birke changed the title [feature] support for readFrame from parquet Read from parquet does not work Aug 16, 2023
@corepointer
Copy link
Collaborator

The Snappy compressed sample parquet files from the link above can be read without error by DAPHNE if I compile Arrow with Snappy support. I'll incorporate all Arrow supported compression formats in the next Docker image updates.

corepointer added a commit that referenced this issue Sep 18, 2023
Added all compression options to the Arrow compilation. This solves parts of the problems described in #587. Reading certain parquet files now does not fail right away due to lack of compression formats support. Parsing correctly is still an issue there.
@m-birke
Copy link
Collaborator Author

m-birke commented Sep 28, 2023

Hi @corepointer

With the new Docker image I am able to read parquet files. Tested for snappy, brotli and gzip.

Thank you!

Should we close this issue and open a new one for requesting readFrame() on parquet or remain this open?

KR

@corepointer
Copy link
Collaborator

As I mentioned in the commit message of c1100d8, this is just a partial fix as the parquet reader seems to be quite limited at the moment.
In my tests with the sample files from the link above I noticed that I get a nice matrix if I use readMatrix() (because readFrame() is not supported here. The nice matrix would contain mostly zeros as it does not handle the required data types correctly (e.g., "345.0" would become 0 as this floating point value is stored as string in the parquet file). Furthermore the inputs are converted to csv in memory and then read from there. That csv reader also had some issues parsing (didn't tokenize well in between the commas).
So if it is working for you now (do all values you from your input get read correctly?), you could close the issue. We can also deal with the shortcomings of the reader with a more detailed error description in another bug report.

@m-birke
Copy link
Collaborator Author

m-birke commented Sep 28, 2023

I just tried a larger file now, and unfortunately it does not properly, at the end of the matrix there are a lot of nan's then instead of the values

@m-birke
Copy link
Collaborator Author

m-birke commented Sep 28, 2023

It is very strange: sometimes it works properly, sometimes not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Non-urgent, simple task with limited scope, suitable for getting started as a contributor.
Projects
None yet
Development

No branches or pull requests

2 participants