Read from parquet does not work #587

m-birke · 2023-08-09T11:39:02Z

Parquet file is existing and can be read with https://parquet-viewer-online.com

Execution of DSL script results in printing a frame with all (10) values nan

// read_from_parquet.daph
group_data = readFrame("/path/to/data.parquet");

print(group_data);

OUTPUT:

Frame(5x2, [Id:double, Vds:double])
nan nan
nan nan
nan nan
nan nan
nan nan

// /path/to/data.parquet.meta
{
  "numRows": 5,
  "numCols": 2,
  "schema": [
    {
      "label": "Id",
      "valueType": "f64"
    },
    {
      "label": "Vds",
      "valueType": "f64"
    }
  ]
}

This is how I created the parquet file:

import pyarrow.parquet as pq
from pathlib import Path
import pyarrow.csv as csv

def main(path: str):

    p = Path(path)
    ro = csv.ReadOptions(column_names = ["Id", "Vds"])
    table = csv.read_csv(p, read_options=ro)    

    print("Arrow table from csv ----------------------------------------------------------------------------")
    print(f"Num cols in table: {table.num_columns}")
    print(f"Num rows in table: {table.num_rows}")
    print(table)
    print("Writing arrow table -----------------------------------------------------------------------------")
    destpth = p.with_suffix(".parquet")
    print(f"writing to {destpth.resolve()}")
    pq.write_table(table, destpth, use_dictionary=False)
    print("Reading back again from parquet file-------------------------------------------------------------")
    rtable=pq.read_table(p.with_suffix(".parquet"))
    print(rtable)
    print("pandas repr.:")
    print(rtable.to_pandas())

from a csv file which looks like this:

0.1184805,4.2727
0.026556,4.2356
-0.0653686,4.248
-0.0347271,4.248
-0.0040855,4.6257

Sample parquet files can be acquired here (data is much more complex here): https://github.com/kaysush/sample-parquet-files/tree/main

The text was updated successfully, but these errors were encountered:

corepointer · 2023-08-11T14:20:42Z

Hi!
I just looked into this. Apparently we don't support reading frames from Parquet files yet (see src/runtime/local/kernels/Read.h) 🙈
When changing readFrame() to readMatrix() I noticed, since I used the suggested sample inputs, that we don't build Arrow with support for the Snappy codec.
Regards, Mark

m-birke · 2023-08-16T10:12:56Z

I changed the Python script with additional compression algo specification like

compression = "GZIP" | "SNAPPY" | "BROTLI"
pq.write_table(table, destpth, use_dictionary=False, compression=compression)

with readMatrix and all 3 compression algos I get the same error:

[error]: Execution error: Could not read Parquet table

corepointer · 2023-09-04T08:36:13Z

The Snappy compressed sample parquet files from the link above can be read without error by DAPHNE if I compile Arrow with Snappy support. I'll incorporate all Arrow supported compression formats in the next Docker image updates.

Added all compression options to the Arrow compilation. This solves parts of the problems described in #587. Reading certain parquet files now does not fail right away due to lack of compression formats support. Parsing correctly is still an issue there.

m-birke · 2023-09-28T07:41:51Z

Hi @corepointer

With the new Docker image I am able to read parquet files. Tested for snappy, brotli and gzip.

Thank you!

Should we close this issue and open a new one for requesting readFrame() on parquet or remain this open?

KR

corepointer · 2023-09-28T08:00:40Z

As I mentioned in the commit message of c1100d8, this is just a partial fix as the parquet reader seems to be quite limited at the moment.
In my tests with the sample files from the link above I noticed that I get a nice matrix if I use readMatrix() (because readFrame() is not supported here. The nice matrix would contain mostly zeros as it does not handle the required data types correctly (e.g., "345.0" would become 0 as this floating point value is stored as string in the parquet file). Furthermore the inputs are converted to csv in memory and then read from there. That csv reader also had some issues parsing (didn't tokenize well in between the commas).
So if it is working for you now (do all values you from your input get read correctly?), you could close the issue. We can also deal with the shortcomings of the reader with a more detailed error description in another bug report.

m-birke · 2023-09-28T08:05:37Z

I just tried a larger file now, and unfortunately it does not properly, at the end of the matrix there are a lot of nan's then instead of the values

m-birke · 2023-09-28T08:12:31Z

It is very strange: sometimes it works properly, sometimes not

corepointer added the bug A mistake in the code. label Aug 11, 2023

corepointer added good first issue Non-urgent, simple task with limited scope, suitable for getting started as a contributor. and removed bug A mistake in the code. labels Aug 11, 2023

m-birke changed the title ~~[bug] Read from parquet does not work~~ [feature] readFrame from parquet does not work Aug 16, 2023

m-birke changed the title ~~[feature] readFrame from parquet does not work~~ [feature] support for readFrame from parquet Aug 16, 2023

m-birke changed the title ~~[feature] support for readFrame from parquet~~ Read from parquet does not work Aug 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read from parquet does not work #587

Read from parquet does not work #587

m-birke commented Aug 9, 2023 •

edited

Loading

corepointer commented Aug 11, 2023

m-birke commented Aug 16, 2023

corepointer commented Sep 4, 2023

m-birke commented Sep 28, 2023 •

edited

Loading

corepointer commented Sep 28, 2023

m-birke commented Sep 28, 2023

m-birke commented Sep 28, 2023

Read from parquet does not work #587

Read from parquet does not work #587

Comments

m-birke commented Aug 9, 2023 • edited Loading

corepointer commented Aug 11, 2023

m-birke commented Aug 16, 2023

corepointer commented Sep 4, 2023

m-birke commented Sep 28, 2023 • edited Loading

corepointer commented Sep 28, 2023

m-birke commented Sep 28, 2023

m-birke commented Sep 28, 2023

m-birke commented Aug 9, 2023 •

edited

Loading

m-birke commented Sep 28, 2023 •

edited

Loading