Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to select a column without quotation from a table #84

Closed
raybellwaves opened this issue Nov 20, 2020 · 4 comments · Fixed by #88
Closed

unable to select a column without quotation from a table #84

raybellwaves opened this issue Nov 20, 2020 · 4 comments · Fixed by #88

Comments

@raybellwaves
Copy link

Trying to select a column from a table and getting ParsingException.

Select * works.

>>> c.sql("""SELECT * from my_data""")
Dask DataFrame Structure:
              UUID_SAILING a b
npartitions=1
                    object            float64              float64
                       ...                ...                  ...
Dask Name: getitem, 6 tasks

But when trying to select a column (UUID_SAILING) I get

>>> c.sql("""SELECT UUID_SAILING from my_data""")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\ds-rm-obr-dev\lib\site-packages\dask_sql\context.py", line 346, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "C:\Users\131416\AppData\Local\Continuum\anaconda3\envs\ds-rm-obr-dev\lib\site-packages\dask_sql\context.py", line 466, in _get_ral
    raise ParsingException(sql, str(e.message())) from None
dask_sql.utils.ParsingException: Can not parse the given SQL: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 19: Column 'uuid_sailing' not found in any table; did you mean 'UUID_SAILING'?

The problem is probably somewhere here:

        SELECT UUID_SAILING from my_data
               ^^^^^^^^^^^^

I believe what is happening is it is expecting quotation around the field name

>>> c.sql("""SELECT * from my_data""").columns
Index(['UUID_SAILING', 'a', 'b'], dtype='object')
>>> c.sql("""SELECT 'UUID_SAILING' from my_data""")
Dask DataFrame Structure:
              'UUID_SAILING'
npartitions=1
                      object
                         ...
Dask Name: getitem, 5 tasks

Note: i'm on windows and the table was created using
c.create_table("my_data", remote_folder + "databricks_out_tmp/file.parquet", storage_options=storage_options)

Also this isn't reproducible using the timeseries as a dataframe

>>> from dask.datasets import timeseries
>>> df = timeseries()
>>> c.create_table("timeseries", df)
>>> c.sql("""SELECT * from timeseries""").columns
Index(['id', 'name', 'x', 'y'], dtype='object')
>>> c.sql("""SELECT * from timeseries""")
Dask DataFrame Structure:
                   id    name        x        y
npartitions=30
2000-01-01      int32  object  float64  float64
2000-01-02        ...     ...      ...      ...
...               ...     ...      ...      ...
2000-01-30        ...     ...      ...      ...
2000-01-31        ...     ...      ...      ...
Dask Name: getitem, 210 tasks
>>> c.sql("""SELECT id from timeseries""")
Dask DataFrame Structure:
                   id
npartitions=30
2000-01-01      int32
2000-01-02        ...
...               ...
2000-01-30        ...
2000-01-31        ...
Dask Name: getitem, 120 tasks
@raybellwaves
Copy link
Author

This is reproducible

$ conda create -n test_env python=3.8 -y
$ conda activate test_env
$ conda install -c conda-forge dask-sql adlfs pyarrow -y
$ python

from dask_sql import Context
c = Context()

storage_options = {'account_name': 'azureopendatastorage'}
c.create_table("taxi", "az:https://nyctlc/green/puYear=2019/puMonth=1/*.parquet", storage_options=storage_options)

c.sql("""SELECT vendorID from taxi""")
>>> c.sql("""SELECT vendorID from taxi""")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ray/local/bin/anaconda3/envs/test_env/lib/python3.8/site-packages/dask_sql/context.py", line 346, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/home/ray/local/bin/anaconda3/envs/test_env/lib/python3.8/site-packages/dask_sql/context.py", line 466, in _get_ral
    raise ParsingException(sql, str(e.message())) from None
dask_sql.utils.ParsingException: Can not parse the given SQL: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 15: Column 'vendorid' not found in any table; did you mean 'vendorID'?

The problem is probably somewhere here:

	SELECT vendorID from taxi
	       ^^^^^^^^

>>> c.sql("""SELECT 'vendorID' from taxi""")
Dask DataFrame Structure:
               'vendorID'
npartitions=80           
                   object
                      ...
...                   ...
                      ...
                      ...
Dask Name: getitem, 400 tasks

@nils-braun
Copy link
Collaborator

Very good issue, @raybellwaves, and thanks for the MVE.
Just to add some context to this: dask-sql is following the postgreSQL SQL standard. In this standard, every identifier without quotations will be turned into lowercase for comparison. So vendorID -> vendorid and this column is actually unknown.
That makes sense for postgreSQL, as there only columns with lowercase will be created (except if you use quotation marks). But in pandas and dask there is no such a convention, so it does not make sense - good point.

Now we have three possibilities:

  • either move away from a pure postgreSQL syntax to posgreSQL without lower-casing.
  • or convert every column name of a dask dataframe to lowercase (do not like that, because it is even possible to have the same column name as upper and lowercase in pandas)
  • or have this as a configuration option

I am in favor of the first option, but maybe you have another opinion?

@nils-braun
Copy link
Collaborator

nils-braun commented Nov 28, 2020

I have added a PR to go with the first option

@raybellwaves I tested your example and now I get the correct result

Dask DataFrame Structure:
               vendorID
npartitions=80         
                  int32
                    ...
...                 ...
                    ...
                    ...
Dask Name: getitem, 320 tasks

@nils-braun
Copy link
Collaborator

Feel free to reopen if the problem is still there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants