Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append write operation ends in AttributeError #249

Open
anders-kiaer opened this issue Jun 15, 2021 · 2 comments
Open

Append write operation ends in AttributeError #249

anders-kiaer opened this issue Jun 15, 2021 · 2 comments

Comments

@anders-kiaer
Copy link
Contributor

anders-kiaer commented Jun 15, 2021

import pandas as pd
import numpy as np
import fastparquet
from adlfs import AzureBlobFileSystem

CONTAINER_NAME = ...
BLOB_NAME = f"{CONTAINER_NAME}/some.parquet"
ACCOUNT_NAME = ...
ACCOUNT_KEY = ...

fs = AzureBlobFileSystem(account_name=ACCOUNT_NAME, account_key=ACCOUNT_KEY, container_name=CONTAINER_NAME)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))

try:
    fastparquet.write(BLOB_NAME, df, open_with=fs.open, append=True)
except FileNotFoundError:
    # File does not already exist... create new
    fastparquet.write(BLOB_NAME, df, open_with=fs.open)

What happened:

Running the script for the first time (i.e. when the file does not already exist), it completes without problem. On next run when the file is to be appended:

Traceback (most recent call last):
  File "test_append.py", line 15, in <module>
    fastparquet.write(BLOB_NAME, df, open_with=fs.open, append=True)
  File "[...]/python3.8/site-packages/fastparquet/writer.py", line 873, in write
    write_simple(filename, data, fmd, row_group_offsets,
  File "[...]/python3.8/site-packages/fastparquet/writer.py", line 705, in write_simple
    with open_with(fn, mode) as f:
  File "[...]/python3.8/site-packages/fsspec/spec.py", line 962, in open
    f = self._open(
  File "[...]/python3.8/site-packages/adlfs/spec.py", line 1607, in _open
    return AzureBlobFile(
  File "[...]/python3.8/site-packages/adlfs/spec.py", line 1720, in __init__
    raise NotImplementedError("File mode not supported")
NotImplementedError: File mode not supported
Exception ignored in: <function AzureBlobFile.__del__ at 0x7f4156c04310>
Traceback (most recent call last):
  File "[...]/python3.8/site-packages/adlfs/spec.py", line 1909, in __del__
    self.close()
  File "[...]/python3.8/site-packages/adlfs/spec.py", line 1745, in close
    super().close()
  File "[...]/python3.8/site-packages/fsspec/spec.py", line 1554, in close
    if not self.forced:
AttributeError: 'AzureBlobFile' object has no attribute 'forced'

What you expected to happen:

One of two options:

  1. Either successful write operation (is supporting append with fastparquet in adlfs fundamentally challenging due to e.g. the options of appending/modifying blobs being limited, or is it more a prioritization/not yet implemented question?)
  2. Alternatively maybe a better (final) error message. There is a clue higher up in the traceback stating NotImplementedError, but for some reason "Exception ignored" and instead it ends with an AttributeError. 🤔

As far as I can see, in the append mode, the mode that is tried behind the scenes is rb+ when the AttributeError occurs.

Environment:

  • Python version: 3.8
  • Operating System: Ubuntu
  • Install method (conda, pip, source): pip
@hayesgb
Copy link
Collaborator

hayesgb commented Jul 17, 2021

The package default behavior is to follow Azure's default, which creates a BlockBlob, as described here. BlockBlobs do not accept an append operation. There's limited ability to create an append blob, but it hasn't been tried with this use case.

From a roadmap perspective, and thinking about how the package gets used by Dask, I'd really like to understand your use case. Currently, the approach I generally see is to incrementally add new files, rather than appending to an existing file. However, the ability to append to an existing file, or collection of files when incrementally updating files, is definitely needed.

@martindurant -- can you comment on how s3fs and gcsfs handle this?

@martindurant
Copy link
Member

gcsfs does not support append

In s3fs, append means: "make a new file, where the first block(s) is the contents of the file which previously had the same name". This is possible if the original file is >5MB. If not, append works by downloading the contents of the previous file and starting a new file from scratch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants