Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Existing file marked as non-existing #265

Open
lmeyerov opened this issue Aug 15, 2021 · 15 comments
Open

Existing file marked as non-existing #265

lmeyerov opened this issue Aug 15, 2021 · 15 comments

Comments

@lmeyerov
Copy link

lmeyerov commented Aug 15, 2021

What happened:

fs.isfile(existing_file_path) incorrectly returns False and gives a warning

EDIT: Output is

False
RuntimeWarning: coroutine 'AzureBlobFileSystem._details' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

What you expected to happen:

Return True without a warning

Minimal Complete Verifiable Example:

import adlfs, fsspec, os
storage_options = {
    'account_name': os.environ['AZ_STORAGE_ACCOUNT_NAME'],
    'account_key': os.environ['AZ_STORAGE_ACCOUNT_KEY']    
}
az_storage_container_name = os.environ['AZ_STORAGE_CONTAINER_NAME']
fs = fsspec.filesystem('abfs', **storage_options)

base_path = f'abfs:https://{az_storage_container_name}/data/datasets'
existing_file_path = f'{base_path}/{dataset_id}'
fs.isdir(existing_file_path)

Anything else we need to know?:

Environment:

fsspec '2021.07.0' (conda)
adlfs '2021.08.1' (pip, no conda yet)
docker / ubuntu 18.04 / python 3.7

@lmeyerov
Copy link
Author

lmeyerov commented Aug 15, 2021

@hayesgb Digging a bit more, switching to asynchronous=True ... await fs._isfile(existing_file_path) does not work around the issue: the warning still triggers and the wrong result still gets returned

@lmeyerov
Copy link
Author

lmeyerov commented Aug 15, 2021

@hayesgb (Continuing from #261)

Just tried from head:

  • existing files isfile() is quickly & incorrectly returning False; no async warning anymore
  • existing dirs isdir() is slowly but correctly returning True; I suspect it is downloading the folders
  • non-existing paths for isfile() quickly & correctly returning False
  • non-existing paths for isdir() quickly & correctly returning False

@lmeyerov
Copy link
Author

lmeyerov commented Aug 15, 2021

Also if it helps, my paths look like:

abfs:https://somecontainer/mydata/mydata2/myfile

@hayesgb
Copy link
Collaborator

hayesgb commented Aug 16, 2021 via email

@lmeyerov
Copy link
Author

AttributeError: 'AzureBlobFileSystem' object has no attribute 'details'

@lmeyerov
Copy link
Author

lmeyerov commented Aug 16, 2021

FYI, having more luck with variants of:

async def aexists_dir(path):
    blob_service_client = BlobServiceClient.from_connection_string(conn_str)
    async with blob_service_client:
        container_client = blob_service_client.get_container_client(az_storage_container_name)
        async for myblob in container_client.list_blobs(name_starts_with=path):
            return myblob['name'] != path
    return False

@hayesgb
Copy link
Collaborator

hayesgb commented Aug 16, 2021

Thanks. I may end up updating to this. I asked about details earlier, but could you post the result of fs.info(path). Trying to create a test case for this.

@lmeyerov
Copy link
Author

{
  "metadata": None,
  "creation_time": datetime.datetime(2020, 9, 29, 0, 16, 6, tzinfo=datetime.timezone.utc),
  "deleted": None,
  "deleted_time": None,
  "last_modified": datetime.datetime(2021, 8, 13, 15, 35, 35, tzinfo=datetime.timezone.utc),
  "content_settings": {
    "content_type": "application/x-gzip",
    "content_encoding": None,
    "content_language": None,
    "content_md5": bytearray(b"*****"),
    "content_disposition": None,
    "cache_control": None
  },
  "remaining_retention_days": None,
  "archive_status": None,
  "last_accessed_on": None,
  "etag": "*****",
  "tags": None,
  "tag_count": None,
  "name": "mycontainer/myfolder/myfile",
  "size": 4332,
  "type": "file"
}

@hayesgb
Copy link
Collaborator

hayesgb commented Aug 18, 2021

Thanks for the help here @lmeyerov Release 2021.08.2 should fix the errors with isfile.

@hayesgb
Copy link
Collaborator

hayesgb commented Aug 18, 2021

Can you share an example of the slowly downloading isdir? This does call cc.list_blobs. Are there a very large number of blobs in the location you're scanning?

@lmeyerov
Copy link
Author

Yes - it's a potentially big folder (named parquet dumps), in this case I wouldn't be surprised if 1K-10K files. I think async list_files paginates, though I'm unsure of how to ensure that's reasonably small. That's part of the reason we're trying to only do asyncio w/ adlfs, ensuring even occasional blips will not starve out other tasks.

@hayesgb
Copy link
Collaborator

hayesgb commented Aug 18, 2021

@lmeyerov -- I just refactored _isdir on the accel_isdir branch. It passes all the tests, and completely eliminates the list_blobs call. Would appreciate your feedback if you have a chance to check it out.

@lmeyerov
Copy link
Author

Sure -- will check on Th/F (am traveling)

At the same time, if anything around async multi-connection downloads of indiv + folder blobs, happy to check there. Currently investigating how to do via az's SDK, but we rather have unified under fsspec!

@hayesgb
Copy link
Collaborator

hayesgb commented Aug 18, 2021

Cool. Just curious -- on the multi-connection downloads -- are you looking to use Dask or is the use case async multithreading?

@lmeyerov
Copy link
Author

  1. Currently single-node / multicore . Our Azure GPU VMs have something like 2-8 NICs with 8-32 Gbps, and I think AWS/GCP end up similar, so focusing on saturating abfs => SSD writes with that. Multi-node may be interesting early next year, but not there yet :)

RE:async multithreading, az sdk has parallel connection support with a configurable # of streams, which seems like a fine first step..

  1. Our other common use case is when we read directly from dask_cudf.read_parquet, and it may have some funny NUMA behavior to consider for remote reads, but not sure yet. Local reads are via GPU Direct Storage, and I believe there may be network extensions for GPU Direct as well....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants