Existing file marked as non-existing #265

lmeyerov · 2021-08-15T16:08:34Z

What happened:

fs.isfile(existing_file_path) incorrectly returns False and gives a warning

EDIT: Output is

False
RuntimeWarning: coroutine 'AzureBlobFileSystem._details' was never awaited
RuntimeWarning: Enable tracemalloc to get the object allocation traceback

What you expected to happen:

Return True without a warning

Minimal Complete Verifiable Example:

import adlfs, fsspec, os
storage_options = {
    'account_name': os.environ['AZ_STORAGE_ACCOUNT_NAME'],
    'account_key': os.environ['AZ_STORAGE_ACCOUNT_KEY']    
}
az_storage_container_name = os.environ['AZ_STORAGE_CONTAINER_NAME']
fs = fsspec.filesystem('abfs', **storage_options)

base_path = f'abfs:https://{az_storage_container_name}/data/datasets'
existing_file_path = f'{base_path}/{dataset_id}'
fs.isdir(existing_file_path)

Anything else we need to know?:

Continuation of [BUG] isfile/isfolder for non-existent items takes ~50s #261 , with patched adfls and now on existing files (vs prev on non-existing)

Environment:

fsspec '2021.07.0' (conda)
adlfs '2021.08.1' (pip, no conda yet)
docker / ubuntu 18.04 / python 3.7

The text was updated successfully, but these errors were encountered:

lmeyerov · 2021-08-15T22:01:53Z

@hayesgb Digging a bit more, switching to asynchronous=True ... await fs._isfile(existing_file_path) does not work around the issue: the warning still triggers and the wrong result still gets returned

lmeyerov · 2021-08-15T23:04:43Z

@hayesgb (Continuing from #261)

Just tried from head:

existing files isfile() is quickly & incorrectly returning False; no async warning anymore
existing dirs isdir() is slowly but correctly returning True; I suspect it is downloading the folders
non-existing paths for isfile() quickly & correctly returning False
non-existing paths for isdir() quickly & correctly returning False

lmeyerov · 2021-08-15T23:07:23Z

Also if it helps, my paths look like:

abfs:https://somecontainer/mydata/mydata2/myfile

hayesgb · 2021-08-16T00:10:37Z

Would you mind posting the result of: fs.details(“somecontainer/mydata/mydata2/abc”)

…

On Aug 15, 2021, at 6:07 PM, lmeyerov ***@***.***> wrote: somecontainer/mydata/mydata2/abc

lmeyerov · 2021-08-16T00:45:36Z

AttributeError: 'AzureBlobFileSystem' object has no attribute 'details'

lmeyerov · 2021-08-16T00:46:36Z

FYI, having more luck with variants of:

async def aexists_dir(path):
    blob_service_client = BlobServiceClient.from_connection_string(conn_str)
    async with blob_service_client:
        container_client = blob_service_client.get_container_client(az_storage_container_name)
        async for myblob in container_client.list_blobs(name_starts_with=path):
            return myblob['name'] != path
    return False

hayesgb · 2021-08-16T02:50:26Z

Thanks. I may end up updating to this. I asked about details earlier, but could you post the result of fs.info(path). Trying to create a test case for this.

lmeyerov · 2021-08-17T13:40:47Z

{
  "metadata": None,
  "creation_time": datetime.datetime(2020, 9, 29, 0, 16, 6, tzinfo=datetime.timezone.utc),
  "deleted": None,
  "deleted_time": None,
  "last_modified": datetime.datetime(2021, 8, 13, 15, 35, 35, tzinfo=datetime.timezone.utc),
  "content_settings": {
    "content_type": "application/x-gzip",
    "content_encoding": None,
    "content_language": None,
    "content_md5": bytearray(b"*****"),
    "content_disposition": None,
    "cache_control": None
  },
  "remaining_retention_days": None,
  "archive_status": None,
  "last_accessed_on": None,
  "etag": "*****",
  "tags": None,
  "tag_count": None,
  "name": "mycontainer/myfolder/myfile",
  "size": 4332,
  "type": "file"
}

hayesgb · 2021-08-18T00:41:29Z

Thanks for the help here @lmeyerov Release 2021.08.2 should fix the errors with isfile.

hayesgb · 2021-08-18T00:44:19Z

Can you share an example of the slowly downloading isdir? This does call cc.list_blobs. Are there a very large number of blobs in the location you're scanning?

lmeyerov · 2021-08-18T01:12:27Z

Yes - it's a potentially big folder (named parquet dumps), in this case I wouldn't be surprised if 1K-10K files. I think async list_files paginates, though I'm unsure of how to ensure that's reasonably small. That's part of the reason we're trying to only do asyncio w/ adlfs, ensuring even occasional blips will not starve out other tasks.

hayesgb · 2021-08-18T02:54:05Z

@lmeyerov -- I just refactored _isdir on the accel_isdir branch. It passes all the tests, and completely eliminates the list_blobs call. Would appreciate your feedback if you have a chance to check it out.

lmeyerov · 2021-08-18T03:05:57Z

Sure -- will check on Th/F (am traveling)

At the same time, if anything around async multi-connection downloads of indiv + folder blobs, happy to check there. Currently investigating how to do via az's SDK, but we rather have unified under fsspec!

hayesgb · 2021-08-18T03:37:55Z

Cool. Just curious -- on the multi-connection downloads -- are you looking to use Dask or is the use case async multithreading?

lmeyerov · 2021-08-18T04:20:46Z

Currently single-node / multicore . Our Azure GPU VMs have something like 2-8 NICs with 8-32 Gbps, and I think AWS/GCP end up similar, so focusing on saturating abfs => SSD writes with that. Multi-node may be interesting early next year, but not there yet :)

RE:async multithreading, az sdk has parallel connection support with a configurable # of streams, which seems like a fine first step..

Our other common use case is when we read directly from dask_cudf.read_parquet, and it may have some funny NUMA behavior to consider for remote reads, but not sure yet. Local reads are via GPU Direct Storage, and I believe there may be network extensions for GPU Direct as well....

lmeyerov mentioned this issue Aug 15, 2021

[BUG] isfile/isfolder for non-existent items takes ~50s #261

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Existing file marked as non-existing #265

Existing file marked as non-existing #265

lmeyerov commented Aug 15, 2021 •

edited

Loading

lmeyerov commented Aug 15, 2021 •

edited

Loading

lmeyerov commented Aug 15, 2021 •

edited by hayesgb

Loading

lmeyerov commented Aug 15, 2021 •

edited

Loading

hayesgb commented Aug 16, 2021 via email

lmeyerov commented Aug 16, 2021

lmeyerov commented Aug 16, 2021 •

edited

Loading

hayesgb commented Aug 16, 2021

lmeyerov commented Aug 17, 2021

hayesgb commented Aug 18, 2021

hayesgb commented Aug 18, 2021

lmeyerov commented Aug 18, 2021

hayesgb commented Aug 18, 2021

lmeyerov commented Aug 18, 2021

hayesgb commented Aug 18, 2021

lmeyerov commented Aug 18, 2021

Existing file marked as non-existing #265

Existing file marked as non-existing #265

Comments

lmeyerov commented Aug 15, 2021 • edited Loading

lmeyerov commented Aug 15, 2021 • edited Loading

lmeyerov commented Aug 15, 2021 • edited by hayesgb Loading

lmeyerov commented Aug 15, 2021 • edited Loading

hayesgb commented Aug 16, 2021 via email

lmeyerov commented Aug 16, 2021

lmeyerov commented Aug 16, 2021 • edited Loading

hayesgb commented Aug 16, 2021

lmeyerov commented Aug 17, 2021

hayesgb commented Aug 18, 2021

hayesgb commented Aug 18, 2021

lmeyerov commented Aug 18, 2021

hayesgb commented Aug 18, 2021

lmeyerov commented Aug 18, 2021

hayesgb commented Aug 18, 2021

lmeyerov commented Aug 18, 2021

lmeyerov commented Aug 15, 2021 •

edited

Loading

lmeyerov commented Aug 15, 2021 •

edited

Loading

lmeyerov commented Aug 15, 2021 •

edited by hayesgb

Loading

lmeyerov commented Aug 15, 2021 •

edited

Loading

lmeyerov commented Aug 16, 2021 •

edited

Loading