Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO NOT MERGE before cfdm 1.9.1.0] dask: compression refactor #354

Merged

Conversation

davidhassell
Copy link
Collaborator

@davidhassell davidhassell commented Mar 10, 2022

Refactor of representation of compressed arrays, building on NCAS-CMS/cfdm#167 and NCAS-CMS/cfdm#168

Merged in lama-to-dask on 2022-03-10

@davidhassell davidhassell added the dask Relating to the use of Dask label Mar 10, 2022
@davidhassell davidhassell marked this pull request as draft March 11, 2022 08:28
@davidhassell davidhassell marked this pull request as ready for review March 11, 2022 09:05
@davidhassell
Copy link
Collaborator Author

Hi Sadie - I'll be leaving this alone, now. I was just checking on some things that I forgotten about since writing this, and improved some of the comments. Thanks.

@sadielbartholomew
Copy link
Member

Thanks David. I will try to do a final review for this PR this evening, if not by early next week.

Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's always great when many more lines are dropped than added - nice job!

These are just some comments from eyeballing the code; I am still reviewing the functionality. I'll try to submit the final review later today. Thanks.

cf/data/abstract/filearray.py Outdated Show resolved Hide resolved
cf/data/abstract/filearray.py Show resolved Hide resolved
cf/data/creation.py Outdated Show resolved Hide resolved
cf/data/creation.py Show resolved Hide resolved
cf/data/creation.py Outdated Show resolved Hide resolved
cf/data/data.py Outdated Show resolved Hide resolved
cf/data/data.py Show resolved Hide resolved
cf/data/data.py Outdated Show resolved Hide resolved
cf/data/umarray.py Outdated Show resolved Hide resolved
cf/test/test_Data.py Show resolved Hide resolved
davidhassell and others added 8 commits March 21, 2022 15:03
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Co-authored-by: Sadie L. Bartholomew <[email protected]>
Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good, including previously-raised feedback being fully addressed, except for one possible issue, namely that I see a lone test failure due to a test netCDF file not being found. I am not sure why that file isn't being picked up (I ran python create_test_files.py before running python test_Data.py, and interactively cfdm.read("subsampled_2.nc") works fine)?

======================================================================
ERROR: test_Data__init__compression (__main__.DataTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_Data.py", line 3827, in test_Data__init__compression
    f = cfdm.read("subsampled_2.nc")[-3]
  File "/home/sadie/cfdm/cfdm/read_write/read.py", line 310, in read
    raise IOError(f"Can't read non-existent file {filename}")
OSError: Can't read non-existent file subsampled_2.nc

----------------------------------------------------------------------
Ran 85 tests in 14.292s

FAILED (errors=1, skipped=41)

@davidhassell
Copy link
Collaborator Author

Sorry! da9b13b and 378d5ba should fix this.

Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding 378d5ba, I am not sure pip is happy with the beta marker, since I see:

$ pip install -e .
...
...
  Running setup.py develop for cfdm
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cf-python 4.0.0b0 requires cfdm<1.9.2.0,>=1.9.1.0, but you have cfdm 1.9.1.0b0 which is incompatible.
Successfully installed cfdm-1.9.1.0b0

(Note the lines above the final 'success' report, which indicate a compatibility issue.)

I can work around that, but when I do I am still seeing the same test failure on the missing file. Could the current merge conflicts have anything to do with it (something that needs to be in data/creation.py for this to work, say)?

@davidhassell
Copy link
Collaborator Author

Hi Sadie. Soory this has been a bit fiddly. Merge conflicts fixed. perhaps they could have affected anything ... I have just run python create_test_files.py with latest commits in, and it does create the required files.

I've changed the lower cfdm version to 1.9.0.1 for now - will that help?

@davidhassell
Copy link
Collaborator Author

With respect to the to_memory keyword to Data.__init__, I wonder if it should be called persist instead, and then we add a self.persist() call in ? E.g.

<snip Data.__init__>
         # Bring the data into memory
         if persist:
             array.persist()

         # Store the dask array
         self._set_dask(array, delete_source=False, reset_mask_hardness=False) 

         # Set the mask hardness on each chunk.
         self.hardmask = hardmask

        

@sadielbartholomew
Copy link
Member

Hi @davidhassell, sorry I lost track of this one and notably missed your latest comments (mostly because GitHub's 'Review requests' list, which I use to monitor review tasks, drops any PR one you have left a review once already, but of course I can't blame that wholly since I have known it works like that for a long time!).

I'm just about to do a final sanity check after you have added the aforementioned updates, and will leave that as a final review shortly. As for:

With respect to the to_memory keyword to Data.init, I wonder if it should be called persist instead, and then we add a self.persist() call

I really like that idea, which will keep us consistent with Dask in that respect and in turn likely align us with any Python Array API guidance on terminology for such methods. Shall we wait until after the Dask migration work is complete, though, in line with our current tentative plan to do the migration and then make a batch of appropriate API changes to comprise the final 4.0.0 version?

@davidhassell
Copy link
Collaborator Author

Hi Sadie - thanks. Since Data.__init__ does not currently have a to_memory keyword, I don't see a problem in introducing a persist one. Is that OK? I'll put in a new commit, if so.

@sadielbartholomew
Copy link
Member

Is that OK? I'll put in a new commit, if so.

Thanks, that sounds great so please go ahead. I'll wait until you push that commit to confirm the final review.

@davidhassell
Copy link
Collaborator Author

Great. Done.

Copy link
Member

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previous feedback has been addressed well and the persist keyword has been implemented solidly. Notably, the one outstanding issue of:

I see a lone test failure due to a test netCDF file not being found
...
OSError: Can't read non-existent file subsampled_2.nc

has been fixed, so that I now get ​a full local pass of test_Data.

All good! Please merge.

@davidhassell davidhassell merged commit 1808417 into NCAS-CMS:lama-to-dask Apr 8, 2022
@davidhassell davidhassell deleted the dask-compression-subarrays branch November 15, 2022 09:18
@davidhassell davidhassell added this to the 3.14.0 milestone Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dask Relating to the use of Dask
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants