Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of reading and accessing the data of PP and UM fields files #746

Closed
davidhassell opened this issue Mar 23, 2024 · 1 comment · Fixed by #747
Closed
Labels
bug Something isn't working performance Relating to speed and memory performance
Milestone

Comments

@davidhassell
Copy link
Collaborator

davidhassell commented Mar 23, 2024

Aspects of accessing PP and UM fields file data has sometimes been very slow, for a quite a while. I had previously always assumed that this was a cf.aggregation issue, which it very much sometimes was! ... but I think aggregation now performs pretty well.

@theabro kindly raised a case of reading a CF field from a 16 GB PP file. the CF Field itself comprised 2040 (= 24 x 85) 2-d PP fields:

>>> print(f)
<CF Field: id%UM_m01s50i500_vn1300(time(24), atmosphere_hybrid_height_coordinate(85), latitude(144), longitude(192))>

Accessing the full data array with a = f.array is taking ~11,000 seconds - far too long!

Investigations showed that the reason for this was that the whole PP file was being parsed (i.e. all headers read and processed) for every 2-d PP field that contributes to the array, i.e. 2040 times in this case.

Stopping this parsing reduces the time taken to get the full array, on the same machine, to ~2 seconds (!). The entire 16 GB can read from disk in ~3.5 minutes.

The size of the file per se is not the cause of the problem, rather the large amount of individual lookup headers in the file: 162,888 in this case. For small my test cases with fewer than 5 PP fields, the slow down is invisible :(

Long overdue PR to follow.

@davidhassell davidhassell added bug Something isn't working performance Relating to speed and memory performance labels Mar 23, 2024
@davidhassell davidhassell added this to the Next release milestone Mar 23, 2024
@JonathanGregory
Copy link

Well done - that's magnificent, @davidhassell! David reports a speedup of x700 in reading the data from one of my PP directories. I am looking forward to it in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working performance Relating to speed and memory performance
Projects
None yet
2 participants