Improve the performance of reading and accessing the data of PP and UM fields files #746

davidhassell · 2024-03-23T19:32:44Z

Aspects of accessing PP and UM fields file data has sometimes been very slow, for a quite a while. I had previously always assumed that this was a cf.aggregation issue, which it very much sometimes was! ... but I think aggregation now performs pretty well.

@theabro kindly raised a case of reading a CF field from a 16 GB PP file. the CF Field itself comprised 2040 (= 24 x 85) 2-d PP fields:

>>> print(f)
<CF Field: id%UM_m01s50i500_vn1300(time(24), atmosphere_hybrid_height_coordinate(85), latitude(144), longitude(192))>

Accessing the full data array with a = f.array is taking ~11,000 seconds - far too long!

Investigations showed that the reason for this was that the whole PP file was being parsed (i.e. all headers read and processed) for every 2-d PP field that contributes to the array, i.e. 2040 times in this case.

Stopping this parsing reduces the time taken to get the full array, on the same machine, to ~2 seconds (!). The entire 16 GB can read from disk in ~3.5 minutes.

The size of the file per se is not the cause of the problem, rather the large amount of individual lookup headers in the file: 162,888 in this case. For small my test cases with fewer than 5 PP fields, the slow down is invisible :(

Long overdue PR to follow.

The text was updated successfully, but these errors were encountered:

JonathanGregory · 2024-03-27T13:44:54Z

Well done - that's magnificent, @davidhassell! David reports a speedup of x700 in reading the data from one of my PP directories. I am looking forward to it in the next release.

davidhassell added bug Something isn't working performance Relating to speed and memory performance labels Mar 23, 2024

davidhassell added this to the Next release milestone Mar 23, 2024

davidhassell mentioned this issue Mar 23, 2024

Improve the performance of reading and accessing the data of PP and UM fields files #747

Merged

davidhassell closed this as completed in #747 Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the performance of reading and accessing the data of PP and UM fields files #746

Improve the performance of reading and accessing the data of PP and UM fields files #746

davidhassell commented Mar 23, 2024 •

edited

Loading

JonathanGregory commented Mar 27, 2024

Improve the performance of reading and accessing the data of PP and UM fields files #746

Improve the performance of reading and accessing the data of PP and UM fields files #746

Comments

davidhassell commented Mar 23, 2024 • edited Loading

JonathanGregory commented Mar 27, 2024

davidhassell commented Mar 23, 2024 •

edited

Loading