Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pygmt.which: Refactor to get rid of temporary files #3148

Merged
merged 9 commits into from
Apr 7, 2024
Merged

Conversation

seisman
Copy link
Member

@seisman seisman commented Mar 29, 2024

Address #2730.

Output to a pandas.DataFrame via virtualfiles and then convert the DataFrame to a list.

@seisman seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Mar 29, 2024
@seisman seisman added this to the 0.12.0 milestone Mar 29, 2024
@seisman seisman added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Apr 2, 2024
Comment on lines 74 to 67
if not path:
result = lib.virtualfile_to_dataset(vfname=vouttbl, output_type="pandas")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can set output_type to numpy to simplify the code below no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to think what's the most efficient way to get the output from which. It seems a bit overkill going virtualfile -> pandas -> numpy -> list[str].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not very efficient. Instead, we can add another internal type output_type="string", which returns the vector of the trailing text only. It's useful if we know that a module only outputs text strings like which. Actually, GMT provides an API function GMT_Get_Strings which does exactly the same thing (if I understand it correctly). We can also wrap that API function instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, we can add another internal type output_type="string", which returns the vector of the trailing text only. It's useful if we know that a module only outputs text strings like which.

Done in PR #3157.

Actually, GMT provides an API function GMT_Get_Strings which does exactly the same thing (if I understand it correctly). We can also wrap that API function instead.

This API function requires the data family to be GMT_IS_VECTOR/GMT_IS_MATRIX, so it cannot be used in this case (family is GMT_IS_DATASET). xref: https://github.com/GenericMappingTools/gmt/blob/fcb795ef196714fe51f7e5e68d30a18e981294d0/src/gmt_api.c#L15785

@weiji14 weiji14 added the run/benchmark Trigger the benchmark workflow in PRs label Apr 2, 2024
Copy link

codspeed-hq bot commented Apr 2, 2024

CodSpeed Performance Report

Merging #3148 will degrade performances by 46.35%

Comparing vfile/which (4846cfb) with main (6193938)

Summary

❌ 1 regressions
✅ 90 untouched benchmarks

🆕 8 new benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main vfile/which Change
🆕 test_accessor_set_geographic_cartesian_roundtrip N/A 936.9 µs N/A
🆕 test_binstats_no_outgrid N/A 160.9 ms N/A
🆕 test_earth_relief_holes N/A 192.9 ms N/A
🆕 test_grd2cpt N/A 231.1 ms N/A
🆕 test_grdcut_dataarray_in_dataarray_out N/A 87.8 ms N/A
🆕 test_compute_bins_no_outfile N/A 50.4 ms N/A
🆕 test_grdimage_grid_and_shading_with_xarray[png] N/A 5.9 s N/A
🆕 test_histogram[list] N/A 81.9 ms N/A
test_which_multiple 74.8 ms 139.5 ms -46.35%

@seisman seisman added run/benchmark Trigger the benchmark workflow in PRs and removed run/benchmark Trigger the benchmark workflow in PRs labels Apr 6, 2024
@seisman
Copy link
Member Author

seisman commented Apr 6, 2024

Ping @GenericMappingTools/pygmt-maintainers for final reviews. The pygmt.which function is slower after refactoring, which is likely due to the overhead of creating a GMT_DATASET container.

@seisman seisman merged commit 6069ebc into main Apr 7, 2024
20 of 22 checks passed
@seisman seisman deleted the vfile/which branch April 7, 2024 11:59
@seisman seisman removed final review call This PR requires final review and approval from a second reviewer run/benchmark Trigger the benchmark workflow in PRs labels Apr 7, 2024
@weiji14
Copy link
Member

weiji14 commented Apr 15, 2024

The cache GMT artifacts workflow is failing at https://github.com/GenericMappingTools/pygmt/actions/runs/8680209408/job/23800357490:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/caching.py", line 103, in cache_data
    which(fname=datasets, download="a")
  File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/decorators.py", line 607, in new_module
    return module_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/decorators.py", line 780, in new_module
    return module_func(*bound.args, **bound.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pygmt/pygmt/pygmt/src/which.py", line 67, in which
    paths = lib.virtualfile_to_dataset(vfname=vouttbl, output_type="strings")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pygmt/pygmt/pygmt/clib/session.py", line 1940, in virtualfile_to_dataset
    return result.to_strings()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pygmt/pygmt/pygmt/datatypes/dataset.py", line 156, in to_strings
    return np.char.decode(textvector) if textvector else np.array([], dtype=str)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/micromamba/envs/pygmt/lib/python3.12/site-packages/numpy/core/defchararray.py", line 615, in decode
    _vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: string operation on non-string array

The error doesn't really indicate what happened, so I added a print(textvector) statement locally, and got something like this:

[None, None, None, None, None, None, None, None, None, None, None, None, None, None, b'/home/user/.gmt/server/earth/earth_relief/earth_relief_15s_p/N30W120.earth_relief_15s_p.nc']

It seems like already downloaded files might return as None?

@seisman
Copy link
Member Author

seisman commented Apr 15, 2024

It seems like already downloaded files might return as None?

It should return the path if the file is already downloaded. Not sure where None come from.

@weiji14
Copy link
Member

weiji14 commented Apr 15, 2024

It seems like already downloaded files might return as None?

It should return the path if the file is already downloaded. Not sure where None come from.

It seems to happen when I have a mix of downloaded and not downloaded files. I get None when the file is already downloaded, and the actual file path when the file hasn't been downloaded.

@seisman
Copy link
Member Author

seisman commented Apr 15, 2024

Now I can reproduce the issue locally. It doesn't always happen for a mix of downloaded/undownloaded files, for example:

>>> from pygmt import which
>>> which(["@capitals.gmt", "@circuit.png"], download="a")
['capitals.gmt', 'circuit.png']

>>> !rm ~/.gmt/capitals.gmt   # delete one file

>>> which(["@capitals.gmt", "@circuit.png"])
gmtwhich [ERROR]: File capitals.gmt not found!
'circuit.png'

>>> which(["@capitals.gmt", "@circuit.png"], download="a")
['capitals.gmt', 'circuit.png']

@seisman
Copy link
Member Author

seisman commented Apr 15, 2024

Here is a minimum example to reproduce the issue. From what I can see, the issue only occurs if we try to download a tile like @N30W120.earth_relief_15s_p.nc:

!rm -rf ~/.gmt/server/earth/earth_relief/

from pygmt import which
datasets = [
    "@earth_relief_01d_p.grd",
    "@N30W120.earth_relief_15s_p.nc",
    "@capitals.gmt",
]
which(fname=datasets, download="a")

@seisman
Copy link
Member Author

seisman commented Apr 15, 2024

Open a bug report in #3170 and add a quick workaround for "Cache data" workflow in #3171.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants