pygmt.which: Refactor to get rid of temporary files #3148

seisman · 2024-03-29T05:57:13Z

Address #2730.

Output to a pandas.DataFrame via virtualfiles and then convert the DataFrame to a list.

weiji14 · 2024-04-02T19:59:49Z

pygmt/src/which.py

- if not path:
+ result = lib.virtualfile_to_dataset(vfname=vouttbl, output_type="pandas")


We can set output_type to numpy to simplify the code below no?

Trying to think what's the most efficient way to get the output from which. It seems a bit overkill going virtualfile -> pandas -> numpy -> list[str].

Yes, it's not very efficient. Instead, we can add another internal type output_type="string", which returns the vector of the trailing text only. It's useful if we know that a module only outputs text strings like which. Actually, GMT provides an API function GMT_Get_Strings which does exactly the same thing (if I understand it correctly). We can also wrap that API function instead.

Instead, we can add another internal type output_type="string", which returns the vector of the trailing text only. It's useful if we know that a module only outputs text strings like which.

Done in PR #3157.

Actually, GMT provides an API function GMT_Get_Strings which does exactly the same thing (if I understand it correctly). We can also wrap that API function instead.

This API function requires the data family to be GMT_IS_VECTOR/GMT_IS_MATRIX, so it cannot be used in this case (family is GMT_IS_DATASET). xref: https://github.com/GenericMappingTools/gmt/blob/fcb795ef196714fe51f7e5e68d30a18e981294d0/src/gmt_api.c#L15785

codspeed-hq · 2024-04-02T20:17:50Z

CodSpeed Performance Report

Merging #3148 will degrade performances by 46.35%

_{Comparing vfile/which (4846cfb) with main (6193938)}

Summary

❌ 1 regressions
✅ 90 untouched benchmarks

🆕 8 new benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`vfile/which`	Change
🆕	`test_accessor_set_geographic_cartesian_roundtrip`	N/A	936.9 µs	N/A
🆕	`test_binstats_no_outgrid`	N/A	160.9 ms	N/A
🆕	`test_earth_relief_holes`	N/A	192.9 ms	N/A
🆕	`test_grd2cpt`	N/A	231.1 ms	N/A
🆕	`test_grdcut_dataarray_in_dataarray_out`	N/A	87.8 ms	N/A
🆕	`test_compute_bins_no_outfile`	N/A	50.4 ms	N/A
🆕	`test_grdimage_grid_and_shading_with_xarray[png]`	N/A	5.9 s	N/A
🆕	`test_histogram[list]`	N/A	81.9 ms	N/A
❌	`test_which_multiple`	74.8 ms	139.5 ms	-46.35%

…ay of trailing texts

seisman · 2024-04-06T11:18:21Z

Ping @GenericMappingTools/pygmt-maintainers for final reviews. The pygmt.which function is slower after refactoring, which is likely due to the overhead of creating a GMT_DATASET container.

weiji14 · 2024-04-15T03:50:13Z

The cache GMT artifacts workflow is failing at https://github.com/GenericMappingTools/pygmt/actions/runs/8680209408/job/23800357490:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/caching.py", line 103, in cache_data
    which(fname=datasets, download="a")
  File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/decorators.py", line 607, in new_module
    return module_func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pygmt/pygmt/pygmt/helpers/decorators.py", line 780, in new_module
    return module_func(*bound.args, **bound.kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pygmt/pygmt/pygmt/src/which.py", line 67, in which
    paths = lib.virtualfile_to_dataset(vfname=vouttbl, output_type="strings")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pygmt/pygmt/pygmt/clib/session.py", line 1940, in virtualfile_to_dataset
    return result.to_strings()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/work/pygmt/pygmt/pygmt/datatypes/dataset.py", line 156, in to_strings
    return np.char.decode(textvector) if textvector else np.array([], dtype=str)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/runner/micromamba/envs/pygmt/lib/python3.12/site-packages/numpy/core/defchararray.py", line 615, in decode
    _vec_string(a, object_, 'decode', _clean_args(encoding, errors)))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: string operation on non-string array

The error doesn't really indicate what happened, so I added a print(textvector) statement locally, and got something like this:

[None, None, None, None, None, None, None, None, None, None, None, None, None, None, b'/home/user/.gmt/server/earth/earth_relief/earth_relief_15s_p/N30W120.earth_relief_15s_p.nc']

It seems like already downloaded files might return as None?

seisman · 2024-04-15T04:07:59Z

It seems like already downloaded files might return as None?

It should return the path if the file is already downloaded. Not sure where None come from.

weiji14 · 2024-04-15T04:16:48Z

It seems like already downloaded files might return as None?

It should return the path if the file is already downloaded. Not sure where None come from.

It seems to happen when I have a mix of downloaded and not downloaded files. I get None when the file is already downloaded, and the actual file path when the file hasn't been downloaded.

seisman · 2024-04-15T04:47:21Z

Now I can reproduce the issue locally. It doesn't always happen for a mix of downloaded/undownloaded files, for example:

>>> from pygmt import which
>>> which(["@capitals.gmt", "@circuit.png"], download="a")
['capitals.gmt', 'circuit.png']

>>> !rm ~/.gmt/capitals.gmt   # delete one file

>>> which(["@capitals.gmt", "@circuit.png"])
gmtwhich [ERROR]: File capitals.gmt not found!
'circuit.png'

>>> which(["@capitals.gmt", "@circuit.png"], download="a")
['capitals.gmt', 'circuit.png']

seisman · 2024-04-15T05:04:46Z

Here is a minimum example to reproduce the issue. From what I can see, the issue only occurs if we try to download a tile like @N30W120.earth_relief_15s_p.nc:

!rm -rf ~/.gmt/server/earth/earth_relief/

from pygmt import which
datasets = [
    "@earth_relief_01d_p.grd",
    "@N30W120.earth_relief_15s_p.nc",
    "@capitals.gmt",
]
which(fname=datasets, download="a")

seisman · 2024-04-15T09:50:45Z

Open a bug report in #3170 and add a quick workaround for "Cache data" workflow in #3171.

pygmt.which: Refactor to get rid of temporary files

6d74016

seisman added maintenance Boring but important stuff for the core devs needs review This PR has higher priority and needs review. labels Mar 29, 2024

seisman mentioned this pull request Mar 29, 2024

Get rid of temporary files from pygmt functions and plotting methods #2730

Closed

31 tasks

seisman added this to the 0.12.0 milestone Mar 29, 2024

michaelgrund approved these changes Apr 2, 2024

View reviewed changes

seisman added final review call This PR requires final review and approval from a second reviewer and removed needs review This PR has higher priority and needs review. labels Apr 2, 2024

weiji14 reviewed Apr 2, 2024

View reviewed changes

weiji14 added the run/benchmark Trigger the benchmark workflow in PRs label Apr 2, 2024

Merge branch 'main' into vfile/which

f60713a

seisman mentioned this pull request Apr 3, 2024

Session.virtualfile_to_dataset: Add 'strings' output type for the array of trailing texts #3157

Merged

seisman added 7 commits April 3, 2024 10:23

Session.virtualfile_to_dataset: Add 'strings' output type for the arr…

2442fda

…ay of trailing texts

Apply suggestions from code review

bafe0f3

Merge branch 'dataset/to_strings' into vfile/which

ccf2dae

Refactor which by setting output_type to strings

b90d655

Merge branch 'main' into vfile/which

d5db7b9

Merge branch 'main' into vfile/which

cbe15ca

Change result to paths

4846cfb

seisman added run/benchmark Trigger the benchmark workflow in PRs and removed run/benchmark Trigger the benchmark workflow in PRs labels Apr 6, 2024

seisman merged commit 6069ebc into main Apr 7, 2024
20 of 22 checks passed

seisman deleted the vfile/which branch April 7, 2024 11:59

seisman removed final review call This PR requires final review and approval from a second reviewer run/benchmark Trigger the benchmark workflow in PRs labels Apr 7, 2024

seisman mentioned this pull request Apr 15, 2024

pygmt.which: Errors if downloading multiple tiled grids #3170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pygmt.which: Refactor to get rid of temporary files #3148

pygmt.which: Refactor to get rid of temporary files #3148

seisman commented Mar 29, 2024

weiji14 Apr 2, 2024

weiji14 Apr 2, 2024

seisman Apr 3, 2024

seisman Apr 3, 2024

codspeed-hq bot commented Apr 2, 2024 •

edited

Loading

seisman commented Apr 6, 2024

weiji14 commented Apr 15, 2024 •

edited

Loading

seisman commented Apr 15, 2024

weiji14 commented Apr 15, 2024

seisman commented Apr 15, 2024 •

edited

Loading

seisman commented Apr 15, 2024

seisman commented Apr 15, 2024

		if not path:
		result = lib.virtualfile_to_dataset(vfname=vouttbl, output_type="pandas")

pygmt.which: Refactor to get rid of temporary files #3148

pygmt.which: Refactor to get rid of temporary files #3148

Conversation

seisman commented Mar 29, 2024

weiji14 Apr 2, 2024

Choose a reason for hiding this comment

weiji14 Apr 2, 2024

Choose a reason for hiding this comment

seisman Apr 3, 2024

Choose a reason for hiding this comment

seisman Apr 3, 2024

Choose a reason for hiding this comment

codspeed-hq bot commented Apr 2, 2024 • edited Loading

CodSpeed Performance Report

Merging #3148 will degrade performances by 46.35%

Summary

Benchmarks breakdown

seisman commented Apr 6, 2024

weiji14 commented Apr 15, 2024 • edited Loading

seisman commented Apr 15, 2024

weiji14 commented Apr 15, 2024

seisman commented Apr 15, 2024 • edited Loading

seisman commented Apr 15, 2024

seisman commented Apr 15, 2024

codspeed-hq bot commented Apr 2, 2024 •

edited

Loading

weiji14 commented Apr 15, 2024 •

edited

Loading

seisman commented Apr 15, 2024 •

edited

Loading