Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clib: Add virtualfile_to_dataset method for converting virtualfile to a dataset #3083

Merged
merged 13 commits into from
Mar 11, 2024
Prev Previous commit
Next Next commit
Rename return_table to return_dataset
  • Loading branch information
seisman committed Mar 7, 2024
commit ce029b252a8b29dd36dcad2bfeedef36a88c8ee8
20 changes: 12 additions & 8 deletions pygmt/clib/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -1738,14 +1738,16 @@ def read_virtualfile(
dtype = {"dataset": _GMT_DATASET, "grid": _GMT_GRID}[kind]
return ctp.cast(pointer, ctp.POINTER(dtype))

def return_table(
def return_dataset(
self,
output_type: Literal["pandas", "numpy", "file"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we set a default output type here? It looks like we're using pandas as the default in #3092.

Suggested change
output_type: Literal["pandas", "numpy", "file"],
output_type: Literal["pandas", "numpy", "file"] = "pandas",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes no differences because we always call the function with the output_type parameter, e.g.,:

        return lib.return_dataset(
            output_type=output_type,
            vfile=vouttbl,
        )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it doesn't make any difference in the PyGMT modules, but this is a good central location to document that output_type="pandas" is the default output (though in #1318, it seemed like most of us were in favour of output_type="input" or auto as the default).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_type="input" or auto may not make sense for PyGMT, especially in cases like:

  1. the input data is a file, then auto means outputting to a file by default, then outfile is required.
  2. the input data is vectors (e.g., x/y/z) and each vector can be a list/ndarray/pd.Series. Then what's the expected format if auto/input is used?

Copy link
Member

@weiji14 weiji14 Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not saying that output_type="auto" would be easy to implement 🙂 I think the default output_type="pandas" is fine for now since it is an in-memory format that can be converted to virtualfiles relatively easily. We can discuss more about what the ideal output type would be in #1318 (if there is still any debate that needs to be had).

vfile: str,
column_names: list[str] | None = None,
) -> pd.DataFrame | np.ndarray | None:
"""
Return an output table from a virtual file based on the output type.
Output a dataset stored in a virtual file in different formats.
seisman marked this conversation as resolved.
Show resolved Hide resolved

The format of the dataset is determined by the ``output_type`` parameter.

Parameters
----------
Expand All @@ -1763,8 +1765,8 @@ def return_table(

Returns
-------
table
The output table. If ``output_type="file"`` returns ``None``.
result
The result dataset. If ``output_type="file"`` returns ``None``.

Examples
--------
Expand Down Expand Up @@ -1792,29 +1794,31 @@ def return_table(
... kind="dataset", fname=outtmp.name
... ) as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... result = lib.return_table(output_type="file", vfile=vouttbl)
... result = lib.return_dataset(
... output_type="file", vfile=vouttbl
... )
... assert result is None
... assert Path(outtmp.name).stat().st_size > 0
...
... # numpy output
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outnp = lib.return_table(output_type="numpy", vfile=vouttbl)
... outnp = lib.return_dataset(output_type="numpy", vfile=vouttbl)
... assert isinstance(outnp, np.ndarray)
...
... # pandas output
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outpd = lib.return_table(output_type="pandas", vfile=vouttbl)
... outpd = lib.return_dataset(output_type="pandas", vfile=vouttbl)
... assert isinstance(outpd, pd.DataFrame)
...
... # pandas output with specified column names
... with Session() as lib:
... with lib.virtualfile_out(kind="dataset") as vouttbl:
... lib.call_module("read", f"{tmpfile.name} {vouttbl} -Td")
... outpd2 = lib.return_table(
... outpd2 = lib.return_dataset(
... output_type="pandas",
... vfile=vouttbl,
... column_names=["col1", "col2", "col3", "coltext"],
Expand Down
Loading