Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support non-ASCII characters in function arguments #2584

Merged
merged 37 commits into from
Aug 21, 2023
Merged

Conversation

seisman
Copy link
Member

@seisman seisman commented Jun 24, 2023

Description of proposed changes

Support non-ASCII characters in PyGMT arguments, so that users don't have to use octal codes.
See #2204 for context.

Working example

import pygmt

fig = pygmt.Figure()
fig.basemap(region=[0, 10, 0, 5], projection="x1c", frame="WSEN+tABC¥±°″≥×≈∇⇔∑DEF")
fig.show()

it produces:

map

Notes to maintainers
The symbols and ZapfDingbats charset can be obtained from https://unicode.org/Public/MAPPINGS/VENDORS/ADOBE/zdingbat.txt and https://unicode.org/Public/MAPPINGS/VENDORS/ADOBE/symbol.txt. The following script is used to generate the list of characters, but manually editing are necessary.

with open("zdingbat.txt", "r") as fin:
    for line in fin:
        if line.startswith("#"):
            continue
        hexval = line.split()[0].strip()
        print(chr(int(hexval, base=16)), end="")
        i += 1
        if i % 16 == 0:
            print()

TODO

  • Support all non-ASCII characters in the ISOLatin1+ character set
  • Support non-ASCII characters in the Symbols character set
  • Support non-ASCII characters in the ZapfDingbats character set
  • Check PS_CHAR_ENCODING so it also works with Standard+ character set?
  • Check arguments that are not processed by build_arg_string (e.g., text in Figure.text)
  • Add a gallery example or a tutorial?
  • update any existing examples

Closes #2204 Closes #2000

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
  • If adding new functionality, add an example to docstrings or tutorials.
  • Use underscores (not hyphens) in names of Python files and directories.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash commands are:

  • /format: automatically format and lint the code
  • /test-gmt-dev: run full tests on the latest GMT development version

@seisman seisman added the feature Brand new feature label Jun 24, 2023
@seisman seisman added this to the 0.10.0 milestone Jun 24, 2023
@seisman seisman added the needs review This PR has higher priority and needs review. label Jun 24, 2023
@seisman seisman changed the title POC: Support non-ASCII characters Support non-ASCII characters Jun 26, 2023
@seisman
Copy link
Member Author

seisman commented Jun 26, 2023

This new feature is almost done and is ready for more testings. Please let me know what you think.

@weiji14
Copy link
Member

weiji14 commented Jun 26, 2023

Looking good! Could you also change this line in the colorbar gallery example to use degrees perhaps?

frame=["x+lTemperature", r"y+l\260C"],

Also, do we need to test different PS_CHAR_ENCODING settings like Standard, Standard+, ISOLatin1, ISOLatin1+, and ISO-8859-x? Currently the mapping seems to assume ISOLatin1, but if a user sets PS_CHAR_ENCODING to another encoding, the output might be different.

@seisman
Copy link
Member Author

seisman commented Jun 27, 2023

Looking good! Could you also change this line in the colorbar gallery example to use degrees perhaps?

frame=["x+lTemperature", r"y+l\260C"],

Done in e1b43b2.

Also, do we need to test different PS_CHAR_ENCODING settings like Standard, Standard+, ISOLatin1, ISOLatin1+, and ISO-8859-x? Currently the mapping seems to assume ISOLatin1, but if a user sets PS_CHAR_ENCODING to another encoding, the output might be different.

ISOLatin1+ is an extension of ISOLatin1, which means ISOLatin1+ supports more characters than ISOLatin (https://docs.generic-mapping-tools.org/latest/cookbook/octal-codes.html). Thus, it does no harm to support ISOLatin1+ only characters even if users set ISOLatin1 encoding.

For Standard+ and other ISO-8859-x encodings, yes, we can definitely support them. But it also means we have to carefully check their differences and maintain the corresponding mapping dictionary from character to octal codes.

Comment on lines 165 to 166
# ISOLatin1+ charset: \240-\377
mapping.update({chr(i): "\\" + format(i, "o") for i in range(160, 256)})
Copy link
Member Author

@seisman seisman Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Python supports many different encodings, thus it's possible to implement this feature without manually maintaining the big dictionary, just like what I already do at line 166 for ISOLatin1+ characters \240 to \377. But I don't have enough knowledge about Python encodings now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:

Can be done using:

for i in range(160, 256):
    print(i, chr(i), chr(i).encode("iso-8859-1").decode("iso-8859-5"))

Copy link
Member Author

@seisman seisman Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the improved version:

for code in [*range(0o040, 0o200), *range(0o240, 0o400)]:
    char = codecs.decode(bytes([code]), "iso8859-5", errors="replace")

pygmt/helpers/utils.py Outdated Show resolved Hide resolved
Comment on lines +199 to +206
+ "αβχδεφγηιϕκλμνο" # \14x-15x
+ "πθρστυϖωξψζ{|}∼" # \16x-17x. \177 is undefined
+ "€ϒ′≤⁄∞ƒ♣♦♥♠↔←↑→↓" # \24x-\25x
+ "°±″≥×∝∂•÷≠≡≈…↵" # \26x-27x
+ "ℵℑℜ℘⊗⊕∅∩∪⊃⊇⊄⊂⊆∈∉" # \30x-31x
+ "∠∇®©™∏√⋅¬∧∨⇔⇐⇑⇒⇓" # \32x-33x
+ "◊〈®©™∑" # \34x-35x
+ "〉∫⌠⌡", # \36x-37x. \360 and \377 are undefined
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Symbols \140, \275 and \276 don't exist in the unicode table, so they're incorrectly shown.

Some symbols of \36x and \37x are shown as boxes in the GitHub web GUI, but they're shown correctly in Vim, so these symbols should be good.

Screenshot from 2023-08-02 15-08-53

@seisman seisman marked this pull request as ready for review August 2, 2023 08:41
@seisman
Copy link
Member Author

seisman commented Aug 2, 2023

Check PS_CHAR_ENCODING so it also works with Standard+ character set?

Already answered in #2584 (comment).

For other ISO-8859-x encodings, it's possible to generate the big mapping dictionary automatically using the following code:

{chr(i).encode("iso-8859-1").decode("iso-8859-5"): "\\" + format(i, "o") for i in range(160, 256)}

but we have to check the PS_CHAR_ENCODING parameter to decide which dictionary to use. However, adding from pygmt.clib.session import Session to pygmt.helpers.utils results in circular imports.

Check arguments that are not processed by build_arg_string (e.g., text in Figure.text)

Yes, we can also translate the strings passed by the text parameter, but we can do very little to plaintext files (we can, but it means we have to open and read the whole plaintext file, which is not efficient). So not sure what we should do here.

Add a gallery example or a tutorial?

Better to do it in a separate PR.

@seisman
Copy link
Member Author

seisman commented Aug 16, 2023

@GenericMappingTools/pygmt-maintainers Any comments and suggestions?

Copy link
Member

@michaelgrund michaelgrund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Member

@weiji14 weiji14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't checked every individual symbol, but I trust that they are correct 🙂

Check arguments that are not processed by build_arg_string (e.g., text in Figure.text)

Yes, we can also translate the strings passed by the text parameter, but we can do very little to plaintext files (we can, but it means we have to open and read the whole plaintext file, which is not efficient). So not sure what we should do here.

Ok, to handle ASCII characters in fig.text in a separate PR.

pygmt/helpers/utils.py Outdated Show resolved Hide resolved
@seisman seisman changed the title Support non-ASCII characters Support non-ASCII characters in function arguments Aug 21, 2023
@seisman seisman merged commit 691f1d4 into main Aug 21, 2023
9 of 15 checks passed
@seisman seisman deleted the non-ascii-support branch August 21, 2023 10:54
@seisman seisman removed the needs review This PR has higher priority and needs review. label Aug 21, 2023
yvonnefroehlich added a commit that referenced this pull request Aug 21, 2023
weiji14 added a commit that referenced this pull request Sep 2, 2023
Tenth minor release of PyGMT.

* Add changelog entry to version switcher
* Update compatibility table
* Update citation
* Add draft changelog
* Add full names of contributors to changelog
* Add two highlight bullet points
* Combine non-ASCII character PRs #2638 and #2584 into one highlight point
* Swap author positions for Dongdong and Leo
* Change release date to 20230902
* Move Yvonne up a few spots

---------

Co-authored-by: Yvonne Fröhlich <[email protected]>
Co-authored-by: Dongdong Tian <[email protected]>
Co-authored-by: Michael Grund <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Brand new feature
Projects
None yet
4 participants