Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters #149

Open
MrBrezina opened this issue Dec 6, 2023 · 7 comments
Assignees
Labels
documentation Improvements or additions to documentation hyperglot-web-app

Comments

@MrBrezina
Copy link
Member

@moyogo made a the following comment in #147:

maybe hyperglot should automatically have a general note for when graphemes can be both composed or decomposed, instead of adding such a note to every single language where this does or can occur.

I think it makes sense to clarify this in the README, web app, and maybe even in CLI. But I wonder if it would be best as a point in some kind of language support checklist.

The handling could be potentially different for each language: requiring a combining dieresis for German is optional while for Tlingit both <Ḵ> and <ḵ> should be supported as precomposed and using combining marks (see #147). Perhaps this could be a flag for orthographies, @kontur ?

@moyogo
Copy link
Contributor

moyogo commented Dec 6, 2023

Note that for German, while it is not common, one may still get composed characters when dealing with files on the macOS file system as strings are stored in something close to NFD rather than NFC. Usually the UI file dialog or Finder normalize strings when names are copied but the name can still be in decomposed forms if obtained otherwise, like in some applications or when files are copied to some other operating systems.

One can also easily stumble upon decomposed forms in library catalogues. For example in the LOC catalogue or the NYPL catalogue.

The nature of the Unicode model means these decomposed forms should also be supported in German, even if they are less common in the large corpus of German data.

@kontur
Copy link
Contributor

kontur commented Dec 7, 2023

I think there is a difference between being able to represent an orthography accurately and being able to represent all legacy input sequences of an orthography (such as encountered in digital texts, which is what the PR comment we concerned with). We have --marks and --decomposed flags for those nuances should someone want to check a font against them, specifically. I don't see this as something that inherently needs differentiating in the language data. Quite the opposite, all data in Hyperglot is saved in Composed form where such a form exists, which makes distinguishing required marks possible from combinations that have no Composed form and thus require the base + mark explicitly.

If this were an issue, this would apply to all orthographies and all characters that have decomposable unicodes.

@MrBrezina
Copy link
Member Author

@moyogo I agree with that, but what you describe sounds like a recommended best practice, not a minimal requirement for language support (good-enough practice). We do not want to fail detecting a font if it is good-enough. Minimality is a key notion in Hyperglot. (I would have loved to call it a principle, but I am unable to support it with a clear definition.)

Frankly, I forgot about the global switch for the CLI when I wrote the issue, but the issue still stands. For some languages supporting decomposed solution may be an essential feature, for others it seems non-essential. In theory at least. In order to add a note in the README and elsewhere I would like to clarify our position.

Sorry, for the latency in my replies.

@kontur
Copy link
Contributor

kontur commented Dec 11, 2023

From the README:

  • -m, --marks: Flag to signal a font should also include all combining marks used for a language - by default only those marks are required which are not part of preencoded characters (default is False)
  • -d, --decomposed: Flag to signal a font should be considered supporting a language as long as it has all base glyphs and marks to write a language - by default also encoded precomposed glyphs are required (default is False)

I think changing the --decomposed default to True would give the broadest results. Just to give a rough idea, Rosetta's Adapter PE supports 398 languages in default detection, 418 with --decomposed. I think the original argument, which is still valid, is that the mere presence of the combining characters is not enough to be certain the composites are working, e.g. have mark attachment points. If we went to check base + mark combinations for anchors, or check actual shaping happens, it would make sense to change the default here then.

@kontur
Copy link
Contributor

kontur commented Jan 30, 2024

I consider this clarified :)

@kontur kontur closed this as completed Jan 30, 2024
@MrBrezina MrBrezina reopened this Jan 30, 2024
@MrBrezina
Copy link
Member Author

This needs to be clarified in the web app about still.

@kontur kontur added documentation Improvements or additions to documentation hyperglot-web-app and removed needs more information labels Jun 4, 2024
@kontur
Copy link
Contributor

kontur commented Jun 20, 2024

Idea: For the CLI, output a short preamble before the test result that clarifies marks, decomposition and shaping checks, as well as opt-in flags, and how they affect the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation hyperglot-web-app
Projects
None yet
Development

No branches or pull requests

3 participants