Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters #149

MrBrezina · 2023-12-06T13:14:38Z

@moyogo made a the following comment in #147:

maybe hyperglot should automatically have a general note for when graphemes can be both composed or decomposed, instead of adding such a note to every single language where this does or can occur.

I think it makes sense to clarify this in the README, web app, and maybe even in CLI. But I wonder if it would be best as a point in some kind of language support checklist.

The handling could be potentially different for each language: requiring a combining dieresis for German is optional while for Tlingit both <Ḵ> and <ḵ> should be supported as precomposed and using combining marks (see #147). Perhaps this could be a flag for orthographies, @kontur ?

moyogo · 2023-12-06T17:48:33Z

Note that for German, while it is not common, one may still get composed characters when dealing with files on the macOS file system as strings are stored in something close to NFD rather than NFC. Usually the UI file dialog or Finder normalize strings when names are copied but the name can still be in decomposed forms if obtained otherwise, like in some applications or when files are copied to some other operating systems.

One can also easily stumble upon decomposed forms in library catalogues. For example in the LOC catalogue or the NYPL catalogue.

The nature of the Unicode model means these decomposed forms should also be supported in German, even if they are less common in the large corpus of German data.

kontur · 2023-12-07T10:00:30Z

I think there is a difference between being able to represent an orthography accurately and being able to represent all legacy input sequences of an orthography (such as encountered in digital texts, which is what the PR comment we concerned with). We have --marks and --decomposed flags for those nuances should someone want to check a font against them, specifically. I don't see this as something that inherently needs differentiating in the language data. Quite the opposite, all data in Hyperglot is saved in Composed form where such a form exists, which makes distinguishing required marks possible from combinations that have no Composed form and thus require the base + mark explicitly.

If this were an issue, this would apply to all orthographies and all characters that have decomposable unicodes.

MrBrezina · 2023-12-08T16:09:43Z

@moyogo I agree with that, but what you describe sounds like a recommended best practice, not a minimal requirement for language support (good-enough practice). We do not want to fail detecting a font if it is good-enough. Minimality is a key notion in Hyperglot. (I would have loved to call it a principle, but I am unable to support it with a clear definition.)

Frankly, I forgot about the global switch for the CLI when I wrote the issue, but the issue still stands. For some languages supporting decomposed solution may be an essential feature, for others it seems non-essential. In theory at least. In order to add a note in the README and elsewhere I would like to clarify our position.

Sorry, for the latency in my replies.

kontur · 2023-12-11T09:11:52Z

From the README:

-m, --marks: Flag to signal a font should also include all combining marks used for a language - by default only those marks are required which are not part of preencoded characters (default is False)
-d, --decomposed: Flag to signal a font should be considered supporting a language as long as it has all base glyphs and marks to write a language - by default also encoded precomposed glyphs are required (default is False)

I think changing the --decomposed default to True would give the broadest results. Just to give a rough idea, Rosetta's Adapter PE supports 398 languages in default detection, 418 with --decomposed. I think the original argument, which is still valid, is that the mere presence of the combining characters is not enough to be certain the composites are working, e.g. have mark attachment points. If we went to check base + mark combinations for anchors, or check actual shaping happens, it would make sense to change the default here then.

kontur · 2024-01-30T07:42:03Z

I consider this clarified :)

MrBrezina · 2024-01-30T09:21:19Z

This needs to be clarified in the web app about still.

kontur · 2024-06-20T07:31:37Z

Idea: For the CLI, output a short preamble before the test result that clarifies marks, decomposition and shaping checks, as well as opt-in flags, and how they affect the result.

MrBrezina assigned MrBrezina and kontur Dec 6, 2023

kontur added the needs more information label Dec 7, 2023

kontur mentioned this issue Dec 7, 2023

Add Tlingit (iso 639-3 tli) #147

Merged

kontur closed this as completed Jan 30, 2024

MrBrezina reopened this Jan 30, 2024

kontur added documentation Improvements or additions to documentation hyperglot-web-app and removed needs more information labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters #149

Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters #149

MrBrezina commented Dec 6, 2023

moyogo commented Dec 6, 2023

kontur commented Dec 7, 2023

MrBrezina commented Dec 8, 2023

kontur commented Dec 11, 2023

kontur commented Jan 30, 2024

MrBrezina commented Jan 30, 2024

kontur commented Jun 20, 2024

Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters #149

Clarify how Hyperglot handles composed and decomposed combinations, e.g. accented letters #149

Comments

MrBrezina commented Dec 6, 2023

moyogo commented Dec 6, 2023

kontur commented Dec 7, 2023

MrBrezina commented Dec 8, 2023

kontur commented Dec 11, 2023

kontur commented Jan 30, 2024

MrBrezina commented Jan 30, 2024

kontur commented Jun 20, 2024