Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of digraphs and trigraphs #116

Open
MrBrezina opened this issue Apr 7, 2023 · 6 comments
Open

Handling of digraphs and trigraphs #116

MrBrezina opened this issue Apr 7, 2023 · 6 comments
Labels
data Issues in the language data help wanted Extra attention is needed

Comments

@MrBrezina
Copy link
Member

It would be useful to establish a systematic approach to digraphs (and generally any multi-graphs) as they are sometimes considered part of the standardised alphabet. It is not the case in English, but it is the case in Czech (ch) or Hungarian (cs, dz, dzs, gy, ly, ny, sz, ty, zs [from Wikipedia]).

They are not too important from type-design perspective as the individual characters may combine with more characters than those in digraphs. Hypothetically, their list could be used to inform more meaningful decorative ligatures, but that is about it, I think.

They are important if we are encoding a standardised orthographies as without them, these are incomplete. They are also important for sorting, but it is not something we deal with.

Technically, base+mark combinations can be seen as digraphs, too, and we include those.

What do we think?

@MrBrezina MrBrezina added the help wanted Extra attention is needed label Apr 7, 2023
@kontur
Copy link
Contributor

kontur commented Apr 26, 2023

In as far as they are part of the official orthography I am in favor of this. After all, the database is not collecting design requirements but orthographies.

@kontur kontur self-assigned this Apr 26, 2023
@kontur
Copy link
Contributor

kontur commented Apr 26, 2023

Also relevant for #114

@moyogo
Copy link
Contributor

moyogo commented Apr 26, 2023

For #114 there was a duplicate n that probably came from digraph ny in the alphabet.
It could be removed since digraphs are not kept in or be kept in if that is the new scheme.

A few digraphs are interesting but n-grams would be just as much if not more relevant.
A lot of languages do not define their digraphs in their alphabets, and in many cases they are not interesting.
Knowing a language uses n-grams fì, qj or įj is more interesting than knowing it uses digraphs cz, gh or rr for example.

Since the digraphs-as-letter-of-alphabet info is generally available, it could be added.

@kontur kontur added the data Issues in the language data label May 31, 2024
@kontur kontur removed their assignment May 31, 2024
@kontur
Copy link
Contributor

kontur commented May 31, 2024

I can't pinpoint the commit where we changed how hyperglot-save parses the characters, but di/trigraphs are as of this writing retained as they are in the character lists, so this is now about adding that data to orthographies where in the past we have not retained those combinations. Also the letters comprising a di/trigraph are no longer extracted and appended to the orthography on saving (this is done only on parsing the list, just to confirm all individual characters are in fact added to the check).

I suppose something like n-grams/possible/common combinations is out of scope (at least for now); if that is what @moyogo was referring to. Compiling a list of possible combinations is one thing, retaining "interesting" combinations is another. E.g. those samples would imply to me that those are useful to check for kerning collisions, but how to pick?

@kontur
Copy link
Contributor

kontur commented May 31, 2024

Related to this as well: hyperglot-save retains the order of characters in base (I vaguely remember this being the trigger why we changed the saving implementation), so there likely are orthographies where we should fix the order of characters to correctly represent the official orthography.

kontur added a commit that referenced this issue May 31, 2024
@kontur
Copy link
Contributor

kontur commented Jun 4, 2024

See #172 for some discussion related to including upper case variants of digraphs; it's sort of unclear if the "uppercase" variants of digraphs should be double upper (like in squ) case or title case (like in Czech/Hungarian orthographic references).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data Issues in the language data help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants