Skip to content

Commit

Permalink
ICU-22707 fix normalization bug: chars that combine back & fwd
Browse files Browse the repository at this point in the history
  • Loading branch information
markusicu committed Apr 30, 2024
1 parent 4d9612b commit d0e43d6
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 8 deletions.
10 changes: 5 additions & 5 deletions docs/design/normalization/custom.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,8 @@ Per starter that combines forward, old and new data stores a linear, sorted list
* The ICU implementation recomposes starting from a fully decomposed sequence. Therefore, the lookup value needs to indicate combines-forward only for characters that do not have a mapping. The composition table result then indicates whether a composite combines-forward, and the index to the combined mapping+composition data is then found via the index from the composite's lookup result.
* ICU 49 composePair() needs to know whether the first character combines forward even if it is a composite. formatVersion 2 separates the YesNo range into two parts accordingly, adding the yesNoMappingsOnly threshold.
* A composite cannot combine-back because the composition algorithm does not try to combine an earlier starter with the new composite.
* The algorithm allows for a character to both combine-back and combine-forward, although this seems like a strange situation and it does not occur in Unicode 5.2..10.
* The algorithm allows for a character to both combine-back and combine-forward.
Such characters occur in Unicode 16 for the first time.
* Hangul syllables are algorithmically decomposed into Jamos, and algorithmically recomposed from them. The actual mappings are not stored in the table.
* In the ICU implementation, recomposition is done only on a fully decomposed sequence. Composition then sees only YesYes and MaybeYes characters which do not have mappings.
* A character that maps to an empty string (that is, one that is deleted during normalization) does not have normalization boundaries before or after it. Its FCD value would be the worst-case 0x1ff (lccc=1, tccc=0xff). (The standard Unicode normalization forms do not delete characters, but NFKC\_Casefold and UTS #46 do.)
Expand Down Expand Up @@ -347,10 +348,9 @@ _The rows of the table, from bottom to top, are encoded with increasing 16-bit "
<br />
</td>
<td style="width:476px;height:31px">
Both combine-back &amp; combine-fwd: strange but allowed
<br />
Both combine-back &amp; combine-fwd
</td>
<td style="width:60px">none</td>
<td style="width:60px">U+1611E GURUNG KHEMA VOWEL SIGN AA</td>
<td style="width:456px;height:31px">
≥minMaybeYes which is 8-aligned
<br />
Expand Down Expand Up @@ -1337,4 +1337,4 @@ It should be easy to include the standard Unicode normalization ccc and composit

Another, simpler way is for gennorm2 to take a list of mapping table files, and to provide standard files like ccc.txt, compose.txt, nfd.txt, nfkd.txt and casefold.txt that could be combined (with or without additional custom tables) in various combinations into one binary data file. This would also allow for a character to have different mappings in different files, and the later mapping would override the earlier one. gennorm2 should be able to also output a .txt file with all of the combined data, except without recursively resolved mappings, to keep two-way mappings in the file valid for input. (**Done in ICU 4.4.** _Modification:_ The NFKC mappings cannot simply add to the NFC mappings because some characters with two-way NFC mappings have one-way NFKC mappings. Therefore, there are separate files that specify each normalization form's mappings.)

We should make it easy to move StringPrep mappings from the .spp files into normalization .txt/.nrm files. (**Not done** (yet).)
We should make it easy to move StringPrep mappings from the .spp files into normalization .txt/.nrm files. (**Not done** (yet).)
5 changes: 2 additions & 3 deletions icu4c/source/common/normalizer2impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -627,7 +627,7 @@ class U_COMMON_API Normalizer2Impl : public UObject {
} else if(norm16<minMaybeYes) {
return getMapping(norm16); // for yesYes; if Jamo L: harmless empty list
} else {
return maybeYesCompositions+norm16-minMaybeYes;
return maybeYesCompositions+((norm16-minMaybeYes)>>OFFSET_SHIFT);
}
}
const uint16_t *getCompositionsListForComposite(uint16_t norm16) const {
Expand Down Expand Up @@ -880,8 +880,7 @@ unorm_getFCD16(UChar32 c);
* The maybeYesCompositions array contains compositions lists for characters that
* combine both forward (as starters in composition pairs)
* and backward (as trailing characters in composition pairs).
* Such characters do not occur in Unicode 5.2 but are allowed by
* the Unicode Normalization algorithms.
* Such characters occur in Unicode 16 for the first time.
* If there are no such characters, then minMaybeYes==MIN_NORMAL_MAYBE_YES
* and the maybeYesCompositions array is empty.
* If there are such characters, then minMaybeYes is subtracted from their norm16 values
Expand Down

0 comments on commit d0e43d6

Please sign in to comment.