Clear Statement on Unicode Normalization Form for use in XHTML content for epubs #1564

kevinhendricks · 2024-06-23T22:22:23Z

In the old epub 2.0.1 spec there existed a clear reasoned specification for the Unicode Normalization Form C to be used for all XHtml content files.

[QUOTE]
1.3.6: Relationship to Unicode
Publications may use the entire Unicode character set, using UTF-8 or UTF-16 encodings, as defined by Unicode (see http:https://www.unicode.org/unicode/standard/versions). The use of Unicode facilitates internationalization and multilingual documents. However, Reading Systems are not required to provide glyphs for all Unicode characters.
Reading Systems must parse all UTF-8 and UTF-16 characters properly (as required by XML). Reading Systems may decline to display
some characters but must be capable of signaling in some fashion that undisolavable characters are preseni
some characters, but must be capable of signaling in some fashion that undisplayable characters are present. Reading Systems must not display Unicode characters merely as if they were 8-bit characters. For example, the biohazard symbol (0×2623) need not be supported by including the correct glyph, but must not be parsed or displayed as if its component bytes were the two characters "&#" (0x0026 0x0023).

To aid Reading Systems in implementing consistent searching and sorting behavior it is required that Unicode Normalization Form C (NFC) be used (See http:https://www.w3.org/TR/charmod-norm/)
[/QUOTE]

I have looked everyplace in the epub 3.0, 3.0.1, 3.2, and 3.3 for a similar clear statement but I can not find any.

The only thing I see in the epub3 specs is for filenames to be NFC form - which to me implies urls (whether % encoded or not since utf-8 byte order matters when % encoding a url) would need to be NFC normalized. If that is true, then the OPF should probably be NFC normalized. And if both of those, should the Xhtml content also be NFC normalized?

FWIW, according to a web search I found the following in related specifications:

[QUOTE]
The W3C Character Model for the World Wide Web 1.0: Normalization [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition) recommend using Normalization Form C for all content, because this form avoids potential interoperability problems arising from the use of canonically equivalent, yet different, character sequences in document formats on the Web.
[/QUOTE]

And from the XML 1.0 spec the following:

[QUOTE]
Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content.
[/QUOTE]

The omission in the epub 3.3 spec of any real mention of Unicode Normalization Form requirements or even recommendations for epub Content Documents means that Arabic and Hebrew and other languages could under say MacOS create and use NFD form content for an epub that would lead to many e-Readers search features becoming worthless. Hurting cross-platform compatibility.

For the sake of clarity and simplicity should there not be a content standard for epub3 XHtml content documents, similar to the one previously used in the epub2 spec?

Thank you for your time and consideration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clear Statement on Unicode Normalization Form for use in XHTML content for epubs #1564

Clear Statement on Unicode Normalization Form for use in XHTML content for epubs #1564

kevinhendricks commented Jun 23, 2024

Clear Statement on Unicode Normalization Form for use in XHTML content for epubs #1564

Clear Statement on Unicode Normalization Form for use in XHTML content for epubs #1564

Comments

kevinhendricks commented Jun 23, 2024