Hacker News new | past | comments | ask | show | jobs | submit login

> the ability to synchronize a byte stream picked up mid-run, with less that one character being consumed before synchronization

Can somebody explain or link to an explanation on how UTF-8 allows for this?




Note that it says less than one character. A character in UTF-8 can be composed of multiple bytes.

The encoding scheme is laid out in the linked email. Based on the high bits it's possible to detect when a new character starts. Relevant portion:

  We define 7 byte types:
  T0 0xxxxxxx      7 free bits
  Tx 10xxxxxx      6 free bits
  T1 110xxxxx      5 free bits
  T2 1110xxxx      4 free bits
  T3 11110xxx      3 free bits
  T4 111110xx      2 free bits
  T5 111111xx      2 free bits

  Encoding is as follows.
  >From hex Thru hex      Sequence             Bits
  00000000  0000007f      T0                   7
  00000080  000007FF      T1 Tx                11
  00000800  0000FFFF      T2 Tx Tx             16
  00010000  001FFFFF      T3 Tx Tx Tx          21
  00200000  03FFFFFF      T4 Tx Tx Tx Tx              26
  04000000  FFFFFFFF      T5 Tx Tx Tx Tx Tx    32
[...]

  4. All of the sequences synchronize on any byte that is not a Tx byte.
If you are starting mid-run, skip initial Tx bytes. That will always be less than one character.


Note that UTF-8 has since been restricted to at most 4 bytes (i.e. the longest sequence is `T3 Tx Tx Tx`).


So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x.

September 1992: 2 guys scribbling on a placemat.

January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.

March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.

March 2002: MySQL added support for UTF-8, initially setting the limit to 6 bytes (https://github.com/mysql/mysql-server/commit/55e0a9c)

September 2002: MySQL decided to reduce the limit to 3 bytes, probably for storage efficiency reason (https://github.com/mysql/mysql-server/commit/43a506c, https://adamhooper.medium.com/in-mysql-never-use-utf8-use-ut...)

November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.

Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.

EDIT: Just notice this in the footnotes, and the plot thickens...

> The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these.

So UTF-8 was indeed intended to be utf8mb3!


This is also a very simple form of using the idea of a "prefix-free code" from information theory and coding. (the codes {0,10,110,1110,11110,...,111111} is a prefix-free set).

I think there's also the idea that the code can "sync up" when it say, starts in the middle of a character.


Other people have answered your question but I wanted to clarify one point. The word "character" here means "unicode code point". However, what the user thinks of as a single character can be made up of more than one code point. This presents a different problem and one UTF-8 itself can't help with.

The Unicode Consortium has a report on extended grapheme clusters[0] (i.e. user-perceived characters). Essentially, if you're processing some text mid stream, it might not be clear if a code point is the start of a new user-perceived character or not. So you may want to skip ahead until an unambiguous symbol boundary is reached.

[0]: https://www.unicode.org/reports/tr29/


It’s fairly simple, actually: leading bytes have a specific bit pattern that continuation bytes don’t. A single-byte character will have the topmost bit unset (0b0xxxxxx), and for a multi-byte run the first byte will have the top two bits set (0b11xxxxxx) and any succeeding bytes will have the top bit set but the next bit unset (0b10xxxxxx). This means given an arbitrary byte you can always tell what it is, and you can tell when you’re at the start of a next character by looking for those first two bit patterns.


The upper bits of the FIRST octet are used to determine the run length of the sequence. All of the other bytes in the sequence use the upper two bits (0xC prefix len 2 OR b10xxxxxx) to indicate that it's another 6 bits of data for the current character.

If synchronization is lost mid-character, by definition that interrupted character is lost. However the very next complete character will be clearly indicated by a byte beginning with either no sign (a 7 bit character) OR a number of 1s indicating the octet count followed by a zero.

This is covered in the section titled:

    Proposed FSS-UTF
    ----------------
    ...
       Bits  Hex Min  Hex Max  Byte Sequence in Binary
    1    7  00000000 0000007f 0vvvvvvv
    2   11  00000080 000007FF 110vvvvv 10vvvvvv
    3   16  00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
    ... Examples trimmed for mobile.


All trailing bytes and only trailing bytes are of the form 10xxxxxx. If you read such a byte you just have to iterate backwards until you find a non-trailing byte.


It is called Self-synchronizing code, simple and beautiful design.

https://en.wikipedia.org/wiki/Self-synchronizing_code




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: