So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x.
September 1992: 2 guys scribbling on a placemat.
January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.
March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.
November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.
Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.
EDIT: Just notice this in the footnotes, and the plot thickens...
> The 4, 5, and 6 byte sequences are only there for
political reasons. I would prefer to delete these.
This is also a very simple form of using the idea of a "prefix-free code" from information theory and coding. (the codes {0,10,110,1110,11110,...,111111} is a prefix-free set).
I think there's also the idea that the code can "sync up" when it say, starts in the middle of a character.
Other people have answered your question but I wanted to clarify one point. The word "character" here means "unicode code point". However, what the user thinks of as a single character can be made up of more than one code point. This presents a different problem and one UTF-8 itself can't help with.
The Unicode Consortium has a report on extended grapheme clusters[0] (i.e. user-perceived characters). Essentially, if you're processing some text mid stream, it might not be clear if a code point is the start of a new user-perceived character or not. So you may want to skip ahead until an unambiguous symbol boundary is reached.
It’s fairly simple, actually: leading bytes have a specific bit pattern that continuation bytes don’t. A single-byte character will have the topmost bit unset (0b0xxxxxx), and for a multi-byte run the first byte will have the top two bits set (0b11xxxxxx) and any succeeding bytes will have the top bit set but the next bit unset (0b10xxxxxx). This means given an arbitrary byte you can always tell what it is, and you can tell when you’re at the start of a next character by looking for those first two bit patterns.
The upper bits of the FIRST octet are used to determine the run length of the sequence. All of the other bytes in the sequence use the upper two bits (0xC prefix len 2 OR b10xxxxxx) to indicate that it's another 6 bits of data for the current character.
If synchronization is lost mid-character, by definition that interrupted character is lost. However the very next complete character will be clearly indicated by a byte beginning with either no sign (a 7 bit character) OR a number of 1s indicating the octet count followed by a zero.
This is covered in the section titled:
Proposed FSS-UTF
----------------
...
Bits Hex Min Hex Max Byte Sequence in Binary
1 7 00000000 0000007f 0vvvvvvv
2 11 00000080 000007FF 110vvvvv 10vvvvvv
3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
... Examples trimmed for mobile.
All trailing bytes and only trailing bytes are of the form 10xxxxxx. If you read such a byte you just have to iterate backwards until you find a non-trailing byte.
Can somebody explain or link to an explanation on how UTF-8 allows for this?