Hacker News new | past | comments | ask | show | jobs | submit login
The history of UTF-8 as told by Rob Pike (2003) (cat-v.org)
239 points by quyleanh 2 days ago | hide | past | favorite | 114 comments





If curious, past threads:

UTF-8 history (2003) - https://news.ycombinator.com/item?id=21212445 - Oct 2019 (52 comments)

UTF-8 History (2003) - https://news.ycombinator.com/item?id=19565980 - April 2019 (3 comments)

UTF-8 history - https://news.ycombinator.com/item?id=15236856 - Sept 2017 (1 comment)

UTF-8 history (2003) - https://news.ycombinator.com/item?id=8648541 - Nov 2014 (7 comments)

UTF-8 Original Proposal - https://news.ycombinator.com/item?id=6463466 - Sept 2013 (3 comments)

UTF-8 History - https://news.ycombinator.com/item?id=2081932 - Jan 2011 (2 comments)

The history of UTF-8 as told by Rob Pike - https://news.ycombinator.com/item?id=577116 - April 2009 (1 comment)

---

Don't miss this great link from the 2017 thread: https://www.flickr.com/photos/ajstarks/sets/7215763147079887...


I think we should take a moment to appreciate how great UTF-8 is, and how it well it worked out. It's easy to get disillusioned with internet standards when IPv6 is taking forever and messaging is all proprietary locked down protocols. Yet character encodings used to be a horrible mess and now it's not. In the 90's the only practical solution was for everyone to use the same OS, same word processor, same web browser, and who cares about talking with foreigners anyway?

I don't think it was always guaranteed to turn out well. China and Japan could have stayed with their own encodings. Microsoft and Apple could have done incompatible things. The tech world is full of bad things we're stuck with because there's no way to coordinate a change.

Unicode has it's flaws, UTF-16 is still lurking here and there, everyone loves to argue about emoji, but overall text just works now.


UTF-8 is just... so well designed.

One little feature I like in particular is that if you're looking for an ASCII-7 character in a UTF-8 stream -- say, a LF or comma -- you don't have to decode the stream first because all bytes in the encoding of non-ASCII-7 characters have the high bit set. Or as Wikipedia puts it:

> Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as / (slash) in filenames, \ (backslash) in escape sequences, and % in printf.

It's amazing to hear they put it together in one night at a diner! :-D


> It's amazing to hear they put it together in one night at a diner! :-D

I guess you're saying that in good humor. But I'll add this because it makes me appreciate how these things happen:

> What happened was this. We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it.

"We hated it" -- there is just so much going on in those 3 words. They could have been suffering with the previous state for a year for all we know. And even if not, to know you hate something just takes a lot of system building experience to get to. And then when opportunity struck they probably already had a laundry list of grievances they had built up over that time and were ready to pounce.


Yes, exactly!

If they hadn't had on-the-ground experience of the plan-9 version, and been able to see what parts of it they wanted to keep and what parts needed to be done different from that actual experience...

Often you can't build the polished thing until you have experienced the thing before.

Lately I get discouraged that there seems to be not so much attention to "prior art" in software development, that's the only way to make progress!


But to build it in 4 days!

This still strikes me as the height of 1990s programming moxy.


While the design is nice, it doesn't seem -that- earthshattering that it was done in four days. Once you make the realiziation that 'wait, ascii only needs the lower 7 bits, let's work off that', it's all just details past that.

Don't get me wrong, I love UTF-8 and it is well thought out and designed. But the end result is not so complicated, so much so that pretty much anyone reading the rules could understand it.

I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems. Today's 'amazing' things would involve image recognition or processing, self driving cars, better ML/AI algos. Things that are hard to impossible to be done by a guy or two over the weekend.

Sadly, as a result, I think we'll have fewer 'programming heroes' than existed in previous decades.


> While the design is nice, it doesn't seem -that- earthshattering that it was done in four days.

And yet it may have needed a genius to desgin and write something so simple. UTF-8 was not the first multi-lingual encoding system; here's an entire list of them, worked on by a lot of probably very smart people:

* https://en.wikipedia.org/wiki/Template:Character_encodings

It only seems 'obvious' in hindsight:

* https://en.wikipedia.org/wiki/Hindsight_bias

Edit: A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. — Antoine de Saint-Exupery


>I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems.

git was 2005, and that was probably similarly impactful in the version control space (in that it was much closer to fundamentally correct, than its predecessors). And there are quite a few standards out there that only survive by virtue of already having been established -- not because they meet any reasonable bar of quality. IPv4 (and all the grand schemes to work around the terror of NAT), email (the worst communication system, except for all the others), SQL (the language specifically -- a mishmash of keywords with almost no ability to properly compose), etc.

The bigger difference I think between the 90's and now is that it was probably much easier to make your new superior standard actually be used -- you could implement a new kernel today which was fantastically superior to linux, and you're much more likely than not to get zero traction (ex: plan9) simply by virtue of how well-entrenched linux already is.


> git was 2005

I'm not sure I'd consider git to be "low-hanging fruit"


Given that Torvalds apparently went from design to implementation in 3 days, and 2 months later had it officially managing the kernel, I wouldn’t say it was particularly high-hanging.

Pretty sure Git was a side project so that Linus could manage Linux source code like he wanted.

Yeah, this is great! I came across that recently when working on a parser in Zig, which treats strings as arrays of bytes. I didn't know much about UTF8 other than that it's scary and programmers mess up text processing all the time. I was worried that a multi byte code point could trick my simple char switch which was looking for certain ASCII characters. But then I came across that bit you quoted and was but surprised and relieved!

Then, when I needed to minimally handle non-ASCII characters I found Zig's minimal unicode helper library and saw what I was looking for in a small function that takes a leading byte and returns how many bytes there are in the codepoint. I was impressed with the spec again!


> It's amazing to hear they put it together in one night at a diner! :-D

On the one hand, sure. But on the other you have Ken Thompson.


I wonder how many pieces of computing technology used today were put together in a single evening by a team of motivated developers. Rubygems, for example, was written in a couple of hours at the back of a hotel bar, then demoed (complete with network install and versioning) at Rubyconf the following morning.

As I age, I'm starting to believe that the best technology is often built this way, rather than stewing for years in an ISO subcommittee. Limited development time can lead to features that provide the greatest value for the time spent.


Here's a picture of Thompson designing UTF-8 on a placemat that night at the diner:

https://www.youtube.com/watch?v=mhvaeHoIE24&t=23m34s


Thanks for that link!

> It's amazing to hear they put it together in one night at a diner! :-D

I will bet that he had half formed ideas of how it could work from the previous pain with the "original UTF". The best people I work with are constantly looking at things that are wrong and coming up with idea for how they could be better even if 99% of them will never be used.


I think this is more of a case where we were lucky, since most applications used 7-bit ASCII and the high bit was available for UTF-8 encoding.

> China and Japan could have stayed with their own encodings.

Absolutely correct. There was a big debate in Japan in the 1990s about character encodings, with some people arguing strongly against the adoption of Unicode. Their main argument, as I remember it, was that Unicode didn’t capture all of the variations in kanji, especially for personal names.

For those of us who were trying to use Japanese online at the time, though, those arguments seemed beside the point. While it would have been nice, in an ideal world, to be able to encode and display all 51 (at least) variations in the second kanji for the surname Watanabe [1], we were faced with the daily frustration of trying to convert between JIS, S-JIS, EUC, and other encodings and often not being able to exchange Japanese text at all with people who hadn’t installed special software on their computers. It was a great relief when UTF-8 became adopted universally.

And now we have emoji, too!

[1] https://www.fujitv-view.jp/gallery/post-149246/?imgid=1


Tell that to my coworkers. I still get emails encoded in SJIS every day, sometimes with attachments with the file name also encoded in SJIS, which results in funny mojibake when saving them to disk. Not to mention the many web forms that insist you need to write your name in full-width characters or whatever funky shit.

On the other I recently got some Python scripts to crash because someone in the European team decided to encode some texts in ISO-8859-1 and Python assumes everything is in UTF-8.

I really, really wish one day all legacy encodings will disappear from the face of the Earth and only UTF-8 will stay.


Not to mention that the Linux unzip utility doesn't have a way to handle Shift-JIS filenames, or really any filename encodings besides UTF-8. You have to use an entirely different program like unzip-jp just for those files, in order to not be left with dozens of unintelligible folder names.

There's a reason the underground community calls it "shit-jizz."


Iconv is your friend.

On that issue, infsp6 (the Spanish library for Inform6, akin to the English inform6lib one) still uses iso8859-15 and it's a pain in the ass to convert the encoding to and from utf8 if you don't use neither emacs, joe or vim to edit the source code (I use nvi).


Mátalos a todos y que dios elija a los suyos.

But at least it's not EBCDIC, the day I find that in the wild is the day I will retire from computers and become a farmer.


Out of curiosity, have you tried the UTF-8 decoder capability and stress test?

https://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html


No, why?

Thunderbird will display SJIS emails just fine. The problem with attachments is when some adds a ZIP with SJIS filenames, but then it's not Thunderbird's problem but whatever tool you use to decompress it.

Regarding Python, the default behaviour when decoding and invalid UTF-8 strings is to raise an exception. But your comment made me research it and I just found that there is a way to replace invalid bytes with U+FFFD, so I will try it.


SJIS is still pretty actively used, and Han unification is the most likely culprit. In hindsight it really does feel like a mistake.

Han unification was definitely a mistake. To this day people in different countries will use different fonts so that text looks how it is supposed to in their language.

The promise of unicode was that you can losslessly convert any encoding to unicode. However, because of the failed attempt at Han unification, some important information can be lost.


Exactly, UTF-8 is great, except this part. In the name of unification you destroyed the culture that was there. But modern days people are fine with it as long as it is not their culture. And no, adding fonts or notation doesn't solve the problem. I remember I read a very nicely put analogy on HN years ago.

Edit: https://news.ycombinator.com/item?id=8041288


Han unification isn't main blocker to transition to utf-8 in Japan. Just use Japanese font in this context.

Why SJIS is still used is because there are many legacy systems and developers who still think SJIS is fine. We not tend to treat other languages so it mostly works (without emoji). SJIS is sometimes useful because 1-byte char is half width, and 2-bytes char is full width by design. Old developers still call Japanese character as "2-bytes character" even though the system is utf-8.

Another reason is that Windows OEM codepage is still CP932 (extended SJIS). It's pain like this: https://discuss.python.org/t/pep-597-enable-utf-8-mode-by-de...


Han unification is a bullshit excuse. Is two story 'a' a different letter than one story? Is seven with a slash through it different than seven without? Is Japanese as written in pre-war books a different language than Japanese in post-war books?

Unicode may have dropped a couple of variants, but they basically all got added back. There's no problem with Han unification; there's just a FUD campaign powered by nationalism and ignorance that is used to justify everyday technological inertia.


> While it would have been nice, in an ideal world, to be able to encode and display all 51 (at least) variations in the second kanji for the surname Watanabe [1],...

Unicode has gotten so big, isn’t this included by now?



Also see the IVD [1]. Indeed both 邉 (U+9089) and 邊 (U+908A) are exceptionally variable characters, the first having 32 variation sequences (the record as of 2020-11-06) and the second having 21 variation sequences.

[1] https://unicode.org/ivd/


Does the Japanese goverment use Hiragana on official documments in order to properly spell out names easily?

Either hiragana or katakana is used on most official documents for convenience. On the family registers (戸籍 koseki), which are perhaps the most important, though, the readings of names are not listed. For people whose names are written only in kanji, those kanji, and not the readings, are the legal versions of their names.

Yes, thank you for saying so!

Unicode often gets a lot of online hate, which frustrates me, as I agree with you -- Unicode in general it is a remarkably succesful standard, technically as well as with regard to adoption.

It's adoption success isn't a coincidence, it's a result of choices made in the design -- with UTF-8 being a big part of that. The choices sometimes involve trade-offs, which lead to the things people complain about (say, the two different codepoint arrangements which can be a é -- there's a reason for that, again related to easing the on-ramp to unicode from legacy technologies, as one of the main goals of UTF-8).

There are always trade-offs, nothing is perfect. But Unicode sometimes seems to me to be almost the optimal balance of all the different concerns, I think they could hardly have done better!

The "UCS=>UTF-16" mis-step was unfortunate, and we are still dealing with some of the consequences (Java/Windows)... but the fact that we made it through with Unicode adoption only continuing to grow, is a testament to Unicode's good design again.

It's not until I ran into some of the "backwaters" of Unicode, realizing they had thought out and specified how to do things like "case-insensitive" normalized collation/comparison for a variety of different specifications in a localized and reasonably performant way...

We are so lucky for Unicode.


> UTF-16 is still lurking here and there

As someone currently stuck in the windows world, this hurts. Every single Windows API is still stuck with using UTF-16/UCS2 as the string encoding.

Also fun fact, on the Nintendo Switch, various subsystems use different kind of encoding. The filsystem submodule uses Shift-JIS, most of the other modules use UTF-8, but some others yet use UTF-16 (like the virtual keyboard, IIRC). A brilliant mess.


Windows finally added support for UTF-8 2 years ago: https://docs.microsoft.com/en-us/windows/uwp/design/globaliz...

Technically it's had support since (IIRC) Windows 7. What this does is call the translation functions for you instead of having to do it yourself.

Conversion functions - MultiByteToWideChar & co. - were in since Windows 2000 and the UTF8 codepage was supported as early as XP if not in W2K as well.

It existed in W2K and maybe even earlier, but there were bugs in the console regarding codepage 65001, so you couldn't use it as the default. This was not fixed yet in XP, maybe in 7 though.

Ah thanks! It's funny because I can't recall ever using code page 65001 before 7. Maybe there was a reason for that or maybe I simply didn't know it existed until then. Or maybe I thought it simpler to just use UTF-16. I can't remember.

It is not default configuration and it's marked as "experimental" in UI. I would never enable it for my PC, that's just absurd.

I mean that application developers can enable it in their manifest and don't have to do UTF-16 <-> UTF-8 conversions anymore.

I thought that's about user choosing UTF-8 as codepage in regional settings.

If it could be set in application manifest, that's a good thing.

Although I guess that in the end Windows will perform the save conversions in user32.dll, so it does not really matter.


Windows does allow setting it in the application's manifest but it also requires a registry setting to be enabled otherwise the manifest option is ignored. Obviously asking users to edit the registry is a non-starter so it's only used where the developers also control the user environment (e.g. the registry change is deployed through group policies, etc).

But yeah, this just tells Windows to do the conversion so that programmers don't have to type out the function calls themselves. It's simple enough to create a wrapper function for Windows API calls in any case.


Yeah, the "might" here is doing a lot of work

> As Windows operates natively in UTF-16 (WCHAR), you might need to convert UTF-8 data to UTF-16 (or vice versa) to interoperate with Windows APIs.


Java is still using UTF-16, it is the internal format used since its creation. I don't know exactly how much this is a problem or not, but it shows that UTF-16 is still an important thing.

I think its a huge problem for Java. Try doing proper string collation (standard library or ICU4J), or regular expression matching, in a context where your strings are all UTF-8 and your output should also be UTF-8. Operations that shouldn't require allocation do, because you have to transcode to UTF-16. Not to mention that in some cases, that transcoding is the most expensive part of the operation.

All the core Java APIs are built around String or CharSequence (more the latter in releases post-Java 8). CharSequence is a terrible interface for supporting UTF-8 or any encoding besides latin1 or UTF-16. If Java's interfaces had been designed around Unicode codepoint iteration rather than char random access, then the coupling to UTF-16 wouldn't have been so tight. But as things stand, you aren't doing anything interesting to text in Java without either (1) re-implementing everything from scratch, from integer parsing to regexp, or (2) paying the transcode cost on everything your program consumes and emits.


It's a huge problem. UTF-16 is a big big pain.

JavaScript (ECMAScript) too has this problem.


Personally I don't find UTF-16 to be too bad. It's a simple encoding and very easy to convert to/from UTF-8. So your program can be written in UTF-8 and your WinAPI wrappers can convert as/when needed.

The bad thing with UTF-16 is that so much software assumes that one code point always is 16 bits.

Which is not UTF-16 at all, UTF-16 standard clearly says this is not so. So why do they do that?

It's actually a leftover of the earlier UCS-2 standard, before it was realized we'd need more codepoints than that, and that it was a mistake to limit to 16-bit space for codepoints in any encoding.

Software written for UCS-2 can mostly work compatibly with UTF-16, but there are some problems, encoding the 'higher' codepoints is only one of several. Another is how right-to-left scripts are handled.

http://www.differencebetween.net/technology/software-technol...

https://unicode.org/faq/utf_bom.html#utf16-11


Wasn't UTF-16 explicitly created as a "backward compatibility hack" for UCS-2 when it became clear that 16 bits per code point isn't enough? They should have ditched 16-bit encodings back then instead of combining the disadvantages of UTF-8 (variable length-encoding) and UTF-32 (not endian-agnostic).

Perhaps unicode wouldn't be nearly as successfully adopted as it is, if they had left UCS-2 adopters hanging instead of providing them a "backward compatibility hack" path.

The UCS-2 adopters after all had been faithfully trying to implement the standard at that time. Among other things, showing implementers that if they choose to adopt, you aren't going to leave them hanging out to dry when you realize you made a mistake in the standard, will give other people more confidence to adopt.

But also, just generally I think a lesson of unicode's success -- as illustrated by UTF-8 in particular -- is, you have to give people a feasible path from where they are to adoption, this is a legitimate part of the design goals of a standard.


Most of the hard stuff is there no matter the encoding (normalization, user-perceived characters spanning multiple code units, paths vs strings, ...).

It's the same with html and css: people shit on it all the time, but this just shows they don't have the imagination to see how much worse it could be.

Just compare to e.g. Photoshop file format: https://github.com/gco/xee/blob/master/XeePhotoshopLoader.m#...


The photoshop file format is fine for what is. The format to explode your head is the MS Office .doc format.

Sort of related: I learned from reading about Facebook's lack of moderation that Myanmar is one of the few countries that doesn't use Unicode (and hence UTF-8). It uses something called Zawgyi that apparently has to be heuristically detected!

https://en.wikipedia.org/wiki/Myanmar_(Unicode_block)#Histor...

https://www.globalapptesting.com/blog/zawgyi-vs-unicode


Facebook is despicable and indefensible. They knew that they could not moderate Myanmar. They knew or should have known that it was a volatile political situation. The amount of money involved could not have been more than a few million dollars. They should have just turned everything off and said we'll come back when we can. It's disgusting what they did and they should never be forgiven for putting market position ahead of human lives.

ttf is also nearly universally supported. Working with text correctly is just really hard so I think people like to reach for stuff that already does it for them.

More or less same the story but in an explainer video:

'Characters, Symbols and the Unicode Miracle - Computerphile'

https://www.youtube.com/watch?v=MijmeoH9LT4


I also want to point out this classic, to better understand Unicode in general. https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

This video is fantastic, especially if you've never really understood how Unicode worked before. Keeps it simple and gets to the point.

I remember seeing this a couple of years ago and thinking "If only this existed when I learned about UTF-8, this would've saved me a lot of bad explanations and time".

This is now my goto-video if anyone asks me how Unicode works.


> This video is fantastic, especially if you've never really understood how Unicode worked before.

Can you people really learn things on videos? My brain sort of shuts down on audiovisual material, I can only really watch and understand light movies. For more complicated material, I can only learn it by reading. There's so much essential back and forth that is impossible on a video. With a text, you have everything already in front of you and your eyes and mind can wander freely. Maybe it's just me, but I really can't stand the fixed, inflexible rhythm that is imposed by listening to speech.


It is really common that you read only a glimpse of the entire text and trick yourself to have understood it, only realizing that mistake later. In some sense the text gives you too much information that your brain can cause frame drop, that's something you should be aware when you read the text (you for example need to rephrase the understood text yourself). By comparison a well-paced video can give the exact amount of information you have to tinker before moving on. I do agree that a well-paced video is much rarer than a well-written text in the whole internet.

> trick yourself to have understood it, only realizing that mistake later.

Exactly. My point is that technical text is never read linearly (like a video). Reading is an active process, where you scan the whole page repeatedly for all the displayed formulas, then for apparitions on these formulas inside the text, then peek at the figure, then read some words in a paragraph while looking from time to time at the figure in case it is referenced by the text. After a few minutes you have grasped everything. At least this is how I read. Looking at a video is so passive and linear that you get bored after a few seconds.


Thank you! Frame-drop is a brilliant analogy.

It wasn’t until I started learning networking concepts from a third-level/college text book that I picked up in a second-hand shop that I realised how much my brain fools me into thinking I’m absorbing information encoded in words and diagrams. The end of each chapter had questions based on the material covered in that chapter and it was only while attempting to answer them that I realised how much I had thought I’d absorbed – but hadn’t.

When buying technical books, I now try to get ones that have questions or exercises at the end of each chapter. If not, I take notes while reading by attempting to summarise each section in my own words. Answering technical questions for other people is also a great way of consolidating knowledge and filling the gaps in my own understanding.


Have you tried taking notes?

Maybe a silly question, but I do this a lot when I'm trying to pick up technical information from a video or a presentation, and it seems to help not least because the result is a durable, textual reference that also can provide starting points for further research. Too, you can pause a video and add a timestamp to your notes as an indexing tool for review.

I do this with a pen and a notebook, not digitally. I don't know about the vaunted "writing by hand helps memories form" effect; I can't say I've observed a dramatic difference, but maybe that's just because I keep my work and technical notes close to hand and refer to them when I need them. That said, I do recommend paper notes over digital ones for stuff like this, if only for the sake of simplicity, reliability, and ease of use.


I also like text, but if you're genuinely asking, yes! Video evidently works great for a lot of people. You can't seek it as well as text, so instead rely on your memory more to remember things that are still unclear or don't make sense yet, and see when they are explained later.

It's okay to have different learning styles I think, it's not too surprising to me that some prefer different medium


It's precisely the seeking that kills me. The only interface for seeking a video is a tiny, one-dimensional bar. How do you remember exactly where the right information occurs? On the other hand, text is visible as a whole, and the "seeking" is two-dimensional. Much more efficient to seek. I have a good visual memory and I tend to remember exactly where on a page the formula that I want appears, and just glance away at it without moving any muscles other than my eyes. To seek a part on a video, the process is overwhelmingly more complicated. And then the place where I was before the seek is lost, and you have to seek for it again. Argh!

Thanks for the answer, it turns out that we people are wildly different to each other.


Video can have chapters too, but I generally agree. The ideal format is probably text with pictures, animations, interactive models... when needed. It's a real shame that we don't have an electronic document format...

Interesting comment near the end, "2. The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these."

If that had happened, I guess emojis as we know them today might never have happened, since it would have limited us to 16 bits of code points. Or we would have had to start doing surrogate pairs even in UTF-8. Close call.


Unicode was originally meant to fit everything into 16 bits. UTF-8 is just a way to encode, it doesn't decide what goes into Unicode.

> UTF-8 is just a way to encode, it doesn't decide what goes into Unicode.

If UTF-8 had not had sequences above 3 bytes, you would not have been able to use it to express Unicode characters as high as Emoji which would certainly have hampered their adoption, is what the person you're replying to means.


While your conclusion is largely correct, it doesn't follow from your premises: UTF-16 is just a way to encode, but its brain-damaged surrogate pair mechanism very much did get baked into Unicode (namely, high and low surrogate code points D800-DFFF).

The 5-6 byte variants (and also 4 at the time) exist because of the need to round-trip UCS surrogate pairs through UTF-8, no? That's what I assume the "political reasons" are...

Others have already answered why surrogate pairs are irrelevant (and not UCS), but I think it's worth saying what the probable actual reason for 5-6 byte variants was. Remember that UCS and Unicode were at this point still two separate things; Unicode was supposed to be 16-bit (and later it got expanded, causing the whole surrogates mess), while UCS was supposed to be 31-bit. I assume the 5-6 byte variants were for UCS (back before it got merged with Unicode).

Surrogate pairs are only in UTF-16 so as to encode code points that require more than 16 bits. UTF-8 has no need of them because it's already a variable width encoding.

If there were no code points larger than 16 bits then UTF-8 would only need a maximum of 3 bytes per code point and UTF-16 wouldn't need surrogate pairs. Well actually UTF-16 probably wouldn't exist at all because UCS-2 would have been enough for everybody.


No. They exist to encode 31 bits of codepoint space, but later the UC decided to limit the codepoint space to only 21 bits because that is what UTF-16 is limited to, and then UTF-8 no longer needed to support sequences of 5 and 6 bytes.

I don't think so? Aren't UCS surrogate pairs at most 16bit each by their very purpose? Also, >16bit unicode code points came much later, I believe, in Unicode 2.0 in 1996 according to Wikipedia (vs UTF-8 which is from around 1992)


Was always a bit surprised that if you had the length encoding byte, that it didn't also imply an offset by the values that should have been encoded with a shorter sequence

This is there to make decoding faster. I think it's a mistake.

The continuation bits also allow backwards traversal and proper null byte decoding.

> the ability to synchronize a byte stream picked up mid-run, with less that one character being consumed before synchronization

Can somebody explain or link to an explanation on how UTF-8 allows for this?


Note that it says less than one character. A character in UTF-8 can be composed of multiple bytes.

The encoding scheme is laid out in the linked email. Based on the high bits it's possible to detect when a new character starts. Relevant portion:

  We define 7 byte types:
  T0 0xxxxxxx      7 free bits
  Tx 10xxxxxx      6 free bits
  T1 110xxxxx      5 free bits
  T2 1110xxxx      4 free bits
  T3 11110xxx      3 free bits
  T4 111110xx      2 free bits
  T5 111111xx      2 free bits

  Encoding is as follows.
  >From hex Thru hex      Sequence             Bits
  00000000  0000007f      T0                   7
  00000080  000007FF      T1 Tx                11
  00000800  0000FFFF      T2 Tx Tx             16
  00010000  001FFFFF      T3 Tx Tx Tx          21
  00200000  03FFFFFF      T4 Tx Tx Tx Tx              26
  04000000  FFFFFFFF      T5 Tx Tx Tx Tx Tx    32
[...]

  4. All of the sequences synchronize on any byte that is not a Tx byte.
If you are starting mid-run, skip initial Tx bytes. That will always be less than one character.

Note that UTF-8 has since been restricted to at most 4 bytes (i.e. the longest sequence is `T3 Tx Tx Tx`).

So now we know who is really responsible for the whole MySQL utf8mb4 fiasco -- these 2 guys sitting in a diner, conjuring up a brilliant scheme to cover 4 billions characters, which turned out to exceed the actual requirement by more than 2000x.

September 1992: 2 guys scribbling on a placemat.

January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.

March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.

March 2002: MySQL added support for UTF-8, initially setting the limit to 6 bytes (https://github.com/mysql/mysql-server/commit/55e0a9c)

September 2002: MySQL decided to reduce the limit to 3 bytes, probably for storage efficiency reason (https://github.com/mysql/mysql-server/commit/43a506c, https://adamhooper.medium.com/in-mysql-never-use-utf8-use-ut...)

November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.

Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.

EDIT: Just notice this in the footnotes, and the plot thickens...

> The 4, 5, and 6 byte sequences are only there for political reasons. I would prefer to delete these.

So UTF-8 was indeed intended to be utf8mb3!


This is also a very simple form of using the idea of a "prefix-free code" from information theory and coding. (the codes {0,10,110,1110,11110,...,111111} is a prefix-free set).

I think there's also the idea that the code can "sync up" when it say, starts in the middle of a character.


Other people have answered your question but I wanted to clarify one point. The word "character" here means "unicode code point". However, what the user thinks of as a single character can be made up of more than one code point. This presents a different problem and one UTF-8 itself can't help with.

The Unicode Consortium has a report on extended grapheme clusters[0] (i.e. user-perceived characters). Essentially, if you're processing some text mid stream, it might not be clear if a code point is the start of a new user-perceived character or not. So you may want to skip ahead until an unambiguous symbol boundary is reached.

[0]: https://www.unicode.org/reports/tr29/


It’s fairly simple, actually: leading bytes have a specific bit pattern that continuation bytes don’t. A single-byte character will have the topmost bit unset (0b0xxxxxx), and for a multi-byte run the first byte will have the top two bits set (0b11xxxxxx) and any succeeding bytes will have the top bit set but the next bit unset (0b10xxxxxx). This means given an arbitrary byte you can always tell what it is, and you can tell when you’re at the start of a next character by looking for those first two bit patterns.

The upper bits of the FIRST octet are used to determine the run length of the sequence. All of the other bytes in the sequence use the upper two bits (0xC prefix len 2 OR b10xxxxxx) to indicate that it's another 6 bits of data for the current character.

If synchronization is lost mid-character, by definition that interrupted character is lost. However the very next complete character will be clearly indicated by a byte beginning with either no sign (a 7 bit character) OR a number of 1s indicating the octet count followed by a zero.

This is covered in the section titled:

    Proposed FSS-UTF
    ----------------
    ...
       Bits  Hex Min  Hex Max  Byte Sequence in Binary
    1    7  00000000 0000007f 0vvvvvvv
    2   11  00000080 000007FF 110vvvvv 10vvvvvv
    3   16  00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
    ... Examples trimmed for mobile.

All trailing bytes and only trailing bytes are of the form 10xxxxxx. If you read such a byte you just have to iterate backwards until you find a non-trailing byte.

It is called Self-synchronizing code, simple and beautiful design.

https://en.wikipedia.org/wiki/Self-synchronizing_code


Those of us from NJ really need to know what diner though...

> UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner one night in September or so 1992.


“The diner was the Corner Café in New Providence, New Jersey.” — Rob Pike, https://commandcenter.blogspot.com/2020/01/utf-8-turned-20-y...

  > helix: Sep  8 03:22:13: ken: upas/sendmail: remote inet!xopen.co.uk!xojig 
  > From ken Tue Sep  8 03:22:07 EDT 1992 ([email protected]) 6833
It was 1992 and they were still using UUCP-style addressing.

ACM should put a plaque at that diner.

Yes but at what diner ? Wikipedia says NJ has about 525.

We could ask. I've always assumed Prestige Diner & Restaurant just because it is close to the Murray Hill building.

Another commenter stated it's Corner Café in New Providence [1]

[1] https://news.ycombinator.com/item?id=26739049


One of Rob's ideas - the GOPATH - is something that I enjoy very much and even port the concept to other languages, sometimes. However, it seems that the community as a whole didn't embrace it. I feel isolated, but in good company :D

I liked GOPATH at first, but the limitations of projects stepping on each other quickly made me sad about it. Now that go modules has removed the need for GOPATH, I'm very happy about it being obsolete. :)

Please tell us more?

Designing and coding up something of this importance in less than a week is impressive. Strangely, what I'm more impressed with is that, in 2003 - 11 years after all of this happened - Bell Labs still had _sendmail logs for two users of their system_. How much history do we throw away by trashing logs nowadays? Sure, there's probably a lot more traffic... but wow.

Plan 9 always had great archival utilities. I do not know what they used at those early days, but I guess that at the point rsc recovered it, it was stored in the venti server: https://p9f.org/sys/doc/venti/venti.html

similar post of the same email/story, 15 days ago and a year ago

here's more discussion from then: https://news.ycombinator.com/item?id=21212445


The transition from 32 bit to 64 bit words was a missed opportunity to move from 8 bit to 32 bit bytes, which would have greatly simplified the fundamental aspect that is dealing with text in computers.

Some architectures use 8 bit bytes, but not really. What I mean is: some architectures require a word bigger than 8 bits (16+) be aligned. As in: to read a 32 bit word from memory, the first octet must have an address ending in two zero bits. But if you want to read only a byte, alignment doesn’t matter. IIRC, ARM used to have this “alignment” requirement.

Why ? When ?

One thing to realize is that UTF-8 is here and required only because we're all stuck with limitations based on the low-power, low-limitations of early systems.

There is not a single reason, other than historical, backward compatibility and the overall ecosystem putting incommensurate pressure on all future systems, why the basic memory unit is 8 bits.

We're stuck on 8 bits because all API, OS and specifications assume the basic units is 8 bits. If computer memory had evolved faster, or computer usage spread slower, the basic unit could have been 32 bits, or even 64 bits.

Think how much easier things would be. UTF could be encoded as a single unit, without ambiguous order.

What we've lost due to byte and UTF-8 is for example the text editor as the universal editor. Due to the need of UTF-8 decoding, we're no longer able to open a any (possibly binary) file as text and save it reliably after editing one character.

There are other consequences, for example at the file-system level for file names, corruption, how to handle invalid UTF-8 sequences, etc.

(And before you talk about the byte order problem, that is the point: there is no order problem for the basic unit you choose. No one (outside of some low-level 1-bit serial interfaces) cars about bit order or nibble order of bytes. Because the bit order is hidden within all interface accepting byte as the unit.)


It's hard to argue against a counterfactual, but I believe there is a "sweet spot" for the size of the basic memory unit, and that "sweet spot" is probably in the range from 4 to 16 bits (power of two sizes are desirable, so the possible sizes would be 4 bits, 8 bits, and 16 bits), so even if computers evolved from scratch today already starting with large memories, the basic unit of memory addressing still wouldn't be that big. My guess is that the chosen size would be 16 bits.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: