UTF-8 history (2003) - https://news.ycombinator.com/item?id=21212445 - Oct 2019 (52 comments)
UTF-8 History (2003) - https://news.ycombinator.com/item?id=19565980 - April 2019 (3 comments)
UTF-8 history - https://news.ycombinator.com/item?id=15236856 - Sept 2017 (1 comment)
UTF-8 history (2003) - https://news.ycombinator.com/item?id=8648541 - Nov 2014 (7 comments)
UTF-8 Original Proposal - https://news.ycombinator.com/item?id=6463466 - Sept 2013 (3 comments)
UTF-8 History - https://news.ycombinator.com/item?id=2081932 - Jan 2011 (2 comments)
The history of UTF-8 as told by Rob Pike - https://news.ycombinator.com/item?id=577116 - April 2009 (1 comment)
Don't miss this great link from the 2017 thread: https://www.flickr.com/photos/ajstarks/sets/7215763147079887...
I don't think it was always guaranteed to turn out well. China and Japan could have stayed with their own encodings. Microsoft and Apple could have done incompatible things. The tech world is full of bad things we're stuck with because there's no way to coordinate a change.
Unicode has it's flaws, UTF-16 is still lurking here and there, everyone loves to argue about emoji, but overall text just works now.
One little feature I like in particular is that if you're looking for an ASCII-7 character in a UTF-8 stream -- say, a LF or comma -- you don't have to decode the stream first because all bytes in the encoding of non-ASCII-7 characters have the high bit set. Or as Wikipedia puts it:
> Since ASCII bytes do not occur when encoding non-ASCII code points into UTF-8, UTF-8 is safe to use within most programming and document languages that interpret certain ASCII characters in a special way, such as / (slash) in filenames, \ (backslash) in escape sequences, and % in printf.
It's amazing to hear they put it together in one night at a diner! :-D
I guess you're saying that in good humor. But I'll add this because it makes me appreciate how these things happen:
> What happened was this. We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it.
"We hated it" -- there is just so much going on in those 3 words. They could have been suffering with the previous state for a year for all we know. And even if not, to know you hate something just takes a lot of system building experience to get to. And then when opportunity struck they probably already had a laundry list of grievances they had built up over that time and were ready to pounce.
If they hadn't had on-the-ground experience of the plan-9 version, and been able to see what parts of it they wanted to keep and what parts needed to be done different from that actual experience...
Often you can't build the polished thing until you have experienced the thing before.
Lately I get discouraged that there seems to be not so much attention to "prior art" in software development, that's the only way to make progress!
This still strikes me as the height of 1990s programming moxy.
Don't get me wrong, I love UTF-8 and it is well thought out and designed. But the end result is not so complicated, so much so that pretty much anyone reading the rules could understand it.
I think there was just a lot of low hanging fruit in the 90s that doesn't exist today, as they are solved problems. Today's 'amazing' things would involve image recognition or processing, self driving cars, better ML/AI algos. Things that are hard to impossible to be done by a guy or two over the weekend.
Sadly, as a result, I think we'll have fewer 'programming heroes' than existed in previous decades.
And yet it may have needed a genius to desgin and write something so simple. UTF-8 was not the first multi-lingual encoding system; here's an entire list of them, worked on by a lot of probably very smart people:
It only seems 'obvious' in hindsight:
Edit: A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. — Antoine de Saint-Exupery
git was 2005, and that was probably similarly impactful in the version control space (in that it was much closer to fundamentally correct, than its predecessors). And there are quite a few standards out there that only survive by virtue of already having been established -- not because they meet any reasonable bar of quality. IPv4 (and all the grand schemes to work around the terror of NAT), email (the worst communication system, except for all the others), SQL (the language specifically -- a mishmash of keywords with almost no ability to properly compose), etc.
The bigger difference I think between the 90's and now is that it was probably much easier to make your new superior standard actually be used -- you could implement a new kernel today which was fantastically superior to linux, and you're much more likely than not to get zero traction (ex: plan9) simply by virtue of how well-entrenched linux already is.
I'm not sure I'd consider git to be "low-hanging fruit"
Then, when I needed to minimally handle non-ASCII characters I found Zig's minimal unicode helper library and saw what I was looking for in a small function that takes a leading byte and returns how many bytes there are in the codepoint. I was impressed with the spec again!
On the one hand, sure. But on the other you have Ken Thompson.
As I age, I'm starting to believe that the best technology is often built this way, rather than stewing for years in an ISO subcommittee. Limited development time can lead to features that provide the greatest value for the time spent.
I will bet that he had half formed ideas of how it could work from the previous pain with the "original UTF". The best people I work with are constantly looking at things that are wrong and coming up with idea for how they could be better even if 99% of them will never be used.
Absolutely correct. There was a big debate in Japan in the 1990s about character encodings, with some people arguing strongly against the adoption of Unicode. Their main argument, as I remember it, was that Unicode didn’t capture all of the variations in kanji, especially for personal names.
For those of us who were trying to use Japanese online at the time, though, those arguments seemed beside the point. While it would have been nice, in an ideal world, to be able to encode and display all 51 (at least) variations in the second kanji for the surname Watanabe , we were faced with the daily frustration of trying to convert between JIS, S-JIS, EUC, and other encodings and often not being able to exchange Japanese text at all with people who hadn’t installed special software on their computers. It was a great relief when UTF-8 became adopted universally.
And now we have emoji, too!
On the other I recently got some Python scripts to crash because someone in the European team decided to encode some texts in ISO-8859-1 and Python assumes everything is in UTF-8.
I really, really wish one day all legacy encodings will disappear from the face of the Earth and only UTF-8 will stay.
There's a reason the underground community calls it "shit-jizz."
On that issue, infsp6 (the Spanish library for Inform6, akin to the English inform6lib one) still uses iso8859-15 and it's a pain in the ass to convert the encoding to and from utf8 if you don't use neither emacs, joe or vim to edit the source code (I use nvi).
But at least it's not EBCDIC, the day I find that in the wild is the day I will retire from computers and become a farmer.
Thunderbird will display SJIS emails just fine. The problem with attachments is when some adds a ZIP with SJIS filenames, but then it's not Thunderbird's problem but whatever tool you use to decompress it.
Regarding Python, the default behaviour when decoding and invalid UTF-8 strings is to raise an exception. But your comment made me research it and I just found that there is a way to replace invalid bytes with U+FFFD, so I will try it.
The promise of unicode was that you can losslessly convert any encoding to unicode. However, because of the failed attempt at Han unification, some important information can be lost.
Why SJIS is still used is because there are many legacy systems and developers who still think SJIS is fine. We not tend to treat other languages so it mostly works (without emoji).
SJIS is sometimes useful because 1-byte char is half width, and 2-bytes char is full width by design. Old developers still call Japanese character as "2-bytes character" even though the system is utf-8.
Another reason is that Windows OEM codepage is still CP932 (extended SJIS). It's pain like this: https://discuss.python.org/t/pep-597-enable-utf-8-mode-by-de...
Unicode may have dropped a couple of variants, but they basically all got added back. There's no problem with Han unification; there's just a FUD campaign powered by nationalism and ignorance that is used to justify everyday technological inertia.
Unicode has gotten so big, isn’t this included by now?
Unicode often gets a lot of online hate, which frustrates me, as I agree with you -- Unicode in general it is a remarkably succesful standard, technically as well as with regard to adoption.
It's adoption success isn't a coincidence, it's a result of choices made in the design -- with UTF-8 being a big part of that. The choices sometimes involve trade-offs, which lead to the things people complain about (say, the two different codepoint arrangements which can be a é -- there's a reason for that, again related to easing the on-ramp to unicode from legacy technologies, as one of the main goals of UTF-8).
There are always trade-offs, nothing is perfect. But Unicode sometimes seems to me to be almost the optimal balance of all the different concerns, I think they could hardly have done better!
The "UCS=>UTF-16" mis-step was unfortunate, and we are still dealing with some of the consequences (Java/Windows)... but the fact that we made it through with Unicode adoption only continuing to grow, is a testament to Unicode's good design again.
It's not until I ran into some of the "backwaters" of Unicode, realizing they had thought out and specified how to do things like "case-insensitive" normalized collation/comparison for a variety of different specifications in a localized and reasonably performant way...
We are so lucky for Unicode.
As someone currently stuck in the windows world, this hurts. Every single Windows API is still stuck with using UTF-16/UCS2 as the string encoding.
Also fun fact, on the Nintendo Switch, various subsystems use different kind of encoding. The filsystem submodule uses Shift-JIS, most of the other modules use UTF-8, but some others yet use UTF-16 (like the virtual keyboard, IIRC). A brilliant mess.
If it could be set in application manifest, that's a good thing.
Although I guess that in the end Windows will perform the save conversions in user32.dll, so it does not really matter.
But yeah, this just tells Windows to do the conversion so that programmers don't have to type out the function calls themselves. It's simple enough to create a wrapper function for Windows API calls in any case.
> As Windows operates natively in UTF-16 (WCHAR), you might need to convert UTF-8 data to UTF-16 (or vice versa) to interoperate with Windows APIs.
All the core Java APIs are built around String or CharSequence (more the latter in releases post-Java 8). CharSequence is a terrible interface for supporting UTF-8 or any encoding besides latin1 or UTF-16. If Java's interfaces had been designed around Unicode codepoint iteration rather than char random access, then the coupling to UTF-16 wouldn't have been so tight. But as things stand, you aren't doing anything interesting to text in Java without either (1) re-implementing everything from scratch, from integer parsing to regexp, or (2) paying the transcode cost on everything your program consumes and emits.
It's actually a leftover of the earlier UCS-2 standard, before it was realized we'd need more codepoints than that, and that it was a mistake to limit to 16-bit space for codepoints in any encoding.
Software written for UCS-2 can mostly work compatibly with UTF-16, but there are some problems, encoding the 'higher' codepoints is only one of several. Another is how right-to-left scripts are handled.
The UCS-2 adopters after all had been faithfully trying to implement the standard at that time. Among other things, showing implementers that if they choose to adopt, you aren't going to leave them hanging out to dry when you realize you made a mistake in the standard, will give other people more confidence to adopt.
But also, just generally I think a lesson of unicode's success -- as illustrated by UTF-8 in particular -- is, you have to give people a feasible path from where they are to adoption, this is a legitimate part of the design goals of a standard.
Just compare to e.g. Photoshop file format: https://github.com/gco/xee/blob/master/XeePhotoshopLoader.m#...
'Characters, Symbols and the Unicode Miracle - Computerphile'
I remember seeing this a couple of years ago and thinking "If only this existed when I learned about UTF-8, this would've saved me a lot of bad explanations and time".
This is now my goto-video if anyone asks me how Unicode works.
Can you people really learn things on videos? My brain sort of shuts down on audiovisual material, I can only really watch and understand light movies. For more complicated material, I can only learn it by reading. There's so much essential back and forth that is impossible on a video. With a text, you have everything already in front of you and your eyes and mind can wander freely. Maybe it's just me, but I really can't stand the fixed, inflexible rhythm that is imposed by listening to speech.
Exactly. My point is that technical text is never read linearly (like a video). Reading is an active process, where you scan the whole page repeatedly for all the displayed formulas, then for apparitions on these formulas inside the text, then peek at the figure, then read some words in a paragraph while looking from time to time at the figure in case it is referenced by the text. After a few minutes you have grasped everything. At least this is how I read. Looking at a video is so passive and linear that you get bored after a few seconds.
It wasn’t until I started learning networking concepts from a third-level/college text book that I picked up in a second-hand shop that I realised how much my brain fools me into thinking I’m absorbing information encoded in words and diagrams. The end of each chapter had questions based on the material covered in that chapter and it was only while attempting to answer them that I realised how much I had thought I’d absorbed – but hadn’t.
When buying technical books, I now try to get ones that have questions or exercises at the end of each chapter. If not, I take notes while reading by attempting to summarise each section in my own words. Answering technical questions for other people is also a great way of consolidating knowledge and filling the gaps in my own understanding.
Maybe a silly question, but I do this a lot when I'm trying to pick up technical information from a video or a presentation, and it seems to help not least because the result is a durable, textual reference that also can provide starting points for further research. Too, you can pause a video and add a timestamp to your notes as an indexing tool for review.
I do this with a pen and a notebook, not digitally. I don't know about the vaunted "writing by hand helps memories form" effect; I can't say I've observed a dramatic difference, but maybe that's just because I keep my work and technical notes close to hand and refer to them when I need them. That said, I do recommend paper notes over digital ones for stuff like this, if only for the sake of simplicity, reliability, and ease of use.
It's okay to have different learning styles I think, it's not too surprising to me that some prefer different medium
Thanks for the answer, it turns out that we people are wildly different to each other.
If that had happened, I guess emojis as we know them today might never have happened, since it would have limited us to 16 bits of code points. Or we would have had to start doing surrogate pairs even in UTF-8. Close call.
If UTF-8 had not had sequences above 3 bytes, you would not have been able to use it to express Unicode characters as high as Emoji which would certainly have hampered their adoption, is what the person you're replying to means.
If there were no code points larger than 16 bits then UTF-8 would only need a maximum of 3 bytes per code point and UTF-16 wouldn't need surrogate pairs. Well actually UTF-16 probably wouldn't exist at all because UCS-2 would have been enough for everybody.
Can somebody explain or link to an explanation on how UTF-8 allows for this?
The encoding scheme is laid out in the linked email. Based on the high bits it's possible to detect when a new character starts. Relevant portion:
We define 7 byte types:
T0 0xxxxxxx 7 free bits
Tx 10xxxxxx 6 free bits
T1 110xxxxx 5 free bits
T2 1110xxxx 4 free bits
T3 11110xxx 3 free bits
T4 111110xx 2 free bits
T5 111111xx 2 free bits
Encoding is as follows.
>From hex Thru hex Sequence Bits
00000000 0000007f T0 7
00000080 000007FF T1 Tx 11
00000800 0000FFFF T2 Tx Tx 16
00010000 001FFFFF T3 Tx Tx Tx 21
00200000 03FFFFFF T4 Tx Tx Tx Tx 26
04000000 FFFFFFFF T5 Tx Tx Tx Tx Tx 32
4. All of the sequences synchronize on any byte that is not a Tx byte.
September 1992: 2 guys scribbling on a placemat.
January 1998: RFC 2279 defines UTF-8 to be between 1 to 6 bytes.
March 2001: A bunch of CJK characters were added to Unicode Data 3.1.0, pushing the total to 94,140, exceeding the 16-bit limit of 3 bytes UTF-8.
March 2002: MySQL added support for UTF-8, initially setting the limit to 6 bytes (https://github.com/mysql/mysql-server/commit/55e0a9c)
September 2002: MySQL decided to reduce the limit to 3 bytes, probably for storage efficiency reason (https://github.com/mysql/mysql-server/commit/43a506c, https://adamhooper.medium.com/in-mysql-never-use-utf8-use-ut...)
November 2003: RFC 3629 defines UTF-8 to be between 1 to 4 bytes.
Arguably, if the placemat was smaller and the guys stopped at 4 bytes after running out of space, perhaps MySQL would have done the right thing? Ah, who am I kidding. The same commit would likely still happen.
EDIT: Just notice this in the footnotes, and the plot thickens...
> The 4, 5, and 6 byte sequences are only there for
political reasons. I would prefer to delete these.
So UTF-8 was indeed intended to be utf8mb3!
I think there's also the idea that the code can "sync up" when it say, starts in the middle of a character.
The Unicode Consortium has a report on extended grapheme clusters (i.e. user-perceived characters). Essentially, if you're processing some text mid stream, it might not be clear if a code point is the start of a new user-perceived character or not. So you may want to skip ahead until an unambiguous symbol boundary is reached.
If synchronization is lost mid-character, by definition that interrupted character is lost. However the very next complete character will be clearly indicated by a byte beginning with either no sign (a 7 bit character) OR a number of 1s indicating the octet count followed by a zero.
This is covered in the section titled:
Bits Hex Min Hex Max Byte Sequence in Binary
1 7 00000000 0000007f 0vvvvvvv
2 11 00000080 000007FF 110vvvvv 10vvvvvv
3 16 00000800 0000FFFF 1110vvvv 10vvvvvv 10vvvvvv
... Examples trimmed for mobile.
> UTF-8 was designed, in front of my eyes, on a
placemat in a New Jersey diner one night in September or so 1992.
> helix: Sep 8 03:22:13: ken: upas/sendmail: remote inet!xopen.co.uk!xojig
> From ken Tue Sep 8 03:22:07 EDT 1992 ([email protected]) 6833
here's more discussion from then: https://news.ycombinator.com/item?id=21212445
There is not a single reason, other than historical, backward compatibility and the overall ecosystem putting incommensurate pressure on all future systems, why the basic memory unit is 8 bits.
We're stuck on 8 bits because all API, OS and specifications assume the basic units is 8 bits. If computer memory had evolved faster, or computer usage spread slower, the basic unit could have been 32 bits, or even 64 bits.
Think how much easier things would be. UTF could be encoded as a single unit, without ambiguous order.
What we've lost due to byte and UTF-8 is for example the text editor as the universal editor. Due to the need of UTF-8 decoding, we're no longer able to open a any (possibly binary) file as text and save it reliably after editing one character.
There are other consequences, for example at the file-system level for file names, corruption, how to handle invalid UTF-8 sequences, etc.
(And before you talk about the byte order problem, that is the point: there is no order problem for the basic unit you choose. No one (outside of some low-level 1-bit serial interfaces) cars about bit order or nibble order of bytes. Because the bit order is hidden within all interface accepting byte as the unit.)