Content-Type: RST Ever seen someone try to write something like "résumé" on a website and instead of coming out looking right, it looks like "résumé" or some other garbled text? That is because someone had their Unicode character encodings completely screwed up! .. figure :: http://blog.opensourcenerd.com/upload/cat-tell-me-when-its-over Well, that was a record time. Let me back up and explain what's going on here. This is about how characters are stored and transmitted. Computers don't know what a "letter" is, or rather what anything that isn't a ``1`` or a ``0`` means. The way they work is by structuring these bits into more complicated structures... like bytes (groupings of 8 bits)! A byte can count in binary from ``00000000`` to ``11111111``, or, in decimal from ``0`` to ``255``, or, in hexadecimal (base 16), from ``00`` to ``FF``. Data in a computer is usually passed around by bytes (since plain bits are really too small to be useful most of the time). For convenience's sake, I will be using hexadecimal in this post; for clarity, all hexadecimal values will be prefixed by ``0x``, the way it is done in Unix and other related systems. .. figure :: /upload/horatio-hex Now, in The Olden Days™, when someone wanted to represent a character in text, it was done via a single byte. The most common way of doing so was using the `ASCII table of characters`_. It mapped various ``0x00`` to ``0x7F`` to different characters. For example, ``0x41`` is the symbol for the uppercase letter A (``A``). ``0x61`` is ``a``, ``0x6A`` is ``j``, and so on. Click on the link to the table if you want to see other examples. .. _`ASCII table of characters`: http://www.asciitable.com/ As time passed, these measly 127 characters proved insufficient! So the last bit in the byte got used, making the ``0x80`` through ``0xFF`` space available via the Extended ASCII standard. A few years later, *that* wasn't enough. Stuff like the trade mark character, which I so like to abuse, wasn't in there! So of course, everyone came up with their own Best Solution™. .. figure :: /upload/barrel-roll-solution Of course, that didn't work. All that came out was a series of different ways of representing characters (as a single byte or more), usually one per language, none of them containing all of the necessary characters, and with insane difficulty in converting from one to the other. So, coming to save the day is: Unicode ------- Backed by the `Unicode Consortium`_, Unicode is a standard specification that does not worry about the way characters are mapped to bits or bytes or anything, and instead just focuses on assigning numeric values to every character. These values are usually marked with a ``U+``, followed by a hex code representing the number. For example, ``U+0058`` is "X", ``U+00E6`` is "æ", ``U+03A3`` is "Σ", and ``U+2622`` is "☢". .. _`Unicode Consortium`: http://www.unicode.org These are called *code points*. However, since the Unicode standard is continually fluctuating, and because it doesn't even use every possible number combination, it is inefficient for actual storage. For this, you need an *encoding* process, which takes the code point, and turns it into an efficiently stored series of bits. There are several encodings in use, but the most popular ones are UTF-8 (efficient varying-byte storage method used by Unix systems), ISO-8859-1 (published by the ISO_), and Windows-1252 (guess what pie Microsoft couldn't resist to make its own proprietary version of? That's right, the ISO standard). .. _`ISO`: http://www.iso.org .. figure :: /upload/iso-meeting `source `__ For example, UTF-8 encodes "résumé" as: ``72 C3 A9 73 75 6D C3 A9`` And ISO-8859-1 encodes it as: ``72 E9 73 75 6D E9`` While the ISO encoding may look more compact, it does not support as many Unicode characters as UTF-8 does. For example, my dear ™ is unsupported, same as ☢, lots of characters in other languages (ಠ_ಠ) and other miscellany. If you haven't figured it out by now, this page is UTF-8 encoded. Now, when you *receive* encoded Unicode text data, the biggest problem you face is *decoding* it. Luckily, things like HTML make it easy, since the encoding can easily be specified via a ```` tag in the header. The problem I talked about at the start of this blog post is what happens when people forget to include the encoding in the document, and the user's browser guesses the wrong one. An extra problem is the current font's support for whatever character is being displayed, and the whole thing may fail right there even if the text was decoded properly. So, now you know how Unicode works, you get to wait for a follow-up blog post on how it works with **Python**! Have fun!