Content-Type: RST
Ever seen someone try to write something like "résumé" on a website
and instead of coming out looking right, it looks like "résumé" or
some other garbled text? That is because someone had their Unicode
character encodings completely screwed up!
.. figure :: http://blog.opensourcenerd.com/upload/cat-tell-me-when-its-over
Well, that was a record time.
Let me back up and explain what's going on here. This is about how
characters are stored and transmitted. Computers don't know what a
"letter" is, or rather what anything that isn't a ``1`` or a ``0``
means. The way they work is by structuring these bits into more
complicated structures... like bytes (groupings of 8 bits)!
A byte can count in binary from ``00000000`` to ``11111111``, or, in
decimal from ``0`` to ``255``, or, in hexadecimal (base 16), from ``00`` to
``FF``. Data in a computer is usually passed around by bytes (since plain
bits are really too small to be useful most of the time). For
convenience's sake, I will be using hexadecimal in this post; for
clarity, all hexadecimal values will be prefixed by ``0x``, the way it
is done in Unix and other related systems.
.. figure :: /upload/horatio-hex
Now, in The Olden Days™, when someone wanted to represent a character
in text, it was done via a single byte. The most common way of doing
so was using the `ASCII table of characters`_. It mapped various
``0x00`` to ``0x7F`` to different characters. For example, ``0x41`` is
the symbol for the uppercase letter A (``A``). ``0x61`` is ``a``,
``0x6A`` is ``j``, and so on. Click on the link to the table if you
want to see other examples.
.. _`ASCII table of characters`: http://www.asciitable.com/
As time passed, these measly 127 characters proved insufficient! So
the last bit in the byte got used, making the ``0x80`` through
``0xFF`` space available via the Extended ASCII standard. A few years
later, *that* wasn't enough. Stuff like the trade mark character,
which I so like to abuse, wasn't in there!
So of course, everyone came up with their own Best Solution™.
.. figure :: /upload/barrel-roll-solution
Of course, that didn't work. All that came out was a series of
different ways of representing characters (as a single byte or more),
usually one per language, none of them containing all of the necessary
characters, and with insane difficulty in converting from one to the
other. So, coming to save the day is:
Unicode
-------
Backed by the `Unicode Consortium`_, Unicode is a standard
specification that does not worry about the way characters are mapped
to bits or bytes or anything, and instead just focuses on assigning
numeric values to every character. These values are usually marked
with a ``U+``, followed by a hex code representing the number. For
example, ``U+0058`` is "X", ``U+00E6`` is "æ", ``U+03A3`` is "Σ", and
``U+2622`` is "☢".
.. _`Unicode Consortium`: http://www.unicode.org
These are called *code points*. However, since the Unicode standard is
continually fluctuating, and because it doesn't even use every
possible number combination, it is inefficient for actual storage. For
this, you need an *encoding* process, which takes the code point, and turns it
into an efficiently stored series of bits.
There are several encodings in use, but the most popular ones are
UTF-8 (efficient varying-byte storage method used by Unix systems),
ISO-8859-1 (published by the ISO_), and Windows-1252 (guess what pie
Microsoft couldn't resist to make its own proprietary version of?
That's right, the ISO standard).
.. _`ISO`: http://www.iso.org
.. figure :: /upload/iso-meeting
`source `__
For example, UTF-8 encodes "résumé" as:
``72 C3 A9 73 75 6D C3 A9``
And ISO-8859-1 encodes it as:
``72 E9 73 75 6D E9``
While the ISO encoding may look more compact, it does not support as
many Unicode characters as UTF-8 does. For example, my dear ™ is
unsupported, same as ☢, lots of characters in other languages (ಠ_ಠ)
and other miscellany. If you haven't figured it out by now, this page
is UTF-8 encoded.
Now, when you *receive* encoded Unicode text data, the biggest problem you face is
*decoding* it. Luckily, things like HTML make it easy, since the
encoding can easily be specified via a ```` tag in the
header. The problem I talked about at the start of this blog post is
what happens when people forget to include the encoding in the
document, and the user's browser guesses the wrong one.
An extra problem is the current font's support for whatever character
is being displayed, and the whole thing may fail right there even if
the text was decoded properly.
So, now you know how Unicode works, you get to wait for a follow-up
blog post on how it works with **Python**! Have fun!