Using Coconuts - a Pythonic Blog

Username:

Password:


Don't have an account? Get one!

Unicode in Python, Part 1: What is Unicode?

Ever seen someone try to write something like "résumé" on a website and instead of coming out looking right, it looks like "résumé" or some other garbled text? That is because someone had their Unicode character encodings completely screwed up!

http://blog.opensourcenerd.com/upload/cat-tell-me-when-its-over

Well, that was a record time.

Let me back up and explain what's going on here. This is about how characters are stored and transmitted. Computers don't know what a "letter" is, or rather what anything that isn't a 1 or a 0 means. The way they work is by structuring these bits into more complicated structures... like bytes (groupings of 8 bits)!

A byte can count in binary from 00000000 to 11111111, or, in decimal from 0 to 255, or, in hexadecimal (base 16), from 00 to FF. Data in a computer is usually passed around by bytes (since plain bits are really too small to be useful most of the time). For convenience's sake, I will be using hexadecimal in this post; for clarity, all hexadecimal values will be prefixed by 0x, the way it is done in Unix and other related systems.

/upload/horatio-hex

Now, in The Olden Days™, when someone wanted to represent a character in text, it was done via a single byte. The most common way of doing so was using the ASCII table of characters. It mapped various 0x00 to 0x7F to different characters. For example, 0x41 is the symbol for the uppercase letter A (A). 0x61 is a, 0x6A is j, and so on. Click on the link to the table if you want to see other examples.

As time passed, these measly 127 characters proved insufficient! So the last bit in the byte got used, making the 0x80 through 0xFF space available via the Extended ASCII standard. A few years later, that wasn't enough. Stuff like the trade mark character, which I so like to abuse, wasn't in there!

So of course, everyone came up with their own Best Solution™.

/upload/barrel-roll-solution

Of course, that didn't work. All that came out was a series of different ways of representing characters (as a single byte or more), usually one per language, none of them containing all of the necessary characters, and with insane difficulty in converting from one to the other. So, coming to save the day is:

Unicode

Backed by the Unicode Consortium, Unicode is a standard specification that does not worry about the way characters are mapped to bits or bytes or anything, and instead just focuses on assigning numeric values to every character. These values are usually marked with a U+, followed by a hex code representing the number. For example, U+0058 is "X", U+00E6 is "æ", U+03A3 is "Σ", and U+2622 is "☢".

These are called code points. However, since the Unicode standard is continually fluctuating, and because it doesn't even use every possible number combination, it is inefficient for actual storage. For this, you need an encoding process, which takes the code point, and turns it into an efficiently stored series of bits.

There are several encodings in use, but the most popular ones are UTF-8 (efficient varying-byte storage method used by Unix systems), ISO-8859-1 (published by the ISO), and Windows-1252 (guess what pie Microsoft couldn't resist to make its own proprietary version of? That's right, the ISO standard).

For example, UTF-8 encodes "résumé" as:

72 C3 A9 73 75 6D C3 A9

And ISO-8859-1 encodes it as:

72 E9 73 75 6D E9

While the ISO encoding may look more compact, it does not support as many Unicode characters as UTF-8 does. For example, my dear ™ is unsupported, same as ☢, lots of characters in other languages (ಠ_ಠ) and other miscellany. If you haven't figured it out by now, this page is UTF-8 encoded.

Now, when you receive encoded Unicode text data, the biggest problem you face is decoding it. Luckily, things like HTML make it easy, since the encoding can easily be specified via a <meta> tag in the header. The problem I talked about at the start of this blog post is what happens when people forget to include the encoding in the document, and the user's browser guesses the wrong one.

An extra problem is the current font's support for whatever character is being displayed, and the whole thing may fail right there even if the text was decoded properly.

So, now you know how Unicode works, you get to wait for a follow-up blog post on how it works with Python! Have fun!

ff is my favorite. very useful for a web-developer. vote http://coolometer.org/chrome-vs-firefox [coolometer.org]

on 2011-04-29 22:12:08.810819
Golden Kumquat says... source permalink

We want part 2!

on 2011-08-26 19:28:23.031261
New Comment
You're not logged in! Log in to be awesome!
Format: BBCode ReStructured Text

Author (max. 20 characters):