Using Coconuts - a Pythonic Blog

Username:

Password:


Don't have an account? Get one!

Unicode in Python, Part 2: Flipping Tables

Almost half a year later... part 2! I know, I know, I should apologize for my absence and everything. I am horrible.

/upload/not-okay

Actually... You can deal with it.


So, Unicode in Python. This is actually a bit convoluted, since there are two ways of doing it: the Python 2.x way (Unicode support "added on"), and the Python 3.x way (Unicode support built-in). So, I'll tackle each individually!

Python 2.x

This is the version most commonly used currently. The reason I called its Unicode support "added on" is because that's just what it is. Until a few versions ago, Python did not support Unicode. As such, you need to be aware of two objects:

  • str - the classic Python string class, built on top of C strings: each character is a byte, as described in my previous post
  • unicode - the Unicode-aware string. Each character is an Unicode code point.

Whereas regular strings just require quotes, unicode strings can be created by prepending a "u" to the quotes, like so:

>>> 'I am a regular string.'
'I am a regular string.'
>>> u'I am an Unicode string.'
u'I am an Unicode string.'
/upload/just-use-unicode

An astute, but dreadfully naïve reader.

Nope! Because implementations of Unicode code points differ from language to language and system to system, the unicode objects are not enough to manage Unicode strings. See, strings can only be printed (and "viewed" really) if they are byte-strings encoded using a known encoding.

That's where the encode and decode methods come in!

>>> uni_str = u'Flipping Tables™ (╯°□°)╯︵ ┻━┻'
>>> utf8_str = uni_str.encode('UTF-8')
>>> utf8_str
'Flipping Tables\xe2\x84\xa2 (\xe2\x95\xaf\xc2\xb0\xe2\x96\xa1\xc2\xb0\xef\xbc\x89\xe2\x95\xaf\xef\xb8\xb5 \xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb'
>>> print utf8_str
Flipping Tables (╯°□°)╯︵ ┻━┻

However, utf8_str is only printed properly because my terminal (and this blog) is written with UTF-8 in mind. If I encoded it in, say, UTF-16 instead...

>>> utf16_str = uni_str.encode('UTF-16')
>>> utf16_str
'\xff\xfeF\x00l\x00i\x00p\x00p\x00i\x00n\x00g\x00 \x00T\x00a\x00b\x00l\x00e\x00s\x00"! \x00(\x00o%\xb0\x00\xa1%\xb0\x00\t\xffo%5\xfe \x00;%\x01%;%'
>>> print utf16_str

I actually can't show the results of that without the server crashing for some reason! Oops! Here's what it looks like in a screenshot, though:

/upload/fail-unicode-screenshot

In short, it doesn't work. Now, what about getting strings from bytecode to understood Unicode code points? Say I had an input query:

>>> mystr = raw_input("How are you today? ")
How are you today? I AM SO ANGRY I AM FLIPPING TWO TABLES ┻━┻ ︵ヽ(`Д´)ノ︵ ┻━┻
>>> mystr
'I AM SO ANGRY I AM FLIPPING TWO TABLES \xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb \xef\xb8\xb5\xe3\x83\xbd(`\xd0\x94\xc2\xb4)\xef\xbe\x89\xef\xb8\xb5\xef\xbb\xbf \xe2\x94\xbb\xe2\x94\x81\xe2\x94\xbb'

I happen to know that my terminal uses UTF-8, so that is a UTF-8 encoded string I got. I could store it like that, and pray that it only ever needs to get used on systems that use UTF-8... Or, I could be responsible and decode it into an unicode object and store it that way.

>>> mystr_uni = mystr.decode('UTF-8')

Later, I can decode it as anything I wish, as shown in the examples above.

So, get it? Normal strings need to be decoded into strings that can't really be rendered, and which need to be encoded into stuff in order to be actually useful.

/upload/headache-cat

I know.

Luckily, there's an easier way, coming Soon™:

Python 3.x

Let's look at our friend, str, now:

>>> mystr = 'Flipping Tables™ (╯°□°)╯︵ ┻━┻'
>>> type(mystr)
<class 'str'>
>>> mystr
'Flipping Tables™ (╯°□°)╯︵ ┻━┻'
>>> mystr[15]
'™'

Perfectly aware of Unicode and its characters. And, printing?

>>> print(mystr)
Flipping Tables (╯°□°)╯︵ ┻━┻

Just works. It is aware of its environment and acts properly. No more messy dual object types.

/upload/shocked-cat

But there is a catch! What if you want the old 8-bit strings to manage actual byte data, like an image?

>>> myimg = open('someimage.jpg').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.2/codecs.py", line 300, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

You can't, silly! That's not a text file! Try this opening it in binary mode:

>>> myimg = open('someimage.jpg', 'rb').read()
>>> type(myimg)
<class 'bytes'>
>>> myimg[1000]
110
>>> myimg[1000:1010]
b'n\xfe5IF\x92\xe4\x1fV\x87'

That's right, there's a new bytes object, using the "b" prefix, which preserves the old functionality of str, while also being aware that its values are bytes, and should be treated as such. The encode and decode methods of course also still exist to exchange between str and bytes if you really need to.

Python 3.x: because this is the Right Way™.

I will stop abusing the trade mark character now.

New Comment
You're not logged in! Log in to be awesome!
Format: BBCode ReStructured Text

Author (max. 20 characters):