Unicode and Encoding in Python22 Sep 2010
I used to have many errors in Unicode and encoding in Python due to that I underestimated it. Unicode and encoding are very basic concepts to understand, but handling without care might give unexpected errors. There are chances that converting from a byte stream object to Unicode object will give error. For example
s = "Thank you pälä"
u = unicode(s)
The first line assigns the byte stream containing the character 'T', 'h' ..etc.. to the variable s. The next line will convert it to the Unicode object. However, the second will give you an exception due to that the default encoding in Python is ascii. Python will try to convert the byte stream data to Unicode string using ascii encoding, but the character ä which is encoded as 2 bytes 00 and E4 which is out of the range of first 128 characters (ascii codec can only process the first 128 characters in the ascii character map)
You might notice that the second line give no error with some other strings. In fact, that is when the byte stream data does not contain any Unicode character out of the range of the first 128 characters and because ascii characters set is a subset of Unicode, they are same for the first 128 characters.
To overcome the problem in converting the byte stream data to Unicode string. We must know the encoding of the string. There are certain cases that we cannot know the encoding in advance, the resolution is to guess it by trying various known and popular encodings such as ascii, UTF-8 and UTF-16 .etc… Assume we know the string is encoded in UTF-8, the second line can change to
u = unicode(s, 'utf-8')
That is the case of converting byte stream to Unicode, how about the opposite case? Look following example
u = "Hello pälä"
f = open(“file.txt”, “w”)
This will give an exception in the third line
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xe4′ in position 7: ordinal not in range(128)
The reason is the same, when writing to file, sockets or some other media. It requires the byte stream object, the object contains byte by byte, not the Unicode string. Unicode string is a special object handled internally by Python, other media cannot understand it. Hence, before writing to file or sending data to the socket, Python will convert the Unicode string to the byte stream and guess what, it will use default encoding ascii to convert it. And since character ä is not in the range of 128 first characters, the result is that it will throw an exception error.
To overcome this problem, we must know the required encoding of the media we are going to deal with. Does the file require utf-8 encoded byte stream? There are many other encodings around contain the ä but the utf-8 is the most popular one. So the correct third line is
Unicode is a industry standard of characters set used in various applications worldwide. Encoding on the other hand is how data is stored in the disk file. When data is moved around different environment or media, such as from the client browser, web server, socket and database. The data is encoded to byte stream and decoded to Unicode string to process many times. Each media might requires different representation of the data to process. So pay attention to the encoding of the data, process it properly. The cost is much more when encoding is handled improperly, data lost, bug introduced, crashed .etc… The choice is yours :D