![]() The codecs module will take care of all the conversions for you. U = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in the file You can manually convert strings that you read from files, however there is an easier way:įileObj = codecs.open( "someFile", "r", "utf-8" ) Instead, use the generic string base class, basestring: if isinstance( s, basestring ): # True for both Unicode and byte strings Reading UTF-8 Files Do not do this: if isinstance( s, str ): # BAD: Not true for Unicode strings! However, you need to be careful in your own code when testing to see if an object is a string. Thankfully, everything in Python is supposed to treat Unicode strings identically to byte strings. The next line stores the UTF-8 representation of u in the byte string backToBytes. Now, the byte string s will be treated as a sequence of UTF-8 bytes to create the Unicode string u. In most cases it is probably better to explicitly specify the encoding of the string: However, relying on the locale's character set is a bad idea, since your application is likely to break as soon as someone from Thailand tries to run it on their computer. The previous code uses your default character set to perform the conversions. This creates a byte string: byteString = "hello world! (in my default locale)"Īnd this creates a Unicode string: unicodeString = u"hello Unicode world!"Ĭonvert a byte string into a Unicode string and back again: On Mac OS X, the default locale is actually UTF-8, but everywhere else, the default is probably ASCII. When needed, Python uses your computer's default locale to convert the bytes into characters. As you may have guessed, a byte string is a sequence of bytes. There are two types of strings in Python: byte strings and Unicode strings. If you need in depth knowledge, or need to learn about Unicode in Java or Windows, see Unicode for Programmers. This is a very quick and dirty introduction. I spent more than a few hours learning these tricks, and I'm hoping that by reading this you won't have to. Python has good support for Unicode, but there are a few tricks that you need to be aware of. What I'm going to tell you is how to use Unicode, and specifically UTF-8, with one of the coolest programming languages, Python, but I have also written an introduction to Using Unicode in C/C++. Tim Bray describes why Unicode and UTF-8 are wonderful much better than I could, so go read that for an overview of what Unicode is, and why all your programs should support it.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |