[ACCEPTED]-Switching to Python 3 causing UnicodeDecodeError-encoding

Accepted answer
Score: 67

Python 3 decodes text files when reading, encodes when 21 writing. The default encoding is taken from 20 locale.getpreferredencoding(False), which evidently for your setup returns 19 'ASCII'. See the open() function documenation:

In text mode, if encoding is not specified 18 the encoding used is platform dependent: locale.getpreferredencoding(False) is 17 called to get the current locale encoding.

Instead 16 of relying on a system setting, you should 15 open your text files using an explicit codec:

currentFile = open(filename, 'rt', encoding='latin1')

where 14 you set the encoding parameter to match the file 13 you are reading.

Python 3 supports UTF-8 12 as the default for source code.

The same applies to 11 writing to a writeable text file; data written 10 will be encoded, and if you rely on the 9 system encoding you are liable to get UnicodeEncodingError exceptions 8 unless you explicitly set a suitable codec. What 7 codec to use when writing depends on what 6 text you are writing and what you plan to 5 do with the file afterward.

You may want 4 to read up on Python 3 and Unicode in the 3 Unicode HOWTO, which explains both about source code 2 encoding and reading and writing Unicode 1 data.

Score: 1

"as far as I know Python3 is supposed 4 to support utf-8 everywhere ..." Not 3 true. I have python 3.6 and my default encoding 2 is NOT utf-8. To change it to utf-8 in 1 my code I use:

import locale
def getpreferredencoding(do_setlocale = True):
   return "utf-8"
locale.getpreferredencoding = getpreferredencoding

as explained in Changing the “locale preferred encoding” in Python 3 in Windows

Score: 1

In general, I found 3 ways to fix Unicode 10 related Errors in Python3:

  1. Use the encoding 9 explicitly like currentFile = open(filename, 'rt',encoding='utf-8')

  2. As 8 the bytes have no encoding, convert the 7 string data to bytes before writing to file 6 like data = 'string'.encode('utf-8')

  3. Especially 5 in Linux environment, check $LANG. Such 4 issue usually arises when LANG=C which makes 3 default encoding as 'ascii' instead of 'utf-8'. One 2 can change it with other appropriate value 1 like LANG='en_IN'

More Related questions