[ACCEPTED]-Switching to Python 3 causing UnicodeDecodeError-encoding
Python 3 decodes text files when reading, encodes when 21 writing. The default encoding is taken from 20
locale.getpreferredencoding(False), which evidently for your setup returns 19
'ASCII'. See the
open() function documenation:
In text mode, if encoding is not specified 18 the encoding used is platform dependent:
locale.getpreferredencoding(False)is 17 called to get the current locale encoding.
Instead 16 of relying on a system setting, you should 15 open your text files using an explicit codec:
currentFile = open(filename, 'rt', encoding='latin1')
where 14 you set the
encoding parameter to match the file 13 you are reading.
Python 3 supports UTF-8 12 as the default for source code.
The same applies to 11 writing to a writeable text file; data written 10 will be encoded, and if you rely on the 9 system encoding you are liable to get
UnicodeEncodingError exceptions 8 unless you explicitly set a suitable codec. What 7 codec to use when writing depends on what 6 text you are writing and what you plan to 5 do with the file afterward.
You may want 4 to read up on Python 3 and Unicode in the 3 Unicode HOWTO, which explains both about source code 2 encoding and reading and writing Unicode 1 data.
"as far as I know Python3 is supposed 4 to support utf-8 everywhere ..." Not 3 true. I have python 3.6 and my default encoding 2 is NOT utf-8. To change it to utf-8 in 1 my code I use:
import locale def getpreferredencoding(do_setlocale = True): return "utf-8" locale.getpreferredencoding = getpreferredencoding
as explained in Changing the “locale preferred encoding” in Python 3 in Windows
In general, I found 3 ways to fix Unicode 10 related Errors in Python3:
Use the encoding 9 explicitly like currentFile = open(filename, 'rt',encoding='utf-8')
As 8 the bytes have no encoding, convert the 7 string data to bytes before writing to file 6 like data = 'string'.encode('utf-8')
Especially 5 in Linux environment, check $LANG. Such 4 issue usually arises when LANG=C which makes 3 default encoding as 'ascii' instead of 'utf-8'. One 2 can change it with other appropriate value 1 like LANG='en_IN'
More Related questions