[ACCEPTED]-Python returning the wrong length of string when using special characters-character-encoding
UTF-8 is an unicode encoding which uses 21 more than one byte for special characters. If 20 you don't want the length of the encoded 19 string, simple decode it and use len()
on the 18 unicode
object (and not the str
object!).
Here are 17 some examples:
>>> # creates a str literal (with utf-8 encoding, if this was
>>> # specified on the beginning of the file):
>>> len('ë́aúlt')
9
>>> # creates a unicode literal (you should generally use this
>>> # version if you are dealing with special characters):
>>> len(u'ë́aúlt')
6
>>> # the same str literal (written in an encoded notation):
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt')
9
>>> # you can convert any str to an unicode object by decoding() it:
>>> len('\xc3\xab\xcc\x81a\xc3\xbalt'.decode('utf-8'))
6
Of course, you can also access 16 single characters in an unicode
object like you 15 would do in a str
object (they are both inheriting 14 from basestring
and therefore have the same methods):
>>> test = u'ë́aúlt'
>>> print test[0]
ë
If 13 you develop localized applications, it's 12 generally a good idea to use only unicode
-objects 11 internally, by decoding all inputs you get. After 10 the work is done, you can encode the result 9 again as 'UTF-8'. If you keep to this principle, you 8 will never see your server crashing because 7 of any internal UnicodeDecodeError
s you might get otherwise 6 ;)
PS: Please note, that the str
and unicode
datatype 5 have changed significantly in Python 3. In 4 Python 3 there are only unicode strings 3 and plain byte strings which can't be mixed 2 anymore. That should help to avoid common 1 pitfalls with unicode handling...
Regards, Christoph
The problem is that the first ë́ is being 29 counted twice, or I guess ë is in position 28 0 and ´ is in position 1.
Yes. That's how 27 code points are defined by Unicode. In general, you 26 can ask Python to convert a letter and a 25 separate ‘combining’ diacritical mark like 24 U+0301 COMBINING ACUTE ACCENT using Unicode 23 normalisation:
>>> unicodedata.normalize('NFC', u'a\u0301')
u'\xe1' # single character: á
However, there is no single 22 character in Unicode for “e with diaeresis 21 and acute accent” because no language in 20 the world has ever used the letter ‘ë́’. (Pinyin 19 transliteration has “u with diaeresis and 18 acute accent”, but not ‘e’.) Consequently 17 font support is poor; it renders really 16 badly in many cases and is a messy blob 15 on my web browser.
To work out where the 14 ‘editable points’ in a string of Unicode 13 code points are is a tricky job that requires 12 quite a bit of domain knowledge of languages. It's 11 part of the issue of “complex text layout”, an 10 area which also includes issues such as 9 bidirectional text and contextual glpyh 8 shaping and ligatures. To do complex text 7 layout you'll need a library such as Uniscribe 6 on Windows, or Pango generally (for which 5 there is a Python interface).
If, on the 4 other hand, you merely want to completely 3 ignore all combining characters when doing 2 a count, you can get rid of them easily 1 enough:
def withoutcombining(s):
return ''.join(c for c in s if unicodedata.combining(c)==0)
>>> withoutcombining(u'ë́aúlt')
'\xeba\xfalt' # ëaúlt
>>> len(_)
5
The best you can do is to use unicodedata.normalize()
to decompose 3 the character and then filter out the accents.
Don't 2 forget to use unicode
and unicode literals in your 1 code.
which Python version are you using? Python 1 3.1 doesn't have this issue.
>>> print(len("ë́aúlt"))
6
Regards Djoudi
You said: I have a string ë́aúlt that I 20 want to get the length of a manipulate based 19 on character positions and so on. The problem 18 is that the first ë́ is being counted twice, or 17 I guess ë is in position 0 and ´ is in position 16 1.
The first step in working on any Unicode 15 problem is to know exactly what is in your 14 data; don't guess. In this case your guess 13 is correct; it won't always be.
"Exactly 12 what is in your data": use the repr() built-in 11 function (for lots more things apart from 10 unicode). A useful advantage of showing 9 the repr() output in your question is that 8 answerers then have exactly what you have. Note 7 that your text displays in only FOUR positions 6 instead of 5 with some browsers/fonts -- the 5 'e' and its diacritics and the 'a' are mangled 4 together in one position.
You can use the 3 unicodedata.name() function to tell you 2 what each component is.
Here's an example:
# coding: utf8
import unicodedata
x = u"ë́aúlt"
print(repr(x))
for c in x:
try:
name = unicodedata.name(c)
except:
name = "<no name>"
print "U+%04X" % ord(c), repr(c), name
Results:
u'\xeb\u0301a\xfalt'
U+00EB u'\xeb' LATIN SMALL LETTER E WITH DIAERESIS
U+0301 u'\u0301' COMBINING ACUTE ACCENT
U+0061 u'a' LATIN SMALL LETTER A
U+00FA u'\xfa' LATIN SMALL LETTER U WITH ACUTE
U+006C u'l' LATIN SMALL LETTER L
U+0074 u't' LATIN SMALL LETTER T
Now 1 read @bobince's answer :-)
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.