[ACCEPTED]-Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?-utf
There is no Unicode character that can be 102 stored in one encoding but not another. This 101 is simply because the valid Unicode characters 100 have been restricted to what can be stored 99 in UTF-16 (which has the smallest capacity 98 of the three encodings). In other words, UTF-8 97 and and UTF-32 could be used to represent a wider 96 range of characters than UTF-16, but they 95 aren't. Read on for more details.
UTF-8
UTF-8 is a variable-length 94 code. Some characters require 1 byte, some 93 require 2, some 3 and some 4. The bytes 92 for each character are simply written one 91 after another as a continuous stream of 90 bytes.
While some UTF-8 characters can be 89 4 bytes long, UTF-8 cannot encode 2^32 characters. It's not even close. I'll 88 try to explain the reasons for this.
The 87 software that reads a UTF-8 stream just 86 gets a sequence of bytes - how is it supposed 85 to decide whether the next 4 bytes is a 84 single 4-byte character, or two 2-byte characters, or 83 four 1-byte characters (or some other combination)? Basically 82 this is done by deciding that certain 1-byte 81 sequences aren't valid characters, and certain 80 2-byte sequences aren't valid characters, and 79 so on. When these invalid sequences appear, it 78 is assumed that they form part of a longer sequence.
You've 77 seen a rather different example of this, I'm 76 sure: it's called escaping. In many programming 75 languages it is decided that the \
character 74 in a string's source code doesn't translate 73 to any valid character in the string's "compiled" form. When 72 a \ is found in the source, it is assumed 71 to be part of a longer sequence, like \n
or 70 \xFF
. Note that \x
is an invalid 2-character sequence, and 69 \xF
is an invalid 3-character sequence, but 68 \xFF
is a valid 4-character sequence.
Basically, there's 67 a trade-off between having many characters 66 and having shorter characters. If you want 65 2^32 characters, they need to be on average 64 4 bytes long. If you want all your characters 63 to be 2 bytes or less, then you can't have 62 more than 2^16 characters. UTF-8 gives a 61 reasonable compromise: all ASCII characters (ASCII 60 0 to 127) are given 1-byte representations, which 59 is great for compatibility, but many more 58 characters are allowed.
Like most variable-length 57 encodings, including the kinds of escape 56 sequences shown above, UTF-8 is an instantaneous code. This 55 means that, the decoder just reads byte 54 by byte and as soon as it reaches the last 53 byte of a character, it knows what the character 52 is (and it knows that it isn't the beginning 51 of a longer character).
For instance, the 50 character 'A' is represented using the byte 49 65, and there are no two/three/four-byte 48 characters whose first byte is 65. Otherwise 47 the decoder wouldn't be able to tell those 46 characters apart from an 'A' followed by 45 something else.
But UTF-8 is restricted even 44 further. It ensures that the encoding of 43 a shorter character never appears anywhere within 42 the encoding of a longer character. For 41 instance, none of the bytes in a 4-byte 40 character can be 65.
Since UTF-8 has 128 39 different 1-byte characters (whose byte 38 values are 0-127), all 2, 3 and 4-byte characters 37 must be composed solely of bytes in the 36 range 128-256. That's a big restriction. However, it 35 allows byte-oriented string functions to 34 work with little or no modification. For 33 instance, C's strstr()
function always works as 32 expected if its inputs are valid UTF-8 strings.
UTF-16
UTF-16 31 is also a variable-length code; its characters 30 consume either 2 or 4 bytes. 2-byte values 29 in the range 0xD800-0xDFFF are reserved 28 for constructing 4-byte characters, and 27 all 4-byte characters consist of two bytes 26 in the range 0xD800-0xDBFF followed by 2 25 bytes in the range 0xDC00-0xDFFF. For this 24 reason, Unicode does not assign any characters 23 in the range U+D800-U+DFFF.
UTF-32
UTF-32 is a fixed-length 22 code, with each character being 4 bytes 21 long. While this allows the encoding of 20 2^32 different characters, only values between 19 0 and 0x10FFFF are allowed in this scheme.
Capacity comparison:
- UTF-8: 2,097,152 (actually 2,166,912 but due to design details some of them map to the same thing)
- UTF-16: 1,112,064
- UTF-32: 4,294,967,296 (but restricted to the first 1,114,112)
The 18 most restricted is therefore UTF-16! The 17 formal Unicode definition has limited the 16 Unicode characters to those that can be 15 encoded with UTF-16 (i.e. the range U+0000 14 to U+10FFFF excluding U+D800 to U+DFFF). UTF-8 13 and UTF-32 support all of these characters.
The 12 UTF-8 system is in fact "artificially" limited 11 to 4 bytes. It can be extended to 8 bytes 10 without violating the restrictions I outlined 9 earlier, and this would yield a capacity 8 of 2^42. The original UTF-8 specification 7 in fact allowed up to 6 bytes, which gives 6 a capacity of 2^31. But RFC 3629 limited it to 4 5 bytes, since that is how much is needed 4 to cover all of what UTF-16 does.
There are 3 other (mainly historical) Unicode encoding 2 schemes, notably UCS-2 (which is only capable 1 of encoding U+0000 to U+FFFF).
No, they're simply different encoding methods. They 13 all support encoding the same set of characters.
UTF-8 12 uses anywhere from one to four bytes per 11 character depending on what character you're 10 encoding. Characters within the ASCII range 9 take only one byte while very unusual characters 8 take four.
UTF-32 uses four bytes per character 7 regardless of what character it is, so it 6 will always use more space than UTF-8 to 5 encode the same string. The only advantage 4 is that you can calculate the number of 3 characters in a UTF-32 string by only counting 2 bytes.
UTF-16 uses two bytes for most charactes, four 1 bytes for unusual ones.
http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
UTF-8, UTF-16, and UTF-32 all support the 14 full set of unicode code points. There 13 are no characters that are supported by 12 one but not another.
As for the bonus question 11 "Do these encodings differ in the number 10 of characters they can be extended to support?" Yes 9 and no. The way UTF-8 and UTF-16 are encoded 8 limits the total number of code points they 7 can support to less than 2^32. However, the 6 Unicode Consortium will not add code points 5 to UTF-32 that cannot be represented in 4 UTF-8 or UTF-16. Doing so would violate 3 the spirit of the encoding standards, and 2 make it impossible to guarantee a one-to-one 1 mapping from UTF-32 to UTF-8 (or UTF-16).
I personally always check Joel's post about unicode, encodings 1 and character sets when in doubt.
All of the UTF-8/16/32 encodings can map 23 all Unicode characters. See Wikipedia's Comparison of Unicode Encodings.
This IBM article 22 Encode your XML documents in UTF-8 is very helpful, and indicates if you have 21 the choice, it's better to choose UTF-8. Mainly 20 the reasons are wide tool support, and UTF-8 19 can usually pass through systems that are unaware 18 of unicode.
From the section What the specs say in the IBM article:
Both 17 the W3C and the IETF have recently become 16 more adamant about choosing UTF-8 first, last, and sometimes 15 only. The W3C Character Model for the 14 World Wide Web 1.0: Fundamentals states, "When 13 a unique character encoding is required, the character 12 encoding MUST be UTF-8, UTF-16 or UTF-32. US-ASCII 11 is upwards-compatible with UTF-8 (an US-ASCII 10 string is also a UTF-8 string, see [RFC 9 3629]), and UTF-8 is therefore appropriate 8 if compatibility with US-ASCII is desired." In practice, compatibility 7 with US-ASCII is so useful it's almost 6 a requirement. The W3C wisely explains, "In 5 other situations, such as for APIs, UTF-16 4 or UTF-32 may be more appropriate. Possible 3 reasons for choosing one of these include efficiency 2 of internal processing and interoperability 1 with other processes."
As everyone has said, UTF-8, UTF-16, and 9 UTF-32 can all encode all of the Unicode 8 code points. However, the UCS-2 (sometimes 7 mistakenly referred to as UCS-16) variant 6 can't, and this is the one that you find e.g. in Windows XP/Vista.
See Wikipedia for more information.
Edit: I am wrong 5 about Windows, NT was the only one to support 4 UCS-2. However, many Windows applications 3 will assume a single word per code point 2 as in UCS-2, so you are likely to find bugs. See 1 another Wikipedia article. (Thanks JasonTrue)
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.