[ACCEPTED]-Escaping unicode characters with C/C++-utf-32

Accepted answer
Score: 18

There are many questions within your question, I 58 will try to answer the most important ones.

Q. I have a C++ string like "Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This 57 is implementation-defined. In many implementations 56 this will be a UTF-8 string, but this is 55 not mandated by the standard. Consult your 54 documentation.

Q. I have a wide C++ string like L"Eat, drink, 愛", is it a UT8-8, UTF-16 or UTF-32 string?
A. This is implementation-defined. In 53 many implementations this will be a UTF-32 52 string. In some other implementations it 51 will be a UTF-16 string. Neither is mandated 50 by the standard. Consult your documentation.

Q. How can I have portable UT8-8, UTF-16 or UTF-32 C++ string literals?
A. In 49 C++11 there is a way:

u8"I'm a UTF-8 string."
u"I'm a UTF-16 string."
U"I'm a UTF-32 string."

In C++03, no such luck.

Q. Does the string "Eat, drink, 愛" contain at least one UTF-32 character?
A. There 48 are no such things as UTF-32 (and UTF-16 47 and UTF-8) characters. There are UTF-32 46 etc. strings. They all contain Unicode characters.

Q. What the heck is a Unicode character?
A. It is an element 45 of a coded character set defined by the 44 Unicode standard. In a C++ program it can 43 be represented in various ways, the most 42 simple and straightforward one is with a 41 single 32-bit integral value corresponding to the character's code point. (I'm 40 ignoring composite characters here and equating 39 "character" and "code point", unless stated 38 otherwise, for simplicity).

Q. Given a Unicode character, how can I escape it?
A. Examine 37 its value. If it's between 256 and 65535, print 36 a 2-byte (4 hex digits) escape sequence. If 35 it's greater than 65535, print a 3-byte 34 (6 hex digits) escape sequence. Otherwise, print 33 it as you normally would.

Q. Given a UTF-32 encoded string, how can I decompose it to characters?
A. Each element 32 of the string (which is called a code unit) corresponds 31 to a single character (code point). Just 30 take them one by one. Nothing special needs 29 to be done.

Q. Given a UTF-16 encoded string, how can I decompose it to characters?
A. Values (code units) outside of the 28 0xD800 to 0xDFFF range correspond to the 27 Unicode characters with the same value. For 26 each such value, print either a normal character 25 or a 2-byte (4 hex digits) escape sequence. Values 24 in the 0xD800 to 0xDFFF range are grouped 23 in pairs, each pair representing a single 22 character (code point) in the U+10000 to 21 U+10FFFF range. For such a pair, print a 20 3-byte (6 hex digits) escape sequence. To 19 convert a pair (v1, v2) to its character 18 value, use this formula:

c = (v1 - 0xd800) >> 10 + (v2-0xdc00)

Note the first element 17 of the pair must be in the range of 0xd800..0xdbff 16 and the second one is in 0xdc00..0xdfff, otherwise 15 the pair is ill-formed.

Q. Given a UTF-8 encoded string, how can I decompose it to characters?
A. The UTF-8 encoding 14 is a bit more complicated than the UTF-16 13 one and I will not detail it here. There 12 are many descriptions and sample implementations 11 out there on the 'net, look them up.

Q. What's up with my L"प्रे" string?
A. It 10 is a composite character that is composed 9 of four Unicode code points, U+092A, U+094D, U+0930, U+0947. Note it's not the same as a high code point being represented with a surrogate pair as 8 detailed in the UTF-16 part of the answer. It's 7 a case of "character" being not the same 6 as "code point". Escape each code point separately. At this level of abstraction, you 5 are dealing with code points, not actual 4 characters anyway. Characters come into 3 play when you e.g. display them for the 2 user, or compute their position in a printed 1 text, but not when dealing with string encodings.

More Related questions