[ACCEPTED]-How to test an application for correct encoding (e.g. UTF-8)-utf-8
Thank you for fliptitle!
I, too, am trying to lay 37 out a proper test plan to make sure that 36 an application supports Unicode strings 35 throughout the system.
I am bilingual, but 34 in two languages that only use ISO-8859-1. Therefore, I 33 have been struggling to determine what is 32 a "real-life," "meaningful" way 31 to test the full range of Unicode possibilities.
I 30 just came across this:
Follow-Up Post:
After devising some 29 tests for my application, I realized that 28 I had put together a small list of encoded 27 values that might be helpful to others.
I 26 am using the following international strings 25 in my test:
(NOTE: here comes some UTF-8 24 encoded text... hopefully you can see this 23 in your browser)
ユーザー別サイト
简体中文
크로스 플랫폼으로
מדורים 22 מבוקשים
أفضل البحوث
Σὲ γνωρίζω ἀπὸ
Десятую 21 Международную
แผ่นดินฮั่นเสื่อมโทรมแสนสังเวช
∮ E⋅da 20 = Q, n → ∞, ∑ f(i) = ∏ g(i)
français langue 19 étrangère
mañana olé
(End of UTF-8 foreign/non-English 18 text)
However, at various points during testing, I 17 realized that it was insufficient to only 16 have information about how the strings were 15 supposed to look when rendered in their 14 respective foreign alphabets. I also needed 13 to know the correct Unicode codepoint numbers, and 12 also the correct hexadecimal values for 11 these strings in at least two encodings 10 (UCS-2 and UTF-8).
Here is the equivalent 9 code-point numbering and hex values:
str = L"\u30E6\u30FC\u30B6\u30FC\u5225\u30B5\u30A4\u30C8"; // JAPAN
// Little endian UTF-16/UCS-2: e6 30 fc 30 b6 30 fc 30 25 52 b5 30 a4 30 c8 30 00 00
// Hex of UTF-8: e3 83 a6 e3 83 bc e3 82 b6 e3 83 bc e5 88 a5 e3 82 b5 e3 82 a4 e3 83 88 00
str = L"\u7B80\u4F53\u4E2D\u6587"; // CHINA
// Little endian UTF-16/UCS-2: 80 7b 53 4f 2d 4e 87 65 00 00
// Hex of UTF-8: e7 ae 80 e4 bd 93 e4 b8 ad e6 96 87 00
str = L"\uD06C\uB85C\uC2A4 \uD50C\uB7AB\uD3FC\uC73C\uB85C"; // KOREA
// Little endian UTF-16/UCS-2: 6c d0 5c b8 a4 c2 20 00 0c d5 ab b7 fc d3 3c c7 5c b8 00 00
// Hex of UTF-8: ed 81 ac eb a1 9c ec 8a a4 20 ed 94 8c eb 9e ab ed 8f bc ec 9c bc eb a1 9c 00
str = L"\u05DE\u05D3\u05D5\u05E8\u05D9\u05DD \u05DE\u05D1\u05D5\u05E7\u05E9\u05D9\u05DD"; // ISRAEL
// Little endian UTF-16/UCS-2: de 05 d3 05 d5 05 e8 05 d9 05 dd 05 20 00 de 05 d1 05 d5 05 e7 05 e9 05 d9 05 dd 05 00 00
// Hex of UTF-8: d7 9e d7 93 d7 95 d7 a8 d7 99 d7 9d 20 d7 9e d7 91 d7 95 d7 a7 d7 a9 d7 99 d7 9d 00
str = L"\u0623\u0641\u0636\u0644 \u0627\u0644\u0628\u062D\u0648\u062B"; // EGYPT
// Little endian UTF-16/UCS-2: 23 06 41 06 36 06 44 06 20 00 27 06 44 06 28 06 2d 06 48 06 2b 06 00 00
// Hex of UTF-8: d8 a3 d9 81 d8 b6 d9 84 20 d8 a7 d9 84 d8 a8 d8 ad d9 88 d8 ab 00
str = L"\u03A3\u1F72 \u03B3\u03BD\u03C9\u03C1\u03AF\u03B6\u03C9 \u1F00\u03C0\u1F78"; // GREECE
// Little endian UTF-16/UCS-2: a3 03 72 1f 20 00 b3 03 bd 03 c9 03 c1 03 af 03 b6 03 c9 03 20 00 00
// Hex of UTF-8: ce a3 e1 bd b2 20 ce b3 ce bd cf 89 cf 81 ce af ce b6 cf 89 20 e1 bc 80 cf 80 e1 bd b8 00
str = L"\u0414\u0435\u0441\u044F\u0442\u0443\u044E \u041C\u0435\u0436\u0434\u0443\u043D\u0430\u0440\u043E\u0434\u043D\u0443\u044E"; // RUSSIA
// Little endian UTF-16/UCS-2: 14 04 35 04 41 04 4f 04 42 04 43 04 4e 04 20 00 1c 04 35 04 36 04 34 04 43 04 3d 04 30 04 40 04 3e 04 34 04 3d 04 43 04 4e 04 00 00
// Hex of UTF-8: d0 94 d0 b5 d1 81 d1 8f d1 82 d1 83 d1 8e 20 d0 9c d0 b5 d0 b6 d0 b4 d1 83 d0 bd d0 b0 d1 80 d0 be d0 b4 d0 bd d1 83 d1 8e 00
str = L"\u0E41\u0E1C\u0E48\u0E19\u0E14\u0E34\u0E19\u0E2E\u0E31\u0E48\u0E19\u0E40\u0E2A\u0E37\u0E48\u0E2D\u0E21\u0E42\u0E17\u0E23\u0E21\u0E41\u0E2A\u0E19\u0E2A\u0E31\u0E07\u0E40\u0E27\u0E0A"; // THAILAND
// Little endian UTF-16/UCS-2: 41 0e 1c 0e 48 0e 19 0e 14 0e 34 0e 19 0e 2e 0e 31 0e 48 0e 19 0e 40 0e 2a 0e 37 0e 48 0e 2d 0e 21 0e 42 0e 17 0e 23 0e 21 0e 41 0e 2a 0e 19 0e 2a 0e 31 0e 07 0e 40 0e 27 0e 0a 0e 00 00
// Hex of UTF-8: e0 b9 81 e0 b8 9c e0 b9 88 e0 b8 99 e0 b8 94 e0 b8 b4 e0 b8 99 e0 b8 ae e0 b8 b1 e0 b9 88 e0 b8 99 e0 b9 80 e0 b8 aa e0 b8 b7 e0 b9 88 e0 b8 ad e0 b8 a1 e0 b9 82 e0 b8 97 e0 b8 a3 e0 b8 a1 e0 b9 81 e0 b8 aa e0 b8 99 e0 b8 aa e0 b8 b1 e0 b8 87 e0 b9 80 e0 b8 a7 e0 b8 8a 00
str = L"\u222E E\u22C5da = Q, n \u2192 \u221E, \u2211 f(i) = \u220F g(i)"; // MATHEMATICS
// Little endian UTF-16/UCS-2: 2e 22 20 00 45 00 c5 22 64 00 61 00 20 00 3d 00 20 00 51 00 2c 00 20 00 20 00 6e 00 20 00 92 21 20 00 1e 22 2c 00 20 00 11 22 20 00 66 00 28 00 69 00 29 00 20 00 3d 00 20 00 0f 22 20 00 67 00 28 00 69 00 29 00 00 00
// Hex of UTF-8: e2 88 ae 20 45 e2 8b 85 64 61 20 3d 20 51 2c 20 20 6e 20 e2 86 92 20 e2 88 9e 2c 20 e2 88 91 20 66 28 69 29 20 3d 20 e2 88 8f 20 67 28 69 29 00
str = L"fran\u00E7ais langue \u00E9trang\u00E8re"; // FRANCE
// Little endian UTF-16/UCS-2: 66 00 72 00 61 00 6e 00 e7 00 61 00 69 00 73 00 20 00 6c 00 61 00 6e 00 67 00 75 00 65 00 20 00 e9 00 74 00 72 00 61 00 6e 00 67 00 e8 00 72 00 65 00 00 00
// Hex of UTF-8: 66 72 61 6e c3 a7 61 69 73 20 6c 61 6e 67 75 65 20 c3 a9 74 72 61 6e 67 c3 a8 72 65 00
str = L"ma\u00F1ana ol\u00E9"; // SPAIN
// Little endian UTF-16/UCS-2: 6d 00 61 00 f1 00 61 00 6e 00 61 00 20 00 6f 00 6c 00 e9 00 00 00
// Hex of UTF-8: 6d 61 c3 b1 61 6e 61 20 6f 6c c3 a9 00
Also, here 8 are a couple images that show some common 7 "mis-renderings" that can happen 6 in various editors, even though the underlying 5 bytes are well-formed UTF8. If you see 4 any of these renderings, it probably means 3 that you correctly produced a UTF8 string, but 2 that your editor/viewer is trying to interpret 1 them under some encoding other than UTF8.
There is a regular expression to test if a string is valid UTF-8:
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
But this doesn’t ensure that 13 the text actual is UTF-8.
An example: The 12 byte sequence for the letter ö (U+00F6) and 11 the corresponding UTF-8 sequence is 0xC3B6.
So 10 when you get 0xC3B6 as input you can say 9 that it is valid UTF-8. But you cannot surely 8 say that the letter ö has been submitted.
This 7 is because imagine that not UTF-8 has been 6 used but ISO 8859-1 instead. There the sequence 5 0xC3B6 represents the character à (0xC3) and 4 ¶ (0xB6) respectivly.
So the sequence 0xC3B6 3 can either represent ö using UTF-8 or ö using 2 ISO 8859-1 (although the latter is rather 1 unusual).
So in the end it’s only guessing.
The real troublemaker with character encoding 16 is quite often that there are multiple encoding-related 15 bugs and that some incorrect behavior has 14 been introduced because of other bugs. I 13 have no count of how many times I have seen 12 this happen.
The goal, as always, is to handle 11 it correctly in every single place. So most 10 of the time simple unit tests can do the 9 trick, it doesn't even have to be very complex 8 character sets. I find all out bugs just 7 by testing on our national character "ø", because 6 it maps differently in UTF-8 and most of 5 the other character sets.
The aggregate 4 works fine when all the pieces do it correctly. I 3 know this sounds trivial, but when it comes 2 to character set issues it's always worked 1 for me ;)
Localization is pretty tough.
I think you 22 are really asking two questions. One of 21 them, how do you get everybody to correctly 20 work on an i8n application, is not technical, but 19 a project management issue in my opinion. If 18 you want people to use a common standard, like 17 UTF-8, then you will simply have to enforce 16 that. Tools will help but people will first 15 need to be told to do so.
Besides saying 14 that UTF-8 is in my opinion the way to go, it 13 is hard to give an answer to the questions 12 about tools. It really depends on the kind 11 of project you are doing. If it for example 10 is a Java project that you are talking about 9 then it is a simple matter of properly configuring 8 the IDE to encode files in UTF-8. And to 7 make sure your UTF-8 localizations are in 6 external resource files.
One thing you can 5 certainly do is to make unit tests that 4 check compliance. If your localized messages/labels 3 are in resource files then it is faily easy 2 to check if they are properly UTF-8 encoded 1 I think.
In PHP we use the mb_ functions such as 10 mb_detect_encoding() and mb_convert_encoding(). They 9 aren't perfect, but they get us 99.9% of 8 the way there. Than we have a few regular 7 expressions to strip out funky characters 6 that somehow make there way in at times.
If 5 you are going international, you definitely 4 want to use UTF-8. We have yet to find the 3 perfect solution for getting all of our 2 data into UTF-8, and i'm not sure one exists. You 1 just have to keep tinkering with it.
More Related questions
We use cookies to improve the performance of the site. By staying on our site, you agree to the terms of use of cookies.