[ACCEPTED]-What use is the 'encoding' in the XML header?-character-encoding

Accepted answer
Score: 42

As you mentioned, you'd have to know the 25 encoding of the file to read the encoding attribute.

However, there 24 is a heuristic that can easily get you close 23 enough to the "real" encoding 22 to allow you to read the encoding attribute. This 21 works, because the <?xml part by definition can 20 only contain characters in the ASCII range 19 (however they are encoded).

The XML standard 18 even describes the exact process used to find out the encoding.

And the encoding label isn't redundant 17 either. For example, if you use the algorithm 16 in the XML spec to find out that some ASCII-based 15 (or ASCII-compatible) encoding is used you 14 still need to read the encoding to find out which 13 one is actually use (valid candidates would 12 be ASCII, UTF-8, any of the ISO-8859-* encodings, any of the 11 Windows-* encodings, KOI8-R and many, many others). For 10 the <?xml part itself it won't make a difference 9 which one it is, but for the rest of the 8 document, it can make a huge difference.

Regarding 7 mis-labeled XML files: yes, it's easy to 6 produce those, however: the XML spec clearly specifies 5 that those files are mal-formed and as such 4 are not correct XML. Incorrect encodings 3 must be reported as an error (as long as 2 they can be detected!). So it's the problem 1 of whoever is producing the XML.

Score: 6

You're quite right that it looks like an 12 odd design. It only works because the XML 11 declaration uses only ASCII characters, and 10 nearly all encodings are supersets of ASCII. If 9 you're prepared to accept something that 8 isn't, for example EBCDIC, you can check 7 whether the file starts with whatever the 6 EBCDIC representation of "<?xml" is. Which means 5 you're relying on the general level of redundancy 4 in the header of the file, rather than purely 3 the encoding attribute itself. Like many 2 things in XML, it's pragmatic and works, but 1 isn't particularly elegant.

Score: 3

XML parsers are only required to support 9 at least UTF-8 and UTF-16. The XML parser 8 starts by trying the encodings based on 7 the Byte Order Mark (BOM), if present (for 6 UTF-16, UTF-32 and even UTF-8 with the dummy 5 BOM). If none is found, then the parser 4 will try UTF-32, UTF-16, UTF-8, ASCII and 3 other ASCII-compatible single-byte encodings. Only 2 then will it see the encoding attribute, and 1 will restart parsing if necessary.

Score: 0

I think in principle you might have a point 7 that the encoding statement is 'late' in the file, however, the 6 whole first line only uses basic characters. AFAIK, those 5 are the same in almost all encodings, so 4 whatever you decode it as, it'll read <?xml ... ?> anyway.

Whatever 3 comes after that however, could matter. For 2 example text in a CDATA section could be 1 encoded in a Cyrillic encoding.

More Related questions