[ACCEPTED]-Reading XML with an "&" into C# XMLDocument Object-xmldocument

Accepted answer
Score: 42

The problem is that the xml is not well-formed. Properly 37 generated xml would list that data like 36 this:

Prepaid & Charge

I've had to fix the same problem before, and 35 I did it with this regex:

Regex badAmpersand = new Regex("&(?![a-zA-Z]{2,6};|#[0-9]{2,4};)");

Combine that with 34 a string constant defined like this:

const string goodAmpersand = "&";

Now 33 you can just say badAmpersand.Replace(<your input>, goodAmpersand);

Note that a simple String.Replace("&", "&amp;") isn't 32 good enough, since you can't know in advance 31 for a given document whether any & characters 30 will be coded correctly, incorrectly, or 29 even both in the same document.

The catches 28 here are that you have to do this to your 27 xml document before loading it into your parser, which 26 likely means an extra pass through it. Also, it 25 does not account for ampersands inside of 24 a CDATA section. Finally, it only catches ampersands, not 23 other illegal characters like <. Update: based 22 on the comment, I need to update the expression 21 for hex-coded (&#x...;) entities as 20 well.

Regarding which characters can cause 19 problems, the actual rules are a little 18 complex. For example, certain characters 17 are allowed in data, but not as the first 16 letter of an element name. And there's 15 no simple list of illegal characters. Instead, a 14 large (non-contiguous) swath of UNICODE 13 is defined as legal, and anything outside of that is illegal.

So 12 when it comes down to it, you have to trust 11 your document source to have at least a 10 certain amount of compliance and consistency. For 9 example, I've found that people are often 8 smart enough to make sure the tags work 7 properly and escape <, even if they don't 6 know that & isn't allowed, hence your 5 problem today. However, the best thing would be to get this fixed at the source.

Oh, and a note 4 about the CDATA suggestion: I'd use that 3 to make sure xml that I'm creating is well-formed, but 2 when dealing with existing xml from outside, I 1 find the regex method easier.

Score: 4

The web application isn't at fault, the 13 XML document is. Ampersands in XML should 12 be encoded as &amp;. Failure to do so is a syntax 11 error.

Edit: in answer to the followup question, yes 10 there are all kinds of similar errors. For 9 example, unbalanced tags, unencoded less-than 8 signs, unquoted attribute values, octets 7 outside of the character encoding and various 6 Unicode oddities, unrecognised entity references, and 5 so on. In order to get any decent XML parser 4 to consume a document, that document must 3 be well-formed. The XML specification requires 2 that a parser encountering a malformed document 1 throw a fatal error.

Score: 4

The other answers are all correct, and I 20 concur with their advice, but let me just 19 add one thing:

PLEASE do not make applications 18 that work with non well-formed XML, it just 17 makes the rest of our lives more difficult 16 :).

Granted, there are times when you really 15 just don't have a choice if you have no 14 control over the other end, but you should 13 really have it throwing a fatal error and 12 complaining very loudly and explicitly about 11 what is broken when such an event occurs.

You 10 could probably take it one step further 9 and say "Ack! This XML is broken in these 8 places and for these reasons, here's how 7 I tried to fix it to make it well-formed: ...".

I'm 6 not overly familiar with the MSXML APIs, but 5 most good XML parsers will allow you to 4 install error handlers so that you can trap 3 the exact line/column number where errors 2 are appearing along with getting the error 1 code and message.

Score: 3

Your database doesn't contain XML documents. It 8 contains some well-formed XML documents 7 and some strings that look like XML to a 6 human.

If it's at all possible, you should 5 fix this - in particular, you should fix 4 whatever process is generating the malformed 3 XML documents. Fixing the program that 2 reads data out of this database is just 1 putting wallpaper over a crack in the wall.

Score: 2

You can replace & with &amp;

Or you might 1 also be able to use CDATA sections.

Score: 2

There are several characters which will 7 cause XML data to be reported as badly-formed.

From 6 w3schools:

Characters like "<" and "&" are 5 illegal in XML elements.

The best solution 4 for input you can't trust to be XML-compliant 3 is to wrap it in CDATA tags, e.g.

<![CDATA[This is my wonderful & great user text]]>

Everything 2 within the <![CDATA[ and ]]> tags is ignored by the 1 parser.

More Related questions