[ACCEPTED]-Is HTML a context-free language?-sgml

Accepted answer
Score: 61

Context Free is a concept from language theory that 47 has important implications in parser implementation. A 46 Context Free Language can be described by a Context Free Grammar, which is one in 45 which all rules have a single non-terminal 44 symbol at the left of the arrow:

X→δ

That simple 43 restriction allows X to be substituted by 42 the right-hand side of the rules in which 41 appears on the left without regard to what 40 came before or after. For example, if while 39 deriving or parsing one arrives at:

αXλ 

one is 38 sure that

αδλ

is also valid. Examples of non-context-free 37 rules would be:

XY→δ
Xa→δ
aX→δ

Those would require knowing 36 what could be derive arround X to determine 35 if a rule applies, and that leads to non-determinism 34 (what's around X would also like to know 33 what it derives to), which is a no-no in 32 parsing, and in any case we want a language 31 to be well-defined.

The only way to prove 30 that a language is context-free is by proving 29 that there's a context-free grammar for 28 it, which is not an easy task. Most programming 27 languages one comes about are already described 26 by CFGs, so the job is done. But there are 25 other languages, including programming languages, that 24 are described using logic or plain English, so 23 work is required to find if they are context-free.

For 22 HTML, the answer about its context-freedom 21 is yes. SGML is a well defined Context Free 20 Language, and HTML defined on top of it 19 is also a CFL. Parsers and grammars for 18 both languages abound on the Web. At any 17 rate, that there exist LL(k) grammars for valid HTML is enough proof that 16 the language is context-free, because LL 15 is a proven subset of CF.

But the way HTML 14 evolved over the life of the Web forced 13 browsers to treat it as not that well defined. Modern 12 Web browsers will go out of their way to 11 try to render something sensible out of 10 almost anything they find. The grammars 9 they use are not CFGs, and the parsers are 8 far more complex than the ones required 7 for SGML/HTML.

HTML is defined at several 6 levels.

  1. At the lexical level there are the rules for valid characters, identifiers, strings, and so on.
  2. At the next level is XML, which consists of the opening and closing <tags> that define a hierarchical document structure. You can use XML or something XML-like for any purpose, like Apache Ant does for build scripts.
  3. At the next level are the tags that are valid in HTML, and the rules about which tags may be nested within which tags.
  4. At the next level are the rules about which attributes are valid for which tags, languages that can be embedded in HTML like CSS and JavaScript.
  5. Finally, you have the semantic rules about what a given HTML document means.

The syntactic part is defined well 5 enough that it can be verified. The semantic part 4 is much larger than the syntactic one, and 3 is defined in terms of browser actions regarding 2 HTTP, and the Document Object Model (DOM), and how a model should 1 be rendered to the screen.

In the end:

  1. Parsing correct HTML is extremely easy (it's context-free and LL/LR).
  2. Parsing the HTML that actually exists over the Web is difficult.
  3. Implementing the semantics (a browser) over HTML/CSS/DOM is extremely difficult.
Score: 14

Valid HTML is not a context-free language.

First 59 of all, HTML being an application of SGML 58 is fiction for all practical purposes, so 57 analyzing SGML to answer the question is 56 useless. (However, the SGML fiction probably 55 isn't context-free, either.)

It's more useful 54 to look at the actually defined HTML parsing 53 algorithm. It works on two levels: tokenization 52 and tree building. What HTML calls tokenization 51 is a higher-level operation than what is 50 usually called tokenization when talking 49 about parsers. In the case of HTML, tokenization 48 splits a stream of characters into units 47 like start tags, end tags, comments and 46 text. The tokenizer expands character references. Usually, when 45 talking about parsers, you'd probably treat 44 stuff like the less-than sign as "tokens" and 43 would consider character references to consist 42 of tokens instead of being resolved by the 41 tokenizer.

If you consider the process of 40 splitting the input stream into tokens, that 39 level of the HTML language is regular (except for 38 feedback from the tree builder).

However, there 37 are three complications: The first one is 36 that splitting the input stream into tokens 35 is just the first and then there's the tree 34 builder's side that actually cares about 33 the identifiers in the tokens. The second 32 one is that the tree builder feeds back 31 into the tokenizer so that some state transitions 30 made by the tokenizer depend on the state 29 of the tree builder! The third one is that 28 valid documents in the language are defined 27 by rules that apply to the output of the 26 tree builder stage and those rules are complex 25 enough that they can't be fully defined 24 using tree automata (as evidenced by RELAX 23 NG not being expressive enough to describe 22 all the validity constraints).

This isn't 21 an actual proof, but you can probably develop 20 real proofs by working from complications 19 #2 and #3.

Note that the case of invalid 18 documents is not particularly interesting 17 as a question of whether the language is 16 context-free in the sense of there being 15 a context-free grammar that generates all 14 the possible strings with no regard to the 13 parse tree having some intelligible interpretation 12 in terms of the tree that an HTML parser 11 generates. The HTML parser will successfully 10 consume all possible strings, so in that 9 sense, all possible strings are in the "invalid 8 HTML" language.

Edit: Interesting questions 7 left as exercise to the reader:

Is HTML without 6 parse errors but ignoring validity a context-free 5 language?

Is HTML without parse errors and 4 ignoring general validity but with only 3 valid element names allowed a context-free 2 language?

(Complication #2 applies in both 1 cases.)

Score: 11

NO

See Edit Below

It depends.

If you are talking about the subset consisting of only theoretical HTML, then yes.

If you also include real life, working HTML 67 that is accessed and used successfully by 66 millions of people daily on many of the 65 top sites on the internet then NO.

That is 64 what gives HTML flexibility. The parsing 63 engine adds tags, closes tags, and takes 62 care of stuff that a theoretical CFG can't 61 do. If you took automata you might remember 60 that a production rule in a formal grammar 59 cannot be empty (aka epsilon/lambda) on 58 the lhs (left-hand side). Since the parsing 57 engine is basically using knowledge that 56 a formal grammar and automata couldn't have, it 55 isn't restricted by that and the 'grammar' would 54 have epsilon/lambda -> result where the specific epsilon/lambda 53 rule is chosen based on information not 52 available in the grammar.

Since I don't think 51 empty lhs are allowed in any formal grammars, HTML 50 cannot be defined by a formal grammar and 49 is not a formal language at all.

Sure, HTML5 48 might try to move towards a 'more formal' language 47 description but the likelihood that it becomes 46 a context free language in reality (i.e. strings 45 not matched by the grammar are rejected) is 44 about the likelihood XHTML 2.0 takes the 43 world by storm and replaces HTML altogether 42 (XHTML is the attempt they made to make 41 HTML a formal language...it was rejected 40 en masse due to its fragility).

Noteworthy 39 is the fact that HTML 5 is the FIRST HTML 38 standard to be defined before being implemented! That's 37 right, HTML 1-4 consist of random ideas 36 someone just implemented in a browser, and 35 were collected into standards after the 34 fact based on which features were popularly 33 used and widely implemented. Then they 32 tried XHTML, which totally failed to be 31 adopted. Even 'xhtml' on the web is automatically 30 parsed as HTML under almost every circumstance 29 to prevent stuff from just breaking with 28 a cryptic syntax error. Now you can see 27 how we got here and why it is unlikely to 26 be formalized any time soon.

Lesson: "In 25 theory, there is no difference between theory 24 and practice. In practice, there is." - Yogi 23 Berra

EDIT:

Actually, after reading through 22 the documents it turns out that HTML, even 21 according to the HTML 4.01 specification, doesn't 20 actually conform to SGML. To see for yourself, view 19 the HTML 4.01 Strict document type definition 18 (doctype) at http://www.w3.org/TR/html4/strict.dtd and note the following lines:

The 17 HTML 4.01 specification includes additional syntactic 16 constraints that cannot be expressed within the 15 DTDs.

So I would say that it is probably not a CFL 14 due to those features (although it technically 13 it doesn't disprove the hypothesis that 12 there is some possible PDA that accepts 11 HTML 4.01, it does prevent the argument 10 that SGML is a CFL therefore HTML is a CFL).

HTML5 9 flip-flops, abandoning any implied conformance to SGML, but is presumably describable 8 by a CFG. However it will still provide 7 best-effort parsing not based on a cfg, so 6 IMO the current situation (i.e. language 5 specification is defined formally, with 4 invalid strings still being accepted, parsed 3 and rendered in a best effort fashion) in 2 this regard is unlikely to change drastically 1 for a long, long, long time.

Score: 5

HTML5 is different from previous HTML versions 5 in that it strictly defines the parsing 4 behaviour of code that isn't completely 3 correct. Pre-HTML5 parsers vary and each 2 do their best to 'guess' the intention of 1 the code author.

More Related questions