zlacker

It took me years to notice, but did you catch that the answer actually subtly misinterprets what the question is asking for?

Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.

What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.

replies(2): >>tiagod+S2 >>somat+Zq

>>perchi+(OP)
I think even for single opening tags like asked there are impossible edge cases.

For example, this is perfectly valid XHTML:

    <a href="/" title="<a /> />"></a>

replies(2): >>comex+Bd >>chungy+sF

>>tiagod+S2
If you already know where the start of the opening tag is, then I think a regex is capable of finding the end of that same opening tag, even in cases like yours. In that sense, it’s possible to use a regex to parse a single tag. What’s not possible is finding opening tags within a larger fragment of HTML.

replies(1): >>kstrau+sB

>>perchi+(OP)
The thing is, even when parsing html "correctly" (whatever that is) regexes will still be used. Sure, There will be a bunch of additional structures and mechanisms involved, but you will be identifying tokens via a bunch of regexes.

So yes, while it is an inspired comidic genius of a rant, and sort of informative in that it opens your eyes to the limitations of regexes, it sort of brushes under the rug all the places that those poor maligned regular expressions will be used when parsing html.

replies(1): >>taftst+hI3

>>comex+Bd
For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:

  <!—- Don't count <hr> this! -—> but do count <hr> this -->

and

  <!-- <!-- Ignore <ht> this --> but do count <hr> this —->

Now your regex has to include balanced comment markers. Solve that

You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.

Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.

replies(2): >>marcos+wY >>Democr+jQ2

>>tiagod+S2
No, that is not valid. The "<" and ">" characters in string values must always be escaped with < and >. The correct form would be:

    <a href="/" title="&lt;a /&gt; /&gt;"></a>

>>kstrau+sB
HTML comments do not nest. The obvious tokenizer you can create with regular expressions is the correct one.

replies(1): >>kstrau+g21

>>marcos+wY
If you're talking about tokenizers, then you're no longer parsing HTML with a regex. You're tokenizing it with a regex and processing it with an actual parser.

replies(2): >>marcos+yF1 >>umanwi+AX4

>>kstrau+g21
If you are talking about detecting tags, you (and the person asking that SO question) is talking about tokenization, and everybody (like the one making that famous answer) bringing parsing into the discussion is just being an asshole.

>>kstrau+sB
I don't think your comment assumes the right givens. I just tried in Vivaldi (i.e. Chrome) and this snippet:

    <!doctype html>
    A<!—- Don't count <hr> this! -—> but do count <hr> that -->Z

gets fixed and rendered as

    <!DOCTYPE html>
    <html><head></head><body>A<!--—- Don't count <hr--> this! -—&gt; but do count <hr> that --&gt;Z</body></html>

Another surprise is that

    <!doctype html>
    A<!—- Don't count this! -— but do count that -->Z

gets rewritten to

    <!DOCTYPE html>
    <html><head></head><body>A<!--—- Don't count this! -— but do count that ---->Z</body></html>

Note the insertion of extra `--` minus-hyphens.

This is what MDN (https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Com...) has to say:

Comments start with the string ``, generally with text in between. This text cannot start with the string `>` or `->`, cannot contain the strings `-->` or `--!>`, nor end with the string `<!-`, though `<!` is allowed. [...] The above is true for XML comments as well. In addition, in XML, such as in SVG or MathML markup, a comment cannot contain the character sequence `--`.

Meaning that you can recognize HTML comments with (one branch of) a RegEx—you start wherever you see `<!--` and consume everything up to one of the listed alternatives. No nesting required.

Be it said that I find the precise rules too convoluted for what they do. Especially XML's prohibition on `--` in comments is ridiculous taken on its own. First you tell me that a comment ends with three characters `-->`, and then you tell me I can't use the specific substring `--`, either? And why can't I use `--!>`?

An interesting bit here is that AFAIK the `<!` syntax was used in SGML as one of the alternatives to write a 'lone tag', so instead of `<hr></hr>` or `<hr/>` (XHTML) or `<hr>` (HTML) you could write `<!hr>` to denote a tag with no content. We should have kept this IMO.

*EDIT* On the quoted HTML source you see things like `-—` (hyphen-minus, em-dash). This is how the Vivaldi DevTools render it; my text editor and HN comment system did not alter these characters. I have no idea whether Chrome's rendering engine internally uses these em-dashes or whether it's just a quirk in DevTool text output.

>>somat+Zq
This is a pragmatic answer. While yes, regex is not proven to be the Most Correct Solution for a generalized parse, when you are sitting down with some data in front of you and you can grab the needed bits with a regex group, why not exactly use this. It might be part of a bigger parsing strategy, sure. But if it gets the job on, that means you can move on to the next thing.

>>kstrau+g21
The original SO question was not asking about parsing.