I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454
I don't suggest writing generic HTML parsers that works with any site, but for custom crawlers they work great.
Not to say that the tools available are the same now as 20 years ago. Today I would probably use puppeteer or some similar tool and query the DOM instead.
I know it's a hassle for a platform to moderate good rants from bad ones, and I decry SO from pushing too hard against these. I truly believe that our industry would benefit from more drunken technical rants.
A scraper is already resigned to being brittle and weird. You’re relying not only on the syntax of the data, but an implicit structure beyond that. This structure is unspecified and may change without notice, so whatever robustness you can achieve will come from being loose with what you accept and trying to guess what changes might be made on the other end. Regex is a decent tool for that.
It's a very bad answer. First of all, processing HTML with regex can be perfectly acceptable depending on what you're trying to do. Yes, this doesn't include full-blown "parsing" of arbitrary HTML, but there are plenty of ways in which you might want to process or transform HTML that either don't require producing a parse tree, don't require perfect accuracy, or are operating on HTML whose structure is constrained and known in advance. Second, it doesn't even attempt to explain to OP why parsing arbitrary HTML with regex is impossible or poorly-advised.
The OP didn't want his post to be taken over by someone hamming it up with an attempt at creative writing. He wanted a useful answer. Yes, this answer is "quirky" and "whimsical" and "fun" but I read those as euphemisms for "trying to conscript unwilling victims into your personal sense of nerd-humor".
I parse my own HTML I produce directly in a context where I fully control the output. It works fine, but parsing other people’s HTML is a lesson in humility. I’ve also done that, but I did it as a one time thing. I parsed a specific point in time, refusing to change that at any point.
It also comes from a time in Internet culture when humor was appreciated instead of aggressively downvoted.
So extracting information from this text with regexps often makes perfect sense.
Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.
What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.
Like, it’s not a matter of cleverness, either. You can’t code around it. It’s simply not possible.
For example, this is perfectly valid XHTML:
<a href="/" title="<a /> />"></a>This is also the reason why I consider the lack of images in IRC a feature.
So yes, while it is an inspired comidic genius of a rant, and sort of informative in that it opens your eyes to the limitations of regexes, it sort of brushes under the rug all the places that those poor maligned regular expressions will be used when parsing html.
<!—- Don't count <hr> this! -—> but do count <hr> this -->
and <!-- <!-- Ignore <ht> this --> but do count <hr> this —->
Now your regex has to include balanced comment markers. Solve thatYou need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.
Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.
<a href="/" title="<a /> />"></a> <!doctype html>
A<!—- Don't count <hr> this! -—> but do count <hr> that -->Z
gets fixed and rendered as <!DOCTYPE html>
<html><head></head><body>A<!--—- Don't count <hr--> this! -—> but do count <hr> that -->Z</body></html>
Another surprise is that <!doctype html>
A<!—- Don't count this! -— but do count that -->Z
gets rewritten to <!DOCTYPE html>
<html><head></head><body>A<!--—- Don't count this! -— but do count that ---->Z</body></html>
Note the insertion of extra `--` minus-hyphens.This is what MDN (https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Com...) has to say:
Comments start with the string `<!--` and end with the string `-->`, generally with text in between. This text cannot start with the string `>` or `->`, cannot contain the strings `-->` or `--!>`, nor end with the string `<!-`, though `<!` is allowed. [...] The above is true for XML comments as well. In addition, in XML, such as in SVG or MathML markup, a comment cannot contain the character sequence `--`.
Meaning that you can recognize HTML comments with (one branch of) a RegEx—you start wherever you see `<!--` and consume everything up to one of the listed alternatives. No nesting required.
Be it said that I find the precise rules too convoluted for what they do. Especially XML's prohibition on `--` in comments is ridiculous taken on its own. First you tell me that a comment ends with three characters `-->`, and then you tell me I can't use the specific substring `--`, either? And why can't I use `--!>`?
An interesting bit here is that AFAIK the `<!` syntax was used in SGML as one of the alternatives to write a 'lone tag', so instead of `<hr></hr>` or `<hr/>` (XHTML) or `<hr>` (HTML) you could write `<!hr>` to denote a tag with no content. We should have kept this IMO.
*EDIT* On the quoted HTML source you see things like `-—` (hyphen-minus, em-dash). This is how the Vivaldi DevTools render it; my text editor and HN comment system did not alter these characters. I have no idea whether Chrome's rendering engine internally uses these em-dashes or whether it's just a quirk in DevTool text output.