zlacker

> It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't

I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454

replies(5): >>josefx+r2 >>Cthulh+S5 >>bayesn+w6 >>umanwi+fC >>perchi+ET

>>lvncel+(OP)
I prefer the question about CPU pipelines that gets explained using a railroad switch as example. That one does a decent job of answering the question instead of going of on a, how to best put it, mentally deranged one page rant about regexes with the lazy throw away line at the end being the only thing that makes it qualify as an answer at all.

replies(3): >>kapep+96 >>MrGilb+B7 >>bityar+0H

>>lvncel+(OP)
HE COMES

>>josefx+r2
The regex answer is from the very old days of Stackoverflow, before fun was banned. I agree it barely qualifies as answer, but considering that the question has over 4 million page views (which almost puts it in the top 100 most viewed questions all-time), it has reached a lot people. The answer probably had much more influence than any serious answer on that topic. So I'd say the author did a good job.

replies(2): >>bobinc+gf >>Dangit+kg

>>lvncel+(OP)
I know this is grumpy but this I’ve never liked this answer. It is a perfect encapsulation of the elitism in the SO community—if you’re new, your questions are closed and your answers are edited and downvoted. Meanwhile this is tolerated only because it’s posted by a member with high rep and username recognition.

replies(2): >>171862+o9 >>throwa+Wd

>>josefx+r2
For anyone wondering about the railroad switch post: https://stackoverflow.com/questions/11227809/why-is-processi...

replies(1): >>operat+3K

>>bayesn+w6
I think this answer was tolerated when SO wasn't as bad as it is now, and wouldn't be tolerated now from anyone.

replies(1): >>bombca+AG

>>bayesn+w6
As someone who used to write custom crawlers 20 years ago, I can confirm that regular expressions worked great. All my crawlers were custom designed for a page and the sites were mostly generated by some CMS and had consistent HTML. I don't remember having to do much bug fixes that were related to regular expression issues.

I don't suggest writing generic HTML parsers that works with any site, but for custom crawlers they work great.

Not to say that the tools available are the same now as 20 years ago. Today I would probably use puppeteer or some similar tool and query the DOM instead.

replies(2): >>wat100+7y >>vbezhe+bO

>>kapep+96
Of all the things I wrote on SO, including many actually-useful detailed explanations, it was this drunken rant that stuck, for some reason.

replies(2): >>falcor+Xk >>scott_+UA

>>kapep+96
People have shared it here and on reddit a bunch of times because it's funny. I always found the pragmatic counter-answer about using regex and the comments about how brittle it is to parse XML properly assuming a specific structure to be much more useful.

replies(1): >>imtrin+fE3

>>bobinc+gf
And for that I applaud you.

I know it's a hassle for a platform to moderate good rants from bad ones, and I decry SO from pushing too hard against these. I truly believe that our industry would benefit from more drunken technical rants.

>>throwa+Wd
I would distinguish between parsing and scraping. Parsing really needs a, well, parser. Otherwise you’ll get things wrong on perfectly well formed input and your program will be brittle and weird.

A scraper is already resigned to being brittle and weird. You’re relying not only on the syntax of the data, but an implicit structure beyond that. This structure is unspecified and may change without notice, so whatever robustness you can achieve will come from being loose with what you accept and trying to guess what changes might be made on the other end. Regex is a decent tool for that.

>>bobinc+gf
I think of, and look up, this drunken rant at least once a year.

>>lvncel+(OP)
Funny how differently people can perceive things. That's my least favorite SO answer of all time, and I cringe every time I see it.

It's a very bad answer. First of all, processing HTML with regex can be perfectly acceptable depending on what you're trying to do. Yes, this doesn't include full-blown "parsing" of arbitrary HTML, but there are plenty of ways in which you might want to process or transform HTML that either don't require producing a parse tree, don't require perfect accuracy, or are operating on HTML whose structure is constrained and known in advance. Second, it doesn't even attempt to explain to OP why parsing arbitrary HTML with regex is impossible or poorly-advised.

The OP didn't want his post to be taken over by someone hamming it up with an attempt at creative writing. He wanted a useful answer. Yes, this answer is "quirky" and "whimsical" and "fun" but I read those as euphemisms for "trying to conscript unwilling victims into your personal sense of nerd-humor".

replies(2): >>philis+AE >>chucks+eI

>>umanwi+fC
The whole argument hinges on one word in your post: arbitrary.

I parse my own HTML I produce directly in a context where I fully control the output. It works fine, but parsing other people’s HTML is a lesson in humility. I’ve also done that, but I did it as a one time thing. I parsed a specific point in time, refusing to change that at any point.

replies(1): >>umanwi+6I

>>171862+o9
It's because SO at the time was a small high-trust society where "everyone knew each other" and so things flew back then that wouldn't fly now.

>>josefx+r2
But--and this is crucial--the one about regexes is hilarious.

It also comes from a time in Internet culture when humor was appreciated instead of aggressively downvoted.

replies(1): >>encom+o61

>>philis+AE
It also hinges on another word: parsing. There are things other than parsing that you might want to do. For example, if you want to count the number of `<hr>` tags in an HTML document, that doesn't require parsing it, and can indeed be done with regex.

replies(1): >>kstrau+9U

>>umanwi+fC
There's nothing that brings joy into this world quite like the guy waiting around to tell people he doesn't like the thing they like.

>>MrGilb+B7
This is new to me, and a wonderful dive that I wish I was aware of during my OS course. Thanks!

>>throwa+Wd
An interesting thing is that most webpages are generated using text templates. There's some text processing like escaping special characters, but it's mostly text that happened to be (somewhat) valid HTML.

So extracting information from this text with regexps often makes perfect sense.

>>lvncel+(OP)
It took me years to notice, but did you catch that the answer actually subtly misinterprets what the question is asking for?

Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.

What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.

replies(2): >>tiagod+wW >>somat+Dk1

>>umanwi+6I
No you can’t. You can have an unescaped <hr> inside a script tag, for example. The best you can do is a simple string search for “<hr>” and hope it’s returning what you think it might be returning. Regexps are not powerful enough to determine whether any particular instance of “<hr>” is actually an HTML tag.

Like, it’s not a matter of cleverness, either. You can’t code around it. It’s simply not possible.

>>perchi+ET
I think even for single opening tags like asked there are impossible edge cases.

For example, this is perfectly valid XHTML:

    <a href="/" title="<a /> />"></a>

replies(2): >>comex+f71 >>chungy+6z1

>>bityar+0H
It's because the author put effort into it. Most (online) humour is lazy, low effort, regurgitated meme spam. See: Reddit. It should be downvoted and ideally never posted at all.

This is also the reason why I consider the lack of images in IRC a feature.

>>tiagod+wW
If you already know where the start of the opening tag is, then I think a regex is capable of finding the end of that same opening tag, even in cases like yours. In that sense, it’s possible to use a regex to parse a single tag. What’s not possible is finding opening tags within a larger fragment of HTML.

replies(1): >>kstrau+6v1

>>perchi+ET
The thing is, even when parsing html "correctly" (whatever that is) regexes will still be used. Sure, There will be a bunch of additional structures and mechanisms involved, but you will be identifying tokens via a bunch of regexes.

So yes, while it is an inspired comidic genius of a rant, and sort of informative in that it opens your eyes to the limitations of regexes, it sort of brushes under the rug all the places that those poor maligned regular expressions will be used when parsing html.

replies(1): >>taftst+VB4

>>comex+f71
For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:

  <!—- Don't count <hr> this! -—> but do count <hr> this -->

and

  <!-- <!-- Ignore <ht> this --> but do count <hr> this —->

Now your regex has to include balanced comment markers. Solve that

You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.

Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.

replies(2): >>marcos+aS1 >>Democr+XJ3

>>tiagod+wW
No, that is not valid. The "<" and ">" characters in string values must always be escaped with < and >. The correct form would be:

    <a href="/" title="&lt;a /&gt; /&gt;"></a>

>>kstrau+6v1
HTML comments do not nest. The obvious tokenizer you can create with regular expressions is the correct one.

replies(1): >>kstrau+UV1

>>marcos+aS1
If you're talking about tokenizers, then you're no longer parsing HTML with a regex. You're tokenizing it with a regex and processing it with an actual parser.

replies(2): >>marcos+cz2 >>umanwi+eR5

>>kstrau+UV1
If you are talking about detecting tags, you (and the person asking that SO question) is talking about tokenization, and everybody (like the one making that famous answer) bringing parsing into the discussion is just being an asshole.

>>Dangit+kg
How is it more useful? Even if you insist on using regex, you'd primarily use it to fix the HTML so that it can be parsed, not to use regex itself to parse HTML.

replies(1): >>Dangit+BG7

>>kstrau+6v1
I don't think your comment assumes the right givens. I just tried in Vivaldi (i.e. Chrome) and this snippet:

    <!doctype html>
    A<!—- Don't count <hr> this! -—> but do count <hr> that -->Z

gets fixed and rendered as

    <!DOCTYPE html>
    <html><head></head><body>A<!--—- Don't count <hr--> this! -—&gt; but do count <hr> that --&gt;Z</body></html>

Another surprise is that

    <!doctype html>
    A<!—- Don't count this! -— but do count that -->Z

gets rewritten to

    <!DOCTYPE html>
    <html><head></head><body>A<!--—- Don't count this! -— but do count that ---->Z</body></html>

Note the insertion of extra `--` minus-hyphens.

This is what MDN (https://developer.mozilla.org/en-US/docs/Web/HTML/Guides/Com...) has to say:

Comments start with the string ``, generally with text in between. This text cannot start with the string `>` or `->`, cannot contain the strings `-->` or `--!>`, nor end with the string `<!-`, though `<!` is allowed. [...] The above is true for XML comments as well. In addition, in XML, such as in SVG or MathML markup, a comment cannot contain the character sequence `--`.

Meaning that you can recognize HTML comments with (one branch of) a RegEx—you start wherever you see `<!--` and consume everything up to one of the listed alternatives. No nesting required.

Be it said that I find the precise rules too convoluted for what they do. Especially XML's prohibition on `--` in comments is ridiculous taken on its own. First you tell me that a comment ends with three characters `-->`, and then you tell me I can't use the specific substring `--`, either? And why can't I use `--!>`?

An interesting bit here is that AFAIK the `<!` syntax was used in SGML as one of the alternatives to write a 'lone tag', so instead of `<hr></hr>` or `<hr/>` (XHTML) or `<hr>` (HTML) you could write `<!hr>` to denote a tag with no content. We should have kept this IMO.

*EDIT* On the quoted HTML source you see things like `-—` (hyphen-minus, em-dash). This is how the Vivaldi DevTools render it; my text editor and HN comment system did not alter these characters. I have no idea whether Chrome's rendering engine internally uses these em-dashes or whether it's just a quirk in DevTool text output.

>>somat+Dk1
This is a pragmatic answer. While yes, regex is not proven to be the Most Correct Solution for a generalized parse, when you are sitting down with some data in front of you and you can grab the needed bits with a regex group, why not exactly use this. It might be part of a bigger parsing strategy, sure. But if it gets the job on, that means you can move on to the next thing.

>>kstrau+UV1
The original SO question was not asking about parsing.

>>imtrin+fE3
I do insist on using regex, and I know that it will be good enough for my purposes.