zlacker

The whole argument hinges on one word in your post: arbitrary.

I parse my own HTML I produce directly in a context where I fully control the output. It works fine, but parsing other people’s HTML is a lesson in humility. I’ve also done that, but I did it as a one time thing. I parsed a specific point in time, refusing to change that at any point.

replies(1): >>umanwi+w3

>>philis+(OP)
It also hinges on another word: parsing. There are things other than parsing that you might want to do. For example, if you want to count the number of `<hr>` tags in an HTML document, that doesn't require parsing it, and can indeed be done with regex.

replies(1): >>kstrau+zf

>>umanwi+w3
No you can’t. You can have an unescaped <hr> inside a script tag, for example. The best you can do is a simple string search for “<hr>” and hope it’s returning what you think it might be returning. Regexps are not powerful enough to determine whether any particular instance of “<hr>” is actually an HTML tag.

Like, it’s not a matter of cleverness, either. You can’t code around it. It’s simply not possible.