What's up with all those equals signs anyway?

>>todsac+(OP)
The real punchline is that this is a perfect example of "just enough knowledge to be dangerous." Whoever processed these emails knew enough to know emails aren't plain text, but not enough to know that quoted-printable decoding isn't something you hand-roll with find-and-replace. It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't, and then you get congressional evidence full of mystery equals signs.

>>ruhith+Vh
> It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't

I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454

>>lvncel+Ko
I know this is grumpy but this I’ve never liked this answer. It is a perfect encapsulation of the elitism in the SO community—if you’re new, your questions are closed and your answers are edited and downvoted. Meanwhile this is tolerated only because it’s posted by a member with high rep and username recognition.

>>bayesn+gv
As someone who used to write custom crawlers 20 years ago, I can confirm that regular expressions worked great. All my crawlers were custom designed for a page and the sites were mostly generated by some CMS and had consistent HTML. I don't remember having to do much bug fixes that were related to regular expression issues.

I don't suggest writing generic HTML parsers that works with any site, but for custom crawlers they work great.

Not to say that the tools available are the same now as 20 years ago. Today I would probably use puppeteer or some similar tool and query the DOM instead.

>>throwa+GC
I would distinguish between parsing and scraping. Parsing really needs a, well, parser. Otherwise you’ll get things wrong on perfectly well formed input and your program will be brittle and weird.

A scraper is already resigned to being brittle and weird. You’re relying not only on the syntax of the data, but an implicit structure beyond that. This structure is unspecified and may change without notice, so whatever robustness you can achieve will come from being loose with what you accept and trying to guess what changes might be made on the other end. Regex is a decent tool for that.

zlacker