What's up with all those equals signs anyway?

>>todsac+(OP)
The real punchline is that this is a perfect example of "just enough knowledge to be dangerous." Whoever processed these emails knew enough to know emails aren't plain text, but not enough to know that quoted-printable decoding isn't something you hand-roll with find-and-replace. It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't, and then you get congressional evidence full of mystery equals signs.

>>ruhith+Vh
> It's the same class of bug as manually parsing HTML with regex, it works right up until it doesn't

I'm sure you already know this one, but for anyone else reading this I can share my favourite StackOverflow answer of all time: https://stackoverflow.com/a/1732454

>>lvncel+Ko
It took me years to notice, but did you catch that the answer actually subtly misinterprets what the question is asking for?

Guy (in my reading) appears to talk about matching an entire HTML document with regex. Indeed, that is not possible due to the grammars involved. But that is not what was being asked.

What was being asked is whether the individual HTML tags can be parsed via regex. And to my understanding those are very much workable, and there's no grammar capability mismatch either.

>>perchi+oi1
I think even for single opening tags like asked there are impossible edge cases.

For example, this is perfectly valid XHTML:

    <a href="/" title="<a /> />"></a>

>>tiagod+gl1
If you already know where the start of the opening tag is, then I think a regex is capable of finding the end of that same opening tag, even in cases like yours. In that sense, it’s possible to use a regex to parse a single tag. What’s not possible is finding opening tags within a larger fragment of HTML.

>>comex+Zv1
For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:

  <!—- Don't count <hr> this! -—> but do count <hr> this -->

and

  <!-- <!-- Ignore <ht> this --> but do count <hr> this —->

Now your regex has to include balanced comment markers. Solve that

You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.

Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.

>>kstrau+QT1
HTML comments do not nest. The obvious tokenizer you can create with regular expressions is the correct one.

>>marcos+Ug2
If you're talking about tokenizers, then you're no longer parsing HTML with a regex. You're tokenizing it with a regex and processing it with an actual parser.

>>kstrau+Ek2
If you are talking about detecting tags, you (and the person asking that SO question) is talking about tokenization, and everybody (like the one making that famous answer) bringing parsing into the discussion is just being an asshole.

zlacker