zlacker

mixing rendering definitions with content (PDF) is something from the printer era, that is unsuitable for the digital era.

HTML was a digital format, but it wanted to be a generic format for all document types, not just papers, so it contains a lot of extras that a paper format doesn't need.

for research papers, since they share the same structure, we can further separate content from rendering.

for example, if you want to later connect a paper with an AI, do you want to send <div class="abstract"> ... ?

or do some nasty heuristic to extract the abstract? like document. getElementsByClassName("abstract")[0] ?

replies(1): >>simonw+p1

>>billco+(OP)
All of the interesting LLMs can handle a full paper these days without any trouble at all. I don't think it's worth spending much time optimizing for that use-case any more - that was much more important two years ago when most models topped out at 4,000 or 8,000 tokens.