zlacker

[return to "Google's new pipe syntax in SQL"]
1. aragon+Nda[view] [source] 2024-08-29 01:57:22
>>heyden+(OP)
> This remains a long-standing pet peeve of mine. PDFs like this are horrible to read on mobile phones, hard to copy-and-paste from ...

I've never understood why copying text from digitally native PDFs (created directly from digital source files, rather than by OCR-ing scanned images) is so often such a poor experience. Even PDFs produced from LaTex often contain undesirable ligatures in the copied text like fi and fl. Text copied from some Springer journals sometimes lacks space between words or introduces unwanted space between letters in a word ... Is it due to something inherent in PDF technology?

◧◩
2. crazyg+8ka[view] [source] 2024-08-29 03:09:31
>>aragon+Nda
> Is it due to something inherent in PDF technology?

Exactly. PDF doesn't have instructions to say "render this paragraph of text in this box", it has instructions to say "render each of these glyphs at each of these x,y coordinates".

It was never designed to have text extracted from it. So trying to turn it back into text involves a lot of heuristics and guesswork, like where enough separation between characters should be considered a space.

A lot also depends on what software produced the PDF, which can make it easier or harder to extract the text.

[go to top]