zlacker

[parent] [thread] 1 comments
1. dredmo+(OP)[view] [source] 2022-02-18 19:00:36
Then pipeline to a PS/PDF generator.

For most modern Web publishing, this is mostly a matter of finding and extracting the <article> block, as well as metadata (title, byline, dateline).

html-xml-tools is quite useful for this.

I'd created a WaPo extractor that reduced pagesize by about 95%, stripped the nags and paywalls, etc. Endpoint was HTML, but that could just as easily have generated PDF or ePub if I'd wanted.

replies(1): >>titano+K1j
2. titano+K1j[view] [source] 2022-02-25 01:57:28
>>dredmo+(OP)
I applaud people who take advantage of the fact that the internet is still largely machine-readable and hackable.

I am much lazier, but I use "reader mode" to similar effect.

[go to top]