HTML as an Accessible Format for Papers

>>el3ctr+(OP)
Is this new or somehow updated? HTML versions of papers have been available for several years now.

EDIT: indeed, it was introduced in 2023: https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv...

>>sega_s+M7
Do the older papers work via [Ar5iv](https://ar5iv.labs.arxiv.org/) ?

> View any arXiv article URL [in HTML] by changing the X to a 5

The line

> Sources upto the end of November 2025.

sounds to me like this is indeed intended for older articles.

>>ForceB+E6
You're right https://github.com/arXiv/arxiv-docs/blob/develop/source/abou... this needs a 2023 tag @dang

>>Domini+Ds
I believe dginev's Docker image https://github.com/dginev/ar5ivist is very close to what runs on arXiv and can be run locally. It uses a recent LaTeXML snapshot from September.

>>el3ctr+(OP)
If the Unicode consortium would spend less time and effort on emoji and more on making the most common/important mathematical symbols and notations available/renderable in plain text, maybe we could move past the (LA)TeX/PDF marriage. OpenType and TrueType now (edit: for well over a decade, actually) support the necessary conditional rendering required to perform complicated rendering operations to get sequences of Unicode code points to display in the way needed (theoretically, anyway) and with fallback missing-glyph-only font family substitution support available pretty much everywhere allowing you to seamlessly display symbols not in your primary font from a fallback asset (something like Noto, with every Unicode symbol supported by design, or math-specific fonts like Cambria Math or TeX Gyre, etc), there are no technical restrictions.

I’ve actually dug into this in the past and it was never lack of technical ability that prevented them from even adding just proper superscript/subscript support before, but rather their opinion that this didn’t belong in the symbolic layer. But since emoji abuse/rely on ZWJ and modifiers left and right to display in one of a myriad of variations, there’s really no good reason not to allow the same, because 2 and the squares symbol are not semantically the same (so it’s not a design choice).

An interesting (complete) tangent is that Gemini 3 Pro is the only model I’ve tested (I do a lot of math-related stuff with LLMs) that absolutely will not under any circumstances respect (system/user) prompt requests to avoid inline math mode (aka LATeX) in the output, regardless of whether I asked for a blanket ban on TeX/MathJax/etc or when I insisted that it use extended unicode codes points to substitute all math formula rendering (I primarily use LLMs via the TUI where I don’t have MathJax support, and as familiar as I once was with raw TeX mathematical notations and symbols, it’s still quite easy to confuse unrendered raw output by missing something if you’re not careful). I shared my experiment and results here – Gemini 3 Pro would insist on even rendering single letter constants or variables as $k$ instead of just k (or k in markdown italics, etc) no matter how hard I asked it not to (which makes me think it may have been overfit against raw LATeX papers, and is also an interesting argument in favor of the “VL LLMs are the more natural construct”): https://x.com/NeoSmart/status/1995582721327071367?s=20

>>Comput+ex
https://github.com/stevengj/subsuper-proposal

>>Tagber+U6
It's kind of fun to compare this formulation with the seemingly contradictory official arXiv argument for submitting the TeX source [1]:

> 1. TeX has many advantages that make it ideal as a format for the archives: It is plain text, it is compact, it is freely available for all platforms, it produces extremely high-quality output, and it retains contextual information.

> 2. It is thus more likely to be a good source from which to generate newer formats, e.g., HTML, MathML, various ePub formats, etc. [...]

Not that I disagree with the effort and it surely is a unique challenge to, at scale, convert the Turing complete macro language TeX to something other than PDF. And, at the same time, the task would be monumentally more difficult if only the generated PDFs were available. So both are right at the same time.

[1] https://info.arxiv.org/help/faq/whytex.html#contextual

>>el3ctr+(OP)
Hi, an arXiv HTML Papers developer here.

As a very brief update - we are pending a larger update.

You will spot many (many) issues with our current coverage and fidelity of the paper rendering. When they jump at you, please report them to us. All reports from the last 2 years have landed on github. We have made a bit of progress since, but there are (a lot of) more low-hanging fruit to pick.

Project issues:

https://github.com/arXiv/html_feedback/issues/

The main bottleneck at the moment is developer time. And the main vehicle for improvements on the LaTeX side of things continues to be LaTeXML. Happy to field any questions.

>>sundar+kb
ar5iv tracks the arXiv collection with a one month lag. Exactly as to signal that this is not the "official" arXiv rendering. It is also a showcase predating the arXiv /html/ route, but largely using the same technology. Nowadays maintained by the same people (hi!)

There used to be another showcase, called arxiv-vanity. They captured what happened pretty well with their farewell post on their homepage:

https://www.arxiv-vanity.com/

>>crazyg+PE
Come on are you serious? HTML/CSS is more powerful than TEX or PDF.

https://csszengarden.com/

>>tefkah+Kh
This is what I'm talking about. HTML/CSS is more powerful than PDF or TEX.

https://csszengarden.com/

>>benatk+Fk
In practice, sometimes. But in principle, hard disagree.

HTML was explicitly designed to semantically represent scientific documents. [1]

”HTML documents represent a media-independent description of interactive content. HTML documents might be rendered to a screen, or through a speech synthesizer, or on a braille display. To influence exactly how such rendering takes place, authors can use a styling language such as CSS.” [2]

1: https://html.spec.whatwg.org/multipage/introduction.html#bac...

2: https://html.spec.whatwg.org/multipage/introduction.html#:~:...

zlacker

HTML as an Accessible Format for Papers