Internet Archive is now a federal depository library

>>XnoiVe+(OP)
Official post: https://blog.archive.org/2025/07/24/internet-archive-designa... (>>44677535 )

>>jprd+uw
“2023 The Internet Archive, a non-profit research library, makes use of internal processes and tools, including human review and hash-matching, as well as reports from external parties to identify, disable access to, and limit the reappearance of illegal and/or proscribed violent extremist material on archive.org”

https://help.archive.org/help/tco-transparency-reports/

>>XnoiVe+(OP)
I'm having trouble finding what this means. Does IA now have new obligations, or gets new information, or something else, or all of the above?

The submission says:

> These records account for “millions and millions of pages” that can take up entire floors of public libraries, Kahle said. San Diego’s public library gave up its federal depository status in 2020 because its government documents took up so much space and often went unused. [...] The GPO [...] has ramped up efforts to digitize the Federal Depository Library Program.

Does IA now have to store floors upon floors of paper copies of information, at least until it got digitized? Or are they now merely obliged to host the digital materials insofar as they already exist? That sounds like what they are doing already for the whole web, and also apparently since 2022 when they started "Democracy’s Library, a free online compendium of government research and publications", just that now they're legally obliged to do this or something?

What I find on doi.gov[1] is "The mission of Federal depository libraries is to provide local, free access to information from the Federal government" and nothing really further on what this concretely means. Sounds like just an obligation though?

What I find on gpo.gov[2] is "The Federal Depository Library Program [ensures] that the American public has access to Government information in depository libraries". Could mean anything. The program ensures that, but let's assume that means the designated libraries ensure that, so then do these libraries get extra info that the public doesn't get (but in order to disseminate them to the public)? Makes no sense either

The GPO page and the submission also say that "Members of Congress may designate up to two qualified libraries." Did they get picked and now it's IA's obligation, or did IA ask for this? What do they get out of it?

[1] https://www.doi.gov/library/collections/federal-documents

[2] https://www.gpo.gov/how-to-work-with-us/agency/services-for-...

>>01HNNW+KK
There are utilities to help, waybackpack comes to mind, but I haven't looked in a while. https://github.com/jsvine/waybackpack

>>cwillu+LA
I wonder how many gems like this https://archive.org/details/youtube-moXX8lbnmHs that could have been saved have been lost. (Obviously this one is saved, for now.)

This is not to disparage the tremendous work done and being done by the IA, it's more of me lamenting the trend of our society and societies to mentally babysit people lest their mind gets exposed to something bad, with the implicit assumption that adult humans can't be trusted to see some stupid bs and react with "that was some stupid bs. I am moving it into the stupid bs bucket of things I know about".

>>layman+8h
A community-maintained list collecting examples of such exclusions: https://wiki.archiveteam.org/index.php/List_of_websites_excl...

>>dylan6+461
Libraries have an entire concept of weeding, and numerous criteria for doing so: https://en.m.wikipedia.org/wiki/Weeding_(library)

Libraries are constantly bringing in new materials and very few are capable of constantly increasing in size to match. I believe national libraries like the Library of Congress tend not to weed, but they do have to offload material to satellite locations and storage facilities.

>>toomuc+rb1
A federal Judge also ruled that "IA does not perform the traditional functions of a library."

https://publishers.org/wp-content/uploads/2024/09/2024.09.04...

Brewster has a friend in a state senator and he's trying to do what he can to preserve his section 108 privileges. He's removed over a million items in the past year after being repeatedly sued for copyright infringement, and leaked millions of private communications with patrons including passports and driver licenses. That's the undercurrent here.

Egos aside, the goal isn't to be a library: it's providing access to knowledge. But when your site is on the blocklist at public library terminals because you keep getting flagged for copyright violations and child pornography, maybe you're not on the path.

>>01HNNW+KK
> A bot told me they offer downloads of the underlying WARC files but I could not find it

The "bot" is wrong. Most of the crawl data used by the Internet Archive, particularly the Alexa crawls, isn't publicly accessible. (This is because some of it includes archived pages which have since been suppressed by the site owner - removing those pages from the archived crawl data isn't practical.)

https://archive.org/details/alexacrawls

Common Crawl data is public, but less comprehensive than IA - https://commoncrawl.org/

>>01HNNW+KK
I used wayback-machine-downloader, I think you need one of the forks to make it work though.

https://github.com/hartator/wayback-machine-downloader

>>mdp202+EO1
Physical libraries act under a different set of rules and those were already made well known to Brewster as part of the Hachette lawsuit.

For music, the Music Modernization Act set up a statutory process for making things available, even downloadable. Brewster and others celebrated the measure in blog posts and speaking gigs. Then didn't follow the process, didn't honor polite requests to stop, then got sued for $700 million.

Previously they did some seriously stupid things in their implementation of Controlled Digital Lending, and got the whole concept killed. Not even a debate, just destroyed on summary judgement without even a trial. This set the future many of us want back decades, and ruined a lot of proper efforts that were run much better than the well-intentioned but undermanaged Internet Archive.

Combined with them giving the finger to the fairly innovative and progressive music act, this caused damage not only to reputations, but also the culture.

Regarding copyright basics, we're likely to agree on many positions, including some radical ones. But Internet Archive cannot be a long-term archive, an activist organization, and an open library. There are different laws, risk profiles, and financial/management requirements for each.

And you can't beg people for donations to "save the internet" then set it all on fire to save a bunch of old records that already existed at the Library of Congress. Or act surprised that just because you scan them, it doesn't mean you can then make them available for unlimited download without permission. Again, archives behave differently from libraries. Although it's annoying to tech people, there are good reasons for it.

Brewster likes his honorary library status and degree but he and the site violate the majority of the librarian code of ethics. https://www.ala.org/tools/ethics

>>mdp202+QO1
archive.org: "Our mission is to provide Universal Access to All Knowledge."

As the name implies, Internet Archive started as an archive. Which is very different from a library.

Running an archive is not particularly fun, and it is very expensive, and you cannot monetize it without having rights to the things you're archiving. They've never offered research services or grants, and yet the monthly bill and tech debt just keeps growing. Last year's hacks showed the state of things, and they leaked patron information and even passports and drivers licenses.

They tried to be a library but didn't follow the law. Hell, they even tried to be a bank at one point and got spanked hard by the feds there, too.

https://ncua.gov/newsroom/press-release/2016/internet-archiv...

With the $700 million lawsuit over old records it became clear that the whole thing is little more than a catch all for things that Brewster Kahle finds interesting. He's got money and seems like a kind guy. But it's not a well-run organization and he's at retirement age without having put much of a dent in that mission.

zlacker

Internet Archive is now a federal depository library