AIIT IndustrySoftwareOnline

Why Digital Preservation Is Failing

Updated

Between platform migrations, AI content floods, and the structural impossibility of archiving the modern web, we're losing more than we realize.

Digital Preservation is Failing

AnandTech’s archive shutdown should have been a bigger story. One of the most comprehensive sources of hardware analysis, written by people who genuinely understood the chips they were testing, said it was going offline “indefinitely.” In corporate language, “indefinitely” means “until we decide otherwise,” which in practice means “it’s gone.”

This is one incident in a much larger failure.

The Discord Migration

Entire communities migrated from public, indexed forums to Discord over the past decade. The reasoning made sense at the time: Discord was where the active community was, the interface was better, moderation tools were stronger.

But conversations on Discord scroll away. They’re invisible to search engines. And when servers close - which they do - everything is gone. Years of troubleshooting discussions, community knowledge, and primary sources for understanding how communities developed simply disappear.

Forums were ugly and sometimes badly organized, but they were indexed, searchable, and persistent. The trade-off we made for better community tools was worse preservation, and we didn’t think carefully enough about it.

AI Amplification Makes It Worse

Large language models now generate content at scales that make meaningful curation impossible. A single GPU produces more text per hour than human writers produce in months. Much of this content is repackaged, derivative, or synthetic - not inherently malicious, but not preserving anything genuinely new either.

Meanwhile, aggressive AI crawlers are overwhelming smaller sites. Cloudflare now blocks AI crawlers by default. Server logs at small independent websites show crawler traffic that dwarfs human visitor traffic. The web’s infrastructure is increasingly dedicated to serving machines that consume content for training, not humans who might benefit from the content existing.

The signal-to-noise ratio has degraded to the point where preservation curation - deciding what’s worth keeping - is increasingly difficult.

The Structural Impossibility

The scale of modern digital content production makes comprehensive archiving literally impossible. Social media platforms generate petabytes of content daily. The same piece of content exists in multiple versions, formats, and quality levels. Web content is increasingly dynamic - what you see depends on who you are, when you look, and what the A/B test your browser was assigned to.

The Internet Archive does extraordinary work. It’s also fighting a losing battle against scale, legal challenges, and the fundamental physics of storage and bandwidth.

What We Actually Lose

This isn’t abstract. Specific consequences:

Research primary sources: Academic research increasingly relies on online sources. When those sources disappear, citations become dead links. Work that built on those sources becomes harder to verify and reproduce.

Cultural memory: The early web had communities that developed their own cultures, memes, vocabulary, and social norms. Much of that is gone. We’re reconstructing the early internet from scattered fragments, when we’re thinking about it at all.

Technical documentation: Documentation for legacy systems - how to configure hardware that’s no longer supported, how to work with software that’s been discontinued - disappears when company sites shut down. This creates real problems for the people maintaining those systems.

A More Realistic Model

We cannot preserve everything. The honest conversation is about what we choose to preserve and why.

A realistic approach: accept that comprehensive archiving is impossible, prioritize by historical and cultural significance, build redundant distributed systems (so that no single organization’s closure takes everything with it), and teach digital literacy that includes an understanding of impermanence.

The web was built on an assumption of persistence that was never technically warranted. We’re paying for that assumption now.

Need hands-on help?

Consulting →
Share