Data on the web is more ephemeral than you realize. A recent examination of linkrot and content drift by Harvard Law School found that as time progresses, web-based content is becoming more and more lost. By analyzing half a million links within New York Times articles, 6% of deep links from 2018 were found to be inaccessible, and 72% of links from 1998 were completely dead.
The rotting web could have severe consequences that affect society’s collective understanding. Consider links used to substantiate pivotal arguments in the course of law. Studies now show that 49% of hyperlinks in Supreme Court decisions no longer function. The linkrot situation undermines humanity’s shared intelligence, which the Internet once promised.
So much of the content on the web is outdated, inaccurate or has changed its URL destination, a problem known as content drift. Therefore, band-aids like the Wayback Machine internet archive have risen as a steward of the historical web archives. However, some proponents in the industry argue we need a more modern, decentralized, immutable reference for this historical data. You may have guessed where this article is heading—blockchain.
I recently met with Jonathan Victor, Protocol Labs, to see how Web 3.0 and blockchain might be used to solve the linkrot endemic. According to Victor, smart contracts that reference canonical URLs could help create a more resilient web. This would, in effect, require a peer-to-peer model with shared ownership and incentive models to sustain content longevity.
The State of Linkrot
The longer data sits, the higher the likelihood of losing track of it. Forgotten databases and rising data storage costs can easily lead to whole networks of websites and user data being accidentally (or intentionally) destroyed. Putting so much faith in private companies to preserve data indefinitely is not viable, explained Victor.
Public data isn’t really as public as you think it is. And older, user-generated content is often the first domino to fall. For example, a massive amount of media on MySpace was deleted in a botched server migration in 2015. In 2021, Yahoo Answers similarly erased 16 years of user-generated content in its shutdown. Surprisingly, users have little ownership or control of their data on forums and social media. A ban on Twitter, for example, could result in inactive links within posts.
When an article points to a website, this imbues an implicit dependency on an external organization to keep the domain live, said Victor. The second dependency is on the domain owner to keep content unchanged. “If we can talk about content and how to reference it in a different way, maybe we can create references that are much more resilient,” suggested Victor.
How Web 3.0 Can Help
One way to solve the linkrot issue is to create a ‘fingerprint’ for web resources. This could be a canonical reference to specific pieces of content. Organizations would store and retrieve content based on a hash of that contract itself, Victor explained. Such a system would likely require an incentive-based, non-exploitative, decentralized network where web users own their data and control who profits from it.
Some have coined this concept Web 3.0. Victor describes Web 3.0 as a broad classification of distributed technologies and tools that enable a peer-to-peer internet underpinned by blockchain.
For example, one open source project spearheading progress on this front is IPFS, the InterPlanetary File System. IPFS is a peer-to-peer hypermedia protocol that uses blockchain to store cryptographic hashes. The project is “designed to preserve and grow humanity’s knowledge by making the web upgradeable, resilient and more open.” With IPFS, a client requests content using the hash and retrieves content. IPFS is based on Filecoin, a permissionless storage network.
Benefits
A decentralized nature fits the original goal of the open web, allowing creators to own and share content horizontally. If you can decouple data storage and retrieval from a specific location, it gives you an impressive ability to rearchitect modern communication, said Victor. This could have a number of benefits:
- Redundancy: First off, a major benefit is the redundancy of data. Distributed nodes assure data is globally available and temporally accessible. Resources are in multiple locations, ensuring better global availability and temporal persistence.
- Incentives: An incentive-based decentralized network could use block rewards to subsidize costs, helping fund storage initiatives. Building on the blockchain can also ease payment acceptance aspects — great for donation-based data hosting groups.
- Efficiency: By using smart contracts, you can programmatically handle storage requests. “Blockchain gives you higher-level primitives to run this stuff automatically,” said Victor. This helps tie on-chain activities to real-world storage, Victor added.
- Low-cost: It can be cheaper to store content on distributed networks than a typical cloud provider. Impressively, Victor estimated it would cost 1/1000 of the cost of Amazon. A distributed environment could also bring speed gains; similar to torrenting, seeding from distributed sources increases speed in orders of magnitude, said Victor.
- Communal ownership: Such a model could spur more communal ownership for some content. This could benefit public-good assets that don’t have a clear owner. According to Victor, this aspect especially benefits universities and open source data initiatives that rely on public support to function.
A Compelling Problem
Constructing a new peer-to-peer Internet obviously has significant barriers. Namely, a decentralized structure requires a large group effort, and this community action may be difficult to encourage, let alone coordinate. It will also require consolidation around open standards and protocols, as many different decentralized projects crowd the scene. Shifting pre-existing architectures to this new style could also introduce roadblocks.
“A lot of content on the internet becomes a public good,” says Victor. Yet, perpetual access to this content is fickle, even for well-established online communities. Some research has found the average lifespan of a website to only be 100 days. Hosting fees or accidental failures are often root causes for the loss of data.
The fragility of the web is an issue that will take not only innovative development effort, but a radically new mindset to tackle. “It’s a really compelling problem,” added Victor. Using blockchain-based smart contracts to store canonical references for content is one proposed solution. But Web 3.0 is in a primitive stage, and to progress, it will take significant community action.
Doing so in a way that is sustainable will be the trick. Open standards and blockchain-based protocols could allow the technical means for such a network to flourish. And a cryptocurrency backing could help incentivize participation and provide a means of funding. This decentralized non-ephemeral storage could help enable perpetual storage for the web’s most precious resources.