Who are they?
URLTeam is an arm of Archive Team, solely dedicated to collecting shortened URLs.
It is unusual to see a long-term archiving or preservation project, once a collection or a grab is completed, that’s that. Yet URLTeam, who have taken on a task with no apparent end date, have endured for over 10 years, growing into a community in their own right.
What do they do?
Circa 2009, Scumola of Archive Team noted how shortened URLs had proliferated on a little website called Twitter. Twitter was then, and still is now, infamous for its character limitations. To free up space, users began sharing shortened links. And as other users discovered this trick, it only spread. This led to a paradigm shift in the web ecosystem. Links became a lot more unrecognizable, both to refers and referees.
Archivists too, were vexed. The traditional approach to web archiving had been to target a particular domain or subdomain, URLs which followed a pattern. Now how could they expect to save posts from blogs, forum threads or stories from news sites, if URLs were coming to them from TinyURL or bit.ly? Thus, URLTeam was born out of an effort to catalogue said short URLs.
Short URLs are loose connections. You cannot actually shorten the domain that someone else has registered, that is to say one cannot rename google
to gugle
.
The secret, that URL Shorteners employ, is to generate short URLs on YOUR OWN (or 3rd party) servers which can be made to redirect to longer URLs. When someone connects to your server, with the shortened address, you just redirect them to the associated full address in your database. So any time you visit a short link, you are going to be visiting (at least) two websites.
Old links dying or websites shutting down is a given. Yet adding more redirects is going to lengthen the chain to get to your final destination. As the saying goes, the chain is only as strong as the weakest link
. Should the URL shortening service shut down, then that means all short URLs will break, even if the actual site being redirected to is still online. So that is reason enough to hoard short URLs. That is what URLTeam does.
How do they do it?
URLTeam’s approach is to decompose the problem. Even if we cannot possibly crawl every link shared on the internet, nor every final redirect, at any time there are going to be a relatively small number of URL shortening services. So URLTeam begins by hunting for said URL Shorteners. If we can recognize ow.ly
or goo.gl
links, that’s a start.
Once a new shortening service is added, the next step would be crawling the web for short URLs. Since URL shorteners are almost universally linkable on any site, this is going to have to be a very broad crawl, akin to a crawl for building a search engine. This brings us to the URLTeam Tracker which oversees the distribution of the crawling jobs to hundreds of volunteering archivists. You can even see live stats on their page!
Collected links are finally shipped to the 301works collection on the Internet Archive. The 301works collection also houses link databases donated by URL shortening services, so if you happen to own a URL shortening service at risk of shutting down you might want to get in contact with them.
How do I sign up?
Communication happens on #[email protected]. General info can be found on their Wiki page.
If you want to hop right in, you can simply download and run an Archive Team Warrior, and then select the URLTeam
project. You can also run the project using a Docker containter by following the instructions on here.
Now if you excuse me, I have to cut this short so I can cut some shortcuts.
Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our past Community Spotlights.
The URLTeam is a collection of assholes, with no regard for anyone but themselves. You can dress it up as “We’re doing the world a service” all you want, but the fact that they are proud of blatantly ignoring standardized mechanisms in place for service owners to opt out of such things speaks volumes to their true intentions. They don’t care who they harm, or how they harm, as long as they get their precious data dump.