URLTeam is an arm of Archive Team, solely dedicated to collecting shortened URLs.
It is unusual to see a long-term archiving or preservation project, once a collection or a grab is completed, that’s that. Yet URLTeam, who have taken on a task with no apparent end date, have endured for over 10 years, growing into a community in their own right.
Circa 2009, Scumola of Archive Team noted how shortened URLs had proliferated on a little website called Twitter. Twitter was then, and still is now, infamous for its character limitations. To free up space, users began sharing shortened links. And as other users discovered this trick, it only spread. This led to a paradigm shift in the web ecosystem. Links became a lot more unrecognizable, both to refers and referees.
Archivists too, were vexed. The traditional approach to web archiving had been to target a particular domain or subdomain, URLs which followed a pattern. Now how could they expect to save posts from blogs, forum threads or stories from news sites, if URLs were coming to them from TinyURL or bit.ly? Thus, URLTeam was born out of an effort to catalogue said short URLs.
Short URLs are loose connections. You cannot actually shorten the domain that someone else has registered, that is to say one cannot rename google
to gugle
.
The secret, that URL Shorteners employ, is to generate short URLs on YOUR OWN (or 3rd party) servers which can be made to redirect to longer URLs. When someone connects to your server, with the shortened address, you just redirect them to the associated full address in your database. So any time you visit a short link, you are going to be visiting (at least) two websites.
Old links dying or websites shutting down is a given. Yet adding more redirects is going to lengthen the chain to get to your final destination. As the saying goes, the chain is only as strong as the weakest link
. Should the URL shortening service shut down, then that means all short URLs will break, even if the actual site being redirected to is still online. So that is reason enough to hoard short URLs. That is what URLTeam does.
URLTeam’s approach is to decompose the problem. Even if we cannot possibly crawl every link shared on the internet, nor every final redirect, at any time there are going to be a relatively small number of URL shortening services. So URLTeam begins by hunting for said URL Shorteners. If we can recognize ow.ly
or goo.gl
links, that’s a start.
Once a new shortening service is added, the next step would be crawling the web for short URLs. Since URL shorteners are almost universally linkable on any site, this is going to have to be a very broad crawl, akin to a crawl for building a search engine. This brings us to the URLTeam Tracker which oversees the distribution of the crawling jobs to hundreds of volunteering archivists. You can even see live stats on their page!
Collected links are finally shipped to the 301works collection on the Internet Archive. The 301works collection also houses link databases donated by URL shortening services, so if you happen to own a URL shortening service at risk of shutting down you might want to get in contact with them.
Communication happens on #[email protected]. General info can be found on their Wiki page.
If you want to hop right in, you can simply download and run an Archive Team Warrior, and then select the URLTeam
project. You can also run the project using a Docker containter by following the instructions on here.
Now if you excuse me, I have to cut this short so I can cut some shortcuts.
Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our past Community Spotlights.
]]>Shutdowns
In about a week’s time, Google Drive’s Security Update for shared links will be rolling out on September 13. Google is introducing a new resource key
to shared links, aiming to make link sharing more secure. This means, that in short, many links shared before September 13 will no longer function beyond that date, rendering many shared files inaccessible.
New Projects
On the subject of Google Drive, recall that accounts which have previously accessed a file will be able to retain access. So we at Data Horde developed a tool just for that purpose. Google-Drivac, allows you to “vaccinate” a particular Google account against a list of links.
Just log into a Google account, copy some cookies, gather a list of links you want to retain access to (either through crawling, or through maybe another account) and presto!
You can find Google-Drivac on our GitHub at https://github.com/Data-Horde/gdrivac.
In the meantime, for those hoarders among you who would prefer a hard-grab, Archive Team is working on doing just that. You can follow development on the Google Drive Warrior on #[email protected]
Last but not least, some of you might recall the panic surrounding Mediafire allegedly deleting accounts from earlier this year. While the threat had been greatly exaggerated, the scare did bring with it some innovation.
Zopolis from the Media Flare project has developed the Automatic Mediafire Link Archiver (AMLA). This is a Chrome extension which automatically records Mediafire links on websites you visit, in the background. These links are then forwarded to Ajay’s MediaFire URL Submitter public database and Archive Team who periodically grab its contents. All website activity is tracked by default, so if you’re keen on your privacy you should also restrict its permissions in the extension settings.
AMLA can be downloaded from here: https://github.com/Zopolis4/AMLA, and if you are into URL hunting be sure to visit the Media Flare Discord server.
]]>