archive team – Data Horde https://datahorde.org Join the Horde! Mon, 04 Apr 2022 22:20:29 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://datahorde.org/wp-content/uploads/2020/04/cropped-DataHorde_Logo_small-32x32.png archive team – Data Horde https://datahorde.org 32 32 Community Spotlight: URLTeam https://datahorde.org/community-spotlight-urlteam/ https://datahorde.org/community-spotlight-urlteam/#comments Mon, 01 Nov 2021 00:25:37 +0000 https://datahorde.org/?p=2687 Who are they?

URLTeam is an arm of Archive Team, solely dedicated to collecting shortened URLs.
It is unusual to see a long-term archiving or preservation project, once a collection or a grab is completed, that’s that. Yet URLTeam, who have taken on a task with no apparent end date, have endured for over 10 years, growing into a community in their own right.

What do they do?

Circa 2009, Scumola of Archive Team noted how shortened URLs had proliferated on a little website called Twitter. Twitter was then, and still is now, infamous for its character limitations. To free up space, users began sharing shortened links. And as other users discovered this trick, it only spread. This led to a paradigm shift in the web ecosystem. Links became a lot more unrecognizable, both to refers and referees.

Theft Quadrants
https://xkcd.com/1698/, Theft Quadrants.

Archivists too, were vexed. The traditional approach to web archiving had been to target a particular domain or subdomain, URLs which followed a pattern. Now how could they expect to save posts from blogs, forum threads or stories from news sites, if URLs were coming to them from TinyURL or bit.ly? Thus, URLTeam was born out of an effort to catalogue said short URLs.


Short URLs are loose connections. You cannot actually shorten the domain that someone else has registered, that is to say one cannot rename google to gugle.

The secret, that URL Shorteners employ, is to generate short URLs on YOUR OWN (or 3rd party) servers which can be made to redirect to longer URLs. When someone connects to your server, with the shortened address, you just redirect them to the associated full address in your database. So any time you visit a short link, you are going to be visiting (at least) two websites.

Old links dying or websites shutting down is a given. Yet adding more redirects is going to lengthen the chain to get to your final destination. As the saying goes, the chain is only as strong as the weakest link. Should the URL shortening service shut down, then that means all short URLs will break, even if the actual site being redirected to is still online. So that is reason enough to hoard short URLs. That is what URLTeam does.

How do they do it?

URLTeam’s approach is to decompose the problem. Even if we cannot possibly crawl every link shared on the internet, nor every final redirect, at any time there are going to be a relatively small number of URL shortening services. So URLTeam begins by hunting for said URL Shorteners. If we can recognize ow.ly or goo.gl links, that’s a start.

Once a new shortening service is added, the next step would be crawling the web for short URLs. Since URL shorteners are almost universally linkable on any site, this is going to have to be a very broad crawl, akin to a crawl for building a search engine. This brings us to the URLTeam Tracker which oversees the distribution of the crawling jobs to hundreds of volunteering archivists. You can even see live stats on their page!

Collected links are finally shipped to the 301works collection on the Internet Archive. The 301works collection also houses link databases donated by URL shortening services, so if you happen to own a URL shortening service at risk of shutting down you might want to get in contact with them.

How do I sign up?

Communication happens on #[email protected]. General info can be found on their Wiki page.

If you want to hop right in, you can simply download and run an Archive Team Warrior, and then select the URLTeam project. You can also run the project using a Docker containter by following the instructions on here.

Now if you excuse me, I have to cut this short so I can cut some shortcuts.


Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our past Community Spotlights.

]]>
https://datahorde.org/community-spotlight-urlteam/feed/ 1
This Week in Archiving 09/06/2021 https://datahorde.org/this-week-in-archiving-09-06-2021/ https://datahorde.org/this-week-in-archiving-09-06-2021/#respond Mon, 06 Sep 2021 19:16:18 +0000 https://datahorde.org/?p=2615 Everything you need to know about the upcoming Google Drive Security Update, and a few pleasant surprises from the Archiving scene to boot!

Shutdowns

In about a week’s time, Google Drive’s Security Update for shared links will be rolling out on September 13. Google is introducing a new resource key to shared links, aiming to make link sharing more secure. This means, that in short, many links shared before September 13 will no longer function beyond that date, rendering many shared files inaccessible.

What you need to know! A few gotchas to this update:

  • If any Google account has accessed a shared-link prior to September 13, they will retain full-access to said file/drive beyond that date, unless access is revoked by the owner. More on this later.
  • Drive accounts can view their affected files and even opt-out of the security update altogether, if they are not a Family Link account. However this opt-out is possible not before, but after September 13, see https://support.google.com/drive/answer/10729743 for details.
  • Don’t worry about your petitions getting blocked; Google Docs, Sheets and Slides are exempted from the update. To be clear, this does not mean all .docx, .xlsx and .pptx files; it only means that Google Workspace‘s native files will not be affected. Links to folders containing Docs, Sheets or Slides do not seem to be exempted either. For further details see https://support.google.com/a/answer/10685032

New Projects

Follow the Google Drive Countdown on Data Horde Twitter to keep up!

On the subject of Google Drive, recall that accounts which have previously accessed a file will be able to retain access. So we at Data Horde developed a tool just for that purpose. Google-Drivac, allows you to “vaccinate” a particular Google account against a list of links.

Just log into a Google account, copy some cookies, gather a list of links you want to retain access to (either through crawling, or through maybe another account) and presto!

You can find Google-Drivac on our GitHub at https://github.com/Data-Horde/gdrivac.

In the meantime, for those hoarders among you who would prefer a hard-grab, Archive Team is working on doing just that. You can follow development on the Google Drive Warrior on #[email protected]


Last but not least, some of you might recall the panic surrounding Mediafire allegedly deleting accounts from earlier this year. While the threat had been greatly exaggerated, the scare did bring with it some innovation.

Zopolis from the Media Flare project has developed the Automatic Mediafire Link Archiver (AMLA). This is a Chrome extension which automatically records Mediafire links on websites you visit, in the background. These links are then forwarded to Ajay’s MediaFire URL Submitter public database and Archive Team who periodically grab its contents. All website activity is tracked by default, so if you’re keen on your privacy you should also restrict its permissions in the extension settings.

AMLA can be downloaded from here: https://github.com/Zopolis4/AMLA, and if you are into URL hunting be sure to visit the Media Flare Discord server.

]]>
https://datahorde.org/this-week-in-archiving-09-06-2021/feed/ 0