Community Spotlight: Archive Team

themadprogramer — Sat, 14 Mar 2020 13:00:20 +0000

Who are they?

If you’re here right now, chances are you’ve heard of the name “Archive Team” before. They might not be the largest internet archiving group, but they are certainly the most influential.

What do they do?

Save your stuff. You are everyone, from links on ancient forums to news reports people will forget, to music videos on dying platforms. Archive Team mostly focusses on extracting web content, often outsourcing its later distribution to the Internet Archive. If a website is reported to be shutting down some time soon, it’ll only be a matter of time before they catch wind of it.

How do they do it?

For most cases they have a standard solution, which anyone can download, known as the Warrior (2). It works by downloading website contents from a website that might not be able to maintain its content (1) to a virtual machine. This content is then passed to a Tracker (3) server which keeps track of what is collected and what else is to be collected. These are then sent to Servers (4) run by dedicated volunteers from the Archive Team, for temporary storage. The final destination will usually be the Internet Archive (5) once the content goes offline for good.

How do I sign up?

Archive Team is entirely composed of volunteers. Although members maintain a small presence on Twitter and Discord, their main hub is https://archiveteam.org, a wiki where they keep track of ongoing projects and have links to resources such as the aforementioned Warrior. For more “real-time” communication you can find them on the #archiveteam channel on EFnet:
http://chat.efnet.org:9090/?channels=%23archiveteam

Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our other Community Spotlights.

How the Annotation Worker …Worked

glmdgrielson — Sat, 29 Feb 2020 18:00:00 +0000

So the annotation thing. You remember that, right? Well, here is how the worker seemed to function. Note that I’m getting this information from a brief cursory glance (and chatting with one of the devs). I know it works because I had three of them running at any given time. But how? Uh, *shrug*

Let’s get started, shall we? So the worker (at omarroth/archive) the code starts by creating a new Worker class. This is our basic worker.

The run function creates a BatchProcess and calls its run. *sigh* So what does that do? Well it asks the server for a batch, pulls it up from a database, and retrieves the annotations for each of them …which is done in yet another class, this one called AnnotationProcess.

So what does AnnotationProcess do? It does a request to YouTube to get the annotations. (The URL in the repository was changed after the fact. By me. Interesting.) How it gets those annotations is interesting: to make sure the worker is functioning properly, there is a trust system. A fresh worker won’t actually get a new batch; it’ll get one that’s already been verified. As it gives more valid responses, it’s more likely to get a new video. This way, the likelihood of getting garbage data is minimized, which is important for an archival project.

Once all the videos in a batch have been downloaded, they’re verified with the server and then uploaded to DigitalOcean Spaces, a cloud storage service. This goes on ad infinitum until YouTube decides to pull the plug.

And that is what (I think) the annotation worker did.

– glmdgrielson

team – Data Horde

Community Spotlight: Archive Team

How the Annotation Worker …Worked