archive – Data Horde https://datahorde.org Join the Horde! Fri, 29 Apr 2022 10:54:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://datahorde.org/wp-content/uploads/2020/04/cropped-DataHorde_Logo_small-32x32.png archive – Data Horde https://datahorde.org 32 32 This Week In Archiving 07/12/2021 https://datahorde.org/this-week-in-archiving-07-12-2021/ https://datahorde.org/this-week-in-archiving-07-12-2021/#respond Mon, 12 Jul 2021 23:07:41 +0000 https://datahorde.org/?p=2521 In Memoriam

Public Interest crusader and Internet Archive partner Sherwin Siy was reported to have passed on 7 July, 2021, at the age of 40. Through the span of his career, Sherwin Siy served as Public Knowledge‘s VP of Legal Affairs, helped shape the Public Policy of Wikimedia and had a long history of cooperating with the EFF.

Sherwin Siy’s contribution to the Internet Archive was making sense of how the Archive handled TV news, from a legal perspective, as reported by Lila Bailey, his former partner in law. Suffice to say, news archives on IA such as the Third Eye news chyron collection might have never came to be, were it not for his efforts.

Updates

Bluemaxima’s Flashpoint, the webgame preservation project/community, finally has a dedicated submission website at https://fpfss.unstable.life/web. Previously, game submissions were made through their Discord server and had to await a laborious approval/rejection process. Let us hope, that this change will make archiving efforts a whole lot easier. You can read more about the how-to of it here.

Again on the subject of Flashpoint, the community has taken it upon themselves to preserve interactive “YouTube games”. As known to our longtime readers, once upon a time YouTube had an annotation system which could be used to link videos together. Some folks went out of their way to make Choose Your Adventure styled games, where depending on your choices, you would be taken to another video. But after the feature’s removal in 2019, many channels unlisted their annotation-intensive videos and now with YouTube’s plans to forcibly private old unlisted videos, these games are at risk of extinction. To help out with the Flashpoint project to preserve these video games, head over to the #youtube-games channel on the Flashpoint Discord Server.

What if the Earth were Hollow? Collab between Vsauce and MinutePhysics
Click here to watch with annotations.

With 11 days to go before the delisting of unlisted videos, archivists are working around the clock to hunt down unlisted videos. All the projects we discussed last week are still in motion; from Archive Team’s metadata scraping on #[email protected], to subreddit frenzies for harvesting videos linked on the filmot.com index.

Distributed YouTube Archive

Another noteworthy project is the #youtube-unlisted project on the Distributed YouTube Archive. A major bottleneck for Archive Team, and other groups, has been archiving of raw video files, as the Internet Archive and Google Cloud are not suited for a sudden influx of large video files. The DYA project, aims to mitigate this by splitting the task of storage between contributors. If a video is requested for download, the contributors who have made a copy of that video share their copy. While this might seem like a tedious process, it means that anyone with spare space can contribute to the storage, without TBs of hardware.

Finally, Omniarchive is holding a competition to collect Minecraft related unlisted videos.

What do I get out of it?
We will be giving 1 month of Discord Nitro, as well as a unique Discord role, and credit on the Omniarchive index, to the three users who submit the highest number of unique, valid videos.

How can I help?
Simply all you need to do is post as many UNLISTED Minecraft videos uploaded on or before 31st December 2011 into #unlisted-videos as you can possibly find. That’s it. If you find an unlisted Minecraft video uploaded anywhere from 2009-2011, post it! We’ll handle the rest using a few scripts to filter out any duplicates and other unwanted links. Check out this link for a detailed guide on how you can find such videos: https://docs.google.com/document/d/1HMeoH7XOpUvqBkUD1jBy5MfzgzK4X4wgUpzV2YIZ4Fs/view

HalfOfAKebab, Omniarchive Mod

Are you involved with an archiving project related to YouTube’s unlisted videos, or not? Reach out to us at [email protected] so we can give you a shoutout!

]]>
https://datahorde.org/this-week-in-archiving-07-12-2021/feed/ 0
Help Archive Team Save Yahoo! Answers! https://datahorde.org/help-archive-team-save-yahoo-answers/ https://datahorde.org/help-archive-team-save-yahoo-answers/#comments Thu, 22 Apr 2021 02:35:47 +0000 https://datahorde.org/?p=2207 Yahoo! Answers is shutting down on May 4th, 2021, taking nearly 15 years worth of content with it!

Archive Team is trying to save as much of it as possible, and you can help!

By setting up the Archive Team Warrior and letting it run in the background, you can back up questions and answers from Yahoo! Answers and make them available in the Internet Archive Wayback Machine. The Archive Team Warrior is easy to set up and uses very few of your system resources. The Archive Team Warrior can work on up to 6 items concurrently.

Advanced users can also run the project with Docker using the atdr.meo.ws/archiveteam/yahooanswers-grab Docker image, which can easily be deployed on large networks and allows for running projects at a higher concurrency rate per container (maximum 20 concurrent items, though users running the project with this many concurrent items might be rate-limited by Yahoo!).

If you need any help or have any questions about the project, please feel free to refer to the project page on the Archive Team Wiki or ask in Archive Team’s IRC channel for the Yahoo! Answers project. (Please be patient and stay connected if your question isn’t immediately answered so you don’t miss any responses.)

]]>
https://datahorde.org/help-archive-team-save-yahoo-answers/feed/ 1
The Open Library Explorer! A new way to browse the Internet Archive https://datahorde.org/the-open-library-explorer-a-new-way-to-browse-the-internet-archive/ https://datahorde.org/the-open-library-explorer-a-new-way-to-browse-the-internet-archive/#respond Fri, 25 Dec 2020 18:17:14 +0000 https://datahorde.org/?p=1899 Are you looking for a change of pace this holiday season? How about some reading? Now I’m sure you are all trying to fill up your progress bars on Goodreads, but here’s a challenge: Read a book you have never heard of before!

This might not be the best time to go to a bookstore or library, but as you have probably heard, the Internet Archive is a great virtual library in its own right. Only sometimes browsing books on the Internet Archive can be a real chore, just try finding a good fiction book, what a nightmare! So in an attempt to emulate the wonder and joy of browsing through the rows and shelves of a real library, the Internet Archive has recently introduced the Open Library Explorer.

Not only is the OLE more visually pleasing, not only is the OLE easier to navigate but the OLE is also very customizable. Want to travel through time? You can apply filters to ensure all books are from a specific time period. Want to rearrange the categories? You can switch between Dewey Decimal Classification and Library of Congress Classification with the push of a button. You can even switch the book layout to replace the front covers with splines for added realism. But don’t take my word for it, watch this guide made by Drini Cadini, the head developer of the OLE!

Drini Cami is a Computer Science undergraduate at the University of Waterloo, interested in Human-Computer Interaction. Drini himself is an avid book-reader, funny enough. I can imagine this is a project he’s deeply invested in.


Then what are you waiting for? Go on and Explore the Open Library for yourself! As for me, I will be reading Find it Fast, readers who know me personally are more than familiar with my obsession to go out of my way to hunt down the most bizarre trivia, here’s hoping this will help me hone my craft!

Have some good fun, watch yourself and Happy Hollidays to all!

]]>
https://datahorde.org/the-open-library-explorer-a-new-way-to-browse-the-internet-archive/feed/ 0
Search for Sound: A New Feature on Internet Archive https://datahorde.org/search-for-sound-a-new-feature-on-internet-archive/ https://datahorde.org/search-for-sound-a-new-feature-on-internet-archive/#respond Wed, 12 Aug 2020 23:00:00 +0000 https://datahorde.org/?p=1083 If you’ve been browsing the Internet Archive recently, you might have noticed a new search option called “Search radio transcripts”. You can now search through radio broadcasts as if looking up something in a book, it’s pretty neat!

How’s it work? Software is used to transcribe a broadcast into text, similar to how OCR programs can transcribe the text in a book. Once a broadcast has a transcript available, you’ll be able to search its contents; effectively being able to search through sound.

The feature does seem to be in a WIP stage, so you might encounter some inaccuracies. Also of note is that only broadcasts from the past 5 years or so are available, from selected collections.

Still, there’s over two million recordings waiting for you to hear their story! You can view the full “catalogue” of radio broadcasts which support this feature in the aptly named Radio Archive collection.

]]>
https://datahorde.org/search-for-sound-a-new-feature-on-internet-archive/feed/ 0
Internet Archive in Trouble https://datahorde.org/internet-archive-in-trouble/ https://datahorde.org/internet-archive-in-trouble/#comments Sat, 06 Jun 2020 12:00:00 +0000 https://datahorde.org/?p=693 So, you might have heard, but in case you haven’t, the Internet Archive is being sued.

A Timeline of The National Emergency Library Controversy

This is a problem. IA is not only the best means we have of preserving the internet through the Wayback Machine, it also houses books (the reason it’s getting sued in the first place), classic (and by that I mean DOS) video games, and oh yeah, all of our annotation data. Losing this site would be very, very bad.

So what’s the lawsuit about? Basically, Hachette, HarperCollins, Penguin Random House, and Wiley are trying to say that the Internet Archive is piracy, plain and simple. They say that IA does not deserve to call itself a library, which according to the state of California, it is. And basically, that IA does not deserve to exist. Thanks, publishers.

Giving the document a careful look through, a couple of things strike me as curious. One, the lawsuit mentions five people also being sued despite the publishers not actually knowing who they are. Good one. They also insist on referring to it as a “library”, air quotes included, rather than an archive. And as I mentioned before, according to California, it is a library.

Unfortunately, we haven’t been able to get any word about this from the Archive’s point of view, either due to an unknown gag order or just fear for its safety.

– glmdgrielson, writing his second article on the new site

]]>
https://datahorde.org/internet-archive-in-trouble/feed/ 1
Community Spotlight: Internet Archive https://datahorde.org/community-spotlight-internet-archive/ https://datahorde.org/community-spotlight-internet-archive/#respond Sat, 28 Mar 2020 12:00:48 +0000 https://datahorde.org/?p=206 Who are they?
Internet Archive defines itself as a “digital library”. Much like most libraries nowadays you can find more than just books. From old manuscripts and journals to offline websites or floppy shareware… IA’s mission is Universal Access to All Knowledge.

What do they do?

First of all IA works to digitize new material, such as books or VHS tapes that probably haven’t made it on the internet yet.

They host a number of collections, which are often curated by libraries or educational institutions such as the New York Public Library and the University of Toronto .

And then there’s the Wayback Machine which started it all! It allows you to capture snapshots of webpages. Hence the name it works like a time machine, allowing you to view past versions of websites or even those which are no longer online.

They also have a whole bunch of other projects, including one which allows users to borrow rare books from libraries and keep a 14-day e-book version. See https://archive.org/projects/ for more information.

Outside View of the Internet Archive
Taken from: https://archive.org/about/

How do they do it?
Although it may come as a surprise, the internet archive has a physical location. The physical (books and similar materials) and virtual archives (servers and digitization equipment) are located inside of a former Christian Science church.

Most of their work comes out of here, although they are known to often collaborate with other libraries/archives or acquire collections from different collectors.

Servers inside of the Internet Archive,
Image taken from: https://thecontentwrangler.com/2016/05/13/payback-machine-internet-archive-continues-inform-serve-inspire/

How do I sign up?

If you would like to work for the Internet Archive at their physical location you could check out https://archive.org/about/jobs.php.

That being said, anyone can browse the archives* and you can start an account if you’d like to upload items of your own from anywhere in the world**. You heard right! All you have to do to contribute to the Internet Archive is sign up right from the comfort of your home.

So what are you waiting for? Become an Internet Archiver today! https://archive.org/account/signup


*Browsing certain material (generally sensitive or graphic content) might require you to sign in with a registered account.
**Using the WaybackMachine to make captures of websites won’t require a registered account.

Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our other Community Spotlights.

]]>
https://datahorde.org/community-spotlight-internet-archive/feed/ 0
Community Spotlight: Archive Team https://datahorde.org/community-spotlight-archive-team/ https://datahorde.org/community-spotlight-archive-team/#comments Sat, 14 Mar 2020 13:00:20 +0000 https://datahorde.org/?p=33  

Who are they?

If you’re here right now, chances are you’ve heard of the name “Archive Team” before. They might not be the largest internet archiving group, but they are certainly the most influential.

What do they do?

Save your stuff. You are everyone, from links on ancient forums to news reports people will forget, to music videos on dying platforms. Archive Team mostly focusses on extracting web content, often outsourcing its later distribution to the Internet Archive. If a website is reported to be shutting down some time soon, it’ll only be a matter of time before they catch wind of it.

How do they do it?

Archiveteam warrior infrastructure.png

For most cases they have a standard solution, which anyone can download, known as the Warrior (2). It works by downloading website contents from a website that might not be able to maintain its content (1) to a virtual machine. This content is then passed to a Tracker (3) server which keeps track of what is collected and what else is to be collected. These are then sent to Servers (4) run by dedicated volunteers from the Archive Team, for temporary storage. The final destination will usually be the Internet Archive (5) once the content goes offline for good.

How do I sign up?

Archive Team is entirely composed of volunteers. Although members maintain a small presence on Twitter and Discord, their main hub is https://archiveteam.org, a wiki where they keep track of ongoing projects and have links to resources such as the aforementioned Warrior. For more “real-time” communication you can find them on the #archiveteam channel on EFnet:
http://chat.efnet.org:9090/?channels=%23archiveteam


Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our other Community Spotlights.

]]>
https://datahorde.org/community-spotlight-archive-team/feed/ 6
How the Annotation Worker …Worked https://datahorde.org/how-the-annotation-worker-worked/ https://datahorde.org/how-the-annotation-worker-worked/#respond Sat, 29 Feb 2020 18:00:00 +0000 So the annotation thing. You remember that, right? Well, here is how the worker seemed to function. Note that I’m getting this information from a brief cursory glance (and chatting with one of the devs). I know it works because I had three of them running at any given time. But how? Uh, *shrug*

Let’s get started, shall we? So the worker (at omarroth/archive) the code starts by creating a new Worker class. This is our basic worker.

The run function creates a BatchProcess and calls its run. *sigh* So what does that do? Well it asks the server for a batch, pulls it up from a database, and retrieves the annotations for each of them …which is done in yet another class, this one called AnnotationProcess.

So what does AnnotationProcess do? It does a request to YouTube to get the annotations. (The URL in the repository was changed after the fact. By me. Interesting.) How it gets those annotations is interesting: to make sure the worker is functioning properly, there is a trust system. A fresh worker won’t actually get a new batch; it’ll get one that’s already been verified. As it gives more valid responses, it’s more likely to get a new video. This way, the likelihood of getting garbage data is minimized, which is important for an archival project.

Once all the videos in a batch have been downloaded, they’re verified with the server and then uploaded to DigitalOcean Spaces, a cloud storage service. This goes on ad infinitum until YouTube decides to pull the plug.

And that is what (I think) the annotation worker did.

– glmdgrielson

]]>
https://datahorde.org/how-the-annotation-worker-worked/feed/ 0