internet archive – Data Horde https://datahorde.org Join the Horde! Mon, 05 Dec 2022 17:05:49 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://datahorde.org/wp-content/uploads/2020/04/cropped-DataHorde_Logo_small-32x32.png internet archive – Data Horde https://datahorde.org 32 32 Twitter in Trouble? Why you should Archive your Tweets https://datahorde.org/twitter-in-trouble-why-you-should-archive-your-tweets/ https://datahorde.org/twitter-in-trouble-why-you-should-archive-your-tweets/#comments Mon, 05 Dec 2022 17:04:49 +0000 https://datahorde.org/?p=2852 Twitter has seen some radical restructuring since Elon Musk’s acquisition over a month ago. Now is a good time as ever, that we talked about what options you have in archiving or preserving your Twitter content.


This new era of Twitter has been quite turbulent, to say the least. More than half of the workforce has been fired or has quit, and site functionality is becoming unstable, as reported by the Seattle Times. Mastodon has emerged as a serious Twitter alternative. In fact, some of those who have departed Twitter now have their own Mastodon instance over at macaw.social. Personally, I am excited about the rise of mastodon as an alternative as I have been posting Data Horde updates over at @[email protected] for about two years now.

So, why not leave Twitter behind and move on? Now, Twitter allows you to request a copy of your personal data: Tweets and all. But it’s probably hard to leave a site that you have been on for over a decade. Especially, when requesting your personal archive is not even working correctly. Many people have reported that archive requests are being ignored or processed with delay. On a test account, we at Data Horde found that it took over 3 days to receive a personal archive.

Tweeters complaining about being unable to export personal archives: view snapshot at archive.is

In 2022 this is a big deal, not only for archivists but also for legality. Article 13 of the GDPR mandates a responsibility to provide a copy of collected data to users (i.e. data subjects) upon request. Outside of Europe, California’s CCPA has a similar clause protecting the right to know.

There are repercussion for not respecting these rules. Recently another messaging app, Discord, was fined 800 000 Euros for failing to respect data retention periods and security of personal data by French Regulator CNIL. That was actually a reduced fine, given Discord’s conciliatory attitude. If Twitter does not up their game, they may meet a similar fate, if not a worser one.

Now that I have your attention, I would like to direct it to the help page on how to request a personal archive from Twitter: https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive . Even if a bit unstable, this is what you need to follow to save a copy of your Tweets.

The Twitter archive is big and burly but not perfect. Johan van der Knijff recently wrote a blogpost on some shortcomings, such as the t.co URL-shortener and some workarounds: https://www.bitsgalore.org/2022/11/20/how-to-preserve-your-personal-twitter-archive


Oh, and by the way. It gets worse: Elon Musk has also stated interest in purging inactive accounts and their Tweet history.

Archive Snapshot: https://archive.ph/hcKsV

This might not seem like a big deal, except to the one or two of our readers who periodically scrape politician accounts off of https://ballotpedia.org. Yet it is actually a serious turning point. Currently, Twitter does not purge inactive accounts, except in the event of death or incapacitation and by special request.

In 2019 there was an attempted Twitter policy change to expire accounts which had not been logged into for 6 months. This sparked outrage across the platform by those who saw this as unfair to the memory of inactive accounts. In particular, fans of deceased K-Pop artist Kim Jong-hyun, otherwise known as Jonghyun (김종현/종현) came to the defence of his legacy overturning the attempt altogether. Turning back on this decision would go against all of that heritage, people’s heritage, Twitter’s heritage, web heritage. Alas this the projected course of things, even if we cannot prevent it, it is perhaps our duty to protest why it is wrong.


What about the extreme scenario of a total collapse of Twitter? What does that mean for web history? Well, the good new is that people have been thinking on this for much longer than before this year.

Already in 2010 the Library of Congress announced that they would be copying the entire internal archive of Twitter, starting from March 2006.

Archive Snapshot: https://web.archive.org/web/20161208074132/https://twitter.com/librarycongress/statuses/12169442690

There are also many smaller grabs on the Internet Archive and archive.today, some of which you have seen linked above. Special mention goes to Archive Team‘s periodical Twitter Stream archive.

Last but not least, you can help! The Internet Archive is collecting Tweet dumps from people as we speak: https://archive.org/services/wayback-gsheets/archive-your-tweets Whether you just want extra insurance for your back-up, or to contribute to the wealth of the web you can help by using the above tool to upload your Tweets to the Internet Archive for generations to come.

]]>
https://datahorde.org/twitter-in-trouble-why-you-should-archive-your-tweets/feed/ 1
Pulling Rank: The Legacy of Alexa Internet https://datahorde.org/pulling-rank-the-legacy-of-alexa-internet/ https://datahorde.org/pulling-rank-the-legacy-of-alexa-internet/#comments Fri, 29 Apr 2022 17:25:26 +0000 https://datahorde.org/?p=2772 Alexa Internet and the Internet Archive, two seemingly unrelated entities, have been partners ever since their inception. Alexa’s sunset scheduled for 1 May 2022 is, therefore, also a loss for the web archiving community. As a small send-off to Alexa, here is the story of two twins who grew apart together.


Today, the internet has become such a big part of our lives, that it’s hard to imagine a time without it. Yet only 30 years ago, the internet was hardly accessible to anyone. Not in the sense that it wasn’t affordable, rather what could be called the internet wasn’t very inter-connected. You had separate networks: ARPANET, which was heavily linked to the US’s military-industrial complex; FidoNet, which was a worldwide network connecting BBSs; USENET, which were newsgroups mostly adopted on university campuses… Each network, had a particular use-case and was often restricted to a particular demographic. It wouldn’t be until the vision of an “open web”, that a common internet would emerge.

In the early 90s, many disillusioned DARPA-contractors began leaving ARPANET on an exodus to San Francisco, synergising with the city’s pre-established tech eco system. Maybe it was the advent of new protocols such as Gopher and the World Wide Web. Perhaps it was the growing Free Software Movement. Not to mention gravitation towards the technology clusters of Silicon Valley or the Homebrew Computer Club. It was more than happenstance that California, and the San Francisco Bay Area had become home to a lot of network engineering experts.

The tricky question wasn’t how to get the internet to more people, it was how to do it the fastest. Many small companies, startups, and even NGOs popped up in San Francisco to address the different challenges of building a massive network. From building infrastructure by laying wires, to law firms for dealing with bureaucracy. Of course, there were also companies dealing with the software problems on top of hardware.

Alexa Internet Logo (1997)

One such company was Alexa Internet, founded by Bruce Gilliat and Brewster Kahle. Alexa started as a recommendation system, to help users find relevant sites without them having to manually search everything. On every page, users would get a toolbar showing them “recommended links”. You may think of these recommended webpages, like suggested videos on YouTube or songs on Spotify. Alexa was “free to download” and came with ads.

Those recommendations had to come from somewhere and Alexa wasn’t just randomised or purely user-based. Their secret was collecting snapshots of webpages through a certain crawler, named ia_archiver, more on that later. This way they were able to collect stats and metrics on webpages themselves, over time. This is how Alexa’s most well-known feature, Alexa Rank, came to be. Which sites are the most popular, in which categories and when? Over time, this emphasis on Web Analytics became Alexa’s competitive advantage.

Alexa was a successful business, only to keep growing, but founder Brewster Kahle had something of an ulterior motive. He was also in the midst of starting a non-profit organisation called the Internet Archive. ia_archiver did, in fact, stand for internetarchive_archiver. All the while Alexa was amassing this web data, it was also collecting it for long-term preservation at this up-and-coming Internet Archive. In fact, one can tell the two were interlinked ideas from the very start; as the name, Alexa, was an obvious nod to the Library of Alexandria. At one point, Alexa -not the Internet Archive- made a donation of web data to the US Library of Congress, as a bit of a publicity stunt to show the merit of what they were doing.

[For your video], there is this robot sort of going and archiving the web, which I think is somewhat interesting towards your web history. It’s a different form. You’re doing an anecdotal history. The idea is to be able to collect the source materials so that historians and scholars will be able to do a different job than you are now.

Brewster Kahle, teasing his vision for the Internet Archive in an interview by Marc Weber (Computer History Museum) from 1996. Fastforward to 31:53 into the video below.
Tim Požar and Brewster Kahle CHM Interview by Marc Weber; October 29 1996.
Mirror on Internet Archive: https://archive.org/details/youtube-u2h2LHRFbNA

For the first few years, Alexa and the IA enjoyed this dualistic nature. One side being the for-profit company and the other a charitable non-profit, both committed to taking meta-stats on the wider internet. This came to a turning point in 1999, when Amazon decided to acquire Alexa Internet (not the smart home product) for approx. US$250 million. Alexa needed growth and the IA needed funding, so it was a happy day for everyone, even if it meant that the two would no longer act as a single entity.

Kahle left the company to focus on the IA and former-partner Gilliat ended up becoming the CEO of Alexa. An arrangement was reached so that even after the acquisition, Alexa would continue donating crawled data to supply the Internet Archive. Their collaborator Tim Požar, who you might recognize from the ’96 interview from above, would remain at Alexa for some time as a backend engineer. A lot of what Požar did was ensuring that Alexa’s crawled data would continue to be rerouted to the Internet Archive. A lot of these data dumps are now visible under the IA’s Alexa crawls collection.

Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.

Afterwards, the IA and Alexa went their separate ways. The Internet Archive expanded to non-web digital collections as well. Books, in particular. The web archive part was dubbed the Wayback Machine.

By 2001, the Internet Archive was no longer a private collection but was made open to the public for browsing. The Internet Archive really lived up to its name and became the de facto hub for archiving on the web. Ever since, the IA has continued to attract not only readers, but also contributors who keep growing the collections.


As for Alexa, Amazon’s bet paid off as they dominated web analytics for the coming years. Alexa rankings became the standard metric when comparing web traffic, for example on Wikipedia. Alexa listed some public stats free to all, but remained profitable thanks to a tiered subscription system. If you needed to know the 100 largest blog sites in a given country, Alexa was your friend. Then you could pay a few dollars extra to find out what countries were visiting your competitors the most. Alexa was great, so long as you were interested in web-sites.

Alexa was born in a very different web. A web of sites. Yet today’s web is a web of apps. Social media, streaming services… The statistics of this web of apps are kept by centralised app markets such as Google Play and Apple’s App Store. Alexa tried to adopt; for example, they changed traffic stats to be based less on crawl data across the entire web, but also on shares posted to Twitter and Reddit. Sadly these changes have not been impactful enough to save Alexa from obsoletion.

(Google Search Trend showing the rise and fall of alexa rank, alternative link.)

Amazon telegraphed their intent to either adapt or shutdown by gradually dropping features over the past few months. For example, they replaced Browse by Category with a more narrow Articles by Topic. Finally, the service closure was announced in December 2021.

So what will happen now? The closing of Alexa is different from most shutdowns because it’s not only the loss of data itself, but a data stream. Alexa was, indeed, at a time a web crawling powerhouse. Yet it’s no longer uncontested. We still have, for example, Common Crawl which also came out of Amazon, interestingly. As for the Internet Archive, they have many partners and collaborators to continue crawling the web as well, so they won’t be alone.

Alexa was also valuable in its own right. Though there are new competitors for web analytics, you won’t see many investigating global/regional popularity, or different categories. Even so, there aren’t very many services interested in overall web traffic, as opposed to site analytics. On top of this, Alexa ran for 25 years. That’s a quarter of a century of historical data on what sites rose and fell before Alexa, unavailable almost anywhere else. Almost.

Just as Alexa helped the Internet Archive grow, from this point, the Internet Archive shall reciprocate by keeping the memory of Alexa alive. Not just the sites crawled by Alexa, but also in snapshots of public statistics gathered by Alexa.

If you have an Alexa account you can also help! Users can export Alexa data by following the instructions here! You can bet any and all data would be very valuable, either on the Internet Archive or elsewhere. Please make sure you act quickly, as there isn’t much time left until May 1.

]]>
https://datahorde.org/pulling-rank-the-legacy-of-alexa-internet/feed/ 1
Community Spotlight: URLTeam https://datahorde.org/community-spotlight-urlteam/ https://datahorde.org/community-spotlight-urlteam/#comments Mon, 01 Nov 2021 00:25:37 +0000 https://datahorde.org/?p=2687 Who are they?

URLTeam is an arm of Archive Team, solely dedicated to collecting shortened URLs.
It is unusual to see a long-term archiving or preservation project, once a collection or a grab is completed, that’s that. Yet URLTeam, who have taken on a task with no apparent end date, have endured for over 10 years, growing into a community in their own right.

What do they do?

Circa 2009, Scumola of Archive Team noted how shortened URLs had proliferated on a little website called Twitter. Twitter was then, and still is now, infamous for its character limitations. To free up space, users began sharing shortened links. And as other users discovered this trick, it only spread. This led to a paradigm shift in the web ecosystem. Links became a lot more unrecognizable, both to refers and referees.

Theft Quadrants
https://xkcd.com/1698/, Theft Quadrants.

Archivists too, were vexed. The traditional approach to web archiving had been to target a particular domain or subdomain, URLs which followed a pattern. Now how could they expect to save posts from blogs, forum threads or stories from news sites, if URLs were coming to them from TinyURL or bit.ly? Thus, URLTeam was born out of an effort to catalogue said short URLs.


Short URLs are loose connections. You cannot actually shorten the domain that someone else has registered, that is to say one cannot rename google to gugle.

The secret, that URL Shorteners employ, is to generate short URLs on YOUR OWN (or 3rd party) servers which can be made to redirect to longer URLs. When someone connects to your server, with the shortened address, you just redirect them to the associated full address in your database. So any time you visit a short link, you are going to be visiting (at least) two websites.

Old links dying or websites shutting down is a given. Yet adding more redirects is going to lengthen the chain to get to your final destination. As the saying goes, the chain is only as strong as the weakest link. Should the URL shortening service shut down, then that means all short URLs will break, even if the actual site being redirected to is still online. So that is reason enough to hoard short URLs. That is what URLTeam does.

How do they do it?

URLTeam’s approach is to decompose the problem. Even if we cannot possibly crawl every link shared on the internet, nor every final redirect, at any time there are going to be a relatively small number of URL shortening services. So URLTeam begins by hunting for said URL Shorteners. If we can recognize ow.ly or goo.gl links, that’s a start.

Once a new shortening service is added, the next step would be crawling the web for short URLs. Since URL shorteners are almost universally linkable on any site, this is going to have to be a very broad crawl, akin to a crawl for building a search engine. This brings us to the URLTeam Tracker which oversees the distribution of the crawling jobs to hundreds of volunteering archivists. You can even see live stats on their page!

Collected links are finally shipped to the 301works collection on the Internet Archive. The 301works collection also houses link databases donated by URL shortening services, so if you happen to own a URL shortening service at risk of shutting down you might want to get in contact with them.

How do I sign up?

Communication happens on #[email protected]. General info can be found on their Wiki page.

If you want to hop right in, you can simply download and run an Archive Team Warrior, and then select the URLTeam project. You can also run the project using a Docker containter by following the instructions on here.

Now if you excuse me, I have to cut this short so I can cut some shortcuts.


Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our past Community Spotlights.

]]>
https://datahorde.org/community-spotlight-urlteam/feed/ 1
Help Archive Team Archive public Google Drive files before September 13! https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/ https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/#comments Sat, 11 Sep 2021 00:50:54 +0000 https://datahorde.org/?p=2637 On September 13, Google is going to start requiring longer URLs to access many Google Drive files, breaking links to public files across the web unless users opt out! Because of this, Archive Team has launched a project to archive as many publicly-available Google Drive files as possible and make them accessible on the Internet Archive Wayback Machine. (Note that video files are not included at this time due to their size.)

You can help! Simply follow the steps to download and run an Archive Team Warrior, and then select the Google Drive project. (You can also run the project using a Docker container using atdr.meo.ws/archiveteam/google-drive-grab as the image address.)

Additionally, people with lists of public Google Drive file URLs are encouraged to share them so they can be archived.

In order to stay up-to-date with the project and be reachable in case of an issue, project contributors are encouraged to connect and stay connected to the project discussion channel, #googlecrash on irc.hackint.org, also available through webchat.

Archiving progress statistics for this project are available on the Archive Team project tracker, and source code is available on GitHub.

]]>
https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/feed/ 2
Thousands of Classic Winamp Skins on the Internet Archive! https://datahorde.org/thousands-of-classic-winamp-skins-on-the-internet-archive/ https://datahorde.org/thousands-of-classic-winamp-skins-on-the-internet-archive/#respond Wed, 04 Aug 2021 20:25:00 +0000 https://datahorde.org/?p=2571 Remember Winamp, the old music player? Built-in equalizer, user-made skins and an unholy amount of ad spam? If yes, I then you will absolutely adore the Winamp Skin Collection on the Internet Archive!

One thing sadly lost in the transition from desktop to mobile was user-customizability. Winamp, in particular, had a community dedicated to designing and sharing custom skins. Though Winamp has been obsoleted, to some extent, the aesthetic in these skins are certainly a lot more timeless. Preserving that artistic value is the Winamp Skin Collection, with over 72 thousand community made Winamp skins, waiting for your viewing pleasure on the Internet Archive.

Skin: V2 Technics White Resting by johnnyg0
Song: Puppies’ Space Station by Saria Lemes

The best part? You can run it from your browser, even on your smartphone! The skins in this collection are set up to load Webamp, an open-source HTML5 rewrite of Winamp 2.

From stereo-themed skins, to futuristic devices; and from game fan-art to original designs you’re certain to find the perfect skin for all of your llama-whipping needs! Also be sure to check out the Winamp Skin Museum and the Butterchurn Visualizer!

]]>
https://datahorde.org/thousands-of-classic-winamp-skins-on-the-internet-archive/feed/ 0
How to recover your Yahoo! Groups from the Internet Archive https://datahorde.org/how-to-recover-your-yahoo-groups-from-the-internet-archive/ https://datahorde.org/how-to-recover-your-yahoo-groups-from-the-internet-archive/#comments Mon, 31 May 2021 14:13:40 +0000 https://datahorde.org/?p=2293 Yahoo! Groups, once upon a time a hub to many online communities, was shut down in 2020. Yahoo! Groups used to host mailing lists going as far back as 1997, and perhaps you may have once been a part of it yourself. Users were offered a Get Your Data tool to download their messages and other data, prior to the shutdown, but many people were unable to respond on short notice.

Thankfully, owing to the efforts of the Save Yahoo Groups Project and Archive Team the data of many groups has been preserved. If you missed out on the GYD tool, you might still be able to retrieve your groups’ data by following the steps below.


To begin, can you remember your group’s name? If yes, the following steps will go by a lot faster; but if not, you might want to make a list of potential names to go by. Was the name of your group Fireflylovers, or Firefliers, or LoversofFF? Write down all likely candidates.

For demonstration’s sake let’s search for data on NFforKids, a non-fiction writing group.

Let’s perform a metadata search, to see when NFforKids was started. Head over to the Yahoo Groups Metadata Collection page on the Internet Archive. Ignoring the no preview warning, either click on Show all files or scroll down until you see DOWNLOAD OPTIONS on the right side of the page.

Click on COMMA-SEPARATED VALUES, to reveal a list of files. Since NFforKids starts with an N, if it does exist, it will be indexed under master_N.csv. Download this CSV file to your device.

You can now open this CSV file using Excel or another spreadsheet program. Search for NFforKids to find the corresponding information row. What do you know? NFforKids was started on 11 June 2000. You can scroll accross this row to find the group’s primary language, the category of the group, if the group was public or not, and more!

If you weren’t able to find metadata on your group, it’s time to pull up that list I told you to make above. Fall back to the other candidates and try another name. If the first letter (or two) of this second name is different, you will need to download the corresponding CSV file before resuming your search.

Please note that while the Yahoo! Groups collections on the Internet Archive are thorough, they are NOT exhaustive. It is entirely possible that data on your group might have been missed. That being said the metadata collection sports a whopping 1.1 million groups. Even if you weren’t able to find your group in the first round, it is very likely that you may have misremembered the name, so keep on trying!


Once you have confirmed the name of your group, and that it has been catalogued in the Metadata Collection, you can then download the corresponding TAR file, which contains even more details. Again, if we’re looking for a group called NFforKids we’ll be looking for the first two letters from the list. That’s NF.tar for NFforKids.

If you’re on Mac or Linux, you should be able to open this .tar file to reveal a folder titled media. If you’re on Windows, you can use 7-zip to open it. This TAR file contains the same information as the CSV, plus additional details. Did the group have spam filtering, was media sharing allowed or was the group text-only? You might even find the URL for group images, although unfortunately most of those links are now dead.

The Cover for the Star Trek: New Frontier Fanfiction group, one of few group covers preserved in the metadata collection.

Stats are fine and dandy, but what about messages or activity? If your group was restricted, tough luck, you’ll need to find a member who made a GYD copy before the shutdown. This is where our luck with NFforKids has run out, seeing as chats of the group were not public. For the final step, let’s switch to a public group whose history is visible. We’ll go with nfwritersontheirwayup. Messages in this group were visible to all subscribers, so archivists were able to grab its contents.

Raw data collections are stored in assorted, non-alphabetic, batches. To see if a group has its raw data available on the Internet Archive, simply query subject:"yahoo groups" nfwritersontheirwayup. If you get any results, your group’s raw data is most likely located here. You can double check the item description to be sure that nfwritersontheirwayup is indeed included in the batch.

Pop open the WEB ARCHIVE GZ download option from the left side of the page. Scroll down until you see nfwritersontheirwayup.bcqkJvN.warc.gz and proceed to download. To unpack this gzip you can use thegzip -d nfwritersontheirwayup.bcqkJvN.warc.gz command on Unix systems or good old 7-zip on Windows.

Last but not least, you’ll need a WARC viewer. If this is your first time with WARCs replayweb.page is very straightforward and runs right out of your browser. Simply upload the WARC contents of the group and voila, you can now navigate through the group’s chat logs.


Recovering your Yahoo! Groups from yesteryear is as simple as that. Got any questions? Or perhaps you have made some worthwhile discoveries while group hunting. Comment below!

]]>
https://datahorde.org/how-to-recover-your-yahoo-groups-from-the-internet-archive/feed/ 28
Help Archive Team Save Yahoo! Answers! https://datahorde.org/help-archive-team-save-yahoo-answers/ https://datahorde.org/help-archive-team-save-yahoo-answers/#comments Thu, 22 Apr 2021 02:35:47 +0000 https://datahorde.org/?p=2207 Yahoo! Answers is shutting down on May 4th, 2021, taking nearly 15 years worth of content with it!

Archive Team is trying to save as much of it as possible, and you can help!

By setting up the Archive Team Warrior and letting it run in the background, you can back up questions and answers from Yahoo! Answers and make them available in the Internet Archive Wayback Machine. The Archive Team Warrior is easy to set up and uses very few of your system resources. The Archive Team Warrior can work on up to 6 items concurrently.

Advanced users can also run the project with Docker using the atdr.meo.ws/archiveteam/yahooanswers-grab Docker image, which can easily be deployed on large networks and allows for running projects at a higher concurrency rate per container (maximum 20 concurrent items, though users running the project with this many concurrent items might be rate-limited by Yahoo!).

If you need any help or have any questions about the project, please feel free to refer to the project page on the Archive Team Wiki or ask in Archive Team’s IRC channel for the Yahoo! Answers project. (Please be patient and stay connected if your question isn’t immediately answered so you don’t miss any responses.)

]]>
https://datahorde.org/help-archive-team-save-yahoo-answers/feed/ 1
Internet Archive Scholar: A Collection of 25 Million Publications https://datahorde.org/internet-archive-scholar-a-collection-of-25-million-publications/ https://datahorde.org/internet-archive-scholar-a-collection-of-25-million-publications/#respond Tue, 09 Mar 2021 20:41:28 +0000 https://datahorde.org/?p=2120 We all know that finding scholarly articles is a pain, given that a good chunk of them are behind some sort of paywall, or require some sort of account, or are straight up just gone.

…hrm, “straight up gone”, you say? Well, if there’s one place to check for straight up gone stuff, it’s the Internet Archive. And wouldn’t you know it, there’s a new project from the Archive: Internet Archive Scholar. Fresh with a very 90s VHS-style logo, Internet Archive Scholar lets you search up any of the scholarly papers stored on the massive archive at your convenience.

coverage visualization tool
Did you know that every year, a 100,000 or so publications are “publish”ed with no back-ups? Stat Source: Fatcat

IA Scholar started as an Archive Lab project built on Fatcat, a publicly editable catalogue of open publications. IA Scholar had been in closed-beta since September 2020 and having now matured, today it officially enters open-beta with a grand announcement you can read on the IA’s Blog.

Be sure to also check out Archive Lab for other crazy projects supported by IA. Have a look at GifCities the GeoCities-era GIF search engine, it even has a Twitter bot that tweets GIFs periodically.

]]>
https://datahorde.org/internet-archive-scholar-a-collection-of-25-million-publications/feed/ 0
Community Spotlight: The Hidden Palace https://datahorde.org/community-spotlight-hidden-palace/ https://datahorde.org/community-spotlight-hidden-palace/#respond Sat, 30 Jan 2021 00:17:27 +0000 https://datahorde.org/?p=2016 Who are they?

The Hidden Palace is a group of video game preservationists who hunt down video game prototypes, cut features and other game development media. You could say they are out to find out how games change throughout their development cycle, what elements and mechanics actually make it to the final product.

The name comes from the Hidden Palace Zone from Sonic 2, an unused area in the original release.

What do they do?

For about 15 years, the Hidden Palace has amassed a collection of over 1000 development builds for various games on a multitude of systems. A good portion of these have been mirrored on the Internet Archive, where you can try them out for yourself via MAME.

More than that, the Hidden Palace is about analyzing differences between different game builds, that is to say different stages of development. The Hidden Palace also frequently cooperates with The Cutting Room Floor to document features which have been cut from the final release. Take a look at their recent joint-update on the elusive Sonic 1 Mega Drive Prototype!

How do they do it?

It’s one thing to hunt down an obscure product, but where do you find a game that was never really released? Good candidates are developers or testers who may have had access to earlier versions of the game. Next come hobbyists or other preservationists who may have acquired a prototype from the above options. At this stage it’s likely that a prototype will go up for auction.

In any case, contributors to the Hidden Palace ship prototypes they have come into possession of, so that they may be dumped and/or scanned. If you would be interested in contributing yourself, get in touch with [email protected] and also have a look at their contribution page (they really value your confidentiality).

How do I sign up?

Even if you can’t travel to the Hidden Palace’s preservation studio for yourself, there is a lot you can do to help. Just join the Wiki, there’s plenty of to-dos to fill out!

Or if you would rather prefer talking and meeting with people, perhaps the Hidden Palace Discord Server is for you!

Then what are you waiting for? Go forth, and discover the next Hidden Palace Zone, today!


Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our other Community Spotlights.

]]>
https://datahorde.org/community-spotlight-hidden-palace/feed/ 0
The Open Library Explorer! A new way to browse the Internet Archive https://datahorde.org/the-open-library-explorer-a-new-way-to-browse-the-internet-archive/ https://datahorde.org/the-open-library-explorer-a-new-way-to-browse-the-internet-archive/#respond Fri, 25 Dec 2020 18:17:14 +0000 https://datahorde.org/?p=1899 Are you looking for a change of pace this holiday season? How about some reading? Now I’m sure you are all trying to fill up your progress bars on Goodreads, but here’s a challenge: Read a book you have never heard of before!

This might not be the best time to go to a bookstore or library, but as you have probably heard, the Internet Archive is a great virtual library in its own right. Only sometimes browsing books on the Internet Archive can be a real chore, just try finding a good fiction book, what a nightmare! So in an attempt to emulate the wonder and joy of browsing through the rows and shelves of a real library, the Internet Archive has recently introduced the Open Library Explorer.

Not only is the OLE more visually pleasing, not only is the OLE easier to navigate but the OLE is also very customizable. Want to travel through time? You can apply filters to ensure all books are from a specific time period. Want to rearrange the categories? You can switch between Dewey Decimal Classification and Library of Congress Classification with the push of a button. You can even switch the book layout to replace the front covers with splines for added realism. But don’t take my word for it, watch this guide made by Drini Cadini, the head developer of the OLE!

Drini Cami is a Computer Science undergraduate at the University of Waterloo, interested in Human-Computer Interaction. Drini himself is an avid book-reader, funny enough. I can imagine this is a project he’s deeply invested in.


Then what are you waiting for? Go on and Explore the Open Library for yourself! As for me, I will be reading Find it Fast, readers who know me personally are more than familiar with my obsession to go out of my way to hunt down the most bizarre trivia, here’s hoping this will help me hone my craft!

Have some good fun, watch yourself and Happy Hollidays to all!

]]>
https://datahorde.org/the-open-library-explorer-a-new-way-to-browse-the-internet-archive/feed/ 0