preservation – Data Horde https://datahorde.org Join the Horde! Mon, 08 May 2023 20:12:19 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://datahorde.org/wp-content/uploads/2020/04/cropped-DataHorde_Logo_small-32x32.png preservation – Data Horde https://datahorde.org 32 32 Twitter to Begin Purging Inactive Accounts https://datahorde.org/twitter-to-begin-purging-inactive-accounts-10-12-22/ https://datahorde.org/twitter-to-begin-purging-inactive-accounts-10-12-22/#respond Sat, 10 Dec 2022 17:01:41 +0000 https://datahorde.org/?p=2864 Edit: May 8 2023, Elon Musk has announced that the account purge has come into effect, read on to learn more about what you can do if this affects you.

Archived Snapshot: https://archive.md/PYJ2E

Yesterday, Twitter CEO Elon Musk announced a purge on inactive accounts. Musk has cited the reasoning as “freeing [up] the name space” for users who might want a new handle. Musk then went on to assure Twitter that these “1.5 billion accounts” would be accounts which have not tweeted or logged-in for the last few years.

Archived Snapshot: https://archive.ph/9tEMm

This is very much, an expected move. You might recall from our blogpost a while ago that Musk had expressed interest in purging accounts early in November. It should also be noted that Twitter’s previous management had failed a similar policy change in 2019, to expire accounts which had not been logged into for 6 months. That policy change failed due to outrage and protest across the platform.

Archive Snapshot: https://archive.ph/hcKsV

Well then, should you be worried? Probably not for your own account. Some have expressed concern over the accounts of loved ones or deceased celebrities, which will more easily fit this criteria. If you have such concerns, we can recommend a useful utility from our friend JustAnotherArchivist, called snscrape. snscrape allows you to save public tweets from accounts on Twitter. It also works for a few other websites like Facebook and Reddit.

Simply install with: pip3 install snscrape

The code is available at https://github.com/JustAnotherArchivist/snscrape, but if you have Python 3 installed pip3 install snscrape is all you need to install it.

From your terminal or command prompt, the following command will save a local archive of Elon Musk’s tweets:

snscrape --jsonl twitter-user elonmusk > muskytweets.json

And if you want another account, just substitute the username and a file to save to:

snscrape --jsonl twitter-user RetroTechDreams > RTD_tweets.json

I might write a more detailed tutorial on snscrape if people are interested.

But for the time being, spread the word! Save what endangered accounts are valuable to you and be sure to tell all of your friends about snscrape.

]]>
https://datahorde.org/twitter-to-begin-purging-inactive-accounts-10-12-22/feed/ 0
Archive95: The Old Man Internet https://datahorde.org/archive95-the-old-man-internet/ https://datahorde.org/archive95-the-old-man-internet/#respond Thu, 21 Jul 2022 23:28:13 +0000 https://datahorde.org/?p=2811 The internet is kind of old. To be fair, so is the Internet Archive and its Wayback Machine. But IA isn’t older than the internet (how could it be?) so there are some things that could slip through the cracks. Things before its founding in 1996, for example.

Then comes along Archive95 which is an archive of the pre-IA internet of 1995. It primarily uses two sources, the World Wide Web Directory and the German language Einblick ins Internet, to give an impression of an era when the web was small and monitors were bulky as heck.

– glmdgrielson, a young whippersnapper

]]>
https://datahorde.org/archive95-the-old-man-internet/feed/ 0
Interview with Hubz of Gaming Alexandria https://datahorde.org/interview-with-hubz-of-gaming-alexandria/ https://datahorde.org/interview-with-hubz-of-gaming-alexandria/#respond Mon, 18 Apr 2022 09:09:30 +0000 https://datahorde.org/?p=2719 Hello, here’s another interview, this time with our head overlord Hubz of Gaming Alexandria.

glmdgrielson: So, first question, what is Gaming Alexandria?
Hubz: At it’s core it’s both a Discord community and a separate website dedicated to preserving various aspects of video games, such as scans, interviews, unreleased games, youtube videos etc. It mainly started as a site where I could share high quality scans but has grown thanks to many people joining up with various skills to help expand the website. The Discord community itself is really an entity unto itself at this point where lots of gaming historians/preservationists have come together to share their works and also help each other out when needed with various projects. I love getting to see all the passion in everybody’s projects that they put forth and the willingness of the community to offer help when asked.

g: Tell me more about this community. I’m active in the server, but what does it look like from your end?
H: From an admin standpoint I have access to all the channels which include the private #staff and #mods channels where we discuss upcoming articles or projects for the site as well as handling the occasional argument or bad apple in the chat. Dylan Mansfeld (DillyDylan) handles a lot of great articles on undumped/prototype games that were previously unreleased. Ethan Johnson writes for his own blog (https://thehistoryofhowweplay.wordpress.com/) and Gaming Alexandria at times and is our editor so he glances through and cleans up all the articles that get posted. Jonas Rosland who is the Executive Director of the NPO, I’m a board member of, called Hit Save (https://hitsave.org/) does a lot of thankless technical work behind the scenes that includes a NAS he has setup for not only the staff of the website to store project files but the community at large which is a huge help. Wietse van Bruggen (Densy) handles a lot of the moderation of the chat and has been a huge help keeping the Discord community friendly and clean with his balanced moderation style. Last but not least there is Stefan Gancer (Gazimaluke) who did the original site redesign and has been a great idea man for ways to improve the site and community as time has gone on. For me personally I try to keep up with all the chat in the channels (though it can be tough at times!) just to have an idea of what’s going on and seeing what I can help with or connect people to further projects as well as post my scans and projects as they’re completed. Thanks to the rest of the staff I rarely have to step in and moderate which is very nice!

g: I’m going to skip over the omission of Norm and ask about the history of how the site has evolved.
H: LOL yes Norm is a menace to society and must be stopped.

Editor’s note: Hubz has a mock rivalry with Norm, a.k.a. the Gaming Historian and is a frequent running gag on the server. I do not believe there is actual malice.

The website itself started officially on October 23rd, 2015 and was just a basic text website that I could easily upload to in order to share my scans, it was very barebones. The reason I wanted to get high quality scans out was due to using an emulator frontend called Hyperspin. For popular systems it had a lot of decent quality artwork for boxes. But for lesser known systems it was sorely lacking and that triggered my OCD and made be realize that scanning stuff in high resolution was something that needed to be done. Slowly, but surely, I met others that wanted to scan in high quality and have their stuff hosted and they would submit stuff such as Densy. At some point I got involved with the VGPC discord and met Kirkland who had been quietly doing something similar with his collection and collaborated with him and others on establishing scanning standards to use going forward to have some level of consistent quality with those that were willing to do it which eventually led to what is the https://scanning.guide/. In late 2018 the site was graciously redone by Gazimaluke and relaunched in the design you see now. We started branching out into actual articles written by our staff and releasing prototypes and unreleased games that we came across. The site continues doing this to this day, though we are branching out into more guest authors from the community posting interviews and articles as well in the near future.

g: As well as hosting my site, for which I am grateful for. So, what is the day to day like for you?
H: Day to day on the scanning I try to get at least one magazine done daily. Doesn’t always happen but, in general, I debind a magazine the night before, then in the morning scan it in before leaving for work. If work gets slow I work on processing the scans, or else I’ll do it later that night and get them uploaded to the site and the Internet Archive.

g: Interesting. So how big do you think your archive is by this point?
H: Archive upload-wise I’m probably right around 2900 items if you count stuff that was removed lol. Then there’s a bunch on the site that wasn’t done to the higher scanning standards I go by now that’s not on the archive. So I’d guess in the 3000-4000 item range currently.

g: Do you know how big it is in terms of filesize?
H: Let me see real quick…
Looks like 2.5TB which is another reason I’m so thankful to have the Internet Archive to host my scans on due to the space and bandwidth that would be required otherwise.
The site alone usually has about half a TB of traffic per month so I can only imagine what it would be like if the magazine scans were also hosted directly on it.

g: Neat. Is there anything interesting that you got to be a part of due to GA that you would like to share?
H: Biggest thing is probably working with The Video Game History Foundation on scanning their extensive magazine collection so digital copies can be provided along with physical copies at their library. Being able to leverage the Internet Archive so people all over the world can easily access the magazines I’ve scanned that they might not have been able to easily otherwise is a great feeling personally for me. So many of these things are quite difficult to acquire and expensive as time goes on so having them as an ally in the preservation world is a godsend. There’s been lots of other connections and other projects I’ve worked on as well but I won’t ramble forever on that. Not only is Gaming Alexandria a tight community that likes to help each other out but there’s plenty of other preservation groups like VGHF, TCRF, and Hidden Palace just to name a few and we all get along great and try to push preservation forward together.
There’s so much work that needs to be done that we need all the help we can get and we need to support each other any way we can I think.

g: True that. Last question for now: anything that you would recommend to a would-be archivist?
H: I think it’s a good idea to preserve what interests you, which seems to go without saying, but I mean it more from a sense of not only going after what is popular. While you might not get much fanfare initially for the more obscure stuff it’s likely you’ll be the only one doing it and it’s important it’s being done. If you do good work for long enough it will get noticed, and to make good work easier it’s best to go with what you’re passionate about. The other thing I would suggest is not beating yourself up or comparing your output to others. Do what you can when you want to, this is a hobby after all. If you make yourself miserable trying to do something your output will naturally suffer or you might even burn out and stop altogether. Like I said before, we need all the help we can get, so try to avoid that if at all possible.

g: Thank you for being here, overlord Hubz. It’s been good talking to you.
H: No problem! Thaks for the interview. 🙂

– glmdgrielson, being a very good minion interviewer

]]>
https://datahorde.org/interview-with-hubz-of-gaming-alexandria/feed/ 0
Stuck in the Desert, or Video Strike Team https://datahorde.org/stuck-in-the-desert-or-video-strike-team/ https://datahorde.org/stuck-in-the-desert-or-video-strike-team/#respond Mon, 28 Feb 2022 17:22:35 +0000 https://datahorde.org/?p=2707 This is an interview with Sokar, of the Video Strike Team, conducted over IRC. The VST is an archival group of a rather small scope: preserving a particular stream, Desert Bus For Hope.

Desert Bus For Hope is a yearly charity stream, running under the premise that the more money that is received, the longer the stream goes on for, and the more the organizers have to play the dullest video game imaginable. So dull, in fact, that Desert Bus has never been officially released, actually. This year’s fundraiser gave us a stream that is just exactly an hour under one week: 6 days and 23 hours! So this was a very long stream with a lot of data to preserve. So follows the story of how that happens.

Note: DBx refers to the iteration of Desert Bus for Hope. For example, this year, 2021, was DB15. Also, I have only minimally modified our interview, by adding in links where applicable and making minor spelling corrections. 

glmdgrielson: So first off, outside of the VST, what are you up to?

Sokar: I do video editing and Linux server security / software support, and various other (computer related) consulting things for “real work”.

g: So you started off with just the poster for DB6, according to the site, correct? How did that work?

S: We didn’t actually start doing the interactive postermaps till DB8, then I worked backwards to do all the previous ones (still not done).
The VST itself started formally during DB6.

g: That’s when Graham contacted MasterGunner, who presumably contacted you, correct?

S: Tracking the run live in some way was a confluence of ideas between me, Lady, and other members of the chat at the time, Graham knew how to get ahold of Gunner about making live edits because he was one of the people who helped with the DB5 torrent.
I honestly don’t remember how most of the DB6 VST crew was put together, it was very last minute.

g: Do you know anything about how that torrent was made?

S: The first DB5 torrent?

g: Yes.

S: Kroze (one of the chat mods) was physically at DB5 and brought a blank external HDD with him specifically for recording the entire stream, then after the run Fugi and dave_random worked together to create the torrect (with all the files split into 15min chunks) I wanna say the torrent file was initially distributed via Fugi’s server.
DB5 was the first time the entire run was successfully recorded.
LRR had previously toyed with the idea (DB3, but ended up doing clips instead) and steamcastle attempted to record all of DB4 but was unsuccessful.

g: And DB6 was the first year the VST existed. What was that first year like?

S: The first year was VERY short handed, we only had 14 people, a LOT of the “night” shifts were either just me by myself or me and BillTheCat
We really didn’t know what we were doing, the first rendition of the DB6 sheet didn’t even have end times for events.
There was just “Start Time” “Event Type” “Description” and “Video Link”.
At some point we (the VST) will just re-spreadsheet the entire run, because we were so short handed we missed a lot of things, when I went back to make the DB6 postermap I think I ended up uploading ~17(ish) new videos because that was how many posterevents weren’t even on the sheet.

g: What sort of equipment or software did you use back then?

S: We used google sheets (and still do, but not in the same way anymore), and then all the “editing” was done via Twitch’s Highlight system at the time, which then had a checkbox to auto upload the video to youtube.
Then there were a few people with youtube access that could enable monetization and other things like that.
Twitch’s Highlight editor (especially at the time we used it (DB6/DB7)) was extremely painful to use on very long VODs, there was no “seek by time”. You had to use the slider and kinda position it where you wanted and then just wait and be quick on the cut button.
We didn’t actually start capturing the run ourselves until Twitch’s overzealous VOD muting happened ( 2014-08-06 ) and we had to figure out a new way of doing things.

g: And just two years down the line, you had to start making your own tools. What was that like?

S: When that happened we had roughly 3 months to figure out what to do. dave_random put in a ton of time figuring out how to capture the run (using livestreamer which has since been forked to streamlink). The way it worked during DB8 was that the video would get uploaded to youtube with a couple of minutes on either side of the video, then the video editors would go in and edit the video using youtube’s editor.
Then we found out that there is a limit tied to youtube’s editor and you can only have a set number of videos “editing” at once, then you get locked out of the editor for a while, we (the VST and DesertBus in general) always end up being en edge case.
MasterGunner wrote the first version of our own editor so we could edit the video before it got sent to youtube.
The VST website itself also didn’t exist till DB9, a lot of the poster revisions archive only exists because J and myself kept copies of all the revisions.

g: After DB9 is when you started trying to backup the previous years, right?

S: Yea, so (internally) the VST had talked about archival problems over the years, and when Anubis169 went to DB9 (in person) to volunteer, he also went with the express purpose to grab as many of the Desert Bus files as he could find at the time.
When he got back home he and I went over the files he managed to get and he sent me a copy of everything he grabbed, I also spent the time trying to figure out how uStream had stored all the DB1 and DB2 clips then downloaded a copy of all of them.
It turned out to be a very good time to do that, since for a few years later IBM bought uStream and deleted all archives

g: So that looks to be all of the history questions I have. Now for the fun part: describe the process of archiving a Bus.

S: As in as it currently stands?
As in “how did this year work”?

g: Yes. How would the process of archival go as it currently stands?

S: well, that’s a hard one, haha

g: Not surprised, given the scope of the event we’re dealing with.

S: For old stuff: I already (also) flew to Victoria to get the missing DB3 and DB4 files, which was successful, the next time I go it will be to recover old prize data (I’m in the process or making a full prize archive)
For what we “regularly” capture setting up for a new run goes pretty much like this:
The current version of the wubloader (our capture architecture) (re-written by ekimekim, and chrusher after DB12) is used by ekim all year, so he reguarly workes on it and fixes it to work around anything twitch changes.
~3 months before the run we will put out the signup form to the internal VST place, a week or so after that it will be the IRC channel, and the LRR discord (in the desertbus channel)
During about 2 of those 3 months I’ll finish up any new stuff for the VST website I’m working on, so they are ready for the run.
The VST Org. Committee has meetings during the year to talk about any changes we want to make to any of the internal tools of our external facing stuff, the first of which usually happens in June for a new run.
Sorry, some of this is out of order.

g: You’re fine.

S: If we need to inform regular VST members of some major changes we’ve made we schedule meetings over some form of video chat for them to signup for and then to do a quick check over on everything new so we can get any questions answered and have everyone on the same page (usually about 30min per-session).
New people will get a separate training session that’s usually about 90-120 min in length, new people will always start off as “spreadsheeters”, we don’t rotate in new editors until they’ve been around for a couple years and they kind of have a feel for what we do.
For setting up the VST website for the run, there’s a separate “front page” for when the run is live, and also the head node is dropped back to being non-public and we stand up a 8-node globally located DNS cluster to handle the load, it runs on a 5 minute update cycle because late-run when there is a new poster revision a full update and sync takes about 3 & 1/2 minutes.
For setting up a “new year” on the VST site, there’s an amount of manual work, but it’s only about 3 hours or so, really depends on how many of the other things we track are setup at that point.

g: Other stuff being things like the charts, the clock, chat stats?

S: The clock is pretty easy, the chat stats require the chat capture be enabled and going, the graphs require that the donation capture is going already, so that can’t be setup till donations re-set, the gamejam page can’t be setup till Famout gets the gamejam on itch.io setup, the gameshows page can’t be setup till Noy2222 actually knows what gameshows he’s doing this year. The spreadsheet page can’t be setup until all the google docs spreadsheets are setupThe posters page requires that Lunsford has the poster that they’re drawing be setup somewhere for us to query. And the animated poster evolution page requires 3 poster revisions before that works at all. The postermap page is updated manually when I have time to draw/trace and then import the new postermap(ImageMap) of the poster Lunsford has drawn (still not done with this year’s yet)
For standing up our capture infastructure: There’s at minimum 2 nodes on “hardware” as in non-virtualized, that are “editing” nodes, only one of which actually uploads to the youtube channel, after that (usually) all the other nodes are virtualized and (this year) were provided by 6 different people, these are completely separate from the VST website nodes.
We also always try to make sure all the capture nodes are geographically distributed so a random network outage can’t hurt us, and so if one node misses a segment the other 7 can fill in the blank.
Once all of those are stood up and working, they’re all imported into the monitoring dashboard so we know if one of them has a problem. Usually we have all the capture (and website) hardware stood up about 1 week before the run starts. Then we have time to test it and ekimekim and chrusher (Wubloader), ElementalAlchemist (who coded the new version of thrimbletrimmer, our editor), and myself (website) have time to fix any bugs / finish any new features. At that point all the approved (new and old) VST members will also get an invite to the private sheet. Also, we invite any new VST members to the private chat space we use during the run (self-hosted Zulip).

We also spend a lot of time working on the schedule (as part of the signup form people tell us their available hours), people are limited to a max of 6 hour shifts, so scheduling ~60 people over a week where we try to maintain ~8 active people on the private spreadsheet is actually quite complex. ekimekim created a python script to create an initial rough guess, we then have a VST Org meeting to smooth things out. The resulting (schedule) spreadsheet is then given to everyone on the VST so they can check for errors in their personal schedule, and then (for during the run) the schedule’s csv is fed in to a zulip bot that announces who’s going on/off shift. Also, once I have the VST website nodes setup I give J access to one (geographically) near him, that he also uses for his own capture of the chat, twitch, and poster revisions, that way if the VST website head-node misses something we have a backup copy with the stuff J sets up as well.
I think that’s it, everything I’m thinking of now is post-run stuff. Oh, J also runs a capture of all of the Prize data that we preserve for the (upcoming) prize archive.

g: Well, that’s one heck of a process. Mind going into the tech used, like Wubloader and thrimbletrimmer?

S: Sure, wubloader is a ekimekim/chrusher coded Python3 project that is a custom HLS capture (as in we capture every 2-second long .ts segment twitch sends out when the stream is going). It uses PostgreSQL for backend databases, nginx for web, FFMPEG for doing the actual video editing, and docker for easier node deployment. It uses the GoogleDocs API for interaction with the private sheet and the YouTube API for uploading to youtube / managing the playlists.
Thrimbletrimmer (Now coded by ElementalAlchemist) uses HLS.js and a bunch of custom javascript and html for the editing interface, it can make multiple cuts (so we can cut the middle out of a video) and has the ability to add the chapter markers to the description if we want to do that on a longer video.

g: So the upload process is done by Thrimbletrimmer?

S: When someone makes an edit in Thrimbletrimmer, it talks to thrimshim (that then passes the actual edits on to the wubloader that then does the edit and uploads the video to youtube.
thrimshim is a piece of the wubloader that is kind of like an API to all the data in wubloader
so when a video is marked in the private sheet for upload there is a link to thrimbletrimmer that has a UUID on it, that thrimbletrimmer passes to thrimshim so it knows which video segments correspond to the requested video. On the way back it’s like “edit this uuid with the following edits, here’s the video title and description”

g: So what about the Twitch chat? How do you grab that?

S: Twitch chat is captured in 2 ways: via irssi (unix command line IRC client) both J and myself run a capture using that, and (this year) ekimekim coded up a capture for it that also captures all the meta-data for each chat message.
So before the run starts, J and I just setup our irssi sessions on 2 respective servers, and just leave them running in screen. ekimekim runs his custom capture off 2 of the wubloader nodes

g: So how has this setup evolved over time?

S: For chat capture or video capture?

g: Both.

S: Chat capture has largely been the same, old (pre-DB6) chat capture was just done with whoever made the capture’s IRC program (mIRC or IceChat).
Video capture has changed quite a bit, the first version of the wubloader (DB8) [coded by dave_random] was done with livestreamer (saved to mp4 files) and only did rough cuts, the 2nd version (DB9-12) came with Thrimbletrimmer (coded by MasterGunner) which did specific cuts, but also still used livestreamer as the capture source, During DB12 we discovered Twitch had implemented a “24-hour watch limit” which caused both capture nodes to miss part of Ash & Alex’s driver intro. Starting with DB13 ekimekim and Chrusher implemented a custom home-grown capture method that attaches directly to the HLS stream, and resets itself every so often to avoid the 24 hour watch limit.
The new capture metod saves all the 2-second long .ts files as they come out and each node fills in for any other node that got a partial or missed segment, now the capture nodes are a cluster instead of independent.
The editing process has gone from using twitch highlights -> using youtube’s editor -> using a custom editor coded by MasterGunner -> using a further improved editor coded by ElementalAlchemist.
Compared to using twitch or youtube’s editor the ones coded by MasterGunner and ElementalAlchemist are an amazing improvement, and much less buggy.

g: Anything else you want to add? Advice for somebody considering a similar archival project? Other than “don’t”?

S: Honestly: “Start on the first year of the event”, “Ask us (the VST) for advice”, “Preserve everything, backtracking to get something you missed is always more painful”
“Don’t try to do it by yourself”
The VST only works because of all the people involved and learning from the mistakes we’ve made over the years.

g: Any closing thoughts before I wrap up this interview?

S: All of this would never have happened if LoadingReadyRun wouldn’t have put “First Annual” on the website banner back in 2007 as a joke.

g: Thank you for your time!

– glmdgrielson, along for the eight hour, mind-numbingly dull drive

]]>
https://datahorde.org/stuck-in-the-desert-or-video-strike-team/feed/ 0
Community Spotlight: URLTeam https://datahorde.org/community-spotlight-urlteam/ https://datahorde.org/community-spotlight-urlteam/#comments Mon, 01 Nov 2021 00:25:37 +0000 https://datahorde.org/?p=2687 Who are they?

URLTeam is an arm of Archive Team, solely dedicated to collecting shortened URLs.
It is unusual to see a long-term archiving or preservation project, once a collection or a grab is completed, that’s that. Yet URLTeam, who have taken on a task with no apparent end date, have endured for over 10 years, growing into a community in their own right.

What do they do?

Circa 2009, Scumola of Archive Team noted how shortened URLs had proliferated on a little website called Twitter. Twitter was then, and still is now, infamous for its character limitations. To free up space, users began sharing shortened links. And as other users discovered this trick, it only spread. This led to a paradigm shift in the web ecosystem. Links became a lot more unrecognizable, both to refers and referees.

Theft Quadrants
https://xkcd.com/1698/, Theft Quadrants.

Archivists too, were vexed. The traditional approach to web archiving had been to target a particular domain or subdomain, URLs which followed a pattern. Now how could they expect to save posts from blogs, forum threads or stories from news sites, if URLs were coming to them from TinyURL or bit.ly? Thus, URLTeam was born out of an effort to catalogue said short URLs.


Short URLs are loose connections. You cannot actually shorten the domain that someone else has registered, that is to say one cannot rename google to gugle.

The secret, that URL Shorteners employ, is to generate short URLs on YOUR OWN (or 3rd party) servers which can be made to redirect to longer URLs. When someone connects to your server, with the shortened address, you just redirect them to the associated full address in your database. So any time you visit a short link, you are going to be visiting (at least) two websites.

Old links dying or websites shutting down is a given. Yet adding more redirects is going to lengthen the chain to get to your final destination. As the saying goes, the chain is only as strong as the weakest link. Should the URL shortening service shut down, then that means all short URLs will break, even if the actual site being redirected to is still online. So that is reason enough to hoard short URLs. That is what URLTeam does.

How do they do it?

URLTeam’s approach is to decompose the problem. Even if we cannot possibly crawl every link shared on the internet, nor every final redirect, at any time there are going to be a relatively small number of URL shortening services. So URLTeam begins by hunting for said URL Shorteners. If we can recognize ow.ly or goo.gl links, that’s a start.

Once a new shortening service is added, the next step would be crawling the web for short URLs. Since URL shorteners are almost universally linkable on any site, this is going to have to be a very broad crawl, akin to a crawl for building a search engine. This brings us to the URLTeam Tracker which oversees the distribution of the crawling jobs to hundreds of volunteering archivists. You can even see live stats on their page!

Collected links are finally shipped to the 301works collection on the Internet Archive. The 301works collection also houses link databases donated by URL shortening services, so if you happen to own a URL shortening service at risk of shutting down you might want to get in contact with them.

How do I sign up?

Communication happens on #[email protected]. General info can be found on their Wiki page.

If you want to hop right in, you can simply download and run an Archive Team Warrior, and then select the URLTeam project. You can also run the project using a Docker containter by following the instructions on here.

Now if you excuse me, I have to cut this short so I can cut some shortcuts.


Looking to discover other archiving communities? Just follow Data Horde’s Twitter List and check out our past Community Spotlights.

]]>
https://datahorde.org/community-spotlight-urlteam/feed/ 1
YouTube’s Community Posts are tragically killing the Discussion Tab https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/ https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/#comments Tue, 21 Sep 2021 19:44:15 +0000 https://datahorde.org/?p=2641 A few days ago YouTube announced that millions of channels would gain access to Community Posts next month. Unfortunately, many people have missed the footnote that they are removing the similar Discussion feature which was already available to all channels. Read on to learn more about how this update will affect YouTube.

Index
Image Credits: YouTube Blog

Community Posts? What are those?

YouTube’s Community Posts were introduced in 2017, as a way for channels to interact with their viewers and fans. These are typically short texts, images, GIFs or polls; you might think of them as Tweets native to YouTube. They appear both on the channel itself, under the so-called Community Tab, and also in the recommendations on the YouTube Homepage and the Subscriptions Feed. Initially, Community Postswere available only to a number of big channels, such as grav3yardgirl and Vidya Vox.

After some initial success, the feature gradually became available to more channels, albeit not all, circa 2019. You can learn more about how it works on YouTube’s Creator Academy. Earlier this year, The Spiffing Brit made a video which brought a lot of attention to Community Posts, with some exaggerations, pointing out that they were especially prominent on the mobile-view and they could be used to amass a lot of subscribers very quickly.

For reference, a short community post with only a GIF can take up around 80% of the screen!
Image Credit: Pringus McDingus

What is new this October, is that Community Posts will be rolling out to all channels above 500 subscribers, beginning on October 12, 2021. Still not all channels, though we are getting there. Then what is the catch? This update will also remove the old discussion tab which was, formerly, available to all channels.

There’s a Discussion Tab? I have never seen it before?

YouTube first introduced comments on videos sometime in the Summer of 2005. Comments were immediately a hit, but there was a problem; people were not just commenting about the videos themselves, they were sometimes asking the uploader general questions. Unfortunately, channels had to divide their attention between all of their uploads to keep up with their commenters. So some time around April of 2006, YouTube introducedChannel Comments, because somebody had had the genius idea of making a channel-wide comment section. It took the form of a long comment thread appearing at the bottom of channel profiles.

Most Recent Channel Comments on jawed‘s channel from 2011, courtesy of the Wayback Machine.

Now these Channel Comments have almost been around since YouTube started, but today they are known as Channel Discussions and have their own, specialised Discussion Tab. Even at this very moment, if you look into a channel’s Discussion, you might catch a glimpse of comments from over a decade ago carrying on memories long forgotten!

Right now, on 21 September 2021, you can still see comments from 13 years ago on YouTube veteran ozwalled’s channel discussion.

Sadly, like many legacy features on YouTube, Discussions were never fully implemented on the mobile apps and they still remain out of view even today. If you mostly browse YouTube on the app, it is very likely that you have never seen any Discussion before.

Why is YouTube removing Discussions? Can’t they co-exist with Community Posts?

In a sense, Discussions were the predecessor to Community Posts. Yet they certainly have not received the same care and attention in recent years, with Discussions still showing hints of the Google+ transition attempted a few years ago. In short, Discussions have been obsoleted.

Unlike Community Posts which were engineered specifically to take advantage of the UI of the mobile app, Discussions remained largely a desktop-only feature. In a mobile-user dominated web, YouTube finds itself with a significant share of mobile users. There are contradictory reports on whether or not YouTube has more mobile or desktop users, with Statista reporting a ~20%-80% global share for mobile vs. desktop users in early 2021 and Comscore reporting a ~70-30% split in the US/UK in 2017. Still, YouTube has made it clear that they are designing with a mobile-first mindset. They have set a track record in removing features missing from mobile, such as video annotations in 2019, and the community contributions editor in 2020. Thus it comes as no surprise that Discussions should come next.

Spammy comments at the top of smosh’s channel comments on September 30, 2007.

It is also safe to assume that the incredibly slow roll-out of the Community Posts feature to all channels is entirely deliberate. Channel Discussions had a few kinks which needed hammering out, such as often being a frequent target for spam comments. In case you have not noticed, Community Posts are oriented in such a way that they emphasize the channel itself, over commenters. Spam comments are often hidden a screen or two behind the Community Posts themselves, which are mostly kept clean at the channel’s own discretion. They are easier to moderate for channels, and more readable for commenters; that’s a win-win from YouTube’s perspective.

Could YouTube have fixed the way Discussions were organised? Probably, given that they have preserved the channel comments from 15 years ago, through countless layout changes to channels. Yet it seems that YouTube is determined to remove the feature, no matter what.

If you don’t yet have access to Community posts you can use Comments on your videos to engage your viewers.

Team YouTube on the Google Support Page, having seemingly forgotten the original reason channel comments were added in the first place.

InterestinglyDiscussionswere once available on the mobile version of the YouTube website, but even those have been removed, some time earlier this year. If that was not self-sabotage, it certainly was a sign that the design team had been tasked with removing Discussions for good.

From a practical standpoint, it does make sense for YouTube to remove Discussions now with Community Posts as an adequate replacement. Yet the answer to why YouTube would break a years’ long tradition of preserving these ancient comments, is anyone’s guess.

Which channels have a discussion tab?

Even channels started as recently as September 2021 have had Discussions at some point. That sentence is in the present perfect, because many channels which used to have a Discussion section no longer have it:

  • Big Channels such as Vsauce, or eddsworld who have already switched to Community Posts have their Discussion history hidden. Not deleted, hidden. There is at least a 30-day rule which allows commenters to view, edit or delete comments they have made on the Discussion page of a channel who has recently switched to Community Posts, from their Google Account’s My Activity Section.
Memoriam comments on Edd Gould‘s Discussion Page, when he posthumously passed a million subscribers in 2016, once again, courtesy of the Wayback Machine.
  • At some point, the switch to Community Posts was mandated to channels above a certain subscriber count. The current requirement applies to channels strictly above 1000 subscribers. This update to Community Posts is forced even if the channel has been inactive for years, such as in the case of Fred or, more tragically, geriatric1927 or KKD1247.

Do I have a discussion tab?

  • If it exists, your channel’s Discussion page will be located at https://www.youtube.com/channel/{CHANNEL ID GOES HERE}/discussion or https://www.youtube.com/user/{USER ID GOES HERE}/discussion.
  • Following COPPA compliance changes introduced in 2020, channels which have declared their content to be “Made for Kids”, have all comments disabled across their channel. This includes Discussions. Luckily, past comments are not deleted. Switching the channel out of the “Made for Kids” option will restore all hidden comments intact.
  • The road to October 12, will see any other old channels which go over 1000 subscribers lose their Discussion tab as well. That means anyone who passes 1000 subscribers, get the update early! Global removal will begin right after October 12, and community tabs will roll out to channels above 500 subs within an estimated time of a week. Anyone with less than 500 subscribers, will no longer have a Discussion tab and nor will they receive a Communitytab replacement. Please note once more, that the 30-day rule allows commenters to view, edit or delete comments they have made on the Discussion page of a channel who has recently switched to Community Posts.

Concluding Remarks

While Community Posts present an overall upgrade to the YouTube user experience, especially for mobile users, the removal of Channel Discussions is yet another dent to the internet’s short memory span. Seeing people who you would have never thought to have ever talked to each other commenting on one another’s Discussions, is in a sense like scrolling through the threads of an ancient forum. Ironically, the Community Posts still lack this sense of community, instead emphasising a dichotomy between channels and viewers.

Relevant to archivists and data hoarders, a non-API endpoint for Discussions is unknown at this time. Furthermore, given the mandated update having already hidden the Discussions for many channels, options are severely limited. Existing archives besides WayBack Machine snapshots remain undiscovered. Perhaps the interconnectivity of the Discussions between older channels might prove to be useful in crawling out other channels which still have a Discussion tab. One thing is for certain: this will go down as a very unusual moment in YouTube’s history…

On a final, albeit related note, YouTube’s video attributions which were expected to be removed in September, went offline yesterday on September 20. One of the last captures of the attributions endpoint, was taken at 5:19 (GMT) Sep 20, and reported to have shut down by 20:25.

]]>
https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/feed/ 8
Help Archive Team Archive public Google Drive files before September 13! https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/ https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/#comments Sat, 11 Sep 2021 00:50:54 +0000 https://datahorde.org/?p=2637 On September 13, Google is going to start requiring longer URLs to access many Google Drive files, breaking links to public files across the web unless users opt out! Because of this, Archive Team has launched a project to archive as many publicly-available Google Drive files as possible and make them accessible on the Internet Archive Wayback Machine. (Note that video files are not included at this time due to their size.)

You can help! Simply follow the steps to download and run an Archive Team Warrior, and then select the Google Drive project. (You can also run the project using a Docker container using atdr.meo.ws/archiveteam/google-drive-grab as the image address.)

Additionally, people with lists of public Google Drive file URLs are encouraged to share them so they can be archived.

In order to stay up-to-date with the project and be reachable in case of an issue, project contributors are encouraged to connect and stay connected to the project discussion channel, #googlecrash on irc.hackint.org, also available through webchat.

Archiving progress statistics for this project are available on the Archive Team project tracker, and source code is available on GitHub.

]]>
https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/feed/ 2
YouTube Attributions to be removed in September https://datahorde.org/youtube-attributions-to-be-removed-in-september/ https://datahorde.org/youtube-attributions-to-be-removed-in-september/#respond Sat, 28 Aug 2021 22:59:17 +0000 https://datahorde.org/?p=2599 On August 18, YouTube quietly announced that due to “low usage”, they will be removing video attribution pages. One version of the announcement said that this will happen in “early September” and another said “after September”. YouTube instead recommends using the description to attribute videos.

Video attribution pages were intended to list which videos were used to make the current video. This created a network of videos, connecting remixes/compilations/shorter versions of videos with their original source videos. These pages also helped ensure that credit was given to the original authors of video clips, even if the original uploader might have forgotten to do so.

Until some point between 2017 and 2019, video attribution pages also listed the videos that used the current video. The attributions were automatically associated with a video when someone used the online YouTube video editor to add a Creative Commons-licensed clip to their video. If a video had attributions, a link to its attributions page would automatically be placed below its description. On the mobile YouTube app, this link would open the attributions page in the user’s web browser, but more recently all of the attributions links in the mobile app would open the channel that claimed the “Attribution” custom URL.

The video attributions page is one of the oldest pages on YouTube, and is believed to be the last page on YouTube that still uses the old, pre-polymer layout. In fact, the HTML content of the attribution web pages (excluding headers, footers, and video thumbnail overlays) has not been modified since 2011!

No formal archival efforts have been initiated as of this time, but it is anticipated that one will start soon.

]]>
https://datahorde.org/youtube-attributions-to-be-removed-in-september/feed/ 0
This Week in Archiving 08/09/2021 https://datahorde.org/this-week-in-archiving-08-09-2021/ https://datahorde.org/this-week-in-archiving-08-09-2021/#respond Mon, 09 Aug 2021 15:25:53 +0000 https://datahorde.org/?p=2580 Introducing the Institute for Digital Heritage, Heritrix3.4 and Crawling@Home: a project for building the largest image-text dataset, ever!

New Projects

History buffs better follow the Institute for Digital Heritage which just launched last week. It’s a new organization aiming to turn cultural heritage into digital heritage. They help museums and cultural institutions digitize and open up their collections, with partners like FloatScans who specialize in 2D/3D scanning. Among its ranks are scholars in digital humanities, culture studies and related fields primarily from Europe and Africa.


Got any processing power to spare? Crawling@Home is a project to match text and images, en masse, from the >3 PB Common Crawl dataset. When completed it will be the world’s largest image-text pair dataset, to fuel next-generation machine learning models. Current stats are available on their homepage.

u/-Archivist has made a tutorial on how to join C@H (periodically updated) which you can find on r/datahoarder. If you have Docker installed, go ahead and execute docker pull arkseal/cah-worker && docker run --shm-size=2g -e NAME={username}-TE arkseal/cah-worker replacing {username} with a nickname of your own.

Updates

Speaking of web crawling, the Internet Archive’s crawler Heritrix3 received an extensive update last week adding in a browser-based link extractor ExtractorChrome. While now only limited to grabbing <a> and <area> links, the team plan to extend it for crawling intercepted sub-requests, possibly mimicking Javascript behavior.

https://twitter.com/anjacks0n/status/1422496424021856275

Last but not least, The-Archive has updated their website. Whether you’re looking for Strategy Guides for retro gaming or Milkdrop Plugins for Winamp Wednesday, The-Archive has much to offer.

]]>
https://datahorde.org/this-week-in-archiving-08-09-2021/feed/ 0
Thousands of Classic Winamp Skins on the Internet Archive! https://datahorde.org/thousands-of-classic-winamp-skins-on-the-internet-archive/ https://datahorde.org/thousands-of-classic-winamp-skins-on-the-internet-archive/#respond Wed, 04 Aug 2021 20:25:00 +0000 https://datahorde.org/?p=2571 Remember Winamp, the old music player? Built-in equalizer, user-made skins and an unholy amount of ad spam? If yes, I then you will absolutely adore the Winamp Skin Collection on the Internet Archive!

One thing sadly lost in the transition from desktop to mobile was user-customizability. Winamp, in particular, had a community dedicated to designing and sharing custom skins. Though Winamp has been obsoleted, to some extent, the aesthetic in these skins are certainly a lot more timeless. Preserving that artistic value is the Winamp Skin Collection, with over 72 thousand community made Winamp skins, waiting for your viewing pleasure on the Internet Archive.

Skin: V2 Technics White Resting by johnnyg0
Song: Puppies’ Space Station by Saria Lemes

The best part? You can run it from your browser, even on your smartphone! The skins in this collection are set up to load Webamp, an open-source HTML5 rewrite of Winamp 2.

From stereo-themed skins, to futuristic devices; and from game fan-art to original designs you’re certain to find the perfect skin for all of your llama-whipping needs! Also be sure to check out the Winamp Skin Museum and the Butterchurn Visualizer!

]]>
https://datahorde.org/thousands-of-classic-winamp-skins-on-the-internet-archive/feed/ 0