community – Data Horde https://datahorde.org Join the Horde! Thu, 16 Nov 2023 09:41:02 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://datahorde.org/wp-content/uploads/2020/04/cropped-DataHorde_Logo_small-32x32.png community – Data Horde https://datahorde.org 32 32 Without being exploited: What archivists should learn from the XeNTaX forums aftermath https://datahorde.org/without-being-exploited-what-archivists-should-learn-from-the-xentax-forums-aftermath/ https://datahorde.org/without-being-exploited-what-archivists-should-learn-from-the-xentax-forums-aftermath/#comments Thu, 16 Nov 2023 09:40:28 +0000 https://datahorde.org/?p=2923 Some 6 months ago, in May 2023 a post was made on r/DataHoarder that the XeNTaX wiki and forum were shutting down due to financial considerations. As with any forum shutdown, much panic had ensued at that moment. However, from the few people I have spoken to about this shutdown, no one really seemed be aware of XeNTaX before this.

Depending on where you look online you may be led to believe XeNTax is/was a company, supposedly a foundation and definitely a website. Yes, that is a XeNTaX website xentax.org distinct from the XeNTaX forums forum.xentax.com. In actuality, XeNTaX has its roots in the Dutch demoscene and it has just kept reincarnating.

A Xentax song composed for the X’98 compo

XeNTaX started as a team of two, Mr. Mouse and Captain Corney, who were hacking/modding Commodore 64 games. XeNTaX grew into a much wider community over time because Mr. Mouse and Captain Corney wanted to be able to focus retrocomputing and to support others working on similar projects. For this XeNTaX developed MultiEx Commander which is a tool for unarchiving 100+ retro game formats, certainly no longer limited to C64.


On October 6, XeNTaX made a more upfront shutdown announcement[Wayback] with the shutdown being scheduled for the end-of-year. While there was still some possibility of a buyout or handover, it was unlikely. Instead, the XeNTaX community was encouraged to join the XeNTaX Discord server. Again, no surprises there: it has become fairly routine for old forums to retire to Discord which offers free hosting and a ton of features.

With this announcement, a second wave shot out. Word got out once again leading to several mass archiving efforts. However, this upset the staff enough to issue a warning on the Discord, with an emphasis on Data Privacy and consent. To quote Mr. Mouse:

Note: Members of the Xentax Forum have agreed to terms of the Forum and any public information. They have not agreed for their information being used on other sites. You may wish to look into the subject of data privacy. As such, while you’ve leeched my posts, I did not agree for those being hosted somewhere else. So remove my posts.

Remember to ensure approval from people before you put their stuff up that they did not agree to. In this age of data privacy and consent that is very important. As for Wayback Machine, they have a process that enables removal of pages if asked and are usually collaborative.

XeNTax Discord

This was a remarkable reaction because two things are being said here. First is the obvious point on data privacy and consent, but second is an undertone of leaching off of previous work and exploitation. The fact that the Xentax forums have shut down does not mean that the staff and contributors have quit completely. They are still around and will frown upon their work being plagerised now just as much as they would have while the forums were alive. And that is an issue most fellow archivists and hoarders have been fairly negligent of.


Amidst the archiving craze focussed on preserving the record, there was also a second preservation effort going on. An effort to preserve community. Although the XeNTaX Discord server offered a solution, many did wish for an independent forum. Even a short gofundme was run to see if maintenance costs could be crowdsourced.

The shutdown date was pulled a bit forward to November 3, 2023 as members were instructed to relocate to a new forum, Reshax, per the updated XeNTaX forum banner[Wayback]. In fact, when the forum did first shut down it began immediately redirecting to Reshax.

I’ve reached an agreement with Mr. Mouse, the owner of the Xentax forum, to promote ResHax and breathe new life into the slowly declining forum. Additionally, I’ll make an effort to bring tools from their site to ours. Once their forum becomes inactive, I’ll attempt to persuade Mr. Mouse to redirect the domain to our forum, ensuring that all users can find a new home here

Reshax admin michalss, “What about Xentax and Zenhax ?” on ResHax, Wayback Snapshot.

michalss also lamented on the recent death of the sister community Zenhax, which was abandoned due to the owner losing interest. And this could have been the end of the story, but people kept begging, asking “where are the tools, where are the assets?”…


On November 8, Xentax Discord Admin Richard Whitehouse came out with an announcement, later also shared on his homepage: Reshax and XeNTaX had reached an alternative agreement. From this point on, Reshax would be free to focus on reverse engineering however so they pleased; and XeNTaX members would be free to continue the tools and projects that they were already making. Whitehouse paints a picture of how he believes the XeNTaX community has been unfairly taken advantage of, and that this was a destructive force.

Many developers stopped sharing their findings and specifications (myself included) because they started to see their work exploited. By companies, which is morally reprehensible (and sometimes in direct violation of a given license/copyright) and serves to devalue the entire skillset associated with the labor. By other developers, who are socially positioned to exploit the labor in some other way. By people who just want to rip content to turn around and sell it, or claim false credit for it. In conjunction with unhealthy ego competition, this exploitation has made it impossible to create a culture of trust and sharing between developers.

We want to create an environment where developers are safe to work together without being exploited, and where developers feel valued by fellow developers enough to not feel the need to engage in pathetic ego-based assertions of skill. We want people to be fueled by their creative ambitions and technical fascinations, not their social standing. We want to create a culture beyond what Open Source can achieve under the constraints of our current socioeconomic systems. No matter how many people are left standing in the end, this is where we’re going.

Richard Whitehouse

On r/DataHoarder and other venues, the XeNTaX forum shutdown was treated as nothing more than a lost cause. There was once a XeNTaX, now there isn’t; we must therefore uphold the memory through downloading all we can. But to the alive and well XeNTaX community, these forum dumps were nothing more than an intensification of the routine stealing of their work they had grown sick of. Whitehouse’s open letter, which I have only abridged here, makes it clear what the Discord staff consider a XeNTaX contributor willing to invest time and effort to learn as opposed to internet passerbys who ask for something, take it and move on.

To further hammer in the point, Mr. Mouse issued another announcement on November 12 imploring members to not share full backups of the XeNTaX Forum on the XeNTaX Discord server. Once again, the Internet Archive and the Wayback Machine were exempted as special cases, but else it was not allowed. This however did attract some internal protest from guild members, as one might gather from the reactions to the message.

This goes to show that the Internet Archive has built up enough of a reputation to not merely be heralded as leachers and pirates and that’s a good thing. Although, there is an implication here that websites just find their way onto the Internet Archive, when in fact there are automation processeces and groups like Archive Team who facilitate this. Thus we find ourselves in a Catch 22, where if something has landed on the Internet Archive it is deemed legitimate, but if it is stuck in transit it was stolen unfairly.

This is a paradox that underpins the challenge of being an archivist today: sucess means being invisible and that your archives are never widely distributed. Does that perhaps sound familiar? It’s the exact same situation the XeNTaX community finds itself in. They would rather preserve their tools and assets internally, circulating on a need-to-know basis than have it out in the open. This ensures that the community retains its knowledge, but also controls it. It’s self-determination against potential exploitation.


The XeNTaX situation is not over and hopefully it will never be over in the near future. The XeNTaX forums might be gone, but XeNTaX lives on. And I believe it sets a good example: Archivism as a hobby or profession is something which should prevail within every community, instead of the interventionist culture from 3rd parties that we have grown accustomed to today.

But that reversal we have is warranted. Many times communities do vanish or are made to vanish, whether it’s subtitlers on YouTube or artists who can no longer use Macromedia Flash. Often times, these communities do not have an obvious way of preserving their memories; the decision is out of their control and attempts at preservation necessitate challenging authority, ad hoc solutions and technical expertise (often from outside).

Whether you define yourself an archivist, a hoarder, a pirate, a cracker, an archaelogist or whatever; it is a must that you understand where the files come from. You don’t have to obey all of the wishes of the original creator, but you have to respect them. Especially if they’re still alive and kicking. The costs couldn’t kill XeNTaX, but from the looks of it archivists almost did.

]]>
https://datahorde.org/without-being-exploited-what-archivists-should-learn-from-the-xentax-forums-aftermath/feed/ 2
Interview with Hubz of Gaming Alexandria https://datahorde.org/interview-with-hubz-of-gaming-alexandria/ https://datahorde.org/interview-with-hubz-of-gaming-alexandria/#respond Mon, 18 Apr 2022 09:09:30 +0000 https://datahorde.org/?p=2719 Hello, here’s another interview, this time with our head overlord Hubz of Gaming Alexandria.

glmdgrielson: So, first question, what is Gaming Alexandria?
Hubz: At it’s core it’s both a Discord community and a separate website dedicated to preserving various aspects of video games, such as scans, interviews, unreleased games, youtube videos etc. It mainly started as a site where I could share high quality scans but has grown thanks to many people joining up with various skills to help expand the website. The Discord community itself is really an entity unto itself at this point where lots of gaming historians/preservationists have come together to share their works and also help each other out when needed with various projects. I love getting to see all the passion in everybody’s projects that they put forth and the willingness of the community to offer help when asked.

g: Tell me more about this community. I’m active in the server, but what does it look like from your end?
H: From an admin standpoint I have access to all the channels which include the private #staff and #mods channels where we discuss upcoming articles or projects for the site as well as handling the occasional argument or bad apple in the chat. Dylan Mansfeld (DillyDylan) handles a lot of great articles on undumped/prototype games that were previously unreleased. Ethan Johnson writes for his own blog (https://thehistoryofhowweplay.wordpress.com/) and Gaming Alexandria at times and is our editor so he glances through and cleans up all the articles that get posted. Jonas Rosland who is the Executive Director of the NPO, I’m a board member of, called Hit Save (https://hitsave.org/) does a lot of thankless technical work behind the scenes that includes a NAS he has setup for not only the staff of the website to store project files but the community at large which is a huge help. Wietse van Bruggen (Densy) handles a lot of the moderation of the chat and has been a huge help keeping the Discord community friendly and clean with his balanced moderation style. Last but not least there is Stefan Gancer (Gazimaluke) who did the original site redesign and has been a great idea man for ways to improve the site and community as time has gone on. For me personally I try to keep up with all the chat in the channels (though it can be tough at times!) just to have an idea of what’s going on and seeing what I can help with or connect people to further projects as well as post my scans and projects as they’re completed. Thanks to the rest of the staff I rarely have to step in and moderate which is very nice!

g: I’m going to skip over the omission of Norm and ask about the history of how the site has evolved.
H: LOL yes Norm is a menace to society and must be stopped.

Editor’s note: Hubz has a mock rivalry with Norm, a.k.a. the Gaming Historian and is a frequent running gag on the server. I do not believe there is actual malice.

The website itself started officially on October 23rd, 2015 and was just a basic text website that I could easily upload to in order to share my scans, it was very barebones. The reason I wanted to get high quality scans out was due to using an emulator frontend called Hyperspin. For popular systems it had a lot of decent quality artwork for boxes. But for lesser known systems it was sorely lacking and that triggered my OCD and made be realize that scanning stuff in high resolution was something that needed to be done. Slowly, but surely, I met others that wanted to scan in high quality and have their stuff hosted and they would submit stuff such as Densy. At some point I got involved with the VGPC discord and met Kirkland who had been quietly doing something similar with his collection and collaborated with him and others on establishing scanning standards to use going forward to have some level of consistent quality with those that were willing to do it which eventually led to what is the https://scanning.guide/. In late 2018 the site was graciously redone by Gazimaluke and relaunched in the design you see now. We started branching out into actual articles written by our staff and releasing prototypes and unreleased games that we came across. The site continues doing this to this day, though we are branching out into more guest authors from the community posting interviews and articles as well in the near future.

g: As well as hosting my site, for which I am grateful for. So, what is the day to day like for you?
H: Day to day on the scanning I try to get at least one magazine done daily. Doesn’t always happen but, in general, I debind a magazine the night before, then in the morning scan it in before leaving for work. If work gets slow I work on processing the scans, or else I’ll do it later that night and get them uploaded to the site and the Internet Archive.

g: Interesting. So how big do you think your archive is by this point?
H: Archive upload-wise I’m probably right around 2900 items if you count stuff that was removed lol. Then there’s a bunch on the site that wasn’t done to the higher scanning standards I go by now that’s not on the archive. So I’d guess in the 3000-4000 item range currently.

g: Do you know how big it is in terms of filesize?
H: Let me see real quick…
Looks like 2.5TB which is another reason I’m so thankful to have the Internet Archive to host my scans on due to the space and bandwidth that would be required otherwise.
The site alone usually has about half a TB of traffic per month so I can only imagine what it would be like if the magazine scans were also hosted directly on it.

g: Neat. Is there anything interesting that you got to be a part of due to GA that you would like to share?
H: Biggest thing is probably working with The Video Game History Foundation on scanning their extensive magazine collection so digital copies can be provided along with physical copies at their library. Being able to leverage the Internet Archive so people all over the world can easily access the magazines I’ve scanned that they might not have been able to easily otherwise is a great feeling personally for me. So many of these things are quite difficult to acquire and expensive as time goes on so having them as an ally in the preservation world is a godsend. There’s been lots of other connections and other projects I’ve worked on as well but I won’t ramble forever on that. Not only is Gaming Alexandria a tight community that likes to help each other out but there’s plenty of other preservation groups like VGHF, TCRF, and Hidden Palace just to name a few and we all get along great and try to push preservation forward together.
There’s so much work that needs to be done that we need all the help we can get and we need to support each other any way we can I think.

g: True that. Last question for now: anything that you would recommend to a would-be archivist?
H: I think it’s a good idea to preserve what interests you, which seems to go without saying, but I mean it more from a sense of not only going after what is popular. While you might not get much fanfare initially for the more obscure stuff it’s likely you’ll be the only one doing it and it’s important it’s being done. If you do good work for long enough it will get noticed, and to make good work easier it’s best to go with what you’re passionate about. The other thing I would suggest is not beating yourself up or comparing your output to others. Do what you can when you want to, this is a hobby after all. If you make yourself miserable trying to do something your output will naturally suffer or you might even burn out and stop altogether. Like I said before, we need all the help we can get, so try to avoid that if at all possible.

g: Thank you for being here, overlord Hubz. It’s been good talking to you.
H: No problem! Thaks for the interview. šŸ™‚

– glmdgrielson, being a very good minion interviewer

]]>
https://datahorde.org/interview-with-hubz-of-gaming-alexandria/feed/ 0
YouTube’s Community Posts are tragically killing the Discussion Tab https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/ https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/#comments Tue, 21 Sep 2021 19:44:15 +0000 https://datahorde.org/?p=2641 A few days ago YouTube announced that millions of channels would gain access to Community Posts next month. Unfortunately, many people have missed the footnote that they are removing the similar Discussion feature which was already available to all channels. Read on to learn more about how this update will affect YouTube.

Index
Image Credits: YouTube Blog

Community Posts? What are those?

YouTube’s Community Posts were introduced in 2017, as a way for channels to interact with their viewers and fans. These are typically short texts, images, GIFs or polls; you might think of them as Tweets native to YouTube. They appear both on the channel itself, under the so-called Community Tab, and also in the recommendations on the YouTube Homepage and the Subscriptions Feed. Initially, Community Postswere available only to a number of big channels, such as grav3yardgirl and Vidya Vox.

After some initial success, the feature gradually became available to more channels, albeit not all, circa 2019. You can learn more about how it works on YouTube’s Creator Academy. Earlier this year, The Spiffing Brit made a video which brought a lot of attention to Community Posts, with some exaggerations, pointing out that they were especially prominent on the mobile-view and they could be used to amass a lot of subscribers very quickly.

For reference, a short community post with only a GIF can take up around 80% of the screen!
Image Credit: Pringus McDingus

What is new this October, is that Community Posts will be rolling out to all channels above 500 subscribers, beginning on October 12, 2021. Still not all channels, though we are getting there. Then what is the catch? This update will also remove the old discussion tab which was, formerly, available to all channels.

There’s a Discussion Tab? I have never seen it before?

YouTube first introduced comments on videos sometime in the Summer of 2005. Comments were immediately a hit, but there was a problem; people were not just commenting about the videos themselves, they were sometimes asking the uploader general questions. Unfortunately, channels had to divide their attention between all of their uploads to keep up with their commenters. So some time around April of 2006, YouTube introducedChannel Comments, because somebody had had the genius idea of making a channel-wide comment section. It took the form of a long comment thread appearing at the bottom of channel profiles.

Most Recent Channel Comments on jawed‘s channel from 2011, courtesy of the Wayback Machine.

Now these Channel Comments have almost been around since YouTube started, but today they are known as Channel Discussions and have their own, specialised Discussion Tab. Even at this very moment, if you look into a channel’s Discussion, you might catch a glimpse of comments from over a decade ago carrying on memories long forgotten!

Right now, on 21 September 2021, you can still see comments from 13 years ago on YouTube veteran ozwalled’s channel discussion.

Sadly, like many legacy features on YouTube, Discussions were never fully implemented on the mobile apps and they still remain out of view even today. If you mostly browse YouTube on the app, it is very likely that you have never seen any Discussion before.

Why is YouTube removing Discussions? Can’t they co-exist with Community Posts?

In a sense, Discussions were the predecessor to Community Posts. Yet they certainly have not received the same care and attention in recent years, with Discussions still showing hints of the Google+ transition attempted a few years ago. In short, Discussions have been obsoleted.

Unlike Community Posts which were engineered specifically to take advantage of the UI of the mobile app, Discussions remained largely a desktop-only feature. In a mobile-user dominated web, YouTube finds itself with a significant share of mobile users. There are contradictory reports on whether or not YouTube has more mobile or desktop users, with Statista reporting a ~20%-80% global share for mobile vs. desktop users in early 2021 and Comscore reporting a ~70-30% split in the US/UK in 2017. Still, YouTube has made it clear that they are designing with a mobile-first mindset. They have set a track record in removing features missing from mobile, such as video annotations in 2019, and the community contributions editor in 2020. Thus it comes as no surprise that Discussions should come next.

Spammy comments at the top of smosh’s channel comments on September 30, 2007.

It is also safe to assume that the incredibly slow roll-out of the Community Posts feature to all channels is entirely deliberate. Channel Discussions had a few kinks which needed hammering out, such as often being a frequent target for spam comments. In case you have not noticed, Community Posts are oriented in such a way that they emphasize the channel itself, over commenters. Spam comments are often hidden a screen or two behind the Community Posts themselves, which are mostly kept clean at the channel’s own discretion. They are easier to moderate for channels, and more readable for commenters; that’s a win-win from YouTube’s perspective.

Could YouTube have fixed the way Discussions were organised? Probably, given that they have preserved the channel comments from 15 years ago, through countless layout changes to channels. Yet it seems that YouTube is determined to remove the feature, no matter what.

If you donā€™t yet have access to Community posts you can use Comments on your videos to engage your viewers.

Team YouTube on the Google Support Page, having seemingly forgotten the original reason channel comments were added in the first place.

InterestinglyDiscussionswere once available on the mobile version of the YouTube website, but even those have been removed, some time earlier this year. If that was not self-sabotage, it certainly was a sign that the design team had been tasked with removing Discussions for good.

From a practical standpoint, it does make sense for YouTube to remove Discussions now with Community Posts as an adequate replacement. Yet the answer to why YouTube would break a years’ long tradition of preserving these ancient comments, is anyone’s guess.

Which channels have a discussion tab?

Even channels started as recently as September 2021 have had Discussions at some point. That sentence is in the present perfect, because many channels which used to have a Discussion section no longer have it:

  • Big Channels such as Vsauce, or eddsworld who have already switched to Community Posts have their Discussion history hidden. Not deleted, hidden. There is at least a 30-day rule which allows commenters to view, edit or delete comments they have made on the Discussion page of a channel who has recently switched to Community Posts, from their Google Account’s My Activity Section.
Memoriam comments on Edd Gould‘s Discussion Page, when he posthumously passed a million subscribers in 2016, once again, courtesy of the Wayback Machine.
  • At some point, the switch to Community Posts was mandated to channels above a certain subscriber count. The current requirement applies to channels strictly above 1000 subscribers. This update to Community Posts is forced even if the channel has been inactive for years, such as in the case of Fred or, more tragically, geriatric1927 or KKD1247.

Do I have a discussion tab?

  • If it exists, your channel’s Discussion page will be located at https://www.youtube.com/channel/{CHANNEL ID GOES HERE}/discussion or https://www.youtube.com/user/{USER ID GOES HERE}/discussion.
  • Following COPPA compliance changes introduced in 2020, channels which have declared their content to be “Made for Kids”, have all comments disabled across their channel. This includes Discussions. Luckily, past comments are not deleted. Switching the channel out of the “Made for Kids” option will restore all hidden comments intact.
  • The road to October 12, will see any other old channels which go over 1000 subscribers lose their Discussion tab as well. That means anyone who passes 1000 subscribers, get the update early! Global removal will begin right after October 12, and community tabs will roll out to channels above 500 subs within an estimated time of a week. Anyone with less than 500 subscribers, will no longer have a Discussion tab and nor will they receive a Communitytab replacement. Please note once more, that the 30-day rule allows commenters to view, edit or delete comments they have made on the Discussion page of a channel who has recently switched to Community Posts.

Concluding Remarks

While Community Posts present an overall upgrade to the YouTube user experience, especially for mobile users, the removal of Channel Discussions is yet another dent to the internet’s short memory span. Seeing people who you would have never thought to have ever talked to each other commenting on one another’s Discussions, is in a sense like scrolling through the threads of an ancient forum. Ironically, the Community Posts still lack this sense of community, instead emphasising a dichotomy between channels and viewers.

Relevant to archivists and data hoarders, a non-API endpoint for Discussions is unknown at this time. Furthermore, given the mandated update having already hidden the Discussions for many channels, options are severely limited. Existing archives besides WayBack Machine snapshots remain undiscovered. Perhaps the interconnectivity of the Discussions between older channels might prove to be useful in crawling out other channels which still have a Discussion tab. One thing is for certain: this will go down as a very unusual moment in YouTube’s history…

On a final, albeit related note, YouTube’s video attributions which were expected to be removed in September, went offline yesterday on September 20. One of the last captures of the attributions endpoint, was taken at 5:19 (GMT) Sep 20, and reported to have shut down by 20:25.

]]>
https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/feed/ 8
Help Archive Team Archive public Google Drive files before September 13! https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/ https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/#comments Sat, 11 Sep 2021 00:50:54 +0000 https://datahorde.org/?p=2637 On September 13, Google is going to start requiring longer URLs to access many Google Drive files, breaking links to public files across the web unless users opt out! Because of this, Archive Team has launched a project to archive as many publicly-available Google Drive files as possible and make them accessible on the Internet Archive Wayback Machine. (Note that video files are not included at this time due to their size.)

You can help! Simply follow the steps to download and run an Archive Team Warrior, and then select the Google Drive project. (You can also run the project using a Docker container using atdr.meo.ws/archiveteam/google-drive-grab as the image address.)

Additionally, people with lists of public Google Drive file URLs are encouraged to share them so they can be archived.

In order to stay up-to-date with the project and be reachable in case of an issue, project contributors are encouraged to connect and stay connected to the project discussion channel, #googlecrash on irc.hackint.org, also available through webchat.

Archiving progress statistics for this project are available on the Archive Team project tracker, and source code is available on GitHub.

]]>
https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/feed/ 2
YouTube was made for Reuploads https://datahorde.org/youtube-was-made-for-reuploads/ https://datahorde.org/youtube-was-made-for-reuploads/#comments Wed, 28 Jul 2021 08:57:00 +0000 https://datahorde.org/?p=2548 The term reupload refers to a new upload of a file previously shared on the web, with minor alterations. Though somewhat a stigmatized term nowadays, reuploads can bridge the past and present, if and when the original version of something becomes unavailable.

YouTube is a platform and a community which live off of reuploads. One might even go so far as to say that reuploads have been a key to YouTube’s success and reuploads themselves have been a product of YouTube. With recent events in mind, now is a good time as ever to re-examine the mutually beneficial relationship between YouTube and the practice of reuploading.


In 2005, YouTube started off as a small video-sharing site. At the time few people would have been able to predict that it would grow to be the 2nd most popular site on the web and yet here we are! One factor Co-founder Jawed Karim attributes their success to is timeliness. In particular, he thinks YouTube came at a time when clip sharing became very common. To quote from a talk he gave in 2006:

The “clip culture” you see now is basically this demand that you can find any video at any time and you can share it with other people, or you can share your own videos with other people. [ā€¦]

There were a couple of events in 2004 that kind of fueled this. One was this [wardrobe malfunction]. So this, of course, happened on television, but it only happened once and never again. And so for anyone who wanted to see it after that, well they had to find it online. The other big event I remember is this [Stewart on Crossfire] interview. And you know this was also shown on television once and not after that. Everyone was talking about it, but people who missed it really wanted to be in on the joke so they would try to find it onlineā€¦

Jawed Karim, r | p 2006: YouTube: From Concept to Hypergrowth (25:15)

YouTube was able to meet this clip demand, acting as a universal replay button for any clip people could imagine. It’s no coincidence that obscure/lost media fanatics were flocking to the site not soon after its launch. From Sesame Street shorts to TV pilots, old footage quickly piled up! YouTube had an entire subculture of video remixes called YouTube Poops, which were made from recycled clips from old TV shows and games.

Alas this clip culture was both the boon and bane of early YouTube. As users uploaded these clips liberally, some of the owners and rights holders of the original source material of said clips came to view this practice as copyright infringement. This tension led to the infamous
Viacom vs. YouTube case in 2007, where media giant Viacom sued YouTube and Google for $1 billion in alleged damages! If you are looking for a good summary, EmpLemon did a video on it a few years ago.

Viacom did not actually win the case, in fact it came to light that they had taken advantage of clip culture for a stealth marketing campaign of their own. But the whole ordeal had lasting effects on YouTube. In an attempt to appease intellectual property owners, YouTube introduced their content ID system, then called video ID, for automatically detecting copyright infringing videos.

(Video Identification ~ YouTube Advertisers. If the above video is unavailable please use this Wayback Snapshot)

All of a sudden, videos on YouTube became a whole lot more volatile. This automated system did not only take down a lot of infringing material, but it also hit false positives, matching short-length clips, remixes and video reviews as well. At one point you would have been lucky to have had a few of your videos deleted, as opposed to having your whole channel terminated for seemingly having one too many copyright strikes. Yet clip culture on YouTube has somehow been able to endure, even beyond this era.


You might be wondering how frequently videos on YouTube are being deleted. To put things into perspective, Archive Team ran a video survey between 2009-2010 to collect metadata on over 105 million public YouTube videos. By August 2010, 4 million items in this collection had been deleted, or 4.4%. This year, in 2021, a fellow Data Horde member investigated how many of the videos in this collection were still available. They estimated from a subset* in the 2009-2010 collection, an astounding 52% had been deleted, 4% were made private, and about 44% remain viewable on the platform!

* the estimate was performed by crawling 50239844 videos from the dataset over the last 3 years.

The term reupload probably first entered the YouTube lexicon when users began uploading new, higher quality versions of videos on their channel as YouTube kept introducing higher caps to video quality. These YouTube upgrades came around the same time as Content ID, so you will find cases where the reupload of a video has survived but the original has been deleted.

It wasn’t just the video makers themselves who were reuploading though, soon other users also began reuploading downloaded copies they had made of their favorite YouTube videos. This was not merely due to fans appreciating content from their fellow YouTubers, but also due to the fact that the frequent channel terminations could deny the original uploader the right to reupload their channel’s videos in the first place.

YTPMV Remix: Planet Freedom, original by Igiulamam, reuploaded by oiramapap

Ironically, the term reupload soon was associated with degredation in quality as people began reuploading videos over and over again. There’s even a Gizmodo article about it from 2010. There have also been people who have complained about their work being reuploaded without permission or credit, worse yet plagiarised. Clearly, reuploads are a great power that came with great responsibility. Still, many diligent channels are dedicated to preserving the memory of original content through its reuploads.

The fear of such memories being lost through mass-deletions looms over YouTube, even today. Early ContentID was certainly not the last disaster to plague YouTube videos. Hacker pranks, copyright trolls, the Adpocalypse and Elsagate controversies have all taken their toll on many unfortunate channels. Today, we once more find ourselves on the brink of a scene similar to a mass-deletion, with the mass-privating of unlisted videos uploaded prior to 2017.


A few years ago it was discovered that YouTube video IDs were being generated according to a certain pattern and it was thus theoretically possible to predict video links. This presented a problem for unlisted videos, which were meant to be videos that were to be shared by link only.

Unlisted videos are a tricky subject; on one hand, a video might be unlisted, rather than privated, to make it easier to share with friends. On the other hand, many YouTubers also unlist videos such as outtakes, early revisions of videos, stream archives or off-topic content that might not fit their channel’s niche. Such videos are linked to, in video descriptions, pinned comments or Tweets. So while some unlisted videos aren’t meant for everyone’s eyes, other unlisted videos are only hidden from the channel interface and search results. Yet an exploit is an exploit, and URL predictability could be a serious problem for certain videos.

Some action certainly had to be taken here, so in 2017 the video ID formula was changed into something less predictable, that was definitely a step in the right direction. What is happening today, 4 and a half years after, is a security update to set a sizable number of unlisted videos uploaded prior to that date to private. Thus, several million videos have suddenly been virtually deleted, as they are no longer accessible to anyone but the channel owner. While this decision will secure potentially private content for many channels, it is also a great loss for inactive channels who unlisted videos liberally and were not able to opt-out of the decision.

Our Unlisted Video Countdown on Twitter

On the bright side, channels which are still active can set their videos to public at a later date. In fact, YouTube goes so far as to encourage these channels to re-upload their own videos to be able to take advantage of the new URL system. Except, it’s just not the original uploaders and video makers who are reuploading. Reuploads from other users who had previously downloaded unlisted videos are starting to also pop up, the same as it ever was.

YTPMV Remix: 00000000.restored.wmv.
Original upload by HOZKINS, reuploaded by IAMGOOMBA, re-reuploaded by aydenrw.

With tools like youtube-dl or Reddit’s SaveVideo, the YouTube community is pulling together to salvage whatever they can from old unlisted videos. And they are getting only better, Archive Team’s unlisted video project hit over 200TB of data. As videos die off, here are some folks desperately trying to revive them, trying to uphold what one might call their online heritage.

A few days ago one of the oldest videos on YouTube was made public from unlisted. It was originally uploaded on April 29, 2005. Titled Premature Baldness, it too is a reupload and final memento from a chasebrown.com which is no longer recognizable. A whisper to remind us that while invoking the right to be forgotten we ought not to neglect, on the other hand, a right to be remembered…

]]>
https://datahorde.org/youtube-was-made-for-reuploads/feed/ 1
YouTube Community Contributions Archive Now Available: A Look at the Stats https://datahorde.org/youtube-community-contributions-archive-now-available-a-look-at-the-stats/ https://datahorde.org/youtube-community-contributions-archive-now-available-a-look-at-the-stats/#respond Fri, 05 Mar 2021 22:22:55 +0000 https://datahorde.org/?p=2091 The YouTube Community Contributions Archive is now available on the Internet Archive! You can download the entire collection, or simply search for and download files for a particular video. The collection is composed of 4096 ZIP archives which contain 406,394 folders and 1,361,998 files. Compressed, the collection is 3.83GB, and once decompressed, the collection is 9.46GB.

YouTube Community Contributions allowed users to create and translate closed captions/subtitles, titles, and descriptions of YouTube videos uploaded by channels who enabled the feature. Users could optionally choose to be credited for their captioning contributions.

While over 50 million videos were scanned for community contributions data, community contributions data was found for only 406,394 videos, indicating that the feature was used on only a small portion of the videos on YouTube. Some videos had YouTube Community Contributions enabled, but only had captions or metadata that was provided by the uploader. This accounted for 198,609 videos, meaning that 207,785 videos in the collection had community-contributed captions or metadata, further indicating that few videos on YouTube received community contributions. This means that approximately 0.4% of the videos that were scanned while creating this archive had community-contributed captions or metadata. This was likely because the community contributions feature was hard to discover in the YouTube interface, which limited the number of people who were aware of the feature.

Breaking down these numbers further, 80,746 videos had community-contributed draft metadata, 127,164 had community-contributed draft captions, 38,440 videos had community-contributed published metadata, 93,499 videos had community-contributed published captions, 179,366 videos had uploader-provided published metadata, and 225,466 videos had uploader-provided published captions.

YouTube Community Contributions allowed those who contributed captions to optionally be credited for their published work. 38,939 videos had credits for published captions created by the community. While captioning credits became inaccessible two weeks before the rest of the community contributions data became inaccessible, the number of videos that had captioning credits was still a considerably low number. It is estimated that, had the credits remained accessible until the rest of the community contributions feature was made inaccessible, about 80 thousand videos would have been found to have had credits.

The community contributions feature supported 196 languages, though not all languages were used equally. Below is a chart of the 25 most popular supported languages, and the number of videos that contain at least 1 file for each language (graphing all of the languages did not display well). This chart includes uploader-provided content.

When the the query excludes the uploader-provided content, we see significant shifts in the 25 most popular supported languages.

This shift indicates that community-contributions were often used to translate content.

A look at the language distribution of the collected metadata, including uploader-provided metadata, appears to be similar to the distribution of languages in the overall collection.

A look at the just the community-provided metadata provides a slightly different distribution of data.

The distribution of captioning languages, including uploader-provided captions, is similar to the collection overall.

The distribution of captioning languages, excluding uploader-provided captions, also resembles the overall collection.

It is also interesting to look at the distribution of the draft community captions and metadata that were collected in comparison to the published community captions and metadata.

The published community contributions data appears to be more evenly distributed across languages compared to the draft community contributions data.

Some users contributed many captions and were credited for their work on many videos. In total, 83,563 channels appeared in our credits collection. On average, a channel was credited on 1.47 caption tracks. 55 channels were credited for more than 50 caption tracks, and 14 channels were credited for more than 100 caption tracks! The top three channels which were credited on the most caption tracks in our collection created 255, 522, and 912 caption tracks, respectively.

Thank you to everyone who contributed to this project! Additional details about the collection itself are available in the Internet Archive item description. If you have any additional questions, please feel free to join the project Discord server!

]]>
https://datahorde.org/youtube-community-contributions-archive-now-available-a-look-at-the-stats/feed/ 0
Yahoo! Groups Archive Metadata Now Available https://datahorde.org/yahoo-groups-archive-metadata-now-available/ https://datahorde.org/yahoo-groups-archive-metadata-now-available/#comments Sun, 06 Dec 2020 13:40:00 +0000 https://datahorde.org/?p=1849 After months of work and preparation, the metadata for over 1.1 million Yahoo! Groups retrieved by Archive Team’s Python script as well as from other grabs has been organized and is now available on the Internet Archive. Special thanks to Doranwen for organizing this data.

Yahoo! Groups’ mailing lists, which are the last remaining part of Yahoo! Groups, will be shutting down in 10 days, on December 15, 2020. However, since group content is no longer accessible to the public, there is little left to archive.

Next year, volunteers will be needed to sort and organize the full group data so related groups can be uploaded to the Internet Archive together. This will make it easier to access and browse archives for multiple groups related to similar topics.

For more information about Yahoo! Groups, please see Doranwen’s blog or our Yahoo! Groups articles.

]]>
https://datahorde.org/yahoo-groups-archive-metadata-now-available/feed/ 10
A New Breed of Digital Archiving and Preservation https://datahorde.org/a-new-breed-of-digital-archiving-and-preservation/ https://datahorde.org/a-new-breed-of-digital-archiving-and-preservation/#respond Thu, 05 Nov 2020 23:05:42 +0000 https://datahorde.org/?p=1711 There it is, because someone thought it ought to be out there. Perhaps a story you read, a picture you saw or even a game you played… It was there because someone poured their heart and soul into making it, and it mattered.

Alas we find ourselves in an age where everyone collectively suffers from short-term memory loss. All that is ever on our minds is what’s relevant -here and now- and everything that is irrelevant is as good as imaginary.

The digital archivist or preservationist’s job is to, ultimately, save those things that matter. To that end they go to great extents; downloading terabytes of data, reverse engineering decade-old websites or even hunting down the source code for the most obscure software.

But it is not an easy job and it certainly is not getting any easier. Things are disappearing at too fast a rate for even the most attentive archivists to be able to keep up. Today, the digital archivist is fighting against currents they cannot overcome, with outdated wisdom. As it were, the archivist ought to find a way to swim with these currents, by taking advantage of tools and options that would have previously been unavailable to them.

In honor of Digital Preservation Day, I myself, as a digital archivist, would like to offer my own two cents for the next generation, that is to say a new breed of digital archivists. The three As: Adaptability, Acceptance and Acknowledgment.


Adaptability:

Keeping track of what is being retired, what websites are dying and mobilizing as quickly as possible!
flat lay shot of tools
Photo by Miguel Ɓ. PadriƱƔn on Pexels.com

This past September deserves to go down in history as Shutdown September, seeing how many websites were shut down that month. And honestly, archivists were barely able to keep up.

  • Archive Team, only got to work on archiving the massive Chinese social media site Tencent Weibo about 10 days after the shutdown announcement. While 248 TBs is an impressive feat, it’s only a fraction of the web content on Tencent Weibo. A lot more could have been grabbed if action was taken sooner.
  • A similar case was the shutdown of Naver Matome, a kind of Japanese tumbleblog. Despite the early shutdown announcement in July, Archive Team’s archival project only began months later in September, with only about a week to spare.
  • YouTube recently removed their community contributions feature and Data Horde started a project to save unpublished drafts. Although the feature removal was known for 2 months, it took us weeks to notice that drafts were at risk and even longer to note that YouTube had restricted the feature last year, leading to many drafts never being published. While we were lucky in that the drafts remained accessible for a month beyond the expected deadline, we might not have been as lucky.

Clearly, there is a need for a watchdog, or two, or three, to be able to inform preservation groups of websites which are closing down, or features being retired before the last minute. While tech news sites like the Verge might occasionally report on shutdowns, these are generally restricted to English websites. As for archivists, there are mainly three outlets:

  1. Archive Team’s Watchlist page.
  2. The Internet Archive Blog, if they are involved.
  3. And us, Data Horde, whenever we find out about a shutdown.

Other than that, it’s a matter of luck if a shutdown announcement makes it to the top page on Hacker News or Reddit. This is unreliable and we need to do something about it.

For starters, we need to not only monitor individual websites, but massive platforms like YouTube, Twitter and Reddit. We need our own unofficial open-documentations to note planned update changes when they are hinted at in tweets or in blogs, prior to official announcements, to be able to mobilize dynamically.

As for websites, especially non-English ones, we need to make it easier for non-archivists who are concerned to be able to reach out to us. Which brings us to the second A: Acceptance…


Acceptance:

Come as you are, we do not just need programmers and librarians, anyone can contribute to preservation and everyone should!
multicolored umbrella
Photo by Sharon McCutcheon on Pexels.com

The archiving community is, by and large, in favor of collaboration and open source. But currently this only applies among archivists and preservationists. Even with the source code and tools out in the open, the average person will most likely lack the technical knowledge to understand what the hey they’re looking at.

As difficult as it might be to admit, digital archiving is not very well known. Even the words preservation or conservation evoke ancient manuscripts, or endangered animals. Digital preservation is far from the first thing to come to mind and this degree of obscurity is not something to be desired.

Even if this obscurity comes with a sense of pride from the joy of being of a select few who know this craft, it has also come back to bite us. There is a good reason that a lot of people who discovered the Internet Archive, through their National Emergency Library experiment earlier this year, they weren’t the biggest of fans. Naturally, people were more inclined to trust the verdict of their favorite authors or speculate rather than to go and read what the NEL actually was and how it was justified.

If we, as preservationists, don’t promote our own work, why should other people? For all they know, we are all just rogue people with malicious intent. Then are we to grow old reclusively in our obscure hobby? Seeing how few of us archivists there are out there, I find it sad how many of us have made a name for ourselves as “grumpy old men”. We don’t have to be vagabonds, and we shouldn’t be. Because if we choose to alienate ourselves from the rest of the world, amateurs will develop their own archiving techniques to take our place.

  • The Save Yahoo Groups/Yahoo Geddon project was initially led by fandom. While Yahoo Groups might not mean a lot in 2020, many fan groups trace their origins to old mailing lists, some of which Yahoo had later acquired. When it was announced that all public groups would be privated, those people knew what was at stake; the history, the works, the memories of two decades.
    So they blindly charged in. None of them were proper digital preservationists, even if some members might have had an affinity for it. But they organized and developed their own method of hunting down and tracking downloads for ancient groups that the world had forgot.
  • Another similar project is BlueMaxima’s Flashpoint, a massive effort to preserve Flash and other multimedia web-content which will (or might) break in the future due to incompatibility. Initial volunteers to the project had some programming knowledge and were motivated to preserve games from their childhood. But they discovered that if they laid down a clear path, other people in a similar situation would be more than happy to contribute.
    They developed their own tools for downloading and curating games into their Flashpoint collection so volunteers would not have to start from scratch. And they did not shy away from letting their project be publicized.

Both of these projects have come a long way, in large part due to the sheer enthusiasm to initiate these projects and later due to extensive help from seasoned archivists who took note of these ideas and supported and nurtured them.

The bottom line is, we need to not shew people away, but embrace them. If an apprenticeship system seems too degrading at the very least we, as archivists, should take note from Save Yahoo Groups and Flashpoint when it comes to writing our tutorials and publicizing our projects.

We shouldn’t merely lurk on IRC chatrooms, we need to be able to reach the same people we’re trying to help, even if it’s on social media. DEF CON is cool, but wouldn’t it be nice if we could stand on our own two legs and had a convention of our own?

And when people come to us, asking for help, with no credentials whatsoever, we need to learn how to help. Which brings us to the third A: Acknowledgment.


Acknowledgment:

Understanding our circumstances and constraints, recognizing that we are not all on equal footing and that we might have very different goals and respecting one another.
photo of people near wooden table
Photo by fauxels on Pexels.com

It is entirely possible for a seasoned archivist to encounter someone -who is much less knowledgeable in archiving, or even programming- asking for help. And as someone who dedicates their time, to observe different archiving communities, I can acknowledge that every group has their own focus policy.

  • Mature archiving groups generally have a lot of tutorials and hardware they are more than happy to share, for their particular focus. Here, let’s acknowledge that every group has their own particular focus, information on a lost commercial is likelier to show up on the Lost Media Wiki, news on a delisted game is more Dead Game News territory.
  • Semimature archiving groups generally function as cliques, they have the expertise but are constrained by resources. Anyone expecting help from higher-ups within the group should first make a name for themselves, to prove that their ideas are worth the high council‘s time.
  • Some other groups are one-man armies, they find a very specific niche and document whatever they can. They might look to expand, or insist their archive is only for the fun of it.

So mature and small groups are comparatively stable, whereas semimature groups are considerably volatile. This is not really something we can change, so instead we should learn to acknowledge it. Some groups are always itching to lend a hand but other groups really won’t be willing to give outsiders, who have not proven their value and sincerity, the time of day.

Next we have to acknowledge expenses. Preservation is expensive, digital or otherwise. For some communities digitizing physical material is not easy, as the incomplete digitizations prior to the unfortunate fire at the National Museum of Brazil goes to show. It is a harsh reality that we do not receive the necessary subsidies from states, often relying on donations.

Again, mature groups are self-sustaining. For every member who retires, there will be new volunteers joining in. They also often have some form of donation system to maintain and expand their hardware. As for archiving groups which are still developing, they will have their ups and downs. A single Tweet or Reddit post can bring in tens of new people overnight. Conversely, it’s not uncommon for splinter groups to emerge if their focus diverges from the majority. Money and hardware might be supplied through donations, but again it could always be cut off. Most notably, The Eye has been experiencing a lot of downtime and struggling to cover costs these past few months. Of course, if you can’t cover the costs in the first place you’re better off avoiding anything ambitious.

As much as we archivists advocate for open access, it’s a harsh reality that we are more often not able to actually provide this. People outside of archiving circles have already discovered how to turn a profit out of archived material, through ad-revenue or paywalls. Take getdaytrends.com as an example, where you can look up Twitter trends per day, while they cover their costs through ads. Then perhaps, some of us smaller archiving groups could survive through such practices, until they can achieve a more mature stage.

But as a maxim: Access should be opened chronologically, so that nothing which has been opened before is closed off. (Ex. uploads for the last year are behind a paywall, but on 1 January 2021, uploads on 1 January 2020 go open access)

Let’s talk a bit about a community which actually is trying something similar to this, namely OldGameMags. OldGameMags was only a small group of people who’d met each other over the internet and had been pooling resources to gather old game magazines. And they were barely receiving any donations. So they came up with a clever idea: To gain access you either had to make a one time donation, or actually send in a magazine to be scanned into their collection.

Except not everyone was happy with this. Back in June, OldGameMags and the Internet Archive were at a stand-off. OldGameMags proprietor Kiwi, discovered that some of their magazine scans were being uploaded to the Internet Archive and, surprisingly, this was a repeated offense. While the Internet Archive is a massive, well-funded and well-known archiving platform, OldGameMags is a small group who gathered everything they had with blood, sweat and tears. From a utilitarian point of view, the IA volunteer who was mirroring OGM might have found a home which could host these with a much lower burden and make them accessible for a cheaper cost. But on the other hand, if they continued mirroring everything on OGM, there would no longer be any reason for people to join OGM. This was their passion and from OGM’s point of view, it was as good as stealing (even if a donation had been made). Eventually the two sides were able to reach a settlement, but if IA and OGM had been able to better communicate their intention, such a conflict of interest might have been resolved a lot sooner.

Then finally, as it stands, we must acknowledge one another. We’re all in this preservation business together after all, even if our goals might differ. To reiterate, most of us don’t have states, companies or foundations subsidizing our efforts. So we ought to support one another, technically and financially.

This is why we need arbiters such as the DPC, SPN and IIPC to bring us together. We need to do what we can preserve the preservationists, because nobody else will! The best way to do that is to acknowledge one another, working synergistically and doing our best to not offend one another.


shallow focus photography of hourglass
Photo by Jordan Benton on Pexels.com

Preservation means preserving the past, yet we do it not for the past, but for the future. Then what right do we have to object so vehemently to get with the times? The world is changing, it’s high-time we caught up.

We need adaptability to achieve dynamism, to keep up with the rapid decay of digital information. We need acceptance to foster new ideas and train new archivists. And we need acknowledgement to protect and support one another.

So as we enter a new decade, may these words usher in a new era. A new era for a new breed of digital archiving and preservation…

]]>
https://datahorde.org/a-new-breed-of-digital-archiving-and-preservation/feed/ 0
We Just Rescued Thousands of Unpublished YouTube Captions https://datahorde.org/we-just-rescued-thousands-of-unpublished-youtube-captions/ https://datahorde.org/we-just-rescued-thousands-of-unpublished-youtube-captions/#respond Fri, 30 Oct 2020 21:33:41 +0000 https://datahorde.org/?p=1690 Community contributions were a feature on YouTube which allowed viewers to provide translations and captions for their favorite channels. Last year, YouTube realized that the feature had some problems and so began restricting it. And this year, believing the feature to be broken beyond salvation, they decided to axe it for good.

Unfortunately, in the process they were going to be getting rid of caption drafts, some of which were complete but stuck in review. So, Data Horde initiated a project to grab as many of these unpublished captions as possible, with a lot of assistance from Archive Team.

Although officially removed on September 28, we were able to continue accessing caption drafts for a whole month, until the endpoint was cut off at around 8 PM (UTz), October 28. In total, we scanned and pooled nearly 52 million items, including videos, channels, playlists, and mix playlists; for drafts. We also have two or three other bulky collections which were retrieved manually by archivists. In the coming days we will be working on organizing these drafts, with the hopes of giving them a collection on the Internet Archive.

We also have a few other ideas in mind for what to do with this massive collection of captions, so stay tuned these next couple of days to find out! In the mean time check out our YouTube Captioner’s Toolkit page for information on alternatives for the retired community captions feature.

]]>
https://datahorde.org/we-just-rescued-thousands-of-unpublished-youtube-captions/feed/ 0
October Status Update on the Save Yahoo Groups! Project https://datahorde.org/october-status-update-on-the-save-yahoo-groups-project/ https://datahorde.org/october-status-update-on-the-save-yahoo-groups-project/#respond Thu, 15 Oct 2020 23:00:35 +0000 https://datahorde.org/?p=1631 Last November, Yahoo announced that they would be shutting down many key features on the ancient Yahoo Groups. There was a major project to rescue data, lead by Archive Team and fandoms who traced their origins to Yahoo Groups. In fact we had written all about it back in January:

The story did not end there however. So let’s talk about what has transpired since…


Despite us even reporting 30 January as the final deadline, Yahoo continued to accept Get My Data (GMD) requests for about a week. So active efforts ceased around that time. Now was the waiting game, as it took a few more weeks for some of those GMD requests to process.

By late February, most of the volunteers had disbanded or moved onto other projects. But there was still much to be done. For one thing, people had rushed so much to grab everything that they could, that a lot of these group files were a total mess, not made any better by how Yahoo’s GMD exports worked. So the remaining volunteers stuck around to label their massive collection.

Doranwen, one of the leads on the Yahoo-Geddon (aka Save Yahoo Groups) project, frequently documented their progress during this time.

A few numbers and random other bits of info:

~2 TB of fandom data saved (that I know of, for now)
~200,000 confirmed fandom groups saved in some fashion
~2,000 Sims groups saved* …

*The only reason I know the Sims number is because I was tracking those groups on Google spreadsheets in order to find all of them and get volunteers to join them. For other fandoms it’s impossible to give any sort of number at this point (although I know there was a ton of LOTR, HP, Buffy, and Westlife, lol). Yahoo’s categorization was terrible and a group name doesn’t always give good clues as to whether it’s fandom/non-fandom. Getting that sort of data will take a good deal of time and work.

Doranwen, The end of Yahoo Groups – a few thoughts & stats

Another issue was that the collection was not actually unified. Archive Team had also archived a bunch of data, so the Yahoo-Geddon team continued to label those batch by batch for a few more months.

It truly is endless!!

Yahoo-Geddon volunteer, 14 July 2020

Yet another reason the Yahoo-Geddon team was taking so long was because of how meticulous they were. They worked to not only curate this collection for the sake of archiving, not only to trace the history of fandom, but also to be able to provide a rich dataset that researchers might want to use in the future.

-[Stage] 4.5b: Remember that we got a bunch of groups from scrounging the links of other groups for new groups to join? Some of the commands used to process that data generated “groups” that never existed (with http: stuck at the end, apostrophes or commas in them, etc.). Also one stage of the spreadsheet work ended up with a certain number of groups getting a duplicate version added to the spreadsheet with _dupe after the name.

So for this stage I send the spreadsheets to my assistant who runs a script against them to find groups with punctuation in them or _dupe at the end. A very very tiny number of very old (grandfathered from who knows which list service) groups actually legitimately have periods in their names, but in most cases groups with periods never existed either.

This process is fairly quick for each letter but varies greatly in what has to be done, as sometimes group folders are affected (and some punctuation marks Yahoo simply ignored everything from that mark onwards and treated the letters before it as a group name).

Yahoo Groups metadata processing steps, stage 4.5b

Sadly, Yahoo!, blind as ever to Yahoo-Geddon’s efforts, have decided to permanently shut down Yahoo Groups. While Yahoo Groups only retained its bare-bone features, this will be putting an end to some decade-old mailing lists…

On a related note, an interesting discovery Yahoo-Geddon made is that Yahoo actually has not deleted archives, photos and files but only removed public access.

The files are still there, from what I can tell! They’ve just blocked us from getting to them.

The monthly reminder emails with attachments are still coming in – and the attachments come from files in the files sections. Clearly those were never removed.

Which means that Yahoo could have chosen to grant us access to all of that for a full year before closing Groups entirely, but did not.

via the Save Yahoo Groups Discord server

Just goes to show that curation is the one half of archiving/preservation… If you would like to learn more or even participate in Yahoo Group dissection, check out the Save Yahoo Groups discord server: https://discord.com/invite/DyCNddf

]]>
https://datahorde.org/october-status-update-on-the-save-yahoo-groups-project/feed/ 0