project – Data Horde https://datahorde.org Join the Horde! Sun, 20 Jun 2021 22:55:06 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://datahorde.org/wp-content/uploads/2020/04/cropped-DataHorde_Logo_small-32x32.png project – Data Horde https://datahorde.org 32 32 YouTube Community Contributions Archive Now Available: A Look at the Stats https://datahorde.org/youtube-community-contributions-archive-now-available-a-look-at-the-stats/ https://datahorde.org/youtube-community-contributions-archive-now-available-a-look-at-the-stats/#respond Fri, 05 Mar 2021 22:22:55 +0000 https://datahorde.org/?p=2091 The YouTube Community Contributions Archive is now available on the Internet Archive! You can download the entire collection, or simply search for and download files for a particular video. The collection is composed of 4096 ZIP archives which contain 406,394 folders and 1,361,998 files. Compressed, the collection is 3.83GB, and once decompressed, the collection is 9.46GB.

YouTube Community Contributions allowed users to create and translate closed captions/subtitles, titles, and descriptions of YouTube videos uploaded by channels who enabled the feature. Users could optionally choose to be credited for their captioning contributions.

While over 50 million videos were scanned for community contributions data, community contributions data was found for only 406,394 videos, indicating that the feature was used on only a small portion of the videos on YouTube. Some videos had YouTube Community Contributions enabled, but only had captions or metadata that was provided by the uploader. This accounted for 198,609 videos, meaning that 207,785 videos in the collection had community-contributed captions or metadata, further indicating that few videos on YouTube received community contributions. This means that approximately 0.4% of the videos that were scanned while creating this archive had community-contributed captions or metadata. This was likely because the community contributions feature was hard to discover in the YouTube interface, which limited the number of people who were aware of the feature.

Breaking down these numbers further, 80,746 videos had community-contributed draft metadata, 127,164 had community-contributed draft captions, 38,440 videos had community-contributed published metadata, 93,499 videos had community-contributed published captions, 179,366 videos had uploader-provided published metadata, and 225,466 videos had uploader-provided published captions.

YouTube Community Contributions allowed those who contributed captions to optionally be credited for their published work. 38,939 videos had credits for published captions created by the community. While captioning credits became inaccessible two weeks before the rest of the community contributions data became inaccessible, the number of videos that had captioning credits was still a considerably low number. It is estimated that, had the credits remained accessible until the rest of the community contributions feature was made inaccessible, about 80 thousand videos would have been found to have had credits.

The community contributions feature supported 196 languages, though not all languages were used equally. Below is a chart of the 25 most popular supported languages, and the number of videos that contain at least 1 file for each language (graphing all of the languages did not display well). This chart includes uploader-provided content.

When the the query excludes the uploader-provided content, we see significant shifts in the 25 most popular supported languages.

This shift indicates that community-contributions were often used to translate content.

A look at the language distribution of the collected metadata, including uploader-provided metadata, appears to be similar to the distribution of languages in the overall collection.

A look at the just the community-provided metadata provides a slightly different distribution of data.

The distribution of captioning languages, including uploader-provided captions, is similar to the collection overall.

The distribution of captioning languages, excluding uploader-provided captions, also resembles the overall collection.

It is also interesting to look at the distribution of the draft community captions and metadata that were collected in comparison to the published community captions and metadata.

The published community contributions data appears to be more evenly distributed across languages compared to the draft community contributions data.

Some users contributed many captions and were credited for their work on many videos. In total, 83,563 channels appeared in our credits collection. On average, a channel was credited on 1.47 caption tracks. 55 channels were credited for more than 50 caption tracks, and 14 channels were credited for more than 100 caption tracks! The top three channels which were credited on the most caption tracks in our collection created 255, 522, and 912 caption tracks, respectively.

Thank you to everyone who contributed to this project! Additional details about the collection itself are available in the Internet Archive item description. If you have any additional questions, please feel free to join the project Discord server!

]]>
https://datahorde.org/youtube-community-contributions-archive-now-available-a-look-at-the-stats/feed/ 0
A Correction on MediaFire’s Account Removal Policy https://datahorde.org/a-correction-on-mediafires-account-removal-policy/ https://datahorde.org/a-correction-on-mediafires-account-removal-policy/#respond Mon, 25 Jan 2021 00:24:16 +0000 https://datahorde.org/?p=1984 A month ago we had reported claims that MediaFire would be purging inactive users in January. At the time we had urged readers to log into their Mediafire accounts to reaffirm their activity and to consider looking into alternative file hosting services.

However, in light of recent events, Data Horde would like to issue a correction. Mediafire co-founder Tom Langridge was kind of enough to comment on the situation, stating that there had, in fact, been no change of policy whatsoever.

https://twitter.com/tlangridge/status/1349515837351231488

Although, as stated previously, Mediafire’s data retention policy has been updated with more concise wording, Langridge’s comment would indicate that this change was made purely out of formality.


Then this begs the question, why was there ever a sudden influx of “warning e-mails” in the first place? The warning e-mails aren’t anything new, but it’s the sudden increase in frequency which had many archivists alert. One possible explanation might be that this isn’t Mediafire’s fault at all!

(Click here if the above chart doesn’t load)

Even in 2021, Mediafire is one of the most visited websites on the web. But interestingly search interest for Mediafire has steadily gone down for years.

This suggests that disproportionally more of Mediafire’s traffic comes from visitors who are downloading files they have been linked to, rather than users who are actively uploading new files. If anything, uploaders (or potential uploaders) appear to be searching for Mediafire far less frequently. So perhaps, the cause of the warning e-mail influx was a dip in overall uploader activity.

Mediafire account removals and file deletions might be, and as far as we know are, happening; but not because of any particular change in Mediafire’s policy.


So where does that leave us? Archive Team is continuing with their #mediaonfire project, since it’s better to be proactive than reactive in cases such as this:

As for Data Horde, we have recently initiated a project of our own called MediaFlare. We have a simple link submission form where anyone can report important files worth preserving, which we collect into a public database. Archivists and data hoarders can then mirror files listed in this database to ensure their survival for years to come.

Many online communities rely on old uploads to Mediafire, even after accounts have gone inactive. With MediaFlare we aim to prevent situations like this:

I noticed this yesterday, with a fellow preservationist asking me to grab a bunch of their files off of MediaFire and some were blocked and removed already. Scans for crying out loud.

via Jonas Roland of Hit Save!

Again, we apologize for misreading MediaFire’s policy change. But that being said, it is still very much a good idea to log back into old accounts. And in the case where this is no longer possible, we hope that you would be willing to assist us on the MediaFlare project, so that we can preserve valuable files for years to come!

]]>
https://datahorde.org/a-correction-on-mediafires-account-removal-policy/feed/ 0
Log into your Account: Mediafire to Purge Inactive Users! https://datahorde.org/log-into-your-account-mediafire-to-purge-inactive-users/ https://datahorde.org/log-into-your-account-mediafire-to-purge-inactive-users/#comments Thu, 24 Dec 2020 19:12:07 +0000 https://datahorde.org/?p=1890 Mediafire, has been a popular file sharing and storage site for over a decade. Users simply upload their files to the system and can share it with the world, via a link. Though Mediafire might have been more prominent in the era before the likes of Dropbox and Google Drive, 14 years of history is certainly nothing to look down on.

Every month, more than 150 million people use MediaFire to quickly and securely store, organize and share all their personal and professional data in the cloud.

Mediafire about page

Sadly, recent changes in Mediafire’s file retention policy are threatening this long history. The policy has introduced a new definition for Abandoned Accounts. Free registered* users who haven’t logged into their account for 8 months will have account considered as “abandoned”. These accounts will be given a 15-day grace period, where Mediafire will warn the user through two e-mails to log back in if they want to keep their account, and failing to comply will result in a deletion of the account and data.

* free, as in non-premium registered users. Failing to verify an email address on a free account will lead to a termination within 14 days.

Interestingly, the file retention policy page previously used more ambiguous language and this “8 month” criteria has been there for a few months now. The last revision on the page was in May, so assuming the first 8 month countdown has just been initiated, the first wave of account deletions will be triggered on January 11, 2020. Accounting for the 15-day grace period, we can broadcast the hard deadline to be January 26, 2020.

So if you have a Mediafire account, we highly recommend that you log in today, and also once every couple of months for good measure, to keep your files safe!


What about users who’ve lost access to their accounts (forgotten password, change in email addresses)? It would seem that there is no term in the policy to account for these accounts as they qualify as abandoned even if the files they’ve uploaded a decade ago are still being accessed to this day.

Content is retained in accounts based on account activity and not individual file activity.

Current Mediafire Account Retention Policy

In particular many modding communities, such as Mario Kart Modders, are heavily reliant on files which have been kept safe on Mediafire for years. As is to be expected, Archive Team has sprung into action, and is trying to set up a large-scale project to archive Mediafire files, aptly named #mediaonfire.

But there’s something that makes archiving Mediafire files particularly challenging. As you will recall, Mediafire files are shared by links and Mediafire doesn’t really have something akin to a search function. This means that the only way to find a particular file is to find the link from where it was been shared. A high res background might have been shared on a forum, old podcasts on a personal website… Any way you slice it, it’s going to take a lot of time and effort to track even some of these links down.

The good news is that new solutions are being proposed every day. Pyxia’s mf-dl tool allows you to bulk-download the file contents if you know the links for yourself.


So spread the word! Tell your friends to log in to their Mediafire accounts, consider migrating your files to safer services and make a copy of any Mediafire link or files you come across, you never know what might happen!

]]>
https://datahorde.org/log-into-your-account-mediafire-to-purge-inactive-users/feed/ 1
We Just Rescued Thousands of Unpublished YouTube Captions https://datahorde.org/we-just-rescued-thousands-of-unpublished-youtube-captions/ https://datahorde.org/we-just-rescued-thousands-of-unpublished-youtube-captions/#respond Fri, 30 Oct 2020 21:33:41 +0000 https://datahorde.org/?p=1690 Community contributions were a feature on YouTube which allowed viewers to provide translations and captions for their favorite channels. Last year, YouTube realized that the feature had some problems and so began restricting it. And this year, believing the feature to be broken beyond salvation, they decided to axe it for good.

Unfortunately, in the process they were going to be getting rid of caption drafts, some of which were complete but stuck in review. So, Data Horde initiated a project to grab as many of these unpublished captions as possible, with a lot of assistance from Archive Team.

Although officially removed on September 28, we were able to continue accessing caption drafts for a whole month, until the endpoint was cut off at around 8 PM (UTz), October 28. In total, we scanned and pooled nearly 52 million items, including videos, channels, playlists, and mix playlists; for drafts. We also have two or three other bulky collections which were retrieved manually by archivists. In the coming days we will be working on organizing these drafts, with the hopes of giving them a collection on the Internet Archive.

We also have a few other ideas in mind for what to do with this massive collection of captions, so stay tuned these next couple of days to find out! In the mean time check out our YouTube Captioner’s Toolkit page for information on alternatives for the retired community captions feature.

]]>
https://datahorde.org/we-just-rescued-thousands-of-unpublished-youtube-captions/feed/ 0
Help Archive YouTube’s Community Contributions! https://datahorde.org/help-archive-youtubes-community-contributions/ https://datahorde.org/help-archive-youtubes-community-contributions/#respond Sat, 26 Sep 2020 00:27:21 +0000 https://datahorde.org/?p=1478 YouTube is removing their community contributions feature on September 28. In case you haven’t already heard, that’s the feature which allows viewers to add captions/subtitles, translated titles and video descriptions on videos. And YouTube seems to be pretty insistent on removing the feature, despite massive backlash.

Now although YouTube have given their word to keep published community captions (and other contributions) online, there’s a small detail many people have overlooked. Last year, YouTube restricted the feature to only allow uploaders to publish contributions. As such, there are many many unpublished captions, title/description translations stuck in review. Furthermore, no information is given on the fate of Caption Credits (people who opted to have their name shown).

Although unpublished on videos, these contributions are still visible in the community captions editor. So for the last few days we have been developing a tool to archive all this data! We have finally reached a mature enough stage that anyone reading this can now run the “YouTube Community Contribution Archiver” (YCCA) on their computer, to help us collect as many of these contribution drafts as we can:

https://github.com/Data-Horde/ytcc-archive

Ideally it’s best if channels accept their own videos, not only from a moral standpoint but also because this method hides information (formatting, stylization, authors of unpublished captions etc.) So beyond archiving these we’ve also done our best to try and reach out to content creators across YouTube.

The good news is that we won’t be archiving these for naught, projects such as YouTubexternal CC will likely be a new home for these captions and other content which have been trapped for so long.

We also have a Discord server where we are coordinating all these efforts, so feel free to hop on board if you have any questions or want to just meet the team!

Discord

Good Luck archiving! Click here to view current stats.

For further context on how we wound up in this predicament, check out our YouTube CC History series:

Part 1: Unusual Beginnings on Google Video

Part 2: Pioneering Online Accessibility

Part 3: Scaling the Waterfall, Captions for All

Part 4: The Untold Story of why YouTube is removing Community Contributions;

]]>
https://datahorde.org/help-archive-youtubes-community-contributions/feed/ 0