The YouTube Community Contributions Archive is now available on the Internet Archive! You can download the entire collection, or simply search for and download files for a particular video. The collection is composed of 4096 ZIP archives which contain 406,394 folders and 1,361,998 files. Compressed, the collection is 3.83GB, and once decompressed, the collection is 9.46GB.
YouTube Community Contributions allowed users to create and translate closed captions/subtitles, titles, and descriptions of YouTube videos uploaded by channels who enabled the feature. Users could optionally choose to be credited for their captioning contributions.
While over 50 million videos were scanned for community contributions data, community contributions data was found for only 406,394 videos, indicating that the feature was used on only a small portion of the videos on YouTube. Some videos had YouTube Community Contributions enabled, but only had captions or metadata that was provided by the uploader. This accounted for 198,609 videos, meaning that 207,785 videos in the collection had community-contributed captions or metadata, further indicating that few videos on YouTube received community contributions. This means that approximately 0.4% of the videos that were scanned while creating this archive had community-contributed captions or metadata. This was likely because the community contributions feature was hard to discover in the YouTube interface, which limited the number of people who were aware of the feature.
Breaking down these numbers further, 80,746 videos had community-contributed draft metadata, 127,164 had community-contributed draft captions, 38,440 videos had community-contributed published metadata, 93,499 videos had community-contributed published captions, 179,366 videos had uploader-provided published metadata, and 225,466 videos had uploader-provided published captions.
YouTube Community Contributions allowed those who contributed captions to optionally be credited for their published work. 38,939 videos had credits for published captions created by the community. While captioning credits became inaccessible two weeks before the rest of the community contributions data became inaccessible, the number of videos that had captioning credits was still a considerably low number. It is estimated that, had the credits remained accessible until the rest of the community contributions feature was made inaccessible, about 80 thousand videos would have been found to have had credits.
The community contributions feature supported 196 languages, though not all languages were used equally. Below is a chart of the 25 most popular supported languages, and the number of videos that contain at least 1 file for each language (graphing all of the languages did not display well). This chart includes uploader-provided content.
When the the query excludes the uploader-provided content, we see significant shifts in the 25 most popular supported languages.
This shift indicates that community-contributions were often used to translate content.
A look at the language distribution of the collected metadata, including uploader-provided metadata, appears to be similar to the distribution of languages in the overall collection.
A look at the just the community-provided metadata provides a slightly different distribution of data.
The distribution of captioning languages, including uploader-provided captions, is similar to the collection overall.
The distribution of captioning languages, excluding uploader-provided captions, also resembles the overall collection.
It is also interesting to look at the distribution of the draft community captions and metadata that were collected in comparison to the published community captions and metadata.
The published community contributions data appears to be more evenly distributed across languages compared to the draft community contributions data.
Some users contributed many captions and were credited for their work on many videos. In total, 83,563 channels appeared in our credits collection. On average, a channel was credited on 1.47 caption tracks. 55 channels were credited for more than 50 caption tracks, and 14 channels were credited for more than 100 caption tracks! The top three channels which were credited on the most caption tracks in our collection created 255, 522, and 912 caption tracks, respectively.
Thank you to everyone who contributed to this project! Additional details about the collection itself are available in the Internet Archive item description. If you have any additional questions, please feel free to join the project Discord server!