news – Data Horde https://datahorde.org Join the Horde! Sun, 01 Oct 2023 21:22:06 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://datahorde.org/wp-content/uploads/2020/04/cropped-DataHorde_Logo_small-32x32.png news – Data Horde https://datahorde.org 32 32 Data Log 2023-09-17 Unity Platform Runtime Fee Controversy https://datahorde.org/data-log-2023-09-17-unity-fee-controversy/ https://datahorde.org/data-log-2023-09-17-unity-fee-controversy/#respond Sun, 17 Sep 2023 23:07:03 +0000 https://datahorde.org/?p=2901 The Unity Engine is a popular 3D engine for making games and other interactive media. In this episode of Data Log glmdgrielson and madpro talk about how game designers and gamers are upset with the Unity platform’s new payment scheme.

]]>
https://datahorde.org/data-log-2023-09-17-unity-fee-controversy/feed/ 0
Data Log 2023-01-26 What is Archiving? https://datahorde.org/2023-01-26-what-is-archiving/ https://datahorde.org/2023-01-26-what-is-archiving/#respond Thu, 26 Jan 2023 23:37:35 +0000 https://datahorde.org/?p=2876 The first ever episode of Data Log: The Archiver’s Favorite Podcast. Learn about what archiving is and how to join the archiving community!

]]>
https://datahorde.org/2023-01-26-what-is-archiving/feed/ 0
Twitter in Trouble? Why you should Archive your Tweets https://datahorde.org/twitter-in-trouble-why-you-should-archive-your-tweets/ https://datahorde.org/twitter-in-trouble-why-you-should-archive-your-tweets/#comments Mon, 05 Dec 2022 17:04:49 +0000 https://datahorde.org/?p=2852 Twitter has seen some radical restructuring since Elon Musk’s acquisition over a month ago. Now is a good time as ever, that we talked about what options you have in archiving or preserving your Twitter content.


This new era of Twitter has been quite turbulent, to say the least. More than half of the workforce has been fired or has quit, and site functionality is becoming unstable, as reported by the Seattle Times. Mastodon has emerged as a serious Twitter alternative. In fact, some of those who have departed Twitter now have their own Mastodon instance over at macaw.social. Personally, I am excited about the rise of mastodon as an alternative as I have been posting Data Horde updates over at @[email protected] for about two years now.

So, why not leave Twitter behind and move on? Now, Twitter allows you to request a copy of your personal data: Tweets and all. But it’s probably hard to leave a site that you have been on for over a decade. Especially, when requesting your personal archive is not even working correctly. Many people have reported that archive requests are being ignored or processed with delay. On a test account, we at Data Horde found that it took over 3 days to receive a personal archive.

Tweeters complaining about being unable to export personal archives: view snapshot at archive.is

In 2022 this is a big deal, not only for archivists but also for legality. Article 13 of the GDPR mandates a responsibility to provide a copy of collected data to users (i.e. data subjects) upon request. Outside of Europe, California’s CCPA has a similar clause protecting the right to know.

There are repercussion for not respecting these rules. Recently another messaging app, Discord, was fined 800 000 Euros for failing to respect data retention periods and security of personal data by French Regulator CNIL. That was actually a reduced fine, given Discord’s conciliatory attitude. If Twitter does not up their game, they may meet a similar fate, if not a worser one.

Now that I have your attention, I would like to direct it to the help page on how to request a personal archive from Twitter: https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive . Even if a bit unstable, this is what you need to follow to save a copy of your Tweets.

The Twitter archive is big and burly but not perfect. Johan van der Knijff recently wrote a blogpost on some shortcomings, such as the t.co URL-shortener and some workarounds: https://www.bitsgalore.org/2022/11/20/how-to-preserve-your-personal-twitter-archive


Oh, and by the way. It gets worse: Elon Musk has also stated interest in purging inactive accounts and their Tweet history.

Archive Snapshot: https://archive.ph/hcKsV

This might not seem like a big deal, except to the one or two of our readers who periodically scrape politician accounts off of https://ballotpedia.org. Yet it is actually a serious turning point. Currently, Twitter does not purge inactive accounts, except in the event of death or incapacitation and by special request.

In 2019 there was an attempted Twitter policy change to expire accounts which had not been logged into for 6 months. This sparked outrage across the platform by those who saw this as unfair to the memory of inactive accounts. In particular, fans of deceased K-Pop artist Kim Jong-hyun, otherwise known as Jonghyun (김종현/종현) came to the defence of his legacy overturning the attempt altogether. Turning back on this decision would go against all of that heritage, people’s heritage, Twitter’s heritage, web heritage. Alas this the projected course of things, even if we cannot prevent it, it is perhaps our duty to protest why it is wrong.


What about the extreme scenario of a total collapse of Twitter? What does that mean for web history? Well, the good new is that people have been thinking on this for much longer than before this year.

Already in 2010 the Library of Congress announced that they would be copying the entire internal archive of Twitter, starting from March 2006.

Archive Snapshot: https://web.archive.org/web/20161208074132/https://twitter.com/librarycongress/statuses/12169442690

There are also many smaller grabs on the Internet Archive and archive.today, some of which you have seen linked above. Special mention goes to Archive Team‘s periodical Twitter Stream archive.

Last but not least, you can help! The Internet Archive is collecting Tweet dumps from people as we speak: https://archive.org/services/wayback-gsheets/archive-your-tweets Whether you just want extra insurance for your back-up, or to contribute to the wealth of the web you can help by using the above tool to upload your Tweets to the Internet Archive for generations to come.

]]>
https://datahorde.org/twitter-in-trouble-why-you-should-archive-your-tweets/feed/ 1
Action Script 3 now supported in the Ruffle Emulator https://datahorde.org/action-script-3-now-supported-in-the-ruffle-emulator/ https://datahorde.org/action-script-3-now-supported-in-the-ruffle-emulator/#comments Tue, 30 Aug 2022 01:00:35 +0000 https://datahorde.org/?p=2826 Flash, once the web’s sweetheart in games and animation, has today fallen into obscurity. Since the end-of-life two years ago, Flash media has become virtually unplayable. But things are changing with emulators like Ruffle.


Not all Flash media is the same. You see, the interactivity in Flash relies on a language called ActionScript. In 2006, ActionScript 3 came out with new features. Alas it was backwards-incompatible with AS2 and so not everyone was to keen on it. It wasn’t until the release of Flash Professional CC (2013), that authors were required to switch to AS3.

This has made Flash emulation quite a challenge. Understandably, Flash emulators have had to choose between prioritising AS2 and 3. For example the emulator Shumway focussed on AS2 (with some AS3 support) and Lightspark focussed on AS3. Unfortunately, Shumway hasn’t been updated in ages and Lightspark isn’t browser-based.

On the other hand, Ruffle, a relative newcomer to the Flash emulation scene has been picking up speed. Written in Rust and sporting WebAssembly, it runs wicked fast and cross-platform! Though Rust’s focus has too been mostly in one direction, namely AS2, they have started to make progress on AS3 well. Below is an abridged version of an announcement shared on the Ruffle Discord Server by our friend Nosamu.


The first few ActionScript 3 games are finally playable in Ruffle, Demos below! One of the first fully-playable games is Not To Scale, a simple but clever photo puzzle! You can try it out right now on Newgrounds: https://www.newgrounds.com/portal/view/575849/format/flash?emulate=flash

Even more exciting, the beautiful minigolf game Wonderputt is now mostly playable with the Ruffle desktop app! The first hole is quite tricky due to collision bugs, but a fix is in progress, along with web performance improvements.

Watch our #announcements channel for updates in the coming weeks! As always, you can download Ruffle from https://ruffle.rs/#downloads.

But wait, there’s more – Ruffle web builds now have a fancy loading animation! If you own a website, now is the perfect time to update Ruffle! No longer will your visitors be greeted with a blank white screen while waiting for Ruffle to load. Check out the animation:

Also, if you’d like to add your own flair to the loading screen or disable it altogether, there are a few customization options: preloader, --preloader-background, and --logo-display. For more information, see our wiki.

And finally, we’re looking for help developing an official Ruffle app for Android! If you have experience with Rust development targeting Android, please check out @szőlő’s WIP repository: https://github.com/torokati44/ruffle-android and join the development thread: Native Android App.


Do you have a favorite Flash game you just wish you could play right now? It’s not emulation, but Bluemaxima’s Flashpoint collection might be able to run what you are looking for right now. Be sure to also check out our Flash Player Emergency kit for more tips on Flash after its end-of-life.

]]>
https://datahorde.org/action-script-3-now-supported-in-the-ruffle-emulator/feed/ 18
All TechnologyGuide Forums shutting down January 31 https://datahorde.org/all-technologyguide-forums-shutting-down-january-31/ https://datahorde.org/all-technologyguide-forums-shutting-down-january-31/#comments Wed, 26 Jan 2022 00:55:00 +0000 https://datahorde.org/?p=2743 A member of our Discord server came in to notify us of a good number of potential closings scheduled for January 31. So that’s less than a week left to take action! The TechnologyGuide forum network, including the likes of NotebookReview.com and TabletPCReview are all shutting down due to a corporate decision. Oh dear!

The announcement came from longtime tech reviewer and moderator Charles Jefferies. As he reminisces over the rise and fall of NotebookReview and her sister sites, which have today only been reduced to unfrequented forums, he implores what remains of the once strong gearhead community to decide on what platform to migrate to in their exodus.

On behalf of the small but dedicated volunteer staff here, we wish you the best. We wish we could send you off a little more gracefully. Please enjoy the remaining time.

Best,
Charles Jefferies & the NBR Moderation Team

So without further ado, a complete list of sites closing down in the TechnologyGuide network are as follows:

It should also be noted that Archive Team has been made aware of the situation as well. Yet we are not aware of any archivebot/grab-site grabs at this time. Be sure to comment or reach out to us at [email protected] if you have any archives you would like to share, and to stay tuned to catch up on any updates.

]]>
https://datahorde.org/all-technologyguide-forums-shutting-down-january-31/feed/ 1
Silent Selene: Touhou Scoreboard Royalflare Saved! https://datahorde.org/silent-selene-touhou-scoreboard-royalflare-saved/ https://datahorde.org/silent-selene-touhou-scoreboard-royalflare-saved/#respond Thu, 20 Jan 2022 10:45:00 +0000 https://datahorde.org/?p=2728 Maribel Hearn, self-proclaimed nerd and shoot ’em up game enthusiast, has mirrored an archive of the Touhou Project Scoreboard Royalflare (ロイヤルフレア), expected to shut down at the end of this month. Though a cause for celebration, there has been little fanfare. So here we are to tell you all about it!


For the uninitiated, Touhou Project is a series of shoot ’em up games, known for its dense bullet patterns, characters, music, and developer. There’s just one developer, by the way, Jun’ya Ota, better known as ZUN. It is also notable for having a wide variety of fanworks.

As for why we’re here, the games are also notable for their scoring system. Top runs are tracked and ranked by score, rather than time. The place you would find top scores is, fittingly, called a scoreboard. A popular Touhou Project scoreboard was Royalflare (ロイヤルフレア), which has been online for 15 years, since 2007. You can find scores for runs for 16 different games, in various categories.


Image
Royalflare shutdown announcement from January 3, 2022

Now earlier this January, Royalflare suspended replay uploads and announced that it would be closing its doors at the end of the month. The site-owner was distraught, stating that there had been run submissions with faked names, faked replays and even some doxxing. This led them to conclude, that the site no longer serves its role as a reliable scoreboard. Royalflare was left read-only, for the duration of January, to shutdown thereafter.

We would like to thank the many players who have contributed to the site over the years since its founded in 2007.

Despite the routine, albeit shortlived, outrage on Twitter and Reddit, it didn’t take long for the internet to quickly forget about the shutdown. Yet, as is often the case in communities with large fandoms, Touhou players were not going to forget so soon. In particular, Maribel Hearn, who already had a personal website dedicated to shoot ’em up games, announced that he had began mass-downloading every replay available on Royalflare to his own site.

Interestingly, this wasn’t his first hustle either, apparently, as he had also made an archive of the former Gensokyo.org scoreboard, before it shut down in 2019. Voile! Mari’s Royalflare Archive is now fully-operational and even upgraded with some search improvements. Now go out and shoot the bullet to spread the word!


Mari’s story is a story, that goes to show that the web doesn’t just need dedicated archiving communities, but also dedicated communities, archiving. The people best suited for archiving Touhou scoreboards, in terms of understanding what they are and how they are categorized, are probably going to be Touhou players. If anything, they are the most likely to be paying attention to shutdown deadlines.

As web archivists, we should not just fish for people but we should try to teach them how to fish. Perhaps, this tutorial on how to use the Wayback Machine and wget might someday help another Touhou fan, or the fan of some other community 😉

]]>
https://datahorde.org/silent-selene-touhou-scoreboard-royalflare-saved/feed/ 0
YouTube’s Community Posts are tragically killing the Discussion Tab https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/ https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/#comments Tue, 21 Sep 2021 19:44:15 +0000 https://datahorde.org/?p=2641 A few days ago YouTube announced that millions of channels would gain access to Community Posts next month. Unfortunately, many people have missed the footnote that they are removing the similar Discussion feature which was already available to all channels. Read on to learn more about how this update will affect YouTube.

Index
Image Credits: YouTube Blog

Community Posts? What are those?

YouTube’s Community Posts were introduced in 2017, as a way for channels to interact with their viewers and fans. These are typically short texts, images, GIFs or polls; you might think of them as Tweets native to YouTube. They appear both on the channel itself, under the so-called Community Tab, and also in the recommendations on the YouTube Homepage and the Subscriptions Feed. Initially, Community Postswere available only to a number of big channels, such as grav3yardgirl and Vidya Vox.

After some initial success, the feature gradually became available to more channels, albeit not all, circa 2019. You can learn more about how it works on YouTube’s Creator Academy. Earlier this year, The Spiffing Brit made a video which brought a lot of attention to Community Posts, with some exaggerations, pointing out that they were especially prominent on the mobile-view and they could be used to amass a lot of subscribers very quickly.

For reference, a short community post with only a GIF can take up around 80% of the screen!
Image Credit: Pringus McDingus

What is new this October, is that Community Posts will be rolling out to all channels above 500 subscribers, beginning on October 12, 2021. Still not all channels, though we are getting there. Then what is the catch? This update will also remove the old discussion tab which was, formerly, available to all channels.

There’s a Discussion Tab? I have never seen it before?

YouTube first introduced comments on videos sometime in the Summer of 2005. Comments were immediately a hit, but there was a problem; people were not just commenting about the videos themselves, they were sometimes asking the uploader general questions. Unfortunately, channels had to divide their attention between all of their uploads to keep up with their commenters. So some time around April of 2006, YouTube introducedChannel Comments, because somebody had had the genius idea of making a channel-wide comment section. It took the form of a long comment thread appearing at the bottom of channel profiles.

Most Recent Channel Comments on jawed‘s channel from 2011, courtesy of the Wayback Machine.

Now these Channel Comments have almost been around since YouTube started, but today they are known as Channel Discussions and have their own, specialised Discussion Tab. Even at this very moment, if you look into a channel’s Discussion, you might catch a glimpse of comments from over a decade ago carrying on memories long forgotten!

Right now, on 21 September 2021, you can still see comments from 13 years ago on YouTube veteran ozwalled’s channel discussion.

Sadly, like many legacy features on YouTube, Discussions were never fully implemented on the mobile apps and they still remain out of view even today. If you mostly browse YouTube on the app, it is very likely that you have never seen any Discussion before.

Why is YouTube removing Discussions? Can’t they co-exist with Community Posts?

In a sense, Discussions were the predecessor to Community Posts. Yet they certainly have not received the same care and attention in recent years, with Discussions still showing hints of the Google+ transition attempted a few years ago. In short, Discussions have been obsoleted.

Unlike Community Posts which were engineered specifically to take advantage of the UI of the mobile app, Discussions remained largely a desktop-only feature. In a mobile-user dominated web, YouTube finds itself with a significant share of mobile users. There are contradictory reports on whether or not YouTube has more mobile or desktop users, with Statista reporting a ~20%-80% global share for mobile vs. desktop users in early 2021 and Comscore reporting a ~70-30% split in the US/UK in 2017. Still, YouTube has made it clear that they are designing with a mobile-first mindset. They have set a track record in removing features missing from mobile, such as video annotations in 2019, and the community contributions editor in 2020. Thus it comes as no surprise that Discussions should come next.

Spammy comments at the top of smosh’s channel comments on September 30, 2007.

It is also safe to assume that the incredibly slow roll-out of the Community Posts feature to all channels is entirely deliberate. Channel Discussions had a few kinks which needed hammering out, such as often being a frequent target for spam comments. In case you have not noticed, Community Posts are oriented in such a way that they emphasize the channel itself, over commenters. Spam comments are often hidden a screen or two behind the Community Posts themselves, which are mostly kept clean at the channel’s own discretion. They are easier to moderate for channels, and more readable for commenters; that’s a win-win from YouTube’s perspective.

Could YouTube have fixed the way Discussions were organised? Probably, given that they have preserved the channel comments from 15 years ago, through countless layout changes to channels. Yet it seems that YouTube is determined to remove the feature, no matter what.

If you don’t yet have access to Community posts you can use Comments on your videos to engage your viewers.

Team YouTube on the Google Support Page, having seemingly forgotten the original reason channel comments were added in the first place.

InterestinglyDiscussionswere once available on the mobile version of the YouTube website, but even those have been removed, some time earlier this year. If that was not self-sabotage, it certainly was a sign that the design team had been tasked with removing Discussions for good.

From a practical standpoint, it does make sense for YouTube to remove Discussions now with Community Posts as an adequate replacement. Yet the answer to why YouTube would break a years’ long tradition of preserving these ancient comments, is anyone’s guess.

Which channels have a discussion tab?

Even channels started as recently as September 2021 have had Discussions at some point. That sentence is in the present perfect, because many channels which used to have a Discussion section no longer have it:

  • Big Channels such as Vsauce, or eddsworld who have already switched to Community Posts have their Discussion history hidden. Not deleted, hidden. There is at least a 30-day rule which allows commenters to view, edit or delete comments they have made on the Discussion page of a channel who has recently switched to Community Posts, from their Google Account’s My Activity Section.
Memoriam comments on Edd Gould‘s Discussion Page, when he posthumously passed a million subscribers in 2016, once again, courtesy of the Wayback Machine.
  • At some point, the switch to Community Posts was mandated to channels above a certain subscriber count. The current requirement applies to channels strictly above 1000 subscribers. This update to Community Posts is forced even if the channel has been inactive for years, such as in the case of Fred or, more tragically, geriatric1927 or KKD1247.

Do I have a discussion tab?

  • If it exists, your channel’s Discussion page will be located at https://www.youtube.com/channel/{CHANNEL ID GOES HERE}/discussion or https://www.youtube.com/user/{USER ID GOES HERE}/discussion.
  • Following COPPA compliance changes introduced in 2020, channels which have declared their content to be “Made for Kids”, have all comments disabled across their channel. This includes Discussions. Luckily, past comments are not deleted. Switching the channel out of the “Made for Kids” option will restore all hidden comments intact.
  • The road to October 12, will see any other old channels which go over 1000 subscribers lose their Discussion tab as well. That means anyone who passes 1000 subscribers, get the update early! Global removal will begin right after October 12, and community tabs will roll out to channels above 500 subs within an estimated time of a week. Anyone with less than 500 subscribers, will no longer have a Discussion tab and nor will they receive a Communitytab replacement. Please note once more, that the 30-day rule allows commenters to view, edit or delete comments they have made on the Discussion page of a channel who has recently switched to Community Posts.

Concluding Remarks

While Community Posts present an overall upgrade to the YouTube user experience, especially for mobile users, the removal of Channel Discussions is yet another dent to the internet’s short memory span. Seeing people who you would have never thought to have ever talked to each other commenting on one another’s Discussions, is in a sense like scrolling through the threads of an ancient forum. Ironically, the Community Posts still lack this sense of community, instead emphasising a dichotomy between channels and viewers.

Relevant to archivists and data hoarders, a non-API endpoint for Discussions is unknown at this time. Furthermore, given the mandated update having already hidden the Discussions for many channels, options are severely limited. Existing archives besides WayBack Machine snapshots remain undiscovered. Perhaps the interconnectivity of the Discussions between older channels might prove to be useful in crawling out other channels which still have a Discussion tab. One thing is for certain: this will go down as a very unusual moment in YouTube’s history…

On a final, albeit related note, YouTube’s video attributions which were expected to be removed in September, went offline yesterday on September 20. One of the last captures of the attributions endpoint, was taken at 5:19 (GMT) Sep 20, and reported to have shut down by 20:25.

]]>
https://datahorde.org/youtubes-community-posts-are-tragically-killing-the-discussion-tab/feed/ 8
Help Archive Team Archive public Google Drive files before September 13! https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/ https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/#comments Sat, 11 Sep 2021 00:50:54 +0000 https://datahorde.org/?p=2637 On September 13, Google is going to start requiring longer URLs to access many Google Drive files, breaking links to public files across the web unless users opt out! Because of this, Archive Team has launched a project to archive as many publicly-available Google Drive files as possible and make them accessible on the Internet Archive Wayback Machine. (Note that video files are not included at this time due to their size.)

You can help! Simply follow the steps to download and run an Archive Team Warrior, and then select the Google Drive project. (You can also run the project using a Docker container using atdr.meo.ws/archiveteam/google-drive-grab as the image address.)

Additionally, people with lists of public Google Drive file URLs are encouraged to share them so they can be archived.

In order to stay up-to-date with the project and be reachable in case of an issue, project contributors are encouraged to connect and stay connected to the project discussion channel, #googlecrash on irc.hackint.org, also available through webchat.

Archiving progress statistics for this project are available on the Archive Team project tracker, and source code is available on GitHub.

]]>
https://datahorde.org/help-archive-team-archive-public-google-drive-files-before-september-13/feed/ 2
This Week in Archiving 09/06/2021 https://datahorde.org/this-week-in-archiving-09-06-2021/ https://datahorde.org/this-week-in-archiving-09-06-2021/#respond Mon, 06 Sep 2021 19:16:18 +0000 https://datahorde.org/?p=2615 Everything you need to know about the upcoming Google Drive Security Update, and a few pleasant surprises from the Archiving scene to boot!

Shutdowns

In about a week’s time, Google Drive’s Security Update for shared links will be rolling out on September 13. Google is introducing a new resource key to shared links, aiming to make link sharing more secure. This means, that in short, many links shared before September 13 will no longer function beyond that date, rendering many shared files inaccessible.

What you need to know! A few gotchas to this update:

  • If any Google account has accessed a shared-link prior to September 13, they will retain full-access to said file/drive beyond that date, unless access is revoked by the owner. More on this later.
  • Drive accounts can view their affected files and even opt-out of the security update altogether, if they are not a Family Link account. However this opt-out is possible not before, but after September 13, see https://support.google.com/drive/answer/10729743 for details.
  • Don’t worry about your petitions getting blocked; Google Docs, Sheets and Slides are exempted from the update. To be clear, this does not mean all .docx, .xlsx and .pptx files; it only means that Google Workspace‘s native files will not be affected. Links to folders containing Docs, Sheets or Slides do not seem to be exempted either. For further details see https://support.google.com/a/answer/10685032

New Projects

Follow the Google Drive Countdown on Data Horde Twitter to keep up!

On the subject of Google Drive, recall that accounts which have previously accessed a file will be able to retain access. So we at Data Horde developed a tool just for that purpose. Google-Drivac, allows you to “vaccinate” a particular Google account against a list of links.

Just log into a Google account, copy some cookies, gather a list of links you want to retain access to (either through crawling, or through maybe another account) and presto!

You can find Google-Drivac on our GitHub at https://github.com/Data-Horde/gdrivac.

In the meantime, for those hoarders among you who would prefer a hard-grab, Archive Team is working on doing just that. You can follow development on the Google Drive Warrior on #[email protected]


Last but not least, some of you might recall the panic surrounding Mediafire allegedly deleting accounts from earlier this year. While the threat had been greatly exaggerated, the scare did bring with it some innovation.

Zopolis from the Media Flare project has developed the Automatic Mediafire Link Archiver (AMLA). This is a Chrome extension which automatically records Mediafire links on websites you visit, in the background. These links are then forwarded to Ajay’s MediaFire URL Submitter public database and Archive Team who periodically grab its contents. All website activity is tracked by default, so if you’re keen on your privacy you should also restrict its permissions in the extension settings.

AMLA can be downloaded from here: https://github.com/Zopolis4/AMLA, and if you are into URL hunting be sure to visit the Media Flare Discord server.

]]>
https://datahorde.org/this-week-in-archiving-09-06-2021/feed/ 0
This Week in Archiving 08/09/2021 https://datahorde.org/this-week-in-archiving-08-09-2021/ https://datahorde.org/this-week-in-archiving-08-09-2021/#respond Mon, 09 Aug 2021 15:25:53 +0000 https://datahorde.org/?p=2580 Introducing the Institute for Digital Heritage, Heritrix3.4 and Crawling@Home: a project for building the largest image-text dataset, ever!

New Projects

History buffs better follow the Institute for Digital Heritage which just launched last week. It’s a new organization aiming to turn cultural heritage into digital heritage. They help museums and cultural institutions digitize and open up their collections, with partners like FloatScans who specialize in 2D/3D scanning. Among its ranks are scholars in digital humanities, culture studies and related fields primarily from Europe and Africa.


Got any processing power to spare? Crawling@Home is a project to match text and images, en masse, from the >3 PB Common Crawl dataset. When completed it will be the world’s largest image-text pair dataset, to fuel next-generation machine learning models. Current stats are available on their homepage.

u/-Archivist has made a tutorial on how to join C@H (periodically updated) which you can find on r/datahoarder. If you have Docker installed, go ahead and execute docker pull arkseal/cah-worker && docker run --shm-size=2g -e NAME={username}-TE arkseal/cah-worker replacing {username} with a nickname of your own.

Updates

Speaking of web crawling, the Internet Archive’s crawler Heritrix3 received an extensive update last week adding in a browser-based link extractor ExtractorChrome. While now only limited to grabbing <a> and <area> links, the team plan to extend it for crawling intercepted sub-requests, possibly mimicking Javascript behavior.

https://twitter.com/anjacks0n/status/1422496424021856275

Last but not least, The-Archive has updated their website. Whether you’re looking for Strategy Guides for retro gaming or Milkdrop Plugins for Winamp Wednesday, The-Archive has much to offer.

]]>
https://datahorde.org/this-week-in-archiving-08-09-2021/feed/ 0