This Week in Archiving 08/09/2021

By themadprogramer August 9, 2021No CommentsWeekly Summary

Introducing the Institute for Digital Heritage, Heritrix3.4 and Crawling@Home: a project for building the largest image-text dataset, ever!

New Projects

History buffs better follow the Institute for Digital Heritage which just launched last week. It’s a new organization aiming to turn cultural heritage into digital heritage. They help museums and cultural institutions digitize and open up their collections, with partners like FloatScans who specialize in 2D/3D scanning. Among its ranks are scholars in digital humanities, culture studies and related fields primarily from Europe and Africa.

Got any processing power to spare? Crawling@Home is a project to match text and images, en masse, from the >3 PB Common Crawl dataset. When completed it will be the world’s largest image-text pair dataset, to fuel next-generation machine learning models. Current stats are available on their homepage.

u/-Archivist has made a tutorial on how to join C@H (periodically updated) which you can find on r/datahoarder. If you have Docker installed, go ahead and execute docker pull arkseal/cah-worker && docker run --shm-size=2g -e NAME={username}-TE arkseal/cah-worker replacing {username} with a nickname of your own.

Updates

Speaking of web crawling, the Internet Archive’s crawler Heritrix3 received an extensive update last week adding in a browser-based link extractor ExtractorChrome. While now only limited to grabbing <a> and <area> links, the team plan to extend it for crawling intercepted sub-requests, possibly mimicking Javascript behavior.

https://twitter.com/anjacks0n/status/1422496424021856275

Last but not least, The-Archive has updated their website. Whether you’re looking for Strategy Guides for retro gaming or Milkdrop Plugins for Winamp Wednesday, The-Archive has much to offer.