This Week in Archiving 08/09/2021

Introducing the Institute for Digital Heritage, Heritrix3.4 and Crawling@Home: a project for building the largest image-text dataset, ever!

New Projects

History buffs better follow the Institute for Digital Heritage which just launched last week. It’s a new organization aiming to turn cultural heritage into digital heritage. They help museums and cultural institutions digitize and open up their collections, with partners like FloatScans who specialize in 2D/3D scanning. Among its ranks are scholars in digital humanities, culture studies and related fields primarily from Europe and Africa.

Got any processing power to spare? Crawling@Home is a project to match text and images, en masse, from the >3 PB Common Crawl dataset. When completed it will be the world’s largest image-text pair dataset, to fuel next-generation machine learning models. Current stats are available on their homepage.

u/-Archivist has made a tutorial on how to join C@H (periodically updated) which you can find on r/datahoarder. If you have Docker installed, go ahead and execute docker pull arkseal/cah-worker && docker run --shm-size=2g -e NAME={username}-TE arkseal/cah-worker replacing {username} with a nickname of your own.


Speaking of web crawling, the Internet Archive’s crawler Heritrix3 received an extensive update last week adding in a browser-based link extractor ExtractorChrome. While now only limited to grabbing <a> and <area> links, the team plan to extend it for crawling intercepted sub-requests, possibly mimicking Javascript behavior.

Last but not least, The-Archive has updated their website. Whether you’re looking for Strategy Guides for retro gaming or Milkdrop Plugins for Winamp Wednesday, The-Archive has much to offer.


No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *