webcrawling – Data Horde https://datahorde.org Join the Horde! Mon, 06 Sep 2021 14:30:17 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.3 https://datahorde.org/wp-content/uploads/2020/04/cropped-DataHorde_Logo_small-32x32.png webcrawling – Data Horde https://datahorde.org 32 32 This Week in Archiving 08/09/2021 https://datahorde.org/this-week-in-archiving-08-09-2021/ https://datahorde.org/this-week-in-archiving-08-09-2021/#respond Mon, 09 Aug 2021 15:25:53 +0000 https://datahorde.org/?p=2580 Introducing the Institute for Digital Heritage, Heritrix3.4 and Crawling@Home: a project for building the largest image-text dataset, ever!

New Projects

History buffs better follow the Institute for Digital Heritage which just launched last week. It’s a new organization aiming to turn cultural heritage into digital heritage. They help museums and cultural institutions digitize and open up their collections, with partners like FloatScans who specialize in 2D/3D scanning. Among its ranks are scholars in digital humanities, culture studies and related fields primarily from Europe and Africa.


Got any processing power to spare? Crawling@Home is a project to match text and images, en masse, from the >3 PB Common Crawl dataset. When completed it will be the world’s largest image-text pair dataset, to fuel next-generation machine learning models. Current stats are available on their homepage.

u/-Archivist has made a tutorial on how to join C@H (periodically updated) which you can find on r/datahoarder. If you have Docker installed, go ahead and execute docker pull arkseal/cah-worker && docker run --shm-size=2g -e NAME={username}-TE arkseal/cah-worker replacing {username} with a nickname of your own.

Updates

Speaking of web crawling, the Internet Archive’s crawler Heritrix3 received an extensive update last week adding in a browser-based link extractor ExtractorChrome. While now only limited to grabbing <a> and <area> links, the team plan to extend it for crawling intercepted sub-requests, possibly mimicking Javascript behavior.

https://twitter.com/anjacks0n/status/1422496424021856275

Last but not least, The-Archive has updated their website. Whether you’re looking for Strategy Guides for retro gaming or Milkdrop Plugins for Winamp Wednesday, The-Archive has much to offer.

]]>
https://datahorde.org/this-week-in-archiving-08-09-2021/feed/ 0