file – Data Horde

How to Archive or Scrape MediaFire Files using mf-dl

themadprogramer — Thu, 24 Jun 2021 14:10:54 +0000

MediaFire is a home to millions of files! MediaFire’s generous upload limits appeal to visual artists who can upload their work in higher resolutions, composers and remixers who want to host their WIP music off-platform; and really anyone who wants to upload big .zip files.

Unfortunately, MediaFire doesn’t have a search/discovery feature, relying entirely on search engine traffic and external linking. There’s a lot of undiscovered things on MediaFire, and Pyxia’s mf-dl tool is one of the first tools that we have for exploring it. Read on to learn how to install and use mf-dl to easily download MediaFire files and crawl undiscovered corners of the internet!

Installation

mf-dl follows the usual steps for setting up a python tool:

Install Python 3 if you don’t already have it on your device.
Clone the mf-dl repository from https://gitgud.io/Pyxia/mf-dl.git using a git client. Alternatively download the repo and unzip it.
Using a terminal, cd into the mf-dldirectory and run python3 -m pip install -r requirements.txt.

Downloading Files with `mfdl.py`

mfdl.py is a bulk-downloader for MediaFire links. You may have found these links yourself, copied them from your bookmarks or possibly scraped them beforehand. At any rate, mfdl.py will download the contents and metadata for a list of links that have already been collected.

The input is a sequence of links and can be any file separated by spaces, new-lines or commas. Ideally, you might want to use a spreadsheet-friendly CSV file. For this tutorial, copy the table below into Excel, or another spreadsheet editor, and save it aslinks.csv.

https://www.mediafire.com/file/y1s9a51a941h7b8/The_Wasteland_%2528MP3%2529.mp3/file

https://www.mediafire.com/folder/xb49obyqfut8d/Merlin_Das_Trevas

https://www.mediafire.com/file/ngteu63n26rhncj/readmetxt4999251.zip/file

links.csv

Next we will need an output directory to save mf-dl’s grabs. mf-dl does not have permission to create new directories, so you will have to create a new folder if the destination doesn’t already exist. For demonstration’s sake we will create/output directory under mf-dl.

If you have been following along, your mf-dl folder should look a little something like this.

To run mfdl.py, execute the following command from inside your terminal and mfdl.py will begin downloading the contents of the input links into the output directory.

> python3 mfdl.py output links.csv

Protip #1: Increasing Download Throughput
mfdl.py can download several files concurrently. By default, mfdl.py runs 6 threads; so that means that it will initiate 6 synchronous downloads at a time. If you have a high network bandwidth, you might want to increase the number of threads to maximize your downloading speed. Or if MediaFire is upset with your frequent downloads and is throwing CAPTCHAs your way, you can decrease your thread count. Use this modified version of the mfdl.py call to change your thread-count.

> python3 mfdl.py --threads NEWTHREADCOUNT output links.csv

Protip #2: Multiple Input Files
All arguments after the output are treated as input files. If you have links split across several files, you can simply concatenate them to the end of the command.

> python3 mfdl.py output links.csv links2.csv links3.csv

Scraping MediaFire links with `web_crawler.py`

web_crawler.py is a utility for discovering new MediaFire links. That’s right, links not files.web_crawler.py does not download the corresponding files and we will need to later feed the outputted links into mfdl.py.

Setting up web_crawler.py is a bit more straightforward. Then we need a seed URL to initiate the crawl. Any site with downloadables will make for a nice link farm. In this case we’ll be using the Minecraft Pocked Edition Downloads site https://mcpedl.com/ as our seed.

To run web_crawler.py, execute the following. Note that web_crawler.py will run indefinitely as new links are discovered, until its execution is interrupted.

> python3 web_crawler.py https://mcpedl.com/ links_found.txt

Protip #1: Feeding Back Links
You can feed links found using web_crawler.py intomfdl.py with

> python3 mfdl.py output links_found.txt

In fact, if you’re familiar with Crontab, you can schedule periodicmfdl.py jobs to download new links as they are added to links_found.txt. This away, you can continue to download new links, without ever stopping web_crawler.py.

Protip #2: Depth Control
You can limitweb_crawler.py‘s search by specifying a filter. If you want to keep your search to mcpedl.com, ignoring out-links to facebook etc. you can --filter https://mcpedl.com.

> python3 web_crawler.py --filter https://mcpedl.com https://mcpedl.com/ links_found.txt

Alternatively, you can specify --regex option if would rather filter with regular expressions instead.

Protip #3: Thread Control
web_crawler.py can also run on multiple threads, 6 by default. You can choose the number of maximum threads you want to use by, again, specifying the --threads option.

> python3 web_crawler.py --threads NEWTHREADCOUNT https://mcpedl.com/ links_found.txt

Have any more questions? To learn more about MediaFire archiving, check out the MediaFlare project!

A Correction on MediaFire’s Account Removal Policy

themadprogramer — Mon, 25 Jan 2021 00:24:16 +0000

A month ago we had reported claims that MediaFire would be purging inactive users in January. At the time we had urged readers to log into their Mediafire accounts to reaffirm their activity and to consider looking into alternative file hosting services.

However, in light of recent events, Data Horde would like to issue a correction. Mediafire co-founder Tom Langridge was kind of enough to comment on the situation, stating that there had, in fact, been no change of policy whatsoever.

https://twitter.com/tlangridge/status/1349515837351231488

Although, as stated previously, Mediafire’s data retention policy has been updated with more concise wording, Langridge’s comment would indicate that this change was made purely out of formality.

Then this begs the question, why was there ever a sudden influx of “warning e-mails” in the first place? The warning e-mails aren’t anything new, but it’s the sudden increase in frequency which had many archivists alert. One possible explanation might be that this isn’t Mediafire’s fault at all!

(Click here if the above chart doesn’t load)

Even in 2021 , Mediafire is one of the most visited websites on the web. But interestingly search interest for Mediafire has steadily gone down for years.

This suggests that disproportionally more of Mediafire’s traffic comes from visitors who are downloading files they have been linked to, rather than users who are actively uploading new files. If anything, uploaders (or potential uploaders) appear to be searching for Mediafire far less frequently. So perhaps, the cause of the warning e-mail influx was a dip in overall uploader activity.

Mediafire account removals and file deletions might be, and as far as we know are, happening; but not because of any particular change in Mediafire’s policy.

So where does that leave us? Archive Team is continuing with their #mediaonfire project, since it’s better to be proactive than reactive in cases such as this:

Too much depends on you and we are backing you up anyway, but I appreciate the effort
— Jason Scott (@textfiles) January 14, 2021

As for Data Horde, we have recently initiated a project of our own called MediaFlare. We have a simple link submission form where anyone can report important files worth preserving, which we collect into a public database. Archivists and data hoarders can then mirror files listed in this database to ensure their survival for years to come.

Many online communities rely on old uploads to Mediafire, even after accounts have gone inactive. With MediaFlare we aim to prevent situations like this:

I noticed this yesterday, with a fellow preservationist asking me to grab a bunch of their files off of MediaFire and some were blocked and removed already. Scans for crying out loud.
via Jonas Roland of Hit Save!

Again, we apologize for misreading MediaFire’s policy change. But that being said, it is still very much a good idea to log back into old accounts. And in the case where this is no longer possible, we hope that you would be willing to assist us on the MediaFlare project, so that we can preserve valuable files for years to come!

Log into your Account: Mediafire to Purge Inactive Users!

themadprogramer — Thu, 24 Dec 2020 19:12:07 +0000

Mediafire, has been a popular file sharing and storage site for over a decade. Users simply upload their files to the system and can share it with the world, via a link. Though Mediafire might have been more prominent in the era before the likes of Dropbox and Google Drive, 14 years of history is certainly nothing to look down on.

Every month, more than 150 million people use MediaFire to quickly and securely store, organize and share all their personal and professional data in the cloud.
Mediafire about page

Sadly, recent changes in Mediafire’s file retention policy are threatening this long history. The policy has introduced a new definition for Abandoned Accounts. Free registered* users who haven’t logged into their account for 8 months will have account considered as “abandoned”. These accounts will be given a 15-day grace period, where Mediafire will warn the user through two e-mails to log back in if they want to keep their account, and failing to comply will result in a deletion of the account and data.

* free, as in non-premium registered users. Failing to verify an email address on a free account will lead to a termination within 14 days.

Interestingly, the file retention policy page previously used more ambiguous language and this “8 month” criteria has been there for a few months now. The last revision on the page was in May, so assuming the first 8 month countdown has just been initiated, the first wave of account deletions will be triggered on January 11, 2020. Accounting for the 15-day grace period, we can broadcast the hard deadline to be January 26, 2020.

So if you have a Mediafire account, we highly recommend that you log in today, and also once every couple of months for good measure, to keep your files safe!

What about users who’ve lost access to their accounts (forgotten password, change in email addresses)? It would seem that there is no term in the policy to account for these accounts as they qualify as abandoned even if the files they’ve uploaded a decade ago are still being accessed to this day.

Content is retained in accounts based on account activity and not individual file activity.
Current Mediafire Account Retention Policy

In particular many modding communities, such as Mario Kart Modders, are heavily reliant on files which have been kept safe on Mediafire for years. As is to be expected, Archive Team has sprung into action, and is trying to set up a large-scale project to archive Mediafire files, aptly named #mediaonfire.

But there’s something that makes archiving Mediafire files particularly challenging. As you will recall, Mediafire files are shared by links and Mediafire doesn’t really have something akin to a search function. This means that the only way to find a particular file is to find the link from where it was been shared. A high res background might have been shared on a forum, old podcasts on a personal website… Any way you slice it, it’s going to take a lot of time and effort to track even some of these links down.

The good news is that new solutions are being proposed every day. Pyxia’s mf-dl tool allows you to bulk-download the file contents if you know the links for yourself.

So spread the word! Tell your friends to log in to their Mediafire accounts, consider migrating your files to safer services and make a copy of any Mediafire link or files you come across, you never know what might happen!