How to Download Files With Wget
Wget is a great tool for automating the task of downloading entire websites, files, or anything that needs to mimic a traditional web browser. This article discusses many of the things that you can use wget
If wget isn’t installed you can use either apt, yum to install it:
Installing Wget on Debian, Ubuntu
$ sudo apt install wget
Installing Wget on RHEL, CentOS
$ sudo yum install wget
Installing Wget on Windows
There is a windows binary for wget, but we’ve found that Cygwin works much better and provides other useful tools as well.
Basic Download with Wget
For the the most part you should be able to just download a file, but if it’s https you might have certificate problems. In that case use the –no-check-certificate flag.
$ wget --no-check-certificate https://wordpress.org/latest.tar.gz
Download File into Different Name and Location
Maybe you want to download a file into a different name (-O) or location (-P)? By default wget will download the file to the current working directory and use the original file name.
$ wget -O wp.tgz -P /tmp --no-check-certificate https://wordpress.org/latest.tar.gz
Bulk Dowload List of Files in wget
If you need to download several files at once using wget you can use the -i flag combined with a text file and 1 download per line:
$ wget --no-check-certificate -i fileslist.txt
Change User Agent in wget
If by chance, they do not like wget hammering their website, you can change the user agent, so they don’t know:
$ wget --user-agent="Mozilla/5.0 (X11; Linux x86_64; rv:61.4) Gecko/20100101 Firefox/61.4" http://mozillaonly.site/
Download Entire Website
Though you might need to fiddle with cookies, span, recursiveness, domain and the other more advanced flags, you should start with a basic download of an entire website, using the “mirror” and “local browsing” flags:
$ wget -m -k -p https://awesomesite.com
Tip: You might also need to gunzip the files if they are compressed.
Rate Limit Wget Downloads
It is rude if you blindly torch a server’s resources. It is polite (and won’t set off as many alarms), if you request resources at a more respectable rate. Many site administrators will block wget because by default people do not behave nicely. Here is how to be more polite when using wget:
$ wget ‐‐limit-rate=20k ‐‐wait=60 ‐‐random-wait ‐‐mirror site.com
Use Passwords with wget
This only works with basic auth, but here are the flags for using a password and user on http authentication:
$ wget --http-user=USER --http-password=PASS URL
Use wget to Check for Broken Links
If you are scanning a site, it’s polite to wait 1 second between grabs. The following will spider a site and look for broken links, dumping the information to wget.log file.
$ wget --spider -o wget.log -e robots=off --wait 1 -r -p https://grimoire.jamesfraze.com/
Download MP3 files from Directory
It may be useful to limit your downloads to a specific directory and it’s subdirectories. The –no-parent flag will help with this. Here is an example to download mp3 files from a directory:
$ wget ‐‐level=1 ‐‐recursive ‐‐no-parent ‐‐accept mp3,MP3 http://site.com/mp3/
Download all Pictures from Website using Wget
This example will put all of the jpg, gif, png, and jpeg files into the /tmp/pictures folder from the site.com/images:
$ wget ‐‐directory-prefix=/tmp/pictures ‐‐no-directories ‐‐recursive ‐‐no-clobber ‐‐accept jpg,gif,png,jpeg http://site.com/images/
Scan list of sites for New PDFs
Sometimes, there are particular files you are interested in and ONLY those files. Wouldn’t it be nice to monitor multiple websites for these files all at once and keep a local copy for easy browsing at your leisure? You can surely do this, though it might not provide the site owner with the ad revenue or metrics that they desire:
$ wget -r --level=1 -H --timeout=1 -nd -N -np --accept=pdf,PDF -e robots=off -i reportsites.txt
Using wget with login cookies
You can have wget get cookies, or you can login with a browser, and use that cookie file after you manually create it. I was able to use this to get past a recent wordpress password location to a membership site.
$ wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data 'user=me&password;=123' http://s.com/login.php
$ wget ‐‐cookies=on ‐‐load-cookies cookies.txt ‐‐keep-session-cookies http://s.com/protecteddir
Populate Cache Using Wget
Wordpress has plugins that cache. There are also squid proxies and a plethora of other caching mechanisms. If you want to preload your caches (whatever they are), you can do it with wget:
$ wget -o /dev/null -r --delete-after http://dynamicsite.com
Use wget Through Proxy
We use socks proxy quite a bit from ssh to a remote server to bypass firewalls (ssh user@remote -D 7070). After the proxy is setup, we use firefox and it’s socks proxy config to use 127.0.0.1:7070 as the proxy. You could use wget through a proxy like this:
$ export http_proxy="http://127.0.0.1:7070"
$ wget [normal wget usage, but now it's going through proxy]
If you use a different proxy, then just export it appropriately and your wget will pick it up from the environment.
Download wget Using Timestamps
This isn’t so much a feature of wget as it is of the shell, but working hand in hand you can take a dynamic site and get period data from it, loading it into sequential snapshots.
$ wget --output-document=results_$(date +\%Y\%m\%d\%H).gif http://dynamicsite.com/stats