wget

Wget is a great tool for automating the task of downloading entire websites, files, or anything that needs to mimic a traditional web browser. This article discusses many of the things that you can use wget

If wget isn’t installed you can use either apt, yum to install it:

Installing Wget on Debian, Ubuntu

sudo apt install wget

Installing Wget on RHEL, CentOS

sudo yum install wget

Installing Wget on Windows

There is a windows binary for wget, but we’ve found that Cygwin works much better and provides other useful tools as well.

wget flags

–keep-session-cookies
–save-cookies file.txt
–load-cookies file.txt
–http-user=user
–http-password=password
–no-cache
–user-agent=
–no-check-certificate
–secure-protocol=SSLv3|TLSv1

response codes

0 No problems occurred.
1 Generic error code.
2 Parse error—for instance, when parsing command-line options, the .wgetrc or .netrc…
3 File I/O error.
4 Network failure.
5 SSL verification failure.
6 Username/password authentication failure.
7 Protocol errors.
8 Server issued an error response.

View headers

wget -SO- -T 5 -t 1 http://google.com

Grep the Stream

wget -SO- -T 5 -t 1 http://google.com 2>&1 | egrep "expression"

Basic Download with Wget

For the the most part you should be able to just download a file, but if it’s https you might have certificate problems. In that case use the –no-check-certificate flag.

wget --no-check-certificate https://wordpress.org/latest.tar.gz

Download File into Different Name and Location

Maybe you want to download a file into a different name (-O) or location (-P)? By default wget will download the file to the current working directory and use the original file name.

wget -O wp.tgz -P /tmp --no-check-certificate https://wordpress.org/latest.tar.gz

Bulk Download List of Files in wget

If you need to download several files at once using wget you can use the -i flag combined with a text file and 1 download per line:

wget --no-check-certificate -i fileslist.txt

Change User Agent in wget

If by chance, they do not like wget hammering their website, you can change the user agent, so they don’t know:

wget --user-agent="Mozilla/5.0 (X11; Linux x86_64; rv:61.4) Gecko/20100101 Firefox/61.4" http://mozillaonly.site/

Download Entire Website

Though you might need to fiddle with cookies, span, recursiveness, domain and the other more advanced flags, you should start with a basic download of an entire website, using the “mirror” and “local browsing” flags:

wget -m -k -p https://awesomesite.com

or my favorite:

wget --recursive --page-requisites --html-extension --convert-links --no-parent --wait=2 --random-wait https://site.com

Tip: You might also need to gunzip the files if they are compressed.

Rate Limit Wget Downloads

It is rude if you blindly torch a server’s resources. It is polite (and won’t set off as many alarms), if you request resources at a more respectable rate. Many site administrators will block wget because by default people do not behave nicely. Here is how to be more polite when using wget:

wget ‐‐limit-rate=20k ‐‐wait=60 ‐‐random-wait ‐‐mirror site.com

Use Passwords with wget

This only works with basic auth, but here are the flags for using a password and user on http authentication:

wget --http-user=USER --http-password=PASS URL

Use wget to Check for Broken Links

If you are scanning a site, it’s polite to wait 1 second between grabs. The following will spider a site and look for broken links, dumping the information to wget.log file.

wget --spider -o wget.log -e robots=off --wait 1 -r -p https://site.com/

Download MP3 files from Directory

It may be useful to limit your downloads to a specific directory and it’s subdirectories. The –no-parent flag will help with this. Here is an example to download mp3 files from a directory:

wget ‐‐level=1 ‐‐recursive ‐‐no-parent ‐‐accept mp3,MP3 http://site.com/mp3/

Download all Pictures from Website using Wget

This example will put all of the jpg, gif, png, and jpeg files into the /tmp/pictures folder from the site.com/images:

wget ‐‐directory-prefix=/tmp/pictures ‐‐no-directories ‐‐recursive ‐‐no-clobber ‐‐accept jpg,gif,png,jpeg http://site.com/images/

Scan list of sites for New PDFs

Sometimes, there are particular files you are interested in and ONLY those files. Wouldn’t it be nice to monitor multiple websites for these files all at once and keep a local copy for easy browsing at your leisure? You can surely do this, though it might not provide the site owner with the ad revenue or metrics that they desire:

wget -r --level=1 -H --timeout=1 -nd -N -np --accept=pdf,PDF -e robots=off -i reportsites.txt

You can have wget get cookies, or you can login with a browser, and use that cookie file after you manually create it. I was able to use this to get past a recent wordpress password location to a membership site.

wget ‐‐cookies=on ‐‐save-cookies cookies.txt ‐‐keep-session-cookies ‐‐post-data 'user=me&password;=123' http://s.com/login.php
wget ‐‐cookies=on ‐‐load-cookies cookies.txt ‐‐keep-session-cookies http://s.com/protecteddir

Populate Cache Using Wget

Wordpress has plugins that cache. There are also squid proxies and a plethora of other caching mechanisms. If you want to preload your caches (whatever they are), you can do it with wget:

wget -o /dev/null -r --delete-after http://dynamicsite.com

Use wget Through Proxy

We use socks proxy quite a bit from ssh to a remote server to bypass firewalls (ssh user@remote -D 7070). After the proxy is setup, we use firefox and it’s socks proxy config to use 127.0.0.1:7070 as the proxy. You could use wget through a proxy like this:

export http_proxy="http://127.0.0.1:7070"
wget [normal wget usage, but now it's going through proxy]

If you use a different proxy, then just export it appropriately and your wget will pick it up from the environment.

Download wget Using Timestamps

This isn’t so much a feature of wget as it is of the shell, but working hand in hand you can take a dynamic site and get period data from it, loading it into sequential snapshots.

wget --output-document=results_$(date +\%Y\%m\%d\%H).gif http://dynamicsite.com/stats