dt.iki.fi
9.6.2025 16.2.2016 linux, download

Scraping the internet with wget

Download all files of (a) certain type(s) from a website

This will download all playlists for all soma.fm radio stations into the current folder, omiting low-quality 32 and 24kbit/s stations. Should work with other sites containing playlists.

If you get a “Forbidden” error for certain folders, it might help to start at the root of the site.

wget --recursive --no-directories --level=2 --accept pls,PLS,m3u,M3U \
--reject "*32.pls","*24.pls" \
https://somafm.com/ ; rm robots.txt

Breaking it down:

  • –recursive Recursive retrieving. The default maximum depth is 5.
  • –no-directories Do not create a hierarchy of directories when retrieving recursively. All files will get saved to the current directory, without clobbering (if a name shows up more than once, the file$
  • –level=2 Specify recursion maximum depth level. Use it if you know at which level the files are, otherwise you might get huge amounts of pointlessly transferred data (in this example ~500kB with level 2 and 47MB without).
  • –accept –reject Comma-separated lists of file name suffixes or patterns (simple wildcards, not regex) to accept or reject.

If the server kicks you out or blocks you for an unreasonable amount of time, try appending these options:

  • –wait=1 Wait the specified number of seconds between retrievals.
  • –random-wait Causes the time between requests to vary between 0.5 and 1.5 * wait seconds.

This will download a large collection of transparent tiles from Transparenttextures

wget -nd -r -l 2 -A png http://www.transparenttextures.com/

Download a directory tree’s content from a website

wget --recursive --no-parent --reject="index.html*" "$URL"
  • –recursive Recursive retrieving. The default maximum depth is 5, applied to the starting directory, not the root. (see --level=n if you need to change this).
  • –no-parent Won’t go up from that directory, only descend further into it. However, the (mostly empty) directory structure is preserved.
  • –reject Comma-separated lists of file name suffixes or patterns (simple wildcards, not regex) to reject.

To also dispense with the useless empty directory tree (something like www.example.com/interesting/project/source/dir_containing_desired_data) you can use --no-host-directories and --cut-dirs.
In this example adding --no-host-directories --cut-dirs=3 will leave you with a folder named dir_containing_desired_data that conatins the desired data (incl. all subfolders that might contain).
However, since usually such folders have non-unique names like src or util, it might be better to remove --no-host-directories and increase the number of cut-dirs by 1.

Here’s a little script that automates the last example:

#!/bin/sh

type awk >/dev/null || {
    echo "awk is required."
    exit 1
}
test "$#" -eq 1 || {
    echo "Please provide exactly one URL to download the last component recursively."
    exit 1
}

URL="$1"

# For counting directory components
# remove trailing slash, if any
str="${URL%/}"
# and the protocol
str="${str#http?://}"

wget --recursive --no-parent --reject="index.html*" --cut-dirs="$(echo "$str" | awk -F/ '{print NF-1}')" "$URL"

robots="${str%%/*}/robots.txt"
if test -r "$robots"; then rm "$robots"; fi