Avoiding robots.txt in wget

I occasionally use the wget utility with the -m option to download a mirror of an entire website. This is very handy, but wget respects the robots.txt file, so it won’t mirror a site if robots.txt disallows it.

Obviously, you should respect the downloading restrictions of other sites, but there are times when you have a valid reason to ignore them (when it’s your site, for instance, but you don’t want to change robots.txt on a live site). In that case, here’s what you do: First run wget with the -m option. It will download the robots.txt file and then quit. Now edit the robots.txt file, change it to Allow instead of Disallow where necessary, and save it. Now change the permissions on that file to 444. Now run your wget -m command again.

On the second run, the permissions change will prevent wget from overwriting the robots.txt file with the version that disallows it, and it will go on happily mirroring the rest of the site.

Here is the sequence of commands (replace vi with your editor of choice):

wget -m http://www.mysite.com/
vi www.mysite.com/robots.txt (edit and save)
chmod 444 www.mysite.com/robots.txt
wget -m http://www.mysite.com/