robots.txt Tutorial - Block Bad Bots
Some bots will ignore robots.txt files as they don't care if you want
them on your web site or not.
These can be blocked by using a .htaccess file instead.
1. Block robots via .htaccess
We can't block by robot name here, we block them by matching the beginning of their User-Agent string.
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^ExtractorPro" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
|
|
This example bans robots on our list of spambots.
To block another robot, add a line for it near the top.
SetEnvIfNoCase User-Agent "^User-Agent" bad_bot
|
|
Replace User-Agent with the User-Agent string for this robot, as found
in log files. Here's a sample log entry.
xyz.net - - [07/Mar/2003:11:28:35] "GET / HTTP/1.0" 403 - "-" "Teleport 1.28"
|
|
Here, the User-Agent is Teleport 1.28. The ^
character in the SetEnvIfNoCase lines tells our .htaccess file to block anything
starting with the string we provide.
Any User-Agent starting directly with Teleport would be blocked, regardless
of version number or added text.
2. Tool to create the .htaccess file
This tool can create .htaccess files for you, blocking some of the robots
discussed in this tutorial.
You can also enter up to six custom User-Agent strings to have blocked from
your site. Enter one per box.