Block URLs with robots.txt
A robots.txt file is a text file that stops web crawler software, such as Googlebot, from crawling certain pages of your site. The file is essentially a list of commands, such Allow and Disallow, that tell web crawlers which URLs they can or cannot retrieve. So, if a URL is disallowed in your robots.txt, that URL and its contents won’t appear in Google Search results.
How robots.txt works
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
The “User-agent: *” means this section applies to all robots. The “Disallow: /” tells the robot that it should not visit any pages on the site.
There are two important considerations when using /robots.txt:
robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
the /robots.txt file is a publicly available file. Anyone can see what sections of your server you don’t want robots to use.
So don’t try to use /robots.txt to hide information.
How to create a /robots.txt file
When a robot looks for the “/robots.txt” file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.
For example, for “http://www.example.com/shop/index.html, it will remove the “/shop/index.html”, and replace it with “/robots.txt”, and will end up with “http://www.example.com/robots.txt”.
What to put in it
The “/robots.txt” file is a text file, with one or more records. Usually contains a single record looking like this:
In this example, three directories are excluded.
The syntax for using the keywords is as follows:
User-agent: [the name of the robot the following rule applies to]
Disallow: [the URL path you want to block]
Allow: [the URL path in of a subdirectory, within a blocked parent directory, that you want to unblock]
Note that you need a separate “Disallow” line for every URL prefix you want to exclude — you cannot say “Disallow: /cgi-bin/ /tmp/” on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.
Note also that globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*’ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*”, “Disallow: /tmp/*” or “Disallow: *.gif”.
What you want to exclude depends on your server. Everything not explicitly disallowed is considered fair game to retrieve. Here follow some examples:
To exclude all robots from the entire server
To allow all robots complete access
(or just create an empty “/robots.txt” file, or don’t use one at all)
To exclude all robots from part of the server
To exclude a single robot
To allow a single robot
To exclude all files except one
This is currently a bit awkward, as there is no “Allow” field. The easy way is to put all files to be disallowed into a separate directory, say “stuff”, and leave the one file in the level above this directory:
Alternatively you can explicitly disallow all disallowed pages: