Robots Txt Tutorial for Web Crawlers.
 
 
 
 

  Robots Txt Tutorial


Robots.txt is the standard (Robots Exclusion Protocol) that instructs Web Crawlers which file/directory should not be crawled. It is a text file placed in the root directory with a special name robots.txt. Almost every search engine spiders or crawlers look for this file and follows the instructions entered in this file. If this file is not present in the root directory or left blank, search engine crawler assumes every link is allowed to be downloaded and indexed. Special format is used in making this file. Special instructions can be given for specific crawler or default settings can be made for all crawlers.

User-agent
Search engine crawler name is specified in this field. For example if you want to give special instructions to Google search engine crawler "googlebot", it is done as follows:

User-agent: googlebot

To make a default entry for all robots, wildcard character "*" can be used as follows:

User-agent: *

Disallow
This field directs crawler which file/directory is not to be indexed. Every User-agent can have one or more Disallow fields in separate lines. For example if you want crawler not to index restricted.html file it is done as follows:

Disallow: restricted.html

If this file is not in a root directory but other directory called "private", you can make following entry:

Disallow: /private/restricted.html

If you want to disallow whole directory, you can do so as follow:

Disallow: /private/

If Disallow: /private is used, /private.html and /private/restricted.html will be restricted.

If Disallow: is left blank, all the files can be indexed. If you want to restrict whole web site Disallow: / is used.

Some Examples
The following example specifies that no robots should visit and download any file in the website:

User-agent: *
Disallow: /

The following example specifies that every robot can download every file on the website:

User-agent: *
Disallow:

The following example specifies that no robots should index /private/restricted.html file, temp directory and restricted.html file on the website:

User-agent: *
Disallow: /private/restricted.html
Disallow: /temp/
Disallow: /restricted.html

The following example specifies that one robot is restricted from indexing and rest are permitted to index whole site:

User-agent: cybercracker
Disallow: /

User-agent: *
Disallow: