Robots Txt Tutorial
Robots.txt is the standard (Robots Exclusion Protocol) that instructs Web Crawlers which file/directory
should not be crawled. It is a text file placed in the root directory with a special name robots.txt.
Almost every search engine spiders or crawlers look for this file and follows the instructions entered in
this file. If this file is not present in the root directory or left blank, search engine crawler assumes
every link is allowed to be downloaded and indexed. Special format is used in making this file. Special
instructions can be given for specific crawler or default settings can be made for all crawlers.
User-agent
Search engine crawler name is specified in this field. For example if you want to give special
instructions to Google search engine crawler "googlebot", it is done as follows:
User-agent: googlebot
To make a default entry for all robots, wildcard character "*" can be used as follows:
User-agent: *
Disallow
This field directs crawler which file/directory is not to be indexed. Every User-agent can have one
or more Disallow fields in separate lines. For example if you want crawler not to index restricted.html
file it is done as follows:
Disallow: restricted.html
If this file is not in a root directory but other directory called "private", you can make following
entry:
Disallow: /private/restricted.html
If you want to disallow whole directory, you can do so as follow:
Disallow: /private/
If Disallow: /private is used, /private.html and /private/restricted.html will be restricted.
If Disallow: is left blank, all the files can be indexed. If you want to restrict whole web site
Disallow: / is used.
Some Examples
The following example specifies that no robots should visit and download any file in the website:
User-agent: *
Disallow: /
The following example specifies that every robot can download every file on the website:
User-agent: *
Disallow:
The following example specifies that no robots should index /private/restricted.html file, temp
directory and restricted.html file on the website:
User-agent: *
Disallow: /private/restricted.html
Disallow: /temp/
Disallow: /restricted.html
The following example specifies that one robot is restricted from indexing and rest are permitted to
index whole site:
User-agent: cybercracker
Disallow: /
User-agent: *
Disallow:
|