OpenSourceVarsity

pdf icons

Robots.txt File

What is robots.txt?

Robots.txt is a text (not html) file you put on your site to tell search robots which pages you would like them not to visit. Robots.txt is by no means mandatory for search engines but generally search engines obey what they are asked not to do. It is important to clarify that robots.txt is not a way from preventing search engines from crawling your site (i.e. it is not a firewall, or a kind of password protection) and the fact that you put a robots.txt file is something like putting a note Please, do not enter on an unlocked door. That is why we say that if you have really sensitive data, it is too naive to rely on robots.txt to protect it from being indexed and displayed in search results.

The format and semantics of the “/robots.txt” file is as follows: The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form “<field>:<optionalspace><value><optionalspace>“. The field name is case insensitive.

What does this do?

The robots.txt file is a good way to prevent the page from getting indexed. However, not every site can use it. The only robots.txt file that the spiders will read is the one at the top html directory of your server. This means you can only use it if you run your own domain.

How is it created?

The file itself is a simple text file, which can be created in Notepad or any other text editor. It needs to be saved to the root directory of your site that is the directory where your home page or index page is. Remember to use all lower case for the filename: “robots.txt”, not “Robots.TXT.

What is its content?

The structure of a robots.txt is pretty simple (and barely flexible) – it is an endless list of user agents and disallowed files and directories.

Each entry has just two lines:

User-Agent: [Spider or Bot name]

Disallow: [Directory or File Name]

This line can be repeated for each directory or file you want to exclude, or for each spider or bot you want to exclude.

User-agent is search engines’ crawlers and disallow: lists the files and directories to be excluded from indexing.
In addition to user-agent: and disallow: entries, you can include comment lines just put the # sign at the beginning of the line:

# All user agents are disallowed to see the /temp directory.

User-agent: *

Disallow: /temp/

Where is this placed?

The robots.txt is placed in the top-level directory of your web server.

When a robot looks for the “/robots.txt” file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/robots.txt” in its place.

For example, for “http://www.yoursite.com/product/index.html, it will remove the “/product/index.html”, and replace it with “/robots.txt“, and will end up with “http://www. yoursite.com/robots.txt”.

So, as a web site owner you need to put it in the right place on your web server for that resulting URL to work. Usually that is the same place where you put your web site’s main “index.html” welcome page. Where exactly that is, and how to put the file there, depends on your web server software.

What value does robots.txt add to the website?

The “/robots.txt” file is a text file, with one or more records. Usually contains a single record as shown below.
It prevents the page of the website from getting indexed.

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /~mark/

In this example, three directories are excluded.

Note: You need a separate “Disallow” line for every URL prefix you want to exclude — you cannot say “Disallow: /cgi-bin/ /tmp/” on a single line. Also, you may not have blank lines in a record, as they are used to delimit multiple records.

Note: Globbing and regular expression are not supported in either the User-agent or Disallow lines. The ‘*‘ in the User-agent field is a special value meaning “any robot”. Specifically, you cannot have lines like “User-agent: *bot*“, “Disallow: /tmp/*” or “Disallow: *.gif”.

To exclude all robots from the entire server

If you want to exclude all the search engine spiders from your entire domain, you would write just the following into the robots.txt file:

User-agent: *

Disallow: /

To allow all robots complete access

If you want to include all the search engine spiders to your entire domain, you would write just the following into the robots.txt file.

User-agent: *

Disallow:

Once again you can use the wildcard, ‘*’, to let all spiders know they are welcome. The second, disallow, line you just leave empty, that is your disallow from no where.

To exclude all robots from part of the server

User-agent: *

Disallow: /cgi-bin/

Disallow: /tmp/

Disallow: /junk/

To exclude a single robot

If you want to keep a specific search engine spider from indexing your site, do this:

User-agent: BadBot

Disallow: /

To allow a single robot

If you want to allow specific search engine spider from indexing your site, do this:

User-agent: Google

Disallow:

To exclude all files except one

As there is no “Allow” field, the easy way is to put all files to be disallowed into a separate directory, say “files”, and leave the one file in the level above this directory:

User-agent: *

Disallow: /~mark/files/

Alternatively you can explicitly disallow all disallowed pages:

User-agent: *

Disallow: /~ mark/junk.html

Disallow: /~ mark/foo.html

Disallow: /~ mark/bar.html

To exclude a file from an individual Search Engine

You have a file, personelfile.html, in a directory called ‘personel’ that you do not wish to be indexed by Google. You know that the spider that Google sends out is called ‘Googlebot’. You would add these lines to your robots.txt file:

User-Agent: Googlebot

Disallow: /personel/ personelfile.htm

Exclude a section of your site from all spiders and bots

You are building a new section to your site in a directory called ‘sect’ and do not wish it to be indexed before you are finished. In this case you do not need to specify each robot that you wish to exclude, you can simply use a wildcard character, ‘*‘, to exclude them all.

User-Agent: *

Disallow: /sect/

Note that there is a forward slash at the beginning and end of the directory name, indicating that you do not want any files in that directory indexed.

Getting More Complicated

If you have a more complex set of requirements you are going to need a robots.txt file with a number of different commands. You need to be quite careful creating such a file, you do not want to accidentally disallow access to spiders or to areas you really want indexed.
Let’s take quite a complex scenario. You want most spiders to index most of your site, with the following exceptions:

  1. You want none of the files in your cgi-bin indexed at all, nor do you want any of the FP specific folders indexed – eg _private, _themes, _vti_cnf and so on.
  2. You want to exclude your entire site from a single search engine – let’s say Alta Vista.
  3. You do not want any of your images to appear in the Google Image Search index.
  4. You want to present a different version of a particular page to Lycos and Google.
    (Caution here, there are a lot of question marks over the use of ‘doorway pages’ in this fashion. This is not the place for a discussion of them but if you are using this technique you should do some research on it first.)

Let’s take this one in stages!
1. First you would ban all search engines from the directories you do not want indexed at all:

User-agent: *

Disallow: /cgi-bin/

Disallow: /_borders/

Disallow: /_derived/

Disallow: /_fpclass/

Disallow: /_overlay/

Disallow: /_private/

Disallow: /_themes/

Disallow: /_vti_bin/

Disallow: /_vti_cnf/

Disallow: /_vti_log/

Disallow: /_vti_map/

Disallow: /_vti_pvt/

Disallow: /_vti_txt/

It is not necessary to create a new command for each directory, it is quite acceptable to just list them as above.

2. The next thing we want to do is to prevent Alta Vista from getting in there at all. The Altavista bot is called Scooter.

User-Agent: Scooter

Disallow: /

This entry can be thought of as an amendment to the first entry, which allowed all bots in everywhere except the defined files. We are now saying we mean allbot can index the whole site apart from the directories specified in 1 above, except Scooter which can index nothing.

3. Now you want to keep Google away from those images. Google grabs these images with a seperate bot from the one that indexes pages generally, called Googlebot-Image. You have a couple of choices here:

User-Agent: Googlebot-Image

Disallow: /images/

That will work if you are very organized and keep all your images strictly in the images folder.

User-Agent: Googlebot-Image

Disallow: /

This one will prevent the Google image bot from indexing any of your images, no matter where they are in your site.

4.Finally, you have two pages called content1.html and content2.html, which are optimized for Google and Lycos respectively. So, you want to hide content1.html from Lycos (The Lycos spider is called T-Rex):

User-Agent: T-Rex

Disallow: /content1.html

and content2.html from Google.

User-Agent: Googlebot

Disallow: /content2.html

April 15, 2016
Design by Ivan Bayross and Meher Bala © 2017 All Rights Reserved
X