The robots.txt file, also called the robots exclusion file, is used to provide useful information to search engine spiders. The file can serve a number of purposes, but it is primarily used to tell a spider which files and directories should NOT be indexed, and that is why it is also called the robots exclusion file.
The file must be placed in a web site’s root directory and the file must be named robots.txt (no variations) with all lower case characters. The root directory is the only place a spider will look for the file.
A record in a the robots.txt file starts out with a User Agent name that is recognized by that particular spider. All legitimate spiders have a user agent signature that identifies who they are and they use that name to see if there are specific instructions for them in the robots exclusion file.
Most robots.txt files start out like this:
User-agent: * Disallow: /images/
In this case the asterisk is a wild card character that includes all spiders, so the User-Agent is directed at all spiders. The second line tells all spiders to ignore the /images/ directory. These two lines create a record for all spiders to follow.
If you want to exclude multiple directories and the search page, the robots.txt might look like this:
User-agent: * Disallow: /images/ Disallow: /flash/ Disallow: /search.php
This example above shows a single record. To end a record, create a blank line between one record and the next. The following robots exclusion file contains two records. One for all spiders and another that is intended for Google’s main spider, which is called Googlebot. For more about Google’s use of the robots.txt file, visit Using a robots.txt file to control access to your site.
User-agent: * Disallow: /images/ Disallow: /flash/ User-agent: Googlebot Disallow: /search.php
If you want the path to start with the root directory, the excluded file or directory needs to start with a slash. If you do not add the preceding slash, a spider may exclude any file or directory that includes the text to the right of the Disallow directive.
To invite spiders to index all of a site, with no exceptions, use the following. This invites all spiders and excludes nothing.
User-agent: * Disallow:
If you want to exclude a particular spider from indexing your site, use a single slash, which indicates the root directory, to the right of the Disallow directive. For example, if you do not want MSN to index any of your site, the record would look like the following. Be careful how you use this method because it will eliminate your entire site from a search engine index.
User-agent: MSNBot Disallow: /
A Few Things to Know About the Robots Exclusion File
- All directory paths and file names are case-sensitive. “Disallow: /flash/” is not the same as “Disallow: /Flash/”. Use the exact upper and lower case combinations for names used by the actual files and directories.
- The robots.txt file does nothing to block a bad spider. A spider’s use of the file is completely voluntary, so it will not anything to block scrapers or other spiders with malicious intent. A bad spider will simply ignore the robots.txt file, so including a long list of bad spiders and blocking their access to your site is useless.
- All the major search engines do use the file, but they only read it periodically (the time period is not defined), so any changes you make will likely take a while before you see the results.
- A blank line delineates a record. Do not break up a record into sections for readability.
- If you disallow a directory, all subdirectories under that directory and all files in the subdirectory are excluded from indexing.
- The robots.txt file is a text file that is visible to anyone on the web. Do not exclude directories or files that you do not want hackers to find, such as a hidden administration area. Anything you put ion this file will be plainly visible to anyone who wants to view it.
- It is a good idea to validate the robots.txt file to avoid errors. There are many good free validators on the web.
For more information about how the robots.txt file works, visit robotstxt.org
Other Uses for the robots.txt File
A few years ago Google, Yahoo and MSN announced that they all recognized a standard way to let their spiders know where you have placed an XML sitemap. If you have added a sitemap to your site, modify the following line to use your domain name and add the new record to the robots.txt file. Make sure that you create a blank line between this record and any other records you have created.
The robots.txt file should be included with every web site because it gives spiders specific directions regarding how they should index a site. If you do not want to exclude any files or directories, just use the code provided above it invite all spiders to index all of your site.
Joseph Carringer says
1) What is a short list of recommended files or folders to disallow in robot.txt? ie. /images, /styles, /cgibin, /subdomians, /scripts etc
2) If you disallow subdomain site folders in the root directory of the main site folder will they still be accessible by the robots through their individual URL?
Joseph Carringer says
One follow up:
If I want my images to be indexed by google images I should not disallow my images, correct?
You do not need to disallow anything other than directories that you do not want spiders to index. If there are no links in your site to a scripts directory and it does not appear in any URL found on your site, then you do not have to include that directory. The only way a spider will get into a directory is when someone provides a link to it or it appears in a path in a URL.
One important rule it to never include directories or folders in the robots.txt file that reveals a secret area, such as the administration area of a site. If you want to hide a directory and you put it in the robots.txt file, anyone can easily view it.
You are correct. Do not disallow the images directory if you want Google to index your images. I always block the images directory because too many people think that anything they find in Google Images is free to use. It is not. The overwhelming number of images in Google’s index are copyrighted images taken from web sites without permission. I am a bit surprised that there hasn’t been a legal challenge to that. However, the way to prevent it is to block the images directory.