| How to Block Spiders From Visiting and Indexing Your Site
: Robots.txt File Tutorial
There are reasons you might not want your Web site to be indexed by search engines. More likely, there are simply certain pages that you don't want indexed by the major search engines.
For instance, maybe you constructed an elaborate direct marketing site that requires the visitor to enter through your main page and then proceed through a highly structured series of links that lead them to a buying decision.
The internal pages would only confuse visitors who entered through those pages and they would be less likely to buy a product or service.
Whatever your reason, there is a standard that you can implement that will keep most of the major search engine spiders from indexing your Web site. |
|
Here's how to block the spiders. Create a file called "robots. txt" that includes the following code:
user-agent: * Disallow: /*
The first line specifies the agents, browsers or spiders that should read this file and adhere to the instructions in the following lines of code. The second line stipulates which files or directories the spider or browser should not read or index. The example above uses the "/*" which means the agent should not read or index anything as the asterisks denotes "everything."
The robots. txt file must be placed in the root directory
of your Web site. What this means is that if you are hosting
your Web site using one of the free services and your domain
looks something like this:
http:// members. aol. com/ Joesmith/ home. htm you cannot
use the robots. txt file to keep out the spiders, since you
don't have a primary domain name. The primary domain name
is aol. com - and America Online will probably not allow you
to block all the search engines spiders from indexing their
site and the Web sites of the 11 million other subscribers.
This robots. txt file could look like this if there were
specific directories and files that you wish the search engines
not to index:
user-agent: * Disallow: /clients/*
Disallow: /products/* Disallow: /pressrelations/*
Disallow: /surveys/ survey. htm
In the above example the robots. txt file asks the search
engines spider to omit all pages within the following directories:
http:// www. yourcompany. com/ clients/ http:// www. yourcompany.
com/ products/
http:// www. yourcompany. com/ pressrelations/
And the following specific page:
http:// www. yourcompany. com/ survey/ survey. htm
If you are one of the millions of people hosting a Web site
on America Online's server or one of the other free or subdirectory
Web site services and you can't place a robots. txt file in
their root directory, you can use a META tag that talks to
some of the spiders:
<META NAME=" ROBOTS" CONTENT=" NOINDEX">
You will need this META tag on every page in your Web site
that you don't want indexed. If your Web site has 30 or 40
pages (or more), this will take a lot of time. Here's another
reason to buy a good HTML editor like Luckman's WebEdit or
Allaire's HomeSite.
These programs allow you to do a global search and replace
and add an HTML tag to every Web page that you open in the
program. As with all META tags, this META tag goes at the
top of your HTML document between the <HEAD> and </
HEAD> tags.
|