Robots.txt is an important file for every online publishing system, not only WordPress. Actually, I have a really not-cool (but still kind of funny) story regarding this file. For something like a year the ads haven’t been displaying on any of my archive pages or category pages on my blog. All because of one small robots.txt mistake.
This is extreme but true. In my case, the equation was: bad robots.txt = less money in my pocket. In the end, if you don’t see any other reason for learning how robots.txt works I hope that money is good enough reason by itself.
Elements of robots.txt
The file has a simple structure. It doesn’t use any fancy syntax or complex formulas. It simply consists of a number of records, i.e. entries. Each record holds information about what areas of a site a certain robot can and cannot access.
This means that you can speak to an individual robot directly, by its name. But you can also speak to all robots at the same moment, which is actually the most common way of using robots.txt.
The file can have just one record or multiple number of them. Having an empty robots.txt file won’t get you in trouble either, but it’s confusing to the search engines, so if you’re creating the file, make sure to put something in it.
Each record of a robots.txt file consists of at leas two lines:
• the line identifying the robot, aka the User-agent line.
• at least one Disallow line
The User-agent line holds the name of the robot that the record is describing the access policy for. The Disallow line holds the URL that is no to be accessed by a given robot.
Each record can have multiple Disallow lines, but only one User-agent line. Here are some examples of different robots.txt files:
User-agent: * # speaking to every robot
In this example the User-agent line uses the “*” character to identify every robot, and then the Disallow line prevents access from the “block-this” sub-URL. As you can see, Disallow lines don’t have to use full URL paths. They can use partial paths. The example above is the same as (for the example-site.com domain):
User-agent: * # speaking to every robot
Another thing you can see in the example is the usage of comments. Everything after the “#” is considered as a comment and it’s not taken into account until the line ends. Another example that prevents access to four areas of a site.
Blocking the whole site
You have to be careful not to prevent access to the whole blog altogether. This is an example that blocks the whole site:
The sole “/” character in the Disallow line reads as the root directory of the blog and all that follows. It’s very unlikely that you’d use such a record intentionally on any blog.
Making the whole site accessible
This next example is a totally different one, as it grants all robots access to all areas of a site:
As you can see, this is achieved by leaving the Disallow line empty.
Speaking to specific robots
If you want to, you can speak to Google’s robot only, by using:
And then following it with a set of Disallow lines just for this one single robot.
There are, of course, other names you can use in the User-agent line. Names like: Googlebot-Mobile, Googlebot-Image, Lycos, Mediapartners-Google (which stands for AdSense), and other. You can visit Robots Database for a more complete list.
One of possible usages for speaking with a single robot would be to remove your site from Google images. By using:
Dealing with URL strings and filenames
By using a combination of words, “*” character, and “$” character you can block pretty much any type of a URL. For example, to block every URL with a question mark in it you can use:
URLs with question marks are usually generated dynamically, therefore their content is probably not highly valuable to the search engines and, in the end, might have a bad influence on the overall rankings. A common example of such a situation is a page presenting all internal search results (if you have a search field somewhere on your blog).
Another idea is to exclude all files of a specific type. For example mp3 files that might be sitting on your blog. Or PDF documents, or XLS sheets, or images, and so on.
To get it done you can use a line like this:
What it does is it prevents a given robot from indexing all destinations that begin with your domain name, followed by any string, ending with “.mp3”. The “$” character matches the end of an URL. Therefore it’s great for blocking all kinds of files and URL strings.
Apart from notifying robots what areas of a blog can’t be accessed, you can use the Allow line to give those robots a free pass, so to speak, to some parts of those restricted areas.
Let me give you an example. First of all, there’s no point in using Allow lines to list all areas of your blog that you want to get ranked. This isn’t what allow line was designed for. Essentially, if you want a given area to be accessed and ranked then just don’t create a Disallow line for it … it’s as simple as this.
Now, here’s where the Allow line might come into play. You can use it when you want to allow access to certain documents that are in a directory that has already been blocked. An example:
What’s important is to place the Allow line before the Disallow line. Otherwise, it won’t work, and the file will remain blocked.
There’s one more type of line that you can place in a robots.txt file. This directive/line is not part of a record. It stands on its own. Its purpose is to point out where on your blog is the sitemap file. Example:
Such a line finds its place usually at the end of robots.txt.
How to create robots.txt for WordPress
The easiest way is of course to do it by hand – create the file in notepad and then upload it to you blog’s root directory via FTP.
There are many areas that are unique to WordPress and need to be treated separately in robots.txt… there are things you should always block, things you should never block, things to block depending on your blog’s configuration, lines handling duplicate content, and so on. I explain all this and more in the other post.
If you don’t have the time to visit it that’s OK too. In such a case check out this template that I’ve prepared. (This is actually the one created after identifying the AdSense-money problem.)
Robots.txt; a template for WordPress
# Disallow: /tag/ # uncomment if you’re not using tags
# Disallow: /category/ # uncomment if you’re not using categories
# Disallow: /author/ # uncomment for single user blogs
Disallow: /2009/ # the year your blog was born
Disallow: /2012/ # and so on
Disallow: /index.php # separate directive for the main script file of WP
Disallow: /*? # search results
(Note. Some adjustments should probably be made before using this template on your blog.)
Lastly, I’m wondering, did you hand-craft your current robots.txt file or are you using the default one?