Using a Robots.txt File with WordPress
When a search engine robot crawls your site for the purpose of indexing its contents the first thing it will do is look for a robots.txt file. A robots.txt file is a file that contains specific instructions that inform the bot what it can and can’t look at. Since the bot is set to crawl everything by default anyway, a robots.txt file is essentially the means through which you can apply a set of specific crawling restrictions.
WordPress automatically generates a “virtual” robots.txt file the moment you publish your first post, however what WordPress doesn’t do is include any restrictions in the file it creates. This is something you need to do manually. If you’re wondering if it is important to concern yourself with taking control of your robots.txt file to manually set up restrictions, the short answer is “yes”. There are SEO, security, and server performance gains to be made that make doing so well worth the minimal effort involved.
From an SEO standpoint, WordPress allows users to find content through multiple paths. For example, you might publish a post that can be found by searching one or more categories, one or more tags, or even via a date archive. Search engines cross check for duplicate content. When they find the same post via multiple paths that post will be treated as duplicate content. Different search engines handle duplicate content differently, but in the end how it gets handled is more than likely going to result in search ranking issues that are best avoided. In some cases your content won’t get indexed as it should, in others content won’t get indexed at all, and there’ll also be elements of your site displayed in search engine queries that you don’t really want the general public to see.
From a performance viewpoint, a failure to provide restriction instructions to the crawlers means any and all crawlers will index everything on your site, an issue that can lead to a lot of needless bandwidth usage. Despite claims from many hosts that they offer unlimited bandwidth, the harsh reality is unlimited bandwidth only applies as long as your bandwidth usage is low. High bandwidth usage equates to higher CPU usage and a corresponding reduction in server performance. Many people have experienced their sites being shut down by their hosts for precisely this reason. Excess bandwidth via bot crawling can also slow the performance of your site to your normal human visitors. If you’re not concerned about your host or your visitors, consider your search rankings. It’s no secret that site speed is now a factor in Google’s search ranking algorithm.
From a security viewpoint, there are good robots and there are bad robots. Not only do the bad one’s contribute to your bandwidth usage, they might be spamming and stealing your content or crawling for private contact information to exploit. Your robots.txt file can be used not only to restrict access to certain areas, but to stop many of these bad bots from accessing any of your content at all. Though it should be pointed out that many bad robots ignore such instructions, so this is a part solution only.
Since a basic knowledge about what to do with your robots.txt file is helpful in all of these areas it pays to take a little time to understand how it works and what you need to do to apply a few useful restrictions.
WordPress’ auto-generated robots.txt file
WordPress does handle a basic robots.txt function, but it doesn’t create a physical robots.txt file. Unless you save a robots.txt file to your root domain manually you won’t see one there when you look via FTP. Instead, WordPress creates code that automatically generates a ‘virtual’ robots.txt file the moment you publish your first post (not before). Once it does, you can type http://www.yourdomain.com/robots.txt and you will see the code in the virtual robots.txt file as if a real physical file were stored there. Basically its a fake but it does the job.
Note that WordPress won’t bother creating a virtual robots.txt file if you’ve already created a real, physical robots.txt file manually and saved it on your root domain. I’ll assume you haven’t done this and that WordPress has done it for you.
The virtual file WordPress generates will either give the crawlers complete open access to your site or completely restrict them, depending on what your site visibility setting is under the Privacy Settings section in your WordPress Admin.
If you have the “I would like my site to be visible to everyone, including search engines (like Google, Bing, Technorati) and archivers” option selected, your robots.txt file will look something like this:
If you have the “I would like to block search engines, but allow normal visitors” option selected, your robots.txt file will look like something this:
In the first example the bot has complete access because nothing specific has been disallowed.
In the second, the “/” is a command that informs crawlers you don’t want to give them any access at all.
Note that “no access” means the crawler is told you don’t want it to crawl the content of your site so it won’t. However, if any URL from your site appears on a different website elsewhere, the URL will still be indexed by the crawler and may appear in the relevant search engine’s results irrespective of what your robots.txt file specifies. The best solution to this issue is to specify crawling restrictions page by page, post by post. You can do that with Yoast’s Robots Meta Plugin (but expanding on that is for another post as here I’m talking specifically about the robots.txt file).
Now, as far as WordPress is concerned, that’s its job done with the robots.txt file. You won’t find WordPress playing around with robots.txt any more than what we’ve already discussed as driven by the options under Privacy Settings.
Some WordPress plugins may make changes to the auto-generated file though. Perhaps the most common is the Google XML Sitemaps Generator plugin which allows the adding of the location of the sitemap it generates to the robots.txt file. If allowed, it writes the sitemap address so that it looks something like this:
This change is a good start because it essentially points crawlers to start with the sitemap. However, if no robots.txt instructions existed in the first place it is likely crawlers from the major search engines would locate the sitemap anyway, so this text file is not very effective other than for preventing file not found related errors when searching specifically for the robots.txt file, and for giving search engine crawlers a heads up about how to find your sitemap quickly (which, in turn, will help the crawlers index your site more accurately).
How to take control of your robots.txt file
The best way is to manually create a robots.txt file and FTP it to your root domain. The reason I say this is the best way is because by doing it manually you avoid the need for plugins and the less plugins you’re running the better your site performance will be. If that’s what you want to do then I’m going to assume you know how to FTP a file to your root domain. In which case, your main question will be what to put in the file. You can find the answer to that below.
If you’re not comfortable with FTP (or if you are but still want a fast solution) then the next easiest approach is to install the excellent PC-Robots.txt plugin. This plugin handles nearly everything with a simple install and also provides a manual window through which you can edit the entries in the virtual robots.txt file.
By default PC-Robots.txt disallows a ton of those bad bots I previously talked about (at least, the bad ones that obey such a restriction request). You can look at our own robots.txt file to see the dozens that it includes by default.
Aside from the bad bots, PC-Robots.txt also adds the following restrictions:
User-agent: * Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /wp-content/cache/ Disallow: /wp-content/themes/ Disallow: /wp-login.php Disallow: /wp-register.php
If you’re inclined to worry about losing search engine relevancy because of these restrictions, don’t. What these restrictions do is prevent needless indexing of duplicate content as well as the extraneous indexing of content you’d never want to display in search engines anyway. For example, we allowed SpiderWeb Press to run without crawler restrictions. Take a look at these Google search results for our own site that came about as a result of Google’s own bot having no restriction on what it was indexing:
Aside from looking as ugly as sin, what possible benefit is offered to a user of Google who searches and stumbles upon these search listings? Absolutely zero. The only thing this really achieves is to make you look unprofessional.
The above search results are resolved with the entry:
Moving along, be aware when you install the PC Robots.txt plugin it will overwrite your previous Google XML Sitemaps entry. You’ll have to rebuild your sitemap again. Note that Google XML Sitemaps only adds an entry to the virtual robots.txt file. If you have put a manual file in your root domain you’ll have to add your sitemap entry manually to the .txt file you FTP’d.
To add a manual entry to the virtual robots.txt file is easy. Go to your WordPress Admin panel and click on Settings > PC Robots.txt.
Doing so will display an editing window like this:
Then just add the entry you wish to add.
Should you add additional entries to the virtual robots.txt file generated by PC-Robots.txt?
Peter Coughlin is the author of this plugin and I think what he’s included is pretty much spot on. I doubt it will hinder your desired results if you do nothing to change it.
There are, however, a few schools of thought that may argue otherwise. In particular, two commonly recommended disallows relate to images and tags, again usually for server performance related reasons:
I personally don’t apply image restrictions as I feel the potential impact of losing image SEO is greater than the load it places on the server. I’ve never experienced an issue with allowing this, but I have operated plenty of WordPress blogs that received enormous volumes of traffic directly via Google images. So that in itself is enough for me to leave well enough alone. I also tend to leave tags alone as by disallowing everything else I’ve found allowing tags has not hindered my search rankings and in some cases has possibly helped.
However, as I mentioned earlier, disallowing indexing of specific posts or pages isn’t foolproof since those pages or posts can be indexed independently via external links. To combat this I think a better approach is to provide explicit instructions in the meta of your individual pages and posts. Once again you can do that with Yoast’s Robots Meta Plugin.
There are, however, a few additional disallows that I think are a good idea to add to your virtual robots.txt file. Specifically:
I see no benefit in indexing feeds, comments, author archives, trackbacks, or date archives at all. That’s needless duplication that detracts from what you really want ranked.
Note that /20* is a quick way to disable all date archives with a year starting with 20.
That’s about all you really need to know to sort out your robots.txt file pretty well. I’m sure there are a few other tweaks you can perform here and there if you really want to (including specific disallowing of certain file types) but in terms of essentials what I’ve included above is sure to improve the performance of your site, reduce lag on the server, and deliver a noticeable benefit to your search engine rankings.
Finally, I highly recommend once you modify the auto-populated entries in the PC-Robots.txt edit screen be sure to save a copy of your amendments as a .txt file to your hard drive. If you deactivate the plugin you’ll lose your changes if you’ve left the ‘delete saved settings’ radio button checked (which can happen by accident) so take the precaution of keeping a copy of your settings handy and safe for just such an event.