Crawlers eating your bandwidth?

The Internet is a active landscape with something going on all the time. Often times you hope the visitors are those seeking what your site offers, but sometimes it’s the search engine ‘bots’ crawling your site. These bots can and do consume quite a bit of your website hosting bandwidth and CPU power resources.

Bot’s without some rules will hammer your site mercilessly in an attempt to fully index it. Controlling them and limiting their ability to hammer away at your site is handled through a special file that resides in the root of your website called Robots.txt.

In this blog post, Rochen support technician Andrew Brown discusses how to setup Robots.txt to limit the impact these powerful software bots can have on your site.

‘Bots’ should only use their power for Right and Good – Right??

Bot’s are advantageous (operating under the assumption that you want people to see your website) to index your pages for SEO purposes. Having all them show up at once, or to request indexing information every five seconds can be detrimental to your site

While there’s nothing official stating that it must be done in under five seconds, if multiple crawlers index your site at once, it can cause very serious resource problems for your server, depending on the content being indexed.

Limiting Bot Behavior with Robots.txt

First, we will need to create a file called “robots.txt” (without quotes) inside our document root.

For example, the document root for the primary domain of your cPanel account will be /home/YouUsernameHere/public_html .

We need to create the following file if it doesn’t exist:

/home/YourUsernameHere/public_html/robots.txt

You can create this file through your cPanel File Manager, SSH or even via FTP. It is also worth noting that this file can be created locally (like in a Notepad document, for example) and then uploaded to the document root (/home/UserName/public_html) of your website.

After the robots.txt file has been created in the document root of the site, we will need to add the following content to it:

#ROBOTS START
User-agent: *
Disallow:
Crawl-delay: 30
#ROBOTS END

The above entry will make it so that all bots that honor the directives found inside robots.txt files will only fetch a page once every 30 seconds, as opposed to as quickly as the bot possibly can. The directive also does not block any content from the bots. If you want to block content from being indexed by bots in addition to limiting crawl speed, you would need to simply add the disallow statements as needed. An example of this:

#ROBOTS START
User-agent: *
Disallow: /cgi-bin/
Disallow: /admin/
Crawl-delay: 30
#ROBOTS END

It does not matter which order you put the Crawl-delay or disallow statements in.

In conclusion, it is a good idea to limit crawlers, especially as your site expands. If you’re using a development language such as PHP, it becomes very important, as each page fetch could potentially cause the PHP engine to exhaust all its resources.

###

Andrew Brown is a support technician with Rochen’s ‘24/7′ Support Team focused on providing technical support to Rochen’s Hosting Customers.