Scroll Up Get A Quote

Popular Posts | 6 Min Read

Tricks To Stop Search Engines From Crawling Your Website

Stop Search Engines from Crawling

Article Overview

Post Views: 2268 Views

Posted on: Saturday June 20, 2020

Reading Time: 6 minutes

For making your website more searched by others, search engine bots, majorly referred as crawlers or spiders, will come to your website looking for newer texts and link for updating the user search index pages.

  • How to Control search engine crawlers with a robots.txt file

Website owners can make search engines to crawl a website in specific ways by using a robots.txt file. Whenever search engine bots come to website, it looks for robots.txt file first and follows the rules next.

  • Edit or create robots.txt file

The robots.txt file is required to be at base of your site. If your domain was example.com it should be found:

  • On your website:

https://example.com/robots.txt

  • On your server:

/home/userna5/public_html/robots.txt

One can also create a new file and call it robots.txt as just a plain-text file if you don’t already have one.

  • Search engine User-agents

Robots.txt file has a common rule that’s based on the user-agent of the search engine crawler. The crawlers use a user-agent to figure out themselves while indexing, here are some common examples:

  • Top 3 US search engine User-agents:

(a) Googlebot

(b) Yahoo

(c) Slurp bingbot

  • Common search engine User-agents blocked:

(a) AhrefsBot

(b) Baiduspider

(c) Ezooms

(d) MJ12bot

(e) YandexBot

  • Search engine crawler access via robots.txt file

Few options are available whenever it comes to restricting your website gets crawled with the robots.txt file. A user-agent: rule specifies which User-agent the rule applies to, and * is a wildcard matching any user-agent.

Disallow: sets the files or folders that are not allowed to be crawled.

  • Set a crawl delay for all search engines:

Those having 1,000 pages website would be potentially indexed by search engines in a few minutes. But this also causes high system resource usage with all of those pages loaded in a short time period.

Crawl-delay: of 30 seconds would allow crawlers to index your entire 1,000 page website in just 8.3 hours

Crawl-delay: of 500 seconds would allow crawlers to index your entire 1,000 page website in 5.8 days

You can set the Crawl-delay: for all search engines at once with:

User-agent: *

Crawl-delay: 30

  • Allow all search engines to crawl website:

By default search engines should be able to crawl your website, but you can also specify they are allowed with:

User-agent: *

Disallow:

  • Disallow all search engines from crawling website:

You can disallow any search engine from crawling your website, with these rules:

User-agent: *

Disallow: /

  • Disallow one particular search engines from crawling website:

You can disallow just one specific search engine from crawling your website, with these rules:

User-agent: Baiduspider

Disallow: /

  • Disallow all search engines from particular folders:

If we had a few directories like /cgi-bin//private/, and /tmp/ we didn’t want bots to crawl we could use this:

User-agent: *

Disallow: /cgi-bin/

Disallow: /private/

Disallow: /tmp/

  • Disallow all search engines from particular files:

If we had files like contactus.htmindex.htm, and store.htm we didn’t want bots to crawl we could use this:

User-agent: *

Disallow: /contactus.htm

Disallow: /index.htm

Disallow: /store.htm

  • Disallow all search engines but one:

If we only wanted to allow Googlebot access to our /private/ directory and disallow all other bots we could use:

User-agent: *

Disallow: /private/

User-agent: Googlebot

Disallow:

When the Googlebot reads our robots.txt file, it will see it is not disallowed from crawling any directories.

Do you need an SEO audit for your website? Contact us now!