|
ArticleSEOPreventCrawl
HOWTO Prevent your content from being crawled or from appearing in search results
First, preventing your page from being crawled is not the same as preventing the URL from appearing in search results. This is because while you may block search engines from crawling your content, they may still have signals about the URL, such as the anchor text used to link to your URL throughout the web, and therefore enough information to show your URL in search results. Decide if what your goal is: prevention of the URL from being crawled or prevention of it appearing in search results. In most cases, the rules are more simpler to prevent crawling than to not index a page. Also, if your information is truly sensitive, it's best not to rely on search engine behavior. Consider more secure options, such as password protection. Preventing content from being crawledUsing the robots.txt file Google, Microsoft, and Yahoo! have formalized the Robots Exclusion Protocol (REP) such that they all follow similar guidelines for not crawling documents in your robots.txt.. This method is often the simplest way to prevent your content from being crawled. To verify that your robots.txt is blocking the correct content (as well as allowing the right content to be crawled) you can use the free robots.txt analyzer in Google Webmaster Tools.
Preventing content from appearing in search resultsIn order to prevent your content from appearing in search results, it needs to be prevented from being indexed. In the three major search engines, prevention from being indexed is dictated on a per URL basis (not as pattern matches such as in the robots.txt for crawl prevention). For a URL not to be indexed, you can place a NOINDEX tag in the HTTP header. This works well for pdf files or spreadsheets. X-Robots-Tag: noindex For HTML files not to be indexed, you can insert a NOINDEX metatag in the head of the page: <meta name="robots" content="noindex, nofollow"> |
Sign in to add a comment