News

Google Can Index Blocked URLs Without Crawling

SEO October 12, 2024

0 30 2 minutes read

Google’s John Mueller recently "liked" a tweet by search marketing consultant Barry Adams from Polemic Digital, which concisely explained the purpose of the robots.txt exclusion protocol. This interaction revitalized an old topic and possibly offered a new way to consider it.

Google Can Index Blocked Pages

The discussion started when a publisher tweeted that Google had indexed a website that was blocked by robots.txt. John Mueller responded:

“URLs can be indexed without being crawled, if they’re blocked by robots.txt – that’s by design.
Usually, this comes from links from somewhere; considering the number, it might be from within your site somewhere.”

How Robots.txt Works

Barry (@badams) tweeted:

“Robots.txt is a crawl management tool, not an index management tool.”

Often, robots.txt is considered a means to prevent Google from adding a page to its index. However, robots.txt merely determines which pages Google crawls. Therefore, if a site links to a specific page, Google can still index that page to an extent.

Barry further explained how to keep a page out of Google’s index:

“Use meta robots directives or X-Robots-Tag HTTP headers to prevent indexing – and (counter-intuitively) allow Googlebot to crawl those pages you don’t want it to index, so it sees those directives.”

NoIndex Meta Tag

The noindex meta tag prevents crawled pages from appearing in Google’s index. It doesn’t stop the page from being crawled but ensures it remains out of the index. The noindex meta tag is better than the robots.txt exclusion protocol for keeping a webpage from being indexed.

John Mueller mentioned in a tweet from August 2018:

“…if you want to prevent them from indexing, I’d use the noindex robots meta tag instead of robots.txt disallow.”

Robots Meta Tag Has Many Uses

The Robots meta tag is versatile and can address issues until a better solution is available. For instance, a publisher faced difficulty generating 404 response codes because the AngularJS framework kept producing 200 status codes. The publisher tweeted for assistance:

"Hi @JohnMu I’m having many troubles with managing 404 pages in AngularJS, always giving me a 200 status on them. Any way to solve it? Thanks”

John Mueller suggested using a robots noindex meta tag, which would make Google drop that 200 response code page from the index and treat it as a soft 404.

"I’d make a normal error page and just add a noindex robots meta tag to it. We’ll call it a soft-404, but that’s fine there.”

Thus, even if a webpage shows a 200 response code (indicating it was successfully served), the robots meta tag keeps it out of Google’s index, treating it as a not-found page, a 404 response.

Official Description of Robots Meta Tag

According to the official documentation by the World Wide Web Consortium (W3C), this is what the Robots Meta Tag accomplishes:

"Robots and the META element
The META element allows HTML authors to tell visiting robots whether a document may be indexed or used to harvest more links.”

The W3C describes robots.txt as:

“When a Robot visits a Website, it first checks for …robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document.”

The W3C views the role of robots.txt as a gatekeeper, determining which files are retrieved, i.e., crawled by a robot adhering to the robots.txt exclusion protocol.

Barry Adams was correct in describing the robots.txt exclusion as a tool for managing crawling, not indexing.

It might be helpful to view robots.txt as security guards at the entrance of your site, managing access to certain web pages. This perspective may help clarify unusual Googlebot activity on blocked webpages.