Why Does Google Index Uncrawled Pages? Insights from John Mueller

SEO October 14, 2024

0 35 2 minutes read

Google’s John Mueller addressed a question regarding why Google indexes pages that are blocked from crawling by robots.txt and why it’s safe to overlook related Search Console reports about those crawls.

Bot Traffic to Query Parameter URLs

Rick Horst raised a concern about bots generating links to non-existent query parameter URLs (?q=xyz) on pages marked with noindex meta tags, which are also blocked in robots.txt. The issue was that Google crawls these links, gets blocked by robots.txt (without seeing a noindex tag), and then reports them in Google Search Console as “Indexed, though blocked by robots.txt.”

The question posed was:

“But here’s the big question: why would Google index pages when they can’t even see the content? What’s the advantage in that?”

Mueller confirmed that if they cannot crawl the page, they cannot see the noindex tag. He also noted that users should ignore the search results from the site:search operator because regular users won’t see those results.

He explained:

“Yes, you’re correct: if we can’t crawl the page, we can’t see the noindex. However, if we can’t crawl the pages, there’s not much for us to index. So while some of these pages might appear with a targeted site:-query, the average user won’t see them, so I wouldn’t fuss over it. Noindex is also fine (without robots.txt disallow), which just means the URLs will be crawled (and show up in the Search Console report for crawled/not indexed — neither status will cause issues to the rest of the site). The key is to avoid making them crawlable and indexable.”

Takeaways:

1. Confirmation of Limitations of Site: Search

Mueller’s answer highlights the limitations of using the Site:search operator for diagnostic purposes, as it’s not connected to the regular search index and is a separate entity.

John Mueller commented on the site search operator in 2021:

“A site: query is not meant to be complete or used for diagnostic purposes. It limits the results to a certain website, composed of the word site, a colon, and the website’s domain. This query isn’t a comprehensive collection of all the pages from that website.”

The site operator doesn’t reflect Google’s search index, making it unreliable for understanding what pages Google has indexed. Like other advanced search operators, it’s unreliable for understanding Google’s indexing and ranking process.

2. Noindex Tag Without Robots.txt

The noindex tag works without robots.txt disallow in situations where a bot links to non-existent pages indexed by Google. Applying noindex allows Google to read the directive, ensuring the page won’t appear in the search index, which is preferable for keeping a page out of Google’s index.

3. No Negative Effect from Noindex Tag in Search Console

URLs with the noindex tag will generate a “crawled/not indexed” entry in Search Console but won’t negatively impact the rest of the website. These entries indicate that Google crawled but did not index the page, highlighting if further investigation is needed.

4. Handling of URLs with Noindex Tags and Robots.txt

If Googlebot can’t crawl a page blocked by robots.txt, it cannot read the noindex tag, meaning the page may still be indexed based on URL discovery through links. Google advises that for the noindex rule to work, pages must be accessible to the crawler.

5. Difference Between Site: Searches and Regular Searches

Site: searches are limited to a specific domain and are separate from the main search index, making them unsuitable for diagnosing indexing issues.

Conclusion:

Google’s John Mueller and online discussions illustrate how noindex and robots.txt affect crawling and indexing, providing insights on managing such issues effectively.

Featured Image by Shutterstock/Krakenimages.com