News

Google’s John Mueller Explains Why Google Crawls Non-Existent Pages

In a recent Webmaster Central Hangout, Google’s John Mueller shed light on why Google crawls non-existent pages and its implications on your crawl budget. Website owners often wish that Google focuses its crawling efforts on existing pages, as it seems inefficient to crawl pages that don’t exist. A web publisher raised the question of blocking Googlebot from crawling non-existent pages. Mueller’s insights provide a deeper understanding of Google’s 404 crawls.

Non-existent pages, known as 404 pages, return a server error code when a requested web page is missing. A 404 indicates the server can’t locate the page, while a 410 signals that the page is intentionally removed and isn’t coming back.

Google occasionally performs what is known as a 404 crawl. John Mueller highlighted three reasons why Google crawls these non-existent pages:

1. Google uses excess crawl capacity to verify URLs that previously existed, in case they reappear.
2. Crawling 404 pages indicates Google has ample capacity to scrutinize more URLs on your site.
3. There is no need to block 404 pages from crawling to save your crawl budget, as these crawls do not affect your crawl capacity.

### Google Remembers 404 Pages

Even if Google doesn’t keep a web page in its index, if it once existed, Google remembers this and checks the old URL to see if the page has returned. Matt Cutts explained in 2014 that this serves as a safeguard if a web publisher mistakenly removes a web page and it becomes operational again.

Here’s how Google handled 404 error pages according to Matt Cutts in 2014:

> “200 might mean everything went totally fine. 404 means the page was not found. 410 typically means gone, as in the page is not found and we do not expect it to come back. So 410 has a little more of a connotation that this page is permanently gone.”
>
> “So the short answer is that we do sometimes treat 404s and 410s a little bit differently, but for the most part you shouldn’t worry about it.”
>
> “If a page is gone and you think it’s temporary, go ahead and use a 404. If a page is gone and you know no other page should substitute it, then use a 410.”

He noted that website owners sometimes mistakenly declare pages as gone, leading Google to build safeguards ensuring it doesn’t prematurely drop pages that may need to be retained.

Mueller further elaborated:

> “We understand that these are 404 or 410 pages or at least shouldn’t be indexed. But we know about those pages. And every now and then when… we have nothing better to do on this website, we’ll go off and double-check on those URLs.”

He added that this process does not affect your crawl capacity. It’s more a sign that Google can handle more URLs, and it double-checks old ones just in case they’ve reappeared.

### Distinctive Points in John Mueller’s Statement

Mueller’s remarks add new dimensions to our understanding of why Googlebot crawls 404 pages, indicating it’s a positive sign of ample crawl budget. Unlike Matt Cutts, Mueller’s insight into 410 pages is noteworthy.

### Googlebot 404 Crawls Are a Positive Indicator?

Today, we learn from Mueller that Google crawling 404 pages is beneficial, but his thoughts on how Google handles 410 pages deviate from Cutts’ outline. Notably, a 410 code signifies a permanently removed page, which may not be revisited, and some publishers prefer Google to honor this status, especially for spam pages.

Matt Cutts on 410 pages in 2014:

> “If we see a 410, then the crawling system says OK, we assume the webmaster knows what they’re doing, because they deliberately indicated that this page is gone. … But we’ll still go back and recheck to ensure those pages are truly gone.”

There is an older statement from 2011 highlighting that Google treated both 404 and 410 errors similarly. However, Cutts indicated that there are subtle differences.

### Google Adheres to Standards for 410 Error Responses

The official standard suggests that links to a resource with a 410 response should be deleted by clients with link-editing capabilities. However, it does not mandate the client never revisiting the URL. The 410 response assists web maintenance by signaling that the resource is intentionally unavailable.

### Conclusion from John Mueller’s Comments

Some web publishers might worry about Google crawling non-existent pages, such as 404 and 410, but Mueller’s insights show it’s actually beneficial. It indicates Google has enough crawl budget to analyze your entire site, making blocking unnecessary. It’s a positive sign to have 404 or 410 pages crawled by Google.

Images have been modified under fair use.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button