Understanding crawl budget is an often overlooked part of SEO. A two-year-old post my team wrote about the topic is practically ancient history in the SEO industry. In this article, I’ll explain how our understanding of crawl budget has changed in the past couple of years, what’s stayed the same, and what it all means for your crawl budget optimization efforts.
What is Crawl Budget and Why Does it Matter?
Computer programs designed to collect information from web pages are called web spiders, crawlers, or bots. These can be malicious (e.g., hacker spiders) or beneficial (e.g., search engine and web service spiders). For example, my company’s backlink index is built using a spider called BLEXBot, which crawls up to 7.5 billion pages daily, gathering backlink data.
When we talk about crawl budget, we’re referring to the frequency with which search engine spiders crawl your web pages. According to Google, crawl budget is a combination of your crawl rate limit (i.e., limits that ensure bots like Googlebot don’t crawl your pages so often that it hurts your server) and your crawl demand (i.e., how much Google wants to crawl your pages).
Optimizing your crawl budget means increasing how often spiders can “visit” each page, collect information, and send that data to other algorithms in charge of indexing and evaluating content quality. Simply put, the better your crawl budget, the faster your information will be updated in search engine indexes when you make changes to your site.
But don’t worry. Unless you’re running a large-scale website (millions or billions of URLs) then you will likely never need to worry about crawl budget.
So why bother with crawl budget optimization? Because even if you don’t need to improve your crawl budget, these tips include a lot of good practices that improve the overall health of your site.
And, as John Mueller explains, the potential benefits of having a leaner site include higher conversions even if they’re not guaranteed to impact a page’s rank in SERPs.
What’s Stayed the Same?
In a Google Webmaster Hangout on Dec. 14, 2018, John was asked about how one could determine their crawl budget. He explained that it’s tough to pin down because crawl budget is not an external-facing metric.
He also said:
“[Crawl budget] is something that changes quite a bit over time. Our algorithms are very dynamic and they try to react fairly quickly to changes that you make on your website … it’s not something that’s assigned one time to a website.”
He illustrates this with a few examples:
- You could reduce your crawl budget if you did something such as improperly setting up a CMS. Googlebot might notice how slow your pages are and slow down crawling within a day or two.
- You could increase your crawl budget if you improved your website (by moving to a CDN or serving content more quickly). Googlebot would notice and your crawl demand would go up.
This is consistent with what we knew about crawl budget a couple of years ago. Many best practices for optimizing crawl budget are also equally applicable today:
1. Don’t block important pages
Make sure that all of your important pages are crawlable. Content won’t provide you with any value if your .htaccess and robots.txt are inhibiting search bots’ ability to crawl essential pages. Conversely, you can use a script to direct search bots away from unimportant pages.
2. Stick to HTML whenever possible
Googlebot has gotten better at crawling rich media files like JavaScript, Flash, and XML, but other search engine bots still struggle with these files. It’s recommended to avoid these files in favor of plain HTML whenever possible. You may also want to provide search engine bots with text versions of pages that rely heavily on these rich media files.
3. Fix long redirect chains
Each redirected URL squanders a little bit of your crawl budget. Try to limit the number of redirects you have on your website and use them no more than twice in a row.
4. Tell Googlebot about URL parameters
If your CMS generates lots of dynamic URLs, then you may be wasting your crawl budget. To inform Googlebot about URL parameters that don’t impact page content, add parameters to your Google Search Console.
5. Correct HTTP errors
404 and 410 pages do use your crawl budget. It’s in your best interest to search for HTTP errors and fix them ASAP.
6. Keep your sitemap up to date
A clean XML sitemap helps users and bots understand where internal links lead and how your site is structured.
7. Use rel="canonical"
to avoid duplicate content
You can use rel="canonical"
to tell bots which URL is the main version of a page.
8. Use hreflang tags to indicate country/language
Bots use hreflang tags to understand localized versions of your pages. You can use HTML tags, HTTP headers, or your sitemap to indicate localized pages to Google.
What’s Changed?
Two main things have changed since we wrote that original article in 2017.
First, I no longer recommend RSS feeds. They had a small resurgence but are not widely used now.
Second, an experiment suggested a strong correlation between external links and crawl budget. When revisiting the study, the correlation was very loose, indicating Google’s algorithm has grown more sophisticated.
However, links remain one of the most important signals that search engines use to judge relevancy and quality. While link building may not be essential for improving your crawl budget, it should remain a priority for your SEO efforts.