Google explains how crawling works in 2026

Gary Illyes from Google shared some details about Googlebot, Google’s crawling ecosystem, downloads and how bytes work.
The article is called Inside Googlebot: de-anonymizing crawls, downloads, and the bytes we process.
Googlebot. Google has more than one crawler, it has many searchers for many purposes. So referring to Googlebot as a single search engine, may not be very accurate anymore. Google has listed many of its searches and user agents here.
Restrictions. Recently, Google talked about its crawl restrictions. Now, Gary Illyes is into it more. He said:
- Googlebot currently downloads up to 2MB from any URL (except PDFs).
- This means that it only clears the first 2MB of the resource, including the HTTP header.
- For PDF files, the limit is 64MB.
- Image and video browsers usually have a wide range of threshold values, and it depends a lot on the product they are downloading.
- For any other browsers that do not specify a limit, the default is 15MB regardless of content type.
And what happens when Google crawls?
- To download the part: If your HTML file is larger than 2MB, Googlebot does not reject the page. Instead, it stops the download exactly at the 2MB cutoff. Note that the limit includes HTTP request headers.
- Processing cutoff: That downloaded part (the first 2MB of bytes) is passed to our directory systems and the Web Rendering Service (WRS) as if it were a complete file.
- Invisible bytes: Any bytes present after that the 2MB limit is completely ignored. They are not downloaded, provided, and identified.
- To deliver services: Every resource referenced in HTML (except for media, fonts, and a few rare files) will be fetched by WRS via Googlebot as the parent HTML. They have their own, separate, byte counter for each URL and do not count towards the size of the parent page.
How Google assigns these bytes. When a browser accesses these bytes, it then forwards them to WRS, a web hosting service. “WRS processes JavaScript and uses client-side code similar to a modern browser to understand the final visual and textual state of the page. The rendering pulls in and uses JavaScript and CSS files, and processes XHR requests to better understand the textual content and layout of the page (it does not request images or videos). For each requested resource, Google’s limit also applies.
Best practices. Google has listed these best practices:
- Keep your HTML soft: Move heavy CSS and JavaScript to external files. While the original HTML document is included in 2MB, external documents, and style sheets are downloaded separately (subject to their limits).
- The order is important: Place your most important elements – such as meta tags,
elements,elements, canonicals, and important structured data – at the top of the HTML document. This ensures that they are less likely to be found below the cutoff. - Monitor your server logs: Check your server’s response times. If your server is struggling to provide bytes, our fans will automatically back off to avoid overloading your infrastructure, which will lower your bandwidth.
A podcast. Google also had a podcast on the topic, here it is:
Search Engine Land is owned by Semrush. We are committed to providing the highest quality of marketing articles. Unless otherwise stated, the content of this page is written by an employee or paid contractor of Semrush Inc.



