Google actively manages its index by removing low-quality pages.
In this article, we'll explain insights from Google's patent "Managing URLs" (US7509315B1) on how Google might manage its search index.
We'll break down the concepts of importance thresholds, crawl priority, and the deletion process. And how you can use this information to spot page quality issues on your website.
So, let's dive in.
Google’s search index is a series of massive databases that store information.
To quote Google’s official documentation How Google Search organizes information:
“The Google Search index covers hundreds of billions of webpages and is well over 100,000,000 gigabytes in size. It’s like the index in the back of a book - with an entry for every word seen on every webpage we index.”
When you do a search in Google, the list of websites returned in Google’s search results comes from its search index.
The process of building its search results is done in a 3-step process:
If a website’s page is Not Indexed, it cannot be served in Google’s search results.
Any SEO professional or company can view the indexing state of their website’s pages in the Page Indexing report in Google Search Console.
Google actively removes pages from its search index.
This isn't new territory. SEO professionals and Googlers have been talking about this for the past decade. A few examples:
Gary Illyes has mentioned in an interview at SERP Conf that Google actively removes pages from its index:
“And in general, also the the general quality of the site, that can matter a lot of how many of these crawled but not indexed, you see in Search Console. If the number of these URLs is very high that could hint at a general quality issues. And I've seen that a lot uh since February, where suddenly we just decided that we are de-indexing a vast amount of URLs on a site just because the perception, or our perception of the site has changed.”
Martin Splitt explained in an official Google Search Central video that Google actively removes pages from its index:
“The other far more common reason for pages staying in "Discovered-- currently not indexed" is quality, though. When Google Search notices a pattern of low-quality or thin content on pages, they might be removed from the index and might stay in Discovered.”
At Indexing Insight, our 'crawl - currently not indexed' research identified that nearly 80% of Not Indexed pages with this index coverage state are actively removed indexed pages.
The active removal of Indexed pages is so prevalent in Google's index that we created a new report called: 'Crawled - previously indexed'.
This new index coverage state helps customers identify indexed pages that are actively being removed from Google Search results.
But how does Google decide which pages are deindexed?
A Google patent called "Managing URLs" (US7509315B1) might hold the answer to how a search giant like Google manages its mammoth Search Index.
Any database (like a Search Index) has limits.
According to the Google patent "Managing URLs" (US7509315B1), any search index comes with limits for the number of pages that can be efficiently indexed.
There are two different limits to managing a search engine’s index effectively:
These two limits work together to ensure Google's index remains manageable while prioritizing high-ranking documents.
However, reaching this limit doesn't mean a search engine stops crawling entirely.
Instead, it continues to crawl new pages but only indexes those deemed "important" enough based on query-independent metrics (e.g. PageRank, according to the patent).
This leads us to an interesting concept: the importance threshold.
The importance threshold is a benchmark score mentioned in the Google patent.
It describes that a new page should be indexed after the initial limit has been reached. Only pages with an importance score equal to or higher than this threshold are added to the index.
Importance Threshold vs Page Quality: In this article, 'importance threshold' and 'page quality' mean the same thing. Google employees use 'page quality threshold' to describe index management, while the patent uses 'importance threshold.'
This ensures that a search engine index prioritizes indexing the most important content. Based on the patent, there are two main methods for determining the importance threshold:
All known pages are ranked according to their importance.
The threshold is implicitly defined by the importance rank of the lowest-ranking page currently in the index.
For example, if a search engine had 1,000,000 pages indexed. It would rank (sort) the pages based on each calculated importance score. The lowest importance rank the list would be 3.
So the importance threshold in the Search Index would be 3.
The system would use a histogram representing the distribution of importance ranks.
The threshold is calculated by analyzing the histogram and identifying the importance rank corresponding to the desired index size limit.
For example, if a search engine had a limit of 1,000,000 pages. If the histogram shows that 800,000 pages have an importance rank of 6 or higher, the importance threshold would be 6.
The number of indexed pages can fluctuate due to the importance threshold.
This is due to the dynamic nature of both the importance threshold and the importance rankings of individual pages.
You can see this sort of process in action in the Page Indexing report in GSC.
According to the patent, three main factors cause these fluctuations:
When new pages with importance rank above the current threshold, they're added to the index.
This can cause the total number of pages to exceed the soft limit, triggering an increase in the importance threshold and potentially removing existing pages with lower importance.
Gary Illyes actually confirmed this process happens in Google’s Search Index.
Poor-quality content (lower importance rank) will be actively removed when higher-quality content needs to be added to the index.
Existing pages are removed from the index because they drop below the page quality / importance threshold benchmark score.
If an existing page's importance rank drops below the unimportance threshold (due to content updates, link structure changes, or poor user engagement from session logs), it is marked as Not Indexed, even if it was previously above the importance threshold.
At Indexing Insight, our tool can pick up when indexed pages are actively removed from Google search results in our ‘crawled - previously indexed’ report.
Gary Illyes confirmed this process at SERP Conf in February 2024:
“And in general, also the the general quality of the site, that can matter a lot of how many of these crawled but not indexed, you see in Search Console. If the number of these URLs is very high that could hint at a general quality issues. And I've seen that a lot uh since February, where suddenly we just decided that we are de-indexing a vast amount of URLs on a site just because the perception, or our perception of the site has changed.”
In the interview, Gary confirms that signals are tracked over time and that these can be used to decide to remove pages from its search results.
Pages with importance ranks close to the threshold are particularly susceptible to fluctuations.
As the threshold and importance ranks dynamically adjust, these pages might be repeatedly crawled, added to the index, and then deleted.
This creates oscillations in the index size.
The patent describes using a buffer zone to mitigate these oscillations, setting the unimportance threshold slightly lower than the importance threshold for crawling.
This reduces the likelihood of repeatedly crawling and deleting pages near the threshold.
Gary Illyes again confirms that a similar system happens in Google’s Search Index, indicating that pages very close to the quality threshold can fall out of the index.
But it can then be crawled and indexed again (and then fall back out of the index).
The patent explains why your page’s indexing states can change.
At Indexing Insight, we have noticed using our first-party data that indexing state in GSC can indicate the crawl priority of a website.
The Google patent (US7509315B1) explains why this happens using two systems:
There are two different limits to managing a search engine’s index effectively:
These two limits work together to ensure Google's index remains manageable while prioritizing high-ranking documents.
For example, if Google’s Search Index has a soft limit of 1,000,000 URLs, and the system detects they have hit that target, then Google will start to increase the importance threshold.
When you increase the importance threshold, indexed pages below that threshold are removed. This impacts the crawling and indexing of new pages.
The quality threshold impacts which URLs get prioritised for crawling.
Here's how it works when the soft limit is reached:
The patent explains why pages move from the "Crawled—currently not indexed" to the "URL is Unknown to Google" indexing states in Google Search Console.
It's all about their importance rank relative to the current threshold.
At Indexing Insight, our 190-Day Indexing Rule research has uncovered this process in action. We've identified that Google's index forgets a page URL approximately 190 days after Googlebot's last crawl.
The Google patent reveals that a gradual decline in importance rank or page quality over time causes a page URL to transition from 'submitted and indexed' to 'URL is unknown to Google.
Our research has identified that this decline can be grouped into three different time buckets and that index coverage states can be mapped to the removal process:
Gary Illyes from Google confirmed that Google’s Search Index does “forget” URLs over time based on the signals. And that eventually these forgotten page URLs have zero crawl priority.
Understanding Google's index management can directly impact your SEO success.
Here are 4 tips to action the information in this article:
Check your indexing states on a weekly or monthly basis.
In Google Search Console, don't just look at the top-level Indexed and Not Indexed totals in the Page Indexing report...
...make sure you are digging into the Not Indexed report trends.
When checking the Not Indexed reports, pay attention to the following trends:
These shifts can indicate that Google is actively deprioritising or removing your content from the index based on importance thresholds.
In Indexing Insight, you can easily track the separate Not Indexed report trends and identify Indexed pages that are actively being removed from search results.
Google constantly evaluates page quality relative to other pages in the index.
This means:
This explains why a page that was indexed for years can suddenly get deindexed. Its importance rank may have remained static while the threshold increased.
SEO professionals need to evaluate the page quality of indexed pages.
In Indexing Insight, you can download ALL the monitor pages in the Indexed report to evaluate the page quality of your website.
URLs with frequent indexing state changes (oscillating between Indexed and Not Indexed) are likely near the importance threshold.
These fluctuating important pages need to be prioritised for improvement.
At Indexing Insight, we have created a specific report to identify pages which are jumping between Indexed and Not Indexed states.
This report is called: Index Fluctuations.
This report can help identify exactly which pages are on the edge of the page quality threshold score.
Google continually reassesses page importance.
SEO teams need to regularly audit their website content to identify small incremental improvements in page quality.
To maintain and improve your indexing, it’s important to:
At Indexing Insight, you can easily view ALL of your Indexed and Not Indexed pages in our Index Coverage report.
This report shows all indexing data for a website and can be combined with other SEO data points to support content audits.
Google actively removes pages from its index.
In this blog, we explained how a set of processes can help manage Google's Search Index using the Google patent (US7509315B1).
The patent sheds some light on how your pages can be actively removed.
This blog post has given you a deeper understanding of how Google works and what you should be doing to help get your pages indexed.