How Google Manages its Web Index

Google actively manages its index by removing low-quality pages.

In this article, we'll explain insights from Google's patent "Managing URLs" (US7509315B1) on how Google might manage its search index.

We'll break down the concepts of importance thresholds, crawl priority, and the deletion process. And how you can use this information to spot page quality issues on your website.

So, let's dive in.

What is the Search Index?

Google’s search index is a series of massive databases that store information.

To quote Google’s official documentation How Google Search organizes information:

“The Google Search index covers hundreds of billions of webpages and is well over 100,000,000 gigabytes in size. It’s like the index in the back of a book - with an entry for every word seen on every webpage we index.”

When you do a search in Google, the list of websites returned in Google’s search results comes from its search index.

The process of building its search results is done in a 3-step process:

Crawling: Automated web crawlers to discover and download content.
Indexing: The content is analysed and stored in a massive database.
Serving: Eligible content is served in Google's search results.

If a website’s page is Not Indexed, it cannot be served in Google’s search results.

Any SEO professional or company can view the indexing state of their website’s pages in the Page Indexing report in Google Search Console.

Google Search Index Quality

Google actively removes pages from its search index.

This isn't new territory. SEO professionals and Googlers have been talking about this for the past decade. A few examples:

Gary Illyes has mentioned in an interview at SERP Conf that Google actively removes pages from its index:

“And in general, also the the general quality of the site, that can matter a lot of how many of these crawled but not indexed, you see in Search Console. If the number of these URLs is very high that could hint at a general quality issues. And I've seen that a lot uh since February, where suddenly we just decided that we are de-indexing a vast amount of URLs on a site just because the perception, or our perception of the site has changed.”

Martin Splitt explained in an official Google Search Central video that Google actively removes pages from its index:

“The other far more common reason for pages staying in "Discovered-- currently not indexed" is quality, though. When Google Search notices a pattern of low-quality or thin content on pages, they might be removed from the index and might stay in Discovered.”

At Indexing Insight, our 'crawl - currently not indexed' research identified that nearly 80% of Not Indexed pages with this index coverage state are actively removed indexed pages.

The active removal of Indexed pages is so prevalent in Google's index that we created a new report called: 'Crawled - previously indexed'.

This new index coverage state helps customers identify indexed pages that are actively being removed from Google Search results.

But how does Google decide which pages are deindexed?

A Google patent called "Managing URLs" (US7509315B1) might hold the answer to how a search giant like Google manages its mammoth Search Index.

Search Index Limit

Any database (like a Search Index) has limits.

According to the Google patent "Managing URLs" (US7509315B1), any search index comes with limits for the number of pages that can be efficiently indexed.

There are two different limits to managing a search engine’s index effectively:

Soft Limit: This limit sets a target for the number of pages to be indexed.
Hard Limit: This limit acts as a ceiling to prevent the index from growing excessively large.

These two limits work together to ensure Google's index remains manageable while prioritizing high-ranking documents.

However, reaching this limit doesn't mean a search engine stops crawling entirely.

Instead, it continues to crawl new pages but only indexes those deemed "important" enough based on query-independent metrics (e.g. PageRank, according to the patent).

This leads us to an interesting concept: the importance threshold.

The Importance Threshold

The importance threshold is a benchmark score mentioned in the Google patent.

It describes that a new page should be indexed after the initial limit has been reached. Only pages with an importance score equal to or higher than this threshold are added to the index.

💡

Importance Threshold vs Page Quality: In this article, 'importance threshold' and 'page quality' mean the same thing. Google employees use 'page quality threshold' to describe index management, while the patent uses 'importance threshold.'

This ensures that a search engine index prioritizes indexing the most important content. Based on the patent, there are two main methods for determining the importance threshold:

Ranking Comparison Method
Histogram-Based Method

1) Ranking Comparison Method

All known pages are ranked according to their importance.

The threshold is implicitly defined by the importance rank of the lowest-ranking page currently in the index.

For example, if a search engine had 1,000,000 pages indexed. It would rank (sort) the pages based on each calculated importance score. The lowest importance rank the list would be 3.

So the importance threshold in the Search Index would be 3.

2) Histogram-Based Method

The system would use a histogram representing the distribution of importance ranks.

The threshold is calculated by analyzing the histogram and identifying the importance rank corresponding to the desired index size limit.

For example, if a search engine had a limit of 1,000,000 pages. If the histogram shows that 800,000 pages have an importance rank of 6 or higher, the importance threshold would be 6.

Importance Threshold Fluctuates

The number of indexed pages can fluctuate due to the importance threshold.

This is due to the dynamic nature of both the importance threshold and the importance rankings of individual pages.

You can see this sort of process in action in the Page Indexing report in GSC.

According to the patent, three main factors cause these fluctuations:

New High-Importance Pages: When new pages with importance rank above the current threshold, they're added to the index.
Importance Rank Changes: Existing pages are removed from the index because they drop below the unimportance threshold.
Oscillations Near Threshold: Pages with importance ranks close to the threshold are particularly susceptible to fluctuations.

1) New High-Importance Pages

When new pages with importance rank above the current threshold, they're added to the index.

This can cause the total number of pages to exceed the soft limit, triggering an increase in the importance threshold and potentially removing existing pages with lower importance.

Gary Illyes actually confirmed this process happens in Google’s Search Index.

Poor-quality content (lower importance rank) will be actively removed when higher-quality content needs to be added to the index.

2) Page Quality Changes

Existing pages are removed from the index because they drop below the page quality / importance threshold benchmark score.

If an existing page's importance rank drops below the unimportance threshold (due to content updates, link structure changes, or poor user engagement from session logs), it is marked as Not Indexed, even if it was previously above the importance threshold.

At Indexing Insight, our tool can pick up when indexed pages are actively removed from Google search results in our ‘crawled - previously indexed’ report.

Gary Illyes confirmed this process at SERP Conf in February 2024:

“And in general, also the the general quality of the site, that can matter a lot of how many of these crawled but not indexed, you see in Search Console. If the number of these URLs is very high that could hint at a general quality issues. And I've seen that a lot uh since February, where suddenly we just decided that we are de-indexing a vast amount of URLs on a site just because the perception, or our perception of the site has changed.”

In the interview, Gary confirms that signals are tracked over time and that these can be used to decide to remove pages from its search results.

3) Oscillations Near Threshold

Pages with importance ranks close to the threshold are particularly susceptible to fluctuations.

As the threshold and importance ranks dynamically adjust, these pages might be repeatedly crawled, added to the index, and then deleted.

This creates oscillations in the index size.

The patent describes using a buffer zone to mitigate these oscillations, setting the unimportance threshold slightly lower than the importance threshold for crawling.

This reduces the likelihood of repeatedly crawling and deleting pages near the threshold.

Gary Illyes again confirms that a similar system happens in Google’s Search Index, indicating that pages very close to the quality threshold can fall out of the index.

But it can then be crawled and indexed again (and then fall back out of the index).

Indexing states: Why do they change?

The patent explains why your page’s indexing states can change.

At Indexing Insight, we have noticed using our first-party data that indexing state in GSC can indicate the crawl priority of a website.

The Google patent (US7509315B1) explains why this happens using two systems:

Soft vs Hard Limits
Importance threshold and crawl priority

1) Soft vs Hard Limits

There are two different limits to managing a search engine’s index effectively:

Soft Limit: This limit sets a target for the number of pages to be indexed.
Hard Limit: This limit acts as a ceiling to prevent the index from growing excessively large.

These two limits work together to ensure Google's index remains manageable while prioritizing high-ranking documents.

For example, if Google’s Search Index has a soft limit of 1,000,000 URLs, and the system detects they have hit that target, then Google will start to increase the importance threshold.

When you increase the importance threshold, indexed pages below that threshold are removed. This impacts the crawling and indexing of new pages.

2) Importance threshold and crawl priority

The quality threshold impacts which URLs get prioritised for crawling.

Here's how it works when the soft limit is reached:

Only pages with an importance rank equal to or greater than the current threshold are crawled and indexed.
As the threshold dynamically adjusts, page URLs' crawl and indexing priority changes.
URLs with an importance rank far below the threshold have zero crawl priority.

The patent explains why pages move from the "Crawled—currently not indexed" to the "URL is Unknown to Google" indexing states in Google Search Console.

It's all about their importance rank relative to the current threshold.

At Indexing Insight, our 190-Day Indexing Rule research has uncovered this process in action. We've identified that Google's index forgets a page URL approximately 190 days after Googlebot's last crawl.

The Google patent reveals that a gradual decline in importance rank or page quality over time causes a page URL to transition from 'submitted and indexed' to 'URL is unknown to Google.

Our research has identified that this decline can be grouped into three different time buckets and that index coverage states can be mapped to the removal process:

✅ 1-130 Days: Between 1 - 130 days of being crawled 90% of the pages are ‘submitted and indexed’.
❌ 131-189 Days: Between 131 - 189 days 50% - 90% of the not indexed pages have ‘crawled - currently not indexed’ index coverage state.
👻 190+ Days: After 190 days since the pages were crawled 90% of the pages are made up of ‘Discovered - crawled currently not indexed’ and/or ‘URL is unknown to Google’.

Gary Illyes from Google confirmed that Google’s Search Index does “forget” URLs over time based on the signals. And that eventually these forgotten page URLs have zero crawl priority.

What does this mean for you (as an SEO)?

Understanding Google's index management can directly impact your SEO success.

Here are 4 tips to action the information in this article:

Monitor your indexing states
Focus on quality over quantity
Identify content that's at risk
Regularly audit and improve existing content

1) Monitor your indexing states

Check your indexing states on a weekly or monthly basis.

In Google Search Console, don't just look at the top-level Indexed and Not Indexed totals in the Page Indexing report...

...make sure you are digging into the Not Indexed report trends.

When checking the Not Indexed reports, pay attention to the following trends:

Increase in pages in the 'Crawled - currently not indexed' report
Increases in pages in the 'Discovered - currently not indexed' report
Pages that are being flagged as 'URL is unknown to Google' in GSC

These shifts can indicate that Google is actively deprioritising or removing your content from the index based on importance thresholds.

In Indexing Insight, you can easily track the separate Not Indexed report trends and identify Indexed pages that are actively being removed from search results.

2) Focus on content quality over quantity

Google constantly evaluates page quality relative to other pages in the index.

This means:

Higher-quality pages push out lower-quality pages from the index.
Page quality score of your pages continues to worsen if you don’t improve them.
Your content isn't just competing against your previous versions but against all new content being created.

This explains why a page that was indexed for years can suddenly get deindexed. Its importance rank may have remained static while the threshold increased.

SEO professionals need to evaluate the page quality of indexed pages.

In Indexing Insight, you can download ALL the monitor pages in the Indexed report to evaluate the page quality of your website.

3) Identify At Risk Content

URLs with frequent indexing state changes (oscillating between Indexed and Not Indexed) are likely near the importance threshold.

These fluctuating important pages need to be prioritised for improvement.

At Indexing Insight, we have created a specific report to identify pages which are jumping between Indexed and Not Indexed states.

This report is called: Index Fluctuations.

This report can help identify exactly which pages are on the edge of the page quality threshold score.

4) Regularly Audit and Improve Existing Content

Google continually reassesses page importance.

SEO teams need to regularly audit their website content to identify small incremental improvements in page quality.

To maintain and improve your indexing, it’s important to:

Track and monitor the index state of important pages.
Perform regular content audits focusing on content with low-engagement.
Update and improve existing content rather than just creating new pages.
Monitor internal links and user engagement metrics as they influence importance.
Identify and improve indexed pages that are at risk of being removed by Google.

At Indexing Insight, you can easily view ALL of your Indexed and Not Indexed pages in our Index Coverage report.

This report shows all indexing data for a website and can be combined with other SEO data points to support content audits.

Summary

Google actively removes pages from its index.

In this blog, we explained how a set of processes can help manage Google's Search Index using the Google patent (US7509315B1).

The patent sheds some light on how your pages can be actively removed.

This blog post has given you a deeper understanding of how Google works and what you should be doing to help get your pages indexed.

How Google Manages its Index

What is the Search Index?

Google Search Index Quality

Search Index Limit

The Importance Threshold

1) Ranking Comparison Method

2) Histogram-Based Method

Importance Threshold Fluctuates

1) New High-Importance Pages

2) Page Quality Changes

3) Oscillations Near Threshold

Indexing states: Why do they change?

1) Soft vs Hard Limits

2) Importance threshold and crawl priority

What does this mean for you (as an SEO)?

1) Monitor your indexing states

2) Focus on content quality over quantity

3) Identify At Risk Content

4) Regularly Audit and Improve Existing Content

Summary

Subscribe to our blog.