Google inadvertently made internal documents about its Search algorithms accessible on GitHub, revealing details on how it ranks and displays web results. These documents, part of the "Google API Content Warehouse," were available from March 27 to May 7, 2023, and included over 2,500 pages of API documentation. Although Google has pulled these documents, copies remain accessible due to indexing by a third-party service.
The leak, exposed by Rand Fishkin of SparkToro, has stirred the SEO community, suggesting potential discrepancies between Google's public statements and its actual practices in ranking search results. This incident comes after Google's March update aimed at prioritizing genuine content for users over search-engine-optimized pages. Google has not yet issued a public response regarding the leak.
The leaked Google API Content Warehouse documentation suggests that a variety of factors are considered in Google Search's ranking algorithms. Some of these factors include:
siteAuthority: This is a feature that Google calculates and uses in the Q* ranking system, indicating that Google does measure sitewide authority despite claims to the contrary.
NavBoost: This system uses click data to influence rankings, contradicting Google's denials about clicks affecting search results. It identifies trending search demand by analyzing the number of searches for a given keyword, the number of clicks on a search result, and differentiating between long clicks and short clicks.
hostAge: This attribute is used to sandbox new sites, which Google has publicly denied. The sandbox feature segregates new or untrusted sites to prevent fresh spam from ranking highly in search results.
User Behavior: Google's Panda algorithm uses a scoring modifier based on user behavior and external links, applied at various levels such as domain, subdomain, and subdirectory.
Anchor Mismatch: Links with irrelevant anchor text are demoted in rankings.
SERP Demotion: Pages showing poor user satisfaction in the SERP are demoted.
Exact Match Domains: These receive less value in rankings.
Product Review Demotion: Likely related to the recent product reviews update.
Location Demotions: "Global" and "super global" pages can be demoted to favor locally relevant content.
It's important to note that while the leaked documentation shows what factors Google Search might take into consideration when ordering search results, it does not reveal the importance or "weight" of each factor in the final ranking.
The leaked Google API Content Warehouse documents have led the SEO community to allege several contradictions between the internal API documentation and Google's public statements about its search algorithms4. Some of these contradictions include:
Use of clicks and user data: The documents reveal that Google uses a system called NavBoost, which employs click-driven measures to adjust rankings. This contradicts Google's previous denials of using click data to influence search rankings.
Site-wide authority measurement: Google has publicly denied measuring site-wide authority, but the leaked documents mention a feature called "siteAuthority," indicating that Google does measure sitewide authority.
Sandbox feature for new websites: Google has previously denied the existence of a sandbox feature that segregates new or untrusted sites. However, the leaked documents mention a "hostAge" attribute used to sandbox new sites, contradicting this denial.
Use of Chrome data: Google representatives have repeatedly stated that Chrome data is not used in search rankings. However, the leaked documents suggest that Chrome data may be used to rank websites.
These contradictions have led to concerns about the transparency and accuracy of Google's public statements regarding its search algorithms. The SEO community is now calling for greater scrutiny of Google's claims and a more accurate representation of how its search engine operates.