
Behind the Scenes of Indexing
Key Insights from Google’s Deep Dive Asia Pacific 2025 (Day 2 - Part 1)
Day 2 of Google’s Search Central Live: Deep Dive Asia Pacific 2025 shifted focus from crawling to indexing; what happens after content is discovered, how it’s evaluated, and ultimately how it’s selected (or not) to appear in Search. With speakers offering rare insight into the inner workings of Google’s indexing systems, this was a technical yet essential session for anyone managing organic performance in 2025.
For business owners and marketers wearing multiple hats, understanding indexing isn’t just for developers. It’s about ensuring the content you’ve worked hard to create is not just discoverable, but actually findable in Search results.
Let’s unpack what was shared.
Indexing in 2025: More Complex Than Ever
“Indexing is way more complex than we give it credit.” - Gary Illyes
In 1998, Google could rely on plain HTML. Today, JavaScript frameworks, dynamic content, and evolving spam tactics have transformed the challenge entirely.
Google’s indexing process begins with a messy reality: HTML is rarely clean. Before any content can be considered for indexing, it must be parsed, rendered, deduplicated, and evaluated. The system interprets HTML into a Document Object Model (DOM), identifying key areas like navigation, main content, and page structure. This helps Google understand what a page is about and whether it deserves a place in the index.
Importantly, Google doesn’t simply dump a page into the index. It tokenises content (splitting it into individual words or phrases), and maps those tokens with URLs using something called The Posting List, a highly efficient system for search recall.
Canonicals, Deduplication, and Choosing the Right Page
A major theme was canonicalisation; the process of selecting a representative version of duplicate content. Rather than indexing every version of a page, Google clusters similar pages and chooses a “canonical” based on signals like rel-canonical tags, redirects, sitemaps, and user experience factors.
This process is machine-learned and competitive. “Canonicalisation is a bidding system,” Gary noted. Each page competes using its available signals; pages with poor user experience, errors, or ambiguous signals may lose out.
Why does this matter? Because duplication is not just a technical issue. It wastes crawl and index resources and can dilute your site’s presence in Search. Using clean redirects, consistent rel-canonical tags, and clear language localisation (via hreflang) helps Google choose the right page - yours.
Robots.txt vs Meta Robots: Controlling Crawling and Indexing
Google reiterated the distinction between crawling controls and indexing directives:
- Robots.txt governs what Googlebot can access (crawling).
- Meta robots tags govern what can be included in the index or displayed in Search (indexing and serving).
Meta robots tags - such as noindex, nofollow, and unavailable_after - must appear in the <head> section of a page that is accessible (i.e. not blocked by robots.txt), or else they’ll be ignored.
The unavailable_after directive was highlighted as particularly useful for temporary content like event pages or limited-time offers. Once the date has passed, Google will automatically de-index the page.
JavaScript and Indexing: Avoid the Pitfalls
JavaScript-powered websites remain a challenge for indexing, especially when essential elements like meta tags or links are generated client-side. Google and community speakers shared several common issues and their solutions:
- Meta tags modified by JavaScript: Set them in raw HTML instead.
- JS-dependent internal links: Use proper <a> tags with href attributes.
- Blocked resources: Ensure robots.txt allows crawling of essential scripts and endpoints.
- Soft 404s: If your app shows an error but returns a 200 HTTP status, Google may interpret this as valid (but thin) content.
Googlebot is a rendering engine as well as a crawler, but it doesn’t process URL fragments (e.g. #tab2) and won’t accept permission requests like a user’s browser might. Expect limitations and optimise accordingly.
Image Indexing and AI Content
Googlebot-Image operates separately from the main crawler, meaning media indexing is asynchronous. Image quality doesn’t affect crawl budget, but the accessibility and context of images is key.
- Use valid <img> elements with descriptive alt text - this is the strongest signal for image indexing.
- Images only referenced in CSS (e.g. background-image) are unlikely to be indexed.
- Image sitemaps are strongly recommended.
- AI-generated images are acceptable if they add user value. If they’re purely decorative or low-quality, consider disallowing their crawl via robots.txt.
A key question answered: Does image quality affect crawl budget? No. But large, slow-loading images can delay subsequent fetches. Fortunately, Googlebot opens multiple connections to avoid bottlenecks.
What Makes the Index? And What Doesn’t?
Not everything Google crawls makes it into the index. Index selection is the final step before ranking, and it’s shaped by multiple signals including:
- Country and language (to ensure broad representation across global users)
- Content quality, trustworthiness, and utility
- Spam signals (Google flags over 40 billion spammy pages every day)
- Freshness (calculated during indexing, applied during ranking)
- Page purpose and usability (main content must help the user)
Pages marked as noindex, expired (unavailable_after), or clearly spammy are excluded. So are soft 404s, those misleading error-like pages with thin or no real content.
If you see “Crawled - currently not indexed” in Search Console, it’s often a sign that Google is still evaluating your page’s usefulness. Focus on improving internal links, content clarity, and user value.
What This Means for You: Key Takeaways
- Fix JavaScript traps: Ensure content and links are visible in raw HTML. Don’t rely on dynamic rendering alone.
- Be precise with signals: Robots.txt blocks override meta robots. Ensure your noindex tags are accessible.
- Use canonical tags wisely: Ambiguity or misuse can result in the wrong page being indexed.
- Clean up soft 404s: Return accurate HTTP status codes and ensure error pages don’t masquerade as valid.
- Leverage image indexing: Use descriptive alt text and ensure images are crawlable through standard HTML.
- Review Search Console: “Discovered - not crawled” could indicate crawl budget limits or low perceived quality.
- Trust the fundamentals: Google still indexes content based on clarity, intent alignment, and usefulness.
Looking Ahead
Indexing might be one of the least visible parts of SEO, but it’s foundational to everything that comes after. As Google’s systems grow more complex, the opportunity for marketers is to match that sophistication with clarity, structure, and quality signals that reinforce what their content is about and why it matters.
If you’re unsure whether your content is being indexed correctly or need help resolving crawl and canonical issues, Altitude Search offers a free SEO Health Check. It’s designed to help businesses scale visibility step by step - no jargon, just practical insight.