I have a quick story to tell where page indexing issues fixing caused trouble instead of bringing positive change to the website; A client came for page indexing issues fixing done by a third person and it was affected by a novice technical SEO expert who has set everything to the index. He tried to fix the Google search console issue blocked by robots.txt and Indexed, though blocked by robots.txt issues by setting every URL to index.
As you can see below screenshot he has massive numbers of indexed pages and millions of not indexed pages. Guess what! all of these pages are spam pages. It is because the so-called technical SEO expert has allowed every URL for indexing. Spammers found that opportunity and attached so many of these unwanted pages

Common Page Indexing Scenarios: When to Index vs. No-Index
Understanding which pages should be indexed is critical. Here’s a comprehensive breakdown:
Pages You SHOULD Index
Primary Content Pages:
– Homepage
– Main product/service pages
– Category pages (with unique content)
– Blog posts and articles
– Landing pages with original content
– About, Contact, and key informational pages
Why? These pages provide value to searchers and represent your core content.
Pages You Should NOT Index
Search Result Pages:
– Internal site search results (?s=keyword, ?q=search-term)
– Filtered results (?color=blue&size=large)
– Sorted views (?sort=price-asc)
Why? These create infinite URL combinations that dilute your crawl budget and create thin content issues.
Utility Pages:
– Login/logout pages
– Checkout and cart pages
– Thank you pages
– User account dashboards
– Admin panels
Why? No search value for external users and can expose sensitive areas.
Technical Pages:
– Staging/development URLs
– Test pages
– Duplicate content with URL parameters
– Printer-friendly versions
– AMP duplicates (use canonical instead)
Why? These are technical duplicates that confuse search engines.
It is not always necessary to index everything; in fact, google does not index everything as seen in their official documentation on page indexing.
So that website was blocking the search pages (with ? q=search terms) from searching through robots.txt. However, someone has changed its setting to unblock the search pages so they could be indexed. This decision was wrong as Google doesn’t index everything and now the client is experiencing issues of so many spam pages being indexed and many are part of not indexed log pages.

Platform-Specific Indexing Control
WordPress: Controlling What Gets Indexed
Using Yoast SEO:
- Edit the page/post you want to no-index
- Scroll to the Yoast SEO meta box
- Click the gear icon → Advanced
- Set Allow search engines to show this page in search results? to No
- Update the page
Using Rank Math:
- Edit the page
- Find the Rank Math meta box
- Click the Advanced tab
- Toggle Robots Meta to No Index
Bulk No-Index for Post Types:
Go to SEO → Search Appearance → [Post Type] and set Show [type] in search results to No for:
– Media/Attachments
– Tags (if thin content)
– Author archives (for single-author blogs)
robots.txt for Search Pages:
# Disallow search result pages
Disallow: /*?s=
Disallow: /search/
Disallow: /?s=*
Shopify: Managing Index Settings
No-Index Product Variants:
Shopify automatically canonicalizes product variants to the main product page. Verify this in your theme’s `product.liquid` file: liquid
No-Index Collections with Filters:
Add this to your theme’s collection.liquid: liquid
{% if current_tags %}
{% endif %}
Block Search Pages in robots.txt:
Edit your robots.txt.liquid file:
Disallow: /search
Disallow: /*?q=
Disallow: /collections/*+
WooCommerce: Product Variations & Filters
No-Index Filtered Shop Pages:
Install Yoast WooCommerce SEO addon, then:
- Go to SEO → Search Appearance → WooCommerce
- Enable No-index for filtered shop pages
Handle Product Variations:
WooCommerce doesn’t create separate URLs for variations (unlike Shopify), but ensure your canonical tags are correct: <?php
// In functions.php or custom plugin
add_filter('woocommerce_product_get_canonical_url', 'custom_canonical_url', 10, 2);
function custom_canonical_url($canonical_url, $product) {
return get_permalink($product->get_id());
}
How to Recover from Indexing Mistakes
If you’ve accidentally indexed thousands of unwanted pages (like the example in our case study), here’s your recovery process:
Step 1: Stop the Bleeding (Immediate)
Block Further Indexing:
- Add no-index meta tags to affected page types
- Update robots.txt to disallow problematic URL patterns
- Remove sitemap references to spam pages
Example robots.txt update:
# Block search pages
Disallow: /*?s=
Disallow: /search/
# Block filter parameters
Disallow: /*?filter=
Disallow: /*&filter=
# Block session IDs
Disallow: /*?sid=
Disallow: /*sessionid=
Step 2: Remove Spam URLs from Google’s Index
For Small Batches (<100 URLs):
- Go to Google Search Console → Removals
- Click New Request
- Enter the URL or URL prefix pattern
- Submit (temporary removal for 6 months)
For Large Batches (1000s of URLs):
You cannot bulk remove in GSC, but you can speed up de-indexing:
- Ensure proper no-index tags are in place
- Submit updated sitemap (without spam URLs)
- Wait for natural de-indexing (can take 2-4 weeks)
- Use URL parameter handling
in GSC:
– Go to Settings → URL Parameters
– Add parameters like ?s= or ?filter=
– Set to No URLs or Let Googlebot decide
Step 3: Monitor Progress
Track De-Indexing:
Use this search operator weekly:
site:yoursite.com inurl:?s=
site:yoursite.com inurl:/search/
GSC Coverage Report:
Monitor the Excluded section for decreases in:
– Duplicate without user-selected canonical
– Crawled – currently not indexed
Step 4: Prevent Future Issues
Set Up Alerts:
Create a monitoring system to catch issues early:
- Weekly GSC Email Reports – Enable in Settings
- Monthly Coverage Audits – Check for new exclusion patterns
- Crawl Budget Analysis – If Googlebot wastes time on junk pages
Create Documentation: Document your indexing rules so future team members don’t reverse your fixes:
✅ Always Index: Products, blog posts, core pages
❌ Never Index: Search results, filters, session URLs
⚠️ Conditional: Category pages (only with unique content >300 words)
Real-World Case Study: Recovering from 2.3M Indexed Spam Pages
The Problem: A client came to us after a previous SEO expert changed their robots.txt to allow all search pages to be indexed. Result:
– Before: ~15,000 legitimate pages indexed
– After bad change: 2.3M pages indexed (mostly spam)
– Traffic impact: 67% drop in organic traffic over 3 months
Our Recovery Process:
Week 1:
– Blocked search URLs in robots.txt
– Added no-index meta tags to search template
– Removed spam URLs from XML sitemap
Week 2-4:
– Submitted 500 removal requests (GSC limit)
– Monitored de-indexing progress
– Fixed internal links pointing to search pages
Results:
– Month 1: Down to 1.8M indexed pages
– Month 2: Down to 800K indexed pages
– Month 3: Back to 18K indexed pages (3K were legitimate new content)
– Traffic recovery: 89% of original traffic restored
Key Lesson: Never index pages that accept user-generated parameters. If a previous expert suggests this, get a second opinion.
So what would be the right approach to Fix Page Indexing Issues?
I always suggest to either hire an SEO expert who can evaluate your website and make the decision based on the reported pages in the page indexing log.
So if you have no-index pages either through robots.txt or meta robot you should check if that page is necessary to be indexed.
Ideally, we should not index the search pages or pages that can accept user-generated search terms like I shared many spammy URLs.
The same happened with this client causing so many unwanted pages indexed for users.
Please share if you have any questions.
Decision Framework: Should This Page Be Indexed?
Use this flowchart for every questionable page:
Does the page provide unique value to searchers?
├─ Yes → Does it have substantial content (>200 words)?
│ ├─ Yes → Does it duplicate another page?
│ │ ├─ No → ✅ INDEX IT
│ │ └─ Yes → Set canonical to main version, no-index duplicate
│ └─ No → ❌ NO-INDEX (thin content)
└─ No → Is it a utility page (login, checkout, etc.)?
├─ Yes → ❌ NO-INDEX
└─ No → Is it generated by URL parameters?
├─ Yes → ❌ NO-INDEX + Block in robots.txt
└─ No → Consult with SEO expert
Quick Reference: Indexing Best Practices by Page Type
| Page Type | Index? | Method | Notes |
|---|---|---|---|
| Homepage | ✅ Yes | Default | Always index |
| Product pages | ✅ Yes | Default | Main product URLs only |
| Product variants (colors) | ❌ No | Canonical | Point to main product |
| Category pages | ✅ Yes | Conditional | Only if unique content >300 words |
| Search results | ❌ No | robots.txt + meta | Never index |
| Filtered results | ❌ No | robots.txt + meta | Never index |
| Pagination (page=2) | ⚠️ Maybe | rel=”next/prev” | Or canonical to page 1 |
| Blog posts | ✅ Yes | Default | Always index |
| Tag archives | ⚠️ Maybe | Conditional | Only if curated with unique content |
| Author archives | ⚠️ Maybe | Conditional | Multi-author sites only |
| 404 pages | ❌ No | Status code | Returns 404 automatically |
| Login/Register | ❌ No | Meta no-index | Utility pages |
| Cart/Checkout | ❌ No | Meta no-index | Utility pages |
| Thank you pages | ❌ No | Meta no-index | Conversion pages |
| AMP versions | ❌ No | Canonical | Point to HTML version |




