The Challenge
A large US eCommerce client — a master site managing multiple category verticals with tens of thousands of product pages — had seen organic traffic stagnate despite consistent content investment. The initial brief was to "improve SEO performance," but the root cause wasn't content quality. It was accumulated technical debt: years of platform changes, developer decisions made without SEO consultation, and CMS configurations that had silently broken fundamental crawlability.
When we ran the first crawl, the scale of the problem became clear. Googlebot was spending crawl budget on thousands of low-value parameterised URLs (filter combinations, sort orders, pagination variants) while the actual category and product pages that generated revenue were either miscanonicalized, blocked by robots.txt edge cases, or caught in redirect chains from previous migrations.
Audit Methodology
A technical SEO audit at this scale cannot be done manually. I built a structured audit process using three data layers:
- Crawl data (Screaming Frog): full crawl of all crawlable URLs, capturing HTTP status, canonical tags, meta robots directives, indexability status, and internal link counts. Exported to BigQuery for analysis at scale.
- Log file analysis: server-side access logs for a 30-day window, filtering for Googlebot. Cross-referencing crawl frequency per URL type against organic traffic contribution. This revealed the budget allocation problem directly: Googlebot was crawling filter/sort URLs at 4× the frequency of actual category pages.
- GSC Coverage report: mapped the "Excluded" URLs (canonicalised away, noindex, crawled but not indexed) against the crawl data to identify pages that were both internally linked and excluded — the worst pattern, as it wastes crawl budget without the excluded page contributing to link equity consolidation.
Key Issues Found and Fixed
Canonical misconfiguration at scale
The CMS was self-canonicalizing paginated category pages (/category?page=2) back to the root category (/category), which is correct. But it was also applying the same logic to filtered views (/category?colour=red). The problem: filtered pages were still internally linked from the faceted navigation, so Googlebot was crawling them, processing the canonical, and using crawl budget without gaining anything. The fix: noindex + disallow in robots.txt for parameterised filter combinations, with a carefully scoped regex to avoid blocking legitimate paginated URLs.
Duplicate content clusters
The site had product pages accessible via three URL paths: the canonical product URL, the category-breadcrumb path, and a legacy URL structure from a previous platform that still resolved via 301 chain. The 301 chains were mostly two hops, with a few three-hop cases. Consolidating these reduced the site's crawlable URL count by ~18% and concentrated link signals onto canonical product URLs.
Revenue pages in crawl budget shadow
Several high-converting category pages had been inadvertently disallowed in a robots.txt update six months prior — a developer had added a blanket rule to block a staging subdirectory but accidentally used a pattern that also matched a subset of live category paths. These pages were still indexed (Google had cached pre-disallow versions) but were not being recrawled, meaning fresh content wasn't being indexed. Fixing the robots.txt rule and resubmitting for indexing recovered these pages within three crawl cycles.
The Outcome
Within 90 days of implementing fixes: organic sessions to category pages increased 34% as crawl budget re-allocated from filter URLs to revenue pages. GSC Coverage showed a 41% reduction in "Crawled but not indexed" URLs. The ongoing SEO QA system introduced post-audit caught three regressions in the following six months before they affected rankings.
Tools & Stack
Screaming Frog · BigQuery (log file analysis) · Google Search Console · Apache robots.txt · Python (canonical audit scripts) · Ahrefs (link equity mapping) · Google Analytics