Soft-block Detection

One of the nastiest scraping failures looks like success:

  • the server returns 200 OK
  • your request library is happy
  • your parser produces output
  • your dataset is quietly wrong

That is a soft block.

Instead of a hard 403 or 429, you get HTML that is technically valid but operationally useless:

  • “access denied” pages
  • consent interstitials
  • login walls
  • JavaScript placeholders
  • challenge pages

If you do not detect these responses, your scraper will lie to you.

This guide shows a practical soft-block detection strategy you can drop into a Python scraper today.

Soft-block detection flow
Detect bad HTML before it becomes bad data

A proxy layer helps reduce failures, but it does not remove the need for validation. ProxiesAPI keeps fetches steadier; your scraper still needs to prove the response is real.


What a soft block looks like

Soft blocks usually share one or more of these traits:

  • the HTML is much smaller than a real page
  • expected DOM anchors are missing
  • keywords like “captcha”, “access denied”, or “enable JavaScript” appear
  • the title is generic instead of target-specific
  • your record count suddenly drops to zero

The key mindset:

Treat 200 OK as untrusted until the content passes validation.

That single habit prevents a lot of bad downstream data.


Step 1: Separate fetching from validation

Do not mix parsing logic directly into the request call. First fetch, then validate, then parse.

from __future__ import annotations

import requests

TIMEOUT = (10, 30)
UA = "Mozilla/5.0 (compatible; ProxiesAPIGuidesBot/1.0; +https://www.proxiesapi.com/)"

session = requests.Session()
session.headers.update({"User-Agent": UA})


def fetch_html(url: str) -> tuple[int, str]:
    r = session.get(url, timeout=TIMEOUT)
    return r.status_code, r.text

This small separation makes the rest of the reliability layer easier to reason about.


Step 2: Add cheap generic heuristics

Start with a few low-cost checks that work on many targets.

import re

SOFT_BLOCK_PATTERNS = [
    r"access denied",
    r"verify you are human",
    r"enable javascript",
    r"captcha",
    r"unusual traffic",
    r"temporarily blocked",
    r"request unsuccessful",
]


def looks_soft_blocked(html: str) -> bool:
    if not html:
        return True

    if len(html) < 1500:
        return True

    low = html.lower()
    return any(re.search(pattern, low) for pattern in SOFT_BLOCK_PATTERNS)

These rules are intentionally simple. They catch a surprising number of fake-success responses before your parser ever runs.


Step 3: Add target-specific anchor checks

Generic heuristics are not enough on their own.

The strongest validation is domain-aware validation:

  • what must be present on a real page?
  • what would never be missing on a successful response?

Example for a product page:

from bs4 import BeautifulSoup


def validate_product_page(html: str) -> bool:
    soup = BeautifulSoup(html, "lxml")
    title = soup.select_one("h1")
    price = soup.select_one("[data-testid='price'], .price, .product-price")
    return bool(title and price)

Example for a search result page:

def validate_search_results(html: str) -> bool:
    soup = BeautifulSoup(html, "lxml")
    cards = soup.select(".result, .search-result, li.result-row")
    return len(cards) >= 3

These checks are far stronger than scanning for the word “captcha”.


Step 4: Fail fast and retry cleanly

If validation fails, do not parse anyway. Treat it like a retryable fetch failure.

import random
import time


def backoff(attempt: int, base: float = 0.8, cap: float = 20.0) -> float:
    exp = min(cap, base * (2 ** (attempt - 1)))
    return exp + random.uniform(0, exp * 0.2)


def fetch_validated(url: str, validate_fn=None, attempts: int = 4) -> str:
    last_reason = None

    for attempt in range(1, attempts + 1):
        status, html = fetch_html(url)

        if status in (429, 500, 502, 503, 504):
            last_reason = f"retryable status {status}"
            time.sleep(backoff(attempt))
            continue

        if status != 200:
            raise RuntimeError(f"non-200 status: {status}")

        if looks_soft_blocked(html):
            last_reason = "soft-block heuristics matched"
            time.sleep(backoff(attempt))
            continue

        if validate_fn and not validate_fn(html):
            last_reason = "anchor validation failed"
            time.sleep(backoff(attempt))
            continue

        return html

    raise RuntimeError(f"failed after {attempts} attempts: {last_reason}")

This is the production habit that matters:

  • soft blocks are failures
  • failures get logged
  • retries are explicit

Step 5: Log the failure reason

If you only log “parse failed”, you learn nothing.

Track specific reasons:

  • tiny HTML
  • known block phrase
  • missing anchor
  • status 429
  • status 503

That lets you answer:

  • Is the site rate-limiting?
  • Did the layout change?
  • Are we getting challenge pages?

A tiny example:

def classify_soft_block(html: str) -> str | None:
    if not html:
        return "empty_html"
    if len(html) < 1500:
        return "tiny_html"

    low = html.lower()
    for marker in ["captcha", "access denied", "enable javascript", "verify you are human"]:
        if marker in low:
            return f"marker:{marker}"

    return None

Granular logs make your retry policy smarter over time.


Step 6: Use success metrics that reveal soft blocks

Do not monitor only request success rate.

Also monitor:

  • parsed item count per run
  • median HTML length
  • share of pages failing validation
  • percent of pages with missing anchors

Why this matters:

  • a crawl can show 99% HTTP success
  • while your useful-data success rate is collapsing

That is exactly how soft blocks hide in production.


Example: wrapping a real parser safely

Here is a minimal end-to-end pattern:

from bs4 import BeautifulSoup


def parse_titles(html: str) -> list[str]:
    soup = BeautifulSoup(html, "lxml")
    return [el.get_text(" ", strip=True) for el in soup.select("h2, h3")]


def scrape_page(url: str) -> list[str]:
    html = fetch_validated(url, validate_fn=validate_search_results)
    items = parse_titles(html)
    if not items:
        raise RuntimeError("validated page parsed zero items")
    return items

That extra validation step is often the difference between a scraper that “works on my laptop” and one that survives real traffic.


Using ProxiesAPI

Soft-block detection still matters when you use a proxy API. A better network layer reduces failures; it does not eliminate the need to validate content.

curl "http://api.proxiesapi.com/?key=API_KEY&url=https://example.com"

Python wrapper:

from urllib.parse import urlencode


def fetch_via_proxiesapi(target_url: str, api_key: str) -> tuple[int, str]:
    api_url = "http://api.proxiesapi.com/?" + urlencode({
        "key": api_key,
        "url": target_url,
    })
    r = session.get(api_url, timeout=TIMEOUT)
    return r.status_code, r.text

Then run the same validation pipeline against the returned HTML.


Common mistakes

1) Only checking status codes

This catches hard blocks, not soft ones.

2) Using only keyword matching

“captcha” detection is useful, but it misses layout drift and placeholder pages.

3) Parsing before validation

Once junk HTML reaches the parser, it is too easy to accidentally emit empty or malformed records.

4) Not tracking result counts

A sudden drop to zero items is often your first sign of trouble.


My default soft-block detection stack is:

  1. status-code handling
  2. HTML-size threshold
  3. keyword heuristics
  4. target-specific anchor validation
  5. retries with backoff
  6. monitoring on parsed output, not just transport success

That combination is lightweight, effective, and easy to explain to future-you.

If you add only one reliability layer to an existing scraper, make it this one.

Detect bad HTML before it becomes bad data

A proxy layer helps reduce failures, but it does not remove the need for validation. ProxiesAPI keeps fetches steadier; your scraper still needs to prove the response is real.

Related guides

Soft-Block Detection for Web Scraping (Python): Catch ‘HTTP 200 but Wrong Page’
Most scrapers fail silently: the request succeeds but the HTML is a block/consent/login page. Here’s how to detect soft-blocks before parsing.
engineering#python#web-scraping#retries
Retry Policies for Web Scrapers: What to Retry vs Fail Fast
Learn a production-safe retry strategy with status-code rules, backoff, and a Python helper you can drop into any scraper.
engineering#python#web-scraping#retries
Retries, Timeouts, and Backoff for Web Scraping (Python): Production Defaults That Work
Most scrapers fail because of networking, not parsing. Here are sane timeout defaults, a retry policy that won’t DDoS a site, and a drop-in requests/httpx implementation.
engineering#python#web-scraping#retries
Web Scraping with Python Requests: Proxies, Retries, and Timeouts (2026)
Make Python Requests reliable for scraping: proxy configuration, timeouts, retries with backoff, common failure modes, and when to use ProxiesAPI for a stable fetch layer.
guide#python#requests#proxy