Goals and constraints
Scrapers are HTTP clients that extract content faster or broader than you permit. Defenses trade off false positives (blocking real users or SEO crawlers), engineering cost, and latency. Measure baseline traffic before tightening rules; tune using 429/Retry-After patterns described in rate limiting and throttling.
Layered controls
- Edge rate limits and bot scores: CDNs/WAFs classify ASNs, JA3/TLS fingerprints, and request pacing; challenge or block high-risk buckets.
- Authenticated or signed fetches: For APIs, require tokens or HMAC-signed requests so anonymous bulk extraction cannot impersonate your web UI.
- Proof-of-work / CAPTCHA / Turnstile: Adds friction for anonymous automation; keep challenges accessible and localized.
- Honeypots and canary URLs: Invisible-to-user links or API endpoints that only bots hit—use to feed blocklists with low false positives.
- Good-bot hygiene: Verify Googlebot using reverse DNS + forward DNS (Google publishes steps); do not blanket-block datacenter IPs without allowlists for known monitors.
robots.txt (RFC 9309) is advisory—compliant crawlers honor it; abusive ones ignore it—so rely on technical enforcement for assets you must protect.
IP-centric limits
Shared CGNAT and corporate egress mean IP alone is a noisy signal; combine with session, API key, or device attestation where possible. For IP reputation context, check how IPs present externally.