Web Scraping for Lead Lists
Purpose
This recipe produces a raw lead list of 200-2,000 contacts scraped from public web sources — business directories, conference attendee pages, G2 review profiles, and GitHub contributor lists — with full legal compliance documentation. Output requires enrichment pipeline processing before outreach.
Prerequisites
- ICP definition from persona-builder — ICP Definition
- Python 3.10+ with requests, beautifulsoup4, pandas
- GitHub token (if scraping contributors) — github.com/settings/tokens
- Legal review of legitimate interest basis for target jurisdictions
Constraints
- GDPR: Public visibility does not remove protection. Fines up to EUR 20 million or 4% of global revenue. [src1]
- CNIL fined KASPR EUR 240,000 for scraping 160M LinkedIn contacts. [src6]
- Robots.txt increasingly enforced under GDPR and EU DSA. [src2]
- CCPA: California residents can request deletion within 45 days. [src4]
- GitHub API: 5,000 requests/hour authenticated, 60/hour unauthenticated. [src5]
Tool Selection Decision
| Path | Tools | Cost | Speed | Output Quality |
|---|---|---|---|---|
| A: No-Code | Browse AI / Octoparse | $0-$39/mo | 1 hr/100 leads | Basic |
| B: Python Simple | BeautifulSoup + requests | $0 | 1 hr/200 leads | Good |
| C: Scrapy + API | Scrapy + GitHub API | $0 | 30 min/500 leads | High |
| D: Apify | Apify marketplace | $49/mo | 15 min/1000 leads | Highest |
Execution Flow
Step 1: Legal Compliance Check
Duration: 15 min | Tool: Browser + Python
Check robots.txt for each target source. Document legitimate interest basis. Record which paths are disallowed. [src2]
Step 2: Scrape GitHub Contributors
Duration: 15-30 min | Tool: GitHub REST API
Extract contributor profiles from target repositories. Get name, company, email (if public), bio, location, GitHub URL. Rate limit: 5,000 req/hr. [src5]
Step 3: Scrape Directory Listings
Duration: 30-60 min | Tool: Python + BeautifulSoup
Parse business directory pages with custom CSS selectors. Extract company names, locations, and profile URLs. Handle anti-scraping protections with delays and User-Agent rotation.
Step 4: Merge, Deduplicate, and Tag Sources
Duration: 10-15 min | Tool: Python (pandas)
Combine all sources, deduplicate by name+company, remove entries without names, tag each record with data source and consent basis.
Step 5: Generate Compliance Report
Duration: 5 min
Create JSON compliance report documenting robots.txt checks, legal basis, suppression list status, and source details.
Output Schema
CSV with columns: name, company, title, email, location, source_url, data_source, consent_basis. Expected 200-2,000 rows, deduplicated on name+company.
Quality Benchmarks
| Metric | Minimum | Good | Excellent |
|---|---|---|---|
| Name completeness | > 80% | > 90% | > 98% |
| Company available | > 50% | > 70% | > 85% |
| Email (pre-enrichment) | > 10% | > 25% | > 40% |
| Compliance docs | Complete | Complete | Complete + legal |
| Duplicate rate | < 15% | < 8% | < 3% |
Error Handling
| Error | Cause | Recovery |
|---|---|---|
| 403 Forbidden | Anti-scraping protection | Add delays, rotate User-Agent, use Apify |
| 429 Rate Limited | Too many requests | Wait for retry-after, reduce frequency |
| CAPTCHA | Bot detection | Switch to manual or API alternative |
| Empty results | CSS selectors outdated | Inspect page, update selectors |
| Legal cease-and-desist | ToS/GDPR violation | Stop immediately, delete data, consult legal |
Cost Breakdown
| Component | Free Tier | Paid Tier | At Scale |
|---|---|---|---|
| Python + BS4 | $0 | $0 | $0 |
| GitHub API | $0 | $0 | $0 |
| Apify (optional) | $5/mo | $49/mo | $499/mo |
| Proxy (optional) | $0 | $30/mo | $100+/mo |
| Total/500 leads | $0 | $49/mo | $599/mo |
Anti-Patterns
Wrong: Scraping without legal basis documentation
CNIL fined KASPR EUR 240,000 for assuming public data is free to collect. [src6]
Correct: Document legitimate interest before scraping
Create compliance report per source. Only scrape B2B professional data. Maintain suppression lists. [src1]
Wrong: Ignoring robots.txt and site ToS
Robots.txt is increasingly enforced as a contract under GDPR and DSA. [src2]
Correct: Check robots.txt programmatically before scraping
Log compliance for each URL. Skip disallowed paths.
When This Matters
Use when the agent needs to build a lead list from public web sources without paying for data providers. Best for small batches from niche sources like conference attendees, open-source contributors, or industry directories. Always pair with enrichment pipeline.