Web Scraping for Lead Lists

How do I scrape lead lists from directories, conferences, G2 reviewers, GitHub contributors — with legal constraints?

Purpose

This recipe produces a raw lead list of 200-2,000 contacts scraped from public web sources — business directories, conference attendee pages, G2 review profiles, and GitHub contributor lists — with full legal compliance documentation. Output requires enrichment pipeline processing before outreach.

Prerequisites

ICP definition from persona-builder — ICP Definition
Python 3.10+ with requests, beautifulsoup4, pandas
GitHub token (if scraping contributors) — github.com/settings/tokens
Legal review of legitimate interest basis for target jurisdictions

Constraints

GDPR: Public visibility does not remove protection. Fines up to EUR 20 million or 4% of global revenue. [src1]
CNIL fined KASPR EUR 240,000 for scraping 160M LinkedIn contacts. [src6]
Robots.txt increasingly enforced under GDPR and EU DSA. [src2]
CCPA: California residents can request deletion within 45 days. [src4]
GitHub API: 5,000 requests/hour authenticated, 60/hour unauthenticated. [src5]

Tool Selection Decision

Path	Tools	Cost	Speed	Output Quality
A: No-Code	Browse AI / Octoparse	$0-$39/mo	1 hr/100 leads	Basic
B: Python Simple	BeautifulSoup + requests	$0	1 hr/200 leads	Good
C: Scrapy + API	Scrapy + GitHub API	$0	30 min/500 leads	High
D: Apify	Apify marketplace	$49/mo	15 min/1000 leads	Highest

Execution Flow

Step 1: Legal Compliance Check

Duration: 15 min | Tool: Browser + Python

Check robots.txt for each target source. Document legitimate interest basis. Record which paths are disallowed. [src2]

Step 2: Scrape GitHub Contributors

Duration: 15-30 min | Tool: GitHub REST API

Extract contributor profiles from target repositories. Get name, company, email (if public), bio, location, GitHub URL. Rate limit: 5,000 req/hr. [src5]

Step 3: Scrape Directory Listings

Duration: 30-60 min | Tool: Python + BeautifulSoup

Parse business directory pages with custom CSS selectors. Extract company names, locations, and profile URLs. Handle anti-scraping protections with delays and User-Agent rotation.

Step 4: Merge, Deduplicate, and Tag Sources

Duration: 10-15 min | Tool: Python (pandas)

Combine all sources, deduplicate by name+company, remove entries without names, tag each record with data source and consent basis.

Step 5: Generate Compliance Report

Duration: 5 min

Create JSON compliance report documenting robots.txt checks, legal basis, suppression list status, and source details.

Output Schema

CSV with columns: name, company, title, email, location, source_url, data_source, consent_basis. Expected 200-2,000 rows, deduplicated on name+company.

Quality Benchmarks

Metric	Minimum	Good	Excellent
Name completeness	> 80%	> 90%	> 98%
Company available	> 50%	> 70%	> 85%
Email (pre-enrichment)	> 10%	> 25%	> 40%
Compliance docs	Complete	Complete	Complete + legal
Duplicate rate	< 15%	< 8%	< 3%

Error Handling

Error	Cause	Recovery
403 Forbidden	Anti-scraping protection	Add delays, rotate User-Agent, use Apify
429 Rate Limited	Too many requests	Wait for retry-after, reduce frequency
CAPTCHA	Bot detection	Switch to manual or API alternative
Empty results	CSS selectors outdated	Inspect page, update selectors
Legal cease-and-desist	ToS/GDPR violation	Stop immediately, delete data, consult legal

Cost Breakdown

Component	Free Tier	Paid Tier	At Scale
Python + BS4	$0	$0	$0
GitHub API	$0	$0	$0
Apify (optional)	$5/mo	$49/mo	$499/mo
Proxy (optional)	$0	$30/mo	$100+/mo
Total/500 leads	$0	$49/mo	$599/mo

Anti-Patterns

Wrong: Scraping without legal basis documentation

CNIL fined KASPR EUR 240,000 for assuming public data is free to collect. [src6]

Correct: Document legitimate interest before scraping

Create compliance report per source. Only scrape B2B professional data. Maintain suppression lists. [src1]

Wrong: Ignoring robots.txt and site ToS

Robots.txt is increasingly enforced as a contract under GDPR and DSA. [src2]

Correct: Check robots.txt programmatically before scraping

Log compliance for each URL. Skip disallowed paths.

When This Matters

Use when the agent needs to build a lead list from public web sources without paying for data providers. Best for small batches from niche sources like conference attendees, open-source contributors, or industry directories. Always pair with enrichment pipeline.