Web Scraping for Lead Lists

Type: Execution Recipe Confidence: 0.86 Sources: 6 Verified: 2026-03-11

Purpose

This recipe produces a raw lead list of 200-2,000 contacts scraped from public web sources — business directories, conference attendee pages, G2 review profiles, and GitHub contributor lists — with full legal compliance documentation. Output requires enrichment pipeline processing before outreach.

Prerequisites

Constraints

Tool Selection Decision

PathToolsCostSpeedOutput Quality
A: No-CodeBrowse AI / Octoparse$0-$39/mo1 hr/100 leadsBasic
B: Python SimpleBeautifulSoup + requests$01 hr/200 leadsGood
C: Scrapy + APIScrapy + GitHub API$030 min/500 leadsHigh
D: ApifyApify marketplace$49/mo15 min/1000 leadsHighest

Execution Flow

Step 1: Legal Compliance Check

Duration: 15 min | Tool: Browser + Python

Check robots.txt for each target source. Document legitimate interest basis. Record which paths are disallowed. [src2]

Step 2: Scrape GitHub Contributors

Duration: 15-30 min | Tool: GitHub REST API

Extract contributor profiles from target repositories. Get name, company, email (if public), bio, location, GitHub URL. Rate limit: 5,000 req/hr. [src5]

Step 3: Scrape Directory Listings

Duration: 30-60 min | Tool: Python + BeautifulSoup

Parse business directory pages with custom CSS selectors. Extract company names, locations, and profile URLs. Handle anti-scraping protections with delays and User-Agent rotation.

Step 4: Merge, Deduplicate, and Tag Sources

Duration: 10-15 min | Tool: Python (pandas)

Combine all sources, deduplicate by name+company, remove entries without names, tag each record with data source and consent basis.

Step 5: Generate Compliance Report

Duration: 5 min

Create JSON compliance report documenting robots.txt checks, legal basis, suppression list status, and source details.

Output Schema

CSV with columns: name, company, title, email, location, source_url, data_source, consent_basis. Expected 200-2,000 rows, deduplicated on name+company.

Quality Benchmarks

MetricMinimumGoodExcellent
Name completeness> 80%> 90%> 98%
Company available> 50%> 70%> 85%
Email (pre-enrichment)> 10%> 25%> 40%
Compliance docsCompleteCompleteComplete + legal
Duplicate rate< 15%< 8%< 3%

Error Handling

ErrorCauseRecovery
403 ForbiddenAnti-scraping protectionAdd delays, rotate User-Agent, use Apify
429 Rate LimitedToo many requestsWait for retry-after, reduce frequency
CAPTCHABot detectionSwitch to manual or API alternative
Empty resultsCSS selectors outdatedInspect page, update selectors
Legal cease-and-desistToS/GDPR violationStop immediately, delete data, consult legal

Cost Breakdown

ComponentFree TierPaid TierAt Scale
Python + BS4$0$0$0
GitHub API$0$0$0
Apify (optional)$5/mo$49/mo$499/mo
Proxy (optional)$0$30/mo$100+/mo
Total/500 leads$0$49/mo$599/mo

Anti-Patterns

Wrong: Scraping without legal basis documentation

CNIL fined KASPR EUR 240,000 for assuming public data is free to collect. [src6]

Correct: Document legitimate interest before scraping

Create compliance report per source. Only scrape B2B professional data. Maintain suppression lists. [src1]

Wrong: Ignoring robots.txt and site ToS

Robots.txt is increasingly enforced as a contract under GDPR and DSA. [src2]

Correct: Check robots.txt programmatically before scraping

Log compliance for each URL. Skip disallowed paths.

When This Matters

Use when the agent needs to build a lead list from public web sources without paying for data providers. Best for small batches from niche sources like conference attendees, open-source contributors, or industry directories. Always pair with enrichment pipeline.

Related Units