This recipe produces a raw lead list of 200-2,000 contacts scraped from public web sources — business directories, conference attendee pages, G2 review profiles, and GitHub contributor lists — with full legal compliance documentation. Output requires enrichment pipeline processing before outreach.
| Path | Tools | Cost | Speed | Output Quality |
|---|---|---|---|---|
| A: No-Code | Browse AI / Octoparse | $0-$39/mo | 1 hr/100 leads | Basic |
| B: Python Simple | BeautifulSoup + requests | $0 | 1 hr/200 leads | Good |
| C: Scrapy + API | Scrapy + GitHub API | $0 | 30 min/500 leads | High |
| D: Apify | Apify marketplace | $49/mo | 15 min/1000 leads | Highest |
Duration: 15 min | Tool: Browser + Python
Check robots.txt for each target source. Document legitimate interest basis. Record which paths are disallowed. [src2]
Duration: 15-30 min | Tool: GitHub REST API
Extract contributor profiles from target repositories. Get name, company, email (if public), bio, location, GitHub URL. Rate limit: 5,000 req/hr. [src5]
Duration: 30-60 min | Tool: Python + BeautifulSoup
Parse business directory pages with custom CSS selectors. Extract company names, locations, and profile URLs. Handle anti-scraping protections with delays and User-Agent rotation.
Duration: 10-15 min | Tool: Python (pandas)
Combine all sources, deduplicate by name+company, remove entries without names, tag each record with data source and consent basis.
Duration: 5 min
Create JSON compliance report documenting robots.txt checks, legal basis, suppression list status, and source details.
CSV with columns: name, company, title, email, location, source_url, data_source, consent_basis. Expected 200-2,000 rows, deduplicated on name+company.
| Metric | Minimum | Good | Excellent |
|---|---|---|---|
| Name completeness | > 80% | > 90% | > 98% |
| Company available | > 50% | > 70% | > 85% |
| Email (pre-enrichment) | > 10% | > 25% | > 40% |
| Compliance docs | Complete | Complete | Complete + legal |
| Duplicate rate | < 15% | < 8% | < 3% |
| Error | Cause | Recovery |
|---|---|---|
| 403 Forbidden | Anti-scraping protection | Add delays, rotate User-Agent, use Apify |
| 429 Rate Limited | Too many requests | Wait for retry-after, reduce frequency |
| CAPTCHA | Bot detection | Switch to manual or API alternative |
| Empty results | CSS selectors outdated | Inspect page, update selectors |
| Legal cease-and-desist | ToS/GDPR violation | Stop immediately, delete data, consult legal |
| Component | Free Tier | Paid Tier | At Scale |
|---|---|---|---|
| Python + BS4 | $0 | $0 | $0 |
| GitHub API | $0 | $0 | $0 |
| Apify (optional) | $5/mo | $49/mo | $499/mo |
| Proxy (optional) | $0 | $30/mo | $100+/mo |
| Total/500 leads | $0 | $49/mo | $599/mo |
CNIL fined KASPR EUR 240,000 for assuming public data is free to collect. [src6]
Create compliance report per source. Only scrape B2B professional data. Maintain suppression lists. [src1]
Robots.txt is increasingly enforced as a contract under GDPR and DSA. [src2]
Log compliance for each URL. Skip disallowed paths.
Use when the agent needs to build a lead list from public web sources without paying for data providers. Best for small batches from niche sources like conference attendees, open-source contributors, or industry directories. Always pair with enrichment pipeline.