MVP Pipeline Build

Type: Execution Recipe Confidence: 0.85 Sources: 5 Verified: 2026-03-29

Purpose

This recipe builds the minimum viable signal intelligence pipeline: cron-scheduled data ingestion, LLM classification using validated taxonomy, company enrichment, PDF dossier generation, and email delivery with tracking. No UI required. Target: functional in 2-4 weeks with >2x conversion improvement over cold outreach. [src1, src4]

Prerequisites

Constraints

Tool Selection Decision

Which stack?
├── Python + Claude API (recommended)
│   └── PATH A: Best classification accuracy
├── Python + GPT-4 API
│   └── PATH B: Good accuracy, wider ecosystem
├── Python + local LLM
│   └── PATH C: No API costs, lower accuracy
└── Node.js + any LLM API
    └── PATH D: Alternative runtime
PathStackMonthly CostClassification QualitySetup Complexity
A: Python + ClaudePython, Claude API, Clearbit$300-$800ExcellentLow
B: Python + GPT-4Python, GPT-4 API, Clearbit$300-$800GoodLow
C: Python + LocalPython, Llama/Mixtral$100-$500AdequateHigh
D: Node.jsNode.js, any LLM$300-$800VariesLow

Execution Flow

Step 1: Data Ingestion Layer (Week 1)

Duration: 3-5 days · Tool: Python + requests/httpx + cron

Build API integrations for 3-5 data sources. Implement rate limiting, retry logic, date-filtered queries. Store raw data as JSON/SQLite. Schedule via cron. Add logging per source. [src2]

Verify: Each source pulls data for 3 consecutive runs. Error handling logs failures without crashing. · If failed: Check auth, rate limits, data format. Test with curl first.

Step 2: LLM Classification Module (Week 1-2)

Duration: 3-5 days · Tool: Python + LLM API + taxonomy JSON

Load taxonomy, construct classification prompts with taxonomy rules as system context and raw records as input. Parse structured JSON output. Apply composite scoring. Filter above threshold. Log all classifications. Include 2-3 few-shot examples. [src2]

Verify: >70% accuracy on 100 test records. Output parsing >95% success. · If failed: Refine prompt structure and few-shot examples.

Step 3: Enrichment Layer (Week 2)

Duration: 2-3 days · Tool: Python + Clearbit/Apollo API

Query enrichment API for classified companies: firmographics, contacts, technology stack. Handle partial data gracefully. Cache results. Track coverage percentage. [src3]

Verify: Enrichment returns data for >60% of companies. Contacts for >50%. · If failed: Add secondary source or manual LinkedIn lookup.

Step 4: Dossier Generation (Week 2-3)

Duration: 2-3 days · Tool: Python + LLM API + PDF generation

Design 1-2 page PDF template. LLM generates narrative summary with outreach angle. Populate with structured data + narrative. Include signal evidence with source links. [src4]

Verify: Sample reviewed by sales team lead. Readable, clear evidence, actionable. · If failed: Iterate template with sales input.

Step 5: Delivery and Tracking (Week 3)

Duration: 2-3 days · Tool: Python + SendGrid/Resend

Configure email templates, batch delivery scheduling, open/click tracking, delivery logging, unsubscribe mechanism. [src4]

Verify: Test emails delivered to 5 recipients. Tracking functional. PDFs render. · If failed: Check SPF/DKIM/DMARC for spam issues.

Step 6: Pipeline Orchestration and Monitoring (Week 3-4)

Duration: 2-3 days · Tool: Cron + logging + alerting

Wire modules together: ingest → classify → enrich → generate → deliver. Add health monitoring, alerting on failures, tracking spreadsheet. [src5]

Verify: Full pipeline runs 3 consecutive days without intervention. Alerts fire on simulated failures. · If failed: Add retry logic, increase timeouts.

Output Schema

{
  "output_type": "mvp_pipeline",
  "format": "deployed code + documentation",
  "sections": [
    {"name": "ingestion_module", "type": "object", "description": "Data source integrations with cron"},
    {"name": "classification_module", "type": "object", "description": "LLM classification with taxonomy"},
    {"name": "enrichment_module", "type": "object", "description": "Company/contact enrichment"},
    {"name": "dossier_generator", "type": "object", "description": "PDF generation with narrative"},
    {"name": "delivery_module", "type": "object", "description": "Email delivery with tracking"},
    {"name": "monitoring", "type": "object", "description": "Health metrics and alerting"}
  ]
}

Quality Benchmarks

Quality MetricMinimum AcceptableGoodExcellent
Pipeline uptime (daily runs)> 90%> 95%> 99%
Classification accuracy> 70%> 80%> 90%
Enrichment coverage> 60%> 75%> 90%
Dossier delivery rate> 95%> 98%> 99%
Conversion vs cold outreach> 1.5x> 2x> 3x

If below minimum: Identify bottleneck module, increase logging detail, fix before scaling.

Error Handling

ErrorLikely CauseRecovery Action
Data source returns 429Rate limit exceededExponential backoff; reduce frequency
LLM returns unparseable outputPrompt format issueOutput validation + retry with stricter format
Enrichment coverage < 50%Small/private companiesAdd secondary source or manual lookup
Emails in spamDomain reputationVerify SPF/DKIM/DMARC; warm up domain
Silent pipeline failureInsufficient error handlingTry/catch per module with alerting

Cost Breakdown

ComponentLean ($5K-$8K)Standard ($8K-$12K)Full ($12K-$18K)
Data ingestion$1.5K-$2.5K$2.5K-$4K$4K-$6K
LLM classification$1K-$1.5K$1.5K-$2.5K$2.5K-$4K
Enrichment$500-$1K$1K-$1.5K$1.5K-$2.5K
Dossier generation$800-$1.2K$1.2K-$2K$2K-$3K
Delivery + tracking$500-$800$800-$1.2K$1.2K-$2K
Monitoring$500-$800$800-$1.2K$1.2K-$1.5K
Total build$5K-$8K$8K-$12K$12K-$18K
Monthly running$300-$500$500-$1K$1K-$2K

Anti-Patterns

Wrong: Building a UI before validating the pipeline

Spending weeks on a dashboard before confirming dossier value. Result: pretty interface, worthless content. [src1]

Correct: Email + PDF for MVP, UI only after validation

Sales teams live in email. Add UI only after confirming >2x conversion improvement.

Wrong: Custom classification rules instead of taxonomy

Writing ad hoc logic instead of using workshop output. Result: developer assumptions, not domain expertise. [src2]

Correct: Implement the validated taxonomy exactly

Load the JSON schema from the workshop. Modifications only after revalidation with domain expert.

Wrong: Overengineering data storage for MVP

PostgreSQL, Redis, message queues for 50 records/day. Result: 3x build time, delayed validation. [src3]

Correct: JSON files + SQLite for MVP

Flat files or SQLite. Migrate to proper database only after volume justifies complexity.

When This Matters

Use when building the technical pipeline that converts identified signals into delivered sales dossiers. This is Phase 3 of the Signal Stack engagement — it implements the validated taxonomy into a functional automated system.

Related Units