MVP Pipeline Build

How do you build a Signal Stack MVP pipeline in 4 weeks with LLM classification?

Purpose

This recipe builds the minimum viable signal intelligence pipeline: cron-scheduled data ingestion, LLM classification using validated taxonomy, company enrichment, PDF dossier generation, and email delivery with tracking. No UI required. Target: functional in 2-4 weeks with >2x conversion improvement over cold outreach. [src1, src4]

Prerequisites

Validated signal taxonomy as JSON schema — classification rules, scoring weights, threshold
API credentials for top 3-5 data sources
Enrichment API credentials with sufficient monthly quota
Email delivery service configured with domain verification
Technical resource available for 2-4 weeks (Python proficiency)
Cron hosting environment ready

Constraints

No UI for MVP — email + PDF only. UI is scope creep. [src1]
Functional in weeks 1-4. If build exceeds 6 weeks, scope is too large.
Human-in-the-loop for first 100 dossiers — no automated delivery before calibration.
Must use validated taxonomy JSON — do not improvise classification rules. [src2]
Track conversion rate vs cold outreach baseline from day 1. [src4]

Tool Selection Decision

Which stack?
├── Python + Claude API (recommended)
│   └── PATH A: Best classification accuracy
├── Python + GPT-4 API
│   └── PATH B: Good accuracy, wider ecosystem
├── Python + local LLM
│   └── PATH C: No API costs, lower accuracy
└── Node.js + any LLM API
    └── PATH D: Alternative runtime

Path	Stack	Monthly Cost	Classification Quality	Setup Complexity
A: Python + Claude	Python, Claude API, Clearbit	$300-$800	Excellent	Low
B: Python + GPT-4	Python, GPT-4 API, Clearbit	$300-$800	Good	Low
C: Python + Local	Python, Llama/Mixtral	$100-$500	Adequate	High
D: Node.js	Node.js, any LLM	$300-$800	Varies	Low

Execution Flow

Step 1: Data Ingestion Layer (Week 1)

Duration: 3-5 days · Tool: Python + requests/httpx + cron

Build API integrations for 3-5 data sources. Implement rate limiting, retry logic, date-filtered queries. Store raw data as JSON/SQLite. Schedule via cron. Add logging per source. [src2]

Verify: Each source pulls data for 3 consecutive runs. Error handling logs failures without crashing. · If failed: Check auth, rate limits, data format. Test with curl first.

Step 2: LLM Classification Module (Week 1-2)

Duration: 3-5 days · Tool: Python + LLM API + taxonomy JSON

Load taxonomy, construct classification prompts with taxonomy rules as system context and raw records as input. Parse structured JSON output. Apply composite scoring. Filter above threshold. Log all classifications. Include 2-3 few-shot examples. [src2]

Verify: >70% accuracy on 100 test records. Output parsing >95% success. · If failed: Refine prompt structure and few-shot examples.

Step 3: Enrichment Layer (Week 2)

Duration: 2-3 days · Tool: Python + Clearbit/Apollo API

Query enrichment API for classified companies: firmographics, contacts, technology stack. Handle partial data gracefully. Cache results. Track coverage percentage. [src3]

Verify: Enrichment returns data for >60% of companies. Contacts for >50%. · If failed: Add secondary source or manual LinkedIn lookup.

Step 4: Dossier Generation (Week 2-3)

Duration: 2-3 days · Tool: Python + LLM API + PDF generation

Design 1-2 page PDF template. LLM generates narrative summary with outreach angle. Populate with structured data + narrative. Include signal evidence with source links. [src4]

Verify: Sample reviewed by sales team lead. Readable, clear evidence, actionable. · If failed: Iterate template with sales input.

Step 5: Delivery and Tracking (Week 3)

Duration: 2-3 days · Tool: Python + SendGrid/Resend

Configure email templates, batch delivery scheduling, open/click tracking, delivery logging, unsubscribe mechanism. [src4]

Verify: Test emails delivered to 5 recipients. Tracking functional. PDFs render. · If failed: Check SPF/DKIM/DMARC for spam issues.

Step 6: Pipeline Orchestration and Monitoring (Week 3-4)

Duration: 2-3 days · Tool: Cron + logging + alerting

Wire modules together: ingest → classify → enrich → generate → deliver. Add health monitoring, alerting on failures, tracking spreadsheet. [src5]

Verify: Full pipeline runs 3 consecutive days without intervention. Alerts fire on simulated failures. · If failed: Add retry logic, increase timeouts.

Output Schema

{
  "output_type": "mvp_pipeline",
  "format": "deployed code + documentation",
  "sections": [
    {"name": "ingestion_module", "type": "object", "description": "Data source integrations with cron"},
    {"name": "classification_module", "type": "object", "description": "LLM classification with taxonomy"},
    {"name": "enrichment_module", "type": "object", "description": "Company/contact enrichment"},
    {"name": "dossier_generator", "type": "object", "description": "PDF generation with narrative"},
    {"name": "delivery_module", "type": "object", "description": "Email delivery with tracking"},
    {"name": "monitoring", "type": "object", "description": "Health metrics and alerting"}
  ]
}

Quality Benchmarks

Quality Metric	Minimum Acceptable	Good	Excellent
Pipeline uptime (daily runs)	> 90%	> 95%	> 99%
Classification accuracy	> 70%	> 80%	> 90%
Enrichment coverage	> 60%	> 75%	> 90%
Dossier delivery rate	> 95%	> 98%	> 99%
Conversion vs cold outreach	> 1.5x	> 2x	> 3x

If below minimum: Identify bottleneck module, increase logging detail, fix before scaling.

Error Handling

Error	Likely Cause	Recovery Action
Data source returns 429	Rate limit exceeded	Exponential backoff; reduce frequency
LLM returns unparseable output	Prompt format issue	Output validation + retry with stricter format
Enrichment coverage < 50%	Small/private companies	Add secondary source or manual lookup
Emails in spam	Domain reputation	Verify SPF/DKIM/DMARC; warm up domain
Silent pipeline failure	Insufficient error handling	Try/catch per module with alerting

Cost Breakdown

Component	Lean ($5K-$8K)	Standard ($8K-$12K)	Full ($12K-$18K)
Data ingestion	$1.5K-$2.5K	$2.5K-$4K	$4K-$6K
LLM classification	$1K-$1.5K	$1.5K-$2.5K	$2.5K-$4K
Enrichment	$500-$1K	$1K-$1.5K	$1.5K-$2.5K
Dossier generation	$800-$1.2K	$1.2K-$2K	$2K-$3K
Delivery + tracking	$500-$800	$800-$1.2K	$1.2K-$2K
Monitoring	$500-$800	$800-$1.2K	$1.2K-$1.5K
Total build	$5K-$8K	$8K-$12K	$12K-$18K
Monthly running	$300-$500	$500-$1K	$1K-$2K

Anti-Patterns

Wrong: Building a UI before validating the pipeline

Spending weeks on a dashboard before confirming dossier value. Result: pretty interface, worthless content. [src1]

Correct: Email + PDF for MVP, UI only after validation

Sales teams live in email. Add UI only after confirming >2x conversion improvement.

Wrong: Custom classification rules instead of taxonomy

Writing ad hoc logic instead of using workshop output. Result: developer assumptions, not domain expertise. [src2]

Correct: Implement the validated taxonomy exactly

Load the JSON schema from the workshop. Modifications only after revalidation with domain expert.

Wrong: Overengineering data storage for MVP

PostgreSQL, Redis, message queues for 50 records/day. Result: 3x build time, delayed validation. [src3]

Correct: JSON files + SQLite for MVP

Flat files or SQLite. Migrate to proper database only after volume justifies complexity.

When This Matters

Use when building the technical pipeline that converts identified signals into delivered sales dossiers. This is Phase 3 of the Signal Stack engagement — it implements the validated taxonomy into a functional automated system.