Architecture Overview

This document describes the internal architecture of the demo data generator.

Project Structure

People.AI-Data-Generator/
├── src/demo_gen/              # Main source code
│   ├── __init__.py
│   ├── cli.py                 # CLI interface (Click)
│   ├── config.py              # Configuration schema (Pydantic)
│   ├── logger.py              # Structured logging (JSONL + summary)
│   ├── state_store.py         # SQLite idempotency tracker
│   ├── sf_client.py           # Salesforce API wrapper
│   ├── activity_planner.py    # Deterministic activity generation
│   ├── content_gen.py         # LLM content generation
│   ├── scorecard_client.py    # Scorecard creation & population
│   └── runner.py              # Main orchestration logic
├── tests/                     # Unit tests
├── runs/                      # Generated run logs (gitignored)
├── demo.example.yaml          # Example configuration
├── .env.example               # Example environment variables
├── pyproject.toml             # Package metadata
├── setup.py                   # Package setup
├── requirements.txt           # Dependencies
├── Makefile                   # Convenience commands
├── README.md                  # User documentation
├── QUICKSTART.md              # Getting started guide
└── ARCHITECTURE.md            # This file

Module Responsibilities

CLI Layer (cli.py)

Purpose: Command-line interface and user interaction

Commands:

run - Full pipeline execution
dry-run - Preview without changes
status - View run summary
reset - Cleanup run data
smoke - Single-opp validation

Dependencies: runner.py, config.py, logger.py

Configuration (config.py)

Purpose: Load, validate, and resolve configuration

Key Classes:

DemoGenConfig - Main configuration schema (Pydantic)
ResolvedConfig - Runtime-resolved config with run metadata
Various sub-configs for each section

Validation:

HTTPS URLs required
Coverage targets 0-1
Required fields present

Orchestration (runner.py)

Purpose: Main pipeline coordination

Flow:

Query opportunities (via sf_client)
For each opportunity:
- Plan activities (via activity_planner)
- Generate content (via content_gen, optional)
- Create Salesforce records (via sf_client)
- Create scorecards (via scorecard_client)
- Log events (via logger)
- Track state (via state_store, optional)

Methods:

run() - Main pipeline
smoke_test() - Single-opp test
cleanup_run() - Reset helper

Salesforce Client (sf_client.py)

Purpose: Abstract Salesforce API interactions

Operations:

Query opportunities (SOQL)
Create Events (meetings)
Create Tasks (emails)
Query Contacts
Tag records (for cleanup)
Delete by run ID

Authentication: OAuth via simple-salesforce library

Activity Planner (activity_planner.py)

Purpose: Generate deterministic activity plans

Key Features:

Seeded random generation
Configurable min/max counts
Past/future time windows
Realistic subject lines
Participant role assignment

Output: ActivityPlan with lists of PlannedMeeting and PlannedEmail

Content Generator (content_gen.py)

Purpose: Generate realistic text via LLM

Capabilities:

Meeting notes (agenda/summary based on timing)
Email bodies
Scorecard answers

Fallback: Returns None if LLM fails; caller uses heuristics

Scorecard Client (scorecard_client.py)

Purpose: Create and populate scorecards

Templates:

MEDDICC (currently)
Extensible for others

Modes:

heuristic - Rule-based answers
llm - AI-generated answers
hybrid - LLM with heuristic fallback

Scoring: Coverage + confidence weighted average

State Store (state_store.py)

Purpose: Track created records for idempotency

Schema:

opportunities - Selected opps
activities - Created meetings/emails (by signature)
scorecards - Created scorecards
scorecard_answers - Populated answers

Signature: MD5 hash of (type, timestamp, subject)

Logger (logger.py)

Purpose: Structured event and error logging

Outputs:

events.jsonl - One JSON per action
errors.jsonl - One JSON per error
run.json - Run metadata
summary.json - Final statistics

Classes:

DemoGenLogger - Full logging
DryRunLogger - No-op for dry runs

Data Flow

                    ┌─────────────┐
                    │   CLI       │
                    │  (cli.py)   │
                    └──────┬──────┘
                           │
                           ▼
                    ┌─────────────┐
                    │  Config     │
                    │(config.py)  │
                    └──────┬──────┘
                           │
                           ▼
                    ┌─────────────┐
                    │   Runner    │◄───────────┐
                    │ (runner.py) │            │
                    └──────┬──────┘            │
                           │                   │
        ┌──────────────────┼──────────────────┼────────────┐
        │                  │                   │            │
        ▼                  ▼                   ▼            ▼
┌──────────────┐   ┌──────────────┐   ┌──────────┐  ┌──────────┐
│  SF Client   │   │Activity      │   │ Content  │  │Scorecard │
│ (sf_client)  │   │  Planner     │   │   Gen    │  │ Client   │
└──────┬───────┘   └──────┬───────┘   └────┬─────┘  └────┬─────┘
       │                  │                 │             │
       │                  └─────────────────┴─────────────┘
       │                            │
       ▼                            ▼
┌──────────────┐           ┌──────────────┐
│ Salesforce   │           │  OpenAI API  │
│     API      │           │              │
└──────────────┘           └──────────────┘

        Logging (logger.py) ──► events.jsonl, summary.json
        State (state_store.py) ──► state.sqlite

Key Design Decisions

1. Deterministic Generation

Why: Reproducibility and debugging

Same seed → same activity plan
Predictable demo environments
Easy to verify what changed

How: Seeded random number generator keyed by seed:opp_id

2. Idempotency

Why: Safe reruns without duplicates

Options:

External state (SQLite) - No Salesforce changes needed
Tag-based - Requires custom field, enables cleanup

Trade-off: External state more compatible, tag-based more powerful

3. Dry-Run Mode

Why: Preview before execution, validate config

Implementation: dry_run flag propagates through all clients

SF client returns mock data
No writes to Salesforce or state DB
Logger is no-op variant

4. LLM Optional

Why: Cost control, offline operation, reliability

Fallback: Heuristic content if LLM fails or disabled

Still creates activities with subjects
Uses template-based scorecard answers

5. Structured Logging

Why: Debugging, auditing, analytics

Format: JSONL for easy parsing, streaming

One event per line
Greppable, jq-friendly
Summary JSON for quick stats

Extension Points

Adding a New Scorecard Template

Create class inheriting ScorecardTemplate in scorecard_client.py
Define questions with IDs and categories
Add to _load_templates() method
Optionally add heuristic answers

Adding a New Activity Type

Extend ActivityPlan dataclass in activity_planner.py
Add generation logic to create_plan()
Add creation method to sf_client.py
Add processing to _create_activities() in runner.py

Custom Content Generation

Replace or extend ContentGenerator:

Could add Azure OpenAI support
Could integrate with other LLMs
Could use templates with variable substitution

Testing Strategy

Unit Tests

Config validation: Ensure constraints enforced
Deterministic generation: Same seed → same output
Bounds checking: Min/max respected

Integration Tests (Manual)

Smoke test: demo-gen smoke validates end-to-end
Dry run: Preview without side effects
Idempotency: Run twice, verify no duplicates

Production Validation

Run smoke test with single opp
Verify in Salesforce
Wait for People.ai ingestion
Confirm activities appear
Then run full generation

Dependencies

Core

click - CLI framework
pyyaml - Config parsing
pydantic - Schema validation
simple-salesforce - SF API client
openai - LLM integration
python-dotenv - Environment loading
rich - Terminal formatting

Dev

pytest - Testing
black - Code formatting
ruff - Linting

Performance Considerations

Concurrency

Configurable via --concurrency flag
Default: 5 parallel API calls
Implemented with ThreadPoolExecutor at opportunity level
Future: finer-grained concurrency for per-activity writes

Rate Limiting

Salesforce: 15K API calls/day (typical)
OpenAI: Depends on tier
Safety: --max-opps hard cap

Scalability

Current design handles:

100-200 opportunities comfortably
5-20 activities per opp
Total: ~2000 API calls per run

For larger scale:

Batch API calls
Implement true concurrency
Add progress bars
Consider async/await pattern

Security

Credentials

Never commit .env to git
Use environment variables only
Support OAuth (preferred) and JWT

Salesforce Permissions

Minimal required:

Read: Opportunity, Account, Contact
Create: Event, Task
API Enabled

LLM API

API key via environment only
No PII sent to OpenAI
Opportunity names/stages only (demo data)

Future Enhancements

Phase 2

Progress bars (rich.progress)
People.ai API verification
More scorecard templates (BANT, MEDDPICC)
Custom activity subjects via config

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History