Customer Unification Agent

AI-powered agent for cross-platform customer record reconciliation and unification.

Project Overview

This agent automatically detects and unifies duplicate customer records across multiple e-commerce platforms (Shopify, Stripe, etc.) using probabilistic record linkage (Splink). It produces confidence-scored match predictions, routes high-confidence pairs to auto-merge and borderline pairs to a human review queue, and surfaces business ROI metrics via a Streamlit dashboard.

Live Page

A static project page lives in docs/index.html and is published via GitHub Pages. It mirrors the dashboard aesthetic — KPI cards, Plotly charts, Mermaid pipeline diagram, and match statistics — using the real numbers from the latest run.

Setup

1. Create Virtual Environment

python -m venv .venv

2. Activate Virtual Environment

On macOS/Linux:

source .venv/bin/activate

On Windows:

.venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

Usage

Generate Synthetic Data

python generate_synthetic_data.py

Creates:

shopify_customers.csv — 425 Shopify customer records
stripe_customers.csv — 425 Stripe customer records
ground_truth.csv — True customer identities for validation

Add Hard Cases (optional)

python add_hard_cases.py

Appends 20 intentionally tricky duplicates (name variations, phone formatting differences, address abbreviations) to create *_with_hard_cases.csv variants.

Run the Matching Engine

python matching_engine.py

Runs the full Splink probabilistic record linkage pipeline and writes:

auto_merge_matches.csv — Pairs with ≥95% confidence (safe to merge automatically)
review_queue_matches.csv — Pairs with 75–95% confidence (flagged for human review)
validation_metrics.json — Precision/recall against ground truth

Test Hard Cases

python test_hard_cases.py

Runs the matching pipeline on the hard-cases dataset and reports how many tricky duplicates were caught vs. missed.

Preview Dashboard Metrics (CLI)

python analyze_metrics.py

Launch the ROI Dashboard

streamlit run dashboard.py

Opens a live Streamlit dashboard showing:

Summary KPIs — total records, duplicates found, unique customers, match precision; with vs-last-run delta badges
Hidden Value — combined customer lifetime value, avg per unified customer, Shopify/Stripe revenue split
Top Customers — searchable top-10 table with show-all expander and CSV download; numeric currency columns
Value Distribution — bar chart with per-bin drill-down (click a range to see the customer list + download)
Actionable Insights — VIP retargeting and cross-sell opportunity cards
Match Quality — confidence score histogram with dynamic threshold annotation line

Sidebar controls:

Confidence Threshold slider — filters all sections live (min = lowest match score in data)
Customer Type filter — All / Shopify Primary / Stripe Primary

Project Structure

customer-unification-agent/
├── config.py                           # All thresholds, file paths, and tuning parameters
├── data.py                             # Schema-validated data loading (single source of truth for I/O)
├── matching_engine.py                  # Splink probabilistic matching pipeline
├── metrics.py                          # Shared business KPI calculations
├── dashboard.py                        # Streamlit ROI dashboard
├── generate_synthetic_data.py          # Synthetic data generator (500 customers, 70% cross-platform)
├── add_hard_cases.py                   # Appends hard-to-match duplicate scenarios
├── test_hard_cases.py                  # Evaluates matching performance on hard cases
├── analyze_metrics.py                  # CLI preview of dashboard metrics
├── quick_check.py                      # Quick sanity-check script
├── requirements.txt                    # Python dependencies
│
├── shopify_customers.csv               # Generated Shopify records
├── stripe_customers.csv                # Generated Stripe records
├── ground_truth.csv                    # True customer identities
├── shopify_customers_with_hard_cases.csv
├── stripe_customers_with_hard_cases.csv
├── ground_truth_with_hard_cases.csv
│
├── auto_merge_matches.csv              # Output: high-confidence matches (383 pairs)
├── review_queue_matches.csv            # Output: borderline matches (9 pairs)
├── validation_metrics.json             # Output: precision/recall metrics
└── dashboard_state.json                # Output: persisted KPI snapshot for delta badges

How It Works

flowchart TD
    A[Shopify Customers] --> C[Blocking]
    B[Stripe Customers] --> C
    C -->|"Fast rules: email, phone, zipcode"| D[Candidate Pairs]
    D --> E[Comparison]
    E -->|"Email · Name · Phone · Address · Zipcode"| F[Splink Training]
    F -->|"EM algorithm → m/u probabilities"| G[Prediction]
    G --> H{Match Probability}
    H -->|"≥ 0.95"| I[Auto-Merge]
    H -->|"0.75 – 0.95"| J[Human Review Queue]
    H -->|"< 0.75"| K[Discarded]
    I --> L[Unified Customer Record]
    J --> M[Analyst Decision] --> L
    L --> N[ROI Dashboard]

Blocking — Candidate pairs are generated using fast blocking rules (exact email, email prefix, phone, zipcode) to avoid O(n²) comparisons.
Comparison — Each candidate pair is scored across five signals:
- Email (Levenshtein distance ≤1, ≤2)
- Name (Jaro-Winkler similarity ≥0.9, ≥0.8)
- Phone (Levenshtein distance ≤2, ≤5)
- Address (Levenshtein distance ≤5, ≤10)
- Zipcode (exact match)
Training — Splink estimates match/non-match probabilities via random sampling (u-probabilities) and EM algorithm (m-probabilities).
Prediction — Each pair receives a match_probability score between 0 and 1.
Categorization — Pairs above 0.95 are auto-merged; pairs between 0.75–0.95 go to the review queue.

Key Configuration (`config.py`)

Parameter	Default	Description
`AUTO_MERGE_THRESHOLD`	0.95	Confidence above which records auto-merge
`REVIEW_THRESHOLD`	0.75	Confidence above which records go to review queue
`DEFAULT_N_CUSTOMERS`	500	Number of customers in synthetic data
`DEFAULT_CROSS_PLATFORM_RATE`	0.70	Fraction of customers present on both platforms
`VIP_SPEND_THRESHOLD`	$2,000	Minimum combined spend to qualify as a VIP

Status

Milestone	Status
Generate synthetic data	✅ Done
Implement Splink matching engine	✅ Done
Build merge logic with confidence thresholds	✅ Done
Create Streamlit review dashboard	✅ Done
Build ROI metrics layer	✅ Done
Hard case stress testing	✅ Done
Real Shopify/Stripe API integration	⏳ Not started
Database persistence (PostgreSQL)	⏳ Not started

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Unification Agent

Project Overview

Live Page

Setup

1. Create Virtual Environment

2. Activate Virtual Environment

3. Install Dependencies

Usage

Generate Synthetic Data

Add Hard Cases (optional)

Run the Matching Engine

Test Hard Cases

Preview Dashboard Metrics (CLI)

Launch the ROI Dashboard

Project Structure

How It Works

Key Configuration (`config.py`)

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md
add_hard_cases.py		add_hard_cases.py
analyze_metrics.py		analyze_metrics.py
config.py		config.py
dashboard.py		dashboard.py
data.py		data.py
generate_synthetic_data.py		generate_synthetic_data.py
matching_engine.py		matching_engine.py
metrics.py		metrics.py
quick_check.py		quick_check.py
requirements.txt		requirements.txt
test_hard_cases.py		test_hard_cases.py
validation_metrics.json		validation_metrics.json

Folders and files

Latest commit

History

Repository files navigation

Customer Unification Agent

Project Overview

Live Page

Setup

1. Create Virtual Environment

2. Activate Virtual Environment

3. Install Dependencies

Usage

Generate Synthetic Data

Add Hard Cases (optional)

Run the Matching Engine

Test Hard Cases

Preview Dashboard Metrics (CLI)

Launch the ROI Dashboard

Project Structure

How It Works

Key Configuration (config.py)

Status

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Key Configuration (`config.py`)

Packages