May 2026·12 min read

Building an Automated Experiment Analysis Agent with Claude Code

What used to take a data scientist 1–2 days now runs autonomously in under 15 minutes.

Claude CodePythonSQLSnowflakeA/B TestingStatisticsPlotlySlack

The Problem

Every A/B experiment at a product company ends the same way: a data scientist spends 1–2 days doing repetitive, largely mechanical work.

Writing SQL to pull cohort and outcome data from a data warehouse
Running statistical tests (t-tests, z-tests, Bonferroni correction) in a notebook
Building results tables and charts by hand
Writing a stakeholder narrative
Setting up monitoring to watch metrics daily

At companies running 15–20 experiments per quarter, this adds up to weeks of DS time per year spent on work that follows the same pattern every time. It's a solved problem we keep re-solving.

The Solution: A Claude Code Agent

Instead of building a traditional application, I built a Claude Code workflow agent — a system where Claude Code acts as the orchestrating brain, calling specialist skills and Python tools to produce the full results package.

The key insight: an experiment design document (XPD) already contains everything needed to automate the analysis. You just need an agent smart enough to read it and drive the pipeline.

What the agent produces from a single prompt:

An interactive HTML dashboard with Plotly charts and a timeline slider
A DS working document (.docx) with 9 sections including embedded charts
Daily Slack monitoring with intelligent summaries
All intermediate artifacts (results.json, snapshots.json, data_quality.json)

Architecture

The most important design decision: three distinct layers with clear separation of concerns.

Layer 1 — Methodology Skill: A markdown-based skill loaded before any SQL is written. It enforces non-negotiable standards: the correct randomization unit, required data quality filters, and the right statistical tests.
Layer 2 — SQL Writer Skill: A specialized skill that knows the data warehouse schema deeply. Given the methodology spec, it finds the right tables, writes production-quality SQL, and has it reviewed against a requirements checklist.
Layer 3 — Python Tools: Pure computation — no LLM calls. Each script has a single responsibility and can be called directly or imported as a module.

The Two Clarifying Questions

The agent asks exactly two questions before running:

1. Date confirmation: “I found these dates in the XPD: Start: [date] · End: [date]. Are these correct? If the end date isn't set yet, provide your best estimate — the daily monitoring cron stops automatically 3 days after this date.”

2. Variant cutoff detection: The agent scans the doc for language like “variant removed”, “stopped”, “final round”. If detected: “I see [variant] was removed on [date]. Should I analyze both phases with separate tabs in the timeline slider?”

Everything else is automatic. No manual intervention.

Statistical Methodology

The agent implements a rigorous methodology that's easy to get wrong manually.

Randomization unit: The agent loads a bucketing guide before writing any SQL — visitor-level vs. user-level experiments use different join keys. A common source of silent bugs in manual analysis.
Multiple comparison correction: Testing 3 variants × 4 metrics = 12 comparisons. Without correction, the probability of at least one false positive is ~46%. The agent applies Bonferroni correction automatically.
Statistical tests by metric type: Two-proportion z-test for rate metrics; Welch's t-test for mean metrics. Winsorization applied to revenue metrics to reduce outlier influence.
Data quality checks: SRM detection via chi-square test; contamination check for users appearing in multiple variants.

The Timeline Slider

The most technically interesting output is the timeline slider — an interactive feature that lets stakeholders drag through the experiment's history to see when each metric became statistically significant.

The technical challenge: computing daily cumulative p-values requires running Welch's t-test at each intermediate sample size. For a 21-day experiment with 3 variants, that's 63 test computations.

A daily SQL query (lightweight — just exposure counts) runs once per day and appends to a CSV
Python accumulates cumulative sums and computes running variance using the online formula: variance = (sum_sq - sum²/n) / (n-1)
The timeline is embedded as JSON in the HTML — no server needed, fully self-contained

Early p-values are unreliable due to noisy variance estimates during a partial traffic ramp. The slider makes this visible — stakeholders can see the p-value oscillate before stabilizing, which builds intuition about why “peeking” at experiments early is dangerous.

Handling Failure Gracefully

Large data warehouse queries on big experiments can hit timeout limits. Instead of failing silently, the agent follows a try-first / prompt-DS fallback pattern:

1. Claude Code attempts to run SQL via data warehouse MCP tool
2. ✅ Success → continues automatically
3. ❌ Timeout → prompts DS with exact instructions:
   "Please run sql/q1_all_metrics.sql in your Snowflake worksheet
    and save the results as output/stats_spec_raw.csv. Then say 'continue'."
4. DS provides the CSV → agent resumes from step 6

This pattern — try autonomously, fall back to human-in-the-loop with exact instructions — is a broadly useful pattern for any agent that interacts with external systems.

Daily Monitoring

A local cron job runs every morning and:

Queries the data warehouse for the latest daily snapshot
Re-renders the HTML report (timeline slider gets a new data point)
Regenerates the DS working document
Posts an intelligent Slack summary to a monitoring channel
Removes itself from crontab 3 days after the experiment's end date

The Slack summary generates a 2–3 sentence narrative rather than raw numbers: “Large variant crossed significance today on primary metric — worth a close look before making a ship decision.”

A scheduled Claude Code remote agent also polls the Slack channel hourly for show me [experiment-name] requests and replies in thread with current results from Google Drive — no server needed.

Key Technical Decisions

No Anthropic API key needed. Claude Code itself is the AI layer — the Python scripts are pure computation tools with no LLM calls. Zero per-run cost beyond the Claude subscription.
Self-contained HTML. The report is a single .html file with all data embedded as JSON. No server, no database, no dependencies. Plotly loaded from CDN. The timeline slider is pure client-side JavaScript.
Direct function calls, not subprocess. The daily cron imports Python modules directly rather than spawning subprocesses. Faster, better error handling, no shell injection surface.

Results on a Real Experiment

I validated the agent on a live multi-variant experiment — three treatment variants against a control, run across millions of users. Details are confidential, but the structural findings illustrate the agent's value:

Primary metric: no variant reached significance in the 4-way phase — all INVESTIGATE
Variant A: metric A significant in the expected direction · metric B significant in the opposite direction
Variant B: inverse pattern — clear trade-off between the two secondary metrics
A focused 2-way follow-on test isolated the best-performing variant and reached primary metric significance

The agent also caught a methodological bug before the analysis ran: the daily snapshots query was using the wrong deduplication key, producing a subtly different cohort. The SQL requirements checklist surfaced it automatically.

Task	Manual	Agent
SQL writing	2–3 hours	~5 minutes
Stats + notebook	2–3 hours	~2 minutes
Report + narrative	2–3 hours	~3 minutes
Total	1–2 days	~15 minutes

What I Learned

Separation between methodology and implementation is critical. The biggest bug risk isn't the statistics — it's the SQL. Using a methodology skill as the spec before invoking a SQL-writing skill caught multiple issues that would have been silent bugs.
Human-in-the-loop fallbacks are more important than perfect automation. An agent that fails silently is worse than no agent at all.
Claude Code workflows are real agents. The system has tools, makes decisions, composes specialist skills, and produces autonomous output from a single input.
The timeline slider exposed something I hadn't anticipated. Early p-values are genuinely volatile during partial traffic ramps — viscerally clear when you can drag a slider and watch it happen.

Stack

Orchestration: Claude Code (Claude Sonnet 4.6)
Data warehouse: Snowflake (via MCP tool)
Statistics: Python — scipy, statsmodels (Bonferroni via multipletests)
Visualizations: Plotly (embedded in self-contained HTML)
Document generation: python-docx
Slack: Incoming Webhooks + Claude Code scheduled remote agent
Storage: Google Drive (for Slack bot data)