Running MCP Servers (Stdio, Streamable HTTP)
What is MCP
MCP (Model Context Protocol) is an open standard from Anthropic for connecting LLMs to tools, data, and prompts. Define an integration once; any MCP-aware host – Claude Code, Claude Desktop, claude.ai, Cursor, custom agents – can use it. You skip the per-host function-calling glue. The model picks tools based on their descriptions and JSON schemas; your server runs them.
A tool is just a name, a description, an input schema, and a handler. The host calls tools/list on connect, the model emits tools/call, your handler returns structured data. Treat the description as prompt engineering – bad description, wrong tool picked.
Two MCP servers at Atlassian
ops-sherpa – stdio transport, runs locally inside Claude Code. An on-call engineer types "Debug soft signups around 10:30am PT today" and gets back a ranked list of likely culprits. Tools wrap SignalFx (metrics), Splunk (logs and backend errors), Sentry (frontend errors), Statsig (feature gates and experiments), and Spinnaker (deploys). The orchestrator tool, correlate_incident, parses the time window, fans out in parallel, and ranks: a deploy or gate change inside the window beats a generic error spike outside it.
growth-mcp – Streamable HTTP transport, internal-only. Self-serve analytics for PMs, analysts, designers, and engineers. Prompts like "week-over-week traffic on the Bitbucket pricing page" or "how long does this experiment need to run for significance?" The brain is Databricks Genie, which converts natural language to SQL against curated tables.
A deployment note worth flagging: growth-mcp is reachable from both Claude Code and claude.ai because Streamable HTTP is what claude.ai's Custom Connectors speak. ops-sherpa is Claude Code only – stdio requires launching a local subprocess, which claude.ai cannot do.
Hands-on: defining the servers
Claude Code reads ~/.claude.json (user scope) or a checked-in .mcp.json at the repo root for team-shared setups:
{
"mcpServers": {
"ops-sherpa": {
"type": "stdio",
"command": "npx",
"args": [
"--registry=https://artifactory.internal.atlassian.com/api/npm/npm-internal/",
"-y",
"@atlassian-internal/ops-sherpa-mcp"
],
"env": {
"SIGNALFX_TOKEN": "${SIGNALFX_TOKEN}",
"SPLUNK_HOST": "https://splunk.internal.atlassian.com",
"SPLUNK_TOKEN": "${SPLUNK_TOKEN}",
"SENTRY_AUTH_TOKEN": "${SENTRY_TOKEN}",
"STATSIG_CONSOLE_KEY": "${STATSIG_KEY}",
"SPINNAKER_API_TOKEN": "${SPINNAKER_TOKEN}",
"OPS_SHERPA_DEFAULT_TZ": "America/Los_Angeles"
}
},
"growth-mcp": {
"type": "http",
"url": "https://growth-mcp.internal.atlassian.com/mcp",
"headers": {
"Authorization": "Bearer ${GROWTH_MCP_TOKEN}"
}
}
}
}
A few practical notes:
- ops-sherpa is published to internal Artifactory and pulled with
npx -yfrom our private npm registry. No global install, always the pinned version, rotation is a re-publish. - All
${VAR}values are read from environment variables. We store them as plaintext in shell profile – not ideal, on the roadmap. Authorization: Bearer <token>is the standard HTTP header for "here's my access token."Beareris just the scheme name (RFC 6750) saying it's an OAuth-style token.- To get one, engineers visit our internal SSO portal, log in, click Generate token for growth-mcp, copy the 1-year token, and paste it into the env var.
- claude.ai uses Settings → Connectors → Add custom connector, with the same OAuth flow on first use.
Code of the agent
A subagent in Claude Code is a markdown file. Mine for incident triage – ~/.claude/agents/debug-incident.md:
---
name: debug-incident
description: Triage a production incident by correlating metrics, logs, errors, recent deploys, and flag changes. Use whenever an on-call mentions a time window and a symptom.
tools:
- mcp__ops-sherpa__correlate_incident
- mcp__ops-sherpa__query_signalfx_metric
- mcp__ops-sherpa__splunk_search
- mcp__ops-sherpa__sentry_issues
- mcp__ops-sherpa__statsig_recent_changes
- mcp__ops-sherpa__spinnaker_recent_deploys
---
You are an incident triage specialist. Given a symptom and a time window:
1. Normalize the time window. Default TZ is PT. Reject windows wider than 6 hours.
2. Call `correlate_incident` first.
3. If signal is weak, fan out to the individual tools in parallel.
4. Return ONLY: top 3 ranked causes, each with probability, evidence, and suggested action (revert / disable gate / no-op).
Be terse. The on-call wants the answer, not the journey.
Invoke as @debug-incident soft signups, around 10:30am PT today.
Calling ops-sherpa
Direct from chat works for quick questions:
"What deployed in
signup-servicebetween 10:00 and 11:00 PT today?"
Claude Code calls spinnaker_recent_deploys, returns a clean list. Done.
For real triage, use the agent: @debug-incident soft signups, around 10:30am PT today. The agent normalizes the window, calls correlate_incident, and returns something like:
- Spinnaker deploy
signup-servicev412 at 10:27 PT (3 min before symptom). Probability 0.62. Action: roll back.- Statsig gate
new_signup_flowflipped to 50% at 10:24 PT. Probability 0.28. Action: disable.- Sentry: 14× spike in
TypeErroron/signup. Probability 0.10, likely downstream symptom.
Triage time went from ~10 minutes of tab-switching to seconds.
Calling growth-mcp
From claude.ai a PM types "week-over-week traffic on the Bitbucket pricing page, by country". growth-mcp forwards to a Genie Space, gets back SQL + result rows + a summary, returns the structured data to the model, which presents it in the user's voice.
Making Genie smarter over time
The MCP layer is thin – a few hundred lines of TypeScript. The Genie Space is the brain, and Genie is only as smart as the metadata you give it. The work is curation.
Setup. Create a Genie Space, attach a small set of tables (we use 12 for growth-traffic: page views, signups, attribution, experiment exposures, product dimensions, a few aggregates). Genie reads three inputs to decide what SQL to write: Unity Catalog metadata on those tables, example SQL queries registered with the Space, and a short text instructions block.
Metadata. Every table and column needs a description that tells Genie when to use it and how it relates to others:
COMMENT ON TABLE growth.analytics.landing_page_daily IS
'Daily aggregated traffic per landing page. One row per (page_id, date).
Use this for week-over-week and trend analysis. For event-level queries
use growth.events.page_views instead.';
The first time we wrote descriptions like that, our wrong-table rate dropped by roughly half.
Example SQL queries. (question, SQL) pairs registered with the Space. The strongest knob – when Genie sees a similar question, it adapts the example instead of inventing SQL. We seed 8–12 per Space:
- "week-over-week traffic for landing page X" → CTE (Common Table Expression) with the right date arithmetic.
- "signups by product last 7 days" → canonical join between
signupsandproduct_dim. - "exposure count for experiment X" → canonical filter on
experiment_exposures.
Joins are declared explicitly via join_specs so Genie does not guess. Text instructions stay short and only cover what SQL cannot express: "Interpret 'last week' as the prior ISO week, Monday–Sunday. Round percentages to two decimals."
The improvement loop. Treat the Space as code:
- Benchmark questions live in a Git repo alongside expected SQL.
- A nightly job replays them through the Genie API and diffs generated SQL against expected. Drift triggers an alert.
- When a user reports a wrong answer, the question becomes a new example SQL – or a metadata fix on the underlying table. Either way, every future user benefits.
- Major schema changes re-run benchmarks before announcement.
The MCP server rarely changes. The Space configuration changes every week, and that is where the actual product lives.
What worked
- Stdio for local power tools. Trivial distribution via internal
npx, zero infrastructure, inherits the engineer's local credentials. - Streamable HTTP for shared services. One curated Genie Space, central observability and audit, no PAT (Personal Access Token) sprawl.
- Tool descriptions as prompts. Rewriting
landing_page_metricsfrom "queries landing page data" to "use when the user asks about a specific page; refers togrowth.analytics.landing_page_dailyjoined toexperiment_exposuresonpage_id" dropped wrong-table errors dramatically. - Curated Genie Space over generic text-to-SQL. Investing in metadata and example SQLs paid back faster than building anything custom.
- Subagents for token discipline. Heavy multi-step work behind an agent keeps the parent thread lean (more on this below).
What didn't
- Upstream rate limits. During a real incident, 50 engineers all hitting
splunk_searchin parallel hit per-user concurrent search caps. Cache identical searches at the tool layer with a short TTL (Time To Live) and dedupe in-flight duplicates (single-flight). Pseudocode: