|
cMCP 0.4.1
Model Context Protocol library in pure C11
|
How we get cMCP to a quality bar where an LLM agent (butlerbot, Claude Code, any MCP host) can rely on it without a human in the loop. This is not a feature roadmap — see `TODO.md` for that. This is the quality track: the work that turns "the suite is green" into "an autonomous agent can drive this for hours and survive whatever the wire throws at it."
An agent that picks up a tool through cMCP has no second chance to catch your bug: it sees the result, makes a decision, takes action. The bar is therefore stricter than "passes tests":
2026-XX-XX, cMCP must either speak it too or fail cleanly enough that the agent moves on.All seven axes of this plan are substantively closed. What's now in place:
make conformance).make valgrind, make test-asan (-fsanitize=address,undefined -fno-sanitize-recover=all), and make test-tsan (-fsanitize=thread). Warning-clean under -Wall -Wextra -Wpedantic.2025-06-18 shipped: tools, resources, prompts, sampling, roots, elicitation, structured output + resource_link + title, logging, pagination, cancel + progress. Pin advanced to 2025-11-25 in Tier 6.1.filesystem-mcp (bundled, no external deps; fs_write hardened against symlink-leaf sandbox escape after a Tier-5 playbook P0), crag-mcp (separate build, real workload), echo-server, plus tools/cmcp-tee/ for transparent wire capture.cmcp_err_t distinguishes transport, protocol, parse, schema, timeout, cancelled, unsupported, etc.make test-asan + make test-tsan run on every push (fail-fast: false).json, rpc, schema, http) with curated seed corpora; make fuzz-smoke runs each 60s, ~32M execs/min total, zero findings. The HTTP request parser was extracted to src/http_parser.c for harness-level driving.tests/test_hostile_peer.c: 9 cases / 70 assertions exercising malicious-side behaviour against both client and server. Wired into the CI matrix automatically.tests/soak/soak_driver.c + run.sh orchestrator with /proc-sampled RSS/FD/Threads + ring-buffered p50/p99 latency and awk drift criteria (RSS ≤15% growth, FDs non-growing, Threads flat, p99 ≤2× baseline). Opt-in via make soak / make soak-churn.echo, filesystem, crag) driven from Claude Code, ~10 tasks each. First pass surfaced one P0 (filesystem symlink-leaf escape — fixed, regression test added, fixture archived) and four description tightenings; zero protocol-level cMCP bugs.conformance/replay/replay.py + per-fixture registry. make replay is a new CI lane that fails on any frame mismatch against the recorded dir:"out" frames, with per-fixture masks for legitimately variable fields.scripts/check-spec-version.sh + weekly spec-drift.yml workflow. Was firing against the 2025-11-25 cut while the 2025-06-18 pin held; resolved by Tier 6.1. Upgrade workflow documented in docs/spec-version-upgrade.md.What's still on the runtime/budget side, not engineering:
cmcp_transport_http_connect, same metrics + drift criteria.Each axis below has a name, the gap from current state, concrete actions, and an acceptance criterion that defines "done."
Current state. make conformance cross-checks against the pinned TS SDK at one moment in time. The protocol version is hardcoded in include/cmcp.h (CMCP_PROTOCOL_VERSION "2025-11-25", advanced in Tier 6.1 from 2025-06-18).
Gap. Conformance is one-shot — nothing alerts us when the MCP spec releases a new revision or when our local interpretation drifts from the reference SDK's. A server upgraded to a future spec version won't immediately tell us why it stopped working.
Actions.
make conformance to a GitHub Actions workflow (or local pre-commit hook) on every push to main. Fail the build on any check regression.scripts/check-spec- version.sh that fetches the latest version date from modelcontextprotocol.io/specification/ and exits non-zero if it differs from CMCP_PROTOCOL_VERSION. Run weekly via cron or GitHub schedule; open an issue automatically on mismatch.conformance/fixtures/*.jsonl. On TS SDK upgrade, diff old vs new transcripts — surfaces subtle spec drift the per-test asserts might miss.Acceptance criterion. Every push to main runs the full conformance battery; a spec-version mismatch alerts within 7 days of upstream release.
Current state. valgrind clean across 21 binaries. No sanitizer builds.
Gap. Valgrind catches a useful subset (definite leaks, invalid reads/writes via dynamic instrumentation) but misses:
Actions.
make test-asan** *(2 hours)*. New target: rebuild everything with -fsanitize=address,undefined -fno-omit-frame-pointer -fno-sanitize-recover=all and run the existing 21 binaries. Fix findings; document any false-positive suppression in tests/asan.supp.make test-tsan** *(2 hours setup + however long the findings take)*. -fsanitize=thread. Expect findings around: writer mutex / notify_mu / inflight_mu interactions, the per-completion mutex/cv dance, the HTTP slot mailbox. Each finding either (a) reflects a real bug (fix), (b) is benign-by-design (annotate with __tsan_acquire/release or document why), or (c) is in third-party code (suppress).-fanalyzer weekly run *(half day setup)*. The static analyser catches some classes valgrind never will (e.g., NULL deref on error paths). Slow — don't gate CI on it, run weekly.test-asan + test-tsan into the CI conformance gate from Axis 1 *(folded into that work)*.Acceptance criterion. make test-asan and make test-tsan both pass with zero unsuppressed findings. CI runs both on every push.
Current state. Tests cover happy path, schema rejection, transport EIO from peer crash. No mutation-based fuzzing.
Gap. The biggest attack surfaces are parsers:
src/json.c — hand-rolled JSON parser. A malformed embedded string from a server could trigger UB the tests don't reach.src/rpc.c — JSON-RPC framing. Batch arrays, weird ID types, oversized strings.src/schema.c — schema validator. Deeply nested schemas, schema loops, malformed types arrays.src/transport_http.c — HTTP request parser. Header injection, oversized Content-Length, missing Content-Length, malformed chunked.Also: behaviour against an adversarial peer (lies about caps, sends notifications shaped like responses, replies with id we never sent, schema-violates its own declared output_schema).
Actions.
fuzz/ with one harness per parser:fuzz_json_parse.c → cmcp_json_parse(data, len)fuzz_rpc_parse.c → cmcp_rpc_parse(data, len, ...)fuzz_schema_validate.c → both schema-parse and validatefuzz_http_parse_request.c → the HTTP server's request parser Seed corpus from existing test fixtures; build with -fsanitize=address,undefined,fuzzer. Initial 24-hour run per target.tests/test_hostile_peer.c: simulate a server thatresult and error in the same response-32600 or CMCP_EINVAL, not OOM.Acceptance criterion. All four fuzz harnesses run 24h with zero crashes / hangs / leaks. Hostile-peer suite passes; every case exits via a documented error path.
Current state. None of the existing tests run for more than a few seconds. We don't know what hours of continuous use does.
Gap. Likely fault classes that only show under load:
Actions.
tests/soak/run.sh** *(2 days)*. Driver script that:examples/echo-serveradd and echo in a loop with random small arguments/proc/<pid>/status (VmRSS, FDSize, Threads) for both client and server processessoak-history.csv so regressions are visible over time.Acceptance criterion. A 6-hour tests/soak/run.sh and a 30-minute HTTP soak both pass the criteria above. Connect/disconnect churn shows zero fd or thread growth.
Current state. Nobody has watched an LLM actually drive cMCP.
Gap. This is the axis no offline test substitutes for. The specific failure modes:
description fields too terse / too verbose / ambiguous → the model picks the wrong tool or formats arguments badly.inputSchema overly strict → model retries fail in confusing ways.Actions.
filesystem-mcp (over a sandbox dir), crag-mcp (over a real corpus), echo-server (smoke test) to ~/.claude/mcp_servers.json (or the equivalent path). Document exact config in docs/agent-validation.md.Acceptance criterion. Weekly playbook run achieves ≥ 9/10 task completion across all three servers, with the failure modes that remain being explicitly classified as "model limitation, not tool issue."
Priority order if I were picking. Numbers are rough effort estimates including the test-it-back-and-fix cycle.
| # | Axis | Effort | Why this order |
|---|---|---|---|
| 1 | Sanitizers (Axis 2.1 + 2.2) | 2 days | Highest signal per hour. Permanent benefit on every test run thereafter. Catches bugs you didn't know existed. |
| 2 | Agent-in-the-loop (Axis 5.1 + 5.2) | 2.5 days + continuous | Free signal you can start today. Discovers failure classes no offline test ever will. The earlier you start, the more cycles of feedback you get. |
| 3 | CI gate (Axis 1.1) | 1 day | Consolidates 1 and 2 so regressions get caught automatically. Cheap once 1 + 2 exist; expensive if you defer it. |
| 4 | Fuzzing (Axis 3.1) | 3 days + 4× 24h runs | Most likely to find real bugs in the parsers — the smallest, best-defined attack surface. |
| 5 | Hostile-peer suite (Axis 3.2) | 2 days | Less likely to find bugs than fuzzing, but covers a different category (semantics, not syntax). |
| 6 | Soak (Axis 4.1 + 4.2 + 4.3) | 3.5 days + nightly runtime | Slowest to give feedback. Important but defer until the cheaper axes are clear. |
| 7 | Spec drift watch (Axis 1.2 + 1.3) | 2.5 days | Future-proofing. Only matters once MCP cuts another revision; not urgent today. |
Total: ~16 person-days of focused work, plus continuous overhead for agent-in-the-loop runs and nightly soak.
Each axis above gets one entry in `TODO.md` under a new "Tier 5 — agentic readiness" section. As tasks land, the acceptance criterion above is the bar for marking them done. Cross-linked from the CHANGELOG once shipped.