MatchSpec

Evaluation framework for AI systems.

Most teams evaluate AI output by reading it and deciding if it "looks good." MatchSpec replaces that with Go code: you write specs that define correctness, run them against model output, and get a pass/fail result you can gate a deployment on.

go get github.com/greynewell/matchspec

Why evals matter

Fine-tuned models silently degrade. Prompt changes break things that used to work. Without automated evaluation, you find out from users. Eval-Driven Development makes evaluation the first step, not the last — you define what correct means before you ship anything.

MatchSpec is how you implement that. It handles the spec definitions, the test harness, and the pass/fail thresholds. It speaks the same MIST protocol as InferMux, SchemaFlux, and TokenTrace, so eval results flow through the same infrastructure as everything else.