LLM features often ship with acceptance criteria like “sounds helpful” or “no hallucinations.” That feels reasonable until you need to decide whether a change improved things, introduced regressions, or made one customer segment unhappy.
The good news is you do not need a research team or a large labeling budget to make quality measurable. You need a small set of realistic examples, a rubric that reflects your product’s risks, and a repeatable way to run checks whenever you change prompts, retrieval, or model settings.
This post outlines a lightweight approach for turning “vibe-based quality” into testable checks a small team can run weekly or on every release.
What you’re actually testing (and why prompts are not enough)
In traditional software, requirements map to deterministic behavior. With LLMs, behavior is probabilistic and heavily dependent on context. That does not make quality untestable, but it does change what your “unit” of testing is.
For most LLM features, you are testing a system, not a prompt. Common system components include:
- Inputs: user message, conversation history, structured fields, attachments converted to text.
- Retrieval:
- Instructions:
- Tools:
- Output constraints:
Because the system is the product, acceptance criteria should measure outcomes that matter to users and the business. “Model is GPT-X” is not an outcome. “Answer includes the correct return-window and links to the right policy section” is.
Define success with a simple rubric
A rubric is a short set of scoring rules that transforms a subjective judgment into something consistent. The goal is not perfect objectivity. The goal is repeatability, so changes can be compared over time.
A practical rubric structure
Start with 3 to 6 criteria. For each one, define what “pass” means and what “fail” looks like. Keep it anchored to observable behavior.
- Correctness:
- Completeness:
- Grounding:
- Safety and policy:
- Format and actionability:
Then decide your scoring model. Two lightweight options:
- Binary pass/fail per criterion (fastest for small teams).
- 0/1/2 scale per criterion (useful if you want “acceptable but imperfect” as a middle state).
Finally, set thresholds. For example: “Correctness must be pass for 95% of cases, Safety must be pass for 100%.” The point is to encode which failures are release blockers.
Build a small test set that stays representative
Your test set should reflect real use, not ideal use. A tiny but representative set beats a large synthetic set that never happens in your product.
Here is a concrete example: a small ecommerce business adds an LLM support assistant that drafts replies about shipping, returns, and product compatibility. A good test set might include:
- Short questions (“Where’s my order?”) and long messy ones (multiple issues in one message).
- Requests where the assistant should ask for missing info (order number, email, product variant).
- Cases where policy matters (return window exceptions, damaged goods process).
- Edge cases (international shipping restrictions, discontinued products).
- Style expectations (empathetic tone for complaints, concise tone for simple status checks).
Keep it small at first: 30 to 80 cases is enough to catch most regressions if the cases are diverse.
What each test case should contain
Store test cases as structured records so they can be replayed. Even a simple JSON file or database table works. Each case should include:
- Input message (and optionally prior turns if your feature is conversational).
- Allowed context (documents, order record, product specs) if retrieval or tools are involved.
- Expected facts (the few key points that must be true).
- Forbidden claims (things the assistant must not say).
- Expected action (draft response only, create ticket, ask clarifying question).
Run evaluations like software tests
Once you have a rubric and a test set, treat evaluation as a test run that happens repeatedly, not as a one-time launch activity. The goal is to detect regressions when you change any of these:
- Prompt instructions or formatting
- Retrieval settings or document sources
- Tool behavior (for example, the “lookup order” function)
- Model choice and decoding settings
A lightweight workflow is: replay your test set through the system, score outputs, then decide if the build passes thresholds.
eval_run:
dataset: support_assistant_v1
build: prompt_12 + retrieval_3 + model_A
checks:
- correctness: must_pass >= 0.95
- safety_policy: must_pass == 1.00
- format_json: must_pass >= 0.98
outputs:
- metrics_summary
- failure_examples_top_20
You can score some criteria automatically (formatting, presence of required fields, banned phrases). Others will be human-scored or LLM-assisted with human spot checks. The trick is to keep the combined process cheap enough to run often.
A copyable checklist for “done”
Use this as a release checklist for an LLM feature change:
- Test set updated with any new feature behavior and at least 5 new real user-like cases.
- Rubric reviewed: are the criteria still aligned with risk (safety, policy, correctness)?
- Eval run completed on the full test set.
- Thresholds met, or deviations documented and approved.
- Top failure examples reviewed and categorized (prompt issue, retrieval issue, tool issue, unclear product rule).
- Monitoring plan updated (what will you sample and review after release, and how often)?
Human review without the bottleneck
Human judgment is still the best tool for criteria like “helpful” and “appropriate tone.” The problem is cost and time. The solution is to use humans where they add the most value.
A practical approach for small teams:
- Full human scoring on a small “golden set” (for example, 20 cases) that represent your highest-risk scenarios.
- LLM-assisted scoring on the broader set (for example, 60 cases), with explicit instructions based on your rubric.
- Human spot checks of a rotating sample of the LLM-scored outputs (for example, 10 per run) to keep the scorer honest.
Over time, your test set becomes a knowledge asset: it captures what your product considers “good,” not just what a model happens to output in a demo.
Key Takeaways
- Acceptance criteria for LLM features should describe outcomes (facts, actions, constraints), not prompts or models.
- A small rubric plus a representative test set can catch most regressions without heavy infrastructure.
- Automate what you can (format, required fields, forbidden claims) and reserve humans for judgment calls.
- Track failures by root cause (prompt, retrieval, tools, policy ambiguity) to make fixes faster.
Common mistakes
- Testing only “happy paths.” Your model will look great until real users paste messy, multi-part, emotional requests. Include those on purpose.
- Using one giant subjective score. “Quality: 7/10” does not tell you what to fix. Split scoring into criteria that map to concrete changes.
- Changing the rubric every run. If you move the goalposts, metrics stop being comparable. Evolve the rubric slowly, and version it.
- Ignoring retrieval failures. Many “hallucinations” are actually “missing context” problems. Track whether the needed source content was available.
- Not saving failure examples. Numbers alone do not drive improvements. Keep the worst outputs and revisit them after fixes.
When NOT to do this
Formal acceptance criteria and repeated eval runs are valuable, but not always the first step. Consider keeping things lightweight if:
- The feature is internal-only and low impact, such as brainstorming copy where outputs are always edited by a person.
- You are still exploring the product shape, and requirements change daily. In that phase, capture example failures, but do not over-invest in thresholds.
- You cannot define a source of truth, and users themselves disagree on what “correct” means. Start by clarifying product rules or offering choices to users.
Even then, it is still worth writing down a minimal “must not do” list (for example, do not claim refunds are approved, do not request sensitive information) and testing those constraints.
Conclusion
LLM quality becomes manageable when you turn it into a set of repeatable checks. A rubric defines what matters, a small test set keeps you honest, and a simple pass/fail threshold makes releases less stressful.
If you only do one thing: start collecting representative examples and agree on what “pass” means for correctness and policy. Everything else gets easier once those two are explicit.
FAQ
How big should my test set be?
Start with 30 to 80 cases. If you can only do 20, do 20, but ensure they cover different customer intents, edge cases, and failure modes. Add a few new cases whenever you learn something from production.
Do I need automated scoring to be “real” evaluation?
No. Manual scoring on a small set is already a major improvement over informal reviews. Automation helps you run checks more often, but the bigger win is consistency and versioned expectations.
What if my team disagrees on what “good” looks like?
That is exactly what a rubric is for. Write down the criteria, score a handful of outputs independently, then compare and resolve differences. The rubric becomes a shared contract between product, engineering, and support.
How do I handle cases where multiple answers could be acceptable?
Focus scoring on required facts and forbidden claims, not exact wording. If tone matters, define acceptable ranges (for example, must acknowledge frustration, must offer next steps) rather than specific phrases.
How often should I run evaluations?
At minimum: whenever you change prompts, retrieval, tools, or model settings. Many teams also run a scheduled check (weekly or monthly) to catch drift from upstream sources, new content, or evolving user behavior.