Traditional software acceptance criteria assume determinism. You click a button, you get a predictable response. AI features are different: they are probabilistic, sensitive to phrasing, and prone to confident mistakes. That makes “done” harder to define, and easy to argue about.
Small teams feel this most. You do not have a dedicated evaluation group or time for long research cycles, yet you still need a reliable product. The goal is not perfection. The goal is to make quality explicit, testable, and repeatable enough that you can ship and maintain the feature.
This playbook shows how to write acceptance criteria for AI features that work like engineering constraints: they describe what must be true, how you will check it, and what happens when it is not true.
Why acceptance criteria matter for AI outputs
AI features fail in ways that normal acceptance criteria do not catch. The UI can “work” while the content is misleading, unsafe, or simply unhelpful. Without clear criteria, teams tend to fall back on subjective review: “Looks good to me.” That approach does not scale and usually breaks at the worst time, like during a rush release.
Well-written AI acceptance criteria give you:
- Shared expectations between product, engineering, and reviewers, including what “good enough” means.
- Testability through examples and rubrics so you can evaluate quality without debating taste.
- Change control when prompts, models, or retrieval sources change.
- Risk control by defining what the system must refuse, redact, or escalate.
The best criteria connect to user value. “The model uses a temperature of 0.2” is not an acceptance criterion. “The output is concise, cites the provided sources, and never invents a policy” is.
Key Takeaways
- Write AI acceptance criteria as observable properties of the output, not implementation details.
- Use a rubric with a small set of dimensions (accuracy, completeness, safety, format) and clear pass thresholds.
- Maintain a mini evaluation set of real inputs and rerun it whenever prompts, models, or data sources change.
- Add criteria for what the system does when uncertain: ask a question, refuse, or route to a human.
Define the job: inputs, outputs, and boundaries
Before you can accept an AI feature, you need to define its job in plain language. This is the equivalent of an API contract: what goes in, what comes out, and what it is not allowed to do.
1) Input contract
List the inputs the AI is allowed to use. This includes user text, selected records, uploaded documents, and any retrieved knowledge base content. Be explicit about what is not available. If the model cannot access live systems, say so.
- What fields are provided? Are they optional?
- How long can the input be before you truncate or summarize it?
- Are there any data types you must remove (emails, phone numbers, secrets)?
2) Output contract
Define the output shape and constraints. This is where acceptance criteria become concrete: required sections, maximum length, tone guidelines, and any structured fields that downstream systems rely on.
- Required format (bullets, headings, JSON, or a fixed template)
- Length constraints (for readability and cost control)
- What must be present (action items, risks, next steps)
- What must never be present (sensitive data, invented citations)
3) Boundaries and refusal behavior
AI acceptance criteria should include “what happens when the AI cannot do the job.” This is where you prevent silent failures.
- If the input is missing required context, the system asks 1 to 3 clarifying questions.
- If the request conflicts with policy, the system refuses and explains the limitation.
- If confidence is low (based on heuristics you define), the system labels the output as a draft and requests review.
Write testable criteria with a rubric
A rubric turns “quality” into a small number of scored dimensions. It is easier to review, easier to compare across versions, and easier to explain to stakeholders. Keep it simple: 3 to 6 dimensions is usually enough.
A rubric pattern you can reuse
Pick dimensions that reflect user harm and user value. Common ones are:
- Factual accuracy: claims are supported by provided input or retrieved sources.
- Completeness: required sections are present; key points are not omitted.
- Faithfulness: the summary reflects the input without changing meaning.
- Safety and privacy: no sensitive leakage; compliant tone and content.
- Format correctness: conforms to the template so downstream systems do not break.
Define a score scale (for example 0 to 2, or 1 to 5) and a pass threshold. Then translate that into acceptance criteria such as “At least 90% of evaluation items score 2 out of 2 for format correctness, and 0 items contain restricted content.”
Rubric (0-2)
- Accuracy: 0 incorrect, 1 minor issue, 2 fully supported by input
- Completeness: 0 missing required sections, 1 partial, 2 complete
- Safety/Privacy: 0 violation, 1 borderline, 2 clean
Pass rules
- Safety/Privacy: no 0 scores allowed
- Overall: average >= 1.6 across items
The rubric is also a writing tool. When you struggle to define a dimension, it often means the feature’s job is still unclear. That is useful feedback before you build more.
Design a lightweight evaluation loop
Acceptance criteria only help if you actually check them. The good news is you can build a lightweight loop without heavy infrastructure.
A copyable evaluation checklist
- Collect 20 to 50 real inputs that represent the feature’s normal usage. Include a few tricky edge cases.
- Freeze the inputs as your “mini evaluation set” and store them in a private place accessible to the team.
- Define your rubric (3 to 6 dimensions) and write examples of what “2 out of 2” looks like.
- Run the feature on the evaluation set and capture the outputs.
- Score a sample (start with 10 items) to calibrate reviewer interpretation, then score the full set.
- Record failures with short labels (hallucination, missing section, privacy leak, wrong tone, wrong action item).
- Turn top failures into criteria (new guardrails, template changes, clarifying question behavior).
- Rerun the set whenever you change prompts, model versions, retrieval sources, or post-processing.
If your team is very small, you can alternate reviewers and only score a rotating subset each release. The important part is consistency: the same inputs, the same rubric, tracked over time.
A concrete example: AI-generated meeting summaries
Imagine you are building an internal feature that takes a meeting transcript and produces a summary for a project channel. The value is saving time and creating shared clarity, but the risk is high: a wrong decision or misattributed action item can cause real confusion.
Acceptance criteria you could ship with
- Format: Output contains exactly four sections in order: “Decisions”, “Action Items”, “Open Questions”, “Risks”. Each section has 1 to 8 bullets.
- Faithfulness: Every bullet must be traceable to the transcript content. If the transcript is ambiguous, the bullet must be labeled “Unconfirmed”.
- Action items: Each action item includes an owner if one is mentioned. If none is mentioned, owner is “Unassigned” and the system adds one clarifying question.
- Safety/privacy: No personal data beyond first names already present in the transcript. No speculation about performance, intent, or emotions.
- Length: Total output under 2200 characters to keep it readable in chat tools.
- Failure behavior: If the transcript is under 200 words, the system asks for more context rather than generating a “best guess” summary.
Now tie those to a rubric and a mini evaluation set. For example, you might accept launch when 45 out of 50 summaries pass format and privacy checks, and the remaining 5 are minor completeness issues. You also add a non-negotiable rule: privacy violations are release blockers.
This example also highlights a practical mindset: acceptance criteria can require the AI to be honest about uncertainty. For many business features, “I cannot confirm from the input” is a better outcome than a confident fabrication.
Common mistakes to avoid
- Criteria that describe implementation instead of outcomes. Users do not care that you used retrieval or a certain model. They care that answers are grounded and formatted correctly.
- One giant criterion called “must be accurate”. Break accuracy down into observable rules: no invented names, no invented policy, no numbers unless present in input.
- No definition for “uncertain” cases. If you do not specify what to do when context is missing, the system will improvise. Improvisation is the default failure mode.
- Ignoring the downstream consumer. If another system parses the output, acceptance criteria must include strict formatting and a plan for validation.
- Overfitting to a few happy-path examples. Include edge cases: messy inputs, contradictory notes, partial records, and “user tries to trick the system” prompts.
When not to ship an AI feature
Some problems are poor fits for probabilistic output, at least without heavy controls. Consider not shipping (or narrowing scope) when:
- You cannot define what “correct” means in a way reviewers can agree on. If you cannot write a rubric, you probably do not have a stable product requirement.
- The cost of a wrong answer is too high and you cannot add verification steps. High-stakes decisions require strong checks, not just better prompting.
- You lack reliable source data. If the feature depends on retrieving policies or product facts that are not maintained, the system will decay and acceptance criteria will fail often.
- You cannot log or review outputs due to constraints, yet you still need to detect failure patterns. Without feedback, quality will drift unnoticed.
A useful compromise is to ship the AI as a draft generator with explicit review steps, rather than as an authoritative automation.
Conclusion
Acceptance criteria for AI features work best when they look like product constraints: clear output contracts, explicit boundaries, and a rubric that turns subjective review into repeatable scoring. You do not need a research department to do this. You need a small evaluation set, a lightweight loop, and a willingness to treat “uncertainty behavior” as a first-class requirement.
When you can say “this feature passes because it meets these measurable criteria,” you can ship faster and sleep better.
FAQ
How many acceptance criteria should an AI feature have?
Aim for 6 to 12 criteria total, grouped into themes: format, accuracy or faithfulness, safety or privacy, and failure behavior. If you have more than that, you probably need to simplify the feature or move details into a rubric.
Do I need automated evaluation to use this approach?
No. Manual scoring with a rubric is enough to start, especially for small teams. The key is consistency: same inputs, same rubric, and recorded results so you can compare across changes.
What if reviewers disagree on rubric scores?
That is normal early on. Do a short calibration session: review 5 outputs together, discuss disagreements, and refine the rubric language with examples. The goal is to reduce ambiguity, not to “win” an argument.
How often should we rerun the evaluation set?
Rerun it whenever you change anything that can affect outputs: prompts, model versions, retrieval sources, post-processing rules, or templates. For stable systems, a periodic rerun also helps catch drift in upstream data quality.