Three scope anti-patterns that kill AI projects before kickoff
Over-promising accuracy, deferring success criteria, and confusing outputs with outcomes. How we write scopes that don't fall apart on contact.
Accuracy promises without measurement
We see this constantly: "The model must achieve 95% accuracy." No definition of accuracy. No held-out test set. No annotation protocol. No plan for edge cases.
Accuracy is not a number you declare. It's a number you measure against labelled data that reflects production distribution. If the scope doesn't specify who labels, how disagreements resolve, or what the evaluation harness looks like, the target is decorative.
Example from a recent brief: "Sentiment classifier must be 90% accurate." We asked: accurate on what data? Three-class or five-class labels? Does neutral count as correct when the truth label is slightly positive? They hadn't thought about it. The 90% was a number someone felt sounded credible.
What works instead: "We'll label 500 representative examples using a rubric you approve. You'll label 50 blind duplicates. If inter-annotator agreement is below 0.8 Cohen's kappa, we revise the rubric together. The model must match the consensus labels on a 100-example hold-out set at ≥85%." Now both parties know what done means.
Success criteria that aren't criteria
"The system should improve team productivity." Fine — how do you know if it did?
"Users should find it intuitive." Measured how?
"It should reduce manual work." By Tuesday or by 2027?
Vague success criteria are a way to avoid the conversation about what actually matters. If the client can't define success, the project can't fail — but it also can't succeed. You end up with a thing that technically works, nobody uses, and both sides feel vaguely disappointed.
Real criteria are specific, measurable, and time-bound. "Support agents resolve 15% more tickets per shift within four weeks of launch, measured via Zendesk exports." Or: "Finance team stops manually tagging 80% of transactions, validated by comparing tagged volume in week -1 vs week +4."
If the client pushes back — "we don't know the exact number yet" — that's fine. Commit to discovery. Write the scope in two phases: week one is instrumentation and baselining. Week two onward is the build, now that you both know what the baseline is and what improvement is worth.
The hard conversation about success criteria belongs in the scope, not in the retro six months later.
Outputs vs outcomes
A scope that lists features is not a scope. It's a shopping list.
"Deliverables: NLP pipeline, API endpoint, React dashboard." Okay — and these do what, exactly? A pipeline that runs isn't the same as a pipeline that changes how the business works.
We once inherited a project where the previous vendor delivered everything in the statement of work. The model ran. The API returned JSON. The dashboard rendered charts. The client was furious, because none of it reduced the thing they actually cared about: time spent triaging inbound documents. The scope had described outputs. It hadn't described the outcome.
Outcome-driven scopes start with the behaviour change or business metric, then work backwards. "Analysts spend ≤30 seconds per document instead of ≥4 minutes." Now you can ask: what has to be true for that to happen? Maybe it's auto-classification. Maybe it's better search. Maybe it's just a better default sort order. The scope becomes a hypothesis about how to move the number, not a list of technologies.
Include the outputs, obviously — someone has to build the API — but frame them as steps toward the outcome, not the finish line.
What a clean scope looks like
We write scopes in three sections.
Problem: What's broken or slow today? Quantify it. "Sales engineers manually enrich 120 leads/day from partial form data. Takes 8 minutes per lead. Error rate unknown but anecdotally high."
Success: What's different when this works? "Enrichment happens automatically for 80% of leads. Remaining 20% flagged for human review with pre-filled suggestions. Sales engineers spend under 2 min/lead on reviewed cases. Measured via Salesforce activity logs."
Scope: What we're building, what we're not, and how we'll know if it worked. Includes the measurement plan, the data pipeline, the annotation process if relevant, and the decision points. "If auto-enrichment accuracy is below 70% after week 3, we pivot to a review-assist UI instead of full automation."
The goal is a document where both parties can read it in three months and agree whether the thing happened or not. No ambiguity. No "it depends what you mean by accurate."
If your scope draft doesn't let you do that, it's not done yet.