An AI vendor is best evaluated before the demo — in writing, by a short brief of six to eight questions that force the vendor to describe, in their own words and on their own letterhead, their data assumptions, their failure modes, and the shape of the integration. The demo, if it happens at all, then comes second. This is not an antagonistic move. It is an efficiency. The pitch meeting is specifically engineered to surface what the vendor has rehearsed and to skip what the vendor has not. A written brief, replied to in writing, produces a record the internal team can read, compare, and hand to the next engineer on the evaluation. The call becomes a conversation between two documents rather than a performance.
What the written brief contains
Six to eight questions is the right number. Fewer and the vendor can answer in slogans. More and the reply becomes a brochure. The studio groups the brief into three clusters, each cluster carrying a pair of questions the vendor must answer together or not at all.
The first cluster is the system and its failures. What data does the system need from the operator, in what shape, to reach the performance the marketing describes — not the happy-path answer, the real answer, with the columns, the volume, the labelling, the historical window, and the refresh cadence. And, in the same breath, what happens when a query falls outside the training distribution. Does the system decline, hallucinate, hand off, or log and continue. We want the failure mode named rather than euphemised, and we want it named in writing. Named failure modes are the difference between a system an operator can govern and one an operator has to trust.
The second cluster is the operational boundary. Where does the operator's data physically live once it enters the system — which region, which processor, which sub-processors, with what retention, and under what deletion guarantees. What does a production incident look like from the operator's side — who is called, what do they see, what service-level is contractually attached to that call, and what is the recorded median time to acknowledge and to resolve over the past two quarters. And at the other end of the relationship, what does a clean exit look like — what does the operator retain, in what format, over what window, if the engagement ends in twenty-four months. A vendor that cannot answer these on a Wednesday will not answer them on a Thursday either.
The third cluster is trust, and it is the one vendors most often fumble. If the underlying model is swapped, deprecated, or re-trained, how is the customer notified, what regression testing is run, and what is the customer's right to hold the prior version. And — the closing question, the one the vendor is least ready for — what is the vendor not good at, in their own words. The reply the studio has learned to trust most is the one that begins, candidly — we would not be the right vendor for. A vendor that cannot name a single situation where they would decline has, by implication, declined to think about the question.
Why the demo comes second, or not at all
A live demo is a carefully edited film. The dataset is the vendor's, or the vendor's synthesised approximation of a customer's, and the prompts are the ones the vendor's team has iterated against for weeks. None of this is dishonest. It is simply the wrong evidence.
The right evidence is what the vendor writes when no one is performing. The reply to the brief is a document — with a date, an author, and a signature — that the internal team can attach to the decision record. Six months from now, when a mid-level engineer reads back through why the vendor was chosen, the reply will still be there. The demo will not.
A demo answers the question the vendor wants asked. A brief answers the question the buyer needs asked.
We have found, in practice, that the meeting after the brief is shorter, more technical, and more useful than the demo that precedes the brief. Two senior readers — one on each side — work through the reply. The vendor's own engineers, often, are in the room for the first time. The commercial team is quieter. The trade-offs the vendor has already named in writing can now be pressed on. The trade-offs they have not named can now be raised.
Sometimes the brief is enough on its own. The studio has seen operators decline to take the demo at all after reading two vendors' replies side by side. Not because the demo would have been uninformative, but because the reply had already answered the question. An hour of senior time saved, on both sides, and a better artefact to file.
What the vendor's reply tells you
Three readings matter, in order.
The first is the content. Does the reply describe a system the operator's team can actually integrate with, on the operator's data, within the operator's regulatory shape. Content is important and usually the easiest layer to assess. Most operators can read for content without help.
The second is the texture. Written replies have a tell. A vendor whose engineering team has read the brief replies in a different voice than a vendor whose sales team has dictated the reply to the engineering team. The former is specific, willing to say what the system cannot do, and comfortable with qualification. The latter is smooth, generous with superlatives, and reluctant to name a failure mode. The operator learns, from the texture of the reply, which team is going to pick up the phone at three in the morning.
The third is the omission. What the reply does not address is often the load-bearing answer. A vendor whose reply pivots away from the exit question, the sub-processor question, or the model-change question is telling the operator — in the politest possible language — that the answer the vendor does have is one the vendor would prefer not to put on letterhead. That is itself a decision-grade signal.
The studio's stance is narrow, and we will state it here without softening. We do not evaluate AI vendors in demos. We evaluate them on paper, and the studio writes the questions. If a vendor declines to reply in writing within a reasonable window, that is the evaluation. The engagement on the operator's side is then to decline the vendor, politely, and move on. The demo was never the evidence. It was the theatre around the evidence. An operator who has already done the work of knowing what they need does not need the theatre to make the decision.