Microsoft prefers to land Copilot at scale through what is described as a pilot. The structure favored by the account team is six months, three hundred seats, no controls, no measurement plan, and a renewal posture that treats the pilot as evidence of commitment regardless of outcome. The right pilot is ninety days, two cohorts, a measurement plan defined before deployment, and contract language that lets the buyer step back to zero or scale to the full population with the contracted unit price held intact. Without that structure the pilot becomes a one way ratchet into a renewal commitment the data did not earn.
The single cohort pilot produces unfalsifiable claims. A two cohort design with one treatment group and one matched control group is the only structure that produces an audit defensible benefit number. The control group does not receive Copilot for the duration of the pilot.
One hundred fifty users selected against three role profiles: sales engineers, account managers, and analysts. The cohort receives M365 Copilot licenses plus a structured onboarding sequence over the first three weeks. Activity telemetry from the Copilot dashboard plus role specific output measurement runs the full ninety days.
One hundred fifty users selected from the same three role profiles, with comparable tenure, output history, and seniority distribution. No Copilot license. Same output measurement. The control group is what makes the treatment number defensible.
The measurement plan is defined in writing before any Copilot seat is provisioned. Once usage data starts to flow, post hoc metric selection is no longer credible. Four metrics carry the weight of the pilot.
Role specific output count per active week. Proposals shipped, decks delivered, customer responses sent, analysis deliverables completed. Measured against the control group baseline.
Sampled blind review against rubric by two reviewers per cohort. Quality cannot decline as volume rises or the benefit is illusory. The review covers ten percent of outputs from each cohort weekly.
The share of treatment cohort users who used Copilot in at least four of the previous five working days. The number must hold above sixty percent in week eight or the population fit is wrong.
Self reported plus manager observed hours redirected away from drafting and toward higher value work. The metric is softer than the first three but it surfaces the qualitative shift.
Data leakage events, over permissioned response incidents, prompt injection attempts. The number must trend down across the ninety days as the governance program matures.
Hours redirected multiplied by loaded cost per role multiplied by population. Compared against the all in pilot cost including tenant readiness and onboarding. The single composite ROI number.
The pilot agreement carries three protections that distinguish it from a stealth full commitment. Each is straightforward to negotiate before the seats are provisioned and almost impossible to obtain after.
At day ninety one the pilot population can step down to zero seats. No automatic renewal. No transition fee. The clause is what allows the data to determine the decision.
The contracted per user price holds for any scale up decision in the twelve months following the pilot. The pilot is not used by Microsoft as a list price reset opportunity at the scale order.
Telemetry exports, governance logs, and tenant readiness work belong to the buyer at any termination point. The pilot does not become a stranded investment if the decision is to defer.
The two cohort design template, the six metric measurement plan, the role specific selection rubric, and the three contract clauses that keep the pilot a measurement program rather than a soft commit. Sent on request.
A pilot designed before the renewal becomes the data that the renewal turns on. A pilot designed after the renewal becomes a justification for a commitment already made.