How to Run a Pre Mortem on Your Metrics Before They Run

If you have ever watched a metric go from “interesting” to “career limiting” in one quarter, you already know the problem.

A leader picks a number off a dashboard. Someone turns it into an OKR. Then it quietly becomes policy: staffing plans, performance reviews, budget approvals, even which customer segments get help first. Months later you discover the metric was missing a channel, or the definition drifted, or the target was “achieved” by doing something that made customers furious. Congratulations, you did not improve performance. You improved your ability to hit a scoreboard.

That is why a metrics pre mortem is worth your time. It is the same idea as the classic pre mortem exercise: assume the thing failed, then list the reasons why. The difference is you are applying it to a KPI before it has the power to steer the business. I like the framing from “If You’re Writing a Post Mortem, It’s Too Late” because it is painfully true for dashboards too: once a metric is wired into incentives, undoing the damage is political work, not analytics work [1].

Here is the concrete scenario to keep in your head as you read: Support leadership decides First Response Time will drive staffing and individual performance. The goal sounds reasonable: respond faster. The unintended outcome is also common: agents game the clock with empty replies, the backlog shifts to “hard” queues, escalations rise, and CSAT drops. The metric did not just measure support. It rewired it.

You can prevent most of that with a short, disciplined dashboard pre mortem. Not a weeks long governance program. One focused meeting, the right people, and a few guardrails that make the number earn the right to be used.

Name the decision your metric will control (and its blast radius)

Support teams rarely get hurt by “bad metrics.” They get hurt by metrics that are allowed to make decisions they were never qualified to make.

The moment a metric turns into policy is usually subtle. Someone says, “Let’s watch it weekly.” Then, “Let’s set a target.” Then, “Let’s tie it to comp plans.” Then, “Why is Branch B worse than Branch A?” Metrics accelerate bad assumptions because they look objective, and humans love outsourcing judgment to anything that comes with a decimal point.

Start your metrics pre mortem by naming the decision the number will control. Use a one sentence template like this:

Decision statement template: **If

the metric moves by
this much
for
this long, then we will
take this action.**

Example for First Response Time: “If First Response Time for Priority 1 email worsens by 20 percent for two weeks, we will add weekend coverage and re route two agents from billing to Priority 1.” Now you can actually evaluate whether the metric is fit for that decision.

Next, scope the blast radius. This is where most teams get honest fast, because it forces you to say who will feel the consequences.

A quick checklist that works in real operations:

Money and careers: Will this touch comp plans, bonuses, performance ratings, promotions, or PIPs?
Staffing and scheduling: Will this drive hiring, staffing levels, shift bids, or on call expectations?
Prioritization: Will this influence what product bugs get escalated, what queues get headcount, or which customer segments get faster lanes?
Comparisons: Will branches or teams be ranked, and will leaders act on those rankings?
Escalations: Will this trigger exec escalations, incident style responses, or “all hands” cleanups?

Finally, choose the metric’s job to be done. Most problems come from mixing these up.

A metric can be a diagnostic (helps you learn), a target (people will optimize it), an alarm (alerts you to abnormal conditions), or an allocation tool (decides where resources go). First Response Time might be a good alarm. It might be a decent allocation tool for staffing. It is often a terrible individual performance target without counter metrics.

Common mistake number one: teams skip the decision and jump straight to the target. Do the opposite. When the decision is clear, the right target often becomes obvious, or you realize you should not have a target at all.

Run the 45-minute metrics pre-mortem: who’s in the room, what you decide, what you write down

Assignment strategy	Best for	Advantages	Risks	Recommended when
Define 'stop-the-line' thresholds	Establishing clear boundaries for acceptable metric performance	Enables rapid response to critical issues. reduces emotional decision-making	Setting thresholds too conservatively or too aggressively. ignoring false positives	Metrics tied to critical user experience or business health
Assumption Ledger — definition, coverage, comparability, automation limits, known biases	Documenting metric health and potential failure modes	Creates a single source of truth. identifies data quality issues early	Can become a bureaucratic overhead if not actively maintained	Any metric used for critical business decisions
Clear decision outputs and ownership for next steps	Ensuring accountability and follow-through	Prevents analysis paralysis. drives concrete actions	Lack of clear ownership leads to inaction. decisions made without full context	Every pre-mortem meeting
Workflow table (copy into doc)	Providing a structured, shareable record of the pre-mortem	Easy to share and reference. ensures consistency in documentation	Table becomes a 'dumping ground' without clear summaries	Documenting pre-mortem outcomes for broader team visibility
Identify counter-metrics	Preventing unintended side effects and gaming of the primary metric	Provides a balanced view of impact. acts as an early warning system	Overwhelm with too many metrics. difficulty in defining true counter-metrics	Any metric with potential for behavioral manipulation or narrow focus
Timeboxed agenda (45 min total)	Standardizing pre-mortem process across teams	Ensures all key steps are covered. respects participants' time. repeatable	Rushing critical discussions. superficial analysis if not well-facilitated	Launching any new metric, OKR, or dashboard

You do not need a committee. You need a small group that represents reality.

Roles that matter more than titles:

First, the decision owner. This is the person who will actually act when the number moves. If nobody owns the decision, the metric becomes dashboard decor.

Second, support operations. They know the routing rules, the queue design, and the ways work gets “reclassified” when pressure hits.

Third, a frontline rep. This role is often missing, and it changes everything. They can tell you exactly how a target will be gamed, usually within 30 seconds.

Fourth, a QA or calibration lead. Also often missing. They are the bridge between “ticket closed” and “ticket actually resolved.”

Fifth, a data steward. Not a generic analyst. Someone who understands what is included, excluded, and how definitions can drift when tools or tags change.

Inputs to bring so the meeting stays concrete:

Bring the current metric definition, a list of channels and queues, a handful of recent tickets across priorities, and the proposed targets or alert thresholds. If you have a dashboard screenshot, bring it. People argue less when they are looking at the same artifact.

Here is a timeboxed agenda that fits into 45 minutes and still produces a decision.

5 minutes: confirm the decision statement and blast radius
5 minutes: read the metric definition out loud, including exclusions
7 minutes: check coverage and denominators by channel, queue, tag, and priority
8 minutes: walk through sample tickets and ask, “Would this be counted the way we think?”
7 minutes: stress test proposed targets and thresholds against seasonality and mix
8 minutes: brainstorm failure modes and gaming paths, then pick the top three
3 minutes: choose guardrails, counter metrics, and monitoring owners
2 minutes: decide ship, ship with guardrails, or pause and fix

The core artifact is the Assumption Ledger. Think of it as a pre commitment to what must be true for the number to be trustworthy.

Assumption Ledger fields you actually need:

Definition and calculation summary in plain language
Coverage and exclusions, including channels and queues
Comparability notes, what makes team comparisons valid or invalid
Automation limits, what tooling can and cannot reliably detect
Known biases, including incentives and process artifacts
Monitoring signals, what leading indicators you will watch
Change triggers, what events force a re approval

A filled in example entry for First Response Time as a target:

Definition: “Minutes from customer message creation to first human reply, excluding auto responders.”

Coverage: “Includes email and web form. Excludes chat and social. Excludes tickets created by internal teams.”

Comparability: “Valid within the same channel and priority. Not valid across regions until routing rules and hours of coverage match.”

Automation limits: “Tool cannot reliably label a reply as meaningful. Empty ‘we got it’ replies count unless QA flags them.”

Known biases: “Agents can send quick low value replies to stop the clock. Managers can reclassify tickets to lower priority to protect the metric.”

Monitoring signals: “Spike in one touch replies, spike in reopens, increase in escalation rate, drop in CSAT response rate.”

Change triggers: “New chat channel added, routing rules changed, ticket tagging taxonomy updated.”

Now make the process repeatable. Copy this workflow table into a doc and run it before every new dashboard KPI, metric pre mortem, or OKR pre mortem.

Define stop the line thresholds.

Assumption Ledger, definition, coverage, comparability, automation limits, known biases.

Clear decision outputs and ownership for next steps.

Workflow table, copy into doc.

One more practical tip: end the meeting with a visible decision. “Ship” means it can be used for the stated decision. “Ship with guardrails” means it is allowed only with specific counter metrics and monitoring. “Pause and fix” means you are not allowed to set targets or rank teams until coverage and definitions are repaired.

Common mistake number two: leadership runs these meetings without frontline or QA, because it feels faster. It is faster, the same way skipping brakes makes your car lighter.

Secondary CTA: schedule a 45 minute pre mortem before your next OKR planning cycle and run this exact workflow. You will never regret spending one meeting preventing a quarter of metric whiplash.

Failure modes to brainstorm: the ways a metric becomes a liar

Most “bad metrics” are not malicious. They are incomplete. They tell a narrow truth, and everyone mistakes it for the whole truth.

Your job in a metric pre mortem is to assume the metric will lie and then name the specific ways it will happen. For each failure mode, you want three things: what causes it, how you will notice, and what you will do about it.

Definition drift is the quietest killer. “First Response Time” becomes “first response during business hours” after a tooling change. Or “backlog” silently switches from “open tickets” to “open and pending tickets.”

How you will notice: a step change right after a process or tooling update, especially if volume did not change.

Guardrail: require a definition change log and freeze targets for a cycle after any definition change.

Coverage gaps are where dashboards get their most convincing lies. If a channel is missing, the metric can improve while customers suffer.

Illustrative lie with numbers: your dashboard shows 1,000 tickets this week, and First Response Time improved from 2 hours to 1.5 hours. Great. Except 300 chat conversations are not counted after a channel integration changed, and chat is where your longest waits live. In reality you handled 1,300 contacts, and the combined response experience got worse.

How you will notice: sudden drops in total counted volume, or a mismatch between staffing workload and “ticket count.” Frontline will often say, “It feels busier but the dashboard says it is quieter.” Believe them.

Guardrail: track a simple coverage metric alongside every KPI. “Percent of contacts included” is boring, which is exactly why it saves you.

Reopens and duplicates are the place where “resolution” stops meaning resolution. If your metric rewards closing tickets quickly, you will create a second wave of work.

Concrete example: a team optimizes for Time to Resolution and gets it from 48 hours down to 24. Two weeks later reopens rise from 6 percent to 14 percent, and duplicate tickets rise because customers give up and start over. Your customers experienced more work, not less.

How you will notice: spikes in reopens, spikes in repeat contact rate, and more “where is my refund” follow ups that should not exist.

Guardrail: never ship a resolution speed target without a reopen counter metric. If you can only afford one counter metric, pick reopens.

Routing and queue effects are the most common support metrics pitfalls because the metric ends up measuring your process, not your performance. If you move a triage step, or change which queue gets first touch, First Response Time can swing without any change in agent behavior.

How you will notice: the metric changes, but QA quality scores and customer outcomes do not. Or the “fast” queue gets faster while backlog age grows in the “hard” queue.

Guardrail: segment by queue and priority first. If leaders insist on an overall number, show it only with a clear breakdown and an explanation of routing changes.

Seasonality and mix shift are where teams accidentally punish the wrong people. Your work changes, the metric looks like it did.

Illustrative lie with numbers: in March, 60 percent of tickets are password resets and your CSAT is 92. In April, you launch a billing change and now 40 percent of tickets are disputes. CSAT falls to 86. If you treat that as performance decline, you will “coach” agents for a product and policy problem.

How you will notice: ticket mix changes, tag usage shifts, or priority distribution changes. A good leading indicator is “top 10 reasons for contact” changing quickly.

Guardrail: require mix context for any metric tied to performance, and pause branch comparisons when the mix changes materially.

One more failure mode that deserves explicit airtime: response rate collapse. If you use CSAT as a target without caring about response rate, people will learn to avoid surveys, or to route unhappy customers away from the survey path.

How you will notice: CSAT improves while response rate declines, or while complaint volume rises.

Guardrail: treat CSAT and response rate as a pair. If response rate drops below a threshold, do not celebrate the score.

A practical tip that works across all of these: put “how you will notice” in the Assumption Ledger as a leading indicator, not a retrospective. If the first time you learn the metric lied is during quarterly planning, you did not run a pre mortem. You ran a eulogy.

Branch/team rollups without fake winners: make comparisons earn the right to exist

Rankings are intoxicating. A league table feels decisive, and it gives leaders a simple story. It also creates fake winners when the underlying work is not comparable.

Goodhart’s law shows up fast in branch comparisons: once teams know they are being ranked, they optimize for the rank. That is not evil. It is rational. The problem is the optimization often targets what is easy to measure, not what is valuable.

So make comparisons earn the right to exist. Use a comparability gate, and treat it as pass or fail.

A comparability checklist with explicit criteria:

Same work: at least 80 percent of contact reasons overlap, or you segment by reason.
Same rules: the definition of “first response,” “resolution,” and “business hours” is identical.
Same tooling: same routing logic, same macros, same automation, same survey trigger rules.
Same done definition: a closed ticket means the same thing, including how reopens are handled.
Similar coverage: channels and queues included are materially the same, or you compare within a single channel.

If any of those fail, you do not rank. You cohort.

Cohorting means you compare within segments that share conditions, like “email only,” “Priority 1 only,” “billing disputes only,” or “regions with 24 by 5 coverage.” It is less dramatic than a league table, and far more useful.

Three common local differences that create phantom performance:

One, routing. Branch A routes simple tickets to a fast lane and escalates hard tickets immediately. Branch B keeps more work in house. Branch A looks faster and “better,” while actually pushing complexity elsewhere.

Two, macros and policy. One team uses a macro that counts as a reply and stops the First Response Time clock with minimal effort. Another team writes bespoke replies. The metric rewards the macro, not the customer experience.

Three, escalation paths. If one branch has a direct product engineer escalation route, resolution is faster. Another branch waits in a shared backlog. You are comparing org design, not agent performance.

Normalization can help, but it has tradeoffs, and you should name them.

Per contact normalization is simple, but it hides complexity. Per resolution can reward teams that close aggressively. Per agent hour is often the closest to staffing reality, but it depends on accurate time accounting and consistent schedules.

Here is a worked example of a rollup that flips best and worst after normalization.

Branch East handles 1,000 contacts with an average Time to Resolution of 18 hours. Branch West handles 600 contacts with an average of 14 hours. West looks “best.” Leaders start asking East what they are doing wrong.

Now adjust for mix. East has 40 percent billing disputes that require back office approvals. West has 10 percent. When you compare only non billing contacts, East averages 10 hours and West averages 12. The winner flipped. The “best branch” was mostly the branch with easier work.

Decision rule: require a minimum comparability threshold before ranking. A good default is this: you can publish a league table only if the comparability gate passes and the difference remains after segmenting by at least two major drivers, usually channel and top contact reason. If the rank changes materially after segmentation, you are not ranking performance. You are ranking mix.

A practical tip for leaders: if you must show a single rollup number, show it as a range with context, not as a trophy. The goal of support metrics is to allocate attention and resources, not to crown a monthly champion.

Ship the metric with guardrails: counter-metrics, sampling checks, and stop-the-line thresholds

A metric without guardrails is a steering wheel bolted to the wrong axle. You can turn it enthusiastically and still drive into a ditch.

When you ship a metric, ship a bundle. Here is a guardrail bundle template you can copy:

Primary metric: the one you intend to use for the decision.

Counter metrics, pick two to four: these are the “do not break the business to hit the target” signals. For support, strong defaults are reopens, escalation rate, repeat contact rate, and CSAT response rate. For speed metrics, add QA quality score if you have it.

Sampling plan: a lightweight audit that checks whether the metric still means what you think it means.

Stop the line thresholds: explicit triggers that freeze targets or disable alerts when trust is compromised.

Monitoring owner and cadence: who checks what, and when.

Counter metrics are not “nice to have.” They are the price of using a target. If you use First Response Time as a target, watch at least one quality signal and one workload signal.

Sampling checks are where you catch drift before the dashboard does. Keep it operationally feasible.

A concrete sampling design that real teams can run: review 10 tickets per queue per week, stratified by channel and priority. For each ticket, confirm it was counted correctly, confirm the response was meaningful, and note whether the customer had to come back. This catches definition drift, macro gaming, and routing artifacts.

Stop the line thresholds give you permission to pause without arguing. Define them in advance.

Two stop the line examples tied to common failures:

First trigger: coverage drop. If included contact volume drops by more than 10 percent week over week without a known business reason, freeze targets and investigate coverage before leaders react to the KPI.

Second trigger: definition change. If routing rules, business hours, or survey triggers change, disable alerts and pause performance use for one cycle. You can still observe the metric, but you cannot punish or reward people based on it.

Automation versus human judgment is a real tradeoff. Automation is safe when it detects objective events reliably, like ticket creation timestamps or queue assignments. It needs human review when it tries to infer meaning, like whether a reply was helpful or whether a ticket is truly resolved.

If you are tempted to fully automate guardrails, read something like the “KPI Watchdog” idea for monitoring slips, but treat it as an assistant, not a judge [2]. The point is faster detection, not robotic accountability.

Monitoring cadence that a support ops team can actually run:

Weekly: watch leading indicators, reopens, escalation rate, repeat contact rate, coverage percent included.

Monthly: do a calibration review, check sample tickets, confirm definitions still match reality.

Quarterly: re approve the metric for target and comparison use, especially if the org changed routing, channels, or staffing model.

Optional CTA: pick your top three operational metrics this week and add guardrails to each. Set at least one stop the line trigger per metric so you can pause targets without a debate.

Decision handoff: the one-page memo that keeps future you from re-learning the same lesson

Even good pre mortems fail if the outcome lives in someone’s head. Metrics outlive org charts. Your future self deserves a paper trail.

Write a one page memo that travels with the metric. Do not make it pretty. Make it durable.

A copyable outline:

Metric name and job to be done: diagnostic, target, alarm, or allocation
Decision statement: what action it controls and the blast radius
Definition in plain language: what counts, what does not, business hours rules
Scope and coverage: channels, queues, priorities included and excluded
Known biases and gaming paths: the top three ways it can mislead
Guardrails: counter metrics, sampling plan, and stop the line thresholds
Comparability status: whether branch or team ranking is allowed, and under what segments
Owner and change control: who can edit definitions, thresholds, rollups
Re approval triggers: what changes force a rerun of the pre mortem
Current decision: ship, ship with guardrails, or pause and fix, plus what happens next

A trigger example that should force metric re approval: adding a new support channel, changing routing logic, or changing survey delivery rules. Any of those can rewrite the denominator overnight.

Here is a short leadership script that prevents accidental weaponization:

“We are using First Response Time to guide staffing decisions, not to reward fast but unhelpful replies. If the number conflicts with what customers and frontline teams are experiencing, we pause targets and investigate coverage and quality signals first.”

Monday plan, keep it simple and real:

First action: pick one metric that currently influences behavior and run a 45 minute metrics pre mortem on it.

Three priorities:

write the decision statement and blast radius,
fill the Assumption Ledger for definition, coverage, and comparability,
ship a guardrail bundle with counter metrics and stop the line thresholds.

Production bar: by end of week, you should have one metric with a one page memo, named owners, and at least one sampling check scheduled. Do not overcomplicate it. You are building the habit that keeps metrics working for you, instead of the other way around.

Primary CTA: download or copy the Assumption Ledger and the one page metric memo template into your team doc, then use it before your next dashboard change or OKR cycle.

Sources

runtimedecisions.com — runtimedecisions.com
falkster.com — falkster.com

How to Run a Pre Mortem on Your Metrics Before They Run Your Business