When Metrics Get Gamed: How to Pick Signals People Cannot

Spot the moment a metric becomes a target (and stops being truth)

A support dashboard can look like a control room. Lots of lights, lots of confidence, and sometimes absolutely no idea whether the customer is actually better off.

Here’s the tell: the week leadership starts celebrating a number, the number starts drifting away from reality.

That isn’t because your team is full of villains. It’s because incentives change behavior, even for well-meaning people. People take the shortest path to what gets praised. That’s the core of Goodhart’s Law: when a measure becomes a target, it stops being a good measure. If you want a quick refresher (and a slightly grim chuckle), this is a solid read: [1]

This is where teams get burned: the dashboard “improves,” the business relaxes, and then a quarter later you realize churn, refunds, or brand sentiment quietly got worse. At that point you’re not fixing a metric problem. You’re digging out of a trust problem.

The operator’s rule: “If a label change can move the number, it will.”

Concrete example. A team is measured on MTTR and backlog age. Pressure ramps, and suddenly “Solved” becomes “Closed as duplicate,” “Closed as user error,” or “Closed pending customer.” MTTR drops from 42 hours to 18 hours in two weeks. Backlog looks healthier. Customers aren’t happier. The same problems are still there, just wearing a different hat.

That’s the moment to stop debating intent and start redesigning signals.

One detail that matters: gaming often starts as “helpful cleanup.” Someone thinks, “This is basically duplicate-ish,” does a small status change, and moves on. Multiply that by a whole team and a month of pressure, and you’ve built a measurement system that no longer represents reality.

Two categories of gaming: reclassification vs pushing work downstream

Most support metric gaming falls into two buckets.

Reclassification: you change tags, status, severity, channel, or even what counts as a ticket. The work stays the same, the number improves.

Pushing work downstream: you “resolve” by moving the customer to another queue, another channel, a community forum, a form, or an engineering backlog with no owner. Your number improves because the work is now someone else’s problem.

A subtle downstream push looks reasonable in isolation: “We routed this to billing,” “We sent them to docs,” “We asked for more info.” Any of those can be correct. Gaming shows up when routing becomes the default way to protect a metric rather than the best way to solve the customer’s problem.

Practical way to spot it: don’t look for one event; look for a trail. For the same issue, do you see a ticket close followed by a new conversation in another channel within a few days? That pattern is the smoke.

A quick diagnostic: what changed—customer work, team behavior, or measurement?

Decision rule: if a metric improves but you can’t point to a specific customer-facing change that should cause it, assume the measurement got contaminated.

Customer-facing changes are concrete things like: “We shipped a fix,” “We removed a broken step,” “We clarified pricing,” “We improved routing so the right specialist gets the case earlier,” “We updated internal docs so answers are consistent.” If you can’t name one, you probably improved the reporting, not the experience.

This article’s promise is simple: pick support metrics that can’t be gamed easily because they’re cross-checked, paired with counter-signals, and backed by a lightweight audit loop. You want signals that are hard to fake and easy to defend.

Where support metrics get gamed first: speed, volume, touches, and deflection (with the usual loopholes)

Support teams usually don’t set out to game metrics. They optimize. And if you optimize one number in isolation, you’ll get polished noise fast.

The fastest way to find contamination is to start with the metric families that get gamed first. Not because they’re “bad metrics,” but because they’re easy to move with behavior that doesn’t help customers.

Speed metrics: how ‘fast’ becomes ‘shallow’ (FRT, MTTR, SLA)

Speed metrics like first response time, time to resolution, and SLA compliance are useful. They’re also easy to “win” without solving anything.

Typical loopholes:

Premature closure: close quickly, wait for the customer to reopen.
Clock-stopping replies: send a fast first response that’s technically a response but functionally useless.
Rerouting to reset timers: move the conversation into a new ticket, new channel, or queue that doesn’t count.

Counter-signals that expose it:

Reopen rate and repeat contact rate. If MTTR improves while reopens climb, you didn’t get faster; you got better at ending conversations.
Escalation rate. If SLAs look great but escalations spike, you may be rushing past the hard cases.

What real improvement looks like: speed gains with stable or improving “did it stick” outcomes. Faster routing plus better internal docs can reduce time to resolution while reopens stay flat or fall.

One definition that saves a lot of drama: write down what “first response” means. If auto-acks count, people will use auto-acks. If only a human message that addresses the question counts, you’ll push the org toward actual help. Metrics follow definitions the way water follows gravity.

Volume metrics: closing more by narrowing what counts as a ticket

Ticket volume and closures look objective, which makes them politically powerful and operationally dangerous.

Common loopholes:

Splitting and merging to shape counts: split one issue into three to inflate productivity, or merge many into one to hide demand.
Redefining what counts as “support”: move requests to “success,” “ops,” or “product feedback.”
Category dumping: anything messy becomes “other” or “user error,” keeping dashboards pretty and useless.

Counter-signals that expose it:

Contact rate per active customer (or per order). Demand doesn’t disappear just because you renamed it.
Backlog composition and severity drift. If “high severity” volume falls but backlog age rises, you may be relabeling work.

What real improvement looks like: volume drops paired with fewer customer pain signals (and not just fewer tickets). Example: a product fix that eliminates password reset failures should reduce tickets and also reduce repeat contacts on that topic.

Tradeoff to name out loud: volume metrics are still useful for staffing and forecasting. Just don’t let them become your proxy for customer health. “Demand” and “quality” are different realities.

Touches and handle time: optimizing the visible effort, not the outcome

Touches per ticket, average handle time, and “messages sent” are tempting because they feel like efficiency. They also attract superficial behavior.

Common loopholes:

Cherry-picking easy work to keep handle time low.
Over-templating: customers get a wall of canned text that looks productive and feels dismissive.
Excessive internal handoffs: many people “touch” the ticket, nobody owns the outcome.

Counter-signals that expose it:

Transfers and escalations. If handle time improves but transfers increase, you’re measuring motion, not progress.
Customer effort signals, especially “did you have to contact us again?”

What real improvement looks like: handle time goes down because issues are easier to solve (better tooling, clearer policies, fewer edge cases), while repeat contact falls.

One warning: handle time often becomes a silent coaching weapon (“be quicker”) even when leadership thinks they’re using it only for planning. If you track it, assume it will shape behavior. Pair it with an outcome metric so coaching doesn’t accidentally become “get them off the phone.”

Deflection: hiding demand vs actually resolving it

Deflection is where dashboards go to lie with a straight face. “Tickets down” is not proof of “customers self-served.” It might just mean you made support harder to reach.

Common loopholes:

Pushing customers to community or documentation with no success check.
Adding form friction: more required fields, fewer entry points, longer menus.
Closing “duplicate” aggressively because you want fewer open items.

Counter-signals that expose it:

Successful self-serve rate: did the customer complete the task without coming back soon for the same issue?
Complaint indicators: social mentions, app reviews, chargebacks, sales objections.

What real improvement looks like: deflection that reduces effort. A strong help center article that genuinely resolves setup issues should reduce tickets and reduce repeat contact for that topic.

For a broader perspective on how single-metric evaluation gets gamed, this piece is worth your time: [2]

Red flags: simultaneous improvements that shouldn’t co-exist

Two combinations that should raise your eyebrows fast:

Faster resolution and higher reopens (classic premature closure).
Higher deflection and rising escalations (customers are stuck, then arrive angrier and later).

A third that shows up in more mature orgs: volume down, backlog age up. Often taxonomy drift, not fewer problems.

Extra “tell”: a sudden drop in variance. Real support systems are messy. If a metric becomes magically smooth overnight, it’s often because someone found a way to make the measurement cleaner, not the experience.

Use a decision matrix to score candidate signals for gameability, auditability, and decision-use

Assignment strategy	Best for	Advantages	Risks	Recommended when
Implement guardrails (sampling, QA, exception reviews)	Maintaining data honesty, detecting gaming early	Adds layers of verification, builds trust in data	Resource-intensive, can be perceived as micromanagement	Any critical metric, or when high stakes are involved
Define decision, then find signals	Strategic alignment, avoiding vanity metrics	Ensures metrics serve a purpose, reduces gaming by design	Can be slow if decision is unclear, requires strong leadership	Starting new initiatives or re-evaluating core metrics
Communicate metric purpose clearly	Aligning teams, fostering understanding	Reduces misinterpretation, encourages ethical behavior	Can be overlooked, requires consistent reinforcement	Introducing new metrics or onboarding new team members
Score candidate signals (Gameability, Auditability, Decision-Use)	Systematic metric selection, operational decisions	Repeatable, defensible, reduces political debate	Requires clear scoring criteria, can be subjective without training	Selecting new support signals or refining existing ones
Drop metrics with high gameability + low auditability	Maintaining data integrity, preventing perverse incentives	Eliminates easily faked or misleading signals	May remove familiar but flawed metrics, resistance from stakeholders	Any metric review, especially if gaming is suspected
Pair metrics (e.g., speed + quality)	Balancing incentives, preventing single-metric gaming	Creates checks and balances, provides a holistic view	Can increase reporting complexity, requires careful weighting	Metrics are prone to gaming or create unintended side effects
Regularly review and sunset metrics	Keeping metrics relevant, reducing dashboard clutter	Ensures metrics remain useful, adapts to changing goals	Resistance to change, loss of historical data context	Annually or when strategic priorities shift

Use this table in a meeting to keep the conversation out of “my favorite KPI vs your favorite KPI.” You’re not picking numbers; you’re picking signals with known strengths, known loopholes, and a plan to keep them honest.

Start with decisions, not dashboards: what do you need the signal to change?

A support metric is only useful if it changes behavior in a way you actually want.

Examples of real decisions:

Should we staff weekends?
Should we invest in knowledge base work or in product fixes?
Is a new routing rule working?
Do we escalate a product issue now?
Where do we coach: which queue, which topic, which shift?

A useful litmus test: if two reasonable leaders look at the metric and still can’t agree what action to take next, the metric may be mood lighting, not a signal.

If the metric doesn’t map to a decision you’ll make within a week or two, it’s probably vanity (or at least “nice to know,” not “run the business”).

Score 1–5 on: gameability, auditability, decision-use, and time-to-detect gaming

Definitions that work in customer support:

Gameability: how easily can a reasonable person move the number without improving customer outcomes?

Auditability: how easy is it to verify with spot checks, samples, and clear definitions?

Decision-use: does this metric reliably tell you what to do next, or does it just tell you how you feel?

Time to detect gaming: if someone starts “optimizing,” how quickly would you notice via a counter-signal or audit?

Keep scoring anchored in real loopholes you’ve seen (or can plausibly imagine). Abstract scoring turns political fast. Concrete scoring (“we can reset the timer by rerouting”) stays grounded.

Prefer “cross-validated” signals: two independent ways to check the same reality

A single metric is a single point of failure. Cross-validation means you can check one reality in at least two ways that don’t share the same loopholes.

Thinking Loop frames these as “canary metrics,” with a useful warning: the canary can also lie if you don’t understand the cage. [3]

Support examples:

Time to resolution becomes more trustworthy when paired with repeat contact.
Deflection becomes more trustworthy when paired with successful self-serve.

Decision rule: if both metrics can be improved by the same trick, they’re not independent enough. You want a pair where the “easy cheat” for one metric makes the other look worse.

Shortlist: keep only what you can act on weekly

Most teams keep everything, then trust nothing.

Instead, aim for a shortlist that covers speed, quality, demand, and outcomes—without requiring a committee to interpret.

A simple way to keep it operational: for each candidate metric, write the weekly decision it should drive in one sentence. If you can’t write the sentence, you don’t have a metric; you have trivia.

Sample scoring to keep your team consistent:

First response time: gameability 4 (auto replies and clock-stopping are common), auditability 4 (timestamps are easy to check), decision-use 3 (staffing/routing more than quality).

Repeat contact rate: gameability 1 (hard to reduce without solving real problems), auditability 3 (identity and channel stitching can be messy but auditable), decision-use 5 (points directly to fixes and coaching).

Two reminders that prevent backsliding:

Communicate metric purpose clearly. If people don’t know why a metric exists, they’ll invent their own reason—usually “this is how I get yelled at.”
Regularly review and sunset metrics. New automations, new channels, and new policies change gameability over time. Re-score your core signals at least annually, and sooner after major workflow changes.

Build paired metrics that cancel loopholes: speed + quality, efficiency + outcomes

Anti-gaming support metrics are usually paired.

You don’t solve gaming by picking the one perfect KPI. You solve it by making it hard to win one metric while quietly losing the one that represents customer reality.

Paired metrics are a seatbelt. Not glamorous, but they keep you from going through the windshield when incentives hit the brakes.

Pairing pattern #1: speed metrics with ‘did it stick?’ (reopens, repeat contact)

Target: move faster without getting sloppier.

Loophole prevented: premature closure, shallow first replies, routing games.

A practical pairing: time to resolution + repeat contact rate.

What to do when they diverge: if MTTR improves but repeat contact rises, treat it like an incident, not a debate. Pull a sample, look at closure reasons, check whether macros are being used to rush, and check whether a workflow is pushing customers into new tickets.

Threshold guidance: don’t obsess over tiny movements. In most support orgs, a sustained 2–3 week increase in repeat contact (even a small one) matters more than a one-day spike. Speed metrics move fast; stickiness moves slower. That’s normal.

Common mistake: pairing speed with CSAT and calling it done. CSAT isn’t useless, but it’s easier to bias. If you want a quality counter-signal, repeat contact is harder to sweet-talk.

Pairing pattern #2: deflection with ‘successful self-serve’ (not just fewer tickets)

Target: reduce demand by helping customers help themselves.

Loophole prevented: hiding contact options, pushing to community, adding form friction.

A practical pairing: deflection rate + successful self-serve rate.

Successful self-serve means the customer got the answer and didn’t come back through any channel for the same issue within a defined window.

If you can’t measure it perfectly, don’t give up. Start with an auditable proxy—“help center view followed by no ticket from that customer for X days”—and validate it with sampling.

What to do when they diverge: if deflection rises and successful self-serve doesn’t, assume you created friction. Go walk the customer path. If contacting support feels like a maze, you didn’t deflect; you delayed.

Tradeoff to say out loud: the best self-serve success metric can be slow to compute and messy across channels. A simpler proxy you can audit weekly often beats a perfect metric that arrives quarterly and starts arguments.

Pairing pattern #3: productivity with customer harm checks (escalations, severity drift)

Target: increase throughput without abandoning hard work.

Loophole prevented: cherry-picking easy tickets, downgrading severity, “other” dumping.

A practical pairing: closures per agent + escalation rate + severity-adjusted backlog age.

What to do when they diverge: if productivity goes up while escalations rise and high-severity backlog gets older, you have “fast hands, slow outcomes.” Rebalance routing, protect time for complex work, and coach on escalation criteria.

Tradeoff worth stating: productivity metrics can energize teams, but they can also punish specialists who take the hardest cases. Harm checks let you keep the motivation without training people to dodge tough tickets.

Guardrails that make pairs work: definitions, sampling triggers, and exception queues

Pairs only work if people trust the definitions.

Write down what counts as a reopen, what window defines repeat contact, and what “self-serve success” means right now. Then add triggers that force human review when the pair breaks.

An “exception queue” should be legible to any lead: “tickets closed within 10 minutes,” “three contacts in seven days,” “severity changed after assignment,” “deflected to docs then escalated.” These aren’t accusations. They’re where you look first.

Keep the paired set small. Once you get past about 6–8 core metrics, people start optimizing the reporting instead of the work. The goal is a scoreboard, not a novel.

Guardrails and review loops: sampling, QA spot-checks, and exception reviews that keep numbers honest

You can’t prevent metric gaming with metric choice alone. You need a review loop.

Not a heavy compliance-theater loop. Just enough auditing that the easiest path is the honest path.

This is where teams often hesitate because “audit” sounds like “gotcha.” Done well, guardrails do the opposite: they protect frontline teams from being judged on contaminated data, and they protect leadership from making decisions on fantasy numbers.

Sampling strategy: what to review, how often, and who reviews it

A concrete sampling plan that scales:

Review 5% of solved tickets weekly, stratified by channel (email/chat/phone), severity, and agent tenure (new/experienced). Add a small topical slice for your highest-volume contact reason.

What to check in each sampled ticket: correct categorization, closure reason, whether the customer’s actual question was answered, whether policy was applied consistently, whether escalation criteria were met, and whether the resolution would prevent repeat contact.

Who reviews: rotate a lead + a peer reviewer. Rotation reduces bias and spreads what “good” looks like.

Two real warnings (this is where teams get burned):

If QA becomes a second performance review system, people will learn to game QA.
If definition changes aren’t logged, you’ll spend hours arguing about “what happened” when the answer is “we changed the measuring stick.” Keep a short definition change log with dates for fields, automations, routing rules, and category definitions.

Exception queues: define ‘suspicious’ tickets and route them for audit

Exception reviews are how you catch gaming early without auditing everything.

Define “suspicious” as patterns, not accusations:

Closed quickly, then reopened.
Severity downgraded after assignment.
Marked duplicate with no linked parent.
Deflected to community/docs, then the customer comes back escalated.

Route these to a small weekly exception review. Keep it short. Decide whether the pattern is instrumentation, routing, training, or policy.

Common mistake: teams build an exception queue, review it twice, then stop because “it’s too busy.” If you can’t sustain it weekly, shrink it until you can. A tiny loop that happens every week beats a beautiful one that happens when someone remembers.

Failure modes: how goodharted systems evolve (and how to stay ahead)

Three failure modes show up repeatedly once metrics matter:

Sampling bias: you keep reviewing the same kind of tickets because they’re easiest to read. Early warning: QA scores stable while escalations rise in a channel you don’t sample (social/app store).

Survey gaming: CSAT improves because the survey is sent only after “happy path” resolutions. Early warning: CSAT up, repeat contact flat or rising, response rate falling.

Ticket taxonomy drift: categories become a junk drawer. Early warning: “other” grows, severity distribution shifts suddenly, or one team’s “billing issue” is another team’s “account issue.”

For a crisp articulation of how honest metrics die under pressure, this Medium piece is worth bookmarking: [4]

When the number improves but customer outcomes don’t: a triage-and-reset flow

When you hit divergence, don’t argue about motives. Run a short triage.

Start with: “Did the world change, or did our measurement change?” Then check the usual suspects:

Instrumentation: definition changes, routing rules, tagging changes, new automations, new channels.
Routing: work moved to a different queue, channel, or team (including “we put it into a backlog”).
Behavior: agents started closing faster, escalating less, or pushing to self-serve differently.
Adjustments: clarify definitions, tighten exception triggers, retrain closure criteria, rebalance pairs.

Light humor, because we all need it: metrics are like a bathroom scale. It can tell you something useful, but if you start weighing your shoes, you’re not getting healthier—you’re getting creative.

End this section with a real action, not a slogan. Secondary CTA: run a 30 day shadow metrics pilot and schedule the first exception review now, before incentives kick in.

Roll out new signals without creating a new incentive to cheat (and how to defend them upward)

Rolling out new support metrics is change management disguised as analytics.

If you announce “we will measure X” and people hear “we will punish you for X,” they’ll optimize X. You want the opposite: support metrics that can’t be gamed, because they guide decisions and improvement instead of triggering defensive behavior.

This is where teams get burned: they launch a shiny new KPI, attach consequences too quickly, and then spend months untangling the behaviors it created. It’s hard to unteach a habit once it becomes the safest way to survive performance conversations.

How to announce metrics: what’s measured vs what’s rewarded

Be explicit about the difference between what you monitor and what you reward.

Early on, most metrics should be monitored only. Rewarding too early is how you manufacture gaming.

Common mistake: tying compensation to a single metric because it’s “clear.” Clarity is good. Single-metric incentives are not.

Use language like: “We’re measuring this to learn, not to punish. If it becomes a performance metric later, it will only be used as part of a paired set, with an audit plan.” That sentence removes fear—and fear is rocket fuel for metric manipulation.

Pilot first: baseline, shadow period, then accountability

A rollout sequence that works:

Baseline: capture current values and variance by team, channel, and severity.

Shadow period: track the new paired metrics for 30 days with no performance consequences. Watch for definition confusion, weird edge cases, and sudden ‘too good to be true’ improvements.

Calibrate: adjust definitions, add exception triggers, agree on tolerances.

Commit: only then use metrics for accountability, and even then, use pairs.

During the shadow period, ask the team: “How could someone make this number look good without helping customers?” You’re not inviting cheating—you’re doing threat modeling for metrics. Frontline teams usually spot loopholes leadership never sees.

Language for leadership: ‘decision-use + audit plan’ beats ‘industry benchmarks’

A short defense script you can use upward:

“We’re moving to support metrics that can’t be gamed easily. Each metric is tied to a weekly decision, paired with a counter-signal, and backed by a sampling and exception review plan. If numbers improve without customer outcomes improving, we’ll detect it quickly and correct course.”

If someone asks for benchmarks: “Benchmarks are fine for context, but they don’t protect us from gaming. Auditability does.”

Benchmarks pull everyone toward “hit the number.” Decision-use and auditability pull everyone toward “make the system better.”

A final checklist: what to lock before you tie anything to performance

Lock three things before you attach consequences:

Definitions (especially reopen rules, repeat contact windows, and what counts as deflection).

Paired metrics (so nobody can win by pushing work downstream).

Review loops (weekly sampling and an exception queue).

Monday plan to make this real: pick one currently celebrated metric and write down its easiest loophole. Then set three priorities for the week: pair that metric with a counter-signal, define one exception trigger, and schedule a 30 minute weekly review with a named owner.

Production bar: by Friday, you should have one paired metric trend you trust and a small sample of audited tickets that proves you can catch the first signs of gaming.

Primary CTA: copy the decision matrix table and the weekly review agenda described above into your own doc, then run the shadow period before you reward anything.

Sources

stocksignal.me — stocksignal.me
medium.com — medium.com
medium.com — medium.com
medium.com — medium.com

When Metrics Get Gamed: How to Pick Signals People Cannot Easily Fake

Spot the moment a metric becomes a target (and stops being truth)

The operator’s rule: “If a label change can move the number, it will.”

Two categories of gaming: reclassification vs pushing work downstream

A quick diagnostic: what changed—customer work, team behavior, or measurement?

Where support metrics get gamed first: speed, volume, touches, and deflection (with the usual loopholes)

Speed metrics: how ‘fast’ becomes ‘shallow’ (FRT, MTTR, SLA)

Volume metrics: closing more by narrowing what counts as a ticket

Touches and handle time: optimizing the visible effort, not the outcome

Deflection: hiding demand vs actually resolving it

Red flags: simultaneous improvements that shouldn’t co-exist

Use a decision matrix to score candidate signals for gameability, auditability, and decision-use

Start with decisions, not dashboards: what do you need the signal to change?

Score 1–5 on: gameability, auditability, decision-use, and time-to-detect gaming

Prefer “cross-validated” signals: two independent ways to check the same reality

Shortlist: keep only what you can act on weekly

Build paired metrics that cancel loopholes: speed + quality, efficiency + outcomes

Pairing pattern #1: speed metrics with ‘did it stick?’ (reopens, repeat contact)

Pairing pattern #2: deflection with ‘successful self-serve’ (not just fewer tickets)

Pairing pattern #3: productivity with customer harm checks (escalations, severity drift)

Guardrails that make pairs work: definitions, sampling triggers, and exception queues

Guardrails and review loops: sampling, QA spot-checks, and exception reviews that keep numbers honest

Sampling strategy: what to review, how often, and who reviews it

Exception queues: define ‘suspicious’ tickets and route them for audit

Failure modes: how goodharted systems evolve (and how to stay ahead)

When the number improves but customer outcomes don’t: a triage-and-reset flow

Roll out new signals without creating a new incentive to cheat (and how to defend them upward)

How to announce metrics: what’s measured vs what’s rewarded

Pilot first: baseline, shadow period, then accountability

Language for leadership: ‘decision-use + audit plan’ beats ‘industry benchmarks’

A final checklist: what to lock before you tie anything to performance

Sources

Artículos relacionados

Which Signals to Trust: A Practical Scoring System for Meetings, Metrics, and Field Reports

What to Do When All the Data Looks Fine: Finding the Missing Signal Before the Next Mistake