When Metrics Improve but Outcomes Get Worse: Diagnosing

Run a 48-hour containment plan when KPIs improve but escalations/churn spike

You’ve seen it: first response time drops, tickets closed per day climbs, and the dashboard looks like a victory lap. Meanwhile escalations rise, refunds tick up, and your frontline managers quietly start using the phrase “customers are madder than the chart.”

That contradiction is the early smoke of Goodhart’s law in support metrics: when a measure becomes a target, it stops being a reliable measure. This doesn’t require bad actors. It only requires pressure and an incentive that makes the cheapest win attractive.

Treat this moment like an incident. Not because you love drama, but because the wrong next move turns a localized problem into a company-wide “this is how we do support now” habit.

Send a contradiction snapshot to stop the argument from becoming vibes:

Last 14 days vs prior 14 days (illustrative): First response time improved from 3h 10m to 58m. Time to resolution improved from 42h to 30h. Tickets closed per day improved from 1,150 to 1,420. Meanwhile escalations to Tier 2 worsened from 6.2 percent to 10.8 percent and refund requests worsened from 1.1 percent to 2.0 percent. If churn is in the picture, add it—but flag the lag.

The snapshot isn’t proof. It’s the shared “we agree reality is weird” anchor.

Now contain for 48 hours. The goal is to prevent propagation while you classify what you’re looking at.

First: pause conclusions and pause incentives, not customer work. Keep staffing, queues, and SLAs running. But if you recently launched a new closure quota, target, or bonus tied to the improving KPI, freeze the incentive for two days while you investigate. Once pay is attached, people stop learning and start lawyering.

Second: pick the primary outcome you care about right now, and state its expected lag window in plain English. Escalations, refunds, reopens, and repeat contacts usually react in days. Churn and renewals often react in weeks. This one sentence prevents the most expensive leadership mistake: ripping out a change that is working because the trailing metric hasn’t caught up.

Third: build a contradiction packet that is intentionally small.

Metrics (5): the improving KPI; one adjacent operational KPI; three downstream indicators (pick from escalation rate, reopen rate, repeat contact within 7 days, refunds, complaint volume).

Artifacts (5): three recent escalated threads; one QA review; one customer survey comment or verbatim complaint.

The trick is to keep it small enough that someone will actually read it.

Fourth: timebox the investigation around change + cohort.

One person: “what changed?” Policies, macros, routing rules, staffing mix, channel mix, deflection flows.

One person: “who got worse?” Split by channel, ticket type, customer tier, lifecycle (new vs existing), severity.

One person: “how does it feel?” Read the artifacts and summarize the failure pattern without editorializing.

Decision rule: do not change targets broadly until you can say, in one sentence, which cohort got worse and what behavior likely shifted.

The leadership line that keeps this calm is:

“The metric improved; the outcome worsened. Plausible causes are gaming, measurement drift, or real improvement with lag. We’ll run one disconfirming check for each in 48 hours.”

Classify the mismatch fast: gaming vs measurement drift vs real improvement with lag

Teams burn a week debating intent when they should be classifying the failure mode. On a dashboard, three different problems can look identical.

Gaming: the system finds the cheapest path to the target, usually by pushing work downstream or off-metric. It can be rational, not malicious.

Measurement drift: the number changed because the definition, eligibility, sampling, or channel mix changed—not because the customer experience improved. This is the streetlight problem in metric form: we measure what’s easy and then act like it’s the whole truth [1].

Lag: the KPI is a leading indicator and the outcome is trailing. You can improve operations today and still see churn look ugly for a while. If you need a clean primer on the difference, this is solid [2].

Run one disconfirming test per hypothesis. The point is to prove yourself wrong quickly.

Gaming: run a downstream-shift check. Pull a small random sample of tickets that were “won” quickly after the KPI push. Look for visible artifacts: more macro-only replies, thinner internal notes, more transfers, more “customer will follow up” language, more “solved” statuses with unresolved threads. If first response time improved dramatically, check whether the first replies are placeholders.

This is a classic pattern:

A macro like “Thanks for reaching out—can you share your order number?” becomes the first reply on almost everything because it wins first response time. Macro usage spikes. First replies become uniform. Reopens and repeat contacts creep up. No one had to plan this. The metric made it the best move.

Another common tell: disposition codes drift toward “customer education” because it closes cleanly, while escalation reasons shift toward “issue not resolved.” That’s not a disposition problem. That’s a metric problem.

Measurement drift: run a meaning-changed check. Ask three boring questions that save you from a month of confusion:

Did we change what counts?

Did we change when we count it?

Did we change who gets counted?

Typical drift patterns:

CSAT inflation because survey timing or eligibility changed. If CSAT only triggers on “solved” tickets and you started auto-solving more, your sample just became more selective.

Channel mix shift. If more volume moved from email to chat, first response time often improves simply because chat is staffed differently, not because the experience got better.

Disconfirm drift by holding cohorts stable. If the KPI improvement disappears once you compare like-for-like (same channel, same ticket type, same tier), you don’t have an experience improvement. You have a reporting improvement. For a clear explanation of how Goodhart effects break measurement, this is a useful reference [3].

Lag: run a leading-should-move-first check. If you truly improved resolution quality, nearer-term outcomes should improve before churn does. Look at repeat contact, escalations, and refunds over the next 1–2 weeks. If those are improving while churn is still worsening, you may be inside the lag window, not inside a failure.

Two warnings that keep teams honest:

Don’t confuse productivity with progress. “We closed 20% more tickets” is productivity. Progress is “customers needed less help and trusted the answer.” Support KPIs often reward motion.

Don’t treat “gaming” as a moral accusation. It’s usually incentive design. Goodhart’s law is the cobra effect wearing a dashboard: you offer a bounty for dead cobras and end up funding cobra farms. A quick, memorable explanation is here [4].

If you need a research-oriented source to cite internally, this preprint is a good starting point [5].

Decision rule: don’t change targets until you can name the most likely category (gaming vs drift vs lag) and you’ve ruled at least one other category down with a check.

Audit any “improving” KPI before you scale it: metric → behavior → customer impact

Assignment strategy	Best for	Advantages	Risks	Recommended when
KPI-to-behavior mapping table	Common support KPIs (e.g., speed, volume, resolution)	Quickly identifies likely gaming behaviors and guardrail metrics	Requires ongoing updates as new gaming tactics emerge	Weekly ops review. proactive risk assessment for existing KPIs
Paired metrics (e.g., speed + quality)	KPIs prone to speed-over-thoroughness gaming	Makes gaming one metric expensive by penalizing the other	Can create conflicting incentives if not balanced carefully	High-volume, time-sensitive tasks where quality is critical
Tradeoff analysis (e.g., speed vs. trust)	Situations where conflicting goals are inherent	Explicitly acknowledges and manages inevitable compromises	Can lead to analysis paralysis if not focused on key decisions	Designing new processes or setting targets for complex customer journeys
Exception: Lagging indicators only	Outcomes that are difficult to game directly (e.g., churn, revenue)	Reflects true business impact, less prone to Goodhart's Law	Slow to react, difficult to influence directly in the short term	As ultimate measures of success, balanced with leading indicators
4-step audit workflow (default)	Any KPI showing unexpected improvement, especially new metrics or targets	Systematic, repeatable, identifies root cause — gaming, drift, real gain	Can be slow if not practiced, requires cross-functional input	First sign of 'looks better, feels worse' pattern. before scaling any KPI
Transparency & 'no hiding' constraints	Preventing data manipulation or selective reporting	Builds trust, makes gaming visible and accountable	Can be perceived as micromanagement if not framed correctly	High-stakes metrics, leadership visibility, or compliance requirements
Stable cohorts / A / B testing	Evaluating impact of process changes or new incentives	Isolates true impact from gaming or external factors	Requires careful setup, can be slower to yield results	Before company-wide rollout of new targets or compensation structures

Scaling a “win” is where Goodhart failures metastasize. A local improvement becomes a company narrative, then a comp plan, then a permanent source of weird behavior.

Before you scale any improving KPI, do a quick audit that forces one translation: metric → behavior → customer impact.

Use the table below as your operating menu. It’s not theory. It’s a set of assignment strategies that help you match the metric risk to the right control—especially when the dashboard looks better and the customer experience feels worse.

To make that table practical (and not a slide that gets admired and ignored), tie it to a default audit workflow you can run in a weekly ops review.

Step 1: Translate the KPI into a behavior contract. Ask: “If I’m an agent under time pressure trying to hit this number, what would I do differently today?” If you can’t answer, you don’t understand the incentive you’re about to create.

Step 2: Name the easiest win paths—including the one that harms customers. There’s always a harmful shortcut, and it’s usually the easiest.

Time to resolution can improve by closing tickets quickly, even if the customer comes back.

Deflection can improve by hiding contact options, even if the customer ends up refunding.

Step 3: Attach two guardrails by design: one quality guardrail and one downstream customer outcome. A speed metric paired only with another speed metric is how you build a fast-moving disaster.

Step 4: Verify the measurement pipeline so the number still means what you think it means. Definitions, eligibility, sampling, channel mix. Dashboards don’t lie out of spite. They lie out of obedience.

Now use the assignment strategies explicitly:

Use a KPI-to-behavior mapping table when you have common KPIs (FRT, handle time, closure rate) and you want to predict the “cheap wins” before they show up.

Use paired metrics when you already know speed will eat quality. Make gaming one metric expensive by penalizing the other.

Use tradeoff analysis when the conflict is real and permanent (speed vs trust, deflection vs access). You’re not solving the conflict; you’re choosing where the organization will land.

Use lagging indicators only as the ultimate scoreboard (churn, revenue), but don’t rely on them for daily steering.

Use transparency and “no hiding” constraints when the organization’s first instinct is to make the metric look good rather than make the customer experience good.

Use stable cohorts and A/B testing when you’re about to roll a change out broadly and you can’t afford a false win.

A practical warning: do this audit before you attach compensation. Once a metric is tied to pay, people optimize it like their rent depends on it (because it does). And then you’ll be shocked—shocked—that they did exactly what the system rewarded.

Recognize “looks better, feels worse” patterns early (and identify what breaks first)

Support teams don’t usually wake up and decide to game metrics. The workflow just adapts. Like water finding a crack, it finds the lowest-resistance path to the number you keep repeating in meetings.

Four patterns show up repeatedly in Goodhart’s law customer support KPIs. The key is knowing what breaks first, because the earliest warning is rarely the headline KPI.

Speed over quality: fast first replies that defer real work.

This is the first response time trap. FRT improves; time to resolution and escalations worsen. What breaks first is in the thread itself: more “did you read my message?” and “this didn’t answer my question.”

A small fix that doesn’t become a policy novel: require that first replies include one specific next step that moves the issue forward, not just an acknowledgment. You can still be fast—just not empty.

The vignette makes it obvious:

Customer: “My integration stopped syncing last night. Orders are missing.”

Agent (2 minutes later): “Thanks for reaching out. Can you confirm your account email and share a screenshot?”

Customer (3 hours later): “It is the same email as this ticket. Screenshot attached. Can you fix it?”

Agent (next day): “Thanks. Have you tried reconnecting?”

Customer (same day): “Yes. This is urgent. Please escalate.”

The dashboard celebrates. The customer escalates.

Closure pressure: solved rate up while repeat contacts and reopens climb.

This is the “we’re efficient” mirage. Leaders push tickets closed per day or solved rate; agents avoid complex work because complex work punishes their numbers.

What breaks first is repeat contact volume—often best measured as customers contacting you again within 7 days about the same issue. If your org doesn’t have a stable definition, align on one and keep it stable. A useful reference on measurement problems (and why definitions drift) is here [6].

A targeted corrective action: add a “confirm resolution” expectation for a subset where harm is highest (high-severity issues, high-value customers). You don’t need it everywhere. You need it where regret is expensive.

CSAT distortion: scores rise while verbatims get angrier.

This one burns teams because it looks like a win and feels like chaos. It often happens through drift: survey timing changes, eligibility changes, channel mix changes.

Early signals:

Response rate drops while score rises.

Verbatims get more negative while the average stays high.

That pattern isn’t “customers are complicated.” It’s usually sampling.

If you need a clean explanation of how numbers stop telling the truth, this is worth sharing with ops leaders who treat CSAT like scripture [7]. For a Goodhart-flavored lens on broken metrics, this complements it [3].

A simple operational rule: pin CSAT send rules for the quarter. If you change them, annotate the dashboard and treat the trend as discontinuous until you rebuild baseline.

Deflection backlash: fewer tickets but more escalations, refunds, and churn.

Bots are cheap. Tickets are expensive. So deflection always looks tempting. But if you deflect by making support harder to reach, demand doesn’t vanish. It mutates.

What breaks first:

Escalation reasons shift toward “can’t contact support.”

Refund requests rise.

Social complaints become more frequent.

The fix is a mindset, not a maze: deflection should be “containment with dignity.” Deflect low-stakes, low-emotion issues. Keep escalation routes obvious for billing, access, and outages. If your bot hides the escape hatch, you aren’t reducing demand. You’re fermenting it.

A light analogy that lands: grading a restaurant only on how fast it serves food will absolutely improve service speed. It will also produce a lot of food that people regret ordering.

Decision rule across all four patterns: your smoke alarms are reopens, repeat contact, escalation reasons shifting, and refund requests—not the headline KPI.

Design guardrails that make gaming expensive: paired metrics, stable cohorts, and “no hiding” constraints

Goodhart problems spread when a local metric win gets promoted to a universal target. The answer isn’t “stop measuring.” It’s measuring like the number will be attacked—because it will, even unintentionally.

Start simple: one primary KPI per objective, plus two guardrails.

If the objective is speed to first touch, FRT can be primary. Guardrails should include one quality measure and one downstream outcome. A practical bundle: FRT + repeat contact within 7 days + QA pass rate on the first reply.

If the objective is clean closure, solved rate can be primary. Guardrails: reopen rate + escalation rate.

If the objective is deflection, deflection rate can be primary. Guardrails: refunds + escalations.

Guardrails aren’t about being fancy. They’re about making the cheap win expensive.

If you need a readable refresher on why metrics backfire and how to make them work for you, this is useful [8]. If you need a memorable cautionary story about targets creating weird behavior, the nails quota story is a classic [9].

Next: stabilize meaning with cohorts. Support metrics are extremely sensitive to mix changes. Without cohorting, you’re not measuring improvement—you’re measuring what showed up.

Three cohorts that commonly change metric meaning:

Channel: email vs chat vs phone. FRT and handle time are not comparable across channels without context.

Ticket type: billing vs bugs vs how-to. Deflection is fine for “where is my invoice,” dangerous for “I can’t access my account.”

Customer tier and lifecycle: free vs paid, enterprise vs self-serve, new vs existing. A rising escalation rate in enterprise is a different incident than a rise in free.

Now add “no hiding” constraints—simple rules that prevent the easiest forms of metric corruption.

Reopens count, and they count for long enough to matter. If you only count reopens within 24 hours, you encourage slow rolling. A 7-day reopen window often matches customer reality better.

Escalations stay visible on the originating team’s dashboard. If work is pushed downstream to protect an upstream metric, you want that behavior to show up without detective work.

Lightweight QA sampling is independent of the KPI. It doesn’t need to be heavy. It needs to be consistent enough that agents believe quality is real, not a poster on the wall. The streetlight problem is exactly why sampling reality matters [1]. If you’re justifying QA as a defense against metric gaming, this research-oriented source helps [5].

Here’s the tradeoff leaders avoid saying out loud: sometimes you accept worse headline KPIs to buy better long-term outcomes. A slightly slower first response that actually progresses the issue can reduce overall contacts, reduce escalations, and improve retention. Optimize only for speed and you’ll win the week and lose the quarter.

Make guardrails operational by putting them into the weekly ops rhythm. Don’t just stare at the number—read five tickets from the worst cohort. If a guardrail breaks, the response should be automatic: pause the target increase, narrow to the cohort, and run a small audit sample.

One practical tip that prevents false celebrations: annotate the dashboard when workflows change. New macros, routing changes, bot flow tweaks, staffing mix shifts. If change events aren’t visible next to the trend line, your org will keep mistaking coincidence for improvement.

Make the call and stop propagation: revert, revise, or reinforce based on evidence thresholds

Once you have the contradiction packet and at least one disconfirming test per hypothesis, you need to decide. The worst outcome is letting a broken metric spread because it feels politically easier than admitting uncertainty.

Use three calls: revert, revise, or reinforce.

Revert when customer harm is clear and immediate, and the KPI improvement does not survive cohort control—or harm indicators are spiking now.

Example: you launched a closure target. Solved rate rose. Repeat contacts within 7 days rose sharply in your highest-value tier. A safe, reversible move is to pause the closure target for a week and watch reopen rate, escalation rate, and backlog age. If the harm stabilizes without destroying FRT, you learned something quickly.

Revise when the KPI is directionally useful but the incentive is too sharp. Keep the metric, but add guardrails, narrow it to stable cohorts, or remove it from compensation. In other words: blunt the knife.

Reinforce when leading indicators improve, harm indicators stay flat, and the improvement holds under stable cohorts. If churn is the only thing worsening, you may be inside a lag window. In many businesses, churn/renewals can lag meaningful support improvements by 4–10 weeks. That’s not permission to ignore churn; it’s a reason not to panic-react.

When you report back, avoid blame. Blame triggers defensiveness, and defensiveness destroys measurement.

A leadership-friendly summary:

“Over the last two weeks, our primary KPI improved materially, but two downstream outcomes worsened. This can happen due to behavior shifting to hit the KPI, measurement drift in how the KPI is calculated, or a real improvement where the business outcome lags. We ran quick checks for each and found the highest likelihood is X. We recommend reverting or revising one target, adding two guardrails, and reviewing a small ticket sample weekly until the metrics realign.”

Then do a 7-day correction sprint that stays focused.

Pause the single target most likely causing harm (closure quota, aggressive deflection push, etc.).

Add one guardrail bundle to the main dashboard (example: FRT + repeat contact + QA pass rate).

Review a tight sample set: 25 tickets from the worst cohort, including five escalations. Summarize the pattern in plain language.

Lock definitions for the quarter: what counts as resolved, what triggers CSAT, what counts as an escalation, and what repeat contact means.

Share one “what we learned” note with the team focused on system incentives and constraints.

One reminder, because this is where teams get burned: don’t punish the behavior you rewarded. If you pushed hard for faster closure and people closed faster, the system worked. Now you change the system.

Monday plan: build the contradiction packet in 30 minutes with support ops and one frontline manager. Your priorities for the week are to classify the mismatch (gaming vs drift vs lag), add two guardrails to the KPI improving fastest, and run a cohort-based ticket sample review in your ops meeting. By Friday, you should be able to say which cohort is impacted, which behavior changed, and whether you’ll revert, revise, or reinforce.

Primary CTA: Run the Goodhart audit on your top 3 “improving” KPIs this week and add two guardrails.

Secondary CTA: Create a one page metric contract for each team level target and review it in your weekly ops meeting.

Sources

datafield.dev — datafield.dev
howtothink.ai — howtothink.ai
whennotesfly.com — whennotesfly.com
whennotesfly.com — whennotesfly.com
figshare.com — figshare.com
whennotesfly.com — whennotesfly.com
robert-tang.com — robert-tang.com
yourstoryhaven.com — yourstoryhaven.com
arcticdba.se — arcticdba.se

When Metrics Improve but Outcomes Get Worse: Diagnosing Goodhart Problems Before They Spread