[{"data":1,"prerenderedAt":47},["ShallowReactive",2],{"/en/blog/when-metrics-lie-how-to-spot-proxy-measures-that-push-teams-in-the-wrong-directi":3,"/en/blog/when-metrics-lie-how-to-spot-proxy-measures-that-push-teams-in-the-wrong-directi-surround":38},{"id":4,"locale":5,"translationGroupId":6,"availableLocales":7,"alternates":8,"_path":9,"path":9,"title":10,"description":11,"date":12,"modified":12,"meta":13,"seo":23,"topicSlug":28,"tags":29,"body":31,"_raw":36},"c278eeff-a09b-4dd5-a620-c000fd34ca37","en","9ebab642-1b2f-4716-a042-1f914942e9dd",[5],{"en":9},"/en/blog/when-metrics-lie-how-to-spot-proxy-measures-that-push-teams-in-the-wrong-directi","When Metrics Lie: How to Spot Proxy Measures That Push Teams in the Wrong Direction","Support dashboards can look green while customers and agents insist things are worse. Learn a practical workflow to audit proxy metrics in customer support, triangulate with outcomes like reopens and escalations, correct for mix shifts, and add guardrails so teams optimize real resolution quality.","2026-05-04T09:22:13.569Z",{"date":12,"badge":14,"authors":17},{"label":15,"color":16},"New","primary",[18],{"name":19,"description":20,"avatar":21},"Lucía Ferrer","Calypso AI · Clear, expert-led guides for operators and buyers",{"src":22},"https://api.dicebear.com/9.x/personas/svg?seed=calypso_expert_guide_v1&backgroundColor=b6e3f4,c0aede,d1d4f9,ffd5dc,ffdfbf",{"title":24,"description":25,"ogDescription":25,"twitterDescription":25,"canonicalPath":9,"robots":26,"schemaType":27},"When Metrics Lie: How to Spot Proxy Measures That Push","Support dashboards can look green while customers and agents insist things are worse. Learn a practical workflow to audit proxy metrics in customer support,","index,follow","BlogPosting","decision_systems_researcher",[30],"when-metrics-lie-how-to-spot-proxy-measures-that-push-teams-in-the-wrong-directi",{"toc":32,"children":34,"html":35},{"links":33},[],[],"\u003Ch2>Your dashboard is green—so why do customers and agents say support is worse?\u003C/h2>\n\u003Cp>If you run support operations long enough, you will eventually hit the strangest kind of incident: nothing looks broken, yet everyone is tense. The dashboard is green. SLA hit rate is up. First response time is down. Closure volume is up. Average handle time is down. The weekly review feels like a victory lap.\u003C/p>\n\u003Cp>Then you talk to customers and hear something else. They are not saying “thanks for the faster first reply.” They are saying “I keep repeating myself,” “I got bounced between people,” “you marked it solved but it is not solved,” and “I feel like I am arguing with a form.” Agents tell you they are exhausted and that the work feels messier, not cleaner. Managers start spending their time on exceptions and escalations. You start wondering whether you are measuring the wrong thing, or measuring the right thing in a way that no longer matches reality.\u003C/p>\n\u003Cp>That is where proxy metrics in customer support get dangerous.\u003C/p>\n\u003Cp>A proxy metric is a number that is easier to measure than the thing you truly care about, so it becomes a stand in. It is not automatically bad. It is often useful. But it is a guess. An outcome metric is closer to the real customer result you want, like “the issue is resolved correctly,” “the customer did not need to contact us again,” or “the customer feels confident using the product.”\u003C/p>\n\u003Cp>A concrete example shows the trap quickly. You tighten your first response SLA and celebrate hitting 95 percent. Agents learn the fastest way to improve the number: reply quickly, even if the reply is thin. The workflow now rewards quick touches and quick transfers. A week later, reopen rate climbs from 7 percent to 12 percent and escalations climb as well. You got faster at beginning the conversation. You did not get better at finishing it.\u003C/p>\n\u003Cp>This is Goodhart’s Law wearing a headset: when a measure becomes a target, it stops being a good measure. If you want a simple explanation of why this happens across many systems, not just support, this overview is useful: \u003Ca href=\"#ref-1\" title=\"arcticdba.se — arcticdba.se\">[1]\u003C/a>\u003C/p>\n\u003Cp>The promise of this article is practical: a repeatable operator workflow to identify which KPIs are acting as harmful proxies, what behaviors they are creating, and what to do next. You will not leave with “delete all metrics” or “trust CSAT only.” You will leave with an audit method, a triangulation set, and guardrails that tell you when dashboards are safe to act on and when human review is mandatory.\u003C/p>\n\u003Ch2>Run the proxy-metric audit: map metric → behavior → workload → customer outcome\u003C/h2>\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth>Control\u003C/th>\n\u003Cth>Where it lives\u003C/th>\n\u003Cth>What to set\u003C/th>\n\u003Cth>What breaks if it’s wrong\u003C/th>\n\u003C/tr>\n\u003C/thead>\n\u003Ctbody>\u003Ctr>\n\u003Ctd>Set: Proxy Chain Audit: Workload → Customer Outcome\u003C/td>\n\u003Ctd>Customer journey maps, research\u003C/td>\n\u003Ctd>Verify workload drives customer value. E.g., Efficient triage → Faster resolution.\u003C/td>\n\u003Ctd>Busy teams, no customer benefit. Wasted resources.\u003C/td>\n\u003C/tr>\n\u003Ctr>\n\u003Ctd>Set: Anomaly Detection: Context Shifts\u003C/td>\n\u003Ctd>Monitoring system, team comms\u003C/td>\n\u003Ctd>Alerts for changes in workload, segments, product features.\u003C/td>\n\u003Ctd>Invalid comparisons. Good performance looks bad, bad looks good.\u003C/td>\n\u003C/tr>\n\u003Ctr>\n\u003Ctd>Set: Proxy Chain Audit: Metric → Behavior\u003C/td>\n\u003Ctd>Metric definitions, team wiki\u003C/td>\n\u003Ctd>Map metric to intended action. E.g., FRT → Agent responds quickly.\u003C/td>\n\u003Ctd>Teams hit numbers, not goals. &#39;Green&#39; dashboards hide bad outcomes.\u003C/td>\n\u003C/tr>\n\u003Ctr>\n\u003Ctd>Set: Proxy Chain Audit: Behavior → Workload\u003C/td>\n\u003Ctd>Process docs, workflow diagrams\u003C/td>\n\u003Ctd>Connect behavior to required tasks/effort. E.g., Quick response → Efficient triage.\u003C/td>\n\u003Ctd>Teams find metric shortcuts. Burnout from misaligned effort.\u003C/td>\n\u003C/tr>\n\u003Ctr>\n\u003Ctd>Set: Guardrail: CSAT for Speed Metrics (e.g., FRT, AHT)\u003C/td>\n\u003Ctd>Dashboard (next to FRT/AHT)\u003C/td>\n\u003Ctd>Minimum CSAT score. If FRT improves but CSAT drops, investigate.\u003C/td>\n\u003Ctd>Agents rush, customers frustrated. Repeat contacts.\u003C/td>\n\u003C/tr>\n\u003Ctr>\n\u003Ctd>Set: Guardrail: QA for Throughput Metrics (e.g., Closures, AHT)\u003C/td>\n\u003Ctd>Dashboard (next to Closures/AHT)\u003C/td>\n\u003Ctd>Minimum QA score. If closures increase but QA drops, investigate.\u003C/td>\n\u003Ctd>Agents close tickets prematurely. Re-opens, churn.\u003C/td>\n\u003C/tr>\n\u003Ctr>\n\u003Ctd>Set: Decision Rule: Trust, Demote, Fence, Retire\u003C/td>\n\u003Ctd>Metric governance, team review\u003C/td>\n\u003Ctd>Criteria for each outcome based on audit/guardrail results.\u003C/td>\n\u003Ctd>Teams chase misleading metrics. Inability to adapt.\u003C/td>\n\u003C/tr>\n\u003C/tbody>\u003C/table>\n\u003Cp>When a KPI and lived experience disagree, teams often do one of two unhelpful things. They argue in circles, trading anecdotes until the meeting ends. Or they pick a side based on hierarchy, which is just arguing with a nicer font.\u003C/p>\n\u003Cp>You need a shared method that turns “this feels off” into a diagnosis and a decision.\u003C/p>\n\u003Cp>Call it the Proxy Chain Audit. The idea is simple: every metric creates pressure. Pressure shapes behavior. Behavior changes workload mechanics like queueing, handoffs, and escalation load. Those mechanics change customer outcomes. If you cannot tell that story end to end, the metric is not ready to be used for high stakes decisions.\u003C/p>\n\u003Cp>Run the audit with a small group: a support ops lead, a frontline manager, one tenured agent who actually works the queue, and someone who sees quality signals like QA or escalations. This is where teams get burned: if the room is only dashboard people, the result will sound logical and still be wrong, because the incentives and shortcuts live on the floor.\u003C/p>\n\u003Cp>Here is a workflow table you can copy into your monthly metric review.\u003C/p>\n\u003Cp>Two worked examples show what “mapping the chain” looks like in practice.\u003C/p>\n\u003Cp>First example: a speed metric, first response time.\u003C/p>\n\u003Cp>Intended outcome: customers feel acknowledged quickly and trust the issue is being handled.\u003C/p>\n\u003Cp>Behavior it rewards: “touch it fast.” Agents send a quick acknowledgement, or they take the simplest next action that stops the clock. In some workflows that means a canned reply. In others it means moving the ticket to a different queue to get it out of their view.\u003C/p>\n\u003Cp>Workload mechanics it changes: handoffs rise, because the fastest way to hit a response target is to route quickly. Backlog age can quietly worsen, because you are increasing the number of touches without increasing the number of true resolutions. The queue becomes noisier, which makes it harder for agents to find the work that actually needs deep attention.\u003C/p>\n\u003Cp>Outcome validation: if first response time improves and reopen rate rises, you did not improve “felt heard.” You improved “received a message.” Also look at repeat contact rate, meaning the customer contacts you again for the same issue or account within a short window. A meaningful “felt heard” improvement should reduce repeat contacts, not increase them.\u003C/p>\n\u003Cp>Decision rule: keep first response time as an operational health metric, but fence it with a counter metric that makes shallow replies expensive. One simple fence is “first response time plus repeat contact within 7 days.” Another is “first response time plus reopen rate.” If first response time improves while repeat contacts increase, the fence triggers a review and you stop celebrating the speed win.\u003C/p>\n\u003Cp>Second example: a throughput metric, closures per agent or average handle time.\u003C/p>\n\u003Cp>Intended outcome: resolve efficiently so more customers get help with the same staffing.\u003C/p>\n\u003Cp>Behavior it rewards: rushing to an ending. Agents learn to avoid complex tickets, push for the customer to “try again,” or close tickets that feel uncertain. If the system allows it, you will see a rise in “solved pending customer” states that act like a polite exit ramp.\u003C/p>\n\u003Cp>Workload mechanics it changes: escalations rise because triage becomes shallow. Specialists become the dumping ground for uncertainty. The escalation backlog grows, which does not always show up in frontline dashboards, so leadership sees “efficiency gains” that are really cost shifting. Reopens rise because premature closure pulls work into a second cycle.\u003C/p>\n\u003Cp>Outcome validation: watch escalation rate per 100 tickets, QA sampling results, and reopen rate. If average handle time drops 20 percent in two weeks and escalations jump, you did not get faster at resolution. You got faster at exiting the conversation.\u003C/p>\n\u003Cp>Decision rule: if you keep average handle time at all, fence it with quality. Make it impossible to “win” handle time while losing QA. This is also where teams get burned: they use average handle time to rank individuals. You end up rewarding the agent who avoids complexity and punishing the agent who takes the hard work that prevents churn and outages.\u003C/p>\n\u003Cp>Now decide what to do with the proxy metric. You have four options.\u003C/p>\n\u003Col>\n\u003Cli>\u003Cp>Keep it when the chain makes sense and it tracks with outcomes.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Demote it when it is still useful for awareness but not reliable for comparisons or incentives.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Fence it when the metric is operationally useful but easy to game. A fence is a counter metric that changes the incentive so shortcuts stop looking attractive.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Retire it when the chain is broken and the workarounds are more predictable than the customer benefit.\u003C/p>\n\u003C/li>\n\u003C/ol>\n\u003Cp>A common mistake at this stage is thinking you need to pick one hero metric. You do not. You need one story that holds together. If the metric story is “we respond faster, therefore support is better,” you are going to keep rediscovering the same pain. If the story is “we respond fast and we resolve correctly, proven by fewer repeat contacts and fewer reopens,” you can move faster without fooling yourself.\u003C/p>\n\u003Cp>Another common mistake is trusting averages. Averages are polite. The customer anger lives in the tail. Always look at backlog age bands and a high percentile time to resolution. If your median looks great but the 90th percentile is getting worse, your dashboard is giving you a comforting bedtime story while the queue quietly turns into a horror novel.\u003C/p>\n\u003Cp>Keep these controls visible as you run audits and set guardrails.\u003C/p>\n\u003Ch2>Triangulate signals: pair leading proxies with lagging outcomes so you can’t ‘win’ by gaming one number\u003C/h2>\n\u003Cp>Once you see how proxy metrics in customer support create pressure, the next move is obvious: stop betting the decision on one number.\u003C/p>\n\u003Cp>The goal is not to drown everyone in metrics. The goal is to build a minimum triangulation set so no one can “win” the dashboard while customers lose the experience.\u003C/p>\n\u003Cp>Think of it like driving with a speedometer only. You can keep the needle in the right place and still run out of fuel, overheat the engine, or miss the fact that the road is covered in ice. Speed is real. It just is not the whole system.\u003C/p>\n\u003Cp>A practical minimum triangulation set has four parts.\u003C/p>\n\u003Col>\n\u003Cli>\u003Cp>Speed: first response time, response SLA hit rate, or time to first reply.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Throughput: time to resolution, solved volume, or closure rate.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Quality: reopen rate, repeat contact rate, escalation rate, and a light QA signal.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Customer sentiment: CSAT trends and complaint themes from verbatims.\u003C/p>\n\u003C/li>\n\u003C/ol>\n\u003Cp>If you can only afford one quality metric beyond escalations, choose repeat contact rate within 7 to 14 days. Reopens are useful, but they depend on workflow. A team can reduce reopens by making re opening harder, which is not the kind of innovation you want. Repeat contact is harder to “fix” without actually helping.\u003C/p>\n\u003Cp>Here are concrete pairings you can put next to each other on a dashboard. The point is not decoration. The point is tension. If the proxy improves but the counter metric worsens, you have to pause and explain.\u003C/p>\n\u003Col>\n\u003Cli>\u003Cp>Response SLA hit rate paired with reopen rate.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>First response time paired with repeat contact rate within 7 days.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Time to resolution paired with escalation rate per 100 tickets.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Closures per agent paired with QA pass rate from a weekly sample.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Average handle time paired with transfer rate or handoff count.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Backlog size paired with backlog age bands, especially tickets older than 3 days and 7 days.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Chat concurrency paired with CSAT and transfer rate.\u003C/p>\n\u003C/li>\n\u003C/ol>\n\u003Cp>Notice what these pairings do. They keep speed and throughput, but they force you to pay for them with quality if you are cutting corners.\u003C/p>\n\u003Cp>Now add leading indicators, because lagging outcomes like CSAT often arrive late. In many B2B environments, CSAT response rates are low and the feedback comes in waves. If you wait for CSAT to confirm what agents already feel, you will be late.\u003C/p>\n\u003Cp>Good leading indicators include backlog age bands, repeat contact rate, escalation backlog, and the share of tickets that require more than one agent touch. They show you friction before customers fill out surveys.\u003C/p>\n\u003Cp>A scenario where leading indicators prevent a bad call looks like this.\u003C/p>\n\u003Cp>Week 0: leadership asks for a cost reduction plan. Your dashboard shows response SLA is green and average handle time is down. The temptation is to reduce coverage or freeze hiring.\u003C/p>\n\u003Cp>Week 1: nothing explodes, but backlog age shifts. Tickets older than 3 days rise from 4 percent to 9 percent. Escalation backlog rises from 18 to 35 open items. A small QA sample finds more “customer asked again” notes.\u003C/p>\n\u003Cp>Week 2: CSAT has not moved yet because volume is low. Meanwhile, repeat contact rate within 7 days rises from 11 percent to 16 percent.\u003C/p>\n\u003Cp>If you act only on the green SLA and lower handle time, you cut staff right as the system is building debt. If you act on the leading indicators, you pause the cost decision, investigate, and often find that the “efficiency gain” came from shallow triage, more transfers, and more unresolved tickets aging in quiet corners.\u003C/p>\n\u003Cp>This is also why a blended “overall” dashboard tile can be risky. When the work shifts, the blend becomes a hiding place. A strong practice is to review the same triangulation set by channel and by severity band at least once a month. You do not need to do it daily. You just need to do it often enough that mix shifts do not get a head start.\u003C/p>\n\u003Cp>When signals split, do not treat it as confusion. Treat it as a diagnosis tool. Here are rules of thumb that work well in real queues.\u003C/p>\n\u003Col>\n\u003Cli>\u003Cp>Speed improves and reopens increase: you are getting faster at acknowledging, not resolving. Look for canned replies, premature routing, and “touch fast” behaviors.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Average handle time drops and escalations rise: you are pushing uncertainty to specialists. Look at triage notes and escalation reasons.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Closure rate rises and repeat contacts rise: customers are coming back because the issue is not done. Look for “solved pending customer” patterns and closure reasons.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Response SLA is green and backlog age worsens: you are meeting the letter of the SLA while building a long tail. Look for how the SLA clock is defined and where tickets can pause.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>CSAT stays flat and agent sentiment worsens: assume agents are seeing the loophole before customers do. Ask what changed in the workflow or target.\u003C/p>\n\u003C/li>\n\u003C/ol>\n\u003Cp>If you want extra framing on why metrics can create perverse incentives even when everyone is acting in good faith, this is a good general read: \u003Ca href=\"#ref-2\" title=\"whennotesfly.com — whennotesfly.com\">[2]\u003C/a>\u003C/p>\n\u003Ch2>Failure modes: when team/branch comparisons lie because the work changed (not the performance)\u003C/h2>\n\u003Cp>Support teams love comparisons, and so does leadership. It feels fair. It feels objective. It produces neat rankings.\u003C/p>\n\u003Cp>It also regularly produces nonsense.\u003C/p>\n\u003Cp>The most common reason proxy metrics in customer support mislead is not that agents are gaming. It is that the work is not comparable. Volume mix changes. Channel mix changes. Severity mix changes. Coverage changes. If you treat those as “performance,” you will praise the wrong team, punish the wrong team, and teach everyone to chase the easiest work.\u003C/p>\n\u003Cp>Start with a volume mix example.\u003C/p>\n\u003Cp>Team A closes 44 tickets per agent per day with a median handle time of 8 minutes.\u003C/p>\n\u003Cp>Team B closes 29 tickets per agent per day with a median handle time of 15 minutes.\u003C/p>\n\u003Cp>On a leaderboard, Team A looks like a dream and Team B looks like a problem.\u003C/p>\n\u003Cp>Now stratify by case type.\u003C/p>\n\u003Cp>Within “password and access,” Team A and Team B both average about 8 to 9 minutes and have similar reopen rates.\u003C/p>\n\u003Cp>Within “billing disputes,” Team B handles a larger share, takes longer, but has fewer escalations and fewer repeat contacts. Team A handles fewer of these and escalates more often.\u003C/p>\n\u003Cp>The raw KPI story was not performance. It was distribution.\u003C/p>\n\u003Cp>You do not need complex modeling to fix this. You need the discipline to compare within like for like buckets. Start with three to five case types that represent most volume, and compare teams within those. If you have severity tags, compare within severity bands as well.\u003C/p>\n\u003Cp>Channel mix is the next distortion, and it quietly breaks many speed metrics.\u003C/p>\n\u003Cp>Chat, email, and phone behave differently. In chat, first response time is part of the experience because the customer is waiting in real time. In email, first response time matters, but the quality of the next message often matters more than the speed of the first. On phone, handle time includes identity verification, empathy, and time spent navigating internal tools while the customer waits.\u003C/p>\n\u003Cp>If one team works mostly chat and another works mostly email, comparing first response time is like comparing ovens by how quickly they preheat. It tells you something, but not whether dinner tastes good.\u003C/p>\n\u003Cp>A simple normalization approach that works well operationally is this:\u003C/p>\n\u003Col>\n\u003Cli>\u003Cp>Report speed and throughput metrics separately by channel first.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Only then show a blended “overall” number, and label it clearly as blended.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>If the channel mix shifts materially, pause cross team comparisons until you can explain the change.\u003C/p>\n\u003C/li>\n\u003C/ol>\n\u003Cp>Severity mix is the distortion that creates the most resentment.\u003C/p>\n\u003Cp>High severity work should take longer. If it does not, that is not efficiency. That is risk. When your most experienced agents are handling critical cases, you want them to do the deep work, write clear notes, coordinate across functions, and prevent recurrence. Those behaviors inflate handle time and time to resolution, and they often improve real outcomes like fewer escalations later and fewer repeat contacts.\u003C/p>\n\u003Cp>If you compare raw time to resolution across teams without severity bands, you teach people to avoid severity work. This is where teams get burned because the incentive is subtle. No one says “avoid important work.” The metrics say “avoid the work that makes you look slow.”\u003C/p>\n\u003Cp>Shift coverage and queue dynamics also distort comparisons in ways that look like individual performance.\u003C/p>\n\u003Cp>The team that works the spike inherits the worst waiting times and the angriest customers. If you rank by CSAT or by time to first reply without adjusting for shift and queue conditions, you will conclude that the spike team is worse, when they are simply standing under the waterfall.\u003C/p>\n\u003Cp>Two practical fixes help quickly.\u003C/p>\n\u003Cp>First, compare within similar time windows. If you review weekly, look at day of week and hour bands when you diagnose a team gap.\u003C/p>\n\u003Cp>Second, include queue health context like backlog age bands and arrival rate, so you see who carried the surge.\u003C/p>\n\u003Cp>Now the question everyone asks: how do you tell mix shift from gaming?\u003C/p>\n\u003Cp>They can look similar at first, but patterns usually differ.\u003C/p>\n\u003Cp>Mix shift tends to move several metrics together in a way that makes sense. Higher severity mix usually increases time to resolution and handle time, and it can increase escalations for legitimate reasons. It might also increase “thank you” verbatims when the work is done well.\u003C/p>\n\u003Cp>Gaming tends to improve the targeted metric in a clean, sudden way, while nearby quality metrics worsen. Closure rate jumps, but repeat contacts and reopens also jump. First response time drops, but handoffs and transfers rise. Escalation backlog grows while frontline metrics look great.\u003C/p>\n\u003Cp>When comparisons are invalid, do not force them. Pause ranking. Add context notes to the dashboard. Make staffing decisions using stratified views by channel and severity. If leadership insists on a single ranking, push back with a clear statement: you can rank teams, or you can be fair, but you cannot do both with unadjusted proxies.\u003C/p>\n\u003Cp>For a general look at metric anti patterns that show up in many orgs, not just support, this resource is a helpful reference: \u003Ca href=\"#ref-3\" title=\"kpitree.co — kpitree.co\">[3]\u003C/a>\u003C/p>\n\u003Ch2>Decision guardrails: when to trust dashboards vs when to require a human review (and what that review must include)\u003C/h2>\n\u003Cp>Dashboards are excellent for visibility. They are not a substitute for judgment. The problem is that judgment is often applied randomly, which is how you get politics disguised as “intuition.”\u003C/p>\n\u003Cp>Guardrails solve that. Guardrails are explicit rules that say when the numbers are safe to act on and when you must slow down and review what is actually happening in the queue.\u003C/p>\n\u003Cp>A simple trust rubric for any metric has three questions.\u003C/p>\n\u003Cp>Stability: does it behave consistently when the environment is stable, or does it swing wildly with small operational changes?\u003C/p>\n\u003Cp>Interpretability: can you explain what moved and why without a two day investigation?\u003C/p>\n\u003Cp>Susceptibility to gaming: can a reasonable agent move the number without improving the customer outcome?\u003C/p>\n\u003Cp>Metrics that are unstable, hard to interpret, and easy to game can still be displayed, but they should not drive incentives, staffing cuts, or performance rankings.\u003C/p>\n\u003Cp>Next, define review triggers. A trigger is an observable pattern that forces a human review before you make a decision or celebrate a win.\u003C/p>\n\u003Cp>Here are eight triggers that work well in real support ops, each tied to a specific risk.\u003C/p>\n\u003Col>\n\u003Cli>\u003Cp>SLA improves while CSAT drops. Risk mitigated: speed wins masking quality regression.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>First response time improves while repeat contacts rise. Risk mitigated: shallow acknowledgements and unresolved issues.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Closure rate increases while reopen rate increases. Risk mitigated: premature closure and ticket ping pong.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Average handle time drops sharply after a target change. Risk mitigated: rushed triage and hidden escalation load.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Escalation backlog grows while frontline KPIs look stable. Risk mitigated: cost shifting to specialists and delayed customer pain.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Ticket mix shifts materially by channel, severity, or product area. Risk mitigated: invalid comparisons and wrong staffing moves.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Policy changes that affect eligibility or required steps. Risk mitigated: attributing policy driven friction to agent performance.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Tooling or workflow changes that alter routing, forms, or definitions. Risk mitigated: trend discontinuities that make “improvement” meaningless.\u003C/p>\n\u003C/li>\n\u003C/ol>\n\u003Cp>This is where teams get burned: they treat a green dashboard as permission to make irreversible changes. For example, cutting headcount because response SLA is green is a classic trap. A green SLA can coexist with a growing backlog tail, rising repeat contacts, and an escalation pile that will explode next month. Headcount changes are slow to reverse. Your guardrail should require a review packet before any staffing reduction when quality counters are moving in the wrong direction.\u003C/p>\n\u003Cp>So what does “human review” mean, operationally? It should not mean “a manager has a feeling.” It should mean a small, repeatable packet that combines quantitative slices and qualitative evidence.\u003C/p>\n\u003Cp>A minimum review packet can be:\u003C/p>\n\u003Col>\n\u003Cli>\u003Cp>A four to eight week trend view of the proxy metric and its counter metrics, sliced by channel and severity.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Backlog age bands and a high percentile time to resolution, not just the median.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Escalation volume and escalation backlog, with top escalation reasons.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>A small QA sample focused on the behavior you suspect, for example first responses that are too thin, or closures that happen after one back and forth.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Customer verbatims from detractors and passives, not just promoters.\u003C/p>\n\u003C/li>\n\u003Cli>\u003Cp>Agent feedback collected with one direct question: “What behavior does this metric encourage that makes outcomes worse?” If you hear the same thing from multiple tenured agents, you have found the incentive.\u003C/p>\n\u003C/li>\n\u003C/ol>\n\u003Cp>You can keep the review lightweight. The point is to make it consistent and hard to ignore.\u003C/p>\n\u003Cp>Finally, name the tradeoffs explicitly, because unspoken tradeoffs become blind spots.\u003C/p>\n\u003Cp>Support is always balancing speed, quality, and cost. If you push speed hard, you risk quality regressions through shallow triage, more handoffs, and more repeat contacts. If you push cost hard, you risk longer waits and higher customer effort. If you push quality hard, you risk slower throughput and the need for more staffing.\u003C/p>\n\u003Cp>The problem is not that tradeoffs exist. The problem is pretending they do not, then acting surprised when reopens and escalations rise.\u003C/p>\n\u003Cp>A tasteful reality check: optimizing support on a single KPI is like trying to judge a movie by how fast you can watch it. You might finish quickly, but you will miss the plot.\u003C/p>\n\u003Cp>If you want additional perspective on how metrics can undermine good decision making even when everyone is acting rationally, this is a solid read: \u003Ca href=\"#ref-4\" title=\"interwebicly.com — interwebicly.com\">[4]\u003C/a>\u003C/p>\n\u003Ch2>Do this next: a 30-day metric reset that fixes incentives without whiplash\u003C/h2>\n\u003Cp>You can fix misleading proxy metrics in customer support without a massive dashboard rebuild or a culture war. What you need is a short, time boxed reset with clear outputs, and one communication move that keeps agents from feeling like the scoreboard changes every week.\u003C/p>\n\u003Cp>Week 1: pick one suspect KPI and run the Proxy Chain Audit end to end.\u003C/p>\n\u003Cp>Choose the KPI that is driving the most behavior right now. Common candidates are response SLA hit rate, first response time, closures per agent, and average handle time. Run the audit with a small group and produce one output artifact: a one page summary that states the intended outcome, the observed behavior, the workload effects, the outcome validation, and your decision to keep, demote, fence, or retire.\u003C/p>\n\u003Cp>Week 2: add counter metrics and stop using invalid comparisons.\u003C/p>\n\u003Cp>Add one or two fences, not ten. If you fence first response time, add repeat contact within 7 days and handoff count. If you fence average handle time, add escalation rate and a QA sample score. Then pause any cross team leaderboard until you can compare within channel and severity bands. Publish a short dashboard note that says what changed and what comparisons are no longer valid. That small honesty prevents a lot of internal damage.\u003C/p>\n\u003Cp>Week 3: calibrate targets with a pilot team and document exceptions.\u003C/p>\n\u003Cp>Pick one team to pilot the new guardrails for two weeks. Watch for unintended behavior. If you see a rise in canned replies or a spike in “pending customer” closures, do not blame the team. Adjust the fence. Document exceptions like outage periods, major releases, or policy changes so you do not punish teams for doing the right work during abnormal weeks.\u003C/p>\n\u003Cp>Week 4: publish a metric contract and schedule the recurring review.\u003C/p>\n\u003Cp>A metric contract is a one page reference that lists definitions, slices you will use for fairness, the fences that sit next to each proxy metric, and the review triggers that force human review. Share it with agents and managers in plain language, including what you are optimizing for and what you are not. When people understand why metrics changed, they stop inventing their own stories about it.\u003C/p>\n\u003Cp>Then put it on the calendar: a monthly metric review that uses the triangulation set and the guardrail triggers. Also set the rule that any mix shift, policy change, or tooling change triggers an off cycle review. That is when proxy measures are most likely to start lying.\u003C/p>\n\u003Cp>If you do one thing this week, do this: copy the Proxy Chain Audit workflow into a doc, pick one KPI that looks green but feels wrong, and run the chain with two tenured agents in the room. The goal is not perfection. The goal is to stop rewarding shortcuts and start rewarding outcomes you can defend.\u003C/p>\n\u003Ch2>Sources\u003C/h2>\n\u003Col>\n\u003Cli>\u003Ca href=\"https://www.arcticdba.se/posts/goodharts-law\">arcticdba.se\u003C/a> — arcticdba.se\u003C/li>\n\u003Cli>\u003Ca href=\"https://whennotesfly.com/concepts/metrics-measurement-evaluation/why-metrics-often-mislead\">whennotesfly.com\u003C/a> — whennotesfly.com\u003C/li>\n\u003Cli>\u003Ca href=\"https://kpitree.co/guides/strategy-culture/metric-anti-patterns\">kpitree.co\u003C/a> — kpitree.co\u003C/li>\n\u003Cli>\u003Ca href=\"https://interwebicly.com/blog/metrics-that-ruin-good-judgment\">interwebicly.com\u003C/a> — interwebicly.com\u003C/li>\n\u003C/ol>\n",{"body":37},"## Your dashboard is green—so why do customers and agents say support is worse?\n\nIf you run support operations long enough, you will eventually hit the strangest kind of incident: nothing looks broken, yet everyone is tense. The dashboard is green. SLA hit rate is up. First response time is down. Closure volume is up. Average handle time is down. The weekly review feels like a victory lap.\n\nThen you talk to customers and hear something else. They are not saying “thanks for the faster first reply.” They are saying “I keep repeating myself,” “I got bounced between people,” “you marked it solved but it is not solved,” and “I feel like I am arguing with a form.” Agents tell you they are exhausted and that the work feels messier, not cleaner. Managers start spending their time on exceptions and escalations. You start wondering whether you are measuring the wrong thing, or measuring the right thing in a way that no longer matches reality.\n\nThat is where proxy metrics in customer support get dangerous.\n\nA proxy metric is a number that is easier to measure than the thing you truly care about, so it becomes a stand in. It is not automatically bad. It is often useful. But it is a guess. An outcome metric is closer to the real customer result you want, like “the issue is resolved correctly,” “the customer did not need to contact us again,” or “the customer feels confident using the product.”\n\nA concrete example shows the trap quickly. You tighten your first response SLA and celebrate hitting 95 percent. Agents learn the fastest way to improve the number: reply quickly, even if the reply is thin. The workflow now rewards quick touches and quick transfers. A week later, reopen rate climbs from 7 percent to 12 percent and escalations climb as well. You got faster at beginning the conversation. You did not get better at finishing it.\n\nThis is Goodhart’s Law wearing a headset: when a measure becomes a target, it stops being a good measure. If you want a simple explanation of why this happens across many systems, not just support, this overview is useful: [[1]](#ref-1 \"arcticdba.se — arcticdba.se\")\n\nThe promise of this article is practical: a repeatable operator workflow to identify which KPIs are acting as harmful proxies, what behaviors they are creating, and what to do next. You will not leave with “delete all metrics” or “trust CSAT only.” You will leave with an audit method, a triangulation set, and guardrails that tell you when dashboards are safe to act on and when human review is mandatory.\n\n## Run the proxy-metric audit: map metric → behavior → workload → customer outcome\n\n| Control | Where it lives | What to set | What breaks if it’s wrong |\n| --- | --- | --- | --- |\n| Set: Proxy Chain Audit: Workload → Customer Outcome | Customer journey maps, research | Verify workload drives customer value. E.g., Efficient triage → Faster resolution. | Busy teams, no customer benefit. Wasted resources. |\n| Set: Anomaly Detection: Context Shifts | Monitoring system, team comms | Alerts for changes in workload, segments, product features. | Invalid comparisons. Good performance looks bad, bad looks good. |\n| Set: Proxy Chain Audit: Metric → Behavior | Metric definitions, team wiki | Map metric to intended action. E.g., FRT → Agent responds quickly. | Teams hit numbers, not goals. 'Green' dashboards hide bad outcomes. |\n| Set: Proxy Chain Audit: Behavior → Workload | Process docs, workflow diagrams | Connect behavior to required tasks/effort. E.g., Quick response → Efficient triage. | Teams find metric shortcuts. Burnout from misaligned effort. |\n| Set: Guardrail: CSAT for Speed Metrics (e.g., FRT, AHT) | Dashboard (next to FRT/AHT) | Minimum CSAT score. If FRT improves but CSAT drops, investigate. | Agents rush, customers frustrated. Repeat contacts. |\n| Set: Guardrail: QA for Throughput Metrics (e.g., Closures, AHT) | Dashboard (next to Closures/AHT) | Minimum QA score. If closures increase but QA drops, investigate. | Agents close tickets prematurely. Re-opens, churn. |\n| Set: Decision Rule: Trust, Demote, Fence, Retire | Metric governance, team review | Criteria for each outcome based on audit/guardrail results. | Teams chase misleading metrics. Inability to adapt. |\n\nWhen a KPI and lived experience disagree, teams often do one of two unhelpful things. They argue in circles, trading anecdotes until the meeting ends. Or they pick a side based on hierarchy, which is just arguing with a nicer font.\n\nYou need a shared method that turns “this feels off” into a diagnosis and a decision.\n\nCall it the Proxy Chain Audit. The idea is simple: every metric creates pressure. Pressure shapes behavior. Behavior changes workload mechanics like queueing, handoffs, and escalation load. Those mechanics change customer outcomes. If you cannot tell that story end to end, the metric is not ready to be used for high stakes decisions.\n\nRun the audit with a small group: a support ops lead, a frontline manager, one tenured agent who actually works the queue, and someone who sees quality signals like QA or escalations. This is where teams get burned: if the room is only dashboard people, the result will sound logical and still be wrong, because the incentives and shortcuts live on the floor.\n\nHere is a workflow table you can copy into your monthly metric review.\n\nTwo worked examples show what “mapping the chain” looks like in practice.\n\nFirst example: a speed metric, first response time.\n\nIntended outcome: customers feel acknowledged quickly and trust the issue is being handled.\n\nBehavior it rewards: “touch it fast.” Agents send a quick acknowledgement, or they take the simplest next action that stops the clock. In some workflows that means a canned reply. In others it means moving the ticket to a different queue to get it out of their view.\n\nWorkload mechanics it changes: handoffs rise, because the fastest way to hit a response target is to route quickly. Backlog age can quietly worsen, because you are increasing the number of touches without increasing the number of true resolutions. The queue becomes noisier, which makes it harder for agents to find the work that actually needs deep attention.\n\nOutcome validation: if first response time improves and reopen rate rises, you did not improve “felt heard.” You improved “received a message.” Also look at repeat contact rate, meaning the customer contacts you again for the same issue or account within a short window. A meaningful “felt heard” improvement should reduce repeat contacts, not increase them.\n\nDecision rule: keep first response time as an operational health metric, but fence it with a counter metric that makes shallow replies expensive. One simple fence is “first response time plus repeat contact within 7 days.” Another is “first response time plus reopen rate.” If first response time improves while repeat contacts increase, the fence triggers a review and you stop celebrating the speed win.\n\nSecond example: a throughput metric, closures per agent or average handle time.\n\nIntended outcome: resolve efficiently so more customers get help with the same staffing.\n\nBehavior it rewards: rushing to an ending. Agents learn to avoid complex tickets, push for the customer to “try again,” or close tickets that feel uncertain. If the system allows it, you will see a rise in “solved pending customer” states that act like a polite exit ramp.\n\nWorkload mechanics it changes: escalations rise because triage becomes shallow. Specialists become the dumping ground for uncertainty. The escalation backlog grows, which does not always show up in frontline dashboards, so leadership sees “efficiency gains” that are really cost shifting. Reopens rise because premature closure pulls work into a second cycle.\n\nOutcome validation: watch escalation rate per 100 tickets, QA sampling results, and reopen rate. If average handle time drops 20 percent in two weeks and escalations jump, you did not get faster at resolution. You got faster at exiting the conversation.\n\nDecision rule: if you keep average handle time at all, fence it with quality. Make it impossible to “win” handle time while losing QA. This is also where teams get burned: they use average handle time to rank individuals. You end up rewarding the agent who avoids complexity and punishing the agent who takes the hard work that prevents churn and outages.\n\nNow decide what to do with the proxy metric. You have four options.\n\n1. Keep it when the chain makes sense and it tracks with outcomes.\n\n2. Demote it when it is still useful for awareness but not reliable for comparisons or incentives.\n\n3. Fence it when the metric is operationally useful but easy to game. A fence is a counter metric that changes the incentive so shortcuts stop looking attractive.\n\n4. Retire it when the chain is broken and the workarounds are more predictable than the customer benefit.\n\nA common mistake at this stage is thinking you need to pick one hero metric. You do not. You need one story that holds together. If the metric story is “we respond faster, therefore support is better,” you are going to keep rediscovering the same pain. If the story is “we respond fast and we resolve correctly, proven by fewer repeat contacts and fewer reopens,” you can move faster without fooling yourself.\n\nAnother common mistake is trusting averages. Averages are polite. The customer anger lives in the tail. Always look at backlog age bands and a high percentile time to resolution. If your median looks great but the 90th percentile is getting worse, your dashboard is giving you a comforting bedtime story while the queue quietly turns into a horror novel.\n\nKeep these controls visible as you run audits and set guardrails.\n\n## Triangulate signals: pair leading proxies with lagging outcomes so you can’t ‘win’ by gaming one number\n\nOnce you see how proxy metrics in customer support create pressure, the next move is obvious: stop betting the decision on one number.\n\nThe goal is not to drown everyone in metrics. The goal is to build a minimum triangulation set so no one can “win” the dashboard while customers lose the experience.\n\nThink of it like driving with a speedometer only. You can keep the needle in the right place and still run out of fuel, overheat the engine, or miss the fact that the road is covered in ice. Speed is real. It just is not the whole system.\n\nA practical minimum triangulation set has four parts.\n\n1. Speed: first response time, response SLA hit rate, or time to first reply.\n\n2. Throughput: time to resolution, solved volume, or closure rate.\n\n3. Quality: reopen rate, repeat contact rate, escalation rate, and a light QA signal.\n\n4. Customer sentiment: CSAT trends and complaint themes from verbatims.\n\nIf you can only afford one quality metric beyond escalations, choose repeat contact rate within 7 to 14 days. Reopens are useful, but they depend on workflow. A team can reduce reopens by making re opening harder, which is not the kind of innovation you want. Repeat contact is harder to “fix” without actually helping.\n\nHere are concrete pairings you can put next to each other on a dashboard. The point is not decoration. The point is tension. If the proxy improves but the counter metric worsens, you have to pause and explain.\n\n1. Response SLA hit rate paired with reopen rate.\n\n2. First response time paired with repeat contact rate within 7 days.\n\n3. Time to resolution paired with escalation rate per 100 tickets.\n\n4. Closures per agent paired with QA pass rate from a weekly sample.\n\n5. Average handle time paired with transfer rate or handoff count.\n\n6. Backlog size paired with backlog age bands, especially tickets older than 3 days and 7 days.\n\n7. Chat concurrency paired with CSAT and transfer rate.\n\nNotice what these pairings do. They keep speed and throughput, but they force you to pay for them with quality if you are cutting corners.\n\nNow add leading indicators, because lagging outcomes like CSAT often arrive late. In many B2B environments, CSAT response rates are low and the feedback comes in waves. If you wait for CSAT to confirm what agents already feel, you will be late.\n\nGood leading indicators include backlog age bands, repeat contact rate, escalation backlog, and the share of tickets that require more than one agent touch. They show you friction before customers fill out surveys.\n\nA scenario where leading indicators prevent a bad call looks like this.\n\nWeek 0: leadership asks for a cost reduction plan. Your dashboard shows response SLA is green and average handle time is down. The temptation is to reduce coverage or freeze hiring.\n\nWeek 1: nothing explodes, but backlog age shifts. Tickets older than 3 days rise from 4 percent to 9 percent. Escalation backlog rises from 18 to 35 open items. A small QA sample finds more “customer asked again” notes.\n\nWeek 2: CSAT has not moved yet because volume is low. Meanwhile, repeat contact rate within 7 days rises from 11 percent to 16 percent.\n\nIf you act only on the green SLA and lower handle time, you cut staff right as the system is building debt. If you act on the leading indicators, you pause the cost decision, investigate, and often find that the “efficiency gain” came from shallow triage, more transfers, and more unresolved tickets aging in quiet corners.\n\nThis is also why a blended “overall” dashboard tile can be risky. When the work shifts, the blend becomes a hiding place. A strong practice is to review the same triangulation set by channel and by severity band at least once a month. You do not need to do it daily. You just need to do it often enough that mix shifts do not get a head start.\n\nWhen signals split, do not treat it as confusion. Treat it as a diagnosis tool. Here are rules of thumb that work well in real queues.\n\n1. Speed improves and reopens increase: you are getting faster at acknowledging, not resolving. Look for canned replies, premature routing, and “touch fast” behaviors.\n\n2. Average handle time drops and escalations rise: you are pushing uncertainty to specialists. Look at triage notes and escalation reasons.\n\n3. Closure rate rises and repeat contacts rise: customers are coming back because the issue is not done. Look for “solved pending customer” patterns and closure reasons.\n\n4. Response SLA is green and backlog age worsens: you are meeting the letter of the SLA while building a long tail. Look for how the SLA clock is defined and where tickets can pause.\n\n5. CSAT stays flat and agent sentiment worsens: assume agents are seeing the loophole before customers do. Ask what changed in the workflow or target.\n\nIf you want extra framing on why metrics can create perverse incentives even when everyone is acting in good faith, this is a good general read: [[2]](#ref-2 \"whennotesfly.com — whennotesfly.com\")\n\n## Failure modes: when team/branch comparisons lie because the work changed (not the performance)\n\nSupport teams love comparisons, and so does leadership. It feels fair. It feels objective. It produces neat rankings.\n\nIt also regularly produces nonsense.\n\nThe most common reason proxy metrics in customer support mislead is not that agents are gaming. It is that the work is not comparable. Volume mix changes. Channel mix changes. Severity mix changes. Coverage changes. If you treat those as “performance,” you will praise the wrong team, punish the wrong team, and teach everyone to chase the easiest work.\n\nStart with a volume mix example.\n\nTeam A closes 44 tickets per agent per day with a median handle time of 8 minutes.\n\nTeam B closes 29 tickets per agent per day with a median handle time of 15 minutes.\n\nOn a leaderboard, Team A looks like a dream and Team B looks like a problem.\n\nNow stratify by case type.\n\nWithin “password and access,” Team A and Team B both average about 8 to 9 minutes and have similar reopen rates.\n\nWithin “billing disputes,” Team B handles a larger share, takes longer, but has fewer escalations and fewer repeat contacts. Team A handles fewer of these and escalates more often.\n\nThe raw KPI story was not performance. It was distribution.\n\nYou do not need complex modeling to fix this. You need the discipline to compare within like for like buckets. Start with three to five case types that represent most volume, and compare teams within those. If you have severity tags, compare within severity bands as well.\n\nChannel mix is the next distortion, and it quietly breaks many speed metrics.\n\nChat, email, and phone behave differently. In chat, first response time is part of the experience because the customer is waiting in real time. In email, first response time matters, but the quality of the next message often matters more than the speed of the first. On phone, handle time includes identity verification, empathy, and time spent navigating internal tools while the customer waits.\n\nIf one team works mostly chat and another works mostly email, comparing first response time is like comparing ovens by how quickly they preheat. It tells you something, but not whether dinner tastes good.\n\nA simple normalization approach that works well operationally is this:\n\n1. Report speed and throughput metrics separately by channel first.\n\n2. Only then show a blended “overall” number, and label it clearly as blended.\n\n3. If the channel mix shifts materially, pause cross team comparisons until you can explain the change.\n\nSeverity mix is the distortion that creates the most resentment.\n\nHigh severity work should take longer. If it does not, that is not efficiency. That is risk. When your most experienced agents are handling critical cases, you want them to do the deep work, write clear notes, coordinate across functions, and prevent recurrence. Those behaviors inflate handle time and time to resolution, and they often improve real outcomes like fewer escalations later and fewer repeat contacts.\n\nIf you compare raw time to resolution across teams without severity bands, you teach people to avoid severity work. This is where teams get burned because the incentive is subtle. No one says “avoid important work.” The metrics say “avoid the work that makes you look slow.”\n\nShift coverage and queue dynamics also distort comparisons in ways that look like individual performance.\n\nThe team that works the spike inherits the worst waiting times and the angriest customers. If you rank by CSAT or by time to first reply without adjusting for shift and queue conditions, you will conclude that the spike team is worse, when they are simply standing under the waterfall.\n\nTwo practical fixes help quickly.\n\nFirst, compare within similar time windows. If you review weekly, look at day of week and hour bands when you diagnose a team gap.\n\nSecond, include queue health context like backlog age bands and arrival rate, so you see who carried the surge.\n\nNow the question everyone asks: how do you tell mix shift from gaming?\n\nThey can look similar at first, but patterns usually differ.\n\nMix shift tends to move several metrics together in a way that makes sense. Higher severity mix usually increases time to resolution and handle time, and it can increase escalations for legitimate reasons. It might also increase “thank you” verbatims when the work is done well.\n\nGaming tends to improve the targeted metric in a clean, sudden way, while nearby quality metrics worsen. Closure rate jumps, but repeat contacts and reopens also jump. First response time drops, but handoffs and transfers rise. Escalation backlog grows while frontline metrics look great.\n\nWhen comparisons are invalid, do not force them. Pause ranking. Add context notes to the dashboard. Make staffing decisions using stratified views by channel and severity. If leadership insists on a single ranking, push back with a clear statement: you can rank teams, or you can be fair, but you cannot do both with unadjusted proxies.\n\nFor a general look at metric anti patterns that show up in many orgs, not just support, this resource is a helpful reference: [[3]](#ref-3 \"kpitree.co — kpitree.co\")\n\n## Decision guardrails: when to trust dashboards vs when to require a human review (and what that review must include)\n\nDashboards are excellent for visibility. They are not a substitute for judgment. The problem is that judgment is often applied randomly, which is how you get politics disguised as “intuition.”\n\nGuardrails solve that. Guardrails are explicit rules that say when the numbers are safe to act on and when you must slow down and review what is actually happening in the queue.\n\nA simple trust rubric for any metric has three questions.\n\nStability: does it behave consistently when the environment is stable, or does it swing wildly with small operational changes?\n\nInterpretability: can you explain what moved and why without a two day investigation?\n\nSusceptibility to gaming: can a reasonable agent move the number without improving the customer outcome?\n\nMetrics that are unstable, hard to interpret, and easy to game can still be displayed, but they should not drive incentives, staffing cuts, or performance rankings.\n\nNext, define review triggers. A trigger is an observable pattern that forces a human review before you make a decision or celebrate a win.\n\nHere are eight triggers that work well in real support ops, each tied to a specific risk.\n\n1. SLA improves while CSAT drops. Risk mitigated: speed wins masking quality regression.\n\n2. First response time improves while repeat contacts rise. Risk mitigated: shallow acknowledgements and unresolved issues.\n\n3. Closure rate increases while reopen rate increases. Risk mitigated: premature closure and ticket ping pong.\n\n4. Average handle time drops sharply after a target change. Risk mitigated: rushed triage and hidden escalation load.\n\n5. Escalation backlog grows while frontline KPIs look stable. Risk mitigated: cost shifting to specialists and delayed customer pain.\n\n6. Ticket mix shifts materially by channel, severity, or product area. Risk mitigated: invalid comparisons and wrong staffing moves.\n\n7. Policy changes that affect eligibility or required steps. Risk mitigated: attributing policy driven friction to agent performance.\n\n8. Tooling or workflow changes that alter routing, forms, or definitions. Risk mitigated: trend discontinuities that make “improvement” meaningless.\n\nThis is where teams get burned: they treat a green dashboard as permission to make irreversible changes. For example, cutting headcount because response SLA is green is a classic trap. A green SLA can coexist with a growing backlog tail, rising repeat contacts, and an escalation pile that will explode next month. Headcount changes are slow to reverse. Your guardrail should require a review packet before any staffing reduction when quality counters are moving in the wrong direction.\n\nSo what does “human review” mean, operationally? It should not mean “a manager has a feeling.” It should mean a small, repeatable packet that combines quantitative slices and qualitative evidence.\n\nA minimum review packet can be:\n\n1. A four to eight week trend view of the proxy metric and its counter metrics, sliced by channel and severity.\n\n2. Backlog age bands and a high percentile time to resolution, not just the median.\n\n3. Escalation volume and escalation backlog, with top escalation reasons.\n\n4. A small QA sample focused on the behavior you suspect, for example first responses that are too thin, or closures that happen after one back and forth.\n\n5. Customer verbatims from detractors and passives, not just promoters.\n\n6. Agent feedback collected with one direct question: “What behavior does this metric encourage that makes outcomes worse?” If you hear the same thing from multiple tenured agents, you have found the incentive.\n\nYou can keep the review lightweight. The point is to make it consistent and hard to ignore.\n\nFinally, name the tradeoffs explicitly, because unspoken tradeoffs become blind spots.\n\nSupport is always balancing speed, quality, and cost. If you push speed hard, you risk quality regressions through shallow triage, more handoffs, and more repeat contacts. If you push cost hard, you risk longer waits and higher customer effort. If you push quality hard, you risk slower throughput and the need for more staffing.\n\nThe problem is not that tradeoffs exist. The problem is pretending they do not, then acting surprised when reopens and escalations rise.\n\nA tasteful reality check: optimizing support on a single KPI is like trying to judge a movie by how fast you can watch it. You might finish quickly, but you will miss the plot.\n\nIf you want additional perspective on how metrics can undermine good decision making even when everyone is acting rationally, this is a solid read: [[4]](#ref-4 \"interwebicly.com — interwebicly.com\")\n\n## Do this next: a 30-day metric reset that fixes incentives without whiplash\n\nYou can fix misleading proxy metrics in customer support without a massive dashboard rebuild or a culture war. What you need is a short, time boxed reset with clear outputs, and one communication move that keeps agents from feeling like the scoreboard changes every week.\n\nWeek 1: pick one suspect KPI and run the Proxy Chain Audit end to end.\n\nChoose the KPI that is driving the most behavior right now. Common candidates are response SLA hit rate, first response time, closures per agent, and average handle time. Run the audit with a small group and produce one output artifact: a one page summary that states the intended outcome, the observed behavior, the workload effects, the outcome validation, and your decision to keep, demote, fence, or retire.\n\nWeek 2: add counter metrics and stop using invalid comparisons.\n\nAdd one or two fences, not ten. If you fence first response time, add repeat contact within 7 days and handoff count. If you fence average handle time, add escalation rate and a QA sample score. Then pause any cross team leaderboard until you can compare within channel and severity bands. Publish a short dashboard note that says what changed and what comparisons are no longer valid. That small honesty prevents a lot of internal damage.\n\nWeek 3: calibrate targets with a pilot team and document exceptions.\n\nPick one team to pilot the new guardrails for two weeks. Watch for unintended behavior. If you see a rise in canned replies or a spike in “pending customer” closures, do not blame the team. Adjust the fence. Document exceptions like outage periods, major releases, or policy changes so you do not punish teams for doing the right work during abnormal weeks.\n\nWeek 4: publish a metric contract and schedule the recurring review.\n\nA metric contract is a one page reference that lists definitions, slices you will use for fairness, the fences that sit next to each proxy metric, and the review triggers that force human review. Share it with agents and managers in plain language, including what you are optimizing for and what you are not. When people understand why metrics changed, they stop inventing their own stories about it.\n\nThen put it on the calendar: a monthly metric review that uses the triangulation set and the guardrail triggers. Also set the rule that any mix shift, policy change, or tooling change triggers an off cycle review. That is when proxy measures are most likely to start lying.\n\nIf you do one thing this week, do this: copy the Proxy Chain Audit workflow into a doc, pick one KPI that looks green but feels wrong, and run the chain with two tenured agents in the room. The goal is not perfection. The goal is to stop rewarding shortcuts and start rewarding outcomes you can defend.\n\n## Sources\n\n1. [arcticdba.se](https://www.arcticdba.se/posts/goodharts-law) — arcticdba.se\n2. [whennotesfly.com](https://whennotesfly.com/concepts/metrics-measurement-evaluation/why-metrics-often-mislead) — whennotesfly.com\n3. [kpitree.co](https://kpitree.co/guides/strategy-culture/metric-anti-patterns) — kpitree.co\n4. [interwebicly.com](https://interwebicly.com/blog/metrics-that-ruin-good-judgment) — interwebicly.com\n",[39,43],{"_path":40,"path":40,"title":41,"description":42},"/en/blog/stop-arguing-about-data-a-simple-workflow-for-agreeing-on-what-is-true","Stop Arguing About Data: A Simple Workflow for Agreeing on What is True","A practical truth sync workflow for support teams to align on support metrics, catch definition drift, and produce decision grade current truth in 30 to 60 minutes without rebuilding your data stack.",{"_path":44,"path":44,"title":45,"description":46},"/en/blog/from-messy-signals-to-trustworthy-calls-the-weekly-decision-workflow-that-actual","From Messy Signals to Trustworthy Calls: The Weekly Decision Workflow That Actually Holds Up","A practical weekly support decision workflow for operators who need defensible calls from noisy tickets, branch level performance numbers, and escalations. Learn how to gate weak signals, converge on a shared picture, decide with clear rules, and follow up with owners and kill criteria.",1778614419556]