Research, signal design, and decision systems

Our “north star” KPI still moves, but it no longer predicts revenue or retention like it used to. How do we debug whether the metric is broken?

Elena Marín
Elena Marín
12 min read·

Answer

Treat this as a prediction failure first, not a dashboard problem. Freeze the KPI definition, reproduce it from raw events to the executive chart, and then audit tracking, pipelines, and outcome alignment in that order. In many teams the metric is still “correct,” but the relationship to value changed because mix, product, pricing, or the lead time changed. Your goal is to classify which of those happened and pick the smallest fix that restores trust.

When your north star stops being a compass

Support and product leaders feel this as whiplash: the KPI is still moving, teams are still shipping, but revenue and retention stop following. Suddenly every review turns into a debate about data, not decisions. The fastest way out is to debug like an operator: isolate what changed, prove the number end to end, then test whether the business relationship broke even if the data is fine.

Below is a practical sequence you can run without turning your org into a statistics lab.

Triage: what exactly changed (prediction, level, or relationship)?

Start by naming the failure mode precisely. “The KPI is off” is not precise enough to fix.

There are three common patterns.

First, the level changed: the KPI jumped or dropped around a date, often right after a release, tracking change, pipeline change, or backfill. That points to measurement.

Second, the KPI still trends smoothly, but prediction broke: it no longer separates good outcomes from bad ones. That points to alignment, mix shifts, or product meaning.

Third, the relationship is intact, but the lead time changed: revenue or retention now follows later than it used to. That points to sales cycle length, onboarding, packaging, or how you define the outcome.

Practical tip: pick two reference windows and treat them like “before” and “after.” For example, the 8 weeks before the break and the 8 weeks after. Make everyone use the same windows while debugging, otherwise you will argue about different pictures.

Also define what “working” means in one line. Example: “A one decile increase in the KPI within 14 days of signup should still correspond to higher 90 day retention and higher expansion rate.” This is the contract you are validating, not just the chart.

Freeze the KPI definition and reproduce the number end to end

Before you touch anything else, freeze the KPI definition as a written spec and stop editing it in place. Strong teams treat metric definitions as versioned artifacts, not living folklore. KPI Tree’s debugging frameworks emphasize this “freeze, then trace” discipline because it prevents the most expensive failure mode: fixing the wrong thing while the definition drifts under your feet.

Your frozen definition should include:

  1. The exact event or events used.

  2. Required properties and filters.

  3. The unit of analysis, such as user, account, workspace.

  4. Time window and timezone.

  5. Deduping rules.

  6. Identity rules, such as how anonymous activity is stitched to logged in users.

  7. Any exclusions, such as internal users, bots, test tenants.

Now reproduce the KPI end to end. Recompute it from raw events, then compare it at each step: raw logs, cleaned events, warehouse tables, metric layer, BI dashboard. Calypso’s step by step checks are useful here because they force you to validate each hop instead of trusting the final chart.

Practical tip: create a “trace sample” of 20 entities. Pick 10 that should count and 10 that should not, then manually validate whether they do. This catches definition and identity errors faster than staring at aggregates.

Common mistake moment: teams start by “adjusting the KPI” to make it line up with revenue again. That is like loosening the fire alarm because it is annoying. Instead, freeze the metric, prove the number, then decide whether you need a new metric version or a new value proxy.

Instrumentation audit: event semantics, schemas, identity, and client behavior

If the metric level shifted, assume tracking first. Most “broken metric” incidents are boring in the best way: an event changed meaning, a property stopped being sent, or an SDK update doubled events.

Look for semantic drift.

If the event name stayed the same, did the meaning change? A button click event used to represent “completed onboarding,” then the UI changed and now it represents “opened onboarding.” Same event, different value.

Look for schema drift.

A required property goes null, a new enum value appears, or a default changes. If your KPI filters on a property that quietly changed, the count can look stable while the population changes.

Look for duplicates and missingness.

Mobile backgrounding, offline queues, retries, and idempotency bugs can create duplicates. Consent changes can create missingness. Bots can create “activity” that looks like humans.

Look for identity issues.

If user ids, account ids, or anonymous ids are stitched differently, your KPI may inflate or deflate. A classic symptom is a sudden change in the ratio of anonymous activity to logged in activity, or a sudden increase in one user having many device ids.

Ask one operational question that support leaders understand: “If I take a real customer ticket and look up their activity, does the system tell a coherent story?” If not, the KPI is probably counting ghosts.

Pipeline and data quality checks: ingestion, transformations, and backfills

If tracking looks correct, move to the pipeline. KPI Tree’s “why did my metric change” diagnostic framing is helpful here because it forces you to check the plumbing before you hypothesize user behavior changes.

Check ingestion completeness.

Compare event volume by day and by source. If you have web and mobile, a drop in one source may be masked by growth in the other. Also check late arriving data, especially if your KPI is computed daily and your pipeline has variable delay.

Check transformations.

Common transformation failures include timezone shifts, partitioning mistakes, incremental model bugs, and joins that change cardinality. A single join that turns one row into many can quietly break counts.

Check deduping and backfills.

If you recently changed dedupe logic or ran a backfill, you may have rewritten history. That can make the KPI look stable today while the historical relationship to retention is now computed on a different base.

Check your outcome source of truth.

If revenue recognition logic changed, if churn definition changed, or if a billing system migration happened, your “revenue” column may have moved under you. Many teams blame the KPI when the outcome table changed.

Practical tip: add three quick monitors even before the incident is resolved: freshness, volume, and null rate on the KPI’s critical events and properties. You can remove them later if you hate peace and quiet.

Outcome alignment: confirm the lead to lag window and attribution to revenue or retention

Sometimes the metric is fine, and your expectation of timing is what broke.

Confirm the lead to lag window.

If your north star activity used to precede expansion within 30 days, and now deals take 60 days, the same KPI can still be predictive but you are looking too early. Run a simple lag sweep: compare KPI measured in week 1 to revenue or retention measured at week 4, 8, 12.

Confirm the outcome definition did not change.

Retention is especially slippery. Did you switch from logo retention to revenue retention? Did you redefine “active customer” or change your churn grace period? Did you start counting downgrades differently?

Confirm attribution assumptions.

If you are attributing revenue to accounts, but your KPI is computed at user level, you need a stable mapping. If seat counts changed, the mapping from users to dollars can dilute.

This is where north star metric guidance often gets misread. A north star is a proxy for value, not the value itself. When the business model or measurement of value shifts, the proxy can lose its predictive power even if it remains well defined.

Segment and mix analysis: the KPI may still work but only for some cohorts

This is the most common “we were not wrong, we were averaged” situation. The KPI can remain predictive inside segments, but your user mix changed.

Break down the relationship by segment.

Useful cuts include acquisition channel, plan tier, geo, device, industry, lifecycle stage, and sales assisted versus self serve. The goal is not to find 50 segments. The goal is to find the one segment whose weight changed and whose KPI behavior differs.

Then check weights.

If a low quality channel grew from 10 percent to 40 percent, your KPI can move while revenue per unit of KPI falls. Your KPI is still measuring activity, but activity is now coming from different users.

Use reweighting.

A simple technique is to reweight the new period to the old segment distribution. If the KPI to revenue relationship “returns” under old weights, you have a mix shift story.

Analyze Acquisition Channel Performance: validate whether new volume is lower intent, lower fit, or simply earlier in the journey.

Examine Lifecycle Stage Shifts: check whether users are stalling at activation, not failing the product overall.

Segment Breakdown: find the few cohorts where the KPI lost its link to outcomes.

Compare Segment Weights: quantify whether composition alone explains the disconnect.

Product and process changes: when the metric is correct but no longer represents value

Now assume the number is correct and the pipeline is healthy. The remaining explanation is that the product, process, or business model changed, so your KPI is no longer the best proxy for value.

Inventory the changes since the break.

Include onboarding flows, paywalls, trial length, pricing and packaging, promotions, support interventions, and sales motion. These changes can create “cheap KPI” behavior, where users can generate the north star activity without reaching the real outcome.

Examples you will recognize:

Your KPI counts “projects created,” but templates auto create projects, so the KPI rises without intent.

Your KPI counts “messages sent,” but notifications or automation now send messages on behalf of users.

Your KPI counts “tickets resolved,” but you introduced aggressive deflection, so resolution counts shift without improving retention.

This is where guidance like Amplitude’s good versus bad north star discussion is practical: if the metric can be gamed, automated, or inflated without user value, it will eventually decouple. A north star is supposed to measure value delivery, not just motion.

Tasteful humor, because you deserve it: a metric that can be generated by a bot is not a north star, it is a night light.

If this is the case, you can do one of two things.

Option one is to refine the KPI so it requires evidence of value, such as “projects created that are used by two collaborators within 7 days.”

Option two is to keep the KPI but add guardrails, such as quality, retention, or revenue per KPI unit.

Revalidate predictiveness: lightweight statistical checks that operators can run

You do not need a research team to sanity check predictiveness. You need a few stable, repeatable tests.

Run a quintile lift check.

Bucket accounts or users into five groups based on KPI in a fixed window, such as first 14 days. Compare subsequent retention or revenue across the buckets. A healthy proxy usually shows monotonic lift, meaning higher KPI corresponds to better outcomes in most buckets.

Run a rolling window stability check.

Compute the lift each month or each quarter. If the relationship broke, you will see the lift collapse or become noisy.

Run a simple calibration check.

Pick a threshold, such as “KPI at least X.” Track what percent of those entities retain. If that percent drops materially post change, your proxy degraded.

Keep it honest.

Do not p hack by trying 30 windows until something looks significant. Decide your windows first, then inspect.

Also separate correlation from causation.

You are validating usefulness as a leading indicator, not proving it causes revenue. That distinction keeps you from overreacting to a short term shock.

Decision tree: classify the root cause and pick the fix

Once you have run the sequence above, classify what you found. Most fixes fall into one of five buckets.

  1. Definition mismatch. Different teams or tools compute different versions. Fix by writing a metric spec, versioning it, and making one canonical source.

  2. Tracking bug or semantic drift. Events changed meaning, properties disappeared, identity stitching broke. Fix the instrumentation, then backfill or annotate the break so historical comparisons remain interpretable.

  3. Pipeline bug or data quality regression. Ingestion gaps, join explosions, dedupe issues, timezone shifts, backfill rewrite. Fix the pipeline and add monitors so you catch it next time.

  4. Outcome definition changed. Revenue, churn, or account mapping changed. Align the sources of truth and document the new outcome definition before re judging the KPI.

  5. The business relationship changed. Mix shifted, lead time changed, or product changes made the KPI less representative of value. Recalibrate the expected lead time, segment the KPI, add guardrails, or replace the metric with a better proxy.

The “stop doing this” guidance that saves the most time: do not change the KPI weekly while you are diagnosing. Freeze, diagnose, then decide whether you need a new metric version.

Prevent recurrence: metric contracts, monitoring, and governance

Once trust is dented, prevention matters as much as the fix.

Start with metric contracts.

Treat the KPI like an API contract: schema plus semantics. If an event name or property meaning changes, require an explicit version bump and a changelog entry. KPI Tree’s metric debugging guidance is consistent on this point: stable definitions make root cause analysis possible.

Add monitoring that reflects how metrics fail in reality.

At minimum, monitor data freshness, event volume, duplicate rate, and null rates on critical properties. Then add one monitor for “relationship health,” such as the rolling lift of KPI quintiles to retention.

Assign ownership and escalation.

Instrumentation needs a clear owner, usually product analytics or data engineering, with an escalation path. When a release changes event semantics, it should create the same level of alertness as a production incident, because it is a decision making incident.

Finally, simplify what you standardize first.

If you do nothing else, standardize the KPI definition, identity rules, and outcome definitions in one place, and make changes versioned and reviewable. That alone prevents the next quarter from turning into a detective novel written in SQL.

Option Best for What you gain What you risk Choose if
Analyze Acquisition Channel Performance Understanding if new users from specific sources are behaving differently Reveal if a new channel brings lower quality users or if an old one declined Ignoring post-acquisition behavior changes or downstream impacts You've recently scaled up or down specific acquisition channels
Examine Lifecycle Stage Shifts Understanding if users are getting stuck or dropping off at new points Identify if onboarding, activation, or retention stages are impacted Overlooking external factors influencing user behavior at different stages The metric decline is concentrated in specific user journey phases
Segment Breakdown Identifying specific user groups where the metric is failing Pinpoint affected user cohorts (e.g., new users, specific geo) Over-segmentation leading to noisy data or false positives Overall metric trend is stable but you suspect underlying shifts
Investigate Product Changes by Segment Connecting metric changes to recent feature releases or experiments Identify features that disproportionately affect certain user groups Missing non-product related factors (e.g., marketing, seasonality) You have recent product changes that could impact specific user segments
Compare Segment Weights Detecting changes in the composition of your user base Understand if a segment's growth/decline is driving the metric change Misinterpreting correlation as causation. not addressing root cause You observe a shift in overall metric but individual segment metrics are stable
Reweight to Prior Mix Isolating the impact of segment mix shifts from other factors Determine if the metric would 'recover' with the old user distribution Masking real product issues if the mix shift is a symptom, not the cause Segment weights have changed significantly and you want to quantify their impact

Sources


Last updated: 2026-05-20 | Calypso

Tags

how-to-debug-a-broken-metric