Most HR leaders know something is wrong with their 9-box calibration. The meeting happens. Boxes get filled. The results sit in a spreadsheet. Nothing changes.
This isn't a facilitation problem. The issues run deeper. This article covers the specific ways the 9-box breaks down in practice, what companies are moving toward instead, and how to get better outcomes from your next calibration cycle, whether or not you replace the grid.
The 9-box was never designed for what we use it for
The 9-box traces back to McKinsey's work with GE in the 1970s. The original tool plotted business units on a matrix of industry attractiveness and competitive strength, a capital allocation tool for a conglomerate deciding where to invest.
Someone mapped the logic to people: performance on one axis, potential on the other. GE adopted it aggressively under Jack Welch, with annual forced rankings and genuine consequences at each box position. Bottom 10% left the company. High potentials got groomed for senior roles.
Most companies kept the grid, dropped the teeth, and called it talent management.
Without the rigor that made it work at GE, the 9-box becomes a categorization exercise instead of a decision framework. You produce placements, not actions. That's where the problems start.
The three core problems with 9-box calibration
1. Boxes are sticky
Once someone lands in a box, they tend to stay there. Next year's calibration uses last year's placement as the starting point. The conversation shifts from "where does the evidence put them today?" to "do we have any reason to move them?"
That's a high bar for change. Most placements survive it intact, not because the person didn't develop, but because inertia is easier than scrutiny.
This problem hits potential ratings hardest. Potential is speculative by nature. You're predicting someone's ceiling based on limited observation. Those predictions get locked in early, often in the first year or two of a person's tenure, and rarely get revisited with fresh eyes.
Research from Korn Ferry found that high-potential designations made early in someone's career rarely change afterward, regardless of what that person actually does. The label shapes how managers see them, not the other way around.
2. Visibility bias dominates
The 9-box is supposed to be objective. It isn't. It represents manager perception. And manager perception tracks visibility.
Who gets rated as high potential? The people managers see clearly. Visible contributions: speaking in meetings, working directly with senior leaders, producing work that's easy to observe and quantify.
The engineer who identifies and quietly patches an architectural flaw that would have caused a production outage in six months? Invisible. The analyst whose report three other teams use to make better decisions? Invisible. The project manager who keeps a messy cross-functional initiative from collapsing through sheer coordination effort? Visible only in her absence.
Loud contributions systematically beat foundational ones in calibration rooms where no one is specifically looking for quiet work.
This isn't a manager character flaw, it's a structural problem with how most calibration sessions run. They rely on memory and impression rather than documented evidence gathered throughout the year.
3. The grid produces placements, not plans
Even a well-run 9-box calibration produces a placement, not an action.
The meeting ends. The grid is documented. Then what?
HR leaders who've run dozens of calibration cycles describe the same pattern: development plans discussed in the meeting don't materialize. People in low-potential boxes don't get development conversations because the placement implies the investment isn't worth it. High potentials don't get the stretch assignments they were supposed to get because everyone forgets to create them.
The 9-box tells you where people are in the matrix. It doesn't tell you what to do next, who owns what, or how to measure whether anything changed.
What companies are using instead
Skills-based assessment
The clearest alternative tracks what employees can actually do against what the role requires, instead of predicting where someone might go.
Skills-based assessment solves the stickiness problem by design. If the framework tracks observed capabilities, a person who builds new skills shows up differently in the next cycle automatically. There's no "reason to move them" threshold to clear. The data updates because the capabilities did.
Microsoft and Unilever have both described moves toward skills-based talent architecture. The shift is especially useful for deployment and succession decisions, where knowing someone's specific skill set matters more than their general potential label.
The tradeoff: skills-based assessment requires a functioning skills taxonomy and a way to assess against it. That's more infrastructure than most companies have ready.
Evidence-based calibration (on top of the existing grid)
A lighter-weight alternative keeps the 9-box but adds an evidence layer before the meeting.
Managers submit a brief for each direct report: two or three specific examples supporting their performance rating, one specific observation informing the potential rating. The calibration session reviews the evidence, not just the conclusions.
This adds preparation time, but it substantially reduces the number of placements that survive on vibes alone. A manager who has to cite a specific example of how someone handled an ambiguous situation is less likely to fall back on "she just seems high potential."
It also creates a paper trail that makes placements easier to challenge in the room.
Structured question frameworks
Some organizations drop the grid entirely and run structured calibration conversations with a question bank instead.
Rather than asking "where does this person fit," a facilitator asks:
- What did this person accomplish this cycle that was harder than it looked?
- If this person left tomorrow, what would break?
- When did you last put them in an unfamiliar situation? What happened?
- Is this rating based on what you've seen this year or what you thought when you first met them?
Questions like these force managers to retrieve specific evidence rather than reconstruct a general impression. The conversation is slower, but the outputs tend to be more defensible.
How to get better outcomes without replacing the grid
If replacing the 9-box isn't on the table, you can still significantly improve calibration outcomes by changing how you run the conversation.
Before the meeting
The most important calibration work happens before anyone sits down together.
Require managers to submit evidence briefs per direct report. Flag anyone whose box hasn't changed in two or more cycles, those placements need explicit scrutiny, not passive renewal. Identify anyone who changed roles since the last calibration; prior placements may not apply.
During the meeting
Use questions that challenge placements instead of confirming them.
For any unchanged placement, ask: "If we didn't know where they placed last year, where would we put them today?" That resets the starting point from "assume prior placement is correct" to "start from evidence."
For potential ratings, require specific examples. "She just has that leadership quality" is not evidence. "She took ownership of the product launch when the PM left, defined the scope in week one, and shipped on time" is evidence.
Enforce a time limit per person, 20 minutes works well, and require a "next step" for every placement before moving on. If there's no action, the placement isn't doing anything useful yet.
After the meeting
Close the loop. Assign next steps with owners and deadlines. Schedule check-ins at 60 days to review whether the actions happened. If they didn't, revisit the placement, a calibration with no follow-through is expensive theater.
The ungated SEO landing page vs. the playbook
If you want the full framework, including pre-meeting manager brief templates, a facilitator checklist, a complete question bank, and the 20-minute per-person structure, we've put it in a free PDF.
Get The 9-Box Escape Playbook →
It covers everything above in more detail, plus escape paths for people stuck in low boxes and a complete calibration conversation structure you can use in your next cycle.
What makes calibration defensible
The standard for good calibration isn't consensus. It's that placements follow from evidence multiple people in the room have seen and can cite.
"She's a high potential" is not defensible.
"She took over a struggling project in Q3, identified the stakeholder misalignment within two weeks, and got it back on track without escalation, that pattern is what we're calling high potential" is defensible.
The difference is specificity. Specific examples are harder to confuse with reputation, personality, or who happens to present well in the room.
Building that evidence base before calibration happens, and using it in the conversation rather than recovering impressions on the fly, is the single biggest change most companies can make to their calibration process.
Related reading
- 9 Box Grid: The Complete Guide, how the framework was designed to work and how to run it well
- 360-Degree Feedback Guide, an evidence-gathering approach that complements calibration
- Performance Review Mistakes, common failure modes in annual review cycles
