Blog post

How to Run Fair Performance Calibration Sessions (And Fix the Bias That Derails Them)

Most calibration sessions are decided before they start — by recency bias, halo effects, and the loudest voice in the room. Here's how to run calibration that produces ratings you can actually defend.

How to Run Fair Performance Calibration Sessions (And Fix the Bias That Derails Them)

Performance calibration sessions should be straightforward: managers review employees' work, compare notes, and agree on ratings. In practice, they're one of the most politically charged conversations in any company. Ratings drift based on who argues the loudest. Employees who worked hard all year lose out to colleagues who happened to close a big project in the final weeks. And the whole thing moves too fast for anyone to catch the bias.

This guide covers how to run calibration sessions that produce ratings you can actually defend, and the specific data, structure, and checks that make the difference.

Why calibration sessions go wrong

Most calibration failures come down to three problems: recency bias, halo effects, and managers who walked in with their minds already made up.

Recency bias is the default

Managers remember what happened last week, not last January. In a 12-month review cycle, this means 11 months of work are effectively invisible unless you have data that makes them visible.

The employee who shipped something big in November gets rated higher than the colleague who quietly solved problems all year and had a slow Q4. Neither rating reflects actual performance over time. Without a running record of what actually happened throughout the year, calibration sessions become a memory contest. The last two months win.

Halo and horn effects distort everything

If a manager likes someone, that impression colors every data point. A missed deadline becomes "they were dealing with a tough situation." If the relationship is strained, the opposite happens. A strong deliverable becomes "yeah, but they were kind of difficult to work with."

These aren't character flaws. They're how human memory works. The brain fills gaps with priors. The practical fix is to rate on behaviors and outputs, not impressions, and to anchor ratings to specific evidence rather than overall gut feel.

Politics show up as ratings

In many companies, managers enter calibration knowing what rating they want to give and then construct the argument. The session becomes a negotiation, not an assessment. This happens for real reasons: managers want to retain their best people, avoid hard conversations, or protect their team's compensation budget.

Structuring the session so evidence comes before ratings changes this dynamic. When data is presented first, the conversation starts from facts, not advocacy.

What data you need before walking into the room

A calibration session without data is a conversation about feelings. Here's what actually moves the needle.

Goal completion data

Which goals did each person have? Did they complete them? If you don't have structured goal data going into calibration, the conversation defaults to memory. Whoever can recall the most recent examples wins, regardless of whether those examples are representative.

Feedback from peers and stakeholders

Managers see roughly 20-30% of what their direct reports actually do. The rest happens in cross-functional projects, client relationships, and peer collaboration. 360 feedback fills that gap.

Collect it before calibration, not after. Many companies run 360s as a formality after ratings are already set, which defeats the purpose.

A running performance log

A quarterly log of notable moments, good and bad, gives managers something concrete to reference at year-end. Three to five entries per employee per quarter is enough to reconstruct 12 months of performance without relying on memory.

If you don't have this going into calibration, the sessions will be dominated by whoever happened to do something memorable recently.

Manager-level distribution data

Before calibration starts, pull each manager's proposed rating distribution. If one manager consistently rates 70% of their team as "Exceeds," and another rates 60% as "Meets," you have a calibration problem that shows up before the meeting. Surface it beforehand so you can address it as part of the calibration conversation, not as a surprise.

How to structure a fair calibration session

Pre-work (one week before)

Share the rating rubric in writing. Every manager should know what "Exceeds" looks like before they walk in, not during the session. Ask each manager to submit their proposed ratings and written rationale for each employee. Distribute the submissions so managers can see each other's ratings in advance. This surfaces outliers before the meeting rather than during it.

The session itself

Start by reviewing the rating distribution you're targeting. If the company uses a calibration curve, show it. Then move through each employee at about three to four minutes per person:

  1. Manager presents key accomplishments, goal completion rate, one example of impact, and one development area
  2. Peers or stakeholders add context if they have cross-functional visibility
  3. Tentative rating against the rubric

No debating during the review phase. Collect tentative ratings, flag disagreements, and move on. Come back to the disagreements after all employees have been reviewed.

When you return to flagged disagreements, start with the evidence: "What specific examples are you drawing on?" and "Is this pattern consistent across the year, or concentrated in one period?"

What the facilitator's job actually is

The calibration facilitator is not neutral. Their job is to push back on unsupported ratings, surface bias patterns, and make sure the final distribution reflects evidence rather than advocacy.

That means asking for evidence when a manager makes a claim without backing it up, flagging when a manager has rated their entire team consistently high or low without explanation, and calling out halo patterns: "You've described everything about this person positively. What's one genuine development area?"

The calibration bias checklist

Before finalizing any rating, run through these checks:

CheckQuestion to ask
Recency testIs this rating based on the full review period, or the last 60 days? Can the manager point to examples from each quarter?
Halo/horn testIs every data point about this person in the same direction? Can the manager name one weakness for an "Exceeds" employee? One strength for a "Below" employee?
Consistency testIf two people did similar work, did they get similar ratings? If not, what's the specific difference?
Documentation testIs there written documentation to support ratings at the tails? If someone is Below Expectations, was that communicated to them before calibration?
Demographic checkAfter ratings are finalized, does the distribution show unexplained gaps by gender, ethnicity, or tenure?

Rating level definitions

Without shared definitions, calibration sessions devolve into managers arguing from different premises. These definitions give everyone the same starting point. Customize them for your company, but have them in writing before the session starts.

RatingWhat it meansEvidence required
Top PerformerConsistently delivered results above what the role requires. Proactively raised the standard for their team. Would represent a real loss if they left.3+ specific examples of above-and-beyond output with context for why it exceeded expectations
Strong PerformerDelivered all expected results and occasionally exceeded. Solid, reliable, growing in the role.Goal completion data, 2+ examples of strong output
Meets ExpectationsDelivered what the role requires. Consistent and dependable. Most of a healthy team should be here.Goal completion data, manager summary
DevelopingDelivered some but not all expected results. Clear growth path ahead with the right support.Documented examples of gaps, documented conversations with the employee
Below ExpectationsSignificant gaps that were communicated to the employee before this review cycle. If this is the first time the employee hears about a performance problem, this rating should not be used.Documentation of prior conversations, PIP or equivalent, specific examples

How AI changes the calibration equation

AI doesn't eliminate bias from calibration. But it does surface patterns that humans miss in the moment.

During a calibration session, AI can flag when a manager's proposed ratings sit significantly above or below the company average for comparable roles. This doesn't override the manager's judgment, but it prompts the question: "Here's how your distribution compares to peers. What's driving the difference?"

After ratings are submitted, AI can scan for demographic gaps faster than manual review. A 15-point gap in average ratings between male and female employees at the same level and tenure is worth investigating. AI finds it in seconds; spreadsheet review might miss it entirely.

Written feedback is another area where AI adds real value. Research shows that written performance feedback uses different language depending on employee demographics. Women receive more feedback about communication style; men receive more feedback about outcomes. AI can flag these patterns in real time as managers write feedback, before it reaches the employee.

What AI can't do: make the final call. It surfaces patterns and prompts questions. The human conversation, grounded in evidence, still determines the rating.

How Confirm supports calibration

Confirm is built specifically for this problem. The platform captures performance signals throughout the year, so when calibration comes around, managers have a running record of outputs, feedback, and growth moments rather than 12 months of memory.

During the review process, Confirm's AI flags language patterns in written feedback that may reflect bias. Managers see the flags in real time as they write, not after the feedback has been delivered. Calibration analytics show how each manager's rating distribution compares to peers, prior cycles, and company norms.

Every rating decision in Confirm is documented, timestamped, and traceable. If an employee or auditor asks why someone received a specific rating, the answer is in the system.

Want to see how Confirm supports calibration? Request a demo →

The full calibration checklist

Before calibration:

  • Rating rubric distributed in writing to all managers
  • Managers submitted written rationale for each proposed rating
  • Distribution data pulled and shared with the facilitator
  • Prior cycle ratings available for comparison
  • 360 feedback collected and distributed to managers

During calibration:

  • Evidence-first structure: data is presented before debate begins
  • Facilitator pushing back on unsupported ratings
  • Each rating tested against the bias checklist
  • Rationale documented for all ratings at the tails

After calibration:

  • Demographic distribution reviewed for unexplained gaps
  • Documentation finalized before managers communicate ratings to employees
  • Employees informed of their development focus, not just the rating number

Calibration done well is a competitive advantage. The companies that get this right retain better performers, make more defensible decisions, and build trust with employees over time. The ones that treat it as a box-checking exercise inherit the problems that come with it.

See Confirm in action

See why forward-thinking enterprises use Confirm to make fairer, faster talent decisions and build high-performing teams.

G2 High Performer Enterprise G2 High Performer G2 Easiest To Do Business With G2 Highest User Adoption Fast Company World Changing Ideas 2023 SHRM