Quality (SPC / Cpk / MSA)

Evaluate Rater Consistency using Intraclass Correlation ICC

When you have multiple people scoring the same items, you need to know if their ratings align. Use this method to determine if your scoring system is consistent across different judges or observers. It translates the agreement level into simple, actionable categories, helping you decide if your measurement process is dependable.

Understanding Rater Agreement

In many professional settings, you rely on human judgment to evaluate quality or performance. When multiple individuals provide these assessments, discrepancies can lead to uncertainty. The intraclass correlation ICC method helps you quantify how consistently these raters agree with one another.

Instead of looking at how one rater compares to another individually, this method looks at the overall variance in your dataset. It effectively separates the true differences between the items being measured from the noise created by rater inconsistency.

Reliability in Your Measurements

The primary benefit of using intraclass correlation ICC is determining the 'trustworthiness' of your data. If your score is low, it suggests that your rater training might be insufficient or that the criteria for scoring are too subjective, leading to inconsistent outcomes.

When the results show high reliability, you can be confident that the scores generated are a reflection of the subject matter rather than the specific person who happened to perform the evaluation.

Interpreting the Results

Lattice provides clear interpretations based on the calculated coefficient. A high value confirms that your measurement process is stable and objective. If the value falls into the 'poor' or 'moderate' range, it acts as a signal to review your rating guidelines or provide additional training to your team.

By looking at the confidence intervals provided alongside the main result, you can also see the range of uncertainty in your estimate, ensuring you aren't over-relying on a single point estimate when your data is limited.

1 · Intent → method

An LLM picks reliability_icc from a fixed catalog.

2 · Method → numbers

Deterministic Python engine runs the math. Same input → same output.

3 · Numbers → plain language

A second LLM translates the result into your domain’s vocabulary.

  • What do the intraclass correlation ICC scores actually mean?

    Following common guidance, scores below 0.50 indicate poor agreement, 0.50 to 0.75 show moderate agreement, 0.75 to 0.90 are good, and 0.90 or above reflect excellent reliability in your scoring system.

  • Why does Lattice show different types of ICC?

    Different versions of intraclass correlation ICC account for whether raters are randomly selected or specifically chosen, and whether you are analyzing a single rating or an average of ratings. We focus on reliable, standard configurations to suit most evaluation needs.

Tool input schema

Schema for reliability_icc not exported yet (run pnpm export:registry).