Expected Calibration Error (ECE)
The standard metric for measuring classifier calibration quality – the weighted average of the difference between confidence and accuracy across bins.
ECE measures how far model confidence values deviate from actual frequencies – the standard metric for calibration quality.
Explanation
ECE divides predictions into confidence bins (e.g., 0-10%, 10-20%, ...) and measures the difference between average confidence and actual accuracy for each bin. Perfect calibration = ECE of 0.
Marketing Relevance
ECE is the standard metric for model calibration in every ML deployment – from lead scoring to churn prediction.
Example
A model with ECE=0.15 is on average 15 percentage points off from actual frequency – it needs recalibration.
Common Pitfalls
ECE is sensitive to the number of bins. Adaptive ECE or KDE-based variants are more robust. ECE alone isn't enough – also check reliability diagrams.
Origin & History
Naeini et al. (2015) formalized ECE and proposed binning-based calibration. Guo et al. (2017) showed systematic miscalibration of modern DNNs using ECE. Adaptive variants followed from 2019.
Comparisons & Differences
Expected Calibration Error (ECE) vs. Brier Score
Brier Score measures overall quality (accuracy + calibration combined); ECE measures calibration exclusively.