Confidence Intervals for the True Classification Error Conditioned on the Estimated Error
Academic Article

Overview

Research

Identity

Additional Document Info

View All

Overview

abstract

Bias and variance for small-sample error estimation are typically posed in terms of statistics for the distributions of the true and estimated errors. On the other hand, a salient practical issue asks, given an error estimate, what can be said about the true error? This question relates to the joint distribution of the true and estimated errors, specifically, the conditional expectation of the true error given the error estimate. A critical issue is that of confidence bounds for the true error given the estimate. We consider the joint distribution of the true error and the estimated error, assuming a random feature-label distribution. From it, we derive the marginal distributions, the conditional expectation of the estimated error given the true error, the conditional expectation of the true error given the estimated error, the conditional variance of the true error given the estimated error, and the 95% upper confidence bound for the true error given the estimated error. Numerous classification and estimation rules are considered across a number of models. Massive simulation is used for continuous models and analytic results are derived for discrete classification. We also consider a breast-cancer study to illustrate how the theory might be applied in practice. Although specific results depend on the classification rule, error-estimation rule, and model, some general trends are seen: (I) if the true error is small (large), then the conditional estimated error is generally high (low)-biased; (II) the conditional expected true error tends to be larger (smaller) than the estimated error for small (large) estimated errors; and (III) the confidence bounds tend to be well above the estimated error for low error estimates, becoming much less so for large estimates.