Cross-Validation and the Estimation of Conditional Probability Densities
Academic Article
Overview
Identity
Additional Document Info
Other
View All
Overview
abstract
Many practical problems, especially some connected with forecasting, require nonparametric estimation of conditional densities from mixed data. For example, given an explanatory data vector X for a prospective customer, with components that could include the customer's salary. occupation, age, sex, marital status, and address, a company might wish to estimate the density of the expenditure. Y, that could he made by that person, basing the inference on observations of (X,Y) for previous clients. Choosing appropriate smoothing parameters for this problem can he tricky, not in the least because plug-in rules take a particularly complex form in the case of mixed data. An obvious difliculty is that there exists no general formula for the optimal smoothing parameters. More insidiously, and more seriously, it can be difticult to determine which components of X are relevant to the problem of conditional inference. For example, if the jth component of X is independent of Y, then that component is irrelevant to estimating the density of Y given X, and ideally should he dropped before conducting inference. In this article we show that cross-validation overcomes these difficulties. It automatically determines which components are relevant and which are not, through assigning large smoothing parameters to the latter and consequently shrinking them toward the uniform distribution on the respective marginals. This effectively removes irrelevant components from contention, by suppressing their contribution to estimator variance; they already have very small bias, a consequence of their independence of Y. Cross-validation also yields important information about which components are relevant: the relevant components are precisely those that cross-validation has chosen to smooth in a traditional way, by assigning them smoothing parameters of conventional size. Indeed, cross-validation produces asymptotically optimal smoothing for relevant components, while eliminating irrelevant components by oversmoothing. In the problem of nonparamctric estimation of a conditional density, cross-validation comes into its own as a method with no obvious peers.