Topics in Measurement Error Analysis and High-Dimensional Binary Classification Thesis uri icon

abstract

  • We propose novel methods to tackle two problems: the misspecified model with measurement error and high-dimensional binary classification, both have a crucial impact on applications in public health. The first problem exists in the epidemiology practice. Epidemiologists often categorize a continuous risk predictor since categorization is thought to be more robust and interpretable, even when the true risk model is not a categorical one. Thus, their goal is to fit the categorical model and interpret the categorical parameters. We address the question: with measurement error and categorization, how can we do what epidemiologists want, namely to estimate the parameters of the categorical model that would have been estimated if the true predictor was observed? We develop a general methodology for such an analysis, and illustrate it in linear and logistic regression. Simulation studies are presented, and the methodology is applied to a nutrition data set. Discussion of alternative approaches is also included. For the second project, we consider the problem of high-dimensional classification between the two groups with unequal covariance matrices. Rather than estimating the full quadratic discriminant rule, we propose to perform simultaneous variable selection and linear dimension reduction on original data, with the subsequent application of quadratic discriminant analysis on the reduced space. In contrast to quadratic discriminant analysis, the proposed framework does not require estimation of precision matrices and scales linearly with the number of measurements, making it especially attractive for the use on high-dimensional datasets. We support the methodology with theoretical guarantees on variable selection consistency, and empirical comparison with competing approaches. We apply the method to gene expression data of breast cancer patients and confirm the crucial importance of the ESR1 gene in differentiating estrogen receptor status. Further, we provide software support for the proposed methodology. We develop two R packages, CCP and DAP, and present two vignettes as long-format illustrations for their usage.

author list (cited authors)

  • Wang, T.

complete list of authors

  • Wang, Tianying

publication date

  • July 2018