Wang, Tianying (2018-08). Topics in Measurement Error Analysis and High-Dimensional Binary Classification. Doctoral Dissertation. Thesis uri icon

abstract

  • We propose novel methods to tackle two problems: the misspecified model with measurement error and high-dimensional binary classification, both have a crucial impact on applications in public health. The first problem exists in the epidemiology practice. Epidemiologists often categorize a continuous risk predictor since categorization is thought to be more robust and interpretable, even when the true risk model is not a categorical one. Thus, their goal is to fit the categorical model and interpret the categorical parameters. We address the question: with measurement error and categorization, how can we do what epidemiologists want, namely to estimate the parameters of the categorical model that would have been estimated if the true predictor was observed? We develop a general methodology for such an analysis, and illustrate it in linear and logistic regression. Simulation studies are presented, and the methodology is applied to a nutrition data set. Discussion of alternative approaches is also included. For the second project, we consider the problem of high-dimensional classification between the two groups with unequal covariance matrices. Rather than estimating the full quadratic discriminant rule, we propose to perform simultaneous variable selection and linear dimension reduction on original data, with the subsequent application of quadratic discriminant analysis on the reduced space. In contrast to quadratic discriminant analysis, the proposed framework does not require estimation of precision matrices and scales linearly with the number of measurements, making it especially attractive for the use on high-dimensional datasets. We support the methodology with theoretical guarantees on variable selection consistency, and empirical comparison with competing approaches. We apply the method to gene expression data of breast cancer patients and confirm the crucial importance of the ESR1 gene in differentiating estrogen receptor status. Further, we provide software support for the proposed methodology. We develop two R packages, CCP and DAP, and present two vignettes as long-format illustrations for their usage.
  • We propose novel methods to tackle two problems: the misspecified model with measurement
    error and high-dimensional binary classification, both have a crucial impact on
    applications in public health.
    The first problem exists in the epidemiology practice. Epidemiologists often categorize a
    continuous risk predictor since categorization is thought to be more robust and interpretable,
    even when the true risk model is not a categorical one. Thus, their goal is to fit the categorical
    model and interpret the categorical parameters. We address the question: with measurement
    error and categorization, how can we do what epidemiologists want, namely to estimate the
    parameters of the categorical model that would have been estimated if the true predictor was
    observed? We develop a general methodology for such an analysis, and illustrate it in linear
    and logistic regression. Simulation studies are presented, and the methodology is applied to
    a nutrition data set. Discussion of alternative approaches is also included.
    For the second project, we consider the problem of high-dimensional classification between
    the two groups with unequal covariance matrices. Rather than estimating the full quadratic
    discriminant rule, we propose to perform simultaneous variable selection and linear dimension
    reduction on original data, with the subsequent application of quadratic discriminant analysis
    on the reduced space. In contrast to quadratic discriminant analysis, the proposed framework
    does not require estimation of precision matrices and scales linearly with the number of
    measurements, making it especially attractive for the use on high-dimensional datasets. We
    support the methodology with theoretical guarantees on variable selection consistency, and
    empirical comparison with competing approaches. We apply the method to gene expression
    data of breast cancer patients and confirm the crucial importance of the ESR1 gene in
    differentiating estrogen receptor status.
    Further, we provide software support for the proposed methodology. We develop two
    R packages, CCP and DAP, and present two vignettes as long-format illustrations for their
    usage.

publication date

  • August 2018