Miao, Jingang (2014-08). New Advances in Logistic Regression for Handling Missing and Mismeasured Data with Applications in Biostatistics. Doctoral Dissertation.
Thesis
As a probabilistic statistical classification model, logistic regression (or logit regression) is widely used to model the outcome of a categorical dependent variable based on one or more predictor variables/features. We study two problems related to logistic regression with applications in biostatistics. In the first problem, we study multivariate disease classification in the presence of partially missing disease traits. In modern cancer epidemiology, diseases are classified based on pathologic and molecular traits, and different combinations of these traits give rise to many disease subtypes. The effect of predictor variables can be measured by fitting a polytomous logistic model to such data. The differences (heterogeneity) among the relative risk parameters associated with subtypes are of great interest to better understand disease etiology. Due to the heterogeneity of the relative risk parameters, when a risk factor is changed, the prevalence of one subtype may change more than that of another subtype does. Estimation of the heterogeneity parameters is difficult when disease trait information is only partially observed and the number of disease subtypes is large. We consider a robust semiparametric approach based on the pseudo conditional likelihood for estimating these heterogeneity parameters. Through simulation studies, we compare the robustness and efficiency of our approach with the maximum likelihood approach. The method is then applied to analyze data from the American Cancer Society Cancer Prevention Study (CPS) II Nutrition Cohort. Weight gain was associated with the risk of breast cancer and the association varies by disease subtype. In the second problem, we use a semiparametric Bayesian method to handle measurement errors. In nutritional epidemiological studies, nutrient intakes are often measured via food frequency questionnaires and 24-hour dietary recalls. Due to self reporting, recall error, and other reasons, the measured nutrient intakes can involve a substantial amount of noise. While independence assumption between the measurement error and the true predictor is likely to be a reasonable assumption for the main effect of the predictors, this assumption is not tenable for the interaction effect of two predictors measured with error. Although there are a number of flexible methods for handling additive, homogeneous measurement error in predictors in logistic regression models, relatively less attention has been paid to handling measurement error that depends on the unobserved predictor. Therefore, we propose a semiparametric Bayesian method for handling this unorthodox measurement error scenario in logistic regression models in the presence of the interaction term. The proposed method is also designed to handle partially missing values for the error-prone surrogate variables. Through simulation studies, we assess some operating characteristics of the proposed method and compare it with the simulation extrapolation and the regression calibration method. Our method has smaller biases than the other methods. In addition, we analyze the NHANES data and assess the association between some important nutrients and high cholesterol level. Total fat and protein reinforce each other's association with the risk of having high cholesterol level.
As a probabilistic statistical classification model, logistic regression (or logit regression) is widely used to model the outcome of a categorical dependent variable based on one or more predictor variables/features. We study two problems related to logistic regression with applications in biostatistics.
In the first problem, we study multivariate disease classification in the presence of partially missing disease traits. In modern cancer epidemiology, diseases are classified based on pathologic and molecular traits, and different combinations of these traits give rise to many disease subtypes. The effect of predictor variables can be measured by fitting a polytomous logistic model to such data. The differences (heterogeneity) among the relative risk parameters associated with subtypes are of great interest to better understand disease etiology. Due to the heterogeneity of the relative risk parameters, when a risk factor is changed, the prevalence of one subtype may change more than that of another subtype does. Estimation of the heterogeneity parameters is difficult when disease trait information is only partially observed and the number of disease subtypes is large. We consider a robust semiparametric approach based on the pseudo conditional likelihood for estimating these heterogeneity parameters. Through simulation studies, we compare the robustness and efficiency of our approach with the maximum likelihood approach. The method is then applied to analyze data from the American Cancer Society Cancer Prevention Study (CPS) II Nutrition Cohort. Weight gain was associated with the risk of breast cancer and the association varies by disease subtype.
In the second problem, we use a semiparametric Bayesian method to handle measurement errors. In nutritional epidemiological studies, nutrient intakes are often measured via food frequency questionnaires and 24-hour dietary recalls. Due to self reporting, recall error, and other reasons, the measured nutrient intakes can involve a substantial amount of noise. While independence assumption between the measurement error and the true predictor is likely to be a reasonable assumption for the main effect of the predictors, this assumption is not tenable for the interaction effect of two predictors measured with error. Although there are a number of flexible methods for handling additive, homogeneous measurement error in predictors in logistic regression models, relatively less attention has been paid to handling measurement error that depends on the unobserved predictor. Therefore, we propose a semiparametric Bayesian method for handling this unorthodox measurement error scenario in logistic regression models in the presence of the interaction term. The proposed method is also designed to handle partially missing values for the error-prone surrogate variables. Through simulation studies, we assess some operating characteristics of the proposed method and compare it with the simulation extrapolation and the regression calibration method. Our method has smaller biases than the other methods. In addition, we analyze the NHANES data and assess the association between some important nutrients and high cholesterol level. Total fat and protein reinforce each other's association with the risk of having high cholesterol level.