Innovative approaches for analyzing SEER breast cancer data
- View All
The Surveillance, Epidemiology and End Results (SEER) Program is a premier source for cancer statistics in the United States. Proper and efficient use of the available resources from the SEER program is of public and national interest. Therefore, we propose innovative methods for estimating 5-year survival probability, identifying important predictors for survival, and estimating the effect of predictor variables on the survivaltime of cancer patients using the SEER data. In particular, we consider breast cancer survival data as it is the most common type of cancer among women. Modeling survival time in terms of several disease characteristics and demographic factors is challenging due to the censored nature of the data and the presence of many parameters (high- dimensional problem). In Aim A, we consider an accelerated failure time (AFT) type model, and propose a nonparametric Bayesian solution to this problem. The solution involves modeling mean in terms of many parameters corresponding to the disease characteristics and demographic fac- tors, and modeling variance as a smooth nonparametric function of the mean. The nonparametric error distribution of the AFT model is handled via a constrained Dirichlet process prior. A variable selection technique is adopted to reduce the effective dimension of the problem as the mean involves a large number of parameters. The main innovation is treating the AFT model from such a real and general perspective which no one has done it before. Many of the disease characteristics in the SEER database contain significant proportion of missing values. Ignoring the subjects accompanied with missing values in any disease characteristic may distort the conclusion, and would definitely reduce the power to detect a potential association between the survival time and predictor variables. In Aim B we propose a semiparametric method of handling a missing predictor variable in the linear transformation model, a semiparametic model which contains the proportional hazard and the proportional odds model as two special cases. The main innovation of this part is how we handle missing data, and make inference about a finite dimensional parameter in the presence of an infinite-dimensional parameter. Finally, our proposed methods permit a useful and accurate interpretation of results of the analysis from modern epidemiological perspective. Our models are broad, and we seek a distribution- free procedure to estimate the model parameters either in the presence of many predictors or in the presence of a missing predictor.