Lu, Meng (2015-12). Probabilistic Models for Aggregate Analysis of Non-Gaussian Data in Biomedicine. Doctoral Dissertation.
Aggregate association analysis is a popular way in genome-wide association studies (GWAS) that analyzes the association between the trait of interest and regions of functionally related genes, which has the advantage of capturing the missing heritability from the joint effects of correlated genetic variants while providing a better understanding of disease etiology from a systematic perspective. However, traditional methods lose their power for biomedical data with non-Gaussian data types. We proposed innovative statistical models to derive more accurate aggregated signals to enhance the power by taking account of the special data types. Based on general exponential family distribution assumptions, we developed supervised logistic PCA and supervised categorical PCA for pathway based GWAS and rare variant analysis. A general framework, sparse exponential family PCA (SePCA), is further developed for aggregate analyses for various types of biomedical data with good interpretation. We derived an efficient algorithm to find the optimal aggregated signals by solving its equivalent dual problem with closed-form updating rules. SePCA is extended for aggregate association analysis in hierarchical levels for better biological interpretation, from groups to individual variables. Both simulation studies and real world applications have demonstrated that our methods can achieve higher power in association analysis and population stratification by taking good care of the correlations among the non-Gaussian variables in biomedical data. Another analytic issue in aggregate analysis is that biomedical data often have special stratified data structures due to the experiment design to solve confounding issues. We extended SePCA to low-rank and full-rank matched models to take account of the stratified data structures. The simulation study has demonstrated their capability of reconstructing more relevant PCs for the signals of interest compared to standard ePCA. A sparse low-rank matched PCA model outperforms the existing Bayesian methods in detecting differentially expressed genes for a benchmark spike-in gene study with technical replicates. In summary, our proposed statistical models for non-Gaussian biomedical data can derive more accurate and robust aggregated signals that help reveal underlying biological principles of human disease. Other than bioinformatics, these probabilistic models also have rich applications in data mining, computer vision, and social science areas.