III: Small: Collaborative Research: Combinatorial Collaborative Clustering for Simultaneous Patient Stratification and Biomarker Identification Grant uri icon


  • Modern high-throughput sequencing (HTS) technologies produce rich high-dimensional biomedical data. When studying complex, dynamic, stochastic, and heterogeneous life and disease systems, the dimensionality (number of features) of samples is typically much higher than the number of samples in HTS data. Such HTS data, while imposing significant statistical and computational challenges, bring unique opportunities for collaborative research to translate them to clinical precision medicine. This project will develop novel Bayesian methods and computational tools for combinatorial collaborative clustering targeting at two fundamental biomedical applications: tumor stratification and predictive biomarker identification. Compared to existing black-box algorithms for tumor stratification and biomarker identification, the proposed Bayesian combinatorial collaborative clustering framework enables simultaneous tumor stratification and biomarker identification for specific tumor subtypes, so that mechanistic understanding of heterogeneity of complex diseases can be obtained. The captured interrelationships between molecular profile patterns and disease subtypes may provide deep insights into disease cellular mechanisms and have the potential of developing personalized disease prognosis and therapeutic strategies. The interdisciplinary nature of this project, together with the planned curriculum development and outreach activities, will provide excellent training opportunities for both undergraduate and graduate students, preparing them with the quantitative skills in biomedical research with unprecedented big biomedical data.The core of this project is the theoretic and computational foundation of a novel Bayesian statistical framework to translate existing large-scale publicly available biomedical datasets, such as TCGA (The Cancer Genome Atlas) and ICGC (International Cancer Genome Consortium), to precision (personalized) disease diagnosis and prognosis. A new class of binary and count data analysis models will be developed for Combinatorial Collaborative Clustering (CCC) based on modern HTS data to achieve reproducible and accurate tumor stratification and biomarker identification. Here "combinatorial'' means that each cluster will be defined over a subset of features, which will be selected from all possible feature combinations, via novel combinatorial analysis; and "collaborative" means that each cluster is collaboratively defined by how its cluster members express their selected subset of features. First, rather than defining cluster centers and a distance metric to stratify patients based on all features, CCC simultaneously identifies cluster-specific features as biomarkers that show similar profile patterns when performing patient stratification. Hence, with the predictive likelihood of a sample under a patient cluster calculated over a small subset of features selected from tens of thousands of them, it alleviates "the curse of dimensionality" and substantially improves reproducibility. Second, it also enables natural integration of mixed-type HTS data by linking various types of data to latent counts. Finally, the proposed count modeling based inference algorithms only compute for non-zero elements and therefore lead to extremely efficient analytic methods for sparse matrices, often the case in HTS data. In addition to the theoretic and computational merit, CCC provides a flexible probabilistic computational framework to identify and characterize tumor subtypes or subclones, which leads to more effective personalized prognosis and therapeutic design. The proposed CCC methods will be first evaluated on the TCGA and ICGC data, and then be applied to the collaborative research with the principal investigator''s ongoing biomedical collaborators on cancer and immunological disease studies.This award reflects NSF''s statutory mission and has been deemed worthy of support through evaluation using the Foundation''s intellectual merit and broader impacts review criteria.

date/time interval

  • 2018 - 2021