Scalable Methods For Classification Of Heterogeneous High-Dimensional Data
- View All
Recent technological advances have enabled routine collection of large-scale high-dimensional data in the biomedical fields. For example, in cancer research it is common to use multiple high-throughput technology platforms to measure genotype, gene expression levels, and methylation levels. One of the main challenges in the analysis of such data is the identification of key biological measurements that can be used to classify the subject into a known cancer subtype. While significant progress has been made in the development of computationally efficient classification methods to address this challenge, existing methods do not adequately take into account the heterogeneity across the cancer subtypes and the mixed types of measurements (binary/count/continuous) across technology platforms. As such, existing methods may fail to identify relevant biological patterns. The goal of this project is to develop new classification methods that explicitly take into account the type and heterogeneity of measurements. While the primary focus is on methodology, high priority will be given to computational considerations and software development to encourage dissemination and ensure ease of use for domain scientists.Regularized linear discriminant methods are commonly used for simultaneous classification and variable selection due to their interpretability and computational efficiency. These methods, however, rely on unrealistic assumptions of equality of group-covariance matrices and normality of measurements. This project aims to address the limitations present in current discriminant approaches, and has three objectives: (1) to develop computationally efficient quadratic classification rules that perform variable selection; (2) to generalize the discriminant analysis framework to non-normal measurements; (3) to develop a classification framework for mixed type data coming from multiple technology platforms collected on the same set of subjects. The key methodological innovation is the combination of sparse low-rank singular value decomposition, which enables computational efficiency, with geometric interpretation of linear discriminant analysis, which allows for the construction of nonlinear classification rules by redefining the space for discrimination.