Arslan, Emre (2018-12). A Novel Bayesian Rank-Based Framework for the Classification of High-Dimensional Biological Data. Doctoral Dissertation. Thesis uri icon

abstract

  • Statistical analysis of high-dimensional biological data is the central component of "personalized medicine" and "translational bioinformatics." Two major barriers limit the application of the extracted information in clinical studies. These barriers are small sample size and lack of biological interpretability due to the complex classification boundaries of current algorithms. Motivated in removing these barriers, we focus in this dissertation to introduce novel statistical analysis algorithms of high-dimensional biological data. We first introduce a novel predictive model. In particular, we extend the top-scoring pair algorithm to a Bayesian setting. We test the performance on several real datasets and various simulated data scenarios and show the proposed method has the best overall performance. Besides having high accuracy rates on real and simulated data sets, the proposed algorithm has the potential to discover gene markers that may be missed via other algorithms. We also suggested the Bayesian Top-Scoring Pair (BTSP) as a feature selection method. We compared the proposed algorithm with many well-known feature selection methods by combining the feature selection methods with different well-known classifiers. We checked the performance of all feature selection methods for different data sets and for different numbers of genes. The proposed BTSP algorithm has the best overall accuracy rates. Finally, we introduce a novel biological pathway data-based algorithm (BTSPP). This algorithm uses all pairwise interactions in the gene level and pathway level. We apply the proposed method and well-known pathway data-based algorithms to different real data sets and check performances in terms of accurately classifying independent test sets and show the proposed algorithm superiority. We also checked the ability to find the biologically validated pathways related with diseases of these pathway data-based algorithms, over-representation analysis (ORA), and gene set enrichment analysis (GSEA). The proposed pathway analysis method has the potential to find the biologically validated pathways, whereas the others cannot detect the biologically validated pathways.
  • Statistical analysis of high-dimensional biological data is the central component of "personalized
    medicine" and "translational bioinformatics." Two major barriers limit the application of the
    extracted information in clinical studies. These barriers are small sample size and lack of biological
    interpretability due to the complex classification boundaries of current algorithms.
    Motivated in removing these barriers, we focus in this dissertation to introduce novel statistical
    analysis algorithms of high-dimensional biological data. We first introduce a novel predictive
    model. In particular, we extend the top-scoring pair algorithm to a Bayesian setting. We test the
    performance on several real datasets and various simulated data scenarios and show the proposed
    method has the best overall performance. Besides having high accuracy rates on real and simulated
    data sets, the proposed algorithm has the potential to discover gene markers that may be missed
    via other algorithms.
    We also suggested the Bayesian Top-Scoring Pair (BTSP) as a feature selection method. We
    compared the proposed algorithm with many well-known feature selection methods by combining
    the feature selection methods with different well-known classifiers. We checked the performance
    of all feature selection methods for different data sets and for different numbers of genes. The
    proposed BTSP algorithm has the best overall accuracy rates.
    Finally, we introduce a novel biological pathway data-based algorithm (BTSPP). This algorithm
    uses all pairwise interactions in the gene level and pathway level. We apply the proposed
    method and well-known pathway data-based algorithms to different real data sets and check performances
    in terms of accurately classifying independent test sets and show the proposed algorithm
    superiority. We also checked the ability to find the biologically validated pathways related with
    diseases of these pathway data-based algorithms, over-representation analysis (ORA), and gene
    set enrichment analysis (GSEA). The proposed pathway analysis method has the potential to find
    the biologically validated pathways, whereas the others cannot detect the biologically validated
    pathways.

publication date

  • December 2018