The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data. Academic Article uri icon


  • Ranking feature sets for phenotype classification based on gene expression is a challenging issue in cancer bioinformatics. When the number of samples is small, all feature selection algorithms are known to be unreliable, producing significant error, and error estimators suffer from different degrees of imprecision. The problem is compounded by the fact that the accuracy of classification depends on the manner in which the phenomena are transformed into data by the measurement technology. Because next-generation sequencing technologies amount to a nonlinear transformation of the actual gene or RNA concentrations, they can potentially produce less discriminative data relative to the actual gene expression levels. In this study, we compare the performance of ranking feature sets derived from a model of RNA-Seq data with that of a multivariate normal model of gene concentrations using 3 measures: (1) ranking power, (2) length of extensions, and (3) Bayes features. This is the model-based study to examine the effectiveness of reporting lists of small feature sets using RNA-Seq data and the effects of different model parameters and error estimators. The results demonstrate that the general trends of the parameter effects on the ranking power of the underlying gene concentrations are preserved in the RNA-Seq data, whereas the power of finding a good feature set becomes weaker when gene concentrations are transformed by the sequencing machine.

published proceedings

  • Cancer Inform

altmetric score

  • 0.25

author list (cited authors)

  • Kim, E., Ivanov, I., Hua, J., Lampe, J. W., Hullar, M. A., Chapkin, R. S., & Dougherty, E. R.

citation count

  • 1

complete list of authors

  • Kim, Eunji||Ivanov, Ivan||Hua, Jianping||Lampe, Johanna W||Hullar, Meredith Aj||Chapkin, Robert S||Dougherty, Edward R

publication date

  • January 2017