Feature selection for high-dimensional integrated data Conference Paper uri icon

abstract

  • Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y , and the remainder of the predictors constitute a " noise set" Xu independent of Y . Using Monte Carlo simulations, we investigated the relative performance of two methods: Thresholding and singular-value decomposition, in combination with stochastic optimization to determine " empirical bounds" on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset. Copyright 2012 by the Society for Industrial and Applied Mathematics.

name of conference

  • Proceedings of the 2012 SIAM International Conference on Data Mining

published proceedings

  • Proceedings of the 2012 SIAM International Conference on Data Mining

author list (cited authors)

  • Zheng, C., Schwartz, S., Chapkin, R. S., Carroll, R. J., & Ivanov, I.

citation count

  • 1

complete list of authors

  • Zheng, Charles||Schwartz, Scott||Chapkin, Robert S||Carroll, Raymond J||Ivanov, Ivan

publication date

  • April 2012