What should be expected from feature selection in small-sample settings.

abstract

MOTIVATION: High-throughput technologies for rapid measurement of vast numbers of biological variables offer the potential for highly discriminatory diagnosis and prognosis; however, high dimensionality together with small samples creates the need for feature selection, while at the same time making feature-selection algorithms less reliable. Feature selection must typically be carried out from among thousands of gene-expression features and in the context of a small sample (small number of microarrays). Two basic questions arise: (1) Can one expect feature selection to yield a feature set whose error is close to that of an optimal feature set? (2) If a good feature set is not found, should it be expected that good feature sets do not exist? RESULTS: The two questions translate quantitatively into questions concerning conditional expectation. (1) Given the error of an optimal feature set, what is the conditionally expected error of the selected feature set? (2) Given the error of the selected feature set, what is the conditionally expected error of the optimal feature set? We address these questions using three classification rules (linear discriminant analysis, linear support vector machine and k-nearest-neighbor classification) and feature selection via sequential floating forward search and the t-test. We consider three feature-label models and patient data from a study concerning survival prognosis for breast cancer. With regard to the two focus questions, there is similarity across all experiments: (1) One cannot expect to find a feature set whose error is close to optimal, and (2) the inability to find a good feature set should not lead to the conclusion that good feature sets do not exist. In practice, the latter conclusion may be more immediately relevant, since when faced with the common occurrence that a feature set discovered from the data does not give satisfactory results, the experimenter can draw no conclusions regarding the existence or nonexistence of suitable feature sets. AVAILABILITY: http://ee.tamu.edu/~edward/feature_regression/

authors

Dougherty, Edward

published proceedings

Bioinformatics

altmetric score

3

author list (cited authors)

Sima, C., & Dougherty, E. R.

citation count

70

complete list of authors

Sima, Chao||Dougherty, Edward R

publication date

October 2006

publisher

Oxford University Press (OUP) Publisher

published in

Bioinformatics Journal

What should be expected from feature selection in small-sample settings. Academic Article

Overview

abstract

authors

published proceedings

altmetric score

author list (cited authors)

citation count

complete list of authors

publication date

publisher

published in

Research

keywords

Identity

PubMed Central ID

Digital Object Identifier (DOI)

Additional Document Info

start page

end page

volume

issue

Other

URL