Vu, Thang (2011-05). The Bootstrap in Supervised Learning and its Applications in Genomics/Proteomics. Doctoral Dissertation. Thesis uri icon

abstract

  • The small-sample size issue is a prevalent problem in Genomics and Proteomics today. Bootstrap, a resampling method which aims at increasing the efficiency of data usage, is considered to be an effort to overcome the problem of limited sample size. This dissertation studies the application of bootstrap to two problems of supervised learning with small sample data: estimation of the misclassification error of Gaussian discriminant analysis, and the bagging ensemble classification method. Estimating the misclassification error of discriminant analysis is a classical problem in pattern recognition and has many important applications in biomedical research. Bootstrap error estimation has been shown empirically to be one of the best estimation methods in terms of root mean squared error. In the first part of this work, we conduct a detailed analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA) classification rule under Gaussian populations. We derive the exact formulas of the first and the second moment of the zero bootstrap and the convex bootstrap estimators, as well as their cross moments with the resubstitution estimator and the true error. Based on these results, we obtain the exact formulas of the bias, the variance, and the root mean squared error of the deviation from the true error of these bootstrap estimators. This includes the moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions. In the second part of this work, we conduct an extensive empirical investigation of bagging, which is an application of bootstrap to ensemble classification. We investigate the performance of bagging in the classification of small-sample gene-expression data and protein-abundance mass spectrometry data, as well as the accuracy of small-sample error estimation with this ensemble classification rule. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overtting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, the ensemble method did not improve the performance of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator that is intended to remove estimator bias, by formulating carefully how the error count is normalized, and investigate the performance of error estimation for bagging of common classification rules, including LDA, 3NN, and CART, applied on both synthetic and real patient data, corresponding to the use of common error estimators such as resubstitution, leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus, bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the numerical experiments indicated that the performance of the out-of-bag estimator is very similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically biased. The performance of the other estimators is consistent with their performance with the corresponding single classifiers, as reported in other studies. The results of this work are expected to provide helpful guidance to practitioners who are interested in applying the bootstrap in supervised learning applications.
  • The small-sample size issue is a prevalent problem in Genomics and Proteomics today.
    Bootstrap, a resampling method which aims at increasing the efficiency of data usage,
    is considered to be an effort to overcome the problem of limited sample size. This dissertation
    studies the application of bootstrap to two problems of supervised learning with small
    sample data: estimation of the misclassification error of Gaussian discriminant analysis,
    and the bagging ensemble classification method.
    Estimating the misclassification error of discriminant analysis is a classical problem in
    pattern recognition and has many important applications in biomedical research. Bootstrap
    error estimation has been shown empirically to be one of the best estimation methods in
    terms of root mean squared error. In the first part of this work, we conduct a detailed
    analytical study of bootstrap error estimation for the Linear Discriminant Analysis (LDA)
    classification rule under Gaussian populations. We derive the exact formulas of the first
    and the second moment of the zero bootstrap and the convex bootstrap estimators, as well
    as their cross moments with the resubstitution estimator and the true error. Based on these
    results, we obtain the exact formulas of the bias, the variance, and the root mean squared
    error of the deviation from the true error of these bootstrap estimators. This includes the
    moments of the popular .632 bootstrap estimator. Moreover, we obtain the optimal weight
    for unbiased and minimum-RMS convex bootstrap estimators. In the univariate case, all
    the expressions involve Gaussian distributions, whereas in the multivariate case, the results are written in terms of bivariate doubly non-central F distributions.
    In the second part of this work, we conduct an extensive empirical investigation of
    bagging, which is an application of bootstrap to ensemble classification. We investigate
    the performance of bagging in the classification of small-sample gene-expression data and
    protein-abundance mass spectrometry data, as well as the accuracy of small-sample error
    estimation with this ensemble classification rule. We observed that, under t-test and
    RELIEF filter-based feature selection, bagging generally does a good job of improving
    the performance of unstable, overtting classifiers, such as CART decision trees and neural
    networks, but that improvement was not sufficient to beat the performance of single stable,
    non-overtting classifiers, such as diagonal and plain linear discriminant analysis, or
    3-nearest neighbors. Furthermore, the ensemble method did not improve the performance
    of these stable classifiers significantly. We give an explicit definition of the out-of-bag estimator
    that is intended to remove estimator bias, by formulating carefully how the error
    count is normalized, and investigate the performance of error estimation for bagging of
    common classification rules, including LDA, 3NN, and CART, applied on both synthetic
    and real patient data, corresponding to the use of common error estimators such as resubstitution,
    leave-one-out, cross-validation, basic bootstrap, bootstrap 632, bootstrap 632 plus,
    bolstering, semi-bolstering, in addition to the out-of-bag estimator. The results from the
    numerical experiments indicated that the performance of the out-of-bag estimator is very
    similar to that of leave-one-out; in particular, the out-of-bag estimator is slightly pessimistically
    biased. The performance of the other estimators is consistent with their performance
    with the corresponding single classifiers, as reported in other studies. The results of this
    work are expected to provide helpful guidance to practitioners who are interested in applying
    the bootstrap in supervised learning applications.

publication date

  • May 2011