Hardcastle, Mark Jeffery (2015-12). The Effect of Selective Data Omission on Type I Error Rates: A Simulation Study. Master's Thesis. Thesis uri icon

abstract

  • There do not exist widely accepted guidelines or standards for identification and removal of outlying data in empirical research. There are sometimes significant incentives for researchers to discover particular research results. Researchers have been observed to use flexibility in outlier omission to selectively omit data in search of statistically significant findings. The degree to which this practice can affect the credibility of research findings is unknown. This study uses Monte Carlo simulation to estimate the propensity of certain types of selective outlier omission to inflate type I error rates in regression models. Simulations are designed to analyze posttest only control group design with no underlying intervention effect, such that any statistically significant findings represent type I errors. Omission of observations is simulated in an exploratory manner, such that observations are omitted and regressions are run iteratively until either a type I error is made or until a maximum trimming threshold is reached, whichever occurs first. Omission of observations based on z-score thresholds, a common research practice in some disciplines, is simulated. Additionally, omission from only of one tail of data--simulating the removal of only "disconfirming" observations--is analyzed. Simulations are performed using a variety of sample sizes and with samples drawn from several underlying population distributions. In all simulations, type I error rates are inflated; type I error rates are found to range from 7.86% to 100%, compared to the expected 5% in the absence of data omission.
  • There do not exist widely accepted guidelines or standards for identification and removal of outlying data in empirical research. There are sometimes significant incentives for researchers to discover particular research results. Researchers have been observed to use flexibility in outlier omission to selectively omit data in search of statistically significant findings. The degree to which this practice can affect the credibility of research findings is unknown. This study uses Monte Carlo simulation to estimate the propensity of certain types of selective outlier omission to inflate type I error rates in regression models.

    Simulations are designed to analyze posttest only control group design with no underlying intervention effect, such that any statistically significant findings represent type I errors. Omission of observations is simulated in an exploratory manner, such that observations are omitted and regressions are run iteratively until either a type I error is made or until a maximum trimming threshold is reached, whichever occurs first. Omission of observations based on z-score thresholds, a common research practice in some disciplines, is simulated. Additionally, omission from only of one tail of data--simulating the removal of only "disconfirming" observations--is analyzed. Simulations are performed using a variety of sample sizes and with samples drawn from several underlying population distributions. In all simulations, type I error rates are inflated; type I error rates are found to range from 7.86% to 100%, compared to the expected 5% in the absence of data omission.

publication date

  • December 2015