Modeling Site-Specific Nucleotide Biases Affecting Himar1 Transposon Insertion Frequencies in TnSeq Data Sets Academic Article uri icon


  • TnSeq is a widely used methodology for determining gene essentiality, conditional fitness, and genetic interactions in bacteria. The Himar1 transposon is restricted to insertions at TA dinucleotides, but otherwise, few site-specific biases have been identified. As a result, most analytical approaches assume that insertions are expected to be randomly distributed among TA sites in nonessential regions. However, through analysis of Himar1 transposon libraries in Mycobacterium tuberculosis, we demonstrate that there are site-specific biases that affect the frequency of insertion of the Himar1 transposon at different TA sites. We use machine learning and statistical models to characterize patterns in the nucleotides surrounding TA sites that correlate with high or low insertion counts. We then develop a quantitative model based on these patterns that can be used to predict the expected counts at each TA site based on nucleotide context, which can explain up to half of the variance in insertion counts. We show that these insertion preferences exist in Himar1 TnSeq data sets from other mycobacterial and nonmycobacterial species. We present an improved method for identification of essential genes, called TTN-Fitness, that can better distinguish true biological fitness effects by comparing observed counts to expected counts based on our site-specific model of insertion preferences. Compared to previous essentiality methods, TTN-Fitness can make finer distinctions among genes whose disruption causes a fitness defect (or advantage), separating them out from the large pool of nonessentials, and is able to classify many smaller genes (with few TA sites) that were previously characterized as uncertain. IMPORTANCE When using the Himar1 transposon to create transposon insertion mutant libraries, it is known that the transposon is restricted to insertions at TA dinucleotide sites throughout the genome, and the absence of insertions is used to infer which genes are essential (or conditionally essential) in a bacterial organism. It is widely assumed that insertions in nonessential regions are otherwise random, and this assumption is used as the basis of several methods for statistical analysis of TnSeq data. In this paper, we show that the nucleotide sequence surrounding TA sites influences the magnitude of insertions, and these Himar1 insertion preferences (sequence biases) can partially explain why some sites have higher counts than others. We use this predictive model to make improved estimates of the fitness effects of genes, which help make finer distinctions of the phenotype and biological consequences of disruption of nonessential genes.

author list (cited authors)

  • Choudhery, S., Brown, A. J., Akusobi, C., Rubin, E. J., Sassetti, C. M., & Ioerger, T. R.

publication date

  • January 1, 2021 11:11 AM