The high-throughput gene prediction of more than 1,700 eukaryote genomes using the software package EukMetaSanity Institutional Repository Document uri icon

abstract

  • AbstractGene prediction and annotation for eukaryotic genomes is challenging with large data demands and complex computational requirements. For most eukaryotes, genomes are recovered from specific target taxa. However, it is now feasible to reconstruct or sequence hundreds of metagenome-assembled genomes (MAGs) or single-amplified genomes directly from the environment. To meet this forth-coming wave of eukaryotic genome generation, we introduce EukMetaSanity, which combines state-of-the-art tools into three pipelines that have been specifically designed for extensive parallelization on high-performance computing infrastructure. EukMetaSanity performs an automated taxonomy search against a protein database of 1,482 species to identify phylogenetically compatible proteins to be used in downstream gene prediction. We present the results for intron, exon, and gene locus prediction for 112 genomes collected from NCBI, including fungi, plants, and animals, along with 1,669 MAGs and demonstrate that EukMetaSanity can provide reliable preliminary gene predictions for a single target taxon or at scale for hundreds of MAGs. EukMetaSanity is freely available at https://github.com/cjneely10/EukMetaSanity.

author list (cited authors)

  • Neely, C. J., Hu, S. K., Alexander, H., & Tully, B. J.

complete list of authors

  • Neely, Christopher J||Hu, Sarah K||Alexander, Harriet||Tully, Benjamin J

Book Title

  • bioRxiv