Optimal reference selection for genome assembly using the minimum description length principle Academic Article uri icon


  • Background and Objectives: Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of novel genomes. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads of the novel genome that align to the reference sequences and then choosing the reference sequence which has the highest number of reads aligning to it. This work explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and sophisticated MDL, in identifying the optimal reference sequence for genome assembly. Methods: The relevance of MDL to genome assembly can be realized by understanding that genome assembly is an inference problem where the task at hand is to infer the novel genome from read data obtained from sequencing. The task of MDL is to identify the model that best describes the data and within comparative assembly framework the same meaning applies to finding the reference sequences that best describe the set of reads. This work explores the potential of three variants of MDL: two-part MDL, sophisticated MDL and minimax regret for the selection of the optimal reference sequence for comparative assembly. Results: The proposed scheme based on sophisticated MDL has been shown to work successfully for the four possible set of mutations: SNPs, insertions, inversions and deletions. The proposed scheme chooses the reference sequence which has the smaller number of SNPs, insertions and deletions. The MDL scheme is able to detect all inversions and rectify them. Conclusions: The work compared the MDL scheme with the standard method of counting the number of reads that align to the reference sequence, and found that though the standard method is a necessary condition for finding the optimal sequence, it is not the sufficient condition. Therefore, the proposed MDL scheme encompassed within itself the standard method of: counting the number of reads, by defining it in an inverted fashion as counting the number of reads that did not align to the reference sequence.

published proceedings

  • Qatar Foundation Annual Research Forum Volume 2012 Issue 1

author list (cited authors)

  • Wajid, B., Serpedin, E., Qaraqe, M., Nounou, H., Nounou, M., Chouchane, L., & Mohamed, N.

citation count

  • 1

complete list of authors

  • Wajid, Bilal||Serpedin, Erchin||Qaraqe, Marwa||Nounou, Hazem||Nounou, Mohamed||Chouchane, Lotfi||Mohamed, Nady

publication date

  • January 2012