Optimal reference selection for genome assembly using the minimum description length principle

abstract

Background and Objectives: Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of novel genomes. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads of the novel genome that align to the reference sequences and then choosing the reference sequence which has the highest number of reads aligning to it. This work explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and sophisticated MDL, in identifying the optimal reference sequence for genome assembly. Methods: The relevance of MDL to genome assembly can be realized by understanding that genome assembly is an inference problem where the task at hand is to infer the novel genome from read data obtained from sequencing. The task of MDL is to identify the model that best describes the data and within comparative assembly framework the same meaning applies to finding the reference sequences that best describe the set of reads. This work explores the potential of three variants of MDL: two-part MDL, sophisticated MDL and minimax regret for the selection of the optimal reference sequence for comparative assembly. Results: The proposed scheme based on sophisticated MDL has been shown to work successfully for the four possible set of mutations: SNPs, insertions, inversions and deletions. The proposed scheme chooses the reference sequence which has the smaller number of SNPs, insertions and deletions. The MDL scheme is able to detect all inversions and rectify them. Conclusions: The work compared the MDL scheme with the standard method of counting the number of reads that align to the reference sequence, and found that though the standard method is a necessary condition for finding the optimal sequence, it is not the sufficient condition. Therefore, the proposed MDL scheme encompassed within itself the standard method of: counting the number of reads, by defining it in an inverted fashion as counting the number of reads that did not align to the reference sequence.

authors

published proceedings

Qatar Foundation Annual Research Forum Volume 2012 Issue 1

author list (cited authors)

Wajid, B., Serpedin, E., Qaraqe, M., Nounou, H., Nounou, M., Chouchane, L., & Mohamed, N.

citation count

1

complete list of authors

Wajid, Bilal||Serpedin, Erchin||Qaraqe, Marwa||Nounou, Hazem||Nounou, Mohamed||Chouchane, Lotfi||Mohamed, Nady

publication date

January 2012

publisher

Hamad bin Khalifa University Press (HBKU Press) Publisher

published in

Qatar Foundation Annual Research Forum Proceedings Journal

Optimal reference selection for genome assembly using the minimum description length principle Academic Article

Overview

abstract

authors

published proceedings

author list (cited authors)

citation count

complete list of authors

publication date

publisher

published in

Research

keywords

Identity

Digital Object Identifier (DOI)

Other

URL