Minimum Description Length Based Selection of Reference Sequences for Comparative Assemblers Conference Paper uri icon

abstract

  • Genome sequences are the most basic, yet most essential pieces of data in all biological analysis. Genome sequence is the solution to the Genome Assembly problem which remakes the entire sequence from a set of reads which are unordered and very small in size. Genome Assembly problem is therefore, quite complex and is broadly divided into denovo and comparative assembly. Comparative assembly takes the aid of a reference sequence, closely related to the unassembled genome, to determine the relative order of the reads with respect to one another, and then joins them together to form the sequence. This paper explores all variants of Minimum Description Length (MDL) to find the best reference sequence for comparative assembly. The paper looked at two-part MDL, Sophisticated MDL and MiniMax Regret and found that Sophisticated MDL performs better than two-part MDL, however, MiniMax regret owing to the nature of the problem was unsuitable. The proposed scheme is prior free and can be incorporated in the data preprocessing stage for all comparative assemblers allowing the assembly process to make use of the best reference sequence available. 2011 IEEE.

name of conference

  • 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS)

published proceedings

  • 2011 IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS)

author list (cited authors)

  • Wajid, B., & Serpedin, E.

citation count

  • 5

complete list of authors

  • Wajid, Bilal||Serpedin, Erchin

publication date

  • December 2011