Chin, See Loong (2003-12). Incomplete gene structure prediction with almost 100% specificity. Master's Thesis.
The goals of gene prediction using computational approaches are to determine gene location and the corresponding functionality of the coding region. A subset of gene prediction is the gene structure prediction problem, which is to define the exon-intron boundaries of a gene. Gene prediction follows two general approaches: statistical patterns identification and sequence similarity comparison. Similarity based approaches have gained increasing popularity with the recent vast increase in genomic data in GenBank. The proposed gene prediction algorithm is a similarity based algorithm which capitalizes on the fact that similar sequences bear similar functions. The proposed algorithm, like most other similarity based algorithms, is based on dynamic programming. Given a genomic DNA, X = x1 xn and a closely related cDNA, Y = y1 yn, these sequences are aligned with matching pairs stored in a data set. These indexes of matching sets contain a large jumble of all matching pairs, with a lot of cross over indexes. Dynamic programming alignment is again used to retrieve the longest common non-crossing subsequence from the collection of matching fragments in the data set. This algorithm was implemented in Java on the Unix platform. Statistical comparisons were made against other software programs in the field. Statistical evaluation at both the DNA and exonic level were made against Est2genome, Sim4, Spidey, and Fgenesh-C. The proposed gene structure prediction algorithm, by far, has the best performance in the specificity category. The resulting specificity was greater than 98%. The proposed algorithm also has on par results in terms of sensitivity and correlation coeffcient. The goal of developing an algorithm to predict exonic regions with a very high level of correctness was achieved.