Integrating sample-driven and pattern-driven approaches in motif finding
- Additional Document Info
- View All
The problem of finding conserved motifs given a set of DNA sequences is among the most fundamental problems in computational biology, with important applications in locating regulatory sites from co-expressed genes. Traditionally, two classes of approaches are used to address the problem: sample-driven approaches focus on finding the locations of the motif instances directly, while pattern-driven approaches focus on finding a consensus string or a profile directly to model the motif. We propose an integrated approach by formulating the motif finding problem as the problem of finding large cliques in k-partite graphs, with the additional requirement that there exists a string s (which may not appear in the given sample) that is close to every motif instance included in such a clique. In this formulation, each clique represents the locations of the motif instances, while the existence of string s ensures that these instances are derived from a common motif pattern. The combined approach provides a better formulation to model motifs than using cliques alone, and the use of k-partite graphs allows the development of a fast and exact divide-and-conquer approach to handle the cases when almost every sequence contains a motif instance. Computational experiments show that this approach is feasible even on the most difficult motif finding problems of moderate size. When many sequences do not contain a motif instance, we complement the above approach by an optimized branch-and-bound algorithm that is much faster than standard clique finding approaches. We will discuss how to further generalize the formulation to better model biological reality. © Springer-Verlag 2004.
author list (cited authors)
Sze, S. H., Lu, S. J., & Chen, J. E.