Gujjula, Krishna Reddy (2018-08). Map2Peak: A Novel Perspective on ChIP-Seq Data Analysis Pipeline. Doctoral Dissertation. Thesis uri icon

abstract

  • The research in this dissertation focuses on developing a novel methodology for ChIPSeq dataset analysis. Despite its advances, the standard ChIP-Seq data analysis pipeline, i.e., read mapping followed by peak calling has the following shortcomings: 1. Majority of the ChIP-Seq dataset consists of background reads, hence unnecessary computation effort is spent on mapping reads that have no role in forming the true peaks. 2. Unnecessary computation effort is spent on aligning control reads which do not map to ChIP-enriched genomic regions. 3. Multi-mappable reads are often discarded during the read mapping, resulting in the reduced power to identify peaks in repeat elements of the genome. We present Map2Peak, a novel tool aimed at mitigating the aforementioned drawbacks. Map2Peak receives ChIP-Seq and control unmapped reads as the input and presents the peaks as the output at a speed twice faster than that of standard workflow. Map2Peak intertwines partial read mappings and peak calling in a five-phase algorithm. It models the fragment count information obtained during the early stages of ChIP read mapping (Phase 1) as a 2-component Poisson mixture model, and then implements expectation-maximization algorithm to identify ChIP enriched regions (Phase 2). The remaining ChIP reads and majority of control reads are then restricted to map exactly only to the much shorter pseudo-genome composed of the ChIP enriched regions (Phase 3 & 4). The mapping information is then used to call peaks on pseudo-genome (Phase 5). Our results show that the peaks called by Map2Peak encompass most of the peaks called by the standard workflow (88%-96%) and some novel motif-justifiable peaks which are not detected by the standard workflow, and majority (90%) of the background reads are discarded. Moreover, Map2Peak implicitly resolves the alignment location for some of the multi-mappable reads which result in increased power to call peaks in repeat elements of the genome. Map2Peak provides researchers with an ultrafast peak caller which utilizes whole ChIP-Seq dataset without discarding multi-mappable reads to identify peaks, and efficiently utilize control datasets for the purpose of peak calling. "Map2Peak" is available at https://kianfar.engr.tamu.edu/map2peak/.

publication date

  • August 2018