Wu, Di (2018-12). Structured Sparsity Learning for Coevolution-Based Protein Contact Prediction. Master's Thesis. Thesis uri icon

abstract

  • Residue coevolution refers to a biological assumption that residue pairs covary during evolution if they form a contact within a protein or across a protein-protein interface. Under this assumption, such covariance can be used to predict residue contacts within or between protein sequences. The increasing availability of protein sequence data allows for wider applicability and also demand more accurate approaches. Current methods are modeling sequence data in Markov random fields and use maximum likelihood estimations to infer residue contacts. They mainly target the accuracy of contact prediction under the promise that more accurate 2D contact prediction helps to get a better 3D structure. This is correct but not the whole picture since patterns of predicted 2D contacts also play a significant impact on 3D structure reconstruction. For example, contacts between long-distance residue pairs in general help more than adjacent residue pairs do. Moreover, current methods always get predictions that focus on certain area. To directly target 3D structure predictions, we introduce a new method which exploits more types of data, such as secondary structure data and folds type information, to characterize the desired sparsity patterns of contact prediction in a biologically meaningful way. It then uses multiple structured sparsity regularization models, including group LASSO and group dispersive sparsity, to enforce such sparsity patterns. This method benefits from the consideration and promotion of structured sparsity, which contributes to improvement of 3D structure prediction.
  • Residue coevolution refers to a biological assumption that residue pairs covary during evolution
    if they form a contact within a protein or across a protein-protein interface. Under this assumption,
    such covariance can be used to predict residue contacts within or between protein sequences. The
    increasing availability of protein sequence data allows for wider applicability and also demand
    more accurate approaches.
    Current methods are modeling sequence data in Markov random fields and use maximum likelihood
    estimations to infer residue contacts. They mainly target the accuracy of contact prediction
    under the promise that more accurate 2D contact prediction helps to get a better 3D structure. This
    is correct but not the whole picture since patterns of predicted 2D contacts also play a significant
    impact on 3D structure reconstruction. For example, contacts between long-distance residue
    pairs in general help more than adjacent residue pairs do. Moreover, current methods always get
    predictions that focus on certain area.
    To directly target 3D structure predictions, we introduce a new method which exploits more
    types of data, such as secondary structure data and folds type information, to characterize the desired
    sparsity patterns of contact prediction in a biologically meaningful way. It then uses multiple
    structured sparsity regularization models, including group LASSO and group dispersive sparsity,
    to enforce such sparsity patterns. This method benefits from the consideration and promotion of
    structured sparsity, which contributes to improvement of 3D structure prediction.

publication date

  • December 2018