|
|
Marten van Dijk Consultant, Inventor, Researcher, Applied Mathematician, & Computer Scientist |
|
|
Protein Folding:
It is an important and relevant problem to accurately predict the secondary structure of proteins
based on their amino acid sequence. The identification of basic secondary structure elements -- alpha
helices, beta strands, and coils -- is a critical prerequisite for many tertiary structure predictors,
which consider the complete three-dimensional protein structure. There exist a broad array of approaches
to secondary structure prediction, including statistical techniques, neural networks, Hidden Markov
Models (HMMs), Support Vector Machines (SVMs), nearest neighbor methods and energy minimization.
In terms of prediction accuracy, neural networks are among the most popular methods in use today,
delivering a pointwise prediction accuracy (Q_3) of about 77% and a segment overlap measure (SOV) of
about 74%.
To improve the long-term performance of secondary structure prediction, it likely will be necessary to
develop a cost model that mirrors the underlying biological constraints. While neural networks offer good
performance today, their operation is largely opaque. Often containing up to 10,000 parameters and relying
on complex layers of non-linear perceptrons, neural networks offer little insight into the patterns learned.
Moreover, they mask the shortcomings of the underlying models, rendering it a tedious and ad-hoc process
to improve them. The largest improvements in neural network prediction accuracy have been due to the
integration of homologous sequence alignments rather than specific changes to the underlying cost model.
Of the approaches developed to date, Hidden Markov Models (HMMs) offer perhaps the most natural
representation of protein secondary structure. An HMM consists of a finite set of states with learned
transition probabilities between states. In biological terms, each transition corresponds to a local folding
event, with the most likely sequence of states corresponding to the lowest-energy protein structure. HMMs
generally contain hundreds of parameters, 1-2 orders of magnitude less than that of neural networks. In
addition to providing a tractable model that can be reasoned about, the reduction in parameters lessens the
risk of overlearning. However, the leading HMM methods to date have not exceeded a Q_3 value of 75%,
and SOV scores are often unreported.
In [1,2], we focus on improving the prediction accuracy of HMM-based methods, thereby advancing the goal
of achieving a state-of-the-art predictor while maintaining an intuitive and biophysically-motivated cost
model. Our technique relies on Hidden Markov Support Vector Machines (HM-SVMs), a recent innovation
in the field of machine learning. While HM-SVMs share the prediction structure of HMMs, the learning
algorithm is more powerful. Unlike the expectation-maximization algorithms typically used to train HMMs,
training with an SVM allows for a discriminative learning function, a soft margin criterion, and
bi-directional influence of features on parameters.
Using the HM-SVM approach, we develop a simple 7-state HMM for predicting alpha helices and coils. The
HMM contains 302 parameters, representing the energetic benefit for each residue being in the middle of a
helix or being in a specific position relative to the N- or C-cap. Our technique does not depend on any
homologous sequence alignments. Applied to a database of all-alpha proteins, our predictor achieves a
Q_alpha value of 77.6% and an SOV_alpha score of 73.4%. These performance numbers are among the best
for techniques that do not rely on multiple sequence alignments.
[1] B. Gassend, C.W. O'Donnell, W. Thies, A. Lee, M. van Dijk, and S. Devadas, Predicting secondary
structure of all-helical proteins using hidden Markov support vector machines, PRIB 2006, p. 93-104, 2006.
[2] B. Gassend, C.W. O'Donnell, G.E. Suh, W. Thies, A. Lee, M. van Dijk, and S. Devadas, Learning
biophysically-motivated parameters for alpha helix prediction, BMC Bioinformatics 8(5), p. S3, 2007.
Poster at 10th Annual International Conference on Research in Computational Molecular Biology
(RECOMB 2006), 2006.
|
|
|
This Web Page Created with PageBreeze Free HTML Editor