Superior
protein structure prediction
Software
RAPTOR
Benchmarks: Lindahls and Fischer et al.
Fischer et als benchmark set consists of 68 target sequences and 301 templates.
RAPTOR ranks 56 pairs out of 68 pairs as top 1, achieving about 82% prediction
rate. The fold recognition performance of RAPTOR was further tested on
Lindahls benchmark set consisting of 976 protein sequences. By threading them
all against all, there are 976 975 threading pairs. We measured RAPTORs
performance in three similarity levels: fold, superfamily and family. Results
are shown in Table 1 (data of other methods are taken from Shi et als paper).
Prediction correctness is assessed based on the SCOP classification.
Table 1. The performance of RAPTOR at three different similarity levels
As shown in Table 1, RAPTOR performs better than other methods at all
similarity levels (especially the fold level). At the family level, RAPTORs
recognition performance is comparable to that of FUGUE, the best method for
family and superfamily level other than RAPTOR. We may conclude that a
strict treatment of pairwise interactions is necessary for fold and
superfamily level recognition. For the family level, sequence (or profile)
alignment could attain satisfactory results.
Specific Examples
We now present several structure prediction examples generated by RAPTOR
in CAFASP3 and LiveBench6. Most of CAFASP3 targets experimental structures
are not allowed to be published so far. Therefore we chose some targets
from LiveBench.
Figure 2 (taken from CAFASP3s website, generated by RasMol and MaxSub)
presents the superimposition between the experimental structure (grey
color) and RAPTORs predicted structure (black color) of T0136 1. According
to MaxSubs evaluation, 17 of 54 servers generated correct fold
recognitions for this target and RAPTOR produced the best alignment among
all. MaxSub could superimpose a segment of 118 residues (sequence size is
144) of the predicted structure to the experimental structure with an RMSD
of mere 1.9.
Fig. 2. The superimposition of experimental structure (grey color) and
prediction structure (black color) of CAFASP3 target T0136 1.
The following two figures are generated by RasMol based on evaluation
results of LiveBench6. Figure 3 shows an almost perfect prediction for
target 1ll8A. The alignment accuracy score measured by MaxSub is more
than 9 (scale 10). Figure 4 presents a good structure prediction for
target 1j53A, with an alignment accuracy score of more than 6.
Considering the length of the target sequence, this prediction is
considered very successful.
Fig. 3. The experimental structure (left) and the predicted structure
(right) of 1kvzA.
Fig. 4. The experimental structure (left) and the predicted structure
(right) of 1j53A.
Computing Efficiency Issues
A key advantage of our algorithm is that the memory requirement is
just about O(|| n2), where is the edge set of the contact graph of
a protein template structure and n is the query sequence length. The
observed memory usage is 100~200M for most threading pairs. In
practice, the computing time does not increase exponentially with
respect to target sequence size. Figure 5 shows the CPU time of
threading 100 sequences (chosen randomly from Lindahls benchmark) with
size ranging from 25 to 572 to a typical template 119l of length 162
(here CPU time was measured on a single 400MHz MIPS R12000 CPU of a
Silicon Graphics Origin 3800 system with 20GB of RAM). It shows that
the computing time of our algorithm increases very slowly with respect
to sequence size. In fact, we found that for real protein data, our
relaxed linear programs directly output integral solutions 99% of the
times and generated only a few branch nodes when the solution was
fractional.
Figure 6 shows the CPU time used for the prediction of each
CAFASP3/CASP5 target sequence. There were in total 62 targets and 3236
protein templates in our template database. It shows that CPU time
increased very slowly with respect to sequence size except for one
target (t0174) that took about 45 hours. After careful inspection, we
found that there were 30 templates, each of which took about 15 hours
threading time. These templates are up for further examination.
Conclusions
In this paper, we have presented performance benchmarks of the
software package RAPTOR, which adopts a novel integer programming
approach to treat pairwise interactions rigorously in protein
threading. Experimental results show that RAPTOR performs very well in
terms of alignment accuracy and fold recognition for FR targets. As
for computational efficiency, RAPTOR is also much better than
algorithms that treat the pairwise potentials strictly when dealing
with templates with complex interaction topology and long sequences.
Fig. 5. CPU time of threading 100 sequences to template 119l
(1s=0.01s).
Fig. 6. CPU time of threading 62 CAFASP3 target sequences to 3236
templates.
References
- J. Xu, M. Li, D. Kim and Y. Xu, RAPTOR: Optimal Protein Threading by
Linear Programming, Journal of Bioinformatics and Computational
Biology, Vol. 1, No. 1 (2003) 95-117
- J. Xu and M. Li, Assessing RAPTOR's New Linear Programming Approach
for Fold Recognition in CAFASP3, Proteins: Structure, Function, and
Genetics, 53(S6): 579-584. 2003
 
|