Approach
Product Line
Online Server

Tech Support
Request Prices
Download Demo

Benchmarks
User Comments
User Research

Superior
protein structure prediction
Software

RAPTOR

RAPTOR User Manual 4.0 for Linux

Index

Introduction

What is Homology Modeling (HM)?

Suppose you know the amino acid sequence of a target protein and you want to know its three-dimensional (3D) structure, yet to be solved experimentally by X-ray crystallography or NMR. An underlying premise for homology modeling is that a set of proteins are homologous, their 3D structures are more conserved than their sequences. The homology modeling method constructs the three-dimensional structure for a target sequence by using the homologous proteins of the target.

General Procedures to Create Homologous Models

  • Homologue selection: Identify one or several homologous proteins from the structure database (i.e. PDB). Some computer tools such as PSI-BLAST can be used for this action.
  • Sequence alignment: Build a multiple sequence alignment among the target sequence and the selected homologous sequences.
  • Core determination: Identify the most conserved segments (cores) and variable segments (loops) in the multiple sequence alignment.
  • Core modeling: Predict coordinates of core residues of the target sequence from those of the known structure(s).
  • Loop modeling: predict conformations for the loops in the target sequence.
  • Side chain packing: construct the side chain coordinates.
  • Refinement and Evaluation: The quality of the predicted structure can be measured using software.

Does Homology Modeling Always Work?

Given a target sequence, if there are no homologous proteins found in structure database, you cannot use homology modeling. In practice, when the sequence identity in the alignment is below 25%, the homology is insignificant and you can not expect to obtain a good homologous model from homology modeling.

Why Fold Recognition (Protein Threading)

Fold recognition is based on the observation that the number of distinct structures do not grow as fast as the PDB, as a whole, and 90% of the new structures submitted to PDB in the past several years have similar structure folds to some structures in PDB. Currently, there are more than 1000 folds.
Protein threading predicts protein structures by using statistical knowledge of the relationship between the structure and the sequence. The prediction is made by “threading” each amino acid of the target sequence to a position in the template structure; evaluation is performed with respect to how well the target fits the template. After the best-fit template is selected, the model is built on the alignment with the chosen template.

Fold Recognition involves the following procedures:

  • Preparation
  • The construction of a structure template database: Select protein structures from the PDB as structural templates.
  • The design of a scoring function: Design a good scoring function to measure the fitness between target sequences and template.
  • A good scoring function should consider: mutation potential, environment fitness potential, pair-wise potential, secondary structure compatibilities and gap penalties. The quality of the scoring function is closely related to the prediction accuracy.

  • Given a Target Sequence
  • Threading alignment: Align the target sequence with each structure template by optimizing the designed scoring function. If there are ‘N’ structure templates in the database, after this step, there will be ‘N’ alignments.
  • Ranking alignment: All the obtained alignments are ranked by using various measuring methods and the best alignment is identified.
  • Build the structural model from the selected alignment as homology modeling does, i.e. core determination, core modeling, loop modeling, side-chain packing.

Fold recognition is most effective for hard targets that homology modeling cannot handle.
In practice, when the sequence identify is below 25%, in many cases, fold recognition can give reasonably accurate prediction.

What is RAPTOR?

RAPTOR (RApid Protein Threading predictOR) is a protein threading software package developed by Dr. Jinbo Xu and Dr. Ming Li. It applies novel Linear Programming techniques to the protein threading problem and has achieved great success. RAPTOR minimizes the scoring function (i.e. seeks for the optimal alignment between sequence and template) by integer programming method. The scoring function used by RAPTOR rigorously takes the pair-wise contact potential into account. The threading problem is formulated as a large scale integer programming problem and RAPTOR can find a global optimal alignment. It turns out that RAPTOR can produce high accuracy alignments and is most effective for hard targets.
RAPTOR has been consistently ranked in the top tier in recent CASP’s (CASP5, CASP6, CASP7). In CASP5, RAPTOR was ranked number one and RAPTOR paper was voted as the “most innovative paper” by peers in the research community.

Installation

Note that this installation guide is for Linux only.

Computing Requirements

To run RAPTOR, the PC must have at least 512M of memory. For time efficiency, multiple high speed CPUs are preferred. The RAPTOR package will take up to 3G of space on the hard drive.

Required Files:

RAPTOR1.tar.gz Executable and Template Library
Install_script1.sh Install Script 1
RAPTOR2.tar.gz RefSeq Database used by PSI-BLAST
Install_script2.sh Install Script 2

How to Install RAPTOR

First create a temporary directory on your hard drive. Copy all the installation files to the temporary direction and enter that directory. You may need to run “chmod u+x *.sh” to make the two script files executable.
Then run Install_one.sh followed by Install_two.sh. This will install RAPTOR in the specified directory.

If You Do Not Have RAPTOR2.tar.gz (or Want to Download REFSEQ or NR Database by Yourself)

PSI-BLAST is used internally by RAPTOR. Database searched by PSI-BLAST can be either NR or REFSEQ which is a representative subset of NR and half the size of NR. By default, RAPTOR comes with REFSEQ which is compressed in RAPTOR2.tar.gz. Optionally, you can download REFSEQ or NR by yourself and install it manually, which is quite straightforward.

For that, Install RAPTOR1.tar.gz first.
Then you can download
NR or REFSEQ by yourself.
Here are instructions for downloading NR database:
Download nr.00.tar.gz and nr.01.tar.gz to a directory
Uncompress them in that directory and you will obtain a bunch of files whose names start with “nr.00.” or “nr.01.”.
Move those files to RAPTOR/data/nr/
After that, you need to specify the NR database path in the configuration panel. i.e. if the NR database is installed at /home/usr/RAPTOR/data/nr, then the “PSI-BLAST Database” field in the “Advanced” tab of the configuration panel should be set to “/home/usr/RAPTOR/data/nr/nr.”
Note that you need to specify both the path and file prefix for NR database.

Alternatively, you can download REFSEQ database which is much smaller than NR.
Here are instructions for downloading REFSEQ database:
Download refseq_protein.tar.gz to a directory.
Uncompress the file and you will obtain a bunch of files whose names start with “refseq_protein”.
Move those file to RAPTOR/data/REFSEQ
After that, you need to specify the database path in the configuration panel, i.e. if the REFSEQ database is installed in /home/usr/RAPTOR/data/REFSEQ/, then the “PSI-BLAST
Database” field in the “Advanced” tab of the configuration panel should be set to /home/usr/RAPTOR/data/REFSEQ/refseq_protein.
Note that you need to specify both the path and file prefix for REFSEQ database.

Registration

When you run install.sh, after the installation is finished, a registration window will pop up and you need to input the key obtained from BSI to register. If you do not register during the installation, the registration window will pop up again before you run a protein sequence.

Organization of Directories (and Important files)

RAPTOR

bin\                          Binaries
Data\
Fssp\                        Template FSSP Files
Parameters\
fssp.list                    Template List
RAPTOR.conf             Configuration File of RAPTOR
GuiProperties.conf    Configuration File of the GUI
Ip-files\                    Parameter Files used in IP
nocore-files\             Parameter Files used in NoCore
nocore2-files\            Parameters files used in NPCore
pdb\                          Template pdb Files
WEIGHTS\                  Parameter Files used by Support Vector Machine

Quick Tour

Load Sequence

To test RAPTOR, you can load a test sequence and run it with RAPTOR. To do that, you click “File” in the menu and select “Load Sequence/XML File”. In the file browser, you can go to RAPTOR/data/seq/ and load one test sequence into the work space. After that, you will see an icon on the left panel and the content of the sequence will be displayed in a window on the right.

Run Sequence

Then you can select “Run” in the menu and select “Run Selected” from the dropdown menu. A configuration panel will pop up. The only option that you may need to change is the path of the database used by PSI-BLAST depending on how you install the database. Click “Advanced” tab and find the “Database for PSI-BLAST”. If you have installed NR database, the path should be [home directory]/RAPTOR/data/nr/nr. If you have installed RefSeq database, the path should be [home directory]/RAPTOR/data/RefSeq/refseq_protein. where [home directory] is the path of your home directory.
If you want to build 3D structures, you need to install Modeller by yourself and specify to use Modeller and give its path in the configuration panel.

Click “Run” and RAPTOR will start to run. It will take about one hour to run one sequence depending on the sequence length. After the sequence is finished, a tabbed window will appear on the right. You will find PSP matrix obtained by PSI-BLAST, predicted secondary structure, ranking list of templates and all the alignments.

Menu System

Launch RAPTOR

In RAPTOR/, run RAPTOR_GUI.sh to launch RAPTOR GUI.
The navigation panel is on the left and the output display panel is on the right, as shown below.

Click to see full size.

File

File->Load
You can load a sequence file (.seq) or an output file (.xml).

File->Close Selected
You can close the output windows for the selected sequence

File->Close All
Close the windows for all the sequences in the workspace

File->Delete Output
Delete the XML file for the selected sequence

File->Exit
Exit the GUI

Edit

Edit->RAPTOR Config
This will pop up a configuration panel where you can set up the configuration of RAPTOR.

Run

Run->Run Selected
This will pop up the configuration panel and after you press “Run” the sequence will be run.

Run->Run All
This will pop up the configuration panel and after you press “Run” all the sequences in the work space will be run.

Window

This will select different window from the drop down menu.

Help

This will launch a browser to allow you to read this manual or visit BSI website.

Configuration Panel

Basic Options

Click to see full size.

Threading Method
There are three threading methods available in RAPTOR: NoCore, NPCore and IP. You can select to run one, two or all of them in a run.

3D Modeling
You can let RAPTOR call Modeller automatically after performing the threading. Select the check box and locate the Modeller program in the file browser. If you prefer to do 3D modeling with ICM PRO, RAPTOR, you can also output ICM Pro input files. You just select the check box and specify an output path. For example, the path could be /home/usr/modeller8v2/bin/mod8v2 on Linux, or c:\modeller8v2\bin\mod8v2 on Window.

Output Path
This is the directory in which RAPTOR will be run and all the output files will be stored.

Output Files
You will need to specify how many templates are saved in the templates. If you save too many in the XML file, the file will take up too much disk space.

Keep raw Files
You can select to keep or remove RAPTOR raw output files.

Advanced Options

Click to see full size.

Template Settings
List Path - The list of the path of the template is a text file which stores the names of all the templates in the template library.
FSSP Path - The directory where all the .fssp files are stored.
PSM Path - The directory where all the .psm files are stored.
PDB Path - The directory where all the trimmed .pdb files are stored.

Database for PSI-BLAST
If you use NR database, it should be [nr path]/nr
If you use RefSeq database, it should be [refseq path]/refseq_protein
Example: if all the NR files are in /home/usr/RAPTOR/data/NR, then this field should read: /home/usr/RAPTOR/data/NR/nr
If the RefSeq files are in /home/usr/RAPTOR/data/RefSeq/, then this field should read: /home/usr/RAPTOR/data/RefSeq/refseq_protein

PDB File Viewer
This is the view that will be called automatically in RAPTOR. A RasMol viewer comes with RAPTOR.

Template Ranking Method
RAPTOR supports two template ranking methods:
Support Vector Machine (SVM) and Z-score. Normally, you should use SVM.
For very long or short sequences, you can use Z-score for possible better result.

Navigation Panel and Output Panel

Navigation Panel

The left hand side is the navigation panel. Each Sequence is represented by a icon .After running RAPTOR, the RAPTOR output is represented by . You can browse different sequences and their outputs by clicking different icons in the navigation panel.

Click to see full size.

Output Window

PSI-BLAST Profile

The output window is composed of a set of tab windows. The first tab window is PSI-BLAST profile. It is a 20 row matrix, each row corresponding to some amino acid.
The column width is the length of the query sequence. Thus each residue in a query sequence has a 20-element vector with it. Each element represents the occurring frequency of certain amino acid at that position in the multiple sequence alignment obtained from PSI-BLAST output.

The frequency is from 0 to 1. To make it easier for you to read the profile, the frequency is divided into 10 segments. Each segment will be represented by a color. In this way, the matrix can be represented by a rectangle in the window which is composed of many small square cells. The color of cell is determined by the occurring frequency. You can easily find out the conserved residues and non-conserved residues by differentiating colors.

Secondary Structure

Different colors are used to represent helices, beta sheets, loops (add color in html).

Some acronyms
AA          amino acid
PHD        PhiPred predicted secondary structure
E            Beta Strand
H            Helices
Space      Loops
Rel         Confidence of predicted secondary structure type
PrE         Chance of being beta strand (0 to 10)
PrH         Chance of being helix (0 to 10)
PrL         Chance of being loop (0 to 10)

Click to see full size.

Rank by Score

Top Window

Each method is represented by a folder icon. If you double click it, the templates will be displayed, ranked by their E-values. The smaller the E-value, the better. Also displayed are other scored used internally.

Click to see full size.

Table fields:
tName: template name
eValue: E value
tLen: template length
sLen: target length
Score:alignment score
mScore: mutation score
fScore: environmental fitness score
gScore: gap score
ssScore: secondary structure score
pScore: pairwise score
cScore: contact capacity score
SVMout: score output by the Support Vector Machine

Bottom Window

If you click a template, its alignment will be displayed in a drop down window. The color of the template is consistent with its actual secondary structure and the color of the target is consistent with its predicted secondary structure.

If you click “View 3D structure with RasMol”, a RasMol window will pop up and the structure will be displayed. If you click “Export pdb file”, a file browser will pop up and you can save the 3D structure in a pdb file.

If you click “Functional Annotation” tab, a window will drop down and show the functional information extracted from the template pdb file. If you click the template name, a browser will pop up and connect to rcsb PDB website.

Click to see full size.

Alignments

The left side of the toolbar allows you to select some session(s) and specify how many templates you want to display. The right side of the tool bar allows you to compare any two alignments. To specify an alignment, you can use method name and its rank.

Click to see full size.

Error

This window displays the errors that occurred during the run. Due to incorrectly generated template files or other reasons, the target sequence may not be able to be threaded to some templates. These templates can be ignored and will not have significant influence on the threading results, considering their number is so small.

Using RAPTOR

Input File and Output File

RAPTOR accept FASTA format sequence file as input. To load a sequence file, click “File” menu and select “Load File”. In the popup file browser, select the right file filter and display all .seq files. Here is an example of FASTA format sequence: >2acy(len=98)
AEGDTLISVDYEIFGKVQGVFFRKYTQAEGKKLGLVGWVQNTDQGTVQG
QLQGPASKVRHMQEWLETKGSPKSHIDRASFHNEKVIVKLDYTDFQIVK
The default suffix for sequence file is “.seq”. If the file you loaded does not have right suffix, “.seq” will be appended to the file name.
The output of RAPOR is stored in XML file. You can load an XML file saved by RAPTOR and display its content. To load a sequence file, click “File” menu and select “Load File”. In the popup file browser, select the right file filter and display all .xml files.

All the raw files of RAPTOR are stored in a directory whose name is the sequence name in the output directory. Suppose the sequence name is XXXX.
Here is the structure of directory XXXX.

XXXX
PSP   PSI-BLAST Output Files
SS    PSI-PRED Output Files
[method name]
MODEL   Alignment Files     .pir File
OUT     Ranking Files     .scoreRank File
           Modeller Output     Modelleroutput .pdb File
           ICM Pro Input ICM Pro Input Files

The structure of output directory:
PSP   PSI-BLAST Output Files
SS    Secondary Structure Prediction Output Files
[method name]
Temporarily Store Threading Output

Where [method name] can be NoCore, NPCore, or IP. Directories embraced by <> are only generated when the corresponding checkbox is selected and the path is specified in the configuration panel.

PSI-BLAST Database

In RAPTOR, PSI-BLAST is used to generate position specific matrix (sequence profile) of a target sequence. By default, PSI-BLAST uses NR database, but the size of NR database is very large (1 G after compression). So an alternative database is RefSeq, which is a curetted non-redundant sequence database of genomes, transcripts and proteins maintained by NCBI. RefSeq is much smaller, about half size of NR. We conducted a comparison of the two. The profiles obtained from them are almost the same. So you can always use RefSeq to replace NR. NR database can be downloaded from ftp://ftp.ncbi.nih.gov/blast/db/nr.00.tar.gz and ftp://ftp.ncbi.nih.gov/blast/db/nr.01.tar.gz

RefSeq can be downloaded from ftp://ftp.ncbi.nih.gov/blast/db/refseq_protein.tar.gz.
After uncompressing, you can obtain a bunch of index files. You need to put them in some directory and specify the path in the configuration panel (add a hyperlink here).

Threading Methods

Dynamic Programming vs. Integer Programming
RAPTOR has three threading methods available: NoCore, NPCore, and IP. NoCore and NPCore both use dynamic programming to optimize the scoring function. IP uses integer programming to optimize the scoring function. The difference is that if a scoring function considers pair-wise contact, dynamic program can only find a local optimum solution while integer programming can find the global optimal solution. Most of other threading servers are based on dynamic programming and RAPTOR’s integer programming is unique.

NoCore vs. NPCore
NoCore and NPCore are both based on dynamic programming. The difference is that in NPCore, the template and target are first divided into cores before doing threading. A core is a conserved segment of a protein. NoCore and NPCore are very effective for easy targets.

Running One Sequence with Different Methods
IP’s running time is longer than NoCore and NPCore. Thus, given a target sequence, you can run NoCore first. If the prediction is not good, try NPCore. If both cannot give good predictions, you can try IP. This will save you much time. Of course, you can also run more than one methods at one time. RAPTOR can keep up to three methods’ output in the XML file. When you run NPCore after running NoCore, the output will be automatically inserted into the XML file. If you run NoCore for the second time with different configuration, the old result in the XML file will be overwritten by the new result.

The fist step of RAPTOR is to run PSI-BLAST. If you already run NoCore, then when you run NPCore, this step will be skipped, as the PSI-BLAST is stored in PSP/ under the output directory. If the program finds those files, PSI-BLAST will be skipped. This will save running time.

Judging Prediction Quality from Alignment

First, you can compare the actual secondary structure of the template with the predicted secondary structure of the query sequence. As the accuracy of secondary structure is around 80%, this is an important measure of the prediction quality. Then you can look at the gaps in the alignment. The fewer the gaps, the better the prediction quality. The shorter the gaps, the better the prediction quality. Ending gaps normally can be ignored. Sometimes, the ending gaps may be very long. This means the program can only give good prediction for part of the query sequence.

What if the ending gaps are too long? In many cases, for long sequences, they may have more than one domain. Thus the ending gaps may be very long. You can cut them into domains first and run each domain with RAPTOR.

Using Modeller

If you are an academic user, you can download Modeller for free from here. And you need to register here to get a license key in order to install Modeller. After you install it, you also need to specify the Modeller path in the configuration panel, i.e., /home/usr/modeller8v2/bin/mod8v2 under linux and C:\modeller8v2\bin\mod8v2 under Window. As Modeller8v2 has used python internally, it may give the follow error message while running, due to a bug in python: 'import site' failed; use -v for traceback” Just ignore it.

Customizing Templates

RAPTOR/data/parameters/fssp.list stores the names f all the templates in the template library. If you are interested in a specific template, you can save its name in another file and specify the path in the configuration panel. You can also create your own template library. You need a pdb file and generate PSM and fssp file from it. Then put PSM file in RAPTOR/data/PSM and fssp file in RAPTOR/data/fssp.

Using RasMol

The default viewer for pdb files is RasMol. The default display mode is cartoon. The structure is colored according to the secondary structure. You can rotate the structure by pressing and dragging the left key of the mouse. To move the structure, press the right mouse key and drag. To shrink or enlarge the display, press “shift” key, press the right mouse key and drag. For a full reference of RasMol, you can visit http://www.umass.edu/microbio/rasmol/.

If you run RAPTOR on a remote machine, rasmol_32BIT may not work properly with the X server. Instead, you need to configure the configuration panel to run a wrapper shell program “rasmol” which will launch RasMol.
If you want to use some view other than RasMol, please contact us and we can customize it for you.

RAPTOR Reference List

Feng Jiao, Jinbo Xu, Libo Yu, Dale Schuurmans. Protein Fold Recognition Using Gradient Boost Algorithm. Accepted by CSB 2006.

Jinbo Xu. Protein Fold Recognition by Predicted Alignment Accuracy. ACM/IEEE Transactions on Computational Biology and Bioinformatics, 2(2):157-165. 2005.

Jinbo Xu, Ming Li, Dongsup Kim, Ying Xu. RAPTOR: optimal protein threading by linear programming. Journal of Bioinformatics and Computational Biology 1:1(2003) 95-117.

Jinbo Xu and Ming Li. Assessment of RAPTOR's linear programming approach in CAFASP3. Proteins: Structure, Function, and Genetics, 53(S6): 579--584, Oct. 2003. Invited paper for CASP5, voted by peers as the "most innovative method in CASP5".

Bioinformatics Solutions Inc.

Technical Support
Email: raptor@bioinfor.com
Phone: 1-519-8858288 ext. 16