Overview
Approach
PH Product Line
ZOOM

Tech Support
Request Prices
Download Demo

Academic Download

Benchmark
User Comments
User Research

Superior
homology search
Software

PatternHunter

PatternHunter's Technology

Why PatternHunter is better

In the 1980’s, BLAST was designed to speed up the Smith-Waterman algorithm for homology search by trading sensitivity for speed. Today, BLAST and Smith-Waterman are no longer sufficient for the exponential growth of genomics data. PatternHunter (Bioinformatics, 18(3):440-445, 2002; Journal of Bioinformatics and Computational Biology, 2004) uses modern homology search technology invented by BSI founders. One such technology is optimized multiple spaced seeds. With new algorithms and ideas, PatternHunter is changing the way homology search is done. One no longer needs to trade sensitivity for speed. PatternHunter can approach Smith-Waterman sensitivity and yet run thousands of times faster.

Spaced seeds

Because the large sizes of databases and queries, comparing each position in the query with each position in the database, as in the Smith-Waterman algorithm, is too computational intensive. For better speed, heuristic methods have been used in homology search.

One heuristic method, as in BLAST, uses a short, continuous sequence of letters as a "seed". An exact match of this seed is a hint that there may be a longer match surrounding it. Hence, BLAST only tries to find homologies in those regions with hits. PatternHunter also uses seeds; the difference being that PatternHunter uses a discontinuous sequence of letters as its seeds. By adjusting the relative positions of letters in our discontinuous sequence, we can optimize the seed to increase sensitivity.

The relative positions of the letters is denoted by a 0-1 string. For example: in the seed model "111010010100110111", a "1" means the letter at that position is required to match, and a "0" means the letter at that position is not required to match. The number of 1s is called the weight of the seed.

For example, the following homology can be "hit" (detected) by the above mentioned spaced seed.

GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT… || ||||||||| ||||| || ||||| |||||| GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT… 111010010100110111

Why spaced seeds are better than consecutive seeds

There are two factors that affect the performance of a seed: the selectivity and the sensitivity.

  • Selectivity determines the search speed - more required matches (more 1's) in a seed means fewer hits, and a faster search
  • Sensitivity determines the search quality - not all homologies can be hit by a given seed. For example, the seed 11111111111 cannot hit the above-mentioned alignment. We want to optimize the seed, so that, on average, the number of homologies hit by the seed is maximized.


  • Two seeds with the same weight will generate approximately the same number of hits (Bioinformatics, 18(3):440-445, 2002). That is to say, a spaced seed and a consecutive seed with the same weight will have very similar selectivity. However, the spaced seed will have better sensitivity. This is because when a consecutive seed finds a hit, a second hit at the next position of the homology is very likely -- it requires only one more letter match (see the following figure). The second hit is redundant because only one hit is required to find the homology.


     TTGACCTCACC? 
     |||||||||||? 
     TTGACCTCACC? 
     11111111111 
      11111111111
    

    Spaced seeds are more independent. Therefore it is more difficult to have more than one hit in a homology. See the following figure:


    CAA?A??A?C??TA?TGG? 
    |||?|??|?|??||?|||? 
    CAA?A??A?C??TA?TGG? 
    111010010100110111 
     111010010100110111
    

    Therefore, using the approximately same amount of hits, a spaced seed will detect more homologies.

    The multiple seed technique and why it increases sensitivity

    As explained before, any given seed may fail to detect some homologies. Because different seeds tend to fail at different homologies, using several different seeds simultaneously can significantly improve the success rate. However, it is very important to optimize the combination of the multiple seeds, so that their detection ability is complementary to each other.