|
||
|
Superior
homology search
Software
PatternHunter PatternHunter's Technology Why PatternHunter is better In the 1980’s, BLAST was designed to speed up the Smith-Waterman algorithm for homology search by trading sensitivity for speed. Today, BLAST and Smith-Waterman are no longer sufficient for the exponential growth of genomics data. PatternHunter (Bioinformatics, 18(3):440-445, 2002; Journal of Bioinformatics and Computational Biology, 2004) uses modern homology search technology invented by BSI founders. One such technology is optimized multiple spaced seeds. With new algorithms and ideas, PatternHunter is changing the way homology search is done. One no longer needs to trade sensitivity for speed. PatternHunter can approach Smith-Waterman sensitivity and yet run thousands of times faster. Spaced seeds Because the large sizes of databases and queries, comparing each position in the query with each position in the database, as in the Smith-Waterman algorithm, is too computational intensive. For better speed, heuristic methods have been used in homology search. One heuristic method, as in BLAST, uses a short, continuous sequence of letters as a "seed". An exact match of this seed is a hint that there may be a longer match surrounding it. Hence, BLAST only tries to find homologies in those regions with hits. PatternHunter also uses seeds; the difference being that PatternHunter uses a discontinuous sequence of letters as its seeds. By adjusting the relative positions of letters in our discontinuous sequence, we can optimize the seed to increase sensitivity. The relative positions of the letters is denoted by a 0-1 string. For example: in the seed model "111010010100110111", a "1" means the letter at that position is required to match, and a "0" means the letter at that position is not required to match. The number of 1s is called the weight of the seed. For example, the following homology can be "hit" (detected) by the above mentioned spaced seed.
Why spaced seeds are better than consecutive seeds There are two factors that affect the performance of a seed: the selectivity and the sensitivity. Two seeds with the same weight will generate approximately the same number of hits (Bioinformatics, 18(3):440-445, 2002). That is to say, a spaced seed and a consecutive seed with the same weight will have very similar selectivity. However, the spaced seed will have better sensitivity. This is because when a consecutive seed finds a hit, a second hit at the next position of the homology is very likely -- it requires only one more letter match (see the following figure). The second hit is redundant because only one hit is required to find the homology. TTGACCTCACC? |||||||||||? TTGACCTCACC? 11111111111 11111111111 Spaced seeds are more independent. Therefore it is more difficult to have more than one hit in a homology. See the following figure: CAA?A??A?C??TA?TGG? |||?|??|?|??||?|||? CAA?A??A?C??TA?TGG? 111010010100110111 111010010100110111 Therefore, using the approximately same amount of hits, a spaced seed will detect more homologies. The multiple seed technique and why it increases sensitivity As explained before, any given seed may fail to detect some homologies. Because different seeds tend to fail at different homologies, using several different seeds simultaneously can significantly improve the success rate. However, it is very important to optimize the combination of the multiple seeds, so that their detection ability is complementary to each other.   |
|
|
|
|
|
|
|
|