ZOOM's Technology
Several techniques were used to improve the speed while maintaining high sensitivity. The spaced seed method was developed to speed up the DNA similarity search in the PatternHunter software (Bioinformatics 18: 440-445, 2002[1]). A spaced seed is a given pattern such as 11*1**11111*1**1111. The number of 1-positions is called the weight of the seed. In the following figure, the aligning of this pattern with a DNA similarity is such that all the 1-positions are matches. Therefore the similarity is hit/detected by the spaced seed.
GAGTACTCAACACCAACATTAGTGGGCAATGGAAAAT
|| ||||||||| |||||| | ||||| ||||||
GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT
11*1**11111*1**1111
Different spaced seeds have different hit probability in a randomly sampled similarity. In PatternHunter, one or several optimized spaced seeds are determined. In order to find all high-scoring local alignments between two long DNA sequences, PatternHunter first finds all the hits and perform extensions nearby the hits. This saves the computing time on most of the low-scoring local alignments (because they usually do not provide a hit). Consequently, the speed is greatly improved. The main difference between PatternHunter and BLAST is that BLAST used a consecutive seed (without the * in the middle), resulting into lower sensitivity.
Researchers[2] have extended the spaced seed strategy in short reads mapping, and carried on specialized optimization for the new application area. Low memory consumption and high throughput performance is two main goals of ZOOM. The new improvements in the spaced seed design guarantee ZOOM of high speed and 100% sensitivity for a wide range of read length and mismatch numbers.
ZOOM supports the mapping of paired end reads. Only when the mapping distance between two paired reads is within a range limit, their mapping information is reported and collected. Experiments show that the paired information helps to identify the true mapping positions and contributes significantly to mapping accuracy.
ZOOM also utilizes quality score of Illumina/Solexa reads. Low quality Illumina/Solexa reads are recognized and reads are mapped relying on only high quality bases. For ABI SOLiD data, sequencing errors in color space can be corrected, polymorphisms on base space and sequencing errors are marked respectively.
WORKFLOW FOR ILLUMINA-SOLEXA DATA
WORKFLOW FOR ABI SOLID DATA
Reference:
[1] Bin Ma., Tromp, J., and Li,M. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics. 18(3), 440-445.
[2] ZOOM: Zillion of Oligos Mapped. Hao Lin, Zefeng Zhang, Michael. Q. Zhang, Bin Ma, Ming Li. Bioinformatics, 2008.
 
|