Abstract
Pyrosequencing is among the emerging sequencing techniques, capable of generating upto 100,000 overlapping reads in a single run. This technique is much faster and cheaper than the existing state of the art sequencing technique such as Sanger. However, the reads generated by pyrosequencing are short in size and contain numerous errors. In order to use these reads for any subsequent analysis, the reads must be aligned . Existing multiple sequence alignment methods cannot be used as they do not take into account the specific positions of the sequences with respect to the genome, and are highly inefficient for large number of sequences. Therefore, the common practice has been to use either simple pairwise alignment despite its poor accuracy for error prone pyroreads, or use computationally expensive techniques based on sequential gap propagation. In this paper, we develop a computationally efficient method based on domain decomposition, referred to as pyro-align, to align such large number of reads. The proposed alignment algorithm accurately aligns the erroneous reads in a short period of time, which is orders of magnitude faster than any existing method. The accuracy of the alignment is confirmed from the consensus obtained from the multiple alignments.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Saeed, F., Khokhar, A.: Sample-Align-D: A High Performance Multiple Sequence Alignment System using Phylogenetic Sampling and Domain Decomposition. In: Proc. 23rd IEEE International Parallel and Distributed Processing Symposium (April 2007)
Hou1, X.-L., Cao, Q.-Y., Jia, H.-Y., Chen, Z.: Pyrosequencing analysis of the gyrB gene to differentiate bacteria responsible for diarrheal diseases. European Journal of Clinical Microbiology & Infectious Diseases 27(7), 587–596 (2007)
Liu, Z., Lozupone, C., Hamady, M., Bushman, F.D., Knight, R.: Short pyrosequencing reads suffice for accurate microbial community analysis. Nucl. Acids Res. 541 (2007)
Edgar, R.C.: Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucl. Acids Res. 32(1), 380–385 (2004)
Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970)
Thompson, J.D., Plewniak, F., Poch, O.: BAliBASE: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15(1), 87–88 (1999)
Pocock, M., Down, T., Hubbard, T.: BioJava: open source components for bioinformatics. SIGBIO Newsl 20(2), 10–12 (2000)
Setubal, C., Meidanis, J.: Introduction to Computational Molecular Biology (January 1997)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Computer Science and Computational Biology (January 1997)
Gusfield, D.: Efficient methods for multiple sequence alignment with guaranteed error bounds. Computer Science Division, UC Davis, Technical Report CSE 91-4 (1991)
Schmid, R., Schuster, S.C., Steel, M.A., Huson, D.H.: ReadSim-A simulator for Sanger and 454 sequencing (2006)
Eriksson, N., Pachter, L., Mitsuya, Y., Rhee, S.-Y., Wang, C., Gharizadeh, B., Ronaghi, M., Shafer, R.W., Beerenwinkel, N.: Viral Population Estimation Using Pyrosequencing: PLoS Comput Biol. Public Library of Science 4 (May 2008)
Wang, C., Mitsuya, Y., Gharizadeh, B., Ronaghi, M.: Characterization of mutation spectra with ultra-deep pyrosequencing, application to HIV-1 drug resistance. Genome Res. 17(8), 1195–1201 (2007)
Zagordi, O., Geyrhofer, L., Roth, V., Beerenwinkel, N.: Deep sequencing of a genetically heterogeneous sample: local haplotype reconstruction and read error correction. In: RECOMB 2009 (accepted paper) (2009)
Hutchison III, C.A.: DNA sequencing, bench to bedside and beyond. Nucleic Acids Research 35, 6227–6237 (2007)
Wang, L., Jiang, T.: On the Complexity of Multiple Sequence Alignment. Journal of Computational Biology 1(4), 337–348 (1994)
Notredame, C., Higgins, D., Heringa, J.: T-coffee: A novel method for multiple sequence alignments. Journal of Molecular Biology 302, 205–217 (2000)
Thompson, J., Higgins, D., Gibson, T.J.: Clustal w: improving the sensitivity of progressive multiple alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 222, 4673–4690 (1994)
Edgar, R.C.: MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput. Nucleic Acids Research 32(5) (2004)
Edgar, R.C.: MUSCLE: A Multiple Sequence Alignment Method with Reduced Time and Space Complexity. BMC Bioinformatics, 1471–2105 (2004)
Morgenstern, B.: DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucleic Acids Research 32, 33–36 (2004)
Saeed, F., Khokhar, A.: A Domain Decomposition Strategy for Alignment of Multiple Biological Sequences on Multiprocessor Platforms. Journal of Parallel and Distributed Computing (to appear)
Do, C.B., Mahabhashyam, M.S.P., Brudno, M., Batzoglou, S.: PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment. Genome Research 15, 330–340 (2005)
Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT A Novel Method for Rapid Multiple Sequence Alignment based on Fast Fourier Transform. Nucleic Acids Res. 30(14), 3059–3066 (2002)
Altschul, S.F.: Amino acid substitution matrices from an information theoretic prospective. J. Mol. Biol. 219(3), 555–565 (1991)
Jones, D.T., Taylor, W.R., Thornton, J.M.: The rapid generation of mutation data matrices from protein sequences. BMC Bioinformatics 8(3), 275–282 (1991)
Müller, T., Spang, R., Vingron, M.: Estimating Amino Acid Substitution Models: A Comparison of Dayhoff’s Estimator, the Resolvent Approach and a Maximum Likelihood Method. Mol. Bio. Evol. 19(1), 8–13 (2002)
Edgar, R.C., Sjolander, K.: A comparison of scoring functions for protein sequence profile alignment. Bioinformatics 20(8), 1301–1308 (2004)
Huse, S., Huber, J., Morrison, H., Sogin, M., Welch, D.: Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 8(7), R143 (2007)
Roche Applied Sciences:GS20 Data Processing Software Manual:Penzberg: Roche Diagnostics GmbH (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Saeed, F., Khokhar, A., Zagordi, O., Beerenwinkel, N. (2009). Multiple Sequence Alignment System for Pyrosequencing Reads. In: Rajasekaran, S. (eds) Bioinformatics and Computational Biology. BICoB 2009. Lecture Notes in Computer Science(), vol 5462. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00727-9_34
Download citation
DOI: https://doi.org/10.1007/978-3-642-00727-9_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00726-2
Online ISBN: 978-3-642-00727-9
eBook Packages: Computer ScienceComputer Science (R0)