Efficient Selection of Unique and Popular Oligos for Large EST Databases

Zheng, Jie; Close, Timothy J.; Jiang, Tao; Lonardi, Stefano

doi:10.1007/3-540-44888-8_28

Jie Zheng⁷,
Timothy J. Close⁸,
Tao Jiang⁷ &
…
Stefano Lonardi⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2676))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

637 Accesses
5 Citations

Abstract

EST databases have grown exponentially in recent years and now represent the largest collection of genetic sequences. An important application of these databases is that they contain information useful for the design of gene-specific oligonucleotides (or simply, oligos) that can be used in PCR primer design, microarray experiments, and genomic library screening. In this paper, we study two complementary problems concerning the selection of short oligos, e.g., 20–50 bases, from a large database of tens of thousands of EST sequences: (i) selection of oligos each of which appears (exactly) in one EST sequence but does not appear (exactly or approximately) in any other EST sequence and (ii) selection of oligos that appear (exactly or approximately) in many ESTs. The first problem is called the unique oligo problem and has applications in PCR primer and microarray probe designs. The second is called the popular oligo problem and is useful in screening genomic libraries (such as BAC libraries) for gene-rich regions. We present an efficient algorithm to identify all unique oligos in the ESTs and an efficient heuristic algorithm to enumerate the most popular oligos. By taking into account the distribution of the frequencies of the words in the EST database, the algorithms have been carefully engineered to achieve remarkable running times on regular PCs. Each of the algorithms takes only a couple of hours (on a 1.2 GHz CPU, 1 GB RAM machine) to run on a dataset 28 Mbases of barley ESTs from the HarvEST database. We present simulation results on synthetic data and a preliminary analysis of the barley EST database.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., Kerlavage, A.R., McCombie, W.R., Venter, J.C.: Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252 (1991) 1651–1656
Article Google Scholar
Boguski, M., Lowe, T., Tolstoshev, C.: dbEST-database for “expressed sequence tags”. Nat. Genet. 4 (1993) 332–3
Article Google Scholar
Yu, Y., Tomkins, J.P., Waugh, R., Frisch, D.A., Kudrna, D., Kleinhofs, A., Brueggeman, R.S., Muehlbauer, G.J., Wise, R.P., Wing, R.A.: A bacterial artificial chromosome library for barley (Hordeum vulgare L.) and the identification of clones containing putative resistance genes. Theor Appl Genet 101 (2000) 1093–1099
Article Google Scholar
Close, T., Wing, R., Kleinhofs, A., Wise, R.: Genetically and physically anchored EST resources for barley genomics. Barley Genetics Newsletter 31 (2001) 29–30
Google Scholar
Michalek, W., Weschke, W., Pleissner, K., Graner, A.: Est analysis in barley defines a unigene set comprising 4,000 genes. Theor Appl Genet 104 (2002) 97–103
Article Google Scholar
Huang, X., Madan, A.: CAP3: A DNA sequence assembly program. Genome Research 9 (1999) 868–877
Article Google Scholar
Han, C., Sutherland, R., Jewett, P., Campbell, M., Meincke, L., Tesmer, J., Mundt, M., Fawcett, J., Kim, U., Deaven, L., Doggett, N.: Construction of a BAC contig map of chromosome 16q by two-dimensional overgo hybridization. Genome Research 104 (2000) 714–721
Article Google Scholar
Barakat, A., Carels, N., Bernardi, G.: The distribution of genes in the genomes of gramineae. Proc. Natl. Acad. Sci. U.S.A. 94 (1997) 6857–6861
Article Google Scholar
Rahmann, S.: Rapid large-scale oligonucleotide selection for microarrays. In: Proceedings of the First IEEE Computer Society Bioinformatics Conference (CSB’02), IEEE Press (2002)
Google Scholar
Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21 (1995) 51–80
Google Scholar
Jonassen, I., Collins, J.F., Higgins, D.G.: Finding flexible patterns in unaligned protein sequences. Protein Science 4 (1995) 1587–1595
Article Google Scholar
Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13 (1997) 509–522
Google Scholar
Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The Teiresias algorithm. Bioinformatics 14 (1998) 55–67
Article Google Scholar
Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15 (1999) 563–577
Article Google Scholar
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262 (1993) 208–214
Article Google Scholar
Neuwald, A., Liu, J., Lawrence, C.: Gibbs motif sampling: Detecting bacterial outer membrane protein repeats. Protein Science 4 (1995) 1618–1632
Google Scholar
Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. of the International Conference on Intelligent Systems for Molecular Biology, AAAI press, Menlo Park, CA (2000) 269–278
Google Scholar
Keich, Pevzner: Finding motifs in the twilight zone. In: Annual International Conference on Computational Molecular Biology, Washington, DC (2002) 195–204
Google Scholar
Tompa, M., Buhler, J.: Finding motifs using random projections. In: Annual International Conference on Computational Molecular Biology, Montreal, Canada (2001) 67–74
Google Scholar
Buhler, J., Tompa, M.: Finding motifs using random projections. J. Comput. Bio. 9 (2002) 225–242
Article Google Scholar
Apostolico, A., Bock, M.E., Lonardi, S.: Monotony of surprise and large-scale quest for unusual words (extended abstract). In Myers, G., Hannenhalli, S., Istrail, S., Pevzner, P., Waterman, M., eds.: Proc. of Research in Computational Molecular Biology (RECOMB), Washington, DC (2002)
Google Scholar
Eskin, E., Pevzner, P.A.: Finding composite regulatory patterns in DNA sequences. In: Proc. of the International Conference on Intelligent Systems for Molecular Biology, AAAI press, Menlo Park, CA (2002) Bioinformatics S181–S188
Google Scholar
Li, F., Stormo, G.D.: Selection of optimal DNA oligos for gene expression arrays. Bionformatics 17 (2001) 1067–1076
Article Google Scholar
Rouillard, J.M., Herbert, C.J., Zuker, M.: Oligoarray: Genome-scale oligonucleotide design for microarrays. Bioinformatics 18 (2002) 486–487
Article Google Scholar
Swofford, D. In: PAUP: Phylogenetic Analysis Using Parsimony version 4.0 beta 10. Sinauer Associates, Sunderland, Massachusetts (2002)
Google Scholar
Wesselink, J.J., de la Iglesia, B., James, S.A., Dicks, J.L., Roberts, I.N., Rayward-Smith, V.J.: Determining a unique defining DNA sequence for yeast species using hashing techniques. Bionformatics 18 (2002) 1004–1010
Article Google Scholar
Ito, M., Shimizu, K., Nakanishi, M., Hashimoto, A.: Polynomial-time algorithms for computing characteristic strings. In Crochemore, M., Gusfield, D., eds.: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching. Number 807 in Lecture Notes in Computer Science, Asilomar, CA, Springer-Verlag, Berlin (1994) 274–288
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science & Engineering, University of California, Riverside, CA, 92521
Jie Zheng, Tao Jiang & Stefano Lonardi
Department of Botany & Plant Sciences, University of California, Riverside, CA, 92521
Timothy J. Close

Authors

Jie Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Timothy J. Close
View author publications
You can also search for this author in PubMed Google Scholar
Tao Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Lonardi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Depto. de Ciencias de la Computación, Universidad de Chile, Blanco Encalada 2120, Santiago, 6511224, Chile
Ricardo Baeza-Yates
Escuela de Ciencias Físico-Matemáticas, Universidad Michoacana, Edificio “B”, ciudad universitaria, Morelia Michoacán, Mexico
Edgar Chávez
Université de Marne-la-Vallée, 77454, Marne-la-Vallée Cedex 2, France
Maxime Crochemore

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zheng, J., Close, T.J., Jiang, T., Lonardi, S. (2003). Efficient Selection of Unique and Popular Oligos for Large EST Databases. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds) Combinatorial Pattern Matching. CPM 2003. Lecture Notes in Computer Science, vol 2676. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44888-8_28

Download citation

DOI: https://doi.org/10.1007/3-540-44888-8_28
Published: 27 May 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40311-1
Online ISBN: 978-3-540-44888-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics