Efficient Selection of Unique and Popular Oligos for Large EST Databases

  • Jie Zheng
  • Timothy J. Close
  • Tao Jiang
  • Stefano Lonardi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2676)


EST databases have grown exponentially in recent years and now represent the largest collection of genetic sequences. An important application of these databases is that they contain information useful for the design of gene-specific oligonucleotides (or simply, oligos) that can be used in PCR primer design, microarray experiments, and genomic library screening. In this paper, we study two complementary problems concerning the selection of short oligos, e.g., 20–50 bases, from a large database of tens of thousands of EST sequences: (i) selection of oligos each of which appears (exactly) in one EST sequence but does not appear (exactly or approximately) in any other EST sequence and (ii) selection of oligos that appear (exactly or approximately) in many ESTs. The first problem is called the unique oligo problem and has applications in PCR primer and microarray probe designs. The second is called the popular oligo problem and is useful in screening genomic libraries (such as BAC libraries) for gene-rich regions. We present an efficient algorithm to identify all unique oligos in the ESTs and an efficient heuristic algorithm to enumerate the most popular oligos. By taking into account the distribution of the frequencies of the words in the EST database, the algorithms have been carefully engineered to achieve remarkable running times on regular PCs. Each of the algorithms takes only a couple of hours (on a 1.2 GHz CPU, 1 GB RAM machine) to run on a dataset 28 Mbases of barley ESTs from the HarvEST database. We present simulation results on synthetic data and a preliminary analysis of the barley EST database.


Hash Table Table Entry UPGMA Tree Computational Molecular Biology Unique Oligos 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., Kerlavage, A.R., McCombie, W.R., Venter, J.C.: Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252 (1991) 1651–1656CrossRefGoogle Scholar
  2. 2.
    Boguski, M., Lowe, T., Tolstoshev, C.: dbEST-database for “expressed sequence tags”. Nat. Genet. 4 (1993) 332–3CrossRefGoogle Scholar
  3. 3.
    Yu, Y., Tomkins, J.P., Waugh, R., Frisch, D.A., Kudrna, D., Kleinhofs, A., Brueggeman, R.S., Muehlbauer, G.J., Wise, R.P., Wing, R.A.: A bacterial artificial chromosome library for barley (Hordeum vulgare L.) and the identification of clones containing putative resistance genes. Theor Appl Genet 101 (2000) 1093–1099CrossRefGoogle Scholar
  4. 4.
    Close, T., Wing, R., Kleinhofs, A., Wise, R.: Genetically and physically anchored EST resources for barley genomics. Barley Genetics Newsletter 31 (2001) 29–30Google Scholar
  5. 5.
    Michalek, W., Weschke, W., Pleissner, K., Graner, A.: Est analysis in barley defines a unigene set comprising 4,000 genes. Theor Appl Genet 104 (2002) 97–103CrossRefGoogle Scholar
  6. 6.
    Huang, X., Madan, A.: CAP3: A DNA sequence assembly program. Genome Research 9 (1999) 868–877CrossRefGoogle Scholar
  7. 7.
    Han, C., Sutherland, R., Jewett, P., Campbell, M., Meincke, L., Tesmer, J., Mundt, M., Fawcett, J., Kim, U., Deaven, L., Doggett, N.: Construction of a BAC contig map of chromosome 16q by two-dimensional overgo hybridization. Genome Research 104 (2000) 714–721CrossRefGoogle Scholar
  8. 8.
    Barakat, A., Carels, N., Bernardi, G.: The distribution of genes in the genomes of gramineae. Proc. Natl. Acad. Sci. U.S.A. 94 (1997) 6857–6861CrossRefGoogle Scholar
  9. 9.
    Rahmann, S.: Rapid large-scale oligonucleotide selection for microarrays. In: Proceedings of the First IEEE Computer Society Bioinformatics Conference (CSB’02), IEEE Press (2002)Google Scholar
  10. 10.
    Bailey, T.L., Elkan, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 21 (1995) 51–80Google Scholar
  11. 11.
    Jonassen, I., Collins, J.F., Higgins, D.G.: Finding flexible patterns in unaligned protein sequences. Protein Science 4 (1995) 1587–1595CrossRefGoogle Scholar
  12. 12.
    Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13 (1997) 509–522Google Scholar
  13. 13.
    Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The Teiresias algorithm. Bioinformatics 14 (1998) 55–67CrossRefGoogle Scholar
  14. 14.
    Hertz, G., Stormo, G.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15 (1999) 563–577CrossRefGoogle Scholar
  15. 15.
    Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262 (1993) 208–214CrossRefGoogle Scholar
  16. 16.
    Neuwald, A., Liu, J., Lawrence, C.: Gibbs motif sampling: Detecting bacterial outer membrane protein repeats. Protein Science 4 (1995) 1618–1632Google Scholar
  17. 17.
    Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. of the International Conference on Intelligent Systems for Molecular Biology, AAAI press, Menlo Park, CA (2000) 269–278Google Scholar
  18. 18.
    Keich, Pevzner: Finding motifs in the twilight zone. In: Annual International Conference on Computational Molecular Biology, Washington, DC (2002) 195–204Google Scholar
  19. 19.
    Tompa, M., Buhler, J.: Finding motifs using random projections. In: Annual International Conference on Computational Molecular Biology, Montreal, Canada (2001) 67–74Google Scholar
  20. 20.
    Buhler, J., Tompa, M.: Finding motifs using random projections. J. Comput. Bio. 9 (2002) 225–242CrossRefGoogle Scholar
  21. 21.
    Apostolico, A., Bock, M.E., Lonardi, S.: Monotony of surprise and large-scale quest for unusual words (extended abstract). In Myers, G., Hannenhalli, S., Istrail, S., Pevzner, P., Waterman, M., eds.: Proc. of Research in Computational Molecular Biology (RECOMB), Washington, DC (2002)Google Scholar
  22. 22.
    Eskin, E., Pevzner, P.A.: Finding composite regulatory patterns in DNA sequences. In: Proc. of the International Conference on Intelligent Systems for Molecular Biology, AAAI press, Menlo Park, CA (2002) Bioinformatics S181–S188Google Scholar
  23. 23.
    Li, F., Stormo, G.D.: Selection of optimal DNA oligos for gene expression arrays. Bionformatics 17 (2001) 1067–1076CrossRefGoogle Scholar
  24. 24.
    Rouillard, J.M., Herbert, C.J., Zuker, M.: Oligoarray: Genome-scale oligonucleotide design for microarrays. Bioinformatics 18 (2002) 486–487CrossRefGoogle Scholar
  25. 25.
    Swofford, D. In: PAUP: Phylogenetic Analysis Using Parsimony version 4.0 beta 10. Sinauer Associates, Sunderland, Massachusetts (2002)Google Scholar
  26. 26.
    Wesselink, J.J., de la Iglesia, B., James, S.A., Dicks, J.L., Roberts, I.N., Rayward-Smith, V.J.: Determining a unique defining DNA sequence for yeast species using hashing techniques. Bionformatics 18 (2002) 1004–1010CrossRefGoogle Scholar
  27. 27.
    Ito, M., Shimizu, K., Nakanishi, M., Hashimoto, A.: Polynomial-time algorithms for computing characteristic strings. In Crochemore, M., Gusfield, D., eds.: Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching. Number 807 in Lecture Notes in Computer Science, Asilomar, CA, Springer-Verlag, Berlin (1994) 274–288Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Jie Zheng
    • 1
  • Timothy J. Close
    • 2
  • Tao Jiang
    • 1
  • Stefano Lonardi
    • 1
  1. 1.Dept. of Computer Science & EngineeringUniversity of CaliforniaRiverside
  2. 2.Department of Botany & Plant SciencesUniversity of CaliforniaRiverside

Personalised recommendations