Genome Identification and Classification by Short Oligo Arrays

  • Stanislav Angelov
  • Boulos Harb
  • Sampath Kannan
  • Sanjeev Khanna
  • Junhyong Kim
  • Li-San Wang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3240)


We explore the problem of designing oligonucleotides that help locate organisms along a known phylogenetic tree. We develop a suffix-tree based algorithm to find such short sequences efficiently. Our algorithm requires O(Nm) time and O(N) space in the worst case where m is the number of the genomes classified by the phylogeny and N is their total length. We implemented our algorithm and used it to find these discriminating sequences in both small and large phylogenies. We believe our algorithm will have wide applications including: high-throughput classification and identification, oligo array design optimally differentiating genes in gene families, and markers for closely related strains and populations. It will also have scientific significance as a new way to assess the confidence in a given classification.


Internal Node Lower Common Ancestor Left Subtree Large Phylogeny AB019729 AB019717 AB019714 AB019715 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Velculescu, V., Zhang, L., Vogelstein, B., Kinzler, K.: Serial analysis of gene expression. Science 270, 484–487 (1995)CrossRefGoogle Scholar
  2. 2.
    Adams, M., Kelley, J., Gocayne, J., Dubnick, M., Polymeropoulos, M., Xiao, H., Merril, C.R., et al.: Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651–1656 (1991)CrossRefGoogle Scholar
  3. 3.
    Olson, M., Hood, L., Cantor, C., Botstein, D.: A common language for physical mapping of the human genome. Science 245, 1434–1435 (1989)CrossRefGoogle Scholar
  4. 4.
    Hebert, P., Cywinska, A., Ball, S., de Waard, J.: Biological identifications through DNA barcodes. In: Proc. of the Royal Society of London, vol. 270, pp. 313–321 (2003)Google Scholar
  5. 5.
    Onodera, K., Melcher, U.: Viroligo: a database of virus-specific oligonucleotides. Nucl. Acids. Res. 30, 203–204 (2002)CrossRefGoogle Scholar
  6. 6.
    Ashelford, K.E., Weightman, A.J., Fry, J.C.: Primrose: a computer program for generating and estimating the phylogenetic range of 16S rRNA oligonucleotide probes and primers in conjunction with the rdp-ii database. Nucl. Acids. Res. 30, 3481–3489 (2002)CrossRefGoogle Scholar
  7. 7.
    Amann, R., Ludwig, W.: Ribosomal rna-targeted nucleic acid probes for studies in microbial ecology. FEMS Microbiology Reviews 24, 555–565 (2000)CrossRefGoogle Scholar
  8. 8.
    Matveeva, O.V., Shabalina, S.A., Nemtsov, V.A., Tsodikov, A.D., Gesteland, R.F., Atkins, J.F.: hermodynamic calculations and statistical correlations for oligoprobes design. Nucl. Acids. Res. 31, 4211–4217 (2003)CrossRefGoogle Scholar
  9. 9.
    Kaderali, L., Schliep, A.: Selecting signature oligonucleotides to identify organisms using DNA arrays. Bioinformatics 18, 1340–1349 (2002)CrossRefGoogle Scholar
  10. 10.
    Frieze, A.M., Halldorsson, B.V.: Optimal sequencing by hybridization in rounds. Journal of Computational Biology 9, 355–369 (2002)CrossRefGoogle Scholar
  11. 11.
    Mitsuhashi, M., Cooper, A., Ogura, M., Shinagawa, T., Yano, K., Hosokawa, T.: Oligonucleotide probe design - a new approach. Nature 367, 759–761 (1994)CrossRefGoogle Scholar
  12. 12.
    Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, New York (1997)zbMATHCrossRefGoogle Scholar
  13. 13.
    Thomas, J., et al.: Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788–793 (2003)CrossRefGoogle Scholar
  14. 14.
    Maidak, B.L., Cole, J.R., Lilburn, T.G., Parker, Charles T., J., Sax man, P.R., Farris, R.J., Garrity, G.M., Olsen, G.J., Schmidt, T.M., Tie dje, J.M.: The rdp-ii (ribosomal database project). Nucl. Acids. Res. 29, 173–174 (2001) Google Scholar
  15. 15.
    Weiner, P.: Linear pattern matching algorithms. In: Proc. of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)Google Scholar
  16. 16.
    McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of the ACM (JACM) 23, 262–272 (1976)zbMATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Ukkonen, E.: On-line construction of suffix-trees. Algorithmica 14, 249–260 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  18. 18.
    Hui, L.: Color set size problem with applications to string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 227–240. Springer, Heidelberg (1992)Google Scholar
  19. 19.
    Harel, D., Tarjan, R.E.: Fast algorithms for finding nearest common ancestors. SIAM Journal of Computing 13, 338–355 (1984)zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Schieber, B., Vishkin, U.: On finding lowest common ancestors: Simplificationsand parallelization. SIAM Journal of Computing 17, 1253–1262 (1988)zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Knudsen, S.: A Biologist’s Guide to Analysis of DNA Microarray Data. Wiley Pub, Chichester (2002)CrossRefGoogle Scholar
  22. 22.
    Baldi, P., Hatfield, G.W.: DNA Microarrays and Gene Expression. Cambridge University Press, Cambridge (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Stanislav Angelov
    • 1
  • Boulos Harb
    • 1
  • Sampath Kannan
    • 1
  • Sanjeev Khanna
    • 1
  • Junhyong Kim
    • 2
  • Li-San Wang
    • 2
  1. 1.Department of Computer and Information Science, School of Engineering and Applied SciencesUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.Department of Biology, School of Arts and SciencesUniversity of PennsylvaniaPhiladelphiaUSA

Personalised recommendations