Advertisement

The TELLTALE dynamic hypertext environment: Approaches to scalability

  • Claudia Pearce
  • Ethan Miller
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1326)

Abstract

Methods and tools for finding documents relevant to a user's needs in document corpora can be found in the information retrieval, library science, and hypertext communities. Typically, these systems provide retrieval capabilities for fairly static corpora, their algorithms are dependent on the language for which they are written, e.g. English, and they don't perform well when presented with misspelled words or text that has been degraded by OCR (optical character recognition) techniques. In this chapter, we present the TELLTALE system. TELLTALE is a dynamic hypertext environment that provides full-text search from a hypertextstyle user interface for text corpora that may be garbled by OCR or transmission errors, and that may contain languages other than English by using several techniques based on n-grams (n character sequences of text). In this chapter, we identify methods and techniques that we have applied to the n-gram data structures. We also discuss algorithms that we used to enhance the scalability of the TELLTALE Dynamic Hypertext System.

Keywords

Hash Table Optical Character Recognition Text Retrieval Posting List Similarity Link 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    M. Aboud, C. Chrisment, R. Razouk, F. Sedes, and C. Soule-Dupuy. Querying a hypertext information retrieval system by the use of classification. Information Processing and Management, 29(3):387–396, 1990.Google Scholar
  2. 2.
    W. B. Cavnar. N-Gram-Based text filtering for TREC-2. In Donna Harman, editor, Proceedings of TREC-2: Text Retrieval Conference 2, Gaithersburg, MD, 1993. National Institute of Standards and Technology.Google Scholar
  3. 3.
    Jonathan Cohen. Highlights: Language-and domain-independent automatic indexing terms for abstracting. To appear in JASIS, 1995.Google Scholar
  4. 4.
    The Unicode Consortium. The Unicode Standard: World Wide Character Encoding. Addison-Wesley, Redwood City, CA, 1992.Google Scholar
  5. 5.
    W. B. Croft and R. Thompson. I 3R: A new approach to the design of document retrieval systems. Journal of the American Society for Information Science, 38:389–404, 1987.Google Scholar
  6. 6.
    W. B. Croft and H. Turtle. A retrieval model for incorporating hypertext links. In Hypertext '89 Proceedings, pages 213–224. ACM Press, November 1989. Pittsburgh, PA, Nov 5–8.Google Scholar
  7. 7.
    Donald B. Crouch, Carolyn J. Crouch, and Glenn Andreas. The use of cluster hierarchies in hypertext information retrieval. In Hypertext '89 Proceedings, pages 225–237. ACM Press, November 1989. Pittsburgh, PA, Nov 5–8.Google Scholar
  8. 8.
    Marc Damashek, 1995. U. S. Patent Number 5,418,951.Google Scholar
  9. 9.
    Marc Damashek. Gauging similarity with N-Grams: Language-independent categorization of text. Science, 267:843–848, 10 February 1995.Google Scholar
  10. 10.
    R. D'Amore and C. Mah. One-time complete indexing of text: theory and practice. In Proceedings 8th International ACM Conference on Research and Development in Information Retrieval. ACM Press, 1985.Google Scholar
  11. 11.
    The dp packagefor Tcl/Tk.Availablefor ftp from ftp://aud.alcatel.com/tcl/extensions/tcl-dp3.3bl.tar.gz.Google Scholar
  12. 12.
    Douglas C. Engelbart and W. K. English. A research center for augmenting human intellect. In Proceedings of the Fall Joint Computer Conference. AFIPS Press, Montvale, NY, 1968.Google Scholar
  13. 13.
    Mark E. Frisse and Steven B. Cousins. Information retrieval from hypertext: Update on the dynamic medical handbook project. In Hypertext '89 Proceedings. ACM Press, November 1989. Pittsburgh, PA, Nov 5–8.Google Scholar
  14. 14.
    Donna Harmon, editor. TREC-2-Text REtrieval Conference-2. National Institute of Standards and Technology, August 1993.Google Scholar
  15. 15.
    Donald E. Knuth. Sorting and Searching, pages 561–562. Addison Wesley, 1973.Google Scholar
  16. 16.
    Theodor H. Nelson. Managing immense storage. BYTE, 13(1):225–238, January 1988.Google Scholar
  17. 17.
    Jakob Nielsen. Hypertext and Hypermedia. Academic Press, San Diego, CA, 1990.Google Scholar
  18. 18.
    Claudia E. Pearce. A Dynamic Hypertext Environment Through n-gram Analysis. PhD thesis, University of Maryland Baltimore County, 1994.Google Scholar
  19. 19.
    Claudia E. Pearce. Dynamic hypertext links for highly degraded data in telltale. In Fourth Annual Symposium on Document Analysis and Information Retrieval, pages 89–106. Information Science Research Institute, University of Nevada Las Vegas, University of Nevada, 4505 Maryland Parkway, Box 454021, Las Vegas, Nevada 89154-4021, 1995.Google Scholar
  20. 20.
    Gerard Salton and Michael McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1983.Google Scholar
  21. 21.
    C. Y. Suen. n-gram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):164–172, 1979.Google Scholar
  22. 22.
    Brent B. Welch. Practical Programming in Tcl and Tk. Prentice-Hall, Inc., 1995.Google Scholar
  23. 23.
    P. Willette. Document retrieval experiments using indexing vocabularies of varying size. II. hashing, truncation, diagram and trigram encoding of index terms. Journal of Documentation, 35:296–305, December 1979.Google Scholar
  24. 24.
    Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes. Van Nostrand Reinhold, 1994.Google Scholar
  25. 25.
    E. J. Yannakoudakis, P. Goyal, and J. A. Huggil. The generation and use of text fragments for data compression. Information Processing and Management, 18(1):15–21, 1982.Google Scholar
  26. 26.
    E. M. Zamora, J. J. Pollock, and A. Zamora. The use of trigram analysis for spelling error detection. Information Processing and Management, 17(6):305–316, 1981.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1997

Authors and Affiliations

  • Claudia Pearce
    • 1
  • Ethan Miller
    • 2
  1. 1.U.S. Department of DefenseUSA
  2. 2.University of Maryland Baltimore CountyUSA

Personalised recommendations