Skip to main content

Extraction-Based Text Categorization: Generating Domain-Specific Role Relationships Automatically

  • Chapter
Natural Language Information Retrieval

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 7))

Abstract

In previous work, we developed several algorithms that use information extraction techniques to achieve high-precision text categorization. The relevancy signatures algorithm classifies texts using extraction patterns, and the augmented relevancy signatures algorithm classifies texts using extraction patterns and semantic features associated with role fillers (Riloff and Lehnert, 1994). These algorithms relied on hand-coded training data, including annotated texts and a semantic dictionary. In this chapter, we describe two advances that significantly improve the practicality of our approach. First, we explain how the extraction patterns can be generated automatically using only preclassified texts as input. Second, we present the word-augmented relevancy signatures algorithm that uses lexical items to represent domain-specific role relationships instead of semantic features. Using these techniques, we can automatically build text categorization systems that benefit from domain-specific natural language processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Borko, H. and Bernick, M. (1963). Automatic Document Classification. J. ACM 10 (2), pp. 151–162.

    Article  Google Scholar 

  • Fuhr, N.; Hartmann, S.; Lustig, G.; Schwantner, M.; and Tzeras, Konstadinos (1991). AIR/X–A Rule-Based Multistage Indexing System for Large Subject Fields. In Proceedings of RIAO 91, pp. 606–623.

    Google Scholar 

  • Goodman, M. (1991). Prism: A Case-Based Telex Classifier. In Proceedings of the Second Annual Conference on Innovative Applications of Artificial Intelligence, AAAI Press, pp. 25–37.

    Google Scholar 

  • Hayes, Philip J. and Weinstein, Steven P. (1991). Construe-TIS: A System for Content-Based Indexing of a Database of News Stories. In Proceedings of the Second Annual Conference on Innovative Applications of Artificial Intelligence. AAAI Press, pp. 49–64.

    Google Scholar 

  • Hoyle, W. (1973). Automatic Indexing and Generation of Classification Systems by Algorithm. Information Storage and Retrieval 9 (4), pp. 233–242.

    Article  Google Scholar 

  • Huffman, S. (1996). Learning information extraction patterns from examples. In Wermter, Stefan; Riloff, Ellen; and Scheler, Gabriele, editors 1996, Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Springer-Verlag, Berlin, pp. 246–260.

    Chapter  Google Scholar 

  • Kim, J. and Moldovan, D. (1993). Acquisition of Semantic Patterns for Information Extraction from Corpora. In Proceedings of the Ninth IEEE Conference on Artificial Intelligence for Applications, Los Alamitos, CA. IEEE Computer Society Press, pp. 171–176.

    Chapter  Google Scholar 

  • Lehnert, W. (1991). Symbolic/Subsymbolic Sentence Analysis: Exploiting the Best of Two Worlds. In Barnden, J. and Pollack, J., editors 1991, Advances in Connectionist and Neural Computation Theory, Vol. 1. Ablex Publishers, Norwood, NJ. pp. 135–164.

    Google Scholar 

  • Mayon, M. (1961). Automatic Indexing: An Experimental Inquiry. J. ACM 8, pp. 404–417.

    Article  Google Scholar 

  • MUC-3 Proceedings, (1991). Proceedings of the Third Message Understanding Conference (MUC-3). Morgan Kaufmann, San Mateo, CA.

    Google Scholar 

  • MUC-4 Proceedings, (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, San Mateo, CA.

    Google Scholar 

  • MUC-5 Proceedings, (1993). Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufmann, San Francisco, CA.

    Google Scholar 

  • Riloff, E. and Lehnert, W. (1994). Information Extraction as a Basis for High-Precision Text Classification. ACM Transactions on Information Systems 12 (3), pp. 296–333.

    Article  Google Scholar 

  • Riloff, E. (1994). Information Extraction as a Basis for Portable Text Classification Sys-teins. Ph.D. Dissertation, CMPSCI Technical Report 95–04, Department of Computer Science, University of Massachusetts, Amherst, MA.

    Google Scholar 

  • Riloff, E. (1995). Little Words Can Make a Big Difference for Text Classification. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 130–136.

    Google Scholar 

  • Riloff, E. (1996a). An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains. Artificial Intelligence 85, pp. 101–134.

    Article  Google Scholar 

  • Riloff, E. (1996b). Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. The AAAI Press/MIT Press, pp. 1044–1049.

    Google Scholar 

  • Salton, G., editor (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ.

    Google Scholar 

  • Soderland, S.; Fisher, D.; Aseltine, J.; and Lehnert, W. (1995). CRYSTAL: Inducing a conceptual dictionary. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence,pp. 1314–1319.

    Google Scholar 

  • Stanfill, C. and Waltz, D. (1986). Toward Memory-Based Reasoning. Communications of the ACM 29 (12), pp. 1213–1228.

    Article  Google Scholar 

  • Turtle, Howard and Croft, W. Bruce (1991). Efficient Probabilistic Inference for Text Retrieval. In Proceedings of RIAD 91, pp. 644–661.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Riloff, E., Lorenzen, J. (1999). Extraction-Based Text Categorization: Generating Domain-Specific Role Relationships Automatically. In: Strzalkowski, T. (eds) Natural Language Information Retrieval. Text, Speech and Language Technology, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2388-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2388-6_7

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5209-4

  • Online ISBN: 978-94-017-2388-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics