Extraction-Based Text Categorization: Generating Domain-Specific Role Relationships Automatically

Riloff, Ellen; Lorenzen, Jeffrey

doi:10.1007/978-94-017-2388-6_7

Ellen Riloff⁴ &
Jeffrey Lorenzen⁴

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 7))

268 Accesses
8 Citations

Abstract

In previous work, we developed several algorithms that use information extraction techniques to achieve high-precision text categorization. The relevancy signatures algorithm classifies texts using extraction patterns, and the augmented relevancy signatures algorithm classifies texts using extraction patterns and semantic features associated with role fillers (Riloff and Lehnert, 1994). These algorithms relied on hand-coded training data, including annotated texts and a semantic dictionary. In this chapter, we describe two advances that significantly improve the practicality of our approach. First, we explain how the extraction patterns can be generated automatically using only preclassified texts as input. Second, we present the word-augmented relevancy signatures algorithm that uses lexical items to represent domain-specific role relationships instead of semantic features. Using these techniques, we can automatically build text categorization systems that benefit from domain-specific natural language processing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Borko, H. and Bernick, M. (1963). Automatic Document Classification. J. ACM 10 (2), pp. 151–162.
Article Google Scholar
Fuhr, N.; Hartmann, S.; Lustig, G.; Schwantner, M.; and Tzeras, Konstadinos (1991). AIR/X–A Rule-Based Multistage Indexing System for Large Subject Fields. In Proceedings of RIAO 91, pp. 606–623.
Google Scholar
Goodman, M. (1991). Prism: A Case-Based Telex Classifier. In Proceedings of the Second Annual Conference on Innovative Applications of Artificial Intelligence, AAAI Press, pp. 25–37.
Google Scholar
Hayes, Philip J. and Weinstein, Steven P. (1991). Construe-TIS: A System for Content-Based Indexing of a Database of News Stories. In Proceedings of the Second Annual Conference on Innovative Applications of Artificial Intelligence. AAAI Press, pp. 49–64.
Google Scholar
Hoyle, W. (1973). Automatic Indexing and Generation of Classification Systems by Algorithm. Information Storage and Retrieval 9 (4), pp. 233–242.
Article Google Scholar
Huffman, S. (1996). Learning information extraction patterns from examples. In Wermter, Stefan; Riloff, Ellen; and Scheler, Gabriele, editors 1996, Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Springer-Verlag, Berlin, pp. 246–260.
Chapter Google Scholar
Kim, J. and Moldovan, D. (1993). Acquisition of Semantic Patterns for Information Extraction from Corpora. In Proceedings of the Ninth IEEE Conference on Artificial Intelligence for Applications, Los Alamitos, CA. IEEE Computer Society Press, pp. 171–176.
Chapter Google Scholar
Lehnert, W. (1991). Symbolic/Subsymbolic Sentence Analysis: Exploiting the Best of Two Worlds. In Barnden, J. and Pollack, J., editors 1991, Advances in Connectionist and Neural Computation Theory, Vol. 1. Ablex Publishers, Norwood, NJ. pp. 135–164.
Google Scholar
Mayon, M. (1961). Automatic Indexing: An Experimental Inquiry. J. ACM 8, pp. 404–417.
Article Google Scholar
MUC-3 Proceedings, (1991). Proceedings of the Third Message Understanding Conference (MUC-3). Morgan Kaufmann, San Mateo, CA.
Google Scholar
MUC-4 Proceedings, (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, San Mateo, CA.
Google Scholar
MUC-5 Proceedings, (1993). Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufmann, San Francisco, CA.
Google Scholar
Riloff, E. and Lehnert, W. (1994). Information Extraction as a Basis for High-Precision Text Classification. ACM Transactions on Information Systems 12 (3), pp. 296–333.
Article Google Scholar
Riloff, E. (1994). Information Extraction as a Basis for Portable Text Classification Sys-teins. Ph.D. Dissertation, CMPSCI Technical Report 95–04, Department of Computer Science, University of Massachusetts, Amherst, MA.
Google Scholar
Riloff, E. (1995). Little Words Can Make a Big Difference for Text Classification. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 130–136.
Google Scholar
Riloff, E. (1996a). An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains. Artificial Intelligence 85, pp. 101–134.
Article Google Scholar
Riloff, E. (1996b). Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. The AAAI Press/MIT Press, pp. 1044–1049.
Google Scholar
Salton, G., editor (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Soderland, S.; Fisher, D._; Aseltine, J.; and Lehnert, W. (1995). CRYSTAL: Inducing a conceptual dictionary. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence,pp. 1314–1319.
Google Scholar
Stanfill, C. and Waltz, D. (1986). Toward Memory-Based Reasoning. Communications of the ACM 29 (12), pp. 1213–1228.
Article Google Scholar
Turtle, Howard and Croft, W. Bruce (1991). Efficient Probabilistic Inference for Text Retrieval. In Proceedings of RIAD 91, pp. 644–661.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Utah, Salt Lake City, UT, 84112, USA
Ellen Riloff & Jeffrey Lorenzen

Authors

Ellen Riloff
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Lorenzen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

General Electric, Research & Development, 12301, Schenectady, NY, USA
Tomek Strzalkowski

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Riloff, E., Lorenzen, J. (1999). Extraction-Based Text Categorization: Generating Domain-Specific Role Relationships Automatically. In: Strzalkowski, T. (eds) Natural Language Information Retrieval. Text, Speech and Language Technology, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2388-6_7

Download citation

DOI: https://doi.org/10.1007/978-94-017-2388-6_7
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5209-4
Online ISBN: 978-94-017-2388-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics