Abstract
In previous work, we developed several algorithms that use information extraction techniques to achieve high-precision text categorization. The relevancy signatures algorithm classifies texts using extraction patterns, and the augmented relevancy signatures algorithm classifies texts using extraction patterns and semantic features associated with role fillers (Riloff and Lehnert, 1994). These algorithms relied on hand-coded training data, including annotated texts and a semantic dictionary. In this chapter, we describe two advances that significantly improve the practicality of our approach. First, we explain how the extraction patterns can be generated automatically using only preclassified texts as input. Second, we present the word-augmented relevancy signatures algorithm that uses lexical items to represent domain-specific role relationships instead of semantic features. Using these techniques, we can automatically build text categorization systems that benefit from domain-specific natural language processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Borko, H. and Bernick, M. (1963). Automatic Document Classification. J. ACM 10 (2), pp. 151–162.
Fuhr, N.; Hartmann, S.; Lustig, G.; Schwantner, M.; and Tzeras, Konstadinos (1991). AIR/X–A Rule-Based Multistage Indexing System for Large Subject Fields. In Proceedings of RIAO 91, pp. 606–623.
Goodman, M. (1991). Prism: A Case-Based Telex Classifier. In Proceedings of the Second Annual Conference on Innovative Applications of Artificial Intelligence, AAAI Press, pp. 25–37.
Hayes, Philip J. and Weinstein, Steven P. (1991). Construe-TIS: A System for Content-Based Indexing of a Database of News Stories. In Proceedings of the Second Annual Conference on Innovative Applications of Artificial Intelligence. AAAI Press, pp. 49–64.
Hoyle, W. (1973). Automatic Indexing and Generation of Classification Systems by Algorithm. Information Storage and Retrieval 9 (4), pp. 233–242.
Huffman, S. (1996). Learning information extraction patterns from examples. In Wermter, Stefan; Riloff, Ellen; and Scheler, Gabriele, editors 1996, Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, Springer-Verlag, Berlin, pp. 246–260.
Kim, J. and Moldovan, D. (1993). Acquisition of Semantic Patterns for Information Extraction from Corpora. In Proceedings of the Ninth IEEE Conference on Artificial Intelligence for Applications, Los Alamitos, CA. IEEE Computer Society Press, pp. 171–176.
Lehnert, W. (1991). Symbolic/Subsymbolic Sentence Analysis: Exploiting the Best of Two Worlds. In Barnden, J. and Pollack, J., editors 1991, Advances in Connectionist and Neural Computation Theory, Vol. 1. Ablex Publishers, Norwood, NJ. pp. 135–164.
Mayon, M. (1961). Automatic Indexing: An Experimental Inquiry. J. ACM 8, pp. 404–417.
MUC-3 Proceedings, (1991). Proceedings of the Third Message Understanding Conference (MUC-3). Morgan Kaufmann, San Mateo, CA.
MUC-4 Proceedings, (1992). Proceedings of the Fourth Message Understanding Conference (MUC-4). Morgan Kaufmann, San Mateo, CA.
MUC-5 Proceedings, (1993). Proceedings of the Fifth Message Understanding Conference (MUC-5). Morgan Kaufmann, San Francisco, CA.
Riloff, E. and Lehnert, W. (1994). Information Extraction as a Basis for High-Precision Text Classification. ACM Transactions on Information Systems 12 (3), pp. 296–333.
Riloff, E. (1994). Information Extraction as a Basis for Portable Text Classification Sys-teins. Ph.D. Dissertation, CMPSCI Technical Report 95–04, Department of Computer Science, University of Massachusetts, Amherst, MA.
Riloff, E. (1995). Little Words Can Make a Big Difference for Text Classification. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 130–136.
Riloff, E. (1996a). An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains. Artificial Intelligence 85, pp. 101–134.
Riloff, E. (1996b). Automatically Generating Extraction Patterns from Untagged Text. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. The AAAI Press/MIT Press, pp. 1044–1049.
Salton, G., editor (1971). The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ.
Soderland, S.; Fisher, D.; Aseltine, J.; and Lehnert, W. (1995). CRYSTAL: Inducing a conceptual dictionary. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence,pp. 1314–1319.
Stanfill, C. and Waltz, D. (1986). Toward Memory-Based Reasoning. Communications of the ACM 29 (12), pp. 1213–1228.
Turtle, Howard and Croft, W. Bruce (1991). Efficient Probabilistic Inference for Text Retrieval. In Proceedings of RIAD 91, pp. 644–661.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Riloff, E., Lorenzen, J. (1999). Extraction-Based Text Categorization: Generating Domain-Specific Role Relationships Automatically. In: Strzalkowski, T. (eds) Natural Language Information Retrieval. Text, Speech and Language Technology, vol 7. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2388-6_7
Download citation
DOI: https://doi.org/10.1007/978-94-017-2388-6_7
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5209-4
Online ISBN: 978-94-017-2388-6
eBook Packages: Springer Book Archive