Abstract
Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Previous work in text mining focused at the word or the tag level. This paper presents an approach to performing text mining at the term level. The mining process starts by preprocessing the document collection and extracting terms from the documents. Each document is then represented by a set of terms and annotations characterizing the document. Terms and additional higher-level entities are then organized in a hierarchical taxonomy. In this paper we will describe the Term Extraction module of the Document Explorer system, and provide experimental evaluation performed on a set of 52,000 documents published by Reuters in the years 1995–1996.
Chapter PDF
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Anand, T.; Kahn, G.: Opportunity Explorer: Navigating Large Databases Using Knowledge Discovery Templates. In: Proceedings of the 1993 workshop on Knowledge Discovery in Databases, (1993).
Bookstein, A.; Klein, S.T.; Raita, T.: Clumping Properties of Content-Bearing Words. In: Proceedings of International Conference on Research and Development in Information Retrieval (SIGIR), (1995).
Brachman, R.J.; Selfridge, P.G.; Terveen, L.G.; Altman, B.; Borgida, A.; Halper, F.; Kirk, T.; Lazar, A.; McGuinness, D.L.; Resnick, L.A.: Integrated Support for Data Archaeology. International Journal of Intelligent and Cooperative Information Systems, (1993)2(2):159–185.
Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, (1995) 21(4):543–565.
Church, K.W.; Hanks, P.: Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, (1990) 16(1):22–29.
Cohen, W.; Singer, Y.: Context Sensitive Learning Methods for Text categorization. In: Proceedings of International Conference on Research and Development in Information Retrieval (SIGIR), (1996).
Dagan, I.; Church K.W.: Termight: Identifying and Translating Technical Terminology. In: Proceedings of the European Chapter of the Association for Computational Linguistics, EACL, (1994) 34–40.
Daille, B.; Gaussier, E.; Lange, J.M.: Towards Automatic Extraction of Monolingual and Bilingual Terminology. In: Proceedings of the International Conference on Computational Linguistics (COLING), (1994) 515–521.
Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In: Resnik, P.; Klavans, J. (eds.): The Balancing Act: Combining Symbolic and Statistical Approaches to Language, MIT Press, Cambridge, MA, USA, (1996) 49–66.
Dunning, T.: Accurute Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, (1993) 19(1).
Feldman, R.; Hirsh, H.: Exploiting Background Information in Knowledge Discovery from Text. Journal of Intelligent Information Systems, (1996).
Feldman, R.; Aumann, Y.; Amir, A.; Klösgen, W.; Zilberstien, A.: Maximal Association Rules: a New Tool for Mining for Keyword co-occurrences in Document Collections. In: Proceedings of the 3rd International Conference on Knowledge Discovery (KDD), (1997).
Feldman, R.; Dagan, I.: KDT—Knowledge Discovery in Texts. In: Proceedings of the First International Conference on Knowledge Discovery (KDD), (1995).
Frantzi, T.K.; Incorporating Context Information for the Extraction of Terms. In: Proceedings of ACL-EACL, (1997).
Frawley, W.J.; Piatetsky-Shapiro, G.; Matheus, C.J.: Knowledge Discovery in Databases: an Overview. In: Piatetsky-Shapiro, G.; Frawley, W. J. (eds.): Knowledge Discovery in Databases, MIT Press, (1991), 1–27.
Gale, W.A.; Church, K.W.: Concordances for parallel texts. In: Proceedings of the 7th Annual Conference of the UW Centre for the New OED and Text Research, Using Corpora, (1991) 40–62.
Hull, D.: Stemming algorithms—a case study for detailed evaluation. Journal of the American Society for Information Science, (1996) 47(1):70–84.
Justeson, J.S.; Katz, S.M.: Technical Terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, (1995) 1(1):9–27.
Klösgen, W.: Problems for Knowledge Discovery in Databases and their treatment in the Statistics Interpreter EXPLORA. International Journal for Intelligent Systems, (1992) 7(7):649–673.
Klösgen, W.: Efficient Discovery of Interesting Statements. The Journal of Intelligent Information Systems, (1995) 4(1).
Lent, B.; Agrawal, R.; Srikant, R.: Discovering Trends in Text Databases. In: Proceedings of the 3rd International Conference on Knowledge Discovery (KDD), (1997).
Rajman, M.; Besançon, R.: Text Mining: Natural Language Techniques and Text Mining Applications. In: Proceedings of the seventh IFIP 2.6 Working Conference on Database Semantics (DS-7), Chapam & Hall IFIP Proceedings serie, (1997) Oct 7–10.
Salton, G.; Buckley, C.: Term-weighting Approaches in Automatic Text Retrieval. Information Processing and Management, (1998) 24(5):513–523.
Srikant, R.; Agrawal, R.: Mining generalized association rules. In: Proceedings of the 21st Very Large Databases (VLDB), (1995).
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feldman, R. et al. (1998). Text mining at the term level. In: Żytkow, J.M., Quafafou, M. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1998. Lecture Notes in Computer Science, vol 1510. Springer, Berlin, Heidelberg . https://doi.org/10.1007/BFb0094806
Download citation
DOI: https://doi.org/10.1007/BFb0094806
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-65068-3
Online ISBN: 978-3-540-49687-8
eBook Packages: Springer Book Archive