Abstract
The goal of the paper is to give an overview on the state of the art of data mining and text mining approaches which are useful for bibliometrics and patent databases. The paper explains the basics of data mining in a non-technical manner. Basic approaches from statistics and machine learning are introduced in order to clarify the groundwork of data mining methods. Text mining is introduced as a special case of data mining. Data and text mining applications especially useful for bibliometrics and querying of patent databases are reviewed and three case studies are described.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andrews, R., Geva S. (1994). Rule extraction from a constrained error backpropagation MLP. Australian Conference on Neural Networks, Brisbane, Queensland 1994 (pp. 9–12).
Baeza-Yates, R., Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley.
Chen, H.H. (2002). Multilingual summarization and question answering. Workshop on Multilingual Summarization and Question Answering, COLING’02, Taipeh, Taiwan 2002.
Chitashvili, R.J., Baayen, R.H. (1993). Word frequency distributions. In G. Altmann, L. Hřebíček (Eds.), Quantitative Text Analysis (pp. 54–135). Wvt: Trier.
Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41 (6), 391–407.
Dempster, A.P., Laird, N.M., Rubin, D.B. (1997). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1–38.
Diederich, J., Kindermann, J., Leopold, E., Paaß, G. (2003). Authorship attribution with Support Vector Machines. Applied Intelligence, 19 (1–2), 109–123.
Dumais, S., Platt, J., Heckerman, D., Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management (pp. 148–155). ACM.
Gövert, B., Lalmas, M., Fuhr, N. (1999). A probabilistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management, Kansas City, Missouri, 1999 (pp. 475–482). ACM.
Guiter, H. (1974). Les rélations fréquence — longueur — sens des mots (langues romanes et anglais), In XIV congresso internazionale di linguistica e filologia romanza (pp. 373–381). Napoli.
Hahn, U., Reimer, U. (1999). Knowledge-based text summarization. In: I. Mani, M. T. Maybury (Eds.), Advances in Automated Text Summarization (pp. 215–232). Cambridge, London: MIT-Press.
Hand, D., Mannila, H., Smyth, P (2001). Principles of data mining. MIT Press.
Hartigan, J.A. (1975). Clustering algorithms. New York: John Wiley.
Hastie T., Tibshirani, R., Friedman, J. (2001). The elements of statistical learning. New York: Springer.
Hofman, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177–196.
Holmes, D.I. (1998). The evolution of stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13 (3), 111–117.
Holmes, D.I., Forsyth, R.S. (1995). The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10 (2), 111–127.
Kohonen, T. (1980). Content-adressable memories. Springer.
Kohonen, T. (1995). Self-organising Maps. Springer.
Kosala, R. Blockeel, H. (2000). Web mining research: A Survey. In P.S. Bradley, S. Sarawagi, U.M. Fayyad (Eds.), SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 2 (pp. 1–15). ACM Press.
Kraaij, W., Spitters, M., Hulth, A. (2002). Headline extraction based on a combination of uniand multidocument summarization techniques. In Proceedings of the ACL workshop on Automatic Summarization/Document Understanding Conference DUC 2002, June 2002, Philadelphia, USA.
Joachims, T. (1998a). Making large-scale SVM learning practical, Technical report University of Dortmund.
Joachims, T. (1998b). Text categorization with Support Vector Machines: learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning, Springer Lecture Notes in Computer Science, Vol. 1398 (pp. 137–142). Springer.
Landauer, T.K., Dumais, S.T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104 (2), 211–240.
Lang, K. (1995). Newsweeder: Learning to filter netnews. In A. Prieditis, S. Russell (Eds.), Proceedings of the 12th International Conferrence on Machine Learning (pp. 331–339). San Francisco: Morgan Kaufmann Publishers.
Leopold, E., Kindermann, J. (2002). Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning, 46, 423–444.
Lowe, D., Matthews, R. (1995). Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities, 29, 449–461.
Manning, C.D., Schütze, H.(1999). Foundations of statistical natural language processing. Cambridge MA, London: MIT Press.
Mitchell, Tom (1997). Machine Learning. Boston et al.: McGraw-Hill.
Mladenic, D., Grobelnik M. (1999). Feature selection for unbalanced class distribution and naive Bayes. In I. Bratko, S. Dzeroski (Eds.), Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999) (pp. 258–267). San Francisco: Morgan Kaufmann.
Neumann, G., Schmeier, S. (2002). Shallow natural language technology and text mining. Künstliche Intelligenz, 2002 (2), 23–26.
Neumann, G., Piskorski, J. (2002). A Shallow text processing core engine. Computational Intelligence, 18 (3), 451–476.
Nigam, K., McCallum, A.K., Thrun, S., Mitchel, T. (1999). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39 (1/2), 103–134.
Paaß, G., Leopold, E., Larson, M., Kindermann, J., Eickeler, S. (2002). SVM Classification using sequences of phonemes and syllables. Tapio Elomaa & Heikki Mannila & Hannu Toivonen (Eds.), Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002); August 19–23, 2002 Helsinki, Finland, Lecture Notes in Artificial Intelligence 2431 (pp. 373–384) Berlin, Heidelberg: Springer.
Porter, M.F. (1980). An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14 (3), 130–137.
Rudman, J. (1998). The state of authorship attribution studies: some problems and solutions. Computers and the Humanities, 31, 351–365.
Salton, G., McGill, M.J. 1983. Introduction to modern information retrieval. New York: McGraw Hill.
Shapire, R.E., Singer, Y. (2000). BoosTexter: a boosting based system for text categorization. Machine Learning, 39, 135–168.
Sparck-Jones, K. (1999). Automatic summarizing: factors and directions. In I. Mani, M.T. Maybury (Eds.), Advances in Automated Text Summarization.
Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N. (2000). Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Exploratins, 1 (2), 12–23.
Stö ber, K., Wagner, P., Helbit, J., Köster, S., Stall, D., Thomae, M., Blauert, J., Hess, W., Hoffmann, R., Mangold, H. (2000). Speech synthesis by multilevel selection and concatenation of units from large speech Corpora. In: W. Wahlster (Ed.), Verb-mobil. Springer, 2000.
Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F. (2000). Vers la conception de filtres ďinformations efficaces. In Reconnaissance des Formes et Intelligence Artificielle (RFIA’ 2000) (pp. 129–137).
Thisted, R., Efron, B. (1987). Did Shakespeare write a newly discovered poem? Biometrika, 74 (3), 445–55.
Thisted, R. (1988). Elements of statistical computing. London: Chapman&Hall.
Towsey, M., Diederich, J., Schellhammer, I., Chalup, S., Brugman, C. (1998). Natural language learning by recurrent neural networks: A comparison with probabilistic approaches. Computational natural language learning conference. Australian Natural Language Processing Fortnight. Sydney: Macquarie University, 15–17 Jan 1998.
Tweedie, F.J., Singh, S., Holmes, D.I. (1996). Neural network applications in stylometry: the federalist paper. Computers and the Humanities, 30, 1–10.
van Rijsbergen, C.J. (1979). Information Retrieval. London, Boston: Butterworths.
Vapnik, V.N. (1998). Statistical Learning Theory. New York et al.: Wiley & Sons.
Weiss, S.M., Apt, C., Damerau, F., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T. (1999). Maximizing textmining performance. IEEE Intelligent Systems, 14 (4), 63–69.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Kluwer Academic Publishers
About this chapter
Cite this chapter
Leopold, E., May, M., Paaß, G. (2004). Data Mining and Text Mining for Science & Technology Research. In: Moed, H.F., Glänzel, W., Schmoch, U. (eds) Handbook of Quantitative Science and Technology Research. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2755-9_9
Download citation
DOI: https://doi.org/10.1007/1-4020-2755-9_9
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-2702-4
Online ISBN: 978-1-4020-2755-0
eBook Packages: Humanities, Social Sciences and LawHistory (R0)