Data Mining and Text Mining for Science & Technology Research

  • Edda Leopold
  • Michael May
  • Gerhard Paaß


The goal of the paper is to give an overview on the state of the art of data mining and text mining approaches which are useful for bibliometrics and patent databases. The paper explains the basics of data mining in a non-technical manner. Basic approaches from statistics and machine learning are introduced in order to clarify the groundwork of data mining methods. Text mining is introduced as a special case of data mining. Data and text mining applications especially useful for bibliometrics and querying of patent databases are reviewed and three case studies are described.


Support Vector Machine Data Mining Support Vector Machine Classifier Text Mining Latent Semantic Analysis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Andrews, R., Geva S. (1994). Rule extraction from a constrained error backpropagation MLP. Australian Conference on Neural Networks, Brisbane, Queensland 1994 (pp. 9–12).Google Scholar
  2. Baeza-Yates, R., Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley.Google Scholar
  3. Chen, H.H. (2002). Multilingual summarization and question answering. Workshop on Multilingual Summarization and Question Answering, COLING’02, Taipeh, Taiwan 2002.Google Scholar
  4. Chitashvili, R.J., Baayen, R.H. (1993). Word frequency distributions. In G. Altmann, L. Hřebíček (Eds.), Quantitative Text Analysis (pp. 54–135). Wvt: Trier.Google Scholar
  5. Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41 (6), 391–407.CrossRefGoogle Scholar
  6. Dempster, A.P., Laird, N.M., Rubin, D.B. (1997). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1–38.Google Scholar
  7. Diederich, J., Kindermann, J., Leopold, E., Paaß, G. (2003). Authorship attribution with Support Vector Machines. Applied Intelligence, 19 (1–2), 109–123.Google Scholar
  8. Dumais, S., Platt, J., Heckerman, D., Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management (pp. 148–155). ACM.Google Scholar
  9. Gövert, B., Lalmas, M., Fuhr, N. (1999). A probabilistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management, Kansas City, Missouri, 1999 (pp. 475–482). ACM.Google Scholar
  10. Guiter, H. (1974). Les rélations fréquence — longueur — sens des mots (langues romanes et anglais), In XIV congresso internazionale di linguistica e filologia romanza (pp. 373–381). Napoli.Google Scholar
  11. Hahn, U., Reimer, U. (1999). Knowledge-based text summarization. In: I. Mani, M. T. Maybury (Eds.), Advances in Automated Text Summarization (pp. 215–232). Cambridge, London: MIT-Press.Google Scholar
  12. Hand, D., Mannila, H., Smyth, P (2001). Principles of data mining. MIT Press.Google Scholar
  13. Hartigan, J.A. (1975). Clustering algorithms. New York: John Wiley.Google Scholar
  14. Hastie T., Tibshirani, R., Friedman, J. (2001). The elements of statistical learning. New York: Springer.Google Scholar
  15. Hofman, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177–196.Google Scholar
  16. Holmes, D.I. (1998). The evolution of stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13 (3), 111–117.CrossRefGoogle Scholar
  17. Holmes, D.I., Forsyth, R.S. (1995). The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10 (2), 111–127.CrossRefGoogle Scholar
  18. Kohonen, T. (1980). Content-adressable memories. Springer.Google Scholar
  19. Kohonen, T. (1995). Self-organising Maps. Springer.Google Scholar
  20. Kosala, R. Blockeel, H. (2000). Web mining research: A Survey. In P.S. Bradley, S. Sarawagi, U.M. Fayyad (Eds.), SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 2 (pp. 1–15). ACM Press.Google Scholar
  21. Kraaij, W., Spitters, M., Hulth, A. (2002). Headline extraction based on a combination of uniand multidocument summarization techniques. In Proceedings of the ACL workshop on Automatic Summarization/Document Understanding Conference DUC 2002, June 2002, Philadelphia, USA.Google Scholar
  22. Joachims, T. (1998a). Making large-scale SVM learning practical, Technical report University of Dortmund.Google Scholar
  23. Joachims, T. (1998b). Text categorization with Support Vector Machines: learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning, Springer Lecture Notes in Computer Science, Vol. 1398 (pp. 137–142). Springer.Google Scholar
  24. Landauer, T.K., Dumais, S.T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104 (2), 211–240.CrossRefGoogle Scholar
  25. Lang, K. (1995). Newsweeder: Learning to filter netnews. In A. Prieditis, S. Russell (Eds.), Proceedings of the 12th International Conferrence on Machine Learning (pp. 331–339). San Francisco: Morgan Kaufmann Publishers.Google Scholar
  26. Leopold, E., Kindermann, J. (2002). Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning, 46, 423–444.CrossRefGoogle Scholar
  27. Lowe, D., Matthews, R. (1995). Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities, 29, 449–461.CrossRefGoogle Scholar
  28. Manning, C.D., Schütze, H.(1999). Foundations of statistical natural language processing. Cambridge MA, London: MIT Press.Google Scholar
  29. Mitchell, Tom (1997). Machine Learning. Boston et al.: McGraw-Hill.Google Scholar
  30. Mladenic, D., Grobelnik M. (1999). Feature selection for unbalanced class distribution and naive Bayes. In I. Bratko, S. Dzeroski (Eds.), Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999) (pp. 258–267). San Francisco: Morgan Kaufmann.Google Scholar
  31. Neumann, G., Schmeier, S. (2002). Shallow natural language technology and text mining. Künstliche Intelligenz, 2002 (2), 23–26.Google Scholar
  32. Neumann, G., Piskorski, J. (2002). A Shallow text processing core engine. Computational Intelligence, 18 (3), 451–476.CrossRefGoogle Scholar
  33. Nigam, K., McCallum, A.K., Thrun, S., Mitchel, T. (1999). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39 (1/2), 103–134.Google Scholar
  34. Paaß, G., Leopold, E., Larson, M., Kindermann, J., Eickeler, S. (2002). SVM Classification using sequences of phonemes and syllables. Tapio Elomaa & Heikki Mannila & Hannu Toivonen (Eds.), Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002); August 19–23, 2002 Helsinki, Finland, Lecture Notes in Artificial Intelligence 2431 (pp. 373–384) Berlin, Heidelberg: Springer.Google Scholar
  35. Porter, M.F. (1980). An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14 (3), 130–137.Google Scholar
  36. Rudman, J. (1998). The state of authorship attribution studies: some problems and solutions. Computers and the Humanities, 31, 351–365.Google Scholar
  37. Salton, G., McGill, M.J. 1983. Introduction to modern information retrieval. New York: McGraw Hill.Google Scholar
  38. Shapire, R.E., Singer, Y. (2000). BoosTexter: a boosting based system for text categorization. Machine Learning, 39, 135–168.Google Scholar
  39. Sparck-Jones, K. (1999). Automatic summarizing: factors and directions. In I. Mani, M.T. Maybury (Eds.), Advances in Automated Text Summarization.Google Scholar
  40. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N. (2000). Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Exploratins, 1 (2), 12–23.Google Scholar
  41. Stö ber, K., Wagner, P., Helbit, J., Köster, S., Stall, D., Thomae, M., Blauert, J., Hess, W., Hoffmann, R., Mangold, H. (2000). Speech synthesis by multilevel selection and concatenation of units from large speech Corpora. In: W. Wahlster (Ed.), Verb-mobil. Springer, 2000.Google Scholar
  42. Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F. (2000). Vers la conception de filtres ďinformations efficaces. In Reconnaissance des Formes et Intelligence Artificielle (RFIA’ 2000) (pp. 129–137).Google Scholar
  43. Thisted, R., Efron, B. (1987). Did Shakespeare write a newly discovered poem? Biometrika, 74 (3), 445–55.Google Scholar
  44. Thisted, R. (1988). Elements of statistical computing. London: Chapman&Hall.Google Scholar
  45. Towsey, M., Diederich, J., Schellhammer, I., Chalup, S., Brugman, C. (1998). Natural language learning by recurrent neural networks: A comparison with probabilistic approaches. Computational natural language learning conference. Australian Natural Language Processing Fortnight. Sydney: Macquarie University, 15–17 Jan 1998.Google Scholar
  46. Tweedie, F.J., Singh, S., Holmes, D.I. (1996). Neural network applications in stylometry: the federalist paper. Computers and the Humanities, 30, 1–10.CrossRefGoogle Scholar
  47. van Rijsbergen, C.J. (1979). Information Retrieval. London, Boston: Butterworths.Google Scholar
  48. Vapnik, V.N. (1998). Statistical Learning Theory. New York et al.: Wiley & Sons.Google Scholar
  49. Weiss, S.M., Apt, C., Damerau, F., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T. (1999). Maximizing textmining performance. IEEE Intelligent Systems, 14 (4), 63–69.CrossRefGoogle Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Edda Leopold
    • 1
  • Michael May
    • 1
  • Gerhard Paaß
    • 1
  1. 1.Fraunhofer Institut Autonome Intelligente SystemeSankt AugustinGermany

Personalised recommendations