Skip to main content

Data Mining and Text Mining for Science & Technology Research

  • Chapter
Handbook of Quantitative Science and Technology Research

Abstract

The goal of the paper is to give an overview on the state of the art of data mining and text mining approaches which are useful for bibliometrics and patent databases. The paper explains the basics of data mining in a non-technical manner. Basic approaches from statistics and machine learning are introduced in order to clarify the groundwork of data mining methods. Text mining is introduced as a special case of data mining. Data and text mining applications especially useful for bibliometrics and querying of patent databases are reviewed and three case studies are described.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 429.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 549.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 549.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Andrews, R., Geva S. (1994). Rule extraction from a constrained error backpropagation MLP. Australian Conference on Neural Networks, Brisbane, Queensland 1994 (pp. 9–12).

    Google Scholar 

  • Baeza-Yates, R., Ribeiro-Neto, B. (1999). Modern information retrieval. Addison-Wesley.

    Google Scholar 

  • Chen, H.H. (2002). Multilingual summarization and question answering. Workshop on Multilingual Summarization and Question Answering, COLING’02, Taipeh, Taiwan 2002.

    Google Scholar 

  • Chitashvili, R.J., Baayen, R.H. (1993). Word frequency distributions. In G. Altmann, L. Hřebíček (Eds.), Quantitative Text Analysis (pp. 54–135). Wvt: Trier.

    Google Scholar 

  • Deerwester, S., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41 (6), 391–407.

    Article  Google Scholar 

  • Dempster, A.P., Laird, N.M., Rubin, D.B. (1997). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1–38.

    Google Scholar 

  • Diederich, J., Kindermann, J., Leopold, E., Paaß, G. (2003). Authorship attribution with Support Vector Machines. Applied Intelligence, 19 (1–2), 109–123.

    Google Scholar 

  • Dumais, S., Platt, J., Heckerman, D., Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th International Conference on Information and Knowledge Management (pp. 148–155). ACM.

    Google Scholar 

  • Gövert, B., Lalmas, M., Fuhr, N. (1999). A probabilistic description-oriented approach for categorising Web documents. In Proceedings of CIKM-99, 8th ACM International Conference on Information and Knowledge Management, Kansas City, Missouri, 1999 (pp. 475–482). ACM.

    Google Scholar 

  • Guiter, H. (1974). Les rélations fréquence — longueur — sens des mots (langues romanes et anglais), In XIV congresso internazionale di linguistica e filologia romanza (pp. 373–381). Napoli.

    Google Scholar 

  • Hahn, U., Reimer, U. (1999). Knowledge-based text summarization. In: I. Mani, M. T. Maybury (Eds.), Advances in Automated Text Summarization (pp. 215–232). Cambridge, London: MIT-Press.

    Google Scholar 

  • Hand, D., Mannila, H., Smyth, P (2001). Principles of data mining. MIT Press.

    Google Scholar 

  • Hartigan, J.A. (1975). Clustering algorithms. New York: John Wiley.

    Google Scholar 

  • Hastie T., Tibshirani, R., Friedman, J. (2001). The elements of statistical learning. New York: Springer.

    Google Scholar 

  • Hofman, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, 177–196.

    Google Scholar 

  • Holmes, D.I. (1998). The evolution of stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13 (3), 111–117.

    Article  Google Scholar 

  • Holmes, D.I., Forsyth, R.S. (1995). The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10 (2), 111–127.

    Article  Google Scholar 

  • Kohonen, T. (1980). Content-adressable memories. Springer.

    Google Scholar 

  • Kohonen, T. (1995). Self-organising Maps. Springer.

    Google Scholar 

  • Kosala, R. Blockeel, H. (2000). Web mining research: A Survey. In P.S. Bradley, S. Sarawagi, U.M. Fayyad (Eds.), SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 2 (pp. 1–15). ACM Press.

    Google Scholar 

  • Kraaij, W., Spitters, M., Hulth, A. (2002). Headline extraction based on a combination of uniand multidocument summarization techniques. In Proceedings of the ACL workshop on Automatic Summarization/Document Understanding Conference DUC 2002, June 2002, Philadelphia, USA.

    Google Scholar 

  • Joachims, T. (1998a). Making large-scale SVM learning practical, Technical report University of Dortmund.

    Google Scholar 

  • Joachims, T. (1998b). Text categorization with Support Vector Machines: learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning, Springer Lecture Notes in Computer Science, Vol. 1398 (pp. 137–142). Springer.

    Google Scholar 

  • Landauer, T.K., Dumais, S.T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104 (2), 211–240.

    Article  Google Scholar 

  • Lang, K. (1995). Newsweeder: Learning to filter netnews. In A. Prieditis, S. Russell (Eds.), Proceedings of the 12th International Conferrence on Machine Learning (pp. 331–339). San Francisco: Morgan Kaufmann Publishers.

    Google Scholar 

  • Leopold, E., Kindermann, J. (2002). Text categorization with Support Vector Machines. How to represent texts in input space? Machine Learning, 46, 423–444.

    Article  Google Scholar 

  • Lowe, D., Matthews, R. (1995). Shakespeare vs. Fletcher: A stylometric analysis by radial basis functions. Computers and the Humanities, 29, 449–461.

    Article  Google Scholar 

  • Manning, C.D., Schütze, H.(1999). Foundations of statistical natural language processing. Cambridge MA, London: MIT Press.

    Google Scholar 

  • Mitchell, Tom (1997). Machine Learning. Boston et al.: McGraw-Hill.

    Google Scholar 

  • Mladenic, D., Grobelnik M. (1999). Feature selection for unbalanced class distribution and naive Bayes. In I. Bratko, S. Dzeroski (Eds.), Proceedings of the Sixteenth International Conference on Machine Learning (ICML 1999) (pp. 258–267). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Neumann, G., Schmeier, S. (2002). Shallow natural language technology and text mining. Künstliche Intelligenz, 2002 (2), 23–26.

    Google Scholar 

  • Neumann, G., Piskorski, J. (2002). A Shallow text processing core engine. Computational Intelligence, 18 (3), 451–476.

    Article  Google Scholar 

  • Nigam, K., McCallum, A.K., Thrun, S., Mitchel, T. (1999). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39 (1/2), 103–134.

    Google Scholar 

  • Paaß, G., Leopold, E., Larson, M., Kindermann, J., Eickeler, S. (2002). SVM Classification using sequences of phonemes and syllables. Tapio Elomaa & Heikki Mannila & Hannu Toivonen (Eds.), Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD 2002); August 19–23, 2002 Helsinki, Finland, Lecture Notes in Artificial Intelligence 2431 (pp. 373–384) Berlin, Heidelberg: Springer.

    Google Scholar 

  • Porter, M.F. (1980). An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14 (3), 130–137.

    Google Scholar 

  • Rudman, J. (1998). The state of authorship attribution studies: some problems and solutions. Computers and the Humanities, 31, 351–365.

    Google Scholar 

  • Salton, G., McGill, M.J. 1983. Introduction to modern information retrieval. New York: McGraw Hill.

    Google Scholar 

  • Shapire, R.E., Singer, Y. (2000). BoosTexter: a boosting based system for text categorization. Machine Learning, 39, 135–168.

    Google Scholar 

  • Sparck-Jones, K. (1999). Automatic summarizing: factors and directions. In I. Mani, M.T. Maybury (Eds.), Advances in Automated Text Summarization.

    Google Scholar 

  • Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N. (2000). Web usage mining: discovery and applications of usage patterns from web data, SIGKDD Exploratins, 1 (2), 12–23.

    Google Scholar 

  • Stö ber, K., Wagner, P., Helbit, J., Köster, S., Stall, D., Thomae, M., Blauert, J., Hess, W., Hoffmann, R., Mangold, H. (2000). Speech synthesis by multilevel selection and concatenation of units from large speech Corpora. In: W. Wahlster (Ed.), Verb-mobil. Springer, 2000.

    Google Scholar 

  • Stricker, M., Vichot, F., Dreyfus, G., Wolinski, F. (2000). Vers la conception de filtres ďinformations efficaces. In Reconnaissance des Formes et Intelligence Artificielle (RFIA’ 2000) (pp. 129–137).

    Google Scholar 

  • Thisted, R., Efron, B. (1987). Did Shakespeare write a newly discovered poem? Biometrika, 74 (3), 445–55.

    Google Scholar 

  • Thisted, R. (1988). Elements of statistical computing. London: Chapman&Hall.

    Google Scholar 

  • Towsey, M., Diederich, J., Schellhammer, I., Chalup, S., Brugman, C. (1998). Natural language learning by recurrent neural networks: A comparison with probabilistic approaches. Computational natural language learning conference. Australian Natural Language Processing Fortnight. Sydney: Macquarie University, 15–17 Jan 1998.

    Google Scholar 

  • Tweedie, F.J., Singh, S., Holmes, D.I. (1996). Neural network applications in stylometry: the federalist paper. Computers and the Humanities, 30, 1–10.

    Article  Google Scholar 

  • van Rijsbergen, C.J. (1979). Information Retrieval. London, Boston: Butterworths.

    Google Scholar 

  • Vapnik, V.N. (1998). Statistical Learning Theory. New York et al.: Wiley & Sons.

    Google Scholar 

  • Weiss, S.M., Apt, C., Damerau, F., Johnson, D.E., Oles, F.J., Goetz, T., Hampp, T. (1999). Maximizing textmining performance. IEEE Intelligent Systems, 14 (4), 63–69.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Kluwer Academic Publishers

About this chapter

Cite this chapter

Leopold, E., May, M., Paaß, G. (2004). Data Mining and Text Mining for Science & Technology Research. In: Moed, H.F., Glänzel, W., Schmoch, U. (eds) Handbook of Quantitative Science and Technology Research. Springer, Dordrecht. https://doi.org/10.1007/1-4020-2755-9_9

Download citation

Publish with us

Policies and ethics