Abstract
While the amount of structured data published on the Web keeps growing (fostered in particular by the Linked Open Data initiative), the Web still comprises of mainly unstructured—in particular textual—content and is therefore a Web for human consumption. Thus, an important question is which techniques are most suitable to enable people to effectively access the large body of unstructured information available on the Web, whether it is semantic or not. While the hope is that semantic technologies can be combined with standard Information Retrieval approaches to enable more accurate retrieval, some researchers have argued against this view. They claim that only data-driven or inductive approaches are applicable to tasks requiring the organization of unstructured (mainly textual) data for retrieval purposes. We argue that the dichotomy between data-driven/inductive and semantic approaches is indeed a false one. We further argue that bottom-up or inductive approaches can be successfully combined with top-down or semantic approaches and illustrate this for a number of tasks such as Ontology Learning, Information Retrieval, Information Extraction and Text Mining.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
While it is true that fuzzy and non-monotonic extensions to description logics and OWL have been proposed, we puristicly view OWL as a non-fuzzy and monotonic logic here.
- 2.
- 3.
- 4.
- 5.
TFIDF is a widely used statistical distribution value of terms in documents given a corpus. For a specific term and document, the TFIDF value is the product of the term frequency (TF)—the number of occurrences of the term in the given document—and the inverse document frequency (IDF)—the inverse number of documents in the corpus that contain the term.
- 6.
Further extensions such as those by Bloehdorn and Moschitti [5] combine this idea with more complex so-called tree kernel functions for text structure.
- 7.
References
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Proceedings of SIGMOD Conference, pp. 207–216 (1993)
Basili, R., Moschitti A., Pazienza M.T., Zanzotto, F.M.: A contrastive approach to term extraction. In: Proceedings of the 4th Terminology and Artificial Intelligence Conference (TIA), May, pp. 119–128 (2001)
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (May Issue) (2001)
Bloehdorn, S., Hotho, A.: Text classification by boosting weak learners based on terms and concepts. In: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM)
Bloehdorn, S., Moschitti, A.: Combined syntactic and semantic kernels for text classification. In: Amati, G., Carpineto, C., Romano, G. (eds.) Proceedings of the 29th European Conference on Information Retrieval (ECIR), Rome, Italy, pp. 307–318. Springer, Berlin (2007)
Bloehdorn, S., Basili, R., Cammisa, M., Moschitti, A.: Semantic kernels for text classification based on topological measures of feature similarity. In: Proceedings of the 6th IEEE International Conference on Data Mining (ICDM), Hong Kong, China. IEEE Comput. Soc., Los Alamitos (2006)
Bloehdorn, S., Cimiano, P., Hotho, A.: Learning ontologies to improve text clustering and classification. In: Spiliopoulou, M., Kruse, R., Nürnberger, A., Borgelt, C., Gaul, W. (eds.) Proceedings of the 29th Annual Conference of the German Classification Society (GfKl), Magdeburg, Germany, 2005, pp. 334–341. Springer, Berlin (2006)
Bloehdorn, S., Cimiano, P., Duke, A., Haase, P., Heizmann, J., Thurlow, I., Völker, J.: Ontology-based question answering for digital libraries. In: Proceedings of the 11th European Conference on Research and Advanced Technologies for Digital Libraries (ECDL), September 2007. Lecture Notes in Computer Science, vol. 4675. Springer, Berlin (2007). ISBN 978-3-540-74850-2
Blohm, S., Cimiano, P.: Using the web to reduce data sparseness in pattern-based information extraction. In: Proceedings of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Warsaw, Poland, pp. 18–29. Springer, Berlin (2007)
Blohm, S., Cimiano, P., Stemle, E.: Harvesting relations from the web—quantifying the impact of filtering functions. In: Proceedings of the 22nd Conference on Artificial Intelligence (AAAI), pp. 1316–1323. AAAI Press, Menlo Park (2007)
Blohm, S., Buza, K., Cimiano, P., Schmidt-Thieme, L.: Relation extraction for the semantic web with taxonomic sequential patterns. In: Sugumaran, V., Gulla, J.A. (eds.) Applied Semantic Web Technologies. Taylor & Francis, London (2011, to appear)
Bonino, D., Corno, F.: Self-similarity metric for index pruning in conceptual vector space models. In: DEXA Workshops, pp. 225–229. IEEE Comput. Soc., Los Alamitos (2008)
Brants, T., Popat, A., Xu, P.J.D., Och, F.J.: Large language models in machine translation. In: Proceedings of the 2007 Conference on Empirical Methods in Natural Language Processing (EMNLP), (2007)
Brewster, C., Ciravegna, F., Wilks, Y.: Background and foreground knowledge in dynamic ontology construction. In: Proceedings of the SIGIR Semantic Web Workshop, (2003)
Brin, S.: Extracting patterns and relations from the world wide web. In: Selected Papers from the International Workshop on the World Wide Web and Databases (WebDB), London, UK, pp. 172–183. Springer, Berlin (1999). ISBN 3-540-65890-4
Brunzel, M.: The XTREEM methods for ontology learning from web documents. In: Buitelaar, P., Cimiano, P. (eds.) Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, January. Frontiers in Artificial Intelligence and Applications, vol. 167, pp. 3–26. IOS Press, Amsterdam (2008)
Buitelaar, P., Cimiano, P., Magnini, B.: Ontology learning from Text: Methods, Evaluation and Applications, Juli. Frontiers in Artificial Intelligence, vol. 123. IOS Press, Amsterdam (2005)
Chodorow, M., Byrd, R.J., Heidorn, G.E.: Extracting semantic hierarchies from a large on-line dictionary. In: Proceedings of the 23rd Annual Meeting on Association for Computational Linguistics (ACL), pp. 299–304. Association for Computational Linguistics, Stroudsburg (1985)
Cimiano, P.: Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. Springer, Berlin (2006). ISBN 978-0-387-30632-2
Cimiano, P.: Ontology learning and population from text. PhD thesis, Universität Karlsruhe (TH), Germany (2006)
Cimiano, P., Völker, J.: Text2Onto—a framework for ontology learning and data-driven change discovery. In: Montoyo, A., Munoz, R., Metais, E. (eds.) Proceedings of the 10th International Conference on Applications of Natural Language to Information Systems (NLDB), Alicante, Spain, June. Lecture Notes in Computer Science, vol. 3513, pp. 227–238. Springer, Berlin (2005)
Cimiano, P., Wenderoth, J.: Automatic acquisition of ranked qualia structures from the web. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), June, pp. 888–895 (2007)
Cimiano, P., Handschuh, S., Staab, S.: Towards the self-annotating web. In: Proceedings of the 13th International World Wide Web Conference (WWW), May, pp. 462–471. ACM, New York (2004). ISBN 1-58113-844-X
Cimiano, P., Hotho, A., Staab, S.: Comparing conceptual, divise and agglomerative clustering for learning taxonomies from text. In: de Mántaras, R.L., Saitta, L. (eds.) Proceedings of the 16th European Conference on Artificial Intelligence (ECAI), Valencia, Spain, pp. 435–439. IOS Press, Amsterdam (2004). ISBN 1-58603-452-9
Cimiano, P., Hotho, A., Staab, S.: Learning concept hierarchies from text corpora using formal concept analysis. Journal of Artificial Intelligence Research 24, 305–339 (2005)
Cimiano, P., Ladwig, G., Staab, S.: Gimme the context: context-driven automatic semantic annotation with C-PANKOW. In: Ellis, A., Hagino, T. (eds.) Proceedings of the 14th International World Wide Web Conference (WWW), Chiba, Japan, May, pp. 332–341. ACM, New York (2005)
Cimiano, P., Pivk, A., Schmidt-Thieme, L., Staab, S.: Learning taxonomic relations from heterogeneous sources of evidence. In: Buitelaar, P., Cimiano, P., Magnini, B. (eds.) Ontology Learning from Text: Methods, Evaluation and Applications, July. Frontiers in Artificial Intelligence, vol. 123, pp. 59–73. IOS Press, Amsterdam (2005)
Cimiano, P., Schultz, A., Sizov, S., Sorg, P., Staab, S.: Explicit versus latent concept models for cross-language information retrieval. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1513–1518 (2009)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Drouin, P.: Detection of domain specific terminology using corpora comparison. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), pp. 79–82. European Language Resources Association, Paris (2004)
Drumm, C., Schmitt, M., Do, H.H., Rahm, E.: Quickmig: automatic schema matching for data migration projects. In: CIKM, pp. 107–116 (2007)
Dumais, S., Letsche, T., Littman, M., Landauer, T.: Automatic cross-language retrieval using latent semantic indexing. In: Proceedings of the AAAI Symposium on Cross-Language Text and Speech Retrieval (1997)
Ehrig, M.: Ontology Alignment: Bridging the Semantic Gap. Semantic Web and Beyond: Computing for Human Experience, vol. 4. Springer, Berlin (2007). ISBN 978-0-387-36501-5
Evans, R.: A framework for named entity recognition in the open domain. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP), pp. 137–144 (2003)
Feldman, R., Dagan, I.: Knowledge discovery in texts (KDT). In: Fayyad, U.M., Uthurusamy, R. (eds.) Proceedings of the First International Conference on Knowledge Discovery (KDD 1996), Montreal, Quebec, Canada, August 20–21, pp. 112–117. AAAI Press, Menlo Park (1995)
Fellbaum, C.: WordNet. An Electronic Lexical Database. MIT Press, Cambridge (1998)
Firth, J.R.: A Synopsis of Linguistic Theory, 1930–1955. Studies in Linguistic Analysis, pp. 1–32 (1957)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), pp. 1606–1611 (2007)
Gärdenfors, P.: Conceptual Spaces: The Geometry of Thought. MIT Press, London (2000)
Giesbrecht, E.: In search of semantic compositionality in vector spaces. In: ICCS, pp. 173–184 (2009)
Gonzalo, J., Verdejo, F., Chugur, I., Cigarran, J.: Indexing with WordNet synsets can improve text retrieval. In: Proceedings of the COLING/ACL ’98 Workshop on Usage of WordNet for NLP, Montreal, Canada, pp. 38–44 (1998)
Guthrie, L., Slator, B.M., Wilks, Y., Bruce, R.: Is there content in empty heads? In: Proceedings of the 13th Conference on Computational Linguistics (COLING), Morristown, NJ, USA pp. 138–143. Association for Computational Linguistics, Stroudsburg (1990). ISBN 952-90-2028-7
Haase, P., Völker, J.: Ontology learning and reasoning—dealing with uncertainty and inconsistency. In: da Costa, P.C.G., d’Amato, C., Fanizzi, N., Laskey, K.B., Laskey, K.J., Lukasiewicz, T., Nickles, M., Pool, M. (eds.) Uncertainty Reasoning for the Semantic Web I. Lecture Notes in Artificial Intelligence, vol. 5327. Springer, Berlin (2008). ISBN 978-3-540-89764-4. ISWC International Workshop, URSW 2005–2007. Revised Selected and Invited Papers
Haase, P., Schnizler, B., Broekstra, J., Ehrig, M., Harmelen, F., Mika, M., Plechawski, M., Pyszlak, P., Siebes, R., Staab, S., Tempich, C.: Bibster—a semantics-based bibliographic peer-to-peer system. Journal of Web Semantics 2(1), 99–103 (2005)
Haase, P., Stojanovic, N., Sure, Y., Völker, J.: Personalized information retrieval in bibster, a semantics-based bibliographic peer-to-peer system. In: Tochtermann, K., Maurer, H. (eds.) Proceedings of the 5th International Conference on Knowledge Management (I-KNOW), July, pp. 104–111 (2005). JUCS, July
Halevy, A.Y., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intelligent Systems 24(2), 8–12 (2009)
Hall, J., Nilsson, J., Nivre, J., Megyesi, B., Nilsson, M., Saers, M.: Single malt or blended? A study in multilingual parser optimization. In: Proc. of the Joint Conf. on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL (2007)
Harris, Z.: Linguistic transformations for information retrieval. In: Proceedings of the International Conference on Scientific Information, vol. 2, Washington, DC (1959)
Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th Conference on Computational Linguistics, vol. 2. Association for Computational Linguistics, Stroudsburg (1992)
Hotho, A.: Clustern Mit Hintergrundwissen. Dissertationen zur Künstlichen Intelligenz, vol. 286. Akademische Verlagsgesellschaft, Berlin (2004). In German. Originally published as PhD thesis, Universität Karlsruhe (TH), Karlsruhe, Germany (2004)
Hotho, A., Staab, S., Stumme, G.: Explaining text clustering results using semantic structures. In: Principles of Data Mining and Knowledge Discovery, 7th European Conference, PKDD 2003, Dubrovnik, Croatia, September 22–26, 2003. Lecture Notes in Computer Science, pp. 217–228. Springer, Berlin (2003)
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Proc. of the ICDM 03, The 2003 IEEE International Conference on Data Mining, pp. 541–544 (2003)
Hotho, A., Nürnberger, A., Paaß, G.: A brief survey of text mining. LDV Forum—GLDV Journal for Computational Linguistics and Language Technology 20(1), 19–62 (2005). ISSN 0175-1336
Jaimes, A., Smith, J.R.: Semi-automatic, data-driven construction of multimedia ontologies. In: Proceedings of the International Conference on Multimedia and Expo (ICME), Washington, DC, USA, pp. 781–784. IEEE Comput. Soc., Los Alamitos (2003). ISBN 0-7803-7965-9
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River (1988)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Computing Surveys 31(3), 264–323 (1999)
Jäschke, R., Hotho, A., Schmitz, C., Ganter, B., Stumme, G.: Discovering shared conceptualizations in folksonomies. Journal of Web Semantics 6(1), 38–53 (2008). ISSN 1570-8268
Kashyap, V., Ramakrishnan, C., Thomas, C., Sheth, A.: TaxaMiner: an experimentation framework for automated taxonomy bootstrapping. International Journal of Web and Grid Services 1(2), 240–266 (2005). ISSN 1741-1106
Katz, S.M., Gauvain, J.L., Lamel, L.F., Adda, G., Mariani, J.: Estimation of probabilities from sparse data for the language model component of a speech recognizer. International Journal of Pattern Recognition and Artificial Intelligence 8 (1987)
Kavalec, M., Svátek, V.: A study on automated relation labelling in ontology learning. In: Buitelaar, P., Cimiano, P., Magnini, B. (eds.) Ontology Learning from Text: Methods, Evaluation and Applications. Frontiers in Artificial Intelligence and Applications, vol. 123, pp. 44–58. IOS Press, Amsterdam (2005)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review 104, 211–240 (1997)
Li, M., Du, X.-y., Wang, S.: Learning ontology from relational database. In: Proceedings of the 4th International Conference on Machine Learning and Cybernetics, pp. 3410–3415 (2005)
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instrumentation, and Computers, 203–220 (1996)
Mädche, A.: Ontology learning for the semantic web. PhD thesis, Universität Karlsruhe (TH), Germany (2001)
Mädche, A., Staab, S.: Discovering conceptual relations from text. In: Horn, W. (ed.) Proceedings of the 14th European Conference on Artificial Intelligence (ECAI), August, pp. 321–325. IOS Press, Amsterdam (2000)
Mädche, A., Volz, R.: The text-to-onto ontology extraction and maintenance system. In: Workshop on Integrating Data Mining and Knowledge Management at the 1st International Conference on Data Mining (ICDM) (2001)
Meilicke, C., Völker, J., Stuckenschmidt, H.: Debugging mappings between lightweight ontologies. In: Proceedings of the 16th International Conference on Knowledge Engineering and Knowledge Management (EKAW), September. Lecture Notes in Artificial Intelligence, pp. 93–108. Springer, Berlin (2008). Best Paper Award!
Miller, G.A.: WordNet: a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)
Moench, E., Ullrich, M., Schnurr, H.-P., Angele, J.: Semanticminer—ontology-based knowledge retrieval. Journal of Universal Computer Science 9(7), 682–696 (2003)
Müller, C., Gurevych, I.: Using Wikipedia and Wiktionary in domain-specific information retrieval. In: Working Notes of the Annual CLEF Meeting (2008)
Newbold, N., Vrusias, B., Gillam, L.: Lexical ontology extraction using terminology analysis: automating video annotation. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the 6th International Language Resources and Evaluation (LREC), Marrakech, Morocco, May. ELRA, Paris (2008)
Ogata, N., Collier, N.: Ontology express: statistical and non-monotonic learning of domain ontologies from text. In: Proceedings of the Workshop on Ontology Learning and Population (OLP) at the 16th European Conference on Artificial Intelligence (ECAI), August (2004)
Papka, R., Allan, J.: On-line new event detection using single pass clustering. Technical report, University of Massachusetts, Amherst, MA, USA 1998
Riloff, E., Jones, R.: Learning dictionaries for information extraction by multi-level bootstrapping. In: AAAI ’99/IAAI ’99: Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference Innovative Applications of Artificial Intelligence, pp. 474–479. American Association for Artificial Intelligence, Menlo Park (1999). ISBN 0-262-51106-1
Sabou, M.: Building web service ontologies. PhD thesis, Vrije Universiteit Amsterdam, The Netherlands (2006)
Sahlgren, M.: An introduction to random indexing. In: Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE, (2005)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24(5), 513–523 (1988)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Sanchez, D.: Domain ontology learning from the web. PhD thesis, Universitat Politècnica de Catalunya, Spain (2007)
Schmitz, C., Hotho, A., Jäschke, R., Stumme, G.: Mining association rules in folksonomies. In: Batagelj, V., Bock, H.-H., Ferligoj, A., Ziberna, A. (eds.) Data Science and Classification (Proc. IFCS 2006 Conference), Ljubljana, July. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 261–270. Springer, Berlin (2006). ISBN 978-3-540-34415-5. doi:10.1007/3-540-34416-0_28
Schütze, H.: Word space. In: Hanson, S., Cowan, J., Giles, C. (eds.) Advances in Neural Information Processing Systems 5. Morgan Kaufmann, San Mateo (1993)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Simperl, E., Tempich, C., Vrandečić, D.: A methodology for ontology learning. In: Buitelaar, P., Cimiano, P. (eds.) Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, January. Frontiers in Artificial Intelligence and Applications, vol. 167, pp. 225–249. IOS Press, Amsterdam (2008)
Sorg, P., Cimiano, P.: Cross-lingual information retrieval with explicit semantic analysis. In: Working Notes of the Annual CLEF Meeting (2008)
Sorg, P., Cimiano, P.: An experimental comparison of explicit semantic analysis implementations for cross-language retrieval. In: Proceedings of 14th International Conference on Applications of Natural Language to Information Systems (NLDB), Saarbrücken (2009)
Stojanovic, N.: On the role of the librarian agent in ontology-based knowledge management systems. Journal of Universal Computer Science 9(7), 697–718 (2003)
Sure, Y., Hitzler, P., Eberhart, A., Studer, R.: The semantic web in one day. IEEE Intelligent Systems 20(3), 85–87 (2005). ISBN 1541-1672. doi:10.1109/MIS.2005.54
Völker, J.: Learning expressive ontologies. PhD thesis, Universität Karlsruhe (TH), Germany (2008)
Völker, J., Rudolph, S.: Lexico-logical acquisition of OWL DL axioms—an integrated approach to ontology refinement. In: Medina, R., Obiedkov, S. (eds.) Proceedings of the 6th International Conference on Formal Concept Analysis (ICFCA), February. Lecture Notes in Artificial Intelligence, vol. 4933, pp. 62–77. Springer, Berlin (2008)
Völker, J., Rudolph, S.: Fostering web intelligence by semi-automatic OWL ontology refinement. In: Proceedings of the 7th International Conference on Web Intelligence (WI), December. IEEE Press, New York (2008). Regular paper
Völker, J., Vrandečić, D., Sure, Y.: Automatic evaluation of ontologies (AEON). In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) Proceedings of the 4th International Semantic Web Conference (ISWC), November. Lecture Notes in Computer Science, vol. 3729, pp. 716–731. Springer, Berlin (2005)
Völker, J., Hitzler, P., Cimiano, P.: Acquisition of OWL DL axioms from lexical resources. In: Franconi, E., Kifer, M., May, W. (eds.) Proceedings of the 4th European Semantic Web Conference (ESWC), June. Lecture Notes in Computer Science, vol. 4519, pp. 670–685. Springer, Berlin (2007)
Völker, J., Vrandečić, D., Sure, Y., Hotho, A.: Learning disjointness. In: Franconi, E., Kifer, M., May, W. (eds.) Proceedings of the 4th European Semantic Web Conference (ESWC), June. Lecture Notes in Computer Science, vol. 4519, pp. 175–189. Springer, Berlin (2007)
Völker, J., Vrandečić, D., Sure, Y., Hotho, A.: AEON—an approach to the automatic evaluation of ontologies. Journal of Applied Ontology 3(1–2), 41–62 (2008). Special Issue on Ontological Foundations of Conceptual Modeling
Widdows, D.: Semantic vector products: some initial investigations. In: Proceedings of the Second AAAI Symposium on Quantum Interaction (QI) (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Bloehdorn, S. et al. (2011). Combining Data-Driven and Semantic Approaches for Text Mining. In: Fensel, D. (eds) Foundations for the Web of Information and Services. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-19797-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-19797-0_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-19796-3
Online ISBN: 978-3-642-19797-0
eBook Packages: Computer ScienceComputer Science (R0)