Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

Chemudugunta, Chaitanya; Holloway, America; Smyth, Padhraic; Steyvers, Mark

doi:10.1007/978-3-540-88564-1_15

Chaitanya Chemudugunta⁸,
America Holloway⁸,
Padhraic Smyth⁸ &
…
Mark Steyvers⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5318))

Included in the following conference series:

International Semantic Web Conference

2338 Accesses
45 Citations

Abstract

Human-defined concepts are fundamental building-blocks in constructing knowledge bases such as ontologies. Statistical learning techniques provide an alternative automated approach to concept definition, driven by data rather than prior knowledge. In this paper we propose a probabilistic modeling framework that combines both human-defined concepts and data-driven topics in a principled manner. The methodology we propose is based on applications of statistical topic models (also known as latent Dirichlet allocation models). We demonstrate the utility of this general framework in two ways. We first illustrate how the methodology can be used to automatically tag Web pages with concepts from a known set of concepts without any need for labeled documents. We then perform a series of experiments that quantify how combining human-defined semantic knowledge with data-driven techniques leads to better language models than can be obtained with either alone.

Download to read the full chapter text

Chapter PDF

An Automatic Construction of Concept Maps Based on Statistical Text Mining

Human Perception of Enriched Topic Models

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Keywords

References

McGuinness, D.L.: Ontologies come of age. In: Fensel, D., Hendler, J.A., Lieberman, H., Wahlster, W. (eds.) Spinning the Semantic Web, pp. 171–194. MIT Press, Cambridge (2003)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. of Nat’l. Academy of Science 101, 5228–5235 (2004)
Article Google Scholar
Handschuh, S., Staab, S., Ciravegna, F.: Scream — semi-automatic creation of metadata. In: International Conference on Knowledge Engineering and Knowledge Management (2002)
Google Scholar
Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: Kim - semantic annotation platform. In: International Semantic Web Conference, pp. 834–849 (2003)
Google Scholar
Tang, J., Hong, M., Li, J.Z., Liang, B.: Tree-structured conditional random fields for semantic annotation. In: International Semantic Web Conference, pp. 640–653 (2006)
Google Scholar
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW 2003, pp. 178–186. ACM, New York (2003)
Google Scholar
Hotho, A., Staab, S., Stumme, G.: Text clustering based on background knowledge (technical report 425). Technical report, University of Karlsruhe, Institute AIFB (2003)
Google Scholar
Gabrilovich, E., Markovitch, S.: Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res. 8, 2297–2345 (2007)
Google Scholar
Ifrim, G., Theobald, M., Weikum, G.: Learning word-to-concept mappings for automatic text classification. In: Proceedings of the 22nd ICML-LWS, pp. 18–26 (2005)
Google Scholar
Boyd-Graber, D., Blei, D., Zhu, X.: A topic model for word sense disambiguation. In: Proc. 2007 Joint Conf. Empirical Methods in Nat’l. Lang. Processing and Compt’l. Nat’l. Lang. Learning, pp. 1024–1033 (2007)
Google Scholar
Brewster, C., Alani, H., Dasmahapatra, S., Wilks, Y.: Data driven ontology evaluation. In: Int’l. Conf. Language Resources and Evaluation (2004)
Google Scholar
Alani, H., Brewster, C.: Metrics for ranking ontologies. In: 4th Int’l. EON Workshop, 15th Int’l World Wide Web Conf. (2006)
Google Scholar
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction and representation of knowledge. Psychological Review 104, 211–240 (1997)
Article Google Scholar
Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. In: Psychological Review, vol. 114, pp. 211–244 (2007)
Google Scholar
Chemudugunta, C., Smyth, P., Steyvers, M.: Modeling general and specific aspects of documents with a probabilistic topic model. In: NIPS, vol. 19, pp. 241–248 (2007)
Google Scholar
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database (Language, Speech and Communication). MIT Press, Cambridge (1998)
MATH Google Scholar
Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Compt’l. Linguistics, 467–479 (1992)
Google Scholar
Chemudugunta, C., Smyth, P., Steyvers, M.: Combining concept hierarchies and statistical topic models. In: 17th ACM Conference on Information and Knowledge Management (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of California,Irvine, Irvine, CA
Chaitanya Chemudugunta, America Holloway & Padhraic Smyth
Department of Cognitive Science, University of California, Irvine, Irvine, CA
Mark Steyvers

Authors

Chaitanya Chemudugunta
View author publications
You can also search for this author in PubMed Google Scholar
America Holloway
View author publications
You can also search for this author in PubMed Google Scholar
Padhraic Smyth
View author publications
You can also search for this author in PubMed Google Scholar
Mark Steyvers
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Wright State University, Colonel Glenn Way 3640, 454350001, Dayton, USA
Amit Sheth
Institut für Informatik, Universität Koblenz-Landau, Universitätsstr. 1, 56016, Koblenz, Germany
Steffen Staab
BBN Technologies, 48103, Ann Arbor, USA
Mike Dean
DoCoMo Communications Laboratories Europe GmbH, 80687, Munich, Germany
Massimo Paolucci
Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, S1 4DP, Sheffield, UK
Diana Maynard
CSEE Department, UMBC, 1000 Hilltop Circle, MD 21250, Baltimore, USA
Timothy Finin
Department of Computer Science and Engineering, Wright State University, 3640 Colonel Glenn Highway, OH 45435, Dayton, USA
Krishnaprasad Thirunarayan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M. (2008). Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning. In: Sheth, A., et al. The Semantic Web - ISWC 2008. ISWC 2008. Lecture Notes in Computer Science, vol 5318. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88564-1_15

Download citation

DOI: https://doi.org/10.1007/978-3-540-88564-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88563-4
Online ISBN: 978-3-540-88564-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

Abstract

Chapter PDF

Similar content being viewed by others

An Automatic Construction of Concept Maps Based on Statistical Text Mining

Human Perception of Enriched Topic Models

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

Abstract

Chapter PDF

Similar content being viewed by others

An Automatic Construction of Concept Maps Based on Statistical Text Mining

Human Perception of Enriched Topic Models

Monolingual and Cross-Lingual Probabilistic Topic Models and Their Applications in Information Retrieval

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation