Abstract
Human-defined concepts are fundamental building-blocks in constructing knowledge bases such as ontologies. Statistical learning techniques provide an alternative automated approach to concept definition, driven by data rather than prior knowledge. In this paper we propose a probabilistic modeling framework that combines both human-defined concepts and data-driven topics in a principled manner. The methodology we propose is based on applications of statistical topic models (also known as latent Dirichlet allocation models). We demonstrate the utility of this general framework in two ways. We first illustrate how the methodology can be used to automatically tag Web pages with concepts from a known set of concepts without any need for labeled documents. We then perform a series of experiments that quantify how combining human-defined semantic knowledge with data-driven techniques leads to better language models than can be obtained with either alone.
Chapter PDF
Similar content being viewed by others
References
McGuinness, D.L.: Ontologies come of age. In: Fensel, D., Hendler, J.A., Lieberman, H., Wahlster, W. (eds.) Spinning the Semantic Web, pp. 171–194. MIT Press, Cambridge (2003)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. of Nat’l. Academy of Science 101, 5228–5235 (2004)
Handschuh, S., Staab, S., Ciravegna, F.: Scream — semi-automatic creation of metadata. In: International Conference on Knowledge Engineering and Knowledge Management (2002)
Popov, B., Kiryakov, A., Kirilov, A., Manov, D., Ognyanoff, D., Goranov, M.: Kim - semantic annotation platform. In: International Semantic Web Conference, pp. 834–849 (2003)
Tang, J., Hong, M., Li, J.Z., Liang, B.: Tree-structured conditional random fields for semantic annotation. In: International Semantic Web Conference, pp. 640–653 (2006)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: Semtag and seeker: bootstrapping the semantic web via automated semantic annotation. In: WWW 2003, pp. 178–186. ACM, New York (2003)
Hotho, A., Staab, S., Stumme, G.: Text clustering based on background knowledge (technical report 425). Technical report, University of Karlsruhe, Institute AIFB (2003)
Gabrilovich, E., Markovitch, S.: Harnessing the expertise of 70,000 human editors: Knowledge-based feature generation for text categorization. J. Mach. Learn. Res. 8, 2297–2345 (2007)
Ifrim, G., Theobald, M., Weikum, G.: Learning word-to-concept mappings for automatic text classification. In: Proceedings of the 22nd ICML-LWS, pp. 18–26 (2005)
Boyd-Graber, D., Blei, D., Zhu, X.: A topic model for word sense disambiguation. In: Proc. 2007 Joint Conf. Empirical Methods in Nat’l. Lang. Processing and Compt’l. Nat’l. Lang. Learning, pp. 1024–1033 (2007)
Brewster, C., Alani, H., Dasmahapatra, S., Wilks, Y.: Data driven ontology evaluation. In: Int’l. Conf. Language Resources and Evaluation (2004)
Alani, H., Brewster, C.: Metrics for ranking ontologies. In: 4th Int’l. EON Workshop, 15th Int’l World Wide Web Conf. (2006)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction and representation of knowledge. Psychological Review 104, 211–240 (1997)
Griffiths, T.L., Steyvers, M., Tenenbaum, J.B.: Topics in semantic representation. In: Psychological Review, vol. 114, pp. 211–244 (2007)
Chemudugunta, C., Smyth, P., Steyvers, M.: Modeling general and specific aspects of documents with a probabilistic topic model. In: NIPS, vol. 19, pp. 241–248 (2007)
Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database (Language, Speech and Communication). MIT Press, Cambridge (1998)
Brown, P.F., de Souza, P.V., Mercer, R.L., Pietra, V.J.D., Lai, J.C.: Class-based n-gram models of natural language. Compt’l. Linguistics, 467–479 (1992)
Chemudugunta, C., Smyth, P., Steyvers, M.: Combining concept hierarchies and statistical topic models. In: 17th ACM Conference on Information and Knowledge Management (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chemudugunta, C., Holloway, A., Smyth, P., Steyvers, M. (2008). Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning. In: Sheth, A., et al. The Semantic Web - ISWC 2008. ISWC 2008. Lecture Notes in Computer Science, vol 5318. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88564-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-540-88564-1_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88563-4
Online ISBN: 978-3-540-88564-1
eBook Packages: Computer ScienceComputer Science (R0)