Injecting Structured Data to Generative Topic Model in Enterprise Settings

Xiao, Han; Wang, Xiaojie; Du, Chao

doi:10.1007/978-3-642-05224-8_29

Han Xiao^21,22,
Xiaojie Wang²² &
Chao Du²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5828))

Included in the following conference series:

Asian Conference on Machine Learning

2285 Accesses
3 Citations

Abstract

Enterprises have accumulated both structured and unstructured data steadily as computing resources improve. However, previous research on enterprise data mining often treats these two kinds of data independently and omits mutual benefits. We explore the approach to incorporate a common type of structured data (i.e. organigram) into generative topic model. Our approach, the Partially Observed Topic model (POT), not only considers the unstructured words, but also takes into account the structured information in its generation process. By integrating the structured data implicitly, the mixed topics over document are partially observed during the Gibbs sampling procedure. This allows POT to learn topic pertinently and directionally, which makes it easy tuning and suitable for end-use application. We evaluate our proposed new model on a real-world dataset and show the result of improved expressiveness over traditional LDA. In the task of document classification, POT also demonstrates more discriminative power than LDA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lu, Y., Zhai, C.: Opinion integration through semi-supervised topic modeling. In: Proceedings of WWW International World Wide Web Conference, pp. 121–130 (2008)
Google Scholar
Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: Proceedings of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Article MATH Google Scholar
Wang, X., McCallum, A.: Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. In: Proceedings of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006)
Google Scholar
Griffiths, T., Steyvers, M., Blei, D., Tenenbaum, J.: Integrating topics and syntax. In: Advances in Neural Information Processing Systems 17, pp. 537–544. MIT Press, Cambridge (2005)
Google Scholar
Mimno, D., McCallum, A.: Expertise modeling for matching papers with reviewers. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2007)
Google Scholar
Andrieu, C., de Freitas, N., Doucet, A., Jordan, M.: An introduction to MCMC for machine learning. Machine Learning 50, 5–43 (2003)
Article MATH Google Scholar
Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: Proceedings of IJCAI International Joint Conferences on Artificial Intelligence (2007)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for Matching Names and Records. In: Proceedings of 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Data Cleaning and Object Consolidation (2003)
Google Scholar
Blei, D.M., McAuliffe, J.D.: Supervised Topic Models. In: Proceedings of NIPS Neural Information Processing Systems (2007)
Google Scholar
Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification. In: Blei, D.M., McAuliffe, J.D. (eds.) Proceedings of NIPS Neural Information Processing Systems (2008)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization. Tech. rep. WS-98-05. AAAI Press, Stanford (1998), http://www.cs.cmu.edu/~mccallum
van der Maaten, L.J.P., Hinton, G.E.: Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008)
Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, August 6-7, pp. 248–256 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Technische Universität München, D-85748, Garching bei München, Germany
Han Xiao
Beijing University of Posts and Telecommunications, 100876, Beijing, China
Han Xiao & Xiaojie Wang
Beihang University, 100191, Beijing, China
Chao Du

Authors

Han Xiao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojie Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chao Du
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Laboratory for Novel Software Technology, Nanjing University, 22 Hankou Road, 210093, Nanjing, China
Zhi-Hua Zhou
The Institute of Scientific and Industrial Research, Osaka University, 8-1 Mihogaoka, 567, Osaka, Ibaraki, Japan
Takashi Washio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xiao, H., Wang, X., Du, C. (2009). Injecting Structured Data to Generative Topic Model in Enterprise Settings. In: Zhou, ZH., Washio, T. (eds) Advances in Machine Learning. ACML 2009. Lecture Notes in Computer Science(), vol 5828. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-05224-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-642-05224-8_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05223-1
Online ISBN: 978-3-642-05224-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics