Abstract
Enterprises have accumulated both structured and unstructured data steadily as computing resources improve. However, previous research on enterprise data mining often treats these two kinds of data independently and omits mutual benefits. We explore the approach to incorporate a common type of structured data (i.e. organigram) into generative topic model. Our approach, the Partially Observed Topic model (POT), not only considers the unstructured words, but also takes into account the structured information in its generation process. By integrating the structured data implicitly, the mixed topics over document are partially observed during the Gibbs sampling procedure. This allows POT to learn topic pertinently and directionally, which makes it easy tuning and suitable for end-use application. We evaluate our proposed new model on a real-world dataset and show the result of improved expressiveness over traditional LDA. In the task of document classification, POT also demonstrates more discriminative power than LDA.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lu, Y., Zhai, C.: Opinion integration through semi-supervised topic modeling. In: Proceedings of WWW International World Wide Web Conference, pp. 121–130 (2008)
Bhattacharya, I., Godbole, S., Joshi, S.: Structured entity identification and document categorization: two tasks with one joint model. In: Proceedings of 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Wang, X., McCallum, A.: Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. In: Proceedings of 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2006)
Griffiths, T., Steyvers, M., Blei, D., Tenenbaum, J.: Integrating topics and syntax. In: Advances in Neural Information Processing Systems 17, pp. 537–544. MIT Press, Cambridge (2005)
Mimno, D., McCallum, A.: Expertise modeling for matching papers with reviewers. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2007)
Andrieu, C., de Freitas, N., Doucet, A., Jordan, M.: An introduction to MCMC for machine learning. Machine Learning 50, 5–43 (2003)
Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: Proceedings of IJCAI International Joint Conferences on Artificial Intelligence (2007)
Cohen, W.W., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for Matching Names and Records. In: Proceedings of 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Workshop on Data Cleaning and Object Consolidation (2003)
Blei, D.M., McAuliffe, J.D.: Supervised Topic Models. In: Proceedings of NIPS Neural Information Processing Systems (2007)
Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: Discriminative learning for dimensionality reduction and classification. In: Blei, D.M., McAuliffe, J.D. (eds.) Proceedings of NIPS Neural Information Processing Systems (2008)
McCallum, A., Nigam, K.: A comparison of event models for naive Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization. Tech. rep. WS-98-05. AAAI Press, Stanford (1998), http://www.cs.cmu.edu/~mccallum
van der Maaten, L.J.P., Hinton, G.E.: Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9, 2579–2605 (2008)
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, August 6-7, pp. 248–256 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xiao, H., Wang, X., Du, C. (2009). Injecting Structured Data to Generative Topic Model in Enterprise Settings. In: Zhou, ZH., Washio, T. (eds) Advances in Machine Learning. ACML 2009. Lecture Notes in Computer Science(), vol 5828. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-05224-8_29
Download citation
DOI: https://doi.org/10.1007/978-3-642-05224-8_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-05223-1
Online ISBN: 978-3-642-05224-8
eBook Packages: Computer ScienceComputer Science (R0)