Abstract
In this paper we introduce and evaluate a technique for applying latent Dirichlet allocation to supervised semantic categorization of documents. In our setup, for every category an own collection of topics is assigned, and for a labeled training document only topics from its category are sampled. Thus, compared to the classical LDA that processes the entire corpus in one, we essentially build separate LDA models for each category with the category-specific topics, and then these topic collections are put together to form a unified LDA model. For an unseen document the inferred topic distribution gives an estimation how much the document fits into the category.
We use this method for Web document classification. Our key results are 46% decrease in 1-AUC value in classification accuracy over tf.idf with SVM and 43% over the plain LDA baseline with SVM. Using a careful vocabulary selection method and a heuristic which handles the effect that similar topics may arise in distinct categories the improvement is 83% over tf.idf with SVM and 82% over LDA with SVM in 1-AUC.
Supported by the EU FP7 project LiWA - Living Web Archives and by grants OTKA NK 72845, ASTOR NKFP 2/004/05.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1), 177–196 (2001)
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Journal of Machine Learning Research 3(5), 993–1022 (2003)
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining (2006)
Xing, D., Girolami, M.: Employing Latent Dirichlet Allocation for fraud detection in telecommunications. Pattern Recognition Letters 28(13), 1727–1734 (2007)
Elango, P., Jayaraman, K.: Clustering Images Using the Latent Dirichlet Allocation Model (2005), http://www.cs.wisc.edu/~pradheep/
Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering Objects and their Localization in Images. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1 (2005)
Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 2 (2005)
Wei, X., Croft, W.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178–185 (2006)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Blei, D., Griffiths, T., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference, p. 17. Bradford Book (2004)
Teh, Y., Jordan, M., Beal, M., Blei, D.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566–1581 (2006)
Biró, I., Szabó, J., Benczúr, A.: Latent Dirichlet Allocation in Web Spam Filtering. In: Proc. 4th AIRWeb (2008)
Heinrich, G.: Parameter estimation for text analysis. Technical report (2004)
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)
Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: Uncertainty in Artificial Intelligence, UAI (2002)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco (2005)
Forman, G., Guyon, I., Elisseeff, A.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3(7-8), 1289–1305 (2003)
Li, J., Sun, M.: Scalable Term Selection for Text Categorization. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 774–782 (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bíró, I., Szabó, J. (2009). Latent Dirichlet Allocation for Automatic Document Categorization. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2009. Lecture Notes in Computer Science(), vol 5782. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04174-7_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-04174-7_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04173-0
Online ISBN: 978-3-642-04174-7
eBook Packages: Computer ScienceComputer Science (R0)