Latent Dirichlet Allocation for Automatic Document Categorization

Bíró, István; Szabó, Jácint

doi:10.1007/978-3-642-04174-7_28

István Bíró²² &
Jácint Szabó²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5782))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3990 Accesses
7 Citations

Abstract

In this paper we introduce and evaluate a technique for applying latent Dirichlet allocation to supervised semantic categorization of documents. In our setup, for every category an own collection of topics is assigned, and for a labeled training document only topics from its category are sampled. Thus, compared to the classical LDA that processes the entire corpus in one, we essentially build separate LDA models for each category with the category-specific topics, and then these topic collections are put together to form a unified LDA model. For an unseen document the inferred topic distribution gives an estimation how much the document fits into the category.

We use this method for Web document classification. Our key results are 46% decrease in 1-AUC value in classification accuracy over tf.idf with SVM and 43% over the plain LDA baseline with SVM. Using a careful vocabulary selection method and a heuristic which handles the effect that similar topics may arise in distinct categories the improvement is 83% over tf.idf with SVM and 82% over LDA with SVM in 1-AUC.

Supported by the EU FP7 project LiWA - Living Web Archives and by grants OTKA NK 72845, ASTOR NKFP 2/004/05.

Download to read the full chapter text

Chapter PDF

Supervised labeled latent Dirichlet allocation for document categorization

Article 25 November 2014

A Machine Learning Framework for Document Classification by Topic Recognition Using Latent Dirichlet Allocation and Domain Knowledge

Unsupervised Document Classification and Topic Detection

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Article Google Scholar
Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1), 177–196 (2001)
Article MathSciNet MATH Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. Journal of Machine Learning Research 3(5), 993–1022 (2003)
MATH Google Scholar
Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SIAM International Conference on Data Mining (2006)
Google Scholar
Xing, D., Girolami, M.: Employing Latent Dirichlet Allocation for fraud detection in telecommunications. Pattern Recognition Letters 28(13), 1727–1734 (2007)
Article Google Scholar
Elango, P., Jayaraman, K.: Clustering Images Using the Latent Dirichlet Allocation Model (2005), http://www.cs.wisc.edu/~pradheep/
Sivic, J., Russell, B., Efros, A., Zisserman, A., Freeman, W.: Discovering Objects and their Localization in Images. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1 (2005)
Google Scholar
Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005. CVPR 2005, vol. 2 (2005)
Google Scholar
Wei, X., Croft, W.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178–185 (2006)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Blei, D., Griffiths, T., Jordan, M., Tenenbaum, J.: Hierarchical topic models and the nested Chinese restaurant process. In: Advances in Neural Information Processing Systems 16: Proceedings of the 2003 Conference, p. 17. Bradford Book (2004)
Google Scholar
Teh, Y., Jordan, M., Beal, M., Blei, D.: Hierarchical dirichlet processes. Journal of the American Statistical Association 101(476), 1566–1581 (2006)
Article MathSciNet MATH Google Scholar
Biró, I., Szabó, J., Benczúr, A.: Latent Dirichlet Allocation in Web Spam Filtering. In: Proc. 4th AIRWeb (2008)
Google Scholar
Heinrich, G.: Parameter estimation for text analysis. Technical report (2004)
Google Scholar
Griffiths, T., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)
Article Google Scholar
Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: Uncertainty in Artificial Intelligence, UAI (2002)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Forman, G., Guyon, I., Elisseeff, A.: An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3(7-8), 1289–1305 (2003)
MATH Google Scholar
Li, J., Sun, M.: Scalable Term Selection for Text Categorization. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 774–782 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Data Mining and Web Search Research Group, Computer and Automation, Research Institute, of the Hungarian Academy of Sciences, Budapest, Hungary
István Bíró & Jácint Szabó

Authors

István Bíró
View author publications
You can also search for this author in PubMed Google Scholar
Jácint Szabó
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

NICTA, Locked Bag 8001, Canberra, 2601, Australia and Helsinki Institute of IT, Finland
Wray Buntine
Dept. of Knowledge Technologies, Jožef Stefan Institute, Jamova 39, 1000, Ljubljana, Slovenia
Marko Grobelnik & Dunja Mladenić &
The Centre for Computational Statistics and Machine Learning Department of Computer Science, University College London, Gower St.,, WC1E 6BT, London, UK
John Shawe-Taylor

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bíró, I., Szabó, J. (2009). Latent Dirichlet Allocation for Automatic Document Categorization. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2009. Lecture Notes in Computer Science(), vol 5782. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04174-7_28

Download citation

DOI: https://doi.org/10.1007/978-3-642-04174-7_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04173-0
Online ISBN: 978-3-642-04174-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Latent Dirichlet Allocation for Automatic Document Categorization

Abstract

Chapter PDF

Similar content being viewed by others

Supervised labeled latent Dirichlet allocation for document categorization

A Machine Learning Framework for Document Classification by Topic Recognition Using Latent Dirichlet Allocation and Domain Knowledge

Unsupervised Document Classification and Topic Detection

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Latent Dirichlet Allocation for Automatic Document Categorization

Abstract

Chapter PDF

Similar content being viewed by others

Supervised labeled latent Dirichlet allocation for document categorization

A Machine Learning Framework for Document Classification by Topic Recognition Using Latent Dirichlet Allocation and Domain Knowledge

Unsupervised Document Classification and Topic Detection

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation