Novel Unsupervised Features for Czech Multi-label Document Classification

Brychcín, Tomáš; Král, Pavel

doi:10.1007/978-3-319-13647-9_8

Tomáš Brychcín^22,23 &
Pavel Král^22,23

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8856))

Included in the following conference series:

Mexican International Conference on Artificial Intelligence

1762 Accesses
4 Citations

Abstract

This paper deals with automatic multi-label document classification in the context of a real application for the Czech News Agency. The main goal of this work consists in proposing novel fully unsupervised features based on an unsupervised stemmer, Latent Dirichlet Allocation and semantic spaces (HAL and COALS). The proposed features are integrated into the document classification task. Another interesting contribution is that these two semantic spaces have never been used in the context of document classification before. The proposed approaches are evaluated on a Czech newspaper corpus. We experimentally show that almost all proposed features significantly improve the document classification score. The corpus is freely available for research purposes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Computational linguistics 22(1), 39–71 (1996)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I., Lafferty, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 2003 (2003)
Google Scholar
Bratko, A., Filipič, B.: Exploiting structural information for semi-structured document categorization. In: Information Processing and Management, pp. 679–694 (2004)
Google Scholar
Brychcín, T., Konopík, M.: Semantic spaces for improving language modeling. Computer Speech & Language 28(1), 192 (2014)
Article Google Scholar
Brychcín, T., Konopík, M.: Hps: High precision stemmer. Information Processing & Management 51(1), 68–91 (2015), http://www.sciencedirect.com/science/article/pii/S0306457314000843
Article Google Scholar
Chandrasekar, R., Srinivas, B.: Using syntactic information in document filtering: A comparative study of part-of-speech tagging and supertagging (1996)
Google Scholar
Della Pietra, S., Della Pietra, V., Lafferty, J.: Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(4), 380–393 (1997), http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=588021
Article Google Scholar
Forman, G.: An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research 3, 1289–1305 (2003)
MATH Google Scholar
Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the use of feature selection and negative evidence in automated text categorization. In: Borbinha, J.L., Baker, T. (eds.) ECDL 2000. LNCS, vol. 1923, pp. 59–68. Springer, Heidelberg (2000), http://dl.acm.org/citation.cfm?id=646633.699638
Chapter Google Scholar
Gomez, J.C., Moens, M.-F.: Pca document reconstruction for email classification. Computer Statistics and Data Analysis 56(3), 741–751 (2012)
Article MathSciNet Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101(Suppl. 1), 5228–5235 (2004)
Article Google Scholar
Habernal, I., Ptáček, T., Steinberger, J.: Sentiment analysis in czech social media using supervised machine learning. In: Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 65–74. Association for Computational Linguistics, Atlanta (2013)
Google Scholar
Hrala, M., Král, P.: Multi-label document classification in czech. In: Habernal, I., Matousek, V. (eds.) TSD 2013. LNCS, vol. 8082, pp. 343–351. Springer, Heidelberg (2013)
Google Scholar
Hrala, M., Král, P.: Evaluation of the document classification approaches. In: Burduk, R., Jackowski, K., Kurzynski, M., Wozniak, M., Zolnierek, A. (eds.) CORES 2013. Advances in Intelligent Systems and Computing, vol. 226, pp. 875–884. Springer, Heidelberg (2013)
Chapter Google Scholar
Jurgens, D., Stevens, K.: The s-space package: An open source package for word space models. System Papers of the Association of Computational Linguistics (2010)
Google Scholar
Karypis, G.: Cluto - a clustering toolkit (2003), www.cs.umn.edu/~karypis/cluto
Konkol, M.: Brainy: A machine learning library. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2014, Part II. LNCS, vol. 8468, pp. 490–499. Springer, Heidelberg (2014)
Chapter Google Scholar
Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Information Processing and Management 41(5), 1263–1276 (2005), http://www.sciencedirect.com/science/article/pii/S0306457304000676
Article Google Scholar
Lund, K., Burgess, C.: Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods Instruments and Computers 28(2), 203–208 (1996)
Article Google Scholar
McCallum, A.K.: Mallet: A machine learning for language toolkit (2002), http://mallet.cs.umass.edu
Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004), http://dx.doi.org/10.1007/978-3-540-24752-4_14
Chapter Google Scholar
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text Classification from Labeled and Unlabeled Documents Using EM. Mach. Learn. 39(2-3), 103–134 (2000), http://dx.doi.org/10.1023/A:1007692713085
Article MATH Google Scholar
Powers, D.: From precision, recall and f-measure to roc., informedness, markedness & correlation. Journal of Machine Learning Technologies 2(1), 37–63 (2011)
MathSciNet Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 1, pp. 248–256. Association for Computational Linguistics, Stroudsburg (2009), http://dl.acm.org/citation.cfm?id=1699510.1699543
Google Scholar
Rohde, D.L.T., Gonnerman, L.M., Plaut, D.C.: An improved method for deriving word meaning from lexical co-occurrence. Cognitive Psychology 7, 573–605 (2004)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM computing surveys (CSUR) 34(1), 1–47 (2002)
Article Google Scholar
Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM) 3(3), 1–13 (2007)
Article Google Scholar
Wong, A.K., Lee, J.W., Yeung, D.S.: Using complex linguistic features in context-sensitive text classification techniques. In: Proceedings of 2005 International Conference on Machine Learning and Cybernetics, vol. 5, pp. 3183–3188. IEEE (2005)
Google Scholar
Yun, J., Jing, L., Yu, J., Huang, H.: A multi-layer text classification framework based on two-level representation model. Expert Systems with Applications 39(2), 2035–2046 (2012)
Article Google Scholar
Zhu, S., Ji, X., Xu, W., Gong, Y.: Multi-labelled classification using maximum entropy method. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 274–281. ACM (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science & Engineering, Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic
Tomáš Brychcín & Pavel Král
NTIS - New Technologies for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Plzeň, Czech Republic
Tomáš Brychcín & Pavel Král

Authors

Tomáš Brychcín
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Král
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Centro de Investigación en Computación, Instituto Politécnico Nacional, Av. Juan Dios Bátiz s/n, Col. Nueva Industrial Vallejo, 07738, Mexico City, Mexico
Alexander Gelbukh
Área Académica de Computación y Electrónica, Carretera Pachuca-Tulancingo, Universidad Autónoma del Estado de Hidalgo, Km. 4.5, Col. Carboneras, Mineral de la Reforma, 42180, Hidalgo, Mexico
Félix Castro Espinoza
Facultad de ciencias, Universidad Autónoma Nacional de México, Ciudad Universitaria, México DF, Mexico
Sofía N. Galicia-Haro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brychcín, T., Král, P. (2014). Novel Unsupervised Features for Czech Multi-label Document Classification. In: Gelbukh, A., Espinoza, F.C., Galicia-Haro, S.N. (eds) Human-Inspired Computing and Its Applications. MICAI 2014. Lecture Notes in Computer Science(), vol 8856. Springer, Cham. https://doi.org/10.1007/978-3-319-13647-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-13647-9_8
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13646-2
Online ISBN: 978-3-319-13647-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics