Text Categorization Improvement via User Interaction

Atroszko, Jakub; Szymański, Julian; Gil, David; Mora, Higinio

doi:10.1007/978-3-319-91262-2_24

Text Categorization Improvement via User Interaction

Jakub Atroszko¹⁸,
Julian Szymański¹⁸,
David Gil¹⁹ &
…
Higinio Mora¹⁹

Conference paper
First Online: 11 May 2018

1881 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10842))

Abstract

In this paper, we propose an approach to improvement of text categorization using interaction with the user. The quality of categorization has been defined in terms of a distribution of objects related to the classes and projected on the self-organizing maps. For the experiments, we use the articles and categories from the subset of Simple Wikipedia. We test three different approaches for text representation. As a baseline we use Bag-of-Words with weighting based on Term Frequency-Inverse Document Frequency that has been used for evaluation of neural representations of words and documents: Word2Vec and Paragraph Vector. In the representation, we identify subsets of features that are the most useful for differentiating classes. They have been presented to the user, and his or her selection allow increase the coherence of the articles that belong to the same category and thus are close on the SOM.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Tayal, S., Goel, S.K., Sharma, K.: A comparative study of various text mining techniques. In: 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 1637–1642 (2015)
Google Scholar
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, pp. 117–119. Cambridge University Press, New York (2008)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)
Google Scholar
Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014)
Google Scholar
Mujtaba, G., Shuib, L., Raj, R.G., Rajandram, R., Shaikh, K.: Automatic text classification of ICD-10 related CoD from complex and free text forensic autopsy reports. In: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1055–1058 (2016)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1799 (2013)
Article Google Scholar
Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. CoRR abs/1105.5444, p. 95 (2011)
Google Scholar
Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)
Google Scholar
Godbole, S., Harpale, A., Sarawagi, S., Chakrabarti, S.: Document classification through interactive supervision of document and term labels. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 185–196. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30116-5_19
Chapter Google Scholar
Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)
Stanković, R., Krstev, C., Obradović, I., Kitanović, O.: Improving document retrieval in large domain specific textual databases using lexical resources. In: Nguyen, N.T., Kowalczyk, R., Pinto, A.M., Cardoso, J. (eds.) TCCI XXVI. LNCS, vol. 10190, pp. 162–185. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59268-8_8
Chapter Google Scholar
Hu, Y., Milios, E.E., Blustein, J.: Interactive feature selection for document clustering. In: Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 2011, pp. 1143–1150. ACM, New York (2011)
Google Scholar
Raghavan, H., Madani, O., Jones, R.: Interactive feature selection. In: IJCAI, vol. 5, pp. 841–846 (2005)
Google Scholar
Dzemyda, G., Kurasova, O., Žilinskas, J.: Multidimensional Data Visualization. SOIA, vol. 75. Springer, New York (2012). https://doi.org/10.1007/978-1-4419-0236-8
Book MATH Google Scholar
Borg, I., Groenen, P.J.F.: Modern Multidimensional Scaling: Theory and Applications. SSS. Springer, New York (2005). https://doi.org/10.1007/0-387-28981-X
Book MATH Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
MATH Google Scholar
Kohonen, T.: The self-organizing map. Proc. IEEE 78, 1464–1465, 1474 (1990)
Article Google Scholar
Ultsch, A.: Emergence in self-organizing feature maps. University Library of Bielefeld (2007)
Google Scholar
Szymański, J.: Self-organizing map representation for clustering Wikipedia search results. In: Nguyen, N.T., Kim, C.-G., Janiak, A. (eds.) ACIIDS 2011. LNCS (LNAI), vol. 6592, pp. 140–149. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20042-7_15
Chapter Google Scholar
Szymański, J., Duch, W.: Self organizing maps for visualization of categories. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012. LNCS, vol. 7663, pp. 160–167. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34475-6_20
Chapter Google Scholar
Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., Liu, H.: Advancing feature selection research. ASU feature selection repository, pp. 1–28 (2010)
Google Scholar
Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. In: Data Classification: Algorithms and Applications, p. 37 (2014)
Google Scholar
Vergara, J.R., Estévez, P.A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24, 175–186 (2014)
Article Google Scholar
Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55, 78–87 (2012)
Article Google Scholar
Kotsiantis, S.B., Zaharakis, I.D., Pintelas, P.E.: Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26, 159–190 (2006)
Article Google Scholar
Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1, 300–302, 306 (2007)
Google Scholar
Ultsch, A., Mörchen, F.: ESOM-maps: tools for clustering, visualization, and classification with emergent SOM. Technical report, Department of Mathematics and Computer Science, University of Marburg, Germany (2005)
Google Scholar
Draszawka, K., Szymański, J.: External validation measures for nested clustering of text documents. In: Ryżko, D., Rybiński, H., Gawrysiak, P., Kryszkiewicz, M. (eds.) Emerging Intelligent Technologies in Industry. SCI, vol. 369, pp. 207–225. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22732-5_18
Chapter Google Scholar
Szymański, J., Duch, W.: Semantic memory knowledge acquisition through active dialogues. In: 2007 International Joint Conference on Neural Networks, IJCNN 2007, pp. 536–541. IEEE (2007)
Google Scholar
Czarnul, P., Rościszewski, P., Matuszek, M., Szymański, J.: Simulation of parallel similarity measure computations for large data sets. In: 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF), pp. 472–477. IEEE (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Systems Architecture, Gdansk University of Technology, Gdańsk, Poland
Jakub Atroszko & Julian Szymański
Department of Computer Science Technology and Computation, University of Alicante, Alicante, Spain
David Gil & Higinio Mora

Authors

Jakub Atroszko
View author publications
You can also search for this author in PubMed Google Scholar
Julian Szymański
View author publications
You can also search for this author in PubMed Google Scholar
David Gil
View author publications
You can also search for this author in PubMed Google Scholar
Higinio Mora
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julian Szymański .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Częstochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Atroszko, J., Szymański, J., Gil, D., Mora, H. (2018). Text Categorization Improvement via User Interaction. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2018. Lecture Notes in Computer Science(), vol 10842. Springer, Cham. https://doi.org/10.1007/978-3-319-91262-2_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-91262-2_24
Published: 11 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91261-5
Online ISBN: 978-3-319-91262-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics