Effects of Evolutionary Linguistics in Text Classification

Efremova, Julia; García, Alejandro Montes; Zhang, Jianpeng; Calders, Toon

doi:10.1007/978-3-319-25789-1_6

Julia Efremova¹⁶,
Alejandro Montes García¹⁶,
Jianpeng Zhang¹⁶ &
…
Toon Calders^16,17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9449))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

674 Accesses

Abstract

We perform an empirical study to explore the role of evolutionary linguistics on the text classification problem. We conduct experiments on a real-world collection with more than 100.000 Dutch historical notary acts. The document collection spans over six centuries. During such a large time period some lexical terms modified significantly. Person names, professions and other information changed over time as well. Standard text classification techniques which ignore temporal information of the documents might not produce the most optimal results in our case. Therefore, we analyse the temporal aspects of the corpus. We explore the effect of training and testing the model on different time periods. We use time periods that correspond to the main historical events and also apply clustering techniques in order to create time periods in a data driven way. All experiments show a strong time-dependency of our corpus. Exploiting this dependence, we extend standard classification techniques by combining different models trained on particular time periods and achieve overall accuracy above \(90\,\%\) and macro-average indicators above 63 %.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Almeida, T.A., Almeida, J., Yamakami, A.: Spam filtering: how the dimensionality reduction affects the accuracy of naive bayes classifiers. J. Internet Serv. Appl. 1(3), 183–200 (2011)
Article Google Scholar
Aranganayagi, S., Thangavel, K.: Clustering categorical data using silhouette coefficient as a relocating measure. In: Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007), vol. 2 pp. 13–17, IEEE Computer Society, Washington, DC (2007)
Google Scholar
Cardie, C.: Empirical methods in information extraction. AI Mag. 18, 65–79 (1997)
Google Scholar
Dalli, A., Wilks, Y.: Automatic dating of documents and temporal text classification. In: Proceedings of the Workshop on Annotating and Reasoning About Time and Events. ARTE 2006, pp. 17–22. Association for Computational Linguistics (2006)
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2001, pp. 269–274. ACM (2001)
Google Scholar
Efremova, J., Montes García, A., Calders, T.: Classification of historical notary acts with noisy labels. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 49–54. Springer, Heidelberg (2015)
Google Scholar
Efremova, J., Ranjbar-Sahraei, B., Calders, T.: A hybrid disambiguation measure for inaccurate cultural heritage data. In: The 8th Workshop on LaTeCH, pp. 47–55 (2014)
Google Scholar
Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques (2005)
Google Scholar
Leong, C.K., Lee, Y.H., Mak, W.K.: Mining sentiments in sms texts for teaching evaluation. Expert Syst. Appl. 39(3), 2584–2589 (2012)
Article Google Scholar
Mihalcea, R., Nastase, V.: Word epoch disambiguation: finding how words change over time. In: ACL (2), pp. 259–263. The Association for Computer Linguistics (2012)
Google Scholar
Mourão, F., Rocha, L., Araújo, R., Couto, T., Gonçalves, M., Meira Jr., W.: Understanding temporal aspects in document classification. In: Proceedings of the 2008 International Conference on Web Search and Data Mining. WSDM 2008, pp. 159–170. ACM, USA (2008)
Google Scholar
Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that can be reasonably supposed to have arisen from random sampling. Phil. Mag. 50, 157–175 (1900)
Article MATH Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Google Scholar
Salles, T., da Rocha, L.C., Mourão, F., Pappa, G.L., Cunha, L., Gonçalves Jr, M.A., Wrigley Jr, W.: Automatic document classification temporally robust. JIDM 1(2), 199–212 (2010)
Google Scholar
Sammut, C., Webb, G.I.: Encyclopedia of Machine Learning. Springer, Berlin Heidelberg (2010)
Book MATH Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10, 141–168 (2005)
Article MathSciNet Google Scholar

Download references

Acknowledgments

Mining Social Structures from Genealogical Data (project no. 640.005.003) project, part of the CATCH program funded by the Netherlands Organization for Scientific Research (NWO).

Author information

Authors and Affiliations

Eindhoven University of Technology, Eindhoven, The Netherlands
Julia Efremova, Alejandro Montes García, Jianpeng Zhang & Toon Calders
Université Libre de Bruxelles, Brussels, Belgium
Toon Calders

Authors

Julia Efremova
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Montes García
View author publications
You can also search for this author in PubMed Google Scholar
Jianpeng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Toon Calders
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julia Efremova .

Editor information

Editors and Affiliations

Research Group on Mathematical Linguistic, Rovira i Virgili University, Tarragona, Spain
Adrian-Horia Dediu
Research Group on Mathematical Linguistic, Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary
Klára Vicsi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Efremova, J., García, A.M., Zhang, J., Calders, T. (2015). Effects of Evolutionary Linguistics in Text Classification. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-25789-1_6
Published: 17 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25788-4
Online ISBN: 978-3-319-25789-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics