Abstract
We perform an empirical study to explore the role of evolutionary linguistics on the text classification problem. We conduct experiments on a real-world collection with more than 100.000 Dutch historical notary acts. The document collection spans over six centuries. During such a large time period some lexical terms modified significantly. Person names, professions and other information changed over time as well. Standard text classification techniques which ignore temporal information of the documents might not produce the most optimal results in our case. Therefore, we analyse the temporal aspects of the corpus. We explore the effect of training and testing the model on different time periods. We use time periods that correspond to the main historical events and also apply clustering techniques in order to create time periods in a data driven way. All experiments show a strong time-dependency of our corpus. Exploiting this dependence, we extend standard classification techniques by combining different models trained on particular time periods and achieve overall accuracy above \(90\,\%\) and macro-average indicators above 63 %.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Almeida, T.A., Almeida, J., Yamakami, A.: Spam filtering: how the dimensionality reduction affects the accuracy of naive bayes classifiers. J. Internet Serv. Appl. 1(3), 183–200 (2011)
Aranganayagi, S., Thangavel, K.: Clustering categorical data using silhouette coefficient as a relocating measure. In: Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007), vol. 2 pp. 13–17, IEEE Computer Society, Washington, DC (2007)
Cardie, C.: Empirical methods in information extraction. AI Mag. 18, 65–79 (1997)
Dalli, A., Wilks, Y.: Automatic dating of documents and temporal text classification. In: Proceedings of the Workshop on Annotating and Reasoning About Time and Events. ARTE 2006, pp. 17–22. Association for Computational Linguistics (2006)
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2001, pp. 269–274. ACM (2001)
Efremova, J., Montes García, A., Calders, T.: Classification of historical notary acts with noisy labels. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 49–54. Springer, Heidelberg (2015)
Efremova, J., Ranjbar-Sahraei, B., Calders, T.: A hybrid disambiguation measure for inaccurate cultural heritage data. In: The 8th Workshop on LaTeCH, pp. 47–55 (2014)
Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques (2005)
Leong, C.K., Lee, Y.H., Mak, W.K.: Mining sentiments in sms texts for teaching evaluation. Expert Syst. Appl. 39(3), 2584–2589 (2012)
Mihalcea, R., Nastase, V.: Word epoch disambiguation: finding how words change over time. In: ACL (2), pp. 259–263. The Association for Computer Linguistics (2012)
Mourão, F., Rocha, L., Araújo, R., Couto, T., Gonçalves, M., Meira Jr., W.: Understanding temporal aspects in document classification. In: Proceedings of the 2008 International Conference on Web Search and Data Mining. WSDM 2008, pp. 159–170. ACM, USA (2008)
Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that can be reasonably supposed to have arisen from random sampling. Phil. Mag. 50, 157–175 (1900)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Salles, T., da Rocha, L.C., Mourão, F., Pappa, G.L., Cunha, L., Gonçalves Jr, M.A., Wrigley Jr, W.: Automatic document classification temporally robust. JIDM 1(2), 199–212 (2010)
Sammut, C., Webb, G.I.: Encyclopedia of Machine Learning. Springer, Berlin Heidelberg (2010)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10, 141–168 (2005)
Acknowledgments
Mining Social Structures from Genealogical Data (project no. 640.005.003) project, part of the CATCH program funded by the Netherlands Organization for Scientific Research (NWO).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Efremova, J., García, A.M., Zhang, J., Calders, T. (2015). Effects of Evolutionary Linguistics in Text Classification. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-25789-1_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25788-4
Online ISBN: 978-3-319-25789-1
eBook Packages: Computer ScienceComputer Science (R0)