Skip to main content

Effects of Evolutionary Linguistics in Text Classification

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9449))

Included in the following conference series:

  • 674 Accesses

Abstract

We perform an empirical study to explore the role of evolutionary linguistics on the text classification problem. We conduct experiments on a real-world collection with more than 100.000 Dutch historical notary acts. The document collection spans over six centuries. During such a large time period some lexical terms modified significantly. Person names, professions and other information changed over time as well. Standard text classification techniques which ignore temporal information of the documents might not produce the most optimal results in our case. Therefore, we analyse the temporal aspects of the corpus. We explore the effect of training and testing the model on different time periods. We use time periods that correspond to the main historical events and also apply clustering techniques in order to create time periods in a data driven way. All experiments show a strong time-dependency of our corpus. Exploiting this dependence, we extend standard classification techniques by combining different models trained on particular time periods and achieve overall accuracy above \(90\,\%\) and macro-average indicators above 63 %.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.bhic.nl/.

  2. 2.

    http://scikit-learn.org/.

  3. 3.

    http://goo.gl/YZvP9q.

References

  1. Almeida, T.A., Almeida, J., Yamakami, A.: Spam filtering: how the dimensionality reduction affects the accuracy of naive bayes classifiers. J. Internet Serv. Appl. 1(3), 183–200 (2011)

    Article  Google Scholar 

  2. Aranganayagi, S., Thangavel, K.: Clustering categorical data using silhouette coefficient as a relocating measure. In: Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007), vol. 2 pp. 13–17, IEEE Computer Society, Washington, DC (2007)

    Google Scholar 

  3. Cardie, C.: Empirical methods in information extraction. AI Mag. 18, 65–79 (1997)

    Google Scholar 

  4. Dalli, A., Wilks, Y.: Automatic dating of documents and temporal text classification. In: Proceedings of the Workshop on Annotating and Reasoning About Time and Events. ARTE 2006, pp. 17–22. Association for Computational Linguistics (2006)

    Google Scholar 

  5. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2001, pp. 269–274. ACM (2001)

    Google Scholar 

  6. Efremova, J., Montes García, A., Calders, T.: Classification of historical notary acts with noisy labels. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 49–54. Springer, Heidelberg (2015)

    Google Scholar 

  7. Efremova, J., Ranjbar-Sahraei, B., Calders, T.: A hybrid disambiguation measure for inaccurate cultural heritage data. In: The 8th Workshop on LaTeCH, pp. 47–55 (2014)

    Google Scholar 

  8. Ikonomakis, M., Kotsiantis, S., Tampakas, V.: Text classification using machine learning techniques (2005)

    Google Scholar 

  9. Leong, C.K., Lee, Y.H., Mak, W.K.: Mining sentiments in sms texts for teaching evaluation. Expert Syst. Appl. 39(3), 2584–2589 (2012)

    Article  Google Scholar 

  10. Mihalcea, R., Nastase, V.: Word epoch disambiguation: finding how words change over time. In: ACL (2), pp. 259–263. The Association for Computer Linguistics (2012)

    Google Scholar 

  11. Mourão, F., Rocha, L., Araújo, R., Couto, T., Gonçalves, M., Meira Jr., W.: Understanding temporal aspects in document classification. In: Proceedings of the 2008 International Conference on Web Search and Data Mining. WSDM 2008, pp. 159–170. ACM, USA (2008)

    Google Scholar 

  12. Pearson, K.: On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that can be reasonably supposed to have arisen from random sampling. Phil. Mag. 50, 157–175 (1900)

    Article  MATH  Google Scholar 

  13. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    Google Scholar 

  14. Salles, T., da Rocha, L.C., Mourão, F., Pappa, G.L., Cunha, L., Gonçalves Jr, M.A., Wrigley Jr, W.: Automatic document classification temporally robust. JIDM 1(2), 199–212 (2010)

    Google Scholar 

  15. Sammut, C., Webb, G.I.: Encyclopedia of Machine Learning. Springer, Berlin Heidelberg (2010)

    Book  MATH  Google Scholar 

  16. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  17. Zhao, Y., Karypis, G., Fayyad, U.: Hierarchical clustering algorithms for document datasets. Data Min. Knowl. Discov. 10, 141–168 (2005)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

Mining Social Structures from Genealogical Data (project no. 640.005.003) project, part of the CATCH program funded by the Netherlands Organization for Scientific Research (NWO).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julia Efremova .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Efremova, J., García, A.M., Zhang, J., Calders, T. (2015). Effects of Evolutionary Linguistics in Text Classification. In: Dediu, AH., Martín-Vide, C., Vicsi, K. (eds) Statistical Language and Speech Processing. SLSP 2015. Lecture Notes in Computer Science(), vol 9449. Springer, Cham. https://doi.org/10.1007/978-3-319-25789-1_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25789-1_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25788-4

  • Online ISBN: 978-3-319-25789-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics