Advertisement

Authorship Verification on Short Text Samples Using Stylometric Embeddings

  • Johannes JasperEmail author
  • Philipp Berger
  • Patrick Hennig
  • Christoph Meinel
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11179)

Abstract

Given the increasing amounts of textual data published online and our inability to reliably identify a person by their writing style, impersonation in the context of social media applications becomes a real-world problem. This work explores how deep learning and metric learning techniques can be applied to the challenge of authorship verification—given a collection of text samples by one author and another document of unknown origin, determine if the new document is written by the same author or not. Using fastText word embeddings, deep LSTMs, and triplet loss, we propose a system that is able to learn stylometric embeddings of different documents and measure their stylistic distance. Unlike most approaches that work on entire documents, our system is able to work on very short text samples of 1–3 sentences, which resembles the length of typical social media posts. We successfully evaluated our approach on the PAN 2014 challenge on authorship verification for English text. The presented system outperforms competing approaches in the PAN 2014 challenge when using 10 short text samples or more.

Keywords

Authorship identification Natural language processing Metric learning Embedding Deep learning 

References

  1. 1.
    Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: CEUR Workshop Proceedings, vol. 1391 (2015). https://arxiv.org/pdf/1506.04891.pdf
  2. 2.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv:1607.04606v1 [cs.CL] (2016). http://arxiv.org/abs/1607.04606
  3. 3.
    Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014). http://arxiv.org/abs/1406.1078
  4. 4.
    Cho, K., Merrienboer, B.V., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: SSST-2014, pp. 103–111 (2014). http://www.aclweb.org/anthology/W14-4012
  5. 5.
    Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Advances in Neural Information Processing Systems (NIPS 2015), pp. 1–9 (2015). http://arxiv.org/abs/1511.01432
  6. 6.
    Ding, S.H.H., Fung, B.C.M., Iqbal, F., Cheung, W.K.: Learning stylometric representations for authorship analysis (2016). https://arxiv.org/pdf/1606.01219.pdf
  7. 7.
    Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., Riedel, S.: emoji2vec: learning emoji representations from their description, pp. 48–54 (2016). http://arxiv.org/abs/1609.08359
  8. 8.
    Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors, pp. 1–18 (2012). http://arxiv.org/abs/1207.0580
  9. 9.
    Hoffer, E., Ailon, N.: Deep metric learning using triplet network. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9370, no. 2010, pp. 84–92 (2015). https://arxiv.org/pdf/1412.6622.pdfCrossRefGoogle Scholar
  10. 10.
    Juola, P., Stamatatos, E.: Overview of the author identification task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop - Working Notes Papers, 23–26 September, Valencia, Spain, September 2013Google Scholar
  11. 11.
    Khonji, M., Iraqi, Y.: A slightly-modified GI-based author-verifier with lots of features (ASGALF): notebook for PAN at CLEF 2014. In: CEUR Workshop Proceedings, vol. 1180, no. 1, pp. 977–983 (2014). http://ai2-s2-pdfs.s3.amazonaws.com/cab5/af021b2277c860cb5095c7f29c49084e3ff1.pdf
  12. 12.
    Koppel, M., Seidman, S.: Automatically identifying pseudepigraphic texts. In: EMNLP 2013 (2004), pp. 1449–1454 (2013). http://www.aclweb.org/anthology/D13-1151
  13. 13.
    Koppel, M., Winter, Y.: Determining if two documents are written by the same author. J. Assoc. Inf. Sci. Technol. 65(1), 178–187 (2014)CrossRefGoogle Scholar
  14. 14.
    Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning - ICML 2014, vol. 32, pp. 1188–1196 (2014). http://arxiv.org/abs/1405.4053
  15. 15.
    Li, J., Luong, M.T., Jurafsky, D.: A hierarchical neural autoencoder for paragraphs and documents. In: ACL, pp. 1106–1115 (2015)Google Scholar
  16. 16.
    Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. CrossRef Listing of Deleted DOIs 1, 1–12 (2013). http://arxiv.org/pdf/1301.3781v3.pdf, http://www.crossref.org/deleted_DOI.html
  17. 17.
    Palangi, H., et al.: Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 694–707 (2016). http://arxiv.org/abs/1502.06922%5Cnwww.arxiv.org/pdf/1502.06922.pdf%5Cnieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7389336%5Cnarxiv.org/abs/1502.06922CrossRefGoogle Scholar
  18. 18.
    Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks, pp. 1–13 (2013). http://arxiv.org/abs/1312.6026
  19. 19.
    Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics and Secur. PP(99), 1–30 (2016)Google Scholar
  20. 20.
    Shrestha, P., Sierra, S., González, F.A., Rosso, P., Montes-y Gómez, M., Solorio, T.: Convolutional neural networks for authorship attribution of short texts. In: EACL 2017, p. 669 (2017)Google Scholar
  21. 21.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf?utm_content=buffer79b43&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
  22. 22.
    Stamatatos, E.: A survey of modern authorship attribution methods. 14(4), 90–103 (2013). http://www.clips.ua.ac.be/stylometry/Lit/Stamatatos_survey2009.pdf
  23. 23.
    Stamatatos, E., amd Ben Verhoeven, W.D., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers, 8–11 September, Toulouse, France. CEUR-WS.org, September 2015Google Scholar
  24. 24.
    Stamatatos, E., et al.: Overview of the author identification task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Evaluation Labs and Workshop - Working Notes Papers, 15–18 September, Sheffield, UK. CEUR-WS.org, September 2014Google Scholar
  25. 25.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112 (2014). http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural
  26. 26.
    Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1386–1393 (2014). https://arxiv.org/pdf/1404.4661.pdf

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Johannes Jasper
    • 1
    Email author
  • Philipp Berger
    • 1
  • Patrick Hennig
    • 1
  • Christoph Meinel
    • 1
  1. 1.Hasso Plattner InstituteUniversity of PotsdamPotsdamGermany

Personalised recommendations