Skip to main content

Authorship Verification on Short Text Samples Using Stylometric Embeddings

  • Conference paper
  • First Online:
Analysis of Images, Social Networks and Texts (AIST 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11179))

Abstract

Given the increasing amounts of textual data published online and our inability to reliably identify a person by their writing style, impersonation in the context of social media applications becomes a real-world problem. This work explores how deep learning and metric learning techniques can be applied to the challenge of authorship verification—given a collection of text samples by one author and another document of unknown origin, determine if the new document is written by the same author or not. Using fastText word embeddings, deep LSTMs, and triplet loss, we propose a system that is able to learn stylometric embeddings of different documents and measure their stylistic distance. Unlike most approaches that work on entire documents, our system is able to work on very short text samples of 1–3 sentences, which resembles the length of typical social media posts. We successfully evaluated our approach on the PAN 2014 challenge on authorship verification for English text. The presented system outperforms competing approaches in the PAN 2014 challenge when using 10 short text samples or more.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://fasttext.cc/docs/en/pretrained-vectors.html.

References

  1. Bagnall, D.: Author identification using multi-headed recurrent neural networks. In: CEUR Workshop Proceedings, vol. 1391 (2015). https://arxiv.org/pdf/1506.04891.pdf

  2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv:1607.04606v1 [cs.CL] (2016). http://arxiv.org/abs/1607.04606

  3. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734 (2014). http://arxiv.org/abs/1406.1078

  4. Cho, K., Merrienboer, B.V., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. In: SSST-2014, pp. 103–111 (2014). http://www.aclweb.org/anthology/W14-4012

  5. Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Advances in Neural Information Processing Systems (NIPS 2015), pp. 1–9 (2015). http://arxiv.org/abs/1511.01432

  6. Ding, S.H.H., Fung, B.C.M., Iqbal, F., Cheung, W.K.: Learning stylometric representations for authorship analysis (2016). https://arxiv.org/pdf/1606.01219.pdf

  7. Eisner, B., Rocktäschel, T., Augenstein, I., Bošnjak, M., Riedel, S.: emoji2vec: learning emoji representations from their description, pp. 48–54 (2016). http://arxiv.org/abs/1609.08359

  8. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors, pp. 1–18 (2012). http://arxiv.org/abs/1207.0580

  9. Hoffer, E., Ailon, N.: Deep metric learning using triplet network. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9370, no. 2010, pp. 84–92 (2015). https://arxiv.org/pdf/1412.6622.pdf

    Chapter  Google Scholar 

  10. Juola, P., Stamatatos, E.: Overview of the author identification task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop - Working Notes Papers, 23–26 September, Valencia, Spain, September 2013

    Google Scholar 

  11. Khonji, M., Iraqi, Y.: A slightly-modified GI-based author-verifier with lots of features (ASGALF): notebook for PAN at CLEF 2014. In: CEUR Workshop Proceedings, vol. 1180, no. 1, pp. 977–983 (2014). http://ai2-s2-pdfs.s3.amazonaws.com/cab5/af021b2277c860cb5095c7f29c49084e3ff1.pdf

  12. Koppel, M., Seidman, S.: Automatically identifying pseudepigraphic texts. In: EMNLP 2013 (2004), pp. 1449–1454 (2013). http://www.aclweb.org/anthology/D13-1151

  13. Koppel, M., Winter, Y.: Determining if two documents are written by the same author. J. Assoc. Inf. Sci. Technol. 65(1), 178–187 (2014)

    Article  Google Scholar 

  14. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning - ICML 2014, vol. 32, pp. 1188–1196 (2014). http://arxiv.org/abs/1405.4053

  15. Li, J., Luong, M.T., Jurafsky, D.: A hierarchical neural autoencoder for paragraphs and documents. In: ACL, pp. 1106–1115 (2015)

    Google Scholar 

  16. Mikolov, T., Corrado, G., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. CrossRef Listing of Deleted DOIs 1, 1–12 (2013). http://arxiv.org/pdf/1301.3781v3.pdf, http://www.crossref.org/deleted_DOI.html

  17. Palangi, H., et al.: Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Trans. Audio Speech Lang. Process. 24(4), 694–707 (2016). http://arxiv.org/abs/1502.06922%5Cnwww.arxiv.org/pdf/1502.06922.pdf%5Cnieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7389336%5Cnarxiv.org/abs/1502.06922

    Article  Google Scholar 

  18. Pascanu, R., Gulcehre, C., Cho, K., Bengio, Y.: How to construct deep recurrent neural networks, pp. 1–13 (2013). http://arxiv.org/abs/1312.6026

  19. Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics and Secur. PP(99), 1–30 (2016)

    Google Scholar 

  20. Shrestha, P., Sierra, S., González, F.A., Rosso, P., Montes-y Gómez, M., Solorio, T.: Convolutional neural networks for authorship attribution of short texts. In: EACL 2017, p. 669 (2017)

    Google Scholar 

  21. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014). http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf?utm_content=buffer79b43&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

  22. Stamatatos, E.: A survey of modern authorship attribution methods. 14(4), 90–103 (2013). http://www.clips.ua.ac.be/stylometry/Lit/Stamatatos_survey2009.pdf

  23. Stamatatos, E., amd Ben Verhoeven, W.D., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. In: Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.) CLEF 2015 Evaluation Labs and Workshop - Working Notes Papers, 8–11 September, Toulouse, France. CEUR-WS.org, September 2015

    Google Scholar 

  24. Stamatatos, E., et al.: Overview of the author identification task at PAN 2014. In: Cappellato, L., Ferro, N., Halvey, M., Kraaij, W. (eds.) CLEF 2014 Evaluation Labs and Workshop - Working Notes Papers, 15–18 September, Sheffield, UK. CEUR-WS.org, September 2014

    Google Scholar 

  25. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112 (2014). http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural

  26. Wang, J., et al.: Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1386–1393 (2014). https://arxiv.org/pdf/1404.4661.pdf

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johannes Jasper .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jasper, J., Berger, P., Hennig, P., Meinel, C. (2018). Authorship Verification on Short Text Samples Using Stylometric Embeddings. In: van der Aalst, W., et al. Analysis of Images, Social Networks and Texts. AIST 2018. Lecture Notes in Computer Science(), vol 11179. Springer, Cham. https://doi.org/10.1007/978-3-030-11027-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11027-7_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11026-0

  • Online ISBN: 978-3-030-11027-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics