Skip to main content

Deep Dive into Authorship Verification of Email Messages with Convolutional Neural Network

  • Conference paper
  • First Online:
Book cover Information Management and Big Data (SIMBig 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 898))

Included in the following conference series:

Abstract

Authorship verification is the task of determining whether a specific individual did or did not write a text, which very naturally can be reduced to the binary-classification problem. This paper deals with the authorship verification of short email messages. Hereafter, we use “message” to identify the content of the information that is transmitted by email. The proposed method implements the binary classification with a sequence-to-sequence (seq2seq) model and trains a convolutional neural network (CNN) on positive (written by the “target” user) and negative (written by “someone else”) examples. The proposed method differs from previously published works, which represent text by numerous stylometric features, by requiring neither advanced text preprocessing nor explicit feature extraction. All messages are submitted to the CNN “as is,” after padding to the maximal length and replacing all words by their ID numbers. CNN learns the most appropriate features with backpropagation and then performs classification. The experiments performed on the Enron dataset using the TensorFlow framework show that the CNN classifier verifies message authorship very accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://securelist.com/spam-and-phishing-in-q1-2017/78221/.

  2. 2.

    We kept the default settings of the CNN model in the TensorFlow framework, which are as follows: number of embedding dimensions is 128; filter sizes are 3, 4, and 5; number of filters is 128, dropout probability is 0.5, L2 regularization lambda is 0, batch size is 64.

  3. 3.

    The best accuracy of \(89\%\) for 40 users from the Enron dataset was reported in [5].

  4. 4.

    We ran our model with 500 epochs.

  5. 5.

    Obtained from training on one of the users.

References

  1. Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation OSDI 2016, pp. 265–283. USENIX Association, Berkeley (2016). http://dl.acm.org/citation.cfm?id=3026877.3026899

  2. Britz, D.: Understanding convolutional neural networks for NLP (2015)

    Google Scholar 

  3. Brocardo, M.L., Traore, I., Woungang, I.: Authorship verification of e-mail and tweet messages applied for continuous authentication. J. Comput. Syst. Sci. 81(8), 1429–1440 (2015)

    Article  MathSciNet  Google Scholar 

  4. Brocardo, M.L., Traore, I., Woungang, I., Obaidat, M.S.: Authorship verification using deep belief network systems. Int. J. Commun. Syst. 30(12), e3259 (2017)

    Article  Google Scholar 

  5. Chen, X., Hao, P., Chandramouli, R., Subbalakshmi, K.P.: Authorship similarity detection from email messages. In: Perner, P. (ed.) MLDM 2011. LNCS (LNAI), vol. 6871, pp. 375–386. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23199-5_28

    Chapter  Google Scholar 

  6. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)

    MATH  Google Scholar 

  7. Desmedt, Y.: Man-in-the-middle attack. In: van Tilborg, H.C.A. (ed.) Encyclopedia of Cryptography and Security. Springer, Boston (2005). https://doi.org/10.1007/0-387-23483-7

    Chapter  Google Scholar 

  8. El Bouanani, S.E.M., Kassou, I.: Authorship analysis studies: a survey. Int. J. Comput. Appl. (0975 – 8887) 86(12), 22–29 (2014)

    Google Scholar 

  9. Iqbal, F., Khan, L.A., Fung, B., Debbabi, M.: E-mail authorship verification for forensic investigation. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1591–1598. ACM (2010)

    Google Scholar 

  10. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014)

    Google Scholar 

  11. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 2017 Proceedings of ICLR (2017)

    Google Scholar 

  12. Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30115-8_22

    Chapter  Google Scholar 

  13. Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-First International Conference on Machine learning, p. 62. ACM (2004)

    Google Scholar 

  14. Li, J.S., Chen, L.C., Monaco, J.V., Singh, P., Tappert, C.C.: A comparison of classifiers and features for authorship authentication of social networking messages. Concurr. Comput.: Pract. Exp. 29(14), e3918 (2017)

    Article  Google Scholar 

  15. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  16. Nirkhi, S.M., Dharaskar, R., Thakare, V.: Authorship identification using generalized features and analysis of computational method. Trans. Mach. Learn. Artif. Intell. 3(2), 41 (2015)

    Google Scholar 

  17. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

  18. Polychronakis, M., Provos, N.: Ghost turns zombie: exploring the life cycle of web-based malware. LEET 8, 1–8 (2008)

    Google Scholar 

  19. Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)

Download references

Acknowledgments

The author is grateful to Vlad Vavilin and Mark Mishaev for the implementation and running the experiments using the TensorFlow framework.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marina Litvak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Litvak, M. (2019). Deep Dive into Authorship Verification of Email Messages with Convolutional Neural Network. In: Lossio-Ventura, J., Muñante, D., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2018. Communications in Computer and Information Science, vol 898. Springer, Cham. https://doi.org/10.1007/978-3-030-11680-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-11680-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-11679-8

  • Online ISBN: 978-3-030-11680-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics