Deep Dive into Authorship Verification of Email Messages with Convolutional Neural Network

Litvak, Marina

doi:10.1007/978-3-030-11680-4_14

Marina Litvak¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 898))

Included in the following conference series:

Annual International Symposium on Information Management and Big Data

846 Accesses
4 Citations

Abstract

Authorship verification is the task of determining whether a specific individual did or did not write a text, which very naturally can be reduced to the binary-classification problem. This paper deals with the authorship verification of short email messages. Hereafter, we use “message” to identify the content of the information that is transmitted by email. The proposed method implements the binary classification with a sequence-to-sequence (seq2seq) model and trains a convolutional neural network (CNN) on positive (written by the “target” user) and negative (written by “someone else”) examples. The proposed method differs from previously published works, which represent text by numerous stylometric features, by requiring neither advanced text preprocessing nor explicit feature extraction. All messages are submitted to the CNN “as is,” after padding to the maximal length and replacing all words by their ID numbers. CNN learns the most appropriate features with backpropagation and then performs classification. The experiments performed on the Enron dataset using the TensorFlow framework show that the CNN classifier verifies message authorship very accurately.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://securelist.com/spam-and-phishing-in-q1-2017/78221/.
2.
We kept the default settings of the CNN model in the TensorFlow framework, which are as follows: number of embedding dimensions is 128; filter sizes are 3, 4, and 5; number of filters is 128, dropout probability is 0.5, L2 regularization lambda is 0, batch size is 64.
3.
The best accuracy of \(89\%\) for 40 users from the Enron dataset was reported in [5].
4.
We ran our model with 500 epochs.
5.
Obtained from training on one of the users.

References

Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation OSDI 2016, pp. 265–283. USENIX Association, Berkeley (2016). http://dl.acm.org/citation.cfm?id=3026877.3026899
Britz, D.: Understanding convolutional neural networks for NLP (2015)
Google Scholar
Brocardo, M.L., Traore, I., Woungang, I.: Authorship verification of e-mail and tweet messages applied for continuous authentication. J. Comput. Syst. Sci. 81(8), 1429–1440 (2015)
Article MathSciNet Google Scholar
Brocardo, M.L., Traore, I., Woungang, I., Obaidat, M.S.: Authorship verification using deep belief network systems. Int. J. Commun. Syst. 30(12), e3259 (2017)
Article Google Scholar
Chen, X., Hao, P., Chandramouli, R., Subbalakshmi, K.P.: Authorship similarity detection from email messages. In: Perner, P. (ed.) MLDM 2011. LNCS (LNAI), vol. 6871, pp. 375–386. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23199-5_28
Chapter Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
MATH Google Scholar
Desmedt, Y.: Man-in-the-middle attack. In: van Tilborg, H.C.A. (ed.) Encyclopedia of Cryptography and Security. Springer, Boston (2005). https://doi.org/10.1007/0-387-23483-7
Chapter Google Scholar
El Bouanani, S.E.M., Kassou, I.: Authorship analysis studies: a survey. Int. J. Comput. Appl. (0975 – 8887) 86(12), 22–29 (2014)
Google Scholar
Iqbal, F., Khan, L.A., Fung, B., Debbabi, M.: E-mail authorship verification for forensic investigation. In: Proceedings of the 2010 ACM Symposium on Applied Computing, pp. 1591–1598. ACM (2010)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014)
Google Scholar
Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: 2017 Proceedings of ICLR (2017)
Google Scholar
Klimt, B., Yang, Y.: The enron corpus: a new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30115-8_22
Chapter Google Scholar
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-First International Conference on Machine learning, p. 62. ACM (2004)
Google Scholar
Li, J.S., Chen, L.C., Monaco, J.V., Singh, P., Tappert, C.C.: A comparison of classifiers and features for authorship authentication of social networking messages. Concurr. Comput.: Pract. Exp. 29(14), e3918 (2017)
Article Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Nirkhi, S.M., Dharaskar, R., Thakare, V.: Authorship identification using generalized features and analysis of computational method. Trans. Mach. Learn. Artif. Intell. 3(2), 41 (2015)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Polychronakis, M., Provos, N.: Ghost turns zombie: exploring the life cycle of web-based malware. LEET 8, 1–8 (2008)
Google Scholar
Zhang, Y., Wallace, B.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015)

Download references

Acknowledgments

The author is grateful to Vlad Vavilin and Mark Mishaev for the implementation and running the experiments using the TensorFlow framework.

Author information

Authors and Affiliations

Shamoon College of Engineering, Beer Sheva, Israel
Marina Litvak

Authors

Marina Litvak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marina Litvak .

Editor information

Editors and Affiliations

Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Juan Antonio Lossio-Ventura
Fondazione Bruno Kessler, Trento, Italy
Denisse Muñante
Facultad de Ingeniería, University of the Pacific, Jesús María, Lima, Peru
Hugo Alatrista-Salas

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Litvak, M. (2019). Deep Dive into Authorship Verification of Email Messages with Convolutional Neural Network. In: Lossio-Ventura, J., Muñante, D., Alatrista-Salas, H. (eds) Information Management and Big Data. SIMBig 2018. Communications in Computer and Information Science, vol 898. Springer, Cham. https://doi.org/10.1007/978-3-030-11680-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-11680-4_14
Published: 08 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11679-8
Online ISBN: 978-3-030-11680-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics