Skip to main content

A Profile-Based Authorship Attribution Approach to Forensic Identification in Chinese Online Messages

  • Conference paper
  • First Online:
Intelligence and Security Informatics (PAISI 2016)

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9650))

Included in the following conference series:

Abstract

With the popularity of Internet technologies and applications, inappropriate or illegal online messages have become a problem for the society. The goal of authorship attribution for anonymous online messages is to identify the authorship from a group of potential suspects for investigation identification. Most previous contributions focused on extracting various writing-style features and employing machine learning algorithms to identify the author. However, as far as Chinese online messages are concerned, they contain not only Chinese characters but also English characters, special symbols, emoticons, slang, etc. It is challenging for word segmentation techniques to segment Chinese online messages correctly. Moreover, online messages are usually short. The performance for short samples would be decreased greatly using traditional machine learning algorithms. In this paper, a profile-based authorship attribution approach for Chinese online messages is firstly provided. N-gram techniques are employed to extract frequency sequences, and the category frequency feature selection method is used to filter common frequent sequences. The profile-based method is used to represent the suspects as category profiles. The illegal messages are attributed to the most likely authorship by comparing the similarity between unknown illegal online messages and suspects’ profiles. Experiments on BBS, Blog, and E-mail datasets show that the proposed profile-based authorship attribution approach can identify the authors effectively. Compared with two instance-based benchmark methods, the proposed profile-based method can obtain better authorship attribution results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. 12321: 12321 statistics figures (2015). http://12321.cn/report.php

  2. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2006)

    Article  Google Scholar 

  3. Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. (TOIS) 26(2), 1–29 (2008)

    Article  Google Scholar 

  4. Basili, R., Moschitti, A., Pazienza, M.T.: A text classifier based on linguistic processing. In: Proceedings of IJCAI99, Machine Learning for Information Filtering. Citeseer, Stockholm, Sweden (1999)

    Google Scholar 

  5. Basili, R., Moschitti, A., Pazienza, M.T.: Robust inference method for profile-based text classification. In: Proceedings of JADT 2000, 5th International Conference on Statistical Analysis of Textual Data. Lausanne, Switzerland (2000)

    Google Scholar 

  6. Casey, E.: Digital Evidence and Computer Crime: Forensic science, Computers, and the Internet. Academic press, Cambridge (2011)

    Google Scholar 

  7. Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 1776–1781. Citeseer, Barcelona, Spain (2011)

    Google Scholar 

  8. De Vel, O.: Mining e-mail authorship. In: Proceedings of ACM International Conference on Knowledge Discovery and Data Mining (KDD 2000). Boston, USA (2000)

    Google Scholar 

  9. De Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM SIGMOD Rec. 30(4), 55–64 (2001)

    Article  Google Scholar 

  10. De Vel, O., Anderson, A., Corney, M., Mohay, G.: Multi-topic e-mail authorship attribution forensics. In: Proceedings of ACM Conference on Computer Security - Workshop on Data Mining for Security Applications. ACM, Philadelphia, PA, USA (2001)

    Google Scholar 

  11. Ding, S.H.H., Fung, B.C.M., Debbabi, M.: A visualizable evidence-driven approach for authorship attribution. ACM Trans. Inf. Syst. Secur. (TISSEC) 17(3), 12 (2015)

    Article  Google Scholar 

  12. Elliot, W., Valenza, R.: Was the earl of oxford the true shakespeare. Notes Queries 38(4), 501–506 (1991)

    Google Scholar 

  13. Estival, D., Gaustad, T., Pham, S.B., Radford, W., Hutchinson, B.: Tat: an author profiling tool with application to arabic emails. In: Proceedings of the Australasian Language Technology Workshop, Melbourne, Australia, pp. 21–30 (2007)

    Google Scholar 

  14. Fisher, B.A., Fisher, D.R.: Techniques of Crime Scene Investigation. CRC Press, Boca Raton (2012)

    Google Scholar 

  15. Forsyth, R.S., Holmes, D.I.: Feature-finding for text classification. Literary Linguist. Comput. 11(4), 163–174 (1996)

    Article  Google Scholar 

  16. Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary Linguist. Comput. 13(3), 111–117 (1998)

    Article  Google Scholar 

  17. Holmes, D.I., Forsyth, R.S.: The federalist revisited: new directions in authorship attribution. Literary Linguist. Comput. 10(2), 111–127 (1995)

    Article  Google Scholar 

  18. Hoorn, J.F., Frank, S.L., Kowalczyk, W., van Der Ham, F.: Neural network identification of poets using letter sequences. Literary Linguist. Comput. 14(3), 311–338 (1999)

    Article  Google Scholar 

  19. ICT: Ict facts and figures (2015). http://www.itu.int/en/ITU-D/Statistics/Pages/facts/default.aspx

  20. Iqbal, F., Binsalleeh, H., Fung, B.C., Debbabi, M.: Mining writeprints from anonymous e-mails for forensic investigation. Digit. Invest. 7(1), 56–64 (2010)

    Article  Google Scholar 

  21. Iqbal, F., Hadjidj, R., Fung, B.C.M., Debbabi, M.: A novel approach of mining write-prints for authorship attribution in e-mail forensics. Digit. Invest. 5, S42–S51 (2008)

    Article  Google Scholar 

  22. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)

    MATH  Google Scholar 

  23. Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264. Halifax Canada, (2003)

    Google Scholar 

  24. Kjell, B.: Authorship attribution of text samples using neural networks and Bayesian classifiers. In: Proceedings of IEEE International Conference on Systems. Man, and Cybernetics, vol. 2, pp. 1660–1664. IEEE, San Antonio, USA (1994)

    Google Scholar 

  25. Ma, J.B., Li, Y., Teng, G.F.: CWAAP: an authorship attribution forensic platform for chinese web information. J. Softw. 9(1), 11–19 (2014)

    Article  Google Scholar 

  26. Merriam, T.V., Matthews, R.A.: Neural computation in stylometry II: an application to the works of Shakespeare and Marlowe. Literary Linguist. Comput. 9(1), 1–6 (1994)

    Article  Google Scholar 

  27. Mosteller, F., Wallace, D.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Boston (1964)

    MATH  Google Scholar 

  28. Peng, F., Schuurmans, D., Wang, S., Keselj, V.: Language independent authorship attribution using character level language models. In: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics. vol. 1, pp. 267–274. Association for Computational Linguistics, Stroudsburg, USA (2003)

    Google Scholar 

  29. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, vol. 77. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  30. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)

    Article  Google Scholar 

  31. Sichel, H.S.: On a distribution law for word frequencies. J. Am. Stat. Assoc. 70(351a), 542–547 (1975)

    Article  Google Scholar 

  32. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)

    Article  Google Scholar 

  33. Sun, J., Yang, Z., Liu, S., Wang, P.: Applying stylometric analysis techniques to counter anonymity in cyberspace. J. Netw. 7(2), 259–266 (2012)

    Google Scholar 

  34. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of Fourteenth International Conference on Machine Learning, vol. 97, pp. 412–420, Nashville, TN, USA (1997)

    Google Scholar 

  35. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)

    Article  Google Scholar 

  36. Zheng, R., Qin, Y., Huang, Z., Chen, H.: Authorship analysis in cybercrime investigation. In: Chen, H., Miranda, R., Zeng, D.D., Demchak, C.C., Schroeder, J., Madhusudan, T. (eds.) ISI 2003. LNCS, vol. 2665, pp. 59–73. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

Download references

Acknowledgments

This work was supported by grants from Department of Education of Hebei Province(No.QN20131150), Program of Study Abroad for Young Teachers by Agricultural University of Hebei. The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jianbin Ma or Bing Xue .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Ma, J., Xue, B., Zhang, M. (2016). A Profile-Based Authorship Attribution Approach to Forensic Identification in Chinese Online Messages. In: Chau, M., Wang, G., Chen, H. (eds) Intelligence and Security Informatics. PAISI 2016. Lecture Notes in Computer Science(), vol 9650. Springer, Cham. https://doi.org/10.1007/978-3-319-31863-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31863-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31862-2

  • Online ISBN: 978-3-319-31863-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics