Skip to main content

Universality of Stylistic Traits in Texts

  • Chapter
  • First Online:
Creativity and Universality in Language

Part of the book series: Lecture Notes in Morphogenesis ((LECTMORPH))

Abstract

The style of documents is an important property that can be used as discriminant factor in text mining applications. Among the great number of possible measures proposed to quantify writing style there are some features that can be characterized as universal, in the sense that they can be easily extracted from any kind of text in practically any natural language and provide accurate results when used in style-based text categorization tasks. In this paper we examine whether such universal stylometric features remain effective under difficult scenarios where the topic and/or genre of documents used in the training phase differ from that of the questioned documents. Based on a series of experiments in authorship attribution, we demonstrate that character n-gram features are reliable and effective given that the appropriate number of features is used. It is also shown that when the number of candidate authors increases, the representation dimensionality should also increase to improve classification results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://explorer.content.guardianapis.com/.

References

  1. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2005)

    Article  Google Scholar 

  2. Argamon, S., Saric, M., Stein, S.: Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of the 9th ACM SIGKDD, pp. 475–480 (2003)

    Google Scholar 

  3. Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features. J. Am. Soc. Inf. Sci. Technol. 58(6), 802–822 (2007)

    Article  Google Scholar 

  4. Arun, R., Suresh, V., Madhavan, C.E.V.: Stopword graphs and authorship attribution in text corpora. In: Proceedings of the 3rd IEEE International Conference on Semantic Computing, pp. 192–196 (2009)

    Google Scholar 

  5. Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)

    Article  Google Scholar 

  6. Burrows, J.F.: Not unless you ask nicely: the interpretative nexus between analysis and information. Lit. Linguist. Comput. 7(2), 91–109 (1992)

    Article  Google Scholar 

  7. Chaski, C.E.: Who’s at the keyboard?: authorship attribution in digital evidence investigations. Int. J. Digit. Evid. 4(1), 1–13 (2005)

    Google Scholar 

  8. Cristani, M., Roffo, G., Segalin, C., Bazzani, L., Vinciarelli, A., Murino, V.: Conversationally-inspired stylometric features for authorship attribution in instant messaging. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1121–1124 (2012)

    Google Scholar 

  9. de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)

    Article  Google Scholar 

  10. Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 611–617 (2004)

    Google Scholar 

  11. Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Lit. Linguist. Comput. 22(3), 251–270 (2007)

    Article  Google Scholar 

  12. Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)

    Article  Google Scholar 

  13. Jair Escalante, H., Solorio, T., Montes-y-Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of ACL, pp. 288–298 (2011)

    Google Scholar 

  14. Jachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

  15. Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Inf. Process. Manag. 45(5), 499–512 (2009)

    Article  Google Scholar 

  16. Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proceedings of the 26th ACM SIGIR, pp. 104–110 (2003)

    Google Scholar 

  17. Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Pacific Association for Computational Linguistics, pp. 255–264 (2003)

    Google Scholar 

  18. Koppel, M., Winter, Y.: Determining if two documents are by the same author. J. Am. Soc. Inf. Sci. Technol. 65(1), 178–187 (2014)

    Article  Google Scholar 

  19. Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)

    Google Scholar 

  20. Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45, 83–94 (2011)

    Article  Google Scholar 

  21. Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manag. 41(5), 1263–1276 (2005)

    Article  Google Scholar 

  22. Luyckx, K., Daelemans, W.: Shallow text analysis and machine learning for authorship attribution. In: Proceedings of the Fifteenth Meeting of Computational Linguistics in the Netherlands (2005)

    Google Scholar 

  23. Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the Twenty-Second International Conference on Computational Linguistics, pp. 513–520 (2008)

    Google Scholar 

  24. Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author identification on the large scale. In: Proceedings of CSNA-05 (2005)

    Google Scholar 

  25. Mendenhall, T.C.: The characteristic curves of composition. Science IX, 237–249 (1887)

    Article  Google Scholar 

  26. Meyer zu Eissen, S., Stein, B.: Genre classification of web pages: user study and feasibility analysis. In: Biundo, S., Fruhwirth, T., Palm, G. (eds.) KI 2004: Advances in Artificial Intelligence, pp. 256–269. Springer, Berlin (2004)

    Chapter  Google Scholar 

  27. Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)

    Google Scholar 

  28. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(12), 1–135 (2008)

    Article  Google Scholar 

  29. Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) Working Notes Papers of the CLEF 2013 Evaluation Labs (2013)

    Google Scholar 

  30. Santini, M.: Automatic identification of genre in webpages. Ph.D. thesis, University of Brighton (2007)

    Google Scholar 

  31. Seidman, S.: Authorship verification using the impostors method. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop Working Notes Papers (2013)

    Google Scholar 

  32. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  33. Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A.F., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)

    Article  Google Scholar 

  34. Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 538–556 (2009)

    Article  Google Scholar 

  35. Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Proceedings of the 3rd International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (2009)

    Google Scholar 

  36. Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)

    Article  Google Scholar 

  37. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)

    Article  Google Scholar 

  38. Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barròn-Cedeño, A.: Overview of the author identification task at PAN 2014. CLEF Working Notes, pp. 877–897 (2014)

    Google Scholar 

  39. Van Halteren, H.: Author verification by linguistic profiling: an exploration of the parameter space. ACM Trans. Speech Lang. Process. 4(1), 1–17 (2007)

    Article  Google Scholar 

  40. Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, New York (2005)

    Book  Google Scholar 

  41. Yule, G.U.: The Statistical Study of Literary Vocabulary. Cambridge University Press, Cambridge (1944)

    Google Scholar 

  42. Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Efstathios Stamatatos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Stamatatos, E. (2016). Universality of Stylistic Traits in Texts. In: Degli Esposti, M., Altmann, E., Pachet, F. (eds) Creativity and Universality in Language. Lecture Notes in Morphogenesis. Springer, Cham. https://doi.org/10.1007/978-3-319-24403-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-24403-7_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-24401-3

  • Online ISBN: 978-3-319-24403-7

  • eBook Packages: Social SciencesSocial Sciences (R0)

Publish with us

Policies and ethics