Abstract
The style of documents is an important property that can be used as discriminant factor in text mining applications. Among the great number of possible measures proposed to quantify writing style there are some features that can be characterized as universal, in the sense that they can be easily extracted from any kind of text in practically any natural language and provide accurate results when used in style-based text categorization tasks. In this paper we examine whether such universal stylometric features remain effective under difficult scenarios where the topic and/or genre of documents used in the training phase differ from that of the questioned documents. Based on a series of experiments in authorship attribution, we demonstrate that character n-gram features are reliable and effective given that the appropriate number of features is used. It is also shown that when the number of candidate authors increases, the representation dimensionality should also increase to improve classification results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intell. Syst. 20(5), 67–75 (2005)
Argamon, S., Saric, M., Stein, S.: Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of the 9th ACM SIGKDD, pp. 475–480 (2003)
Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features. J. Am. Soc. Inf. Sci. Technol. 58(6), 802–822 (2007)
Arun, R., Suresh, V., Madhavan, C.E.V.: Stopword graphs and authorship attribution in text corpora. In: Proceedings of the 3rd IEEE International Conference on Semantic Computing, pp. 192–196 (2009)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88(4), 048702 (2002)
Burrows, J.F.: Not unless you ask nicely: the interpretative nexus between analysis and information. Lit. Linguist. Comput. 7(2), 91–109 (1992)
Chaski, C.E.: Who’s at the keyboard?: authorship attribution in digital evidence investigations. Int. J. Digit. Evid. 4(1), 1–13 (2005)
Cristani, M., Roffo, G., Segalin, C., Bazzani, L., Vinciarelli, A., Murino, V.: Conversationally-inspired stylometric features for authorship attribution in instant messaging. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1121–1124 (2012)
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics, pp. 611–617 (2004)
Grieve, J.: Quantitative authorship attribution: an evaluation of techniques. Lit. Linguist. Comput. 22(3), 251–270 (2007)
Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)
Jair Escalante, H., Solorio, T., Montes-y-Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of ACL, pp. 288–298 (2011)
Jachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, pp. 137–142 (1998)
Kanaris, I., Stamatatos, E.: Learning to recognize webpage genres. Inf. Process. Manag. 45(5), 499–512 (2009)
Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: Proceedings of the 26th ACM SIGIR, pp. 104–110 (2003)
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Pacific Association for Computational Linguistics, pp. 255–264 (2003)
Koppel, M., Winter, Y.: Determining if two documents are by the same author. J. Am. Soc. Inf. Sci. Technol. 65(1), 178–187 (2014)
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resour. Eval. 45, 83–94 (2011)
Lim, C.S., Lee, K.J., Kim, G.C.: Multiple sets of features for automatic genre classification of web documents. Inf. Process. Manag. 41(5), 1263–1276 (2005)
Luyckx, K., Daelemans, W.: Shallow text analysis and machine learning for authorship attribution. In: Proceedings of the Fifteenth Meeting of Computational Linguistics in the Netherlands (2005)
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the Twenty-Second International Conference on Computational Linguistics, pp. 513–520 (2008)
Madigan, D., Genkin, A., Lewis, D., Argamon, S., Fradkin, D., Ye, L.: Author identification on the large scale. In: Proceedings of CSNA-05 (2005)
Mendenhall, T.C.: The characteristic curves of composition. Science IX, 237–249 (1887)
Meyer zu Eissen, S., Stein, B.: Genre classification of web pages: user study and feasibility analysis. In: Biundo, S., Fruhwirth, T., Palm, G. (eds.) KI 2004: Advances in Artificial Intelligence, pp. 256–269. Springer, Berlin (2004)
Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading (1964)
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(12), 1–135 (2008)
Rangel, F., Rosso, P., Koppel, M., Stamatatos, E., Inches, G.: Overview of the author profiling task at PAN 2013. In: Forner, P., Navigli, R., Tufis, D. (eds.) Working Notes Papers of the CLEF 2013 Evaluation Labs (2013)
Santini, M.: Automatic identification of genre in webpages. Ph.D. thesis, University of Brighton (2007)
Seidman, S.: Authorship verification using the impostors method. In: Forner, P., Navigli, R., Tufis, D. (eds.) CLEF 2013 Evaluation Labs and Workshop Working Notes Papers (2013)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A.F., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 538–556 (2009)
Stamatatos, E.: Intrinsic plagiarism detection using character n-gram profiles. In: Proceedings of the 3rd International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (2009)
Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)
Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barròn-Cedeño, A.: Overview of the author identification task at PAN 2014. CLEF Working Notes, pp. 877–897 (2014)
Van Halteren, H.: Author verification by linguistic profiling: an exploration of the parameter space. ACM Trans. Speech Lang. Process. 4(1), 1–17 (2007)
Weiss, S.M., Indurkhya, N., Zhang, T., Damerau, F.: Text Mining: Predictive Methods for Analyzing Unstructured Information. Springer, New York (2005)
Yule, G.U.: The Statistical Study of Literary Vocabulary. Cambridge University Press, Cambridge (1944)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Stamatatos, E. (2016). Universality of Stylistic Traits in Texts. In: Degli Esposti, M., Altmann, E., Pachet, F. (eds) Creativity and Universality in Language. Lecture Notes in Morphogenesis. Springer, Cham. https://doi.org/10.1007/978-3-319-24403-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-24403-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24401-3
Online ISBN: 978-3-319-24403-7
eBook Packages: Social SciencesSocial Sciences (R0)