A New Document Author Representation for Authorship Attribution

  • Adrián Pastor López-Monroy
  • Manuel Montes-y-Gómez
  • Luis Villaseñor-Pineda
  • Jesús Ariel Carrasco-Ochoa
  • José Fco. Martínez-Trinidad
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7329)


This paper proposes a novel representation for Authorship Attribution (AA), based on Concise Semantic Analysis (CSA), which has been successfully used in Text Categorization (TC). Our approach for AA, called Document Author Representation (DAR), builds document vectors in a space of authors, calculating the relationship between textual features and authors. In order to evaluate our approach, we compare the proposed representation with conventional approaches and previous works using the c50 corpus. We found that DAR can be very useful in AA tasks, because it provides good performance on imbalanced data, getting comparable or better accuracy results.


Authorship Attribution Author Identification Document Representation Semantic Analysis Text Classification 


  1. 1.
    Zhixing, L., Zhongyang, X., Yufang, Z., Chunyong, L., Kuan, L.: Fast text categorization using concise semantic analysis. Pattern Recognition Letters 32(3), 441–448 (2010)Google Scholar
  2. 2.
    Stamatatos, E.: A survey on modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)CrossRefGoogle Scholar
  3. 3.
    Plakias, S., Stamatatos, E.: Tensor Space Models for Authorship Identification. In: Darzentas, J., Vouros, G.A., Vosinakis, S., Arnellos, A. (eds.) SETN 2008. LNCS (LNAI), vol. 5138, pp. 239–249. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  4. 4.
    Frantzeskou, G., Stamatatos, E., Gritzalis, S., Chaski, C.E., Howald, B.S.: Identifying authorship by byte-level n-grams: the source code author profile (SCAP). Int. Journal of Digital Evidence 6(1) (2007)Google Scholar
  5. 5.
    Deerwester, S.: Indexing by latent semantic analysis. Journal of the American Society for Information Science 41(6), 391–407 (1990)CrossRefGoogle Scholar
  6. 6.
    Gabrilovich, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research 34, 443–498 (2009)zbMATHGoogle Scholar
  7. 7.
    Schler, J., Koppel, M., Argamon, S.: Computational methods in authorship attribution. Journal of the American Society for Information Science 60(1), 9–26 (2009)CrossRefGoogle Scholar
  8. 8.
    Miranda-García, A., Calle-Martín, J.: Yule’s k characteristic K revisited. Language Resources and Evaluation 39(4), 287–294 (2005)CrossRefGoogle Scholar
  9. 9.
    Stamatatos, E.: Author identification: Using text sampling to handle the class imbalance problem. Information Processing and Management 44(2), 790–799 (2008)CrossRefGoogle Scholar
  10. 10.
    Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-2011. Notebook for PAN at CLEF 2011 (2011)Google Scholar
  11. 11.
    Solorio, T., Pillay, S., Raghavan, S., Montes-y-Gómez, M.: Modality specific meta features for authorship attribution in web forum posts. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 156–164 (2011)Google Scholar
  12. 12.
    Abbasi, A., Chen, H.: Writeprints: a stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems 26(2), Article 7 (2008)Google Scholar
  13. 13.
    Cai, D., He, X., Wen, J.R., Han, J., Ma, W.Y.: Support tensor machines for text categorization. Technical report, UIUCDCS-R-2006-2714, University of Illinois at Urbana-Champaign (2006)Google Scholar
  14. 14.
    Pavelec, D., Justino, E., Batista, L.V., Oliveira, L.S.: Author identification using writer-dependent and writer-independent strategies. In: Proceedings of the 2008 ACM Symposium on Applied Computing - SAC 2008, pp. 414–418 (2008)Google Scholar
  15. 15.
    Houvardas, J., Stamatatos, E.: N-Gram Feature Selection for Authorship Identification. In: Euzenat, J., Domingue, J. (eds.) AIMSA 2006. LNCS (LNAI), vol. 4183, pp. 77–86. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  16. 16.
    Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research 5, 361–397 (2004)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Adrián Pastor López-Monroy
    • 1
  • Manuel Montes-y-Gómez
    • 1
  • Luis Villaseñor-Pineda
    • 1
  • Jesús Ariel Carrasco-Ochoa
    • 1
  • José Fco. Martínez-Trinidad
    • 1
  1. 1.Computer Science DepartmentNational Institute for Astrophysics, Optics and ElectronicsTonantzintlaMexico

Personalised recommendations