Skip to main content

Example of Application of n-grams: Authorship Attribution Using Syllables

  • Chapter
  • First Online:
Syntactic n-grams in Computational Linguistics

Part of the book series: SpringerBriefs in Computer Science ((BRIEFSCOMPUTER))

Abstract

As we described in the previous chapters, mainstream of the modern computational linguistics is based on application of machine learning methods. We represent our task as a classification task, represent our objects formally using features and their values (constructing vector space model), and then apply well-known classification algorithms. In this pipeline, the crucial question is how to select the features. For example, we can use as features words or n-grams of words (sequences of words) or sequences of characters (character n-grams), etc. An interesting question arises: Can we use syllables as features? It is very rarely done in computational linguistics, but there is certain linguistic reality behind syllables. This chapter explores this possibility for the authorship attribution task; it follows our research paper [99]. Note that syllables are somewhat similar to character n-grams in the sense that they are composed of several characters (being not too long).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://pan.webis.de [last access: 27.12.2016]. All other URLs in this document were also verified on this date.

  2. 2.

    For syllables, space-prefix and space-suffix categories are omitted since they correspond to prefix and suffix categories.

  3. 3.

    http://CulturaCollectiva.com

  4. 4.

    http://www.nltk.org

  5. 5.

    https://pypi.python.org/pypi/Pyphen

  6. 6.

    https://pypi.python.org/pypi/PyHyphen

  7. 7.

    https://pypi.python.org/pypi/hyphenate

  8. 8.

    http://icon.shef.ac.uk/Moby/mhyph.html

  9. 9.

    https://www.howmanysyllables.com; https://ahdictionary.com; http://www.dictionary.com

  10. 10.

    http://es.oslin.org/syllables.php

  11. 11.

    Programming of the method was performed by H. J. Hernández and E. López.

Bibliography

  1. Abbasi, A., Chen, H.: Applying authorship analysis to extremist-group web forum messages. IEEE Intelligent Systems, Vol. 20, No. 5, pp. 67–75 (2005)

    Article  Google Scholar 

  2. Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features. Journal of the American Society of Information Science and Technology, Vol. 58, No. 6, pp. 802–822 (2007)

    Article  Google Scholar 

  3. Burrows, J.: Word-patterns and story-shapes: The statistical analysis of narrative style. Literary and Linguistic Computing. Vol. 2, No. 2, pp. 61–70 (1987)

    Article  Google Scholar 

  4. Daelemans, W.: Explanation in computational stylometry. In: Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics, pp. 451–462 (2013)

    Google Scholar 

  5. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence, Vol. 19, No. 1–2, pp. 109–123 (2003)

    Article  Google Scholar 

  6. Feng, L., Jansche, M., Huenerfauth, M., Elhadad, N.: A comparison of features for automatic readability assessment. In: Proceedings of the 23rd International Conference on Computational Linguistics, pp. 276–284 (2010)

    Google Scholar 

  7. Fucks, W.: On the mathematical analysis of style. Biometrica, Vol. 39, No. 1–2, pp. 122–129 (1952)

    Article  Google Scholar 

  8. Gómez-Adorno, H., Sidorov, G., Pinto, D., Markov, I.: A graph based authorship identification approach. Working Notes Papers of the CLEF 2015 Evaluation Labs, Vol. 1391 (2015)

    Google Scholar 

  9. Grieve, J.: Quantitative authorship attribution: A history and an evaluation of techniques. MSc dis. Simon Fraser University (2005)

    Google Scholar 

  10. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA Data Mining Software: An Update; SIGKDD Explorations, 11(1), pp. 10–18 (2009)

    Article  Google Scholar 

  11. Holmes, D.: Authorship attribution. Computers and the Humanities. Vol. 28, No. 2, pp. 87–106 (1994)

    Article  Google Scholar 

  12. Jarvis, S., Bestgen, Y., Pepper, S.: Maximizing classification accuracy in native language identification. In: Proceeding of the 8th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 111–118 (2013)

    Google Scholar 

  13. Juola, P.: Authorship Attribution. Foundations and Trends in Information Retrieval. 1(3):233–334 (2006)

    Article  Google Scholar 

  14. Kestemont, M.: Function words in authorship attribution. From black magic to theory? In: Proceedings of the 3rd Workshop on Computational Linguistics for Literature, pp. 59–66 (2014)

    Google Scholar 

  15. Koppel, M., Winter, Y.: Determining if two documents are written by the same author. Journal of the American Society for Information Science and Technology. Vol. 65, No. 1, pp. 178–187 (2014)

    Google Scholar 

  16. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, Vol. 5, pp. 361–397 (2004)

    Google Scholar 

  17. Luyckx K., Daelemans W. Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics, pp. 513–520 (2008)

    Google Scholar 

  18. Markov, I., Baptista, J., Pichardo-Lagunas, O.: Authorship attribution in Portuguese using character n-grams. Acta Polytechnica Hungarica, Vol. 14, No. 3, pp. 59–78 (2017)

    Google Scholar 

  19. Markov, I., Gómez-Adorno, H., Posadas-Durán, J.-P., Sidorov, G., Gelbukh, A.: Author profiling with doc2vec neural network-based document embeddings. In: Proceedings of the 15th Mexican International Conference on Artificial Intelligence, LNAI, Vol. 10062, pp. 117–131 (2017)

    Google Scholar 

  20. Markov, I., Gómez-Adorno, H., Sidorov, G.: Language- and subtask-dependent feature selection and classifier parameter tuning for author profiling. Working Notes Papers of the CLEF 2017 Evaluation Labs, Vol. 1866 (2017)

    Google Scholar 

  21. Markov, I., Stamatatos, E., Sidorov, G.: Improving cross-topic authorship attribution: The role of pre-processing. In: Proceedings of the 18th International Conference on Computational Linguistics and Intelligent Text Processing (2017)

    Google Scholar 

  22. McNamara, D., Louwerse, M., McCarthy, P., Graesser, A.: Cohmetrix: Capturing linguistic features of cohesion. Discourse Processes, Vol. 47, No. 4, pp. 292–330 (2010)

    Article  Google Scholar 

  23. Mendenhall, T.: The characteristic curves of composition. Science, Vol. 9, No. 214, pp. 237–249 (1887)

    Article  Google Scholar 

  24. Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Reading, MA: Addison-Wesley Publishing Company (1964) (Reprinted: Stanford: Center for the Study of Language and Information (2008))

    MATH  Google Scholar 

  25. Pentel, A. Effect of different feature types on age based classification of short texts. In: Proceedings of the 6th International Conference on Information, Intelligence, Systems and Applications, pp. 1–7 (2015)

    Google Scholar 

  26. Posadas-Durán, J.-P., Gómez-Adorno, H., Sidorov, G., Batyrshin, I., Pinto, D., Chanona-Hernandez, L.: Application of the distributed document representation in the authorship attribution task for small corpora. Soft Computing, Vol. 21. No. 3, pp. 627–639 (2016)

    Article  Google Scholar 

  27. Qian, T., Liu, B., Chen, L., Peng, Z.: Tritraining for authorship attribution with limited training data. In: Proceeding of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 345–351 (2014)

    Google Scholar 

  28. Sapkota, U., Solorio, T., Montes-y-Gómez, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: Will out-of-topic data help? In: Proceedings of the 25th International Conference on Computational Linguistics, pp. 1228–1237 (2014)

    Google Scholar 

  29. Sapkota, U., Bethard, S., Montes-y-Gómez, M., Solorio, T. Not all character n-grams are created equal: A study in authorship attribution. In: Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies, pp. 93–102 (2015)

    Google Scholar 

  30. Sidorov, G.: Automatic Authorship Attribution Using Syllables as Classification Features. Rhema, Vol. 1, pp. 62–81 (2018)

    Google Scholar 

  31. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60(3): 538–556 (2009)

    Article  Google Scholar 

  32. Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy, Vol. 21, pp. 427–439 (2013)

    Google Scholar 

  33. Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN 2014. Working Notes of CLEF 2014 - Conference and Labs of the Evaluation forum, pp. 877–897 (2014)

    Google Scholar 

  34. Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., López-López, A., Potthast, M., Stein, B.: Overview of the author identification task at PAN 2015. Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum (2015)

    Google Scholar 

  35. Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Computational Linguistics, Vol. 26, No. 4, pp. 471–495 (2000)

    Article  Google Scholar 

  36. Van Halteren, H.: Linguistic profiling for author recognition and verification. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2019 The Author(s), under exclusive licence to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Sidorov, G. (2019). Example of Application of n-grams: Authorship Attribution Using Syllables. In: Syntactic n-grams in Computational Linguistics. SpringerBriefs in Computer Science. Springer, Cham. https://doi.org/10.1007/978-3-030-14771-6_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-14771-6_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-14770-9

  • Online ISBN: 978-3-030-14771-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics