Skip to main content

Computer-Generated Text Detection Using Machine Learning: A Systematic Review

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9612))

Abstract

Computer-generated text or artificial text nowadays is in abundance on the web, ranging from basic random word salads to web scraping. In this paper, we present a short version of systematic review of some existing automated methods aimed at distinguishing natural texts from artificially generated ones. The methods were chosen by certain criteria. We further provide a summary of the methods considered. Comparisons, whenever possible, use common evaluation measures, and control for differences in experimental set-up.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Grechnikov, E.A., Gusev, G.G., Kustarev, A.A., Raigorodsky, A.M.: Detection of artificial texts, digital libraries: advanced methods and technologies, digital collections. In: Proceedings of XI All-Russian Research Conference RCDL 2009, KRC RAS, Petrozavodsk, pp. 306–308 (2009)

    Google Scholar 

  2. Corston-Oliver, S., Gamon, M., Brockett, C.: A machine learning approach to the automatic evaluation of machine translation. In: Proceeding of 39th Annual Meeting on Association for Computational Linguistics, ACL 2001, pp. 148–155 (2001)

    Google Scholar 

  3. Urvoy, T., Lavergne, T., Filoche, P.: Tracking web spam with hidden style similarity. In: AIRWEB 2006, Seattle, Washington, USA, 10 August 2006

    Google Scholar 

  4. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, Burlington (2011)

    Google Scholar 

  5. Arase, Y., Zhou, M.: Machine translation detection from monolingual web-text. In: Proceedings of 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 1597–1607, 4–9 August 2013

    Google Scholar 

  6. Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Amsterdam (2001)

    Book  MATH  Google Scholar 

  7. Clarkson, P., Rosenfeld, R.: Statistical language modeling using the CMU-Cambridge toolkit. In: Proceedings of Eurospeech 1997, pp. 2707–2710 (1997)

    Google Scholar 

  8. Chickering, D.M., Heckerman, D., Meek, C.: A Bayesian approach to learning Bayesian networks with local structure. In: Geiger, D., Shenoy, P.P. (eds.) Proceedings of 13th Conference on Uncertainty in Artificial Intelligence, pp. 80–89 (1997)

    Google Scholar 

  9. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)

    Book  MATH  Google Scholar 

  10. Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. In: Proceedings of 34th Annual Meeting of the Association for Computational Linguistics (ACL), Santa Cruz, pp. 310–318 (1996)

    Google Scholar 

  11. Honore, A.: Some simple measures of richness of vocabulary. Assoc. Lit. Linguist. Comput. Bull. 7(2), 172–177 (1979)

    Google Scholar 

  12. Sichel, H.: On a distribution law for word frequencies. J. Am. Stat. Assoc. 70, 542–547 (1975)

    Google Scholar 

  13. Lavergne, T., Urvoy, T., Yvon, F.: Detecting fake content with relative entropy scoring. In: PAN 2008 (2008)

    Google Scholar 

  14. Seymore, K., Rosenfeld, R.: Scalable backoff language models. In: ICSLP 1996, Philadelphia, PA, vol. 1, pp. 232–235 (1996)

    Google Scholar 

  15. Stolcke, A.: Entropy-based pruning of backoff language models (1998)

    Google Scholar 

  16. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  17. Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005) (2005)

    Google Scholar 

  18. Heymann, P., Koutrika, G., Garcia-Molina, H.: Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Mag. Internet Comput. 11(6), 36–45 (2007)

    Article  Google Scholar 

  19. Labbé, C., Labbé, D.: Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics, Akadémiai Kiadó, p. 10 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daria Beresneva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Beresneva, D. (2016). Computer-Generated Text Detection Using Machine Learning: A Systematic Review. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds) Natural Language Processing and Information Systems. NLDB 2016. Lecture Notes in Computer Science(), vol 9612. Springer, Cham. https://doi.org/10.1007/978-3-319-41754-7_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41754-7_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41753-0

  • Online ISBN: 978-3-319-41754-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics