Computer-Generated Text Detection Using Machine Learning: A Systematic Review

Beresneva, Daria

doi:10.1007/978-3-319-41754-7_43

Computer-Generated Text Detection Using Machine Learning: A Systematic Review

Daria Beresneva¹⁸

Conference paper
First Online: 17 June 2016

2518 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9612))

Abstract

Computer-generated text or artificial text nowadays is in abundance on the web, ranging from basic random word salads to web scraping. In this paper, we present a short version of systematic review of some existing automated methods aimed at distinguishing natural texts from artificially generated ones. The methods were chosen by certain criteria. We further provide a summary of the methods considered. Comparisons, whenever possible, use common evaluation measures, and control for differences in experimental set-up.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Grechnikov, E.A., Gusev, G.G., Kustarev, A.A., Raigorodsky, A.M.: Detection of artificial texts, digital libraries: advanced methods and technologies, digital collections. In: Proceedings of XI All-Russian Research Conference RCDL 2009, KRC RAS, Petrozavodsk, pp. 306–308 (2009)
Google Scholar
Corston-Oliver, S., Gamon, M., Brockett, C.: A machine learning approach to the automatic evaluation of machine translation. In: Proceeding of 39th Annual Meeting on Association for Computational Linguistics, ACL 2001, pp. 148–155 (2001)
Google Scholar
Urvoy, T., Lavergne, T., Filoche, P.: Tracking web spam with hidden style similarity. In: AIRWEB 2006, Seattle, Washington, USA, 10 August 2006
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, Burlington (2011)
Google Scholar
Arase, Y., Zhou, M.: Machine translation detection from monolingual web-text. In: Proceedings of 51st Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria, pp. 1597–1607, 4–9 August 2013
Google Scholar
Baayen, R.H.: Word Frequency Distributions. Kluwer Academic Publishers, Amsterdam (2001)
Book MATH Google Scholar
Clarkson, P., Rosenfeld, R.: Statistical language modeling using the CMU-Cambridge toolkit. In: Proceedings of Eurospeech 1997, pp. 2707–2710 (1997)
Google Scholar
Chickering, D.M., Heckerman, D., Meek, C.: A Bayesian approach to learning Bayesian networks with local structure. In: Geiger, D., Shenoy, P.P. (eds.) Proceedings of 13th Conference on Uncertainty in Artificial Intelligence, pp. 80–89 (1997)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, New York (1995)
Book MATH Google Scholar
Chen, S.F., Goodman, J.T.: An empirical study of smoothing techniques for language modeling. In: Proceedings of 34th Annual Meeting of the Association for Computational Linguistics (ACL), Santa Cruz, pp. 310–318 (1996)
Google Scholar
Honore, A.: Some simple measures of richness of vocabulary. Assoc. Lit. Linguist. Comput. Bull. 7(2), 172–177 (1979)
Google Scholar
Sichel, H.: On a distribution law for word frequencies. J. Am. Stat. Assoc. 70, 542–547 (1975)
Google Scholar
Lavergne, T., Urvoy, T., Yvon, F.: Detecting fake content with relative entropy scoring. In: PAN 2008 (2008)
Google Scholar
Seymore, K., Rosenfeld, R.: Scalable backoff language models. In: ICSLP 1996, Philadelphia, PA, vol. 1, pp. 232–235 (1996)
Google Scholar
Stolcke, A.: Entropy-based pruning of backoff language models (1998)
Google Scholar
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge (1999)
MATH Google Scholar
Gyongyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 2005) (2005)
Google Scholar
Heymann, P., Koutrika, G., Garcia-Molina, H.: Fighting spam on social web sites: a survey of approaches and future challenges. IEEE Mag. Internet Comput. 11(6), 36–45 (2007)
Article Google Scholar
Labbé, C., Labbé, D.: Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics, Akadémiai Kiadó, p. 10 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Moscow Institute of Physics and Technology, Russian Academy of National Economy and Public Administration, Anti-Plagiat Research, Moscow, Russia
Daria Beresneva

Authors

Daria Beresneva
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daria Beresneva .

Editor information

Editors and Affiliations

ConservatoireNational desArts et Métiers, Paris, France
Elisabeth Métais
University of Salford, Salford, United Kingdom
Farid Meziane
University of Salford, Salford, United Kingdom
Mohamad Saraee
Oakland University, Rochester, Michigan, USA
Vijayan Sugumaran
University of Salford, Salford, United Kingdom
Sunil Vadera

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Beresneva, D. (2016). Computer-Generated Text Detection Using Machine Learning: A Systematic Review. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds) Natural Language Processing and Information Systems. NLDB 2016. Lecture Notes in Computer Science(), vol 9612. Springer, Cham. https://doi.org/10.1007/978-3-319-41754-7_43

Download citation

DOI: https://doi.org/10.1007/978-3-319-41754-7_43
Published: 17 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41753-0
Online ISBN: 978-3-319-41754-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics