Abstract
Fake scientific papers have recently become of interest within the academic community as a result of the identification of fake papers in the digital libraries of major academic publishers [8]. Detecting and removing these papers is important for many reasons. We describe an investigation into the use of similarity search for detecting fake scientific papers by comparing several methods for signature construction and similarity scoring and describe a pseudo-relevance feedback technique that can be used to improve the effectiveness of these methods. Experiments on a dataset of 40,000 computer science papers show that precision, recall and MAP scores of 0.96, 0.99 and 0.99, respectively, can be achieved, thereby demonstrating the usefulness of similarity search in detecting fake scientific papers and ranking them highly.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the Web. Computer Networks and ISDN Systems 29(8–13), 1157–1166 (1997)
Butler, D.: Investigating journals: The dark side of publishing. Nature 495(7442), 433–435 (2013)
Gad-el Hak, M.: Publish or perish - an ailing enterprise? Physics Today 57(3), 61–62 (2004)
Labbé, C., Labbé, D.: Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science? Scientometrics 94(1), 379–396 (2012)
Manku, G., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: WWW, pp. 141–149 (2007)
Medelyan, O., Frank, E., Witten, I.H.: Human-competitive tagging using automatic keyphrase extraction. In: EMNLP, vol. 3, pp. 1318–1327 (2009)
Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: CLEF (2014)
Van Noorden, R.: Publishers withdraw more than 120 gibberish papers. Nature, February 2014
Williams, K., Giles, C.L.: Near duplicate detection in an academic digital library. In: DocEng, pp. 91–94 (2013)
Xiong, J., Huang, T.: An effective method to identify machine automatically generated paper. In: KESE, pp. 101–102. IEEE (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Williams, K., Giles, C.L. (2015). On the Use of Similarity Search to Detect Fake Scientific Papers. In: Amato, G., Connor, R., Falchi, F., Gennaro, C. (eds) Similarity Search and Applications. SISAP 2015. Lecture Notes in Computer Science(), vol 9371. Springer, Cham. https://doi.org/10.1007/978-3-319-25087-8_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-25087-8_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25086-1
Online ISBN: 978-3-319-25087-8
eBook Packages: Computer ScienceComputer Science (R0)