Advertisement

Wikipedia Text Reuse: Within and Without

  • Milad AlshomaryEmail author
  • Michael VölskeEmail author
  • Tristan LichtEmail author
  • Henning WachsmuthEmail author
  • Benno SteinEmail author
  • Matthias HagenEmail author
  • Martin PotthastEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11437)

Abstract

We study text reuse related to Wikipedia at scale by compiling the first corpus of text reuse cases within Wikipedia as well as without (i.e., reuse of Wikipedia text in a sample of the Common Crawl). To discover reuse beyond verbatim copy and paste, we employ state-of-the-art text reuse detection technology, scaling it for the first time to process the entire Wikipedia as part of a distributed retrieval pipeline. We further report on a pilot analysis of the 100 million reuse cases inside, and the 1.6 million reuse cases outside Wikipedia that we discovered. Text reuse inside Wikipedia gives rise to new tasks such as article template induction, fixing quality flaws, or complementing Wikipedia’s ontology. Text reuse outside Wikipedia yields a tangible metric for the emerging field of quantifying Wikipedia’s influence on the web. To foster future research into these tasks, and for reproducibility’s sake, the Wikipedia text reuse corpus and the retrieval pipeline are made freely available.

References

  1. 1.
    Ardi, C., Heidemann, J.: Web-scale content reuse detection (extended). USC/Information Sciences Institute, Tech. Rep. ISI-TR-692 (2014)Google Scholar
  2. 2.
    Bendersky, M., Croft, W.: Finding text reuse on the web. In: Proceedings of WSDM 2009, pp. 262–271 (2009)Google Scholar
  3. 3.
    Chaidaroon, S., Fang, Y.: Variational deep semantic hashing for text documents. arXiv preprint arXiv:1708.03436 (2017)
  4. 4.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC 2002, pp. 380–388 (2002)Google Scholar
  5. 5.
    Citron, D.T., Ginsparg, P.: Patterns of text reuse in a scientific corpus. PNAS 112(1), 25–30 (2015)CrossRefGoogle Scholar
  6. 6.
    Clough, P.D., Wilks, Y.: Measuring text reuse in a journalistic domain. In: Proceedings of the CLUK Colloquium (2001)Google Scholar
  7. 7.
    Coffee, N., Koenig, J.P., Poornima, S., Forstall, C.W., Ossewaarde, R., Jacobson, S.L.: The Tesserae project: intertextual analysis of Latin poetry. Literary Linguist. Comput. 28(2), 221–228 (2012)CrossRefGoogle Scholar
  8. 8.
    Generous, N., Fairchild, G., Deshpande, A., Del Valle, S., Priedhorsky, R.: Global disease monitoring and forecasting with Wikipedia. PLoS Comput. Biol. 10(11), e1003892 (2014)CrossRefGoogle Scholar
  9. 9.
    Hagen, M., Potthast, M., Adineh, P., Fatehifar, E., Stein, B.: Source retrieval for web-scale text reuse detection. In: Proceedings of CIKM 2017, pp. 2091–2094 (2017)Google Scholar
  10. 10.
    Lin, Y., Yu, B., Hall, A., Hecht, B.: Problematizing and addressing the article-as-concept assumption in Wikipedia. In: Proceedings of CSCW 2017, pp. 2052–2067 (2017)Google Scholar
  11. 11.
    McMahon, C., Johnson, I.L., Hecht, B.J.: The substantial interdependence of Wikipedia and Google: a case study on the relationship between peer production communities and information technologies. In: Proceedings of ICWSM 2017, pp. 142–151 (2017)Google Scholar
  12. 12.
    Mestyán, M., Yasseri, T., Kertész, J.: Early prediction of movie box office success based on Wikipedia activity big data. PLoS One 8(8), e71226 (2013)CrossRefGoogle Scholar
  13. 13.
    Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of ACL 2008, pp. 236–244 (2008)Google Scholar
  14. 14.
    Potthast, M., et al.: Overview of the 5th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2013 Evaluation LabsGoogle Scholar
  15. 15.
    Potthast, M., et al.: Overview of the 6th international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2014 Evaluation LabsGoogle Scholar
  16. 16.
    Stamatatos, E.: Plagiarism detection using stopword \(n\)-grams. JASIST 62(12), 2512–2527 (2011)CrossRefGoogle Scholar
  17. 17.
    Stein, B., Meyer zu Eißen, S., Potthast, M.: Strategies for retrieving plagiarized documents. In: Proceedings of SIGIR 2007, pp. 825–826 (2007)Google Scholar
  18. 18.
    Taraborelli, D.: The sum of all human knowledge in the age of machines: a new research agenda for Wikimedia. In: Proceedings of the ICWSM 2015 Workshop Wikipedia, a Social Pedia: Research Challenges and OpportunitiesGoogle Scholar
  19. 19.
    Thompson, N., Hanley, D.: Science is shaped by Wikipedia: Evidence from a randomized control trial. MIT Sloan Research Paper No. 5238-17 (2018)Google Scholar
  20. 20.
    Vincent, N., Johnson, I., Hecht, B.: Examining Wikipedia with a broader lens: quantifying the value of Wikipedia’s relationships with other large-scale online communities. In: Proceedings of CHI 2018, pp. 566:1–566:13 (2018)Google Scholar
  21. 21.
    Weissman, S., Ayhan, S., Bradley, J., Lin, J.: Identifying duplicate and contradictory information in Wikipedia. In: Proceedings of JCDL 2015, pp. 57–60 (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Paderborn UniversityPaderbornGermany
  2. 2.Bauhaus-Universität WeimarWeimarGermany
  3. 3.Martin-Luther-Universität Halle-WittenbergHalleGermany
  4. 4.Leipzig UniversityLeipzigGermany

Personalised recommendations