Skip to main content

WHAD: Wikipedia historical attributes data

Historical structured data extraction and vandalism detection from the Wikipedia edit history

Abstract

This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Notes

  1. 1.

    As of March 2012, there were more than 85,000 active contributors working on more than 21,000,000 articles in more than 280 languages. The English Wikipedia contained more than 3.9 million articles. Ref: http://en.wikipedia.org/wiki/Wikipedia:About.

  2. 2.

    Wikimedia Deutschland—Gesellschaft zur Förderung Freien Wissens e.V.

  3. 3.

    Wikimedia Toolserver, http://toolserver.org. The dataset is available for download at http://toolserver.org/~RENDER/toolkit/downloads/. Additional information can be obtained at http://alfonseca.org/eng/research/whad.html.

  4. 4.

    http://nlp.cs.qc.cuny.edu/kbp/2011/KBP2011_TaskDefinition.pdf.

  5. 5.

    http://www.nist.gov/tac/2011/.

  6. 6.

    The corpus is freely available at http://www.uni-weimar.de/cms/medien/webis/research/corpora/pan-wvc-11.html.

  7. 7.

    Notice that the issue of detecting other kinds of incorrect data is out of the scope of this work; some lines of development that we are investigating to address this open research question are discussed in Sect. 5.

  8. 8.

    Wikipedia makes database downloads available, including those of the full edit history of every article. All text content is released under a double license: the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). For details on the different download options, see: http://en.wikipedia.org/wiki/Wikipedia:Database_download.

  9. 9.

    Specifically, the download of the English Wikipedia with its full edit history that we have used for this research, and newer versions available later, is distributed at http://dumps.wikimedia.org/enwiki.

  10. 10.

    See http://flex.sourceforge.net/.

  11. 11.

    MediaWiki, Markup spec http://www.mediawiki.org/wiki/Markup_spec, retrieved February 1, 2012.

  12. 12.

    The number of edits skipped because of parse failures is negligible: 119.

  13. 13.

    Some other metadata is kept, see Sect. 4.1 for more details.

  14. 14.

    Removing, for instance, edits which textual content is too long or too short, or edits that were rapidly reverted.

  15. 15.

    http://crowdflower.com/.

  16. 16.

    http://aws.amazon.com/mturk/.

  17. 17.

    A Wikipedia diff page shows the difference between two versions of a page.

  18. 18.

    There exists a file history for image files, but it is not immediately available from the diff page.

  19. 19.

    As of March, 2011, the total number of Wikipedia pages is over 3.9 million articles. Source: http://en.wikipedia.org/wiki/Special:Statistics.

  20. 20.

    Observe that not all of them are valid infobox names, as many are in fact editors errors, or vandalism.

  21. 21.

    The high frequency of the “french commune” infobox might be surprising, but has a simple explanation. The commune is the lowest level of administrative division in France, and can range from a large city to a small village. As of January 9, 2008, there were 36,781 communes in France, and through the collaborative effort of a group of editors, most of them have an article, following a common template that defines the specific “french commune” infobox. See http://en.wikipedia.org/wiki/Communes_of_France and http://en.wikipedia.org/wiki/Wikipedia:WikiProject_French_communes.Similar reasons make “settlement" the top frequency infobox.

  22. 22.

    A notable exception is volatility, which is defined in Stvilia et al. (2005) as the median revert time.

References

  1. Adler, B. T., De Alfaro, L., & Pye, I. (2010). Detecting Wikipedia vandalism using WikiTrust—Lab report for PAN at CLEF 2010. In Notebook Papers of CLEF 2010 Labs and Workshops.

  2. Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, P., & West, A. G. (2011). Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing, Lecture Notes in Computer Science, Vol. 6609, Berlin: Springer, pp. 277–288.

  3. Ahn, D., Jijkoun, V., Mishne, G., Müller, K., de Rijke, M., & Schlobach, S. (2004). Using Wikipedia at the TREC QA track. In Proceedings of TREC 2004.

  4. Anderka, M., & Stein, B. (2012). Overview of the 1st international competition on quality flaw prediction in Wikipedia. In P. Forner, J. Karlgren, & C. Womser-Hacker (Eds.), CLEF 2012 Evaluation Labs and Workshop—Working Notes Papers.

  5. Arazy, O., & Nov, O. (2010). Determinants of Wikipedia quality: The roles of global and local contribution inequality. In Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW ’10, ACM, New York, NY, USA, pp. 233–236.

  6. Auer, S., & Lehmann, J. (2007). What have Innsbruck and Leipzig in common? Extracting semantics from Wiki content. In Proceedings of the 4th European conference on the semantic web: Research and applications, ESWC ’07, pp. 503–517.

  7. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., & Ives, Z. (2007). DBpedia: A nucleus for a web of open data. In The semantic web, 6th international semantic web conference, ISWC ’07, Springer, pp. 722–735.

  8. Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. In Proceedings of the international joint conference on artificial intelligence, IJCAI ’07.

  9. Boguraev, B., Pustejovsky, J., Ando, R., Verhagen, M. (2007). TimeBank evolution as a community resource for TimeML parsing. Language Resources and Evaluation 41, 91–115.

    Article  Google Scholar 

  10. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on management of data, New York, NY, USA, pp. 1247–1250.

  11. Chin, S. C., Street, W. N., Srinivasan, P., & Eichmann, D. (2010). Detecting Wikipedia vandalism with active learning and statistical language models. In Proceedings of the 4th workshop on information credibility, WICOW ’10, ACM, New York, NY, USA, pp. 3–10.

  12. Dean, J., Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 107–113.

    Article  Google Scholar 

  13. Ferschke, O., Zesch, T., & Gurevych, I. (2011). Wikipedia revision toolkit: Efficiently accessing wikipedia’s edit history. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human language technologies. System demonstrations, Portland, OR, USA, pp. 97–102.

  14. Fleiss, J. L., Levin, B., & Paik, M. C. (2004). The measurement of interrater agreement (pp. 598–626). New York: Wiley.

  15. Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on artificial intelligence, IJCAI ’07, pp. 1606–1611.

  16. Geiger, R. S., & Ribes, D. (2010). The work of sustaining order in Wikipedia: The banning of a vandal. In Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW ’10, ACM, New York, NY, USA, pp. 117–126.

  17. Hoffmann, R., Zhang, C., & Weld, D. S. (2010). Learning 5,000 relational extractors. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, ACL ’10, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 286–295.

  18. Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09, ACM, New York, NY, USA, pp. 389–396.

  19. Itakura, K. Y., & Clarke, C. L. A. (2009). Using dynamic markov compression to detect vandalism in the Wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09, ACM, New York, NY, USA, pp. 822–823.

  20. Lange, D., Böhm, C., & Naumann, F. (2010). Extracting structured information from Wikipedia articles to populate infoboxes. In Proceedings of the 19th ACM international conference on information and knowledge management, CIKM ’10, pp. 1661–1664.

  21. Milne, D., & Witten, I. H. (2008). Learning to link with Wikipedia. In Proceedings of the 17th ACM conference on information and knowledge management, CIKM ’08, ACM, New York, NY, USA, pp. 509–518.

  22. Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 2—Volume 2, Association for Computational Linguistics, ACL ’09, Stroudsburg, PA, USA, pp. 1003–1011.

  23. Mola-Velasco, S. (2010). Wikipedia vandalism detection through machine learning: Feature review and new proposals. Notebook papers of CLEF 2010 labs and workshops .

  24. Nguyen, D. P. T., Matsuo, Y., & Ishizuka, M. (2007). Exploiting syntactic and semantic information for relation extraction from Wikipedia. In IJCAI workshop on Text-Mining & Link-Analysis, TextLink ’07.

  25. Nguyen, T., Moreira, V., Nguyen, H., Nguyen, H., Freire, J. (2011). Multilingual schema matching for wikipedia infoboxes. Proceedings of the VLDB Endowment 5(2), 133–144.

    Google Scholar 

  26. Ponzetto, S. P., & Strube, M. (2007). Deriving a large scale taxonomy from Wikipedia. In Proceedings of the 22nd national conference on artificial intelligence (Vol. 2), AAAI Press, pp. 1440–1445.

  27. Potthast, M. (2010). Crowdsourcing a Wikipedia vandalism corpus. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’10, ACM, New York, NY, USA, pp. 789–790.

  28. Potthast, M., & Holfeld, T. (2011). Overview of the 2nd international competition on Wikipedia vandalism detection. In V. Petras, P. Forner & P. Clough (Eds.), Notebook papers of CLEF 11 labs and workshops.

  29. Potthast, M., Stein, B., & Gerling, R. (2008). Automatic vandalism detection in Wikipedia. In Proceedings of the IR research, 30th European conference on advances in information retrieval, ECIR’08, Springer, Berlin, pp. 663–668.

  30. Potthast, M., Stein, B., & Holfeld, T. (2010). Overview of the 1st international competition on Wikipedia vandalism detection. In Notebook papers of CLEF 2010 labs and workshops.

  31. Smets, K., Goethals, B., & Verdonk, B. (2008). Automatic vandalism detection in Wikipedia: Towards a machine learning approach. In WikiAI’08: Proceedings of the workshop on Wikipedia and Artificial Intelligence: An evolving synergy.

  32. Stvilia, B., Twidale, M. B., Smith, L. C., & Gasser, L. (2005). Assessing information quality of a community-based encyclopedia. In Proceedings of the international conference on information quality, ICIQ 2005, pp. 442–454.

  33. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). YAGO: A core of semantic knowledge. In Proceedings of the 16th international conference on world wide web, WWW ’07, ACM, New York, NY, USA, pp. 697–706.

  34. Verhagen, M., Gaizauskas, R., Schilder, F., Hepple, M., Moszkowicz, J., & Pustejovsky, J. (2009). The TempEval challenge: Identifying temporal relations in text. Language Resources and Evaluation 43, 161–179.

    Article  Google Scholar 

  35. Völkel, M., Krötzsch, M., Vrandecic, D., Haller, H., & Studer, R. (2006). Semantic Wikipedia. In Proceedings of the 15th international conference on world wide web, WWW ’06, ACM, New York, NY, USA, pp. 585–594.

  36. Voss, J. (2005). Measuring Wikipedia. In Proceedings of the international conference of the international society for scientometrics and informetrics (ISSI), Vol. 10, Stockholm.

  37. Wang, Y., Zhu, M., Qu, L., Spaniol, M., & Weikum, G. (2010). Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. In Proceedings of the 13th international conference on extending database technology, EDBT ’10, ACM, New York, NY, USA, pp. 697–700.

  38. West, A. G., & Lee, I. (2011). Multilingual vandalism detection using language-independent and ex post facto evidence—Notebook for pan at clef 2011. In CLEF (Notebook papers/labs/workshop).

  39. West, A. G., Kannan, S., & Lee, I. (2010). Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata? Tech. rep., University of Pennsylvania, New York, NY, USA.

  40. Wilkinson, D. M., & Huberman, B. A. (2007). Cooperation and quality in Wikipedia. In Proceedings of the 2007 international symposium on Wikis, WikiSym ’07, ACM, New York, NY, USA, pp. 157–164.

  41. Wu, F., Weld, D.S. (2007). Autonomously semantifying Wikipedia. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, CIKM ’07, ACM, New York, NY, USA, pp. 41–50.

  42. Wu, F., & Weld, D. S. (2010). Open information extraction using Wikipedia. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, ACL ’10, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 118–127.

  43. Wu, Q., Irani, D., Pu, C., & Ramaswamy, L. (2010). Elusive vandalism detection in Wikipedia: a text stability-based approach. In Proceedings of the 19th ACM international conference on information and knowledge management, ACM, pp. 1797–1800.

  44. Xu, S., Yang, S., & Lau, F. C. M. (2010). Keyword extraction and headline generation using novel word features. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence, AAAI 2010, AAAI Press.

  45. Yamangil, E., & Nelken, R. (2008). Mining Wikipedia revision histories for improving sentence compression. In ACL 2008, Proceedings of the 46th annual meeting of the Association for Computational Linguistics, June 15–20, 2008, Columbus, Ohio, USA, Short Papers, pp. 137–140.

  46. Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., & Lee, L. (2010). For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. In Proceedings of the conference of the north American chapter of the Association for Computational Linguistics, NAACL, pp. 365–368.

  47. Ye, S., Chua, T. S., & Lu, J. (2009). Summarizing definition from Wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1—Volume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’09, pp. 199–207.

  48. Zanzotto, F. M., & Pennacchiotti, M. (2010). Expanding textual entailment corpora from Wikipedia using co-training. In Proceedings of the COLING-Workshop on the peoples web meets NLP: collaboratively constructed semantic resources.

  49. Zeng, H., Alhossaini, M. A., Ding, L., Fikes, R., & McGuinness, D. L. (2006). Computing trust from revision history. In Proceedings of the 2006 international conference on privacy, security and trust: Bridge the gap between PST technologies and business services, PST ’06, ACM, New York, NY, USA.

  50. Zhang, Q., Suchanek, F. M., Yue, L., & Weikum, G. (2008). TOB: Timely ontologies for business relations. In 11th international workshop on the web and databases, WebDB.

Download references

Acknowledgments

The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement number 257790; the Spanish Ministry of Science and Innovation project Holopedia (TIN2010-21128-C02); and the Regional Government of Madrid MA2VICMR (S2009/TIC1542)

Author information

Affiliations

Authors

Corresponding author

Correspondence to Enrique Alfonseca.

Additional information

This work was partially done while the second author was visiting Google Switzerland GmbH.

Appendix: Manual rating instructions

Appendix: Manual rating instructions

Instructions

Wikipedia is an on-line encyclopedia to which many users contribute editing the entries. Wikipedia entries sometimes contain one or several small boxes with structured data called Infoboxes. For example, the Wikipedia entry for United States has a small box at the right hand side containing the name of the country, its flag and seal, motto, anthem, capital, and other facts about the country. We’ll call each of these lines in the infobox attributes.

If you want to read more about Wikipedia Infoboxes, you can see this page.

Wikipedia keeps logs of all the edits done by each contributor during the past many years. This allows us to explore the past changes for each entry. For example, this page shows a particular edit that was done to the entry “Articles of Confederation". In this example, the contributor modified the value of the attribute “writer". This attribute is the one that is used in the infobox line specifying who the authors were. This particular contributor edited the value of the writer from just “Continental Congress" to a new value of an insulting nature. This is a clear case of vandalism. For the purposes of this evaluation, we consider that a contribution is vandalic if either:

  • It is adding insulting or obscene content.

  • It is plainly false.

If a page contained a correct value and a user replaces it with an incorrect value, we assume that the edit is vandalism. For example, look at this page. The value of the origin (birth place) of Lil Jon was changed from Montreal to Atlanta. The correct value for this attribute is Atlanta. You can click on the “Previous edit" link to see that Montreal was added in replacement of the correct value Atlanta. For these reasons, we’ll say that the page was initially correct, Montreal was added in a vandal edit, and the change in the shown page is fixing the vandalism by reverting the value to the previous correct value Atlanta.

You will be shown below the name of an entry, the time when it was changed, name of the attribute in the infobox, the old value of the attribute, and the new value of the attribute. The task is to reply to the questions below to identify possible cases of incorrect values or vandalic actions.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Alfonseca, E., Garrido, G., Delort, JY. et al. WHAD: Wikipedia historical attributes data. Lang Resources & Evaluation 47, 1163–1190 (2013). https://doi.org/10.1007/s10579-013-9232-5

Download citation

Keywords

  • Wikipedia
  • Infobox
  • Attributes
  • Temporal data