Abstract
This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.
This is a preview of subscription content, access via your institution.




Notes
As of March 2012, there were more than 85,000 active contributors working on more than 21,000,000 articles in more than 280 languages. The English Wikipedia contained more than 3.9 million articles. Ref: http://en.wikipedia.org/wiki/Wikipedia:About.
Wikimedia DeutschlandāGesellschaft zur Fƶrderung Freien Wissens e.V.
Wikimedia Toolserver, http://toolserver.org. The dataset is available for download at http://toolserver.org/~RENDER/toolkit/downloads/. Additional information can be obtained at http://alfonseca.org/eng/research/whad.html.
The corpus is freely available at http://www.uni-weimar.de/cms/medien/webis/research/corpora/pan-wvc-11.html.
Notice that the issue of detecting other kinds of incorrect data is out of the scope of this work; some lines of development that we are investigating to address this open research question are discussed in Sect.Ā 5.
Wikipedia makes database downloads available, including those of the full edit history of every article. All text content is released under a double license: the Creative Commons Attribution-ShareAlike 3.0 License (CC-BY-SA) and the GNU Free Documentation License (GFDL). For details on the different download options, see: http://en.wikipedia.org/wiki/Wikipedia:Database_download.
Specifically, the download of the English Wikipedia with its full edit history that we have used for this research, and newer versions available later, is distributed at http://dumps.wikimedia.org/enwiki.
MediaWiki, Markup spec http://www.mediawiki.org/wiki/Markup_spec, retrieved February 1, 2012.
The number of edits skipped because of parse failures is negligible: 119.
Some other metadata is kept, see Sect. 4.1 for more details.
Removing, for instance, edits which textual content is too long or too short, or edits that were rapidly reverted.
A Wikipedia diff page shows the difference between two versions of a page.
There exists a file history for image files, but it is not immediately available from the diff page.
As of March, 2011, the total number of Wikipedia pages is over 3.9 million articles. Source: http://en.wikipedia.org/wiki/Special:Statistics.
Observe that not all of them are valid infobox names, as many are in fact editors errors, or vandalism.
The high frequency of the āfrench communeā infobox might be surprising, but has a simple explanation. The commune is the lowest level of administrative division in France, and can range from a large city to a small village. As of January 9, 2008, there were 36,781 communes in France, and through the collaborative effort of a group of editors, most of them have an article, following a common template that defines the specific āfrench communeā infobox. See http://en.wikipedia.org/wiki/Communes_of_France and http://en.wikipedia.org/wiki/Wikipedia:WikiProject_French_communes.Similar reasons make āsettlement" the top frequency infobox.
A notable exception is volatility, which is defined in Stvilia etĀ al. (2005) as the median revert time.
References
Adler, B. T., De Alfaro, L., & Pye, I. (2010). Detecting Wikipedia vandalism using WikiTrustāLab report for PAN at CLEF 2010. In Notebook Papers of CLEF 2010 Labs and Workshops.
Adler, B. T., De Alfaro, L., Mola-Velasco, S. M., Rosso, P., & West, A. G. (2011). Wikipedia vandalism detection: Combining natural language, metadata, and reputation features. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing, Lecture Notes in Computer Science, Vol. 6609, Berlin: Springer, pp. 277ā288.
Ahn, D., Jijkoun, V., Mishne, G., Müller, K., de Rijke, M., & Schlobach, S. (2004). Using Wikipedia at the TREC QA track. In Proceedings of TREC 2004.
Anderka, M., & Stein, B. (2012). Overview of the 1st international competition on quality flaw prediction in Wikipedia. In P. Forner, J. Karlgren, & C. Womser-Hacker (Eds.), CLEF 2012 Evaluation Labs and WorkshopāWorking Notes Papers.
Arazy, O., & Nov, O. (2010). Determinants of Wikipedia quality: The roles of global and local contribution inequality. In Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW ā10, ACM, New York, NY, USA, pp. 233ā236.
Auer, S., & Lehmann, J. (2007). What have Innsbruck and Leipzig in common? Extracting semantics from Wiki content. In Proceedings of the 4th European conference on the semantic web: Research and applications, ESWC ā07, pp. 503ā517.
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., & Ives, Z. (2007). DBpedia: A nucleus for a web of open data. In The semantic web, 6th international semantic web conference, ISWC ā07, Springer, pp. 722ā735.
Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., & Etzioni, O. (2007). Open information extraction from the web. In Proceedings of the international joint conference on artificial intelligence, IJCAI ā07.
Boguraev, B., Pustejovsky, J., Ando, R., Verhagen, M. (2007). TimeBank evolution as a community resource for TimeML parsing. Language Resources and Evaluation 41, 91ā115.
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on management of data, New York, NY, USA, pp. 1247ā1250.
Chin, S. C., Street, W. N., Srinivasan, P., & Eichmann, D. (2010). Detecting Wikipedia vandalism with active learning and statistical language models. In Proceedings of the 4th workshop on information credibility, WICOW ā10, ACM, New York, NY, USA, pp. 3ā10.
Dean, J., Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 107ā113.
Ferschke, O., Zesch, T., & Gurevych, I. (2011). Wikipedia revision toolkit: Efficiently accessing wikipediaās edit history. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human language technologies. System demonstrations, Portland, OR, USA, pp. 97ā102.
Fleiss, J. L., Levin, B., & Paik, M. C. (2004). The measurement of interrater agreement (pp. 598ā626). New York: Wiley.
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the 20th international joint conference on artificial intelligence, IJCAI ā07, pp. 1606ā1611.
Geiger, R. S., & Ribes, D. (2010). The work of sustaining order in Wikipedia: The banning of a vandal. In Proceedings of the 2010 ACM conference on computer supported cooperative work, CSCW ā10, ACM, New York, NY, USA, pp. 117ā126.
Hoffmann, R., Zhang, C., & Weld, D. S. (2010). Learning 5,000 relational extractors. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, ACL ā10, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 286ā295.
Hu, X., Zhang, X., Lu, C., Park, E. K., & Zhou, X. (2009). Exploiting Wikipedia as external knowledge for document clustering. In Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ā09, ACM, New York, NY, USA, pp. 389ā396.
Itakura, K. Y., & Clarke, C. L. A. (2009). Using dynamic markov compression to detect vandalism in the Wikipedia. In Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ā09, ACM, New York, NY, USA, pp. 822ā823.
Lange, D., Bƶhm, C., & Naumann, F. (2010). Extracting structured information from Wikipedia articles to populate infoboxes. In Proceedings of the 19th ACM international conference on information and knowledge management, CIKM ā10, pp. 1661ā1664.
Milne, D., & Witten, I. H. (2008). Learning to link with Wikipedia. In Proceedings of the 17th ACM conference on information and knowledge management, CIKM ā08, ACM, New York, NY, USA, pp. 509ā518.
Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 2āVolume 2, Association for Computational Linguistics, ACL ā09, Stroudsburg, PA, USA, pp. 1003ā1011.
Mola-Velasco, S. (2010). Wikipedia vandalism detection through machine learning: Feature review and new proposals. Notebook papers of CLEF 2010 labs and workshops .
Nguyen, D. P. T., Matsuo, Y., & Ishizuka, M. (2007). Exploiting syntactic and semantic information for relation extraction from Wikipedia. In IJCAI workshop on Text-Mining & Link-Analysis, TextLink ā07.
Nguyen, T., Moreira, V., Nguyen, H., Nguyen, H., Freire, J. (2011). Multilingual schema matching for wikipedia infoboxes. Proceedings of the VLDB Endowment 5(2), 133ā144.
Ponzetto, S. P., & Strube, M. (2007). Deriving a large scale taxonomy from Wikipedia. In Proceedings of the 22nd national conference on artificial intelligence (Vol. 2), AAAI Press, pp. 1440ā1445.
Potthast, M. (2010). Crowdsourcing a Wikipedia vandalism corpus. In Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, SIGIR ā10, ACM, New York, NY, USA, pp. 789ā790.
Potthast, M., & Holfeld, T. (2011). Overview of the 2nd international competition on Wikipedia vandalism detection. In V. Petras, P. Forner & P. Clough (Eds.), Notebook papers of CLEF 11 labs and workshops.
Potthast, M., Stein, B., & Gerling, R. (2008). Automatic vandalism detection in Wikipedia. In Proceedings of the IR research, 30th European conference on advances in information retrieval, ECIRā08, Springer, Berlin, pp. 663ā668.
Potthast, M., Stein, B., & Holfeld, T. (2010). Overview of the 1st international competition on Wikipedia vandalism detection. In Notebook papers of CLEF 2010 labs and workshops.
Smets, K., Goethals, B., & Verdonk, B. (2008). Automatic vandalism detection in Wikipedia: Towards a machine learning approach. In WikiAIā08: Proceedings of the workshop on Wikipedia and Artificial Intelligence: An evolving synergy.
Stvilia, B., Twidale, M. B., Smith, L. C., & Gasser, L. (2005). Assessing information quality of a community-based encyclopedia. In Proceedings of the international conference on information quality, ICIQ 2005, pp. 442ā454.
Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). YAGO: A core of semantic knowledge. In Proceedings of the 16th international conference on world wide web, WWW ā07, ACM, New York, NY, USA, pp. 697ā706.
Verhagen, M., Gaizauskas, R., Schilder, F., Hepple, M., Moszkowicz, J., & Pustejovsky, J. (2009). The TempEval challenge: Identifying temporal relations in text. Language Resources and Evaluation 43, 161ā179.
Vƶlkel, M., Krƶtzsch, M., Vrandecic, D., Haller, H., & Studer, R. (2006). Semantic Wikipedia. In Proceedings of the 15th international conference on world wide web, WWW ā06, ACM, New York, NY, USA, pp. 585ā594.
Voss, J. (2005). Measuring Wikipedia. In Proceedings of the international conference of the international society for scientometrics and informetrics (ISSI), Vol. 10, Stockholm.
Wang, Y., Zhu, M., Qu, L., Spaniol, M., & Weikum, G. (2010). Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia. In Proceedings of the 13th international conference on extending database technology, EDBT ā10, ACM, New York, NY, USA, pp. 697ā700.
West, A. G., & Lee, I. (2011). Multilingual vandalism detection using language-independent and ex post facto evidenceāNotebook for pan at clef 2011. In CLEF (Notebook papers/labs/workshop).
West, A. G., Kannan, S., & Lee, I. (2010). Detecting Wikipedia vandalism via spatio-temporal analysis of revision metadata? Tech. rep., University of Pennsylvania, New York, NY, USA.
Wilkinson, D. M., & Huberman, B. A. (2007). Cooperation and quality in Wikipedia. In Proceedings of the 2007 international symposium on Wikis, WikiSym ā07, ACM, New York, NY, USA, pp. 157ā164.
Wu, F., Weld, D.S. (2007). Autonomously semantifying Wikipedia. In Proceedings of the sixteenth ACM conference on conference on information and knowledge management, CIKM ā07, ACM, New York, NY, USA, pp. 41ā50.
Wu, F., & Weld, D. S. (2010). Open information extraction using Wikipedia. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics, ACL ā10, Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 118ā127.
Wu, Q., Irani, D., Pu, C., & Ramaswamy, L. (2010). Elusive vandalism detection in Wikipedia: a text stability-based approach. In Proceedings of the 19th ACM international conference on information and knowledge management, ACM, pp. 1797ā1800.
Xu, S., Yang, S., & Lau, F. C. M. (2010). Keyword extraction and headline generation using novel word features. In Proceedings of the twenty-fourth AAAI conference on artificial intelligence, AAAI 2010, AAAI Press.
Yamangil, E., & Nelken, R. (2008). Mining Wikipedia revision histories for improving sentence compression. In ACL 2008, Proceedings of the 46th annual meeting of the Association for Computational Linguistics, June 15ā20, 2008, Columbus, Ohio, USA, Short Papers, pp. 137ā140.
Yatskar, M., Pang, B., Danescu-Niculescu-Mizil, C., & Lee, L. (2010). For the sake of simplicity: Unsupervised extraction of lexical simplifications from Wikipedia. In Proceedings of the conference of the north American chapter of the Association for Computational Linguistics, NAACL, pp. 365ā368.
Ye, S., Chua, T. S., & Lu, J. (2009). Summarizing definition from Wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1āVolume 1, Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ā09, pp. 199ā207.
Zanzotto, F. M., & Pennacchiotti, M. (2010). Expanding textual entailment corpora from Wikipedia using co-training. In Proceedings of the COLING-Workshop on the peoples web meets NLP: collaboratively constructed semantic resources.
Zeng, H., Alhossaini, M. A., Ding, L., Fikes, R., & McGuinness, D. L. (2006). Computing trust from revision history. In Proceedings of the 2006 international conference on privacy, security and trust: Bridge the gap between PST technologies and business services, PST ā06, ACM, New York, NY, USA.
Zhang, Q., Suchanek, F. M., Yue, L., & Weikum, G. (2008). TOB: Timely ontologies for business relations. In 11th international workshop on the web and databases, WebDB.
Acknowledgments
The research leading to these results has received funding from the European Unions Seventh Framework Programme (FP7/2007-2013) under grant agreement number 257790; the Spanish Ministry of Science and Innovation project Holopedia (TIN2010-21128-C02); and the Regional Government of Madrid MA2VICMR (S2009/TIC1542)
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was partially done while the second author was visiting Google Switzerland GmbH.
Appendix: Manual rating instructions
Appendix: Manual rating instructions
1.1 Instructions
Wikipedia is an on-line encyclopedia to which many users contribute editing the entries. Wikipedia entries sometimes contain one or several small boxes with structured data called Infoboxes. For example, the Wikipedia entry for United States has a small box at the right hand side containing the name of the country, its flag and seal, motto, anthem, capital, and other facts about the country. Weāll call each of these lines in the infobox attributes.
If you want to read more about Wikipedia Infoboxes, you can see this page.
Wikipedia keeps logs of all the edits done by each contributor during the past many years. This allows us to explore the past changes for each entry. For example, this page shows a particular edit that was done to the entry āArticles of Confederation". In this example, the contributor modified the value of the attribute āwriter". This attribute is the one that is used in the infobox line specifying who the authors were. This particular contributor edited the value of the writer from just āContinental Congress" to a new value of an insulting nature. This is a clear case of vandalism. For the purposes of this evaluation, we consider that a contribution is vandalic if either:
-
It is adding insulting or obscene content.
-
It is plainly false.
If a page contained a correct value and a user replaces it with an incorrect value, we assume that the edit is vandalism. For example, look at this page. The value of the origin (birth place) of Lil Jon was changed from Montreal to Atlanta. The correct value for this attribute is Atlanta. You can click on the āPrevious edit" link to see that Montreal was added in replacement of the correct value Atlanta. For these reasons, weāll say that the page was initially correct, Montreal was added in a vandal edit, and the change in the shown page is fixing the vandalism by reverting the value to the previous correct value Atlanta.
You will be shown below the name of an entry, the time when it was changed, name of the attribute in the infobox, the old value of the attribute, and the new value of the attribute. The task is to reply to the questions below to identify possible cases of incorrect values or vandalic actions.
Rights and permissions
About this article
Cite this article
Alfonseca, E., Garrido, G., Delort, JY. et al. WHAD: Wikipedia historical attributes data. Lang Resources & Evaluation 47, 1163ā1190 (2013). https://doi.org/10.1007/s10579-013-9232-5
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-013-9232-5