Language Resources and Evaluation

, Volume 47, Issue 4, pp 1163–1190

WHAD: Wikipedia historical attributes data

Historical structured data extraction and vandalism detection from the Wikipedia edit history


    • Google Research Zurich
  • Guillermo Garrido
    • NLP & IR Group, UNED
  • Jean-Yves Delort
    • Google Research Zurich
  • Anselmo Peñas
    • NLP & IR Group, UNED
Original Paper

DOI: 10.1007/s10579-013-9232-5

Cite this article as:
Alfonseca, E., Garrido, G., Delort, J. et al. Lang Resources & Evaluation (2013) 47: 1163. doi:10.1007/s10579-013-9232-5


This paper describes the generation of temporally anchored infobox attribute data from the Wikipedia history of revisions. By mining (attribute, value) pairs from the revision history of the English Wikipedia we are able to collect a comprehensive knowledge base that contains data on how attributes change over time. When dealing with the Wikipedia edit history, vandalic and erroneous edits are a concern for data quality. We present a study of vandalism identification in Wikipedia edits that uses only features from the infoboxes, and show that we can obtain, on this dataset, an accuracy comparable to a state-of-the-art vandalism identification method that is based on the whole article. Finally, we discuss different characteristics of the extracted dataset, which we make available for further study.


WikipediaInfoboxAttributesTemporal data

Copyright information

© Springer Science+Business Media Dordrecht 2013