Capturing the Ineffable: Collecting, Analysing, and Automating Web Document Quality Assessments

Ceolin, Davide; Noordegraaf, Julia; Aroyo, Lora

doi:10.1007/978-3-319-49004-5_6

Davide Ceolin¹⁷,
Julia Noordegraaf¹⁸ &
Lora Aroyo¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10024))

Included in the following conference series:

European Knowledge Acquisition Workshop

2233 Accesses
3 Citations

Abstract

Automatic estimation of the quality of Web documents is a challenging task, especially because the definition of quality heavily depends on the individuals who define it, on the context where it applies, and on the nature of the tasks at hand. Our long-term goal is to allow automatic assessment of Web document quality tailored to specific user requirements and context. This process relies on the possibility to identify document characteristics that indicate their quality. In this paper, we investigate these characteristics as follows: (1) we define features of Web documents that may be indicators of quality; (2) we design a procedure for automatically extracting those features; (3) develop a Web application to present these results to niche users to check the relevance of these features as quality indicators and collect quality assessments; (4) we analyse user’s qualitative assessment of Web documents to refine our definition of the features that determine quality, and establish their relevant weight in the overall quality, i.e., in the summarizing score users attribute to a document, determining whether it meets their standards or not. Hence, our contribution is threefold: a Web application for nichesourcing quality assessments; a curated dataset of Web document assessments; and a thorough analysis of the quality assessments collected by means of two case studies involving experts (journalists and media scholars). The dataset obtained is limited in size but highly valuable because of the quality of the experts that provided it. Our analyses show that: (1) it is possible to automate the process of Web document quality estimation to a level of high accuracy; (2) document features shown in isolation are poorly informative to users; and (3) related to the tasks we propose (i.e., choosing Web documents to use as a source for writing an article on the vaccination debate), the most important quality dimensions are accuracy, trustworthiness, and precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The tool is running at http://webq3.herokuapp.com, the code is available at https://github.com/davideceolin/webq.
2.
The dataset is available at https://github.com/davideceolin/WebQ-Analyses.
3.
http://www.alchemyapi.com.
4.
http://www.mywot.com.
5.
http://flask.pocoo.org/.
6.
http://mongodb.com.
7.
http://annotatorjs.org.
8.
https://github.com/openannotation/annotator-store.
9.
https://www.elastic.co/.
10.
The dataset is available at https://goo.gl/cLDTtS.
11.
The questionnaire is available at http://goo.gl/forms/2pIjjpIp0PtyPxd72.
12.
The questionnaire is available at http://goo.gl/forms/ZwvaqDidGeC8FCXm1.

References

Amento, B., Terveen, L., Hill, W.: Does authority mean quality? Predicting expert quality ratings of web documents. In: SIGIR, pp. 296–303. ACM (2000)
Google Scholar
Bharat, K., Curtiss, M., Schmitt, M.: Method and apparatus for clustering news online content based on content freshness and quality of content source. US Patent 9,361,369 (2016). https://www.google.com/patents/US9361369
Ceolin, D., Groth, P., Maccatrozzo, V., Fokkink, W., van Hage, W.R., Nottamkandath, A.: Combining user reputation and provenance analysis for trust assessment. J. Data Inf. Qual. 7(1–2), 6:1–6:28 (2016)
Google Scholar
Ceolin, D., Noordegraaf, J., Aroyo, L., van Son, C.: Towards web documents quality assessment for digital humanities scholars. In: WebSci, pp. 315–317. ACM (2016)
Google Scholar
De Jong, M., Schellens, P.: Toward a document evaluation methodology: what does research tell us about the validity and reliability of evaluation methods? (2000)
Google Scholar
Hartig, O., Zhao, J.: Using web data provenance for quality assessment. In: SWPM (2009)
Google Scholar
Howell, M., Prevenier, W.: From Reliable Sources: An Introduction to Historical Methods. Cornell University Press, Ithaca (2001)
Google Scholar
Inel, O., et al.: CrowdTruth: machine-human computation framework for harnessing disagreement in gathering annotated data. In: Mika, P., et al. (eds.) ISWC 2014, Part II. LNCS, vol. 8797, pp. 486–504. Springer, Heidelberg (2014)
Google Scholar
International Organization for Standardization: ISO/IEC 25012: 2008 software engineering - software product quality requirements and evaluation (SQuaRE) - data quality model. Technical report, ISO (2008)
Google Scholar
Kang, I.H., Kim, G.: Query type classification for web document retrieval. In: SIGIR, pp. 64–71. ACM (2003)
Google Scholar
Lee, Y.W., Strong, D.M., Kahn, B.K., Wang, R.Y.: AIMQ: a methodology for information quality assessment. Inf. Manage. 40(2), 133–146 (2002)
Article Google Scholar
Nottamkandath, A., Oosterman, J., Ceolin, D., de Vries, G.K.D., Fokkink, W.: Predicting quality of crowdsourced annotations using graph kernels. In: Jensen, C.D., Marsh, S., Dimitrakos, T., Murayama, Y. (eds.) IFIPTM 2015. IFIPAICT, vol. 454, pp. 134–148. Springer International Publishing, New York (2015)
Google Scholar
Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: a survey. Seman. Web J. 7(1), 63–93 (2015). http://www.semantic-web-journal.net/content/quality-assessment-linked-data-survey
Article Google Scholar
Zhu, H., Ma, Y., Su, G.: Collaboratively assessing information quality on the web. In: ICIS sigIQ Workshop (2011)
Google Scholar

Download references

Acknowledgements

This work was supported by the Amsterdam Academic Alliance Data Science (AAA-DS) Program Award to the UvA and VU Universities. We thank the students of the UvA journalism course and the RMeS summer school participants for participating our user studies.

Author information

Authors and Affiliations

VU University Amsterdam, Amsterdam, The Netherlands
Davide Ceolin & Lora Aroyo
University of Amsterdam, Amsterdam, The Netherlands
Julia Noordegraaf

Authors

Davide Ceolin
View author publications
You can also search for this author in PubMed Google Scholar
Julia Noordegraaf
View author publications
You can also search for this author in PubMed Google Scholar
Lora Aroyo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Davide Ceolin .

Editor information

Editors and Affiliations

Linköping University, Linköping, Sweden
Eva Blomqvist
University of Bologna, Bologna, Italy
Paolo Ciancarini
University of Bologna, Bologna, Italy
Francesco Poggi
University of Bologna, Bologna, Italy
Fabio Vitali

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ceolin, D., Noordegraaf, J., Aroyo, L. (2016). Capturing the Ineffable: Collecting, Analysing, and Automating Web Document Quality Assessments. In: Blomqvist, E., Ciancarini, P., Poggi, F., Vitali, F. (eds) Knowledge Engineering and Knowledge Management. EKAW 2016. Lecture Notes in Computer Science(), vol 10024. Springer, Cham. https://doi.org/10.1007/978-3-319-49004-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-49004-5_6
Published: 04 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49003-8
Online ISBN: 978-3-319-49004-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics