Probabilistically Ranking Web Article Quality Based on Evolution Patterns

Han, Jingyu; Chen, Kejia; Jiang, Dawei

doi:10.1007/978-3-642-34179-3_8

Jingyu Han²²,
Kejia Chen²² &
Dawei Jiang²³

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 7600))

502 Accesses

Abstract

User-generated content (UGC) is created, updated, and maintained by various web users, and its data quality is a major concern to all users. We observe that each Wikipedia page usually goes through a series of revision stages, gradually approaching a relatively steady quality state and that articles of different quality classes exhibit specific evolution patterns. We propose to assess the quality of a number of web articles using Learning Evolution Patterns (LEP). First, each article’s revision history is mapped into a state sequence using the Hidden Markov Model (HMM). Second, evolution patterns are mined for each quality class, and each quality class is characterized by a set of quality corpora. Finally, an article’s quality is determined probabilistically by comparing the article with the quality corpora. Our experimental results demonstrate that the LEP approach can capture a web article’s quality precisely.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aebi, D., Perrochon, L.: Towards improving data quality. In: Proc. of the International Conference on Information Systems and Management of Data, pp. 273–281 (1993)
Google Scholar
Giles, J.: Internet encyclopaedias go head to head. Nature 438, 900–901 (2005)
Article Google Scholar
Dalip, D.H., Cristo, M., Calado, P.: Automatic quality assessment of content created collaboratively by web communities: A case study of wikipedia. In: Proc. of JCDL 2009, pp. 295–304 (2009)
Google Scholar
Stvilia, B., Twidle, B.M., Smith, C.L.: Assessing information quality of a community-based encyclopedia. In: Proc. of the International Conference on Information Quality, pp. 442–454 (2005)
Google Scholar
Rassbach, L., Pincock, T., Mingus, B.: Exploring the feasibility of automatically rating online article quality (2008)
Google Scholar
Wang, R.Y., Kon, H.B., Madnick, S.E.: Data quality requirements analysis and modeling. In: Proc. of the Ninth International Conference on Data Engineering, pp. 670–677 (1993)
Google Scholar
Bouzeghoub, M., Peralta, V.: A framework for analysis of data freshness. In: Proc. of 2004 International Information Quality Conference on Information System, pp. 59–67 (2004)
Google Scholar
Wand, Y., Wang, R.Y.: anchoring data quality dimensions in ontological foundations. Communications of the ACM 39(11) (1996)
Google Scholar
Pernici, B., Scannapieco, M.: Data Quality in Web Information Systems. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 397–413. Springer, Heidelberg (2002)
Chapter Google Scholar
Parssian, A., Sarkar, S., Jacob, V.S.: Assessing information quality for the composite relational operation joins. In: Proceedings of the Seventh International Conference on Information Quality, pp. 225–236 (2002)
Google Scholar
Parssian, A., Sarkar, S., Jacob, V.S.: assessing data quality for information products. In: Proceeding of the 20th International Conference on Information Systems, pp. 428–433 (1999)
Google Scholar
Ballou, D.P., Chengalur-Smith, I.N., Wang, R.Y.: Sample-based quality estimation of query results in relational database environments. IEEE Transactions on Knowledge and Data Engineering 18(5), 639–650 (2006)
Article Google Scholar
Macdonald, N., Frase, L., Gingrich, P., Keenan, S.: The writer’s workbench: computer aids for text analysis. IEEE Transactions on Communications 30(1), 105–110 (1982)
Article Google Scholar
Foltz, P.W.: Supporting content-based feedback in on-line writing evaluation with lsa. Interactive Learning Environments 8(2), 111–127 (2000)
Article Google Scholar
Hu, M., Lim, E.P., Sun, A.: Measuring article quality in wikipedia: Models and evaluation. In: Proc. of the sixteenth CIKM, pp. 243–252 (2007)
Google Scholar
Zeng, H., Alhossaini, A., Ding, M. L.: Computing trust from revision history. In: Proc. of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services (2006)
Google Scholar
Zeng, H., Alhossaini, A., Fikes, M., McGuinness, R.L.: Data Mining revision history to assess trustworthiness of article fragments. In: Proc. of the 2006 International Conference on Collaborative Computing Networking Applications and Worksharing, pp. 1–10 (2006)
Google Scholar
Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989)
Article Google Scholar
Agrawal, R., Srikant, R.: Mining sequential patterns. In: ICDE (1995)
Google Scholar
Ramakrishnan, A.S.: Mining sequential patterns: Generalizations and performance improvements. In: 1996 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology (1996)
Google Scholar
Zhang, M., Kao, B., Cheung, D., Yip, K.: Mining periodic patterns with gap requirement from sequences. In: SIGMOD (2005)
Google Scholar
Ding, B., Lo, D., Han, J., Khoo, S.C.: Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proc. of 2009 ICDE, pp. 1024–1035 (2009)
Google Scholar
Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439 (1992)
Article Google Scholar
Knuth, D.: Knuth-morris-pratt algorithm. http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/StringMatch/kuthMP.htm
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2005)
Google Scholar
Croft, W., Metzler, D., Strohman, T.: Search engines: information retrieval in practice. Addison-Wesley (2009)
Google Scholar
Mitchell, T.M.: Machine learning. McGraw-Hill Higher Education (1997)
Google Scholar
Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(8), 841–847 (1991)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing, 210003, P.R. China
Jingyu Han & Kejia Chen
School of Computing, National University of Singapore, Singapore, 119077
Dawei Jiang

Authors

Jingyu Han
View author publications
You can also search for this author in PubMed Google Scholar
Kejia Chen
View author publications
You can also search for this author in PubMed Google Scholar
Dawei Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut de Recherche en Informatique de Toulouse (IRIT), Paul Sabatier University, 118 route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
FAW - Institute for Applied Knowledge Processing, University of Linz, Altenbergerstraße 69, 4040, Linz, Austria
Josef Küng
FAW, University of Linz, Altenbergerstraße 69, 4040, Linz, Austria
Roland Wagner
Marriott School, Brigham Young University, 784 TNRB, 84602, Provo, UT, USA
Stephen W. Liddle
Software Competence Center Hagenberg, Softwarepark 21, 4232, Hagenberg, Austria
Klaus-Dieter Schewe
School of Information Technology and Electrical Engineering, University of Queensland, 4072, Brisbane, QLD, Australia
Xiaofang Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, J., Chen, K., Jiang, D. (2012). Probabilistically Ranking Web Article Quality Based on Evolution Patterns. In: Hameurlain, A., Küng, J., Wagner, R., Liddle, S.W., Schewe, KD., Zhou, X. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems VI. Lecture Notes in Computer Science, vol 7600. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34179-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-34179-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34178-6
Online ISBN: 978-3-642-34179-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics