Abstract
As it gets easier to add information to the web via html pages, wikis, blogs, and other documents, it gets tougher to distinguish accurate or trustworthy information from inaccurate or untrustworthy information. Moreover, apart from inaccurate or untrustworthy information, we also need to anticipate web spam — where spammers publish false facts and scams to deliberately mislead users. Creating an effective spam detection method is a challenge. In this paper, we use the notion of content trust for spam detection, and regard it as a ranking problem. Evidence is utilized to define the feature of spam web pages, and machine learning techniques are employed to combine the evidence to create a highly efficient and reasonably-accurate spam detection algorithm. Experiments on real web data are carried out, which show the proposed method performs very well in practice.
Please use the following format when citing this ehupter: Wang, W. and Zeng, G., 2007, in IFIP international Federation for Information Processing, Volume 238, Trust Management, eds. Etalle, S., Marsh, S., (Boston: Springer), pp. 139–152
Chapter PDF
Similar content being viewed by others
References
B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer-Verlag Berlin Heidelberg, (2007)
A. Ntoulas, M. Najork, M. Manasse, et al., Detecting Spam Web Pages through Content Analysis. In Proceedings of the 15th International World Wide Web Conference (WWW’06), May 23–26, Edinburgh, Scotland, (2006)
Y. Gil, D. Artz, Towards Content Trust of Web Resources. In Proceedings of the 15th International World Wide Web Conference (WWW’06), May 23–26, Edinburgh, Scotland, (2006)
D. Fetterly, M. Manasse, M. Najork, Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages. In 7th International Workshop on the Web and Databases, (2004)
Z. Gyongyi, H. Garcia-Molina, Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May (2005)
B. Davison, Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July (2000)
R. Baeza-Yates, C. Castillo, V. Liopez, PageRank Increase under Different Collusion Topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web, May (2005)
L. Page, S. Brin, et al., The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project, (1998)
S. Adali, T. Liu, M. Magdon-Ismail, Optimal Link Bombs are Uncoordinated. In 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’05), May (2005)
Z. Gyongyi, H. Garcia-Molina, J. Pedersen, Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. (2004)
G. Mishne, D. Carmel, R. Lempel, Blocking Blog Spam with Language Model Disagreement. In 1st International Workshop on Adversarial Information Retrieval on the Web, May (2005)
C. Castillo, D. Donato, L. Becchett, et al., A Reference Collection for Web Spam. SIGIR Forum, 40(2), 11–24 (2006)
Y. B Cao, J. Xu, T. Y.Liu et al., Adapting Ranking SVM to Document Retrieval, In Proceedings of the 29th Annual International ACM SIGIR Conference On Research and Development in Information Retrieval, 186–193 (2006)
R. Herbrich, T. Graepel, K. Obermayer, Large Margin Rank Boundaries for Ordinal Regression. Advances in Large Margin Classifiers, 115–132 (2000)
B. Wu, V. Goel, B. D. Davison, Topical TrustRank: Using Topicality to Combat Web Apam. In Proceedings of the 15th International World Wide Web Conference (WWW’06), May 23–26, Edinburgh, Scotland, (2006)
Z. Gyiongyi, P. Berkhin, H. Garcia-Molina, et al, Link Spam Detection Based on Mass Estimation, In Proceedings of the 32nd International Conference on Very Large Databases (VLDB’06), (2006)
P. T. Metaxas, J. DeStefano, Web Spam, Propaganda and Trust, In 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb’05), May (2005)
L. Nie, B. Wu and B. D. Davison. Incorporating Trust into Web Search. Technical Report LU-CSE-07-002, Dept. of Computer Science and Engineering, Lehigh University, (2007)
S. D. Kamvar, M. T. Schlosser, H. Garcia-Molina, The Eigentrust Algorithm for Reputation Management in P2P Networks. In Proceedings of the 12th International World Wide Web Conference (WWW’03), Budapest, Hungary, May (2003)
R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. Propagation of Trust and Distrust. In Proceedings of the 13th International World Wide Web Conference (WWW’04), New York City, May (2004)
W. Wang, G. S. Zeng, L. L. Yuan, A Semantic Reputation Mechanism in P2P Semantic Web, In Proceedings of the 1st Asian Semantic Web Conference (ASWC), LNCS 4185, 682–688 (2006)
W. Wang, G. S. Zeng, Trusted Dynamic Level Scheduling Based on Bayes Trust Model. Science in China: Series F Information Sciences, 37(2), 285–296 (2007)
F. J. Provost, P. Domingos, Tree Induction for Probability-Based Ranking. Machine Learning, 52(3), 199–215 (2003)
H. Zhang, J. Su, Naive Bayesian Classifiers for Ranking, Proceedings of the 15th European Conference on Machine Learning (ECML’04), Springer (2004)
I. H. Witten, E. Frank, Data Mining — Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann, (2000)
F. Provost, T. Fawcett, Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distribution. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, AAAI Press, 43–48 (1997)
Y. Freund, R. E. Schapire, A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. In European Conference on Computational Learning Theory, (1995)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2007 International Federation for Information Processing
About this paper
Cite this paper
Wang, W., Zeng, G. (2007). Content Trust Model for Detecting Web Spam. In: Etalle, S., Marsh, S. (eds) Trust Management. IFIPTM 2007. IFIP International Federation for Information Processing, vol 238. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-73655-6_10
Download citation
DOI: https://doi.org/10.1007/978-0-387-73655-6_10
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-73654-9
Online ISBN: 978-0-387-73655-6
eBook Packages: Computer ScienceComputer Science (R0)