Abstract
Creating an effective spam detection method is a challenging task. Traditional works usually regard this kind of work as a problem of binary classification. In this paper, however, we argue that it is more property to use the notion of content trust for it, and regard it as a ranking or ordinal regression problem. Evidence is utilized to define the feature of spam web pages, and machine learning techniques are employed to combine the evidence to create a highly efficient and reasonably-accurate detection algorithm. Experiments on real web data are carried out, which improve the proposed method performs very well in practice.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fetterly, D., Manasse, M., Najork, M.: Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In: 7th International Workshop on the Web and Databases (2004)
Ntoulas, A., Najork, M., Manasse, M., et al.: Detecting Spam Web Pages through Content Analysis. In: proceedings of WWW 2006, May 23–26, Edinburgh, Scotland (2006)
Wang, W., Zeng, G. S., Liu, T.: An Autonomous Trust Construction System Based on Bayesian Method, In: Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT 2006), Hong Kong, China, pp. 357–362 (December18-22 2006)
Gyongyi, Z., Garcia-Molina, H.: Web Spam Taxonomy. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)
Davison, B.: Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search (July 2000)
Baeza-Yates, R., Castillo, C., Liopez, V.: PageRank Increase under Different Collusion Topologies. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)
Page, L., Brin, S., et al.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project (1998)
Adali, S., Liu, T., Magdon-Ismail, M.: Optimal Link Bombs are Uncoordinated. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)
Gyiongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with TrustRank. In: 30th International Conference on Very Large Data Bases (August 2004)
Mishne, G., Carmel, D., Lempel, R.: Blocking Blog Spam with Language Model Disagreement. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)
Cao, Y. B., Xu, J., Liu, T. Y., et al.: Adapting Ranking SVM to Document Retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference On Research and Development in Information Retrieval, pp. 186–193 (2006)
Herbrich, R., Graepel, T., Obermayer, K.: Large Margin Rank Boundaries for Ordinal Regression. Advances in Large Margin Classifiers, pp. 115–132 (2000)
Wang, W., Zeng, G.S., Yuan, L.L.: A Semantic Reputation Mechanism in P2P Semantic Web. In: Mizoguchi, R., Shi, Z., Giunchiglia, F. (eds.) ASWC 2006. LNCS, vol. 4185, pp. 682–688. Springer, Heidelberg (2006)
Zhang, H., Su, J.: Naive Bayesian classifiers for ranking. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, Springer, Heidelberg (2004)
Provost, F.J., Domingos, P.: Tree Induction for Probability-Based Ranking. Ma.-chine Learning 52(3), 199–215 (2003)
Provost, F., Fawcett, T.: Analysis and visualization of classifier performance: comparison under imprecise class and cost distribution. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 43–48. AAAI Press, California (1997)
Witten, I.H., Frank, E.: Data Mining–Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann, Washington (2000)
Gil, Y., Artz, D.: Towards content trust of web resources. In: Proceedings of the 15th International World Wide Web Conference (2006)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, W., Zeng, G., Sun, M., Gu, H., Zhang, Q. (2007). EviRank: An Evidence Based Content Trust Model for Web Spam Detection. In: Chang, K.CC., et al. Advances in Web and Network Technologies, and Information Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72909-9_34
Download citation
DOI: https://doi.org/10.1007/978-3-540-72909-9_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72908-2
Online ISBN: 978-3-540-72909-9
eBook Packages: Computer ScienceComputer Science (R0)