Skip to main content

EviRank: An Evidence Based Content Trust Model for Web Spam Detection

  • Conference paper
Advances in Web and Network Technologies, and Information Management (APWeb 2007, WAIM 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4537))

Abstract

Creating an effective spam detection method is a challenging task. Traditional works usually regard this kind of work as a problem of binary classification. In this paper, however, we argue that it is more property to use the notion of content trust for it, and regard it as a ranking or ordinal regression problem. Evidence is utilized to define the feature of spam web pages, and machine learning techniques are employed to combine the evidence to create a highly efficient and reasonably-accurate detection algorithm. Experiments on real web data are carried out, which improve the proposed method performs very well in practice.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fetterly, D., Manasse, M., Najork, M.: Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages. In: 7th International Workshop on the Web and Databases (2004)

    Google Scholar 

  2. Ntoulas, A., Najork, M., Manasse, M., et al.: Detecting Spam Web Pages through Content Analysis. In: proceedings of WWW 2006, May 23–26, Edinburgh, Scotland (2006)

    Google Scholar 

  3. Wang, W., Zeng, G. S., Liu, T.: An Autonomous Trust Construction System Based on Bayesian Method, In: Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology (IAT 2006), Hong Kong, China, pp. 357–362 (December18-22 2006)

    Google Scholar 

  4. Gyongyi, Z., Garcia-Molina, H.: Web Spam Taxonomy. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)

    Google Scholar 

  5. Davison, B.: Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search (July 2000)

    Google Scholar 

  6. Baeza-Yates, R., Castillo, C., Liopez, V.: PageRank Increase under Different Collusion Topologies. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)

    Google Scholar 

  7. Page, L., Brin, S., et al.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project (1998)

    Google Scholar 

  8. Adali, S., Liu, T., Magdon-Ismail, M.: Optimal Link Bombs are Uncoordinated. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)

    Google Scholar 

  9. Gyiongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with TrustRank. In: 30th International Conference on Very Large Data Bases (August 2004)

    Google Scholar 

  10. Mishne, G., Carmel, D., Lempel, R.: Blocking Blog Spam with Language Model Disagreement. In: 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)

    Google Scholar 

  11. Cao, Y. B., Xu, J., Liu, T. Y., et al.: Adapting Ranking SVM to Document Retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference On Research and Development in Information Retrieval, pp. 186–193 (2006)

    Google Scholar 

  12. Herbrich, R., Graepel, T., Obermayer, K.: Large Margin Rank Boundaries for Ordinal Regression. Advances in Large Margin Classifiers, pp. 115–132 (2000)

    Google Scholar 

  13. Wang, W., Zeng, G.S., Yuan, L.L.: A Semantic Reputation Mechanism in P2P Semantic Web. In: Mizoguchi, R., Shi, Z., Giunchiglia, F. (eds.) ASWC 2006. LNCS, vol. 4185, pp. 682–688. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Zhang, H., Su, J.: Naive Bayesian classifiers for ranking. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, Springer, Heidelberg (2004)

    Google Scholar 

  15. Provost, F.J., Domingos, P.: Tree Induction for Probability-Based Ranking. Ma.-chine Learning 52(3), 199–215 (2003)

    MATH  Google Scholar 

  16. Provost, F., Fawcett, T.: Analysis and visualization of classifier performance: comparison under imprecise class and cost distribution. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 43–48. AAAI Press, California (1997)

    Google Scholar 

  17. Witten, I.H., Frank, E.: Data Mining–Practical Machine Learning Tools and Techniques with Java Implementation. Morgan Kaufmann, Washington (2000)

    Google Scholar 

  18. Gil, Y., Artz, D.: Towards content trust of web resources. In: Proceedings of the 15th International World Wide Web Conference (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Kevin Chen-Chuan Chang Wei Wang Lei Chen Clarence A. Ellis Ching-Hsien Hsu Ah Chung Tsoi Haixun Wang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, W., Zeng, G., Sun, M., Gu, H., Zhang, Q. (2007). EviRank: An Evidence Based Content Trust Model for Web Spam Detection. In: Chang, K.CC., et al. Advances in Web and Network Technologies, and Information Management. APWeb WAIM 2007 2007. Lecture Notes in Computer Science, vol 4537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72909-9_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72909-9_34

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72908-2

  • Online ISBN: 978-3-540-72909-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics