Abstract
Web spamming is nowadays a serious problem for search engines. It not only degrades the quality of search results by intentionally boosting undesirable web pages to users, but also causes the search engine to waste a significant amount of computational and storage resources in manipulating useless information. In this paper, we present a machine learning approach for spam detection by adopting the ant colony optimization algorithm. We first construct a directed graph corresponding to web hosts and their aggregated hyperlinks. Then, we train a classifier by employing ants to walk along paths in the graph. Each ant will start from an individual non-spam host and afterwards decides to follow a link to the next host with a probability based on both heuristic function and pheromone trail. Relying on the approximate isolation principle of a good set, we reward an ant that can discover a good path, i.e., a sequence of non-spam hosts, by charging energy for its longer walking. In contrast, if the ant instead discovers any spam, it will be penalized by decreasing its walking step. Finally, the classification rules are constructed by choosing common overlapping characteristic features of all non-spam hosts along the discovered paths. Experiments on WEBSPAM-UK2007 dataset show that our approach contributes to more accurately classify spam and non-spam hosts than several rule-based classification baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Araujo, L., Martinez-Romo, J.: Web spam detection: New classification features based on qualified link analysis and language models. IEEE Transactions on Information Forensics and Security 5(3), 581–590 (2010)
Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. Addison Wesley, England (1999)
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Link-based characterization and detection of web spam. In: Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, pp. 1–8 (2006)
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Web spam detection: Link-based and content-based techniques. In: The European Integrated Project Dynamically Evolving, Large Scale Information Systems (DELIS): Proceedings of the Final Workshop, vol. 222, pp. 99–113 (2008)
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S.: A reference collection for web spam. ACM SIGIR Forum 40(2), 11–24 (2006)
Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your neighbors: Web spam detection using the web topology. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 423–430 (2007)
Dorigo, M., Di Caro, G., Gambardella, L.M.: Ant algorithms for discrete optimization. Artificial Life 5(2), 137–172 (1999)
Dorigo, M., Gambardella, L.M.: Ant colony system: A cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation 1(1), 53–66 (1997)
Dorigo, M., Maniezzo, V., Colorni, A.: Ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics 26(1), 29–41 (1996)
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. 1022–1027 (1993)
Fetterly, D., Manasse, H., Najork, M.: Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In: Proceedings of the 7th International Workshop on the Web and Databases, pp. 1–6 (2004)
Geng, G.G., Jin, X.B., Wang, C.H.: Casia at web spam challenge 2008 track iii. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (2008)
Gyöngyi, Z., Garcia-Molina, H.: Web spam taxonomy. In: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, pp. 39–47 (2005)
Gyöngyi, Z., Garcia-Molina, H., Pedersen, J.: Combating web spam with trustrank. In: Proceedings of the 13th International Conference on Very Large Data Bases, pp. 576–587 (2004)
Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in web search engines. ACM SIGIR Forum 36(2), 11–22 (2002)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Journal of the ACM 46(5), 604–632 (1999)
Krishnan, V., Raj, R.: Web spam detection with anti-trust rank. In: Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, pp. 37–40 (2006)
Liu, Y., Gao, B., Liu, T.Y., Zhang, Y., Ma, Z., He, S., Li, H.: Browserank: Letting web users vote for page importance. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 451–458 (2008)
Liu, Y., Zhang, M., Ma, S., Ru, L.: User behavior oriented web spam detection. In: Proceedings of the 17th International Conference on World Wide Web, pp. 1039–1040 (2008)
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web, pp. 83–92 (2006)
Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the web. Tech. rep., Stanford Digital Libraries (1999)
Parpinelli, R.S., Lopes, H.S., Freitas, A.A.: Data mining with an ant colony optimization algorithm. IEEE Transactions on Evolutionary Computation 6(4), 321–332 (2002)
Stützle, T., Hoos, H.H.: \(\mathcal{MAX\mbox{-}MIN}\) ant system. Future Generation Computer Systems 16(9), 889–914 (2000)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Wu, B., Davison, B.D.: Identifying link farm spam pages. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, pp. 820–829 (2005)
Wu, B., Goel, V., Davison, B.D.: Propagating trust and distrust to demote web spam. In: Proceedings of the Workshop on Models of Trust for the Web (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Manaskasemsak, B., Jiarpakdee, J., Rungsawang, A. (2014). Adaptive Learning Ant Colony Optimization for Web Spam Detection. In: Murgante, B., et al. Computational Science and Its Applications – ICCSA 2014. ICCSA 2014. Lecture Notes in Computer Science, vol 8584. Springer, Cham. https://doi.org/10.1007/978-3-319-09153-2_48
Download citation
DOI: https://doi.org/10.1007/978-3-319-09153-2_48
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09152-5
Online ISBN: 978-3-319-09153-2
eBook Packages: Computer ScienceComputer Science (R0)