Analyzing the Impact of Unbalanced Data on Web Spam Classification

  • J. Fdez-GlezEmail author
  • D. Ruano-Ordás
  • F. Fdez-Riverola
  • J. R. Méndez
  • R. Pavón
  • R. Laza
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 373)


Web spam is a serious problem which nowadays continues to threaten search engines because the quality of their results can be severely degraded by the presence of illegitimate pages. With the aim of fighting against web spam, several works have been carried out trying to reduce the impact of spam content. Regardless of the type of developed approaches, all the proposals have been faced with the difficulty of dealing with a corpus in which the difference between the amount of legitimate pages and the number of web sites with spam content is extremely high. Unbalanced data is a well-known common problem present in many practical applications of machine learning, having significant effects on the performance of standard classifiers. Focusing on web spam detection, the objective of this work is two-fold: to evaluate the effect of the class imbalance ratio over popular classifiers such as Naïve Bayes, SVM and C5.0, and to assess how their performance can be improved when different types of techniques are combined in an unbalanced scenario.


web spam detection unbalanced data sampling techniques ensemble of classifiers 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    García, S., Derrac, J., Triguero, I., Carmona, C.J., Herrera, F.: Evolutionary-based selection of generalized instances for imbalanced classification. Knowledge-Based Systems 25(1), 3–12 (2012)CrossRefGoogle Scholar
  2. 2.
    Fetterly, D., Manasse, M., Najork, M.: Detecting phrase-level duplication on the World Wide Web. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 170–177 (2005)Google Scholar
  3. 3.
    Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web (WWW 2006), pp. 83–92 (2006)Google Scholar
  4. 4.
    Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality (WebQuality 2011), New York, USA, pp. 27–34 (2011)Google Scholar
  5. 5.
    Gyöngyi, Z., Berkhin, P., Molina, H.G., Pedersen, J.: Link spam detection based on mass estimation. In: Proceedings of the 32nd International Conference on Very large data bases, VLDB, pp. 439–450. Endowment, Seoul (2006)Google Scholar
  6. 6.
    Benczur, A., Csalogany, K., Sarlos, T., Uher, M.: SpamRank–Fully Automatic Link Spam Detection. In: Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web, Japan (2005)Google Scholar
  7. 7.
    Geng, G.G., Wang, C.H., Li, Q.D., Xu, L., Jin, X.B.: Boosting the performance of web spam detection with ensemble under-sampling classification. In: Proceedings of IEEE 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 583–587 (2007)Google Scholar
  8. 8.
    Abernethy, J., Chapelle, O., Castillo, C.: Webspam identification through content and hyperlinks. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (2008)Google Scholar
  9. 9.
    Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-Yates, R.: Web spam detection: link-based and content-based techniques. In: Proceedings of the European Integrated Project Dynamically Evolving, Large Scale Information Systems, pp. 99–113. Heinz-Nixdorf-Institut. (2008)Google Scholar
  10. 10.
    Karimpour, J., Noroozi, A.A., Alizadeh, S.: Web Spam Detection by Learning from Small Labelled Samples. International Journal of Computer Applications 50(21), 1–5 (2012)CrossRefGoogle Scholar
  11. 11.
    Castillo, C., Chellapilla, K., Denoyer, L.: Web spam challenge 2008. In: Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2008 (2008)Google Scholar
  12. 12.
    He, H., Garcia, E.A.: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21(9), 1263–1284 (2009)CrossRefGoogle Scholar
  13. 13.
    Drummond, C., Holte, R.C.: C4.5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In: Proceedings of the International Conference on Machine Learning (2003)Google Scholar
  14. 14.
    Laza, R., Pavón, R., Reboiro-Jato, M., Fdez-Riverola, F.: Assessing the suitability of mesh ontology for classifying medline documents. In: Proceedings of the 5th International Conference on Practical Applications of Computational Biology & Bioinformatics, PACBB 2011, pp. 337–344 (2011)Google Scholar
  15. 15.
    Van Hulse, J., Khoshgoftaar, T.M., Napolitano, A.: Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, Corvallis, OR, USA, pp. 935–942 (2007)Google Scholar
  16. 16.
    Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter - Special Issue on Learning from Imbalanced Datasets 6(1), 20–29 (2004)CrossRefGoogle Scholar
  17. 17.
    Chih-Chung, C., Chih-Jen, L.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011),
  18. 18.
    Data Mining Tools C5.0, Rulequest Research (2013), (accessed December 19, 2014)

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • J. Fdez-Glez
    • 1
    Email author
  • D. Ruano-Ordás
    • 1
  • F. Fdez-Riverola
    • 1
  • J. R. Méndez
    • 1
  • R. Pavón
    • 1
  • R. Laza
    • 1
  1. 1.Dept. InformáticaUniversity of Vigo, Escuela Superior de Ingeniería Informática, Edificio PolitécnicoOurenseSpain

Personalised recommendations