Advertisement

Extracting Novel Features for E-Commerce Page Quality Classification

  • Jing Wang
  • Lanfen Lin
  • Feng Wang
  • Penghua Yu
  • Jiaolong Liu
  • Xiaowei Zhu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8346)

Abstract

There’re a huge amount of web pages describing the same product on e-commerce websites, while their quality varies greatly. Therefore, there is a growing need for automated, accurate and efficient quality classification methods. Several link-based, click-based and content-based approaches have been proposed to evaluate the quality of pages for general search engines. However, these methods only consider the surface features of the html documents. What’s more, features like link relations have drawbacks when dealing with e-commerce pages, because the hypothesis that links mean endorsements is not always right in the environment of e-commerce. In this paper, we propose two kinds of features that can directly indicate the quality of content. We analyze pages’ content structure with a corpus of labeled texts, and evaluate the property completeness with the help of ontology. Then we combine these features with other commonly used features in literature. We apply several learning methods to train and classify pages into good and bad ones. Experiments on real e-commerce pages show that the proposed novel features can greatly improve the accuracy of classification.

Keywords

E-commerce Page quality Semantic Content analysis Information extraction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Batini, C., Cappiello, C., Francalanci, C., Maurino, A.: Methodologies for data quality assessment and improvement. ACM Computing Surveys (CSUR) 41, 1–52 (2009)CrossRefGoogle Scholar
  2. 2.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117 (1998)CrossRefGoogle Scholar
  3. 3.
    Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM 46, 604–632 (1999)CrossRefzbMATHMathSciNetGoogle Scholar
  4. 4.
    Lempel, R., Moran, S.: SALSA: the stochastic approach for link-structure analysis. ACM Trans. Inf. Syst. 19, 131–160 (2001)CrossRefGoogle Scholar
  5. 5.
    Liu, Y., Gao, B., Liu, T., Zhang, Y., Ma, Z., He, S., Li, H.: BrowseRank: letting web users vote for page importance. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 451–458. ACM, Singapore (2008)Google Scholar
  6. 6.
    Richardson, M., Prakash, A., Brill, E.: Beyond PageRank: machine learning for static ranking. In: Proceedings of the 15th International Conference on World Wide Web, pp. 707–715. ACM, Edinburgh (2006)CrossRefGoogle Scholar
  7. 7.
    Zhu, X., Gauch, S.: Incorporating quality metrics in centralized/distributed information retrieval on the World Wide Web. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 288–295. ACM, Athens (2000)Google Scholar
  8. 8.
    Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Proceedings of the 15th International Conference on World Wide Web, pp. 83–92. ACM, Edinburgh (2006)CrossRefGoogle Scholar
  9. 9.
    Bendersky, M., Croft, W.B., Diao, Y.: Quality-biased ranking of web documents. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 95–104. ACM, Hong Kong (2011)CrossRefGoogle Scholar
  10. 10.
    Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality content in social media. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 183–194. ACM, Palo Alto (2008)Google Scholar
  11. 11.
    Wu, O., Chen, Y., Li, B., Hu, W.: Learning to evaluate the visual quality of web pages. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1205–1206. ACM, Raleigh (2010)CrossRefGoogle Scholar
  12. 12.
    Wu, O., Chen, Y., Li, B., Hu, W.: Evaluating the visual quality of web pages using a computational aesthetic approach. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 337–346. ACM, Hong Kong (2011)CrossRefGoogle Scholar
  13. 13.
    Pun, J.C.C., Lochovsky, F.H.: Ranking search results by web quality dimensions. Journal of Web Engineering 3, 216–235 (2004)Google Scholar
  14. 14.
    Mandl, T.: Implementation and evaluation of a quality-based search engine. In: Proceedings of the Seventeenth Conference on Hypertext and Hypermedia, pp. 73–84. ACM, Odense (2006)CrossRefGoogle Scholar
  15. 15.
    Cai, D., Yu, S., Wen, J., Ma, W.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)Google Scholar
  16. 16.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 2002) (2002)Google Scholar
  17. 17.
    Cen, R., Liu, Y., Zhang, M., Ru, L., Ma, S.: Web page quality estimation based on linear discriminant function. Journal of Computational Information Systems 3, 1117–1126 (2007)Google Scholar
  18. 18.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11, 10–18 (2009)CrossRefGoogle Scholar
  19. 19.
    Guyon, I., Andr, E.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)zbMATHGoogle Scholar
  20. 20.
    Hall, M.A.: Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning. In: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 359–366. Morgan Kaufmann Publishers Inc. (2000)Google Scholar
  21. 21.
    Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Jing Wang
    • 1
  • Lanfen Lin
    • 1
  • Feng Wang
    • 1
  • Penghua Yu
    • 1
  • Jiaolong Liu
    • 1
  • Xiaowei Zhu
    • 1
  1. 1.College of Computer ScienceZhejiang UniversityHangzhouChina

Personalised recommendations