Skip to main content
Log in

Research on Web Page Classification Method Based on Query Log

  • Published:
Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Abstract

Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. SUN A X, LIM E P, NG W K. Web classification using support vector machine [J]. Proceedings of the 4th International Workshop on Web Information and Data Management (WIDM 2002). Virginia: ACM, 2002: 1–4.

    Google Scholar 

  2. SHIH L K, KARGER D R. Using URLs and table layout for Web classification tasks [C]//International Conference on World Wide Web. New York: ACM, 2004: 193–202.

    Google Scholar 

  3. CRISTO M, CALADO P, DE MOURA E S, et al. Link information as a similarity measure inWeb classification [C]//International Symposium on String Processing and Information Retrieval. Manaus: Springer, 2003: 43–55.

    Chapter  Google Scholar 

  4. ANH N T K, THANH V M, LINH N V. Efficient label propagation for classification on information networks [C]//Symposium on Information & Communication Technology. Ha Long: ACM, 2012: 41–46.

    Google Scholar 

  5. DUAN Q G, MIAO D Q, JIN K M. A rough set approach to classifying Web page without negative examples [C]//Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Nanjing: Springer, 2007: 481–488.

    Chapter  Google Scholar 

  6. KIM S M, PANTEL P, DUAN L, et al. Improving web page classification by label-propagation over click graphs [C]//ACM Conference on Information and Knowledge Management. Hong Kong: ACM, 2009: 572–576.

    Google Scholar 

  7. NIE L, HUA Z G, HE X F, et al. Learning document labels from enriched click graphs [C]//the IEEE International Conference on Data Mining Workshops. Sydney: IEEE, 2010: 57–64.

    Chapter  Google Scholar 

  8. LI X, WANG Y Y, ACERO A. Learning query intent from regularized click graphs [C]//The International ACM SIGIR Conference. Singapore: ACM, 2008: 339–346.

    Google Scholar 

  9. ZHANG Z Y, NASRAOUI O. Mining search engine query logs for query recommendation [C]//International Conference on World Wide Web. Edinburgh: ACM, 2006: 1039–1040.

    Chapter  Google Scholar 

  10. ZHU X J, GHAHRAMANI Z B. Learning from labeled and unlabeled data with label propagation [R]. Pittsburgh: Carnegie Mellon University, 2002.

    Google Scholar 

  11. HINTON G E. Learning distributed representations of concepts [C]//Proceedings of the Eighth Annual Conference of the Cognitive Science Society. Amherst,MA: [s.n.], 1986: 1–12.

    Google Scholar 

  12. BENGIO Y, SCHWENK H, SENÉCAL J S, et al. Neural probabilistic language models [J]. Innovations in Machine Learning: Theory and Applications, 2006, 194: 137–186.

    Google Scholar 

  13. MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space [EB/OL].(2016-06-06). https://arxiv.org/abs/1301.3781v1.

  14. MIKOLOV T, KARAFIAT M, BURGET L, et al. Recurrent neural network based language model [C]//Conference of the International Speech Communication Association. Makuhari: ISCA, 2010: 1045–1048.

    Google Scholar 

  15. COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing (almost) from scratch [J]. Journal of Machine Learning Research, 2011, 12(1): 2493–2537.

    MATH  Google Scholar 

  16. MIKOLOV T, LE Q V, SUTSKEVER I. Exploiting similarities among languages for machine translation [EB/OL]. (2016-06-06). https://arxiv.org/abs/1309.4168.

  17. FROME A, CORRADO G S, SHLENS J, et al. DeVise: A deep visual-semantic embedding model [C]//Conference on Neural Information Processing Systems. [s.l.]: IEEE, 2013: 2121–2129.

    Google Scholar 

  18. SOCHER R, CHEN D Q, MANNING C D, et al. Reasoning with neural tensor networks for knowledge base completion [C]//Advances in Neural Information Processing Systems. South Lake Tahoe: NIPS, 2013: 926–934.

    Google Scholar 

  19. TANG D, WEI F, YANG N, et al. Learning sentimentspecific word embedding for twitter sentiment classification [C]//Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, Maryland: Association for Computational Linguistics, 2014: 1555–1565.

    Google Scholar 

  20. SOCHER R, HUVAL B, MANNING C D, et al. Semantic compositionality through recursive matrixvector spaces [C]//Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Jeju Island: [s.n.], 2012: 1201–1211.

    Google Scholar 

  21. WHITE L, TOGNERI R, LIU W, et al. How well sentence embeddings capture meaning [C]//Australasian Document Computing Symposium. Parramatta: ACM, 2015: 1–8.

    Google Scholar 

  22. MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality [J]. Advances in neural information processing systems, 2013, 26: 3111–3119.

    Google Scholar 

  23. YANG H B, HU Q M, HE L. Learning topic-oriented word embedding for query classification [C]//Advances in Knowledge Discovery and Data Mining. [s.l.]: Springer International Publishing Switzerland, 2015: 188–198.

    Google Scholar 

  24. JIANG S, HU Y N, KANG C S, et al. Learning query and document relevance from a Web-scale click graph [C]//The International ACM SIGIR Conference. Pisa: ACM, 2016: 185–194.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yixing Ma  (马祎星).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, F., Ma, Y. Research on Web Page Classification Method Based on Query Log. J. Shanghai Jiaotong Univ. (Sci.) 23, 404–410 (2018). https://doi.org/10.1007/s12204-017-1899-0

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12204-017-1899-0

Key words

CLC number

Document code

Navigation