Skip to main content

Subtopic Ranking Based on Block-Level Document Analysis

  • Conference paper
  • First Online:
Web Information Systems and Technologies (WEBIST 2016)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 292))

Included in the following conference series:

  • 284 Accesses

Abstract

We propose methods for ranking subtopics of a keyword query. Subtopics are also keyword queries which specialize and/or disambiguate search intent behind their original query. Information on subtopics are useful for search systems to generate diversified search results. Search result diversification is important when there are multiple ways to interpret the submitted query. In search result diversification, it is important to rank subtopics by their intent probabilities that users need information on the subtopics. Our subtopic ranking methods use hierarchical structure in documents in the corpus. Hierarchical structure in documents consists of nested logical blocks with headings. A heading describes the topic of a part of a document, and a block is such a part of a document. All our methods are based on two assumptions related to the structure. First, hierarchical headings in a document represent hierarchical topics discussed in the document. Second, authors write more contents about subtopics with higher intent probabilities. Based on these assumptions, our methods score each subtopic based on the total size of the blocks whose hierarchical headings represent the subtopic. We develop our methods in the following way. We first propose four methods to score a subtopic on a document, four methods to integrate subtopic scores on multiple documents, and two methods to sort subtopics based on their scores. We then combined these methods, which results in 32 subtopic ranking methods in total. We evaluated these methods on the data set for the subtopic mining subtask of the NTCIR-10 INTENT-2 task. The results indicated that our methods generated rankings statistically significantly better than the query completion snapshots by major commercial search engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html.

  2. 2.

    http://research.nii.ac.jp/ntcir/tools/ntcireval-en.html.

  3. 3.

    https://github.com/trec-web/trec-web-2014.

  4. 4.

    http://www.lemurproject.org/clueweb09/.

  5. 5.

    https://github.com/tmanabe/HEPS.

  6. 6.

    http://lucene.apache.org/.

References

  1. Bah, A., Carterette, B., Chandar, P.: Udel @ NTCIR-11 IMine track. In: NTCIR (2014)

    Google Scholar 

  2. Bouchoucha, A., Nie, J., Liu, X.: Université de montréal at the NTCIR-11 IMine task. In: NTCIR (2014)

    Google Scholar 

  3. Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: SIGIR, pp. 335–336 (1998)

    Google Scholar 

  4. Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: TREC (2012)

    Google Scholar 

  5. Das, S., Mitra, P., Giles, C.L.: Phrase pair classification for identifying subtopics. In: Baeza-Yates, R., Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 489–493. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28997-2_48

    Chapter  Google Scholar 

  6. Dias, G., Cleuziou, G., Machado, D.: Informative polythetic hierarchical ephemeral clustering. In: WI, pp. 104–111 (2011)

    Google Scholar 

  7. Dou, Z., Hu, S., Chen, K., Song, R., Wen, J.R.: Multi-dimensional search result diversification. In: WSDM, pp. 475–484 (2011)

    Google Scholar 

  8. Dou, Z., Hu, S., Luo, Y., Song, R., Wen, J.R.: Finding dimensions for queries. In: CIKM, pp. 1311–1320 (2011)

    Google Scholar 

  9. Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)

    Article  Google Scholar 

  10. He, J., Hollink, V., de Vries, A.: Combining implicit and explicit topic representations for result diversification. In: SIGIR, pp. 851–860 (2012)

    Google Scholar 

  11. Hu, Y., Qian, Y., Li, H., Jiang, D., Pei, J., Zheng, Q.: Mining query subtopics from search log data. In: SIGIR. pp. 305–314 (2012)

    Google Scholar 

  12. Jiang, D., Ng, W.: Mining web search topics with diverse spatiotemporal patterns. In: SIGIR, pp. 881–884 (2013)

    Google Scholar 

  13. Kim, S.J., Lee, J.H.: Subtopic mining using simple patterns and hierarchical structure of subtopic candidates from web documents. Inf. Process. Manage. 51(6), 773–785 (2015)

    Article  Google Scholar 

  14. Liu, Y., Song, R., Zhang, M., Dou, Z., Yamamoto, T., Kato, M.P., Ohshima, H., Zhou, K.: Overview of the NTCIR-11 IMine task. In: NTCIR (2014)

    Google Scholar 

  15. Luo, C., Li, X., Khodzhaev, A., Chen, F., Xu, K., Cao, Y., Liu, Y., Zhang, M., Ma, S.: THUSAM at NTCIR-11 IMine task. In: NTCIR (2014)

    Google Scholar 

  16. Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. PVLDB 8(12), 1606–1617 (2015)

    Google Scholar 

  17. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL, pp. 55–60 (2014)

    Google Scholar 

  18. Moreno, J.G., Dias, G.: HULTECH at the NTCIR-10 INTENT-2 task: discovering user intents through search results clustering. In: NTCIR (2013)

    Google Scholar 

  19. Moreno, J.G., Dias, G.: HULTECH at the NTCIR-11 IMine task: mining intents with continuous vector space models. In: NTCIR (2014)

    Google Scholar 

  20. Oyama, S., Tanaka, K.: Query modification by discovering topics from web page structures. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 553–564. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24655-8_60

    Chapter  Google Scholar 

  21. Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)

    Google Scholar 

  22. Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241 (1994)

    Google Scholar 

  23. Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song, R.: Overview of the NTCIR-10 INTENT-2 task. In: NTCIR (2013)

    Google Scholar 

  24. Santos, R.L., Macdonald, C., Ounis, I.: Exploiting query reformulations for web search result diversification. In: WWW, pp. 881–890 (2010)

    Google Scholar 

  25. Song, R., Zhang, M., Sakai, T., Kato, M.P., Liu, Y., Sugimoto, M., Wang, Q., Orii, N.: Overview of the NTCIR-9 INTENT task. In: NTCIR (2011)

    Google Scholar 

  26. Strohman, T., Metzler, D., Turtle, H., Croft, W.: Indri: a language model-based search engine for complex queries. In: International Conference on Intelligent Analysis (2005)

    Google Scholar 

  27. Ullah, M.Z., Aono, M.: Query subtopic mining for search result diversification. In: ICAICTA, pp. 309–314 (2014)

    Google Scholar 

  28. Ullah, M.Z., Aono, M., Seddiqui, M.H.: SEM12 at the NTCIR-10 INTENT-2 English subtopic mining subtask. In: NTCIR (2013)

    Google Scholar 

  29. Wang, C., Danilevsky, M., Desai, N., Zhang, Y., Nguyen, P., Taula, T., Han, J.: A phrase mining framework for recursive construction of a topical hierarchy. In: KDD, pp. 437–445 (2013)

    Google Scholar 

  30. Wang, C.J., Lin, Y.W., Tsai, M.F., Chen, H.H.: Mining subtopics from different aspects for diversifying search results. Inf. Retr. 16(4), 452–483 (2013)

    Article  Google Scholar 

  31. Wang, J., Tang, G., Xia, Y., Zhou, Q., Zheng, T.F., Hu, Q., Na, S., Huang, Y.: Understanding the query: THCIB and THUIS at NTCIR-10 intent task. In: NTCIR (2013)

    Google Scholar 

  32. Wang, Q., Qian, Y., Song, R., Dou, Z., Zhang, F., Sakai, T., Zheng, Q.: Mining subtopics from text fragments for a web query. Inf. Retr. 16(4), 484–503 (2013)

    Article  Google Scholar 

  33. Xia, Y., Zhong, X., Tang, G., Wang, J., Zhou, Q., Zheng, T.F., Hu, Q., Na, S., Huang, Y.: Ranking search intents underlying a query. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) NLDB 2013. LNCS, vol. 7934, pp. 266–271. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38824-8_23

    Chapter  Google Scholar 

  34. Xue, Y., Chen, F., Damien, A., Luo, C., Li, X., Huo, S., Zhang, M., Liu, Y., Ma, S.: THUIR at NTCIR-10 INTENT-2 task. In: NTCIR (2013)

    Google Scholar 

  35. Yamamoto, T., Kato, M.P., Ohshima, H., Tanaka, K.: KUIDL at the NTCIR-11 IMine task. In: NTCIR (2014)

    Google Scholar 

  36. Yamamoto, T., Liu, Y., Zhang, M., Dou, Z., Zhou, K., Markov, I., Kato, M.P., Ohshima, H., Fujita, S.: Overview of the NTCIR-12 IMine-2 task. In: NTCIR (2015)

    Google Scholar 

  37. Yu, H., Ren, F.: TUTA1 at the NTCIR-11 IMine task. In: NTCIR (2014)

    Google Scholar 

  38. Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster web search results. In: SIGIR, pp. 210–217 (2004)

    Google Scholar 

  39. Zheng, W., Fang, H., Cheng, H., Wang, X.: Diversifying search results through pattern-based subtopic modeling. Int. J. Semant. Web Inf. Syst. 8(4), 37–56 (2012)

    Article  Google Scholar 

  40. Zheng, W., Wang, X., Fang, H., Cheng, H.: An exploration of pattern-based subtopic modeling for search result diversification. In: JCDL. pp. 387–388 (2011)

    Google Scholar 

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number 13J06384 and 26540163.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomohiro Manabe .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Manabe, T., Tajima, K. (2017). Subtopic Ranking Based on Block-Level Document Analysis. In: Monfort, V., Krempels, KH., Majchrzak, T., Traverso, P. (eds) Web Information Systems and Technologies. WEBIST 2016. Lecture Notes in Business Information Processing, vol 292. Springer, Cham. https://doi.org/10.1007/978-3-319-66468-2_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-66468-2_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-66467-5

  • Online ISBN: 978-3-319-66468-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics