Abstract
We propose methods for ranking subtopics of a keyword query. Subtopics are also keyword queries which specialize and/or disambiguate search intent behind their original query. Information on subtopics are useful for search systems to generate diversified search results. Search result diversification is important when there are multiple ways to interpret the submitted query. In search result diversification, it is important to rank subtopics by their intent probabilities that users need information on the subtopics. Our subtopic ranking methods use hierarchical structure in documents in the corpus. Hierarchical structure in documents consists of nested logical blocks with headings. A heading describes the topic of a part of a document, and a block is such a part of a document. All our methods are based on two assumptions related to the structure. First, hierarchical headings in a document represent hierarchical topics discussed in the document. Second, authors write more contents about subtopics with higher intent probabilities. Based on these assumptions, our methods score each subtopic based on the total size of the blocks whose hierarchical headings represent the subtopic. We develop our methods in the following way. We first propose four methods to score a subtopic on a document, four methods to integrate subtopic scores on multiple documents, and two methods to sort subtopics based on their scores. We then combined these methods, which results in 32 subtopic ranking methods in total. We evaluated these methods on the data set for the subtopic mining subtask of the NTCIR-10 INTENT-2 task. The results indicated that our methods generated rankings statistically significantly better than the query completion snapshots by major commercial search engines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bah, A., Carterette, B., Chandar, P.: Udel @ NTCIR-11 IMine track. In: NTCIR (2014)
Bouchoucha, A., Nie, J., Liu, X.: Université de montréal at the NTCIR-11 IMine task. In: NTCIR (2014)
Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: SIGIR, pp. 335–336 (1998)
Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: TREC (2012)
Das, S., Mitra, P., Giles, C.L.: Phrase pair classification for identifying subtopics. In: Baeza-Yates, R., Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 489–493. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28997-2_48
Dias, G., Cleuziou, G., Machado, D.: Informative polythetic hierarchical ephemeral clustering. In: WI, pp. 104–111 (2011)
Dou, Z., Hu, S., Chen, K., Song, R., Wen, J.R.: Multi-dimensional search result diversification. In: WSDM, pp. 475–484 (2011)
Dou, Z., Hu, S., Luo, Y., Song, R., Wen, J.R.: Finding dimensions for queries. In: CIKM, pp. 1311–1320 (2011)
Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)
He, J., Hollink, V., de Vries, A.: Combining implicit and explicit topic representations for result diversification. In: SIGIR, pp. 851–860 (2012)
Hu, Y., Qian, Y., Li, H., Jiang, D., Pei, J., Zheng, Q.: Mining query subtopics from search log data. In: SIGIR. pp. 305–314 (2012)
Jiang, D., Ng, W.: Mining web search topics with diverse spatiotemporal patterns. In: SIGIR, pp. 881–884 (2013)
Kim, S.J., Lee, J.H.: Subtopic mining using simple patterns and hierarchical structure of subtopic candidates from web documents. Inf. Process. Manage. 51(6), 773–785 (2015)
Liu, Y., Song, R., Zhang, M., Dou, Z., Yamamoto, T., Kato, M.P., Ohshima, H., Zhou, K.: Overview of the NTCIR-11 IMine task. In: NTCIR (2014)
Luo, C., Li, X., Khodzhaev, A., Chen, F., Xu, K., Cao, Y., Liu, Y., Zhang, M., Ma, S.: THUSAM at NTCIR-11 IMine task. In: NTCIR (2014)
Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. PVLDB 8(12), 1606–1617 (2015)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL, pp. 55–60 (2014)
Moreno, J.G., Dias, G.: HULTECH at the NTCIR-10 INTENT-2 task: discovering user intents through search results clustering. In: NTCIR (2013)
Moreno, J.G., Dias, G.: HULTECH at the NTCIR-11 IMine task: mining intents with continuous vector space models. In: NTCIR (2014)
Oyama, S., Tanaka, K.: Query modification by discovering topics from web page structures. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 553–564. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24655-8_60
Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241 (1994)
Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song, R.: Overview of the NTCIR-10 INTENT-2 task. In: NTCIR (2013)
Santos, R.L., Macdonald, C., Ounis, I.: Exploiting query reformulations for web search result diversification. In: WWW, pp. 881–890 (2010)
Song, R., Zhang, M., Sakai, T., Kato, M.P., Liu, Y., Sugimoto, M., Wang, Q., Orii, N.: Overview of the NTCIR-9 INTENT task. In: NTCIR (2011)
Strohman, T., Metzler, D., Turtle, H., Croft, W.: Indri: a language model-based search engine for complex queries. In: International Conference on Intelligent Analysis (2005)
Ullah, M.Z., Aono, M.: Query subtopic mining for search result diversification. In: ICAICTA, pp. 309–314 (2014)
Ullah, M.Z., Aono, M., Seddiqui, M.H.: SEM12 at the NTCIR-10 INTENT-2 English subtopic mining subtask. In: NTCIR (2013)
Wang, C., Danilevsky, M., Desai, N., Zhang, Y., Nguyen, P., Taula, T., Han, J.: A phrase mining framework for recursive construction of a topical hierarchy. In: KDD, pp. 437–445 (2013)
Wang, C.J., Lin, Y.W., Tsai, M.F., Chen, H.H.: Mining subtopics from different aspects for diversifying search results. Inf. Retr. 16(4), 452–483 (2013)
Wang, J., Tang, G., Xia, Y., Zhou, Q., Zheng, T.F., Hu, Q., Na, S., Huang, Y.: Understanding the query: THCIB and THUIS at NTCIR-10 intent task. In: NTCIR (2013)
Wang, Q., Qian, Y., Song, R., Dou, Z., Zhang, F., Sakai, T., Zheng, Q.: Mining subtopics from text fragments for a web query. Inf. Retr. 16(4), 484–503 (2013)
Xia, Y., Zhong, X., Tang, G., Wang, J., Zhou, Q., Zheng, T.F., Hu, Q., Na, S., Huang, Y.: Ranking search intents underlying a query. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) NLDB 2013. LNCS, vol. 7934, pp. 266–271. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38824-8_23
Xue, Y., Chen, F., Damien, A., Luo, C., Li, X., Huo, S., Zhang, M., Liu, Y., Ma, S.: THUIR at NTCIR-10 INTENT-2 task. In: NTCIR (2013)
Yamamoto, T., Kato, M.P., Ohshima, H., Tanaka, K.: KUIDL at the NTCIR-11 IMine task. In: NTCIR (2014)
Yamamoto, T., Liu, Y., Zhang, M., Dou, Z., Zhou, K., Markov, I., Kato, M.P., Ohshima, H., Fujita, S.: Overview of the NTCIR-12 IMine-2 task. In: NTCIR (2015)
Yu, H., Ren, F.: TUTA1 at the NTCIR-11 IMine task. In: NTCIR (2014)
Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster web search results. In: SIGIR, pp. 210–217 (2004)
Zheng, W., Fang, H., Cheng, H., Wang, X.: Diversifying search results through pattern-based subtopic modeling. Int. J. Semant. Web Inf. Syst. 8(4), 37–56 (2012)
Zheng, W., Wang, X., Fang, H., Cheng, H.: An exploration of pattern-based subtopic modeling for search result diversification. In: JCDL. pp. 387–388 (2011)
Acknowledgment
This work was supported by JSPS KAKENHI Grant Number 13J06384 and 26540163.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Manabe, T., Tajima, K. (2017). Subtopic Ranking Based on Block-Level Document Analysis. In: Monfort, V., Krempels, KH., Majchrzak, T., Traverso, P. (eds) Web Information Systems and Technologies. WEBIST 2016. Lecture Notes in Business Information Processing, vol 292. Springer, Cham. https://doi.org/10.1007/978-3-319-66468-2_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-66468-2_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66467-5
Online ISBN: 978-3-319-66468-2
eBook Packages: Computer ScienceComputer Science (R0)