Subtopic Ranking Based on Block-Level Document Analysis

Manabe, Tomohiro; Tajima, Keishi

doi:10.1007/978-3-319-66468-2_5

Tomohiro Manabe¹¹^nAff10 &
Keishi Tajima¹¹

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 292))

Included in the following conference series:

International Conference on Web Information Systems and Technologies

284 Accesses

Abstract

We propose methods for ranking subtopics of a keyword query. Subtopics are also keyword queries which specialize and/or disambiguate search intent behind their original query. Information on subtopics are useful for search systems to generate diversified search results. Search result diversification is important when there are multiple ways to interpret the submitted query. In search result diversification, it is important to rank subtopics by their intent probabilities that users need information on the subtopics. Our subtopic ranking methods use hierarchical structure in documents in the corpus. Hierarchical structure in documents consists of nested logical blocks with headings. A heading describes the topic of a part of a document, and a block is such a part of a document. All our methods are based on two assumptions related to the structure. First, hierarchical headings in a document represent hierarchical topics discussed in the document. Second, authors write more contents about subtopics with higher intent probabilities. Based on these assumptions, our methods score each subtopic based on the total size of the blocks whose hierarchical headings represent the subtopic. We develop our methods in the following way. We first propose four methods to score a subtopic on a document, four methods to integrate subtopic scores on multiple documents, and two methods to sort subtopics based on their scores. We then combined these methods, which results in 32 subtopic ranking methods in total. We evaluated these methods on the data set for the subtopic mining subtask of the NTCIR-10 INTENT-2 task. The results indicated that our methods generated rankings statistically significantly better than the query completion snapshots by major commercial search engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bah, A., Carterette, B., Chandar, P.: Udel @ NTCIR-11 IMine track. In: NTCIR (2014)
Google Scholar
Bouchoucha, A., Nie, J., Liu, X.: Université de montréal at the NTCIR-11 IMine task. In: NTCIR (2014)
Google Scholar
Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: SIGIR, pp. 335–336 (1998)
Google Scholar
Clarke, C.L.A., Craswell, N., Voorhees, E.M.: Overview of the TREC 2012 web track. In: TREC (2012)
Google Scholar
Das, S., Mitra, P., Giles, C.L.: Phrase pair classification for identifying subtopics. In: Baeza-Yates, R., Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 489–493. Springer, Heidelberg (2012). doi:10.1007/978-3-642-28997-2_48
Chapter Google Scholar
Dias, G., Cleuziou, G., Machado, D.: Informative polythetic hierarchical ephemeral clustering. In: WI, pp. 104–111 (2011)
Google Scholar
Dou, Z., Hu, S., Chen, K., Song, R., Wen, J.R.: Multi-dimensional search result diversification. In: WSDM, pp. 475–484 (2011)
Google Scholar
Dou, Z., Hu, S., Luo, Y., Song, R., Wen, J.R.: Finding dimensions for queries. In: CIKM, pp. 1311–1320 (2011)
Google Scholar
Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)
Article Google Scholar
He, J., Hollink, V., de Vries, A.: Combining implicit and explicit topic representations for result diversification. In: SIGIR, pp. 851–860 (2012)
Google Scholar
Hu, Y., Qian, Y., Li, H., Jiang, D., Pei, J., Zheng, Q.: Mining query subtopics from search log data. In: SIGIR. pp. 305–314 (2012)
Google Scholar
Jiang, D., Ng, W.: Mining web search topics with diverse spatiotemporal patterns. In: SIGIR, pp. 881–884 (2013)
Google Scholar
Kim, S.J., Lee, J.H.: Subtopic mining using simple patterns and hierarchical structure of subtopic candidates from web documents. Inf. Process. Manage. 51(6), 773–785 (2015)
Article Google Scholar
Liu, Y., Song, R., Zhang, M., Dou, Z., Yamamoto, T., Kato, M.P., Ohshima, H., Zhou, K.: Overview of the NTCIR-11 IMine task. In: NTCIR (2014)
Google Scholar
Luo, C., Li, X., Khodzhaev, A., Chen, F., Xu, K., Cao, Y., Liu, Y., Zhang, M., Ma, S.: THUSAM at NTCIR-11 IMine task. In: NTCIR (2014)
Google Scholar
Manabe, T., Tajima, K.: Extracting logical hierarchical structure of HTML documents based on headings. PVLDB 8(12), 1606–1617 (2015)
Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: ACL, pp. 55–60 (2014)
Google Scholar
Moreno, J.G., Dias, G.: HULTECH at the NTCIR-10 INTENT-2 task: discovering user intents through search results clustering. In: NTCIR (2013)
Google Scholar
Moreno, J.G., Dias, G.: HULTECH at the NTCIR-11 IMine task: mining intents with continuous vector space models. In: NTCIR (2014)
Google Scholar
Oyama, S., Tanaka, K.: Query modification by discovering topics from web page structures. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 553–564. Springer, Heidelberg (2004). doi:10.1007/978-3-540-24655-8_60
Chapter Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241 (1994)
Google Scholar
Sakai, T., Dou, Z., Yamamoto, T., Liu, Y., Zhang, M., Song, R.: Overview of the NTCIR-10 INTENT-2 task. In: NTCIR (2013)
Google Scholar
Santos, R.L., Macdonald, C., Ounis, I.: Exploiting query reformulations for web search result diversification. In: WWW, pp. 881–890 (2010)
Google Scholar
Song, R., Zhang, M., Sakai, T., Kato, M.P., Liu, Y., Sugimoto, M., Wang, Q., Orii, N.: Overview of the NTCIR-9 INTENT task. In: NTCIR (2011)
Google Scholar
Strohman, T., Metzler, D., Turtle, H., Croft, W.: Indri: a language model-based search engine for complex queries. In: International Conference on Intelligent Analysis (2005)
Google Scholar
Ullah, M.Z., Aono, M.: Query subtopic mining for search result diversification. In: ICAICTA, pp. 309–314 (2014)
Google Scholar
Ullah, M.Z., Aono, M., Seddiqui, M.H.: SEM12 at the NTCIR-10 INTENT-2 English subtopic mining subtask. In: NTCIR (2013)
Google Scholar
Wang, C., Danilevsky, M., Desai, N., Zhang, Y., Nguyen, P., Taula, T., Han, J.: A phrase mining framework for recursive construction of a topical hierarchy. In: KDD, pp. 437–445 (2013)
Google Scholar
Wang, C.J., Lin, Y.W., Tsai, M.F., Chen, H.H.: Mining subtopics from different aspects for diversifying search results. Inf. Retr. 16(4), 452–483 (2013)
Article Google Scholar
Wang, J., Tang, G., Xia, Y., Zhou, Q., Zheng, T.F., Hu, Q., Na, S., Huang, Y.: Understanding the query: THCIB and THUIS at NTCIR-10 intent task. In: NTCIR (2013)
Google Scholar
Wang, Q., Qian, Y., Song, R., Dou, Z., Zhang, F., Sakai, T., Zheng, Q.: Mining subtopics from text fragments for a web query. Inf. Retr. 16(4), 484–503 (2013)
Article Google Scholar
Xia, Y., Zhong, X., Tang, G., Wang, J., Zhou, Q., Zheng, T.F., Hu, Q., Na, S., Huang, Y.: Ranking search intents underlying a query. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) NLDB 2013. LNCS, vol. 7934, pp. 266–271. Springer, Heidelberg (2013). doi:10.1007/978-3-642-38824-8_23
Chapter Google Scholar
Xue, Y., Chen, F., Damien, A., Luo, C., Li, X., Huo, S., Zhang, M., Liu, Y., Ma, S.: THUIR at NTCIR-10 INTENT-2 task. In: NTCIR (2013)
Google Scholar
Yamamoto, T., Kato, M.P., Ohshima, H., Tanaka, K.: KUIDL at the NTCIR-11 IMine task. In: NTCIR (2014)
Google Scholar
Yamamoto, T., Liu, Y., Zhang, M., Dou, Z., Zhou, K., Markov, I., Kato, M.P., Ohshima, H., Fujita, S.: Overview of the NTCIR-12 IMine-2 task. In: NTCIR (2015)
Google Scholar
Yu, H., Ren, F.: TUTA1 at the NTCIR-11 IMine task. In: NTCIR (2014)
Google Scholar
Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., Ma, J.: Learning to cluster web search results. In: SIGIR, pp. 210–217 (2004)
Google Scholar
Zheng, W., Fang, H., Cheng, H., Wang, X.: Diversifying search results through pattern-based subtopic modeling. Int. J. Semant. Web Inf. Syst. 8(4), 37–56 (2012)
Article Google Scholar
Zheng, W., Wang, X., Fang, H., Cheng, H.: An exploration of pattern-based subtopic modeling for search result diversification. In: JCDL. pp. 387–388 (2011)
Google Scholar

Download references

Acknowledgment

This work was supported by JSPS KAKENHI Grant Number 13J06384 and 26540163.

Author information

Tomohiro Manabe
Present address: Yahoo Japan Corporation, Chiyoda, Tokyo, 102-0094, Japan

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Sakyo, Kyoto, 606-8501, Japan
Tomohiro Manabe & Keishi Tajima

Authors

Tomohiro Manabe
View author publications
You can also search for this author in PubMed Google Scholar
Keishi Tajima
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tomohiro Manabe .

Editor information

Editors and Affiliations

University of Paris, Paris, France
Valérie Monfort
Department of Information Systems and Databases, RWTH Aachen University, Aachen, Germany
Karl-Heinz Krempels
Faculty of Social Sciences, University of Agder, Kristiansand, Norway
Tim A. Majchrzak
Center for Information Technology, FBK-ICT IRST, Trento, Italy
Paolo Traverso

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manabe, T., Tajima, K. (2017). Subtopic Ranking Based on Block-Level Document Analysis. In: Monfort, V., Krempels, KH., Majchrzak, T., Traverso, P. (eds) Web Information Systems and Technologies. WEBIST 2016. Lecture Notes in Business Information Processing, vol 292. Springer, Cham. https://doi.org/10.1007/978-3-319-66468-2_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-66468-2_5
Published: 08 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-66467-5
Online ISBN: 978-3-319-66468-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics