Abstract
In this paper, we first propose algorithms for passage extraction to build indices for the purpose of generating more accurate passages as query answers. Second, we propose a basic result combination method and an improved result combination method to combine the retrieved results from different indices for the purpose of selecting and merging relevant passages as outputs. For passage extraction, three new algorithms are proposed, namely paragraphParsed, sentenceParsed and wordSentenceParsed. For result combination, a novel method is proposed, in which we use factor analysis to generate a better baseline result for combination by finding some hidden common factors that can be used to estimate the importance of keywords and keyword associations. Finally, we report the experimental results that confirm the effectiveness and superiority of the factor analysis based method for result combination. Our proposed approaches achieve excellent results on the TREC 2006 and 2007 Genomics data sets, which provide a promising avenue for constructing high performance information retrieval systems in biomedicine.
Similar content being viewed by others
Notes
The “error” is a statistical term that means the amount by which an individual differs from what is average for the common factors.
References
Beaulieu, M., Gatford, M., Huang, X., Robertson, S., Walker, S., & Williams, P. (1997). Okapi at TREC-5. In Proceedings of the 5th text REtrieval conference (pp. 143–166). NIST Special Publication.
Fuhr, N., & Pfeifer, U. (1994). Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions. ACM Transactions on Information Systems (TOIS), 12(1), 92–115.
Hersh, W., Cohen, A. M., & Roberts, P. (2007). TREC 2007 genomics track overview. In Proceedings of the 16th text REtrieval conference. NIST Special Publication.
Hersh, W., Cohen, A. M., Roberts, P., & Rekapalli1, H. K. (2006). TREC 2006 genomics track overview. In Proceedings of the 15th text REtrieval conference. NIST Special Publication.
Hersh, W., Cohen, A. M., & Yang, J. (2005). TREC 2005 genomics track overview. In Proceedings of 14th text REtrieval conference. NIST Special Publication.
Huang, X., Huang, Y., & Wen, M. (2005a). A dual index model for contextual IR. In Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval (pp. 613–614).
Huang, X., Peng, F., Schuurmans, D., Cercone, N., & Robertson, S. (2003). Applying machine learning to text segmentation for information retrieval. Information Retrieval Journal, 6(4), 333–362.
Huang, X., Zhong, M., & Si, L. (2005b). York University at TREC 2005: Genomics track. In Proceedings of the 14th text retrieval conference.
Jiang, J., & Zhai, C. (2007). An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval, 10(4–5), 341–363.
Machado, A., & Marinho, C. (2003). An image retrieval method based on factor analysis. In Proceedings of the XVI Brazilian symposium on computer graphics and image processing (pp. 191–198).
Mandl, T. (1999). Efficient preprocessing for information retrieval with neural networks. Datenbank Rundbrief, 24, 54–60.
Montegomery Douglas, C., Peck Elizabeth, A., & Geoffrey, V. G. (2001). Introduction to linear regression analysis (3rd ed.). New York: Wiley.
Reyment, R., & Joreskog, G. (1996). Applied factor analysis in the natural sciences (2nd ed.). Cambridge: Cambridge University Press.
Richard, G. L. (1983). Factor analysis (2nd ed.). Hillsdale: Lawrence Erlbaum Associates.
Robertson, E. S., & Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th international ACM SIGIR conference on research and development in information retrieval (pp. 232–241).
Subbarao, C., Subbarao, N., & Chandu, S. (1995). Characterisation of groundwater contamination using factor analysis. Environmental Geology, 28, 175–180.
Tsai, M. F., Wang, Y. T., & Chen, H. H. (2008). A study of learning a merge model for multilingual information retrieval. In Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval (pp. 195–202).
Wang, M., & Si, L. (2008). Discriminative probabilistic models for passage based retrieval. In Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval (pp. 419–426).
Zhong, M., & Huang, X. (2006). Concept-based biomedical text retrieval. In Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval (pp. 723–724).
Zhou, W., Yu, C., Smalheiser, N., Torvik, V., & Hong, J. (2007). Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature. In Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (pp. 655–662).
Zhou, X., Hu, X., Zhang, X., Lin, X., & Song, I. (2006). Context-sensitive semantic smoothing for the language modeling approach to genomic IR. In Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval (pp. 170–177).
Acknowledgements
This research is supported in part by the research grant from the Natural Sciences & Engineering Research Council (NSERC) of Canada and the Early Researcher Award/Premier’s Research Excellence Award. We would like to thank Ming Zhong and Luo Si for their contributions at the early stage of this project. The authors are also grateful to the anonymous reviewers for their constructive comments, which have helped improve the quality of the paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Hu, Q., Huang, J.X. Passage extraction and result combination for genomics information retrieval. J Intell Inf Syst 34, 249–274 (2010). https://doi.org/10.1007/s10844-009-0097-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-009-0097-4