Passage extraction and result combination for genomics information retrieval

Hu, Qinmin; Huang, Jimmy Xiangji

doi:10.1007/s10844-009-0097-4

Passage extraction and result combination for genomics information retrieval

Published: 15 July 2009

Volume 34, pages 249–274, (2010)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Qinmin Hu¹ &
Jimmy Xiangji Huang²

194 Accesses
11 Citations
Explore all metrics

Abstract

In this paper, we first propose algorithms for passage extraction to build indices for the purpose of generating more accurate passages as query answers. Second, we propose a basic result combination method and an improved result combination method to combine the retrieved results from different indices for the purpose of selecting and merging relevant passages as outputs. For passage extraction, three new algorithms are proposed, namely paragraphParsed, sentenceParsed and wordSentenceParsed. For result combination, a novel method is proposed, in which we use factor analysis to generate a better baseline result for combination by finding some hidden common factors that can be used to estimate the importance of keywords and keyword associations. Finally, we report the experimental results that confirm the effectiveness and superiority of the factor analysis based method for result combination. Our proposed approaches achieve excellent results on the TREC 2006 and 2007 Genomics data sets, which provide a promising avenue for constructing high performance information retrieval systems in biomedicine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Test Collection for Passage Retrieval Evaluation of Spanish Health-Related Resources

A Multi-lingual Approach to Improve Passage Retrieval for Automatic Question Answering

Improving the Reliability of Query Expansion for User-Generated Speech Retrieval Using Query Performance Prediction

Notes

The “error” is a statistical term that means the amount by which an individual differs from what is average for the common factors.

References

Beaulieu, M., Gatford, M., Huang, X., Robertson, S., Walker, S., & Williams, P. (1997). Okapi at TREC-5. In Proceedings of the 5th text REtrieval conference (pp. 143–166). NIST Special Publication.
Fuhr, N., & Pfeifer, U. (1994). Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions. ACM Transactions on Information Systems (TOIS), 12(1), 92–115.
Article Google Scholar
Hersh, W., Cohen, A. M., & Roberts, P. (2007). TREC 2007 genomics track overview. In Proceedings of the 16th text REtrieval conference. NIST Special Publication.
Hersh, W., Cohen, A. M., Roberts, P., & Rekapalli1, H. K. (2006). TREC 2006 genomics track overview. In Proceedings of the 15th text REtrieval conference. NIST Special Publication.
Hersh, W., Cohen, A. M., & Yang, J. (2005). TREC 2005 genomics track overview. In Proceedings of 14th text REtrieval conference. NIST Special Publication.
Huang, X., Huang, Y., & Wen, M. (2005a). A dual index model for contextual IR. In Proceedings of the 28th international ACM SIGIR conference on research and development in information retrieval (pp. 613–614).
Huang, X., Peng, F., Schuurmans, D., Cercone, N., & Robertson, S. (2003). Applying machine learning to text segmentation for information retrieval. Information Retrieval Journal, 6(4), 333–362.
Article Google Scholar
Huang, X., Zhong, M., & Si, L. (2005b). York University at TREC 2005: Genomics track. In Proceedings of the 14th text retrieval conference.
Jiang, J., & Zhai, C. (2007). An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval, 10(4–5), 341–363.
Article Google Scholar
Machado, A., & Marinho, C. (2003). An image retrieval method based on factor analysis. In Proceedings of the XVI Brazilian symposium on computer graphics and image processing (pp. 191–198).
Mandl, T. (1999). Efficient preprocessing for information retrieval with neural networks. Datenbank Rundbrief, 24, 54–60.
Google Scholar
Montegomery Douglas, C., Peck Elizabeth, A., & Geoffrey, V. G. (2001). Introduction to linear regression analysis (3rd ed.). New York: Wiley.
Google Scholar
Reyment, R., & Joreskog, G. (1996). Applied factor analysis in the natural sciences (2nd ed.). Cambridge: Cambridge University Press.
MATH Google Scholar
Richard, G. L. (1983). Factor analysis (2nd ed.). Hillsdale: Lawrence Erlbaum Associates.
Google Scholar
Robertson, E. S., & Walker, S. (1994). Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In Proceedings of the 17th international ACM SIGIR conference on research and development in information retrieval (pp. 232–241).
Subbarao, C., Subbarao, N., & Chandu, S. (1995). Characterisation of groundwater contamination using factor analysis. Environmental Geology, 28, 175–180.
Article Google Scholar
Tsai, M. F., Wang, Y. T., & Chen, H. H. (2008). A study of learning a merge model for multilingual information retrieval. In Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval (pp. 195–202).
Wang, M., & Si, L. (2008). Discriminative probabilistic models for passage based retrieval. In Proceedings of the 31st international ACM SIGIR conference on research and development in information retrieval (pp. 419–426).
Zhong, M., & Huang, X. (2006). Concept-based biomedical text retrieval. In Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval (pp. 723–724).
Zhou, W., Yu, C., Smalheiser, N., Torvik, V., & Hong, J. (2007). Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature. In Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (pp. 655–662).
Zhou, X., Hu, X., Zhang, X., Lin, X., & Song, I. (2006). Context-sensitive semantic smoothing for the language modeling approach to genomic IR. In Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval (pp. 170–177).

Download references

Acknowledgements

This research is supported in part by the research grant from the Natural Sciences & Engineering Research Council (NSERC) of Canada and the Early Researcher Award/Premier’s Research Excellence Award. We would like to thank Ming Zhong and Luo Si for their contributions at the early stage of this project. The authors are also grateful to the anonymous reviewers for their constructive comments, which have helped improve the quality of the paper.

Author information

Authors and Affiliations

Department of Computer Science & Engineering, York University, Toronto, Ontario, Canada
Qinmin Hu
School of Information Technology, York University, Toronto, Ontario, Canada
Jimmy Xiangji Huang

Authors

Qinmin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Jimmy Xiangji Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jimmy Xiangji Huang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, Q., Huang, J.X. Passage extraction and result combination for genomics information retrieval. J Intell Inf Syst 34, 249–274 (2010). https://doi.org/10.1007/s10844-009-0097-4

Download citation

Received: 23 November 2008
Revised: 29 June 2009
Accepted: 30 June 2009
Published: 15 July 2009
Issue Date: June 2010
DOI: https://doi.org/10.1007/s10844-009-0097-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Passage extraction and result combination for genomics information retrieval

Abstract

Access this article

Similar content being viewed by others

A Test Collection for Passage Retrieval Evaluation of Spanish Health-Related Resources

A Multi-lingual Approach to Improve Passage Retrieval for Automatic Question Answering

Improving the Reliability of Query Expansion for User-Generated Speech Retrieval Using Query Performance Prediction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Passage extraction and result combination for genomics information retrieval

Abstract

Access this article

Similar content being viewed by others

A Test Collection for Passage Retrieval Evaluation of Spanish Health-Related Resources

A Multi-lingual Approach to Improve Passage Retrieval for Automatic Question Answering

Improving the Reliability of Query Expansion for User-Generated Speech Retrieval Using Query Performance Prediction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation