Skip to main content

A Language Modeling Approach to Search Distributed Text Databases

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2903))

Abstract

As the number and diversity of distributed information sources on the Internet exponentially increase, it is difficult for the user to know which databases are appropriate to search. Given database language models that describe the content of each database, database selection services can provide assistance in locate relevant databases of the user’s information need. In this paper, we propose a database selection approach based on statistical language modeling. The basic idea behind the approach is that, for the databases that are categorized into a topic hierarchy, individual language models are estimated at different search stages, and then the databases are ranked by the similarity to the query according to the estimated language model. Two-stage smoothed language models are presented to circumvent the inaccuracy due to word sparseness. Experimental results demonstrate such a language modeling approach is competitive with current state-of-the-art database selection approaches.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, H., Schuffels, C., Orwig, R.: Internet Categorization and search: a Self-Organizing Approach. Journal of Visual Communication and Image Representation 7(1), 88–102 (1996)

    Article  Google Scholar 

  2. Dempser, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Int. J. Journal of the Royal Statistical Society 39(B), 1–38 (1977)

    Google Scholar 

  3. Jelinek, F., Mercer, R.: Interpolated estimation of Marvok source parameters from sparse data. In: Patter Recognition in Practices, pp. 381–402. North Holland, Amsterdam (1980)

    Google Scholar 

  4. MacKey, D., Peto, L.: A Hierarchical Dirichlet Language Model. Int. J. National Language Engineering 1(3), 289–307 (1995)

    Google Scholar 

  5. Mood, A.M., Graybill, F.A.: Introduction to The Theory of Statistics, 2nd edn. McGraw-Hill, New York (1963)

    Google Scholar 

  6. Miller, D.J., Leek, T., Schwartz, R.M.: A Hidden Markov Model Information Retrieval System. In: Proceedings of the 22th annual international ACM SIGIR conference on Research and development in information retrieval. Berkeley, California, United States, pp. 214–221 (1999)

    Google Scholar 

  7. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proceedings of the 21th annual international ACM SIGIR conference on Research and development in information retrieval, Melbourne, Australia, pp. 214–221 (1998)

    Google Scholar 

  8. Salton, G., Mcgill, M.: Introduction of Modern Information Retrieval. McGrag-Hill, New York (1983)

    Google Scholar 

  9. Yang, H., Zhang, M.: Hierarchical Classification for Multiple, Distributed Web Databases. In: Proceedings of the 18th International Conference on Computers and Their Applications, Honolulu, Hawaii, US, pp. 155–160 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yang, H., Zhang, M. (2003). A Language Modeling Approach to Search Distributed Text Databases. In: Gedeon, T.(.D., Fung, L.C.C. (eds) AI 2003: Advances in Artificial Intelligence. AI 2003. Lecture Notes in Computer Science(), vol 2903. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24581-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24581-0_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-20646-0

  • Online ISBN: 978-3-540-24581-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics