Skip to main content

Computational Information Retrieval

  • Conference paper
Foundations of Statistical Inference

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

  • 358 Accesses

Abstract

The main goal of this note is to introduce the notion of collection dependent “same context words”. Two (or more) words are the “same context words” if they occur in the same (or similar) context across a given text collection. Each word w in the collection is associated with a profile P(w). The profile P(w) is the set of words occurring in sentences that contain w. We introduce a distance function in the set profiles, and use it to cluster words. Words contained in the same cluster are “same context words”. We select “same context words” for several text collections, and briefly discuss further possible applications of the introduced concepts to a number of information retrieval related problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Imielinski, T., Swami, A. (1993). Mining association rules between sets of items in large databases. Proc. ACM SIGMOD, 207–216

    Google Scholar 

  2. Alsabti, K., Ranka, S., Singh, V. (1999). An efficient space-partitioning based algorithm for the k-means clustering. PAKDD, 355–359

    Google Scholar 

  3. Attar, R., Fraenkel, A.S. (1977). Local feedback in full-text retrieval systems. J. Assoc. Comput. Mach. 24, 397–417

    Article  MATH  Google Scholar 

  4. Berry, M., Browne, M. (1999).Understanding Search Engines.SIAM, Philadelphia

    MATH  Google Scholar 

  5. Berry, M., Drmac, Z., Jessup, E.R. (1999). Matrices, vectors spaces, and information retrieval. SIAM Rev. 41, 335–362

    Article  MathSciNet  MATH  Google Scholar 

  6. Boley, D. (1998). Principal directions divisive partitioning. Data Min. Knowl. Disc. 2, 325–344

    Google Scholar 

  7. Bottou, L., Bengio, Y. (1995). Convergence properties of the k-means algorithms. In:Advances inNeutralInformation Processing Systems7, Tesario, G., Touretzky, D. (Eds.), The MIT Press, Massachusets, 585–592

    Google Scholar 

  8. Castellanos, M., Stinger, J.R. (2001). A practical approach to extracting relevant sentences in the presence of dirty text. In:Workshop on Text Mining.Berry, M.W. (Ed.), Chicago, Illinois, 15–22

    Google Scholar 

  9. Deerwester, S., Dumas, S., Fumas, G., Landauer, T., (1990). Indexing by Latent Semantic Analysis. J. Am. Soc. Inform. Sci. 41, 391–407

    Article  Google Scholar 

  10. Dhillon, I.S., Modha, D.S. (2000). A data-clustering algorithm on distributed memory multiprocessors, in Large-Scale Parallel Data Mining. Lect. Notes Ar-tif. Int. 1759, 245–260

    Google Scholar 

  11. Grefenstette, G. (1994).Explorations in Automatic Thesaurus Discovery.Kluwer, Boston

    Book  MATH  Google Scholar 

  12. Jing, Y., Croft, W.B. (1994). An association thesaurus for information retrieval. In:Proceedings of RIAO 94146–160

    Google Scholar 

  13. Kleinberg, J., Tomkins A. (1999). Applications of linear algebra in information retrieval and hypertext analysis. In:Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems185–193

    Google Scholar 

  14. Kowalski, G. (1997).Information Retrieval Systems.Kluwer, Boston

    MATH  Google Scholar 

  15. Kogan, J. (2001). Clustering large unstructured document sets. In:ComputationalInformationRetrieval.Berry, M.W. (Ed.), SIAM, Philadelphia, 115–125

    Google Scholar 

  16. Kogan, J. (2001). Means clustering for text data. In:Workshop on Text Mining.Berry, M.W. (Ed.), Chicago, Illinois, 47–54

    Google Scholar 

  17. Porter, M.F. (1980). An algorithm for suffix stripping. Program 14, 130–137

    Article  Google Scholar 

  18. Schütze, H., Pedersen, J.O. (1995). Information retrieval based on word senses. In:Proceedings of the Symposium on Document Analysis and Information Retrieval4, 161–175

    Google Scholar 

  19. Selim, S.Z., Ismail, M.A. (1984). K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE T. Pattern Anal. 6, 81–87

    Article  MATH  Google Scholar 

  20. Xu, J., Croft, W.B. (1998). Corpus-based stemming using co-occurrence of word variance. ACM T. Inform. Syst. 16, 61–81

    Google Scholar 

  21. Zhang, T., Ramakrishnan, R., Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases.Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data103–114

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kogan, J. (2003). Computational Information Retrieval. In: Haitovsky, Y., Ritov, Y., Lerche, H.R. (eds) Foundations of Statistical Inference. Contributions to Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-642-57410-8_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-57410-8_3

  • Publisher Name: Physica, Heidelberg

  • Print ISBN: 978-3-7908-0047-0

  • Online ISBN: 978-3-642-57410-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics