Computational Information Retrieval

Kogan, Jacob

doi:10.1007/978-3-642-57410-8_3

Jacob Kogan³

Part of the book series: Contributions to Statistics ((CONTRIB.STAT.))

358 Accesses

Abstract

The main goal of this note is to introduce the notion of collection dependent “same context words”. Two (or more) words are the “same context words” if they occur in the same (or similar) context across a given text collection. Each word w in the collection is associated with a profile P(w). The profile P(w) is the set of words occurring in sentences that contain w. We introduce a distance function in the set profiles, and use it to cluster words. Words contained in the same cluster are “same context words”. We select “same context words” for several text collections, and briefly discuss further possible applications of the introduced concepts to a number of information retrieval related problems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Imielinski, T., Swami, A. (1993). Mining association rules between sets of items in large databases. Proc. ACM SIGMOD, 207–216
Google Scholar
Alsabti, K., Ranka, S., Singh, V. (1999). An efficient space-partitioning based algorithm for the k-means clustering. PAKDD, 355–359
Google Scholar
Attar, R., Fraenkel, A.S. (1977). Local feedback in full-text retrieval systems. J. Assoc. Comput. Mach. 24, 397–417
Article MATH Google Scholar
Berry, M., Browne, M. (1999).Understanding Search Engines.SIAM, Philadelphia
MATH Google Scholar
Berry, M., Drmac, Z., Jessup, E.R. (1999). Matrices, vectors spaces, and information retrieval. SIAM Rev. 41, 335–362
Article MathSciNet MATH Google Scholar
Boley, D. (1998). Principal directions divisive partitioning. Data Min. Knowl. Disc. 2, 325–344
Google Scholar
Bottou, L., Bengio, Y. (1995). Convergence properties of the k-means algorithms. In:Advances inNeutralInformation Processing Systems7, Tesario, G., Touretzky, D. (Eds.), The MIT Press, Massachusets, 585–592
Google Scholar
Castellanos, M., Stinger, J.R. (2001). A practical approach to extracting relevant sentences in the presence of dirty text. In:Workshop on Text Mining.Berry, M.W. (Ed.), Chicago, Illinois, 15–22
Google Scholar
Deerwester, S., Dumas, S., Fumas, G., Landauer, T., (1990). Indexing by Latent Semantic Analysis. J. Am. Soc. Inform. Sci. 41, 391–407
Article Google Scholar
Dhillon, I.S., Modha, D.S. (2000). A data-clustering algorithm on distributed memory multiprocessors, in Large-Scale Parallel Data Mining. Lect. Notes Ar-tif. Int. 1759, 245–260
Google Scholar
Grefenstette, G. (1994).Explorations in Automatic Thesaurus Discovery.Kluwer, Boston
Book MATH Google Scholar
Jing, Y., Croft, W.B. (1994). An association thesaurus for information retrieval. In:Proceedings of RIAO 94146–160
Google Scholar
Kleinberg, J., Tomkins A. (1999). Applications of linear algebra in information retrieval and hypertext analysis. In:Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems185–193
Google Scholar
Kowalski, G. (1997).Information Retrieval Systems.Kluwer, Boston
MATH Google Scholar
Kogan, J. (2001). Clustering large unstructured document sets. In:ComputationalInformationRetrieval.Berry, M.W. (Ed.), SIAM, Philadelphia, 115–125
Google Scholar
Kogan, J. (2001). Means clustering for text data. In:Workshop on Text Mining.Berry, M.W. (Ed.), Chicago, Illinois, 47–54
Google Scholar
Porter, M.F. (1980). An algorithm for suffix stripping. Program 14, 130–137
Article Google Scholar
Schütze, H., Pedersen, J.O. (1995). Information retrieval based on word senses. In:Proceedings of the Symposium on Document Analysis and Information Retrieval4, 161–175
Google Scholar
Selim, S.Z., Ismail, M.A. (1984). K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE T. Pattern Anal. 6, 81–87
Article MATH Google Scholar
Xu, J., Croft, W.B. (1998). Corpus-based stemming using co-occurrence of word variance. ACM T. Inform. Syst. 16, 61–81
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases.Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data103–114
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Statistics, University of Maryland, Baltimore County, Baltimore, USA
Jacob Kogan

Authors

Jacob Kogan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Statistics, Hebrew University, Jerusalem, 91905, Israel
Yoel Haitovsky & Yaacov Ritov &
Department of Mathematical Stochastics, University of Freiburg, Eckerstraße 1, Freiburg, 79104, Germany
Hans Rudolf Lerche

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kogan, J. (2003). Computational Information Retrieval. In: Haitovsky, Y., Ritov, Y., Lerche, H.R. (eds) Foundations of Statistical Inference. Contributions to Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-642-57410-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-57410-8_3
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-0047-0
Online ISBN: 978-3-642-57410-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics