Abstract
The main goal of this note is to introduce the notion of collection dependent “same context words”. Two (or more) words are the “same context words” if they occur in the same (or similar) context across a given text collection. Each word w in the collection is associated with a profile P(w). The profile P(w) is the set of words occurring in sentences that contain w. We introduce a distance function in the set profiles, and use it to cluster words. Words contained in the same cluster are “same context words”. We select “same context words” for several text collections, and briefly discuss further possible applications of the introduced concepts to a number of information retrieval related problems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Imielinski, T., Swami, A. (1993). Mining association rules between sets of items in large databases. Proc. ACM SIGMOD, 207–216
Alsabti, K., Ranka, S., Singh, V. (1999). An efficient space-partitioning based algorithm for the k-means clustering. PAKDD, 355–359
Attar, R., Fraenkel, A.S. (1977). Local feedback in full-text retrieval systems. J. Assoc. Comput. Mach. 24, 397–417
Berry, M., Browne, M. (1999).Understanding Search Engines.SIAM, Philadelphia
Berry, M., Drmac, Z., Jessup, E.R. (1999). Matrices, vectors spaces, and information retrieval. SIAM Rev. 41, 335–362
Boley, D. (1998). Principal directions divisive partitioning. Data Min. Knowl. Disc. 2, 325–344
Bottou, L., Bengio, Y. (1995). Convergence properties of the k-means algorithms. In:Advances inNeutralInformation Processing Systems7, Tesario, G., Touretzky, D. (Eds.), The MIT Press, Massachusets, 585–592
Castellanos, M., Stinger, J.R. (2001). A practical approach to extracting relevant sentences in the presence of dirty text. In:Workshop on Text Mining.Berry, M.W. (Ed.), Chicago, Illinois, 15–22
Deerwester, S., Dumas, S., Fumas, G., Landauer, T., (1990). Indexing by Latent Semantic Analysis. J. Am. Soc. Inform. Sci. 41, 391–407
Dhillon, I.S., Modha, D.S. (2000). A data-clustering algorithm on distributed memory multiprocessors, in Large-Scale Parallel Data Mining. Lect. Notes Ar-tif. Int. 1759, 245–260
Grefenstette, G. (1994).Explorations in Automatic Thesaurus Discovery.Kluwer, Boston
Jing, Y., Croft, W.B. (1994). An association thesaurus for information retrieval. In:Proceedings of RIAO 94146–160
Kleinberg, J., Tomkins A. (1999). Applications of linear algebra in information retrieval and hypertext analysis. In:Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems185–193
Kowalski, G. (1997).Information Retrieval Systems.Kluwer, Boston
Kogan, J. (2001). Clustering large unstructured document sets. In:ComputationalInformationRetrieval.Berry, M.W. (Ed.), SIAM, Philadelphia, 115–125
Kogan, J. (2001). Means clustering for text data. In:Workshop on Text Mining.Berry, M.W. (Ed.), Chicago, Illinois, 47–54
Porter, M.F. (1980). An algorithm for suffix stripping. Program 14, 130–137
Schütze, H., Pedersen, J.O. (1995). Information retrieval based on word senses. In:Proceedings of the Symposium on Document Analysis and Information Retrieval4, 161–175
Selim, S.Z., Ismail, M.A. (1984). K-means-type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE T. Pattern Anal. 6, 81–87
Xu, J., Croft, W.B. (1998). Corpus-based stemming using co-occurrence of word variance. ACM T. Inform. Syst. 16, 61–81
Zhang, T., Ramakrishnan, R., Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases.Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data103–114
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kogan, J. (2003). Computational Information Retrieval. In: Haitovsky, Y., Ritov, Y., Lerche, H.R. (eds) Foundations of Statistical Inference. Contributions to Statistics. Physica, Heidelberg. https://doi.org/10.1007/978-3-642-57410-8_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-57410-8_3
Publisher Name: Physica, Heidelberg
Print ISBN: 978-3-7908-0047-0
Online ISBN: 978-3-642-57410-8
eBook Packages: Springer Book Archive