This chapter reviews some search and cluster mining algorithms based on vector space modeling (VSM). The first part of the review considers two methods to address polysemy and synonomy problems in very large data sets: latent semantic indexing (LSI) and principal component analysis (PCA). The second part focuses on methods for finding minor clusters. Until recently, the study of minor clusters has been relatively neglected, even though they may represent rare but significant types of events or special types of customers. A novel new algorithm for finding minor clusters is introduced. It addresses some difficult issues in database analysis, such as accommodation of cluster overlap, automatic labeling of clusters based on their document contents, and user-controlled trade-off between speed of computation and quality of results. Implementation studies with new articles from Reuters and Los Angeles Times TREC datasets show the effectiveness of the algorithm compared to previous methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. Ando. Latent semantic space. In Proceedings of ACM SIGIR, pages 213-232. ACM Press, New York, 2000.
M.W. Berry, Z. Drmac, and E.R. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335-362, 1999.
M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.
J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981.
K. Blom and A. Ruhe. Information retrieval using very short Krylov sequences. In Proceedings of Computational Information Retrieval Workshop, North Carolina State University, pages 3-24. SIAM, Philadelphia, 2001.
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6):391-407, 1990.
J. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.
G. Dupret. Latent concepts and the number of orthogonal factors in latent semantic analysis. In Proceedings of ACM SIGIR, pages 221-226. ACM Press, New York, 2003.
B. Everitt, S. Landau, and N. Leese. Cluster Analysis. Arnold, London, UK, fourth edition, 2001.
C. Eckart and G. Young. A principal axis transformation for non-Hermitian matrices. Bulletin of the American Mathematics Society, 45:118-121, 1939.
G. Golub and C. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, third edition, 1996.
G. Hamerly. Learning Structure and Concepts in Data Through Data Clustering. PhD thesis, University of California at San Diego, CA, 2003.
D. Harman. Ranking algorithms. In R. Baeza-Yates and B. Ribeiro-Neto (eds.), Information Retrieval, pages 363-392, ACM Press, New York, 1999.
S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice-Hall, Upper Saddle River, NJ, second edition, 1999.
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On cluster validation techniques. Journal of Intelligent Infoormation Systems, 17(2-3):107, 145, 2001.
M. Hearst. The use of categories and clusters for organizing retrieval results. In T. Strzalkowski, editor, Natural Language Information Retrieval, pages 333-374. Kluwer Academic, Dordrecht, The Netherlands, 1999.
J. Han and M. Kamber. Data Mining: Concepts & Techniques. Morgan Kaufmann, San Francisco, 2000.
D. Hundley and M. Kirby. Estimation of topological dimension. In Proceedings of SIAM International Conference on Data Mining, pages 194-202. SIAM, Philadelphia, 2003.
H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441, 1933.
M. Houle. Navigating massive sets via local clustering. In Proceedings of ACM KDD, pages 547-552. ACM Press, New York, 2003.
Y. Ishii. Analysis of customer data for targeted marketing: case studies using airline industry data (in Japanese). In Proceedings of ACM SIGMOD of Japan Conference, pages 37-49, 2004.
I. Jolliffe. Principal Component Analysis. Springer, New York, second edition, 2002.
M. Kobayashi and M. Aono. Major and outlier cluster analysis using dynamic rescaling of document vectors. In Proceedings of the SIAM Text Mining Workshop, Arlington, VA, pages 103-113, SIAM, Philadelphia, 2002.
M. Kobayashi and M. Aono. Exploring overlapping clusters using dynamic rescaling and sampling. Knowledge & Information Systems, 10(3):295-313, 2006.
M. Kobayashi, M. Aono, H. Samukawa, and H. Takeuchi. Matrix computations for information retrieval and major and outlier cluster detection. Journal of Computational and Applied Mathematics, 149(1):119-129, 2002.
S. Katz. Distribution of context words and phrases in text and language modeling. Natural Language Engineering, 2(1):15-59, 1996.
S. Kumar and J. Ghosh. GAMLS: a generalized framework for associative modular learning systems. Proceedings of Applications & Science of Computational Intelligence II, pages 24-34, 1999.
K.-I. Lin and R. Kondadadi. A similarity-based soft clustering algorithm for documents. In Proceedings of the International Conference on Database Systems for Advanced Applications, pages 40-47. IEEE Computer Society, Los Alamitos, CA, 2001.
S. Macskassy, A. Banerjee, B. Davison, and H. Hirsh. Human performance on clustering Web pages. In Proceedings of KDD, pages 264-268. AAAI Press, Menlo Park, CA, 1998.
K. Mardia, J. Kent, and J. Bibby. Multivariate Analysis. Academic Press, New York, 1979.
L. Malassis, M. Kobayashi, and H. Samukawa. Statistical methods for search engines. Technical Report RT-5181, IBM Tokyo Research Laboratory, 2000.
C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 2000.
Z.-Y. Niu, D.-H. Ji, and C.-L. Tan. Document clustering based on cluster validation. In Proceedings of ACM CIKM, pages 501-506. ACM Press, New York, 2004.
B. Parlett. The Symmetric Eigenvalue Problem. SIAM, Philadelphia, 1997.
K. Pearson. On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, Sixth Series, 2:559-572, 1901.
H. Park, M. Jeon, and J. Rosen. Lower dimensional representation of text data in vector space based information retrieval. In M. Berry (ed.), Proceedings of the Computational Information Retrieval Conference held at North Carolina State University, Raleigh, Oct. 22, 2000, pages 3-24, SIAM, Philadelphia, 2001.
D. Pelleg and A. Moore. Mixtures of rectangles: interpretable soft clustering. In Proceedings of ICML, pages 401-408. Morgan Kaufmann, San Francisco, 2001.
Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist. Principal component analysis for dimension reduction in massive distributed datasets. In S. Parthasarathy, H. Kargupta, V. Kumar, D. Skillicorn, and M. Zaki (eds.), SIAM Workshop on High Performance Data Mining, pages 7-18, Arlington, VA, 2002.
G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 1971.
A. Strehl. Relationship-based clustering and cluster ensembles for highdimensional data mining. PhD thesis, University of Texas at Austin, 2002.
H. Sakano and K. Yamada. Horror story: the curse of dimensionality). Information Processing Society of Japan (IPSJ) Magazine, 43(5):562-567, 2002.
I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 1999.
O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of ACM SIGIR, pages 46-54. ACM Press, New York, 1998.
O. Zaine, A. Foss, C.-H. Lee, and W. Wang. On data clustering analysis: scalability, constraints and validation. In Proceedings of PAKDD, Lecture Notes in Artificial Intelligence, No. 2336, pages 28-39. Springer, New York, 2002.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag London Limited
About this chapter
Cite this chapter
Kobayashi, M., Aono, M. (2008). Vector Space Models for Search and Cluster Mining. In: Berry, M.W., Castellanos, M. (eds) Survey of Text Mining II. Springer, London. https://doi.org/10.1007/978-1-84800-046-9_6
Download citation
DOI: https://doi.org/10.1007/978-1-84800-046-9_6
Publisher Name: Springer, London
Print ISBN: 978-1-84800-045-2
Online ISBN: 978-1-84800-046-9
eBook Packages: Computer ScienceComputer Science (R0)