Skip to main content

Vector Space Models for Search and Cluster Mining

  • Chapter
Survey of Text Mining II

This chapter reviews some search and cluster mining algorithms based on vector space modeling (VSM). The first part of the review considers two methods to address polysemy and synonomy problems in very large data sets: latent semantic indexing (LSI) and principal component analysis (PCA). The second part focuses on methods for finding minor clusters. Until recently, the study of minor clusters has been relatively neglected, even though they may represent rare but significant types of events or special types of customers. A novel new algorithm for finding minor clusters is introduced. It addresses some difficult issues in database analysis, such as accommodation of cluster overlap, automatic labeling of clusters based on their document contents, and user-controlled trade-off between speed of computation and quality of results. Implementation studies with new articles from Reuters and Los Angeles Times TREC datasets show the effectiveness of the algorithm compared to previous methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • R. Ando. Latent semantic space. In Proceedings of ACM SIGIR, pages 213-232. ACM Press, New York, 2000.

    Google Scholar 

  • M.W. Berry, Z. Drmac, and E.R. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335-362, 1999.

    Article  MATH  MathSciNet  Google Scholar 

  • M.W. Berry, S.T. Dumais, and G.W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995.

    Article  MATH  MathSciNet  Google Scholar 

  • J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York, 1981.

    MATH  Google Scholar 

  • K. Blom and A. Ruhe. Information retrieval using very short Krylov sequences. In Proceedings of Computational Information Retrieval Workshop, North Carolina State University, pages 3-24. SIAM, Philadelphia, 2001.

    Google Scholar 

  • R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.

    Google Scholar 

  • S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6):391-407, 1990.

    Article  Google Scholar 

  • J. Demmel. Applied Numerical Linear Algebra. SIAM, Philadelphia, 1997.

    MATH  Google Scholar 

  • G. Dupret. Latent concepts and the number of orthogonal factors in latent semantic analysis. In Proceedings of ACM SIGIR, pages 221-226. ACM Press, New York, 2003.

    Google Scholar 

  • B. Everitt, S. Landau, and N. Leese. Cluster Analysis. Arnold, London, UK, fourth edition, 2001.

    Google Scholar 

  • C. Eckart and G. Young. A principal axis transformation for non-Hermitian matrices. Bulletin of the American Mathematics Society, 45:118-121, 1939.

    Article  MathSciNet  Google Scholar 

  • G. Golub and C. Van Loan. Matrix Computations. John Hopkins University Press, Baltimore, MD, third edition, 1996.

    MATH  Google Scholar 

  • G. Hamerly. Learning Structure and Concepts in Data Through Data Clustering. PhD thesis, University of California at San Diego, CA, 2003.

    Google Scholar 

  • D. Harman. Ranking algorithms. In R. Baeza-Yates and B. Ribeiro-Neto (eds.), Information Retrieval, pages 363-392, ACM Press, New York, 1999.

    Google Scholar 

  • S. Haykin. Neural Networks: A Comprehensive Foundation. Prentice-Hall, Upper Saddle River, NJ, second edition, 1999.

    MATH  Google Scholar 

  • M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On cluster validation techniques. Journal of Intelligent Infoormation Systems, 17(2-3):107, 145, 2001.

    Article  MATH  Google Scholar 

  • M. Hearst. The use of categories and clusters for organizing retrieval results. In T. Strzalkowski, editor, Natural Language Information Retrieval, pages 333-374. Kluwer Academic, Dordrecht, The Netherlands, 1999.

    Google Scholar 

  • J. Han and M. Kamber. Data Mining: Concepts & Techniques. Morgan Kaufmann, San Francisco, 2000.

    Google Scholar 

  • D. Hundley and M. Kirby. Estimation of topological dimension. In Proceedings of SIAM International Conference on Data Mining, pages 194-202. SIAM, Philadelphia, 2003.

    Google Scholar 

  • H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24:417-441, 1933.

    Article  Google Scholar 

  • M. Houle. Navigating massive sets via local clustering. In Proceedings of ACM KDD, pages 547-552. ACM Press, New York, 2003.

    Google Scholar 

  • Y. Ishii. Analysis of customer data for targeted marketing: case studies using airline industry data (in Japanese). In Proceedings of ACM SIGMOD of Japan Conference, pages 37-49, 2004.

    Google Scholar 

  • I. Jolliffe. Principal Component Analysis. Springer, New York, second edition, 2002.

    MATH  Google Scholar 

  • M. Kobayashi and M. Aono. Major and outlier cluster analysis using dynamic rescaling of document vectors. In Proceedings of the SIAM Text Mining Workshop, Arlington, VA, pages 103-113, SIAM, Philadelphia, 2002.

    Google Scholar 

  • M. Kobayashi and M. Aono. Exploring overlapping clusters using dynamic rescaling and sampling. Knowledge & Information Systems, 10(3):295-313, 2006.

    Article  Google Scholar 

  • M. Kobayashi, M. Aono, H. Samukawa, and H. Takeuchi. Matrix computations for information retrieval and major and outlier cluster detection. Journal of Computational and Applied Mathematics, 149(1):119-129, 2002.

    Article  MATH  MathSciNet  Google Scholar 

  • S. Katz. Distribution of context words and phrases in text and language modeling. Natural Language Engineering, 2(1):15-59, 1996.

    Article  Google Scholar 

  • S. Kumar and J. Ghosh. GAMLS: a generalized framework for associative modular learning systems. Proceedings of Applications & Science of Computational Intelligence II, pages 24-34, 1999.

    Google Scholar 

  • K.-I. Lin and R. Kondadadi. A similarity-based soft clustering algorithm for documents. In Proceedings of the International Conference on Database Systems for Advanced Applications, pages 40-47. IEEE Computer Society, Los Alamitos, CA, 2001.

    Google Scholar 

  • S. Macskassy, A. Banerjee, B. Davison, and H. Hirsh. Human performance on clustering Web pages. In Proceedings of KDD, pages 264-268. AAAI Press, Menlo Park, CA, 1998.

    Google Scholar 

  • K. Mardia, J. Kent, and J. Bibby. Multivariate Analysis. Academic Press, New York, 1979.

    MATH  Google Scholar 

  • L. Malassis, M. Kobayashi, and H. Samukawa. Statistical methods for search engines. Technical Report RT-5181, IBM Tokyo Research Laboratory, 2000.

    Google Scholar 

  • C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 2000.

    Google Scholar 

  • Z.-Y. Niu, D.-H. Ji, and C.-L. Tan. Document clustering based on cluster validation. In Proceedings of ACM CIKM, pages 501-506. ACM Press, New York, 2004.

    Google Scholar 

  • B. Parlett. The Symmetric Eigenvalue Problem. SIAM, Philadelphia, 1997.

    Google Scholar 

  • K. Pearson. On lines and planes of closest fit to systems of points in space. The London, Edinburgh and Dublin Philosophical Magazine and Journal of Science, Sixth Series, 2:559-572, 1901.

    Google Scholar 

  • H. Park, M. Jeon, and J. Rosen. Lower dimensional representation of text data in vector space based information retrieval. In M. Berry (ed.), Proceedings of the Computational Information Retrieval Conference held at North Carolina State University, Raleigh, Oct. 22, 2000, pages 3-24, SIAM, Philadelphia, 2001.

    Google Scholar 

  • D. Pelleg and A. Moore. Mixtures of rectangles: interpretable soft clustering. In Proceedings of ICML, pages 401-408. Morgan Kaufmann, San Francisco, 2001.

    Google Scholar 

  • Y. Qu, G. Ostrouchov, N. Samatova, and A. Geist. Principal component analysis for dimension reduction in massive distributed datasets. In S. Parthasarathy, H. Kargupta, V. Kumar, D. Skillicorn, and M. Zaki (eds.), SIAM Workshop on High Performance Data Mining, pages 7-18, Arlington, VA, 2002.

    Google Scholar 

  • G. Salton. The SMART Retrieval System. Prentice-Hall, Englewood Cliffs, NJ, 1971.

    Google Scholar 

  • A. Strehl. Relationship-based clustering and cluster ensembles for highdimensional data mining. PhD thesis, University of Texas at Austin, 2002.

    Google Scholar 

  • H. Sakano and K. Yamada. Horror story: the curse of dimensionality). Information Processing Society of Japan (IPSJ) Magazine, 43(5):562-567, 2002.

    Google Scholar 

  • I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 1999.

    Google Scholar 

  • O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of ACM SIGIR, pages 46-54. ACM Press, New York, 1998.

    Google Scholar 

  • O. Zaine, A. Foss, C.-H. Lee, and W. Wang. On data clustering analysis: scalability, constraints and validation. In Proceedings of PAKDD, Lecture Notes in Artificial Intelligence, No. 2336, pages 28-39. Springer, New York, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag London Limited

About this chapter

Cite this chapter

Kobayashi, M., Aono, M. (2008). Vector Space Models for Search and Cluster Mining. In: Berry, M.W., Castellanos, M. (eds) Survey of Text Mining II. Springer, London. https://doi.org/10.1007/978-1-84800-046-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-1-84800-046-9_6

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84800-045-2

  • Online ISBN: 978-1-84800-046-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics