Cluster-Preserving Dimension Reduction Methods for Efficient Classification of Text Data

  • Peg Howland
  • Haesun Park

Abstract

In today’s vector space information retrieval systems, dimension reduction is imperative for efficiently manipulating the massive quantity of data. To be useful, this lower-dimensional representation must be a good approximation of the original document set given in its full space. Toward that end, we present mathematical models, based on optimization and a general matrix rank reduction formula, which incorporate a priori knowledge of the existing structure. From these models, we develop new methods for dimension reduction based on the centroids of data clusters. We also adapt and extend the discriminant analysis projection, which is well known in pattern recognition. The result is a generalization of discriminant analysis that can be applied regardless of the relative dimensions of the term-document matrix.

Keywords

Discriminant Analysis Singular Value Decomposition Dimension Reduction Singular Vector Misclassification Rate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [BD095]
    M. Berry, S. Dumais, and G. O’Brien.Using linear algebra for intelligent information retrieval.SIAM Review, 37 (4): 573–595, 1995.MathSciNetMATHGoogle Scholar
  2. [Bjö96]
    A. Björck.Numerical Methods for Least Squares Problems.SIAM, Philadelphia, 1996.Google Scholar
  3. [CF79]
    R.E. Cline and R. E. Funderlic.The rank of a difference of matrices and associated generalized inverses.Linear Algebra Appl., 24: 185–215, 1979.Google Scholar
  4. [CFG95]
    M.T. Chu, R. E. Funderlic, and G.H. Golub.A rank-one reduction formula and its applications to matrix factorizations.SIAM Review, 37 (4): 512–530, 1995.Google Scholar
  5. [DDF+90]
    S. Deerwester, S. Dumais, G. Fumas, T. Landauer, and R. Harshman.lndexing by latent semantic analysis. Journal of the American Society for Information Science, 41 (6): 391–407, 1990.CrossRefGoogle Scholar
  6. [DHS01]
    R.O. Duda, P.E. Hart, and D.G. Stork.Pattern Classification, second edition. Wiley, New York, 2001.Google Scholar
  7. [Fuk90]
    K. Fukunaga.Introduction to Statistical Pattern Recognition, second edition. Academic, Boston, MA, 1990.Google Scholar
  8. [GV961.
    G. Golub and C. Van Loan.Matrix Computations, third edition. John Hopkins Univ. Press, Baltimore, MD, 1996.Google Scholar
  9. [Gut571.
    L. Guttman.A necessary and sufficient formula for matric factoring.Psychometrika, 22 (1): 79–81, 1957.Google Scholar
  10. HJP03] P. Howland, M. Jeon, and H. Park.Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition.SIAM Journal on Matrix Analysis and Applications,2003, to appear.Google Scholar
  11. [HMH00]
    L. Hubert, J. Meulman, and W. Heiser.Two purposes for matrix factorization: A historical appraisal.SIAM Review, 42 (1): 68–82, 2000.MathSciNetMATHGoogle Scholar
  12. [Hor65]
    R. Horst.Factor Analysis of Data Matrices.Holt, Rinehart and Winston, Orlando, FL, 1965.Google Scholar
  13. [HP02]
    P. Howland and H. Park.Extension of discriminant analysis based on the generalized singular value decomposition.Technical Report 021, Department of Computer Science and Engineering, University of Minnesota, Twin Cities, 2002.Google Scholar
  14. [JD88]
    A. Jain and R. Dubes.Algorithms for Clustering Data.Prentice-Hall, Englewood Cliffs, NJ, 1988.Google Scholar
  15. [Kow97]
    G. Kowalski.Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Hingham, MA, 1997.Google Scholar
  16. [LH95]
    C.L. Lawson and R.J. Hanson.Solving Least Squares Problems.SIAM, Philadelphia, 1995.Google Scholar
  17. [Ort87]
    J. Ortega.Matrix Theory: A Second Course.Plenum, New York, 1987.Google Scholar
  18. [PJR03]
    H. Park, M. Jeon, and J.B. Rosen.Lower dimensional representation of text data based on centroids and least squares.BIT, 2003, to appear.Google Scholar
  19. [PS81]
    C.C. Paige and M.A. Saunders.Towards a generalized singular value decomposition.SIAM Journal on Numerical Analysis, 18 (3): 398–405, 1981.Google Scholar
  20. [Sa171]
    G. Salton. The SMART Retrieval System.Prentice-Hall, Englewood Cliffs, NJ, 1971.Google Scholar
  21. [SM83]
    G. Salton and M.J.McGill.Introductionto Modern Information Retrieval.McGrawHill, New York, 1983.Google Scholar
  22. [Thu45]
    L.L. Thurstone.A multiple group method of factoring the correlation matrix.Psychometrika, 10 (2): 73–78, 1945.Google Scholar
  23. [TK99]
    S. Theodoridis and K. Koutroumbas.Pattern Recognition.Academic, San Diego, 1999.Google Scholar
  24. [Tor01]
    K. Torkkola.Linear discriminant analysis in document classification.In Proceedings of the IEEE ICDM Workshop on Text Mining,2001.Google Scholar
  25. [vL76]
    C.F. Van Loan.Generalizing the singular value decomposition.SIAM Journal on Numerical Analysis, 13 (1): 76–83, 1976.Google Scholar

Copyright information

© Springer Science+Business Media New York 2004

Authors and Affiliations

  • Peg Howland
  • Haesun Park

There are no affiliations available

Personalised recommendations