Advertisement

TMG: A MATLAB Toolbox for Generating Term-Document Matrices from Text Collections

  • D. Zeimpekis
  • E. Gallopoulos
Chapter

Summary

A wide range of computational kernels in data mining and information retrieval from text collections involve techniques from linear algebra. These kernels typically operate on data that are presented in the form of large sparse term-document matrices (tdm). We present TMG, a research and teaching toolbox for the generation of sparse tdms from text collections and for the incremental modification of these tdms by means of additions or deletions. The toolbox is written entirely in MATLAB, a popular problem-solving environment that is powerful in computational linear algebra, in order to streamline document preprocessing and prototyping of algorithms for information retrieval. Several design issues that concern the use of MATLAB sparse infrastructure and data structures are addressed. We illustrate the use of the tool in numerical explorations of the effect of stemming and different term-weighting policies on the performance of querying and clustering tasks.

Keywords

Information Retrieval Singular Value Decomposition Vector Space Model Inverted Index Query Answering 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • D. Zeimpekis
    • 1
  • E. Gallopoulos
    • 1
  1. 1.Department of Computer Engineering and InformaticsUniversity of PatrasPatrasGreece

Personalised recommendations