This chapter details the process of converting documents into an analysis-ready term-document representation. Preprocessed text documents are first transformed into an inverted index for demonstrative purposes. Then, the inverted index is manipulated into a term-document or document-term matrix. The chapter concludes with descriptions of different weighting schemas for analysis-ready term-document representation.
KeywordsInverted index Term-document matrix Document-term matrix Term frequency Document frequency Term frequency-inverse document frequency Inverse document frequency Weighting Term weighting Document weighting Log frequency
- Jessup, E. R., & Martin, J. H. (2001). Taking a new look at the latent semantic analysis approach to information retrieval. Computational Information Retrieval, 2001, 121–144.Google Scholar
- For more about the term-document representation of text data, see Berry et al. (1999) and Manning et al. (2008).Google Scholar