Abstract
We closed the previous chapter by introducing the "curse of dimensionality", which refers to the sparseness problem that typically affects models involving a very large number of variables, i.e. high dimensionality spaces. This problem is alleviated in practice by the use of dimensionality reduction techniques, which aim at reducing the sparseness of the data representation by projecting the original model into a new space of lower dimensionality. There exist several different approaches to dimensionality reduction, as well as it constitutes a very common practice in data mining applications. Indeed, almost every standard data mining method or procedure involves some sort of dimensionality reduction. In this chapter we will focus our attention on three basic methods for dimensionality reduction. First, in Sect. 9.1, vocabulary pruning and merging methods will be presented. Then, in Sect. 9.2, the linear transformation approach to dimensionality reduction will be introduced and discussed. In Sect. 9.3, the use of non-linear projection methods for dimensionality reduction will be described. Finally, some relevant references to other important and commonly used methods are provided in the Further Reading section at the end of the chapter.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Several stop word lists for many different languages can be easily found in the web just by querying for "stop words". For instance, for a general purpose English stop word list, you can check: http://www.textfixer.com/resources/common-english-words.txt. Accessed 16 July 2011.
- 2.
In practical applications, with probably few exceptions, it is not customary to implement dimensionality reductions like this one. Generally, reduced space dimensionalities will be around 100–500 dimensions. Here, we are using an exaggeratedly low dimensionality for illustrative purposes.
- 3.
Although this can vary significantly from word to word, the general trend is to observe a more direct dependency on pair-wise co-occurrences in the high-dimensional space. On the other hand, in low-dimensional spaces, similarity scores among terms depend more on context similarities than on simple word co-occurrences.
- 4.
A similar example is also presented in the documentation page of MATLAB® multidimensional scaling function: http://www.mathworks.com/help/toolbox/stats/briu08r-1.html. Accessed 16 September 2011.
- 5.
Different versions and/or initializations of the mdscale algorithm can produce different rotation, offsets and scaling factors; if you are not able to reproduce the results shown in Fig. 9.5, you will have to play for a while with the values of variables rotang , scale1 , scale2 , offset1 and offset2 until you get an appropriate fit of the cities into the map. .
References
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Cox MF, Cox MAA (2001) Multidimensional scaling. Chapman and Hall, Boca Raton
Deerwester S, Dumais S, Landauer T, Furnas G, Beck L (1988) Improving information retrieval with latent semantic indexing. In: Proceedings of the 51st Annual Meeting of the American society for information science, 25:36–40, Atlanta, GA
Fodor IK (2002) A survey of dimension reduction techniques. U.S. Department of Energy, Lawrence Livermore National Laboratory, UCRL-ID-148494
Golub GH, Kahan W (1965) Calculating the singular values and pseudo-inverse of a matrix. J Soc Ind Appl Math: Numer Anal 2(2):205–224
Gorsuch RL (1983) Factor analysis. Lawrence Erlbaum, Hillsdale
Griffiths T, Steyvers M, Tenenbaum JB (2007) Topics in semantic representation. Psychol Rev 144(2):211–244
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507
Hofmann T (1999) Probabilistic latent semantic analysis. In Proceedings of the 15th Conference on uncertainty in artificial intelligence
Hull DA (1996) Stemming algorithms: a case study for detailed evaluation. J Am Soc Inf Sci 47(1):70–84
Hyvärinen A (1999) Survey on independent component analysis. Neural Comput Surv 2:94–128
Jolliffe IT (2002) Principal component analysis. Springer-Verlag, New York
Karlgren J (1999) Stylistic experiments in information retrieval. In: Strzalkowski (ed) Natural language information retrieval. Kluwer, Dordrecht, pp 147–166
Kaski S (1998) Dimensionality reduction by random mapping: fast similarity computation for clustering. In: Proceedings IEEE international joint conference on neural networks, pp 413–418
Kohonen TK (1990) The self-organizing map. In: Proceedings IEEE, 78(9):1464–1480
Koskenniemi KM (1983) Two-level morphology: a general computational model for word-form recognition and production. Technical report, University of Helsinki, Helsinki, Finland
Landauer T, Dumais S (1997) A solution to Plato’s problem: the latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychol Rev 104(2):211–240
Madsen RE, Sigurdsson S, Hansen LK, Lansen J (2004) Pruning the vocabulary for better context recognition. In: Proceedings of the 7th international conference on pattern recognition
Matsuda Y, Yamaguchi K (2005) An efficient MDS algorithm for the analysis of massive document collections. In: Khosla R et al (eds): KES 2005, LNAI 3682:1015–1021. Springer, Berlin
Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 2(6):559–572
Wälchli B, Cysouw M (2012) Lexical typology through similarity semantics: Toward a semantic map of motion verbs. Linguistics 50(3):671–710
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 Springer Science+Business Media New York
About this chapter
Cite this chapter
Banchs, R.E. (2013). Dimensionality Reduction. In: Text Mining with MATLAB®. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-4151-9_9
Download citation
DOI: https://doi.org/10.1007/978-1-4614-4151-9_9
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-4150-2
Online ISBN: 978-1-4614-4151-9
eBook Packages: Computer ScienceComputer Science (R0)