## Abstract

We have seen that a blob of data can be translated so that it has zero mean, then rotated so the covariance matrix is diagonal. In this coordinate system, we can set some components to zero, and get a representation of the data that is still accurate. The rotation and translation can be undone, yielding a dataset that is in the same coordinates as the original, but lower dimensional. The new dataset is a good approximation to the old dataset. All this yields a really powerful idea: we can choose a small set of vectors, so that each item in the original dataset can be represented as the mean vector plus a weighted sum of this set. This representation means we can think of the dataset as lying on a low dimensional space inside the original space. It’s an experimental fact that this model of a dataset is usually accurate for real high dimensional data, and it is often an extremely convenient model. Furthermore, representing a dataset like this very often suppresses noise—if the original measurements in your vectors are noisy, the low dimensional representation may be closer to the true data than the measurements are.