Abstract
In this paper we apply different techniques of information distortion on a set of classical books written in English. We study the impact that these distortions have upon the Kolmogorov complexity and the clustering by compression technique (the latter based on Normalized Compression Distance, NCD). We show how to decrease the complexity of the considered books introducing several modifications in them. We measure how the information contained in each book is maintained using a clustering error measure. We find experimentally that the best way to keep the clustering error is by means of modifications in the most frequent words. We explain the details of these information distortions and we compare with other kinds of modifications like random word distortions and unfrequent word distortions. Finally, some phenomenological explanations from the different empirical results that have been carried out are presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Transactions on Information Theory 51(4), 1523–1545 (2005)
Turing, A.: On computable numbers, with an application to the entscheidungsproblem. Proceedings of the London Mathematical Society 2(42), 230–265 (1936)
Kolmogorov, A.: Three approaches to the quantitative definition of information. Problems Information Transmission 1(1), 1–7 (1965)
Li, M., Vitányi, P.: An introduction to Kolmogorov complexity and its applications. Graduate Texts In Computer Science, p. 637. Springer, Heidelberg (1997)
Sipser, M.: Introduction to the Theory of Computation, 2nd edn. PWS Publishing (2006)
Cilibrasi, R., Cruz, A.L., de Rooij, S., Keijzer, M.: CompLearn Toolkit, http://www.complearn.org/
Cebrián, M., Alfonseca, M., Ortega, A.: The normalized compression distance is resistant to noise. IEEE Transactions on Information Theory 53(5), 1895–1900 (2007)
Consortium, B.N.C.: British National Corpus. Oxford University Computing Services, http://www.natcorp.ox.ac.uk/
Pavlov, I.: LZMA, http://www.7-zip.org/sdk.html
Cebrián, M., Alfonseca, M., Ortega, A.: Common pitfalls using normalized compression distance: what to watch out for in a compressor. Communications in Information and Systems 5(4), 367–384 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Granados, A., Cebrián, M., Camacho, D., Rodríguez, F.B. (2008). Evaluating the Impact of Information Distortion on Normalized Compression Distance. In: Barbero, Á. (eds) Coding Theory and Applications. Lecture Notes in Computer Science, vol 5228. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87448-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-540-87448-5_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87447-8
Online ISBN: 978-3-540-87448-5
eBook Packages: Computer ScienceComputer Science (R0)