Abstract
Traditional wisdom has it that the better a theory compresses the learning data concerning some phenomenon under investigation, the better we learn, generalize, and the better the theory predicts unknown data. This belief is vindicated in practice but apparently has not been rigorously proved in a general setting. Making these ideas rigorous involves the length of the shortest effective description of an individual object: its Kolmogorov complexity. In a previous paper we have shown that optimal compression is almost always a best strategy in hypotheses identification (an ideal form of the minimum description length (MDL) principle). Whereas the single best hypothesis does not necessarily give the best prediction, we demonstrate that nonetheless compression is almost always the best strategy in prediction methods in the style of R. Solomonoff.
Paul Vitányi is also affiliated with the University of Amsterdam. He was supported by NSERC through International Scientific Exchange Award ISE0125663, and by the European Union through NeuroCOLT ESPRIT Working Group Nr. 8556, and by NWO through NFI Project ALADDIN under Contract number NF 62–376. Ming Li was supported in part by NSERC operating grant OGP-046506, ITRC, and a CGAT grant and the Steacie Fellowship. On sabbatical leave from: Department of Computer Science, University of Waterloo; Email: mli@math.uwaterloo.ca.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
J.L. Doob, Stochastic Processes, Wiley, 1953.
P. Gács, On the symmetry of algorithmic information, Soviet Math. Dokl., 15 (1974) 1477–1480. Correction: ibid., 15 (1974) 1480.
P. Gács, On the relation between descriptional complexity and algorithmic probability, Theoret. Comput. Sci., 22(1983), 71–93.
A.N. Kolmogorov, Three approaches to the quantitative definition of information, Problems Inform. Transmission 1:1 (1965) 1–7.
L.A. Levin, On the notion of a random sequence, Soviet Math. Dokl., 14(1973), 1413–1416.
M. Li and P.M.B. Vitányi, An Introduction to Kolmogorov Complexity and its Applications, Springer-Verlag, New York, 1993.
M. Li and P.M.B. Vitanyi, Computational Machine Learning in Theory and Praxis. In: ‘Computer Science Today', J. van Leeuwen, Ed., Lecture Notes in Computer Science, Vol. 1000, Springer-Verlag, Heidelberg, 1995, 518–535.
P.M.B. Vitanyi and M. Li, Ideal MDL and Its Relation To Bayesianism, ‘Proc. ISIS: Information, Statistics and Induction in Science', World Scientific, Singapore, 1996, 282–291.
P. Martin-Löf, The definition of random sequences, Inform. Contr., 9(1966), 602–619.
J.J. Rissanen, Modeling by the shortest data description, Automatica-J.IFAC 14 (1978) 465–471.
J.J. Rissanen, Stochastic Complexity and Statistical Inquiry, World Scientific Publishers, 1989.
J.J. Rissanen, Fisher information and stochastic complexity, IEEE Trans. Inform. Theory, IT-42:1(1996), 40–47.
J. Segen, Pattern-Directed Signal Analysis, PhD Thesis, Carnegie-Mellon University, Pittsburgh, 1980.
R.J. Solomonoff, A formal theory of inductive inference, Part 1 and Part 2, Inform. Contr., 7(1964), 1–22, 224–254.
R.J. Solomonoff, Complexity-based induction systems: comparisons and convergence theorems, IEEE Trans. Inform. Theory IT-24 (1978) 422–432.
A.M. Turing, On computable numbers with an application to the Entscheidungsproblem, Proc. London Math. Soc., Ser. 2, 42(1936), 230–265; Correction, Ibid, 43(1937), 544–546.
R. von Mises, Grundlagen der Wahrscheinlichkeitsrechnung, Mathemat. Zeitsch., 5(1919), 52–99.
V. Vovk, Minimum description length estimators under the universal coding scheme, in: P. Vitányi (Ed.), Computational Learning Theory, Proc. 2nd European Conf. (EuroCOLT '95), Lecture Notes in Artificial Intelligence, Vol. 904, Springer-Verlag, Heidelberg, 1995, pp. 237–251; Learning about the parameter of the Bernoulli model, J. Comput. System Sci., to appear.
C.S. Wallace and D.M. Boulton, An information measure for classification, Computing Journal 11 (1968) 185–195.
C.S. Wallace and P.R. Freeman, Estimation and inference by compact coding, J. Royal Stat. Soc, Series B, 49 (1987) 240–251. Discussion: ibid.,252–265.
K. Yamanishi, A Randomized Approximation of the MDL for Stochastic Models with Hidden Variables, Proc. 9th ACM Comput. Learning Conference, ACM Press, 1996.
A.K. Zvonkin and L.A. Levin, The complexity of finite objects and the development of the concepts of information and randomness by means of the theory of algorithms, Russian Math. Surveys 25:6 (1970) 83–124.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1997 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Vitányi, P., Li, M. (1997). On prediction by data compression. In: van Someren, M., Widmer, G. (eds) Machine Learning: ECML-97. ECML 1997. Lecture Notes in Computer Science, vol 1224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-62858-4_69
Download citation
DOI: https://doi.org/10.1007/3-540-62858-4_69
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-62858-3
Online ISBN: 978-3-540-68708-5
eBook Packages: Springer Book Archive