Children learn to name the objects they see by forming general associations between the words they hear and the images arriving at their retina. Discriminative neural network models can also be taught to classify objects, but to do so they require more information about how images pair with words (i.e. supervised data) than the brain seems to receive. We propose that the brain exploits unsupervised learning on raw sensory input to compensate for the scarcity of supervised data in its environment. Here we show that artificial neural networks which first develop a statistical model of the world in an unsupervised fashion are capable of learning good image-word pairings using dramatically less supervised data. This idea may help to explain how the brain learns sensorimotor problems for which there is little feedback available about the success of selected actions.