For my undergraduate physics degree, many years ago, the final set of exams contained a general paper in which we students could be asked anything.
The sky is dark at night: explain. Or: How many molecules of Caesar’s last breath do you inhale each time you take a breath? Or: Derive a lower limit on the proton lifetime from the fact of your own existence. The purpose wasn’t to test whether we could remember physics facts—books were allowed into the exam hall, so we could look up whatever facts our books made available. Rather, the exam tested our ability to strip down a problem to its essence and apply mathematical arguments to reach a reasonable, if not exact, conclusion. (This was in the years B.I., before the internet. Nowadays, if smartphones were allowed, students could google the answer even to seemingly random problems such as those above.) One of the questions on my final paper essentially required us to recreate the argument made by Professor Wallhausen. I aced the paper, so I’ve had a soft spot for the concept of a universal library ever since.
I first came across the concept in fiction in “The Library of Babel” by Jorge Luis Borges. The Borges tale was published in 1941, almost four decades after “The Universal Library”, but it’s far more celebrated than Lasswitz’s story. (Incidentally, Borges wrote his story while he was working—unhappily—as a librarian.) Where the scientifically-trained Lasswitz took a rigorously mathematical approach to the question of a universal library, the philosophically inclined Borges took a more metaphysical approach. In “The Library of Babel”, Borges imagines a universe filled with planes of interlocking hexagonal rooms, each room having four of its walls lined with bookshelves. Spiral staircases connect the planes. Each book on the shelves is different. For Borges, each book contains 410 pages; there are 40 lines per page, 80 characters per line, and 25 different possible typographical symbols. If you run the numbers in the same way as Lasswitz did, it’s easy to calculate that Borges’ Library of Babel contains about 1.95 × 10
1834097 different books. This is a huge number, incomparably bigger than the number of particles in the observable universe. It is, however, vastly smaller than the number of all possible books, calculated by Lasswitz, which is 10 2000000. It’s difficult to comprehend the size of the numbers contained in “The Universal Library” and “The Library of Babel”. Perhaps a comparison with some real-world attempts at a universal library can put them into perspective.
The original universal library was the ancient Library of Alexandria, the tragic destruction of which through fire meant manuscripts of immense cultural significance were lost to the ages. The library’s index was also lost in the conflagration, so it’s not known for certain how many books were housed at Alexandria—experts suggest the number of scrolls would have been between 40,000 and 400,000. If the number of scrolls were at the top end of the range then the Library of Alexandria would have housed a significant fraction of the ancient world’s knowledge. (It’s interesting to note that an important function of a library is to classify and organize knowledge. An ancient library wouldn’t necessarily have been organized in the same way as a modern library because the ancients viewed the world in a quite different way to us. What we might classify as poetry, for example, the ancients might have classified as natural science.) Moving forward in time, the present British Library contains some of the world’s most significant books and manuscripts, items that are priceless. In addition to the quality of its collections, the British Library stands out in terms of quantity: it has more books than any other library except the American Library of Congress. The LOC, the world’s largest library, has 32 million books and many more millions of photographs, maps, and manuscripts. Of course, even the British Library and the Library of Congress (see Fig.
) now have a rival: the internet. The internet can be thought of as a library containing not just text, but images, sounds, videos, and simulations. (Indeed, one website even provides a simulation of Borge’s Library of Babel—visit
for a disorientating glimpse of what the Library contains; see Fig.
. You’ll struggle to find anything of interest in it, however. As both Lasswitz and Borges emphasised, there’s a problem with indexing a universal library.)
Reading rooms of the Library of Congress (top) and the British Library (bottom). The LOC and BL are among the largest libraries in the world, but their combined storage capacity would be insufficient to house the printed output of the Web—let alone the unimaginable vastness of the Universal Library (Credit: top—Carol M. Highsmith; bottom—Diliff)
Jonathan Basile, creator of the online universal library, standing in front of a page of text from the library (Credit: Alan Levine)
The Library of Alexandria, the British Library, the Library of Congress, and the internet. If you were to collect all the items contained in these libraries and throw in all the items from all the other libraries, public and private, that people have put together throughout history—well, the collection fill only a tiny fraction of Borge’s Library of Babel. It would be vanishingly small compared to Lasswitz’s Universal Library. The real world is much smaller than the world of mathematical possibility. Nevertheless, you can’t deny our technological civilization is producing data at an unprecedented rate. And this opens up numerous challenges and opportunities. Let’s see why.
Suppose we wanted to print out all the text that appears on the Web. How many books would we need? It’s impossible to give a definitive answer, of course, but we can make an informed guess.
In 2014, scientists estimated that the internet housed 1 billion websites; the number of sites fluctuates because new websites are created and old websites are retired, but a round billion is a reasonable figure. The number of websites by itself doesn’t help us, because each site can contain multiple pages, but it’s possible to account for that. In 2016, scientists estimated that there were 4.66 billion web pages. (These estimates ignored material on the so-called “Deep Web”—a corner of the internet which, for many different reasons, both legitimate and illegitimate, is not indexed by search engines. By definition, it’s difficult to calculate the size of the Deep Web but experts estimate that it contains orders of magnitude more material than appears in traditionally indexed pages. For simplicity, though, let’s agree to omit the Deep Web from our calculations.) In the spirit of Professor Wallhausen, let’s suppose the internet contains 5 billion web pages and each web page, if printed out on paper, corresponds to 10 book pages. (I have no idea how realistic this estimate is, but it doesn’t seem too unreasonable.) In this case, if you were to make a hard copy of the Web you’d end up with 50 billion printed pages. If we assume an average book has a page length of 500 then we know how big a library would have to be in order to house the “Surface” Web: the library would have to hold 100 million books. Neither the British Library nor the Library of Congress would suffice.
But our online world consists of more than just static webpages. With their tweets, Twitter users generate the equivalent of about 25,000 500-page books each day; Facebook users share 78 million links each day; around the world, about 200 billion emails are sent each day. It’s as if the general public is regularly filling the Library of Congress with content. And of course digital content isn’t restricted to textual material of the sort that interested Professor Wallhausen: there are graphics, maps, photos, songs, videos, simulations … all sorts of information is now in digital form. Text-based data constitutes only a small fraction of what is stored on the internet. It’s worth reiterating that any numbers we use to capture the size of the internet can’t compare with the ungraspably large numbers discussed by Lasswitz in “The Universal Library”. But by most real-world standards we are surely justified in saying the internet is big, and getting bigger with each passing year. In order to better quantify this, though, and understand why this trend carries with it both challenges and opportunities, we need to look at how computer scientists measure storage requirements for data.
Ultimately, computers work with binary digits—bits: 0 or 1, on or off, north or south. When discussing data, however, a more useful unit is the byte, which is eight binary digits long. Most computers use a byte to represent a single character—letter, number, typographical symbol. In the early days of personal computing, when people worried about how much memory was inside their machine and about the file size of their documents, units such as kilobyte and megabyte entered the common parlance. Note that the terms have two different but equally valid definitions, so there is some confusion here. A kilobyte can be 1000 bytes, as the name implies; but in computing it’s convenient to work in powers of 2 so a kilobyte is often 2
10 = 1024 bytes. A megabyte can be 1,000,000 bytes; but it can also be 2 20 = 1,048,576 bytes. The difference between the two definitions increases as the size increases, but for the purposes of this discussion we needn’t be concerned. We are interested in orders of magnitude, not in precise numbers.) Anyway, as many aspects of computer technology advanced along the exponential curve known as Moore’s Law—with a doubling every 18 months or so—people began talking about the gigabyte (a billion bytes) and then the terabyte (a trillion bytes). My current computer contains a terabyte hard disk, a luxury unthinkable back when home computers typically came with a 360-kilobyte floppy disk drive. The pattern whereby each named unit of data is one thousand (or 1024) times greater than the previous unit continues: after the terabyte we have the petabyte, then the exabyte, zettabyte, and yottabyte. I’ve even seen mention of the brontobyte and geobyte. Phew.
For someone who remembers having to transfer data from computer to computer on 5.25-inch floppy disks, a unit such as the zettabyte seems ludicrously inappropriate for any realistic computing task. And yet last year, as I write, internet traffic exceeded a zettabyte; this year there’s been even more traffic; next year it will be greater still. Individuals, businesses, universities, research teams … it seems as if the world is becoming a factory for generating data. As mentioned above, handling such a flood of data presents challenges—conceptual, technical, and ethical. But if we can tame the deluge then the opportunities are immense.
Let me give just one example of the challenges and opportunities of so-called Big Data. The example happens to come from astronomy, but I could have taken an example from other areas of science—or from healthcare, retail, technology … indeed, from most aspects of human endeavour.
We are entering a golden age of observational astronomy and cosmology. Consider, for example, the Large Synoptic Survey Telescope (LSST). When this wonderful telescope commences operations in 2022 it will consist essentially of three very large mirrors, behind which a 3.2-gigapixel digital camera will take 15-second exposures of the sky every 20 seconds. Scientists expect the camera to take 1.28 petabytes of data every year. Human astronomers simply won’t be able to process that amount of data: there aren’t enough eyes and brains for the task. And the LSST is just one of many observatories—operating not just throughout the electromagnetic spectrum, but also using gravitational waves and particle detectors. Some have already seen “first light”, some are soon to come online. Each of them will generate such vast quantities of data that human astronomers would drown if they tried to process it manually. But if data scientists could store and index the observational data in an efficient way then
machines would be able to mine the data for us—and make discoveries much more quickly than humans would be able to do. Indeed, machines might make serendipitous discoveries that humans themselves would miss.
This approach is already bearing fruit.
As I began to write this commentary, astronomers published a paper explaining how they trained an AI (the same sort of algorithm that Google DeepMind used to beat the world Go champion, as discussed in the previous chapter) to search for gravitational lenses. A gravitational lens can be seen when light rays from a distant galaxy are bent by the gravitational influence of an intervening galaxy; instead of observing a small disk we instead see the light of the distant galaxy smeared into arcs. The AI was trained to recognise known gravitational lenses and then asked to find lenses in a much larger data set containing millions of astronomical images. The AI quickly discovered 56 new gravitational lenses. At present, there remains a large element of human intervention in this work. Eventually, though, there’ll be no need for visual inspection by humans. The discovery process will speed up immensely.
The same approach will, I’m sure, be taken in all those other areas of endeavour I mentioned above, all those other areas in which Big Data is being generated. In other words, artificial intelligence will be applied
this will be the future of artificial intelligence: not androids walking around with us but AIs analysing data to help us make scientific discoveries, guide political decisions, improve human health. Asimov’s robot stories are typically remembered as being about androids—machines in human form. But he also wrote stories about a supercomputer called Multivac. The all-powerful Multivac was essentially a machine that had learned to navigate a useful corner of the Universal Library and excelled at Big Data problems. It acted as humanity’s guide. Perhaps humans and machines will go forward together—with humans asking the questions and machines providing the answers?
I have to end this chapter with mention of Asimov’s personal favourite of his own stories: “The Last Question”. In the story, a technician asks Multivac a question involving the basic laws of physics: can the universal increase in entropy be reversed? Multivac ponders, then replies: “Insufficient data for meaningful answer”. But Multivac doesn’t forget the question, and considers it through the aeons. I won’t spoil the story for you, except to say that Multivac does eventually present an answer.
The notion explored in “The Last Question”—whether it’s possible, even for a super-advanced AI, to circumvent the laws of physics—leads us nicely into Chapter
. The next story asks: is it possible to travel faster than light? 8