Everyone knows the saying, “a picture is worth a thousand words.” But what if I told you that some pictures could be worth 500 billion words? This realization came from a study conducted by experts from Harvard and MIT, among others. They delved into understanding how human culture and history evolve over time using a massive collection of digitized books.
Reading all the books ever written seemed like an incredible, yet impractical, way to understand history. However, thanks to Google’s digitization project, millions of books have been scanned, offering a practical and awesome way to analyze this vast trove of information with computational methods.
Books have been around since time immemorial, with 129 million unique books published over the centuries. Google has scanned 15 million of these, and from this data, researchers have gathered metadata, cleaned up the information, and ended up with 5 million high-quality books, amounting to 500 billion words. This dataset provides an unbelievably vast resource to mine cultural insights.
Instead of releasing full texts, which could lead to lawsuits, the researchers released statistical data about the books. They created time series data about the frequency of words and phrases over the years. This produced a huge table of two billion lines, each representing elements of culture through what they called “engrams.”
Engrams can show cultural trends. For example, examining the usage of “thrived” versus “throve” reveals how language evolves. Visualization tools and engram viewers allow anyone to explore these massive datasets and uncover fascinating historical insights.
Another intriguing example: when looking for “influenza” in the data, spikes appear during major flu epidemics, correlating with historical events. Even abstract concepts, like public interest in certain years (e.g., 1950), show noticeable trends where interest bubbles and bursts over time.
By tracking famous individuals, the data reveals careers peaking differently among actors, authors, politicians, and scientists. Mathematicians, unfortunately, don’t get the same level of public attention.
The dataset has also uncovered instances of censorship. For example, artist Marc Chagall’s fame plummeted during Nazi Germany, only to rebound after World War II. This suppression was identifiable through statistical aberrations in the data.
This new field of “culturomics” mirrors genomics but applies large-scale data tools to human culture, using digitized records to study trends and shifts over time. The engram viewer, created by Google engineers, allows the public to explore this data, making cultural analysis accessible to everyone.
The vast historical record being digitized is transforming our understanding of culture. As more cultural artifacts like manuscripts, newspapers, and paintings are digitized, our grasp of human history and culture will only deepen, offering new ways to explore our past and present. Thanks to these efforts, we can now examine and understand cultural shifts with precision like never before.