Culturomics – a new discipline?

The middle of this month, researchers from Harvard, MIT and Google published a Quantitative Analysis of Culture Using Millions of Digitized Books in Science. They dub their technique “culturomics:” “the application of high-throughput data collection and analysis to the study of human culture” using the vast word corpus collected by Google Books. The research is abstract/summarized below with links to the free on-line article and research tools at the end of this entry.

We can all participate! Try it out and report interesting results in a comment to this page.

abstracted from Quantitative Analysis of Culture Using Millions of Digitized Books
Jean-Baptiste Michel,* Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, Erez Lieberman Aiden*
Science, v. 331, issue 6014, 14 Jan. 2011

The authors identify a new methodology/discipline for the statistical study of culture, “cultuomics.” Culturomics is the application of high-throughput data collection and analysis to the study of human culture. Culturomic results are a new type of evidence in the humanities.

Over 15 million books have been digitized by Google [~12% of all books ever published]. A corpus of 5,195,769 digitized books were chosen, containing ~4% of all books ever published, including 500 billion words, in English (361 billion), French (45 billion), Spanish (45 billion), German (37 billion), Chinese (13 billion), Russian (35 billion), and Hebrew (2 billion). The sequence of letters in this corpus is 1000 times longer than the human genome: If you wrote it out in a straight line, it would reach to the Moon and back 10 times over.

The statistical analysis of the corpus relied on usage frequencies of two word types: A 1-gram is a string of characters uninterrupted by a space; this includes words, numbers, even typos. An n-gram is a sequence of 1-grams: phrases, titles, etc. “N” was restricted to 5 and the study was limited to n-grams occurring at least 40 times in the corpus. Usage frequency is computed by dividing the number of instances of the n-gram in a given year by the total number of words in the corpus in that year.

Analysis and sampling of the corpus produced a range of results. The number of words in the English lexicon were estimated as 544,000 in 1900, 597,000 in 1950, and 1,022,000 in 2000. Comparisons with standard dictionaries shows that 52% of the English lexicon in books consists of lexical “dark matter” undocumented in standard references. Culturomic tools will aid lexicographers in at least two ways: (i) finding low-frequencywords that they do not list, and (ii) providing accurate estimates of current frequency trends to reduce the lag between changes in the lexicon and changes in the dictionary. Plots of 1-gram year names (ie. “1951″) for each year between 1875 and 1975. The amplitude of the plots rose every year, Precise dates became increasingly common. There is also a greater focus on the present. We are forgetting our past faster with each passing year. A list of 147 inventions was divided into time-resolved cohorts based on the 40-year interval in which they were first invented (1800–1840, 1840–1880, and 1880–1920). The inventions from the earliest cohort (1800–1840) took over 66 years from invention to widespread impact (frequency >25% of peak). Since then, the cultural adoption of technology has become more rapid. The 1840–1880 invention cohort was widely adopted within 50 years; the 1880–1920 cohort within 27 years. People, too, rise to prominence, only to be forgotten (22). Fame can be tracked by measuring the frequency of a person’s name. The rise to fame of the most famous people of different eras was compared. For every year from 1800 to 1950, a cohort consisting of the 50 most famous people born in that year was constructed. Each cohort had a pre-celebrity

period, followed by a rapid rise to prominence, a peak, and a slow decline. The age of peak celebrity has been consistent over time: about 75 years after birth. Fame comes sooner and rises faster, yet this fame is increasingly short-lived. Thus, people are getting more famous than ever before but are being forgotten more rapidly than ever. And, occupational choices affect the rise to fame. Actors tend to become famous earliest, scientists took much longer and mathematicians rarely became famous.

Suppression of a person or an idea leaves quantifiable fingerprints. The impact of censorship on a person’s cultural influence in Nazi Germany was examined and shows declining lexical reference to of authors and artists whose “undesirable”, “degenerate” work was banned from libraries and museums and publicly burned.

Additional brief observations include:
(i) Peaks in “influenza” correspond with dates of known pandemics, suggesting the value of culturomic methods for historical epidemiology;
(ii) Trajectories for “the North”, “the South”, and finally “the enemy” reflect how polarization of the states preceded the descent into the Civil War ;
(iii) In the battle of the sexes, the “women” are gaining ground on the “men”;
(iv) “féminisme” made early inroads in France, but the United States proved to be a more fertile environment in the long run;
(v) “Galileo”, “Darwin”, and “Einstein” may be well-known scientists, but “Freud” is more deeply ingrained in our collective subconscious;
(vi) Interest in “evolution” was waning until “DNA” came along.

My own explorations show that:
“intelligent design” peaks in 1880 and 2000; “evolution” peaks about 1920, then declines World War II after which its frequency continued to climb;
“Anasazi” is rare until the late 1930s and peaks about 1995, while “Ancestral Pueblo” is rare until about 1995 when its frequency skyrockets;
“Mogollon” peaks about 1940; “Hohokam” in 1940 and 1980.
“American” begins to appear in 1760, has minor peaks in 1776 and 1782, and then increases steadily with additional peaks at 1859, 1918, 1942, and 1970.

The full data set, which comprises over two billion culturomic trajectories, is available for download or exploration at www.culturomics.org, which has links to the full article;
and ngrams.googlelabs.com. where words and phrases can be plotted and compared in the Ngram viewer which also has chronological links to the chosen word or phrase.

The parameters and procedures for the Ngram viewer are described at: http://ngrams.googlelabs.com/info