Can physicists produce insights about language that have eluded linguists and English professors? That possibility was put to the test this week when a team of physicists published a paper drawing on Google’s massive collection of scanned books. They claim to have identified universal laws governing the birth, life course and death of words.
The paper marks an advance in a new field dubbed “Culturomics”: the application of data-crunching to subjects typically considered part of the humanities. Last year a group of social scientists and evolutionary theorists, plus the Google Books team, showed off the kinds of things that could be done with Google’s data, which include the contents of five-million-plus books, dating back to 1800.
Published in Science, that paper gave the best-yet estimate of the true number of words in English—a million, far more than any dictionary has recorded (the 2002 Webster’s Third New International Dictionary has 348,000). More than half of the language, the authors wrote, is “dark matter” that has evaded standard dictionaries.