Making these United States safe for Ngrams

Last month's court decision upholding the legality of the Google Books search engine also kept a powerful research tool available to the public.

December 5, 2013

A side effect of Judge Denny Chin's dismissal of a long-running suit by the Authors Guild against Google Books – discussed in this space last week – was that he made it safe for the Ngram Viewer.

The Google Ngram Viewer is a powerful search tool. It appears, though, that not many people know about Ngrams; or so I infer after a Google News search has yielded only 65 results. The Ngram Viewer is an application that lets you trace over time the frequency of usage of particular search terms, as they are embodied in the "corpus" of books Google has scanned into its database.

The sample terms that come up by default when you go to the Ngram Viewer are Albert Einstein, Sherlock Holmes, and Frankenstein. (Go figure.) The search produces three colored lines on the fever chart, as this kind of graph is known, one for each name. The lines cross and recross. But they show that Frankenstein had an early lead, and by 2000 was once again far ahead of the other two.

Democrats begin soul-searching – and finger-pointing – after devastating loss

It all takes somewhat less effort than a quick check of an online movie schedule. Frankenstein, et al., aside, the poster child for the Ngram Viewer is "the United States" itself.

As Judge Chin wrote in his ruling in the Authors Guild suit: "Using Google Books ... researchers can track the frequency of references to the United States as a single entity ('the United States is') versus references to the United States in the plural ('the United States are') and how that usage has changed over time.

"The ability to determine how often different words or phrases appear in books at different times 'can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology,' " the judge added, quoting one of the scholars who has weighed in on this topic.

The commonplace observation among those who tune into this sort of thing is that it wasn't until after the Civil War that the United States became truly singular in a grammatical sense. An amicus brief submitted on behalf of Google in the Authors Guild case included an Ngram that showed the red line ("United States are") trending downward over time, crossing the blue line ("United States is") in the late 1870s, and then dribbling away into statistical insignificance in the lower right-hand corner.

Ben Zimmer, in a post for "Language Log," included an Ngram built from slightly different search terms that shows the crossing point coming about 10 years later. Your mileage may vary, too.

What Trump’s historic victory says about America

I've just tried a search for "the United States is, the United States are" and have gotten two lines that start out twisted together like strands of a rope but then separate around 1830, with "the United States is" taking the lead. Each line also shows two humps corresponding to each of the two world wars, when the name of the nation was presumably more often mentioned in books.

Ngrams are available to adjudicate less cosmic matters as well. I've just done a quick check of "e-mail" versus "email" to find out just how much of a fuddy-duddy I am to continue to prefer the hyphen, and I feel positively vindicated: The hyphenated version of the term has a commanding lead – for now, at least.

Ngrams are a useful tool I look forward to learning more about.