Feature

Corpus Linguistics Is Changing How Courts Interpret the Law

In June 1986, President Ronald Reagan nominated William Rehnquist to replace the resigning Warren Burger as chief justice of the Supreme Court. It was a polarizing nomination and everything from the recent introduction of television cameras to the Senate floor to the upcoming midterm elections conspired to heighten the drama.  

Over four momentous days, Rehnquist was questioned sharply by Democratic senators exercising their Constitutional power of “advice and consent.” They asked Rehnquist about his work as an elections lawyer, about his financial holdings, and about his membership in a men’s-only club in Washington, D.C. 

After a 65-33 vote, the nomination went forward and Rehnquist was sworn in as the new chief justice on September 26. That same day, Reagan’s other appointee to the Supreme Court, Antonin Scalia, was sworn in as an associate justice. With no conflict at all, Scalia had been unanimously confirmed by a vote of 98 to 0 after a brief pro forma hearing, with his large family seated, smiling, behind him.  

Photograph of Justice William Rehnquist being sworn in while his wife holds the bible. Antonin Scalia and his wife observe, along with President Ronald Reagan.
Photo caption

At the White House on September 26, 1986, Associate Justice William Rehnquist is sworn in by Justice Warren Burger as the new chief justice of the Supreme Court. From left: Antonin and Maureen Scalia, Natalie Rehnquist, and President Ronald Reagan. 

―National Archives

The elevation of a new chief justice marks the beginning of a new era, but the quietly confirmed Justice Scalia was a major story in its own right. Scalia, a quotable writer and a bold legal mind, would prove especially influential as the prime mover behind a new school of thought that even critics allow has been reshaping legal interpretation.  

This school has gone by various names, including originalism and textualism. The two are related but not identical. Originalism names a philosophical commitment to the original linguistic meaning of legislation as understood at the time of its passage. Textualism refers to a method of seeking the meaning of laws in the words and sentences of the law itself. Stripped of elaboration, the basic idea of textualism is almost comically straightforward: To understand the law, read the law.  

It’s important to know what textualism rules out: legislative history. Scalia believed that legislators sought to game future litigation by intentionally generating a paper trail favoring their preferred interpretation of the laws they passed. So all the floor debate, committee reports, and other contextual materials a historian might seize upon is set aside by the textualist. The law—as passed, in its final wording—is considered the only authoritative expression, the one text that needs to be consulted. 

This way of thinking can and does invite challenges, some of which concern the act of reading itself. For example, legal language can be extremely dense. The idea that one need only read a few lines of a contract or legal code to grasp their meaning seems improbable on its face. And even on those happy occasions when legal text seems clear and unambiguous, other difficulties can make the path to understanding and agreement a rocky one.  

Readings of the same text can differ for ideological reasons or because of different takes on historical context, or because of differing assumptions over authors’ intentions, or because of reading errors, or because of personal biases, or because of many other factors such as the passage of time and changes in the meanings of words. Any one of these problems by itself can leave courts and citizens unsure of what a law is saying.  

Interpretive problems have a long history in legal thought and practice. And they are old hat in the humanities. But recent developments have brought these disciplines together in a new field called “corpus linguistics and the law.” It is a methodology and a movement that has already affected scores of decisions in state courts, federal courts, and even the Supreme Court. Taking advantage of the vast amounts of linguistic data available in online sources, this approach seeks to better assess the possible meanings of disputed words and phrases whose interpretation can prove decisive in legal disputes. 

To learn more about this high tech mashup of legal argument and linguistics research, I read a stack of law journal articles and court decisions recommended to me by lawyers and linguists interested in the subject. And, in late 2023, I attended the eighth annual conference on law and corpus linguistics at Brigham Young University’s law school.  

A digital corpus is a collection of text like LexisNexis or Google Books but tailored to encompass specific domains of language. A single book or a whole library of books (or newspapers or magazines or radio transcripts, or social media posts or any other collection of words) might be made into a corpus.  

BYU is home to several important corpora (the plural of corpus), including a major one supported by grants from the National Endowment for the Humanities. Readers of this magazine may recall learning about the Corpus of Historical American English (COHA), for which the linguist Mark Davies received NEH funding. COHA contains 475 million words drawn from more than 100,000 newspapers, magazines, and books (fiction and nonfiction) from the 1820s to the 2010s. COHA calls itself “the largest structured corpus of historical English.” The Corpus of Contemporary American English (COCA), also created by Mark Davies, may be the most widely used corpus of English: With more than one billion words, it brings together news sources, spoken text, academic literature, television and movie subtitles, blogs, and other online materials from 1990 to 2019. BYU has also brought out several more specialized corpora such as COFEA, the Corpus of Founding Era American English, and several corpora focused on legal and legislative history that were built by David Armond, assistant dean for information technology, and several others at the law school. 

BYU Law has become a hotbed of thought on how corpus linguistics can help lawyers and judges better approach questions of meaning and interpretation. It was home to the first-ever course in corpus linguistics and the law in 2013. One of its professors and the organizer of the conference, Thomas R. Lee, is a former Utah Supreme Court judge and the coauthor of numerous law journal articles on the legal relevance of corpus linguistics. Lee is also the author of important judicial opinions that helped draw legal attention to corpus linguistics. Recently, an article he coauthored with other leading lights of the movement was cited in Justice Amy Coney Barrett’s concurring opinion in Moore v. United States, a Supreme Court case that hinged in part on the meaning of the word income. 

A photograph of the illuminated BYU law school at dusk with mountains in the background.
Photo caption

The J. Reuben Clark Law School, aka BYU Law, in Provo, Utah, where the Law and Corpus Linguistics Conference takes place.  

―Mark A. Philbrick / BYU Photo

The BYU conference began with an orientation for those new to corpus linguistics. A small group of us received training on performing simple searches while instruction touched on a few classic law cases in which an ambiguous word or phrase played an outsized role in the outcome. The rest of the conference was your typical academic showcase for papers on a wide range of possible applications, including corpus research into the nature of legal language, the language of terms of service contracts, the decline of trademark names through genericization (which is when a trademark name like “escalator” becomes the generic word everyone uses for what might otherwise be called, say, a moving staircase), fiduciary law, and much else. 

Once or twice, in informal settings at the conference, the subject of politics arose, as corpus linguistics and the law has come to be associated with judicial conservatives, but liberals, too, have explored corpus linguistics. Neal Goldfarb, a lawyer who has become a serious student of corpus linguistics, told me at the conference that he believes corpus research on the Founding era has the potential to unsettle current law around the Second Amendment. Goldfarb’s research has been cited in Supreme Court opinions, including Justice Breyer’s dissent in New York State Rifle and Pistol Association v. Bruen (2021), which also cited another amicus brief based on corpus research whose lead author, Dennis Baron, is a prominent linguist working at the University of Illinois. 

Little about the conference smacked of politics, though. Far more remarkable, to me at least, was the high concentration of people who were comfortable talking about the slippery intersection of law and language without losing their footing or their audience. The conversation did sometimes become a little too abstract to follow easily, at which point you had to sit tight and wait until the theoretical and the empirical came back together. Then you could finally see why this field has become so interesting and, potentially, so important. 

Much of the scholarship in legal corpus linguistics is accomplished by revisiting old cases and subjecting court findings to new evidence derived from corpus research. And among the many cases that scholars have revisited, few stand out as much as Muscarello v. United States 

Frank J. Muscarello was arrested in Louisiana in 1995, after trying to sell eight pounds of high-grade marijuana to an undercover federal agent. At the scene of the arrest, a .38 caliber handgun was discovered in the locked glove compartment of Muscarello’s pickup truck. He pleaded guilty to three charges: two counts of distribution and one count of using or carrying a firearm while committing a drug-related crime.  

The latter count relates to a section of the federal criminal code describing “any person who . . . during and in relation to any crime of violence or drug trafficking crime . . . uses or carries a firearm.” 

Around the time that Muscarello was being processed, however, the Supreme Court handed down a ruling that cast a new light on his case. In Bailey v. United States, the court held that “uses” in the phrase “uses or carries a firearm” should not be interpreted vaguely to mean having a gun somewhere in one’s possession. Muscarello’s lawyers saw an opening and challenged the “uses or carries” part of his conviction.  

With using set aside, Muscarello’s appeal turned on the meaning of the word carry. Was Muscarello carrying a firearm by having one in his locked glove compartment? The defense argued that it was a stretch. The prosecution countered that Muscarello was, indeed, carrying a firearm simply by having one in his car. 

The Supreme Court took up the question in 1998. In its opinion, the Court held that “the phrase ‘carries a firearm’ applies to a person who knowingly possesses and conveys firearms in a vehicle, including in the locked glove compartment or trunk of a car.”  

The logic of the decision was as much a matter of law as of language. The specific sense of carrying in one’s car, the Court said, was part of the “first, or basic, meaning” of carry 

The Court’s opinion and dissent presented evidence from a shelf of dictionaries and other sources, from the King James Bible to Sesame Street. The 5-4 majority decision prominently quoted the Oxford English Dictionary’s first definition of carry: “convey, originally by cart or wagon, hence in any vehicle.” Meanwhile, the Court noted, the other relevant meaning of carry—as in to carry on or with one’s person—was only the twenty-sixth meaning listed in the OED 

This use of a dictionary may seem unexceptional but is, in fact, a key example of why Muscarello v. United States has been so scrutinized. An important complaint that critics of the case make is that the difference between the OED’s first definition and its twenty-sixth definition was not what it seemed. 

As Stephen Mouritsen, then a law student at BYU, explained in an influential law review note, the OED had divided its 42 definitions of carry into two groups. The first 24 were all examples of “to transport, convey.” The other 18 were examples of “to support, sustain.”  

Explaining the division in a usage note, the OED said that all the meanings for the first group were now largely expressed using the word take. This was a major qualification, saying, in effect, that people no longer used the word carry to mean “convey, originally by cart or wagon, hence in any vehicle.” 

So the Court’s conclusion that the word carry included to carry in one’s car as part of its “ordinary meaning” was not supported by the dictionary it cited.  

Now, this distinction between two senses of the word carry may seem especially fine. It may sound like minutiae, hair-splitting, or even the kind of technicality that sometimes allows confirmed criminals to escape judgment.  

After all, Muscarello pled guilty to selling drugs, and eight pounds of marijuana is not a trivial amount. But the drug possession alone might not have led to much of a punishment. It was the additional sentence Muscarello received for “carrying” a firearm that significantly increased his time in prison.  

“The gun charge had the teeth in it,” said J. Garrison Jordan, one of the Louisiana attorneys who represented Muscarello, in a phone interview. Without it, Jordan believed, Muscarello’s original sentence might have been no heavier than probation. 

But because of the gun charge, Muscarello’s sentence was increased to six years, which was then knocked down to thirty months for reasons Jordan was not free to explain. That may not sound especially harsh for selling illegal drugs with a firearm in the glove compartment, but at the time of his arrest Muscarello was sixty-nine years old. As a result of the Supreme Court’s dubious interpretation of the word carry, Muscarello, a former court bailiff with a large family, was still in prison when he suffered a heart attack and died—his mandatory minimum sentence having turned into a life sentence.  

Standing in front of a large bookshelf, Antonin Scalia and C. Ronald Ellington inspect a passage in a book.
Photo caption

Supreme Court justice Antonin Scalia consults a book with University of Georgia Law School dean C. Ronald Ellington.  

―Digital Public Library of America / Digital Library of Georgia

We’re all textualists now,” said Supreme Court Justice Elena Kagan in an interview she gave at Harvard University in 2015. This now famous comment came in answer to a series of questions about Justice Scalia’s influence on the Court.  

Before Scalia’s appointment, said Kagan, law schools hardly acknowledged the topic of how to read and interpret statutory language. There were no courses in statutory interpretation and little instruction on the topic. Nor had Scalia’s signature proposition that legal interpretation begin and end with the text itself so far changed how lawyers argued and judges judged. Once it did, however, the exact meaning of countless words such as carry became profoundly more important to judicial decision-making. 

One consequence of this new emphasis on individual words and phrases has been a marked increase in the use of dictionaries in court opinions. As James J. Brudney and Lawrence Baum reported in the William & Mary Law Review, dictionary use at the Supreme Court, starting in 1986, soared by 400 percent. And with this steady rise in the use of dictionaries in court came a steady rise in the criticism of such use. 

Some of the critics have shown an unusual sensitivity to the shortcomings of dictionaries, noting, correctly, that dictionary definitions tend to describe words in a general way that often only suggests what the word in question means in the sentence where it was found. “Dictionaries, by their very nature,” wrote Rickie Sonpal in the Fordham Law Review in 2003, “do not provide the precise meaning of a word as it is used in a particular context.”  

A general definition can be helpful, especially to a reader who has never encountered an unusual word before and has little or no other information at hand. But for a specialized reader examining the subtle shades of difference between one word or one sense and another, a general definition may be inadequate. Worse yet, it may be inadequate while the reasons for its inadequacy may not be clear, leaving the truth hidden behind a double veil of obscurity. And when it comes to words and contexts made more difficult or foreign by the passage of time, the difficulties increase yet again. As Sonpal, a former linguistics major, noted, dictionaries can be unreliable guides to the historical meaning of a term, especially if they are not historical dictionaries but instead general dictionaries that just happen to be old. Indeed, a hornets’ nest of interpretive difficulties, peculiar to the history of lexicography, awaits the curious user of old dictionaries. 

Consider, for example, the fact that it was once common for lexicographers to copy material word for word from earlier dictionaries. It would be very easy, for example, to quote Noah Webster’s 1828 American Dictionary of the English Language and not realize the true source of a definition was Samuel Johnson’s 1755 dictionary, from which Webster liberally copied. Now factor in the added complication that Johnson, writing several decades earlier, and from London instead of western Massachusetts, quoted few living authorities. So instead of finding evidence of early nineteenth-century American English, the unwitting user of Webster’s 1828 dictionary may come away with definitions or quotations drawn from eighteenth- or even seventeenth-century British English. Meanwhile, the unstated differences in meaning and context could prove decisive in a twenty-first-century courtroom in need of far more specialized information. 

Photo of American Dictionary of the English language opened to the title page, with an illustration of Noah Webster on the left.
Photo caption

Like many old dictionaries, Noah Webster’s American Dictionary of the English Language recycled quotations and definitions from earlier works.  

―Wikimedia 

At BYU, I spoke with a few of the key figures in this scene, including Neal Goldfarb whose work on corpus linguistics and the law was featured as early as 2011 on Language Log, an important academic blog for linguists. Back then, Goldfarb had submitted an amicus brief in FCC v. AT&T, a Supreme Court case that hinged on the meaning of the word personal in the Freedom of Information Act. Justice Ruth Bader Ginsburg cited Goldfarb’s research directly in oral argument. And Goldfarb’s brief, which drew evidence from COHA and COCA, seems clearly to have influenced the Court’s unanimous decision, written by Chief Justice John Roberts. Goldfarb had no formal background in linguistics, but in the mid-1990s he happened to read The Language Instinct by Steven Pinker. It made a big impression and almost helped him in court when he noticed a syntactical ambiguity in a judge’s instruction to a jury, though he ended up losing that appeal. The feisty lawyer is something of an outsider among the legal minds at BYU, more of a solo operator but just as committed to the overall idea of the movement. 

Stephen Mouritsen, now a lawyer with Kirkland & Ellis, wrote one of the most cited papers criticizing how judges use dictionaries, “The Dictionary Is Not a Fortress,” which takes its title from a comment by Judge Learned Hand: “It is one of the surest indexes of a mature and developed jurisprudence not to make a fortress out of the dictionary; but to remember that statutes always have some purpose or object to accomplish, whose sympathetic and imaginative discovery is the surest guide to their meaning.” Several people at the conference recalled reading Mouritsen’s paper, published in 2010 in the BYU Law Review, and only then understanding how corpus linguistics might affect legal work. Gordon Smith, former dean of the BYU law school, had called it “the best student comment I have ever read.” In a keynote at the 2023 conference, Smith said that Mouritsen’s paper had pointed the way to this whole new area of legal work. 

Mouritsen is quiet, observant, and carefully dressed. As we talk about dictionaries, he quotes from memory from Dictionaries: The Art and Craft of Lexicography, a standard reference work not often quoted by lawyers.  

He introduced Tom Lee, then a law professor for whom Mouritsen worked as a research assistant, to corpus linguistics and helped him see the value in this work. When Lee joined the Utah Supreme Court, Mouritsen became his law clerk and worked with him on the first case in which Justice Lee made use of corpus linguistics research, Baby E.Z. v. T.I.Z., in 2011.  

But before all this, Mouritsen was a graduate student in linguistics who was usually broke and thus always on the lookout for research gigs, which is how he ended up working as a research assistant for Mark Davies on his Frequency Dictionary of Spanish, which drew on a 20 million-word corpus. On his own, Mouritsen made a frequency dictionary of Arabic newsprint. In short, Mouritsen came to his knowledge of both dictionaries and corpora by working on or with them. When he read Muscarello as a law student in 2009 in a course called “Law and Logic”—having shifted his professional ambitions toward law—he had, he says, a “visceral” reaction to how Supreme Court judges used dictionaries in their opinions. 

Writing about Muscarello, Mouritsen was especially critical of the Court’s fixation on the ranking of senses in the dictionaries it consulted, arguing that Justice Breyer’s opinion for the majority was guilty of the “sense-ranking fallacy.” The Court wrongly assumed that the first definition of carry was the most essential or “primary” sense of carry when, in the OED, the first sense given is simply the earliest sense so far documented.  

But such criticism did not explain how one might answer the linguistic question at issue in Muscarello: When late twentieth-century speakers of American English used the word carry, did they always mean to carry on or with one’s person? It was certainly possible to use the word carry to mean to carry in one’s car, but one would need to look at a large number of real-world examples to see if either sense so predominated in current usage that the other sense could be ruled out as a reasonable interpretation. 

Mouritsen, in his paper, showed how corpus linguistics could provide such evidence. Turning to the Corpus of Contemporary American English, he searched for instances in which the verb carry occurred in the same context as firearm, gun, handgun, rifle, or pistol. In those sentences where the reference was clear, the meaning of to carry on one’s person accounted for 143 instances to 3 instances of the meaning to carry in one’s car. Using the Corpus of Historical American English to focus on carry in the 1960s (the federal code in Muscarello became law in 1968), Mouritsen found many instances of to carry on one’s person and no instances of to carry in one’s car. 

Drawing of a man carrying a satchel and gun slung over his shoulder.
Photo caption

Centuries ago, the word carry could be defined as “to convey . . . by cart or wagon.” More recently, the word carry refers primarily to the act of carrying on or with one’s person.

―Artvee, A Man with a Gun by William Evans, 1840, watercolor  

This was an especially lopsided distribution showing that, at least when American English speakers write and talk about carrying guns without further specification, they almost always mean to carry on one’s person. And for this reason, the Supreme Court majority was, again, mistaken in saying that to carry in one’s car was part of the “first, or basic, meaning” of carry 

Had the Court accounted for such evidence, they might have concluded that, at the very least, they were working from an ambiguous statute. In such cases, a centuries-old principle called the “rule of lenity” could have been invoked, giving the defendant the benefit of the doubt. Muscarello’s sentence could have been much lighter, as his lawyer believed it would have been.  

This one discrete example of corpus linguistics research also shows why a dictionary is such a limited tool when it comes to determining frequency and other patterns of actual usage. Using a dictionary to look up the individual words that constitute an idiomatic phrase like carrying a firearm is likely to tell you very little about the phrase itself and the specifics of its probable meaning. Even an expansive historical dictionary like the OED is not going to offer much more than representative examples of a word or phrase as it has been used over time.  

Corpus linguistics, by comparison, is like a backstage pass to the raw data that one might use to write dictionary definitions. And instead of the small stack of handpicked examples that lexicographers long worked from, corpus research yields hundreds if not thousands or even tens of thousands of examples of words and phrases in their native environment. It then allows for not only close-in microscopic examination of individual words and phrases; it allows for large-scale research on how we actually use such words and phrases, illuminating in much finer detail not only what native speakers mean by them but also what they understand and expect others to understand. 

As with a lot of scholarly work on the border of politics, the use of corpus linguistics in court has given rise to hot debate. Practical questions have been raised, such as, How can judges who don’t seem to know how to use a dictionary be expected to master these far more complicated tools? If all judges understood, however, that different dictionaries ordered their definitions differently and that basic methods and limitations are often well-explained in a dictionary’s introduction, the use of dictionaries in court might quickly improve. 

As for how corpus linguistics research can be assimilated into legal practice and court proceedings, there are several answers. Findings from corpus linguistics can and are presented in  lawyers’ briefs and through expert testimony. Training is increasingly available too, through which judges can learn more about this kind of research, teaching them how to evaluate the testimony of experts. Other factors may help people get the gist without great effort. The power of corpora is more easily grasped, for instance, if you have spent any time reading about the large language models behind recent advances in artificial intelligence. 

Tom Lee, whom I also met at the conference and talked with recently, is especially focused on various methodological challenges possibly haunting the way forward for law and corpus linguistics. These are evidentiary problems that go by a few different names, one of which is the “frequency fallacy.”  

A word’s relative frequency alone is not a certain guide to finding a word’s “ordinary meaning” (itself a debated term). More than one sense or usage may fall comfortably within its ordinary meaning. And depending on how well the corpus is designed, a meaning may show up more or less often than justified by its actual usage. Meanwhile, the frequency of one sense or usage may mislead you to think that other, less frequent meanings aren’t also common, persistent, and relevant to a word’s general usage.  

A painting of two blue pitta birds standing in front of foliage.
Photo caption

If legal corpus linguistics had a mascot, it would likely be the rare and elusive blue pitta of South Asia, which has become associated with certain types of evidentiary problems. 

―Wikimedia

Law professors Lawrence M. Solan and Tammy Gales have probed some of these issues with the example of the blue pitta, a bird not found in North America. If one searched for this rare bird in a corpus of American English, it might not appear at all. In a more appropriate corpus, the blue pitta would more likely show up, if not in exact proportion with nearby bird populations then perhaps at a level reflecting the greater familiarity with this species in Asia, where the blue pitta can be found. As with the quirks of different styles of English, the blue pitta problem underlines the need for researchers to use appropriate corpora for their searches. 

To address these and related obstacles that may keep a term from surfacing or otherwise misrepresent its usage in a given corpus, Lee and his frequent coauthor Jesse Egbert, a professor of applied linguistics at Northern Arizona University, have been working to expand their thinking to help capture other words and specific examples that fall within the meaning of a search term.  

The challenge is that specific words are at the very heart of corpus linguistics research, yet they can also mislead. You might search for birds, say, and never find the blue pitta but this doesn’t mean the blue pitta is not a bird. So, Tom Lee says, the act of searching needs to better negotiate between the overall category of birds and every possible example of a bird—that, or risk missing the very thing you hope to find.