Quantity and quality in linguistics

"This book advises you to be wary of forecasters who say that the science is not very important to their jobs, or scientists who say that forecasting is not very important to their jobs! These activities are essentially and intimately related. A forecaster who says he doesn't care about the science is like the cook who says he doesn't care about food. What distinguishes science, and what makes a forecast, is that it is concerned with the objective world. What makes forecasts fail is when our concern only extends as far as the method, maxim, or model."

A great quote from a great book: Nate Silver's The signal and the noise. The art and science of prediction. I read this book over the summer and thoroughly enjoyed it. Nate Silver is a statistician who runs and writes for the highly recommendable blog FiveThirtyEight. (As a quick aside, before reading the book, all I knew about Silver was that he had received an honorary doctoral degree from my university and so I assumed that he was a statistics professor at some Ivy League university, but it turns out he started his career as a baseball analyst and online poker player.)

Back to the quote: I think it hits the nail squarely on the head when it comes to characterizing the relationship between statistics and scientific theory. More specifically, if you replace 'forecasters' with 'quantitative linguists' and 'science' with 'theoretical linguistics', you arrive at the motto that characterizes my current thinking about the field. Theoretical linguistics has to acknowledge and embrace the wealth of quantitative data, methods, and techniques that is out there, but at the same time, pure number crunching uninformed by insights from theoretical linguistics—no matter how methodologically or mathematically impressive—is not going to lead to a deeper understanding of natural language.

Let's make this a bit more concrete. A book that I've drawn a lot of inspiration from lately is Marco René Spruit's 2008 PhD-dissertation Quantitative perspectives on syntactic variation in Dutch dialects. It is an excellent, impressive, and innovative piece of work, in which the empirical results of (the first half of) the SAND-project are analysed from a quantitative point of view. However, because Spruit is not a linguist—he's a computer scientist by training—his book forms the perfect illustration of the dichotomy introduced above. For instance, when looking for associations between 485 syntactic variables—associations of the type If a dialect has property A, what are the odds that it also had property B?—he finds no less than 10,730 of them with an accuracy of 90 percent or higher. And when either the antecedent or the consequent is allowed to contain a disjunction, that number even goes up to 56,267,729 (yes, fifty-six million!). With numbers like these you need someone to separate the wheat from the chaff, and that someone—guess what—is a theoretical linguist. In Spruit's own words:

"From a statistical perspective many more linguistically interesting variable associations can be expected to surface upon closer investigation. The explorations described above merely attempt to indicate the great potential of association rule mining as a meaningful contribution to linguistic theory in general and syntactic theory in particular. (..) However, every approach will require extensive consultation with syntactic theorists to meaningfully interpret the data."

It is this synthesis between quantitative-statistical and formal-theoretical approaches that I've been pursuing in a number of recent talks, and there's more to come, so stay tuned.

  1. I'll probably end up reading it a second time, because by the side of the pool I was too lazy to read all—or rather, any—of the footnotes.