Quantity and quality in linguistics

"This book advises you to be wary of forecasters who say that the science is not very important to their jobs, or scientists who say that forecasting is not very important to their jobs! These activities are essentially and intimately related. A forecaster who says he doesn't care about the science is like the cook who says he doesn't care about food. What distinguishes science, and what makes a forecast, is that it is concerned with the objective world. What makes forecasts fail is when our concern only extends as far as the method, maxim, or model."

A great quote from a great book: Nate Silver's The signal and the noise. The art and science of prediction. I read this book over the summer and thoroughly enjoyed it. Nate Silver is a statistician who runs and writes for the highly recommendable blog FiveThirtyEight. (As a quick aside, before reading the book, all I knew about Silver was that he had received an honorary doctoral degree from my university and so I assumed that he was a statistics professor at some Ivy League university, but it turns out he started his career as a baseball analyst and online poker player.)

Back to the quote: I think it hits the nail squarely on the head when it comes to characterizing the relationship between statistics and scientific theory. More specifically, if you replace 'forecasters' with 'quantitative linguists' and 'science' with 'theoretical linguistics', you arrive at the motto that characterizes my current thinking about the field. Theoretical linguistics has to acknowledge and embrace the wealth of quantitative data, methods, and techniques that is out there, but at the same time, pure number crunching uninformed by insights from theoretical linguistics—no matter how methodologically or mathematically impressive—is not going to lead to a deeper understanding of natural language.

Let's make this a bit more concrete. A book that I've drawn a lot of inspiration from lately is Marco René Spruit's 2008 PhD-dissertation Quantitative perspectives on syntactic variation in Dutch dialects. It is an excellent, impressive, and innovative piece of work, in which the empirical results of (the first half of) the SAND-project are analysed from a quantitative point of view. However, because Spruit is not a linguist—he's a computer scientist by training—his book forms the perfect illustration of the dichotomy introduced above. For instance, when looking for associations between 485 syntactic variables—associations of the type If a dialect has property A, what are the odds that it also had property B?—he finds no less than 10,730 of them with an accuracy of 90 percent or higher. And when either the antecedent or the consequent is allowed to contain a disjunction, that number even goes up to 56,267,729 (yes, fifty-six million!). With numbers like these you need someone to separate the wheat from the chaff, and that someone—guess what—is a theoretical linguist. In Spruit's own words:

"From a statistical perspective many more linguistically interesting variable associations can be expected to surface upon closer investigation. The explorations described above merely attempt to indicate the great potential of association rule mining as a meaningful contribution to linguistic theory in general and syntactic theory in particular. (..) However, every approach will require extensive consultation with syntactic theorists to meaningfully interpret the data."

It is this synthesis between quantitative-statistical and formal-theoretical approaches that I've been pursuing in a number of recent talks, and there's more to come, so stay tuned.

  1. I'll probably end up reading it a second time, because by the side of the pool I was too lazy to read all—or rather, any—of the footnotes.

Gmail UI woes

A couple of months ago, our IT-department configured the university's firewall to block all outgoing SMTP-traffic (except for its own Exchange server of course). As a consequence, I've been using the Gmail web interface a lot lately, and Oh! My! God! has it been driving me crazy. I'm no designer by any stretch of the imagination, but I'm guessing that whoever is responsible for this jumbled mess wasn't getting straight A's in designer school either. Every day there are numerous aspects of mail.google.com that confuse, irritate, and bewilder me, but for the sake of keeping the amount of complaining on this blog to a minimum, let me pick out my two main grievances. First off, take a look at the following pair:

Pop quiz, hotshot: which of these is the back button and which is the reply button? If I had a eurocent for every time I've mixed these two up, well, I'd have a lot of eurocents. And to add insult to injury: in the view where these two buttons pop up together—the detailed view of a single (thread of) message(s)—the back button is superfluous, because the left-hand column also contains a button to take you to the inbox, which is exactly what the back button does. But hey, it's a good thing replying to messages is not something one does frequently in an email client, right?

A second thing one rarely does, is write new messages. In order to help you execute this obscure task, Google has devised this beauty:

Oh, where to start? There's so many things wrong here. First off, it doesn't look like a button at all. Here's what it looks like in context:

It doesn't look anything like the other buttons in this column. If anything, it looks like the title of the column, something that's not even clickable. I'm guessing the reasoning here was: "Composing a new message is a very common task, so let's make a Big Fat Red Button for it", but the effect has been the exact opposite; by making it stand out so prominently, it completely disappears and becomes invisible.

Looks aside, though, let's focus on what the button says. The Dutch verb opstellen is the literal (Google Translate-style) translation of English 'to compose'. Aha, exactly what someone who wants to compose a new message needs, right? Nope. You see, the verb opstellen is only rarely used in combination with mail or e-mail. You don't have to take my word for it; we can look at some numbers. I did a couple of searches in the Corpus of Contemporary Dutch, a corpus of over 70 million words from (among others) newspapers, journals, legal documents, television news broadcasts, novels, and internet texts. The verb opstellen (in any of its inflectional forms) occurs 24,650 times, the noun e-mail 12,264 times, and mail 8,320 times. The question now is to what extent these sets overlap. It turns out that (e-)mail co-occurs with opstellen only a meagre 16 (sixteen!) times. Compare this to the numbers for sturen 'to send' and schrijven 'to write':

mail e-mail total
opstellen 11 5 16
sturen 1706 1409 3115
schrijven 621 771 1392

What does this mean? Well, it means that Google should have chosen a different name for their button. The verb they've put on there is only very rarely associated with the action connected to the button, which makes it unintuitive and hard to use. That said, though, I think the problem is more fundamental than the choice of verb. Even if the numbers in the table had been reversed, I still think the button would have been poorly designed. The most informative part of the verb phrase een nieuw bericht opstellen 'to compose a new message' is the direct object een nieuw bericht 'a new message', not the verb. So if anything, that's what should have been on the button: nieuw 'new' or nieuw bericht 'new message'. It would sidestep the whole issue of which verb to use, and instead would focus much more directly on the result the user is trying to achieve.

Anyway, the good news is that I will be changing workplaces soon—more on that in a future post—at which point my days of Gmail-web-interface-suffering will be over. I can't wait.

  1. Both nouns are used interchangeably in Dutch.

  2. A quick word on my methodology: for both mail and e-mail I did two searches: one with the noun preceding the verb and one with the noun following the verb. In both cases I allowed for anywhere between 0 and 20 optional intervening words. For opstellen the number of hits was so small that I was able to manually verify all of them and throw out any false positives. For sturen and schrijven I didn't do that because there were too many hits. Note that sturen occurs 72,022 times in the whole corpus, and schrijven 209,487 times.

A reviewer's review

I won't name names. Technically, it is not even possible for me to name names, because the review process is doubly blind. However, as many of you know, between theory and practice there is only a quick Google search or a sly glance at the document properties, and very often you have a pretty good idea of who's behind the review. In this particular case, however, even a six-year old could have figured it out.

I've spent a considerable portion of the day wading through a badly written, poorly structured 60+ page maze of a linguistics paper. Why? Because a reviewer thought that the one footnote I had previously devoted to this publication didn't do it justice. As I was reading, however, it became clear that most of the other comments also implicitly referred to this paper. I needed to add some stuff on adverbs? Turns out the paper had a whole section devoted to adverbs. Discussion of languages X and Y was missing? Turns out languages X and Y were the centerpiece of this paper. More on cross-linguistic variation? Bingo! A subsection on cross-linguistic variation. It pretty soon became clear that 90% of the 'major issues' pointed out in the review could be paraphrased as Refer More Extensively To My Paper!.

Now, don't get me wrong, I'm a big fan of peer review in academia. I don't think there's a single publication of mine that didn't get better in some way thanks to reviewers' comments. (Even the review which prompted me to write this post contained some valid points and pointed out a number of weaknesses in the paper.) At the same time, however, oftentimes a paper also gets worse in some respects because a reviewer is trying to push his own agenda. So let me try and lay down Three Simple Ground Rules for Reviewing:

  1. Be specific: this is my number one pet peeve when it comes to reviews. You think a paper sucks? Fine, give specific, concrete, clearly formulated arguments to back that up. A whole lot of literature is missing? Give the full bibliographic info of at least one of those missing references. There's a problem in section 2? Give the page and line number of where the problem occurs. Vagueness in a review is not only utterly useless to the author, it's also annoying for the editor, who will have no way of knowing whether you're being vague because you didn't feel like doing the review and didn't invest the time into it, or whether the paper is really deficient in some fundamental way.
  2. Structure your review: editors don't have the time to read all the papers that are submitted to their journal. This means your review should help guide their decision and in this respect, structure is golden: a clear, concise overall judgment of the paper at the start of your review, a numbered, structured list of the major issues, and a possibly longer list with smaller issues. No lengthy summary of the paper—the editor can read the abstract—and no long swaths of rambling prose; this text isn't about (showing how brilliant) you (are), it's about evaluating the paper as objectively as possible.
  3. Be prepared to think along: this is probably the most controversial of the three, but I feel that if a paper starts out from a number of assumptions or axioms, the reviewer should think along inside that framework, unless he has (concrete, specific, clearly formulated, see point 1) arguments for rejecting those assumptions. Very often a reviewer simply disagrees with some assumption (because he adheres to some other flavour of linguistics) and starts bitching about it, trying to push his own agenda.

Call me naive, idealistic, or just plain stupid, but I think that a couple of simple rules like this—if properly adhered to of course—could really improve the reviewing process. It might even make it possible to get rid of the anonymity of peer review, but that's a whole nother can of worms.

  1. Actually, I feel reviews should always contain a full list of references (excluding the ones that were mentioned in the original paper). It's a matter of common courtesy and professionalism.

Service announcement

I've been busy behind the scenes getting the site more up to date, and am happy to share some of the fruits of my labor today. First off, my list of publications now reflects its most current state, and I've added downloadable versions of everything going back to 2008. If you want something from 2007 or earlier, either let me know or be patient: I plan to add them in due time.

Secondly, I've added a new subsection called Talks, which contains—you'll never guess—a complete list of all my talks. Here too, I've added downloadable versions of handouts and slides going back to 2008. This new subsection should give you a good idea of what I'm currently doing research-wise, especially since my talk-to-paper conversion rate isn't always as high or as fast as I'd like it to be. You'll notice quite a bit of verb clusters and (reverse) dialectometry in my most recent presentations; more on that in a later post. In the meantime, comments or questions about any of this (or even some good ole fashioned hecklin') are most welcome.

  1. Well, almost everything: for books I simply link to their webpage.

Adverbs and scope

Today's random linguistic observation: an adverb that is extraposed in an adverbial subclause can take scope over the complementizer of that subclause. Take a look at this pair:

  1. omdat 	hij	waarschijnlijk	slaapt
    because	he 	probably 	sleeps
    'because he's probably sleeping'
  2. omdat 	hij	slaapt waarschijnlijk
    because	he 	sleeps	probably
    'probably because he's sleeping'

As is clear from the English translations, in (1) the adverb waarschijnlijk 'probably' scopes below omdat 'because', while in (2) it scopes higher. Now, that an extraposed adverb in Dutch can scope relatively high should come as no surprise, but as high as the complementizer? There's two obvious ways of interpreting this contrast: (i) waarschijnlijk can indeed adjoin as high as CP and from that position it takes scope over omdat, or (ii) omdat is not as high up in the structure as one might think—e.g. it occupies a low projection in a split CP-system—allowing an extraposed adverb to take scope over it.

Unless of course the real answer is secret option number three: you'll notice that the examples in (1) and (2) don't contain a matrix clause. What if waarschijnlijk is not so much extraposed from the adverbial clause as it is a matrix adverb? Let's take a look at a more complete example:

  1. Ze	hebben 	hem	ontslagen 	
    they 	have 	him	fired 	
    omdat   hij	stal 	waarschijnlijk.
    because he 	stole	probably
    'They probably fired him because he stole.'

Forget all my earlier talk about waarschijnlijk adjoining all the way up to CP or omdat being base-generated lower that the highest C-position: maybe waarschijnlijk is just a matrix adverb in examples like (2) and (3) and never has any relationship with the adverbial clause. That would immediately explain its high scope, but it wouldn't mean we're out of the woods yet. For one, the combination of because-clause and adverb seems to form a constituent, as they can be fronted together to the pre-V2-position:

  1. Omdat   hij	stal 	waarschijnlijk
    because he 	stole	probably
    hebben 	ze	hem	ontslagen. 	
    have	they	him	fired
    'They probably fired him because he stole.'

This would seem to bring these examples in line with a construction discussed by Sjef Barbiers in the mid nineties and which I also briefly dabbled in (not in ‘Nam of course) ten years ago in unfinished and unpublished work, whereby both an argument and an adverb precede the finite verb in a declarative main clause:

  1. De  krant 	gisteren  meldde   het voorval  niet.
    the newspaper 	yesterday reported the incident not
    'The newspaper didn't report the incident yesterday.'

What makes this example similar to the data discussed above is that the adverb gisteren 'yesterday' takes clausal rather than nominal scope. In Sjef's analysis the object-DP de krant 'the newspaper' is in the specifier of the adverb gisteren 'yesterday', while I tried to argue that (5) is a genuine case of V3, but what both of us agreed upon, is that the adverb originates in the clausal spine, so it looks like option C might be the right one to go for after all.

  1. If you're an old-fashioned guy like me, you might even think that this is because every projection up to CP is right-headed and so in (2) you're adjoining at least as high as IP.

  2. In fact, if this is the right solution, it must adjoin as high as CP, because to my ear, the readings indicated are the only ones that are available.

  3. The relevant ground for comparison is the DP de krant van gisteren 'yesterday's newspaper' (lit. the newspaper of yesterday), where the adverb does not take clausal scope.