In this post, our heroine -- spurred on by her godly pursuit of science and a bevy of caffeinated drinks -- compares the standard approach to language to intelligent design. It might get noodly.
Pick one :
Does language "emerge" full-blown in children, guided by a hierarchy of inbuilt grammatical rules for sentence formation and comprehension?
Or is language better described as a learned system of conventions -- one that is grounded in statistical regularities that give the appearance of a rule-like architecture, but which belie a far more nuanced and intricate structure?
For over half a century now, many scientists have believed that the second of these possibilities is a non starter.
No one's quite sure -- but it might be because Chomsky told them it was impossible.
In a seminal 1963 paper, Miller and Chomsky argued that the computational problems facing any stochastic model of language were insoluble. Without a hierarchical framework of rules already in place to lay the groundwork, they argued, language would be unlearnable from the available input.
Since 1963, language models have become increasingly sophisticated. Problems that then seemed intractable in computational linguistics, no longer do. Yet in psychology, there is still widespread belief that the ‘grammar’ of a language cannot be learned, and thus must be hardwired. Indeed, the field is rife with arguments that it would be ‘logically impossible’ to learn language without such innate structures.
These arguments hinge on key assumptions that do not stand up to either empirical or theoretical scrutiny. It's certainly possible that some piece of linguistic machinery could be hardwired – this is a legitimate question, clearly worth investigating. But there is absolutely no logical requirement that it be so.
To begin to dismantle Miller and Chomsky’s critique, we first need to address its content. To precis: the pair take a hypothetical stochastic model of language and then show that for that model to accurately and precisely estimate the probability of a reasonable proportion of English sentences, the model would need to be exposed to simply astronomical amounts of linguistic input, of the kind that a human language learner could never possibly hope to be exposed to in the span of a lifetime. By dint of this, the pair conclude that language is unlearnable from the input. This then becomes known – in the history of science – as one of the classic 'poverty of the stimulus' arguments.
“Just how large must n and V be in order to give a satisfactory model? Consider a perfectly ordinary sentence: The people who called and wanted to rent your house when you go away next year are from California. In this sentence there is a grammatical dependency extending from the second word (the plural subject people) to the seventeenth word (the plural verb are). In order to reflect this particular dependency, therefore, n must be at least 15 words. We have not attempted to explore how far n can be pushed and still appear to stay within the bounds of common usage, but the limit is surely greater than 15 words; and the vocabulary must have at least 1000 words. Taking these conservative values of n and V, therefore, we have V^n = 1045 parameters to cope with, far more than we could estimate even with the fastest digital computers.”
The suggestion is that a functional language model would require exposure to an input of at least 10^45 words to parse even a simple sentence. Miller and Chomsky emphasize that this is a highly conservative estimate, based on a language that had only 1000 words, which is, of course, only a small fraction of the vocabulary of an adult speaker. But regardless, the figure is huge! To put it into perspective, the average lifetime consists of 2.2 billion seconds. For a person to hear all the words required within her lifespan, she would have to process about undecillion words a second. The point is : it ain’t gonna happen. Consequently -- conclude the daring duo -- language simply must be unlearnable from the input.
The numbers are impressive, no doubt. But is the argument grounded in anything real?
Turns out – no, it’s not. The problem is that Miller and Chomsky never consider the possibility that another, more psychologically realistic model of probabilistic learning might not suffer from the same problem. The underlying –and unjustified—assumption is that learning could only work in this way. But indeed, there are many reasons to think that human learning would not proceed in the way that their model does – namely, the last forty years of learning theory!
Before we get to that, however, let's waltz a step back for a moment, and consider how the argument is set up in the first place. Step 1: Miller and Chomsky describe a possible probabilistic model of language. Step 2: They show that it can't possibly account for how language is learned. Step 3: They conclude that language can't possibly be learned.
I'm sorry -- is it just me, or does this sound shockingly like an argument from intelligent design? "This is too complex -- evolution couldn't possibly explain it" versus "This is too complex -- we couldn't possibly learn it."
Isn't the point of science to figure out whether our models can explain things? And to build better models if they can't? If I tried to model quantum mechanics with play-dough and failed, that wouldn't be any cut on quantum (Kidding. Kind of).
But in all seriousness, there are at least two goals of modeling in cognitive science : 1) to discover the best computational method of accounting for a given phenomena, 2) to discover the best account that is also psychologically plausible
The goal has never been to rule out a whole class of models on the basis of one ill-starred example. Because -- quite frankly -- models don't deal in 'logical possibilities.' They are not mathematical or logical proofs. Step 3 in Miller and Chomsky's paper is a pseudo-scientific non sequitur.
The Trouble with the Model
I'm sure you're wondering -- what was wrong with the model Miller and Chomsky used? And how do we know for certain that they were working with the wrong model and the wrong set of assumptions?
For starters, the Markov model they use automatically assigns a probability of zero to any sentence (or other string of words) that it hasn't encountered yet. Which is to say : if the model hasn't been exposed to that particular string, then the string is deemed ‘ungrammatical.’ This is why such massive amounts of input are needed to guarantee that no potentially ‘grammatical’ sentences are ruled out. As Chomsky and Miller point out:
“We know that the sequences produced by n-limited Markov sources cannot converge on the set of grammatical utterances as n increases because there are many grammatical sentences that are never uttered and so could not be represented in any estimation of transitional probabilities.”
Fair enough. Markov models aren't up to snuff when it comes to explaining the whole of language learning. It's not clear, however, what this is supposed to tell us -- for one, how psychologically plausible are these models to begin with? Well -- if we take the model 'literally' -- i.e., if we assume that this is actually how human learning mechanisms work -- then our expectation should be that children will be stuck parroting back sentences they've heard before, since all other sentences will have been ruled 'ungrammatical' until proven otherwise. But -- do we really think this model captures the extent of our learning capabilities ? And is this actually the best a 'probabilistic' model can do?
No (and no again). It's easy to see why : think of a non-linguistic skill of yours that you're really good at -- say, cooking. Now imagine that as you were learning to cook, you were only able to learn precisely what you were taught, and never able to move outside those bounds. For example, if you were given a recipe, you could only cook at that temperature, with those specific ingredients, in that order, and so on. And you could only ever make the recipes you'd already made -- never anything else.
Does that seem like a fair way to describe what a top chef does?
Why then should we expect that language learning would be so limited?
Stimulus generalization is ubiquitous in nature. Even goldfish do it. There are many reasons to think that a child can generalize from what she has already learned : for example, when learns how to use the word 'dog,' she might at the same time gain some insight into how to use words like 'cat.'
If you think about it, using one word like another isn't so different from substituting Tapatio for Tabasco in a recipe -- sure, they're not quite the same flavor, but until we really learn to distinguish the differences between them, we may use them interchangeably in our cooking. Once we can tell the difference, of course, we'll know that Tapatio is a great sauce for salsa, while Tabasco is better for chips, but that takes time and -- well -- trial and error!
In a similar manner, children will, over the course of several years, learn that while both cats and dogs 'sit on mats' and 'walk outside,' only cats 'meow' and only dogs go 'woof woof.' (Funny enough, very small children raised with dogs have a tendency to call just about every animal they encounter a "dog," just as those brought up in households with Subarus, may amusingly call every car they encounter a "Subaru")
I want to save a more in-depth treatment of early language learning for a later post, but the simple point here is : we're fully capable of both contextual generalization and discrimination learning in other domains. Why would these powerful learning mechanisms not be available to us as we learn language? And why should we trust in the failure of models that don't begin to approximate those general learning capabilities?
The bottom line is : we shouldn't.
Do we really want to say that phonemes are 'innate'?
I haven't yet addressed how we know -- with all but certainty -- that the model Miller and Chomsky used had to be a poor approximation of human learning capabilities. It has to do with phonemes.
Experiments have shown that people are remarkably sensitive to the transitional probabilities between phonemes in their native languages, both when speaking and when listening to speech. If Miller and Chomsky’s assessment of probabilistic learning is correct, then the problem of "parameter estimation" should apply not only to learning the probabilities between words, but also to learning the probabilities between phonemes. Given that people do learn to predict phonemes, Miller and Chomsky's logic would force us to conclude that not only must ‘grammar’ be innate, but the particular distribution of phonemes in English (and every other language) must be innate as well.
We only get to this absurdist conclusion because Miller & Chomsky's argument mistakes philosophical logic for science (which is, of course, exactly what intelligent design does). So what's the difference between philosophical logic and science? Here's the answer, in Einstein's words, "No amount of experimentation can ever prove me right; a single experiment can prove me wrong."
(Rule #264 : Models don't make for good premises.)
Attributions and Amusements
In line with Einstein's comment, Scholz & Pullum (2006) have this to say about what the nativist research program should amount to :
“…The research program of linguistic nativism aims to show, proposition by proposition and mechanism by mechanism, that very little knowledge of syntactic structure is acquired or learned from sensory stimuli. Thus the discovery of one (or even a few) language-specialized cognitive mechanisms does not resolve the partisan nativist/non-nativist dispute.” (in Irrational Nativist Exuberance)
This post draws heavily on the literature reviews in Yarlett (2007), Ramscar, Yarlett, Dye, Denny & Thorpe (2010) and Yarlett, Ramscar & Dye (in submission). Much of the original scholarship on Miller and Chomsky is Dan Yarlett's. If you would like a copy of the first or the third paper, please email me.
Rescorla RA (1988). Pavlovian conditioning. It's not what you think it is. American Psychologist, 43 (3), 151-60 PMID: 3364852
Bannard C, Lieven E, & Tomasello M (2009). Modeling children's early grammatical knowledge. Proceedings of the National Academy of Sciences of the United States of America, 106 (41), 17284-9 PMID: 19805057
Ramscar, M., Yarlett, D., Dye, M., Denny, K., & Thorpe, K. (2010). The Effects of Feature-Label-Order and their Implications for Symbolic Learning Cognitive Science, 34 (7), 909-957 : 10.1111/j.1551-6709.2009.01092.x
Scholz, B., & Pullum, G. (2006). Irrational Nativist Exuberance Contemporary Debates in Cognitive Science, 59-80