I've been at a lot of conferences lately. Heck I'm headed to another one today! At a lot of these conferences, you end up in a large room full of people. Maybe it's the bar, maybe it's a giant ballroom, maybe it's a huge poster session floor. It's absolutely echoing with the sound of dozens to thousands of voices all talking at once. And yet, through all of it, you can somehow focus on the person in front of you! Sometimes, as I navigate my way through another happy hour or poster session, I really wonder how that works.
Over time, and through many crowded conversations, I've noticed something. When my attention begins to wander away from the person I'm talking to (not necessarily because I'm bored, but usually because I'm tired), I end up looking around...and I have a much harder time hearing the person I'm supposed to hear! How does that work? How do you focus on the one person you need to talk to at the cocktail party?
Golumbic et al. "Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a “Cocktail Party” Journal of Neuroscience, 2013.
There's quite a lot involved in filtering out the unimportant conversations at a cocktail party. First, there's the matter of attention. We know that the voice you're paying attention to and focusing on in a crowded space will elicit larger neural responses than those that you are not paying attention to. But picking out the voice from the crowd is more than just attention.
We process speech information in many ways. It's not just a matter of miscellaneous sounds, instead our brains also detect temporal patterns of speech. These temporal patterns, how fast something is generally said, form what we call a "envelope of speech" or a "temporal envelope", and are necessary for tracking spoken words. Scientists think that these temporal envelopes allow us to break the continuous speech flow up into more "digestable" units to help in word comprehension and overall comprehension of the phrase.
But we don't just rely on sound processing. We also rely on visual cues. Facial cues during speaking also elicit neural responses and are correlated with the "speech envelope". In fact, they preceded it by about 150 ms. So the authors hypothesize that the visual input from the face that is speaking might help someone to "predict" what they are about to hear, easing processing of the words.
In order to examine this, the authors of the study did magnetoencephalography (MEG), which records the magnetic fields produced by the natural electrical activity of your brain (don't worry, you're not Magneto, the currents are very, very tiny). What they were looking for were changes in the MEG signal correlated with "speech envelope" tracking. Larger signals would mean the subjects were doing better tracking, while smaller signals would indicate that they might be losing the speech thread.
So they got 13 subjects, and gave them MEGs while they tracked a conversation. They could get:
- One person speaking (male or female) alone.
- Two people speaking (male and female) at the same time, while they tracked one of the voices.
- One person speaking along with a video of that person speaking.
- Two people speaking along with video of both of them speaking.
The subjects had to report on what the speaker they were told to attend to was saying. And it turns out that the visual helped a lot:
You can see that when the participants were listening to one voice at a time, they had an easy time with correct responses, with or without the video. But when they were listening in the "cocktail party" situation (a very small cocktail party...), the audio alone made it harder, and they gave fewer correct responses. But when they had a face to go with the voice, they had a major increase in the number of correct responses.
When they looked at the MEG responses, they saw something similar.
What you can see here are average measures of "phase dissimilarity". Phase dissimilarity is a detectable change in MEG recording, and a larger dissimilarity means that something specific is receiving the bulk of the attention. And the pattern of the dissimilarity represents a "speech envelope". You can see on the left that for the lone speaker, the dissimilarity was high. In the cocktail party situation, the dissimilarity remained high, but ONLY if there was a face to go with the voice. When there was no face, the dissimilarity was low and the speech envelope was more difficult to discern. This means that having the visual input of the face can increase your ability to track what that person it saying in your auditory cortex.
Why does this help? It could be that the visual input helps you maintain attention. The visual input could also help you predict what is to be said next and help with auditory processing that way. It would be interesting to see this study carried out in a larger "cocktail party" of more than two people. And it would also be interesting to try and dissociate the maintenance of attention from the predictions of facial cues (maybe have a vid with just a still face you have to focus on, vs one that is saying what you are hearing?). But regardless of how it's working, it definitely helps. And the next time you need to pick out a voice in the crowd, look to the eyes!
Zion Golumbic, E., Cogan, G., Schroeder, C., & Poeppel, D. (2013). Visual Input Enhances Selective Speech Envelope Tracking in Auditory Cortex at a "Cocktail Party" Journal of Neuroscience, 33 (4), 1417-1426 DOI: 10.1523/JNEUROSCI.3675-12.2013