I wanted to write a post on a new fMRI paper that looks really interesting. But in attempting to do so I felt the need to condense some of my own cloudy thoughts on fMRI. Think of this as one part explanation, one part rant, and one part thinking aloud.
Functional magnetic resonance imaging (fMRI) has become a popular tool for human neuroscience and experimental psychology. But its popularity masks several major issues of interpretation that call into question many of the generalizations fMRI researchers would like to make. These generalizations often lead to overzealous and premature brain-centric redefinitions of high-level concepts such as thinking, emotion, love, and pleasure.
Sensationalist elements in the media run with these redefinitions, taking advantage of secular society’s respect for science in order to promote an unfounded reductionist attitude towards psychological and cultural phenomena. A reductionist attitude feeds off two often contradictory human desires: one for simple, intuitive explanations, and the other for controversial, novel solutions. This biased transfer of ideas from the laboratory to the newspaper, the blog and the self-help book is responsible for a rash of fallacious oversimplifications, such as the use of dopamine as a synonym for “pleasure”. (It correlates with non-so-pleasurable events too.)
Neuroscientific redefinitions of high-level phenomena, even when inspired by an accurate scientific picture, often fall prey to the “mereological fallacy“, i.e., the conflating of parts of a phenomenon for the whole. Bits of muscle don’t dance, people do. And a free-floating brain doesn’t think or feel, a whole organism does.
But before dealing with the complex philosophical, sociological and semantic issues posed by neuroscience, we must be sure that we understand what the experimental techniques are actually telling us. fMRI experiments are usually interpreted as indicating which parts of the brain “light up” during a particular, task, event, or subjective experience. For instance, a recent news headline informs us that “Speed-Dating Lights Up Key Brain Areas“. The intuitive simplicity of such statements masks a hornets’ nest of interpretation problems.
What does an fMRI picture represent? fMRI results only reach statistical significance if the studies are carried out in a group of at least 5-10 people, so the “lighting up” is reliable only after pooling data [But see the edit below]. And in addition to averaging the data from multiple subjects, usually the experimenter must also average multiple trials for each subject. So the fMRI heat-map is an average of averages. Far from being a snapshot of the brain activity as it evolves through time, an fMRI heat-map is like a blurry composite photograph produced by superimposing several long-exposures of similar, non-identical things.
As if this were not problematic enough, the results of an fMRI scan of a single person must be subtracted from a baseline before further analysis. The brain is never quiescent, so to find out about activity in a particular region during a particular event or task, the experimenter must design a control that is identical to the task except with respect to the phenomenon of interest. Imagine a boat with three people on it, sailing on choppy seas. If an observer watching calmly from the shore wants to understand how each person is moving on the boat, she must first factor out the overall motion of the boat caused by the waves. Only then can the observer determine, say, that one person on the boat is jumping up and down, the other is swaying from side to side, and the third is trying to be as still as possible. The subtractions used in fMRI studies are like the factoring out of the motion of the boat — they allow the experimenter to zoom in on the activities of interest and ignore the choppy sea that is the brain’s baseline activity.
Here’s another analogy that might help (me) understand what’s happening with fMRI averaging and subtracting. Let’s say you want to understand the technique of baseball batters that allows them to successfully hit a particular kind of pitch. You take videos of 20 batters hitting, say, a fastball. The fastball is the “task”, and each attempt to hit the ball is a “trial”. Suppose each batter makes 100 swings, and around 50 connect with the ball. The rest are strikes. So there are 50 hits and 50 misses. You take the videos for the 50 hits and average them, so you get a composite or superimposed video for each batter. Then you take the videos of the 50 strikes, and average those. This is the control. Now you “subtract” these two averaged videos, for each batter, getting a video that would presumably show a series of ghostly images of floating, morphing body parts — highlighting only what was different about the batter’s technique when he made contact with the ball versus when he didn’t. In other words, if the batter’s torso moves in exactly the same way whether he hits or misses, then in the video the torso will be subtracted out, and only the head, arms and legs will be visible. Finally, you pool together the subtracted videos for all 20 batters and average them. Now you have a single video that shows the average difference in batting technique between successful hits and misses. If you’ve done everything right, you have some idea of which batting techniques tend on average to work against fastballs.
But consider what may be misleading about this video. Perhaps there are two different techniques or strategies for hitting a fastball. The averaged video will only show a kind of midway point between them. Basically, individual differences can get blurred out by averaging. So sometimes the batting technique that seems to be suggested by the average doesn’t really exist — it can be an artifact of averaging, rather than a picture of an actual trend. Good statistical practices help experimenters avoid artifacts, but as the task and the stats become more complicated, the scope for misunderstanding and misuse expand. In other words, every mathematical angel is shadowed by its very own demon.
Another interpretation issue has to do with what the subtraction means. In the case of the missing torso, you can assert that the difference between success and failure at hitting a fastball does not depend on the torso’s movement, since it’s the same regardless of what the batter does. But this does not mean, however, that the torso has nothing to do with batting. After all, we know the torso is what links up everything else and provides the crucial central services to the arms and legs. So if a brain region doesn’t light up in an fMRI study, this doesn’t mean that it has no role to play in the task being studied. It may in fact be as central to the task as the torso is!
But the problems associated with averaging and subtraction crop up in all forms of data analysis, so they’re among the inevitable hazards that go with experimental science. The central question that plagues fMRI interpretation is not mathematical or philosophical, it’s physiological. What neural phenomenon does fMRI measure, exactly? It seems a partial answer may have been found, which I’ll touch on, hopefully, in the next post.
- William Uttal’s book The New Phrenology (which I have only read a chapter or so of) describes how the “localization of function” thread that runs through fMRI and other neuroscientific approaches may be a misguided return to that notorious Victorian pseudoscience. Here is a precis of the book.
- This New Yorker piece deals with many of the issues with fMRI, and also links to related resources including a New York Times article.
- Neuroscience may not be as misleading to the public as was originally thought. Adding fMRI pictures or neurobabble was thought to make people surrender their logical faculties, but a recent study suggests that the earlier one may have been flawed.
- For a vivid illustration of the potential effects of averaging, check out this art project. The artist averaged every Playboy centerfold, grouping them by decade, producing one blurry image each for the 60, 70s, 80s, and 90s. Don’t worry, it’s very SFW.
- You can also play around with averaging faces here.
EDIT: In the comment section Kayle made some very important corrections and clarifications:
“You say that fMRI results only reach statistical significance if the studies are carried out in groups. This is not quite right. Almost all fMRI studies begin with a “first-level analysis,” which are single-subject statistics. This way you can contrast different conditions for a single subject. With large differences, small variablity, and enough trials, robust maps can be created. This is done for surgical planning when doctors are considering how much brain they can resect surrounding a tumor without endangering someone’s ability to move or talk. When examining mean differnces between groups, however, you need to examine results from multiple people (“second-level analysis”). Again, this is not specific to fMRI. The rule of thumb goes something like this: Most people are interested in being able to detect differences with effect sizes of about 1 SD or above. To do this with some confidence (Type II p < 0.20) you need about 10 to 30 observations per group.”
” Every fMRI result you’ve seen includes single-person single-session results that generally are not reported because most people aren’t interested.”
What’s your take on the neuro-economics field? People working here wish to trace elements of “utility” to brain “readings”, to give empirical content to the notion of utility. It seems a little difficult to digest, and one point of observation I often encounter is that such readings are hard to map on to any “feeling”, the data is really noisy.
As you might expect, I’m quite suspicious of neuro-econ. It’s a trendy sort of field, and makes precisely the kind of mereological fallacy-type errors I mention here. They like dopamine because they think it will tell them about “value” or “preference”, but the data are not so simple. Noise per se is not the issue, it’s the physical meaning of the data that is subject to a variety of interpretations. There are so many links and nonlinearities connecting brain areas that these abstract ideas like utility — or beauty, or political leaning — tend to involve a gross overextension of the inferred role of some brain area. Each experiment is like a 2D shadow of some complex 3D object. Or even higher D. 🙂 One can’t try to infer too much from a single set of dimensionality-reducing projections.
Great metaphors, but your criticisms are not specific to fMRI. In your hypothetical example, the torso is involved in batting but not differently for hits and misses. If, instead of doing signal averaging, you examined the relationship of hight and hand size to batting (using t-tests), you might find that height is related but not find the same for hand size. (As with all traditional frequentist inferential parametric statistics, not finding significance for hand size could be due to comparatively high variablity or a small effect.) Even if the relationship between hand size and pitching ability is really very small, it doesn’t mean hands aren’t used for pitching. As you say, only what is different about the batter’s technique is highlighted, but this is true for any t-test. Next, individual differences get blurred out with any statistic. This applies to the hands and height example, but also to the often reported statistic the men are better at mental 3D block rotation than women and women are better at word generation than men. Those results are true for the average, not universally (as you could say that blue whales are bigger than humans, but men are not universally taller than women). I think your metaphors are great if you ever teach statistics for any field you should employ them.
You say that fMRI results only reach statistical significance if the studies are carried out in groups. This is not quite right. Almost all fMRI studies begin with a “first-level analysis,” which are single-subject statistics. This way you can contrast different conditions for a single subject. With large differences, small variablity, and enough trials, robust maps can be created. This is done for surgical planning when doctors are considering how much brain they can resect surrounding a tumor without endangering someone’s ability to move or talk. When examining mean differnces between groups, however, you need to examine results from multiple people (“second-level analysis”). Again, this is not specific to fMRI. The rule of thumb goes something like this: Most people are interested in being able to detect differences with effect sizes of about 1 SD or above. To do this with some confidence (Type II p < 0.20) you need about 10 to 30 observations per group.
Also, your post refers to the most common method of fMRI analysis, task-based GLM. However, I think that ICA and other multivariate approaches that take into account the relationships of voxels to each other are the best methods and are being utilized more and more as the field advances. Further, new methods are looking at temporal dynamics instead of looking at the average state of the brain over the time spent in the scanner.
I'm excited to read your next post!
Thanks for the comments and criticism Kayle! Very helpful! And feel free to use these metaphors … and if you improve on them, let me know!
I did realize as I was writing this that the data analysis criticisms are not specific to fMRI, hence the last paragraph. I’m more interested in the relationship between blood flow, oxygen, metabolism, and firing, which I intend to talk about in the next post. But the details you provide are definitely worth mentioning clearly.
What I was trying to get at here is the very first level of processing, which nowadays is almost pre-processing. The inter-voxel relationships (where the real action is!) aren’t even dealt with here — they’d probably sink/strike out both my metaphors! 🙂
As a point of clarification — ICA and other methods still need to use subtraction (or normalization) before doing anything else, right?
Also, re: single person MRI. What you mention is structural, right? I’d be excited to see a study done just on one person — perhaps longitudinally over weeks and months, on the same task. Do the stats allow that?
Re: temporal dynamics. You’d still need to average across trials though right? If you see any new methods that a non-specialist might be interested in (and understand!), share on g+ etc!
In fMRI methods, “pre-processing” refers to a specific set of procedures used to reduce artifacts and improve data, such as slice-time correction, temporal and spatial smoothing, and co-registration.
ICA does not necessarily use subtraction or normalization.
I wasn’t referring to structural MRI. Every fMRI result you’ve seen includes single-person single-session results that generally are not reported because most people aren’t interested. I’d be happy to show you data on one subject. Longitudinal analyses are easy to do, but you have to be careful because there are a lot of potential pitfalls (eg, you can’t use a different scanner).
Yes, for temporal dynamics you need to average across observations (trials or TRs). The time window you choose plays a big role in the dynamic you’re investigating. It’s an awesome new part of fMRI! However, to get good results you need many observations (short TR) and high SNR from your magnet.
Cool! Good to know. Very curious about how ICA manages without subtraction or normalization. Isn’t there a need to find a zero point for any measurement? How else does one calibrate?
I’ve quoted you on some key issues in the form of an edit at the bottom of the post. Must try to be fastidious about these things!
I’m still learning about ICA when it comes to fMRI, but if I find out more about zeroing, subtraction, or normalization, I’ll let you know. My understanding though is that you use the time series for each voxel with the BOLD values reported by the scanner. But at this point, there is more that I don’t understand about ICA than there is that I do.
Thank you for your posts and blog. Are you a nero-student or professor? I’d love to know more about you. My daughter is getting her Ph.D. in cog. sci at UCI and I have shared your blog with her and am sending your G+ id.
Sorry I took so long to reply! I’m a neuroscience postdoc — I completed my PhD in 2011. I’d be happy to answer any questions your daughter has…though I’m not really a cognitive scientist. (The differences between fields may seem “academic”., but they affect what one ends up reading and studying. 🙂 )