How reliable are psychiatric diagnoses?

DSM-5 Cover.png

This post, which mostly provides useful citations, was originally written as an answer to the following Quora question:

Is ADHD one of the only psychiatric conditions that can be diagnosed objectively?

It really depends on what you mean by “objective”, but the answer is “probably not”.

Since we do not understand the underlying causes of ADHD — or any major psychiatric disorder — we diagnose them based on clusters of symptoms.

In the United States and several other countries, a large number of psychiatrists use a book called Diagnostic and Statistical Manual of Mental Disorders (DSM).

DSM (now in it’s 5th, revised edition, DSM-5) essentially uses a system of checklists to enable a clinician to assess if a person has a given disorder. This is a controversial book for various reasons, but for now it is what most psychiatrists use.

Instead of the complex philosophical question of ‘objectivity’, the usefulness of DSM can be assessed using statistical measures of ‘reliability’.

Given that one clinician uses DSM-5 to give the diagnosis of ADHD, how likely is another clinician to do so using the DSM-5? Measures of “test-retest reliability” capture this probability.

Here is a paper that explains the statistical measurement of reliability in some detail:

DSM-5: How Reliable Is Reliable Enough? [pdf]

There are conflicting reports on the reliability of DSM-5, but here is one paper that reports statistical assessments:

DSM-5 field trials in the United States and Canada, Part II: test-retest reliability of selected categorical diagnoses.

“There were a total of 15 adult and eight child/adolescent diagnoses for which adequate sample sizes were obtained to report adequately precise estimates of the intraclass kappa. Overall, five diagnoses were in the very good range(kappa=0.60–0.79), nine in the good range(kappa=0.40–0.59), six in the questionable range (kappa = 0.20–0.39), and three in the unacceptable range (kappa values,0.20). Eight diagnoses had insufficient sample sizes to generate precise kappa estimates at any site.”


“Two were in the very good (kappa=0.60–0.79) range: autism spectrum disorder and ADHD.”

For more on the quantity reported here, kappa, see this paper:

Interrater reliability: the kappa statistic

The quantity kappa ranges from 0 to 1. Zero means that there was no agreement between raters (clinicians in this case), and 1 means there was perfect agreement.

As I said before, the DSM is controversial — and not just because of reliability issues. Here is a sampling of papers and popular articles on the general topic:

Academic articles

Reliability in Psychiatric Diagnosis with the DSM: Old Wine in New Barrels

“However, the standards for evaluating κ-statistics have relaxed substantially over time. In the early days of systematic reliability research, Spitzer and Fleiss [4] suggested that in psychiatric research κ-values ≥0.90 are excellent; values between 0.70 and 0.90 are good, while values ≤0.70 are unacceptable. In 1977, Landis and Koch [5] proposed the frequently used thresholds: values ≥0.75 are excellent; values between 0.40 and 0.75 indicate fair to good reliability, and values ≤0.40 indicate poor reliability. More recently, Baer and Blais [6] suggested that κ-values >0.70 are excellent; values between 0.60 and 0.70 are good; values between 0.41 and 0.59 are questionable, and values ≤0.40 are poor. Considering these standards, the norms used in the DSM-5 field trial are unacceptably generous.”

The Reliability of Psychiatric Diagnoses: Point—Our psychiatric Diagnoses are Still Unreliable

“Today, 26 years later, did the DSM system succeed in improving the reliability of psychiatric diagnoses? Two answers exist. The DSM did improve the reliability of psychiatric diagnoses at the research level. If a researcher or a clinician can afford to spend 2 to 3 hours per patient using the DSM criteria and a structured interview or a rating scale, the reliability would improve. [13] For psychiatrists and clinicians, who live in a world without hours to spare, the reliability of psychiatric diagnoses is still poor. [2,3]”

Diagnostic Issues and Controversies in DSM-5: Return of the False Positives Problem.

“The fifth revision of the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) was the most controversial in the manual’s history. This review selectively surveys some of the most important changes in DSM-5, including structural/organizational changes, modifications of diagnostic criteria, and newly introduced categories. It analyzes why these changes led to such heated controversies, which included objections to the revision’s process, its goals, and the content of altered criteria and new categories. The central focus is on disputes concerning the false positives problem of setting a valid boundary between disorder and normal variation. Finally, this review highlights key problems and issues that currently remain unresolved and need to be addressed in the future, including systematically identifying false positive weaknesses in criteria, distinguishing risk from disorder, including context in diagnostic criteria, clarifying how to handle fuzzy boundaries, and improving the guidelines for “other specified” diagnosis.”

Popular articles

The DSM-5 Controversy

“You will need to display fewer and fewer symptoms to get labeled with certain disorders, for example Attention Deficit Disorder and Generalized Anxiety Disorder. Children will have more and more mental disorder labels available to pin on them. These are clearly boons to the mental health industry but are they legitimate additions to the manual that mental health professionals use to diagnose their clients?”

DSM 5 Is Guide Not Bible—Ignore Its Ten Worst Changes

“This is the saddest moment in my 45 year career of studying, practicing, and teaching psychiatry. The Board of Trustees of the American Psychiatric Association has given its final approval to a deeply flawed DSM 5 containing many changes that seem clearly unsafe and scientifically unsound. My best advice to clinicians, to the press, and to the general public – be skeptical and don’t follow DSM 5 blindly down a road likely to lead to massive over-diagnosis and harmful over-medication. Just ignore the ten changes that make no sense.”

Normal or Not? New Psychiatric Manual Stirs Controversy

“Among the flashpoints: Asperger’s disorder will be folded into autism spectrum disorder; grief will no longer exempt someone from a diagnosis of depression; irritable children who throw frequent temper tantrums can be diagnosed with disruptive mood dysregulation disorder. [Hypersex to Hoarding: 7 New Psychological Disorders]

“One prominent critic has been Allen Frances, a professor emeritus of psychiatry at Duke University who chaired the DSM-IV task force.

“Frances charges that through a combination of new disorders and lowered thresholds, the DSM-5 is expanding the boundaries of psychiatry to encompass many whom he describes as the “worried well.””


What are the limits of neuroscience?

[My answer to a recent Quora question.]

There are two major problems with neuroscience:

  1. Weak philosophical foundations when dealing with mental concepts
  2. Questionable statistical analyses of experimental results

1. Neuroscience needs a bit of philosophy

Many neuroscientific results are presented without sufficiently nuanced  philosophical knowledge. This can lead to cartoonish and potentially harmful conceptions of the brain, and by extension, of human behavior, psychology, and culture. Concepts related to the mind are among the hardest to pin down, and yet some neuroscientists give the impression that there are no issues that require philosophical reflection.

Because of a certain disdain for philosophy (and sometimes even psychology!), some neuroscientists end up drawing inappropriate inferences from their research, or distorting the meaning of their results.

One particularly egregious example is the “double subject fallacy”, which was recently discussed in an important paper:

“Me & my brain”: exposing neuroscience’s closet dualism.

Here’s the abstract of the paper:

Our intuitive concept of the relations between brain and mind is  increasingly challenged by the scientific world view. Yet, although few  neuroscientists openly endorse Cartesian dualism, careful reading  reveals dualistic intuitions in prominent neuroscientific texts. Here,  we present the “double-subject fallacy”: treating the brain and the  entire person as two independent subjects who can simultaneously occupy  divergent psychological states and even have complex interactions with  each other-as in “my brain knew before I did.” Although at first, such  writing may appear like harmless, or even cute, shorthand, a closer look  suggests that it can be seriously misleading. Surprisingly, this  confused writing appears in various cognitive-neuroscience texts, from  prominent peer-reviewed articles to books intended for lay audience. Far  from being merely metaphorical or figurative, this type of writing  demonstrates that dualistic intuitions are still deeply rooted in  contemporary thought, affecting even the most rigorous practitioners of  the neuroscientific method. We discuss the origins of such writing and  its effects on the scientific arena as well as demonstrate its relevance  to the debate on legal and moral responsibility.

[My answer to the earlier question raises related issues: What are the limits of neuroscience with respect to subjectivity, identity, self-reflection, and choice?]

2. Neuroscience needs higher data analysis standards

On a more practical level, neuroscience is besieged by problems related to bad statistics. The data in neuroscience (and all “complex system” science) are extremely noisy, so increasingly sophisticated statistical techniques are deployed to extract meaning from them. This sophistication means that  fewer and fewer neuroscientists actually understand the math behind the statistical methods they employ. This can create a variety of problems, including incorrect inferences. Scientists looking for “sexy” results can use poorly understood methods to show ‘significant’ effects where there really is only a random fluke. (The more methods you use, the more chances you create for finding a random “statistically significant” effect. This kind of thing has been called “torturing the data until it confesses”.)

Chance effects are unreproducible, and this is a major problem for many branches of science. Replication is central to good science, so when it frequently fails to occur, then we know there are problems with research and with how it is reviewed and published. Many times there is a “flash in the pan” at a laboratory that turns out to be fool’s gold.

See these article for more:

Bad Stats Plague Neuroscience

Voodoo Correlations in Social Neuroscience

The Dangers of Double Dipping (Voodoo IV)

Erroneous analyses of interactions in neuroscience: a problem of significance.

Fixing Science, Not Just Psychology – Neuroskeptic

The Replication Problem in the Brain Sciences

Quora: What are the limits of neuroscience?

Brains, Boats & Baseball bats — some thoughts on fMRI

I wanted to write a post on a new fMRI paper that looks really interesting. But in attempting to do so I felt the need to condense some of my own cloudy thoughts on fMRI. Think of this as one part explanation, one part rant, and one part thinking aloud.

Functional magnetic resonance imaging (fMRI) has become a popular tool for human neuroscience and experimental psychology. But its popularity masks several major issues of interpretation that call into question many of the generalizations fMRI researchers would like to make. These generalizations often lead to overzealous and premature brain-centric redefinitions of high-level concepts such as thinking, emotion, love, and pleasure.

Sensationalist elements in the media run with these redefinitions, taking advantage of secular society’s respect for science in order to promote an unfounded reductionist attitude towards psychological and cultural phenomena. A reductionist attitude feeds off two often contradictory human desires: one for simple, intuitive explanations, and the other for controversial, novel solutions. This biased transfer of ideas from the laboratory to the newspaper, the blog and the self-help book is responsible for a rash of fallacious oversimplifications, such as the use of dopamine as a synonym for “pleasure”. (It correlates with non-so-pleasurable events too.)

Neuroscientific redefinitions of high-level phenomena, even when inspired by an accurate scientific picture, often fall prey to the “mereological fallacy“, i.e.,  the conflating of parts of a phenomenon for the whole. Bits of muscle don’t dance, people do. And a free-floating brain doesn’t think or feel, a whole organism does.

Wikipedia saves lives.

what is this i dont even

But before dealing with the complex philosophical, sociological and semantic issues posed by neuroscience, we must be sure that we understand what the experimental techniques are actually telling us. fMRI experiments are usually interpreted as indicating which parts of the brain “light up” during a particular, task, event, or subjective experience. For instance, a recent news headline informs us that “Speed-Dating Lights Up Key Brain Areas“. The intuitive simplicity of such statements masks a hornets’ nest of interpretation problems.

What does an fMRI picture represent? fMRI results only reach statistical significance if the studies are carried out in a group of at least 5-10 people, so the “lighting up” is reliable only after pooling data [But see the edit below]. And in addition to averaging the data from multiple subjects, usually the experimenter must also average multiple trials for each subject. So  the fMRI heat-map is an average of averages. Far from being a snapshot of the brain activity as it evolves through time, an fMRI heat-map is like a blurry composite photograph produced by superimposing several long-exposures of similar, non-identical things.

As if this were not problematic enough, the results of an fMRI scan of a single person must be subtracted from a baseline before further analysis. The brain is never quiescent, so to find out about activity in a particular region during a particular event or task, the experimenter must design a control that is identical to the task except with respect to the phenomenon of interest. Imagine a boat with three people on it, sailing on choppy seas. If an observer watching calmly from the shore wants to understand how each person is moving on the boat, she must first factor out the overall motion of the boat caused by the waves. Only then can the observer determine, say,  that one person on the boat is jumping up and down, the other is swaying from side to side, and the third is trying to be as still as possible. The subtractions used in fMRI studies are like the factoring out of the motion of the boat — they allow the experimenter to zoom in on the activities of interest and ignore the choppy sea that is the brain’s baseline activity.

Here’s another analogy that might help (me) understand what’s happening with fMRI averaging and subtracting. Let’s say you want to understand the technique of baseball batters that allows them to successfully hit a particular kind of pitch. You take videos of 20 batters hitting, say, a fastball. The fastball is the “task”, and each attempt to hit the ball is a “trial”. Suppose each batter makes 100 swings, and around 50 connect with the ball. The rest are strikes. So there are 50 hits and 50 misses. You take the videos for the 50 hits and average them, so you get a composite or superimposed video for each batter. Then you take the videos of the 50 strikes, and average those. This is the control. Now you “subtract” these two averaged videos, for each batter, getting a video that would presumably show a series of ghostly images of floating, morphing body parts — highlighting only what was different about the batter’s technique when he made contact with the ball versus when he didn’t.  In other words, if the batter’s torso moves in exactly the same way whether he hits or misses, then in the video the torso will be subtracted out, and only the head, arms and legs will be visible. Finally, you pool together the subtracted videos for all 20 batters and average them. Now you have a single video that shows the average difference in batting technique between successful hits and misses. If you’ve done everything right, you have some idea of which batting techniques tend on average to work against fastballs.

But consider what may be misleading about this video. Perhaps there are two different techniques or strategies for hitting a fastball. The averaged video will only show a kind of midway point between them. Basically, individual differences can get blurred out by averaging. So sometimes the batting technique that seems to be suggested by the average doesn’t really exist — it  can be an artifact of averaging, rather than a picture of an actual trend. Good statistical practices help experimenters avoid artifacts, but as the task and the stats become more complicated, the scope for misunderstanding and misuse expand. In other words, every mathematical angel is shadowed by its very own demon.

Another interpretation issue has to do with what the subtraction means. In the case of the missing torso, you can assert that the difference between success and failure at hitting a fastball does not depend on the torso’s movement, since it’s the same regardless of what the batter does. But this does not mean, however, that the torso has nothing to do with batting. After all, we know the torso is what links up everything else and provides the crucial central services to the arms and legs. So if a brain region doesn’t light up in an fMRI study, this doesn’t mean that it has no role to play in the task being studied. It may in fact be as central to the task as the torso is!

But the problems associated with averaging and subtraction crop up in all forms of data analysis, so they’re among the inevitable hazards that go with  experimental science. The central question that plagues fMRI interpretation is not mathematical or philosophical, it’s physiological. What neural phenomenon does fMRI measure, exactly? It seems a partial answer may have been found, which I’ll touch on, hopefully, in the next post.



  • William Uttal’s book The New Phrenology (which I have only read a chapter or so of) describes how the “localization of function” thread that runs through fMRI and other neuroscientific approaches may be a misguided return to that notorious Victorian pseudoscience. Here is a precis of the book.
  • This New Yorker piece deals with many of the issues with fMRI, and also links to related resources including a New York Times article.
  • Neuroscience may not be as misleading to the public as was originally thought. Adding fMRI pictures or neurobabble was thought to make people surrender their logical faculties, but a recent study suggests that the earlier one may have been flawed.
  • For a vivid illustration of the potential effects of averaging, check out this art project. The artist averaged every Playboy centerfold, grouping them by decade, producing one blurry image each for the 60, 70s, 80s, and 90s. Don’t worry, it’s very SFW.
  • You can also play around with averaging faces here.


EDIT: In the comment section Kayle made some very important corrections and clarifications:

“You say that fMRI results only reach statistical significance if the studies are carried out in groups. This is not quite right. Almost all fMRI studies begin with a “first-level analysis,” which are single-subject statistics. This way you can contrast different conditions for a single subject. With large differences, small variablity, and enough trials, robust maps can be created. This is done for surgical planning when doctors are considering how much brain they can resect surrounding a tumor without endangering someone’s ability to move or talk. When examining mean differnces between groups, however, you need to examine results from multiple people (“second-level analysis”). Again, this is not specific to fMRI. The rule of thumb goes something like this: Most people are interested in being able to detect differences with effect sizes of about 1 SD or above. To do this with some confidence (Type II p < 0.20) you need about 10 to 30 observations per group.”


” Every fMRI result you’ve seen includes single-person single-session results that generally are not reported because most people aren’t interested.”