I should have put that in quotes because it actually appears in the title of this new paper published in Neuropsychopharmacology:
Hart AB, de Wit H, Palmer AA. Candidate gene studies of a promising intermediate phenotype: failure to replicate. Neuropsychopharmacology. 2013 Apr;38(5):802-16. doi: 10.1038/npp.2012.245. Epub 2012 Dec 3. [PubMed]
We previously conducted a series of 12 candidate gene analyses of acute subjective and physiological responses to amphetamine in 99-162 healthy human volunteers (ADORA2A, SLC6A3, BDNF, SLC6A4, CSNK1E, SLC6A2, DRD2, FAAH, COMT, OPRM1). Here, we report our attempt to replicate these findings in over 200 additional participants ascertained using identical methodology. We were unable to replicate any of our previous findings.
The team, with de Wit's lab expert on the human phenotyping and drug-response side and Palmer's lab expert on the genetics, has been after genetic differences that mediate differential response to amphetamine for some time. There's a human end and a mouse end to the overall program which has been fairly prolific.
In terms of human results, they have previously reported effects as varied as:
-association of an adenosine receptor gene polymorphism with degree of anxiety in response to amphetamine
-association of a dopamine transporter gene promotor polymorphism with feeling the drug effect and diastolic blood pressure
-association of casein-kinase I epsilon gene polymophisms with feeling the drug effect
-association with fatty acid amide hydrolase (FAAH) with Arousal and Fatigue responses to amphetamine
-association of mu 1 opioid receptor gene polymorphisms with Amphetamine scale subjective report in response to amphetamine
There were a dozen in total and for the most part the replication attempt with a new group of subjects failed to confirm the prior observation. The Discussion is almost plaintive at the start:
This study is striking because we were attempting to replicate apparently robust findings related to well-studied candidate genes. We used a relatively large number of new participants for the replication, and their data were collected and analyzed using identical procedures. Thus, our study did not suffer from the heterogeneity in phenotyping procedures implicated in previous failures to replicate other candidate gene studies (Ho et al, 2010; Mathieson et al, 2012). The failure of our associations to replicate suggests that most or all of our original results were false positives.
The authors then go on to discuss a number of obvious issues that may have led to the prior "false positives".
-variation in the ethnic makeup of various samples- one reanalysis using ancestry as covariate didn't change their prior results.
-power in Genome-Wide association studies is low because effect sizes / contribution to variance by rare alleles is small. they point out that candidate gene studies continue to report large effect sizes that are probably very unlikely in the broad scheme of things...and therefore comparatively likely to be false positives.
-multiple comparisons. They point out that not even all of their prior papers applied multiple comparisons corrections against the inflation of alpha (the false positive rate, in essence) and certainly they did no such thing for the 12 findings that were reported in a number of independent publications. As they note, the adjusted p value for the "322 primary tests performed in this study" (i.e., the same number included in the several papers which they were trying to replicate) would be 0.00015.
-publication bias. This discussion covers the usual (ignoring all the negative outcomes) but the interesting thing is the confession on something many of us (yes me) do that isn't really addressed in the formal correction procedures for multiple comparisons.
Similarly, we sometimes considered several alternative methods for calculating phenotypes (eg, peak change score summarization vs area under the curve, which tend to be highly but incompletely correlated). It seems very likely that the candidate gene literature frequently reflects this sort of publication bias, which represents a special case of uncorrected multiple testing.
This is a fascinating read. The authors make no bones about the fact that they've found that no less than 12 papers that they have published were the result of false positives. Not wrong...not fraudulent. Let us be clear. We must assume they were published with peer review, analysis techniques and samples sizes that were (and are?) standard for the field.
But they are not true.
The authors offer up solutions of larger sample sizes, better corrections for multiple comparisons and a need for replication. Of these, the last one seems the best and most likely solution. Like it or not, research funding is limited and there will always be a sliding scale. At first we have pilot experiments or even anecdotal observations to put us on the track. We do one study, limited by the available resources. Positive outcomes justify throwing more resources at the question. Interesting findings can stimulate other labs to join the party. Over time, the essential features of the original observation or finding are either confirmed or consigned to the bin of "likely false alarm".
This is how science progresses. So while we can use experiences like this to define what is a target sample size and scope for a real experiment, I'm not sure that we can ever overcome the problems of publication bias and cherry picking results from amongst multiple analyses of a dataset. At first, anyway. The way to overcome it is for the lab or field to hold a result in mind as tentatively true and then proceed to replicate it in different ways.
UPDATE: I originally forgot to put in my standard disclaimer that I'm professionally acquainted with one or more of the authors of this work.
Hart, A., de Wit, H., & Palmer, A. (2012). Candidate Gene Studies of a Promising Intermediate Phenotype: Failure to Replicate Neuropsychopharmacology, 38 (5), 802-816 DOI: 10.1038/npp.2012.245