Show me the data, Jerry!!!!!!

Sep 03 2013 Published by under #FWDAOTI, Conduct of Science, Tribe of Science

Today's Twittsplosion was brought to you by @mbeisen:

he then elaborated

and

and

There was a great deal of distraction in there from YHN, MBE and the Twitteratti. But these are the ones that get at the issue I was responding to. I think the last one here shows that I was basically correct about what he meant at the outset.

I also agree that it would be GREAT if all authors of papers had deposited all of their raw data, carefully annotated, commented and described (curated, in a word) with all of the things that I might eventually want to know. That would be kickass.

And I have had NUMEROUS frustrations that I cannot tell even from methods sections what was done, how the data were selected and groomed, etc in many critical papers.

It isn't because I assume fraud but rather that I find that when it comes to behaving animals in laboratory studies that details matter. Unfortunately we all wish to overgeneralize from published reports....the authors want to imply they have reported a most universal TRUTH and other investigators wish to believe it so that they don't have to sweat the details.

This is never true in science, as much as we want to pretend.

Science is ever only a description of what has occurred under these specific conditions. Period. Including the ones we've bothered to describe in the Methods and those we have not bothered to describe. Including those conditions of which we have no knowledge or understanding that they might have contributed.

Let us take our usual behavioral pharmacology model, the 10 m Hedgerow BunnyHopper assay. The gold standard, of course. And everyone knows it is trivial to speed up the BunnyHopping with a pretreatment of amphetamine.

However, we've learned over the years that the time of day matters.

Until...finally....in its dotage seniority. The Dash Lab finally fesses up. The PI allows a trainee to publish the warts. And compare the basic findings, done at nighttime in naive bunnies, with what you get during the dawn/dusk period. In Bunnies who have seen the Dash arena before. And maybe they are hungry for clover now. And they've had a whiff of fox without seeing the little blighters before.

And it turns out these minor methodological changes actually matter.

We also know that dose response curves can be individual for amphetamine and if the dose is too high the Bunny just stims (and gets eaten by the fox). Perhaps this dose threshold is not identical so we're just going to chop off the highest dose because half of them were eaten after that dose. Wait...individuals? Why can't we show the individuals? Because maybe a quarter are speeded up by 4X and a quarter by 10X and now that there are these new genetic data on Bunny myocytes under stressors as diverse as....

So why do the new papers just report the effects of single doses of amphetamine in the context of this fancy transcranial activation of vector-delivered Channelrhodopsin in motor cortex? Where are the training data? What time of day were they run? How many Bunnies were aced out of the study because the ReaChr expression was too low? I want to do a correlation, dammit! and a multivariate analysis that includes my favorite myocyte epigenetic markers! Say, how come these damn authors aren't required to bank genomic DNA from every damn animal they run just so I can ask for it and do a whole new analysis?

After all, the taxpayers paid for it!

I can go on, and on and on with arguments for what "raw" data need to be included in all BunnyHopping papers from now into eternity. Just so that I can perform my pet analyses of interest.

The time and cost and sheer effort involved is of no consequence because of course it is magically unicorn fairy free time that makes it happen. Also, there would never be any such thing as a protracted argument with people who simply prefer the BadgerDigger assay and have wanted to hate on BunnyHopping since the 70s. Naaah. One would never get bogged down in irrelevant stuff better suited for review articles by such a thing. Never would one have to re-describe why this was actually the normal distribution of individual Hopping speeds and deltas with amphetamine.

What is most important here is that all scientists focus on the part of their assays and data that I am interested in.

Just in case I read their paper and want to write another one from their data.

Without crediting them, of course. Any such requirement is, frankly my dear, gauche.

42 responses so far

  • Geeka says:

    For a lot of research, there isn't a universal data format. Reporting raw data, but not exact experimental conditions is going to throw another wrench in the works.

    For example, if you report raw flow data, you can report which laser and respective filters, but good luck if you really want to compare to something you did, because age of laser and those filters can be off by +/- nm. Not to mention which software you use etc.

    Move on to getting numbers from images. We'd get tons of papers with a 'my analysis algorithm is better than yours'. I'm not sure that those incremental changes are worth it.

    I'm for data transparency, but I spent 5 years making my model system for my PhD. Sure, you can have the constructs and the protocols, but you don't get the time and results for nothing.

  • drugmonkey says:

    But see, Geeka, you should be forced to put all of your "raw data" in a public repository! Simple, right?

  • Dr Becca says:

    Despite MBE and DrDrA's strong voices to the contrary, all data are not the same. I can totally see that with respect to genetic raw data there's a lot that can be done with it, and the authors who generated it would be happy to dump it into a database somewhere and let the data miners go to town.

    But those kinds of endless possibilities aren't there in behavior (or as superkash pointed out, ephys) studies. If someone were to take my excel spreadsheets from a behavioral pharm experiment, they're getting one or two data points per animal. All they could really do is run a different set of stats, which, fine, maybe they pull something new out. But if they published those results without putting me or my grad students on the paper, that feels wrong - like they're taking credit for designing and running an experiment that they didn't actually do.

  • Ben says:

    Show me an example of someone asking you to collect data you haven't already collected, or reformat your data files before depositing them, or put forward *any* effort apart from depositing your data files in a public repository. Presumably you have data files in some format that you could deposit along with your published research. You can put them on Dryad in 5 minutes.

    If someone wants new data that you didn't collect, they can repeat your experiment and collect it themselves. If someone wants the data behind the specific results you've already published, though, I really don't see the problem or inconvenience.

  • drugmonkey says:

    Presumably you have data files in some format that you could deposit along with your published research.

    Sure, MedAssociates generates some really nice files. Which, one may or may not have annotated outside of the lab book. And may or may not have a custom structure that requires a few macros to extract into a usable spreadsheet. all of which may or may not be designed mostly to accord with a specific statistical analysis package. And those filenames might be a tad cryptic for anyone outside of the lab to sort through.

    oh and maybe those excel sheets that you use to generate graphs have a lot of other data in there that ISN'T going into this paper. or details that you'd rather not get into the wrong hands, like the name of the tech running the study.

    I really don't see the problem or inconvenience.
    Your lack of imagination and/or experience outside of your specific datasets/assays/models doesn't make you right.

  • Ben says:

    If your argument for not sharing is "it's too hard because my lab's work isn't done in a reproducible way," I think the solution is pretty obvious!

    You just described a bunch of personal failings that make it harder to share your data. You don't do reproducible work. You don't use logical file naming or version control. You don't organize your data (everything for multiple papers is only found in a single Excel spreadsheet.) I would suggest making these things more of a priority and attending a workshop on reproducibility in research - it will not only benefit others by enabling you to share your data (and code), but you'll be able to go back to your own work in a couple months (when you may have forgotten your own ad hoc file naming conventions) without so much difficulty. Software Carpentry exists for this kind of training: http://www.software-carpentry.org

    I hope you, like many of us, can get your work into a state that enables you to share it without so much difficulty.

  • drugmonkey says:

    No, genius, it is that the way we do things is not set up for this novel interest of the OpenEverything waccaloon. There were no incentives. Would it be "better" in some abstract way? maybe. But it takes time and planning. If there is no reason or reward for doing so, it isn't going to get done.

    and in fact one needs to realize that if there isn't any practical *reason* to do obsessive curating, then doing so is a waste of taxpayer dollars.

    You don't do reproducible work.

    I don't think you have any idea what that phrase means.

    You just described a bunch of personal failings that make it harder to share your data.

    Do tell how *your* interests become *my* "failings?

    when you may have forgotten your own ad hoc file naming conventions

    Perhaps if you didn't waste all your apparently limited brainpower on your awesome curation you would have some room in there to remember? Just a thought.

    I hope you, like many of us, can get your work into a state that enables you to share it without so much difficulty.

    Do tell what use of my raw data you require? Feel free to use the BunnyHopper example.....

  • zb says:

    For a starter, I want to be able to click on any mean reported in a paper and see the list of values used to calculate the mean. That's a simple one, and, possible now in a way that was difficult in the days when data had to be shared on paper. Then, at the very least, I can see if the t-test you applied and the p value depended on two points in your data (and, yes, I know, really simplistic statistics, but still being used, in some studies).

    Yes, some data is going to be more difficult to share, but we should be working towards protocols for sharing those data, and, yes, partially because the taxpayers have already paid for it.

    Oh, and the best reason is that the studies will be better if they are more shareable.

  • It is interesting that any time one of the OPEN SCIENZS!111!!11!1ELEBNTY!!1!1 enthusiasts gets personally inconvenienced by common scientific practice, it is a SYSTEMIC ETHICAL DISASTER!!!1111!!11!!ELEBNYT!!11!!1! that must be remedied by imposing additional burdens on other people.

  • jipkin says:

    "For a starter, I want to be able to click on any mean reported in a paper and see the list of values used to calculate the mean."

    This, this, a thousand times this.

    Clearly there's an arbitrary line to be drawn with how much data should/needs to be shown. But can we all agree to include what zb has written here? In the digital age, it can't be that hard for journals to require that all "means" be supported by the raw values used to calculate them. These can simply be included as tables in the supplementary online material.

  • You don't need to click on any shittio. Just use box plots.

  • jipkin says:

    It's not about the interface - you're right. I hate hate HATE bar graphs. complete bullshit format most of the time. Box-n-whiskers (with outliers individually shown) is almost always better, agreed. Better yet, a scatter plot with all the individual points jittered within a column and the box-n-whiskers overlaying those points. fuck aesthetics, show me the data!

  • drugmonkey says:

    Then, at the very least, I can see if the t-test you applied and the p value depended on two points in your data

    Why? For the most part if you are familiar with the kind of study that is being run, taking a look at the n, means, error bars and the stats tells you what you need to know.

    If that knowledge is not enough, then ....? what are you after, exactly?

    The single outlier thing rarely drives the stats IME. Maybe the mean diff but the error bar should help with that...

  • zb says:

    "If that knowledge is not enough, then ....? what are you after, exactly? "

    error checking? trust but verify? For example access to the XL file uncovered the data selection error in the Reinhart & Rogoff paper: http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04deaa6388b1/publication/566/. Innocent mistakes get made (let's assume innocence, because we don't need to assume malfeasance), and these errors are propagated when the error supports the biases of the authors. That's reason enough to push towards more open data whenever we can.

  • zb says:

    How often does that kind of error affect the conclusions? Well, do we have any idea at all?

  • jipkin says:

    DM, agree or disagree:

    for the vast majority of bar plots (or most mean vs mean comparisons) the aggregate cost to the scientists and journals of uploading the raw values and hosting them is far lower than the aggregate benefit of catching the rare (?) mistakes or misanalyses.

    The argument being that for most studies the raw data won't be of much help, but occasionally it will be of great help (error catching, other measurements of variance, additional tests that might be useful, and who knows what else).

  • DJMH says:

    If you're making your graphs in Excel, I don't want to see your data anyhow.

  • DrugMonkey says:

    Oh? And what is the problem with Excel?

  • SteveTodd says:

    Oh? And what is the problem with Excel?"

    Its not open access? BOOM!

  • miko says:

    From the other side, there are plenty of people in labs I've worked in, doing the same experiments I am doing, whose data I would never in a million years bother to re-analyze or integrate into some downstream analysis because I think they are sucky experimentalists.

    All these computational folks who keep whining about everyone formatting their data to make it easier for them to do meta-analyses or their gee-whiz monte carlo non-linear booleo-bayesio regresso-max model they made while cranked on energy drinks or whatever the fuck they are doing on 80 different data sets from 30 different labs: they have the mother of all GIGO problems. Which is reason #193 to ignore them.

  • Jonathan says:

    "Oh? And what is the problem with Excel?"

    Graphpad does better pharmacology graphs.

  • DJMH says:

    Don't you have a side-bar quote from Isis along the lines of, "I am going to pretend that your graph is great, whereas in reality it makes me die a little on the inside"? That's all Excel graphs, ever.

  • drugmonkey says:

    That is not actually an answer DJMH

  • drugmonkey says:

    Graphpad does better pharmacology graphs.

    Ok, so there's one application out of the vast array of graphs that a scientist might need to make...

  • zb says:

    But complaining about making the data that goes into a mean (and, yes sometimes there are means, relatively simple ones, that represent a significant part of the data, and aren't even terribly crunched, except for the way the experiment was designed, which, presumably was in the paper), is just circling wagons style opposition.

    I agree that I'm not hugely sympathetic to the computational folks who think they're going to do Google style data analysis on behavioral data (and, any associated pharma, neuro, molecular, . . . . ) data. My issue with just linking to those files is the complexity of describing the behavior and how the imperfections of that complexity might come into play when someone starts crunching the data for a variable for which it was not designed (which can't be the mean reported in the paper). There's too high a likelihood of ascribing meaning to noise (which gets treated as noise in the original analysis).

    But refusing to release simple data, on the grounds of a slippery slope to being required to release complex data I think increases not decreases the probability that bad decisions will be made (especially when the self-interest argument -- which sounds like a whine -- that you won't get credit for the data sounds like the main reason).

  • drugmonkey says:

    especially when the self-interest argument -- which sounds like a whine -- that you won't get credit for the data

    You people are totally delusional.

  • dsks says:

    Excel's alright. The main problem with it is that it takes a bit of initial learnin' and setting up (macros etc) to make it do the things idiot-proof apps like Prism let you do at the click of a button. I've warmed to the former after seeing what some of my undergrads can do with it, and it has the advantage of being so ubiquitous that you don't have to shell out for multiple licenses. I still use Igor Pro for manuscript figures, but to be honest that's only to take advantage of pooling multiple figures into one file.

    As far as presentation is concerned, you can easily duplicate the graph styles of Prism and Igor in Excel, it's not hard at all.

  • The Other Dave says:

    Ha Ha. This whole discussion stems from a naive misunderstanding of science. Newsflash: Scientists don't just collect data and the conclusions are not self-evident. Science is a process for inferring general rules based on observation and experience. The 'raw data' (even assuming the 'raw data' are in any usable universal format, which they never are) by themselves are useless without the reasoning too. And reasoning doesn't fit into an Excel sheet. That's why scientists write papers, instead of merely uploading to PLoS Excel Sheets.

    And besides, anyone who is using Excel is a moron anyway. I mean, seriously.

  • dr24hours says:

    My career is not about your convinience. That simple.

  • Bashir says:

    I use Excel all the time. Does that opt me out of the whole Open Access thing? Well that was easy.

  • DJMH says:

    Excel still defaults to gray backgrounds, black grid lines, purple and light blue terrible symbols, and shadows. Of *course* you can fix these things, but any program whose defaults are that bad self-disqualifies as a graph making go-to.

    It's a perfectly fine spreadsheet program, though.

  • Juan Lopez says:

    How many times will a sharp-eyed click-happy reader will find an error or uncover an abuse of an averaging? Probably few. Errors, cheats and tricks are usually deeper than that.
    I buy more into the argument that further analysis may be useful. But to think that people will be able to just understand a "raw" file is naive. By the time data is ready for averaging often it has gone through a lot of processing.

    To disregard the efforts needed to make it all available online, or the issues with credit is naive.

  • dsks says:

    "And besides, anyone who is using Excel is a moron anyway. I mean, seriously."

    Oh, the irony of the above comment being so soon followed by the below comment...

    "Excel still defaults to gray backgrounds, black grid lines, purple and light blue terrible symbols, and shadows. Of *course* you can fix these things, but any program whose defaults are that bad self-disqualifies as a graph making go-to."

    Takes about five min to draft a template and save it as a default. After that, your all time fave graph stylez are but a few clicks away.

    Even a moron can do it apparently ;)

  • The Other Dave says:

    dsks: My opinion about Excel has nothing to do with its ease of use. Excel is like the stick that primates and little children pick up and love before they discover real tools.

  • The Other Dave says:

    Please note, when reading the above, that I recognize that little children are primates.

  • Nat says:

    On the aesthetics of data, I'm with DJMH. Regardless of the ease of changing the Excel template, the most people don't.

    So the dictum:

    "If you're making your graphs in Excel, I don't want to see your data anyhow."

    is going to be true in the majority of cases.

    As far as uploading and curating the raw data files, I'm with DM. Besides, if you have a question that might be answered by the raw data, why not email the authors? Did Eisen even try that, or could he just not be bothered?

  • dsks says:

    "On the aesthetics of data, I'm with DJMH. Regardless of the ease of changing the Excel template, the most people don't."

    I've been using Igor Pro and Illustrator for ten years, and I probably will do so for another ten years, but my last undergrad put a poster together for a summer research conference with excel and ppt and her graphs looked exactly the same as mine. It was a timely epiphany for me because I was about to shell out for some Prism licenses just to avoid the timely process of training undergrads to build figures in Igor.

    Let's be honest... part of this is that we just feel dirty using something as plebeian as an MS Office app for doing Teh H@rd Scienze :)

  • MolecularGeek says:

    Graphs from Excel are ugly by default, but the bigger problem is that Microsoft has an abysmal track record of using broken implementations of statistical methods (for that matter, anything requiring greater precision than most financial calculations). Consider this paper:
    http://www.sciencedirect.com/science/article/pii/S0167947308001606
    Yes, it doesn't address office 2010, but that isn't particularly a problem, given that they make references to problems identified with statistical functions in 1994 that still hadn't been addressed by 2008. Microsoft fundamentally doesn't care about the science/technical/engineering market enough to make sure that the functions they provide are accurate. Yes, Excel is a convenient data entry tool, but for any sort of serious analysis, it has enough shortcomings that relying upon it hurts one's credibility. Use Prism, Igor, SAS, SPSS, or Minitab; if you don't have money for a license, R is free. Yes, it's more likely that someone already knows the basics of Excel, but that's like saying that you should use a steak knife for dissections instead of a scalpel and probe because everyone knows how to use the knife already. The right tool for the job, please.

  • tim says:

    i'm no anti-excel zealot but it's not a very expressive tool. most of the graphs that are easy to produce do not portray data in very interesting or information-rich ways, and most of the graphs that are produced with excel are the graphs that are easy to produce with excel. some very common and very useful plots, like box+dot plots or histograms, require black magic. if excel is the only tool someone ever learns (which is not at all uncommon), they're forced to solve all of their data visualization problems with a very small toolbox, and that constrains the way they think about the possibilities in their data. call it an extension of the sapir-whorf hypothesis. in my work, i find that the most useful charts tend to be easier to produce with tools other than excel.

    but, hell, i'd be happy if people were even using excel effectively. we need to bake pivottables into the high school curriculum. maybe undergrad stats.

  • Juan Lopez says:

    tim: you make a good point about the possibilities of tools other than Excel.

    The snobbery against Excel would be funny if it wasn't a real problem. R is awesome indeed, but surely not as easy to hack some jobs with R than with Excel.

    An advantage of Excel is that pretty much everyone has access to it and knows how to use it. We make worksheets and send them to collaborators for them to work with, and they love it. I am not worried that Microsoft may have miscoded something, since we use other tools for the calculations or code it ourselves. Telling a collaborator that they need to learn a new package, or that they must pay a license "because Excel graphics are ugly and I don't want to change the default" doesn't go far. If the project is more complicated than can be done well with Excel, we send them R modules, or custom programs written in C or java. In those cases, they do have to learn something new.

    It is not just about using "the right tool for the job", but the simplest tool that will get the job done right. I have seen plenty of people misuse SAS, Matlab and R (and C, Java and machine language) to think that power is all that matters. I have also seen beautiful graphics that were wrong.

    It's not the tool, but what you make with it.

  • Isabel says:

    "All these computational folks who keep whining about everyone formatting their data to make it easier for them to do meta-analyses or their gee-whiz monte carlo non-linear booleo-bayesio regresso-max model they made while cranked on energy drinks or whatever the fuck they are doing on 80 different data sets from 30 different labs: they have the mother of all GIGO problems. Which is reason #193 to ignore them."

    Haha, great comment Miko.

    I think if we are all going to be sharing our data, there should be a rule that everyone and every lab that uses it has to produce and contribute original data also. It isn't so much "not getting credit" that puts me off about this shiny new field it's that some of those folks have placed themselves above those who produce. It's a lot of work and takes up a LOT of time to grow organisms, to do field work, lab work etc. So it's an advantage to not have to do all this work. Those who do all this work are even referred to as "stamp collectors" since the real language of science is math, of course.

Leave a Reply