I enjoyed this short article by Andrew Gelman on the blog he keeps with several of his colleagues. The main point, which I agree 100% with, is that the examples in our stats and data analysis textbooks never seem to match what we see in the real world. The examples always seem to work! The data is clean and looks perfectly manicured for the analysis. I get it! The idea is to convey the concept of how different analytical approaches work. The rub is that once you get to the real world and look at your data you end up being like, “Uh. Wait….what is this? What do I do now?!”
The blog got me thinking about something else, though. Something that really frustrates me. If the examples we see in textbooks don’t reflect the data problems we face in the real world, what about the examples we read about in applied sport science research? How much do those examples reflect what we see in the real world?
At the risk of upsetting some colleagues, I’ll go ahead and say it:
I’m not convinced that the research we read in publications completely represents the real world either!
How can that be? This is applied science! Isn’t the real word THE research?
Well, yes and no. Yes, the data was collected in the real world, with real athletes, and in a sport setting. But often, reading a paper, looking at the aim and the conclusions, and then parsing through the methods section to see how they handled the data leaves me scratching my head.
My entire day revolves around looking at data and I can tell you; the real world is very messy.
- How things get collected
- How things get saved
- How things get loaded
- How things get logged
There are potential hiccups all along the way, no matter how stringent you are in trying to keep sound data collection practices!
In research, the problem is that the data is often touched up in some manner to create an analysis sufficient for publication. Missing values need to be handled a certain way (sometimes those rows get dropped, sometimes values get imputed), class imbalance can be an issue, errant values and outliers from technology flub-ups are a very real thing, data entry issues arise, etc. These things are all problematic and, if not identified prior to analysis, can be a major issue with the findings. I do believe that most people recognize these problems and would agree with me that they are very real issues. However, it is less about knowing that there are problems but rather, figuring out what to do about them. Often the methods sections gloss over these details (I get it, word counts for journals can be a pain in the butt) and simply produce a result that, on paper at least, seems overly optimistic. As I read through the results section, without details about data processing, I frequently say to myself, “No way. There is no way this effect is as real as they report. I can’t reproduce this without knowing how they cleaned up their data to observe this effect.”
Maybe someone should post a paper about how crappy their data is in the applied setting? Maybe we should just be more transparent about the data cleaning processes we go through so that we aren’t overly bullish on our findings and more realistic about the things we can say with confidence in the applied setting?
Does anyone else feel this way? Maybe I’m wrong and I’m being pessimistic, and this isn’t as big of an issue as I believe it to be? Is the data we see in publication truly representative of the real world?