In the modern era a new breed of scientist has emerged on the scene: data analysis specialists. Similarly there is a new breed of journalist called “data journalists” with my favourite being Chad Skelton, formerly of the Vancouver Sun. These practitioners are skilled in the mining of existing data sources and the use of various statistical and data analysis and presentation tools. These practitioners have filled an important skill-gap in the research (and journalism) community and provide value for most teams that are dealing with large volumes of data. That being said there is one serious concern about these data analysts and that is the fact that it is terribly easy to become fascinated with the available computational power and when used absent underlying theory, these tools can easily be misused to generate erroneous or spurious conclusions.
As I have written previously, my graduate work involved developing systems to allow data collected by governmental scientists to be evaluated for its reliability and quality; stored in information systems; and made available for subsequent re-use by other researchers. I carried out my research in an era before the wide availability of the computer statistics programs in use today. As a consequence, all my stats calculations were done with a calculator and statistical look-up tables. Due to the difficulty in completing these analyses we were taught some very important lessons. The most important being that you should be very sure of what you are testing before you start the test and your test should have acceptance criteria that were determined before the analysis got underway.
These lessons, however, appear to be getting lost in our modern era. In an era where a Mann-Kendall calculation took a over an hour to complete, you didn’t just run a dozen analyses to see what would happen. Instead, you identified your hypothesis and your alternative hypothesis based on an understanding (or hypothesis) of the theoretical underpinning of your topic then assembled your data accordingly. Now that the same analyses can be conducted in milliseconds, analysts, it is joked, will keep running tests until they find a result they like. This practice was playfully ridiculed in a very clever xkcd comic on the topic.
Another important lesson apparently lost is the recognition of what a p-value actually represents. As described in this article in Nature when Ronald Fisher introduced the p-value in the 1920’s he did not mean for it to be a definitive test but rather one tool in the researcher’s tool belt. Nowadays there is an entire edifice in science built on the importance of achieving a p-value less than 0.05 (see another xkcd comic which makes fun of that idea). The problem is that a low p-value is not a proof of anything. A p-value simply provides the probability of getting results at least as extreme as the ones you observed. A really clear write–up on the topic is provided in this link. Unfortunately, even practicing scientists have a really hard time explaining what a p-value represents. So why is a p-value not always a useful tool? Well the answer has to do with what it is trying to prove and what it cannot prove.
For readers not familiar with the language, science historically identified two major potential types of errors in hypothesis testing. A Type I error (a false positive) involves claiming that an observed hypothesis is correct, when in reality it is false. A Type II error (a false negative) involves claiming that an observed hypothesis is incorrect when it is actually correct. A p-value helps you establish the likelihood of a Type I error only. It does nothing to help avoid Type II errors and has absolutely no information about whether your results are “right” or “wrong”. Remember in science all results are right since they represent observations. It is just that some observations help support a hypothesis while others help disprove a hypothesis.
Recently the science has advanced to recognize at least two more types of error: Type III (wrong model, right answer) and Type IV (right model, wrong answer). Both these types of errors build on how we understand the process knowledge that was involved in collecting the data used to test our hypotheses. Process knowledge is an understanding of how a process/ecological system works, and might also be described as the conceptual model underlying your data collection scheme. If you collect data absent a model or hypothesis then you risk committing a Type III error (also known as an error of the third kind) in that you may actually obtain the right answer to the wrong question. This happens all the time in science and is so common that there is a great web site dedicated to the more entertaining examples of these spurious correlations.
So I’ve talked a lot of theory, but what about that cautionary tale I promised in my blog title? Well this week we have all been talking about Leonardo DiCaprio and his comment about Chinooks. In response to my last blog post on the topic, I was directed by an activist to a web site where a young data analyst had done an assessment of Chinook occurrences in Calgary. This analysis provides an excellent example of how a data analysis can look really good on paper but can subsequently risk coming to conclusions unsupported by the data used in the analysis.
In this analysis the young scientist described pulling weather data from the Calgary International airport and creating a simple assessment tool to identify Chinooks. He then counted how many of his identified Chinooks happened each year from 1907 through 2015 and then he did some data crunching. Now I am simplifying here since he didn’t really do that exactly. What he did was to take the data and put it into R (a free software environment for statistical computing and graphics) where he then applied a bunch of statistical tests. As described at his blog, all the data is available for download except that it isn’t. You see, the critical data (a simple table that expresses how many Chinooks were observed in each year under each scenario) was never actually generated. I’m really not sure what to think about this because I am an old-school type and that table represents the entire meat of the argument. That being said this seems to be a common practice these days which is why, being an old-school type, I simply printed up his graphic and reconstructed his data tables by hand in order to do some old-school analyses on the data.
You might ask why I went through this effort? Well the answer is simple, I looked at his outputs, in the form of his graphs, and the reported results did not appear consistent with the data presented in his graphs. Specifically, in his second figure (T Diff = 7 C, Daily Max T=2 C) his 1907 dataset is presented with a 95% confidence interval of approximately 9.5-12 Chinooks a year. In 2015 he has a 95% confidence interval of about 10-13 Chinooks a year. In between there are two observable trends. There is a distinct decrease going until about 1960 (about 8-11 Chinooks a year) followed by an equally distinct increase from 1960 through to 2012. The problem is that his analysis claims to cover the entire extent of the time from 1907 to 2012 and the 1907 and 2012 confidence intervals overlap significantly. In his conclusion he identified a trend at the p<0.01 range which didn’t seem right to me.
Running the numbers myself I observed that his first assumption (T Diff = 5 C, Daily Max T = 5 C) did indeed show a significant increase in occurrences of Chinooks (using his definition of a Chinook) between 1907 through 2012 but the same was not true for the T Diff = 7 C, Daily Max T=2 C dataset which did not show the same significance (P> 0.1). What really confused me was that during his analysis he broke the dataset it into smaller chunks checking to see if each was significant (he showed that smaller chunks had significant trends in them as well).
The problem with this approach is there was no scientific basis for the chunking he chose. He looked at the occurrences by decades. Now I cannot think of a single non-human process that depends on a ten-year cycle starting at years ending with a zero in the Western dating system. Nature just doesn’t work that way but I see this sort of thing all the time with folks who are playing with data. They superimpose human constructs on non-human systems. Realistically, this is mostly a quibble but it raises more questions. Ultimately, had he stopped at this point, irrespective of the chunking, I would have praised him for a well-conducted analysis. He showed an increasing trend over the century-plus timescale, unfortunately, he didn’t stop there.
Now let’s remember, in his blog he indicated that he was trying to demonstrate that Chinooks have increased, but more importantly, his blog indicates that he was trying to provide statistical support for Leo DiCaprio’s suggestion that climate change formed the basis for this increase. The problem lies with the fact that, as is clear from the analyses conducted to date, none of the analyses incorporated climate data. As he described his null hypothesis was:
He has shown, depending on your definition of a Chinook, that Chinooks have either increased by about 4 a year (T Diff = 5 C, Daily Max T = 5 C assumptions) or by about 1 a year (T Diff = 7 C, Daily Max T=2 C assumptions) over the course of the period examined. However, he did no analyses to link this result to climate change. Certainly he hypothesized a reasonable mechanism by which climate change may play a role in Chinook occurrence, but he didn’t then do the obvious analyses to test his hypothesis and even a cursory observation of the data demonstrates that this hypothesis is likely not to do well.
As is well recognized in the climate community, the first real burst of observed climate change happened in 1880’s through the 1940’s. This was followed by a reduction in the rate of increases in the 1960’s and then a solid upward movement in the 1970’s. There was a subsequent reduction in the rate of increase for the last 15-18 years with the current year looking to bust that trend and possibly return us to a higher rate of increase. Looking at this trend, the question that must be asked is: where is the commensurate variation in the Chinook data? The answer is that it is not there. When the temperature was spiking in the 20-30’s the Chinook occurrence rate was plummeting.
Our data analyst’s working theory was that the Chinooks were likely driven by the Pacific SSTs [Pacific seas surface temperatures] and that looks like a realistic alternative for the period after 1970 but makes no sense in the years from 1907 through 1950. From 1910 through 1950 (when the sea surface temperatures increase spiked) the Chinook occurrence shows a strong downward trend. The start of the short-term decrease in sea surface anomalies, meanwhile, is associated with the beginning of increase in Chinooks in this assessment.
So what happened in this analysis? A very interesting post about Chinook occurrences appears to have gone sideways when the analyst appears to have forgotten what he was testing in the first place. To be clear, for all I know the two processes are indeed linked. The problem is that in this case the analysis for one hypothesis was incorrectly used to justify a totally different hypothesis. Unfortunately the degree of confidence in the first trend did not carry through to the second. He got the right answer to the wrong question.
The thing for young data analysts to learn from this example is not to attempt to stretch the significance of your results unless you have the data, and the underlying theory, to support that stretch. Feel free to massage and play with your data if you must, but restrict your conclusions to the hypotheses tested. You don’t want to be the person claiming that US spending on science, space and technology correlates with suicides hanging and suffocations even with a correlation of 99.7% and an r-value of 0.99789.
As I mentioned earlier, I only became aware of the Chinook blog post because an activist online was using it to justify a political statement. The activist apparently lacked the technical expertise to question the data and has been broadcasting it widely. In my case he used it to try and derail a discussion I was having on climate change and based on his enthusiasm for the post, I am betting I was not the only one reading that blog post this week.
Author’s Note: For more reading on the topic of p-values I suggest you try the Vox piece written about a week-and-a-half after this was posted titled: An unhealthy obsession with p-values is ruining science. It goes into more detail on the topics brought up in the middle of my post.