In Part Three we started looking at time-series that are autocorrelated, which means each value has a relationship to one or more previous values in the time-series. This is unlike the simple statistical models of independent events.
And in Part Two we have seen how to test whether a sample comes from a population of a stated mean value. The ability to run this test is important and in Part Two the test took place for a population of independent events.
The theory that allows us to accept or reject hypotheses to a certain statistical significance does not work properly with serially correlated data (not without modification).
Here is a nice example from Wilks:
Figure 1
Remember that (usually) with statistical test we don’t actually know the whole population – that’s what we want to find out about. Instead, we take a sample and attempt to find out information about the population.
Take a look at figure 1 – the lighter short horizontal lines are the means (the “averages”) of a number of samples. If you compare the top and bottom graphs you see that the distribution of the means of samples is larger in the bottom graph. This bottom graph is the timeseries with autocorrelation.
What this means is that if we take a sample from a time-series and apply the standard Student-t test to find out whether it came from a population of mean = μ, we will calculate that it didn’t come from a mean that it actually did come from too many times. So a 95% test will incorrectly reject the hypothesis a lot more than 5%.
To demonstrate this, here is the % of false rejections (“Type I errors”) as the autocorrelation parameter increases, when a standard Student-t test is applied:
Figure 2
The test was done with Matlab, with a time-series population of 100,000, Gaussian (“normal distribution”) errors, and samples of 100 taken 10,000 times (in each case a random start point was chosen then the next 100 points were taken as a sample – this was repeated 10,000 times). When the time-series is generated with no serial correlation, the hypothesis test works just fine. As the autocorrelation increases (as we move to the right of the graph), the hypothesis test starts creating more false fails.
With AR(1) autocorrelation – the simplest model of autocorrelation – there is a simple correction that we can apply. This goes under different names like effective sample size and variance inflation factor.
For those who like details, instead of the standard deviation of the sample means:
s = σ/√n
we derive:
s = σ.√[(1+ρ)/n.(1-ρ)], where ρ = autocorrelation parameter.
Repeating the same test with the adjusted value:
Figure 3
We see that Type I errors start to get above our expected values at higher values of autocorrelation. (I’m not sure whether that actually happens with an infinite number of tests and true random samples).
Note as well that the tests above were done using the known value of the autocorrelation parameter (this is like having secret information which we don’t normally have).
So I re-ran the tests using the derived autocorrelation parameter from the sample data (regressing the time-series against the same time-series with a one time step lag) – and got similar, but not identical results and apparently more false fails.
Curiosity made me continue (tempered by the knowledge of the large time-wasting exercise I had previously engaged in because of a misplaced bracket in one equation), so I rewrote the Matlab program to allow me to test some ideas a little further. It was good to rewrite because I was also wondering whether having one (long) time-series generated with lots of tests against it was as good as repeatedly generating a time-series and carrying out lots of tests each time.
So this following comparison had a time-series population of 100,000 events, samples of 100 items for each test, repeated for 100 tests, then the time-series regenerated – and this done 100 times. So 10,000 tests across 100 different populations – first with the known autoregression parameter, then with the estimated value of this parameter from the sample in question:
Figure 4 – Each sample size = 100
The correct value of rejected tests should be 5% no matter what the autoregression parameter.
The rewritten program allows us to test for the effect of sample size. The following graph uses the known value of autogression parameter in the test, a time-series population of 100,000, drawing samples out 1000 times from each population, and repeating through 10 populations in total:
Figure 5 – Using known value of autoregression parameter in Student T-test
Remembering that all of the lines should be horizontal on 5%, we can see that the largest sample population of 1000 is the most resistant to higher autoregression parameters.
This reminded me that the equation for the variance inflation factor (shown earlier) is in fact an approximation. The correct formula (for those who like to see such things):
Figure 6
So I adjusted the variance inflation factor in the program and reran.
I’m really starting to slow things down now – because in each single hypothesis test we are estimating the autoregression parameter, ρ, by a lag-1 correlation, then with this estimate we have to calculate the above circled formula, which requires summing the equation from 1 through to the number of samples. So in the case of n=1000 that’s 1000 calculations, all summed, then used in a Student-t test. And this is done in each case for 1000 tests per population x 10 populations.. thank goodness for Matlab which did it in 18 minutes. (And apologies to readers trying to follow the detail – in the graphics I show the autoregression parameter as φ, when I meant to use ρ, no idea why..)
Fortunately, the result turns out almost identical to using the approximation (the graph using the approximation is not shown):
Figure 7 – Using estimated autoregression parameter
So unless I have made some kind of mistake (quite possible), I take this to mean that the sampling uncertainty in the autoregression parameter adds uncertainty to the Student T-test, which can’t be corrected for easily.
With large samples, like 1000, it appears to work just fine. With time-series data from the climate system we have to take what we can get and mostly it’s not 1000 points.
We are still considering a very basic model – AR(1) with normally-distributed noise.
In the next article I hope to cover some more complex models, as well as the results from this kind of significance test if we assume AR(1) with normally-distributed noise yet actually have a different model in operation..
References
Statistical Methods in the Atmospheric Sciences, 3rd edition, Daniel Wilks, Academic Press (2011)
Taking Serial Correlation into Account in Tests of the Mean, Zwiers & von Storch, Journal of Climate (1995)
I could be mistaken, but I believe the breakdown in your analytic equations is do to a bias error in the estimation of the standard deviation that depends on N. (For large N your Monte Carlo results should converge asymptotically to the analytic formulae.) IMO, this is a good motivation for going away from the formulae in computing the confidence interval, and using a Monte Carlo based approach instead. That and with a Monte Carlo, you aren’t locked into normal distributions (or stationary noise, etc.)
Wikipedia’s article on error of the mean has a reference to Bence that relates to this, which I downloaded and briefly read over.
James R. Bence (1995) Analysis of short time series: Correcting for autocorrelation. Ecology 76(2): 628 – 639
He does similar comparisons to what you did. I didn’t look at the article closely to see how much agreement there was being your results and his, but it might be a good article to look at to track down the breakdown of the formal results for small N.
Thanks for the article – I’ve had a read and very interesting. I’ll try his revised calculation out. This is the “bias-corrected estimator suggested by Doran et al (1992)”.
For the standard deviation of the sample I am using the 1/(N-1) version, i.e. Bessel’s correction. In Matlab this is simply the function std().
I was meaning to comment on this earlier and got side tracked by RL.
Wikipedia has a good writeup on why Bessel’s correction doesn’t fix the problem with the bias in the standard deviation.
Take home part:
The problem arises because of taking the square-root of a random variable, and it is that which introduces the bias. Since the square-root is a “compressive nonlinearity”, that causes the square-root of an unbiased estimator of variance (Bessel’s variance) to be biased low.
It is this which leads to an underestimation of the true error of the mean, and in turn to a too-high of a false rejection rate.
SOD,
You once wrote, why did the editor post an article at tAV? You may remember that it was an obviously confused individual with a post that would be shredded by the readers. I think you may have answered your own question with these baby step posts into stats.
The fun is in the puzzle and the intent is for clarity to those who don’t have the backgrounds to find it.
[…] the last article we saw some testing of the simplest autoregressive model AR(1). I still have an outstanding issue raised by one commenter relating to the hypothesis testing that was introduced, and I hope to come […]
I’ll right away snatch your rss as I can’t to find your e-mail subscription hyperlink or
newsletter service. Do you’ve any? Kindly allow me understand in order that
I could subscribe. Thanks.