Feeds:
Posts
Comments

Archive for the ‘Statistics’ Category

There are many classes of systems but in the climate blogosphere world two ideas about climate seem to be repeated the most.

In camp A:

We can’t forecast the weather two weeks ahead so what chance have we got of forecasting climate 100 years from now.

And in camp B:

Weather is an initial value problem, whereas climate is a boundary value problem. On the timescale of decades, every planetary object has a mean temperature mainly given by the power of its star according to Stefan-Boltzmann’s law combined with the greenhouse effect. If the sources and sinks of CO2 were chaotic and could quickly release and sequester large fractions of gas perhaps the climate could be chaotic. Weather is chaotic, climate is not.

Of course, like any complex debate, simplified statements don’t really help. So this article kicks off with some introductory basics.

Many inhabitants of the climate blogosphere already know the answer to this subject and with much conviction. A reminder for new readers that on this blog opinions are not so interesting, although occasionally entertaining. So instead, try to explain what evidence is there for your opinion. And, as suggested in About this Blog:

And sometimes others put forward points of view or “facts” that are obviously wrong and easily refuted.  Pretend for a moment that they aren’t part of an evil empire of disinformation and think how best to explain the error in an inoffensive way.

Pendulums

The equation for a simple pendulum is “non-linear”, although there is a simplified version of the equation, often used in introductions, which is linear. However, the number of variables involved is only two:

  • angle
  • speed

and this isn’t enough to create a “chaotic” system.

If we have a double pendulum, one pendulum attached at the bottom of another pendulum, we do get a chaotic system. There are some nice visual simulations around, which St. Google might help interested readers find.

If we have a forced damped pendulum like this one:

Pendulum-forced

Figure 1 – the blue arrows indicate that the point O is being driven up and down by an external force

-we also get a chaotic system.

What am I talking about? What is linear & non-linear? What is a “chaotic system”?

Digression on Non-Linearity for Non-Technical People

Common experience teaches us about linearity. If I pick up an apple in the supermarket it weighs about 0.15 kg or 150 grams (also known in some countries as “about 5 ounces”). If I take 10 apples the collection weighs 1.5 kg. That’s pretty simple stuff. Most of our real world experience follows this linearity and so we expect it.

On the other hand, if I was near a very cold black surface held at 170K (-103ºC) and measured the radiation emitted it would be 47 W/m². Then we double the temperature of this surface to 340K (67ºC) what would I measure? 94 W/m²? Seems reasonable – double the absolute temperature and get double the radiation.. But it’s not correct.

The right answer is 758 W/m², which is 16x the amount. Surprising, but most actual physics, engineering and chemistry is like this. Double a quantity and you don’t get double the result.

It gets more confusing when we consider the interaction of other variables.

Let’s take riding a bike [updated thanks to Pekka]. Once you get above a certain speed most of the resistance comes from the wind so we will focus on that. Typically the wind resistance increases as the square of the speed. So if you double your speed you get four times the wind resistance. Work done = force x distance moved, so with no head wind power input has to go up as the cube of speed (note 4). This means you have to put in 8x the effort to get 2x the speed.

On Sunday you go for a ride and the wind speed is zero. You get to 25 km/hr (16 miles/hr) by putting a bit of effort in – let’s say you are producing 150W of power (I have no idea what the right amount is). You want your new speedo to register 50 km/hr – so you have to produce 1,200W.

On Monday you go for a ride and the wind speed is 20 km/hr into your face. Probably should have taken the day off.. Now with 150W you get to only 14 km/hr, it takes almost 500W to get to your basic 25 km/hr, and to get to 50 km/hr it takes almost 2,400W. No chance of getting to that speed!

On Tuesday you go for a ride and the wind speed is the same so you go in the opposite direction and take the train home. Now with only 6W you get to go 25 km/hr, to get to 50km/hr you only need to pump out 430W.

In mathematical terms it’s quite simple: F = k(v-w)², Force = (a constant, k) x (road speed – wind speed) squared. Power, P = Fv = kv(v-w)². But notice that the effect of the “other variable”, the wind speed, has really complicated things.

To double your speed on the first day you had to produce eight times the power. To double your speed the second day you had to produce almost five times the power. To double your speed the third day you had to produce just over 70 times the power. All with the same physics.

The real problem with nonlinearity isn’t the problem of keeping track of these kind of numbers. You get used to the fact that real science – real world relationships – has these kind of factors and you come to expect them. And you have an equation that makes calculating them easy. And you have computers to do the work.

No, the real problem with non-linearity (the real world) is that many of these equations link together and solving them is very difficult and often only possible using “numerical methods”.

It is also the reason why something like climate feedback is very difficult to measure. Imagine measuring the change in power required to double speed on the Monday. It’s almost 5x, so you might think the relationship is something like the square of speed. On Tuesday it’s about 70 times, so you would come up with a completely different relationship. In this simple case know that wind speed is a factor, we can measure it, and so we can “factor it out” when we do the calculation. But in a more complicated system, if you don’t know the “confounding variables”, or the relationships, what are you measuring? We will return to this question later.

When you start out doing maths, physics, engineering.. you do “linear equations”. These teach you how to use the tools of the trade. You solve equations. You rearrange relationships using equations and mathematical tricks, and these rearranged equations give you insight into how things work. It’s amazing. But then you move to “nonlinear” equations, aka the real world, which turns out to be mostly insoluble. So nonlinear isn’t something special, it’s normal. Linear is special. You don’t usually get it.

..End of digression

Back to Pendulums

Let’s take a closer look at a forced damped pendulum. Damped, in physics terms, just means there is something opposing the movement. We have friction from the air and so over time the pendulum slows down and stops. That’s pretty simple. And not chaotic. And not interesting.

So we need something to keep it moving. We drive the pivot point at the top up and down and now we have a forced damped pendulum. The equation that results (note 1) has the massive number of three variables – position, speed and now time to keep track of the driving up and down of the pivot point. Three variables seems to be the minimum to create a chaotic system (note 2).

As we increase the ratio of the forcing amplitude to the length of the pendulum (β in note 1) we can move through three distinct types of response:

  • simple response
  • a “chaotic start” followed by a deterministic oscillation
  • a chaotic system

This is typical of chaotic systems – certain parameter values or combinations of parameters can move the system between quite different states.

Here is a plot (note 3) of position vs time for the chaotic system, β=0.7, with two initial conditions, only different from each other by 0.1%:

Forced damped harmonic pendulum, b=0.7: Start angular speed 0.1; 0.1001

Forced damped harmonic pendulum, b=0.7: Start angular speed 0.1; 0.1001

Figure 1

It’s a little misleading to view the angle like this because it is in radians and so needs to be mapped between 0-2π (but then we get a discontinuity on a graph that doesn’t match the real world). We can map the graph onto a cylinder plot but it’s a mess of reds and blues.

Another way of looking at the data is via the statistics – so here is a histogram of the position (θ), mapped to 0-2π, and angular speed (dθ/dt) for the two starting conditions over the first 10,000 seconds:

Histograms for 10k seconds

Histograms for 10,000 seconds

Figure 2

We can see they are similar but not identical (note the different scales on the y-axis).

That might be due to the shortness of the run, so here are the results over 100,000 seconds:

Pendulum-0.7-100k seconds-2 conditions-hist

Histogram for 100,000 seconds

Figure 3

As we increase the timespan of the simulation the statistics of two slightly different initial conditions become more alike.

So if we want to know the state of a chaotic system at some point in the future, very small changes in the initial conditions will amplify over time, making the result unknowable – or no different from picking the state from a random time in the future. But if we look at the statistics of the results we might find that they are very predictable. This is typical of many (but not all) chaotic systems.

Orbits of the Planets

The orbits of the planets in the solar system are chaotic. In fact, even 3-body systems moving under gravitational attraction have chaotic behavior. So how did we land a man on the moon? This raises the interesting questions of timescales and amount of variation. Planetary movement – for our purposes – is extremely predictable over a few million years. But over 10s of millions of years we might have trouble predicting exactly the shape of the earth’s orbit – eccentricity, time of closest approach to the sun, obliquity.

However, it seems that even over a much longer time period the planets will still continue in their orbits – they won’t crash into the sun or escape the solar system. So here we see another important aspect of some chaotic systems – the “chaotic region” can be quite restricted. So chaos doesn’t mean unbounded.

According to Cencini, Cecconi & Vulpiani (2010):

Therefore, in principle, the Solar system can be chaotic, but not necessarily this implies events such as collisions or escaping planets..

However, there is evidence that the Solar system is “astronomically” stable, in the sense that the 8 largest planets seem to remain bound to the Sun in low eccentricity and low inclination orbits for time of the order of a billion years. In this respect, chaos mostly manifest in the irregular behavior of the eccentricity and inclination of the less massive planets, Mercury and Mars. Such variations are not large enough to provoke catastrophic events before extremely large time. For instance, recent numerical investigations show that for catastrophic events, such as “collisions” between Mercury and Venus or Mercury failure into the Sun, we should wait at least a billion years.

And bad luck, Pluto.

Deterministic, non-Chaotic, Systems with Uncertainty

Just to round out the picture a little, even if a system is not chaotic and is deterministic we might lack sufficient knowledge to be able to make useful predictions. If you take a look at figure 3 in Ensemble Forecasting you can see that with some uncertainty of the initial velocity and a key parameter the resulting velocity of an extremely simple system has quite a large uncertainty associated with it.

This case is quantitively different of course. By obtaining more accurate values of the starting conditions and the key parameters we can reduce our uncertainty. Small disturbances don’t grow over time to the point where our calculation of a future condition might as well just be selected from a randomly time in the future.

Transitive, Intransitive and “Almost Intransitive” Systems

Many chaotic systems have deterministic statistics. So we don’t know the future state beyond a certain time. But we do know that a particular position, or other “state” of the system, will be between a given range for x% of the time, taken over a “long enough” timescale. These are transitive systems.

Other chaotic systems can be intransitive. That is, for a very slight change in initial conditions we can have a different set of long term statistics. So the system has no “statistical” predictability. Lorenz 1968 gives a good example.

Lorenz introduces the concept of almost intransitive systems. This is where, strictly speaking, the statistics over infinite time are independent of the initial conditions, but the statistics over “long time periods” are dependent on the initial conditions. And so he also looks at the interesting case (Lorenz 1990) of moving between states of the system (seasons), where we can think of the precise starting conditions each time we move into a new season moving us into a different set of long term statistics. I find it hard to explain this clearly in one paragraph, but Lorenz’s papers are very readable.

Conclusion?

This is just a brief look at some of the basic ideas.

Other Articles in the Series

Part Two – Lorenz 1963

References

Chaos: From Simple Models to Complex Systems, Cencini, Cecconi & Vulpiani, Series on Advances in Statistical Mechanics – Vol. 17 (2010)

Climatic Determinism, Edward Lorenz (1968) – free paper

Can chaos and intransivity lead to interannual variation, Edward Lorenz, Tellus (1990) – free paper

Notes

Note 1 – The equation is easiest to “manage” after the original parameters are transformed so that tω->t. That is, the period of external driving, T0=2π under the transformed time base.

Then:

Pendulum-forced-equation

where θ = angle, γ’ = γ/ω, α = g/Lω², β =h0/L;

these parameters based on γ = viscous drag coefficient, ω = angular speed of driving, g = acceleration due to gravity = 9.8m/s², L = length of pendulum, h0=amplitude of driving of pivot point

Note 2 – This is true for continuous systems. Discrete systems can be chaotic with less parameters

Note 3 – The results were calculated numerically using Matlab’s ODE (ordinary differential equation) solver, ode45.

Note 4 – Force = k(v-w)2 where k is a constant, v=velocity, w=wind speed. Work done = Force x distance moved so Power, P = Force x velocity.

Therefore:

P = kv(v-w)2

If we know k, v & w we can find P. If we have P, k & w and want to find v it is a cubic equation that needs solving.

Read Full Post »

I don’t think this is a simple topic.

The essence of the problem is this:

Can we measure the top of atmosphere (TOA) radiative changes and the surface temperature changes and derive the “climate sensivity” from the relationship between the two parameters?

First, what do we mean by “climate sensitivity”?

In simple terms this parameter should tell us how much more radiation (“flux”) escapes to space for each 1°C increase in surface temperature.

Climate Sensitivity Is All About Feedback

Climate sensitivity is all about trying to discover whether the climate system has positive or negative feedback.

If the average surface temperature of the earth increased by 1°C and the radiation to space consequently increased by 3.3 W/m², this would be approximately “zero feedback”.

Why is this zero feedback?

If somehow the average temperature of the surface of the planet increased by 1°C – say due to increased solar radiation – then as a result we would expect a higher flux into space. A hotter planet should radiate more. If the increase in flux = 3.3 W/m² it would indicate that there was no negative or positive feedback from this solar forcing (note 1).

Suppose the flux increased by 0. That is, the planet heated up but there was no increase in energy radiated to space. That would be positive feedback within the climate system – because there would be nothing to “rein in” the increase in temperature.

Suppose the flux increased by 5 W/m². In this case it would indicate negative feedback within the climate system.

The key value is the “benchmark” no feedback value of 3.3 W/m². If the value is above this, it’s negative feedback. If the value is below this, it’s positive feedback.

Essentially, the higher the radiation to space as a result of a temperature increase the more the planet is able to “damp out” temperature changes that are forced via solar radiation, or due to increases in inappropriately-named “greenhouse” gases.

Consider the extreme case where as the planet warms up it actually radiates less energy to space – clearly this will lead to runaway temperature increases (less energy radiated means more energy absorbed, which increased temperatures, which leads to even less energy radiated..).

As a result we measure sensitivity as W/m².K which we read as Watts per meter squared per Kelvin” – and 1K change is the same as 1°C change.

Theory and Measurement

In many subjects, researchers’ algebra converges on conventional usage, but in the realm of climate sensitivity everyone has apparently adopted their own. As a note for non-mathematicians, there is nothing inherently wrong with this, but it just makes each paper confusing especially for newcomers and probably for everyone.

I mostly adopt the Spencer & Braswell 2008 terminology in this article (see reference and free link below). I do change their α (climate sensitivity) into λ (which everyone else uses for this value) mainly because I had already produced a number of graphs with λ before starting to write the article..

The model is a very simple 1-dimensional model of temperature deviation into the ocean mixed layer, from the first law of thermodynamics:

C.∂T/∂t = F + S ….[1]

where C = heat capacity of the ocean, T = temperature anomaly, t = time, F = total top of atmosphere (TOA) radiative flux anomaly, S = heat flux anomaly into the deeper ocean

What does this equation say?

Heat capacity times change in temperature equals the net change in energy

- this is a simple statement of energy conservation, the first law of thermodynamics.

The TOA radiative flux anomaly, F, is a value we can measure using satellites. T is average surface temperature, which is measured around the planet on a frequent basis. But S is something we can’t measure.

What is F made up of?

Let’s define:

F = N + f – λT ….[1a]

where N = random fluctuations in radiative flux, f = “forcings”, and λT is the all important climate response or feedback.

The forcing f is, for the purposes of this exercise, defined as something added into the system which we believe we can understand and estimate or measure. This could be solar increases/decreases, it could be the long term increase in the “greenhouse” effect due to CO2, methane and other gases. For the purposes of this exercise it is not feedback. Feedback includes clouds and water vapor and other climate responses like changing lapse rates (atmospheric temperature profiles), all of which combine to produce a change in radiative output at TOA.

And an important point is that for the purposes of this theoretical exercise, we can remove f from the measurements because we believe we know what it is at any given time.

N is an important element. Effectively it describes the variations in TOA radiative flux due to the random climatic variations over many different timescales.

The climate sensitivity is the value λT, where λ is the value we want to find.

Noting the earlier comment about our assumed knowledge of ‘f’ (note 2), we can rewrite eqn 1:

C.∂T/∂t = – λT + N + S ….[2]

remembering that – λT + N = F is the radiative value we measure at TOA

Regression

If we plot F (measured TOA flux) vs T we can estimate λ from the slope of the least squares regression.

However, there is a problem with the estimate:

λ (est) = Cov[F,T] / Var[T] ….[3]

          = Cov[- λT + N, T] / Var[T]

where Cov[a,b] = covariance of a with b, and Var[a]= variance of a

Forster & Gregory 2006

This oft-cited paper (reference and free link below) calculates the climate sensitivity from 1985-1996 using measured ERBE data at 2.3 ± 1.3 W/m².K.

Their result indicates positive feedback, or at least, a range of values which sit mainly in the positive feedback space.

On the method of calculation they say:

This equation includes a term that allows F to vary independently of surface temperature.. If we regress (- λT+ N) against T, we should be able to obtain a value for λ. The N terms are likely to contaminate the result for short datasets, but provided the N terms are uncorrelated to T, the regression should give the correct value for λ, if the dataset is long enough..

[Terms changed to SB2008 for easier comparison, and emphasis added].

Simulations

Like Spencer & Braswell, I created a simple model to demonstrate why measured results might deviate from the actual climate sensitivity.

The model is extremely simple:

  • a “slab” model of the ocean of a certain depth
  • daily radiative noise (normally distributed with mean=0, and standard deviation σN)
  • daily ocean flux noise (normally distributed with mean=0, and standard deviation σS)
  • radiative feedback calculated from the temperature and the actual climate sensitivity
  • daily temperature change calculated from the daily energy imbalance
  • regression of the whole time series to calculate the “apparent” climate sensitivity

In this model, the climate sensitivity, λ = 3.0 W/m².K.

In some cases the regression is done with the daily values, and in other cases the regression is done with averaged values of temperature and TOA radiation across time periods of 7, 30 & 90 days. I also put a 30-day low pass filter on the daily radiative noise in one case (before “injecting” into the model).

Some results are based on 10,000 days (about 30 years), with 100,000 days (300 years) as a separate comparison.

In each case the estimated value of λ is calculated from the mean of 100 simulation results. The 2nd graph shows the standard deviation σλ, of these simulation results which is a useful guide to the likely spread of measured results of λ (if the massive oversimplifications within the model were true). The vertical axis (for the estimate of λ) is the same in each graph for easier comparison, while the vertical axis for the standard deviation changes according to the results due to the large changes in this value.

First, the variation as the number of time steps changes and as the averaging period changes from 1 (no averaging) through to 90-days. Remember that the “real” value of λ = 3.0 :

Figure 1

Second, the estimate as the standard deviation of the radiative flux is increased, and the ocean depth ranges from 20-200m. The daily temperature and radiative flux is calculated as a monthly average before the regression calculation is carried out:

Figure 2

As figure 2, but for 100,000 time steps (instead of 10,000):

Figure 3

Third, the estimate as the standard deviation of the radiative flux is increased, and the ocean depth ranges from 20-200m. The regression calculation is carried out on the daily values:

Figure 4

As figure 4, but with 100,000 time steps:

Figure 5

Now against averaging period and also against low pass filtering of the “radiative flux noise”:

Figure 6

As figure 6 but with 100,000 time steps:

Figure 7

Now with the radiative “noise” as an AR(1) process (see Statistics and Climate – Part Three – Autocorrelation), vs the autoregressive parameter φ and vs the number of averaging periods: 1 (no averaging), 7, 30, 90 with 10,000 time steps (30 years):

Figure 8

And the same comparison but with 100,000 timesteps:

Figure 9

Discussion of Results

If we consider first the changes in the standard deviation of the estimated value of climate sensitivity we can see that the spread in the results is much higher in each case when we consider 30 years of data vs 300 years of data. This is to be expected. However, given that in the 30-year cases σλ is similar in magnitude to λ we can see that doing one estimate and relying on the result is problematic. This of course is what is actually done with measurements from satellites where we have 30 years of history.

Second, we can see that mostly the estimates of λ tend to be lower than the actual value of 3.0 W/m².K. The reason is quite simple and is explained mathematically in the next section which non-mathematically inclined readers can skip.

In essence, it is related to the idea in the quote from Forster & Gregory. If the radiative flux noise is uncorrelated to temperature then the estimates of λ will be unbiased. By the way, remember that by “noise” we don’t mean instrument noise, although that will certainly be present. We mean the random fluctuations due to the chaotic nature of weather and climate.

If we refer back to Figure 1 we can see that when the averaging period = 1, the estimates of climate sensitivity are equal to 3.0. In this case, the noise is uncorrelated to the temperature because of the model construction. Slightly oversimplifying, today’s temperature is calculated from yesterday’s noise. Today’s noise is a random number unrelated to yesterday’s noise. Therefore, no correlation between today’s temperature and today’s noise.

As soon as we average the daily data into monthly results which we use to calculate the regression then we have introduced the fact that monthly temperature is correlated to monthly radiative flux noise (note 3).

This is also why Figures 8 & 9 show a low bias for λ even with no averaging of daily results. These figures are calculated with autocorrelation for radiative flux noise. This means that past values of flux are correlated to current vales – and so once again, daily temperature will be correlated with daily flux noise. This is also the case where low pass filtering is used to create the radiative noise data (as in Figures 6 & 7).

Maths

x = slope of the line from the linear regression

x = Cov[- λT + N, T] / Var[T] ….[3]

It’s not easy to read equations with complex terms numerator and denominator on the same line, so breaking it up:

Cov[- λT + N, T] = E[ (λT + N)T ] – E[- λT + N]E[T] ….[4], where E[a] = expected value of a

= E[-λT²] + E[NT] + λ.E[T].E[T] – E[N].E[T]

= -λ { E[T²] – (E[T])² } + E[NT] – E[N].E[T] …. [4]

And

Var[T] = E[T²] – (E[T])² …. [5]

So

x = -λ + { E[NT] – E[N].E[T] } / { E[T²] – (E[T])² } …. [6]

And we see that the regression of the line is always biased if N is correlated with T. If the expected value of N = 0 the last term in the top part of the equation drops out, but E[NT] ≠ 0 unless N is uncorrelated with T.

Note of course that we will use the negative of the slope of the line to estimate λ, and so estimates of λ will be biased low.

As a note for the interested student, why is it that some of the results show λ > 3.0?

Murphy & Forster 2010

Murphy & Forster picked up the challenge from Spencer & Braswell 2008 (reference below but no free link unfortunately). The essence of their paper is that using more realistic values for radiative noise and mixed ocean depth the error in calculation of λ is very small:

From Murphy & Forster (2010)

Figure 10

The value ba on the vertical axis is a normalized error term (rather than the estimate of λ).

Evaluating their arguments requires more work on my part, especially analyzing some CERES data, so I hope to pick that up in a later article. [Update, Spencer has a response to this paper on his blog, thanks to Ken Gregory for highlighting it]

Linear Feedback Relationship?

One of the biggest problems with the idea of climate sensitivity, λ, is the idea that it exists as a constant value.

From Stephens (2005), reference and free link below:

The relationship between global-mean radiative forcing and global-mean climate response (temperature) is of intrinsic interest in its own right. A number of recent studies, for example, discuss some of the broad limitations of (1) and describe procedures for using it to estimate Q from GCM experiments (Hansen et al. 1997; Joshi et al. 2003; Gregory et al. 2004) and even procedures for estimating  from observations (Gregory et al. 2002).

While we cannot necessarily dismiss the value of (1) and related interpretation out of hand, the global response, as will become apparent in section 9, is the accumulated result of complex regional responses that appear to be controlled by more local-scale processes that vary in space and time.

If we are to assume gross time–space averages to represent the effects of these processes, then the assumptions inherent to (1) certainly require a much more careful level of justification than has been given. At this time it is unclear as to the specific value of a global-mean sensitivity as a measure of feedback other than providing a compact and convenient measure of model-to-model differences to a fixed climate forcing (e.g., Fig. 1).

[Emphasis added and where the reference to "(1)" is to the linear relationship between global temperature and global radiation].

If, for example, λ is actually a function of location, season & phase of ENSO.. then clearly measuring overall climate response is a more difficult challenge.

Conclusion

Measuring the relationship between top of atmosphere radiation and temperature is clearly very important if we want to assess the all-important climate sensitivity.

Spencer & Braswell have produced a very useful paper which demonstrates some obvious problems with deriving the value of climate sensitivity from measurements. Although I haven’t attempted to reproduce their actual results, I have done many other model simulations to demonstrate the same problem.

Murphy & Forster have produced a paper which claims that the actual magnitude of the problem demonstrated by Spencer & Braswell is quite small in comparison to the real value being measured (as yet I can’t tell whether their claim is correct).

The value called climate sensitivity might be a variable (i.e., not a constant value) and it might turn out to be much harder to measure than it really seems (and already it doesn’t seem easy).

Articles in this Series

Measuring Climate Sensitivity – Part Two – Mixed Layer Depths

Measuring Climate Sensitivity – Part Three – Eddy Diffusivity

References

The Climate Sensitivity and Its Components Diagnosed from Earth Radiation Budget Data, Forster & Gregory, Journal of Climate (2006)

Potential Biases in Feedback Diagnosis from Observational Data: A Simple Model Demonstration, Spencer & Braswell, Journal of Climate (2008)

On the accuracy of deriving climate feedback parameters from correlations between surface temperature and outgoing radiation, Murphy & Forster, Journal of Climate (2010)

Cloud Feedbacks in the Climate System: A Critical Review, Stephens, Journal of Climate (2005)

Notes

Note 1 – The reason why the “no feedback climate response” = 3.3 W/m².K is a little involved but is mostly due to the fact that the overall climate is radiating around 240 W/m² at TOA.

Note 2 – This is effectively the same as saying f=0. If that seems alarming I note in advance that the exercise we are going through is a theoretical exercise to demonstrate that even if f=0, the regression calculation of climate sensitivity includes some error due to random fluctuations.

Note 3 – If the model had one random number for last month’s noise which was used to calculate this month’s temperature then the monthly results would also be free of correlation between the temperature and radiative noise.

Read Full Post »

In the last article we saw some testing of the simplest autoregressive model AR(1). I still have an outstanding issue raised by one commenter relating to the hypothesis testing that was introduced, and I hope to come back to it at a later stage.

Different Noise Types

Before we move onto more general AR models, I did do some testing of the effectiveness of the hypothesis test for AR(1) models with different noise types.

The testing shown in Part Four has Gaussian noise (a “normal distribution”), and the theory applied is only apparently valid for Gaussian noise, so I tried uniform distribution of noise and also a Gamma noise distribution:

Figure 1

The Gaussian and uniform distribution produce the same results. The Gamma noise result isn’t shown because it was also the same.

A Gamma distribution can be quite skewed, which was why I tried it – here is the Gamma distribution that was used (with the same variance as the Gaussian, and shifted to produce the same mean = 0):

Figure 2

So in essence I have found that the tests work just as well when the noise component is uniformly distributed or Gamma distributed as when it has a Gaussian distribution (normal distribution).

Hypothesis Testing of AR(1) Model When the Model is Actually AR(2)

The next idea I was interested to try was to apply the hypothesis testing from Part Three on an AR(2) model, when we assume incorrectly that it is an AR(1) model.

Remember that the hypothesis test is quite simple – we produce a series with a known mean, extract a sample, and then using the sample find out how many times the test rejects the hypothesis that the mean is different from its actual value:

Figure 3

As we can see, the test, which should be only rejecting 5% of the tests, rejects a much higher proportion as φ2 increases. This simple test is just by way of introduction.

Higher Order AR Series

The AR(1) model is very simple. As we saw in Part Three, it can be written as:

xt – μ = φ(xt-1 – μ) + εt

where xt = the next value in the sequence, xt-1 = the last value in the sequence, μ = the mean, εt = random quantity and φ = auto-regression parameter

[Minor note, the notation is changed slightly from the earlier article]

In non-technical terms, the next value in the series is made up of a random element plus a dependence on the last value – with the strength of this dependence being the parameter φ.

The more general autoregressive model of order p, AR(p), can be written as:

xt – μ = φ1(xt-1 – μ) + φ2(xt-2 – μ) + .. + φp(xt-p – μ) + εt

φ1..φp = the series of auto-regression parameters

In non-technical terms, the next value in the series is made up of a random element plus a dependence on the last few values. So of course, the challenge is to determine the order p, and then the parameters φ1..φp

There is a bewildering array of tests that can be applied, so I started simply. With some basic algebraic manipulation (not shown – but if anyone is interested I will provide more details in the comments), we can produce a series of linear equations known as the Yule-Walker equations, which allow us to calculate φ1..φp from the estimates of the autoregression.

If you look back to Figure 2 in Part Three you see that by regressing the time series with itself moved by k time steps we can calculate the lag-k correlation, rk, for k=1, 2, 3, etc. So we estimate r1, r2, r3, etc., from the sample of data that we have, and then solve the Yule-Walker equations to get φ1..φp

First of all I played around with simple AR(2) models. The results below are for two different sample sizes.

A population of 90,000 is created (actually 100,000 then the first 10% is deleted), and then a sample is randomly selected 10,000 times from this population. For each sample, the Yule-Walker equations are solved (each of 10,000 times) and then the results are averaged.

In these results I normalized the mean and standard deviation of the parameters by the original values (later I decided that made it harder to see what was going on and reverted to just displaying the actual sample mean and sample standard deviation):

Figure 4

Notice that the sample size of 1,000 produces very accurate results in the estimation of φ1 & φ2, with a small spread. The sample size of 50 appears to produce a low bias in the calculated results, especially for φ2, which is no doubt due to not reading the small print somewhere..

Here is a histogram of the results, showing the spread across φ1 & φ2 – note the values on the axes, the sample size of 1000 produces a much tighter set of results, the sample size of 50 has a much wider spread:

Figure 5

Then I played around with a more general model. With this model I send in AR parameters to create the population, but can define a higher order of AR to test against, to see how well the algorithm estimates the AR parameters from the samples.

In the example below the population is created as AR(3), but tested as if it might be an AR(4) model. The AR(3) parameters (shown on the histogram in the figure below) are φ1= 0.4, φ2= 0.2, φ3= -0.3.

The estimation seems to cope quite well as φ4 is estimated at about zero:

Figure 6

The histogram of results for the first two parameters, note again the difference in values on the axes for the different sample sizes:

Figure 7

[The reason for the finer detail on this histogram compared with figure 5 is just discovery of the Matlab parameters for 3d histograms].

Rotating the histograms around in 3d appears to confirm a bell-curve. Something to test formally at a later stage.

Here’s an example of a process which is AR(5) with φ1= 0.3, φ2= 0, φ3= 0, φ4= 0, φ5= 0.4; tested against AR(6):

Figure 8

And the histogram of estimates of φ1& φ2:

Figure 8

ARMA

We haven’t yet seen ARMA models – auto-regressive moving average models. And we haven’t seen MA models – moving average models with no auto-regressive behavior.

What is an MA or “moving average” model?

The term in the moving average is a “linear filter” on the random elements of the process. So instead of εt as the “uncorrelated noise” in the AR model we have εt plus a weighted sum of earlier random elements. The MA process, of order q, can be written as:

xt – μ = εt + θ1εt-1+ θ2εt-2 + .. + θpεt-p

θ1..θp = the series of moving average parameters

The term “moving average” is a little misleading, as Box and Jenkins also comment.

Why is it misleading?

Because for AR (auto-regressive) and MA (moving average) and ARMA (auto-regressive moving average = combination of AR & MA) models the process is stationary.

This means, in non-technical terms, that the mean of the process is constant through time. That doesn’t sound like “moving average”.

So think of “moving average” as a moving average (filter) of the random elements, or noise, in the process. By their nature these will average out over time (because if the average of the random elements = 0, the average of the moving average of the random elements = 0).

An example of this in the real world might be a chemical introduced randomly into a physical process  – this is the εt term – but because the chemical gets caught up in pipework and valves, the actual value of the chemical released into the process at time t is the sum of a proportion of the current value released plus a proportion of earlier values released. Examples of the terminology used for the various processes:

  • AR(3) is an autoregressive process of order 3
  • MA(2) is a moving average process of order 2
  • ARMA(1,1) is a combination of AR(1) and MA(1)

References

Time Series Analysis: Forecasting & Control, 3rd Edition, Box, Jenkins & Reinsel, Prentice Hall (1994)

Read Full Post »

In Part Three we started looking at time-series that are autocorrelated, which means each value has a relationship to one or more previous values in the time-series. This is unlike the simple statistical models of independent events.

And in Part Two we have seen how to test whether a sample comes from a population of a stated mean value. The ability to run this test is important and in Part Two the test took place for a population of independent events.

The theory that allows us to accept or reject hypotheses to a certain statistical significance does not work properly with serially correlated data (not without modification).

Here is a nice example from Wilks:

From Wilks (2011)

Figure 1

Remember that (usually) with statistical test we don’t actually know the whole population – that’s what we want to find out about. Instead, we take a sample and attempt to find out information about the population.

Take a look at figure 1 – the lighter short horizontal lines are the means (the “averages”) of a number of samples. If you compare the top and bottom graphs you see that the distribution of the means of samples is larger in the bottom graph. This bottom graph is the timeseries with autocorrelation.

What this means is that if we take a sample from a time-series and apply the standard Student-t test to find out whether it came from a population of mean = μ, we will calculate that it didn’t come from a mean that it actually did come from too many times. So a 95% test will incorrectly reject the hypothesis a lot more than 5%.

To demonstrate this, here is the % of false rejections (“Type I errors”) as the autocorrelation parameter increases, when a standard Student-t test is applied:

Figure 2

The test was done with Matlab, with a time-series population of 100,000, Gaussian (“normal distribution”) errors, and samples of 100 taken 10,000 times (in each case a random start point was chosen then the next 100 points were taken as a sample – this was repeated 10,000 times). When the time-series is generated with no serial correlation, the hypothesis test works just fine. As the autocorrelation increases (as we move to the right of the graph), the hypothesis test starts creating more false fails.

With AR(1) autocorrelation – the simplest model of autocorrelation – there is a simple correction that we can apply. This goes under different names like effective sample size and variance inflation factor.

For those who like details, instead of the standard deviation of the sample means:

s = σ/√n

we derive:

s = σ.√[(1+ρ)/n.(1-ρ)], where ρ = autocorrelation parameter.

Repeating the same test with the adjusted value:

Figure 3

We see that Type I errors start to get above our expected values at higher values of autocorrelation. (I’m not sure whether that actually happens with an infinite number of tests and true random samples).

Note as well that the tests above were done using the known value of the autocorrelation parameter (this is like having secret information which we don’t normally have).

So I re-ran the tests using the derived autocorrelation parameter from the sample data (regressing the time-series against the same time-series with a one time step lag) – and got similar, but not identical results and apparently more false fails.

Curiosity made me continue (tempered by the knowledge of the large time-wasting exercise I had previously engaged in because of a misplaced bracket in one equation), so I rewrote the Matlab program to allow me to test some ideas a little further. It was good to rewrite because I was also wondering whether having one (long) time-series generated with lots of tests against it was as good as repeatedly generating a time-series and carrying out lots of tests each time.

So this following comparison had a time-series population of 100,000 events, samples of 100 items for each test, repeated for 100 tests, then the time-series regenerated – and this done 100 times. So 10,000 tests across 100 different populations – first with the known autoregression parameter, then with the estimated value of this parameter from the sample in question:

Figure 4 – Each sample size = 100

The correct value of rejected tests should be 5% no matter what the autoregression parameter.

The rewritten program allows us to test for the effect of sample size. The following graph uses the known value of autogression parameter in the test, a time-series population of 100,000, drawing samples out 1000 times from each population, and repeating through 10 populations in total:

Figure 5 – Using known value of autoregression parameter in Student T-test

Remembering that all of the lines should be horizontal on 5%, we can see that the largest sample population of 1000 is the most resistant to higher autoregression parameters.

This reminded me that the equation for the variance inflation factor (shown earlier) is in fact an approximation. The correct formula (for those who like to see such things):

from Zwiers & von Storch (1995)

Figure 6

So I adjusted the variance inflation factor in the program and reran.

I’m really starting to slow things down now – because in each single hypothesis test we are estimating the autoregression parameter, ρ, by a lag-1 correlation, then with this estimate we have to calculate the above circled formula, which requires summing the equation from 1 through to the number of samples. So in the case of n=1000 that’s 1000 calculations, all summed, then used in a Student-t test. And this is done in each case for 1000 tests per population x 10 populations.. thank goodness for Matlab which did it in 18 minutes. (And apologies to readers trying to follow the detail – in the graphics I show the autoregression parameter as φ, when I meant to use ρ, no idea why..)

Fortunately, the result turns out almost identical to using the approximation (the graph using the approximation is not shown):

Figure 7 – Using estimated autoregression parameter

So unless I have made some kind of mistake (quite possible), I take this to mean that the sampling uncertainty in the autoregression parameter adds uncertainty to the Student T-test, which can’t be corrected for easily.

With large samples, like 1000, it appears to work just fine. With time-series data from the climate system we have to take what we can get and mostly it’s not 1000 points.

We are still considering a very basic model – AR(1) with normally-distributed noise.

In the next article I hope to cover some more complex models, as well as the results from this kind of significance test if we assume AR(1) with normally-distributed noise yet actually have a different model in operation..

References

Statistical Methods in the Atmospheric Sciences, 3rd edition, Daniel Wilks, Academic Press (2011)

Taking Serial Correlation into Account in Tests of the Mean, Zwiers & von Storch, Journal of Climate (1995)

Read Full Post »

In the first two parts we looked at some basic statistical concepts, especially the idea of sampling from a distribution, and investigated how this question is answered:  Does this sample come from a population of mean = μ?

If we can answer this abstract-looking question then we can consider questions such as:

  • “how likely is it that the average temperature has changed over the last 30 years?”
  • “is the temperature in Boston different from the temperature in New York?”

It is important to understand the assumptions under which we are able to put % probabilities on the answers to these kind of questions.

The statistical tests so far described rely upon each event being independent from every other event. Typical examples of independent events in statistics books are:

  • the toss of a coin
  • the throw of a dice
  • the measurement of the height of a resident of Burkina Faso

In each case, the result of one measurement does not affect any other measurement.

If we measure the max and min temperatures in Ithaca, NY today, and then measure it tomorrow, and then the day after, are these independent (unrelated) events?

No.

Here is the daily maximum temperature for January 1987 for Ithaca, NY:

Data from Wilks (2011)

Figure 1

Now we want to investigate how values on one day are correlated with values on another day. So we look at the correlation of the temperature on each day with progressively larger lags in days. The correlation goes by the inspiring and memorable name of the Pearson product-moment correlation coefficient.

This correlation is the value commonly known as “r”.

So for k=0 we are comparing each day with itself, which obviously has a perfect correlation. And for k=1 we are comparing each day with the one afterwards – and finding the (average) correlation. For k=2 we are comparing 2 days afterwards. And so on. Here are the results:

Figure 2

As you can see, the autocorrelation decreases as the number of days increases, which is intuitively obvious. And by the time we get to more than 5 days, the correlation has decreased to zero.

By way of comparison, here is one random (normal) distribution with the same mean and standard deviation as the Ithaca temperature values:

Figure 3

And the autocorrelation values:

Figure 4

As you would expect, the correlation of each value with the next value is around zero. The reason it is not exactly zero is just the randomness associated with only 31 values.

Digression: Time-Series and Frequency Transformations

Many people will be new to the concept of how time-series values convert into frequency plots – the Fourier transform. For those who do understand this subject, skip forward to the next sub-heading..

Suppose we have a 50Hz sine wave. If we plot amplitude against time we get the first graph below.

Figure 5

If we want to investigate the frequency components we do a fourier transform and we get the 2nd graph below. That simply tells us the obvious fact that a 50Hz signal is a 50Hz signal. So what is the point of the exercise?

What about if we have the time-based signal shown in the next graph – what can we tell about its real source?

Figure 6

When we see the frequency transform in the 2nd graph we can immediately tell that the signal is made up of two sine waves – one at 50Hz and one at 120Hz – along with some noise. It’s not really possible to deduce that from looking at the time-domain signal (not for ordinary people anyway).

Frequency transforms give us valuable insights into data.

Just as a last point on this digression, in figure 5, why isn’t the frequency plot a perfect line at 50Hz? If the time-domain data went from zero to infinity, the frequency plot would be that perfect line. In figure 5, the time-domain data actually went from zero to 10 seconds (not all of which was plotted).

Here we see the frequency transform for a 50Hz sine wave over just 1 second:

Figure 7

For people new to frequency transforms it probably doesn’t seem clear why this happens but by having a truncated time series we have effectively added other frequency components – from the 1 second envelope surrounding the 50 Hz sine wave. If this last point isn’t clear, don’t worry about it.

Autocorrelation Equations and Frequency

The simplest autocorrelation model is the first-order autoregression, or AR(1) model.

The AR(1) model can be written as:

xt+1 – μ = φ(xt – μ) + εt+1

where xt+1 = the next value in the sequence, xt = the last value in the sequence, μ = the mean, εt+1 = random quantity and φ = auto-regression parameter

In non-technical terms, the next value in the series is made up of a random element plus a dependence on the last value – with the strength of this dependence being the parameter φ.

It appears that there is some confusion about this simple modelRecently, referencing an article via Bishop Hill, Doug Keenan wrote:

To draw that conclusion, the IPCC had to make an assumption about the global temperature series. The assumption that it made is known as the “AR1” assumption (this is from the statistical concept of “first-order autoregression”). The assumption implies, among other things, that only the current value in a time series has a direct effect on the next value. For the global temperature series, it means that this year’s temperature affects next year’s, but temperatures in previous years do not. For example, if the last several years were extremely cold, that on its own would not affect the chance that next year will be colder than average. Hence, the assumption made by the IPCC seems intuitively implausible.

[Update - apologies to Doug Keenan for misunderstanding his point - see his comment below ]

The confusion in the statement above is that mathematically the AR1 model does only rely on the last value to calculate the next value – you can see that in the formula above. But that doesn’t mean that there is no correlation between earlier values in the series. If day 2 has a relationship to day 1, and day 3 has a relationship to day 2, clearly there is a relationship between day 3 and day 1 – just not as strong as the relationship between day 3 and day 2 or between day 2 and day 1.

(And it is easy to demonstrate with a lag-2 correlation of a synthetic AR1 series – the 2-day correlation is not zero).

Well, more for another article when we look at the various autoregression models.

For now we will consider the simplest model, AR1, to learn a few things about time-series data with serial correlation.

Here are some synthetic time-series with different autoregression parameters (the value φ in the equation) and gaussian (=normal, or the “bell-shaped curve”) noise. The gaussian noise is the same in each series – with a standard deviation=5.

I’ve used long time-series to make the frequency characteristics clearer (later we will see the same models over a shorter time period):

Figure 8

The value <x> is the mean. Note that the standard deviation (sd) of the data gets larger as the autoregressive parameter increases. DW is the Durbin-Watson statistic which we will probably come back to at a later date.

When φ = 0, this is the same as each data value being completely independent of every other data value.

Now the frequency transformation (using a new dataset to save a little programming time on my part):

Figure 9

The first graph in the panel, with φ=0, is known as “white noise“. This means that the energy per unit frequency doesn’t change with frequency. As the autoregressive parameter increases you can see that the energy shifts to lower frequencies. This is known as “red noise“.

Here are the same models over 100 events (instead of 10,000) to make the time-based characteristics easier to see:

Figure 10

As the autoregression parameter increases you can see that the latest value is more likely to be influenced by the previous value.

The equation is also sometimes known as a red noise process because a positive value of the parameter φ averages or smoothes out short-term fluctuations in the serially independent series of innovations, ε, while affecting the slower variations much less strongly. The resulting time series is called red noise by analogy to visible light depleted in the shorter wavelengths, which appears reddish..

It is evident that the most erratic point to point variations in the uncorrelated series have been smoothed out, but the slower random variations are essentially preserved. In the time domain this smoothing is expressed as positive serial correlation. From a frequency perspective, the resulting series is “reddened”.

From Wilks (2011).

There is more to cover on this simple model but the most important point to grasp is that data which is serially correlated has different statistical properties than data which is a series of independent events.

Luckily, we can still use many standard hypothesis tests but we need to make allowance for the increase in the standard deviation of serially correlated data over independent data.

References

Statistical Methods in the Atmospheric Sciences, 3rd edition, Daniel Wilks, Academic Press (2011)

Read Full Post »

In Part One we raced through some basics, including the central limit theorem which is very handy.

This theorem tells us that even if we don’t know the type of distribution of a population we can say something very specific about the mean of a sample from that population (subject to some caveats).

Even though this theorem is very specific and useful it is not the easiest idea to grasp conceptually. So it is worth taking the time to think about it – before considering the caveats..

What do we know about Samples taken from Populations?

Usually we can’t measure the entire “population”. So we take a sample from the population. If we do it once and measure the mean (= “the average”) of that sample, then repeat again and again, and then plot the “distribution” of those means of the samples we get the graph on the right:

Figure 1

- and the graph on the right follows a normal distribution.

We know the probabilities associated with normal distributions, so this means that even if we have just ONE sampling distribution – the usual case – we can assess how likely it is that it comes from a specific population.

Here is a demonstration..

Using Matlab I created a population – the uniform distribution on the left of figure 1. Then I took a random sample from the population. Note that in real life you don’t know the details of the actual population, this is what you are trying to ascertain via statistical methods.

Figure 2

Each sample was 100 items. The test was made using the known probabilities of the normal distribution – “is this sample from a population of mean = 10?” And for a statistical test we can’t get a definite yes or no. We can only get a % likelihood. So a % threshold was set – you can see in figure 3, it was set at 95%.

Basically we are asking, “is there a 95% likelihood that this sample was drawn from a population with a mean of 10?

The exercise of

a) extracting a random sample of 100 items, and

b) carrying out the test

- was repeated 100,000 times

Even though the sample was drawn from the actual population every single time, 5% of the time (4.95% to be precise) the test rejected the sample as coming from this population. This is to be expected. Statistical tests can only give answers in terms of a probability.

All we have done is confirmed that the test to 95% threshold gives us 95% correct answers and 5% incorrect answers. We do get incorrect answers. So why not increase the level of confidence in the test by increasing the threshold?

Ok, let’s try it. Let’s increase the threshold to 99%:

Figure 3

Nice. Now we only get just under 1% false rejections. We have improved our ability to tell whether or not a sample is drawn from a specific population!

Or have we?

Unfortunately there is no free lunch, especially in statistics.

Reducing the Risk of Rejecting one Error Increases the Risk of Accepting a Different Error..

In each and every case here we happen to know that we have drawn the sample from the population. Suppose we don’t know this? – The usual situation. The wider we cast the net, the more likely we are to assume that a sample is drawn from a population when in fact it is not.

I’ll show some examples shortly, but here is a good summary of the problem – along with the terminology of Type I and Type II errors – note that H0 is the hypothesis that the sample was drawn from the population in question:

From Brase & Brase (2009)

Figure 4

What we have been doing by moving from 95% to 99% certainty is reducing the possibility of making a Type I error = thinking that the sample does not come from the population in question when it actually does. But in doing so we have been increasing the possibility of making a Type II error = thinking that the sample does come from the population when it does not.

So now let’s widen the Matlab example – we have added an alternative population and are drawing samples out of that as well.

So first – as before – we take samples from the main population and use the statistical test to find out how good it is at determining whether the samples do come from this population. Then second, we take samples from the alternative population and use the same test to see whether it makes the mistake of thinking the samples come from the original population.

Figure 5

As before, the % of false rejections is about what we would expect (note the number of tests was reduced to 10,000, for no particular reason) for a 95% significance test.

But now we see the % of “false acceptance” – where a sample from an alternative population is assessed to see whether it came from the original population. This error is – in this case – around 4%.

Now we increase the significance level to 99%:

Figure 6

Of course, the number of false rejections (type I error) has dropped to 1%. Excellent.

But the number of false accepts (type II error) has increased from 4% to 13%. Bad news.

Now let’s demonstrate why it is that we can’t know in advance how likely Type II errors are. In the following example, the mean of the alternative population has moved to 10.5 (from 10.3):

Figure 7

So no Type II errors. And we widen the test to 99%:

Figure 8

Still no Type II errors. So we widen the test further to 99.9%:

Figure 9

Finally we get some Type II errors. But because the population we are drawing the samples from is different enough from the population we are testing for (our hypothesis) the statistical test is very effective. The “power of the test” – in this case – is very high.

So, in summary, when you see a test “at the 5% significance level” =95%, or at the “1% significance level” = 99%, you have to understand that the more impressive the significance level, the more likely that a false result has been accepted.

Increasing the Sample Size

As the sample size increases the distribution of “the mean of the sample” gets smaller. I know, stats sounds like gobbledygook..

Let’s see a simple example to demonstrate what is a simple idea turned into incomprehensible English:

Figure 10

As you increase the size of the sample, you reduce the spread of the “sampling means” and this means that separating truth from fiction becomes easier.

It isn’t always possible to increase the sample size (for example, the monthly temperatures since satellites were introduced), but if it is possible, it makes it easier to find whether a sample is drawn from a given distribution or not.

Student T-test vs Normal Distribution test

What is a student t-test? It sounds like something “entry level” that serious people don’t bother with..

Actually it is a test developed by William Gossett just over 100 years ago and he had to write under a pen name because of his employer. Statistics was one of his employer’s trade secrets..

In the tests shown earlier we had to know the standard deviation of the population from which the sample was drawn. Often we don’t know this, and so we have a sample of unknown standard deviation – and we want to test the probability that it is drawn from a population of a certain mean.

The principle is the same, but the process is slightly different.

More in the next article, and hopefully we get to the concept of autocorrelation.

In all the basic elements we have covered so far we have assumed that each element in a sample and in a population is unrelated to any other element – independent events. Unfortunately, in the atmosphere and in climate, this assumption is not true (perhaps there are some circumstances where it is true, but generally it is not true).

Read Full Post »

I am very much a novice with statistics. Until recently I have avoided stats in climate, but of course, I keep running into climate science papers which introduce some statistical analysis.

So I decided to get up to speed and this series is aimed at getting me up to speed as well as, hopefully, providing some enlightenment to the few people around who know less than me about the subject. In this series of articles I will ask questions that I hope people will answer, and also I will make confident statements that will turn out to be completely or partially wrong – I expect knowledgeable readers to put us all straight when this happens.

One of the reasons I have avoided stats is that I have found it difficult to understand the correct application of the ideas from statistics to climate science. And I have had a suspicion (that I cannot yet prove and may be totally wrong about) that some statistical analysis of climate is relying on unproven and unstated assumptions. All for the road ahead.

First, a few basics. They will be sketchy basics – to avoid it being part 30 by the time we get to interesting stuff – and so if there are questions about the very basics, please ask.

In this article:

  • independence, or independent events
  • the normal distribution
  • sampling
  • central limit theorem
  • introduction to hypothesis testing

Independent Events

A lot of elementary statistics ideas are based around the idea of independent events. This is an important concept to understand.

One example would be flipping a coin. The value I get this time is totally independent of the value I got last time. Even if I have just flipped 5 heads in a row, assuming I have a normal unbiased coin, I have a 50/50 chance of getting another head.

Many people, especially people with “systems for beating casinos”, don’t understand this point. Although there is only a 1/25 = 1/32 = 3% chance of getting 5 heads in a row, once it has happened the chance of getting one more head is 50%. Many people will calculate the chance – in advance – of getting 6 heads in a row (=1.6%) and say that because 5 heads have already been flipped, therefore the probability of getting the 6th head is 1.6%. Completely wrong.

Another way to think about this interesting subject is that the chance of getting H T H T H T H T is just as unlikely as getting H H H H H H H H. Both have a 1/28 = 1/256 = 0.4% chance.

On the other hand, the chance of getting 4 heads and 4 tails out of 8 throws is much more likely, so long as you don’t specify the order like we did above.

If you send 100 people to the casino for a night, most will lose “some of their funds”, a few will lose “a lot”, and a few will win “a lot”. That doesn’t mean the winners have any inherent skill, it is just the result of the rules of chance.

A bit like fund managers who set up 20 different funds, then after a few years most have done “about the same as the market”, some have done very badly and some have done well. The results from the best performers are published, the worst performers are “rolled up” into the best funds and those who understand statistics despair of the standards of statistical analysis of the general public. But I digress..

Normal Distributions and “The Bell Curve”

The well-known normal distribution describes a lot of stuff unrelated to climate. The normal distribution is also known as a gaussian distribution.

For example, if we measure the weights of male adults in a random country we might get a normal distribution that looks like this:

Figure 1

Essentially there is a grouping around the “mean” (= arithmetic average) and outliers are less likely the further away they are from the mean.

Many distributions match the normal distribution closely. And many don’t. For example, rainfall statistics are not Gaussian.

The two parameters that describe the normal distribution are:

  • the mean
  • the standard deviation

The mean is the well-known concept of the average (note that “the average” is a less-technical definition than “the mean”), and is therefore very familiar to non-statistics people.

The standard deviation is the measure of the spread of the population. In the example of figure 1 the standard deviation = 30. A normal distribution has 68% of its values within 1 standard deviation from the mean – so in figure 1 this means that 68% of the population are between 140-200 lbs. And 95% of its values are within 2 standard deviation from the mean – so 95% of the population are between 110-230 lbs.

Sampling

If there are 300M people in a country and we want to find out their weights it is a lot of work. A lot of people, a lot of scales, and a lot of questions about privacy. Even under a dictatorship it is a ton of work.

So the idea of “a sample” is born.. We measure the weights of 100 people, or 1,000 people and as a result we can make some statements about the whole population.

The population is the total collection of “things” we want to know about.

The sample is our attempt to measure some aspects of “the population” without as much work as measuring the complete population

There are many useful statistical relationships between samples and populations. One of them is the central limit theorem.

Central Limit Theorem

Let me give an example, along with some “synthetic data”, to help get this idea clear.

I have a population of 100,000 with a uniform distribution between 9 and 11. I have created this population using Matlab.

Now I take a random sample of 100 out of my population of 100,000. I measure the mean of this sample. Now I take another random sample of 100 (out of the same population) and measure the mean. I do this many many many times (100,000 times in this example below). What does the sampling distributions of the mean look like?

Figure 2

Alert readers will see that the sampling distribution of the means – right side graph – looks just like the “bell curve” of the normal distribution. Yet the original population is not a normal distribution.

It turns out that regardless of the population distribution, if you have enough items in your sample, you get a normal distribution (when you plot the mean of each sample distribution).

The mean of this normal distribution (the sampling distribution of the mean) is the same as the mean of the population, and the standard deviation, s = σ/√n, where σ= standard deviation of the population, and n = number of items in one sampling distribution.

This is the central limit theorem – in non-technical language – and is the reason why the normal distribution takes on such importance in statistical analysis. We will see more in considering hypothesis testing..

Hypothesis Testing

We have zoomed through many important statistical ideas and for people new to the concepts, probably too fast. Let’s ask this one question:

If we have a sampling distribution can we asses how likely it is that is was drawn from a particular population?

Let’s pose the problem another way:

The original population is unknown to us. How do we determine the characteristics of the original population from the sample we have?

Because the probabilities around the normal distribution are very well understood, and because the sampling distribution of the mean has a normal distribution, this means that if we have just one sampling distribution we can calculate the probability that it has come from a population of specified mean and specified standard deviation.

More in the next article in the series.

Read Full Post »

Follow

Get every new post delivered to your Inbox.

Join 291 other followers