Wednesday, October 27, 2010

Extracting Signal From Drought Noise


In this post, I want to take up a particularly striking aspect of the Dai paper (and some other earlier papers on drought).  To do so, though, I first need to fill readers in a little bit on a beautiful piece of statistical machinery called Principal Component Analysis.  We will not delve into the mathematics of it here - I refer you to the Wiki page which will be inscrutable to non-technical readers, but a good foundation for the mathematically inclined.  However readers need to get at least a little intuition for what principal components mean, and I will try to supply that.

Suppose we have some data, and the data has more than one dimension.  The simplest case is points in a plane - two dimensional data.  Suppose the data form a cloud like the fuzzy grey dots in this picture:

Here each fuzzy grey dot has both an x-value (on the lower axis) and a y-value (on the left axis).  However, it's clear that the most interesting direction, as far as this data is concerned, is neither of the original axes, but rather, the long arrow that points a little bit east of north east.  The data form a long thin cloud in that direction.  So if you wanted to capture the most interesting/important features of the data, you could just keep one number - how far along in the direction of that long arrow a particular point was.  You'd lose something if you dropped the distance along the short arrow, but not as much as if you just used the x or y value.

So the long arrow here is known as the first principal component, and the short arrow is the second principal component.  If you have the value of both components (for a given point), you know everything about where the point is (you can recreate the original x and y values - I'll spare you the math).  But these are more interesting directions to think about this particular data set in: in some sense they capture the natural features of the data.

Now, you can do the same thing in three dimensions.  Googling around, I couldn't find a really great illustration, but this one (from here) will have to serve:

Here we have a 3d cloud of data.  The two red arrows represent the first (longer) and second (shorter) principal components (which between them define a pinkish plane).  The data lie mostly near the pink plane, and the third principal component, which is not shown, would stick out from the plane at right angles to the others.  So again, the most important feature of a given data point is its position along the first principal component.  If we wanted to know more about it, its direction along the second principal component would be the next most useful thing to have.  And finally, if wanted to know everything there was to know in the original data, we'd need to keep the third principal component too.

Now, we humans with our monkey++ brains can only think in three dimensions.  But computers don't care, and if you let them run long enough, they will happily crunch through data in four, five, or in general N dimensions.  And so Principal Component Analysis - aka PCA - has been used for a huge variety of problems that are trying to look for significant patterns in data - finding "average" faces, genetic analysis, etc, etc.  Indeed, one of my youthful research forays was exploring the possibility of using this technique in tracking Internet malfeasors down (a paper which is now up to 228 citations, pretty good for a technique that I now judge to be completely impractical).

In any case, now imagine taking the world and dividing it up into square cells, 2.5o on a side, and computing the Palmer Drought Severity Index for each square, in a particular year.  Consider that collection of numbers for that particular year to be one data point in a very high dimensional space (the number of dimensions being equal to the number of 2.5o cells required to cover the world - 1296, except we'd probably economize somehow near the poles).  Now, if we have all the years from 1900 to 2008, that forms a cloud of 109 points in this very high dimensional space.  And the computer will cheerfully crunch the mathematical algorithm to produce the principal components for that data cloud (it's called diagonalizing the covariance matrix, but you don't need to know that to appreciate what it's doing).  And so it will figure out the direction of the first principal component, the second principal component, and the other 1294 principal components (though after the first few, things are going to get noisy as hell with only 109 points in 1296 dimensions, but never mind that).  And it will tell us how big each principal component is (how preferentially the data tends to line up along that particular direction - ie how long and thin the data cloud looks in that particular direction).

Ok, but if we have a "direction" in a 1296 dimensional space, what the hell do we do with that?  Well, we map it back into the original grid cells, and show a map that has a different color for how much a given cell PDSI aligns with a particular principal component.

Now, before I show you the result of that, I wish to stress that this technique is a purely statistical technique for number-crunching.  The algorithm is just thinking data points in a high-dimensional space - it doesn't know anything about temperatures or rainfall or clouds or droughts or carbon dioxide.  In short, there is no actual climate physics in it whatsoever.  In no sense is this a climate model - just a statistical technique for analyzing the existing observations.

Ok, so here's what the first principal component looks like in Dai's paper:

So, the most important pattern the algorithm finds in the data involves drought in the Amazon, the western US, much of Africa, the Mediterranean, eastern Australia, eastern China.  Wait, you're probably thinking, this seems a bit familiar.... Yes, let me jog your memory:


This is the 1950-2008 trend map we saw yesterday.  If you look at Australia, Scandinavia, Africa, you'll see the pattern is very similar.  The US is a bit different, in that the PCA first component finds significantly more southwest drought than the trend.  We'll get to that in a minute.  But otherwise, these two maps are qualitatively similar.

So in a sense, that PCA algorithm seems to be saying "the most important thing that I'm finding going on in this data is that there is a trend toward drought in thus and so places".  And if we now look at how much of this particular principal component there is in each year (how far along the long arrow each grey dot is), we get this graph:

So this particular pattern was very slightly increasing (but bumpy) before about 1950, and then started to take off.

So that's the single most important thing going on in the data (and that PC1, 7.1% at the top of the graph essentially says that this particular direction accounts for 7.1% of the total variation in the data).  Now, what about the second principal component?  In a way, that's even more fascinating:


 Ok, a pattern of wetness in the southwestern US, dryness in Brazil, Indonesia, Australia and Southern Africa.  What is this?  How about this description:


and this:


These are from the Wikipedia page on the El-Nino Southern Oscillation.  And if you stare at the map above while reading it, I think you'll agree that the description matches very well with the second principal component map.  And indeed, if you look at the temporal pattern of how much the years line up with this El-Nino like pattern, you get this:

The black line is the amount of the second principal component in the PDSI data.  The red line is a commonly used indicator of how strong the El-Nino is (it's actually the atmospheric pressure difference between Darwin Australia and Tahiti), only shifted by six months.  You can see that there is a pretty good degree of agreement between these lines - they aren't exactly the same, but clearly are capturing qualitatively the same phenomenon.

As a scientist, I find this amazing.  You stick in this PDSI data to a completely physics-agnostic statistical algorithm, and it comes out with "The most important thing going on here is an overall trend to drying in certain places" that roughly matches what climate models say global warming will do, and then it adds "And the second most important thing going on is this oscillation back and forth between drought in some other places" which very closely match the well known El-Nino phenomenon.

So let's match up three things now: the picture of where climate models say drought will be (top), the picture of the first principal component (middle), and the picture of the drying trend of the last 60 years (bottom):

Stare and stare.  Now, there's a great deal of uncertainty here.  The climate models are clearly not perfect.  Extrapolating existing trends could completely fail to foresee non-linear reorganizations of the climate.  But suppose we had to bet (and really, we do, don't we?).  I'd make two points: the places that are very dry in all three pictures don't look to me like a great bet (that would be you, Mediterranean and eastern Australian famers).  And in the regions where there is uncertainty, my instinct would be to bet on the middle picture as capturing the best idea of regional drought patterns.

The differences between the second and the third picture are particularly interesting in North America.  As you can see, the main difference is that the PCA says the southwestern US is drying, but the trend picture shows that much more weakly.  And in the comment I referenced yesterday:
Some of these regions, such as the United States, have fortunately avoided prolonged droughts during the last 50 years mainly due to decadal variations in ENSO and other climate modes, but people living in these regions may see a switch to persistent severe droughts in the next 20–50 years, depending on how ENSO and other natural variability modulate the GHG-induced drying.
Dai is basically saying parts of the US have been protected by the fact that there's been a trend towards more El-Nino from 1950 to the early 1990s - with the resulting wet southwestern winter tendency offsetting the global warming signal - but it's not clear that will continue.  And indeed, it hasn't been so much so in the last ten years and there's been a lot of really nasty western wildfires, as well as epic beetle outbreaks:


Having recently left California, I guess I have voted with my feet.

I still have the bit between my teeth on this drought stuff.  Tomorrow I'll try to take a deeper look at the PDSI and how much confidence it's reasonable to have in it as an indicator.

Note: This post is part of the Future of Drought Series on Early Warning.

10 comments:

Greg said...

Excellent exposition, Stuart. You have talent as a teacher!

I look forward to tomorrow's tutorial.

Stuart Staniford said...

Greg:

Why thank you! I was trying...

Amusingly, I have always pretty much not liked to teach, taught only one class in my time at the university, and that was a major factor in my electing to pursue industrial research of various flavors, rather than stay in the university system. The issue being that I get bored very quickly once I understand something, and can't stand regurgitating it again. Blogging is ok, because I can write it up almost as soon as I've figured something out.

porsena said...

Should we attach much importance to the finding that the first and second principal components account for only 12% of the variance between them?

Gary said...

Beautiful!

Just out of curiosity, can you "subtract out" the second principal component and plot what is left? I'm wondering if the US starts looking more like the models.

Second question - do you know if typical climate models include ENSO effects - and if so how would they get the phasing correct? Is it weather or is it climate!

Stuart Staniford said...

Porsena:

I think the implication is that at this stage, there's still a lot of just random noise (aka weather) going on as well. Presumably, over time, the global warming signal will get stronger and stronger (not that it will be stronger every single year, but over the decades).

Stuart Staniford said...

Gary:

One could subtract it out (technically project it out), but not so easy to do by cut and paste from the existing figures in the paper!

Also - I think ENSO simulation is still a very active area of research in climate models and they don't have it down. Eg, here's an upcoming workshop on the subject:

El NiƱo - Southern Oscillation (ENSO) is the dominant mode of interannual climate variability with worldwide weather and societal impacts. Because ENSO involves a complex interplay of numerous ocean and atmospheric processes, accurately modelling this climate phenomenon with coupled General Circulation Models (GCMs) and understanding and anticipating its behaviour on seasonal to decadal and longer time scales still poses a formidable challenge. Over the past few years, new promising methods have emerged which can improve ENSO simulation, for example by combining ENSO theoretical frameworks and GCM modelling or by using initialised hindcasts and by utilising the recent wealth of high-quality observations to understand errors and their growth in forecast systems. By focusing on the very key processes affecting ENSO dynamics, these new approaches have a strong potential to accelerate progress and improve representation of ENSO in complex climate models. Not only can these new methods help address the question of whether ENSO is changing in a changing climate, but potentially they can also improve reliability of centennial-scale climate projections.
The World Climate Research Program (WCRP) has long recognised the central importance of an improved understanding and predictability of ENSO by encouraging coordinated research in tropical climate variability via its different expert Panels (Pacific, Indian, Seasonal to Interannual Prediction,...) The next Coordinated Model Intercomparison Project (CMIP5), which will feed into the IPCC 5th Assessment, provides a new opportunity to evaluate current research in the process-based evaluation of ENSO in GCMs.
Main goal
The workshop aims to present and discuss emerging new methods for the process-based evaluation of ENSO in GCMs, their use in multi-model assessments and identify future directions and associated challenges.
Specific goals
a) To make an inventory of the existing approaches to evaluate ENSO processes in GCMs; compare, contrast and discuss the relative merits of these approaches as for example applied to CMIP3;
b) To review the potential of methods bridging ENSO theoretical frameworks and GCM modelling; and,
c) To review the observing system and available data for the evaluation of ENSO in GCMs.

Stuart Staniford said...

Ah - looks like that detailed examination of PDSI is going to have to wait another day. The algorithm is really surprisingly arcane.

Gary said...

Stuart,

I've been thinking a bit more about your results. I'm wondering if your methodology allows your to "calibrate" the strength of the principle components against one another. In this case, the warming-drying trend compared to ENSO effects. Qualitatively, many people can relate to the odd weather during El Nino years. Yet here is something that is more significant in terms of the strength of the principle component.
Looking at the first principle component, you see some noisy oscillations until 1960 or so. What happens if you only look at data before that time? Which component is most important on the truncated data set? How about just in data since that time? I guess what I'm asking is how does the magnitude of the principle component change for various long term averages? Clearly, if the trends are largely oscillatory changes like ENSO, the magnitude of the principle component representing that behavior should be roughly the same magnitude as long as you average over several periods. However, with the onset of a unidirectional trend you might expect to see the principle component representing this slowly rise out of the mud. Rather than being just 7% of the signal since 1900, it might represent 20% of the variations today.

Sam Norton said...

Stuart, are you reading Judith Curry's stuff (eg: http://judithcurry.com/2010/10/24/overconfidence-in-ipccs-detection-and-attribution-part-iii/ ) There's some overlap with what you like to explore, and I find her a very lucid and honest thinker - so I think you'd enjoy engaging with her ;)

Stuart Staniford said...

Gary:

Just to be clear, the PDSI principal component analysis is from the Dai review - I didn't reproduce it myself.