Remove Climatological Mean Annual Cycle and Detrend Data

This tutorial shows how to use CDAT to remove the climatological mean annual cycle and detrend data - a common procedure applied prior to detailed climate data analysis of monthly anomalies.

The data considered in this notebook are monthly-mean surface air temperatures gridded over the United States and spanning the years 1850 - 1990. The original dataset is complete, but it is artificially modified in this notebook by "masking" some values, yielding an incomplete dataset with some values "missing" (as is often encountered in analysis of climate data). The analysis procedure involves three major steps:

  1. Remove the climatological annual cycle yielding monthly-mean departures.
  2. Spatially average over the domain.
  3. Remove the time-mean and the linear trend.

When there are missing values in the dataset (as in the sample calculations below), the final detrended time-series will depend on the order in which these steps are executed. Here we examine options for detrending the data, and we show that slightly different results are generated depending on the order in which operations are performed. More sophisticated treatments (not discussed here) involving appropriately weighting samples and collections of samples should be considered for datasets that only sparsely cover the time and space domains.

To download this Jupyter Notebook, right click on the link and choose "Download Linked File As..." or "Save Link as...". Remember where you saved the downloaded file, which should have an .ipynb extension. (You'll need to launch the Jupyter notebook or JupyterLab instance from the location where you store the notebook file.)

Table of Contents

Prepare Notebook and Data

Back to Top

Download Data

The following CMIP5 NetCDF file contains Near-Surface Air Temperature data in degrees Kelvin (K) over North America. It is downloaded to the directory where this notebook is stored.

Open Data File, Extract Variable

The following two lines of code open the file just downloaded to your local computer (filename), extract data for the Near-Surface Air Temperature (tas), and assign it to the variable data.

The following line of code uses the .info() method to allow us to take a quick look at the structure of the temperature data stored in the data variable.

There are 1680 different time values which are measured as "days since 1850-01-01 00:00:00". The range of the time values is the difference between the last value (51084.5) and the first value (15.5) which equals 51069 days. If we divide this range (51069) by the number of intervals in the dataset (1680-1), we get (51069/(1680-1)) = 30.416 days, which is the average time duration for each data point. This tells us that we are working with monthly data.

The data cover 13 latitude bands and 16 longitude values over the United States (latitudes ~25.6 to ~48.3 and longitudes -123.75 to -67.5).

Data Exploration

Back to Top

First, to get a feel for the data, let's spatially average the data over the entire domain to create a time series of the mean temperature for the region. In creating this time series, the averager function will take the temperature data for the entire region and spatially average it to yield a single temperature value as a function of time (i.e., the latitude and longitude dimensions are removed by this action, as shown with the .shape method.)

In the cell above, we have specified that averaging over the longitude and latitude dimensions (x and y) should be performed by weighting by grid-cell area. Note that the "combinewts" option should also be included for correct area-weighting.

In the next line of code, let's plot this time series.

The figure above shows that the surface temperature averaged over the U.S. is characterized by a pronounced annual cycle and a long term trend with some year to year variability. The goal of the remainder of this analysis is remove both the climatological mean annual cycle and the long term trend. This procedure leads to a filtered time series representing monthly temperature anomalies characterizing variability that cannot be explained by annual cycle forcing or any long-term changes in climate forcing.

But first, let's look at a numerical example, which illustrates that the order in which you perform operations can make a difference when a dataset is incomplete (i.e., when a dataset includes "masked" or "missing" data).

Order of Operations Matters

Numerical Example

Back to Top

If data are missing from a dataset, the order of operations can matter. The following is a numerical example of averaging values two different ways.

Let's say we have the following dataset, which is a function of X and Y:

Y1 Y2 Y3 Y4
X1 3 4 - 7
X2 - 5 - -
X3 1 2 5 5
X4 - - 6 4

Creating this dataset using a numpy array and using 999 where no values exist yields:

Masking the 999 values leads to the following:

If we average over Y first, then average over X, we get (4.667 + 5.000 + 3.250 + 5.000) / 4 = 4.479

Y1 Y2 Y3 Y4 Average
X1 3 4 - 7 4.667
X2 - 5 - - 5.000
X3 1 2 5 5 3.250
X4 - - 6 4 5.000
Average 4.479

Verifying this with code gives:

If we average over X first, then over Y, we get (2.000 + 3.667 + 5.500 + 5.333) / 4 = 4.125

Y1 Y2 Y3 Y4 Average
X1 3 4 - 7
X2 - 5 - -
X3 1 2 5 5
X4 - - 6 4
Average 2.000 3.667 5.500 5.333 4.125

Verifying this with code yields:

Finally, if we average using all 16 cells at once (but, of course, exclude those with missing data), we get (3 + 4 + 7 + 5 + 1 + 2 + 5 + 5 + 6 + 4) / 10 = 4.200

We get three different overall means (4.479, 4.125, or 4.200) depending on our processing choices (specifically the order of operations). (Note that with appropriate weighting, which is not done here, a consistent mean can be obtained, independent of the order of operations. From the first example of averaging over Y, then X, since the total number of values in the dataset is 10, the proper weighting would be: 4.667 * 3/10 + 5.000 * 1/10 + 3.250 * 4/10 + 5.000 * 2/10 = 4.200.)

When additional processing steps are involved, ordering can affect results in more complex ways, as in the next example.

Removing the Climatological Annual Cycle

Back to Top

Before detrending a time series, it is often best to filter it by removing the climatological mean annual cycle; in fact, this may be a requirement for particular types of analyses. In the case of a time series that does not span an integral/non-fractional representative number of years, an accurate determination of the trend of interest may require removal of the climatological mean annual cycle. To see why, consider a temperature time series starting in January and ending in July a year and a half later (i.e., not an integral number of years). Over Northern Hemisphere continents with a large annual cycle, the ending temperature would be much higher than the beginning temperature simply due to the usual seasonal changes in temperature. A linear fit to the time-series would result in a trend that would not reflect any real change in climate but just the particulars of the time period treated. To avoid such artificial trends, one should first remove the climatological annual cycle.

The surface temperature data considered earlier has no missing data, but more often than not observational datasets are incomplete. For purposes of illustration, we therefore will first modify the original surface temperature dataset by designating certain values as "missing". Specifically, we'll treat as "missing" (i.e., delete or mask) all data values that are within 7 degrees of the maximum temperature (data.max()-7) and store the result in datamskd.

We now consider what order to carry out the two-step operation needed to remove the climatological mean annual cycle:

  1. Remove the climatological annual cycle yielding monthly-mean departures.
  2. Spatially average over the domain.

We examine the two ordering options, with steps performed in the order shown above and then in the reverse order.

Processing Option 1: Remove the annual cycle, then spatially average to create a single time series

Back to Top

First at each location (grid cell) we will remove the climatological mean annual cycle to produce monthly mean departures or anomalies relative to the climatological annual cycle. Then we will spatially average the anomalies to produce a time-series of regional-mean anomalies.

In the next line of code, the ANNUALCYCLE.departures calculates the average monthly temperature value for each of the 12 months in a year over the complete time period for each location in the input data file and determines the departure of each temperature (at each time and location) from this average (i.e., "climatological") monthly value.

For example, once an average January value for the entire dataset has been calculated, each January value in the dataset is subtracted from that average January value to yield a series of January departures for each year in the data set. Since there are 1680 months in the dataset, there are 1680/12 = 140 years of data, and therefore 140 January departures. Since there are 140 Februaries, 140 Marches, etc. there are 140 departures x 12 months = 1680 departures for each location in the dataset, as the .shape method shows (1680 departure values by 13 latitude bands, by 16 longitude values).

Now that we have a time series of monthly departures at each grid cell, we can spatially average them over the entire domain to obtain a single area-weighted time-series for the regional-mean monthly anomalies:

Using the .shape method below verifies that the resulting spatially averaged data no longer have latitude and longitude information.

Let's plot the resulting time series of the departures (i.e. the result of removing the annual cycle before averaging spatially).

Notice how, with the annual cycle removed, it is easier to see the trend and which months are anomalously warm or cold (compared to the climatological mean temperature).

It should be noted that with the order of operations executed under this option, we cannot expect the mean of the anomalies for a given month of the year summed over all years to be exactly zero. In the case considered here, for example, the climatological monthly mean anomalies are:

Although for any individual cell the anomalies do sum to zero for each month of the year, when the anomalies are spatially averaged and there are missing values this cannot be guaranteed. This is because when there are missing values, the order of averaging data sequentially over two dimensions can depend on the order it is done, as demonstrated earlier. This means that under Processing Option 1, additional care must be taken in calculating the anomalies. Although applying appropriate weighting during the averaging procedures (over time and over space) can remedy the problem, an easier solution is to remove a mean value (over all years for a given month of the year) from that month's temperature anomaly (datmskd_departures_ts) to obtain anomalies (datmskd_departures_ts_corrected) with means of zero:

The time series mean should now be zero (within the limits of machine precision):

Numpy calculates an unweighted mean by default, whereas the cdutils averager method calculates a weighted mean by default.

In this case, both the weighted and unweighted means are essentially zero (within the limits of machine precision, as mentioned above), but the following lines of code prove that numpy calculates an unweighted mean (which is the same as cdutil.averager with the weights set to "unweighted").

and cdutils calculates a weighted mean by default:

In the plot below, the mean shown in the upper left corner is cdutil's weighted mean.

Processing Option 2: Spatially average data to obtain a single time-series, then remove the annual cycle

Back to Top

Now let's reverse the order of performing the operations. We first calculate the spatially-averaged time series and then remove the annual cycle.

We calculate the time series characterizing area-mean temperature for the masked dataset by spatially averaging the temperature values over all latitude and longitude locations to give a single global temperature time series. (Again, the .shape method, shows we are looking at 1680 values with no latitude or longitude, as expected.)

Let's take a look at this time series. Note that the trend and the annual cycle are similar but not identical to what we saw with the unmasked data. In particular the maximum value of the spatial mean (considering all times) is lower than before (300.556 now compared to 301.658 for the unmasked data). This is because temperatures at individual grid cells that are within 7 K of the maximum temperature have been eliminated from consideration ("masked").

Next we'll remove the annual cycle using the ANNUALCYCLE.departures method as we did in executing Option 1 above. Again, the method calculates an average temperature value for each month of the year and determines the departure of the temperature at each time value from the average monthly temperature, effectively removing the annual cycle from the data.

The time series mean should be zero (within the limits of machine precision):