Evaluation of Extreme Temperatures Over Australia in the Historical Simulations of CMIP5 and CMIP6 Models
Abstract
Historical simulations of models participating in the sixth phase of the Coupled Model Intercomparison Project (CMIP6) are evaluated over 10 Australian regions for their performance in simulating extreme temperatures, among which three models with initial-condition large ensembles (LEs) are used to estimate the effects of internal variability. Based on two observational data sets, the Australian Water Availability Project (AWAP) and the Berkeley Earth Surface Temperatures (BEST), we first analyze the models' abilities in simulating the probability distributions of daily maximum and minimum temperature (TX and TN), followed by the spatial patterns and temporal variations of the extreme indices, as defined by the Expert Team on Climate Change Detection and Indices (ETCCDI). Overall, the CMIP6 models are comparable to CMIP5, with modest improvements shown in CMIP6. Compared to CMIP5, the CMIP6 ensemble tends to have narrower interquartile model ranges for some cold extremes, as well as narrower ensemble ranges in temporal trends for most indices. Over southeast, tropical, and southern regions, both CMIP ensembles generally exhibit relatively large deficiencies in simulating temperature extremes. We also confirm that internal variability can affect the trends of the extremes and there is uncertainty in representing the irreducible variability among different LEs in CMIP6. Furthermore, the evaluation based on Perkins' skill score (PSS) and root-mean-square error (RMSE) in the three LEs does not directly correlate with the ranges of the trends for extreme temperatures. The findings of this study are useful in informing and interpreting future projections of temperature-related extremes over Australia.
Key Points
-
The assessment on the probability distributions of daily maximum and minimum temperature makes the evaluation of extremes more robust
-
Temperature extremes over Australia are broadly similar in CMIP5 and CMIP6
-
There are differences in estimating internal variability across multiple CMIP6 models
1 Introduction
Extreme temperatures pose severe threats to human society and the natural environment, such as human health, energy consumption, agriculture, and ecosystems (Intergovernmental Panel on Climate Change [IPCC], 2012). During recent decades, distinct warming trends have been documented (e.g., Donat et al., 2013; Perkins-Kirkpatrick & Lewis, 2020) and attributed to anthropogenic influence (e.g., Diffenbaugh et al., 2017; Fischer & Knutti, 2015; Min et al., 2011), which may further change the severity of these impacts (IPCC, 2013). In Australia, observations also show clear warming trends in extreme temperatures, which are represented by most global climate models (GCMs) relatively well (e.g., Alexander & Arblaster, 2009, 2017). However, the Australian climate is highly variable (e.g., Herold et al., 2018; Westra et al., 2016), making the analysis of extremes over different subregions more complex. The climate over Australia is related to a variety of physical mechanisms and teleconnections to modes of internal climate variability. For example, the frequency of heatwaves over southern and northern parts of Australia can be influenced by the El Niño-Southern Oscillation (ENSO); and for southeastern Australia, there is a positive correlation between the Southern Annular Mode (SAM) and heatwave frequency (Perkins et al., 2015). In this study, we utilize state-of-art climate models to investigate if there are improvements in simulating extreme temperatures and the effects of internal variability over different Australian regions.
To understand how extreme temperatures evolve in the past, present, and future climate, GCMs are the main tools available. The GCMs in the sixth phase of the Coupled Model Intercomparison Project (CMIP6; Eyring et al., 2016), organized by the Working Group on Coupled Modelling (WGCM) of the World Climate Research Programme (WCRP), recently became available and will contribute to the Intergovernmental Panel on Climate Change (IPCC) sixth Assessment Report (AR6). Compared to the previous phase, CMIP5 (Taylor et al., 2012), the models in CMIP6 generally have finer model resolution and improved physical processes (Eyring et al., 2016; Stouffer et al., 2017). However, the improvements in model configuration may not always lead to better simulations. For example, recent studies (e.g., Meehl et al., 2020; Tokarska et al., 2020; Zelinka et al., 2020) have shown that equilibrium climate sensitivity (ECS), a quantity of how global surface temperature changes once equilibrium is reached in response to an instantaneous doubling of CO2, has a greater range in CMIP6 (1.8°C–5.6°C), which is likely due to the representation of cloud feedbacks and cloud-aerosol interactions.
Extreme temperature can be measured in many ways. The Expert Team on Climate Change Detection and Indices (ETCCDI), organized by the joint World Meteorological Organization (WMO) Commission on Climatology/WCRP project on Climate Variability and Predictability/Joint Technical Commission for Oceanography and Marine Meteorology, defines 16 core indices (Zhang et al., 2011), which are based on daily scale data and usually describe extremes on an annual basis. Compared to other indices or methods that describe temperature extremes, such as extreme value theory (e.g., Coles, 2001; Kharin et al., 2007; Kharin et al., 2013; Perkins et al., 2014; Zwiers et al., 2011) and the frequency of record-breaking high or low monthly temperatures (Meehl et al., 2009), the ETCCDI indices are consistent, widely used and easy to interpret (e.g., Alexander & Arblaster, 2017; Kim et al., 2020; Klein Tank et al., 2009; Sillmann et al., 2013; Zhang et al., 2011). In a global study, using these extreme indices, Sillmann et al. (2013) found that the inter-model spread in CMIP5 decreases for extreme temperatures, compared to CMIP3. As an updated analysis of Sillmann et al. (2013), Kim et al. (2020) concluded that there are limited improvements for CMIP6 models in simulating temperature extremes, both globally and regionally; however, some systematic biases (e.g., the cold bias in cold extremes over high-latitude regions) still exist. In Australia, there are distinct warming trends in CMIP5 models for most locations, but cold extremes are generally overestimated, and warm extremes underestimated (Alexander & Arblaster, 2017). CMIP6 has not been analyzed in terms of the ETCCDI indices over subregions for Australia, nor have the CMIP5 or CMIP6 indices been compared as yet.
Furthermore, to get reliable information on regional climate, it is critical to determine the effects of internal variability and its influence on the evaluation of extreme temperatures over different Australian regions. Internal variability emerges from the chaotic nature of the climate system, contributing to the observed fluctuations in temporal variations of temperature or hydro-meteorological variables (Deser, Knutti, et al., 2012; Deser et al., 2020; Deser, Phillips, et al., 2012; Lehner et al., 2020; Sippel et al., 2020; Xie et al., 2015). Although internal variability can be predicted for limited times (e.g., the predictability of ENSO), it is fundamentally unpredictable and irreducible (e.g., Bengtsson & Hodges, 2019; Deser, Knutti, et al., 2012; Deser et al., 2020; Deser, Phillips, et al., 2012; Lehner et al., 2020; Lorenz, 1963; Shepherd, 2014; Xie et al., 2015). Previous studies on regional climate have shown that the inherently irreducible variability can contribute to climate changes and mask the forced responses over multidecadal timescales (e.g., Dai & Bloecker, 2019; Deser, Knutti, et al., 2012; Deser et al., 2020; Deser, Phillips, et al., 2012; Hawkins & Sutton, 2009; Hegerl et al., 2015; Hu & Deser, 2013; Perkins & Fischer, 2013; Perkins-Kirkpatrick et al., 2017). For example, Perkins and Fischer (2013) concluded that internal variability could result in larger or smaller trends in heatwaves than observed over Australia. Neglecting the effects of the irreducible variability may make the evaluations on regional climate misleading (Dai & Bloecker, 2019).
In addition to internal variability, there are another two major sources of uncertainties, which are sampled in the multimodel ensembles in the CMIP5 and CMIP6. The first is the uncertainty from radiative forcing, including natural external variability (e.g., solar variability and volcanic sulfur emissions) and anthropogenic influences (e.g., greenhouse gas [GHG] emissions and land-use change), which exists because the effective radiative forcing can differ among the models within each CMIP ensemble; the second is the climate response uncertainty (previously termed “model uncertainty”), which occurs as different model structures may exhibit diverse climate responses to the same radiative forcing (e.g., Lehner et al., 2020; McKinnon & Deser, 2018; Stouffer et al., 2017). Consequently, a multimodel ensemble is not ideal to isolate and quantify the effects of internal variability (Lehner et al., 2020).
In contrast, a single-model initial-condition large ensemble (SMILE; hereafter LE) is convenient to evaluate the influences of internal variability, as external forcing and model structure are identical among the members. The spread across the members within an LE, resulting from slightly different initial conditions, is considered to be induced by internal variability; and the multi-member average represents the forced response (e.g., Dai & Bloecker, 2019; Deser, Knutti, et al., 2012; Deser et al., 2020; Deser, Phillips, et al., 2012; Lehner et al., 2020; Maher et al., 2020; Schlunegger et al., 2020). However, recent studies show that different LEs, most of which are built upon the CMIP5-class models, can produce different estimates of internal variability, differing the relative importance of internal variability compared to other sources of uncertainties (e.g., Lehner et al., 2020; Maher et al., 2020; Schlunegger et al., 2020). To date, estimating the influence of internal variability on regional temperature extremes, based upon CMIP6-class models, has not been documented.
Since CMIP6 has not been analyzed in terms of the ETCCDI indices over Australian regions and the effects of internal variability on extreme temperatures need to be taken into consideration, the aim of this study is to compare the performance of CMIP6 models with CMIP5 in simulating temperature extremes over Australian regions, as well as an analysis investigating how internal variability influences the evaluation of extreme temperatures. The study is organized as follows: Section 2 introduces the observed and model data. The methods are summarized in Section 3. Section 4 describes the results and the discussion and conclusions are presented in Section 5.
2 Data
-
CanESM5-LE: 25 members used, r(1–25)i1p1f1.
-
CNRM-CM6-1-LE: 30 members used, r(1–30)i1p1f2.
-
MIROC6-LE: 50 members used, r(1–50)i1p1f1.
As suggested by previous studies (Alexander & Arblaster, 2017; Sillmann et al., 2013; Srivastava et al., 2020), there are large differences between observational data sets. To robustly validate the simulated results produced by the models from CMIP5 and CMIP6, the Australian Water Availability Project (AWAP; Jones et al., 2009) and the Berkeley Earth Surface Temperatures (BEST; Rohde, Muller, Jacobsen, Muller, et al., 2013; Rohde, Muller, Jacobsen, Perlmutter, et al., 2013) are employed here.
AWAP is generated by the Bureau of Meteorology (BOM), the Bureau of Rural Sciences, and the Commonwealth Scientific and Industrial Research Organization, which aims to understand the terrestrial water balance of Australia and the responses of land surface changes to climate variability and change (Jones et al., 2009). The gridded data set includes rainfall, temperature, vapor pressure, solar exposure, and the normalized difference vegetation index (NDVI) at the horizontal resolution of 0.05° × 0.05° (approximately 5 × 5 km2) over the period 1911–present. Although the analyses over data-sparse regions (e.g., central Western Australia) should be taken with caution as the station network is changed over time (e.g., station coming in and out of use) (Alexander & Arblaster, 2017; King et al., 2013), AWAP is a high-quality observed data set over Australia (King et al., 2017), which in this study is the primary reference data set.
As a globally observed data set, BEST is also analyzed in this study, which provides daily high and low temperatures from 1880 to the present (Rohde, Muller, Jacobsen, Muller, et al., 2013; Rohde, Muller, Jacobsen, Perlmutter, et al., 2013). Compared to other global observational data sets (e.g., Global Precipitation Climatology Project [GPCP], Global Historical Climatology Network [GHCN]), the resolution of the Berkeley data is 1° × 1°, which is relatively higher and covers a longer period. Moreover, more records (around 37,000) are incorporated into the data set, compared to 5,000–7,000 records for other global data sets. Although Berkeley Earth claims to address some major concerns (e.g., data selection, data adjustment, poor station quality, and the urban heat island effect) systematically and objectively, there are still some issues. Through an analysis over Canada, Way et al. (2017) found that monthly minimum temperatures in BEST show larger biases and a systematic underestimation of warming, compared to monthly maximum temperatures, which suggests that some inhomogeneities in the raw data are not taken into account by the Berkeley’s algorithm. For this study, it is also an opportunity to check its validity in measuring temperature extremes over Australia.
3 Methods and Data Processing
3.1 Perkins' Skill Score



In this study, since the definitions of the ETCCDI indices are based on TX and TN, it is necessary to examine the models' ability in simulating the distributions of TX and TN before applying the metrics to conduct research. Consequently, we utilized PSS to assess the overall similarity between the observed and simulated data (e.g., Kumar et al., 2014; Lewis, 2018; Perkins et al., 2007).
3.2 Extreme Temperature Indices
The ETCCDI indices used in this study are outlined in Table S3. The indices defined in Zhang et al. (2011) can be classified into four categories: absolute indices (e.g., hottest day [TXx]), threshold-based indices (e.g., frost days [FD]), percentile indices (e.g., cold days [TX10p]), and duration indices (e.g., cold spell duration index [CSDI]). Since the definitions of growing season length (GSL) and ice days (ID) are not suitable over most of Australia (Alexander & Arblaster, 2017), they are excluded here. Furthermore, compared to previous studies (e.g., Alexander & Arblaster, 2017; Sillmann et al., 2013), the bootstrap resampling procedure proposed by Zhang et al. (2005) is also applied to the calculations of warm spell duration index (WSDI) and CSDI, and the spells crossing year boundaries are taken into consideration.
We use 30-year climatologies to investigate spatial patterns. For temporal variations, the trends for the time series of the ETCCDI indices are estimated by Theil-Sen estimator and Mann-Kendall nonparametric test is used as the significance test (e.g., Alexander & Arblaster, 2009; Dey et al., 2019).
3.3 Model Performance Metric

3.4 Data Processing
To investigate the extreme temperatures over Australia in more detail, Australia is divided into nine subregions shown in Table S4 and Figure S1, which is based on a study by Perkins et al. (2014) and the BOM (http://www.bom.gov.au/climate/change/about/temp_timeseries.shtml). Ten regions were determined according to climatological and geographical conditions, abbreviated AUS (Australia), NA (Northern Australia), SA (Southern Australia), SEA (South East Australia), MEA (Middle Eastern Australia), TA (Tropical Australia), SWA (South West Australia), SSA (Southern South Australia), CAU (Central Australia), and MWA (Mid-Western Australia). Since there has been an increase in in-situ observations since 1950, the analysis is carried out from 1950, and the base period is from 1961 to 1990, which is commonly used and allows for a standardized quantification.
The observed and simulated data sets of TX and TN are first regridded to 1° × 1° resolution using bilinear interpolation; the calculations of the extreme indices are then performed. It is noted that reversing the order of operation may have significant effects on the resulting gridded values (e.g., Avila et al., 2015; Chen & Knutson, 2008; Herold et al., 2017; Zhang et al., 2011). For example, indices sensitive to resolution choice (e.g., maximum 1-day precipitation amount) are substantially altered when the order of operation is changed (Herold et al., 2017). In addition, following the practice in King et al. (2015), grid boxes containing less than 75% land are masked out. To calculate the trends, we first average each ETCCDI index over the regions; then Theil-Sen estimator and Mann-Kendall nonparametric test are applied to the time series of the extreme indices.
In Section 4, each model is first evaluated for TX and TN using the PSS, which is based on the probability distributions between observations and simulations over the regions for the period 1950–2005. Second, the ETCCDI indices are analyzed in terms of the spatial patterns during the base period, which are evaluated by RMSE. The temporal variations for the extreme indices in CMIP5 (1950–2005) and CMIP6 (1950–2014) are then investigated (the boxplots of the trends for the observed and simulated data from 1950 to 2005 are indicated in Figure S11). Finally, based on the three LEs in CMIP6, the influences of internal variability on the evaluation of extreme temperatures in the single-member multimodel ensembles are estimated.
4 Results
4.1 Probability Distributions and PSS
Figures 1-4 show the probability distributions of TX and TN and their PSSs over the Australian regions during the period 1950–2005 for AWAP, BEST, CMIP6, and CMIP5 models. Bin sizes of 0.5°C were used. For the probability distributions of TX (Figure 1), the two observations are generally comparable over the regions, though there are slight differences between AWAP and BEST over the regions SWA and SSA. In contrast, the probability distributions of TN (Figure 2) in the two observed data sets show larger differences over most regions (except NA). Overall, for TN, BEST tends to have right-shifted distributions (warmer-side tails), with higher peaks over the northern regions and lower peaks over the southern regions, compared to AWAP.

Probability distributions of daily maximum temperature (TX) during the period 1950–2005 over Australian regions for AWAP (black), BEST (yellow), the multimodel medians in CMIP6 (red), and CMIP5 (blue). Shading denotes the full range across the models in each CMIP ensemble.

Same as Figure 1, but for daily minimum temperature (TN).

Perkins' skill scores (PSSs) for probability distributions of TX in the CMIP6 (colored circles) and CMIP5 (black asterisks) models over Australian regions, relative to AWAP and BEST. The black squares and triangles are the multimodel means from CMIP6 and CMIP5, respectively.

Same as Figure 3, but for TN.
For both TX and TN, the multimodel medians in CMIP6 and CMIP5 are generally similar over all regions (Figures 1 and 2). Compared to AWAP, the medians of the two CMIP ensembles in the probability distributions of TX tend to overestimate the lower tails and slightly underestimate the upper tails in Figure 1. For TN (Figure 2), the lower tails are underestimated and the upper tails overestimated. Furthermore, the medians in CMIP6 and CMIP5 are more analogous to AWAP than BEST, and the TX medians fit to observations are much better than for TN. The model spread, as measured by the full range of the multimodel ensemble in each CMIP, tends to be larger in the upper tails and narrower in the lower tails for CMIP6 when compared to CMIP5 (Figure 1). This suggests that more models in CMIP6 tend to show warmer patterns. In particular, for the probability distributions of TX in CMIP6 models, the larger spread in the upper tail is mainly caused by the three models CanESM5, MIROC6, and MRI-ESM2 (not shown).
In Figures 3 and 4, compared to both observations, the multimodel means of PSSs in CMIP6 and CMIP5 models are generally around 90%, which implies that both CMIP ensembles simulate the daily scale extreme temperatures similarly and relatively well. The lower multimodel mean PSSs are found for TX over TA (∼83%) and TN over SEA (∼82%), TA (∼84%), and SSA (∼84%). Also, over most regions shown in Figure 4 (e.g., AUS, NA, and MEA), higher scores for AWAP do not mean the same in BEST, suggesting that the two observed data sets are significantly different and the TN in BEST may be biased. It further implies that AWAP is more appropriate for evaluation of TN and cold extremes over Australian regions and analyzing cold extremes based on TN in BEST in the following parts should be taken cautiously. For the model spreads of PSSs, the full ranges of the probability distributions for TX and TN in CMIP6 are commonly wider than CMIP5 over the regions. This could be due to the fact that several models in CMIP6, such as MIROC6 and NorCPM1, show relatively lower scores. It is also noted that the models with higher resolution (e.g., MRI-ESM2-0) do not generally show higher scores than those with relatively coarse resolution (e.g., FGOALS-g3; Figures 3 and 4). As the change of temperature may be more related to large-scale meteorological patterns (Grotjahn et al., 2016), the relatively lower PSSs in some models with higher resolution may result from the generation of unrealistic local details (e.g., soil moisture) in simulations (Lau & Nath, 2012).
In general, models in CMIP6 and CMIP5 can perform quite differently based on AWAP or BEST, especially for TN; and the multimodel means and spreads of PSSs over most regions in CMIP6 are comparable to that in CMIP5, though the multimodel means are typically slightly lower in CMIP6 for both TX and TN over most regions (Figures 3 and 4). This is because some models in CMIP6, which usually produce lower scores, collectively reduce the ensemble mean. Compared to AWAP (Figure 3), MIROC6, NorCPM1, IPSL-CM6A-LR, and CanESM5 usually have lower scores. Of those, NorCPM1 and IPSL-CM6A-LR have cold shifts while warm shifts occur for MIROC6 and CanESM5 (not shown). In contrast, the PSSs of MIROC-ES2L, MIROC6, MPI-ESM1-2-HR, and NorESM2-LM for the probability distributions of TN are usually lower over Australian regions (Figure 4), which all have warm shifts (not shown). It is interesting to note that the model MIROC-ES2L typically has lower PSSs in Figure 4 but relatively higher scores in Figure 3, implying that MIROC-ES2L tends to simulate higher temperatures over Australia. The results based on PSSs suggest that when using historical simulations from the above models to calculate extremes, the results should be interpreted with caution.
4.2 Spatial Patterns of Climatologies
Examining the extreme temperature indices averaged over the period 1961–1990 helps us to determine the magnitude and spatial distributions of model bias. The 30-year climatologies of TXx, TNn (coldest night), and diurnal temperature range (DTR) for the observations and the medians from CMIP6 and CMIP5 models are shown in Figures 5-7, as well as their differences. The climatological patterns of other indices, including coldest day (TXn), warmest night (TNx), WSDI, CSDI, summer days (SU), tropical nights (TR), and FD, are shown in Figures S2–S8. Except for DTR (Figure 7), AWAP and BEST exhibit similar patterns for other temperature indices (Figures 5, 6, and S2–S8). Overall, compared to AWAP, the magnitude in BEST for most indices is higher over most parts of Australia, although the absolute values of TXx (Figure 5) and FD (Figure S8) in BEST are generally lower. The negligible variations of DTR in BEST (Figure 7b) should be treated with caution, which can result from the biased TN shown in Figures 2 and 4. This may be also related to the minimization process in the Berkeley’s homogenization algorithm, minimizing the mean square of the local weather term and suppresses regional differences to some extent. This issue needs further investigation; however, it is beyond the scope of our study (Rohde, Muller, Jacobsen, Muller, et al., 2013; Rohde, Muller, Jacobsen, Perlmutter, et al., 2013) In Figures 5, 6 and 7 the differences between AWAP and BEST are largest over southwest parts for TXx, north regions for TNn, and central Australia for DTR, which suggests that the evaluation of the models can be quite different compared to different observed data. Compared to the biases of the models, the observation uncertainty can be of the same magnitude (e.g., CSDI) or even larger (e.g., TNx in Figure S3 and TR in Figure S7).

Spatial patterns of the 30-year climatological TXx (1961–1990) over Australia for (a) AWAP, (b) BEST, (d) the multimodel medians in CMIP5 (termed “CMIP5_Median”) and (g) CMIP6 (termed “CMIP6_Median”); and the biases for (c) AWAP–BEST, (e) CMIP5_Median–AWAP, (f) CMIP5_Median–BEST, (h) CMIP6_Median–AWAP and (i) CMIP6_Median–BEST.

Same as Figure 5, but for TNn.

Same as Figure 5, but for DTR.
The observed climatological indices are reasonably well represented by the models from CMIP6 and CMIP5. However, similar to CMIP5, systematic errors still exist in the CMIP6 multimodel medians. As shown in Figures 5-7 and S2–S8, the distinct differences are usually located over the eastern part of tropical Australia, southeast and western Australia. For example, for TXx, there are cold biases over southwest Australia and warm biases over southeast Australia (Figure 5h). In general, relative to AWAP, the multimodel medians of CMIP6 appear to show improvements for some indices (e.g., TXx, TXn, CSDI, and SU), compared to CMIP5.
To investigate the regional performance of CMIP6 models, box-and-whisker plots are employed to show the ETCCDI indices over Australian regions (Figure 8). The boxes indicate the interquartile model spreads (range between the 25th and 75th percentiles), the black lines within the boxes are the multimodel medians, the whiskers extend to the edges of 1.5 × interquartile ranges, and “outlier points” that fall outside of the whiskers are denoted by diamonds. Except for DTR, BEST exhibits broadly higher values than AWAP over most regions (Figures 8b–8i). However, for TXx (Figure 8a) and FD (Figure 8j), the magnitudes of indices in BEST are generally lower than AWAP. Moreover, the differences between the observational data sets can be comparable to the interquartile range of the models from CMIP6 and CMIP5 over most regions for many indices (except TXx, TXn, SU, and FD). This implies that based on different observational data, the model evaluation results may differ, which is consistent with previous studies (e.g., Kim et al., 2020; Sillmann et al., 2013; Srivastava et al., 2020). For future studies over Australian regions, we advocate the use of AWAP over BEST, as both TX and TN are more realistic in AWAP.

Boxplots of the 30-year climatologies of the 10 ETCCDI indices for the CMIP6 (red) and CMIP5 (blue) models over Australian regions. The boxes indicate the interquartile spreads (ranges between the 25th and 75th percentiles), the black lines within the boxes are the multimodel medians, the whiskers extend to the edges of 1.5 × interquartile ranges and “outliers” outside of the whiskers are denoted by diamonds. The circles represent the indices in AWAP (black) and BEST (yellow).
Compared to AWAP, the multimodel medians of CMIP6 tend to overestimate the duration indices (i.e., WSDI and CSDI) over all Australian regions. Among the absolute and threshold indices, TXx, TXn, DTR, SU, and FD are commonly underestimated by CMIP6 over most regions (except TXx, TXn, and SU over SEA); while the medians in CMIP6 models overestimate TNx, TNn, and TR. There are relatively higher biases between AWAP and the medians in the CMIP6 models over some regions such as SEA, MEA, TA, and CAU, compared to other regions.
For the comparison between CMIP6 and CMIP5 models, the multimodel medians and interquartile model ranges are analyzed and shown to be broadly comparable. The distinct differences for the medians are among the absolute indices. For TXx and TNx, the medians in CMIP6 models are higher than CMIP5. In contrast, for TXn and TNn, CMIP6 shows lower values over most regions (except for TXn over the regions CAU and MWA). The interquartile model ranges in CMIP6 tend to be lower than CMIP5 for TNn, WSDI, and CSDI over most regions, which indicates that the model uncertainty in CMIP6 may be reduced. However, over some regions such as NA, TA, and MEA, the interquartile range tends to be larger for some indices, compared to other regions, suggesting that models simulating the extremes over these regions may have more uncertainty.
4.3 Metric Evaluation
With respect to AWAP, the RMSEs for the CMIP6 and CMIP5 models are used to assess the models' overall performance in simulating extreme temperature indices averaged for the base period 1961–1990 over Australian regions (Figure 9; RMSEs based on BEST is shown in Figure S9). The medians of the RMSEs in the two ensembles commonly have higher values for the indices over tropical and eastern Australia (Figure 9). Similar to previous studies (e.g., Srivastava et al., 2020), the models do not perform consistently well over the regions (not shown). This demonstrates that there is large variability for the performance of the models in simulating different indices over different regions, suggesting that different indices or regions are influenced by different processes. For example, in CMIP6, the model MIROC-ES2L has higher RMSEs across all regions for TNn, while its performance in simulating TXn is relatively better than other models (lower RMSEs). Overall, the RMSEs in CMIP6 indicate that the models need further improvement over the regions MEA, TA, CAU, and MWA. The models HadGEM3-GC31-MM, HadGEM3-GC31-LL, and GFDL-CM4 are commonly among the best performers, while NorCPM1, NorESM2-LM, and MIROC6 tend to show higher RMSEs (not shown). However, among the best performers, the evaluation of the model “NorESM2-LM” needs to be cautiously interpreted, since its performance estimated by PSSs is broadly modest among the CMIP6 models.

Boxplots of RMSEs for the 14 ETCCDI indices in the CMIP6 (red) and CMIP5 (blue) models over Australian regions, with respect to AWAP. The boxes indicate the interquartile spreads (ranges between the 25th and 75th percentiles), the black lines within the boxes are the multimodel medians, the whiskers extend to the edges of 1.5 × interquartile ranges and “outliers” outside of the whiskers are denoted by diamonds. The colored squares represent the multimodel means of RMSEs calculated from CMIP6 and CMIP5 with respect to AWAP and BEST, termed “CMIP6_AWAP_Mean” (purple), “CMIP5_AWAP_Mean” (orange), “CMIP6_BEST_Mean” (green), and “CMIP5_BEST_Mean” (cyan), respectively.
Compared to the RMSEs in CMIP5 models, there are some improvements shown in CMIP6. Usually, for some cold extremes (e.g., TNn, warm nights [TN90p], cold nights [TN10p], CSDI, and FD), the interquartile model ranges are commonly narrower in CMIP6. For TXx and SU, the means and medians of RMSEs in CMIP6 are generally lower than CMIP5.
4.4 Temporal Variations
The time series of the anomalies and the actual values for extreme temperature indices averaged over Australia (10°S–45°S, 110°E–155°E) are shown in Figure 10 and Figure S10, respectively. Furthermore, the boxplots representing trends over Australian regions are displayed in Figure 11 and S11, and the number of models that show trends of the ETCCDI indices significant at the 95% confidence level is summarized in Table 1.

Time series of the anomalies for the 14 ETCCDI indices averaged over Australia (10°S–45°S, 110°E–155°E) from 1950 to 2014 for AWAP (black), BEST (yellow), CMIP6 (multimodel mean: red solid; multimodel median: red dashed), and CMIP5 (multimodel mean: blue solid; multimodel median: blue dashed). Shading indicates the full range across the models for each CMIP ensemble.

Boxplots of the trends of the 14 ETCCDI indices calculated from 1950 to 2014 for the CMIP6 (red) and CMIP5 (blue) models over Australian regions. The boxes indicate the interquartile spreads (ranges between the 25th and 75th percentiles), the black lines within the boxes are the multimodel medians, the whiskers extend to the edges of 1.5 × interquartile ranges and “outliers” outside of the whiskers are denoted by diamonds. The circles represent the trends in AWAP (black) and BEST (yellow).
Region | CMIP Phase | TXx | TXn | TNx | TNn | DTR | TX90p | TX10p | TN90p | TN10p | WSDI | CSDI | SU | TR | FD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AUS | CMIP6 | 58.1 | 38.7 | 83.9 | 80.6 | 16.1 | 77.4 | 80.6 | 100.0 | 100.0 | 77.4 | 96.8 | 74.2 | 96.8 | 71.0 |
CMIP5 | 53.8 | 15.4 | 61.5 | 61.5 | 11.5 | 65.4 | 46.2 | 92.3 | 96.2 | 57.7 | 76.9 | 46.2 | 84.6 | 42.3 | |
NA | CMIP6 | 61.3 | 25.8 | 87.1 | 67.7 | 22.6 | 80.6 | 71.0 | 96.8 | 100.0 | 80.6 | 96.8 | 64.5 | 100.0 | 41.9 |
CMIP5 | 46.2 | 7.7 | 61.5 | 50.0 | 7.7 | 65.4 | 42.3 | 92.3 | 92.3 | 61.5 | 73.1 | 46.2 | 84.6 | 30.8 | |
SA | CMIP6 | 41.9 | 41.9 | 58.1 | 80.6 | 12.9 | 64.5 | 74.2 | 96.8 | 100.0 | 51.6 | 80.6 | 54.8 | 80.6 | 67.7 |
CMIP5 | 34.6 | 34.6 | 42.3 | 57.7 | 15.4 | 42.3 | 38.5 | 80.8 | 88.5 | 30.8 | 73.1 | 38.5 | 69.2 | 34.6 | |
SEA | CMIP6 | 12.9 | 51.6 | 38.7 | 58.1 | 16.1 | 54.8 | 80.6 | 93.5 | 90.3 | 32.3 | 67.7 | 45.2 | 67.7 | 64.5 |
CMIP5 | 19.2 | 38.5 | 19.2 | 38.5 | 19.2 | 38.5 | 57.7 | 73.1 | 84.6 | 15.4 | 34.6 | 26.9 | 42.3 | 34.6 | |
MEA | CMIP6 | 25.8 | 35.5 | 41.9 | 71.0 | 9.7 | 61.3 | 48.4 | 90.3 | 90.3 | 51.6 | 80.6 | 51.6 | 80.6 | 61.3 |
CMIP5 | 30.8 | 26.9 | 34.6 | 38.5 | 19.2 | 42.3 | 42.3 | 80.8 | 80.8 | 42.3 | 61.5 | 50.0 | 73.1 | 38.5 | |
TA | CMIP6 | 61.3 | 22.6 | 87.1 | 67.7 | 19.4 | 77.4 | 67.7 | 96.8 | 96.8 | 80.6 | 93.5 | 58.1 | 93.5 | 3.2 |
CMIP5 | 34.6 | 11.5 | 61.5 | 30.8 | 7.7 | 65.4 | 50.0 | 80.8 | 84.6 | 61.5 | 57.7 | 38.5 | 76.9 | 3.8 | |
SWA | CMIP6 | 51.6 | 35.5 | 41.9 | 67.7 | 6.5 | 74.2 | 71.0 | 96.8 | 100.0 | 45.2 | 74.2 | 41.9 | 74.2 | 35.5 |
CMIP5 | 50.0 | 38.5 | 38.5 | 50.0 | 19.2 | 42.3 | 42.3 | 61.5 | 84.6 | 19.2 | 46.2 | 23.1 | 53.8 | 15.4 | |
SSA | CMIP6 | 45.2 | 41.9 | 29.0 | 54.8 | 16.1 | 54.8 | 74.2 | 87.1 | 90.3 | 35.5 | 58.1 | 41.9 | 64.5 | 29.0 |
CMIP5 | 26.9 | 30.8 | 26.9 | 34.6 | 15.4 | 38.5 | 38.5 | 69.2 | 80.8 | 15.4 | 46.2 | 26.9 | 38.5 | 26.9 | |
CAU | CMIP6 | 58.1 | 19.4 | 74.2 | 64.5 | 16.1 | 67.7 | 51.6 | 93.5 | 90.3 | 71.0 | 77.4 | 54.8 | 83.9 | 32.3 |
CMIP5 | 38.5 | 7.7 | 61.5 | 53.8 | 7.7 | 53.8 | 26.9 | 80.8 | 84.6 | 46.2 | 50.0 | 38.5 | 76.9 | 34.6 | |
MWA | CMIP6 | 74.2 | 25.8 | 87.1 | 67.7 | 9.7 | 77.4 | 54.8 | 96.8 | 100.0 | 67.7 | 80.6 | 58.1 | 90.3 | 22.6 |
CMIP5 | 61.5 | 19.2 | 53.8 | 50.0 | 7.7 | 57.7 | 26.9 | 84.6 | 80.8 | 57.7 | 61.5 | 38.5 | 65.4 | 11.5 |
- Abbreviations: CMIP5/CMIP6, fifth/sixth phase of the Coupled Model Intercomparison Project; ETCCDI, Expert Team on Climate Change Detection and Indices.
As shown in Figure 10, the temporal variations of the two observations for the extremes are quite similar and they are reasonably well captured by both the CMIP ensembles. However, for some indices, differences between AWAP and BEST are substantial. For example, the differences between the two observations for TR (Figure S10m) can be as large as the total inter-model range, further indicating the observational uncertainty can be quite large, likely due to the aforementioned erroneous TN data in BEST. Consistent with Alexander and Arblaster (2017), the temporal variations of TNx, TNn, and TR in AWAP are close to the lower end of the model spread in CMIP6 and CMIP5, while the observed TXn, DTR, and FD tend to be at the upper end (Figure S10). In terms of the model spread, some outliers shown in CMIP5 are corrected in CMIP6 (e.g., the outliers shown in TN10p and CSDI produced by the model GFDL-ESM2G in 1964).
The trends of the temperature indices over each region in the observed and simulated data are displayed in Figure 11, and Tables S5 and S6 additionally show the trends and significance for the indices in the observations. For all the temperature indices, the warming trends of BEST are generally higher than AWAP over most regions, with the lower warming trends in BEST usually located over SSA, CAU, and MWA, which are data-sparse regions. Again, the differences between the observations can be as large as the interquartile model range (e.g., TN10p). Compared to the medians of CMIP5 models, the medians in CMIP6 are commonly closer to AWAP (e.g., TXx, warm days [TX90p], and SU). Moreover, both the spreads and interquartile model ranges tend to be narrower in CMIP6, and there is a larger portion of models in CMIP6 that show the trends significant as compared to CMIP5 (Table 1). This implies that the model uncertainty in CMIP6 may be somewhat reduced, or that both the single-member multimodel ensembles in CMIP5 and CMIP6 do not sample the three kinds of uncertainties reasonably. Next, in Section 4.5, we use three LEs only to investigate the effects of internal variability as they are unlikely to be enough to cover the spread of model uncertainty. For the trends calculated over the period 1950–2005 (Figure S11), the interquartile ranges in CMIP6 are still broadly narrower, while the multimodel medians in CMIP6 indicate lower warming rates than CMIP5. Additionally, the interquartile model ranges in CMIP6 and CMIP5 are usually larger over NA, TA, and MWA for some indices (e.g., TNn, TX90p, TN90p, WSDI, and CSDI), compared to other regions (Figures 11 and S11).
4.5 Internal Variability
The ranges of probability distributions and PSSs for TX and TN in the CanESM5-LE, CNRM-CM6-1-LE, and MIROC6-LE are shown in Figures S12–S15. For TX and TN, there are quite small ranges in the probability distributions and PSSs among the members in each of the three LEs (Figures S12–S15). This is likely due to the large amount of data, making the influences of internal variability indistinct. The ranges of PSSs in the three LEs are generally broader for TX (Figure S14) than TN (Figure S15), in which the effects of internal variability in MIROC6-LE are generally larger than the other LEs (Figure S14). Over NA, MEA, TA, CAU, and MWA, the ranges in PSSs (based on both observations) for MIROC6-LE are slightly more than 5%, relatively larger compared to other regions. For CNRM-CM6-1-LE, the PSSs of TX span larger ranges over NA and TA, while CanESM5-LE over CAU and MWA. Figures S16–S19 exhibit the 30-year climatologies and RMSEs for TXx and TNn in each LE. Since the 30-year averaging also reduces the magnitude of internal variability, the situation is analogous to that shown in Figures S14 and S15. It demonstrates that the evaluation of extreme temperatures over Australia in the single-member multimodel ensembles of CMIP5 and CMIP6, based on PSS and RMSE, is reasonable, which is slightly influenced by the internal variability.
In Figure 12, although the multi-member means of the TXx trends (i.e., the forced response) increase at a similar rate of ∼0.2°C per decade for all the LEs over the regions, the effects of internal variability can alter the regional forced responses, which varies for each region and LE. Generally, the TXx trends in all LEs span larger ranges over SEA, MEA, and SSA; while the ranges of the trends are relatively narrower over TA and SWA. For different LEs, the ranges in MIROC6-LE are much wider, which are larger than CNRM-CM6-1-LE by a factor of ∼3 or more across Australian regions. For example, over SSA, the range of TXx trends in MIROC6-LE is ∼0.7°C per decade while ∼0.2°C per decade in CNRM-CM6-1-LE. It is interesting to note that the ranges in MIROC6-LE can be larger than the inter-model spread in CMIP6 (Figure 11) over certain regions (e.g., SEA and MEA), which indicates that the inter-model spreads in both CMIP ensembles do not fully represent the range of major uncertainties. For TNn (Figure 13), all the LEs show relatively comparable ranges over the regions, although the differences of the forced responses among the LEs are slightly larger than that in TXx.

Boxplots showing the ranges of the TXx trends (°C decade−1) over Australian regions for CanESM5-LE (cyan), CNRM-CM6-1-LE (purple), and MIROC6-LE (green). The boxes indicate the interquartile spreads (ranges between the 25th and 75th percentiles), the black lines and the white squares within the boxes are the multi-member medians and means respectively, the whiskers extend to the edges of 1.5 × interquartile ranges and “outliers” outside of the whiskers are denoted by diamonds. The horizontal dashed lines represent the trends from the first realizations in the three LEs, respectively.

Same as Figure 12, but for TNn.
Consistent with recent studies (e.g., Lehner et al., 2020; Maher et al., 2020; Schlunegger et al., 2020), the estimates of internal variability among CanESM5-LE, CNRM-CM6-1-LE, and MIROC6-LE are different. Additionally, the evaluation based on PSS and RMSE in the three LEs does not directly correlate with the ranges of the trends for TXx and TNn. For example, despite the high RMSEs and low PSSs in MIROC6-LE (relative to AWAP), the TXx trends exhibit much broader ranges while the trends of TNn are more comparable over the regions, compared to other LEs. In Figures S20 and S21, we plot the TXx and TNn trends for the first 25 members in the LEs, which show similar ranges demonstrated in Figures 12 and 13. As suggested by recent studies (e.g., Sippel et al., 2020; Thompson et al., 2015; Xie et al., 2015), it should be the model errors that lead to the differences in the estimates of internal variability among the LEs, which needs further research. Although the model uncertainty in internal variability highlights that the representation of internal variability itself varies considerably across the three LEs, internal variability is still important in contributing to the inter-model spread for the trends in the two CMIPs.
5 Discussion and Conclusions
This study examines the performance of the newly released CMIP6 models in simulating the 30-year climatologies and time series of extreme temperature indices over Australian regions, with an analysis estimating the influences of internal variability on the evaluation of the extremes based on the single-member multimodel ensembles. Using two observational data sets, AWAP and BEST, as the verification, the historical simulations from 31 CMIP6 models are compared with 26 models from CMIP5. Since extreme temperatures are defined based on TX and TN, we also use PSS to evaluate the models' abilities in simulating the probability distributions of TX and TN, for which we expect more robust conclusions to be obtained. Among the CMIP6 models, three LEs are employed to estimate the effects of internal variability.
Similar to previous studies, the observational uncertainty (e.g., the spatial pattern of DTR and the time series of TNn shown between AWAP and BEST) can be substantial. This demonstrates that multiple observations or reanalysis data sets should be employed for the evaluation studies on climate models (e.g., Alexander & Arblaster, 2017; Herold et al., 2017; Kim et al., 2020; Sillmann et al., 2013; Srivastava et al., 2020), which also benefits the identification of the biases and improvements of the data. For example, compared to AWAP, the spatial pattern of DTR shown in BEST is smoother, which can be influenced by the erroneous TN. It may also due to the process that minimizes the square of the local weather term in the algorithm. Moreover, while AWAP and BEST use a comparable amount of Australian stations in their calculations (Jones et al., 2009; Rohde, Muller, Jacobsen, Muller, et al., 2013; Rohde, Muller, Jacobsen, Perlmutter, et al., 2013), the interpolation procedures in BEST are more complex (Rohde, Muller, Jacobsen, Muller, et al., 2013; Rohde, Muller, Jacobsen, Perlmutter, et al., 2013). Thus, due to different underpinning methods, it is not surprising that these observational products yield different ETCCDI values, which highlights why multiple data sets should always be used when evaluating climate models. The study indicates that TN in BEST is biased and not an optimal choice for evaluation over Australia.
Although the performance of CMIP6 and CMIP5 models in simulating extreme temperatures are comparable, there are some improvements in CMIP6. For TXx and SU, the multimodel means and medians of RMSEs in CMIP6 are generally lower. In terms of model ranges in CMIP6, the interquartile model ranges of RMSEs, for some cold extremes (e.g., TNn, TN90p, TN10p, CSDI, and FD), are usually narrower; and there are narrower interquartile model ranges for the temporal trends as well. In addition, with the results from PSS, the RMSEs for some individual models need to be interpreted with caution. For example, as the model MIROC-ES2L is much better at simulating TX than TN, the relatively lower RMSEs of some cold extremes for MIROC-ES2L are doubtful. Moreover, the lower PSSs and the higher RMSEs for the model NorCPM1confirm that its performance in simulating extreme heat is among the worst performers.
This study confirms that there are model differences in estimating internal variability (e.g., Lehner et al., 2020; Maher et al., 2020; Schlunegger et al., 2020) and further shows that the range for the trends may not be related to the performance evaluated by PSS and RMSE (e.g., the ranges of the TNn trends among the three LEs). Although the ideal ensemble sizes for the LEs need to be determined to sufficiently quantify the magnitude of internal variability, it should be model biases that lead to the different estimates, which requires more attention to address the issue. Also, since the thermodynamic response to external forcing is better understood than the dynamics (the factor can primarily regulate the atmospheric circulation; Shepherd, 2014), it may be related to the different relative ranges of the TXx and TNn trends in MIROC6-LE. Another potential way to derive reasonable representations of internal variability is using observationally-based LEs, which are combined with LEs (McKinnon & Deser, 2018; McKinnon et al., 2017). In addition, as suggested by Xie et al. (2015), the process-based evaluation methods should be developed, which may be helpful to better understand internal variability in different models.
Over the regions SEA, TA, and SSA, both CMIP ensembles usually show relatively large deficiencies in simulating temperature extremes. As documented in previous studies, over southeast Australia, the SAM and the Madden Julian Oscillation are two important factors related to extremes (Parker et al., 2014; Perkins et al., 2015); in southern Australia, it is generally assumed that there exists a positive relationship between the Indian Ocean Dipole (IOD) and extreme events (White et al., 2014), and TA can be influenced by the South Pacific convergence zone, tropical cyclones and ENSO (Perkins et al., 2015; Vincent et al., 2011). Compared to other regions, the narrower ranges of the TXx trends over TA indicated by the three LEs, needs further investigation. As suggested by Meehl et al. (2020), the range of model results may not represent the uncertainty, for which the small ranges may be due to the fact that all LEs miss the same processes or feedbacks. With the improvements in understanding physical processes and reducing model biases, the models' performance in simulating extreme temperatures and estimating the uncertainty can be further improved over Australian regions. Despite the different estimates among the LEs and over the regions, internal variability can still have important roles in contributing to the inter-model spread. Even after the model errors are further reduced, estimating the effects of internal variability is essential for adaptation planning on regional scales.
This study provides an assessment of the CMIP6 models' ability in simulating extremes, first analyzing the probability distributions of daily scale weather variables and then calculating the extreme indices. However, it should be recognized that with more CMIP6 models available, the conclusions may be changed to some extent. Also, in the future, remote sensing data may be assimilated into the observations, so that robust conclusions over the data-sparse regions like western Australia can be obtained.
Acknowledgments
The authors would like to thank two anonymous reviewers for their constructive comments, which have improved the quality of this manuscript. The authors thank Lisa V. Alexander for feedback and comments and Zeke Hausfather for discussions about Berkeley Earth surface temperature data sets. This research/project was undertaken with the assistance of resources and services from the National Computational Infrastructure (NCI), which is supported by the Australian Government. The authors further acknowledge the World Climate Research Programme’s Working Group on Coupled Modelling, which is responsible for CMIP and coordinated CMIP5 and CMIP6. The authors thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies who support CMIP and ESGF. The authors also thank the Bureau of Meteorology, the Bureau of Rural Sciences and the Commonwealth Scientific and Industrial Research Organization for providing the Australian Water Availability Project (AWAP) data. Sarah E. Perkins-Kirkpatrick is supported by ARC (Grant number FT170100106) and CLEX (Grant number CE170100023).
Conflict of Interest
The authors declare no financial or other conflicts of interests that could have appeared to influence the work reported in this paper.
Open Research
Data Availability Statement
The AWAP data (Jones et al., 2009) are available from the BOM (http://www.bom.gov.au/metadata/catalogue/19115/ANZCW0503900567#distribution-information). The BEST data set is obtained from http://berkeleyearth.org/data, and the methodological details are provided in the references: Rohde, Muller, Jacobsen, Muller, et al. (2013) and Rohde, Muller, Jacobsen, Perlmutter, et al. (2013). The CMIP6 and CMIP5 outputs can be downloaded from the Earth System Grid Federation (https://esgf-node.llnl.gov/search/cmip6/ and https://esgf-node.llnl.gov/search/cmip5/). Code to reproduce the ETCCDI indices is available at https://doi.org/10.5281/zenodo.4903200.