A Large Ensemble Approach to Quantifying Internal Model Variability Within the WRF Numerical Model

The Weather Research and Forecasting (WRF) community model is widely used to explore cross‐scale atmospheric features. Although WRF uncertainty studies exist, these usually involve ensembles where different physics options are selected (e.g., the boundary layer scheme) or adjusting individual parameters. Uncertainty from perturbing initial conditions, which generates internal model variability (IMV), has rarely been considered. Moreover, many off‐line WRF research studies generate conclusions based on a single model run without addressing any form of uncertainty. To demonstrate the importance of IMV, or noise, we present a 4‐month case study of summer 2018 over London, UK, using a 244‐member initial condition ensemble. Simply by changing the model start time, a median 2‐m temperature range or IMV of 1.2 °C was found (occasionally exceeding 8 °C). During our analysis, episodes of high and low IMV were found for all variables explored, explained by a relationship with the boundary condition data. Periods of slower wind speed input contained increased IMV, and vice versa, which we hypothesis is related to how strongly the boundary conditions influence the nested region. We also show the importance of IMV effects for the uncertainty of derived variables like the urban heat island, whose median variation in magnitude is 1 °C. Finally, a realistic ensemble size to capture the majority of WRF IMV is also estimated, essential considering the high computational overheads (244 members equaled 140,000 CPU hours). We envisage that highlighting considerable IMV in this repeatable manner will help advance best practices for the WRF and wider regional climate modeling community.


Introduction
Regional climate models (RCMs), through dynamic downscaling over limited areas, allow small-scale features to be resolved at high resolutions without the large computational overheads of global climate models (GCMs). They are typically employed to explore subgrid-scale processes, such as urban heat islands (UHIs) or heavy precipitation events. RCM boundary conditions, taken from GCMs or reanalysis (observation-based) products, should largely constrain internal model variability (IMV) or noises; the overall RCM climatology will be comparable to the boundary conditions. However, initial states, code approximations to mathematical formula, and nonlinear physical processes may drive RCM IMV (Christensen et al., 2001). Despite this, IMV has yet to be fully investigated by the RCM community (Giorgi, 2019;Laprise et al., 2012). Moreover, RCM studies often draw conclusions without exploring model uncertainty (e.g., El Afandi et al., 2013;Sertel et al., 2011;Tse et al., 2018;Zhang et al., 2011;Zhou et al., 2016). Considering that IMV is reflected in RCM output (Giorgi & Bi, 2000), results from such single-run studies should at best be seen as contingent. Furthermore, where differences between single model runs and observations are interpreted as error, these may in fact be due to imperfect initial conditions (Laprise et al., 2012). Here, we quantify IMV in the Weather Research and Forecasting (WRF) model (Skamarock et al., 2008), perhaps the most widely used mesoscale numerical weather prediction model, with a global community of over 36,000 users and code contributors. WRF, as a modeling environment, has been exploited for a wide range of applications, ranging from air quality to hydrology (Powers et al., 2017). WRF is also well connected to UHI research, the nocturnal warming caused by the built environment, through coupling with several urban models (Chen et al., 2011).
Fortunately, some initial condition ensembles, commonly found in weather forecasting and climate modeling (Kay et al., 2015), have started to address IMV within RCMs (Bonekamp et al., 2018;Crétat et al., 2011;Fathalli et al., 2018;Giorgi & Bi, 2000;Laprise et al., 2012;Laux et al., 2017;Lucas-Picher et al., 2008). However, this use of ensembles has not fully caught up to the RCM community (Laprise et al., 2012). This may be due to a lack of IMV awareness, something we aim to address in this study, or the availability of computational requirements to run these ensembles. Computational requirements may be addressed through access to new technology platforms, such as cloud computing (Simm et al., 2018;Siuta et al., 2016). Conversely, increased computing resources may conflict between increasing resolution and resolving subgrid-scale processes and ensembles (Berner et al., 2017).
This study aims to improve upon existing IMV knowledge by significantly increasing the number of ensemble members, allowing us to address: (i) the degree of IMV for various atmospheric fields, (ii) the number of simulations required to capture IMV, and (iii) what conditions leads to higher IMV. This is achieved through modifying initial conditions (i.e., simply changing start times) while keeping model configuration (i.e., parameters and physics) and lateral boundary conditions constant. Furthermore, we test if IMV has a knock-on effect to the magnitude of the UHI, a common application of WRF modeling. While our attention is focused on the WRF model because of its large user base, our approach is applicable to any RCM. Furthermore, the resulting IMV quantifications may impact optional coupled model components, for example, WRF-Chem (Grell et al., 2005).

WRF Configuration
WRF Version 3.9.1.1 was configured using three nested domains, each containing 100 × 100 grid cells and increasing in resolution from 27 to 9 to 3 km ( Figure 1a). The model was centered over London, a city historically well observed in terms of urban climate (e.g., Jones & Lister, 2009). All model physics and dynamics options were kept as default, except setting the model time step to 120 s and turning urban physics on. For this, the single-layer urban canopy model (Kusaka et al., 2001) was used to account for radiation processes within an urban canyon (e.g., trapping) and friction caused by increased drag from buildings, before passing heat and momentum fluxes back to WRF. To complement single-layer urban canopy model, urban parameters specific to London were used (Loridan et al., 2013). Two-way nesting that allows a domain to feed information back to its parent domain is kept on. While the use of a triple two-way nested domain will significantly increase the degrees of freedom in the model, this is a commonly used setup in WRF studies. WRF default land use (Moderate Resolution Imaging Spectroradiometer satellite classification) was kept for simplicity ( Figure 1b). As the default land use only contains a single urban category, we recommend using location-specific land use for more detailed investigations. For initial and lateral boundary conditions, NCEP (National Centers for Environmental Prediction, 2000) Final Operational Model Global Tropospheric Analyses data at 1°horizontal and 6-hourly temporal resolution were used (https://rda.ucar. edu/datasets/ds083.2/). Sea surface temperatures (SSTs) were updated during the simulation with data also provided by NCEP. The SST update is not "on" by default in WRF, and instead, the model uses a fixed value for the whole simulation. While SSTs may not be too influential for a short timescale or continental setups, due to the coastal location of our domain, it has a significant impact on our simulations (tests not shown). Switching on the SST update also updated other time-dependent variables during the simulation, including vegetation fraction, leaf area index, and albedo. Although NCEP is commonly used in WRF studies (see Loridan et al., 2013), a variety of alternatives are available (see http://www2.mmm.ucar.edu/wrf/users/download/free_data.html). To account for model spin-up, we disregarded the first 24 hr of each simulation.

Observations for Evaluation
For comparisons between our simulations and observations, we confined the analysis to a subset of the inner domain (see the dashed box in Figure 1b Table S1 in the supporting information. These locations were chosen based on consistent data records, spatial spread, and small variations in altitude. Furthermore, as the focus of this study is not fine tuning WRF to a specific place, we took the mean air temperature and precipitation of all stations to represent a spatial mean within the area defined by the dashed box in Figure 1b. This simplified spatial mean was also calculated using the same dashed box for each ensemble member. This box represents 40 grid cells or 120 km in each direction. A time series for observed variables are provided in Figure S1.

Ensemble Framework
We created a 244-member ensemble with run lengths ranging from 2 to 4 months. WRF ensemble members were started at 6-hr intervals covering a 2-month period from 1 May 2018 to 1 June 2018 (i.e., 244 members from 61 days × 4 starts per day), illustrated in Figure 2. Although this method differs from more traditional ensemble perturbations, like Giorgi and Bi (2000), WRF simulation lengths typically range from days to months, and start times are usually decided on an ad hoc basis. For each ensemble member, all WRF preprocessing steps were repeated, to ensure each member had a unique set of initial conditions. Bash scripts were used to automatically run the preprocessing, update the WRF namelist for each member and submit these to a high-performance computer (HPC). During the simulations, each member contained identical boundary (b) D03 land use: urban (gray) and mixed rural categories (green). The dashed box around London represents the analysis zone in which the spatial means for both the observations and ensemble members were calculated. The black squares show the location of the observation records for temperature and precipitation, and black triangles for two stations that contained temperature only, further detailed in supporting information Table S1. conditions ending 1 September 2018, hence creating a 2-month period when all members were running. The summer 2018 period was chosen because of its notable (near heat wave) warmth that became one of the joint-hottest summers on record in the United Kingdom (https://www.metoffice.gov.uk/about-us/press-office/news/weather-and-climate/2018/end-of-summer-stats). This case is likely to contain higher IMV than an equivalent winter case due to the increased frequency of stable weather patterns (Giorgi & Bi, 2000).
The ensemble configuration required over 140,000 CPU hours and had the potential to generate almost 30-TB output data. We therefore only saved a small fraction of the output variables: 2-m air temperature, 10-m wind (u and v components), surface pressure, precipitation, boundary layer height, sensible and latent heat fluxes, and 2-m specific humidity. All ensemble members were run on a HPC at Lancaster University, using 32 CPUs for each simulation, and running 10 simulations concurrently. The longest (4 months) member took approximately 24 hr to run and shortest (2 months) 12 hr. Excluding the preprocessing steps and queue times, the 244-member ensemble took 37 days to complete.

Bit-for-Bit Reproducibility
Different hardware options were available on Lancaster University's HPC, which could either be specified at job submission or left unspecified in order to reduce queuing times. In unspecified mode, the scheduler would automatically assign our WRF jobs to the first available hardware type. Due to the number of simulations, we ideally needed to run these in unspecified mode to reduce queue times. However, different hardware types and compilers have been known to produce different results (R. , largely due to floating point rounding errors. To test this, we conducted a 2-month test simulation specifying (i) two different hardware types with each run using 32 CPUs and (ii) a different number of CPUs, 32 versus 40, on each of these two hardware types. All four test simulations used the same initial conditions and boundary conditions, as well as Intel compilers with default flags as set by the WRF model installation scripts. Different compilers (e.g., ifort vs. gfortan) may also cause variability (R. ; however, we do not test this here, using only the Intel compiler for our ensembles. Our tests showed no difference in simulations using different hardware types using the same number of CPUs. We did find significant variability in results between 32 and 40 CPUs, and the same amount of variability was found between the different number of CPU cases on both hardware types. This likely relates to how WRF splits domains into patches for each CPU. For example, 32 CPUs may contain 8 × 4 patches and 40 CPUs 10 × 4 patches. Therefore, our ensembles were permitted to run on unspecified hardware types but only using 32 CPUs, in the knowledge that the variability between the simulations would just be due to IMV.

Results and Discussion
The ensemble mean temperature, wind speed, pressure, and precipitation time series are provided in Figure S1. The ensemble mean temperature bias compared to the observations is shown in Figure 3a. The mean bias during the 4-month period is 0.04°C (s.d. 1.87°C). Figure 3b shows the root-mean-square error (RMSE) between the simulated and observed 2-m temperature for each ensemble member over a 2-month period from 1 July to 1 September (i.e., when all 244 members were running), where the mean RMSE is 1.16°C. Our results show that, in terms of model performance against observations, the difference between RMSE only slightly improves as members are started nearer the 2-month analysis period. Using the R package segmented (https://cran.r-project.org/web/packages/segmented/segmented.pdf), we detected a breakpoint (see Figure 3b), occurring at member 138 (near the beginning of June); however, mean RMSE before and after this change represents less than a 3% reduction.

IMV and the Impact on the Derived UHI
The spread or bias relative to the ensemble mean 2-m temperature (i.e., the IMV) is shown in Figure 4. Note the spread is calculated from the spatial mean of the analysis box centered over London (see extent in Figure 1b). The magnitude of the IMV increases as more members start, and by the start of July all members are included. From Figure 4 it is clear that notable differences exist between ensemble members. To explore this, at each hour in the 4-month analysis period, the range between the ensemble members are calculated using the spatial means from the analysis box. We consider the range between these values to represent WRF IMV. The IMV time series for temperature, surface pressure, wind speed, and precipitation are presented in Figures 5a-5d. Across the 4-month period, the median temperature range is 1.20°C (interquartile range, IQR, 0.67, 2.21°C) and maximum range over 8°C. This temperature range is significantly higher than previous IMV studies (Laux et al., 2017), perhaps explained by the significantly larger ensemble size used in our study. It is also comparable to the spread created using large multiphysics ensembles, for example, a 3-month spatial mean up to 3°C (Jerez et al., 2013) or maximum range up to 10°C (Stegehuis et al., 2015). Moreover, (i) a spatial mean is used to calculate variability; therefore, at the individual grid cell level variability will be higher; (ii) at the start of the 4-month period, less members are running; therefore, some of the variability will be missed in the calculation (i.e., an underestimation). Variability for the other variables we outputted from WRF (not plotted) are listed in Table 1. Note, these variables only represent a small subset of all possible WRF output as we were limited by storage.
To further emphasize IMV, cumulative precipitation covering a 2-month period, starting from 1 July 2018 when all 244 members are running, is plotted in Figure 6. When all simulations are complete (1 September 2018), the total precipitation at the end of this 2-month period ranges between 84.17 and 106.30 mm. The actual observed precipitation, taken as the spatial mean of 14 weather stations (see the black line in Figure 6), had a cumulative total of 84.00 mm, placing this at bottom of the ensemble range. For comparison, the ensemble mean cumulative precipitation is 94.41 mm (s.d. 4.86 mm). While this is not far from   Figure 6 shows a significant overestimation in July, followed by the model missing rainfall events in August, that effectively allows (cumulatively) the observations to catch up to the model. Spatially, we show the 2-m temperature field for a single time slice (20 August 2018 at midnight) for several ensemble members in Figure 7a, effectively depicting London's UHI. Even though the ensemble members are started only a few hours apart, large spatial differences between the temperature fields (i.e., the UHI pattern) exist. Aside from city size and form (fixed in these simulations), UHI intensity is predominantly controlled by wind speed and cloud cover; proxies for stability (Pasquill & Smith, 1984), as captured in many observational studies (e.g., Tomlinson et al., 2012). Therefore, based on the considerable variability already found for wind speed (see Figures 5b and Table 1), it is not surprising that temperature  Note. UHI is derived as the mean of the urban grid cells less the rural grid cells within the analysis box.
fields depicting the UHI also exhibited large IMV between ensemble members.
We then classify rural areas as all grid cells not containing urban or water, shown as green in Figure 1b. Separate means for all urban and all rural grid cells within the analysis box are calculated for each member. Prior to the UHI calculation, temperatures were adjusted to sea level to account for altitude changes (0-217 m) within the domain using the mean environmental lapse rate (0.0065°C/m). Across the 4-month modeling period, there is considerable IMV for UHI intensity with a median of 0.84°C (IQR 0.46, 1.49°C). Split diurnally (based on daily sunrise and sunset times), larger IMV is found at night, 1.00°C median versus 0.74°C during the day. These UHI ranges created through modifying the start time are greater in magnitude to that found by improving urban parameters (e.g., 0.6°C improvement in modeled nighttime UHI was found by Touchaei & Wang, 2015). An example UHI spread for this single time slice (20 August 2018 at midnight) that contains all ensemble members is shown in Figure 7b. Considering a large number UHI studies use single-run methods, these results show considerable UHI variability that should really be accounted for in future model based UHI studies.

IMV Relationship With Boundary Conditions
It is evident in Figure 5 that IMV levels waxed and waned during the 4-month modeling period. For example, sustained high IMV is found mid-May to mid-June, with a median temperature range of 3.04°C (IQR 1.88, 4.46°C). This is also the period where the maximum temperature range over 8°C occurs and where differences between the ensemble mean and observations are highest (see Figure S1). Note that by mid-May only 60 members are running; therefore, IMV during this period is likely underestimated. Between the variables plotted in Figures 5a-5d, we note a similarity in the periods where either high or low IMV occur. We show the relationships between some of the outputted WRF variables in Figure 8, with r values up to 0.85 (Spearman's rank).
Fast-moving boundary conditions, that is, a high through flow of weather driven by large-scale circulations, will cause conditions within the WRF domain to be more strongly forced by these external inputs (Giorgi & Bi, 2000;Laprise et al., 2012). However, if conditions at the domain boundaries are not active, like persistent high pressure, WRF will have more freedom to generate its own weather: Hence, there will be a higher IMV as the model is more sensitive to initial conditions. We demonstrate this effect in our results by calculating the mean of the maximum wind speed input from each of the first 10 model levels (surface to 700 mb or 10 km) within the Outer Domain 1 (Figure 1a), using the coarse (1°) 6-hourly boundary condition data from NCEP. This cross-domain wind speed is a proxy for the "refreshing" capability of the boundary conditions imposed onto the internally generated conditions inside of WRF. Figure 9a shows the comparison between the temperature range for our ensemble and the input wind speed. We immediately see that lower input wind speeds correspond to higher temperature IMV across the ensemble. We further show this relationship as a scatterplot in Figure 9b. Here we find a significant relationship between input wind speed and modeled 2-m temperature (r = −0.66, Spearman's rank). Therefore, we find slow-moving boundary conditions are linked to higher levels of IMV in WRF; these results are in agreement with earlier studies (e.g., Giorgi & Bi, 2000). A cross-correlation function (Figure 9c) was used to test preceding input wind speed effects on IMV, finding peak correlations at zero lags. This conveys that IMV at a given time does not relate to earlier input wind conditions. We note the secondary lags likely relate to synoptic regimes, that is, the main high wind speed (mid-May to mid-June) or low wind speed (mid-June to early July) periods. Although we do not test effects of seasonality in this study, considering the relationship with input wind speed, we would expect findings similar to Giorgi and Bi (2000) where higher IMV was found in summer due to less influence from prevailing westerly winds. Overall, these findings will have direct implications for UHI modeling, where periods of calm conditions are usually chosen due to their relationship with maximum UHI intensity.

Number of Ensemble Members Required to Capture IMV
Running 244 members to create a WRF ensemble is computationally demanding, and here we were limited to a small subset of the total WRF output due to storage limits. We therefore explore the minimum number of ensemble members required to capture the majority of IMV contained in all 244 members using three methods: (i) connected, (ii) sequenced, and (iii) sampled.
For the "connected" method, Figure 10a shows the median temperature range across the 2-month period of July and August as more ensemble members are included, moving back in time from the 2-month analysis window start time (1 July 2018); the ensemble starts with a single member and finishes with 244. On the left-hand side of Figure 10, by only including a single member the temperature range is unsurprisingly zero, whereas the median range is highest when all 244 members are included (0.95°C). A marked change in gradient for the mean temperature range is apparent in Figure 10a, and the R package "segmented" (https:// cran.r-project.org/web/packages/segmented/segmented.pdf) was applied to automatically detect this breakpoint. Occurring after 16 members, this accounts for approximately half (47%) of the IMV generated with all 244 simulations. From here in we refer to this "relative" IMV as rIMV (i.e., variability relative to the ensemble containing all 244 simulations), because the gradient does not plateau by 244 members suggesting that the "true" IMV is greater than captured by our simulations. Following this breakpoint, to account for 75% rIMV, we would need to include 107 members.
For the second "sequenced" method we explore using the members to greater effect, why constrict to 16 members in a row? To test this, we compared six ensemble member sequences, at 1, 2, 4, 8, 16, and 32 member intervals. Each of these new ensembles contain 244, 122, 61, 30, 15, and 8 members, respectively. Results from these sequences are plotted in Figure 10b. To further explain this sequenced method, take the purple Figure 9. (a) Ensemble temperature range (IMV) shown as the blue line and the dashed red line shows the Domain 1 (outer) mean of the maximum wind speed input from each of the first 10 input model levels (surface to 700 mb or 3 km), taken from NCEP prior to running WRF. Each line is a 24-hr running mean. (b) The relationship between the variables in (a). Note for both plots, the temperature range data were reduced to fit the 6-hourly NCEP boundary condition times. (c) Cross-correlation function between the variables in (a). The dashed blue lines are the 95% confidence bounds. Figure 8. Relationships between ensemble spatial mean ranges of (left) surface pressure and temperature, (middle) wind speed and temperature, and (right) surface pressure and wind speed. triangles in Figure 10b by way of an example. In total, there are 61 purple triangles. The last purple triangle represents the range in temperature or IMV from 61 members taken at 4-member intervals (or daily) between ensemble members 1 and 244. At the thirtieth purple triangle in Figure 10b, this represents the temperature range taken across the first 30 members in a 4-member sequence between 1 and 244; the new ensemble contains original members 1, 5, 9, 13, … 117. Using one less ensemble member (15) than the first connected-member approach (16), we increase the amount of rIMV from 47% to 65%. More dramatically, we only require 30 members to now explain 75% rIMV (whereas 107 members were needed previously). This number of ensemble members is still considerably larger than used in previous IMV studies (Bonekamp et al., 2018;Crétat et al., 2011;Fathalli et al., 2018;Laprise et al., 2012;Laux et al., 2017;Lucas-Picher et al., 2008). To explain the relationship between the maximum median temperature range captured by each of the six sequences, nonlinear regression is used, shown as the black line in Figure 10b. The resulting equation y = 0.937 * x/(8.5 + x) may be used (for our WRF configuration) to estimate the likely temperature rIMV from a given number of sequenced ensemble members.
The third "sampled" method uses different arrangements of the ensemble members; that is, for 5 members out of 244 cases, members could be 3, 32, 87, 144, 201, or any other random combination, randomized using the R sample function. We also test the member sampling period, individual weeks (9 total), months and an all-member case (i.e., 2 months). We limit the number of random configurations for all tests to 702 (maximum combinations for a 2-member case over a week) and only create ensembles up to 27 members (i.e., a week). The results are shown in Figure 10c for the median temperature range for all cases. The solid lines depict members taken within individual weeks of the 2-month starting window and show a 0.15°C IMV variation between weeks when 27 members are included. This variation does not appear to relate to proximity to the analysis window. For the month cases, higher IMV occurs for our configuration when sampling ensemble members in May than June. Overall however, we find the highest IMV when sampling across all 244 members, with IMV at 0.69°C, when using 27 members. For the all-member case, we also test using 30 members to make direct comparisons with the "sequenced" approach, finding 73% rIMV.
We make two deductions from these approaches: (i) the size and period chosen for sampling affect total IMV and (ii) "sequenced" and "sampled" approaches produce similar rIMV. There are caveats to recommending a minimum number of ensemble members to capture IMV. Our results are specific to this study and may not be transferable due to both the unique WRF setups in terms of location, domain sizes, resolutions, physics packages, and parameters and the fact that the gradient on the slopes in Figure 10a do not plateau after 244 members are running. Therefore, we surmise that 244 members do not capture all WRF IMV, hence relative IMV. This poses a challenge: Hypothetically, even if we ran enough simulations to capture 100% of IMV, we are still limited to tiny subspace of model uncertainty. Different physics packages, domain sizes, resolutions, and initial condition data will all trigger a slight variation on the IMV we found. Furthermore, for the sequencing method, the further back in time the sequence starts, the more computational power and time is required. Figure 10. (a) Minimum number of ensemble members to represent the majority of WRF IMV for temperature using the "connected" approach. Breakpoint analysis was used to detect the change in gradient at 16 members, shown as the black line. (b) Minimum number of simulations required to represent the majority of WRF IMV for temperature using the "sequenced" approach. Six ensemble member sequences were created from members 1 to 244, at 1, 2, 4, 8, 16, and 32 member intervals. Each of the new ensembles contained 244, 122, 61, 30, 15, and 8 members, respectively. The black line represents the nonlinear regression for the maximum range at each of the six sequences. (c) "Sampled" method. The solid lines represent IMV from week long sampling periods (1-9) within the 2-month initialization period. The points are from member samples within monthlong periods May (triangles) and June (squares) and both months (circles). Note that for (a)-(c), these relationships are unique to our model setup, in our case, three domains centered over London.

Journal of Geophysical Research: Atmospheres
Although we do not test in this study, IMV may also be introduced by perturbing initial conditions, reducing the time before the analysis period.

Ensemble Member Similarities
In the previous section, we found higher IMV when members were started in May rather than in June. To investigate this further, we used Ward's (Ward, 1963) hierarchical clustering to test if there were any similarity between the ensemble members. Ward's method effectively minimizes variance (using Euclidian distance) at each step, that is, merging the two closest members into a new cluster, before repeating this process until all members are moved into a single cluster. The analysis was conducted on the second 2 months only (July and August) when all the members were running. The results are shown as a dendrogram in Figure S2 where two main clusters are detected. The ensemble mean temperature for each of these clusters is shown in Figure 11a where differences between peak daytime temperatures are found. The differences are largest at the beginning of the 2-month period (up to 1°C) and decrease toward the end. The start times of the ensemble members are provided in Figure 11b. We find the majority of Cluster 1 members to start in May and Cluster 2 in July. This difference is also reflected earlier in Figure 3b where we find reductions, albeit very small, in RMSE between observations and members started in June rather than May. Possible reasons for these effects may include fixed monthly model values for leaf area index, albedo and green fraction, or long soil moisture memory effects (Sörensson & Berbery, 2015).

Conclusions
A large ensemble of 244 members was used to quantify IMV within the WRF RCM for a case study covering southeast United Kingdom during June and July 2018. Using a largely default WRF model configuration, we changed model start times (initial conditions) to generate IMV, an often overlooked feature in RCMs. The simulations were completed on a local HPC at Lancaster University, taking 37 days to complete using 10 simultaneous 32 CPU jobs.
Defining the IMV as the range in a given variable between all ensemble members at each time step, we found a median IMV for the 2-m temperature of 1.20°C. IMV was also found to fluctuate considerably with a maximum exceeding 8°C. This IMV was found to have knock-on effects on the derived UHI for London, with median nocturnal UHI ensemble range of 1°C. This is a common application of WRF, yet most similar studies use only a single model run and may only change a single parameter, such as albedo, to replicate cooling from green roofs. Such studies do not substantially address the model sensitivity in the UHI calculation, and on the basis of our results, we would therefore question their validity. Further analysis demonstrated the relationship of IMV to the boundary conditions. Coarse resolution input wind speed from the NCEP boundary condition data was statistically linked to temperature IMV within the inner domain, with higher boundary condition wind speeds reducing the IMV (r = −0.66). Effectively, fast-moving boundary conditions are efficient at refreshing conditions inside WRF, whereas under slow-moving boundary conditions, nonlinear (chaotic) processes within WRF are more likely to dominate.
Since running 244 ensemble members is computationally demanding, we proposed several methods to calculate a minimum number of simulations required to capture the majority of IMV (while acknowledging that 244 members will not capture all of the IMV). An analysis of IMV showed a natural breakpoint in terms of ensemble size, with large increases in IMV from increasing size up to that point and minimal increases after. For our simulations, 75% of the IMV was accounted for using 30 members in a sequence across a 2-month period. Additionally, no benefit to using randomly selected ensemble members was found, except noting the period in which ensemble members were selected could modify IMV.
Overall, our results are applicable to all WRF and indeed any RCM studies and will have knock-on implications for coupled WRF components like chemistry or hydrology. However, exact levels of IMV will be unique

Journal of Geophysical Research: Atmospheres
to individual study configurations. In addition to IMV generated through initial conditions, WRF uncertainty may also be created by changing, for example, study location, domain sizes, nest resolutions and ratios, one-way or two-way nesting, time of year, initial and boundary condition products, number of vertical levels, domain boundary relaxation zones, and use of techniques such as observation nudging. This list is by no means exhaustive as IMV could also be created by the hardware and software environment in which the model is hosted. Testing all these sources of variability will be extremely computationally demanding. However, we are excited by new cloud-computing resources that will allow for fast-finishing ensembles. To conclude, that WRF generates a large degree of IMV is not a bad problem per se so long that it is recognized and accounted for in future studies.