A Diffusion-Based Uncertainty Quantification Method to Advance E3SM Land Model Calibration
Abstract
Calibrating land surface models and accurately quantifying their uncertainty are crucial for improving the reliability of simulations of complex environmental processes. This, in turn, advances our predictive understanding of ecosystems and supports climate-resilient decision-making. Traditional calibration methods, however, face challenges of high computational costs and difficulties in accurately quantifying parameter uncertainties. To address these issues, we develop a diffusion-based uncertainty quantification (DBUQ) method. Unlike conventional generative diffusion methods, which are computationally expensive and memory-intensive, DBUQ innovates by formulating a parameterized generative model and approximates this model through supervised learning, which enables quick generation of parameter posterior samples to quantify its uncertainty. DBUQ is effective, efficient, and general-purpose, making it suitable for site-specific ecosystem model calibration and broadly applicable for parameter uncertainty quantification across various earth system models. In this study, we applied DBUQ to calibrate the Energy Exascale Earth System Model land model at the Missouri Ozark AmeriFlux forest site. Results indicated that DBUQ produced accurate parameter posterior distributions similar to those from Markov Chain Monte Carlo sampling but with 30 times less computing time. This significant improvement in efficiency suggests that DBUQ can enable rapid, site-level model calibration at a global scale, enhancing our predictive understanding of climate impacts on terrestrial ecosystems.
Key Points
-
A novel diffusion model-based uncertainty quantification method was developed for efficient model calibration
-
This method produced accurate parameter posterior distributions comparable to those from Markov Chain Monte Carlo sampling, but 30 times faster
-
The method performs amortized Bayesian inference and can be broadly applied to accelerate earth system model calibration
Plain Language Summary
Land surface models are essential for simulating environmental processes and aiding climate-resilient decision-making. Traditionally, calibrating these models has been costly and time-consuming. To address this, we developed a new method called diffusion-based uncertainty quantification (DBUQ), which is faster and less memory-intensive than previous methods. Using supervised learning, DBUQ quickly generates samples to accurately approximate parameter posterior distributions. We tested this method on the Energy Exascale Earth System Model land model at the Missouri Ozark AmeriFlux forest site, finding that it can produce results similar to traditional methods but 30 times faster. This efficiency suggests that DBUQ could revolutionize the calibration of land surface models globally, enhancing our predictive understanding of climate impacts on ecosystems.
1 Introduction
Land surface models, such as the Energy Exascale Earth System Model (E3SM) (Golaz et al., 2019; Leung et al., 2020), Land Model (ELM), serve as an essential tool in enhancing our understanding of how ecosystems respond to climate change and this understanding is crucial for developing strategies to mitigate and adapt to the ongoing and future effects of the change. ELM simulates key processes such as water dynamics, energy exchanges, and biogeochemical cycles occurring on terrestrial surfaces (Lu & Ricciuto, 2019). It involves a large number of parameters (Ricciuto et al., 2018), many of which are not measured and default values are usually used either from surveys of broadly defined plant functional types or by benchmarking the model simulations against global data sets (Y. Q. Luo et al., 2012; White et al., 2000). However, due to differences in climate, soil, and vegetation types between geographic regions, assigning uniform values to these site-specific parameters resulted in inaccurate model simulations at individual sites (Gu et al., 2016; Lu et al., 2018). Additionally, the deterministic parameter values at a single site do not consider the parameter uncertainty where different parameter sets can produce similar model simulations (Lu et al., 2017). Therefore, an efficient uncertainty quantification method is required to enable the site-by-site parameter estimation, thus improving model's predictability and advancing our understanding of climate impacts on ecosystems globally.
Markov Chain Monte Carlo (MCMC) sampling is a widely adopted method for estimating parameter uncertainty (Hararuk et al., 2014; Vrugt, 2016; Ziehn et al., 2012). It involves generating a series of samples from the posterior probability density function (PDF) to quantify uncertainty. Ideally, with sufficient iterations, these samples converge to the true PDF. However, MCMC is notoriously computationally intensive, often necessitating hundreds of thousands or even millions of model evaluations, which are not fully parallelizable (Lu et al., 2012). Rapidly quantifying uncertainty in land surface modeling is crucial, as it underpins informed decision-making, adaptive management, and effective risk management and resource allocation in the face of pressing climate change. To mitigate these computational demands, machine learning (ML)-based uncertainty quantification methods have been introduced (Mo et al., 2017, 2019; Zhang et al., 2013). Some researchers employ ML to create fast surrogate models, accelerating model evaluations during MCMC sampling (Asher et al., 2015; Dunbar et al., 2021; Weber et al., 2020; Xi et al., 2017). Others leverage generative modeling techniques like normalizing flows to solve the uncertainty quantification problems directly (Bao et al., 2023, 2024; Khorashadizadeh et al., 2023; Liu et al., 2024; Lu et al., 2022; Papamakarios et al., 2021; Yang et al., 2023). Surrogate modeling demands a globally accurate surrogate across the entire parameter space and requires a new MCMC simulation whenever site-specific likelihood functions vary. Whereas, normalizing flows hinge on an invertible neural network (NN) structure, which requires costly computation of inverse mappings and Jacobian determinants. These limitations hinder their effective applications for ELM parameter estimation, underscoring the need for more efficient approaches.
Recently, diffusion models have gained a great success in solving inverse problems, particularly in applications of image processing, such as image synthesis (Cai et al., 2020; Dhariwal & Nichol, 2021; Ho et al., 2020, 2022; Meng et al., 2022; Song & Ermon, 2019; Song et al., 2021), image denoising (Ho et al., 2020; Kawar et al., 2021; S. Luo & Hu, 2021; Sohl-Dickstein et al., 2015), image enhancement (Kim et al., 2021; Li et al., 2022; Saharia et al., 2023; Whang et al., 2022), and image segmentation (Amit et al., 2022; Baranchuk et al., 2022; Brempong et al., 2022; Graikos et al., 2022). Many image processing challenges addressed in these studies can be approached as either deterministic or probabilistic inverse problems. However, there is a notable distinction between these inverse problems in image processing and the model calibration tasks discussed in this study. In image processing, the aim is often to enhance the quality of each individual image produced by diffusion models, such as by reducing noise, improving resolution, or enhancing the realism of synthetic images. Achieving these objectives requires precisely controlled diffusion paths. In contrast, model calibration focuses on the statistical properties of generated parameter posterior samples, such as mean and covariance, which do not necessitate such finely adjusted diffusion paths.
Inspired by recent advancements in diffusion models and acknowledging their limitations for scientific model calibration, this study introduces a novel diffusion-based uncertainty quantification (DBUQ) method designed for efficient parameter estimation. Diffusion models are a class of generative ML models used to generate samples from a given distribution (Baldassari et al., 2023). The process begins by sampling from a prescribed noise distribution—typically a standard Gaussian distribution—and then iteratively transforming the Gaussian sample via a learned denoiser, until it approximates a sample from the target distribution (e.g., the parameter posterior distribution of the ELM in this study). The differences in the denoising process result in various diffusion models, with the score-based diffusion method standing out for its solid theoretical foundation and its capability to produce high-quality samples. This method uses a NN to learn the score function and then repeatedly solves a reverse stochastic differential equation (SDE) using the learned score function to draw target samples. Score-based diffusion models have been used to provide prior samples within the MCMC framework (Baldassari et al., 2024; Qiao et al., 2024; Sun et al., 2023). However, there are several unresolved challenges when using these diffusion models for Bayesian inference. First, these methods are computationally intensive due to the iterative reverse process required to generate each sample and the necessity of precise score estimation at every iteration. Second, most existing diffusion models are trained using unsupervised learning of the score function due to the lack of labeled data mapping the initial and terminal states of the underlying stochastic process. The unsupervised learning of the score function requires storing numerous stochastic paths of the forward SDE, further escalating both computational costs and memory requirements.
To address these challenges, our DBUQ method implements a training-free score estimation using the Monte Carlo estimator for direct approximation of the score function. With the estimated score function, it then generates labeled data by solving an ordinary differential equation (ODE), instead of the computationally expensive reverse SDE. This labeled data is then used to train a simple NN to learn a sample generator through supervised learning. After training, this NN quickly generates target posterior samples when evaluated on observational data. The DBUQ method is not only reliable and computationally efficient but also performs amortized Bayesian inference (Zammit-Mangion et al., 2024). It can swiftly produce corresponding parameter posterior samples for any given observations, facilitating rapid, site-level parameter estimation on a global scale. Additionally, the application of DBUQ is not confined to specific types of forward models, allowing for its wide-ranging use across various earth system models. In this effort, we demonstrate DBUQ's effectiveness by applying it to the calibration of eight parameters in the ELM, using 5 years of latent heat flux measurements from the Missouri Ozark AmeriFlux forest site. We evaluate DBUQ's performance by comparing its estimated parameter posterior distributions with those derived from MCMC sampling.
-
We develop a novel DBUQ method to improve earth system model calibration and parameter uncertainty quantification.
-
Our DBUQ method addresses the challenges of high computational costs in traditional uncertainty quantification methods such as MCMC sampling and some modern ML methods including normalizing flows and diffusion generative models.
-
Applications show that DBUQ produces accurate parameter posterior distributions similar to those from MCMC sampling but requires 30 times less computing time. This significant improvement in efficiency suggests that DBUQ can enable rapid, site-level model calibration at a global scale.
2 Diffusion-Based Uncertainty Quantification (DBUQ) Method
In the following sections, we first introduce score-based diffusion models in Section 2.1. We then detail each step of our DBUQ method, including the Monte Carlo estimation of the score function in Section 2.2 and our supervised learning strategy for generating target parameter posterior samples in Section 2.3. Lastly, in Section 2.4, we illustrate our DBUQ method using a one-dimensional, bimodal distribution.
2.1 The Score-Based Diffusion Model
On the other hand, the reverse-time SDE in Equation 6 performs as a denoiser, which can transform the terminal distribution to the initial distribution p(Z0). Then, given samples from the standard Gaussian distribution, we can solve the reverse-time SDE in Equation 6 to generate target samples to quantify parameter posterior uncertainty. The conventional score-based diffusion model (Ho et al., 2020; Kawar et al., 2021; S. Luo & Hu, 2021; Sohl-Dickstein et al., 2015; Song et al., 2021) uses a NN to learn the score function S(Zt, t) by minimizing a score matching loss (e.g., Equation 7 in Song et al. (2021)), and then for each Gaussian sample, it solves the reverse-time SDE in Equation 6 to obtain one target sample. This method is computationally intensive because generating each target sample requires to solve the iterative reverse process, and this process needs to be performed repeatedly to generate the desired large number of target samples for posterior distribution approximation. Additionally, when estimating the score function, the NN is trained in an unsupervised manner due to the lack of labeled data. This unsupervised learning requires storing numerous stochastic paths of the forward SDE, further escalating both computational costs and memory requirements.
In the following, we introduce our DBUQ method to improve the computational efficiency. Briefly, DBUQ uses the Monte Carlo estimator to approximate the score function (Section 2.2), next it trains a NN using supervised learning to learn the sample generator F in Equation 3 based on the labeled data produced by solving a reserve-time ODE, and then it evaluates F to quickly generate the target samples (Section 2.3).
2.2 Estimating the Score Function Using a Monte Carlo Estimator
2.3 Supervised Learning of the Generative Model to Produce Parameter Posterior Samples
In this section, we describe how to leverage the score function estimated in Section 2.2 to generate parameter posterior samples for uncertainty quantification. First, we solve the reverse-time ODE in the diffusion model based on the estimated score function to generate labeled data set . Next, we train a feedforward NN on these labeled pairs to learn the generative model F. Lastly, given an observation y, we evaluate the trained F at numerous samples Z from the standard Gaussian to generate the same large number of samples of X from the target distribution p(X|Y = y).

Illustrating why the ordinary differential equation (ODE) model in Equation 18, instead of the stochastic differential equation (SDE) model in Equation 6, can be used to generate the labeled data for the supervised learning of the generator F in Equation 3. Although both the ODE model (top) and the SDE model (bottom) can map the standard Gaussian distribution at the state of T = 1 to the target distribution at the state of T = 0, the relationship between the state at T = 1 and the state at T = 0 are completely different for the two models. For the SDE model, the relationship between the states at T = 1 and T = 0 are purely random (the bottom right plot) due to the use of the SDE transport. This makes the reverse SDE in Equation 6 infeasible to generate the labeled data to train a neural network (NN) to learn such a randomness. In comparison, the ODE model defines a very smooth function between the states at T = 1 and T = 0 (the top right plot). This nice relationship suggests that the ODE model can be reliably used to generate the labeled data for the supervised learning of the generative model F.
In practice, we can use the following strategies to draw samples of ym from the data set depending on its data size. When has a large number of samples, we can choose as a subset of under the condition that M ≤ J. When has a small number of samples and a large number of labeled data is needed for the NN training to avoid over-fitting, that is, M ≥ J, we can use a slight modification of the procedure described in Section 2.2 and Section 2.3 to generate samples from the marginal distribution p(Y) using the data set . It is rather easy to learn the marginal distribution because it has no conditional dependence. The low requirement for samples size of in to generate parameter posterior samples makes DBUQ particularly suitable for earth system model calibration, as these simulations typically involve computationally intensive processes that make generating large sample sizes of impractical.
Our DBUQ Method for Parameter Uncertainty Quantification
Input: Prior sample set ;
Output: Trained generative model F(Y, Z; θ);
Procedure:
-
for m = 1, …, M
end
- 2.
Train a NN to approximate the generative model X = F(Y, Z; θ) using the training data .
Generate parameter posterior samples: for a given observation y, evaluate the trained F at standard Gaussian samples Z to generate parameter posterior samples to approximate the target distribution p(X|Y = y).
2.4 An Illustrative Example of DBUQ
The prior data set in Equation 4 comprises 1,000 samples of X drawn from and the corresponding 1,000 samples of Y generated by evaluating Equation 21 at the samples of X. The prior data set is shown in Figure 2 (left). To generate the labeled data in Equation 19, we solve the reverse-time ODE in Equation 18 using the explicit Euler method with 500 time steps. The generated labeled data are plotted in Figure 2 (middle). Next, we train a simple NN with 100 hidden neurons using the labeled data to learn the generator F. We train the NN using the Adam optimizer with learning rate of 0.001 for 2,000 epochs. The response surface of the trained F(Y, Z; θ) is given in Figure 2 (right).

Illustration of the proposed diffusion-based uncertainty quantification method. Left: the prior data set generated based on the model in Equation 21. Middle: the labeled data in in Equation 19 obtained by solving the reverse-time ordinary differential equation (ODE) of the diffusion model. The use of the reverse-time ODE ensures a smooth function relationship between (Y, Z) and X, which makes it reliable to perform supervised learning to train the generator F(Y, Z; θ). Right: the response surface of the trained F(Y, Z; θ) using the labeled data , which successfully captures the function relationship between (Y, Z) and X.
After training the NN to estimate the generator F, we can evaluate F for any given observation y to quickly generate the posterior samples of X. Here we test the performance of our DBUQ method at two observations y = 1 and y = 2.25, which gives a bi-modal posterior distribution of p(X|Y = y) for the model defined in Equation 21. Figure 3 shows the exact functions of p(X|Y = y) at the two observations in red lines. For approximation, we evaluate the trained F(Y, Z; θ) at 10,000 standard Gaussian samples of Z to generate the posterior samples of X for the two observations and plot their approximated posterior density functions in blue lines. It can be seen that, the approximated distributions are very similar to the exact ones; the K-L divergences between the exact and approximated distributions are 0.036 and 0.055 for y = 1 and y = 2.25, respectively, which demonstrates the high accuracy of our DBUQ method.

Illustration of the accuracy of the trained generative model F(Y, Z; θ) in estimating the posterior distribution p(X|Y) for different observations. The left and right column show the posterior p(X|Y) for y = 1 and y = 2.25, respectively. The red line represents the exact posterior distribution and the blue line represents the approximated distribution by the trained generative model F(Y, Z; θ). We observe that diffusion-based uncertainty quantification can accurately capture the multi-modal posterior distributions, and the generative model F(Y, Z; θ) only needs to be trained once to approximate corresponding distributions for the given different observations.
These bi-modal distributions in Figure 3 can be common in earth system modeling when quantifying model parameter uncertainty for a given observation. This occurs because complex earth system models, with their significant structural and model parameter uncertainties, are likely to produce multiple parameter optima that fit the observation similarly, resulting in multi-modality in the parameter posterior distribution. Standard uncertainty quantification methods, such as MCMC, typically face challenges with these multimodal distributions, particularly when the modes are significantly separated, as in the y = 2.25 case. Such methods may either miss a mode or require extremely long simulation times to achieve convergence. This example illustrates that our DBUQ method can accurately generate samples to approximate multimodal posterior distributions, promising its effective applications in earth system modeling for parameter uncertainty quantification.
DBUQ not only accurately approximates the posterior distributions but also shows computational efficiency in generating these posterior samples. It takes less than 2 min to approximate the two distributions in Figure 3. Specifically, generating the labeled data takes 36 s, training the NN to learn the generator F takes 58 s, and evaluating F to generate 10,000 posterior samples of X and approximate its density function takes less than a second. These computations were performed on a workstation equipped with an Nvidia RTX A5000 GPU. Additionally, DBUQ performs amortized Bayesian inference. Once the generator F has been trained, it produces corresponding posterior samples for any given new observations in nearly no time (less than a second), which further enhances the computational efficiency of our DBUQ method.
3 Application of DBUQ for ELM Calibration
The DBUQ method can be generally applied for model calibration and parameter uncertainty quantification based on limited model simulation samples in the parameter prior space. Here we apply DBUQ to calibrate the ELM model at the Missouri Ozark AmeriFlux site (Baldocchi et al., 2001) and quantify the parameter uncertainty for its eight sensitive parameters. The Missouri Ozark site has been collecting eddy covariance data since 2004 (Gu et al., 2016). The site consists of second-growth oak-hickory forests with a mean annual precipitation of 1,083 mm. It is subject to frequent drought, most notably during the observation period in 2007 and 2012. We use the annual average latent heat fluxes (LH) collected in the site from 2006 to 2010 (i.e., five model output quantities) for ELM parameter estimation. The observation data were provided by Gu et al. (2016) that used default parameter values for ELM model simulations, resulting in significant discrepancies from the observations. This study investigates whether parameter estimation can enhance model prediction performance. ELM involves more than 60 parameters. According to the sensitivity analysis of Ricciuto et al. (2018), seven model parameters were responsible for more than 80% of the variation in the LH. These seven parameters are a factor controlling rooting distribution with depth (rootb_par [0.5, 4.5], m−1), the specific leaf area at the top of the canopy (slatop [0.01, 0.06], m2 gC−1), the fraction of leaf nitrogen in RuBisCO (flnr [0.1, 0.4]), the fine root carbon:nitrogen ratio (frootcn [25, 65], gC gN−1), the fine root to leaf allocation ratio (froot_leaf [0.3, 1.8]), the base rate of maintenance respiration (br_mr [1.5e−6, 4.5e−6], gC m−2 s−1 gN−1), and the critical day length to initiate autumn senescence (crit_dayl [35,000, 55,000], s). Along with one additional parameter crit_onset_gdd ([500, 1,300], °C day), a total of eight parameters are estimated. The parameter crit_onset_gdd, which represents the number of accumulated growing degree days needed to initiate spring leaf-out, was previously hard coded as a constant in ELM. However, given a recent analysis highlighting the importance of phenology for carbon uptake (Xia et al., 2015), we include this important parameter for calibration here. The prior ranges of the eight parameters are listed above after the parameter names.
Given the observed LH data y, we aim to estimate the posterior distribution p(X|Y = y) of the eight parameters X. We start by generating 1,000 sample pairs from the ELM simulations within the predefined parameter prior ranges. Then, we construct the training data set for the NN learning of the generator F by employing mini-batch-based Monte Carlo estimation to approximate the score function and using reverse-time ODE for generating training data based on the estimated score function. Finally, we input 2,000 standard Gaussian samples into the trained F to produce 2,000 samples of the parameter posterior, thereby approximating its posterior distribution.
We assess the performance of DBUQ by comparing its estimated posterior distributions with those obtained from MCMC sampling, a method widely used in earth sciences for model calibration. Each Markov chain necessitates numerous evaluations of the ELM to achieve convergence. Given that each ELM simulation takes approximately 2–3 hr, the extensive computing time and the high number of model evaluations render executing MCMC directly on the ELM computationally prohibitive. To enhance computational efficiency and obtain sampling results within a reasonable timeframe, we perform MCMC using a surrogate model of the ELM. Specifically, we construct a surrogate model from the 1,000 prior samples using a NN, and then perform the MCMC sampling on the surrogate using the EMCEE algorithm (Foreman-Mackey et al., 2013). As illustrated in Appendix A, the surrogate model yields simulations that are comparable to those from the ELM model. We initiated 20 chains, each generating 50,000 samples. Using trace plots to monitor convergence, we discarded the first 40,000 samples of each chain as burn-in. From the remaining 10,000 samples, we selected every 100th sample as effective or independent samples (producing 2,000 samples in total) to approximate the parameter posterior distribution.
4 Results
We apply the DBUQ method to both synthetic and real observations to assess its accuracy and performance. In the synthetic case, we randomly select a ELM-simulated LH sample from the data set as a synthetic observation and use its corresponding parameter sample as the synthetic “truth” for accuracy evaluation. In the real-data case, the parameters are calibrated using the real observations from the forest site. In the real-data case where actual parameter values are unknown, we assess DBUQ's performance by comparing its approximated posterior PDF with that of the MCMC approximation. Conversely, in the synthetic case when the “true” parameter values are known, DBUQ's accuracy is further evaluated by comparing its posterior PDF not only with the MCMC approximation but also against the “true” values. DBUQ performs amortized Bayesian inference; once the trained generator F is obtained, it generates parameter posterior samples for both synthetic and real observations swiftly by efficiently evaluating F. This contrasts with the MCMC simulation, which requires separate setups and executions for each case, doubling the computational time of an already expensive single run. In the following, we first discuss the synthetic case and then move to the real-data case discussion. Besides evaluating the parameter posterior PDFs, we also analyze the prediction uncertainty of LH caused by the parameter uncertainty.
Figure 4 illustrates the parameter estimation results for the DBUQ and MCMC methods in the synthetic case. Notably, both methods produce posterior PDFs for marginal and joint distributions that are similar in two key aspects: the accuracy of estimating the “true” parameter values with high probability, and the resemblance in the shapes of these PDFs. This similarity extends to their consistency with our domain knowledge. For instance, the width of the marginal distributions signifies the uncertainty level, while the rotation angle in the joint distributions indicates the parameter correlation. The parameter rootb_par shows sensitivity to LH, as demonstrated by its posterior PDF's distinctly narrow shape, indicating it is effectively constrained by the LH observations. Furthermore, the parameters slatop and flnr, known to be inherently positively correlated, are shown to maintain this relationship accurately in their estimated joint distributions.

Parameter posterior distributions estimated by the diffusion-based uncertainty quantification (DBUQ) and Markov Chain Monte Carlo (MCMC) methods in the synthetic case, where the red color highlights the synthetic “true” value and the x-axis range represents the parameter prior distribution range.
For nonlinear problems like the ELM, without known analytical solutions for parameter posteriors, the precise shape of their PDFs is unknown. Both DBUQ and MCMC are approximate methods. Although MCMC can converge to nearly accurate PDFs with sufficient samples, there is no practical guarantee that its estimated distributions represent the true underlying distributions. To assess our parameter estimation accuracy more reliably, we analyze the posterior PDFs of the simulated LH. The underlying rationale is that if the posterior uncertainty of the parameter is effectively captured, then the generated LH posterior samples should closely encompass the observed value. In real-world applications, observational data usually represent the ground truth, making it reasonable to use these observations to evaluate the quality of parameter uncertainty quantification. As depicted in Figure 5a, the LH samples derived from both DBUQ and MCMC methods exhibit a tight distribution around the synthetic “true” observations, with their quantiles showing remarkable similarity. This pattern underscores the efficacy of DBUQ in quantifying parameter uncertainty and solving inverse problems with precision.

Boxplots summarize latent heat fluxes (LH) predictions from the parameter posterior samples of the diffusion-based uncertainty quantification (DBUQ) and Markov Chain Monte Carlo (MCMC) methods. “True” in (a) presents synthetic observation; Obs in (b) is the real observation.
In the real-data case, the DBUQ method reuses the trained generative model F to compute parameter posterior distributions based on the actual observation data from the forest site. The parameter estimation results are shown in Figure 6 and the generated LH posterior samples are summarized in Figure 5b along with the MCMC approximations. These figures demonstrate that DBUQ's results comparable with that of MCMC, both in terms of parameter uncertainty quantification and the prediction of the LH. The parameter posteriors generated by both methods exhibit analogous PDF shapes in both marginal and joint distributions. Moreover, it's noteworthy that the LH prediction samples produced by DBUQ and MCMC not only mirror each other but also effectively encapsulate the observed data (Figure 5b), reflecting the reasonable uncertainty quantification and accurate predictions.

Parameter posterior distributions estimated by the diffusion-based uncertainty quantification (DBUQ) and Markov Chain Monte Carlo (MCMC) methods in the real-data case.
Notably, DBUQ not only produces similar results with those obtained through MCMC sampling but achieves this with 30 times less computational time. After building the surrogate model, the MCMC sampling takes about 5 hr. In contrast, the entire implementation of DBUQ takes less than 10 min, where estimating the score function and solving the ODE take about 5 min and the training of NN to estimate the generative model F takes another 4 min. The time used in evaluation of F to generate the 2,000 parameter posterior samples is negligible, less than a second. All these simulations were conducted on an Nvidia RTX A5000 GPU without parallelization. The speedup of DBUQ is 30 times. More importantly, DBUQ solves an amortized Bayesian inference, that is, after F is learned, for any observation y, we can quickly evaluate F to approximate p(X|Y = y) without re-calculating the score function and re-solving the ODE. In contrast, MCMC simulation needs to be repeatedly run for different observations due to the change of its likelihood function.
To further demonstrate the amortized Bayesian inference capability of DBUQ and assess its generalization, we evaluated its generative model F using a distinct set of synthetic observations. The findings, detailed in Appendix A, reaffirm DBUQ's accuracy in parameter estimation. The application of DBUQ across two synthetic cases and one real-data case totals 10 min, with training the generator F accounting for approximately 9 min. Evaluating F with the three “observations” takes only a second. Each of the three applications using MCMC requires 5 hr, totaling 15 hr to generate the three sets of parameter posterior distributions. Additionally, once F is trained in DBUQ, this generative model can be stored and used for parameter uncertainty quantification with any new observations, whereas MCMC requires saving numerous parameter posterior samples for each simulation to approximate uncertainty.
5 Discussion
-
Reliable supervised learning addresses computing challenges. Traditional diffusion models often employ a NN to learn the score function S(Zt, t) in Equation 12, which, while enhancing the quality of individual image samples, incurs high computational costs due to the need to store data for all paths, as depicted in Figure 1. In contrast, our DBUQ method requires only the storage of the initial and terminal states for each path, reducing memory demands. Moreover, DBUQ uniquely generates labeled data for training the generator F, facilitating reliable supervised learning for the first time within the diffusion model training paradigm, which typically relies on unsupervised learning.
-
Efficient generation of posterior samples. Traditional diffusion models require solving the reverse-time SDE in Equation 6 for a large number of time steps to generate a single sample of the posterior distribution. This process must be repeated multiple times to produce enough samples for distribution approximation, significantly increasing the computational cost. In contrast, DBUQ reduces computational time by directly evaluating the supervisedly trained generator F, typically a simple feedforward NN, to generate multiple samples more efficiently.
-
High computational efficiency on high-dimensional problem. DBUQ is highly scalable to the dimensions of observation and parameter variables due to its use of a NN to learn the relationship X|Y ≈ F(Y, Z; θ). In this framework, an increase in the number of parameters (X) and observation variables (Y) would typically expand the NN size, possibly requiring more labeled data for training. However, DBUQ efficiently generates labeled data by solving the ODE and can further reduce the demand for labeled data through dimension reduction techniques that simplify the NN structure (Lu & Ricciuto, 2019). In contrast, the computational cost of using MCMC methods to approximate parameter posterior distributions increases significantly with the dimensionality of the parameters.
-
Low requirement for physical model simulation samples. In many earth system modeling, generating model simulation samples can be very computationally expensive. As in this study, each ELM simulation takes about 2–3 hr, rendering direct MCMC analysis impractical and necessitating the use of a surrogate model. However, constructing an accurate surrogate also demands a large number of ELM simulation samples. In contrast, DBUQ is less dependent on the size of the model simulation samples. We use Dprior to generate labeled data, which then trains the NN and approximate the generator F. If the labeled data size is smaller than the model simulation sample size, we can supplement it by generating samples from the marginal distribution p(Y).
-
Amortized Bayesian inference. DBUQ performs amortized Bayesian inference by training NNs to learn the relationship X|Y ≈ F(Y, Z; θ). Once trained, the NN can rapidly provide Bayesian inferences for new data. This technique significantly reduces the time and resources needed for site-level parameter estimation and hypotheses evaluation across various data sets, making it particularly advantageous for earth system model calibration and data assimilation. Rather than saving all the parameter posterior samples for uncertainty quantification, DBUQ stores the trained generative model F. By evaluating F, we can efficiently calibrate models at various sites and assimilate data to enhance predictions at specific locations, thereby improving earth system predictability globally.
As a novel development, the current DBUQ method exhibits some limitations that necessitate further enhancements. First, DBUQ's effectiveness, like most uncertainty quantification methods, depends on the quality of the prior distribution; specifically, the high-probability region of the posterior distribution should fall within the support of the prior. However, in practice, the mode of the posterior may lie in the tails of both the prior and the likelihood, especially when these are significantly misaligned, potentially causing DBUQ to miss the mode of the posterior. One potential improvement could be to update the prior samples using Langevin dynamics, driven by the gradient of the log-likelihood function. Second, DBUQ's performance evaluation has primarily relied on numerical experiments, which do not provide rigorous mathematical validation. The method's approximation performance needs systematic analysis concerning various factors, including the number of prior samples, the duration of reverse-time ODE solutions, the volume of labeled data, and the NN architecture, such as the number of layers and neurons. Third, a rigorous mathematical analysis of DBUQ's approximation error is necessary. A key issue is understanding how Monte Carlo (MC) error accumulates when solving the reverse-time ODE. Intuitively, one might expect the total error in solving the ODE to be the sum of the MC errors at each time step multiplied by the number of time steps. However, our numerical results suggest that MC errors from different time steps may partially cancel each other out, indicating that the total error does not necessarily escalate with an increase in the number of time steps. Detailed theoretical error analysis is required to clarify this behavior.
6 Conclusion and Future Work
We develop a DBUQ method and evaluate its performance for ELM model calibration at one AmeriFlux site. Our evaluations in both synthetic and real-data applications underscore DBUQ's robust capability to precisely quantify parameter uncertainty. The resultant posterior distributions can be reasonably explained with our domain knowledge, aligning with both synthetic benchmarks and predictive insights. A comparative analysis reveals that the posterior estimates generated by DBUQ are similar to those derived from the MCMC approximations, yet DBUQ stands out by drastically reducing the computational burden, necessitating 30 times less computational time. Moreover, once the generative model is trained, our DBUQ method quickly generates parameter posterior samples for any given new observations. This rapid inverse modeling and uncertainty quantification capability significantly enhances site-level model calibration on a global scale, marking a transformative step toward more efficient and timely assessments of climate impacts on our earth systems.
Looking ahead, we plan to further enhance the proposed DBUQ method and increase its broad applications to advance earth system modeling. We will undertake a comprehensive mathematical analysis to solidify DBUQ's theoretical foundations. Additionally, we will conduct further tests on its amortized Bayesian inference capabilities using time-dependent problems, particularly where the observation data consists of time series. Moreover, we aim to expand DBUQ's application across additional AmeriFlux sites, thereby improving the global predictability of the ELM model and deepening our understanding of Earth's climate dynamics and their impacts.
Acknowledgments
This research is supported by D. Lu's Early Career Project, sponsored by the Office of Biological and Environmental Research in the U.S. Department of Energy (DOE). It is also supported by the Office of Advanced Scientific Computing Research, Applied Mathematics program in DOE under the contract ERKJ387. Additionally, F. Bao would like to acknowledge support from the U.S. National Science Foundation through project DMS-2142672. Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the DOE under Contract DE-AC05-00OR22725.
Appendix A: Additional Results and Supporting Materials
A1 Surrogate Accuracy
This section provides the surrogate modeling results. To reduce the computational costs in MCMC simulation and for a fair comparison with our DBUQ method, we perform the MCMC on the surrogate model which was constructed using the 1,000 prior samples. Specifically, we used 900 samples to train the NN surrogate and evaluate the surrogate accuracy on the remaining 100 samples. As demonstrated in Figure A1 below, the surrogate model yields results comparable to those of the ELM model for both training and unseen testing samples, indicating its accurate performance.

From the 1,000 prior samples, we used 900 samples to train the neural network surrogate model and evaluated the model performance on the remaining 100 samples. Both the training and testing data sets demonstrate the surrogate's high prediction accuracy.
A2 Results of the Second Synthetic Case
This section provides the results of another synthetic case, where we select a different sample from the data set as the synthetic “truth” than the one presented in the main text. The following Figure A2 summarizes the parameter estimation results from the DBUQ and the MCMC and Figure A3 shows the corresponding LH predictions. This synthetic case once again demonstrates DBUQ's competence in accurate inverse modeling and parameter uncertainty quantification.

Parameter posterior distributions estimated by the diffusion-based uncertainty quantification (DBUQ) and Markov Chain Monte Carlo (MCMC) in the second synthetic case.

Prediction results of latent heat fluxes (LH) in the second synthetic case.
Open Research
Data Availability Statement
The code of the proposed DBUQ method for reproducing the results in the paper is available at (Liu, 2024).