Volume 11, Issue 4 e2021EF002571
Research Article
Open Access

A Random Forest in the Great Lakes: Stream Nutrient Concentrations Across the Transboundary Great Lakes Basin

Nandita B. Basu

Corresponding Author

Nandita B. Basu

Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, ON, Canada

Water Institute, University of Waterloo, Waterloo, ON, Canada

Department of Earth and Environmental Sciences, University of Waterloo, Waterloo, ON, Canada

Correspondence to:

N. B. Basu,

[email protected]

Contribution: Conceptualization, Methodology, Supervision, Funding acquisition, Visualization, Writing - original draft, Writing - review & editing

Search for more papers by this author
J. Dony

J. Dony

Department of Civil and Environmental Engineering, University of Waterloo, Waterloo, ON, Canada

Contribution: Data curation, Formal analysis, Methodology, Writing - original draft

Search for more papers by this author
K. J. Van Meter

K. J. Van Meter

Department of Geography, The Pennsylvania State University, University Park, PA, USA

Contribution: Conceptualization, Writing - review & editing

Search for more papers by this author
Samuel J. Johnston

Samuel J. Johnston

Computer Science Program, University of Waterloo, Waterloo, ON, Canada

Contribution: Formal analysis, Methodology, Visualization

Search for more papers by this author
Anita T. Layton

Anita T. Layton

Computer Science Program, University of Waterloo, Waterloo, ON, Canada

Department of Applied Mathematics, University of Waterloo, Waterloo, ON, Canada

Department of Biology, Cheriton School of Computer Science, and School of Pharmacy, University of Waterloo, Waterloo, ON, Canada

Contribution: Funding acquisition, Methodology, Writing - review & editing

Search for more papers by this author
First published: 06 April 2023


Excess nutrient inputs from agricultural and urban sources have accelerated eutrophication and increased the incidence of algal blooms in the Great Lakes Basin (GLB). Lake basin management to address these threats relies on understanding the key drivers of pollution. Here, we use a random forest machine learning model to leverage information from 202 monitored streams in the GLB to predict seasonal and annual flow-weighted concentrations of nitrogen and phosphorus, as well as nutrient ratios across the GLB. Land use (agricultural and urban land) and land management (tile drainage and wetland density) emerge as the two most important predictors for dissolved inorganic nitrogen (DIN; NO3 + NO2) and soluble reactive phosphorus (SRP; PO43), while soil type and wetland density are more important for particulate P (PP). Partial dependence plots demonstrate increasing nutrient concentrations with increasing tile density and decreasing wetland density. In addition, increasing tile and livestock densities and decreasing forest cover correspond to higher SRP:Total Phosphorus (TP) ratios. Seasonally, the highest proportions of SRP occur in summer and fall. Higher livestock densities are also correlated with increasing N:P (DIN:TP) ratios. Livestock operations can contribute to the buildup of soil nutrients from excess manure application, while increasing subsurface drainage can provide transport pathways for dissolved nutrients. Given that both SRP:TP and the N:P ratios are strong predictors of harmful algal blooms, our study highlights the importance of livestock management, drainage management, and wetland restoration in efforts to address eutrophication in intensively managed landscapes.

Key Points

  • Random forest model, developed using data from 202 streams, identifies various land use and management controls on nutrients

  • High livestock densities correspond with greater proportions of bioavailable P and higher N:P ratios, and thus greater risk of blooms

  • The highest proportions of bioavailable P occur seasonally in summer and fall, and in the Lake Erie Basin

Plain Language Summary

While attempts have been made to improve water quality and reduce algal blooms in the Great Lakes Basin, we still have a limited understanding of where the greatest inputs of nutrients to the lakes are coming come from and why. In the current study, we have used nitrogen and phosphorus concentration data from over 200 monitoring stations around the Great Lakes to model daily, seasonal, and annual concentrations and to link these concentration magnitudes to a variety of watershed characteristics. Our results show that land use and land management are important predictors of nitrogen and phosphorus concentrations, with tile drainage emerging as a key driver of higher nitrate and soluble phosphorus concentrations. For particulate phosphorus, however, our results show that soil type and wetland density are more important predictors. Higher tile drainage densities and livestock densities were found to be associated with higher N:P ratios and higher ratios of soluble to total phosphorus. We also found that concentrations of soluble reactive phosphorus are highest during the summer and fall months in watersheds dominated by agriculture, which is in contrast to seasonal patterns observed in less-impacted watersheds. Understanding watershed drivers of nutrient concentrations is critical for managing and improving water quality

1 Introduction

Excess nitrogen (N) and phosphorus (P) inputs from agricultural intensification and urbanization have contributed to the eutrophication of inland and coastal waters (Anderson et al., 2002; Basu et al., 2022; Schindler, 2006; V. H. Smith et al., 2006; Van Meter et al., 2018), resulting in high environmental, social, and economic costs (Dodds et al., 2009; Moss et al., 2011; Pretty et al., 2003). Globally, billions of dollars are lost each year due to costs associated with eutrophication (Dodds et al., 2009; Pretty et al., 2003). The Great Lakes Basin (GLB) is particularly vulnerable to water quality threats due to its highly populated urban areas as well as areas of intensive agricultural land use (EC & OMECC, 2018b; EC & USEPA, 2017). Excessive nutrient inputs have caused eutrophication problems in various regions of the drainage basin, including most of Lake Erie, Muskegon Bay and Green Bay in Lake Michigan, Saginaw Bay and Georgian Bay in Lake Huron, and Hamilton Harbor, Oswego Harbor, and Bay of Quinte in Lake Ontario (EC & OMECC, 2018b; EC & USEPA, 2017). In addition to eutrophication and the occurrence of harmful algal blooms (HABs), nuisance algae like Cladophora, have been led to the fouling of beaches and shorelines as well as the clogging of water intakes for drinking and cooling systems in nearshore regions of Lake Michigan, Lake Ontario, and Lake Erie (Bootsma et al., 2015; EC & USEPA, 2017).

Water quality degradation has prompted the development of several national and binational efforts to reduce nutrient loading to the Great lakes, such as the binational Great Lakes Water Quality Agreement (GLWQA), the Great Lakes Restoration Initiative led by the US Environmental Protection Agency, and the Great Lakes Nutrient Initiative led by Environment and Climate Change Canada. The GLWQA calls for a 40% reduction in the annual total phosphorus (TP) loads, and a 40% reduction in spring-time (March–July) soluble reactive phosphorus (SRP) loads, from the 2008 levels, in some priority tributaries of Lake Erie by 2025 (EC & OMECC, 2018a; GLWQA NAS (Great Lakes Water Quality Agreement Nutrients Annex Subcommittee), 2019; US EPA, 2018). These goals to improve water quality in Lake Erie have spurred binational efforts to reduce both P and N loads. However, a key to setting future targets for nutrient reduction is being able to accurately estimate current nutrient loads. A significant obstacle to making these estimates, however, lies in a lack of data availability. Only a fraction of all watersheds in the Great lakes drainage basin are monitored for water quality. For example, in the Lake Erie Basin it is estimated that approximately 28% of the watersheds are not monitored (Maccoux et al., 2016), and this number is likely greater for the other lakes given Erie is one of the most studied basins (Figure 1). Further, even when watersheds are monitored, there are often significant data gaps, with most watersheds having, on average, 8–10 data points a year (Van Meter et al., 2020). How do we go from such temporally and spatially sparse data to providing robust estimates of loads to the lakes, as well identifying hotspots in nutrient sources across the basin to target management? These challenges are not unique to the GLB, but are similarly present in watersheds across the world as we grapple with increasing eutrophication risks in inland and coastal waters.

Details are in the caption following the image

The Great Lakes Basin, divided by colors into its five major subbasins. Water quality monitoring stations used in the current study are indicated with red markers. The dashed line indicates the US-Canada Border.

In the GLB, significant work has been done to address these questions. Robertson et al. (2019) used the SPARROW model to quantify N and P loads in the GLB. Maccoux et al. (2016) estimated total and soluble reactive P in monitored tributaries of the Lake Erie basin using the stratified Beale's Ratio Estimator for monitored watersheds, coupled with a unit area load (UAL) approach (Rathke & McCrae, 1989) for unmonitored watersheds. They found that non-point sources contributed to 50% of the SRP and 70% of the TP loads to Lake Erie, highlighting the importance of accurately estimating non-point source loads. While Maccoux et al. (2016) focused only on Lake Erie, Dolan and Chapra (2012) used the same methodology to estimate TP loads from 1994 to 2008 for all of the Great Lakes.

While significant progress has been made in the last decade towards our understanding of nutrient dynamics in the GLB, key knowledge gaps remain. First, the goal of most of these studies has been to estimate watershed P loads for lake management and modeling. They have not, however, focused on identifying key watershed controls on the non-point source loads, controls that can be used as important levers for watershed management. Second, studies thus far have focused primarily on P species; however, recent research shows that the ratios of N and P can be critical controls on the growth of HABs and trophodynamics (Glibert et al., 2011; Saaltink et al., 2014). Third, these studies have focused primarily on annual loads; while, it is well established that the occurrence and intensity of algal blooms in the lakes is more strongly impacted by seasonal loads. And finally, while previous studies have used simple proximity approaches to quantify loads in unmonitored watersheds, newer machine-learning approaches and the availability of large spatial data sets allow us to provide more robust, temporally resolved estimates across the basin.

Here, we propose a novel methodology to model the biogeochemical signatures of watersheds across a large transboundary drainage basin, the GLB, based on a machine-learning approach (Dony, 2020). The overall objective of this study is (a) to better our understanding of climate, landscape and management controls on annual and seasonal nutrient concentrations and ratios across all monitored watersheds of the GLB, and (b) to predict N and P concentrations, loads, and ratios in monitored and unmonitored watersheds across the GLB.

2 Methods

2.1 Site Description

The Great Lakes drainage basin lies on the Canada-United States border and covers an area of 520,000 km2, with 59% of the land area in the United States, and 41% in the province of Ontario (MacDonagh-Dumler et al., 2003; Neff et al., 2005). The lake basin is home to over 33 million people and not only provides drinking water to millions of Americans and Canadians (EC & USEPA, 2017; MacDonagh-Dumler et al., 2003), but also is heavily relied upon for transportation, fishing, industry, agriculture, and recreation. The geology of the GLB varies widely, from granitic bedrock in northern parts of the basin, with a thin cover of acidic soils and mostly conifer dominated forests, to glacial deposits with deeper and more fertile soils in the southern part of the basin (Neff et al., 2005; Shear & Wittig, 1995). The south is also home to most of the agricultural and urban centers of the basin. Climate varies widely across the GLB, with higher temperatures in the south and higher precipitation in the east (Neff et al., 2005; Shear & Wittig, 1995).

2.2 Data Sources and Site-Selection Criteria

A cross-border study such as the one carried out in the present work is frequently hindered by data set incompatibilities. Data is often collected at different spatial scales and/or at different frequencies, making it necessary to evaluate the compatibility of the data sets and to establish harmonization approaches, as needed. Below, we describe the various sources of Canadian and U.S. data, and, when necessary, provide a discussion of potential incompatibilities as well as necessary harmonization approaches.

Water quality data for Canada was obtained from the Provincial Water Quality Monitoring Network (PWQMN) (Ontario Ministry of Environment, 2016), while that for the United States was obtained from the United States Geological Survey (USGS) and from the Water Quality eXchange (WQX) and the Storage and Retrieval Data Warehouse (STORET) databases. Daily discharge data for Canada was obtained from the Water Survey of Canada (WSC), while discharge data for the United States was obtained from the USGS database (USGS, 2022). Additionally, nutrient water quality and discharge data was obtained from the National Center for Water Quality Research at Heidelberg University (Heidelberg University, 2020). Water quality monitoring stations were selected based on the following criteria: (a) proximity to a USGS flow-monitoring station (US) or an WSC station (Canada), with a difference in drainage area between flow and water quality stations <15%, (b) availability of at least 40 records between 2000 and 2016. Based on these criteria, we identified 170 stations (109 in Canada and 61 in US) with available data on nitrate and nitrite (hereafter referred to as DIN), 174 stations (109 in Canada and 65 in US) with orthophosphate-P data (PO4-P) data (hereafter referred to as SRP), and TP data. We focused on nitrate and nitrite data, given the greater significance of DIN for the productivity of downstream ecosystems (Bergström, 2010; Ptacnik et al., 2010); however, the accompanying data set includes TN.

Land use data for Canada and the US were obtained from the Annual Crop Inventory (2015), and the National Land Cover Database, respectively (Homer et al., 2015). Both data sets were developed using decision-tree based approaches to Landsat-8 and RADARSAT-2 satellite images, providing predictions of land cover at a 30-m scale. While there are some differences in methodology, the products have comparable levels of reliability. We used the metric percent developed land as the sum of the agricultural and urban land cover in the NLCD database, following (Mooney et al., 2020). Tile drainage data for Canada was obtained from the Tile Drainage GIS Layer (OMAFRA, 2015), and the US map of subsurface drains on agricultural land (Nakagaki et al., 2016). For the US, the tile drainage data set is based on reported county-scale data regarding the extent of surface tile drains, downscaled to a resolution of 30-m based on geospatial data sets of cropland and poorly drained soil. In contrast, for Canada, the data is obtained from the Tile Drainage Project, which is based on actual reporting from drainage contractors regarding the area over which new drainage systems are installed. While the estimation approaches vary, each data set represents the best available current data and gives us a best estimate of tile-drained area in Great Lakes watersheds.

Population density data was obtained from Statistics Canada (2012), and the United States Census Bureau (2016). Gridded precipitation and air temperature data were obtained from the WorldClim database at 1 km resolution (Fick & Hijmans, 2017). Slope data for the United States was obtained from the StreamStats Program from the USGS, which are based on USGS digital elevation model (DEM) data, available at a 10-m scale (U.S. Geological Survey, 2016). Slope data for Canada was obtained using the Ontario Flow Assessment Tool, which are based on 30-m DEM data (Ontario Ministry of Natural Resources and Forestry, 2015). While the US estimates are based on higher-resolution data, for both the US and Canada the slope is aggregated to the watershed scale and we therefore do not anticipate that there would be significantly different slope estimates at the two different resolutions.

Soil data for the United States was obtained from the Soil Survey Geographic Database by the United States Department of Agriculture (USDA-NASS, 2012; Wieczorek, 2014). Soil data for Canada was obtained from the National Soil Database (NSDB) and from the Harmonized World Soil Database (Fischer et al., 2008). Livestock density data for cattle, swine and chickens were obtained from the Food and Agriculture Organization of the United Nations (FAO) as part of their Gridded Livestock of the World v 2.01 (Robinson et al., 2014) and were converted into a single “Livestock Equivalent Density” using animal unit coefficients from Statistics Canada (Government of Canada-Statistics Canada, 2001), based on the amount of manure each animal type would produce to fertilize a standardized acreage of cropland.

2.3 Data Processing for Random Forest Model Development

Daily discharge and sparse water quality data were used to estimate daily stream concentrations by running the Weighted Regression in Time Discharge and Season (WRTDS) model using the EGRET package in R (Hirsch et al., 2010). WRTDS outputs were used to develop annual and seasonal flow-weighted concentrations (Cf), estimated as the ratio between total load over annual and seasonal timescales and annual flows over the same timescale (Equation 1). Data from Heidelberg University was not processed using WRTDS, as measurements were at the daily scale and thus interpolations were not needed.

Qi = daily flows (L3/T), Ci = daily concentrations (M/L3) estimated using WRTDS, Cf = flow-weighted concentrations (M/L3) estimated at the seasonal or annual scales. For the seasonal scale analysis Cf was estimated for each of the four seasons, with seasons being defined as winter (December, January, February), spring (March, April, May), summer (June, July, August), and fall (September, October, November).

We further determined the annual molar SRP:TP ratios and the N:P ratios to understand the drivers of nutrient ratios. For N:P ratios, we used DIN and TP magnitudes in moles/year, since this ratio has been shown to be one of y the most reliable predictors of nutrient limitation in lakes (Bergström, 2010; Ptacnik et al., 2010). DIN is particularly relevant because it represents the largest bioavailable pool of N, while the best proxy for P bioavailability is TP, which includes both dissolved P and particulate P (PP) (Ptacnik et al., 2010).

2.4 Random Forest Modeling Framework

We developed Random Forest (RF) models to predict seasonal and annual flow-weighted concentrations of DIN, SRP, PP and molar ratios of SRP:TP and DIN:TP (hereafter referred to as N:P), as a function of various climate, landscape and management factors within the monitored watersheds (Carlisle et al., 2009; Dugan et al., 2020; K. King et al., 2019; Read et al., 2015; Shen et al., 2020). The RF regression algorithm was implemented in the ensemble functions module from the machine learning python package, scikit-learn (Pedregosa et al., 2011). The RF modeling framework was chosen due to its non-parametric nature, ability to handle noisy, nonlinear, and intercorrelated data, and its robustness to overfitting (Breiman, 2001; Liaw & Wiener, 2002; Meinshausen, 2006; Solomatine & Ostfeld, 2008; Tyralis et al., 2019). Random Forest models generate fitted binary decision trees based on random, independent sampling (with replacement) of variables and data to minimize error (Breiman, 2001). This random independent sampling for each decision tree, also known as “Bootstrapping”, helps reduce noise in the data and overfitting through averaging, while also reducing intercorrelation between decision trees (Shen et al., 2020).

The RF models were developed by using a staged approach, beginning with feature selection, followed by hyperparameter tuning, and finally a training and a testing stage. Models were first specified to use the default parameter values provided by the scikit-learn implementation (Pedregosa et al., 2011). Since sampling was conducted randomly, seed values were standardized in random number generators for sampling reproducibility during RF model training and validation. We performed feature selection using the recursive feature elimination (RFE) algorithm implemented in the scikit-learn feature selection module (Pedregosa et al., 2011). RFE first trains the default RF model on the initial set of features, and calculates feature importance estimates using the RF models feature importance attribute (Pedregosa et al., 2011). The least important feature is pruned from the current set, and the procedure is repeated until a specified number of features (the k parameter) is ultimately reached (Pedregosa et al., 2011). The RFE model's k value was selected during hyperparameter tuning, from the range of 6–11, or all features. The model used 11 different climate and land use variables as predictors: percent wetland, percent area in tile drainage, livestock density (equivalent headcount/km2), percent developed land, percent forest, drainage area (km2), precipitation (mm/yr), slope (%), population density (count/km2), silt and clay percent.

Hyperparameter tuning was performed using 5-fold cross validation, implemented in the scikit-learn model selection function GridSearchCV (Pedregosa et al., 2011), which performs an exhaustive search over a specified grid of parameter values, and returns the combination of parameters that minimize the models mean-squared error. The RF hyperparameters tuned by this process included the fraction of data to sample for each tree, the minimum number of samples required to be at leaf nodes of the trees, the maximum depth of the decision trees, and the fraction of features to consider when determining the best split in decision trees. After performing feature selection and hyperparameter tuning, we trained the model using data from all the watersheds. Furthermore, “Out of Bag” (OOB) data, data that is not sampled and used in model development, was used to provide estimates of unbiased model performance and validation (Breiman, 2001).

We used two metrics, the variable importance factor (VIF) and partial dependence (PD), to identify dominant controls on the mean annual and seasonal flow-weighted concentrations (Cf) across monitored watersheds. The VIF quantifies how model performance changes when the values of a single variable are perturbed; variables that cause greater changes in model performance are considered more important (Grömping, 2009; Tyralis et al., 2019). Partial-dependence plots illustrate the marginal effect of explanatory variables on the predicted outcome in a trained model, while other variables are held constant (Friedman, 2001). PD plots for RF models were implemented using the plot PD tool in scikit-learn's inspection library (Pedregosa et al., 2011).

2.5 Model Predictions Across the GLB

Final RF models were used to predict seasonal and annual flow-weighted concentrations across 159 HUC-8 watersheds across the GLB (GLC, 2017). Flow-weighted concentrations were converted to nutrient loads using annual area-discharge regression relationships developed from the monitored watersheds for the period 2000 to 2016. Annual nutrient loads were estimated to determine the export to the Great Lakes from tributary sources.

3 Results and Discussion

3.1 Nutrient Concentrations and Ratios Across the Great Lakes Basin

Mean annual flow-weighted concentrations and nutrient ratios ranged widely across the GLB, from 0.06–9.5 mg/L for DIN, 2–296 μg/L for SRP, 6–808 μg/L for PP, 0.03–0.74 for SRP:TP ratio, and 1–124 for the N:P ratio. A subset of the 202 stations analyzed in this study (133 stations for DIN, 114 stations for SRP, 130 stations for PP, 106 stations for SRP:TP, and 109 stations for N:P) had flux bias values between ±0.15 and were used for further analysis in the RF models. The flux bias statistic quantifies the bias in the estimated flux, relative to the observed flux, on days when concentration data is available, with a positive value indicating over-estimation and a negative value indicating underestimation (Hirsch et al., 2010).

3.2 Random Forest Model Validation

The RF models performed adequately in both the training and the test data sets (Figure 2). Models for DIN performed significantly better, followed by those for SRP and PP (Figure 2). The predictive ability of the RF models for nutrient ratios is lower than that for the individual nutrients. Nutrient ratios are more difficult to predict than individual concentrations, as they are most likely affected by more local effects, including intra-annual variability in precipitation, local nutrient-cycling conditions, management practices such as tillage, etc. (Glibert & Burkholder, 2011).

Details are in the caption following the image

Random Forest model predictions plotted against observed data for (a) dissolved inorganic nitrogen, (b) soluble reactive phosphorus (SRP), (c) particulate P, (d) SRP:TP, and (e) N:P. The “X” markers refer to training data, while the “O” markers refer to out-of-bag (OOB) validation data. The dotted line is the 1:1 line.

To compare our results with previous estimates of SRP and TP loads in the GLB, we multiplied the annual flow-weighted nutrient concentrations by the annual discharge. We found our loads for Lake Erie to compare well with existing estimates, albeit our loads were somewhat larger (Figure S1 in Supporting Information S1) (Dolan & Chapra, 2012; Maccoux et al., 2016). This is interesting, given that we focus on our analysis only on riverine discharges of nutrients, while previous estimates also considered direct point source inputs to the lake. Our higher values can be attributed to a multitude of methodological differences. First, we used the WRTDS methodologies, which can more effectively capture the time-varying relationships between concentration and discharge than previous methodologies. Second, we used the RF model to estimate loads from ungaged watersheds, which is more rigorous than the methodology used in the Maccoux et al. (2016) study, in which ungaged watersheds were assigned nutrient loads based on proximity to gaged watersheds.

3.3 Dominant Controls on Nutrient Concentrations

We found the percent tile-drained area (TD), percent developed land use (defined as the sum of agricultural and urban land) (DEV), percent wetland area (WET), and the percent forested land (FOR) to be the top four predictor variables for the annual Cf of the two dissolved nutrients: DIN and SRP (Figures 3a and 3b). Partial-dependence plots provide further insight into the relationships, with both DIN and SRP increasing with developed land (Figures 4b and 4f) and tile-drained percent (Figures 4a and 4h), and decreasing with forested (Figures 4e and 4h) and wetland percent (Figures 4c and 4h). The importance of the predictor variables was found to be similar at the seasonal scale, highlighting the pervasive influence of these landscape predictors (Figure S2 in Supporting Information S1). Our finding of a positive relationship between developed land and dissolved N and P concentrations is consistent with others in the GLB and across the world highlighting the role of anthropogenic land use on stream nutrient concentrations (Basu et al., 2010; Ilampooranan et al., 2022; Mooney et al., 2020; Stets et al., 2020). The positive relationship between tile drains and dissolved nutrient concentrations highlights the role of subsurface drainage networks in bypassing the nutrient-filtering abilities of the soil and contributing to increasing concentrations of dissolved nutrients (Basu et al., 2011; Blann et al., 2009; Guan et al., 2011; K. W. King et al., 2015; M. Macrae et al., 2021; Schilling et al., 2012; Van Meter et al., 2020; Zanardo et al., 2012). Field-scale and watershed-scale studies across North America, Europe and New Zealand highlight the potential of subsurface drainage to increase export of dissolved N and P (Basu et al., 2011; Blann et al., 2009; Boland-Brien et al., 2014; K. W. King et al., 2015; Kleinman et al., 2015; M. L. Macrae et al., 2019; Sloan et al., 20162017; Thompson et al., 2011; Williams et al., 2015a). Our study corroborates these field and watershed-scale studies, and highlights that the patterns persist across >100 GLB watersheds at the regional scale. Concentrations of DIN and SRP correlate inversely with wetland area, with increases in wetland area contributing to decreasing concentrations till a threshold of about 20% wetland area (Figures 4c and 4j); above this threshold, however, wetlands have minimal effect. Interestingly, the regions with low wetland density are also regions with a high percent developed area, suggesting that the presence of wetlands in more agricultural and urban areas has the potential to provide the greatest ecosystem services with respect to water quality improvements. Wetlands remove nutrients and protect downstream waters (Cheng & Basu, 2017; Cheng et al., 20202022; Creed et al., 2017; Golden et al., 2017; Hansen et al., 2018; Jordan et al., 2011; Marton et al., 2015; Thorslund et al., 2017; Van Meter & Basu, 2015); however empirical studies at the watershed scale often find little or no influence of wetland cover on riverine nitrate due to confounding effects of land use and other variables (Powers et al., 2013; Strayer et al., 2003). The PD plots control for the variability of the other variable and thus allow us to clarify the role of wetlands of dissolved N and P concentrations in streams.

Details are in the caption following the image

Predictor variable importance for annual flow-weighted concentrations of (a) dissolved inorganic nitrogen (DIN), (b) soluble reactive phosphorus (SRP), and (c) particulate P, and molar ratios of (d) SRP:TP and (e) DIN:TP. The variable importance estimates are calculated by permutation, with higher values on the x-axis corresponding to greater importance.

Details are in the caption following the image

Partial-dependence plots for the top ranked predictor variables in the Random Forest models. The y-axis of each plot indicates the average of all the possible model predictions for a given x predictor value. Results for nutrient concentrations (dissolved inorganic nitrogen [DIN], soluble reactive phosphorus [SRP] and particulate P) and nutrient ratios (SRP:TP and DIN:TP) are presented in five rows, with the columns corresponding to the top five predictor variables. The histograms of the x predictor values are also presented in gray to highlight the effect of data density on the plotted relationships.

In contrast to the dissolved solutes, the strongest predictors of PP concentrations are the mean air temperature (TEMP), the percent wetland area (WET), and the percent silt and clay content in watershed soils (SLTC) (Figure 3c). Higher temperatures and a higher silt and clay were found to correlate with higher PP concentrations (Figures 4k and 4m), while greater wetland area correlates with lower concentrations (Figure 4l). While, from a mechanistic perspective, temperature should not have a major impact on PP concentrations, the strong positive relationship between PP and temperature above mean annual temperatures of 8°C can most likely be attributed to there being more intensive agriculture in warmer, southern watersheds, leading to higher PP concentrations (Figure 4k). The percent developed land (DEV) was still found to be a major predictor of particular P concentrations, but its importance was somewhat less (Figure 4n) than that for the dissolved solutes. It is possible that the percent developed land predictor is not able to capture the intensity of the agricultural activity as well as the mean air temperature gradient, contributing to this somewhat surprising finding. While the relationship between the percent wetland area and PP concentration is similar to that seen for DIN and SRP concentrations (Figures 4c, 4j, and 4i), the relative importance of wetlands to PP concentrations is greater, likely due to the more active role of wetlands in settling out sediments from runoff. Soil type was also found to be a stronger predictor of PP concentrations compared with the dissolved solutes, with PP increasing with increasing percent silt and clay soils (Figures 3c and 4m), highlighting the greater role of fine-soil erosion in elevating downstream PP (Dolph et al., 2019). It is also interesting that tile density is not one of the top 5 predictors for PP, possibly because dissolved solutes are more dominantly transported through the subsurface pathway (Blann et al., 2009), and thus their concentrations are more affected by the presence of tile drains. In contrast, PP is dominantly transported through surface pathways (K. W. King et al., 2015), and thus PP concentrations are more controlled by soil type.

3.4 Dominant Controls on Nutrient Ratios

Of course, many of the strong predictors of concentration are also strong predictors of nutrient ratios (SRP:TP and N:P), but with some key differences. Specifically, the % forested area, tile drainage densities, and livestock densities emerge as the top three predictors for the SRP:TP ratio (Figure 3d), with higher ratios being associated with lower forest cover and greater livestock and tile densities (Figures 4p–4r). Regions of high livestock density have often been associated with high soil test P from excess manure buildup in soils (Glibert, 2020; Long et al., 2018; Reid et al., 2019; Y. Zhang et al., 2021), and high soil P levels contribute to elevated dissolved P in runoff (Duncan et al., 2017; K. W. King et al., 2015; M. Macrae et al., 2021; Ni et al., 2020). Increased tile drainage has also been linked with an increased proportion of SRP export from agricultural fields to streams in the Maumee and Sandusky watersheds in the Western Lake Erie Basin (Jarvie et al., 2017; K. W. King et al., 2015; M. Macrae et al., 2021).

The top two predictor variables for N:P ratios are livestock density, and slope (Figures 4u–4w), with steeper slope magnitudes, and greater livestock densities correlating with higher N:P ratios (Figures 4v and 4w). Areas with steeper slopes contribute to PP in runoff and thus potentially lower N:P ratios (K. W. King et al., 2015). Increases in N:P ratios with increasing livestock density possibly arise due to over-application of N fertilizers in areas with manure application to compensate for the low N:P ratio of manure (Penuelas et al., 2013; Sardans et al., 2012), leading to N legacy buildup and transport of dissolved N through subsurface pathways (Basu et al., 20112022; Jones et al., 2019). This hypothesis is consistent with both US and global-scale analyses indicating that N inputs from agriculture and livestock production are increasing at a faster rate than P inputs, contributing to elevated N:P ratios in freshwater and marine systems (L. Bouwman et al., 2013; A. F. Bouwman et al., 2017; Glibert, 2020; Penuelas et al., 2013; Sardans et al., 2012). Higher N:P ratios are concerning, given they are often associated with HABs, which have been increasing in frequency and intensity in the last decade (Glibert et al., 2014).

3.5 Spatial and Seasonal Patterns in Nutrient Concentrations and Loads

The RF models are used to predict annual flow-weighted nutrient concentrations for 159 monitored and unmonitored HUC-8 watersheds in the GLB (GLC, 2017) (Figure 5). Predicted concentrations are the highest in the southwestern drainage basin of Lake Erie and along the eastern side of the Huron-Erie corridor. High N and P concentrations in the Lake Erie drainage basin are consistent with previous studies relating high nutrient concentrations to eutrophication and algal blooms in the lake (EC & USEPA, 2017; Kane et al., 2014; D. R. Smith et al., 2015). Watersheds in the Lake Erie Basin have the largest modeled annual loads for DIN, SRP, and TP of all the Great Lakes (Figure 6), while watersheds in the Lake Superior Basin have the lowest loads. The hot spots of high N and P concentration are found in areas of high developed land use (>80%), high tile drainage densities (>40%), and low wetland density (<5%). Watersheds with high N and P concentrations also have the highest area-normalized loads, as seen in the load seasonality plots in Figure 6.

Details are in the caption following the image

Modeled annual Cf for (a) dissolved inorganic nitrogen (DIN), (b) soluble reactive phosphorus (SRP), (c) particulate P, (d) Total Phosphorus (TP) and ratios (e) SRP:TP and (f) DIN:TP (N:P) for the Laurentian Great Lakes watersheds.

Details are in the caption following the image

Seasonal patterns of nutrient loads for dissolved inorganic nitrogen, soluble reactive phosphorus and Total Phosphorus for the five great lakes. Most of the lakes have a consistent seasonal regime of higher loads in winter and spring, followed by lower loads in summer and fall.

Seasonal patterns in nutrient loads play a critical role in driving algal blooms in downstream waters (Van Meter et al., 2020). Seasonally, the loads to the five lakes were found to be the highest in spring and winter, while the lowest loads occur during summer and fall months, driven by seasonal variations in discharge (Figure 6). As can be seen in the figure, this seasonal variability differs from lake to lake and across the different solute types. For example, in Lake Erie, DIN and TP loads are approximately three times higher in spring (March–May) than in fall (September-November), while SRP loads are only approximately 30% higher in spring than in fall. In contrast, in Lake Huron, In Lake Superior, the spring loads of DIN are actually lower than those in winter, and the TP loads are approximately 18 times greater in spring than in fall. As has also been shown in an analysis of seasonal nutrient concentration regimes across the GLB (Van Meter et al., 2020), this variability can lead to large seasonal changes in nutrient ratios, as described in the next section.

3.6 Spatial and Seasonal Patterns in Nutrient Ratios

Bioavailable P fractions (SRP:TP) were found to range from a low of 0.16 in the more pristine watersheds draining into Lake Superior and the northern shores of Lake Huron to a high of 0.32 in watersheds in the Lake Erie Basin (Figure 5d and Table 1). Lake Erie watersheds are dominated by agricultural land use and have high livestock densities that possibly contribute to greater a greater buildup of bioavailable P and larger SRP:TP ratios (Jarvie et al., 2017). In contrast, along northern lake Superior and the Georgian Bay, erosion of more pristine forested soils leads to higher PP fractions. Seasonally, SRP:TP ratios are highest in fall, followed by summer, and winter, with the lowest ratios being observed in spring (Table 1). This seasonal pattern is likely due to more particulate runoff during the spring freshet, which decreases the proportion of bioavailable P (K. W. King et al., 2015; Lozier et al., 2017; M. L. Macrae et al., 2019). In contrast, subsurface flow processes dominate in fall and summer, contributing to higher bioavailable P fractions (K. W. King et al., 2015). Higher bioavailable P fractions in fall and summer can also be attributed to internal loading from in-streams reservoirs, which often become anoxic during the summer months, contributing to higher releases of dissolved P release from sediments (Van Meter et al., 2020).

Table 1. Median and Interquartile Range (IQR) of Seasonal and Annual SRP:TP Ratios Across the Great Lakes Drainage Basin
SRP:TP Erie Huron Michigan Ontario Superior
Season Med. IQR Med. IQR Med. IQR Med. IQR Med. IQR
Annual 0.25 0.10 0.20 0.050 0.23 0.070 0.23 0.072 0.17 0.026
Fall 0.32 0.060 0.24 0.076 0.29 0.104 0.28 0.070 0.22 0.011
Winter 0.23 0.070 0.25 0.041 0.25 0.054 0.24 0.060 0.27 0.106
Spring 0.22 0.11 0.19 0.04 0.18 0.071 0.21 0.089 0.16 0.033
Summer 0.32 0.10 0.23 0.06 0.25 0.071 0.27 0.069 0.19 0.016

Median N:P ratios were found to range from 30 in watersheds draining to Lake Superior to 81 in those draining to Lake Michigan (Table 2, Figure 5e). N:P ratios are higher in the watersheds draining to the southern shores of Lake Huron and the eastern shores of Lake Michigan, near Green Bay (Figure 5e), while watersheds along southwestern Lake Erie, and in the eastern watersheds of the Huron-Erie corridor have lower ratios. Higher N:P ratios in the Lake Michigan (38–81) and Lake Ontario watersheds (55–89) are likely related to the greater urban influence in the watersheds draining to these lakes. Wastewater treatment plant efficiencies for P removal are often much greater than N removal efficiencies, leading to elevated ratios (Oleszkiewicz et al., 2015). To distinguish between areas of N versus P limitation, we used the N: P thresholds identified by Keck and Lepori (2012), who found that predictions of nutrient limitation are uncertain in freshwater streams and rivers except at extreme N:P ratios, with N:P ratios >100 associated with a higher probability of P limitation. Thus, the more urban watersheds in Ontario and Michigan basins are more P limited compared to the more agricultural watersheds in the Lake Erie basin, or pristine watersheds in the Lake Superior Basin. Seasonally, the highest N:P ratios are observed in fall and winter, while ratios are lower in summer and spring (Table 2). Higher N:P ratios in fall and winter can be attributed to greater contributions from subsurface flow pathways, while lower ratios in spring and summer can be attributed to greater surface runoff of PP in spring and increased biological processes in summer. Higher ratios in Fall in the more urban watersheds in Ontario and Michigan may also be caused by internal P loading from stormwater ponds (Orihel et al., 2017; Van Meter et al., 2020) and the greater influence of wastewater discharge, which is impacted by the greater efficiencies of wastewater P removal (Oleszkiewicz et al., 2015).

Table 2. Median and Interquartile Range (IQR) of Seasonal and Annual N:P Ratios Across the Great Lakes Drainage Basin
N:P Erie Huron Michigan Ontario Superior
Season Med. IQR Med. IQR Med. IQR Med. IQR Med IQR
Annual 28.2 19.1 26.1 17.8 25.6 20.9 45.3 18.7 18.8 7.68
Fall 47.0 48.8 80.1 55.5 81.4 50 79.4 16.5 36.0 5.89
Winter 37.0 42.2 83.9 25.6 80.1 47 89.0 31.8 67.6 32.1
Spring 37.4 36.8 48.6 19.9 37.7 24 61.7 21.3 37.9 11.7
Summer 35.6 29.1 45.9 25.8 44.0 27 55.2 24.0 29.9 22.4

3.7 Implications

Our findings have significant implications for managing water quality in the GLB as well as other freshwater streams and lakes globally. Overall, our analysis demonstrates that watersheds with high levels of developed land use, high tile-drainage densities, and low wetland densities are more likely to have higher concentrations of dissolved nutrients (dissolved inorganic N and soluble reactive P), while higher concentrations of PP occur in areas with more silt-clay soils and lower wetland cover. Given the strong relationship between riverine inputs of the more bioavailable SRP and the increasing frequency and intensity of algal blooms (Baker et al., 2014; Daloğlu et al., 2012; Jarvie et al., 2017; Kane et al., 2014; Michalak et al., 2013; Stow et al., 2015), our study highlights how wetland loss and increases in subsurface drainage in agricultural landscapes may be contributing to increasing bloom occurrences in the Lake Erie Basin (Ho et al., 2019; Hou et al., 2022; Paltsev & Creed, 2022). While increasing SRP trends have been attributed to increases in tile drainage densities in major watersheds like the Maumee and the Sandusky in the Western Lake Erie Basin (Daloğlu et al., 2012; Jarvie et al., 2017; Kane et al., 2014; Michalak et al., 2013; Stow et al., 2015), we use a much larger data set to show that similar patterns persist across the entire basin. We find an increase in the SRP:TP ratio corresponding to increased livestock density and tile-drainage densities, and attribute it to increased legacy soil P buildup due to over application of manure, as well as dissolved nutrient transport through subsurface pathways (Long et al., 2018; Miralha et al., 2022; Reid & Schneider, 2019; Reid et al., 2019; Van Meter et al., 2021; Van Staden et al., 2022). Seasonally, the bioavailable SRP fraction is highest in the warmer fall and summer months, making it a critical driver of increased bloom abundance (Paerl & Huisman, 2008).

Most nutrient management efforts focus on limiting nutrients, with P reduction strategies being the most common in freshwater systems (Schindler et al., 2016) and N reduction strategies in marine systems (Rabalais et al., 2001). However, it is increasingly recognized that the concept of limiting nutrients alone is not adequate to predict biological responses in receiving waters (Glibert, 2017; Glibert & Burkholder, 2011; Glibert et al., 2014; Heil et al., 2007). Changing N:P ratios have been known to alter algal biodiversity in both freshwater and marine systems (Glibert, 2017), and there is increasing evidence of increases in Microcystis blooms in direct proportion to increasing N:P ratios in lakes around the world, including Lake Taihu in China, impoundments on the Huron River, Michigan, USA (Glibert, 2017; Glibert et al., 2014; Lehman, 2007). In the current study, we find increased N:P ratios corresponding to increased livestock density and lower slopes, and attribute it to increased legacy soil N due to overapplication of manure, and dissolved nutrient transport through subsurface pathways (Basu et al., 2022; Williams et al., 2015a). Our range of N:P ratios in the GLB is similar to other human-dominated rivers around the world, which range from 30 to 100 in the Ruhr River catchment in Germany (Westphal et al., 2020), 16–80 in coastal watersheds across the US (Oelsner & Stets, 2019), 16–30 for the Baltic Sea Drainage Basin (Saaltink et al., 2014), 35–67 for the Seine, Somme and Scheldt Rivers in Western Europe (Thieu et al., 2009), 13–44 for US streams (Maranger et al., 2018). These ratio values are much greater than the 16:1 Redfield ratio, which is commonly used as a marker of phosphorus limitation in marine systems, and they are consistent with other global studies that have documented increasing N:P ratios in human-impacted catchments from increased use of N fertilizers compared to P fertilizers, and the more efficient treatment of P in urban wastewaters (Glibert, 2020; Oleszkiewicz et al., 2015).

Given the strong relationship between both high N:P and SRP:TP ratios and increasing algal blooms, including HABs (Glibert et al., 2014; Hou et al., 2022; Michalak et al., 2013), our findings have critical implications for nutrient management in the GLB and other watersheds around the world. First, the linkage between high SRP:TP ratios and high livestock densities suggests the importance of legacy nutrient sources from manure accumulation to current water quality. The presence of such legacies within the GLB suggest the potential for accessing legacy nutrient sources to grow crops with reduced fertilizer application (Ascott et al., 2021; Basu et al., 2022; Chang et al., 2021; Ilampooranan et al., 2019; Liu et al., 2019; Van Meter et al., 201620182021; T. Q. Zhang et al., 2020), thus improving water quality but without compromising on crop yield. Indeed, a study on paired wheat and canola fields in Manitoba showed that decrease in fertilizer P applications reduced soil P levels, and dissolved P concentrations, but with no impact on crop yield over a > 10 year period (Liu et al., 2019). Another long-term study in Oxford County, ON showed that decreasing fertilizer applications by 50% in the capture zone of a drinking water well reduced nitrate concentrations significantly but without a reduction in crop yields over a 10-year timeframe (Basu et al., 2022; Rudolph et al., 2015). It has been argued that addressing legacy nutrient sources is one of the most important pieces in sustainable N and P management, both in the GLB and across the world (Basu et al., 2022; Condron et al., 2013; Haygarth et al., 2014; Powers et al., 2016; Sharpley et al., 2013; Wironen et al., 2018).

With the increasing magnitude and intensity of precipitation in the lower Great Lakes region (Norton, 2019; Singh & Basu, 2022; Williams et al., 2015a), it is likely that more land will be brought under tile drainage (Arbuckle et al., 2015; Williams & King, 2020), increasing the risk of SRP export (Bartolai et al., 2015; Muenich et al., 2016). Farmer survey data highlight that increased drainage is likely to be one of the main approaches to adapt to climate change and extreme precipitation events (Arbuckle et al., 2015). Conservation strategies to reduce SRP export include nutrient management to address P sources (Guo et al., 2020; K. W. King et al., 2015; M. Macrae et al., 2021; Osterholz et al., 2020), drainage water management practices that hold back water seasonally in the tile drains and reduce export (Williams et al., 2015b), and restoring wetlands, which can intercept outflow from tiles and reduce nutrient export to downstream waters (Basu et al., 2022; Cheng et al., 2020; Golden et al., 2017; Hansen et al., 2018; M. Macrae et al., 2021)Many of these strategies do not effectively reduce nutrient export during extreme rainfall events, and thus innovations in watershed conservation practices become critically important for water quality improvement (Williams et al., 2015a).

Finally, our work highlights the potential for manure management as an effective strategy and possibly a “low hanging fruit” of nutrient control in agricultural landscapes, both in reducing N:P and SRP:TP ratios. A modeling study in the lake Erie Basin highlighted that appropriate manure management is a much more effective strategy than reductions in fertilizer application due to legacy P sources buildup in soils (Van Meter et al., 2021). While manure can potentially be effectively re-used in crop fields as a fertilizer source, one major challenge of such practices arises from the spatial disconnect between crop fields and livestock-dense areas, making such recycling often economically unfeasible (Long et al., 2018; Metson et al., 2016). Potential re-use of manure for bio-energy generation has tremendous potential for addressing the profitability gap, providing a renewable energy source, as well as improving water quality (Akram et al., 2019; Glibert, 2020; Metson et al., 2020).

4 Conclusions

Anthropogenic nutrient pollution contributes to eutrophication and increasing frequency and intensity of algal blooms in freshwater lakes around the world, including small and large lakes in the GLB (Anderson et al., 2002; Basu et al., 2022; Glibert et al., 2014; Jarvie et al., 2017; Scavia et al., 2014). Managing nutrient pollution requires understanding the climate, landscape and human factors that contribute to increasing concentrations and loads of the different nutrient species. Here, we use a machine-learning approach to assess the spatial drivers of nutrient concentrations within the GLB using flow-weighted concentration (Cf) as a biogeochemical watershed signature.

From a methodological perspective, our work has two unique contributions. First, while most studies focus on developing statistical models for estimating nutrient loads, we use the flow-weighted concentration as a biogeochemical signature that is more robust against year-to-year variations in flow, and allows us to better clarify the land use and land management controls on stream concentrations. Second, we use PD plots to explore the marginal effects of explanatory variables for stream nutrient concentrations, allowing us to uncover dependencies that are missed in simpler regression approaches due to confounding effects of correlated variables. From a watershed and lake management perspective, both in the GLB and other freshwater systems around the world, our findings highlight that the critical predictor variables are different for the different nutrient species and ratios that vary seasonally and spatially. Thus, watershed management efforts must be targeted to the species and ratios of interest. We further highlight the critical role that livestock and drainage water management can play in decreasing N:P ratios and SRP:TP ratios, thus mitigating nutrient pollution and the occurrence in HABs in the GLB. We find increasing tile drainage and livestock density to contribute to increased SRP:TP ratios, while increasing livestock density contributes to increased N:P ratios. Our finding of increasing SRP:TP and N:P ratios with livestock density is novel, and alludes to the possible buildup of legacy nutrient stores that act as a source to subsurface transport pathways through tile drains. Finally, our results highlight the critical role wetlands play in reducing both N and P concentrations, especially in intensive agricultural areas with low wetland densities, suggesting wetland restoration as an effective strategy for water quality improvement.

The estimated loads presented here are key to understanding the current conditions of the Great Lakes and potential paths forward in nutrient management. They offer insight into the magnitude of the underlying driver of modern-day eutrophication challenges. While TP loads, primarily from point sources, have been reduced due to actions associated with the 1978 GLWQA, water quality challenges from nutrient pollution still persist in the Great Lakes (Robertson & Saad, 2011). By better quantifying spatially and seasonally varying nutrient loads, managers can be better informed regarding the performance of past reduction strategies, while also establishing a baseline to develop and adapt new strategies to achieve future targets. Such insights from the Great Lakes region are also relevant to watersheds around the world, particularly as a changing climate puts more watersheds at risk of increased nutrient runoff from intensively managed landscapes.


We acknowledge funding support from Lake Futures Grant, which is a part of the Global Water Futures Project, funded through the Canada First Research Excellence Fund (to N. B. Basu, J. W.Dony and S. J. Johnston), the Canada Research Chair Program (to N. B. Basu), the Canada 150 Research Chair program (to A. T. Layton and S. J. Johnston) and the Natural Sciences and Engineering Research Council of Canada, via Discovery awards and an Alliance Grant (to N. B. Basu and A. T. Layton) and Pennsylvania State University Startup funds (to K. J. Van Meter). We acknowledge valuable feedback from two anonymous reviewers and the editorial team, which significantly helped improve the quality of the work.

    Data Availability Statement

    The compiled data set used in this research is available through the CUAHSI Hydroshare site at https://doi.org/10.4211/hs.b94e2da3b5094cdfa679ad31fe7fb09d (N. B. Basu et al., 2021).