Volume 128, Issue 5 e2022JF006810
Research Article
Open Access

Mapping Landslide Susceptibility Over Large Regions With Limited Data

J. B. Woodard

Corresponding Author

J. B. Woodard

U.S. Geological Survey, Geologic Hazards Science Center, Golden, CO, USA

Correspondence to:

J. B. Woodard,

[email protected]

Search for more papers by this author
B. B. Mirus

B. B. Mirus

U.S. Geological Survey, Geologic Hazards Science Center, Golden, CO, USA

Search for more papers by this author
M. M. Crawford

M. M. Crawford

Kentucky Geological Survey, University of Kentucky, Lexington, KY, USA

Search for more papers by this author
D. Or

D. Or

Division of Hydrologic Sciences, Desert Research Institute, Reno, NV, USA

Department of Environmental Systems Science, Soil and Terrestrial Environmental Physics, ETH Zürich, Zürich, Switzerland

Search for more papers by this author
B. A. Leshchinsky

B. A. Leshchinsky

Department of Forest Engineering, Resources and Management, Oregon State University, Corvallis, OR, USA

Search for more papers by this author
K. E. Allstadt

K. E. Allstadt

U.S. Geological Survey, Geologic Hazards Science Center, Golden, CO, USA

Search for more papers by this author
N. J. Wood

N. J. Wood

U.S. Geological Survey, Western Geographic Science Center, Portland, OR, USA

Search for more papers by this author
First published: 04 May 2023
Citations: 2

Abstract

Landslide susceptibility maps indicate the spatial distribution of landslide likelihood. Modeling susceptibility over large or diverse terrains remains a challenge due to the sparsity of landslide data (mapped extent of known landslides) and the variability in triggering conditions. Several different data sampling strategies of landslide locations used to train a susceptibility model are used to mitigate this challenge. However, to our knowledge, no study has systematically evaluated how different sampling strategies alter a model's predictor effects (i.e., how a predictor value influences the susceptibility output) critical to explaining differences in model outputs. Here, we introduce a statistical framework that examines the variation in predictor effects and the model accuracy (measured using receiver operator characteristics) to highlight why certain sampling strategies are more effective than others. Specifically, we apply our framework to an array of logistic regression models trained on landslide inventories collected at sub-regional scales over four terrains across the United States. Results show significant variations in predictor effects depending on the inventory used to train the models. The inconsistent predictor effects cause low accuracies when testing models on inventories outside the domain of the training data. Grouping test and training sets according to physiographic and ecological characteristics, which are thought to share similar triggering mechanisms, does not improve model accuracy. We also show that using limited landslide data distributed uniformly over the entire modeling domain is better than using dense but spatially isolated data to train a model for applications over large regions.

Key Points

  • We use a statistical framework to investigate the influence of data sampling strategies on landslide susceptibility model performance

  • The framework shows that the predictor data effects on output probability vary drastically with the sampling strategy used

  • The best sampling strategy we evaluate uses landslide data sampled uniformly from the entire modeling domain

Plain Language Summary

Landslide susceptibility maps show which areas in a region are more prone to landsliding than others. These maps are created from attributes of mapped landslides. The variation in landslide attributes and amount of landslide data required makes it difficult to map landslide susceptibility accurately over large regions. It is unclear whether any previously proposed methods to overcome these difficulties produce accurate susceptibility maps. Here, we develop a framework that evaluates the effectiveness of the following methods: using landslide data sets from only a few locations where data are readily available, applying models only to regions presumed to have landslide attributes similar to the regions used to develop the models, or gathering a few uniformly distributed (i.e., spread approximately equally) landslide data points. We show that the wide variation in landslide attributes over large regions reduces the accuracy of landslide susceptibility models that are developed using data from only a few locations. Restricting model application to regions with presumed similar attributes does not improve model performance. However, using a limited landslide data set that covers the entire region produces accurate susceptibility maps.

1 Introduction

Landslides (defined here as any form of mass wasting, including debris flows, rock falls, or rotational slides) occur naturally across the world and cause substantial losses in life, infrastructure, property, and economies (Froude & Petley, 2018; Kirschbaum et al., 2015; Mirus et al., 2020; National Research Council, 1985; Varnes & IAEG Comission on Landslides, 1984) with the highest disaster risk occurring among the world's most vulnerable populations (Hallegatte et al., 2017). As the climate continues to change, the frequency and intensity of severe weather events are expected to increase in some parts of the world (Pendergrass & Knutti, 2018), which may result in increased landslides and associated losses (Froude & Petley, 2018; Haque et al., 2019; IPCC, 2019; Kirschbaum et al., 2015). As such, resources from many countries have been devoted to studying landslides and mitigating future losses. Products from these studies include hazards maps (e.g., Kirschbaum & Stanley, 2018; Micu et al., 2023; Nowicki Jessee et al., 2018), susceptibility maps (e.g., Crawford et al., 2021; Huang et al., 2020; Hughes & Schulz, 2020), early warning systems (e.g., Baum & Godt, 2010; Guzzetti et al., 2020), and emergency response plans (e.g., Godt et al., 2022; Wooten et al., 2017). Although these efforts have improved our knowledge and response to landslides, additional work would be beneficial to further reduce landslide effects.

Susceptibility maps provide critical information on the spatial pattern and likelihood of landslide occurrence given the local terrain conditions (Reichenbach et al., 2018). In contrast, hazard maps quantify the timing and magnitude of landslides and risk maps measure the expected losses from landslides. Regional-scale or larger (>1,000 km2) maps are fundamental for mitigating future losses from landslides by providing uniform information across scales relevant for land management, planning, infrastructure, and emergency response decisions (Godt et al., 2022). Several methods for categorizing landslide susceptibility have been used including geomorphic mapping, heuristic methods, physically based methods, and data-driven statistical models (Reichenbach et al., 2018). Statistical methods are generally preferred when assessing landslide susceptibility over large regions due to the methods' ability to provide estimates without the prohibitively detailed and extensive data necessary for the parameterization and evaluation of physically based methods. Statistical methods facilitate leveraging large and complex data sets that are often incomplete while outputting accurate results (Korup & Stolle, 2014). These models require the attributes (e.g., slope, soil thickness) of the modeling domain (i.e., area where the model is applied) and landslide inventories that identify areas with geomorphic evidence of landsliding. The models output probabilities that indicate the relative level of landslide susceptibility within the modeling domain by estimating the probability of the location containing a mapped landslide. Since the 1980s, hundreds of papers have been published that evaluate the use of statistical models (Reichenbach et al., 2018). Common types of statistical models include logistic regression (Budimir et al., 2015), random forest (Chen et al., 2017; Tanyu et al., 2021; Trigila et al., 2015), generalized additive models (Bordoni et al., 2020; Steger et al., 2022), and deep learning (Thi Ngo et al., 2021). Often, the overall accuracies among these model types are comparable (Chen et al., 2017; Pradhan, 2013; Reichenbach et al., 2018; Trigila et al., 2015; Wang et al., 2019; Youssef et al., 2016). However, logistic regression is the most commonly used due, in part, to its ease in programming and simplicity (Steger et al., 20162017).

Most landslide susceptibility models (LSSMs) are, by design, dictated by input data. Here, LSSMs are referring exclusively to data-driven statistical methods. This reliance presents a fundamental challenge when trying to apply LSSMs to regional scales or greater due to the chronic lack of consistent, accurate, and representative landslide inventories over the entire modeling domain. The landslide inventories used to train the models should outline areas within the modeling domain with geomorphic evidence of landsliding. Despite the proliferation of new automated landslide mapping techniques that use increasingly available remote sensing data (e.g., Benz & Blum, 2019; Ghorbanzadeh et al., 2021; Nagendra et al., 2022), landslide inventories required for LSSMs are still lacking over most of the world.

Previous attempts to create susceptibility maps over large areas have used several methods to model regions with little or no landslide data (Von Ruette et al., 2011). Van Den Eeckhaut et al. (2012) and Broeckx et al. (2018) created susceptibility maps over the European and African continents, respectively, using an inventory of landslide and non-landslide (i.e., no geomorphic evidence of landsliding) locations they describe as uniformly distributed in space. That is, the coverage of landslide locations is approximately equal across the area of interest. Van Den Eeckhaut et al. (2012) used an inventory of 1,340 landslides and Broeckx et al. (2018) used an inventory of 18,050 landslides. Although the inventories used were admittedly limited in terms of landslides per unit area, the authors argued that a uniformly distributed landslide inventory with samples from a range of different environments (i.e., across the entire modeling domain) creates an accurate LSSM. Stanley and Kirschbaum (2017) developed a heuristic fuzzy logic approach for modeling susceptibility on a global scale. The fuzzy logic approach combines heuristic assumptions about the effects of different predictors on susceptibility with the measured effects derived from the available landslide inventories to make an LSSM. A predictor is an environmental attribute (e.g., slope, soil thickness) used by the LSSM to evaluate landslide susceptibility. A predictor effect measures the change in the model outcome (i.e., the probability of landslide occurrence) due to a change in the predictor value and is often referred to as weights or coefficients, depending on the model used. The fuzzy logic method, in theory, helps overcome some of the data shortage problems by forcing the LSSM to include the expected effects of the predictors. The landslide inventory used by Stanley and Kirschbaum (2017) included one globally distributed database of 1,194 landslides derived largely from media and citizen scientist reports and eight higher-density inventories totaling 61,704 landslides mapped over targeted areas around the world that included individual U.S. states and a hurricane-affected area. Hervás (2007) and Hervás et al. (2010) outlined a framework for using a heuristic index-based approach to evaluate landslide susceptibility on regional scales with limited data. The index-based approach requires a user to assign a relative weight to each predictor based on their assumed effects on susceptibility. The user may then either use the purely heuristic weights or adjust them by evaluating their effectiveness on the available landslide data using different methods (e.g., analytic hierarchy process). As of 2018, the index-based approach was the second most used model for landslide susceptibility, comprising 29% of all publications on the subject (Reichenbach et al., 2018). Many authors have tried to refine this approach by defining different subregions of the mapping area of interest according to physiographic or climate attributes (Bălteanu et al., 2020; Günther et al., 2013; Malet et al., 2009; Wilde et al., 2018). By subdividing the modeled domain according to these attributes, the predictor effects may be better constrained due to similar triggering mechanisms and are expected to result in more accurate LSSMs. Despite these refinements, the index-based models remain largely dependent on expert opinion to determine susceptibility and may not account for variations in landslide characteristics obtained from data-driven approaches. Notwithstanding the variable approaches for building LSSMs with sparse and incomplete landslide inventories, the relative effectiveness of these approaches is still an open research question.

The purpose of these different methods for creating susceptibility maps is to obtain a representative model that accurately locates areas of mapped landslides over the domain of interest. For data-driven techniques, the level of model representation is largely dictated by the selection of landslide location data used to train the model (i.e., data sampling or model training strategy). A few notable studies have explored the impacts of different sampling strategies in detail. Tanyas et al. (2019) examined 25 earthquake-induced landslide inventories from around the world to evaluate which inventories were representative of others using logistic regression model performance metrics and k-means cluster analysis on the predictor data sets from each inventory. They show that the level of representativeness of a given event inventory varied across the other inventories and that grouping data sets by predictor similarities (k-means) did not significantly improve their representativeness. Petschko et al. (2013) analyzed the effects of dividing the study domain into distinct homogeneous subdomains with geological similarities for creating a regional-scale susceptibility map. They demonstrate that the most influential predictors were inconsistent between models trained on data from the different subdomains. This suggests differences in the level of similarity in landslide characteristics between each subdomain. However, the authors do not explore this phenomenon any further. Additionally, Kornejady et al. (2017) explored the differences between using random sampling and a Mahalanobis distance-dependent sampling strategy (Tsangaratos & Benardos, 2014). This technique prevents a purely random selection of data for model training and testing, instead creating a training data set that increases the variance in the landslide predictor values compared to most random sampling strategies. This approach improved the model's representativeness as reflected in an increase in model performance compared to random sampling. Despite this work, questions persist about the best sampling strategies to use over large regions and why some techniques perform better than others. The lack of consensus hinders the formulation of a consistent sampling strategy.

Validation procedures of susceptibility models generally use receiver operator characteristics (ROCs) (Ayalew & Yamagishi, 2005), qualitative evaluation based on expert opinion (Bălteanu et al., 2020), and comparisons to subregional-scale maps (Van Den Eeckhaut et al., 2012). The latter two methods are qualitative, preventing objective analysis of the accuracy of the LSSMs. The ROCs provide reproducible estimates of the effectiveness of the model at fitting the available data. However, these metrics are often measured without an objective evaluation of how factors within the model affect its outputs (Budimir et al., 2015; Reichenbach et al., 2018).

The purpose of this study is to evaluate how to improve susceptibility model performance over broad geographic regions when extrapolating to areas with limited landslide data. We do this by employing a statistical framework (workflow) that examines the predictor effects within LSSMs, which allows us to better understand how different sampling strategies affect the model outputs. This approach provides a more detailed understanding of the impact of different input data on model behavior and performance over previous efforts at evaluating data sampling strategies. We carry out several illustrative experiments to help determine the effectiveness of different sampling strategies for modeling susceptibility over large and diverse regions. The results of this study can help determine the best practices and mitigate the misrepresentation of landslide susceptibility that results from models with low accuracies.

2 Methods

We employ a framework for understanding LSSM output and evaluating different sampling strategies for characterizing landslide susceptibility over large and diverse terrains. We compile landslide and predictor data from four regions across the United States with varying degrees of environmental similarity. Our data preprocessing steps are outlined in Figure 1 and follow the recommendations of previous work (e.g., Budimir et al., 2015; Chang et al., 2019; Ozturk et al., 2021; Segoni et al., 2020). After processing the data, LSSMs are created using Bayesian logistic regression (Das et al., 2012). Implementing the common logistic regression model within a Bayesian framework incorporates prior information to constrain the logistic coefficients and allows the explicit treatment of the uncertainty in the model's predictor coefficients and probability output (Korup, 2021; Korup & Stolle, 2014). After the different LSSMs are trained, we use a comparison of logistic regression coefficients to measure the predictor effects derived from the logistic regression LSSMs. We also estimate the ROC area under the curve (AUC) metric for each LSSM when applied to different data sets. Comparing the results of the variation in predictor effects between the models with the estimated ROC-AUC values helps determine why different sampling strategies used on the LSSMs produce more accurate results.

Details are in the caption following the image

Workflow of landslide susceptibility mapping.

We use our framework to carry out an array of experiments that illustrate three commonly used landslide data sampling techniques. First, we study the effects of applying a susceptibility model trained on data collected across diverse regions to areas with no data. We simulate this by training LSSMs on all the landslide inventories except one and testing the LSSM on the omitted site and repeating so that each site is left out once. Second, we test if limiting model development and application to regions with shared physiographic and ecologic characteristics improves model performance. Our selected landslide data sets include inventories with varying degrees of ecological and physiographic similarity. By training models on a single inventory and testing it on another inventory with shared ecological and physiographic properties, we will be able to determine whether restricting model training and application based on these attributes improves model performance. Third, we evaluate the effectiveness of using a limited but uniformly distributed landslide inventory to develop LSSMs. That is, we intentionally do not include every known landslide within an area (limited) but randomly sample from a compilation of all landslides across the study areas (uniformly distributed) using a Mersenne-Twister random number generator (Matsumoto & Nishimura, 1998). To do this, we train a model using 5% of all the available landslide data from the compiled inventories to train the model and then test the model on the remaining 95% of the data. All the data processing and modeling are carried out in ArcGIS Pro (Esri Inc., 2021) and R (R Core Team, 2016).

2.1 Physiographic and Ecological Divisions

We use ecological regions (ecoregions) and physiographic provinces to subdivide the continental United States for susceptibility model tests (Figure 2). Level II ecoregions divide North America into 50 ecologically distinct areas at the subcontinent scale (Omernik & Griffith, 2014). Ecoregions are identified by analyzing spatial patterns of factors that affect the local ecosystem. Factors include geology, landforms, soils, vegetation, climate, land use, wildlife, and hydrology. The 25 physiographic provinces of the contiguous United States group areas with common topographic and geologic characteristics (Fenneman & Johnson, 1946). In contrast to ecoregions, physiographic provinces are distinguished by homogenous landforms that result from geologic structures and do not consider variations in climate or vegetation (Fenneman, 1917). Ecoregions and physiographic provinces may delimit regions with similar landslide triggering mechanisms (e.g., rainfall, spring thaw) and terrain attributes (Bălteanu et al., 2020; Günther et al., 2013; Malet et al., 2009; Wilde et al., 2018).

Details are in the caption following the image

(a) Map of the level II ecoregions and physiographic provinces of the contiguous United states. Ecoregions (Omernik & Griffith, 2014) are colored according to the legend, while the physiographic provinces (Fenneman & Johnson, 1946) are outlined and labeled on the map. The locations of the four sites analyzed in this study are colored in red. (b) Zoomed-in map showing the location of the Elkhorn Ridge Wilderness. (c) Zoomed-in portion of the study sites in the eastern United States.

We chose four locations to explore the effects of physiographic and ecological grouping on LSSM performance (Figure 2). These locations were chosen for their variable levels of physiographic and ecological similarity and the systematic approaches used to compile their landslide inventories (see Section 2.2). Magoffin County, Kentucky, is in the Appalachian Plateaus physiographic province and the Ozark/Ouachita-Appalachian forest ecoregion. Doddridge County, West Virginia, shares the same physiographic and ecological divisions as Magoffin County. Macon County, North Carolina, shares the same ecoregion as the previous two but is in the Blue Ridge Physiographic province. Lastly, the Elkhorn Ridge Wilderness, California, is in the Pacific Border physiographic province and the Marine West Coast Forest ecoregion. Attributes of these ecoregions and physiographic provinces are shown in Table 1.

Table 1. Physiographic Province and Level II Ecoregion Attributes
Physiographic province
Name Geology Topography
Appalachian Plateau Mostly undeformed Paleozoic sedimentary rocks Steep rugged terrain
Blue Ridge Extensively deformed Precambrian metamorphic rock Steep rugged terrain
Pacific Border Extensively deformed and faulted rocks of variable ages Steep rugged terrain
Level II Ecoregion
Name Precipitation range (mm per year) Vegetation
Ozark/Ouachita-Appalachian Forests 900–1,500 Low mountain forests
Marine West Coast 650–5,000 Coniferous forests

2.2 Landslide Inventory

We compiled existing inventories for Magoffin County (Crawford, 2023; Crawford et al., 2021), Doddridge County (Kite et al., 2019), Macon County (Wooten et al., 2017), and the Elkhorn Ridge Wilderness (Wills et al., 2016). The different inventories are publicly available and consist of polygons and points of landslide features (i.e., head scarp, flanks, toe slopes, and hummocky topography) apparent in base maps (e.g., slope, hillshade, curvature, contour) derived from digital elevation models (DEMs), aerial photography, and field investigations. Details of the different inventories are shown in Table 2. All four inventories were mapped by experienced geologists using a well-defined and systematic approach to landslide identification, as detailed in each reference above, with high-resolution DEM or aerial imagery. As such, all the mapped locations of landslides considered in this study meet or exceed the criteria for good confidence (or level 3) defined by Mirus et al. (2020), meaning the landslides features are at or near (within the resolution and accuracy limits of the identification tools) their mapped locations. Although no landslide inventory is perfect, the dense coverage of mapped landslides and the systematic approaches of the different mapping teams make these inventories highly suitable for our study objectives. Shaded relief maps of sections of the counties with landslide points plotted are shown in Supporting Information S1 (Figures S1–S4).

Table 2. Landslide Inventory Attributes
Location Mapping tools Landslide data format Landslide count Location area (km2) Landslide density (km−2) Reference
Magoffin County, Kentucky 1.5-m DEM and derivatives, aerial photography, field reconnaissance Polygon (total affected area) 2,003 800 2.5 Crawford et al. (2021)
Doddridge County, West Virginia 3-m DEM and derivatives, landslides verified by two independent surveyors Point (at head scarp) 1,731 829 2.09 Kite et al. (2019)
Macon County, North Carolina 6-m DEM and derivatives, aerial photography, geologic maps, field reconnaissance Point (at head scarp) and Polygon (total affected area or deposit) 640 1,347 0.48 Wooten et al. (2017)
Elkhorn Ridge Wilderness, California Variable resolution DEMs and their derivatives, aerial photography, field reconnaissance, previous geologic data Point (if too small to map on 1:24,000 map) and Polygon (total affected area or deposit) 3,087 711 4.34 Wills et al. (2016)

Landslide locations are standardized to points because of the inconsistent data format between the inventories. We convert landslides mapped as polygons to points by finding the highest elevation point within the polygon. In cases where multiple pixels have the same maximum elevation, we select the pixel with the highest slope. The point of highest elevation within the polygon will more closely approximate the location of the landslide head scarp. The LSSMs also require training data that contain non-landslide points, which represent areas without any signs of landslide occurrence. To extract the non-landslide points, we randomly sample areas outside the mapped landslide polygons. For landslides originally mapped as points, we sample locations outside a buffer with a radius derived from the average area of the polygons within the same data set, where possible. All landslides in the Magoffin data set are mapped as polygons, so there are no points to buffer. Macon and Elkhorn use a radius of 158 and 150 m, respectively. For Doddridge County, only point data of landslide head scarps exist; so, we use the radius derived from Magoffin County (45 m) due to the shared physiographic province and ecoregion. Buffering the landslide point data helps prevent sampling landslide locations as non-landslide locations. Importantly, work by Zhu et al. (2017) and Nowicki Jessee et al. (2018) showed that varying the buffer size does not significantly affect susceptibility model output.

The sampling ratio between landslide and non-landslide points can have significant effects on model outcomes due to potential sampling bias (King & Zeng, 2001; Nad'o & Kaňuch, 2018; Oommen et al., 2011). To help mitigate these effects, we followed the methods outlined by Oommen et al. (2011) and King and Zeng (2001) to detect the most appropriate sampling ratio. A frequentist logistic regression model is run on a range of different sampling ratios of non-landslide to landslide points ranging from 100:1 to 1:1. Values of recall, precision, and the weighted harmonic mean of precision and recall (F-measure) are then calculated for landslide and non-landslide classes. The sampling ratio that shows the most consistent recall, precision, and F-measure values within each class has the least sampling bias. This was determined by measuring the standard deviation of the three metrics within each class and finding the ratio with the smallest Euclidean distance of the two class standard deviations. In this study, a sampling ratio of 1:1 proved best for mitigating sampling bias.

2.3 Model Predictors

We compile an array of predictors to differentiate between landslide and non-landslide locations (Table S1 in Supporting Information S1). Predictors are chosen based on their effectiveness in other studies for determining landslide susceptibility (e.g., Budimir et al., 2015). They are designed to characterize the local geology (e.g., lithology), soil (e.g., soil thickness), topography (e.g., slope), hydrology (e.g., flow accumulation), anthropogenic impacts (e.g., proximity to roads), climate (e.g., mean annual precipitation), weather (e.g., precipitation frequency), and seismology (e.g., peak horizontal acceleration), all of which have the potential to influence the driving and resistive forces that affect slope stability. The raw predictor data are available in a variety of different resolutions and formats (i.e., vector and raster). Thus, all data are converted to raster and resampled to the same resolution as the DEMs used to derive the topographic predictors (i.e., 10 m) using the nearest neighbor interpolation method. We use a 10-m DEM from the U.S. Geological Survey 3D Elevation Program database (U.S. Geological Survey, 2019) because it is the finest resolution available over all the study sites and coarser resolutions would obscure the predictor values that led to ground failure. Categorical predictors are converted to model matrices with K−1 different categories (McElreath, 2020, Section 5.4.2), where K is the number of categories within a given predictor. The Soil Survey Geographic Database (SSURGO) (U.S. Department of Agriculture, 2021b) is a high-resolution soil database but has some null data points within the study regions. Thus, we fill the null data with values with the coarser-resolution STATSGO data set (U.S. Department of Agriculture, 2021a). To increase computational efficiency in the susceptibility models, we standardize the compiled predictors of each location included in training the LSSM to have a mean of zero and a standard deviation of one (Kruschke, 2015, Section 17.2.1.1). Herein, we refer to the combined data sets of the locations used to train a particular model as a training group. Although some of the included predictors are not available in many parts of the world, the overall findings of our analysis are pertinent to any study that uses machine learning or other statistical methods to assess landslide susceptibility.

Correlation between predictors can cause inaccurate estimates of the measured predictor effects, which makes meaningful comparisons between the LSSMs difficult. Thus, we use the variation inflation factor (VIF), a measure of collinearity between predictors in the LSSM (James et al., 2013), to eliminate predictors that are the most correlated with others (Hong et al., 2015) using an iterative approach for each training group. For each iteration we run a frequentist logistic regression model and eliminate the predictors in the highest tenth percentile of VIF values greater than five, continuing until all predictors have a VIF value less than five. A VIF value of five is a conservative threshold for eliminating collinearity within a statistical model (James et al., 2013). Using an iterative approach allows us to account for variations in the VIF values from changes in the predictor combinations. After the correlated predictors are removed from the LSSMs, predictors are matched between training groups by eliminating predictors absent in any member of the training groups. This allows us to analyze the variation in effects of the same predictors between training groups.

2.4 Modeling Strategy

Landslide susceptibility models are created and tested on the following training groups (Figure 3):
  1. Two models trained on a random sampling of half of the Magoffin landslide data set (Magoffin 1 and Magoffin 2) (Figure 3a). This training group acts as the control by measuring the model accuracies when the training and test data are from the same area.

  2. One training group for each location individually (four in total) (Figure 3b). Testing LSSMs trained at one location and tested on another will determine whether grouping data sets according to physiographic or ecological attributes is a meaningful division for creating broad-scale LSSMs.

  3. A training group that consists of data from all the locations (All) (Figure 3c). This training group is tested on the training group for each individual location to evaluate how the model behavior changes when it is trained on an aggregated data set compared to models developed on a subset of the data.

  4. Four training groups with all the data except one location (e.g., All-Doddridge indicates all locations except Doddridge) (Figure 3d). These training groups were tested on the withheld location. This experiment will evaluate the expected model accuracies when applying an LSSM trained on a compilation of data from diverse regions to an area with no data.

Details are in the caption following the image

Schematic examples of the different modeling strategies described in Sections 2.4 and 2.6. Red blobs illustrate the landslide training data, and the figures label the locations where the trained model is applied (i.e., tested).

The models trained on these groups were evaluated by looking at the measured predictor effects between the models and by testing the models on the other training groups' data and measuring their performance using ROC-AUC.

2.4.1 Logistic Regression

We use Bayesian logistic regression to model landslide susceptibility. Logistic regression estimates the log-odds, or logit function (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0001), of a binary outcome (i.e., landslide occurrence or no landslide occurrence) given some predictor input data, where P is the probability of there being a mapped landslide. The logistic regression model for observation i and M input predictors is expressed as follows:
urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0002(1)
where urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0003 is an N by M predictor data matrix, N is the number of observations, urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0004 is the coefficient vector of length urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0005, urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0006 is the intercept, and urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0007 is the logit (log-odds) function. The logistic coefficient of a given predictor variable provides the change in log-odds with unit change in the predictor variable. We use logistic regression because it is one of the most commonly used models for predicting landslide susceptibility (Reichenbach et al., 2018). Implementing a statistical analysis of the different logistic regression coefficients between the training groups will allow us to statistically evaluate changes in predictor effects on landslide susceptibility.

Bayesian logistic regression incorporates uncertainty into the model by using probability distributions of the model parameters. While the frequentist methodology estimates the probability of the true value of a parameter being within a given range using confidence intervals established by the data, the Bayesian framework assumes the data to be fixed and knowledge about the parameter to be a distribution (van de Schoot et al., 2021). Bayesian methods require the incorporation of prior knowledge about the parameters of interest, uncertainty estimates of the probability outputs, and greater flexibility in post-processing, which facilitates more transparent and interpretable results (Das et al., 2012; Korup, 2021; Loche et al., 2022). These benefits help to prevent model users from making unjustified conclusions about the data compared to traditional frequentist models (Wasserstein & Lazar, 2016).

The basis of Bayesian analysis is that the probability (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0008) of observing the unobserved parameter(s) of interest (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0009) provided some data (x) is given by the posterior probability (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0010):
urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0011(2)
where urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0012 is the prior probability and urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0013 is the likelihood function. The prior probability is the estimated probability of urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0014 before x is observed. The likelihood function is the probability of observing x for a given urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0015 and is given by Equation 1. The posterior probability distributions using a logistic regression model have no analytical solution; thus, a Markov Chain Monte Carlo approach is used to numerically estimate urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0016 (Kruschke, 2015). We assume independent modeling regions and provide Gaussian priors with a mean of zero and standard deviation of 2.5 and 10 for the parameter coefficients (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0017) and intercepts (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0018), respectively. As the input data are large and standardized, these priors are considered weakly informative (Gelman et al., 2013). That is, with the amount of data we use in these models (Table 2), the likelihood function will dominate and the priors will not heavily affect the posterior probabilities. Posterior probabilities are estimated using the statistical software Stan (Stan Development Team, 2017) through the computational environment R (R Core Team, 2016). Stan is run with four chains at 4,000 iterations and a 1,000 iteration warmup (burn-in) to omit the unrepresentative initial values of the model before the model converges on the representative parameter distributions. Diagnostics run on the Markov chains indicate that they were well mixed (i.e., provide a representative sampling of the posterior distribution).

2.5 Model Evaluation

In addition to the widely used ROC metrics, we use an analysis of regression coefficients to understand the logistic regression model performance. Each method is explained in detail in the following sections. Although ROC metrics can provide meaningful insights into the overall model performance, they do not elucidate the controls of a given model's performance. A better understanding of the model by analyzing the predictor effects provides valuable information about why certain LSSMs perform better in different scenarios. This will inform modelers on the effects of the input data and the limits of a model's utility when applied to different settings.

2.5.1 Receiver Operator Characteristics

The AUC metric of ROC is used to evaluate the performance of each model applied to different training data. The ROC curves compare the true positive rate against the false-positive rate (see Oommen et al., 2011 for an overview). AUC values near one indicate perfect model accuracy (i.e., every landslide and non-landslide from the data is modeled correctly), whereas AUC values near 0.5 indicate the model classification is equivalent to random guessing. Generally, values from 0.5 to 0.6, 0.6 to 0.7, 0.7 to 0.8, 0.8 to 0.9, and 0.9 to 1.0 are classified as poor, average, good, very good, and excellent performance, respectively (Yesilnacar, 2005). Importantly, before applying the trained model on the test data sets, predictors are standardized using the mean and standard deviations of the model's training group. We measure the ROC-AUC for each LSSM both on its training data set and for every permutation of the other training group pairs. These comparisons will allow us to first evaluate how accurate the model is at recreating its training data and then determine how accurate the model is when applied to other landslide data sets. We use the posterior distributions of urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0019 and urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0020 to obtain distributions of ROC-AUC for each LSSM comparison.

2.5.2 Logistic Coefficient Comparison

Raw coefficients of uncorrelated predictors cannot be directly compared between LSSMs (Mood, 2010) without first converting the logistic coefficients to a measure of probability changes (e.g., average marginal effects, AME). In brief, the fixed variance of the logit function (Equation 1) requires that any variance in the unobserved response must be accommodated by a change in the logistic coefficients (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0021). A detailed explanation and proof of this concept is found in Mood (2010). The AME measures the average change in landslide occurrence probabilities attributed to a given predictor (m) and is given by
urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0022(3)
urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0023(4)
where urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0024 is the linear combination of the predictors (xi) with their coefficients (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0025) for the ith observation, N is the number of observations, and urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0026 is the logistic probability density function given by the derivative of the logistic cumulative distribution function. Large magnitudes of AME indicate that the given predictor has a major influence on landslide susceptibility, whereas small AME magnitudes indicate that the predictor has only minor influence. The AME sign shows if the probability decreases (negative sign) or increases (positive sign) with an increase in the predictor value. In summary, AME distributions can be used to directly compare logistic LSSMs trained on different data.

2.6 Limited Data Experiment

We simulate the effects of using a uniformly distributed in space but limited landslide inventory to model landslide susceptibility over large areas by subsampling all the landslide data previously described (Sections 2.2 and 2.3). We randomly sample 5% of the compiled landslide and non-landslide inventory data using a 1:1 sampling ratio from all the study sites to optimize and train a Bayesian logistic regression model (Figure 3e). We then evaluate the models' ROC-AUC metric on all the landslide data not used in the model training phase (i.e., 95% of all the landslide data) using the mean of the posterior distributions of urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0027. We also analyze the models' AME distributions to compare them with the ROC-AUC metrics. We iterate this procedure 100 times to estimate the variation in model performance from different training data.

3 Results

3.1 Predictor Effects

Eleven predictors were included in the LSSMs after the correlation and matching phases of workflow (Figure 1). These predictors were soil thickness (Thick), available water holding capacity (AWC), slope, aspect (urn:x-wiley:21699003:media:jgrf21714:jgrf21714-math-0028) (McCune & Keon, 2002), topographic roughness with 30- and 100-m windows (Rough30 and Rough100), flow length (FlowLength), flow accumulation (FlowAcc), topographic wetness index (TWI), proximity to rivers (ProxRivers), and proximity to roads (ProxRoads). Figure S5 in Supporting Information S1 shows that non-standardized distributions of the slope, Rough30m, and Rough100m predictors for landslides generally have elevated values compared to non-landslide locations. Most of the other predictors show little variation between landslide and non-landslide locations. Many of the predictors show different distributions between training groups.

Posterior distributions of the AMEs show large variation between locations for the logistic regression LSSMs (Figure 4). There is no consistent sign (i.e., direction) within many of the predictors, indicating that an average unit increase in predictor values may increase or decrease the probability of detecting a mapped landslide, depending on the location. For instance, Macon County shows a negative AME value for the slope, indicating that, on average, steeper slopes decrease the chances of capturing mapped landslides in that area, whereas in all the other study areas, increased slope values increase the chances of capturing a mapped landslide. In addition to the inconsistent AME signs, the AME magnitudes are highly variable. This indicates highly variable predictor importance between locations. The most consistent AME distributions are between the split Magoffin data distributions (Magoffin1 and Magoffin2) and the combined Magoffin data (Magoffin1+Magoffin2). The extreme AME values for FlowAcc are due to the highly skewed distribution of that predictor (Figure S5 in Supporting Information S1). However, the influence of the extreme values is minimized in the models due to their paucity. A more in-depth statistical analysis and interpretation of the AME posterior distributions is presented in Supporting Information S1 (Text S1, Figures S6 and S7). This analysis confirms that most of the predictors' AME distributions are credibly (95% credibility interval) different between locations. Finally, Figure 4 shows that many predictor AME distributions overlap with zero (e.g., Aspect), indicating that these predictors consistently have minimal influence on model performance. However, these predictors are influential in a minority of the models (e.g., Aspect for the Doddridge model). Other predictors show distribution magnitudes that generally have little overlap with zero (e.g., Slope and Rough30) suggesting that these predictors are consistently highly influential on the model outputs.

Details are in the caption following the image

Posterior distributions of the parameter average marginal effects (AME) from the logistic regression model are plotted as boxplots. AME measures the average change in landslide occurrence probabilities attributed to a given predictor. The box hinges show the first and third quartiles, the whiskers extend to 1.5 the inter-quartile range, the vertical bars show the median values of the distributions.

3.2 Receiver Operator Characteristics

The ROC-AUC values show notable variations in the different model comparisons. All models perform satisfactorily (≥0.6) using resubstitution (i.e., the same data are used to train and validate the model) (Figure 5). The Magoffin County self-comparisons show higher AUC values (mean of 0.70) compared to the other models with independent training and test data. By independent we mean that there is no shared data between the training and test data sets (e.g., the trained All-Locationx model evaluations), as opposed to some ROC-AUC results showing results of non-independent training and test data sets (e.g., the resubstitution model evaluations). However, comparisons of models within the same ecological or physiographic regions show no increase in model performance compared with areas outside a shared region. Models that include all the data and are applied to any specific location also perform better (average of 0.61) than models with independent training and test data. However, when the test location data are omitted from the compiled training set, model performance decreases (average of 0.55). The lack of overlap between many of the AUC distributions indicates that many of the differences in model performance are statistically meaningful (i.e., exceed random noise).

Details are in the caption following the image

Area under the curve (AUC) distributions for receiver operator characteristics (ROC) calculated for various model experiments using logistic regression. Training data column indicates the data (non-limited) used to train the model and the test data column indicates where the model was applied for measuring the ROC-AUC. Results are organized according to ecoregion and physiographic comparisons. All-Locationx indicates that the model was trained on data from every location except Locationx. The box hinges show the first and third quartiles, the whiskers extend to 1.5 the inter-quartile range, the vertical bars show the median values of the distributions.

3.3 Limited Data Experiment

Landslide susceptibility models developed with a limited but uniformly distributed landslide data set perform relatively well compared to models trained on areas outside their test areas (Figure 5). Figure 6 shows the ROC-AUC scores (gray dots) of the 100 model iterations trained on 5% of the compiled landslide inventory from all the study sites and tested on the remaining 95%. The average ROC-AUC value of the model iterations is 0.63. This is lower than the Magoffin self-comparisons, which used 50% of the landslide data but is higher than most of the other comparisons with independent training and test data sets (Figure 5). Even the minimum AUC score (0.615) from the limited data runs provides higher AUC values than most of the All-Locationx and the ecoregion-segregated models. The LSSMs trained with limited data generally show consistent predictor effects (i.e., AME distributions) with the LSSMs trained using all the data (Figure 7). The magnitudes of the posterior AME values also reflect the influence of predictors in the combined domain. The high average magnitude of slope, Rough30, and AWC suggest that, on average, these predictors are influential in the combined domain. In contrast, the predictors FlowLength, Thick, Aspect, FlowAcc, ProxRoads, ProxRivers, and TWI all have AME values that are indistinguishable from zero at high probability. This suggests that, on average, these parameters have little influence on the model output over the combined domain.

Details are in the caption following the image

Kernel density plot of ROC-AUC scores from iterations of the logistic regression LSSMs trained on 5% of all the data and tested on the remaining 95%. Black line shows the kernel density, red bar shows the mean ROC-AUC, and the gray dots show the mean ROC-AUC values for each model iteration with vertical spacing set to facilitate visualization.

Details are in the caption following the image

Posterior distributions of the logistic regression coefficient average marginal effect (AME) values from the limited-data experiment iterations. Multi-colored lines show the AME distributions for each of the 100 iterations, bold black distribution shows the average distribution of all the iterations, and bold red line shows the posterior AME distribution when using all the landslide data to train the model. Note that the center (i.e., mean) of all iterations (bold black line) is near the center (i.e., mean) of the posterior distribution developed using all the landslide data (bold red line).

4 Discussion

Our results illustrate why certain sampling (or model training) strategies designed for creating LSSMs for regional or greater scales perform better than others by examining the resultant variation in predictor effects. Through our analysis, we demonstrate the following:
  1. LSSMs trained with extensive local landslide data may still perform poorly when applied to other regions with no landslide data;

  2. Using a model in areas with shared physiographic provinces and level II ecological divisions does not warrant improved performance;

  3. Data uniformly distributed (i.e., spread approximately equally) in extent but limited in quantity can produce relatively accurate LSSMs compared to the previous two approaches.

Thus, focusing efforts on obtaining uniformly distributed landslide inventories over the entire modeling domain will likely produce better results than using a few high-density but spatially isolated landslide inventories. Using spatially isolated landslide inventories to infer landslide susceptibility in data-poor regions has been common practice for regional and global applications of landslide susceptibility (Reichenbach et al., 2018; Stanley & Kirschbaum, 2017). Below we explore each of these points in detail.

4.1 Model Accuracy in Limited-Data Regions

We show that LSSMs are very sensitive to the local conditions of the different landslide inventories and this sensitivity manifests in the predictors' effects (Figure 4). Thus, applying LSSMs trained on a location with different predictor effects from the test location is likely to yield poor overall landslide susceptibility characterization, as indicated by the ROC-AUC scores (Figure 5). This variation in local conditions may indicate differences in the triggering events responsible for landsliding at our study locations. Although the landslide inventories used in this study do not include trigger mechanisms, the negative AME values estimated for the slope predictor at Macon County may indicate that landslides in that area are predominantly rainfall-triggered, whereas the other study areas may include more (or some) earthquake-triggered landslides (Marc et al., 2018; Meunier et al., 2008; Rault et al., 2019). Earthquake-triggered landslide generally cluster toward ridge crests where slopes are the steepest due to topographic amplification effects. Alternatively, this difference could reflect the influence of human activity on more accessible slopes within Macon County (Wooten et al., 2017). Finally, our framework explains why different methods used to determine landslide susceptibility over regions with limited data may not perform well.

The observed variable effects of the model predictors between locations indicate that applying LSSMs to limited data regions may lead to spurious local results as measured by the ROC-AUC and divergence in predictor effects. While poor model performance when models are applied outside of their training domain is a commonly reported finding (e.g., Tanyas et al., 2019), our analysis highlights why this sampling strategy is ineffective. Applying LSSMs to regions where the model was not trained is often done to evaluate the versatility of the model. In most studies, this is carried out by dividing the landslide data set by random sampling (∼60% of publications), temporal attributes (∼20% of publications), or location (∼15% of publications) (Reichenbach et al., 2018). Thus, in most cases, the test data set is spatially near the training data set. The comparison between the split Magoffin data sets simulates the variation in local effects expected when testing LSSMs on areas spatially near, or overlapping, the training data (Figure 4). Although the local effects are not identical in sign and magnitude, they are relatively consistent, resulting in higher ROC-AUC values compared to other model evaluations that don't include the training data (Figure 5). However, when studies develop susceptibility models for areas that include regions with little or no landslide data, there is often no way to effectively evaluate the model performance over these regions. The low ROC-AUC scores of models applied to areas omitted from the training data indicate the model accuracies that might be expected when applying LSSMs to limited data regions (Figure 5). A model applied to limited data regions will likely omit important variations in predictor effects needed to accurately model landslide susceptibility (Figure 4).

Our observed differences in predictor effects between locations may partially explain the common poor model results from heuristic and physically based susceptibility methods when applied to diverse terrains (e.g., Fusco et al., 2021). Like data-driven statistical methods, physically based methods are calibrated on available data from local observations that may not manifest the full range of attributes responsible for slope failure within the study site. Additionally, heuristic approaches (e.g., fuzzy logic and index-based methods) assumed fixed predictor effects across the entire modeling domain. Thus, the susceptibility models developed using these methods may perform poorly when applied to areas with different landslide attributes (i.e., predictor values).

4.2 Ecological and Physiographic Divisions

We show that using continental-scale physiographic and ecological divisions to restrict where models are trained and applied sometimes produce maps worse than random susceptibility assignments (i.e., ROC-AUC values less than 0.5; Figure 5). Previous studies that use this approach assume that restricting the region where a model is trained and applied will lead to more uniformity in the predictor effects across the restricted domain and more accurate model performance. The logic for such restrictions appears sound because areas with similar climate and terrain are expected to have similar triggering mechanisms and landslide attributes. However, predictor effects are too diverse within subregions of the level II Ecoregion and physiographic provinces for them to improve the models' performances at the 10 m pixel scale used herein. Applying models to data sets with the same physiographic and ecological attributes as the training data did not help constrain the predictor effects between the training groups (Figure 4). This prevented any improvements in LSSM performance (Figure 5). It is possible that the physiographic and ecological divisions used herein are too broad to segregate the landslide inventories into representative groups or that they do not properly capture the environmental attributes that control slope stability. Work by Tanyas et al. (2019) attempted to segregate the modeling domain by clustering locations with similar landslide predictor values and also found negligible improvement in model performance compared to an aggregated model. Effective means of segregating a modeling domain into representative subdomains remains an open research question (Kornejady et al., 2017; Loche et al., 2022; Petschko et al., 2013; Tanyas et al., 2019). Future work could use our proposed framework to evaluate whether using more localized subdivisions reduce the variation of predictor effects in LSSMs.

4.3 Uniformly Distributed Landslide Inventory

A limited but uniformly distributed landslide inventory over the modeling domain performs relatively well (Figure 6); however, omitting any area from the model training phase may lead to poor susceptibility characterization of the withheld area despite assumed environmental similarities (Figure 5). The ROC-AUC values for the logistic regression models in the limited data experiment indicate above-average performances compared to the other model experiments with independent training and test data (Figure 6). Figure 7 illustrates that the reason for this good model performance is the relatively consistent predictor effects between LSSMs trained on all the data (bold red lines) and LSSMs trained on only a small sampling (multi-colored lines). In contrast, Figure 4 shows differences in the magnitudes and scales of the AME distributions between models. By using uniformly distributed training data, the model converges on the most representative coefficients over the whole domain but at the expense of poor accuracies at some smaller spatial scales (see the Trained on All section of Figure 5) that have divergent predictor effects from the average of all the data sets (Figures 4 and 7). Other empirical observations in machine learning applications indicate that having at least 10 events (landslides) for every predictor is sufficient to estimate the predictors' coefficients within a prediction model (Moons et al., 2014; Pavlou et al., 2016; Peduzzi et al., 1996). As we sample 5% of the available data (373 landslides), we have ∼33 landslides for every predictor in our models, which may indicate why the limited data experiments performed well on the 95% of data left for cross-validation.

While restricting the training data to so few landslides greatly limits the representation of the range of possible environmental conditions and its representativeness of external domains, using uniformly distributed but limited data is better than using more data that do not cover the entire modeling domain. For example, the large drop in ROC-AUC scores in regions not included in the compiled data sets illustrates the consequences of not having a uniformly distributed landslide inventory when using statistical LSSMs (Figure 5). The omission of any given area with predictor effects different from the training data could result in very poor susceptibility characterization of that area. Regional-scale or greater landslide susceptibility models trained on spatially constrained inventories are unlikely to represent the full range of predictor effects necessary to accurately characterize landslide susceptibility over data-poor areas. Although some studies have found that applying LSSMs to areas outside the training domain can perform well (e.g., Von Ruette et al., 2011), the reason for the good model performance is often not explored in depth. Application of the framework used herein for analyzing predictor effects would allow better understanding of the controls of LSSM performance in other studies. In summary, our results indicate that when modeling susceptibility over broad extents, randomly sampling landslide and non-landslide locations over the entire modeling domain capture a greater range of predictor effects and will generally improve model performance over using dense but spatially separated training data or grouping data sets by assumed environment controls.

4.4 Improving Landslides Susceptibility Models at Regional Scales

We cannot exclude the possibility of differences in the landslide inventories or our sampling strategies leading to some of the observed model variations between locations. The landslide inventories used were collected by different teams, with unique objectives, at different times and using various methods (Table 2). This inconsistency is common for any inventory, regardless of scale or extent. However, restricting our study to landslide inventories that use well-defined and systematic approaches with high-resolution DEM or aerial imagery, coupled with the systematization procedures implemented herein (Figure 1), minimize the effects of inconsistencies in the landslide inventories (Sections 2.2 and 2.3). The lack of any specific location showing consistently divergent results supports the idea that we are accurately characterizing the relative controls as described by the measured predictors of landslide susceptibility, not merely artifacts caused by differences in the landslide mapping methodologies.

Although many challenges regarding the representativeness of data for susceptibility studies remain, our analysis offers several promising avenues for improvement. First, LSSMs applied to areas where there is no landslide inventory may have areas with poorly characterized landslide susceptibility due to differences in landslide characteristics not represented in the LSSM. The implications of the assumptions used to extrapolate susceptibility models across the entire modeling domain (including data-limited regions) need to be carefully explained to end-users. Second, extrapolating models developed on other regions based on the assumption that landslide triggering mechanisms and attributes are similar (e.g., due to shared ecoregions and physiographic provinces) may lead to spurious results. For instance, although Macon, Doddridge, and Magoffin share similar physiographic and/or ecological environments (Figure 2) that would indicate the predictor effects should be relatively consistent, our analysis indicates otherwise (Figure 4). Third, the use of heuristic methods in other studies explicitly assumes specific predictor effects that may not match local observations. This is consistent with observations that national-scale susceptibility maps tend to under-represent the hazard in moderately sloping terrain (Mirus et al., 2020). Finally, our results indicate that using a uniformly distributed landslide inventory with widespread representation across the modeling domain will likely produce the most accurate susceptibility maps over regional or greater scales. By giving the model as many variations in predictor values as possible, the model will be more robust and produce more accurate results on diverse terrains (Figure 5) (Halevy et al., 2009). Thus, when attempting to create susceptibility models where no current data exist, efforts focusing on gathering a uniformly distributed sample of landslide and non-landslide locations across the entire study site, even if the inventory is limited, would be more useful than trying to gather spatially limited but more complete inventories.

5 Conclusion

Accurate LSSMs for large and diverse terrains are needed worldwide. Here, we use a statistical framework to evaluate the influence of a few common sampling (or model training) strategies on model parameters and performance. This approach provides a more detailed understanding of the impacts of different input data on model behavior and performance over previous efforts. We emphasize that the choice of sampling strategy can have drastic impacts on the predictor effects within the model, which influence the representativeness of the model for new (i.e., unsampled) domains. For example, sampling a few spatially dense but isolated landslide inventories scattered throughout the modeling domain generally provides very poor representation of areas with limited landslide data. Additionally, limiting model development and application to regions within the same physiographic provinces and level II ecoregions does not help constrain the predictor effects or improve model performance. Finally, using a limited but uniformly distributed landslide inventory can create accurate landslide susceptibility maps over the same domain where the training data were gathered. In summary, our results illustrate the diverse conditions that can lead to landslide susceptible conditions across geologic settings and some of the challenges in creating representative susceptibility maps over these settings.

Acknowledgments

We appreciate the insights from three anonymous reviewers and the constructive suggestions from Oliver Korup and Eric Thompson, that helped us to improve the manuscript. This work was funded by the U.S. Geological Survey. Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

    Data Availability Statement

    The Doddridge County landslide inventory data are available at https://services1.arcgis.com/cTNi34MxOdcfum3A/arcgis/rest/services/LandslideIncidenceUser/FeatureServer and the Magoffin County inventory is available at https://doi.org/10.13023/kgs.data.2022.01 (Crawford, 2023). The other landslide inventories are compiled within the USGS Landslide inventories across the United States Version 2 via https://doi.org/10.5066/P9FZUX6N (Belair et al., 2022). Data sources for the predictor data are shown in Supporting Information S1 (Table S1). The Stan model and data required to run the model are deposited at https://doi.org/10.5066/P959G9JN.