Crowdsourcing Methods for Data Collection in Geophysics: State of the Art, Issues, and Future Directions
Abstract
Data are essential in all areas of geophysics. They are used to better understand and manage systems, either directly or via models. Given the complexity and spatiotemporal variability of geophysical systems (e.g., precipitation), a lack of sufficient data is a perennial problem, which is exacerbated by various drivers, such as climate change and urbanization. In recent years, crowdsourcing has become increasingly prominent as a means of supplementing data obtained from more traditional sources, particularly due to its relatively low implementation cost and ability to increase the spatial and/or temporal resolution of data significantly. Given the proliferation of different crowdsourcing methods in geophysics and the promise they have shown, it is timely to assess the state of the art in this field, to identify potential issues and map out a way forward. In this paper, crowdsourcing-based data acquisition methods that have been used in seven domains of geophysics, including weather, precipitation, air pollution, geography, ecology, surface water, and natural hazard management, are discussed based on a review of 162 papers. In addition, a novel framework for categorizing these methods is introduced and applied to the methods used in the seven domains of geophysics considered in this review. This paper also features a review of 93 papers dealing with issues that are common to data acquisition methods in different domains of geophysics, including the management of crowdsourcing projects, data quality, data processing, and data privacy. In each of these areas, the current status is discussed and challenges and future directions are outlined.
Key Points
- Different crowdsourcing-based methods for acquiring geophysical data are reviewed and categorized across seven domains of geophysics
- Project management, data quality, data processing, and privacy issues have hampered wider uptake of crowdsourcing methods for practical applications
- Future applications of crowdsourcing methods require public education, engagement strategies and incentives, technology developments, and government support
1 Introduction
1.1 Importance of Data
The availability of sufficient and high quality data is vitally important for activities in a broad range of areas within geophysics (Assumpção et al., 2018). As shown in Figure 1, data are used, either directly or via models, for a variety of purposes (Eggimann et al., 2017; Montanari et al., 2013; See et al., 2016), such as developing increased understanding of physical systems or processes (e.g., the weather); geophysical event prediction (e.g., rainfall, earthquakes); natural resources management (e.g., river systems); impact assessment (e.g., air pollution); infrastructure system planning, design, and operation (e.g., water supply systems); and the management of natural hazards (e.g., floods). In addition, they are also used in the model development process itself (See, Schepaschenko, et al., 2015), as well as to inform us about deficits in our models and thus foster an improved understanding/form the basis of scientific discovery (Del Giudice et al., 2016). It should be noted that the examples in Figure 1 are not meant to be exhaustive, but to demonstrate the wide range of purposes for which geophysical data can be used.
In relation to models (Figure 1), data are used for both model building (model setup, calibration, and validation) and executing models, as illustrated in Figure 2. For example, in the case of flood models, different types of data are required, including topography and land cover during model setup; high water marks for calibration and validation; and water levels/discharges, provided by gauging at the flooding area boundary, during the use of models (Assumpção et al., 2018).
1.2 Challenges
As mentioned in section 1.1, the availability of adequate geophysical data is vital in a range of applications in geophysics. However, a lack of availability of such data has restricted many research and application activities, as mentioned above. For example, models have often been developed with limited data (Reis et al., 2015) and consequently these models are not used in practical applications due to a lack of confidence in their performance (Assumpção et al., 2018). This is particularly true in relation to extreme events, such as floods and earthquakes, as the available data for simulating/predicting such events are significantly rarer than those available for more frequent events (Panteras & Cervone, 2018). The issue of data deficiency has taken on even greater importance in recent years, as real-time system operations and integrated management are becoming increasingly important in many domains within geophysics, which requires an increased amount of data with high spatiotemporal resolution (Muller et al., 2015). Consequently, how to efficiently and effectively collect sufficient amounts of data has been one of the key questions that needs to be addressed urgently in the area of geophysics (See, Perger et al., 2015).
- Spatial and temporal resolution: Many geophysical processes are highly spatially and temporally variable (e.g., recent research has found that precipitation intensity within an identical storm event can vary by up to 30% across a spatial region with an extent of 3–5 km; Muller et al., 2015), but most existing data collection methods are not able to capture this variation adequately.
- Cost: Traditional means of collecting data (e.g., fixed monitoring stations, paying people for data collection) are expensive, limiting the amount of data that can be collected within the constraints of available resources.
- Accessibility: Many locations where data are needed are difficult to access from a physical perspective, or the services needed for data collection (e.g., electricity) are not available.
- Availability: In many instances, data are needed in real time (e.g., infrastructure management, natural hazard management), but traditional means of data collection and transmission are unable to make the data available when needed.
- Uncertainty: There can be large uncertainty surrounding the quality of the data provided by traditional means.
- Dimensionality: As mentioned in section 1.1, collecting the different types of data needed for application areas that require a higher degree of social interaction can be a challenge.
For example, some of the challenges associated with weather data are due to the fact that they are traditionally obtained through ground gauges and stations, which are usually sparsely distributed with low density (Kidd et al., 2014; Lorenz & Kunstmann, 2012). This low density has long been an impediment to more accurate real-time weather prediction and management (Bauer et al., 2015), but further increases in their density would be difficult to achieve because of a lack of availability of candidate locations and high maintenance costs (B. Mahoney et al., 2010; Muller et al., 2013). Radar and satellites have also been used to monitor weather data, but the spatial and/or temporal resolution of the data obtained is often insufficient for many applications (e.g., real-time management and operation) and characterized by high levels of uncertainty (Thorndahl et al., 2017).
Another example of some of the challenges associated with traditional data collection methods relates to the mapping of geographical features such as buildings, road networks, and land cover, which has traditionally been undertaken by national mapping agencies. In many cases the data have not been made openly available or are only available at a cost. There is also a need to increase the amount of in situ or reference data needed for different applications, for example, observations of land cover for training classification algorithms or collection of ground data to validate maps or model outputs (See et al., 2016).
Finally, challenges arise from the lack of data availability caused by the failure or loss of equipment, for example, during natural disasters. To overcome this limitation in the field of flood management, remote sensing and social media are being used increasingly for obtaining topographic information and flood extent. However, to enable effective applications, the data must be obtained in a timely fashion (Cervone et al., 2016; Gobeyn et al., 2015), or they may need to be obtained at a high spatial resolution, for example, to capture cross sections. In both cases, there may be too much uncertainty in the data (Grimaldi et al., 2016).
- Climate Change: This increases the spatial and temporal variability, as well as the uncertainty, of many geophysical processes (e.g., precipitation; Zheng, Westra, et al., 2015), therefore requiring data collection at a greater spatiotemporal resolution. This increases cost and can present challenges related to accessibility.
- Urbanization: This can increase the spatial variability of a number of geophysical variables (e.g., due to the urban heat island effect; Arnfield, 2003; Burrows & Richardson, 2011), as well as increasing system complexity. This is likely to increase the cost, uncertainty, and the dimensionality associated with data collection.
- Community Expectation: Increased community expectations around levels of service provided by infrastructure systems (e.g., water supply) and levels of protection from natural hazards can increase the spatial and temporal resolution of the data required, as well as the speed with which they need to be made available (e.g., as a result of real-time operations; Muller et al., 2015). This is also likely to increase the cost and dimensionality of data collection efforts.
For example, the above drivers can have a significant impact on the acquisition of in situ precipitation data, the majority of which are currently collected through ground gauges and stations that are sparsely distributed around the world (Westra et al., 2014). However, these are unlikely to meet the growing data demands associated with the management of water systems, which is becoming increasingly complex due to climate change and rapid urbanization (Montanari et al., 2013). This problem has been exacerbated in recent years as real-time water system operations and management are being adopted increasingly in many cities around the world. These real-time systems require substantially increased amounts of precipitation data with high spatiotemporal resolution (Eggimann et al., 2017), which themselves are becoming more variable as a result of climate change (e.g., Berg et al., 2013; Wasko & Sharma, 2015; Zheng, Westra, et al., 2015).
1.3 Crowdsourcing
Over the past decade, crowdsourcing has emerged as a promising approach to addressing some of the growing challenges associated with data collection. Crowdsourcing was traditionally used as a problem solving model (Brabham, 2008), or as a task distribution or particular outsourcing method (Howe, 2006), but it can now be considered as one type of citizen science, which is regarded as the involvement of citizens in science, ranging from data collection to hypothesis generation (Bonney, 2009). Although the terms crowdsourcing and citizen science have appeared in the literature much more recently, citizens have been involved in data collection and science for more than a century, for example, through manual reporting of rainfall to weather services and participation in the National Audubon Society's Christmas Bird Count.
Citizen science can be categorized into four levels according to the extent of public involvement in scientific activities, as illustrated in Figure 4 (Estellés-Arolas & González-Ladrón-De-Guevara, 2012; Haklay, 2013). In essence, these four levels can be thought of as representing a trajectory of shift in perspectives on data. As part of this trajectory, crowdsourcing is referred to as Level 1, as it provides the foundations for the three more advanced forms of citizen science, where its implementation is underpinned by a network of citizen volunteers (Haklay, 2013). The second level is distributed intelligence, which relies on the cognitive ability of the participants for data analysis, for example, in projects such as Galaxy Zoo (Lintott et al., 2008) or MPing (Elmore et al., 2014). In the third level (participatory science), citizen input is used to determine what data need to be collected, requiring citizens to assist in research problem definition (Haklay, 2013). The last level (Level 4) is extreme citizen science, which engages citizens as scientists to participate heavily in research design, data collection, and result interpretation. As a consequence, participants not only offer data, but also provide collaborative intelligence (Haklay, 2013).
In practice, a limited number of participants have the ability to provide integrated designs for research projects due to their lack of knowledge of the research gaps to be addressed (Buytaert et al., 2014). This is especially the case in the domain of geoscience, as significant professional knowledge is required to enable research design in this area (Haklay, 2013). Therefore, it has been difficult to develop the levels of trust required to enable common citizens to participate in all aspects of the research process within geoscience. This substantially limits the practical utilization of citizen science (especially Levels 3–4) in many professional domains, such as floods, earthquakes, and precipitation within the geophysical domain, hampering its wider promotion (Buytaert et al., 2014). Consequently, this review is restricted to crowdsourcing (i.e., Level 1 citizen science).
Crowdsourcing was originally defined by Howe (2006) as the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call. More specifically, crowdsourcing has traditionally been used as an outsourcing method, but it can now be considered as an approach to collecting data through the participation of the general public, therefore requiring the active involvement of citizens (Bonney, 2009). However, more recently, this definition has been relaxed somewhat to also include data collected from public sensor networks, that is, opportunistic sensing (Mccabe et al., 2017) and the Internet of Things (IoT; Sethi & Sarangi, 2017), as well as from sensors installed and maintained by private citizens (Muller et al., 2015). In addition, with the onset of data-mining, the data do not necessarily have to be collected for the purpose for which they are ultimately used. For example, precipitation data can be extracted from commercial microwave links (CMLs) with the aid of data mining techniques (Doumounia et al., 2014). Hence, for the purpose of this paper, we include opportunistic sensing (Krishnamurthy & Poor, 2014; Messer, 2018; Uijlenhoet et al., 2017) within the broader term crowdsourcing to recognize the fact that there is a spectrum to the data collection process; this spectrum reflects the degree of citizen or crowd participation from 100% to 0%.
In recent years, crowdsourcing has been made possible by rapid developments in information technology (Buytaert et al., 2014), which has assisted with data acquisition, data transmission, and data storage, all of which are required to enable the data to be used in an efficient manner, as illustrated in the crowdsourcing data chain shown in Figure 5. For example, in the instance where citizens count the number of birds as part of ecological studies, technology is not needed for data collection. However, the collected data only become useful if they can be transmitted cheaply and easily via the internet or mobile phone networks and are made accessible via dedicated online repositories or social media platforms. In other instances, technology might also be used to acquire data via smart phones in addition to enabling data transmission, or dedicated sensor networks may be used, for example, through IoT. In fact, the crowdsourcing data chain has clear parallels with a three-layer IoT architecture (Sethi & Sarangi, 2017). The data acquisition layer in Figure 5 is similar to the perception layer in IoT, which collects information through the sensors, the data transmission and storage layers in Figure 5 have similar functions to the IoT network layer data for transmission and processing; while the IoT application layer corresponds to the data usage layer in Figure 5.
Crowdsourcing methods enable a number of the challenges outlined in section 1.2 (see Figure 3) to be addressed. For example, due to the wide availability of low-cost and ubiquitous sensors (either dedicated or as part of smart phones or other personal devices) used by a large number of citizens, as well as the sensors' ability to almost instantaneously transmit and store/share the acquired data, data can be collected at a greater spatial and temporal resolution and at a lower cost than with the aid of a professional monitoring network. It is noted that data obtained using crowdsourcing methods are often not as accurate as those obtained from official measurement stations, but it possesses much higher spatiotemporal resolution compared with traditional ground-based observations (Buytaert et al., 2014). This makes crowdsourcing a potentially important complementary source of information, or, in some situations, the only available source of information that can provide valuable observations.
In many instances, this wide availability also increases data accessibility, as dedicated data collection stations do not have to be established at particular sites. Data availability is generally also increased, as data can be transmitted and shared in real time, often through distributed networks that also increase reliability, especially in disaster situations (McSeveny & Waddington, 2017). Finally, given the greater ease and lower cost with which different types of data can be collected, crowdsourcing techniques also increase the dimensionality of the data that can be collected, which is especially important when dealing with application areas that require a higher degree of social interaction, such as the management of infrastructure systems or natural hazards (Figure 1).
In relation to the use of crowdsourcing methods for the collection of weather data, measurements from amateur gauges and weather stations can now be assimilated in real time (Agüera-Pérez et al., 2014; Bell et al., 2013), and new, low-cost sensors have been developed and integrated to allow a larger number of citizens to be involved in the monitoring of weather (Muller et al., 2013). Similarly, other geophysical data can now be collected more cheaply and with a greater spatial and temporal resolution with the assistance of citizens, including data on ecological variables (Chandler et al., 2016; Donnelly et al., 2014), temperature (Meier et al., 2017), and other atmospheric observations (McKercher et al., 2017). These crowdsourced data are often used as an important supplement to official data sources for system management.
In the field of geography, the mapping of features such as buildings, road networks, and land cover can now be undertaken by citizens as a result of advances in Web 2.0 and global positioning system (GPS)-enabled mobile technology, which has blurred the once clear-cut distinction between map producer and consumer (Coleman et al., 2009). In a seminal paper published in 2007, Goodchild (2007) coined the phrase Volunteered Geographic Information (VGI). Similar to the idea of crowdsourcing, VGI refers to the idea of citizens as sensors, collecting vast amounts of georeferenced data. These data can complement existing authoritative databases from national mapping agencies, provide a valuable source of research data, and even have considerable commercial value. OpenStreetMap (OSM) is an example of a highly successful VGI application (Neis & Zielstra, 2014), which was originally driven by users in the United Kingdom wanting access to free topographic information, for example, buildings, roads, and physical features; at the time, these data were only available from the U.K. Ordnance Survey at a considerable cost. Since then, OSM has expanded globally and works strongly within the humanitarian field, mobilizing citizen mappers during disaster events to provide rapid information to first responders and nongovernmental organizations working on the ground (Soden & Palen, 2014). Another strong motivator behind crowdsourcing in geography has been the need to increase the amount of in situ or reference data needed for different applications, for example, observations of land cover for training classification algorithms or collection of ground data to validate maps or model outputs (See et al., 2016). The development of new resources such as Google Earth and Bing Maps has also made many of these crowdsourcing applications possible, for example, visual interpretations of very high resolution satellite imagery (Fritz et al., 2012).
1.4 Contribution of This Paper
This paper reviews recent progress in the approaches used within the data acquisition step of the crowdsourcing data chain (Figure 5) in the geophysical sciences and engineering. The main contributions include (i) a categorization of different crowdsourcing data acquisition methods and a comprehensive summary of how these have been applied in a number of domains in the geosciences over the past two decades; (ii) a detailed discussion on potential issues associated with the application of crowdsourcing data acquisition methods in the selected areas of the geosciences, as well as a categorization of approaches for dealing with these; and (iii) identification of future research needs and directions in relation to crowdsourcing methods used for data acquisition in the geosciences. The review will cover a broad range of application areas (e.g., see Figure 1) within the domain of geophysics (see section 2.1) and should therefore be of significant interest to a broad audience, such as academics and engineers in the area of geophysics, government departments, decision-makers, and even sensor manufacturers. In addition to its potentially significant contributions to the literature, this review is also timely because crowdsourcing in the geophysical sciences is nearly ready for practical implementation, primarily due to rapid developments in information technologies over the past few years (Muller et al., 2015). This is supported by the fact that a large number of crowdsourcing techniques have been reported in the literature in this area (see section 3).
While there have been previous reviews of crowdsourcing approaches, this paper goes significantly beyond the scope and depth of those attempts. Buytaert et al. (2014) summarized previous work on citizen science in hydrology and water resources, Muller et al. (2015) performed a review of crowdsourcing methods applied to climate and atmospheric science, and Assumpção et al. (2018) focused on the crowdsourcing techniques used for flood modeling and management. Our review provides significantly more updated developments of crowdsourcing methods across a broader range of application areas in geosciences, including weather, precipitation, air pollution, geography, ecology, surface water, and natural hazard management. In addition, this review also provides a categorization of data acquisition methods and systematically elaborates on the potential issues associated with the implementation of crowdsourcing techniques across different problem domains, which has not been explored in previous reviews.
The remainder of this paper is structured as follows. First, an overview of the proposed methodology is provided, including details of which domains of geophysics are covered, how the reviewed papers were selected, and how the different crowdsourcing data acquisition methods were categorized. Next, an overview of the reviewed publications is provided, which is followed by detailed reviews of the applications of different crowdsourcing data acquisition methods in the different domains of geophysics. Subsequently, a discussion is presented regarding some of the issues that have to be overcome when applying these methods, as well as state-of-the-art methods to address them. Finally, the implications arising from this review are provided in terms of research needs and future directions.
2 Review Methodology
2.1 Geophysical Domains Reviewed
In order to cover a broad spectrum of geophysical domains, a number of atmospheric (weather, precipitation, air quality) and terrestrial variables (geographic, ecological, surface water) are included in this review. This is because crowdsourcing has been often implemented in these geophysical domains, which is demonstrated by the result of a preliminary search of the relevant literature through the Web of Science database using the keyword crowdsourcing (Thomson Reuters, 2016). This also shows that these domains are of great importance within geophysics. In addition, data acquisition in relation to natural hazard management (e.g., floods, fires, earthquakes, hurricanes) is also included, as the impact of extreme events is becoming increasingly important and because it requires a high degree of social interaction (Figure 1). A more detailed rationale for the inclusion of the above domains is provided below. While these domains were selected to cover a broad range of domains in geophysics, by necessity, they do not cover the full spectrum. However, given the diversity of the domains included in the review, the outcomes are likely to be more broadly applicable.
Weather is included as detailed monitoring of weather-related data at a high spatiotemporal resolution is crucial for a series of research and practical problems (Niforatos et al., 2016). Solar radiation, cloud cover, and wind data are direct inputs to weather models (Chelton & Freilich, 2005). Snow cover and depth data can be used as input for hydrological modeling of snow-fed rivers (Parajka & Blöschl, 2008), and they can also be used to estimate snow erosion on mountain ridges (Parajka et al., 2012). Moreover, wind data are used extensively in the efficient management and prediction of wind power production (Agüera-Pérez et al., 2014).
Precipitation is covered here as it is a research domain that has been studied extensively for a long period of time. This is because precipitation is a critical factor in floods and droughts, which have had devastating impacts worldwide (Westra et al., 2014). In addition, precipitation is an important parameter required for the development, calibration, validation, and use of many hydrological models. Therefore, precipitation data are essential for many models related to floods, droughts, as well as water resource management, planning, and operation (Hallegatte et al., 2013).
Air quality is included due to pressing air pollution issues around the world (Y. N. Zhang et al., 2011), especially in developing countries (Erickson, 2017; Jiang et al., 2015). The availability of detailed atmospheric data at a high spatiotemporal resolution is critical for the analysis of air quality, which can result in negative impacts on health (Snik et al., 2015). A good spatial coverage of air quality data can significantly improve the awareness and preparedness of citizens in mitigating their personal exposure to air pollution, and hence the availability of air quality data is an important contributor to enabling the protection of public health (Castell et al., 2015).
The subset of geography considered in this review is focused on the mapping and collection of data about features on the Earth's surface, both natural and man-made, as well as georeferenced data more generally. This is because these data are vital for a range of other areas of geophysics, such as impact assessment (e.g., location of vulnerable populations in the case of air pollution); infrastructure system planning, design, and operation (e.g., location and topography of households in the case of water supply); natural hazard management (e.g., topography of the landscape in terms of flood management); and ecological monitoring (e.g., deforestation).
Ecological data acquisition is included as it has been clearly acknowledged that ecosystems are being threatened around the world by climate change, as well as other factors, such as illegal wildlife trade, habitat loss, and human-wildlife conflicts (Can et al., 2017; Donnelly et al., 2014). Therefore, it is of great importance to have sufficient high quality data for a range of ecosystems, aimed at building solid and fundamental knowledge on their underlying processes, as well as enabling biodiversity observation, phenological monitoring, natural resource management, and environmental conservation (Groom et al., 2017; Mckinley et al., 2016; van Vliet et al., 2014).
Data on surface water systems, such as rivers and lakes, are vital for their management and protection, as well as usage for irrigation and water supply. For example, water quality data are needed to improve the management effectiveness (e.g., monitoring) of surface water systems (rivers and lakes), which is particularly the case for urban rivers, many of which have been polluted (T. Zhang et al., 2016). Water depth or velocity data in rivers or lakes are also important, as they can be used to derive flows, or indirectly to represent the water quality and ecology within these systems. Therefore, sourcing data for surface water with a good temporal and spatial resolution is necessary for enabling the protection of these aquatic environments (Tauro et al., 2018).
Natural hazards, such as floods, wildfires, earthquakes, tsunamis, and hurricanes, are causing significant losses worldwide, both in terms of lives lost and economic costs (McMullen & Lytle, 2012; Newman et al., 2017; Wen et al., 2013; Westra et al., 2014). Data are needed to support all stages of natural hazard management, including preparedness and response (Anson et al., 2017). Examples of such data include real-time information on the location, extent, and changes in hazards, as well as information on their impacts (e.g., losses, missing persons), to assist with the development of situational awareness (Akhgar et al., 2017; Stern, 2017), assess damage and suffering (Akhgar et al., 2017), and justify actions prior, during, and after disasters (Stern, 2017). In addition, data, and models developed with such data, are needed to identify risks and the impact of different risk reduction strategies (Anson et al., 2017; Newman et al., 2017).
2.2 Papers Selected for Review
The papers to be reviewed were selected using the following steps: (i) first, we identified crowdsourcing-related papers in influential geophysics-related journals, such as Nature, Bulletin of the American Meteorological Society, Water Resources Research, and Geophysical Research Letters, to ensure that high-quality papers are included in the review; (ii) we then checked the reference lists of these papers to identify additional crowdsourcing-related publications; and (iii) finally, crowdsourcing was used as the keyword to identify geophysics-related publications through the Web of Science database (Reuters, 2016). While it is unlikely that all crowdsourcing-related papers have been included in this review, we believe that the selected publications provide a good representation of progress in the use of crowdsourcing techniques in geophysics. An overview of the papers obtained using the above approach is given in section 3.
2.3 Categorization of Crowdsourcing Data Acquisition Methods
As mentioned in section 1.4, one of the primary objectives of this review is to ascertain which crowdsourcing data acquisition methods have been applied in different domains of geophysics. To this end, the categorization of different crowdsourcing methods shown in Figure 6 is proposed. As can be seen, it is suggested that all data acquisition methods have two attributes, including how the data were generated (i.e., data generation agent) and for what purpose the data were generated (i.e., data type).
Data generation agents can be divided into two categories (Figure 6), including citizens and instruments. In this categorization, if citizens are the data generating agents, no instruments are used for data collection, with only the human senses allowed as sensors. Examples of this would be counting the number of fish in a river or the mapping of buildings or the identification of objects/boundaries within satellite imagery. In contrast, the instruments category does not have any active human input during data collection, but these instruments are installed and maintained by citizens, as would be the case with collecting data from a network of automatic rain gauges operated by citizens, or sourcing data from distributed computing environments (e.g., Mechanical Turk; Buhrmester et al., 2011). As mentioned in section 1.3, while this category does not fit within the original definition of crowdsourcing (i.e., sourcing data from communities), such passive data collection methods have been considered under the umbrella of crowdsourcing methods more recently (Bigham et al., 2014; Muller et al., 2015), especially if data are transmitted via the internet or mobile phone networks and stored/shared in online repositories. As shown in Figure 6, some data acquisition methods require active input from both citizens and instruments. An example of this would include the measurement of air quality by citizens with the aid of their smart phones.
Data types can also be divided into two categories (Figure 6), including intentional and unintentional. If a data acquisition method belongs to the intentional category, the data were intentionally collected for the purpose they are ultimately used for. For example, if citizens collect air quality data using sensors on their smart device as part of a study on air pollution, then the data were acquired for that purpose they are ultimately used for. In contrast, for data acquisition methods belonging to the unintentional category, the data were not intentionally collected for the geophysical analysis purposes they are ultimately used for. An example of this includes the generation of data via social media platforms, such as Facebook, as part of which people might make a text-based post about the weather for the purposes of updating their personal status, but which might form part of a database of similar posts that can be mined for the purposes of gaining a better understanding of underlying weather patterns (Niforatos et al., 2014). Another example is the data on precipitation intensity collected by the windshields of cars (Nashashibi et al., 2011). While these data are collected to control the operation of windscreen wipers, a database of such information could be mined to support the development of precipitation models. Yet another example is the determination of the spatial distribution of precipitation data from microwave links that are primarily used for telecommunications purposes (Messer et al., 2006).
As shown in Figure 6, in some instances, intentional and unintentional data types can both be used as part of the same crowdsourcing approach. For example, river level data can be obtained by combining observations of river levels by citizens with information obtained by mining relevant social media posts. Alternatively, more accurate precipitation data could be obtained by combining data from citizen-owned gauges with those extracted from microwave networks or air quality data could be improved by combining data obtained from personal devices operated by citizens and mined from social media posts.
As data acquisition methods have two attributes (i.e., data generation agent and data type), each of which has two categories that can also be combined, there are nine possible categories of data acquisition methods, as shown in Table 1. Examples of each of these categories, based on the illustrations given above, are also shown.
Data Generation Agent | Data Type | Examples | ||
---|---|---|---|---|
Citizens | Instruments | Intentional | Unintentional | |
X | X | Counting the number of fish, mapping buildings | ||
X | X | Social media text data | ||
X | X | X | River level data from combining citizen reports and social media text data | |
X | X | Automatic rain gauges | ||
X | X | Microwave data | ||
X | X | X | Precipitation data from citizen-owned gauges and microwave data | |
X | X | X | Citizens measure air quality with sensors | |
X | X | X | People driving cars that collect rainfall data on windshields | |
X | X | X | X | Air quality data from citizens collected using sensors, gauges and social media |
3 Overview of Reviewed Publications
Based on the process outlined in section 2.2, 255 papers were selected for review, of which 162 are concerned with the applications of crowdsourcing methods, and 93 are primarily concerned with the issues related to their applications. Figure 7 presents an overview of these selected papers. As shown in this figure, very limited work was published in the selected journals before 2010, with a rapid increase in the number of papers from that year onward (2010–2017), to the point where about 34 papers on average were published per year from 2014 to 2017. This implies that crowdsourcing has become an increasingly important research topic in recent years. This can be attributed to the fact that information technology has developed in an unprecedented manner after 2010, and hence a broad range of inexpensive, yet robust, sensors (e.g., smart phones, social media, telecommunication microwave links) has been developed to collect geophysical data (Buytaert et al., 2014). These collected data have the potential to overcome the problems associated with limited data availability, as discussed previously, creating opportunities for research at incomparable scales (Dickinson et al., 2012) and leading to a surge in relevant studies.
Figure 8 presents the distribution of the affiliations of the coauthors of the 255 publications included in this review. As shown, universities and research institutions have clearly dominated the development of crowdsourcing technology reported in these papers. Interestingly, government departments have demonstrated significant interest in this area (Conrad & Hilchey, 2011), as indicated by the fact that they have been involved in a total of 38 publications (14.9%), of which 10 and 7 are in collaboration with universities and private or public research institutions, respectively. As shown in Figure 8, industry has closely collaborated with universities and research institutions on crowdsourcing, as all of their publications (22 in total, 8.6%) have been coauthored with researchers from these sectors. These results show that developments and applications of crowdsourcing techniques have been mainly reported by universities and research institutions thus far. However, it should be noted that not all progress made by crowdsourcing-related industry is reported in journal papers, as is the case for most research conducted by universities (Hut et al., 2014; Jongman et al., 2015; Kutija et al., 2014; Michelsen et al., 2016).
In addition to the distribution of affiliations, it is also meaningful to understand how active crowdsourcing-related research is in different countries, which is shown in Figure 9. It should be noted that only the country of the leading author is considered in this figure. As reflected by the 255 papers reviewed, the United States has performed the most extensive research in the crowdsourcing domain, followed by the United Kingdom, Canada, and some other European countries, particularly Germany and France. In contrast, China, Japan, Australia, and India have made limited attempts to develop or apply crowdsourcing methods in geophysics. In addition, many other countries have not published any crowdsourcing-related efforts so far. This may be partly attributed to the economic status of different countries, as a mature and efficient information network is a requisite condition for the development and application of crowdsourcing techniques (Buytaert et al., 2014).
As stated previously, one of the features of this review is that it assesses papers in terms of both application area and generic issues that cut across application areas. The split between these two categories for the 255 papers reviewed is shown in Figure 10. As can be seen from this figure, crowdsourcing techniques have been widely used to collect precipitation data (15% of the reviewed papers) and data for natural hazard management (17%). This is likely because precipitation data and data for natural hazard management are highly spatially distributed, and hence are more likely to benefit from crowdsourcing techniques for data collection (Eggimann et al., 2017). In terms of potential issues that exist within the applications of crowdsourcing approaches, project management, data quality, data processing, and privacy have been increasingly recognized as problems based on our review and hence they are considered (Figure 10). A review of these issues, as one of the important focuses of this paper, offers insight into potential problems and solutions that cut across different problem domains, but also provides guidance for the future development of crowdsourcing techniques.
4 Review of Crowdsourcing Data Acquisition Methods Used
4.1 Weather
Currently, crowdsourced weather data mainly come from four sources: (i) human estimation; (ii) automated amateur gauges and weather stations; (iii) CMLs; and (iv) sensors integrated with vehicles, portable devices, and existing infrastructure. For the first category of data source, citizens are heavily involved in providing qualitative or categorical descriptions of the weather conditions based on their observations. For instance, citizens are encouraged to classify their estimations of air temperature and wind speed into three classes (low, medium, and high) for their surrounding regions, as well as to predict short-term weather variables in the near future (Niforatos et al., 2014; Niforatos, Vourvopoulos, et al., 2015). The estimations have been compared against the records from authorized weather stations, and results showed that both data sources matched reasonably in terms of the levels of the variables (e.g., low or high temperature; Niforatos, Fouad, et al., 2015). These estimates are transmitted to their corresponding authorized databases with the aid of different types of apps, which have greatly facilitated the wider uptake of this type of crowdsourcing method. While this type of crowdsourcing project is simple to implement, the data collected are only subjective estimates.
To provide quantitative measurements of weather variables, low-cost amateur gauges and weather stations have been installed and managed by citizens to source relevant data. This type of crowdsourcing method has been made possible by the availability of affordable and user-friendly weather stations over the past decade (Muller et al., 2013). For example, in the United Kingdom and Ireland, the weather observation website and Weather Underground have been developed to accept weather reports from public amateurs, and in early spring 2012, over 400 and 1,350 amateurs have been regularly uploading their weather data (temperature, wind, pressure, and so on) to weather observation website and Weather Underground, respectively (Bell et al., 2013). Agüera-Pérez et al. (2014) compiled wind data from 198 citizen-owned weather stations and successfully estimated the regional wind field with high accuracy, while a high density of temperature data was collected through citizen-owned automatic weather stations (Chapman et al., 2016; Wolters & Brandsma, 2012; Young et al., 2014), which have been used in urban climate research in recent years (Meier et al., 2017).
Alternatively, weather data could also be quantitatively measured through analyzing the transmitted and received signal levels of commercial cellular communication networks, which have often been installed by telecommunication companies or other private entities, and whose electromagnetic waves are attenuated by atmospheric influences. For instance, during fog conditions, the attenuation of microwave links was found to be related to the fog liquid water content, which enabled the use of commercial cellular communication network attenuation data to monitor fog at a high spatiotemporal resolution (David et al., 2015), in addition to their wider applications in estimating rainfall intensity, as discussed in section 4.2.
In more recent years, a large amount of weather data has been obtained from sensors that are available in cars, mobile phones, and telecommunication infrastructure. For example, automobiles are equipped with a variety of sensors, including cameras, impact sensors, wiper sensors, and sun sensors, which could all be used to derive weather data such as humidity, sun radiation, and pavement temperature (B. Mahoney et al., 2010; W. P. Mahoney & O'Sullivan, 2013). Similarly, modern smartphones are also equipped with a number of sensors, which enables them to be used to measure air temperature, atmospheric pressure, and relative humidity (Anderson et al., 2012; Madaus & Mass, 2016; Mass & Madaus, 2014; Mcnicholas & Mass, 2018; Sosko & Dalyot, 2017). More specifically, smartphone batteries, as well as smartphone-interfaced wireless sensors, have been used to indicate air temperature in surrounding regions (B. Mahoney et al., 2010; Majethisa et al., 2015). In addition to automobiles and smartphones, some research has been carried out to investigate the potential of transforming vehicles to moving sensors for measuring air temperature and atmospheric pressure (Anderson et al., 2012; Overeem, Leijnse, et al., 2013). For instance, bicycles equipped with thermometers were employed to collect air temperature in remote regions (Cassano, 2014; Melhuish & Pedder, 2012).
Researchers have also discussed the possibility of integrating automatic weather sensors with microwave transmission towers, and transmitting the collected data through wireless communication networks (Vishwarupe et al., 2016). These sensors have the potential to form an extensive infrastructure system for monitoring weather, thereby enabling better management of weather related issues (e.g., heat waves).
4.2 Precipitation
A number of crowdsourcing methods have been developed to collect precipitation data over the past two decades. These methods can be divided into four categories based on the means by which precipitation data are collected, including (i) citizens, (ii) CMLs, (iii) moving cars, and (iv) low-cost sensors. In methods belonging to the first category, precipitation data are collected and reported by individual citizens. Based on the papers reviewed in this study, the first official report of this approach can be dated back to the year 2000 (Doesken & Weaver, 2000), where a volunteer network composed of local residents was established to provide records of rainfall for disaster assessment after a devastating flooding event in Colorado. These residents voluntarily reported the rainfall estimates that were collected using their own simple, home-made equipment (e.g., precipitation gauges). These data showed that rainfall intensity within this storm event was highly spatially varied, highlighting the importance of access to precipitation data with a high spatial resolution for flood management. In recognition of this, research communities have suggested the development of an official volunteer network with the aid of local residents, aimed at routinely collecting rainfall and other meteorological parameters, such as snow and hail (Cifelli et al., 2005; Elmore et al., 2014; Reges et al., 2016). More recent examples include citizen reporting of precipitation type based on their observations (e.g., hail, rain, drizzle, etc.) to calibrate radar precipitation estimation (Elmore et al., 2014), and the use of automatic personal weather stations, which measure and provide precipitation data with high accuracy (De Vos et al., 2017).
In addition to precipitation data collection by citizens, many studies have explored the potential of other ways of estimating precipitation, with a typical example being the use of CMLs, which are generally operated by telecommunication companies. This is mainly because CMLs are often spatially distributed within cities, and hence can potentially be used to collect precipitation data with good spatial coverage. More specifically, precipitation attenuates the electromagnetic signals transmitted between antennas within the CML network. This attenuation can be calculated from the difference between the received powers with and without precipitation and is a measure of the path-averaged precipitation intensity (Overeem et al., 2011). Based on our review, Upton et al. (2005) probably first suggested the use of CMLs for rainfall estimation, and Messer et al. (2006) were the first to actually use data from CMLs to estimate rainfall. This was followed by more detailed studies by Leijnse et al. (2007), Zinevich et al. (2009), and Overeem et al. (2011), where relationships between electromagnetic signals caused by precipitation and precipitation intensity were developed. The accuracy of such relationships has been subsequently investigated in many studies (Doumounia et al., 2014; Rayitsfeld et al., 2012). Results show that while quantitative precipitation estimates from CMLs might be regionally biased, possibly due to antenna wetting and systematic disturbances from the built environment, they could match reasonably well with precipitation observations overall (Chwala et al., 2016; Fencl, Rieckermann, Sykora, et al., 2015; Fencl, Rieckermann, Vojtěch, 2015; Mercier et al., 2015; Rios Gaona et al., 2015). This implies that the use of communication networks to estimate precipitation is promising, as it provides an important supplement to traditional measurements using ground gauges and radars (Fencl et al., 2017; Gosset et al., 2015). This is supported by the fact that the precipitation data estimated from microwave links have been widely used to enable flood forecasting and management (Overeem, Robinson, et al., 2013) and urban storm water runoff modeling (Pastorek et al., 2017).
In parallel with the development of microwave-link based methods, some studies have been undertaken to utilize moving cars for the collection of precipitation. This is theoretically possible with the aid of windshield sensors, wipers, and in-vehicle cameras (Gormer et al., 2009; Haberlandt & Sester, 2010; Nashashibi et al., 2011). For example, precipitation intensity can be estimated through its positive correlation with wiper speed. To demonstrate the feasibility of this approach for practical implementation, laboratory experiments and computer simulations have been performed, and the results showed that estimated data could generally represent the spatial properties of precipitation (Rabiei et al., 2012, 2013, 2016). In more recent years, an interesting and preliminary attempt has been made to identify rainy days and sunny days with the aid of in-vehicle audio clips from smartphones installed in cars (Guo et al., 2016). However, such a method is unable to estimate rainfall intensity and hence has not been used in practice thus far.
As alternatives to the crowdsourcing methods mentioned above, low-cost sensors are also able to provide precipitation data (Trono et al., 2012). Typical examples include (i) home-made acoustic disdrometers, which are generally installed in cities at a high spatial density, where precipitation intensity is identified by the acoustic strength of raindrops, with larger acoustic strength corresponding to stronger precipitation intensity (De Jong, 2010); (ii) acoustic sensors installed on umbrellas that can be used to measure precipitation intensity on rainy days (Hut et al., 2014); (iii) cameras and videos (e.g., surveillance cameras) that are employed to detect raindrops with the aid of some data processing methods (Allamano et al., 2015; Minda & Tsuda, 2012), and smartphones with built-in sensors to collect precipitation data (Alfonso et al., 2015).
4.3 Air Quality
Crowdsourcing methods for the acquisition of air quality data can be divided into three main categories, including (i) citizen-owned in situ sensors, (ii) mobile sensors, and (iii) information obtained from social media. An example of the application of the first approach is presented by Gao et al. (2015), who validated the performance of the use of seven Portable University of Washington Particle sensors in Xi'an, China, to detect fine particulate matter (PM2.5). Similarly, Jiao et al. (2015) integrated commercially available technologies to create the Village Green Project, a durable, solar-powered air monitoring park bench that measures real-time ozone and PM2.5. More recently, Miskell et al. (2017) demonstrated that crowdsourced approaches with the aid of low-cost and citizen-owned sensors can increase the temporal and spatial resolution of air quality networks. Furthermore, Schneider et al. (2017) mapped real-time urban air quality (NO2) by combining crowdsourced observations from low-cost air quality sensors with time-invariant data from a local-scale dispersion model in the city of Oslo, Norway.
Typical examples of the use of mobile sensors for the measurement of air quality over the past few years include the work of B. Yang et al. (2016), where a low-cost mobile platform was designed and implemented to measure air quality. Munasinghe et al. (2017) demonstrated how a miniature microcontroller-based handheld device was developed to collect hazardous gas levels (CO, SO2, NO2) using semiconductor sensors. In addition to moving platforms, sensors have also been integrated with smartphones and vehicles to measure air quality, with the aid of hardware and software support (Honicky et al., 2008). Application examples include smartphones with built-in sensors used to measure air quality (CO, O3, and NO2) in urban environments (Oletic & Bilas, 2013) and smartphones with a corresponding app in the Netherlands to measure aerosol properties (Snik et al., 2015). In relation to vehicles equipped with sensors for air quality measurement, examples include Elen et al. (2012), who used a bicycle for mobile air quality monitoring, and Bossche et al. (2015), who used a bicycle equipped with a portable black carbon sensor to collect black carbon measurements in Antwerp, Belgium. Within their applications, bicycles are equipped with compact air quality measurement devices to monitor ultrafine particle number counts, particulate mass, and black carbon concentrations at a high resolution (up to 1 s), with each measurement automatically linked to its geographical location and time of acquisition using GPS and Internet time (Elen et al., 2012). Subsequently, Castell et al. (2015) demonstrated that data gathered from sensors mounted on mobile modes of transportation could be used to mitigate citizen exposure to air pollution, while Apte et al. (2017) applied moving platforms with the aid of Google Street View cars to collect air pollution data (black carbon) with reasonably high resolution.
The potential of acquiring air quality data from social media has also been explored recently. For instance, Jiang et al. (2015) have successfully reproduced dynamic changes in air quality in Beijing by analyzing the spatiotemporal trends in geotagged social media messages. Following a similar approach, Sachdeva et al. (2017) assessed the air quality impacts caused by wildfire events with the aid of data sourced from social media, while Ford et al. (2017) have explored the use of daily social media posts from Facebook regarding smoke, haze, and air quality to assess population-level exposure in the western United States. Analysis of social media data has also been used to assess air pollution exposure. For example, Sun et al. (2017) estimated the inhaled dose of pollutant (PM2.5) during a single cycling or pedestrian trip using Strava Metro data and GIS technologies in Glasgow, United Kingdom, demonstrating the potential of using such data for the assessment of average air pollution exposure during active travel, and Sun and Mobasheri (2017) investigated associations between cycling purpose and air pollution exposure at a large scale.
4.4 Geography
Crowdsourcing methods in geography can be divided into three types: (i) those that involve intentional participation of citizens; (ii) those that harvest existing sources of information or which involve mobile sensors; and (iii) those that integrate crowdsourcing data with authoritative databases. Citizen-based crowdsourcing has been widely used for collaborative mapping, which is exemplified by the OSM application (Heipke, 2010; Neis et al., 2011; Neis & Zielstra, 2014). There are numerous papers on OSM in the geographical literature; see Mooney and Minghini (2017) for a good overview. The Collabmap platform is another example of a collaborative mapping application, which is focused on emergency planning; volunteers use satellite imagery from Google Maps and photographs from Google StreetView to digitize potential evacuation routes. Within geography, citizens are often trained to provide data through in situ collection. For example, volunteers were trained to map the spatial extent of the surface flow along the San Pedro River in Arizona using paper maps and GPS units (Turner & Richter, 2011). This low-cost solution has allowed for continuous monitoring of the river that would not have been possible without the volunteers, where the crowdsourced maps have been used for research and conservation purposes. In a similar way, volunteers were asked to go to specific locations and classify the land cover and land use, documenting each location with geotagged photographs with the aid of a mobile app called FotoQuest (Laso Bayas et al., 2016).
In addition to citizen-based approaches, crowdsourcing within geography can be conducted through various low-cost sensors, such as mobile phones and social media. For example, Heipke (2010) presented an example from TomTom, which uses data from mobile phones and locations of TomTom users to provide live traffic information and improved navigation. Subsequently, Fan et al. (2016) developed a system called CrowdNavi to ingest GPS traces for identifying local driving patterns. This local knowledge was then used to improve navigation in the final part of a journey, for example, within a campus, which has proven problematic for applications such as Google Maps and commercial satnavs. Social media has also been used as a form of crowdsourcing of geographical data over the past few years. Examples include the use of Twitter data from a specific event in 2012 to demonstrate how the data can be analyzed in space and time, as well as through social connections (Crampton et al., 2013), and the collection of Twitter data as part of the Global Twitter Heartbeat project (Leetaru et al., 2013). These collected Twitter data were used to demonstrate different spatial, temporal, and linguistic patterns using the subset of georeferenced tweets, among several other analyses.
In parallel with the development of citizen and low-cost sensor-based crowdsourcing methods, a number of approaches have also been developed to integrate crowdsourcing data with authoritative data sources. Craglia et al. (2012) showed an example of how data from social media (Twitter and Flickr) can be used to plot clusters of fire occurrence through their CONtextual Analysis of Volunteered Information system. Using data from France, they demonstrated that the majority of fires identified by the European Forest Fires Information System were also identified by processing social media data through CONtextual Analysis of Volunteered Information. Moreover, additional fires not picked up by European Forest Fires Information System were also identified through this approach. In the application by Rice et al. (2013), crowdsourced data from both citizen-based and low-cost sensor-based methods were combined with authoritative data to create an accessibility map for blind and partially sighted people. The authoritative database contained permanent obstacles (e.g., curbs, sloped walkways, etc.), while crowdsourced data were used to complement the authoritative map with information on transitory objects such as the erection of temporary barriers or the presence of large crowds. This application demonstrates how diverse sources of information can be used to produce a better final information product for users.
4.5 Ecology
Crowdsourcing approaches to obtaining ecological data can be broadly divided into three categories, including (i) ad hoc volunteer-based methods; (ii) structured volunteer-based methods; and (iii) methods using technological advances. Ad hoc volunteer-based methods have typically been used to observe a certain type or group of species (Donnelly et al., 2014). The first example of this can be dated back to 1966, where a Breeding Bird Survey project was conducted with the aid of a large number of volunteers (Sauer et al., 2009). The records from this project have become a primary source of avian study in North America, with which additional analysis and research have been carried out to estimate bird population counts and how they change over time (Geissler & Noon, 1981; Link & Sauer, 1998; Sauer et al., 2003). Similarly, a number of well-trained recreational divers have voluntarily examined fish populations in California between 1997 and 2011 (Wolfe & Pattengill-Semmens, 2013), and the project results have been used to develop a fish database where the density variations of 18 different fish species have been reported. In more recent years, local residents were encouraged to monitor surface algal blooms in a lake in Finland from 2011 to 2013, and results showed that such a crowdsourcing method can provide more reliable data with regard to bloom frequency and intensity relative to the traditional satellite remote sensing approach (Kotovirta et al., 2014). Subsequently, many citizens have voluntarily participated in a research project to assist in the identification of species richness in groundwater, and it was reported that citizen engagement was very beneficial in estimating the diversity of the amphipod in Switzerland (Fiser et al., 2017). In more recent years, a crowdsourcing approach assisted with identifying a 75% decline in flying insects in Germany over the last 27 years (Hallmann et al., 2017).
While being simple in implementation, the ad hoc volunteer-based crowdsourcing methods mentioned above are often not well designed in terms of their monitoring strategy, and hence the data collected may not be able to fully represent the underlying properties of the species being investigated. In recognizing this, a network named eBird has been developed to create and sustain a global avian biological network (Sullivan et al., 2009), where this network has been officially developed and optimized with regard to monitoring locations. As a result, the collected data can possess more integrity compared with data obtained using crowdsourcing methods where monitoring networks are developed on a more ad hoc basis. Based on the data obtained from the eBird network, many models have been developed to exploit variations in observation density (Fink et al., 2013) and show the distributions of hemisphere-wide species (Fink et al., 2014), thereby enabling better understanding of broad-scale spatiotemporal processes in conservation and sustainability science. In a similar way, a network called PhragNet has been developed and applied to investigate the Phragmites australis (common reed) invasion, and the collected data have successfully identified environmental and plant community associations between the Phragmites invasion and patterns of management responses (Hunt et al., 2017).
In addition to these volunteer-based crowdsourcing methods, novel techniques have been increasingly employed to collect ecological data as a result of rapid developments in information technology (Teacher et al., 2013). For instance, a global hybrid forest map has been developed through combining remote sensing data, observations from volunteer-based crowdsourcing methods and traditional measurements performed by governments (Schepaschenko et al., 2015). More recently, social media has been used to observe dolphins in the Hellenic Seas of the Mediterranean, and the collected data showed high consistency with currently available literature on dolphin distributions (Giovos et al., 2016).
4.6 Surface Water
Data collection methods in the surface water domain based on crowdsourcing can be represented by three main groups, including (i) citizen observations, (ii) the use of dedicated instruments, and (iii) the use of images or videos. Of the above, citizen observations represent the most straightforward manner for sourcing data, typically water depth. Examples include a software package designed to enable the collection of water levels via text messages from local citizens (Fienen & Lowry, 2012), and a crowdsourced database built for collecting stream stage measurements, where text messages from citizens were transmitted to a server that stored and displayed the data on the web (Lowry & Fienen, 2013). In more recent years, a local community was encouraged to gather data on time-series of river stage (Walker et al., 2016). Subsequently, a crowdsourced database was implemented as a low-cost method to assess the water quantity within the Sondu River catchment in Kenya, where citizens were invited to read and transmit water levels and the station number to the database via a simple text message using their cell phones (Weeser et al., 2016). As the collection of water quality data generally requires specialist equipment, crowdsourcing data collection efforts in this field have relied on citizens to provide water samples that could then be analyzed. Examples of this include estimation of the spatial distribution of nitrogen solutes via a crowdsourcing campaign, with citizens providing samples at different locations, the investigation of watershed health (water quality assessment) with the aid of samples collected by local citizens (Jollymore et al., 2017), and the monitoring of fecal indicator bacteria concentrations in waterbodies of the greater New York City area with the aid of water samples collected by local citizens.
An example of the use of instruments for obtaining crowdsourced surface water data is given in Sahithi (2016), who showed that a mobile app and lake monitoring kit can be used to measure the physical properties of water samples. Another application is given in Castilla et al. (2015), who showed that the data from 13 cities (250 water bodies) measured by trained citizens with the aid of instruments can be used to successfully assess elevated phytoplankton densities in urban and peri-urban freshwater ecosystems.
The use of crowdsourced images and videos has increased in popularity with developments in smart phones and other personal devices, in conjunction with the increased ability to share these. For example, Secchi depth and turbidity (water quality parameters) of rivers have been monitored using images taken via mobile phones (Toivanen et al., 2013), and water levels have been determined using projected geometry and augmented reality to analyze three different images of a river's surface at the same location taken by citizens with the aid of smart phones, together with the corresponding GPS location (Demir et al., 2016). In more recent years, Tauro and Salvatori (2017) developed a system with lasers and an internet protocol camera equipped with two optical modules to acquire velocity data for the river surface of the Tiber River; Kampf et al. (2018) proposed the CrowdWater project to measure stream levels with the aid of multiple photos taken at the same site, but at different times; and Leeuw and Boss (2018) introduced HydroColor, which is a mobile application that utilizes a smartphone's camera and auxiliary sensors to measure the remote sensing reflectance of natural water bodies.
Crowdsourced data can also be combined with other types of data to improve data quality. For example, Kampf et al. (2018) developed a Stream Tracker with the goal of improving intermittent stream mapping and monitoring using satellite and aircraft remote sensing, in-stream sensors, and crowdsourced observations of streamflow presence and absence. The crowdsourced data were used to fill in information on streamflow intermittence anywhere that people regularly visited streams, for example, during a hike or bike ride, or when passing by while commuting.
4.7 Natural Hazard Management
The crowdsourcing data acquisition methods used to support natural hazard management can be divided into three broad classes, including (i) the use of low-cost sensors; (ii) the active provision of dedicated information by citizens; and (iii) the mining of relevant data from social media databases. Low-cost sensors are generally used for obtaining information of the hazard itself. The use of such sensors is becoming more prevalent, particularly in the field of flood management, where they have been used to obtain water levels (Liu et al., 2015) or velocities (Braud et al., 2014; Le Coz et al., 2016; Tauro & Salvatori, 2017) in rivers. The latter can also be obtained with the use of autonomous small boats (Sanjou & Nagasaka, 2017).
Active provision of data by citizens can also be used to better understand the location, extent, and severity of natural hazards and has been aided by recent advances in technological developments, not only in the acquisition of data, but also their transmission and storage, making them more accessible and usable. In the area of flood management, Alfonso et al. (2010) tested a system in which citizens sent their reading of water level rulers by text messages. Since then, other studies have adopted similar approaches (Lowry & Fienen, 2013; McDougall, 2011; McDougall & Temple-Watts, 2012) and have adapted them to new technologies, such as website upload (Degrossi et al., 2014; Starkey et al., 2017). Kutija et al. (2014) developed an approach in which images of floods are received, from which water levels are extracted. Such an approach has also been used to obtain flood extent (Yu et al., 2016) and velocity (Le Coz et al., 2016).
Another means by which citizens can actively provide data for natural hazard management is collaborative mapping. For example, as mentioned in section 4.4, the Collabmap platform can be used to crowdsource evacuation routes for natural hazard events. As part of this approach, citizens are involved in one of five microtasks related to the development of maps of evacuation routes, including building identification, building verification, route identification, route verification, and completion verification (Ramchurn et al., 2013). In another example, citizens used a WEB GIS application to indicate the position of ditches and to modify the attributes of existing ditch systems on maps, to be used as inputs in a flood model for inland excess water hazard management (Juhász et al., 2016).
The mining of data from social media databases and image/video repositories has received significant attention in natural hazard management (Alexander, 2008; Goodchild & Glennon, 2010; Horita et al., 2013) and can be used to signal and detect hazards, to document and learn from what is happening, and support disaster response activities (Houston et al., 2015). However, this approach has been used primarily for hazard response activities in order to improve situational awareness (Anson et al., 2017; Horita et al., 2013). This is due to the speed and robustness with which information is made available, its low cost and the fact that it can provide text, image/video, and locational information (McSeveny & Waddington, 2017; Middleton et al., 2014; Stern, 2017). However, it can also provide large amounts of data from which to learn from past events (Stern, 2017), as was the case for the 2013 Colorado Floods, where social media data were analyzed to better understand damage mechanisms and prevent future damage (Dashti et al., 2014).
Due to accessibility issues, the most common platforms for obtaining relevant information are Twitter and Flickr. For example, Twitter data can be analyzed to detect the occurrence of natural hazard events (Li et al., 2012), as demonstrated by applications to floods (Palen et al., 2010; Smith et al., 2015) and earthquakes (Sakaki et al., 2013), as well as the location of such events, as shown for earthquakes (Sakaki et al., 2013), floods (Vieweg et al., 2010), fires (Vieweg et al., 2010), storms (Smith et al., 2015), and hurricanes (Kryvasheyeu et al., 2016). The location of wildfires has also been obtained by analyzing data from VGI services such as Flickr (Craglia et al., 2012; Goodchild & Glennon, 2010).
Data obtained from analyzing social media databases and image/video repositories can also be used to assess the impact of natural disasters. This can include determination of the spatial extent (Brouwer et al., 2017; Cervone et al., 2016; Jongman et al., 2015; Rosser et al., 2017) and impact/damage (de Albuquerque et al., 2015; Jongman et al., 2015; Kryvasheyeu et al., 2016; Vieweg et al., 2010) of floods, as well as the damage/injury arising from fires (Vieweg et al., 2010), hurricanes (Kryvasheyeu et al., 2016; Middleton et al., 2014; Yuan & Liu, 2018), tornadoes (Kryvasheyeu et al., 2016; Middleton et al., 2014), earthquakes (Kryvasheyeu et al., 2016), and mudslides (Kryvasheyeu et al., 2016).
Social media data can also be used to obtain information about the hazard itself. Examples of this include the determination of water levels (Aulov et al., 2014; de Albuquerque et al., 2015; Eilander et al., 2016; Jongman et al., 2015; Kongthon et al., 2012; Li et al., 2017; Smith et al., 2015; Vieweg et al., 2010) and water velocities (Le Boursicaud et al., 2016), including using such data to evaluate the stability of a person immersed in a flood (Milanesi et al., 2016). The analysis of Twitter data has also been able to provide information on a range of other information relevant to natural hazard management, including information on traffic and road conditions during floods (de Albuquerque et al., 2015; Kongthon et al., 2012; Vieweg et al., 2010) and typhoons (Declan, 2013), as well as information on damaged and intact buildings and the locations of key infrastructure, such as hospitals, during Typhoon Hayan in the Philippines (Declan, 2013). Goodchild and Glennon (2010) were able to use VGI services such as Flickr to obtain maps of the locations of emergency shelters during the Santa Barbara wildfires in the United States.
Different types of crowdsourced data can also be combined with other types of data and simulation models to improve natural hazard management. Other types of data can be used to verify the quality and improve the usefulness of outputs obtained by analyzing social media data. For example, Middleton et al. (2014) used published information to verify the quality of maps of flood extent resulting from Hurricane Sandy, and damage extent resulting from the Oklahoma tornado, obtained by analyzing the geospatial information contained in tweets. In contrast, de Albuquerque et al. (2015) used authoritative data on water levels from 185 stations with 15-min resolution, as well as information on drainage direction, to identify the tweets that provided the most relevant information for improving situational awareness related to the management of the 2013 floods in the River Elbe in Germany. Other data types can also be combined with crowdsourced data to improve the usefulness of the outputs. For example, Jongman et al. (2015) combined near-real-time satellite data with near-real-time Twitter data on the location, timing, and impacts of floods for case studies in Pakistan and the Philippines for improving humanitarian response. McDougall and Temple-Watts (2012) combined high quality aerial imagery, LiDAR data, and publically available volunteered geographic imagery (e.g., from Flickr) to reconstruct flood extents and obtain information on depth of inundation for the 2011 Brisbane floods in Australia.
With regard to the combination of crowdsourced data with models, Juhász et al. (2016) used data on the location of channels and ditches provided by citizens as one of the inputs to an online hydrological model for visualizing areas at potential risk of flooding under different scenarios. Alternatively, Smith et al. (2015) developed an approach that uses data from Twitter to identify when a storm event occurs, triggering simulations from a hydrodynamic flood model in the correct location, and to validate the model outputs, whereas Aulov et al. (2014) used data from tweets and Instagram images for the real-time validation of a process-driven storm surge model for Hurricane Sandy in the United States.
4.8 Summary of Crowdsourcing Methods Used
The different crowdsourcing-based data acquisition methods discussed in sections 4.1 to 4.7 can be broadly classified into four major groups: citizen observations, instruments, social media, and integrated methods (Table 2). As can be seen, the methods belonging to these groups cover all nine categories of crowdsourcing data acquisition methods defined in Table 1. Interestingly, six out of the nine possible methods have been used in the domain of natural hazard management (Table 2), which is primarily due to the widespread use of social media and integrated methods in this domain.
Methods | Data agent | Data type | Weather | Precipitation | Air quality | Geography | Ecology | Surface water | Natural hazard management | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CI* | IS | IT | UIT | |||||||||
Citizens | Citizen observation | √ | √ | Temperature, wind (Elmore et al., 2014; Niforatos, Vourvopoulos, et al., 2015) | Rainfall, snow, hail (Illingworth et al., 2014) | Land cover and Geospatial database (Fritz et al., 2012; Neis & Zielstra, 2014) | Fish and algal bloom (Kotovirta et al., 2014; Wolfe & Pattengill-Semmens, 2013) | Stream stage (Weeser et al., 2016) | Flooded area and evacuation routes (Ramchurn et al., 2013; Yu et al., 2016) | |||
Instruments | In situ (automatic stations, microwave links, etc.) | √ | √ | Wind and temperature (Chapman et al., 2016) | Rainfall (De Vos et al., 2017) | PM2.5, Ozone (Jiao et al., 2015) | Shale gas and heavy metal (Jalbert & Kinchy, 2015) | |||||
√ | √ | Fog (David et al., 2015) | Rainfall (Fencl et al., 2017) | |||||||||
Mobile (phones, cameras, vehicles, bicycles, etc.) | √ | √ | √ | Temperature and humidity (Majethia et al., 2015; Sosko & Dalyot, 2017) | Rainfall (Allamano et al., 2015; Guo et al., 2016) | NO, NO2, black carbon (Apte et al., 2017) | Land cover (Laso Bayas et al., 2016) | Dolphin count (Giovos et al., 2016) | Suspended sediment and dissolved organic matter (Leeuw & Boss, 2018) | Water level and velocity (Liu et al., 2015; Sanjou & Nagasaka, 2017) | ||
√ | √ | √ | Rainfall (P. Yang & Ng, 2017) | Particulate matter (Sun et al., 2017) | ||||||||
Social media | Text-based | √ | √ | Flooded area (Brouwer et al., 2017) | ||||||||
Multimedia (text, images, videos, etc.) | √ | √ | √ | Smoke dispersion (Sachdeva et al., 2017) | Location of tweets (Leetaru et al., 2013) | Tiger count (Can et al., 2017) | Water level (Michelsen et al., 2016) |
Disaster detection (Sakaki et al., 2013) Damage (Yuan & Liu, 2018) |
||||
Integrated | Multiple sources | √ | √ | √ | Flood extent and level (Wang et al., 2018) | |||||||
√ | √ | √ | Rainfall (Haese et al., 2017) | |||||||||
√ | √ | √ | √ | Accessibility mapping (Rice et al., 2013) | Water quantity (Deutsch et al., 2005) | Inundated area (Le Coz et al., 2016) |
- Note. CI = Citizen; IS = Instrument; IT = intentional; UIT = unintentional.
Of the four major groups of methods shown in Table 2, citizen observations have been used most broadly across the different domains of geophysics reviewed. This is, at least partly, because of the relatively low cost associated with this crowdsourcing approach, as it does not rely on the use of monitoring equipment and sensors. Based on the categorization introduced in Figure 6, this approach uses citizens (through their senses, such as sight) as data generation agents and has a data type that belongs to the intentional category. As part of this approach, local citizens have reported on general degrees of temperature, wind, rain, snow, and hail based on their subjective feelings, and land cover, algal blooms, stream stage, flooded area, and evacuation routines according to their readings and counts.
While citizen observation-based methods are simple to implement, the resulting data might not be sufficiently accurate for particular applications. This limitation can be overcome by using instruments. As shown in Table 2, instruments used for crowdsourcing generally belong to one of two categories: in situ sensors/stations (installed and maintained by citizens, rather than authoritative agencies) or mobile devices. For methods belonging to the former category, instruments are used as data generation agents, but the data type can be either intentional or unintentional. Typical in situ instruments for the intentional collection of data include automatic weather stations used to obtain wind and temperature data, gauges used to measure rainfall intensity, and sensors used to measure air quality (PM2.5 and Ozone), shale gas and heavy metal in rivers, and water levels during flooding events. An example of a method as part of which the geophysical data of interest can be obtained from instruments that were not installed to intentionally provide these data is the use of the microwave links to estimate fog and rain intensity.
Instruments belonging to the mobile category generally require both citizens and instruments as data generation agents (Table 2). This is because such sensors are either attached to citizens themselves or to vehicles operated by citizens (although this is likely to change in future as the use of autonomous vehicles becomes more common). However, as is the case for the in situ category, data types can be either intentional or unintentional. As can be seen from Table 2, methods belonging to the intentional category have been used across all domains of geophysics considered in this review. Examples include the use of mobile phones, cameras, cars, and people on bikes measuring variables such as temperature, humidity, rainfall, air quality, land cover, dolphin numbers, suspended sediment, dissolved organic matter, water level, and water velocity. Examples from the unintentional data type include the identification of rainy days through audio clips collected from smartphones installed in cars (Guo et al., 2016) and the general assessment of air pollution exposure with the aid of traces and duration of outdoor cycling activities (Sun et al., 2017). It should be noted that there are also cases where different instruments can be combined to collect/estimate data. For example, weather stations and microwave links were jointly used to estimate wind and humidity by Vishwarupe et al. (2016).
Crowdsourced data obtained from social media or image/video repositories belong to the unintentional data type category, as they are mined from information not shared for the purposes they are ultimately used for. However, the data generation agent can either be citizens or citizens in combination with instruments (Table 2). As most of the information that is useful from a geophysics perspective contains images or spatial information that requires the use of instruments (e.g., mobile phones), there are few examples where citizens are the sole data generation agent, such as that of the analysis of text-based information from Twitter or Facebook to obtain maps of flooded areas to aid natural hazard management (Table 2). However, applications where both citizens and instruments are used to generate the data to be analyzed are more widespread, including the estimation of smoke dispersion after fire events, the determination of the geographical locations where tweets were authored, the identification of the number of tigers around the world to aid tiger conservation, the estimation of water levels, the detection of earthquake events, and the identification of critically affected areas and damage from hurricanes.
In parallel with the developments of the three types of methods mentioned above, there is also growing interest in integrating various crowdsourced data, typically aimed to improve data coverage or to enable data cross-validation. As shown in Table 2, these can involve both categories of data-generation agents and both categories of data types. Examples include the development of accessibility mapping for people with disabilities, water quantity estimation, and estimation of inundated areas. An example where citizens are used as the only data generation agent but both data types are used is where citizen observations transmitted through a dedicated mobile app and Twitter are integrated to show flood extent and water level to assist with disaster management (Wang et al., 2018).
As discussed in sections 4.1 to 4.7, these crowdsourcing methods can be also integrated with data from authoritative databases or with models to further improve the spatiotemporal resolution of the data being collected. Another aim of such hybrid approaches is to enable the crowdsourcing data to be validated. Examples include gauged rainfall data integrated with data estimated from microwave links (Fencl et al., 2017; Haese et al., 2017), stream mapping through combining mobile app data and satellite remote sensing data (Kampf et al., 2018), and the validation of the quality of water level data derived from tweets using authoritative data (de Albuquerque et al., 2015).
5 Review of Issues Associated With Crowdsourcing Applications
5.1 Management of Crowdsourcing Projects
5.1.1 Background
The managerial, organizational, and social aspects of crowdsourced applications are as important and challenging as the development of data processing and modeling technologies that ingest the resulting data. Hence, there is a growing body of literature on how to design, implement, and manage crowdsourcing projects. As the core component in crowdsourcing projects is the participation of the crowd, engaging and motivating the public has become a primary consideration in the management of crowdsourcing applications, and a range of strategies is emerging to address this aspect of project design. At the same time, many authors argue that the design of crowdsourcing efforts, in terms of spatial scale and participant selection, is a trade-off between cost, time, accuracy, and research objectives. Another key set of methods related to project design revolves around data collection, that is, data protocols and standards, as well as the development of optimal spatial-temporal sampling strategies for a given application. When using low-cost sensors and smartphones, additional methods are needed to address calibration and environmental conditions. Finally, we consider methods for the integration of various crowdsourced data into further applications, which is one of the main categories of crowdsourcing methods that emerged from the review (see Table 2) but warrants further consideration related to the management of crowdsourcing projects.
5.1.2 Current Status
There are four main categories of methods associated with the management of crowdsourcing applications as outlined in Table 3. A number of studies have been conducted to help understand what methods are effective in the engagement and motivation of participation in crowdsourcing applications, particularly as many crowdsourcing applications need to attract a large number of participants (Alfonso et al., 2015; Buytaert et al., 2014). Groom et al. (2017) argue that the users of crowdsourced data should acknowledge the citizens who were involved in the data collection in ways that matter to them. If the monitoring is over a long time period, crowdsourcing methods must be put in place to ensure sustainable participation (Theobald et al., 2015), potentially resulting in challenges for the implementation of crowdsourcing projects. In other words, many crowdsourcing projects are applicable in cases where continuous data gathering is not the main objective.
Methods | Typical references | Key comments |
---|---|---|
Engagement strategies for motivating participation in crowdsourcing | Buytaert et al. (2014), Alfonso et al. (2015), Groom et al. (2017), Theobald et al. (2015), Donnelly et al. (2014), Kobori et al. (2015), Roy et al. (2016), Can et al. (2017), Elmore et al. (2014), Vogt and Fischer (2014), Fritz et al. (2017) |
• Understanding of the motivations of citizens to guide the design of crowdsourcing projects • Adoption of the best practice in various projects across multiple domains, for example, training, good communication and feedback, targeting existing communities, volunteer recognition systems, social interaction, etc. • Incentives, for example, micropayments, gamification |
Data collection protocols and standards | Kobori et al. (2015), Vogt and Fischer (2014), Honicky et al. (2008), Anderson et al. (2012), Wolters and Brandsma, (2012), Overeem, Robinson, et al. (2013), Majethia et al. (2015), Buytaert et al. (2014) |
• Simple, usable data collection protocols • Better protocols and methods for the deployment of low-cost and vehicle sensors • Data standards and interoperability, for example, OGC Sensor Observation Service |
Sample design for data collection | Doesken and Weaver (2000), De Vos et al. (2017), Chacon-Hurtado et al. (2017), Davids et al. (2017) |
• Sampling design strategies, for example, for precipitation and streamflow monitoring, that is, spatial distribution and temporal frequency • Adapting existing sample design frameworks to crowdsourced data |
Assimilation and integration of crowdsourced data | Mazzoleni et al. (2017), Schneider et al. (2017), Panteras and Cervone (2018), Bell et al. (2013), Muller (2013), Haese et al. (2017), Chapman et al. (2015), Liberman et al. (2014), Doumounia et al. (2014), Allamano et al. (2015), Overeem et al. (2016) |
• Assimilation of crowdsourced data in flood forecasting models, flood and air quality mapping, numerical weather prediction, simulation of precipitation fields • Dense urban monitoring networks for assessment of crowdsourced data, integration into smart city applications • Methods for working with existing infrastructure for data collection and transmission |
- Note. OGC = Open Geospatial Consortium.
Considerable experience has been gained in setting up successful citizen science projects for biodiversity monitoring in Ireland, which can inform crowdsourcing project design and implementation. Donnelly et al. (2014) provide a checklist of criteria, including the need to devise a plan for participant recruitment and retention. They also recognize that training needs must be assessed and the necessary resources provided, for example, through workshops, training videos, and so forth. To sustain participation, they provide comprehensive newsletters to their volunteers, as well as regular workshops to further train and engage participants. Involving schools is also a way to improve participation, particularly when data become a required element to enable the desired scientific activities, for example, save tigers (Can et al., 2017; Donnelly et al., 2014; Roy et al., 2016). Other experiences can be found in Japan, United Kingdom, and United States by Kobori et al. (2015), who suggested that existing communities with interest in the application area should be targeted, some form of volunteer recognition system should be implemented, and tools for facilitating positive social interaction between the volunteers should be used. They also suggest that front-end evaluation involving interviews and focus groups with the target audience can be useful for understanding the research interests and motivations of the participants, which can be used in application design. Experiences in the collection of precipitation data through the mPING mobile app have shown that the simplicity of the application and immediate feedback to the user were key elements of success in attracting large numbers of volunteers (Elmore et al., 2014). This more general element of the need to communicate with volunteers has been touched upon by several researchers (e.g., Donnelly et al., 2014; Kobori et al., 2015; Vogt & Fischer, 2014). Finally, different incentives should be considered as a way to increase volunteer participation from the addition of gamification or competitive elements to micropayments, for example, though the use of platforms such as Amazon Mechanical Turk, where appropriate (Fritz et al., 2017).
A second set of methods related to the management of crowdsourcing applications revolves around data collection protocols and data standards. Kobori et al. (2015) recognize that complex data collection protocols or inconvenient locations for sampling can be barriers to citizen participation and hence they suggest that data protocols should be simple. Vogt and Fischer (2014) have similarly noted that the usability of their protocol in monitoring of urban trees is an important element of the project. Clear protocols are also needed for collecting data from vehicles, low-cost sensors, and smartphones in order to deal with inconsistencies in the conditions of the equipment, such as the running speed of the vehicles, the operating system version of the smartphones, the conditions of batteries, the sensor environments, that is, whether they are indoors or outdoors or if a smartphone is carried in a pocket or handbag, and a lack of calibration or modifications for sensor drift (Anderson et al., 2012; Honicky et al., 2008; Majethia et al., 2015; Overeem, Leijnse, et al., 2013; Wolters & Brandsma, 2012). Hence, the quality of crowdsourced atmospheric data is highly susceptible to various disturbances caused by user behavior, their movements, and other interference factors. An approach for tackling these problems would be to record the environmental conditions along with the sensor measurements, which could then be used to correct the observations. Finally, data standards and interoperability are important considerations, which are discussed by Buytaert et al. (2014) in relation to sensors. The Open Geospatial Consortium Sensor Observation Service is one example where work is progressing on sensor data standards.
Another set of methods that needs to be considered in the design of a crowdsourcing application is the identification of an appropriate sample design for the data collection. For example, methods have been developed for determining the optimal spatial density and locations for precipitation monitoring (Doesken & Weaver, 2000). Although a precipitation observation network with a higher density is more likely to capture the underlying characteristics of the precipitation field, it comes with significantly increased efforts needed to organize and maintain such a large volunteer network (De Vos et al., 2017). Hence, the sample design and corresponding trade-off needs to be considered in the design of crowdsourcing applications. Chacon-Hurtado et al. (2017) present a generic framework for designing a rainfall and streamflow sensor network including the use of model outputs. Such a framework could be extended to include crowdsourced precipitation and streamflow data. The temporal frequency of sampling also needs to be considered in crowdsourcing applications. Davids et al. (2017) investigated the effect of lower frequency sampling of streamflow, which could be similar to that produced by citizen monitors. By subsampling 7 years of data from 50 stations in California, they found that even with lower temporal frequency, the information would be useful for monitoring, with reliability increasing for less flashy catchments.
The final set of methods that needs to be considered when developing and implementing a crowdsourcing application is how the crowdsourced data will be used, that is, integrated or assimilated into monitoring and forecasting systems. For example, Mazzoleni et al. (2017) investigated the assimilation of crowdsourced data directly into flood forecasting models. They developed a method that deals specifically with the heterogeneous nature of the data by updating the model states and covariance matrices as the crowdsourced data became available. Their results showed that model performance increased with the addition of crowdsourced observations, highlighting the benefits of this data stream. In the area of air quality, Schneider et al. (2017) used a data fusion method to assimilate NO2 measurements from low-cost sensors with spatial outputs from an air quality model. Although the results were generally good, the accuracy varied based on a number of factors including uncertainties in the low-cost sensor measurements. Other methods are needed for integrating crowdsourced data with ground-based station data and remote sensing since these different data inputs have varying spatiotemporal resolutions. An example is provided by Panteras and Cervone (2018), who combined Twitter data with satellite imagery to improve the temporal and spatial resolution of probability maps of surface flooding produced during four phases of a flooding event in Charleston, South Carolina. The value of the crowdsourced data was demonstrated during the peak of the flood in phase two when no satellite imagery was available.
Another area of ongoing research is assimilation of data from amateur weather stations in numerical weather prediction, providing both high resolution data for initial surface conditions and correction of outputs locally. For example, Bell et al. (2013) compared crowdsourced data from amateur weather stations with official meteorological stations in the United Kingdom and found good correspondence for some variables, indicating assimilation was possible. Muller (2013) showed how crowdsourced snow depth interpolated for 1 day appeared to correlate well with a radar map, while Haese et al. (2017) showed that by merging data collected from existing weather observation networks with crowdsourced data from CMLs, a more complete understanding of the weather conditions could be obtained. Both clearly have potential value for forecasting models. Finally, Chapman et al. (2015) presented the details of a high resolution urban monitoring network in Birmingham, describing many potential applications from assimilation of the data into numerical weather prediction models, acting as a test bed to assess crowdsourced atmospheric data and linking to various smart city applications, among others.
Some crowdsourcing methods depend upon existing infrastructure or facilities for data collection, as well as infrastructure for data transmission (Liberman et al., 2014). For example, the utilization of microwave links for rainfall estimation is greatly affected by the frequency and length of available links (Leijnse et al., 2007; Zinevich et al., 2009), and the moving-car and low-cost sensor-based methods are heavily influenced by the availability of such cars and sensors (Allamano et al., 2015). An ad hoc method for tackling this issue is the development of hybrid crowdsourcing methods that can integrate multiple existing crowdsourcing approaches to provide precipitation data with improved reliability (Liberman et al., 2014; P. Yang & Ng, 2017).
5.1.3 Challenges and Future Directions
There is considerable experience being amassed from crowdsourcing applications across multiple domains in geophysics. This collective best practice in the design, implementation, and management of crowdsourcing applications should be harnessed and shared between disciplines rather than duplicating efforts. In many ways, this review paper represents a way of signposting important developments in this field for the benefit of multiple research communities. Moreover, new conferences and journals focused on crowdsourcing and citizen science will facilitate a more integrated approach to solving problems of a similar nature experienced within different disciplines. Engagement and motivation will continue to be a key challenge. In particular, it is important to recognize that participation will always be biased, that is, subject to the 90:9:1 rule, which states that 90% of the participants will simply view the data generated, 9% will provide some data from time to time while the majority of the data will be collected by 1% of the volunteers. Although different crowdsourcing applications will have different percentages and degrees of success in mitigating this bias, it is critical to gain a better understanding of participant motivations and then design projects that meet these motivations. Ongoing research in the field of governance can help to identify bottlenecks in the operational implementation of crowdsourcing projects, by evaluating citizen participation mechanisms (Wehn et al., 2015).
On the data collection side, some of the challenges related to the deployment of low-cost and mobile sensors may be solved through improving the reliability of the sensors in the future (McKercher et al., 2017). However, an ongoing challenge that hinders the wider collection of atmospheric observations from the public is that outdoor measurement facilities are often vulnerable to environmental damage (Chapman et al., 2016; Melhuish & Pedder, 2012). There are technical challenges arising from the lack of data standards and interoperability for data sharing (Panteras & Cervone, 2018), particularly in domains where multiple types of data are collected and integrated within a single application. This will continue to be a future challenge, but there are several open data standards emerging that could be used for integrating data from multiple sources and sensors, for example, WaterML or SWE (Sensor Web Enablement), which are being championed by the Open Geospatial Consortium.
Another key future direction will be the development of more operational systems that integrate intentional and unintentional crowdsourcing, particularly as the value of such data to enhance existing authoritative databases becomes more and more evident. Much of the research reported in this review presents the results of dedicated, one-time-only experiments that, as discussed in section 3, are in most cases restricted to research projects and the academic environment. Even in research projects dedicated to citizen observatories that include local partnerships, there is limited demonstration of changes in management procedures and structures, and little technological uptake. Hence, crowdsourcing needs to be operationalized, and there are many challenges associated with this. For example, amateur weather stations are often clustered in urban areas or areas with a higher population density, they have not necessarily been calibrated or recalibrated for drift, they are not always placed in optimal locations at a particular site, and they often lack metadata (Bell et al., 2013). Chapman et al. (2015) touched upon a wide range of issues related to the urban monitoring network in Birmingham from site discontinuation due to lack of engagement to more technical problems associated with connectivity, signal strength, and battery life. Use of more unintentional sensing through cars, wearable technologies, and the IoT may be one solution for gathering data in ways that will become less intrusive and less effort for citizens in the future. There are difficult challenges associated with data assimilation but this will clearly be an area of continued research focus. Hydrological model updating, both offline and real-time, which has not been possible due to lack of gauging stations, could have a bigger role in the future due to the availability of new data sources, while the development of new methods for handling noisy data will most likely result in significant improvements in meteorological forecasting.
5.2 Data Quality
5.2.1 Background
Concerns about the uncertain quality of the data obtained from crowdsourcing and their rate of acceptability is one of the primary issues raised by potential users (Foody et al., 2013; Steger et al., 2017; Walker et al., 2016). These include not only scientists, but natural resource managers, local and regional authorities, communities, and businesses, among others. Given the large quantities of crowdsourced data that are currently available (and will continue to come from crowdsourcing in the future), it is important to document the quality of the data so that users can decide if the available crowdsourced data are fit-for-purpose, similar to the way that users would judge data coming from professional sources. Crowdsourced data are subject to the same types of errors as professional data, each of which require methods for quality assessment. These errors include observational and sampling errors, lack of completeness, for example, only 1% to 2% of Twitter data are currently geotagged (Das & Kim, 2015; Middleton et al., 2014; Morstatter et al., 2013; Palen & Anderson, 2016), and issues related to trust and credibility, for example, for data from social media (Schmierbach & OeldorfHirsh, 2012; Sutton et al., 2008), where information may be deliberately or even unintentionally erroneous, potentially endangering lives when used in a disaster response context (Akhgar et al., 2017). In addition, there are social and political challenges, such as the initial lack of trust in crowdsourced data (Buytaert et al., 2014; McCray, 2006). For governmental organizations, the driver could be fear of having current data collections invalidated or the need to process overwhelming amounts of varying quality data (McCray, 2006). It could also be driven by cultural characteristics that inhibit public participation.
5.2.2 Current Status
From the literature, it is clear that research on finding optimal ways to improve the accuracy of crowdsourced data is taking place in different disciplines within geophysics and beyond, yet there are clear similarities in the approaches used, as outlined in Table 4. Seven different types of approaches have been identified, while the eighth type refers to methods of uncertainty more generally. Typical references that demonstrate these different methods are also provided.
Methods | Typical references | Key comments |
---|---|---|
Comparison with an expert or gold standard data set | Goodchild and Li (2012), Comber et al. (2013), Foody et al. (2013), Kazai et al. (2013), See et al. (2013), Leibovici et al. (2015), Jollymore et al. (2017), Steger et al. (2017), Walker et al. (2016) | • Direct comparison of professionally collected data with crowdsourced data to assess quality using different quantitative metrics |
Comparison against an alternative source of data | Leibovici et al. (2015), Walker et al. (2016) |
• Use of another data set as a proxy for expert data, for example, rainfall from satellites for comparison with crowdsourced rainfall measurements • Model-based validation, that is, validation of crowdsourced data against model outputs |
Combining multiple observations | Comber et al. (2013), Foody et al. (2013), Kazai et al. (2013), See et al. (2013), Swanson et al. (2016) |
• Use of majority voting or another consensus-based method to combine multiple observations of crowdsourced data • Latent class analysis to look at relative performance of individuals • Use of certainty metrics and bootstrapping to determine the number of volunteers needed to reach a given accuracy |
Crowdsourced peer review | Goodchild and Li (2012) | • Use of citizens to crowdsource information about the quality of other citizen contributions |
Automated checking | Leibovici et al. (2015), Walker et al. (2016), Castillo et al. (2011) |
• Look for errors in formatting, consistency, and assess whether the data are within acceptable limits (numerically or spatially) • Train a classifier to determine the level of credibility of information from Twitter |
Methods from different disciplines | Leibovici et al. (2015), Walker et al. (2016), Fonte et al. (2017) |
• Quality control procedures from the World Meteorological Organization (WMO) • Double mass check • ISO 19157 standard for assessing spatial data quality • Bespoke systems such as the COBWEB quality assurance system |
Measures of credibility (of information and users) | Castillo et al. (2011), Westerman et al. (2012), Kongthon et al. (2012) | • Credibility measures based on different features, for example, user-based features such as number of followers, message-based features such as length of messages, sentiments, propagation-based features such as retweets etc. |
Quantification of uncertainty of data and model predictions | Rieckermann (2016) | • Identify potential sources of uncertainty in crowdsourced data and construct credible measures of uncertainty to improve scientific analysis and practical decision making |
- Note. COBWEB = Citizen Observatory WEB.
The first method in Table 4 involves the comparison of crowdsourced data with data collected by experts or existing authoritative databases; this is referred to as a comparison with a gold standard data set. This is also one of seven different methods that comprise the Citizen Observatory WEB (COBWEB) quality assurance system (Leibovici et al., 2015). An example is the gold standard data set collected by experts using the Geo-Wiki crowdsourcing system (Fritz et al., 2012). In the postprocessing of data collected through a Geo-Wiki crowdsourcing campaign, See et al. (2013) showed that volunteers with some background in the topic (i.e., remote sensing or geospatial sciences) outperformed volunteers with no background when classifying land cover but that this difference in performance decreased over time as less experienced volunteers improved. Using this same data set, Comber et al. (2013) employed geographically weighted regression to produce surfaces of crowdsourced reliability statistics for Western and Central Africa. Other examples include the use of a gold standard data set in crowdsourcing via the Amazon Mechanical Turk system (Kazai et al., 2013), to examine various drivers of performance, in species identification in East Africa (Steger et al., 2017), in hydrological (Walker et al., 2016) and water quality monitoring (Jollymore et al., 2017), and to show how rainfall can be enhanced with CMLs (Pastorek et al., 2017). Although this is clearly one of the most frequently used methods, Goodchild and Li (2012) argue that some authoritative data, for example, topographic databases, may be out of date so other methods should be used to complement this gold standard approach.
The second category in Table 4 is the comparison of crowdsourced data with alternative sources of data, which is referred to as model-based validation in the COBWEB system (Leibovici et al., 2015). An illustration of this approach is given in Walker et al. (2016), who examined the correlation and bias between rainfall data collected by the community with satellite-based rainfall and reanalysis products as one form of quality check among several. Combining multiple observations at the same location is another approach for improving the quality of crowdsourced data. Having consensus at a given location is similar to the idea of replicability, which is a key characteristic of data quality. Crowdsourced data collected at the same location can be combined using a consensus-based approach such as majority weighting (Kazai et al., 2013; See et al., 2013) or latent analysis can be used to determine the relative performance of different individuals using such a data set (Foody et al., 2013). Other methods have been developed for crowdsourced data being collected on species occurrence. In the Snapshot Serengeti project, citizens identified species from more than 1.5 million photographs taken by camera traps. Using bootstrapping and comparison of accuracy from a subset of the data with a gold standard data set, researchers determined that 90% accuracy could be reached with five volunteers per photograph, while this number increased to 95% accuracy with 10 people (Swanson et al., 2016).
The fourth category is crowdsourced peer review or what Goodchild and Li (2012) refer to as the crowdsourcing approach. They argue that the crowd can be used to validate data from individuals and even correct any errors. Trusted individuals in a self-organizing hierarchy may also take on this role of data validation and correction in what Goodchild and Li (2012) refer to as the social approach. Examples of this hierarchy of trusted individuals already exist in applications such as OSM and Wikipedia. Automated checking of the data, which is the fifth category of approaches, can be undertaken in numerous ways and is part of two different validation routines in the COBWEB system (Leibovici et al., 2015), one that looks for simple errors or mistakes in the data entry and a second routine that carries out further checks based on validity. In the analysis by Walker et al. (2016), the crowdsourced data undergo a number of tests for formatting errors, application of different consistency tests, for example, are observations consistent with previous observations recorded in time, and tests for tolerance, that is, are the data within acceptable upper and lower limits. Simple checks like these can easily be automated.
The next method in Table 4 refers to a general set of approaches that are derived from different disciplines. For example, Walker et al. (2016) use the quality procedures suggested by the World Meteorological Organization to quality assess crowdsourced data, many of which also fall under the types of automated approaches available for data quality checking. World Meteorological Organization also recommends a completeness test, that is, are there missing data that may potentially affect any further processing of the data, which is clearly context-dependent. Another test that is specific to streamflow and rainfall is the double mass check (Walker et al., 2016), whereby cumulative values are compared with those from a nearby station to look for consistency. Within VGI and geography, there are international standards for assessing spatial data quality (ISO 19157), which break down quality into several components such as positional accuracy, thematic accuracy, completeness, and so forth as outlined in Fonte et al. (2017). In addition, other VGI-specific quality indicators are discussed such as the quality of the contributors or consideration of the socioeconomics of the areas being mapped. Finally, the COBWEB system described by Leibovici et al. (2015) is another example that has several generic elements, but also some that are specific to VGI, for example, the use of spatial relationships to assess the accuracy of the position using the mobile device.
When dealing with data from social media, for example, Twitter, methods have been proposed for determining the credibility (or believability) in the information. Castillo et al. (2011) developed an automated approach for determining the credibility of tweets by testing different message-based (e.g., length of the message), user-based (e.g., number of followers), topic-based (e.g., number and average length of tweets associated with a given topic), and propagation-based (i.e., retweeting) features. Using a supervised classifier, an overall accuracy of 86% was achieved. Westerman et al. (2012) examined the relationship between credibility and the number of followers on Twitter and found an inverted U-shaped pattern, that is, having too few or too many followers decreases credibility, while credibility increased as the gap between the number of followers and the number followed by a given source decreased. Kongthon et al. (2012) applied the measures of Westerman et al. (2012) but found that retweets were a better indicator of credibility than the number of followers. Quantifying these types of relationships can help to determine the quality of information derived from social media. The final approach listed in Table 1 is the quantification of uncertainty, although the methods summarized in Rieckermann (2016) are not specifically focused on crowdsourced data. Instead, the author advocates the importance of reporting a reliable measure of uncertainty, of either observations or predictions of a computer model, to improve scientific analysis, such as parameter estimation, or decision making in practical applications.
5.2.3 Challenges and Future Directions
Handling concerns over crowdsourced data quality will continue to remain a major challenge in the near future. Walker et al. (2016) highlight the lack of examples of the rigorous validation of crowdsourced data from community-based hydrological monitoring programs. In the area of wildlife ecology, the quality of the crowdsourced data varies considerably by species and ecosystem (Steger et al., 2017), while experiences of crowd-based visual interpretation of very high resolution satellite imagery show there is still room for improvement (See et al., 2013). To make progress on this front, more studies are needed that continue to evaluate the quality of crowdsourced data, in particular how to make improvements, for example, through additional training and the use of stricter protocols, which is also closely related to the management of crowdsourcing projects (section 5.1). Quality assurance systems such as those developed in COBWEB may also provide tools that facilitate quality control across multiple disciplines. More of these types of tools will undoubtedly be developed in the near future.
Another concern with crowdsourcing data collection is the irregular intervals in time and space at which the data are gathered. To collect continuous records of data, volunteers must be willing to provide such measurements at specific locations, for example, every monitoring station, which may not be possible. Moreover, measurements during extreme events, for example, during a storm, may not be available as there are fewer volunteers willing to undertake these tasks. However, studies show that even incidental and opportunistic observations can be invaluable when regular monitoring at large spatial scales is infeasible (Hochachka et al., 2012).
Another important factor in crowdsourcing environmental data, which is also a requirement for data sharing systems, is data heterogeneity. Granell et al. (2016) highlight two general approaches for homogenizing environmental data: (i) standardization to define common specifications for interfaces, metadata, and data models, which is also discussed briefly in section 5.1, and (ii) mediation to adapt and harmonize heterogeneous interfaces, meta-models, and data models. The authors also call for reusable Specific Enablers in the environmental informatics domain as possible solutions to share and mediate collected data in environmental and geospatial fields. Such Specific Enablers include geo-referenced data collection applications, tagging tools, mediation tools (mediators and harvesters), fusion applications for heterogeneous data sources, event detection and notification, and geospatial services. Moreover, test beds are also important for enabling generic applications of crowdsourcing methods. For instance, regions with good reference data (e.g., dedicated Urban Meteorological Networks) can be used to optimize and validate retrieval algorithms for crowdsourced data. Ideally, these test beds would be available for different climates, so that improved algorithms can subsequently be applied to other regions with similar climates but where there is a lack of good reference data.
5.3 Data Processing
5.3.1 Background
In the 1970s, an automated flood detection system was installed in Boulder County, consisting of around 20 stream and rain gauges following a catastrophic flood event that resulted in 145 fatalities and considerable damage. After that, the Automated Local Evaluation in Real-Time system spread to larger geographical regions with more instrumentation (of around 145 stations) and internet access was added in 1998 (Stewart, 1999). Now two decades later, we have entered an entirely new era of big data, including novel sources of information such as crowdsourcing. This has necessitated the development of new and innovative data processing methods (Vatsavai et al., 2012). Crowdsourced data, in particular, can be noisy and unstructured, thus requiring specialized methods that turn these data sources into useful information. For example, it can be difficult to find relevant information in a timely manner due to the large volumes of data such as Twitter (Barbier et al., 2012; Goolsby, 2009). Processing methods are also needed that are specifically designed to handle spatial and temporal autocorrelation since some of these data are collected over space and time, often in large volumes over short periods (Vatsavai et al., 2012), as well as at varying spatial scales, which can vary considerably between applications, for example, from a single lake to monitoring at the national level. The need to record background environmental conditions along with data observations can also result in issues related to increased data volumes. The next section provides an overview of different processing methods that are being used to handle these new data streams.
5.3.2 Current Status
The different processing methods that have been used with crowdsourced data are summarized in Table 5 along with typical examples from the literature. As the data are often unstructured and incomplete, crowdsourced data are often processed using a range of different methods in a single workflow, from initial filtering (preprocessing methods) to data mining (postprocessing methods).
Methods | Typical references | Key comments |
---|---|---|
Passive crowdsourced data processing methods, for example, Twitter, Flickr | Houston et al. (2015), Barbier et al. (2012), Imran et al. (2015), Granell and Ostermann (2016), Rosser et al. (2017), Cervone et al. (2016), Braud et al. (2014), Le Coz et al. (2016), Tauro and Salvatori (2017) |
• Methods for acquiring the data (through APIs) •Methods for filtering the data, for example, natural language processing, stop word removal, filtering for duplication and irrelevant information, feature extraction and geotagging •Processing crowdsourced videos through velocimetry techniques |
Web-based technologies | Vitolo et al. (2015) |
• Use of web services to process environmental big data, that is, SOAP, REST • Web Processing Services (WPS) to create data processing workflows |
Spatiotemporal data mining algorithms and geospatial methods | Hochachka et al. (2012), Sun and Mobasheri (2017), Cervone et al. (2016), Granell and Oostermann (2016), Barbier et al. (2012), Imran et al. (2015), Vatsavai et al. (2012) |
• Spatial autoregressive models, Markov random field classifiers and mixture models • Different soft and hard classifiers • Spatial clustering for hotspot analysis |
Enhanced tools for data collection | Kim et al. (2013) | • New generation of mobile app authoring tools to simplify the technical process, for example, the Sensr system |
One increasingly used source of unintentional crowdsourced data is Twitter, particularly in a disaster-related context. Houston et al. (2015) undertook a comprehensive literature review of social media and disasters in order to understand how the data are used and in what phase of the event. Fifteen distinct functions were identified from the literature and described in more detail, for example, sending and receiving requests for help and documenting and learning about an event. Some simple methods mentioned within these different functions included mapping the evolution of tweets over an event or the use of heat maps and building a Twitter listening tool that can be used to dispatch responders to a person in need. The latter tool requires reasonably sophisticated methods for filtering the data, which are described in detail in papers by Barbier et al. (2012) and Imran et al. (2015). For example, both papers describe different methods for data preprocessing. Stop word removal, filtering for duplication and messages that are off topic, feature extraction, and geotagging are examples of common techniques used for working with Twitter (or other text-based) information. Once the data are preprocessed, there is a series of other data mining methods that can be applied. For example, there is a variety of hard and soft clustering techniques, as well as different classification methods and Markov models. These methods can be used, for example, to categorize the data, detect new events, or examine the evolution of an event over time.
An example that puts these different methods into practice is provided by Cervone et al. (2016), who show how Twitter can be used to identify hotspots of flooding. The hotspots are then used to task the acquisition of very high resolution satellite imagery from Digital Globe. By adding the imagery with other sources of information such as the road network and the classification of satellite and aerial imagery for flooded areas, it was possible to provide a damage assessment of the transport infrastructure and determine which roads are impassable due to flooding. A different flooding example is described by Rosser et al. (2017), who used a different source of social media, that is, geotagged photographs from Flickr. These photographs are used with a very high resolution digital terrain model to create cumulative viewsheds. These are then fused with classified Landsat images for areas of water using a Bayesian probabilistic method to create a map with areas of likely inundation. Even when data come from citizen observations and instruments intentionally, the type of data being collected may require additional processing, which is the case for velocity, where velocimetry-based methods are usually applied in the context of videos (Braud et al., 2014; Le Coz et al., 2016; Tauro & Salvatori, 2017).
The review by Granell and Ostermann (2016) also focuses on the area of disasters, but they undertook a comprehensive review of papers that have used any types of VGI (both intentional and unintentional) in a disaster context. Of the processing methods used, they identified six key types, including descriptive, explanatory, methodological, inferential, predictive, and causal. Of the 59 papers reviewed, the majority used descriptive and explanatory methods. The authors argue that much of the work in this area is technology or data driven, rather than human or application centric, both of which require more complex analytical methods.
Web-based technologies are being employed increasingly for processing of environmental big data, including crowdsourced information (Vitolo et al., 2015), for example, using web services such as SOAP, which sends data encoded in XML, and Representational State Transfer, where resources have Universal Resource Identifiers. Data processing is then undertaken through Web Processing Services with different frameworks available that can apply existing or bespoke data processing operations. These types of Environmental Virtual Observatories promote the idea of workflows that chain together processes and facilitate the implementation of scientific reproducibility and traceability. An example is provided in the paper of an Environmental Virtual Observatory that supports the development of different hydrological models, from ingesting the data to producing maps and graphics of the model outputs, where crowdsourced data could easily fit into this framework (Hill et al., 2011).
Other crowdsourcing projects such as eBird contain millions of bird observations over space and time, which requires methods that can handle nonstationarity in both dimensions. Hochachka et al. (2012) have developed a spatiotemporal exploratory model for species prediction, which integrates randomized mixture models capturing local effects, which are then scaled up to larger areas. They have also developed semiparametric approaches to occupancy detection models, which represents the true occupancy status of a species at a given location. Combining standard site occupancy models with boosted regression trees, this semiparametric approach produced better probabilities of occupancy than traditional models. Vatsavai et al. (2012) also recognize the need for spatiotemporal data mining algorithms for handling big data. They outline three different types of models that could be used for crowdsourced data, including spatial autoregressive models, Markov random field classifiers, and mixture models like those used by Hochachka et al. (2012). They then show how different models can be used across a variety of domains in geophysics and informatics, touching upon challenges related to the use of crowdsourced data from social media and mobility applications, including GPS traces and cars as sensors.
When working with GPS traces, other types of data processing methods are needed. Using cycling data from Strava, a website and mobile app that citizens use to upload their cycling and running routes, Sun and Mobasheri (2017) examined exposure to air pollution on cycling journeys in Glasgow. Using a spatial clustering algorithm (A Multidirectional Optimum Ecotope-Based Algorithm) for displaying hotspots of cycle journeys in combination with calculations of instantaneous exposure to particulate matter (PM2.5 and PM10), they were able to show that cycle journeys for noncommuting purposes had less exposure to harmful pollutants than those used for commuting. Finally, there are new methods for helping to simplify the data collection process through mobile devices. The Sensr system is an example of a new generation of mobile application authoring tools that allows users to build a simple data collection app without requiring any programming skills (Kim et al., 2013). The authors then demonstrate how such an app was successfully built for air quality monitoring, documenting illegal dumping in catchments, and detecting invasive species, illustrating the generic nature of such a solution to process crowdsourcing data.
5.3.3 Challenges and Future Directions
Tulloch (2013) argued that one of the main challenges of crowdsourcing was not the recruitment of participants but rather handling and making sense of the large volumes of data coming from this new information stream. Hence, the challenges associated with processing crowdsourced data are similar to those of big data. Although crowdsourced data may not always be big in terms of volume, they have the potential to be with the proliferation of mobile phones and social media for capturing videos and images. Crowdsourced data are also heterogeneous in nature and therefore require methods that can handle very noisy data in such a way as to produce useful information for different applications, where the utility for disaster-related applications is clearly evident. Much of the data are georeferenced and temporally dynamic, which requires methods that can handle spatial and temporal autocorrelation, or correct for biases in observations in both space and time. Since 2003, there have been advances in data mining, in particular in the realm of deep learning (Najafabadi et al., 2015), which should help solve some of these data issues. From the literature, it is clear that much attention is being paid to developing new or modified methods to handle all of these different types of data-relevant challenges, which will undoubtedly dominate much of future research in this area.
At the same time, we should ensure that the time and efforts of volunteers are used optimally. For example, where relevant, the data being collected by citizens should be used to train deep learning algorithms, for example, to recognize features in images. Hence, parallel developments should be encouraged, that is, train algorithms to learn what humans can do from the crowdsourced data collected and use humans for tasks that algorithms cannot yet solve. However, training algorithms still require a sufficiently large training data set, which can be quite laborious to generate. Rai et al. (2018) showed how distributed intelligence (Level 2 of Figure 4), recruited using Amazon Mechanical Turk, can be used for generating a large training data set for identifying green storm water infrastructure in Flickr and Instagram images. More widespread use of such tools will be needed to enable rapid processing of large crowdsourced image and video data sets.
5.4 Data Privacy
5.4.1 Background
The guiding principle of privacy protection is to collect as little private data as possible (Mooney et al., 2017). However, advances in information and communication technologies in the late 20th and early 21st century have created the technological basis for an unprecedented increase in the types and amounts of data collected, particularly those obtained through crowdsourcing. Furthermore, there is a strong push by various governments to open data for the benefit of society. These developments have also raised many privacy, legal, and ethical issues (Mooney et al., 2017). For example, in addition to participatory (volunteered) crowdsourcing, where individuals provide their own observations and can choose what they want to report, methods for nonvolunteered (opportunistic) data harvesting from sensors on their mobile phones can raise serious privacy concerns. The main worry is that without appropriate suitable protection mechanisms, mobile phones can be transformed into miniature spies, possibly revealing private information about their owners (Christin et al., 2011). Johnson et al. (2017) argue that for open data, it is the government's role to ensure that methods are in place for the anonymization or aggregation of data to protect privacy, as well as to conduct the necessary privacy, security, and risk assessments. The key concern for individuals is the limited control over personal data, which can open up the possibility of a range of negative or unintended consequences (Bowser et al., 2015).
Despite these potential consequences, there is a lack of a commonly accepted definition of privacy. Mitchell and Draper (1983) defined the concept of privacy as the right of human beings to decide for themselves which aspects of their lives they wish to reveal to or withhold from others. Christin et al. (2011) focused more narrowly on the issue of information privacy and define it as the guarantee that participants maintain control over the release of their sensitive information. He goes further to include the protection of information that can be inferred from both the sensor readings and from the interaction of the users with the participatory sensing system. These privacy issues could be addressed through technological solutions, legal frameworks, and via a set of universally acceptable research ethics practices and norms (Table 6).
Methods | Typical references | Key comments |
---|---|---|
Legal framework | Rak et al. (2012) |
• Methods from the perspective of the operator, the contributor, and the user of the data product • Creative Commons, General Data Protection Regulation (GDPR), INSPIRE • Highlights the risks of accidental or unlawful destruction, loss, alteration, unauthorized disclosure of personal data |
Technological solutions | Christin et al. (2011), Calderoni et al. (2015), Shen et al. (2016) |
• Method from the perspective of sensing, transmitting, and processing • Bloom filters • Provides tailored sensing and user control of preferences, anonymous task distribution, anonymous and privacy-preserving data reporting, privacy-aware data processing, and access control and audit |
Ethics practices and norms | Alexander (2008), Sula (2016) |
• Places special emphasis on the ethics of social media • Involves participants more fully in the research process • No collection of any information that should not be made public • Informs participants of their status and provides them with opportunities to correct or remove data about themselves • Communicates research broadly through relevant channels |
Crowdsourcing activities, which could encompass both VGI and harvested data, also raise a variety of legal issues, from intellectual property to liability, defamation, and privacy (Scassa, 2013). Mooney et al. (2017) argued that these issues are not well understood by all of the actors in VGI. Akhgar et al. (2017) also emphasized legal considerations relating to privacy and data protection, particularly in the application of social media in crisis management. Social media also come with inherent problems of trust and misuse, ethical and legal issues, as well as with potential for information overload (Andrews, 2017). Finally, in addition to the positive side of social media, Alexander (2008) indicated the need for the awareness of their potential for negative developments, such as disseminating rumors, undermining authority and promoting terrorist acts. The use of crowdsourced data on commercial platforms can also raise issues of data ownership and control (Scassa, 2016). Therefore, licensing conditions for the use of crowdsourced data should be in place to allow sharing of data and provide not only the protection of individual privacy, but also of data products, services, or applications that are created by crowdsourcing (Groom et al., 2017).
Ethical practices and protocols for researchers and practitioners who collect crowdsourced data are also an important topic for discussion and debate on privacy. Bowser et al. (2017) reported on the attitudes of researchers engaged in crowdsourcing that are dominated by an ethic of openness. This, in turn, encourages crowdsourcing volunteers to share their information and makes them focus on the personal and collective benefits that motivate and accompany participation. Ethical norms are often seen as soft law, although the recognition and application of these norms can give rise to enforceable legal obligations (Scassa et al., 2015). The same researchers also state that codes of research ethics serve as a normative framework for the design of research projects, and compliance with research norms can shape how the information is collected. These codes influence from whom data are collected, how they are represented and disseminated, how crowdsourcing volunteers are engaged with the project, and where the projects are housed.
5.4.2 Current Status
Judge and Scassa (2010) and Scassa (2013) identified a series of potential legal issues from the perspective of the operator, the contributor, and the user of the data product, service, or application that is created using VGI. However, the scholarly literature is mostly focused on the technology, with little attention given to legal concerns (Cho, 2014). Cho (2014) also identified the lack of a legal framework and governance structure whereby technology, networked governance, and provision of legal protections may be combined to mitigate liability. Rak et al. (2012) claimed that nontransparent, inconsistent, and producer-proprietary licenses have often been identified as a major barrier to the sharing of data and a clear need for harmonized geo-licences is increasingly being recognized. They gave an example of the framework used by the Creative Commons organization, which offered flexible copyright licenses for creative works such as text articles, music and graphics (http://creativecommons.org). A recent example of an attempt to provide a legal framework for data protection and privacy for citizens is the General Data Protection Regulation (GDPR), as shown in Table 6. The GDPR (http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv:OJ.L_.2016.119.01.0001.01.ENG) particularly highlights the risks of accidental or unlawful destruction, loss, alteration, unauthorized disclosure of, or access to, personal data transmitted, stored or otherwise processed, which may in particular lead to physical, material, or nonmaterial damage. The GDPR, however, may also pose questions for another EU directive, INSPIRE (http://inspire.ec.europa.eu/), which is designed to create infrastructure to encourage data interoperability and sharing. The GDPR and INSPIRE seem to have opposing objectives, where the former focuses on privacy and the latter encourages interoperability and data sharing.
Technological solutions (Table 6) involve the provision of tailored sensing and user control of preferences, anonymous task distribution, anonymous and privacy-preserving data reporting, privacy-aware data processing, as well as access control and audit (Christin et al., 2011). An example of a technological solution for controlling location sharing and preserving the privacy of crowdsourcing participants is presented by Calderoni et al. (2015). They describe a spatial Bloom filter with the ability to allow privacy-preserving location queries by encoding into a spatial Bloom filter a list of sensitive areas and points located in a geographic region of arbitrary size. This then can be used to detect the presence of a person within the predetermined area of interest, or his/her proximity to points of interest, but not the exact position. Despite technological solutions providing the necessary conditions for preserving privacy, the adoption rate of location-based services has been lagging behind from what it was expected to be. Fodor and Brem (2015) investigated how privacy influences the adoption of these services. They found that it is not sufficient to analyze user adoption through technology-based constructs only, but that privacy concerns, the size of the crowdsourcing organization, and perceived reputation also play a significant role. Shen et al. (2016) also employ a Bloom filter to protect privacy while allowing controlled location sharing in mobile online social networks.
Sula (2016) refers to the The Ethics of Fieldwork, which identifies over 30 ethical questions that arise in research, such as prediction of possible harms, leading questions, and the availability of raw materials, to other researchers. Through these questions, he examines ethical issues concerning crowdsourcing and Big Data in the areas of participant selection, invasiveness, informed consent, privacy/anonymity, exploratory research, algorithmic methods, dissemination channels, and data publication. He then concludes that Big Data introduces big challenges for research ethics, but keeping to traditional research ethics should suffice in crowdsourcing projects.
5.4.3 Challenges and Future Directions
The issues of privacy, ethics, and legality in crowdsourcing have not received widespread or in-depth treatment by the research community, thus these issues are also still not well understood. The main challenge for going forward is to create a better understanding of privacy, ethics, and legality by all of the actors in crowdsourcing (Mooney et al., 2017). Laws that regulate the use of technology, the governance of crowdsourced information, and protection for all involved is undoubtedly a significant challenge for researchers, policy makers, and governments (Cho, 2014). The recent introduction of GDPR in the EU provides an excellent example of the effort being made in that direction. However, it may be only seen as a significant step in harmonizing licensing of data and protecting the privacy of people who provide crowdsourced information. Norms from traditional research ethics need to be reexamined by researchers as they can be built into the enforceable legal obligations. Despite advances in solutions for preserving privacy for volunteers involved in crowdsourcing, technological challenges will still be a significant direction for future researchers (Christin et al., 2011). For example, the development of new architectures for preserving privacy in typical sensing applications and new countermeasures to privacy threats represent a major technological challenge.
6 Conclusions and Future Directions
This review contributes to knowledge development with regard to what crowdsourcing approaches are applied within seven specific domains of geophysics, and where similarities and differences exist. This was achieved by developing a new approach to categorizing the methods used in the papers reviewed based on whether the data were acquired by citizens and/or by instruments and whether they were obtained in an intentional and/or unintentional manner, resulting in nine different categories of data acquisition methods. The results of the review indicate that methods belonging to these categories have been used to varying degrees in the different domains of geophysics considered. For instance, within the area of natural hazard management, six out of the nine categories have been implemented. In contrast, only three of the categories have been used for the acquisition of ecological data based on the papers selected for review. In addition to the articulation and categorization of different crowdsourcing data acquisition methods in different domains of geophysics, this review also offers insights into the challenges and issues that exist within their practical implementations by considering four issues that cut across different methods and application domains, including crowdsourcing project management, data quality, data processing, and data privacy.
- Crowdsourcing can be considered as an important supplementary data source, complementing traditional data collection approaches, while in some developing countries, crowdsourcing may even play the role of a traditional measuring network due to the lack of a formally established observation network (Sahithi, 2016). This can be in the form of increased spatial and temporal distribution, which is particularly relevant for natural hazard management, for example, for floods and earthquakes. Crowdsourcing methods are expected to develop rapidly in the near future with the aid of continuing developments in information technology, such as smart phones, cameras, and social media as well as in response to increasing public awareness of environment issues. In addition, the sensors used for data collection are expected to increase in reliability and stability, as will the methods for processing noisy data coming from these sensors. This in turn will further facilitate continued development and more applications of crowdsourcing methods in the future.
- Successful applications of crowdsourcing methods should not only rely on the developments of information technologies, but also foster the participation of the general public through active engagement strategies, both in terms of attracting large numbers and in fostering sustained participation. This requires improved cooperation between academics and relevant government departments for outreach activities, awareness raising, and intensive public education to engage a broad and reliable volunteer network for data collection. A successful example of this is the River Chief project in China, where each river is assigned to a few local residents who take ownership and voluntarily monitor the pollution discharge from local manufacturers and businesses (T. Zhang et al., 2016). This project has markedly increased urban water quality, enabled the government to economize on monitoring equipment and involved citizens in a positive environmental outcome.
- Different types of incentives should be considered as a way of engaging more participants while potentially improving the quality of data collected through various crowdsourcing methods. A small amount of compensation or other type of benefit can significantly enhance the responsibility of participants. However, such engagement strategies should be well designed and there should either be leadership from government agencies in engagement or they should be thoroughly embedded in the process.
- There are already instances where data from crowdsourcing methods fall into the category of Big Data and therefore have the same challenges associated with data processing. Efficiency is needed in order to enable near real-time system operation and management. Developments of data processing methods for crowdsourced data is an area where future attention should be directed, as these will become crucial for the successful application of crowdsourcing applications in the future.
- Data integration and assimilation is an important future direction to improve the quality and usability of crowdsourced data. For example, various crowdsourced data can be integrated to enable cross validation, and crowdsourced data can also be assimilated with authorized sensors to enable successful applications, for example, for numerical models and forecasting systems. Such an integration and assimilation not only improves the confidence of data quality, but also enables improved spatiotemporal precision of data.
- Data privacy is an increasingly critical issue within the implementations of crowdsourcing methods, which has not been well recognized thus far. To avoid malicious use of the data, complaints, or even lawsuits, it is time for governments and policy makers to consider/develop appropriate laws to regulate the use of technology and the governance of crowdsourced information. This will provide an important basis for the development of crowdsourcing methods in a sustainable manner.
- Much of the research reported here falls under proof of concept, which equates to a Technology Readiness Level (TRL) of 3 (Olechowski et al., 2015). However, there are clearly some areas in which crowdsourcing and opportunistic sensing are currently more promising than others and already have higher TRLs. For example, amateur weather stations are already providing data for numerical weather prediction, where the future potential of integrating these additional crowdsourced data with nowcasting systems is immense. Opportunistic sensing of precipitation from CMLs is also an area of intense interest as evidenced by the growing literature on this topic, while other crowdsourced precipitation applications tend to be much more localized, linked to individual projects. Low-cost air quality sensing is already a growth area with commercial exploitation and high TRLs, driven by smart city applications and the increasing desire to measure personal health exposure to pollutants, but the accuracy of these sensors still needs further improvement. In geography, OSM is the most successful example of sustained crowdsourcing. It also allows commercial exploitation due to the open licensing of the data, which contributes to its success. In combination with natural hazard management, OSM and other crowdsourced data are becoming essential sources of information to aid in disaster response. Beyond the many proof of concept applications and research advances, operational applications are starting to appear and will become mainstream before long. Species identification (and to a lesser extent phenology) is the most successful ecological application of crowdsourcing, with a number of successful projects that have been in place for several years. Unlike other areas in geosciences, there is less commercial potential in the data but success is down to an engaged citizen science community.
- While this paper mainly focuses on the review of crowdsourcing methods applied to the seven areas within geophysics, the techniques, potential issues, as well as future directions derived from this paper can be easily extended to other domains. Meanwhile, many of the issues and challenges faced by the different domains reviewed here are similar, indicating the need for greater multidisciplinary research and sharing of best practices.
Acknowledgments
Feifei Zheng and Tuqiao Zhang are funded by The National Key Research and Development Program of China (2016YFC0400600), The National Natural Science Foundation of China (51708491), National Science and Technology Major Project for Water Pollution Control and Treatment (2017ZX07201004), and the Funds for International Cooperation and Exchange of the National Natural Science Foundation of China (51761145022). Holger Maier would like to acknowledge funding from the Bushfire and Natural Hazards Cooperative Research Centre. Linda See is partly funded by the ENSUF/FFG-funded FloodCitiSense project (860918), the FP7 ERC CrowdLand grant (617754), and the Horizon2020 LandSense project (689812). Thaine H. Assumpção and Ioana Popescu are partly funded by the Horizon 2020 European Union project SCENT (Smart Toolbox for Engaging Citizens into a People-Centric Observation Web), under grant 688930. Some research efforts have been undertaken by Dimitri P. Solomatine in the framework of the WeSenseIt project (EU grant 308429), and grant 17-77-30006 of the Russian Science Foundation. The paper is theoretical and no data are used.