Volume 55, Issue 7 p. 5202-5211
Commentary
Open Access

Balancing Open Science and Data Privacy in the Water Sciences

Samuel C. Zipper

Corresponding Author

Samuel C. Zipper

Department of Civil Engineering, University of Victoria, Victoria, British Columbia, Canada

Correspondence to: S. C. Zipper,

[email protected]

Search for more papers by this author
Kaitlin Stack Whitney

Kaitlin Stack Whitney

Science, Technology, and Society Department, Rochester Institute of Technology, Rochester, NY, USA

Search for more papers by this author
Jillian M. Deines

Jillian M. Deines

Department of Earth Systems Science and Center on Food Security and the Environment, Stanford University, Stanford, CA, USA

Search for more papers by this author
Kevin M. Befus

Kevin M. Befus

Department of Civil and Architectural Engineering, University of Wyoming, Laramie, WY, USA

Search for more papers by this author
Udit Bhatia

Udit Bhatia

Civil and Environmental Engineering, Northeastern University, Boston, MA, USA

Department of Civil Engineering, Indian Institute of Technology, Gandhinagar, India

Search for more papers by this author
Sam J. Albers

Sam J. Albers

Groundwater, Hydrology and Hydrometric Programs, Ministry of Environment and Climate Change Strategy, British Columbia Provincial Government, Victoria, British Columbia, Canada

Search for more papers by this author
Janice Beecher

Janice Beecher

Institute of Public Utilities, Michigan State University, East Lansing, MI, USA

Search for more papers by this author
Christa Brelsford

Christa Brelsford

Oak Ridge National Laboratory, Oak Ridge, TN, USA

Search for more papers by this author
Margaret Garcia

Margaret Garcia

School of Sustainable Engineering and the Built Environment, Arizona State University, Tempe, AZ, USA

Search for more papers by this author
Tom Gleeson

Tom Gleeson

Department of Civil Engineering, University of Victoria, Victoria, British Columbia, Canada

Search for more papers by this author
Frances O'Donnell

Frances O'Donnell

Department of Civil Engineering, Auburn University, Auburn, AL, USA

Search for more papers by this author
David Resnik

David Resnik

National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, USA

Search for more papers by this author
Edella Schlager

Edella Schlager

School of Government and Public Policy, University of Arizona, Tucson, AZ, USA

Search for more papers by this author
First published: 22 May 2019
Citations: 38
This article has been contributed to by US Government employees and their work is in the public domain in the USA.

Abstract

Open science practices such as publishing data and code are transforming water science by enabling synthesis and enhancing reproducibility. However, as research increasingly bridges the physical and social science domains (e.g., socio-hydrology), there is the potential for well-meaning researchers to unintentionally violate the privacy and security of individuals or communities by sharing sensitive information. Here we identify the contexts in which privacy violations are most likely to occur, such as working with high-resolution spatial data (e.g., from remote sensing), consumer data (e.g., from smart meters), and/or digital trace data (e.g., from social media). We also suggest practices for identifying and addressing privacy concerns at the individual, institutional, and disciplinary levels. We strongly advocate that the water science community continue moving toward open science and socio-environmental research and that progress toward these goals be rooted in open and ethical data management.

Key Points

  • Natural scientists have little guidance to deal with privacy concerns for open science, which are inherent in socio-environmental research
  • Hydrology data with potential privacy concerns include high-resolution spatial data, consumer data, and digital trace data
  • Scientists should continue to share data openly while proactively addressing privacy concerns via ethical data management and sharing

1 Emerging and Intersecting Trends

Widespread adoption of open science practices such as sharing data via public repositories advances water science by enabling new types of synthesis-based science and promoting reproducibility (Gil et al., 2016; Munafò et al., 2017; Powers & Hampton, 2018). In the earth sciences, this push is led by the American Geophysical Union's policy to make data and code available for all papers under the Findable, Accessible, Interoperable, and Reusable (FAIR) standards (Stall et al., 2017; Wilkinson et al., 2016). However, gradual adoption of open science practices is converging with two other trends: (i) growth in research investigating the relationships between humans and the water cycle as part of a broader movement of socio-environmental research including socio-hydrology and enhanced collaboration with social scientists (Flint et al., 2017; Konar et al., 2019; Sivapalan et al., 2012; Srinivasan et al., 2017; Wagener et al., 2010); and (ii) exponential growth in computing power and sensor technology allowing data collection and analyses with unprecedented spatial and temporal granularity. These advances are essential for understanding the water cycle of the Anthropocene, and we unequivocally encourage continued progress along these paths within the water science community.

At the intersection of open science, socio-environmental research, and high-resolution data, however, there is an emerging potential to violate the privacy of uninformed and/or nonconsenting individuals and communities (Grossman et al., 2015; Hartter et al., 2013). Researchers have a responsibility to acknowledge and anticipate the risk inherent in open data and accordingly minimize harm to stakeholders potentially impacted by their research. While the natural inclination of many well-meaning researchers (many of the present authors included) is to focus on the societal benefits of data sharing, there are also potential risks arising from unintended applications of open data. These risks can magnify when researchers lack cultural understanding of and sensitivity toward communities to which they do not belong. In some cases, people or companies in positions of power have taken advantage of open data at the expense of the intended beneficiaries of the shared data (Donovan, 2012; Gurstein, 2011; McClean, 2011). For instance, the digitization of land records in Karnataka, India, was promoted as a tool to democratize access to information, but instead allowed wealthy landowners with more financial resources to consolidate power and capitalize on these new data (Donovan, 2012). As seen through the lens of environmental justice, these concerns are particularly acute when working with historically disadvantaged groups such as impoverished communities and indigenous peoples (Brugge & Missaghian, 2006; Christen, 2015; Radin, 2017).

Though data-sharing mandates make exemptions for potentially sensitive data sets, natural scientists are rarely trained in navigating ethical, privacy, and data security issues. Our primary objective here is to highlight potential privacy concerns specific to hydrology at the intersection of open science, socio-environmental research, and high-resolution data, and secondarily to recommend practices for water science researchers interested in adopting open science principles.

2 Sensitive Data

Sensitive data include private or personal information as well as information that, whether in isolation or combined with other data sets, can be linked to specific, nonconsenting individuals or communities. Researchers should be cautious when their research meets the definition of “human subject research”, defined in the United States as including “a living individual about whom a research investigator… obtains data through 1) intervention or interaction with the individual, or 2) identifiable private information” (32 C.F.R. 219.102(f)). However, the definition of “human subject research” focuses on the individual subjects of research, and there may also be situations where concerns arise about data sets dealing with communities, households, or other units. This may include third parties who are not the specific human subjects—for example, family members of study participants (Resnik & Sharp, 2006)—or broader communities even when individual data are protected (Radin, 2017).

We highlight three general categories of data used by the water science community where privacy concerns are most likely to arise: high-resolution spatial data, consumer data, and digital trace data (Figure 1). We argue that harm to individuals or communities will infrequently come directly from the researchers publishing studies on the data themselves but rather from third parties who could use the data for profit, coercion, or regulatory action (Lagos & Polonetsky, 2013); analogously, poachers have used species location data from scientific papers for wildlife trafficking (Lindenmayer & Scheele, 2017).

Details are in the caption following the image
Examples of data types with potential privacy concerns, and recommended practices. Left: High spatial resolution data as provided by the U.S. Department of Agriculture's CropScape portal for the Cropland Data Layers (Han et al., 2012), which uses satellite data to map agricultural land use and crop types at 30 square meter resolution. Center: Example high temporal resolution household water use data from a smart meter with annotated information that can be inferred. Right: Example tweet including potentially concerning information (in this case, travel patterns).

2.1 High-Resolution Spatial Data

High-resolution spatial data include satellite data (and derived products), outputs of hydrological models, and other geospatial data sets. Geospatial data are commonly used in the hydrologic sciences, and unmanned aerial vehicles (i.e., drones; Kelleher et al., 2018), traffic/surveillance cameras (Jiang et al., 2019; Leitão et al., 2018), and increasing access to satellite data are likely to make these data less costly to collect and more widely available. Despite not meeting traditional definitions of human subject research, this type of data could be sensitive at the individual and community levels (Rissman et al., 2017). For example, 30% of Iowa farmers surveyed felt that collecting geospatial data on private land was an invasion of privacy (Arbuckle, 2013).

At the individual level, high-resolution spatial data can be used to track or identify private activities. For example, a farmer's operations, finances, and land valuation may be inferred by mapping agricultural practices such as cropping patterns, management, and productivity (Deines et al., 2017, 2019; Kang et al., 2016; Seifert et al., 2018; Zipper et al., 2015, 2017). Similar data sets containing information on illegal or quasi-legal activities such as marijuana cultivation could be used by law enforcement agencies (Bauer et al., 2015; Butsic et al., 2017, 2018). Analysis of wastewater at specific points can reveal information about the activities and health of either individuals or communities via chemical tracers of illegal drugs, prescription medicine, or other biomarkers (Choi et al., 2018; Hall et al., 2012). At the community level, high-resolution hydrological data such as that produced by flood risk studies, for example, following hurricanes (Bin & Landry, 2013) or wildfire (Mueller et al., 2009), can lower property values. Similarly, sharing household level water quality data may negatively impact property values or insurance rates at the individual and neighborhood levels; this was a concern in Flint, Michigan, following the water crisis. Water infrastructure locational data can be sensitive due to the potential for threats to water safety and quality (Copeland & Cody, 2007; van Leuven, 2011). Also potentially concerning are culturally or ecologically sensitive geospatial information, which can lead to resource degradation and harm from ecotourism (Lindenmayer & Scheele, 2017; Lunghi et al., 2019; McCoy, 2017; de Noronha Vaz, 2008).

Given that many geospatial data sets quantify features of the land surface that could be observed by someone on the ground (e.g., land cover and irrigation practices), it is challenging to draw the line between properties of the landscape and private information. The notion of a “reasonable expectation of privacy” for people, a legal standard in the United States and the European Union among other regions, can come into conflict with the preponderance of high-resolution spatial data, and satellite and aerial image data sets may be privacy and liability risks to individuals (Craig, 2007). Some court cases have ruled on issues with potential conceptual application. For example, the United States of America v. Vargas (2014) decision ruled that an individual had an expectation to privacy in and around the front yard of their home and thus surveillance in this area was a violation of their rights. With similar types of data in a research context, there is no clear-cut answer or deciding body, but legal rights and protections might still apply and the ethical implications remain.

2.2 Consumer Data

Potentially sensitive consumer data include household consumption of water or electricity, or other variables that are of sufficient spatial or temporal resolution to be identified with and provide information about an individual or household (McKenna et al., 2012). While these data often have a spatial component to them, they are distinct from the previous category in that they quantify resource consumption (Helveston, 2015). The potential to monetize consumer information raises issues of data ownership, along with privacy.

Consumer data gaining traction in the water sciences are derived from “smart meters,” which are electricity or water meters that can transmit data back to the utility at hourly or finer temporal resolutions. Smart water meters are relatively less common than smart electricity meters (Cominola et al., 2015) but are potentially valuable for understanding water use, promoting conservation, and managing water supply in urban areas (Britton et al., 2013; Cardell-Oliver et al., 2016). However, data provided by smart meters can also reveal household-level activity, namely, when residents are home and using energy or water (Cole & Stewart, 2013; Molina-Markham et al., 2010; Sankar et al., 2013).

While water-related research is often fairly unintrusive, consumer water data can enable undesired surveillance. Meter data may be used in law enforcement (Douris, 2017) and searched by the police without a warrant (Naperville Smart Meter Awareness v. City of Naperville, 2018), as for identifying illegal marijuana grow operations (US7402993B2, 2008). Some cities publicize the highest water users during droughts to “name and shame” consumers into conserving water resources (Glionna, 2015; Horwath, 2015), which, regardless of perceived efficacy, violates personal privacy and allowable choice, and may not actually be necessary if less individualized tools for shaping consumption behavior (such as pricing and information campaigns) are in place.

2.3 Digital Trace Data

Digital trace data include deliberate online activities (e.g., social media and Web browsing) as well as Web-enabled technologies (e.g., the “Internet of Things”) and can be divided into two groups: passively and actively contributed. Passively contributed data are posted to the internet without the intent or knowledge for potential scientific use (most social media data), while actively contributed data are contributed to a specific project (most crowd-sourced citizen science research). Both types of data have been used for hydrologic research. Examples of passively contributed studies include generating long-term water level records from YouTube videos (Michelsen et al., 2016), estimating snowpack from public Web images (Giuliani et al., 2016), and reconstructing crop planting dates from Twitter postings (Zipper, 2018). Examples of actively contributed studies include citizen science projects focused on streamflow monitoring (Fienen & Lowry, 2012; Lowry & Fienen, 2013), storm identification (Zhou & Xu, 2017), and flood extent mapping (le Coz et al., 2016; Yu et al., 2016).

Both actively and passively collected data can violate individual or community privacy (Wu, 2013). Data derived from social media present particular challenges. The State of New York recently allowed insurance companies to use social media data to help determine customer premiums (Scism, 2019). While research is permitted within Twitter's terms of service, the lack of comfort and awareness among users highlights both the public's growing unease with researchers using digital trace data, and the fact that individuals often accept user agreements that they do not read or fully understand (Bashir et al., 2015; Editorial Board, 2019). Although social media data are increasingly used in environmental research (Daume, 2016; Zipper, 2018), only 17% of respondents in a recent survey indicated that they were comfortable with their tweets being used without being informed (Fiesler & Proferes, 2018) and many of these data are contributed by minors who have additional legal protections and are challenging to screen for.

The ethical responsibility of researchers may thus call for a higher standard than either the letter of law or terms of service to protect individual privacy rights, autonomy, and well-being (Ghermandi & Sinclair, 2019). The good intentions of researchers cannot prevail over the interests of human subjects; even a sense of social purpose should not be used to rationalize circumventing ethical requirements and procedures. Given the pace of technological change, and lagging governmental regulation, self-regulation by the scientific community is needed.

3 Addressing privacy concerns with open and ethical data management

Despite challenges, we do not suggest that data should never be openly shared. Rather, our goal is to encourage water scientists to practice open and ethical data management in which researchers recognize and address privacy and security considerations prior to collecting data and proactively plan for data sharing throughout the research process (Meyer, 2018). Guidance can be drawn from disciplines including medical science, utilities research, computer science, economics, psychology, and law as well as previous work integrating biophysical and social aspects of water science (Flint et al., 2017). Given the diversity of data used across these fields, accepted practices will vary widely (Lupia & Elman, 2014), and we focus on broadly applicable general principles, which may be relevant to the water science community. A recent synthesis proposed a decision tree for biodiversity data, which considers the potential benefits and risks of sharing (Tulloch et al., 2018); we present a similar approach in Figure 2. However, legal constraints vary (Klass & Wilson, 2016), as only 58% of countries currently have data privacy legislation (United Nations, 2018), and researchers should consider their local context.

Details are in the caption following the image
Potential decision tree researchers can use to evaluate practices for sharing their data. The yellow boxes indicate that action researchers should take to protect privacy.

3.1 Institutional and Community Resources

First and foremost, we encourage water scientists to consult available institutional resources. Prior to beginning a study, investigators should evaluate whether it could be classified as human subject research (Figure 2). Institutional review boards in the United States, research ethics committees in the European Union, and their equivalents in other nations and the private sector set requirements for obtaining informed consent from research subjects and stipulations for protecting data confidentiality and privacy (Resnik, 2018). Colleagues can also be an invaluable resource; by collaborating with social scientists with experience navigating these issues, hydrologists can codevelop research topics, methods, and data management plans, which ask and answer socio-environmental questions in an ethical and reproducible manner (Flint et al., 2017). Additional resources found at many institutions include legal counsels, privacy or information officers, research librarians, and research ethicists. Researchers and their institutions may need to enter into agreements to ensure protection of data provided by others, such as meter data collected by utilities. As data privacy and security issues evolve, so will public opinion and regulatory policies about which researchers need to be aware.

There is also a need to think beyond the individual when sharing data that may lead to harm for a group of individuals or a community. Dickert and Sugarman (2005) suggest a community consultation process, which is well suited to the water sciences (Figure 2): (1) Prior to beginning a project, researchers should identify potential risks to individuals and the community; (2) the community being studied should benefit in some way; (3) potentially affected parties should be given opportunity to shape the project; and (4) communities share in the responsibility for the project. These steps require meaningful engagement with the stakeholder community prior to the onset of research to identify potential benefits and harms, which can then be addressed collaboratively. Although desirable, it may be prohibitive to obtain individual consent, in which case this process might be conducted at the community level via consultation with elected representatives, community leaders, and open public meetings, such as town halls, as well as focus groups and opinion surveys. The challenge is to establish community-level authority and rules for decision-making.

Additional concerns arise when affected communities include indigenous peoples. Sovereign nations often have their own research protocols, which may be more stringent than institutional requirements (Brugge & Missaghian, 2006). Some Indigenous nations or people consider data collected on their land to be tribal property and do not permit these data to be shared openly (Chief et al., 2016). The emerging concept of Indigenous data sovereignty asserts that indigenous groups have jurisdiction over the collection, ownership, and downstream use of data collected by or about their own peoples or land (Rainie et al., 2017). Thus, a collaborative approach should guide the entire research process (David-Chavez & Gavin, 2018), including discussions about data security, ownership, and sharing (Chief, 2018; Whyte, 2017).

To meet the diverse needs of the water research community, legal frameworks should be informed and supplemented by community, professional, and scientific standards, and vice versa. At the funding stage, many agencies require the submission of data management plans, and these should be required to address potential privacy and security concerns prior to the onset of research. At the publication stage, journals could augment data sharing requirements by requiring a written data privacy and security statement as part of the submission process; similar recommendations have been made by the wildlife research community to deal with inconsistent standards across institutional boundaries (Field et al., 2019). At the archiving stage, community data repositories (such as the Consortium of Universities for the Advancement of Hydrologic Science, Inc. HydroShare portal) can develop data privacy guidelines and require researchers to submit data privacy statements; even the simple step of requiring users to affirm that submitted data are legally allowed and do not contain personally identifiable information can be effective (King, 2007). To further assist early-career scientists, responsible human subjects training should be integrated into graduate programs in the water sciences and departmental handbooks and protocols should include information about institutional resources to improve both the technical and ethical data literacy. As in other areas of ethical training, opportunities or requirements for continuing education should also be provided. Finally, standards should be enforced, and breaches should be penalized.

3.2 Sharing Private Data

Ethical data sharing requires transforming data via aggregation or other means to ensure that it is no longer identifiable at a level that jeopardizes privacy and cannot by “deanonymized” when combined with other data sets (Helveston, 2015; Wu, 2013). All anonymization techniques will inherently cause a loss in the information content and utility of the data (Antonatos et al., 2018). To minimize the effect of this loss and meet FAIR standards, it is critical to also include detailed information about the anonymization procedure via metadata and sharing code, ideally using open-source tools integrating version control for transparency, to allow for interoperability and usability by other researchers (Bakker, 2019; Lowndes et al., 2017; Stagge et al., 2019). When possible, researchers should leave jurisdiction of sensitive data to the agencies responsible for collecting and warehousing these data; where there is no such organization, they should provide synthetic examples of the data so that others can understand and replicate the anonymization procedure.

Spatially identifiable information can be stripped from data prior to publication without compromising reproducibility if the spatial location is not critical to the study. McKenna et al. (2012) suggest, for example, that smart meter data can be used without compromising individual privacy by aggregating data to sufficiently coarse spatial or temporal scales so that individual activities cannot be inferred. Alternately, where the spatial relation among data points is important but absolute geographic coordinates are not, geographic coordinates can be scaled to preserve relative relationships between points (Stack Whitney et al., 2016) or data can be converted to a nonspatial network with mapped relationships between nodes (individuals) and elements (data points; Figure 1). A network perspective can yield insights about characteristics of water systems without revealing information about individual users (Barabási & Albert, 1999; Perelman & Ostfeld, 2011). Where spatial location is critical, aggregation is necessary. For example, urban water use data are often aggregated to the census block or coarser for research purposes (Brelsford & Abbott, 2017; Breyer et al., 2012, 2018). Other high-resolution data providing evidence of water conditions or human activity (including water use, water quality impairment, and illegal activities) may also require aggregation (Hall et al., 2012; Prichard et al., 2014). Aggregation protects individual privacy but limits the ability of researchers to explore fine-scale spatial and behavioral dynamics.

Digital trace data are particularly challenging to anonymize, since social media platforms such as Twitter are searchable; even if a researcher strips identifying information (such as user names) from the database, data can easily be deanonymized via searching for the text or observing network structure (Ayers et al., 2018). In most studies, data at the individual level are unnecessary, since researchers are primarily interested in population-level statistics, and derived statistics can be extracted from the data set and shared without the accompanying raw data. Even more directly, the metric quantified from each piece of digital trace data could be shared. For example, a study using tweets to study the timing of irrigation could share the date, county, and crop-type mentioned without sharing the specific field-level geolocation or raw tweet text.

4 Conclusions

Increased adoption of open science principles and availability of high-resolution data are transforming socio-environmental and socio-hydrological science for the better. At the convergence of these trends are emerging challenges related to ensuring reproducibility without inadvertently causing harm to individuals or communities. As new data sources and interdisciplinary research continue to grow, self-reflection as a community is necessary to ensure that privacy and security are dealt with proactively to maintain trust in the hydrologic sciences among all stakeholders and the public we serve.

Acknowledgments

This paper arose from a workshop at the Santa Fe Institute that was supported by the National Science Foundation under grant 1735884. No data were used in this manuscript but if any were, we definitely would have openly published them!