Data Sharing and Scientific Impact in Eddy Covariance Research
Abstract
Do the benefits of data sharing outweigh its perceived costs? This is a critical question, and one with the potential to change culture and behavior. Dai et al. (2018, https://doi.org/10.1002/2017JG004277) examined how data sharing is related to scientific impact in the field of eddy covariance (EC) and found that data sharers are disproportionately high-impact researchers, and vice versa; they also note strong regional differences in EC data-sharing norms. The current policies and restrictions of EC journals and repositories are highly uneven. Incentivizing data sharing and enhancing computational reproducibility are critical next steps for EC, ecology, and science more broadly.
Plain Language Summary
The raw data that scientists generate—typically from experiments and/or observations—have traditionally been kept private, but research funders, journals, and scientists themselves are pushing for change, arguing that taxpayer-funded data should be “open” (available to all) on both moral and practical grounds. Do influential scientists share more, and does data sharing lead to higher impact? This commentary discusses a recent paper examining these issues in a particular field of ecology and earth system science. Current policies and restrictions are highly uneven, and it is critical to provide proper credit for scientists who freely share their data.
Key Points
- Data sharers are disproportionately high-impact researchers, but the causality is uncertain
- Institutional support for data sharing in eddy covariance research is uneven
- Incentivizing data sharing and enhancing computational reproducibility are critical next steps
1 Introduction
As science becomes increasingly collaborative and data-intensive (Adams, 2012), practices with respect to data sharing and archiving are changing quickly. Journals are more frequently specifying and enforcing data access and deposition policies (Nosek et al., 2015) aimed at ensuring scientific reproducibility and transparency. Funding agencies such as the U.S. Department of Energy, National Science Foundation, National Aeronautics and Space Administration, and the U.K. Wellcome Trust increasingly require detailed data management plans, open access to primary data, and use of established repositories, although enforcement of these rules is inconsistent (Borgman, 2012). Finally, growing numbers of scientists are pushing for open science and data on moral and political grounds, arguing that taxpayer-funded research must always be available to the public, that is, the very people who paid for it (Neylon, 2012).
One particular concern is that data that are not deposited or shared will inevitably be lost as postdocs move on, researchers retire, computers crash, and/or the data are simply lost. This issue is particularly troublesome for ecosystem and global change ecology, as climate change, disturbance, and land-use changes make ecological data effectively irreproducible: the exact same system state will never recur in the future (Wolkovich et al., 2012). The problem of data loss has been examined empirically: Vines et al. (2014), for example, found that the odds of a data set being available post-publication fell by 17% each year over 20 years. This is consistent with ecology-specific studies finding that only 1–10% of data can be acquired only a few years after publication (Reichman et al., 2011; Wolkovich et al., 2012).
There are thus powerful arguments in favor of data sharing and deposition, but also, for many researchers, significant concerns. These include being “scooped,” not receiving sufficient credit and other “adverse use” cases (Fecher et al., 2015), and a perceived lack of time and resources needed to prepare data for publication or deposition (Tenopir et al., 2011). Although one can now easily assign Digital Object Identifiers to data sets and publish descriptive “data papers,” it is more difficult to give proper credit to researchers who, for example, contribute to global databases. Meta-analyses and other synthesis efforts fundamentally rely on such broad collections of published data, and it is thus critical that researchers' efforts are adequately valued and cited (Kueffer et al., 2011).
Do the benefits of data sharing outweigh its perceived costs? This is a critical question, and one with the potential to change culture and behavior (Fecher et al., 2015). Open-access articles that make their data freely available have been found to have higher citation counts (Piwowar et al., 2007). This citation/impact advantage may be less strong in ecological fields (Norris et al., 2008), perhaps because ecological data have historically tended to be small and highly variable in form and have been collected by individual researchers, not teams or networks. However, the emergence of data-sharing networks and public archives in fields such as eddy covariance (EC; e.g., FLUXNET; Baldocchi et al., 2001), and macroecology (e.g., NEON; Schimel et al., 2007) are changing this paradigm (Michener, 2015). Thus, a critical question is to what extent data sharing enables or impedes individual success in these disciplines, as well as advancing science more broadly.
2 Analyzing Data Sharing and Impact in EC
Dai et al. (2018) examined the question of how data sharing is related to scientific impact, arguing that EC researchers in many ways pioneered data sharing in modern ecology and thus the field deserves particular attention. Taking an algorithmic (and thus reproducible) approach to the problem, the authors extracted researchers' names from major EC data portals (FLUXNET, Ameriflux, AsiaFlux, EFDC, and OzFlux) and classified them according to the degree of data sharing based on site data availability. These data were matched against Web of Science bibliometric data and the resulting data set used to compute broader scientific impact, not simply citation counts (Li et al., 2014).
The analysis by Dai et al. (2018) has a number of specific and interesting findings. First, roughly half of EC sites had observation data directly available online. Second, somewhere between 8% (the percentage of paper authors whose names appeared in public data portals) and 64% (the percentage of portal researchers associated with open data sharing) can be regarded as open data sharers. The large range around this number comes from uncertainty surrounding authorial roles: if an author publishes an EC paper but their name does not appear as a data sharer in an online portal, did they ever have original data to share (as opposed to being a secondary user)? This is difficult to resolve algorithmically.
Third, Dai et al. (2018) reported that significant majorities of the most influential top 30 and top 100 EC researchers were data sharers. One third of the identified data sharers were in the top 900 most influential researchers, while only 3% of data sharers were found in the bottom 900; this is suggestive, although not conclusive, of a causal link between data sharing and impact. Finally, Dai et al. (2018) noted regional differences, with the AsiaFlux and ChinaFlux communities notably less oriented to data sharing than their counterparts in North America and Europe.
3 Context, Questions, and Future Work
One way to provide context for the findings of Dai et al. (2018) is to look at two other critical components of the scientific infrastructure supporting the EC field: journals and the global data-sharing network FLUXNET. To what extent are they supporting, enabling, and/or requiring a culture and expectation of data sharing and openness?
Examining the 1,000 most cited EC articles over the last 30 years, and specifically the data policies of the dominant journals in this list, reveals a wide discrepancy in policies surrounding data and code availability (Figure 1). Nine of these 20 journals “encourage” post-publication data and code sharing, while five “require” it. Only one requires data deposition as a condition of publication. Five have no policy at all, including the journal with both the highest impact factor and second-highest number of EC publications. The leading journals in this field thus present a very mixed picture and in general do not require any pre-publication or post-publication data sharing. This is troubling, given the importance of journal policies as a motivator of researcher data-sharing behavior (Fecher et al., 2015, and references therein).
A different picture emerges when one looks at FLUXNET2015, the latest data release from the global FLUXNET network (Baldocchi et al., 2001). These data cover 1991–2014, depending on site, and about 82% of the data-years are freely available (Figure 2). It is important to note however that many EC data, in particular from Asia, do not appear at all in FLUXNET2015; this is consistent with the overall estimate of Dai et al., noted above, that 44% of EC sites have data directly available online. Nonetheless, this compares favorably with many ecological databases outside of EC. For example, the TRY database (Kattge et al., 2011) documentation notes, somewhat ambiguously, that “more than 50% of the trait records are publicly available (open access)” (https://www.try-db.org/TryWeb/Prop0.php), while U.S. Forest Inventory and Analysis (Woudenberg et al., 2010) spatial data are legally restricted. In contrast, other databases such as the global Fine Root Ecology Database (Iversen et al., 2017) and the Soil Respiration Database (Bond-Lamberty & Thomson, 2010) are entirely open and unencumbered.
The Dai et al. (2018) results emphasize a number of unanswered questions and potential future work. One obvious issue—and a classically ecological one—revolves around causality versus correlation: does data sharing truly increase influence in EC and related fields? Or do influential researchers have the security (in career and reputation) to share more freely? The increasing use of standardized authorial contribution taxonomies (e.g., http://docs.casrai.org/CRediT) that can be parsed by algorithmic tools, such as those used by Dai et al. (2018), should help resolve this question. In addition, a follow-up analysis might look at sharing patterns over time, for example, reconstruct the degree of sharing taken by influential researchers early in their careers.
A second and critical issue revolves around computational reproducibility: the sharing of code in addition to data. This is a hard (and often expensive) problem, as computing environments change and become obsolete; common software is seldom cited; hardware and environmental settings are rarely discussed; and well-documented workflows are the exception, not the norm (Stodden et al., 2018). For example, I applaud that Dai et al. (2018) archived the data and source code used for their analysis, both in the journal supporting information and in a GitHub repository, and (critically) provided software version numbers; this puts their level of computational reproducibility far ahead of that of most other studies (Stodden et al., 2018; Thornton et al., 2005). Nonetheless, fully reproducing the Dai et al. (2018) computational environment and workflow would be challenging, and it will become close to impossible as time passes. The use of best practices (Marwick et al., 2017; Wilson et al., 2014) can ameliorate, but not currently solve, this fundamental problem of scientific computing.
Broad-scale, fully supported data sharing in EC and ecology more broadly will require institutional support at all levels and a degree of community buy-in and culture shift that has not yet occurred (Hampton et al., 2015). Progress in these areas has been uneven and sporadic (Michener, 2015), but there is no doubt that the tools, infrastructure, and incentives for researchers are all increasing. As many previous authors have noted, incentivizing research sharing—formalizing a process that currently unpredictably credits researchers with acknowledgments, or citations, or co-authorship depending on the case—will be critical. We must also be careful that the cost structures of open access and open data science do not impose new barriers on scientists from less wealthy institutions or regions of the world (Shieber, 2009).
Finally, the technological and cultural changes exemplified by scientific data sharing are ongoing. We should continue to experiment with and support innovative approaches, for example, preregistration of experiments (Wagenmakers & Dutilh, 2016) or fully “open” experiments with real-time data availability (Bond-Lamberty et al., 2016). The payoff from such a successful transition and transformation will be fairer, faster science that enables the study of fundamentally new questions and brings new light to old questions. Documenting and understanding our current behavior (Dai et al., 2018) are fundamental steps along the way.
Acknowledgments
This research was supported by the U.S. Department of Energy, Office of Science, Biological and Environmental Research as part of the Terrestrial Ecosystem Sciences Program. The Pacific Northwest National Laboratory is operated for DOE by the Battelle Memorial Institute under contract DE-AC05-76RL01830.