Water science must be Open Science

[adace-ad id="91168"]

While much focus in recent years has been put on Open Access publishing, this is only a small part of Open Science. According to the 2015 FOSTER taxonomy19, Open Science integrates Open Access, Open Data, Open Source, and Open Reproducible Research (all of which we will touch on here, see Fig. 1a), while UNESCO and others have extended this further (e.g.,6). Open Data is commonly associated with the ‘FAIR Principles’20, which describe how to make data findable, accessible, interoperable, and reusable. The FAIR principles were introduced in 2016 and provide vital guidance that can be applied irrespective of whether the data itself is strictly open or not21. Note that the FAIR principles do not enforce Open Access, i.e., FAIR data is not automatically Open Data. Conversely, Open Data that is neither FAIR nor managed (see Fig. 1b) can easily be useless data. Thus, the combination of Open and FAIR data is extremely important. However, even Open Access publishing combined with Open and FAIR data does not necessarily make the research reproducible and re-usable, as discussed further below.

Fig. 1: The many elements of Open Science.

a, Open Science (centre, blue) and the four elements of Open Science pointed out by UNESCO most pertinent to this article (orange-ish circles with text). The remaining elements of Open Science described by UNESCO were removed for space reasons. They are represented in the ‘…’ circle along with the smaller decorative bubbles to show that Open Science covers many facets, big and small6. b, FAIR data vs. Open Data vs. Managed Data. Image modified from ref. 36. Managed data means that the data has in some way been collected, stored, organized and maintained. There is a large proportion of managed data that is neither FAIR nor Open, along with a large proportion of unmanaged Open Data. Since both cases are difficult to include in reproducible workflows, scientists and journals alike should be working on expanding the intersection between FAIR and Open Data.

Full size image

Open Access is the subset of Open Science that includes principles and practices for distributing research outputs online, free of cost or other access barriers22,23. This includes for instance Open Access publications (e.g., the dissemination of research as so-called Green, Gold or Diamond Open Access) or the use of preprint servers to access earlier versions of research articles.

Open Data refers to the availability of the data behind the published research, typically hosted in either institutional or domain-specific data repositories (e.g., HydroShare for hydrological data24), or generic repositories such as Zenodo or FigShare. For Open Access publications and Open Data, appropriate license conditions should be stipulated, so that the conditions of re-use are clear. Creative Commons (CC) licenses are commonly used, with CC0 (public domain) and CC-BY (re-use with attribution) being the most permissive. Other restrictions on CC licenses can cause problems for downstream use. For instance, the ‘ND’ (no derivatives) clause forbids re-use for derivative works, i.e., any actual re-use other than re-distribution of the original work, while ‘NC’ (non-commercial use only) can prevent commercial companies (e.g., instrument vendors) from integrating Open Data into vendor-provided instrument libraries that could be used by researchers. The ‘SA’ (share-alike) clause can enforce a license on downstream users that they may not be able to comply with, thus preventing integration of Open Data in other open projects (due to incompatible licenses). While Open Data is an important starting point, without the availability of appropriate metadata and sufficient FAIRness to make the data findable, accessible, re-useable and interoperable, Open Data alone is only of limited use. In the era of ‘big data’, it is now relatively easy to create a quick dump of data, but curation and FAIRification of data requires a concerted effort, which may necessitate either incentives (carrot) or mandates (stick). The Global Natural Products Social Molecular Networking (GNPS) ecosystem25 is a prime example for incentivising Open Data sharing. Starting primarily as a mass spectral data repository for metabolomics, the developers have consistently added features and functionality over the years to value-add the repository and increase motivation for deposition. For example, MASST26 has enabled discovery of the neurotoxin domoic acid and analogues within marine samples and food such as ocean-caught mackerel.

Open Source software and code refer to the public availability of source code27, i.e., sets of computer instructions ranging from data processing scripts and algorithms to fully blown numerical models, desktop applications, or even operating systems. The purpose of open source is to provide transparency, and most importantly, re-usability and adaptability of the code, with a common aim of collaborative development. Licenses for Open Source works are generally designed to explicitly cover code sharing, thus Open Source licenses are generally preferred over CC, with common examples including GPL, Apache and MIT27. Suitable code repositories with version control and issue tracking are indispensable for collaborative open source developments, with common platforms including GitHub, GitLab, Bitbucket and more. For all three above-mentioned aspects of Open Science, i.e., Open Access, Open Data and Open Source, the generation of permanent identifiers such as a Digital Object Identifier (DOI)28 is an integral aspect of FAIR and vital to preserve the discoverability and lifetime of such projects.

Finally, open reproducible research is a culmination of all three aspects above. With systems such as RMarkdown and Jupyter Notebooks, it is now possible to have fully compliable research outputs and reproducible manuscripts. The Journal of Open Source Software even accepts submissions as GitHub pull requests and compiles the entire submission on their system; one example relevant to water research is patRoon 2.0 (ref. 29). The ‘open-source knowledge infrastructure for collaborative and reproducible data science’ Renku facilitates traceability and reproducibility of complex workflows involving networks of interconnected code, data and figure files. It does so by automatic provenance tracking of output files and the creation of a version-controlled git repository containing all information, including the computational environment.

Source: Resources -

Preparing to be prepared

Synapsid tracks with skin impressions illuminate the terrestrial tetrapod diversity in the earliest Permian of equatorial Pangea