Building Your Data Structure: FAIR Data
Obtaining access to the right data is a first, essential step in any Data Science endeavour. But what makes the data “right”?
The difference in datasets
Every dataset can be different, not only in terms of content, but in how the data is collected, structured and displayed. For example, how national image archives store and annotate their data is not necessarily how meteorologists store their weather data, nor how forensic experts store information on potential suspects. The problem occurs when researchers from one field need to use a dataset from a different field. The disparity in datasets is not conducive to the re-use of (multiple) datasets in new contexts.
The FAIR data principles provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The emphasis is placed on the ability of computational systems to find, access, interoperate, and reuse data with no or minimal human intervention. Launched at a Lorentz workshop in Leiden in 2014, the principles quickly became endorsed and adopted by a broad range of stakeholders (e.g. European Commission, G7, G20) and have been cited widely since their publication in 2016 [1]. The FAIR principles are agnostic of any specific technological implementation, which has contributed to their broad adoption and endorsement.
Why do we need datasets that can be used in new contexts?
Ensuring that data sources can be (re)used in many different contexts can lead to unexpected results. For example, combining mental depression data with weather data can establish a correlation between mental states and weather conditions. The original data resources were not created with this reuse in mind, however, applying FAIR principles to these datasets makes this analysis possible.
FAIRness in the current crisis
A pressing example of the importance of FAIR data is the current COVID-19 pandemic. Many patients worldwide have been admitted to hospitals and intensive care units. While global efforts are moving towards effective treatments and a COVID-19 vaccine, there is still an urgent need to combine all the available data. This includes information from distributed multimodal patient datasets that are stored at local hospitals in many different, and often unstructured, formats.
Learning about the disease and its stages, and which drugs may or may not be effective, requires combining many data resources, including SARS-CoV-2 genomics data, relevant scientific literature, imaging data, and various biomedical and molecular data repositories.
One of the issues that needs to be addressed is combining privacy-sensitive patient information with open viral data at the patient level, where these datasets typically reside in very different repositories (often hospital bound) without easily mappable identifiers. This underscores the need for federated and local data solutions, which lie at the heart of the FAIR principles.
Examples of concerted efforts to build an infrastructure of FAIR data to combat COVID-19 and future virus outbreaks are in the VODAN initiative [2], the COVID-19 data portal organised by the European Bioinformatics Institute and the ELIXIR network [3].
FAIR data in Amsterdam
Many scientific and commercial applications require the combination of multiple sources of data for analysis. While providing a digital infrastructure and (financial) incentives are required for data owners to share their data, we will only be able to unlock the full potential of existing data archives when we are also able to find the datasets needed and use the data within them.
The FAIR data principles allow us to better describe individual datasets and allow easier re-use in many diverse applications beyond the sciences for which they were originally developed. Amsterdam provides fertile ground for finding partners with appropriate expertise for developing both digital and hardware infrastructures.
Author: Jaap Heringa
Source: Amsterdam Data Science