Making the sum greater than the parts – FAIR Data

Jaap Heringa

Obtaining access to the right data is a first, essential step in any Data Science endeavour. But what makes the data “right”?

The difference in datasets

Every dataset can be different, not only in terms of content, but in how the data is collected, structured and displayed. For example, how national image archives store and annotate their data is not necessarily how meteorologists store their weather data, nor how forensic experts store information on potential suspects. The problem occurs when researchers from one field need to use a dataset from a different field. The disparity in datasets is not conducive to the re-use of (multiple) datasets in new contexts.

The FAIR data principles provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The emphasis is placed on the ability of computational systems to find, access, interoperate, and reuse data with no or minimal human intervention. Launched at a Lorentz workshop in Leiden in 2014, the principles quickly became endorsed and adopted by a broad range of stakeholders (e.g. European Commission, G7, G20) and have been cited widely since their publication in 2016 [1]. The FAIR principles are agnostic of any specific technological implementation, which has contributed to their broad adoption and endorsement.

Why do we need datasets that can be used in new contexts?

Ensuring that data sources can be (re)used in many different contexts can lead to unexpected results. For example, combining mental depression data with weather data can establish a correlation between mental states and weather conditions. The original data resources were not created with this reuse in mind, however, applying FAIR principles to these datasets makes this analysis possible.

FAIRness in the current crisis

A pressing example of the importance of FAIR data is the current COVID-19 pandemic. Many patients worldwide have been admitted to hospitals and intensive care units. While global efforts are moving towards effective treatments and a COVID-19 vaccine, there is still an urgent need to combine all the available data. This includes information from distributed multimodal patient datasets that are stored at local hospitals in many different, and often unstructured, formats.

Learning about the disease and its stages, and which drugs may or may not be effective, requires combining many data resources, including SARS-CoV-2 genomics data, relevant scientific literature, imaging data, and various biomedical and molecular data repositories.

One of the issues that needs to be addressed is combining privacy-sensitive patient information with open viral data at the patient level, where these datasets typically reside in very different repositories (often hospital bound) without easily mappable identifiers. This underscores the need for federated and local data solutions, which lie at the heart of the FAIR principles.

Examples of concerted efforts to build an infrastructure of FAIR data to combat COVID-19 and future virus outbreaks are in the VODAN initiative [2], the COVID-19 data portal organised by the European Bioinformatics Institute and the ELIXIR network [3].

FAIR data in Amsterdam

Many scientific and commercial applications require the combination of multiple sources of data for analysis. While providing a digital infrastructure and (financial) incentives are required for data owners to share their data, we will only be able to unlock the full potential of existing data archives when we are also able to find the datasets needed and use the data within them.

The FAIR data principles allow us to better describe individual datasets and allow easier re-use in many diverse applications beyond the sciences for which they were originally developed. Amsterdam provides fertile ground for finding partners with appropriate expertise for developing both digital and hardware infrastructures.

References

M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, … & B. Mons. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3(2016), Article No.160018. doi: 10.1038/sdata.2016.18
https://www.go-fair.org/implementation-networks/overview/vodan/
https://www.covid19dataportal.org/

02 December 2021
Managing/Being a Master’s Student during a Pandemic

For the past five years, Elsevier has been an enthusiastic participant in the UvA Master’s Student programme. In total, more than 45 students have been supervised by researchers across the company, which has led to 12 new recruits for our Data Science teams.
- Magdalena Mladenova
- Anita de Waard
- Thom Pijnenburg
13 September 2021
Data as a material for fashion: How treating data as a material enables a new future for design

Data Science is rapidly changing industries around the world, yet the digital transformation remains difficult for Fashion. Fashion (Design, Business, Branding, and Marketing) has never been known for maths geniuses. (There are a few, but they keep it a secret.) While maths and data may not be a given in the industry, people who work in fashion are material experts. So what would it mean if we treated data as if it were a material?
- Troy Nachtigall
29 April 2021
Programming Training for Refugees

In August 2020, VodafoneZiggo and Accenture wrapped up their three-month CodeMasters training programme for refugees. The training course was tailor-made to help refugees integrate in the Dutch labour market by teaching participants to write computer code.
- Gabriel Lopez

Making the sum greater than the parts – FAIR Data

The difference in datasets

Why do we need datasets that can be used in new contexts?

FAIRness in the current crisis

FAIR data in Amsterdam

References

Read More

Managing/Being a Master’s Student during a Pandemic

Data as a material for fashion: How treating data as a material enables a new future for design

Programming Training for Refugees