Making the sum greater than the parts – FAIR Data

The difference in datasets

Every dataset can be different, not only in terms of content, but in how the data is collected, structured and displayed. For example, how national image archives store and annotate their data is not necessarily how meteorologists store their weather data, nor how forensic experts store information on potential suspects. The problem occurs when researchers from one field need to use a dataset from a different field. The disparity in datasets is not conducive to the re-use of (multiple) datasets in new contexts. The FAIR data principles provide guidelines to improve the Findability, Accessibility, Interoperability, and Reuse of digital assets. The emphasis is placed on the ability of computational systems to find, access, interoperate, and reuse data with no or minimal human intervention. Launched at a Lorentz workshop in Leiden in 2014, the principles quickly became endorsed and adopted by a broad range of stakeholders (e.g. European Commission, G7, G20) and have been cited widely since their publication in 2016 [1]. The FAIR principles are agnostic of any specific technological implementation, which has contributed to their broad adoption and endorsement.

Why do we need datasets that can be used in new contexts?

Ensuring that data sources can be (re)used in many different contexts can lead to unexpected results. For example, combining mental depression data with weather data can establish a correlation between mental states and weather conditions. The original data resources were not created with this reuse in mind, however, applying FAIR principles to these datasets makes this analysis possible.

FAIRness in the current crisis

A pressing example of the importance of FAIR data is the current COVID-19 pandemic. Many patients worldwide have been admitted to hospitals and intensive care units. While global efforts are moving towards effective treatments and a COVID-19 vaccine, there is still an urgent need to combine all the available data. This includes information from distributed multimodal patient datasets that are stored at local hospitals in many different, and often unstructured, formats. Learning about the disease and its stages, and which drugs may or may not be effective, requires combining many data resources, including SARS-CoV-2 genomics data, relevant scientific literature, imaging data, and various biomedical and molecular data repositories. One of the issues that needs to be addressed is combining privacy-sensitive patient information with open viral data at the patient level, where these datasets typically reside in very different repositories (often hospital bound) without easily mappable identifiers. This underscores the need for federated and local data solutions, which lie at the heart of the FAIR principles. Examples of concerted efforts to build an infrastructure of FAIR data to combat COVID-19 and future virus outbreaks are in the VODAN initiative [2], the COVID-19 data portal organised by the European Bioinformatics Institute and the ELIXIR network [3].

FAIR data in Amsterdam

Many scientific and commercial applications require the combination of multiple sources of data for analysis. While providing a digital infrastructure and (financial) incentives are required for data owners to share their data, we will only be able to unlock the full potential of existing data archives when we are also able to find the datasets needed and use the data within them. The FAIR data principles allow us to better describe individual datasets and allow easier re-use in many diverse applications beyond the sciences for which they were originally developed. Amsterdam provides fertile ground for finding partners with appropriate expertise for developing both digital and hardware infrastructures.


  1. M.D. Wilkinson, M. Dumontier, I.J. Aalbersberg, G. Appleton, M. Axton, A. Baak, … & B. Mons. The FAIR guiding principles for scientific data management and stewardship. Scientific Data 3(2016), Article No.160018. doi: 10.1038/sdata.2016.18

A Practitioner’s Perspective on Fairness in AI

The use of machine learning models to support decision making can have harmful consequences on individuals, particularly, when such models are used in critical applications in the social domain such as justice, health, education and resource allocation (e.g. welfare). These harms can be caused by different types of biases that can originate at any point of the algorithmic pipeline. In order to prevent discrimination and algorithmically reinforced bias, the Fair AI community has been developing best practices and new methods to ‘fix the bias’ in the data and therefore ensure fair outcomes.

Fair AI: an active community

We encounter many “AI gone wrong” narratives in critical domains, ranging from hospital triage algorithms systematically discriminating against minorities¹, to welfare fraud detection algorithms that disproportionately fail among sub-populations². In response to these issues, fairness-driven solutions are emerging, for example, by deriving new algorithms that ensure fairness, or by identifying conditions under which certain fairness constraints can be guaranteed. New frameworks are also being developed to draw attention to best practices on how to avoid bias during the modelling process³. There are also a number of user-friendly tools⁴ that allow for testing a system against possible bias. Once bias is detected, it can be mitigated by either modifying the training data, or by constraining the inner-workings of machine learning algorithms to satisfy a given fairness definition, for example, ensuring that two groups get the same outcome distributions⁵.

Ensuring Fair AI in practice is challenging

In our experience as practicing data scientists, the existing tools that deal with bias and ensuring fairness in machine learning suffer from an important limitation⁶: their validity and performance has been tested under unrealistic scenarios. Typically, the validity and performance of these solutions are tested on publicly available datasets such as COMPAS or the German credit datasets⁷ where the sources of bias are easily identifiable since they are linked to correlations with sensitive attributes such as gender, ethnicity, or age, which are available in the data sets. This means that the data can be corrected or the outcomes of a given model can be altered to fix the bias that originates from these correlations. In practice, sensitive attributes cannot be included directly in the modeling process, for legal or for confidentiality reasons, but other features that correlate with them might be used in the model, leading to indirect discrimination. These correlations can be unknown at the time of development, especially when sensitive attributes are missing. It is therefore challenging and sometimes impossible to apply fairness corrections in real-world models. Another challenge practitioners need to shoulder is that testing whether an algorithm is fair or not requires the knowledge of a desirable ground truth against which one can evaluate fairness of the outcomes. Furthermore, as practitioners, we need to acknowledge the broader context of high-impact decision making systems and for -ultimately- beneficial fairness interventions, we need to investigate the long-term impacts of these systems.

The way forward

So what can we do to make fair AI achievable in practice? Mobilise the AI community to provide more realistic datasets. To achieve this, one approach is to facilitate (confidentiality-preserving) data exchange between industry and academia. Making relevant and realistic data sets available to the fair AI community will not only open new research avenues, it will also help create tangible solutions to fairness issues that can be used by practitioners, and consequently help solve the bias and discrimination issues. Encourage tackling some of the real-world challenges. Consider bias bounties⁸ where developers are offered monetary incentives to test the systems for bias in a similar way to the bug bounties offered in security software. Embrace inclusive communication between stakeholders and AI developers. Interpreting the real-world impact of an AI-driven solution can be challenging and requires open collaboration between the technical and functional teams throughout the AI lifecycle. We are undoubtedly living through crucial times in terms of developments in the fair AI field. In order to foster more fairness-enabling practices and to facilitate bridging the gap between theory and practice, we invite all interested readers to share their experiences, ideas and suggestions with us on this dedicated slack channel using the link below: We are building an inclusive community, whether you work for a company, startup or a university, whether you are a concerned citizen, policy or law expert, or a student interested in Fair AI, join us!


  6. Note that we are not addressing here the more general issue of fixing an identified source of bias algorithmically, instead of addressing the actual societal issues leading to the bias, as we want to focus here on the province of the data scientists, in the real world.
  7. Popular fairness-datasets: