Artificial intelligence, Replicability, Reproducibility

Estimating Reproducibility of AI Research

  • Iris Kisjes
    Iris Kisjes
Being able to reproduce research is a key aspect of creating knowledge. If a study can be reproduced by another lab then the validity of the findings are confirmed. This is particularly important in AI research with questions around explainable and trustworthy AI.

There are a number of different ways to refer to reproducibility, in this piece we are actually referring to replicability using the standard ACM definition. It refers to research that reuses the data and/or analysis to hopefully get the same results. To this end, replication is considered to be a cornerstone of wider scientific reproducibility where multiple different studies should attain the same results. Concerns have been raised about the ease of replicating results of AI papers. Making replication easier will help to raise public confidence in the field.

Why is the replication of research reported in a paper so difficult?

For AI, many factors affect their replicability. It is difficult to replicate a study as it requires access to and operational knowledge of software, data, specific settings (hyper parameters) and more. Another contributing factor is not technical but social: there is little incentive to spend time replicating someone else’s research. Researchers get no scientific credit for it and it costs time and uses resources while not contributing to their next promotion.

While making the process of replication easier would be ideal, some indication of how replicable a published study is without actually having to replicate it, would already help. This is where various initiatives within the AI and wider computer science community are exploring how AI studies can be made more replicable.

Determining the replicability of AI

During O. E. Gundersen’s recent talks at the AAAI; Reproducibility of AI workshop and, Evaluating Evaluation of AI Systems (Meta-Eval 2020), he provided a high level overview of methods used in AI research (Table 1). 

Table 1

Furthermore, Gundersen discussed the current state of replicability of AI research. One of his experiments investigated whether the presence of specific sections of a research paper could be used as a proxy to determine it’s replicability by classifying certain information in the methods, experiments and the data, see Table 2.

Table 2

O.E. Gundersen and Sigbjørn Kjensmo sampled 400 papers, 100 papers from each installment of AAAI 2014, AAAI 2016, IJCAI 2013 and IJCAI 2016, and assessed the documentation provided by each study, see Figure 1.

Figure 1

Table 3

O.E. Gundersen and Sigbjørn Kjensmo then sampled 30 of the most highly cited papers of these 400 based on Scopus data. They tried to replicate 22 papers after having filtered out those containing only a method section (R3 papers, see Table 3), and spent a maximum of 40 hours per paper. Categories of results after the expended effort were that they:

  • 20% Successfully replicated the research
  • 13% Achieved partial success
  • 23% Failed
  • 17% Obtained no result
  • 27% Were filtered out due to lack of data and experimental information

The top six aspects of failure identified were: 

  • Aspect of implementation not described or ambiguous (R2)
  • Aspect of experiment not described or ambiguous (R2)
  • Not all hyper-parameters are specified (R2)
  • Mismatch between data in paper and available online (R1+R2)
  • Method code shared; experiment code not shared (R1)
  • Method not described with sufficient detail (R2)

Given the increasing speed with which new AI systems and techniques are developed, an automated measure for assessing the quality of a research result is likely to increase the confidence in published research results. Just as international bodies develop quality standards in many fields, understanding which AI studies are likely to be replicable is a first step towards the development of quality standards in AI. O.E. Gundersen’s research provides us with a useful, automated means of assessing replicability- which itself will be fraught with the unavoidable false negatives and positives. The importance of documentation in this process may help the AI community in maturing the culture around completeness to help the replicability, and therefore confidence, in the AI field.


Reproducibility vs. Replicability: A Brief History of a Confused Terminology, Hans E. Plesser, Front. Neuroinform., 18 Jan 2018

O. E. Gundersen’s presentation:

O. E. Gundersen and Sigbjørn Kjensmo. “State of the art: Reproducibility in artificial intelligence.” Thirty-second AAAI conference on artificial intelligence. 2018.

O. E. Gundersen, Yolanda Gil, and David W. Aha. “On reproducible AI: Towards reproducible research, open science, and digital scholarship in AI publications.” AI magazine 39.3 (2018): 56-68.

O. E. Gundersen (2019). Standing on the Feet of Giants — Reproducibility in AI. AI Magazine, 40(4), 9-23. DOI:

R. Isdahl and O. E. Gundersen, “Out-of-the-Box Reproducibility: A Survey of Machine Learning Platforms,” 2019 15th International Conference on eScience (eScience), San Diego, CA, USA, 2019, pp. 86-95. DOI: 10.1109/eScience.2019.00017

Read More

  • Managing/Being a Master’s Student during a Pandemic

    For the past five years, Elsevier has been an enthusiastic participant in the UvA Master’s Student programme. In total, more than 45 students have been supervised by researchers across the company, which has led to 12 new recruits for our Data Science teams.

    • Magdalena Mladenova
      Magdalena Mladenova
    • Anita de Waard
      Anita de Waard
    • Thom Pijnenburg
      Thom Pijnenburg
  • Data as a material for fashion: How treating data as a material enables a new future for design

    Data Science is rapidly changing industries around the world, yet the digital transformation remains difficult for Fashion. Fashion (Design, Business, Branding, and Marketing) has never been known for maths geniuses. (There are a few, but they keep it a secret.) While maths and data may not be a given in the industry, people who work in fashion are material experts. So what would it mean if we treated data as if it were a material?

    • Troy Nachtigall
      Troy Nachtigall
  • Programming Training for Refugees

    In August 2020, VodafoneZiggo and Accenture wrapped up their three-month CodeMasters training programme for refugees. The training course was tailor-made to help refugees integrate in the Dutch labour market by teaching participants to write computer code.

    • Gabriel Lopez
      Gabriel Lopez