Big data, Evalues, Machine Learning
Safe & Flexible Testing of COVID19 Treatments: flexibly combining results from many trials around the world
The BCG Trials
The BCG vaccine has been in use since the 1920s as an effective vaccination against tuberculosis. There is substantial evidence that it provides a temporary boost of the immune system. Therefore, many virologists think that, within a limited period after vaccination, the chance of a (serious) corona infection would go down substantially. If this is true, given that the vaccine is cheap and a lot of it is in store worldwide, it could be useful to bridge the period until a real vaccine against COVID19 becomes available. While the BCG vaccine is not a specific corona medication, it could serve as a temporary and partial fix.
To test this hypothesis, several research groups around the world have initiated large scale randomized clinical trials.
The first one was initiated by N. Metea (Radboud) and M. Bonten (Utrecht), but soon thereafter, trials were initiated in Australia, the US, South Africa, Hungary and other places.
In each trial, a number of hospital workers get the BCG vaccine (treatment), and a number get a placebo vaccination (control). It is then checked, during a certain followup time, how many in each group get a COVID19 diagnosis and/or are themselves hospitalized. If the differences are ‘significant’ then we decide that the BCG probably has an effect. If the effect is large enough and in the right direction, we should start giving the BCG vaccine to all hospital workers.
Safe Testing and EValues
The difference between our method and traditional ones mostly lies in how one measures the evidence provided by the data. Traditionally, one uses the pvalue, and one concludes ‘significance’ if p < 0.05 (or some other predetermined level).
Pvalues are notoriously problematic when it comes to things like early stopping and optional continuation. With standard pvalue based methods one must specify a sampling plan (such as “test 500 subjects with medication; test 500 with placebo”). If you stop early (because e.g. results may look futile or extremely strong), the results become uninterpretable. If, after 1000 subjects, the results look promising but inconclusive, you cannot simply go on and test some more.
Combining the data and calculating pvalues as if the new data were fixed in advance gives wildly wrong results, usually overstating evidence in favor of the treatment (very briefly, this is because the pvalue for sample size n is defined such that, if there is no effect, then the probability that you conclude ‘significant’ at time n is no more than 0.05. But the probability that there is some sample size n at which, using the pvalue for time n you conclude ‘significant’ is much larger than 0.05). While methods to deal with optional stopping exist, they always require one to specify a maximum number of subjects. After that, no more continuation is possible…
Instead of pvalues, our safe tests use Evalues. These allow for unlimited optional continuation – by multiplying Evalues, one can effortlessly add data without compromising statistical guarantees. In particular, evidence in favour of the treatment will not be inflated. They also allow for easy visualization of the evidence obtained so far (more on this below).
Safe Testing for ALLIN MetaAnalysis in the BCG Trials
In the BCG trial setting, we want to combine results online, making a first recommendation as soon as the combined data is really convincing either way. Also, if new trials want to join our common effort at a later stage, they are welcome, as long as they are properly organized. This is like an extreme form of optional continuation. With standard methods (pvalues), it is very difficult to do this while keeping Type Ierror probabilities under control; with Evalues, we can do it.
Figure 1
We will soon start this ALLIN (Anytime Live and Leading INterim) metaanalysis for the worldwide BCG trials. Alexander Ly, a researcher from UvA and Judith ter Schure and Rosanne Turner from my group have worked very hard to make this possible. In Figure 1 you can see the type of graphs (filled in with fake data for the time being) that will be generated from the combined data, updated each day an interim result comes in. The yellow, red and orange lines represent Evalues coming in from different trials – some start later, corresponding to trials that start later. The blue line is the combined Evalue, obtained by multiplying the individual Evalues – an Evalue of 1 is neutral; the larger the Evalue, the more evidence that the treatment is effective. The accumulated evidence at a given calendar day is simply the height of the blue curve at that day.
Thinking Big
Applied sciences such as medicine and psychology suffer from a replicability crisis: even in top journals, a devastatingly large fraction of published results are irreproducible. The continued use of pvalues outside the (very narrow) domain where they really work plays a significant part in this. In particular, in contrast to traditional tests with pvalues, with safe tests and Evalues, a valid combination of results is easy. Safe tests, as well as the closely related alwaysvalid confidence intervals, could thus really help us with achieving better replicability.
We hope that someday, they will be employed at a much grander scale. Yet, for this, a lot of work still has to be done: for more sophisticated applications in the medical sciences we would need safe tests/confidence intervals for logistic regression, Cox regression, and more. We know from the underlying theory that these can all be developed, it is ‘just’ a matter of implementation. But I put ‘just’ in quotation marks, since, for example, we may run into computational issues.
Big vs Small Data, Machine Learning vs Statistics
Talking about the bigger picture: is this modern Data Science, with big data? Is this Machine Learning? Or is this traditional statistics? The answer is not at all clearcut: it’s big data, coming from around the world, covering 10000s of hospital workers. It’s also small data, since the large majority of hospital workers in both treatment and control groups don’t get corona at all, hence do not give us any useful information! It’s the most traditional type of statistics – hypothesis testing, the branch of statistics that was developed almost a 100 years ago. Yet it uses new methods for this old area. The new methods are still oldfashioned in that they give precise error guarantees. I consider that a very good thing.
While there are close connections to Bayesian statistical methods, much work in this direction has been developed outside of mainstream statistical research: interestingly, Volodya Vovk (Royal Holloway, University of London) and Aaditya Ramdas (Carnegie Mellon University), who are pioneering similar research, and myself all work in the intersection of machine learning and statistics. Machine learning ideas often provide a fresh look (even more interestingly, similar ideas were already put forward around 1970 by the famous statistician Herbert Robbins (Columbia University) and his students – but they did not catch on at all at the time – perhaps because there was no ‘reproducibility crisis’ – and development of these initial ideas stopped in the mid 1970s).
But, is it working?
Thinking big is all very nice, but right now, the most important question is of course whether the BCG vaccine actually helps to reduce COVID19. Some of the trials are well under way now – but, unlike Judith ter Schure, the official metatrial statistician, I am not personally allowed to see the interim individual/combined results until the bar for ‘significance’ or ‘futility/harm’ has been achieved. So for the time being I’ll just have to wait and bite my nails…
This blog was updated in November 2020 to change “Svalues” to “Evalues” per the author’s request.
Read More

Managing/Being a Master’s Student during a Pandemic
For the past five years, Elsevier has been an enthusiastic participant in the UvA Master’s Student programme. In total, more than 45 students have been supervised by researchers across the company, which has led to 12 new recruits for our Data Science teams.

Data as a material for fashion: How treating data as a material enables a new future for design
Data Science is rapidly changing industries around the world, yet the digital transformation remains difficult for Fashion. Fashion (Design, Business, Branding, and Marketing) has never been known for maths geniuses. (There are a few, but they keep it a secret.) While maths and data may not be a given in the industry, people who work in fashion are material experts. So what would it mean if we treated data as if it were a material?

Programming Training for Refugees
In August 2020, VodafoneZiggo and Accenture wrapped up their threemonth CodeMasters training programme for refugees. The training course was tailormade to help refugees integrate in the Dutch labour market by teaching participants to write computer code.