# Safe & Flexible Testing of COVID-19 Treatments: flexibly combining results from many trials around the world

Recently, the Machine Learning group at CWI has developed a method for statistical hypothesis testing that are safer and more flexible than traditional ones. While until March 2020, most work was theoretical, now the time has come to put “safe testing” to the test: in a collaboration with UMC Utrecht and Radboud UMC, our method is employed for combining results of several COVID-19 related clinical trials that are currently running in a number of hospitals in various countries.

**The BCG Trials**

The BCG vaccine has been in use since the 1920s as an effective vaccination against tuberculosis. There is substantial evidence that it provides a temporary boost of the immune system. Therefore, many virologists think that, within a limited period after vaccination, the chance of a (serious) corona infection would go down substantially. If this is true, given that the vaccine is cheap and a lot of it is in store world-wide, it could be useful to bridge the period until a real vaccine against COVID-19 becomes available. While the BCG vaccine is not a specific corona medication, it could serve as a temporary and partial fix.

To test this hypothesis, several research groups around the world have initiated large scale randomized clinical trials.

The first one was initiated by N. Metea (Radboud) and M. Bonten (Utrecht), but soon thereafter, trials were initiated in Australia, the US, South Africa, Hungary and other places.

In each trial, a number of hospital workers get the BCG vaccine (treatment), and a number get a placebo vaccination (control). It is then checked, during a certain follow-up time, how many in each group get a COVID-19 diagnosis and/or are themselves hospitalized. If the differences are ‘significant’ then we decide that the BCG probably has an effect. If the effect is large enough and in the right direction, we should start giving the BCG vaccine to all hospital workers.

### Safe Testing and S-Values

The difference between our method and traditional ones mostly lies in how one measures the evidence provided by the data. Traditionally, one uses the p-value, and one concludes ‘significance’ if p < 0.05 (or some other predetermined level).

P-values are notoriously problematic when it comes to things like *early stopping* and *optional continuation*. With standard p-value based methods one must specify a sampling plan (such as “test 500 subjects with medication; test 500 with placebo”). If you stop early (because e.g. results may look futile or extremely strong), the results become uninterpretable. If, after 1000 subjects, the results look *promising but inconclusive*, you cannot simply go on and test some more.

Combining the data and calculating p-values as if the new data were fixed in advance gives wildly wrong results, usually overstating evidence in favor of the treatment (very briefly, this is because the p-value for sample size *n* is defined such that, if there is no effect, then the probability that you conclude ‘significant’ *at time n* is no more than 0.05. But the probability that there is *some* sample size *n* at which, using the p-value for time *n* you conclude ‘significant’ is much larger than 0.05). While methods to deal with optional stopping exist, they always require one to specify a maximum number of subjects. After that, no more continuation is possible…

Instead of p-values, our safe tests use S-values. These allow for unlimited optional continuation – by multiplying S-values, one can effortlessly add data without compromising statistical guarantees. In particular, evidence in favour of the treatment will not be inflated. They also allow for easy visualization of the evidence obtained so far (more on this below).

### Safe Testing for ALL-IN Meta-Analysis in the BCG Trials

In the BCG trial setting, we want to combine results on-line, making a first recommendation as soon as the combined data is really convincing either way. Also, if new trials want to join our common effort at a later stage, they are welcome, as long as they are properly organized. This is like an extreme form of optional continuation. With standard methods (p-values), it is very difficult to do this while keeping Type I-error probabilities under control; with S-values, we can do it.

#### Figure 1

We will soon start this *ALL-IN (Anytime Live and Leading INterim) meta-analysis* for the worldwide BCG trials. Alexander Ly, a researcher from UvA and Judith ter Schure and Rosanne Turner from my group have worked very hard to make this possible. In Figure 1 you can see the type of graphs (filled in with fake data for the time being) that will be generated from the combined data, updated each day an interim result comes in. The yellow, red and orange lines represent S-values coming in from different trials – some start later, corresponding to trials that start later. The blue line is the combined S-value, obtained by multiplying the individual S-values – an S-value of 1 is neutral; the larger the S-value, the more evidence that the treatment is effective. The accumulated evidence at a given calendar day is simply the height of the blue curve at that day.

**Thinking Big**

Applied sciences such as medicine and psychology suffer from a *replicability crisis*: even in top journals, a devastatingly large fraction of published results are irreproducible. The continued use of p-values outside the (very narrow) domain where they really work plays a significant part in this. In particular, in contrast to traditional tests with p-values, with safe tests and S-values, a valid combination of results is easy. Safe tests, as well as the closely related *always-valid* confidence intervals, could thus really help us with achieving better replicability.

We hope that someday, they will be employed at a much grander scale. Yet, for this, a lot of work still has to be done: for more sophisticated applications in the medical sciences we would need safe tests/confidence intervals for logistic regression, Cox regression, and more. We know from the underlying theory that these can all be developed, it is ‘just’ a matter of implementation. But I put ‘just’ in quotation marks, since, for example, we may run into computational issues.

### Big vs Small Data, Machine Learning vs Statistics

Talking about the bigger picture: is this modern Data Science, with big data? Is this Machine Learning? Or is this traditional statistics? The answer is not at all clear-cut: it’s big data, coming from around the world, covering 10000s of hospital workers. It’s also small data, since the large majority of hospital workers in both treatment and control groups don’t get corona at all, hence do not give us any useful information! It’s the most traditional type of statistics – hypothesis testing, the branch of statistics that was developed almost a 100 years ago. Yet it uses new methods for this old area. The new methods are still old-fashioned in that they give precise error guarantees. I consider that a very good thing.

While there are close connections to Bayesian statistical methods, much work in this direction has been developed outside of mainstream statistical research: interestingly, Volodya Vovk (Royal Holloway, University of London) and Aaditya Ramdas (Carnegie Mellon University), who are pioneering similar research, and myself all work in the intersection of machine learning and statistics. Machine learning ideas often provide a fresh look (even more interestingly, similar ideas were already put forward around 1970 by the famous statistician Herbert Robbins (Columbia University) and his students – but they did not catch on at all at the time – perhaps because there was no ‘reproducibility crisis’ – and development of these initial ideas stopped in the mid 1970s).

### But, is it working?

Thinking big is all very nice, but right now, the most important question is of course whether the BCG vaccine actually helps to reduce COVID-19. Some of the trials are well under way now – but, unlike Judith ter Schure, the official meta-trial statistician, I am not personally allowed to see the interim individual/combined results until the bar for ‘significance’ or ‘futility/harm’ has been achieved. So for the time being I’ll just have to wait and bite my nails…