Testing baseline tables in trials for signs of fraud

Posted by Adrian Barnett on Wednesday, April 23, 2025

When fraudsters make up research data, they can make mistakes. Real data is rich and complex whilst fraudsters are on a get-rich-quick scheme and make slapdash errors.

One mistake they make is in randomised trials, where it’s standard to have a baseline table that compares the randomised groups. As the groups are randomised, the summary statistics should be similar. Fraudsters have no sense of ‘similar’ and so have created data where the groups are nearly identical. Others have gone too far the other way and created implausible differences between groups.

Carlisle’s test

The approach to testing baseline tables has been led by John Carlisle and he has used his test to spot multiple fraudulent papers. His test uses the distribution of p-values for the between group differences and tests whether the p-values follow a uniform distribution, as they should under the null hypothesis that the groups were correctly randomised.

A Bayesian test

I created an alternative Bayesian test that examines the symmetry of the t-statistics rather than the distribution of p-values. P-values are bounded between 0 to 1, which creates floor and ceiling effects. Bounded distributions are always harder to work with than unbounded.

Carlisle’s uniform test is also sensitive to false positives, for example where there’s no between group differences but some of the data are skewed data, or the summary statistics are rounded.

A key advantage of my test is that it can deal with categorical as well as continuous data, whereas Carlisle’s test can only use continuous data. This greatly increases the available data as over half the summary statistics in a sample of over 2,000 trials were percentages (see Table 4 here).

Some authors have used Carlisle’s test with categorical variables, but as pointed out by Daniel Tausk this creates false positives. To illustrate the problem, Tausk simulated data from two groups of randomised patients with 100 patients per group with a binary variable that had a 5% chance of being “Yes”. As shown in figure 1 below, Fisher’s exact test is conservative as it does not keep up with the ideal type 1 error, whereas the t-test is almost perfectly on the ideal diagonal line.

Figure 1

Figure 1: Cumulative distribution functions for Fisher’s exact p-value and a t-test versus the uniform distribution (diagonal line). 100 patients per group with a binary variable probability of “yes” of 0.05. The data were created with no difference between the randomised groups.

The ability of the t-test to deal with categorical data was shown by D’Agostino and colleagues back in 1988, who showed that the independent samples t-test beats Fisher’s exact test and the Chi-squared test for 2×2 categorical data. This will be a surprise to the “statistics by recipe” crowd, who insist on choosing tests using flow charts and/or dogma.

As the Fisher’s exact test p-values deviate from the uniform distribution, then using Carlisle’s test will lead to an excess of false positives when the null hypothesis is true. This is shown in Figure 2 (again reproducing Tausk’s paper) for simulated data with 20 categorical variables each with a 0.5 chance of being “Yes”. As before, there are two groups with 100 patients per group.

Figure 2

Figure 2: Cumulative distributions for Stouffer’s test for uniform p-values and my Bayesian test for illustration. 100 patients per group, 20 binomial variables, and probability of “yes” equal to 0.5. The data were created with no difference between the randomised groups.

I’ve added my Bayesian test to Figure 2, but the posterior probability from a Bayesian test does not behave like a standard p-value. However, the plot still shows that the Bayesian test often gives a low posterior probability when the null hypothesis is true, which is useful.

My test is available online here and for papers on PubMed Central, there’s some automated code that tries to extract the baseline table.