Increasing concerns about the trustworthiness of research have prompted calls to scrutinise studies’ Individual Participant Data (IPD), but guidance on how to do this was lacking. integrity has been developed to screen randomised controlled trials (RCTs) for integrity issues. The software guides decision-making by determining whether a trial has no concerns, some concerns requiring further information, or major concerns warranting exclusion from evidence synthesis or publication.

Data Preparation

Since the functionality is implemented in R, please import the data set into R. There are are variety of functions in R or cRAN packages to do this.

An accompanying YAML file also needs to be written to describe the expected characteristics of each column.

The top-level elements are required to be named:

participantID, enrollment, baseline, intervention and outcome are mandatory. Others only need to be specified if there is a column to annotate.

View the YAML file corresponding to this dataset at C:/Users/dstr7320/AppData/Local/Temp/RtmpUdLtCJ/Rinst41842e0b1ab8/integrity/extdata/variables.yaml for an example of the expected contents and structure.

Integrity Checks

The checks are categorised into several domains.

Domain 1: Unusual or Repeated Patterns

Item 1: Repeating patterns across baseline variables. Item 2: Repeating patterns within baseline variables. Item 3: Repeating patterns across baseline variables for rare outcome. Item 4: Bias in terminal digits.

Domain 2: Unusual or Repeated Patterns

Item 5: Excessively homogeneous distribution of binary baseline variables. Item 6: Excessive imbalances of continuous baseline variables between groups. Item 7: Excessive imbalances of categorical baseline variables between groups. Item 8: Differential variability of numerical baseline characteristics between groups.

Domain 3: Correlations

Item 9: Expected correlations between variables (e.g. height and weight).

Domain 4: Date Violations

Item 10: Randomisation dates outside of the study period.

Domain 5: Participant Randomisation

Item 11: Deviation from randomness of allocation of participants to treatments over time. Item 12: Deviation from randomness of allocation on days of the week.

Domain 6: Internal Consistency

Item 13: Impossible or implausible values, e.g. Age at Menarche for a male participant.

Domain 7: External Consistency

Item 14: Discrepancies between summary statistics calculated from data set and those presented in the corresponding journal article.

Domain 8: Data Plausibility

Item 15: Too few missing data values or missing data overly similar between treatment groups. Item 16: Implausible event rates based on expert knowledge.

Based on the YAML file, only checks that are relevant to the data set will be executed.

Case Study: Cord Management at Preterm Birth

The data set bundled with this package is an extract from the iCOMP study. The main goal was to determine the optimal umbilical cord management strategy at preterm birth, such as milking or delayed cord clamping.

Data Loading and Preparation

The data is in a Microsoft Excel file. There is one sheet.

library(readxl)
examplePath <- system.file("extdata", "dataset.xlsx", package = "integrity")
dataset <- read_excel(examplePath)
dataset[1:5, ]
## # A tibble: 5 × 18
##   infant_ID rand_date           mat_age blood_loss treatment_cat GA_weeks
##       <dbl> <dttm>                <dbl>      <dbl>         <dbl>    <dbl>
## 1         1 2019-03-21 00:00:00      36        200             2       30
## 2         2 2020-07-17 00:00:00      18        200             1       28
## 3         3 2019-06-14 00:00:00      20        300             1       32
## 4         4 2019-10-08 00:00:00      30        500             2       29
## 5         5 2019-03-02 00:00:00      34        400             1       32
## # ℹ 12 more variables: birthweight <dbl>, sex <dbl>, hospital_days <dbl>,
## #   temp <dbl>, inf_transfusion_any <dbl>, Hct <dbl>, CLD <dbl>, IVH <dbl>,
## #   NEC <dbl>, inf_death <dbl>, enrol_start <dttm>, enrol_end <dttm>

The sample identifiers can be seen, as well as the first few clinical covariates. At this stage, categorical variables which only have one distinct value should be removed. This data has no such variables.

The variable types and expectations need to be defined. The metadata representation language YAML is used for this purpose.

library(yaml)
example_path <- system.file("extdata", "variables.yaml", package = "integrity")
dataset_info <- read_yaml(example_path)

On your computer, the file is located at C:/Users/dstr7320/AppData/Local/Temp/RtmpUdLtCJ/Rinst41842e0b1ab8/integrity/extdata/dataset.xlsx.

Running Checks

Simply provide the data frame and data information to run_checks. The first step which automatically happens is data checking and cleaning, which ensures that all variables defined in the YAML file are present in the dataset, converts any variables annotated as factors but not factors into factors, and removes any columns that are entirely missing values.

library(integrity)
result <- run_checks(dataset, dataset_info)
## Repeating pattern within each baseline algorithm in development
## No duplicate combinations found of: sex, mat_age, GA_weeks, birthweight
names(result)
## [1] "check_table"   "images"        "summary_table"

This creates a list of three result types.

Firstly, there is a check table with Pass or Fail statuses based on appropriate statistical tests.

head(result[["check_table"]])
##                          Domain                           Item Status
## 1  Unusual or Repeated Patterns             Repeated Baselines   Fail
## 3  Unusual or Repeated Patterns    Consecutive Baseline Binary   Fail
## 7                  Correlations      Unexpectedly Uncorrelated   Fail
## 10              Date Violations Implausible Randomisation Date   Fail
## 11       Internal Inconsistency                Implausible Day   Fail
## 12       Internal Inconsistency                Implausible Day   Fail
##                                                                     Details
## 1          sex:1, mat_age:30, GA_weeks:33, birthweight:1568 occurs 2 times.
## 3  Variable sex has statistically significant runs of values using χ² test.
## 7                                                     GA_weeks, birthweight
## 10                                                      Participants 38, 49
## 11                                       All participants start on Saturday
## 12                                              5 randomisation on Saturday

There are some interesting issues which may be examined further. Next is a list of four images. Here, the unexpected lacks of correlation between gestational age and birthweight is shown.

names(result[["images"]])
## [1] "Terminal Digits"       "timeAndSize"           "Cumulative Allocation"
## [4] "Days"
result[["images"]][["timeAndSize"]]

Finally, there is list of clinical summary tables; one for the measurements and one for the missingness.

result[["summary_table"]]
Characteristic 1
N = 50
1
2
N = 70
1
infant_ID 66 (35) 57 (34)
rand_date 2019-12-07 06:43:12 (14840940.1535606) 2019-10-29 01:42:51.428571 (13740220.8211391)
mat_age 29 (7) 30 (7)
blood_loss 298 (169) 266 (171)
GA_weeks

    28 3 (6.0%) 6 (8.6%)
    29 3 (6.0%) 3 (4.3%)
    30 3 (6.0%) 10 (14%)
    31 6 (12%) 10 (14%)
    32 19 (38%) 13 (19%)
    33 16 (32%) 28 (40%)
birthweight 1,835 (421) 1,757 (361)
sex

    1 27 (54%) 35 (50%)
    2 23 (46%) 35 (50%)
hospital_days 30 (20) 36 (24)
temp 36.72 (0.65) 36.84 (0.58)
inf_transfusion_any 9 (18%) 14 (20%)
Hct 53.7 (5.8) 54.0 (5.7)
CLD

    0 40 (80%) 45 (64%)
    1 1 (2.0%) 16 (23%)
    2 9 (18%) 7 (10%)
    3 0 (0%) 2 (2.9%)
IVH

    0 35 (73%) 52 (74%)
    1 13 (27%) 18 (26%)
    Unknown 2 0
NEC

    0 46 (92%) 67 (96%)
    1 4 (8.0%) 3 (4.3%)
inf_death

    0 49 (98%) 65 (94%)
    1 1 (2.0%) 4 (5.8%)
    Unknown 0 1
enrol_start

    2019-03-02 50 (100%) 70 (100%)
enrol_end

    2020-08-09 50 (100%) 70 (100%)
1 Mean (SD); n (%)

Computing Environment

This vignette was executed on the following computing system:

sessionInfo()
## R Under development (unstable) (2026-02-04 r89376 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=C                       LC_CTYPE=English_Australia.utf8   
## [3] LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Australia.utf8    
## 
## time zone: Australia/Sydney
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] integrity_1.0 yaml_2.3.12   readxl_1.4.5 
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.57          bslib_0.10.0       ggplot2_4.0.2     
##  [5] rstatix_0.7.3      lattice_0.22-9     vctrs_0.7.1        tools_4.6.0       
##  [9] generics_0.1.4     tibble_3.3.1       pkgconfig_2.0.3    Matrix_1.7-4      
## [13] RColorBrewer_1.1-3 S7_0.2.1           gt_1.3.0           lifecycle_1.0.5   
## [17] compiler_4.6.0     farver_2.1.2       stringr_1.6.0      janitor_2.2.1     
## [21] carData_3.0-6      snakecase_0.11.1   litedown_0.9       htmltools_0.5.9   
## [25] sass_0.4.10        Formula_1.2-5      pillar_1.11.1      car_3.1-5         
## [29] ggpubr_0.6.3       jquerylib_0.1.4    tidyr_1.3.2        cachem_1.1.0      
## [33] abind_1.4-8        nlme_3.1-168       commonmark_2.0.0   tidyselect_1.2.1  
## [37] digest_0.6.39      stringi_1.8.7      gtsummary_2.5.0    dplyr_1.2.0       
## [41] purrr_1.2.1        labeling_0.4.3     splines_4.6.0      fastmap_1.2.0     
## [45] grid_4.6.0         cli_3.6.5          magrittr_2.0.4     cards_0.7.1       
## [49] broom_1.0.12       withr_3.0.2        scales_1.4.0       backports_1.5.0   
## [53] cardx_0.3.2        lubridate_1.9.5    timechange_0.4.0   rmarkdown_2.30    
## [57] otel_0.2.0         ggsignif_0.6.4     cellranger_1.1.0   evaluate_1.0.5    
## [61] knitr_1.51         markdown_2.0       mgcv_1.9-4         rlang_1.1.7       
## [65] glue_1.8.0         xml2_1.5.2         rstudioapi_0.18.0  jsonlite_2.0.0    
## [69] R6_2.6.1           fs_1.6.7