Randomized clinical trial quality has improved over time but is still not good enough: an analysis of 176,620 randomized controlled trials published between 1966 and 2018

Background: Many randomized controlled trials (RCTs) are biased and difficult to reproduce due to methodological flaws and poor reporting. There is increasing attention for responsible research practices including reporting guidelines, but it is unknown whether these efforts have improved RCT quality (i.e. reduced risk of bias). We therefore mapped trends over time in trial publication, trial registration, reporting according to CONSORT, and characteristics of publication and authors. Methods: Meta-information of 176,620 RCTs published between 1966 and 2018 was extracted. Risk of bias probability (four domains: random sequence generation, allocation concealment, blinding of patients/personnel, and blinding of outcome assessment) was assessed using validated risk-of-bias machine learning tools. In addition, trial registration and reporting according to CONSORT were assessed with automated searches. Characteristics were extracted related to publication (number of authors, journal impact factor, medical discipline) and authors (gender and Hirsch-index). Findings: The annual number of published RCTs substantially increased over four decades, accompanied by increases in the number of authors (5.2 to 7.8), institutions (2.9 to 4.8), female authors (20 to 42%, first authorship; 17 to 29%, last authorship), and Hirsch-indices (10 to 14, first authorship; 16 to 28, last authorship). Risk of bias remained present in most RCTs but decreased over time for the domains allocation concealment (63 to 51%), random sequence generation (57 to 36%), and blinding of outcome assessment (58 to 52%). Trial registration (37 to 47%) and CONSORT (1 to 20%) rapidly increased in the latest period. In journals with higher impact factor (>10), risk of bias was consistently lower, higher levels of trial registration more frequent, and mentioning CONSORT. Interpretation: The likelihood of bias in RCTs has generally decreased over the last decades. This may be driven by increased knowledge and improved education, augmented by mandatory trial registration, and more stringent reporting guidelines and journal requirements. Nevertheless, relatively high probabilities of bias remain, particularly in journals with lower impact factors. This emphasizes that further improvement of RCT registration, conduct, and reporting is still urgently needed.


Introduction
Randomized controlled trials (RCTs) are the primary source for evidence on the efficacy and safety of clinical interventions, and systematic reviews and clinical guidelines synthesize their results. Unfortunately, many RCTs have severe methodological flaws and results are often biased. 1 Strikingly, the majority of RCT findings have inflated estimates and have problems with randomization, allocation concealment, and blinding. 2,3 Recently, it was shown that over 40% of RCTs were at high risk of bias which could have been easily avoided. 4 Moreover, poor reporting prevents the adequate assessment of RCT quality and limits its reproducibility. 5 Avoidable sources of waste and inefficiency in clinical research were estimated to be as high as 85%. 6 For a longer time period, CONSORT criteria have been introduced to improve RCT reporting, and mandatory RCT registration by the International Committee of Medical Journal Editors (ICMJE) has been put forward, 7 . 8 More recently, The Lancet published a series on increasing value and reducing waste in medical research which proposed meaningful steps towards more high-quality research, including improved methodology and reporting, and reduction of unpublished negative findings. 5,9 Additional actions to improve RCT quality and transparency include trial tracker initiatives aimed at reducing non-publication of clinical trials, 10 and fostering responsible research practices. At the most recent World Conference on Research Integrity, the Hong Kong Principles were proposed to further stimulate responsible research practices by including them in researcher assessments. 11 Even though these actions and initiatives have undoubtedly contributed to awareness that the quality of RCTs needs to improve, the question remains whether real progress has been made in reducing the extent of avoidable waste in clinical research. In other words, have these initiatives and measures improved the quality, transparency, and reproducibility of RCTs?
Several studies have assessed the quality of reporting and risk of bias in RCTs, 12 but most are relatively small and limited to specific medical disciplines or time periods. Nevertheless, based on 20,920 RCTs from Cochrane reviews, there are indications that poor reporting and inadequate methods have decreased over time. 13 However, large-scale evidence on trends of RCT characteristics and quality across medical disciplines over time is currently lacking. This is surprising in view of the importance of valid and reliable evidence from RCTs for patient care. Therefore, this study aimed to provide a comprehensive and unbiased analysis of . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020. . https://doi.org/10.1101/2020.04.22.20072371 doi: medRxiv preprint developments in the clinical trial landscape between 1966 and 2018 based on 176,620 full-text RCT publications.

Methods
The protocol for this analysis was registered prior to study conduct, 14 the database and scripts are available through GitHub (see Data Sharing), and the results are disseminated through the medRxiv pre-print server.

Selection of RCTs and extraction of characteristics
RCTs were identified via MEDLINE (Nov 20, 2017) starting with all publications indicated as 'randomized controlled trial' using the query "randomized controlled trial[pt] NOT (animals[mh] NOT humans[mh])". The initial search did not include a time window. Non-English language, non-randomized, animal, pilot, and feasibility studies were subsequently excluded (see Supplementary Methods text for details on selection procedure). We collected the Portable Document Format (PDF) for all available RCTs across publishers in journals covered by the library subscription of our institution, and converted these PDFs to structured text in xml format using publicly available software (Grobid, available via GitHub). By linking information from the MEDLINE database, the full-text publication, and data from Scopus and Web of Science, we extracted metadata on number of authors, author gender, number of countries and institutions of (co-)authors, and the Hirsch (H)-index of the first and last authors (see Supplementary Table S1 for details). Moreover, we extracted the journal impact factor (JIF) at the time of publication. We also quantified the frequency of predefined positive, negative, and neutral words (25 words in each category) in titles and abstract texts as previously published (see Supplementary Methods for details). 15 Time was stratified in 5-year periods as behavioral changes are expected to occur with relative low pace, with the relatively few trials published before 1990 merged in one stratum.

Risk of Bias assessment
For every included full-text RCT, risk of bias assessment was automatically performed using machine learning assessment developed by RobotReviewer. 16 This tool is optimized for largescale characterizations 17,18 and algorithmically based on a large sample of human-rated risk of bias reports and extracted support texts from trial publications covering the full RCT spectrum.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 27, 2020. The level of agreement between RobotReviewer and human raters was similar for most domains (human-human agreement: min/max, 71-85%, average, 79%), human-RobotReviewer agreement: min/max 39-91%, average 65%). 20,21 Of the seven risk of bias domains described by Cochrane , 19 we assessed four: random sequence generation and allocation concealment (i.e., selection bias)), blinding of participants and personnel (i.e., performance bias ), and blinding of outcome assessment (i.e., detection bias ). Publication bias and outcome reporting bias were outside the scope of our analysis.

Analysis of trial registration and CONSORT statement use
To check for trial registration, we extracted trial registration numbers from the abstract and full text publication and searched for the corresponding trial registration number in two online databases: WHO's International Clinical Trials Registry Platform, composed of worldwide primary and partner registries, and the ClinicalTrials.gov trial registry. 20 We checked all fulltext publications for at least one mention of the words "Consolidated Standards of Reporting Trials" or CONSORT.

Analysis related to Journal Impact Factor
Even though the journal impact factor (JIF, average number of times its articles has been cited in other articles for two years) is not a very suitable indicator of journal quality, 21 no unbiased alternatives exist. In our study, we therefore used the JIF as a proxy to identify journals with high publication standards and high rejection rates. For each individual trial we selected the JIF of the year before trial publication. We used a JIF threshold of 10 as the primary cutoff based on JIF distributions (see Supplementary Table S2 and Supplementary Figure S1) and previous evidence for sensitivity to assess RCT quality using this cutoff. 13 However, we also performed sensitivity analyses for JIF cutoff thresholds at 3 and 5.

Analyses related to medical disciplines
We assigned RCTs to medical disciplines based on the journal category (Web of Science). 7 As a secondary analysis, we examined medical disciplines separately. 9 Medical disciplines with less than 4,000 RCTs in our sample were assigned to the category 'Other'.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 27, 2020. were very close to mean values, the data is presented as mean ± 95% confidence intervals. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020.

Risk of bias and reporting: relation with journal impact factor
The risk of bias in allocation concealment was consistently lower in trials published in journals with JIF larger than 10 (P< 0.001; Figure 4A). This also applied to randomization and blinding of participants and personnel and outcome assessment, even though the results were less is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020. . https://doi.org/10.1101/2020.04.22.20072371 doi: medRxiv preprint bias and increased registration and mentioning of the CONSORT Statement in journals with higher JIF (Supplementary Figures S2 and S3).

Risk of bias and reporting: relation with medical discipline
Risk of bias patterns substantially differed across medical disciplines (Supplementary Table   S1). Lowest probabilities of bias were found in RCTs within the field of anesthesiology (27% randomization bias, 43% allocation concealment bias, 45% risk of bias due to insufficient blinding of participants and personnel, 45% bias in blinding of outcome assessment) (Supplementary Figure S4). The field of oncology had the highest levels of trial registration (43.4%) and mention of the CONSORT Statement (30.3%) (Supplementary Figure S5).

Discussion
We analyzed a total of 176,620 full-text publications of RCTs from the last four decades and is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020. . https://doi.org/10.1101/2020.04.22.20072371 doi: medRxiv preprint 2017 that found improvements in reporting and methods over time for sequence generation and allocation concealment. 13 Notwithstanding these improvements, it is also clear that there is still a pressing need to further improve the quality of RCTs. The average risk in each of the bias domains remains generally high (around 50%), and bias related to blinding of participants and personnel is increasing over time, which may be due to more pragmatic or non-drug RCTs being performed. Moreover, despite the requirement of trial registration for publication since 2004, still in 2017 a substantial percentage of published RCTs are not registered. Furthermore, many RCTs do not mention the CONSORT guidelines in their full text, and more so for journals with lower impact factors.
Despite accessibility of reporting guidelines, researchers are generally not required to adhere to them, and, more problematic, requirements are not strictly enforced and non-compliance to all the items on the reporting guideline is not sanctioned. 4,24 To further improve the quality and reliability of RCTs, there is still a long way to go, and the rather slow progress of improvement may be due to the complex nature of conducting RCTs. Better education, enforcements, and (dis)incentives may be inevitable. Additionally, making data sets available according to the FAIR principles arguably will improve the situation. 25 Depending on one's expectations and future goals, the interpretation can be either optimistic or pessimistic: optimistic because, over the past decades, there has been quite some improvement in RCT conduct and reporting, but pessimistic because the improvements are going at a rather slow pace. From our analyses, it also appears that journals with higher JIF generally publish RCTs with lower scores on risk of bias domains. Our results confirm previous results showing higher JIF (higher than 10) being associated with a lower proportion of trials at unclear or high risk of bias in Cochrane reviews. 13 Even though JIFs are not a very suitable measure of journal quality, our results are in line with previous studies showing that increased JIF is related with higher RCT quality. 26 Finally, there are large differences across medical disciplines related to risk of bias scores across domains which cannot readily be explained.
There are several strengths and limitations inherent to our approach of automated extraction of full-text RCT publications. The automated and uniform approach yielded an unprecedented large and rich data source concerning RCTs from the last forty years is available for further . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020. . https://doi.org/10.1101/2020.04.22.20072371 doi: medRxiv preprint study (see https://github.com/wmotte/frrp for the data), covering a large proportion of all published RCTs included in PubMed. Nevertheless, there are several limitations. First, risk of bias is inherently difficult to assess. Experts' assessments of trials show that labeling the same trials for different Cochrane reviews resulted in substantial differences. 27,28 Probabilities assigned with machine learning are based on a large set of human-assigned labels, and a direct comparison shows computerized assessment performance of 71.0% agreement. 29 Second, we did not investigate all aspects of methodological rigor. In our study, we did not check for forms of attrition bias (e.g., incomplete outcome data) or reporting bias (e.g., selective outcome reporting), which both would require a direct comparison between the trial registration and actual trial publication report. Third, even though the CONSORT Statement was introduced to improve RCT reporting, 30 the rapid increase of RCTs that mention following the CONSORT guideline does not guarantee adherence and reporting quality can remain suboptimal. 12 We were not able to automatically correct for the conventional and non-abbreviated use of the word 'consort'. This may have slightly increased our CONSORT Statement percentages and explains the very low but non-zero values in the earliest stratum.
Our comprehensive picture of RCT quality provides quantitative insight into the current state and trends over time. With many thousands of RCTs being published each year and thousands of clinical trials currently recruiting patients, this can help us to better understand the current situation but also to find solutions for further improvement. These could include a more stringent adoption of measures to enforce transparent and credible trial publication, but also fine-tuning of stricter registration regulations. In conclusion, our comprehensive analyses of a large body of full-text RCTs show that there is a slow and gradual improvement of RCT quality over the last decades. While RCTs certainly face challenges in relation to their quality and reproducibility and there is still ample room for improvement, our study is a first step in showing that all efforts that have been made to improve RCT practices may be paying off.

Role of the funding source
The trial was funded by the ZonMw who had no influence on the study design; in the collection, analysis, or interpretation of data, the writing of the report; or the decision to submit the manuscript for publication. CHV, WMO, and HJL had full access to all the data in the study and together with the writing group made the final decision to submit for publication.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020.

Data sharing
The risk of bias characterization was done with a large-batch-customized-customized Python scripts (version 3; https://github.com/wmotte/robotreviewer_prob). The data management and analyses used R (version 3.6.1). All data including code and risk of bias data are available at https://github.com/wmotte/frrp). . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020.  . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020.

Data collection procedures
Step 1. Identification of all human randomized controlled trials (RCTs). The Entrez API enables access to the PubMed database and was used via R Statistical Software using the query: "randomized controlled trial[pt] NOT (animals[mh] NOT humans[mh])". Basic information (including authors, affiliations, journal, trial registry numbers, language, and funding agency) which is indexed by the PubMed database was downloaded.
Step 2. Filtering identified studies. Not all automatically identified studies were RCTs -despite the query in Step 1. To exclude potential contamination of the data by non-randomized, pilot and feasibility studies, a more strict selection of studies was made based on the title and abstract.
Articles were automatically excluded in the following conditions: -When the title contained: "study protocol", "study design", "protocol for", "pilot" or "feasibility" -When the abstract contained: "pilot study" or "feasibility study" -When the title or abstract did not contain: "random" (in title and abstract) OR "assign" OR "allocat*" OR "placebo" OR "double-blind" (in abstract).
-When language was other than English Step 3. Downloading PDFs of the remaining studies after Step 1 and 2. We used R scripts to download the PDF of each publication via the website of the respective publisher is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020.

Gender of first and last author
Based on the first name of an author, the API Genderize (https://genderize.io/) can determine the probability that this person is male or female. The gender with the highest probability was assigned to the author's name.
Proportion of female co-authors. The putative gender of all authors was determined and combined into a proportion of all authors of the publication at issue.

Number of authors Continuous number.
Number of countries Total number of countries of (co-)authors.

Number of institutions
Total number of institutions of (co-)authors.

H(irsch)-index of first and last author
The H-index of the first and last author at time of publication was obtained from the Scopus web portal.

C. Journal
Medical discipline Categories as downloaded from Web of Science inCites Journal Citation Reports, shortened list. If in multiple categories, the more specific category prevailed (e.g. cardiovascular vs. general medicine, or neurology vs. oncology).
Journal impact factor of the year before publication JIF was extracted for each journal from Web of Science data for the period 1997-2016 as no earlier time points were available. The JIF's of 1997 were assigned to RCTs published before 1996.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 27, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 27, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 27, 2020. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 27, 2020. 'blinding outcome': bias in blinding of outcome assessment.
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 27, 2020. . https://doi.org/10.1101/2020.04.22.20072371 doi: medRxiv preprint