Accuracy and clinical effectiveness of risk prediction tools for 3 pressure injury occurrence: An umbrella review

Pressure injuries (PIs) pose a substantial healthcare burden and incur significant costs worldwide.


INTRODUCTION
Pressure injuries (PI), also known as pressure ulcers or decubitus ulcers, have an estimated global prevalence of 12.8% among hospitalised adults, 1 and place a significant burden on healthcare systems (estimated at $26.8 billion per year in the US alone 2 ).PIs are most common in individuals with reduced mobility, limited sensation, poor circulation, or compromised skin integrity, and can affect those in community settings and long-term care as well as hospital settings.Effective prevention of PI requires multicomponent preventive strategies such as mattresses, overlays, and other support systems, nutritional supplementation, repositioning, dressings, creams, lotions, and cleansers. 3It is therefore important to correctly identify those most at risk of PI to allow timely and targeted implementation of preventive measures, to reduce harm and consequently burden to healthcare systems. 4erous clinical assessment scales (e.g.Braden 5 6 , Norton 7 and Waterlow 8 ) and statistical risk prediction models for assessing the risk of PI are available however, many are limited by reliance on subjective clinical judgment and do not appear to meet basic standards for the development or reporting of risk prediction models. 9Nevertheless, many such tools are in routine clinical usage.For example, in certain hospitals and long-term care settings in the US, healthcare professionals must conduct mandatory risk assessments for PI for all patients for the purposes of risk stratification and clinical triage.
Despite the apparent lack of sound methods for development and validation (including external validation) of available risk prediction tools, there is a considerable body of evidence evaluating their clinical utility, much of which has been synthesised in systematic reviews and meta-analyses. 9Clinical utility includes both prognostic accuracy and clinical effectiveness.Prognostic accuracy is estimated by applying a numeric threshold above (or below) which there is a greater risk of PI, with study results presented using accuracy metrics such as sensitivity, specificity or the area under the receiver operating characteristic (ROC) curve. 10Resulting accuracy is driven not only by the nominated threshold for defining participants as at low or high risk for PI but by other study factors including population and setting. 11Clinical effectiveness, or the ability of a tool to impact on health outcomes such as the incidence or severity of PI, is related both to the accuracy of the tool (or its ability to correctly identify those most likely to develop PI) and to the uptake and implementation of the tool in practice.Demonstrating a change in health outcomes as a result of use of a risk prediction tool is vital to encourage implementation. 12ng an umbrella review approach, we aimed to provide a comprehensive overview of available systematic reviews that consider the prognostic accuracy and clinical effectiveness of PI risk prediction tools.

Protocol registration and reporting of findings
We followed Cochrane guidance for conducting umbrella reviews 13 , and 'Preferred Reporting Items for Systematic Reviews and Meta-Analyses of Diagnostic Test Accuracy Studies' (PRISMA-DTA) reporting guidelines 14 (see Appendix 1).The protocol was registered on Open Science Framework (https://osf.io/tepyk).

Literature search
Electronic searches of MEDLINE, Embase via Ovid and CINAHL Plus EBSCO from inception to January 2023 were developed and conducted by an experienced information specialist (AC), employing well-established systematic review and prognostic search filters, [15][16][17] combined with appropriate keywords related to PIs.Simplified supplementary searches in EPISTEMONIKOS and Google Scholar were also undertaken (see Appendix 2 for further details).Screening of search results and full texts were conducted independently and in duplicate by two reviewers (BH, JD, YT, KS), with disagreements resolved by a third reviewer.

Eligibility criteria for this umbrella review
Published English-language systematic reviews of risk prediction tools developed for adult patients at risk of PI in any setting were included.Clinical risk assessment scales and models developed using statistical or machine learning (ML) methods were eligible (models exclusively using pressure sensor data were not considered).Risk prediction tools could be applied by any healthcare professional using any threshold for classifying patients as high or low risk and using any PI classification system 18- 21 as a reference standard.For prognostic accuracy, we required accuracy metrics, such as sensitivity and specificity, to be presented but did not require full 2x2 classification tables to be reported.Reviews on diagnosing or staging suspected or existing PIs were excluded.
To be considered 'systematic', reviews were required to report a thorough search of at least two electronic databases and at least one other indication of systematic methods (e.g.explicit eligibility criteria, formal quality assessment of included studies, adequate data presentation for reproducibility of results, or review stages (e.g.search screening) conducted independently in duplicate).

Data extraction and quality assessment
Data extraction forms (Appendix 3) were informed by the CHARMS checklist (CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies) and Cochrane Prognosis group template. 22 23Data extraction items included review characteristics, number of studies and participants, study quality and results.
The methodological quality of included systematic reviews was assessed using AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews) 24 , adapted for systematic reviews of risk prediction models (Appendix 4).Our adapted AMSTAR-2 contains six critical items, and limitations in any of these items reduces the overall validity of a review. 24Quality assessment and data extraction were conducted by one reviewer and checked by a second (BH, JD, KS), with disagreements resolved by consensus.

Synthesis methods
Reviews about prognostic accuracy and clinical effectiveness of risk prediction tools were considered separately.Review methods and results were tabulated and a narrative synthesis provided.Prognostic accuracy results from reviews including a statistical synthesis were tabulated according to risk prediction tool.
Considerable overlap in risk prediction tools and included primary studies was noted between reviews.For risk prediction tools that were included in multiple meta-analyses, we focused our synthesis on the review(s) with the most recent search date or most comprehensive (based on number of included studies) and most robust estimate of prognostic accuracy (judged according to the appropriateness of the meta-analytic method used, e.g.use of recommended hierarchical approaches for test accuracy data 25 ).The prognostic accuracy of risk prediction tools that were included in three or fewer reviews, was reported only if an appropriate method of statistical synthesis 13 was used.
For clinical effectiveness results, reviews with the most recent search date or most comprehensive overview of available studies and that at least partially met more of the AMSTAR-2 criteria 24 were prioritised for narrative synthesis.

Characteristics of included reviews
A total of 110 records were selected for full-text assessment from 6302 unique records.We could obtain the full text of 104 publications, of which 23 reviews met all eligibility criteria (Figure 1), 16 reported accuracy data [24][25][26][27][28][29][30][31][32][33][34][35][36][37][38][39] and 10 reported clinical effectiveness data 25 29 34 40-46 (three reported both accuracy and effectiveness data 25 29 34 ).restricted to aged >60 years; B one review 29 states either prospective or retrospective data eligible for Research Question 1, but prospective only for Research Question 2, hence 0.5 added to each category; C including databases, bibliographies or registries; D reviews may fall into multiple categories, therefore total number within domain not necessarily equal to N (100%); E one review 30 reported use of PROBAST in methods, but did not present any PROBAST results; F one review conducts univariate meta-analysis for single estimate, e.g.AUC 31 , RR 32 1), three included any age group, and nine (39%) did not report any age restrictions.Six reviews (6/23, 26%) specified only populations without PIs at baseline for inclusion.Acute care was the most common setting across both review questions, 5/16 (31%) and 4/10 (40%) for accuracy and effectiveness reviews, respectively.Quality assessment tools varied, with QUADAS-2 (n=7) or QUADAS (n=2) being most common for reviews of accuracy (9/16, 56%).One accuracy review 30 reported use of both QUADAS-2 and PROBAST tools in their methods, but only reported QUADAS-2 results.
Reviews of accuracy predominantly focused on studies using any (5/16, 31%) or pre-specified (8/16, 50%) risk assessment tools or scales, one included only ML-based prediction models. 30A total of 63 risk prediction tools were reported across the reviews, including 24 ML models.The number of included risk prediction tools in a single review ranged from one [34][35][36][37][38][39] to 28 32 .Only two reviews reported eligibility criteria related to the development or validation of the risk prediction tools.One 33 (6%) excluded evaluation studies that used the same data that was used to develop the tool and the other 29 included only "validated risk assessment instruments", however this was not further defined and the review included studies reporting the original development of risk prediction tools.
The majority (13/16, 81%) of accuracy reviews conducted a statistical synthesis of data, however only two utilised currently recommended hierarchical approaches for the meta-analysis of test accuracy data, 36 40 seven conducted univariate meta-analysis of individual accuracy measures (e.g.sensitivity and specificity separately, or AUC 31 , RR 32 or odds ratio 33 ) and four did not clearly report the type of analysis approach used.
Of the 10 systematic reviews evaluating the clinical effectiveness of risk prediction tools, two only considered the reliability of risk assessment scales 41 42 and eight considered effects on patient outcomes (one of which also considered tool reliability 43 ).More than half of reviews (6, 60%) compared use of PI risk assessment scales to clinical judgement alone or 'standard care'.The number of included studies ranged from one 44 to 20 45 and the sample sizes of primary studies ranged from one (one subject and 110 raters, in an inter-rater reliability study 46 ) to 3,027 patients.Reported outcomes included the incidence of PIs (7/10), preventative interventions prescribed (5/10) and interrater reliability (3/10) (reported in Appendix 5).One (Cochrane) review used the Cochrane RoB tool for quality assessment of included studies and three used JBI (n=2) or CASP (n=1) tools.Due to heterogeneity in study design, risk prediction tools and outcomes evaluated, none of the included reviews provided any form of statistical synthesis of study results.
Of the 16 accuracy reviews, four (25%) 30 35 36 40 used an appropriate method of quality assessment of included studies (i.e.QUADAS or QUADAS-2 dependent on publication year) and presented judgements per study.Of the 10 effectiveness reviews, two (20%) 29 47 used an appropriate method of quality assessment (the Cochrane tool for assessing risk of bias 48 and a criteria consistent with AHRQ Methods Guide for Effectiveness and Comparative Effectiveness Reviews 49 , respectively) and provided judgements per study.Four reviews either reported quality assessment results per study (n=3 41 45 50 ) or were considered to use an appropriate quality assessment tool (n=1 33 ) (AMSTAR-2 criterion partially met).
Of the accuracy reviews that included a statistical synthesis, 31% (4/13) 31 32 36 40 used an appropriate method of meta-analysis and investigated sources of heterogeneity.Two reviews 36 40 used recommended hierarchical approaches to meta-analysis of test accuracy data (the bivariate model 36 and hierarchical summary ROC (HSROC) model 40 ) and one 31 calculated summary AUC using random effects meta-analysis. 32pared to the reviews of accuracy, reviews of effectiveness more commonly provided adequate descriptions of primary studies (8/10, 80% vs 4/16, 25%) (Figure 2).No other major differences across review questions were noted.
Results from reviews evaluating the prognostic accuracy of risk prediction models Five of 16 accuracy reviews were prioritised for narrative synthesis (Tables 2-3) and are reported below according to risk prediction tool.Four of the five reviews did not include development study estimates within their meta-analyses, but this information could not be ascertained for the review 30 of ML-based models.None of these reviews assessed the quality of development methods for the prediction tools considered in their statistical syntheses.

Braden, and modified Braden scales
The most recent and largest review 36 36 The review noted a high risk of bias for the 'index test' section of the QUADAS-2 assessment in approximately a third of included studies, but failed to provide details or reasons for this assessment.

Cubbin & Jackson scale
Zhang and colleagues 40 included six studies evaluating the original Cubbin & Jackson scale 54 (800 patients).Summary sensitivity and specificity were both reported as 0.84 (95% CIs 0.59, 0.95 and 0.66, 0.93, respectively) 40 suggesting that this represents the point on the HSROC curve where sensitivity equals specificity, particularly as reported thresholds ranged from 24 to 34.The review authors concluded that although the accuracy of the Cubbin & Jackson scale was higher than the EVARUCI scale and the Braden scale, low quality of evidence and significant heterogeneity limit the strength of conclusions that can be drawn.Zhang 40 (2021) 4 tools evaluated in metaanalyses Studies not described according to prediction tool.All prospective Studies not described according to prediction tool QUADAS-2: Overall judgement was "not so satisfactory".

Norton scale
Park and colleagues 53 pooled data from seven studies (2,899 participants) evaluating the Norton scale, across thresholds ranging from <14 to <16.They reported summary sensitivity of 0.75 (95% CI 0.70, 0.79) and specificity 0.57 (95% CI 0.55, 0.59).A further four reviews presented statistically synthesised results for the Norton scale (Appendix 5), including one review by Chou and colleagues 29 which included nine studies (5,444 participants) but only reported median values for accuracy parameters.

Waterlow scale
Although Zhang and colleagues 40

Machine learning algorithms
Qu and colleagues 30 conducted separate meta-analyses of 25 studies according to ML algorithm type (Table 2).The review rated critically low on AMSTAR-2 items, with only 6/15 (40%) criteria fulfilled, and reported using Bayesian DTA meta-analysis.The review did not restrict inclusion to external evaluations of the models, and the authors did not report which estimates were sourced from development data or external data.The summary AUC for the five algorithms ranged from 0.82 (95% CI 0.79, 0.85; 9 studies with 97,815 participants) for neural network-based models to 0.95 (95% CI 0.93, 0.97; 7 studies with 161,334 participants) for random forest models (Table 3).The latter approach also had the highest summary specificity 0.96 (95% CI 0.80, 0.99), with sensitivity 0.72 (95% CI 0.26, 0.95).The highest summary sensitivity was observed for support vector machine models (0.81, 95% CI 0.69, 0.90) with summary specificity 0.81 (95% CI 0.59, 0.93) (9 studies, 152,068 participants).The remaining algorithms had summary sensitivities ranging from 0.66 (decision tree models) to 0.73 (neural network models) (Table 3).Two additional ML algorithms evaluated in the included studies (Bayesian networks and LOS (abbreviation not explained)) had too few studies to allow meta-analysis (Appendix 5).
Beyond the results covered by our five prioritised reviews, three further modifications of the Braden scale were evaluated in statistical syntheses: Braden modified by Kwong 60 , the 4-factor model 61 and 'extended Braden' 61 , revealing variable performance with high uncertainty. 32 29 42Another two modified versions of the Norton scale (by Ek 62 , and by Bienstein 63 ) were also included in one review's meta-analyses 32 , but only risk ratios were reported.Three additional scales (revised "Jackson & Cubbin" 64 , EMINA 65 and PSPS 66 ) were evaluated in one statistical synthesis each. 32 29Full details can be found in Appendix 5 Table S4.
Appendix 5 Table S5 reports data for another 17 risk prediction tools, each associated with a single primary study (therefore not covered in detail in the text above), and another two tools, Sunderland 67 and RAPS 68 , which are assessed in two primary studies each.

Results from reviews evaluating the clinical effectiveness of risk prediction models
Table 4 provides an overview of results from four 29 45 47 50 69 of the 10 reviews reporting clinical effectiveness, including one Cochrane review 47 which identified two randomised controlled trials (RCTs) of risk prediction tools and assessed risk of bias using the Cochrane RoB tool 48 .The remaining reviews used broader eligibility criteria for study inclusion and a range of different quality assessment tools, with some reviews reaching varying conclusions about the methodological quality of the same studies.Given the overlap in study inclusion between reviews, a summary of the included comparative studies is provided below.
One individually randomised trial (Webster and colleagues 70 ) and one cluster randomised trial (Saleh and colleagues 71 ) were considered to be at high risk of bias by the Cochrane review authors.The individually randomised trial 70 was included in three additional reviews 29 44 47 50 , each of which considered the trial to be 'good quality' 29 , 'valid' 44 , or 'high quality' 50 .The trial was conducted in 1,231 hospital inpatients and found no evidence of a difference in PI incidence between patients assessed with either the Waterlow scale or Ramstadius tool compared with clinical judgment alone (RR 1.10, 95% CI 0.68, 1.81 for Waterlow and RR 0.79, 95% CI 0.46, 1.35 for Ramstadius).The trial further showed no evidence of a difference in patient management or in PI severity when using a risk assessment tool compared to clinical judgement.
The cluster randomised trial 71 was considered to be of poor methodological quality in two reviews. 29 47 Te trial included 521 patients at a military hospital and compared nurse training with mandatory use of the Braden scale, to nurse training and optional use of the Braden scale, to no training.No evidence of a difference in PI incidence was observed between groups: incidence rates were 22%, 22% and 15% (p=0.38), for the three groups respectively.
In both reviews by Lovegrove and colleagues, 45 50 an uncontrolled comparison study 72 was included.
The study assessed the clinical effectiveness of the Maelor scale, 73 and was rated as high quality within the most recent review. 50Preventive strategies and PI prevalence were compared across two sites, an Irish hospital that used the Maelor scale (121 patients) and a Norwegian hospital that used nurses' clinical judgement (59 patients).A higher rate of preventive strategies, as well as a lower PI prevalence (12% vs. 54%), was reported for the Irish hospital.However, these results are likely to be highly confounded by inherent differences in population and setting.
A non-randomised study by Gunningberg and colleagues 74 was included in two reviews, one of which is reported in Table 4 33 69 and was considered to be of relatively high quality.The study was conducted in 124 patients in emergency and orthopaedic units and compared the use of a pressure ulcer risk alarm sticker for patients with a modified Norton Score of <21 (indicating high-risk patients) to standard care.No significant difference in the incidence of pressure ulcers between the Norton scale and standard care groups was observed.
A non-randomised study 75 conducted in 233 hospice inpatients was included in three reviews, 29 33 69 one of which is reported in Table 4. 69 The study met six of eight quality criteria used by Health Quality Ontario. 69Use of a modified version of the Norton scale (Norton modified by Bale), in conjunction with standardised use of preventive interventions based on risk score, was found to be associated with lower risk of pressure ulcers when compared with nurses' clinical judgment alone (RR 0.11, 95% CI 0.03, 0.46).The lack of randomisation limits the reliability of this result, and review authors report that the modified Norton scale had not been validated.
Finally, a 'before-and-after' study 76 of 181 patients in various hospital settings was included in two reviews. 33 69The Health Quality Ontario considered the study to meet all quality criteria. 69Use of the • Compared a strategy that gave high-risk patients (based on modified Norton score) a risk alarm sticker to standard care.No significant difference between the groups in the incidence of PIs (Gunningberg 1999 74 ).
• Compared a strategy where patients received a pressure support system allocated according to the modified Norton scale to one where the nurse chose whether to give a special mattress.Using the scale significantly reduced the incidence of PIs (22.4% vs. 2.5%, p<0.0001) (Bale 1995 75 ).
• Compared the Norton scale with training to standard care.There was a significant difference in the number of preventative interventions (18.96 vs. 10.75, for Norton and usual care respectively).Interventions were used earlier for Norton vs. usual care (on day 1, 61% vs. 50%, p<0.002).No significant difference in the incidence of PIs between the groups (Hodge 1990 76 ).AHRQ -Agency for Healthcare Research; CASP -Critical Appraisal Skills Checklist; CI -confidence interval; ICU -intensive care unit; JBI -Joanna Briggs Institute; PI -pressure injury; RCT -Norton scale with additional training for staff was associated with significant differences in the number of preventative interventions prescribed compared to standard care (18.96 vs. 10.75, respectively).Preventative interventions were also introduced earlier in the intervention group (on day 1, 61% vs. 50%, P<0.002 for Norton and usual care, respectively).However, no significant difference in the incidence of PIs was detected between the groups.

DISCUSSION
This umbrella review summarises data from a total of 23 systematic reviews of studies evaluating the clinical utility of a total of 63 PI risk prediction tools.Despite the large number of available reviews, quality assessment using an adaptation of AMSTAR-2 suggested that the majority were conducted to a relatively poor standard or did not meet reporting standards for systematic reviews. 14 26Of the 15 items included in AMSTAR-2, only two (for accuracy reviews) and four (for effectiveness reviews) criteria were more consistently met (more than 60% of reviews scoring 'Yes').All other criteria were fully met by less than half of reviews.The primary studies included in the reviews were particularly poorly described in the accuracy reviews, making it difficult to determine exactly what was evaluated and in whom.In particular, the source of the data was poorly reported.Only one review 33 explicitly restricted to accuracy estimates from external validations, and only one review 29 described whether estimates were sourced from training or external data.The extent to which we could reliably describe and comment on the content of the reviews is limited and high-quality evidence for the accuracy and clinical effectiveness of PI risk prediction models may be lacking.

Prognostic accuracy of risk prediction models
Of the 16 reviews focused on the predictive accuracy of included models, only two used appropriate methods for both quality assessment and statistical synthesis of accuracy data 36 40 , one of which 36 evaluated only the Braden scale.Only one review 33 pre-specified the exclusion of studies reporting tool development only, one review restricted to "validated risk assessment instruments" only 29 , and none of the reviews discussed the importance of appropriate validation of prediction models.Only two reviews conducted meta-analyses at different cut-offs for determination of high risk 29 36 ; the remaining reviews combined data regardless of the threshold used.Combining data across different thresholds to estimate summary sensitivity and specificity is discouraged as it yields clinically uninterpretable and non-generalisable estimates, because the estimates do not relate to a particular threshold. 25ults of meta-analyses suggested that risk prediction scales have moderate sensitivities and somewhat lower specificities, typically in the range of around 70% to 85% for sensitivity and as low as 30% to 40% for specificity for some tools.Without a detailed review of the primary study publications for these models, it is not possible to assess which, if any, of these risk assessment scales might outperform the others.It seems that limited comparative studies comparing the accuracy of different tools are available.
For the ML-based models, one review 30 meta-analysed accuracy data by algorithm type.The results of the meta-analyses are not informative for clinical practice but may be a useful way of identifying which ML algorithms may be more suited to the data.Results suggested that specificities for random forest or decision tree models could reach 90% or above with associated sensitivities in the range of 66% to 72%, however relatively wide confidence intervals around these summary estimates reflect considerable variation in model performance.Moreover, some of these estimates came from internal validations within model development studies, and may not be transferable to other settings. 77thors should make it clear where accuracy estimates are derived from to avoid overinterpretation of results.

Clinical effectiveness of risk prediction scales
Prediction models, like any test used for diagnostic or prognostic purposes, require evaluation in the care pathway to identify the extent to which their use can impact on health outcomes. 78Of the 10 reviews assessing clinical effectiveness of PI risk prediction tools, the only primary studies suggesting potential patient benefits from the use of risk prediction tools, 72 75 76 were non-randomised and are likely to be at high risk of bias.In contrast, two randomised trials, 70 71 (both considered at high risk of bias by the Cochrane review 47 ) suggest that use of structured risk assessment tools does not reduce the incidence of PIs.We should recognise that effectiveness outcomes largely depend on the availability and efficacy of preventative measures, and conclusions regarding the clinical effectiveness of these risk assessment tools cannot be confidently drawn from the limited evidence available.All reviews included studies that assessed the use of risk assessment scales developed by experts, and no evidence is available evaluating the clinical effectiveness of empirically derived prediction models or ML algorithms.

Other existing evidence
Moore and colleagues 47 recently updated their review (published after our search was conducted 79 ) and reported no new randomised trials that assessed the effect of risk assessment tools on PI incidence.
We have separately reviewed 9 available evidence for the development and validation of risk prediction tools for PI occurrence.Almost half (52/116, 45%) of available tools were developed using ML methods (as defined by review authors), 40% (46/116) were based on clinical expertise or unclear methods, and only 18 (16%) were identified as having used statistical modelling methods.The reviews varied in methodological quality and reporting; however, the reporting of prediction model development in the original primary studies appears to be poor.For example, across all prediction tools identified, the internal validation approach was unclear and unidentifiable for 70% (81/116) of tools, and only one review identified and included external validation studies (n=7 studies).ML-based models may have potential for identifying those at risk of PI, as suggested by one review 30 included in this umbrella review.However, it is important to consider the lack of transparency in reporting of model development methods and model performance, and the concerning lack of model validation in populations outside of the original model development sample. 9

Strengths and limitations
We have conducted the first umbrella review that summarise the prognostic accuracy and clinical effectiveness of prediction models for risk of PI.We followed Cochrane guidance 13 , with a highly sensitive search strategy designed by an experienced information specialist.Although we excluded non-English publications due to time and resource constraints, where possible these publications were used to identify additional eligible risk prediction models.To some extent, our review is limited by the use of AMSTAR-2 for quality assessment of included reviews.AMSTAR-2 was not designed for assessing systematic reviews of diagnostic or prognostic studies.Although we made some adaptations, many of the existing and amended criteria relate to the quality of reporting of the reviews as opposed to methodological quality.There is scope for further work to establish criteria for assessing systematic reviews of prediction models.
The primary limitation of our study lies in the limited detail available on risk prediction tools and their performance within the included systematic reviews.To ensure comprehensive model identification, we adopted a broad definition of 'systematic,' potentially influencing the depth of information provided in the reviews, and the reporting quality in many primary studies contributing to these reviews may be suboptimal.Notably, excluding ML-based models, over half of the existing risk prediction tools were published prior to 2000, before the publication of original versions of reporting guidelines for diagnostic accuracy studies 80 and risk prediction models. 81

CONCLUSIONS
In conclusion, this umbrella review comprehensively summarises the prognostic accuracy and clinical effectiveness of risk prediction tools for developing PIs.The included systematic reviews used poor methodology and reporting, limiting our ability to reliably describe and evaluate their content.MLbased models demonstrated potential, with high specificity reported for some models.Wide confidence intervals highlight the variability in current evaluations, and external validation of ML tools may be lacking.The prognostic accuracy of clinical scales and statistically derived prediction models has a substantial range of specificities and sensitivities, motivating further model development with high quality data and appropriate statistical methods.
Regarding clinical effectiveness, a reduction of PI incidence is unclear due the overall uncertainty and potential biases in available studies.This underscores the need for further research in this critical area, once promising prediction tools have been developed and appropriately validated.In particular, the clinical impact of newer ML-based models currently remains largely unexplored.Despite these limitations, our umbrella review provides valuable insights into the current state of PI risk prediction tools, emphasising the need for robust research methods to be used in future evaluations.

Table 1 .
Summary of included systematic review characteristics

Volume of evidence Median (range) no. studies
Item 5 -Study selection in duplicate?;Item 6 -Data extraction in duplicate?;Item 7 -Excluded studies list (with justifications)?; Item 8 -Included studies description adequate?; 182 Item 9 -Assessment of RoB/quality satisfactory?; Item 10 -Studies' sources of funding reported?;Item 11 -Appropriate statistical synthesis method?; Item 12 -Assessment of impact of RoB 183 on synthesised results?; Item 13 -Assessment of impact of RoB on review results?; Item 14 -Discussion/investigation of heterogeneity?; Item 15 -Conflicts of interest reported?;N/A -Not 184 Applicable; RoB -Risk of Bias; QA -quality assessment.Further details on AMSTAR items are given in Appendix 4, and results per review are given in Appendix 5.

Table 2 .
Findings related to prognostic accuracy, by model: Characteristics and quality of studies included within reviews

Table 3 .
Summary estimates of accuracy parameters (main results from statistical syntheses), by prediction tool * as reported in review's text.However, the table reports a mixture of female and male participants for all studies, with a mean female proportion of 50.73%.AHCPR -Agency for Health Care Policy and Research; CI -confidence interval; CVD -cardiovascular disease; DTA -diagnostic test accuracy; EPUAP -European Pressure Ulcer Advisory Panel; (H)SROC -(hierarchical) summary receiver operating characteristic curve; ICU -intensive care unit; LTC(F) -long-term care (facility); ML -machine learning; N -number of participants; nnumber of studies; NMA -network meta-analysis; NS -not stated; NPUAP -National Pressure Ulcer Advisory Panel; PI -pressure injury; PPPU -Panel for the Prediction and Prevention of Pressure Ulcers; QUADAS -Quality Assessment of Diagnostic Accuracy Studies; RCT -randomised controlled trial; RoB -risk of bias; TDCPS -Torrance Developmental Classification of Pressure Sore.

Table 4 .
Systematic reviews evaluating clinical effectiveness