Introducing the EMPIRE Index: A novel, value-based metric framework to measure the impact of medical publications

Article-level measures of publication impact (alternative metrics or altmetrics) can help authors and other stakeholders assess engagement with their research and the success of their communication efforts. The wide variety of altmetrics can make interpretation and comparative assessment difficult; available summary tools are either narrowly focused or do not reflect the differing values of metrics from a stakeholder perspective. We created the EMPIRE (EMpirical Publication Impact and Reach Evaluation) Index, a value-based, multi-component metric framework for medical publications. Metric weighting and grouping were informed by a statistical analysis of 2891 Phase III clinical trial publications and by a panel of stakeholders who provided value assessments. The EMPIRE Index comprises three component scores (social, scholarly, and societal impact), each incorporating related altmetrics indicating a different aspect of engagement with the publication. These are averaged to provide a total impact score and benchmarked so that a score of 100 equals the mean scores of Phase III clinical trial publications in the New England Journal of Medicine (NEJM) in 2016. Predictor metrics are defined to estimate likely long-term impact. The social impact component correlated strongly with the Altmetric Attention Score and the scholarly impact component correlated modestly with CiteScore, with the societal impact component providing unique insights. Analysis of fresh metrics collected 1 year after the initial dataset, including an independent sample, showed that scholarly and societal impact scores continued to increase, whereas social impact scores did not. Analysis of NEJM notable articles showed that observational studies had the highest total impact and component scores, except for societal impact, for which surgical studies had the highest score. The EMPIRE Index provides a richer assessment of publication value than standalone traditional and alternative metrics and may enable medical researchers to assess the impact of publications easily and to understand what characterizes impactful research.

In summary, during framework construction, a large set of publications was generated to gain an in-117 depth understanding of the statistical characteristics of altmetrics in a relevant sample. Publications 118 of Phase III clinical trials were chosen for analysis because these studies typically require a high 119 investment of resources and personnel and are most likely to have an impact on clinical practice. In 120 addition, they are likely to be rich in metrics -the mean number of metric counts has a substantial 121 effect on the size of the intercorrelation observed in a publication sample [22]. A series of statistical 122 NOMF003 EMPIRE Index development 6 April 2020 9 Phase III sample were acquired on June 7, 2020 (approximately 1 year after the original metrics were 173 acquired). 174 175 To enable analysis of temporal changes, both the updated reference sample and the prospective 176 Phase III sample were divided into 12-month subsamples (May 1 to April 31) based on publication 177 dates provided by Altmetric Explorer. Publications with a publication date before May 1, 2016 178 according to Altmetric Explorer were excluded. 179 180 NEJM notable articles sample 181 An additional independent sample was identified with which to assess framework performance in 182 other types of clinical research, especially the utility of the societal impact component. Annually, the 183 editor of the NEJM curates a selection of articles published in the journal that year that they believe 184 have practice-changing potential ('notable articles'). We identified all of these articles for the years 185 2016, 2017, 2018, and 2019 [24-27], and obtained altmetrics for them on January 8, 2020. Articles 186 were classified by the authors under a broad typology: interventional (studies describing an 187 intervention with a medical treatment intended for clinical practice), observational (prospective and 188 retrospective non-interventional studies), innovative (publications describing novel techniques or 189 assays), and surgical. 190 191 Acquisition of altmetrics and other metrics 192 Data for all publications were obtained from the five sources listed below. 193 • Altmetric Explorer [6]: This was the primary source for altmetrics data as well as publication 194 dates). 195 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ;https://doi.org/10.1101.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 • PlumX [5]: In addition to a wide range of metrics similar to those provided by Altmetric 196 Explorer, PlumX provided some unique metrics such as citations in articles classified by 197 Medline's indexers as 'clinical practice guideline' (PubMed guidelines). 198 • Pubstrat Journal Database [28]: This was scraped to determine JIFs for journals identified by 199 Altmetric Explorer in the acquired datasets. In addition to these standard metrics, original tweets and retweets (provided by Altmetric.com) 206 were obtained for the reference Phase III sample. 207 208 In a similar way to the exploratory analysis of Costas et al. (2015), an 'altmetrics-driven' universe of 209 publications was created in which all publications had at least one altmetric or citation (via Altmetric 210 Explorer) [12]. Costas et al. noted that this analysis did not result in a meaningful impact on the 211 precision of altmetrics as predictive tools for citations, but did reduce the zero inflation that can 212 confound statistical analysis. 213 214 Statistical analysis 215 Analyses were conducted in Microsoft Excel using the Analyse-it plugin (Analyse-it Software, Ltd., 216 Leeds, United Kingdom). Descriptive statistics were obtained and Spearman rank correlations 217 between individual altmetrics were calculated. In addition, exploratory factor analysis was used to 218 provide insights into how best to group similar metrics. Factor analysis assumes that latent or 219 underlying factors exist that causally influence the observations. For the purposes of metric 220 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. The basis of each predictor score was a multiple linear regression of the altmetrics included in the 259 predictor against the total impact score in the reference Phase III sample. Weightings for each metric 260 were calculated as follows: 261 where β is β from linear regression, summ is the sum total of the incidence of the target metric in the 263 reference sample, and summ1,m2… is the sum total of all metrics included in the predictor score. 264

267
The initial search found 3498 Phase III clinical publications, of which altmetrics for 3450 were 268 identifiable by PlumX and 2891 by Altmetric Explorer. The analysis set comprised 2891 articles with 269 at least one metric identified by Altmetric Explorer, of which eight were unavailable in the PlumX 270 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 13 dataset. Publication metric characteristics of this sample are shown in S1 Table. Several altmetrics 271 had a very low density so were discarded for further analysis (e.g. Weibo,LinkedIn,Google+,272 Pinterest, Q&A, peer review, video, and syllabi mentions). Some altmetrics were retained despite a 273 low density as they were thought to provide unique insights relevant to the objectives (policy, 274 patent, F1000Prime, Wikipedia, and guideline [from PlumX] mentions). Some metrics of high 275 relevance (abstract and publication views and downloads) were discarded because the quality of the 276 data was inconsistent -in particular, many papers had numerous citations and Mendeley readers 277 without recorded views or downloads, suggesting that coverage was incomplete. 278 279 Journal-level metrics were not included in the EMPIRE Index total impact score or component 280 scores, but they were considered potential components of predictor scores. Given that the coverage 281 obtained with CiteScore was higher than with the other two journal-level metrics examined (JIF and 282 Scimago Journal Ranking -S1 Three-factor analysis was conducted on the full range of metrics selected for inclusion (S3 Table). 298 Two-factor analysis was also carried out on a subset of metrics excluding those with low incidence 299 (policy document, PubMed guideline, and patent mentions) (S4 Table). These analyses revealed 300 consistent groupings, such as Mendeley readers with Dimensions citations, and news, blog, and 301  (Table 1). 315 316 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. Weightings derived from the statistical approach were revised to reflect findings from stakeholder 319 value assessments. The selected weightings and their contribution to the total impact score based 320 on the sample are shown in Table 2. In general, the approach taken was to balance the weighting 321 such that the percentage contribution to scores in publications in the reference Phase III sample 322 resembled the stakeholder value, while acknowledging relative importance (e.g. of news articles vs 323 blogs) and prevalence (e.g. when Wikipedia entries were too infrequent to make a meaningful 324 contribution without greatly inflated weighting relative to the value accorded by the stakeholder 325 panel). To combine statistical and value-based weighting effectively, some related metrics were 326 considered as combined entities (i.e. Twitter and Facebook mentions were allocated a combined 327 20% of points by stakeholders, and contributed a combined 17.7% to the total impact score in the 328 reference Phase III sample). 329 330 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 Table 2. Weighting assigned to metrics included in the social, scholarly, and societal impact scores, 331 along with their contribution to total impact scores in the reference sample. 332 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. 333 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 Predictor scores 334 The variance in total impact scores explained by each predictor score was moderately high (early 335 predictor vs total impact score, r 2 = 0.56; intermediate predictor vs total impact score, r 2 = 0.65, S2 336   Fig). An overall predictor score can be calculated as the average of early and intermediate predictor 337 scores. The variance in total impact scores explained by the overall predictor score was also 338 moderate (overall predictor vs total impact score, r 2 = 0.69). Weightings calculated for each of the 339 variables in the predictor score are shown in Table 3. 340 341 In total, 74 Phase III publications from the NEJM published in 2016 were identified for the 345 benchmark sample. The non-adjusted, non-adjusted overall predictor score was selected as the 346 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint benchmark for predictor scores, and the non-adjusted total impact score was selected for total, 347 social, scholarly, and societal impact scores (Table 4) The non-adjusted total impact score is the sum of the social, scholarly, and societal impact scores. 353 The adjusted total impact score is the average of the adjusted component scores. 354

355
Dividing the non-adjusted total benchmark by 3 before applying it to the component scores had the 356 effect of upscaling them so that the adjusted total impact score represents the mean of the 357 components (rather than the sum, as in the non-adjusted total impact score). EMPIRE Index scores 358 are calculated by dividing the unadjusted score of interest by the appropriate benchmark and 359 multiplying by 100. 360 361 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. Characterization in samples used in development 374 The distributions of scores in the reference sample 1H and 2H, and in the benchmark sample, are 375 shown in Fig 3. Of note, social impact scores were lower and societal impact scores were higher in 376 1H than in 2H. Predictor scores were higher than total impact scores in the reference Phase III 377 sample but not in the benchmark NEJM Phase III sample, and median social impact scores were 378 closer to median total impact scores in the benchmark NEJM Phase III sample than in the reference 379 Phase III sample. 380 381 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint The correlations between component scores, the AAS, and CiteScore are shown in Table 5. 387 Correlations between component scores were relatively low, the greatest being between social and 388 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 23 scholarly impact scores. The social impact score correlated strongly with AAS, and both social and 389 scholarly impact scores correlated modestly with CiteScore. However, the societal impact score is 390 quite distinct from AAS, CiteScore, and the other component scores. Although predictor scores were 391 moderately successful at predicting the total impact score, they were only weakly related to the 392 societal impact score. 393 394 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.  Innovative studies had notably low societal impact, indicating that they were infrequently 431 referenced in guidelines or policy documents (Fig 6). 432 433 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint Impact, and the Inverse Altmetric Impact [18,38]. First, the value-based approach to the weighting 448 and grouping of metrics recognizes that simple statistical associations may be sample-dependent 449 and may not relate to underlying conceptual underpinnings. Second, the EMPIRE Index is specifically 450 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. between metrics in various disciplines [10,15,19,20] and, given that the value of each metric is 452 inherently subjective, this value is unlikely to be consistent across scholarly disciplines. Third, the 453 EMPIRE Index is scaled against a clearly defined, relevant benchmark, because interpretation of a 454 novel composite metric is difficult without such a reference point. 455

456
Such are the potential advantages of the EMPIRE Index. However, its utility is dependent on the 457 robustness of the selection grouping, weighting of metrics, and benchmarking, as well as its 458 performance in the evaluation of suitable publications. In the process of investigating these factors, 459 a series of results of broad interest to the altmetrics community were generated. These will be 460 discussed in the sections that follow. The reference dataset was selected to provide a sample rich in altmetrics. PlumX identified at least 474 one metric for 99% of our sample, while the figure was 83% for Altmetric. This result compares 475 favorably with that of previous work [12,15,19,20,43,44], most likely indicating the increasing 476 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 29 volume of altmetric activity. One important metric not included was article views and downloads. 477 Although these data were provided by PlumX, we found them to be patchy, with many articles 478 reporting metrics such as tweets or Mendeley readers but no page views or downloads on the 479 EBSCO information service. This resulted in weak and spurious correlations (data not shown), similar 480 to the findings of Maggio et al. (2018) [45]. 481 482 Similar to previous investigators, we found that news, blog, Twitter, and Facebook mentions, 483 Mendeley readers, and Dimensions citations were the most common metrics in our sample. These 484 metrics were included in our analysis, as well as additional metrics that, although rare, provided in line with findings from previous research [46][47][48]. Correlations were also found between Twitter 499 and Facebook mentions, and news/blog and social media mentions, which again aligns with previous 500 observations [47,48]. 501 502 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. No meaningful correlations were found between mentions in F1000Prime articles, policy documents, 503 guidelines, patents, or Wikipedia articles and other metrics. These metrics have not previously been 504 widely studied, and the low correlations observed may reflect their very small coverage -over 90% 505 of publications score zero on these metrics. However,  reported 506 that F1000Prime recommendations were more closely correlated with Mendeley readers and 507 Dimensions citations than with Twitter mentions [49]. 508 509 Pairwise correlations can give useful insights into relationships between different metrics, but for 510 the purposes of reducing data into composite scores it is helpful to understand the shared variance 511 between multiple metrics. The exploratory factor analysis in our study produced findings consistent 512 with those reported in previous literature [10][11][12][13]. Separating articles into those that were older 513 (1H) and younger (2H) showed that citations (including policy and guideline mentions) and Mendeley 514 readers consistently grouped into one factor; news, blogs, Wikipedia, and F1000Prime mentions 515 grouped into a second factor; and Twitter (and, usually, Facebook) mentions comprised a third 516 factor. A two-factor analysis excluding policy document, guideline, and patent mentions confirmed 517 that Mendeley readers and Dimensions citations formed a separate group from the remaining 518 metrics. 519 520 Each altmetric represents a different action on the part of an audience; this has implications for how 521 we understand the meaning of individual metrics [4] and whether these statistical associations 522 represent meaningful groupings. For example, much remains unknown about the motivation for 523 tweeting, given that most tweets are empty of context [50] and content [51]. Often all that is certain 524 is that the tweeter felt the research interesting enough to broadcast. Social media platforms are 525 known to be used mostly by the general public, so a central motivation for scholars to tweet is likely 526 to be to communicate and explain their work to lay people [52]. This may be particularly true of 527 publications in biomedical sciences, which attain greater Twitter interest than those in other 528 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 31 scholarly disciplines [52]. Twitter communities linked through publication tweets tend to be led by 529 organizational accounts associated with well-known journals or leading scholars [53], although at 530 least half of sharing on social media is likely to be non-academic [54,55]. 531 532 Reference manager data have been suggested as an alternative to download counts as a source of 533 readership evidence [3]. Although Mendeley users often add articles to their library with the 534 intention of citing them, many also add these for professional or teaching purposes, which may 535 explain why some articles have many readers but few citations. 536 537 Interestingly, articles rated in F1000Prime reviews as 'good for teaching' received higher Twitter 538 scores, but not higher Mendeley scores, than those that were not rated this way. The reverse was 539 true for articles considered a 'technical advance' [56]. 540 541

542
The weighting of metrics in the EMPIRE Index was based on three considerations: the prevalence of 543 metrics in the reference sample (highly prevalent metrics were weighted less), the need for each 544 component to make a substantial contribution to the total impact score, and the value given to each 545 metric as an indicator of impact. As a result, the weighting is quite different from other approaches 546 based on purely statistical considerations. 547 548 Several approaches have determined weighting by regressing altmetrics on citations. These typically 549 result in, for example, higher weighting given to blog posts and Mendeley readers than to news 550 articles (because blog posts are relatively uncommon) [15,37,45,57]. Because the target variable is 551 journal citations, each Mendeley save or F1000 citation may be weighted in a similar way to or 552 higher than a policy document citation [37,57]. Ortega has developed weightings based on principal 553 component analysis and also on inverse prevalence (so that the rarest metrics receive the highest 554 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 32 weighting). The two approaches create very different weightings -for example, a news article 555 carries half the weight of a publication citation in the Weighted Altmetric Impact, but eight times the 556 weight of a publication citation in the Inverse Altmetric Impact [18,38]. These statistical approaches 557 give very different results from the weighting developed for the EMPIRE Index. 558 559 Predictor scores 560 Given that some altmetrics accumulate early, there is long-standing interest in the use of a limited 561 selection of rapidly accumulating altmetrics to identify publications likely to have high long-term 562 impact. Earlier work has employed multivariate regression with citations as a measure of long-term 563 impact [9,15,37,45,57,58] but, as we have seen, citations are only one of several measures of long-564 term impact. 565 566 Among common metrics, tweets and news articles accumulate most rapidly after publication, while 567 Mendeley readers, blogs, and F1000Prime articles increase more gradually [3,34,35]. Wikipedia and 568 policy document mentions can, like article citations, take well over a year to accumulate [34,59]. The 569 EMPIRE Index addresses this by using two predictor scores -early and intermediate. The early 570 predictor score also uses CiteScore, a journal-based metric. CiteScore, in this context, can be thought 571 of as a proxy for the exposure an article is likely to have; it has previously been shown that 572 combining citations over the first year with JIFs accurately predicts future citations [59,60]. 573 574 Predictor scores are a purely statistical construct so the weighting is quite different from the EMPIRE 575 Index itself; however, the weighting is also different from methods employed in previous work using 576 citations as a target. Compared with studies mentioned earlier that used statistically based 577 weighting with only citations as a target, in the EMPIRE Index predictor scores, Mendeley readers 578 carry less weight relative to news article citations. This most likely reflects the broader basis of the 579 EMPIRE Index compared with citation-only targets. 580 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint NOMF003 EMPIRE Index development 6 April 2020 33 581 The reasonably strong relationship between predictor scores and the total impact score in the 582 reference Phase III sample is to be expected, given that they share many of the same metrics. 583 However, the weak correlation with the societal impact score indicates that the predictor scores will 584 lack precision in identifying high-impact publications (given the importance of the contribution of 585 societal impact to the total impact). Further work using longitudinal datasets is required to improve 586 these predictor scores. 587 588

589
The responsiveness and utility of the EMPIRE Index was evaluated in several ways. Averages and 590 distributions of scores in the reference Phase III sample and the benchmark NEJM sample were 591 explored, showing that both samples had similar social and scholarly metrics and the latter had far 592 higher societal metrics. Because the scores were scaled to the benchmark NEJM sample, this 593 resulted in predictor scores lacking sensitivity for lower-impact publications (i.e. although they 594 retained precision for identifying higher-impact articles, they tended to overpredict the impact of 595 lower-impact articles uniformly). 596

597
The social score was shown to be closely correlated with the AAS. The AAS weights metrics in a way 598 that is not possible for users of the Altmetric Explorer dashboard -news outlets are weighted in a 599 proprietary (and undisclosed) tier system, while retweets are assigned only 75% of the weight of 600 original tweets [6]. The high correlation between the social score and the AAS thus reassures users 601 that these nuances make little difference. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ;https://doi.org/10.1101https://doi.org/10. /2021 NOMF003 EMPIRE Index development 6 April 2020 publication. Both scholarly and societal impact scores continued to increase, and further follow-up is 607 needed to identify the point at which these scores plateau. 608 609 Finally, an independent dataset was investigated: articles selected by NEJM editors for their practice-610 changing potential. These papers had substantially higher societal impact than the benchmark set of 611 NEJM Phase III articles, supporting the sensitivity of the societal impact component in identifying 612 practice-changing publications. Furthermore, innovative articles were found to have relatively low 613 societal impact, indicating that although these are of interest to scholars and wider society, they do 614 not directly feed into clinical practice changes. Conversely, articles on surgery had a high impact on 615 practice even though social and academic interest was low. 616 617 Weaknesses 618 Although the EMPIRE Index provides advantages over existing metric approaches, it has some 619 potential weaknesses. For example, grouping and value weighting have a large subjective 620 component that may not reflect the value assigned to metrics by others. However, the transparent 621 nature of the approach will hopefully stimulate further debate and discussion around the inherent 622 subjectivity and allow for future refinements. 623

624
The analyses conducted were based on a closely defined subset of medical publications, in terms of 625 both content (Phase III trials) and publication date. As metrics evolve over time owing to changes in 626 the way audiences engage with publications or technical advances in the way metrics are recorded, 627 these original analyses and assumptions may not apply. They may also not apply to other publication 628 types or study designs, and may vary across disease areas. Predictor scores are based on results of 629 cross-sectional, rather than longitudinal, analyses; further follow-up will allow these scores to be 630 refined and improved. Furthermore, benchmarking to very high-impact articles results in predictor 631 scores that tend to overestimate the final impact of more usual articles. 632 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. ;https://doi.org/10.1101

638
The EMPIRE Index is a novel metric framework incorporating three component scores that respond 639 to different types of publication impact: social, scholarly, and societal. Whereas the social impact 640 score is similar to the AAS and the scholarly impact score is closely linked to (but broader than) 641 article citations, the societal impact score reflects a key and distinct aspect of publication impact. In 642 a similar way to the AAS, the EMPIRE Index weights metrics subjectively to reflect their value from 643 the user's perspective as well as by prevalence. Unlike the AAS, it is designed for a limited subject 644 area (medicine) and weights and benchmarks the metrics accordingly. It also has a clear, transparent 645 explanation of the scoring system, and provides predictor scores to give an early estimate of likely 646 future impact. 647 648 Several potential uses are envisaged for the EMPIRE Index. Because it provides a richer assessment 649 of publication value than standalone traditional and alternative metrics, it will enable individuals 650 involved in medical research to assess the impact of related publications easily and to understand 651 what characterizes impactful research. It can also be used to assess the effectiveness of 652 communications around publications and publication enhancements such as infographics and 653 explanatory videos. Fuller validation of the EMPIRE Index requires additional prospective and cross-654 sectional studies, which are ongoing. 655 656 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted April 20, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint S2 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted April 20, 2021. ; https://doi.org/10.1101/2021.04.13.21255419 doi: medRxiv preprint S4