Skip to main content
medRxiv
  • Home
  • About
  • Submit
  • ALERTS / RSS
Advanced Search

A Comparative Analysis of System Features Used in the TREC-COVID Information Retrieval Challenge

View ORCID ProfileJimmy Chen, View ORCID ProfileWilliam R. Hersh
doi: https://doi.org/10.1101/2020.10.15.20213645
Jimmy Chen
1School of Medicine, Oregon Health & Science University, Portland, OR, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Jimmy Chen
  • For correspondence: chenjim@ohsu.edu
William R. Hersh
2Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for William R. Hersh
  • Abstract
  • Full Text
  • Info/History
  • Metrics
  • Data/Code
  • Preview PDF
Loading

Abstract

The COVID-19 pandemic has resulted in a rapidly growing quantity of scientific publications from journal articles, preprints, and other sources. The TREC-COVID Challenge was created to evaluate information retrieval methods and systems for this quickly expanding corpus. Based on the COVID-19 Open Research Dataset (CORD-19), several dozen research teams participated in over 5 rounds of the TREC-COVID Challenge. While previous work has compared IR techniques used on other test collections, there are no studies that have analyzed the methods used by participants in the TREC-COVID Challenge. We manually reviewed team run reports from Rounds 2 and 5, extracted features from the documented methodologies, and used a univariate and multivariate regression-based analysis to identify features associated with higher retrieval performance. We observed that fine-tuning datasets with relevance judgments, MS-MARCO, and CORD-19 document vectors was associated with improved performance in Round 2 but not in Round 5. Though the relatively decreased heterogeneity of runs in Round 5 may explain the lack of significance in that round, fine-tuning has been found to improve search performance in previous challenge evaluations by improving a system’s ability to map relevant queries and phrases to documents. Furthermore, term expansion was associated with improvement in system performance, and the use of the narrative field in the TREC-COVID topics was associated with decreased system performance in both rounds. These findings emphasize the need for clear queries in search. While our study has some limitations in its generalizability and scope of techniques analyzed, we identified some IR techniques that may be useful in building search systems for COVID-19 using the TREC-COVID test collections.

INTRODUCTION

Since the World Health Organization declared the Coronavirus Disease 2019 (COVID-19) a public health emergency,1 there has been explosive growth in scientific knowledge about this novel virus. Consequently, the use of preprints and fast-track publication policies has resulted in a significant increase in the number of COVID-19 related publications over a short period of time.2,3 Information retrieval (IR, also known as search) systems are the tool usually employed to manage access to large corpora of literature.4 The efficacy of IR systems is often assessed in challenge evaluations that provide reusable test collections, such as those led by the National Institutes of Standards and Technology (NIST) in the Text REtrieval Conference (TREC).5

To address the need for system evaluation in this rapidly changing information environment, NIST sponsored the TREC-COVID Challenge.6 As with most IR challenge evaluations, test collections of documents, topics for searching, and relevance judgments were developed.7 The document collections were derived from snapshots of the COVID-19 Open Research Dataset (CORD-19), a regularly updated dataset of manuscripts consisting of coronavirus-related research gathered from various sources including journal articles, PubMed references, arXiv, medRxiv, and bioRxiv.3 Based on the the TREC framework, the TREC-COVID Challenge tasked researchers to evaluate IR systems over 5 rounds to retrieve manuscripts relevant to topics about COVID-19 from CORD-19, with the goal of building a reusable test collection for further research.8 Every 2-3 weeks, researchers implemented and evaluated retrieval systems with un updated CORD-19 dataset and the addition of new search topics.9 While there exists previous work examining techniques used in other IR test collections,10–12 there have not yet been any studies comparing the methods and systems used by participants in the TREC-COVID Challenge.

The purpose of this study was to compare performance in different approaches used in the TREC-COVID Challenge by: 1) developing a taxonomy to characterize IR techniques and system characteristics used, and 2) applying this taxonomy to identify features of IR systems associated with higher performance. Using run reports from Round 2 and Round 5, we designed a taxonomy and evaluated its features using a univariate and multivariate regression analysis. In this study, we assessed how certain methodologies were associated with higher retrieval performance and discussed the implications and limitations of our analysis.

METHODS

Dataset and Features

The TREC-COVID challenge6 occurred over 5 rounds in 2020 on the rapidly growing CORD19 dataset,3(p19) with 30 initial topics in Round 1 and 5 new topics each subsequent round. Each topic consisted of three fields: (1) a short query statement that a user might enter, (2) a longer question field more thoroughly expressing the information need of the topic, and (3) a narrative field describing what would constitute a relevant document. After Round 1, relevance judgments, consisting of IDs of manuscripts assessed by human assessors as relevant, partially relevant, or irrelevant, were made available for previously unassessed manuscripts after each round. Table 1 summarizes the CORD19 corpora, topics, number of teams, runs, and judgments present in each round of the TREC-COVID challenge.

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 1.

Overview of the TREC-COVID challenge. Over 5 rounds, research teams implemented information retrieval (IR) systems to search the growing CORD-19 dataset. After each round, new topics and relevance judgments of manuscripts from previous iterations of the CORD-19 dataset were released for use in subsequent rounds.

Reports for all submitted runs, including those from Rounds 2 and 5, were made publicly available from the TREC-COVID Challenge (https://ir.nist.gov/covidSubmit/archive.html) and were manually reviewed by J.C. We chose to review reports from Rounds 2 and 5 because we wanted to compare methodologies used in two different rounds where feedback methods from topics from previous rounds were available. Each run report was written as a textual description of the methodology used to produce the run in whatever detail the submitting team provided. An example run report is shown (Figure 1).

Figure 1.
  • Download figure
  • Open in new tab
Figure 1. Example Run Report from Round 2.

During submission of a run, participants were encouraged to provide a methodological description of each submitted run. This run description, along with the run ID, topic types, and performance metrics, were reported in a publicly available repository of archived results (https://ir.nist.gov/covidSubmit/archive.html). These run reports were manually reviewed for features.

The following features were extracted for each run in the reports from Round 2 and Round 5:

  • Text used (i.e., title and abstract only, paragraph-based indices, or full-text)

  • Type of query (i.e., any combination of the query, question, and narrative from the TREC-COVID topic fields)

  • Any query pre-processing (i.e., stemming, removing stop words)

  • Query term expansion (addition of terms not originally provided in each topic)

  • Manual review methods (i.e., human interventions including the use of human assessors in Continuous Active Learning)13

  • Any weighted ranking system used (i.e. non-neural scoring functions such as BM2514 and term frequency–inverse document frequency, or TF-IDF15)

  • Any ranking model that used a neural architecture (including deep transformer models such as BERT,16 SciBERT,17 T518 as well as DeepRank,19 a neural network that attempts to simulate humans in relevance judgments)

  • Other techniques (machine learning models such as SVM, logistic regression, custom scoring functions, otherwise known as term proximity scores. Custom search methods were also included in this category including ReQ-ReC,20 a double-loop retrieval system)

  • Dataset used to fine tune any system (i.e., MS-MARCO, a large dataset of annotated documents based off 100,000 Bing queries,21 CORD-19 dataset transformed into document vectors, and relevance judgments from previous rounds)

  • Fusion of multiple runs into a single run (including use of reciprocal rank fusion,22 COMB fusion methods23)

  • Re-ranking implemented, defined as whether a second system (most commonly a neural network) was used to refine an initial scoring system.

  • Pseudo-relevance feedback, or system-generated relevance feedback based on an initial query.

  • How/if human-generated relevance feedback, or relevance judgements, from the previous round(s) were used.

  • Runs filtered by date. Removing documents published before 2020 (or when the pandemic began to gain widespread notice) had been previously suggested by McAvaney et al. in their post-hoc analysis of their neural re-ranking system as a possible method to improve performance.24

These features were selected to encompass a broad set of commonly used techniques in ad-hoc retrieval. Of note, TREC challenges typically do not occur over multiple rounds; thus, the addition of relevance judgments was a novel addition to the TREC-COVID challenge. Since the length of reports varied at the researcher’s discretion, many reports likely had some number of missing features. To minimize the impact of these null features, we assumed that runs that did not provide information about the type of text or query used in their system likely searched on the full-text using the query subfield from each topic. Features extracted from reports were either characterized as a binary feature (used or not used in the system) or a categorical feature (a description of feature for the system, i.e., BM25 as the weighted system used). Categorical features were later one-hot encoded, or converted into binary features over multiple columns, prior to input into our regression analysis. The extracted features and their encoding is shown in Table 2

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 2. Taxonomy Features Extracted from Run Reports.

Features extracted from run reports from Round 2 and Round 5 are listed. Features were either extracted as a binary feature or a categorical feature (in which a short description specifying the feature was provided). Prior to input into univariate and multivariate analysis, categorical features were one-hot encoded, or converted into binary variables over multiple columns.

We included all runs that contained more than 1 extracted feature to ensure a reasonably large enough and useful dataset for analysis. We excluded unusually poorly performing runs that likely represented poor system or method implementations. The exclusion threshold for these runs was defined as an average performance of less than 0.2 across all 5 performance metrics used to evaluate run performance in the TREC-COVID Challenge. Performance metrics reported by NIST and used in our analysis included: precision at K documents (P@K), normalized discounted cumulative gain at K documents (NDCG@K), rank-based precision with depth = 5 (RBP (p = 0.5)), binary preference (bpref), and mean average precision (MAP). In Round 2, the depth of documents for P@K and NDCG@K were 5 and 10 respectively, while in Round 5, the depth of documents for P@K and NDCG@K were both 20. These changes were made out of concern for inflated performance when evaluating precision on a small number of documents.6 For each run, these performance metrics were computed as the mean performance across all topics in the round.

Univariate and Multivariate Analysis

All data analysis and pre-processing was performed using R (version 4.0.2)25 using the glmnet package.26 For each round, 5 univariate linear regressions were created using all extracted features as the independent variables, and each of the 5 performance metrics (NDCG@K, P@K, RBP, bpref, and MAP) as the dependent variable. Coefficients and standard errors were calculated for each feature, and p-values were extracted for each feature coefficient, with significance defined as p < 0.05. Features that met the threshold for significance in the univariate regression were subsequently input into a multivariate linear regression. Overall, positive coefficients were interpreted to be associated with higher performance. Therefore, features which remained significant after both univariate and multivariate regression were likely associated with high performance in the TREC-COVID challenge.

RESULTS

Round 2 consisted of 136 runs submitted by 51 teams (with a permitted maximum of 3 runs submitted per team), and Round 5 consisted of 126 runs submitted by 28 teams (with a permitted maximum of 8 runs submitted per team). The topics in Round 2 included 30 previous (i.e., from Round 1) topics with relevance judgments, and 5 new topics. The topics in Round 5 included 45 previous (i.e., from Rounds 1-4) topics and 5 new topics. Overall, 110 runs from 42 teams met inclusion criteria in Round 2 and 111 runs from 23 teams met inclusion criteria in Round 5. The proportion of manual (defined as involving human intervention), feedback (defined as using judgments from prior rounds), and automatic (defined as neither feedback nor manual) runs varied between runs. In Round 2, the majority of the runs were categorized as automatic runs; in round 5, the majority of the runs were characterized as feedback runs. These findings are summarized in Table 3

View this table:
  • View inline
  • View popup
  • Download powerpoint
Table 3. Distribution of Included Runs in Rounds 2 and 5.

The number of included runs and the number of participating teams is included in this table. Runs were either sub-categorized as feedback (defined as using judgments from prior rounds), manual (involving human intervention), and automatic (neither manual nor feedback) runs.

Significant features for the 5 univariate regressions each for Round 2 and Round 5 are shown in Figure 2 and varied depending on the performance metric used. In Round 2, query term expansion (n = 37 runs), fine-tuning of ranking systems on MS-MARCO (n = 18 runs), Round 1 judgments (n = 9 runs), or document vectors formed by the CORD-19 dataset (n = 9 runs) were associated with higher performance across most, if not all, performance metrics. Use of ReQ-Rec (n = 3 runs submitted by 1 team), and narrative text in the query (n = 28 runs) were associated with decreased performance across the majority of performance metrics. In Round 5, use of the question text in the query (n = 32 runs) and TF-IDF vectors were associated with increased performance (n = 14 runs), whereas the use of neural networks, narrative text in the query (n = 67 runs), and proximity score (n = 2 runs) were associated with decreased performance across all performance metrics.

Figure 2.
  • Download figure
  • Open in new tab
Figure 2. Significant Features after Univariate Regression Analysis in Rounds 2 and 5.

Univariate analysis was performed on features extracted from reports from Rounds 2 and 5. Features that were significant after input into a univariate linear regression are shown for the following performance metrics from Rounds 2 and 5 respectively: binary preference (A and F), mean average precision (B and G), normalized distributive cumulative gain (C and H), precision @ k documents (D and I), and rank-based precision (E and J). The count, or number of times that the feature occurred in our extracted dataset, is displayed adjacent to the feature name. These significant features were subsequently input into a multivariate regression to determine which features were independently associated with performance.

Significant features from multivariate regressions on the 5 different performance metrics in Rounds 2 and 5 are shown in Figure 3. After features found to be significant on univariate regression were input into a multivariate regression, the following features remained significantly associated with increased performance in Round 2 with the majority of performance metrics: term expansion (n = 37), ranking system fine-tuning on CORD-19 vectors (n = 9), MS-MARCO (n = 18), and Round 1 judgments (n =9). Using ReQ-Rec (n = 3) remained significantly associated with decreased performance. In Round 5, using the question text to formulate the query (n = 60) and TF-IDF vector weighting (n =14) were associated with increased performance, while a custom proximity score (n = 2) as a scoring function was associated with decreased performance. As seen in Round 2, using feedback in Round 5 (n = 59) was associated with increased performance when runs were evaluated on RBP.

Figure 3.
  • Download figure
  • Open in new tab
Figure 3. Significant Features after Multivariate Regression Analysis in Rounds 2 and 5.

Features that were found to be significant in univariate regression were further input into a second, multivariate regression. Significant features were reported for the following performance metrics in Rounds 2 and 5 respectively: binary preference (A and F), mean average precision (B and G), normalized distributive cumulative gain (C and H), precision @ k documents (D and I), and rank-based precision (E and J). Depending on the coefficients, these features were concluded to be significantly associated with increased or decreased performance in the TREC-COVID challenge.

DISCUSSION

This study aimed to develop a taxonomy of features to evaluate techniques associated with higher performance in runs submitted to Rounds 2 and 5 of the TREC-COVID Challenge. The key findings were: 1) fine-tuning ranking systems using relevance judgments resulted in significant improvement in performance, particularly in Round 2, and 2) query formulation is an important component of successful search.

Our first key finding was that fine-tuning ranking systems using relevance judgments resulted in significant improvement in performance, particularly in Round 2. Unlike previous TREC challenges, rapid turnout of relevance judgments over multiple rounds resulted in opportunities to fine-tune ranking systems for improved performance. Many of the runs labelled as feedback runs (n = 41 in Round 2 and n = 65 in round 5) employed fine-tuning, though a small portion specifically used the relevance judgments specifically in fine-tuning their ranking systems. Other teams who fine-tuned on similar datasets, including the vectorized CORD19 dataset (represented as Dataset.for-FineTuning_CORD19) and MS-MARCO also achieved comparable levels of improvement when compared to systems that did not use fine-tuning on these specific datasets. The noted improvement of fine-tuning on an annotated dataset has been reported in other TREC challenges, most notably the usage of MS-MARCO by Nogueira et al to refine a neural network system that vastly outperformed other runs in the TREC CAR challenge.27 Interestingly, the benefits of fine-tuning systems did not persist into Round 5 (with the exception of evaluation on RBP) despite more prevalent use of neural systems and feedback runs.

Since the TREC-COVID Challenge brought together a mix of research teams with varying experience in IR challenge evaluations, along with the short time between rounds (2-3 weeks), the absence of significance with fine-tuning on previous round judgments may be explained by implementation differences between teams, as many teams implemented variations of the popular sequence of an initial weighted system (most commonly BM25) followed by a neural re-ranker (i.e., BERT with or without fine-tuning on MS-MARCO or previous relevance judgments).28,29 However, since we one-hot encoded other techniques, our linear regressions may have overrepresented individual techniques that few teams used (including ReQ-ReC in Round 2, and proximity scoring in Round 5). Future work may be needed to validate the performance of other techniques compared with the standard weighted and neural pipelines. Furthermore, since we chose to set neural networks as a binary variable, there may be opportunities to explore how different architectures influence performance in the TREC-COVID challenge.

Our second key finding was that query formulation was an important component of successful search. While most teams used the query and question fields in formulating an input query, several teams (n = 28 and 32 in Rounds 2 and 5 respectively) chose to use the narrative portion of the topic, which was associated with decreased performance in both rounds. Because the narrative contained freehand descriptions qualifying each topic, these descriptive fields were noisy. For example, topics 33 and 34 contained the phrase “excluding…,” with subsequent wording describing what not to search. Furthermore, vocabulary used in the topics designed by the TREC-COVID organizers were not consistently used in manuscripts included in the CORD-19 dataset (i.e. differences in how COVID-19 was named: SARS-CoV 2, coronavirus) and may have adversely affected search performance for those who did not expand their queries to include such terms.

In fact, many of the successful runs from teams that used baseline runs from Anserini30 employed a query preprocessing tool produced by University of Delaware that used SciSpacy31 to lemmatize and remove non-stop words. Runs generated by Anserini comparing standard addition of various topic fields with and without the “Udel” method consistently showed improvements in retrieved relevant documents no matter what topics were used to construct the query and which indices were used.32 This approach was taken further either manually by certain teams (i.e., OHSU) or automatically, as seen in approaches in initial iterations of Covidex.33 In fact, adapting the queries to better represent document representations, or minimize query-document mismatch, has long been researched and includes work using relevance judgments34 and query expansion.35 Novel methods have focused on reverse: adjusting documentation representations to better represent queries - for example, Doc2Query36 was employed most commonly in Round 5 by one team, though this technique was not shown to be significantly associated with high performance in our study. However, the team that incorporated this technique submitted runs that were widely variable in performance, and may have used other features not found to be significant in our taxonomy. The importance of defining relevant terms in queries has also been reflected in human user studies, in which previous work by Hersh et. al has demonstrated the importance of search ability and domain knowledge of the user in a biomedical search task.10

LIMITATIONS

This study had several limitations that future work could address. First, the instructions for describing methodologies in the run reports varied in detail. As such, the data used for this study were only as complete as what was provided in the reports. This not only presented a challenge to building our taxonomy, but also meant that important features may not have been (and likely were not) reported. In the future, teams should document methodologies that promote reproducibility or publish their results in reports as is done in the regular TREC challenges.

Second, it was difficult to capture run-specific differences between runs submitted by the same team, as team-specific features were often not provided. This had important implications in runs submitted in Round 5, where teams were allowed to submit up to 8 runs. While many runs submitted from the same team were largely similar (and often performed similarly), our methodology was not well-suited to capture nuances such as hyperparameter tuning that were likely small adjustments to otherwise similar methods and pipelines. We sought to characterize runs broadly, rather than capture each individual technique and adjustment in each run, since features built around individual techniques were subject to bias. However, to find a balance between granularity vs. breadth of techniques, we attempted to take into account differences between runs (even from the same team) using a one-hot encoded column of other techniques that we thought were unique enough to warrant specific inclusion. Future directions for this work may include identifying how to best capture adjustments between runs using similar techniques that result in different performances.

Third, our study was retrospective and limited in scope. While we studied techniques associated with performance on the CORD-19 dataset, these techniques may not be generalizable to other test collections. This was reflected in our work, where our regressions overfitted particularly performances with teams that used unique methodologies (i.e., associated a feature with significantly low or high performance despite a low number of teams employing this feature, such as ReQ-ReC20 or Proximity score). As mentioned above, implementation may have had a role in this “significantly” poor performance. Future work will be needed to prospectively evaluate these unique techniques across different developers and users.

CONCLUSION

Using multivariate regression analysis, we developed and evaluated a taxonomy of features IR systems associated with high performance in the TREC-COVID Challenge. While our multivariate analysis demonstrates the utility of relevance feedback and the need for well-defined queries, it remains unclear which broad methodologies are associated with high performance in the TREC-COVID test collection. While our study has limitations in generating specific, prospective generalizations about IR systems and techniques, our work broadly showcases general techniques that may be useful in building search systems for COVID-19, and serves as a springboard for future work on TREC-COVID and related test collections.

Data Availability

N/A

Footnotes

  • Contact Information: Jimmy Chen, BA, School of Medicine - MD Program 3181 SW Sam Jackson Park Rd., Oregon Health & Science University, Portland, OR, 97239, Email: jimmyschen94{at}gmail.com

  • Financial Support: None.

  • Conflicts of Interest: Jimmy Chen and William Hersh have no conflicts of interest to disclose.

REFERENCES

  1. 1.↵
    Statement on the second meeting of the International Health Regulations (2005) Emergency Committee regarding the outbreak of novel coronavirus (2019-nCoV). Accessed September 8, 2020. https://www.who.int/news-room/detail/30-01-2020-statement-on-the-second-meeting-of-the-international-health-regulations-(2005)-emergency-committee-regarding-the-outbreak-of-novel-coronavirus-(2019-ncov)
  2. 2.↵
    Palayew A, Norgaard O, Safreed-Harmon K, Andersen TH, Rasmussen LN, Lazarus JV. Pandemic Publishing Poses a New COVID-19 Challenge. Nat Hum Behav. 2020;4(7):666– 669. doi:10.1038/s41562-020-0911-0
    OpenUrlCrossRef
  3. 3.↵
    Wang LL, Lo K, Chandrasekhar Y, et al. CORD-19: The COVID-19 Open Research Dataset. ArXiv200410706 Cs. Published online July 10, 2020. Accessed September 8, 2020. http://arxiv.org/abs/2004.10706
  4. 4.↵
    Hersh, W. Information Retrieval: A Biomedical and Health Perspective. 4th ed.; 2020. doi:10.1007/978-3-030-47686-1
    OpenUrlCrossRef
  5. 5.↵
    Voorhees EM HD. TREC: Experiment and Evaluation in Information Retrieval. Cambridge, MA: The MIT Press (Digital Libraries and Electronic Publishing series); 2005.
  6. 6.↵
    TREC-COVID Home. Accessed October 13, 2020. https://ir.nist.gov/covidSubmit/
  7. 7.↵
    Roberts K, Alam T, Bedrick S, et al. TREC-COVID: Rationale and Structure of an Information Retrieval Shared Task for COVID-19. J Am Med Inform Assoc. Published online May 4, 2020. doi:10.1093/jamia/ocaa091
    OpenUrlCrossRef
  8. 8.↵
    Voorhees E, Alam T, Bedrick S, et al. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. ArXiv200504474 Cs. Published online May 9, 2020. Accessed September 8, 2020. http://arxiv.org/abs/2005.04474
  9. 9.↵
    Roberts, K, Alam T, Bedrick S, et al. Searching for Answers in a Pandemic: An Overview of TREC-COVID Submitted to Journal of Biomedical Informatics COVID-19 special issue. J Biomed Inform COVID-19 Peical Issue. Published online October 15, 2020.
  10. 10.↵
    Hersh WR, Crabtree MK, Hickam DH, et al. Factors associated with success in searching MEDLINE and applying evidence to answer clinical questions. J Am Med Inform Assoc JAMIA. 2002;9(3):283–293. doi:10.1197/jamia.m0996
    OpenUrlCrossRef
  11. 11.
    Roberts K, Simpson M, Demner-Fushman D, Voorhees E, Hersh W. State-of-the-art in biomedical literature retrieval for clinical cases: a survey of the TREC 2014 CDS track. Inf Retr J. 2016;19(1):113–148. doi:10.1007/s10791-015-9259-x
    OpenUrlCrossRef
  12. 12.↵
    Rekapalli HK, Cohen AM, Hersh WR. A comparative analysis of retrieval features used in the TREC 2006 Genomics Track passage retrieval task. AMIA Annu Symp Proc AMIA Symp. 2007;2007:620–624.
    OpenUrl
  13. 13.↵
    Cormack GV, Grossman MR. Autonomy and Reliability of Continuous Active Learning for Technology-Assisted Review. ArXiv150406868 Cs. Published online April 26, 2015. Accessed October 14, 2020. http://arxiv.org/abs/1504.06868
  14. 14.↵
    Beaulieu MM, Gatford M, Huang X, Robertson S, Walker S, Williams P. Okapi at TREC-5. In: The Fifth Text REtrieval Conference (TREC-5). The Fifth Text REtrieval Conference (TREC–5). Gaithersburg, MD: NIST; 1997:143–165. Accessed October 13, 2020. https://www.microsoft.com/en-us/research/publication/okapi-at-trec-5/
  15. 15.↵
    Rajaraman A, Ullman JD, eds. Data Mining. In: Mining of Massive Datasets. Cambridge University Press; 2011:1–17. doi:10.1017/CBO9781139058452.002
    OpenUrlCrossRef
  16. 16.↵
    Devlin J, Chang M-W, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv181004805 Cs. Published online May 24, 2019. Accessed October 14, 2020. http://arxiv.org/abs/1810.04805
  17. 17.↵
    Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text. ArXiv190310676 Cs. Published online September 10, 2019. Accessed October 14, 2020. http://arxiv.org/abs/1903.10676
  18. 18.↵
    Tang R, Nogueira R, Zhang E, et al. Rapidly Bootstrapping a Question Answering Dataset for COVID-19. ArXiv200411339 Cs. Published online April 23, 2020. Accessed May 4, 2020. http://arxiv.org/abs/2004.11339
  19. 19.↵
    Pang L, Lan Y, Guo J, Xu J, Xu J, Cheng X. DeepRank: A New Deep Architecture for Relevance Ranking in Information Retrieval. Proc 2017 ACM Conf Inf Knowl Manag. Published online November 6, 2017:257-266. doi:10.1145/3132847.3132914
    OpenUrlCrossRef
  20. 20.↵
    Li C, Wang Y, Resnick P, Mei Q. ReQ-ReC: High Recall Retrieval with Query Pooling and Interactive Classification. In: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval. SIGIR ‘14. Association for Computing Machinery; 2014:163–172. doi:10.1145/2600428.2609618
    OpenUrlCrossRef
  21. 21.↵
    Bajaj P, Campos D, Craswell N, et al. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. ArXiv161109268 Cs. Published online October 31, 2018. Accessed October 11, 2020. http://arxiv.org/abs/1611.09268
  22. 22.↵
    Cormack GV, Clarke CLA, Buettcher S. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ‘09. Association for Computing Machinery; 2009:758–759. doi:10.1145/1571941.1572114
    OpenUrlCrossRef
  23. 23.↵
    Shaw JA, Fox EA. Combination of Multiple Searches. In: THE SECOND TEXT RETRIEVAL CONFERENCE (TREC-2.; 1994:243–252.
  24. 24.↵
    MacAvaney S, Cohan A, Goharian N. SLEDGE: A Simple Yet Effective Baseline for Coronavirus Scientific Knowledge Search. ArXiv200502365 Cs. Published online May 6, 2020. Accessed May 7, 2020. http://arxiv.org/abs/2005.02365
  25. 25.↵
    R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; 2020. https://www.R-project.org/
  26. 26.↵
    Friedman J, Hastie T, Tibshirani R. Regularization Paths for Generalized Linear Models via Coordinate Descent. J Stat Softw. 2010;33(1):1–22.
    OpenUrlCrossRefPubMedWeb of Science
  27. 27.↵
    Nogueira R, Cho K. Passage Re-ranking with BERT. ArXiv190104085 Cs. Published online April 14, 2020. Accessed May 4, 2020. http://arxiv.org/abs/1901.04085
  28. 28.↵
    Mitra B, Craswell N. An Introduction to Neural Information Retrieval. Found Trends Inf Retr. 2018;13:1-126.
    OpenUrl
  29. 29.↵
    Dehghani M, Zamani H, Severyn A, Kamps J, Croft WB. Neural Ranking Models with Weak Supervision. ArXiv170408803 Cs. Published online May 29, 2017. Accessed October 13, 2020. http://arxiv.org/abs/1704.08803
  30. 30.↵
    Yang P, Fang H, Lin J. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ‘17. Association for Computing Machinery; 2017:1253–1256. doi:10.1145/3077136.3080721
    OpenUrlCrossRef
  31. 31.↵
    Neumann M, King D, Beltagy I, Ammar W. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing. Proc 18th BioNLP Workshop Shar Task. Published online 2019:319-327. doi:10.18653/v1/W19-5034
    OpenUrlCrossRef
  32. 32.↵
    Castorini. A Lucene toolkit for replicable information retrieval research. GitHub. Accessed October 13, 2020. https://github.com/castorini/anserini
  33. 33.↵
    Zhang E, Gupta N, Tang R, et al. Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset. ArXiv200707846 Cs. Published online July 14, 2020. Accessed October 11, 2020. http://arxiv.org/abs/2007.07846
  34. 34.↵
    1. Salton G
    Rocchio JJ. Relevance feedback in information retrieval. In: Salton G, ed. The Smart Retrieval System - Experiments in Automatic Document Processing. Englewood Cliffs, NJ: Prentice-Hall; 1971:313–323.
  35. 35.↵
    Voorhees EM. Query Expansion Using Lexical-Semantic Relations. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ‘94. Springer-Verlag; 1994:61–69.
  36. 36.↵
    Nogueira R, Yang W, Lin J, Cho K. Document Expansion by Query Prediction. ArXiv190408375 Cs. Published online September 24, 2019. Accessed September 20, 2020. http://arxiv.org/abs/1904.08375
View Abstract
Back to top
PreviousNext
Posted October 20, 2020.
Download PDF
Data/Code
Email

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Enter multiple addresses on separate lines or separate them with commas.
A Comparative Analysis of System Features Used in the TREC-COVID Information Retrieval Challenge
(Your Name) has forwarded a page to you from medRxiv
(Your Name) thought you would like to see this page from the medRxiv website.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Share
A Comparative Analysis of System Features Used in the TREC-COVID Information Retrieval Challenge
Jimmy Chen, William R. Hersh
medRxiv 2020.10.15.20213645; doi: https://doi.org/10.1101/2020.10.15.20213645
Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation Tools
A Comparative Analysis of System Features Used in the TREC-COVID Information Retrieval Challenge
Jimmy Chen, William R. Hersh
medRxiv 2020.10.15.20213645; doi: https://doi.org/10.1101/2020.10.15.20213645

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Health Informatics
Subject Areas
All Articles
  • Addiction Medicine (62)
  • Allergy and Immunology (142)
  • Anesthesia (44)
  • Cardiovascular Medicine (409)
  • Dentistry and Oral Medicine (68)
  • Dermatology (47)
  • Emergency Medicine (141)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (171)
  • Epidemiology (4818)
  • Forensic Medicine (3)
  • Gastroenterology (177)
  • Genetic and Genomic Medicine (671)
  • Geriatric Medicine (70)
  • Health Economics (188)
  • Health Informatics (621)
  • Health Policy (314)
  • Health Systems and Quality Improvement (200)
  • Hematology (85)
  • HIV/AIDS (155)
  • Infectious Diseases (except HIV/AIDS) (5287)
  • Intensive Care and Critical Care Medicine (327)
  • Medical Education (91)
  • Medical Ethics (24)
  • Nephrology (73)
  • Neurology (677)
  • Nursing (41)
  • Nutrition (112)
  • Obstetrics and Gynecology (126)
  • Occupational and Environmental Health (203)
  • Oncology (439)
  • Ophthalmology (138)
  • Orthopedics (36)
  • Otolaryngology (89)
  • Pain Medicine (35)
  • Palliative Medicine (15)
  • Pathology (128)
  • Pediatrics (193)
  • Pharmacology and Therapeutics (129)
  • Primary Care Research (84)
  • Psychiatry and Clinical Psychology (771)
  • Public and Global Health (1800)
  • Radiology and Imaging (322)
  • Rehabilitation Medicine and Physical Therapy (138)
  • Respiratory Medicine (255)
  • Rheumatology (86)
  • Sexual and Reproductive Health (69)
  • Sports Medicine (61)
  • Surgery (100)
  • Toxicology (23)
  • Transplantation (28)
  • Urology (37)