Evaluation of a Popular Large Language Model in Orthopedic Literature Review: Comparison to Previously Published Reviews

Document Type : RESEARCH PAPER

Authors

1 Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, PA, USA.

2 Google Search AI, Google, Mountain View, CA, USA.

10.22038/abjs.2025.84896.3874

Abstract

Objectives: Large language models (LLMs) may improve the process of conducting systematic literature reviews. Our aim was to evaluate the utility of one popular LLM chatbot, Chat Generative Pre-trained Transformer (ChatGPT), in systematic literature reviews when compared to traditionally conducted reviews.



Methods: We identified five systematic reviews published in the Journal of Bone and Joint Surgery from 2021 to 2022. We retrieved the clinical questions, methodologies, and included studies for each review. We evaluated ChatGPT’s performance on three tasks. (1) For each published systematic review’s core clinical question, ChatGPT designed a relevant database search strategy. (2) ChatGPT screened the abstracts of those articles identified by that search strategy for inclusion in a review. (3) For one systematic review, ChatGPT reviewed each individual manuscript identified after screening to identify those that fit inclusion criteria. We compared the performance of ChatGPT on each of these three tasks to the previously published systematic reviews.



Results: ChatGPT captured a median of 91% (interquartile range, IQR 84%, 94%) of articles in the published systematic reviews. After screening of these abstracts, ChatGPT was able to capture a median of 75% (IQR 70%, 79%) of articles included in the published systematic reviews. On in-depth screening of manuscripts, ChatGPT captured only 55% of target publications; however, this improved to 100% on review of the manuscripts that ChatGPT identified on this step. Qualitative analysis of ChatGPT’s performance highlighted the importance of prompt design and engineering.



Conclusion: Using published reviews as a gold standard, ChatGPT demonstrated ability in replicating fundamental tasks for orthopedic systematic review. Cautious use and supervision of this general purpose LLM, ChatGPT, may aid in the process of systematic literature review. Further study and discussion regarding the role of LLMs in literature review is needed.

Keywords

Main Subjects



Articles in Press, Accepted Manuscript
Available Online from 22 April 2025
  • Receive Date: 31 December 2024
  • Revise Date: 24 February 2025
  • Accept Date: 06 March 2025