Evaluation of a Popular Large Language Model in Orthopedic Literature Review: Comparison to Previously Published Reviews

Yao, Jie J.; Lopez, Ryan D.; Rizk, Adam A.; Aggarwal, Manan; Namdari, Surena

doi:10.22038/abjs.2025.84896.3874

Evaluation of a Popular Large Language Model in Orthopedic Literature Review: Comparison to Previously Published Reviews

Document Type : RESEARCH PAPER

Authors

Jie J. Yao ¹

Ryan D. Lopez ¹

Adam A. Rizk ¹

Manan Aggarwal ²

Surena Namdari ¹

¹ Rothman Orthopaedic Institute, Thomas Jefferson University, Philadelphia, PA, USA

² Google Search AI, Google, Mountain View, CA, USA

10.22038/abjs.2025.84896.3874

Abstract

Objectives: Large language models (LLMs) may improve the process of conducting systematic literature reviews. Our aim was to evaluate the utility of one popular LLM chatbot, Chat Generative Pre-trained Transformer (ChatGPT), in systematic literature reviews when compared to traditionally conducted reviews.
Methods: We identified five systematic reviews published in the Journal of Bone and Joint Surgery from 2021 to 2022. We retrieved the clinical questions, methodologies, and included studies for each review. We evaluated ChatGPT’s performance on three tasks. (1) For each published systematic review’s core clinical question, ChatGPT designed a relevant database search strategy. (2) ChatGPT screened the abstracts of those articles identified by that search strategy for inclusion in a review. (3) For one systematic review, ChatGPT reviewed each individual manuscript identified after screening to identify those that fit inclusion criteria. We compared the performance of ChatGPT on each of these three tasks to the previously published systematic reviews.
Results: ChatGPT captured a median of 91% (interquartile range, IQR 84%, 94%) of articles in the published systematic reviews. After screening of these abstracts, ChatGPT was able to capture a median of 75% (IQR 70%, 79%) of articles included in the published systematic reviews. On in-depth screening of manuscripts, ChatGPT captured only 55% of target publications; however, this improved to 100% on review of the manuscripts that ChatGPT identified on this step. Qualitative analysis of ChatGPT’s performance highlighted the importance of prompt design and engineering.
Conclusion: Using published reviews as a gold standard, ChatGPT demonstrated ability in replicating fundamental tasks for orthopedic systematic review. Cautious use and supervision of this general purpose LLM, ChatGPT, may aid in the process of systematic literature review. Further study and discussion regarding the role of LLMs in literature review is needed.
Level of evidence: III

Keywords

ChatGPT

Large language models

Orthopedics

Systematic review

Subjects

General Orthopedics

Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023; 6:1169595. doi:10.3389/frai.2023.1169595.
Minssen T, Vayena E, Cohen IG. The Challenges for Regulating Medical Use of ChatGPT and Other Large Language Models. JAMA. 2023; 330(4):315. doi:10.1001/jama.2023.9651.
Liebrenz M, Schleifer R, Buadze A, Bhugra D, Smith A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. Lancet Digital Health. 2023;5(3):e105-e106. doi:10.1016/S2589-7500(23)00019-5.
Murad MH, Montori VM, Ioannidis JPA, et al. How to Read a Systematic Review and Meta-analysis and Apply the Results to Patient Care: Users’ Guides to the Medical Literature. JAMA. 2014; 312(2):171. doi:10.1001/jama.2014.5559.
Michelson M, Reuter K. The significant cost of systematic reviews and meta-analyses: A call for greater involvement of machine learning to assess the promise of clinical trials. Contemp Clin Trials Commun. 2019; 16:100443. doi:10.1016/j.conctc.2019.100443.
Qureshi R, Shaughnessy D, Gill KAR, Robinson KA, Li T, Agai E. Are ChatGPT and large language models “the answer” to bringing us closer to systematic review automation? Syst Rev. 2023; 12(1):72. doi:10.1186/s13643-023-02243-z.
Bi AS. What's Important: The Next Academic-ChatGPT AI? J Bone Joint Surg Am. 2023; 105(11):893-895. doi:10.2106/JBJS.23.00269.
Dahmen J, Kayaalp ME, Ollivier M, et al. Artificial intelligence bot ChatGPT in medical research: the potential game changer as a double-edged sword. Knee Surg Sports Traumatol Arthrosc. 2023; 31(4):1187-1189. doi:10.1007/s00167-023-07355-6.
Fayed AM, Mansur NSB, De Carvalho KA, Behrens A, D’Hooghe P, De Cesar Netto C. Artificial intelligence and ChatGPT in Orthopaedics and sports medicine. J exp orthop. 2023; 10(1):74. doi:10.1186/s40634-023-00642-8.
Hosseini M, Gao CA, Liebovitz DM, et al. An exploratory survey about using ChatGPT in education, healthcare, and research. PLOS ONE. 2023; 18(10):e0292216. doi:10.1371/journal.pone.0292216.
Cheng K, Li Z, He Y, et al. Potential Use of Artificial Intelligence in Infectious Disease: Take ChatGPT as an Example. Ann Biomed Eng. 2023; 51(6):1130-1135. doi:10.1007/s10439-023-03203-3.
Valentini M, Szkandera J, Smolle MA, Scheipl S, Leithner A, Andreou D. Artificial intelligence large language model ChatGPT: is it a trustworthy and reliable source of information for sarcoma patients? Front Public Health. 2024; 12:1303319. doi:10.3389/fpubh.2024.1303319.
Gomez-Cabello CA, Borna S, Pressman SM, Haider SA, Forte AJ. Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini. Medicina (Kaunas). 2024; 60(6):957. doi:10.3390/medicina60060957.
Aliferis C, Simon G. Overfitting, Underfitting and General Model Overconfidence and Under-Performance Pitfalls and Best Practices in Machine Learning and AI. In: Simon GJ, Aliferis C, eds. Artificial Intelligence and Machine Learning in Health Care and Medical Sciences: Best Practices and Pitfalls. Springer; 2024. Accessed February 4, 2025.
Lucas HC, Upperman JS, Robinson JR. A systematic review of large language models and their implications in medical education. Med Educ. 2024; 58(11):1276-1285. doi:10.1111/medu.15402.
Özer M. Is Artifical Intelligence Hallucinating? Turk Psikiyatri Derg. 2024; 35(4):333-335. doi:10.5080/u27587.
Kwok WCH, Wong CYY, Law JHW, et al. Risk Factors for Adjacent Segment Disease Following Anterior Cervical Discectomy and Fusion with Plate Fixation: A Systematic Review and Meta-Analysis. J Bone Joint Surg Am. 2022; 104(21):1915-1945. doi:10.2106/JBJS.21.01494.
Lau KKL, Samartzis D, To NSC, Harada GK, An HS, Wong AYL. Demographic, Surgical, and Radiographic Risk Factors for Symptomatic Adjacent Segment Disease After Lumbar Fusion: A Systematic Review and Meta-Analysis. J Bone Joint Surg Am. 2021; 103(15):1438-1450. doi:10.2106/JBJS.20.00408.
Nepple JJ, Block AM, Eisenberg MT, Palumbo NE, Wright RW. Meniscal Repair Outcomes at Greater Than 5 Years: A Systematic Review and Meta-Analysis. J Bone Joint Surg Am. 2022; 104(14):1311-1320. doi:10.2106/JBJS.21.01303.
Wang CK, Cohen D, Kay J, et al. The Effect of Femoral and Acetabular Version on Outcomes Following Hip Arthroscopy: A Systematic Review. J Bone Joint Surg Am. 2022; 104(3):271-283. doi:10.2106/JBJS.21.00375.
Morrell AT, Layon DR, Scott MJ, Kates SL, Golladay GJ, Patel NK. Enhanced Recovery After Primary Total Hip and Knee Arthroplasty: A Systematic Review. J Bone Joint Surg Am. 2021; 103(20):1938-1947. doi:10.2106/JBJS.20.02169.
Yao JJ, Aggarwal M, Lopez RD, Namdari S. Large Language Models in Orthopaedics: Definitions, Uses, and Limitations. J Bone Joint Surg Am. 2024; 106(15):1411. doi:10.2106/JBJS.23.01417.
OpenAI, Achiam J, Adler S, et al. GPT-4 Technical Report. 2024. doi:10.48550/arXiv.2303.08774.
Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine. 2024; 100:104988. doi:10.1016/j.ebiom.2024.104988.
Wang X, Sanders HM, Liu Y, et al. ChatGPT: promise and challenges for deployment in low- and middle-income countries. Lancet Reg Health West Pac. 2023; 41:100905. doi:10.1016/j.lanwpc.2023.100905.
Giray L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann Biomed Eng. 2023; 51(12):2629-2633. doi:10.1007/s10439-023-03272-4.
Subramanian T, Shahi P, Araghi K, et al. Using Artificial Intelligence to Answer Common Patient-Focused Questions in Minimally Invasive Spine Surgery. J Bone Joint Surg Am. 2023;105(20):1649-1653. doi:10.2106/JBJS.23.00043.
Hernigou P, Scarlat MM. Two minutes of orthopaedics with ChatGPT: it is just the beginning; it’s going to be hot, hot, hot! Int Orthop. 2023; 47(8):1887-1893. doi:10.1007/s00264-023-05887-7.

Hurley ET, Crook BS, Lorentz SG, et al. Evaluation High-Quality of Information from ChatGPT (Artificial Intelligence—Large Language Model) Artificial Intelligence on Shoulder Stabilization Surgery. Arthroscopy. 2024; 40(3):726-731.e6. doi:10.1016/j.arthro.2023.07.048.
Kaarre J, Feldt R, Keeling LE, et al. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee surg sports traumatol arthrosc. 2023; 31(11):5190-5198. doi:10.1007/s00167-023-07529-2.
Kung JE, Marshall C, Gauthier C, Gonzalez TA, Jackson JB. Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination. JBJS Open Access. 2023; 8(3). doi:10.2106/JBJS.OA.23.00056.
Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y. Large Language Models are Zero-Shot Reasoners. 2022; 35:22199-213. doi:10.48550/arXiv.2205.11916.
Mostafapour M, Fortier JH, Pacheco K, Murray H, Garber G. Evaluating Literature Reviews Conducted by Humans Versus ChatGPT: Comparative Study. JMIR AI. 2024; 3(1):e56537. doi:10.2196/56537.
Wei J, Wang X, Schuurmans D, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems. 2022; 35:24824-24837.
Henrickson L, Meroño-Peñuela A. Prompting meaning: a hermeneutic approach to optimising prompt engineering with ChatGPT. AI & Soc. 2023:1-6.doi:10.1007/s00146-023-01752-8.
Najafali D, Camacho JM, Reiche E, Galbraith LG, Morrison SD, Dorafshar AH. Truth or Lies? The Pitfalls and Limitations of ChatGPT in Systematic Review Creation. Aesthet Surg J. 2023; 43(8):NP654-NP655. doi:10.1093/asj/sjad093.
Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023; 620(7972):172-180. doi:10.1038/s41586-023-06291-2.
Med-PaLM: A large language model from Google Research, designed for the medical domain. Available at: https://sites.research.google/med-palm/.Accessed February 4, 2025.

Volume 13, Issue 8
August 2025
Pages 460-469

XML

PDF 870.57 K

Receive Date 31 December 2024
Revise Date 24 February 2025
Accept Date 06 March 2025

Article View 383
PDF Download 286

The Archives of Bone and Joint Surgery

Evaluation of a Popular Large Language Model in Orthopedic Literature Review: Comparison to Previously Published Reviews

Volume 13, Issue 8August 2025Pages 460-469

Files

History

Share

How to cite

Statistics

Volume 13, Issue 8
August 2025
Pages 460-469