From Algorithms to Academia: An Endeavor to Benchmark AI-Generated Scientific Papers against Human Standards

Document Type : RESEARCH PAPER

Authors

Foot & Ankle Research and Innovation Lab (FARIL), Department of Orthopaedic Surgery, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA

10.22038/abjs.2024.80093.3669

Abstract

Objectives: The aim of this study is to quantitatively investigate the accuracy of text generated by AI 
large language models while comparing their readability and likelihood of being accepted to a scientific 
compared to human-authored papers on the same topics.
Methods: The study consisted of two papers written by ChatGPT, two papers written by Assistant by scite, and two 
papers written by humans. A total of six independent reviewers were blinded to the authorship of each paper and 
assigned a grade to each subsection on a scale of 1 to 4. Additionally, each reviewer was asked to guess if the 
paper was written by a human or AI and explain their reasoning. The study authors also graded each AI-generated 
paper based on factual accuracy of the claims and citations.
Results: The human-written calcaneus fracture paper received the highest score of a 3.70/4, followed by Assistantwritten calcaneus fracture paper (3.02/4), human-written ankle osteoarthritis paper (2.98/4), ChatGPT calcaneus 
fracture (2.89/4), ChatGPT Ankle Osteoarthritis (2.87/4), and Assistant Ankle Osteoarthritis (2.78/4). The human 
calcaneus fracture paper received a statistically significant higher rating than the ChatGPT calcaneus fracture paper 
(P = 0.028) and the Assistant calcaneus fracture paper (P = 0.043). The ChatGPT osteoarthritis review showed 
100% factual accuracy, the ChatGPT calcaneus fracture review was 97.46% factually accurate, the Assistant 
calcaneus fracture was 95.56% accurate, and the Assistant ankle osteoarthritis was 94.98% accurate. Regarding 
citations, the ChatGPT ankle osteoarthritis paper was 90% accurate, the ChatGPT calcaneus fracture was 69.23% 
accurate, the Assistant ankle osteoarthritis was 35.14% accurate, and the Assistant calcaneus fracture was 39.68% 
accurate. 
Conclusion: Through this paper we emphasize that while AI holds the promise of enhancing knowledge sharing, it 
must be used responsibly and in conjunction with comprehensive fact-checking procedures to maintain the integrity 
of the scientific discourse.
 Level of evidence: III

Keywords

Main Subjects


1. Zinkula J, Mok A. ChatGPT may be coming for our jobs. Here are 
the 10 roles that AI is most likely to replace. Business Insider. 
Available at: https://www.businessinsider.com/chatgpt-jobs-atrisk-replacement-artificial-intelligence-ai-labor-trends-2023-02. 
2024.
2. Oremus W. Analysis | Google’s AI passed a famous test — and 
showed how the test is broken. Washington Post. Available at: 
https://www.washingtonpost.com/technology/2022/06/17/go
ogle-ai-lamda-turing-test/. 2022.
3. Hu K. ChatGPT sets record for fastest-growing user base - analyst 
note. Reuters. Available at: 
https://www.reuters.com/technology/chatgpt-sets-recordfastest-growing-user-base-analyst-note-2023-02-01/. 2023.
4. Edwards, B. OpenAI’s GPT-4 exhibits “human-level performance” 
on professional benchmarks. ARS Technica. Available at: 
https://arstechnica.com/informationtechnology/2023/03/openai-announces-gpt-4-its-nextgeneration-ai-language-model/. 2023.
5. Ramazanian T, Fu S, Sohn S, Taunton MJ, Kremers HM. Prediction 
Models for Knee Osteoarthritis: Review of Current Models and 
Future Directions. Arch Bone Jt Surg. 2023; 11(1):1-11. doi: 
10.22038/ABJS.2022.58485.2897.
6. Abedi R, Fatouraee N, Bostanshirin M, Arjmand N, Ghandhari H. 
Prediction of Fusion Rod Curvature Angles in Posterior Scoliosis 
Correction Using Artificial Intelligence. Arch Bone Jt Surg. 2024; 
12(7):494-505. doi: 10.22038/ABJS.2024.76701.3545.
7. Dehouche N. Plagiarism in the age of massive Generative Pretrained Transformers (GPT-3). Ethics Sci Environ Polit. 2021; 
21:17-23. doi:10.3354/esep00195.
8. Gao CA, Howard FM, Markov NS, et al. Comparing scientific abstracts 
generated by ChatGPT to original abstracts using an artificial 
intelligence output detector, plagiarism detector, and blinded 
human reviewers. BioRxiv. 2022 Dec 27:2022-12. 
doi:10.1101/2022.12.23.521610.
9. Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: 
Implications in Scientific Writing. Cureus. 2023; 15(2):e35179. 
doi:10.7759/cureus.35179.
10. Athaluri SA, Manthena SV, Kesapragada VKM, Yarlagadda V, Dave 
T, Duddumpudi RTS. Exploring the boundaries of reality: 
investigating the phenomenon of artificial intelligence 
hallucination in scientific writing through ChatGPT references. 
Cureus. 2023;15(4):e37432. doi: 10.7759/cureus.37432.
11. Pequeño A. Major ChatGPT update: AI program no longer 
restricted to September 2021 knowledge cutoff. Forbes. Available 
at: https://www.forbes.com.au/news/innovation/chatgptremoves-september-2021-knowledge-cutoff/. 2023.
12. Salameh M, Al-Hashki L, Al-Juboori S, Rayyan R, Hantouly A, 
Blankenhorn B. Management of displaced intra-articular calcaneal 
fractures; current concept review and treatment algorithm. Eur J 
Orthop Surg Traumatol. 2023; 33(4):779-785. 
doi:10.1007/s00590-022-03264-5.
13. Herrera-Pérez M, Valderrabano V, Godoy-Santos AL, de César Netto 
C, González-Martín D, Tejero S. Ankle osteoarthritis: 
comprehensive review and treatment algorithm proposal. EFORT 
Open Rev. 2022; 7(7):448-459. doi:10.1530/EOR-21-0117.
14. Ramponi, M. How ChatGPT actually works. AssemblyAI. Available 
at: https://www.assemblyai.com/blog/how-chatgpt-actuallyworks/. 2022.
15. Kacena MA, Plotkin LI, Fehrenbacher JC. The Use of Artificial 
Intelligence in Writing Scientific Review Articles. Curr Osteoporos 
Rep. 2024; 22(1):115-121. doi:10.1007/s11914-023-00852-0.
16. Kitamura FC. ChatGPT Is Shaping the Future of Medical Writing But 
Still Requires Human Judgment. Radiology. 2023; 
307(2):e230171. doi:10.1148/radiol.230171.
17. Salvagno M, Taccone FS, Gerli AG. Can artificial intelligence help for 
scientific writing? Crit Care. 2023; 27(1):75. doi:10.1186/s13054-
023-04380-2.
18. Lee JY. The Use of Artificial Intelligence in Writing Scientific Review 
Articles. Curr Osteoporos Rep. 2024; 22(1):115-121. doi: 
10.1007/s11914-023-00852-0.