Methodology for Online Reliability Studies: A Primer for Orthopedic Surgeons

Claessen, Femke M.A.P.; Zwiers, Ruben; Guitton, Thierry G.; Doornberg, Job N.

doi:10.22038/abjs.2022.44387.2214

Methodology for Online Reliability Studies: A Primer for Orthopedic Surgeons

Document Type : CURRENT CONCEPTS REVIEW

Authors

¹ Haaglanden Medisch Centrum, Leidschendam, Orthotrauma Research Center Amsterdam, Amsterdam, the Netherlands

² Orthotrauma Research Center Amsterdam, Amsterdam, the Netherlands

³ Department of Plastic, Reconstructive, Hand en Burn Surgery, Martini Hospital, Burn Center Groningen, the Netherlands

⁴ Flinders Medical Center, Royal Adelaide Hospital, Adelaide, Australia

10.22038/abjs.2022.44387.2214

Abstract

In orthopedic surgery, there is an increasing number of papers about online studies on the reliability of classification
systems. Useful classification systems need to be reliable and valid. Measurement of validity can be variable and is
prone to observer bias.
These online collaboratives derive adequate power to study reliability by having a large group of trained surgeons
review a small number of cases instead of the “classic” reliability studies in which a small number of observers evaluate
many cases. Large online interobserver studies have advantages (i.e., less than 15 minutes to complete the survey,
the ability to randomize, and the ability to study factors associated with reliability, accuracy, or decision-making). This
‘handbook’ paper gives an overview of current methods for online reliability studies. We discuss the study design,
sample size calculation, statistical analyses of results, pitfalls, and limitations of the study design.
Level of evidence: V

Keywords

Main Subjects

General Orthopedics

References

Kocher MS, Zurakowski D. Clinical epidemiology and
biostatistics: a primer for orthopaedic surgeons. J
Bone Joint Surg Am. 2004;86-A(3):607-20.
2. Hegedus EJ, Stern B. Beyond SpPIN and SnNOUT:
Considerations with Dichotomous Tests During
Assessment of Diagnostic Accuracy. J Man Manip Ther.
2009;17(1):E1-5. doi: 10.1179/jmt.2009.17.1.1E.
3. Doornberg J, Lindenhovius A, Kloen P, van Dijk CN,
Zurakowski D, Ring D. Two and three-dimensional
computed tomography for the classification and
management of distal humeral fractures. Evaluation
of reliability and diagnostic accuracy. J Bone Joint
Surg Am. 2006;88(8):1795-801. doi: 10.2106/
JBJS.E.00944.
4. Obuchowski NA, Gallas BD, Hillis SL. Multi-reader
ROC studies with split-plot designs: a comparison of
statistical methods. Acad Radiol. 2012;19(12):1508-
17. doi:10.1016/j.acra.2012.09.012 5. Lindenhovius A, Karanicolas PJ, Bhandari M, van Dijk
N, Ring D, Collaboration for Outcome Assessment
in Surgical T. Interobserver reliability of coronoid
fracture classification: two-dimensional versus threedimensional computed tomography. J Hand Surg Am.
2009;34(9):1640-6. doi: 10.1016/j.jhsa.2009.07.009.
6. Minami M, Nakashita K, Ishii S, et al. Twenty-five cases
of osteochondritis dissecans of the elbow. Rinsho
Seikei Geka. 1979;14:805-810.
7. Berndt AL, Harty M. Transchondral fractures
(osteochondritis dissecans) of the talus. J Bone Joint
Surg Am. 1959;41-A:988-1020.
8. Ferkel RD, Zanotti RM, Komenda GA, et al. Arthroscopic
treatment of chronic osteochondral lesions of the talus:
long-term results. Am J Sports Med. 2008;36(9):1750-
62. doi:10.1177/0363546508316773
9. Anderson IF, Crichton KJ, Grattan-Smith T, Cooper RA,
Brazier D. Osteochondral fractures of the dome of the
talus. J Bone Joint Surg Am.1989;71(8):1143-52.
10.Claessen FM, van den Ende KI, Doornberg JN, et al.
Osteochondritis dissecans of the humeral capitellum:
reliability of four classification systems using
radiographs and computed tomography. J Shoulder
Elbow Surg. 2015;24(10):1613-8.doi:10.1016/j.
jse.2015.03.029
11.Claessen FM, Stoop N, Doornberg JN, et al. Interpretation
of Post-operative Distal Humerus Radiographs
After Internal Fixation: Prediction of Later Loss of
Fixation. J Hand Surg Am. 2016;41(10):e337-e341.
doi:10.1016/j.jhsa.2016.07.094
12.Altaye M, Donner A, Klar N. Inference procedures
for assessing interobserver agreement among
multiple raters. Biometrics. 2001;57(2):584-8. doi:
10.1111/j.0006-341x.2001.00584.x.
13.Cantor AB. Power calculation for the log rank test using
historical data. Control Clin Trials. 1996;17(2):111-6.
doi: 10.1016/s0197-2456(96)80002-x.
14.Donner A, Eliasziw M. Sample size requirements for
reliability studies. Stat Med. 1987;6(4):441-8. doi:
10.1002/sim.4780060404.
15.Walter SD, Eliasziw M, Donner A. Sample size and
optimal designs for reliability studies. Stat Med.
1998;17(1):101-10. doi: 10.1002/(sici)1097-
0258(19980115)17:1<101::aid-sim727>3.0.co;2-e.
16.Cichetti DV. Methodological Commentary The
Precision of Reliability and Validity Estimates
Re-Visited: Distinguishing Between Clinical and
Statistical Significance of Sample Size Requirements.
J Clin Exp Neuropsychol. 2010;23(5):695-700. doi:
10.1076/jcen.23.5.695.1249.
17.Shoukri MM, Asyali MH, Donner A. Sample size
requirements for the design of reliability study:
review and new results. Statistical methods in medical
research. 2004;13(4):251-71.
18.Guitton TG, Ring D, Science of Variation G. Interobserver
reliability of radial head fracture classification: twodimensional compared with three-dimensional
CT. J Bone Joint Surg Am. 2011;93(21):2015-21.
doi:10.2106/JBJS.J.00711
19.Sim J, Wright CC. The kappa statistic in reliability
studies: use, interpretation, and sample size
requirements. Phys Ther. 2005;85(3):257-68.
20.Rotondi MA, Donner A, Koval JJ. Evidence-based
sample size estimation based upon an updated
meta-regression analysis. Res Synth Methods.
2012;3(4):269-84. doi:10.1002/jrsm.1055
21.Vanbelle G. In the beginning was a relationship. Revue
belge de medecine dentaire. 2008;63(2):77-80.
22.Bruinsma WE, Guitton TG, Warner JJ, Ring D, Science of
Variation G. Interobserver reliability of classification
and characterization of proximal humeral fractures: a
comparison of two and three-dimensional CT. J Bone
Joint Surg Am. 2013;95(17):1600-4. doi:10.2106/
JBJS.L.00586
23.Doornberg JN, Rademakers MV, Van Den Bekerom
MP, et al. Two-dimensional and three-dimensional
computed tomography for the classification
and characterisation of tibial plateau fractures.
Injury. 2011;42(12):1416-25. doi: 10.1016/j.
injury.2011.03.025.
24.Buijze GA, Mallee WH, Beeres FJ, Hanson TE, Johnson
WO, Ring D. Diagnostic performance tests for
suspected scaphoid fractures differ with conventional
and latent class analysis. Clin Orthop Relat Res.
2011;469(12):3400-7. doi:10.1007/s11999-011-2074-
9
25.Mallee W, Doornberg JN, Ring D, van Dijk CN, Maas M,
Goslings JC. Comparison of CT and MRI for diagnosis
of suspected scaphoid fractures. J Bone Joint Surg Am.
2011;93(1):20-8. doi: 10.2106/JBJS.I.01523.
26.Cole RJ, Bindra RR, Evanoff BA, Gilula LA, Yamaguchi
K, Gelberman RH. Radiographic evaluation of
osseous displacement following intra-articular
fractures of the distal radius: reliability of plain
radiography versus computed tomography. J Hand
Surg Am. 1997;22(5):792-800. doi: 10.1016/s0363-
5023(97)80071-8.
27.Feinstein AR, Cicchetti DV. High agreement but
low kappa: I. The problems of two paradoxes. J Clin
Epidemiol. 1990;43(6):543-9. doi: 10.1016/0895-
4356(90)90158-l.
28.Viera AJ, Garrett JM. Understanding interobserver
agreement: the kappa statistic. Family medicine.
2005;37(5):360-3.
29.Landis JR, Koch GG. The measurement of observer
agreement for categorical data. Biometrics.
1977;33(1):159-74.
30.Cohen J. Weighted kappa: nominal scale agreement
with provision for scaled disagreement or partial
credit. Psychol Bull. 1968;70(4):213-20. doi: 10.1037/
h0026256.
31.Fleiss JL. Measuring nominal scale agreement among
many raters. Psychol Bull. 1971;76(5):378.
32.Light RJ. Measures of response agreement for
qualitative data: some generalizations and
alternatives. Psychol Bull. 1971;76:365-377.
33.Davies M, Fleiss JL. Measuring Agreement for
Multinominal Data. Biometrics. 1982;38(4):1047-
1051.
34.Krippendorff K. Validity in content analysis.
Computerstrategien fur die Komminikationsanalyse.
1980:69-112. 35.Hallgren KA. Computing Inter-Rater Reliability
for Observational Data: An Overview and Tutorial.
Tutor Quant Methods Psychol. 2012;8(1):23-34. doi:
10.20982/tqmp.08.1.p023.
36.Streiner DLNGR. Health Measurement Scales. 2008;
https://academic.oup.com/book/6813
37.van Kollenburg JA, Vrahas MS, Smith RM, Guitton TG,
Ring D, Science of Variation G. Diagnosis of union of
distal tibia fractures: accuracy and interobserver
reliability. Injury. 2013;44(8):1073-5. doi: 10.1016/j.
injury.2012.10.034.
38.Cicchetti DV, Feinstein AR. High agreement but
low kappa: II. Resolving the paradoxes. J Clin
Epidemiol. 1990;43(6):551-8. doi: 10.1016/0895-
4356(90)90159-m.
39.Sidney S. Nonparametric statistics for the behavioral
sciences. The Journal of Nervous and Mental Disease.
1957;125(3):497

Methodology for Online Reliability Studies: A Primer for Orthopedic Surgeons

References

Volume 11, Issue 2
February 2023
Pages 94-101

Files

History

Share

How to cite

Statistics

Methodology for Online Reliability Studies: A Primer for Orthopedic Surgeons

References

Volume 11, Issue 2February 2023Pages 94-101

Files

History

Share

How to cite

Statistics

Volume 11, Issue 2
February 2023
Pages 94-101