Methodology for Online Reliability Studies: A Primer for Orthopedic Surgeons



1 Haaglanden Medisch Centrum, Leidschendam, Orthotrauma Research Center Amsterdam, Amsterdam, the Netherlands

2 Orthotrauma Research Center Amsterdam, Amsterdam, the Netherlands

3 Department of Plastic, Reconstructive, Hand en Burn Surgery, Martini Hospital, Burn Center Groningen, the Netherlands

4 Flinders Medical Center, Royal Adelaide Hospital, Adelaide, Australia


In orthopedic surgery, there is an increasing number of papers about online studies on the reliability of classification 
systems. Useful classification systems need to be reliable and valid. Measurement of validity can be variable and is 
prone to observer bias. 
These online collaboratives derive adequate power to study reliability by having a large group of trained surgeons 
review a small number of cases instead of the “classic” reliability studies in which a small number of observers evaluate 
many cases. Large online interobserver studies have advantages (i.e., less than 15 minutes to complete the survey, 
the ability to randomize, and the ability to study factors associated with reliability, accuracy, or decision-making). This 
‘handbook’ paper gives an overview of current methods for online reliability studies. We discuss the study design, 
sample size calculation, statistical analyses of results, pitfalls, and limitations of the study design.
Level of evidence: V


Main Subjects

  1. Kocher MS, Zurakowski D. Clinical epidemiology and 
    biostatistics: a primer for orthopaedic surgeons. J 
    Bone Joint Surg Am. 2004;86-A(3):607-20. 
    2. Hegedus EJ, Stern B. Beyond SpPIN and SnNOUT: 
    Considerations with Dichotomous Tests During 
    Assessment of Diagnostic Accuracy. J Man Manip Ther. 
    2009;17(1):E1-5. doi: 10.1179/jmt.2009.17.1.1E.
    3. Doornberg J, Lindenhovius A, Kloen P, van Dijk CN, 
    Zurakowski D, Ring D. Two and three-dimensional 
    computed tomography for the classification and 
    management of distal humeral fractures. Evaluation 
    of reliability and diagnostic accuracy. J Bone Joint 
    Surg Am. 2006;88(8):1795-801. doi: 10.2106/
    4. Obuchowski NA, Gallas BD, Hillis SL. Multi-reader 
    ROC studies with split-plot designs: a comparison of 
    statistical methods. Acad Radiol. 2012;19(12):1508-
    17. doi:10.1016/j.acra.2012.09.012 5. Lindenhovius A, Karanicolas PJ, Bhandari M, van Dijk 
    N, Ring D, Collaboration for Outcome Assessment 
    in Surgical T. Interobserver reliability of coronoid 
    fracture classification: two-dimensional versus threedimensional computed tomography. J Hand Surg Am. 
    2009;34(9):1640-6. doi: 10.1016/j.jhsa.2009.07.009.
    6. Minami M, Nakashita K, Ishii S, et al. Twenty-five cases 
    of osteochondritis dissecans of the elbow. Rinsho 
    Seikei Geka. 1979;14:805-810. 
    7. Berndt AL, Harty M. Transchondral fractures 
    (osteochondritis dissecans) of the talus. J Bone Joint 
    Surg Am. 1959;41-A:988-1020. 
    8. Ferkel RD, Zanotti RM, Komenda GA, et al. Arthroscopic 
    treatment of chronic osteochondral lesions of the talus: 
    long-term results. Am J Sports Med. 2008;36(9):1750-
    62. doi:10.1177/0363546508316773
    9. Anderson IF, Crichton KJ, Grattan-Smith T, Cooper RA, 
    Brazier D. Osteochondral fractures of the dome of the 
    talus. J Bone Joint Surg Am.1989;71(8):1143-52. 
    10.Claessen FM, van den Ende KI, Doornberg JN, et al. 
    Osteochondritis dissecans of the humeral capitellum: 
    reliability of four classification systems using 
    radiographs and computed tomography. J Shoulder 
    Elbow Surg. 2015;24(10):1613-8.doi:10.1016/j.
    11.Claessen FM, Stoop N, Doornberg JN, et al. Interpretation 
    of Post-operative Distal Humerus Radiographs 
    After Internal Fixation: Prediction of Later Loss of 
    Fixation. J Hand Surg Am. 2016;41(10):e337-e341. 
    12.Altaye M, Donner A, Klar N. Inference procedures 
    for assessing interobserver agreement among 
    multiple raters. Biometrics. 2001;57(2):584-8. doi: 
    13.Cantor AB. Power calculation for the log rank test using 
    historical data. Control Clin Trials. 1996;17(2):111-6. 
    doi: 10.1016/s0197-2456(96)80002-x.
    14.Donner A, Eliasziw M. Sample size requirements for 
    reliability studies. Stat Med. 1987;6(4):441-8. doi: 
    15.Walter SD, Eliasziw M, Donner A. Sample size and 
    optimal designs for reliability studies. Stat Med. 
    1998;17(1):101-10. doi: 10.1002/(sici)1097-
    16.Cichetti DV. Methodological Commentary The 
    Precision of Reliability and Validity Estimates 
    Re-Visited: Distinguishing Between Clinical and 
    Statistical Significance of Sample Size Requirements. 
    J Clin Exp Neuropsychol. 2010;23(5):695-700. doi: 
    17.Shoukri MM, Asyali MH, Donner A. Sample size 
    requirements for the design of reliability study: 
    review and new results. Statistical methods in medical 
    research. 2004;13(4):251-71. 
    18.Guitton TG, Ring D, Science of Variation G. Interobserver 
    reliability of radial head fracture classification: twodimensional compared with three-dimensional 
    CT. J Bone Joint Surg Am. 2011;93(21):2015-21. 
    19.Sim J, Wright CC. The kappa statistic in reliability 
    studies: use, interpretation, and sample size 
    requirements. Phys Ther. 2005;85(3):257-68. 
    20.Rotondi MA, Donner A, Koval JJ. Evidence-based 
    sample size estimation based upon an updated 
    meta-regression analysis. Res Synth Methods. 
    2012;3(4):269-84. doi:10.1002/jrsm.1055
    21.Vanbelle G. In the beginning was a relationship. Revue 
    belge de medecine dentaire. 2008;63(2):77-80. 
    22.Bruinsma WE, Guitton TG, Warner JJ, Ring D, Science of 
    Variation G. Interobserver reliability of classification 
    and characterization of proximal humeral fractures: a 
    comparison of two and three-dimensional CT. J Bone 
    Joint Surg Am. 2013;95(17):1600-4. doi:10.2106/
    23.Doornberg JN, Rademakers MV, Van Den Bekerom 
    MP, et al. Two-dimensional and three-dimensional 
    computed tomography for the classification 
    and characterisation of tibial plateau fractures. 
    Injury. 2011;42(12):1416-25. doi: 10.1016/j.
    24.Buijze GA, Mallee WH, Beeres FJ, Hanson TE, Johnson 
    WO, Ring D. Diagnostic performance tests for 
    suspected scaphoid fractures differ with conventional 
    and latent class analysis. Clin Orthop Relat Res. 
    2011;469(12):3400-7. doi:10.1007/s11999-011-2074-
    25.Mallee W, Doornberg JN, Ring D, van Dijk CN, Maas M, 
    Goslings JC. Comparison of CT and MRI for diagnosis 
    of suspected scaphoid fractures. J Bone Joint Surg Am. 
    2011;93(1):20-8. doi: 10.2106/JBJS.I.01523. 
    26.Cole RJ, Bindra RR, Evanoff BA, Gilula LA, Yamaguchi 
    K, Gelberman RH. Radiographic evaluation of 
    osseous displacement following intra-articular 
    fractures of the distal radius: reliability of plain 
    radiography versus computed tomography. J Hand 
    Surg Am. 1997;22(5):792-800. doi: 10.1016/s0363-
    27.Feinstein AR, Cicchetti DV. High agreement but 
    low kappa: I. The problems of two paradoxes. J Clin 
    Epidemiol. 1990;43(6):543-9. doi: 10.1016/0895-
    28.Viera AJ, Garrett JM. Understanding interobserver 
    agreement: the kappa statistic. Family medicine. 
    29.Landis JR, Koch GG. The measurement of observer 
    agreement for categorical data. Biometrics. 
    30.Cohen J. Weighted kappa: nominal scale agreement 
    with provision for scaled disagreement or partial 
    credit. Psychol Bull. 1968;70(4):213-20. doi: 10.1037/
    31.Fleiss JL. Measuring nominal scale agreement among 
    many raters. Psychol Bull. 1971;76(5):378. 
    32.Light RJ. Measures of response agreement for 
    qualitative data: some generalizations and 
    alternatives. Psychol Bull. 1971;76:365-377. 
    33.Davies M, Fleiss JL. Measuring Agreement for 
    Multinominal Data. Biometrics. 1982;38(4):1047-
    34.Krippendorff K. Validity in content analysis. 
    Computerstrategien fur die Komminikationsanalyse. 
    1980:69-112.    35.Hallgren KA. Computing Inter-Rater Reliability 
    for Observational Data: An Overview and Tutorial. 
    Tutor Quant Methods Psychol. 2012;8(1):23-34. doi: 
    36.Streiner DLNGR. Health Measurement Scales. 2008;
    37.van Kollenburg JA, Vrahas MS, Smith RM, Guitton TG, 
    Ring D, Science of Variation G. Diagnosis of union of 
    distal tibia fractures: accuracy and interobserver 
    reliability. Injury. 2013;44(8):1073-5. doi: 10.1016/j.
    38.Cicchetti DV, Feinstein AR. High agreement but 
    low kappa: II. Resolving the paradoxes. J Clin 
    Epidemiol. 1990;43(6):551-8. doi: 10.1016/0895-
    39.Sidney S. Nonparametric statistics for the behavioral 
    sciences. The Journal of Nervous and Mental Disease.