Inter-rater reliability

Last updated: 2025-11-03

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 213cab1

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 213cab1. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/1M_neurons_filtered_gene_bc_matrices_h5.h5
    Ignored:    data/293t/
    Ignored:    data/293t_3t3_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/293t_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
    Ignored:    data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
    Ignored:    data/brain_counts/
    Ignored:    data/cl.obo
    Ignored:    data/cl.owl
    Ignored:    data/jurkat/
    Ignored:    data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/jurkat_293t/
    Ignored:    data/jurkat_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc20k/
    Ignored:    data/pbmc20k_seurat/
    Ignored:    data/pbmc3k.h5ad
    Ignored:    data/pbmc3k/
    Ignored:    data/pbmc3k_bpcells_mat/
    Ignored:    data/pbmc3k_export.mtx
    Ignored:    data/pbmc3k_matrix.mtx
    Ignored:    data/pbmc3k_seurat.rds
    Ignored:    data/pbmc4k_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
    Ignored:    data/pbmc_1k_v3_raw_feature_bc_matrix.h5
    Ignored:    data/refdata-gex-GRCh38-2020-A.tar.gz
    Ignored:    data/seurat_1m_neuron.rds
    Ignored:    data/t_3k_filtered_gene_bc_matrices.tar.gz
    Ignored:    r_packages_4.4.1/
    Ignored:    r_packages_4.5.0/

Untracked files:
    Untracked:  analysis/bioc_scrnaseq.Rmd
    Untracked:  bpcells_matrix/
    Untracked:  data/Caenorhabditis_elegans.WBcel235.113.gtf.gz
    Untracked:  data/GCF_043380555.1-RS_2024_12_gene_ontology.gaf.gz
    Untracked:  data/arab.rds
    Untracked:  data/astronomicalunit.csv
    Untracked:  data/femaleMiceWeights.csv
    Untracked:  m3/

Unstaged changes:
    Modified:   analysis/isoform_switch_analyzer.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/rater.Rmd) and HTML (docs/rater.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	213cab1	Dave Tang	2025-11-03	{yardstick} can calculate Cohen’s Kappa
html	719e63e	Dave Tang	2025-10-08	Build site.
Rmd	466f85f	Dave Tang	2025-10-08	Cohen’s Kappa with random results
html	9a6ead6	Dave Tang	2025-10-06	Build site.
Rmd	47f2160	Dave Tang	2025-10-06	Manually calculate Fleiss’ Kappa
html	a8d6f16	Dave Tang	2025-10-06	Build site.
Rmd	59c9392	Dave Tang	2025-10-06	Inter-rater reliability

Introduction

Measures of inter-rater reliability (IRR) provide an index indicating how much agreement there is between raters/observers, correcting for agreement that would happen just by chance.

Packages

Install {irr}.

install.packages("irr")

Cohen’s Kappa

Measures agreement between two raters who classify items into categories.

\(\kappa\) = 1 means a perfect agreement.
\(\kappa\) = 0 means agreement no better than chance.
\(\kappa\) < 0 means worse than chance, i.e., systematic disagreement.

\[ \kappa = \frac{P_o - P_e}{1 - P_e} \]

where

\(P_o\) = observed agreement (proportion of times both raters agree).
\(P_e\) = expected agreement by chance.

cohen_kappa <- function(x, y) {
  stopifnot(length(x) == length(y))
  
  confusion_matrix <- table(x, y)
  n <- sum(confusion_matrix)
  
  P_o <- sum(diag(confusion_matrix)) / n
  
  row_marginals <- rowSums(confusion_matrix) / n
  col_marginals <- colSums(confusion_matrix) / n
  P_e <- sum(row_marginals * col_marginals)
  
  kappa <- (P_o - P_e) / (1 - P_e)
  return(list(
    kappa = kappa,
    observed = P_o,
    expected = P_e,
    confusion_matrix = confusion_matrix
  ))
}

Agreement between two doctors on 50 patients.

	Doctor 2: Disease	Doctor 2: No Disease	Row Total
Doctor 1: Disease	15	5	20
Doctor 1: No Disease	10	20	30
Column Total	25	25	50

Total agreement = 15 + 20 = 35.

\[ P_o = \frac{35}{50} = 0.70 \quad \text{(70% agreement observed)} \]

How much agreement would we expect just by chance, given how often each doctor says “Disease” versus “No Disease”?

Doctor 1 says “Disease” 20/50 = 0.40
Doctor 1 says “No Disease” 30/50 = 0.60
Doctor 2 says “Disease” 25/50 = 0.50
Doctor 2 says “No Disease” 25/50 = 0.50

Now multiply matching probabilities:

Chance both say “Disease” = 0.40 × 0.50 = 0.20
Chance both say “No Disease” = 0.60 × 0.50 = 0.30

Expected agreement = 0.20 + 0.30 = 0.50 (50%)

Now calculate Cohen’s Kappa manually:

\[ \kappa = \frac{Po - Pe}{1 - Pe} = \frac{0.70 - 0.50}{1 - 0.50} = \frac{0.20}{0.50} = 0.40 \]

Using our function and irr::kappa2().

doc1 <- factor(c(rep('D', 15), rep('N', 20), rep('N', 10), rep('D', 5)))
doc2 <- factor(c(rep('D', 15), rep('N', 20), rep('D', 10), rep('N', 5)))

cohen_kappa(doc1, doc2)$kappa

[1] 0.4

irr::kappa2(data.frame(x = doc1, y = doc2))

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 50 
   Raters = 2 
    Kappa = 0.4 

        z = 2.89 
  p-value = 0.00389

Using {yardstick}.

yardstick::kap_vec(doc1, doc2)

[1] 0.4

Fleiss’ Kappa

Generalises Cohen’s Kappa to more than two raters. Each item is rated by k raters (not necessarily the same raters for every item). Compute the agreement per item, then average across items, correcting for chance.

\[ \kappa = \frac{\bar{P} - \bar{P_e}}{1 - \bar{P_e}} \]

where

\(\bar{P}\) = mean observed agreement across items.
\(\bar{P_e}\) = mean expected agreement by chance.

Data:

Psychiatric diagnoses of n=30 patients provided by different sets of m=6 raters. Data were used by Fleiss (1971) to illustrate the computation of Kappa for m raters.

data(diagnoses)
dim(diagnoses)

[1] 30  6

Fleiss’ Kappa.

kappam.fleiss(diagnoses)

 Fleiss' Kappa for m Raters

 Subjects = 30 
   Raters = 6 
    Kappa = 0.43 

        z = 17.7 
  p-value = 0

Manually calculate.

lapply(diagnoses, \(x) as.integer(sub("\\. .*", "", x))) |>
  as.data.frame() |>
  as.matrix() -> ratings

# patients
N <- nrow(ratings)
# doctors
n <- ncol(ratings)

cats <- sort(unique(as.numeric(ratings)))

# build item × category counts
counts <- t(apply(ratings, 1, function(row) {
  tab <- table(factor(row, levels = cats))
  as.integer(tab)
}))
colnames(counts) <- cats
  
# category proportions across all items
p_j <- colSums(counts) / (N * n)

# agreement per item
P_i <- (rowSums(counts^2) - n) / (n * (n - 1))

# observed and expected agreement
P_bar <- mean(P_i)
P_e <- sum(p_j^2)

fkappa <- (P_bar - P_e) / (1 - P_e)
fkappa

[1] 0.4302445

Random

Since the expected number of agreements is taken into consideration, random guesses should result in a Kappa close to zero.

set.seed(1984)
replicate(
  100, 
  {
    a <- rbinom(n = 100, size = 1, prob = 0.5)
    b <- rbinom(n = 100, size = 1, prob = 0.5)
    irr::kappa2(data.frame(a, b))$value
  }
) |>
  mean()

[1] 0.003047436

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] yardstick_1.3.2 irr_0.84.1      lpSolve_5.6.23  lubridate_1.9.4
 [5] forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.4    
 [9] readr_2.1.5     tidyr_1.3.1     tibble_3.3.0    ggplot2_3.5.2  
[13] tidyverse_2.0.0 workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      hms_1.1.3         
 [5] digest_0.6.37      magrittr_2.0.3     timechange_0.3.0   evaluate_1.0.3    
 [9] grid_4.5.0         RColorBrewer_1.1-3 fastmap_1.2.0      rprojroot_2.0.4   
[13] jsonlite_2.0.0     processx_3.8.6     whisker_0.4.1      ps_1.9.1          
[17] promises_1.3.3     httr_1.4.7         scales_1.4.0       jquerylib_0.1.4   
[21] cli_3.6.5          rlang_1.1.6        withr_3.0.2        cachem_1.1.0      
[25] yaml_2.3.10        tools_4.5.0        tzdb_0.5.0         httpuv_1.6.16     
[29] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4    git2r_0.36.2      
[33] fs_1.6.6           pkgconfig_2.0.3    callr_3.7.6        pillar_1.10.2     
[37] bslib_0.9.0        later_1.4.2        gtable_0.3.6       glue_1.8.0        
[41] Rcpp_1.0.14        xfun_0.52          tidyselect_1.2.1   rstudioapi_0.17.1 
[45] knitr_1.50         farver_2.1.2       htmltools_0.5.8.1  rmarkdown_2.29    
[49] compiler_4.5.0     getPass_0.2-4