Last updated: 2025-03-29
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version a1be60c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/
Ignored: data/1M_neurons_filtered_gene_bc_matrices_h5.h5
Ignored: data/293t/
Ignored: data/293t_3t3_filtered_gene_bc_matrices.tar.gz
Ignored: data/293t_filtered_gene_bc_matrices.tar.gz
Ignored: data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
Ignored: data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
Ignored: data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
Ignored: data/brain_counts/
Ignored: data/cl.obo
Ignored: data/cl.owl
Ignored: data/jurkat/
Ignored: data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
Ignored: data/jurkat_293t/
Ignored: data/jurkat_filtered_gene_bc_matrices.tar.gz
Ignored: data/pbmc20k/
Ignored: data/pbmc20k_seurat/
Ignored: data/pbmc3k/
Ignored: data/pbmc3k_seurat.rds
Ignored: data/pbmc4k_filtered_gene_bc_matrices.tar.gz
Ignored: data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
Ignored: data/pbmc_1k_v3_raw_feature_bc_matrix.h5
Ignored: data/refdata-gex-GRCh38-2020-A.tar.gz
Ignored: data/seurat_1m_neuron.rds
Ignored: data/t_3k_filtered_gene_bc_matrices.tar.gz
Ignored: r_packages_4.4.1/
Untracked files:
Untracked: analysis/bioc_scrnaseq.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/fisher.Rmd
) and HTML
(docs/fisher.html
) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote
), click on the hyperlinks
in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | a1be60c | Dave Tang | 2025-03-29 | Left justify by adding colon |
html | 6b4d189 | Dave Tang | 2025-03-29 | Build site. |
Rmd | 1661426 | Dave Tang | 2025-03-29 | Left justify |
html | 3a24351 | Dave Tang | 2025-03-29 | Build site. |
Rmd | cd487f9 | Dave Tang | 2025-03-29 | Label significant cluster in volcano plot |
html | c33357f | Dave Tang | 2025-03-29 | Build site. |
Rmd | e9e40c4 | Dave Tang | 2025-03-29 | Add some notes |
html | 4bccd40 | Dave Tang | 2025-03-29 | Build site. |
Rmd | 77bdfb3 | Dave Tang | 2025-03-29 | Volcano plot and cluster proportions |
html | 1b5016d | Dave Tang | 2025-03-29 | Build site. |
Rmd | a86eba5 | Dave Tang | 2025-03-29 | Per-cluster Fisher’s Exact Test |
html | 5477eb3 | Dave Tang | 2025-03-29 | Build site. |
Rmd | aa71d48 | Dave Tang | 2025-03-29 | Fisher’s Exact Test |
If you have more than two clusters, Fisher’s Exact Test becomes a generalised Fisher’s Exact Test (also known as Fisher-Freeman-Halton test), because the regular Fisher’s test is for 2x2 tables.
For larger tables, people often use:
Create cluster table.
treated <- c(50, 60, 40, 30, 80, 90, 20, 15, 70, 45)
untreated <- c(55, 50, 35, 25, 85, 95, 25, 20, 65, 40)
cluster_table <- rbind(treated, untreated)
colnames(cluster_table) <- paste0("cluster", 0:9)
rownames(cluster_table) <- c("Treated", "Untreated")
cluster_table
cluster0 cluster1 cluster2 cluster3 cluster4 cluster5 cluster6
Treated 50 60 40 30 80 90 20
Untreated 55 50 35 25 85 95 25
cluster7 cluster8 cluster9
Treated 15 70 45
Untreated 20 65 40
Chi-squared Test.
chisq_test <- chisq.test(cluster_table)
chisq_test
Pearson's Chi-squared test
data: cluster_table
X-squared = 3.9458, df = 9, p-value = 0.9149
Generalised Fisher’s Exact Test.
workspace
- an integer specifying the size of the
workspace used in the network algorithm. In units of 4 bytes. Only used
for non-simulated p-values larger than 2×22×2 tables. Since R version
3.5.0, this also increases the internal stack size which allows larger
problems to be solved, however sometimes needing hours. In such cases,
simulate.p.values=TRUE may be more reasonable.fisher_test <- fisher.test(cluster_table, workspace=2e8)
fisher_test
Fisher's Exact Test for Count Data
data: cluster_table
p-value = 0.917
alternative hypothesis: two.sided
Another cluster table.
treated2 <- c(200, 60, 40, 30, 80, 90, 20, 15, 70, 45)
untreated2 <- c(55, 50, 35, 30, 85, 95, 25, 20, 75, 40)
cluster_table2 <- rbind(treated2, untreated2)
colnames(cluster_table2) <- paste0("cluster", 0:9)
rownames(cluster_table2) <- c("Treated", "Untreated")
cluster_table2
cluster0 cluster1 cluster2 cluster3 cluster4 cluster5 cluster6
Treated 200 60 40 30 80 90 20
Untreated 55 50 35 30 85 95 25
cluster7 cluster8 cluster9
Treated 15 70 45
Untreated 20 75 40
Chi-squared Test; Chi-squared is valid when expected counts are reasonably large (> 5 cells per condition per cluster).
chisq_test2 <- chisq.test(cluster_table2)
chisq_test2
Pearson's Chi-squared test
data: cluster_table2
X-squared = 69.837, df = 9, p-value = 1.639e-11
Generalised Fisher’s Exact Test.
simulate.p.value
- a logical indicating whether to
compute p-values by Monte Carlo simulation, in larger than 2×22×2
tables.fisher_test2 <- fisher.test(cluster_table2, simulate.p.value = TRUE)
fisher_test2
Fisher's Exact Test for Count Data with simulated p-value (based on
2000 replicates)
data: cluster_table2
p-value = 0.0004998
alternative hypothesis: two.sided
Notes:
treated3 <- c(50, 60, 40, 30, 80, 90, 20, 15, 70, 45)
untreated3 <- untreated*3
cluster_table3 <- rbind(treated3, untreated3)
colnames(cluster_table3) <- paste0("cluster", 0:9)
rownames(cluster_table3) <- c("Treated", "Untreated")
chisq.test(cluster_table3)
Pearson's Chi-squared test
data: cluster_table3
X-squared = 5.8925, df = 9, p-value = 0.7506
Perform one Fisher’s Exact Test per cluster.
treated <- c(200, 60, 40, 30, 80, 90, 20, 15, 70, 45)
untreated <- c(55, 50, 35, 30, 85, 95, 25, 20, 75, 40)
clusters <- paste0("cluster", 0:9)
total_treated <- sum(treated)
total_untreated <- sum(untreated)
res <- purrr::map(seq_along(clusters), \(i){
contingency_table <- matrix(
c(
treated[i], total_treated - treated[i],
untreated[i], total_untreated - untreated[i]
),
nrow=2,
byrow=TRUE
)
fisher_res <- fisher.test(contingency_table)
list(
table = contingency_table,
stat = fisher_res
)
})
res[[1]]
$table
[,1] [,2]
[1,] 200 450
[2,] 55 455
$stat
Fisher's Exact Test for Count Data
data: contingency_table
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
2.631973 5.188538
sample estimates:
odds ratio
3.672694
As a table with multiple testing correction.
data.frame(
Cluster = clusters,
Treated = treated,
Untreated = untreated,
pvalue = purrr::map_dbl(res, \(x) x$stat$p.value),
odds_ratio = purrr::map_dbl(res, \(x) x$stat$estimate)
) -> res_df
res_df$padj <- p.adjust(res_df$pvalue, method = "BH")
res_df$log2_odds_ratio <- log2(res_df$odds_ratio)
res_df
Cluster Treated Untreated pvalue odds_ratio padj
1 cluster0 200 55 7.654036e-17 3.6726937 7.654036e-16
2 cluster1 60 50 7.625829e-01 0.9356553 7.625829e-01
3 cluster2 40 35 6.324680e-01 0.8900209 7.027423e-01
4 cluster3 30 30 3.519013e-01 0.7743725 5.027161e-01
5 cluster4 80 85 4.189083e-02 0.7019839 1.226881e-01
6 cluster5 90 95 2.924272e-02 0.7022974 1.226881e-01
7 cluster6 20 25 1.259970e-01 0.6161451 2.099949e-01
8 cluster7 15 20 1.215913e-01 0.5790270 2.099949e-01
9 cluster8 70 75 4.907522e-02 0.7002301 1.226881e-01
10 cluster9 45 40 5.716169e-01 0.8740708 7.027423e-01
log2_odds_ratio
1 1.87683857
2 -0.09595098
3 -0.16808884
4 -0.36890045
5 -0.51049016
6 -0.50984593
7 -0.69865806
8 -0.78829757
9 -0.51409895
10 -0.19417790
Volcano plot.
library(ggplot2)
alpha <- 0.05
res_df$Significant <- res_df$padj < alpha
ggplot(res_df, aes(x=log2_odds_ratio, y=-log10(pvalue))) +
geom_point(aes(colour=Significant), size=3) +
geom_text(
data=res_df[res_df$Significant,],
aes(label=Cluster),
nudge_y=1,
size=4
) +
geom_hline(yintercept = -log10(0.05), linetype="dashed", colour="red") +
scale_colour_manual(values=c("grey", "firebrick")) +
xlab("Log2 Odds Ratio") +
ylab("-log10(p-value)") +
ggtitle("Volcano Plot of Cluster Proportions (Treated vs Untreated)") +
theme_minimal()
Plot Cluster Proportions.
prop_df <- data.frame(
Cluster = rep(clusters, 2),
Condition = rep(c("Treated", "Untreated"), each=length(clusters)),
Proportion = c(treated / total_treated, untreated / total_untreated)
)
ggplot(prop_df, aes(x=Cluster, y=Proportion, fill=Condition)) +
geom_bar(stat="identity", position="dodge") +
ylab("Proportion of Cells") +
theme_minimal() +
theme(axis.title.x = element_blank())
Version | Author | Date |
---|---|---|
4bccd40 | Dave Tang | 2025-03-29 |
If using Seurat, just use
table(Idents(seurat_obj), seurat_obj$condition)
to create
the contingency table and carry out the steps as per above.
The odds ratio compares the odds of finding a cell in a specific cluster in treated cells versus in untreated cells.
For example:
Cluster A | Not Cluster A | Total | |
---|---|---|---|
Treated | 200 | 450 | 650 |
Untreated | 55 | 455 | 510 |
The odds of being in Cluster A:
\[ {Odds}_{\text{treated}} = \frac{200}{450} = 0.4444 \]
\[ \text{Odds}_{\text{untreated}} = \frac{55}{455} = 0.1209 \]
So the odds ratio is:
\[ \text{OR} = \frac{0.4444}{0.1209} \approx 3.6758 \]
res_df[1, ]
Cluster Treated Untreated pvalue odds_ratio padj
1 cluster0 200 55 7.654036e-17 3.672694 7.654036e-16
log2_odds_ratio Significant
1 1.876839 TRUE
Odds Ratio | Interpretation |
---|---|
OR = 1 | No difference between treated and untreated |
OR > 1 | Treated cells are enriched in this cluster |
OR < 1 | Treated cells are depleted in this cluster |
In the example above, OR ~ 3.67 means treated cells are ~3.67 times more likely to be in Cluster 0 than untreated cells.
Taking the log2
of the odds ratio is common because:
log2(OR = 2)
= 1
log2(OR = 0.5)
= -1
log2(OR) = 1
means cells are 2x more
likely to be in that cluster if treated.log2(OR) = -1
means cells are 2x less
likely (i.e., enriched in untreated).log2(OR) = 0
means no difference between treated and
untreated.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[5] purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[9] ggplot2_3.5.1 tidyverse_2.0.0 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] sass_0.4.9 generics_0.1.3 stringi_1.8.4 hms_1.1.3
[5] digest_0.6.37 magrittr_2.0.3 timechange_0.3.0 evaluate_1.0.1
[9] grid_4.4.1 fastmap_1.2.0 rprojroot_2.0.4 jsonlite_1.8.9
[13] processx_3.8.4 whisker_0.4.1 ps_1.8.1 promises_1.3.2
[17] httr_1.4.7 scales_1.3.0 jquerylib_0.1.4 cli_3.6.3
[21] rlang_1.1.4 munsell_0.5.1 withr_3.0.2 cachem_1.1.0
[25] yaml_2.3.10 tools_4.4.1 tzdb_0.4.0 colorspace_2.1-1
[29] httpuv_1.6.15 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
[33] git2r_0.35.0 fs_1.6.4 pkgconfig_2.0.3 callr_3.7.6
[37] pillar_1.10.1 bslib_0.8.0 later_1.3.2 gtable_0.3.6
[41] glue_1.8.0 Rcpp_1.0.13 highr_0.11 xfun_0.48
[45] tidyselect_1.2.1 rstudioapi_0.17.1 knitr_1.48 farver_2.1.2
[49] htmltools_0.5.8.1 labeling_0.4.3 rmarkdown_2.28 compiler_4.4.1
[53] getPass_0.2-4