Last updated: 2025-01-27
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version ab9aeab. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/pbmc3k.csv
Ignored: data/pbmc3k.csv.gz
Ignored: data/pbmc3k/
Ignored: r_packages_4.4.0/
Ignored: r_packages_4.4.1/
Untracked files:
Untracked: rsem.merged.gene_counts.tsv
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/gsea.Rmd
) and HTML
(docs/gsea.html
) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote
), click on the hyperlinks
in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | ab9aeab | Dave Tang | 2025-01-27 | RNA-seq expression data |
html | 415653c | Dave Tang | 2025-01-24 | Build site. |
Rmd | 954f4bf | Dave Tang | 2025-01-24 | Using the GSEA GUI tool |
GSEA:
Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).
Download page that requires your email address. For this post, I will download GSEA v4.3.3 for Windows.
Text file format for expression dataset.
The TXT format is a tab delimited file format that describes an expression dataset.
The first line contains the labels Name and Description followed by the identifiers for each sample in the dataset.
The Description column is intended to be optional, but there is currently a bug such that it is treated as required. We hope to fix this in a future release. If you have no descriptions available, a value of NA will suffice.
Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) … (sample N name) Name Description DLBC1_1 DLBC2_1 … DLBC58_0
Download some example data Lung_Michigan_collapsed.gct.
Collapsed
refers to datasets whose identifiers (i.e Affymetrix probe set ids) have been replaced with symbols. In this process, all probe sets that map to a particular gene are summarized into a single expression vector by picking the maximum expression value in each sample. A utility to do this is included in the GSEA java software.
RNA-seq data. GSEA does not normalize RNA-seq data. RNA-seq data must be normalized for between-sample comparisons using an external normalization procedure (e.g. those in DESeq2 or Voom).
Use {edgeR} to perform “between-sample” normalisation; there are two main types of normalisation in RNA-seq:
If count_mat
contains the raw counts. Note that for
calcNormFactors()
and cpm()
the defaults are
TMM and normalized.lib.sizes = TRUE
, respectively.
Use example data from https://zenodo.org/records/13970886.
library(edgeR)
Loading required package: limma
my_url <- 'https://zenodo.org/records/13970886/files/rsem.merged.gene_counts.tsv?download=1'
my_file <- 'rsem.merged.gene_counts.tsv'
if(file.exists(my_file) == FALSE){
download.file(url = my_url, destfile = my_file)
}
gene_counts <- readr::read_tsv("rsem.merged.gene_counts.tsv", show_col_types = FALSE)
gene_counts |>
dplyr::select(-gene_id, -`transcript_id(s)`) |>
as.matrix() -> count_mat
row.names(count_mat) <- gene_counts$gene_id
dge <- DGEList(counts = count_mat)
dge <- normLibSizes(dge, method = "TMM")
count_mat_norm <- cpm(dge, normalized.lib.sizes = TRUE)
Raw matrix.
head(count_mat)
ERR160122 ERR160123 ERR160124 ERR164473 ERR164550 ERR164551
ENSG00000000003 2.00 6.00 5.00 374.00 1637.00 650.00
ENSG00000000005 19.00 40.00 28.00 0.00 1.00 0.00
ENSG00000000419 268.24 273.78 428.81 489.00 637.00 879.00
ENSG00000000457 360.34 449.07 566.05 362.61 605.96 708.87
ENSG00000000460 155.66 184.93 264.95 85.39 312.04 239.13
ENSG00000000938 24.00 23.00 40.00 1181.00 423.00 3346.00
ERR164552 ERR164554
ENSG00000000003 1015.00 562.00
ENSG00000000005 0.00 0.00
ENSG00000000419 1157.00 729.00
ENSG00000000457 632.16 478.93
ENSG00000000460 147.84 156.07
ENSG00000000938 1249.00 1149.00
Between-sample normalised values.
head(count_mat_norm)
ERR160122 ERR160123 ERR160124 ERR164473 ERR164550
ENSG00000000003 0.2950283 0.7356763 0.4523046 27.751783 93.60454585
ENSG00000000005 2.8027686 4.9045087 2.5329056 0.000000 0.05718054
ENSG00000000419 39.5691916 33.5689100 38.7905451 36.285085 36.42400471
ENSG00000000457 53.1552434 55.0616934 51.2054012 26.906615 34.64912071
ENSG00000000460 22.9620503 22.6747700 23.9676195 6.336162 17.84261606
ENSG00000000938 3.5403392 2.8200925 3.6184366 87.633304 24.18736890
ERR164551 ERR164552 ERR164554
ENSG00000000003 28.74003 51.906072 31.316431
ENSG00000000005 0.00000 0.000000 0.000000
ENSG00000000419 38.86536 59.167808 40.622203
ENSG00000000457 31.34299 32.328022 26.687505
ENSG00000000460 10.57323 7.560388 8.696718
ENSG00000000938 147.94481 63.872595 64.025941
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] edgeR_4.4.1 limma_3.62.2 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] generics_0.1.3 sass_0.4.9 utf8_1.2.4 stringi_1.8.4
[5] lattice_0.22-6 hms_1.1.3 digest_0.6.37 magrittr_2.0.3
[9] evaluate_1.0.1 grid_4.4.1 fastmap_1.2.0 rprojroot_2.0.4
[13] jsonlite_1.8.9 processx_3.8.4 whisker_0.4.1 ps_1.8.1
[17] promises_1.3.0 httr_1.4.7 fansi_1.0.6 jquerylib_0.1.4
[21] cli_3.6.3 rlang_1.1.4 crayon_1.5.3 bit64_4.5.2
[25] withr_3.0.2 cachem_1.1.0 yaml_2.3.10 parallel_4.4.1
[29] tools_4.4.1 tzdb_0.4.0 dplyr_1.1.4 locfit_1.5-9.10
[33] httpuv_1.6.15 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
[37] git2r_0.35.0 stringr_1.5.1 fs_1.6.4 bit_4.5.0
[41] vroom_1.6.5 pkgconfig_2.0.3 callr_3.7.6 pillar_1.9.0
[45] bslib_0.8.0 later_1.3.2 glue_1.8.0 Rcpp_1.0.13
[49] statmod_1.5.0 xfun_0.48 tibble_3.2.1 tidyselect_1.2.1
[53] rstudioapi_0.17.1 knitr_1.48 htmltools_0.5.8.1 rmarkdown_2.28
[57] readr_2.1.5 compiler_4.4.1 getPass_0.2-4