Using the GSEA GUI tool

Last updated: 2025-01-27

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: ab9aeab

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version ab9aeab. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    data/pbmc3k.csv
    Ignored:    data/pbmc3k.csv.gz
    Ignored:    data/pbmc3k/
    Ignored:    r_packages_4.4.0/
    Ignored:    r_packages_4.4.1/

Untracked files:
    Untracked:  rsem.merged.gene_counts.tsv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/gsea.Rmd) and HTML (docs/gsea.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	ab9aeab	Dave Tang	2025-01-27	RNA-seq expression data
html	415653c	Dave Tang	2025-01-24	Build site.
Rmd	954f4bf	Dave Tang	2025-01-24	Using the GSEA GUI tool

GSEA:

Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

Download

Download page that requires your email address. For this post, I will download GSEA v4.3.3 for Windows.

GSEA input data

Text file format for expression dataset.

The TXT format is a tab delimited file format that describes an expression dataset.
The first line contains the labels Name and Description followed by the identifiers for each sample in the dataset.
The Description column is intended to be optional, but there is currently a bug such that it is treated as required. We hope to fix this in a future release. If you have no descriptions available, a value of NA will suffice.

Name(tab)Description(tab)(sample 1 name)(tab)(sample 2 name) (tab) … (sample N name) Name Description DLBC1_1 DLBC2_1 … DLBC58_0

Download some example data Lung_Michigan_collapsed.gct.

Collapsed refers to datasets whose identifiers (i.e Affymetrix probe set ids) have been replaced with symbols. In this process, all probe sets that map to a particular gene are summarized into a single expression vector by picking the maximum expression value in each sample. A utility to do this is included in the GSEA java software.

Expression datasets

Preparing RNA-seq data:

RNA-seq data. GSEA does not normalize RNA-seq data. RNA-seq data must be normalized for between-sample comparisons using an external normalization procedure (e.g. those in DESeq2 or Voom).

Use {edgeR} to perform “between-sample” normalisation; there are two main types of normalisation in RNA-seq:

Between-sample normalisation to compare gene expression across samples and
Within-sample normalisation to account for differences in sequencing depth or gene length within a single sample.

If count_mat contains the raw counts. Note that for calcNormFactors() and cpm() the defaults are TMM and normalized.lib.sizes = TRUE, respectively.

Use example data from https://zenodo.org/records/13970886.

library(edgeR)

Loading required package: limma

my_url <- 'https://zenodo.org/records/13970886/files/rsem.merged.gene_counts.tsv?download=1'
my_file <- 'rsem.merged.gene_counts.tsv'
if(file.exists(my_file) == FALSE){
  download.file(url = my_url, destfile = my_file)
}
gene_counts <- readr::read_tsv("rsem.merged.gene_counts.tsv", show_col_types = FALSE)
gene_counts |>
  dplyr::select(-gene_id, -`transcript_id(s)`) |>
  as.matrix() -> count_mat
row.names(count_mat) <- gene_counts$gene_id

dge <- DGEList(counts = count_mat)
dge <- normLibSizes(dge, method = "TMM")
count_mat_norm <- cpm(dge, normalized.lib.sizes = TRUE)

Raw matrix.

head(count_mat)

                ERR160122 ERR160123 ERR160124 ERR164473 ERR164550 ERR164551
ENSG00000000003      2.00      6.00      5.00    374.00   1637.00    650.00
ENSG00000000005     19.00     40.00     28.00      0.00      1.00      0.00
ENSG00000000419    268.24    273.78    428.81    489.00    637.00    879.00
ENSG00000000457    360.34    449.07    566.05    362.61    605.96    708.87
ENSG00000000460    155.66    184.93    264.95     85.39    312.04    239.13
ENSG00000000938     24.00     23.00     40.00   1181.00    423.00   3346.00
                ERR164552 ERR164554
ENSG00000000003   1015.00    562.00
ENSG00000000005      0.00      0.00
ENSG00000000419   1157.00    729.00
ENSG00000000457    632.16    478.93
ENSG00000000460    147.84    156.07
ENSG00000000938   1249.00   1149.00

Between-sample normalised values.

head(count_mat_norm)

                 ERR160122  ERR160123  ERR160124 ERR164473   ERR164550
ENSG00000000003  0.2950283  0.7356763  0.4523046 27.751783 93.60454585
ENSG00000000005  2.8027686  4.9045087  2.5329056  0.000000  0.05718054
ENSG00000000419 39.5691916 33.5689100 38.7905451 36.285085 36.42400471
ENSG00000000457 53.1552434 55.0616934 51.2054012 26.906615 34.64912071
ENSG00000000460 22.9620503 22.6747700 23.9676195  6.336162 17.84261606
ENSG00000000938  3.5403392  2.8200925  3.6184366 87.633304 24.18736890
                ERR164551 ERR164552 ERR164554
ENSG00000000003  28.74003 51.906072 31.316431
ENSG00000000005   0.00000  0.000000  0.000000
ENSG00000000419  38.86536 59.167808 40.622203
ENSG00000000457  31.34299 32.328022 26.687505
ENSG00000000460  10.57323  7.560388  8.696718
ENSG00000000938 147.94481 63.872595 64.025941

sessionInfo()

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] edgeR_4.4.1     limma_3.62.2    workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] generics_0.1.3    sass_0.4.9        utf8_1.2.4        stringi_1.8.4    
 [5] lattice_0.22-6    hms_1.1.3         digest_0.6.37     magrittr_2.0.3   
 [9] evaluate_1.0.1    grid_4.4.1        fastmap_1.2.0     rprojroot_2.0.4  
[13] jsonlite_1.8.9    processx_3.8.4    whisker_0.4.1     ps_1.8.1         
[17] promises_1.3.0    httr_1.4.7        fansi_1.0.6       jquerylib_0.1.4  
[21] cli_3.6.3         rlang_1.1.4       crayon_1.5.3      bit64_4.5.2      
[25] withr_3.0.2       cachem_1.1.0      yaml_2.3.10       parallel_4.4.1   
[29] tools_4.4.1       tzdb_0.4.0        dplyr_1.1.4       locfit_1.5-9.10  
[33] httpuv_1.6.15     vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[37] git2r_0.35.0      stringr_1.5.1     fs_1.6.4          bit_4.5.0        
[41] vroom_1.6.5       pkgconfig_2.0.3   callr_3.7.6       pillar_1.9.0     
[45] bslib_0.8.0       later_1.3.2       glue_1.8.0        Rcpp_1.0.13      
[49] statmod_1.5.0     xfun_0.48         tibble_3.2.1      tidyselect_1.2.1 
[53] rstudioapi_0.17.1 knitr_1.48        htmltools_0.5.8.1 rmarkdown_2.28   
[57] readr_2.1.5       compiler_4.4.1    getPass_0.2-4