Last updated: 2025-05-26
Checks: 7 0
Knit directory: muse/ 
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712) was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 1d87e1e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish or
wflow_git_commit). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/1M_neurons_filtered_gene_bc_matrices_h5.h5
    Ignored:    data/293t/
    Ignored:    data/293t_3t3_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/293t_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
    Ignored:    data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
    Ignored:    data/brain_counts/
    Ignored:    data/cl.obo
    Ignored:    data/cl.owl
    Ignored:    data/jurkat/
    Ignored:    data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/jurkat_293t/
    Ignored:    data/jurkat_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc20k/
    Ignored:    data/pbmc20k_seurat/
    Ignored:    data/pbmc3k/
    Ignored:    data/pbmc3k_bpcells_mat/
    Ignored:    data/pbmc3k_seurat.rds
    Ignored:    data/pbmc4k_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
    Ignored:    data/pbmc_1k_v3_raw_feature_bc_matrix.h5
    Ignored:    data/refdata-gex-GRCh38-2020-A.tar.gz
    Ignored:    data/seurat_1m_neuron.rds
    Ignored:    data/t_3k_filtered_gene_bc_matrices.tar.gz
    Ignored:    r_packages_4.4.1/
    Ignored:    r_packages_4.5.0/
Untracked files:
    Untracked:  Nothobranchius_furzeri.Nfu_20140520.113.gtf.gz
    Untracked:  analysis/bioc_scrnaseq.Rmd
    Untracked:  bpcells_matrix/
    Untracked:  data/GCF_043380555.1-RS_2024_12_gene_ontology.gaf.gz
    Untracked:  m3/
    Untracked:  pbmc3k_before_filtering.rds
    Untracked:  pbmc3k_save_rds.rds
    Untracked:  rsem.merged.gene_counts.tsv
Unstaged changes:
    Modified:   analysis/cluster_profiler.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/fgsea.Rmd) and HTML
(docs/fgsea.html) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote), click on the hyperlinks
in the table below to view the files as they were in that past version.
| File | Version | Author | Date | Message | 
|---|---|---|---|---|
| Rmd | 1d87e1e | Dave Tang | 2025-05-26 | Modify plotEnrichment plot | 
| html | 6f2ce9b | Dave Tang | 2025-01-24 | Build site. | 
| html | a24419b | Dave Tang | 2023-02-13 | Build site. | 
| Rmd | 380c988 | Dave Tang | 2023-02-13 | Fix leading edge gene count | 
| html | f488702 | Dave Tang | 2023-02-13 | Build site. | 
| Rmd | f90ae59 | Dave Tang | 2023-02-13 | fgsea analysis and summary | 
| html | b82b4fc | Dave Tang | 2023-02-01 | Build site. | 
| Rmd | fb91538 | Dave Tang | 2023-02-01 | fgsea example data | 
From the original paper describing the Gene Set Enrichment Analysis (GSEA):
The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L, in which case the gene set is correlated with the phenotypic class distinction.
To provide an example, consider carrying out RNA-seq experiments with two conditions, where one set of experiments are the control and another set are the treatment. A differential gene expression analysis is carried out and each gene is ranked by the fold-change between the control and treatment; genes that are up-regulated (fold-change > 1) are ranked at the top and genes that are down-regulated (fold-change < 1) are ranked at the bottom.
Besides inspecting the individual genes that are up- or down-regulated, you wish to find out whether a pathway of genes is up- or down-regulated. Therefore you carry out GSEA using a set of gene pathways (S) and check whether genes belonging to the same pathway tend to occur at the start or the end of the list of ranked genes (L).
First install fgsea.
if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
if (!require("fgsea", quietly = TRUE))
  BiocManager::install("fgsea")
library(fgsea)The example pathways are packaged with fgsea and can be
loaded with data(). The example pathways are stored in a
list.
data("examplePathways", package = "fgsea")
class(examplePathways)[1] "list"There are 1457 example pathways.
length(examplePathways)[1] 1457The first pathway 1221633_Meiotic_Synapsis contains Entrez Gene IDs that belong to this gene set.
examplePathways[1]$`1221633_Meiotic_Synapsis`
 [1] "12189"     "13006"     "15077"     "15078"     "15270"     "15512"    
 [7] "16905"     "16906"     "19357"     "20842"     "20843"     "20957"    
[13] "20962"     "21749"     "21750"     "22196"     "23856"     "24061"    
[19] "28113"     "50878"     "56739"     "57321"     "64009"     "66654"    
[25] "69386"     "71846"     "74075"     "77053"     "94244"     "97114"    
[31] "97122"     "97908"     "101185"    "140557"    "223697"    "260423"   
[37] "319148"    "319149"    "319150"    "319151"    "319152"    "319153"   
[43] "319154"    "319155"    "319156"    "319157"    "319158"    "319159"   
[49] "319160"    "319161"    "319565"    "320332"    "320558"    "326619"   
[55] "326620"    "360198"    "497652"    "544973"    "625328"    "667250"   
[61] "100041230" "102641229" "102641751" "102642045"The gene ranks are also packaged with fgsea but we will
re-generate the ranks based on the author’s
code to see how the ranks were created. Several other Bioconductor
packages are required to generate the ranks. The Reactome database is
also installed here for later use.
bioc_pac <- c("GEOquery", "limma", "org.Mm.eg.db", "reactome.db")
cran_pac <- c("data.table", "pheatmap")
install_pac <- function(x, repo){
  if (!require(x, quietly = TRUE, character.only = TRUE)){
    if(repo == "bioc"){
      BiocManager::install(x, character.only = TRUE)
    } else if (repo == "cran"){
      install.packages(x, character.only = TRUE)
    } else {
      stop("Unknown repo")
    }
  }
}
sapply(bioc_pac, install_pac, repo = "bioc")
Attaching package: 'generics'The following objects are masked from 'package:base':
    as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
    setequal, union
Attaching package: 'BiocGenerics'The following objects are masked from 'package:stats':
    IQR, mad, sd, var, xtabsThe following objects are masked from 'package:base':
    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
    unsplit, which.max, which.minWelcome to Bioconductor
    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.Setting options('download.file.method.GEOquery'='auto')Setting options('GEOquery.inmemory.gpl'=FALSE)
Attaching package: 'limma'The following object is masked from 'package:BiocGenerics':
    plotMA
Attaching package: 'S4Vectors'The following object is masked from 'package:utils':
    findMatchesThe following objects are masked from 'package:base':
    expand.grid, I, unname$GEOquery
NULL
$limma
NULL
$org.Mm.eg.db
NULL
$reactome.db
NULLsapply(cran_pac, install_pac, repo = "cran")
Attaching package: 'data.table'The following object is masked from 'package:IRanges':
    shiftThe following objects are masked from 'package:S4Vectors':
    first, second$data.table
NULL
$pheatmap
NULLLoad packages.
library(GEOquery)
library(limma)
library(org.Mm.eg.db)
library(data.table)
# for collapseBy
source("https://raw.githubusercontent.com/assaron/r-utils/master/R/exprs.R")The example gene ranks were created using mouse
microarray data, which we can download using getGEO
from the GEOquery package.
gse14308 <- getGEO("GSE14308")[[1]]Found 1 file(s)GSE14308_series_matrix.txt.gzConditions (are extracted from the title and) are added to the phenotypic data.
pData(gse14308)$condition <- sub("-.*$", "", gse14308$title)
pData(gse14308)[, c('platform_id', 'type', 'condition')]          platform_id type condition
GSM357839     GPL1261  RNA       Th2
GSM357841     GPL1261  RNA       Th2
GSM357842     GPL1261  RNA       Th1
GSM357843     GPL1261  RNA      Th17
GSM357844     GPL1261  RNA       Th1
GSM357845     GPL1261  RNA      Th17
GSM357847     GPL1261  RNA     Naive
GSM357848     GPL1261  RNA     Naive
GSM357849     GPL1261  RNA     iTreg
GSM357850     GPL1261  RNA     iTreg
GSM357852     GPL1261  RNA     nTreg
GSM357853     GPL1261  RNA     nTregfData retrieves information on features, which are
microarray probe sets.
feature_dat <- fData(gse14308)
colnames(feature_dat) [1] "ID"                               "GB_ACC"                          
 [3] "SPOT_ID"                          "Species Scientific Name"         
 [5] "Annotation Date"                  "Sequence Type"                   
 [7] "Sequence Source"                  "Target Description"              
 [9] "Representative Public ID"         "Gene Title"                      
[11] "Gene Symbol"                      "ENTREZ_GENE_ID"                  
[13] "RefSeq Transcript ID"             "Gene Ontology Biological Process"
[15] "Gene Ontology Cellular Component" "Gene Ontology Molecular Function"The collapseBy function is sourced from exprs.R
and the source is shown below but named as collapseBy_ so
we do not overwrite the sourced function. We will go through each line
of code to try to understand what the function is doing.
collapseBy_ <- function(es, factor, FUN=median) {
  ranks <- apply(exprs(es), 1, FUN)
  t <- data.frame(f=factor, i=seq_along(ranks), r=ranks)
  t <- t[order(t$r, decreasing=T), ]
  keep <- t[!duplicated(t$f) & !is.na(t$f),]$i
  res <- es[keep, ]
  fData(res)$origin <- rownames(res)
  rownames(res) <- factor[keep]
  res
}ranks contains the median probe intensity across all
samples.
ranks <- apply(exprs(gse14308), 1, median)
head(ranks)  1415670_at   1415671_at   1415672_at   1415673_at 1415674_a_at   1415675_at 
    3324.740     4933.035    11202.750     3239.865     4007.050     1830.655 The ENTREZ_GENE_ID, index, and median get saved into a
data frame. (I’ve used my_df here because t is
the name of the transpose function.) Next, the data frame is ordered
from highest intensity to lowest.
my_df <- data.frame(
  f=fData(gse14308)$ENTREZ_GENE_ID,
  i=seq_along(ranks),
  r=ranks
)
my_df <- my_df[order(my_df$r, decreasing=TRUE), ]
head(my_df)                                             f     i        r
1438859_x_at                             20090 23165 99737.85
1454859_a_at 65019 /// 100044627 /// 100862455 39154 99078.75
1416404_s_at 20055 /// 100039355 /// 100862433   735 97947.30
1455485_x_at                             22121 39780 97510.75
1422475_a_at                             20091  6781 97177.05
1435873_a_at               22121 /// 100504632 20179 96558.85A vector called keep is created to keep only rows with
non-duplicated and non-missing Entrez Gene IDs.
keep <- my_df[!duplicated(my_df$f) & !is.na(my_df$f),]$i
length(keep)[1] 21603Finally, keep is used to subset the data;
origin is created to store the original probe IDs before
replacing the row names with Entrez Gene IDs.
res <- gse14308[keep, ]
fData(res)$origin <- rownames(res)
rownames(res) <- fData(gse14308)$ENTREZ_GENE_ID[keep]
resExpressionSet (storageMode: lockedEnvironment)
assayData: 21603 features, 12 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM357839 GSM357841 ... GSM357853 (12 total)
  varLabels: title geo_accession ... condition (34 total)
  varMetadata: labelDescription
featureData
  featureNames: 20090 65019 /// 100044627 /// 100862455 ... 194227
    (21603 total)
  fvarLabels: ID GB_ACC ... origin (17 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 19144320 
Annotation: GPL1261 Therefore, the collapseBy function is used to calculate
the median intensity across samples/experiments, which is used to rank
features, and to remove features that are duplicated or have no Entrez
Gene ID.
Now that we know what collapseBy does, we can use
it.
es <- collapseBy(gse14308, fData(gse14308)$ENTREZ_GENE_ID, FUN=median)
esExpressionSet (storageMode: lockedEnvironment)
assayData: 21603 features, 12 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM357839 GSM357841 ... GSM357853 (12 total)
  varLabels: title geo_accession ... condition (34 total)
  varMetadata: labelDescription
featureData
  featureNames: 20090 65019 /// 100044627 /// 100862455 ... 194227
    (21603 total)
  fvarLabels: ID GB_ACC ... origin (17 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 19144320 
Annotation: GPL1261 Probe IDs that mapped to several Entrez Gene IDs and empty entries are also removed.
es <- es[!grepl("///", rownames(es)), ]
es <- es[rownames(es) != "", ]
dim(exprs(es))[1] 20770    12Quantile normalisation is carried out.
exprs(es) <- normalizeBetweenArrays(log2(exprs(es)+1), method="quantile")A design matrix is defined.
es.design <- model.matrix(~0+condition, data=pData(es))
es.design          conditioniTreg conditionNaive conditionnTreg conditionTh1
GSM357839              0              0              0            0
GSM357841              0              0              0            0
GSM357842              0              0              0            1
GSM357843              0              0              0            0
GSM357844              0              0              0            1
GSM357845              0              0              0            0
GSM357847              0              1              0            0
GSM357848              0              1              0            0
GSM357849              1              0              0            0
GSM357850              1              0              0            0
GSM357852              0              0              1            0
GSM357853              0              0              1            0
          conditionTh17 conditionTh2
GSM357839             0            1
GSM357841             0            1
GSM357842             0            0
GSM357843             1            0
GSM357844             0            0
GSM357845             1            0
GSM357847             0            0
GSM357848             0            0
GSM357849             0            0
GSM357850             0            0
GSM357852             0            0
GSM357853             0            0
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$condition
[1] "contr.treatment"A linear model is fit given the design.
fit <- lmFit(es, es.design)A contrasts matrix is used to compute contrasts using our fitted linear model. Here we’re contrasting naive T-cells to T-helper 1 cells.
makeContrasts(
  conditionTh1-conditionNaive,
  levels=es.design
)                Contrasts
Levels           conditionTh1 - conditionNaive
  conditioniTreg                             0
  conditionNaive                            -1
  conditionnTreg                             0
  conditionTh1                               1
  conditionTh17                              0
  conditionTh2                               0fit2 <- contrasts.fit(
  fit,
  makeContrasts(
    conditionTh1-conditionNaive,
    levels=es.design
  )
)
fit2An object of class "MArrayLM"
$coefficients
       Contrasts
        conditionTh1 - conditionNaive
  20090                   0.011485524
  22121                  -0.003602942
  20091                   0.035573263
  67671                   0.092303053
  19241                   0.013452656
20765 more rows ...
$rank
[1] 6
$assign
[1] 1 1 1 1 1 1
$qr
$qr
          conditioniTreg conditionNaive conditionnTreg conditionTh1
GSM357839      -1.414214       0.000000       0.000000    0.0000000
GSM357841       0.000000      -1.414214       0.000000    0.0000000
GSM357842       0.000000       0.000000      -1.414214    0.0000000
GSM357843       0.000000       0.000000       0.000000   -1.4142136
GSM357844       0.000000       0.000000       0.000000    0.7071068
          conditionTh17 conditionTh2
GSM357839      0.000000            0
GSM357841      0.000000            0
GSM357842      0.000000            0
GSM357843      0.000000            0
GSM357844      1.414214            0
7 more rows ...
$qraux
[1] 1.0 1.0 1.0 1.0 1.5 1.0
$pivot
[1] 1 2 3 4 5 6
$tol
[1] 1e-07
$rank
[1] 6
$df.residual
[1] 6 6 6 6 6
20765 more elements ...
$sigma
     20090      22121      20091      67671      19241 
0.03820456 0.03984248 0.05377004 0.03935491 0.03240994 
20765 more elements ...
$cov.coefficients
                               Contrasts
Contrasts                       conditionTh1 - conditionNaive
  conditionTh1 - conditionNaive                             1
$stdev.unscaled
       Contrasts
        conditionTh1 - conditionNaive
  20090                             1
  22121                             1
  20091                             1
  67671                             1
  19241                             1
20765 more rows ...
$pivot
[1] 1 2 3 4 5 6
$genes
                ID    GB_ACC SPOT_ID Species Scientific Name Annotation Date
20090 1438859_x_at  AV037157                    Mus musculus     Oct 6, 2014
22121 1455485_x_at  AI324936                    Mus musculus     Oct 6, 2014
20091 1422475_a_at NM_016959                    Mus musculus     Oct 6, 2014
67671 1433472_x_at  AA050777                    Mus musculus     Oct 6, 2014
19241   1415906_at NM_021278                    Mus musculus     Oct 6, 2014
           Sequence Type Sequence Source
20090 Consensus sequence         GenBank
22121 Consensus sequence         GenBank
20091 Consensus sequence         GenBank
67671 Consensus sequence         GenBank
19241 Consensus sequence         GenBank
                                                                                                                                                                                                                                                                Target Description
20090                                                                              gb:AV037157 /DB_XREF=gi:4856822 /DB_XREF=AV037157 /CLONE=1600022O04 /FEA=EST /CNT=10 /TID=Mm.154915.5 /TIER=Stack /STK=9 /UG=Mm.154915 /LL=20090 /UG_GENE=Rps29 /UG_TITLE=ribosomal protein S29
22121                                                                         gb:AI324936 /DB_XREF=gi:4059365 /DB_XREF=mb49d08.x1 /CLONE=IMAGE:332751 /FEA=EST /CNT=24 /TID=Mm.13020.6 /TIER=Stack /STK=12 /UG=Mm.13020 /LL=22121 /UG_GENE=Rpl13a /UG_TITLE=ribosomal protein L13a
20091                                         gb:NM_016959.1 /DB_XREF=gi:8394217 /GEN=Rps3a /FEA=FLmRNA /CNT=122 /TID=Mm.6957.1 /TIER=FL+Stack /STK=78 /UG=Mm.6957 /LL=20091 /DEF=Mus musculus ribosomal protein S3a (Rps3a), mRNA. /PROD=ribosomal protein S3a /FL=gb:NM_016959.1
67671                                                            gb:AA050777 /DB_XREF=gi:1530594 /DB_XREF=mg72a01.r1 /CLONE=IMAGE:438504 /FEA=EST /CNT=215 /TID=Mm.43330.3 /TIER=Stack /STK=151 /UG=Mm.43330 /LL=67671 /UG_GENE=0610025G13Rik /UG_TITLE=RIKEN cDNA 0610025G13 gene
19241 gb:NM_021278.1 /DB_XREF=gi:10946577 /GEN=Tmsb4x /FEA=FLmRNA /CNT=360 /TID=Mm.142729.1 /TIER=FL+Stack /STK=172 /UG=Mm.142729 /LL=19241 /DEF=Mus musculus thymosin, beta 4, X chromosome (Tmsb4x), mRNA. /PROD=thymosin, beta 4, X chromosome /FL=gb:NM_021278.1 gb:BC018286.1
      Representative Public ID                     Gene Title Gene Symbol
20090                 AV037157          ribosomal protein S29       Rps29
22121                 AI324936         ribosomal protein L13A      Rpl13a
20091                NM_016959         ribosomal protein S3A1      Rps3a1
67671                 AA050777          ribosomal protein L38       Rpl38
19241                NM_021278 thymosin, beta 4, X chromosome      Tmsb4x
      ENTREZ_GENE_ID
20090          20090
22121          22121
20091          20091
67671          67671
19241          19241
                                                                                RefSeq Transcript ID
20090                                                                                      NM_009093
22121                                                                                      NM_009438
20091                                                                                      NM_016959
67671 NM_001048057 /// NM_001048058 /// NM_023372 /// XM_006534007 /// XM_006534008 /// XM_006534009
19241                                                                     NM_021278 /// XM_006528759
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Gene Ontology Biological Process
20090                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         0006412 // translation // not recorded
22121 0006351 // transcription, DNA-templated // inferred from electronic annotation /// 0006355 // regulation of transcription, DNA-templated // inferred from electronic annotation /// 0006412 // translation // inferred from electronic annotation /// 0006417 // regulation of translation // inferred from electronic annotation /// 0017148 // negative regulation of translation // inferred from direct assay /// 0017148 // negative regulation of translation // not recorded /// 0032496 // response to lipopolysaccharide // inferred from mutant phenotype /// 0032844 // regulation of homeostatic process // inferred from mutant phenotype /// 0048246 // macrophage chemotaxis // inferred from mutant phenotype /// 0060425 // lung morphogenesis // inferred from mutant phenotype /// 0071346 // cellular response to interferon-gamma // inferred from direct assay /// 0071346 // cellular response to interferon-gamma // not recorded /// 1901194 // negative regulation of formation of translation preinitiation complex // not recorded
20091                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0002181 // cytoplasmic translation // not recorded /// 0006412 // translation // inferred from electronic annotation /// 0030154 // cell differentiation // inferred from electronic annotation /// 0043066 // negative regulation of apoptotic process // inferred from sequence or structural similarity /// 0097194 // execution phase of apoptosis // inferred from sequence or structural similarity
67671                                                                                                                                                                                                                                                                                                                                                                                          0001501 // skeletal system development // inferred from mutant phenotype /// 0001501 // skeletal system development // traceable author statement /// 0001503 // ossification // inferred from mutant phenotype /// 0006412 // translation // inferred from electronic annotation /// 0006417 // regulation of translation // inferred from mutant phenotype /// 0007605 // sensory perception of sound // inferred from mutant phenotype /// 0034463 // 90S preribosome assembly // inferred from direct assay /// 0042474 // middle ear morphogenesis // inferred from mutant phenotype /// 0048318 // axial mesoderm development // inferred from mutant phenotype
19241                                                                                                                                                                                                                                                                                                                                                                                                                      0007010 // cytoskeleton organization // inferred from electronic annotation /// 0014911 // positive regulation of smooth muscle cell migration // traceable author statement /// 0030036 // actin cytoskeleton organization // inferred from electronic annotation /// 0030334 // regulation of cell migration // inferred from direct assay /// 0042989 // sequestering of actin monomers // inferred from electronic annotation /// 0045893 // positive regulation of transcription, DNA-templated // traceable author statement /// 0051152 // positive regulation of smooth muscle cell differentiation // traceable author statement
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Gene Ontology Cellular Component
20090                                                                                                                                                                                   0005622 // intracellular // inferred from electronic annotation /// 0005737 // cytoplasm // not recorded /// 0005840 // ribosome // inferred from electronic annotation /// 0015935 // small ribosomal subunit // not recorded /// 0022627 // cytosolic small ribosomal subunit // not recorded /// 0030529 // ribonucleoprotein complex // inferred from electronic annotation /// 0070062 // extracellular vesicular exosome // not recorded
22121 0005576 // extracellular region // inferred from electronic annotation /// 0005634 // nucleus // not recorded /// 0005840 // ribosome // inferred from electronic annotation /// 0015934 // large ribosomal subunit // inferred from electronic annotation /// 0016020 // membrane // inferred from electronic annotation /// 0022625 // cytosolic large ribosomal subunit // not recorded /// 0030529 // ribonucleoprotein complex // inferred from direct assay /// 0030529 // ribonucleoprotein complex // not recorded /// 0097452 // GAIT complex // inferred from direct assay /// 0097452 // GAIT complex // not recorded
20091                                                                                                                                                                                               0005622 // intracellular // inferred from electronic annotation /// 0005634 // nucleus // inferred from direct assay /// 0005737 // cytoplasm // inferred from electronic annotation /// 0005829 // cytosol // inferred from direct assay /// 0005840 // ribosome // inferred from electronic annotation /// 0022627 // cytosolic small ribosomal subunit // not recorded /// 0030529 // ribonucleoprotein complex // not recorded
67671                                                                                                                                                                                                                                                                   0005622 // intracellular // inferred from electronic annotation /// 0005840 // ribosome // inferred from electronic annotation /// 0022625 // cytosolic large ribosomal subunit // not recorded /// 0030529 // ribonucleoprotein complex // inferred from electronic annotation /// 0033291 // eukaryotic 80S initiation complex // inferred from direct assay
19241                                                                                                                                                                                                                                                                                                                                                                                         0005634 // nucleus // inferred from direct assay /// 0005737 // cytoplasm // inferred from electronic annotation /// 0005829 // cytosol // inferred from direct assay /// 0005856 // cytoskeleton // inferred from electronic annotation
                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Gene Ontology Molecular Function
20090                                                                                                                                                                                                                                                                                                          0003735 // structural constituent of ribosome // not recorded /// 0008270 // zinc ion binding // not recorded /// 0046872 // metal ion binding // inferred from electronic annotation
22121 0003676 // nucleic acid binding // inferred from electronic annotation /// 0003677 // DNA binding // inferred from electronic annotation /// 0003729 // mRNA binding // inferred from direct assay /// 0003735 // structural constituent of ribosome // inferred from electronic annotation /// 0005125 // cytokine activity // inferred from electronic annotation /// 0044822 // poly(A) RNA binding // not recorded /// 0046872 // metal ion binding // inferred from electronic annotation
20091                                                                                                                                                                                                                                                                                        0003735 // structural constituent of ribosome // not recorded /// 0005515 // protein binding // inferred from physical interaction /// 0031369 // translation initiation factor binding // not recorded
67671                                                                                                                                                                                                                                                                                                                                                                                                           0003735 // structural constituent of ribosome // inferred from electronic annotation
19241                                                                                                                                                                                                                                                                                                                                                           0003779 // actin binding // inferred from electronic annotation /// 0005515 // protein binding // inferred from physical interaction
            origin
20090 1438859_x_at
22121 1455485_x_at
20091 1422475_a_at
67671 1433472_x_at
19241   1415906_at
20765 more rows ...
$Amean
   20090    22121    20091    67671    19241 
16.28272 16.27241 16.25759 16.23408 16.23061 
20765 more elements ...
$method
[1] "ls"
$design
          conditioniTreg conditionNaive conditionnTreg conditionTh1
GSM357839              0              0              0            0
GSM357841              0              0              0            0
GSM357842              0              0              0            1
GSM357843              0              0              0            0
GSM357844              0              0              0            1
          conditionTh17 conditionTh2
GSM357839             0            1
GSM357841             0            1
GSM357842             0            0
GSM357843             1            0
GSM357844             0            0
7 more rows ...
$contrasts
                Contrasts
Levels           conditionTh1 - conditionNaive
  conditioniTreg                             0
  conditionNaive                            -1
  conditionnTreg                             0
  conditionTh1                               1
  conditionTh17                              0
  conditionTh2                               0Finally, the differential expression analysis is carried out and the results saved. The results are ranked by limma’s moderated t-statistic and this creates the ranked list of genes.
fit2 <- eBayes(fit2)
names(topTable(fit2, adjust.method="BH", number=12000, sort.by = "B")) [1] "ID"                               "GB_ACC"                          
 [3] "SPOT_ID"                          "Species.Scientific.Name"         
 [5] "Annotation.Date"                  "Sequence.Type"                   
 [7] "Sequence.Source"                  "Target.Description"              
 [9] "Representative.Public.ID"         "Gene.Title"                      
[11] "Gene.Symbol"                      "ENTREZ_GENE_ID"                  
[13] "RefSeq.Transcript.ID"             "Gene.Ontology.Biological.Process"
[15] "Gene.Ontology.Cellular.Component" "Gene.Ontology.Molecular.Function"
[17] "origin"                           "logFC"                           
[19] "AveExpr"                          "t"                               
[21] "P.Value"                          "adj.P.Val"                       
[23] "B"                               de <- data.table(topTable(fit2, adjust.method="BH", number=12000, sort.by = "B"), keep.rownames = TRUE)
ranks <- de[order(t), list(rn, t)]
ranks           rn         t
       <char>     <num>
    1: 170942 -62.22877
    2: 109711 -49.47829
    3:  18124 -43.40540
    4:  12775 -41.16952
    5:  72148 -33.23463
   ---                 
11996:  58801  49.10222
11997:  13730  50.02863
11998:  15937  50.29120
11999:  12772  50.52651
12000:  80876  52.59930Load exampleRanks.
data("exampleRanks", package = "fgsea")
head(exampleRanks)   170942    109711     18124     12775     72148     16010 
-63.33703 -49.74779 -43.63878 -41.51889 -33.26039 -32.77626 The names of the vector are the Entrez Gene IDs and the values are the rank metric.
Compare with our results.
wanted <- head(names(exampleRanks))
ranks[rn %in% wanted]       rn         t
   <char>     <num>
1: 170942 -62.22877
2: 109711 -49.47829
3:  18124 -43.40540
4:  12775 -41.16952
5:  72148 -33.23463
6:  16010 -32.38200Visualise the six most significantly down- and up-regulated genes.
library(pheatmap)
my_sample <- pData(es)$condition == "Th1" | pData(es)$condition == "Naive" 
my_group <- data.frame(group = pData(es)$condition[my_sample])
row.names(my_group) <- colnames(exprs(es))[my_sample]
 
pheatmap(
  mat = es[c(head(de[order(t), 1])$rn, tail(de[order(t), 1])$rn), my_sample],
  annotation_col = my_group,
  cluster_rows = FALSE,
  cellwidth=25,
  cellheight=15,
  scale = "row"
)
The following section is based on the fgsea
tutorial but with my elaborations. The pathways are stored in
examplePathways and the ranked gene list in
exampleRanks.
The fgsea() function runs the pre-ranked gene set
enrichment analysis.
data(examplePathways)
data(exampleRanks)
set.seed(42)
system.time(
  fgseaRes <- fgsea(
    pathways = examplePathways, 
    stats = exampleRanks,
    minSize=15,
    maxSize=500
  )
)    user  system elapsed 
  4.324   0.137   3.312 class(fgseaRes)[1] "data.table" "data.frame"Top 6 enriched pathways; see the data.table Wiki
for more information on the data.table package.
head(fgseaRes[order(pval), ])                                           pathway         pval         padj
                                            <char>        <num>        <num>
1:                     5990979_Cell_Cycle,_Mitotic 6.690481e-27 3.920622e-24
2:                              5990980_Cell_Cycle 3.312565e-26 9.705816e-24
3:                    5991851_Mitotic_Prometaphase 8.470173e-19 1.654507e-16
4: 5992217_Resolution_of_Sister_Chromatid_Cohesion 2.176649e-18 3.188791e-16
5:                                 5991454_M_Phase 1.873997e-14 2.196325e-12
6:         5991599_Separation_of_Sister_Chromatids 8.733223e-14 8.529448e-12
     log2err        ES      NES  size  leadingEdge
       <num>     <num>    <num> <int>       <list>
1: 1.3422338 0.5594755 2.769070   317 66336, 6....
2: 1.3267161 0.5388497 2.705894   369 66336, 6....
3: 1.1239150 0.7253270 2.972690    82 66336, 6....
4: 1.1053366 0.7347987 2.957518    74 66336, 6....
5: 0.9759947 0.5576247 2.554076   173 66336, 6....
6: 0.9545416 0.6164600 2.670030   116 66336, 6....Number of significant pathways at padj < 0.01.
sum(fgseaRes[, padj < 0.01])[1] 76Plot the most significantly enriched pathway.
library(ggplot2)
plotEnrichment(
  examplePathways[[head(fgseaRes[order(pval), ], 1)$pathway]],
  exampleRanks,
  ticksSize = 0.2 # default
) +
  labs(title=head(fgseaRes[order(pval), ], 1)$pathway) -> p
p
Check out the object returned by plotEnrichment().
class(p)[1] "gg"     "ggplot"names(p) [1] "data"        "layers"      "scales"      "guides"      "mapping"    
 [6] "theme"       "coordinates" "facet"       "plot_env"    "layout"     
[11] "labels"     Look inside layers.
p$layers[[1]]
mapping: x = ~rank, y = ~ES 
geom_line: na.rm = FALSE, orientation = NA
stat_identity: na.rm = FALSE
position_identity 
[[2]]
mapping: x = ~rank, y = ~-spreadES/16, xend = ~rank, yend = ~spreadES/16 
geom_segment: arrow = NULL, arrow.fill = NULL, lineend = butt, linejoin = round, na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 
[[3]]
mapping: yintercept = ~yintercept 
geom_hline: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 
[[4]]
mapping: yintercept = ~yintercept 
geom_hline: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 
[[5]]
mapping: yintercept = ~yintercept 
geom_hline: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity Change line width of geom_line.
p$layers[[1]]$aes_params$size <- 2
p
Plot the top 10 pathways enriched at the top and bottom of the ranked list, respectively.
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(
  examplePathways[topPathways],
  exampleRanks,
  fgseaRes, 
  gseaParam = 0.5
)
Since the genes were ranked according to their differential
expression significance and fold change, with the most significantly
down-regulated genes at the top and up-regulated genes at the bottom of
the list, the enriched gene sets provides us with some idea of the
function of these genes. For example, exampleRanks was
generated by testing naive T cells against T helper 1 cells, so the
enriched pathways may suggest how naive cells are differentiated into T
helper 1 cells.
Reactome pathways can also be used with fgsea and this database is available via Bioconductor (and was installed earlier in this post).
The reactomePathways() function returns a list of
Reactome pathways given a list of Entrez Gene IDs. In the example data,
the IDs are stored as the names in the exampleRanks
vector.
my_pathways <- reactomePathways(names(exampleRanks))'select()' returned 1:many mapping between keys and columns'select()' returned 1:1 mapping between keys and columnsOut of interest, let’s check the median number of genes in the pathways.
summary(sapply(my_pathways, length))   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    4.00   11.00   30.07   29.00 1250.00 The results are very similar to the example pathways, where the pathways are related to the cell cycle and mitosis.
set.seed(42)
system.time(
  fgsea_reactome <- fgsea(
    pathways = my_pathways, 
    stats = exampleRanks,
    minSize=15,
    maxSize=500
  )
)   user  system elapsed 
  3.813   0.212   2.228 head(fgsea_reactome[order(pval), ])                                   pathway         pval         padj   log2err
                                    <char>        <num>        <num>     <num>
1:                              Cell Cycle 8.200375e-28 5.986274e-25 1.3651796
2:                     Cell Cycle, Mitotic 6.852361e-27 2.501112e-24 1.3422338
3:                  Cell Cycle Checkpoints 1.080789e-20 2.629921e-18 1.1778933
4: Resolution of Sister Chromatid Cohesion 3.987687e-16 7.277529e-14 1.0376962
5:                    Mitotic Prometaphase 1.121259e-15 1.637039e-13 1.0175448
6:                                 S Phase 2.413133e-14 2.935978e-12 0.9759947
          ES      NES  size  leadingEdge
       <num>    <num> <int>       <list>
1: 0.5250345 2.661002   422 66336, 6....
2: 0.5440528 2.724387   357 66336, 6....
3: 0.6068232 2.830899   190 66336, 6....
4: 0.6806110 2.815193    91 66336, 6....
5: 0.5948613 2.697816   149 66336, 6....
6: 0.6295907 2.706269   107 12428, 5....Again quote the original paper:
Often it is useful to extract the core members of high scoring gene sets that contribute to the enrichment score. We define the leading-edge subset to be those genes in the gene set S that appear in the ranked list L at, or before, the point where the running sum reaches its maximum deviation from zero.
The fgsea results includes the leading edge, which
contains the genes that contributed to the enrichment score. To
illustrate, let’s take the leader edge genes from the most significant
Reactome pathway.
fgsea_reactome[order(pval),][1,]      pathway         pval         padj log2err        ES      NES  size
       <char>        <num>        <num>   <num>     <num>    <num> <int>
1: Cell Cycle 8.200375e-28 5.986274e-25 1.36518 0.5250345 2.661002   422
    leadingEdge
         <list>
1: 66336, 6....The set of leading edge genes is made up of 197 genes.
fgsea_reactome[order(pval),][1,]$leadingEdge[[1]]  [1] "66336"  "66977"  "15366"  "12442"  "107995" "66442"  "19361"  "12571" 
  [9] "12428"  "52276"  "54392"  "66311"  "215387" "67629"  "54124"  "12649" 
 [17] "72415"  "56150"  "57441"  "20877"  "67121"  "12615"  "66468"  "67849" 
 [25] "19053"  "73804"  "76044"  "20878"  "15270"  "13555"  "60411"  "60530" 
 [33] "17219"  "69270"  "12575"  "69263"  "12448"  "14211"  "20873"  "18005" 
 [41] "72119"  "71988"  "12189"  "72107"  "17215"  "12534"  "66156"  "208628"
 [49] "237911" "22390"  "68240"  "228421" "68014"  "269582" "19348"  "12236" 
 [57] "72151"  "18817"  "21781"  "18968"  "217653" "66934"  "272551" "227613"
 [65] "67141"  "67951"  "68612"  "68298"  "108000" "23834"  "106344" "242705"
 [73] "18141"  "223921" "26886"  "13557"  "26909"  "72787"  "268697" "72155" 
 [81] "56371"  "17535"  "107823" "12531"  "327762" "12567"  "229841" "67052" 
 [89] "16319"  "66634"  "171567" "26931"  "67203"  "381306" "12235"  "19891" 
 [97] "74470"  "72083"  "66570"  "17216"  "76308"  "19687"  "17218"  "102920"
[105] "29870"  "18973"  "16881"  "17463"  "68964"  "75786"  "19645"  "19075" 
[113] "26417"  "69736"  "19357"  "76816"  "70385"  "70645"  "22628"  "225182"
[121] "22627"  "52683"  "19076"  "18972"  "26932"  "12544"  "17997"  "26440" 
[129] "68549"  "12445"  "269113" "26444"  "19324"  "93765"  "103733" "59001" 
[137] "107976" "19179"  "12579"  "232987" "17420"  "219072" "26445"  "105988"
[145] "69745"  "18538"  "69928"  "11651"  "235559" "68097"  "57296"  "63955" 
[153] "19170"  "17246"  "17220"  "12144"  "208084" "50793"  "77605"  "18392" 
[161] "236930" "70024"  "66296"  "16906"  "109145" "71819"  "67733"  "50883" 
[169] "12447"  "75430"  "12532"  "68942"  "26442"  "19177"  "230376" "245688"
[177] "229776" "214901" "16563"  "16328"  "18971"  "77011"  "225659" "68953" 
[185] "100088" "66671"  "19718"  "78177"  "66578"  "59046"  "22194"  "52530" 
[193] "23996"  "101565" "16475"  "381280" "235661"The Cell Cycle contains 422 genes, which was indicated in the
fgsea result as size (but it’s always good to
double-check).
length(my_pathways[[fgsea_reactome[order(pval),][1,]$pathway]])[1] 422The Gene Set Enrichment Analysis (GSEA) has been around since 2005 and has become a routine step when analysing gene expression data. It differs from Gene Ontology enrichment analysis in that it considers all genes in contrast to taking only significantly differentially expressed genes.
The fgsea package allows one to conduct a pre-ranked
GSEA in R, which is one approach in a GSEA. A p-value is estimated by
permuting the genes in a gene set, which leads to randomly assigned gene
sets of the same size. Note that “This approach is not strictly accurate
because it ignores gene-gene correlations and will overestimate the
significance levels and may lead to false positives.” However, it is
still useful in getting an idea of the potential roles of the genes that
are up- and down-regulated. If your pre-ranked GSEA returns no
significant gene sets, you may still get an idea of what roles the up-
and down-regulated genes may be involved in by examining the leading
edge set. This set indicates the genes that contributed to the
enrichment score.
The example ranks in the fgsea package were ranked based
on the moderated t-statistic. Another
metric that you may consider is to take the signed fold change and
multiply it by the \(-log_{10}\)p-value. For example, if you
performed your differential expression analysis with edgeR,
you can simply multiply the signed fold change column to the \(-log_{10}\)p-value column. This metric is
based on the paper: Rank-rank
hypergeometric overlap: identification of statistically significant
overlap between gene-expression signatures.
sessionInfo()R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS
Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     
other attached packages:
 [1] ggplot2_3.5.2        pheatmap_1.0.12      data.table_1.17.2   
 [4] reactome.db_1.92.0   org.Mm.eg.db_3.21.0  AnnotationDbi_1.70.0
 [7] IRanges_2.42.0       S4Vectors_0.46.0     limma_3.64.0        
[10] GEOquery_2.76.0      Biobase_2.68.0       BiocGenerics_0.54.0 
[13] generics_0.1.4       fgsea_1.34.0         BiocManager_1.30.25 
[16] workflowr_1.7.1     
loaded via a namespace (and not attached):
 [1] DBI_1.2.3                   httr2_1.1.2                
 [3] rlang_1.1.6                 magrittr_2.0.3             
 [5] git2r_0.36.2                matrixStats_1.5.0          
 [7] compiler_4.5.0              RSQLite_2.3.11             
 [9] getPass_0.2-4               png_0.1-8                  
[11] callr_3.7.6                 vctrs_0.6.5                
[13] stringr_1.5.1               pkgconfig_2.0.3            
[15] crayon_1.5.3                fastmap_1.2.0              
[17] XVector_0.48.0              labeling_0.4.3             
[19] promises_1.3.2              rmarkdown_2.29             
[21] tzdb_0.5.0                  UCSC.utils_1.4.0           
[23] ps_1.9.1                    purrr_1.0.4                
[25] bit_4.6.0                   xfun_0.52                  
[27] cachem_1.1.0                GenomeInfoDb_1.44.0        
[29] jsonlite_2.0.0              blob_1.2.4                 
[31] later_1.4.2                 DelayedArray_0.34.1        
[33] BiocParallel_1.42.0         parallel_4.5.0             
[35] R6_2.6.1                    bslib_0.9.0                
[37] stringi_1.8.7               RColorBrewer_1.1-3         
[39] GenomicRanges_1.60.0        jquerylib_0.1.4            
[41] Rcpp_1.0.14                 SummarizedExperiment_1.38.1
[43] knitr_1.50                  R.utils_2.13.0             
[45] readr_2.1.5                 httpuv_1.6.16              
[47] rentrez_1.2.3               Matrix_1.7-3               
[49] tidyselect_1.2.1            rstudioapi_0.17.1          
[51] abind_1.4-8                 yaml_2.3.10                
[53] codetools_0.2-20            curl_6.2.3                 
[55] processx_3.8.6              lattice_0.22-6             
[57] tibble_3.2.1                withr_3.0.2                
[59] KEGGREST_1.48.0             evaluate_1.0.3             
[61] xml2_1.3.8                  Biostrings_2.76.0          
[63] pillar_1.10.2               MatrixGenerics_1.20.0      
[65] whisker_0.4.1               rprojroot_2.0.4            
[67] hms_1.1.3                   scales_1.4.0               
[69] glue_1.8.0                  tools_4.5.0                
[71] fs_1.6.6                    XML_3.99-0.18              
[73] fastmatch_1.1-6             cowplot_1.1.3              
[75] grid_4.5.0                  tidyr_1.3.1                
[77] GenomeInfoDbData_1.2.14     cli_3.6.5                  
[79] rappdirs_0.3.3              S4Arrays_1.8.0             
[81] dplyr_1.1.4                 gtable_0.3.6               
[83] R.methodsS3_1.8.2           sass_0.4.10                
[85] digest_0.6.37               SparseArray_1.8.0          
[87] farver_2.1.2                R.oo_1.27.1                
[89] memoise_2.0.1               htmltools_0.5.8.1          
[91] lifecycle_1.0.4             httr_1.4.7                 
[93] statmod_1.5.0               bit64_4.6.0-1