Last updated: 2025-01-24
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 88fe3eb. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: data/pbmc3k.csv
Ignored: data/pbmc3k.csv.gz
Ignored: data/pbmc3k/
Ignored: r_packages_4.4.0/
Ignored: r_packages_4.4.1/
Untracked files:
Untracked: analysis/fgsea_edger.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/fgsea.Rmd
) and HTML
(docs/fgsea.html
) files. If you’ve configured a remote Git
repository (see ?wflow_git_remote
), click on the hyperlinks
in the table below to view the files as they were in that past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
html | a24419b | Dave Tang | 2023-02-13 | Build site. |
Rmd | 380c988 | Dave Tang | 2023-02-13 | Fix leading edge gene count |
html | f488702 | Dave Tang | 2023-02-13 | Build site. |
Rmd | f90ae59 | Dave Tang | 2023-02-13 | fgsea analysis and summary |
html | b82b4fc | Dave Tang | 2023-02-01 | Build site. |
Rmd | fb91538 | Dave Tang | 2023-02-01 | fgsea example data |
From the original paper describing the Gene Set Enrichment Analysis (GSEA):
The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L, in which case the gene set is correlated with the phenotypic class distinction.
To provide an example, consider carrying out RNA-seq experiments with two conditions, where one set of experiments are the control and another set are the treatment. A differential gene expression analysis is carried out and each gene is ranked by the fold-change between the control and treatment; genes that are up-regulated (fold-change > 1) are ranked at the top and genes that are down-regulated (fold-change < 1) are ranked at the bottom.
Besides inspecting the individual genes that are up- or down-regulated, you wish to find out whether a pathway of genes is up- or down-regulated. Therefore you carry out GSEA using a set of gene pathways (S) and check whether genes belonging to the same pathway tend to occur at the start or the end of the list of ranked genes (L).
First install fgsea.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("fgsea", quietly = TRUE))
BiocManager::install("fgsea")
library(fgsea)
The example pathways are packaged with fgsea
and can be
loaded with data()
. The example pathways are stored in a
list.
data("examplePathways", package = "fgsea")
class(examplePathways)
[1] "list"
There are 1457 example pathways.
length(examplePathways)
[1] 1457
The first pathway 1221633_Meiotic_Synapsis contains Entrez Gene IDs that belong to this gene set.
examplePathways[1]
$`1221633_Meiotic_Synapsis`
[1] "12189" "13006" "15077" "15078" "15270" "15512"
[7] "16905" "16906" "19357" "20842" "20843" "20957"
[13] "20962" "21749" "21750" "22196" "23856" "24061"
[19] "28113" "50878" "56739" "57321" "64009" "66654"
[25] "69386" "71846" "74075" "77053" "94244" "97114"
[31] "97122" "97908" "101185" "140557" "223697" "260423"
[37] "319148" "319149" "319150" "319151" "319152" "319153"
[43] "319154" "319155" "319156" "319157" "319158" "319159"
[49] "319160" "319161" "319565" "320332" "320558" "326619"
[55] "326620" "360198" "497652" "544973" "625328" "667250"
[61] "100041230" "102641229" "102641751" "102642045"
The gene ranks are also packaged with fgsea
but we will
re-generate the ranks based on the author’s
code to see how the ranks were created. Several other Bioconductor
packages are required to generate the ranks. The Reactome database is
also installed here for later use.
bioc_pac <- c("GEOquery", "limma", "org.Mm.eg.db", "reactome.db")
cran_pac <- c("data.table", "pheatmap")
install_pac <- function(x, repo){
if (!require(x, quietly = TRUE, character.only = TRUE)){
if(repo == "bioc"){
BiocManager::install(x, character.only = TRUE)
} else if (repo == "cran"){
install.packages(x, character.only = TRUE)
} else {
stop("Unknown repo")
}
}
}
sapply(bioc_pac, install_pac, repo = "bioc")
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
CRAN: https://p3m.dev/cran/__linux__/jammy/2024-10-30
Bioconductor version 3.20 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)
Installing package(s) 'GEOquery'
also installing the dependency 'rentrez'
Installation paths not writeable, unable to update packages
path: /usr/local/lib/R/library
packages:
boot, foreign, MASS, Matrix, nlme, survival
Attaching package: 'BiocGenerics'
The following object is masked from 'package:limma':
plotMA
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
table, tapply, union, unique, unsplit, which.max, which.min
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':
findMatches
The following objects are masked from 'package:base':
expand.grid, I, unname
$GEOquery
[1] "GEOquery"
$limma
NULL
$org.Mm.eg.db
NULL
$reactome.db
NULL
sapply(cran_pac, install_pac, repo = "cran")
Attaching package: 'data.table'
The following object is masked from 'package:IRanges':
shift
The following objects are masked from 'package:S4Vectors':
first, second
$data.table
NULL
$pheatmap
NULL
Load packages.
library(GEOquery)
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
library(limma)
library(org.Mm.eg.db)
library(data.table)
# for collapseBy
source("https://raw.githubusercontent.com/assaron/r-utils/master/R/exprs.R")
The example gene ranks were created using mouse
microarray data, which we can download using getGEO
from the GEOquery
package.
gse14308 <- getGEO("GSE14308")[[1]]
Found 1 file(s)
GSE14308_series_matrix.txt.gz
Conditions (are extracted from the title and) are added to the phenotypic data.
pData(gse14308)$condition <- sub("-.*$", "", gse14308$title)
pData(gse14308)[, c('platform_id', 'type', 'condition')]
platform_id type condition
GSM357839 GPL1261 RNA Th2
GSM357841 GPL1261 RNA Th2
GSM357842 GPL1261 RNA Th1
GSM357843 GPL1261 RNA Th17
GSM357844 GPL1261 RNA Th1
GSM357845 GPL1261 RNA Th17
GSM357847 GPL1261 RNA Naive
GSM357848 GPL1261 RNA Naive
GSM357849 GPL1261 RNA iTreg
GSM357850 GPL1261 RNA iTreg
GSM357852 GPL1261 RNA nTreg
GSM357853 GPL1261 RNA nTreg
fData
retrieves information on features, which are
microarray probe sets.
feature_dat <- fData(gse14308)
colnames(feature_dat)
[1] "ID" "GB_ACC"
[3] "SPOT_ID" "Species Scientific Name"
[5] "Annotation Date" "Sequence Type"
[7] "Sequence Source" "Target Description"
[9] "Representative Public ID" "Gene Title"
[11] "Gene Symbol" "ENTREZ_GENE_ID"
[13] "RefSeq Transcript ID" "Gene Ontology Biological Process"
[15] "Gene Ontology Cellular Component" "Gene Ontology Molecular Function"
The collapseBy
function is sourced from exprs.R
and the source is shown below but named as collapseBy_
so
we do not overwrite the sourced function. We will go through each line
of code to try to understand what the function is doing.
collapseBy_ <- function(es, factor, FUN=median) {
ranks <- apply(exprs(es), 1, FUN)
t <- data.frame(f=factor, i=seq_along(ranks), r=ranks)
t <- t[order(t$r, decreasing=T), ]
keep <- t[!duplicated(t$f) & !is.na(t$f),]$i
res <- es[keep, ]
fData(res)$origin <- rownames(res)
rownames(res) <- factor[keep]
res
}
ranks
contains the median probe intensity across all
samples.
ranks <- apply(exprs(gse14308), 1, median)
head(ranks)
1415670_at 1415671_at 1415672_at 1415673_at 1415674_a_at 1415675_at
3324.740 4933.035 11202.750 3239.865 4007.050 1830.655
The ENTREZ_GENE_ID
, index, and median get saved into a
data frame. (I’ve used my_df
here because t
is
the name of the transpose function.) Next, the data frame is ordered
from highest intensity to lowest.
my_df <- data.frame(
f=fData(gse14308)$ENTREZ_GENE_ID,
i=seq_along(ranks),
r=ranks
)
my_df <- my_df[order(my_df$r, decreasing=TRUE), ]
head(my_df)
f i r
1438859_x_at 20090 23165 99737.85
1454859_a_at 65019 /// 100044627 /// 100862455 39154 99078.75
1416404_s_at 20055 /// 100039355 /// 100862433 735 97947.30
1455485_x_at 22121 39780 97510.75
1422475_a_at 20091 6781 97177.05
1435873_a_at 22121 /// 100504632 20179 96558.85
A vector called keep
is created to keep only rows with
non-duplicated and non-missing Entrez Gene IDs.
keep <- my_df[!duplicated(my_df$f) & !is.na(my_df$f),]$i
length(keep)
[1] 21603
Finally, keep
is used to subset the data;
origin
is created to store the original probe IDs before
replacing the row names with Entrez Gene IDs.
res <- gse14308[keep, ]
fData(res)$origin <- rownames(res)
rownames(res) <- fData(gse14308)$ENTREZ_GENE_ID[keep]
res
ExpressionSet (storageMode: lockedEnvironment)
assayData: 21603 features, 12 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM357839 GSM357841 ... GSM357853 (12 total)
varLabels: title geo_accession ... condition (34 total)
varMetadata: labelDescription
featureData
featureNames: 20090 65019 /// 100044627 /// 100862455 ... 194227
(21603 total)
fvarLabels: ID GB_ACC ... origin (17 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 19144320
Annotation: GPL1261
Therefore, the collapseBy
function is used to calculate
the median intensity across samples/experiments, which is used to rank
features, and to remove features that are duplicated or have no Entrez
Gene ID.
Now that we know what collapseBy
does, we can use
it.
es <- collapseBy(gse14308, fData(gse14308)$ENTREZ_GENE_ID, FUN=median)
es
ExpressionSet (storageMode: lockedEnvironment)
assayData: 21603 features, 12 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM357839 GSM357841 ... GSM357853 (12 total)
varLabels: title geo_accession ... condition (34 total)
varMetadata: labelDescription
featureData
featureNames: 20090 65019 /// 100044627 /// 100862455 ... 194227
(21603 total)
fvarLabels: ID GB_ACC ... origin (17 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 19144320
Annotation: GPL1261
Probe IDs that mapped to several Entrez Gene IDs and empty entries are also removed.
es <- es[!grepl("///", rownames(es)), ]
es <- es[rownames(es) != "", ]
dim(exprs(es))
[1] 20770 12
Quantile normalisation is carried out.
exprs(es) <- normalizeBetweenArrays(log2(exprs(es)+1), method="quantile")
A design matrix is defined.
es.design <- model.matrix(~0+condition, data=pData(es))
es.design
conditioniTreg conditionNaive conditionnTreg conditionTh1
GSM357839 0 0 0 0
GSM357841 0 0 0 0
GSM357842 0 0 0 1
GSM357843 0 0 0 0
GSM357844 0 0 0 1
GSM357845 0 0 0 0
GSM357847 0 1 0 0
GSM357848 0 1 0 0
GSM357849 1 0 0 0
GSM357850 1 0 0 0
GSM357852 0 0 1 0
GSM357853 0 0 1 0
conditionTh17 conditionTh2
GSM357839 0 1
GSM357841 0 1
GSM357842 0 0
GSM357843 1 0
GSM357844 0 0
GSM357845 1 0
GSM357847 0 0
GSM357848 0 0
GSM357849 0 0
GSM357850 0 0
GSM357852 0 0
GSM357853 0 0
attr(,"assign")
[1] 1 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$condition
[1] "contr.treatment"
A linear model is fit given the design.
fit <- lmFit(es, es.design)
A contrasts matrix is used to compute contrasts using our fitted linear model. Here we’re contrasting naive T-cells to T-helper 1 cells.
makeContrasts(
conditionTh1-conditionNaive,
levels=es.design
)
Contrasts
Levels conditionTh1 - conditionNaive
conditioniTreg 0
conditionNaive -1
conditionnTreg 0
conditionTh1 1
conditionTh17 0
conditionTh2 0
fit2 <- contrasts.fit(
fit,
makeContrasts(
conditionTh1-conditionNaive,
levels=es.design
)
)
fit2
An object of class "MArrayLM"
$coefficients
Contrasts
conditionTh1 - conditionNaive
20090 0.011485524
22121 -0.003602942
20091 0.035573263
67671 0.092303053
19241 0.013452656
20765 more rows ...
$rank
[1] 6
$assign
[1] 1 1 1 1 1 1
$qr
$qr
conditioniTreg conditionNaive conditionnTreg conditionTh1
GSM357839 -1.414214 0.000000 0.000000 0.0000000
GSM357841 0.000000 -1.414214 0.000000 0.0000000
GSM357842 0.000000 0.000000 -1.414214 0.0000000
GSM357843 0.000000 0.000000 0.000000 -1.4142136
GSM357844 0.000000 0.000000 0.000000 0.7071068
conditionTh17 conditionTh2
GSM357839 0.000000 0
GSM357841 0.000000 0
GSM357842 0.000000 0
GSM357843 0.000000 0
GSM357844 1.414214 0
7 more rows ...
$qraux
[1] 1.0 1.0 1.0 1.0 1.5 1.0
$pivot
[1] 1 2 3 4 5 6
$tol
[1] 1e-07
$rank
[1] 6
$df.residual
[1] 6 6 6 6 6
20765 more elements ...
$sigma
20090 22121 20091 67671 19241
0.03820456 0.03984248 0.05377004 0.03935491 0.03240994
20765 more elements ...
$cov.coefficients
Contrasts
Contrasts conditionTh1 - conditionNaive
conditionTh1 - conditionNaive 1
$stdev.unscaled
Contrasts
conditionTh1 - conditionNaive
20090 1
22121 1
20091 1
67671 1
19241 1
20765 more rows ...
$pivot
[1] 1 2 3 4 5 6
$genes
ID GB_ACC SPOT_ID Species Scientific Name Annotation Date
20090 1438859_x_at AV037157 Mus musculus Oct 6, 2014
22121 1455485_x_at AI324936 Mus musculus Oct 6, 2014
20091 1422475_a_at NM_016959 Mus musculus Oct 6, 2014
67671 1433472_x_at AA050777 Mus musculus Oct 6, 2014
19241 1415906_at NM_021278 Mus musculus Oct 6, 2014
Sequence Type Sequence Source
20090 Consensus sequence GenBank
22121 Consensus sequence GenBank
20091 Consensus sequence GenBank
67671 Consensus sequence GenBank
19241 Consensus sequence GenBank
Target Description
20090 gb:AV037157 /DB_XREF=gi:4856822 /DB_XREF=AV037157 /CLONE=1600022O04 /FEA=EST /CNT=10 /TID=Mm.154915.5 /TIER=Stack /STK=9 /UG=Mm.154915 /LL=20090 /UG_GENE=Rps29 /UG_TITLE=ribosomal protein S29
22121 gb:AI324936 /DB_XREF=gi:4059365 /DB_XREF=mb49d08.x1 /CLONE=IMAGE:332751 /FEA=EST /CNT=24 /TID=Mm.13020.6 /TIER=Stack /STK=12 /UG=Mm.13020 /LL=22121 /UG_GENE=Rpl13a /UG_TITLE=ribosomal protein L13a
20091 gb:NM_016959.1 /DB_XREF=gi:8394217 /GEN=Rps3a /FEA=FLmRNA /CNT=122 /TID=Mm.6957.1 /TIER=FL+Stack /STK=78 /UG=Mm.6957 /LL=20091 /DEF=Mus musculus ribosomal protein S3a (Rps3a), mRNA. /PROD=ribosomal protein S3a /FL=gb:NM_016959.1
67671 gb:AA050777 /DB_XREF=gi:1530594 /DB_XREF=mg72a01.r1 /CLONE=IMAGE:438504 /FEA=EST /CNT=215 /TID=Mm.43330.3 /TIER=Stack /STK=151 /UG=Mm.43330 /LL=67671 /UG_GENE=0610025G13Rik /UG_TITLE=RIKEN cDNA 0610025G13 gene
19241 gb:NM_021278.1 /DB_XREF=gi:10946577 /GEN=Tmsb4x /FEA=FLmRNA /CNT=360 /TID=Mm.142729.1 /TIER=FL+Stack /STK=172 /UG=Mm.142729 /LL=19241 /DEF=Mus musculus thymosin, beta 4, X chromosome (Tmsb4x), mRNA. /PROD=thymosin, beta 4, X chromosome /FL=gb:NM_021278.1 gb:BC018286.1
Representative Public ID Gene Title Gene Symbol
20090 AV037157 ribosomal protein S29 Rps29
22121 AI324936 ribosomal protein L13A Rpl13a
20091 NM_016959 ribosomal protein S3A1 Rps3a1
67671 AA050777 ribosomal protein L38 Rpl38
19241 NM_021278 thymosin, beta 4, X chromosome Tmsb4x
ENTREZ_GENE_ID
20090 20090
22121 22121
20091 20091
67671 67671
19241 19241
RefSeq Transcript ID
20090 NM_009093
22121 NM_009438
20091 NM_016959
67671 NM_001048057 /// NM_001048058 /// NM_023372 /// XM_006534007 /// XM_006534008 /// XM_006534009
19241 NM_021278 /// XM_006528759
Gene Ontology Biological Process
20090 0006412 // translation // not recorded
22121 0006351 // transcription, DNA-templated // inferred from electronic annotation /// 0006355 // regulation of transcription, DNA-templated // inferred from electronic annotation /// 0006412 // translation // inferred from electronic annotation /// 0006417 // regulation of translation // inferred from electronic annotation /// 0017148 // negative regulation of translation // inferred from direct assay /// 0017148 // negative regulation of translation // not recorded /// 0032496 // response to lipopolysaccharide // inferred from mutant phenotype /// 0032844 // regulation of homeostatic process // inferred from mutant phenotype /// 0048246 // macrophage chemotaxis // inferred from mutant phenotype /// 0060425 // lung morphogenesis // inferred from mutant phenotype /// 0071346 // cellular response to interferon-gamma // inferred from direct assay /// 0071346 // cellular response to interferon-gamma // not recorded /// 1901194 // negative regulation of formation of translation preinitiation complex // not recorded
20091 0002181 // cytoplasmic translation // not recorded /// 0006412 // translation // inferred from electronic annotation /// 0030154 // cell differentiation // inferred from electronic annotation /// 0043066 // negative regulation of apoptotic process // inferred from sequence or structural similarity /// 0097194 // execution phase of apoptosis // inferred from sequence or structural similarity
67671 0001501 // skeletal system development // inferred from mutant phenotype /// 0001501 // skeletal system development // traceable author statement /// 0001503 // ossification // inferred from mutant phenotype /// 0006412 // translation // inferred from electronic annotation /// 0006417 // regulation of translation // inferred from mutant phenotype /// 0007605 // sensory perception of sound // inferred from mutant phenotype /// 0034463 // 90S preribosome assembly // inferred from direct assay /// 0042474 // middle ear morphogenesis // inferred from mutant phenotype /// 0048318 // axial mesoderm development // inferred from mutant phenotype
19241 0007010 // cytoskeleton organization // inferred from electronic annotation /// 0014911 // positive regulation of smooth muscle cell migration // traceable author statement /// 0030036 // actin cytoskeleton organization // inferred from electronic annotation /// 0030334 // regulation of cell migration // inferred from direct assay /// 0042989 // sequestering of actin monomers // inferred from electronic annotation /// 0045893 // positive regulation of transcription, DNA-templated // traceable author statement /// 0051152 // positive regulation of smooth muscle cell differentiation // traceable author statement
Gene Ontology Cellular Component
20090 0005622 // intracellular // inferred from electronic annotation /// 0005737 // cytoplasm // not recorded /// 0005840 // ribosome // inferred from electronic annotation /// 0015935 // small ribosomal subunit // not recorded /// 0022627 // cytosolic small ribosomal subunit // not recorded /// 0030529 // ribonucleoprotein complex // inferred from electronic annotation /// 0070062 // extracellular vesicular exosome // not recorded
22121 0005576 // extracellular region // inferred from electronic annotation /// 0005634 // nucleus // not recorded /// 0005840 // ribosome // inferred from electronic annotation /// 0015934 // large ribosomal subunit // inferred from electronic annotation /// 0016020 // membrane // inferred from electronic annotation /// 0022625 // cytosolic large ribosomal subunit // not recorded /// 0030529 // ribonucleoprotein complex // inferred from direct assay /// 0030529 // ribonucleoprotein complex // not recorded /// 0097452 // GAIT complex // inferred from direct assay /// 0097452 // GAIT complex // not recorded
20091 0005622 // intracellular // inferred from electronic annotation /// 0005634 // nucleus // inferred from direct assay /// 0005737 // cytoplasm // inferred from electronic annotation /// 0005829 // cytosol // inferred from direct assay /// 0005840 // ribosome // inferred from electronic annotation /// 0022627 // cytosolic small ribosomal subunit // not recorded /// 0030529 // ribonucleoprotein complex // not recorded
67671 0005622 // intracellular // inferred from electronic annotation /// 0005840 // ribosome // inferred from electronic annotation /// 0022625 // cytosolic large ribosomal subunit // not recorded /// 0030529 // ribonucleoprotein complex // inferred from electronic annotation /// 0033291 // eukaryotic 80S initiation complex // inferred from direct assay
19241 0005634 // nucleus // inferred from direct assay /// 0005737 // cytoplasm // inferred from electronic annotation /// 0005829 // cytosol // inferred from direct assay /// 0005856 // cytoskeleton // inferred from electronic annotation
Gene Ontology Molecular Function
20090 0003735 // structural constituent of ribosome // not recorded /// 0008270 // zinc ion binding // not recorded /// 0046872 // metal ion binding // inferred from electronic annotation
22121 0003676 // nucleic acid binding // inferred from electronic annotation /// 0003677 // DNA binding // inferred from electronic annotation /// 0003729 // mRNA binding // inferred from direct assay /// 0003735 // structural constituent of ribosome // inferred from electronic annotation /// 0005125 // cytokine activity // inferred from electronic annotation /// 0044822 // poly(A) RNA binding // not recorded /// 0046872 // metal ion binding // inferred from electronic annotation
20091 0003735 // structural constituent of ribosome // not recorded /// 0005515 // protein binding // inferred from physical interaction /// 0031369 // translation initiation factor binding // not recorded
67671 0003735 // structural constituent of ribosome // inferred from electronic annotation
19241 0003779 // actin binding // inferred from electronic annotation /// 0005515 // protein binding // inferred from physical interaction
origin
20090 1438859_x_at
22121 1455485_x_at
20091 1422475_a_at
67671 1433472_x_at
19241 1415906_at
20765 more rows ...
$Amean
20090 22121 20091 67671 19241
16.28272 16.27241 16.25759 16.23408 16.23061
20765 more elements ...
$method
[1] "ls"
$design
conditioniTreg conditionNaive conditionnTreg conditionTh1
GSM357839 0 0 0 0
GSM357841 0 0 0 0
GSM357842 0 0 0 1
GSM357843 0 0 0 0
GSM357844 0 0 0 1
conditionTh17 conditionTh2
GSM357839 0 1
GSM357841 0 1
GSM357842 0 0
GSM357843 1 0
GSM357844 0 0
7 more rows ...
$contrasts
Contrasts
Levels conditionTh1 - conditionNaive
conditioniTreg 0
conditionNaive -1
conditionnTreg 0
conditionTh1 1
conditionTh17 0
conditionTh2 0
Finally, the differential expression analysis is carried out and the results saved. The results are ranked by limma’s moderated t-statistic and this creates the ranked list of genes.
fit2 <- eBayes(fit2)
names(topTable(fit2, adjust.method="BH", number=12000, sort.by = "B"))
[1] "ID" "GB_ACC"
[3] "SPOT_ID" "Species.Scientific.Name"
[5] "Annotation.Date" "Sequence.Type"
[7] "Sequence.Source" "Target.Description"
[9] "Representative.Public.ID" "Gene.Title"
[11] "Gene.Symbol" "ENTREZ_GENE_ID"
[13] "RefSeq.Transcript.ID" "Gene.Ontology.Biological.Process"
[15] "Gene.Ontology.Cellular.Component" "Gene.Ontology.Molecular.Function"
[17] "origin" "logFC"
[19] "AveExpr" "t"
[21] "P.Value" "adj.P.Val"
[23] "B"
de <- data.table(topTable(fit2, adjust.method="BH", number=12000, sort.by = "B"), keep.rownames = TRUE)
ranks <- de[order(t), list(rn, t)]
ranks
rn t
<char> <num>
1: 170942 -62.22877
2: 109711 -49.47829
3: 18124 -43.40540
4: 12775 -41.16952
5: 72148 -33.23463
---
11996: 58801 49.10222
11997: 13730 50.02863
11998: 15937 50.29120
11999: 12772 50.52651
12000: 80876 52.59930
Load exampleRanks
.
data("exampleRanks", package = "fgsea")
head(exampleRanks)
170942 109711 18124 12775 72148 16010
-63.33703 -49.74779 -43.63878 -41.51889 -33.26039 -32.77626
The names of the vector are the Entrez Gene IDs and the values are the rank metric.
Compare with our results.
wanted <- head(names(exampleRanks))
ranks[rn %in% wanted]
rn t
<char> <num>
1: 170942 -62.22877
2: 109711 -49.47829
3: 18124 -43.40540
4: 12775 -41.16952
5: 72148 -33.23463
6: 16010 -32.38200
Visualise the six most significantly down- and up-regulated genes.
library(pheatmap)
my_sample <- pData(es)$condition == "Th1" | pData(es)$condition == "Naive"
my_group <- data.frame(group = pData(es)$condition[my_sample])
row.names(my_group) <- colnames(exprs(es))[my_sample]
pheatmap(
mat = es[c(head(de[order(t), 1])$rn, tail(de[order(t), 1])$rn), my_sample],
annotation_col = my_group,
cluster_rows = FALSE,
cellwidth=25,
cellheight=15,
scale = "row"
)
Version | Author | Date |
---|---|---|
b82b4fc | Dave Tang | 2023-02-01 |
The following section is based on the fgsea
tutorial but with my elaborations. The pathways are stored in
examplePathways
and the ranked gene list in
exampleRanks
.
The fgsea()
function runs the pre-ranked gene set
enrichment analysis.
data(examplePathways)
data(exampleRanks)
set.seed(42)
system.time(
fgseaRes <- fgsea(
pathways = examplePathways,
stats = exampleRanks,
minSize=15,
maxSize=500
)
)
user system elapsed
4.436 0.138 3.372
class(fgseaRes)
[1] "data.table" "data.frame"
Top 6 enriched pathways; see the data.table Wiki
for more information on the data.table
package.
head(fgseaRes[order(pval), ])
pathway pval padj
<char> <num> <num>
1: 5990979_Cell_Cycle,_Mitotic 6.690481e-27 3.920622e-24
2: 5990980_Cell_Cycle 3.312565e-26 9.705816e-24
3: 5991851_Mitotic_Prometaphase 8.470173e-19 1.654507e-16
4: 5992217_Resolution_of_Sister_Chromatid_Cohesion 2.176649e-18 3.188791e-16
5: 5991454_M_Phase 1.873997e-14 2.196325e-12
6: 5991599_Separation_of_Sister_Chromatids 8.733223e-14 8.529448e-12
log2err ES NES size leadingEdge
<num> <num> <num> <int> <list>
1: 1.3422338 0.5594755 2.769070 317 66336, 6....
2: 1.3267161 0.5388497 2.705894 369 66336, 6....
3: 1.1239150 0.7253270 2.972690 82 66336, 6....
4: 1.1053366 0.7347987 2.957518 74 66336, 6....
5: 0.9759947 0.5576247 2.554076 173 66336, 6....
6: 0.9545416 0.6164600 2.670030 116 66336, 6....
Number of significant pathways at padj < 0.01.
sum(fgseaRes[, padj < 0.01])
[1] 76
Plot the most significantly enriched pathway.
plotEnrichment(
examplePathways[[head(fgseaRes[order(pval), ], 1)$pathway]],
exampleRanks
) +
ggplot2::labs(title=head(fgseaRes[order(pval), ], 1)$pathway)
Version | Author | Date |
---|---|---|
f488702 | Dave Tang | 2023-02-13 |
Plot the top 10 pathways enriched at the top and bottom of the ranked list, respectively.
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(
examplePathways[topPathways],
exampleRanks,
fgseaRes,
gseaParam = 0.5
)
Version | Author | Date |
---|---|---|
f488702 | Dave Tang | 2023-02-13 |
Since the genes were ranked according to their differential
expression significance and fold change, with the most significantly
down-regulated genes at the top and up-regulated genes at the bottom of
the list, the enriched gene sets provides us with some idea of the
function of these genes. For example, exampleRanks
was
generated by testing naive T cells against T helper 1 cells, so the
enriched pathways may suggest how naive cells are differentiated into T
helper 1 cells.
Reactome pathways can also be used with fgsea and this database is available via Bioconductor (and was installed earlier in this post).
The reactomePathways()
function returns a list of
Reactome pathways given a list of Entrez Gene IDs. In the example data,
the IDs are stored as the names in the exampleRanks
vector.
my_pathways <- reactomePathways(names(exampleRanks))
'select()' returned 1:many mapping between keys and columns
'select()' returned 1:1 mapping between keys and columns
Out of interest, let’s check the median number of genes in the pathways.
summary(sapply(my_pathways, length))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 4.00 10.00 30.68 29.00 1261.00
The results are very similar to the example pathways, where the pathways are related to the cell cycle and mitosis.
set.seed(42)
system.time(
fgsea_reactome <- fgsea(
pathways = my_pathways,
stats = exampleRanks,
minSize=15,
maxSize=500
)
)
user system elapsed
4.196 0.160 2.372
head(fgsea_reactome[order(pval), ])
pathway pval padj log2err
<char> <num> <num> <num>
1: Cell Cycle 4.153686e-28 2.949117e-25 1.3727430
2: Cell Cycle, Mitotic 3.944662e-26 1.400355e-23 1.3267161
3: Cell Cycle Checkpoints 3.693033e-21 8.740177e-19 1.1866510
4: Resolution of Sister Chromatid Cohesion 3.827468e-16 6.793755e-14 1.0376962
5: Mitotic Prometaphase 1.110433e-15 1.576815e-13 1.0175448
6: S Phase 4.777872e-14 5.653815e-12 0.9653278
ES NES size leadingEdge
<num> <num> <int> <list>
1: 0.5226639 2.659352 430 66336, 6....
2: 0.5409376 2.712810 363 66336, 6....
3: 0.6022241 2.823320 199 66336, 6....
4: 0.6806110 2.834239 91 66336, 6....
5: 0.5948613 2.678654 149 66336, 6....
6: 0.6182229 2.651952 116 12428, 5....
Again quote the original paper:
Often it is useful to extract the core members of high scoring gene sets that contribute to the enrichment score. We define the leading-edge subset to be those genes in the gene set S that appear in the ranked list L at, or before, the point where the running sum reaches its maximum deviation from zero.
The fgsea
results includes the leading edge, which
contains the genes that contributed to the enrichment score. To
illustrate, let’s take the leader edge genes from the most significant
Reactome pathway.
fgsea_reactome[order(pval),][1,]
pathway pval padj log2err ES NES size
<char> <num> <num> <num> <num> <num> <int>
1: Cell Cycle 4.153686e-28 2.949117e-25 1.372743 0.5226639 2.659352 430
leadingEdge
<list>
1: 66336, 6....
The set of leading edge genes is made up of 198 genes.
fgsea_reactome[order(pval),][1,]$leadingEdge[[1]]
[1] "66336" "66977" "15366" "12442" "107995" "66442" "19361" "12571"
[9] "12428" "52276" "54392" "66311" "215387" "67629" "54124" "12649"
[17] "72415" "56150" "57441" "20877" "67121" "12615" "66468" "67849"
[25] "19053" "73804" "76044" "20878" "15270" "13555" "60411" "60530"
[33] "17219" "69270" "12575" "69263" "12448" "14211" "20873" "18005"
[41] "72119" "71988" "12189" "72107" "17215" "12534" "66156" "208628"
[49] "237911" "22390" "68240" "228421" "68014" "269582" "19348" "12236"
[57] "72151" "18817" "21781" "18968" "217653" "66934" "272551" "227613"
[65] "67141" "67951" "68612" "68298" "108000" "23834" "106344" "242705"
[73] "18141" "223921" "26886" "13557" "26909" "72787" "268697" "72155"
[81] "56371" "17535" "107823" "12531" "327762" "12567" "229841" "67052"
[89] "16319" "66634" "171567" "26931" "67203" "381306" "12235" "19891"
[97] "74470" "72083" "66570" "17216" "76308" "19687" "17218" "102920"
[105] "29870" "18973" "16881" "17463" "68964" "75786" "19645" "19075"
[113] "26417" "69736" "19357" "76816" "70385" "70645" "22628" "225182"
[121] "22627" "52683" "19076" "18972" "26932" "12544" "17997" "26440"
[129] "68549" "12445" "269113" "26444" "19324" "93765" "103733" "59001"
[137] "107976" "19179" "12579" "232987" "17420" "228769" "219072" "26445"
[145] "105988" "69745" "18538" "69928" "11651" "235559" "68097" "57296"
[153] "63955" "19170" "17246" "17220" "12144" "208084" "50793" "77605"
[161] "18392" "236930" "67151" "70024" "66296" "16906" "109145" "71819"
[169] "67733" "50883" "12447" "75430" "12532" "68942" "26442" "19177"
[177] "230376" "245688" "229776" "214901" "16563" "16328" "18971" "225659"
[185] "68953" "100088" "66671" "19718" "78177" "66578" "59046" "22194"
[193] "52530" "23996" "101565" "16475" "381280" "235661"
The Cell Cycle contains 430 genes, which was indicated in the
fgsea
result as size (but it’s always good to
double-check).
length(my_pathways[[fgsea_reactome[order(pval),][1,]$pathway]])
[1] 430
The Gene Set Enrichment Analysis (GSEA) has been around since 2005 and has become a routine step when analysing gene expression data. It differs from Gene Ontology enrichment analysis in that it considers all genes in contrast to taking only significantly differentially expressed genes.
The fgsea
package allows one to conduct a pre-ranked
GSEA in R, which is one approach in a GSEA. A p-value is estimated by
permuting the genes in a gene set, which leads to randomly assigned gene
sets of the same size. Note that “This approach is not strictly accurate
because it ignores gene-gene correlations and will overestimate the
significance levels and may lead to false positives.” However, it is
still useful in getting an idea of the potential roles of the genes that
are up- and down-regulated. If your pre-ranked GSEA returns no
significant gene sets, you may still get an idea of what roles the up-
and down-regulated genes may be involved in by examining the leading
edge set. This set indicates the genes that contributed to the
enrichment score.
The example ranks in the fgsea
package were ranked based
on the moderated t-statistic. Another
metric that you may consider is to take the signed fold change and
multiply it by the \(-log_{10}\)p-value. For example, if you
performed your differential expression analysis with edgeR
,
you can simply multiply the signed fold change column to the \(-log_{10}\)p-value column. This metric is
based on the paper: Rank-rank
hypergeometric overlap: identification of statistically significant
overlap between gene-expression signatures.
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] GEOquery_2.74.0 pheatmap_1.0.12 data.table_1.16.2
[4] reactome.db_1.89.0 org.Mm.eg.db_3.20.0 AnnotationDbi_1.68.0
[7] IRanges_2.40.1 S4Vectors_0.44.0 Biobase_2.66.0
[10] BiocGenerics_0.52.0 limma_3.62.2 fgsea_1.32.2
[13] BiocManager_1.30.25 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] DBI_1.2.3 httr2_1.0.7
[3] rlang_1.1.4 magrittr_2.0.3
[5] git2r_0.35.0 matrixStats_1.4.1
[7] compiler_4.4.1 RSQLite_2.3.9
[9] getPass_0.2-4 png_0.1-8
[11] callr_3.7.6 vctrs_0.6.5
[13] stringr_1.5.1 pkgconfig_2.0.3
[15] crayon_1.5.3 fastmap_1.2.0
[17] XVector_0.46.0 labeling_0.4.3
[19] utf8_1.2.4 promises_1.3.0
[21] rmarkdown_2.28 tzdb_0.4.0
[23] UCSC.utils_1.2.0 ps_1.8.1
[25] purrr_1.0.2 bit_4.5.0
[27] xfun_0.48 zlibbioc_1.52.0
[29] cachem_1.1.0 GenomeInfoDb_1.42.1
[31] jsonlite_1.8.9 blob_1.2.4
[33] highr_0.11 later_1.3.2
[35] DelayedArray_0.32.0 BiocParallel_1.40.0
[37] parallel_4.4.1 R6_2.5.1
[39] bslib_0.8.0 stringi_1.8.4
[41] RColorBrewer_1.1-3 GenomicRanges_1.58.0
[43] jquerylib_0.1.4 Rcpp_1.0.13
[45] SummarizedExperiment_1.36.0 knitr_1.48
[47] R.utils_2.12.3 readr_2.1.5
[49] httpuv_1.6.15 rentrez_1.2.3
[51] Matrix_1.7-0 tidyselect_1.2.1
[53] rstudioapi_0.17.1 abind_1.4-8
[55] yaml_2.3.10 codetools_0.2-20
[57] curl_6.0.1 processx_3.8.4
[59] lattice_0.22-6 tibble_3.2.1
[61] withr_3.0.2 KEGGREST_1.46.0
[63] evaluate_1.0.1 xml2_1.3.6
[65] Biostrings_2.74.1 pillar_1.9.0
[67] MatrixGenerics_1.18.1 whisker_0.4.1
[69] generics_0.1.3 rprojroot_2.0.4
[71] hms_1.1.3 ggplot2_3.5.1
[73] munsell_0.5.1 scales_1.3.0
[75] glue_1.8.0 tools_4.4.1
[77] XML_3.99-0.17 fs_1.6.4
[79] fastmatch_1.1-4 cowplot_1.1.3
[81] grid_4.4.1 tidyr_1.3.1
[83] colorspace_2.1-1 GenomeInfoDbData_1.2.13
[85] cli_3.6.3 rappdirs_0.3.3
[87] fansi_1.0.6 S4Arrays_1.6.0
[89] dplyr_1.1.4 gtable_0.3.6
[91] R.methodsS3_1.8.2 sass_0.4.9
[93] digest_0.6.37 SparseArray_1.6.1
[95] farver_2.1.2 R.oo_1.27.0
[97] memoise_2.0.1 htmltools_0.5.8.1
[99] lifecycle_1.0.4 httr_1.4.7
[101] statmod_1.5.0 bit64_4.5.2