Last updated: 2025-04-02

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version c442512. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/1M_neurons_filtered_gene_bc_matrices_h5.h5
    Ignored:    data/293t/
    Ignored:    data/293t_3t3_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/293t_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
    Ignored:    data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
    Ignored:    data/brain_counts/
    Ignored:    data/cl.obo
    Ignored:    data/cl.owl
    Ignored:    data/jurkat/
    Ignored:    data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/jurkat_293t/
    Ignored:    data/jurkat_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc20k/
    Ignored:    data/pbmc20k_seurat/
    Ignored:    data/pbmc3k/
    Ignored:    data/pbmc3k_seurat.rds
    Ignored:    data/pbmc4k_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
    Ignored:    data/pbmc_1k_v3_raw_feature_bc_matrix.h5
    Ignored:    data/refdata-gex-GRCh38-2020-A.tar.gz
    Ignored:    data/seurat_1m_neuron.rds
    Ignored:    data/t_3k_filtered_gene_bc_matrices.tar.gz
    Ignored:    r_packages_4.4.1/

Untracked files:
    Untracked:  analysis/bioc_scrnaseq.Rmd
    Untracked:  rsem.merged.gene_counts.tsv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/gene_exp_pca.Rmd) and HTML (docs/gene_exp_pca.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd c442512 Dave Tang 2025-04-02 Explain results of prcomp
html 4d70d91 Dave Tang 2025-04-02 Build site.
Rmd a162500 Dave Tang 2025-04-02 Visualise genes contributing to PC1
html b9d5a11 Dave Tang 2025-04-01 Build site.
Rmd d1f7ed3 Dave Tang 2025-04-01 Aesthetics
html bbeb9ad Dave Tang 2025-04-01 Build site.
Rmd 5bbd819 Dave Tang 2025-04-01 Show only the head of data
html c9c5f07 Dave Tang 2025-04-01 Build site.
Rmd cf5f5ca Dave Tang 2025-04-01 PCA using DESeq2

DESeq2 is used to:

Estimate variance-mean dependence in count data from high-throughput sequencing assays and test for differential expression based on a model using the negative binomial distribution.

Installation

Install using BiocManager::install().

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("DESeq2")

Package version.

packageVersion("DESeq2")
[1] '1.46.0'

Count table

https://zenodo.org/records/13970886.

my_url <- 'https://zenodo.org/records/13970886/files/rsem.merged.gene_counts.tsv?download=1'
my_file <- 'rsem.merged.gene_counts.tsv'

if(file.exists(my_file) == FALSE){
  download.file(url = my_url, destfile = my_file)
}

gene_counts <- read_tsv("rsem.merged.gene_counts.tsv", show_col_types = FALSE)
head(gene_counts)
# A tibble: 6 × 10
  gene_id   `transcript_id(s)` ERR160122 ERR160123 ERR160124 ERR164473 ERR164550
  <chr>     <chr>                  <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 ENSG0000… ENST00000373020,E…        2         6         5      374       1637 
2 ENSG0000… ENST00000373031,E…       19        40        28        0          1 
3 ENSG0000… ENST00000371582,E…      268.      274.      429.     489        637 
4 ENSG0000… ENST00000367770,E…      360.      449.      566.     363.       606.
5 ENSG0000… ENST00000286031,E…      156.      185.      265.      85.4      312.
6 ENSG0000… ENST00000374003,E…       24        23        40     1181        423 
# ℹ 3 more variables: ERR164551 <dbl>, ERR164552 <dbl>, ERR164554 <dbl>

Metadata.

tibble::tribble(
  ~sample, ~run_id, ~group,
  "C2_norm", "ERR160122", "normal",
  "C3_norm", "ERR160123", "normal",
  "C5_norm", "ERR160124", "normal",
  "C1_norm", "ERR164473", "normal",
  "C1_cancer", "ERR164550", "cancer",
  "C2_cancer", "ERR164551", "cancer",
  "C3_cancer", "ERR164552", "cancer",
  "C5_cancer", "ERR164554", "cancer"
) -> my_metadata

my_metadata$group <- factor(my_metadata$group, levels = c('normal', 'cancer'))

Matrix.

gene_counts |>
  dplyr::select(starts_with("ERR")) |>
  mutate(across(everything(), as.integer)) |>
  as.matrix() -> gene_counts_mat

row.names(gene_counts_mat) <- gene_counts$gene_id

idx <- match(colnames(gene_counts_mat), my_metadata$run_id)
colnames(gene_counts_mat) <- my_metadata$sample[idx]

tail(gene_counts_mat)
                C2_norm C3_norm C5_norm C1_norm C1_cancer C2_cancer C3_cancer
ENSG00000293594       0       0       0       0         0         0         0
ENSG00000293595       3       5       3       0         0         0         0
ENSG00000293596       0       0       0       0         0         0         0
ENSG00000293597       1       2      11       1         2         3         1
ENSG00000293599       2       0       1       0         1         2         0
ENSG00000293600      45      59      85     561       789      1099       701
                C5_cancer
ENSG00000293594         0
ENSG00000293595         0
ENSG00000293596         0
ENSG00000293597         2
ENSG00000293599         0
ENSG00000293600       845

PCA

Create DESeqDataSet object.

lung_cancer <- DESeqDataSetFromMatrix(
  countData = gene_counts_mat,
  colData   = my_metadata,
  design    = ~ group
)
lung_cancer
class: DESeqDataSet 
dim: 63140 8 
metadata(1): version
assays(1): counts
rownames(63140): ENSG00000000003 ENSG00000000005 ... ENSG00000293599
  ENSG00000293600
rowData names(0):
colnames(8): C2_norm C3_norm ... C3_cancer C5_cancer
colData names(3): sample run_id group

Quickly estimate dispersion trend and apply a variance stabilizing transformation.

pas_vst <- vst(lung_cancer)
pas_vst
class: DESeqTransform 
dim: 63140 8 
metadata(1): version
assays(1): ''
rownames(63140): ENSG00000000003 ENSG00000000005 ... ENSG00000293599
  ENSG00000293600
rowData names(4): baseMean baseVar allZero dispFit
colnames(8): C2_norm C3_norm ... C3_cancer C5_cancer
colData names(4): sample run_id group sizeFactor

Plot PCA.

plotPCA(pas_vst, intgroup = "group") +
  theme_minimal()
using ntop=500 top features by variance

Version Author Date
b9d5a11 Dave Tang 2025-04-01
c9c5f07 Dave Tang 2025-04-01

PCA data.

pca_data <- plotPCA(pas_vst, intgroup = "group", returnData = TRUE)
using ntop=500 top features by variance
pca_data
                 PC1         PC2  group group.1      name
C2_norm   -117.37808   7.0149036 normal  normal   C2_norm
C3_norm   -116.03913 -13.2708426 normal  normal   C3_norm
C5_norm   -115.34425   6.2593420 normal  normal   C5_norm
C1_norm     69.82117   0.8856676 normal  normal   C1_norm
C1_cancer   67.53316  -2.1774785 cancer  cancer C1_cancer
C2_cancer   70.13468   0.7640921 cancer  cancer C2_cancer
C3_cancer   70.50438   0.1610337 cancer  cancer C3_cancer
C5_cancer   70.76807   0.3632821 cancer  cancer C5_cancer

Source code.

plotPCA_copied = function(object, intgroup="condition", ntop=500, returnData=FALSE, pcsToUse=1:2){
  message(paste0("using ntop=",ntop," top features by variance"))
  
  # calculate the variance for each gene
  rv <- rowVars(assay(object))

  # select the ntop genes by variance
  select <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]

  # perform a PCA on the data in assay(x) for the selected genes
  pca <- prcomp(t(assay(object)[select,]))

  # the contribution to the total variance for each component
  percentVar <- pca$sdev^2 / sum( pca$sdev^2 )

  if (!all(intgroup %in% names(colData(object)))) {
    stop("the argument 'intgroup' should specify columns of colData(dds)")
  }

  # add the intgroup factors together to create a new grouping factor
  group <- if (length(intgroup) > 1) {
    intgroup.df <- as.data.frame(colData(object)[, intgroup, drop=FALSE])
    factor(apply( intgroup.df, 1, paste, collapse=":"))
  } else {
    colData(object)[[intgroup]]
  }

  # assembly the data for the plot
  pcs <- paste0("PC", pcsToUse)
  d <- data.frame(V1=pca$x[,pcsToUse[1]],
                  V2=pca$x[,pcsToUse[2]],
                  group=group, name=colnames(object), colData(object))
  colnames(d)[1:2] <- pcs
  
  if (returnData) {
    attr(d, "percentVar") <- percentVar[pcsToUse]
    return(d)
  }

  ggplot(data=d, aes_string(x=pcs[1], y=pcs[2], color="group")) +
    geom_point(size=3) + 
    xlab(paste0(pcs[1],": ",round(percentVar[pcsToUse[1]] * 100),"% variance")) +
      ylab(paste0(pcs[2],": ",round(percentVar[pcsToUse[2]] * 100),"% variance")) +
        coord_fixed()
}

Plot using the copied function.

plotPCA_copied(pas_vst, intgroup = "group") +
  theme_minimal()
using ntop=500 top features by variance
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

Version Author Date
b9d5a11 Dave Tang 2025-04-01
c9c5f07 Dave Tang 2025-04-01

Perform PCA as per plotPCA().

# calculate the variance for each gene
rv <- rowVars(assay(pas_vst))
head(rv)
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460 
      6.3038555       3.4528327       0.1236461       0.5966854       1.0711842 
ENSG00000000938 
      3.6559005 
# select the ntop genes by variance
ntop <- 500
topgenes <- order(rv, decreasing=TRUE)[seq_len(min(ntop, length(rv)))]
head(topgenes)
[1] 12679 12446 12359 20116 53333  8411
# perform a PCA on the data in assay(x) for the selected genes
pca <- prcomp(t(assay(pas_vst)[topgenes,]))
names(pca)
[1] "sdev"     "rotation" "center"   "scale"    "x"       
  1. sdev (Standard Deviations)
    • A numeric vector containing the standard deviations of the principal components.
    • These values correspond to the square roots of the eigenvalues of the covariance (or correlation) matrix.
    • They indicate the amount of variance explained by each principal component.
  2. rotation (Loadings)
    • A matrix where each column represents a principal component.
    • The rows correspond to the original variables, and the values indicate how much each variable contributes to each principal component.
    • This is sometimes referred to as the eigenvectors or loadings matrix.
  3. center
    • A numeric vector containing the means of the original variables if centering was performed.
    • If center = TRUE in prcomp(), each column of the input data was mean-centered before PCA.
    • If centering was not performed, this slot will be NULL.
  4. scale
    • A numeric vector containing the scaling values applied to the original variables if scaling was performed.
    • If scale = TRUE, each column of the input data was scaled to have unit variance before PCA.
    • If scaling was not performed, this slot will be NULL.
  5. x (Principal Component Scores)
    • A matrix containing the transformed data in the new principal component space.
    • The rows correspond to observations, and the columns correspond to the principal components.
    • This is obtained by projecting the original data onto the principal component directions.
loadings <- pca$rotation
loadings[1:10, 1:2]
                        PC1          PC2
ENSG00000169876 -0.06639949  0.008294131
ENSG00000168878  0.06304686  0.005076416
ENSG00000168484  0.05002038  0.117877848
ENSG00000205277 -0.05947887  0.023446665
ENSG00000275969 -0.05923209  0.013935605
ENSG00000143520 -0.05881607  0.007613250
ENSG00000087086  0.05824667  0.016638253
ENSG00000204849 -0.05843500  0.001321622
ENSG00000276040 -0.05730797  0.013343573
ENSG00000185775 -0.05689974 -0.001037869

The loadings matrix contains the coefficients that define how the original variables contribute to each principal component. Each element in the matrix tells you how much a given original variable contributes (positively or negatively) to a principal component.

When you square the loadings, you get the proportion of variance in a variable explained by each principal component. Squaring eliminates negative signs, so you lose the direction of influence. It is useful when you want to: (i) determine the relative importance of each variable in a principal component and (ii) see how much of the original variable’s variance is captured by each principal component.

contribution <- loadings^2
contribution[1:10, 1:2]
                        PC1          PC2
ENSG00000169876 0.004408893 6.879261e-05
ENSG00000168878 0.003974907 2.577000e-05
ENSG00000168484 0.002502039 1.389519e-02
ENSG00000205277 0.003537736 5.497461e-04
ENSG00000275969 0.003508441 1.942011e-04
ENSG00000143520 0.003459330 5.796158e-05
ENSG00000087086 0.003392674 2.768315e-04
ENSG00000204849 0.003414650 1.746686e-06
ENSG00000276040 0.003284204 1.780509e-04
ENSG00000185775 0.003237580 1.077171e-06

Gene contributing most to PC1.

sort(contribution[, 1], decreasing = TRUE) |> head()
ENSG00000169876 ENSG00000168878 ENSG00000205277 ENSG00000275969 ENSG00000143520 
    0.004408893     0.003974907     0.003537736     0.003508441     0.003459330 
ENSG00000204849 
    0.003414650 

Note that the order is similar, because plotPCA() already sorted the genes by their variance!

Gene contribution

Generate normalised counts for visualisation.

lung_cancer <- estimateSizeFactors(lung_cancer)
norm_counts <- counts(lung_cancer, normalized=TRUE)
head(norm_counts)
                   C2_norm    C3_norm    C5_norm    C1_norm    C1_cancer
ENSG00000000003   5.483606   13.47201   7.630305  320.18436 1020.9708328
ENSG00000000005  52.094259   89.81340  42.729708    0.00000    0.6236841
ENSG00000000419 734.803228  612.97645 653.154115  418.63677  397.2867565
ENSG00000000457 987.049112 1008.15541 863.750536  309.91107  377.3288661
ENSG00000000460 424.979479  413.14164 402.880109   72.76917  194.5894318
ENSG00000000938  65.803274   51.64270  61.042441 1011.06345  263.8183642
                C2_cancer C3_cancer C5_cancer
ENSG00000000003  315.4875 578.87186 345.53876
ENSG00000000005    0.0000   0.00000   0.00000
ENSG00000000419  426.6362 659.85689 448.21665
ENSG00000000457  343.6387 360.44041 293.89240
ENSG00000000460  116.0023  83.83661  95.91467
ENSG00000000938 1624.0327 712.32606 706.44846

Top 10 genes contributing to PC1.

sort(contribution[, 1], decreasing = TRUE) |>
  head(10) |>
  names() -> top_genes

norm_counts[top_genes, ] |>
  as.data.frame() |>
  tibble::rownames_to_column(var = "ensembl_gene_id") |>
  tidyr::pivot_longer(-ensembl_gene_id) |>
  ggplot(aes(name, ensembl_gene_id, fill = value)) +
  geom_tile() +
  theme_minimal() +
  theme(axis.title = element_blank())

Version Author Date
4d70d91 Dave Tang 2025-04-02

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] DESeq2_1.46.0               SummarizedExperiment_1.36.0
 [3] Biobase_2.66.0              MatrixGenerics_1.18.1      
 [5] matrixStats_1.5.0           GenomicRanges_1.58.0       
 [7] GenomeInfoDb_1.42.3         IRanges_2.40.1             
 [9] S4Vectors_0.44.0            BiocGenerics_0.52.0        
[11] lubridate_1.9.3             forcats_1.0.0              
[13] stringr_1.5.1               dplyr_1.1.4                
[15] purrr_1.0.2                 readr_2.1.5                
[17] tidyr_1.3.1                 tibble_3.2.1               
[19] ggplot2_3.5.1               tidyverse_2.0.0            
[21] workflowr_1.7.1            

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        farver_2.1.2            fastmap_1.2.0          
 [4] promises_1.3.2          digest_0.6.37           timechange_0.3.0       
 [7] lifecycle_1.0.4         processx_3.8.4          magrittr_2.0.3         
[10] compiler_4.4.1          rlang_1.1.4             sass_0.4.9             
[13] tools_4.4.1             utf8_1.2.4              yaml_2.3.10            
[16] knitr_1.48              S4Arrays_1.6.0          labeling_0.4.3         
[19] bit_4.5.0               DelayedArray_0.32.0     abind_1.4-8            
[22] BiocParallel_1.40.0     withr_3.0.2             grid_4.4.1             
[25] git2r_0.35.0            colorspace_2.1-1        scales_1.3.0           
[28] cli_3.6.3               rmarkdown_2.28          crayon_1.5.3           
[31] generics_0.1.3          rstudioapi_0.17.1       httr_1.4.7             
[34] tzdb_0.4.0              cachem_1.1.0            zlibbioc_1.52.0        
[37] parallel_4.4.1          XVector_0.46.0          vctrs_0.6.5            
[40] Matrix_1.7-0            jsonlite_1.8.9          callr_3.7.6            
[43] hms_1.1.3               bit64_4.5.2             locfit_1.5-9.12        
[46] jquerylib_0.1.4         glue_1.8.0              codetools_0.2-20       
[49] ps_1.8.1                stringi_1.8.4           gtable_0.3.6           
[52] later_1.3.2             UCSC.utils_1.2.0        munsell_0.5.1          
[55] pillar_1.10.1           htmltools_0.5.8.1       GenomeInfoDbData_1.2.13
[58] R6_2.5.1                rprojroot_2.0.4         vroom_1.6.5            
[61] evaluate_1.0.1          lattice_0.22-6          highr_0.11             
[64] httpuv_1.6.15           bslib_0.8.0             Rcpp_1.0.13            
[67] SparseArray_1.6.2       whisker_0.4.1           xfun_0.48              
[70] fs_1.6.4                getPass_0.2-4           pkgconfig_2.0.3