Getting started with harmony

Last updated: 2025-06-08

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: df82da8

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version df82da8. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rproj.user/
    Ignored:    data/1M_neurons_filtered_gene_bc_matrices_h5.h5
    Ignored:    data/293t/
    Ignored:    data/293t_3t3_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/293t_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/5k_Human_Donor1_PBMC_3p_gem-x_5k_Human_Donor1_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor2_PBMC_3p_gem-x_5k_Human_Donor2_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor3_PBMC_3p_gem-x_5k_Human_Donor3_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/5k_Human_Donor4_PBMC_3p_gem-x_5k_Human_Donor4_PBMC_3p_gem-x_count_sample_filtered_feature_bc_matrix.h5
    Ignored:    data/97516b79-8d08-46a6-b329-5d0a25b0be98.h5ad
    Ignored:    data/Parent_SC3v3_Human_Glioblastoma_filtered_feature_bc_matrix.tar.gz
    Ignored:    data/brain_counts/
    Ignored:    data/cl.obo
    Ignored:    data/cl.owl
    Ignored:    data/jurkat/
    Ignored:    data/jurkat:293t_50:50_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/jurkat_293t/
    Ignored:    data/jurkat_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc20k/
    Ignored:    data/pbmc20k_seurat/
    Ignored:    data/pbmc3k.h5ad
    Ignored:    data/pbmc3k/
    Ignored:    data/pbmc3k_bpcells_mat/
    Ignored:    data/pbmc3k_export.mtx
    Ignored:    data/pbmc3k_matrix.mtx
    Ignored:    data/pbmc3k_seurat.rds
    Ignored:    data/pbmc4k_filtered_gene_bc_matrices.tar.gz
    Ignored:    data/pbmc_1k_v3_filtered_feature_bc_matrix.h5
    Ignored:    data/pbmc_1k_v3_raw_feature_bc_matrix.h5
    Ignored:    data/refdata-gex-GRCh38-2020-A.tar.gz
    Ignored:    data/seurat_1m_neuron.rds
    Ignored:    data/t_3k_filtered_gene_bc_matrices.tar.gz
    Ignored:    r_packages_4.4.1/
    Ignored:    r_packages_4.5.0/

Untracked files:
    Untracked:  Nothobranchius_furzeri.Nfu_20140520.113.gtf.gz
    Untracked:  analysis/bioc_scrnaseq.Rmd
    Untracked:  bpcells_matrix/
    Untracked:  data/GCF_043380555.1-RS_2024_12_gene_ontology.gaf.gz
    Untracked:  m3/
    Untracked:  pbmc3k_before_filtering.rds
    Untracked:  pbmc3k_save_rds.rds
    Untracked:  rsem.merged.gene_counts.tsv

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/harmony.Rmd) and HTML (docs/harmony.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	df82da8	Dave Tang	2025-06-08	Install glmGamPoi
html	d806d92	Dave Tang	2025-06-05	Build site.
Rmd	a0049f6	Dave Tang	2025-06-05	Return harmony object
html	66d57ad	Dave Tang	2025-06-05	Build site.
Rmd	18c0aed	Dave Tang	2025-06-05	Include background information
html	dbb1ef7	Dave Tang	2025-02-17	Build site.
Rmd	f6fdf6f	Dave Tang	2025-02-17	PBMCs from 4 donors
html	f55033c	Dave Tang	2025-02-17	Build site.
Rmd	abc7ef4	Dave Tang	2025-02-17	Increase memory for future
html	4db2bdf	Dave Tang	2024-04-15	Build site.
Rmd	9a4bcb8	Dave Tang	2024-04-15	Using 10x filtered matrices
html	a940a7e	Dave Tang	2024-04-15	Build site.
Rmd	03f3f3b	Dave Tang	2024-04-15	Quieter
html	b174605	Dave Tang	2024-04-15	Build site.
Rmd	6c16dce	Dave Tang	2024-04-15	Seurat SCTransform workflow with harmony
html	5466912	Dave Tang	2024-04-15	Build site.
Rmd	3ec3367	Dave Tang	2024-04-15	Seurat SCTransform workflow
html	c3d7314	Dave Tang	2024-04-15	Build site.
Rmd	708f1ab	Dave Tang	2024-04-15	Using harmony with Seurat
html	085be81	Dave Tang	2024-04-14	Build site.
Rmd	95fa6fd	Dave Tang	2024-04-14	Create some plots
html	9abb7b6	Dave Tang	2024-04-14	Build site.
Rmd	72ffea9	Dave Tang	2024-04-14	Getting started with harmony

Introduction

Harmony helps address the following problem:

… it is challenging to analyze multiple scRNA-seq datasets together, particularly when they are assayed with different technologies, because biological and technical differences are interspersed.

The algorithm works by:

… projecting cells into a shared embedding in which cells group by cell type rather than dataset-specific conditions. Harmony simultaneously accounts for multiple experimental and biological factors.

Read the manuscript: Fast, sensitive and accurate integration of single-cell data with Harmony.

Harmony algorithm

Carry out PCA (or something equivalent) to embed cells into a space with reduced dimensionality.
Harmony accepts the cell coordinates in this reduced space and runs an iterative algorithm to adjust for dataset specific effects. 2a. Harmony uses fuzzy (regular k-means) clustering to assign each cell to multiple clusters, while a penalty term ensures that the diversity of datasets within each cluster is maximized. (I.e., the penalty term tries to keep multiple datasets in the same cluster.) 2b. Harmony calculates a global centroid for each cluster, as well as dataset-specific centroids for each cluster. 2c. Within each cluster, Harmony calculates a correction factor for each dataset based on the centroids. 2d. Finally, Harmony corrects each cell with a cell-specific factor: a linear combination of dataset correction factors weighted by the cell’s soft cluster assignments made in step 2a.
Harmony repeats steps 2a to 2d until convergence. The dependence between cluster assignment and dataset diminishes with each round.

Harmony algorithm details

Detailed Walkthrough of Harmony Algorithm.

Since I’m getting the following errors when using RunHarmony():

Warning: Quick-TRANSfer stage steps exceeded maximum
did not converge in 25 iterations

I’m interested in the details of the initial clustering.

Initializing the Harmony object also triggered initialization of all the clustering data structures. Harmony currently uses regular kmeans, with 10 random restarts, to find initial locations for the cluster centroids.

The “Quick-TRANSfer stage steps exceeded maximum” warning is explained in the Details section of ?kmeans:

In rare cases, when some of the points (rows of x) are extremely close, the algorithm may not converge in the “Quick-Transfer” stage, signalling a warning (and returning ifault = 4). Slight rounding of the data may be advisable in that case.

From ?RunHarmony.default, here are what I assume the default arguments:

theta = NULL
sigma = 0.1
lambda = 1
nclust = NULL
max_iter = 10
early_stop = TRUE
ncores = 1
plot_convergence = FALSE
.options = harmony_options()

From ?harmony_options:

alpha = 0.2
tau = 0
block.size = 0.05
max.iter.cluster = 20
epsilon.cluster = 0.001
epsilon.harmony = 0.01

Dependencies

Make sure {glmGamPoi} is installed for much faster estimation.

installed.packages() |>
  row.names() -> installed_packages

if(!'glmGamPoi' %in% installed_packages){
  BiocManager::install('glmGamPoi')
}

Quickstart

Following the quickstart tutorial.

install.packages("harmony")

Load required packages.

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(patchwork))
suppressPackageStartupMessages(library(harmony))
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(hdf5r))

{harmony} version.

packageVersion("harmony")

[1] '1.2.3'

Data

We library normalized the cells, log transformed the counts, and scaled the genes. Then we performed PCA and kept the top 20 PCs. The PCA embeddings and meta data are available as part of this package.

data(cell_lines)
V <- cell_lines$scaled_pcs
meta_data <- cell_lines$meta_data

str(cell_lines)

List of 2
 $ meta_data : tibble [2,370 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ cell_id     : chr [1:2370] "half_GTACGAACCACCAA" "t293_AGGTCATGCACTTT" "half_ATAGTTGACTTCTA" "half_GAGCGGCTTGCTTT" ...
  ..$ dataset     : chr [1:2370] "half" "t293" "half" "half" ...
  ..$ nGene       : int [1:2370] 1508 4009 3545 2450 2388 3762 3792 4089 3374 3023 ...
  ..$ percent_mito: num [1:2370] 0.0148 0.0232 0.0153 0.017 0.0601 ...
  ..$ cell_type   : chr [1:2370] "jurkat" "t293" "jurkat" "jurkat" ...
  ..- attr(*, ".internal.selfref")=<externalptr> 
 $ scaled_pcs:Classes 'data.table' and 'data.frame':    2370 obs. of  20 variables:
  ..$ X1 : num [1:2370] 0.00281 -0.01167 0.00933 0.00634 0.00855 ...
  ..$ X2 : num [1:2370] -0.00145 0.000877 -0.006972 -0.002518 0.007087 ...
  ..$ X3 : num [1:2370] -0.00639 0.000897 -0.002599 -0.00439 -0.002254 ...
  ..$ X4 : num [1:2370] 0.000282 0.001324 0.001882 0.000274 0.001679 ...
  ..$ X5 : num [1:2370] 0.00144 -0.00329 -0.0038 -0.0025 0.00455 ...
  ..$ X6 : num [1:2370] 0.000752 0.001303 -0.000347 0.000435 0.0003 ...
  ..$ X7 : num [1:2370] -0.00283 -0.00198 -0.00157 0.00136 -0.0016 ...
  ..$ X8 : num [1:2370] -0.000653 0.001625 -0.003272 -0.00263 -0.000263 ...
  ..$ X9 : num [1:2370] 0.001411 -0.000913 -0.001031 -0.001876 0.001389 ...
  ..$ X10: num [1:2370] -0.000417 -0.000175 -0.001623 -0.000425 0.000391 ...
  ..$ X11: num [1:2370] 0.001652 -0.000034 0.001241 -0.000458 -0.001444 ...
  ..$ X12: num [1:2370] 6.71e-05 3.76e-04 -7.61e-04 -6.52e-04 -2.44e-03 ...
  ..$ X13: num [1:2370] 0.000542 0.000219 -0.001502 -0.002067 -0.000907 ...
  ..$ X14: num [1:2370] 0.001223 0.001688 -0.000279 -0.000927 -0.000135 ...
  ..$ X15: num [1:2370] 0.002081 0.000386 -0.001141 0.001114 0.001015 ...
  ..$ X16: num [1:2370] 1.87e-03 -1.50e-03 5.99e-04 -1.98e-05 -1.25e-03 ...
  ..$ X17: num [1:2370] 0.000429 0.000259 0.001224 -0.001069 -0.001165 ...
  ..$ X18: num [1:2370] 0.00115 -0.00106 0.00145 0.00028 0.00111 ...
  ..$ X19: num [1:2370] -1.09e-03 4.11e-04 7.41e-05 9.33e-04 -1.76e-04 ...
  ..$ X20: num [1:2370] 0.000265 -0.00171 -0.000662 0.000365 0.000477 ...
  ..- attr(*, ".internal.selfref")=<externalptr>

Table of cell types.

table(cell_lines$meta_data$cell_type)


jurkat   t293 
  1266   1104

Table of datasets.

table(cell_lines$meta_data$dataset)


  half jurkat   t293 
   846    824    700

Analysis

Initially, the cells cluster by both dataset (left) and cell type (right). The quickstart guide uses the do_scatter() function, which is missing. We can simply plot the first two PCs using {ggplot2}.

Plot PC1 versus PC2.

my_df <- data.frame(PC1 = V$X1, PC2 = V$X2, dataset = meta_data$dataset, cell_type = meta_data$cell_type)

ggplot(my_df, aes(PC1, PC2, colour = dataset)) +
  geom_point() +
  theme_minimal() +
  ggtitle("Before harmony") -> p1

ggplot(my_df, aes(PC1, PC2, colour = cell_type)) +
  geom_point() +
  theme_minimal() -> p2

p1 + p2

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
f55033c	Dave Tang	2025-02-17
085be81	Dave Tang	2024-04-14

Let’s run Harmony to remove the influence of dataset-of-origin from the cell embeddings.

harmony_embeddings <- harmony::RunHarmony(
    V, meta_data, 'dataset', verbose=FALSE
)

my_df2 <- data.frame(PC1 = harmony_embeddings[, 1], PC2 = harmony_embeddings[, 2], dataset = meta_data$dataset, cell_type = meta_data$cell_type)

ggplot(my_df2, aes(PC1, PC2, colour = dataset)) +
  geom_point() +
  theme_minimal() +
  ggtitle("After harmony") -> p1

ggplot(my_df2, aes(PC1, PC2, colour = cell_type)) +
  geom_point() +
  theme_minimal() -> p2

p1 + p2

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
f55033c	Dave Tang	2025-02-17
c3d7314	Dave Tang	2024-04-15

Check out the harmony object.

harmonyObj <- harmony::RunHarmony(
    V, meta_data, 'dataset', verbose=FALSE, return_object = TRUE
)

str(harmonyObj, max.level = 2)

Reference class 'Rcpp_harmony' [package "harmony"] with 27 fields
 $ B                       : num 3
 $ B_vec                   : int 3
 $ E                       : num [1:79, 1:3] 7.14 10.02 12.6 9.51 12.29 ...
 $ K                       : num 79
 $ N                       : num 2370
 $ O                       : num [1:79, 1:3] 4.95 9.56 13.01 9.42 12.53 ...
 $ Phi                     :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
 $ Phi_moe                 :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
 $ Pr_b                    : num [1:3, 1] 0.357 0.348 0.295
 $ R                       : num [1:79, 1:2370] 8.35e-05 1.09e-04 1.06e-13 1.16e-11 6.28e-12 ...
 $ W                       : num [1:4, 1:20] 0.00 1.53e-04 8.74e-09 -1.53e-04 0.00 ...
 $ Y                       : num [1:20, 1:79] 0.5474 0.0345 -0.2684 -0.2462 0.1879 ...
 $ Z_corr                  : num [1:20, 1:2370] 0.004308 0.002214 -0.004278 0.000472 0.00155 ...
 $ Z_cos                   : num [1:20, 1:2370] 0.5154 0.2649 -0.5118 0.0565 0.1855 ...
 $ Z_orig                  : num [1:20, 1:2370] 0.002807 -0.00145 -0.00639 0.000282 0.001444 ...
 $ alpha                   : num 0.2
 $ d                       : num 20
 $ kmeans_rounds           : int [1:5] 10 7 6 5 5
 $ lambda                  : num [1:4, 1] 0 1 1 1
 $ max_iter_kmeans         : num 20
 $ objective_harmony       : num [1:6] 590 513 460 440 434 ...
 $ objective_kmeans        : num [1:34] 590 537 524 520 517 ...
 $ objective_kmeans_cross  : num [1:34] 475 454 453 453 453 ...
 $ objective_kmeans_dist   : num [1:34] 508 547 560 568 574 ...
 $ objective_kmeans_entropy: num [1:34] -393 -464 -489 -501 -509 ...
 $ sigma                   : num [1:79, 1] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
 $ theta                   : num [1:3, 1] 2 2 2
 and 23 methods, of which 9 are  possibly relevant:
   check_convergence, cluster_cpp, compute_objective, finalize,
   init_cluster_cpp, initialize, moe_correct_ridge_cpp,
   moe_ridge_get_betas_cpp, setup

Using harmony with Seurat

Following the Using harmony with Seurat tutorial, which describes how to use harmony in Seurat v5 single-cell analysis workflows. Also, it will provide some basic downstream analyses demonstrating the properties of harmonised cell embeddings and a brief explanation of the exposed algorithm parameters.

Data

For this demo, we will be aligning two groups of PBMCs Kang et al., 2017:

Control PBMCs
Stimulated PBMCs treated with interferon beta.

data("pbmc_stim")
str(pbmc.ctrl)

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:558407] 1 3 8 19 33 40 51 53 89 100 ...
  ..@ p       : int [1:1001] 0 812 1248 1692 2463 2810 3314 4127 4660 5229 ...
  ..@ Dim     : int [1:2] 9015 1000
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:9015] "LINC00115" "NOC2L" "HES4" "ISG15" ...
  .. ..$ : chr [1:1000] "TCGCAAGAGCGATT-1" "GCAAACTGAACTGC-1" "TGATAAACTGGTAC-1" "GTTAAATGACTGTG-1" ...
  ..@ x       : num [1:558407] 1 1 1 1 1 2 1 3 1 2 ...
  ..@ factors : list()

str(pbmc.stim)

Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:571399] 3 33 53 118 128 138 144 154 171 208 ...
  ..@ p       : int [1:1001] 0 425 505 852 1403 2010 2325 2859 3435 3955 ...
  ..@ Dim     : int [1:2] 9015 1000
  ..@ Dimnames:List of 2
  .. ..$ : chr [1:9015] "LINC00115" "NOC2L" "HES4" "ISG15" ...
  .. ..$ : chr [1:1000] "ATCACTTGCTCGAA-1" "CCGGAGACTGTGAC-1" "CAAGCCCTGTTAGC-1" "GAGGTACTAACGGG-1" ...
  ..@ x       : num [1:571399] 1 2 3 1 8 1 1 1 3 1 ...
  ..@ factors : list()

The full dataset used for this vignette have been upload to Zenodo but currently does not work with newer versions of R.

Create Seurat object

Create a Seurat object with treatment conditions in the metadata.

pbmc <- CreateSeuratObject(
  counts = cbind(pbmc.stim, pbmc.ctrl),
  project = "Kang",
  min.cells = 5
)

pbmc@meta.data$stim <- c(rep("STIM", ncol(pbmc.stim)), rep("CTRL", ncol(pbmc.ctrl)))
pbmc

An object of class Seurat 
9015 features across 2000 samples within 1 assay 
Active assay: RNA (9015 features, 0 variable features)
 1 layer present: counts

Seurat SCTransform workflow

Using sctransform in Seurat.

pbmc_sct <- SCTransform(pbmc) |>
  RunPCA() |>
  FindNeighbors() |>
  RunUMAP(dims = 1:20) |>
  FindClusters()

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2000
Number of edges: 64353

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8843
Number of communities: 15
Elapsed time: 0 seconds

DimPlot(pbmc_sct, reduction = "umap", group.by = "stim", pt.size = .1) + ggtitle("Seurat SCTransform workflow")

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
dbb1ef7	Dave Tang	2025-02-17
f55033c	Dave Tang	2025-02-17
5466912	Dave Tang	2024-04-15

Seurat workflow with harmony

Harmony works on an existing matrix with cell embeddings and outputs its transformed version with the datasets aligned according to some user-defined experimental conditions. By default, harmony will look up the pca cell embeddings and use these to run harmony. Therefore, it assumes that the Seurat object has these embeddings already precomputed.

We will run the Seurat workflow to generate the embeddings.

Here, using Seurat::NormalizeData(), we will be generating a union of highly variable genes using each condition (the control and stimulated cells). These features are going to be subsequently used to generate the 20 PCs with Seurat::RunPCA().

Note that the defaults for NormalizeData are:

normalization.method = “LogNormalize”
scale.factor = 10000

pbmc <- NormalizeData(pbmc, verbose = FALSE)

pbmc <- FindVariableFeatures(object = pbmc, selection.method = "vst", nfeatures = 2000)

Finding variable features for layer counts

cell_by_cond <- split(row.names(pbmc@meta.data), pbmc@meta.data$stim)

vfeatures <- lapply(cell_by_cond, function(cells){
  FindVariableFeatures(object = pbmc[, cells], selection.method = "vst", nfeatures = 2000) |>
    VariableFeatures()
})

Finding variable features for layer counts
Finding variable features for layer counts

VariableFeatures(pbmc) <- unique(unlist(vfeatures))
length(VariableFeatures(pbmc))

[1] 3236

Scale and perform PCA.

pbmc <- ScaleData(pbmc, verbose = FALSE) |>
  RunPCA(features = VariableFeatures(pbmc), npcs = 20, verbose = FALSE)

RunHarmony() is a generic function designed to interact with Seurat objects. To run harmony on a Seurat object after it has been normalised, only one argument needs to be specified which contains the batch covariate located in the metadata. For this vignette, further parameters are specified to align the dataset but the minimum parameters are shown in the snippet below and is not run.

pbmc <- RunHarmony(pbmc, "stim")

Here, we will be running harmony with some indicative parameters and plotting the convergence plot to illustrate some of the under the hood functionality. By setting plot_converge=TRUE, harmony will generate a plot with its objective showing the flow of the integration. Each point represents the cost measured after a clustering round. Different colors represent different Harmony iterations which is controlled by max_iter (assuming that early_stop=FALSE). Here max_iter=10 and up to 10 correction steps are expected. However, early_stop=TRUE so harmony will stop after the cost plateaus.

pbmc <- RunHarmony(pbmc, "stim", plot_convergence = TRUE, nclust = 50, max_iter = 10, early_stop = TRUE)

Transposing data matrix

Initializing state using k-means centroids initialization

Harmony 1/10

Harmony 2/10

Harmony 3/10

Harmony 4/10

Harmony 5/10

Harmony converged after 5 iterations

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
f55033c	Dave Tang	2025-02-17
c3d7314	Dave Tang	2024-04-15

RunHarmony has several parameters accessible to users which are outlined below.

object (required) - The Seurat object. This vignette assumes Seurat objects are version 5.
group.by.vars (required) - A character vector that specifies all the experimental covariates to be corrected/harmonized by the algorithm.

When using RunHarmony() with Seurat, harmony will look up the group.by.vars metadata fields in the Seurat Object metadata. For example, given the pbmc[["stim"]] exists as the stim condition, setting group.by.vars="stim" will perform integration of these samples accordingly. If you want to integrate on another variable, it needs to be present in Seurat object’s meta.data. To correct for several covariates, specify them in a vector: group.by.vars = c("stim", "new_covariate").

reduction.use - The cell embeddings to be used for the batch alignment. This parameter assumes that a reduced dimension already exists in the reduction slot of the Seurat object. By default, the pca reduction is used.
dims.use - Optional parameter which can use a name vector to select specific dimensions to be harmonised.
nclust - is a positive integer. Under the hood, harmony applies k-means soft-clustering. For this task, k needs to be determined. nclust corresponds to k. The harmonisation results and performance are not particularly sensitive for a reasonable range of this parameter value. If this parameter is not set, harmony will autodetermine this based on the dataset size with a maximum cap of 200. For dataset with a vast amount of different cell types and batches this pamameter may need to be determined manually.
sigma - a positive scalar that controls the soft clustering probability assignment of single-cells to different clusters. Larger values will assign a larger probability to distant clusters of cells resulting in a different correction profile. Single-cells are assigned to clusters by their euclidean distance $d$ to some cluster center $Y$ after cosine normalisation which is defined in the range [0,4]. The clustering probability of each cell is calculated as $e^{-\frac{d}{\sigma}}$ where $\sigma$ is controlled by the sigma parameter. Default value of sigma is 0.1 and it generally works well since it defines probability assignment of a cell in the range $[e^{-40}, e^0]$. Larger values of sigma restrict the dynamic range of probabilities that can be assigned to cells. For example, sigma=1 will yield a probabilities in the range of $[e^{-4}, e^0]$.
theta - theta is a positive scalar vector that determines the coefficient of harmony’s diversity penalty for each corrected experimental covariate. In challenging experimental conditions, increasing theta may result in better integration results. Theta is an expontential parameter of the diversity penalty, thus setting theta=0 disables this penalty while increasing it to greater values than 1 will perform more aggressive corrections in an expontential manner. By default, it will set theta=2 for each experimental covariate.
max_iter - The number of correction steps harmony will perform before completing the data set integration. In general, more iterations than necessary increases computational runtime especially which becomes evident in bigger datasets. Setting early_stop=TRUE may reduce the actual number of correction steps which will be smaller than max_iter.
early_stop - Under the hood, harmony minimizes its objective function through a series of clustering and integration tests. By setting early_stop=TRUE, when the objective function is less than 1e-4 after a correction step harmony exits before reaching the max_iter correction steps. This parameter can drastically reduce run-time in bigger datasets.
.options - A set of internal algorithm parameters that can be overriden. For advanced users only.

These parameters are Seurat-specific and do not affect the flow of the algorithm.

project_dim - Toggle-like parameter, by default project_dim=TRUE. When enabled, RunHarmony() calculates genomic feature loadings using Seurat’s ProjectDim() that correspond to the harmonized cell embeddings.
reduction.save - The new Reduced Dimension slot identifier. By default, reduction.save=TRUE. This option allows several independent runs of harmony to be retained in the appropriate slots in the SeuratObjects. It is useful if you want to try Harmony with multiple parameters and save them as e.g. ‘harmony_theta0’, ‘harmony_theta1’, ‘harmony_theta2’.

Miscellaneous parameters

These parameters help users troubleshoot harmony.

plot_convergence - Option that plots the convergence plot after the execution of the algorithm. By default FALSE. Setting it to TRUE will collect harmony’s objective value and plot it allowing the user to troubleshoot the flow of the algorithm and fine-tune the parameters of the dataset integration procedure.

Results

RunHarmony() returns the Seurat object which contains the harmonised cell embeddings in a slot named harmony. This entry can be accessed via pbmc@reductions$harmony. To access the values of the cell embeddings we can also use Embeddings.

harmony.embeddings <- Embeddings(pbmc, reduction = "harmony")
head(harmony.embeddings)

                 harmony_1   harmony_2   harmony_3 harmony_4    harmony_5
ATCACTTGCTCGAA-1 -6.480663  0.00952975 -3.37852867 -4.097951 -0.006347539
CCGGAGACTGTGAC-1 -6.899292  1.58604432 -6.18220626 -4.582268  5.405759241
CAAGCCCTGTTAGC-1 -6.779873 -1.16664039 -6.51093771  9.135969  0.218211823
GAGGTACTAACGGG-1 -7.462947 -0.21612621  0.05653332 -2.053192  2.198580034
CGCGGATGCCACAA-1 14.721144  4.51693614 -4.50692722 -1.592197  1.435251617
ACATGGTGCCTAAG-1 -7.611455 -0.15176634 -3.56837585 -2.897915  1.767036544
                   harmony_6  harmony_7  harmony_8  harmony_9   harmony_10
ATCACTTGCTCGAA-1  0.50611573  0.7028111 -1.3731536 -0.6795988  0.892197758
CCGGAGACTGTGAC-1 -2.61350881 -3.7921039 -4.4606791 -2.4551098  2.114050295
CAAGCCCTGTTAGC-1 -0.02887235 -1.6370781 -1.1577088 -1.3786499 -0.163219451
GAGGTACTAACGGG-1 -0.76738718  4.0403344  0.9611097 -2.6146619 -2.678083320
CGCGGATGCCACAA-1 -0.46887861 -0.7908545  1.6754028  2.1588659  2.176845715
ACATGGTGCCTAAG-1 -0.81507202 -0.1678781 -0.6548181  0.7715909  0.004886104
                 harmony_11 harmony_12 harmony_13  harmony_14 harmony_15
ATCACTTGCTCGAA-1 -0.4515260 -1.4668975 -0.4626245  0.22250854 -0.2398488
CCGGAGACTGTGAC-1  2.7404863 -3.3613318  0.8675147 -4.06737595 -4.4371434
CAAGCCCTGTTAGC-1 -0.6102676 -1.3375097  0.3194690  1.11547059  0.5792846
GAGGTACTAACGGG-1  0.9525904 -3.6856542 -0.3459815  0.65236151  1.2870529
CGCGGATGCCACAA-1  0.1827323 -0.1453718  4.3071518 -3.44419047  0.7401925
ACATGGTGCCTAAG-1  0.2992616 -2.0873166 -0.5291568  0.02118662 -0.6469339
                 harmony_16  harmony_17 harmony_18 harmony_19  harmony_20
ATCACTTGCTCGAA-1   0.652587 -0.62753433  2.3403474  0.4976432  0.03191453
CCGGAGACTGTGAC-1   3.508959 -9.64908448 -1.3600007 -1.4332550  4.14694537
CAAGCCCTGTTAGC-1   3.926326  2.35297117 -0.4837548 -1.2182239 -0.00394190
GAGGTACTAACGGG-1  -2.470791  0.10259285  0.2715829 -1.0776178 -3.49092397
CGCGGATGCCACAA-1  -1.093456 -0.74090071  2.9671890 -4.7123912  1.06377161
ACATGGTGCCTAAG-1  -0.922144 -0.06599774 -0.6820571  0.4381916 -0.30253665

After Harmony integration, we should inspect the quality of the harmonisation and contrast it with the unharmonised algorithm input. Ideally, cells from different conditions will align along the Harmonized PCs. If they are not, you could increase the theta value above to force a more aggressive fit of the dataset and rerun the workflow.

p1 <- DimPlot(object = pbmc, reduction = "harmony", pt.size = .1, group.by = "stim")
p2 <- VlnPlot(object = pbmc, features = "harmony_1", group.by = "stim",  pt.size = .1)

Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
ℹ Please use the `layer` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
  Please report the issue at <https://github.com/satijalab/seurat/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

Warning: `PackageCheck()` was deprecated in SeuratObject 5.0.0.
ℹ Please use `rlang::check_installed()` instead.
ℹ The deprecated feature was likely used in the Seurat package.
  Please report the issue at <https://github.com/satijalab/seurat/issues>.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

p1 + p2

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
f55033c	Dave Tang	2025-02-17
c3d7314	Dave Tang	2024-04-15

Plot Genes correlated with the Harmonized PCs

DimHeatmap(object = pbmc, reduction = "harmony", cells = 500, dims = 1:3)

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
f55033c	Dave Tang	2025-02-17
c3d7314	Dave Tang	2024-04-15

The harmonised cell embeddings generated by harmony can be used for further integrated analyses. In this workflow, the Seurat object contains the harmony reduction modality name in the method that requires it.

Perform clustering using the harmonized vectors of cells

pbmc <- FindNeighbors(pbmc, reduction = "harmony") |>
  FindClusters(resolution = 0.5)

Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 2000
Number of edges: 71968

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8721
Number of communities: 9
Elapsed time: 0 seconds

TSNE visualisation of harmony embeddings.

pbmc <- RunTSNE(pbmc, reduction = "harmony")

p1 <- DimPlot(pbmc, reduction = "tsne", group.by = "stim", pt.size = .1) + ggtitle("tSNE using harmony")
p2 <- DimPlot(pbmc, reduction = "tsne", label = TRUE, pt.size = .1) + ggtitle("Cell clusters")
p1 + p2

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
f55033c	Dave Tang	2025-02-17
a940a7e	Dave Tang	2024-04-15
c3d7314	Dave Tang	2024-04-15

One important observation is to assess that the harmonised data contain biological states of the cells. Therefore by checking the following genes we can see that biological cell states are preserved after harmonisation.

Expression of gene panel heatmap in the harmonized PBMC dataset.

FeaturePlot(
  object = pbmc,
  features= c("CD3D", "SELL", "CREM", "CD8A", "GNLY", "CD79A", "FCGR3A", "CCL2", "PPBP"), 
  min.cutoff = "q9",
  cols = c("lightgrey", "blue"),
  pt.size = 0.5
)

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
f55033c	Dave Tang	2025-02-17
c3d7314	Dave Tang	2024-04-15

Similar to TSNE, we can run UMAP by passing the harmony reduction in the function.

pbmc <- RunUMAP(pbmc, reduction = "harmony",  dims = 1:20)

p1 <- DimPlot(pbmc, reduction = "umap", group.by = "stim", pt.size = .1)
p2 <- DimPlot(pbmc, reduction = "umap", label = TRUE,  pt.size = .1)
p1 + p2

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
f55033c	Dave Tang	2025-02-17
c3d7314	Dave Tang	2024-04-15

Seurat SCTransform workflow with harmony

Use SCTransform() instead of NormalizeData(), ScaleData(), and FindVariableFeatures().

pbmc <- CreateSeuratObject(
  counts = cbind(pbmc.stim, pbmc.ctrl),
  project = "Kang",
  min.cells = 5
)

pbmc@meta.data$stim <- c(rep("STIM", ncol(pbmc.stim)), rep("CTRL", ncol(pbmc.ctrl)))

pbmc <- SCTransform(pbmc) |>
  RunPCA(npcs = 20, verbose = FALSE)

pbmc <- RunHarmony(pbmc, "stim")

pbmc <- RunUMAP(pbmc, reduction = "harmony",  dims = 1:20)

DimPlot(pbmc_sct, reduction = "umap", group.by = "stim", pt.size = .1) + ggtitle("SCTransform without harmony") -> p1
DimPlot(pbmc, reduction = "umap", group.by = "stim", pt.size = .1) + ggtitle("SCTransform with harmony") -> p2

p1 + p2

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
dbb1ef7	Dave Tang	2025-02-17
f55033c	Dave Tang	2025-02-17
b174605	Dave Tang	2024-04-15

From 10x Genomics filtered matrices

Mix of Jurkat and 293t

Downloaded Jurkat, 293t, and mixture cells.

jurkat <- Seurat::Read10X("data/jurkat/filtered_matrices_mex/hg19/")
t293 <- Seurat::Read10X("data/293t/filtered_matrices_mex/hg19/")
half <- Seurat::Read10X("data/jurkat_293t/filtered_matrices_mex/hg19/")

colnames(jurkat) <- paste0('jurkat_', colnames(jurkat))
colnames(t293) <- paste0('t293_', colnames(t293))
colnames(half) <- paste0('half_', colnames(half))

seurat_obj <- CreateSeuratObject(
  counts = cbind(jurkat, t293, half),
  project = "Mix",
  min.cells = 5
)

seurat_obj@meta.data$dataset <- c(
  rep("jurkat", ncol(jurkat)),
  rep("t293", ncol(t293)),
  rep("half", ncol(half))
)

options(future.globals.maxSize = 1.5 * 1024^3)

seurat_obj <- SCTransform(seurat_obj) |>
  RunPCA(npcs = 20, verbose = FALSE)

seurat_obj <- RunUMAP(seurat_obj, dims = 1:20)

DimPlot(seurat_obj, reduction = "umap", group.by = "dataset", pt.size = .1) + ggtitle("Without harmony")

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
dbb1ef7	Dave Tang	2025-02-17
f55033c	Dave Tang	2025-02-17
4db2bdf	Dave Tang	2024-04-15

With harmony.

seurat_obj_harmony <- RunHarmony(seurat_obj, "dataset")
seurat_obj_harmony <- RunUMAP(seurat_obj_harmony, reduction = "harmony",  dims = 1:20)

DimPlot(seurat_obj_harmony, reduction = "umap", group.by = "dataset", pt.size = .1) + ggtitle("With harmony")

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
dbb1ef7	Dave Tang	2025-02-17
f55033c	Dave Tang	2025-02-17
4db2bdf	Dave Tang	2024-04-15

Plot XIST.

Human embryonic kidney 293 cells, also often referred to as HEK 293, HEK-293, 293 cells, are an immortalised cell line derived from HEK cells isolated from a female fetus in the 1970s.
The Jurkat cell line (originally called JM) was established in the mid-1970s from the peripheral blood of a 14-year-old boy with T cell leukemia.

FeaturePlot(
  object = seurat_obj_harmony,
  features= "XIST",
  cols = c("lightgrey", "blue"),
  pt.size = 0.5
)

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
dbb1ef7	Dave Tang	2025-02-17
f55033c	Dave Tang	2025-02-17
4db2bdf	Dave Tang	2024-04-15

PBMCs from 4 donors

Read HDF5 files into a list.

hdf5_files <- list.files(path = "data", pattern = "5k_Human", full.names = TRUE)

mats <- purrr::map(seq_along(hdf5_files), function(x){
  my_mat <- Seurat::Read10X_h5(hdf5_files[x])
  colnames(my_mat) <- paste0('donor', x, '_', colnames(my_mat))
  my_mat
})
str(mats, max.level = 1)

List of 4
 $ :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
 $ :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
 $ :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
 $ :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots

Create Seurat object using a list of matrices.

pbmc20k <- CreateSeuratObject(
  counts = mats,
  min.cells = 3,
  min.features = 200
)
pbmc20k

An object of class Seurat 
27385 features across 22061 samples within 1 assay 
Active assay: RNA (27385 features, 0 variable features)
 4 layers present: counts.1, counts.2, counts.3, counts.4

Create one count layer.

pbmc20k <- JoinLayers(pbmc20k)
pbmc20k

An object of class Seurat 
27385 features across 22061 samples within 1 assay 
Active assay: RNA (27385 features, 0 variable features)
 1 layer present: counts

Without Harmony.

options(future.globals.maxSize = 2 * 1024^3)

pbmc20k <- SCTransform(pbmc20k) |>
  RunPCA(npcs = 20, verbose = FALSE)

Running SCTransform on assay: RNA

vst.flavor='v2' set. Using model with fixed slope and excluding poisson genes.

Calculating cell attributes from input UMI matrix: log_umi

Variance stabilizing transformation of count matrix of size 25954 by 22061

Model formula is y ~ log_umi

Get Negative Binomial regression parameters per gene

Using 2000 genes, 5000 cells

Found 156 outliers - those will be ignored in fitting/regularization step

Second step: Get residuals using fitted parameters for 25954 genes

Computing corrected count matrix for 25954 genes

Calculating gene attributes

Wall clock passed: Time difference of 1.226104 mins

Determine variable features

Centering data matrix

Place corrected count matrix in counts slot

Set default assay to SCT

pbmc20k <- RunUMAP(pbmc20k, dims = 1:20, verbose = FALSE)

DimPlot(pbmc20k, reduction = "umap", group.by = "orig.ident", pt.size = .1) + ggtitle("Without harmony")

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
dbb1ef7	Dave Tang	2025-02-17

With harmony.

pbmc20k_harmony <- RunHarmony(pbmc20k, "orig.ident")
pbmc20k_harmony <- RunUMAP(pbmc20k_harmony, reduction = "harmony",  dims = 1:20, verbose = FALSE)

DimPlot(pbmc20k_harmony, reduction = "umap", group.by = "orig.ident", pt.size = .1) + ggtitle("With harmony")

Version	Author	Date
66d57ad	Dave Tang	2025-06-05
dbb1ef7	Dave Tang	2025-02-17

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 24.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] future_1.58.0      hdf5r_1.3.12       Seurat_5.3.0       SeuratObject_5.1.0
 [5] sp_2.2-0           harmony_1.2.3      Rcpp_1.0.14        patchwork_1.3.0   
 [9] lubridate_1.9.4    forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4       
[13] purrr_1.0.4        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1      
[17] ggplot2_3.5.2      tidyverse_2.0.0    workflowr_1.7.1   

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3          rstudioapi_0.17.1          
  [3] jsonlite_2.0.0              magrittr_2.0.3             
  [5] spatstat.utils_3.1-4        farver_2.1.2               
  [7] rmarkdown_2.29              fs_1.6.6                   
  [9] vctrs_0.6.5                 ROCR_1.0-11                
 [11] DelayedMatrixStats_1.30.0   spatstat.explore_3.4-3     
 [13] S4Arrays_1.8.1              htmltools_0.5.8.1          
 [15] SparseArray_1.8.0           sass_0.4.10                
 [17] sctransform_0.4.2           parallelly_1.45.0          
 [19] KernSmooth_2.23-26          bslib_0.9.0                
 [21] htmlwidgets_1.6.4           ica_1.0-3                  
 [23] plyr_1.8.9                  plotly_4.10.4              
 [25] zoo_1.8-14                  cachem_1.1.0               
 [27] whisker_0.4.1               igraph_2.1.4               
 [29] mime_0.13                   lifecycle_1.0.4            
 [31] pkgconfig_2.0.3             Matrix_1.7-3               
 [33] R6_2.6.1                    fastmap_1.2.0              
 [35] GenomeInfoDbData_1.2.14     MatrixGenerics_1.20.0      
 [37] fitdistrplus_1.2-2          shiny_1.10.0               
 [39] digest_0.6.37               S4Vectors_0.46.0           
 [41] ps_1.9.1                    rprojroot_2.0.4            
 [43] tensor_1.5                  RSpectra_0.16-2            
 [45] irlba_2.3.5.1               GenomicRanges_1.60.0       
 [47] beachmat_2.24.0             labeling_0.4.3             
 [49] progressr_0.15.1            spatstat.sparse_3.1-0      
 [51] timechange_0.3.0            httr_1.4.7                 
 [53] polyclip_1.10-7             abind_1.4-8                
 [55] compiler_4.5.0              bit64_4.6.0-1              
 [57] withr_3.0.2                 fastDummies_1.7.5          
 [59] R.utils_2.13.0              MASS_7.3-65                
 [61] DelayedArray_0.34.1         tools_4.5.0                
 [63] lmtest_0.9-40               httpuv_1.6.16              
 [65] future.apply_1.20.0         goftest_1.2-3              
 [67] R.oo_1.27.1                 glmGamPoi_1.20.0           
 [69] glue_1.8.0                  callr_3.7.6                
 [71] nlme_3.1-168                promises_1.3.3             
 [73] grid_4.5.0                  Rtsne_0.17                 
 [75] getPass_0.2-4               cluster_2.1.8.1            
 [77] reshape2_1.4.4              generics_0.1.4             
 [79] gtable_0.3.6                spatstat.data_3.1-6        
 [81] tzdb_0.5.0                  R.methodsS3_1.8.2          
 [83] data.table_1.17.4           hms_1.1.3                  
 [85] XVector_0.48.0              BiocGenerics_0.54.0        
 [87] spatstat.geom_3.4-1         RcppAnnoy_0.0.22           
 [89] ggrepel_0.9.6               RANN_2.6.2                 
 [91] pillar_1.10.2               spam_2.11-1                
 [93] RcppHNSW_0.6.0              later_1.4.2                
 [95] splines_4.5.0               lattice_0.22-6             
 [97] bit_4.6.0                   survival_3.8-3             
 [99] deldir_2.0-4                tidyselect_1.2.1           
[101] miniUI_0.1.2                pbapply_1.7-2              
[103] knitr_1.50                  git2r_0.36.2               
[105] gridExtra_2.3               IRanges_2.42.0             
[107] SummarizedExperiment_1.38.1 scattermore_1.2            
[109] stats4_4.5.0                RhpcBLASctl_0.23-42        
[111] xfun_0.52                   Biobase_2.68.0             
[113] matrixStats_1.5.0           UCSC.utils_1.4.0           
[115] stringi_1.8.7               lazyeval_0.2.2             
[117] yaml_2.3.10                 evaluate_1.0.3             
[119] codetools_0.2-20            cli_3.6.5                  
[121] uwot_0.2.3                  xtable_1.8-4               
[123] reticulate_1.42.0           processx_3.8.6             
[125] jquerylib_0.1.4             GenomeInfoDb_1.44.0        
[127] globals_0.18.0              spatstat.random_3.4-1      
[129] png_0.1-8                   spatstat.univar_3.1-3      
[131] parallel_4.5.0              dotCall64_1.2              
[133] sparseMatrixStats_1.20.0    listenv_0.9.1              
[135] viridisLite_0.4.2           scales_1.4.0               
[137] ggridges_0.5.6              crayon_1.5.3               
[139] rlang_1.1.6                 cowplot_1.1.3