Following the documentation for the R package {cellxgene.census} which is part of CZ CELLxGENE Discover Census.

{cellxgene.census} provides an API to efficiently access the cloud-hosted Census single-cell data from R. In just a few seconds users can access any slice of Census data using cell or gene filters across hundreds of single-cell datasets.

Census data can be fetched in an iterative fashion for bigger-than-memory slices of data, or quickly exported to basic R structures, as well as {Seurat} or {SingleCellExperiment} objects for downstream analysis.

Install from the R-universe platform. If installing using Ubuntu/Debian, you may need to install the following libraries via APT:

  • libxml2-dev
  • libssl-dev
  • libcurl4-openssl-dev

In addition you must have at least cmake v3.21. Install the {tiledbsoma} dependency first, since this takes some time to compile and install, it’s better to fail first.

install.packages('tiledbsoma', repos = c('',

Now install {cellxgene.census}, which should be a breeze now.

  repos=c('', '')

Now install {Seurat}.


Querying the metadata

Querying and fetching the single-cell data and cell/gene metadata

[1] '1.13.0'

The human gene metadata of the Census, for RNA assays, is located at census$get("census_data")$get("homo_sapiens")$obs. The mouse cell metadata is at census$get("census_data")$get("mus_musculus").obs.

To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.

census <- open_soma()
The stable Census release is currently 2023-12-15. Specify census_version = "2023-12-15" in future calls to open_soma() to ensure data consistency.
my_keys <- census$get("census_data")$get("homo_sapiens")$obs$colnames()
 [1] "soma_joinid"                             
 [2] "dataset_id"                              
 [3] "assay"                                   
 [4] "assay_ontology_term_id"                  
 [5] "cell_type"                               
 [6] "cell_type_ontology_term_id"              
 [7] "development_stage"                       
 [8] "development_stage_ontology_term_id"      
 [9] "disease"                                 
[10] "disease_ontology_term_id"                
[11] "donor_id"                                
[12] "is_primary_data"                         
[13] "self_reported_ethnicity"                 
[14] "self_reported_ethnicity_ontology_term_id"
[15] "sex"                                     
[16] "sex_ontology_term_id"                    
[17] "suspension_type"                         
[18] "tissue"                                  
[19] "tissue_ontology_term_id"                 
[20] "tissue_general"                          
[21] "tissue_general_ontology_term_id"         
[22] "raw_sum"                                 
[23] "nnz"                                     
[24] "raw_mean_nnz"                            
[25] "raw_variance_nnz"                        
[26] "n_measured_vars"                         

soma_joinid is a special SOMADataFrame column that is used for join operations. All of the keys can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.

For example let’s see what are the possible values available for sex. To this we can load all cell metadata but fetching only for the column sex; column_names are character vector indicating what metadata columns to fetch.

census$get("census_data")$get("homo_sapiens")$obs$read(column_names = "sex")$concat() |> |>
1          male
224      female
3747640 unknown

With this information we can fetch all cell metadata for a specific sex value, for example “unknown”; the value_filter is an R expression with selection conditions to fetch rows.

census$get("census_data")$get("homo_sapiens")$obs$read(value_filter = "sex == 'unknown'")$concat() |> -> sex_unknown

  soma_joinid                           dataset_id     assay
1     3747639 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2
2     3747640 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2
3     3747641 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2
4     3747642 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2
5     3747643 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2
6     3747644 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2
  assay_ontology_term_id  cell_type cell_type_ontology_term_id
1            EFO:0009899 fibroblast                 CL:0000057
2            EFO:0009899 fibroblast                 CL:0000057
3            EFO:0009899 fibroblast                 CL:0000057
4            EFO:0009899 fibroblast                 CL:0000057
5            EFO:0009899 fibroblast                 CL:0000057
6            EFO:0009899 fibroblast                 CL:0000057
  development_stage development_stage_ontology_term_id disease
1 human adult stage                     HsapDv:0000087  normal
2 human adult stage                     HsapDv:0000087  normal
3 human adult stage                     HsapDv:0000087  normal
4 human adult stage                     HsapDv:0000087  normal
5 human adult stage                     HsapDv:0000087  normal
6 human adult stage                     HsapDv:0000087  normal
  disease_ontology_term_id                     donor_id is_primary_data
1             PATO:0000461 Pagella_GSE161267_GSM4904134            TRUE
2             PATO:0000461 Pagella_GSE161267_GSM4904134            TRUE
3             PATO:0000461 Pagella_GSE161267_GSM4904134            TRUE
4             PATO:0000461 Pagella_GSE161267_GSM4904134            TRUE
5             PATO:0000461 Pagella_GSE161267_GSM4904134            TRUE
6             PATO:0000461 Pagella_GSE161267_GSM4904134            TRUE
  self_reported_ethnicity self_reported_ethnicity_ontology_term_id     sex
1                 unknown                                  unknown unknown
2                 unknown                                  unknown unknown
3                 unknown                                  unknown unknown
4                 unknown                                  unknown unknown
5                 unknown                                  unknown unknown
6                 unknown                                  unknown unknown
  sex_ontology_term_id suspension_type  tissue tissue_ontology_term_id
1              unknown            cell gingiva          UBERON:0001828
2              unknown            cell gingiva          UBERON:0001828
3              unknown            cell gingiva          UBERON:0001828
4              unknown            cell gingiva          UBERON:0001828
5              unknown            cell gingiva          UBERON:0001828
6              unknown            cell gingiva          UBERON:0001828
  tissue_general tissue_general_ontology_term_id raw_sum  nnz raw_mean_nnz
1         mucosa                  UBERON:0000344     547  329     1.662614
2         mucosa                  UBERON:0000344     982  563     1.744227
3         mucosa                  UBERON:0000344   12467 3809     3.273038
4         mucosa                  UBERON:0000344    1053  566     1.860424
5         mucosa                  UBERON:0000344     548  363     1.509642
6         mucosa                  UBERON:0000344     678  429     1.580420
  raw_variance_nnz n_measured_vars
1        14.559604           31602
2         5.315247           31602
3       109.305683           31602
4         7.430042           31602
5         2.410818           31602
6        11.379616           31602

We can use both column_names and value_filter to perform specific queries. For example fetching the disease column for the cell_type “B cell” in the tissue_general “lung”.

cell_metadata_b_cell <- census$get("census_data")$get("homo_sapiens")$obs$read(
  value_filter = "cell_type == 'B cell' & tissue_general == 'lung'",
  column_names = "disease"

cell_metadata_b_cell <-$concat())

chronic obstructive pulmonary disease                              COVID-19 
                                 6369                                  2729 
         hypersensitivity pneumonitis             interstitial lung disease 
                                   52                                   376 
                  lung adenocarcinoma             lung large cell carcinoma 
                                62351                                  1534 
             lymphangioleiomyomatosis         non-small cell lung carcinoma 
                                  133                                 17484 
  non-specific interstitial pneumonia                                normal 
                                  231                                 25461 
                pleomorphic carcinoma                             pneumonia 
                                 1210                                    50 
                  pulmonary emphysema                    pulmonary fibrosis 
                                 1512                                  6798 
                pulmonary sarcoidosis             small cell lung carcinoma 
                                    6                                   583 
         squamous cell lung carcinoma 

Querying expression data as Seurat

Use get_seurat() to perform the same type of filtering but returning a Seurat object.

  • obs_column_names — character vector indicating the columns to select for cell metadata.
  • obs_value_filter — expression with selection conditions to fetch cells meeting a criteria.
  • var_column_names — character vector indicating the columns to select for gene metadata.
  • var_value_filter — expression with selection conditions to fetch genes meeting a criteria.

seurat_obj <- get_seurat(
  census, "Homo sapiens",
  obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
  var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
  obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"

saveRDS(object = seurat_obj, file = "data/lung_bcell.rds")

Seurat object.

seurat_obj <- readRDS(file = "data/lung_bcell.rds")
An object of class Seurat 
2 features across 2729 samples within 1 assay 
Active assay: RNA (2 features, 0 variable features)
 2 layers present: counts, data

Close the census

After use, the census object should be closed to release memory and other resources. This also closes all SOMA objects accessed via the top-level census. Closing can be automated using on.exit(census$close(), add = TRUE) immediately after census <- open_soma().


