Last updated: 2019-07-12

Checks: 6 0

Knit directory: listerlab/

This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20190712)

The command set.seed(20190712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Repository version: 389bd42

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Untracked files:
    Untracked:  code/new_analysis.pl
    Untracked:  docs/assets/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	389bd42	davetang	2019-07-12	First commit

Introduction

Cell Ranger is a suite of four pipelines used to process Chromium single cell 3’ RNA-seq data. The workflow starts with demultiplexing the Illumina sequencer’s per-cycle base call (BCL) files into FASTQ files. This guide is for running the cellranger mkfastq pipeline, which augments Illumina’s bcl2fastq tool.

Requirements

The Cell Ranger toolkit can be used after downloading and extraction. However, in order to use cellranger mkfastq you will need an installation of the bcl2fastq Conversion Software; bcl2fastq is installed globally in /usr/local/bin/bcl2fastq on machete and razor. However, on stiletto you will need to compile your own version and have it included in your PATH; you can add /home/dtang/bin to your PATH to use my compiled version (v2.20.0.422).

bcl2fastq

The bcl2fastq tool performs the conversion and demultiplexing in a single step. By default, the software outputs the demultiplexed and compressed FASTQ files in <run folder>/Data/Intensities/BaseCalls. If the Sample_Project column is specified in the sample sheet, the FASTQ files for that sample are placed in <run folder>/Data/Intensities/BaseCalls/<Project>.

Converting BCL to FASTQ

It is possible to use bcl2fastq directly on Chromium scRNA-seq data, however using cellranger mkfastq is the preferred option, since it provides a number of additional features.

Since cellranger mkfastq is a wrapper around bcl2fastq, many of the arguments for bcl2fastq are accepted by cellranger mkfastq. Below we run cellranger mkfastq on 10x scRNA-seq data that was sequenced on a NextSeq 500.

nice /home/dtang/working_data_01/src/cellranger/cellranger-2.1.1/cellranger mkfastq \
  --output-dir Unaligned \
  -R /mnt/remoteserv/switch/rundata/nextseq/Runs/180521_NB500898_040_HJ5J5BGX5 \
  --samplesheet=SampleSheet_NextSeq_2018_05_18_RL972.csv \
  --localcores=8 \
  --localmem=200

The arguments used are:

--output-dir = the directory that you want to output FASTQ files
-R = the directory that contains a flowcell’s Data folder
--samplesheet = the path to an Illumina Experiment Manager-compatible sample sheet which contains 10x sample index set names (e.g., SI-GA-A12) in the sample index column. All other information, such as sample names and lanes, should be in the sample sheet
--localcores = set max cores the pipeline may request at one time
--localmem = Set max GB the pipeline may request at one time

On the cellranger mkfastq documentation page, it specifies that the --run parameter is required (this is the path of the BCL run folder). However, since we specified -R, this is the equivalent of setting --run.

Data folder

The directory of your ExperimentName contains the Data folder and this should be the input to the -R argument. Below is the directory structure generated from the Illumina MiniSeq or NextSeq system.

Figure from the bcl2fastq2 Conversion Software v2.19 User Guide

Resources

Cell Ranger uses 90% of ALL available memory and cores on a system by default. This may cause problems on systems that have set limits to the number of processes that a user can spawn; each core that Cell Ranger uses will spawn 64 user processes. To find out the max user processes of your system run ulimit -a.

# on razor
ulimit -a | grep processes
max user processes              (-u) 1024

# on machete
ulimit -a | grep processes
max user processes              (-u) 1024

# on stiletto
ulimit -a | grep processes
max user processes              (-u) 4096

There are 24 cores on razor, hence Cell Ranger will use 21 cores by default. This means that it will try to spawn 1,344 processes, which is over the limit of 1,024 and problems may occur. This may also be a problem on machete, which has 32 cores and a limit of 1,024 processes and for stiletto, which has 96 cores and a limit of 4,096 processes.

To prevent this problem, use the argument --localcores to set the number of cores that Cell Ranger will use; in addition set a limit to the amount of memory that will be used by using the argument --localmem to prevent using up too much of the system’s memory.

FASTQ output

The cellranger mkfastq ouput will be stored in the directory specified by --output-dir, which in our case was Unaligned.

ls -lrt
total 368
-r--r--r-- 1 dtang     dtang      26330 May 18 11:44 RunParameters.xml
-r--r--r-- 1 dtang     dtang      28570 May 18 11:44 RunInfo.xml
-r--r--r-- 1 dtang     dtang         37 May 18 16:54 RTARead1Complete.txt
-r--r--r-- 1 dtang     dtang         37 May 18 19:19 RTARead2Complete.txt
-r--r--r-- 1 dtang     dtang         37 May 19 05:04 RTARead3Complete.txt
-r--r--r-- 1 dtang     dtang         47 May 19 05:10 RTAComplete.txt
-r--r--r-- 1 dtang     dtang        926 May 19 06:54 RunCompletionStatus.xml
drwxr-xr-x 2 dtang     dtang       4096 May 21 11:20 RTALogs
drwxr-xr-x 2 dtang     dtang         29 May 21 11:20 Logs
drwxr-xr-x 2 dtang     dtang       4096 May 21 11:20 InterOp
drwxr-xr-x 3 dtang     dtang         32 May 21 11:20 Data
drwxr-sr-x 2 dtang     listerlab     10 May 21 11:21 JahnviData
-rw-rw-r-- 1 dtang     listerlab 256014 May 21 16:41 copy.log
-rw-r--r-- 1 jpflueger listerlab    707 May 21 18:31 SampleSheet_plasmids_only.csv
drwxr-sr-x 6 jpflueger listerlab   4096 May 22 10:36 Unaligned_plasmids
-rw-r--r-- 1 dtang     listerlab    601 May 23 16:49 SampleSheet_NextSeq_2018_05_18_RL972.csv
-rwxr-xr-x 1 dtang     listerlab    338 May 23 18:36 run_mkfastq.sh
-rw-rw-r-- 1 dtang     listerlab   1113 May 23 18:36 __HJ5J5BGX5.mro
drwxrwsr-x 5 dtang     listerlab   4096 May 23 19:13 Unaligned
drwxrwsr-x 4 dtang     listerlab   4096 May 23 19:36 HJ5J5BGX5

Here are the contents of Unaligned.

ls -lrt Unaligned
total 9838856
-rw-rw-r-- 1 dtang listerlab  524058981 May 23 18:46 Undetermined_S0_L001_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  233889016 May 23 18:46 Undetermined_S0_L001_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1724017202 May 23 18:46 Undetermined_S0_L001_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  515067774 May 23 18:54 Undetermined_S0_L002_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  224709936 May 23 18:54 Undetermined_S0_L002_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1708252015 May 23 18:54 Undetermined_S0_L002_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  534937138 May 23 19:04 Undetermined_S0_L003_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  232040765 May 23 19:04 Undetermined_S0_L003_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1786013401 May 23 19:04 Undetermined_S0_L003_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  547445480 May 23 19:12 Undetermined_S0_L004_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  245106496 May 23 19:12 Undetermined_S0_L004_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1799381484 May 23 19:12 Undetermined_S0_L004_R2_001.fastq.gz
drwxrwsr-x 2 dtang listerlab       4096 May 23 19:12 Stats
drwxrwsr-x 3 dtang listerlab         25 May 23 19:13 Reports
drwxrwsr-x 3 dtang listerlab         78 May 23 19:36 plant_single_cell

The FASTQ files are stored in the directory plant_single_cell; this directory was created based on the sample sheet, which specified plant_single_cell as the Sample_Project. FASTQ files are named with the sample name and the sample number; the sample number is a numeric assignment based on the order of the sample list.

In the example below, files are named after our Sample_ID, which was RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control. The S1 refers to the sample number and in this case indicates that this sample is the first sample listed for the run. L00X is the lane number, I1, R1, and R2 is the read type, and the last segment is always 001.

ls -lrt Unaligned/plant_single_cell/RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control/
total 59081404
-rw-rw-r-- 1 dtang listerlab  1036165630 May 23 19:28 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L002_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  1013524732 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L001_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  1040287181 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L004_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  1065899896 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L003_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  3104322672 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L002_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  3000015567 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L001_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  3067586227 May 23 19:30 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L004_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab  3217370578 May 23 19:30 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L003_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 11460856936 May 23 19:33 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L003_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 10609464389 May 23 19:34 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L001_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 10951709478 May 23 19:35 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L004_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 10932078782 May 23 19:35 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L002_R2_001.fastq.gz

Quality control

In addition to the FASTQ files, bcl2fastq generates various summary files. If --stats-dir was not specified, summary and statistic files are stored in a Stats folder by default. These include:

InterOp Files [used by the Sequencing Analysis Viewer (SAV)]
ConversionStats File (ConversionStats.xml)
DemultiplexingStats File (DemultiplexingStats.xml)
- Barcode Count
- PerfectBarcode Count
- OneMismatchBarcode Count
AdapterTrimming File
FastqSummary and DemuxSummary
HTML Reports
JSON File

ls -lrt Stats
total 12196
-rw-rw-r-- 1 dtang listerlab    12640 May 23 19:12 DemultiplexingStats.xml
-rw-rw-r-- 1 dtang listerlab    23586 May 23 19:12 FastqSummaryF1L1.txt
-rw-rw-r-- 1 dtang listerlab    26071 May 23 19:12 DemuxSummaryF1L1.txt
-rw-rw-r-- 1 dtang listerlab    23599 May 23 19:12 FastqSummaryF1L2.txt
-rw-rw-r-- 1 dtang listerlab    26057 May 23 19:12 DemuxSummaryF1L2.txt
-rw-rw-r-- 1 dtang listerlab    23615 May 23 19:12 FastqSummaryF1L3.txt
-rw-rw-r-- 1 dtang listerlab    26022 May 23 19:12 DemuxSummaryF1L3.txt
-rw-rw-r-- 1 dtang listerlab    23632 May 23 19:12 FastqSummaryF1L4.txt
-rw-rw-r-- 1 dtang listerlab    26107 May 23 19:12 DemuxSummaryF1L4.txt
-rw-rw-r-- 1 dtang listerlab   108870 May 23 19:12 Stats.json
-rw-rw-r-- 1 dtang listerlab 12107974 May 23 19:12 ConversionStats.xml
-rw-rw-r-- 1 dtang listerlab    36013 May 23 19:12 AdapterTrimming.txt

DemuxSummary

The DemuxSummaryF1L#.txt (where # indicates lane number) is only created if the sample sheet contains at least one sample and the sample barcode is provided. The first part of the file contains summary statistics on the percentage that each sample makes up for each tile.

head DemuxSummaryF1L1.txt 
SampleNumber    0       1       2       3       4
SampleName      None    RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control       RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control       RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control       RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control
L1T11101        14.28523        20.1733 27.09556        17.31634        21.12957
L1T11102        14.81065        19.9818 27.33801        16.71274        21.15679
L1T11103        14.42195        20.05802        27.6645 16.74197        21.11357
L1T11104        14.06736        20.15878        27.98941        16.61069        21.17377
L1T11105        13.70394        20.19893        28.19503        16.74543        21.15667
L1T11106        13.64402        20.03309        28.37754        16.70499        21.24036
L1T11107        13.15779        20.15206        28.52084        16.83371        21.3356
L1T11108        12.93981        20.24513        28.52892        16.90557        21.38056

The second part of the file (denoted by Most Popular Unknown Index Sequences) contains the 1,000 most common unknown barcode sequences and the total number of reads observed with each barcode.

cat DemuxSummaryF1L2.txt | grep -A 9 "Most Popular"
### Most Popular Unknown Index Sequences
### Columns: Index_Sequence Hit_Count
GGGGGGGG        11290680
CCAGGATA        158760
GGGGCGGC        124440
GGGGCGGG        123720
GGGGGGGC        88520
CGGGGGGG        86420
GGGCGGGG        81160
CGGTCCGC        79400

Reads associated with these unknown barcode sequences are stored in files that begin with Undetermined_S0_, which are stored in the directory specified by --output-dir.

Examining the unknown barcode sequences is a good way to ensure that:

The indexes you used to create your FASTQ files were correct
There was no issue with the demultiplexing
There are no contamination issues
No further issues with the index reads of the run

sessionInfo()

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.0.1   purrr_0.3.1    
[5] readr_1.3.1     tidyr_0.8.3     tibble_2.0.1    ggplot2_3.1.0  
[9] tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0       highr_0.7        cellranger_1.1.0 plyr_1.8.4      
 [5] pillar_1.3.1     compiler_3.5.2   git2r_0.24.0     workflowr_1.2.0 
 [9] tools_3.5.2      digest_0.6.18    lubridate_1.7.4  jsonlite_1.6    
[13] evaluate_0.13    nlme_3.1-137     gtable_0.2.0     lattice_0.20-38 
[17] pkgconfig_2.0.2  rlang_0.3.1      cli_1.0.1        rstudioapi_0.9.0
[21] yaml_2.2.0       haven_2.1.0      xfun_0.5         withr_2.1.2     
[25] xml2_1.2.0       httr_1.4.0       knitr_1.21       hms_0.4.2       
[29] generics_0.0.2   fs_1.2.6         rprojroot_1.3-2  grid_3.5.2      
[33] tidyselect_0.2.5 glue_1.3.0       R6_2.4.0         readxl_1.3.0    
[37] rmarkdown_1.11   modelr_0.1.4     magrittr_1.5     whisker_0.3-2   
[41] backports_1.1.3  scales_1.0.0     htmltools_0.3.6  rvest_0.3.2     
[45] assertthat_0.2.0 colorspace_1.4-0 stringi_1.3.1    lazyeval_0.2.1  
[49] munsell_0.5.0    broom_0.5.1      crayon_1.3.4

Generating FASTQs with cellranger mkfastq on 10x scRNA-seq data

Dave Tang

2019-07-12