Last updated: 2019-07-12
Checks: 6 0
Knit directory: listerlab/
This reproducible R Markdown analysis was created with workflowr (version 1.2.0). The Report tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20190712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.
Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Untracked files:
Untracked: code/new_analysis.pl
Untracked: docs/assets/
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.
| File | Version | Author | Date | Message |
|---|---|---|---|---|
| Rmd | 389bd42 | davetang | 2019-07-12 | First commit |
Cell Ranger is a suite of four pipelines used to process Chromium single cell 3’ RNA-seq data. The workflow starts with demultiplexing the Illumina sequencer’s per-cycle base call (BCL) files into FASTQ files. This guide is for running the cellranger mkfastq pipeline, which augments Illumina’s bcl2fastq tool.
The Cell Ranger toolkit can be used after downloading and extraction. However, in order to use cellranger mkfastq you will need an installation of the bcl2fastq Conversion Software; bcl2fastq is installed globally in /usr/local/bin/bcl2fastq on machete and razor. However, on stiletto you will need to compile your own version and have it included in your PATH; you can add /home/dtang/bin to your PATH to use my compiled version (v2.20.0.422).
The bcl2fastq tool performs the conversion and demultiplexing in a single step. By default, the software outputs the demultiplexed and compressed FASTQ files in <run folder>/Data/Intensities/BaseCalls. If the Sample_Project column is specified in the sample sheet, the FASTQ files for that sample are placed in <run folder>/Data/Intensities/BaseCalls/<Project>.
It is possible to use bcl2fastq directly on Chromium scRNA-seq data, however using cellranger mkfastq is the preferred option, since it provides a number of additional features.
Since cellranger mkfastq is a wrapper around bcl2fastq, many of the arguments for bcl2fastq are accepted by cellranger mkfastq. Below we run cellranger mkfastq on 10x scRNA-seq data that was sequenced on a NextSeq 500.
nice /home/dtang/working_data_01/src/cellranger/cellranger-2.1.1/cellranger mkfastq \
--output-dir Unaligned \
-R /mnt/remoteserv/switch/rundata/nextseq/Runs/180521_NB500898_040_HJ5J5BGX5 \
--samplesheet=SampleSheet_NextSeq_2018_05_18_RL972.csv \
--localcores=8 \
--localmem=200
The arguments used are:
--output-dir = the directory that you want to output FASTQ files-R = the directory that contains a flowcell’s Data folder--samplesheet = the path to an Illumina Experiment Manager-compatible sample sheet which contains 10x sample index set names (e.g., SI-GA-A12) in the sample index column. All other information, such as sample names and lanes, should be in the sample sheet--localcores = set max cores the pipeline may request at one time--localmem = Set max GB the pipeline may request at one timeOn the cellranger mkfastq documentation page, it specifies that the --run parameter is required (this is the path of the BCL run folder). However, since we specified -R, this is the equivalent of setting --run.
The directory of your ExperimentName contains the Data folder and this should be the input to the -R argument. Below is the directory structure generated from the Illumina MiniSeq or NextSeq system.
Figure from the bcl2fastq2 Conversion Software v2.19 User Guide
Cell Ranger uses 90% of ALL available memory and cores on a system by default. This may cause problems on systems that have set limits to the number of processes that a user can spawn; each core that Cell Ranger uses will spawn 64 user processes. To find out the max user processes of your system run ulimit -a.
# on razor
ulimit -a | grep processes
max user processes (-u) 1024
# on machete
ulimit -a | grep processes
max user processes (-u) 1024
# on stiletto
ulimit -a | grep processes
max user processes (-u) 4096
There are 24 cores on razor, hence Cell Ranger will use 21 cores by default. This means that it will try to spawn 1,344 processes, which is over the limit of 1,024 and problems may occur. This may also be a problem on machete, which has 32 cores and a limit of 1,024 processes and for stiletto, which has 96 cores and a limit of 4,096 processes.
To prevent this problem, use the argument --localcores to set the number of cores that Cell Ranger will use; in addition set a limit to the amount of memory that will be used by using the argument --localmem to prevent using up too much of the system’s memory.
The cellranger mkfastq ouput will be stored in the directory specified by --output-dir, which in our case was Unaligned.
ls -lrt
total 368
-r--r--r-- 1 dtang dtang 26330 May 18 11:44 RunParameters.xml
-r--r--r-- 1 dtang dtang 28570 May 18 11:44 RunInfo.xml
-r--r--r-- 1 dtang dtang 37 May 18 16:54 RTARead1Complete.txt
-r--r--r-- 1 dtang dtang 37 May 18 19:19 RTARead2Complete.txt
-r--r--r-- 1 dtang dtang 37 May 19 05:04 RTARead3Complete.txt
-r--r--r-- 1 dtang dtang 47 May 19 05:10 RTAComplete.txt
-r--r--r-- 1 dtang dtang 926 May 19 06:54 RunCompletionStatus.xml
drwxr-xr-x 2 dtang dtang 4096 May 21 11:20 RTALogs
drwxr-xr-x 2 dtang dtang 29 May 21 11:20 Logs
drwxr-xr-x 2 dtang dtang 4096 May 21 11:20 InterOp
drwxr-xr-x 3 dtang dtang 32 May 21 11:20 Data
drwxr-sr-x 2 dtang listerlab 10 May 21 11:21 JahnviData
-rw-rw-r-- 1 dtang listerlab 256014 May 21 16:41 copy.log
-rw-r--r-- 1 jpflueger listerlab 707 May 21 18:31 SampleSheet_plasmids_only.csv
drwxr-sr-x 6 jpflueger listerlab 4096 May 22 10:36 Unaligned_plasmids
-rw-r--r-- 1 dtang listerlab 601 May 23 16:49 SampleSheet_NextSeq_2018_05_18_RL972.csv
-rwxr-xr-x 1 dtang listerlab 338 May 23 18:36 run_mkfastq.sh
-rw-rw-r-- 1 dtang listerlab 1113 May 23 18:36 __HJ5J5BGX5.mro
drwxrwsr-x 5 dtang listerlab 4096 May 23 19:13 Unaligned
drwxrwsr-x 4 dtang listerlab 4096 May 23 19:36 HJ5J5BGX5
Here are the contents of Unaligned.
ls -lrt Unaligned
total 9838856
-rw-rw-r-- 1 dtang listerlab 524058981 May 23 18:46 Undetermined_S0_L001_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 233889016 May 23 18:46 Undetermined_S0_L001_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1724017202 May 23 18:46 Undetermined_S0_L001_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 515067774 May 23 18:54 Undetermined_S0_L002_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 224709936 May 23 18:54 Undetermined_S0_L002_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1708252015 May 23 18:54 Undetermined_S0_L002_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 534937138 May 23 19:04 Undetermined_S0_L003_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 232040765 May 23 19:04 Undetermined_S0_L003_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1786013401 May 23 19:04 Undetermined_S0_L003_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 547445480 May 23 19:12 Undetermined_S0_L004_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 245106496 May 23 19:12 Undetermined_S0_L004_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1799381484 May 23 19:12 Undetermined_S0_L004_R2_001.fastq.gz
drwxrwsr-x 2 dtang listerlab 4096 May 23 19:12 Stats
drwxrwsr-x 3 dtang listerlab 25 May 23 19:13 Reports
drwxrwsr-x 3 dtang listerlab 78 May 23 19:36 plant_single_cell
The FASTQ files are stored in the directory plant_single_cell; this directory was created based on the sample sheet, which specified plant_single_cell as the Sample_Project. FASTQ files are named with the sample name and the sample number; the sample number is a numeric assignment based on the order of the sample list.
In the example below, files are named after our Sample_ID, which was RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control. The S1 refers to the sample number and in this case indicates that this sample is the first sample listed for the run. L00X is the lane number, I1, R1, and R2 is the read type, and the last segment is always 001.
ls -lrt Unaligned/plant_single_cell/RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control/
total 59081404
-rw-rw-r-- 1 dtang listerlab 1036165630 May 23 19:28 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L002_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1013524732 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L001_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1040287181 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L004_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 1065899896 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L003_I1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 3104322672 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L002_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 3000015567 May 23 19:29 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L001_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 3067586227 May 23 19:30 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L004_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 3217370578 May 23 19:30 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L003_R1_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 11460856936 May 23 19:33 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L003_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 10609464389 May 23 19:34 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L001_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 10951709478 May 23 19:35 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L004_R2_001.fastq.gz
-rw-rw-r-- 1 dtang listerlab 10932078782 May 23 19:35 RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control_S1_L002_R2_001.fastq.gz
In addition to the FASTQ files, bcl2fastq generates various summary files. If --stats-dir was not specified, summary and statistic files are stored in a Stats folder by default. These include:
ls -lrt Stats
total 12196
-rw-rw-r-- 1 dtang listerlab 12640 May 23 19:12 DemultiplexingStats.xml
-rw-rw-r-- 1 dtang listerlab 23586 May 23 19:12 FastqSummaryF1L1.txt
-rw-rw-r-- 1 dtang listerlab 26071 May 23 19:12 DemuxSummaryF1L1.txt
-rw-rw-r-- 1 dtang listerlab 23599 May 23 19:12 FastqSummaryF1L2.txt
-rw-rw-r-- 1 dtang listerlab 26057 May 23 19:12 DemuxSummaryF1L2.txt
-rw-rw-r-- 1 dtang listerlab 23615 May 23 19:12 FastqSummaryF1L3.txt
-rw-rw-r-- 1 dtang listerlab 26022 May 23 19:12 DemuxSummaryF1L3.txt
-rw-rw-r-- 1 dtang listerlab 23632 May 23 19:12 FastqSummaryF1L4.txt
-rw-rw-r-- 1 dtang listerlab 26107 May 23 19:12 DemuxSummaryF1L4.txt
-rw-rw-r-- 1 dtang listerlab 108870 May 23 19:12 Stats.json
-rw-rw-r-- 1 dtang listerlab 12107974 May 23 19:12 ConversionStats.xml
-rw-rw-r-- 1 dtang listerlab 36013 May 23 19:12 AdapterTrimming.txt
The DemuxSummaryF1L#.txt (where # indicates lane number) is only created if the sample sheet contains at least one sample and the sample barcode is provided. The first part of the file contains summary statistics on the percentage that each sample makes up for each tile.
head DemuxSummaryF1L1.txt
SampleNumber 0 1 2 3 4
SampleName None RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control RL972_2018_05_18_scRNAseq_root_col_cvi_ws_c24_sha_control
L1T11101 14.28523 20.1733 27.09556 17.31634 21.12957
L1T11102 14.81065 19.9818 27.33801 16.71274 21.15679
L1T11103 14.42195 20.05802 27.6645 16.74197 21.11357
L1T11104 14.06736 20.15878 27.98941 16.61069 21.17377
L1T11105 13.70394 20.19893 28.19503 16.74543 21.15667
L1T11106 13.64402 20.03309 28.37754 16.70499 21.24036
L1T11107 13.15779 20.15206 28.52084 16.83371 21.3356
L1T11108 12.93981 20.24513 28.52892 16.90557 21.38056
The second part of the file (denoted by Most Popular Unknown Index Sequences) contains the 1,000 most common unknown barcode sequences and the total number of reads observed with each barcode.
cat DemuxSummaryF1L2.txt | grep -A 9 "Most Popular"
### Most Popular Unknown Index Sequences
### Columns: Index_Sequence Hit_Count
GGGGGGGG 11290680
CCAGGATA 158760
GGGGCGGC 124440
GGGGCGGG 123720
GGGGGGGC 88520
CGGGGGGG 86420
GGGCGGGG 81160
CGGTCCGC 79400
Reads associated with these unknown barcode sequences are stored in files that begin with Undetermined_S0_, which are stored in the directory specified by --output-dir.
Examining the unknown barcode sequences is a good way to ensure that:
sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.5
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.0.1 purrr_0.3.1
[5] readr_1.3.1 tidyr_0.8.3 tibble_2.0.1 ggplot2_3.1.0
[9] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 highr_0.7 cellranger_1.1.0 plyr_1.8.4
[5] pillar_1.3.1 compiler_3.5.2 git2r_0.24.0 workflowr_1.2.0
[9] tools_3.5.2 digest_0.6.18 lubridate_1.7.4 jsonlite_1.6
[13] evaluate_0.13 nlme_3.1-137 gtable_0.2.0 lattice_0.20-38
[17] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1 rstudioapi_0.9.0
[21] yaml_2.2.0 haven_2.1.0 xfun_0.5 withr_2.1.2
[25] xml2_1.2.0 httr_1.4.0 knitr_1.21 hms_0.4.2
[29] generics_0.0.2 fs_1.2.6 rprojroot_1.3-2 grid_3.5.2
[33] tidyselect_0.2.5 glue_1.3.0 R6_2.4.0 readxl_1.3.0
[37] rmarkdown_1.11 modelr_0.1.4 magrittr_1.5 whisker_0.3-2
[41] backports_1.1.3 scales_1.0.0 htmltools_0.3.6 rvest_0.3.2
[45] assertthat_0.2.0 colorspace_1.4-0 stringi_1.3.1 lazyeval_0.2.1
[49] munsell_0.5.0 broom_0.5.1 crayon_1.3.4