Last updated: 2024-10-02
Checks: 7 0
Knit directory: bioinformatics_tips/
This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200503)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version e469b2c. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rproj.user/
Untracked files:
Untracked: script/job_name.e2
Untracked: script/job_name.o2
Untracked: script/sge.sh
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/queuing.Rmd
) and HTML
(docs/queuing.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | e469b2c | Dave Tang | 2024-10-02 | Example Grid Engine script |
html | 7c2efdd | Dave Tang | 2024-05-23 | Build site. |
Rmd | 4924dc4 | Dave Tang | 2024-05-23 | Sun Grid Engine |
html | e7069bf | Dave Tang | 2023-06-27 | Build site. |
html | 3e6869c | davetang | 2020-08-15 | Build site. |
Rmd | 93ad1d0 | davetang | 2020-08-15 | Update |
html | c6a497c | davetang | 2020-06-21 | Build site. |
Rmd | 3b81b96 | davetang | 2020-06-21 | Queuing systems |
If you will be using a high-performance computer (HPC) cluster for your work you should learn to use a batch-queuing system. These systems are responsible for scheduling, dispatching, and managing the execution of your jobs as well as managing resource allocation.
See comparison of cluster software.
You can configure the server by setting server attributes via the
qmgr
command:
Qmgr: set server <attribute> = <value>
The default configuration is shown below.
qmgr
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq enabled = True
set queue workq started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999
PBS.
Specific tasks.
Resources.
Commonly used options.
-N
- specify job name-S
- specify shell-q
- specify queue-name-l
- resource=value[,resource=value]…-o
- specify standard output stream path(s)-e
- specify standard error stream path(s)-cwd
- Execute the job from the current working
directory-wd
- specify working directoryExample script.
cat script/sge.sh
#!/usr/bin/env bash
set -euo pipefail
#$ -N job_name
#$ -q all.q
#$ -cwd
#$ -l h_rt=01:00:00
#$ -l h_rss=30720M,mem_free=30720M
#$ -S /bin/bash
export LANGUAGE=en_AU.UTF-8
printf "Hello World!\n"
Submit.
qsub sge.sh
sbatch job_script.slurm
squeue
scancel jobid
To list partitions type:
sinfo
It is important to use the correct system and partition for each part
of a workflow. To list out the limits of each partition use
scontrol
.
scontrol show partition
Use squeue
to display the status of jobs in the local
cluster; the larger the priority value, the higher the priority.
squeue
# queue for specific user
squeue -u dtang
# queue for specific partition and sorted by priority
squeue -p workq -S p
Individual job information.
scontrol show job jobid
SLURM needs to know two things from you:
Try to ask for the right amount of resources because:
You cannot submit an application directly to SLURM; SLURM executes on
your behalf a list of shell commands. In batch mode, SLURM executes a
job script which contains the commands as a bash
or
csh
script. In interactive mode, type in the commands just
like when you log in.
sbatch
interprets directives in the script, which are
written as comments and not executed.
sbatch
command-line
argumentsBelow is an example script.
#!/bin/bash -l
#SBATCH --partition=workq
#SBATCH --job-name=hostname
#SBATCH --account=director2120
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --time=00:05:00
#SBATCH --export=NONE
hostname
Use --export=NONE
to start with a clean environment,
improving reproducibility and avoids contamination of the
environment.
Use sbatch
to submit the job.
sbatch hostname.slurm
Parallel applications are launched using srun
.
Use salloc
instead of sbatch
for
interactive jobs. Use -p
to request a specific partition
for the resource allocation. If not specified, the default behavior is
to allow the slurm controller to select the default partition as
designated by the system administrator.
salloc --tasks=16 --time=00:10:00
srun make -j 16
When specifying the number of threads, make sure you know the parallel programming model that is used by your library or software. The manner in which you issue the number of tasks may affect how your program runs. The arguments to pay attention to are:
--ntasks=# : Number of "tasks" (use with distributed parallelism).
--ntasks-per-node=# : Number of "tasks" per node (use with distributed parallelism).
--cpus-per-task=# : Number of CPUs allocated to each task (use with shared memory parallelism).
Therefore, using --cpus-per-task
will ensure it gets
allocated to the same node, while using --ntasks
can and
may allocate it to multiple nodes. You may get by by simply
specifying--ntasks
but you should do some testing with a
smaller dataset.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --time=04:00:00
#SBATCH --partition=workq
#SBATCH --ntasks=16
#SBATCH --export=NONE
Use job arrays
to run embarassingly parallel jobs. In the example below, we are
requesting that each array task be allocated 1 CPU
(--ntasks=1
) and 4 GB of memory (--mem=4G
) for
up to one hour (--time=01:00:00
).
#!/bin/bash -l
#SBATCH --job-name=array
#SBATCH --partition=workq
#SBATCH --account=director2120
#SBATCH --array=0-3
#SBATCH --output=array_%A_%a.out
#SBATCH --error=array_%A_%a.err
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --mem=4G
#SBATCH --export=NONE
FILES=(1.bam 2.bam 3.bam 4.bam)
echo ${FILES[$SLURM_ARRAY_TASK_ID]}
Use bash
arrays to store chromosomes, parameters, etc.
for job arrays.
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] DT_0.33 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[5] dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1
[9] tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 workflowr_1.7.1
loaded via a namespace (and not attached):
[1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.4
[5] hms_1.1.3 digest_0.6.35 magrittr_2.0.3 timechange_0.3.0
[9] evaluate_0.24.0 grid_4.4.0 fastmap_1.2.0 rprojroot_2.0.4
[13] jsonlite_1.8.8 processx_3.8.4 whisker_0.4.1 ps_1.7.6
[17] promises_1.3.0 httr_1.4.7 fansi_1.0.6 crosstalk_1.2.1
[21] scales_1.3.0 jquerylib_0.1.4 cli_3.6.2 rlang_1.1.4
[25] munsell_0.5.1 withr_3.0.0 cachem_1.1.0 yaml_2.3.8
[29] tools_4.4.0 tzdb_0.4.0 colorspace_2.1-0 httpuv_1.6.15
[33] vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4 git2r_0.33.0
[37] htmlwidgets_1.6.4 fs_1.6.4 pkgconfig_2.0.3 callr_3.7.6
[41] pillar_1.9.0 bslib_0.7.0 later_1.3.2 gtable_0.3.5
[45] glue_1.7.0 Rcpp_1.0.12 xfun_0.44 tidyselect_1.2.1
[49] rstudioapi_0.16.0 knitr_1.47 htmltools_0.5.8.1 rmarkdown_2.27
[53] compiler_4.4.0 getPass_0.2-4