Using the Bioconductor GenomicsRanges package

Last updated: 2023-02-13

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 1242888

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 1242888. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.1.2/
    Ignored:    r_packages_4.2.0/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  analysis/tss_xgboost.Rmd
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa

Unstaged changes:
    Modified:   analysis/graph.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/genomic_ranges.Rmd) and HTML (docs/genomic_ranges.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	1242888	Dave Tang	2023-02-13	GenomicRanges

From the introductory article for GenomicRanges:

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. In the Bioconductor package hierarchy, it builds upon the IRanges (infrastructure) package and provides support for the BSgenome (infrastructure), Rsamtools (I/O), ShortRead (I/O & QA), rtracklayer (I/O), GenomicFeatures (infrastructure), GenomicAlignments (sequence reads), VariantAnnotation (called variants), and many other Bioconductor packages.

This package lays a foundation for genomic analysis by introducing three classes (GRanges, GPos, and GRangesList), which are used to represent genomic ranges, genomic positions, and groups of genomic ranges. This vignette focuses on the GRanges and GRangesList classes and their associated methods.

Installation

To being, install GenomicRanges.

if (!require("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

if (!require("GenomicRanges", quietly = TRUE))
  BiocManager::install("GenomicRanges")

library(GenomicRanges)

Getting started

The introduction article starts with creating a GRanges object:

The GRanges class represents a collection of genomic features that each have a single start and end location on the genome. This includes features such as contiguous binding sites, transcripts, and exons. These objects can be created by using the GRanges constructor function.

gr <- GRanges(
  seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
  ranges = IRanges(101:110, end = 111:120, names = letters[1:10]),
  strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
  score = 1:10,
  GC = seq(1, 0, length=10)
)
 
class(gr)

[1] "GRanges"
attr(,"package")
[1] "GenomicRanges"

The GRanges() function was used to specify a list of sequence names, their ranges, strand, score, and GC content. For the seqnames and strand, the Rle() function was used; RLE is an abbreviation for run-length encoding, which is a way of representing and compressing data. The function saves us from typing c("chr1", "chr2", "chr2", "chr2", "chr1", "chr1", "chr3", "chr3", "chr3", "chr3"), which would also work.

The IRanges() function was used to create a vector representation of sequence ranges; 10 ranges were created and named using the first ten letters of the alphabet.

Rle() was also used to indicate the strandedness of the ranges.

Metadata can also be added to a GRanges object and in the example, a score and the GC content were included.

The genomic coordinates (seqnames, ranges, and strand) are displayed on the left-hand side and the metadata columns (annotations) are displayed on the right-hand side.

gr

GRanges object with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score        GC
       <Rle> <IRanges>  <Rle> | <integer> <numeric>
  a     chr1   101-111      - |         1  1.000000
  b     chr2   102-112      + |         2  0.888889
  c     chr2   103-113      + |         3  0.777778
  d     chr2   104-114      * |         4  0.666667
  e     chr1   105-115      * |         5  0.555556
  f     chr1   106-116      + |         6  0.444444
  g     chr3   107-117      + |         7  0.333333
  h     chr3   108-118      + |         8  0.222222
  i     chr3   109-119      - |         9  0.111111
  j     chr3   110-120      - |        10  0.000000
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

Each component of the genomic coordinates can be extracted using the accessor functions that have the same name as the column names. For example to retrieve the ranges we use the ranges() function.

ranges(gr)

IRanges object with 10 ranges and 0 metadata columns:
        start       end     width
    <integer> <integer> <integer>
  a       101       111        11
  b       102       112        11
  c       103       113        11
  d       104       114        11
  e       105       115        11
  f       106       116        11
  g       107       117        11
  h       108       118        11
  i       109       119        11
  j       110       120        11

Metadata is extracted using the mcols() function.

mcols(gr)

DataFrame with 10 rows and 2 columns
      score        GC
  <integer> <numeric>
a         1  1.000000
b         2  0.888889
c         3  0.777778
d         4  0.666667
e         5  0.555556
f         6  0.444444
g         7  0.333333
h         8  0.222222
i         9  0.111111
j        10  0.000000

BED to GRanges

The Browser Extensible Data (BED) format is quite ubiquitous in bioinformatics, so it is useful to know how a GRanges object can be created from a BED file.

We first create a data frame from a small BED file hosted on my web server and then we create a GRanges object using the data frame. Since BED is 0-based, we add one to make it 1-based.

data <- read.table(
  "https://davetang.org/file/granges.bed",
  col.names = c('chr','start','end','id','score','strand')
)
 
bed <- with(data, GRanges(chr, IRanges(start+1L, end), strand, score, refseq=id))
bed

GRanges object with 6 ranges and 2 metadata columns:
      seqnames            ranges strand |     score       refseq
         <Rle>         <IRanges>  <Rle> | <integer>  <character>
  [1]     chr1 66999825-67210768      + |         0    NM_032291
  [2]     chr1 33546714-33585995      + |         0    NM_052998
  [3]     chr1 48998527-50489626      - |         0    NM_032785
  [4]     chr1 16767167-16786584      + |         0 NM_001145278
  [5]     chr1 16767167-16786584      + |         0 NM_001145277
  [6]     chr1   8384390-8404227      + |         0 NM_001080397
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

The elementMetadata function can be used to fetch metadata.

elementMetadata(bed)

DataFrame with 6 rows and 2 columns
      score       refseq
  <integer>  <character>
1         0    NM_032291
2         0    NM_052998
3         0    NM_032785
4         0 NM_001145278
5         0 NM_001145277
6         0 NM_001080397

The with() function used to create the GRanges object is nice because it saves us some typing; we can directly refer to the column names.

bed <- with(
  data,
  GRanges(
    chr,
    IRanges(start+1, end),
    strand,
    score,
    refseq=id
  )
)
 
bed2 <- GRanges(
  data$chr,
  IRanges(data$start+1L, data$end),
  data$strand,
  score = data$score,
  refseq = data$id
)
 
identical(bed, bed2)

[1] TRUE

Let’s load another BED file to demonstrate how we can intersect ranges.

data2 <- read.table(
  "https://davetang.org/file/granges2.bed",
  col.names = c('chr','start','end','id','score','strand')
)
 
bed2 <- with(data2, GRanges(chr, IRanges(start+1, end), strand, score, refseq = id))
bed2

GRanges object with 4 ranges and 2 metadata columns:
      seqnames    ranges strand |     score      refseq
         <Rle> <IRanges>  <Rle> | <integer> <character>
  [1]     chr1  66999823      + |         0        1_no
  [2]     chr1  33585997      + |         0        2_no
  [3]     chr1  16786584      + |         0       3_yes
  [4]     chr1   8384390      + |         0       4_yes
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

We can use the intersect function to perform an intersect on the two GRanges objects; note that ignore.strand=FALSE is the default, which means the strand information is taken into account (but not illustrated in my example).

GenomicRanges::intersect(bed, bed2)

GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1   8384390      +
  [2]     chr1  16786584      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that the metadata from both BED files is lost. Thanks to a suggestion by Daniel, we can use the subsetByOverlaps() function to find overlapping genomic ranges and returns results that retain the metadata. This function returns an additional result because when considering the refseq metadata, there are two unique ranges.

subsetByOverlaps(bed, bed2)

GRanges object with 3 ranges and 2 metadata columns:
      seqnames            ranges strand |     score       refseq
         <Rle>         <IRanges>  <Rle> | <integer>  <character>
  [1]     chr1 16767167-16786584      + |         0 NM_001145278
  [2]     chr1 16767167-16786584      + |         0 NM_001145277
  [3]     chr1   8384390-8404227      + |         0 NM_001080397
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Matching the overlaps

In the previous example, subsetByOverlaps() does not indicate which features overlapped, only that there is an overlap. In this section, we will demonstrate the use of the GenomicRanges functions countOverlaps(), findOverlaps(), queryHits(), and subjectHits().

my_ranges <- GRanges(
  seqnames = rep('chr1',6),
  ranges = IRanges(
    start = c(66999824,33546713,48998526,16767166,16767166,8384389),
    end = c(67210768,33585995,50489626,16786584,16786584,8404227)
  ),
  strand = c('+','+','-','+','+','+'),
  score = rep(0,6),
  refseq = c('NM_032291','NM_052998','NM_032785','NM_001145278','NM_001145277','NM_001080397')
)
 
my_snps <- GRanges(
  seqnames = rep('chr1',3),
  ranges = IRanges(
    start = c(66999900,33546793,8384389),
    end = c(66999901,33546794,8384390)
  ),
  strand = c('+','+','+'),
  id = c('snp1','snp2','snp3'),
  score = rep(0,3)
)

The function countOverlaps() counts the overlaps with respect to the first GRanges object.

countOverlaps(my_ranges, my_snps)

[1] 1 1 0 0 0 1

The function findOverlaps() returns the matching indexes between two GRanges objects; a subject and a query.

my_overlaps <- findOverlaps(
  subject = my_ranges,
  query = my_snps
)
my_overlaps

Hits object with 3 hits and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           1
  [2]         2           2
  [3]         3           6
  -------
  queryLength: 3 / subjectLength: 6

Finally, we create a data frame that displays the matches by their metadata; the functions queryHits() and subjectHits() are used to retrieve the match indexes from the findOverlaps() result and these indexes are used to retrieve the metadata. This process is similar to using the base R function match().

my_query <- queryHits(my_overlaps)
my_subject <- subjectHits(my_overlaps)
 
data.frame(
  subject = my_ranges[my_subject]$refseq,
  query = my_snps[my_query]$id
)

       subject query
1    NM_032291  snp1
2    NM_052998  snp2
3 NM_001080397  snp3

Using the Bioconductor GenomicsRanges package

2023-02-13

Installation

Getting started

BED to GRanges

Matching the overlaps

Further reading