CAGE data analysis

MODHEP workshop

Dave Tang
PhD candidate

Download

Welcome!

  • My name is Dave Tang; I was born in Hong Kong but raised in Papua New Guinea.
  • I'm a PhD candidate in bioinformatics at the VU University Amsterdam.
  • I also maintain a blog, where I write about bioinformatics.
  • I'm here to talk about analysing CAGE data.

About this presentation

  • I had a lot of difficulty deciding what to present in this talk.
  • I started off extremely ambiguous and believed that it is possible to show you how to analyse CAGE data in under two hours.
  • Then I realised that this isn't possible.
  • This presentation will provide an overview of analysing CAGE data
  • There are links on each slide for those who want to learn more.

Slidify

Promoter

Cap Analysis Gene Expression (CAGE)

CAGE protocol 1

Image source: me.

CAGE

Why do we perform CAGE?

A typical CAGE analysis pipeline

  • The output of CAGE is a set of short nucleotide sequences and their counts.
  • Typical steps in a CAGE analysis pipeline include:
  • Tag extraction
  • Quality control steps
  • Mapping tags to a reference genome
  • Tag clustering
  • Tag cluster annotation
  • Data analysis (e.g. differential expression analysis, co-expression networks, etc.)

ENCODE CAGE data

Download these BAM files into some directory:

path=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage
curl -O $path/wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz
curl -O $path/wgEncodeRikenCageHelas3CellPapAlnRep1.bam
curl -O $path/wgEncodeRikenCageHelas3CellPapAlnRep2.bam
curl -O $path/wgEncodeRikenCageHelas3CytosolPapAlnRep1.bam
curl -O $path/wgEncodeRikenCageHelas3CytosolPapAlnRep2.bam
curl -O $path/wgEncodeRikenCageHelas3NucleusPapAlnRep1.bam
curl -O $path/wgEncodeRikenCageHelas3NucleusPapAlnRep2.bam

Raw sequencing data

A fastq file contains all the reads from the sequencer.

gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4
@HWUSI-EAS566_0007:5:1:965:4705#0/1
NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC
+HWUSI-EAS566_0007:5:1:965:4705#0/1
BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^

For each read, there are four lines of information.

  1. The first line is the read name and contains information on the read
  2. The second line is the raw sequence of the CAGE tag
  3. The third line can be used to store optional notes
  4. The fourth line denotes the quality of each base call, i.e. how sure the machine is that an A is an A.

Tag extraction

gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4
@HWUSI-EAS566_0007:5:1:965:4705#0/1
NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC
+HWUSI-EAS566_0007:5:1:965:4705#0/1
BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^

gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | perl -nle 'if (($. - 2) % 4 == 0){ print substr($_,0,3)}' | sort | uniq -c | sort -k1,1rn | head -4
25596580 TAG
300377 TAT
158391 TCG
142412 CAG

The barcode for this library was TAG. The http://hannonlab.cshl.edu/fastx_toolkit/ contains tools for tag extraction. See also http://tagdust.sourceforge.net/, which uses Hidden Markov Models for tag extraction (and performs quality control steps).

Quality control

gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4
@HWUSI-EAS566_0007:5:1:965:4705#0/1
NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC
+HWUSI-EAS566_0007:5:1:965:4705#0/1
BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^

We would remove this read because it contains an ambiguous base call, especially since it's in the barcode. The quality scores range from 3 to 40:

3------------------------------------40

BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

Various different criteria to remove reads, e.g. remove reads with 10 base calls that have a quality less than 10, etc. The tool http://samstat.sourceforge.net/ can be used to calculate and visualise read qualities.

A note about quality scores

p <- function(q){ return(10^{-q/10}) }
p(10)
## [1] 0.1
plot(3:37, 1-p(3:37), xaxt='n', type='l', xlab='Quality score', ylab='Probability correct')
axis(1, 0:40); abline(h=0.9, v=10)

plot of chunk unnamed-chunk-1

Mapping

  • High-throughput sequencing results in millions to billions of reads
  • A computational strategy known as "indexing" speeds up mapping algorithms.
  • It works like a book index; an index of a large DNA sequence allows one to rapidly find shorter sequences embedded within it. See How to map billions of short reads onto genomes.
  • The confidence of mapping is indicated by the mapping quality, which uses the same scale as the base calling qualities.
  • Pipelines may remove reads that have a low mapping quality.
  • I have a post on mapping qualities.

SAM/BAM

A SAM/BAM file contains alignment information of a single read to a reference genome.

samtools view wgEncodeRikenCageHelas3CellPapAlnRep1.bam | head -3
HWUSI-EAS566_0009:3:16:8371:2418#0|TAG  16  chr1    16213   1   27M *   0   0   GCCATGCTCTGACAGTCTCAGTTGCAC fefffffffffffffdfefddffdddd NM:i:0  MD:Z:27 XP:Z:~~~~-----000077777,,,,9998:
HWUSI-EAS566_0007:5:11:4550:12891#0|TAG 16  chr1    16445   17  27M *   0   0   AGTTTGAAAACCACTATTTTATGAACC fefffffdffffffeffdffffeefff NM:i:0  MD:Z:27 XP:Z:{{~~~~~~~~~~LO?~
HWUSI-EAS566_0007:5:63:12130:4370#0|TAG 0   chr1    16446   17  27M *   0   0   GTTTGAAAACCACTATTTTATGAACCA fffffdfffffdffffffffffefffe NM:i:0  MD:Z:27 XP:Z:~~~~~~~~~~~~}}L~

If you are confused about the second column, have a look at http://davetang.org/muse/2014/03/06/understanding-bam-flags/.

CAGE defined transcription starting sites

CTSS

Image source: me.

Tag cluster annotation

Intersect/overlap tag clusters to transcript annotations

UCSC Genome Browser screenshot

Intermission

  • Any questions so far?

A few words on R before we begin

  • I have been using R on and off for a couple of years and it takes a while to get used to.
  • Honest confession: I am not very good with R (I keep a lot of documentation to make up for this).
  • I think it is worthwhile learning because a lot of genomics analysis packages are available via the Bioconductor project.
  • By default, data loaded into R is stored into memory; the bigger your data, the more memory will be used.

Bioconductor

  • From Wikipedia: Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.
  • Provides state of the art software to analyse various genomic datasets.
  • Has very well written guides, called vignettes, aimed at biologists.
  • To learn more take a look at these courses, which are provided by the Bioconductor team.

Loading data into R

Download the data for this paper (right-click on data and save as).

getwd()
## [1] "/Users/davetang/slidify_test"
setwd("/Users/davetang/slidify_test/data/")
data <- read.table("pnas_expression.txt", header=TRUE, sep="\t", stringsAsFactors = FALSE)
head(data, 5)
##        ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8  len
## 1 ENSG00000215696     0     0     0     0     0     0     0  330
## 2 ENSG00000215700     0     0     0     0     0     0     0 2370
## 3 ENSG00000215699     0     0     0     0     0     0     0 1842
## 4 ENSG00000215784     0     0     0     0     0     0     0 2393
## 5 ENSG00000212914     0     0     0     0     0     0     0  384

Working with data frames

class(data)
## [1] "data.frame"
dim(data)
## [1] 37435     9
data[1,]
##        ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8 len
## 1 ENSG00000215696     0     0     0     0     0     0     0 330

Removing a column

data_subset <- data[,-9]
dim(data_subset)
## [1] 37435     8
head(data_subset)
##        ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8
## 1 ENSG00000215696     0     0     0     0     0     0     0
## 2 ENSG00000215700     0     0     0     0     0     0     0
## 3 ENSG00000215699     0     0     0     0     0     0     0
## 4 ENSG00000215784     0     0     0     0     0     0     0
## 5 ENSG00000212914     0     0     0     0     0     0     0
## 6 ENSG00000212042     0     0     0     0     0     0     0

Naming the rows

row.names(data_subset) <- data_subset[,1]
data_subset <- data_subset[,-1]
head(data_subset)
##                 lane1 lane2 lane3 lane4 lane5 lane6 lane8
## ENSG00000215696     0     0     0     0     0     0     0
## ENSG00000215700     0     0     0     0     0     0     0
## ENSG00000215699     0     0     0     0     0     0     0
## ENSG00000215784     0     0     0     0     0     0     0
## ENSG00000212914     0     0     0     0     0     0     0
## ENSG00000212042     0     0     0     0     0     0     0

Plotting the rowSums

plot(density(log2(rowSums(data_subset))), xlab="Gene expression (log2)")

plot of chunk unnamed-chunk-6

Independent filtering

Differential expression

source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")
library(edgeR)
group <- c(rep("Control",4), rep("Test",3))
d <- DGEList(counts = data_subset, group=group)
#data normalisation
d <- calcNormFactors(d, method="TMM")
d <- estimateDisp(d)
et <- exactTest(d)
summary(de <- decideTestsDGE(et, p=0.05, adjust="BH"))
##    [,1]
## -1 2119
## 0  7215
## 1  2205

Differentially expressed genes

detags <- rownames(d)[as.logical(de)]
plotSmear(et, de.tags=detags)
abline(h = c(-2, 2), col = "blue")

plot of chunk unnamed-chunk-10

Differentially expression genes

topTags(et, n=5)
## Comparison of groups:  Test-Control 
##                    logFC    logCPM        PValue           FDR
## ENSG00000151503 5.789617  9.719792  0.000000e+00  0.000000e+00
## ENSG00000096060 4.977976  9.954372  0.000000e+00  0.000000e+00
## ENSG00000162772 3.289656  9.749036 6.589017e-263 2.534356e-259
## ENSG00000166451 4.658056  8.855415 8.833771e-254 2.548322e-250
## ENSG00000115648 2.571817 11.481226 3.797278e-236 8.763359e-233
data_subset['ENSG00000151503',]
##                 lane1 lane2 lane3 lane4 lane5 lane6 lane8
## ENSG00000151503    35    35    49    59  3307  3439  1224

Saving the results

my_genes <- data_subset[row.names(topTags(et, n=100)),]
head(my_genes, 3)
##                 lane1 lane2 lane3 lane4 lane5 lane6 lane8
## ENSG00000151503    35    35    49    59  3307  3439  1224
## ENSG00000096060    65    79   105   113  3975  3727  1451
## ENSG00000162772   172   204   250   304  2972  3269  1112
write.table(my_genes, file = "my_genes.tsv", quote = FALSE, sep = "\t")

Intermission 2

  • Any questions or comments?

Correlation

Pairwise comparisons

#a vs. b, a vs. c, b vs. c
choose(3,2)
## [1] 3
choose(1000, 2)
## [1] 499500
choose(30000, 2)
## [1] 449985000

Pairwise comparisons

plot(1000:30000, log2(choose(1000:30000, 2)), type='l', xlab="n", ylab="Pairwise comparisons (log2)")

plot of chunk unnamed-chunk-14

Hierarchical clustering

h <- hclust(dist(t(data_subset)))
plot(h)

plot of chunk unnamed-chunk-15

Distance measure

See http://davetang.org/muse/2013/08/15/distance-matrix-computation/.

dist(t(data_subset))
##           lane1     lane2     lane3     lane4     lane5     lane6
## lane2  4874.042                                                  
## lane3 11119.401  7469.846                                        
## lane4 12085.221  8346.255  3337.774                              
## lane5 27151.883 24875.433 22593.797 22219.560                    
## lane6 27339.010 25050.592 22690.310 22325.354  3573.774          
## lane8 12519.767 15546.218 21483.780 22298.779 28479.141 28722.565

Heatmap

CAGEr

The CAGEr package available on Bioconductor provides various methods for analysing CAGE data. To install the CAGEr package:

source("http://bioconductor.org/biocLite.R")
biocLite("CAGEr")

To load the CAGEr package:

library(CAGEr)

What CAGEr can do

Some more resources