CAGE data analysis

MODHEP workshop

Dave Tang
PhD candidate

Download

Welcome!

My name is Dave Tang; I was born in Hong Kong but raised in Papua New Guinea.
I'm a PhD candidate in bioinformatics at the VU University Amsterdam.
I also maintain a blog, where I write about bioinformatics.
I'm here to talk about analysing CAGE data.

About this presentation

I had a lot of difficulty deciding what to present in this talk.
I started off extremely ambiguous and believed that it is possible to show you how to analyse CAGE data in under two hours.
Then I realised that this isn't possible.
This presentation will provide an overview of analysing CAGE data
There are links on each slide for those who want to learn more.

Slidify

These slides were made using an R package called Slidify.
All the slides and the output are generated by R; it's like embedding analysis code in PowerPoint but also having PowerPoint perform the analysis.
This is the first time I've used Slidify for a presentation.
These slides can be viewed at http://davetang.github.io/cage_r
If you're interested in Slidify, have a look at http://davetang.org/muse/2014/12/12/making-slides-using-r/ and the links within.

Promoter

A promoter is a region of DNA that initiates transcription

Eukaryotic promoter

Source: http://www.nature.com/scitable/topicpage/dna-transcription-426

Cap Analysis Gene Expression (CAGE)

CAGE protocol 1

Image source: me.

CAGE

CAGE protocol 2

For the full protocol see http://www.ncbi.nlm.nih.gov/pubmed/22362160.

Why do we perform CAGE?

Quantify the expression of transcripts
Study promoter structure and usage
Discover new promoters
Discover regulatory elements
For CAGE versus RNA-Seq see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3975069/

A typical CAGE analysis pipeline

The output of CAGE is a set of short nucleotide sequences and their counts.
Typical steps in a CAGE analysis pipeline include:
Tag extraction
Quality control steps
Mapping tags to a reference genome
Tag clustering
Tag cluster annotation
Data analysis (e.g. differential expression analysis, co-expression networks, etc.)

ENCODE CAGE data

Download these BAM files into some directory:

path=http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage
curl -O $path/wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz
curl -O $path/wgEncodeRikenCageHelas3CellPapAlnRep1.bam
curl -O $path/wgEncodeRikenCageHelas3CellPapAlnRep2.bam
curl -O $path/wgEncodeRikenCageHelas3CytosolPapAlnRep1.bam
curl -O $path/wgEncodeRikenCageHelas3CytosolPapAlnRep2.bam
curl -O $path/wgEncodeRikenCageHelas3NucleusPapAlnRep1.bam
curl -O $path/wgEncodeRikenCageHelas3NucleusPapAlnRep2.bam

Raw sequencing data

A fastq file contains all the reads from the sequencer.

gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4
@HWUSI-EAS566_0007:5:1:965:4705#0/1
NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC
+HWUSI-EAS566_0007:5:1:965:4705#0/1
BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^

For each read, there are four lines of information.

The first line is the read name and contains information on the read
The second line is the raw sequence of the CAGE tag
The third line can be used to store optional notes
The fourth line denotes the quality of each base call, i.e. how sure the machine is that an A is an A.

Tag extraction

gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4
@HWUSI-EAS566_0007:5:1:965:4705#0/1
NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC
+HWUSI-EAS566_0007:5:1:965:4705#0/1
BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^

gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | perl -nle 'if (($. - 2) % 4 == 0){ print substr($_,0,3)}' | sort | uniq -c | sort -k1,1rn | head -4
25596580 TAG
300377 TAT
158391 TCG
142412 CAG

The barcode for this library was TAG. The http://hannonlab.cshl.edu/fastx_toolkit/ contains tools for tag extraction. See also http://tagdust.sourceforge.net/, which uses Hidden Markov Models for tag extraction (and performs quality control steps).

Quality control

gunzip -c wgEncodeRikenCageHelas3CellPapRawDataRep1.fastq.gz | head -4
@HWUSI-EAS566_0007:5:1:965:4705#0/1
NAGCAGCAGGGGAGAGTGTCATGGAGGCCTACGAGC
+HWUSI-EAS566_0007:5:1:965:4705#0/1
BYZZY\\[\[^^IXMVUUUX\]\\X[ZZ\]``^__^

We would remove this read because it contains an ambiguous base call, especially since it's in the barcode. The quality scores range from 3 to 40:

3------------------------------------40

BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh

Various different criteria to remove reads, e.g. remove reads with 10 base calls that have a quality less than 10, etc. The tool http://samstat.sourceforge.net/ can be used to calculate and visualise read qualities.

A note about quality scores

p <- function(q){ return(10^{-q/10}) }
p(10)

## [1] 0.1

plot(3:37, 1-p(3:37), xaxt='n', type='l', xlab='Quality score', ylab='Probability correct')
axis(1, 0:40); abline(h=0.9, v=10)

plot of chunk unnamed-chunk-1

Mapping

High-throughput sequencing results in millions to billions of reads
A computational strategy known as "indexing" speeds up mapping algorithms.
It works like a book index; an index of a large DNA sequence allows one to rapidly find shorter sequences embedded within it. See How to map billions of short reads onto genomes.
The confidence of mapping is indicated by the mapping quality, which uses the same scale as the base calling qualities.
Pipelines may remove reads that have a low mapping quality.
I have a post on mapping qualities.

SAM/BAM

A SAM/BAM file contains alignment information of a single read to a reference genome.

samtools view wgEncodeRikenCageHelas3CellPapAlnRep1.bam | head -3
HWUSI-EAS566_0009:3:16:8371:2418#0|TAG  16  chr1    16213   1   27M *   0   0   GCCATGCTCTGACAGTCTCAGTTGCAC fefffffffffffffdfefddffdddd NM:i:0  MD:Z:27 XP:Z:~~~~-----000077777,,,,9998:
HWUSI-EAS566_0007:5:11:4550:12891#0|TAG 16  chr1    16445   17  27M *   0   0   AGTTTGAAAACCACTATTTTATGAACC fefffffdffffffeffdffffeefff NM:i:0  MD:Z:27 XP:Z:{{~~~~~~~~~~LO?~
HWUSI-EAS566_0007:5:63:12130:4370#0|TAG 0   chr1    16446   17  27M *   0   0   GTTTGAAAACCACTATTTTATGAACCA fffffdfffffdffffffffffefffe NM:i:0  MD:Z:27 XP:Z:~~~~~~~~~~~~}}L~

If you are confused about the second column, have a look at http://davetang.org/muse/2014/03/06/understanding-bam-flags/.

CAGE defined transcription starting sites

CTSS

Image source: me.

Tag cluster annotation

Intersect/overlap tag clusters to transcript annotations

UCSC Genome Browser screenshot

Intermission

Any questions so far?

A few words on R before we begin

I have been using R on and off for a couple of years and it takes a while to get used to.
Honest confession: I am not very good with R (I keep a lot of documentation to make up for this).
I think it is worthwhile learning because a lot of genomics analysis packages are available via the Bioconductor project.
By default, data loaded into R is stored into memory; the bigger your data, the more memory will be used.

Bioconductor

From Wikipedia: Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.
Provides state of the art software to analyse various genomic datasets.
Has very well written guides, called vignettes, aimed at biologists.
To learn more take a look at these courses, which are provided by the Bioconductor team.

Loading data into R

Download the data for this paper (right-click on data and save as).

getwd()

## [1] "/Users/davetang/slidify_test"

setwd("/Users/davetang/slidify_test/data/")
data <- read.table("pnas_expression.txt", header=TRUE, sep="\t", stringsAsFactors = FALSE)
head(data, 5)

##        ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8  len
## 1 ENSG00000215696     0     0     0     0     0     0     0  330
## 2 ENSG00000215700     0     0     0     0     0     0     0 2370
## 3 ENSG00000215699     0     0     0     0     0     0     0 1842
## 4 ENSG00000215784     0     0     0     0     0     0     0 2393
## 5 ENSG00000212914     0     0     0     0     0     0     0  384

Working with data frames

class(data)

## [1] "data.frame"

dim(data)

## [1] 37435     9

data[1,]

##        ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8 len
## 1 ENSG00000215696     0     0     0     0     0     0     0 330

Removing a column

data_subset <- data[,-9]
dim(data_subset)

## [1] 37435     8

head(data_subset)

##        ensembl_ID lane1 lane2 lane3 lane4 lane5 lane6 lane8
## 1 ENSG00000215696     0     0     0     0     0     0     0
## 2 ENSG00000215700     0     0     0     0     0     0     0
## 3 ENSG00000215699     0     0     0     0     0     0     0
## 4 ENSG00000215784     0     0     0     0     0     0     0
## 5 ENSG00000212914     0     0     0     0     0     0     0
## 6 ENSG00000212042     0     0     0     0     0     0     0

Naming the rows

row.names(data_subset) <- data_subset[,1]
data_subset <- data_subset[,-1]
head(data_subset)

##                 lane1 lane2 lane3 lane4 lane5 lane6 lane8
## ENSG00000215696     0     0     0     0     0     0     0
## ENSG00000215700     0     0     0     0     0     0     0
## ENSG00000215699     0     0     0     0     0     0     0
## ENSG00000215784     0     0     0     0     0     0     0
## ENSG00000212914     0     0     0     0     0     0     0
## ENSG00000212042     0     0     0     0     0     0     0

Plotting the rowSums

plot(density(log2(rowSums(data_subset))), xlab="Gene expression (log2)")

plot of chunk unnamed-chunk-6

Independent filtering

dim(data_subset)

## [1] 37435     7

data_subset <- subset(data_subset, rowSums(data_subset)>50)
dim(data_subset)

## [1] 11539     7

See http://www.ncbi.nlm.nih.gov/pubmed/20460310.

Differential expression

source("http://bioconductor.org/biocLite.R")
biocLite("edgeR")

library(edgeR)
group <- c(rep("Control",4), rep("Test",3))
d <- DGEList(counts = data_subset, group=group)
#data normalisation
d <- calcNormFactors(d, method="TMM")
d <- estimateDisp(d)
et <- exactTest(d)
summary(de <- decideTestsDGE(et, p=0.05, adjust="BH"))

##    [,1]
## -1 2119
## 0  7215
## 1  2205

Differentially expressed genes

detags <- rownames(d)[as.logical(de)]
plotSmear(et, de.tags=detags)
abline(h = c(-2, 2), col = "blue")

plot of chunk unnamed-chunk-10

Differentially expression genes

topTags(et, n=5)

## Comparison of groups:  Test-Control 
##                    logFC    logCPM        PValue           FDR
## ENSG00000151503 5.789617  9.719792  0.000000e+00  0.000000e+00
## ENSG00000096060 4.977976  9.954372  0.000000e+00  0.000000e+00
## ENSG00000162772 3.289656  9.749036 6.589017e-263 2.534356e-259
## ENSG00000166451 4.658056  8.855415 8.833771e-254 2.548322e-250
## ENSG00000115648 2.571817 11.481226 3.797278e-236 8.763359e-233

data_subset['ENSG00000151503',]

##                 lane1 lane2 lane3 lane4 lane5 lane6 lane8
## ENSG00000151503    35    35    49    59  3307  3439  1224

Saving the results

my_genes <- data_subset[row.names(topTags(et, n=100)),]
head(my_genes, 3)

##                 lane1 lane2 lane3 lane4 lane5 lane6 lane8
## ENSG00000151503    35    35    49    59  3307  3439  1224
## ENSG00000096060    65    79   105   113  3975  3727  1451
## ENSG00000162772   172   204   250   304  2972  3269  1112

write.table(my_genes, file = "my_genes.tsv", quote = FALSE, sep = "\t")

Intermission 2

Any questions or comments?

Correlation

See http://davetang.org/muse/2012/01/31/creating-a-correlation-matrix-with-r/

Co-expression

Image source: adpated from http://www.ncbi.nlm.nih.gov/pubmed/20628352.

Pairwise comparisons

#a vs. b, a vs. c, b vs. c
choose(3,2)

## [1] 3

choose(1000, 2)

## [1] 499500

choose(30000, 2)

## [1] 449985000

Pairwise comparisons

plot(1000:30000, log2(choose(1000:30000, 2)), type='l', xlab="n", ylab="Pairwise comparisons (log2)")

plot of chunk unnamed-chunk-14

Hierarchical clustering

h <- hclust(dist(t(data_subset)))
plot(h)

plot of chunk unnamed-chunk-15

Distance measure

See http://davetang.org/muse/2013/08/15/distance-matrix-computation/.

dist(t(data_subset))

##           lane1     lane2     lane3     lane4     lane5     lane6
## lane2  4874.042                                                  
## lane3 11119.401  7469.846                                        
## lane4 12085.221  8346.255  3337.774                              
## lane5 27151.883 24875.433 22593.797 22219.560                    
## lane6 27339.010 25050.592 22690.310 22325.354  3573.774          
## lane8 12519.767 15546.218 21483.780 22298.779 28479.141 28722.565

Heatmap

See http://davetang.org/muse/2010/12/06/making-a-heatmap-with-r/.

#install.packages("gplots")
library(gplots)
heatmap.2(as.matrix(my_genes), scale="row", trace="none")

plot of chunk unnamed-chunk-17

CAGEr

The CAGEr package available on Bioconductor provides various methods for analysing CAGE data. To install the CAGEr package:

source("http://bioconductor.org/biocLite.R")
biocLite("CAGEr")

To load the CAGEr package:

library(CAGEr)

What CAGEr can do

Quality control steps
CAGE normalisation
CAGE tag clustering
Analysis of promoter shape
Expression clustering
Analysis of promoter shifting
Importantly, we can use the CAGEr package to prepare our data for further downstream analyses.
For more information: http://davetang.org/muse/2013/04/16/cage-analysis-using-the-r-bioconductor-package-cage/.

Some more resources

I highly recommend this book called Bioinformatics Data Skills. I wrote a book review on it (on my blog of course).
My collection of links to various introductory material related to bioinformatics.
My collection of links to resources for learning about R.
My collection of links to courses or course material related to bioinformatics.
Feel free to send me an email any time.
Thank you for your attention!