This is a test of the new tuxedo pipeline as described in Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown (Pertea et al., Nature Protocol, Aug. 2016). The data for this can be downloaded from: this site. In the new tuxedo pipeline, the mapper bowtie2 is replaced by HiSAT2. StringTie is then used to merge the files from HiSAT2 and computer the coverage per gene (previously done with cufflinks). Finally BallGown computes the differentially expressed genes (as cufflinks used to).
HiSAT2 is a spliced (or gapped) mapper. It is used for RNAseq mapping as the gene sequences may contain introns converted into gaps while aligning the RNAseq reads. HiSAT works on indexed reference files. To build this index, use:
hisat2-build reference output_dir
Many reference files can be used. They must be separated by commas on the command line. The indexes have already been built in the downloaded dataset. It is possible to provide the splicing sites (ss) and exons in the command line. They can be extracted from a gtf file giving information about genes and exons boundaries using tools provided with HiSAT2.
extract_splice_sites.py chrX_data/genes/chrX.gtf > chrX.ss
extract_exons.py chrX_data/genes/chrX.gtf > chrX.exon
hisat2-build --ss chrX.ss --exon chrX.exon chrX_data/genome/chrX.fa chrX_tran
Both –ss and –exon are optional arguments. This hisat2-build command took about 6min on the example dataset.
Here are all the HiSAT commands to map the reads from the 12 samples onto the indexed reference:
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S ERR188104_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S ERR188234_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S ERR188245_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S ERR188257_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S ERR188273_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S ERR188337_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S ERR188383_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S ERR188401_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S ERR188428_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S ERR188454_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S ERR204916_chrX.sam --time
The mapping is fairly fast. Each command takes less than a minute. The generated files need then to be sorted and converted to smaller bam files. This is done using smatools.
samtools sort -@ 2 -o ERR188044_chrX.bam ERR188044_chrX.sam
samtools sort -@ 2 -o ERR188104_chrX.bam ERR188104_chrX.sam
samtools sort -@ 2 -o ERR188234_chrX.bam ERR188234_chrX.sam
samtools sort -@ 2 -o ERR188245_chrX.bam ERR188245_chrX.sam
samtools sort -@ 2 -o ERR188257_chrX.bam ERR188257_chrX.sam
samtools sort -@ 2 -o ERR188273_chrX.bam ERR188273_chrX.sam
samtools sort -@ 2 -o ERR188337_chrX.bam ERR188337_chrX.sam
samtools sort -@ 2 -o ERR188383_chrX.bam ERR188383_chrX.sam
samtools sort -@ 2 -o ERR188401_chrX.bam ERR188401_chrX.sam
samtools sort -@ 2 -o ERR188428_chrX.bam ERR188428_chrX.sam
samtools sort -@ 2 -o ERR188454_chrX.bam ERR188454_chrX.sam
samtools sort -@ 2 -o ERR204916_chrX.bam ERR204916_chrX.sam
This will generate 12 bam files (binary sam files) which are compressed, and here sorted, versions of the original sam files. The sam files can be removed
rm -f ERR*_chrX.sam
StringTie assembles RNAseq alignments from HiSat into potential transcripts. It can use a gtf file as a guide for the assembly.
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188044_chrX.gtf -l ERR188044 ERR188044_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188104_chrX.gtf -l ERR188104 ERR188104_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188234_chrX.gtf -l ERR188234 ERR188234_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188245_chrX.gtf -l ERR188245 ERR188245_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188257_chrX.gtf -l ERR188257 ERR188257_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188273_chrX.gtf -l ERR188273 ERR188273_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188337_chrX.gtf -l ERR188337 ERR188337_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188383_chrX.gtf -l ERR188383 ERR188383_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188401_chrX.gtf -l ERR188401 ERR188401_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188428_chrX.gtf -l ERR188428 ERR188428_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188454_chrX.gtf -l ERR188454 ERR188454_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR204916_chrX.gtf -l ERR204916 ERR204916_chrX.bam
On the example dataset, stringtie takes about 10s for each samples. It outputs new gtf files containing the coordinates of the rebuilt transcripts. All the StringTie results need to be merged into one unique transcript coordinate file. For this a list of the atomic gtf files need to be built first.
ls ERR*chrX.gtf > merge_list.txt
Now the files can be merged.
stringtie --merge -p 2 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf merge_list.txt
This will generate an unique gtf file called stringtie_merged.gtf. Some extra information can be obtained using the gffcompare command fromt the gffutilities package. This step is optional.
gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf
This will create few new files:
Code | Description |
---|---|
= | Predicted transcript has exactly the same introns as the reference transcript |
c | Predicted transcript is contained within the reference transcript |
j | Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript |
e | Predicted single-exon transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment |
i | Predicted transcript falls entirely within a reference intron |
o | Exon of predicted transcript overlaps a reference transcript |
p | Predicted transcript lies within 2 kb of a reference transcript (possible polymerase run-on fragment) |
r | Predicted transcript has >50% of its bases overlapping a soft-masked (repetitive) reference sequence |
u | Predicted transcript is intergenic in comparison with known reference transcripts |
x | Exon of predicted transcript overlaps reference but lies on the opposite strand |
s | Intron of predicted transcript overlaps a reference intron on the opposite strand |
This is also done using StringTie, but with different options.
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188044/ERR188044_chrX.gtf ERR188044_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188104/ERR188104_chrX.gtf ERR188104_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188234/ERR188234_chrX.gtf ERR188234_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188245/ERR188245_chrX.gtf ERR188245_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188257/ERR188257_chrX.gtf ERR188257_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188273/ERR188273_chrX.gtf ERR188273_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188337/ERR188337_chrX.gtf ERR188337_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188383/ERR188383_chrX.gtf ERR188383_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188401/ERR188401_chrX.gtf ERR188401_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188428/ERR188428_chrX.gtf ERR188428_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188454/ERR188454_chrX.gtf ERR188454_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR204916/ERR204916_chrX.gtf ERR204916_chrX.bam
The DEG computation is done using ballgown. The installation of ballgown can be done through bioconductor. The following commands are R commands that need to be launched in a R console (or in RStudio), unless specified otherwise.
source("https://bioconductor.org/biocLite.R")
biocLite("ballgown")
It is also recommended to install the following libraries: * RSkittleBrewer * genefilter * dplyr * devtools
install.packages(c("dplyr","devtools"))
devtools::install_github('alyssafrazee/RSkittleBrewer')
#source("https://bioconductor.org/biocLite.R")
biocLite("genefilter")
The libraries to load are:
library(ballgown)
library(RSkittleBrewer)
library(genefilter)
library(dplyr)
library(devtools)
On Ubuntu 16.04, the following libraries are mandatory to install devtools:
# In a terminal
sudo apt install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev
The phenotype data (aka the samples information) need first to be loaded.
#setwd("RNAseq_protocol")
pheno_data <- read.csv("chrX_data/geuvadis_phenodata.csv")
Here is the content of the table:
ids | sex | population |
---|---|---|
ERR188044 | male | YRI |
ERR188104 | male | YRI |
ERR188234 | female | YRI |
ERR188245 | female | GBR |
ERR188257 | male | GBR |
ERR188273 | female | YRI |
ERR188337 | female | GBR |
ERR188383 | male | GBR |
ERR188401 | male | GBR |
ERR188428 | female | GBR |
ERR188454 | male | YRI |
ERR204916 | female | YRI |
To compute the FPKM, use the ballgown command.
bg_chrX <- ballgown(dataDir = "ballgown", samplePattern = "ERR", pData=pheno_data)
The dataDir parameter indicates to ballgown where is the directory containing the previously processed data. The parameter samplePattern indicates that the sample names start by “ERR”. Finally the parameter pData correspond to the data frame containing the sample information. The structure of the ballgown object is explained here. The expression values can be accessed using:
bg_chrX.expr <- expr(bg_chrX)
knitr::kable(bg_chrX.expr$trans[bg_chrX.expr$trans$gene_id == "MSTRG.1",1:12]) # expression values for all the transcripts from gene "MSTRG.1"
t_id | chr | strand | start | end | t_name | num_exons | length | gene_id | gene_name | cov.ERR188044 | FPKM.ERR188044 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | chrX | + | 255489 | 304287 | MSTRG.1.1 | 2 | 1597 | MSTRG.1 | . | 4.395328 | 23.96415 |
2 | 2 | chrX | + | 276324 | 303355 | NR_028057 | 9 | 2087 | MSTRG.1 | PLCXD1 | 0.000000 | 0.00000 |
3 | 3 | chrX | + | 276330 | 303355 | MSTRG.1.3 | 8 | 5281 | MSTRG.1 | . | 4.745659 | 25.87422 |
4 | 4 | chrX | + | 276330 | 303355 | MSTRG.1.4 | 7 | 5078 | MSTRG.1 | . | 0.000000 | 0.00000 |
6 | 6 | chrX | + | 281394 | 303355 | NM_018390 | 7 | 5304 | MSTRG.1 | PLCXD1 | 106.236000 | 579.21863 |
7 | 7 | chrX | + | 283484 | 303355 | MSTRG.1.6 | 8 | 1850 | MSTRG.1 | . | 0.000000 | 0.00000 |
The genes with a low read counts have to be discarded. It is also possible to apply a low variance filter. Here, we remove all the transcripts having a variance lower than 1.
bg_chrX_filt <- subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)
results_transcripts <- stattest(bg_chrX_filt, feature="transcript", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")
Here are the first five rows of the returned object.
feature | id | fc | pval | qval |
---|---|---|---|---|
transcript | 1 | 0.9406676 | 0.7331365 | 0.9471063 |
transcript | 2 | 1.2079374 | 0.8671955 | 0.9746178 |
transcript | 3 | 1.0136519 | 0.9917717 | 0.9982377 |
transcript | 4 | 0.3889283 | 0.5259196 | 0.9222276 |
transcript | 5 | 0.6088188 | 0.3205852 | 0.9140348 |
Here, we will look for transcripts that are differentially expressed between sexes, while correcting for any differences in expression due to the population variable. getFC is set to TRUE so the Fold Change is also returned. It might not be correct to use this stattest method on small replicate numbers, and we may need to go to DESeq2.
It is the same process as for the transcripts, only the feature type has to change.
results_genes <- stattest(bg_chrX_filt, feature="gene", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")
Here are the first five rows of the returned object.
feature | id | fc | pval | qval |
---|---|---|---|---|
gene | MSTRG.1 | 0.5973154 | 0.0388305 | 0.4158518 |
gene | MSTRG.10 | 0.9281766 | 0.7988706 | 0.9594504 |
gene | MSTRG.1000 | 0.9371723 | 0.8815343 | 0.9803352 |
gene | MSTRG.1001 | 1.0448129 | 0.6157911 | 0.8963496 |
gene | MSTRG.1002 | 1.2207407 | 0.4158461 | 0.8389430 |
First, we need to add the gene information to the transcript_results data frame
results_transcripts <- data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)
Here are the first five rows of the returned object.
geneNames | geneIDs | feature | id | fc | pval | qval |
---|---|---|---|---|---|---|
. | MSTRG.1 | transcript | 1 | 0.9406676 | 0.7331365 | 0.9471063 |
PLCXD1 | MSTRG.1 | transcript | 2 | 1.2079374 | 0.8671955 | 0.9746178 |
. | MSTRG.1 | transcript | 3 | 1.0136519 | 0.9917717 | 0.9982377 |
. | MSTRG.1 | transcript | 4 | 0.3889283 | 0.5259196 | 0.9222276 |
. | MSTRG.2 | transcript | 5 | 0.6088188 | 0.3205852 | 0.9140348 |
Then the data has to be sorted by increasing P-Value.
results_transcripts <- arrange(results_transcripts,pval)
results_genes <- arrange(results_genes,pval)
Here are the first five rows of the genes results.
feature | id | fc | pval | qval |
---|---|---|---|---|
gene | MSTRG.531 | 0.0023957 | 0.00e+00 | 0.0000000 |
gene | MSTRG.141 | 3.1649422 | 1.70e-06 | 0.0008468 |
gene | MSTRG.530 | 0.0827409 | 7.00e-06 | 0.0023929 |
gene | MSTRG.616 | 7.3140193 | 1.13e-05 | 0.0028885 |
gene | MSTRG.620 | 9.0494508 | 5.03e-05 | 0.0102804 |
The log fold change (logFC) can be computed.
results_transcripts$logFC <- log2(results_transcripts$fc)
results_genes$logFC <- log2(results_genes$fc)
Here are the first five rows of the genes results.
feature | id | fc | pval | qval | logFC |
---|---|---|---|---|---|
gene | MSTRG.531 | 0.0023957 | 0.00e+00 | 0.0000000 | -8.705332 |
gene | MSTRG.141 | 3.1649422 | 1.70e-06 | 0.0008468 | 1.662179 |
gene | MSTRG.530 | 0.0827409 | 7.00e-06 | 0.0023929 | -3.595256 |
gene | MSTRG.616 | 7.3140193 | 1.13e-05 | 0.0028885 | 2.870664 |
gene | MSTRG.620 | 9.0494508 | 5.03e-05 | 0.0102804 | 3.177830 |
Finally, in order to keep only the statistically significant genes, it is possible to filter the data using a Q-Value threshold of 0.05.
subset(results_transcripts,results_transcripts$qval<0.05)
subset(results_genes,results_genes$qval<0.05)
volcano_plot <- function(data){
logfc.threshold <- 1
with(data, plot(logFC, -log10(qval), pch=20, main="Volcano plot"))
with(subset(data, qval<.05 ), points(logFC, -log10(qval), pch=20, col="red"))
with(subset(data, abs(logFC)>logfc.threshold), points(logFC, -log10(qval), pch=20, col="orange"))
with(subset(data, qval<.05 & abs(logFC)>logfc.threshold), points(logFC, -log10(qval), pch=20, col="green"))
}
volcano_plot(results_genes)
tropical= c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow')
palette(tropical)
fpkm = texpr(bg_chrX_filt,meas="FPKM")
fpkm = log2(fpkm+1)
par(mar = c(10,5,4,2) + 0.1)
boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)')
This bar plot shows the distribution of the FPKM colored the “sex” variable.
plot(fpkm[12,] ~ pheno_data$sex, border=c(1,2), main=paste(ballgown::geneNames(bg_chrX)[12],' : ', ballgown::transcriptNames(bg_chrX)[12]),pch=19, xlab="Sex", ylab='log2(FPKM+1)')
points(fpkm[12,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex))
plotTranscripts(ballgown::geneIDs(bg_chrX)[ballgown::geneNames(bg_chrX) == "XIST"], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))
library(RColorBrewer)
library(gplots)
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
mat_data <- data.matrix(fpkm)
heatmap.2(mat_data,
main = "FPKM",
notecol="black",
# density.info="none",
trace="none",
margins =c(12,9),
col=my_palette,
# breaks=col_breaks,
dendrogram="both")
gene_expression <- gexpr(bg_chrX)
FPKM.ERR188044 | FPKM.ERR188104 | FPKM.ERR188234 | FPKM.ERR188245 | FPKM.ERR188257 | |
---|---|---|---|---|---|
MSTRG.1 | 509.746960 | 319.605100 | 332.294280 | 302.473627 | 178.818228 |
MSTRG.10 | 21.423500 | 13.129211 | 14.111122 | 18.451681 | 10.181026 |
MSTRG.100 | 0.470175 | 0.000000 | 0.000000 | 0.361315 | 0.000000 |
MSTRG.1000 | 6.751377 | 3.257341 | 3.030485 | 2.217741 | 1.993606 |
MSTRG.1001 | 872.470466 | 682.913294 | 782.223726 | 861.644227 | 774.260158 |
Export to CSV
write.csv(gene_expression, "chrX_gene_expression_results.csv", row.names=FALSE)
Export all information in one spreadsheet
write.csv(merge(results_genes, gene_expression, by=0, all=T), "chrX_all_results.csv", row.names=TRUE)
write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE)
The StringTie output needs to be converted. It can be done using a python script.
#This is a shell command
prepDE.py -i ballgown -g stringtie_to_DESeq.genes.csv -t stringtie_to_DESeq.trans.csv -p ERR
Then the data can be loaded in DESeq2
library(DESeq2)
counts <- read.csv("stringtie_to_DESeq.genes.csv", row.names = 1)
sex <- as.factor(pheno_data$sex)
pop <- as.factor(pheno_data$population)
ids <- as.vector(pheno_data$ids)
coldata <- data.frame(row.names=ids, pop=pop, sex=sex)
bg_chrX.deseq <- DESeqDataSetFromMatrix(counts, coldata, design=~ sex + pop)
The design parameter indicates that there is two variables: sex and pop. The DEG computation is done using the DEseq function.
deseq <- DESeq(bg_chrX.deseq)
deseq.results <- results(deseq)
deseq.results<-deseq.results[order(deseq.results$padj),]
As the model varies a lot from BallGown. The results are very different. With a small number of replicates (<4?) it is usually advised to use a non-linear model method like DESeq2.
deseq.volcano_plot <- function(data){
logfc.threshold <- 1
with(data, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot"))
with(subset(data, padj<.05 ), points(log2FoldChange, -log10(padj), pch=20, col="red"))
with(subset(data, abs(log2FoldChange)>logfc.threshold), points(log2FoldChange, -log10(padj), pch=20, col="orange"))
with(subset(data, padj<.05 & abs(log2FoldChange)>logfc.threshold), points(log2FoldChange, -log10(padj), pch=20, col="green"))
}
deseq.volcano_plot(deseq.results)
It is finally possible to include a shrink correction on the log fold change. In this case, this removes all significant results.
deseq.results.lfcshrink <- lfcShrink(deseq, coef=2, res=deseq.results)
deseq.volcano_plot(deseq.results.lfcshrink)
library("VennDiagram")
# dev.off()
logfc.threshold <- 2
BG.genes <- as.vector(subset(results_genes, qval<=.05 & abs(logFC) >= logfc.threshold )$id)
DEseq2.genes <- row.names(subset(deseq.results, padj<=.05 & abs(log2FoldChange) >= logfc.threshold))
venn.list <- list(BG.genes, DEseq2.genes)
venn.plot <- venn.diagram(venn.list,
"VennDiagram.png",
imagetype = 'png',
height = 800,
width = 800,
resolution = 100,
fill=c("darkmagenta", "darkblue"),
alpha=c(0.5,0.5),
cex = 4,
cat.fontface=0.1,
category.names=c("BG", "DS2"),
main="Differentially expressed genes")
This creates a VennDiagram.png file:
library("DEGseq")
file <- file.path(getwd(), "stringtie_to_DESeq.genes.csv")
male.indexes <- as.vector(as.integer(row.names(subset(pheno_data, sex=="male")))+1)
female.indexes <- as.vector(as.integer(row.names(subset(pheno_data, sex=="female")))+1)
deg.geneexp.male <- readGeneExp(file=file, geneCol=1, valCol=male.indexes, header=T, sep=",")
deg.geneexp.female <- readGeneExp(file=file, geneCol=1, valCol=female.indexes, header=T, sep=",")
DEGexp(geneExpMatrix1 = deg.geneexp.male,
geneExpMatrix2 = deg.geneexp.female,
geneCol1 = 1,
expCol1 = c(2,3,4,5,6,7),
geneCol2 = 1,
expCol2 = c(2,3,4,5,6,7),
groupLabel1 = "Male",
groupLabel2 = "Female",
method = "LRT",
outputDir = "DEGseqResults")
degseq.result<-read.table(file=file.path("DEGseqResults","output_score.txt"),header = T,sep = "\t")
degseq.result$Signature2 <- (degseq.result$q.value.Storey.et.al..2003. <= 0.05 & abs(degseq.result$log2.Fold_change.) >= 2)
length(which(degseq.result$Signature2=="TRUE"))
degseq.volcano_plot <- function(data){
logfc.threshold <- 1
with(data, plot(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, main="Volcano plot"))
with(subset(data, q.value.Storey.et.al..2003.<.05 ), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="red"))
with(subset(data, abs(log2.Fold_change.)>logfc.threshold), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="orange"))
with(subset(data, q.value.Storey.et.al..2003.<.05 & abs(log2.Fold_change.)>logfc.threshold), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="green"))
}
degseq.volcano_plot(degseq.result)
DEGseq.genes <- as.vector(subset(degseq.result, q.value.Storey.et.al..2003.<=.05 & abs(log2.Fold_change.) >= logfc.threshold )$GeneNames)
venn.list <- list(BG.genes, DEseq2.genes, DEGseq.genes)
venn.plot <- venn.diagram(venn.list,
"VennDiagram3.png",
imagetype = 'png',
height = 800,
width = 800,
resolution = 100,
fill=c("darkmagenta", "darkblue", "darkgreen"),
alpha=c(0.5,0.5,0.5),
cex = 4,
cat.fontface=0.1,
category.names=c("BG", "DS2", "DEGseq"),
main="Differentially expressed genes")
This creates a VennDiagram3.png file:
It is very surprising that the three methods don’t overlap at all. I suspect that it is due to the low fold ratios and the low number of significant pvalues. Here is a plot showing the comparison between the fold ratios computed by DESeq2 and BallGown (log10 scale).
library("ggplot2")
BG.results_genes <- subset(results_genes, select=c("logFC","qval"))
row.names(BG.results_genes) <- results_genes$id
names(BG.results_genes) <- c("BG.log.fold.change","BG.qval")
DEseq2.results_genes <- subset(as.data.frame(deseq.results), select=c("log2FoldChange","padj"))
names(DEseq2.results_genes) <- c("DESeq2.log.fold.change","DESeq2.padj")
dt <- merge(BG.results_genes, DEseq2.results_genes, by=0, all=TRUE)
p <- ggplot(dt, aes(x=BG.log.fold.change, y=DESeq2.log.fold.change)) +
geom_point(shape=1) +
scale_y_log10() +
scale_x_log10()
ggExtra::ggMarginal(p, type="histogram")
Here is a comparison of the corrected p-values.
p1 <- ggplot(dt, aes(x=BG.qval, y=DESeq2.padj)) +
geom_point(shape=1)
ggExtra::ggMarginal(p1, type="histogram")