Introduction

This is a test of the new tuxedo pipeline as described in Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown (Pertea et al., Nature Protocol, Aug. 2016). The data for this can be downloaded from: this site. In the new tuxedo pipeline, the mapper bowtie2 is replaced by HiSAT2. StringTie is then used to merge the files from HiSAT2 and computer the coverage per gene (previously done with cufflinks). Finally BallGown computes the differentially expressed genes (as cufflinks used to).

HiSAT2

HiSAT2 is a spliced (or gapped) mapper. It is used for RNAseq mapping as the gene sequences may contain introns converted into gaps while aligning the RNAseq reads. HiSAT works on indexed reference files. To build this index, use:

hisat2-build reference output_dir

Many reference files can be used. They must be separated by commas on the command line. The indexes have already been built in the downloaded dataset. It is possible to provide the splicing sites (ss) and exons in the command line. They can be extracted from a gtf file giving information about genes and exons boundaries using tools provided with HiSAT2.

extract_splice_sites.py chrX_data/genes/chrX.gtf > chrX.ss
extract_exons.py chrX_data/genes/chrX.gtf > chrX.exon

hisat2-build --ss chrX.ss --exon chrX.exon chrX_data/genome/chrX.fa chrX_tran

Both –ss and –exon are optional arguments. This hisat2-build command took about 6min on the example dataset.

Here are all the HiSAT commands to map the reads from the 12 samples onto the indexed reference:

hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S ERR188104_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S ERR188234_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S ERR188245_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S ERR188257_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S ERR188273_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S ERR188337_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S ERR188383_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S ERR188401_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S ERR188428_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S ERR188454_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S ERR204916_chrX.sam --time

The mapping is fairly fast. Each command takes less than a minute. The generated files need then to be sorted and converted to smaller bam files. This is done using smatools.

samtools sort -@ 2 -o ERR188044_chrX.bam ERR188044_chrX.sam
samtools sort -@ 2 -o ERR188104_chrX.bam ERR188104_chrX.sam
samtools sort -@ 2 -o ERR188234_chrX.bam ERR188234_chrX.sam
samtools sort -@ 2 -o ERR188245_chrX.bam ERR188245_chrX.sam
samtools sort -@ 2 -o ERR188257_chrX.bam ERR188257_chrX.sam
samtools sort -@ 2 -o ERR188273_chrX.bam ERR188273_chrX.sam
samtools sort -@ 2 -o ERR188337_chrX.bam ERR188337_chrX.sam
samtools sort -@ 2 -o ERR188383_chrX.bam ERR188383_chrX.sam
samtools sort -@ 2 -o ERR188401_chrX.bam ERR188401_chrX.sam
samtools sort -@ 2 -o ERR188428_chrX.bam ERR188428_chrX.sam
samtools sort -@ 2 -o ERR188454_chrX.bam ERR188454_chrX.sam
samtools sort -@ 2 -o ERR204916_chrX.bam ERR204916_chrX.sam

This will generate 12 bam files (binary sam files) which are compressed, and here sorted, versions of the original sam files. The sam files can be removed

rm -f ERR*_chrX.sam

StringTie

StringTie assembles RNAseq alignments from HiSat into potential transcripts. It can use a gtf file as a guide for the assembly.

stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188044_chrX.gtf -l ERR188044 ERR188044_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188104_chrX.gtf -l ERR188104 ERR188104_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188234_chrX.gtf -l ERR188234 ERR188234_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188245_chrX.gtf -l ERR188245 ERR188245_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188257_chrX.gtf -l ERR188257 ERR188257_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188273_chrX.gtf -l ERR188273 ERR188273_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188337_chrX.gtf -l ERR188337 ERR188337_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188383_chrX.gtf -l ERR188383 ERR188383_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188401_chrX.gtf -l ERR188401 ERR188401_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188428_chrX.gtf -l ERR188428 ERR188428_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188454_chrX.gtf -l ERR188454 ERR188454_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR204916_chrX.gtf -l ERR204916 ERR204916_chrX.bam

On the example dataset, stringtie takes about 10s for each samples. It outputs new gtf files containing the coordinates of the rebuilt transcripts. All the StringTie results need to be merged into one unique transcript coordinate file. For this a list of the atomic gtf files need to be built first.

ls ERR*chrX.gtf > merge_list.txt

Now the files can be merged.

stringtie --merge -p 2 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf merge_list.txt

This will generate an unique gtf file called stringtie_merged.gtf. Some extra information can be obtained using the gffcompare command fromt the gffutilities package. This step is optional.

gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf

This will create few new files:

Table 1

Code Description
= Predicted transcript has exactly the same introns as the reference transcript
c Predicted transcript is contained within the reference transcript
j Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript
e Predicted single-exon transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment
i Predicted transcript falls entirely within a reference intron
o Exon of predicted transcript overlaps a reference transcript
p Predicted transcript lies within 2 kb of a reference transcript (possible polymerase run-on fragment)
r Predicted transcript has >50% of its bases overlapping a soft-masked (repetitive) reference sequence
u Predicted transcript is intergenic in comparison with known reference transcripts
x Exon of predicted transcript overlaps reference but lies on the opposite strand
s Intron of predicted transcript overlaps a reference intron on the opposite strand

Compute transcript abundances and counts tables

This is also done using StringTie, but with different options.

stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188044/ERR188044_chrX.gtf ERR188044_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188104/ERR188104_chrX.gtf ERR188104_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188234/ERR188234_chrX.gtf ERR188234_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188245/ERR188245_chrX.gtf ERR188245_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188257/ERR188257_chrX.gtf ERR188257_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188273/ERR188273_chrX.gtf ERR188273_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188337/ERR188337_chrX.gtf ERR188337_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188383/ERR188383_chrX.gtf ERR188383_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188401/ERR188401_chrX.gtf ERR188401_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188428/ERR188428_chrX.gtf ERR188428_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188454/ERR188454_chrX.gtf ERR188454_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR204916/ERR204916_chrX.gtf ERR204916_chrX.bam

Differentially expressed genes (DEG) computation

The DEG computation is done using ballgown. The installation of ballgown can be done through bioconductor. The following commands are R commands that need to be launched in a R console (or in RStudio), unless specified otherwise.

source("https://bioconductor.org/biocLite.R")
biocLite("ballgown")

It is also recommended to install the following libraries: * RSkittleBrewer * genefilter * dplyr * devtools

install.packages(c("dplyr","devtools"))
devtools::install_github('alyssafrazee/RSkittleBrewer')

#source("https://bioconductor.org/biocLite.R")
biocLite("genefilter")

The libraries to load are:

library(ballgown)
library(RSkittleBrewer) 
library(genefilter)
library(dplyr)
library(devtools)

On Ubuntu 16.04, the following libraries are mandatory to install devtools:

# In a terminal
sudo apt install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev

The phenotype data (aka the samples information) need first to be loaded.

#setwd("RNAseq_protocol")
pheno_data <- read.csv("chrX_data/geuvadis_phenodata.csv")

Here is the content of the table:

ids sex population
ERR188044 male YRI
ERR188104 male YRI
ERR188234 female YRI
ERR188245 female GBR
ERR188257 male GBR
ERR188273 female YRI
ERR188337 female GBR
ERR188383 male GBR
ERR188401 male GBR
ERR188428 female GBR
ERR188454 male YRI
ERR204916 female YRI

FPKM

To compute the FPKM, use the ballgown command.

bg_chrX <- ballgown(dataDir = "ballgown", samplePattern = "ERR", pData=pheno_data)

The dataDir parameter indicates to ballgown where is the directory containing the previously processed data. The parameter samplePattern indicates that the sample names start by “ERR”. Finally the parameter pData correspond to the data frame containing the sample information. The structure of the ballgown object is explained here. The expression values can be accessed using:

bg_chrX.expr <- expr(bg_chrX)
knitr::kable(bg_chrX.expr$trans[bg_chrX.expr$trans$gene_id == "MSTRG.1",1:12]) # expression values for all the transcripts from gene "MSTRG.1"
t_id chr strand start end t_name num_exons length gene_id gene_name cov.ERR188044 FPKM.ERR188044
1 1 chrX + 255489 304287 MSTRG.1.1 2 1597 MSTRG.1 . 4.395328 23.96415
2 2 chrX + 276324 303355 NR_028057 9 2087 MSTRG.1 PLCXD1 0.000000 0.00000
3 3 chrX + 276330 303355 MSTRG.1.3 8 5281 MSTRG.1 . 4.745659 25.87422
4 4 chrX + 276330 303355 MSTRG.1.4 7 5078 MSTRG.1 . 0.000000 0.00000
6 6 chrX + 281394 303355 NM_018390 7 5304 MSTRG.1 PLCXD1 106.236000 579.21863
7 7 chrX + 283484 303355 MSTRG.1.6 8 1850 MSTRG.1 . 0.000000 0.00000

The genes with a low read counts have to be discarded. It is also possible to apply a low variance filter. Here, we remove all the transcripts having a variance lower than 1.

bg_chrX_filt <- subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)

Differentially expressed transcripts

results_transcripts <- stattest(bg_chrX_filt, feature="transcript", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Here are the first five rows of the returned object.

feature id fc pval qval
transcript 1 0.9406676 0.7331365 0.9471063
transcript 2 1.2079374 0.8671955 0.9746178
transcript 3 1.0136519 0.9917717 0.9982377
transcript 4 0.3889283 0.5259196 0.9222276
transcript 5 0.6088188 0.3205852 0.9140348

Here, we will look for transcripts that are differentially expressed between sexes, while correcting for any differences in expression due to the population variable. getFC is set to TRUE so the Fold Change is also returned. It might not be correct to use this stattest method on small replicate numbers, and we may need to go to DESeq2.

Differentially expressed genes

It is the same process as for the transcripts, only the feature type has to change.

results_genes <- stattest(bg_chrX_filt, feature="gene", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Here are the first five rows of the returned object.

feature id fc pval qval
gene MSTRG.1 0.5973154 0.0388305 0.4158518
gene MSTRG.10 0.9281766 0.7988706 0.9594504
gene MSTRG.1000 0.9371723 0.8815343 0.9803352
gene MSTRG.1001 1.0448129 0.6157911 0.8963496
gene MSTRG.1002 1.2207407 0.4158461 0.8389430

Data post processing

First, we need to add the gene information to the transcript_results data frame

results_transcripts <- data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

Here are the first five rows of the returned object.

geneNames geneIDs feature id fc pval qval
. MSTRG.1 transcript 1 0.9406676 0.7331365 0.9471063
PLCXD1 MSTRG.1 transcript 2 1.2079374 0.8671955 0.9746178
. MSTRG.1 transcript 3 1.0136519 0.9917717 0.9982377
. MSTRG.1 transcript 4 0.3889283 0.5259196 0.9222276
. MSTRG.2 transcript 5 0.6088188 0.3205852 0.9140348

Then the data has to be sorted by increasing P-Value.

results_transcripts <- arrange(results_transcripts,pval)
results_genes <- arrange(results_genes,pval)

Here are the first five rows of the genes results.

feature id fc pval qval
gene MSTRG.531 0.0023957 0.00e+00 0.0000000
gene MSTRG.141 3.1649422 1.70e-06 0.0008468
gene MSTRG.530 0.0827409 7.00e-06 0.0023929
gene MSTRG.616 7.3140193 1.13e-05 0.0028885
gene MSTRG.620 9.0494508 5.03e-05 0.0102804

The log fold change (logFC) can be computed.

results_transcripts$logFC <- log2(results_transcripts$fc)
results_genes$logFC <- log2(results_genes$fc)

Here are the first five rows of the genes results.

feature id fc pval qval logFC
gene MSTRG.531 0.0023957 0.00e+00 0.0000000 -8.705332
gene MSTRG.141 3.1649422 1.70e-06 0.0008468 1.662179
gene MSTRG.530 0.0827409 7.00e-06 0.0023929 -3.595256
gene MSTRG.616 7.3140193 1.13e-05 0.0028885 2.870664
gene MSTRG.620 9.0494508 5.03e-05 0.0102804 3.177830

Finally, in order to keep only the statistically significant genes, it is possible to filter the data using a Q-Value threshold of 0.05.

subset(results_transcripts,results_transcripts$qval<0.05)
subset(results_genes,results_genes$qval<0.05)

Visualisation

Volcano plot

volcano_plot <- function(data){
  logfc.threshold <- 1
  with(data, plot(logFC, -log10(qval), pch=20, main="Volcano plot"))
  with(subset(data, qval<.05 ), points(logFC, -log10(qval), pch=20, col="red"))
  with(subset(data, abs(logFC)>logfc.threshold), points(logFC, -log10(qval), pch=20, col="orange"))
  with(subset(data, qval<.05 & abs(logFC)>logfc.threshold), points(logFC, -log10(qval), pch=20, col="green"))
}

volcano_plot(results_genes)

Distribution

tropical= c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow')
palette(tropical)
fpkm = texpr(bg_chrX_filt,meas="FPKM")
fpkm = log2(fpkm+1)
par(mar = c(10,5,4,2) + 0.1)
boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)')

This bar plot shows the distribution of the FPKM colored the “sex” variable.

Data variation for one gene

plot(fpkm[12,] ~ pheno_data$sex, border=c(1,2), main=paste(ballgown::geneNames(bg_chrX)[12],' : ', ballgown::transcriptNames(bg_chrX)[12]),pch=19, xlab="Sex", ylab='log2(FPKM+1)')
points(fpkm[12,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex))

Plot transcripts

plotTranscripts(ballgown::geneIDs(bg_chrX)[ballgown::geneNames(bg_chrX) == "XIST"], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))

Heatmap

library(RColorBrewer)
library(gplots)
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
mat_data <- data.matrix(fpkm)
heatmap.2(mat_data,
  main = "FPKM",
  notecol="black",
  # density.info="none",
  trace="none",
  margins =c(12,9),
  col=my_palette,
  # breaks=col_breaks,
  dendrogram="both")

Export

Gene expression - FPMK

gene_expression <- gexpr(bg_chrX)
FPKM.ERR188044 FPKM.ERR188104 FPKM.ERR188234 FPKM.ERR188245 FPKM.ERR188257
MSTRG.1 509.746960 319.605100 332.294280 302.473627 178.818228
MSTRG.10 21.423500 13.129211 14.111122 18.451681 10.181026
MSTRG.100 0.470175 0.000000 0.000000 0.361315 0.000000
MSTRG.1000 6.751377 3.257341 3.030485 2.217741 1.993606
MSTRG.1001 872.470466 682.913294 782.223726 861.644227 774.260158

Export to CSV

write.csv(gene_expression, "chrX_gene_expression_results.csv", row.names=FALSE)

Export all information in one spreadsheet

write.csv(merge(results_genes, gene_expression, by=0, all=T), "chrX_all_results.csv", row.names=TRUE)

Transcript and genes folds export

write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE) 

Comparison with DESeq2

DEG computation

The StringTie output needs to be converted. It can be done using a python script.

#This is a shell command
prepDE.py -i ballgown -g stringtie_to_DESeq.genes.csv -t stringtie_to_DESeq.trans.csv -p ERR

Then the data can be loaded in DESeq2

library(DESeq2)
counts <- read.csv("stringtie_to_DESeq.genes.csv", row.names = 1)

sex <- as.factor(pheno_data$sex)
pop <- as.factor(pheno_data$population)
ids <- as.vector(pheno_data$ids)

coldata <- data.frame(row.names=ids, pop=pop, sex=sex)

bg_chrX.deseq <- DESeqDataSetFromMatrix(counts, coldata, design=~ sex + pop)

The design parameter indicates that there is two variables: sex and pop. The DEG computation is done using the DEseq function.

deseq <- DESeq(bg_chrX.deseq)
deseq.results <- results(deseq)
deseq.results<-deseq.results[order(deseq.results$padj),]

As the model varies a lot from BallGown. The results are very different. With a small number of replicates (<4?) it is usually advised to use a non-linear model method like DESeq2.

Volcano plot

deseq.volcano_plot <- function(data){
  logfc.threshold <- 1
  with(data, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot"))
  with(subset(data, padj<.05 ), points(log2FoldChange, -log10(padj), pch=20, col="red"))
  with(subset(data, abs(log2FoldChange)>logfc.threshold), points(log2FoldChange, -log10(padj), pch=20, col="orange"))
  with(subset(data, padj<.05 & abs(log2FoldChange)>logfc.threshold), points(log2FoldChange, -log10(padj), pch=20, col="green"))
}

deseq.volcano_plot(deseq.results)

It is finally possible to include a shrink correction on the log fold change. In this case, this removes all significant results.

deseq.results.lfcshrink <- lfcShrink(deseq, coef=2, res=deseq.results)
deseq.volcano_plot(deseq.results.lfcshrink)

Venn diagrams comparing both methods

library("VennDiagram")
# dev.off()
logfc.threshold <- 2
BG.genes <- as.vector(subset(results_genes, qval<=.05 & abs(logFC) >= logfc.threshold )$id)
DEseq2.genes <- row.names(subset(deseq.results, padj<=.05 & abs(log2FoldChange) >= logfc.threshold))
venn.list <- list(BG.genes, DEseq2.genes)
venn.plot <- venn.diagram(venn.list, 
                          "VennDiagram.png", 
                          imagetype = 'png',
                          height = 800, 
                          width = 800,
                          resolution = 100,
                          fill=c("darkmagenta", "darkblue"), 
                          alpha=c(0.5,0.5), 
                          cex = 4, 
                          cat.fontface=0.1, 
                          category.names=c("BG", "DS2"), 
                          main="Differentially expressed genes")

This creates a VennDiagram.png file:

DEGSeq

library("DEGseq")

file <- file.path(getwd(), "stringtie_to_DESeq.genes.csv")
male.indexes <- as.vector(as.integer(row.names(subset(pheno_data, sex=="male")))+1)
female.indexes <- as.vector(as.integer(row.names(subset(pheno_data, sex=="female")))+1)
deg.geneexp.male <- readGeneExp(file=file, geneCol=1, valCol=male.indexes, header=T, sep=",")
deg.geneexp.female <- readGeneExp(file=file, geneCol=1, valCol=female.indexes, header=T, sep=",")
DEGexp(geneExpMatrix1 = deg.geneexp.male, 
       geneExpMatrix2 = deg.geneexp.female,
       geneCol1 = 1,
       expCol1 = c(2,3,4,5,6,7),
       geneCol2 = 1,
       expCol2 = c(2,3,4,5,6,7),
       groupLabel1 = "Male",
       groupLabel2 = "Female",
       method = "LRT",
       outputDir = "DEGseqResults")
degseq.result<-read.table(file=file.path("DEGseqResults","output_score.txt"),header = T,sep = "\t")
degseq.result$Signature2 <- (degseq.result$q.value.Storey.et.al..2003. <= 0.05  & abs(degseq.result$log2.Fold_change.) >= 2)
length(which(degseq.result$Signature2=="TRUE"))

Volcano plot:

degseq.volcano_plot <- function(data){
  logfc.threshold <- 1
  with(data, plot(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, main="Volcano plot"))
  with(subset(data, q.value.Storey.et.al..2003.<.05 ), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="red"))
  with(subset(data, abs(log2.Fold_change.)>logfc.threshold), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="orange"))
  with(subset(data, q.value.Storey.et.al..2003.<.05 & abs(log2.Fold_change.)>logfc.threshold), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="green"))
}

degseq.volcano_plot(degseq.result)

Venn diagram

DEGseq.genes <- as.vector(subset(degseq.result, q.value.Storey.et.al..2003.<=.05 & abs(log2.Fold_change.) >= logfc.threshold )$GeneNames)
venn.list <- list(BG.genes, DEseq2.genes, DEGseq.genes)
venn.plot <- venn.diagram(venn.list, 
                          "VennDiagram3.png", 
                          imagetype = 'png',
                          height = 800, 
                          width = 800,
                          resolution = 100,
                          fill=c("darkmagenta", "darkblue", "darkgreen"), 
                          alpha=c(0.5,0.5,0.5), 
                          cex = 4, 
                          cat.fontface=0.1, 
                          category.names=c("BG", "DS2", "DEGseq"), 
                          main="Differentially expressed genes")

This creates a VennDiagram3.png file:

It is very surprising that the three methods don’t overlap at all. I suspect that it is due to the low fold ratios and the low number of significant pvalues. Here is a plot showing the comparison between the fold ratios computed by DESeq2 and BallGown (log10 scale).

Final comparison

library("ggplot2")
BG.results_genes <- subset(results_genes, select=c("logFC","qval"))
row.names(BG.results_genes) <- results_genes$id
names(BG.results_genes) <- c("BG.log.fold.change","BG.qval")

DEseq2.results_genes <- subset(as.data.frame(deseq.results), select=c("log2FoldChange","padj"))
names(DEseq2.results_genes) <- c("DESeq2.log.fold.change","DESeq2.padj")

dt <- merge(BG.results_genes, DEseq2.results_genes, by=0, all=TRUE)

p <- ggplot(dt, aes(x=BG.log.fold.change, y=DESeq2.log.fold.change)) +
  geom_point(shape=1) +
  scale_y_log10() +
  scale_x_log10() 
ggExtra::ggMarginal(p, type="histogram")

Here is a comparison of the corrected p-values.

p1 <- ggplot(dt, aes(x=BG.qval, y=DESeq2.padj)) +
  geom_point(shape=1)
ggExtra::ggMarginal(p1, type="histogram")