RNASeq

Introduction

This is a test of the new tuxedo pipeline as described in Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown (Pertea et al., Nature Protocol, Aug. 2016). The data for this can be downloaded from: this site. In the new tuxedo pipeline, the mapper bowtie2 is replaced by HiSAT2. StringTie is then used to merge the files from HiSAT2 and computer the coverage per gene (previously done with cufflinks). Finally BallGown computes the differentially expressed genes (as cufflinks used to).

HiSAT2

HiSAT2 is a spliced (or gapped) mapper. It is used for RNAseq mapping as the gene sequences may contain introns converted into gaps while aligning the RNAseq reads. HiSAT works on indexed reference files. To build this index, use:

hisat2-build reference output_dir

Many reference files can be used. They must be separated by commas on the command line. The indexes have already been built in the downloaded dataset. It is possible to provide the splicing sites (ss) and exons in the command line. They can be extracted from a gtf file giving information about genes and exons boundaries using tools provided with HiSAT2.

extract_splice_sites.py chrX_data/genes/chrX.gtf > chrX.ss
extract_exons.py chrX_data/genes/chrX.gtf > chrX.exon

hisat2-build --ss chrX.ss --exon chrX.exon chrX_data/genome/chrX.fa chrX_tran

Both –ss and –exon are optional arguments. This hisat2-build command took about 6min on the example dataset.

Here are all the HiSAT commands to map the reads from the 12 samples onto the indexed reference:

hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188044_chrX_1.fastq.gz -2 chrX_data/samples/ERR188044_chrX_2.fastq.gz -S ERR188044_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188104_chrX_1.fastq.gz -2 chrX_data/samples/ERR188104_chrX_2.fastq.gz -S ERR188104_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188234_chrX_1.fastq.gz -2 chrX_data/samples/ERR188234_chrX_2.fastq.gz -S ERR188234_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188245_chrX_1.fastq.gz -2 chrX_data/samples/ERR188245_chrX_2.fastq.gz -S ERR188245_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188257_chrX_1.fastq.gz -2 chrX_data/samples/ERR188257_chrX_2.fastq.gz -S ERR188257_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188273_chrX_1.fastq.gz -2 chrX_data/samples/ERR188273_chrX_2.fastq.gz -S ERR188273_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188337_chrX_1.fastq.gz -2 chrX_data/samples/ERR188337_chrX_2.fastq.gz -S ERR188337_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188383_chrX_1.fastq.gz -2 chrX_data/samples/ERR188383_chrX_2.fastq.gz -S ERR188383_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188401_chrX_1.fastq.gz -2 chrX_data/samples/ERR188401_chrX_2.fastq.gz -S ERR188401_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188428_chrX_1.fastq.gz -2 chrX_data/samples/ERR188428_chrX_2.fastq.gz -S ERR188428_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR188454_chrX_1.fastq.gz -2 chrX_data/samples/ERR188454_chrX_2.fastq.gz -S ERR188454_chrX.sam --time
hisat2 -p 2 --dta -x chrX_data/indexes/chrX_tran -1 chrX_data/samples/ERR204916_chrX_1.fastq.gz -2 chrX_data/samples/ERR204916_chrX_2.fastq.gz -S ERR204916_chrX.sam --time

The mapping is fairly fast. Each command takes less than a minute. The generated files need then to be sorted and converted to smaller bam files. This is done using smatools.

samtools sort -@ 2 -o ERR188044_chrX.bam ERR188044_chrX.sam
samtools sort -@ 2 -o ERR188104_chrX.bam ERR188104_chrX.sam
samtools sort -@ 2 -o ERR188234_chrX.bam ERR188234_chrX.sam
samtools sort -@ 2 -o ERR188245_chrX.bam ERR188245_chrX.sam
samtools sort -@ 2 -o ERR188257_chrX.bam ERR188257_chrX.sam
samtools sort -@ 2 -o ERR188273_chrX.bam ERR188273_chrX.sam
samtools sort -@ 2 -o ERR188337_chrX.bam ERR188337_chrX.sam
samtools sort -@ 2 -o ERR188383_chrX.bam ERR188383_chrX.sam
samtools sort -@ 2 -o ERR188401_chrX.bam ERR188401_chrX.sam
samtools sort -@ 2 -o ERR188428_chrX.bam ERR188428_chrX.sam
samtools sort -@ 2 -o ERR188454_chrX.bam ERR188454_chrX.sam
samtools sort -@ 2 -o ERR204916_chrX.bam ERR204916_chrX.sam

This will generate 12 bam files (binary sam files) which are compressed, and here sorted, versions of the original sam files. The sam files can be removed

rm -f ERR*_chrX.sam

StringTie

StringTie assembles RNAseq alignments from HiSat into potential transcripts. It can use a gtf file as a guide for the assembly.

stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188044_chrX.gtf -l ERR188044 ERR188044_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188104_chrX.gtf -l ERR188104 ERR188104_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188234_chrX.gtf -l ERR188234 ERR188234_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188245_chrX.gtf -l ERR188245 ERR188245_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188257_chrX.gtf -l ERR188257 ERR188257_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188273_chrX.gtf -l ERR188273 ERR188273_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188337_chrX.gtf -l ERR188337 ERR188337_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188383_chrX.gtf -l ERR188383 ERR188383_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188401_chrX.gtf -l ERR188401 ERR188401_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188428_chrX.gtf -l ERR188428 ERR188428_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR188454_chrX.gtf -l ERR188454 ERR188454_chrX.bam
stringtie -p 2 -G chrX_data/genes/chrX.gtf -o ERR204916_chrX.gtf -l ERR204916 ERR204916_chrX.bam

On the example dataset, stringtie takes about 10s for each samples. It outputs new gtf files containing the coordinates of the rebuilt transcripts. All the StringTie results need to be merged into one unique transcript coordinate file. For this a list of the atomic gtf files need to be built first.

ls ERR*chrX.gtf > merge_list.txt

Now the files can be merged.

stringtie --merge -p 2 -G chrX_data/genes/chrX.gtf -o stringtie_merged.gtf merge_list.txt

This will generate an unique gtf file called stringtie_merged.gtf. Some extra information can be obtained using the gffcompare command fromt the gffutilities package. This step is optional.

gffcompare -r chrX_data/genes/chrX.gtf -G -o merged stringtie_merged.gtf

This will create few new files:

merged.stats: containing information about the new/missed introns, exons and loci
merged.annotated.gtf: a gtf files with information about each transcript provided in the field class_code. The class_codes are provided in table 1.
merged.loci
merged.stringtie_merged.gtf.refmap
merged.stringtie_merged.gtf.tmap
merged.tracking

Table 1

Code	Description
=	Predicted transcript has exactly the same introns as the reference transcript
c	Predicted transcript is contained within the reference transcript
j	Predicted transcript is a potential novel isoform that shares at least one splice junction with a reference transcript
e	Predicted single-exon transcript overlaps a reference exon plus at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment
i	Predicted transcript falls entirely within a reference intron
o	Exon of predicted transcript overlaps a reference transcript
p	Predicted transcript lies within 2 kb of a reference transcript (possible polymerase run-on fragment)
r	Predicted transcript has >50% of its bases overlapping a soft-masked (repetitive) reference sequence
u	Predicted transcript is intergenic in comparison with known reference transcripts
x	Exon of predicted transcript overlaps reference but lies on the opposite strand
s	Intron of predicted transcript overlaps a reference intron on the opposite strand

Compute transcript abundances and counts tables

This is also done using StringTie, but with different options.

stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188044/ERR188044_chrX.gtf ERR188044_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188104/ERR188104_chrX.gtf ERR188104_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188234/ERR188234_chrX.gtf ERR188234_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188245/ERR188245_chrX.gtf ERR188245_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188257/ERR188257_chrX.gtf ERR188257_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188273/ERR188273_chrX.gtf ERR188273_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188337/ERR188337_chrX.gtf ERR188337_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188383/ERR188383_chrX.gtf ERR188383_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188401/ERR188401_chrX.gtf ERR188401_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188428/ERR188428_chrX.gtf ERR188428_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR188454/ERR188454_chrX.gtf ERR188454_chrX.bam
stringtie -e -B -p 2 -G stringtie_merged.gtf -o ballgown/ERR204916/ERR204916_chrX.gtf ERR204916_chrX.bam

Differentially expressed genes (DEG) computation

The DEG computation is done using ballgown. The installation of ballgown can be done through bioconductor. The following commands are R commands that need to be launched in a R console (or in RStudio), unless specified otherwise.

source("https://bioconductor.org/biocLite.R")
biocLite("ballgown")

It is also recommended to install the following libraries: * RSkittleBrewer * genefilter * dplyr * devtools

install.packages(c("dplyr","devtools"))
devtools::install_github('alyssafrazee/RSkittleBrewer')

#source("https://bioconductor.org/biocLite.R")
biocLite("genefilter")

The libraries to load are:

library(ballgown)
library(RSkittleBrewer) 
library(genefilter)
library(dplyr)
library(devtools)

On Ubuntu 16.04, the following libraries are mandatory to install devtools:

# In a terminal
sudo apt install build-essential libcurl4-gnutls-dev libxml2-dev libssl-dev

The phenotype data (aka the samples information) need first to be loaded.

#setwd("RNAseq_protocol")
pheno_data <- read.csv("chrX_data/geuvadis_phenodata.csv")

Here is the content of the table:

ids	sex	population
ERR188044	male	YRI
ERR188104	male	YRI
ERR188234	female	YRI
ERR188245	female	GBR
ERR188257	male	GBR
ERR188273	female	YRI
ERR188337	female	GBR
ERR188383	male	GBR
ERR188401	male	GBR
ERR188428	female	GBR
ERR188454	male	YRI
ERR204916	female	YRI

FPKM

To compute the FPKM, use the ballgown command.

bg_chrX <- ballgown(dataDir = "ballgown", samplePattern = "ERR", pData=pheno_data)

The dataDir parameter indicates to ballgown where is the directory containing the previously processed data. The parameter samplePattern indicates that the sample names start by “ERR”. Finally the parameter pData correspond to the data frame containing the sample information. The structure of the ballgown object is explained here. The expression values can be accessed using:

bg_chrX.expr <- expr(bg_chrX)
knitr::kable(bg_chrX.expr$trans[bg_chrX.expr$trans$gene_id == "MSTRG.1",1:12]) # expression values for all the transcripts from gene "MSTRG.1"

	t_id	chr	strand	start	end	t_name	num_exons	length	gene_id	gene_name	cov.ERR188044	FPKM.ERR188044
1	1	chrX	+	255489	304287	MSTRG.1.1	2	1597	MSTRG.1	.	4.395328	23.96415
2	2	chrX	+	276324	303355	NR_028057	9	2087	MSTRG.1	PLCXD1	0.000000	0.00000
3	3	chrX	+	276330	303355	MSTRG.1.3	8	5281	MSTRG.1	.	4.745659	25.87422
4	4	chrX	+	276330	303355	MSTRG.1.4	7	5078	MSTRG.1	.	0.000000	0.00000
6	6	chrX	+	281394	303355	NM_018390	7	5304	MSTRG.1	PLCXD1	106.236000	579.21863
7	7	chrX	+	283484	303355	MSTRG.1.6	8	1850	MSTRG.1	.	0.000000	0.00000

The genes with a low read counts have to be discarded. It is also possible to apply a low variance filter. Here, we remove all the transcripts having a variance lower than 1.

bg_chrX_filt <- subset(bg_chrX, "rowVars(texpr(bg_chrX)) >1", genomesubset=TRUE)

Differentially expressed transcripts

results_transcripts <- stattest(bg_chrX_filt, feature="transcript", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Here are the first five rows of the returned object.

feature	id	fc	pval	qval
transcript	1	0.9406676	0.7331365	0.9471063
transcript	2	1.2079374	0.8671955	0.9746178
transcript	3	1.0136519	0.9917717	0.9982377
transcript	4	0.3889283	0.5259196	0.9222276
transcript	5	0.6088188	0.3205852	0.9140348

Here, we will look for transcripts that are differentially expressed between sexes, while correcting for any differences in expression due to the population variable. getFC is set to TRUE so the Fold Change is also returned. It might not be correct to use this stattest method on small replicate numbers, and we may need to go to DESeq2.

Differentially expressed genes

It is the same process as for the transcripts, only the feature type has to change.

results_genes <- stattest(bg_chrX_filt, feature="gene", covariate="sex", adjustvars = c("population"), getFC=TRUE, meas="FPKM")

Here are the first five rows of the returned object.

feature	id	fc	pval	qval
gene	MSTRG.1	0.5973154	0.0388305	0.4158518
gene	MSTRG.10	0.9281766	0.7988706	0.9594504
gene	MSTRG.1000	0.9371723	0.8815343	0.9803352
gene	MSTRG.1001	1.0448129	0.6157911	0.8963496
gene	MSTRG.1002	1.2207407	0.4158461	0.8389430

Data post processing

First, we need to add the gene information to the transcript_results data frame

results_transcripts <- data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

Here are the first five rows of the returned object.

geneNames	geneIDs	feature	id	fc	pval	qval
.	MSTRG.1	transcript	1	0.9406676	0.7331365	0.9471063
PLCXD1	MSTRG.1	transcript	2	1.2079374	0.8671955	0.9746178
.	MSTRG.1	transcript	3	1.0136519	0.9917717	0.9982377
.	MSTRG.1	transcript	4	0.3889283	0.5259196	0.9222276
.	MSTRG.2	transcript	5	0.6088188	0.3205852	0.9140348

Then the data has to be sorted by increasing P-Value.

results_transcripts <- arrange(results_transcripts,pval)
results_genes <- arrange(results_genes,pval)

Here are the first five rows of the genes results.

feature	id	fc	pval	qval
gene	MSTRG.531	0.0023957	0.00e+00	0.0000000
gene	MSTRG.141	3.1649422	1.70e-06	0.0008468
gene	MSTRG.530	0.0827409	7.00e-06	0.0023929
gene	MSTRG.616	7.3140193	1.13e-05	0.0028885
gene	MSTRG.620	9.0494508	5.03e-05	0.0102804

The log fold change (logFC) can be computed.

results_transcripts$logFC <- log2(results_transcripts$fc)
results_genes$logFC <- log2(results_genes$fc)

Here are the first five rows of the genes results.

feature	id	fc	pval	qval	logFC
gene	MSTRG.531	0.0023957	0.00e+00	0.0000000	-8.705332
gene	MSTRG.141	3.1649422	1.70e-06	0.0008468	1.662179
gene	MSTRG.530	0.0827409	7.00e-06	0.0023929	-3.595256
gene	MSTRG.616	7.3140193	1.13e-05	0.0028885	2.870664
gene	MSTRG.620	9.0494508	5.03e-05	0.0102804	3.177830

Finally, in order to keep only the statistically significant genes, it is possible to filter the data using a Q-Value threshold of 0.05.

subset(results_transcripts,results_transcripts$qval<0.05)
subset(results_genes,results_genes$qval<0.05)

Visualisation

Volcano plot

volcano_plot <- function(data){
  logfc.threshold <- 1
  with(data, plot(logFC, -log10(qval), pch=20, main="Volcano plot"))
  with(subset(data, qval<.05 ), points(logFC, -log10(qval), pch=20, col="red"))
  with(subset(data, abs(logFC)>logfc.threshold), points(logFC, -log10(qval), pch=20, col="orange"))
  with(subset(data, qval<.05 & abs(logFC)>logfc.threshold), points(logFC, -log10(qval), pch=20, col="green"))
}

volcano_plot(results_genes)

Distribution

tropical= c('darkorange', 'dodgerblue', 'hotpink', 'limegreen', 'yellow')
palette(tropical)
fpkm = texpr(bg_chrX_filt,meas="FPKM")
fpkm = log2(fpkm+1)
par(mar = c(10,5,4,2) + 0.1)
boxplot(fpkm,col=as.numeric(pheno_data$sex),las=2,ylab='log2(FPKM+1)')

This bar plot shows the distribution of the FPKM colored the “sex” variable.

Data variation for one gene

plot(fpkm[12,] ~ pheno_data$sex, border=c(1,2), main=paste(ballgown::geneNames(bg_chrX)[12],' : ', ballgown::transcriptNames(bg_chrX)[12]),pch=19, xlab="Sex", ylab='log2(FPKM+1)')
points(fpkm[12,] ~ jitter(as.numeric(pheno_data$sex)), col=as.numeric(pheno_data$sex))

Plot transcripts

plotTranscripts(ballgown::geneIDs(bg_chrX)[ballgown::geneNames(bg_chrX) == "XIST"], bg_chrX, main=c('Gene XIST in sample ERR188234'), sample=c('ERR188234'))

Heatmap

library(RColorBrewer)
library(gplots)
my_palette <- colorRampPalette(c("green", "yellow", "red"))(n = 299)
mat_data <- data.matrix(fpkm)
heatmap.2(mat_data,
  main = "FPKM",
  notecol="black",
  # density.info="none",
  trace="none",
  margins =c(12,9),
  col=my_palette,
  # breaks=col_breaks,
  dendrogram="both")

Export

Gene expression - FPMK

gene_expression <- gexpr(bg_chrX)

	FPKM.ERR188044	FPKM.ERR188104	FPKM.ERR188234	FPKM.ERR188245	FPKM.ERR188257
MSTRG.1	509.746960	319.605100	332.294280	302.473627	178.818228
MSTRG.10	21.423500	13.129211	14.111122	18.451681	10.181026
MSTRG.100	0.470175	0.000000	0.000000	0.361315	0.000000
MSTRG.1000	6.751377	3.257341	3.030485	2.217741	1.993606
MSTRG.1001	872.470466	682.913294	782.223726	861.644227	774.260158

Export to CSV

write.csv(gene_expression, "chrX_gene_expression_results.csv", row.names=FALSE)

Export all information in one spreadsheet

write.csv(merge(results_genes, gene_expression, by=0, all=T), "chrX_all_results.csv", row.names=TRUE)

Transcript and genes folds export

write.csv(results_transcripts, "chrX_transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "chrX_gene_results.csv", row.names=FALSE)

Comparison with DESeq2

DEG computation

The StringTie output needs to be converted. It can be done using a python script.

#This is a shell command
prepDE.py -i ballgown -g stringtie_to_DESeq.genes.csv -t stringtie_to_DESeq.trans.csv -p ERR

Then the data can be loaded in DESeq2

library(DESeq2)
counts <- read.csv("stringtie_to_DESeq.genes.csv", row.names = 1)

sex <- as.factor(pheno_data$sex)
pop <- as.factor(pheno_data$population)
ids <- as.vector(pheno_data$ids)

coldata <- data.frame(row.names=ids, pop=pop, sex=sex)

bg_chrX.deseq <- DESeqDataSetFromMatrix(counts, coldata, design=~ sex + pop)

The design parameter indicates that there is two variables: sex and pop. The DEG computation is done using the DEseq function.

deseq <- DESeq(bg_chrX.deseq)
deseq.results <- results(deseq)
deseq.results<-deseq.results[order(deseq.results$padj),]

As the model varies a lot from BallGown. The results are very different. With a small number of replicates (<4?) it is usually advised to use a non-linear model method like DESeq2.

Volcano plot

deseq.volcano_plot <- function(data){
  logfc.threshold <- 1
  with(data, plot(log2FoldChange, -log10(padj), pch=20, main="Volcano plot"))
  with(subset(data, padj<.05 ), points(log2FoldChange, -log10(padj), pch=20, col="red"))
  with(subset(data, abs(log2FoldChange)>logfc.threshold), points(log2FoldChange, -log10(padj), pch=20, col="orange"))
  with(subset(data, padj<.05 & abs(log2FoldChange)>logfc.threshold), points(log2FoldChange, -log10(padj), pch=20, col="green"))
}

deseq.volcano_plot(deseq.results)

It is finally possible to include a shrink correction on the log fold change. In this case, this removes all significant results.

deseq.results.lfcshrink <- lfcShrink(deseq, coef=2, res=deseq.results)
deseq.volcano_plot(deseq.results.lfcshrink)

Venn diagrams comparing both methods

library("VennDiagram")
# dev.off()
logfc.threshold <- 2
BG.genes <- as.vector(subset(results_genes, qval<=.05 & abs(logFC) >= logfc.threshold )$id)
DEseq2.genes <- row.names(subset(deseq.results, padj<=.05 & abs(log2FoldChange) >= logfc.threshold))
venn.list <- list(BG.genes, DEseq2.genes)
venn.plot <- venn.diagram(venn.list, 
                          "VennDiagram.png", 
                          imagetype = 'png',
                          height = 800, 
                          width = 800,
                          resolution = 100,
                          fill=c("darkmagenta", "darkblue"), 
                          alpha=c(0.5,0.5), 
                          cex = 4, 
                          cat.fontface=0.1, 
                          category.names=c("BG", "DS2"), 
                          main="Differentially expressed genes")

This creates a VennDiagram.png file:

DEGSeq

library("DEGseq")

file <- file.path(getwd(), "stringtie_to_DESeq.genes.csv")
male.indexes <- as.vector(as.integer(row.names(subset(pheno_data, sex=="male")))+1)
female.indexes <- as.vector(as.integer(row.names(subset(pheno_data, sex=="female")))+1)
deg.geneexp.male <- readGeneExp(file=file, geneCol=1, valCol=male.indexes, header=T, sep=",")
deg.geneexp.female <- readGeneExp(file=file, geneCol=1, valCol=female.indexes, header=T, sep=",")
DEGexp(geneExpMatrix1 = deg.geneexp.male, 
       geneExpMatrix2 = deg.geneexp.female,
       geneCol1 = 1,
       expCol1 = c(2,3,4,5,6,7),
       geneCol2 = 1,
       expCol2 = c(2,3,4,5,6,7),
       groupLabel1 = "Male",
       groupLabel2 = "Female",
       method = "LRT",
       outputDir = "DEGseqResults")

degseq.result<-read.table(file=file.path("DEGseqResults","output_score.txt"),header = T,sep = "\t")
degseq.result$Signature2 <- (degseq.result$q.value.Storey.et.al..2003. <= 0.05  & abs(degseq.result$log2.Fold_change.) >= 2)
length(which(degseq.result$Signature2=="TRUE"))

Volcano plot:

degseq.volcano_plot <- function(data){
  logfc.threshold <- 1
  with(data, plot(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, main="Volcano plot"))
  with(subset(data, q.value.Storey.et.al..2003.<.05 ), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="red"))
  with(subset(data, abs(log2.Fold_change.)>logfc.threshold), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="orange"))
  with(subset(data, q.value.Storey.et.al..2003.<.05 & abs(log2.Fold_change.)>logfc.threshold), points(log2.Fold_change., -log10(q.value.Storey.et.al..2003.), pch=20, col="green"))
}

degseq.volcano_plot(degseq.result)

Venn diagram

DEGseq.genes <- as.vector(subset(degseq.result, q.value.Storey.et.al..2003.<=.05 & abs(log2.Fold_change.) >= logfc.threshold )$GeneNames)
venn.list <- list(BG.genes, DEseq2.genes, DEGseq.genes)
venn.plot <- venn.diagram(venn.list, 
                          "VennDiagram3.png", 
                          imagetype = 'png',
                          height = 800, 
                          width = 800,
                          resolution = 100,
                          fill=c("darkmagenta", "darkblue", "darkgreen"), 
                          alpha=c(0.5,0.5,0.5), 
                          cex = 4, 
                          cat.fontface=0.1, 
                          category.names=c("BG", "DS2", "DEGseq"), 
                          main="Differentially expressed genes")

This creates a VennDiagram3.png file:

It is very surprising that the three methods don’t overlap at all. I suspect that it is due to the low fold ratios and the low number of significant pvalues. Here is a plot showing the comparison between the fold ratios computed by DESeq2 and BallGown (log10 scale).

Final comparison

library("ggplot2")
BG.results_genes <- subset(results_genes, select=c("logFC","qval"))
row.names(BG.results_genes) <- results_genes$id
names(BG.results_genes) <- c("BG.log.fold.change","BG.qval")

DEseq2.results_genes <- subset(as.data.frame(deseq.results), select=c("log2FoldChange","padj"))
names(DEseq2.results_genes) <- c("DESeq2.log.fold.change","DESeq2.padj")

dt <- merge(BG.results_genes, DEseq2.results_genes, by=0, all=TRUE)

p <- ggplot(dt, aes(x=BG.log.fold.change, y=DESeq2.log.fold.change)) +
  geom_point(shape=1) +
  scale_y_log10() +
  scale_x_log10() 
ggExtra::ggMarginal(p, type="histogram")

Here is a comparison of the corrected p-values.

p1 <- ggplot(dt, aes(x=BG.qval, y=DESeq2.padj)) +
  geom_point(shape=1)
ggExtra::ggMarginal(p1, type="histogram")

RNASeq

Pierre-Yves Dupont

Introduction

HiSAT2

StringTie

Compute transcript abundances and counts tables

Differentially expressed genes (DEG) computation

FPKM

Differentially expressed transcripts

Differentially expressed genes

Data post processing

Visualisation

Volcano plot

Distribution

Data variation for one gene

Plot transcripts

Heatmap

Export

Gene expression - FPMK

Transcript and genes folds export

Comparison with DESeq2

DEG computation

Volcano plot

Venn diagrams comparing both methods

DEGSeq

Volcano plot:

Venn diagram

Final comparison