Isoform-level differential expression analysis with Ballgown.
Install R package
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("ballgown")
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
CRAN: https://cran.rstudio.com
Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)
Warning message:
“package(s) not installed when version(s) same as or greater than current; use
`force = TRUE` to re-install: 'ballgown'”
Old packages: 'Matrix'
Understand the folder with example data
library(ballgown)
data_directory = system.file('extdata', package='ballgown') # automatically finds ballgown's installation directory
# examine data_directory:
data_directory
Attaching package: ‘ballgown’
The following object is masked from ‘package:base’:
structure
'/usr/local/lib/R/site-library/ballgown/extdata'
list.files(data_directory)
- 'annot.gtf.gz'
- 'hg19_genes_small.gtf.gz'
- 'sample01'
- 'sample02'
- 'sample03'
- 'sample04'
- 'sample05'
- 'sample06'
- 'sample07'
- 'sample08'
- 'sample09'
- 'sample10'
- 'sample11'
- 'sample12'
- 'sample13'
- 'sample14'
- 'sample15'
- 'sample16'
- 'sample17'
- 'sample18'
- 'sample19'
- 'sample20'
- 'tiny.genes.results.gz'
- 'tiny.isoforms.results.gz'
- 'tiny2.genes.results.gz'
- 'tiny2.isoforms.results.gz'
# make the ballgown object:
bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
bg
Mon Oct 21 17:42:05 2024
Mon Oct 21 17:42:05 2024: Reading linking tables
Mon Oct 21 17:42:05 2024: Reading intron data files
Mon Oct 21 17:42:05 2024: Merging intron data
Mon Oct 21 17:42:05 2024: Reading exon data files
Mon Oct 21 17:42:05 2024: Merging exon data
Mon Oct 21 17:42:05 2024: Reading transcript data files
Mon Oct 21 17:42:05 2024: Merging transcript data
Wrapping up the results
Mon Oct 21 17:42:05 2024
ballgown instance with 100 transcripts and 20 samples
Accessing assembly data
A ballgown object has six slots: structure, expr, indexes, dirs, mergedDate, and meas.
Exon, intron, and transcript structures are easily extracted from the main ballgown object:
Structure
structure(bg)$exon
GRanges object with 633 ranges and 2 metadata columns:
seqnames ranges strand | id transcripts
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 18 24412069-24412331 * | 12 10
[2] 22 17308271-17308950 + | 55 25
[3] 22 17309432-17310226 + | 56 25
[4] 22 18121428-18121652 + | 88 35
[5] 22 18138428-18138598 + | 89 35
... ... ... ... . ... ...
[629] 22 51221929-51222113 - | 3777 1294
[630] 22 51221319-51221473 - | 3782 1297
[631] 22 51221929-51222162 - | 3783 1297
[632] 22 51221929-51222168 - | 3784 1301
[633] 6 31248149-31248334 * | 3794 1312
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
structure(bg)$intron
GRanges object with 536 ranges and 2 metadata columns:
seqnames ranges strand | id transcripts
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 22 17308951-17309431 + | 33 25
[2] 22 18121653-18138427 + | 57 35
[3] 22 18138599-18185008 + | 58 35
[4] 22 18185153-18209442 + | 59 35
[5] 22 18385514-18387397 - | 72 41
... ... ... ... . ... ...
[532] 22 51216410-51220615 - | 2750 c(1294, 1297, 1301)
[533] 22 51220776-51221928 - | 2756 1294
[534] 22 51220780-51221318 - | 2757 1297
[535] 22 51221474-51221928 - | 2758 1297
[536] 22 51220780-51221928 - | 2759 1301
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
structure(bg)$trans
GRangesList object of length 100:
$`10`
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | id transcripts
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 18 24412069-24412331 * | 12 10
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
$`25`
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | id transcripts
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 22 17308271-17308950 + | 55 25
[2] 22 17309432-17310226 + | 56 25
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
$`35`
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | id transcripts
<Rle> <IRanges> <Rle> | <integer> <character>
[1] 22 18121428-18121652 + | 88 35
[2] 22 18138428-18138598 + | 89 35
[3] 22 18185009-18185152 + | 90 35
[4] 22 18209443-18212080 + | 91 35
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
...
<97 more elements>
expr
The expr slot is a list that contains tables of expression data for the genomic features. These tables are very similar to the *_data.ctab Tablemaker output files. Ballgown implements the following syntax to access components of the expr slot:
*expr(ballgown_object_name, <EXPRESSION_MEASUREMENT>)
where * is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate .ctab file. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from the bg ballgown object:
transcript_fpkm = texpr(bg, 'FPKM')
transcript_cov = texpr(bg, 'cov')
whole_tx_table = texpr(bg, 'all')
exon_mcov = eexpr(bg, 'mcov')
junction_rcount = iexpr(bg)
whole_intron_table = iexpr(bg, 'all')
gene_expression = gexpr(bg)
Indexes
pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))
pData(bg)
id | group |
---|---|
<chr> | <dbl> |
sample01 | 1 |
sample02 | 1 |
sample03 | 1 |
sample04 | 1 |
sample05 | 1 |
sample06 | 1 |
sample07 | 1 |
sample08 | 1 |
sample09 | 1 |
sample10 | 1 |
sample11 | 0 |
sample12 | 0 |
sample13 | 0 |
sample14 | 0 |
sample15 | 0 |
sample16 | 0 |
sample17 | 0 |
sample18 | 0 |
sample19 | 0 |
sample20 | 0 |
Plotting transcript structures
plotTranscripts(gene='XLOC_000454', gown=bg, samples='sample12',
meas='FPKM', colorby='transcript',
main='transcripts from gene XLOC_000454: sample 12, FPKM')
It is also possible to plot several samples at once:
plotTranscripts('XLOC_000454', bg,
samples=c('sample01', 'sample06', 'sample12', 'sample19'),
meas='FPKM', colorby='transcript')
You can also make side-by-side plots comparing mean abundances between groups (here, 0 and 1):
plotMeans('XLOC_000454', bg, groupvar='group', meas='FPKM', colorby='transcript')
Differential expression analysis
Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time).
stat_results = stattest(bg, feature='transcript',
meas='FPKM', covariate='group',
getFC=TRUE)
results_transcripts <- data.frame(geneNames = geneNames(bg),
geneIDs = geneIDs(bg),
transcriptNames = transcriptNames(bg),
stat_results)
head(results_transcripts)
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
10 | XLOC_000010 | TCONS_00000010 | transcript | 10 | 3.193499 | 0.01381576 | 0.10521233 | |
25 | XLOC_000014 | TCONS_00000017 | transcript | 25 | 1.549093 | 0.26773622 | 0.79114975 | |
35 | XLOC_000017 | TCONS_00000020 | transcript | 35 | 4.388626 | 0.01085070 | 0.08951825 | |
41 | XLOC_000246 | TCONS_00000598 | transcript | 41 | 1.440519 | 0.47108019 | 0.90253747 | |
45 | XLOC_000019 | TCONS_00000024 | transcript | 45 | 1.714340 | 0.08402948 | 0.48934813 | |
67 | XLOC_000255 | TCONS_00000613 | transcript | 67 | 2.518524 | 0.27317385 | 0.79114975 |
results_transcripts <- results_transcripts[order(results_transcripts$qval), ]
head(results_transcripts, 10)
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
1225 | XLOC_000440 | TCONS_00001129 | transcript | 1225 | 5.67437122 | 1.035753e-05 | 0.001025395 | |
980 | XLOC_000179 | TCONS_00000452 | transcript | 980 | 6.25344921 | 2.514632e-05 | 0.001244743 | |
469 | XLOC_000101 | TCONS_00000244 | transcript | 469 | 119.29999938 | 2.398681e-04 | 0.007915648 | |
695 | XLOC_000354 | TCONS_00000883 | transcript | 695 | 0.21950959 | 3.302059e-04 | 0.008172596 | |
1012 | XLOC_000409 | TCONS_00001041 | transcript | 1012 | 0.24664434 | 1.527175e-03 | 0.030238073 | |
123 | XLOC_000029 | TCONS_00000059 | transcript | 123 | 0.01603345 | 2.097875e-03 | 0.034614939 | |
961 | XLOC_000176 | TCONS_00000435 | transcript | 961 | 4.96074399 | 2.736075e-03 | 0.038695918 | |
880 | XLOC_000531 | TCONS_00001277 | transcript | 880 | 29.40485236 | 3.272859e-03 | 0.040501628 | |
1063 | XLOC_000197 | TCONS_00000487 | transcript | 1063 | 3.12988187 | 4.555313e-03 | 0.050108442 | |
35 | XLOC_000017 | TCONS_00000020 | transcript | 35 | 4.38862590 | 1.085070e-02 | 0.089518247 |
results_transcripts[results_transcripts$geneIDs == "XLOC_000101", ]
geneNames | geneIDs | transcriptNames | feature | id | fc | pval | qval | |
---|---|---|---|---|---|---|---|---|
<chr> | <chr> | <chr> | <chr> | <chr> | <dbl> | <dbl> | <dbl> | |
469 | XLOC_000101 | TCONS_00000244 | transcript | 469 | 119.299999 | 0.0002398681 | 0.007915648 | |
477 | XLOC_000101 | TCONS_00000252 | transcript | 477 | 3.128184 | 0.4511803632 | 0.902537469 |
plotMeans('XLOC_000101', bg, groupvar='group', meas='FPKM', colorby='transcript')
Reference
https://www.bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html