Isoform-level differential expression analysis with Ballgown.

Install R package

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ballgown")

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

Warning message:
“package(s) not installed when version(s) same as or greater than current; use
  `force = TRUE` to re-install: 'ballgown'”
Old packages: 'Matrix'

Understand the folder with example data

library(ballgown)
data_directory = system.file('extdata', package='ballgown') # automatically finds ballgown's installation directory
# examine data_directory:
data_directory

Attaching package: ‘ballgown’


The following object is masked from ‘package:base’:

    structure

'/usr/local/lib/R/site-library/ballgown/extdata'

list.files(data_directory)

'annot.gtf.gz'
'hg19_genes_small.gtf.gz'
'sample01'
'sample02'
'sample03'
'sample04'
'sample05'
'sample06'
'sample07'
'sample08'
'sample09'
'sample10'
'sample11'
'sample12'
'sample13'
'sample14'
'sample15'
'sample16'
'sample17'
'sample18'
'sample19'
'sample20'
'tiny.genes.results.gz'
'tiny.isoforms.results.gz'
'tiny2.genes.results.gz'
'tiny2.isoforms.results.gz'

# make the ballgown object:
bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
bg

Mon Oct 21 17:42:05 2024

Mon Oct 21 17:42:05 2024: Reading linking tables

Mon Oct 21 17:42:05 2024: Reading intron data files

Mon Oct 21 17:42:05 2024: Merging intron data

Mon Oct 21 17:42:05 2024: Reading exon data files

Mon Oct 21 17:42:05 2024: Merging exon data

Mon Oct 21 17:42:05 2024: Reading transcript data files

Mon Oct 21 17:42:05 2024: Merging transcript data

Wrapping up the results

Mon Oct 21 17:42:05 2024




ballgown instance with 100 transcripts and 20 samples

Accessing assembly data

A ballgown object has six slots: structure, expr, indexes, dirs, mergedDate, and meas.

Exon, intron, and transcript structures are easily extracted from the main ballgown object:

Structure

structure(bg)$exon

GRanges object with 633 ranges and 2 metadata columns:
        seqnames            ranges strand |        id transcripts
           <Rle>         <IRanges>  <Rle> | <integer> <character>
    [1]       18 24412069-24412331      * |        12          10
    [2]       22 17308271-17308950      + |        55          25
    [3]       22 17309432-17310226      + |        56          25
    [4]       22 18121428-18121652      + |        88          35
    [5]       22 18138428-18138598      + |        89          35
    ...      ...               ...    ... .       ...         ...
  [629]       22 51221929-51222113      - |      3777        1294
  [630]       22 51221319-51221473      - |      3782        1297
  [631]       22 51221929-51222162      - |      3783        1297
  [632]       22 51221929-51222168      - |      3784        1301
  [633]        6 31248149-31248334      * |      3794        1312
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

structure(bg)$intron

GRanges object with 536 ranges and 2 metadata columns:
        seqnames            ranges strand |        id         transcripts
           <Rle>         <IRanges>  <Rle> | <integer>         <character>
    [1]       22 17308951-17309431      + |        33                  25
    [2]       22 18121653-18138427      + |        57                  35
    [3]       22 18138599-18185008      + |        58                  35
    [4]       22 18185153-18209442      + |        59                  35
    [5]       22 18385514-18387397      - |        72                  41
    ...      ...               ...    ... .       ...                 ...
  [532]       22 51216410-51220615      - |      2750 c(1294, 1297, 1301)
  [533]       22 51220776-51221928      - |      2756                1294
  [534]       22 51220780-51221318      - |      2757                1297
  [535]       22 51221474-51221928      - |      2758                1297
  [536]       22 51220780-51221928      - |      2759                1301
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

structure(bg)$trans

GRangesList object of length 100:
$`10`
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |        id transcripts
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]       18 24412069-24412331      * |        12          10
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

$`25`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |        id transcripts
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]       22 17308271-17308950      + |        55          25
  [2]       22 17309432-17310226      + |        56          25
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

$`35`
GRanges object with 4 ranges and 2 metadata columns:
      seqnames            ranges strand |        id transcripts
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]       22 18121428-18121652      + |        88          35
  [2]       22 18138428-18138598      + |        89          35
  [3]       22 18185009-18185152      + |        90          35
  [4]       22 18209443-18212080      + |        91          35
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

...
<97 more elements>

expr

The expr slot is a list that contains tables of expression data for the genomic features. These tables are very similar to the *_data.ctab Tablemaker output files. Ballgown implements the following syntax to access components of the expr slot:

*expr(ballgown_object_name, <EXPRESSION_MEASUREMENT>)

where * is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate .ctab file. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from the bg ballgown object:

transcript_fpkm = texpr(bg, 'FPKM')
transcript_cov = texpr(bg, 'cov')
whole_tx_table = texpr(bg, 'all')
exon_mcov = eexpr(bg, 'mcov')
junction_rcount = iexpr(bg)
whole_intron_table = iexpr(bg, 'all')
gene_expression = gexpr(bg)

Indexes

pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))

pData(bg)

A data.frame: 20 × 2
id	group
<chr>	<dbl>
sample01	1
sample02	1
sample03	1
sample04	1
sample05	1
sample06	1
sample07	1
sample08	1
sample09	1
sample10	1
sample11	0
sample12	0
sample13	0
sample14	0
sample15	0
sample16	0
sample17	0
sample18	0
sample19	0
sample20	0

Plotting transcript structures

plotTranscripts(gene='XLOC_000454', gown=bg, samples='sample12',
    meas='FPKM', colorby='transcript',
    main='transcripts from gene XLOC_000454: sample 12, FPKM')

png

It is also possible to plot several samples at once:

plotTranscripts('XLOC_000454', bg,
    samples=c('sample01', 'sample06', 'sample12', 'sample19'),
    meas='FPKM', colorby='transcript')

png

You can also make side-by-side plots comparing mean abundances between groups (here, 0 and 1):

plotMeans('XLOC_000454', bg, groupvar='group', meas='FPKM', colorby='transcript')

png

Differential expression analysis

Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time).

stat_results = stattest(bg, feature='transcript',
                   meas='FPKM', covariate='group',
                   getFC=TRUE)

results_transcripts <- data.frame(geneNames = geneNames(bg),
                                  geneIDs = geneIDs(bg),
                                  transcriptNames = transcriptNames(bg),
                                  stat_results)

head(results_transcripts)

A data.frame: 6 × 8
	geneNames	geneIDs	transcriptNames	feature	id	fc	pval	qval
	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
10		XLOC_000010	TCONS_00000010	transcript	10	3.193499	0.01381576	0.10521233
25		XLOC_000014	TCONS_00000017	transcript	25	1.549093	0.26773622	0.79114975
35		XLOC_000017	TCONS_00000020	transcript	35	4.388626	0.01085070	0.08951825
41		XLOC_000246	TCONS_00000598	transcript	41	1.440519	0.47108019	0.90253747
45		XLOC_000019	TCONS_00000024	transcript	45	1.714340	0.08402948	0.48934813
67		XLOC_000255	TCONS_00000613	transcript	67	2.518524	0.27317385	0.79114975

results_transcripts <- results_transcripts[order(results_transcripts$qval), ]

head(results_transcripts, 10)

A data.frame: 10 × 8
	geneNames	geneIDs	transcriptNames	feature	id	fc	pval	qval
	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
1225		XLOC_000440	TCONS_00001129	transcript	1225	5.67437122	1.035753e-05	0.001025395
980		XLOC_000179	TCONS_00000452	transcript	980	6.25344921	2.514632e-05	0.001244743
469		XLOC_000101	TCONS_00000244	transcript	469	119.29999938	2.398681e-04	0.007915648
695		XLOC_000354	TCONS_00000883	transcript	695	0.21950959	3.302059e-04	0.008172596
1012		XLOC_000409	TCONS_00001041	transcript	1012	0.24664434	1.527175e-03	0.030238073
123		XLOC_000029	TCONS_00000059	transcript	123	0.01603345	2.097875e-03	0.034614939
961		XLOC_000176	TCONS_00000435	transcript	961	4.96074399	2.736075e-03	0.038695918
880		XLOC_000531	TCONS_00001277	transcript	880	29.40485236	3.272859e-03	0.040501628
1063		XLOC_000197	TCONS_00000487	transcript	1063	3.12988187	4.555313e-03	0.050108442
35		XLOC_000017	TCONS_00000020	transcript	35	4.38862590	1.085070e-02	0.089518247

results_transcripts[results_transcripts$geneIDs == "XLOC_000101", ]

A data.frame: 2 × 8
	geneNames	geneIDs	transcriptNames	feature	id	fc	pval	qval
	<chr>	<chr>	<chr>	<chr>	<chr>	<dbl>	<dbl>	<dbl>
469		XLOC_000101	TCONS_00000244	transcript	469	119.299999	0.0002398681	0.007915648
477		XLOC_000101	TCONS_00000252	transcript	477	3.128184	0.4511803632	0.902537469

plotMeans('XLOC_000101', bg, groupvar='group', meas='FPKM', colorby='transcript')

png

Reference

https://www.bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html