Open In Colab

Isoform-level differential expression analysis with Ballgown.

This notebook is based on the reference: https://www.bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html

Install R package

## Took around 17 minutes
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ballgown")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.22 (BiocManager 1.30.26), R 4.5.2 (2025-10-31)

Installing package(s) 'BiocVersion', 'ballgown'

also installing the dependencies ‘plogr’, ‘png’, ‘formatR’, ‘abind’, ‘SparseArray’, ‘RSQLite’, ‘KEGGREST’, ‘lambda.r’, ‘futile.options’, ‘S4Arrays’, ‘DelayedArray’, ‘MatrixGenerics’, ‘AnnotationDbi’, ‘annotate’, ‘futile.logger’, ‘snow’, ‘BH’, ‘locfit’, ‘bitops’, ‘Rhtslib’, ‘SummarizedExperiment’, ‘cigarillo’, ‘RCurl’, ‘rjson’, ‘BiocGenerics’, ‘genefilter’, ‘BiocParallel’, ‘matrixStats’, ‘edgeR’, ‘statmod’, ‘XML’, ‘XVector’, ‘Biostrings’, ‘Rsamtools’, ‘GenomicAlignments’, ‘BiocIO’, ‘restfulr’, ‘GenomicRanges’, ‘IRanges’, ‘S4Vectors’, ‘sva’, ‘limma’, ‘rtracklayer’, ‘Biobase’, ‘Seqinfo’


Old packages: 'digest', 'testthat'

Understand the folder with example data

library(ballgown)
data_directory = system.file('extdata', package='ballgown') # automatically finds ballgown's installation directory
# examine data_directory:
data_directory
Attaching package: ‘ballgown’


The following object is masked from ‘package:base’:

    structure

'/usr/local/lib/R/site-library/ballgown/extdata'

list.files(data_directory)
  1. 'annot.gtf.gz'
  2. 'hg19_genes_small.gtf.gz'
  3. 'sample01'
  4. 'sample02'
  5. 'sample03'
  6. 'sample04'
  7. 'sample05'
  8. 'sample06'
  9. 'sample07'
  10. 'sample08'
  11. 'sample09'
  12. 'sample10'
  13. 'sample11'
  14. 'sample12'
  15. 'sample13'
  16. 'sample14'
  17. 'sample15'
  18. 'sample16'
  19. 'sample17'
  20. 'sample18'
  21. 'sample19'
  22. 'sample20'
  23. 'tiny.genes.results.gz'
  24. 'tiny.isoforms.results.gz'
  25. 'tiny2.genes.results.gz'
  26. 'tiny2.isoforms.results.gz'
list.files(paste0(data_directory, "/sample01"))
  1. 'e_data.ctab'
  2. 'e2t.ctab'
  3. 'i_data.ctab'
  4. 'i2t.ctab'
  5. 't_data.ctab'
File Level Purpose
e2t.ctab Exon → Transcript Maps each exon to transcript(s)
i2t.ctab Intron → Transcript Maps each intron to transcript(s)
t_data.ctab Transcript Expression + structural data
e_data.ctab Exon Exon expression levels
i_data.ctab Intron Intron expression levels
list.files(file.path())
# make the ballgown object:
bg = ballgown(dataDir=data_directory, samplePattern='sample', meas='all')
bg
Thu Nov 13 22:22:51 2025

Thu Nov 13 22:22:51 2025: Reading linking tables

Thu Nov 13 22:22:51 2025: Reading intron data files

Thu Nov 13 22:22:51 2025: Merging intron data

Thu Nov 13 22:22:51 2025: Reading exon data files

Thu Nov 13 22:22:51 2025: Merging exon data

Thu Nov 13 22:22:51 2025: Reading transcript data files

Thu Nov 13 22:22:51 2025: Merging transcript data

Wrapping up the results

Thu Nov 13 22:22:51 2025




ballgown instance with 100 transcripts and 20 samples

Accessing assembly data

A ballgown object has six slots: structure, expr, indexes, dirs, mergedDate, and meas.

Exon, intron, and transcript structures are easily extracted from the main ballgown object:

Structure

structure(bg)$exon

GRanges object with 633 ranges and 2 metadata columns:
        seqnames            ranges strand |        id transcripts
           <Rle>         <IRanges>  <Rle> | <integer> <character>
    [1]       18 24412069-24412331      * |        12          10
    [2]       22 17308271-17308950      + |        55          25
    [3]       22 17309432-17310226      + |        56          25
    [4]       22 18121428-18121652      + |        88          35
    [5]       22 18138428-18138598      + |        89          35
    ...      ...               ...    ... .       ...         ...
  [629]       22 51221929-51222113      - |      3777        1294
  [630]       22 51221319-51221473      - |      3782        1297
  [631]       22 51221929-51222162      - |      3783        1297
  [632]       22 51221929-51222168      - |      3784        1301
  [633]        6 31248149-31248334      * |      3794        1312
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
structure(bg)$intron

GRanges object with 536 ranges and 2 metadata columns:
        seqnames            ranges strand |        id         transcripts
           <Rle>         <IRanges>  <Rle> | <integer>         <character>
    [1]       22 17308951-17309431      + |        33                  25
    [2]       22 18121653-18138427      + |        57                  35
    [3]       22 18138599-18185008      + |        58                  35
    [4]       22 18185153-18209442      + |        59                  35
    [5]       22 18385514-18387397      - |        72                  41
    ...      ...               ...    ... .       ...                 ...
  [532]       22 51216410-51220615      - |      2750 c(1294, 1297, 1301)
  [533]       22 51220776-51221928      - |      2756                1294
  [534]       22 51220780-51221318      - |      2757                1297
  [535]       22 51221474-51221928      - |      2758                1297
  [536]       22 51220780-51221928      - |      2759                1301
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
structure(bg)$trans

GRangesList object of length 100:
$`10`
GRanges object with 1 range and 2 metadata columns:
      seqnames            ranges strand |        id transcripts
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]       18 24412069-24412331      * |        12          10
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

$`25`
GRanges object with 2 ranges and 2 metadata columns:
      seqnames            ranges strand |        id transcripts
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]       22 17308271-17308950      + |        55          25
  [2]       22 17309432-17310226      + |        56          25
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

$`35`
GRanges object with 4 ranges and 2 metadata columns:
      seqnames            ranges strand |        id transcripts
         <Rle>         <IRanges>  <Rle> | <integer> <character>
  [1]       22 18121428-18121652      + |        88          35
  [2]       22 18138428-18138598      + |        89          35
  [3]       22 18185009-18185152      + |        90          35
  [4]       22 18209443-18212080      + |        91          35
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

...
<97 more elements>

expr

The expr slot is a list that contains tables of expression data for the genomic features. These tables are very similar to the *_data.ctab Tablemaker output files. Ballgown implements the following syntax to access components of the expr slot:

*expr(ballgown_object_name, <EXPRESSION_MEASUREMENT>)

where * is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate .ctab file. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from the bg ballgown object:

transcript_fpkm = texpr(bg, 'FPKM')
transcript_cov = texpr(bg, 'cov')
whole_tx_table = texpr(bg, 'all')
exon_mcov = eexpr(bg, 'mcov')
junction_rcount = iexpr(bg)
whole_intron_table = iexpr(bg, 'all')
gene_expression = gexpr(bg)

Indexes

pData(bg) = data.frame(id=sampleNames(bg), group=rep(c(1,0), each=10))

pData(bg)
A data.frame: 20 × 2
idgroup
<chr><dbl>
sample011
sample021
sample031
sample041
sample051
sample061
sample071
sample081
sample091
sample101
sample110
sample120
sample130
sample140
sample150
sample160
sample170
sample180
sample190
sample200

Plotting transcript structures

plotTranscripts(gene='XLOC_000454', gown=bg, samples='sample12',
    meas='FPKM', colorby='transcript',
    main='transcripts from gene XLOC_000454: sample 12, FPKM')

png

It is also possible to plot several samples at once:

plotTranscripts('XLOC_000454', bg,
    samples=c('sample01', 'sample06', 'sample12', 'sample19'),
    meas='FPKM', colorby='transcript')

png


You can also make side-by-side plots comparing mean abundances between groups (here, 0 and 1):

plotMeans('XLOC_000454', bg, groupvar='group', meas='FPKM', colorby='transcript')

png

Differential expression analysis

Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time).

stat_results = stattest(bg, feature='transcript',
                   meas='FPKM', covariate='group',
                   getFC=TRUE)

results_transcripts <- data.frame(geneNames = geneNames(bg),
                                  geneIDs = geneIDs(bg),
                                  transcriptNames = transcriptNames(bg),
                                  stat_results)
head(results_transcripts)
A data.frame: 6 × 8
geneNamesgeneIDstranscriptNamesfeatureidfcpvalqval
<chr><chr><chr><chr><chr><dbl><dbl><dbl>
10XLOC_000010TCONS_00000010transcript103.1934990.013815760.10521233
25XLOC_000014TCONS_00000017transcript251.5490930.267736220.79114975
35XLOC_000017TCONS_00000020transcript354.3886260.010850700.08951825
41XLOC_000246TCONS_00000598transcript411.4405190.471080190.90253747
45XLOC_000019TCONS_00000024transcript451.7143400.084029480.48934813
67XLOC_000255TCONS_00000613transcript672.5185240.273173850.79114975
results_transcripts <- results_transcripts[order(results_transcripts$qval), ]
head(results_transcripts, 10)
A data.frame: 10 × 8
geneNamesgeneIDstranscriptNamesfeatureidfcpvalqval
<chr><chr><chr><chr><chr><dbl><dbl><dbl>
1225XLOC_000440TCONS_00001129transcript1225 5.674371221.035753e-050.001025395
980XLOC_000179TCONS_00000452transcript980 6.253449212.514632e-050.001244743
469XLOC_000101TCONS_00000244transcript469 119.299999382.398681e-040.007915648
695XLOC_000354TCONS_00000883transcript695 0.219509593.302059e-040.008172596
1012XLOC_000409TCONS_00001041transcript1012 0.246644341.527175e-030.030238073
123XLOC_000029TCONS_00000059transcript123 0.016033452.097875e-030.034614939
961XLOC_000176TCONS_00000435transcript961 4.960743992.736075e-030.038695918
880XLOC_000531TCONS_00001277transcript880 29.404852363.272859e-030.040501628
1063XLOC_000197TCONS_00000487transcript1063 3.129881874.555313e-030.050108442
35XLOC_000017TCONS_00000020transcript35 4.388625901.085070e-020.089518247
results_transcripts[results_transcripts$geneIDs == "XLOC_000101", ]
A data.frame: 2 × 8
geneNamesgeneIDstranscriptNamesfeatureidfcpvalqval
<chr><chr><chr><chr><chr><dbl><dbl><dbl>
469XLOC_000101TCONS_00000244transcript469119.2999990.00023986810.007915648
477XLOC_000101TCONS_00000252transcript477 3.1281840.45118036320.902537469
plotMeans('XLOC_000101', bg, groupvar='group', meas='FPKM', colorby='transcript')

png

Reference

https://www.bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html