Open In Colab

Isoform-level differential expression analysis with Ballgown.

Install R package

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ballgown")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
    CRAN: https://cran.rstudio.com

Bioconductor version 3.19 (BiocManager 1.30.25), R 4.4.1 (2024-06-14)

Installing package(s) 'BiocVersion', 'ballgown'

also installing the dependencies ‘plogr’, ‘png’, ‘formatR’, ‘abind’, ‘SparseArray’, ‘RSQLite’, ‘KEGGREST’, ‘lambda.r’, ‘futile.options’, ‘S4Arrays’, ‘DelayedArray’, ‘MatrixGenerics’, ‘AnnotationDbi’, ‘annotate’, ‘futile.logger’, ‘snow’, ‘BH’, ‘locfit’, ‘bitops’, ‘Rhtslib’, ‘SummarizedExperiment’, ‘RCurl’, ‘rjson’, ‘BiocGenerics’, ‘XVector’, ‘genefilter’, ‘BiocParallel’, ‘matrixStats’, ‘edgeR’, ‘statmod’, ‘XML’, ‘Biostrings’, ‘zlibbioc’, ‘Rsamtools’, ‘GenomicAlignments’, ‘BiocIO’, ‘restfulr’, ‘UCSC.utils’, ‘GenomeInfoDbData’, ‘GenomicRanges’, ‘IRanges’, ‘S4Vectors’, ‘sva’, ‘limma’, ‘rtracklayer’, ‘Biobase’, ‘GenomeInfoDb’


Old packages: 'gtable'
library(ballgown)

Attaching package: ‘ballgown’


The following object is masked from ‘package:base’:

    structure

Please upload your data generated by `tablemaker. Please refer to the section Running Tablemaker the link here for details.

You can also download a copy from the path below:

/scratch/zt1/project/bioi611/shared/output/bulkRNA_SE_tablemaker.tar.gz

Or you can download a copy via the link below:

https://umd0-my.sharepoint.com/:u:/g/personal/xie186_umd_edu/EYLz8khnMeRCmyK_YFDDXaQBP_4hzpAgs_nN-TNXghdQMQ?e=5By9ct

getwd()

'/content'


system("tar zxvf bulkRNA_SE_tablemaker.tar.gz")
data_directory = file.path(getwd(), "bulkRNA_SE_tablemaker")
data_directory

'/content/bulkRNA_SE_tablemaker'

# make the ballgown object:
bg = ballgown(dataDir = data_directory, samplePattern='N2_day', meas='all')
bg
Mon Oct 28 10:28:23 2024

Mon Oct 28 10:28:23 2024: Reading linking tables

Mon Oct 28 10:28:24 2024: Reading intron data files

Mon Oct 28 10:28:27 2024: Merging intron data

Mon Oct 28 10:28:27 2024: Reading exon data files

Mon Oct 28 10:28:33 2024: Merging exon data

Mon Oct 28 10:28:34 2024: Reading transcript data files

Mon Oct 28 10:28:38 2024: Merging transcript data

Wrapping up the results

Mon Oct 28 10:28:38 2024




ballgown instance with 60032 transcripts and 6 samples

Accessing assembly data

A ballgown object has six slots: structure, expr, indexes, dirs, mergedDate, and meas.

Exon, intron, and transcript structures are easily extracted from the main ballgown object:

structure(bg)$exon

GRanges object with 178766 ranges and 2 metadata columns:
           seqnames      ranges strand |        id transcripts
              <Rle>   <IRanges>  <Rle> | <integer> <character>
       [1]        I   3747-3909      - |         1           1
       [2]        I   4116-4358      - |         2           2
       [3]        I   5195-5296      - |         3           2
       [4]        I   6037-6327      - |         4           2
       [5]        I   9727-9846      - |         5           2
       ...      ...         ...    ... .       ...         ...
  [178762]    MtDNA 10348-10401      + |    178762       60028
  [178763]    MtDNA 10403-11354      + |    178763       60029
  [178764]    MtDNA 11356-11691      + |    178764       60030
  [178765]    MtDNA 11691-13272      + |    178765       60031
  [178766]    MtDNA 13275-13327      + |    178766       60032
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths
structure(bg)$intron

GRanges object with 116284 ranges and 2 metadata columns:
           seqnames            ranges strand |        id transcripts
              <Rle>         <IRanges>  <Rle> | <integer> <character>
       [1]        I         4359-5194      - |         1           2
       [2]        I         5297-6036      - |         2           2
       [3]        I         6328-9726      - |         3           2
       [4]        I        9847-10094      - |         4           2
       [5]        I       11562-11617      + |         5         3:4
       ...      ...               ...    ... .       ...         ...
  [116280]        X 17715112-17716973      + |    116280 59995:59996
  [116281]        X 17717088-17717170      + |    116281 59995:59996
  [116282]        X 17717279-17717327      + |    116282 59995:59996
  [116283]        X 17717444-17718427      + |    116283       59995
  [116284]        X 17717444-17718434      + |    116284       59996
  -------
  seqinfo: 6 sequences from an unspecified genome; no seqlengths
structure(bg)$trans

GRangesList object of length 60032:
$`1`
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        id transcripts
         <Rle> <IRanges>  <Rle> | <integer> <character>
  [1]        I 3747-3909      - |         1           1
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths

$`2`
GRanges object with 5 ranges and 2 metadata columns:
      seqnames      ranges strand |        id transcripts
         <Rle>   <IRanges>  <Rle> | <integer> <character>
  [1]        I   4116-4358      - |         2           2
  [2]        I   5195-5296      - |         3           2
  [3]        I   6037-6327      - |         4           2
  [4]        I   9727-9846      - |         5           2
  [5]        I 10095-10230      - |         6           2
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths

$`3`
GRanges object with 5 ranges and 2 metadata columns:
      seqnames      ranges strand |        id transcripts
         <Rle>   <IRanges>  <Rle> | <integer> <character>
  [1]        I 11495-11561      + |         7         3:4
  [2]        I 11618-11689      + |         8         3:5
  [3]        I 14951-15160      + |         9         3:5
  [4]        I 16473-16585      + |        10     c(3, 6)
  [5]        I 16702-16793      + |        11           3
  -------
  seqinfo: 7 sequences from an unspecified genome; no seqlengths

...
<60029 more elements>

expr

The expr slot is a list that contains tables of expression data for the genomic features. These tables are very similar to the *_data.ctab Tablemaker output files. Ballgown implements the following syntax to access components of the expr slot:

*expr(ballgown_object_name, <EXPRESSION_MEASUREMENT>)

where * is either e for exon, i for intron, t for transcript, or g for gene, and is an expression-measurement column name from the appropriate .ctab file. Gene-level measurements are calculated by aggregating the transcript-level measurements for that gene. All of the following are valid ways to extract expression data from the bg ballgown object:

transcript_fpkm = texpr(bg, 'FPKM')
transcript_cov = texpr(bg, 'cov')
whole_tx_table = texpr(bg, 'all')
exon_mcov = eexpr(bg, 'mcov')
junction_rcount = iexpr(bg)
whole_intron_table = iexpr(bg, 'all')
gene_expression = gexpr(bg)
Warning message in .normarg_f(f, x):
“'NROW(x)' is not a multiple of split factor length”
Warning message in tlengths * tmeas:
“longer object length is not a multiple of shorter object length”

Indexes

sampleNames(bg)
  1. 'N2_day1_rep1'
  2. 'N2_day1_rep2'
  3. 'N2_day1_rep3'
  4. 'N2_day7_rep1'
  5. 'N2_day7_rep2'
  6. 'N2_day7_rep3'
pData(bg) = data.frame(id=sampleNames(bg), group=rep(c("young","old"), each=3))
pData(bg)
A data.frame: 6 × 2
idgroup
<chr><chr>
N2_day1_rep1young
N2_day1_rep2young
N2_day1_rep3young
N2_day7_rep1old
N2_day7_rep2old
N2_day7_rep3old

Plotting transcript structures

plotTranscripts(gene='WBGene00002054', gown=bg, samples='N2_day1_rep1',
    meas='FPKM', colorby='transcript',
    main='transcripts from gene XLOC_000454: sample 12, FPKM')

png

It is also possible to plot several samples at once:

plotTranscripts('WBGene00002054', bg,
    samples=c('N2_day1_rep1', 'N2_day7_rep1'),
    meas='FPKM', colorby='transcript')

png


You can also make side-by-side plots comparing mean abundances between groups (here, 0 and 1):

plotMeans('WBGene00002054', bg, groupvar='group', meas='FPKM', colorby='transcript')

png

Differential expression analysis

Ballgown provides a wide selection of simple, fast statistical methods for testing whether transcripts are differentially expressed between experimental conditions or across a continuous covariate (such as time).

stat_results = stattest(bg, feature='transcript',
                   meas='FPKM', covariate='group',
                   getFC=TRUE)

results_transcripts <- data.frame(geneNames = geneNames(bg),
                                  geneIDs = geneIDs(bg),
                                  transcriptNames = transcriptNames(bg),
                                  stat_results)
head(results_transcripts)
A data.frame: 6 × 8
geneNamesgeneIDstranscriptNamesfeatureidfcpvalqval
<chr><chr><chr><chr><chr><dbl><dbl><dbl>
1Y74C9A.6WBGene00023193Y74C9A.6 transcript10.30879350.3137594480.52196762
2homt-1 WBGene00022277Y74C9A.3.1 transcript20.69606740.0068861130.08724496
3nlp-40 WBGene00022276Y74C9A.2a.3transcript30.99999610.9485353190.96924051
4nlp-40 WBGene00022276Y74C9A.2a.1transcript40.42483820.0408627730.16836530
5nlp-40 WBGene00022276Y74C9A.2a.2transcript54.80070110.0503369420.18664213
6nlp-40 WBGene00022276Y74C9A.2b.1transcript60.62993000.1124777850.29368625
results_transcripts <- results_transcripts[order(results_transcripts$qval), ]
head(results_transcripts, 10)
A data.frame: 10 × 8
geneNamesgeneIDstranscriptNamesfeatureidfcpvalqval
<chr><chr><chr><chr><chr><dbl><dbl><dbl>
1940F48C1.8 WBGene00018600F48C1.8.1 transcript1940 2.10126065.213184e-060.01647351
3643F32H2.15 WBGene00284858F32H2.15 transcript3643 0.19120401.087705e-060.01647351
3811lin-41 WBGene00003026C12C8.3a.1 transcript3811 12.14724532.940965e-060.01647351
6887 WBGene00044425F54D12.11 transcript6887 0.87750277.411872e-060.01647351
8957ifb-2 WBGene00002054F10C1.7a.1 transcript8957 4.56250271.180041e-060.01647351
20015Y69A2AR.28WBGene00022099Y69A2AR.28.1transcript20015 0.60764037.905547e-060.01647351
20798str-185 WBGene00006228R08C7.7.1 transcript20798 0.96686318.516315e-060.01647351
23441unc-44 WBGene00006780B0350.2a.1 transcript23441 3.33190265.919439e-060.01647351
25280plp-1 WBGene00004046F45E4.2.1 transcript25280 0.70532677.087841e-060.01647351
25710 WBGene00023488T26A8.5 transcript25710 0.96328628.470675e-060.01647351
results_transcripts[results_transcripts$geneIDs == "WBGene00002054", ]
A data.frame: 3 × 8
geneNamesgeneIDstranscriptNamesfeatureidfcpvalqval
<chr><chr><chr><chr><chr><dbl><dbl><dbl>
8957ifb-2WBGene00002054F10C1.7a.1transcript89574.5625031.180041e-060.01647351
8956ifb-2WBGene00002054F10C1.7c.1transcript89563.5545433.838807e-040.04612121
8958ifb-2WBGene00002054F10C1.7e.1transcript89581.0096728.838138e-010.93058010
plotMeans('WBGene00002054', bg, groupvar='group', meas='FPKM', colorby='transcript')

png

write.csv(results_transcripts,
   file = "BIOI611_bulkRNA_SE_ballgown.csv",
   row.names = FALSE)
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] ballgown_2.36.0

loaded via a namespace (and not attached):
 [1] IRdisplay_1.1               blob_1.2.4                 
 [3] Biostrings_2.72.1           bitops_1.0-9               
 [5] fastmap_1.2.0               RCurl_1.98-1.16            
 [7] GenomicAlignments_1.40.0    XML_3.99-0.17              
 [9] digest_0.6.37               lifecycle_1.0.4            
[11] survival_3.7-0              statmod_1.5.0              
[13] KEGGREST_1.44.1             RSQLite_2.3.7              
[15] genefilter_1.86.0           compiler_4.4.1             
[17] rlang_1.1.4                 tools_4.4.1                
[19] utf8_1.2.4                  yaml_2.3.10                
[21] rtracklayer_1.64.0          S4Arrays_1.4.1             
[23] bit_4.5.0                   curl_5.2.3                 
[25] DelayedArray_0.30.1         repr_1.1.7                 
[27] RColorBrewer_1.1-3          abind_1.4-8                
[29] BiocParallel_1.38.0         pbdZMQ_0.3-13              
[31] BiocGenerics_0.50.0         grid_4.4.1                 
[33] stats4_4.4.1                fansi_1.0.6                
[35] xtable_1.8-4                edgeR_4.2.2                
[37] SummarizedExperiment_1.34.0 cli_3.6.3                  
[39] crayon_1.5.3                httr_1.4.7                 
[41] rjson_0.2.23                DBI_1.2.3                  
[43] cachem_1.1.0                zlibbioc_1.50.0            
[45] splines_4.4.1               parallel_4.4.1             
[47] AnnotationDbi_1.66.0        BiocManager_1.30.25        
[49] XVector_0.44.0              restfulr_0.0.15            
[51] matrixStats_1.4.1           base64enc_0.1-3            
[53] vctrs_0.6.5                 Matrix_1.7-1               
[55] jsonlite_1.8.9              sva_3.52.0                 
[57] IRanges_2.38.1              S4Vectors_0.42.1           
[59] bit64_4.5.2                 locfit_1.5-9.10            
[61] limma_3.60.6                annotate_1.82.0            
[63] glue_1.8.0                  codetools_0.2-20           
[65] GenomeInfoDb_1.40.1         BiocIO_1.14.0              
[67] GenomicRanges_1.56.2        UCSC.utils_1.0.0           
[69] pillar_1.9.0                htmltools_0.5.8.1          
[71] IRkernel_1.3.2              GenomeInfoDbData_1.2.12    
[73] R6_2.5.1                    evaluate_1.0.1             
[75] lattice_0.22-6              Biobase_2.64.0             
[77] png_0.1-8                   Rsamtools_2.20.0           
[79] memoise_2.0.1               Rcpp_1.0.13                
[81] uuid_1.2-1                  SparseArray_1.4.8          
[83] nlme_3.1-166                mgcv_1.9-1                 
[85] MatrixGenerics_1.16.0

Reference

https://www.bioconductor.org/packages/release/bioc/vignettes/ballgown/inst/doc/ballgown.html