Analysis of RNA-seq data using R
Navigation in the file system and read the count files
getwd()
list.files()
# Define base URL and files to download
base_url <- "https://raw.githubusercontent.com/Bix4UMD/BIOI611_lab/refs/heads/main/data/bulkRNA_SE_tab/"
files <- c("N2_day1_rep1.ReadsPerGene.out.tab",
"N2_day1_rep2.ReadsPerGene.out.tab",
"N2_day1_rep3.ReadsPerGene.out.tab",
"N2_day7_rep1.ReadsPerGene.out.tab",
"N2_day7_rep2.ReadsPerGene.out.tab",
"N2_day7_rep3.ReadsPerGene.out.tab"
)
# Download each file
for (f in files) {
download.file(paste0(base_url, f), destfile = f, method = "libcurl")
}
list.files()
file_paths <- list.files(pattern = "*.ReadsPerGene.out.tab")
file_paths
# Function to read the STAR ReadsPerGene.out.tab file
read_star_file <- function(file_path) {
# Read the file
df <- read.table(file_path, header = FALSE, stringsAsFactors = FALSE)
# Keep only the first (gene) and second (unstranded counts) columns
library(dplyr)
df <- df %>% select(V1, V2)
# Rename the columns for clarity (GeneID and counts for this sample)
colnames(df) <- c("GeneID", gsub(".ReadsPerGene.out.tab", "", basename(file_path)))
return(df)
}
# Read all files into a list of data frames
list_of_dfs <- lapply(file_paths, read_star_file)
# Merge all data frames by the GeneID column
merged_df <- Reduce(function(x, y) merge(x, y, by = "GeneID"), list_of_dfs)
merged_df <- merged_df[-c(1:4), ]
# Check the first few rows of the combined data frame
head(merged_df)
# Optionally, write the combined data frame to a CSV file
write.csv(merged_df, "combined_gene_counts.csv", row.names = FALSE)
class(list_of_dfs)
head(list_of_dfs[[2]])
rownames(merged_df) = merged_df$GeneID
head(merged_df)
# NULL reserved word representing empty
merged_df$GeneID = NULL
head(merged_df)
subset_df4test <- merged_df[, c("N2_day1_rep1", "N2_day1_rep2", "N2_day1_rep3")]
head(subset_df4test)
Check count matrix
Different samples have different total number of counts
as.data.frame(colSums(merged_df))
barplot(colSums(merged_df),
las = 2,
cex.names= 0.6)
# inkscape: an open source software for editing the graph saved in pdf
pdf("total_count_barplot.pdf")
barplot(colSums(merged_df),
las = 2,
cex.names= 0.6)
dev.off()
coldata <- colnames(merged_df)
coldata_df <- cbind(group = gsub("_rep\\d", "", coldata))
coldata_df
rownames(coldata_df) = coldata
coldata_df
Instsall required R packages
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!require("DESeq2", quietly = TRUE))
BiocManager::install("DESeq2")
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
'getOption("repos")' replaces Bioconductor standard repositories, see
'help("repositories", package = "BiocManager")' for details.
Replacement repositories:
CRAN: https://cran.rstudio.com
Bioconductor version 3.22 (BiocManager 1.30.26), R 4.5.2 (2025-10-31)
Installing package(s) 'BiocVersion', 'DESeq2'
also installing the dependencies ‘XVector’, ‘formatR’, ‘abind’, ‘SparseArray’, ‘lambda.r’, ‘futile.options’, ‘Seqinfo’, ‘S4Arrays’, ‘DelayedArray’, ‘futile.logger’, ‘snow’, ‘BH’, ‘S4Vectors’, ‘IRanges’, ‘GenomicRanges’, ‘SummarizedExperiment’, ‘BiocGenerics’, ‘Biobase’, ‘BiocParallel’, ‘matrixStats’, ‘locfit’, ‘MatrixGenerics’, ‘RcppArmadillo’
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
if (!requireNamespace("EnhancedVolcano", quietly = TRUE))
remotes::install_github("kevinblighe/EnhancedVolcano")
Load R packages
library(DESeq2)
library(EnhancedVolcano)
Run DESeq2 to identify DEG
dds <- DESeqDataSetFromMatrix(countData = merged_df,
colData = coldata_df,
design =~ group)
class(dds)
The DESeq() function normalizes the read counts,estimates dispersions, and fits the linear model, all in one go.
dds <- DESeq(dds)
Dispersion is a measure of spread or variability in the data. Variance, standard deviation, IQR, among other measures, can all be used to measure dispersion.
DESeq2 uses a specific measure of dispersion (α) related to the mean (μ) and variance of the data: Var = μ + α*μ^2.
sizeFactors(dds)
head(counts(dds, normalized = TRUE))
The function plotDispEsts shows the dispersion by mean of normalized counts. We expect the dispersion to decrease as the mean of normalized counts increases.
The functions shows:
-
black per-gene dispersion estimates
-
a red trend line representing the global relationship between dispersion and normalized count
-
blue 'shrunken' values moderating individual dispersion estimates by the global relationship
-
blue-circled dispersion outliers with high gene-wise dispersion that were not adjusted.
plotDispEsts(dds)
Plot normalized genes
The function plotCounts is used to plot normalized counts plus a pseudocount of 0.5 by default.
vsd <- vst(dds, blind=FALSE)
class(vsd)
plotPCA(vsd, intgroup = c("group"))
Extract DEG results using results function
res <- results(dds)
res
class(res)
mcols(res)$description
baseMean: mean of normalized counts for all sampleslog2FoldChange: log2 fold changelfcSE: standard errorstat: Wald statisticpvalue: Wald test p-valuepadj: BH adjusted p-values
If we used the p-value directly from the Wald test with a significance cut-off of p < 0.05, that means there is a 5% chance it is a false positives. Each p-value is the result of a single test (single gene). The more genes we test, the more we inflate the false positive rate. This is the multiple testing problem. For example, if we test 20,000 genes for differential expression, at p < 0.05 we would expect to find 1,000 genes by chance. If we found 3000 genes to be differentially expressed total, roughly one third of our genes are false positives. We would not want to sift through our “significant” genes to identify which ones are true positives.
DESeq2 helps reduce the number of genes tested by removing those genes unlikely to be significantly DE prior to testing, such as those with low number of counts and outlier samples (gene-level QC). However, we still need to correct for multiple testing to reduce the number of false positives, and there are a few common approaches:
-
Bonferroni: The adjusted p-value is calculated by: p-value * m (m = total number of tests). This is a very conservative approach with a high probability of false negatives, so is generally not recommended.
-
FDR/Benjamini-Hochberg: Benjamini and Hochberg (1995) defined the concept of FDR and created an algorithm to control the expected FDR below a specified level given a list of independent p-values. An interpretation of the BH method for controlling the FDR is implemented in DESeq2 in which we rank the genes by p-value, then multiply each ranked p-value by m/rank.
-
Q-value / Storey method: The minimum FDR that can be attained when calling that feature significant. For example, if gene X has a q-value of 0.013 it means that 1.3% of genes that show p-values at least as small as gene X are false positives In DESeq2, the p-values attained by the Wald test are corrected for multiple testing using the Benjamini and Hochberg method by default. There are options to use other methods in the results() function. The p-adjusted values should be used to determine significant genes. The significant genes can be output for visualization and/or functional analysis.
So what does FDR < 0.05 mean? By setting the FDR cutoff to < 0.05, we’re saying that the proportion of false positives we expect amongst our differentially expressed genes is 5%. For example, if you call 500 genes as differentially expressed with an FDR cutoff of 0.05, you expect 25 of them to be false positives.
Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons:
-
If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p value and adjusted p value will all be set to NA.
-
If a row contains a sample with an extreme count outlier then the p value and adjusted p value will be set to NA. These outlier counts are detected by Cook’s distance. Customization of this outlier filtering and description of functionality for replacement of outlier counts and refitting is described below
-
If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below
plotMA(res, ylim=c(-2,2))
write.csv(res, file = "BIOI_bulkRNAseq_SE_DESeq2_res.csv")
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="group",
returnData=TRUE)
d
which.min(res$padj)
library("ggplot2")
ggplot(d, aes(x=group, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0)) +
scale_y_log10(breaks=c(25,100,400))
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'pvalue')
Understand normalized count in DESeq2 (Optional)
Create a pseudo-reference sample (row-wise geometric mean)
The code below creates a new column called pseudo_reference that contains the average log-transformed expression value for each gene across all samples. This pseudo-reference is similar to calculating a "reference sample" to compare other samples.
log_data = log(merged_df)
head(log_data)
library(dplyr)
library(tibble) # rownames_to_column
log_data = log_data %>%
rownames_to_column('gene') %>%
mutate (pseudo_reference = rowMeans(log_data))
head(log_data)
table(log_data$pseudo_reference == "-Inf")
filtered_log_data = log_data %>% filter(pseudo_reference != "-Inf")
head(filtered_log_data)
filtered_log_data$pseudo_reference = exp(filtered_log_data$pseudo_reference)
head(filtered_log_data)
dim(log_data)
dim(filtered_log_data)
Calculate ratio between each sample and the pseudo-reference for each gene
This step calculates the fold change between each sample and the pseudo-reference for each gene.
ratio_data = sweep(exp(filtered_log_data[,2:7]), 1, filtered_log_data[,8], "/")
head(ratio_data)
Calculate scaling factor
The code below computes the median fold change for each sample across all genes.
scaling_factors = apply(ratio_data, 2, median, na.rm = TRUE)
scaling_factors
The 2 indicates that the function is applied to columns, i.e., for each sample.
Normalize the counts
This step below normalizes each sample by its scaling factors, making the data comparable across samples. The result is a normalized gene expression matrix.
manually_normalized = sweep(merged_df, 2, scaling_factors, "/")
head(manually_normalized)
hist(manually_normalized$N2_day1_rep1)
The code below shows that the size factors and the normalized read counts calculated by ourselves are the same as what DESeq2 function returns.
head(counts(dds, normalized = TRUE))
sizeFactors(dds)
SessionInfo
sessionInfo()