The following note is designed to give you an overview of comparative analyses on complex cell types that are possible using the Seurat integration procedure.
Install required R packages
# # Install the remotes package
# if (!requireNamespace("remotes", quietly = TRUE)) {
# install.packages("remotes")
# }
# # Install Seurat
# if (!requireNamespace("Seurat", quietly = TRUE)) {
# remotes::install_github("satijalab/seurat", "seurat5", quiet = TRUE)
# }
# # Install BiocManager
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# # Install SingleR package
# if (!require("hdf5r", quietly = TRUE)){
# BiocManager::install("hdf5r")
# }
# # Install SingleR package
# if (!require("presto", quietly = TRUE)){
# remotes::install_github("immunogenomics/presto")
# }
# # Install SingleR package
# if (!require("SingleR", quietly = TRUE)){
# BiocManager::install("SingleR")
# }
# if (!require("celldex", quietly = TRUE)){
# BiocManager::install("celldex")
# }
# if (!require("SingleCellExperiment", quietly = TRUE)){
# BiocManager::install("SingleCellExperiment")
# }
# if (!require("scater", quietly = TRUE)){
# BiocManager::install("scater")
# }
## Installing the R packages could take around 51 minutes
## To speed up this process, you can download the R lib files
## saved from a working Google Colab session
## https://drive.google.com/file/d/1EQvZnsV6P0eNjbW0hwYhz0P5z0iH3bsL/view?usp=drive_link
system("gdown 1EQvZnsV6P0eNjbW0hwYhz0P5z0iH3bsL")
system("md5sum R_lib4scRNA.tar.gz", intern = TRUE)
'5898c04fca5e680710cd6728ef9b1422 R_lib4scRNA.tar.gz'
## required by scater package
system("apt-get install libx11-dev libcairo2-dev") #, intern = TRUE)
system("tar zxvf R_lib4scRNA.tar.gz")
.libPaths(c("/content/usr/local/lib/R/site-library", .libPaths()))
.libPaths()
- '/content/usr/local/lib/R/site-library'
- '/usr/local/lib/R/site-library'
- '/usr/lib/R/site-library'
- '/usr/lib/R/library'
Load required R packages
library(Seurat)
library(dplyr)
library(SingleR)
library(celldex)
library(scater)
library(SingleCellExperiment)
Loading required package: SeuratObject
Loading required package: sp
Attaching package: ‘SeuratObject’
The following objects are masked from ‘package:base’:
intersect, t
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Loading required package: SummarizedExperiment
Loading required package: MatrixGenerics
Loading required package: matrixStats
Attaching package: ‘matrixStats’
The following object is masked from ‘package:dplyr’:
count
Attaching package: ‘MatrixGenerics’
The following objects are masked from ‘package:matrixStats’:
colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
colWeightedMeans, colWeightedMedians, colWeightedSds,
colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
rowWeightedSds, rowWeightedVars
Loading required package: GenomicRanges
Loading required package: stats4
Loading required package: BiocGenerics
Attaching package: ‘BiocGenerics’
The following objects are masked from ‘package:dplyr’:
combine, intersect, setdiff, union
The following object is masked from ‘package:SeuratObject’:
intersect
The following objects are masked from ‘package:stats’:
IQR, mad, sd, var, xtabs
The following objects are masked from ‘package:base’:
anyDuplicated, aperm, append, as.data.frame, basename, cbind,
colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
table, tapply, union, unique, unsplit, which.max, which.min
Loading required package: S4Vectors
Attaching package: ‘S4Vectors’
The following objects are masked from ‘package:dplyr’:
first, rename
The following object is masked from ‘package:utils’:
findMatches
The following objects are masked from ‘package:base’:
expand.grid, I, unname
Loading required package: IRanges
Attaching package: ‘IRanges’
The following objects are masked from ‘package:dplyr’:
collapse, desc, slice
The following object is masked from ‘package:sp’:
%over%
Loading required package: GenomeInfoDb
Loading required package: Biobase
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Attaching package: ‘Biobase’
The following object is masked from ‘package:MatrixGenerics’:
rowMedians
The following objects are masked from ‘package:matrixStats’:
anyMissing, rowMedians
Attaching package: ‘SummarizedExperiment’
The following object is masked from ‘package:Seurat’:
Assays
The following object is masked from ‘package:SeuratObject’:
Assays
Attaching package: ‘celldex’
The following objects are masked from ‘package:SingleR’:
BlueprintEncodeData, DatabaseImmuneCellExpressionData,
HumanPrimaryCellAtlasData, ImmGenData, MonacoImmuneData,
MouseRNAseqData, NovershternHematopoieticData
Loading required package: SingleCellExperiment
Loading required package: scuttle
Loading required package: ggplot2
list.files()
- 'R_lib4scRNA.tar.gz'
- 'sample_data'
- 'usr'
# https://drive.google.com/drive/folders/1lp6kSGFyYYAswfAyG07DgELQ2G2Ja51Q?usp=sharing
# Download "filtered_feature_bc_matrix.h5"
# Output of cellranger
system("gdown --folder 1lp6kSGFyYYAswfAyG07DgELQ2G2Ja51Q", intern = TRUE)
- 'Processing file 1ezH8P0iRpV9fsOOqrqStPxI5-q54Ge9B filtered_feature_bc_matrix_300min.h5'
- 'Processing file 1hu9c2BhKb5bEYXrPlG_mkSQ219eG-E2K filtered_feature_bc_matrix_400min.h5'
- 'Processing file 16bSSWAn5Fg_sZgajA4JbuZPoVgulGt_5 filtered_feature_bc_matrix_500min.h5'
system("md5sum ./cele_cellranger_mtx/*.h5", intern = TRUE)
- 'e0fd344696c5188e55aeb359efd7a8c1 ./cele_cellranger_mtx/filtered_feature_bc_matrix_300min.h5'
- 'd087ff62ba449586858c058117aa0438 ./cele_cellranger_mtx/filtered_feature_bc_matrix_400min.h5'
- 'efb8a9ef4898918e53a53878531f64ce ./cele_cellranger_mtx/filtered_feature_bc_matrix_500min.h5'
# Specify the directory containing the .h5 files
mtx_directory <- "./cele_cellranger_mtx"
# List all .h5 files in the specified directory
mtx_file_paths <- list.files(path = mtx_directory, pattern = "\\.h5$", full.names = TRUE)
# Print the file paths
mtx_file_paths
- './cele_cellranger_mtx/filtered_feature_bc_matrix_300min.h5'
- './cele_cellranger_mtx/filtered_feature_bc_matrix_400min.h5'
- './cele_cellranger_mtx/filtered_feature_bc_matrix_500min.h5'
# Read the files into a list
count_mtx_list <- lapply(mtx_file_paths, Read10X_h5)
names(count_mtx_list) <- c("300min", "400min", "500min")
names(count_mtx_list)
- '300min'
- '400min'
- '500min'
lapply(count_mtx_list, class)
- $`300min`
- 'dgCMatrix'
- $`400min`
- 'dgCMatrix'
- $`500min`
- 'dgCMatrix'
print(format(object.size(count_mtx_list), units = "MB"))
[1] "829.8 Mb"
sample_names <- names(count_mtx_list)
seurat_obj_list <- list()
for (i in seq_along(count_mtx_list)) {
seurat_obj <- CreateSeuratObject(counts = count_mtx_list[[i]],
project = sample_names[i])
seurat_obj_list[[sample_names[i]]] <- seurat_obj
}
Warning message:
“Feature names cannot have underscores ('_'), replacing with dashes ('-')”
Warning message:
“Feature names cannot have underscores ('_'), replacing with dashes ('-')”
Warning message:
“Feature names cannot have underscores ('_'), replacing with dashes ('-')”
seurat_obj_list
$`300min`
An object of class Seurat
19985 features across 25996 samples within 1 assay
Active assay: RNA (19985 features, 0 variable features)
1 layer present: counts
$`400min`
An object of class Seurat
19985 features across 37944 samples within 1 assay
Active assay: RNA (19985 features, 0 variable features)
1 layer present: counts
$`500min`
An object of class Seurat
19985 features across 14378 samples within 1 assay
Active assay: RNA (19985 features, 0 variable features)
1 layer present: counts
rm(count_mtx_list); gc();
used | (Mb) | gc trigger | (Mb) | max used | (Mb) | |
---|---|---|---|---|---|---|
Ncells | 10810136 | 577.4 | 19318554 | 1031.8 | 14556029 | 777.4 |
Vcells | 127080778 | 969.6 | 301745553 | 2302.2 | 301081234 | 2297.1 |
Access Seurat object
seurat_obj_list$'300min'@active.assay
'RNA'
class(seurat_obj_list$'300min'@meta.data)
head(seurat_obj_list$'300min'@meta.data, 4)
'data.frame'
orig.ident | nCount_RNA | nFeature_RNA | |
---|---|---|---|
<fct> | <dbl> | <int> | |
AAACCTGAGACAATAC-1 | 300min | 1630 | 803 |
AAACCTGAGACACTAA-1 | 300min | 3147 | 1365 |
AAACCTGAGACGCTTT-1 | 300min | 892 | 586 |
AAACCTGAGAGGGCTT-1 | 300min | 1666 | 1033 |
seurat_obj_list$'300min'@meta.data$orig.ident = "300min"
seurat_obj_list$'400min'@meta.data$orig.ident = "400min"
seurat_obj_list$'500min'@meta.data$orig.ident = "500min"
head(seurat_obj_list$'300min'@meta.data, 4)
orig.ident | nCount_RNA | nFeature_RNA | |
---|---|---|---|
<chr> | <dbl> | <int> | |
AAACCTGAGACAATAC-1 | 300min | 1630 | 803 |
AAACCTGAGACACTAA-1 | 300min | 3147 | 1365 |
AAACCTGAGACGCTTT-1 | 300min | 892 | 586 |
AAACCTGAGAGGGCTT-1 | 300min | 1666 | 1033 |
Layers(seurat_obj_list$'300min')
'counts'
seurat_obj_list$'300min'@version
[1] ‘5.0.2’
Data preprocessing
Ensembl biomart can be used to extract the mitochodria genes:
https://useast.ensembl.org/info/website/archives/assembly.html
Gene stable ID | Gene name |
---|---|
WBGene00000829 | ctb-1 |
WBGene00010957 | nduo-6 |
WBGene00010958 | WBGene00010958 |
WBGene00010959 | WBGene00010959 |
WBGene00010960 | atp-6 |
WBGene00010961 | nduo-2 |
WBGene00010962 | ctc-3 |
WBGene00010963 | nduo-4 |
WBGene00010964 | ctc-1 |
WBGene00010965 | ctc-2 |
WBGene00010966 | nduo-3 |
WBGene00010967 | nduo-5 |
# Define the mitochondria gene names as an R vector
mt_gene_names <- c(
"ctb-1", "nduo-6", "WBGene00010958", "WBGene00010959",
"atp-6", "nduo-2", "ctc-3", "nduo-4",
"ctc-1", "ctc-2", "nduo-3", "nduo-5"
)
mt_genes <- mt_gene_names[mt_gene_names %in% rownames(seurat_obj_list$'300min')]
mt_genes
- 'ctb-1'
- 'nduo-6'
- 'WBGene00010958'
- 'WBGene00010959'
- 'atp-6'
- 'nduo-2'
- 'ctc-3'
- 'nduo-4'
- 'ctc-1'
- 'ctc-2'
- 'nduo-3'
- 'nduo-5'
# Function to calculate percentage of mitochondrial genes
add_mt_percentage <- function(seurat_obj, mt_genes) {
# Calculate percentage of mitochondrial genes
seurat_obj$percent.mt <- PercentageFeatureSet(seurat_obj, features = mt_genes)
return(seurat_obj)
}
seurat_obj_list <- lapply(seurat_obj_list, add_mt_percentage, mt_genes = mt_genes)
qc_features <- c("nFeature_RNA", "nCount_RNA", "percent.mt")
VlnPlot(seurat_obj_list$'300min', features = qc_features, ncol = 3, pt.size=0)
VlnPlot(seurat_obj_list$'400min', features = qc_features, ncol = 3, pt.size=0)
VlnPlot(seurat_obj_list$'500min', features = qc_features, ncol = 3, pt.size=0)
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
How to interpret QC plot
nFeature_RNA
: The number of unique features (genes) detected per cell.
Extremely high values could suggest potential doublets (two cells mistakenly captured as one), as two cells would have more unique genes combined.
Low number of detected genes - potential ambient mRNA (not real cells)
nCount_RNA
: The total number of RNA molecules (or unique molecular identifiers, UMIs) detected per cell.
Higher counts generally indicate higher RNA content, but they could also result from cell doublets. Cells with very low nCount_RNA might represent poor-quality cells with low RNA capture, while very high counts may also suggest doublets.
percent.mt
: The percentage of reads mapping to mitochondrial genes.
High mitochondrial content often indicates cell stress or apoptosis, as damaged cells tend to release mitochondrial RNA.
Filtering cells with high percent.mt
values is common to exclude potentially dying cells.
# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(seurat_obj_list$'300min', feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(seurat_obj_list$'300min', feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
Filter out potential doublets, empty droplets and dying cells
# Load necessary libraries
library(Seurat)
library(ggplot2)
# Define the function to calculate median and MAD values
calculate_thresholds <- function(seurat_obj) {
# Extract relevant columns
nFeature_values <- seurat_obj@meta.data$nFeature_RNA
nCount_values <- seurat_obj@meta.data$nCount_RNA
percent_mt_values <- seurat_obj@meta.data$percent.mt
# Calculate medians and MADs
nFeature_median <- median(nFeature_values, na.rm = TRUE)
nFeature_mad <- mad(nFeature_values, constant = 1, na.rm = TRUE)
nCount_median <- median(nCount_values, na.rm = TRUE)
nCount_mad <- mad(nCount_values, constant = 1, na.rm = TRUE)
percent_mt_median <- median(percent_mt_values, na.rm = TRUE)
percent_mt_mad <- mad(percent_mt_values, constant = 1, na.rm = TRUE)
# Calculate thresholds for horizontal lines
thresholds <- list(
nFeature_upper = nFeature_median + 4 * nFeature_mad,
nFeature_lower = nFeature_median - 4 * nFeature_mad,
nCount_upper = nCount_median + 4 * nCount_mad,
nCount_lower = nCount_median - 4 * nCount_mad,
percent_mt_upper = percent_mt_median + 4 * percent_mt_mad
)
return(thresholds)
}
# Define a function to filter Seurat objects
filter_seurat_obj <- function(seurat_obj) {
# Calculate thresholds
thresholds <- calculate_thresholds(seurat_obj)
# Apply filtering
seurat_obj <- subset(
seurat_obj,
subset = nFeature_RNA > thresholds$nFeature_lower &
nFeature_RNA < thresholds$nFeature_upper &
percent.mt < thresholds$percent_mt_upper
)
#
return(seurat_obj)
}
# Apply filtering to each Seurat object in the list
seurat_obj_list <- lapply(seurat_obj_list, filter_seurat_obj)
VlnPlot(seurat_obj_list$'300min', features = qc_features, ncol = 3, pt.size=0)
VlnPlot(seurat_obj_list$'400min', features = qc_features, ncol = 3, pt.size=0)
VlnPlot(seurat_obj_list$'500min', features = qc_features, ncol = 3, pt.size=0)
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
RandomSubsetSeurat <- function(seurat_obj, subset_size, seed = NULL) {
# Optionally set a random seed for reproducibility
if (!is.null(seed)) {
set.seed(seed)
}
# Get all cell names
total_cells <- Cells(seurat_obj)
# Ensure subset size is not larger than the total number of cells
if (subset_size > length(total_cells)) {
stop("Subset size exceeds the total number of cells in the Seurat object.")
}
# Randomly sample a subset of cell names
subset_cells <- sample(total_cells, size = subset_size)
# Create a new Seurat object with the subsetted cells
subset_seurat_obj <- subset(seurat_obj, cells = subset_cells)
return(subset_seurat_obj)
}
⚠️ Important
This is included only to reduce the memory used. In real project, you don't want to perform this step. Instead, you should request larger memory computing resources.
seurat_obj_list <- lapply(seurat_obj_list,
RandomSubsetSeurat,
subset_size = 2000,
seed = 123)
seurat_obj_list
$`300min`
An object of class Seurat
19985 features across 2000 samples within 1 assay
Active assay: RNA (19985 features, 0 variable features)
1 layer present: counts
$`400min`
An object of class Seurat
19985 features across 2000 samples within 1 assay
Active assay: RNA (19985 features, 0 variable features)
1 layer present: counts
$`500min`
An object of class Seurat
19985 features across 2000 samples within 1 assay
Active assay: RNA (19985 features, 0 variable features)
1 layer present: counts
so_merged <- merge(seurat_obj_list$'300min',
c(seurat_obj_list$'400min', seurat_obj_list$'500min'),
add.cell.ids = c("300min", "400min", "500min"),
project = "scRNA_cele")
so_merged
Layers(so_merged)
table(so_merged$orig.ident)
An object of class Seurat
19985 features across 6000 samples within 1 assay
Active assay: RNA (19985 features, 0 variable features)
3 layers present: counts.300min, counts.400min, counts.500min
- 'counts.300min'
- 'counts.400min'
- 'counts.500min'
300min 400min 500min
2000 2000 2000
#rm(seurat_obj_list); gc();
Instead of using an arbitrary number, you can also use statistical algorithm to predict doublets and empty droplets to filter the cells, such as DoubletFinder
and EmptyDrops
.
Normalization, ccaling of the data and linear dimensional reduction
Normalization
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. In Seurat v5, Normalized values are stored in pbmc[["RNA"]]$data
.
While this method of normalization is standard and widely used in scRNA-seq analysis, global-scaling relies on an assumption that each cell originally contains the same number of RNA molecules.
Next, we identify a subset of features that show high variation across cells in the dataset—meaning they are highly expressed in some cells and lowly expressed in others. Prior work, including our own, has shown that focusing on these variable genes in downstream analyses can enhance the detection of biological signals in single-cell datasets.
The approach used in Seurat improves upon previous versions by directly modeling the inherent mean-variance relationship in single-cell data. This method is implemented in the FindVariableFeatures() function, which, by default, selects 2,000 variable features per dataset. These features will then be used in downstream analyses, such as PCA.
Scaling the data
Next, we apply a linear transformation (scaling
) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData()
function:
Shifts the expression of each gene, so that the mean expression across cells is 0 Scales the expression of each gene, so that the variance across cells is 1
This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate
The results of this are stored in pbmc[["RNA"]]$scale.data
By default, only variable features are scaled. You can specify the features argument to scale additional features.
# run standard anlaysis workflow
so_merged <- NormalizeData(so_merged)
so_merged <- FindVariableFeatures(so_merged)
so_merged <- ScaleData(so_merged)
so_merged <- RunPCA(so_merged)
Normalizing layer: counts.300min
Normalizing layer: counts.400min
Normalizing layer: counts.500min
Finding variable features for layer counts.300min
Finding variable features for layer counts.400min
Finding variable features for layer counts.500min
Centering and scaling data matrix
PC_ 1
Positive: noah-1, noah-2, dpy-2, dpy-3, mlt-11, col-76, mlt-8, dpy-7, dpy-10, hch-1
col-121, dpy-17, dpy-14, txdc-12.2, sym-1, C01H6.8, Y41D4B.6, sqt-3, Y23H5B.8, K02E10.4
acn-1, C05C8.7, C26B9.3, R148.5, C48E7.1, dsl-6, inx-12, cpg-24, R05D3.9, F37C4.4
Negative: ost-1, pat-10, D2092.4, mlc-3, let-2, unc-15, lev-11, emb-9, tni-1, set-18
tnt-3, sgn-1, test-1, hsp-12.1, unc-98, cpn-3, sgcb-1, F53F10.1, spp-15, mup-2
Y71F9AR.2, mlc-1, C29F5.1, mlc-2, clik-1, stn-2, mig-18, F21H7.3, unc-60, Y73F8A.26
PC_ 2
Positive: asp-4, enpl-1, C50F4.6, T02E9.5, his-24, atz-1, R07E5.17, C03C10.5, nphp-1, hil-3
tmem-231, mksr-2, tctn-1, fmi-1, jbts-14, osm-5, bbs-9, ift-81, fbxb-66, K02B12.2
mks-2, ift-20, K07C11.10, che-13, R01H2.8, ifta-2, F48E3.9, arl-3, ccep-290, tmem-17
Negative: unc-15, sgn-1, D2092.4, C29F5.1, let-2, lev-11, tnt-3, mlc-2, mup-2, mlc-1
tni-1, test-1, mig-18, clik-1, hsp-12.1, mlc-3, ttn-1, F21H7.3, sgca-1, emb-9
set-18, icl-1, ost-1, myo-3, unc-54, sgcb-1, hsp-12.2, unc-98, stn-2, Y71F9AR.2
PC_ 3
Positive: mks-2, arl-3, mksr-2, ift-81, ifta-2, dylt-2, tmem-231, K07C11.10, che-13, ift-20
ccep-290, bbs-5, osm-5, C33A12.4, ift-74, R01H2.8, Y102A11A.9, bbs-9, tctn-1, lgc-20
dyf-1, mks-1, T02G5.3, dyf-3, arl-13, nphp-1, tmem-17, Y17D7B.10, bbs-2, F01E11.3
Negative: his-24, Y37E3.30, atz-1, lbp-1, C01G6.3, cht-1, ttr-50, clec-266, Y65A5A.1, enpl-1
hil-3, clec-196, idh-1, dsl-3, fbxb-66, cpg-20, F53B3.5, Y71F9AL.7, fbn-1, E01G4.5
pmt-2, ZK512.1, cutl-2, Y43F8B.2, cpg-24, F55C9.5, Y71F9AL.6, W04H10.6, fbxb-101, lam-3
PC_ 4
Positive: arl-3, ifta-2, mks-2, mksr-2, tmem-231, ift-81, che-13, K07C11.10, ift-20, dylt-2
Y55D5A.1, bbs-9, ift-74, ccep-290, tctn-1, dsl-3, cutl-2, T02G5.3, R01H2.8, bbs-5
osm-5, nphp-1, dyf-1, lgc-20, dyf-3, arl-13, mks-1, Y102A11A.9, tmem-17, bbs-2
Negative: mab-7, wrt-2, abu-13, mlt-9, sups-1, clec-180, Y73E7A.8, C01G6.9, wrt-1, R07E3.6
Y54G2A.76, K08B12.1, mam-3, T19A5.3, glf-1, ZC449.1, H03E18.1, M03B6.3, ZK154.1, Y11D7A.9
pqn-32, F01D5.6, C35A5.11, F33D4.6, C52G5.2, grl-15, T03D8.6, F23H12.5, cut-6, ZC123.1
PC_ 5
Positive: F33H2.8, nhr-127, F37A4.3, bus-17, nhr-270, sams-1, bus-12, F18A11.2, W04H10.6, oac-51
elt-1, F54B11.10, T13H10.2, nhr-218, rocf-1, Y6D1A.2, txdc-12.1, T04G9.4, F32H2.6, subs-4
fbn-1, F13G3.3, T03G6.1, nhr-94, C35A5.5, fbxa-52, fasn-1, nstp-3, W03B1.3, bus-8
Negative: Y41D4B.6, B0205.4, C06C3.4, T01D1.8, F43D9.1, F11E6.9, F31C3.6, dsl-6, ttr-2, T19C4.1
F46G11.6, K02E10.4, best-14, nhx-1, ifa-3, T25B9.1, F46F11.7, Y45F10B.59, F15B9.8, C24B5.4
T19B10.5, clec-78, cpt-4, C49F5.13, srm-1, H41C03.1, B0393.5, C53A5.2, ttr-3, far-5
DimPlot(so_merged,
reduction = "pca",
split.by = 'orig.ident',
label.color = "black") +
NoLegend()
so_merged <- FindNeighbors(so_merged,
dims = 1:30, reduction = "pca")
so_merged <- FindClusters(so_merged,
resolution = 2,
cluster.name = "unintegrated_clusters")
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 6000
Number of edges: 197290
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8357
Number of communities: 37
Elapsed time: 0 seconds
You have several useful ways to visualize both cells and features that define the PCA, including VizDimReduction()
, DimPlot()
, and DimHeatmap()
.
DimHeatmap()
draws a heatmap focusing on a principal component. Both cells and genes are sorted by their principal component scores
Perform analysis without integration
To visualize and explore these datasets, Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP.
The goal of tSNE/UMAP is to learn underlying structure in the dataset, in order to place similar cells together in low-dimensional space. Therefore, cells that are grouped together within graph-based clusters determined above should co-localize on these dimension reduction plots.
so_merged <- RunUMAP(so_merged,
dims = 1:30,
reduction = "pca",
reduction.name = "umap.unintegrated")
Warning message:
“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”
22:47:06 UMAP embedding parameters a = 0.9922 b = 1.112
22:47:06 Read 6000 rows and found 30 numeric columns
22:47:06 Using Annoy for neighbor search, n_neighbors = 30
22:47:06 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
|
22:47:07 Writing NN index file to temp file /tmp/Rtmp9vMJuf/file43d447d49a3
22:47:07 Searching Annoy index using 1 thread, search_k = 3000
22:47:10 Annoy recall = 100%
22:47:11 Commencing smooth kNN distance calibration using 1 thread
with target n_neighbors = 30
22:47:15 Initializing from normalized Laplacian + noise (using RSpectra)
22:47:15 Commencing optimization for 500 epochs, with 241186 positive edges
22:47:25 Optimization finished
DimPlot(so_merged,
reduction = "umap.unintegrated",
split.by = c("orig.ident"))
FeaturePlot(so_merged, feature="ham-1", pt.size = 0.1,
split.by = c("orig.ident"))
FeaturePlot(so_merged, feature="egl-21", pt.size = 0.1,
split.by = c("orig.ident"))
FeaturePlot(so_merged, feature="dsl-3", pt.size = 0.1,
split.by = c("orig.ident"))
Perform integration
Seurat v5 enables streamlined integrative analysis using the IntegrateLayers function. The method currently supports five integration methods. Each of these methods performs integration in low-dimensional space, and returns a dimensional reduction (i.e. integrated.rpca) that aims to co-embed shared cell types across batches:
- Anchor-based CCA integration (method=CCAIntegration)
- Harmony (method=HarmonyIntegration)
- Anchor-based RPCA integration (method=RPCAIntegration)
- FastMNN (method= FastMNNIntegration)
- scVI (method=scVIIntegration)
Canonical correlation analysis: CCA
Reciprocal PCA: RPCA
CCAIntegration
integration method that is available in the Seurat package utilizes the canonical correlation analysis (CCA). This method expects “correspondences” or shared biological states among at least a subset of single cells across the groups.
so_merged_integ <- IntegrateLayers(object = so_merged,
method = CCAIntegration, orig.reduction = "pca",
new.reduction = "integrated.cca",
verbose = FALSE)
Once integrative analysis is complete, you can rejoin the layers - which collapses the individual datasets together and recreates the original counts
and data
layers. You will need to do this before performing any differential expression analysis. However, you can always resplit the layers in case you would like to reperform integrative analysis.
# re-join layers after integration
so_merged_integ[["RNA"]] <- JoinLayers(so_merged_integ[["RNA"]])
so_merged_integ <- FindNeighbors(so_merged_integ,
reduction = "integrated.cca", dims = 1:30)
so_merged_integ <- FindClusters(so_merged_integ, resolution = 2)
Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
Number of nodes: 6000
Number of edges: 203509
Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8373
Number of communities: 37
Elapsed time: 1 seconds
so_merged_integ <- RunUMAP(so_merged_integ,
dims = 1:30,
reduction = "integrated.cca")
22:48:04 UMAP embedding parameters a = 0.9922 b = 1.112
22:48:04 Read 6000 rows and found 30 numeric columns
22:48:04 Using Annoy for neighbor search, n_neighbors = 30
22:48:04 Building Annoy index with metric = cosine, n_trees = 50
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
|
22:48:06 Writing NN index file to temp file /tmp/Rtmp9vMJuf/file43d7e00c0c4
22:48:06 Searching Annoy index using 1 thread, search_k = 3000
22:48:09 Annoy recall = 100%
22:48:10 Commencing smooth kNN distance calibration using 1 thread
with target n_neighbors = 30
22:48:13 Initializing from normalized Laplacian + noise (using RSpectra)
22:48:13 Commencing optimization for 500 epochs, with 243916 positive edges
22:48:24 Optimization finished
DimPlot(so_merged_integ,
reduction = "umap",
split.by = c("orig.ident"))
DefaultAssay(so_merged_integ)
'RNA'
markers <- FindAllMarkers(so_merged_integ)
Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
Calculating cluster 12
Calculating cluster 13
Calculating cluster 14
Calculating cluster 15
Calculating cluster 16
Calculating cluster 17
Calculating cluster 18
Calculating cluster 19
Calculating cluster 20
Calculating cluster 21
Calculating cluster 22
Calculating cluster 23
Calculating cluster 24
Calculating cluster 25
Calculating cluster 26
Calculating cluster 27
Calculating cluster 28
Calculating cluster 29
Calculating cluster 30
Calculating cluster 31
Calculating cluster 32
Calculating cluster 33
Calculating cluster 34
Calculating cluster 35
Calculating cluster 36
markers %>%
group_by(cluster) %>%
dplyr::filter(avg_log2FC > 1) %>%
slice_head(n = 2) %>%
ungroup() -> top2
DoHeatmap(so_merged_integ, features = top2$gene) + NoLegend()
Warning message in DoHeatmap(so_merged_integ, features = top2$gene):
“The following features were omitted as they were not found in the scale.data slot for the RNA assay: nmur-3, timp-1, acp-6, C14A4.6, skpo-2, mltn-13, srw-12, F59C6.8, T03G11.9, C02F5.2, C33D9.10, lev-9, C08F1.6, cank-26, egl-21”
head(markers)
p_val | avg_log2FC | pct.1 | pct.2 | p_val_adj | cluster | gene | |
---|---|---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <fct> | <chr> | |
F36H1.11 | 0.000000e+00 | 5.450659 | 0.570 | 0.048 | 0.000000e+00 | 0 | F36H1.11 |
egl-21 | 1.563079e-282 | 3.984563 | 0.577 | 0.058 | 3.123813e-278 | 0 | egl-21 |
T10B5.4 | 1.850416e-276 | 3.373848 | 0.792 | 0.138 | 3.698057e-272 | 0 | T10B5.4 |
madd-4 | 1.841569e-242 | 4.579571 | 0.420 | 0.030 | 3.680376e-238 | 0 | madd-4 |
F28E10.1 | 8.224195e-225 | 3.734330 | 0.746 | 0.167 | 1.643605e-220 | 0 | F28E10.1 |
C31H5.5 | 8.048185e-222 | 4.466041 | 0.411 | 0.034 | 1.608430e-217 | 0 | C31H5.5 |
FeaturePlot(so_merged_integ, reduction = "umap", feature="ham-1", pt.size = 0.1,
split.by = c("orig.ident"))
so_merged_integ
An object of class Seurat
19985 features across 6000 samples within 1 assay
Active assay: RNA (19985 features, 2000 variable features)
3 layers present: data, counts, scale.data
4 dimensional reductions calculated: pca, umap.unintegrated, integrated.cca, umap
FeaturePlot(so_merged_integ, reduction = "umap", feature="egl-21", pt.size = 0.1,
split.by = c("orig.ident"))
FeaturePlot(so_merged_integ, reduction = "umap", feature="dsl-3", pt.size = 0.1,
split.by = c("orig.ident"))
With the integrated Seurat object, you can perform cell type annotation using marker genes.
⚠️ Important
In this class, we will not perform cell type annotation for this dataset. With the notebook for analyzing the human PBMC data, you will be able to perform cell type annnotation if needed.
In the next section, you will learn how to perform trajectory and pseudotime analysis. You will directly utilize the annotation performed by the authors who generated this dataset in the Science paper.
The data can be obtained using the URLs below:
## count matrix
https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_expression.rds
## metadata
https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_colData.rds
## gene list
https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_rowData.rds
Save the Seurat object
saveRDS(so_merged_integ, file = "Seurat_object_10x_cele_final.rds")
remotes::install_github("10xGenomics/loupeR")
loupeR::setup()
Downloading GitHub repo 10xGenomics/loupeR@HEAD
promises (1.3.0 -> 1.3.2 ) [CRAN]
later (1.3.2 -> 1.4.1 ) [CRAN]
progressr (0.15.0 -> 0.15.1) [CRAN]
Installing 3 packages: promises, later, progressr
Installing packages into ‘/content/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
[36m──[39m [36mR CMD build[39m [36m─────────────────────────────────────────────────────────────────[39m
* checking for file ‘/tmp/Rtmp9vMJuf/remotes43d20adbcde/10XGenomics-loupeR-a169417/DESCRIPTION’ ... OK
* preparing ‘loupeR’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘loupeR_1.1.2.tar.gz’
Installing package into ‘/content/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning message in fun(libname, pkgname):
“Please call `loupeR::setup()` to install the Louper executable and to agree to the EULA before continuing”
Installing Executable
2024/12/02 22:51:21 Downloading executable
The LoupeR executable is subject to the 10x End User Software License, available at:
https://10xgen.com/EULA
Do you accept the End-User License Agreement
(y/yes or n/no): yes
EULA
library(loupeR)
create_loupe_from_seurat(so_merged_integ,
output_name = "Seurat_object_10x_cele_merged_integ",
force = TRUE)
2024/12/02 22:56:56 extracting matrix, clusters, and projections
2024/12/02 22:56:56 selected assay: RNA
2024/12/02 22:56:56 selected clusters: active_cluster orig.ident unintegrated_clusters seurat_clusters RNA_snn_res.2
2024/12/02 22:56:56 selected projections: umap.unintegrated umap
2024/12/02 22:56:56 validating count matrix
2024/12/02 22:56:56 validating clusters
2024/12/02 22:56:56 validating projections
2024/12/02 22:56:56 creating temporary hdf5 file: /tmp/Rtmp9vMJuf/file43d561ea60d.h5
2024/12/02 22:56:59 invoking louper executable
2024/12/02 22:56:59 running command: "/root/.local/share/R/loupeR/louper create --input='/tmp/Rtmp9vMJuf/file43d561ea60d.h5' --output='Seurat_object_10x_cele_merged_integ.cloupe' --force"
Reference
https://monashbioinformaticsplatform.github.io/Single-Cell-Workshop/pbmc3k_tutorial.html
https://bioinformatics.ccr.cancer.gov/docs/getting-started-with-scrna-seq/IntroToR_Seurat/
https://hbctraining.github.io/scRNA-seq/lessons/elbow_plot_metric.html
https://satijalab.org/seurat/articles/integration_introduction