Open In Colab

The following note is designed to give you an overview of comparative analyses on complex cell types that are possible using the Seurat integration procedure.

Install required R packages

# # Install the remotes package
# if (!requireNamespace("remotes", quietly = TRUE)) {
#   install.packages("remotes")
# }
# # Install Seurat
# if (!requireNamespace("Seurat", quietly = TRUE)) {
#     remotes::install_github("satijalab/seurat", "seurat5", quiet = TRUE)
# }
# # Install BiocManager
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")

# # Install SingleR package
# if (!require("hdf5r", quietly = TRUE)){
#     BiocManager::install("hdf5r")
# }
# # Install SingleR package
# if (!require("presto", quietly = TRUE)){
#     remotes::install_github("immunogenomics/presto")
# }
# # Install SingleR package
# if (!require("SingleR", quietly = TRUE)){
#     BiocManager::install("SingleR")
# }
# if (!require("celldex", quietly = TRUE)){
#     BiocManager::install("celldex")
# }
# if (!require("SingleCellExperiment", quietly = TRUE)){
#     BiocManager::install("SingleCellExperiment")
# }
# if (!require("scater", quietly = TRUE)){
#     BiocManager::install("scater")
# }
## Installing the R packages could take around 51 minutes
## To speed up this process, you can download the R lib files
## saved from a working Google Colab session
## https://drive.google.com/file/d/1EQvZnsV6P0eNjbW0hwYhz0P5z0iH3bsL/view?usp=drive_link
system("gdown 1EQvZnsV6P0eNjbW0hwYhz0P5z0iH3bsL")
system("md5sum R_lib4scRNA.tar.gz", intern = TRUE)

'5898c04fca5e680710cd6728ef9b1422 R_lib4scRNA.tar.gz'

## required by scater package
system("apt-get install libx11-dev libcairo2-dev") #, intern = TRUE)
system("tar zxvf R_lib4scRNA.tar.gz")
.libPaths(c("/content/usr/local/lib/R/site-library", .libPaths()))
.libPaths()
  1. '/content/usr/local/lib/R/site-library'
  2. '/usr/local/lib/R/site-library'
  3. '/usr/lib/R/site-library'
  4. '/usr/lib/R/library'

Load required R packages

library(Seurat)
library(dplyr)
library(SingleR)
library(celldex)
library(scater)
library(SingleCellExperiment)
Loading required package: SeuratObject

Loading required package: sp


Attaching package: ‘SeuratObject’


The following objects are masked from ‘package:base’:

    intersect, t



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: SummarizedExperiment

Loading required package: MatrixGenerics

Loading required package: matrixStats


Attaching package: ‘matrixStats’


The following object is masked from ‘package:dplyr’:

    count



Attaching package: ‘MatrixGenerics’


The following objects are masked from ‘package:matrixStats’:

    colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
    colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
    colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
    colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
    colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
    colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
    colWeightedMeans, colWeightedMedians, colWeightedSds,
    colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
    rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
    rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
    rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
    rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
    rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
    rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
    rowWeightedSds, rowWeightedVars


Loading required package: GenomicRanges

Loading required package: stats4

Loading required package: BiocGenerics


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:dplyr’:

    combine, intersect, setdiff, union


The following object is masked from ‘package:SeuratObject’:

    intersect


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
    match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
    Position, rank, rbind, Reduce, rownames, sapply, saveRDS, setdiff,
    table, tapply, union, unique, unsplit, which.max, which.min


Loading required package: S4Vectors


Attaching package: ‘S4Vectors’


The following objects are masked from ‘package:dplyr’:

    first, rename


The following object is masked from ‘package:utils’:

    findMatches


The following objects are masked from ‘package:base’:

    expand.grid, I, unname


Loading required package: IRanges


Attaching package: ‘IRanges’


The following objects are masked from ‘package:dplyr’:

    collapse, desc, slice


The following object is masked from ‘package:sp’:

    %over%


Loading required package: GenomeInfoDb

Loading required package: Biobase

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.



Attaching package: ‘Biobase’


The following object is masked from ‘package:MatrixGenerics’:

    rowMedians


The following objects are masked from ‘package:matrixStats’:

    anyMissing, rowMedians



Attaching package: ‘SummarizedExperiment’


The following object is masked from ‘package:Seurat’:

    Assays


The following object is masked from ‘package:SeuratObject’:

    Assays



Attaching package: ‘celldex’


The following objects are masked from ‘package:SingleR’:

    BlueprintEncodeData, DatabaseImmuneCellExpressionData,
    HumanPrimaryCellAtlasData, ImmGenData, MonacoImmuneData,
    MouseRNAseqData, NovershternHematopoieticData


Loading required package: SingleCellExperiment

Loading required package: scuttle

Loading required package: ggplot2
list.files()
  1. 'R_lib4scRNA.tar.gz'
  2. 'sample_data'
  3. 'usr'
# https://drive.google.com/drive/folders/1lp6kSGFyYYAswfAyG07DgELQ2G2Ja51Q?usp=sharing
# Download "filtered_feature_bc_matrix.h5"
# Output of cellranger
system("gdown --folder 1lp6kSGFyYYAswfAyG07DgELQ2G2Ja51Q", intern = TRUE)
  1. 'Processing file 1ezH8P0iRpV9fsOOqrqStPxI5-q54Ge9B filtered_feature_bc_matrix_300min.h5'
  2. 'Processing file 1hu9c2BhKb5bEYXrPlG_mkSQ219eG-E2K filtered_feature_bc_matrix_400min.h5'
  3. 'Processing file 16bSSWAn5Fg_sZgajA4JbuZPoVgulGt_5 filtered_feature_bc_matrix_500min.h5'
system("md5sum ./cele_cellranger_mtx/*.h5", intern = TRUE)
  1. 'e0fd344696c5188e55aeb359efd7a8c1 ./cele_cellranger_mtx/filtered_feature_bc_matrix_300min.h5'
  2. 'd087ff62ba449586858c058117aa0438 ./cele_cellranger_mtx/filtered_feature_bc_matrix_400min.h5'
  3. 'efb8a9ef4898918e53a53878531f64ce ./cele_cellranger_mtx/filtered_feature_bc_matrix_500min.h5'
# Specify the directory containing the .h5 files
mtx_directory <- "./cele_cellranger_mtx"

# List all .h5 files in the specified directory
mtx_file_paths <- list.files(path = mtx_directory, pattern = "\\.h5$", full.names = TRUE)

# Print the file paths
mtx_file_paths
  1. './cele_cellranger_mtx/filtered_feature_bc_matrix_300min.h5'
  2. './cele_cellranger_mtx/filtered_feature_bc_matrix_400min.h5'
  3. './cele_cellranger_mtx/filtered_feature_bc_matrix_500min.h5'
# Read the files into a list
count_mtx_list <- lapply(mtx_file_paths, Read10X_h5)
names(count_mtx_list) <- c("300min", "400min", "500min")
names(count_mtx_list)
  1. '300min'
  2. '400min'
  3. '500min'
lapply(count_mtx_list, class)
$`300min`
'dgCMatrix'
$`400min`
'dgCMatrix'
$`500min`
'dgCMatrix'
print(format(object.size(count_mtx_list), units = "MB"))
[1] "829.8 Mb"
sample_names <- names(count_mtx_list)

seurat_obj_list <- list()
for (i in seq_along(count_mtx_list)) {
  seurat_obj <- CreateSeuratObject(counts = count_mtx_list[[i]],
                                   project = sample_names[i])
  seurat_obj_list[[sample_names[i]]] <- seurat_obj
}
Warning message:
“Feature names cannot have underscores ('_'), replacing with dashes ('-')”
Warning message:
“Feature names cannot have underscores ('_'), replacing with dashes ('-')”
Warning message:
“Feature names cannot have underscores ('_'), replacing with dashes ('-')”
seurat_obj_list
$`300min`
An object of class Seurat 
19985 features across 25996 samples within 1 assay 
Active assay: RNA (19985 features, 0 variable features)
 1 layer present: counts

$`400min`
An object of class Seurat 
19985 features across 37944 samples within 1 assay 
Active assay: RNA (19985 features, 0 variable features)
 1 layer present: counts

$`500min`
An object of class Seurat 
19985 features across 14378 samples within 1 assay 
Active assay: RNA (19985 features, 0 variable features)
 1 layer present: counts
rm(count_mtx_list); gc();
A matrix: 2 × 6 of type dbl
used(Mb)gc trigger(Mb)max used(Mb)
Ncells 10810136577.4 193185541031.8 14556029 777.4
Vcells127080778969.63017455532302.23010812342297.1

Access Seurat object

seurat_obj_list$'300min'@active.assay

'RNA'

class(seurat_obj_list$'300min'@meta.data)
head(seurat_obj_list$'300min'@meta.data, 4)

'data.frame'

A data.frame: 4 × 3
orig.identnCount_RNAnFeature_RNA
<fct><dbl><int>
AAACCTGAGACAATAC-1300min1630 803
AAACCTGAGACACTAA-1300min31471365
AAACCTGAGACGCTTT-1300min 892 586
AAACCTGAGAGGGCTT-1300min16661033
seurat_obj_list$'300min'@meta.data$orig.ident = "300min"
seurat_obj_list$'400min'@meta.data$orig.ident = "400min"
seurat_obj_list$'500min'@meta.data$orig.ident = "500min"
head(seurat_obj_list$'300min'@meta.data, 4)
A data.frame: 4 × 3
orig.identnCount_RNAnFeature_RNA
<chr><dbl><int>
AAACCTGAGACAATAC-1300min1630 803
AAACCTGAGACACTAA-1300min31471365
AAACCTGAGACGCTTT-1300min 892 586
AAACCTGAGAGGGCTT-1300min16661033
Layers(seurat_obj_list$'300min')

'counts'

seurat_obj_list$'300min'@version
[1] ‘5.0.2’

Data preprocessing

Ensembl biomart can be used to extract the mitochodria genes:

https://useast.ensembl.org/info/website/archives/assembly.html

Gene stable ID Gene name
WBGene00000829 ctb-1
WBGene00010957 nduo-6
WBGene00010958 WBGene00010958
WBGene00010959 WBGene00010959
WBGene00010960 atp-6
WBGene00010961 nduo-2
WBGene00010962 ctc-3
WBGene00010963 nduo-4
WBGene00010964 ctc-1
WBGene00010965 ctc-2
WBGene00010966 nduo-3
WBGene00010967 nduo-5
# Define the mitochondria gene names as an R vector
mt_gene_names <- c(
  "ctb-1", "nduo-6", "WBGene00010958", "WBGene00010959",
  "atp-6", "nduo-2", "ctc-3", "nduo-4",
  "ctc-1", "ctc-2", "nduo-3", "nduo-5"
)
mt_genes <- mt_gene_names[mt_gene_names %in% rownames(seurat_obj_list$'300min')]
mt_genes
  1. 'ctb-1'
  2. 'nduo-6'
  3. 'WBGene00010958'
  4. 'WBGene00010959'
  5. 'atp-6'
  6. 'nduo-2'
  7. 'ctc-3'
  8. 'nduo-4'
  9. 'ctc-1'
  10. 'ctc-2'
  11. 'nduo-3'
  12. 'nduo-5'
# Function to calculate percentage of mitochondrial genes
add_mt_percentage <- function(seurat_obj, mt_genes) {
  # Calculate percentage of mitochondrial genes
  seurat_obj$percent.mt <- PercentageFeatureSet(seurat_obj, features = mt_genes)
  return(seurat_obj)
}
seurat_obj_list <- lapply(seurat_obj_list, add_mt_percentage, mt_genes = mt_genes)
qc_features <- c("nFeature_RNA", "nCount_RNA", "percent.mt")
VlnPlot(seurat_obj_list$'300min', features = qc_features, ncol = 3, pt.size=0)
VlnPlot(seurat_obj_list$'400min', features = qc_features, ncol = 3, pt.size=0)
VlnPlot(seurat_obj_list$'500min', features = qc_features, ncol = 3, pt.size=0)
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”

png

Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”

png

png

How to interpret QC plot

nFeature_RNA: The number of unique features (genes) detected per cell.

Extremely high values could suggest potential doublets (two cells mistakenly captured as one), as two cells would have more unique genes combined.

Low number of detected genes - potential ambient mRNA (not real cells)

nCount_RNA: The total number of RNA molecules (or unique molecular identifiers, UMIs) detected per cell.

Higher counts generally indicate higher RNA content, but they could also result from cell doublets. Cells with very low nCount_RNA might represent poor-quality cells with low RNA capture, while very high counts may also suggest doublets.

percent.mt: The percentage of reads mapping to mitochondrial genes.

High mitochondrial content often indicates cell stress or apoptosis, as damaged cells tend to release mitochondrial RNA.

Filtering cells with high percent.mt values is common to exclude potentially dying cells.

# FeatureScatter is typically used to visualize feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.

plot1 <- FeatureScatter(seurat_obj_list$'300min', feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(seurat_obj_list$'300min', feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2

png

Filter out potential doublets, empty droplets and dying cells

# Load necessary libraries
library(Seurat)
library(ggplot2)

# Define the function to calculate median and MAD values
calculate_thresholds <- function(seurat_obj) {
  # Extract relevant columns
  nFeature_values <- seurat_obj@meta.data$nFeature_RNA
  nCount_values <- seurat_obj@meta.data$nCount_RNA
  percent_mt_values <- seurat_obj@meta.data$percent.mt

  # Calculate medians and MADs
  nFeature_median <- median(nFeature_values, na.rm = TRUE)
  nFeature_mad <- mad(nFeature_values, constant = 1, na.rm = TRUE)

  nCount_median <- median(nCount_values, na.rm = TRUE)
  nCount_mad <- mad(nCount_values, constant = 1, na.rm = TRUE)

  percent_mt_median <- median(percent_mt_values, na.rm = TRUE)
  percent_mt_mad <- mad(percent_mt_values, constant = 1, na.rm = TRUE)

  # Calculate thresholds for horizontal lines
  thresholds <- list(
    nFeature_upper = nFeature_median + 4 * nFeature_mad,
    nFeature_lower = nFeature_median - 4 * nFeature_mad,
    nCount_upper = nCount_median + 4 * nCount_mad,
    nCount_lower = nCount_median - 4 * nCount_mad,
    percent_mt_upper = percent_mt_median + 4 * percent_mt_mad
  )

  return(thresholds)
}

# Define a function to filter Seurat objects
filter_seurat_obj <- function(seurat_obj) {
  # Calculate thresholds
  thresholds <- calculate_thresholds(seurat_obj)

  # Apply filtering
  seurat_obj <- subset(
    seurat_obj,
    subset = nFeature_RNA > thresholds$nFeature_lower &
             nFeature_RNA < thresholds$nFeature_upper &
             percent.mt < thresholds$percent_mt_upper
  )
  #
  return(seurat_obj)
}

# Apply filtering to each Seurat object in the list
seurat_obj_list <- lapply(seurat_obj_list, filter_seurat_obj)


VlnPlot(seurat_obj_list$'300min', features = qc_features, ncol = 3, pt.size=0)
VlnPlot(seurat_obj_list$'400min', features = qc_features, ncol = 3, pt.size=0)
VlnPlot(seurat_obj_list$'500min', features = qc_features, ncol = 3, pt.size=0)
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”
Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”

png

Warning message:
“Default search for "data" layer in "RNA" assay yielded no results; utilizing "counts" layer instead.”

png

png

RandomSubsetSeurat <- function(seurat_obj, subset_size, seed = NULL) {
  # Optionally set a random seed for reproducibility
  if (!is.null(seed)) {
    set.seed(seed)
  }

  # Get all cell names
  total_cells <- Cells(seurat_obj)

  # Ensure subset size is not larger than the total number of cells
  if (subset_size > length(total_cells)) {
    stop("Subset size exceeds the total number of cells in the Seurat object.")
  }

  # Randomly sample a subset of cell names
  subset_cells <- sample(total_cells, size = subset_size)

  # Create a new Seurat object with the subsetted cells
  subset_seurat_obj <- subset(seurat_obj, cells = subset_cells)

  return(subset_seurat_obj)
}

⚠️ Important

This is included only to reduce the memory used. In real project, you don't want to perform this step. Instead, you should request larger memory computing resources.

seurat_obj_list <- lapply(seurat_obj_list,
                      RandomSubsetSeurat,
                      subset_size = 2000,
                      seed = 123)
seurat_obj_list
$`300min`
An object of class Seurat 
19985 features across 2000 samples within 1 assay 
Active assay: RNA (19985 features, 0 variable features)
 1 layer present: counts

$`400min`
An object of class Seurat 
19985 features across 2000 samples within 1 assay 
Active assay: RNA (19985 features, 0 variable features)
 1 layer present: counts

$`500min`
An object of class Seurat 
19985 features across 2000 samples within 1 assay 
Active assay: RNA (19985 features, 0 variable features)
 1 layer present: counts
so_merged <- merge(seurat_obj_list$'300min',
         c(seurat_obj_list$'400min', seurat_obj_list$'500min'),
         add.cell.ids = c("300min", "400min", "500min"),
         project = "scRNA_cele")
so_merged
Layers(so_merged)
table(so_merged$orig.ident)
An object of class Seurat 
19985 features across 6000 samples within 1 assay 
Active assay: RNA (19985 features, 0 variable features)
 3 layers present: counts.300min, counts.400min, counts.500min
  1. 'counts.300min'
  2. 'counts.400min'
  3. 'counts.500min'
300min 400min 500min 
  2000   2000   2000
#rm(seurat_obj_list); gc();

Instead of using an arbitrary number, you can also use statistical algorithm to predict doublets and empty droplets to filter the cells, such as DoubletFinder and EmptyDrops.

Normalization, ccaling of the data and linear dimensional reduction

Normalization

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. In Seurat v5, Normalized values are stored in pbmc[["RNA"]]$data.

While this method of normalization is standard and widely used in scRNA-seq analysis, global-scaling relies on an assumption that each cell originally contains the same number of RNA molecules.

Next, we identify a subset of features that show high variation across cells in the dataset—meaning they are highly expressed in some cells and lowly expressed in others. Prior work, including our own, has shown that focusing on these variable genes in downstream analyses can enhance the detection of biological signals in single-cell datasets.

The approach used in Seurat improves upon previous versions by directly modeling the inherent mean-variance relationship in single-cell data. This method is implemented in the FindVariableFeatures() function, which, by default, selects 2,000 variable features per dataset. These features will then be used in downstream analyses, such as PCA.


Scaling the data

Next, we apply a linear transformation (scaling) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:

Shifts the expression of each gene, so that the mean expression across cells is 0 Scales the expression of each gene, so that the variance across cells is 1

This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate

The results of this are stored in pbmc[["RNA"]]$scale.data

By default, only variable features are scaled. You can specify the features argument to scale additional features.

# run standard anlaysis workflow
so_merged <- NormalizeData(so_merged)
so_merged <- FindVariableFeatures(so_merged)
so_merged <- ScaleData(so_merged)
so_merged <- RunPCA(so_merged)
Normalizing layer: counts.300min

Normalizing layer: counts.400min

Normalizing layer: counts.500min

Finding variable features for layer counts.300min

Finding variable features for layer counts.400min

Finding variable features for layer counts.500min

Centering and scaling data matrix

PC_ 1 
Positive:  noah-1, noah-2, dpy-2, dpy-3, mlt-11, col-76, mlt-8, dpy-7, dpy-10, hch-1 
       col-121, dpy-17, dpy-14, txdc-12.2, sym-1, C01H6.8, Y41D4B.6, sqt-3, Y23H5B.8, K02E10.4 
       acn-1, C05C8.7, C26B9.3, R148.5, C48E7.1, dsl-6, inx-12, cpg-24, R05D3.9, F37C4.4 
Negative:  ost-1, pat-10, D2092.4, mlc-3, let-2, unc-15, lev-11, emb-9, tni-1, set-18 
       tnt-3, sgn-1, test-1, hsp-12.1, unc-98, cpn-3, sgcb-1, F53F10.1, spp-15, mup-2 
       Y71F9AR.2, mlc-1, C29F5.1, mlc-2, clik-1, stn-2, mig-18, F21H7.3, unc-60, Y73F8A.26 
PC_ 2 
Positive:  asp-4, enpl-1, C50F4.6, T02E9.5, his-24, atz-1, R07E5.17, C03C10.5, nphp-1, hil-3 
       tmem-231, mksr-2, tctn-1, fmi-1, jbts-14, osm-5, bbs-9, ift-81, fbxb-66, K02B12.2 
       mks-2, ift-20, K07C11.10, che-13, R01H2.8, ifta-2, F48E3.9, arl-3, ccep-290, tmem-17 
Negative:  unc-15, sgn-1, D2092.4, C29F5.1, let-2, lev-11, tnt-3, mlc-2, mup-2, mlc-1 
       tni-1, test-1, mig-18, clik-1, hsp-12.1, mlc-3, ttn-1, F21H7.3, sgca-1, emb-9 
       set-18, icl-1, ost-1, myo-3, unc-54, sgcb-1, hsp-12.2, unc-98, stn-2, Y71F9AR.2 
PC_ 3 
Positive:  mks-2, arl-3, mksr-2, ift-81, ifta-2, dylt-2, tmem-231, K07C11.10, che-13, ift-20 
       ccep-290, bbs-5, osm-5, C33A12.4, ift-74, R01H2.8, Y102A11A.9, bbs-9, tctn-1, lgc-20 
       dyf-1, mks-1, T02G5.3, dyf-3, arl-13, nphp-1, tmem-17, Y17D7B.10, bbs-2, F01E11.3 
Negative:  his-24, Y37E3.30, atz-1, lbp-1, C01G6.3, cht-1, ttr-50, clec-266, Y65A5A.1, enpl-1 
       hil-3, clec-196, idh-1, dsl-3, fbxb-66, cpg-20, F53B3.5, Y71F9AL.7, fbn-1, E01G4.5 
       pmt-2, ZK512.1, cutl-2, Y43F8B.2, cpg-24, F55C9.5, Y71F9AL.6, W04H10.6, fbxb-101, lam-3 
PC_ 4 
Positive:  arl-3, ifta-2, mks-2, mksr-2, tmem-231, ift-81, che-13, K07C11.10, ift-20, dylt-2 
       Y55D5A.1, bbs-9, ift-74, ccep-290, tctn-1, dsl-3, cutl-2, T02G5.3, R01H2.8, bbs-5 
       osm-5, nphp-1, dyf-1, lgc-20, dyf-3, arl-13, mks-1, Y102A11A.9, tmem-17, bbs-2 
Negative:  mab-7, wrt-2, abu-13, mlt-9, sups-1, clec-180, Y73E7A.8, C01G6.9, wrt-1, R07E3.6 
       Y54G2A.76, K08B12.1, mam-3, T19A5.3, glf-1, ZC449.1, H03E18.1, M03B6.3, ZK154.1, Y11D7A.9 
       pqn-32, F01D5.6, C35A5.11, F33D4.6, C52G5.2, grl-15, T03D8.6, F23H12.5, cut-6, ZC123.1 
PC_ 5 
Positive:  F33H2.8, nhr-127, F37A4.3, bus-17, nhr-270, sams-1, bus-12, F18A11.2, W04H10.6, oac-51 
       elt-1, F54B11.10, T13H10.2, nhr-218, rocf-1, Y6D1A.2, txdc-12.1, T04G9.4, F32H2.6, subs-4 
       fbn-1, F13G3.3, T03G6.1, nhr-94, C35A5.5, fbxa-52, fasn-1, nstp-3, W03B1.3, bus-8 
Negative:  Y41D4B.6, B0205.4, C06C3.4, T01D1.8, F43D9.1, F11E6.9, F31C3.6, dsl-6, ttr-2, T19C4.1 
       F46G11.6, K02E10.4, best-14, nhx-1, ifa-3, T25B9.1, F46F11.7, Y45F10B.59, F15B9.8, C24B5.4 
       T19B10.5, clec-78, cpt-4, C49F5.13, srm-1, H41C03.1, B0393.5, C53A5.2, ttr-3, far-5
DimPlot(so_merged,
        reduction = "pca",
        split.by = 'orig.ident',
        label.color = "black") +
        NoLegend()

png

so_merged <- FindNeighbors(so_merged,
                           dims = 1:30, reduction = "pca")
so_merged <- FindClusters(so_merged,
                           resolution = 2,
                           cluster.name = "unintegrated_clusters")
Computing nearest neighbor graph

Computing SNN



Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 6000
Number of edges: 197290

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8357
Number of communities: 37
Elapsed time: 0 seconds

You have several useful ways to visualize both cells and features that define the PCA, including VizDimReduction(), DimPlot(), and DimHeatmap().

DimHeatmap() draws a heatmap focusing on a principal component. Both cells and genes are sorted by their principal component scores

Perform analysis without integration

To visualize and explore these datasets, Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP.

The goal of tSNE/UMAP is to learn underlying structure in the dataset, in order to place similar cells together in low-dimensional space. Therefore, cells that are grouped together within graph-based clusters determined above should co-localize on these dimension reduction plots.

so_merged <- RunUMAP(so_merged,
              dims = 1:30,
              reduction = "pca",
              reduction.name = "umap.unintegrated")
Warning message:
“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”
22:47:06 UMAP embedding parameters a = 0.9922 b = 1.112

22:47:06 Read 6000 rows and found 30 numeric columns

22:47:06 Using Annoy for neighbor search, n_neighbors = 30

22:47:06 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
|

22:47:07 Writing NN index file to temp file /tmp/Rtmp9vMJuf/file43d447d49a3

22:47:07 Searching Annoy index using 1 thread, search_k = 3000

22:47:10 Annoy recall = 100%

22:47:11 Commencing smooth kNN distance calibration using 1 thread
 with target n_neighbors = 30

22:47:15 Initializing from normalized Laplacian + noise (using RSpectra)

22:47:15 Commencing optimization for 500 epochs, with 241186 positive edges

22:47:25 Optimization finished
DimPlot(so_merged,
             reduction = "umap.unintegrated",
             split.by = c("orig.ident"))

png

FeaturePlot(so_merged, feature="ham-1", pt.size = 0.1,
             split.by = c("orig.ident"))

png

FeaturePlot(so_merged, feature="egl-21", pt.size = 0.1,
             split.by = c("orig.ident"))

png

FeaturePlot(so_merged, feature="dsl-3", pt.size = 0.1,
             split.by = c("orig.ident"))

png

Perform integration

Seurat v5 enables streamlined integrative analysis using the IntegrateLayers function. The method currently supports five integration methods. Each of these methods performs integration in low-dimensional space, and returns a dimensional reduction (i.e. integrated.rpca) that aims to co-embed shared cell types across batches:

  • Anchor-based CCA integration (method=CCAIntegration)
  • Harmony (method=HarmonyIntegration)
  • Anchor-based RPCA integration (method=RPCAIntegration)
  • FastMNN (method= FastMNNIntegration)
  • scVI (method=scVIIntegration)

Canonical correlation analysis: CCA

Reciprocal PCA: RPCA

CCAIntegration integration method that is available in the Seurat package utilizes the canonical correlation analysis (CCA). This method expects “correspondences” or shared biological states among at least a subset of single cells across the groups.

so_merged_integ <- IntegrateLayers(object = so_merged,
                method = CCAIntegration, orig.reduction = "pca",
                new.reduction = "integrated.cca",
                verbose = FALSE)

Once integrative analysis is complete, you can rejoin the layers - which collapses the individual datasets together and recreates the original counts and data layers. You will need to do this before performing any differential expression analysis. However, you can always resplit the layers in case you would like to reperform integrative analysis.

# re-join layers after integration
so_merged_integ[["RNA"]] <- JoinLayers(so_merged_integ[["RNA"]])

so_merged_integ <- FindNeighbors(so_merged_integ,
                         reduction = "integrated.cca", dims = 1:30)
so_merged_integ <- FindClusters(so_merged_integ, resolution = 2)
Computing nearest neighbor graph

Computing SNN



Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 6000
Number of edges: 203509

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.8373
Number of communities: 37
Elapsed time: 1 seconds
so_merged_integ <- RunUMAP(so_merged_integ,
                          dims = 1:30,
                          reduction = "integrated.cca")

22:48:04 UMAP embedding parameters a = 0.9922 b = 1.112

22:48:04 Read 6000 rows and found 30 numeric columns

22:48:04 Using Annoy for neighbor search, n_neighbors = 30

22:48:04 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
|

22:48:06 Writing NN index file to temp file /tmp/Rtmp9vMJuf/file43d7e00c0c4

22:48:06 Searching Annoy index using 1 thread, search_k = 3000

22:48:09 Annoy recall = 100%

22:48:10 Commencing smooth kNN distance calibration using 1 thread
 with target n_neighbors = 30

22:48:13 Initializing from normalized Laplacian + noise (using RSpectra)

22:48:13 Commencing optimization for 500 epochs, with 243916 positive edges

22:48:24 Optimization finished
DimPlot(so_merged_integ,
             reduction = "umap",
             split.by = c("orig.ident"))

png

DefaultAssay(so_merged_integ)

'RNA'

markers <- FindAllMarkers(so_merged_integ)
Calculating cluster 0

Calculating cluster 1

Calculating cluster 2

Calculating cluster 3

Calculating cluster 4

Calculating cluster 5

Calculating cluster 6

Calculating cluster 7

Calculating cluster 8

Calculating cluster 9

Calculating cluster 10

Calculating cluster 11

Calculating cluster 12

Calculating cluster 13

Calculating cluster 14

Calculating cluster 15

Calculating cluster 16

Calculating cluster 17

Calculating cluster 18

Calculating cluster 19

Calculating cluster 20

Calculating cluster 21

Calculating cluster 22

Calculating cluster 23

Calculating cluster 24

Calculating cluster 25

Calculating cluster 26

Calculating cluster 27

Calculating cluster 28

Calculating cluster 29

Calculating cluster 30

Calculating cluster 31

Calculating cluster 32

Calculating cluster 33

Calculating cluster 34

Calculating cluster 35

Calculating cluster 36
markers %>%
    group_by(cluster) %>%
    dplyr::filter(avg_log2FC > 1) %>%
    slice_head(n = 2) %>%
    ungroup() -> top2
DoHeatmap(so_merged_integ, features = top2$gene) + NoLegend()
Warning message in DoHeatmap(so_merged_integ, features = top2$gene):
“The following features were omitted as they were not found in the scale.data slot for the RNA assay: nmur-3, timp-1, acp-6, C14A4.6, skpo-2, mltn-13, srw-12, F59C6.8, T03G11.9, C02F5.2, C33D9.10, lev-9, C08F1.6, cank-26, egl-21”

png

head(markers)
A data.frame: 6 × 7
p_valavg_log2FCpct.1pct.2p_val_adjclustergene
<dbl><dbl><dbl><dbl><dbl><fct><chr>
F36H1.11 0.000000e+005.4506590.5700.048 0.000000e+000F36H1.11
egl-211.563079e-2823.9845630.5770.0583.123813e-2780egl-21
T10B5.41.850416e-2763.3738480.7920.1383.698057e-2720T10B5.4
madd-41.841569e-2424.5795710.4200.0303.680376e-2380madd-4
F28E10.18.224195e-2253.7343300.7460.1671.643605e-2200F28E10.1
C31H5.58.048185e-2224.4660410.4110.0341.608430e-2170C31H5.5
FeaturePlot(so_merged_integ, reduction = "umap", feature="ham-1", pt.size = 0.1,
             split.by = c("orig.ident"))

png

so_merged_integ
An object of class Seurat 
19985 features across 6000 samples within 1 assay 
Active assay: RNA (19985 features, 2000 variable features)
 3 layers present: data, counts, scale.data
 4 dimensional reductions calculated: pca, umap.unintegrated, integrated.cca, umap
FeaturePlot(so_merged_integ, reduction = "umap", feature="egl-21", pt.size = 0.1,
             split.by = c("orig.ident"))

png

FeaturePlot(so_merged_integ, reduction = "umap", feature="dsl-3", pt.size = 0.1,
             split.by = c("orig.ident"))

png

With the integrated Seurat object, you can perform cell type annotation using marker genes.

⚠️ Important

In this class, we will not perform cell type annotation for this dataset. With the notebook for analyzing the human PBMC data, you will be able to perform cell type annnotation if needed.

In the next section, you will learn how to perform trajectory and pseudotime analysis. You will directly utilize the annotation performed by the authors who generated this dataset in the Science paper.

The data can be obtained using the URLs below:

## count matrix
https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_expression.rds
## metadata
https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_colData.rds
## gene list
https://depts.washington.edu:/trapnell-lab/software/monocle3/celegans/data/packer_embryo_rowData.rds

Save the Seurat object

saveRDS(so_merged_integ, file = "Seurat_object_10x_cele_final.rds")
remotes::install_github("10xGenomics/loupeR")
loupeR::setup()
Downloading GitHub repo 10xGenomics/loupeR@HEAD



promises  (1.3.0  -> 1.3.2 ) [CRAN]
later     (1.3.2  -> 1.4.1 ) [CRAN]
progressr (0.15.0 -> 0.15.1) [CRAN]


Installing 3 packages: promises, later, progressr

Installing packages into ‘/content/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



── R CMD build ─────────────────────────────────────────────────────────────────
* checking for file ‘/tmp/Rtmp9vMJuf/remotes43d20adbcde/10XGenomics-loupeR-a169417/DESCRIPTION’ ... OK
* preparing ‘loupeR’:
* checking DESCRIPTION meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘loupeR_1.1.2.tar.gz’



Installing package into ‘/content/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Warning message in fun(libname, pkgname):
“Please call `loupeR::setup()` to install the Louper executable and to agree to the EULA before continuing”



Installing Executable



2024/12/02 22:51:21 Downloading executable



The LoupeR executable is subject to the 10x End User Software License, available at:
https://10xgen.com/EULA

Do you accept the End-User License Agreement
(y/yes or n/no): yes

EULA
library(loupeR)
create_loupe_from_seurat(so_merged_integ,
                output_name = "Seurat_object_10x_cele_merged_integ",
                force = TRUE)

2024/12/02 22:56:56 extracting matrix, clusters, and projections

2024/12/02 22:56:56 selected assay: RNA

2024/12/02 22:56:56 selected clusters: active_cluster orig.ident unintegrated_clusters seurat_clusters RNA_snn_res.2

2024/12/02 22:56:56 selected projections: umap.unintegrated umap

2024/12/02 22:56:56 validating count matrix

2024/12/02 22:56:56 validating clusters

2024/12/02 22:56:56 validating projections

2024/12/02 22:56:56 creating temporary hdf5 file: /tmp/Rtmp9vMJuf/file43d561ea60d.h5

2024/12/02 22:56:59 invoking louper executable

2024/12/02 22:56:59 running command: "/root/.local/share/R/loupeR/louper create --input='/tmp/Rtmp9vMJuf/file43d561ea60d.h5' --output='Seurat_object_10x_cele_merged_integ.cloupe' --force"

Reference

https://monashbioinformaticsplatform.github.io/Single-Cell-Workshop/pbmc3k_tutorial.html

https://bioinformatics.ccr.cancer.gov/docs/getting-started-with-scrna-seq/IntroToR_Seurat/

https://hbctraining.github.io/scRNA-seq/lessons/elbow_plot_metric.html

https://satijalab.org/seurat/articles/integration_introduction