Analysis of long read transcriptome data

In this lab, you are going to analyze the direct RNA data published on Genome Research in 2020. The title of the paper is: The full-length transcriptome of C. elegans using direct RNA sequencing.

Download the data

You can go the Data access section of the paper here: https://genome.cshlp.org/content/30/2/299.full

Go to: https://www.ncbi.nlm.nih.gov/
Search PRJEB31791
Click SRA link
Click the first item in the search results
Click the link: ERP114391
Right click on the fastq files to obtain the FTP URL

Then you can use wget to download the fastq files.

The data for L1 and adult male samples has been downloaded and saved on the HPC cluster:

/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/L1_rep1.fastq.gz
/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/L1_rep2.fastq.gz
/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/male_rep1.fastq.gz
/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/male_rep2.fastq.gz

Analyze the data using `wf-transcriptomes`

wf-transcriptomes is a cDNA and RNA sequencing data analysis workflow that leverages long nanopore reads, providing a detailed view of the transcriptome.

Download the tool

https://github.com/epi2me-labs/wf-transcriptomes

Path: /scratch/zt1/project/bioi611/shared/software/wf-transcriptomes-1.4.0/main.nf

Install conda

Miniforge is a minimal installer for Conda specific to conda-forge. Miniforge allows users to install the conda package manager with the following features pre-configured: conda-forge set as the default (and only) channel.

rm -rf miniforge3
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

Run `wf-transcriptome` on the demo datasets

The workflow wf-transcriptome has a demo datasets. This demo datasets can be used to test the workflow and help you undestand the input and output.

The demo data can be found here:

ls /scratch/zt1/project/bioi611/shared/raw_data/wf-transcriptomes-demo/ |cat
chr20
differential_expression_fastq
gencode.v22.annotation.chr20.gff
gencode.v22.annotation.chr20.gff3
gencode.v22.annotation.chr20.gtf
hg38_chr20.fa
Homo_sapiens.GRCh38.109.gtf.gz
Homo_sapiens.GRCh38.cdna.all.fa.gz
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
md5sums.txt
nextflow.config
ref_transcriptome.fasta
sample_sheet.csv

You can analyze the demo data by submitting the job file below:

# Takes around 8 minutes to finish
sbatch /scratch/zt1/project/bioi611/shared/scripts/ONT_directRNA_wf_transcriptome_demo.sub

The output folder is:

/scratch/zt1/project/bioi611/user/$USER/ONT_directRNA_demo

The documentation for the output files can be found here:

https://github.com/epi2me-labs/wf-transcriptomes?tab=readme-ov-file#outputs

Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.

Title	File path	Description	Per sample or aggregated
workflow report	wf-transcriptomes-report.html	a HTML report document detailing the primary findings of the workflow	aggregated
Per file read stats	fastq_ingress_results/reads/fastcat_stats/per-file-stats.tsv	A TSV with per file read stats, including all samples.	aggregated
Read stats	fastq_ingress_results/reads/fastcat_stats/per-read-stats.tsv	A TSV with per read stats, including all samples.	aggregated
Run ID's	fastq_ingress_results/reads/fastcat_stats/run_ids	List of run IDs present in reads.	aggregated
Meta map json	fastq_ingress_results/reads/metamap.json	Metadata used in workflow presented in a JSON.	aggregated
Concatenated sequence data	fastq_ingress_results/reads/{{ alias }}.fastq.gz	Per sample reads concatenated in to one FASTQ file.	per-sample
Assembled transcriptome	{{ alias }}_transcriptome.fas	Per sample assembled transcriptome.	per-sample
Annotated assembled transcriptome	{{ alias }}_merged_transcriptome.fas	Per sample annotated assembled transcriptome.	per-sample
Alignment summary statistics	{{ alias }}_read_aln_stats.tsv	Per sample alignment summary statistics.	per-sample
GFF compare results.	{{ alias }}_gffcompare	All GFF compare output files.	per-sample
Differential gene expression results	de_analysis/results_dge.tsv	This is a gene-level result file that describes genes and their probability of showing differential expression between experimental conditions.	aggregated
Differential gene expression report	de_analysis/results_dge.pdf	Summary report of differential gene expression analysis as a PDF.	aggregated
Differential transcript usage gene TSV	de_analysis/results_dtu_gene.tsv	This is a gene-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression.	aggregated
Differential transcript usage report	de_analysis/results_dtu.pdf	Summary report of differential transcript usage results as a PDF.	aggregated
Differential transcript usage TSV	de_analysis/results_dtu_transcript.tsv	This is a transcript-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression.	aggregated
Differential transcript usage stageR TSV	de_analysis/results_dtu_stageR.tsv	This is the output from StageR and it shows both gene and transcript probabilities of differential expression	aggregated
Differential transcript usage DEXSeq TSV	de_analysis/results_dexseq.tsv	The complete output from the DEXSeq-analysis, shows both gene and transcript probabilities of differential expression.	aggregated
Gene counts	de_analysis/all_gene_counts.tsv	Raw gene counts created by the Salmon tool, before filtering.	aggregated
Gene counts per million	de_analysis/cpm_gene_counts.tsv	This file shows counts per million (CPM) of the raw gene counts to facilitate comparisons across samples.	aggregated
Transcript counts	de_analysis/unfiltered_transcript_counts_with_genes.tsv	Raw transcript counts created by the Salmon tool, before filtering. Includes reference to the associated gene ID.	aggregated
Transcript per million counts	de_analysis/unfiltered_tpm_transcript_counts.tsv	This file shows transcripts per million (TPM) of the raw counts to facilitate comparisons across samples.	aggregated
Transcript counts filtered	de_analysis/filtered_transcript_counts_with_genes.tsv	Filtered transcript counts, used for differential transcript usage analysis. Includes a reference to the associated gene ID.	aggregated
Transcript info table	{{ alias }}_transcripts_table.tsv	This file details each isoform that was reconstructed from the input reads. It contains a subset of columns from the .tmap output from gffcompare	per-sample
Final non redundant transcriptome	de_analysis/final_non_redundant_transcriptome.fasta	Transcripts that were used for differential expression analysis including novel transcripts with the identifiers used for DE analysis.	aggregated
Index of reference FASTA file	igv_reference/{{ ref_genome file }}.fai	Reference genome index of the FASTA file required for IGV config.	aggregated
GZI index of the reference FASTA file	igv_reference/{{ ref_genome file }}.gzi	GZI Index of the reference FASTA file.	aggregated
JSON configuration file for IGV browser	igv.json	JSON configuration file to be loaded in IGV for visualising alignments against the reference.	aggregated
BAM file (minimap2)	BAMS/{{ alias }}.reads_aln_sorted.bam	BAM file generated from mapping input reads to the reference.	per-sample
BAM index file (minimap2)	BAMS/{{ alias }}.reads_aln_sort.bam.bai	Index file generated from mapping input reads to the reference.	per-sample

Run `wf-transcriptome` on the direct RNA sequencing data from C. elegans

Based on the demo data, you can set up the input folder and run the workflow on the direct RNA from the Genome Research paper.

# Takes around 14 minutes to finish
sbatch /scratch/zt1/project/bioi611/shared/scripts/ONT_directRNA_wf_transcriptome.sub

You can find the output files here:

/scratch/zt1/project/bioi611/user/$USER/ONT_directRNA

Basecalling [Optional]

Introduction to ONT Raw Data (FAST5/POD5)

In Oxford Nanopore sequencing, raw data captures the electrical signal generated as DNA or RNA molecules pass through a nanopore. This signal reflects variations in ionic current caused by the unique properties of each nucleotide. Key points about raw data:

Ionic Current Signal: The primary measurement is the change in ionic current as each nucleotide interacts with the nanopore. This signal is captured continuously.
MinKNOW Software: This software suite manages the sequencing process, capturing raw signals and translating them into "reads."
File Formats:
POD5: This is the primary file format used in recent ONT sequencing runs, replacing the older FAST5 format.
Each read in these files corresponds to a single DNA or RNA strand. Understanding raw data is crucial because it represents the initial and most unprocessed form of information from ONT sequencing. However, it’s challenging to interpret without further processing.
Base Calling and File Outputs (BAM/FASTQ)

After generating raw data, the next essential step is base calling, which translates the electrical signal into nucleotide sequences. This is where machine learning plays a critical role:

Base Calling Process:
Signal Processing Techniques: ONT’s basecalling algorithms use advanced machine learning models to interpret the raw signal.
Output: Each ionic current pattern is mapped to a sequence of nucleotide bases (A, T, C, or G).
Output File Formats:
BAM Files: These files contain sequence information along with potential modifications and alignment information. ONT typically structures BAM files with 4,000 reads per file by default.
FASTQ Files: This is the widely-used format for storing nucleotide sequences and their associated quality scores. Similar to BAM, ONT defaults to 4,000 reads per file in FASTQ format.

Basecalling using Guppy

In Roach, et. al., 2020, RNA sequencing on the GridION platform was performed using ONT R9.4 flow cells and the standard MinKNOW protocol script (NC_48Hr_sequencing_FLO-MIN106_SQK-RNA001). The raw data is in FAST5 format.

Guppy is a data processing toolkit that contains the Oxford Nanopore Technologies' production basecalling algorithms and several bioinformatic post-processing features.

To basecall reads with Guppy, you will need to use the following commands:

guppy_basecaller (or the fully-qualified path if using the archive installer)

--input_path: Full or relative path to the directory where the raw read files are located. The folder can be absolute or a relative path to the current working directory.

--save_path: Full or relative path to the directory where the basecall results will be saved. The folder can be absolute or a relative path to the current working directory. This folder will be created if it does not exist using the path you provide. (e.g. if it is a relative path, it will be relative to the current working directory)

Then either:

--config: configuration file containing Guppy parameters

or

--flowcell flow cell version --kit sequencing kit version

Find the corresponding model

The kit and flow cell information should be clearly labelled on the corresponding boxes. Flow cells almost always start with "FLO" and kits almost always start with "SQK" or "VSK".

To see the supported flow cells and kits, run Guppy with the --print_workflows option:

/scratch/zt1/project/bioi611/shared/software/ont-guppy-cpu/bin/guppy_basecaller --print_workflows |grep 'FLO-M
IN106' |grep 'SQK-RNA001'
flowcell       kit               barcoding config_name                    model version
FLO-MIN106     SQK-RNA001                  rna_r9.4.1_70bps_hac           2020-09-07_rna_r9.4.1_minion_256_8f8fc47b

Basecalling using `Guppy`

/scratch/zt1/project/bioi611/shared/software/ont-guppy-cpu/bin/guppy_basecaller \
 -i test_data/ \
 --config  /scratch/zt1/project/bioi611/sha
red/software/ont-guppy-cpu/data/rna_r9.4.1_70bps_hac.cfg \
--save_path temp --recursive

Quality-Based Output Directories

Guppy categorizes reads based on their quality scores, storing them in separate folders for easy access and downstream processing.

pass/: Contains reads with quality scores above a specified threshold (typically a Phred quality score of 7 or higher). These reads are considered high quality and are commonly used for downstream analyses. File Format: FASTQ files, each storing sequences and associated quality scores.
fail/: Contains reads with quality scores below the specified threshold, indicating lower confidence in accuracy. These reads might be filtered out or re-processed depending on the study's goals. File Format: FASTQ files, similar to those in the pass directory, but typically excluded from final analyses.

Remember the data you basecalled here is only a test dataset. The number of reads in pass/ and fail/ doesn't reflect the acutual data.

Software dorado is now the recommended tool to perform basecalling on POD5 files.

FAST5 files can be converted to POD5 files using the tool below:

https://github.com/nanoporetech/pod5-file-format

Analysis of long read transcriptome data

Download the data

Analyze the data using wf-transcriptomes

Download the tool

Install conda

Run wf-transcriptome on the demo datasets

Run wf-transcriptome on the direct RNA sequencing data from C. elegans

Basecalling [Optional]

Basecalling using Guppy

Find the corresponding model

Basecalling using Guppy

Quality-Based Output Directories

Analyze the data using `wf-transcriptomes`

Run `wf-transcriptome` on the demo datasets

Run `wf-transcriptome` on the direct RNA sequencing data from C. elegans

Basecalling using `Guppy`