Analysis of long read transcriptome data
In this lab, you are going to analyze the direct RNA data published on Genome Research in 2020. The title of the paper is: The full-length transcriptome of C. elegans using direct RNA sequencing.
Download the data
You can go the Data access
section of the paper here: https://genome.cshlp.org/content/30/2/299.full
-
Go to: https://www.ncbi.nlm.nih.gov/
-
Search
PRJEB31791
-
Click
SRA
link -
Click the first item in the search results
-
Click the link: ERP114391
-
Right click on the fastq files to obtain the FTP URL
Then you can use wget
to download the fastq files.
The data for L1
and adult male
samples has been downloaded and saved on the HPC cluster:
/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/L1_rep1.fastq.gz
/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/L1_rep2.fastq.gz
/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/male_rep1.fastq.gz
/scratch/zt1/project/bioi611/shared/raw_data/ONT_directRNA/male_rep2.fastq.gz
Analyze the data using wf-transcriptomes
wf-transcriptomes
is a cDNA and RNA sequencing data analysis workflow that leverages long nanopore reads, providing a detailed view of the transcriptome.
Download the tool
https://github.com/epi2me-labs/wf-transcriptomes
Path: /scratch/zt1/project/bioi611/shared/software/wf-transcriptomes-1.4.0/main.nf
Install conda
Miniforge is a minimal installer for Conda specific to conda-forge. Miniforge allows users to install the conda package manager with the following features pre-configured: conda-forge set as the default (and only) channel.
rm -rf miniforge3
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
Run wf-transcriptome
on the demo datasets
The workflow wf-transcriptome
has a demo datasets. This demo datasets can be used to test the workflow and help you undestand the input and output.
The demo data can be found here:
ls /scratch/zt1/project/bioi611/shared/raw_data/wf-transcriptomes-demo/ |cat
chr20
differential_expression_fastq
gencode.v22.annotation.chr20.gff
gencode.v22.annotation.chr20.gff3
gencode.v22.annotation.chr20.gtf
hg38_chr20.fa
Homo_sapiens.GRCh38.109.gtf.gz
Homo_sapiens.GRCh38.cdna.all.fa.gz
Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
md5sums.txt
nextflow.config
ref_transcriptome.fasta
sample_sheet.csv
You can analyze the demo data by submitting the job file below:
# Takes around 8 minutes to finish
sbatch /scratch/zt1/project/bioi611/shared/scripts/ONT_directRNA_wf_transcriptome_demo.sub
The output folder is:
/scratch/zt1/project/bioi611/user/$USER/ONT_directRNA_demo
The documentation for the output files can be found here:
https://github.com/epi2me-labs/wf-transcriptomes?tab=readme-ov-file#outputs
Output files may be aggregated including information for all samples or provided per sample. Per-sample files will be prefixed with respective aliases and represented below as {{ alias }}.
Title | File path | Description | Per sample or aggregated |
---|---|---|---|
workflow report | wf-transcriptomes-report.html | a HTML report document detailing the primary findings of the workflow | aggregated |
Per file read stats | fastq_ingress_results/reads/fastcat_stats/per-file-stats.tsv | A TSV with per file read stats, including all samples. | aggregated |
Read stats | fastq_ingress_results/reads/fastcat_stats/per-read-stats.tsv | A TSV with per read stats, including all samples. | aggregated |
Run ID's | fastq_ingress_results/reads/fastcat_stats/run_ids | List of run IDs present in reads. | aggregated |
Meta map json | fastq_ingress_results/reads/metamap.json | Metadata used in workflow presented in a JSON. | aggregated |
Concatenated sequence data | fastq_ingress_results/reads/{{ alias }}.fastq.gz | Per sample reads concatenated in to one FASTQ file. | per-sample |
Assembled transcriptome | {{ alias }}_transcriptome.fas | Per sample assembled transcriptome. | per-sample |
Annotated assembled transcriptome | {{ alias }}_merged_transcriptome.fas | Per sample annotated assembled transcriptome. | per-sample |
Alignment summary statistics | {{ alias }}_read_aln_stats.tsv | Per sample alignment summary statistics. | per-sample |
GFF compare results. | {{ alias }}_gffcompare | All GFF compare output files. | per-sample |
Differential gene expression results | de_analysis/results_dge.tsv | This is a gene-level result file that describes genes and their probability of showing differential expression between experimental conditions. | aggregated |
Differential gene expression report | de_analysis/results_dge.pdf | Summary report of differential gene expression analysis as a PDF. | aggregated |
Differential transcript usage gene TSV | de_analysis/results_dtu_gene.tsv | This is a gene-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression. | aggregated |
Differential transcript usage report | de_analysis/results_dtu.pdf | Summary report of differential transcript usage results as a PDF. | aggregated |
Differential transcript usage TSV | de_analysis/results_dtu_transcript.tsv | This is a transcript-level result file from DEXSeq that lists annotated genes and their probabilities of differential expression. | aggregated |
Differential transcript usage stageR TSV | de_analysis/results_dtu_stageR.tsv | This is the output from StageR and it shows both gene and transcript probabilities of differential expression | aggregated |
Differential transcript usage DEXSeq TSV | de_analysis/results_dexseq.tsv | The complete output from the DEXSeq-analysis, shows both gene and transcript probabilities of differential expression. | aggregated |
Gene counts | de_analysis/all_gene_counts.tsv | Raw gene counts created by the Salmon tool, before filtering. | aggregated |
Gene counts per million | de_analysis/cpm_gene_counts.tsv | This file shows counts per million (CPM) of the raw gene counts to facilitate comparisons across samples. | aggregated |
Transcript counts | de_analysis/unfiltered_transcript_counts_with_genes.tsv | Raw transcript counts created by the Salmon tool, before filtering. Includes reference to the associated gene ID. | aggregated |
Transcript per million counts | de_analysis/unfiltered_tpm_transcript_counts.tsv | This file shows transcripts per million (TPM) of the raw counts to facilitate comparisons across samples. | aggregated |
Transcript counts filtered | de_analysis/filtered_transcript_counts_with_genes.tsv | Filtered transcript counts, used for differential transcript usage analysis. Includes a reference to the associated gene ID. | aggregated |
Transcript info table | {{ alias }}_transcripts_table.tsv | This file details each isoform that was reconstructed from the input reads. It contains a subset of columns from the .tmap output from gffcompare | per-sample |
Final non redundant transcriptome | de_analysis/final_non_redundant_transcriptome.fasta | Transcripts that were used for differential expression analysis including novel transcripts with the identifiers used for DE analysis. | aggregated |
Index of reference FASTA file | igv_reference/{{ ref_genome file }}.fai | Reference genome index of the FASTA file required for IGV config. | aggregated |
GZI index of the reference FASTA file | igv_reference/{{ ref_genome file }}.gzi | GZI Index of the reference FASTA file. | aggregated |
JSON configuration file for IGV browser | igv.json | JSON configuration file to be loaded in IGV for visualising alignments against the reference. | aggregated |
BAM file (minimap2) | BAMS/{{ alias }}.reads_aln_sorted.bam | BAM file generated from mapping input reads to the reference. | per-sample |
BAM index file (minimap2) | BAMS/{{ alias }}.reads_aln_sort.bam.bai | Index file generated from mapping input reads to the reference. | per-sample |
Run wf-transcriptome
on the direct RNA sequencing data from C. elegans
Based on the demo data, you can set up the input folder and run the workflow on the direct RNA from the Genome Research paper.
# Takes around 14 minutes to finish
sbatch /scratch/zt1/project/bioi611/shared/scripts/ONT_directRNA_wf_transcriptome.sub
You can find the output files here:
/scratch/zt1/project/bioi611/user/$USER/ONT_directRNA
Basecalling [Optional]
- Introduction to ONT Raw Data (FAST5/POD5)
In Oxford Nanopore sequencing, raw data captures the electrical signal generated as DNA or RNA molecules pass through a nanopore. This signal reflects variations in ionic current caused by the unique properties of each nucleotide. Key points about raw data:
-
Ionic Current Signal: The primary measurement is the change in ionic current as each nucleotide interacts with the nanopore. This signal is captured continuously.
-
MinKNOW Software: This software suite manages the sequencing process, capturing raw signals and translating them into "reads."
-
File Formats:
-
POD5: This is the primary file format used in recent ONT sequencing runs, replacing the older FAST5 format.
-
Each read in these files corresponds to a single DNA or RNA strand. Understanding raw data is crucial because it represents the initial and most unprocessed form of information from ONT sequencing. However, it’s challenging to interpret without further processing.
-
Base Calling and File Outputs (BAM/FASTQ)
After generating raw data, the next essential step is base calling, which translates the electrical signal into nucleotide sequences. This is where machine learning plays a critical role:
-
Base Calling Process:
-
Signal Processing Techniques: ONT’s basecalling algorithms use advanced machine learning models to interpret the raw signal.
-
Output: Each ionic current pattern is mapped to a sequence of nucleotide bases (A, T, C, or G).
-
Output File Formats:
-
BAM Files: These files contain sequence information along with potential modifications and alignment information. ONT typically structures BAM files with 4,000 reads per file by default.
-
FASTQ Files: This is the widely-used format for storing nucleotide sequences and their associated quality scores. Similar to BAM, ONT defaults to 4,000 reads per file in FASTQ format.
Basecalling using Guppy
In Roach, et. al., 2020, RNA sequencing on the GridION platform was performed using ONT R9.4 flow cells and the standard MinKNOW protocol script (NC_48Hr_sequencing_FLO-MIN106_SQK-RNA001). The raw data is in FAST5 format.
Guppy is a data processing toolkit that contains the Oxford Nanopore Technologies' production basecalling algorithms and several bioinformatic post-processing features.
To basecall reads with Guppy
, you will need to use the following commands:
guppy_basecaller
(or the fully-qualified path if using the archive installer)
--input_path
: Full or relative path to the directory where the raw read files are located. The folder can be absolute or a relative path to the current working directory.
--save_path
: Full or relative path to the directory where the basecall results will be saved. The folder can be absolute or a relative path to the current working directory. This folder will be created if it does not exist using the path you provide. (e.g. if it is a relative path, it will be relative to the current working directory)
Then either:
--config
: configuration file containing Guppy parameters
or
--flowcell
flow cell version--kit
sequencing kit version
Find the corresponding model
The kit and flow cell information should be clearly labelled on the corresponding boxes. Flow cells almost always start with "FLO" and kits almost always start with "SQK" or "VSK".
To see the supported flow cells and kits, run Guppy
with the --print_workflows
option:
/scratch/zt1/project/bioi611/shared/software/ont-guppy-cpu/bin/guppy_basecaller --print_workflows |grep 'FLO-M
IN106' |grep 'SQK-RNA001'
flowcell kit barcoding config_name model version
FLO-MIN106 SQK-RNA001 rna_r9.4.1_70bps_hac 2020-09-07_rna_r9.4.1_minion_256_8f8fc47b
Basecalling using Guppy
/scratch/zt1/project/bioi611/shared/software/ont-guppy-cpu/bin/guppy_basecaller \
-i test_data/ \
--config /scratch/zt1/project/bioi611/sha
red/software/ont-guppy-cpu/data/rna_r9.4.1_70bps_hac.cfg \
--save_path temp --recursive
Quality-Based Output Directories
Guppy categorizes reads based on their quality scores, storing them in separate folders for easy access and downstream processing.
-
pass/: Contains reads with quality scores above a specified threshold (typically a Phred quality score of 7 or higher). These reads are considered high quality and are commonly used for downstream analyses. File Format: FASTQ files, each storing sequences and associated quality scores.
-
fail/: Contains reads with quality scores below the specified threshold, indicating lower confidence in accuracy. These reads might be filtered out or re-processed depending on the study's goals. File Format: FASTQ files, similar to those in the pass directory, but typically excluded from final analyses.
Remember the data you basecalled here is only a test dataset. The number of reads in pass/
and fail/
doesn't reflect the acutual data.
Software dorado
is now the recommended tool to perform basecalling on POD5 files.
FAST5 files can be converted to POD5 files using the tool below:
https://github.com/nanoporetech/pod5-file-format