Sequence files background

We’ve extracted totalRNA and shipped it to Azenta for sequencing… so what happens next? How did we go from working with microcentrifuge tubes in the lab to large files on the computer?

First they run the total RNA through quality assessment and quality control steps (QAQC) to make sure it is sufficient in quantity and quality to sequence. Here is the QAQC report from Azenta:

The total RNA is then subjected to library prep, where the RNA is turned into cDNA.

The cDNA is what is actually sequenced, with an Illumina sequencer, 20 million reads, Poly-A selection,

The raw FASTA files come back demultiplexed

Coding resources

This roadmap was built off the following resources and references from:

Sam White
- Sam’s Notebook
  - QAQC with FastQC, MultiQC & Fastp
  - Genome alignment with HISAT2
- QAQC post
Babraham Bioinformatics Institute:
- Training Courses
Erin Chille:
- Mcapitata_OA_Developmental_Gene_Expression_Timeseries github repository from the [@chille2022]paper
Steven Roberts:
- Bioinformatics FISH 546 course at UW
- Lab Handbook
Savanah Liedholt:
- Differentially Expressed Genes from a Reference Genome
- SmallRNA processing and pipeline for viral community Analysis
Ariana Huffmyer:
- EarlyLifeHistory_Energetics TagSeq github repository
Sarah Tanja:
- FISH 546 Compendium

Pipeline birds-eye view

1. Receive raw FASTA files

files are already demultiplexed
files have a .fasta.gz zipped format
files must checked to make sure there were no errors in the transfer process (this is done with md5sum )

2. QAQC FASTA files

Some great examples of previous QAQC scripts generated for Montipora capitata RNA-seq data by E. Chille (QAQC Script), Sam White (Notebook post) and A. Huffmyer (QAQC Script)

Quality check raw sequences with FastQC , synthesize a report with MultiQC
Clean up sequences with Fastp
- Trim sequence lengths
- Filter out bad quality reads
- Remove adapters & polyA tails
Check cleaned sequences with FastQC, synthesize a report with MultiQC
Repeat cleaning steps if needed

3. Align to reference genome & assemble

Great example of previous HISAT2 RNA-seq alignment in Sam White’s notebook here and Steven Robert’s examples here

Obtain reference genome assembly & GFF annotation file
- Genome Version 3 [@stephens2022]
- GFF from Rutgers (or GFF fixed from AHuffmyer?)
- genomes, indexes, & feature tracks from Roberts Lab Handbook
Align sequences to genome
- using HISAT-2 [@zhang_rapid_2021]
StringTie 2 , samtools ,

4. Create gene expression count matrix

Python script prepDE.py from Steven’s Differentially Expressed Genes post
Another example by Sam here which uses python3

5. Data exploration using R & DESeq2

DESeq-2 vignette found here

6. Identify differentially expressed genes (DEG)s

DESeq-2 vignette found here
Multifactorial DE Analysis