Welcome to fastq2EZbakR!

fastq2EZbakR is a Snakemake pipeline designed to process nucleotide recoding RNA-seq data (NR-seq, e.g., TimeLapse-seq, SLAM-seq, TUC-seq, etc.). fastq2EZbakR provides output readily compatible with EZbakR, but is also designed to provide processed NR-seq data in a convenient form that you can work with as you see fit.

Where to go

Step 0: Check out the rest of the information on this page to get a sense of fastq2EZbakR's functionality and use cases.

Step 1: Read the Deployment documentation to get up and running with fastq2EZbakR, or the Slurm documentation for deploying fastq2EZbakR on an HPC system with a slurm scheduler (e.g., Yale's HPC).

Step 2: Read the Configuration documentation to get details about all config parameters.

Stpe 3: Read about fastq2EZbakR's generalized feature assignment, one of the major benefits of using fastq2EZbakR.

Step 4: Read the Tips and Tricks documentation for helpful pointers that may make your life easier.

Step 5: Read about Output produced by fastq2EZbakR.

Step 6: Check out ancillary documentation about creating tracks and FAQs.

What fastq2EZbakR does

The input to fastq2EZbakR is either FASTQ files or aligned BAM files (the latter must have the not-always-standard MD tag). The main output of fastq2EZbakR is a so-called cB (counts binomial) file that will always include the following columns:

  • sample - Sample name
  • rname - Chromosome name
  • sj - Logical: TRUE if read contains exon-exon spliced junction
  • n - Number of reads which have the identical set of values described above

In addition, columns reporting mutation counts and nucleotide counts will be included. For a standard NR-seq dataset (s4U labeling), that means tracking T-to-C mutation counts (column name: TC) and the number of reference Ts covered by a read (column name: nT). Finally, reads will be assigned to a set of annotated features, and columns will be included based on which of these feature assignment strategies you have activated in your particular pipeline run. The possibilities include:

  • GF: gene read was assigned to (any region of gene)
  • XF: gene read was assigned to (only exonic regions of gene)
  • exonic_bin: exonic bins as defined in DEXSeq paper
  • transcripts: set of transcripts a read is compatible with (i.e, its transcript equivalence class)
  • junction_start: 5' splice site of exon-exon junction (genomic coordinate)
  • junction_end: 3' splice site of exon-exon junction (genomic coordinate)
  • ei_junction_id: Numerical ID given to a given exon-intron junction
  • ee_junction_id: Numerical ID given to a given exon-exon junction

See Configuration and Feature Assignment for details about feature assignment strategies and how to select which to use.

Advantages of fastq2EZbakR

fastq2EZbakR provides a number of unique functionalites not found in other established NR-seq data processing tools. These include:

  1. Flexible assignment of reads to genomic features.
  2. Quantification of any mutation type you are interested in. T-to-C mutation counting is the most common NR-seq application, but any combination of mutation types are fair game.
  3. A tidy, easy to work with representation of your mutational data in the form of the aforementioned cB file.
  4. Optional site-specific mutation counting (as was used here for example). Has allowed fastq2EZbakR to support processing of non-NR-seq mutational probing RNA-seq datasets.
  5. Optional automatic downloading and processing of published data available on the Sequence Read Archive (SRA).

The pipeline

Below is a simplfied schematic of the major steps performed by the fastq2EZbakR.

pipeline

It can take either FASTQ files or BAM files as input. If FASTQ files are provided, the following steps are run:

  1. fastQC is run on each FASTQ file to generate QC reports.
  2. Adapters are trimmed with fastp. This can be turned off (e.g., if you already did this before running fastq2EZbakR) by setting the skip_trimming parameter in your config.yaml file to True. If you have paired end data, adapters can be automatically detected. If not, or if you don't want to rely on automatic detection, you can specify adapters in the fastp_adapters parameter of the config.yaml file.
  3. An alignment index is built (if the path specified in the indices parameter of the config doesn't already exist) using your provided genome FASTA file and annotation GTF file.
  4. Reads are aligned with your choice of aligner (currently STAR and HISAT2 are implemented, but currently only the use of STAR supports all possible feature assignment strategies. Specify your choice in the aligner parameter of the config.yaml file).

The following steps will run in either the FASTQ or BAM file input cases:

  1. BAM files are filtered and sorted by read name. The former removes unaligned reads and non-primary alignments. The latter is important for parallelization of mutation counting, and is required by the feature assignment scripts.
  2. If you have -label control samples (e.g., -s4U control samples in a standard NR-seq experiment), these can be used to call SNPs. These will be used to avoid false positive mutation counts.
  3. Reads are assigned to feature, either using featureCounts or custom scripting. See Feature Assignment for details of what feature assignment strategies exist. You can choose which ones to implement in a particular pipeline run under the features parameter in the config.yaml file.
  4. Mutations are counted with some custom scripting. All mutation types are tallied in the intermediate files created in the results/counts directory created by running fastq2EZbakR. You can specify which to include in the final processed cB file in the mut_tracks parameter in your config.yaml file.
  5. Feature assignment and mutation counts are merged.
  6. Colored sequencing tracks are created with STAR and IGVtools.
  7. The final cB file is created.

fastq2EZbakR's (brief) origin story

fastq2EZbakR is a rewrite of the original TimeLapse-seq pipeline developed by the Simon lab at Yale. The contributors to the original pipeline were Matthew Simon, Jeremy Schofield, Martin Machyna, Lea Kiefer, and Joshua Zimmer. A lot has changed since the initial creation of the pipeline, and fastq2EZbakR (developed by Isaac Vock) has a load of novel functionality. It is also an extension/rewrite of bam2bakR (also developed by Isaac), doing pretty much everything it does and then some.