nf-core/isoseq
Genome annotation with PacBio Iso-Seq. Takes raw subreads as input, generate Full Length Non Chemiric (FLNC) sequences and produce a bed annotation.
1.1.1
). The latest
stable release is
2.0.0
.
Introduction
This document describes the output produced by the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- CCS - Generate CCS sequences
- LIMA - Remove primer sequences from CCS
- ISOSEQ REFINE - Detect and remove chimerics reads
- BAMTOOLS CONVERT - Convert bam file into fasta file
- TAMA POLYA CLEAN UP - Detect and trim polyA tails reads
- GUNZIP - Decompress FLNC fastas (uLTRA path only)
- ULTRA or MINIMAP2 - Map FLNCs on genome
- BIOPERL - Remove spurious alignments (uLTRA path only, Issue #11)
- SAMTOOLS SORT - Sort alignment and convert sam file into bam file
- TAMA FILE LIST - Prepare list file for TAMA collapse
- TAMA COLLAPSE - Clean gene models
- TAMA MERGE - Merge all annotations into one for each sample with TAMA merge
- Pipeline information - Report metrics generated during the workflow execution
CCS
Output files
01_PBCCS/
<sample>.chunk<X>.bam
: The CCS sequences<sample>.chunk<X>.bam.pbi
: The Pacbio index of CCS files<sample>.chunk<X>.metrics.json.gz
: Statistics for each zmws<sample>.chunk<X>.report.json
: General statistics about generated CCS sequences in json format<sample>.chunk<X>.report.txt
: General statistics about generated CCS sequences in txt format
CCS generate a Circular Consensus Sequence from subreads. It reports the number of selected and discarded zmws and the reason why.
LIMA
Output files
02_LIMA/
<sample>.chunk<X>_flnc.json
: Metadata about generated xml file<sample>.chunk<X>_flnc.lima.clips
: Clipped sequences<sample>.chunk<X>_flnc.lima.counts
: Statistics about detected primers pairs<sample>.chunk<X>_flnc.lima.guess
: Statistics about detected primers pairs<sample>.chunk<X>_flnc.lima.report
: Detailed statistics on primers pairs for each sequence<sample>.chunk<X>_flnc.lima.summary
: General statistics about selected and rejected sequences<sample>.chunk<X>_flnc.primer_5p--primer_3p.bam
: Selected sequences<sample>.chunk<X>_flnc.primer_5p--primer_3p.bam.pbi
: Pacbio index of selected sequences<sample>.chunk<X>_flnc.primer_5p--primer_3p.consensusreadset.xml
: Selected sequences metadata
LIMA clean generated CCS. It selects sequences containing valid pairs of primers and removed it.
ISOSEQ REFINE
Output files
03_ISOSEQ3_REFINE/
<sample>.chunk<X>.bam
: Sequences sequences<sample>.chunk<X>.bam.pbi
: Pacbio index of selected sequences<sample>.chunk<X>.consensusreadset.xml
: Metadata<sample>.chunk<X>.filter_summary.json
: Number of Full Length, Full Length Non Chimeric, Full Length Non Chimeric PolyA<sample>.chunk<X>.report.csv
: Primers and insert length of each read
ISOSEQ REFINE discard chimeric reads.
BAMTOOLS CONVERT
Output files
04_BAMTOOLS_CONVERT/
<sample>.chunk<X>.fasta
: The reads in fasta format.
BAMTOOLS CONVERT convert reads in BAM format into fasta format.
TAMA POLYA CLEAN UP
Output files
05_GSTAMA_POLYACLEANUP/
<sample>.chunk<X>_tama.fa.gz
: The polyA tail free reads.<sample>.chunk<X>_polya_flnc_report.txt.gz
: Length of removed tails.<sample>.chunk<X>_tama_tails.fa.gz
: Sequence of removed tails.
GSTAMA_POLYACLEANUP TAMA cleanup remove polyA tails from the selected reads.
GUNZIP
Output files
06.1_GUNZIP/
<sample>.chunk<X>_tama.fa
: The polyA tail free reads uncompressed.
GUNZIP Uncompress FLNCs for their alignment with uLTRA (gzip not handled by uLTRA yet).
ULTRA or MINIMAP2
Output files
06.2_ULTRA/
or06_MINIMAP2/
<sample>.chunk<X>.sam
: The aligned reads.
MINIMAP2
or uLTRA
aligns reads ont the genome.
BIOPERL
Output files
06.3_PERL_BIOPERL/
<sample>.chunk<X>_filtered.sam
: The aligned reads with spurious alignments removed.
BIOPERL Some CIGAR string sometimes with a gap (N). This can happen when using GFF file converted to GTF file. See Issue #11 from uLTRA repo.
SAMTOOLS SORT
Output files
07_SAMTOOLS_SORT/
<sample>.chunk<X>_sorted.bam
: The sorted aligned reads.
SAMTOOLS SORT sort the aligned reads and convert the sam file in bam file.
TAMA COLLAPSE
Output files
08_GSTAMA_COLLAPSE/
<sample>.chunk<X>_collapsed.bed
: This is a bed12 format file containing the final collapsed version of your transcriptome<sample>.chunk<X>_local_density_error.txt
: This file contains the log of filtering for local density error around the splice junctions<sample>.chunk<X>_polya.txt
: This file contains the reads with potential poly A truncation<sample>.chunk<X>_read.txt
: This file contains information for all mapped reads from the input SAM/BAM file.<sample>.chunk<X>_strand_check.txt
: This file shows instances where the sam flag strand information contrasted the GMAP strand information.<sample>.chunk<X>_trans_read.bed
: This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.This file uses bed12 format to show the transcript model for each read based on the mapping prior to collapsing.<sample>.chunk<X>_trans_report.txt
: This file contains collapsing information for each transcript<sample>.chunk<X>_varcov.txt
: This file contains the coverage information for each variant detected.<sample>.chunk<X>_variants.txt
: This file contains the variants called
TAMA COLLAPSE TAMA Collapse is a tool that allows you to collapse redundant transcript models in your Iso-Seq data.
TAMA FILE LIST
Output files
09_GSTAMA_FILELIST/
<sample>.tsv
: A tsv listing bed files to merge with TAMA merge
TAMA FILELIST is a home script for generating input file list for TAMA merge.
TAMA MERGE
Output files
10_GSTAMA_MERGE/
<sample>.bed
: This is the main merged annotation file.<sample>_gene_report.txt
: This contains a report of the genes from the merged file.<sample>_merge.txt
: This contains a bed12 format file which shows the coordinates of each input transcript matched to the merged transcript ID.<sample>_trans_report.txt
: This contains the source information for each merged transcript.
TAMA MERGE TAMA Merge is a tool that allows you to merge multiple transcriptomes while maintaining source information.
MultiQC
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.