nf-core/eager
A fully reproducible and state-of-the-art ancient DNA analysis pipeline
2.2.0
). The latest
stable release is
2.5.2
.
22.10.6
.
Learn more.
Define where the pipeline should find input data, and additional metadata.
Either paths or URLs to FASTQ/BAM data (must be surrounded with quotes). For paired end data, the path must use '{1,2}' notation to specify read pairs. Alternatively, a path to a TSV file (ending .tsv) containing file paths and sequencing/sample metadata. Allows for merging of multiple lanes/libraries/samples. Please see documentation for template.
string
null
There are two possible ways of supplying input sequencing data to nf-core/eager.
The most efficient but more simplistic is supplying direct paths (with
wildcards) to your FASTQ or BAM files, with each file or pair being considered a
single library and each one run independently. TSV input requires creation of an
extra file by the user and extra metadata, but allows more powerful lane and
library merging.
Direct Input Method
This method is where you specify with --input
, the path locations of FASTQ
(optionally gzipped) or BAM file(s). This option is mutually exclusive to the
TSV input method, which is used for more complex input
configurations such as lane and library merging.
When using the direct method of --input
you can specify one or multiple
samples in one or more directories files. File names must be unique, even if
in different directories.
By default, the pipeline assumes you have paired-end data. If you want to run
single-end data you must specify --single_end
For example, for a single set of FASTQs, or multiple paired-end FASTQ
files in one directory, you can specify:
--input 'path/to/data/sample_*_{1,2}.fastq.gz'
If you have multiple files in different directories, you can use additional
wildcards (*
) e.g.:
--input 'path/to/data/*/sample_*_{1,2}.fastq.gz'
⚠️ It is not possible to run a mixture of single-end and paired-end
files in one run with the paths--input
method! Please see the TSV input
method for possibilities.
Please note the following requirements:
- Valid file extensions:
.fastq.gz
,.fastq
,.fq.gz
,.fq
,.bam
. - The path must be enclosed in quotes
- The path must have at least one
*
wildcard character - When using the pipeline with paired end data, the path must use
{1,2}
notation to specify read pairs. - Files names must be unique, having files with the same name, but in different
directories is not sufficient- This can happen when a library has been sequenced across two sequencers on
the same lane. Either rename the file, try a symlink with a unique name, or
merge the two FASTQ files prior input.
- This can happen when a library has been sequenced across two sequencers on
- Due to limitations of downstream tools (e.g. FastQC), sample IDs may be
truncated after the first.
in the name, Ensure file names are unique prior
to this! - For input BAM files you should provide a small decoy reference genome with
pre-made indices, e.g. the human mtDNA or phiX genome, for the mandatory
parameter--fasta
in order to avoid long computational time for generating
the index files of the reference genome, even if you do not actual need a
reference genome for any downstream analyses.
TSV Input Method
Alternatively to the direct input method, you can supply
to --input
a path to a TSV file that contains paths to FASTQ/BAM files and
additional metadata. This allows for more complex procedures such as merging of
sequencing data across lanes, sequencing runs, sequencing configuration types,
and samples.
The use of the TSV --input
method is recommended when performing
more complex procedures such as lane or library merging. You do not need to
specify --single_end
, --bam
, --colour_chemistry
, -udg_type
etc. when
using TSV input - this is defined within the TSV file itself. You can only
supply a single TSV per run (i.e. --input '*.tsv'
will not work).
This TSV should look like the following:
| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |
|-------------|------------|------|------------------|--------|----------|--------------|---------------|----|----|-----|
| JK2782 | JK2782 | 1 | 4 | PE | Mammoth | double | full | https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz | https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz | NA |
| JK2802 | JK2802 | 2 | 2 | SE | Mammoth | double | full | https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz | https://github.com/nf-core/test-datasets/raw/eager/testdata/Mammoth/fastq/JK2802_AGAATAACCTACCA_L008_R2_001.fastq.gz.tengrand.fq.gz | NA |
A template can be taken from
here.
⚠️ Cells must not contain spaces before or after strings, as this
will make the TSV unreadable by nextflow. Strings containing spaces should be
wrapped in quotes.
When using TSV_input, nf-core/eager will merge FASTQ files of libraries with the
same Library_ID
but different Lanes
values after adapter clipping (and
merging), assuming all other metadata columns are the same. If you have the same
Library_ID
but with different SeqType
, this will be merged directly after
mapping prior BAM filtering. Finally, it will also merge BAM files with the same
Sample_ID
but different Library_ID
after duplicate removal, but prior to
genotyping. Please see caveats to this below.
Column descriptions are as follows:
- Sample_Name: A text string containing the name of a given sample of which
there can be multiple libraries. All libraries with the same sample name and
same SeqType will be merged after deduplication. - Library_ID: A text string containing a given library, which there can be
multiple sequencing lanes (with the same SeqType). - Lane: A number indicating which lane the library was sequenced on. Files
from the libraries sequenced on different lanes (and different SeqType) will
be concatenated after read clipping and merging. - Colour Chemistry A number indicating whether the Illumina sequencer the
library was sequenced on was a 2 (e.g. Next/NovaSeq) or 4 (Hi/MiSeq) colour
chemistry machine. This informs whether poly-G trimming (if turned on) should
be performed. - SeqType: A text string of either 'PE' or 'SE', specifying paired end (with
both an R1 [or forward] and R2 [or reverse]) and single end data (only R1
[forward], or BAM). This will affect lane merging if different per library. - Organism: A text string of the organism name of the sample or 'NA'. This
currently has no functionality and can be set to 'NA', but will affect
lane/library merging if different per library - Strandedness: A text string indicating whether the library type is
'single' or 'double'. This will affect lane/library merging if different per
library. - UDG_Treatment: A text string indicating whether the library was generated
with UDG treatment - either 'full', 'half' or 'none'. Will affect lane/library
merging if different per library. - R1: A text string of a file path pointing to a forward or R1 FASTQ file.
This can be used with the R2 column. File names must be unique, even if
they are in different directories. - R2: A text string of a file path pointing to a reverse or R2 FASTQ file,
or 'NA' when single end data. This can be used with the R1 column. File names
must be unique, even if they are in different directories. - BAM: A text string of a file path pointing to a BAM file, or 'NA'. Cannot
be specified at the same time as R1 or R2, both of which should be set to 'NA'
For example, the following TSV table:
| Sample_Name | Library_ID | Lane | Colour_Chemistry | SeqType | Organism | Strandedness | UDG_Treatment | R1 | R2 | BAM |
|-------------|------------|------|------------------|---------|----------|--------------|---------------|----------------------------------------------------------------|----------------------------------------------------------------|-----|
| JK2782 | JK2782 | 7 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA |
| JK2782 | JK2782 | 8 | 4 | PE | Mammoth | double | full | data/JK2782_TGGCCGATCAACGA_L008_R1_001.fastq.gz.tengrand.fq.gz | data/JK2782_TGGCCGATCAACGA_L008_R2_001.fastq.gz.tengrand.fq.gz | NA |
| JK2802 | JK2802 | 7 | 4 | PE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L007_R1_001.fastq.gz.tengrand.fq.gz | data/JK2802_AGAATAACCTACCA_L007_R2_001.fastq.gz.tengrand.fq.gz | NA |
| JK2802 | JK2802 | 8 | 4 | SE | Mammoth | double | full | data/JK2802_AGAATAACCTACCA_L008_R1_001.fastq.gz.tengrand.fq.gz | NA | NA |
will have the following effects:
- After AdapterRemoval, and prior to mapping, FASTQ files from lane 7 and lane 8
with the sameSeqType
(and all other metadata columns) will be
concatenated together for each Library. - After mapping, and prior BAM filtering, BAM files with the same with different
SeqType
(but with all other metadata columns the same) will be merged
together for each Library. - After duplicate removal, BAM files with
Library_ID
s with the same
Sample_Name
and the sameUDG_Treatment
will be merged together. - If BAM trimming is turned on, all post-trimming BAMs (i.e. non-UDG and
half-UDG ) will be merged with UDG-treated (untreated) BAMs, if they have the
sameSample_Name
.
Note the following important points and limitations for setting up:
- The TSV must use actual tabs (not spaces) between cells.
- File names must be unique regardless of file path, due to risk of
over-writing (see:
https://github.com/nextflow-io/nextflow/issues/470).- If it is 'too late' and you already have duplicate file names, a workaround is
to concatenate the FASTQ files together and supply this to a nf-core/eager
run. The only downside is that you will not get independent FASTQC results
for each file.
- If it is 'too late' and you already have duplicate file names, a workaround is
- Lane IDs must be unique for each sequencing of each library.
- If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can
give a fake lane ID (e.g. 20) for one of the FASTQs, and the libraries will
still be processed correctly. - This also applies to the SeqType column, i.e. with the example above, if one
run is PE and one run is SE, you need to give fake lane IDs to one of the
runs as well.
- If you have a library sequenced e.g. on Lane 8 of two HiSeq runs, you can
- All BAM files must be specified as
SE
underSeqType
.- You should provide a small decoy reference genome with pre-made indices, e.g.
the human mtDNA or phiX genome, for the mandatory parameter--fasta
in
order to avoid long computational time for generating the index files of the
reference genome, even if you do not actual need a reference genome for any
downstream analyses.
- You should provide a small decoy reference genome with pre-made indices, e.g.
- nf-core/eager will only merge multiple lanes of sequencing runs with the
same single-end or paired-end configuration - Accordingly nf-core/eager will not merge lanes of FASTQs with BAM files
(unless you use--run_convertbam
), as only FASTQ files are lane-merged
together. - Same libraries that are sequenced on different sequencing configurations (i.e
single- and paired-end data), will be merged after mapping and will always
be considered 'paired-end' during downstream processes- Important running DeDup in this context is not recommended, as PE and
SE data at the same position will not be evaluated as duplicates.
Therefore not all duplicates will be removed. - When you wish to run PE/SE data together
-dedupper markduplicates
is
therefore preferred. - An error will be thrown if you try to merge both PE and SE and also supply
--skip_merging
. - If you truly want to mix SE data and PE data but using mate-pair info for PE
mapping, please run FASTQ preprocessing mapping manually and supply BAM
files for downstream processing by nf-core/eager - If you regularly want to run the situation above, please leave a feature
request on github.
- Important running DeDup in this context is not recommended, as PE and
- DamageProfiler, NuclearContamination, MTtoNucRatio and PreSeq are performed on
each unique library separately after deduplication (but prior same-treated
library merging). - nf-core/eager functionality such as
--run_trim_bam
will be applied to only
non-UDG (UDG_Treatment: none) or half-UDG (UDG_Treatment: half) libraries. - Qualimap is run on each sample, after merging of libraries (i.e. your values
will reflect the values of all libraries combined - after being damage trimmed
etc.). - Genotyping will be typically performed on each
sample
independently, as
normally all libraries will have been merged together. However, if you have a
mixture of single-stranded and double-stranded libraries, you will normally
need to genotype separately. In this case you must give each the SS and DS
libraries distinctSample_IDs
; otherwise you will receive afile collision
error in steps such assexdeterrmine
, and then you will need to
merge these yourself. We will consider changing this behaviour in the future
if there is enough interest.
Specifies whether you have UDG treated libraries. Set to 'half' for partial treatment, or 'full' for UDG. If not set, libraries are assumed to have no UDG treatment ('none'). Not required for TSV input.
string
Defines whether Uracil-DNA glycosylase (UDG) treatment was used to remove DNA
damage on the sequencing libraries.
Specify 'none'
if no treatment was performed. If you have partial UDG treated
data (Rohland et al 2016), specify
'half'
. If you have complete UDG treated data (Briggs et al.
2010), specify 'full'
.
When also using PMDtools specifying 'half'
will use a different model for DNA
damage assessment in PMDTools (PMDtools: --UDGhalf
). Specify 'full'
and the
PMDtools DNA damage assessment will use CpG context only (PMDtools: --CpG
).
Default: 'none'
.
Tip: You should provide a small decoy reference genome with pre-made indices, e.g.
the human mtDNA genome, for the mandatory parameter--fasta
in order to
avoid long computational time for generating the index files of the reference
genome, even if you do not actual need a reference genome for any downstream
analyses.
Specifies that libraries are single stranded. Always affects MALTExtract but will be ignored by pileupCaller with TSV input. Not required for TSV input.
boolean
Indicates libraries are single stranded.
Currently only affects MALTExtract where it will switch on damage patterns
calculation mode to single-stranded, (MaltExtract: --singleStranded
) and
genotyping with pileupCaller where a different method is used (pileupCaller:
--singleStrandMode
). Default: false
Only required when using the 'Path' method of --input
Specifies that the input is single end reads. Not required for TSV input.
boolean
By default, the pipeline expects paired-end data. If you have single-end data, specify this parameter on the command line when you launch the pipeline. It is not possible to run a mixture of single-end and paired-end files in one run.
Only required when using the 'Path' method of --input
Specifies which Illumina sequencing chemistry was used. Used to inform whether to poly-G trim if turned on (see below). Not required for TSV input. Options: 2, 4.
integer
4
Specifies which Illumina colour chemistry a library was sequenced with. This informs whether to perform poly-G trimming (if --complexity_filter_poly_g
is also supplied). Only 2 colour chemistry sequencers (e.g. NextSeq or NovaSeq) can generate uncertain poly-G tails (due to 'G' being indicated via a no-colour detection). Default is '4' to indicate e.g. HiSeq or MiSeq platforms, which do not require poly-G trimming. Options: 2, 4. Default: 4
Only required when using the 'Path' method of input.
Specifies that the input is in BAM format. Not required for TSV input.
boolean
Specifies the input file type to --input
is in BAM format. This will automatically also apply --single_end
.
Only required when using the 'Path' method of --input
.
Additional options regarding input data.
If library result of SNP capture, path to BED file containing SNPS positions on reference genome.
string
Can be used to set a path to a BED file (3/6 column format) of SNP positions of a reference genome, to calculate SNP captured libraries on-target efficiency. This should be used for array or in-solution SNP capture protocols such as 390K, 1240K, etc. If supplied, on-target metrics are automatically generated for you by qualimap.
Turns on conversion of an input BAM file into FASTQ format to allow re-preprocessing (e.g. AdapterRemoval etc.).
boolean
Allows you to convert an input BAM file back to FASTQ for downstream processing. Note this is required if you need to perform AdapterRemoval and/or polyG clipping.
If not turned on, BAMs will automatically be sent to post-mapping steps.
Specify locations of references and optionally, additional pre-made indices
Path or URL to a FASTA reference file (required if not iGenome reference). File suffixes can be: '.fa', '.fn', '.fna', '.fasta'.
string
You specify the full path to your reference genome here. The FASTA file can have any file suffix, such as .fasta
, .fna
, .fa
, .FastA
etc. You may also supply a gzipped reference files, which will be unzipped automatically for you.
For example:
--fasta '/<path>/<to>/my_reference.fasta'
If you don't specify appropriate
--bwa_index
,--fasta_index
parameters, the pipeline will create these indices for you automatically. Note that you can save the indices created for you for later by giving the--save_reference
flag.
You must select either a--fasta
or--genome
Name of iGenomes reference (required if not FASTA reference).
string
Alternatively to --fasta
, the pipeline config files come bundled with paths to the Illumina iGenomes reference index files. If running with docker or AWS, the configuration is set up to use the AWS-iGenomes resource.
There are 31 different species supported in the iGenomes references. To run the pipeline, you must specify which to use with the --genome
flag.
You can find the keys to specify the genomes in the iGenomes config file. Common genomes that are supported are:
- Human
--genome GRCh37
--genome GRCh38
- Mouse *
--genome GRCm38
- Drosophila *
--genome BDGP6
- S. cerevisiae *
--genome 'R64-1-1'
* Not bundled with nf-core eager by default.
Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the iGenomes resource. See the Nextflow documentation for instructions on where to save such a file.
The syntax for this reference configuration is as follows:
params {
genomes {
'GRCh37' {
fasta = '<path to the iGenomes genome fasta file>'
}
// Any number of additional genomes, key is used with --genome
}
}
Directory / URL base for iGenomes references.
string
s3://ngi-igenomes/igenomes/
Do not load the iGenomes reference config.
boolean
Do not load igenomes.config
when running the pipeline. You may choose this option if you observe clashes between custom parameters and those supplied in igenomes.config
.
Path to directory containing pre-made BWA indices (i.e. everything before the endings '.amb' '.ann' '.bwt'. Most likely the same path as --fasta). If not supplied will be made for you.
string
If you want to use pre-existing bwa index
indices, please supply the directory to the FASTA you also specified in --fasta
nf-core/eager will automagically detect the index files by searching for the FASTA filename with the corresponding bwa
index file suffixes.
For example:
nextflow run nf-core/eager \
-profile test,docker \
--input '*{R1,R2}*.fq.gz'
--fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \
--bwa_index 'results/reference_genome/bwa_index/BWAIndex/'
bwa index
does not give you an option to supply alternative suffixes/names for these indices. Thus, the file names generated by this command must not be changed, otherwise nf-core/eager will not be able to find them.
Path to directory containing pre-made Bowtie2 indices (i.e. everything before the endings e.g. '.1.bt2', '.2.bt2', '.rev.1.bt2'. Most likely the same value as --fasta). If not supplied will be made for you.
string
If you want to use pre-existing bt2 index
indices, please supply the directory to the FASTA you also specified in --fasta
. nf-core/eager will automagically detect the index files by searching for the FASTA filename with the corresponding bt2
index file suffixes.
For example:
nextflow run nf-core/eager \
-profile test,docker \
--input '*{R1,R2}*.fq.gz'
--fasta 'results/reference_genome/bwa_index/BWAIndex/Mammoth_MT_Krause.fasta' \
--bwa_index 'results/reference_genome/bt2_index/BT2Index/'
bowtie2-build
does not give you an option to supply alternative suffixes/names for these indices. Thus, the file names generated by this command must not be changed, otherwise nf-core/eager will not be able to find them.
Path to samtools FASTA index (typically ending in '.fai'). If not supplied will be made for you.
string
If you want to use a pre-existing samtools faidx
index, use this to specify the required FASTA index file for the selected reference genome. This should be generated by samtools faidx
and has a file suffix of .fai
For example:
--fasta_index 'Mammoth_MT_Krause.fasta.fai'
Path to picard sequence dictionary file (typically ending in '.dict'). If not supplied will be made for you.
string
If you want to use a pre-existing picard CreateSequenceDictionary
dictionary file, use this to specify the required .dict
file for the selected reference genome.
For example:
--seq_dict 'Mammoth_MT_Krause.dict'
Specify to generate more recent '.csi' BAM indices. If your reference genome is larger than 3.5GB, this is recommended due to more efficient data handling with the '.csi' format over the older '.bai'.
boolean
This parameter is required to be set for large reference genomes. If your
reference genome is larger than 3.5GB, the samtools index
calls in the
pipeline need to generate CSI
indices instead of BAI
indices to compensate
for the size of the reference genome (with samtools: -c
). This parameter is
not required for smaller references (including the human hg19
or
grch37
/grch38
references), but >4GB
genomes have been shown to need CSI
indices. Default: off
If not already supplied by user, turns on saving of generated reference genome indices for later re-usage.
boolean
Use this if you do not have pre-made reference FASTA indices for bwa
, samtools
and picard
. If you turn this on, the indices nf-core/eager generates for you and will be saved in the <your_output_dir>/results/reference_genomes
for you. If not supplied, nf-core/eager generated index references will be deleted.
modifies SAMtools index command:
-c
Specify where to put output files and optional saving of intermediate files
The output directory where the results will be saved.
string
./results
The output directory where the results will be saved. By default will be made in the directory you run the command in under ./results
.
Mode for publishing results in the output directory. Options: 'symlink', 'rellink', 'link', 'copy', 'copyNoFollow', 'move'.
string
copy
Nextflow mode for 'publishing' final results files i.e. how to move final files into your --outdir
from working directories. Options: 'symlink', 'rellink', 'link', 'copy', 'copyNoFollow', 'move'. Default: 'copy'.
It is recommended to select
copy
(default) if you plan to regularly delete intermediate files fromwork/
.
Turn this on if you want to keep trimmed reads.
boolean
true
Turn this on if you want to keep intermediate alignment files (SAM, BAM, non-dedupped BAM)
boolean
Less common options for the pipeline, typically set in a config file.
Display help text.
boolean
Workflow name of run, for future reference.
string
A custom name for the pipeline run. Unlike the core nextflow -name
option with one hyphen this parameter can be reused multiple times, for example if using -resume
. Passed through to steps such as MultiQC and used for things like report filenames and titles.
Email address for completion summary.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
An email address to send a summary email to when the pipeline is completed.
Email address for completion summary, only when pipeline fails.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Set this parameter to your e-mail address to get a summary e-mail with details of the run if it fails. Normally would be the same as in --email
but can be different. If set in your user config file (~/.nextflow/config
) then you don't need to specify this on the command line for every run.
Note that this functionality requires either
sendmail
to be installed on your system.
Send plain-text email instead of HTML.
boolean
File size limit when attaching MultiQC reports to summary emails.
string
25.MB
Do not use coloured log outputs.
boolean
Custom config file to supply to MultiQC.
string
Directory to keep pipeline Nextflow logs and reports.
string
${params.outdir}/pipeline_info
Set the top limit for requested resources for any single job.
Maximum number of CPUs that can be requested for any single job.
integer
16
Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1
Maximum amount of memory that can be requested for any single job.
string
128.GB
Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'
Maximum amount of time that can be requested for any single job.
string
240.h
Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'
Parameters used to describe centralised config profiles. These generally should not be edited.
Git commit id for Institutional configs.
string
master
Base directory for Institutional configs.
string
https://raw.githubusercontent.com/nf-core/configs/master
If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.
Institutional configs hostname.
string
Institutional config description.
string
Institutional config contact information.
string
Institutional config URL link.
string
The AWSBatch JobQueue that needs to be set when running on AWSBatch
string
The AWS Region for your AWS Batch job to run on
string
eu-west-1
Path to the AWS CLI tool
string
Skip any of the mentioned steps.
boolean
Turns off FastQC pre- and post-Adapter Removal, to speed up the pipeline. Use of this flag is most common when data has been previously pre-processed and the post-Adapter Removal mapped reads are being re-mapped to a new reference genome.
boolean
Turns off adapter trimming and paired-end read merging. Equivalent to setting both --skip_collapse
and --skip_trim
.
boolean
Turns off the computation of library complexity estimation.
boolean
Turns off duplicate removal methods DeDup and MarkDuplicates respectively. No duplicates will be removed on any data in the pipeline.
boolean
Turns off the DamageProfiler module to compute DNA damage profiles.
boolean
Turns off QualiMap and thus does not compute coverage and other mapping metrics.
Processing of Illumina two-colour chemistry data.
Turn on running poly-G removal on FASTQ files. Will only be performed on 2 colour chemistry machine sequenced libraries.
boolean
Performs a poly-G tail removal step in the beginning of the pipeline using fastp
, if turned on. This can be useful for trimming ploy-G tails from short-fragments sequenced on two-colour Illumina chemistry such as NextSeqs (where no-fluorescence is read as a G on two-colour chemistry), which can inflate reported GC content values.
Specify length of poly-g min for clipping to be performed.
integer
10
This option can be used to define the minimum length of a poly-G tail to begin low complexity trimming. By default, this is set to a value of 10
unless the user has chosen something specifically using this option.
Modifies fastp parameter:
--poly_g_min_len
Options for adapter clipping and paired-end merging.
Specify adapter sequence to be clipped off (forward strand).
string
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
Defines the adapter sequence to be used for the forward read. By default, this is set to 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
.
Modifies AdapterRemoval parameter:
--adapter1
Specify adapter sequence to be clipped off (reverse strand).
string
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
Defines the adapter sequence to be used for the reverse read in paired end sequencing projects. This is set to 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
by default.
Modifies AdapterRemoval parameter:
--adapter2
Specify read minimum length to be kept for downstream analysis.
integer
30
Defines the minimum read length that is required for reads after merging to be considered for downstream analysis after read merging. Default is 30
.
Note that performing read length filtering at this step is not reliable for correct endogenous DNA calculation, when you have a large percentage of very short reads in your library - such as retrieved in single-stranded library protocols. When you have very few reads passing this length filter, it will artificially inflate your endogenous DNA by creating a very small denominator. In these cases it is recommended to set this to 0, and use --bam_filter_minreadlength
instead, to filter out 'un-usable' short reads after mapping.
Modifies AdapterRemoval parameter:
--minlength
Specify minimum base quality for trimming off bases.
integer
20
Defines the minimum read quality per base that is required for a base to be kept. Individual bases at the ends of reads falling below this threshold will be clipped off. Default is set to 20
.
Modifies AdapterRemoval parameter:
--minquality
Specify minimum adapter overlap required for clipping.
integer
1
Sets the minimum overlap between two reads when read merging is performed. Default is set to 1
base overlap.
Modifies AdapterRemoval parameter:
--minadapteroverlap
Skip of merging forward and reverse reads together. Only applicable for paired-end libraries.
boolean
Turns off the paired-end read merging.
For example
--skip_collapse --input '*_{R1,R2}_*.fastq'
It is important to use the paired-end wildcard globbing as --skip_collapse
can only be used on paired-end data!
⚠️ If you run this and also with --clip_readlength
set to something (as is by default), you may end up removing single reads from either the pair1 or pair2 file. These will be NOT be mapped when aligning with either bwa
or bowtie
, as both can only accept one (forward) or two (forward and reverse) FASTQs as input.
Modifies AdapterRemoval parameter:
--collapse
Skip adapter and quality trimming.
boolean
Turns off adapter AND quality trimming.
For example:
--skip_trim --input '*.fastq'
⚠️ it is not possible to keep quality trimming (n or base quality) on,
and skip adapter trimming.
⚠️ it is not possible to turn off one or the other of quality
trimming or n trimming. i.e. --trimns --trimqualities are both given
or neither. However setting quality in --clip_min_read_quality
to 0 would
theoretically turn off base quality trimming.
Modifies AdapterRemoval parameters:
--trimns --trimqualities --adapter1 --adapter2
Skip quality base trimming (n, score, window) of 5 prime end.
boolean
Turns off quality based trimming at the 5p end of reads when any of the --trimns, --trimqualities, or --trimwindows options are used. Only 3p end of reads will be removed.
This also entirely disables quality based trimming of collapsed reads, since both ends of these are informative for PCR duplicate filtering. Described here.
Modifies AdapterRemoval parameters:
--preserve5p
Only use merged reads downstream (un-merged reads and singletons are discarded).
boolean
Specify that only merged reads are sent downstream for analysis.
Singletons (i.e. reads missing a pair), or un-merged reads (where there wasn't sufficient overlap) are discarded.
You may want to use this if you want ensure only the best quality reads for your analysis, but with the penalty of potentially losing still valid data (even if some reads have slightly lower quality). It is highly recommended when using --dedupper 'dedup'
(see below).
Options for reference-genome mapping
Specify which mapper to use. Options: 'bwaaln', 'bwamem', 'circularmapper', 'bowtie2'.
string
Specify which mapping tool to use. Options are BWA aln ('bwaaln'
), BWA mem ('bwamem'
), circularmapper ('circularmapper'
), or bowtie2 (bowtie2
). BWA aln is the default and highly suited for short-read ancient DNA. BWA mem can be quite useful for modern DNA, but is rarely used in projects for ancient DNA. CircularMapper enhances the mapping procedure to circular references, using the BWA algorithm but utilizing a extend-remap procedure (see Peltzer et al 2016, Genome Biology for details). Bowtie2 is similar to BWA aln, and has recently been suggested to provide slightly better results under certain conditions (Poullet and Orlando 2020), as well as providing extra functionality (such as FASTQ trimming). Default is 'bwaaln'
More documentation can be seen for each tool under:
Specify the -n parameter for BWA aln, i.e. amount of allowed mismatches in the alignment.
number
0.04
Configures the bwa aln -n
parameter, defining how many mismatches are allowed in a read. By default set to 0.04
(following recommendations of Schubert et al. (2012 BMC Genomics)), if you're uncertain what to set check out this Shiny App for more information on how to set this parameter efficiently.
Modifies bwa aln parameter:
-n
Specify the -k parameter for BWA aln, i.e. maximum edit distance allowed in a seed.
integer
2
Configures the bwa aln -k
parameter for the seeding phase in the mapping algorithm. Default is set to 2
.
Modifies BWA aln parameter:
-k
Specify the -l parameter for BWA aln i.e. the length of seeds to be used.
integer
1024
Configures the length of the seed used in bwa aln -l
. Default is set to be 'turned off' at the recommendation of Schubert et al. (2012 BMC Genomics) for ancient DNA with 1024
.
Note: Despite being recommended, turning off seeding can result in long runtimes!
Modifies BWA aln parameter:
-l
Specify the number of bases to extend reference by (circularmapper only).
integer
500
The number of bases to extend the reference genome with. By default this is set to 500
if not specified otherwise.
Modifies circulargenerator and realignsamfile parameter:
-e
Specify the FASTA header of the target chromosome to extend (circularmapper only).
string
MT
The chromosome in your FASTA reference that you'd like to be treated as circular. By default this is set to MT
but can be configured to match any other chromosome.
Modifies circulargenerator parameter:
-s
Turn on to filter off-target reads (circularmapper only).
boolean
If you want to filter out reads that don't map to a circular chromosome, turn this on. By default this option is turned off.
Specify the bowtie2 alignment mode. Options: 'local', 'end-to-end'.
string
The type of read alignment to use. Options are 'local' or 'end-to-end'. Local allows only partial alignment of read, with ends of reads possibly 'soft-clipped' (i.e. remain unaligned/ignored), if the soft-clipped alignment provides best alignment score. End-to-end requires all nucleotides to be aligned. Default is 'local', following Cahill et al (2018) and Poullet and Orlando 2020.
Modifies Bowtie2 parameters:
--very-fast --fast --sensitive --very-sensitive --very-fast-local --fast-local --sensitive-local --very-sensitive-local
Specify the level of sensitivity for the bowtie2 alignment mode. Options: 'no-preset', 'very-fast', 'fast', 'sensitive', 'very-sensitive'.
string
The Bowtie2 'preset' to use. Options: 'no-preset' 'very-fast', 'fast', 'sensitive', or 'very-sensitive'. These strings apply to both --bt2_alignmode
options. See the Bowtie2 manual for actual settings. Default is 'sensitive' (following Poullet and Orlando (2020), when running damaged-data without UDG treatment)
Modifies Bowtie2 parameters:
--very-fast --fast --sensitive --very-sensitive --very-fast-local --fast-local --sensitive-local --very-sensitive-local
Specify the -N parameter for bowtie2 (mismatches in seed). This will override defaults from alignmode/sensitivity.
integer
0
The number of mismatches allowed in the seed during seed-and-extend procedure of Bowtie2. This will override any values set with --bt2_sensitivity
. Can either be 0 or 1. Default: 0 (i.e. use--bt2_sensitivity
defaults).
Modifies Bowtie2 parameters:
-N
Specify the -L parameter for bowtie2 (length of seed substrings). This will override defaults from alignmode/sensitivity.
integer
0
The length of the seed sub-string to use during seeding. This will override any values set with --bt2_sensitivity
. Default: 0 (i.e. use--bt2_sensitivity
defaults: 20 for local and 22 for end-to-end.
Modifies Bowtie2 parameters:
-L
Specify number of bases to trim off from 5' (left) end of read before alignment.
integer
0
Number of bases to trim at the 5' (left) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0
Modifies Bowtie2 parameters:
-bt2_trim5
Specify number of bases to trim off from 3' (right) end of read before alignment.
integer
0
Number of bases to trim at the 3' (right) end of read prior alignment. Maybe useful when left-over sequencing artefacts of in-line barcodes present Default: 0.
Modifies Bowtie2 parameters:
-bt2_trim3
Options for production of host-read removed FASTQ files for privacy reasons.
Turn on per-library creation pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)
boolean
Create pre-Adapter Removal FASTQ files without reads that mapped to reference (e.g. for public upload of privacy sensitive non-host data)
Host removal mode. Remove mapped reads completely from FASTQ (remove) or just mask mapped reads sequence by N (replace).
string
Read removal mode. Remove mapped reads completely ('remove'
) or just replace mapped reads sequence by N ('replace'
)
Modifies extract_map_reads.py parameter:
-m
Options for quality filtering and how to deal with off-target unmapped reads.
Turn on filtering of mapping quality, read lengths, or unmapped reads of BAM files.
boolean
Turns on the bam filtering module for either mapping quality filtering or unmapped read treatment.
Minimum mapping quality for reads filter.
integer
0
Specify a mapping quality threshold for mapped reads to be kept for downstream analysis. By default keeps all reads and is therefore set to 0
(basically doesn't filter anything).
Modifies samtools view parameter:
-q
Specify minimum read length to be kept after mapping.
integer
0
Specify minimum length of mapped reads. This filtering will apply at the same time as mapping quality filtering.
If used instead of minimum length read filtering at AdapterRemoval, this can be useful to get more realistic endogenous DNA percentages, when most of your reads are very short (e.g. in single-stranded libraries) and would otherwise be discarded by AdapterRemoval (thus making an artificially small denominator for a typical endogenous DNA calculation). Note in this context you should not perform mapping quality filtering nor discarding of unmapped reads to ensure a correct denominator of all reads, for the endogenous DNA calculation.
Modifies filter_bam_fragment_length.py parameter:
-l
Defines whether to discard all unmapped reads, keep only bam and/or keep only fastq format Options: 'discard', 'bam', 'fastq', 'both'.
string
Defines how to proceed with unmapped reads: 'discard'
removes all unmapped reads, keep
keeps both unmapped and mapped reads in the same BAM file, 'bam'
keeps unmapped reads as BAM file, 'fastq'
keeps unmapped reads as FastQ file, both
keeps both BAM and FASTQ files. Default is discard
. keep
is what would happen if --run_bam_filtering
was not supplied.
Note that in all cases, if --bam_mapping_quality_threshold
is also supplied, mapping quality filtering will still occur on the mapped reads.
Modifies samtools view parameter:
-f4 -F4
Options for removal of PCR amplicon duplicates that can artificially inflate coverage.
Deduplication method to use. Options: 'markduplicates', 'dedup'.
string
Sets the duplicate read removal tool. By default uses markduplicates
from Picard. Alternatively an ancient DNA specific read deduplication tool dedup
(Peltzer et al. 2016) is offered.
This utilises both ends of paired-end data to remove duplicates (i.e. true exact duplicates, as markduplicates will over-zealously deduplicate anything with the same starting position even if the ends are different). DeDup should only be used solely on paired-end data otherwise suboptimal deduplication can occur if applied to either single-end or a mix of single-end/paired-end data.
Note that if you run without the --mergedonly
flag for AdapterRemoval, DeDup will
likely fail. If you absolutely want to use both PE and SE data, you can supply the
--dedup_all_merged
flag to consider singletons to also be merged paired-end reads. This
may result in over-zealous deduplication.
Turn on treating all reads as merged reads.
boolean
Sets DeDup to treat all reads as merged reads. This is useful if reads are for example not prefixed with M_
in all cases. Therefore, this can be used as a workaround when also using a mixture of paired-end and single-end data, however this is not recommended (see above).
Modifies dedup parameter:
-m
Options for calculating library complexity (i.e. how many unique reads are present).
Specify the step size of Preseq.
integer
1000
Can be used to configure the step size of Preseq's c_curve
method. Can be useful when only few and thus shallow sequencing results are used for extrapolation.
Modifies preseq c_curve parameter:
-s
Options for calculating and filtering for characteristic ancient DNA damage patterns.
Specify length filter for DamageProfiler.
integer
100
Specifies the length filter for DamageProfiler. By default set to 100
.
Modifies DamageProfile parameter:
-l
Specify number of bases of each read to consider for DamageProfiler calculations.
integer
15
Specifies the length of the read start and end to be considered for profile generation in DamageProfiler. By default set to 15
bases.
Modifies DamageProfile parameter:
-t
Specify the maximum misincorporation frequency that should be displayed on damage plot. Set to 0 to 'autoscale'.
number
0.3
Specifies what the maximum misincorporation frequency should be displayed as, in the DamageProfiler damage plot. This is set to 0.30
(i.e. 30%) by default as this matches the popular mapDamage2.0 program. However, the default behaviour of DamageProfiler is to 'autoscale' the y-axis maximum to zoom in on any possible damage that may occur (e.g. if the damage is about 10%, the highest value on the y-axis would be set to 0.12). This 'autoscale' behaviour can be turned on by specifying the number to 0
. Default: 0.30
.
Modifies DamageProfile parameter:
-yaxis_damageplot
Turn on PMDtools
boolean
Specifies to run PMDTools for damage based read filtering and assessment of DNA damage in sequencing libraries. By default turned off.
Specify range of bases for PMDTools to scan for damage.
integer
10
Specifies the range in which to consider DNA damage from the ends of reads. By default set to 10
.
Modifies PMDTools parameter:
--range
Specify PMDScore threshold for PMDTools.
integer
3
Specifies the PMDScore threshold to use in the pipeline when filtering BAM files for DNA damage. Only reads which surpass this damage score are considered for downstream DNA analysis. By default set to 3
if not set specifically by the user.
Modifies PMDTools parameter:
--threshold
Specify a path to reference mask for PMDTools.
string
Can be used to set a path to a reference genome mask for PMDTools.
Specify the maximum number of reads to consider for metrics generation.
integer
10000
The maximum number of reads used for damage assessment in PMDtools. Can be used to significantly reduce the amount of time required for damage assessment in PMDTools. Note that a too low value can also obtain incorrect results.
Modifies PMDTools parameter:
-n
Options for getting reference annotation statistics (e.g. gene coverages)
Turn on ability to calculate no. reads, depth and breadth coverage of features in reference.
boolean
Specifies to turn on the bedtools module, producing statistics for breadth (or percent coverage), and depth (or X fold) coverages.
Path to GFF or BED file containing positions of features in reference file (--fasta). Path should be enclosed in quotes.
string
Specify the path to a GFF/BED containing the feature coordinates (or any acceptable input for bedtools coverage
). Must be in quotes.
Options for trimming of aligned reads (e.g. to remove damage prior genotyping).
Turn on BAM trimming. Will only run on non-UDG or half-UDG libraries
boolean
Turns on the BAM trimming method. Trims off [n]
bases from reads in the deduplicated BAM file. Damage assessment in PMDTools or DamageProfiler remains untouched, as data is routed through this independently. BAM trimming is typically performed to reduce errors during genotyping that can be caused by aDNA damage.
BAM trimming will only be performed on libraries indicated as --udg_type 'none'
or --udg_type 'half'
. Complete UDG treatment ('full') should have removed all damage. The amount of bases that will be trimmed off can be set separately for libraries with --udg_type
'none'
and 'half'
(see --bamutils_clip_half_udg_left
/ --bamutils_clip_half_udg_right
/ --bamutils_clip_none_udg_left
/ --bamutils_clip_none_udg_right
).
Note: additional artefacts such as bar-codes or adapters that could potentially also be trimmed should be removed prior mapping.
Specify the number of bases to clip off reads from 'left' end of read for half-UDG libraries.
integer
1
Default set to 1
and clips off one base of the left or right side of reads from libraries whose UDG treatment is set to half
. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-L -R
Specify the number of bases to clip off reads from 'right' end of read for half-UDG libraries.
integer
1
Default set to 1
and clips off one base of the left or right side of reads from libraries whose UDG treatment is set to half
. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-L -R
Specify the number of bases to clip off reads from 'left' end of read for non-UDG libraries.
integer
1
Default set to 1
and clips off one base of the left or right side of reads from libraries whose UDG treatment is set to none
. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-L -R
Specify the number of bases to clip off reads from 'right' end of read for non-UDG libraries.
integer
1
Default set to 1
and clips off one base of the left or right side of reads from libraries whose UDG treatment is set to none
. Note that reverse reads will automatically be clipped off at the reverse side with this (automatically reverses left and right for the reverse read).
Modifies bamUtil's trimBam parameter:
-L -R
Turn on using softclip instead of hard masking.
boolean
By default, nf-core/eager uses hard clipping and sets clipped bases to N
with quality !
in the BAM output. Turn this on to use soft-clipping instead, masking reads at the read ends respectively using the CIGAR string.
Modifies bam trimBam parameter:
-c
Options for variant calling.
Turn on genotyping of BAM files.
boolean
Turns on genotyping to run on all post-dedup and downstream BAMs. For example if --run_pmdtools
and --trim_bam
are both supplied, the genotyper will be run on all three BAM files i.e. post-deduplication, post-pmd and post-trimmed BAM files.
Specify which genotyper to use either GATK UnifiedGenotyper, GATK HaplotypeCaller, Freebayes, or pileupCaller. Note: UnifiedGenotyper requires user-supplied defined GATK 3.5 jar file. Options: 'ug', 'hc', 'freebayes', 'pileupcaller', 'angsd'.
string
Specifies which genotyper to use. Current options are: GATK (v3.5) UnifiedGenotyper or GATK Haplotype Caller (v4); and the FreeBayes Caller. Specify 'ug', 'hc', 'freebayes', 'pileupcaller' and 'angsd' respectively.
Note that while UnifiedGenotyper is more suitable for low-coverage ancient DNA (HaplotypeCaller does de novo assembly around each variant site), it is officially deprecated by the Broad Institute and is only accessible by an archived version not properly available on
conda
. Therefore if specifying 'ug', will need to supply a GATK 3.5-jar
to the parametergatk_ug_jar
. Note that this means the pipline is not fully reproducible in this configuration, unless you personally supply the.jar
file.
Specify which input BAM to use for genotyping. Options: 'raw', 'trimmed' or 'pmd'.
string
raw
Indicates which BAM file to use for genotyping, depending on what BAM processing modules you have turned on. Options are: 'raw'
for mapped only, filtered, or DeDup BAMs (with priority right to left); 'trimmed'
(for base clipped BAMs); 'pmd'
(for pmdtools output). Default is: 'raw'
.
When specifying to use GATK UnifiedGenotyper, path to GATK 3.5 .jar.
string
Specify a path to a local copy of a GATK 3.5 .jar
file, preferably version
'3.5-0-g36282e4'. The download location of this may be available from the GATK
forums or the Google Cloud
Storage
of the Broad Institute.
Specify GATK phred-scaled confidence threshold.
integer
30
If selected, specify a GATK genotyper phred-scaled confidence threshold of a given SNP/INDEL call. Default: 30
Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter:
-stand_call_conf
Specify GATK organism ploidy.
integer
2
If selected, specify a GATK genotyper ploidy value of your reference organism. E.g. if you want to allow heterozygous calls from >= diploid organisms. Default: 2
Modifies GATK UnifiedGenotyper or HaplotypeCaller parameter:
--sample-ploidy
Maximum depth coverage allowed for genotyping before down-sampling is turned on.
integer
250
Maximum depth coverage allowed for genotyping before down-sampling is turned on. Any position with a coverage higher than this value will be randomly down-sampled to 250 reads. Default: 250
Modifies GATK UnifiedGenotyper parameter:
-dcov
Specify VCF file for SNP annotation of output VCF files. Optional. Gzip not accepted.
string
(Optional) Specify VCF file for output VCF SNP annotation e.g. if you want to annotate your VCF file with 'rs' SNP IDs. Check GATK documentation for more information. Gzip not accepted.
Specify GATK output mode. Options: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_ACTIVE_SITES'.
string
If the GATK genotyper HaplotypeCaller is selected, what type of VCF to create, i.e. produce calls for every site or just confidence sites. Options: 'EMIT_VARIANTS_ONLY'
, 'EMIT_ALL_CONFIDENT_SITES'
, 'EMIT_ALL_ACTIVE_SITES'
. Default: 'EMIT_VARIANTS_ONLY'
Modifies GATK HaplotypeCaller parameter:
-output_mode
Specify HaplotypeCaller mode for emitting reference confidence calls . Options: 'NONE', 'BP_RESOLUTION', 'GVCF'.
string
If the GATK HaplotypeCaller is selected, mode for emitting reference confidence calls. Options: 'NONE'
, 'BP_RESOLUTION'
, 'GVCF'
. Default: 'GVCF'
Modifies GATK HaplotypeCaller parameter:
--emit-ref-confidence
Specify GATK output mode. Options: 'EMIT_VARIANTS_ONLY', 'EMIT_ALL_CONFIDENT_SITES', 'EMIT_ALL_SITES'.
string
If the GATK UnifiedGenotyper is selected, what type of VCF to create, i.e. produce calls for every site or just confidence sites. Options: 'EMIT_VARIANTS_ONLY'
, 'EMIT_ALL_CONFIDENT_SITES'
, 'EMIT_ALL_SITES'
. Default: 'EMIT_VARIANTS_ONLY'
Modifies GATK UnifiedGenotyper parameter:
--output_mode
Specify UnifiedGenotyper likelihood model. Options: 'SNP', 'INDEL', 'BOTH', 'GENERALPLOIDYSNP', 'GENERALPLOIDYINDEL'.
string
If the GATK UnifiedGenotyper is selected, which likelihood model to follow, i.e. whether to call use SNPs or INDELS etc. Options: 'SNP'
, 'INDEL'
, 'BOTH'
, 'GENERALPLOIDYSNP'
, 'GENERALPLOIDYINDEL
'. Default: 'SNP'
Modifies GATK UnifiedGenotyper parameter:
--genotype_likelihoods_model
Specify to keep the BAM output of re-alignment around variants from GATK UnifiedGenotyper.
string
If provided when running GATK's UnifiedGenotyper, this will put into the output folder the BAMs that have realigned reads (with GATK's (v3) IndelRealigner) around possible variants for improved genotyping.
These BAMs will be stored in the same folder as the corresponding VCF files.
Supply a default base quality if a read is missing a base quality score. Setting to -1 turns this off.
string
When running GATK's UnifiedGenotyper, specify a value to set base quality scores, if reads are missing this information. Might be useful if you have 'synthetically' generated reads (e.g. chopping up a reference genome). Default is set to -1 which is to not set any default quality (turned off). Default: -1
Modifies GATK UnifiedGenotyper parameter:
--defaultBaseQualities
Specify minimum required supporting observations to consider a variant.
integer
1
Specify minimum required supporting observations to consider a variant. Default: 1
Modifies freebayes parameter:
-C
Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified in --freebayes_C.
integer
0
Specify to skip over regions of high depth by discarding alignments overlapping positions where total read depth is greater than specified C. Not set by default.
Modifies freebayes parameter:
-g
Specify ploidy of sample in FreeBayes.
integer
2
Specify ploidy of sample in FreeBayes. Default is diploid. Default: 2
Modifies freebayes parameter:
-p
Specify path to SNP panel in bed format for pileupCaller.
string
Specify a SNP panel in the form of a bed file of sites at which to generate pileup for pileupCaller.
Specify path to SNP panel in EIGENSTRAT format for pileupCaller.
string
Specify a SNP panel in EIGENSTRAT format, pileupCaller will call these sites.
Specify calling method to use. Options: 'randomHaploid', 'randomDiploid', 'majorityCall'.
string
Specify calling method to use. Options: randomHaploid, randomDiploid, majorityCall. Default: 'randomHaploid'
Modifies pileupCaller parameter:
--randomHaploid --randomDiploid --majorityCall
Specify the calling mode for transitions. Options: 'AllSites', 'TransitionsMissing', 'SkipTransitions'.
string
Specify if genotypes of transition SNPs should be called, set to missing, or excluded from the genotypes respectively. Options: 'AllSites'
, 'TransitionsMissing'
, 'SkipTransitions'
. Default: 'AllSites'
Modifies pileupCaller parameter:
--skipTransitions --transitionsMissing
Specify which ANGSD genotyping likelihood model to use. Options: 'samtools', 'gatk', 'soapsnp', 'syk'.
string
Specify which genotype likelihood model to use. Options: 'samtools
, 'gatk'
, 'soapsnp'
, 'syk'
. Default: 'samtools'
Modifies ANGSD parameter:
-GL
Specify which output type to output ANGSD genotyping likelihood results: Options: 'text', 'binary', 'binary_three', 'beagle'.
string
Specifies what type of genotyping likelihood file format will be output. Options: 'text'
, 'binary'
, 'binary_three'
, 'beagle_binary'
. Default: 'text'
.
The options refer to the following descriptions respectively:
text
: textoutput of all 10 log genotype likelihoods.binary
: binary all 10 log genotype likelihoodbinary_three
: binary 3 times likelihoodbeagle_binary
: beagle likelihood file
See the ANGSD documentation for more information on which to select for your downstream applications.
Modifies ANGSD parameter:
-doGlF
Turn on creation of FASTA from ANGSD genotyping likelihood.
boolean
Turns on the ANGSD creation of a FASTA file from the BAM file.
Specify which genotype type of 'base calling' to use for ANGSD FASTA generation. Options: 'random', 'common'.
string
The type of base calling to be performed when creating the ANGSD FASTA file. Options: 'random'
or 'common'
. Will output the most common non-N base at each given position, whereas 'random' will pick one at random. Default: 'random'
.
Modifies ANGSD parameter:
-doFasta -doCounts
Options for creation of a per-sample FASTA sequence useful for downstream analysis (e.g. multi sequence alignment)
Turns on ability to create a consensus sequence FASTA file based on a UnifiedGenotyper VCF file and the original reference (only considers SNPs).
boolean
Turn on consensus sequence genome creation via VCF2Genome. Only accepts GATK UnifiedGenotyper VCF files with the --gatk_ug_out_mode 'EMIT_ALL_SITES'
and --gatk_ug_genotype_model 'SNP
flags. Typically useful for small genomes such as mitochondria.
Specify name of the output FASTA file containing the consensus sequence. Do not include .vcf
in the file name.
string
The name of your requested output FASTA file. Do not include .fasta
suffix.
Specify the header name of the consensus sequence entry within the FASTA file.
string
The name of the FASTA entry you would like in your FASTA file.
Minimum depth coverage required for a call to be included (else N will be called).
integer
5
Minimum depth coverage for a SNP to be made. Else, a SNP will be called as N. Default: 5
Modifies VCF2Genome parameter:
-minc
Minimum genotyping quality of a call to be called. Else N will be called.
integer
30
Minimum genotyping quality of a call to be made. Else N will be called. Default: 30
Modifies VCF2Genome parameter:
-minq
Minimum fraction of reads supporting a call to be included. Else N will be called.
number
0.8
In the case of two possible alleles, the frequency of the majority allele required for a call to be made. Else, a SNP will be called as N. Default: 0.8
Modifies VCF2Genome parameter:
-minfreq
Options for creation of a SNP table useful for downstream analysis (e.g. estimation of cross-mapping of different species and multi-sequence alignment)
Turn on MultiVCFAnalyzer. Note: This currently only supports diploid GATK UnifiedGenotyper input.
boolean
Turns on MultiVCFAnalyzer. Will only work when in combination with UnifiedGenotyper genotyping module.
Turn on writing write allele frequencies in the SNP table.
boolean
Specify whether to tell MultiVCFAnalyzer to write within the SNP table the frequencies of the allele at that position e.g. A (70%).
Specify the minimum genotyping quality threshold for a SNP to be called.
integer
30
The minimal genotyping quality for a SNP to be considered for processing by MultiVCFAnalyzer. The default threshold is 30
.
Specify the minimum number of reads a position needs to be covered to be considered for base calling.
integer
5
The minimal number of reads covering a base for a SNP at that position to be considered for processing by MultiVCFAnalyzer. The default depth is 5
.
Specify the minimum allele frequency that a base requires to be considered a 'homozygous' call.
number
0.9
The minimal frequency of a nucleotide for a 'homozygous' SNP to be called. In other words, e.g. 90% of the reads covering that position must have that SNP to be called. If the threshold is not reached, and the previous two parameters are matched, a reference call is made (displayed as . in the SNP table). If the above two parameters are not met, an 'N' is called. The default allele frequency is 0.9
.
Specify the minimum allele frequency that a base requires to be considered a 'heterozygous' call.
number
0.9
The minimum frequency of a nucleotide for a 'heterozygous' SNP to be called. If
this parameter is set to the same as --min_allele_freq_hom
, then only
homozygous calls are made. If this value is less than the previous parameter,
then a SNP call will be made. If it is between this and the previous parameter,
it will be displayed as a IUPAC uncertainty call. Default is 0.9
.
Specify paths to additional pre-made VCF files to be included in the SNP table generation. Use wildcard(s) for multiple files.
string
If you wish to add to the table previously created VCF files, specify here a path with wildcards (in quotes). These VCF files must be created the same way as your settings for GATK UnifiedGenotyping module above.
Specify path to the reference genome annotations in '.gff' format. Optional.
string
NA
If you wish to report in the SNP table annotation information for the regions
SNPs fall in, provide a file in GFF format (the path must be in quotes).
Specify path to the positions to be excluded in '.gff' format. Optional.
string
NA
If you wish to exclude SNP regions from consideration by MultiVCFAnalyzer (such as for problematic regions), provide a file in GFF format (the path must be in quotes).
Specify path to the output file from SNP effect analysis in '.txt' format. Optional.
string
NA
If you wish to include results from SNPEff effect analysis, supply the output
from SNPEff in txt format (the path must be in quotes).
Options for the calculation of ratio of reads to one chromosome/FASTA entry against all others.
Turn on mitochondrial to nuclear ratio calculation.
boolean
Turn on the module to estimate the ratio of mitochondrial to nuclear reads.
Specify the name of the reference FASTA entry corresponding to the mitochondrial genome (up to the first space).
string
MT
Specify the FASTA entry in the reference file specified as --fasta
, which acts
as the mitochondrial 'chromosome' to base the ratio calculation on. The tool
only accepts the first section of the header before the first space. The default
chromosome name is based on hs37d5/GrCH37 human reference genome. Default: 'MT'
Options for the calculation of biological sex of human individuals.
Turn on sex determination for human reference genomes.
boolean
Specify to run the optional process of sex determination.
Specify path to SNP panel in bed format for error bar calculation. Optional (see documentation).
string
Specify an optional bedfile of the list of SNPs to be used for X-/Y-rate calculation. Running without this parameter will considerably increase runtime, and render the resulting error bars untrustworthy. Theoretically, any set of SNPs that are distant enough that two SNPs are unlikely to be covered by the same read can be used here. The programme was coded with the 1240K panel in mind. The path must be in quotes.
Options for the estimation of contamination of human DNA.
Turn on nuclear contamination estimation for human reference genomes.
boolean
Specify to run the optional processes for (human) nuclear DNA contamination estimation.
The name of the X chromosome in your bam/FASTA header. 'X' for hs37d5, 'chrX' for HG19.
string
X
The name of the human chromosome X in your bam. 'X'
for hs37d5, 'chrX'
for HG19. Defaults to 'X'
.
Options for metagenomic screening of off-target reads.
Turn on metagenomic screening module for reference-unmapped reads.
boolean
Turn on the metagenomic screening module.
Specify which classifier to use. Options: 'malt', 'kraken'.
string
undefined
Specify which taxonomic classifier to use. There are two options available:
⚠️ Important It is very important to run nextflow clean -f
on your
Nextflow run directory once completed. RMA6 files are VERY large and are
copied from a work/
directory into the results folder. You should clean the
work directory with the command to ensure non-redundancy and large HDD
footprints!
Specify path to classifier database directory. For Kraken2 this can also be a .tar.gz
of the directory.
string
Specify the path to the directory containing your taxonomic classifier's database (malt or kraken).
For Kraken2, it can be either the path to the directory or the path to the .tar.gz
compressed directory of the Kraken2 database.
Specify a minimum number of reads a taxon of sample total is required to have to be retained. Not compatible with --malt_min_support_mode 'percent'.
integer
1
Specify the minimum number of reads a given taxon is required to have to be retained as a positive 'hit'.
For malt, this only applies when --malt_min_support_mode
is set to 'reads'. Default: 1.
Modifies MALT or kraken_parse.py parameter:
-sup
and-c
respectively
Percent identity value threshold for MALT.
integer
85
Specify the minimum percent identity (or similarity) a sequence must have to the reference for it to be retained. Default is 85
Only used when --metagenomic_tool malt
is also supplied.
Modifies MALT parameter:
-id
Specify which alignment mode to use for MALT. Options: 'Unknown', 'BlastN', 'BlastP', 'BlastX', 'Classifier'.
string
Use this to run the program in 'BlastN', 'BlastP', 'BlastX' modes to align DNA
and DNA, protein and protein, or DNA reads against protein references
respectively. Ensure your database matches the mode. Check the
MALT
manual
for more details. Default: 'BlastN'
Only when --metagenomic_tool malt
is also supplied.
Modifies MALT parameter:
-m
Specify alignment method for MALT. Options: 'Local', 'SemiGlobal'.
string
Specify what alignment algorithm to use. Options are 'Local' or 'SemiGlobal'. Local is a BLAST like alignment, but is much slower. Semi-global alignment aligns reads end-to-end. Default: 'SemiGlobal'
Only when --metagenomic_tool malt
is also supplied.
Modifies MALT parameter:
-at
Specify the percent for LCA algorithm for MALT (see MEGAN6 CE manual).
integer
1
Specify the top percent value of the LCA algorithm. From the MALT manual: "For each
read, only those matches are used for taxonomic placement whose bit disjointScore is within
10% of the best disjointScore for that read.". Default: 1
.
Only when --metagenomic_tool malt
is also supplied.
Modifies MALT parameter:
-top
Specify whether to use percent or raw number of reads for minimum support required for taxon to be retained for MALT. Options: 'percent', 'reads'.
string
Specify whether to use a percentage, or raw number of reads as the value used to decide the minimum support a taxon requires to be retained.
Only when --metagenomic_tool malt
is also supplied.
Modifies MALT parameter:
-sup -supp
Specify the minimum percentage of reads a taxon of sample total is required to have to be retained for MALT.
number
0.01
Specify the minimum number of reads (as a percentage of all assigned reads) a given taxon is required to have to be retained as a positive 'hit' in the RMA6 file. This only applies when --malt_min_support_mode
is set to 'percent'. Default 0.01.
Only when --metagenomic_tool malt
is also supplied.
Modifies MALT parameter:
-supp
Specify the maximum number of queries a read can have for MALT.
integer
100
Specify the maximum number of alignments a read can have. All further alignments are discarded. Default: 100
Only when --metagenomic_tool malt
is also supplied.
Modifies MALT parameter:
-mq
Specify the memory load method. Do not use 'map' with GPFS file systems for MALT as can be very slow. Options: 'load', 'page', 'map'.
string
How to load the database into memory. Options are 'load'
, 'page'
or 'map'
.
'load' directly loads the entire database into memory prior seed look up, this
is slow but compatible with all servers/file systems. 'page'
and 'map'
perform a sort of 'chunked' database loading, allowing seed look up prior entire
database loading. Note that Page and Map modes do not work properly not with
many remote file-systems such as GPFS. Default is 'load'
.
Only when --metagenomic_tool malt
is also supplied.
Modifies MALT parameter:
--memoryMode
Specify to also produce SAM alignment files. Note this includes both aligned and unaligned reads, and are gzipped. Note this will result in very large file sizes.
boolean
Specify to also produce gzipped SAM files of all alignments and un-aligned reads in addition to RMA6 files. These are not soft-clipped or in 'sparse' format. Can be useful for downstream analyses due to more common file format.
⚠️ can result in very large run output directories as this is essentially duplication of the RMA6 files.
Modifies MALT parameter
-a -f
Options for authentication of metagenomic screening performed by MALT.
Turn on MaltExtract for MALT aDNA characteristics authentication.
boolean
Turn on MaltExtract for MALT aDNA characteristics authentication of metagenomic output from MALT.
More can be seen in the MaltExtract documentation
Only when --metagenomic_tool malt
is also supplied
Path to a text file with taxa of interest (one taxon per row, NCBI taxonomy name format)
string
Path to a .txt
file with taxa of interest you wish to assess for aDNA characteristics. In .txt
file should be one taxon per row, and the taxon should be in a valid NCBI taxonomy name format.
Only when --metagenomic_tool malt
is also supplied.
Path to directory containing containing NCBI resource files (ncbi.tre and ncbi.map; available: https://github.com/rhuebler/HOPS/)
string
Path to directory containing containing the NCBI resource tree and taxonomy table files (ncbi.tre and ncbi.map; available at the HOPS repository).
Only when --metagenomic_tool malt
is also supplied.
Specify which MaltExtract filter to use. Options: 'def_anc', 'ancient', 'default', 'crawl', 'scan', 'srna', 'assignment'.
string
Specify which MaltExtract filter to use. This is used to specify what types of characteristics to scan for. The default will output statistics on all alignments, and then a second set with just reads with one C to T mismatch in the first 5 bases. Further details on other parameters can be seen in the HOPS documentation. Options: 'def_anc'
, 'ancient'
, 'default'
, 'crawl'
, 'scan'
, 'srna'
, 'assignment'. Default: 'def_anc'
.
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
-f
Specify percent of top alignments to use.
number
0.01
Specify frequency of top alignments for each read to be considered for each node.
Default is 0.01, i.e. 1% of all reads (where 1 would correspond to 100%).
⚠️ this parameter follows the same concept as
--malt_top_percent
but
uses a different notation i.e. integer (MALT) versus float (MALTExtract)
Default: 0.01
.
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
-a
Turn off destacking.
boolean
Turn off destacking. If left on, a read that overlaps with another read will be
removed (leaving a depth coverage of 1).
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
--destackingOff
Turn off downsampling.
boolean
Turn off downsampling. By default, downsampling is on and will randomly select 10,000 reads if the number of reads on a node exceeds this number. This is to speed up processing, under the assumption at 10,000 reads the species is a 'true positive'.
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
--downSampOff
Turn off duplicate removal.
boolean
Turn off duplicate removal. By default, reads that are an exact copy (i.e. same start, stop coordinate and exact sequence match) will be removed as it is considered a PCR duplicate.
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
--dupRemOff
Turn on exporting alignments of hits in BLAST format.
boolean
Export alignments of hits for each node in BLAST format. By default turned off.
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
--matches
Turn on export of MEGAN summary files.
boolean
Export 'minimal' summary files (i.e. without alignments) that can be loaded into MEGAN6. By default turned off.
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
--meganSummary
Minimum percent identity alignments are required to have to be reported. Recommended to set same as MALT parameter.
number
85
Minimum percent identity alignments are required to have to be reported. Higher values allows fewer mismatches between read and reference sequence, but therefore will provide greater confidence in the hit. Lower values allow more mismatches, which can account for damage and divergence of a related strain/species to the reference. Recommended to set same as MALT parameter or higher. Default: 85.0
.
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
--minPI
Turn on using top alignments per read after filtering.
boolean
Use the best alignment of each read for every statistic, except for those concerning read distribution and coverage. Default: off.
Only when --metagenomic_tool malt
is also supplied.
Modifies MaltExtract parameter:
--useTopAlignment