riboSeed Pipeline¶
Usage¶
The pipeline consists of 3 main stages: preprocessing, de fere novo assembly, and visualization/assessment. As of version 0.4.21, the pipeline is run by invoking ribo
and then one of the following commands:
[ preprocessing ]
scan
select
[ de fere novo assembly ]
seed
[ visualizations/assessment ]
snag
stack
sketch
swap
score
spec
Before We Start¶
Please back up any and all data used, and work within a virtualenv.
Genome assembly gobbles RAM. If you, like me, are working on a 4gb RAM
lappy, don’t run seed in parallel and instead run in series by using
the --serialize
option. That should prevent you from running out of
RAM during the final SPAdes calls.
0: Preprocessing¶
scan
¶
scan
preprocesses sequences straight from a multifasta or one
or more fasta. The issue with many legacy annotations, assemblies, and
scaffold collections is rDNAs are often poorly annotated at best, and
unannotated at worst. This is shortcut to happiness without using the
full Prokka annotation scheme. It requires
`barrnap
<http://www.vicbioinformatics.com/software.barrnap.shtml>`__
and seqret (from `emboss
<http://www.ebi.ac.uk/Tools/emboss/>`__) to
be available in your path. #### Usage
scan
can either use a directory of fastas or one (multi)fasta file.
If using a directory of fastas, provide the appropriate extension using
the -e
option. If using a (multi)fasta as input, it write out each
entry to its own fasta in the contigs
subdirectory that it makes in
the output. For each of the fastas, the script renames complex headers
(sketchy), scans with barrnap and captures the output gff. It then edits
the gff to add fake incrementing locus_tags, and uses the original
sequence file through seqret to make a GenBank file that contains just
annotated rRNA features. The last step is a concatenation which, whether
or not there are multiple files, makes a single (possibly multi-entry)
genbank file perfect for seed-ing.
NOTE: If using a reference with long names or containing special characters, use the –name argument to rename the contigs to something a bit more convenient and less prone to errors when piping results.
usage: ribo scan -o OUTPUT [-e EXT] [-k {bac,euk,arc,mito}] [-t ID_THRESH]
[-b BARRNAP_EXE] [-n NAME] [-c {1,2,4,8,16}]
[-s SEQRET_EXE] [-v {1,2,3,4,5}] [-h]
contigs
Given a directory of one or more chromosomes as fasta files, this facilitates
reannotation of rDNA regions with Barrnap and outputs all sequences as a
single, annotated genbank file
positional arguments:
contigs_dir directory containing one or more chromosomal sequences
in fasta format
required named arguments:
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/seed
optional arguments:
-k {bac,euk,arc,mito}, --kingdom {bac,euk,arc,mito}
whether to look for eukaryotic, archaeal, or bacterial
rDNA; default: bac
-e extension extension of the chromosomal sequences, usually
'.fa', '.fasta' or similar; default: .fa
-t ID_THRESH, --id_thresh ID_THRESH
partial rRNA hits below this threshold will be
ignored. default: 0.5
-b BARRNAP_EXE, --barrnap_exe BARRNAP_EXE
path to barrnap executable; default: barrnap
-n NAME, --name NAME name to give the contig files; default: infered from
file
-s SEQRET_EXE, --seqret_exe SEQRET_EXE
path to seqret executable, usually installed with
emboss; default: seqret
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
-h, --help Displays this help message
NOTE: If using a reference with long names or containing special characters, use the –name argument to rename the contigs to something a bit more convenient and less prone to errors when piping results.
select
¶
select
searches the genome for rRNA annotations, clusters
them into likely ribosomal groups, and outputs a colon-separated list of
clustered rRNA locus tags by record id.
You will probably want to preview your file to figure out the syntax used. (ie, 16s vs 16S, rRNA vs RRNA, etc…)
If not using scan
or if not working with a prokaryotic
genome, you will need to change --specific_features
appropriately to
reflect the annotations in your reference (ie, for a fungal genome, use
--specific_features 5_8S:18S:28S
).
NOTE: the format of the output text file is very simple, and due to the
relatively small number of such coding sequences in bacterial genomes,
this can be constructed by hand if the clusters do not look appropriate.
The format is genome_sequence_id locus_tag1:locus_tag2
, where each
line represents a cluster. See example below, where 14 rRNAs are
clustered into 6 groups:
NOTE 2: In order to streamline things, as of version 0.0.3 there will be
a commented header line with the feature type in the format “#$ FEATURE
“, such as #$ FEATURE rRNA
.
#$ FEATURE rRNA
CM000577.1 FGSG_20052:FGSG_20051:FGSG_20053
CM000577.1 FGSG_20048:FGSG_20047
CM000577.1 FGSG_20049:FGSG_20050
CM000577.1 FGSG_20054:FGSG_20056:FGSG_20055
CM000577.1 FGSG_20058:FGSG_20057
CM000577.1 FGSG_20075:FGSG_20074
Usage¶
usage: ribo select [-h] [-o OUTPUT] [-f FEATURE] [-s SPECIFIC_FEATURES]
[--keep_temps] [--clobber] [-c CLUSTERS] [-v VERBOSITY]
[--debug]
genbank_genome
This is used to identify and cluster rRNA regions from a gb file, returns a
text file with the clusters
positional arguments:
genbank_genome Genbank file (WITH SEQUENCE)
optional arguments:
-h, --help show this help message and exit
required named arguments:
-o OUTPUT, --output OUTPUT
output directory;default:
/home/nicholas/GitHub/seed
optional arguments:
-f FEATURE, --feature FEATURE
Feature, rRNA or RRNA; default: rRNA
-s SPECIFIC_FEATURES, --specific_features SPECIFIC_FEATURES
colon:separated -- specific features; default:
16S:23S:5S
--keep_temps view intermediate clustering filesdefault: False
--clobber overwrite previous output files: default: False
-c CLUSTERS, --clusters CLUSTERS
number of rDNA clusters;if submitting multiple
records, must be a colon:separated list whose length
matches number of genbank records. Default is inferred
from specific feature with fewest hits
-v VERBOSITY, --verbosity VERBOSITY
1 = debug(), 2 = info(), 3 = warning(), 4 = error()
and 5 = critical(); default: 2
--debug Enable debug messages
2: De fere novo Assembly¶
seed
¶
seed
maps reads to a genome and (1) extracts reads mapping to
rDNA regions, (2) perfoms subassemblies on each pool of extracted reads
to recover the rDNA complete with flanking regions (resulting in a
pseudocontig) (3) concatenates a;; pseudocontigs into them into a
pseudogenome with 5kb spacers of N’s in between, (5) map remaining reads
to the pseudogenome, and (6) repeat steps 1-5 for a given number of
iterations (default 3 iterations). Finally, seed runs SPAdes
assemblied with and without the pseudocontigs and the resulting
assemblies are assessed with QUAST.
Output¶
The results directory will contain a ‘final_long_reads’ directory with
all the pseudocontigs, the mapped fastq files, and
final_de_novo_assembly
and final_de_fere_novo_assembly
folders,
containing the SPAdes results.
NOTE:¶
If using a consumer-grade computer, it will be advantagous to run with
-z/--serialize
enabled to run asseblies in serial rather than
parallel.
Usage:¶
minimal usage:
ribo seed clustered_accession\list.txt -F FASTQ1 -R FASTQ2 -r REFERENCE_GENOME -o OUTPUT
usage: ribo seed -r REFERENCE_GENBANK -o OUTPUT [-F FASTQ1] [-R FASTQ2]
[-S1 FASTQS1] [-n EXP_NAME] [-l FLANKING] [-m {smalt,bwa}]
[-c CORES] [-k KMERS] [-p PRE_KMERS] [-s SCORE_MIN]
[-a MIN_ASSEMBLY_LEN] [--include_shorts] [--linear]
[--ref_as_contig {None,trusted,untrusted}] [--keep_temps]
[--skip_control] [-i ITERATIONS] [-v {1,2,3,4,5}]
[--target_len TARGET_LEN] [-t {1,2,4}] [-z]
[--smalt_scoring SMALT_SCORING] [--mapper_args MAPPER_ARGS]
[-h] [--spades_exe SPADES_EXE]
[--samtools_exe SAMTOOLS_EXE] [--smalt_exe SMALT_EXE]
[--bwa_exe BWA_EXE] [--quast_exe QUAST_EXE]
[--python2_7_exe PYTHON2_7_EXE]
clustered_loci_txt
Given cluster file of rDNA regions from select and either paired-end or
single-end reads, assembles the mapped reads into pseduocontig 'seeds', and
uses those with the reads to runde fere novo and de novo assembly with SPAdes
positional arguments:
clustered_loci_txt output from select
required named arguments:
-r REFERENCE_GENBANK, --reference_genbank REFERENCE_GENBANK
genbank reference, used to estimate insert sizes, and
compare with QUAST
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/seed
optional arguments:
-F FASTQ1, --fastq1 FASTQ1
forward fastq reads, can be compressed
-R FASTQ2, --fastq2 FASTQ2
reverse fastq reads, can be compressed
-S1 FASTQS1, --fastq_single1 FASTQS1
single fastq reads
-n EXP_NAME, --experiment_name EXP_NAME
prefix for results files; default: seed
-l FLANKING, --flanking_length FLANKING
length of flanking regions, in bp; default: 1000
-m {smalt,bwa}, --method_for_map {smalt,bwa}
available mappers: smalt and bwa; default: bwa
-c CORES, --cores CORES
cores for multiprocessing; default: None
-k KMERS, --kmers KMERS
kmers used for final assembly, separated by commas;
default: 21,33,55,77,99,127
-p PRE_KMERS, --pre_kmers PRE_KMERS
kmers used during seeding assemblies, separated bt
commas; default: 21,33,55,77,99
-s SCORE_MIN, --score_min SCORE_MIN
If using smalt, this sets the '-m' param; default with
smalt is inferred from read length. If using BWA,
reads mapping with ASscore lower than this will be
rejected; default with SWA is half of read length
-a MIN_ASSEMBLY_LEN, --min_assembly_len MIN_ASSEMBLY_LEN
if initial SPAdes assembly largest contig is not at
least as long as --min_assembly_len, exit. Set this to
the length of the seed sequence; if it is not
achieved, seeding across regions will likely fail;
default: 6000
--include_shorts if assembled contig is smaller than
--min_assembly_len, contig will still be included in
assembly; default: inferred
--linear if genome is known to not be circular and a region of
interest (including flanking bits) extends past
chromosome end, this extends the seqence past
chromosome origin forward by --padding; default: False
--ref_as_contig {None,trusted,untrusted}
if 'trusted', SPAdes will use the seed sequences as a
--trusted-contig; if 'untrusted', SPAdes will treat as
--untrusted-contig. if '', seeds will not be used
during assembly. See SPAdes docs; default: untrusted
--keep_temps if not --keep_temps, mapping files will be removed
once they are no no longer needed during the
iterations; default: False
--skip_control if --skip_control, no de novo assembly will be done;
default: False
-i ITERATIONS, --iterations ITERATIONS
if iterations>1, multiple seedings will occur after
subassembly of seed regions; if setting --target_len,
seedings will continue until --iterations are
completed or --target_len is matched or exceeded;
default: 3
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
--target_len TARGET_LEN
if set, iterations will continue until contigs reach
this length, or max iterations (set by --iterations)
have been completed. Set as fraction of original seed
length by giving a decimal between 0 and 5, or set as
an absolute number of base pairs by giving an integer
greater than 50. Not used by default
-t {1,2,4}, --threads {1,2,4}
if your cores are hyperthreaded, set number threads to
the number of threads per processer.If unsure, see
'cat /proc/cpuinfo' under 'cpu cores', or 'lscpu'
under 'Thread(s) per core'.: 1
-z, --serialize if --serialize, runs seeding and assembly without
multiprocessing. This is recommended for machines with
less than 8GB RAM: False
--smalt_scoring SMALT_SCORING
if mapping with SMALT, submit custom smalt scoring via
smalt -S scorespec option; default:
match=1,subst=-4,gapopen=-4,gapext=-3
--mapper_args MAPPER_ARGS
submit custom parameters to mapper. And by mapper, I
mean bwa, cause we dont support this option for SMALT,
sorry. This requires knowledge of your chosen mapper's
optional arguments. Proceed with caution! default: -L
0,0 -U 0
-h, --help Displays this help message
--spades_exe SPADES_EXE
Path to SPAdes executable; default: spades.py
--samtools_exe SAMTOOLS_EXE
Path to samtools executable; default: samtools
--smalt_exe SMALT_EXE
Path to smalt executable; default: smalt
--bwa_exe BWA_EXE Path to BWA executable; default: bwa
--quast_exe QUAST_EXE
Path to quast executable; default: quast.py
--python2_7_exe PYTHON2_7_EXE
Path to python2.7 executable, cause QUAST won't run on
python3. default: python2.7
Key Parameters¶
Results can be tuned by changing several of the default parameters.
--score_min
: This can be used to set the minimum mapping score. If using BWA, the default is not to supply a minimum and to rely on the BWA default. If submitting a--score_min
to BWA, double check that it is appropriate. It appears to be extremely sensitive to read length, and having a too-low threshold for minimum mapping can seriously ruin ones day. Check out IGB or similar to view your mappings if greater than, say, 5% or the reads are mapping in subsequent iterations.-l, --flanking_length
: Default is 2000. That seems to be a good compromise between gaining unique sequence and not relying too much on the reference.--kmers
and--pre_kmers
: Adjust these as you otherwise would for a de novo assembly.--min_assembly_len
: For bacteria, this is about 7000bp, as the rDNA regions for a typical operon of 16S 23S and 5S coding sequences combined usually are about that long. If you are using non-standard rDNA regions, this should be adjusted to prevent spurious assemblies.--ref_as_contig
: This can be used to guide how SPAdes treats the long read sequences during the assembly (trusted
oruntrusted
). By default, this is infered from mapping percentage (trusted
if over 85% of reads map to the reference)--iterations
: Each iteration typically increases the length of the long read by approximately 5%.
3: Visualization/Assessment¶
snag
¶
snag
takes the list of clustered locus tags and extracts
their sequences with flanking regions, optionally turning the coding
sequences to N’s to minimize bias towards reference. Is used to pull out
regions of interest from a Genbank file. Outputs a directory with a
fasta file for each clustered region (and a log file).
Additionally, it does a lot of plotting to visualize the Shannon entropy, coverage, occurrences, and other useful metrics.
Usage:¶
usage: ribo snag [-o OUTPUT] [-n NAME] [-l FLANKING] [--msa_kmers] [-c]
[-p PADDING] [-v VERBOSITY] [--clobber] [--no_revcomp]
[--skip_check] [--msa_tool {mafft,prank}]
[--prank_exe PRANK_EXE] [--mafft_exe MAFFT_EXE]
[--barrnap_exe BARRNAP_EXE]
[--makeblastdb_exe MAKEBLASTDB_EXE]
[--kingdom {mito,euk,arc,bac}] [-h]
genbank_genome clustered_loci
Use to extract regions of interest based on supplied Locus tags and evaluate
the extracted regions
positional arguments:
genbank_genome Genbank file (WITH SEQUENCE)
clustered_loci output from select
required named arguments:
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/seed
optional arguments:
-n NAME, --name NAME rename the contigs with this prefixdefault: date
(YYYYMMDD)
-l FLANKING, --flanking_length FLANKING
length of flanking regions, in bp; default: 1000
--msa_kmers calculate kmer similarity based on aligned sequences
instead of raw sequences;default: False
-c, --circular if the genome is known to be circular, and an region
of interest (including flanking bits) extends past
chromosome end, this extends the seqence past
chromosome origin forward by 5kb; default: False
-p PADDING, --padding PADDING
if treating as circular, this controls the length of
sequence added to the 5' and 3' ends to allow for
selecting regions that cross the chromosom's origin;
default: 5000
-v VERBOSITY, --verbosity VERBOSITY
1 = debug(), 2 = info(), 3 = warning(), 4 = error()
and 5 = critical(); default: 2
--clobber overwrite previous output filesdefault: False
--no_revcomp default returns reverse complimented seq if majority
of regions on reverse strand. if --no_revcomp, this is
overwriddendefault: False
--skip_check Dont bother calculating Shannon Entropy; default:
False
--msa_tool {mafft,prank}
Path to PRANK executable; default: mafft
--prank_exe PRANK_EXE
Path to PRANK executable; default: prank
--mafft_exe MAFFT_EXE
Path to MAFFT executable; default: mafft
--barrnap_exe BARRNAP_EXE
Path to barrnap executable; default: barrnap
--makeblastdb_exe MAKEBLASTDB_EXE
Path to makeblastdb executable; default: makeblastdb
--kingdom {mito,euk,arc,bac}
kingdom for barrnap; default: bac
-h, --help Displays this help message
stack
¶
Decause assembly using short reads often collases rDNA repeats, it is
not uncommon to find a reference genome that has less than the actual
number of rDNAs. stack uses bedtools
and samtools
to
determine the coverage across rDNA regiosn, adn compares that coverage
depth to 10 sets of randomly selected non-rDNA regions. If the number of
rDNAs in the reference matches the number of rDNAs in your sequecned
isolate, the coverage should be pretty similar. However, if the coverage
in your rDNA regions is significantly higher, than there are likely more
rDNAs in your sequenced isoalte that there are in the reference, which
is something to be aware of.
It requires a mapping BAM file and the scan output directory as input.
swap
¶
Infrequently, seed
has joined together contigs that appear
incorrect according to your reference. If you are at all unhappy with a
bridging, swap
allows swapping of a “bad” contig for one or
more syntenic contigs from the de novo assembly. #### USAGE
usage: ribo swap -o OUTPUT [-v {1,2,3,4,5}] [-h]
de_novo_file de_fere_novo_file bad_contig good_contigs
Given de novo and de fere novo contigs files, a misjoined de fere novo contig
name, and a colon:separated list of de novo contig names, replace the
offending contig with the de novo contig(s)
positional arguments:
de_novo_file multifasta containing de novo contigs
de_fere_novo_file multifasta containing de fere novo contigs
bad_contig name of the bad contig
good_contigs colon separated good contigs for replacement
required named arguments:
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/seed
optional arguments:
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
-h, --help Displays this help message
spec
¶
One limitation in resolving the rDNA repeats is the lack of confidence in
the reference genomes that were assembled from short reads along. ribo spec
parses
the SPAdes assembly graph in fastg format to take a guess at how many rDNAs are in
the genome based on the nodes and edges represinting the region in the graph. This can
help alert the user that the number of rDNAs in the reference may disagree with the
actual number in the genome.