Welcome to riboSeed’s documentation!¶
Impatient? See our Quickstart Guide
A brief overview of the theory can be found here.
The manuscript can be found here: Nicholas R Waters, Florence Abram, Fiona Brennan, Ashleigh Holmes, Leighton Pritchard; riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions, Nucleic Acids Research, Volume 46, Issue 11, 20 June 2018, Pages e68, https://doi.org/10.1093/nar/gky212
Description¶
riboSeed is an supplemental assembly refinement method to try to address the issue of multiple ribosomal regions in a genome, as these create repeates unresolvable by short read sequencing. It takes advantage of the fact that while each region is identical, the regions flanking are unique, and therefore can potentially be used to seed an assembly in such a way that rDNA regions are bridged.
Contents:
Description¶
riboSeed is an supplemental assembly refinement method to try to address the issue of multiple ribosomal regions in a genome, as these create repeates unresolvable by short read sequencing. It takes advantage of the fact that while each region is identical, the regions flanking are unique, and therefore can potentially be used to seed an assembly in such a way that rDNA regions are bridged.
The Problem¶
As you probably know, repeated regions are difficult to resolve when sequencing with a short read technology. Specifically, if the length of the repeat exceeds the length of the kmers used to construct the de Bruijn graph, the repeat cannot be resolved.
rDNAs, the genomic regions containing the sequences coding for ribosomal RNAs, are often found multiple times in a single genome. The rDNAs are usually around 5kb long, which is much longer than the length of short reads. Because of this, these regions cause breaks in the assembly.
Due to how well rDNA is conserved within a taxa, we hypothesized that if the regions flanking the rDNAs are sufficiently unique within a genome, those regions would be able to locate an rDNA within the genome during assembly.
The Method¶
We call our method a de fere novo assembly (meaning “starting from almost nothing”), as we use a subassembly technique to minimize the bias caused by reference choice. We map the short reads to the reference genome, extract the reads mapping to rDNA (with flanking) regions, and perform subassemblies with SPAdes to reassemble the rDNA and flanking regions from the reads. These “long reads” are concatenated together separated with 5kb of N’s. The reads are then mapped to the concatenated sequence and and subassembled for several additional iterations.
The Results¶
We generated a simulated genome from the 7 rDNA regions with 5kb flanking regions, and then used `ART (MountRainier-2016-06-05)<https://www.niehs.nih.gov/research/resources/software/biostatistics/art/>`__ to generated simulated MiSeq reads of various depths.
In this `Mauve<http://darlinglab.org/mauve/mauve.html>`__ visualization, we show (from top to bottom) the reference simulated genome, riboSeed’s de fere novo assembly, de novo assembly, and a negative control de fere novo assembly using a Klebsiella reference genome. The results show that with riboSeed’s de fere novo assembly correctly joins six of the seven rDNA regions to reconstruct the simulated genome with only short reads. By contrast, the short reads alone failed to bridge any gaps caused by the repeated rDNAs, and the assembly using a poor reference choice only assembled across a single rDNA region. We have run this successfully on many real datasets with positive results
Installation¶
From conda (new and recommended!) —
Conda is a cross-platform, cross-language package management system. If you haven’t already installed conda, follow these instructions here, and install the python3.6 version. Once you have that done, add the appropriate channels.
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
and then install riboSeed and all of its dependencies with one command:
conda install riboseed
(Note the lowercase “s”)
From Pypi¶
riboSeed is on Pypi, so you can install with pip, preferably within a virtualenv:
virtualenv -p python3.5 venv-riboSeed
source venv-riboSeed/bin/activate
pip3.5 install riboSeed
From TestPypi¶
To install the bleeding-edge version, install from testpypi:
virtualenv -p python3.5 venv-riboSeed
source venv-riboSeed/bin/activate
pip install --extra-index-url https://testpypi.python.org/pypi riboSeed
From GitHub¶
You can also clone this repository, create a virtual environment, and run
python3.5 setup.py install
.
Dependencies¶
External Requirements¶
riboScan.py
- Barrnap (must be 0.7 or above)
- EMBOSS’s Seqret
riboSelect.py
- None
riboSnag.py
- PRANK or Mafft
- BLAST+ suite
- Barrnap (must be 0.7 or above)
riboSeed.py
- SPAdes v3.8 or higher
- BWA (tested with 0.7.12-r1039)
- SAMTools (must be 1.3.1 or above)
- QUAST (tested with 4.1)
NOTE: barrnap has certain Perl requirements that may not be included on your machine. Ensure barrnap runs fine before trying riboSeed. Or try python barrnap.
Choosing an Appropriate Reference¶
Introduction¶
Any sort of analysis that involves a reference strain begs the question: which strain should I use? Luckily, riboSeed’s de fere novo approach gives you a bit of flexiblility here, in that the method is not affected my major genomic restructuring events, as long as they do not occur in the rDNA region. Still, using a close reference maximizes your chance for a successful assembly.
There are three ways that we recommend selecting a reference. The first uses average nucleotide identity to select an optimal reference for a given set of potential genomes – perfect if you are sequencing a popular bug and are spoiled for reference choice. The Kraken method gives great results when nothing is known a priori, and results in a high degree of certainty, but requires a bit of legwork. The last method can be done entirely through your web browser, but is much less robust. For popular bugs, we reccomend ANI; otherwise we recommend Kraken, but if you have already pulled out all of your hair from bioinformatics software installation, reads2Type/cgfind is a nice, painless alternative.
Reference Selection via ANI¶
With any assembly that uses a reference, the choice of that reference is crucial. Here, we outline a protocol for using Average Nucleotide Identity via pyANI. We call this pipeline Plenty of Bugs, as it will help you find a compatible match for your sequenced isolate.
This is easiest to do with Docker (or Singularity)
The following script will identify 25 random E. coli genomes, download then, build the pyani database, do a quick assembly of your isolate, and find the closest reference to your isolate.
docker run --rm -t -v ${PWD}:/data/ nickp60/plentyofbugs:0.87 -f /data/tests/references/toy_reads1.fq -o "Escherichia coli" -n 25 -e sample_ecoli -d /data/results/
Kraken Method¶
Kraken is a kmer-based phylogeny tool that can be used to idenify the strains present in a metagenomic dataset; the installation and usage instructions can be found here
- Download and install Kraken, along with the MiniKraken database from their website.
- Run Kraken on your isolate’s reads, and generate the Kraken report.
kraken --db MiniKraken reads1.fq reads2.fastq > sequences.kraken
kraken-translate --db MiniKraken sequences.kraken > sequences.labels
kraken-report --db MiniKraken sequence.kraken
Because the MiniKraken database was built from all the complete genomes from RefSeq, it should be easy to identify which strain in the database has the closest match to your sequenced isolate.
PS: This is a great time to check if you have any contamination in your sample; thanks, Kraken!
reads2type and cgFind Method¶
reads2type is also a kmer-based phylogeny tool, but it relies on a lightweight, prebuilt database that allows the analysis to be performed in your web browser, and it doesn’t require you to upload your whole read file to a webserver. It works by taking one read at a time from your file, generating 55-mers, and comparing to its databse. If there is not enough taxanomic information to indentify the isolate off of that read alone, additional reads will be processed until a single taxonomy is achieved. This method works best on trimmed reads. Instructions and the webserver can be found here
Once you have a genus and species, you can use cgfind
, a tool we developed to provide easy access to downloadable genomes based on the complete prokaryotic genomes found in NCBI. it can be found here Just enter your genus and species name, and select one of the available strains to download.
Usage¶
Minimal example: ribo run
¶
The pseudogenome was constructed from the 7 rDNAs separated by several kb of flanking DNA. If can be found under ./riboSeed/integration_test/concatenated_seq.fasta. If you have installed using setuptools, the integration_test folder will be installed in the site-packages dir, such as /venv-riboSeed/lib/python3.5/site-packages/riboSeed/integration_data/.
Two read files can be found in the same directory.
To run the whole riboSeed pipline, use the following command:
ribo run ./riboSeed/integration_data/concatenated_seq.fasta \
-F ./riboSeed/integration_data/test_reads1.fq \
-R ./riboSeed/integration_data/test_reads2.fq \
-o ./test1/ -v 1
Whats going on:¶
ribo run
is used to run the pipeline with the most commonly used settings. It first creates a config file, tracking down your system executables
for the required tools, and setting the default parameters for things not
specified as args to run_riboSeed.
Then, ribo scan
is run to re-annotate your reference, ribo select
calls the rDNA
operons, and ribo seed
runs the de fere novo assembly.
If you want to change the behaviour of the programs under the hood, all of the
command line options not set by ribo run
are defined in the config file in
the output directory. After editing the parameters in the config file, you can
submit it to ribo run
using the -c flag.
Running individual scripts¶
All of the elements of the package can be run individually: Perhaps you want to
modify barrnap’s behaviour in scan
, or you want to experiment with
different feature selectors in select
. Go for it!
$ ribo
Description: A suite of tools to perform de fere novo assembly to bridge
gaps caused by rDNA repeats
Usage: ribo <command> [options]
Available commands:
-run execute pipeline (scan, select, seed, sketch, and score)
-scan reannotate rRNAs in a FASTA file
-select group rRNA annotations into rDNA operons
-seed perform de fere novo assembly
-snag extract rDNA regions and plot entropy
-sim perform simulations used in manuscript
-sketch plot results from a de fere novo assembly
-stack compare coverage depth in rDNA regions to rest of genome
-score score batches of assemblies with BLASTn
-swap swap contigs from assemblies
-spec use assembly graph to speculate number of rDNAs
-structure view the rRNA operon structure across several genomes
-config write out a blank config file to be used with `run`
riboSeed Pipeline¶
Usage¶
The pipeline consists of 3 main stages: preprocessing, de fere novo assembly, and visualization/assessment. As of version 0.4.21, the pipeline is run by invoking ribo
and then one of the following commands:
[ preprocessing ]
scan
select
[ de fere novo assembly ]
seed
[ visualizations/assessment ]
snag
stack
sketch
swap
score
spec
Before We Start¶
Please back up any and all data used, and work within a virtualenv.
Genome assembly gobbles RAM. If you, like me, are working on a 4gb RAM
lappy, don’t run seed in parallel and instead run in series by using
the --serialize
option. That should prevent you from running out of
RAM during the final SPAdes calls.
0: Preprocessing¶
scan
¶
scan
preprocesses sequences straight from a multifasta or one
or more fasta. The issue with many legacy annotations, assemblies, and
scaffold collections is rDNAs are often poorly annotated at best, and
unannotated at worst. This is shortcut to happiness without using the
full Prokka annotation scheme. It requires
`barrnap
<http://www.vicbioinformatics.com/software.barrnap.shtml>`__
and seqret (from `emboss
<http://www.ebi.ac.uk/Tools/emboss/>`__) to
be available in your path. #### Usage
scan
can either use a directory of fastas or one (multi)fasta file.
If using a directory of fastas, provide the appropriate extension using
the -e
option. If using a (multi)fasta as input, it write out each
entry to its own fasta in the contigs
subdirectory that it makes in
the output. For each of the fastas, the script renames complex headers
(sketchy), scans with barrnap and captures the output gff. It then edits
the gff to add fake incrementing locus_tags, and uses the original
sequence file through seqret to make a GenBank file that contains just
annotated rRNA features. The last step is a concatenation which, whether
or not there are multiple files, makes a single (possibly multi-entry)
genbank file perfect for seed-ing.
NOTE: If using a reference with long names or containing special characters, use the –name argument to rename the contigs to something a bit more convenient and less prone to errors when piping results.
usage: ribo scan -o OUTPUT [-e EXT] [-k {bac,euk,arc,mito}] [-t ID_THRESH]
[-b BARRNAP_EXE] [-n NAME] [-c {1,2,4,8,16}]
[-s SEQRET_EXE] [-v {1,2,3,4,5}] [-h]
contigs
Given a directory of one or more chromosomes as fasta files, this facilitates
reannotation of rDNA regions with Barrnap and outputs all sequences as a
single, annotated genbank file
positional arguments:
contigs_dir directory containing one or more chromosomal sequences
in fasta format
required named arguments:
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/seed
optional arguments:
-k {bac,euk,arc,mito}, --kingdom {bac,euk,arc,mito}
whether to look for eukaryotic, archaeal, or bacterial
rDNA; default: bac
-e extension extension of the chromosomal sequences, usually
'.fa', '.fasta' or similar; default: .fa
-t ID_THRESH, --id_thresh ID_THRESH
partial rRNA hits below this threshold will be
ignored. default: 0.5
-b BARRNAP_EXE, --barrnap_exe BARRNAP_EXE
path to barrnap executable; default: barrnap
-n NAME, --name NAME name to give the contig files; default: infered from
file
-s SEQRET_EXE, --seqret_exe SEQRET_EXE
path to seqret executable, usually installed with
emboss; default: seqret
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
-h, --help Displays this help message
NOTE: If using a reference with long names or containing special characters, use the –name argument to rename the contigs to something a bit more convenient and less prone to errors when piping results.
select
¶
select
searches the genome for rRNA annotations, clusters
them into likely ribosomal groups, and outputs a colon-separated list of
clustered rRNA locus tags by record id.
You will probably want to preview your file to figure out the syntax used. (ie, 16s vs 16S, rRNA vs RRNA, etc…)
If not using scan
or if not working with a prokaryotic
genome, you will need to change --specific_features
appropriately to
reflect the annotations in your reference (ie, for a fungal genome, use
--specific_features 5_8S:18S:28S
).
NOTE: the format of the output text file is very simple, and due to the
relatively small number of such coding sequences in bacterial genomes,
this can be constructed by hand if the clusters do not look appropriate.
The format is genome_sequence_id locus_tag1:locus_tag2
, where each
line represents a cluster. See example below, where 14 rRNAs are
clustered into 6 groups:
NOTE 2: In order to streamline things, as of version 0.0.3 there will be
a commented header line with the feature type in the format “#$ FEATURE
“, such as #$ FEATURE rRNA
.
#$ FEATURE rRNA
CM000577.1 FGSG_20052:FGSG_20051:FGSG_20053
CM000577.1 FGSG_20048:FGSG_20047
CM000577.1 FGSG_20049:FGSG_20050
CM000577.1 FGSG_20054:FGSG_20056:FGSG_20055
CM000577.1 FGSG_20058:FGSG_20057
CM000577.1 FGSG_20075:FGSG_20074
Usage¶
usage: ribo select [-h] [-o OUTPUT] [-f FEATURE] [-s SPECIFIC_FEATURES]
[--keep_temps] [--clobber] [-c CLUSTERS] [-v VERBOSITY]
[--debug]
genbank_genome
This is used to identify and cluster rRNA regions from a gb file, returns a
text file with the clusters
positional arguments:
genbank_genome Genbank file (WITH SEQUENCE)
optional arguments:
-h, --help show this help message and exit
required named arguments:
-o OUTPUT, --output OUTPUT
output directory;default:
/home/nicholas/GitHub/seed
optional arguments:
-f FEATURE, --feature FEATURE
Feature, rRNA or RRNA; default: rRNA
-s SPECIFIC_FEATURES, --specific_features SPECIFIC_FEATURES
colon:separated -- specific features; default:
16S:23S:5S
--keep_temps view intermediate clustering filesdefault: False
--clobber overwrite previous output files: default: False
-c CLUSTERS, --clusters CLUSTERS
number of rDNA clusters;if submitting multiple
records, must be a colon:separated list whose length
matches number of genbank records. Default is inferred
from specific feature with fewest hits
-v VERBOSITY, --verbosity VERBOSITY
1 = debug(), 2 = info(), 3 = warning(), 4 = error()
and 5 = critical(); default: 2
--debug Enable debug messages
2: De fere novo Assembly¶
seed
¶
seed
maps reads to a genome and (1) extracts reads mapping to
rDNA regions, (2) perfoms subassemblies on each pool of extracted reads
to recover the rDNA complete with flanking regions (resulting in a
pseudocontig) (3) concatenates a;; pseudocontigs into them into a
pseudogenome with 5kb spacers of N’s in between, (5) map remaining reads
to the pseudogenome, and (6) repeat steps 1-5 for a given number of
iterations (default 3 iterations). Finally, seed runs SPAdes
assemblied with and without the pseudocontigs and the resulting
assemblies are assessed with QUAST.
Output¶
The results directory will contain a ‘final_long_reads’ directory with
all the pseudocontigs, the mapped fastq files, and
final_de_novo_assembly
and final_de_fere_novo_assembly
folders,
containing the SPAdes results.
If using a consumer-grade computer, it will be advantagous to run with
-z/--serialize
enabled to run asseblies in serial rather than
parallel.
Usage:¶
minimal usage:
ribo seed clustered_accession\list.txt -F FASTQ1 -R FASTQ2 -r REFERENCE_GENOME -o OUTPUT
usage: ribo seed -r REFERENCE_GENBANK -o OUTPUT [-F FASTQ1] [-R FASTQ2]
[-S1 FASTQS1] [-n EXP_NAME] [-l FLANKING] [-m {smalt,bwa}]
[-c CORES] [-k KMERS] [-p PRE_KMERS] [-s SCORE_MIN]
[-a MIN_ASSEMBLY_LEN] [--include_shorts] [--linear]
[--ref_as_contig {None,trusted,untrusted}] [--keep_temps]
[--skip_control] [-i ITERATIONS] [-v {1,2,3,4,5}]
[--target_len TARGET_LEN] [-t {1,2,4}] [-z]
[--smalt_scoring SMALT_SCORING] [--mapper_args MAPPER_ARGS]
[-h] [--spades_exe SPADES_EXE]
[--samtools_exe SAMTOOLS_EXE] [--smalt_exe SMALT_EXE]
[--bwa_exe BWA_EXE] [--quast_exe QUAST_EXE]
[--python2_7_exe PYTHON2_7_EXE]
clustered_loci_txt
Given cluster file of rDNA regions from select and either paired-end or
single-end reads, assembles the mapped reads into pseduocontig 'seeds', and
uses those with the reads to runde fere novo and de novo assembly with SPAdes
positional arguments:
clustered_loci_txt output from select
required named arguments:
-r REFERENCE_GENBANK, --reference_genbank REFERENCE_GENBANK
genbank reference, used to estimate insert sizes, and
compare with QUAST
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/seed
optional arguments:
-F FASTQ1, --fastq1 FASTQ1
forward fastq reads, can be compressed
-R FASTQ2, --fastq2 FASTQ2
reverse fastq reads, can be compressed
-S1 FASTQS1, --fastq_single1 FASTQS1
single fastq reads
-n EXP_NAME, --experiment_name EXP_NAME
prefix for results files; default: seed
-l FLANKING, --flanking_length FLANKING
length of flanking regions, in bp; default: 1000
-m {smalt,bwa}, --method_for_map {smalt,bwa}
available mappers: smalt and bwa; default: bwa
-c CORES, --cores CORES
cores for multiprocessing; default: None
-k KMERS, --kmers KMERS
kmers used for final assembly, separated by commas;
default: 21,33,55,77,99,127
-p PRE_KMERS, --pre_kmers PRE_KMERS
kmers used during seeding assemblies, separated bt
commas; default: 21,33,55,77,99
-s SCORE_MIN, --score_min SCORE_MIN
If using smalt, this sets the '-m' param; default with
smalt is inferred from read length. If using BWA,
reads mapping with ASscore lower than this will be
rejected; default with SWA is half of read length
-a MIN_ASSEMBLY_LEN, --min_assembly_len MIN_ASSEMBLY_LEN
if initial SPAdes assembly largest contig is not at
least as long as --min_assembly_len, exit. Set this to
the length of the seed sequence; if it is not
achieved, seeding across regions will likely fail;
default: 6000
--include_shorts if assembled contig is smaller than
--min_assembly_len, contig will still be included in
assembly; default: inferred
--linear if genome is known to not be circular and a region of
interest (including flanking bits) extends past
chromosome end, this extends the seqence past
chromosome origin forward by --padding; default: False
--ref_as_contig {None,trusted,untrusted}
if 'trusted', SPAdes will use the seed sequences as a
--trusted-contig; if 'untrusted', SPAdes will treat as
--untrusted-contig. if '', seeds will not be used
during assembly. See SPAdes docs; default: untrusted
--keep_temps if not --keep_temps, mapping files will be removed
once they are no no longer needed during the
iterations; default: False
--skip_control if --skip_control, no de novo assembly will be done;
default: False
-i ITERATIONS, --iterations ITERATIONS
if iterations>1, multiple seedings will occur after
subassembly of seed regions; if setting --target_len,
seedings will continue until --iterations are
completed or --target_len is matched or exceeded;
default: 3
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
--target_len TARGET_LEN
if set, iterations will continue until contigs reach
this length, or max iterations (set by --iterations)
have been completed. Set as fraction of original seed
length by giving a decimal between 0 and 5, or set as
an absolute number of base pairs by giving an integer
greater than 50. Not used by default
-t {1,2,4}, --threads {1,2,4}
if your cores are hyperthreaded, set number threads to
the number of threads per processer.If unsure, see
'cat /proc/cpuinfo' under 'cpu cores', or 'lscpu'
under 'Thread(s) per core'.: 1
-z, --serialize if --serialize, runs seeding and assembly without
multiprocessing. This is recommended for machines with
less than 8GB RAM: False
--smalt_scoring SMALT_SCORING
if mapping with SMALT, submit custom smalt scoring via
smalt -S scorespec option; default:
match=1,subst=-4,gapopen=-4,gapext=-3
--mapper_args MAPPER_ARGS
submit custom parameters to mapper. And by mapper, I
mean bwa, cause we dont support this option for SMALT,
sorry. This requires knowledge of your chosen mapper's
optional arguments. Proceed with caution! default: -L
0,0 -U 0
-h, --help Displays this help message
--spades_exe SPADES_EXE
Path to SPAdes executable; default: spades.py
--samtools_exe SAMTOOLS_EXE
Path to samtools executable; default: samtools
--smalt_exe SMALT_EXE
Path to smalt executable; default: smalt
--bwa_exe BWA_EXE Path to BWA executable; default: bwa
--quast_exe QUAST_EXE
Path to quast executable; default: quast.py
--python2_7_exe PYTHON2_7_EXE
Path to python2.7 executable, cause QUAST won't run on
python3. default: python2.7
Key Parameters¶
Results can be tuned by changing several of the default parameters.
--score_min
: This can be used to set the minimum mapping score. If using BWA, the default is not to supply a minimum and to rely on the BWA default. If submitting a--score_min
to BWA, double check that it is appropriate. It appears to be extremely sensitive to read length, and having a too-low threshold for minimum mapping can seriously ruin ones day. Check out IGB or similar to view your mappings if greater than, say, 5% or the reads are mapping in subsequent iterations.-l, --flanking_length
: Default is 2000. That seems to be a good compromise between gaining unique sequence and not relying too much on the reference.--kmers
and--pre_kmers
: Adjust these as you otherwise would for a de novo assembly.--min_assembly_len
: For bacteria, this is about 7000bp, as the rDNA regions for a typical operon of 16S 23S and 5S coding sequences combined usually are about that long. If you are using non-standard rDNA regions, this should be adjusted to prevent spurious assemblies.--ref_as_contig
: This can be used to guide how SPAdes treats the long read sequences during the assembly (trusted
oruntrusted
). By default, this is infered from mapping percentage (trusted
if over 85% of reads map to the reference)--iterations
: Each iteration typically increases the length of the long read by approximately 5%.
3: Visualization/Assessment¶
snag
¶
snag
takes the list of clustered locus tags and extracts
their sequences with flanking regions, optionally turning the coding
sequences to N’s to minimize bias towards reference. Is used to pull out
regions of interest from a Genbank file. Outputs a directory with a
fasta file for each clustered region (and a log file).
Additionally, it does a lot of plotting to visualize the Shannon entropy, coverage, occurrences, and other useful metrics.
Usage:¶
usage: ribo snag [-o OUTPUT] [-n NAME] [-l FLANKING] [--msa_kmers] [-c]
[-p PADDING] [-v VERBOSITY] [--clobber] [--no_revcomp]
[--skip_check] [--msa_tool {mafft,prank}]
[--prank_exe PRANK_EXE] [--mafft_exe MAFFT_EXE]
[--barrnap_exe BARRNAP_EXE]
[--makeblastdb_exe MAKEBLASTDB_EXE]
[--kingdom {mito,euk,arc,bac}] [-h]
genbank_genome clustered_loci
Use to extract regions of interest based on supplied Locus tags and evaluate
the extracted regions
positional arguments:
genbank_genome Genbank file (WITH SEQUENCE)
clustered_loci output from select
required named arguments:
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/seed
optional arguments:
-n NAME, --name NAME rename the contigs with this prefixdefault: date
(YYYYMMDD)
-l FLANKING, --flanking_length FLANKING
length of flanking regions, in bp; default: 1000
--msa_kmers calculate kmer similarity based on aligned sequences
instead of raw sequences;default: False
-c, --circular if the genome is known to be circular, and an region
of interest (including flanking bits) extends past
chromosome end, this extends the seqence past
chromosome origin forward by 5kb; default: False
-p PADDING, --padding PADDING
if treating as circular, this controls the length of
sequence added to the 5' and 3' ends to allow for
selecting regions that cross the chromosom's origin;
default: 5000
-v VERBOSITY, --verbosity VERBOSITY
1 = debug(), 2 = info(), 3 = warning(), 4 = error()
and 5 = critical(); default: 2
--clobber overwrite previous output filesdefault: False
--no_revcomp default returns reverse complimented seq if majority
of regions on reverse strand. if --no_revcomp, this is
overwriddendefault: False
--skip_check Dont bother calculating Shannon Entropy; default:
False
--msa_tool {mafft,prank}
Path to PRANK executable; default: mafft
--prank_exe PRANK_EXE
Path to PRANK executable; default: prank
--mafft_exe MAFFT_EXE
Path to MAFFT executable; default: mafft
--barrnap_exe BARRNAP_EXE
Path to barrnap executable; default: barrnap
--makeblastdb_exe MAKEBLASTDB_EXE
Path to makeblastdb executable; default: makeblastdb
--kingdom {mito,euk,arc,bac}
kingdom for barrnap; default: bac
-h, --help Displays this help message
stack
¶
Decause assembly using short reads often collases rDNA repeats, it is
not uncommon to find a reference genome that has less than the actual
number of rDNAs. stack uses bedtools
and samtools
to
determine the coverage across rDNA regiosn, adn compares that coverage
depth to 10 sets of randomly selected non-rDNA regions. If the number of
rDNAs in the reference matches the number of rDNAs in your sequecned
isolate, the coverage should be pretty similar. However, if the coverage
in your rDNA regions is significantly higher, than there are likely more
rDNAs in your sequenced isoalte that there are in the reference, which
is something to be aware of.
It requires a mapping BAM file and the scan output directory as input.
swap
¶
Infrequently, seed
has joined together contigs that appear
incorrect according to your reference. If you are at all unhappy with a
bridging, swap
allows swapping of a “bad” contig for one or
more syntenic contigs from the de novo assembly. #### USAGE
usage: ribo swap -o OUTPUT [-v {1,2,3,4,5}] [-h]
de_novo_file de_fere_novo_file bad_contig good_contigs
Given de novo and de fere novo contigs files, a misjoined de fere novo contig
name, and a colon:separated list of de novo contig names, replace the
offending contig with the de novo contig(s)
positional arguments:
de_novo_file multifasta containing de novo contigs
de_fere_novo_file multifasta containing de fere novo contigs
bad_contig name of the bad contig
good_contigs colon separated good contigs for replacement
required named arguments:
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/seed
optional arguments:
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
-h, --help Displays this help message
spec
¶
One limitation in resolving the rDNA repeats is the lack of confidence in
the reference genomes that were assembled from short reads along. ribo spec
parses
the SPAdes assembly graph in fastg format to take a guess at how many rDNAs are in
the genome based on the nodes and edges represinting the region in the graph. This can
help alert the user that the number of rDNAs in the reference may disagree with the
actual number in the genome.
Accessory Scripts¶
Assessment¶
riboScore.py
~~~
Suppose you have a whole bunch of assemblies to assess. The most rigorous way of checking the assemblies would be to use Mauve (or a similar whole genome alignment visualizer tool) for the job, and manually check the quality of each assembly, listening to the ends of the contigs, seeking one-ness with the data. Thats all well and good if you are (a) independantly wealthy and enjoy doing this sort of thing, (b) seeking a meditative state through mindless clicking, or (c) an undergrad assistant, but for the rest of us, we are willing to sacrifice a bit of accuracy for throughput. This is, after all, why we aren’t sequencing on gels anymore.
ribo score
outputs two types of score repors as text files: one which is
easy for humans to read, and the other that can be easily combined with
hundreds like it to make various types of graphs etc.
usage: ribo score [-h] [-o OUTPUT] [-l FLANKING] [-p MIN_PERCENT]
[-f ASSEMBLY_EXT] [-g REF_EXT] [-F] [-v {1,2,3,4,5}]
indir
This does some simple blasting to detect correctness of riboSeed results
positional arguments:
indir dir containing a genbank file and other file
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
directory in which to place the output files
-l FLANKING, --flanking_length FLANKING
length of flanking regions, in bp; default: 1000
-p MIN_PERCENT, --min_percent MIN_PERCENT
minimum percent identity
-f ASSEMBLY_EXT, --assembly_ext ASSEMBLY_EXT
extenssion of reference, usually fasta
-g REF_EXT, --ref_ext REF_EXT
extension of reference, usually .gb
-F, --blast_Full if true, blast full sequences along with just the
flanking. Interpretation is not implemented currently
as false positives cant be detected this way
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
Visualization¶
riboSnag.py
¶
riboSnag.py
takes the list of clustered locus tags and extracts
their sequences with flanking regions, optionally turning the coding
sequences to N’s to minimize bias towards reference. Is used to pull out
regions of interest from a Genbank file. Outputs a directory with a
fasta file for each clustered region (and a log file).
Additionally, it does a lot of plotting to visualize the Shannon entropy, coverage, occurrences, and other useful metrics.
Usage:¶
usage: riboSnag.py [-o OUTPUT] [-n NAME] [-l FLANKING] [--msa_kmers] [-c]
[-p PADDING] [-v VERBOSITY] [--clobber] [--no_revcomp]
[--skip_check] [--msa_tool {mafft,prank}]
[--prank_exe PRANK_EXE] [--mafft_exe MAFFT_EXE]
[--barrnap_exe BARRNAP_EXE]
[--makeblastdb_exe MAKEBLASTDB_EXE]
[--kingdom {mito,euk,arc,bac}] [-h]
genbank_genome clustered_loci
Use to extract regions of interest based on supplied Locus tags and evaluate
the extracted regions
positional arguments:
genbank_genome Genbank file (WITH SEQUENCE)
clustered_loci output from riboSelect
required named arguments:
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/riboSeed
optional arguments:
-n NAME, --name NAME rename the contigs with this prefixdefault: date
(YYYYMMDD)
-l FLANKING, --flanking_length FLANKING
length of flanking regions, in bp; default: 1000
--msa_kmers calculate kmer similarity based on aligned sequences
instead of raw sequences;default: False
-c, --circular if the genome is known to be circular, and an region
of interest (including flanking bits) extends past
chromosome end, this extends the seqence past
chromosome origin forward by 5kb; default: False
-p PADDING, --padding PADDING
if treating as circular, this controls the length of
sequence added to the 5' and 3' ends to allow for
selecting regions that cross the chromosom's origin;
default: 5000
-v VERBOSITY, --verbosity VERBOSITY
1 = debug(), 2 = info(), 3 = warning(), 4 = error()
and 5 = critical(); default: 2
--clobber overwrite previous output filesdefault: False
--no_revcomp default returns reverse complimented seq if majority
of regions on reverse strand. if --no_revcomp, this is
overwriddendefault: False
--skip_check Dont bother calculating Shannon Entropy; default:
False
--msa_tool {mafft,prank}
Path to PRANK executable; default: mafft
--prank_exe PRANK_EXE
Path to PRANK executable; default: prank
--mafft_exe MAFFT_EXE
Path to MAFFT executable; default: mafft
--barrnap_exe BARRNAP_EXE
Path to barrnap executable; default: barrnap
--makeblastdb_exe MAKEBLASTDB_EXE
Path to makeblastdb executable; default: makeblastdb
--kingdom {mito,euk,arc,bac}
kingdom for barrnap; default: bac
-h, --help Displays this help message
ribo stack
¶
Because assembly using short reads often collases rDNA repeats, it is
not uncommon to find a reference genome that has less than the actual
number of rDNAs. riboStack uses bedtools
and samtools
to
determine the coverage across rDNA regiosn, adn compares that coverage
depth to 10 sets of randomly selected non-rDNA regions. If the number of
rDNAs in the reference matches the number of rDNAs in your sequecned
isolate, the coverage should be pretty similar. However, if the coverage
in your rDNA regions is significantly higher, than there are likely more
rDNAs in your sequenced isoalte that there are in the reference, which
is something to be aware of.
It requires a mapping BAM file and the riboScan output directory as input.
Utilities¶
riboSwap.py
¶
Infrequently, riboSeed has joined together contigs that appear
incorrect according to your reference. If you are at all unhappy with a
bridging, ribo swap
allows swapping of a “bad” contig for one or
more syntenic contigs from the de novo assembly. #### USAGE
usage: ribo swap -o OUTPUT [-v {1,2,3,4,5}] [-h]
de_novo_file de_fere_novo_file bad_contig good_contigs
Given de novo and de fere novo contigs files, a misjoined de fere novo contig
name, and a colon:separated list of de novo contig names, replace the
offending contig with the de novo contig(s)
positional arguments:
de_novo_file multifasta containing de novo contigs
de_fere_novo_file multifasta containing de fere novo contigs
bad_contig name of the bad contig
good_contigs colon separated good contigs for replacement
required named arguments:
-o OUTPUT, --output OUTPUT
output directory; default:
/home/nicholas/GitHub/riboSeed
optional arguments:
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
-h, --help Displays this help message
seedRand.py
¶
There is no convenient unix command to generate seeded random numbers from the command line. This standalone script uses numpy (if availible) or the built-in random module to generate n random numbers given a seed.
Note: numpy should give you the same random numbers given the same seed across platforms: this is not the case with python’s build-in random module.
usage: seedRand.py [-h] seed n
Given a seed, return a pseudrando integer between 1 and 9999, separated by
newlines, to stdout. usage : `seedRand.py 27 10` would return 10 random
numbers seeded with 27
positional arguments:
seed seed
n number of random numbers to return, must be > 0
optional arguments:
-h, --help show this help message and exit