Note: Newbler is available for use on the HPC systems
Newbler is a software package for de novo DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Sciences, a Roche Diagnostics company.
Newbler is a useful tool for assembling your 454 (or other pyrosequencing) data.
Make sure you are using Newbler v2.5 or newer! Previous versions had a bug that greatly reduced its effectiveness.
There are a several ways to run Newbler.
You can set up and run an individual job like this
runAssembler [options] filename.sff
Options can be zero or more of the following
This is the directory in which Newbler will save your output files. This directory must NOT EXIST when you run Newbler, or Newbler will exit with an error. Newbler will create the directory and fill it with files. Default: None. This argument is required.
Trims primers, adapters, polyA tails from start or end of reads
Removes reads that match a cloning vector (such as E.Coli) -vt / -vs also match the verse compliments of the given sequences
Min contig length for all contigs Default: 100
Min contig length for large contigs Newbler classifies some contigs as "large" Default: 500
Speeds up assembly but reduces accuracy For large genomes Not with -cdna option
Keeps sequence data in memory Increases speed With a large assembly this is likely to use all your available RAM I recommend allocating at least 8gb of RAM to use this Requires more ram
Number of cpus to use Leave this as default, unless you have a reason for giving it less CPUs than are available Default: all
Minimum length of reads to use in assembly Default: 50 Min: 15
Output each read in only one contig This prevents any single read from appearing in more than one contig
Disable default quality and primer trimming of input reads
Input file contains paired-end reads
Does not group duplicates Treats each read separately Default: groups duplicates
Set seed step parameter
Set seed length parameter
Set seed count parameter
Set minimum overlap length
Set minimum overlap identify
Skip output of large files (.ace, 454AlignmentInfo.tsv) Default: no
Creates subdirectory, with .ace, and .phd files, and sff_dir for consed Default: no Warning: The consed files can be 1-5 GB each per folder. Keep that in mind if you are creating several of them, And don't generate the consed files unless you need them
Note Newbler collects first into Isogroups, and then creates Isotigs
For transcriptome (cDNA assembly)
Max contigs in an isogroup Default: 500
Max number of isotigs in an isogroup Default: 100
Max number of contigs in one isotig Default: 100
Isotig contig length threshold, below which traversal stops Default: 3 bp
Newbler can also be run with a graphical interface using the command:
To use this, you will need some sort of X window software installed on your computer. On an Apple computer, you can install X11 (included with the developer tools, available on the OS X disc or from Apple's website). On Windows, you can use Xming which is available here:
The rest of the instructions refer to the individual-job mode using the command line interface
Newbler's default parameters have been tuned to offer the best results for a wide variety of input data.
Here is an example PBS script for running newbler with the default options:
#!/bin/bash #PBS -N asmbl #PBS -m ea #PBS -M email@example.com #PBS -W group_list=rmaier #PBS -q default ### Set the number of cpus that will be used. #PBS -l select=1:ncpus=8:mem=2gb ### Specify up to a maximum of 1600 hours total cpu time for 1-processor job #PBS -l cput=12:0:0 ### Specify up to a maximum of 240 hours walltime for the job #PBS -l walltime=12:0:0 ### Specify working directory with sff or fasta/qual files (you need to set this) cd /homeB/home4/u32/bmf/results/newbler /uaopt/roche454/2.5.3/bin/runAssembly -consed -o results_defaults_consed KC_meta.fna
The CD command will change directories to our results folder. This allows us to specify the output directory (results_defaults_consed) and the input file (KC_meta.fna) without including the full absolute path. I always recommend specifying the absolute path to the program's executable (/uaopt/roche454/2.5.3/bin/runAssembly in this case) because otherwise when PBS launches the job it may not be able to find the correct location for the program. This is because the PBS system won't use the same PATH information as your own Bash profile.
Newbler's options can also be tuned to offer more customized performance for specific datasets. Here is a PBS script for running Newbler with more stringent options:
#!/bin/bash #PBS -N asmbl #PBS -m ea #PBS -M firstname.lastname@example.org #PBS -W group_list=rmaier #PBS -q default ### Set the number of cpus that will be used. #PBS -l select=1:ncpus=8:mem=2gb ### Specify up to a maximum of 1600 hours total cpu time for 1-processor job #PBS -l cput=12:0:0 ### Specify up to a maximum of 240 hours walltime for the job #PBS -l walltime=12:0:0 ### Specify working directory with sff or fasta/qual files (you need to set this) cd /homeB/home4/u32/bmf/results/newbler /uaopt/roche454/2.5.3/bin/runAssembly -consed16 -ml 80 -mi 95 -rip -o results_consed16 KC_meta.fna
This script limits the minimum overlap length to 80, the minimum overlap identify to 95, and limits each read to only appearing in one contig