Newbler

Note: Newbler is available for use on the HPC systems




General Information

Newbler is a software package for de novo DNA sequence assembly. It is designed specifically for assembling sequence data generated by the 454 GS-series of pyrosequencing platforms sold by 454 Life Sciences, a Roche Diagnostics company.

Newbler is a useful tool for assembling your 454 (or other pyrosequencing) data.

Important

Make sure you are using Newbler v2.5 or newer! Previous versions had a bug that greatly reduced its effectiveness.



Newbler Usage

There are a several ways to run Newbler.

Individual Job

You can set up and run an individual job like this

runAssembler [options] filename.sff

Options can be zero or more of the following

-o output-directory

This is the directory in which Newbler will save your output files. This directory must NOT EXIST when you run Newbler, or Newbler will exit with an error. Newbler will create the directory and fill it with files. Default: None. This argument is required.

-vt trimmingFile.fasta

Trims primers, adapters, polyA tails from start or end of reads

-vs screeningFile.fasta

Removes reads that match a cloning vector (such as E.Coli) -vt / -vs also match the verse compliments of the given sequences

-a NUM

Min contig length for all contigs Default: 100

-l num

Min contig length for large contigs Newbler classifies some contigs as "large" Default: 500

-large

Speeds up assembly but reduces accuracy For large genomes Not with -cdna option

-m

Keeps sequence data in memory Increases speed With a large assembly this is likely to use all your available RAM I recommend allocating at least 8gb of RAM to use this Requires more ram

-cpu NUM

Number of cpus to use Leave this as default, unless you have a reason for giving it less CPUs than are available Default: all

-minlen NUM

Minimum length of reads to use in assembly Default: 50 Min: 15

-rip

Output each read in only one contig This prevents any single read from appearing in more than one contig

-notrim

Disable default quality and primer trimming of input reads

-p FILENAME

Input file contains paired-end reads

-ud

Does not group duplicates Treats each read separately Default: groups duplicates

-ss

Set seed step parameter

-sl

Set seed length parameter

-sc

Set seed count parameter

-ml

Set minimum overlap length

-mi

Set minimum overlap identify

-nobig

Skip output of large files (.ace, 454AlignmentInfo.tsv) Default: no

-consed

Creates subdirectory, with .ace, and .phd files, and sff_dir for consed Default: no Warning: The consed files can be 1-5 GB each per folder. Keep that in mind if you are creating several of them, And don't generate the consed files unless you need them



The following options are for using Newbler to process transcriptome data

Note Newbler collects first into Isogroups, and then creates Isotigs

-cdna

For transcriptome (cDNA assembly)

-ig

Max contigs in an isogroup Default: 500

-it

Max number of isotigs in an isogroup Default: 100

-icc

Max number of contigs in one isotig Default: 100

-icl

Isotig contig length threshold, below which traversal stops Default: 3 bp



GUI (Graphical Interface)

Newbler can also be run with a graphical interface using the command:

gsAssembler

To use this, you will need some sort of X window software installed on your computer. On an Apple computer, you can install X11 (included with the developer tools, available on the OS X disc or from Apple's website). On Windows, you can use Xming which is available here:

http://www.straightrunning.com/XmingNotes/

Important

The rest of the instructions refer to the individual-job mode using the command line interface



Running Newbler

Newbler's default parameters have been tuned to offer the best results for a wide variety of input data.
Here is an example PBS script for running newbler with the default options:

#!/bin/bash
#PBS -N asmbl
#PBS -m ea
#PBS -M bmf@email.arizona.edu
#PBS -W group_list=rmaier
#PBS -q default

### Set the number of cpus that will be used.
#PBS -l select=1:ncpus=8:mem=2gb
### Specify up to a maximum of 1600 hours total cpu time for 1-processor job
#PBS -l cput=12:0:0

### Specify up to a maximum of 240 hours walltime for the job
#PBS -l walltime=12:0:0


### Specify working directory with sff or fasta/qual files (you need to set this)
cd /homeB/home4/u32/bmf/results/newbler

/uaopt/roche454/2.5.3/bin/runAssembly -consed -o results_defaults_consed KC_meta.fna


Explanation

The CD command will change directories to our results folder. This allows us to specify the output directory (results_defaults_consed) and the input file (KC_meta.fna) without including the full absolute path. I always recommend specifying the absolute path to the program's executable (/uaopt/roche454/2.5.3/bin/runAssembly in this case) because otherwise when PBS launches the job it may not be able to find the correct location for the program. This is because the PBS system won't use the same PATH information as your own Bash profile.


Newbler's options can also be tuned to offer more customized performance for specific datasets. Here is a PBS script for running Newbler with more stringent options:

    #!/bin/bash
    #PBS -N asmbl
    #PBS -m ea
    #PBS -M bmf@email.arizona.edu
    #PBS -W group_list=rmaier
    #PBS -q default

### Set the number of cpus that will be used.
#PBS -l select=1:ncpus=8:mem=2gb
### Specify up to a maximum of 1600 hours total cpu time for 1-processor job
#PBS -l cput=12:0:0

### Specify up to a maximum of 240 hours walltime for the job
#PBS -l walltime=12:0:0


### Specify working directory with sff or fasta/qual files (you need to set this)
cd /homeB/home4/u32/bmf/results/newbler

/uaopt/roche454/2.5.3/bin/runAssembly -consed16 -ml 80 -mi 95 -rip -o results_consed16 KC_meta.fna

This script limits the minimum overlap length to 80, the minimum overlap identify to 95, and limits each read to only appearing in one contig

Related Links:

Newbler Official Page

Wikipedia: Newbler

How Newbler Works