Blast


Blast stands for Basic Local Alignment Search Tool. Blast allows us to compare a nucleotide sequence (or group of nucleotide sequences), known as the query, against another group of nucleotide sequences known as the database. Blast is useful in many situations when analyzing genomic DNA, such as determining whether a certain gene is represented in a certain species. In metagenomics, Blast is especially useful for comparing a read or set of reads against a reference genome and determining how much and which sequences of the reference genome are represented in the original metagenomic data.




Online Resources

Here are some sites you can use to learn more about BLAST

The Wikiomics BLAST Tutorial describes the differences between blastn, blastx, and blastp; it also describes the command line options for BLAST. Please refer to this guide for general information regarding running BLAST.

The

Official BLAST Quick Start Guide

is also a good resource for information about running BLAST.



Usage

You use Blast to compare a collection of DNA sequences in Fasta format against a "blast database".

You can create this database from your own data, or you can use a pre-made database such as the NR database.

You will be comparing fasta files with each other so make sure your reads and your reference genome (or other sequences) are in fasta format.

There are a couple of steps to using Blast:

  1. Run the format DB command to create the db (if needed)

  2. Run Blast to generate the alignment information

  3. Analyze the Results



Example

Here is an example PBS script for submitting a BLAST job:

#!/bin/bash
#PBS -N total_kc_01
#PBS -m ea
#PBS -M bmf@email.arizona.edu

#PBS -W group_list=rmaier
#PBS -q standard

### Set the number of cpus that will be used.
#PBS -l select=1:ncpus=12:mem=23gb
### Specify up to a maximum of 1600 hours total cpu time for 1-processor job
#PBS -l cput=50:0:0

### Specify up to a maximum of 240 hours walltime for the job
#PBS -l walltime=7:0:0

/uaopt/ncbi/blast-2.2.26+/bin/blastx -num_threads 12 -db /genome/nr -query /homeB/home4/u32/bmf/datasets/main/kartchner/KC_meta_01.fna > /scr5/bmf/results/blast/total/KC_meta_01.blastx

This example is a standard PBS script. The last line is the BLAST command.

/uaopt/ncbi/blast-2.2.26+/bin/blastx is the location of the BLAST executable. Change this if necessary.

The -num_threads option defines how many threads BLAST will use to run the job. Set this equal to the number of cpus you have reserved in the PBS directive (12 cpus in this case: select=1:ncpus=12:mem=23gb).

The -db option allows you to define the location of the BLAST database. On the HTC/Cluster systems, the NCBI NR database is located at /genome/nr. Change this to your own database if necessary.

The -query option chooses the FASTA file to use as the BLAST query. Change this to your BLAST file accordingly.

The > operator redirects the output of the command to the file following the > symbol. In this case it will save a text file containing the results of the BLAST command to /scr5/bmf/results/blast/total/KC_meta_01.blastx.



Blasting a Large Dataset

The time needed to BLAST a very large dataset can be considerably reduced by breaking the dataset into pieces and Blasting them concurrently.

I recommend breaking them up into 1000 read per file.

Use the split-fasta.rb script to split up the fasta file. Here is an example command to split up a fasta file:

split-fasta.rb -l -v -n 1000 -d my_output_dir my_fasta_file.fna

In this example, the -l and -v switches enable logging and verbose output. The -n switch tells the script we want each file to contain 1000 reads. -d sets the output directory to "my_output_dir". The script processes the "my_fasta_file.fna" file.

To submit many fasta files to the PBS system at once, use the multi-blast.rb script. This script is made to submit fasta files created with the split-fasta.rb script.

Here is an example using the multi-blast.rb script:

multi-blast.rb -i /xdisk/bmf/datasets/split_data -o /xdisk/bmf/results/split_data -b my_fasta_file_ 1 2 3 4 5 6 7 8 9

or

multi-blast.rb -i /xdisk/bmf/datasets/split_data -o /xdisk/bmf/results/split_data -b my_fasta_file_ {1..9}

In this example, the -i option sets input directory to /xdisk/bmf/datasets/split_data. This should be the directory that was used as the output directory from the split-fasta.rb script.

The -o option sets the output directory. In this example, it is set to /xdisk/bmf/results/split_data.

The -b option sets the "base name" (e.g. my_fasta_file_). You can find this by looking at the fasta files generated from the split-fasta.rb script. Everything up to but excluding the number part of the file name is the "base name". E.g., if you have my_fasta_file_1, my_fasta_file_2, ..., your base name is my_fasta_file_.

After the options, list the numbers of the fasta file you would like to BLAST. To do a series of numbers, you can use the {0..9} syntax. You can do something like {1..3}{0..9} to do the numbers 100-399.

Hint: to see which numbers are generated enter echo {0..9} into the terminal. That way you can see what numbers are generated by more complicated sequences such as {0..2}{0..9} before submitting the jobs.

Important: Make sure to use absolute paths to your data for the multi-blast.rb script. E.g. /xdisk/bmf/data/my_data (absolute) vs. data/my_data (relative). If you use relative paths the PBS system may not find your data resulting in your jobs failing.

Once the files are finished processing, you can use them as desired.



Considerations when BLASTing Large Datasets

You must take into careful consideration the amount of space data you will be generating. For example, a metagenomic dataset consisting of 500,000 reads might be split into 500 files of 1000 reads per file. If we assume we are BLASTing against the NCBI NR database, we might generate around 150 MB per file. That amounts to 75 GB of resulting data. Be sure the location in which the files will be stored can hold that amount of data.