Jargon


Here is some common terminology related to bioinformatics:


metagenomic - contains genomic data from a variety of organisms rather than a single organism. e.g. an environmental sample from which all dna in the sample was sequenced together

read - an individual DNA string output by the sequencer. Reads are either raw or only lightly processed after sequencing (such as for quality control)

contig - a sequence of base pairs made up of one or more reads that have been assembled (joined together). The reads form a contiguous length of DNA sequence

assembly - all of the contigs that fit together form an assembly

assembler - a piece of software that takes as input a file of reads, compares and analyzes them all, and then outputs a file of contigs that together form the assembly. e.g. Newbler

alignment - a process in which one set of reads or contigs is compared to another set of reads or contigs

BLAST (Basic Local Alignment Search Tool) - performs sequence alignment and returns information based on the similarity of matching sequences. Compares a "query" sequence (or group of sequences) with a "database" sequence (or group of sequences)

query - when performaing an alignment, the sequences that are being searched with (e.g. your reads)

database - when performaing an alignment, the sequences that will be searched against (e.g. reference genome contigs)

pyrosequencing - a high-throughput next generation sequencing technique used by the 454 machines. It basically works by detecting the pyrophosphate release on nucleotide incorporation

shotgun sequencing - uses the chain termination with dideoxynucleotides in parallel

singleton - a read that only occurs one time in a metagenomic dataset

comparative metagenomics - comparing the characteristics of different metagenomic samples

FASTA format - a common plaintext format that contains DNA sequences

ACE format - another format for containing raw sequence data, originally used by Consed

SFF files - files created by 454 machines. They include quality info and other metadata as well as the DNA sequences themselves

consensus sequence - the sequence formed after assembling many separate sequences

repeat - a read that has previously occurred in the dataset

chimera - a contig that is successfully pieced together from reads, but doesn't fit in the assembly; this is a false positive in the sequence matching


Here is some common terminology related to using the University HPC systems:


HPC - University of Arizona's High Performance Computing center. You can access their website here: http://uits.arizona.edu/research-computing/hpc. They run several different systems that are used around campus for data processing and other research.

HTC/Cluster - These are a couple of the HPC systems the University provides. However, because they all (or mostly) run Unix or another POSIX OS the information will still directly apply.

POSIX - The operating system specifications that define the programming interface, shell interface, and low-level OS utilities in Unix that other software can rely on. In a very general sense, it allows you to run the same software on any OS that conforms to these standards i.e. Mac OS X, BSD or one of the many varieties of Linux.

Cygwin - A POSIX compliant development and run-time environment for Windows. It allows you to use a Unix-like environment on your Windows computer.

shell - The Unix command line. There are several different commonly used shell programs. The most popular is called Bash, and is the default shell on most Linux computers and on OS X. The HPC systems use the Korn shell by default although you can have them change your default shell. It is important to know which shell you are using so when you are having problems you can look up the right documentation. Bash is the widely used shell and is highly recommended.