emboss software for sequence analysis · emboss software for sequence analysis ... the gcg...

UniversityofWisconsin‐Madison version10/2008

EMBOSS Software for sequence analysis

Professor Ann Palmenberg,

Institute for Molecular Virology & Department of Biochemistry [email protected]

Dr. Jean-Yves Sgro

Biotechnology Center & Institute for Molecular Virology [email protected]

Fall08

Biochemistry711–Book3–

This labbook is Copyright © 1997-2008 A.C. Palmenberg & J.-Y. Sgro, University of Wisconsin-Madison. AllRightsReserved(October2008)

[ @k ? \

Biochemistry 711 - 2008

Biochem 711 – 2008 i

ForewordandAcknowledgements‐i

Foreword and Acknowledgements

The original laboratory exercises resulted from a long-term commitment to promote and foster genetic computing on the Madison campus by the Genetics Computing Group Inc., (GCG) and its standing collaborative teaching efforts with Ann Palmenberg. John Devereux and Maggie Smith provided, through GCG, the original UNIX-based hardware and software licenses necessary to create the first such curriculum for UW students. We are thankful for their largess in providing the funding for purchase and yearly upgrades the original UW UNIX-based teaching computer. The GCG exercises of this lab book were inspired by the original educational tutorials developed by Barbara Butler to teach this complex family of software programs. She has generously shared her materials and her knowledge for the benefit of UW students and staff. GCG has now been replaced by an open source software and the exercises adapted to this new package: EMBOSS, the European Molecular Biology Open Software Suite. We want to express special thanks to Ms. Marchel Hill, a course instructor, who has helped translate the GCG exercises to an EMBOSS equivalent and has unselfishly volunteered many hundreds of hours of her time and also her teaching skills towards tutoring UW students, both inside and outside of the scheduled classes. Ann and Jean-Yves would also like to acknowledge Joshua Harder at the Digital Media Center (DMC) for the maintenance of the desktop computing classroom and John Koger for installing EMBOSS both on Macintosh and Windows partitions. The goal of these exercises, is to provide an introduction to sequence analysis that will help students acquire the expertise beneficial to his or her research program. Two key lessons are (1) that computers are nothing to be afraid of, and (2) they will only do what they are told. In this modern age of genomics, “what can I DO with my sequence, now that I have it?” and ”how can I put my sequence into biological perspective?” are very important questions for the learned biologist. If by taking this lab course you simply increase your confidence when using a computer, it will be time well spent!

Biochem 711 – 2008 ii

IntroductiontoEMBOSS‐ii

The BLOSUM62 matrix

BLOSUM (BLOcks of Amino Acid SUbstitution Matrix) is a substitution matrix used for sequence alignment of proteins. BLOSUM are used to score alignments between evolutionarily divergent protein sequences. BLOSUM is based on local alignments. BLOSUM was first introduced in a paper by Henikoff and Henikoff [1]. They scanned the BLOCKS database for very conserved regions of protein families (that do not have gaps in the sequence alignment) and then counted the relative frequencies of amino acids and their substitution probabilities. Then, they calculated a log-odds score for each of the 210 possible substitutions of the 20 standard amino acids. All BLOSUM are based on observed alignments; they are not extrapolated from comparisons of closely related proteins like the PAM Matrices. [1] Henikoff, S., Henikoff, JG. (1992). "Amino Acid Substitution Matrices from Protein Blocks". Proc Natl Acad Sci 89 (22): 10915–10919. doi:10.1073/pnas.89.22.10915. PMID 1438297 Source: http://en.wikipedia.org/wiki/BLOSUM

Biochem 711 – 2008 1

IntroductiontoEMBOSS‐1

Introduction to EMBOSS

Table of Contents

Introduction: The EMBOSS Package ....................................................... 2 1. History ......................................................................................................... 2 2. Overview....................................................................................................... 2 3. License......................................................................................................... 2 4. The EMBOSS software organization .............................................................. 3

4.1. Applications ............................................................................................ 3 4.2. Platforms & Interface ................................................................................ 3 4.3. Accessing the line-command..................................................................... 4

5. Download and installation............................................................................. 4 5.1. Windows.................................................................................................. 5 5.2. Macintosh ............................................................................................... 5

6. Manual, documentation and help .................................................................. 6 7. Tutorial ........................................................................................................ 6

EMBOSS Graphical Output ...................................................................... 7

EMBOSS Commands Organized by Functional Group............................... 8

GCG to EMBOSS Commands Equivalence .............................................. 14

Biochem 711 – 2008 2


Introduction: The EMBOSS Package

1. History The Genetics Computer Group (GCG or Wisconsin package), originated in Madison1, was a pioneering software for sequence analysis that became commercial in 1992. EGCG developed by a group within EMBnet2 from 1988 provided extensions to the GCG package. Because of changes in the source rcode distribution rules of GCG and other factors the former EGCG developers created a totally new generation of academic sequence analysis software: the present EMBOSS project.

2. Overview EMBOSS is "The European Molecular Biology Open Software Suite". EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology community […]. EMBOSS also integrates a range of currently available packages and tools for sequence analysis into a seamless whole. EMBOSS breaks the historical trend towards commercial software packages3. Citation: EMBOSS: The European Molecular Biology Open Software Suite (2000) Rice, P. Longden, I. and Bleasby, A. Trends in Genetics 16, (6) pp276-277

3. License EMBOSS is licensed for use by everyone under the GNU General Public Licence (GPL) and GNU Library General Public Licence (LGPL) licences. No one individual or institute 'owns' the code. For developers who have their own licensing conditions already in effect […] the EMBASSY collection can include packages that use the EMBOSS core libraries and interfaces but under their own licensing conditions. They will be bound by the Library GPL […], but not necessarily by the full GPL. For more information see http://emboss.sourceforge.net/licence/

1 Devereux J, Haeberli P, Smithies O. A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 1984 Jan 11;12(1 Pt 1):387-95. 2 EMBnet (http://www.embnet.org/) is the only organisation world-wide bringing bioinformatics professionals to work together to serve the expanding fields of genetics and molecular biology. 3 Rice,P. Longden,I. and Bleasby,A. "EMBOSS: The European Molecular Biology Open Software Suite" Trends in Genetics June 2000, 16(6) pp.276-277

Biochem 711 – 2008 3


4. The EMBOSS software organization

4.1. Applications EMBOSS is a set of a few hundred programs (applications) that handle specific functions. The EMBOSS applications are organized into 45 logical groups according to their function. (http://emboss.sourceforge.net/apps/groups.html). The groups cover the EMBOSS and EMBASSY (see above) sets of applications. For example the group ALIGNMENT GLOBAL contains 4 applications: Table - Global sequence alignment Program name Description est2genome Align EST and genomic DNA sequences needle Needleman-Wunsch global alignment stretcher Finds the best global alignment between two sequences esim4 Align an mRNA to a genomic DNA sequence while the group ALIGNMENT LOCAL contains 5 applications: Table - Local sequence alignment Program name Description matcher Finds the best local alignments between two sequences seqmatchall All-against-all comparison of a set of sequences supermatcher Match large sequences against one or more other sequences water Smith-Waterman local alignment wordmatch Finds all exact matches of a given size between 2 sequences

4.2. Platforms & Interface EMBOSS exists for multiple computer platforms. All platforms can support the basic line-command version of EMBOSS, including in Microsoft Windows cmd DOS interface. The line-command applications are the core engine of EMBOSS. These commands can be called from multiple graphical interface (GUI) variations that can be added over EMBOSS (some GUIsand not available for all platforms.) The most common GUI is the Java-based Jemboss that is part of the EMBOSS development. However, Jemboss assumes a client-server set-up but in some cases can be available as a stand-alone application.

Biochem 711 – 2008 4


Some GUIs are specific to an operating system, such as EMBOSSrunner for MacOSX. There also exists various web interfaces options. Essentially EMBOSS can be viewed as a layer over the operating system (OS). Similarly the GUI can be viewed as another layer between EMBOSS and the user: Therefore the GUI is useful but not essential to running EMBOSS. A list of all available GUI is at http://emboss.sourceforge.net/interfaces/

4.3. Accessing the line-command The line-command is the most basic way to interact with the operating system.

4.3.1. Macintosh On a Macintosh it is available on the Terminal or X11 terminal found within Applications > Utilities

4.3.2. Windows On a Windows system it is available within the DOS command window started by the menu cascade: Start > Run and enter cmd within the resulting window:

This will open a new DOS command-line text window. Note: you may need Administrator privilege to install.

5. Download and installation

User GUI

EMBOSS applications

OS

Biochem 711 – 2008 5


http://emboss.sourceforge.net/download/ is the official download information page. However, this will point to the actual download site, an FTP site: ftp://emboss.open-bio.org/pub/EMBOSS/ Biologists should only consider the “stable release” and not bother with any developer release. It is somewhat assumed that the end-user will actually configure and compile the software from the source code, which should be practical on a Linux system.

5.1. Windows Windows users will be pleased to find a Windows-only version of EMBOSS that installs together with Jemboss (the Java GUI interface) configured as a stand-alone application:

ftp://emboss.open-bio.org/pub/EMBOSS/windows/

The Windows version is called mEMBOSS and developers insist that any emails sent their way specify this fact and not EMBOSSWin or any other name.

5.2. Macintosh Macintosh users install EMBOSS form fink http://www.finkproject.org/ The simplest method to using fink is via the fink GUI called FinkCommander (part of the download package.)

Seach for “emboss” on the top right, an use the top left button ( ) “install from binary” to install in your system:

Biochem 711 – 2008 6


6. Manual, documentation and help The documentation page http://emboss.sourceforge.net/docs/ has limited information but provides other links. An online search might reveal manuals at various institutions. The Fine Manual (tfm) is the online documentation for applications called by the command line tfm followed by the application name. To find relevant applications the command wossname is very useful: it will echo back a list of applications based on a single search word. (Note: in line-command $ and % are typical prompts waiting for user’s input.) For example: $ wossname global Finds programs by keywords in their short description SEARCH FOR 'GLOBAL' est2genome Align EST sequences to genomic DNA sequence needle Needleman-Wunsch global alignment of two sequences stretcher Needleman-Wunsch rapid global alignment of two sequences Therefore to obtain information on the application needle for global alignment the command would be: $ tfm needle Help can also simply be requested by adding –help after the name of the application, for example: $ needle -help Finally, the user can ask to be prompted for optional parameters by adding –opt after the name of the application: $ needle -opt

7. Tutorial A short online tutorial is available on the EMBOSS home page or by going directly to: http://emboss.sourceforge.net/docs/emboss_tutorial/emboss_tutorial.html

- e -

Biochem 711 – 2008 7


EMBOSS Graphical Output

(From http://www.ch.embnet.org/EMBOSS/introduction.html) EMBOSS applications that create a graphical output (interactive or redirected to a file) will send the graphics to the default current set-up. For example X11 graphics if connecting by line command or PNG is using Jemboss. The graphical format can be altered by the –graph qualifier

The allowed values are better explained in the following table:

Example:

$ dotmatcher calm_drome.fasta calm_drome.fasta -graph png Draw a threshold dotplot of two sequences Created dotmatcher.1.png

Biochem 711 – 2008 8


EMBOSS Commands Organized by Functional Group

Group Description Acd Acd file utilities acdc ACD compiler acdpretty ACD pretty printing utility acdtable Creates an HTML table from an ACD file acdtrace ACD compiler on-screen trace acdvalid ACD file validation Alignment consensus Merging sequences to make a consensus Cons Creates a consensus from multiple alignments megamerger Merge two large overlapping nucleic acid sequences merger Merge two overlapping nucleic acid sequences Alignment differences Finding differences between sequences diffseq Find differences between nearly identical sequences Alignment dot plots Dot plot sequence comparisons dotmatcher Displays a thresholded dotplot of two sequences dotpath Non-overlapping wordmatch dotplot of two sequences dottup Displays a wordmatch dotplot of two sequences polydot Displays all-against-all dotplots of a set of sequences Alignment global Global sequence alignment est2genome Align EST and genomic DNA sequences needle Needleman-Wunsch global alignment stretcher Finds the best global alignment between two sequences esim4 Align an mRNA to a genomic DNA sequence Alignment local Local sequence alignment matcher Finds the best local alignments between two sequences seqmatchall All-against-all comparison of a set of sequences supermatcher Match large sequences against one or more other sequences water Smith-Waterman local alignment wordmatch Finds all exact matches of a given size between 2 sequences Alignment multiple Multiple sequence alignment emma Multiple alignment program - interface to ClustalW program infoalign Information on a multiple sequence alignment plotcon Plot quality of conservation of a sequence alignment prettyplot Displays aligned sequences, with colouring and boxing showalign Displays a multiple sequence alignment tranalign Align nucleic coding regions given the aligned proteins mse Multiple Sequence Editor Display Publication-quality display abiview Reads ABI file and display the trace cirdna Draws circular maps of DNA constructs lindna Draws linear maps of DNA constructs pepnet Displays proteins as a helical net pepwheel Shows protein sequences as helices prettyplot Displays aligned sequences, with colouring and boxing prettyseq Output sequence with translated ranges remap Display sequence with restriction sites, translation etc seealso Finds programs sharing group names showalign Displays a multiple sequence alignment showdb Displays information on the currently available databases showfeat Show features of a sequence showseq Display a sequence with features, translation etc sixpack Display a DNA sequence with 6-frame translation and ORFs textsearch Search sequence documentation. Slow, use SRS and Entrez!

Biochem 711 – 2008 9


Edit Sequence editing biosed Replace or delete sequence sections codcopy Reads and writes a codon usage table cutseq Removes a specified section from a sequence degapseq Removes gap characters from sequences descseq Alter the name or description of a sequence entret Reads and writes (returns) flatfile entries extractfeat Extract features from a sequence extractseq Extract regions from a sequence listor Write a list file of the logical OR of two sets of sequences maskfeat Mask off features of a sequence maskseq Mask off regions of a sequence newseq Type in a short new sequence noreturn Removes carriage return from ASCII files notseq Exclude a set of sequences and write out the remaining ones nthseq Writes one sequence from a multiple set of sequences pasteseq Insert one sequence into another revseq Reverse and complement a sequence seqret Reads and writes (returns) sequences seqretsplit Reads and writes (returns) sequences in individual files skipseq Reads and writes (returns) sequences, skipping first few splitter Split a sequence into (overlapping) smaller sequences trimest Trim poly-A tails off EST sequences trimseq Trim ambiguous bits off the ends of sequences union Reads sequence fragments and builds one sequence vectorstrip Strips out DNA between a pair of vector sequences yank Reads a sequence range, appends the full USA to a list file Enzyme kinetics Enzyme kinetics calculations findkm Find Km and Vmax for an enzyme reaction Feature tables Manipulation and display of sequence annotation coderet Extract CDS, mRNA and translations from feature tables extractfeat Extract features from a sequence maskfeat Mask off features of a sequence showfeat Show features of a sequence twofeat Finds neighbouring pairs of features in sequences HMM Hidden markov model analysis ealistat Statistics for multiple alignment files ehmmalign Align sequences with an HMM ehmmbuild Build HMM ehmmcalibrate Calibrate a hidden Markov model ehmmconvert Convert between HMM formats ehmmemit Extract HMM sequences ehmmfetch Extract HMM from a database ehmmindex Index an HMM database ehmmpfam Align single sequence with an HMM ehmmsearch Search sequence database with an HMM Information Information and general help for users infoalign Information on a multiple sequence alignment infoseq Displays some simple information about sequences seealso Finds programs sharing group names showdb Displays information on the currently available databases textsearch Search sequence documentation. Slow, use SRS and Entrez! tfm Displays a program's help documentation manual whichdb Search all databases for an entry wossname Finds programs by keywords in their one-line documentation Menus Menu interface(s) emnu Simple menu of EMBOSS applications Nucleic 2d structure Nucleic acid secondary structure einverted Finds DNA inverted repeats

Biochem 711 – 2008 10


Nucleic codon usage Codon usage analysis cai CAI codon adaptation index chips Codon usage statistics codcmp Codon usage table comparison cusp Create a codon usage table syco Synonymous codon usage Gribskov statistic plot Nucleic composition Composition of nucleotide sequences banana Bending and curvature plot in B-DNA btwisted Calculates the twisting in a B-DNA sequence chaos Create a chaos game representation plot for a sequence compseq Count composition of dimer/trimer/etc words in a sequence dan Calculates DNA RNA/DNA melting temperature freak Residue/base frequency table or plot isochore Plots isochores in large DNA sequences sirna Finds siRNA duplexes in mRNA wordcount Counts words of a specified size in a DNA sequence Nucleic CpG islands CpG island detection and analysis cpgplot Plot CpG rich areas cpgreport Reports all CpG rich regions geecee Calculates fractional GC content of nucleic acid sequences newcpgreport Report CpG rich areas newcpgseek Reports CpG rich regions Nucleic gene finding Predictions of genes and other genomic features getorf Finds and extracts open reading frames (ORFs) marscan Finds MAR/SAR sites in nucleic sequences plotorf Plot potential open reading frames showorf Pretty output of DNA translations sixpack Display a DNA sequence with 6-frame translation and ORFs syco Synonymous codon usage Gribskov statistic plot tcode Fickett TESTCODE statistic to identify protein-coding DNA wobble Wobble base plot Nucleic motifs Nucleic acid motif searches dreg Regular expression search of a nucleotide sequence fuzznuc Nucleic acid pattern search fuzztran Protein pattern search after translation marscan Finds MAR/SAR sites in nucleic sequences Nucleic mutation Nucleic acid sequence mutation msbar Mutate sequence beyond all recognition shuffleseq Shuffles a set of sequences maintaining composition Nucleic primers Primer prediction eprimer3 Picks PCR primers and hybridization oligos primersearch Searches DNA sequences for matches with primer pairs stssearch Search a DNA database for matches with a set of STS primers Nucleic profiles Nucleic acid profile generation and searching profit Scan a sequence or database with a matrix or profile prophecy Creates matrices/profiles from multiple alignments prophet Gapped alignment for profiles Nucleic repeats Nucleic acid repeat detection einverted Finds DNA inverted repeats equicktandem Finds tandem repeats etandem Looks for tandem repeats in a nucleotide sequence palindrome Looks for inverted repeats in a nucleotide sequence Nucleic restriction Restriction enzyme sites in nucleotide sequences recoder Remove restriction sites but maintain same translation redata Search REBASE for enzyme name, references, suppliers etc remap Display sequence with restriction sites, translation etc restover Find restriction enzymes producing specific overhang restrict Finds restriction enzyme cleavage sites

Biochem 711 – 2008 11


showseq Display a sequence with features, translation etc silent Silent mutation restriction enzyme scan Nucleic RNA folding RNA folding methods and analysis vrnaalifold RNA alignment folding vrnaalifoldpf RNA alignment folding with partition vrnacofold RNA cofolding vrnacofoldconc RNA cofolding with concentrations vrnacofoldpf RNA cofolding with partitioning vrnadistance RNA distances vrnaduplex RNA duplex calculation vrnaeval RNA eval vrnaevalpair RNA eval with cofold vrnafold Calculate secondary structures of RNAs vrnafoldpf Secondary structures of RNAs with partition vrnaheat RNA melting vrnainverse RNA sequences matching a structure vrnalfold Calculate locally stable secondary structures of RNAs vrnaplot Plot vrnafold output vrnasubopt Calculate RNA suboptimals Nucleic transcription Transcription factors, promoters and terminator prediction tfscan Scans DNA sequences for transcription factors Nucleic translation Translation of nucleotide sequence to protein sequence backtranambig Back translate a protein sequence to ambiguous codons backtranseq Back translate a protein sequence coderet Extract CDS, mRNA and translations from feature tables plotorf Plot potential open reading frames prettyseq Output sequence with translated ranges remap Display sequence with restriction sites, translation etc showorf Pretty output of DNA translations showseq Display a sequence with features, translation etc sixpack Display a DNA sequence with 6-frame translation and ORFs transeq Translate nucleic acid sequences Phylogeny consensus Phylogenetic consensus methods econsense Majority-rule and strict consensus tree fconsense Majority-rule and strict consensus tree ftreedist Distances between trees ftreedistpair Distances between two sets of trees Phylogeny continuous characters Phylogenetic continuous character methods econtml Continuous character Maximum Likelihood method econtrast Continuous character Contrasts fcontrast Continuous character Contrasts Phylogeny discrete characters Phylogenetic discrete character methods eclique Largest clique program edollop Dollo and polymorphism parsimony algorithm edolpenny Penny algorithm Dollo or polymorphism efactor Multistate to binary recoding program emix Mixed parsimony algorithm epenny Penny algorithm, branch-and-bound fclique Largest clique program fdollop Dollo and polymorphism parsimony algorithm fdolpenny Penny algorithm Dollo or polymorphism ffactor Multistate to binary recoding program fmix Mixed parsimony algorithm fmove Interactive mixed method parsimony fpars Discrete character parsimony fpenny Penny algorithm, branch-and-bound Phylogeny distance matrix Phylogenetic distance matrix methods distmat Creates a distance matrix from multiple alignments efitch Fitch-Margoliash and Least-Squares Distance Methods ekitsch Fitch-Margoliash method with contemporary tips

Biochem 711 – 2008 12


eneighbor Phylogenies from distance matrix by N-J or UPGMA method ffitch Fitch-Margoliash and Least-Squares Distance Methods fkitsch Fitch-Margoliash method with contemporary tips fneighbor Phylogenies from distance matrix by N-J or UPGMA method Phylogeny gene frequencies Phylogenetic gene frequency methods egendist Genetic Distance Matrix program fcontml Gene frequency and continuous character Maximum Likelihood fgendist Compute genetic distances from gene frequencies Phylogeny molecular sequence Phylogenetic tree drawing methods ednacomp DNA compatibility algorithm ednadist Nucleic acid sequence Distance Matrix program ednainvar Nucleic acid sequence Invariants method ednaml Phylogenies from nucleic acid Maximum Likelihood ednamlk Phylogenies from nucleic acid Maximum Likelihood with clock ednapars DNA parsimony algorithm ednapenny Penny algorithm for DNA eprotdist Protein distance algorithm eprotpars Protein parsimony algorithm erestml Restriction site Maximum Likelihood method eseqboot Bootstrapped sequences algorithm fdiscboot Bootstrapped discrete sites algorithm fdnacomp DNA compatibility algorithm fdnadist Nucleic acid sequence Distance Matrix program fdnainvar Nucleic acid sequence Invariants method fdnaml Estimates nucleotide phylogeny by maximum likelihood fdnamlk Estimates nucleotide phylogeny by maximum likelihood fdnamove Interactive DNA parsimony fdnapars DNA parsimony algorithm fdnapenny Penny algorithm for DNA fdolmove Interactive Dollo or Polymorphism Parsimony ffreqboot Bootstrapped genetic frequencies algorithm fproml Protein phylogeny by maximum likelihood fpromlk Protein phylogeny by maximum likelihood fprotdist Protein distance algorithm fprotpars Protein pasimony algorithm frestboot Bootstrapped restriction sites algorithm frestdist Distance matrix from restriction sites or fragments frestml Restriction site maximum Likelihood method fseqboot Bootstrapped sequences algorithm fseqbootall Bootstrapped sequences algorithm Phylogeny tree drawing Phylogenetic molecular sequence Methods fdrawgram Plots a cladogram- or phenogram-like rooted tree diagram fdrawtree Plots an unrooted tree diagram fretree Interactive tree rearrangement Protein 2d structure Protein secondary structure garnier Predicts protein secondary structure helixturnhelix Report nucleic acid binding motifs hmoment Hydrophobic moment calculation pepcoil Predicts coiled coil regions pepnet Displays proteins as a helical net pepwheel Shows protein sequences as helices tmap Displays membrane spanning regions topo Draws an image of a transmembrane protein Protein 3d structure Protein tertiary structure psiphi Phi and psi torsion angles from protein coordinates domainreso Remove low resolution domains from a DCF file domainalign Generate alignments (DAF file) for nodes in a DCF file domainrep Reorder DCF file to identify representative structures seqalign Extend alignments (DAF file) with sequences (DHF file) seqfraggle Removes fragment sequences from DHF files seqsearch Generate PSI-BLAST hits (DHF file) from a DAF file seqsort Remove ambiguous classified sequences from DHF files seqwords Generates DHF files from keyword search of UniProt

Biochem 711 – 2008 13


libgen Generate discriminating elements from alignments matgen3d Generate a 3D-1D scoring matrix from CCF files rocon Generates a hits file from comparing two DHF files rocplot Performs ROC analysis on hits files siggen Generates a sparse protein signature from an alignment siggenlig Generate ligand-binding signatures from a CON file sigscan Generate hits (DHF file) from a signature search sigscanlig Search ligand-signature library & write hits (LHF file) contacts Generate intra-chain CON files from CCF files interface Generate inter-chain CON files from CCF files Protein composition Composition of protein sequences Backtranambig Back translate a protein sequence to ambiguous codons backtranseq Back translate a protein sequence charge Protein charge plot checktrans Reports STOP codons and ORF statistics of a protein compseq Count composition of dimer/trimer/etc words in a sequence emowse Protein identification by mass spectrometry freak Residue/base frequency table or plot iep Calculates the isoelectric point of a protein mwcontam Shows molwts that match across a set of files mwfilter Filter noisy molwts from mass spec output octanol Displays protein hydropathy pepinfo Plots simple amino acid properties in parallel pepstats Protein statistics pepwindow Displays protein hydropathy pepwindowall Displays protein hydropathy of a set of sequences Protein motifs Protein motif searches ntigenic Finds antigenic sites in proteins digest Protein proteolytic enzyme or reagent cleavage digest epestfind Finds PEST motifs as potential proteolytic cleavage sites fuzzpro Protein pattern search fuzztran Protein pattern search after translation helixturnhelix Report nucleic acid binding motifs oddcomp Find protein sequence regions with a biased composition patmatdb Search a protein sequence with a motif patmatmotifs Search a PROSITE motif database with a protein sequence pepcoil Predicts coiled coil regions preg Regular expression search of a protein sequence pscan Scans proteins using PRINTS sigcleave Reports protein signal cleavage sites meme Motif detection Protein mutation Protein sequence mutation msbar Mutate sequence beyond all recognition shuffleseq Shuffles a set of sequences maintaining composition Protein profiles Protein profile generation and searching profit Scan a sequence or database with a matrix or profile prophecy Creates matrices/profiles from multiple alignments prophet Gapped alignment for profiles Test Testing tools, not for general use. crystalball Answers every drug discovery question about a sequence Utils database creation Database installation aaindexextract Extract data from AAINDEX cutgextract Extract data from CUTG printsextract Extract data from PRINTS prosextract Build the PROSITE motif database for use by patmatmotifs rebaseextract Extract data from REBASE tfextract Extract data from TRANSFAC cathparse Generates DCF file from raw CATH files domainnr Removes redundant domains from a DCF file domainseqs Adds sequence records to a DCF file domainsse Add secondary structure records to a DCF file scopparse Generate DCF file from raw SCOP files

Biochem 711 – 2008 14


ssematch Search a DCF file for secondary structure matches allversusall Sequence similarity data from all-versus-all comparison seqnr Removes redundancy from DHF files domainer Generates domain CCF files from protein CCF files hetparse Converts heterogen group dictionary to EMBL-like format pdbparse Parses PDB files and writes protein CCF files pdbplus Add accessibility & secondary structure to a CCF file pdbtosp Convert swissprot:PDB codes file to EMBL-like format sites Generate residue-ligand CON files from CCF files Utils database indexing Database indexing dbiblast Index a BLAST database dbifasta Database indexing for fasta file databases dbiflat Index a flat file database dbigcg Index a GCG formatted database dbxfasta Database b+tree indexing for fasta file databases dbxflat Database b+tree indexing for flat file databases dbxgcg Database b+tree indexing for GCG formatted databases Utils misc Utility tools embossdata Finds or fetches data files read by EMBOSS programs embossversion Writes the current EMBOSS version number

GCG to EMBOSS Commands Equivalence

Edited from http://helix.nih.gov/Applications/ And / or http://migale.jouy.inra.fr/faq/outils/gcg-vs-emboss Former GCG users will find this extremly useful. GCG program EMBOSS program Description/Comments Assemble merger

union Construct new sequences from pieces of existing sequences; merger only accepts 2 sequences while assemble and union accept several.

BackTranslate backtranseq backtranambig

Backtranslate protein -> nucleotide sequence. backtranambig backtranslates to ambiguous codons.

BestFit water matcher

Bestfit uses the Smith-Waterman algorithm to find the best local alignment between 2 sequences. water uses Smith-Waterman, matcher uses Pearson's lalign algorithm.

Blast Psiblast

dbiBlast NCBI homology search between query and database

Breakup splitter Splits a sequence into (overlapping) smaller sequences Chopup - Helps to convert a non-GCG sequence format

Not needed in EMBOSS because it reads most sequence formats without conversion

CodonFrequency chips compseq cusp

CodonFrequency --tabulates codon usage. compseq -- counts composition of dimer/trimer in sequence. chips -- calculates codon usage stats cusp -- creates a codon usage table.

CodonPreference syco wobble

Recognize protein coding sequences

CoilScan pepcoil Predicts coiled-coil regions Compare + DotPlot

dottup + dotmatcher dotpath

2-sequence comparison. dotpath does a non-overlapping wordmatch dotplot.

Composition compseq pepstats

Sequence composition

compresstext - Removes extra whitespace in text files. Can be done via Unix shell script. comptable - Creates a scoring matrix consensus prophecy Creates a consensus sequence

or matrices/profiles from multiple alignments correspond codcmp Codon usage table comparison

Biochem 711 – 2008 15


corrupt msbar Randomly mutate sequence dataset dbiflat

dbiblast dbigcg

Creates searchable sequence database. GCG's Dataset requires sequences in GCG format, whereas dbiflat, dbiblast, dbigcg will take most formats between them.

detab - Replaces tabs with spaces in sequence files. Can be performed by Unix shell command.

distances - Calculates pairwise evolutionary distances between aligned sequences. The Phylip package can do this. http://evolution.genetics.washington.edu/phylip.html

diverge - Estimates pairwise substitutions per site between 2 or more coding sequences. The Phylip package can do this. http://evolution.genetics.washington.edu/phylip.html

dotplot dottup dotmatcher

2-sequence comparison

extractpeptide transeq ExtractPeptide takes the output of Map and can write one or more of the reading-frame translations. transeq translates one or more of the frames or specific regions directly from an input nucleotide sequence.

FastA FastX Tfasta TfastX

- Pearson's homology-search program, available as a standalone. Mostly replaced by Blast

fetch seqret seqretsplit

Pull one or more sequences out of the databases. seqret/seqretsplit can save output in various sequence formats.

figure - Generates plots from other GCG programs. The equivalent EMBOSS programs usually generate plots (e.g. plotorf).

findpatterns fuzznuc fuzzpro

searches for patterns in a sequence or database

fingerprint - Finds the products of T1 ribonuclease digestion. fitconsensus - Use after Consensus to find the best fits. framealign - Finds best local alignment including frame shifts between a protein and

nucleotide sequence. frames plotorf

showorf Show open reading frames. plotorf does this graphically

framesearch - Homology searches including frameshifts between protein and nucleotide sequences

fromembl fromfasta fromgenbank fromig frompir fromstaden fromtrace

- Converts from various formats to GCG sequence format. Unnecessary in EMBOSS because it can accept most sequence formats, but seqret can convert between formats if desired.

Gap needle stretcher

Needleman-Wunsch algorithm to compare 2 sequences. stretcher uses the Myers-Miller algorithm which is more memory-efficient. For sequences larger than 10kb, I would suggest you to use 'stretcher' program in EMBOSS which is also a global alignment program. If one of your sequence is genomic and you are trying to align an est sequence to it, you may want to consider the 'est2genome' program. On the other hand, water->matcher->supermatcher are local alignment programs for small, medium, and large sequences, respectively.

Gapshow plotcon Graphical representation of similarity of 2 sequences. GCGtoBlast - Makes a Blast database. Use NCBI's 'formatdb' instead. GelAssemble GelDisassemble GelEnter GelMerge GelStart GelView

megamerger merger union

Parts of GCG's gel assembly suite.

GetSeq seqret Type in a new sequence GrowTree - Creates phylogenetic tree. Can use Phylip or Clustal instead. HelicalWheel pepwheel Plots peptide sequence as helical wheel to help recognize amphiphilic

regions. HmmerAlign HmmerBuild HmmerCalibrate HmmerEmit HmmerFetch

- Sean Eddy's HMMER package. http://biowiki.org/HmmerPackage

Biochem 711 – 2008 16


HmmerIndex HmmerPfam HmmerSearch HTHScan helixturnhelix Finds HTH motifs in protein sequences. IsoElectric iep Calculates isoelectric pt of protein. Lineup - Edits multiple sequence alignments – SEE SeqEd below. ListFile - for printing. Can use Unix pcprint command instead. Lookup - Versatile program for finding sequences in a database. whichdb in emboss

can search for accession numbers, but GCG's lookup is much more sophisticated. Use NCBI Entrez instead. http://www.ncbi.nlm.nih.gov/Entrez/

Map Mapplot Mapsort

restrict remap restover

finds restriction enzyme cleavage sites. GCG & EMBOSS may display different isoschizomers of the same enzyme, but the results are equivalent. The EMBOSS remap program may not display a few of the available isoschizomers.

MeltTemp dan Computes melting temperature of oligos MEME - Finds conserved motifs in a group of unaligned sequences. There exist a

standalone Meme/Mast software. http://meme.sdsc.edu/

MFold - Predicts nucleotide secondary structure. GCG's version is an old version of Zuker's MFOLD. Info on Zuker’s site: http://mfold.bioinfo.rpi.edu/

Moment pepnet, octanol hmoment

Makes a contour plot of the helical hydrophobic moment of a peptide sequence hmoment prints the text output of the calculation.

Motifs patmatmotifs Finds common Prosite motifs in a sequence. Use '-full' tag to display abstract information when using EMBOSS patmatmotifs. Note that both these programs will only find Prosite 'Patterns' (e.g. CAMP Phosphorylation Site), and not Prosite 'Matrices' (e.g. Helix-turn-Helix). Use Interproscan to find all known domains and functional sites. (http://www.ebi.ac.uk/Tools/InterProScan/). patmatmotifs can accept file containing multiple sequences or patterns.

Meme + Motifsearch

prophecy + profit Search a sequence or database with a matrix or profile.

Names infoseq provides some info about sequence specifications. NetBlast Netfetch

- remote access to NCBI's Blast. Use web version: http://www.ncbi.nlm.nih.gov/BLAST/

NoOverlap diffseq Finds differences between 2 sequences. NoOverlap can work with a group of sequences.

OldDistances - Makes a table of the pairwise similarities within a group of sequenes. onecase - converts sequence into lower or upper case. Can be performed by Unix

shell command. Overlap - Compares 2 sets of sequences using Wilbur-Lipman algorithm. Paupdisplay + Paupsearch

- PAUP Phylogenetic Analysis.

Pepdata getorf sixpack

Translates in all 6 reading frames. sixpack displays the DNA sequence with 6-frame translations and orfs.

Pepplot pepinfo Pepplot plots protein 2ndary structure and hydrophobicity. pepinfo plots hydrophobicity, and garnier does protein 2ndary structure prediction.

Peptidemap digest Enzyme/reagent cleavage map of a protein. Peptidesort digest

pepstats GCG peptidesort sorts fragments from an enzyme/reagent cleavage of one or more proteins according to position, mol. wt., and HPLC retention. EMBOSS digest only processes one reagent cleavage at a time. EMBOSS pepstats can be used to determine the composition of the fragments afterwards. The EMBOSS programs do not provide the elution times from HPLC. If you need this data, try the UCSF MS-Digest program which has an option for HPLC Indices. http://prospector.ucsf.edu/cgi-bin/msform.cgi?form=msdigest

Peptidestructure Plotstructure

garnier antigenic pepwindow pepwindowall

Secondary structure prediction. Garnier does not include Jameson-Wolf antigenic indexing. antigenic predicts potentially antigenic regions of a protein sequence, using the method of Kolaskar and Tongaonkar. pepwindow displays Kyte-Doolittle protein hydropathy. pepwindowall produces a set of superimposed Kyte & Doolittle hydropathy plots from an aligned set of protein sequences.

Pileup emma Multiple sequence alignment. emma is an interface to ClustalW. Can also use the standalone Clustal (command clustalw for linge-command or clustalx for GUI) or web ClustalW online:

Biochem 711 – 2008 17


http://www.ebi.ac.uk/Tools/clustalw2/ PlasmidMap cirdna

lindna Plot DNA constructs.

PlotFold - Plots MFold output. See MFOLD. PlotSimilarity plotcon Graphical representation of the similarity along a set of aligned sequences. Pretty prettybox

cons prettyplot showalign

Calculates consensus sequence from a multiple sequence alignment, and displays them prettily.

Prime eprimer3 Selects oligonucleotide primers. Profilegap Profilemake

prophecy prophet distmat

Creates matrices/profiles from multiple alignments. Gapped alignment for profiles and sequences.

PrimePair primersearch Evaluates individual primers to determine their compatibility for use as PCR primer pairs.

Profilescan patmatdb Searches sequences or db for protein motifs. Profilescan uses Gribskov method.

Profilesearch profit Scans a sequence or database with a matrix or profile. Profilesegments - Alignments for results of Profilesearch Publish seqret

showseq Makes publication-quality displays of sequences.

Reformat seqret GCG requires input sequences to be in GCG format, hence other formats need to be converted with 'reformat'. Emboss programs accept most sequence formats, so conversion is rarely required, but 'seqret' can be used to convert between formats if desired.

Repeat equicktandem etande einverted palindrome

Finds tandem repeats in sequences. The equivalent group of Emboss programs will also look for inverted or palindromic repeats.

Replace biosed degapseq

Replaces characters in a text file. Degapseq is specific for replacing gap characters. Can be performed with Unix shell utilities like sed, awk or tr.

Reverse revseq Reverse/complement a sequence. Sample extractseq Extract regions from a sequence. Seg maskseq Masks off low-complexity regions from a sequence. Seqed biosed, cutseq,

degapseq, descseq, entret, extractfeat, extractseq, listor, maskfeat, maskseq, newseq, noreturn, notseq, nthseq, pasteseq, revseq, seqret, seqretsplit, skipseq, splitter, trimest, trimseq, union, vectorstrip, yank

Sequence editor. EMBOSS has several tools for specific editing tasks. Or use a text editor (not word processor!).

Try the Jemboss alignment editor for editing multiple sequence alignments: http://emboss.sourceforge.net/Jemboss/

Other alternatives are BioEdit (Windows only, http://www.mbio.ncsu.edu/BioEdit/bioedit.html ) and Seaview (Mac, Windows, Unix; http://pbil.univ-lyon1.fr/software/seaview)

SeqLab - X-windows interface to GCG. Setkeys - Redefines keyboard keys, mainly used for GCG's gel assembly programs. Shiftover - Moves text by column. Use the nedit editor instead. Shuffle shuffleseq Shuffles a sequence. Simplify - Reduce the number of symbols in a sequence. Spew - Sends a sequence from a remote computer (e.g. Helix) to your desktop.

Use FTP instead. SecureFX for Windows, or line-command sftp on Mac/Unix.

SPScan sigcleave Predicts signal peptides in protein sequences. Ssearch - Part of Pearson's Fasta package, available as a standalone program on

Helix. StatPlot - Plotting program. Rarely used. StemLoop palindrome

etandem Finds inverted repeats.

Stringsearch textsearch Finds text phrases in sequence or database. Use NCBI's Entrez instead: http://www.ncbi.nlm.nih.gov/Entrez/

Biochem 711 – 2008 18


Terminator - searches for prokaryotic factor-independent RNA polymerase terminators according to the method of Brendel and Trifonov.

Testcode wobble Plots 3rd-position variability as an indicator of potential coding regions. ToFastA ToIG ToPIR ToStaden

seqret Emboss accepts most sequence formats, therefore format conversion is rarely required. seqret can be used to convert between formats if desired.

Translate transeq Translates nucleotide -> Protein sequences Transmem - predicts transmembrane helices. Window + Statplot freak Residue/base frequency table or plot. Wordsearch Segments

- Homology search using Wilbur/Lipman algorithm. Segments displays the result.

Xnu - Masks tandem repeats for future Blast search. - abiview Reads ABI file and displays trace - antigenic Finds antigenic sites in proteins - banana Bending and curvature plot in B-DNA - btwisted Calculates the twisting in a B-DNA sequence - cai CAI codon adaptation index, to measure synonymous codon usage bias. - chaos Create a chaos game representation plot for a sequence - charge Protein charge plot. - checktrans Reports STOP codons and ORF statistics of a protein - coderet Extract CDS, mRNA and translations from feature tables - cpgplot

cpgreport newcpgreport newcpgseek

Plots and reports CpG-rich regions.

seqed cutseq Removes a specified section from a sequence. seqed is interactive, cutseq is command-line.

seqed degapseq Alter name/description of sequence. Findpatterns dreg Regular expression search of a sequence. Findpatterns is an approximate

equivalent. - emma interface to ClustalW program. - emowse Protein identification by Mass spectrometry. - epestfind Finds PEST motifs as potential proteolytic cleavage sites - est2genome Align EST and genomic DNA sequences. - extractfeat Extract features from a sequence. - findkm Find Km and Vmax for an enzyme reaction by a Hanes/Woolf plot - fuzztran Protein pattern search after translation - geecee Calculates the fractional GC content of nucleic acid sequences - isochore Plots isochores in large DNA sequences - listor Writes a list file of the logical OR of two sets of sequences - makenucseq

makeprotseq Create random nucleotide and protein sequences

- marscan Finds MAR/SAR sites in nucleic sequences - maskfeat Mask off features of a sequence. - mwcontam Shows molwts that match across a set of files - mwfilter Filter noisy molwts from mass spec output - noreturn remove carriage return from a ASCII files. Can be performed by Unix

utilities like 'tr'. Reformat nthseq Pulls one sequence out of a multiple set. Reformat will pull a sequence out

of an MSF or RSF file. - oddcomp Finds protein sequence regions with a biased composition - polydot Displays all-against-all dotplots of a set of sequences - printsextract Extract data from PRINTS - pscan Scans proteins using PRINTS - rebaseextract

redata Search and extract from REBASE.

- recoder Remove restriction sites but maintain the same translation - seqmatchall all-against-all comparison of a set of sequences. - showdb Shows info about currently available databases. - showfeat Shows features of a sequence - silent Silent mutation restriction enzyme scan - sirna Finds siRNA duplexes in mRNA - stssearch Searches a DNA database for matches with a set of STS primers

Biochem 711 – 2008 19


- supermatcher Finds a match of a large sequence against one or more sequences - tfextract Extract data from TRANSFAC database. gcghelp tfm shows documentation for a program. - tfscan Scans DNA sequences for transcription factors - tmap Displays membrane spanning regions - tranalign Align nucleic coding regions given the aligned proteins - trimest

trimseq Trim bits off ends of sequences. Can be done interactively with GCG's seqed.

- twofeat inds neighbouring pairs of features in sequences - vectorstrip Strips out DNA between a pair of vector sequences - wordcount Counts words of a specified size in a DNA sequence - wordmatch Finds all exact matches of a given size between 2 sequences

Biochem 711 – 2008 20


Class notes

Biochem 711 – 2008 21

PairwisecomparisonwithEMBOSS‐21

L09: Pairwise alignment with EMBOSS

Table of Contents

L09 Exercise A: Xterm and Unix line commands.................................... 23 1. Begin an Xterm session .............................................................................. 23 2. Line commands.......................................................................................... 23

2.1. The prompt: $, % ................................................................................... 23 2.2. Full path location of a file......................................................................... 24 2.3. Relative path and current directory: ., ....................................................... 24 2.4. Changing directory and present working directory: cd, pwd,~ ...................... 24 2.5. Directory listing: ls.................................................................................. 25 2.6. Creating a new directory: mkdir................................................................ 25 2.7. Text file content: cat, more, head, tail ....................................................... 26 2.8. Simple text editing: pico, nano................................................................. 26 2.9. Redirect of standard text output: >............................................................ 26 2.10. Documentation, help and manual pages: man ......................................... 27 2.11. Summary tables.................................................................................... 27

L09 Exercise B: Help and relevant EMBOSS applications: wossname, tfm, -option................................................................................................... 28 1. Find relevant programs: wossname............................................................. 28 2. Documentation and Help: tfm, -help, -option ............................................... 28

L09 Exercise C: Sequence format and changing format: seqret.............. 29 1. Fasta format............................................................................................... 30 2. Seqret reads and writes (reformats) sequences........................................... 32

2.1. Changing the format: format codes........................................................... 33 3. List files: @ symbol .................................................................................... 35 4. Multiple sequence formats: seqret, seqretsplit ............................................ 35

L09 Exercise D: Pairwise comparisons with dotplots ............................. 36 1. Working directory ....................................................................................... 36 2. Dotmatcher ................................................................................................ 36

2.1. Defaults run ........................................................................................... 37 2.2. Window size........................................................................................... 38 2.3. Threshold .............................................................................................. 38

3. dottup ........................................................................................................ 39 3.1. Word size............................................................................................... 39

4. Comparison tables: BLOSUM62.................................................................. 40 5. Nucleotide sequence comparison................................................................ 41 6. Inverted repeats ......................................................................................... 42

L09 Exercise E: Pairwise comparisons with optimal alignments ............ 43 1. Local alignment: water ............................................................................... 43

Biochem 711 – 2008 22


2. Global alignment: needle ............................................................................ 45 2.1. Defaults run ........................................................................................... 45 2.2. Change the gaps .................................................................................... 45

3. Comparison tables: PAM250 ...................................................................... 46 4. Global and local alignment comparison....................................................... 47 5. Alternative alignments ................................................................................ 48

L09 Exercise F: End of laboratory .......................................................... 49

Biochem 711 – 2008 23


The EMBOSS package will be used here as a line-command tool, available on all installations and therefore common to all platforms. All of the commands can be transcribed to any of the multiple GUI interfaces that exist for EMBOSS, including the java-based Jemboss interface. In this manner, the various options for each particular GUI do not become an encumbrance in the learning process and the users can concentrate on the algorithms and the effect of changing parameters from default.

L09 Exercise A: Xterm and Unix line commands

The directory LabFiles on the desktop contains the files necessary for these exercises. If you are practicing at home you can download the files from the http://virology.wisc.edu/acp web site. Under Public Data click on Class Resources, then under Files for ACP Labs use the Files for our Labs pull-down menu to select Seq Files for Lab.

1. Begin an Xterm session

✔ TASK Click on the X11 logo within the Dock (bottom of screen) This will launch X11 (Xwindows) and open an xterm VT100 terminal emulation

At the % or $ prompt type (DO NOT TYPE the prompt!) $ cd Desktop/LabFiles

LabFiles is on the DMC dektops

Alternatively find X11 within the Applications > Utilities directory. Note: X11 is mandatory for any EMBOSS application that has graphical output. However, Copy/Paste is easier from Terminal, and with the most recent Mac OS Terminal will transfer the graphical output to the X11 system. Therefore you can also use Terminal if you prefer. However launch X11 as well. Terminal is found in the directory Applications/Utilties

2. Line commands

✔ READ A few sets of line-commands are useful to know. They serve to navigate along the directory tree on the hard drive.

2.1. The prompt: $, %

Biochem 711 – 2008 24


The line command prompt means that the computer is ready for input. Typically the prompt is either $ or % for non-administrative users. The prompt could also be > and depending on the computer setting reflect the name of the computer and even the current directory name.

2.2. Full path location of a file Under Unix the top directory is called “root” and is symbolized by a forward slash. To access a file e.g. myfile.txt, all directories that need to be traversed from the root directory to reach to the file to be accessed need to be listed and separated by a forward slash without space. (spaces can be allowed but require special care, not covered here.) For example: /Users/dmc/Desktop/myfile.txt is the full path to the file myfile.txt since it starts with root, which is the first /. Note: On a Windows system it is exactly the same, except that it is usually case-insensitive, spaces are allowed, the root is the hard drive letter e.g. C: and the slashes are backward slashes. For example:

C:\Documents and Settings\Administrator\Desktop\myfile.txt

2.3. Relative path and current directory: ., ..

The relative path is a method to access the file without going through all the hierarchy of the directories from root and relative to the current location. The simplest relative path is simply the name of the file alone, assuming that the software we are using is now “looking” within the current directory where the file resides. The special symbol for the current directory is a dot: . while the parent directory immediately above the current directory is represented by a double dot .. Therefore, the following relative paths are correct depending on the location of the file and the location where the software is “looking”:

myfile.txt ./myfile.txt ../Desktop/myfile.txt

2.4. Changing directory and present working directory: cd, pwd,~ We already used the cd command above to change directory. Combined with the path, one can access any directory within the accessible hard drives.

Biochem 711 – 2008 25


To know in which directory into which we are currently looking we can use the command pwd that will echo the present working directory. A special case is a very useful shorthand that always takes you back “home” (to your home directory as computers may have multiple users.) The tilde symbol (~) replaces all that would be required as a full path from root to the home directory. It can then be used as well for going down the directory path. For example, the commands

cd ~ cd ~/Desktop

would return to the home directory and to the desktop respectively from ANY other location. Extremely useful if one gets “lost” even with help of pwd!

2.5. Directory listing: ls

To obtain a list of the files present in the current working directory we use the command ls. (On Windows the command is DIR) The ls command can be modified with –l (letter L) for a long list, –1 (number one) for a one-column list, –F to show file type (files marked with / are directories) and –a to show hidden files. Compatible modifiers can be combined:

$ ls -lFa total 168 drwx------ 22 dmc staff 748 Oct 23 19:16 ./ drwxr--r--+ 71 dmc staff 2414 Oct 23 19:17 ../ -rw-------@ 1 dmc staff 6148 Sep 11 15:42 .DS_Store drwxr-x--- 17 dmc staff 578 Apr 11 2002 EVOL/ drwxr-x--- 7 dmc staff 238 Nov 5 2003 FOLD/ -rw-r----- 1 dmc staff 1014 Sep 14 1999 ant.pep -rw-r--r-- 1 dmc staff 5743 Sep 11 15:08 blue.vec.seq -rw-r--r-- 1 dmc staff 163 Oct 23 19:14 calm_drome.fasta

The first column shows if the file is a directory (d) followed by 3 sets of file permission levels (read write execute for user/group/other): the user, a group to which the user belongs to and the rest of the world. The owner of the file (dmc) and group (staff) are shown, then the file size, the date of last change and the file name. Since we used –F the directories are shown with / and since we used –a we can see the hidden files, most remarkable are ./ and ../ the present and parent directories. Note: on Windows, the command DIR /b (forward slash!) shows the directory content as one column.

2.6. Creating a new directory: mkdir

Biochem 711 – 2008 26


Since the EMBOSS software is on the local computer it may be easier to create new directories with the mouse menu File > New Folder. However, the command mkdir will create a new directory within the current directory. The command cd can then be used to go down the directory path into the new directory.

2.7. Text file content: cat, more, head, tail The content of a text file (binary files are special cases) can easily be appraised by having the content of the file scrolled onto the terminal. Some commands will scroll the complete file all at once (cat), while others will pause (more) with the next line shown when hitting the return key, and the next page when hitting the space bar. Note: in Windows the command is type The commands head and tail display the first 10 lines at the top or the last 10 lines at the bottom of a file. The number of desired lines to view can be specified.

2.8. Simple text editing: pico, nano All exercises are done with files that are local. Therefore it is easiest to use TextEdit (make sure to change the format to plain format with the menu cascade Format > Make Plain Text). However it is possible to edit a small text file within the terminal with the full-screen text program pico. Navigation is simple with the up/down/ right/left arrows of the keyboard. Cut one line: control-k, paste that line: control-u. Type control-X to exit and write the file. Commands are summarized at bottom of the screen.

Note: recently pico has be replaced by nano: “ANOther editor, an enhanced free Pico clone.”

2.9. Redirect of standard text output: > The standard input is the keyboard and the standard output is the terminal screen. It is possible to redirect the standard text output to a file by adding > and a file name after a command that would create a text output such as cat or ls. For example, we can obtain a one-column list of file names within the current directory with the dash-one –1 option of ls and redirecting the standard text output into a file: ls -1 > mylist.txt Note: in Windows the command would be (forward slash b as /b and \b have different meanings.) DIR /b > mylist.txt

Biochem 711 – 2008 27


2.10. Documentation, help and manual pages: man The command man displays the documentation of commands within a more screen display. Example: man pico

2.11. Summary tables Here are summarized the commands and symbols reviewed here. If you learn this table, you will appear as a “Unix Guru” to most people! And indeed you will be able to interact with ease with any Unix/Linux system! Learn the Windows notes embedded above for an even stronger effect.

✔ READ Symbol Name Function / examples $ % > Prompt Shows ready for input / . .. ~

Root Current directory Parent directory “Home” directory

See cd and pwd below

Command Name Function / examples cd Change directory cd Desktop

cd ../Desktop/LabFiles pwd Present working directory Shows absolute path mkdir Create a new directory Example: mkdir Test ls List files

Can specify another directory

Modifiers can be added: long list (letter L): –l 1 column (# one): –1 mark file types : –F show hidden files: –a example: ls –laF

cat Types complete file to screen cat myfile.txt more Types file one screen-page at

a time. % of file viewed displayed at bottom left.

See next line: press <return> See next page: press <space bar> Return to prompt (quit): q Example: more myfile.txt

head Displays top 10 lines of file by default or specify # of lines

head myfile.txt head -2 myfile.txt

tail Same as head for end of file. tail -5 myfile.txt pico Simple text editor Cut one line: Control-K

Save and exit: Control-X man displays doc with more man cat >

Redirects standard screen text output into a text file.

Examples: ls > mylist.txt head myfile.txt > top10.txt

Biochem 711 – 2008 28


L09 Exercise B: Help and relevant EMBOSS applications: wossname, tfm, -option

EMBOSS programs used in this exercise: wossname, tfm, dotmatcher

EMBOSS contains a very large number of applications (programs). A list organized by logical group is provided in the EMBOSS introduction. Online it is possible to identify relevant applications to what we want to do with the wossname application.

1. Find relevant programs: wossname In a following exercise we will use dotplotting as a means to compare 2 sequences, or even a sequence against itself. wossname will let us know what applications could be used for the exercise:

✔ TASK $ wossname dotplot Finds programs by keywords in their short description SEARCH FOR 'DOTPLOT' dotmatcher Draw a threshold dotplot of two sequences dotpath Draw a non-overlapping wordmatch dotplot of two sequences dottup Displays a wordmatch dotplot of two sequences polydot Draw dotplots for all-against-all comparison of a sequence set

We now have a list of relevant EMBOSS applications that we can use.

2. Documentation and Help: tfm, -help, -option The command tfm (the fine manual) contains all the details about a specified application. More succinct information can be obtained as well by the following methods:

✔ TASK Type the bold commands after the % or $ prompt and observe the output: $ tfm dotmatcher dotmatcher Function Draw a threshold dotplot of two sequences Description dotmatcher generates a dotplot from two input sequences. The dotplot is an intuitive graphical representation of the regions of similarity

Biochem 711 – 2008 29


tfm uses more to display text: press the space bar to see the next page, or press q to quit. $ dotmatcher –help Standard (Mandatory) qualifiers (* if not always prompted): [-asequence] sequence Sequence filename and optional format, or reference (input USA) [-bsequence] sequence Sequence filename and optional format, or reference (input USA) [...]

The qualifier –option (or –opt) will be used within a following exercise.

L09 Exercise C: Sequence format and changing format: seqret

EMBOSS programs used in this exercise: seqret EMBOSS qualifiers used: -option, -help, -osf UNIX commands used: cat, head, more, cd, mkdir, pwd

✔ READ There are too many file types, each with their own story and history to review here. A good summary is presented at http://emboss.sourceforge.net/docs/themes/SequenceFormats.html On that web page they rightfully state: “Before reading the rest of this document, please note: Microsoft WORD format is not a sequence format.” Program-specific file types such as PDF, RTF, HTML, PostScript® are NOT sequence file formats either. Sequence files are plain text files containing only printable characters from the keyboard. If anything is in bold, italics or underlined it is NOT a plain text file! Formats were designed to hold the sequence data and other information about the sequence. The “format” part pertains to the conventions of arrangement of the text within the file, as well as the order and organization of specific characters that serve as flags to tell what parts of the file contain the actual sequence data, headers, annotations and features. Most sequence formats include at least one form of ID name, usually placed somewhere at the top of the sequence format. Most sequence databases have two identifiers for each sequence - an ID name and an Accession number.

Biochem 711 – 2008 30


The ID name was originally intended to be a human-readable name that had some indication of the function of its sequence. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Names are not guaranteed to remain the same between different versions of a database (although in practice they usually do). Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the rest of the life of the database. If two sequences are merged into one, then the new sequence will get a new Accession number and the Accession numbers of the merged sequences will be retained as 'secondary' Accession numbers. 1

1. Fasta format The simplest file format is the fasta file format, used by default for output by EMBOSS. The first character is the greater-than sign (>) followed by a name with no blank space either before or within the name. After the name can be some comments but only on that same 1st line. Note that > means something for Unix and something else for the fasta format. For example the fasta format version of a file of a calmodulin protein with ID name calm_drome on Entrez is: >gi|49037468|sp|P62152.2|CALM_DROME RecName: Full=Calmodulin; Short=CaM MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFL TMMARKMKDTDSEEEIREAFRVFDKDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYE EFVTMMTSK The accession number is P62152 version 2. The ID name is CALM_DROME. Note that all the information is within one line after the > symbol. Since the name of the file is that of the first word without space touching the > sign, it would be a very long word for many programs and should be rewritten. Later w will use the EMBOSS seqret program for this purpose.

✔ TASK Open a browser to point to NCBI: http://www.ncbi.nlm.nih.gov/sites/ Click on Protein: sequence database Within the search box type: calm_drome

1 http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Biochem 711 – 2008 31


When the entry is shown switch from Summary to FASTA

With the mouse select (highlight) and Edit > Copy the text of the file: title with > and sequence.

We will paste the content of the clipboard shortly...

✔ TASK Switch to the Terminal or X11 xterm. We will create a test directory within the LabFiles directory (review line commands in previous section if necessary): Type the bold commands after the % or $ prompt on the terminal:

cd ~/Desktop/LabFiles <return> mkdir TEST <return> cd TEST <return> pwd <return>

Then we will now create a new text file from the clipboard contents called testfile.txt with help of cat and redirect (>):

cat > testfile.txt <return> Paste the contents of the clipboard immediately after that.

✔CAUTION: on X11 Xterm, the mouse menu Edit > Paste is NOT available. The method to paste is to click the middle mouse button. On Terminal, use the mouse menu or the paste shortcut ⌘v

At this point your screen should look like this:

Biochem 711 – 2008 32


cat > testfile.txt >gi|49037468|sp|P62152.2|CALM_DROME RecName: Full=Calmodulin; Short=CaM MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTIDFPEFL TMMARKMKDTDSEEEIREAFRVFDKDGNGFISAAELRHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYE EFVTMMTSK

Now do the following: Press return and then press together control and D to close the file.

<return> <control> D

The file is now written on the local hard drive and contains the pasted text. The ls command will now list the file within our directory, and we can also verify its contents with cat, more or head. For example: Ls <rtn> more testfile.txt <rtn>

Press either return, q or the space bar to return to the prompt.

2. Seqret reads and writes (reformats) sequences By default the EMBOSS application seqret reformats sequence files to the fasta format.

✔ TASK Type the bold commands after the % or $ prompt and press return: $ seqret –help Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Sequence set(s) filename and optional format (output USA) [...]

There are many more options explained within the tfm manual (tfm seqret) In our case we already have a fasta-formated file, but the name within is very long and seqret can rewrite the file to update the name in a more useful format. $ seqret testfile.txt Reads and writes (returns) sequences output sequence(s) [calm_drome.fasta]: <return> $ $ head -1 calm_drome.fasta >CALM_DROME P62152.2 RecName: Full=Calmodulin; Short=CaM

Biochem 711 – 2008 33


The command head -1 shows only the first line, which we can observe has been rewritten with the sequence ID as its name. By simply pressing return we accepted the default of fasta format and the suggested file name. Note: the file is still a fasta-formated file. Only the name after > has changed.

2.1. Changing the format: format codes It is possible to specify the format for the output file if we know the format code: (reduced list. Complete list and description available online at http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Format codes (use one) Format short description abi ABI trace file format. clustal, aln ClustalW ALN (multiple alignment) format. embl, em EMBL entry format fasta, ncbi FASTA format with optional accession number. gcg GCG 9.x and 10.x format with !NA and !AA

sequence type identified on the first line. Sequence data after first double dot “..”

gcg8 GCG 8.x heading up to first ".." Remainder is sequence data.

genbank, gb, ddbj GENBANK entry format, including the feature table. (NOT for output of protein sequences)

ig IntelliGenetics format. mega Mega format msf GCG's MSF multiple sequence format. nbrf, pir NBRF (PIR) format, as used in the PIR database

sequence files. nexus, paup Nexus/PAUP format pearson FASTA with no further processing of the "ID" eg:

>name description phylip PHYLIP interleaved multiple alignment format. strider DNA Strider format swissprot, swiss, sw SWISSPROT entry format, or at least a minimal

subset of the fields. Specifying the output format is done either with either the qualifier –osf (output sequence format) or with the double colon nomenclature of the requested format followed by the desired file name: formatcode::filename In addition it is possible to specify the input and output file names on the same line as seqret rather than pressing return.

✔ TASK Type the bold commands after the % or $ prompt and press return: $ seqret testfile.txt test.gcg -osf gcg $ cat test.gcg

Biochem 711 – 2008 34


!!AA_SEQUENCE 1.0 RecName: Full=Calmodulin; Short=CaM CALM_DROME Length: 149 Type: P Check: 5504 .. 1 MADQLTEEQI AEFKEAFSLF DKDGDGTITT KELGTVMRSL GQNPTEAELQ 51 DMINEVDADG NGTIDFPEFL TMMARKMKDT DSEEEIREAF RVFDKDGNGF 101 ISAAELRHVM TNLGEKLTDE EVDEMIREAD IDGDGQVNYE EFVTMMTSK

The command can also be written with the exactly equivalent alternative: $ seqret testfile.txt gcg::test.gcg In that case we specify the format code and the desired file name output between 2 colons. The complete line-qualifiers are shown in the following tables: (integer = a numeric value; boolean: a switch; string: text) Input sequence command-line qualifiers that change the behaviour of the sequence input. Qualifier Type description -sbegin integer first base used -send integer last base used, default=seq length -sreverse boolean reverse (if DNA) -sask boolean ask for begin/end/reverse -snucleotide boolean sequence is nucleotide -sprotein boolean sequence is protein -slower boolean make lower case -supper boolean make upper case -sformat string input sequence format -sopenfile string input filename -sdbname string database name -sid string entryname -ufo string UFO features -fformat string features format -fopenfile string features file name

Output sequence command-line qualifiers that change the behaviour of the sequence output. Qualifier Type description -osformat string output sequence file format -osextension string file name extension -osname string base file name -osdirectory boolean output sequence file directory -osdbname string database name to add -ossingle boolean create a separate output file for each

entry -oufo string feature file to create -offormat string features format -ofname string features file name -ofdirectory string features output directory

Biochem 711 – 2008 35


3. List files: @ symbol

✔ INFO When the number of files becomes large, it may be easiest to enter the sequence file names into a list and supply the list to the EMBOSS application. A list file contains a single column, with each file name on one line. Lists can be embedded within another list if preceded by @. Here is an example of a valid list:

File1.gcg File2.fasta File3.ig @another_list.txt

One easy way to create a list is to use the ls command with the dash-one (ls -1) and redirect the output into a file. Some minor editing may be needed to remove names that do not belong to the list. Example: ls -1 > mylist.txt To tell the EMBOSS application that we are supplying a list rather than an actual sequence file, the list file name is preceded by the @ symbol.

4. Multiple sequence formats: seqret, seqretsplit

✔ INFO Multiple sequences can fit together one after the other in any order into a single fasta-formated file. Other formats mesh the files together which become interlaced, as is the case for some alignment formats. The multiple file formats that are useful to us are the fasta, msf and aln formats. seqret will return a multiple fasta-formated sequence file by default if it is supplied with multiple files as input either as a list or as a wild card command:

seqret *.fasta The output format can be altered by specifying the output format code e.g. msf or aln in the same manner as it was done for a single file: either with the –osf option or the double colon :: method. Multiple sequence files can be split back into single files with the EMBOSS application seqretsplit .

Biochem 711 – 2008 36


L09 Exercise D: Pairwise comparisons with dotplots

EMBOSS programs used: dottup, dotmatcher, embossdata EMBOSS qualifiers used: –option, –fetch, –sask, -file, –wordsize,

–windowsize, –threshold UNIX commands used in this exercise: cd, pwd, ls, cat

✔ READ In this exercise we will explore two EMBOSS programs for pair-wise sequence comparison dotmatcher and dottup. Dot-plotting is the best method for comparing two sequences visually when it is suspected that there could be more than one segment of similarity between them. Identity and similarity is defined by the chosen comparison table (substitution matrix.) dotmatcher compares two protein or nucleic acid sequences at all positions between the first sequence and all positions of the second sequence and displays the points of similarity between them shown as a graphical 2-dimentional dotplot. dottup looks for places where “words” (tuples) of a specified length have an exact match in both sequences and draws a diagonal line over the position of these words. The “word” method is faster but not as sensitive and requires that the sequences actually contain short perfect matches for any similarity to be found. Using a longer word (tuple) size displays less random noise, runs extremely quickly, but is less sensitive

1. Working directory

✔ TASK Make sure you are in the LabFiles directory with:

pwd If you just completed the previous exercise you need to go up one level with:

cd .. If you are unsure : cd ~/Desktop/LabFiles

2. Dotmatcher The dotplot created by dotmatcher is a graphical output. Since we are using the line-command on an X11 system the default graphical output is the X11 interactive display. The –graph option allows to change the graphical format output, for example to create a png file. See “EMBOSS Graphical Output” tables within the introduction section for more details.

Biochem 711 – 2008 37


2.1. Defaults run

✔ TASK Use domatcher to compare the protein sequence in “dcalm.pep” with itself. This time use all of the default parameters. We will be able to see what the command line choices are with the –option qualifier. If the –option qualifier is omitted you would only be prompted for three things: input sequence, second sequence and graph type. % dotmatcher -option Draw a threshold dotplot of two sequences Input sequence: dcalm.pep <rtn> Second sequence: dcalm.pep <rtn> Matrix file [EBLOSUM62]: <rtn> Window size over which to test threshhold [10]: <rtn> Threshold [23]: <rtn> Graph type [x11]: <rtn> (to display this graphic) While the interactive X11 graphic window is being displayed it is not possible to type any more commands and the prompt is not visible. Typing <rtn> would only create useless blank lines. The process of displaying the graphical output needs to be terminated to return to the line-command prompt ($ or %): Do either of the following:

a) Close the graphical window (click the red “x” button at top left: ) b) on the keyboard press the “control” and “C “ key (together)

Note: On line-command Windows the default graphical output is called “win3”

and the graphical window is closed by clicking the red “x” square on the top right:

Biochem 711 – 2008 38


2.2. Window size

✔ TASK Repeat the dotmatcher command adding the name of the files to the command line (short cut) along with –option and when prompted change the windowsize to 20 and threshold to 44. $ dotmatcher dcalm.pep dcalm.pep -option Draw a threshold dotplot of two sequences Matrix file [EBLOSUM62]: <rtn> Window size over which to test threshhold [10]: 20<rtn> Threshold [23]: 44<rtn> Graph type [x11]: <rtn>

<control C> together to return the prompt

2.3. Threshold

✔ TASK Rerun dotmatcher using a different window and threshold. This time put all the commands on the first line. $ dotmatcher dcalm.pep dcalm.pep -windowsize 10 -threshold 44 Draw a threshold dotplot of two sequences Graph type [x11]: <rtn> notice the change in the size of the diagonals when changing only the threshold (stringency). <control C> together to return the prompt Note: in the output of this example there is at least one long region of similarity in addition to the diagonal that bisects the figure. The long bisecting diagonal represents the identity that is found when a sequence is compared to itself.

Biochem 711 – 2008 39


Reminder: If you want to know more about any of the programs in EMBOSS add the –help qualifer after the program name, for example: dotmatcher –help

3. dottup dottup displays a wordmatch dotplot of two sequences. The default word size is 10. Type tfm dottup at the prompt for all the details.

3.1. Word size Run dottup with the –word qualifier to identify perfect matches (in this case, repeat sequences) that are at least 8 residues long (wordsize of 8). $ dottup dcalm.pep dcalm.pep -wordsize 8 Displays a wordmatch dotplot of two sequences Graph type [x11]: <rtn> (to display the graphic) <control C> together to return the prompt

Ah HA! Gotcha! You probably didn’t find any dots on this plot, did you? This means that there are no repeats of 8 identical (or very similar) amino acids in a row in the peptide sequence “dcalm.pep”. Repeat exercise 4 as above, but this time specify –wordsize 4. You should now find a few dots. $ dottup dcalm.pep dcalm.pep -wordsize 4 <control C> together to return the prompt

Biochem 711 – 2008 40


4. Comparison tables: BLOSUM62

✔ INFO Special Note: but how can this be? In the first 3 exercises with dotmatcher, where we were compared the protein to itself with different windows and threshols (stringency), we found several long diagonals that suggested possible internal repeats within dcalm.pep, or at least regions of similarity. Yet, the last 2 exercises with dottup show that there are very few regions with short perfect matches! The key here is to look at the scoring table that tells dottup whether there IS a match or a similarity between residues in the window, and “yes” that match should (or should not) be counted towards the stringency score. For protein comparison dottup uses a table called “Eblosum62.” This table assigns a relative significance score to every possible pairing of amino acids (aa), according to their (average) known frequencies in proteins, and the observed likelihood of this particular pair, substituting for each other during natural evolution. We can retrieve the comparison table used by the 2 dot plot programs and look at its content:

✔ TASK Type the bold commands after the % or $ prompt and press return. $ embossdata -fetch -file EBLOSUM62 $ ls $ cat EBLOSUM62

# Matrix made by matblas from blosum62.iij # * column uses minimum score # BLOSUM Clustered Scoring Matrix in 1/2 Bit Units # Blocks Database = /data/blocks_5.0/blocks.dat # Cluster Percentage: >= 62 # Entropy = 0.6979, Expected = -0.5209 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

Note how the values for identical aa matches this table have scores that vary, but are always positive (AA=4, CC=9, GG=6). Some similarity matches are also positive (EQ=2, FY=3, LI=2). When using this table, dottup or dottmatcher just add the all the match

Biochem 711 – 2008 41


values within the requested window size and evaluates whether that sum meets or exceeds the stringency. For example, with –windowsize=15 –threshold=11 settings, a single CC match (9), supplemented by an EQ match (2), would meet that criteria (i.e. if the rest of the aa’s averaged zeros), and give a dot on the dotplot. This means, when using this default table, dottup doesn’t have to find any exact matches between the sequences, if it can find enough similarities with high enough scores, to meet or exceed the specified stringency (threshold). Also note, that even a very few exact matches within the window may give a high enough score to register as a dot. There are MANY other tables you can use, containing different values for aa comparisons. Some applications can use an identity matrix that assigns all identical aa matches a value of 1.0, and all mismatches a value of 0.

5. Nucleotide sequence comparison Use dotmatcher to find regions of similarity between two different nucleotide sequences. This type of comparison (between 2 different sequences) is perhaps the most common use of dotplots. We will compare the H1 and H4 histone promoter sequences to see if they share any regions of sequence similarity; first use defaults, then repeat changing the threshold levels. Note: the default scoring table for DNA sequences (“EDNAFULL”) scores all identities and ambiguities with values of 5, but all mismatches = -4.

✔ TASK $ dotmatcher h1prom.seq h4prom.seq –option <rtn>

raw a threshold dotplot of two sequences Matrix file [EDNAFULL]: Window size over which to test threshhold [10]: Threshold [23]: Graph type [x11]: (use all other defaults)

Repeat with altered threshold:

✔ TASK $ dotmatcher h1prom.seq h4prom.seq –option <rtn>

Draw a threshold dotplot of two sequences Matrix file [EDNAFULL]: <rtn>

Biochem 711 – 2008 42


Window size over which to test threshhold [10]: 12 <rtn> Threshold [23]: 30 <rtn> Graph type [x11]: <rtn> (to display the graphic)

6. Inverted repeats Use dotmatcher to find regions of inverted repeats within a single nucleotide sequence. Analyze the first 300 bases of the dau.seq sequence against its reverse-complement (“Reverse strand: Y”). Within the previous seqret and file format exercises the line-command qualifier -sask is shown within the tables of qualifiers and mean “ask for begin/ end/ reverse.” Here we will use this feature with -sask1 and -sask2 to specify that we want to answer optional questions about sequence input files 1 and 2. Note: for clarity return <rtn> is only shown for lines that keep default values and is implied for other lines.

✔ TASK $ dotmatcher dau.seq dau.seq -sask1 -sask2 -option Draw a threshold dotplot of two sequences Begin at position [start]: 1 End at position [end]: 300 Reverse strand [N]: <rtn> Begin at position [start]: 1 End at position [end]: 300 Reverse strand [N]: Y Matrix file [EDNAFULL]: <rtn> Window size over which to test threshhold [10]: 50 Threshold [23]: 50 Graph type [x11]: <rtn> (to display the graphic)

Note how the top and bottom halves of this plot are symmetrical. This method can be a very valuable tool for identifying inverted repeats, or when used in conjunction with RNA structural prediction programs.

Biochem 711 – 2008 43


L09 Exercise E: Pairwise comparisons with optimal alignments

EMBOSS commands used: water, needle, and matcher EMBOSS qualifiers used: –option, –fetch, –gapopen, –gapextend,

–sask, –alt UNIX commands used in this exercise: more

✔ INFO needle2 and water3 are pair-wise alignment programs based on published algorithms that find the optimal mathematical “fit” between two sequences through the judicious insertion of gaps (spacers designated with “.” symbols to show where one sequence might have an insertion or a deletion relative to the other). When you want an alignment that covers the whole length of both sequences (global), use needle. When you are trying to find only the best segment of similarity between two sequences (local), use water. Both programs read a scoring matrix (comparison table) that contains values for every possible symbol match. These values are used to construct a path matrix that represents the entire surface of comparison, with a score at every position for the best possible alignment to that point. The quality score for the best alignment to any point is equal to the sum of the scoring matrix values of the matches in that alignment, less the gap creation penalty times the number of gaps in that alignment, less the gap extension penalty times the total length of all gaps in that alignment. The gap open penalty and gap extension penalties are set by the user. After the path matrix is complete, the highest value on the surface (water) or at the edge of the comparison (needle) represents the end of the best region of similarity between the sequences. The best path from this highest value backwards to the point where the values revert to zero (water) or back to the origin of the matrix (needle) is the alignment shown by in the output file. For water, this alignment is the best segment of similarity between the two sequences. For needle, it is the best end-to-end alignment for the two sequences. Note that either program will find an alignment for any pair of sequences you compare, even if there is no significant similarity between them! YOU must evaluate the results critically to decide if the segment shown is not just a random region of relative similarity.

1. Local alignment: water Use water to align two regions of dcalm.pep that the dottup and dotplot programs show to be similar. The first region is between amino acids 10 and 30. The second is between amino acids 80 and 100. Call the output file “dcalm.pair”. As in the previous exercise we will use the “-saskn” command to have the program prompt for start and end positions. With “n” being the sequence number

2 Needleman SB, Wunsch CD. (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of Molecular Biology 48 (3): 443-53. doi:10.1016/0022-2836(70)90057-4. PMID 5420325 3 Smith TF, Waterman MS (1981). "Identification of Common Molecular Subsequences". Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5

Biochem 711 – 2008 44


in the order listed in the line command. For clarity return <rtn> is only shown for lines that keep default values and is implied for other lines.

✔ TASK $ water dcalm.pep dcalm.pep -sask1 -sask2 –option

Smith-Waterman local alignment of sequences Begin at position [start]: 10 End at position [end]: 30 Begin at position [start]: 80 End at position [end]: 100 Matrix file [EBLOSUM62]: <rtn> Gap opening penalty [10.0]: <rtn> Gap extension penalty [0.5]: <rtn> Output alignment [calm_drome.water]: dcalm.pair

$ more dcalm.pair The output first restates the commands given and then provides the result:

#======================================= # # Aligned_sequences: 2 # 1: CALM_DROME # 2: CALM_DROME # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 17 # Identity: 11/17 (64.7%) # Similarity: 14/17 (82.4%) # Gaps: 0/17 ( 0.0%) # Score: 60.0 # # #======================================= CALM_DROME 11 EFKEAFSLFDKDGDGTI 27 |.:|||.:|||||:|.| CALM_DROME 84 EIREAFRVFDKDGNGFI 100 #--------------------------------------- #---------------------------------------

Note: both water and needle output will summarize the input parameters chosen for the alignment and show the quality score (score of the optimal matrix path), the % similarity and the % identity that were calculated for this alignment, according to the scoring table that was selected, in this case the default: Eblosum62.

Biochem 711 – 2008 45


2. Global alignment: needle Use needle to analyze the viral protein leader sequences from encephalomyo-carditis virus (“emc.pep”) and from Theiler’s virus (“tme.pep”). Call the output file “emcgap1.pair”.

2.1. Defaults run

✔ TASK $ needle emc.pep tme.pep Needleman-Wunsch global alignment of two sequences Gap opening penalty [10.0]: <rtn> (accept all defaults, this time) Gap extension penalty [0.5]: <rtn> (accept all defaults, this time) Output alignment [e.needle]: emcgap1.pair (for the output file name) $ more emcgap1.pair #======================================= # # Aligned_sequences: 2 # 1: e. # 2: t. # Matrix: EBLOSUM62 # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 258 # Identity: 145/258 (56.2%) # Similarity: 170/258 (65.9%) # Gaps: 36/258 (14.0%) # Score: 721.0 # # #======================================= e. 1 MATTMEQETCAHSLTFEECPKCSALQYRNGF-YLLKYDEEWYPEELL-TD 48 ..|.|... :.||.|:|:....|| |||..|.||||.:|| .| t. 1 -------MACKHGYP-DVCPICTAVDATPGFEYLLMADGEWYPTDLLCVD 42 e. 49 GEDDVF---------------DPELDMEVVFELQGNSTSSDKNNSSSEGN 83 .:|||| |..|..::|.|.||||:||||:||.|.|| t. 43 LDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSGN 92 e. 84 EGVIINNFYSNQYQNSIDLSANAAGS-DPPRTYGQFSNLFSGAVNAFSNM 132 |||||||||||||||||||||:...: |.|:|.||.||:..||.|||:.| t. 93 EGVIINNFYSNQYQNSIDLSASGGNAGDAPQTNGQLSNILGGAANAFATM 142 [...] etc.

2.2. Change the gaps Now align the same sequences, but with lower gap creation penalties and gap extension penalties. Notice how the optimal path changes depending upon how “easy” it is for the program to insert gaps.

Biochem 711 – 2008 46


✔ TASK $ needle emc.pep tme.pep –gapopen=3 –gapextend=1

<rtn> (accept all other defaults) emcgap2.pair (for the output file name)

$ more emcgap2.pair e. 1 MATTMEQETCAHSLTFEE-CPKCSALQYRNGF-YLLKYDEEWYPEELL-T 47 || |.|. :.: ||.|:|:....|| |||..|.||||.:|| . t. 1 MA-------CKHG--YPDVCPICTAVDATPGFEYLLMADGEWYPTDLLCV 41 e. 48 DGEDDVF---D----PE-LD-----M--EVVFELQGNSTSSDKNNSSSEG 82 |.:|||| | .: :| : ::|.|.||||:||||:||.|.| t. 42 DLDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSG 91 . etc

Note also how the increased scores of this alignment (Score: 765; Length: 261; Gaps: 16.1%; %Similarity: 67.8, %Identity: 57.5) differ from the previous values (Score: 721; Length: 258; Gaps: 14; %Similarity: 65.9; %Identity: 56.2) and come at the expense of adding more gaps.

3. Comparison tables: PAM250 Use the –fetch qualifier to copy an alternative symbol comparison table to your local directory, the original PAM250 matrix. Realign the viral leader sequences using this table. Use the same gap weight and length weight penalties as in exercise 3. How does using this alternative table change the gap analysis?

✔ TASK $ embossdata -fetch -file EPAM250 (note: case sensitive) $ needle emc.pep tme.pep –gapopen=3 –gapextend=1 –datafile=Epam250 Needleman-Wunsch global alignment of two sequences Output alignment [e.needle]: emcgap3.pair (output file name) $ more emcgap3.pair #======================================= e. 1 MATTMEQETCAHSLTFEE-CPKCSALQYRNGF-YLLKYDEEWYPEELL-T 47 |. |.|: :.: ||.|:|::...|| |||..|.||||.:|| . t. 1 ----MA---CKHG--YPDVCPICTAVDATPGFEYLLMADGEWYPTDLLCV 41 e. 48 DGEDDVF-------DPE-LD-----M--EVVFELQGNSTSSDKNNSSSEG 82 |.:|||| ::: :| : ::|.|.||||:||||:||.|.| t. 42 DLDDDVFWPSDTSNQSQTMDWTDVPLIRDIVMEPQGNSSSSDKSNSQSSG 91 . etc

Biochem 711 – 2008 47


Note: the last 3 exercises should emphasize that the specific output for any optimal alignment is very sensitive to the values in the scoring table, the gap weight (-gapopen) and the gap length (-gapextend) weight. How then, can we recognize a good alignment when we see one? What values or tables should we use? These programs and others will typically offer default values that have been chosen for their general ability to give “relevant” results. But it is always wise to rerun alignment programs with a variety of different input values, and then LOOK carefully at the different results. For many proteins or nucleic acid comparisons, the regions of “good” similarity will tend to be part of the optimal path over a wide range of penalties or table scores. Mostly however, we still need to use good biological judgment and evaluate each output for whether or not it makes any sense! If you don’t trust your judgment, and prefer a mathematical answer, the shuffleseq program can help you evaluate the significance of your alignment, using a simple statistical method by randomizing (shuffling) the sequence. Simply create one or many randomized sequences with shuffleseq and compare them with the biological sequence keeping track of the resulting scores.

4. Global and local alignment comparison Compare the two histone promoter sequences using the water and needle programs. How do the alignments from these two programs differ?

✔ TASK $ water h1prom.seq h4prom.seq <rtn> (3 times) $ needle h1prom.seq h4prom.seq <rtn> (9 times) Accepting all defaults includes accepting the default names for the output files that can then be viewed with: $ more h1prom.water $ more h1prom.needle Local: h1prom.water Global: h1prom.needle (no header) ######################################## # Program: water # Rundate: Fri 24 Oct 2008 17:40:42 # Commandline: water # [-asequence] h1prom.seq # [-bsequence] h4prom.seq # Align_format: srspair # Report_file: h1prom.water ######################################## #======================================= # # Aligned_sequences: 2 # 1: h1prom.seq # 2: h4prom.seq # Matrix: EDNAFULL # Gap_penalty: 10.0 # Extend_penalty: 0.5 # # Length: 180 # Identity: 72/180 (40.0%) # Similarity: 72/180 (40.0%) # Gaps: 80/180 (44.4%) # Score: 103.5 # # #======================================= h1prom.seq 271 TGTGAACCTGGAGGCTGTTTT--CCTCCTTTGGAG---CTTCAAAGTGCC 315 |.||..|||..| |.||| |||||...||.| |||| ..||| h4prom.seq 11 TTTGGCCCTTTA----GATTTCCCCTCCACCGGCGGGACTTC---CCGCC 53

h1prom.seq 1 GTCCTGTGCCTGTGTTACTTGCTACAGTTAGAAACAAACTTCATGCCCAA 50 h4prom.seq 0 -------------------------------------------------- 0 h1prom.seq 51 ACCAAGGAACCCAGTGTCTTTTCTCTTGCAAAAATCAAAGCATGAACTCA 100 h4prom.seq 0 -------------------------------------------------- 0 h1prom.seq 101 TGGGCAAATTTTTAAAAATAACTTTCACTGGATACTTAGTAGAAATTTAT 150 h4prom.seq 0 -------------------------------------------------- 0 h1prom.seq 151 CGCGACACGCTACTAACTAACATGATGCCCTCAGCCCAATGGATTCTTAT 200 ||.|| ||| h4prom.seq 1 -----------------------------------CCTAT---TTC---- 8 h1prom.seq 201 GAAAAGCTGAAGGGATTT-----TTTAAAATATCTTTCATCAATTGCACA 245 |.||| ||||.| ||||..| |.|| h4prom.seq 9 -------------GGTTTGGCCCTTTAGA-----TTTCCCC---TCCA-- 35 h1prom.seq 246 AGATTCTTGAAAACACAAACAAGTATGTGAACCTGGAGGCTGTTTTC--- 292 |.||.|| |..||| h4prom.seq 36 --------------------------------CCGGCGG--GACTTCCCG 51 h1prom.seq 293 ----CTCCTTTGGAGCTTCAAAGT-------GCCAAATTCTGTACCATTG 331 ||.|||| .||.|||..||| ||||| ||| |.| h4prom.seq 52 CCGACTTCTTT-CAGGTTCTCAGTTCGGTCCGCCAA---CTG-----TCG 92 h1prom.seq 332 TTTTAAGCATTTAATCAAATTTTGAGGACTAACAAACACAATTTGGGAGT 381 |.|.||| h4prom.seq 93 TATAAAG------------------------------------------- 99

Biochem 711 – 2008 48


h1prom.seq 316 AAATTCTGTACCATTGTTTTAAGCATTTAATCAAATTTTGAGG-ACTAAC 364 .|.||| |||.||| .|| h4prom.seq 54 GACTTC------------------------------TTTCAGGTTCT--- 70 h1prom.seq 365 AAACACAATTTGGGAGTCCAAC--GCG------AGCGCGGC----GGCCA 402 ||.||.||....||||| .|| .||||.|| ||.|| h4prom.seq 71 -----CAGTTCGGTCCGCCAACTGTCGTATAAAGGCGCTGCCTCAGGTCA 115 h1prom.seq 403 GAGGGCGGTGGATTGGACGCTCCACCAATC 432 |||| ||||.||.| h4prom.seq 116 GAGG-----------------CCACAAAGC 128 #--------------------------------------- #---------------------------------------

h1prom.seq 382 CCAACGCGAGCGCGGC----GGCCAGAGG-------GCGGTGGATTGGAC 420 ||||.|| ||.|||||| |||.|| h4prom.seq 100 ---------GCGCTGCCTCAGGTCAGAGGCCACAAAGCGTTG-------- 132 h1prom.seq 421 GCTCCACCAATCACAGGGCAGCGCCGGCTTATATAAGCCCGGGCCCGAGC 470 h4prom.seq 132 -------------------------------------------------- 132 h1prom.seq 471 ATAGCAGCAACGCAAAACCTGCTCTTTAGATTTCGAGCTTATTCTCTTCT 520 h4prom.seq 132 -------------------------------------------------- 132 h1prom.seq 521 AGCAGTTTCTTGCCACCATG 540 h4prom.seq 132 -------------------- 132 #--------------------------------------- #---------------------------------------

5. Alternative alignments Repeat the alignment of the histone promoters using the matcher program and the –alt (alternative) command line qualifiers. This sets the number of alternative matches output. By default only the highest scoring alignment is shown. A value of 2 gives you other reasonable alignments. How do the alignments differ?

✔ TASK $ matcher h1prom.seq h4prom.seq –alt=4 <rtn> (one time) h1prom4.pair (for output file name) $ matcher h1prom.seq h4prom.seq –alt=10 <rtn> (one time) h1prom10.pair (for output file name) $ more h1prom4.pair $ more h1prom10.pair

✔ READ Note: just when you thought you had the variables in the optimal alignment programs figured out, you now have to face the concept of “highroad” and “low road”! Actually, the specific location of gap insertion is arbitrary in many cases, and equally optimal alignments can be generated by inserting the gaps differently. When equally optimal alignments are possible, matcher will insert the gaps differently if you select for the alternative parameter. Here are two examples for the alignment of GACCAT with GACAT with these different parameters. For: Match = 10 MisMatch = -9 Gap weight = 10 Length Weight = 0 LowRoad: 1 GACCAT 6 HighRoad: 1 GACCAT 6 || ||| Quality = 40 ||| || 1 GA.CAT 5 1 GAC.AT 5 For: Match = 10 MisMatch = 0 Gap weight = 30 Length Weight = 0

Biochem 711 – 2008 49


HighRoad: 1 GACCAT 6 LowRoad: 1 GACCAT 6 ||| Quality = 30 ||| 1 GACAT. 5 1 .GACAT 5 Essentially the lowroad shifts all of the arbitrary gaps in sequence two to the left and all of the arbitrary gaps in sequence one to the right. The highroad does exactly the opposite. Applications will try NOT to insert a gap whenever that is possible, but when forced to choose, may use the highroad alternative as a default.

L09 Exercise F: End of laboratory

1) Tell the server you are done: type exit at the $ prompt 2) quit from X11: File > Quit. 3) Close all Macintosh windows.

- e -

Biochem 711 – 2008 50


Class notes