protein sequencing and identification with mass spectrometry · pdf filean introduction to...

81
www.bioalgorithms.info An Introduction to Bioinformatics Algorithms Protein Sequencing and Identification With Mass Spectrometry

Upload: trinhhanh

Post on 05-Feb-2018

231 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms

Protein Sequencing and

Identification With Mass

Spectrometry

Page 2: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Outline• Tandem Mass Spectrometry

• De Novo Peptide Sequencing

• Spectrum Graph

• Protein Identification via Database Search

• Identifying Post Translationally Modified Peptides

• Spectral Convolution

• Spectral Alignment

Page 3: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Amino Acids vs. Nucleic Acids

Amino Acids: Amine, Carboxylic Acid, R-group

Nucleic Acids:Sugar, Phosphate, Base

Page 4: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Protein Backbone

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

Page 5: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Breaking of Protein Backbone

H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

H+

Page 6: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Breaking Peptides into Fragment Ions

• Proteases, e.g. trypsin, break protein intopeptides.

• A Tandem Mass Spectrometer further breaksthe peptides down into fragment ions andmeasures the mass of each piece.

• Mass Spectrometer electrically accelerates thefragmented ions; heavier ions accelerate slowerthan lighter ones.

• Mass Spectrometers measure mass/chargeratio of an ion.

Page 7: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass SpectrometryMatrix-assisted Laser Desorption/Ionization

From lectures by Vineet Bafna (UCSD)

Page 8: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tandem Mass SpectrometryRT: 0.01 - 80.02

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80Time (min)

0

10

20

30

40

50

60

70

80

90

100

Relative Abundance

1389 19911409 21491615 16211411 2147

161119951655

15931387

21551435 1987 2001 21771445 16611937

22051779 21352017

1313 22071307 23291105 17071095

2331

NL:1.52E8Base Peak F: + c Full ms [ 300.00 - 2000.00]

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

Scan 1708

LC

S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7F: + c Full ms [ 300.00 - 2000.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

638.0

801.0

638.9

1173.8872.3 1275.3

687.6944.7 1884.51742.11212.0783.3 1048.3 1413.9 1617.7

Scan 1707

MS

MS/MSIon

Source

MS-1collision

cell MS-2

Page 9: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Using Tandem Mass Spectrometry

SSeeqquueennccee

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

MS/MS instrumentMS/MS instrument

Database search•Sequestde Novo interpretation•Sherenga

Page 10: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tandem Mass Spectrum• Tandem Mass Spectrometry (MS/MS): mainly

generates partial N- and C-terminal peptides

• Spectrum consists of different ion typesbecause peptides can be broken in severalplaces.

• Chemical noise often complicates thespectrum.

• Represented in 2-D: mass/charge axis vs.intensity axis

Page 11: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Tandem Mass Spectrum: An Example

Secondary Fragmentation

Ionized parent peptide

Page 12: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

N- and C-terminal Peptides

N-term

inal

pep

tides

C-te

rmin

al p

eptid

es

Page 13: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Terminal peptides and ion types

Peptide

Mass (D) 57 + 97 + 147 + 114 = 415

Peptide

Mass (D) 57 + 97 + 147 + 114 – 18 = 397

without

Page 14: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Fragmentation

y3

b2

y2 y1

b3a2 a3

HO NH3+

| |

R1 O R2 O R3 O R4

| || | || | || |H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H

b2-H2O

y3 -H2O

b3- NH3

y2 - NH3

Page 15: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

De novo Peptide Sequencing

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

SequenceSequence

Page 16: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Theoretical Spectrum

Page 17: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Theoretical Spectrum (cont’d)

Page 18: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Theoretical Spectrum (cont’d)

Page 19: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Building Spectrum Graph• How to create vertices (from peaks)

• How to create edges (from mass differences)

• How to score paths

• How to find best path

Page 20: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S E Q U E N C E

b

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 21: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

a

Mass/Charge (M/Z)Mass/Charge (M/Z)

S E Q U E N C E

Page 22: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

S E Q U E N C E

Mass/Charge (M/Z)Mass/Charge (M/Z)

a is an ion type shift in b

Page 23: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

y

Mass/Charge (M/Z)Mass/Charge (M/Z)

E C N E U Q E S

Page 24: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

y with corresponding intensities

Mass/Charge (M/Z)Mass/Charge (M/Z)

E C N E U Q E S

Inte

nsity

Inte

nsity

Page 25: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass/Charge (M/Z)Mass/Charge (M/Z)

Inte

nsity

Inte

nsity

Page 26: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass/Charge (M/Z)Mass/Charge (M/Z)

Inte

nsity

Inte

nsity

Page 27: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

noise

Mass/Charge (M/Z)Mass/Charge (M/Z)

Page 28: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MS/MS Spectrum

Mass/Charge (M/z)Mass/Charge (M/z)

Inte

nsity

Inte

nsity

Page 29: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Mass Differences Correspond to Amino Acids

ss

ssss

ee

eeee

ee

ee

ee

ee

ee

qq

qq

qquu

uu

uu

nn

nn

nn

ee

cc

cc

cc

Page 30: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ion Types• Some peaks correspond to fragment ions,

others are just random noise

• Knowing ion types _={_1, _2,…, _k} lets us

distinguish fragment ions from noise

• We can learn ion types _i and their

probabilities qi by analyzing a large test

sample of annotated spectra.

Page 31: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Example of Ion Type

• _={_1, _2,…, _k}

• _={b, b-NH3, b-H2O}

• Corresponding values of _={0, 17, 18}

• *Note: In reality the _ value of ion type b is -1

Page 32: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Sequencing ProblemGoal: Find a peptide with maximal match between

an experimental and theoretical spectrum.Input:

• S: experimental spectrum• _: set of possible ion types• m: parent mass

Output:• P: peptide with mass m, whose theoretical

spectrum matches the experimental Sspectrum the best

Page 33: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Vertices• Masses of potential N-terminal peptides

• Vertices are generated by reverse shift

• Every peak s in a spectrum generates

vertices

• V(s) = {s+_1, s+ _2, …, s+ _k}

Page 34: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Vertices (cont’d)

• Vertices of the spectrum graph:

• {vinit}∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{vfin}

• Where _={_1, _2,…, _k} are ion types.

Page 35: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Reverse Shifts

• Two peaks b-H2O and b are given by the MassSpectrum

• With a +H2O shift, if two peaks coincide that is apossible vertex.

Mass/Charge (M/Z)Mass/Charge (M/Z)

Inte

nsity

Inte

nsity Red: Mass Spectrum

Blue: shift (+H2O)

b/b-H2O+H2O

b-H2O b+H2O

Page 36: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Example of Reverse Shift

Shift in H2O and NH3

Shift in H2O

Page 37: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Edges

• Two vertices with mass difference

corresponding to an amino acid A:

• Connect with an edge labeled by A

• Gap edges for di- and tri-peptides

Page 38: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Paths

• Path in the graph corresponds to an aminoacid sequence

• There are many paths, how to find the correctone?

• We need scoring to evaluate paths

Page 39: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Path Score• p(P,S) = probability that peptide P produces

spectrum S = {s1,s2,…sq}

• p(P, s) = the probability that peptide Sgenerates a peak s

• Scoring = computing probabilities

• p(P,S) = !s_S p(P, s)

Page 40: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

• For a position t that represents ion type dj :

qj, if peak is generated at t

p(P,st) =

1-qj , otherwise

Peak Score

Page 41: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peak Score (cont’d)

• For a position t that is not associated with anion type:

qR , if peak is generated at t

pR(P,st) =

1-qR , otherwise

• qR = the probability of a noisy peak that doesnot correspond to any ion type

Page 42: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Finding Optimal Paths in the Spectrum Graph

• For a given MS/MS spectrum S, find apeptide P’ maximizing p(P,S) over allpossible peptides P:

• Peptides = paths in the spectrum graph

• P’ = the optimal path in the spectrum graph

p(P,S)p(P',S) Pmax=

Page 43: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ions and Probabilities• Tandem mass spectrometry is characterized

by a set of ion types {•‰1,•‰2,..,•‰k} and theirprobabilities {q1,...,qk}

¶U•‰i-ions of a partial peptide are producedindependently with probabilities qi

Page 44: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ions and Probabilities

• A peptide has all k peaks with probability

• and no peaks with probability

• A peptide also produces a ``random noise''with uniform probability qR in any position.

∏=

k

iiq

1

∏=

−k

iiq

1

)1(

Page 45: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Ratio Test Scoring for Partial Peptides

• Incorporates premiums for observed ionsand penalties for missing ions.

• Example: for k=4, assume that for a partialpeptide P’ we only see ions •‰1,•‰2,•‰4.

The score is calculated as:RRRR q

q

q

q

q

q

q

q 4321

)1(

)1(⋅

−⋅⋅

Page 46: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Scoring Peptides

• T- set of all positions.

• Ti={t _1,, t _2,..., ,t _k,}- set of positions thatrepresent ions of partial peptides Pi.

• A peak at position t_j is generated withprobability qj.

• R=T- U Ti - set of positions that are notassociated with any partial peptides (noise).

Page 47: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Probabilistic Model

• For a position t _j ∈ Ti the probability p(t, P,S) thatpeptide P produces a peak at position t.

• Similarly, for t∈R, the probability that P produces arandom noise peak at t is:

−=

otherwise1

position tat generated ispeak a if),,( j

j

j

q

qSPtP δ

−=

otherwise1

position tat generated ispeak a if)(

R

RR q

qtP

Page 48: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Probabilistic Score

• For a peptide P with n amino acids, the scorefor the whole peptides is expressed by thefollowing ratio test:

∏∏= =

=n

i

k

j iR

i

R j

j

tp

SPtp

Sp

SPp

1 1 )(

),,(

)(

),(

δ

δ

Page 49: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Role of de novo Interpretation

• Interpreting MS/MS of novel peptides

• Automatic validation of MS/MS database

matches.

• Leveraging homology matching across

species

Page 50: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Post-Translational ModificationsProteins are involved in cellular signaling and

metabolic regulation.

They are subject to a large number of biologicalmodifications.

Almost all protein sequences are post-translationally modified and 200 types ofmodifications of amino acid residues areknown.

Page 51: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Examples of Post-TranslationalModification

Page 52: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Difficulties in Finding Post-Translational ModificationsCurrently post-translational modifications

cannot be inferred from DNA sequences.

Finding post-translational modificationsremains an open problem even after thehuman genome is completed.

Post-translational modifications increase thenumber of “letters” in amino acid alphabetand lead to a combinatorial explosion.

Page 53: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Sequencing of Modified Peptides

De novo peptide sequencing is invaluable foridentification of unknown proteins:

However, de novo algorithms are designed forworking with high quality spectra with goodfragmentation and without modifications.

Another approach is to compare a spectrumagainst a set of known spectra in a database.

Page 54: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Functional Proteomics

• Problem: Given a large collection ofuninterpreted spectra, find out which spectracorrespond to similar peptides.

• A method that cross-correlates relatedspectra (e.g., from normal and diseasedindividuals) would be valuable in functionalproteomics.

Page 55: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Protein identification Problem

• Input: A database of proteins, anexperimental spectrum S, a set of ion types_, and a parent mass m.

• Output: A peptide of mass m from thedatabase with the best match to spectrumS.

Page 56: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

MS/MS Database Search

Database search in mass-spectrometry has been verysuccessful in identification of already known proteins.

Experimental spectrum can be compared with theoreticalspectra database peptides to find the best fit.

SEQUEST (Yates et al., 1995)

But reliable algorithms for identification of modifiedpeptides are not yet known.

Page 57: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Search for Modified Peptides:Virtual Database ApproachYates et al.,1995: an exhaustive search in a

virtual database of all modified peptides.

Exhaustive search leads to a large combinatorialproblem, even for a small set of modificationstypes.

Problem (Yates et al.,1995). Extend the virtualdatabase approach to a large set ofmodifications.

Page 58: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Modified Peptide Identification Problem

Input: Experimental spectrum S Database of peptides Parameter k (# of mutations/modifications) A set of ion types _

Parent mass m

Output: a peptide with the best match to thespectrum S that is at most kmutations/modifications apart from a databasepeptide.

Page 59: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Peptide Identification Problem: Challenge

Very similar peptides may have very differentspectra!

Goal: Define a notion of spectral similarity thatcorrelates well with the sequence similarity.

If peptides are a few mutations/modificationsapart, the spectral similarity between theirspectra should be high.

Page 60: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Deficiency of the Shared Peaks Count

Shared peaks count (SPC): intuitive measureof spectral similarity.

Problem: SPC diminishes very quickly as thenumber of mutations increases.

Only a small portion of correlations betweenthe spectra of mutated peptides is capturedby SPC.

Page 61: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SPC Diminishes Quickly

S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632}

S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682}

S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583}

no mutations

SPC=10

1 mutation

SPC=5

2 mutations

SPC=2

Page 62: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Convolution

)0)((

))((,

12

12

122211

22111212

:

SS

xSSssSsSs

}S,sS:ss{sSSx

−−∈∈

∈∈−=−=

:peak) (SPC count peaks shared The

with pairs of Number

Page 63: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Elements of S2 S1 represented as elements of a difference matrix. Theelements with multiplicity >2 are colored; the elements with multiplicity =2are circled. The SPC takes into account only the red entries

Page 64: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spe

ctra

lC

onvo

lutio

n

1

2

3

4

5

0 -150 -100 -50 0 50 100150

x

Spectral Convolution: An Example

Page 65: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Comparison: Difficult Case

S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}

Which of the spectra

S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95}

or

S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}

fits the spectrum S the best?

SPC: both S’ and S” have 5 peaks in common with S.

Spectral Convolution: reveals the peaks at 0 and 5.

Page 66: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Comparison: Difficult Case

S S’

S S’’

Page 67: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Limitations of the Spectrum Convolutions

Spectral convolution does not reveal thatspectra S and S’ are similar, while spectra Sand S” are not.

Clumps of shared peaks: the matchingpositions in S’ come in clumps while thematching positions in S” don't.

This important property was not captured byspectral convolution and was overlooked inthe previous MS/MS algorithms.

Page 68: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Edit Distance Between SpectraA = {a1 < … < an} : an ordered set of natural

numbers.A shift Δi transforms {a1, …., an}

Into {a1, ….,ai-1,ai+Δi,…,an+ Δi }

e.g.

20 30 40 50 60 70 80 90

10 20 30 35 45 55 65 75 85

10 20 30 35 45 55 62 72 82

Page 69: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment Problem

• Find a series of k shifts that make the sets

A={a1, …., an} and B={b1,….,bn}

as similar as possible.

• k-similarity between sets

• D(k) - the maximum number of elements incommon between sets after k shifts.

Page 70: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment vs. Sequence Alignment

• Manhattan-like graph with different alphabetand scoring.

• Axes in the graph correspond to peaks in thetwo spectra.

• In this case, score is 1 if the diagonal linegoes through a peak on both axes, 0otherwise.

• Movement can be diagonal or perpendicular(but only k times total).

Page 71: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment =Sequence Alignment in 0-1 Alphabet

• Convert spectrum to a string with eachindex being 1 if it corresponds to a peakand 0 otherwise.

Page 72: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral ProductA={a1, …., an} and B={b1,…., bn}

Spectral product A⊗B: two-dimensional matrix withnm 1s corresponding to all pairs of

indices (ai,bj) and remaining

elements being 0s.

10 20 30 40 50 55 65 75 85 95

δ

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

SPC: the number of 1s atthe main diagonal.

δ-shifted SPC: the numberof 1s on the diagonal (i,i+ δ)

Page 73: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment: k-similarityk-similarity between spectra: the maximum number

of 1s on a path through this graph that uses at mostk+1 diagonals.

k-optimal spectral

alignment = a path.

The spectral alignmentallows one to detectmore and more subtlesimilarities betweenspectra by increasing k.

Page 74: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

SPC reveals onlyD(0)=3 matchingpeaks.

Spectral Alignmentreveals morehidden similaritiesbetween spectra:D(1)=5 and D(2)=8and detectscorrespondingmutations.

Use of k-Similarity

Page 75: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Black lines represent the paths for k=0

Red lines represent the paths for k=1

blue line in Fig.(b) represents the path for k=2

Page 76: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Convolution’ Limitation The spectral convolution considers diagonals

separately without combining them into feasiblemutation scenarios.

D(1) =10 shift function score = 10 D(1) =6

10 20 30 40 50 55 65 75 85 95

10

20

30

40

50

60

70

80

90

100

10 15 30 35 50 55 70 75 90 95

10

20

30

40

50

60

70

80

90

100

δ δ

Page 77: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Dynamic Programming forSpectral AlignmentDij(k): the maximum number of 1s on a path to

(ai,bj) that uses at most k+1 diagonals.

Running time: O(n4 k)

otherwisekD

jijiifkDkD

ji

ji

jijiij ,1)1(

),(~)','(,1)(max)(

''

''

),()','({

+−

+=

<

)(max)( kDkD ijij

=

Page 78: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Edit Graph for Fast Spectral Alignment

Page 79: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Fast Spectral Alignment Algorithm

+−

+=

−− 1)1(

1)(max)(

1,1

),(

kM

kDkD

ji

jidiagij

)(max)( ''),()','(

kDkM jijiji

ij<

=

=

)(

)(

)(

max)(

1,

,1

kM

kM

kD

kM

ji

ji

ij

ij

Running time: O(n2 k)

Page 80: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment: Complications

• Simultaneous analysis of N- and C-terminalions

• Taking into account the intensities andcharges

• Analysis of minor ions

• Much more complicated!

Page 81: Protein Sequencing and Identification With Mass Spectrometry · PDF fileAn Introduction to Bioinformatics Algorithms   Protein Sequencing and Identification With Mass Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Spectral Alignment: Complications

Spectra are combinations of an increasing (N-terminal ions) and a decreasing (C-terminalions) number series.

These series form two diagonals in thespectral product, the main diagonal and theperpendicular diagonal.

The described algorithm deals with the maindiagonal only.