protein sequencing and identification with mass spectrometry · pdf filean introduction to...

www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms

Protein Sequencing and

Identification With Mass

Spectrometry

An Introduction to Bioinformatics Algorithms www.bioalgorithms.info

Outline• Tandem Mass Spectrometry

• De Novo Peptide Sequencing

• Spectrum Graph

• Protein Identification via Database Search

• Identifying Post Translationally Modified Peptides

• Spectral Convolution

• Spectral Alignment


Amino Acids vs. Nucleic Acids

Amino Acids: Amine, Carboxylic Acid, R-group

Nucleic Acids:Sugar, Phosphate, Base


Protein Backbone

H...-HN-CH-CO-NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus


Breaking of Protein Backbone

H...-HN-CH-CO NH-CH-CO-NH-CH-CO-…OH

Ri-1 Ri Ri+1

AA residuei-1 AA residuei AA residuei+1

N-terminus C-terminus

H+


Breaking Peptides into Fragment Ions

• Proteases, e.g. trypsin, break protein intopeptides.

• A Tandem Mass Spectrometer further breaksthe peptides down into fragment ions andmeasures the mass of each piece.

• Mass Spectrometer electrically accelerates thefragmented ions; heavier ions accelerate slowerthan lighter ones.

• Mass Spectrometers measure mass/chargeratio of an ion.


Mass SpectrometryMatrix-assisted Laser Desorption/Ionization

From lectures by Vineet Bafna (UCSD)


Tandem Mass SpectrometryRT: 0.01 - 80.02

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80Time (min)

0

10

20

30

40

50

60

70

80

90

100

Relative Abundance

1389 19911409 21491615 16211411 2147

161119951655

15931387

21551435 1987 2001 21771445 16611937

22051779 21352017

1313 22071307 23291105 17071095

2331

NL:1.52E8Base Peak F: + c Full ms [ 300.00 - 2000.00]

S#: 1708 RT: 54.47 AV: 1 NL: 5.27E6T: + c d Full ms2 638.00 [ 165.00 - 1925.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

Scan 1708

LC

S#: 1707 RT: 54.44 AV: 1 NL: 2.41E7F: + c Full ms [ 300.00 - 2000.00]

200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

638.0

801.0

638.9

1173.8872.3 1275.3

687.6944.7 1884.51742.11212.0783.3 1048.3 1413.9 1617.7

Scan 1707

MS

MS/MSIon

Source

MS-1collision

cell MS-2


Using Tandem Mass Spectrometry

SSeeqquueennccee


200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

MS/MS instrumentMS/MS instrument

Database search•Sequestde Novo interpretation•Sherenga


Tandem Mass Spectrum• Tandem Mass Spectrometry (MS/MS): mainly

generates partial N- and C-terminal peptides

• Spectrum consists of different ion typesbecause peptides can be broken in severalplaces.

• Chemical noise often complicates thespectrum.

• Represented in 2-D: mass/charge axis vs.intensity axis


Tandem Mass Spectrum: An Example

Secondary Fragmentation

Ionized parent peptide


N- and C-terminal Peptides

N-term

inal

pep

tides

C-te

rmin

al p

eptid

es


Terminal peptides and ion types

Peptide

Mass (D) 57 + 97 + 147 + 114 = 415

Peptide

Mass (D) 57 + 97 + 147 + 114 – 18 = 397

without


Peptide Fragmentation

y3

b2

y2 y1

b3a2 a3

HO NH3+

| |

R1 O R2 O R3 O R4

| || | || | || |H -- N --- C --- C --- N --- C --- C --- N --- C --- C --- N --- C -- COOH | | | | | | | H H H H H H H

b2-H2O

y3 -H2O

b3- NH3

y2 - NH3


De novo Peptide Sequencing


200 400 600 800 1000 1200 1400 1600 1800 2000

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

850.3

687.3

588.1

851.4425.0

949.4

326.0524.9

589.2

1048.6397.1226.9

1049.6489.1

629.0

SequenceSequence


Theoretical Spectrum


Theoretical Spectrum (cont’d)


Building Spectrum Graph• How to create vertices (from peaks)

• How to create edges (from mass differences)

• How to score paths

• How to find best path


S E Q U E N C E

b

Mass/Charge (M/Z)Mass/Charge (M/Z)


a


S E Q U E N C E


S E Q U E N C E


a is an ion type shift in b


y


E C N E U Q E S


y with corresponding intensities


E C N E U Q E S

Inte

nsity

Inte

nsity



Inte

nsity

Inte

nsity


noise



MS/MS Spectrum

Mass/Charge (M/z)Mass/Charge (M/z)

Inte

nsity

Inte

nsity


Mass Differences Correspond to Amino Acids

ss

ssss

ee

eeee

ee

ee

ee

ee

ee

qq

qq

qquu

uu

uu

nn

nn

nn

ee

cc

cc

cc


Ion Types• Some peaks correspond to fragment ions,

others are just random noise

• Knowing ion types _={_1, _2,…, _k} lets us

distinguish fragment ions from noise

• We can learn ion types _i and their

probabilities qi by analyzing a large test

sample of annotated spectra.


Example of Ion Type

• _={_1, _2,…, _k}

• _={b, b-NH3, b-H2O}

• Corresponding values of _={0, 17, 18}

• *Note: In reality the _ value of ion type b is -1


Peptide Sequencing ProblemGoal: Find a peptide with maximal match between

an experimental and theoretical spectrum.Input:

• S: experimental spectrum• _: set of possible ion types• m: parent mass

Output:• P: peptide with mass m, whose theoretical

spectrum matches the experimental Sspectrum the best


Vertices• Masses of potential N-terminal peptides

• Vertices are generated by reverse shift

• Every peak s in a spectrum generates

vertices

• V(s) = {s+_1, s+ _2, …, s+ _k}


Vertices (cont’d)

• Vertices of the spectrum graph:

• {vinit}∪V(s1) ∪V(s2) ∪... ∪V(sm) ∪{vfin}

• Where _={_1, _2,…, _k} are ion types.


Reverse Shifts

• Two peaks b-H2O and b are given by the MassSpectrum

• With a +H2O shift, if two peaks coincide that is apossible vertex.


Inte

nsity

Inte

nsity Red: Mass Spectrum

Blue: shift (+H2O)

b/b-H2O+H2O

b-H2O b+H2O


Example of Reverse Shift

Shift in H2O and NH3

Shift in H2O


Edges

• Two vertices with mass difference

corresponding to an amino acid A:

• Connect with an edge labeled by A

• Gap edges for di- and tri-peptides


Paths

• Path in the graph corresponds to an aminoacid sequence

• There are many paths, how to find the correctone?

• We need scoring to evaluate paths


Path Score• p(P,S) = probability that peptide P produces

spectrum S = {s1,s2,…sq}

• p(P, s) = the probability that peptide Sgenerates a peak s

• Scoring = computing probabilities

• p(P,S) = !s_S p(P, s)


• For a position t that represents ion type dj :

qj, if peak is generated at t

p(P,st) =

1-qj , otherwise

Peak Score


Peak Score (cont’d)

• For a position t that is not associated with anion type:

qR , if peak is generated at t

pR(P,st) =

1-qR , otherwise

• qR = the probability of a noisy peak that doesnot correspond to any ion type


Finding Optimal Paths in the Spectrum Graph

• For a given MS/MS spectrum S, find apeptide P’ maximizing p(P,S) over allpossible peptides P:

• Peptides = paths in the spectrum graph

• P’ = the optimal path in the spectrum graph

p(P,S)p(P',S) Pmax=


Ions and Probabilities• Tandem mass spectrometry is characterized

by a set of ion types {•‰1,•‰2,..,•‰k} and theirprobabilities {q1,...,qk}

¶U•‰i-ions of a partial peptide are producedindependently with probabilities qi


Ions and Probabilities

• A peptide has all k peaks with probability

• and no peaks with probability

• A peptide also produces a ``random noise''with uniform probability qR in any position.

∏=

k

iiq

1

∏=

−k

iiq

1

)1(


Ratio Test Scoring for Partial Peptides

• Incorporates premiums for observed ionsand penalties for missing ions.

• Example: for k=4, assume that for a partialpeptide P’ we only see ions •‰1,•‰2,•‰4.

The score is calculated as:RRRR q

q

q

q

q

q

q

q 4321

)1(

)1(⋅

−

−⋅⋅


Scoring Peptides

• T- set of all positions.

• Ti={t _1,, t _2,..., ,t _k,}- set of positions thatrepresent ions of partial peptides Pi.

• A peak at position t_j is generated withprobability qj.

• R=T- U Ti - set of positions that are notassociated with any partial peptides (noise).


Probabilistic Model

• For a position t _j ∈ Ti the probability p(t, P,S) thatpeptide P produces a peak at position t.

• Similarly, for t∈R, the probability that P produces arandom noise peak at t is:

−=

otherwise1

position tat generated ispeak a if),,( j

j

j

q

qSPtP δ

−=

otherwise1

position tat generated ispeak a if)(

R

RR q

qtP


Probabilistic Score

• For a peptide P with n amino acids, the scorefor the whole peptides is expressed by thefollowing ratio test:

∏∏= =

=n

i

k

j iR

i

R j

j

tp

SPtp

Sp

SPp

1 1 )(

),,(

)(

),(

δ

δ


Role of de novo Interpretation

• Interpreting MS/MS of novel peptides

• Automatic validation of MS/MS database

matches.

• Leveraging homology matching across

species


Post-Translational ModificationsProteins are involved in cellular signaling and

metabolic regulation.

They are subject to a large number of biologicalmodifications.

Almost all protein sequences are post-translationally modified and 200 types ofmodifications of amino acid residues areknown.


Examples of Post-TranslationalModification


Difficulties in Finding Post-Translational ModificationsCurrently post-translational modifications

cannot be inferred from DNA sequences.

Finding post-translational modificationsremains an open problem even after thehuman genome is completed.

Post-translational modifications increase thenumber of “letters” in amino acid alphabetand lead to a combinatorial explosion.


Sequencing of Modified Peptides

De novo peptide sequencing is invaluable foridentification of unknown proteins:

However, de novo algorithms are designed forworking with high quality spectra with goodfragmentation and without modifications.

Another approach is to compare a spectrumagainst a set of known spectra in a database.


Functional Proteomics

• Problem: Given a large collection ofuninterpreted spectra, find out which spectracorrespond to similar peptides.

• A method that cross-correlates relatedspectra (e.g., from normal and diseasedindividuals) would be valuable in functionalproteomics.


Protein identification Problem

• Input: A database of proteins, anexperimental spectrum S, a set of ion types_, and a parent mass m.

• Output: A peptide of mass m from thedatabase with the best match to spectrumS.


MS/MS Database Search

Database search in mass-spectrometry has been verysuccessful in identification of already known proteins.

Experimental spectrum can be compared with theoreticalspectra database peptides to find the best fit.

SEQUEST (Yates et al., 1995)

But reliable algorithms for identification of modifiedpeptides are not yet known.


Search for Modified Peptides:Virtual Database ApproachYates et al.,1995: an exhaustive search in a

virtual database of all modified peptides.

Exhaustive search leads to a large combinatorialproblem, even for a small set of modificationstypes.

Problem (Yates et al.,1995). Extend the virtualdatabase approach to a large set ofmodifications.


Modified Peptide Identification Problem

Input: Experimental spectrum S Database of peptides Parameter k (# of mutations/modifications) A set of ion types _

Parent mass m

Output: a peptide with the best match to thespectrum S that is at most kmutations/modifications apart from a databasepeptide.


Peptide Identification Problem: Challenge

Very similar peptides may have very differentspectra!

Goal: Define a notion of spectral similarity thatcorrelates well with the sequence similarity.

If peptides are a few mutations/modificationsapart, the spectral similarity between theirspectra should be high.


Deficiency of the Shared Peaks Count

Shared peaks count (SPC): intuitive measureof spectral similarity.

Problem: SPC diminishes very quickly as thenumber of mutations increases.

Only a small portion of correlations betweenthe spectra of mutated peptides is capturedby SPC.


SPC Diminishes Quickly

S(PRTEIN) = {98, 133, 246, 254, 355, 375, 476, 484, 597, 632}

S(PRTEYN) = {98, 133, 254, 296, 355, 425, 484, 526, 647, 682}

S(PGTEYN) = {98, 133, 155, 256, 296, 385, 425, 526, 548, 583}

no mutations

SPC=10

1 mutation

SPC=5

2 mutations

SPC=2


Spectral Convolution

)0)((

))((,

12

12

122211

22111212

:

SS

xSSssSsSs

}S,sS:ss{sSSx

−

−−∈∈

∈∈−=−=

:peak) (SPC count peaks shared The

with pairs of Number


Elements of S2 S1 represented as elements of a difference matrix. Theelements with multiplicity >2 are colored; the elements with multiplicity =2are circled. The SPC takes into account only the red entries


Spe

ctra

lC

onvo

lutio

n

1

2

3

4

5

0 -150 -100 -50 0 50 100150

x

Spectral Convolution: An Example


Spectral Comparison: Difficult Case

S = {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}

Which of the spectra

S’ = {10, 20, 30, 40, 50, 55, 65, 75,85, 95}

or

S” = {10, 15, 30, 35, 50, 55, 70, 75, 90, 95}

fits the spectrum S the best?

SPC: both S’ and S” have 5 peaks in common with S.

Spectral Convolution: reveals the peaks at 0 and 5.


Spectral Comparison: Difficult Case

S S’

S S’’


Limitations of the Spectrum Convolutions

Spectral convolution does not reveal thatspectra S and S’ are similar, while spectra Sand S” are not.

Clumps of shared peaks: the matchingpositions in S’ come in clumps while thematching positions in S” don't.

This important property was not captured byspectral convolution and was overlooked inthe previous MS/MS algorithms.


Edit Distance Between SpectraA = {a1 < … < an} : an ordered set of natural

numbers.A shift Δi transforms {a1, …., an}

Into {a1, ….,ai-1,ai+Δi,…,an+ Δi }

e.g.

20 30 40 50 60 70 80 90

10 20 30 35 45 55 65 75 85

10 20 30 35 45 55 62 72 82


Spectral Alignment Problem

• Find a series of k shifts that make the sets

A={a1, …., an} and B={b1,….,bn}

as similar as possible.

• k-similarity between sets

• D(k) - the maximum number of elements incommon between sets after k shifts.


Spectral Alignment vs. Sequence Alignment

• Manhattan-like graph with different alphabetand scoring.

• Axes in the graph correspond to peaks in thetwo spectra.

• In this case, score is 1 if the diagonal linegoes through a peak on both axes, 0otherwise.

• Movement can be diagonal or perpendicular(but only k times total).


Spectral Alignment =Sequence Alignment in 0-1 Alphabet

• Convert spectrum to a string with eachindex being 1 if it corresponds to a peakand 0 otherwise.


Spectral ProductA={a1, …., an} and B={b1,…., bn}

Spectral product A⊗B: two-dimensional matrix withnm 1s corresponding to all pairs of

indices (ai,bj) and remaining

elements being 0s.

10 20 30 40 50 55 65 75 85 95

δ

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

SPC: the number of 1s atthe main diagonal.

δ-shifted SPC: the numberof 1s on the diagonal (i,i+ δ)


Spectral Alignment: k-similarityk-similarity between spectra: the maximum number

of 1s on a path through this graph that uses at mostk+1 diagonals.

k-optimal spectral

alignment = a path.

The spectral alignmentallows one to detectmore and more subtlesimilarities betweenspectra by increasing k.


SPC reveals onlyD(0)=3 matchingpeaks.

Spectral Alignmentreveals morehidden similaritiesbetween spectra:D(1)=5 and D(2)=8and detectscorrespondingmutations.

Use of k-Similarity


Black lines represent the paths for k=0

Red lines represent the paths for k=1

blue line in Fig.(b) represents the path for k=2


Spectral Convolution’ Limitation The spectral convolution considers diagonals

separately without combining them into feasiblemutation scenarios.

D(1) =10 shift function score = 10 D(1) =6

10 20 30 40 50 55 65 75 85 95

10

20

30

40

50

60

70

80

90

100

10 15 30 35 50 55 70 75 90 95

10

20

30

40

50

60

70

80

90

100

δ δ


Dynamic Programming forSpectral AlignmentDij(k): the maximum number of 1s on a path to

(ai,bj) that uses at most k+1 diagonals.

Running time: O(n4 k)

otherwisekD

jijiifkDkD

ji

ji

jijiij ,1)1(

),(~)','(,1)(max)(

''

''

),()','({

+−

+=

<

)(max)( kDkD ijij

=


Edit Graph for Fast Spectral Alignment


Fast Spectral Alignment Algorithm

+−

+=

−− 1)1(

1)(max)(

1,1

),(

kM

kDkD

ji

jidiagij

)(max)( ''),()','(

kDkM jijiji

ij<

=

=

−

−

)(

)(

)(

max)(

1,

,1

kM

kM

kD

kM

ji

ji

ij

ij

Running time: O(n2 k)


Spectral Alignment: Complications

• Simultaneous analysis of N- and C-terminalions

• Taking into account the intensities andcharges

• Analysis of minor ions

• Much more complicated!


Spectral Alignment: Complications

Spectra are combinations of an increasing (N-terminal ions) and a decreasing (C-terminalions) number series.

These series form two diagonals in thespectral product, the main diagonal and theperpendicular diagonal.

The described algorithm deals with the maindiagonal only.

protein sequencing and identification with mass spectrometry · pdf filean introduction to...

Documents