bioinformatics proteomics lecture 4bioinformatics −proteomics lecture 4 prof. lászl ópoppe bme...

27
2019.10.01. Bioinformatics - Proteomics Bioinformati cs Proteomics Lecture 4 Prof. László Poppe BME Department of Organic Chemistry and Technology Bioinformatics Proteomics Lecture and practice

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

2019.10.01. Bioinformatics - Proteomics

Bioinformatics − Proteomics Lecture 4

Prof. László Poppe

BME Department of Organic Chemistryand Technology

Bioinformatics – Proteomics

Lecture and practice

Page 2: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

2 2019.10.01. Bioinformatics - Proteomics

Biological databases

Felhasználó

KeresõprogramBLAST

Biológiai adatbázisok

Szerkezeti adatbázisok Szekvencia adatbázisok

Primeradatbázisok

Szekunderadatbázisok

Protein Protein NukleinsavPDB SwissProt

TrEMBLPIR

GenBankDDBJEMBL

SCOPCATH

PFAMBLOCKSPROSITE

Integrált adatbázisok INTERPRO

user

Searching

BLAST

Biological databases

Structural databases Sequence databases

Primary

databases

Secondary

databases

Nucleotide

Integrated databases

Page 3: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

3 2019.10.01. Bioinformatics - Proteomics

Secondary databases

The secondary seqence databases – containing sequence pattern data - are derived from

primary (ie containing seqences) databases

From multiple alignments of primary seqence data, motifs can be determined.

A fingerprint is a group of conserved motifs used to characterise a protein family. On the basis

of motifs regular expressions, or frqency matrices can be derived.

fingerprint

motif

Seqences in

multiple aligment

insertions

frequency

matrix

balanced

frequency

matrix

(block)

regular expression

Page 4: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

4 2019.10.01. Bioinformatics - Proteomics

Secondary databases

Secondary

database

Primary or

secondary source

Content

PROSITE SwissProt Regular expressions

(motifs)

Profiles (part of

PROSITE)

SwissProt Balanced matrices (profiles)

PRINTS SwissProt + TrEMBL Aligned motifs

(fingerprints)

Pfam SwissProt Hidden Markov models (HMMs)

BLOCKS* PROSITE / PRINTS Aligned motifs

(blocks)

eMOTIF* BLOCKS / PRINTS "Fuzzy" regular expressions

(patterns)

* Derived from secondary databese

Page 5: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

5 2019.10.01. Bioinformatics - Proteomics

SINGLE MOTIF

METODS

METODS BASED ON FULL

ALIGNMENT

MULI-MOTIF

METHODS

fuzzy regular expression

accurate regular expression

profiles

Secondary databases

Page 6: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

6 2019.10.01. Bioinformatics - Proteomics

Application paradigm

Similar seqence

Paralogy

Orthology

Similar seqence - Similar structure - Similar function

Homology

Similar structure Similar function

+++/- +/-

{ +?

Bioinformatics basic question: new sequence -> protein function, structure family, etc.

Search tools (FASTA, BLAST, PSI−BLAST, etc.) -> good to find homology, but sometimes

identification of orthology is more important (homolog may be a paralog of ortholog, less

useful)

Secondary databases (deriving mostly seqences of proteins of similar function) can help to

find ortholgy.

Page 7: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

7 2019.10.01. Bioinformatics - Proteomics

PROSITE – Regular expressions

ADLGAVFALCDRYFQ

SDVGPRSCFCERFYQ

ADLGRTQNRCDRYYQ

ADIGQPHSLCERYFQ

Alignment of four proteins

[AS]−D−[IVL]−G−x4−{PG}−C−[DE]−R−[FY]2−Q

· Standard IUPAC single letter amino acid (AA) codes

· Positions are separeted by −· An AA letter: fully conserved position (eg. −G−)

· Squered bracket: one of the listed AAs (pl. [AS])

· {AA}: Any AA but the given ones (eg. {PG})

· x: any AA

· Number: repeating (eg. [FY]2, x4)

· x(2,4): x 2−, 3− or 4−times.

Page 8: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

8 2019.10.01. Bioinformatics - Proteomics

H-x-[LIVM]-{P}-x(0,2)-G-x(4)-W

Example:

H-C-I-N--G-YFRA-W

Sequence mathing

PROSITE – Regular expressions

Page 9: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

9 2019.10.01. Bioinformatics - Proteomics

PROSITE - Patterns

Homologous regions from multiple alignments which have relevant function within a

protein family, eg.:

Catalytic sites of enzymes

Binding sites for prosthetic groups (e.g. hem, biotin, etc.)

Metal binding sites

Disulfide-brige forming cysteines

Binding sites for ligands (ADP/ATP, GDP/GTP, calcium, DNA, etc.)

Single motiv database, based on SwissProt manual alignments, annotated by experts using

experimental and literature data

Goodness of expressions are regularly checked/enhanced.

Reliable and exhaustive documentation.

Page 10: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

10 2019.10.01. Bioinformatics - Proteomics

PROSITE – Pattern records

Page 11: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

11 2019.10.01. Bioinformatics - Proteomics

PROSITE – Pattern description

Page 12: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

12 2019.10.01. Bioinformatics - Proteomics

PROSITE – Documentation section

Page 13: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

13 2019.10.01. Bioinformatics - Proteomics

PROSITE - Search

Page 14: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

14 2019.10.01. Bioinformatics - Proteomics

PRINTS – „Fingerprints”

PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved

motifs used to characterise a protein family; its diagnostic power is refined by iterative

scanning of a SWISS-PROT/TrEMBL composite. Usually the motifs do not overlap, but are

separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can

encode protein folds and functionalities more flexibly and powerfully than can single

motifs, full diagnostic potency deriving from the mutual context provided by motif

neighbours.

Building PRINTS

Primary database: SWISSPROT+TrEMBL

Starts with several manually aligned seqences of a protin family

Finding conserved regions (mostly visually) -> starting motifs

Freqency matrices are derived for each motif

Search (SwissProt+TrEMBL) with these freqency matrices

The best scoring seqences are added to the starting motifs

Repeating the process iteratively until no more enhancement found

Page 15: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

15 2019.10.01. Bioinformatics - Proteomics

PRINTS – Database

Page 16: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

16 2019.10.01. Bioinformatics - Proteomics

BLOCKS fields

Older, matrix-based approach, derived from SwissProt database

BLOCKS – „Blokkok”

BLOCKS search (not maintained, nowdays more sophisticated methods are available)

Keyword, description, etc.

Comparison of a sequence with BLOCKS (with balanced frequency matrix): -> matching

blocks, with E value.

The sc. logo of the block (the AA frequencies are converted to letter height) can be

visualized e.g..:

Page 17: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

17 2019.10.01. Bioinformatics - Proteomics

Profiles – Prosite, Pfam

Sequence profiles are mathematical objects describing full seqences which are derived from

aligned seqences. Profiles are essentially patterns where each position in the sequence of the

segment (or motif) has been assigned a probability value for each possible amino-acid

residue type. There are two main types of profiles:

Weight matrices: balanced freqency matrices (similar like at BLOCKS), extended with

position-dependent gap opening and extension penalties (22 number in a ro of matrix: 20 AAs +

2 gap penalties). The PROSITE desribes in this way the protein families for which there is no

good regular expression is found.

Hidden Markov Models (HMM): HMM is a probability model which “generates” seqences.

Ie. a linear chain with Match (M), Insertion (I) and Deletion (D) states, with values for their

transitions. HMM is a theoretically sound modeling paradigm for collections of motifs for

which efficient algorithms exist.

Page 18: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

18 2019.10.01. Bioinformatics - Proteomics

The Hidden Markov model (HMM) is a virtual machine generating sequences. The machine

has a finite states. The machine steps between these states. At each state or state transition, a

sequence unit (i.e. one AA or nucleotide) may be generated. The generated sequences are

assembled from this units.

Hidden Markov−models (HMM)

Page 19: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

19 2019.10.01. Bioinformatics - Proteomics

Hidden Markov−models (HMM)

On the basis of related seqences a HMM may be defined. If the model efficiently represents the

related family, this HMM can generate novel seqences similar to the ones which were present in

the starting set of sequences.

In the case of seqence analysis, it is possible to calculate for a seqence the probability that the

HMM could generate the given sequence. If this probability is high than the seqence blongs to

the family from which th HMM was constructed.

Page 20: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

20 2019.10.01. Bioinformatics - Proteomics

Profilok – Prosite, Pfam

PROSITE profile records

Basic parameters: the scores for MI (eg. Match−Insertion)

M: Match states, with parameters (elements of the weight matrix)

I: Insertion states, with parameters

Pfam records

Desription resords: Descriptions of families (seqence lists)

HMM record: it gives the HMM

Pfam−A: Well documented families

Pfam−B: Badly documented, automatically generated families

Search in profile databases

Seqence comparisons with the profiles (various programs / servers)

Page 21: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

21 2019.10.01. Bioinformatics - Proteomics

Integrated secondary database - INTERPRO

Integration of best-documented

secondary databases

(PROSITE, PRINTS)

with other secondary databases

(Pfam, PRODOM, etc.).

Thousands of protein families

Page 22: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

22 2019.10.01. Bioinformatics - Proteomics

Integated secondary database - INTERPRO

Page 23: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

23 2019.10.01. Bioinformatics - Proteomics

Integated secondary database - INTERPRO

Page 24: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

24 2019.10.01. Bioinformatics - Proteomics

Integrated biological database -

NCBI

Page 25: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

25 2019.10.01. Bioinformatics - Proteomics

Integrated biological database - NCBI

Page 26: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

26 2019.10.01. Bioinformatics - Proteomics

Integrated biological database – NCBI Structure

Page 27: Bioinformatics Proteomics Lecture 4Bioinformatics −Proteomics Lecture 4 Prof. Lászl óPoppe BME Department of Organic Chemistry and Technology Bioinformatics –Proteomics Lecture

27 2019.10.01. Bioinformatics - Proteomics

Integrated biological database – NCBI PubMed