retos de la bioinformatica

164
Bioinformática: la biología por otros medios Alberto Labarga UGR, Noviembre 2008

Upload: alberto-labarga

Post on 05-Dec-2014

2.776 views

Category:

Technology


4 download

DESCRIPTION

Charla impartida en la Universidad de Granada

TRANSCRIPT

Page 1: Retos de la Bioinformatica

Bioinformática: la biología por otros medios

Alberto Labarga

UGR, Noviembre 2008

Page 2: Retos de la Bioinformatica

Computational Biology

Bioinformatics[Biological Information]

Page 3: Retos de la Bioinformatica

1859 1866 1870 1900 1902

Hacia una teoría científica de la herencia

Page 4: Retos de la Bioinformatica

1859 1866 1870 1900 1902

Charles Darwin publica en 1859 'The Origin of Species‘donde se propone que los seres vivos son el resultado de la selección natural y que todas las criaturas han evolucionado a lo largo de las generaciones a través de pequeños cambios.

Page 5: Retos de la Bioinformatica

1859 1866 1870 1900 1902

Leyes de Mendel,

publicadas en 1866,

redescubiertas en 1900

Page 6: Retos de la Bioinformatica

1859 1866 1870 1900 1902

En 1870, un científico alemán llamado Friedrich Miescher aísla los componentes almacenados en el núcleo, compuesto principalmente por proteinas y ácidos nucleicos. En aquel momento se creía que el elemento que almacenaba la información hereditaria tenía que ser la proteína, compuesta por 20 aminoacidos, mientras que los ácidos nucleicos tenían sólo 4 componentes.

Page 7: Retos de la Bioinformatica

1859 1866 1870 1900 1902

A comienzo de siglo, Phoebus Levene, descubrió que el ADN es una cadena de nucleótidos, en la que cada nucleótido está compuesto de un azucar (desoxirribosa), un grupo fosfato y una base nitrogenada, que podía ser de cuatro tipos, Adenin, Timina, guanina y Citosina

Page 8: Retos de la Bioinformatica

1859 1866 1870 1900 1902

Walter Sutton, a graduate student in E. B. Wilson’s

lab at Columbia University, observed that in the

process of cell division, called meiosis, that produces

sperm and egg cells, each sperm or egg receives only

one chromosome of each type. (In other parts of the

body, cells have two chromosomes of each type, one

inherited from each parent.) The segregation pattern

of chromosomes during meiosis matched the

segregation patterns of Mendel’s genes.

Page 9: Retos de la Bioinformatica

1928 1944 1949 1952 1953

El descubrimiento del ADN

Page 10: Retos de la Bioinformatica

1928 1944 1949 1952 1953

1928 Frederick Griffith: principio de transformación

si mezclaba a los neumococos R

con neumococos S previamente

muertos por calor, entonces los

ratones se morían. Aún más, en la

sangre de estos ratones muertos

Griffith encontró neumococos

con cápsula (S).

Page 11: Retos de la Bioinformatica

1928 1944 1949 1952 1953

En 1944 Oswald Avery y sus colaboradores, que estaban estudiando la bacateria que causa la neumonía, Pneumococcus, descubrieron que las bacterias tienen ácidos nucleicos y que es la molécula de ADN la encargada de almacenar los genes. Otros estudios con virus se encargaronde confirmar esta teoría a pesar de que se seguía creyendo que el ADN era demasiado simple.

Page 12: Retos de la Bioinformatica

1928 1944 1949 1952 1953

La vida puede verse como un proceso de almacenamiento y transmisión de información biológica.

Los cromosomas son los portadores de esta información.

La información está almacenada en la forma de un código molecular

Para entender la vida debemos identificar estas moléculas y descifrar el código

Page 13: Retos de la Bioinformatica

1928 1944 1949 1952 1953

1949DNA se duplica durante la división celularChargaff: A = T and G = C

Page 14: Retos de la Bioinformatica

1928 1944 1949 1952 1953

1952 - Hershey-Chase Experiment

Page 15: Retos de la Bioinformatica

1928 1944 1949 1952 1953

M.H.F. Wilkins, A.R. Stokes, H.R. Wilson:

Molecular Structure of Deoxypentose Nucleic

Acids. Nature 171, 738 (1953)

R.E. Franklin and R.G. Gosling

Molecular Configuration in Sodium

Thymonucleate, Nature 171, 740

(1953)

Page 16: Retos de la Bioinformatica

1928 1944 1949 1952 1953

MOLECULAR STRUCTURE

OF NUCLEIC ACIDS

“We wish to propose a

structure for the salt of

desoxyribose nucleic acid

(DNA). This structure has

novel features which are of

considerable biological

interest”

Nature. 25 de abril de 1953

Page 17: Retos de la Bioinformatica

1928 1944 1949 1952 1953

“It has not escaped our

attention that the specific

pairing we have

postulated immediately

suggests a possible

copying mechanism for

the genetic material.”

Page 18: Retos de la Bioinformatica

The base pairs

Page 19: Retos de la Bioinformatica
Page 20: Retos de la Bioinformatica

1955 1959 1962 1966

En 1955 Ochoa publicó en Journal of the American

Chemical Society con la bioquímica francorrusa

Marianne Grunberg-Manago, el aislamiento de una

enzima del colibacilo que cataliza la síntesis de ARN, el

intermediario entre el ADN y las proteínas. Los

descubridores llamaron «polinucleótido-fosforilasa» a

la enzima, conocida luego como ARN-polimerasa. El

descubrimiento de la polinucleótido fosforilasa dio

lugar a la preparación de polinucleótidos sintéticos de

distinta composición de bases con los que el grupo de

Severo Ochoa, en paralelo con el grupo de Marshall

Nirenberg, llegaron al desciframiento de la clave

genética.

Page 21: Retos de la Bioinformatica
Page 22: Retos de la Bioinformatica

1955 1959 1962 1966

Page 23: Retos de la Bioinformatica

1955 1959 1962 1966

Cuando Perutz llegó a Cambridge la

estructura molecular más grande que se

había resuelto era la del pigmento natural

ficocianina, de 58 átomos. Una proteína

tiene miles de átomos. Bernal, su director,

había realizado algunas imágenes de

difracción de rayos X de cristales de una

proteína, la pepsina, pero sin llegar a

interpretarlas. El tema escogido por Perutz

para su tesis fue otra proteína, la

hemoglobina, el transportador de oxígeno

que da color rojo a nuestra sangre. La

hemoglobina tiene nada menos que 11.000

átomos. Tardo 23 años.

Page 24: Retos de la Bioinformatica

1955 1959 1963 1966

Page 25: Retos de la Bioinformatica

1955 1959 1962 1966

Over the course of several years,

Marshall Nirenberg, Har Khorana and

Severo Ochoa and their colleagues

elucidated the genetic code – showing

how nucleic acids with their 4-letter

alphabet determine the order of the 20

kinds of amino acids in proteins.

Messenger RNA is interpreted three

letters at a time; a set of three

nucleotides forms a "codon" that

encodes an amino acid. A three-letter

word made of four possible letters can

have 64 (4 x 4 x 4) permutations, which

is more than enough to encode the 20

amino acids in living beings.

Page 26: Retos de la Bioinformatica
Page 27: Retos de la Bioinformatica

From DNA to protein

Page 28: Retos de la Bioinformatica

1970 1975 1977 19801971

Entendiendo los mecanismos, creando las herramientas

Page 29: Retos de la Bioinformatica

1970 1975 1977 19801971

El Central Dogma

Page 30: Retos de la Bioinformatica

1970 1975 1977 19801971

Created in 1971

with seven

structures

Page 31: Retos de la Bioinformatica

1970 1975 1977 19801971

El ADN recombinante, o ADN recombinado, es

una molécula de ADN formada por la unión de

dos moléculas heterólogas, es decir, de diferente

origen.

Se realiza a través de las enzimas de restricción

que son capaces de "cortar" el ADN en puntos

concretos.

De una manera muy simple podemos decir que

"cortamos" un gen humano y se lo "pegamos" al

ADN de una bacteria; si por ejemplo es el gen

que regula la fabricación de insulina, lo que

haríamos al ponérselo a una bacteria es

"obligar" a ésta a que fabrique la insulina.

Page 32: Retos de la Bioinformatica

1970 1975 1977 19801971

Page 33: Retos de la Bioinformatica

1970 1975 1977 19801971

A precursor-RNA may often be matured to

mRNAs with alternative structures. An example

where alternative splicing has a dramatic

consequence is somatic sex determination in the

fruit fly Drosophila melanogaster.

In this system, the female-specific sxl-protein

is a key regulator. It controls a cascade of

alternative RNA splicing decisions that finally

result in female flies.

Page 34: Retos de la Bioinformatica

1981 1985 1987 199019831982

Entendiendo los mecanismos, creando las herramientas

Page 35: Retos de la Bioinformatica

1981 1985 1987 199019831982

Read out the letters from a DNA sequence

GTGAGGCGCTGC

Page 36: Retos de la Bioinformatica

1981 1985 1987 199019831982

1983 La reacción en cadena de la polimerasa,

conocida como PCR por sus siglas en inglés

(Polymerase Chain Reaction), es una técnica

de biología molecular descrita en 1986 por

Kary Mullis,[1] cuyo objetivo es obtener un

gran número de copias de un fragmento de

ADN particular, partiendo de un mínimo; en

teoría basta partir de una única copia de ese

fragmento original, o molde.

Page 37: Retos de la Bioinformatica

1981 1985 1987 199019831982

Total nucleotides

(Nov 07: 188,490,792,445)

Number of entries

(Nov 07: 106,144,026)

Page 38: Retos de la Bioinformatica

1981 1985 1987 199019831982

Page 39: Retos de la Bioinformatica

1981 1985 1987 199019831982

El Proyecto Genoma Humano (PGH) (Human

Genome Project en inglés) consiste en

determinar las posiciones relativas de todos los

nucleótidos (o pares de bases) e identificar

100.000 genes presentes en él.

El proyecto, dotado con 3.000 millones de

dólares, fue fundado en 1990 por el

Departamento de Energía y los Institutos de la

Salud de los Estados Unidos, con un plazo de

realización de 15 años.

Page 40: Retos de la Bioinformatica

”Imagine varias copias de un libro, cortadas en

10 millones de trocitos cada una, de manera

que los trocitos se solapan. Supongamos que 1

millón de trocitos se han perdido, y que los

otros 9 millones están manchados de tinta.

Recupere el texto original.”

Page 41: Retos de la Bioinformatica
Page 42: Retos de la Bioinformatica

HUGO: Idealized representation of the hierarchical shotgun sequencing strategy. A library is constructed by

fragmenting the target genome and cloning it into a large-fragment cloning vector; here, BAC vectors are shown. The

genomic DNA fragments represented in the library are then organized into a physical map and individual BAC clones

are selected and sequenced by the random shotgun strategy. Finally, the clone sequences are assembled to reconstruct

the sequence of the genome.

Page 43: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

Descifrando el libro de la vida

Page 44: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

S.F. Altschul, et al. (1990), "Basic Local

Alignment Search Tool," J. Molec.

Biol., 215(3): 403-10, 1990. 15,306

citations

Altschul, S.F. et al (1997), “Gapped

BLAST and PSI-BLAST: a new

generation of protein database search

programs”, Nucleic Acids Res., vol. 25,

no. 17, pp. 3389-402.

Page 45: Retos de la Bioinformatica
Page 46: Retos de la Bioinformatica
Page 47: Retos de la Bioinformatica

• SSAHA (Ning et al., 2001)• http://www.sanger.ac.uk/Software/analysis/SSAHA/

• SSAHA is an algorithm for very fast matching and alignment of DNA

sequences. It stands for Sequence Search and Alignment by Hashing

Algorithm. It achieves its fast search speed by converting sequence

information into a `hash table' data structure, which can then be

searched very rapidly for matches.

• BLAT (J. Kent, 2002)• http://genome.ucsc.edu/cgi-bin/hgBlat

• BLAT on DNA is designed to quickly find sequences of 95% and greater

similarity of length 40 bases or more. It may miss more divergent or

shorter sequence alignments. It will find perfect sequence matches of 33

bases, and sometimes find them down to 20 bases. BLAT on proteins

finds sequences of 80% and greater similarity of length 20 amino acids

or more.

Page 48: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

J. Thompson, T. Gibson, D.

Higgins (1994), CLUSTAL W:

improving the sensitivity of

progressive multiple sequence

alignment … Nuc. Acids. Res. 22,

4673 - 4680

Page 49: Retos de la Bioinformatica

Flowchart of computation steps in

Clustal W (Thompson et al., 1994)

Pairwise alignment: calculation of distance matrix

Creation of unrooted neighbor-joining tree

Rooted nJ tree (guide tree) and calculation of sequence weights

Progressive alignment following the guide tree

Page 50: Retos de la Bioinformatica

Otros métodos

Notredame, C., Higgins, D.G., Heringa, J. (2000) T-Coffee: a novel method for

fast and accurate multiple sequence alignment. J. Mol. Biol, 302, 205–217.

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high

accuracy and high throughput. Nucleic Acids Res., 32, 1792–1797.

Katoh, K., Kuma, K., Toh, H., Miyata, T. (2005) MAFFT version 5:

improvement in accuracy of multiple sequence alignment. Nucleic Acids

Res, 33, 511–518.

Lassmann, T., Sonnhammer, E. (2005) Kalign – an accurate and fast multiple

sequence alignment algorithm. BMC Bioinformatics , 6, 298.

Larkin M.A. et al. (2007) ClustalW and ClustalX version 2. Bioinformatics 2007

23(21): 2947-2948.

Page 51: Retos de la Bioinformatica

Tree of Life

http://tolweb.org/tree/phylogeny.html http://itol.embl.de/

Page 52: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

1995• El primer genoma completo de un organismoHemophilus influenzae.

Page 53: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

1996• El genoma de la levadura se completa: aproximadamente, 6,000 genes y 14.000.000 de pares de bases

Page 54: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

Page 55: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

1997

•Ecuenciado el genoma de la bacteria E. Coli: 4,600 genes 4,5 millones de nucleótidos.

Page 56: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

1998

El genoma del gusano Caenorhabditis elegans, tiene 18,000 genes unos 100 millones de nucleotidos

Page 57: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

1999•Se consigue la secuencia completa del cromosoma 22 El HGP va por delante de lo planeado.Sorprende el reducido número de genes encontrado (unos 300)

Page 58: Retos de la Bioinformatica

Fire A, Xu S, Montgomery M, Kostas

S, Driver S, Mello C (1998). "Potent

and specific genetic interference by

double-stranded RNA in

Caenorhabditis elegans". Nature 391

(6669): 806–11. doi:10.1038/35888.

PMID 9486653

Page 59: Retos de la Bioinformatica

Hamilton A, Baulcombe D

(1999). "A species of small

antisense RNA in

posttranscriptional gene

silencing in plants". Science

286 (5441): 950–2.

PMID 10542148

Page 60: Retos de la Bioinformatica

Dr Alan Wolffe (1999)

• Epigenetics is heritable changes in gene expression that occur without a change in DNA sequence

• Such changes cannot be attributed to changes in DNA sequence (mutations)

• They are as Irreversible as mutations (or difficult to reverse)

Page 61: Retos de la Bioinformatica

1990 1995 1997 20011996 1998 1999

Page 62: Retos de la Bioinformatica

Gene prediction

In humans:

~22,000 genes

~1.5% of human DNA

Where are the genes?

Page 63: Retos de la Bioinformatica

the gencode pipeline

1. mapping of known transcripts sequences (ESTs, cDNAs, proteins) into the

human genome

2. manual curation to resolve conflicting evidence

3. additional computational predictions

4. experimental verification

5. FINAL ANNOTATION

Page 64: Retos de la Bioinformatica

August 2008 Bioinformatics tools for Comparative

Genomics of Vectors

64

Genome annotation - building a pipeline

Genome sequence

Map repeats

Genefinding

Protein-coding genes

Map ESTs Map Peptides

nc-RNAs

Functional annotation

Release

Page 65: Retos de la Bioinformatica

August 2008 Bioinformatics tools for Comparative

Genomics of Vectors

65

Genefinding - ab initio predictions

Use compositional features of the DNA sequence to define coding

segments (essentially exons)

ORFs

Coding bias

Splice site consensus sequences

Start and stop codons

Each feature is assigned a log likelihood score

Use dynamic programming to find the highest scoring path

Need to be trained using a known set of coding sequences

Examples: Genefinder, Augustus, Glimmer, SNAP, fgenesh

Page 66: Retos de la Bioinformatica

August 2008 Bioinformatics tools for Comparative

Genomics of Vectors

66

ab initio prediction

Genome

Coding

potential

Coding

potential

ATG & Stop

codons

ATG & Stop

codons

Splice sites

Page 67: Retos de la Bioinformatica

August 2008 Bioinformatics tools for Comparative

Genomics of Vectors

67

ab initio prediction

Genome

Coding

potential

Coding

potential

ATG & Stop

codons

ATG & Stop

codons

Splice sites

Page 68: Retos de la Bioinformatica

August 2008 Bioinformatics tools for Comparative

Genomics of Vectors

68

ab initio prediction

Find best prediction

Genome

Coding

potential

Coding

potential

ATG & Stop

codons

ATG & Stop

codons

Splice sites

Page 69: Retos de la Bioinformatica

August 2008 Bioinformatics tools for Comparative

Genomics of Vectors

69

Genefinding - similarity

Use known coding sequence to define coding regions

EST sequences

Peptide sequences

Needs to handle fuzzy alignment regions around splice sites

Needs to attempt to find start and stop codons

Examples: EST2Genome, exonerate, genewise

Use 2 or more genomic sequences to predict genes based on

conservation of exon sequences

Examples: Twinscan and SLAM

Page 70: Retos de la Bioinformatica

August 2008 Bioinformatics tools for Comparative

Genomics of Vectors

70

Similarity-based prediction

Align

Create prediction

Genome

cDNA/peptide

Page 71: Retos de la Bioinformatica

Example of a simple HMM

EPFL – Bioinformatics I – 05 Dec 2005

Top: model architecture and parameters. Bottom: sequence generation process.

green: state transition probabilities, red: emission probabilities.

Prob(sequence, path|model) = 6.8e-8.

Page 72: Retos de la Bioinformatica

Automatic Annotation vs Manual

Automatic Annotation

• Quick whole genome analysis ~ weeks

• Consistent annotation

• Use unfinished sequence/shotgun assembly

• No polyA sites/signals, pseudogene

• Predicts ~70% loci

Manual Annotation

• Extremely slow~3 months Chr 6

• Need finished seq

• Flexible, can deal with inconsistencies in data

• Most rules have exception

• Consult publications as well as databases

Page 73: Retos de la Bioinformatica

Analysis EGASP predictions vs manual

annotation

0

10

20

30

40

50

60

70

80

90

100

9_101_1 20_79_1 36_46_1 41_77_1

Nuc Sn

Nuc Sp

0

10

20

30

40

50

60

70

80

90

100

9_101_1 20_79_1 36_46_1 41_77_1

Exon Sn

Exon Sp

0

10

20

30

40

50

60

70

80

9_101_1 20_79_1 36_46_1 41_77_1

Trans Sn

Trans Sp

0

10

20

30

40

50

60

70

80

9_101_1 20_79_1 36_46_1 41_77_1

Gene Sn

Gene Sp

Page 74: Retos de la Bioinformatica

2002 2007 201020052004

Y sólo es el principio

Page 75: Retos de la Bioinformatica

2002 2007 201020052004

Page 76: Retos de la Bioinformatica

2002 2007 201020052004

874

2124

1004

10/0810/3/02

104

316

218

8/28/03

156

386

246

5/07

500

1500

700

4000

Published complete genomes:

Ongoing prokaryotic genomes:

Ongoing eukaryotic genomes:

http://www.genomesonline.org

Page 77: Retos de la Bioinformatica

2002 2007 201020052004

Illumina / Solexa

Genetic Analyzer

2000 Mb / run

Applied Biosystems

ABI 3730XL

1 Mb / day

Roche / 454

Genome Sequencer FLX

100 Mb / run

Applied Biosystems

SOLiD

3000 Mb / run

454-GS20

32,000,000

0 .04

0 .54

1 .04

1 .54

2 .04

2 .54

3 .04

3 .54

4 .04

4 .54

199 4 199 6 199 8 200 0 200 2 200 4 200 6

Mill

ions

Date of Introduction

# B

ases

/Run

ABI

3730ABI

370/377

ABI

3700

Page 78: Retos de la Bioinformatica

Aunque los seres humanos compartimos

99.9 por ciento de la información genética,

tenemos pequeñas variaciones, llamadas

poliformismos singulares de nucléotido o

SNP (por su siglas en inglés; se pronuncia

snip). Se estima que existen unos 10

millones de SNP en la especie humana y

supuestamente esas diferencias estarían

relacionadas con la mayor resistencia o

susceptibilidad a enfermedades y

medicamentos.

2002 2007 201020052004

Page 79: Retos de la Bioinformatica

VARIACIÓN EN LA SECUENCIA HUMANA DE

DNA

Tasa de mutación = 10-8 /sitio/generación

Nº generaciones ancestro común-humano actual: 104-105

Page 80: Retos de la Bioinformatica

2002 2007 20102005

ENCyclopedia Of DNA Elements

2004

Page 81: Retos de la Bioinformatica

2002 2007 201020052004

Page 82: Retos de la Bioinformatica

Genómica funcional

Page 83: Retos de la Bioinformatica

Comparative

genomics

Sequence (DNA/RNA)

& phylogeny

Regulation of gene

expression;

transcription factors &

micro RNAs

Protein sequence analysis &

evolution

Protein families,

motifs and domains

Protein structure & function:

computational crystallography

Protein interactions & complexes: modelling and

prediction

Chemical biology

Pathway analysis

Systems

modelling

Image analysis

Data integration & literature

mining

Page 84: Retos de la Bioinformatica

Se preparan copias del ADN

de los genes de interés

Transcripción

inversa

...que se

imprimen

en el chip

Las muestras se hibridan

en el microarray

Laser 1 Laser 2

El chip se excita

con láseres

diferentes: el

control

reacciona a uno

de ellos y la

muestra al otro

La comparación

de ambas

imágenes nos

indica que genes

se expresan de

manera diferente

Añadir

fluorescencia

control muestr

a

Se preparan las

muestras de ARN

de interés

Schena et al. Science 1995

Page 85: Retos de la Bioinformatica

Microarray analysis

Clinical prediction of Leukemia type

• 2 types

– Acute lymphoid (ALL)

– Acute myeloid (AML)

• Different treatment & outcomes

• Predict type before treatment?

Golub et. al. Science 286:531-537. (1999)

Page 86: Retos de la Bioinformatica

Biomarkers discovery

Data

Management

statistical

analysis AnnotationNetwork

análisis Selection

30.000

genes

1500 genes 150 genes 50 elements 10 targets

Page 87: Retos de la Bioinformatica

Step1: Calculate Ct with SDS and export text file

TaqMan Assays

Step 3: Biological Replicates

Step 4: Selection of Optimal Endogenous Controls &

Calculation of ΔCt

Step 5: Differential Expression Analysis ΔΔCt

! Overview Plates & Samples

! Quality Control

Raw Values

! Discard Samples

! Quality Control

ΔCt Overview

RT-PCR Standard Processing Procedure

Step2: Retrieve data and define

experiment design

Page 88: Retos de la Bioinformatica

88

Example of Array CGH Technology*

Chari et al, Cancer Informatics, 2006, 2, 48-58

Page 89: Retos de la Bioinformatica

89

Page 90: Retos de la Bioinformatica

Source: http://www.chiponchip.org/

Chip-on-chip

Page 91: Retos de la Bioinformatica

DNA-binding proteins are crosslinked to DNA with formaldehyde in vivo

Isolate the chromatin. Shear DNA along with bound proteins into small fragments.

Bind antibodies specific to the DNA-binding protein to isolate the complex by precipitation. Reverse the cross-linking to release the DNA and digest the proteins.

Use PCR( Polymerase Chain Reaction )

to amplify specific DNA sequences to see if they were precipitated with the antibody

ChIP (Chromatin ImmunoPrecipitation)

• Chromatin immunoprecipitation, or ChIP, refers to a procedure used to determine whether a given protein binds to a specific DNA sequence in vivo

Page 92: Retos de la Bioinformatica
Page 93: Retos de la Bioinformatica

Protein MicroarrayG. MacBeath and S.L. Schreiber, 2000, Science 289:1760

arrayIT TM

Spotting platform and protein microarray

Page 94: Retos de la Bioinformatica

Different Kinds of Protein Arrays*

Antibody Array Antigen Array Ligand Array

Detection by: SELDI MS, fluorescence, SPR,

electrochemical, radioactivity, microcantelever

Page 95: Retos de la Bioinformatica

The Microarray Study Process

Page 96: Retos de la Bioinformatica

Preprocesado

Page 97: Retos de la Bioinformatica

Some Questions:

• Which genes have expression levels that are correlated

with some external variable?

• For a given pathway, which of the genes in our collection

are most likely to be involved?

• For a diffuse disease, which genes are associated with

different outcomes?

Page 98: Retos de la Bioinformatica

Challenges for Data Analysis

• Normalization (removing systematic measurement effects)

• Variable Selection (Identification of relevant Variables)

• Large sample Effects:

Type I and Type II errors (False positives / False negatives)

• Dimensionality Reduction

• Identification of new disease classes

• Classification of data into known disease classes

Page 99: Retos de la Bioinformatica

Data Analysis Methods

Dimension Reduction

• PCA (Principle Component Analysis)

• ICA (Independent Component Analysis)

• Multidimensional Scaling

Unsupervised Learning

• K-Means / K-Medoid

• Hierarchical Clustering Algorithms

Supervised Learning

• Linear Discriminant Analysis

• Maximum Likelihood Discrimination

• Nearest Neighbor Methods

• Decision Trees

• Random Forests

Page 100: Retos de la Bioinformatica

Matrix factorization

Page 101: Retos de la Bioinformatica
Page 102: Retos de la Bioinformatica

102

Popular Classification Methods

• Decision Trees/Rules– Find smallest gene sets, but not robust – poor performance

• Neural Nets - work well for reduced number of genes

• K-nearest neighbor – good results for small number of genes, but no model

• Naïve Bayes – simple, robust, but ignores gene interactions

• Support Vector Machines (SVM)– Good accuracy, does own gene selection,

but hard to understand

• Specialized methods, D/S/A (Dudoit), …

Page 103: Retos de la Bioinformatica

Support Vector Machine (SVM)

• Main idea: Select hyperplane that is more likely to

generalize on a future datum

Page 104: Retos de la Bioinformatica

104

Best Practices

• Capture the complete process, from raw data to final results

• Gene (feature) selection inside cross-validation

• Randomization testing

• Robust classification algorithms– Simple methods give good results

– Advanced methods can be better

• Wrapper approach for best gene subset selection

• Use bagging to improve accuracy

• Remove/relabel mislabeled or poorly differentiated samples

Page 105: Retos de la Bioinformatica

Alistair Chalk, 2008

Enrichment Analysis

• What are major enriched GO terms?

• What are the highly active pathways?

• What are the frequently interacting proteins?

• What are the known disease associations?

Page 106: Retos de la Bioinformatica

Meta-analysis example: “Creation and

implications of a phenome-genome network”

Butte and Kohane. Nat Biotech. 2006

Page 107: Retos de la Bioinformatica

Meta-analysis example: “Creation and

implications of a phenome-genome network”

Butte and Kohane. Nat Biotech. 2006

• Clustered experiments based on mapping concepts found in sample annotations to UMLS meta-thesaurus.

• Relationships found between phenotype (e.g., aging), disease (e.g., leukemia), environmental (e.g., injury) and experimental (e.g., muscle cells) factors and genes with differential expression.

• “the ease and accuracy of automating inferences across data are crucially dependent on the accuracy and consistency of the human annotation process, which will only happen when every investigator has a better prospective understanding of the long-term value of the time invested in improving annotations.”

Page 108: Retos de la Bioinformatica

Biología de sistemas

Page 109: Retos de la Bioinformatica
Page 110: Retos de la Bioinformatica

PPI ANNOTATION AND DATABASES

http://www.hpid.org (Han et al., 2004)HPID

http://www.ebi.ac.uk/intact(Hermjakob et al., 2004)IntAct

http://www.hprd.org/(Peri et al., 2004)HPRD

http://dip.doe-mbi.ucla.edu/(Xenarios et al., 2002)DIP

http://mint.bio.uniroma2.it/mint(Zanoni et al., 2002)MINT

URLReferenceDatabase

iMEX agreement to share curation efforts

Protein Standard Initiative (PSI) recommendation

Molecular Interaction (MI) Ontology

Large scale experiments

Literature curation

Page 111: Retos de la Bioinformatica
Page 112: Retos de la Bioinformatica

Complex networks

• Many systems can be represented as networks (graphs)– Nodes: individual component (proteins)

– Edges: relationships (interactions)

• They share common properties– Scale-free

– Hierarchical

– Clustering

• Some properties may be intrinsic and can be understood better when putting into the context of evolution

Page 113: Retos de la Bioinformatica

Detecting Hierarchical Organization

Page 114: Retos de la Bioinformatica

Summary: Network Measures

• Degree ki

The number of edges involving node i

• Degree distribution P(k)

The probability (frequency) of nodes of degree k

• Mean path length

The avg. shortest path between all node pairs

• Network Diameter

– i.e. the longest shortest path

• Clustering Coefficient

– A high CC is found for modules

Page 115: Retos de la Bioinformatica

Mapping the phenotypic data to the network

Begley TJ, Rosenbach AS, Ideker T,

Samson LD. Damage recovery pathways

in Saccharomyces cerevisiae revealed by

genomic phenotyping and interactome

mapping. Mol Cancer Res. 2002

Dec;1(2):103-12.

•Systematic phenotyping

of 1615 gene knockout

strains in yeast

•Evaluation of growth of

each strain in the presence

of MMS (and other DNA

damaging agents)

•Screening against a

network of 12,232 protein

interactions

Page 116: Retos de la Bioinformatica
Page 117: Retos de la Bioinformatica

The Role of Proteomics

• The existence of an ORF does not imply the

existence of a functional gene.

• Limitations of comparative genomics.

• mRNA levels may not correlate with protein levels.

• Protein modifications post-transcriptional

modifications, isoforms, post-translational

modifications, mutants.

• Issues of proteolysis, sequestration, etc. relevant only

at the protein level.

• Protein complex composition, protein-protein

interactions, structures.

Page 118: Retos de la Bioinformatica

Structural proteomics

• Folding

• Structure and function

• Protein structure prediction

• Secondary structure

• Tertiary structure

• Function

• Post-translational modification

• Prot.-Prot. Interaction -- Docking algorithm

• Molecular dynamics/Monte Carlo

Page 119: Retos de la Bioinformatica

What kind of methods around?

5 main levels of protein Structure prediction:

1. Extensive Sequence Search

2. Threading and 1D-3D profiles

3. Ab initio prediction of protein structure

4. Comparative Modelling

5. Docking (domain interaction prediction)

Page 120: Retos de la Bioinformatica
Page 121: Retos de la Bioinformatica

Prediction of Protein Structures

• Examples – a few good examples

actual predicted actual

actual actual

predicted

predicted predicted

Page 122: Retos de la Bioinformatica
Page 123: Retos de la Bioinformatica

START

Get profile for sequence (NR)

Scan sequence profile against

representative PDB chains

Scan PDB chain profiles

against sequence

PS

I-B

LA

ST

MODPIPE: Large-Scale Comparative Protein Structure Modeling

Select templates using

permissive E-value cutoff

1

Expand match to cover

complete domains

1

Build model for target segment by

satisfaction of spatial restraints

Evaluate model

Align matched parts of sequence and

structure

MO

DE

LL

ER

R. Sánchez & A. Šali, Proc. Natl. Acad. Sci. USA 95, 13597, 1998.

N. Eswar, M. Marti-Renom, M.S. Madhusudhan, B. John, A. Fiser, R. Sánchez, F. Melo, N. Mirkovic, A. Šali.

Fo

r ea

ch t

arg

et s

equ

ence

Fo

r ea

ch t

emp

late

str

uct

ure

3/25/03

END

Page 124: Retos de la Bioinformatica

Structural Proteomics:

The Motivation*

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000

1980 1985 1990 1995 2000 2005

2000040000

6000080000

100000120000

140000160000

0

Seq

uen

ces S

tructu

res

180000200000

Page 125: Retos de la Bioinformatica

The hierarchies of protein structure

Page 126: Retos de la Bioinformatica

126

Docking Programs

• Dock (UCSF)

• Autodock (Scripps)

• Glide

(Schrodinger)

• ICM (Molsoft)

• FRED (Open Eye)

• Gold, FlexX, etc.

Page 127: Retos de la Bioinformatica

Cell cycle network from KEGG

Page 128: Retos de la Bioinformatica

128

Graphical Notation: a necessity for the conceptual representation

of biopathways

Thiery & Sleeman, Nat. Rev. Mol.

Cell. Biol 7:131 (2006)

Qualitative Mechanistic

various degree of

detail, mixed level

of presentation

Aladjem et al., Science STKE pe8

(2004)

Page 129: Retos de la Bioinformatica

129

Strategies: simulate or analyse?

(or rather what to do first)

convert diagram

into a quantitative

model

simulate model

behavior

numerically

obtain qualitative

understanding

through numerical

results and model

reduction

qualitatively

analyze network

topology, stability,

etc

identify

“elementary

modes”

build and

simulate a

reduced model

Page 130: Retos de la Bioinformatica

130

Space of modeling methods

con

tin

uou

s↔

dis

cret

e

sto

chsi

mB

oo

lean

net

wo

rks

Page 131: Retos de la Bioinformatica

Continuum of modeling approaches

Top-down Bottom-up

Page 132: Retos de la Bioinformatica

Frazier et al. (2003) Science 11 April Vol 300:290-293

Page 133: Retos de la Bioinformatica

Integración de datos

Page 134: Retos de la Bioinformatica

Nucleic Acids Research article lists

1078 public databases

Nucleic Acids Research, 2008, Vol. 36, Database issue

http://nar.oxfordjournals.org/cgi/reprint/36/suppl_1/D2

Page 135: Retos de la Bioinformatica

Growth in Available Bioinformatics Databases

Page 136: Retos de la Bioinformatica

Too much unintegrated data

• Data sources incompatible

• No (or few) standard naming convention

• No common interface (varying tools for browsing,

querying and visualizing data)

Page 137: Retos de la Bioinformatica

– Small, isolated, independent, groups/individuals

– Loosely coupled provider-consumer of resources.

– Commonly resource consumers

– Boutique suppliers.

– Poor access systems admins

– Large experiments or large research groups/labs, possibly distributed

– Large service provider institutes.

– Tightly coupled provider-consumer of resources.

– Commonly resource providers.

– Some or lots of access to sys admin

Page 138: Retos de la Bioinformatica

138

Challenges: Names and Identity

Q92983

O00275

O00276

O00277

O00278

O00279

O00280

O14865

O14866

P78507

• WSL-1 protein

• Apoptosis-mediating receptor DR3

• Apoptosis-mediating receptor TRAMP

• Death domain receptor 3

• WSL protein

• Apoptosis-inducing receptor AIR

• Apo-3

• Lymphocyte-associated receptor of death

• LARD

• GENE: Name=TNFRSF25

Q93038 = Tumor necrosis factor receptor superfamily member 25 precursor

P78515

Q93036

Q93037

Q99722

Q99830

Q99831

Q9BY86

Q9UME0

Q9UME1

Q9UME5

Annotation history:

http://www.expasy.org/uniprot/Q93038

GUIDs

Life Science

Identifier?

Normalisation

Page 139: Retos de la Bioinformatica
Page 140: Retos de la Bioinformatica

Why must support standards?

• Unambiguous representation, description

and communication

– Final results and metadata

• Interoperability

– Data management and analysis

• Integration of OMICS system biology

Page 141: Retos de la Bioinformatica

What to standarize?

• CONTENT: Minimal/Core Information to be reported

• MIBBI (http://www.mibbi.org)

• SEMANTIC: Terminology Used -> Ontologies

• OBI (http://obi-ontology.org)

• SYNTAX: Data Model, Data Exchange

• Fuge (http://fuge.sourceforge.net/)

Page 142: Retos de la Bioinformatica

MIBBI: Standard Content

Promoting Coherent Minimum Reporting Requirements for

Biological and Biomedical Investigations: The MIBBI Project, Taylor et Al, Nature Biotech.

Page 143: Retos de la Bioinformatica

143U

ser

inte

rface

Applic

ation

Applic

ation inte

rface

Link Integration: Integration Lite

Ontology

Authority

Identity Authority

Page 144: Retos de la Bioinformatica

144

Warehouse

Applic

ation

User

inte

rfaceW

rappers

Wra

ppers

Wra

ppers

Unified

model Data

Access a

nd Q

uery

• Copy the data sets, clean and massage data into shape

• Combine them into a (different) pre-determined model before query

• ATLAS, MRS, e-Fungi, GIMS, Medicel Integrator, MIPS, BioMART

• Often called “Knowledge bases”

Page 145: Retos de la Bioinformatica

145

View integration

• Data at Source; Virtual integrating database view

• Global as View / Local as View mappings between models

• Map from model to databases dynamically so always fresh

• TAMBIS, Information Integrator, K4, ComparaGrid, UTOPIA, caCORE

Wra

ppers

Wra

ppers

Wra

ppers

Applic

ation

User

inte

rface

Unified

model Data

Access a

nd Q

uery

Page 146: Retos de la Bioinformatica

146

Specialist Integrating Application

E.g. Ensembl, UTOPIA

• Very popular. Known to be one application.

Applic

ation

User

inte

rfaceW

rappers

Wra

ppers

Wra

ppers

Page 147: Retos de la Bioinformatica

147

Workflows

• Data flow protocol. Automated data chaining.

• General technique for describing and enacting a process

• Describes what you want to do, not how you want to do it

• Various degrees of data type compliance anticipated

Applic

ation

User

inte

rface

Wra

pper

Workflow

Engine

Page 148: Retos de la Bioinformatica

148

Mash-Up Data Marshalling

• Content syndication and feeds

• Emphasis on User creating specific integration by mapping.

• Just in time, just enough design

• On demand integration

Ma

sh U

p A

pplic

ation

User

inte

rfaceP

roto

col

objects

Pro

tocol

Pro

tocol

Page 149: Retos de la Bioinformatica

Composite applications

Page 150: Retos de la Bioinformatica

150

Semantic Web help?

• Slight problem: we have no first class metadata migration and

management infrastructure, where metadata is outside the application and

in the middleware, and we can handle progressive curation

Wra

ppers

Wra

ppers

Wra

pper

Applic

ation

User

inte

rface

Acce

ss a

nd

Qu

ery

Semantic Enrichment

Model flattening

Mapping Transparency

Page 151: Retos de la Bioinformatica
Page 152: Retos de la Bioinformatica

dataflow workflow

ws ws ws ws ws

curation

submission

Advanced Search

Retrieve data

Submit data

Service Oriented Architecture

Page 153: Retos de la Bioinformatica

Distributed Annotation System

Page 154: Retos de la Bioinformatica

Distributed Annotation System

Page 155: Retos de la Bioinformatica

An Integrative Analysis Example

Relational data

mining Text mining

Spectrum data

mining

Chemical sequence

data model

Visualizing

relational data

clusters

Visualizingmultidimensi

onal data

Visualizingsequence

data

Visualizingpathway

dataText mining visualization

Visualizing cluster

statistics

Visualizing serial/spect

rum data

Decision tree model

of metabonomi

c profile

Chemical structure

visualization

Page 156: Retos de la Bioinformatica

1- Experiments

Planning and carrying outexperiments(lab work)

2- Results

Processing and interpretation of obtained results

3- Scientific Peer-reviewed articles

'Relevant' results are published in scientific

journals

From experiments to scientific publications

Page 157: Retos de la Bioinformatica

PubMed/Medline database at NCBI

- Developed at the National

Center for Biotechnology

Information (NCBI).

- The core 'Textome'.

- repository of citation

entries of scientific

articles.

- PubMed titles and

abstracts

are primary data source for

Bio-NLP.

- ~ 450,000 new abstracts/a

- > 4,800 biomedical

journals

- ENTREZ search engine

Page 158: Retos de la Bioinformatica

ScientificJournals

Journal-specific

Information:

•Format•Paper structure

(sections)•Article type

Data in scientific articles

Free Text

Title

Abstracts

Keywords

Text body

References

Tables Figures

Biomedical literature characteristics

- Heavy use of domain specific terminology (12%

biochemistry

related technical terms).

- Polysemic words (word sense disambiguation).

- Most words with low frequency (data sparseness).

- New names and terms created.

- Typographical variants

- Different writing styles (native languages)

Page 159: Retos de la Bioinformatica
Page 160: Retos de la Bioinformatica

BioCreative

Page 161: Retos de la Bioinformatica

BioCreative

Page 162: Retos de la Bioinformatica

BioCreative results

1: Chiang et al.

2: Couto et al.

3: Ehrler et al.

4: Ray et al.

5: Rice et al.

6: Verspoor et al.

TP: prediction evaluated as protein and GO terms correct

Precision: TP / Total nr. of

evaluated submissions

Page 163: Retos de la Bioinformatica

Data Integration

• Standards, DBs

Knowledge Discovery

• Algorithms, Informatics, Machine Learning

Integrate knowledge

• Text mining, Ontologies

Modelling

• Pathways, Circuits, Abstraction

Infrastructure

SupportResearch

Page 164: Retos de la Bioinformatica

Los retos de la biología en los próximos

50 years

• Listado de todos los componentes moleculares que forman un organismo:– Genes, proteinas, y otros elementos funcionales

• Comprender la funcion de cada componente

• Comprender como interaccionan

• Estudiar como la función ha evolucionado

• Encontrar defectos geneticos que causan enfermedades

• Diseñar medicamentos y terapias de manera racional

• Secuenciar el genoma de cada individuo y usarlo en una medicina personalizada

• La Bioinformatica es un componente esencial para conseguir todos estos objetivos