biologia in silico - centro de informática - ufpe ivan g. costa filho [email protected] centro de...
TRANSCRIPT
Biologia In Silico - Centro de Informática - UFPE
Ivan G. Costa [email protected]
Centro de InformáticaUniversidade Federal de Pernambuco
Processamento de Cadeias de Caracteres
Biologia In Silico - Centro de Informática - UFPE
Tópicos• Cadeias de Caracteres Biológicas• Problemas Básicos
– alinhamento par/múltiplo– busca de motifs– modelagem de famílias de proteínas
• Métodos– Algoritmos dinâmicos– cadeias escondidas de Markov– métodos probabilísticos
Biologia In Silico - Centro de Informática - UFPE
Disciplina• Aulas – Marco/Abril
– introdução de conceitos/métodos básicos– Aulas práticas
• Seminários - Abril/Maio– apresentação de tópicos da disciplina
• Individual - pós• duplas – graduação
• Projeto Maio a Junho– analise de dados reais (de artigos
discutidos) em grupo
Biologia In Silico - Centro de Informática - UFPE
Avaliação• 40% - apresentação dos seminários
– avaliação pelos companheiros de classe e presença
• 20% - listas de exercícios• 40% - projeto em grupo
– nota individual - cada grupo é responsável por descrever a participação
Biologia In Silico - Centro de Informática - UFPE
Bibliografia
• R Durbin, Sean R Eddy, A Krogh, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press.
• An Introduction to Bioinformatics Algorithms, Neil Jones e Pavel Pevzner, MIT Press, 2004
• Ver pagina para literatura especifica de cada aula …
– www.cin.ufpe.br/~igcf
Biologia In Silico - Centro de Informática - UFPE
Biologia Molecular
Biologia In Silico - Centro de Informática - UFPE
Entender a vida a nível celular
• Como a informação genética é herdada
• Como a informação genética influencia processos celulares
• Como genes trabalham juntos para realizar uma função celular
Biologia In Silico - Centro de Informática - UFPE
Informação Genética - DNA
• DNA (ácido desoxirribonucleico) – Cadeia de
nucleotídeos – 4 tipos: A;C;G;T– forma fita dupla a
partir da complementaridade.
• A = T e C = G
Biologia In Silico - Centro de Informática - UFPE
Dogma Central - Transcrição
• Transcrição – DNA para RNA
• RNA (acido ribonucléico)– fita simples.– 4 tipos: A;C;G;U– Moléculas instáveis– Transporte de
informação do núcleo ao citoplasma
Biologia In Silico - Centro de Informática - UFPE
Dogma Central - Transcrição
• Transcrição – copia seqüência de bases do DNA para o RNA (com U ao invéss de T).
Biologia In Silico - Centro de Informática - UFPE
Dogma Central - Tradução• Tradução
– RNA -> Proteínas– realizada pelo ribossomo– Código genético
• Proteínas– cadeia de aminoácidos– 20 tipos diferentes– adquire uma estrutura tri-
dimensional– entidades funcionais da
célula
Biologia In Silico - Centro de Informática - UFPE
Tradução - Código Genético
• Combinações de códons (3 bases) codificam um dos 20 aminoácidos.
Biologia In Silico - Centro de Informática - UFPE
Dogma Central• Dogma: fluxo de
informação DNA mRNA Proteína• Gene: segmento de DNA
codificando uma proteína.• Transcrito: segmento de
RNA transcrito de uma gene.
• Um gene corresponde a uma proteína e uma função celular.
Biologia In Silico - Centro de Informática - UFPE
Controle da Expressão Gênica• Como se da o controle da
expressão gênica?• Certas proteínas, fatores de
transcrição, se ligam ao DNA e são responsáveis por iniciar a transcrição.
Biologia In Silico - Centro de Informática - UFPE
Controle da Regulação Gênica
Biologia In Silico - Centro de Informática - UFPE
• Manage molecular biological data– Store in databases, organise, formalise, describe...
• Compare molecular biological data• Find patterns in molecular biological data
– phylogenies– correlations (sequence / structure / expression / function
/ disease)
Goals:• characterise biological patterns & processes• predict biological properties
– low level data ⇒ high level properties (eg., sequence ⇒ function)
Bioinformatics
Biologia In Silico - Centro de Informática - UFPE
Bioinformatics: neighbour disciplines
• Computational biology– Broader concept: includes computational
ecology, physiology, neurology etc...• -omics:
– Genomics– Transcriptomics– Proteomics
• Systems biology– Putting it all together...– Building models, identify control & regulation
Biologia In Silico - Centro de Informática - UFPE
Molecular biology data...
>alpha-DATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCACCCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCAAGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAGCACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGGCCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCAGCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGACGGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCCGGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTGGCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTGTCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA>alpha-AATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGCCAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCCATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTGTCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTCTGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACCTGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTGAGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACGCCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCAGTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACCATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTTCCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGGCACCGTCCTTACTGCCAAGTACCGTTAA
• DNA sequences
Biologia In Silico - Centro de Informática - UFPE
Molecular biology data...
• Amino acid sequences
• Protein structure:– X-ray crystallography– NMR
Biologia In Silico - Centro de Informática - UFPE
Cell biology & proteomics data...
• Subcellular localization
Biologia In Silico - Centro de Informática - UFPE
• Homology / Alignment• Simple pattern (“word”) recognition • Statistical methods
– Weight matrices: calculate amino acid probabilities– Other examples: Regression, variance analysis,
clustering• Machine learning
– Like statistical methods, but parameters are estimated by iterative training rather than direct calculation
– Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)
• Combinations
Prediction Methods
Biologia In Silico - Centro de Informática - UFPE
Similarity between sequencesIf two sequences look similar, the explanation
may be:• Homology (common descent)• Convergent evolution (common function → common selective pressure)• Chance!
Biologia In Silico - Centro de Informática - UFPE
Sequences are related
• Darwin: all organisms are related through descent with modification• => Sequences are related through descent with modification• => Similar molecules have similar functions in different organisms
Phylogenetic tree based on ribosomal RNA: three domains of life
Biologia In Silico - Centro de Informática - UFPE
Sequences are related II
Phylogenetic tree of globin-type proteins found in humans
Biologia In Silico - Centro de Informática - UFPE
Why compare sequences?
• Determination of evolutionary relationships
• Prediction of protein function and structure (database searches).
Protein 1: binds oxygen
Sequence similarity
Protein 2: binds oxygen ?
Biologia In Silico - Centro de Informática - UFPE
Biological Databases• Vast biological and sequence data is freely available
through online databases• Use computational algorithms to efficiently store large
amounts of biological data Examples
• NCBI GeneBank http://ncbi.nih.gov Huge collection of databases, the most prominent being the nucleotide sequence database
• Protein Data Bank http://www.pdb.orgDatabase of protein tertiary structures• SWISSPROT http://www.expasy.org/sprot/ • Database of annotated protein sequences• PROSITE http://kr.expasy.org/prositeDatabase of protein active site motifs
Biologia In Silico - Centro de Informática - UFPE
Alinhamento de Sequencias
Biologia In Silico - Centro de Informática - UFPE
BLAST• A computational tool that allows us
to compare query sequences with entries in current biological databases.
• A great tool for predicting functions of a unknown sequence based on alignment similarities to known genes.
Biologia In Silico - Centro de Informática - UFPE
BLAST
Biologia In Silico - Centro de Informática - UFPE
Some Early Roles of Bioinformatics• Sequence comparison• Searches in sequence databases
Biologia In Silico - Centro de Informática - UFPE
Biological Sequence Comparison• Needleman-
Wunsch, 1970– Dynamic
programming algorithm to align sequences
Biologia In Silico - Centro de Informática - UFPE
Busca de Sinais de Localização
Biologia In Silico - Centro de Informática - UFPE
Protein sorting in eukaryotes
• Proteins belong in different organelles of the cell – and some even have their function outside the cell
• Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"
Biologia In Silico - Centro de Informática - UFPE
Secretory proteins have a signal peptide
Initially, they are transported across the ER membrane
Protein sorting: secretory pathway / ER
Biologia In Silico - Centro de Informática - UFPE
Signal peptides
A signal peptide is an N-terminal part of the amino acid chain, containing a hydrophobic region.
Signal peptides differ between proteins, and can be hard to recognize.
Biologia In Silico - Centro de Informática - UFPE
Simple pattern (“word”) recognitionExample: PROSITE entry PS00014, ER_TARGET:Endoplasmic reticulum targeting sequence (”KDEL-signal”). Pattern: [KRHQSA]-[DENQ]-E-L
NB: only yes/no answers!
Biologia In Silico - Centro de Informática - UFPE
ALAKAAAAMALAKAAAANALAKAAAARALAKAAAATALAKAAAAVGMNERPILTGILGFVFTMTLNAWVKVVKLNEPVLLLAVVPFIVSV
• Estimate probabilities for nucleotides / amino acids• Information content in sequences; logos; Position- Weight
Matrices.• Quantitative answers.
Statistical Methods
Biologia In Silico - Centro de Informática - UFPE
Busca de Motifs
Biologia In Silico - Centro de Informática - UFPE
Random Sampleatgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtaca
tgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttatag
gtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
Biologia In Silico - Centro de Informática - UFPE
Implanting Motif AAAAAAAGGGGGGG
atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa
tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag
gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
Biologia In Silico - Centro de Informática - UFPE
Where is the Implanted Motif?
atgaccgggatactgataaaaaaaagggggggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataaaaaaaaaggggggga
tgagtatccctgggatgacttaaaaaaaagggggggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgaaaaaaaagggggggtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaataaaaaaaagggggggcttatag
gtcaatcatgttcttgtgaatggatttaaaaaaaaggggggggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtaaaaaaaagggggggcaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaaaagggggggctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcataaaaaaaagggggggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttaaaaaaaaggggggga
Biologia In Silico - Centro de Informática - UFPE
Implanting Motif AAAAAAGGGGGGG
with Four MutationsatgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
Biologia In Silico - Centro de Informática - UFPE
Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcggga
tgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttatag
gtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
Biologia In Silico - Centro de Informática - UFPE
Why Finding (15,4) Motif is Difficult?
atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg
acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa
tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga
gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga
tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag
gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa
cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat
aacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta
ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaag
ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa
AgAAgAAAGGttGGG
cAAtAAAAcGGcGGG
..|..|||.|..|||
Biologia In Silico - Centro de Informática - UFPE
Próxima Aula• Ler capitulo 1 do Durbin • Introdução a algoritmos
dinâmicos (10/08)
Biologia In Silico - Centro de Informática - UFPE
Agradecimentos• Alguns slides extraidos de
– Biological Sequence Analysis course, CBS, Universidade Tecnica da Dinamarca
– Neil Jones, University of California at San Diego