comparación de secuencias (sequence comparison)

Download Comparación de secuencias (Sequence comparison)

If you can't read please download the document

Upload: vail

Post on 19-Mar-2016

52 views

Category:

Documents


1 download

DESCRIPTION

Comparación de secuencias (Sequence comparison). Objetivo. Aprovechar información funcional y/o estructural identificando homología entre secuencias Diferencia entre Homología e identidad Dos secuencias se consideran homólogas cuando: Tienen el mismo origen evolutivo - PowerPoint PPT Presentation

TRANSCRIPT

  • Comparacin de secuencias(Sequence comparison)

  • ObjetivoAprovechar informacin funcional y/o estructural identificando homologa entre secuenciasDiferencia entre Homologa e identidadDos secuencias se consideran homlogas cuando:Tienen el mismo origen evolutivoTienen funcin y estructura similares

  • Homologous sequences - sequences that share a commonevolutionary ancestry Similar sequences - sequences that have a high percentage ofaligned residues with similar physicochemical properties(e.g., size, hydrophobicity, charge)

    IMPORTANT: Sequence homology: An inference about a common ancestral relationship, drawn whentwo sequences share a high enough degree of sequence similarity Homology is qualitative Sequence similarity: The direct result of observation from a sequence alignment Similarity is quantitative; can be described using percentages

  • Ejercicio

  • Protenas posibles de 50 Aminocidos ?MALRTGGPAL VVLLAFWVAL GPCHLQGTDP GASADAEGPQ CPVACTCSHDMRCAPTAGAA LVLCAATAGL LSAQGRPAQP EPPRFASWDE MNLLAHGLLQ5020: 100000000000000000000000000000000 protenas posiblesProtenas distintas que existen en la naturaleza: unas 200.000Porcentaje de reales sobre posibles: 0.0000000000000000000000002% (o sea nada, prcticamente)

    Nuestras protenas son una minora

  • Ms definicionesOrthologs: secuencias que corresponden exactamente a la misma funcin/estructura en organismos distintos

    Paralogs: secuencias producto de duplicaciones en un mismo organismo. Normalmente implican cambios de funcin.

  • ORTHOLOGS AND PARALOGS INTO LOCUS FROM GLOBINS

  • Homology and predictionVery divergent protein sequences may suport similar structuresSimilar protein structures will probably have related or similar functions

  • 3D STRUCTURE VERSUS SEQUENCESequence alignment between human myoglobin, and globins from hemoglobin

  • myoglobin-globin-globin Comparison of 3D structures of human myoglobin, and globins from hemoglobin

  • myoglobin-globin-globin Comparison of 3D structures of human myoglobin, and globins from hemoglobin

  • Homology and predictionLa comparacin de secuencias es el mtodo ms simple para identificar la existencia de homologa.

    Identidad > 30% en protena implica homologa

    Identidad > 80-90% es normal en ortlogos de especies cercanas

    Identidad 10-30%. Si existe homologa, es indetectable (twilight zone)

  • DNA o protena?Ambas proporcionan informacin sobre homologa

    DNA: Solamente la identidad entre bases es relevante

    Protena: Existen equivalencia funcional entre aminocidos

  • Apareamientos cannicos (Watson-Crick)Unicamente la identidad es relevante

  • Mismatch costs are not usually used in aligningDNA or RNA sequences, because no substitution is"better" than any other (in general)

  • Cdigo gentico Trp, Met (1) Leu, Ser, Arg (6) resto (2) Iniciacin AUG Stop (3)

    Pos 1Posicin 2Pos 3UCAGUPhePheLeuLeuSerSerSerSerTyrTyrStopStopCysCysStopTrpUCAGCLeuLeuLeuLeuProProProProHisHisGlnGlnArgArgArgArgUCAGAIleIleIleMetThrThrThrThrAsnAsnLysLysSerSerArgArgUCAGGValValValValAlaAlaAlaAlaAspAspGluGluGlyGlyGlyGlyUCAG

  • Aminocidos equivalentesHidrofbicosAla (A), Val (V), Met (M), Leu (L), Ile (I), Phe (F), Trp (W), Tyr (Y)PequeosGly (G), Ala (A), Ser (S)PolaresSer (S), Thr (T), Asn (N), Gln (Q), Tyr (Y) En la superficie de la protena polares y cargados son equivalentesCargadosAsp (D), Glu (E) / Lys (K), Arg (R)Dificilmente sustituiblesGly (G), Pro (P), Cys (C), His (H)

  • HistidinFor the hemo coordination bondsProlin in a turn2 conserved glycines in 2 separate helix crossing each other3D visualization of some conserved residues in globin family (Myoglobin structure)

  • La secuencia de DNA diverge ms rpidamentemutacin o recombinacin altera el DNA pero debe mantener la funcin/estructura

    La comparacin de protenas permite localizar homologas ms lejanas

  • Alineamiento de secuenciasMedir la homologa entre secuencias requiere un alineamientoHomologa alta:

    AWTRRATVHDGLMEDEFAAAWTRRATVHDGLCEDEFAAHomologa baja:

    AWTKLATAVVVFEGLCEDEWGGAWTRRAT---VHDGLMEDEFAA

  • Tipos alineamientopairwiseDos secuenciasMultipleMs de dos secuenciasGlobalToda la secuencia se consideraLocalUnicamente se alinean regiones parecidas

  • EstrategiasDepende del objetivoComparacin de secuenciasObjetivo: medir homologa, identificar aminocidos equivalentes global, pairwise/mltiple

    Bsqueda en bases de datosObjetivo: Identificar homlogos en un conjunto grande de secuenciasLocal, pairwise

  • Alineamiento manual protenaRequiere oficioConocer propiedades de aminocidosConocer la protena

    Permite incorporar informacin adicional Aminocidos funcionalesAminocidos necesarios para mantener la estructuraEs lento y poco reproducible

  • Alineamiento automtico (problema de optimizacin)Requiere un mtodo objetivo de comparar aminocidos o bases para puntuar el alineamiento (matrices de comparacin)algoritmo para encontrar el alineamiento con la mxima puntuacin

    Es reproducible y rpido

    No permite, en general, introducir informacin adicional

  • Tipos de matricesIdentidad

    Propiedades fsico-qumicas

    Genticas (sustitucin de codones)

    Evolutivas

  • La aplicacin sucesiva de la matriz PAM permite simular varias generacionesPAM 40, PAM 100, PAM 250

    Evolutionary distance considered is constantBigger number bigger divergence. Less stringent

  • Evolutionary distances considered are variableMore modern than PAM but similar results.

    Smaller is n bigger divergence. Less stringency

  • Blosum 62

  • Which matrix to use??No clear answer

    All matrix evaluate functional equivalence between aminoacids in the light of evolution and conservation: la equivalencia funcional entre aminocidos

  • Choice of a Matrix!Rat versus mouse proteinRat versus bacterialproteinBLOSUM90PAM30BLOSUM45PAM240BLOSUM80PAM120BLOSUM62PAM180

  • PAM Point Accepted Mutatiton

    Query Length Substitution Matrix Gap Costs

  • Gaps (inserciones/delecciones)Normalmente localizados en loops

    AWTKLATAVVVFEGLCEDEWGGAWTRRAT---VHDGLMEDEFAA

  • Gaps (inserciones/delecciones)Esquemas de puntuacin:

    Dependiendo de estructura 2

    Valor constante

    Funcin linealgo + n.gl

  • Global versus local alignmentGlobal alignmentFinds best possible alignment across entire length of 2 sequencesAligned sequences assumed to be generally similar over entire lengthLocal alignmentFinds local regions with highest similarity between 2 sequencesAligns these without regard for rest of sequence Sequences are not assumed to be similar over entire length

  • Global or Local ?1. Searching for conserved motifs in DNA or protein sequences?2. Aligning two closely related sequences with similar lengths?3. Aligning highly divergent sequences?4. Generating an extended alignment of closely related sequences?5. Generating an extended alignment of closely related sequences with very different lengths?

  • Local vs. Global Alignment (contd)Global Alignment

    Local Alignmentbetter alignment to find conserved segment --T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac ||||||||||||aattgccgccgtcgttttcagCAGTTATGTCAGatc

  • Comparacin de secuencias contra bases de datosSecuencia incgnitaATTVG...LMNBase de datos De secuencias

    AGLM...WTKRTCGGLMN..HICGWRKCPGL...Requiere algoritmos de comparacin muy rpidos

  • Diasdvantages from global alignmentSlow Scores whole sequenceDo not recognize multidomain proteins

    ABCACBDGlobal alignment server

  • Alineamiento local10 100x ms rpidos

    Reconocen dominios individuales

    No proporcionan necesariamente el mejor alineamiento!

    BLAST, FASTA

  • Basic Local Alignment Search ToolBlast NCBI

  • Basic Local Alignment Search ToolBlast NCBIThe Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

  • Formatos entrada

  • E parameter (Expected threshold)Expect The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. However, keep in mind that searches with short sequences, can be virtually indentical and have relatively high EValue. This is because the calculation of the E-value also takes into account the length of the Query sequence. This is because shorter sequences have a high probability of occuring in the database purely by chance.

  • E value (Expect)E value:Expect: This setting specifies the statistical significance threshold for reporting matches against database sequences. The default value (10) means that 10 such matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported.

    E = K.m.n.e-l.S

    Warning:

    E Falsos negativos

  • EstadsticaIndice de referencia: E: nmero de falsos positivos esperado

    Bsquedas espordicas: 0.01 0.001

    Bsquedas masivas (anotacin genoma): 10-6

  • Programas Blastblastp amino acid query sequence vs. protein sequence database

    blastn nucleotide query sequence vs. nucleotide sequence database

    blastx nucleotide query sequence translated in all reading frames vs. protein sequence database

    tblastn protein query sequence vs. a nucleotide sequence database translated in all reading frames

    tblastx six-frame translations of a nucleotide query vs. the six-frame translations of a nucleotide sequence database.

  • Qu programa usar?La comparacin en protena permite ampliar el espectro de bsqueda (aunque comparemos DNA!)

    Blastn blastx, tblastxBlastp tblastn

    Degeneracin del cdigo genticoEquivalencia funcional entre aminocidos

  • BLAST substitution matrices

    A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62. In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices and gap costs for various query lengths is: