genome sequences

66
Genome Sequences Ka-Lok Ng Asia University

Upload: werner

Post on 12-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

Genome Sequences. Ka-Lok Ng Asia University. History of genome sequencing. 1995, led by Craig Venter’s group, at the Institute of Genomic Research (TIGR) in Maryland Reported the complete DNA seq. of the bacterium Haemophilus influenzae - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Genome Sequences

Genome Sequences

Ka-Lok Ng

Asia University

Page 2: Genome Sequences

History of genome sequencing

• 1995, led by Craig Venter’s group, at the Institute of Genomic Research (TIGR) in Maryland

• Reported the complete DNA seq. of the bacterium Haemophilus influenzae

• The first viral genome seq. (phage phiX174) was produced by Fred Sanger’s group at 1978

• Insulin A, B chains( 胰島素 ) – the first determined amino acid sequence in 1951 by F. Sanger (Cambridge U)

• Sanger was awarded two Nobel prizes, the first one in 1958 on the structure of insulin, and the second one in 1980 (both in chemistry) for developing DNA sequencing techniques (with Paul Berg and Walter Gilbert)

Page 3: Genome Sequences

Genome sequencing up to year 2001

http://www.biochem.arizona.edu/classes/bioc471/pages/Lecture7/Lecture7.html

Page 4: Genome Sequences

Timeline of genome sequencing

http://www.biochem.arizona.edu/classes/bioc471/pages/Lecture7/Lecture7.html

Page 5: Genome Sequences

First draft of human genome

F. Collins and C. Venter

Page 6: Genome Sequences

Biological sequence space

• DNA sequence– a seq. of symbols from the alphabet A,

T, C, and G– IUPAC notation– R denotes A or G– Y denotes C or T– - denotes Gap

• RNA sequence– a seq. of symbols from the alphabet A,

U, C, and G– IUPAC notation– R denotes A or G– Y denotes C or U– - denotes Gap

• Protein sequence– a seq. of symbols from 20 alphabets

(except U,X, “J,O,B”, Z)

RNA secondary structure

Page 7: Genome Sequences

Biological sequence space

• Convenient to model biological seq. as a one-dimensional (1D) object

• It is also incorrect• It neglects all the information that might be

contained in the 3D structure of the molecule• We make this approximation in this course

Page 8: Genome Sequences

Building blocks of DNA sequences

• Backbone

• Pyrimidines – single ring–Thymine–Cytosine

• Purines – double rings–Adenosine–Guanin

Complementary (A,T), (C,G)

Page 9: Genome Sequences

Building blocks of protein sequences

N-terminius, C-terminus (reading protein sequences from N to C)

peptide bond O==C –N-H, alpha carbon, the R group

Page 10: Genome Sequences

Central dogma of molecular biology

More with coding DNADNA is a double strands, there are a total of 6 open reading frame (ORF)

Page 11: Genome Sequences

Codon translation

Page 12: Genome Sequences

Alternative splicing

Page 13: Genome Sequences

Genome sequences

• Prokaryotic genomes– Eubacteria and archaes are the two major groups of prokaryotes organisms with

out nuclei– Generally have a single, circular genome between 0.5 and 1.3 Mbp long– Simple genes and genetic control seqs.

• Viral genomes– Not free-living organisms – Can be either single or double-stranded, and either DNA or RNA, that is ssDNA,

ssRNA, dsDNA ro dsRNA– HIV, SARS

• Eukaryotic genomes– Ranging in size from 8 Mb for some fungi to 670 Gbp– Human genome is about 3 Gbp long– Baker’s yeast, worm, zebra-fish, fruit-fly, mosquito; mammalian such as human,

mouse, and plants such as rice • Organellar genomes

– Mitochondrion (mtDNA) and chloroplast genome– Only hundreds or tens of thousand of bases long, circular, and contain a few ess

ential genes

Page 14: Genome Sequences

Working with whole Genomes

Below is a circular representation of the E. coli.

Page 15: Genome Sequences

DNA and Protein Sequences Databases

NCBI http://www.ncbi.nlm.nih.gov/

EMBL http://www.ebi.ac.uk/services/

DDBJ http://www.ddbj.nig.ac.jp/

Protein Sequence Databases

NCBI Molecular databases http://www.ncbi.nlm.nih.gov/Database/ R SS qefeUniProt http://www.pir.uniprot.org/UniProt = Swiss-Prot + TrEMBL + PIR-PSD UniProt = UniProt Archive (UniParc) + UniProt Knowledgebase (UniProtKB) + UniProt nonredundant reference database (UniRef)ExPasy http://us.expasy.org/PIR http://www-nbrf.georgetown.edu/

Page 16: Genome Sequences

The Entrez system

• Redundancy in GenBank• Many different GenBank entries are relevant to a specifi

c gene, esp. for human, E.coli, yeast, fruit fly• 4 entries encompass the same E.coli dUTPase gene

GenBank entries Sizes

X01714 1609

V01578 2568

L10328 136254

AE000441 10562

Page 17: Genome Sequences

Entrez Gene

• Example: MEN1 AND human[ORGN]• where ORGN = organism

Page 18: Genome Sequences

Entrez Gene

• Read the summary - Summary

• Official Symbol• Gene type• Gene name• Gene description• RefSeq status• Organism• Lineage• Gene aliases• Summary• Reference• Protein-protein interaction

Page 19: Genome Sequences

FASTA format

Page 20: Genome Sequences

Batch Entrez Gene

• NCBI site map

Page 21: Genome Sequences

Batch Entrez Gene

• Retrieve multiple sequences information at one time

• Uniprot seq. ID, prepare a text file, and upload (use database = protein)

Q9XX00 Q8MQ56 Q9XWS4Q9XU77 Q9XWH5Q9N2K7

Page 22: Genome Sequences

Eukaryotic entryexample: AF018430Use CoreNucleotide to search for the seq.

Page 23: Genome Sequences

Retrieving GenBank entries without accession number• Entrez - human[organism] AND dUTPase[protein name] • AND must be in capital letters !

Page 24: Genome Sequences

Whole Genome DB

• NCBI home page Genome Biology Entrez Genome Viral genome DB, Microbial genome ..etc )

Page 25: Genome Sequences

Microbial genome – TIGR

• http://www.tigr.org/tdb/• Comprehensive Microbial Resource (CMR)

Page 26: Genome Sequences

Genome databases

• allow you to browse genomes starting from chromosome down to a single gene, an individual exons or a nucleotide.

• Ensembl database• http://www.ensembl.org

• UCSC database• http://genome.ucsc.edu

Page 27: Genome Sequences

Microbial Database : GOLD

• http://www.genomesonline.org

Page 28: Genome Sequences

Statistical analysis of biological sequences

• Look for sequence structures in biological sequences, either DNA, RNA or protein seqs.

• Assuming one starts from 1D structure• Take DNA as an example, one expects the frequ

ency of appearance of nucleotide A, T, C and G are equal random sequence, %A = %T = %C = %G = 25%

• In actual DNA seq., this is not true !

Page 29: Genome Sequences

Statistical analysis of DNA sequences

• Study the base composition• GC content• Frequent or rare words – words of length k• Biological relevance of unusual words (motifs)

Page 30: Genome Sequences

Counting words in DNA seqs.

http://www.genomatix.de/cgi-bin/tools/tools.pl create seq. statistics

Page 31: Genome Sequences

Counting words in DNA seqs.

• NCBI Genome (complete genome sequences) microbial Haemophilus influenzae Rd KW20 , NC_000907.1 (TIGR, dated on 1995) Link: RefSeq FTP or GenBank FTP (L42023.fna)

Page 32: Genome Sequences

Counting words in Haemophilus influenzae genome

Total number of bp

GC content agree withNCBI record

Page 33: Genome Sequences

Counting words in Haemophilus influenzae genome

• (%A) strand + = (%T) strand -,

• (%C) strand + = (%G) strand -,

• ….• Because of the complementary principle, i.e. A-

T, and C-G

Page 34: Genome Sequences

Percentage of dinucleotide

Counting words in Haemophilus influenzae genome

Use L-k+1

Page 35: Genome Sequences

Counting words in Haemophilus influenzae genome

• Nucleotide words of length 2 (called dimer) or higher (trimers, k-mers)

• Words of length k are called k-grams or k-tuples in computer science, or k-mer in biological science

Frequency of 3-mers

Page 36: Genome Sequences

Finding unusual DNA words

• A simple statistical analysis can be used to find under- and over-representation of motifs ( 主題 , 基本花紋 ) (i.e. k-mers)

• Help us to decide when an observed bias is significant For the case of 2-mers• Compare the observed probability N of the 2-mers with the one expected un

der a background model, typically a multi-nomial model. The ratio between the two quantities indicates how much a certain word deviates from the background model and is called the odds ratio;

)()(

)(

yNxN

xyNrxy

where N(xy) is the frequency of the dinucleotide xy, N(x) and N(y) denote the frequency of the nucleotide x and y respectively.

rxy > 1 or rxy < 1 the xy nucleotide is considered of high or lower relative abundance compared with a random seq.

Page 37: Genome Sequences

Finding unusual DNA words

• Clearly dimer deviate from value 1 are unusually represented, although the amount of deviation needed to consider this as a significant patterns needs to be analyzed with the tools discussed later in this course.

• The dimer GG looks extremely infrequent in that table but this analysis reveals that this is not likely to be a significant bias because the nucleotide G is low in frequency to begin with.

AA and TA seems to be unusual

Page 38: Genome Sequences

Finding unusual DNA words

• the odds ratio can be generalized to a k-mers• For k-mers there are 4 to the k-th power, 4k, possible different patter

ns

)()....2()1(

)(

kNNN

merskNr mersk

Frequent words in H. influenzae,The words AAAGTGCGGT and ACCGCACTTT both appearing more than 500 times.

Page 39: Genome Sequences

Biological relevance of unusual motifs

• Frequent words may be due to repetitive elements• Rare motifs include binding sites for transcription factors• Words such as CTAG that have undesirable structural

properties, because they lead to “kinking” of the DNAVirus vs. Bacteria• Words that are not compatible with the internal immune

system of a bacterium. Bacterial cells can be infected by viruses, and I response they produce restriction enzymes, proteins that are capable of cutting DNA at specific nucleotide words, known as restriction sites. The nucleotide motifs recognized by restriction enzymes are under-represented in many viral genomes, so as to avoid the bacterial hosts’ restriction enzymes.

Page 40: Genome Sequences

Analyzing DNA seq.

http://bioweb.pasteur.fr/intro-uk.html#dna

Page 41: Genome Sequences

Analyzing DNA seq. GC composition• Calculates the fractional GC content of nucleic acid sequences • C+G content, C ≡ G has a triple bond• GEECEE http://bioweb.pasteur.fr/seqanal/interfaces/geecee.html

Page 42: Genome Sequences

Counting long words in DNA seqs.

• http://bioweb.pasteur.fr/intro-uk.html • Use AK003076 >gi|12833508|dbj|AK003076.1| Mus musculus adult male spleen cDNA, RIKEN full-le

ngth enriched library, clone:0910001I10 product:DUTPASE homolog [Mus musculus], full insert sequence GGCTTTTTCCACGCCCGCCGCCATGCCCTGCTCGGAAGATGCCGCGGCCGTCTCTGCCTCCAAGAGGGCT CGAGCGGAGGATGGCGCTTCTCTGCGCTTCGTGCGGCTCTCGGAGCACGCCACGGCGCCCACCCGCGGGT CCGCGCGCGCTGCCGGCTACGACCTATTCAGTGCCTATGATTATACAATATCACCCATGGAGAAAGCCAT CGTGAAGACAGACATTCAGATAGCTGTCCCTTCTGGGTGCTATGGAAGAGTAGCTCCACGTTCTGGCTTG GCTGTAAAGCACTTCATAGATGTAGGAGCTGGTGTCATAGACGAGGATTACAGAGGAAACGTTGGGGTCG TGCTGTTTAACTTTGGGAAAGAGAAGTTTGAAGTGAAAAAAGGTGATCGGATTGCGCAGCTCATCTGTGA GCGGATTTCTTATCCAGACTTAGAGGAAGTGCAGACCCTGGATGACACCGAGAGAGGCTCAGGAGGCTTC GGCTCCACCGGGAAGAATTAGAACTTTGCTGGAAGTATCTCGCTGTTTCAACACTGGAAACCAGAAGCTC TAACTTCGGAAGCATTTGGTGTTCTAGGATGCAGGAAAGGAGACCTCGATCACATCACGTTGGAACGATT CTGTTCCCTGGTTGAGGTCGCCTGTAAGTCTGCACTGTGAGCATGGCATTGACATGCAGACTTGGTAAAA CCCAGGGTACAGTTAGATTTTTTGTTGTTGTTGTATTATTTAAATTATAGCCTTCCAAAAACTGTTTTTG ATCATAATTGCTGTATCATTTGTAATTTT

TTTTAATCCAATAAAGTTGCTTTTAGC

Page 43: Genome Sequences

Analyzing DNA seq. composition

Page 44: Genome Sequences

Unusual words in different organisms or chromosomes

• The measure rxy is suitable for a single seq.. • In comparing seqs. from different organisms or chromosome account for

the complementary anti-parallel structure of DNA modify rxy• Reference: Burge, Campbell and Karlin (1992), PNAS, 89, 1358

Double helixS = 5’-ATCG....-3’ S = 5’-CAGT….-3’S

= 3’-TAGC….-5’ S = 3’-GTCA….-5’

• Let = inverted complementary seq., • X = A, T, C, G • , = species• fX = freq. of X for species

Observation• Chargaff’s rule double strands

total number of A/C = total number of T/G

Page 45: Genome Sequences

Unusual words in different organisms or chromosomes

• Question: compare fX and fX

• need to consider the union of S and S

• why ? Let us consider the case in which one seq. with lots of A, and the other with lots of T in fact it has lots of A in the complementary seq. !

S = 5’-AAAACGT....-3’ S = 5’-TTTTCGA….-3’

S = 3’-TTTTGCA….-5’ S

= 3’-AAAAGCT….-5’• Need to symmetrize 對稱化 the nucleotide frequencies, take into account of complem

entary seq.

• Define S* = S + S fX* = (fX + f(X))/2

• * means the union, that is count the freq. of X in both strand and take the average

**

*)(

**

*)(

)(*

2

222

GC

AITA

TTIT

ATTAAIAA

ff

similarly

fff

fff

fffffff

Compare the double strand quantity f*, that is compare f*X and f*X

= inverted complement of X

Work with single DNA only, no need to find out the complementary seq.

Page 46: Genome Sequences

Unusual words in different organisms or chromosomes

How about counting frequency of 2-mers ?

2

_

2

222

)(*)(

*

**

*)(

)(*

XYIXYXYIXY

ACGT

ACACIAC

GTACACGTGTIGTGT

ffff

generalin

ff

fff

fffffff

= inverted complement of XY

Page 47: Genome Sequences

Unusual words in different organisms or chromosomes

How about the odd ratio for 2-mers ?

*)(

*

**

**

**

,_

,__

2___

))((

)(2

XYIXY

ACGT

ATCG

ACGT

TG

GTGT

rr

generalin

rrproofcanyou

mersotherforsimilar

ffff

ff

ff

fr

A conservative estimation of low and high odd ratios are less than 0.78 and higher than 1.22 respectively.

Page 48: Genome Sequences

Unusual words in different organisms or chromosomes

How about the odd ratio for 3-mers ?

2

)(

,___,

)(*

***

*****

XYZIXYZXYZ

XNZYZXY

ZYXXYZXYZ

fff

andnucleotideanyisNwhere

fff

ffffr

Page 49: Genome Sequences

Compare statistical properties (1-mer and 2-mers) of human and chimp

complete mitochondrial DNA

NC_001807 and NC_001643 Human Chimp

A (%) 30.86% 31.13%

C (%) 31.33% 30.80%

G (%) 13.16% 12.89%

T (%) 24.66% 25.18%

%245.222

16.1333.31

%76.272

66.2486.30

2

*

)(*

C

AIAA

f

fff

Human Chimp

%845.212

89.1280.30

%155.282

18.2513.31

2

*

)(*

C

AIAA

f

fff

Both species have similar fX

Page 50: Genome Sequences

Compare statistical properties (1-mer and 2-mers) of human and chimp complete mitochondrial DNA

second nucleotide

A C G T

first A 0.0962 0.0902 0.0483 0.0738

nucl. C 0.0927 0.1074 0.0265 0.0868

  G 0.0371 0.0432 0.0258 0.0254

  T 0.0826 0.0725 0.0309 0.0606

 second nucleotide

A C G T

first A 1.0042 0.8812 1.1750 1.0293

nucl. C 0.9664 1.1537 0.5742 1.0861

  G 0.9819 1.0328 1.5773 0.7321

  T 1.0352 0.9857 0.9296 1.0019

*)(

*XYIXY rr

Human Chimp

second nucleotide

A C G T

first A 1 2 3 4

nucl. C 5 6 7 3

  G 8 7 6 2

  T 4 8 5 1

symmetric

4x4 = 16, symmetric only need to compute 8 numbers not 16 !

Page 51: Genome Sequences

Compare statistical properties (1-mer and 2-mers) of human and chimp complete mitochondrial DNA

wordtheofpercentagedenotespwhere

pppp

ppN

Nffff

Nff

ffff

ff

ff

fr

YIYXIX

XYIXY

YIYXIX

XYIXY

YIYXIX

XYIXY

YX

XYXY

______

))((

)(2

/)])([(

/)(2

))((

)(2

)()(

)(

2)()(

2)(

)()(

)(

**

**

See my human and chimp k-mers Excel file

Page 52: Genome Sequences

Linguistic study of DNA sequences

• Does genomic sequences have any resemblance to a natural language ? open question !– Coding regions

• Bacteria: no introns• Archaea: some introns, TATA boxes• Eukarya: many introns and exons, TATA boxes

– Noncoding regions• Pseudogenes• Repetitive sequences

– Mini-satellites– Micro-satellites

– Alphabets, words, sentences– Coding regions words– Non-coding regions ?

Page 53: Genome Sequences

How to obtain inverted complementary seq. ?

• Prepare a FASTA format file• Biological software web site http://bioweb.pasteur.fr/intro-uk.html#dna seq. tools

EMBOSS program name: revseq Advanced revseq form output file : outseq.out

Page 54: Genome Sequences

GC content

Factors contributing to the variation of GC content

1.Environmental temperature

2.Levels of methylation

3.Recent transposon activity (DNA jumps around)• Over stretches of hundreds of kb, GC content sh

ould vary by <1% as a result of random sampling

• But most genomes show a bias ranging over as much as 30% !

Page 55: Genome Sequences

GC content

Figure. Distribution of GC content along human chromosome 1. GC content varies between 20% and 65% at several different levels of resolution, including for the entire 220Mb of chromosome 1 average over 1-Mb windows (top) and within just 1 Mb for 200-bp windows (bottoms). A gap in the IHFSE seq. can be seen at the 400-kb mark on the 1-Mb scale.

Page 56: Genome Sequences

GC content

• Karyotypic bands revealed by nuclear dyes such as Giemsa tend to correlate with GC content (dark bands being more AT-rich), possibly reflecting their propensity to coil into superstructure, but clearly other features of the DNA contribute to chromatin assembly.

• Chromosome is 2 ~ 3 cm long• The 46 chromosomes (over 1m long) a

re packed inside the nucleus with a size of 0.001 cm ! Amazing !!

• CpG dinucleotides are underrepresent

ed in mammalian genomes overall, but cluster as CpG islands between 0.5 and 2 kb in length that are significantly enriched just upstream of genes.

hsa-mir-639

UCSC database http://genome.ucsc.edu

Page 57: Genome Sequences

Finding internal repeats in DNA seqs.

• tandem repeats, inverted repeat • repeats often involved in genome rearrangement

s or regulatory mechanisms of gene expression• tools result depend on scoring system and ranki

ng• Dot-plot approach http://arbl.cvmbs.colostate.ed

u/molkit/

Page 58: Genome Sequences

Finding internal repeats in DNA seqs.

Page 59: Genome Sequences

TF sequence

Transcription factor, TFIIIA for X.laevis, K02938>gi|214818|gb|K02938.1|XELTFIIIA X.laevis 5S RNA gene transcription factor (TFIIIA) mRNA, complete cdsGAATTCCGGAAGCCGAGGGCTGTTCAGTTGCTGAAGGAGAGATGGGAGAGAAGGCGCTGCCGGTGGTGTATAAGCGGTACATCTGCTCTTTCGCCGACTGCGGCGCTGCTTATAACAAGAACTGGAAACTGCAGGCGCATCTGTGCAAACACACAGGAGAGAAACCATTTCCATGTAAGGAAGAAGGATGTGAGAAAGGCTTTACCTCGCTTCATCACTTAACCCGCCACTCACTCACTCATACTGGCGAGAAAAACTTCACATGTGACTCGGATGGATGTGACTTGAGATTTACTACAAAGGCAAACATGAAGAAGCACTTTAACAGATTCCATAACATCAAGATCTGCGTCTATGTGTGCCATTTTGAGAACTGTGGCAAAGCATTCAAGAAACACAATCAATTAAAGGTTCATCAGTTCAGTCACACACAGCAGCTGCCATACGAATGTCCTCATGAAGGCTGTGACAAGCGGTTTTCTTTGCCTTCCCGTTTAAAACGTCATGAAAAAGTCCATGCAGGCTATCCCTGCAAAAAGGATGATTCTTGCTCATTTGTGGGAAAGACTTGGACATTATACTTGAAACACGTGGCAGAATGCCATCAGGACCTAGCAGTATGTGATGTGTGTAATCGAAAATTCAGGCACAAAGATTACTTGAGGGATCATCAGAAAACTCACGAAAAAGAGCGAACTGTGTATCTCTGCCCTCGAGATGGCTGTGACCGCTCCTATACCACTGCATTCAATCTTAGAAGCCATATACAATCATTTCATGAGGAACAGAGACCTTTTGTTTGTGAGCATGCTGGCTGCGGGAAATGCTTTGCAATGAAAAAAAGCCTAGAAAGACATTCAGTTGTACATGATCCAGAGAAGAGGAAGCTGAAGGAGAAATGCCCTCGCCCAAAGAGAAGCCTGGCCTCTCGCCTCACTGGATACATACCCCCCAAGAGCAAAGAAAAAAATGCATCCGTTTCGGGAACAGAAAAGACTGATTCACTTGTGAAAAATAAGCCCTCTGGCACTGAAACAAATGGCTCATTGGTTCTAGATAAATTAACTATACAATAATATAAGAAAACATTTAAATTTATTTTTTTATTTGTTAAAATTGCCCTCAGGATGGTTAACCCATATTTAGTGTGGGTTTTTTCTTTTTTTACAGCTTTAATTCATTTTTTTTCGGCTATAACAAAAGGAATCTGTTCTAGACGCATGATTTGTTTTATGAACTGCAGTATTGGCCATGCCTACAGGTAAAGGCACAGTGTTAATGGCTACATACCTCTTCTACCCCATGTTTGCTATTAAAAGTGAGGTGCAGCAGCCACTGGTCTGTTTATTTACAATACATTCATTTAGTAAGACTCTGTATTCATTTTCAAAAGAATCACTAAGGGAATGTGCAAAATTGTTATCACTCTACTGTAAACACAA

ATGTACTGCTTGCACCCTGTTGGTGGGGCTTTTTTTGGGGAGGTTGACTGACCCTGTTTTTTTTTTAACGGAATTC

Page 60: Genome Sequences

Rosalind Franklin The Dark Lady of DNA (1920 ~ 1958)

By Brenda Maddox • Maddox tells her readers, in their Nobel acceptance

speeches in 1962 Watson and Crick made no mention of Rosalind Franklin at all. It was only Wilkins who “uttered” Franklin’s name, mentioning her as one of two people (the other being Alex Stokes), who “made very valuable contributions to the X-ray analysis.”

• Watson, Francis Crick, and Maurice Wilkins. The latter three received a Nobel Prize for their discovery in 1962. Franklin was ignored.

• For more about the story read http://www.humanistperspectives.org/issue151/books.html

Sodium deoxyribose nucleate from calf thymus, Structure B, Photo 51, taken by Rosalind Franklin and R G Gosling, 2 May 1952, with Linus Pauling’s holographic annotations to the right of the photo. This photo shows the double helices structure of DNA with a separation of 20A.

Page 61: Genome Sequences

Discovery of the double helix structure of DNA

The discovery is based on three pieces of works1. Chargaff’s rule (discovered in 1949)

• Chargaff – an Austrian-American biochemist

• total number of A/C = total number of T/G2. Linus Pauling - discover the alpha-helix struct

ure of protein3. X-ray diffraction pattern of crystal

– Did by Rosalind Franklin– Crystal X-ray diffraction – by William Brag

g and his son William Junior Bragg

http://www.virtualsciencefair.org/2004/mcgo4s0/public_html/t2/dna.html

http://post.queensu.ca/~forsdyke/bioinfo1.htm

Erwin Chargaff (1905 -2002)

Page 62: Genome Sequences

Discovery of the double helix structure of DNA

Linus Pauling • Nobel Prize in Chemistry in 1954 • Nobel Peace Prize in 1963• who championed the use of Vitamin C, • live to be 93 (he died in 1994)• nature was a marvelous 令人驚異的 contrivanc

e 發明 , 想出的辦法 composed of molecules assembled by the Great Mechanic

• http://www.utoronto.ca/jpolanyi/public_affairs/public_affairs4i.html

• Watson and Crick conjectured that DNA is made up of two, three or four helices, but their model did not fit the X-ray data.

• They stop their work for almost one year• Crick continued his Ph.D thesis, Crick worked o

n his tobacco virus research • So as Pauling, he proposed the three helices m

odel of DNA. • They were wrong, because they did not know th

e complimentary principle yet.

Page 63: Genome Sequences

Discovery of the double helix structure of DNA

• Watson and Crick also proposed DNA is made up of two exactly same helix with the same nucleotide on the opposite helix wrong

• Jerry Donohue, a visiting chemist at Cambridge from Cal-Tech, asserted that the shape of those DNA bases ought to be the keto form and not the enol form, as the textbooks of the day asserted.

• Armed now with the memory of Franklin’s clear photograph 51, this next to-last-step in the emergence of the final model was absolutely crucial.

Donohue (1920 – 1985)

Page 64: Genome Sequences

Discovery of the double helix structure of DNA• “‘The point was important,’ [Crick] said,

‘because if the unit cell is strictly C2, one must have the DNA chains in pairs, running in opposite directions.’”

• This scientific point was crucial for

Watson and Crick. In separate papers published that same year, Franklin had said that “C2 is the only space group possible.” Why, Maddox wonders, had Watson or Crick failed to mention the importance of this in either of their Nature papers of 1953?

• A physicist, he worked with John Randall in the late 1930s on the development of radar, moving to the USA during World War II to work on the Manhattan project. After the War he joined Randall at King's College London and with Rosalind Franklin began an investigation into the structure of DNA.

Watson (1928-) and Crick (1916-2004)

Maurice Wilkins (1916-2004)

Page 65: Genome Sequences

Diffraction of X ray by crystal

• Max von Laue who was awarded the Nobel prize for physics in 1914 "for his discovery of the diffraction of X-rays by crystals". His collaborators Walter Friedrich and Paul Knipping took the picture on the right in 1912.

• http://cxpi.spme.monash.edu.au/xray_history.htm

Max von Laue (1897-1960)

A beam of X-rays is scattered into a characteristic pattern by a crystal. In this case it is copper sulphate.

Page 66: Genome Sequences

Diffraction of X ray by crystal

• Sir William Lawrence Bragg, Australian born British physicist, won the Nobel prize (1915) with his father William Henry Bragg "for their services in the analysis of crystal structure by means of Xrays“, when he was only 25 years old. William Henry Bragg

(1862-1942)  William Lawrence Bragg

(1890-1971)

Bragg’s law of diffraction