생물학 연구를 위한 컴퓨터 활용기술 13강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

13th Lecture 2015.12.8

Data Mining

Syllabus주 수업내용1 주차 Introduction : Why we need to learn this stuff?

2 주차 Basic of Unix and running BLAST in your PC

3 주차 Unix Command Prompt II and shell scripts

4 주차 Basic of programming (Python programming)

5 주차 Python Scripting II and sequence manipulations

6 주차 Ipython Notebook and Pandas

7 주차 Basic of Next Generation Sequencings and Tutorial

8 주차9 주차 Next Generation Sequencing Analysis I

10 주차 Next Generation Sequencing Analysis II

11 주차 Next Generation Sequencing Analysis III

12 주차 Next Generation Sequencing Analysis IV

13 주차 Structural Predictions

14 주차

‘Parts’

Complex machine..

‘Parts’

Complex machine is consist of various parts

Cell is very complicated ‘machine’

Consisted with biological macromolecules

Like examine its details on structure of parts,

Structure of ‘Parts’

Structure of Biological Components

Understanding structure of Protein (or RNA)

- Understanding the structure of ‘component of Life’ is the first step to understand- Life in molecular levels

• More than 20,000 proteins is encoded in human genome

• Understanding of their structure and its interactions

- Human genome projects determined human DNA sequences

• So we have sequence informations of most human (and other organism) proteins

• So it means we have ‘component list’ for that

• Now we need to know the detailed structural informations for these..

Primary Structure of ProteinsAmino acid sequences of protein = Primary Structure of Proteins

Secondary Structure of proteins

단백질의 3 차 구조 (Tertiary Structure)

단백질의 4 차 구조 (Tertiary Structure)

단백질 서열 -> 구조 -> 기능단백질의 서열은 단백질의 구조를 결정하며

단백질의 구조는 그 기능을 결정한다 .

Determination of Protein Structure by experimental method

X-ray Crystallography NMRNuclear Magnetic Resonance

전자현미경(Electron Microscope)

고해상도 구조 ( 최대 1-2Å)

결정화된 상태의 단백질의 스냅샷단백질을 결정화해야 함단백질 복합체 혹은 거대분자도 가능

중해상도 구조수용액 상태에서의 동적인 단백질 움직임의 앙상블수용액상에서 안정적으로 고농도로 유지가능해야 함20-30kDa 이상의 단백질은 어려움

저해상도 구조거대단백질 복합체를직접 관찰 거대복합체가 아닌경우 어려움

Protein Structure Determination by X-ray Crystallography

Protein Productions

- You need to have enough (5-10mg) pure (at least 95% purity) protein

- Overexpression (Bacteria or Insect Cell or Mammalian Cell) or Natural Source

- Purification

Crystallization

- Concentrate Proteins (at least 5mg/ml)

- Crystallization happens in the boundary of soluble and precipitation

Strong X-ray generated from synchroton is essential

Raw Data : Diffraction images from Protein Crystals

ComputerAnalysis

Calcuration of electron density

~100-300 individual Images

Final Structure and Interpretations

실험적으로 단백질 구조를 푸는 방법을 알아야 하는가 ?

- 대개 그렇지는 않을 가능성이 많음 .

우리가 관심있는 단백질의 구조가 어떻게 생겼는지는 ?

- 알아야 한다http://www.rcsb.org

In old days, you need very expensive workstation-level computerTo visualize Protein Structure..

Not anymore. Cheap PC or even your smartphone can do that.

Protein Visualization Software

Pymol :http://www.pymol.org

단백질을 표시하는 방법저분자 물질의 경우에는

이런 식의 spacefilling model 로 표시해도 되지만…

단백질의 경우에는

??????

좀 더 간단한 표시방법이 필요하다 .

Line

단백질 전체의 표시에는 적절하지 않음

일부분 확대 표시에는 적절

Ribbon

단백질의 전체 윤곽을 표시할 때 좋음

Cartoon

Alpha-Helix

Beta-Sheet

단백질의 2 차구조를 표시할때 좋음

Surface Surface with Charge

DEMO

Pymol, RCSB PDB

Without experiments, we want to know protein structure interested

기존의 단백질 구조를 참조하지 않는 방법Ab initio modeling

기존의 단백질 구조를 참조하는 방법Homology modeling

3 차 구조 예측2 차 구조 / 기타 예측- 2 차 구조 예측- Coiled-Coil 예측- Membrane Topology 예측

GFCHIKAYTRLIMVG…

Anabaena 7120

Anacystis nidulans

Condrus crispus

Desulfovibrio vulgaris

단백질의 2 차 구조 예측

단백질의 1 차 구조 ( 서열 ) 단백질의 2 차 구조 예측

Alpha Helix?Beta-sheet?Loop?

아미노산에 따른 2 차 구조 선호도- These amino acid prefer in alpha helix

Ala, Leu, Met, Glu, Lun, His, Lys, Arg

- These amino acids prefer beta-sheet because of their bulky side chainsTyr, Trp, Phe, Ile, Val, Thr, Cys

- These amino acids tend to break amino acid sequencesGlyProAsp, Asn, Ser

MSA 에 의한 2 차 구조 예측

Alpha Helix

I, i+3, i+4, i+7

혹은 I, i+4, i+7 의 잔기에서의 보존소수성 잔기 및 친수성 잔기가 엇갈리게 존재하는 경우

Beta-Sheet

연속된 보존된 소수성 잔기 : 단백질 내부에 위치한 beta-sheet

I, i+2, i+4 에 보존된 소수성 잔기 : 단백질 표면에 위치한 beta-sheet

MSA 에 의한 2 차 구조 예측Loop/Disordered Region

Usually not well conservedPresence of secondary structure breakers (P, G)

기계학습에 의한 2 차 구조 예측기계학습이란 ?

http://www.crazymind.net/28

Prediction of Secondary Structure based on machine learning

Utilizing MSAs with known secondary structure as training sets, Generate structure prediction models.

Examples of Secondary Structure Predictions

Jpred :http://www.compbio.dundee.ac.uk/www-jpred/index.html

Jpred3

예측하고자 하는 서열을 입력BLAST search in Uniref90( 상동성이 있는 서열을 부르고 이를 이용하여 MSA 를 만듬 )

복수의 알고리즘을 이용하여 2 차구조를 예측하고 이들의 consensus 에 따라 최종결정을 함

일단 단백질 구조 DB 에 해당 단백질이 있는지 확인 .완전히 동일한 단백질이 3 차 구조가 나왔는데 2 차구조를 예측할 필요는 없다 .

http://www.compbio.dundee.ac.uk/www-jpred/results/jp_q1hwsUv/jp_q1hwsUv.results.html

Secondary Structure Prediction

Confidence for predictions

Alpha-Helix Beta-Sheet

Other Predictions

- Coiled-Coil Prediction

Namgoong et al., Nature Struct Mol Biol. 2011

Coiled-Coil 예측

http://toolkit.tuebingen.mpg.de/pcoils

Membrane Spanning Region 예측막단백질

친수성

소수성

친수성

친수성

소수성

Membrane spanning region predictions

* 막을 가로지르는 부분은 상대적으로 소수성을 띈 아미노산이 많을 것이다 .

Hydropathy plot

>sp|P08908|5HT1A_HUMAN 5-hydroxytryptamine receptor 1A OS=Homo sapiens GN=HTR1A PE=1 SV=3MDVLSPGQGNNTTSPPAPFETGGNTTGISDVTVSYQVITSLLLGTLIFCAVLGNACVVAAIALERSLQNVANYLIGSLAVTDLMVSVLVLPMAALYQVLNKWTLGQVTCDLFIALDVLCCTSSILHLCAIALDRYWAITDPIDYVNKRTPRRAAALISLTWLIGFLISIPPMLGWRTPEDRSDPDACTISKDHGYTIYSTFGAFYIPLLLMLVLYGRIFRAARFRIRKTVKKVEKTGADTRHGASPAPQPKKSVNGESGSRNWRLGVESKAGGALCANGAVRQGDDGAALEVIEVHRVGNSKEHLPLPSEAGPTPCAPASFERKNERNAEAKRKMALARERKTVKTLGIIMGTFILCWLPFFIVALVLPFCESSCHMPTLLGAIINWLGYSNSLLNPVIYAYFNKDFQNAFKKIIKCKFCRQ

막단백질의 서열각각의 아미노산이 얼마나소수성인가를 숫자화함

소수성

친수성

1.9,-3.5,4.2,3.8,-0.9,-1.6,-0.4,-3.5,-0.4,-3.5,-3.5,-0.7,-0.7,-0.9,-1.6,-1.6,1.8,-1.6,2.8,-3.5,-0.7,-0.4,-0.4,-3.5,-0.7,-0.7,-0.4,4.5,-0.9,-3.5,4.2,-0.7,4.2,-0.9,-1.3,-3.5,4.2,4.5,-0.7,-0.9,3.8,3.8,3.8,-0.4,-0.7,3.8,4.5,2.8,2.5,1.8,4.2,3.8,-0.4…

서열을 숫자로 변환1.9,-3.5,4.2,3.8,-0.9,-1.6,-0.4,-3.5,-0.4,-3.5,-3.5,-0.7,-0.7,-0.9,-1.6,-1.6,1.8,-1.6,2.8,-3.5,-0.7,-0.4,-0.4,-3.5,-0.7,-0.7,-0.4,4.5,-0.9,-3.5,4.2,-0.7,4.2,-0.9,-1.3,-3.5,4.2,4.5,-0.7,-0.9,3.8,3.8,3.8,-0.4,-0.7,3.8,4.5,2.8,2.5,1.8,4.2,3.8,-0.4…

10 개 값씩 평균값1.9,-3.5,4.2,3.8,-0.9,-1.6,-0.4,-3.5,-0.4,-3.5,-3.5,-0.7,-0.7,-0.9,-1.6,-1.6,1.8,-1.6,2.8,-3.5,-0.7,-0.4,-0.4,-3.5,-0.7,-0.7,-0.4,4.5,-0.9,-3.5,4.2,-0.7,4.2,-0.9,-1.3,-3.5,4.2,4.5,-0.7,-0.9,3.8,3.8,3.8,-0.4,-0.7,3.8,4.5,2.8,2.5,1.8,4.2,3.8,-0.4…

10 개 값씩 평균값

1.9,-3.5,4.2,3.8,-0.9,-1.6,-0.4,-3.5,-0.4,-3.5,-3.5,-0.7,-0.7,-0.9,-1.6,-1.6,1.8,-1.6,2.8,-3.5,-0.7,-0.4,-0.4,-3.5,-0.7,-0.7,-0.4,4.5,-0.9,-3.5,4.2,-0.7,4.2,-0.9,-1.3,-3.5,4.2,4.5,-0.7,-0.9,3.8,3.8,3.8,-0.4,-0.7,3.8,4.5,2.8,2.5,1.8,4.2,3.8,-0.4…

DEMO

JPREDCoilsHydrophathy plot

Referencing previously known protein StructureAb initio modeling

GFCHIKAYTRLIMVG…

Anabaena 7120

Anacystis nidulans

Condrus crispus

Desulfovibrio vulgaris

Predictions of three dimensional structure of Protein

Referencing previously known structureHomology modeling

Ab initio Modeling

Ab initio : “ 처음부터”기존에 실험적으로 알려진 단백질 구조정보를 전혀 참조하지 않고 , 물리화학적인 원리에 근거하여시퀀스로부터 단백질 구조를 예측

Anfinsen’s experiments (1973)

- Urea + mercaptoethanol 처리로 단백질의 입체 구조를 파괴- 회복된 단백질의 입체 구조가 원상복귀될 수 있음- 단백질의 3 차 구조를 결정하는 정보는 모두단백질 서열 안에 있음 !

따라서 단백질의 서열 정보만으로 단백질의 3 차원 구조를 예측 가능 !

Ab initio modeling

단백질은 열역학적으로 가장 안정된 상태따라서 물리 , 화학적 시뮬레이션을 통해서 가장 안정된에너지 상태의 단백질을 찾으면 -> 그게 단백질의 3 차 구조 !

현실은 그리 간단하지 않음Anfinsen 이 사용한 RNaseA 는 워낙 안정된 단백질이라서 그렇고 , 대개의 단백질은 일단 3 차구조가 변성되면 회복되기 힘듬

원래의 단백질보다 더 안정한 aggregate 가존재

Ab initio modeling

따라서 현실적으로 단백질 구조를 정확하게 예측하는데는 사용하기 어려움

기존에 실험적으로 밝혀진 서열이 유사한 단백질의 구조정보를 이용하여 미지의단백질 구조를 모델링

Homology Modeling

Template-Based Modeling

Homology Modeling

단백질의 구조는 서열보다 보존되어 있음

Identity = 4.7%

RMSD=3.99이것을 이용하여 구조가 알려져 있지 ㅇ낳은 단백질의 구조를 유추 !

Steps in Homology Modeling

1. Search sequence database with known protein structure

2. 상동성이 높은 것중 가장 ‘고퀄’ 의 구조를 선택 (Template Selection)

3. 이미 알려진 구조의 서열과 미지의 서열과의 alignment

4. 모델링5. Loop Modeling

6. 모델 평가 (Model Assessments)

- PSI-BLAST- HHpred

Search Protein Struct ure

1. PSI-BLAST using PDB blast db

2. HHpred

Homology Modeling 에는 어느정도의 상동성이 필요한가 ?

~ 대략적으로 30% 정도의 상동성이 필요

가장 서열 상동성이 높은 구조가 최적의 모델링 Template 인가 ?

Template 1: 93% id, 3.5 Å vs Template 2: 90% id, 1.5 Å

Template 선택

가급적 고해상도의 구조를 선택하는 것이 필요

4 Å 2 Å3 Å 1 Å

NMR or X-ray Crystallography?

http://www.cbs.dtu.dk/courses/27614/Lectures/TBlicher_Homology_Modelling.ppt

http://www.cbs.dtu.dk/courses/27614/Lectures/TBlicher_Homology_Modelling.ppt

NMR 구조가 유일한 선택일때는 ..

NMR 구조에는 대개 미세한 차이가 있는 구조들이복수로 존재함 (Ensemble)

단백질 구조에서 변화가 심한 부분은 제거하고 , 고정된 부분만을 선택하는 것이 용이함

Alignment and Modeling

>gi|6513841|gb|AAD01939.2| homeobox protein HOXA7 [Homo sapiens]MSSSYYVNALFSKYTAGTSLFQNAEPTSCSFAPNSQRSGYGAGAGAFASTVPGLYNVNSPLYQSPFASGYGLGADAYGNLPCASYDQNIPGLCSDLAKGACDKTDEGALHGAAEANFRIYPWMRSSGPDRKRGRQTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALCLTERQVKIWFQNRRMKWKKEHKDEGPTAAAAPEGAVPSAAATAAADKADEEDDDEEEEDEEE

>gi|34398398|gb|AAQ67266.1| antennapedia [Drosophila virilis]MTMSTNNCESMTSYFTNSYMGADMHHGHYPGNGVTDLDAQQMHHYSQNPNQQGNMPYPRFPPYDRMPYYNGQGMDQQQQQHQGYSRPDSPSSQVGGVMPQAQTNGQLVSVAQQQQQTQQQQQAQTQQQQAQQAPLQQQQHPQVTQQVTHPQQQQPVVYASCKLQAAVGGLGMVQEGGSPPLVDQMGGHHMNAQMTLPHHMGHPQAQLGYTDVGVPDVTEVHQNHHNMGMYGQQQTGVPPVVAPPQAMMHPGAGQGPPQMHQGHPGQHTPPSQNPSSQSSGMPSPLYPWMRSQFGKCQERKRGRQTYTRYQTLELEKEFHFNRYLTRRRRIEIAHALCLTERQIKIWFQNRRMKWKKENKTKGEPGSGGEGDEITPPNSPQ

119 IYPWMRS---SGPDRKRGRQTYTRYQTLELEKEFHFNRYLTRRRRIEIAH 16 :|||||| ...:||||||||||||||||||||||||||||||||||||285 LYPWMRSQFGKCQERKRGRQTYTRYQTLELEKEFHFNRYLTRRRRIEIAH 334

166 ALCLTERQVKIWFQNRRMKWKKEHKDEG 193 ||||||||:||||||||||||||:|.:|335 ALCLTERQIKIWFQNRRMKWKKENKTKG 362

119 IYPWMRS---SGPDRKRGRQTYTRYQTLELEKEFHFNRYLTRRRRIEIAH 16 :|||||| ...:||||||||||||||||||||||||||||||||||||285 LYPWMRSQFGKCQERKRGRQTYTRYQTLELEKEFHFNRYLTRRRRIEIAH 334

166 ALCLTERQVKIWFQNRRMKWKKEHKDEG 193 ||||||||:||||||||||||||:|.:|335 ALCLTERQIKIWFQNRRMKWKKENKTKG 362

?

Homology Modeling Tool

- Swiss-Model

- Modeller

- HHPred

https://salilab.org/modeller/about_modeller.html

http://swissmodel.expasy.org

http://toolkit.tuebingen.mpg.de/hhpred

Swiss-Model

Swiss-Model : Search Template

Transcriptome Analysis

1. Select one of datasets

http://www.genomebiology.com/2015/16/1/148

Fan et al., Single-cell RNA-seq transcriptome analysis of linear and circular RNAs in mouse preimplantation embryos

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3290793/

2. Pauli et al, Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis

3. Kijin et al, A comprehensive transcriptional portrait of human cancer cell lines

http://www.nature.com/nbt/journal/v33/n3/abs/nbt.3080.html

Select 3-4 defined stages or tissues

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32898Data Link

2. Using Kallisto, perform quantification of RNA-Seq data of each transcripts

3. Using Sleuth, analyze gene expression and find out differentially expressed gene list

4. Generate ‘HeatMap’ for top 100 (based on p values) differentially expressed genes and Generate ‘ClusterMap’

5. From differentially expressed gene list, find out your genes of interests like this using GO Annotations like this..

- Wnt signlaing related genes, - Transcriptions Factors- Actin Cytoskeleton Related Genes- Pluripotency related genes- Spling reclated genes- Extra,,,,

6. Find out expression levels (TPM) of these genes and filter out based on expression and differential Expressions (E.G. TPM>10 and p value > 1e-10), and generate ClusterMap

7. Using Ensemble Transcript id, Find out ‘Protein Sequence’ of these genes and generateMultifasta and

8. Write short ‘Discusssion’ describing t hat what kind of biological reasoning could be come from these ‘dry experiments’

생물학 연구를 위한 컴퓨터 활용기술 13강

Education