생물학 연구를 위한 컴퓨터 사용기술 제 5강

71
Computational Skill for Modern Biology Research Department of Biology Chungbuk National University 5nd Lecture 2015.10.6 of Programming, Python Scripting & Sequence Manipu

Upload: suk-namgoong

Post on 13-Feb-2017

561 views

Category:

Education


2 download

TRANSCRIPT

Page 1: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

5nd Lecture 2015.10.6

Basic of Programming, Python Scripting & Sequence Manipulation

Page 2: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?

2주차 Basic of Unix and running BLAST in your PC

3주차 Unix Command Prompt II and shell scripts

4주차 Basic of programming (Python programming)

5주차 Python Scripting II and sequence manipulations

6주차 Python Scripting III and Biopython

7주차 Python Scripting IV and

8주차 Next Generation Sequencing

9주차10주차 Next Generation Sequencing Analysis

11주차 R and statistical analysis

12주차 Bioconductor I

13주차 Bioconductor II

14주차 Network analysis

Page 3: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Utilization of Genome Sequencing Data

In ftp://ftp.ncbi.nlm.nih.gov/genomes, most of genome sequences are archived.

Today, we will learn how to extract desired information from them.

(and relevant python information to achieve this)

Page 4: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128/

Page 5: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

.asn - nucleotide record in asn.1 format

.faa - protein sequences in fasta format

.ffn - nucleotide sequences of CDS features in fasta format

.fna - total nucleotide sequence in fasta format

.frn - nucleotide sequences of structural RNAs in fasta format

.gbk - full Genbank flat file format

.gff - feature annotation in GFF3 format

.ptt - protein table

.rnt - structural RNA table

.rpt - report file

.val - binary file

Description for each files

Page 6: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

GFF file : Contain genome anotationsSeperated by <TAB>

Type of regions Start end

We will learn how to extract desired information from them

Page 7: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Download multiple files

We used curl for the downloading from internets..

But downloading multiple files, wget is more efficient options.

In Mac, download and install wgethttp://rudix.org/packages/wget.html

In the case of linux, wget is installed in most cases

Page 8: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128/*.faa

Download all faa file in that directory

Download fna, ffn, frn, gff, ptt files using wget!

Page 9: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

What we will learn today

Read annotation (gff) files and store annotation data in Python dictionary

Using stored annotation data, search desired gene and find out locations

Based on the list of genes, extract desired sequence from genome fasta file

Page 10: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

First, we need to read annotation file

gff file is just text fileThen, extract desired informations only (Parsing) Modification of a script from the last lecture

#!/usr/bin/python

import sys

for filename in sys.argv[1:]:f = open(filename, 'r')content = f.readlines()f.close

for line in content:print line

Save it as gff.py, chmod +x gff.py

Page 11: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

./gff.py *.gff

Same with cat *.gff

#!/usr/bin/python

import sys

for filename in sys.argv[1:]:f = open(filename, 'r')content = f.readlines()f.close

for line in content:print line<- Instead of just printing line, we will extract information

In line variables

Page 12: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Parsing in Python

We need to skip Line started with #. How we can skip line started with #?

if line[0]!=‘#’:print line

If first character of line (line[0]) is not (!=) ‘#’, process line

is not equal to!= is equal to ==

Page 13: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Split stringsplit() : split content in string and store it in list

NC_001224.1 RefSeq region 1 85779 .+ .

separate = line.split()

line<tab> <tab> <tab> <tab><tab> <tab><tab>

seperate

NC_001224.1

RefSeqregion185779 .+.

seperate[0]seperate[1]seperate[2]seperate[3]seperate[4]seperate[5]seperate[6]separate[7]

Store some of them in separate variables

name = seperate[0]classification = separate[2]start = seperate[3]end = seperate[4]id = separate[8]

Page 14: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

id = separate[8]

Split one more time

ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS;product=cytochrome c oxidase subunit 3;protein_id=NP_009328.1;transl_table=3

We need to separate content in id again.In this time, we need to separate based on the “;”

ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS

items = id.split(‘;’)

items[0] ‘Id=cds18’items[1] ‘Name=NP_009328.1’items[2] ‘Parent=rna32’

Page 15: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Store them as directory

items[0] ‘Id=cds18’items[1] ‘Name=NP_009328.1’items[2] ‘Parent=rna32’

Split again using ‘=‘

temp = items[0].split(‘=‘)

temp[0] = Idtemp[1]= cds18

keys = {} #generate directory

Keys[temp[0]] = temp[1]Id cds18

cds18

Id Name

NP_009328.1

Parent

rna32

Page 16: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Store them in directory

items[0] ‘Id=cds18’items[1] ‘Name=NP_009328.1’items[2] ‘Parent=rna32’

Split again using ‘=‘

temp = items[0].split(‘=‘)

temp[0] = Idtemp[1]= cds18

keys = {} #generate directory

Keys[temp[0]] = temp[1]Id cds18

cds18

Id Name

NP_009328.1

Parent

rna32

Page 17: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

items = id.split(';’)keys = {}for item in items:

temp = item.split('=')keys[temp[0]]=temp[1]

ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS;product=cytochrome c oxidase subunit 3;protein_id=NP_009328.1;transl_table=3

ID=cds18Name=NP_009328.1Parent=rna43Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627gbkey=CDSproduct=cytochrome c oxidase subunit 3protein_id=NP_009328.1transl_table=3

ID cds18Name NP_009328.1Parent rna43Dbxref SGD:S000007283,Genbank:NP_009328.1,GeneID:854627Gbkey CDSProduct cytochrome c oxidase subunit 3protein_id NP_009328.1transl_table 3

Id.split(‘;’)

item.split(‘=’)

Id

Items

Page 18: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Retrieve from dictionary

if 'product' in keys:product = keys['product']

else:product = ''

product = keys[‘product’]

ID cds18Name NP_009328.1Parent rna43Dbxref SGD:S000007283,Genbank:NP_009328.1,GeneID:854627Gbkey CDSProduct cytochrome c oxidase subunit 3protein_id NP_009328.1transl_table 3

keys

key value

Sometime, there is no ‘product’ items in Keys. Before using dictionary, you have to check whether key value is in dictionary

If there is ‘prouct’ in keys,Get a value which has key with ‘product’Then put it to variable product

Page 19: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Put all of them togetherhttps://gist.github.com/anonymous/fc1ad3ff14a0e0298eca

Only print out if classification is “mRNA”

You can change if you want to extract “CDS”

Page 20: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Execute themDownload it and change permission

chmod +x gff2.py./gff2.py *.gff (Process all of gff file in the directory)

Page 21: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Save results as file

save = open(‘mRNA.txt’, ‘w')

First you need to open a file for save

Then change print as like this

print (filename, classification, start, end, product, name, note, gene, file=save)

In the end of script, close save file

save.close()

https://gist.github.com/anonymous/b7ded2ebed8f94f69798

Instead of ‘r’, we are using ‘w’ flag for ‘write’

Assign save file handle here

Page 22: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Storing extract data into memoryIn the previous examples, we extract desired data and print (or save as file)

But some cases, we need to store them in memory and use later..

We extract data like this

FilenameClassificationStartEndProductName Note

Page 23: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Filename : NC_001148.4Classification : mRNAStart : 939992End : 941136Product : Arr3pName : NM_001184298.1Note

….

How we can store data like this?

Filename : NC_001148.4Classification : mRNAStart : 943032End : 943896Product : HypotheticalName : NM_001184290.1Note

Filename : NC_001148.4Classification : mRNAStart : 943880End : 944188Product : HypotheticalName : NM_001184300.1Note

Page 24: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Modules and Packages

Various functions (File Handling, graphics, calculation…) is built-in and distributed inPython

from packages import module

Module

Module

Packages

Module Module

Module

Packages

Module

It is called as ‘The Python Standard Library’

Page 25: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Collections and namedTupleWe will use package called collections

We will use module named namedtuple in packages ‘collections’

Name of variable = namedtupe(‘Name of variable’, list containing member)

from collections import namedtuple

Page 26: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

xyz

datadata.xdata.ydata.z

It is basically same with tuple,But we can access them by name

Record = namedtuple("Record", ["filename", "classification", "start", "end", "product", "gene", "name", "note"])

Filename : NC_001148.4Classification : mRNAStart : 939992End : 941136Product : Arr3pName : NM_001184298.1Note

firstrecord = Record ("NC_001148.4", "mRNA", "939992", "941136", “”, "Arr3p", "NM_001184298.1", "”)secondrecord = Record ("NC_001148.4", "mRNA", "943032", "943896", "","Hypothetical", "NM_001184290.1", "")

firstrecord

Filename : NC_001148.4Classification : mRNAStart : 943032End : 943896Product : HypotheticalName : NM_001184290.1Note

secondrecord

data = namedtuple(’data’, [’x’,’y’,’z’])

Page 27: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Check content of record

Page 28: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

You can store this in a list

DataStore=[] Initialize list named ‘DataStore’

DataStore.append(firstrecord) Append first Record in DataStore

DataStore.append(secondrecord) Append second Record in DataStore

DataStore[0] is firstrecord

DataStore[1] is secondrecord

Page 29: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Filename : NC_001148.4Classification : mRNAStart : 939992End : 941136Product : Arr3pName : NM_001184298.1Note

firstrecord

Filename : NC_001148.4Classification : mRNAStart : 943032End : 943896Product : HypotheticalName : NM_001184290.1Note

secondrecord

DataStore

DataStore[0] DataStore[1]

DataStore[0].Start 939992

DataStore[0].End941136

DataStore[1].Start 943032

DataStore[1].End943896

Page 30: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Back to gff parsingStore all of Data into DataStore

https://gist.github.com/anonymous/03da342a3d21dcc20e66

Append Record into DataStore

Retrieve record in DataStore and print them.

Define namedtuple called ‘Record’ and Initialize List called Datastore

Page 31: 생물학 연구를 위한 컴퓨터 사용기술 제 5강
Page 32: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Save each features in separate file

Classification could be

geneexontRNAmRNACDSncRNAregion

Save separate file like gene.txt, exon.txt, mRNA.txt…

Page 33: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

We will store these classifications. Store them in Tuple named ‘classes’

savefile is dictionary used for storage of file save handle

oneclass = ‘gene’

Open file with ‘gene.txt’ with write mode

Store in dictionary as class name key

savefile[‘gene’] is file handle for ‘gene’.txt

oneclass = ‘exon’

Open file with ‘exon.txt’ with write mode

Store in dictionary as class name key

savefile[‘exon’] is file handle for ‘exon’.txt

Page 34: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

if record.classification is in classes:

('gene', 'exon', 'tRNA', 'mRNA', 'ncRNA', 'CDS', 'region')

print (…....file = savefile[record.classification])

record.classification : genesavefile[‘gene’] has file handle for gene.txt

Write out current record in gene.txt

Everything is done, then close file.

Page 35: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

https://gist.github.com/anonymous/03e10ec56f679bab608f

All thing put together

Page 36: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

File Handling packages

Get a list containing file names

Open and process all of file obatined from glob

>>> import glob>>> glob.glob('*.*')['CDS.txt', 'exon.txt', 'gene.txt', 'gff.py’….]>>> list = glob.glob('*.gff')>>> list['NC_001133.gff', 'NC_001134.gff’.... 'NC_001224.gff']>>>

filelist = glob.glob(‘*.gff’)for file in filelist:

do something

Page 37: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Glob Exampleshttps://gist.github.com/anonymous/3eff4b86bd0f3aa55db9

Save as ‘total’

Same function with cat *.gff > total

Read all gff file in the directory

Then read into singlefileAnd merged into contents

Page 38: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Execute External Program

subprocess package, call module

Like shell scripts, sometime we need to run external program in Python scripts

call ( list_containing_program_and_arguement)

ls –l [‘ls’, ‘-l’]

Page 39: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Run MUSCLE inside Python Scripts

Assume fasta file is saved as “merged.fasta”

Equivalent with command line muscle –in merge.fasta –out merge.aln –clwCall([“muslce”, “-in”, “merge.fasta”, “-out”, “merge.aln”, “-clw”])

Page 40: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Other file related packages and moduleCurrent directory?

os.getcwd()

Change directory

os.chdir(DIRECTORY)

Page 41: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Make directory under current directory

os.makedir(directory)

os.renames('data', 'data2')

Rename directory with other name

Page 42: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Install External Packages

Besides Standard Python Library, you may want to install external packages to add features

External packages are packages developed by outside developer and add more features in python

Most scientific packages are external packages developed by scientists, so you need to install them, If you need them.

Page 43: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

BioPython : Comprehensive bioinfomatic packages for Pythonhttp://biopython.org/wiki/Main_Page

Numpyhttp://www.numpy.org/

Scipyhttp://www.scipy.org/

MatPlotlibhttp://matplotlib.org/

Pandas : Data Analysis Packages http://pandas.pydata.org/

Example of scientific packages we may have interested..

We will cover some of them in later lectures..

Page 44: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Install External Packages

First, let’s check in your command prompts

pip

If you see this messages, you are ok. If you are not, install pip itself.

Page 45: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Install pip (python install packages)

http://pip.readthedocs.org/en/stable/installing/

Download get-pip.py

curl -O "https://bootstrap.pypa.io/get-pip.py

Change permission of pip

chmod +x get-pip.py

Execute get-pip.py (as administrator)

sudo python get-pip.puy (You need to password)

Page 46: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Install pyfasta package using pip

pyfasta : external packages which can access multifasta file easily.

sudo pip install pyfastaCollecting pyfasta Downloading pyfasta-0.5.2.tar.gzInstalling collected packages: pyfasta Running setup.py install for pyfastaSuccessfully installed pyfasta-0.5.2

Check whether installation is okay

In python interpreter,

If installation is not okay, you will see the error message here..

Page 47: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

pyFasta ExampleCombine all of chromosome and protein file as single fna (nucleotide) and faa (protein)

Load yeast.faa info f

In f.keys(), you can see the header of each protein file

In f[list][:], you can see the sequence

For the access of specific sequences, you need a header content in fasta file

Page 48: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

./pyfasta1.py

Let’s extract using conditions

f[list][:] Sequence

len(f[list][:]) length of SequenceIf len(f[list][:])>1000 If length of sequence is

Bigger than 1000,

It will print out all of yeast proteins.

Print out..

Page 49: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Total Number of Protein

Protein within the conditions

Page 50: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

If you want only list satisfying conditions

I only want to see the Refseq Id, Protein name and amino acid #...

Page 51: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

a = list.split(‘|’)

a[0] a[1] a[2] a[3] a[4]

gi 6324242 ref NP_014312.1 Tcb2p [Saccharomyces cerevisiae…]

Page 52: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

I don’t want [Saccharomyces cerevisiae…]

How we can remove them?

Split again...

a[4] Mdm1p [Saccharomyce cerevisiae S288c]

a[4].split(“[“)[0].strip() strip() : remove all of spaces

Page 53: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

https://gist.github.com/anonymous/4fc9f86c82ac7fde4b69

Page 54: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

“…Instead of amino acid number, I want to see molecular weight”

Last lecture, we learned how to calculate molecular weight from amino acid sequences..

https://gist.github.com/anonymous/74ffdb1407ead96ce560

You can cut paste this code and modify it. But…If you need to calculate m.w. several times, it is not efficient.

Page 55: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Functions

If some part of program is used frequently, make it as separate part, Then reuse it.

ABCBE…..

Acall BCCall BE…..

B

How can we define functions?

Page 56: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Syntax of Functions

Input (arguments)

Return values

calculate from input and return some values

def function_name(arguments):…...Do somethingreturn (arguments)

Return value is optional…

Page 57: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Molecular Weight FunctionsArguments : sequenceshould contains amino acid sequences

Calculation of molecular weight

Return (mweight)

Page 58: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Main part of program

Calling newly defined function, mwThen it will stored variable ‘mweight’

Put current sequences into variable ‘sequence’

Page 59: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

https://gist.github.com/anonymous/3c2bf5ec586fd3280502

Little bit modifications

Print out all of molecular weight of proteins

./molweight2.py > mw.csv

Save them and import in Excel (or whatever stat program)

Page 60: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Imported Data Bins for Histogram

Page 61: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Use Histogram function in Excel..

Page 62: 생물학 연구를 위한 컴퓨터 사용기술 제 5강
Page 63: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

1000020000

3000040000

5000060000

7000080000

90000

100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

250000

300000More

0

100

200

300

400

500

600

700

800

900

Frequency

In the Next class, we will learn how to draw these graphs in python…

Draw bar chart

Distributions of protein size in yeast

Page 64: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Extraction of subsequences in pyFasta

First 10 character

Page 65: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

From 5-15 character

Page 66: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Extract features using pyFasta and Gff

Previously, we learned how to extract desired information in gff file

Using these information, we can extract desired sequences from big fasta file

Page 67: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Little modification of previous script

<- only differences

NC_001224.1 RefSeq region 1 85779 . + . ID=id0;Dbxref=taxon:559292;Is_circular=true;gbkey=Src;

Direction of strand (+ : sense, - : antisense)

Parsing gff file and store in Datastore list

Page 68: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

filename : gff file name "NC_001224.gff

We need to open NC_001224.fna

filename.split(“.”)[0] = “NC_001224”

filename.split(“.”)[0]+”.fna” = “NC_001224.fna”

Then open fasta file

fasta = Fasta(fastafilename)

Go through stored data in DataStore

If record.classification is gene:First print out headerThen print out partial sequence start from record.start to record.end.

Page 69: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

https://gist.github.com/anonymous/6a7d63ccb82be2a30dca

All thing put together

./extract.py *.gff > gene.fasta

./extract.py *.gff

Page 70: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

Assignments

1. In NCBI genome database, download all of faa, gff and fna file for Arabidopsis thaliana and C.elegans and do analysis using scripts (modify if you need to do)

ftp://ftp.ncbi.nlm.nih.gov/genomes/Caenorhabditis_elegans/

ftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thaliana/

Download CHR-I, II, III,IV, V, X

Download CHR-I, II, III,IV, V

Page 71: 생물학 연구를 위한 컴퓨터 사용기술 제 5강

2. Calcurate all of molecular weight of proteins in each organisms and make histogram

3. What is the fraction of proteins bigger than 30kDa (30,000) and smaller than 100kDa (100,000) in proteome?

4. Generate multifasta file containing protein sequences in 30,000 < MW <100,000

5. In genome sequences, extract all of tRNA sequences in the genome and submit multifasta