생물학 연구를 위한 컴퓨터 사용기술 제 5강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

5nd Lecture 2015.10.6

Basic of Programming, Python Scripting & Sequence Manipulation

Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?

2주차 Basic of Unix and running BLAST in your PC

3주차 Unix Command Prompt II and shell scripts

4주차 Basic of programming (Python programming)

5주차 Python Scripting II and sequence manipulations

6주차 Python Scripting III and Biopython

7주차 Python Scripting IV and

8주차 Next Generation Sequencing

9주차10주차 Next Generation Sequencing Analysis

11주차 R and statistical analysis

12주차 Bioconductor I

13주차 Bioconductor II

14주차 Network analysis

Utilization of Genome Sequencing Data

In ftp://ftp.ncbi.nlm.nih.gov/genomes, most of genome sequences are archived.

Today, we will learn how to extract desired information from them.

(and relevant python information to achieve this)

ftp://ftp.ncbi.nlm.nih.gov/genomes

ftp://ftp.ncbi.nlm.nih.gov/genomes

ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128/

.asn - nucleotide record in asn.1 format

.faa - protein sequences in fasta format

.ffn - nucleotide sequences of CDS features in fasta format

.fna - total nucleotide sequence in fasta format

.frn - nucleotide sequences of structural RNAs in fasta format

.gbk - full Genbank flat file format

.gff - feature annotation in GFF3 format

.ptt - protein table

.rnt - structural RNA table

.rpt - report file

.val - binary file

Description for each files

GFF file : Contain genome anotationsSeperated by <TAB>

Type of regions Start end

We will learn how to extract desired information from them

Download multiple files

We used curl for the downloading from internets..

But downloading multiple files, wget is more efficient options.

In Mac, download and install wgethttp://rudix.org/packages/wget.html

In the case of linux, wget is installed in most cases

http://rudix.org/packages/wget.html

http://rudix.org/packages/wget.html

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128/*.faa

Download all faa file in that directory

Download fna, ffn, frn, gff, ptt files using wget!

What we will learn today

Read annotation (gff) files and store annotation data in Python dictionary

Using stored annotation data, search desired gene and find out locations

Based on the list of genes, extract desired sequence from genome fasta file

First, we need to read annotation file

gff file is just text fileThen, extract desired informations only (Parsing) Modification of a script from the last lecture

#!/usr/bin/python

import sys

for filename in sys.argv[1:]:f = open(filename, 'r')content = f.readlines()f.close

for line in content:print line

Save it as gff.py, chmod +x gff.py

./gff.py *.gff

Same with cat *.gff

#!/usr/bin/python

import sys

for filename in sys.argv[1:]:f = open(filename, 'r')content = f.readlines()f.close

for line in content:print line<- Instead of just printing line, we will extract information

In line variables

Parsing in Python

We need to skip Line started with #. How we can skip line started with #?

if line[0]!=‘#’:print line

If first character of line (line[0]) is not (!=) ‘#’, process line

is not equal to!= is equal to ==

Split stringsplit() : split content in string and store it in list

NC_001224.1 RefSeq region 1 85779 .+ .

separate = line.split()

line<tab> <tab> <tab> <tab><tab> <tab><tab>

seperate

NC_001224.1

RefSeqregion185779 .+.

seperate[0]seperate[1]seperate[2]seperate[3]seperate[4]seperate[5]seperate[6]separate[7]

Store some of them in separate variables

name = seperate[0]classification = separate[2]start = seperate[3]end = seperate[4]id = separate[8]

id = separate[8]

Split one more time

ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS;product=cytochrome c oxidase subunit 3;protein_id=NP_009328.1;transl_table=3

We need to separate content in id again.In this time, we need to separate based on the “;”

ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS

items = id.split(‘;’)

items[0] ‘Id=cds18’items[1] ‘Name=NP_009328.1’items[2] ‘Parent=rna32’

Store them as directory


Split again using ‘=‘

temp = items[0].split(‘=‘)

temp[0] = Idtemp[1]= cds18

keys = {} #generate directory

Keys[temp[0]] = temp[1]Id cds18

cds18

Id Name

NP_009328.1

Parent

rna32

Store them in directory


Split again using ‘=‘

temp = items[0].split(‘=‘)

temp[0] = Idtemp[1]= cds18

keys = {} #generate directory

Keys[temp[0]] = temp[1]Id cds18

cds18

Id Name

NP_009328.1

Parent

rna32

items = id.split(';’)keys = {}for item in items:

temp = item.split('=')keys[temp[0]]=temp[1]

ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS;product=cytochrome c oxidase subunit 3;protein_id=NP_009328.1;transl_table=3

ID=cds18Name=NP_009328.1Parent=rna43Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627gbkey=CDSproduct=cytochrome c oxidase subunit 3protein_id=NP_009328.1transl_table=3

ID cds18Name NP_009328.1Parent rna43Dbxref SGD:S000007283,Genbank:NP_009328.1,GeneID:854627Gbkey CDSProduct cytochrome c oxidase subunit 3protein_id NP_009328.1transl_table 3

Id.split(‘;’)

item.split(‘=’)

Id

Items

Retrieve from dictionary

if 'product' in keys:product = keys['product']

else:product = ''

product = keys[‘product’]

ID cds18Name NP_009328.1Parent rna43Dbxref SGD:S000007283,Genbank:NP_009328.1,GeneID:854627Gbkey CDSProduct cytochrome c oxidase subunit 3protein_id NP_009328.1transl_table 3

keys

key value

Sometime, there is no ‘product’ items in Keys. Before using dictionary, you have to check whether key value is in dictionary

If there is ‘prouct’ in keys,Get a value which has key with ‘product’Then put it to variable product

Put all of them togetherhttps://gist.github.com/anonymous/fc1ad3ff14a0e0298eca

Only print out if classification is “mRNA”

You can change if you want to extract “CDS”

Execute themDownload it and change permission

chmod +x gff2.py./gff2.py *.gff (Process all of gff file in the directory)

Save results as file

save = open(‘mRNA.txt’, ‘w')

First you need to open a file for save

Then change print as like this

print (filename, classification, start, end, product, name, note, gene, file=save)

In the end of script, close save file

save.close()

https://gist.github.com/anonymous/b7ded2ebed8f94f69798

Instead of ‘r’, we are using ‘w’ flag for ‘write’

Assign save file handle here

Storing extract data into memoryIn the previous examples, we extract desired data and print (or save as file)

But some cases, we need to store them in memory and use later..

We extract data like this

FilenameClassificationStartEndProductName Note

Filename : NC_001148.4Classification : mRNAStart : 939992End : 941136Product : Arr3pName : NM_001184298.1Note

….

How we can store data like this?

Filename : NC_001148.4Classification : mRNAStart : 943032End : 943896Product : HypotheticalName : NM_001184290.1Note


Modules and Packages

Various functions (File Handling, graphics, calculation…) is built-in and distributed inPython

from packages import module

Module

Module

Packages

Module Module

Module

Packages

Module

It is called as ‘The Python Standard Library’

Collections and namedTupleWe will use package called collections

We will use module named namedtuple in packages ‘collections’

Name of variable = namedtupe(‘Name of variable’, list containing member)

from collections import namedtuple

xyz

datadata.xdata.ydata.z

It is basically same with tuple,But we can access them by name

Record = namedtuple("Record", ["filename", "classification", "start", "end", "product", "gene", "name", "note"])


firstrecord = Record ("NC_001148.4", "mRNA", "939992", "941136", “”, "Arr3p", "NM_001184298.1", "”)secondrecord = Record ("NC_001148.4", "mRNA", "943032", "943896", "","Hypothetical", "NM_001184290.1", "")

firstrecord


secondrecord

data = namedtuple(’data’, [’x’,’y’,’z’])

Check content of record

You can store this in a list

DataStore=[] Initialize list named ‘DataStore’

DataStore.append(firstrecord) Append first Record in DataStore

DataStore.append(secondrecord) Append second Record in DataStore

DataStore[0] is firstrecord

DataStore[1] is secondrecord


firstrecord


secondrecord

DataStore

DataStore[0] DataStore[1]

DataStore[0].Start 939992

DataStore[0].End941136

DataStore[1].Start 943032

DataStore[1].End943896

Back to gff parsingStore all of Data into DataStore

https://gist.github.com/anonymous/03da342a3d21dcc20e66

Append Record into DataStore

Retrieve record in DataStore and print them.

Define namedtuple called ‘Record’ and Initialize List called Datastore

Save each features in separate file

Classification could be

geneexontRNAmRNACDSncRNAregion

Save separate file like gene.txt, exon.txt, mRNA.txt…

We will store these classifications. Store them in Tuple named ‘classes’

savefile is dictionary used for storage of file save handle

oneclass = ‘gene’

Open file with ‘gene.txt’ with write mode

Store in dictionary as class name key

savefile[‘gene’] is file handle for ‘gene’.txt

oneclass = ‘exon’

Open file with ‘exon.txt’ with write mode

Store in dictionary as class name key

savefile[‘exon’] is file handle for ‘exon’.txt

if record.classification is in classes:

('gene', 'exon', 'tRNA', 'mRNA', 'ncRNA', 'CDS', 'region')

print (…....file = savefile[record.classification])

record.classification : genesavefile[‘gene’] has file handle for gene.txt

Write out current record in gene.txt

Everything is done, then close file.

https://gist.github.com/anonymous/03e10ec56f679bab608f

All thing put together

File Handling packages

Get a list containing file names

Open and process all of file obatined from glob

>>> import glob>>> glob.glob('*.*')['CDS.txt', 'exon.txt', 'gene.txt', 'gff.py’….]>>> list = glob.glob('*.gff')>>> list['NC_001133.gff', 'NC_001134.gff’.... 'NC_001224.gff']>>>

filelist = glob.glob(‘*.gff’)for file in filelist:

do something

Glob Exampleshttps://gist.github.com/anonymous/3eff4b86bd0f3aa55db9

Save as ‘total’

Same function with cat *.gff > total

Read all gff file in the directory

Then read into singlefileAnd merged into contents

Execute External Program

subprocess package, call module

Like shell scripts, sometime we need to run external program in Python scripts

call ( list_containing_program_and_arguement)

ls –l [‘ls’, ‘-l’]

Run MUSCLE inside Python Scripts

Assume fasta file is saved as “merged.fasta”

Equivalent with command line muscle –in merge.fasta –out merge.aln –clwCall([“muslce”, “-in”, “merge.fasta”, “-out”, “merge.aln”, “-clw”])

Other file related packages and moduleCurrent directory?

os.getcwd()

Change directory

os.chdir(DIRECTORY)

Make directory under current directory

os.makedir(directory)

os.renames('data', 'data2')

Rename directory with other name

Install External Packages

Besides Standard Python Library, you may want to install external packages to add features

External packages are packages developed by outside developer and add more features in python

Most scientific packages are external packages developed by scientists, so you need to install them, If you need them.

BioPython : Comprehensive bioinfomatic packages for Pythonhttp://biopython.org/wiki/Main_Page

Numpyhttp://www.numpy.org/

Scipyhttp://www.scipy.org/

MatPlotlibhttp://matplotlib.org/

Pandas : Data Analysis Packages http://pandas.pydata.org/

Example of scientific packages we may have interested..

We will cover some of them in later lectures..

http://biopython.org/wiki/Main_Page



http://www.numpy.org/



http://www.scipy.org/



http://matplotlib.org/



http://pandas.pydata.org/

http://pandas.pydata.org/

Install External Packages

First, let’s check in your command prompts

pip

If you see this messages, you are ok. If you are not, install pip itself.

Install pip (python install packages)

http://pip.readthedocs.org/en/stable/installing/

Download get-pip.py

curl -O "https://bootstrap.pypa.io/get-pip.py

Change permission of pip

chmod +x get-pip.py

Execute get-pip.py (as administrator)

sudo python get-pip.puy (You need to password)

Install pyfasta package using pip

pyfasta : external packages which can access multifasta file easily.

sudo pip install pyfastaCollecting pyfasta Downloading pyfasta-0.5.2.tar.gzInstalling collected packages: pyfasta Running setup.py install for pyfastaSuccessfully installed pyfasta-0.5.2

Check whether installation is okay

In python interpreter,

If installation is not okay, you will see the error message here..

pyFasta ExampleCombine all of chromosome and protein file as single fna (nucleotide) and faa (protein)

Load yeast.faa info f

In f.keys(), you can see the header of each protein file

In f[list][:], you can see the sequence

For the access of specific sequences, you need a header content in fasta file

./pyfasta1.py

Let’s extract using conditions

f[list][:] Sequence

len(f[list][:]) length of SequenceIf len(f[list][:])>1000 If length of sequence is

Bigger than 1000,

It will print out all of yeast proteins.

Print out..

Total Number of Protein

Protein within the conditions

If you want only list satisfying conditions

I only want to see the Refseq Id, Protein name and amino acid #...

a = list.split(‘|’)

a[0] a[1] a[2] a[3] a[4]

gi 6324242 ref NP_014312.1 Tcb2p [Saccharomyces cerevisiae…]

I don’t want [Saccharomyces cerevisiae…]

How we can remove them?

Split again...

a[4] Mdm1p [Saccharomyce cerevisiae S288c]

a[4].split(“[“)[0].strip() strip() : remove all of spaces

https://gist.github.com/anonymous/4fc9f86c82ac7fde4b69

“…Instead of amino acid number, I want to see molecular weight”

Last lecture, we learned how to calculate molecular weight from amino acid sequences..

https://gist.github.com/anonymous/74ffdb1407ead96ce560

You can cut paste this code and modify it. But…If you need to calculate m.w. several times, it is not efficient.

Functions

If some part of program is used frequently, make it as separate part, Then reuse it.

ABCBE…..

Acall BCCall BE…..

B

How can we define functions?

Syntax of Functions

Input (arguments)

Return values

calculate from input and return some values

def function_name(arguments):…...Do somethingreturn (arguments)

Return value is optional…

Molecular Weight FunctionsArguments : sequenceshould contains amino acid sequences

Calculation of molecular weight

Return (mweight)

Main part of program

Calling newly defined function, mwThen it will stored variable ‘mweight’

Put current sequences into variable ‘sequence’

https://gist.github.com/anonymous/3c2bf5ec586fd3280502

Little bit modifications

Print out all of molecular weight of proteins

./molweight2.py > mw.csv

Save them and import in Excel (or whatever stat program)

Imported Data Bins for Histogram

Use Histogram function in Excel..

1000020000

3000040000

5000060000

7000080000

90000

100000

110000

120000

130000

140000

150000

160000

170000

180000

190000

200000

250000

300000More

0

100

200

300

400

500

600

700

800

900

Frequency

In the Next class, we will learn how to draw these graphs in python…

Draw bar chart

Distributions of protein size in yeast

Extraction of subsequences in pyFasta

First 10 character

From 5-15 character

Extract features using pyFasta and Gff

Previously, we learned how to extract desired information in gff file

Using these information, we can extract desired sequences from big fasta file

Little modification of previous script

<- only differences

NC_001224.1 RefSeq region 1 85779 . + . ID=id0;Dbxref=taxon:559292;Is_circular=true;gbkey=Src;

Direction of strand (+ : sense, - : antisense)

Parsing gff file and store in Datastore list

filename : gff file name "NC_001224.gff

We need to open NC_001224.fna

filename.split(“.”)[0] = “NC_001224”

filename.split(“.”)[0]+”.fna” = “NC_001224.fna”

Then open fasta file

fasta = Fasta(fastafilename)

Go through stored data in DataStore

If record.classification is gene:First print out headerThen print out partial sequence start from record.start to record.end.

https://gist.github.com/anonymous/6a7d63ccb82be2a30dca

All thing put together

./extract.py *.gff > gene.fasta

./extract.py *.gff

Assignments

1. In NCBI genome database, download all of faa, gff and fna file for Arabidopsis thaliana and C.elegans and do analysis using scripts (modify if you need to do)

ftp://ftp.ncbi.nlm.nih.gov/genomes/Caenorhabditis_elegans/

ftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thaliana/

Download CHR-I, II, III,IV, V, X

Download CHR-I, II, III,IV, V

2. Calcurate all of molecular weight of proteins in each organisms and make histogram

3. What is the fraction of proteins bigger than 30kDa (30,000) and smaller than 100kDa (100,000) in proteome?

4. Generate multifasta file containing protein sequences in 30,000 < MW <100,000

5. In genome sequences, extract all of tRNA sequences in the genome and submit multifasta

생물학 연구를 위한 컴퓨터 사용기술 제 5강

Education