생물학 연구를 위한 컴퓨터 사용기술 제 5강
TRANSCRIPT
Computational Skill for Modern Biology Research
Department of BiologyChungbuk National University
5nd Lecture 2015.10.6
Basic of Programming, Python Scripting & Sequence Manipulation
Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?
2주차 Basic of Unix and running BLAST in your PC
3주차 Unix Command Prompt II and shell scripts
4주차 Basic of programming (Python programming)
5주차 Python Scripting II and sequence manipulations
6주차 Python Scripting III and Biopython
7주차 Python Scripting IV and
8주차 Next Generation Sequencing
9주차10주차 Next Generation Sequencing Analysis
11주차 R and statistical analysis
12주차 Bioconductor I
13주차 Bioconductor II
14주차 Network analysis
Utilization of Genome Sequencing Data
In ftp://ftp.ncbi.nlm.nih.gov/genomes, most of genome sequences are archived.
Today, we will learn how to extract desired information from them.
(and relevant python information to achieve this)
ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128/
.asn - nucleotide record in asn.1 format
.faa - protein sequences in fasta format
.ffn - nucleotide sequences of CDS features in fasta format
.fna - total nucleotide sequence in fasta format
.frn - nucleotide sequences of structural RNAs in fasta format
.gbk - full Genbank flat file format
.gff - feature annotation in GFF3 format
.ptt - protein table
.rnt - structural RNA table
.rpt - report file
.val - binary file
Description for each files
GFF file : Contain genome anotationsSeperated by <TAB>
Type of regions Start end
We will learn how to extract desired information from them
Download multiple files
We used curl for the downloading from internets..
But downloading multiple files, wget is more efficient options.
In Mac, download and install wgethttp://rudix.org/packages/wget.html
In the case of linux, wget is installed in most cases
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Fungi/Saccharomyces_cerevisiae_uid128/*.faa
Download all faa file in that directory
Download fna, ffn, frn, gff, ptt files using wget!
What we will learn today
Read annotation (gff) files and store annotation data in Python dictionary
Using stored annotation data, search desired gene and find out locations
Based on the list of genes, extract desired sequence from genome fasta file
First, we need to read annotation file
gff file is just text fileThen, extract desired informations only (Parsing) Modification of a script from the last lecture
#!/usr/bin/python
import sys
for filename in sys.argv[1:]:f = open(filename, 'r')content = f.readlines()f.close
for line in content:print line
Save it as gff.py, chmod +x gff.py
./gff.py *.gff
Same with cat *.gff
#!/usr/bin/python
import sys
for filename in sys.argv[1:]:f = open(filename, 'r')content = f.readlines()f.close
for line in content:print line<- Instead of just printing line, we will extract information
In line variables
Parsing in Python
We need to skip Line started with #. How we can skip line started with #?
if line[0]!=‘#’:print line
If first character of line (line[0]) is not (!=) ‘#’, process line
is not equal to!= is equal to ==
Split stringsplit() : split content in string and store it in list
NC_001224.1 RefSeq region 1 85779 .+ .
separate = line.split()
line<tab> <tab> <tab> <tab><tab> <tab><tab>
seperate
NC_001224.1
RefSeqregion185779 .+.
seperate[0]seperate[1]seperate[2]seperate[3]seperate[4]seperate[5]seperate[6]separate[7]
Store some of them in separate variables
name = seperate[0]classification = separate[2]start = seperate[3]end = seperate[4]id = separate[8]
id = separate[8]
Split one more time
ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS;product=cytochrome c oxidase subunit 3;protein_id=NP_009328.1;transl_table=3
We need to separate content in id again.In this time, we need to separate based on the “;”
ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS
items = id.split(‘;’)
items[0] ‘Id=cds18’items[1] ‘Name=NP_009328.1’items[2] ‘Parent=rna32’
Store them as directory
items[0] ‘Id=cds18’items[1] ‘Name=NP_009328.1’items[2] ‘Parent=rna32’
Split again using ‘=‘
temp = items[0].split(‘=‘)
temp[0] = Idtemp[1]= cds18
keys = {} #generate directory
Keys[temp[0]] = temp[1]Id cds18
cds18
Id Name
NP_009328.1
Parent
rna32
Store them in directory
items[0] ‘Id=cds18’items[1] ‘Name=NP_009328.1’items[2] ‘Parent=rna32’
Split again using ‘=‘
temp = items[0].split(‘=‘)
temp[0] = Idtemp[1]= cds18
keys = {} #generate directory
Keys[temp[0]] = temp[1]Id cds18
cds18
Id Name
NP_009328.1
Parent
rna32
items = id.split(';’)keys = {}for item in items:
temp = item.split('=')keys[temp[0]]=temp[1]
ID=cds18;Name=NP_009328.1;Parent=rna43;Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627;gbkey=CDS;product=cytochrome c oxidase subunit 3;protein_id=NP_009328.1;transl_table=3
ID=cds18Name=NP_009328.1Parent=rna43Dbxref=SGD:S000007283,Genbank:NP_009328.1,GeneID:854627gbkey=CDSproduct=cytochrome c oxidase subunit 3protein_id=NP_009328.1transl_table=3
ID cds18Name NP_009328.1Parent rna43Dbxref SGD:S000007283,Genbank:NP_009328.1,GeneID:854627Gbkey CDSProduct cytochrome c oxidase subunit 3protein_id NP_009328.1transl_table 3
Id.split(‘;’)
item.split(‘=’)
Id
Items
Retrieve from dictionary
if 'product' in keys:product = keys['product']
else:product = ''
product = keys[‘product’]
ID cds18Name NP_009328.1Parent rna43Dbxref SGD:S000007283,Genbank:NP_009328.1,GeneID:854627Gbkey CDSProduct cytochrome c oxidase subunit 3protein_id NP_009328.1transl_table 3
keys
key value
Sometime, there is no ‘product’ items in Keys. Before using dictionary, you have to check whether key value is in dictionary
If there is ‘prouct’ in keys,Get a value which has key with ‘product’Then put it to variable product
Put all of them togetherhttps://gist.github.com/anonymous/fc1ad3ff14a0e0298eca
Only print out if classification is “mRNA”
You can change if you want to extract “CDS”
Execute themDownload it and change permission
chmod +x gff2.py./gff2.py *.gff (Process all of gff file in the directory)
Save results as file
save = open(‘mRNA.txt’, ‘w')
First you need to open a file for save
Then change print as like this
print (filename, classification, start, end, product, name, note, gene, file=save)
In the end of script, close save file
save.close()
https://gist.github.com/anonymous/b7ded2ebed8f94f69798
Instead of ‘r’, we are using ‘w’ flag for ‘write’
Assign save file handle here
Storing extract data into memoryIn the previous examples, we extract desired data and print (or save as file)
But some cases, we need to store them in memory and use later..
We extract data like this
FilenameClassificationStartEndProductName Note
Filename : NC_001148.4Classification : mRNAStart : 939992End : 941136Product : Arr3pName : NM_001184298.1Note
….
How we can store data like this?
Filename : NC_001148.4Classification : mRNAStart : 943032End : 943896Product : HypotheticalName : NM_001184290.1Note
Filename : NC_001148.4Classification : mRNAStart : 943880End : 944188Product : HypotheticalName : NM_001184300.1Note
Modules and Packages
Various functions (File Handling, graphics, calculation…) is built-in and distributed inPython
from packages import module
Module
Module
Packages
Module Module
Module
Packages
Module
It is called as ‘The Python Standard Library’
Collections and namedTupleWe will use package called collections
We will use module named namedtuple in packages ‘collections’
Name of variable = namedtupe(‘Name of variable’, list containing member)
from collections import namedtuple
xyz
datadata.xdata.ydata.z
It is basically same with tuple,But we can access them by name
Record = namedtuple("Record", ["filename", "classification", "start", "end", "product", "gene", "name", "note"])
Filename : NC_001148.4Classification : mRNAStart : 939992End : 941136Product : Arr3pName : NM_001184298.1Note
firstrecord = Record ("NC_001148.4", "mRNA", "939992", "941136", “”, "Arr3p", "NM_001184298.1", "”)secondrecord = Record ("NC_001148.4", "mRNA", "943032", "943896", "","Hypothetical", "NM_001184290.1", "")
firstrecord
Filename : NC_001148.4Classification : mRNAStart : 943032End : 943896Product : HypotheticalName : NM_001184290.1Note
secondrecord
data = namedtuple(’data’, [’x’,’y’,’z’])
Check content of record
You can store this in a list
DataStore=[] Initialize list named ‘DataStore’
DataStore.append(firstrecord) Append first Record in DataStore
DataStore.append(secondrecord) Append second Record in DataStore
DataStore[0] is firstrecord
DataStore[1] is secondrecord
Filename : NC_001148.4Classification : mRNAStart : 939992End : 941136Product : Arr3pName : NM_001184298.1Note
firstrecord
Filename : NC_001148.4Classification : mRNAStart : 943032End : 943896Product : HypotheticalName : NM_001184290.1Note
secondrecord
DataStore
DataStore[0] DataStore[1]
DataStore[0].Start 939992
DataStore[0].End941136
DataStore[1].Start 943032
DataStore[1].End943896
Back to gff parsingStore all of Data into DataStore
https://gist.github.com/anonymous/03da342a3d21dcc20e66
Append Record into DataStore
Retrieve record in DataStore and print them.
Define namedtuple called ‘Record’ and Initialize List called Datastore
Save each features in separate file
Classification could be
geneexontRNAmRNACDSncRNAregion
Save separate file like gene.txt, exon.txt, mRNA.txt…
We will store these classifications. Store them in Tuple named ‘classes’
savefile is dictionary used for storage of file save handle
oneclass = ‘gene’
Open file with ‘gene.txt’ with write mode
Store in dictionary as class name key
savefile[‘gene’] is file handle for ‘gene’.txt
oneclass = ‘exon’
Open file with ‘exon.txt’ with write mode
Store in dictionary as class name key
savefile[‘exon’] is file handle for ‘exon’.txt
if record.classification is in classes:
('gene', 'exon', 'tRNA', 'mRNA', 'ncRNA', 'CDS', 'region')
print (…....file = savefile[record.classification])
record.classification : genesavefile[‘gene’] has file handle for gene.txt
Write out current record in gene.txt
Everything is done, then close file.
https://gist.github.com/anonymous/03e10ec56f679bab608f
All thing put together
File Handling packages
Get a list containing file names
Open and process all of file obatined from glob
>>> import glob>>> glob.glob('*.*')['CDS.txt', 'exon.txt', 'gene.txt', 'gff.py’….]>>> list = glob.glob('*.gff')>>> list['NC_001133.gff', 'NC_001134.gff’.... 'NC_001224.gff']>>>
filelist = glob.glob(‘*.gff’)for file in filelist:
do something
Glob Exampleshttps://gist.github.com/anonymous/3eff4b86bd0f3aa55db9
Save as ‘total’
Same function with cat *.gff > total
Read all gff file in the directory
Then read into singlefileAnd merged into contents
Execute External Program
subprocess package, call module
Like shell scripts, sometime we need to run external program in Python scripts
call ( list_containing_program_and_arguement)
ls –l [‘ls’, ‘-l’]
Run MUSCLE inside Python Scripts
Assume fasta file is saved as “merged.fasta”
Equivalent with command line muscle –in merge.fasta –out merge.aln –clwCall([“muslce”, “-in”, “merge.fasta”, “-out”, “merge.aln”, “-clw”])
Other file related packages and moduleCurrent directory?
os.getcwd()
Change directory
os.chdir(DIRECTORY)
Make directory under current directory
os.makedir(directory)
os.renames('data', 'data2')
Rename directory with other name
Install External Packages
Besides Standard Python Library, you may want to install external packages to add features
External packages are packages developed by outside developer and add more features in python
Most scientific packages are external packages developed by scientists, so you need to install them, If you need them.
BioPython : Comprehensive bioinfomatic packages for Pythonhttp://biopython.org/wiki/Main_Page
Numpyhttp://www.numpy.org/
Scipyhttp://www.scipy.org/
MatPlotlibhttp://matplotlib.org/
Pandas : Data Analysis Packages http://pandas.pydata.org/
Example of scientific packages we may have interested..
We will cover some of them in later lectures..
Install External Packages
First, let’s check in your command prompts
pip
If you see this messages, you are ok. If you are not, install pip itself.
Install pip (python install packages)
http://pip.readthedocs.org/en/stable/installing/
Download get-pip.py
curl -O "https://bootstrap.pypa.io/get-pip.py
Change permission of pip
chmod +x get-pip.py
Execute get-pip.py (as administrator)
sudo python get-pip.puy (You need to password)
Install pyfasta package using pip
pyfasta : external packages which can access multifasta file easily.
sudo pip install pyfastaCollecting pyfasta Downloading pyfasta-0.5.2.tar.gzInstalling collected packages: pyfasta Running setup.py install for pyfastaSuccessfully installed pyfasta-0.5.2
Check whether installation is okay
In python interpreter,
If installation is not okay, you will see the error message here..
pyFasta ExampleCombine all of chromosome and protein file as single fna (nucleotide) and faa (protein)
Load yeast.faa info f
In f.keys(), you can see the header of each protein file
In f[list][:], you can see the sequence
For the access of specific sequences, you need a header content in fasta file
./pyfasta1.py
Let’s extract using conditions
f[list][:] Sequence
len(f[list][:]) length of SequenceIf len(f[list][:])>1000 If length of sequence is
Bigger than 1000,
It will print out all of yeast proteins.
Print out..
Total Number of Protein
Protein within the conditions
If you want only list satisfying conditions
I only want to see the Refseq Id, Protein name and amino acid #...
a = list.split(‘|’)
a[0] a[1] a[2] a[3] a[4]
gi 6324242 ref NP_014312.1 Tcb2p [Saccharomyces cerevisiae…]
I don’t want [Saccharomyces cerevisiae…]
How we can remove them?
Split again...
a[4] Mdm1p [Saccharomyce cerevisiae S288c]
a[4].split(“[“)[0].strip() strip() : remove all of spaces
https://gist.github.com/anonymous/4fc9f86c82ac7fde4b69
“…Instead of amino acid number, I want to see molecular weight”
Last lecture, we learned how to calculate molecular weight from amino acid sequences..
https://gist.github.com/anonymous/74ffdb1407ead96ce560
You can cut paste this code and modify it. But…If you need to calculate m.w. several times, it is not efficient.
Functions
If some part of program is used frequently, make it as separate part, Then reuse it.
ABCBE…..
Acall BCCall BE…..
B
How can we define functions?
Syntax of Functions
Input (arguments)
Return values
calculate from input and return some values
def function_name(arguments):…...Do somethingreturn (arguments)
Return value is optional…
Molecular Weight FunctionsArguments : sequenceshould contains amino acid sequences
Calculation of molecular weight
Return (mweight)
Main part of program
Calling newly defined function, mwThen it will stored variable ‘mweight’
Put current sequences into variable ‘sequence’
https://gist.github.com/anonymous/3c2bf5ec586fd3280502
Little bit modifications
Print out all of molecular weight of proteins
./molweight2.py > mw.csv
Save them and import in Excel (or whatever stat program)
Imported Data Bins for Histogram
Use Histogram function in Excel..
1000020000
3000040000
5000060000
7000080000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
250000
300000More
0
100
200
300
400
500
600
700
800
900
Frequency
In the Next class, we will learn how to draw these graphs in python…
Draw bar chart
Distributions of protein size in yeast
Extraction of subsequences in pyFasta
First 10 character
From 5-15 character
Extract features using pyFasta and Gff
Previously, we learned how to extract desired information in gff file
Using these information, we can extract desired sequences from big fasta file
Little modification of previous script
<- only differences
NC_001224.1 RefSeq region 1 85779 . + . ID=id0;Dbxref=taxon:559292;Is_circular=true;gbkey=Src;
Direction of strand (+ : sense, - : antisense)
Parsing gff file and store in Datastore list
filename : gff file name "NC_001224.gff
We need to open NC_001224.fna
filename.split(“.”)[0] = “NC_001224”
filename.split(“.”)[0]+”.fna” = “NC_001224.fna”
Then open fasta file
fasta = Fasta(fastafilename)
Go through stored data in DataStore
If record.classification is gene:First print out headerThen print out partial sequence start from record.start to record.end.
https://gist.github.com/anonymous/6a7d63ccb82be2a30dca
All thing put together
./extract.py *.gff > gene.fasta
./extract.py *.gff
Assignments
1. In NCBI genome database, download all of faa, gff and fna file for Arabidopsis thaliana and C.elegans and do analysis using scripts (modify if you need to do)
ftp://ftp.ncbi.nlm.nih.gov/genomes/Caenorhabditis_elegans/
ftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thaliana/
Download CHR-I, II, III,IV, V, X
Download CHR-I, II, III,IV, V
2. Calcurate all of molecular weight of proteins in each organisms and make histogram
3. What is the fraction of proteins bigger than 30kDa (30,000) and smaller than 100kDa (100,000) in proteome?
4. Generate multifasta file containing protein sequences in 30,000 < MW <100,000
5. In genome sequences, extract all of tRNA sequences in the genome and submit multifasta