생물학 연구를 위한 컴퓨터 사용기술 제 6강
Post on 13-Feb-2017
732 Views
Preview:
TRANSCRIPT
Computational Skill for Modern Biology Research
Department of BiologyChungbuk National University
6th Lecture 2015.10.13
Ipython Notebook, Pandas, Matplotlib
Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?
2주차 Basic of Unix and running BLAST in your PC
3주차 Unix Command Prompt II and shell scripts
4주차 Basic of programming (Python programming)
5주차 Python Scripting II and sequence manipulations
6주차 Ipython Notebook and Pandas
7주차 Tutorial and Basic of Next Generation Sequencings
8주차 Next Generation Sequencing
9주차10주차 Next Generation Sequencing Analysis
11주차 R and statistical analysis
12주차 Bioconductor I
13주차 Bioconductor II
14주차 Network analysis
Python scripting and research..
BioinformaticSoftware
Extract InformationsUsing scripts..
Repeats
During multistep analysis, several scripts and resulting files is generated..
DataExtracted
Data
Bioinformatic analysis is multistep process
Using python scripts, we can analysis various dataset..
Results
Python Scripts(Sometime we need to modify scripts slightly)
Original data
Various output
Sometime it may be difficult to tracks what I did before…
Notebook in Experimental Research
1. Protocols…..
2. Experimental Procedures…..
3. Resulting Data…..
4. Analysis.......
Should be summarized in the notebook.How about ‘computational experiments’?
iPython Notebook (Jupyter)
“Electronic Notebook for Python scripts and results”
Run in web browser
Like in Python Interpreter, You can test small code snippet and see the results
Long script can be executed
Results are stored in the same file with scripts
Documentations along with results and scripts
Documentations
code
Data & Results
You can distribute iPython Notebook and share code and results simultaneously
You can send .ipynb file with colleagues
And share them!
You can even publish notebook in webs
Installing iPython Notebook and Other Packages
You can install iPython in your current Python…But It is more convenient to install all of them as single installs
http://www.continuum.io/downloads
Python + iPython Notebook + many scientific packages
Or
And Install them!
You can install in Windows
Anaconda Python
If installation is done correctly,
<-Anaconda
Run iPython NotebookIn command line, ipython notebook
Your default python became ‘Anaconda Python’
You may need to install some of packages again (e.g: pip install pyfasta)
iPython Notebook (Jupyter)
In Your web Browser, iPython Notebook will be lanched
You can return ipython notebook using this address
Go to the directory where your previous analysis are in..
Then Make New Notebooks
It is like Python Interpreter..
Input some commands…
Press Run (or Ctrl+Enter)
Variables are maintained in the other cell
Previously, we have scripts like this
Cut and Paste into iPython Notebook!
https://gist.github.com/anonymous/3c2bf5ec586fd3280502
Results
Little bit modification of scripts
Save file in mw.csv
You can check content of file using cat command (similar with shell)
Last time, we make histogram in Excel, but you can do it in IPython Notebook..
PandasPython Data Analysis Library (http://pandas.pydata.org/)
It is included in Anaconda Python, so you don’t need to install seperately.
Pandas is..Not like this…
More like this…
Excel spreadsheet
<- Different Data in row
Different kinds of data in column
Let’s look the previous cell agains
%matplotlib inline <- Used for insert graph inside notebook
Import pandas module as name ‘pd’
data = pd.read_csv : Read comma seperated file into data
Dataframe
Dataframe is basic analysis unit for Pandas
columns
Multiple Row
You can imagine ‘spreadsheet’ in the excel
Dataframe = spreadsheet
Descriptive statistics
Filtering by conditions
Display rows specific for the conditions(mw > 50000)
Display rows specific for the multiple conditions(mw > 10000) and (mw < 50000)
From 500 to endFirst 500 rows
Display specific column
Generate new Dataframe only have ‘mw’ Generate new Series from mw
What is difference between two?
What is ‘Series’?
Series : one-dimensional labeled array
50 40 20 40 40Values
0 1 2 3 4
mathIndex
Multiple Series can form a DataFrame
Generate Different series
Make dictionary contains ‘column name’ : Series
Then, generate DataFrame
Name english korean mathSeries Series Series Series
DataFrame
Calculate total score of each Persons
Attach sum into original frames (pd.concat([DataFrame_to_concatenate],axis=1)
Sort by sum
New DataFrame with some of columns
Save DataFrame as CSV file
Let’s utilize DataFrame in real world examples…
In the previous lecture, we made scripts to parse gff file and modified it
Read all .gff file in the directory
Only extract filename, classification, start,end, gene, name, note
Datastore is the list of dictionary containing all data.
{Column 1 name : values, column 2 name : value….}
Generate DataFrame using list of dictionary
Select items when classification is CDS
We want to add new column ‘length’
=end-start
How you can do that?
In excel, you can input like
length
First, we need to define a function
Apply diff functions to every row in the dataFrame
From each row
first make character as numberThen substract ‘end’ to ‘start’Then return values
Output is ‘Series’
We need to generate DataFrame
Combine length DataFrame with gene
Select Gene bigger than 1000bp
Gene name contains ‘ARP’
Gene name contains CDC
We want to extract sequence upstream to 2kb of each selected gene…
First, extract filename, gene and start as dictionary
Each have same index as key
Because filename, genename and start has same key, we can extract each file name
Import pyFasta
Print Genename[index]
Filename[index] NC_001147.6 Filename[index].split[“.”)[0] NC_001147
Extract sequences from start position-2000To start position
You may uses it for promoter analysis?
Open Fastafile like NC_001147.fna
Plotting Experimental Data using Panda
How we can plot boxplot using these dataset?
Group by ‘Treatment’
Line Plots
Assume we have these datasets (11 datasets with different time points)
Save them as csv
Read Data
Read Data.csv and make DataFrame called plotdata
Generate Time Data
We don’t have time data for each dataset. But we know intervals, so we can generate it.
Series start from 0 and increase for 0.2
Attach time into DataFrame then plot
X = TimeY = Each of DataSets
Plot with mean values and standard Error
Save figures as pdf
Calculate mean and s.e.m
Other Examples : UCSC Genome Browser Database
Lots of genomic informations are needed to display them..
https://genome.ucsc.edu
http://hgdownload.soe.ucsc.edu/downloads.html
We can use these genome-related data for our analysis
Mouse Annotations
“Annotation Database”
Many data for databases are in here
http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/
curl -O "http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGene.txt.gz"
Text File (Tab seperated)
curl -O "http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGene.sql"
To utilize them, we need to organization of data…
<-Column name
Download some of them
<-Column name<-Column name<-Column name
Import knownGene.txt into DataFrame
Uses read_table
Filename Sepertation is tab(\t)
Column name in list
names=['name','chrom','strand','txStart','txEnd','cdsStart','cdsEnd','exonCount','ExonStarts','ExonEnds','proteinID','alignID']
Download another Datafile
http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGeneMrna.txt.gz
http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGenePep.txt.gz
http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/kgXref.txt.gz
Then Import them as Pandas DataFrame
Check DataFrame
knownGene
knownGenePep
We can link two DataFrame Using ‘name’ columns
Using name, we can combine them as same DataFrame
Join two DataFrames
Pandas.merge(First_DataFrame, Second_DataFrame, on=“column_based on”)
Link knownGene and knownGeneMrna with ‘name’ columns
We added nucleotide and peptide sequence informations
Merge knownGene and kgXRef
Using ‘name’ column, link two DataFrames
Select specific columns
Add peptide sequences
Filter record contains ‘cdc’ in descriptions
Gene contains ‘cdc’ in descriptions and located in ‘chromosome 1’
Convert gene Symbol and sequence as Dictionary
Output as FASTA
Save File
Assignments
1. Download and install anaconda python
2. Install pyfasta in anaconda python(sudo pip install pyfasta)
3. Downloads iPython Notebook (https://github.com/madscientist01/ComputationalSkill_CBNU_2015/tree/master/Lect6)
4. look around them
In the next weeks, we will have tutorial and lessons
http://www.continuum.io/downloads
top related