생물학 연구를 위한 컴퓨터 사용기술 제 6강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

6th Lecture 2015.10.13

Ipython Notebook, Pandas, Matplotlib

Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?

2주차 Basic of Unix and running BLAST in your PC

3주차 Unix Command Prompt II and shell scripts

4주차 Basic of programming (Python programming)

5주차 Python Scripting II and sequence manipulations

6주차 Ipython Notebook and Pandas

7주차 Tutorial and Basic of Next Generation Sequencings

8주차 Next Generation Sequencing

9주차10주차 Next Generation Sequencing Analysis

11주차 R and statistical analysis

12주차 Bioconductor I

13주차 Bioconductor II

14주차 Network analysis

Python scripting and research..

BioinformaticSoftware

Extract InformationsUsing scripts..

Repeats

During multistep analysis, several scripts and resulting files is generated..

DataExtracted

Bioinformatic analysis is multistep process

Using python scripts, we can analysis various dataset..

Results

Python Scripts(Sometime we need to modify scripts slightly)

Original data

Various output

Sometime it may be difficult to tracks what I did before…

Notebook in Experimental Research

1. Protocols…..

2. Experimental Procedures…..

3. Resulting Data…..

4. Analysis.......

Should be summarized in the notebook.How about ‘computational experiments’?

iPython Notebook (Jupyter)

“Electronic Notebook for Python scripts and results”

Run in web browser

Like in Python Interpreter, You can test small code snippet and see the results

Long script can be executed

Results are stored in the same file with scripts

Documentations along with results and scripts

Documentations

Data & Results

You can distribute iPython Notebook and share code and results simultaneously

You can send .ipynb file with colleagues

And share them!

You can even publish notebook in webs

Installing iPython Notebook and Other Packages

You can install iPython in your current Python…But It is more convenient to install all of them as single installs

http://www.continuum.io/downloads

Python + iPython Notebook + many scientific packages

And Install them!

You can install in Windows

Anaconda Python

If installation is done correctly,

<-Anaconda

Run iPython NotebookIn command line, ipython notebook

Your default python became ‘Anaconda Python’

You may need to install some of packages again (e.g: pip install pyfasta)

iPython Notebook (Jupyter)

In Your web Browser, iPython Notebook will be lanched

You can return ipython notebook using this address

Go to the directory where your previous analysis are in..

Then Make New Notebooks

It is like Python Interpreter..

Input some commands…

Press Run (or Ctrl+Enter)

Variables are maintained in the other cell

Previously, we have scripts like this

Cut and Paste into iPython Notebook!

https://gist.github.com/anonymous/3c2bf5ec586fd3280502

Results

Little bit modification of scripts

Save file in mw.csv

You can check content of file using cat command (similar with shell)

Last time, we make histogram in Excel, but you can do it in IPython Notebook..

PandasPython Data Analysis Library (http://pandas.pydata.org/)

It is included in Anaconda Python, so you don’t need to install seperately.

Pandas is..Not like this…

More like this…

Excel spreadsheet

<- Different Data in row

Different kinds of data in column

Let’s look the previous cell agains

%matplotlib inline <- Used for insert graph inside notebook

Import pandas module as name ‘pd’

data = pd.read_csv : Read comma seperated file into data

Dataframe

Dataframe is basic analysis unit for Pandas

columns

Multiple Row

You can imagine ‘spreadsheet’ in the excel

Dataframe = spreadsheet

Descriptive statistics

Filtering by conditions

Display rows specific for the conditions(mw > 50000)

Display rows specific for the multiple conditions(mw > 10000) and (mw < 50000)

From 500 to endFirst 500 rows

Display specific column

Generate new Dataframe only have ‘mw’ Generate new Series from mw

What is difference between two?

What is ‘Series’?

Series : one-dimensional labeled array

50 40 20 40 40Values

0 1 2 3 4

mathIndex

Multiple Series can form a DataFrame

Generate Different series

Make dictionary contains ‘column name’ : Series

Then, generate DataFrame

Name english korean mathSeries Series Series Series

DataFrame

Calculate total score of each Persons

Attach sum into original frames (pd.concat([DataFrame_to_concatenate],axis=1)

Sort by sum

New DataFrame with some of columns

Save DataFrame as CSV file

Let’s utilize DataFrame in real world examples…

In the previous lecture, we made scripts to parse gff file and modified it

Read all .gff file in the directory

Only extract filename, classification, start,end, gene, name, note

Datastore is the list of dictionary containing all data.

{Column 1 name : values, column 2 name : value….}

Generate DataFrame using list of dictionary

Select items when classification is CDS

We want to add new column ‘length’

=end-start

How you can do that?

In excel, you can input like

length

First, we need to define a function

Apply diff functions to every row in the dataFrame

From each row

first make character as numberThen substract ‘end’ to ‘start’Then return values

Output is ‘Series’

We need to generate DataFrame

Combine length DataFrame with gene

Select Gene bigger than 1000bp

Gene name contains ‘ARP’

Gene name contains CDC

We want to extract sequence upstream to 2kb of each selected gene…

First, extract filename, gene and start as dictionary

Each have same index as key

Because filename, genename and start has same key, we can extract each file name

Import pyFasta

Print Genename[index]

Filename[index] NC_001147.6 Filename[index].split[“.”)[0] NC_001147

Extract sequences from start position-2000To start position

You may uses it for promoter analysis?

Open Fastafile like NC_001147.fna

Plotting Experimental Data using Panda

How we can plot boxplot using these dataset?

Group by ‘Treatment’

Line Plots

Assume we have these datasets (11 datasets with different time points)

Save them as csv

Read Data

Read Data.csv and make DataFrame called plotdata

Generate Time Data

We don’t have time data for each dataset. But we know intervals, so we can generate it.

Series start from 0 and increase for 0.2

Attach time into DataFrame then plot

X = TimeY = Each of DataSets

Plot with mean values and standard Error

Save figures as pdf

Calculate mean and s.e.m

Other Examples : UCSC Genome Browser Database

Lots of genomic informations are needed to display them..

https://genome.ucsc.edu

http://hgdownload.soe.ucsc.edu/downloads.html

We can use these genome-related data for our analysis

Mouse Annotations

“Annotation Database”

Many data for databases are in here

http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/

curl -O "http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGene.txt.gz"

Text File (Tab seperated)

curl -O "http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGene.sql"

To utilize them, we need to organization of data…

<-Column name

Download some of them

<-Column name<-Column name<-Column name

Import knownGene.txt into DataFrame

Uses read_table

Filename Sepertation is tab(\t)

Column name in list

names=['name','chrom','strand','txStart','txEnd','cdsStart','cdsEnd','exonCount','ExonStarts','ExonEnds','proteinID','alignID']

Download another Datafile

http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGeneMrna.txt.gz

http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/knownGenePep.txt.gz

http://hgdownload.soe.ucsc.edu/goldenPath/mm10/database/kgXref.txt.gz

Then Import them as Pandas DataFrame

Check DataFrame

knownGene

knownGenePep

We can link two DataFrame Using ‘name’ columns

Using name, we can combine them as same DataFrame

Join two DataFrames

Pandas.merge(First_DataFrame, Second_DataFrame, on=“column_based on”)

Link knownGene and knownGeneMrna with ‘name’ columns

We added nucleotide and peptide sequence informations

Merge knownGene and kgXRef

Using ‘name’ column, link two DataFrames

Select specific columns

Add peptide sequences

Filter record contains ‘cdc’ in descriptions

Gene contains ‘cdc’ in descriptions and located in ‘chromosome 1’

Convert gene Symbol and sequence as Dictionary

Output as FASTA

Save File

Assignments

1. Download and install anaconda python

2. Install pyfasta in anaconda python(sudo pip install pyfasta)

3. Downloads iPython Notebook (https://github.com/madscientist01/ComputationalSkill_CBNU_2015/tree/master/Lect6)

4. look around them

In the next weeks, we will have tutorial and lessons

http://www.continuum.io/downloads

생물학 연구를 위한 컴퓨터 사용기술 제 6강

Education

지구온난화에 대처하는 건강소녀...

영상매체를 통해 본 한국복식사 5강. 개화기,...

생물학 연구를 위한 컴퓨터 활용기술 13강

생물학 연구를 위한 컴퓨터 활용기술 8강

웹2.0과 인터넷 커뮤니케이션 6강 공개용

생물학 연구를 위한 컴퓨터 사용기술 제 3강

6강 제2장 제3절 의료관광 법과...

대용량 실험데이터 국내외 활용 동향 · 2021....

6강. 프로세서관리 -...

제 6강 자본주의와 경영윤리 -...

19-a4.ppt [호환 모드] - 캐드앤그래픽스 ·...

소외된 육체를 위하여 전체 ws(6강) 요약

119 구조대대응화학, 생물학,...

생물학 연구를 위한 컴퓨터 활용기술 11강

12. 종자의 생물학과...

생물학 연구를 위한 컴퓨터 사용기술 제 1강

6강. 크라우드펀팅

개인정보의...

도시계획사 6강 산업도시의 발생과...

business process 6강. bpr as-is (현행모델분석)...