96-summer 生物資訊程式設計實習 ( 二 ) bioinformatics with perl 8/13~8/22 蘇中才...

Post on 06-Jan-2018

300 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

序號姓名帳號 1 許郁彬 course1 2 杜羿樞 course2 3 黃裕雄 course3 4 王建智 course4 5 陳士杰 course5 6 莊智傑 course6 7 朱柏威 course7 8 洪文峯 course8 9 吳耿豪 course9 10 張雯琪 course10 11 王悅 course11 12 張嘉芸 course12 13 林義峰 course13 14 游棨元 course14 15 許育堂 course15 16 陳建瑋 course16 17 黃國鑫 course17 18 翁小涵 course18 19 郭建鴻 course19 20 曾意儒 course20

TRANSCRIPT

96-Summer生物資訊程式設計實習( 二 )

Bioinformatics with Perl

8/13~8/22 蘇中才8/24~8/29 張天豪

8/31 曾宇鳯

課前準備 課程網頁

http://gene.csie.ntu.edu.tw/~sbb/summer-course/

安裝流程 抓 Putty / Pietty 連上 140.112.28.186 wget

http://gene.csie.ntu.edu.tw/~sbb/summer-course/doc/course1.tgz

tar zxvf course1.tgz

序號 姓名 帳號1 許郁彬 course12 杜羿樞 course23 黃裕雄 course34 王建智 course45 陳士杰 course56 莊智傑 course67 朱柏威 course78 洪文峯 course89 吳耿豪 course910 張雯琪 course1011 王悅 course1112 張嘉芸 course1213 林義峰 course1314 游棨元 course1415 許育堂 course1516 陳建瑋 course1617 黃國鑫 course1718 翁小涵 course1819 郭建鴻 course1920 曾意儒 course20

Appendix

Scalar, Array, Hash

Variable reset (1/2)

$scalar = undef;$scalar = “”;$scalar = 0;

@array = ();%hash = ();

Variable reset (1/2)

@array = undef;

print scalar(@array);

Array

my @number = ("one", "two", "three");my $number = ("one", "two", "three");

print "@number\n";print scalar(@number)."\n";print $#number."\n";print @number."\n";print $number."\n";

Array

@array = qw"5 4 9 8 1 3 6 2 7 10";

print "@array\n";print @array."\n";print @array;

Array – sort by number

#! /usr/bin/perl

@test=(1, 5, 4, 22, 9, 101);

@mmm=sort {$a<=>$b} @test;

print join ',', @mmm, "\n\n";

Hash – show all elements#! /usr/bin/perl -w

%nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine );

while (($key, $value)=each %nucleotide_bases) { print "$key ====> $value\n"; }

foreach $key (keys %nucleotide_bases) { print "$key ====> $nucleotide_bases{$key}\n";}

Hash – reverse with identical values%nucleotide_bases = ( A => Adenine, T => Thymine, G => Adenine, C => Cytosine );

while (($key, $value)=each %nucleotide_bases) { print "$key ====> $value\n";}

%reverse = reverse %nucleotide_bases;while (($key, $value)=each %reverse) { print "$key ====> $value\n";}

Hash – the number of elements

How to know the number of elements in a hash?

Ex:my %hash = ('a'=>1,'b'=>2);print scalar(keys(%hash))."\n";

Comment

# This is a comment

=This is a comment, too=This is a comment, three=cut

print "Really ?\n";

Appendix

STDIN, <>, our/my

$_ - extract data from <STDIN>

while (<STDIN>) {print;}

if (<STDIN>) {print;}

<>;$line = <>;

#! /usr/bin/perl -w

while ( $line = <> ){ print $line;}

Processing Data Files

(like UNIX command : cat)

#! /usr/bin/perl -w

while (<> ){ print;}

Others … while (defined($_ = <>)) { print; } while ($_ = <>) { print; } while (<>) { print; } for (;<>;) { print; } print while defined($_ = <>); print while ($_ = <>); print while <>;

our/my

my $var;$var = 1;{ my $var; $var = 2; print $var,"\n";}print $var, "\n";

our $var;$var = 1;{ our $var; $var = 2; print $var,"\n";}print $var, "\n";

Appendix

Regular expression

Reserved word

open log, ">test.txt“ or die “…”;print log "test\n";close log;

Magic diamond - <>

print “$_” while (<>);

print “$_” while (<*.pl>);

Get the list of files in the current directory

my @files = <*.pl>;

my @files = glob("*.pl");

Greedy matching

my $string = "course1:x:509:510::/home/course1:/bin/bash";

if ($string =~ /(.*):/) { print "matched string = [$1]\n";}

#How to match the first column ?

Greedy matching

my $string = "course1:x:509:510::/home/course1:/bin/bash";

if ($string =~ /^([\S]*):/) { print "matched string = [$1]\n";}

if ($string =~ /^([\S]*?):/) { print "matched string = [$1]\n";}if ($string =~ /([^:]*):/) { print "matched string = [$1]\n";}

Substitution – remove all x

$_ = "China xxxxxx Taiwan";s/x*//; # How to rewrite ?print;

China xxxxx Taiwan

Quoted syntax

Symbol General Description Interpolated‘ ‘ q/ / String No“ “ qq/ / String Yes` ` qx/ / Execution Yes( ) qw/ / List of words No/ / m/ / Pattern matching Yess/ / / s/ / / Substitution Yesy/ / / tr/ / / transliteration No“ “ qr/ / Regular expression Yes

Appendix

Useful techniques

Shell command – file/directory

mkdir(“doc”,0x744); chdir(“doc”); rmdir(“doc”); unlink(“log.txt”); chmod(0x700, “log1.txt”, “log2.txt”,”log3.txt”); rename (“old_name”, “new_name”); chown(<uid>,<gid>,”log1.txt”,”log2.txt”,”log3.txt”);

PerlUsage: perl [switches] [--] [programfile] [arguments] -c check syntax only (runs BEGIN and CHECK blocks) -d[:debugger] run program under debugger -e program one line of program (several -e's allowed, omit

programfile) -i[extension] edit <> files in place (makes backup if extension

supplied) -n assume "while (<>) { ... }" loop around program -p assume loop like -n but print line also, like sed -u dump core after parsing program -v print version, subversion -w enable many useful warnings (RECOMMENDED) -W enable all warnings -X disable all warnings

Removal of ^M

perl -pi.bak -e 's/\r//g;' index.html

File Copy

#! /usr/bin/perl

use File::Copy;

copy("file1", "file2");

Reserved word for debug

__FILE__ __LINE__

Ex:print "FILE:".__FILE__." LINE:".__LINE__."\n";

Debug

Perl –d “program name”

Debug

$perlcc –d test.pl

Special variable$_ the last assignment$! Error message$$ current process ID$? the status when the previous child process end

$” the separator of the list$/

$`,$&,$’string matching$+ the last backreference@- @LAST_MATCH_START@+ @LAST_MATCH_END@_ arguments of a subroutine

Bytecode generator

$perlcc -B -o test test3.pl

CPANperl -MCPAN -e "install GD"

BioPerl

PSI-BLAST Position Specific Iterative BLAST constructs a multiple sequence alignment then

creates a position-specific scoring matrix (PSSM)Query

Sequence

BlastSequencedatabase

PSSM

Multiple sequence alignment

Multiple sequence alignment

Homologous proteins Blast

New homologous

proteins

PSSM(1/4)

G H E G V G K V V K L G A G A

G H E K K G Y F E D R G P S AG H E G Y G G R S R G G G Y SG H E F E G P K G C G A L Y IG H E L R G T T F M P A L E C

Query Sequence

Homologous proteins

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15A 0 0 0 0 0 0 0 0 0 0 0 2 1 0 2C 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1D 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0E 0 0 5 0 1 0 0 0 1 0 0 0 0 1 0F 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0G 5 0 0 2 0 5 1 0 1 0 2 3 1 1 0H 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1K 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0L 0 0 0 1 0 0 0 0 0 0 1 0 2 0 0M 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0R 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0S 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1T 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0V 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0

Frequency

Column 1: fA,1=0/5, fC,1=0/5, …, fG,1=5/5, …Column 2: fA,1=0/5, fC,1=0/5, …, fH,1=5/5, ……Column 15: fA,1=2/5, fC,1=1/5, …, fS,1=1/5, …

PSSM (2/4) The original data:

Column 1: fA,1=0/5, fC,1=0/5, …, fG,1=5/5, …

Column 2: fA,1=0/5, fC,1=0/5, …, fH,1=5/5, …

…Column 15: fA,1=2/5, fC,1=1/5, …, fS,1=1/5, …

Set a pseudo-counts of 1:Column 1: f’A,1=(0+1)/(5+20),f’C,1=(0+1)/(5+20),…,f’G,1=(1+1)/(5+20),…Column 2: f’A,1=(0+1)/(5+20),f’C,1=(0+1)/(5+20),…,f’H,1=(1+1)/(5+20),… …Column 15: f’A,1=(2+1)/(5+20),f’C,1=(1+1)/(5+20),…,f’S,1=(1+1)/(5+20),…

PSSM (3/4)

The score is derived from the ratio of the observed to the expected frequencies. More precisely, the logarithm of this ratio is taken and refereed to as the log-likelihood ratio:

)log('

,i

ijji q

fScore

where Scorei,j is the score for residue i at position j, f’ij is the relative frequency for a residue i at position j and qi is the expected relative frequency of residue i in a random sequence.

PSSM (4/4)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 0.7 -0.2 1.3C -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7D -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2E -0.2 -0.2 2.3 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 -0.2F -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2G 2.3 -0.2 -0.2 1.3 -0.2 2.3 0.7 -0.2 0.7 -0.2 1.3 1.7 0.7 0.7 -0.2H -0.2 2.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2I -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7K -0.2 -0.2 -0.2 0.7 0.7 -0.2 0.7 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2L -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 1.3 -0.2 -0.2M -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2N -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2P -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2Q -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2R -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2S -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 0.7T -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2V -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2W -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2Y -0.2 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 -0.2

top related