96-summer 生物資訊程式設計實習 ( 二 ) bioinformatics with perl 8/13~8/22 蘇中才...

96-Summer生物資訊程式設計實習( 二 )

Bioinformatics with Perl

8/13~8/22 蘇中才8/24~8/29 張天豪

8/31 曾宇鳯

課前準備課程網頁

http://gene.csie.ntu.edu.tw/~sbb/summer-course/

安裝流程抓 Putty / Pietty 連上 140.112.28.186 wget

http://gene.csie.ntu.edu.tw/~sbb/summer-course/doc/course1.tgz

tar zxvf course1.tgz

http://gene.csie.ntu.edu.tw/~sbb/summer-course/



序號姓名帳號1 許郁彬 course12 杜羿樞 course23 黃裕雄 course34 王建智 course45 陳士杰 course56 莊智傑 course67 朱柏威 course78 洪文峯 course89 吳耿豪 course910 張雯琪 course1011 王悅 course1112 張嘉芸 course1213 林義峰 course1314 游棨元 course1415 許育堂 course1516 陳建瑋 course1617 黃國鑫 course1718 翁小涵 course1819 郭建鴻 course1920 曾意儒 course20

Appendix

Scalar, Array, Hash

Variable reset (1/2)

$scalar = undef;$scalar = “”;$scalar = 0;

@array = ();%hash = ();

Variable reset (1/2)

@array = undef;

print scalar(@array);

Array

my @number = ("one", "two", "three");my $number = ("one", "two", "three");

print "@number\n";print scalar(@number)."\n";print $#number."\n";print @number."\n";print $number."\n";

Array

@array = qw"5 4 9 8 1 3 6 2 7 10";

print "@array\n";print @array."\n";print @array;

Array – sort by number

#! /usr/bin/perl

@test=(1, 5, 4, 22, 9, 101);

@mmm=sort {$a<=>$b} @test;

print join ',', @mmm, "\n\n";

Hash – show all elements#! /usr/bin/perl -w

%nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine );

while (($key, $value)=each %nucleotide_bases) { print "$key ====> $value\n"; }

foreach $key (keys %nucleotide_bases) { print "$key ====> $nucleotide_bases{$key}\n";}

Hash – reverse with identical values%nucleotide_bases = ( A => Adenine, T => Thymine, G => Adenine, C => Cytosine );

while (($key, $value)=each %nucleotide_bases) { print "$key ====> $value\n";}

%reverse = reverse %nucleotide_bases;while (($key, $value)=each %reverse) { print "$key ====> $value\n";}

Hash – the number of elements

How to know the number of elements in a hash?

Ex:my %hash = ('a'=>1,'b'=>2);print scalar(keys(%hash))."\n";

Comment

# This is a comment

=This is a comment, too=This is a comment, three=cut

print "Really ?\n";

Appendix

STDIN, <>, our/my

$_ - extract data from <STDIN>

while (<STDIN>) {print;}

if (<STDIN>) {print;}

<>;$line = <>;

#! /usr/bin/perl -w

while ( $line = <> ){ print $line;}

Processing Data Files

(like UNIX command : cat)

#! /usr/bin/perl -w

while (<> ){ print;}

Others … while (defined($_ = <>)) { print; } while ($_ = <>) { print; } while (<>) { print; } for (;<>;) { print; } print while defined($_ = <>); print while ($_ = <>); print while <>;

our/my

my $var;$var = 1;{ my $var; $var = 2; print $var,"\n";}print $var, "\n";

our $var;$var = 1;{ our $var; $var = 2; print $var,"\n";}print $var, "\n";

Appendix

Regular expression

Reserved word

open log, ">test.txt“ or die “…”;print log "test\n";close log;

Magic diamond - <>

print “$_” while (<>);

print “$_” while (<*.pl>);

Get the list of files in the current directory

my @files = <*.pl>;

my @files = glob("*.pl");

Greedy matching

my $string = "course1:x:509:510::/home/course1:/bin/bash";

if ($string =~ /(.*):/) { print "matched string = [$1]\n";}

#How to match the first column ?

Greedy matching

my $string = "course1:x:509:510::/home/course1:/bin/bash";

if ($string =~ /^([\S]*):/) { print "matched string = [$1]\n";}

if ($string =~ /^([\S]*?):/) { print "matched string = [$1]\n";}if ($string =~ /([^:]*):/) { print "matched string = [$1]\n";}

Substitution – remove all x

$_ = "China xxxxxx Taiwan";s/x*//; # How to rewrite ?print;

China xxxxx Taiwan

Quoted syntax

Symbol General Description Interpolated‘ ‘ q/ / String No“ “ qq/ / String Yes` ` qx/ / Execution Yes( ) qw/ / List of words No/ / m/ / Pattern matching Yess/ / / s/ / / Substitution Yesy/ / / tr/ / / transliteration No“ “ qr/ / Regular expression Yes

Appendix

Useful techniques

Shell command – file/directory

mkdir(“doc”,0x744); chdir(“doc”); rmdir(“doc”); unlink(“log.txt”); chmod(0x700, “log1.txt”, “log2.txt”,”log3.txt”); rename (“old_name”, “new_name”); chown(<uid>,<gid>,”log1.txt”,”log2.txt”,”log3.txt”);

PerlUsage: perl [switches] [--] [programfile] [arguments] -c check syntax only (runs BEGIN and CHECK blocks) -d[:debugger] run program under debugger -e program one line of program (several -e's allowed, omit

programfile) -i[extension] edit <> files in place (makes backup if extension

supplied) -n assume "while (<>) { ... }" loop around program -p assume loop like -n but print line also, like sed -u dump core after parsing program -v print version, subversion -w enable many useful warnings (RECOMMENDED) -W enable all warnings -X disable all warnings

Removal of ^M

perl -pi.bak -e 's/\r//g;' index.html

File Copy

#! /usr/bin/perl

use File::Copy;

copy("file1", "file2");

Reserved word for debug

__FILE__ __LINE__

Ex:print "FILE:".__FILE__." LINE:".__LINE__."\n";

Debug

Perl –d “program name”

Debug

$perlcc –d test.pl

Special variable$_ the last assignment$! Error message$$ current process ID$? the status when the previous child process end

$” the separator of the list$/

$`,$&,$’string matching$+ the last backreference@- @LAST_MATCH_START@+ @LAST_MATCH_END@_ arguments of a subroutine

Bytecode generator

$perlcc -B -o test test3.pl

CPANperl -MCPAN -e "install GD"

BioPerl

PSI-BLAST Position Specific Iterative BLAST constructs a multiple sequence alignment then

creates a position-specific scoring matrix (PSSM)Query

Sequence

BlastSequencedatabase

PSSM

Multiple sequence alignment

Multiple sequence alignment

Homologous proteins Blast

New homologous

proteins

PSSM(1/4)

G H E G V G K V V K L G A G A

G H E K K G Y F E D R G P S AG H E G Y G G R S R G G G Y SG H E F E G P K G C G A L Y IG H E L R G T T F M P A L E C

Query Sequence

Homologous proteins

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15A 0 0 0 0 0 0 0 0 0 0 0 2 1 0 2C 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1D 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0E 0 0 5 0 1 0 0 0 1 0 0 0 0 1 0F 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0G 5 0 0 2 0 5 1 0 1 0 2 3 1 1 0H 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1K 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0L 0 0 0 1 0 0 0 0 0 0 1 0 2 0 0M 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0R 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0S 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1T 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0V 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0

Frequency

Column 1: fA,1=0/5, fC,1=0/5, …, fG,1=5/5, …Column 2: fA,1=0/5, fC,1=0/5, …, fH,1=5/5, ……Column 15: fA,1=2/5, fC,1=1/5, …, fS,1=1/5, …

PSSM (2/4) The original data:

Column 1: fA,1=0/5, fC,1=0/5, …, fG,1=5/5, …

Column 2: fA,1=0/5, fC,1=0/5, …, fH,1=5/5, …

…Column 15: fA,1=2/5, fC,1=1/5, …, fS,1=1/5, …

Set a pseudo-counts of 1:Column 1: f’A,1=(0+1)/(5+20),f’C,1=(0+1)/(5+20),…,f’G,1=(1+1)/(5+20),…Column 2: f’A,1=(0+1)/(5+20),f’C,1=(0+1)/(5+20),…,f’H,1=(1+1)/(5+20),… …Column 15: f’A,1=(2+1)/(5+20),f’C,1=(1+1)/(5+20),…,f’S,1=(1+1)/(5+20),…

PSSM (3/4)

The score is derived from the ratio of the observed to the expected frequencies. More precisely, the logarithm of this ratio is taken and refereed to as the log-likelihood ratio:

)log('

,i

ijji q

fScore

where Scorei,j is the score for residue i at position j, f’ij is the relative frequency for a residue i at position j and qi is the expected relative frequency of residue i in a random sequence.

PSSM (4/4)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

A -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 0.7 -0.2 1.3C -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7D -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2E -0.2 -0.2 2.3 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 -0.2F -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2G 2.3 -0.2 -0.2 1.3 -0.2 2.3 0.7 -0.2 0.7 -0.2 1.3 1.7 0.7 0.7 -0.2H -0.2 2.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2I -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7K -0.2 -0.2 -0.2 0.7 0.7 -0.2 0.7 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2L -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 1.3 -0.2 -0.2M -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2N -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2P -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2Q -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2R -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2S -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 0.7T -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2V -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2W -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2Y -0.2 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 -0.2

96-summer 生物資訊程式設計實習 ( 二 ) bioinformatics with perl 8/13~8/22 蘇中才...

Documents