96-summer 生物資訊程式設計實習 ( 二 ) bioinformatics with perl 8/13~8/22 蘇中才...
DESCRIPTION
序號姓名帳號 1 許郁彬 course1 2 杜羿樞 course2 3 黃裕雄 course3 4 王建智 course4 5 陳士杰 course5 6 莊智傑 course6 7 朱柏威 course7 8 洪文峯 course8 9 吳耿豪 course9 10 張雯琪 course10 11 王悅 course11 12 張嘉芸 course12 13 林義峰 course13 14 游棨元 course14 15 許育堂 course15 16 陳建瑋 course16 17 黃國鑫 course17 18 翁小涵 course18 19 郭建鴻 course19 20 曾意儒 course20TRANSCRIPT
96-Summer生物資訊程式設計實習( 二 )
Bioinformatics with Perl
8/13~8/22 蘇中才8/24~8/29 張天豪
8/31 曾宇鳯
課前準備 課程網頁
http://gene.csie.ntu.edu.tw/~sbb/summer-course/
安裝流程 抓 Putty / Pietty 連上 140.112.28.186 wget
http://gene.csie.ntu.edu.tw/~sbb/summer-course/doc/course1.tgz
tar zxvf course1.tgz
序號 姓名 帳號1 許郁彬 course12 杜羿樞 course23 黃裕雄 course34 王建智 course45 陳士杰 course56 莊智傑 course67 朱柏威 course78 洪文峯 course89 吳耿豪 course910 張雯琪 course1011 王悅 course1112 張嘉芸 course1213 林義峰 course1314 游棨元 course1415 許育堂 course1516 陳建瑋 course1617 黃國鑫 course1718 翁小涵 course1819 郭建鴻 course1920 曾意儒 course20
Appendix
Scalar, Array, Hash
Variable reset (1/2)
$scalar = undef;$scalar = “”;$scalar = 0;
@array = ();%hash = ();
Variable reset (1/2)
@array = undef;
print scalar(@array);
Array
my @number = ("one", "two", "three");my $number = ("one", "two", "three");
print "@number\n";print scalar(@number)."\n";print $#number."\n";print @number."\n";print $number."\n";
Array
@array = qw"5 4 9 8 1 3 6 2 7 10";
print "@array\n";print @array."\n";print @array;
Array – sort by number
#! /usr/bin/perl
@test=(1, 5, 4, 22, 9, 101);
@mmm=sort {$a<=>$b} @test;
print join ',', @mmm, "\n\n";
Hash – show all elements#! /usr/bin/perl -w
%nucleotide_bases = ( A => Adenine, T => Thymine, G => Guanine, C => Cytosine );
while (($key, $value)=each %nucleotide_bases) { print "$key ====> $value\n"; }
foreach $key (keys %nucleotide_bases) { print "$key ====> $nucleotide_bases{$key}\n";}
Hash – reverse with identical values%nucleotide_bases = ( A => Adenine, T => Thymine, G => Adenine, C => Cytosine );
while (($key, $value)=each %nucleotide_bases) { print "$key ====> $value\n";}
%reverse = reverse %nucleotide_bases;while (($key, $value)=each %reverse) { print "$key ====> $value\n";}
Hash – the number of elements
How to know the number of elements in a hash?
Ex:my %hash = ('a'=>1,'b'=>2);print scalar(keys(%hash))."\n";
Comment
# This is a comment
=This is a comment, too=This is a comment, three=cut
print "Really ?\n";
Appendix
STDIN, <>, our/my
$_ - extract data from <STDIN>
while (<STDIN>) {print;}
if (<STDIN>) {print;}
<>;$line = <>;
#! /usr/bin/perl -w
while ( $line = <> ){ print $line;}
Processing Data Files
(like UNIX command : cat)
#! /usr/bin/perl -w
while (<> ){ print;}
Others … while (defined($_ = <>)) { print; } while ($_ = <>) { print; } while (<>) { print; } for (;<>;) { print; } print while defined($_ = <>); print while ($_ = <>); print while <>;
our/my
my $var;$var = 1;{ my $var; $var = 2; print $var,"\n";}print $var, "\n";
our $var;$var = 1;{ our $var; $var = 2; print $var,"\n";}print $var, "\n";
Appendix
Regular expression
Reserved word
open log, ">test.txt“ or die “…”;print log "test\n";close log;
Magic diamond - <>
print “$_” while (<>);
print “$_” while (<*.pl>);
Get the list of files in the current directory
my @files = <*.pl>;
my @files = glob("*.pl");
Greedy matching
my $string = "course1:x:509:510::/home/course1:/bin/bash";
if ($string =~ /(.*):/) { print "matched string = [$1]\n";}
#How to match the first column ?
Greedy matching
my $string = "course1:x:509:510::/home/course1:/bin/bash";
if ($string =~ /^([\S]*):/) { print "matched string = [$1]\n";}
if ($string =~ /^([\S]*?):/) { print "matched string = [$1]\n";}if ($string =~ /([^:]*):/) { print "matched string = [$1]\n";}
Substitution – remove all x
$_ = "China xxxxxx Taiwan";s/x*//; # How to rewrite ?print;
China xxxxx Taiwan
Quoted syntax
Symbol General Description Interpolated‘ ‘ q/ / String No“ “ qq/ / String Yes` ` qx/ / Execution Yes( ) qw/ / List of words No/ / m/ / Pattern matching Yess/ / / s/ / / Substitution Yesy/ / / tr/ / / transliteration No“ “ qr/ / Regular expression Yes
Appendix
Useful techniques
Shell command – file/directory
mkdir(“doc”,0x744); chdir(“doc”); rmdir(“doc”); unlink(“log.txt”); chmod(0x700, “log1.txt”, “log2.txt”,”log3.txt”); rename (“old_name”, “new_name”); chown(<uid>,<gid>,”log1.txt”,”log2.txt”,”log3.txt”);
PerlUsage: perl [switches] [--] [programfile] [arguments] -c check syntax only (runs BEGIN and CHECK blocks) -d[:debugger] run program under debugger -e program one line of program (several -e's allowed, omit
programfile) -i[extension] edit <> files in place (makes backup if extension
supplied) -n assume "while (<>) { ... }" loop around program -p assume loop like -n but print line also, like sed -u dump core after parsing program -v print version, subversion -w enable many useful warnings (RECOMMENDED) -W enable all warnings -X disable all warnings
Removal of ^M
perl -pi.bak -e 's/\r//g;' index.html
File Copy
#! /usr/bin/perl
use File::Copy;
copy("file1", "file2");
Reserved word for debug
__FILE__ __LINE__
Ex:print "FILE:".__FILE__." LINE:".__LINE__."\n";
Debug
Perl –d “program name”
Debug
$perlcc –d test.pl
Special variable$_ the last assignment$! Error message$$ current process ID$? the status when the previous child process end
$” the separator of the list$/
$`,$&,$’string matching$+ the last backreference@- @LAST_MATCH_START@+ @LAST_MATCH_END@_ arguments of a subroutine
Bytecode generator
$perlcc -B -o test test3.pl
CPANperl -MCPAN -e "install GD"
BioPerl
PSI-BLAST Position Specific Iterative BLAST constructs a multiple sequence alignment then
creates a position-specific scoring matrix (PSSM)Query
Sequence
BlastSequencedatabase
PSSM
Multiple sequence alignment
Multiple sequence alignment
Homologous proteins Blast
New homologous
proteins
PSSM(1/4)
G H E G V G K V V K L G A G A
G H E K K G Y F E D R G P S AG H E G Y G G R S R G G G Y SG H E F E G P K G C G A L Y IG H E L R G T T F M P A L E C
Query Sequence
Homologous proteins
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15A 0 0 0 0 0 0 0 0 0 0 0 2 1 0 2C 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1D 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0E 0 0 5 0 1 0 0 0 1 0 0 0 0 1 0F 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0G 5 0 0 2 0 5 1 0 1 0 2 3 1 1 0H 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1K 0 0 0 1 1 0 1 1 0 1 0 0 0 0 0L 0 0 0 1 0 0 0 0 0 0 1 0 2 0 0M 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0P 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0R 0 0 0 0 1 0 0 1 0 1 1 0 0 0 0S 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1T 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0V 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0Y 0 0 0 0 1 0 1 0 0 0 0 0 0 2 0
Frequency
Column 1: fA,1=0/5, fC,1=0/5, …, fG,1=5/5, …Column 2: fA,1=0/5, fC,1=0/5, …, fH,1=5/5, ……Column 15: fA,1=2/5, fC,1=1/5, …, fS,1=1/5, …
PSSM (2/4) The original data:
Column 1: fA,1=0/5, fC,1=0/5, …, fG,1=5/5, …
Column 2: fA,1=0/5, fC,1=0/5, …, fH,1=5/5, …
…Column 15: fA,1=2/5, fC,1=1/5, …, fS,1=1/5, …
Set a pseudo-counts of 1:Column 1: f’A,1=(0+1)/(5+20),f’C,1=(0+1)/(5+20),…,f’G,1=(1+1)/(5+20),…Column 2: f’A,1=(0+1)/(5+20),f’C,1=(0+1)/(5+20),…,f’H,1=(1+1)/(5+20),… …Column 15: f’A,1=(2+1)/(5+20),f’C,1=(1+1)/(5+20),…,f’S,1=(1+1)/(5+20),…
PSSM (3/4)
The score is derived from the ratio of the observed to the expected frequencies. More precisely, the logarithm of this ratio is taken and refereed to as the log-likelihood ratio:
)log('
,i
ijji q
fScore
where Scorei,j is the score for residue i at position j, f’ij is the relative frequency for a residue i at position j and qi is the expected relative frequency of residue i in a random sequence.
PSSM (4/4)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 0.7 -0.2 1.3C -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7D -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2E -0.2 -0.2 2.3 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 -0.2F -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2G 2.3 -0.2 -0.2 1.3 -0.2 2.3 0.7 -0.2 0.7 -0.2 1.3 1.7 0.7 0.7 -0.2H -0.2 2.3 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2I -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7K -0.2 -0.2 -0.2 0.7 0.7 -0.2 0.7 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2L -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 1.3 -0.2 -0.2M -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2N -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2P -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2Q -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2R -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2S -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 0.7 0.7T -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2V -0.2 -0.2 -0.2 -0.2 0.7 -0.2 -0.2 0.7 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2W -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2Y -0.2 -0.2 -0.2 -0.2 0.7 -0.2 0.7 -0.2 -0.2 -0.2 -0.2 -0.2 -0.2 1.3 -0.2