2012 12 12_adam_v_final
TRANSCRIPT
Prof. Wim Van Criekinge
18th december 2012, VUmc, Amsterdam
Bioinformatics
2
Outline
• ScriptingPerl (Bioperl/Python)
examples spiders/bots
• DatabasesGenome Browser
examples biomart, galaxy
• AIClassification and clustering
examples WEKA (R, Rapidminer)
Math
Informatics
Bioinformatics, a life science discipline …
(Molecular)Biology
Math
Informatics
Bioinformatics, a life science discipline …
Theoretical Biology
Computational Biology
(Molecular)Biology
Computer Science
Math
Informatics
Bioinformatics, a life science discipline …
Theoretical Biology
Computational Biology
(Molecular)Biology
Computer Science
Bioinformatics
Math
Informatics
Bioinformatics, a life science discipline … management of expectations
Theoretical Biology
Computational Biology
(Molecular)Biology
Computer Science
Bioinformatics
Interface Design
AI, Image Analysisstructure prediction (HTX)
Sequence Analysis
Expert Annotation
NPDatamining
Math
Informatics
Bioinformatics, a life science discipline … management of expectations
Theoretical Biology
Computational Biology
(Molecular)Biology
Computer Science
BioinformaticsDiscovery Informatics – Computational Genomics
Interface Design
AI, Image Analysisstructure prediction (HTX)
Sequence Analysis
Expert Annotation
NPDatamining
8
Bioinformatics
9
10
• Perl is a High-level Scripting language• Larry Wall created Perl in 1987Practical Extraction (a)nd Reporting Language(or Pathologically Eclectic Rubbish Lister)• Born from a system administration
tool• Faster than sh or csh• Sslower than C• No need for sed, awk, tr, wc, cut, …• Perl is open and free• http://conferences.oreillynet.com/eurooscon/
What is Perl ?
11
• Perl is available for most computing platforms: all flavors of UNIX (Linux), MS-DOS/Win32, Macintosh, VMS, OS/2, Amiga, AS/400, Atari
• Perl is a computer language that is:Interpreted, compiles at run-time (need for perl.exe !)Loosely “typed”String/text orientedCapable of using multiple syntax formats
• In Perl, “there’s more than one way to do it”
What is Perl ?
12
• Ease of use by novice programmers• Flexible language: Fast software prototyping
(quick and dirty creation of small analysis programs)
• Expressiveness. Compact code, Perl Poetry: @{$_[$#_]||[]}
• Glutility: Read disparate files and parse the relevant data into a new format
• Powerful pattern matching via “regular expressions” (Best Regular Expressions on Earth)
• With the advent of the WWW, Perl has become the language of choice to create Common Gateway Interface (CGI) scripts to handle form submissions and create compute severs on the WWW.
• Open Source – Free. Availability of Perl modules for Bioinformatics and Internet.
Why use Perl for bioinformatics ?
13
• Some tasks are still better done with other languages (heavy computations / graphics)
C(++),C#, Fortran, Java (Pascal,Visual Basic)
• With perl you can write simple programs fast, but on the other hand it is also suitable for large and complex programs. (yet, it is not adequate for very large projects)
Python
• Larry Wall: “For programmers, laziness is a virtue”
Why NOT use Perl for bioinformatics ?
14
• Sequence manipulation and analysis
• Parsing results of sequence analysis programs (Blast, Genscan, Hmmer etc)
• Parsing database (eg Genbank) files
• Obtaining multiple database entries over the internet
• …
What bioinformatics tasks are suited to Perl ?
15
• PerlPerl is available for various operating systems. To download Perl and install it on your computer, have a look at the following resources: www.perl.com (O'Reilly). Downloading Perl Software ActiveState. ActivePerl for Windows, as well as for Linux and Solaris. ActivePerl binary packages. CPAN
• PHPTriad: bevat Apache/PHP en MySQL: http://sourceforge.net/projects/phptriad
Perl installation
16
Check installation
• Command-line flags for perlPerl – v Gives the current version of PerlPerl –e Executes Perl statements from the comment line.
Perl –e “print 42;”
Perl –e “print \”Two\n\lines\n\”;”
Perl –we Executes and print warnings
Perl –we “print ‘hello’;x++;”
17
• Syntax highlighting
• Run program (prompt for parameters)
• Show line numbers
• Clip-ons for web with perl syntax
• ….
TextPad
18
Customize textpad part 1: Create Document Class
19
• Document classes
20
Customize textpad part 2: Add Perl to “Tools Menu”
21
Unzip to textpad samples directory
22
• Perl is mostly a free format language: add spaces, tabs or new lines wherever you want.
• For clarity, it is recommended to write each statement in a separate line, and use indentation in nested structures.
• Comments: Anything from the # sign to the end of the line is a comment. (There are no multi-line comments).
• A perl program consists of all of the Perl statements of the file taken collectively as one big routine to execute.
General Remarks
23
Three Basic Data Types
•Scalars - $
•Arrays of scalars - @
•Associative arrays of scalers or Hashes - %
2+2 = ?
$a = 2;
$b = 2;
$c = $a + $b;
$ - indicates a variable
; - ends every command
= - assigns a value to a variable
$c = 2 + 2;or
$c = 2 * 2;or
$c = 2 / 2;or
$c = 2 ^ 4;or 2^4 <-> 24 =16
$c = 1.35 * 2 - 3 / (0.12 + 1);or
Ok, $c is 4. How do we know it?
print “Hello \n”;
print command:
$c = 4;
print “$c”;
“ ” - bracket output expression
\n - print a end-of-the-line character
(equivalent to pressing ‘Enter’)
print “Hello everyone\n”;
print “Hello” . ” everyone” . “\n”;
Strings concatenation:
Expressions and strings together:
print “2 + 2 = “ . (2+2) . ”\n”;
expression
2 + 2 = 4
Loops and cycles (for statement):
# Output all the numbers from 1 to 100
for ($n=1; $n<=100; $n+=1) {
print “$n \n”;
}
1. Initialization:for ( $n=1 ; ; ) { … }
2. Increment:for ( ; ; $n+=1 ) { … }
3. Termination (do until the criteria is satisfied):for ( ; $n<=100 ; ) { … }
4. Body of the loop - command inside curly brackets:for ( ; ; ) { … }
FOR & IF -- all the even numbers from 1 to 100:
for ($n=1; $n<=100; $n+=1) {
if (($n % 2) == 0) {
print “$n”;
}
}
Note: $a % $b -- Modulus -- Remainder when $a is divided by $b
28
Two brief diversions (warnings & strict)
• Use warnings • strict – forces you to ‘declare’ a variable
the first time you use it.usage: use strict; (somewhere near the top of your script)
• declare variables with ‘my’usage: my $variable; or: my $variable = ‘value’;
• my sets the ‘scope’ of the variable. Variable exists only within the current block of code
• use strict and my both help you to debug errors, and help prevent mistakes.
29
Text Processing Functions
The substr function• Definition• The substr function extracts a substring out of a
string and returns it. The function receives 3 arguments: a string value, a position on the string (starting to count from 0) and a length.
Example:• $a = "university"; • $k = substr ($a, 3, 5); • $k is now "versi" $a remains unchanged. • If length is omitted, everything to the end of the
string is returned.
30
Random
$x = rand(1);
• srand The default seed for srand, which used to be time, has been changed. Now it's a heady mix of difficult-to-predict system-dependent values, which should be sufficient for most everyday purposes. Previous to version 5.004, calling rand without first calling srand would yield the same sequence of random numbers on most or all machines. Now, when perl sees that you're calling rand and haven't yet called srand, it calls srand with the default seed. You should still call srand manually if your code might ever be run on a pre-5.004 system, of course, or if you want a seed other than the default
31
Demo/Example
• Oefening hoe goed zijn de random nummers ?
• Als ze goed zijn kan je er Pi mee berekenen …
• Een goede random generator is belangrijk voor goede randomsequenties die we nadien kunnen gebruiken in simulaties
32
Bereken Pi aan de hand van twee random getallen
1
x
y
33
Introduction
Buffon's Needle is one of the oldest problems in the field of geometrical probability. It was first stated in 1777. It involves dropping a needle on a lined sheet of paper and determining the probability of the needle crossing one of the lines on the page. The remarkable result is that the probability is directly related to the value of pi.
http://www.angelfire.com/wa/hurben/buff.html
In Postscript you send it too the printer … PS has no variables but “stacks”, you can mimick this in Perl by recursively loading and rewriting a subroutine
34– http://www.csse.monash.edu.au/~damian/papers/HTML/Perligata.html
35
Programming
•Variables
•Flow control (if, regex …)
•Loops
• input/output
•Subroutines/object
36
What is a regular expression?
• A regular expression (regex) is simply a way of describing text.
• Regular expressions are built up of small units (atoms) which can represent the type and number of characters in the text
• Regular expressions can be very broad (describing everything), or very narrow (describing only one pattern).
37
38
Regular Expression Review
• A regular expression (regex) is a way of describing text.
• Regular expressions are built up of small units (atoms) which can represent the type and number of characters in the text
• You can group or quantify atoms to describe your pattern
• Always use the bind operator (=~) to apply your regular expression to a variable
39
Why would you use a regex?
• Often you wish to test a string for the presence of a specific character, word, or phrase
Examples
“Are there any letter characters in my string?” “Is this a valid accession number?” “Does my sequence contain a start codon (ATG)?”
40
Regular Expressions
Match to a sequence of characters
The EcoRI restriction enzyme cuts at the consensus sequence GAATTC.
To find out whether a sequence contains a restriction site for EcoR1, write;
if ($sequence =~ /GAATTC/) { ...};
41
Regex-style
[m]/PATTERN/[g][i][o]
s/PATTERN/PATTERN/[g][i][e][o]
tr/PATTERNLIST/PATTERNLIST/[c][d][s]
42
Regular Expressions
Match to a character class• Example• The BstYI restriction enzyme cuts at the consensus sequence
rGATCy, namely A or G in the first position, then GATC, and then T or C. To find out whether a sequence contains a restriction site for BstYI, write;
• if ($sequence =~ /[AG]GATC[TC]/) {...}; # This will match all of AGATCT, GGATCT, AGATCC, GGATCC.
Definition• When a list of characters is enclosed in square brackets [], one and
only one of these characters must be present at the corresponding position of the string in order for the pattern to match. You may specify a range of characters using a hyphen -.
• A caret ^ at the front of the list negates the character class. Examples• if ($string =~ /[AGTC]/) {...}; # matches any nucleotide • if ($string =~ /[a-z]/) {...}; # matches any lowercase letter • if ($string =~ /chromosome[1-6]/) {...}; # matches chromosome1,
chromosome2 ... chromosome6 • if ($string =~ /[^xyzXYZ]/) {...}; # matches any character except x, X, y,
Y, z, Z
43
Constructing a Regex
• Pattern starts and ends with a / /pattern/if you want to match a /, you need to escape it \/ (backslash, forward slash)
you can change the delimiter to some other character, but you probably won’t need to m|pattern|
• any ‘modifiers’ to the pattern go after the last /
i : case insensitive /[a-z]/i o : compile once g : match in list context (global) m or s : match over multiple lines
44
Looking for a pattern
• By default, a regular expression is applied to $_ (the default variable)
if (/a+/) {die} looks for one or more ‘a’ in $_
• If you want to look for the pattern in any other variable, you must use the bind operator
if ($value =~ /a+/) {die} looks for one or more ‘a’ in $value
• The bind operator is in no way similar to the ‘=‘ sign!! = is assignment, =~ is bind.
if ($value = /[a-z]/) {die} Looks for one or more ‘a’ in $_, not $value!!!
45
Regular Expression Atoms
• An ‘atom’ is the smallest unit of a regular expression.
• Character atoms 0-9, a-Z match themselves . (dot) matches everything [atgcATGC] : A character class (group) [a-z] : another character class, a through z
46
Quantifiers
• You can specify the number of times you want to see an atom. Examples
•\d* : Zero or more times•\d+ : One or more times•\d{3} : Exactly three times•\d{4,7} : At least four, and not more than seven•\d{3,} : Three or more times We could rewrite /\d\d\d-\d\d\d\d/ as:
/\d{3}-\d{4}/
47
Anchors
• Anchors force a pattern match to a certain location•^ : start matching at beginning of string•$ : start matching at end of string•\b : match at word boundary (between \w and \W)
• Example:•/^\d\d\d-\d\d\d\d$/ : matches only valid phone numbers
48
Remembering Stuff
• Being able to match patterns is good, but limited.
• We want to be able to keep portions of the regular expression for later.
Example: $string = ‘phone: 353-7236’ We want to keep the phone number only Just figuring out that the string contains a phone number is insufficient,
we need to keep the number as well.
49
Memory Parentheses (pattern memory)
• Since we almost always want to keep portions of the string we have matched, there is a mechanism built into perl.
• Anything in parentheses within the regular expression is kept in memory.
‘phone:353-7236’ =~ /^phone\:(.+)$/; Perl knows we want to keep everything that matches ‘.+’ in the above
pattern
50
Getting at pattern memory
• Perl stores the matches in a series of default variables. The first parentheses set goes into $1, second into $2, etc.
This is why we can’t name variables ${digit}Memory variables are created only in the amounts needed. If you have three sets of parentheses, you have ($1,$2,$3).Memory variables are created for each matched set of parentheses. If you have one set contained within another set, you get two variables (inner set gets lowest number)Memory variables are only valid in the current scope
51
Finding all instances of a match
• Use the ‘g’ modifier to the regular expression@sites = $sequence =~ /(TATTA)/g;think g for globalReturns a list of all the matches (in order), and stores them in the arrayIf you have more than one pair of parentheses, your array gets values in sets ($1,$2,$3,$1,$2,$3...)
52
Perl is Greedy
• In addition to taking all your time, perl regular expressions also try to match the largest possible string which fits your pattern
/ga+t/ matches gat, gaat, gaaat‘Doh! No doughnuts left!’ =~ /(d.+t)/ $1 contains ‘doughnuts left’
• If this is not what you wanted to do, use the ‘?’ modifier
/(d.+?t)/ # match as few ‘.’s as you can and still make the pattern work
53
Substitute function
• s/pattern1/pattern2/;
• Looks kind of like a regular expressionPatterns constructed the same way
• Inherited from previous languages, so it can be a bit different.Changes the variable it is bound to!
54
55
tr function
• translate or transliterate
• tr/characterlist1/characterlist2/;
• Even less like a regular expression than s
• substitutes characters in the first list with characters in the second list
$string =~ tr/a/A/; # changes every ‘a’ to an ‘A’ No need for the g modifier when using tr.
56
Translations
57
Using tr
• Creating complimentary DNA sequence$sequence =~ tr/atgc/TACG/;
• Sneaky Perl trick for the daytr does two things. 1. changes characters in the bound variable 2. Counts the number of times it does thisSuper-fast character counter™ $a_count = $sequence =~ tr/a/a/; replaces an ‘a’ with an ‘a’ (no net change), and assigns the result
(number of substitutions) to $a_count
58
Regex-Related Special Variables
• Perl has a host of special variables that get filled after every m// or s/// regex match. $1, $2, $3, etc. hold the backreferences. $+ holds the last (highest-numbered) backreference. $& (dollar ampersand) holds the entire regex match.
• @- is an array of match-start indices into the string. $-[0] holds the start of the entire regex match, $-[1] the start of the first backreference, etc. Likewise, @+ holds match-end indices (ends, not lengths).
• $' (dollar followed by an apostrophe or single quote) holds the part of the string after (to the right of) the regex match. $` (dollar backtick) holds the part of the string before (to the left of) the regex match. Using these variables is not recommended in scripts when performance matters, as it causes Perl to slow down all regex matches in your entire script.
• All these variables are read-only, and persist until the next regex match is attempted. They are dynamically scoped, as if they had an implicit 'local' at the start of the enclosing scope. Thus if you do a regex match, and call a sub that does a regex match, when that sub returns, your variables are still set as they were for the first match.
59
Voorbeeld
Which of following 4 sequences (seq1/2/3/4) a) contains a “Galactokinase signature”
b) How many of them? c) Where (hints:pos and $&) ?
http://us.expasy.org/prosite/
60
>SEQ1MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT
YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVAM MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR
>SEQ2MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK
VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ
>SEQ3MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY
SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLACISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGSSSGH TSTTL
>SEQ4MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK
HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA
61
Arrays
Definitions• A scalar variable contains a scalar value: one number or one
string. A string might contain many words, but Perl regards it as one unit.
• An array variable contains a list of scalar data: a list of numbers or a list of strings or a mixed list of numbers and strings. The order of elements in the list matters.
Syntax• Array variable names start with an @ sign.• You may use in the same program a variable named $var and
another variable named @var, and they will mean two different, unrelated things.
Example• Assume we have a list of numbers which were obtained as a
result of some measurement. We can store this list in an array variable as the following:
• @msr = (3, 2, 5, 9, 7, 13, 16);
62
The foreach construct
The foreach construct iterates over a list of scalar values (e.g. that are contained in an array) and executes a block of code for each of the values.
• Example:foreach $i (@some_array) { statement_1; statement_2; statement_3; } Each element in @some_array is aliased to the variable $i in turn, and the block of code inside the curly brackets {} is executed once for each element.• The variable $i (or give it any other name you wish) is
local to the foreach loop and regains its former value upon exiting of the loop.
• Remark $_
63
Examples for using the foreach construct - cont.
• Calculate sum of all array elements:
#!/usr/local/bin/perl
@msr = (3, 2, 5, 9, 7, 13, 16);
$sum = 0;
foreach $i (@msr) {
$sum += $i; }
print "sum is: $sum\n";
64
Accessing individual array elements
Individual array elements may be accessed by indicating their position in the list (their index).
Example:@msr = (3, 2, 5, 9, 7, 13, 16);index value 0 3 1 2 2 5 3 9 4 7 5 13 6 16 First element: $msr[0] (here has the value of
3),Third element: $msr[2] (here has the value of
5),and so on.
65
The sort function
The sort function receives a list of variables (or an array) and returns the sorted list.
@array2 = sort (@array1);
#!/usr/local/bin/perl @countries = ("Israel", "Norway", "France", "Argentina"); @sorted_countries = sort ( @countries); print "ORIG: @countries\n", "SORTED: @sorted_countries\n"; Output:ORIG: Israel Norway France Argentina SORTED: Argentina France Israel Norway
#!/usr/local/bin/perl @numbers = (1 ,2, 4, 16, 18, 32, 64); @sorted_num = sort (@numbers); print "ORIG: @numbers \n", "SORTED: @sorted_num \n"; Output:ORIG: 1 2 4 16 18 32 64 SORTED: 1 16 18 2 32 4 64 Note that sorting numbers does not happen numerically, but by the string values of
each number.
66
The push and shift functions
The push function adds a variable or a list of variables to the end of a given array.
Example:$a = 5; $b = 7; @array = ("David", "John", "Gadi"); push (@array, $a, $b); # @array is now ("David", "John", "Gadi", 5, 7)
The shift function removes the first element of a given array and returns this element.
Example:@array = ("David", "John", "Gadi"); $k = shift (@array); # @array is now ("John", "Gadi"); # $k is now "David"
Note that after both the push and shift operations the given array @array is changed!
67
Perl Array review
• An array is designated with the ‘@’ sign• An array is a list of individual elements• Arrays are orderedYour list stays in the same order that you created it, although you can add or subtract elements to the front or back of the list
• You access array elements by number, using the special syntax:
$array[1] returns the ‘1th’ element of the array (remember perl starts counting at zero)
• You can do anything with an array element that you can do with a scalar variable (addition, subtraction, printing … whatever)
68
Generate random sequence string
for($n=1;$n<=50;$n++)
{
@a = ("A","C","G","T");
$b=$a[rand(@a)];
$r.=$b;
}
print $r;
69
Text Processing Functions
The split function
• The split function splits a string to a list of substrings according to the positions of a given delimiter. The delimiter is written as a pattern enclosed by slashes: /PATTERN/. Examples:
• $string = "programming::course::for::bioinformatics";
• @list = split (/::/, $string); • # @list is now ("programming", "course", "for",
"bioinformatics") # $string remains unchanged. • $string = "protein kinase C\t450 Kilodaltons\t120
Kilobases"; • @list = split (/\t/, $string); #\t indicates tab # • @list is now ("protein kinase C", "450
Kilodaltons", "120 Kilobases")
70
Text Processing Functions
The join function• The join function does the opposite of split. It
receives a delimiter and a list of strings, and joins the strings into a single string, such that they are separated by the delimiter.
• Note that the delimiter is written inside quotes. • Examples:• @list = ("programming", "course", "for",
"bioinformatics"); • $string = join ("::", @list); • # $string is now
"programming::course::for::bioinformatics" • $name = "protein kinase C"; $mol_weight = "450
Kilodaltons"; $seq_length = "120 Kilobases"; • $string = join ("\t", $name, $mol_weight,
$seq_length); • # $string is now: # "protein kinase C\t450
Kilodaltons\t120 Kilobases"
71
When is an array not good enough?
• Sometimes you want to associate a given value with another value. (name/value pairs)
(Rob => 353-7236, Matt => 353-7122, Joe_anonymous => 555-1212)
(Acc#1 => sequence1, Acc#2 => sequence2, Acc#n => sequence-n)
• You could put this information into an array, but it would be difficult to keep your names and values together (what happens when you sort? Yuck)
72
Problem solved: The associative array
• As the name suggests, an associative array allows you to link a name with a value
• In perl-speak: associative array = hash‘hash’ is the preferred term, for various arcane reasons, including that it is easier to say.
• Consider an array: The elements (values) are each associated with a name – the index position. These index positions are numerical, sequential, and start at zero.
• A hash is similar to an array, but we get to name the index positions anything we want
73
The ‘structure’ of a Hash
• An array looks something like this:
@array =Index
Value
0 1 2
'val1' 'val2' 'val3'
74
The ‘structure’ of a Hash
• An array looks something like this:
• A hash looks something like this:
@array =Index
Value
0 1 2
'val1' 'val2' 'val3'
Rob Matt Joe_A
353-7236 353-7122 555-1212
Key (name)
Value%phone =
75
Creating a hash
• There are several methods for creating a hash. The most simple way – assign a list to a hash.
%hash = (‘rob’, 56, ‘joe’, 17, ‘jeff’, ‘green’);
• Perl is smart enough to know that since you are assigning a list to a hash, you meant to alternate keys and values.
%hash = (‘rob’ => 56 , ‘joe’ => 17, ‘jeff’ => ‘green’);
• The arrow (‘=>’) notation helps some people, and clarifies which keys go with which values. The perl interpreter sees ‘=>’ as a comma.
76
Getting at values
• You should expect by now that there is some way to get at a value, given a key.
• You access a hash key like this:$hash{‘key’}
• This should look somewhat familiar$array[21] : refer to a value associated with a specific index position in an array$hash{key} : refer to a value associated with a specific key in a hash
77
• There is more than one right way to do it. Unfortunately, there are also many wrong ways.
1. Always check and make sure the output is correct and logical Consider what errors might occur, and take steps to ensure that you are
accounting for them.2. Check to make sure you are using every variable you declare. Use Strict !3. Always go back to a script once it is working and see if you can eliminate unnecessary steps. Concise code is good code. You will learn more if you optimize your code. Concise does not mean comment free. Please use as many comments as you
think are necessary. Sometimes you want to leave easy to understand code in, rather than short but
difficult to understand tricks. Use your judgment. Remember that in the future, you may wish to use or alter the code you wrote
today. If you don’t understand it today, you won’t tomorrow.
Programming in general and Perl in particular
78
Develop your program in stages. Once part of it works, save the working version to another file (or use a source code control system like RCS) before continuing to improve it.
When running interactively, show the user signs of activity. There is no need to dump everything to the screen (unless requested to), but a few words or a number change every few minutes will show that your program is doing something.
Comment your script. Any information on what it is doing or why might be useful to you a few months later.
Decide on a coding convention and stick to it. For example,
for variable names, begin globals with a capital letter and privates (my) with a lower case letter indent new control structures with (say) 2 spaces line up closing braces, as in: if (....) { ... ... } Add blank lines between sections to improve readibility
Programming in general and Perl in particular
79
CPAN
• CPAN: The Comprehensive Perl Archive Network is available at www.cpan.org and is a very large respository of Perl modules for all kind of taks (including bioperl)
80
What is BioPerl?
• An ‘open source’ projecthttp://bio.perl.org or http://www.cpan.org• A loose international collaboration of
biologist/programmersNobody (that I know of) gets paid for this• A collection of PERL modules and methods
for doing a number of bioinformatics tasksThink of it as subroutines to do biology• Consider it a ‘tool-box’There are a lot of nice tools in there, and (usually) somebody else takes care of fixing parsers when they break • BioPerl code is portable - if you give
somebody a script, it will probably work on their system
81
Multi-line parsing
use strict;use Bio::SeqIO;
my $filename="sw.txt";my $sequence_object;
my $seqio = Bio::SeqIO -> new ( '-format' => 'swiss', '-file' => $filename );
while ($sequence_object = $seqio -> next_seq) {my $sequentie = $sequence_object-> seq(); print $sequentie."\n";}
82
Live.pl
#!e:\Perl\bin\perl.exe -w# script for looping over genbank entries, printing out
nameuse Bio::DB::Genbank;use Data::Dumper;
$gb = new Bio::DB::GenBank();
$sequence_object = $gb->get_Seq_by_id('MUSIGHBA1');
print Dumper ($sequence_object);
$seq1_id = $sequence_object->display_id();$seq1_s = $sequence_object->seq();print "seq1 display id is $seq1_id \n";print "seq1 sequence is $seq1_s \n";
83
Bioperl 101: 2 ESSENTIAL TOOLS
Data::Dumper to find out what class your in
Perl bptutorial (100 Bio::Seq) to find the available methods for that class
84
Outline
• ScriptingPerl (Bioperl/Python)
examples spiders/bots
• DatabasesGenome Browser
examples biomart, galaxy
• AIClassification and clustering
examples WEKA (R, Rapidminer)
85
Overview
• Bots and SpidersThe webBotsSpidersReal world examples Bioinformatics applicationsPerl – LWP librariesGoogle hacksAdvanced APIsFetch data from NCBI / Ensembl /
86
The web
• The WWW-part of the Internet is based on hyperlinks
• So if one started to follow all hyperlinks, it would be possible to map almost the entire WWW
• Everything you can do as a human (clicking, filling in forms,…) can be done by machines
87
Bots
• Webbots (web robots, WWW robots, bots): software applications that run automated tasks over the Internet
• Bots perform tasks that:Are simpleStructurally repetitiveAt a much higher rate than would be possible for a
human• Automated script fetches, analyses and files information
from web servers at many times the speed of a human• Other uses:
Chatbots IM / Skype / Wiki botsMalicious bots and bot networks (Zombies)
88
Spiders
• Webspiders / Crawlers are programs or automated scripts which browses the World Wide Web in a methodical, automated manner. It is one type of bot
• The spider starts with a list of URLs to visit, called the seeds As the crawler visits these
URLs, it identifies all the hyperlinks in the page
It adds them to the list of URLs to visit, called the crawl frontier
URLs from the frontier are recursively visited according to a set of policies
• This process is called web crawling or spidering: in most cases a mean of providing up-to-date data
89
Spiders
Use of webcrawlers: Mainly used to create a copy of all the visited pages for later
processing by a search engine that will index the downloaded pages to provide fast searches
Automating maintenance tasks on a website, such as checking links or validating HTML code
Can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses
Most common used crawler is probably the GoogleBot crawler Crawls Indexes (content + key content tags and attributes, such as Title
tags and ALT attributes) Serves results: PageRank Technology
90
Spiders
91
Perl - LWP
LWP (also known as libwww-perl) The World-Wide Web library for Perl Set of Perl modules which provides a simple and
consistent application programming interface (API) to the World-Wide Web
Free book: http://lwp.interglacial.com/ LWP for newbies
LWP::Simple (demo1) Go to a URL, fetch data, ready to parse Attention: HTML tags and regular expression
92
Perl - LWP
Some more advanced features LWP::UserAgent (demo2 – show server access
logs) Fill in forms and parse results Depending on content: follow hyperlinks to other
pages and parse these again,… Bioinformatics examples Use genome browser data (demo3) and sequences Get gene aliases and symbols from GeneCards
(demo4)
93
Google hacks
Why not make use of crawls, indexing and serving technologies of others (e.g. Google) Google allows automated queries: per account 1000
queries a day Google uses Snippets: the short pieces of text you get
in the main search results This is the result of its indexing and parsing algoritms Demo5: LWP and Google combined and parsing the
results
94
Advanced APIs
An application programming interface (API) is a source code interface that an operating system, library or service provides to support requests made by computer programs
Language-dependent APIs Language-independent APIs are written in a way
they can be called from several programming languages. This is a desired feature for service style API which is not bound to a particular process or system and is available as a remote procedure call
95
Advanced APIs
Google example used Google API / SOAP NCBI API
The NCBI Web service is a web program that enables developers to access Entrez Utilities via the Simple Object Access Protocol (SOAP)
Programmers may write software applications that access the E-Utilities using any SOAP development tool
Main tools (demo6):E-Search Searches and retrieves primary IDs and term
translations and optionally retains results for future use in the user's environment
E-Fetch: Retrieves records in the requested format from a list of one or more primary IDs
Ensembl API (demo7)
96
Fetch data from NCBI
A NCBI database, frequently used is PubMed PubMed can be queried using E-Utils Uses syntax as regular PubMed website Get the data back in data formats as on the website
(XML, Plain Text) Parse XML results and more advanced Text-mining
techniques Demo8 Parse results and present them in an interface (
http://matrix.ugent.be/mate/methylome/result1.html)
97
Fetch data from NCBI
Example: PubMeth Get data from NCBI PubMed Get all genes and all aliases for human genes and their
annotations from Ensembl & GeneCards Get all cancer types from cancer thesaurius Parse PubMed results: find genes and aliases;
keywords Keep variants in mind (Regexes are very useful) Sort the PubMed abstracts and store found genes and
keywords in database; apply scoring scheme
98
Outline
• ScriptingPerl (Bioperl/Python)
examples spiders/bots
• DatabasesGenome Browser
examples biomart, galaxy
• AIClassification and clustering
examples WEKA (R, Rapidminer)
99
The three genome browsers
• There are three main browsers:EnsemblNCBI MapViewerUCSC
• At first glance their main distinguishing features are:
MapViewer is arranged vertically.Ensembl has multiple (22) different “Views”.UCSC has a single “View” for (almost) everything.
100
MapViewer Home
http://www.ncbi.nlm.nih.gov/mapview/
101
MapViewer Master Map
102
Selecting tracks on MapViewer
103
MapViewer strengths
• Good coverage of plant and fungal genomes.• Close integration with other NCBI tools and
databases, such as Model Maker, trace archives or Celera assemblies.
• Vertical view enables convenient overview of regional gene descriptions.
• Discontiguous MEGABLAST is probably the most sensitive tool available for cross-species sequence queries.
• Ability to view multiple assemblies (e.g. Celera and reference) simultaneously.
104
MapViewer limitations
• Little cross-species conservation or alignment data.
• Inability to upload custom annotations and data.
• Limited capability for batch data access.• Limited support for automated database
querying.• Vertical view makes base-pair level annotation
cumbersome.
105105
UCSC Genome Browser
106106
http://genome.ucsc.edu/
107107
UCSC Genome Browser
108
Strengths of the UCSC Browser (I)
For this course I will be focusing primarily on the UCSC Browser for several reasons:
• Strong comparative genomics capabilities.• Fast responsesequence searches performed with BLAT.code is written in speed-optimized C.Multiple indexing and non-normalized tables for fast database retrieval.
• (Essentially) single “view” from single base-pair to entire chromosome.
• Easiest interface for loading custom annotations.
109
UCSC Browser Strengths (II)
• Well suited for batch and automated querying of both gene and intergenic regions.
• Comprehensive: tends to have the most species, genes and annotations.• Annotations frequently updated (Genbank/Refseq daily / ESTs weekly).• Able to find “similar” genes easily with GeneSorter.• Rapid access to in situ images with VisiGene.
110
UCSC browser limitations
• Lack of “overview” mode can make it harder to see genomic context.
• Syntenic regions cannot be viewed simultaneously.
• Cross species sequence queries with BLAT are often insensitive.
• Comprehensiveness of database can make user interface intimidating.
• Code access for commercial users requires licensing.
111
Human, mouse,rat synteny in MapViewer
112112
Browser/Database Batch Querying
113
Batch querying overview
• Introduction / motivation• UCSC table browser• Custom tracks and frames• Galaxy and direct SQL database
querying• A batch query example• UCSC Database “gotchas”• Batch querying on Ensembl
114
Why batch querying
• Interactive querying is difficult if you want to study numerous “interesting” genomic regions.
• Querying each region interactively is:TediousTime-consumingError prone
115
Batch querying examples
• As an example, say you have found one hundred candidate polymorphisms and you want to know:
Are they in dbSNP?Do they occur in any known ESTs?Are the sites conserved in other vertebrates?Are they near any ”LINE” repeat sequences?
Of course you could repeat the procedures described in Part II one hundred times but that would get “old” very fast…
116
Other examples
• Other examples include characterizing multiple:
Non-coding RNA candidates ultra-conserved regions introns hosting snoRNA genes
117
Browsers and databases
• Each of the genome browsers is built on top of multiple relational databases.
• Typically data for each genome assembly are stored in a separate database and auxiliary data, e.g. gene ontology (GO) data, are stored in yet other databases.
• These databases may have hundreds of tables, many with millions of entries.
118
The UCSC Table Browser
• For batch queries, you need to query the browser databases.
• The conventional way of querying a relational database is via “Structured Query Language” (SQL).
• However with the Table Browser, you can query the database without using SQL.
119
Browser Database Formats
Nevertheless, even with the Table Browser, you need some understanding of the underlying track, table and file formats.
Table formats describe how data is stored in the (relational) databases.Track formats describe how the data is presented on the browser.File formats describe how the data is stored in “flat files” in conventional computer files.Finally, for understanding the underlying the computer code (as we will do in the last part of this tutorial) you will need to learn about the “C” structures which hold the data in the source code.
120
Main UCSC Data Formats
• GFF/GTF • BED (Browser Extensible Data) lists of genomic blocks
• PSLRNA/DNA alignments
• .chain pair-wise cross species alignments
• .maf multiple genome alignments
• .wignumerical data
121
Custom Tracks
• Custom tracks are essentially BED, PSL or GTF files with formatting lines so they can be displayed on the browser.
• A custom track file can contain multiple tracks, which may be in different formats.
• Custom tracks are useful for:Display of regions of interest on the browser.Sharing custom data with others.Input of multiple, arbitrary regions for annotation by the Table Browser.
• Custom tracks can be made by the Table Browser, or you can make them easily yourself.
122
Selecting custom track output
123123
Sending custom track to browser
124124
Adding a custom track
125
Adding a custom track (II)
126
Custom track example
browser position chr22:10000000-10020000browser hide alltrack name=clones description="Clones” visibility=3 color=0,128,0 useScore=1 chr22 10000000 10004000 cloneA 960 chr22 10002000 10006000 cloneB 200 chr22 10005000 10009000 cloneC 700 chr22 10006000 10010000 cloneD 600chr22 10011000 10015000 cloneE 300chr22 10012000 10017000 cloneF 100
127
Limitations of the table browser
• Can be difficult to create more complex queries.• With hundreds of tables, finding the one(s) you
want can be confusing.• Getting intersections or unions of genomic
regions is often a multi-step process and can be tedious or error prone.
• May be slower than direct SQL query.• Not designed for fully automated operation.
128
Ensembl
129
Ensembl Home http://www.ensembl.org/
130
Ensembl ContigView
131
Ensembl ContigView
132
Detail and Basepair view
133
Changing tracks in Ensembl
134134
Ensembl strengths (I)
• Multiple view levels shows genomic context.
• Some annotations are more complete and/or are more clearly presented (e.g. snpView of multiple mouse strain data.)
• Possible to create query over more than one genome database at a time (with BioMart).
135
Ensembl snpView
136
Ensembl strengths (II)
• Batch and automated querying well supported and documented (especially for perl and java).
• API (programmer interface) is designed to be identical for all databases in a release.
• Ensembl tends to be more “community oriented” - using standard, widely used tools and data formats.
• All data and code are completely free to all.
137
Ensembl is “community oriented”
• Close alliances with Wormbase, Flybase, SGD• “support for easy integration with third party data
and/or programs” – BioMart• Close integration with R/ Bioconductor software • More use of community standard formats and
programs, e.g. DAS, GFF/GTF, Bioperl
( Note: UCSC also supports GFF/GTF and is compatible with R/Bioconductor and DAS, but UCSC tends to use more “homegrown” formats, e.g. BED, PSL, and tools.)
138
Ensembl limitations
• Limited data quantifying cross-species sequence conservation.
• Batch queries for intergenic regions with BioMart are difficult.
• BioMart offers less complete access to database than UCSC Table Browser. (However, the user interface to BioMart is easier.)
139
BioMart
• BioMart - the Ensembl “Table browser” • Similar to the Table Browser and Galaxy tools.• Previous version was called EnsMart.• Fewer tables can be accessed with BioMart
than with UCSC Table Browser. In particular, non-gene oriented queries may be difficult.
• However, the user interface is simpler.• Tight interface with Bioconductor project for
annotation of microarray genes.
140
The Galaxy Website
• Galaxy website: http://g2.bx.psu.edu
• Galaxy objective: Provide sequence and data manipulation tools (a la SRS or the UCSD Biology Workbench) that are capable of being applied to genomic data.
• The intent is to provide an easy interface to numerous analysis tools with varied output formats that can work on data from multiple browsers / databases.
141141
142
• Galaxy is a web interface to bioinformatics tools that deal with genome-scale data
• There is a public server with many pre-installed tools• Many tools work with genomic intervals• Other tools work with various types of tab delimited data
formats, and some directly on DNA sequences• It has excellent tools to access public data• It can be installed on a local computer or set up as an
institutional server• Can access a standard or custom build on Amazon “Cloud”• Any command line tool or web service can easily be
wrapped into the Galaxy interface.
Demo: Galaxy Genomics Toolkit
143
Genome-Scale Data
• Bioinformatics work is challenging on very large “genomics” data setssequencing, gene expression, variants,
ChIPseq
• Complex command line programs• Genome Browsers • New tools
144
The Galaxy Interface has 3 parts
List of Tools Central work panelHistory =
data & results
145
Load Data from UCSC
Or upload from your computer
146
Demo: Galaxy Genomics Toolkit
• http://athos.ugent.be:8080: staat er een Galaxy instance.
• inloggen (als admin: [email protected], password: newnew)
• de cleanfq history heeft 2 paar fastq files en een ref fa en een ref gtf
147
Workflows
• Galaxy saves your data, and results in the History
• The exact commands and parameters used with each operation with each tool are also saved.
• These operations can be saved as a “Workflow”, which can be reused, and shared with other users.
148
• Galaxy has many public data sets and public workflows, which can be easily used in your projects (or a tutorial)
149
NGS tools
• Galaxy has recently been expanded with tools to analyze Next-Gen Sequence data
• File format conversions• Analysis methods specific to different sequencing
platforms (454, Illumina, SOLID)• Analysis methods specific to different applications
(RNA-seq, ChIP-seq, mutation finding, metagenomics, etc).
• NGS tools include file format conversion, mapping to reference genome, ChIPseq peak calling, RNA-seq gene expression, etc.
• NGS data analysis uses large files – slow to upload and slow to
process on a public server
151
A number of Groups have set up custom Galaxy servers with special tools
152
The SPARQLing future
153
Outline
• ScriptingPerl (Bioperl/Python)
examples spiders/bots
• DatabasesGenome Browser
examples biomart, galaxy
• AIClassification and clustering
examples WEKA (R, Rapidminer)
154
Wat is ‘intelligent’ ?
•Intelligentie = de mogelijkheid tot leren en begrijpen, tot het oplossen van problemen, tot het nemen van beslissingen
Machine learning …
155
Turing test voor intelligentie
THE IMITATION GAME
Vrouw
Man/Machine
Ondervrager: Wie van beide is de vrouw?
156
Wat is ‘artificieel’ ?
• Artificieel = kunstmatig = door de mens vervaardigd, niet van natuurlijke oorsprong
• in de context van A.I.: machines, meestal een digitale computer
• H. Simon: analogie mens-digitale computergeheugenuitvoeringseenheidcontrole-eenheid
157
Data mining
• WAT? extraheren van kennis uit data
• Data indelen in drie groepen:trainingssetvalidatiesettestset
• Clustering/Classificatie
158
Clustering
• WAT? ‘unsupervised learning’ – antwoord voor de trainingsdata niet gekend
• Resultaat meestal als boomstructuur
• Belangrijke methode: hiërarchisch clusteren opstellen van distance matrix
159
Cluster Analysis
• Unsupervised methods• Descriptive modelingGrouping of genes with “similar” expression profilesGrouping of disease tissues, cell lines, or toxicants with “similar” effects on gene expression
• Clustering algorithmsSelf-organizing mapsHierarchical clusteringK-means clusteringSVD
160
Linkage in Hierarchical Clustering
• Single linkage:S(A,B) = mina minb d(a,b)
• Average linkage:A(A,B) = (∑a ∑b d(a,b)) / |A| |B|
• Complete linkage:C(A,B) = maxa maxb d(a,b)
• Centroid linkage:M(A,B) = d(mean(A),mean(B))
• Hausdorff linkage:h(A,B) = maxa minb d(a,b)H(A,B) = max(h(A,B),h(B,A))
• Ward linkage:W(A,B) = (|A| |B| (M(A,B))2) / (|A|+|B|)
A
B
161
Hierarchical Clustering
3 clusters?
2 clusters?
162
Classificatie
• WAT? ‘supervised learning’ – antwoord voor de trainingsdata is gekend
• Verschillende classificatiemethoden:decision treeneurale netwerkensupport vector machines
163
Voorbeeld: tennis
Decision tree
164
Neurale netwerken
BOUW: Neuronen en verbindingenTAAK:verwerken van invoergegevensmachine learning
165
Support Vector Machines
Doorvoeren van een lineaire separatie in de data door de dimensies aan te passen
166
Bio-informatica toepassingen
• Decision tree: zoeken naar DNA-sequenties homoloog aan een gegeven DNA-sequentie
• Neurale netwerken: modelleren en analyseren van genexpressiegegevens, voorspellen van de inwerkingsplaatsen van proteasen
• Support Vector Machines: identificeren van genen betrokken bij anti-kankermechanismen, detecteren van homologie tussen eiwitten, analyse van genexpressie
167
Bio-informatica toepassingen
• Hiërarchisch clusteren: opstellen van fylo- genetische bomen op basis van DNA-sequenties
• Genetische algoritmes: moleculaire herkenning, relatie tussen structuur en functie ophelderen, Multiple Sequence Alignment
• Expertsystemen: ontdekken van blessures, vroege detectie van afwijkingen aan de hartklep
• Fuzzy logic: primerdesign, voorspellen van de functie van een onbekend gen, expressie-analyse
168
Outline
169
Classification
OMSclassifier
C NN N C C
N C
CC C C
NN N
N
C: cancer, N: normal
170
Classification
OMSclassifier
R NN N R R
N R
RR R R
NN N
N
R: responderN: non-responder
171
Outline
172
OMS Classifier using “Methylation”
Patient
Sample
Measuring Methylation
OMSclassifier
Cancer Normal
Gene Gen 1 Gen 2 Gen 3 … Gen nMethylated + - - … +
173
Why use methylation as a biomarker ?
• What is feature/biomarker ?A characteristic that is objectively
measured and evaluated as an indicator of normal biological processes, pathogenic processes, or pharmacologic responses to a therapeutic intervention
• Business/biological feature selection/reductionOf all possible (molecular and clinical)
features oncomethylome measures methylation (in cancer/onco)
174
Outline
175
Data preparation and modelling
• Data preparationConstruct binary features « Methylated » from
PCR data (Ct and Temp)
• ModellingConstruct classifier (cancer vs normal) from
features « Methylated »
176
Data Preparation: Feature Construction
Sample
Methylation SpecificQuantitative PCR
Feature construction: “gene Methylated in sample”
Gene Gen 1 Gen 2 Gen 3 … Gen nTemp 78 81 69 … 72Ct 25 38 24 … 27
Gene Gen 1 Gen 2 Gen 3 … Gen nMethylated + - - … +
Compute « methylated » as function of Temp and Ct
177
Construction of features « Methylated »
• Per gene: find boolean functionMethylated IFF:
Ct below upperbound AND
Temp above lowerbound
• Taking into accountAll Ct and Temp measurements
Methylation Specific Quantitative PCR (QMSP) for normals and cancers
Noise in QMPS measurementsAs observed per gene during Quality Control
178
Construction of features « Methylated »
Plot of all Ct and Temp measurements for a given gene
Temp
Ct
What about noise?
179
Noise
Noise: random error or variance in a measured variable
Incorrect attribute values may due to Quantity not correctly compared to calibration
(e.g., ruler slips)
Inaccurate calibration device (e.g., ruler > 1m)
Precision (e.g., truncated to nearest mile or Ångstrom unit)
Data entry problems Data transmission problems Inconsistency in naming convention
180
Construction of features «Methylated» Taking into account noise
StDev 1.6 StDev 0.3
QC: StdDev of Ct and Tm in IVM
StD
ev
3.5
StD
ev 0
.02
Inrobust assay Robust assay
Cancer
Normal
Cut-off
181
Construction of features « Methylated » Taking into account noise
Sharp cut-off
Blunt cut-off
Good Reproducibility Bad Reproducibility
Methylated Methylated
MethylatedMethylated
Fatal
okok
ok
182
Construction of features « Methylated » Taking into account noise
Find most robust cut-off for each gene
Compute quality with increasing noise levels (0-2 times StdDev)
Stdev 0 2
Qua
lity 1
Stdev 0 2
Qua
lity 1
Inrobust Robust
Quality score based on binomial test
46 or more successes with 58 trials unlikelyWhen probability success = 80/179
Expected nr successes = 21
16 or more successes with 44 trials likelywhen probability success = 77/175
Expected nr successes = 19
183
Construction of features « Methylated »
Methylated: inside red box
184
Construction of features « Methylated »C
ance
rN
orm
al
Ranked GenesMethylated Unmethylated
185
Data preparation and modelling
• Data preparationConstruct binary features « Methylated » from
PCR data (Ct and Temp)
• ModellingConstruct classifier (cancer vs normal) from
« Methylated » features
186
Selection of modelling technique
• In theory, many techniques applicableData type: boolean methylation table, discrete
classesSee other talks today
• But, additional requirements follow from business understanding (more details below)
Feature selectionFinal test should be based on at most ~5 genes
UnderstandabilityBoth provide a direct competitive advantage
• Example of acceptable technique: decision trees
187
Decision treesThe Weka tool
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {yes, no}
@datasunny,hot,high,FALSE,nosunny,hot,high,TRUE,noovercast,hot,high,FALSE,yesrainy,mild,high,FALSE,yesrainy,cool,normal,FALSE,yesrainy,cool,normal,TRUE,noovercast,cool,normal,TRUE,yessunny,mild,high,FALSE,nosunny,cool,normal,FALSE,yesrainy,mild,normal,FALSE,yessunny,mild,normal,TRUE,yesovercast,mild,high,TRUE,yesovercast,hot,normal,FALSE,yesrainy,mild,high,TRUE,no
http://www.cs.waikato.ac.nz/ml/weka/
188
Decision treesAttribute selection
outlook temperature humidity windy playsunny hot high FALSE nosunny hot high TRUE noovercast hot high FALSE yesrainy mild high FALSE yesrainy cool normal FALSE yesrainy cool normal TRUE noovercast cool normal TRUE yessunny mild high FALSE nosunny cool normal FALSE yesrainy mild normal FALSE yessunny mild normal TRUE yesovercast mild high TRUE yesovercast hot normal FALSE yesrainy mild high TRUE no
play
don’t play
pno = 5/14
pyes = 9/14
maximal gain of information maximal reduction of Entropy = - pyes log2 pyes - pno log2 pno
= - 9/14 log2 9/14 - 5/14 log2 5/14
= 0.94 bits
http://www-lmmb.ncifcrf.gov/~toms/paper/primer/latex/index.htmlhttp://directory.google.com/Top/Science/Math/Applications/Information_Theory/Papers/
play
don’t play
amount of information required to specify class of an example given that it reaches node
0.94 bits
0.0 bits* 4/14
0.97 bits* 5/14
0.97 bits* 5/14
0.98 bits* 7/14
0.59 bits* 7/14
0.92 bits* 6/14
0.81 bits* 4/14
0.81 bits* 8/14
1.0 bits* 4/14
1.0 bits* 6/14
outlook
sunny overcast rainy
+= 0.69 bits
gain: 0.25 bits
+= 0.79 bits
+= 0.91 bits
+= 0.89 bits
gain: 0.15 bits gain: 0.03 bits gain: 0.05 bits
play don't playsunny 2 3
overcast 4 0rainy 3 2
humidity temperature windy
high normal hot mild cool false true
play don't playhot 2 2mild 4 2cool 3 1
play don't playhigh 3 4
normal 6 1
play don't playFALSE 6 2TRUE 3 3
maximal
informatio
n
gain
Decision treesAttribute selection
play
don’t play
outlook
sunny overcast rainy
maximal
informatio
n
gain
0.97 bits
0.0 bits* 3/5
humidity temperature windy
high normal hot mild cool false true
+= 0.0 bits
gain: 0.97 bits
+= 0.40 bits
gain: 0.57 bits
+= 0.95 bits
gain: 0.02 bits
0.0 bits* 2/5
0.0 bits* 2/5
1.0 bits* 2/5
0.0 bits* 1/5
0.92 bits* 3/5
1.0 bits* 2/5
outlook temperature humidity windy playsunny hot high FALSE nosunny hot high TRUE nosunny mild high FALSE nosunny cool normal FALSE yessunny mild normal TRUE yes
Decision treesAttribute selection
play
don’t playoutlook
sunny overcast rainy
humidity
high normal
outlook temperature humidity windy playrainy mild high FALSE yesrainy cool normal FALSE yesrainy cool normal TRUE norainy mild normal FALSE yesrainy mild high TRUE no
1.0 bits*2/5
temperature windy
hot mild cool false true
+= 0.95 bits
gain: 0.02 bits
+= 0.95 bits
gain: 0.02 bits
+= 0.0 bits
gain: 0.97 bits
humidity
high normal
0.92 bits* 3/5
0.92 bits* 3/5
1.0 bits* 2/5
0.0 bits* 3/5
0.0 bits* 2/5
0.97 bits
Decision treesAttribute selection
192
play
don’t playoutlook
sunny overcast rainy
windy
false true
humidity
high normal
Decision treesfinal tree
193
• Initialize top node to all examples
• While impure leaves availableselect next impure leave L find splitting attribute A with maximal information gain for each value of A add child to L
Decision treesBasic algorithm
194
Decision tree built from methylation table
Sensitivity: 80%
Specificity: 88%
Decision tree:Test based on 12 genes
Leave-one-out experimentTo avoid overfitting
195
Outline
196
Evaluation and deployment
• Decide whether to use Classification resultsCan we use 12 gene decision tree for classifying
new patients?
• Verification of all stepsExcercise. The above modelling procedure contains
a classical mistake: the test-sets used for cross-validation (see leave-one-out) have actually been used for training the model. How? (Weka is not to blame) And how can we fix this?
• Check whether business goals have been metNo: test based on 12 genes not useful (max ~5) Iteration required
197
Attempt to rebuild decision treewith at most ~5 genes
New Decision tree:Test based on 4 genes
Minimal leaf sizeIncreased to 12
Sensitivity decreased from 80% to 64%
Specificity increased from 88% to 90%
198
Evaluation and deploymentThe impact of « cost »
• Market conditions, cost of goods & royalty structure can limit the amount of genes that can tested
199
Evaluation and deploymentThe importance of « understandability »
200
Evaluation and deploymentThe importance of « understandability »
Pre and postmarket requirements imposed for IVDMIA (510k etc)
Understandability (NO black boxes) is becoming an important asset
201
Outline
• ScriptingPerl (Bioperl/Python)
examples spiders/bots
• DatabasesGenome Browser
examples biomart, galaxy
• AIClassification and clustering
examples WEKA (R, Rapidminer)
202
WEKA:: Introduction
• A collection of open source ML algorithmspre-processingclassifiersclusteringassociation rule
• Created by researchers at the University of Waikato in New Zealand
• Java based
203
WEKA:: Installation
• Download software from http://www.cs.waikato.ac.nz/ml/weka/
If you are interested in modifying/extending weka there is a developer version that includes the source code• Set the weka environment variable for javasetenv WEKAHOME /usr/local/weka/weka-3-0-2 setenv CLASSPATH $WEKAHOME/weka.jar:$CLASSPATH• Download some ML data from
http://mlearn.ics.uci.edu/MLRepository.html
204
205
Main GUI
• Three graphical user interfaces “The Explorer” (exploratory data
analysis) “The Experimenter” (experimental
environment) “The KnowledgeFlow” (new
process model inspired interface)
20604/12/2023
Explorer: pre-processing the data
• Data can be imported from a file in various formats: ARFF, CSV, C4.5, binary
• Data can also be read from a URL or from an SQL database (using JDBC)
• Pre-processing tools in WEKA are called “filters”
• WEKA contains filters for:Discretization, normalization, resampling, attribute selection, transforming and combining attributes, …
207
@relation heart-disease-simplified
@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal,
atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}
@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...
WEKA only deals with “flat” files
20804/12/2023
208
@relation heart-disease-simplified
@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal,
atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}
@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...
WEKA only deals with “flat” files
20904/12/2023University of Waikato
209
21004/12/2023University of Waikato
210
21104/12/2023University of Waikato
211
21204/12/2023University of Waikato
212
21304/12/2023University of Waikato
213
21404/12/2023University of Waikato
214
21504/12/2023University of Waikato
215
21604/12/2023University of Waikato
216
21704/12/2023University of Waikato
217
21804/12/2023University of Waikato
218
21904/12/2023University of Waikato
219
22004/12/2023University of Waikato
220
22104/12/2023University of Waikato
221
22204/12/2023University of Waikato
222
22304/12/2023University of Waikato
223
22404/12/2023University of Waikato
224
22504/12/2023University of Waikato
225
22604/12/2023University of Waikato
226
22704/12/2023University of Waikato
227
22804/12/2023University of Waikato
228
22904/12/2023University of Waikato
229
230
Explorer: building “classifiers”
• Classifiers in WEKA are models for predicting nominal or numeric quantities
• Implemented learning schemes include:Decision trees and lists, instance-based classifiers, support vector machines, multi-layer perceptrons, logistic regression, Bayes’ nets, …
231April 12, 2023
231
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows
an example
of Quinlan’s
ID3 (Playing Tennis)
Decision Tree Induction: Training Dataset
232
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
no
fairexcellentyesno
Output: A Decision Tree for “buys_computer”
23404/12/2023University of Waikato
234
23504/12/2023University of Waikato
235
23604/12/2023University of Waikato
236
23704/12/2023University of Waikato
237
23804/12/2023University of Waikato
238
23904/12/2023University of Waikato
239
24004/12/2023University of Waikato
240
24104/12/2023University of Waikato
241
24204/12/2023University of Waikato
242
24304/12/2023University of Waikato
243
24404/12/2023University of Waikato
244
24504/12/2023University of Waikato
245
24604/12/2023University of Waikato
246
24704/12/2023University of Waikato
247
24804/12/2023University of Waikato
248
24904/12/2023University of Waikato
249
25004/12/2023University of Waikato
250
25104/12/2023University of Waikato
251
25204/12/2023University of Waikato
252
25304/12/2023University of Waikato
253
25404/12/2023University of Waikato
254
25504/12/2023University of Waikato
255
258
Explorer: finding associations
• WEKA contains an implementation of the Apriori algorithm for learning association rules
Works only with discrete data• Can identify statistical dependencies between
groups of attributes:milk, butter bread, eggs (with confidence 0.9 and support 2000)• Apriori can compute all rules that have a
given minimum support and exceed a given confidence
25904/12/2023
259
Explorer: data visualization
• Visualization very useful in practice: e.g. helps to determine difficulty of the learning problem
• WEKA can visualize single attributes (1-d) and pairs of attributes (2-d)
To do: rotating 3-d visualizations (Xgobi-style)• Color-coded class values• “Jitter” option to deal with nominal attributes
(and to detect “hidden” data points)• “Zoom-in” function
26004/12/2023University of Waikato
260
26104/12/2023University of Waikato
261
26204/12/2023University of Waikato
262
26304/12/2023University of Waikato
263
26404/12/2023University of Waikato
264
26504/12/2023University of Waikato
265
26604/12/2023University of Waikato
266
26704/12/2023University of Waikato
267
26804/12/2023University of Waikato
268
26904/12/2023University of Waikato
269
270
References and Resources
• References: WEKA website:
http://www.cs.waikato.ac.nz/~ml/weka/index.html
WEKA Tutorial:Machine Learning with WEKA: A presentation
demonstrating all graphical user interfaces (GUI) in Weka.
A presentation which explains how to use Weka for exploratory data mining.
WEKA Data Mining Book:Ian H. Witten and Eibe Frank, Data Mining:
Practical Machine Learning Tools and Techniques (Second Edition)
WEKA Wiki: http://weka.sourceforge.net/wiki/index.php/Main_Page
Others:Jiawei Han and Micheline Kamber, Data Mining:
Concepts and Techniques, 2nd ed.
271
Een praktisch voorbeeld…
272
Resultaat van clustering
273
Resultaat van classificatie
Verschillende technieken:ZeroR
IBk
Decision Table – SMO – Multilayer Perceptron
274
RapidMiner:: Introduction
• A very comprehensive open-source software implementing tools forintelligent data analysis, data mining, knowledge discovery, machine learning, predictive analytics, forecasting, and analytics in business intelligence (BI). • Is implemented in Java and available under GPL among other
licenses• Available from http://rapid-i.com
275
RapidMiner:: Intro. Contd.
• Is similar in spirit to Weka’s Knowledge flow
• Data mining processes/routines are views as sequential operators
Knowledge discovery process are modeled as operator chains/trees Operators define their expected inputs and delivered
outputs as well as their parameters
• Has over 400 data mining operators
276
RapidMiner:: Intro. Contd.
• Uses XML for describing operator trees in the KD process
• Alternatively can be started through the command line and passed the XML process file
277
Outline
• ScriptingPerl (Bioperl/Python)
examples spiders/bots
• DatabasesGenome Browser
examples biomart, galaxy
• AIClassification and clustering
examples WEKA (R, Rapidminer)