ta lecture 1.ppt - ut lecture 1 2up.pdf · 1 textalgorithms (4ap) jaak vilo 2008 fall jaak vilo...
TRANSCRIPT
3.09.2008
1
Text Algorithms (4AP)
Jaak Vilo
2008 fall
1MTAT.03.190 Text Algorithms Jaak Vilo
Topic
• Algorithms on strings
• Strings, sequences, texts, documents, …– Stringid, sõned, tekstid, …
3.09.2008
2
Ingredients
• S = s1 s2 … sn (text) |S| = n (length)
• P = p1p2..pm (pattern) |P| = m
• Σ ‐ alphabet | Σ| = c
Questions: Pattern matching
• Does S contain P?– Does S = S' P S" fo some strings S' ja S"?Does S S P S fo some strings S ja S ? – Usually m << n and n can be (very) large
– Exact match O(nm) O(n) O(n/m) – Approximate match
• Hamming distance• Edit distance• Edit distance• Generalised edit distance
• Why?
3.09.2008
3
Example
• S=text that contains characters
• P= racter
• P= obtain, cotain, ota, hate, harra
Multiple occurrences in text
P
S
3.09.2008
4
Multiple patterns
S
{P}
Dictionary lookup
• Given D= { T1, T2, .., Tn }
• Does D contain P?
• Data Structures & Algorithms?
3.09.2008
5
CPM
• Combinatorial Pattern Matching addresses issues of searching and matching strings and more complicated patterns such as trees, regular expressions graphs point sets and arrays The goal is toregular expressions, graphs, point sets, and arrays. The goal is to derive non‐trivial combinatorial properties for such structures and then to exploit these properties in order to achieve improved performance for the corresponding computational problem.
• A steady flow of high‐quality research on this subject has changed a sparse set of isolated results into a full‐fledged area of algorithmicswith important applications. This area is expected to grow even further due to the increasing demand for speed and efficiency thatfurther due to the increasing demand for speed and efficiency that comes from molecular biology, but also from areas such as information retrieval, pattern recognition, compiling, data compression, program analysis and security.
Algorithms
• Brute force
• Knuth‐Morris‐Pratt
• Rabin‐Karp
• Boyer‐Moore• …
3.09.2008
6
Animations
• http://www‐igm.univ‐mlv.fr/~lecroq/string/
• EXACT STRING MATCHING ALGORITHMSAnimation in Java
• Christian Charras ‐ Thierry LecroqLaboratoire d'Informatique de RouenUniversité de RouenFaculté des Sciences et des Techniques76821 M S i Ai C d76821 Mont‐Saint‐Aignan CedexFRANCE
• e‐mails: {Christian.Charras, Thierry.Lecroq}@laposte.net
Brute force
Algorithm Naiveattempt 1:gcatcgcagagagtatacagtacgGCAg....
Input: Text S[1..n] and pattern P[1..m]
Output: All positions i, whereP occurs in S
for( i=1 ; i <= n‐m‐1 ; i++ ) for ( j=1 ; j <= m ; j++ )
attempt 2:gcatcgcagagagtatacagtacg
g.......
attempt 3:gcatcgcagagagtatacagtacg
g.......
attempt 4:gcatcgcagagagtatacagtacg
g.......
attempt 5:gcatcgcagagagtatacagtacg
gfor ( j ; j m ; j )if( S[i+j‐1] != P[j] ) break ;
if ( j > m ) print i ;
g.......
attempt 6:gcatcgcagagagtatacagtacg
GCAGAGAG
attempt 7:gcatcGCAGAGAGtatacagtacg
g.......
3.09.2008
7
Brute Force
Pi i+j‐1
S
i i+j 1
j
Identify the first mismatch!
Question:
Problems of this method?Ideas to improve the search? ☺
Time analysis
• Worst case
• Average case
• Practical measurements
• Preprocessing vs analysis
3.09.2008
8
How does search depend on
• |S|
• |P| or set of patterns, ||P||
• |Σ|
• Similarity measure and distance k
3.09.2008
9
Space complexity
• Memory usage– Preprocessing
– Matching
– …
3.09.2008
10
Approximate search
• Similarity measures– Edit distance
– …
• Dynamic programming– Memorizing intermediate results– Memorizing intermediate results
• Bit‐parallel algorithms and practical efficiency
3.09.2008
11
Questions
• Exact vs approximate
• (sub) string ACGTAG
• 1D vs 2D …
• Regular expressions A([CG]A*T)+T
• Probabilistic
• Multiple patterns
• Online vs offline (indexed)
3.09.2008
12
Knowledge Explosion: PubMed
No. of New Publications Accumulated New Publications
12000000
300000
350000
400000
450000
500000
550000
600000
2000000
4000000
6000000
8000000
10000000
• Average number of new citations appearing in PubMed– In 1980: 746/day– In 2004: 1,640/day
1980 1983 1985 1988 1990 1993 1995 1998 2000 2003Year
1980 1983 1985 1988 1990 1993 1995 1998 2000 2003Year
Indexing
• Suffix tree– O(n) time and space
– O(m) query
• Suffix array
• Compressed suffix trees• Compressed suffix trees
• Inverted index (pöördindex)
3.09.2008
13
Compression
Model Model
E d D dTextTextCompressedEncoder DecoderText p
Text
Compression
• Run‐length encoding
• Shannon‐Fano
• Huffman codes
• Arithmetic Coding
• “Memorizing” – Lempel‐Ziv
• Burrows‐Wheeler
• Algorithmic complexity, Kolmogorov …
3.09.2008
14
Information retrieval
• Google, Yahoo!, …
• How can this be made possible?
• How to find relevant documents?
• Similar documents
Text mining
• Analyzing texts
• Finding trends
• Finding regularities, patterns, motifs
• Mining web usage, links, etc…
• …
3.09.2008
15
aasta aega ainult ajal alajaama all alles ampritasu anda annab as asi atonen balti edasi eesti eile elekter
250 sõna suhtelise osakaalu järgi (2004)
elektrienergia elektriga elektrihinna elektrijaama elektrita elektrituru enda endale energeetika energia energiafirma energiaturu eriti esimees
esimehe ettevõte euroopa firma fortum gunnar hakkab hea hetkel hind hinnangul hinnatõus ikka ilma ilmselt investeeringute ise
isegi jaoks jooksul juhataja juhatuse juht juhul jäi järgi jääb kaasa kahe kaks kasutada keegi kell kelle keskmiselt kindlasti kinni kinnitas kirjutab kogu
kohaselt kohe kokku kolm kolme koos korda korral krooni kuidas kuigi kuu kuus kwh kõige kõik küll küsimus leedu ligi lihtsalt liidu liiga
liige lisaks lisas läbi läheb majandusminister maksab maksma meelis midagi miljarditmiljonminister minu mõne märkis nad narva neidnõukogunüüdliige lisaks lisas läbi läheb majandusminister maksab maksma meelis midagi miljardit miljonminister minu mõne märkis nad narva neid nõukogunüüd
okkoleks oleme olen olid olnud oluliselt online osa osas osta palju pea peabpeakaitsme peale peavad pm poole poolt postimees praegupressiesindaja protsent puhul põlevkivi raha reiljan reklaam riigi riigikogu riik riina rohkemrääkis saada saanud saavad sai sama samal
samas samuti seal seetõttu selleks selline seni seotud siin siiski soome suur suure suurem sõnul sõõrumaa tagasi tahab tallinn tarbija tarbijad
tartu tasu teada teatas teeb tegelikult teha tehtud teine tuleks tulevikus turu tõsta tõttu tõusu tähendab täna umbes urmas uued uus uute vahel vaja valitsus valmis
vastavalt vastu vene venemaa viis võiks võimalik võimalus võivad võrguteenuse võrra võrreldes võtta väga vähem vähemalt vändre öelda ühe üks ütleb
Eesti Energiaga seostatud sõnad juunis 2004, muutus, tõus ja langus (tõus – oranzh, langus – sinine)
aasta aastal aastat all asi atonen balti eesti eestis ekseko elektri elektrit
energia hoonejuuni jäi kasutada kiriku krooni kubits kuidas kõik küll lisaks
maja li pakri l pingeläbi majameelis mikk narvaoleks pakri peab pealepingepraegu riigikogu
sõnul tehatrafopunkti varastati vargad vene võimalik väga vändre üks
3.09.2008
16
Valik kiiremini tõusvaid sõnu mais 2004
niinemäe atoneni vene minister riigisaladuse
kubits corpore maikubits corpore toomas juhi mai kapo teabeameti talle iru
kätlin pirita pärnu saaremaa p p
näitas oki valitsuse maran info
res publica teab elioni
TGTTCTTTCTTCTTTCATACATCCTTTTCCTTTTTTTCCTTCTCCTTTCATTTCCTGACTTTTAATATAGGCTTACCATCCTTCTTCTCTTCAATAACCTTCTTACATTGCTTCTTCTTCGATTGCTTCAAAGTAGTTCGTGAATCATCCTTCAATGCCTCAGCACCTTCAGCACTTGCACTTCATTCTCTGGAAGTGCTGCACCTGCGCTGTCTTGCTAATGGATTTGGAGTTGGCGTGGCACTGATTTCTTCGACATGGGCGGCGTCTTCTTCGAATTCCATCAGTCCTCATAGTTCTGTTGGTTCTTTTCTCTGATGATCGTCATCTTTCACTGATCTGATGTTCCTGTGCCCTATCTATATCATCTCAAAGTTCACCTTTGCCACTTTCCAAGATCTCTCATTCATAATGGGCTTAAAGCCGTACTTTTTTCACTCGATGAGCTATAAGAGTTTTCCACTTTTAGATCGTGGCTGGGCTTATATTACGGTGTGATGAGGGCGCTTGAAAAGATTTTTTCATCTCACAAGCGACGAGGGCCCGAGTGTTTGAAGCTAGATGCAGTAGGTGCAAGCGTAGAGTCTTAGAAGATAAAGTAGTGAATTACAATAGATTCGATAC
3.09.2008
17
Overview
• Matching (exact, approximate, multiple P, l i )regular expressions, … )
• Indexing and data structures
• Text compression
• Information Retrieval
• Probabilistic motifs: HMM, SCFG, …
• Pattern discovery & data mining
Practical assignment 1
• Propose some real‐life use cases for string searching– Exact
– Approximate
– Regular expression
– Multiple patterns
• Estimate size of “interesting texts” for various typical use cases
• Study Unix tool grep family (grep, ggrep, egrep, fgrep, agrep … )– What functionality is being offered?
• Run grep, and perform practical measurements– Speed ‐‐ howmany characters searched per second?
– Analyze dependence on m
– Dependence on alphabet size
• Create script(s) for evaluating grep at different text and pattern sizes– E.g. Perl, bash, python, … etc…
– Unix time, redirection of output ( > , >>, 2> , … ) …
3.09.2008
18
Course
12 L t 24h 20h• ~12 Lectures 24h + 20h• ~10 Practicals 20h + 30h• Term paper 10h• Project work 40h• Exam 4h + 12hExam 4h + 12h• ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐• Total 48h + 112h = 160h (4AP)
Grade
• Homework 40 + bonus points
• Term paper 10
• Project work 20
• Exam 30
• ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐
• Total 100p
3.09.2008
19
Homework
• Essential part of the course
• Obligatory to perform minimum 50% tasks
• Presentations orally during the practicals
• In rare cases of not attending:– Solutions must be in before 16:15 by email, clearlystating which tasks have been completed.
Term paper
• Will be an essay based on some article
• To be decided during the course
• Reading and writing skills
• A format of the scientific article (abstract, citations, etc)
3.09.2008
20
Project
• A practical algorithm development task plusanalysis and comparisons of efficiency
• Presentation of results as a poster
• A poster session – everybody presenting!
Exam
• Will be based on exactly or nearly on theti f th h k i tquestions of the homework assignments
• Knowledge of the basic principles ofalgorithms
• Creative use of the algorithms
3.09.2008
21
Contact
• Lectures, practicals – active hours
• http://courses.cs.ut.ee/2008/text/• Email ([email protected])
• Office hour (Thursday 1‐2pm); room 327Office hour (Thursday 1‐2pm); room 327– Other times: knock on door or when door open
• Upon agreement