greedy algorithms amihood amir bar-ilan university
Post on 14-Dec-2015
227 Views
Preview:
TRANSCRIPT
Idea
Simplest type of strategy:
1. Take a step that makes the problem smaller. 2. iterate.
Difficulty: Prove that this leads to an optimal solution.
This is not always the case!
Example: Centerstring Problem
Input: k strings s1,…,sk of length ℓ over alphabet Σ, distance d.
Find: string s* such that max(Ham(s*,si)), i=1,…k is ≤ d.
3
Example:
s1: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
s2: 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
s3: 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
--------------------------------------------------
s*: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The Hamming distance of the consensus from any string: 4
5
Suggestion: greedy strategy column majority?
0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 00 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 00 1 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0
---------------------------------------0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0
Problem: Works if we want to minimize averageNot if we want to minimize maximum!
6
Why?
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-----------------------------------------
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (majority)
Hamming distance from last string: 16
7
But:
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-----------------------------------------
1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0
Hamming distance from any string: 8 8
Example (that works) –Huffman code
Computer Data Encoding: How do we represent data in binary?
Historical Solution:Fixed length codes.Encode every symbol by a unique
binary string of a fixed length. Examples: ASCII (7 bit code), EBCDIC (8 bit code), …
Total space usage in bits:
Assume an ℓ bit fixed length code.
For a file of n characters
Need nℓ bits.
Variable Length codes
Idea: In order to save space, use less bits for frequent characters and more bits for rare characters.
Example: suppose alphabet of 3 symbols: { A, B, C }. suppose in file: 1,000,000 characters. Need 2 bits for a fixed length code for a total of 2,000,000 bits.
Variable Length codes - example
ABC
999,000500500
Suppose the frequency distribution of the characters is:
ABC
01011
Note that the code of A is of length 1, and the codes for B and C are of length 2
Encode:
Fixed code: 1,000,000 x 2 = 2,000,000
Varable code: 999,000 x 1 + 500 x 2 500 x 2 1,001,000
Total space usage in bits:
A savings of almost 50%
How do we decode?
In the fixed length, we know where every character starts, since they all have the same number of bits.
Example: A = 00 B = 01 C = 10
000000010110101001100100001010
A A A B B C C C B C B A A C C
How do we decode?
In the variable length code, we use an idea called Prefix code, where no code is a prefix of another.
Example: A = 0 B = 10 C = 11
None of the above codes is a prefix of another.
How do we decode?
Example: A = 0 B = 10 C = 11
So, for the string: A A A B B C C C B C B A A C C the encoding:
0 0 01010111111101110 0 01111
Prefix Code
Example: A = 0 B = 10 C = 11
Decode the string
0 0 01010111111101110 0 01111
A A A B B C C C B C B A A C C
Desiderata:
Construct a variable length code for a given file with the following properties:
1. Prefix code.2. Using shortest possible codes.3. Efficient.4. As close to entropy as possible.
Idea
Consider a binary tree, with: 0 meaning a left turn 1 meaning a right turn.
0
0
0
1
1
1
A
B
C D
Idea
Consider the paths from the root to each of the leaves A, B, C, D:
A : 0 B : 10 C : 110 D : 111
0
0
0
1
1
1
A
B
C D
Observe:
1. This is a prefix code, since each of the leaves has a path ending in it, without continuation.
2. If the tree is full then we are not “wasting” bits.
3. If we make sure that the more frequent symbols are closer to the root then they will have a smaller code.
0
0
0
1
1
1
A
B
C D
Greedy Algorithm:
1. Consider all pairs: <frequency, symbol>.
2. Choose the two lowest frequencies, and make them brothers, with the root having the combined frequency.
3. Iterate.
Greedy Algorithm Example:
Alphabet: A, B, C, D, E, F
Frequency table:
ABCDEF
102030405060
Total File Length: 210
The Huffman encoding:
A 10 B 20
C 30
F 60
X 30
Y 60D 40 E 50
Z 90 W 120
V 2100
0 0
0
0
1
1
11
1
A: 1000B: 1001C: 101D: 00E: 01F: 11
File Size: 10x4 + 20x4 + 30x3 + 40x2 + 50x2 + 60x2 = 40 + 80 + 90 + 80 + 100 + 120 = 510 bits
Note the savings:
The Huffman code: Required 510 bits for the file.
Fixed length code:Need 3 bits for 6 characters.File has 210 characters.
Total: 630 bits for the file.
Note also:
For uniform character distribution: The Huffman encoding will be equal to
the fixed length encoding.
Why?
Assignment.
Formally, the algorithm:
Initialize trees of a single node each.
Keep the roots of all subtrees in a priority queue.
Iterate until only one tree left: Merge the two smallest frequency
subtrees into a single subtree with two children, and insert into priority queue.
Algorithm time:
Each priority queue operation (e.g. heap):
O(log n)
In each iteration: one less subtree.
Initially: n subtrees.
Total: O(n log n) time.
Algorithm correctness:Need to prove two things for greedy
algorithms:
Greedy Choice Property:The choice of local optimum is indeed
part of a global optimum.
Optimal Substructure Property:When we recurse on the remaining and
combine it with the local optimum of the greedy choice, we get a global optimum.
Centerstring Agorithm correctness:
Greedy Choice Property:The choice of majority at a column turns
out not be necessarily a global optimum.
Optimal Substructure Property:A global optimum means that the
overall max distance including the first greedy choice is smallest.
Example:
0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
-----------------------------------------
1
For the optimum the second index needs to be 0, but if we ignore the first index, a global optimum may be
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
42
Huffman Algorithm correctness:
Need to prove two things:
Greedy Choice Property: There exists a minimum cost prefix
tree where the two smallest frequency characters are indeed siblings with the longest path from root.
This means that the greedy choice does not hurt finding the optimum.
Algorithm correctness:
Optimal Substructure Property: An optimal solution to the problem once
we choose the two least frequent elements and combine them to produce a smaller problem, is indeed a solution to the problem when the two elements are added.
Algorithm correctness:There exists a minimum cost tree where
the minimum frequency elements are longest path siblings:
Assume that is not the situation.Then there are two elements in the
longest path.
Say a,b are the elements with smallest frequency and x,y the elements in the longest path.
Algorithm correctness:
x y
a
dy
da
We also knowabout code tree CT: ∑fσdσ σ
is smallestpossible.
CT
Now exchange a and y.
Algorithm correctness:
x a
y
dy
da
CT’
(da ≤ dy, fa ≤ fy
Thereforefada ≥fyda andfydy ≥fady )
Cost(CT) = ∑fσdσ
= σ
∑fσdσ+fada+fydy≥σ≠a,y
∑fσdσ+fyda+fady=σ≠a,y cost(CT’)
Algorithm correctness:
b a
x
dx
db
CT”
And get an optimal code tree where a and b are sibling with the longest paths
Algorithm correctness:
Optimal substructure property:Let a,b be the symbols with the smallest frequency.Let x be a new symbol whose frequency isfx =fa +fb. Delete characters a and b, and find the optimal code tree CT for the reduced alphabet.
Then CT’ = CT U {a,b} is an optimal tree for the original alphabet.
Algorithm correctness:
cost(CT’)=∑fσd’σ = ∑fσd’σ + fad’a + fbd’b= σ σ≠a,b
∑fσd’σ + fa(dx+1) + fb (dx+1) =σ≠a,b
∑fσd’σ+(fa + fb)(dx+1)=σ≠a,b
∑fσdσ+fx(dx+1)+fx = cost(CT) + fxσ≠a,b
Algorithm correctness:
Assume CT’ is not optimal.
By the previous lemma there is a tree CT”that is optimal, and where a and b are siblings. So
cost(CT”) < cost(CT’)
Algorithm correctness:CT’’’
x
a b
CT”
x
fx = fa + fb
By a similar argument:cost(CT’’’)+fx = cost(CT”)
Consider
Algorithm correctness:
We get:
cost(CT’’’) = cost(CT”) – fx < cost(CT’) – fx = cost(CT)
and this contradicts the minimality of cost(CT).
top related