DNA computing-based Implementation of
Decision tree
Advanced AI
컴퓨터공학부 임 예니인지과학 협동 과정 이 은석
생물정보학 협동 과정 조 성범
Decision Tree using DNA computing
• Input strand organization At each attribute, instance value and
class label was coupled
After hybridization, length of strand means number of instances
유전자 1
class
유전자2
class
환자 1
0100
환자 2
1000
환자 3
1000
환자 4
1101
Gene A
Class Gene B
Class Gene C
Class
Patien
t 1
0 0 0 0 1 0
2 0 1 0 1 0 1
3 1 1 1 0 1 1
4 0 1 0 1 1 1
5 0 1 0 1 0 1
Cy5
{(00),(01),(10),(11)}5’ Sticky end Sticky end 3’
(0,0)
(0,1)
(1,0)
(1,1)
Calculation of Information Gain
• Information Gain(S,A) ≡ Entropy(S) - ∑(|Sv|/|S|)*Entropy(Sv)
= (|S0|/|S|)*Entropy(S0) +(|S1|/|S|)*Entropy(S1) In gene expression data, all attribute values are encode
d in binary mode.
(|S0|/|S|)*Entropy(S0) ≈ (|S0|/|S|)*(n1/|S0|) ≈ n1/|S|
(|S1|/|S|)*Entropy(S1) ≈ (|S1|/|S|)*(n2/|S1|) ≈ n2/|S|
∑(|Sv|/|S|)*Entropy(Sv) =
(|S0|/|S|)*Entropy(S0)+(|S1|/|S|)*Entropy(S1)
≈ (|S0|/|S|)*(n1/|S0|) + (|S1|/|S|)*(n2/|S1|)
≈ n1/|S|+ n2/|S| ≈ n1+n2
36822
1894 34836
1915 38982
(0,0) 13 39 24 46 54(0,1) 49 6 55 14 24(1,0) 47 21 36 14 6(1,1) 11 54 5 46 36
36822=0
1894 34836
1915 38982
(0,0) 11 7 13 13
(0,1) 6 46 13 21
(1,0) 2 6 0 0
(1,1) 43 3 36 28
1894=0
34836 1915 38982
(0,0) 5 11 11
(0,1) 6 1 1
(1,0) 0 0 1
(1,1) 6 5 4
DNA computing Vs Digital computing
• Rules from DNA computing
36822=0 -1894=0 -1915=0:0 -1915=1:1 -1894=1:1
Identical to conventional decision tree algorithm
Input Sequence<00>/<01>/<10>/<11>
5’ Sticky end Sticky end 3’
GCATAG GAAATGAGTT CTTTACTCAA CGTATC
ATAGGC TGATGCTACA ACTACGATGT TATCCG
AGGCAT GGTTGTGGCG CCAACACCGC TCCGTA
ATAGGA CAGTTATTTC GTCAATAAAG TATCCT
<00><01><10>
<11>
Implementation steps
1. Rule representing sequence 2. Hybridization 3. Construction random paths 4. Florescence detection: Check if a
specific rule appeared sequentially
5. Repeating step 3-5
Simulation Results
• 1st: each rule sequences: 1000,900,800,700 hybridization #: 1000
1st
0100200300400500600700800900
<00> <01> <10> <11>
연속
출현
시퀀
스수
1계열
Simulation Results
• 2nd:
2nd
0100200300400500600700800900
1000
<00> <01> <10> <11>
연
속출
현시
퀀스
수
1계열
Simulation Results
• 3rd:
3rd
0
100
200
300
400
500
600
700
<00> <01> <10> <11>
연속
출현
시퀀
스수
1계열
Simulation Results
• 4th:
4th
0100200300400500600700800900
1000
<00> <01> <10> <11>
연속
출현
시퀀
스수
1계열
(0,0) ; 0.85
(0,1) ; 0.91
(1,0) ; 0.62
(1,1) ; 0.87
Summary of Simulation Results
36822
1894 34836
1915 38982
(0,0) 13:11.05
39:33.15
24:20.4
46:39.1
54:45.9
(0,1) 49:44.59
6:5.46 55:50.05
14:12.74
24:21.84
(1,0) 47:29.14
21:13.1
36:22.32
14:8.68
6:3.72
(1,1) 11:9.57
54:4.69
5:4.35 46:40.2
36:31.3
Calculation of Root Node
with Simulation Results
Validation of decision tree resulting from DNA computing
and digital computing
Number of gene
Digital computi
ng
DNA computi
ng
3 82% 70%
5 90% 75%
Discussion
• Due to unspecific hybridization, simulation results were different from that of calculation
• Lack of Pruning process
• Cost
• More specific hybridization process