Thèse
Présentée pour l’obtention du titre de
DOCTEUR DE L’UNIVERSITÉ PARIS-EST Spécialité: Sciences Informatiques
par Ivan BUDNYK
Contribution à l’Étude et Implantation de Systèmes Intelligents Modulaires Auto-Organisateurs
Soutenue publiquement 8 décembre 2009 devant la commission d’examen composée de
LISSI – Laboratoire Images, Signaux et Systèmes Intelligents – EA 3956
Université Paris-Est (Paris 12), l’IUT Sénart-Fontainebleau, Département GEII
Bât A., Av. P. Point, F-77127 Lieusaint ; Tél : +33(0)164134486, Fax : +33(0)164134503, http://lissi.univ-paris12.fr
Rapporteur
Rapporteur
Examinateur
Examinateur
Invité
Directeur de thèse
Prof. Gilles
Prof. Vladimir
Prof. Hichem
Dr. Abdennasser
Dr. Semen
Prof. Kurosh
BERNARD
GOLOVKO
MAAREF
CHEBIRA
GOROKHOVSKYI
MADANI
Université Paris 8
Brest State Technical University
Université d’Évry-Val d’Essonne
Université Paris-Est (Paris 12)
Université nationale de
Kiev-Mohyla-Académie
Université Paris-Est (Paris 12)
Thesis
Presented to obtain the degree of
DOCTOR OF UNIVERSITY PARIS-EST Topic: Computer Sciences
by Ivan BUDNYK
Contribution to the Study and Implementation of Intelligent Modular Self-organizing Systems
Defended on 8 December 2009 in presence of commission composed by
LISSI – Laboratory of Images, Signals and Intelligent Systems – EA 3956
University Paris-Est (Paris 12), IUT Senart-Fontainebleau, Department GEII
Bat A., Av. P. Point, F-77127 Lieusaint ; Tel : +33(0)164134486, Fax : +33(0)164134503, http://lissi.univ-paris12.fr
Rapporteur
Rapporteur
Examiner
Examiner
Invited
Doctorate director
Prof. Gilles
Prof. Vladimir
Prof. Hichem
Dr. Abdennasser
Dr. Semen
Prof. Kurosh
BERNARD
GOLOVKO
MAAREF
CHEBIRA
GOROKHOVSKYI
MADANI
University Paris 8
Brest State Technical University
University of Évry Val d’Essonne
University Paris-Est (Paris 12)
National University of
Kyiv-Mohyla Academy
University Paris-Est (Paris 12)
i
Abstract Classification problems deal with separating group of objects into sets of smaller
classes; this set of problems have received considerable attention in diverse engineering fields such as biomedical imaging, speaker identification, fingerprint recognition, etc. Several effective approaches for automated classification were suggested based on artificial intelligence techniques, including neural networks. Still, one of the major challenges faced by these approaches is a large scale of data required for successful classification. In this thesis, we explore a possible solution to this problem based on a module-based Tree-like Divide to Simplify (T-DTS) classification model.
We focus on enhancing the key module of this approach - complexity estimation module. Furthermore, we provide an automated procedure for optimizing key complexity estimation parameters of the T-DTS model; this considerably improves usability and allows for a more effective configuration of decomposition reasoning of the approach. Another major contribution of this work employs further development of T-DTS modules that could be implemented using parallel computer architecture, thereby allowing T-DTS to utilize an underlying hardware to the fullest extent.
Key words: Information processing, complex systems, artificial intelligence, modular artificial learning systems, classification, complexity estimating.
Résume
Les problèmes de la classification ont reçu une attention considérable dans des différents champs d’ingénierie comme traitement des images biomédicales, identification a partir de la voix, reconnaissance d'empreinte digitale etc. Les techniques d’intelligence artificielles, incluant les réseaux de neurones artificiels, permettent de traiter des problèmes de ce type. En particulier, les problèmes rencontrés nécessitent la manipulation de bases de données de tailles très importantes. Des structures de traitement adaptatives et exploitant des ensembles de classificateurs sont utilisées. Dans cette thèse, nous décrivons principalement le développement et des améliorations apportées à un outil de classification désigné par le terme Tree-like Divide to Simplify ou T-DTS. Nos efforts se sont portés sur l’un des modules de cet outil, le module d’estimation de complexité. L’architecture de l’outil T-DTS est très flexible et nécessite le choix d’un nombre important de paramètres. Afin de simplifier l’exploitation de T-DTS, nous avons conçu et développé une procédure automatique d’optimisation d’un de ces plus importants paramètres, le seuil de décision associé à la mesure de complexité. La contribution principale de cette thèse concerne le développement de modules pouvant s’implanté sur une architecture de calcul matérielle parallèle. Ce ceci permet de se rapproché d’une implantation purement matérielle de l’outil T-DTS.
Mots clés: Traitement de l’information, systèmes complexes, intelligence artificiel, systèmes d’apprentissages artificiels modulaires, classification, estimation de la complexité.
ii
Acknowledgement
I would like to express my deepest gratitude to Prof. Kurosh Madani for his
continuous support and patience. His help and guidance throughout my studies at
Laboratoire Images Signaux et Systèmes Intelligents (LISSI / EA 3956) were invaluable.
His experience was truly of great benefit to me and his advice in academia and otherwise
has undoubtedly left a mark on me.
I would also like to thank Dr. Abdennasser Chebira for his valuable input and
meticulous efforts in reviewing much of this research, as well as his patience while I was
preparing this thesis.
I would also like to thank my thesis committee members Prof. Gilles Bernard, Prof.
Vladimir Golovko and Prof. Hichem Maaref for the useful comments, suggestions, and
the references.
I am grateful to the faculty members of l’IUT de Sénart including Dr. Veronique
Amarger, Dr. Christophe Sabourin and Dr. Amine Chohra for their assistance.
My deepest thanks also go to the members and ex-members of LISSI including Dr.
Nadia Kanaoui, Dr. El Khier Sofiane Bouyoucef, Dr. Mariusz Rybnik, Dr. Lamine Thiaw,
Dr. Matthieu Voiry, Weiwei Yu, Dalel Kanzari, Ting Wang, Dominik Maximilián Ramík
and Arash Bahrammirzaee.
I also thank France’s leading international exchange operator ÉGIDE, who gave me the
initial opportunity to pursue research in France.
Numerous other people and lifelong friends with whom I have worked and lived in
France, Czech Republic and Ukraine, deserve many thanks as well, especially Dr. Maksym
Petrenko and Dr. Semen Gorokhovskyi.
Finally, I would like to thank my parents, Georgiy and Ganna Budnyk, because without
them I would not be the person I am today.
iii
Table of contents
List of tables............................................................................................................................v
List of figures........................................................................................................................ vi
Index of symbol ......................................................................................................................x
General introduction ...............................................................................................................1
I State-of-the-art of classification approaches ........................................................................6
I.1 Concepts of classification............................................................................................7
I.2 Clustering methods ......................................................................................................8
I.3 Main classification methods ......................................................................................17
I.4 T-DTS (Tree-like Divide To Simplify)approach.......................................................37
I.5 Conclusion .................................................................................................................42
II Complexity concepts .........................................................................................................44
II.1 Introduction to complexity concepts ........................................................................44
II.2 Computational complexity measurement.................................................................48
II.3 ANN-structure based classification complexity estimator.......................................75
II.4 Conclusion................................................................................................................85
III T-DTS software architecture............................................................................................86
III.1 T-DTS concept: self-tuning procedure ...................................................................94
III.2 T-DTS software architecture and realization........................................................101
III.3 Conclusion ............................................................................................................118
IV Validation aspects..........................................................................................................120
IV.1 ANN-structure based complexity estimators........................................................120
IV.1.1 Hardware-based validation .........................................................................121
IV.1.2 Software-based validation ...........................................................................133
IV.1.3 Summary......................................................................................................149
IV.2 T-DTS...................................................................................................................152
IV.2.1 ANN-structure based complexity estimator validation ...............................152
IV.2.2 T-DTS self-tuning procedure validation......................................................160
IV.2.3 Summary......................................................................................................172
IV.3 Conclusion ............................................................................................................173
General conclusion and perspectives ..................................................................................175
iv
Appendixes .........................................................................................................................179
A The list of publication ...............................................................................................180
B Approaches to defining complexity ..........................................................................181
B.1 Defining complexity. Genesis of the concept complexity ...............................182
B.2 System’s attributes and complexity..................................................................183
C Neural Networks in hardware ...................................................................................187
C.1 IBM© ZISC®-036 Neurocomputer .................................................................189
C.2 System’s attributes and complexity..................................................................193
Bibliography .......................................................................................................................197
v
List of tables
II.1 Advantages and disadvantages of complexity estimating techniques............................72
IV.1 Benchmarks complexity rates obtained using IBM© ZISC®-036 implementation of
ANN structure based complexity estimator........................................................................123
IV.2 Complexity rates obtained for Splice-junction DNA classification problem (original
database) using IBM© ZISC®-036 Neurocomputer ..........................................................132
IV.3 Complexity rates obtained for Splice-junction DNA classification problem (re-
encoded database) using ANN-structure based and other applications ..............................147
IV.4 Complexity rates obtained for Tic-tac-toe endgame classification problem using
sixteen classification complexity criteria including ANN-structure based complexity
estimator..............................................................................................................................148
IV.5 Classification results: Four spiral benchmark, two classes, generalization database
size 500 prototypes, learning database size 500 prototypes ...............................................163
IV.6 Classification results: Tic-tac-toe endgame classification problem ...........................167
IV.7 Classification results: Splice-junction DNA sequences classification problem, three
classes, generalization and learning database size 1595 prototypes ...................................170
IV.8 Consolidation of classification results: Splice-junction DNA sequences classification
problem ...............................................................................................................................171
vi
List of figures
I.1 SVM: space mapping using linear hyperplane ................................................................21
I.2 SVM: space mapping using different space kernel functions Φ .....................................22
I.3 General block scheme diagram of the T-DTS structure constructing .............................40
II.1 Description and interpretation process...........................................................................52
II.2 Retained adherence subset for two classes near the boundary.......................................69
II.3 Taxonomy of classification complexity (separability) measures...................................71
II.4 Examples of Voronoy polyhedron for 2D and 3D classification problems ...................77
II.5 Q(m) indicator function behaviour.................................................................................78
III.1 Diagram of T-DTS implementation for classification tasks .........................................87
III.2 Scheme of T-DTS learning concept..............................................................................90
III.3 T-DTS operating ...........................................................................................................90
III.4 An example of maximal possible decomposition tree ..................................................95
III.5 An example of distribution of the clusters’ number over [Amin; Amax] complexity interval .................................................................................................................................97
III.6 T-DTS software architecture.......................................................................................102
III.7 Principal T-DTS v. 2.50 Matlab software architecture...............................................104
III.8 Detailed T-DTS v. 2.50 Matlab software architecture ...............................................107
III.9 Matlab T-DTS software realization v. 2.50, Control panel ........................................108
III.10 T-DTS GUI: Results : 2 stripe-like benchmark (576 prototypes).............................113
III.11 GUI of T-DTS, decomposition clusters’ chart..........................................................114
III.12 GUI of T-DTS, decomposition tree charts................................................................115
III.13 GUI of T-DTS, decomposition tree charts in 3D......................................................115
III.14 Menu of T-DTS, Configuration ................................................................................116
III.15 Menu of T-DTS, Set Constant ..................................................................................117
III.16 Menu of T-DTS, Set EC Options ..............................................................................117
III.17 Menu of T-DTS, Analysis .........................................................................................117
IV.1 Stripe classification benchmarks ...............................................................................122
IV.2 Stripe classification benchmarks : Qi(m) behavior versus learning database size m, LSUP ZISC®-036 mode.....................................................................................................124
IV.3 Stripe classification benchmarks : Qi(m) behavior versus learning database size m, L1 ZISC®-036 mode................................................................................................................124
IV.4 Benchmarks’ classification rates behavior versus learning database size m, LSUP ZISC®-036 mode................................................................................................................126
vii
IV.5 Benchmarks’ classification rates behavior versus learning database size m, L1 ZISC®-036 mode ............................................................................................................................126
IV.6 Qk(m) evaluation for DNA splice-junction classification problem using different ZISC®-036 k-MIF parameters (k: 55,56, 4096) for: Q55(m), Q56(m), Q4096(m), mk – corresponds to calculated m0 for each k-curve....................................................................129
IV.7 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k-MIF parameter k=55..............................................129
IV.8 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k-MIF parameter k=56..............................................131
IV.9 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k-MIF parameter k=4096..........................................131
IV.10 Square classification benchmarks, 2 classes, 2000 prototypes.................................134
IV.11 ANN-structure based complexity estimator evaluation for: Square benchmarks, 2 classes, 2000 prototypes, MIF = 1024, 3 distance modes...................................................135
IV.12 ANN-structure based complexity estimator evaluation for: 8 Stripe benchmark, 2 classes, 2000 prototypes, LSUP distance mode..................................................................136
IV.13 ANN-structure based complexity estimator evaluation for: 8 Stripe benchmark, 2 classes (4&4 stripes), LSUP distance mode .......................................................................137
IV.14 Grid classification benchmarks in D1 D2 and D3 dimension ..................................138
IV.15 ANN-structure based complexity estimator evaluation for: Grid benchmark, 2 classes, EUCL distance mode .............................................................................................139
IV.16 ANN-structure based complexity estimator evaluation for: Grid and 8-stripe-benchmarks, 2 classes, EUCL distance mode, MIF=1024 .................................................140
IV.17 Four spiral classification benchmarks, 2 classes, 2000 prototypes ..........................142
IV.18 ANN-structure based complexity estimator evaluation for: 4 Spiral benchmark, 2 classes, EUCL distance mode .............................................................................................142
IV.19 Six classification benchmarks [from left to right, from top to down: 2 Stripes (2ST), 2 Grids (2GR), 2 Squares (2SQ), 2 Sinusoids (2SN), 2 Spirals (2SP), 2 Circles (2CR) with small overlapping zone]......................................................................................................143
IV.20 ANN-structure based complexity estimator evaluation for: 6 classification benchmarks, 2 classes, 2000 prototypes, MIF=1024..........................................................144
IV.21 Validation Matlab implementation of ANN-structure based complexity estimator embedded into T-DTS framework: 2 Stripe benchmark, 2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU – CNN, PU – LVQ1..153
IV.22 Validation ANN-structure based complexity estimator embedded into T-DTS framework: 10 Stripe benchmark, 2 classes, generalization database size 1600 prototypes, learning database size 400 prototypes, DU – CNN, PU – LVQ1.......................................154
IV.23 Validation Matlab implementation of ANN-structure based complexity estimator embedded into T-DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 479 prototypes, learning database size 479 prototypes, DU – CNN, PU – MLP_FF_GDM...................................................................................................................157
viii
IV.24 Validation ANN-structure based complexity estimator embedded into T-DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 766 prototypes, learning database size 192 prototypes, DU – CNN, PU – MLP_FF_GDM ....158
IV.25 Validation ANN-structure based complexity estimator embedded into T-DTS framework: Splice-junction DNA sequence classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU – CNN, PU – MLP_FF_GDM...................................................................................................................158
IV.26 Validation T-DTS self-tuning threshold procedure, Average learning rate (including its corridor of the standard deviations) as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator................................160
IV.27 Validation T-DTS self-tuning threshold procedure, Average generalization rate (including its corridor of the standard deviations) as a function of θ-threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator ......161
IV.28 Validation T-DTS self-tuning threshold procedure, Average clusters’ number as a function of θ-threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator.................................................................................................161
IV.29 Validation T-DTS self-tuning threshold procedure, Performance estimating function P(θ): 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator..............................................................................................................................162
IV.30 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: 4 Spiral benchmark, 2 classes, learning database size 500 prototypes, DU – CNN, Collective entropy based complexity estimator ...................................................................................163
IV.31 Validation T-DTS self-tuning threshold procedure: 10 stripe benchmark, 2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU – CNN, PU – LVQ1, 4 complexity estimators ......................................................................164
IV.32 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, Collective entropy complexity estimator..............................................................................................................................166
IV.33 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, ANN-structure based complexity estimator..............................................................................................................................166
IV.34 Validation T-DTS self-tuning threshold procedure, Splice-junction DNA sequences classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU – CNN, PU – MLP_FF_GDM, 3 complexity estimators.............................................................................................................................................168
IV.35 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size 1595 prototypes, DU – CNN, Purity PRISM based complexity estimator.........................170
IV.36 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size
ix
1595 prototypes, DU – CNN, Fukunaga’s interclass distance measure J1 based complexity estimator..............................................................................................................................170
B.1 Genesis of the complexity ............................................................................................181
C.1 General block level architecture representation of a neurochip or a neurocomputer processing elements ............................................................................................................188
C.2 IBM© ZISC®-036 PC-486 ISA bus based block diagram..........................................190
C.3 Schematic drawing of a single IBM© ZISC®-036 processing element - neuron........191
C.4 IBM© ZISC®-036 chip’s block diagram ....................................................................191
C.5 Hardware realization: IBM© ZISC®-036 PCI board ..................................................192
C.6 CM-1K Neural network chip........................................................................................193
C.7 Network of CM-1K chips, or 6l44 neurons in parallel ................................................194
C.8 CM-1K chip’s functional diagram. Inner architecture .................................................194
x
Index of symbols Symbol Signification
s / s input vector in the feature space
S set of vectors of the feature space
c element of a concept class
C set of concepts
i, j, l, r indexes
dim feature space dimension
d() distance function between two vectors
dE Euclidean distance
dWE weighted Euclidean distance
W vector of weights for dWE
dMN Minkovsky distance
L coefficient of dMN
dMH Manhattan distance, City-block distance or L1 distance
dCH Chebyshev distance
dML Mahalanobis distance
dCN Canberra distance
dCS Cosine distance
dTN Tanimoto distance
m total number of data items / instances in S
Df() distance function
Gi i-th cluster / group
is mean vector of cluster Gi
M() / E() mean / math expectation
ESS() error sum-of-square of a cluster
k clusters / groups number, number of samples
εs square error
af() affinity function
SM matrix of similarity
θ threshold / (threshold value of: affinity function or complexity or so forth)
g() hyperplane function
N number of classifiers
xi
p(), q() probability distribution functions
µ length of the segment of code / program
x, y discrete variables
H() entropy measure
I() mutual information measure
Λ() likelihood ratio
KL() Kullback–Leibler divergence, information gain or relative entropy
SH overall degree of separability
Dλ() Λ-divergence
JSD() Jenses-Shannon divergence
HD() Hellinger distance measure
JMD() Jeffreys-Matusita distance measure
BD() Bhattacharyya distance measure ε Bayes error
Г(s) local region around instance s
v(s) volume of the region Г(s)
()p density estimating function
()ε Bayes error ε estimating function / equation
NMD() normalized mean distance
Sw/Sb/Sm scatter matrices
J1/J2/J3/J4 Fukunaga’s four criteria of scatter matrices
B resolution parameter used for creation hyper-cuboids
ηij number of data instances of class i in box j
SGj PRISM’s parameter of separability each cluster (box) Gj
SG PRISM’s overall purity for all clusters (boxes) G
SG.norm normalized SG
HPG entropy measure of each G-PRISM’s box
HPG.norm collective entropy measure
FMD() Fisher linear discriminant (ratio) based measure
Ω Chaitin’s constant
n number of neurons in RBF-Net / parameter that reflects complexity
(.)g classification complexity estimation function
Q() indicator of classification complexity iℜ input feature space of dimension i, to distinguish from S
xii
Ψ(t) model’s input of vector(s), it ℜ∈)(ψ
nΨ dimension of input vectors Ψ(t)
ui output decision variable for NNMi , i – index
nu dimension of a vector ui
U linear combiner of outputs ui
F(.) transfer function ji ℜ→ℜ , where i and j are indexes of a dimension
CU[.] control unit output of a decision function
b set of parameters of CU
ξ set of condition of CU
Wi centroid of Gi cluster
z iteration number required to get statistics of T-DTS output
Ai complexity ration of the sub-database denoted by index i
[Amin;Amax] interval of θ-threshold ratio variation
α T-DTS parameter responsible for constriction of [Amin;Amax]
[Bmin;Bmax] sub-interval of interval [Amin;Amax]
Gr generalization (testing) rate, expressed in %
Lr learning rate, expressed in %
SdGr standard deviation of generalization (testing) rate, expressed in %
SdLr standard deviation of learning rate, expressed in %
Tp T-DTS processing time, expressed in seconds
NTp processing unit execution time applied to non-decomposed database
SdTp standard deviation of T-DTS processing time, expressed in seconds
SdNTp standard deviation of processing unit executing time, expressed in seconds
P(.) aggregation function of T-DTS performance
b1/b2/b3 coefficients of priorities
h number of optimization iteration equivalent of precision of quasi optimal t
1
General introduction
Fundamental efforts in understanding the nature of intelligence and its realization in
human minds have been of a growing interest in various research communities, including
education, cognition science, computer science, neuro-science and engineering. The field
of Artificial Intelligence (IA) was formed to specifically facilitate these efforts and to
solve a set of associated problems, ranging from Pattern recognition to Artificial life.
This thesis addresses Machine learning (ML) (Mitchell, Anderson, Carbonel and
Michalski 1986), (Mitchell 1997) discipline of AI that deals with design and development
of learning algorithms capable of recognizing complex patterns and making intelligent
decisions based on information represented by a database of classification objects.
Traditionally, Machine learning tools occupied a specific niche of commercial and self-
improving software applications; however, recent advances in AI field have brought ML
into the mainstream (Thrun, Faloutsos, Mitchell and Wassermanand 1999). For example,
Machine learning is in the foundation of many data-mining techniques that have become
the necessity in coping with the ever-growing volumes of available on-line and off-line
data (Thrun, Faloutsos, Mitchell and Wassermanand 1999). It is our motivation that better
ML techniques, particularly better algorithms that are based on fundamental statistical-
computational theories of learning processes, could greatly benefit and facilitates further
development of ML-based applications.
Motivation and objectives
In this thesis, I explore the direction dealing with data stores that use classification and
clustering methods (Fielding 2007). Being extensively used in data mining applications
(Bellman 1961), these methods were shown to not perform well with huge data sets,
(Dong 2003). In this research, I study Machine learning methods that can help to alleviate
this caveat and, as suggested by the No Free Lunch Theorem (Stork, Duda and Hart
2
2001), are at least as effective as other available classification approaches. I am
particularly interested in making Machine learning classification tools adaptable, self-
adjusting and capable of employing an intelligent human-like view on classification
problems in massive databases. I also aim at streamlining the use of Machine learning
techniques by providing an automated method for resolving parameters of these
techniques while tuning them for efficiency and effectiveness.
Of available Machine learning techniques, I focus on the proposed technique (Madani
and Chebira 2000) of Hybrid Multi Neural Networks (HMNN) (Madani, Rybnik and
Chebira 2003) developed by Rybnik (Rybnik 2004) and Bouyoucef (Bouyoucef, Chebira,
Rybnik and Madani 2005), (Bouyoucef 2007). Since this technique is based on design
paradigms of “divide and conquer” and “reduce complexity through decomposition”, in
this thesis I investigate an issue of complexity; in this context, complexity means difficulty
of classification tasks regardless of the data set size.
Further, I study how the data-driven tree-like constructing approach of ensemble of
neural networks can be adopted for the data organization purposes. I argue that this
structure provide for the best control of the generalization performance of the technique in
the typical setting of classification problems, i.e., when little reliable prior information
about the statistical law underlying a classification problem is available. I adopted and
enhanced the data driven HMNN structure constructor concept (Madani and Chebira
2000) named Tree-like Divide to Simplify (T-DTS) developed in the works (Madani,
Chebira and Mercier 1997) - (Madani, Rybnik and Chebira 2003) and designed for
classification problems based on the “divide and conquer” strategy. The objectives of this
thesis in studying T-DTS can be summarized as the following:
1. As it is shown (Rybnik 2004) that T-DTS algorithmic approach converges quickly
and their time complexities (computational complexities) linearly depend on the number
of training samples and the dimension of a feature vector, my central target is T-DTS
development and enhancement.
2. As a complexity estimator is at heart of T-DTS engine, I aim at developing a novel
classification complexity estimator based on an ad hoc method that takes into account the
classification database complexity issue and extracts information about its
“classifyability” using information provided by a Neural Networks (NNs) structure. The
Neural Networks are especially suitable for this problem as data is coming from a
complex environment and can be incomplete, heterogeneous, or have other characteristics
3
that make statistically-based classification complexity estimating not only very difficult,
but sometimes impossible or incorrect (Bouyoucef 2007).
3. Update T-DTS framework with the new complexity estimating techniques, the
Neural Networks based decomposition, and the processing methods.
4. Provide an automated self-adjusting procedure for effective resolving of the key
parameters of the approach.
5. Perform verification, analyze the results and outline possible perspectives.
The following section represents my contribution to this research.
Contribution
The main purpose of this research is to improve T-DTS technique and its
implementation. I expect that the results of this thesis will allow T-DTS to increase not
only its performance in classification tasks, but also to answer on the question how and
when to apply T-DTS approach to classification problem represented by particular
database. I expect based on the validation results’ feedback that proposed ANN based
classification complexity estimator will enhance classifiers’ adjustment regardless of
classification concept or paradigm.
Debriefing, this thesis makes the following specific contributions:
1. I have implemented a component-independent T-DTS platform for classification
tasks. Each component of the platform (i.e., decomposition method, processing method,
or complexity estimation technique) can be adjusted and verified independently of the
main platform and then successfully integrated into T-DTS.
2. I proposed an Artificial Neural Network (ANN) structure-based classification
complexity estimator that can theoretically be used for an adjustment of other advanced
classifiers such as T-DTS; I have showed, using benchmark and two real-world
classification problem, the effectiveness of the proposed estimator on the case of T-DTS
classifier.
3. I have proposed a self-adjusting T-DTS procedure that semi-automatically
determines the key parameter of T-DTS – complexity threshold, - at which T-DTS
performance reaches its quasi-optimum performance and produces the range (not only
one) satisfactory results.
4
The most important, using an analysis provided by this semi-automated procedure,
one could reason on why a T-DTS approach cannot be successful for a selected
decomposition and a processing unit. Another point is that taking advantage of the low-
price high-performance hardware that has become available in the recent years and de
facto became the platform of choice for large-scale learning algorithms; I have prepared
the way for a multi-processor implementation of T-DTS approach that would allow a user
to improve further T-DTS performance enhancement and efficient usability for large-scale
data.
In conclusion to my contributions, I have to mention that in this research I focused on
precondition and controlled issues of T-DTS functioning. Findings of this research can
help industry as well as academic researchers in designing efficient and robust structure of
intelligent classifiers capable of handling complex classification problems imposed even
by huge databases.
Thesis organization
The thesis is organized in four chapters. The main contributions of the thesis are
discussed in Chapter I, Chapter II, Chapter III and Chapter IV; these chapters are intended
to be self-contained and can be read by an experienced reader without considering other
chapters.
In Chapter I, I review major classification techniques and learning algorithms
available in the literature. The main goal of this chapter is to provide the readers with a
relevant background necessary to understand the details and additional motivation to the
T-DTS approach and related to T-DTS problems of employing overviewed classification
methods are discussed in the subsequent sections of this Chapter I. Last one is an extra
reason, why I provide a brief overview and comparison of traditional classification
techniques based on their speed, accuracy and memory requirements on large and
complex data sets; I also discuss limitations of these techniques and argue for a more
advanced approach. In addition, I discuss basic optimization techniques and key issues in
the disciplines of numerical analysis, computer architecture, operating systems and
parallel computing that are necessary for designing a high-performance implementation
and industrial adaptation of the suggested learning-based algorithm. The chapter
5
concludes with a theoretical overview of our proposed learning framework named T-DTS
that is based on building a tree-like classifier by means of the "divide and conquer" design
principle and unique ensemble method of complexity estimation. I also highlight the key
role of complexity estimating technique which is the core of T-DTS approach.
Chapter II is fully dedicated to complexity concepts and big range of complexity
measurement approaches. It also provides a detailed description of my novel Artificial
Neural Networks structure (ANN-structure) based complexity estimation approach that is
used for defining complexity of classification tasks.
In Chapter III, I highlight important T-DTS implementation aspects. I introduce a
self-tuning threshold procedure enhancement of the basic T-DTS learning framework that
is based on the analysis of maximal possible database decomposition. This procedure has
automatically determines an appropriate complexity threshold required by our approach.
Moreover, it provides a range of acceptable alternative solutions for industrial-based T-
DTS application.
Chapter IV is dedicated to the validation of the proposed approach. In the first part of
this chapter, we provide the results of RCE-kNN-like based complexity estimator (an
implementation of ANN structure based complexity estimator) verification, both as a
stand-alone technique and as a part of T-DTS. The second part of Chapter IV overviews
the evaluation of my proposed T-DTS threshold self-tuning procedure and outlines the
basics of an effective hardware implementation of the approach.
Finally, the last part of the thesis contains consolidated conclusions of my work and
defines further perspectives for T-DTS development.
6
Chapter I:
State-of-the-art of classification approaches
We start this chapter by providing a brief cross-field overview of the major statistical
classification methods. We shortly review the advantages of these methods and their
shortcomings in terms of accuracy and computational cost; our aim is to highlight the
importance of database decomposition based techniques in a frame of Statistical
classification. The review surveys not only traditional methods such as the k-Nearest
Neighbour-based method (Cover and Hart 1967), but also the modern ensemble classifiers
such as Bagging (Breiman 1996) and Boosting (Freund and Schapire 1996), and advanced
classifiers such as Support Vector Machines (SVM) (Vapnik 1998), (Serfling 2002). This
review is crucial in selecting and designing an appropriate classifier for our approach. We
conclude the chapter by introducing computational tools and techniques in order to
describe principal design of our framework named T-DTS (Tree-like Divide to Simplify)
and also to define the place of T-DTS among the existing classification approaches. The
subsections of the chapter are organized as the following.
In Section I.1, we provide a short introduction to the theory of classification. Section
I.2 introduces clustering techniques and methods. Section I.3 presents an overview of the
main classification techniques. Section I.4 describes general concepts of our T-DTS
approach and introduces our modular neural tree structure construction approach as
applied to classification tasks initially proposed by (Madani and Chebira 2000), and
subsequently developed by Rybnik (Rybnik 2004) and Bouyoucef (Bouyoucef 2007).
Section I.5 concludes.
7
I.1 Concepts of classification
Categorization (classification) is a process, where patterns or objects are recognized,
differentiated and associated with different classes. Categorization is fundamental for a
very large spectrum of problem solving tasks. Among many available categorization
theories and techniques, we can distinguish three general approaches of categorization:
Classical categorization, Conceptual clustering and Prototype theory (Mitchell,
Anderson, Carbonel and Michalski 1986).
Categorization is based on the criteria that all entities in a single category share one or
more common properties (Jurgen 2004). Categorization (classification) originates from
Plato, who in his “Statesman” 1 (Plato 1997) introduces the approach for grouping objects
based on their similar properties. In this way, any entity of a given classification universe
belongs unequivocally to one and only one class in the classification. Conceptual
clustering (Jurgen 2004) is a modern version of Classical categorization, where the
selection criteria are based not on common properties, but rather on shared concepts
among different entities. Prototype theory uses an entity of a category as a prototypical
object, and the decision whether another object also fits into this category depends on the
degree of overlap between the objects (Jurgen 2004). At a glance, the prototype theory
might have a relation to machine learning; however, it in fact is a subjective cognitive
approach that might manipulate natural categories (Goldstone and Kersten 2003).
In our thesis, we deal with a subset of Conceptual clustering theory called Statistical
classification, typically used in pattern recognition systems. Generally, this type of
classification might be described in terms of data instances: Ss ∈ , where s is a single
vector of a feature space S. Each problem instance s belongs to one element c of a concept
class C: Cc ∈ . The mapping from S to C is represented by c target concept belonging to
C set of concepts (that is, the concept space). Thus, the goal of automated classification
system is to predict to which class c belongs an arbitrary instance s (Butz 2001).
The general classification process involves two main steps: learning, where training
data are analyzed and a model is built, and classification, testing or generalizing, where
1 Dialogue between Stranger and Younger Socrates (“Statesman” [261a-261e]), (Plato 1997) Plato, J.M.Cooper (ed.),
D.C.Hutchinson (ed.), “Plato Complete Works”, Hackett Publishing Company Inc., 1997, ISBN: 0872203492
8
the built model is used to predict to which class belongs any given instance (Han and
Kamber 2006).
From the theoretical point of view, when information about training/generalization
data is given, the prime interest is the maximization of the accuracy rate. Other important
factors to be considered are the following: processing speed, recognition rate and other
computational resources cost factors (Gao, Foster, Mobus and Moschytz 2001).
According to the “No free lunch” theorem (Stork, Duda and Hart 2001), there is no
single learning-based approach that works the best on all classification problems;
subsequently, there is a wide variety of learning systems/approaches/concepts optimized
to provide a selection/extraction process with the most successful high-rate behaviour
predictable model using for given input data and classification (Han and Kamber 2006).
Therefore, this is a strong reason to continue development of the various categorization
algorithms. The existent categorization techniques fall under two general groups (Han and
Karypis 2000). The first one described in Section I.3, contains traditional machine
learning algorithms that have been developed over the years. The second group of
methods, described in the following Section I.2, contains specialized categorization
algorithms developed in the Information Retrieval community.
I.2 Clustering methods
Data clustering (not be confused with the Conceptual clustering paradigm) (Brito,
Bertrand, Cucumel and De Carvalho 2007)) or Clustering analysis (firstly proposed by
Tryon in 1939) encompasses a number of different algorithms and methods for grouping
objects of similar kind into respective categories. Clustering is often confused with
classification, but there is an important difference between the two: in classification, the
objects are assigned to pre-defined classes, whereas in clustering the classes have to be
discovered from the data (Michalski and Stepp 1987). Thus, one can assign clustering to
the category of unsupervised learning based classification methods. Furthermore, cluster
analysis (clustering) group of methods is a part of an important data mining area of
research (Agrarwal and Yu 1999). The reason why clustering is important for data-
mining is that as a technique in which the information that is logically similar is
physically stored together, clustering analysis increases the efficiency of management of
9
database systems: objects of similar properties are placed in a single class of objects, and
a single access to the disk makes the entire class available. The existing approaches to
data clustering include (employs another method) statistical approach (e.g., the k-MEANS
algorithm), optimization approach (e.g., branch and bound method), simulated annealing
techniques and neural network approach (Zeng and Starzyk 2001). Since the data mining
process involves extracting or uncovering patterns from data (Kantardzic 2002), and
grouping records in accordance with these patterns, a subset of data- mining (Fayyad,
Piatetsky-Shapiro and Smyth 1996) methods is directly related to data clustering, data-
mining methods can be used as classification methods.
Let us note that most clustering is largely based on heuristic, but intuitively reasonable
procedures, and most clustering methods for solving important practical questions are also
of this type. However, there is a little semantic guidance associated with these methods.
(Fraley and Raftery 2002).
In conclusion, we highlight that because clustering analysis is the method that can be
applied to pattern classification (Fayyad, Piatetsky-Shapiro and Smyth 1996) and has
numerous practical applications including diverse engineering fields (Mitchell 1997),
(Leondes 1998) such as oil exploration, biomedical imaging, speaker identification,
automated data entry, parameter identification of dynamical systems (Rao and Yadaiah
2005), fingerprint recognition, evaluation of the fetal state as carried out by obstetricians,
multi-path propagation channel conditions, etc. (Abdelwahab 2004), we perform a more
detailed overview of the main clustering methods in the following section.
I.2.1 Type of clustering
There are many approaches to categorization of the clustering methods. Depending on
a selected criterion, one might find important to group the methods into two big categories
such as hierarchical and non- hierarchical or parametric and non-parametric. However,
the most common used structure of the clustering methods includes the following four
categories (Jesse, Liu, Smart and Brown 2008):
1. Hierarchical methods (Section I.2.1.1)
2. Partitioning methods (Section I.2.1.2) determine the clusters at once, but can also
be viewed as divisive hierarchical methods
10
3. Density-based clustering methods (Section I.2.1.3) are employed to discover
arbitrary-shaped clusters, where each cluster is regarded as a region in which the density
of data objects exceeds a threshold
4. Two-way clustering, co-clustering or bi-clustering are the algorithms where not
only the objects are clustered, but also the features of the objects, i.e., if the data is
represented in a data matrix, the rows and columns are clustered simultaneously.
Section I.2.1.1 - Section I.2.1.3 provide a detailed overview of clustering methods in
groups 1-3, because these methods are closely related to the combination of clustering
approaches embedded into T-DTS.
I.2.1.1 Hierarchical methods
Hierarchical methods belong to the iterative type of procedures (Xu and Wunsch
2008) in which m data instances are partitioned into groups which may vary from a single
cluster containing all m data instances to m clusters each containing a single instance; a
proper definition might be found in work (Arabie 1994). Henceforth, the key point of the
hierarchical clustering methods is a decomposition algorithm over the given data set.
Based on the features of the algorithm, the most commonly used taxonomy of hierarchical
methods (Jain, Murty and Flynn 1999) includes the following five sub-categories.
I. First sub-category contains Agglomerative vs. Divisive approaches that are
related to algorithmic structure(s) and operation(s). An agglomerative approach begins
with each pattern in distinct unique cluster, and merges clusters together until a
stopping criterion is satisfied. A divisive approach proceeds top-down, placing all the
data in one cluster, and then successively splitting up clusters until a stopping criterion
is satisfied. T-DTS concept belongs to this sub-category, and more details on this
technique are provided in Chapter III.
II. Second sub-category consists of Monothetic vs. Polythetis approaches. The first
one uses the features simultaneously and second one in sequential way.
III. Third sub-category includes Deterministic (Hard) vs. Fuzzy approaches. The
first one allocates each instance-pattern to a single cluster, while the second one could
assign it to the different clusters in the same time.
11
IV. Fourth sub-category deals with Deterministic vs. Stochastic approaches. They
are designed to optimize a square error function using traditional determined
technique or through a random search.
V. Fifth sub category contains Incremental vs. Non-incremental methods that refer
to scalability issue that arise when the pattern database for clustering is very large, and
constrains on execution time or memory affect the architecture of the algorithm.
The following sub-section is dedicated to the second category of clustering methods -
Partitioning methods.
I.2.1.2 Partitioning methods
For given m data items (instances), a partitioning method arranges the data into k
groups/clusters, where mkGk ii≤= ,max and Gi is i-th cluster, sub-database, sub-group
of vectors. As k is an input parameter for these group of algorithms, some domain
knowledge is required, which in practice unfortunately is not available for real-word
applications (Ester, Kriegel, Sander and Xu 1996).
The partitioning algorithm typically starts with an initial partition and then uses an
iterative control strategy to optimize an objective function. Each cluster is represented by
the gravity centre of the cluster (k-centroid-based method) or by one of the objects of the
cluster located near its centre (k-medoid-based method). Usually, a centroid-based method
is used. Centroid is computed as the average of the attributes of each vector s of the
feature space S (Berry 2003).
Consequently, partitioning algorithm algorithms use a two step procedure. First,
determine k representatives minimizing the objective function. Second, assign each object
to the cluster with its representative based on the criterion of closeness (Section I.2.2) to
the considered object. This second step implies that a partition is equivalent to Voronoy
diagram and each cluster contains one of the Voronoy cell (Ester, Kriegel, Sander and Xu
1996). Thus, the shape of all clusters found by a partitioning algorithm is convex which is
very restrictive. There are plenty of partitioning methods, but principally, according to the
work (Jain, Murty and Flynn 1999), all of them can be arrange into 3 general sub-
categories.
I. First sub-category is the Square error clustering. The most popular partitioning
algorithm of this class is the k-MEANS algorithm.
12
II. Second sub-category is the Mixture-resolving and Mode-seeking clustering. This
group of algorithms is developed in number of ways. The principal concept uses an
underlying assumption that the patterns which have to be clustered are drawn from
one of several distributions, and the goal is to identify the parameters of each and (if
possible) their numbers. There are Gaussian mixture model, Expectation-
Maximization (EM) algorithm, and unsupervised Bayes models.
III. Third sub-category is the Graph-theoretic based clustering. These algorithms use
a similar graph theoretic approach to clustering, where the input data is represented as
a similarity graph and the algorithm recursively partitions the current set of elements
(represented by a sub-graph) into two subsets by a minimum weight cut computed
from that sub-graph, until a stopping rule is met (Shamir and Sharan 2002). There are
Minimal Spanning Tree (MST), Highly Connected Sub-graphs (HCS) (Hartuv and
Shamir 2000) and CLICK (Cluster Identification via Connectivity Kernels) (Sharan
and Shamir 2000).
The following sub-section represents the third category of clustering methods – the
Density-based methods.
I.2.1.3 Density-based methods
Density-based methods are used as the stand-alone tools to get an insight into the
distribution of a data set, e.g. to focus further analysis and data processing, or as a pre-
processing step. Density-based approaches apply a local cluster criterion. Clusters are
regarded as regions in the data space in which the objects are dense, and which are
separated by regions of low object density (noise). These regions may have an arbitrary
shape and the points inside a region may be arbitrarily distributed (Jesse, Liu, Smart and
Brown 2008).
The most popular approaches of this category are: DBSCAN (Density-Based Spatial
Clustering of Applications with Noise), OPTICS (Ordering Points To Identify the
Clustering Structure, LOF (Local Outlier Factors). They are used in Knowledge
Discovery in Databases (KDD) applications in finding the outliers, i.e. the rare events,
which are more interesting and useful than finding the common cases, e.g. detecting
criminal activities in e-commerce (Barbara and Kamath 2003).
13
Finalizing this short overview of the structure of classification methods, we would like
to remind that in clustering analysis, the existence of predefined pattern classes is not
assumed, the number of classes is unknown, or the class memberships of the vectors are
generally unknown (Leondes 1998). The method employs arrangement of the instances
into groups (clusters) so that there is a high similarity among objects in each cluster, but a
very low similarity among objects between clusters. The goals of any clustering include:
1. Organizing information about data so that relatively homogeneous groups
(clusters) are formed and describing their unknown properties.
2. Finding representatives.
Basing on these goals, the homogeneous group analysis has two components. One
component is the (dis)similarity measure between any two data samples or feature
vectors. Second component is the clustering algorithm that groups samples into clusters.
A similarity measure is essential to most clustering algorithms. That is why the next
section gives the description of main (dis)similarity measures.
The provided distance measures serve as a short overview. These distances are
embedded into T-DTS as a part of decomposition and complexity estimating techniques.
I.2.2 Distance measure
An important step in any clustering is to select a distance measure, which will
determine how the similarity of two elements is calculated. This will influence the shape
of the clusters, as some elements may be close to one another according to one distance
and farther away according to another.
The first classical commonly used similarity measure is the Euclidean measure.
Let d(·) denote a distance-function between two vectors. Let the two instances s1 and
s2 of feature space S be represented by two vectors Ss ∈1 and Ss ∈2 accordingly, where
index dim,...,1=i denotes an attribute of the feature vector. Then, the scalar dE(·) is
Euclidean distance:
• Euclidean distance:
( ) ( )∑=
−=dim
1
22121 )()(,
iE isisssd (I.1)
• Weighted Euclidean distance:
14
( ) ( )∑=
−=dim
1
22121 )()()(,
iWE isisiwssd (I.2)
where vector w defines the importance of the features of the weights. The choice of
weights must be done carefully, because this factor is more critical than switching
between the types of distance measures. Basic Euclidean distance is the special case
(L=2) of a more general type of distance - Minkovsky distance.
• Minkovsky distance:
( ) ( )L
i
LMN isisssd
1dim
12121 )()(, ⎟
⎠
⎞⎜⎝
⎛−= ∑
=
(I.3)
for L=1 we have Manhattan distance
• Manhattan (Hamming) distance, City-block distance or L1-distance (in most cases
denoted in literature as L1):
( ) ∑=
−=dim
12121 )()(,
iMH isisssd (I.4)
The following Chebyshev distance is a generalization of Minkovsky distance when L
approaches infinity. In literature it’s marked as ∞L -distance. Further, in our work it is used
as LSUP-distance.
• Chebyshev distance or LSUP-distance:
)()(max),( 21dim,121 isisssd iCH −==
(I.5)
The subsequent Mahalanobis distance is popular in statistics for measuring the
similarity of two data distribution. If T represents the matrix transpose, Σ is the covariance
matrix of the vectors s1 and s2.
• Mahalanobis distance:
( ) ( )211
2121 ),( ssssssd TML −Σ−= − (I.6)
The purpose of using Σ-1 is to standardize the data relative to covariance matrix.
The following distances provide some important clues about (dis)similarity criteria for
cluster analysis. For example, Canberra distance which is often used for homogeneous
cluster analysis for its sensitiveness to small changes when both coordinates are closed to
zero
• Canberra distance:
15
∑= +
−=
dim
1 21
2121 )()(
)()(),(
iCN isis
isisssd (I.7)
when 0)()( 21 =+ isis , one needs to define 0)00( = .
• Cosine distance:
21
2121 arccos),(
ss
ssssdCS×
= (I.8)
where dCS is in the range of [0;π]. Cosine similarity distance is frequently used in text
mining and document comparison.
Next, Tanimoto coefficient is an extension of equation II.8, such that it yields the
Jaccard coefficient (Tan, Steinbach and Kumar 2005).
• Tanimoto distance:
21
2
2
2
1
2121 ),(
ssss
ssssdTN
×−+
×=
(I.9)
The above given list of metrics can be completed with Levenshtein distance, Sorensen
similarity measures (Deza M. and Deza E. 2006) and the like metrics which comprise the
complex taxonomy of distance metrics. One example of the present metrics development
is the popular (because of its outperformed characteristics) group of distances called
Signal-to-noise distances (Gavin, Oswald, Wahl and Williams 2002); note that they do
not belong to the listed I.1 – I.9 subclass of Equal-weighted and Unweighted metrics.
Therefore, the above mentioned (dis)similarity measures permit to group the feature
vectors s to clusters in which the resemblance of instances is stronger than between the
clusters (Leondes 1998). Another important distinction is whether the clustering uses
symmetric or asymmetric distances. Many of the distance functions listed above have the
property that distances are symmetric. In other applications this is not the case. Distance
measurement is the fundamental vehicle for data clustering that is widely used in
numerous classification applications. However defining clustering in terms of
simultaneous closeness on all attributes may sometimes be desirable, but often is not.
Usually, clustering if it exists occurs only within a relatively small unknown subset of the
attributes (Friedman and Meulman 2004).
The following section provides the summary of clustering methods taking into account
mentioned above important concepts: clustering algorithms and distance measures.
16
I.2.3 Summary
The presented overview of the clustering analysis demonstrates the big range of
available clustering methods, where each method may produce a different grouping of a
given dataset. The choice of a particular method strongly depends on the type of output
being desired; a utilitarian approach to selecting an appropriate clustering algorithm has to
take into account whether the desired a clustering algorithm should be parametric or
nonparametric (Fukunaga 1972).
In the first case, a main criterion has to be provided; then, data are arranged into a pre-
assigned number of groups with the goal to optimize this criterion. Parametric clustering
methods are based on a pre-analysis of a global data structure and generally have to be
utilized in a combination with other optimization methods. The performance of this type
of methods depends on the assumptions about their parameters (e.g., number of clusters,
etc), which are hard to establish beforehand in real-world applications.
The most intuitive and frequently used criterion in parametric partition clustering
techniques is the squared error criterion, which tends to work well on isolated and
compact clusters (Jain, Murty and Flynn 1999). The most common used parametric
method is the k-MEANS algorithm, which employs the square error criterion (McQueen
1967). Several other variants of the k-MEANS algorithm were proposed to handle the
sensitivity to the initial partitioning (Anderberg 1973), which attempt to select a good
initial partitioning so that the algorithm is more likely to find the global optimum (Jain,
Murty and Flynn 1999). Other variants use splitting and merging of clusters, which make
it possible to obtain an optimal partitioning while starting from any arbitrary initial
partitioning. If the variance of a cluster is above a threshold, it is split, while two clusters
are merged if the distance between them is below another threshold (Jain, Murty and
Flynn 1999). However, a necessity to specify the threshold parameters in advance and
supervise the output is an inherited disadvantage of these approaches.
The alternative to the parametric methods is the non-parametric clustering, where no
assumptions can be made about the main characterizing parameter(s). In the commonly
used nonparametric approaches, e.g., valley-seeking method (Koontz and Fukunaga 1972),
data are grouped according to a density function (Density-based methods). These methods
do not require knowledge of the number of clusters beforehand. However in general, the
performance of these methods is very sensitive to the control parameters and, naturally, to
17
data distribution. The classic example of non-parametric approach is the algorithm called
CAST (The Cluster Affinity Search Technique), an iterative approach (Portnoy,
Bellaachia, Chen and Elkhahloun 2002) that deals effectively with outliers (Shamir and
Sharan 2002), (Zhang 2006, developed by Ben-Dor (Ben-Dor, Shamir and Yakhini 1999).
Regardless of the clustering strategy, the main purpose of clustering analysis is to
discover structures in data without providing an explanation and interpretation. Therefore,
clustering is a subjective in nature topic: the same data set may need to be partitioned
differently for different purposes. Typically, this subjectivity is incorporated into the main
clustering control parameters and employs domain knowledge in one or more steps of the
clustering analysis. It should be mentioned that every clustering approach uses some type
of knowledge, either implicitly or explicitly. The incorporation of explicitly available
domain knowledge into clustering is used mainly in ad hoc approaches (Jain, Murty and
Flynn 1999); however generally, clustering techniques automatically extract knowledge
during their pre-processing step.
Another problem with the clustering is that the result of clusterization strongly
depends on the choice of the feature space, on the object proximity measures, and on the
methods used to formalized the concepts of the object and cluster equivalence (Biryukov,
Ryazanov and Shmakov 2007). By combining clustering with classification methods
(described in the following section) that take into account the domain knowledge (in a
form of a set of concepts), we can greatly improve the performance of both clustering and
classification performance generally.
I.3 Main classification methods
This first section surveys the most prominent supervised learning classification
methods. This type of classification techniques distinguishes from unsupervised learning
or more precisely clustering overviewed above in the way that learner is not provided
with class labelling. In machine learning, such unsupervised learning group of algorithms
is known under name of instance learners or lazy learners. The second part of this section
is dedicated to this type of learners.
In contrast to lazy learners, eager learners’ aim is to predict the value of a function
for any valid input object after having seen a number of training examples. To achieve
18
this, the learner has to generalize from the presented (training/learning) data to unseen
situations, build a learning model. These algorithms are almost always biased toward
some representation (e.g. Neural Networks, Decision Trees, Support Vector Machine
(SVM) and etc. fall into this category). For instance, lazy learners do not build a model
and generally just remember the data samples. They are faster at training step but slower
at classification step (Ding 2007). A good example of a lazy learner is k-Nearest
Neighbour (kNN) classifier (Han and Kamber 2006).
Lazy learner has the option of representing the target function by a combination of
many local approximations, whereas an eager learner must commit at training time to a
single global approximation. The distinction between eager and lazy learner is thus related
to the distinction between global and local approximations of the target function.
Independently from the specificities of the learners, a combination of classification
approaches might be considering as a general solution.
Recently, in the area of Machine learning, the concept of combining classifiers is
proposed as a new direction for the improvement of the performance of individual
classifiers. One may see the combination of numerous hybrid methods, multiple experts,
and mixture of experts, cooperative agents, opinion pool, decision forest and including
classifiers ensemble and classifiers fusion, results in the improvement of classification.
Classifiers with different features and methodologies can complete each other (Parvin,
Alizadeh and Minaei-Bidgoli 2009). More precisely, in the work of Dietterich (Dietterich
2001) is given an accessible and informal reasoning, from statistical, computational and
representational viewpoints, of why ensembles approach can improve results.
The general goal of classification ensembles is to generate more certain, precise and
accurate system results. Our aim is a development of classifiers’ ensemble approach using
universal principles. Thus, we have to consider two main aspects. The first aspect is that a
classifier structures the data. Second one is that classifier’s optimization procedure has to
fit the classification model(s) to the data sample(s). Taking into consideration complex
classifier’s constructing concept, there is a risk of its failure if there is no sufficient
amount of data (Micheli-Tzanakou 1999). Moreover, the ensemble of classifiers has its
complex set of properties that has to be controlled by a user or a procedure. However, this
issue must be considered as an advantage, because these different properties become a
source of the principal classifiers’ dissimilarities and possible advanced (in comparison to
single classifier) applicability. Therefore, basing on the Lotte’s survey (Lotte and al.
19
2007), we provide the list of the most important properties that are commonly used to
describe different types of the classifiers:
• Generative vs. Discriminative. The first one computes the likelihood of each class
and chooses the most likely one. Discriminative learns only the way of discriminating the
classes. This is related to concept of Lazy and Eager learners correspondingly.
• Static vs. Dynamic. First one neglects temporal information during classification
for example Multi Layer Perceptron (MLP). On the contrary, Hidden Markov Model
(HMM) might classify the sequences of instances.
• Stable vs. Unstable. Discriminant linear function is an example of very stable
classifier in contradictory to MLP.
• Regularized vs. Unregulated: First group of classifiers uses a careful controlling
methods in order to prevent overtraining. On the contrary an overlapping may occur with
unregulated classifiers.
The list of the criteria can be prolonged, but even taking into consideration these four
criteria, we find that the properties of the most commonly used classifiers are highly
overlapped (the full chart of the classifiers and their properties are available at the work
(Lotte and al. 2007)).
Sections I.3.1-I.3.6 present the single classifiers descriptions in order to give a clue
about the ensemble of classifiers’ approach Sections I.3.5. Afterwards, Section I.4
describes T-DTS. T-DTS is based on proposed in the work (Chebira, Madani and Mercier
1997), (Madani, Chebira and Mercier 1997) “divide” and “conquer” paradigm. Thus, it
builds a classifiers’ ensemble over spitted up in tree-like manner initial database.
I.3.1 Linear classifiers
Linear classifiers are discriminant algorithms that use linear function to distinguish
classes. They are probably the most commonly used algorithms for applications. Two
main kinds of linear classifiers are used: Discriminant function and Support Vector
Machine (SVM).
20
I.3.1.1 Discriminant functions
Discriminant (linear / quadratic) function is a simple and basic classification method
based on Linear Discriminant Analysis (also known as Fisher’s LDA) used to separate the
data representing different classes. For a two-class classification problem, this function or
classes’ separating hyperplane2 is generally written as
where w is called the weight vectors and w0 - the threshold vector. The sign of g0()
defines belonging to one of two classes. Equation I.11 performs well when data from
different classes are linearly separable or the covariance matrices of all the classes are
identical (Duda and Hart 1973). When data can be linearly separable, the Perceptron using
a learning rule (Rosenblatt 1961), after a finite number of iterations (Novikoff 1963), can
successfully classify data.
For two-class problem, weight vectors of the Perceptron classifier can also be
determined by Fisher criterion (Fisher 1936). For non-separable cases, these parameters
can be obtained by assuming that data distribution is Gaussian-like or by minimizing the
mean square error (Duda and Hart 1973). To solve more than two-class problem, several
hyperplanes gi() are used. The strategy used for multiclass tasks is the “One Versus the
Rest” (OVR) which consists in separating each class from all the others.
Although Linear Discriminant function cannot achieve a high accuracy in most real
cases, its appealing property is low computational cost that can be easily implemented in
vector processors and a single actual processor. Therefore, for a large classification
problem with thousands of classes discriminant function is a good choice for pre-
classification.
The further enhancement of this approach such as Quadratic Discriminant Function
(QDF) or Modified Quadratic Discriminant Function (MQDF) (Kimura, Takashina,
Tsuruoka and Miyake 1987) performs very well in handwritten character recognition
(Kimura and Shridhar 1991), (Kimura, Wakabayashi, Tsuruoka and Miyake 1997).
Nevertheless, when a large number of training samples are available and problems involve
2 a higher dimensional analogue of a plane in three dimensions, The Collins English Dictionary
00 )( wswsg T += (I.11)
21
multimodal (multi local maxima) densities, non-parametric methods such as kNN and
SVM usually perform better than discriminant functions.
I.3.1.2 Support Vector Machine
Support Vector Machines (SVM) also uses hyperplane to identify classes, but to build
it, SVM algorithm using the Structural Risk minimization principle (Han and Karypis
2000). SVM method has been proposed firstly by Vapnik (Vapnik 1998), (Serfling 2002).
It often achieves superior classification performance compared to other learning
algorithms across most domains and tasks (Statnikov and al. 2005). It has recently
achieved a key position in pattern classification and has achieved promising performances
in many applications such as handwritten digit recognition (Decoste and Scholkopf 2002),
classification of web pages (Joachims 1998) and face recognition (Osuna, Freund, and
Girosi 1997).
SVM is generated in two steps. First, the data vectors are mapped to a high-
dimensional feature space. Second, the SVM tries to find a hyperplane in this space with
maximum margin separating the data. The margin denotes the distance from the boundary
to the closest data point in the feature space Fig. I.1.
Fig. I.1 : SVM: space mapping using linear hyperplane
Linear Support Vector Machine (LSVM) is the simplest linear form of SVM. In linear
case, the margin is defined by the distance of the hyperplane to the nearest positive and
22
negative examples. The goal of the SVM is to predict the class label for each input. The
classification is based on the sign of the decision function called a hyperplane.
Vapnik SVM algorithm finds the optimal hyperplane, which is defined as the one with
the largest margin separating classes of data (Serfling 2002). When training sets are not
linearly separable and perfect separation is not possible, a trade-off is used and it allows
LSVM to penalize the misclassification of a data point. However, the following section
presents SVM modification - Nonlinear Support Vector Machine (NSVM) that takes
intermediate position between linear classifiers and group of non-linear briefly described
in Sections I.3.3 – I.3.7.
I.3.2 Nonlinear Support Vector Machine
The original optimal hyperplane algorithm proposed by Vapnik was a linear classifier.
The proposed solution NSVM for non-linear cases builds nonlinear decision boundaries
by using a non-linear kernel function.
This allows the algorithm to fit the maximum-margin hyperplane in a transformed
feature space. The mapping the data S to another space by means of modified kernel
function Φ, schematically described in Fig. I.2. Generally, this new space has higher
dimension than original one (Gold, Holub and Sollich 2005).
Fig. I.2 : SVM space mapping using different space kernel functions Φ: Nonlinear
kernel tool (right) and Linear hyperplane (left)
There are the following classical kernels for SVM vectors: Polynomial, Gaussian,
Radial Basis Function (RBF) and Sigmoid. To resolve a multiclass classification problem,
SVM employs: One-versus-rest (OVR) method (Kressel 1999), One-versus-one (OVO)
23
method (Manning, Raghavan and Schultze 2008), Directed Acyclic Graph SVM
(DAGSVM) method and Weston and Watkins (WW) method (Hastie, Tibshirani and
Friedman 2009).
SVM based approaches have several advantages. They are good generalization
properties owing to the margin maximization and the regularization term. Many problems
in pattern recognition such as curse of dimensionality and over-fitting doesn’t occur with
SVM when the suitable kernel and multiclass SVM method parameters are properly
selected. Another appealing property of SVM is that its classifier structure is data-driven.
This structure is automatically determined by solving a constrained convex quadratic
programming problem. This avoids customizing the structure manually to achieve a high
performance in contrast to NNs (Neural networks). However, there are two important
problems for SVM: wide range applicability and SVM training algorithms’ computational
costs (Lotte and al. 2007).
I.3.3 Neural networks
Neural Networks (NNs) or more precisely Artificial Neural Networks (ANNs) are a
powerful tool with nonlinear approximation capabilities used in many engineering areas
and computer technologies. They are computational systems, either hardware or software,
which mimic the computational abilities of biological systems (Maren 1990).
ANNs are easy to construct and can be developed within a reasonable timeframe
(Maren 1990). They have been employed and compared to conventional classifiers for a
number of classification problems (Abdelwahab 2004). The results have shown that the
accuracy of the NNs based approach is equivalent to, or slightly better than, other
methods due to its ability to produce non-linear boundaries. This fact leads to NNs based
classifiers to be efficient (Zhou 1999) (Lotte and al. 2007).
ANNs are composed of highly interconnected neurons that accept input and generate
output. Each connection has a weight associated with it. The weights are adjusted during
the training of the network to achieve human-like pattern recognition. The choice of the
learning algorithm, weight initialization, the input signal representation and the structure
of the network is very important (Go, Han, Kim and Lee 2001). The number of hidden
neurons and layers must be sufficient to provide the discriminating capability required for
an application. However, if there are too many neurons, the neural network will not be
24
able to employ generalizing between input patterns when there are minor variations from
the training data (Fogel 1991). Furthermore, there will be a significant increase in cost and
in the time required for training. As reported (Goblick 1988), NNs’ classifiers have the
following characteristics:
o NNs’ classifiers are distribution free. NNs allow the target classes to be defined
without consideration to their distribution in the corresponding domain of each data
source (Benediksson, Swain and Ersoy 1990). In other words, using neural networks is a
better choice when it is necessary to define heterogeneous classes that may cover
extensive and irregularly formed areas in the spectral domain and may not be well
described by statistical models.
o NNs’ classifiers are capable of forming non-linear decision boundaries and they do
not require decision functions to be given in advance (Go, Han, Kim and Lee 2001). This
makes them flexible in modelling real world complex relationships (Zhang 2000) and they
can approximate any function with arbitrary accuracy.
o NNs’ classifiers are data independent (Abdelwahab 2004). When neural networks
are used, data sources with different characteristics can be incorporated into the process of
classification without knowing or specifying the weights on each data source. Until now,
the importance-free property of neural networks has been mostly demonstrated
empirically. Efforts have also been made to establish the relationship between the data
independent characteristics of NNs and their internal structure, particularly their weights
after training (Zhou 1999). In addition, NNs’ implementations demonstrate recently
storage reducing and computational requirements trends.
NNs learning method is unsupervised or supervised:
• Unsupervised methods determine classes automatically, but in fact show limited
ability to accurately divide space into clusters.
• Supervised methods have yielded higher accuracy than unsupervised ones, but
suffer from the need for human interaction to determine classes and training regions.
Backpropagation is one of the low computationally cost algorithms used in training of the
supervised neural network. It is based on linear model (steepest descent) (Go, Han, Kim
and Lee 2001) and it has been shown that for Multi Layer Perceptron (MLP) training
backpropagation approximates the Bayes optimal discriminant functions for both two-
class and multi-class recognition problems (Ruck and al 1990). However, there are some
drawbacks that are associated with backpropagation such as convergence to local
minimum and the absence of specific methods for determining the network structure.
25
Although, network pruning approach of deleting the irrelevant weights of a network
before invoking inference, can be used to optimize the size of the network (Hertz, Palmer
and Krogh 1991). Some approaches (Le Cun, Denker and Solla 1990), (Hassibi and Stork
1993) use the information of all second order derivatives of the error function for network
pruning. Although, these methods can improve the generalization performance, the
computational cost of pruning an initial large fully connected network is high. For other
range of methods such as LeNet1 (Le Cun and al. 1989) and LeNet5 (Le Cun, Bottou,
Bengio and Haffner 1998), the network structure is customized manually for the specific
application. However, for those networks constructing methods, it is required a good prior
knowledge.
Neural networks are robust to errors and thus are well-suited to problems in which the
training data are noisy. However, they have poor interpretability, since it is difficult for
humans to interpret the meaning behind the weights. Also it requires a number of
parameters, such as number of layers, to be determined, which often comes from
experience, especially when we deal with the NNs of the big size. A glaring and
fundamental weakness in the current theories of ANN and ANN-connectionism is the
total absence of the concept of an automous system. As, a result, the field developed NNs
algorithms require human adjustment (Roy 2000).
In conclusion, let us note that NNs are used to model continuous complex processes,
even human behaviour in a simplified task involving collision avoidance and target
positioning (Maren 1990).
I.3.4 Non-linear Bayesian classifiers
This section introduces Bayesian classifiers and Hidden Markov Models (HMMs). All
these classifiers produce non-linear decision boundaries. Although, this group of
classifiers is not as widespread as Discriminant functions or NNs in the real-word
applications, but they are generative which enables them to perform more efficient
rejection of uncertain samples than discriminative classifiers.
26
I.3.4.1 Bayesian classifiers
Bayesian decision theory is a fundamental statistical tool in pattern classification
problems. Bayesian method is one of the traditional classification techniques. It provides
the optimal performance from the standpoint of error probabilities in a statistical
framework (Go, Han, Kim and Lee 2001).
The success of the Bayesian methods depends on the assumptions used to obtain the
probabilistic model (Chen and Varshney 2002), (Zhang 2000). This makes them
unsuitable for some applications such as image classification based on a feature space
comprising texture measures (Simard, Saatchi and Grandi 2000). However, they have
been applied to ANNs in order to regularize training and in such way improve the
performance of the classifier (Kupinski, Edwards, Giger and Metz 2001).
I.3.4.2 Hidden Markov Models
The Hidden Markov Models (HMM) are the models with finite sets of states, each of
which is associated with a probability distribution. Transitions among the states are
governed by a set of probabilities called transition probabilities. In a particular state, an
outcome or observation can be generated, according to the associated probability
distribution. It is only the outcome, not the state visible to an external observer and
therefore states are hidden to the outside. HMMs are known to classify the data based on
their statistical properties. HMMs extract fuzzy features from the pattern in question and
comparing it with the known (stored) one (Lu 1996). To use HMMs for classification of
unknown input data, HMM are first trained to classify (recognize) known pattern.
HMMs are basically 1D-model. Hence, to use them for complex classification task
such as face recognition, the pattern must be represented in 1D format without loosing
any vital information. HMMs are popular in the field of speech recognition, because it is
perfectly suitable for the classification of time series. However, there are several problems
with HMMs, the main of them are: HMMs make very general assumption about the data,
the number of parameters that have to be set in HMMs is also big.
The theory of HMMs is elegant, but its implementation is hard (Kadous 2002).
27
I.3.5 Prototype methods
Let a prototype consist of p pairs (si,сi), i = 1,...l , where l is the maximal index for
each tuple of (si,сi) , сi is the class label of sample si. In most cases, si associated with the
prototype is typically an example from the training set. The classification of an unseen
pattern s is to assign its class to the label of the closest prototype by a distance measure
function d().
This group of classification methods is related to the group of clustering techniques
mentioned at Section I.3.5.2. One of the most used prototype methods are k-Nearest
Neighbour (kNN) and Vector Quantization. This type of classifiers is shortly overviewed
in the following sub-sections. It is relatively simple and performs non-linear space
separation.
I.3.5.1 Vector quantization
Vector quantization is a powerful technique used not only for classification, but also
for data compression purpose. It is based on the competitive learning paradigm, so it is
closely related to the self-organizing map model that is trained using unsupervised
learning. Vector Quantization and supervised classification techniques are combined
because both techniques can be designed and implemented using methods from Statistical
classification as well as classification trees (Cosman, Oehler, Riskin and Gray 1993). Let
us note that such implementation with a tree structure greatly reduces the encoding
complexity (Gray, Oehler, Perlmutte and Ohlsen 1993) and it has been shown that if an
optimal vector quantizer is obtained, under certain design constraints and for a given
performance objective(s) - no other coding system can achieve better performance. Vector
quantization has several advantages in coding and in reducing the computation in speech
recognition (Gold and Morgan 1999).
One of the most widely used algorithms is Linear Vector Quantization (LVQ). It is
applied for classifying various kinds of patterns and signals. The reason to apply LVQ is
that it can treat many input data with small computational burden. In other words, it can
deal with high dimensional representation space using simple learning structure.
LVQ can be used for training competitive layers of the unsupervised neural network
model developed by Kohonen (Kohonen 1989), called Self-Organizing Map (SOM), in a
supervised manner. LVQ is composed of two layers: a competitive layer that learns the
28
feature space topology and the linear layer that transforms classes into target classes.
According to Kohonen (Kohonen 1989), prototypes are placed with respect to the
decision boundary to reduce the classification error by attracting the prototypes of the
correct class and repelling prototypes of incorrect class. The decision boundary of LVQ is
piece-wise hyperplane. LVQ is defined in the form of algorithm rather than optimization
of a cost function, which makes difficult the analysis of its property. Generalized
Learning Vector Quantization (GLVQ) (Sato and Yamada 1996) adjusts the prototypes
based on Minimization of Classification Errors (MCE) (Juang and Katagiri 1992), It
allows GLVQ user to improve classification performance. It also has the advantage of
increasing the classification accuracy of the SOM network (Go, Han, Kim and Lee 2001).
Vector quantization technique is used to simplify image processing tasks such as half-
toning, edge detection (Cosman, Oehler, Riskin and Gray 1993), image recognition
(Gersho and Gray 1991), image: thinning, shrinking, skeletonization (Shen and Castan
1999) and speech (Gersho and Gray 1991).
I.3.5.2 k-Nearest Neighbour classifier
The k-Nearest Neighbour (kNN) classifiers are instance-based learners. Learning
consists of storing the present training samples. When a new instance is presented for a
query, a set of similar instances is retrieved and used for classification. As a lazy learner,
kNN classifier stores the training samples and do not build the classifier explicitly. When
an unknown instance/prototype is given, the algorithm searches the whole set of training
instances for the k instances which are closest to the unknown instance.
The unknown instance will be assigned the most common class among those k
instances. The k instances are the k-Nearest Neighbours of the unknown instance.
Proximity is generally defined in the terms of Euclidean distance or any other distance
measure (Mitchell 1997) such as the mentioned I.1 – I.9.
A very good property of this classifier is that kNN doesn’t require the analysis of the
data density function form. Its asymptotic probability of error is never greater than twice
the Bayesian error (Cover and Hart 1967).
The kNN classifiers are faster at training but slower at classification than eager
methods since nearly all computation takes place at classification time rather than when
the classifier model is built at training time. The disadvantages of KNN are that in order
29
to achieve a high accuracy, a huge number of training samples are required. To address
this problem, a variety of techniques have been developed for kNN performance adjusting
(Han and Karypis 2000). As a consequence, the computational cost of kNN is
prohibitively high.
In conclusion, we highlight that the performance of Prototype methods largely
depends on the initial sample, which are usually set by some clustering algorithm such as
k-MEANS. Comparing kNN with LVQ, it is noted that last one usually achieves better
performance and it costs less.
The following Section I.3.6 presents a classifier which is directly related to T-DTS
concept, due to its ability to construct a tree classification structure.
I.3.6 Decision trees
The decision trees are considered as one of the most used classification approaches
due to their accuracy and simplified computational properties (Gelfand, Ravishankar and
Delp 1991), (Srivastava, Han, Kumar and Singh 1999), (Zhang, Chen and Kot 2000).
They are capable of performing non-linear classification (Atlas and al. 1989) and they do
not rely on statistical distribution. This yields to successful applications in many fields
(Simard, Saatchi and Grandi 2000).
The tree is composed of a root node, intermediate nodes and terminal nodes. The data
set is classified at each node according to the decision framework defined by the tree (Ho,
Hull and Stihari 1994). The decision tree model is built by recursively splitting the
learning set based on the locally optimal criterion (Han and Karypis 2000). It starts with a
coarse classification, and then followed by a fine classification where finally each group
contains only one instance or one class. Decision trees based classification has the
advantages of employing more than one feature. Each group of the employed features
provides partial information about the instance(s). The combination of such groups or
clusters can be used to obtain accurate recognition decision (Senior 2001). There are more
than one decision tree that can be used for a given data base (Tu and Chung 1992).
The large number of methods has been proposed in the literature for the design of the
classification tree. Classification and Regression Trees (CART) is one of the approaches
that have been commonly used (Gelfand, Ravishankar and Delp 1991). It was developed
during years, starting from 1973 till 1984 (Atlas and al. 1989). It has an advantage of
30
constructing classification regions with sharp corners. However, it is computationally
expensive (Gelfand, Ravishankar and Delp 1991). In CART, splitting continues until
terminal nodes are reached. Then, a pruning criterion is used to sequentially remove splits.
Pruning can be implemented by using different data than those used for tree building
(Atlas and al. 1989). The main advantages of pruning is reducing the size of the decision
tree and hence reducing the classification error and avoiding both overfitting and
underfitting. The most often used pruning methods are based on removing some of the
nodes of the tree. Pruning can be performed employing neural networks, trained by
backpropagation algorithm (Kijsirikul and Chongkasemwongse 2001), to give weights to
nodes according to their significance instead of completely removing them.
Another widely used decision tree-based classification algorithm that has to be
mentioned is C4.5. As CART, it has been shown to produce good classification results,
but let us in conclusion highlight that decision tree based schemes do not work well due to
ovefitting problem (Han and Karypis 2000). Therefore, the disadvantages of the
classification methods (Section I.3.1 – I.3.7) required employing advanced approach of
combining classifiers.
I.3.7 Ensemble of classifiers
There are two common ways to increase the accuracy of a classification
system/framework: one is to improve the performance of a single classifier; another one is
to combine the results of multiple classifiers by employing decision combination for
different data-clusters of the learning heap (Tumer and Ghosh 1995). One may find in the
literature a range of experiments where the multiple classifiers have better performance
than single classifier when they are selected carefully and the combining algorithm retains
the advantages of each individual classifier and avoids its weakness (Ho, Hull and Stihari
1994), (Hsieh and Fan 2001).
This is due to two reasons (Briem, Benediktsson and Sveinsson 2000):
1. The risk of choosing the wrong data-cluster is lower;
2. Individual classifiers can be built on different types of instance-features of the
same data-cluster, and the multiple classifiers can weight the classifiers based on the
characteristics of the different features.
31
It’s quite typical when the ensemble method of combining classifiers is based on data
re-sampling approach. In this instance the outputs of a classifier are interpreted in terms of
bias-and-variance decomposition, the ensemble methods mainly reduce the variance of
these single classifiers.
Bagging (Breiman 1996) and Boosting (Freund and Schapire 1996) are two classical
methods that have shown great success when using ensemble of classifiers (Chan, Huang
and De Fries 2001). Bagging employs the bootstrap sampling method to generate training
subsets while the creation of each subset of Boosting depends on previous classification
results.
For both methods, the final decision is made by the majority of votes. Numerous
experiments (Bauer and Kohavi 1999), (Opitz and Maclin1999) have shown that Bagging
and Boosting are effective only for weak classifiers such as classification tree, neural
networks and perform well in a small dataset.
Stacking is another method which uses several classifiers. These classifiers are called
level-0 classifiers. The output of each these classifiers are given as input to a meta-
classifier (classifier level-1) which makes a final decision (Lotte and al. 2007).
In comparison to previous method, Voting are being used in a “democratic” way.
Each classifier assigns the class to a given vector, but the final class assigning is done
based on majority (the weakness of this method) of classifiers (Lotte and al. 2007).
Two most important issues appear by designing a multiple classifiers system:
classifier selection and decision combination. However, on large data set, others issues for
the ensemble methods need to be taken into consideration more seriously:
• If an ensemble method generates N classifiers over a database and then the
classifiers’ outputs are combined, time cost of such approach is about N times as high as a
single based classifier.
• If the data set assigned for ensemble of classifiers building is required to be stored
on hard-drive, because of lack of RAM, the time cost of re-sampling cannot be further
ignored.
The last important issue is that very often, the group of approaches for constructing
ensemble of classifiers uses subset(s) of database, rather than the total set. As a result, the
number of constructed classifiers loose generalizing power. This issue can be taken into
account, because in this case the ensemble methods may not perform better than a single
base classifier over total database.
32
Taking to consideration these two issues and the fact that classification accuracy
might degrade very fast as the number of classes’ increases (Li, Zhang and Ogihara 2004),
the multiple classifier method thereby becomes one alternative solution for solving
classification problems. Multiple classifier method cannot be superior to a single classifier
method, because here we deal with two most important issues of designing a multiple
classifiers system: classifier(s) selection and decision combination.
The classifiers’ ensemble approach is very important for complex tasks that contains
not only a major problems such a structure of classifiers and a final decision combination,
but also realization. These details that can be viewed as minor sub-problems have a
crucial influence on the ensemble of classifiers performance. The following sections are
dedicated to the problem of constructing Multiple Classifiers Structures (MCS).
I.3.7.1 Multiple classifiers structures
A number of classification systems based on the combination of the outputs of a set of
different classifiers and approaches of their constructing have been proposed in the
literature. Different Multiple Classifiers Structures (MCS) can be grouped as follows
(Sarlashkar, Bodruzzaman and Malkani 1998), (Hsieh and Fan 2001): parallel, pipeline
and hierarchical structures.
For the parallel structure, the classifiers are used in parallel and their outputs are
combined. In the pipeline structure, the system classifiers are connected in cascade
(Giusti, Masulli and Sperduti 2002), (Kawatani 1999). The hierarchical structure is a
combination of the previous two structures.
Irrelevant of the type of MCS, the difficulty to choose and construct classifiers pushed
researchers to developed methods that help designer to carry out the choice (Gasmi and
Merouani 2005). Among the various methods of constructing MCS dominates one central
method of producing initially a large number of classifiers and then selecting a subset
which is judged most valid to lead optimal performances. Central issue of the methods of
MCS constructing is the classifiers’ output decision combination. The following section
gives a clue about this problem.
33
I.3.7.2 Decision combination
The decision combination methods proposed in the literature are based on different
ideas: voting, statistics usage, constructing belief functions and other classifiers’ fusion
schemes (Xu, Krzyzak and Suen 1992), (Prampero and Carvalho 1998) and etc. Our
attention deserves last methods of decision combinations for hierarchical MCS
(Abdelwahab 2004) among available: random decision, majority decision and
hierarchical decision method. In hierarchical classification framework, the probability
output from each individual classifier is used as input for the next lower of hierarchy,
expecting that inaccuracy of classification in such way might be reduced on the low lever
of the hierarchy.
Therefore, for solving pattern recognition problems, one can find a proposal (Xu,
Krzyzak and Suen 1992) of using different type of classifiers’ decisions combination:
average Bayes classifier, voting methods, Bayesian formalism and Dempster-Shafer
formalism. In these methods, only the top choice from each classifier is used, which is
usually sufficient for problems with a small number of classes. The examination of the
strengths and weaknesses of each method leads to the problem of determining classifier
correlation that is the central issue in deriving an effective combination method.
Recently, in the work (Du, Zhang and Sun 2009), we can find even an integration of
the Dempster-Shafer and hierarchical decision combination approaches. There is plenty
of other approaches such as in the work (Kittler, Hatef, Duin and Matas 1998) - a
common theoretical framework for combining classifiers which may derive the product
rule, sum rule, max rule, min rule and median rule to take the product, sum, maximum,
minimum and median values of the a posterior probabilities p(сj|si) - the probability that
an input pattern with feature vector si is assigned to class сj . There are also different
classification techniques that have been proposed in literature based on the combination of
classifiers such as the kNN decision rule and the combination of an ensemble of neural
networks.
The T-DTS concept belongs to this mentioned group of classifiers’ ensemble. The
following section describes this type of approach in greater detail.
34
I.3.7.3 Ensemble of Neural Networks
The combination of an ensemble of neural networks has been proposed to achieve
high classification performance in comparison with the best performance that could be
achieved by employing a single neural network. This has been verified experimentally
(Kittler, Hatef, Duin and Matas 1998), (Giacinto, Roli and Fumera 2000). Also, it has
been shown that additional advantages are provided by an ensemble of neural networks in
the context of classification applications including biomedical applications that employ
specific diagnostic tools (Dujardin and al. 1999).
There are two types (macro-structure) (Maren 1990) of NNs combination: strongly
coupled and loosely coupled networks. First type can be treated as a single network and
are created by fusing togthere two or more networks into a single new structure. Loosely
coupled structures connect networks which retain their structural distinctness. In
consequence, a combination of NNs can be performed not only on macro-structure level,
but also on micro (intra NNs characteristics) level. This type of NNs combination is
similar to data fusion mechanisms. Henceforth data fusion (i.e. fusion) of multitemporal
characteristics approaches is essential for medical imaging, remote sensing applications.
Due to the availability of the large amount of data acquired by different types of NNs, it is
mandatory to develop effective fusion techniques able of take advantage of such multi-
NNs sources and multi-temporal characteristics of NNs (Bruzzone, Prieto, and Serpico
1999). Fusion of multi-characteristics especially refers to the acquisition, processing, and
synergistic combination of information from various type of NNs in order to provide a
better understanding of the classification situation under consideration (Kundur,
Hatzinakos and Leung 2000).
A fusion scheme (regardless NNs combining problem) might be defined as follows
(Pattichis C., Pattichis M., and Micheli-Tzanakou 2001): "data fusion is a formal
frameworks in which are expressed means and tools for the alliance of data originating
from different sources. It aims at obtaining information of greater quality; where the
exact definition of "greater quality" will depend upon the application." Data fusion
techniques might be classified into the following three groups:
1. Data Level fusion: Combination of raw data from all sensors.
2. Feature level fusion: Extraction, combination and classification of feature vectors
from all sensors.
35
3. Decision level fusion: Combination of outputs of the classifications achieved on
each single source.
Taking into consideration of specificity of NNs, one may find a big range multi-NNs
models’ fusion approaches. For example a method proposed by Ueda (Ueda 2000) of
linearly combining multiple NNs based classifiers, uses statistical pattern recognition
theory. In this approach, several neural networks are first selected based on which works
best for each class in terms of minimizing classification errors. Then, they are linearly
combined to form desirable classifier that exploits the strengths of the individual
classifiers. In this approach, the Minimum Classification Error (MCE) criterion is used to
estimate the optimal linear weights. In this method, the problem of estimating linear
weights in combination is reformulated as a problem of designing a linear discriminate
functions using MCE discriminate (Juang and Katagiri 1992). Being more general, we
remind that it has been shown (Lazarevic and Obradovic 2001) that NNs’ ensembles are
effective only if the NNs forming them make different errors. For example, it is shown
(Hansen and Salamon 1990) that neural networks combined by the majority rule can
provide an increase in classification accuracy only if the NNs make independent errors.
Unfortunately, the reported experimental results pointed out that the creation of error-
independent networks is not a trivial task.
In the neural network field, several methods for the creation of ensemble of neural
networks making different errors have been investigated. Such methods basically lie on
varying the parameters related to the design and training of neural networks. According to
the work (Rao, Chand and Murthy 2005), the ensemble of NNs’ methods can be included
into one of the three following categories:
1. Methods that use different initial NNs parameters,
2. Methods that use different type of NNs (such as probabilistic neural networks,
MLP or RBF network),
3. Methods that uses different sub-set of the given database.
The capabilities of the above methods were experimentally compared to create error-
independent NNs output. It was concluded that varying the networks type (second
category) and the training data (third type) are the two best ways for creating ensembles of
networks making independent errors. However, it has been noted that neural network
ensembles could be created using a combination of two or more of the above methods.
Therefore, neural network researchers have stated to investigate the problem of the
engineering design of NNs’ ensembles. The proposed approaches can be classified into
36
main design strategies: already mentioned overproduce and choose strategy and direct
strategy. The first strategy uses a group of method that generates an ensemble of error-
independent NNs directly. In practice, it is shown (Sung and Niyogi 1995) that using
ensemble of NNs might lead to low computation cost and good classification
performance. Moreover, the assumption that is central in this type of modelling is too
tight for many real-world applications (Ryabko 2006).
In conclusion, let us highlight that there are two major ways in which we can integrate
NNs to create an ensemble. One is to create a hybrid network by tightly fusion two
existing neural networks architecture (well-know old Hamming net and the counter-
propagation network are examples). The disadvantages of this approach are that in this
case we are still relying on a single network to accomplish a task which definitely may be
greater than the capabilities any single network may offer. The alternative, perspective
approach is to create ensemble of individual integrated ANNs, because complex problems
require multiple stage of processing. The computational complexity of fully connected
networks scales as the square of the numbers of neurons involved. The functioning of
NNs’ ensemble is easy to test, debug, repair and update using different type of ANNs.
Moreover, from the practical point of view nothing like a structural programming
principle have yet invaded to implement an any multi-modular ANNs system. Therefore,
modular system design (such as out T-DTS) makes it easy to verify and validate final
project (Maren 1990).
Summarizing Section II.3, which includes brief overview of classification approach
and their combination, we logical move to one of the main core of our work T-DTS
concept development. Let us highlight that some famous classifiers might not be
mentioned in Section II.2 and Section II.3. Concerning final sub-Section II.3.6.3, there are
many other different combination schemes which are available in the literature that might
make the Chapter II more complete, but our goal here has been to provide an introduction
to NNs’ ensemble (hybrid) T-DTS approached. Therefore, T-DTS concept can be treated
as a special case of ensemble of the classifiers (currently NNs) that use decision trees
approach.
The Section II.4 is fully dedicated to T-DTS approach. It contains detailed description
of the T-DTS concept regardless implementation aspects.
37
I.4 T-DTS (Tree-like Divide to Simplify) approach
Many real world problems and applications, e.g., system identification, industrial
processes, manufacturing regulation, optimization, decision, systems, plants safety,
pattern recognition, etc., information is available as data stored in files (databases etc.).
For classification approaches, an efficient processing and handling of these data is of
paramount importance. In most of these cases, processing efficiency is closely related to
several issues, among which are:
1. Data nature: includes data complexity, data quality and data representative
features.
2. Processing technique related issues: includes model choice, processing
complexity and intrinsic processing delays.
Data complexity, frequently related to nonlinearity or subjectivity of data, may affect
the processing efficiency. While, date quality (noisy or degraded data), may influence
processing success and expected results’ quality. Finally, representative features such as
scarcity of pertinent data could affect processing achievement or resulted precision
(Madani, Rybnik and Chebira 2003). On the other hand, choice or availability of
appropriated model which describes and forecast the behaviour is the major importance.
The ability of multi-models system to achieve goals thought selectively switching
between the models within a model space can be considered as a form of adaptive
reasoning (Ravindranathan and Leitch 1999). This switching (more precisely selection) of
models using reasoning is an adaptation. In T-DTS case, processing technique or
algorithms’ complexity (designing, precision, etc.) shapes the processing effectiveness.
Intrinsic processing delay or processing time, related to implementation issues (software
or hardware related issues) or processing models parameterization could affect not only
processing quality (results’ quality) but also the technique’s viability to offer an adequate
solution for a complex problem represented by a huge data store.
I.4.1 Modular approach
The aim of modular tree structure called Tree-like Divide to Simplify (T-DTS)
(Bouyoucef, Chebira, Rybnik and Madani 2005) is to extent an existent relation between
38
processing time and database size. Under “to extent” we mean ability to use clustering or
database decomposition as an approach to induct a significant gain of performance. The
expected result is a real-working system that decreases general processing time and/or
increases classification quality by means of database decomposition initially proposed in
the work (Chebira, Madani and Mercier 1997), (Madani, Chebira and Mercier 1997).
From our point of view, “Complexity reduction” is the key point on which the
modular Tree like Divide to Simplify (T-DTS) approach acts (Rybnik 2004), (Bouyoucef,
Chebira, Rybnik and Madani 2005). Let us mention here that in our search for appropriate
characterization, we focus on the property of the data (classification complexity) that
easily models the underlying structure (ensemble of NNs) that might be coherent to a
given problem. An important note is that the choice of the characterization depends on the
available model and on the other hand, when a particular set of properties of experimental
data has been found, one can meaningfully ask for the a model that reproduces the
structure that. Modelling requires prior characterization (Rossberg 2004). In this essence,
T-DTS is Hybrid of Multiple Neural Networks handled by characterization named
complexity estimator (Bouyoucef 2007). We purposely leave here the term of complexity
without proper definition in order to describe conceptual clue about T-DTS first and then
determine it to particular set of classification problems.
T-DTS concept is found on the assumption that database decomposition decreases
general task complexity. More precisely, it is based on the universal “divide” and
“conquer” principle and the “complexity reduction” (Madani, Chebira and Mercier 1997)
(in this paragraph I share a possible criticism concerning misapplication of the term
complexity) approach.
Many systems, such as: Committee Machines (Tresp 2001), Multi Agent and
Distributed Artificial Intelligence share the paradigm using it directly: splitting database
into clusters; or indirectly by the means of agents or modules coordinators (Rybnik 2004),
(Bouyoucef, 2007).
The complexity estimating and reduction technique allows T-DTS to deal with
complex problem in intelligent way. It was observed that the intelligent method of
partitioning generally performs than random one (Chawla, Eschrich and Hall 2001). This
staple point of T-DTS is the recursively construction of adaptive tree-structure(s) over
database sub-sets.
The first my question “Why do we use tree structures?” has an answer if one might
take into account (we share possible criticism concerning this primitive comparison) the
39
analogies in the “self-organizing world” and “brain activity” (Josephson 2004). The tree
structures are abundant in complex nature systems (e.g. taxonomic hierarchies, tropic
pyramids) and human intelligent organizations). There are some grounds (Green and
Newth 2001) for supposing that trees form naturally in many problems.
Therefore, when one deals with intelligence, last one shares a common view on how
to handle the phenomenon of complexity, more precisely the complexity of the natural or
intelligent human systems such as grammar and languages. Let me remind here that a
structure of any language as a product of human intelligence underlay a human behaviour
in board sense (Chomsky 1968). For example, I find a common idea with Humboldt’s
universal approach dealing with complexity that is highlighted at the same work
(Chomsky 1968): “in the Huboldtian’s sense, namely as «a recursive generated systems,
where the laws of generalization are fixed and invariant, but the scope and the specific
manner in which they are applied remain entirely unspecified»” Therefore, this presages
Wolfram's basic insight met at work (Chaitin 2005) dedicate to computational complexity.
Similarly, T-DTS concept targets complexity using simple principles that may have very
complicated-looking output, because being principally simple is a condition to have the
richest in phenomena that it produces: “simple in hypotheses - the most rich in
phenomena” (Chaitin 2005).
In machine learning, recursive tree-like decomposing approach is a part of junction
tree, clique tree, or join trees group of decomposition methods. This general concept of
tree-like decompositions is formalized in the works (Robertson and Seymour 1984) and
since that time has been studied and developed by many other authors.
Unfortunately, real world and industrial problems are never comfortable for tree-like
decomposition approaches in comparison to the benchmarks. They are often much more
complex, because of a large number of parameters which have to be considered. That’s
why conventional solutions (based on mathematical and analytical models) reach serious
limitation for solving this category of the problems (Budnyk, Bouyoucef, Chebira and
Madani 2008).
One of the key points on which we can rely on is the complexity reduction. This
approach might allow us to deal with complexity not only at the problem’s solution level
but also at processing procedure’s level. An issue could model complexity reduction by
splitting a complex problem into a set of simpler sub-problems. This leads us to “multi-
modelling" where a set of simple models is used to sculpt a complex behaviour (Jordan
and Xu 1995), (Murray-Smith and Johansen 1997). Another promising approach to reduce
40
complexity takes advantage from hybridization (Goonatilake and Khebbal 1995).
Henceforth, taking also into consideration classifiers’ ensemble approach, the following
section describe general T-DTS concept
I.4.2 T-DTS concept
T-DTS approach deals with classification problems using the universal principle
“divide” to “conquer”. According to DARPA report (Goblick 1988), ANNs’ models
have demonstrated superiority over classical methods for pattern classifiers (Maren 1990).
Therefore, T-DTS approach was implemented using Neural Networks models. Another
argument for ANN based T-DTS is a fact that NNs are superior for dealing with more
complex or open systems, which may be poorly understood and which cannot be
adequately described by a set of rules or equations. Regardless application, the principal
T-DTS concept includes two main operation stages that could be described as follows:
Fig. I.3 : General block scheme diagram of the T-DTS structure constructing
• The First Stage or Learning phase: T-DTS recursively decomposes the input
database into sub-databases of smaller size using step-by-step scheme (Fig. II.3) and then
generates processing structures and tools (special parameters) for the decomposed data
sets.
41
• The Second Stage or operation phase: has an aim to learn the inputs sub-spaces
obtained from splitting. Hither, the obtained hybrid multi neural network system used for
unknown (e.g. unlearned) database of new instances.
T-DTS decomposes problem into sub-problems recursively building a neural tree
computing structure. The nodes of the constructed tree are decision making units and NNs
based decomposition units, and leafs correspond to NNs based processing units (Madani
and Chebira 2000)
We have to note that the described diagram on Fig. I.3 is general. It might be adapted
up to specific problems by intelligent choice of the components. Let us mention also that
Normalization block could include not only database normalizing, but other pre-
processing expertises such as Principal Component Analysis.
I.4.3 T-DTS short description
T-DTS is a Hybrid Multi Neural Networks structure tree-based constructing approach
(Madani, Chebira and Mercier 1997). First, the concept is designed to create ensemble of
NNs over tree-like database decomposition. However, this ensemble of NNs might
contain different intelligent module (more appropriate for global task target) on different
structure’s levels. Generated recursively by T-DTS, a tree-structure (including form and
size of tree) conceptually might have to reflect the general complexity of the given
problem represented by the initial database.
Henceforth, T-DTS is also a purely data driven concept founded on the universal
principle “divide” to “conquer”. Decomposition T-DTS technique belongs to the
category of nonparametric (unsupervised methods must be realized) approaches or more
certainly to the subgroup of hierarchical divisive method, where processing techniques are
supervised.
We have to remark first that proposed way of decomposition is the analytical method
which infers microscopic events from macroscopic data is typical for physics. Second
remark is that T-DTS approach is based on the concept of reducibility (Madani, Chebira
and Mercier 1997), but it is known has its own limitations (Haken 2002).
T-DTS tree designing strategy belongs to the class of “overproduce and choose”
engineering methods. The decision(s) taking strategy is also hierarchical. During
decomposition, T-DTS checks complexity conformity and this is a moment of decision
42
taking: to continue decomposing or to stop. The tree-structure building process is
performed dynamically, as consequence T-DTS inherits the properties of decision tree
approach. Data base decomposing will continues until complexity conformity and sub-
database parameters for each neural networks module are not met together. In case, when
the sub-problem/sub-database still hasn’t reached required size or dimensionality of
neural networks model, T-DTS continues to process decomposing. Therefore, complexity
conformity block is a core engine of T-DTS (Fig. I.3). This (core) block cannot be
omitted in any possible implementation.
T-DTS concept is based on “divide” to “conquer” universal paradigms.
Decomposition technique(s) of T-DTS use(s) unsupervised data driven technique. The
complexity conformity block or simply saying complexity estimating module as well as
its taking decision part has to be also data driven only.
The following section provides a summary of clustering and classification methods
and situates the concept of T-DTS among the various paradigms.
I.5 Conclusion
Our thesis is dedicated to T-DTS technique applied to classification problems. T-DTS
employs two basic classification methods: clustering and classification.
There are two main categories of classification techniques, see Section II.2: parametric
and nonparametric. The first category uses parameters that are based on certain
assumptions about a given problem. Since performance of the parametric clustering
methods is very sensitive to their parameters, we have chosen to base T-DTS on non-
parametric methods. More precisely, T-DTS uses Prototype-based vector quantization
technique and Density-based decomposition techniques that include k-centroid
partitioning methods.
To achieve high classification performance, T-DTS employs a set of classification
methods that were reviewed in Section II.3. Among considered techniques were Bayesian
classifiers, Hidden Markov Models and Support Vector Machines; however, we found
these techniques unsuitable as they require an extensive parameterization and often
involve a high computational cost. Instead, we have chosen to base T-DTS on a
combination of Neural Networks and Decision Tree techniques that provide a good
43
classification performance at a moderate computation cost and require only a minor
parameterization; we combined the two classification approaches by the means of the
ensemble of classifiers technique that is based on the “overproduce and choose” design
principle and allows for reaching a higher accuracy than that of a single classifier. It is
worth mentioning that the core of T-DTS is the Neural Networks technique as it is
currently considered to be the outperforming tool for identifying patterns, forecasting, and
trends identification in large amounts of data, and thus is well-suited for the objectives of
this thesis.
44
Chapter II:
Complexity concepts
Since complexities are in the core of T-DTS, the aim of this section is to provide an
overview of different types of complexities and their relations. The specific type of
complexity - classification complexity, which corresponds to the class separability
criterion, was applied to T-DTS related work (Rybnik 2004). The complexity estimators
have been well studied in the work (Bouyoucef 2007) and we rely on their results in this
chapter. Our prime goal was to extent and update taxonomy of the definitions and to link
the high-level concepts with the applied definitions of complexity (Bouyoucef 2007). Let
us note that the term « complexity » cannot be defined in a unique way; thus, as other
works attempting to generalize the complexity, our overview may suffer from the lack of
accuracy in defining the complexity.
We begin this chapter in Section II.1 with a description of the complexity concept as a
whole. We provide a cross-disciplinary overview with an aim to gain a better insight on
the common views of complexity and its origins. Next, in Section II.2, we describe means
of measurement of a specific type of complexity – the computational complexity.
Section II.3 presents our novel approach for estimating complexity of classification
tasks that belongs to the group of ad hoc complexity estimation techniques and is based
on extracting information about separability of classes from ANN structures (structures of
Artificial Neural Networks). Section II.4 provides a short conclusion.
II.1 Introduction to complexity concepts
There is no common definition of the term “complexity”. The Oxford dictionary
defines complexity as something “made of usually several closely connected parts”. The
45
Latin “Complexus” signifies "entwined", "twisted together”. ”Complicated” uses Latin
ending “plic” that means “to fold”. “Complex” uses “plex” that means “to weave”.
Many attempts had been made to develop a generalized understanding of complexity,
and, ultimately, a Theory of Complexity - a systems theory (Lucas 2000) that consists of
many interacting components and many hierarchical layers. Note that a system is called
complex if it is impossible to reduce its overall behaviour to a set of properties
characterizing its individual components (Lucas 2000); interactions at collective level in
such system can produce properties that are simply not present when the components are
considered individually.
The quantification (measurement) of the systems’ complexity, including complexity
of classification process, plays the key role in T-DTS as it is responsible for optimizing
tree structure of Neural Networks. This motivated us to find a new approach for the
analysis of complexity and synthesis of complex systems.
We highlight in this chapter our general interest in investigating a complexity
phenomenon and stress the need in a clearly visible and solid classification complexity
estimator, to the extent possible while using our ad hoc approach described below. We
begin by considering the theoretical aspects of the complexity concept. Considering the
given aggregation, this concept can be stratified from the simplest level to the most
complex: 1. Static complexity: the simplest form of complexity that relates to static systems
and is generally studied by scientists using developed mathematic tools (Lucas 2000). For
example this form of complexity is studied by such techniques as Algorithmic
Information Theory (Chaitin 2005).
2. Dynamic complexity: extends upon static complexity by adding the dimension of
time that can improve or worsen the static situation (Wolfram 1994). Given interest in
experimental repeatability in science, is to observe dynamic and measure this type of
complexity (Lucas 2000) “which is a one way to exhibit the phenomenon complexity
because complexity as phenomena is born not statically, but dynamically (Appendix B).
3. Evolving complexity: relates to systems that evolve through time into different
systems (Wolfram 1994); the best known class of this phenomenon is usually described as
organic: open ended mutation. As all such systems are unique, there are symmetries
present in the arrangements that would allow one to measure these systems. For example,
it is possible to analyze the complexity of an evolving system from an evolutionary
viewpoint as a set of specific already investigated parts or patterns (e.g. DNA code) that
46
can also have numerous combinations that have not yet occurred and thus have not been
studied (Lucas 2000).
4. Complexity of self-organizing systems: this form of complexity is based on the
idea of comprising. According to Lucas (Lucas 2000), this is the self-maintaining type of
systems that operating at the edge of chaos, aggregate in nonlinear ways the structures and
complex mix of types 1-3 above mentioned (Wolfram 1994).
Let us mention that the complexity concept categorizing on 1-4 is performed by Lucas
(Lucas 2000) in very informal way, but it is not the fault of author. Thus, even the
simplest Static type of complexity that employing precise mathematic tools does not have
the common definition (Saakian 2004). Although, this work (Lucas 2000) is useful to
introduce a common context of the phenomenon complexity. Therefore, before applying a
quantitative technique (meaning we concern with Static type of complexity) of estimating
complexity, we need to decide whether they are, in fact, complex in any of the senses
mentioned above (Bak 1996). However, the problem with deciding and determining
complexity is typically done in an informal form of comparison. It means that the whole
spectrum of complexity assessment statements will be in the form of "x system … is more
complex than y system" (Edmonds 1999). Also, there is some point of a transition from
“simple” to “complex”; the assumed nature of this point further complicates the
formalization of complexity estimations.
Therefore, we will provide the pragmatic solution for describing categories of
complexity. The philosophical issue related with complexity will henceforth be ignored.
To classify different types of complexity, we define the following criteria that we later use
to organize concepts in several groups:
• Criterion of size: for example, the size of a genome, the number of species in a
biosphere. Size could be an indicator of difficulty, but for strong definition of complexity,
such criterion is inter-related.
• Minimum description length criterion: is based on Kolmogorov’s idea of
complexity based on the minimum possible length of a description in some language
(usually that of a Turing machine) (Shalizi 2005), (Chaitin 2005). We should discuss this
criterion in details later in this section.
• Criterion of variety: variety of basic components of a concept. Variety is the key
point of evolutionary processes. For example, human teeth including its organization and
functions are more complex than shark teeth regardless the quantity (Edmonds 1999),
(Lucas 2000). Variety is the necessary feature of complexity but is not sufficient for it.
47
• (Dis)Order: Complexity is a mid-point between order and disorder (in board
meaning of these terms) (Permana 2003).
There are certain difficulties in applying the listed criteria to building a common solid
complexity hierarchy; most notable, the criteria originate from absolutely different fields
and these origins cannot be ignored. Thus, we are going to rely on a heuristic inventory
done at the Horgan’s work (Horgan 1995). In his survey, Horgan analyses more than 30
different ways of categorizing complexities; in this work, we will consider the four of
them: algorithmic complexity, computational complexity, entropic complexity,
grammatical complexity. In the next section we describe the most important of the four considered complexity
categories, the computational complexity. Based on structure and complexity of a given
data, we propose a combined context-dependent measure of complexity of associated
computations. We describe our contribution of a novel classification complexity
estimation technique named ANN-structure based classification complexity estimator; we
explain its connections to other complexity estimation approaches according based on
proposed classification complexity measurement hierarchy.
II.2 Computational complexity measurement
Computational complexity as a discipline presents an outstanding research in
computational complexity. This subject is at the interface between mathematics and
theoretical computer science, with a clear mathematical profile and strictly mathematical
format. Neighbourhood fields of study inside theoretical computer science are analysis of
algorithms and computability theory. The key distinction between computational
complexity theory and analysis of algorithm is that the latter is dedicated to analysis of the
amount of resources needed by a particular algorithm to solve a concrete problem,
whereas the former asks a more general question: what kind of problems can be solved at
all (within a given computational model). There are other measures of computational complexity, such as communication
complexity in distributed computation or with connections to hardware design – circuit
computational complexity. Informally, a computational problem is regarded as inherently
48
difficult if solving the problem requires a large amount of resources, independent of the
algorithm used for solving it.
Computational complexity theory formalizes this intuition, by introducing
mathematical models of computation. Therefore, the following section is dedicated to
algorithmic aspect of the computational complexity.
II.2.1 Complexity, randomness and computability
Computational problems can be classified by the criterion of the time it takes for an
algorithm - usually a computer program - to solve them as a function of the problem’s
size. Some problems are difficult to solve, while others are easy. For example, some
difficult problems need algorithms that take an exponential amount of time in terms of the
size of the problem to solve, for example the general salesman problem.
From algorithmic part of computational complexity theory, complexity is relevant to
the time or the number of steps that it takes to solve an instance of the problem as a
function of the size of the input (usually measured in bits), using the most efficient (if this
fact is possible to prove) algorithm (Sipser 2005).
The question of whether NP is the same set as P is one of the most important open
questions in theoretical computer science due to the wide implications a solution would
present. Most scientists trend to idea of the negative answer on this question (Boolos,
Burgess and Jeffrey 2002). This fact corresponds with unproved Church–Turing thesis
which claims equivalency of the set of computing machines to each other in terms of
theoretical computational power. Thesis states that there is no chance to build a
computation device that is more powerful than the Turing machine. Thus, for NP-
problems there is no other way as deep investigation of NP-problems (or its sub-class),
where one may find new specific methods of its resolving for problem. Another
alternative is new computational models (Blass and Gurevich 2003). For example,
generally one of NP-problem solution could be solved using a quantum model of
computation (Gershenfeld and Chuang 1998).
Therefore, we have to mention also that “P=NP?” with the trend to P≠NP is linked
with a “real” according to Chaitin (Chaitin 2005) randomness. Thus, for the random
binary sequences of 0 and 1 inability to find compression algorithm/machine represents
complexity in terms of Kolmogorov (Rubin and Trajkovic 2001). Because the sequences
49
may exhibit not only a relation between input and output, but a program or algorithm
itself, for all random programs represented by these random sequences Chaitin has
calculated the probability to halt Ω3. As Turing’s machine halting problem is related to
“P=NP?” and the halting problem in Turing’s terms is unsolvable, then Chaitin comes to
conclusion that the real randomness is another form of computational complexity.
Taking to account that the base of our intent to define applied computation complexity
measure is determined Turing machine, where uncomputability as inability to resolve
P≠NP problem is linked with randomness. Taking also to consideration that computability
(algorithmic computability) is related to the numbers of algorithm’s iteration required to
solve the posed problem and this characteristic is measured quantity, then following
section represents a brief overview of measurable computational complexity classes in
order to figure out the hierarchy of classification complexity estimators and their relations
to more general/abstract type of complexity measures.
II.2.2 Kolmogorov related complexity measures
Among the big variety of the nowadays proposed measures of complexity, one may
find no relation between them, because they are usually special quantification for applied
usage (Shalizi 2005). However, all of them have common base. Thus, computational
complexity is the amount of computational resource (usually time or memory) that it takes
to solve a class of problem. Thus the difficulty here is the limited supply of these
resources once the appropriate program is supplied. This is now a very well studied
measure. For our purposes, this is a weak definition of complexity as applied to evolving
entities, as the time to perform a program or the space that the program takes is often not a
very pressing difficulty compared with the problem of providing the program itself
(through evolution) (Lucas 2000). The next list of complexity estimators are purely
resource (defined instance) based.
The very classical computational complexity measure above was introduced by
Kolmogorov (Shalizi 2005), which is roughly the minimum length of a Turing machine
program needed to generate a binary sequence (Edmonds 1999). This quantity is in
3 In the subfield of algorithmic information theory a Chaitin’s construction or Chaitin’s constant or halting
probability is a real number that informally represents the probability that a randomly-chosen program will halt.
50
general incomputable (Chaitin 2005), in the sense that there is simply no algorithm which
will compute it. This comes from a result of the halting problem or the other form of
Church’s thesis (Church 1936), which in turn is a disguised form of Gödel's theorem
(Godel 2001), so this limit is not likely to be broken any time soon. Moreover, the
Kolmogorov’s complexity is maximized by random strings (Chaitin 2005), so it's really
cannot justify what's random, that why it’s gradually come to be called algorithmic
information (Shalizi 2007). Anyway, ideas that lay down in the formalism which
describes a complexity in Kolmogorov’s ways when we rely on the main parameters of it,
such as the minimum length of algorithm, can be used for creating different type of the
practical approaches, where the complexity might be measured in context (Green and
Newth 2001). One of application example of is grammar complexity.
II.2.2.1 Grammar complexity
The Minimum Grammar Complexity Criterion (grammar complexity) is a formalism
used for description of structural relationships (Young and Fu 1986). The goal of the
grammar complexity is to give a quantitative characterization of communication. The
measure is said to be an upper bound of complexity, not the real complexity value,
because it is not sure that this is the minimum description of a sequence (Permana 2003).
Let µ is the length of segment of the code/program or input. Therefore we start with
segment of length µ = 2:
Step 1: search the most frequent µ -tuple in the sequence.
Step 2: replace the most frequent µ -tuple by a new symbol.
Step 3: increase the length of the µ -tuple by 1.
Step 4: repeat Steps 1 to 4 until no replacements can be performed.
When the original sequence has been compressed into another sequence of symbols,
the complexity is computed as follows. It is the sum of the length of the new sequence plus
the length of all symbols used to recode the sequence, without counting the repetition of
symbols. If symbols are repeated in recoding, their logarithm is added to the complexity
value. The problem with this procedure relates to the lack of bounds or comparisons of the
measure. One doesn't really know what a complexity of 5 or 20 means. It is small or
large? What does a disordered system typically exhibits? It makes the measure difficult to
apply in real situations. Another limit concerns the algorithm used to compute complexity.
Starting with pairs, and after moving to triplets, quadruples and so on, is logical, but what
51
if a system exhibits a 3-period cycle? Replacing pairs before triplets completely destroys
the structure of the original sequence. Replacing triplets would be a far more efficient
procedure. In fact, a hybrid procedure, combining this approach with entropy would be
better.
Grammar complexity is an illustration of using Kolmogorov’s approach. It plays a
very important role in every discussion of measuring complexity, but it puts away as
useless for any practical application (Shalizi 2007). Generally speaking, Kolmogorov’s
complexity measures involve finding some computer or abstract automaton which will
produce the pattern of interest. Bennett's Logical Depth is an instance of this tendency.
II.2.2.2 Bennett's Logical Depth
Bennett's Logical Depth is the running time of the shortest program (Bennett 1988). It
is the computational resources (especially time) based measure that calculating time for
obtaining results of a program of minimal length (Shalizi 2007). Bennett uses it also to
formalize the level of organization in systems. All present-day organisms can be viewed
as the result of a very long computation from an incompressible program and are thus by
this definition complex (Edmonds 1999). The principal disadvantage of this definition is
that it measures the process, but not the results. Next Lofgren’s Interpretative Descriptive
complexity is better than Bennett’s Logical Depth measure in term of formalizing
complexity of self-organizing systems.
II.2.2.3 Lofgren's Interpretation and Descriptive Complexity
In his work (Lofgren 1973) Lofgren proposes complexity as two processes of
interpretation and description (Fig. II.1). Interpretation process is the translation from the
description to the system and the descriptive process is another way around (Edmonds
1999).
For example the description could be the genotype and the system the phenotype. The
interpretation process would correspond to the decoding of the DNA into the effective
proteins that control the cell and the descriptive process the result of reproduction and
selection of the information there encoded. Löfgren then goes on to associate descriptive
52
complexity with Kolmogorov’s complexity and interpretational complexity with logical
strength and computational complexity.
Fig. II.1 : Description and interpretation process
Kauffman’s number of conflicts measure is a developing of the idea of self-organizing
process measurment, because the self-organizing abilities is an essential attribute of
complexity.
II.2.2.4 Kauffman's number of conflicting constraints
Kauffman’s definition (Kauffman 1993) is less concerned with Kolmogorov’s
complexity, but it introduces a working definition of complexity for the formal self-
organizing model (Kauffman 1993). His principal idea can reflect the complexity of self-
organizing process.
Kauffman defines complexity as the “number of conflicts”, or more precisely "number
of conflicting constraints". This definition represents the difficulty of specifying a
successful evolutionary (and not only) process to meet the imposed constraints, but this
definition hard applies for some real-word problem, because the conflicts measurement is
relative issue. Summarizing, we highlight that Kolmogorov’s and related to Kolmogorov’s
complexities mentioned here is very theoretical or relative that to be easy-applicable in
practice. Dealing with more pragmatic Information theory, allows us to use explicit
methods for computing classification complexity.
II.2.3 Information based complexity measures
Information theory is a branch of applied mathematics involved in quantifying amount
of information originally for coding and encoding. This theory has been developed in
53
order to find fundamental limits on compressing and reliably storing and communicating
information data. Since its inception, it has been broadened to find applications in many
others areas including machine learning field.
II.2.3.1 Shannon’s entropy based measures
Beginning with fundamental Shannon’s entropy measure, the role of Information
theory was at first very narrow. It was a subset of communication theory the main purpose
of which was to find answers to two fundamental questions:
1. What is the ultimate data compression that can be applied to a signal?
2. What is the ultimate transmission rate of signals on a wire?
But the mathematical techniques developed after pioneer Shannon so fruitfully that
they were applied in various fields of investigation including Theory of Complexity.
Entropy might be used for classification tasks complexity measurement, because
accordingly to theory, it figures out the characteristics of probabilistic models.
This ought to be defined as following:
Let x be a discrete variable which may take i = 1 … l values xx1,…xl, where l is the
index of maximal possible values. To i-th value xi of variable x is assigned probability pi.
Thus, Shannon’s entropy measures complexity as following:
)(log)()( 2 xpxpxH i
l
ixi∑
=
−= (II.1)
The entropy varies from 0 to log2(l). Additional predefining has to been done for i-
cases when pi(x) = 0, log pi(x) must be set equal 0. In case of uniform distribution the
entropy is maximal H(x) = log2(l). It is obvious that the greater number of possible states
we have - the greater entropy is. In a classical works from the theory of information,
entropy signifies the average amount of information required to select observations by
categories (Krippendorff 1986).
Entropy may be standardized so that it would range from 0 to 1, by dividing it by its
maximum log2(l). It may be easier to compare the amount of disorder of two systems,
knowing that one system encountered more states than the other.
A nice property of Shannon’s entropy is that variable categories may be permuted
without changing its value. Only the relative frequencies have a matter. This is why this
54
measure is said to be content-free. It does not make any assumptions about the
distribution of data; it thus belongs to the nonparametric family of statistics methods.
Entropy is interpreted in many different ways. It is a measure of the uncertainty tied to
the observed system. The lower the entropy the easier it is to predict the system's state,
and conversely. It may also be interpreted as a measure of disorder of a system, or in a
very similar fashion, a measure of its variability. There again is the lower the entropy - the
more orderly the system and conversely. Important note here is that we have to take into
consideration the source/analyzing tool which provide the distribution of pi(x) which in
the terms of mentioned variables describe the system.
Shannon's entropy works very fine for describing the order, uncertainty or variability
of a unique variable, but when we deal with more than one variable the following Joint
entropy is used. There are various entropies when considering two or more variables together: the joint
entropy, the mutual information and the conditional entropy.
When considering two discrete variables x and y at the same time, it is possible to
measure the degree of uncertainty or information associated with them. It is called the
joint entropy, H(x,y). If independent value x and y may respectively take l1 and l2 maximal
numbers of possible values of their individual probabilities pij correspondingly, the joint
entropy is computed as:
),(log),(),(1 2
1 12 yxpyxpyxH ij
l
i
l
jij∑∑
= =
−= (II.2)
Where pij represents the probability of being classified in both categories i of variable
x and category j of variable y. The joint entropy varies from a theoretical 0 or empirically
minH(x),H(y) to log2(l1)+log2(l2). The relation between the individual entropies and
their joint entropy is given by:
)()(),( yHxHyxH +≤ (II.3)
It expresses the fact that the joint entropy is always smaller than the sum of the
individual entropies. Let us underline that the equation III.3 holds only when the two
variables are independent, plus despite similar notation joint entropy shouldn’t be
confused with cross entropy.
Not only information can be measured by two variables as a whole, but also the
amount of information of a variable knowing the other. This is called the conditional
entropy.
55
This type of entropy relies on conditional probabilities, or as it’s also called
transitional probabilities. Suppose we want to compute the conditional probability of
state j of variable y for given state i of variable x; this is written as p(y|x). It is different
from the joint probability
)(),(
log),()(log)()()(2 112
1 12
12
1 ypyxp
yxpxypyxpypyxHj
ijl
j
l
jiij
l
ji
l
jj ∑ ∑∑∑
= +=+==
−=−= (II.4)
The relationship between conditional entropy and joint entropy is as follows:
)(),()( yHyxHyxH −= (II.5)
The conditional entropy defines a reduction of uncertainty. The higher the conditional
entropy the more an observer can predict the state of a variable, knowing the state of the
other variable.
Contrary to the joint entropy, conditional entropy is not a symmetrical measure:
H(y|x) ≠ H(x|y). Conditioning on a variable or the other does not give the same result,
because each variable has its own entropy H(x) and H(y).
The conditional information is the information particular to a variable, while the joint
entropy is the sum of the information of two variables.
Another commonly used Shannon’s based measure is Mutual information. It measure
the information shared by variables, or the quantity of information an observer that gets
common in two (or more) variables. Generally, for two variable formula given as
∑ ∑= +=
=1 2
1 12 )()(
),(log),(),(
l
i
l
ij ji
ijij ypxp
yxpyxpyxI (II.6)
There are many ways of expressing given formula. It might be expressed as a relation
between the individual entropies and the joint entropy. It is the sum of the individual
entropies, minus the joint entropy, as expressed:
),()()(),(),( yxHyHxHxyIyxI −+== (II.7)
In a case of three variables, the equation becomes:
),,(),(),(),()()()(),,( zyxHzyHzxHyxHzHyHxHzyxI +−−−++= (II.8)
When two variables are independent, the sum of their individual entropies is equal to
the joint entropy the mutual information is equal to zero.
56
Therefore the best measure of the proximity between variables is the mutual
information. It was shown in work (Lemay 1999) that mutual information is related to the
likelihood ratio Λ by the following, where l is the number of observations:
),(log2),( 2 yxIlyx =Λ (II.9)
This is an important fact, since it is the link between the information theory and the
statistical use of the branch of theory of probability. The greater is the mutual information,
the more similar two variables.
II.2.3.2 Relative entropies
Another group of Shannon’s like entropy methods is called Relative entropie(s). They
are also known in literature as the divergences or distances. In probability theory, this
type of measures belongs to class f-divergences, because it measures the difference
between two probability distributions. Interesting is a fact that these measures can be used
not only in signal processing, but in analysis of the contingency of tables and particularly
in pattern recognition (Lin 1991).
Suppose we compare two distributions: a "true" probability distribution p(x), and an
arbitrary probability distribution q(x). The relative entropy, information gain or Kullback-
Leibler's I-directed divergence (Lin 1991) formula is given by:
))()((log)(),( 2
1 xqxpxpqpKL
i
il
ii∑
=
= (II.10)
where l is the maximal index of levels of the variables.
The properties of the relative entropy equation makes it non-negative and equal to 0 if
both distribution are equivalent (p = q). The smaller is the relative entropy, the more
similar the distribution of the two variables, and conversely. It has to be noted that the
measure is asymmetrical, the distance KL(p,q) is not equal to KL(q,p). If the distributions
are not too dissimilar, the difference between KL(p,q) and KL(q,p) is small, and the
distance is then equivalent to the χ2-statistics (relatively to the sample size). To gain the
property of symmetry Kullback and Leibler actually defines the divergence as:
))()((log))()((),(),( 2
1 xqxpxqxppqKLqpKL
i
ii
l
ii −=+ ∑
= (II.11)
An alternative approach is given via the 0<λ<1 variable, called λ-divergence, Dλ :
57
))1(,()1())1(,(),( qpqKLqppKLqpD λλλλλλλ −+−+−+= (II.12)
when, λ = 0.5, Dλ becomes Jensen-Shannon divergence (Fuglede and Topsoe 2004)
)2
,(21)
2,(
21),( qpqKLqppKLqpJSD +
++
= (II.13)
or in another form
)]))()((
)(2(log)()))()((
)(2(log)([),( 221 xpxq
xqxqxpxq
xpxpqpJSDii
ii
ii
il
ii +
++
= ∑=
(II.14)
For classification problem, Jensen-Shannon divergence performs better as a feature
selection criterion for multivariate normal classes (Richards and Xiuping 2005). In fact, it
provides both the lower and the upper bounds for the Bayes probability of
misclassification error ε (importance of such approximations are given at short overview
in Section III.3.4.1). This makes it particularly suitable for the study of decision problems
(Lin 1991).
Jensen-Shannon divergence is the square of a metric that is equivalent to the Hellinger
distance.
2
1))()((
21),( xqxpqpHD i
l
ii −= ∑
= (II.15)
HD(p,q) satisfies the property : 1q)HD(p,0 ≤≤ . Hellinger distance is equivalent to
Jensen-Shannon divergence and the last one is also equal to one-half the so-called
Jeffrey’s divergence or Jeffreys-Matusita distance measure (Matusita and Akaike 1956).
2
1))()((),( xqxpqpJMD i
l
ii −= ∑
= (II.16)
II.1.12 - II.16 have been developed in order to find small differences between two
probabilities. Important to note that Hellinger distance is related to another distance based
measure - Bhattacharyya distance:
∑=
−=l
iii xqxpqpBD
1
)()(ln),( (II.17)
where the expression under logarithm is Bhattacharyya coefficient. As in II.16 and
II.17 cases, Bhattacharyya coefficient can be used to determine the relative closeness of
the two samples. It is shown in work (Theodoridis and Koutroumbas 2006) that it
corresponds to the optimum Chernoff bound. This measure is more general case of
Bhattacharyya distance (Chernoff 1966).
58
It is theoretically demonstrated that Bhattacharyya distance become proportional to
Mahalanobis distance I.6 (Theodoridis and Koutroumbas 2006).
Equations II.6 - II.17 origin from the information theory applied for narrow sphere of
statistical classification suits well for approximation of Bayes error ε. In the above given
form they present acceptable (from the point of view of computational/algorithm
difficulty) solution for two-class classification problems. However, in reality, we deal
with multi-class problems, and in this case common solution (Swain and King 1973) is
the usage of an average aggregated value among all possible pairs of the classes. Let for
II.11 Kp,q represents KL(p(x|ap),q(x|aq)), where ap and aq are class-labels. Let’s the
maximal number of classes will be l, in this case the aggregated ratio of class separability
(complexity – more description is given at Section II.3.4.1) for multi-class classification
problem is computed as:
∑ ∑−
= +=−=
1
1 1,)1(
1 l
p
l
pqqpavr K
llK (II.18)
Similar to equation II.18 (Swain and King 1973) is equation II.19 that very often use
the sum-approach (Shi, Shu and Liu 1998), for example for II.16 identical to Kp,q , JSDp,q
we will have the following aggregation ratio:
∑ ∑−
= +=
=1
1 1,
l
p
l
pqqpsum JSDJSD (II.19)
We have to mention that for the approaches of computing the final ratio of
classification complexity (II.18-II.19) in such way has a one disadvantage. In case when
both Kp,q and JSDp,q vary mach, aggregated ratio is not representative.
The equation III.20 represents the approach to get an aggregative value that has been
implemented in new T-DTS (Chapter IV) version:
))(min(min ,,...,11,...,1
qplpqlp
global JSDJSD+=−=
= (II.20)
In equation II.20, aggregated ratio of complexity is represented by the most difficult
case of class-separability. Thus, among zero indicates a case when there is as minimum
one similar p(x|cp) and q(x|cq) distributions among all possible pairs of classes. Although,
we have to motion that Kullback-Leibler relative entropy approaches is just a particular
group of Renyi’s entropies (Renyi 1960). Last one belongs to general family of
functionals of the order α for quantifying the diversity, uncertainty or systems’
randomness.
59
Taking into consideration the clue about the foundations of the information theory,
and more precisely the aspect of f-divergences estimates complexity measures, the
complexity estimating techniques defined in II.11 and II.14 - II.17 are used in the current
T-DTS version (Chapter III).
The next sub-section briefly highlights the main disadvantages of this Shannon’s
based complexity estimation measures that are grounded on Information theory.
II.2.3.3 Limits of the information theory to complexity estimation
Information theoretical analyses consider variables as nominal ones. So any analyses
dealing with truly quantified variables may suffer a loss of information and power.
Another similar point is the loss of meaning in variables, especially when you
leverage them (II.18 - II.19), because these measures are very general and dedicated to the
treatment of any information, but we deal with more specific statistical classification case.
For more general example of linear system modelling, when correlation is computed,
the direction of the relationship is known; if the sign is positive the two variables vary
together, and when negative they do it in the opposite direction. Analyses on categorical
variables seldom show the direction: for example high mutual information does not tell
for which categories there is a strong association. But this is a problem of categorical
variables not just the information theoretical approach.
To make our view on the problem of defining complexity more related to
classification problem in a frame of T-DTS concept, we need to use data complexity
measurement for classification process in Machine Learning.
II.2.4 Complexity measurement in machine learning
In this section we investigate the role of data complexity in the context of
classification problems. The classical Kolmogorov is closely related to several existing
universal principles used not only in machine learning such as Occam's razor of the
simplest possible theory (Chaitin 2005) (the minimum description length) and the
Bayesian approach. The difficulties of implementation of universal principles in terms of
Turing machine model, drives us to idea that the data complexity should be defined based
on a learning model, which is a more realistic approach.
60
In this case, a successful developing of T-DTS concept including implementation of
the real-world application for classification tasks is our particular aim. Therefore,
overviewed in previous section, the group of complexity measures used for more global
problem such as data transmission might be also applied for classification (labelled data)
problems.
In practice, we approximate the given above formula of data complexity measurement
for classification problems. Following section describes more utilitarian range of methods
focused on estimation complexity of classification problems only.
II.2.4.1 Class separability measures: Bayes error
While definition of an optimal complexity problem value including classification tasks
is often difficult (Ho 2001), this difficulty does not arise from the lack of an appropriate
metric (Ho 2002). Since, the objective of pattern recognition problems is to minimize the
number of classification errors, also called the error rate ε, the staple stone of the
Statistical classification is the minimum error rate over the set of all classifiers (Ho and
Baird 1994). This error is known as the Bayes error, ε. Thus ε becomes the proper
measure for evaluating given vectors database-set S and their class-labels database-set C
for classification (Cacoullos 1966).
If ε were easy to estimate, the classification complexity problem would embed an
alternative approach that minimize ε (Ho and Basu 2002). Unfortunately, direct estimation
of ε is usually difficult (Fukunaga 1972), (Young and Fu 1986), (Devroye 1987),
(Therrien 1989). However, for purely scientific (not industrial) testing needs, we have
implemented it at the last version of T-DTS using high-time consuming algorithm
expressed by equation II.21.
Examination of the governing equations for ε shows where the difficulty of estimating
ε arises. The misclassification rate for l classes (i – index for classes counting), where a
data vector s is represented as s = [s1, …, sdim]T is:
∑=
−=dim
1
)()))((max1(j
jjiispscpε (II.21)
where p(ci|si) is the posterior probability, where p(si) is the probability for attribute j
of the data vector s, that is defined as:
61
∑=
=l
iiijj cpcspsp
1)()(()( (II.22)
where p(si|ci) is the probability of belonging of the j-attribute of s vector to class ci
under condition of p(ci) . Through Bayes' theorem, the a posterior probability function is
related to the class conditional density by:
)()()(
)(j
iiji sp
cpcspscp = (II.23)
The estimation of, ε is difficult for three following reasons: density estimation of
p(si|ci) is an ill-posed problem (Devroye 1987), the difficulty of numeric integration
increases with dimensionality, class probabilities p(ci) are needed.
Concerning the last reason, in many applications, p(ci) values are unknown and must
be approximated. Several techniques have been developed to circumvent the difficulties
associated with direct estimation of ε and not related to ε. The general taxonomy of the
most commonly used techniques is: Indirect Bayes error estimating: mathematically based
measures, Non-parametric Bayes error estimation and bounds, Intuitive measures. Our
main interest in estimating classification complexity is related to this last group named
Intuitive measure. There are the following types of measurement: Interclass distance
measures (Fukunaga 1972), Boundary methods (Sancho and al. 1997), Space partitioning
methods (Singh 2003), Other techniques (Chen 1976), (Young and Fu 1986) and such as
Ad hoc (Ho 2000) measures.
Because this group of methods is not well-developed, here we have Other techniques
group. Although, our research interest addresses the group of Ad hoc methods, where we
have the following trends named as: Length of class boundary (Friedman and Rafsky
1979), Space covering by ε-Neighbourhoods (Le Bourgeois and Emptoz 1996), Feature
efficiency (Ho and Baird 1998), Hyperplane number based complexity estimator (Zhao
and Wu 1999), ANN based complexity estimator (Budnyk, Chebira and Madani 2007).
As it is mentioned, our method of complexity estimation can be range with the Ad hoc
type of methods that belong to upper-group of Intuitive measures. The last one is one of
three main direction of estimating class separability (classification complexity).
We briefly overview and give a clue about this category of the methods.
62
III.2.4.1.1 Indirect Bayes error estimating: mathematically based measures
This group of measures is easier to calculate. Some of these measures bound ε while
others are justified by other considerations. The calculation of II.10- II.20 can be
simplified in case when the distributions are normal (Theodoridis and Koutroumbas
2006), but for T-DTS universal concept which deals with a priory unknown classification
problem, this assumption cannot be taken into consideration. Thus, high computational
cost of these measure, especially in cases of high dimension S and increasing class
numbers makes them difficult to use (Rybnik 2004), (Bouyoucef 2007). Of course, some
simplification unrelated to Information theory results II.10- II.20 can be used for
calculating indirect Bayes error. One popular example of these is Normalized mean
distance.
This measure has been used as a complexity estimator in T-DTS. It’s shown (Rybnik
2004) to be a computationally cheap procedure. For two class problems represented by
vector-instances s1 Є S and s2 Є S respectfully we have:
)()(
)()())(),((
21
2121 jj
jsjsjsjsNDM
ssj σσ +
−= (II.24)
where j (1…dim) is an index of feature space , is is the mean of the i-component of
the vectors, and sσ is variance. The question of aggregation NDMj for all j and in case of
multi classes into final NDM(s1, s2) ratio practically is resolved in T-DTS application in
the way proposed by equation II.20.
We should mention here that II.24 is inadequate as a measure of class separability in
cases when the classes are not distance, and especially when the both classes have the
same mean values. However because of low computational cost of this measure, we have
mainly used this criterion for testing purposes.
III.2.4.1.2 Non-parametric Bayes error estimation and bounds
The two main non-parametric techniques used for density estimation are k-Nearest
Neighbors (kNN) and Parzen (Parzen 1962). Both techniques stand for the same
underlying concept of setting a local region Г(s) around each sample s, and examining the
ratio of the samples enclosed to the total number of samples normalized with respect to
the volume v(s) of the local region. Let’s the resolution parameter B determines the
63
dividing given initial space into Г(s) regions, therefore v(s) depends on the inner
parameter of Г(s) and global resolution B.
In this point we come to the difference between realizations of Parzen and kNN
technique. Parzen fixes the volume v(s) of the local region while kNN fixes the number of
samples enclosed by the local region. Both ε-estimating are asymptotically unbiased and
consistent (Fukunaga 1972), (Chernoff 1966). To provide asymptotical bound of Bayes
error, the following density estimation function can be calculated as:
msvKsp
)(1)(ˆ −
= (II.25)
For Parzen measure II.25 is defined similarly. Therefore, using the following logic of
building density estimating function )(ˆ sp of kNN approach, plus intense to approximate
Bayes error, next generalized formula expresses non parametric Bayes measure:
∑=
=⎥⎥⎥
⎦
⎤
⎢⎢⎢
⎣
⎡
∑−=
m
il
jj
i
i sv
svm 1
1
))(
)(max1(1ε (II.26)
However, the variance of Parzen estimates decays faster than that of k-NN (Fukunaga
1972). Nevertheless, k-NN is very often used and there are three reasons for its common
and wide-spread applicability:
1. The kernel, i.e. the local region size and shape, must be determined for Parzen
estimation which can be difficult (Chernoff 1966)
2. Using variably sized kernels, as is the case for kNN, helps to improve the accuracy
in the low density regions (Fukunaga 1972)
3. It is shown (Cover and Hart 1967) that k-NN (for k=1) estimates asymptotic the
true Bayes error.
ANN-structure based complexity estimating measure with a resolution parameter B
which must be preselected by user has been implemented in T-DTS. The above mentioned
methods of approximating Bayes error belongs to the group of Intuitive measure.
Therefore, next Section II.2.4.2 accomplishes this direction of approximating Bayes error,
shortly debriefing additional methods.
64
III.2.4.2 Intuitive measures
The most intuitive measures are founded upon the idea that class separability increases
as class mean separates and class covariance becomes tighter. One of the first concepts of
separability was proposed to use second order statistics.
This approach does not perform well in situations not adequately described by this
statistical assumption. However, there are several applications for which these techniques
perform well and, accordingly, have a valid claim as a useful idea to classification
complexity.
III.2.4.2.1 Scatter matrixes measures
The concept of mean separation and covariance tightness are expressed here by the
scatter matrices (Fukunaga 1972). The expressions for the within-class, between-class,
and mixture (total) scatter matrices, Sw, Sb, and Sm, respectively are:
[ ] ∑∑==
∑=−−=l
iiii
Tii
l
iiw cpcMsMsEcpS
11)())(()( (II.27)
Tii
l
iib MMMMcpS ))(()( 00
1−−= ∑
=
(II.28)
bwm SSS += (II.29)
where E[] is mathematical expectation of the vectors, p(ci) are the class distribution,
the Σi’s are the i-th class covariance matrices and ∑=
=l
iii McpM
10 )( - the mathematical
expectation of all classes, where iM - vectors’ mean.
Many intuitive measures of class separability come from manipulating these matrices
which are formulated to capture the separation of class means and class covariance
compactness.
Some of the most common used (Fukunaga 1972) measures are:
( )11
21 SStrJ −= (II.30)
)ln(2
12 S
SJ = (II.31)
65
( )13 StrJ = (II.32)
)()(
2
14 Str
StrJ = (II.33)
where generally (Fukunaga 1972) S1 = Sb and S2 = Sm . When |S2| = 0, for practical
reasons it’s used there a pseudo-inverted matrix S1 or S2. J1 and J2 are invariant, when J3
and J4 are coordinate dependant (Fukunaga 1972).
Equations III.32 - III.33 criteria are implemented into the current T-DTS version.
Another group of intuitive measures are the Bound methods.
III.2.4.2.2 Bound methods: PRISM approach
These methods count the number of samples in the overlap region of the sample
distributions. This process is done in a controlled way where the boundaries are modified
such the overlap region progressively decays. The trajectory of this count is then
examined to determine the separability of the classes. Two possible sources where
information regarding class separation can be obtained by:
1. Value of the trajectory.
2. Shape of the trajectory.
We have to mention, Pierson’s measure of classification complexity calls the Overlap
Sum (Pierson 1998). Overlap Sum is the arithmetical mean of overlapped points with
respect to progressive collapsing iteration. This criterion does not require any exact
knowledge of the distributions and is easy to compute. It is proved (Pierson 1998) to have
strong correlation with Bayes error (Rybnik 2004). Further, in the terms of computational
complexity, it has an advantage over kNN (Bouyoucef 2007).
The following Space partitioning methods are quite new. They have been proposed
for Pattern recognition in the Singh’s work (Singh 2003). As proven by an example of
Pierson’s Overlap Sum method of Space partitioning group of methods is more efficient
compare to conventional data analysis approaches (Pierson 1998).
The space partitioning methods analyze classification problems by decomposing
feature space into a number cells or boxes. Then, they are analyzed at different
resolutions.
The only class separability measure based on feature space partitioning according to
Singh (Singh 2003) are the other space portioning method proposed at work (Kohn,
66
Nakano and Silva 1996). This measure is called Class Discriminability Measure (CDM).
It is grounded on data-adaptive partitioning of the feature space in such a way the regions
having samples from two or more classes are more finely partitioned than those that have
a single class. CDM is based on the analysis of inhomogeneous buckets. The basic idea is
to divide the feature space into a number of hyper-cuboids.
Each of these subspaces is termed as a box. A given box at any stage is first tested on
the basis of a number of stopping criteria. If the stopping criterion is not satisfied, the box
partitioning continues. After the stopping criteria are satisfied, all boxes that are
inhomogeneous and not linearly separable are used to calculate CDM that is defined on
the basis of the difference between total samples in a box and the number of samples from
the majority class, over all boxes (Kohn, Nakano and Silva 1996), (Singh 2003).
A class separability ratio calculated using PRISM-based approach has been compared
with the Bayes error ratio, as well as with the mentioned Fukunaga’s scatter matrices
measures across a range of normal and non-normal data sets (Kohn, Nakano and Silva
1996). The main conclusion is that the CDM measure is more in line with Bayes error
when ranking the importance of features compared to indirect Bayes error estimating
criteria and is attractive in use due to its reduced computational cost (Kohn, Nakano and
Silva 1996), (Singh 2003).
The next paragraph presents a more sophisticate concept of feature space portioning
which forms the Pattern Recognition using Information Slicing Method (PRISM)
framework.
The feature space in dim-dimension can be partitioned using subspaces with different
topologies. In order to avoid the curse of dimensionality, a hyper-cuboids’ primitive is
used in PRISM framework (Singh 2002). Partitioning algorithms create hyper-cuboids for
m (1 < i < m) data points in dim-dimensional feature space, each of which can be assigned
to one of the known l (index of maximal class) classes 1, …, l. The extra parameter -
resolution B (commonly, 310 ≤≤ B ) of partitions per axes must be selected by user in
advance. The total number of boxes is Ktotal=(B+1)D
The difference of PRISM from CDM (Kohn, Nakano and Silva 1996) is that it does
not start to split from the median position. Secondly, there is no stopping criterion and the
empty boxes are not analyzed. The measure of classification complexity is described on
the basis of mentioned above partitioned scheme. The ratio of measures lies in the interval
[0;1]. The higher ratio indicates the simpler problem (Singh 2003[2]).
67
Since, we have a space partitioned on hyper-cuboids by resolution factor B, we can
calculate the Purity measure which defines “how pure data is” in hyper-cuboids. This
method of calculation is more advanced than CLC (Cluster Label Consistency)
complexity estimating proposed at work (Shipp and Kuncheva 2001)
Therefore, based on the initial PRISM clustering, we have the instances allocated in
the boxes. For each box computes the category ratios. For a total of jKl classes presented
in cell Gj (j is an index of box), totalkj ≤≤1 , if the ηij number of data points available in
the box Gj , then for each class lj, rkj ≤≤1 , where kr is the maximal number classes in
box Gj. Then the probability of class li allocated at box Gj is defined as:
∑=
=jKl
iij
ijijp
1η
η
(II.34)
For each box and for each class inside the j-th box, we apply equation II.34. We
receive a probabilistic distribution vector lljj Kjppp ,...,, 21 . Taking into account the
required normalization (Singh 2003), the parameter of separability for box Gj is calculated
as:
∑=
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟
⎟⎠
⎞⎜⎜⎝
⎛
−=
jl
i jij
j
jGj l
pl
lS
1
21
1 (II.35)
For each box Gj contains mj instances of data out of all total number m, then the
overall purity of different cells is given as:
∑=
=totalk
j
jGjG m
mSS
1 (II.36)
To give the largest weigh to lowest resolution (Singh 2003), plus to change a meaning
of complexity value on opposite, equation III.6.3 takes a form:
∑=
−=totalk
l
lGjBnormG m
mSS1
. 211 (II.37)
Next Neighbourhood separability PRISM’s criterion proposed by Singh at work
(Singh 2003) defines a classification complexity measure that depends on the concept of
decision boundaries. It is very similar to Purity, that’s why we have skipped its TDTS
implementation, but following PRISM’s Collective entropy basing on Information theory
is very important measure that represents order/disorder of the system (Singh 2003[2]).
68
In this particular PRISM case, Collective entropy signifies order/disorder accumulated
at different resolution considering non-empty cells for a given global resolution parameter
B. For probability distribution vector lljj Kjppp ,...,, 21 produced by equation II.6.1 the
entropy measure for each box Gj calculates as:
∑=
−=jk
iijijj ppHP
12 )(log (II.37)
Let us note here that generally, the base of logarithm in work (Singh 2003) is not
defined. Finally, the overall collective entropy included weighted factor and required
normalization is calculated as:
∑=
−=totalk
j
jjBnormG m
mHPHP
1. 2
11 (II.38)
This is to keep consistency with other measures: maximal value of 1 signifies
complete certainty and minimum value of 0 uncertainty and disorder.
The utility of the PRISM framework and the above complexity measures Purity and
Collective entropy based on feature space partitioning have demonstrated in practice
(Singh 2003) ability to predict the relevant level of classification error.
The group of methods that cannot be range with the classification complexity
measures mentioned above is organized in the group name Other methods. This group
may show up to be a very different to their origin. We do not try to provide complete
overview of these methods. Our aim is to give a clue about this group of complexity
estimators, and give a definition of the type of estimators/criteria that are implemented
into T-DTS. One of these measures is the well known Fisher linear discriminant ratio.
III.2.4.2.3 Fisher linear discriminant ratio and Ad hoc approaches
This classification complexity estimator is originated from Linear Discriminant
Analysis. Fisher (Fisher 1936) has defined the separation between two classes that are
represented by two vector-instances s1 Є S and s2 Є S respectfully as:
)()())()(())(),(( 22
221
2121
jjjsjsjsjsFDM
sSj σσ +
−= (II.39)
where j is an index of feature space , )(,)( 21 jsjs are the mean of the j-component of
the vectors, and )(),(21
jj sSσσ are the variance of these vectors respectfully. Similar
69
question appears about defining aggregative ratio for II.8. In a framework of T-DTS this
problem is resolved in similar, like it done for measure II.3.
However, this complexity measure is easy to compute, it is has the same
disadvantages as Normalized mean distance based measure.
Leaving the collective category of the methods named Other methods, our following
paragraphs overview more detailed Ad hoc measures. Thus in the work (Ho 2000), there
are three classes of Ad hoc complexity measures.
The first of them is named Length of class boundary group. Minimum Spanning Tree
is one of the well known measures. Hither, Length of class boundary method based on
Minimum Spanning Tree (MST) (Friedman and Rafsky 1979) finds the length of
boundary between two classes. Then, it constructs tree that connects all sample points
regardless of class. The length of the boundary is given as the count of these edges that
connect points belonging to two classes (Singh 2003). The length of class boundary
normalized by the total number of points is used as a measure of classification complexity
(Ho 2000).
The second method of this group is known as Space covering by ε-Neighbourhoods
measure. The idea of this method is to generate an adherence set for each data point which
represents the maximum size hypersphere around a data point that contains samples of the
same class and no other classes Fig. II.2 (Ho 2000).
Fig. II.2 : Retained adherence subset for two classes near the boundary
A list of the ε–Neighbourhoods, where ε represents the maximum possible distance
between sample points to be in a set together needed to cover the two classes is a
composite description of the shape of the class. The number and the order of the retained
subsets is an extension for complexity measuring. For example at work (Le Bourgeois and
Emptoz 1996), it is proposed a measure computed as an average proportional size in terms
of number of members in each hypersphere divided by the total number of use data points.
The third group of methods is titled as Feature efficiency. It is constructed for
classification problem (Ho and Baird 1998) with high dimension feature space. Feature
70
efficiency based method is concerned about how discriminatory information is distributed
across the features. Created characteristic of efficiency of individual features describes
how much each feature to contribute to the separation of the classes.
The last fourth measure calls Hyperplane number based complexity estimator. There
is an estimating method is an approach figured out at the work (Zhao and Wu 1999). The
complexity ratio there is defined as k-degree linear separable if at least k hyperplanes are
needed for separating data linearly across the different classes. This approach is simplistic
but easy computable.
II.2.5 Hierarchy of complexity estimation approaches
We would like to highlight that the central role of classification (class separability)
complexity estimator. Even regardless of T-DTS concept, most statistical pattern
recognition methods use such measures. Brief overview of the global concept complexity
as formalism and as inherent attribute of the complex systems lead us to more general
understanding of this phenomenon and more precisely to understanding of the
classification complexity.
Fig II.3 describes the updated taxonomy of the different complexity estimation
approaches in order to figure the place of each measure out and describe the general
hierarchy beginning from the global method and finishing with particular specific
classification complexity measures.
Let us note that complexities estimation approaches might include the
melting/overlapping concepts, more over some particular classification complexity
methods can use the conceptual ideas that belongs to different branch of this hierarchy.
For example, it is possible to subscribe ANN based complexity estimator to branch of
dynamic complexity concept (Lucas 2000), because of using hardware implemented
evolutionary net construction (De Tremiolles 1998).
These and other issues might be prejudiced a given stratification on Fig. II.3 and we
share here possible scepticism, because of the natures of such phenomenon complexity.
Any attempt to give a general high-level definition that in the same time generalizes
and includes various aspects of this phenomenon automatically implicates uncertainty or
subjectivism. For example, Kaufmann’s definition of complexity as a sub-sub class of
Self-organizing (Fig. II.3) complexity defines complexity as number of conflicts, but
71
natural question here arise how we calculate this number, based on which criterion and is
this criteria is environmental invariant. Thus, the given output of Kaufmann’s complexity
measure is relative, because of its weak-defined origin. However, according to taxonomy,
on the low lever of proposed hierarchy complexity carries some certain and utilitarian
meaning ignores some general aspects of phenomenon complexity.
To link the high abstract low ill-defined concept with practical classification
complexity measures in a frame of classification problem, we propose the following brief
summary:
Fig. II.3 : Taxonomy of classification complexity (separability) measures
72
1. The Bayes error (misclassification error) is staple point of any classification
complexity.
2. The concepts of data distribution analysis which approximates the Bayes error
dominate in pattern classification.
3. These approximations might belong to more global complexity concepts, such as
Information Theory, but their principal usage doesn’t relate to classification problem and
pattern recognition.
4. The range of novel methods (including not only the well developed direction of
space portioning) could and (what is important and has a practical sense) could not be
directly related to the golden standard of Bayes error misclassification. This set of
methods can provide efficient and data adaptive solutions for (ideally, there should be one
measure) “classifiability” (Singh 2003) measurement independent of the feature set.
Summarizing, Bayes error misclassification is necessary, but not sufficient for
defining classification view on complexity. Thus, the classification tasks can be purely
separable, but also can be geometrically or topologically complicated (complex), meaning
that Bayes error based method are most reliable, but not sufficient. Moreover as we deal
with computation in practice, there are a lot of applied computational problems (simply
saying computational complexity of classification complexity) that drives for further for
discovering of alternative methods.
Table II.1 : Advantages and disadvantages of complexity estimating techniques
Technique Advantages Disadvantages ANN-structure based. No probability distributions
required. Directly related to Bayes error
Computation efficient
Bhattacharyya bound based, Hellinger distance, Kullback-Leibler divergence, Jeffreys-Matusita distance, Bhattacharya distance, Mahalanobis distance.
Directly related to Bayes error Probability distributions required. Limited 2 classes.
Fukunaga’s 4 scatted matrices criteria Easy to compute Not equivalent to Bayes error. Limited with second order statistics
PRISM (Collective entropy, Purity) Provides relevant complexity levels to purely separable cases
High computational costs, not equivalent to Bayes error
Fisher linear discriminant based, Normalized distance base
Very easy to compute not equivalent to Bayes error
73
Following Table II.1 presents a quick snap shot analysis of classification complexity
methods, extracted from the thesis’s results of Pierson (Pierson 1998), Rybnik (Rybnik
2004) and Bouyoucef (Bouyoucef 2007).
The aim of Table II.1 is to give a brief view about implemented into T-DTS main
criteria of classification complexity estimators. They are organized into five groups
according to the taxonomy given on Fig II.3 (implemented 18 complexity estimators are
marked there using different style of box and bold font. Exceptional kNN method of
PRISM is not implemented and also is marked in different way).
As the complexity estimation is essential in classification. The principal idea of
developing a big diversity of complexity estimating methods is to satisfy the different
requirements of database decomposition and tree-structure building in the T-DTS
framework. A complexity estimator is an engine of T-DTS, the complexity measures
investigation is an essential for correct T-DTS functioning
On the given taxonomy Fig II.3 non-gradient and sharp bold areas include complexity
estimator methods that are an integrated part of new version T-DTS.
The next Section II.3 is fully dedicated to our novel complexity estimator named ANN
based estimating approach which belongs to one of the ad hoc group of the intuitive
measures.
II.3 ANN-structure based classification complexity
estimator
Passing ahead of the possible criticism concerning the name as ANN-structure based
classification complexity, I underline that in this section of my thesis I am not focused on
studying another statistical complexity measure built upon a combining of information
measure. I concern the fact that any concept (including its possible criticism) of the
complexity is strongly depending on the chosen measure interpretation base (in general
case - language), because complexity definition is exclusively found on this interpretation
base (Feldman and Crutchfield 1997) (Edmonds 1999). In the next paragraphs I focus on
giving a proper definition of ANN-structure based classification complexity estimator.
Essentially, by classification task complexity, I mean a class separability criterion.
Class separability is a classical concept in the field of pattern recognition, usually
74
expressed using a scatter matrixes approach (Fukunaga 1972). Because of using this
separability measure in discriminate analysis, feature selection, clustering and a wide
range of pattern recognition applications, this direction of research continues to develop.
For example in the past few years, Singh has proposed his PRISM-based method of
computation class separability criteria (Singh 2003) that belongs to a group of Space
partition methods Fig II.3. In contradiction, the majority of methods has been investigated
from the Bayesian standpoint, and is strongly influenced by ideas and tools from
statistical physics, as well as by information theory. While each of these theories has its
own distinct strengths and drawbacks, there is a little understanding of what relationships
hold between them (Haussler, Kearns and Schapire 1994).
My novel direct approach of classification task complexity estimation comprises
measurement of computational effort required for successful classification using “most
efficient algorithm” (Boolos, Burgess and Jeffrey 2002). Theoretically, “most efficient
algorithm” has to appear in previous sentence in case when we deal with complexity
estimating. However, in the same time, accordingly to Chaitin (Chaitin 2005) generally,
this is an idealistic expression, because to prove a fact that our algorithm is “the most …”
depends totally on the selected formalism or formal system (FS). The last one might be
proven to be incomplete (Hofstadter 1999), (Godel 2001). Therefore, the role of “most
efficient algorithm” in our ANN-structure based classification complexity estimation
technique is played by an algorithm that is user selected.
In our case of ANN-structure based complexity estimator, we have chosen kNN-based
classification method. It is very successful for estimating unknown data density function
(Dong 2003), whatever unimodal and mixture of multi-modals. Although, on the other
hand, a large number of samples are required to grow exponentially with the
dimensionality of feature space in order to get a good estimate and this issue leads to
increasing of computational cost of kNN-classifier. The estimating of the class density
function that increases extra learning efforts in case of multi-dimensional feature space
case is more likely to be expected in terms of Occam's razor principle. Although Occam's
razor doesn’t tell why this principle should works and why this principle is necessary. The
idea that only the simplest underlay principle produces enormous variety of a
phenomenon and in such way exhibits its complexity is central in Chaitin’s observation of
computational complexity (Chaitin 2005). Thus, from the geometric viewpoint, kNN-
classifiers is simple intuitive approach.
75
It was formed so that ANN-structure based classification complexity estimator has
been firstly proposed and implemented using IBM© ZISC®-036 Neurocomputer. That’s
why one may find a name of the realized modules titled as “ZISC”, “ZISC complexity
estimator”
The main reasons why we have chosen this tool are the following:
1. Quality aspect: implemented Restricted Column Energy (RCE) algorithm and
KNN algorithm.
2. Control aspect: limited number of necessary functions to control the IBM©
ZISC®-036 Neurocomputer.
3. Conceptual aspect: ANN-evolutionary learning strategy based (De Tremiolles
1998), (Madani, De Tremiolles and Tannhof 2001).
The following section describes in more detail the foundation of ANN-structure based
classification complexity estimator.
II.3.1 ANN-structure based complexity estimating concept
Before starting to discuss the algorithm of complexity estimating named ANN-
structure based or ANN based, we briefly give an important clue about computational
complexity (not classification) of artificial neural networks. According to the taxonomy in
Fig. II.3, our novel approach belongs to the sub-set of computational complexity that
takes into account limit of computational sources.
II.3.1.1 Computability of ANNs
Early computing machines had computational limitations, so in 1960s the complexity
theory had been developed to chart different subsets of the computable functions that
could be obtained when restrictions are placed on computing resources.
ANNs’ models as digital machines require a framework of computation defined on a
continuous phase space, with dynamics, that is characterized by an existence of real
constants that influence the macroscopic behaviour of the system.
ANNs’ paradigm is an alternative to the conventional von Neumann model (Sima and
Orponen 2003). ANNs’ models have been considered more powerful than traditional
models of computation. Depending on implementation and intra-parameters adjustment,
76
they may compute more functions within the same given time bounds. However, in
practice it is hard to rely on an approach which uses infinite precision real operations
(Arbib 2003).
Generally, estimation of the classification task complexity using neural networks is
not an easy subject. In our work, I propose a different slant using a neuroprocessor
implementation of Reduced Coulomb Energy (RCE) (Park and Sandberg 1991) and KNN
(Dasarathy 1991) model. Analysis of the evolutionary process of constructing neural
networks topology (Budnyk, Chebira and Madani 2007), (Budnyk, Bouyoucef, Chebira
and Madani 2008) on IBM© ZISC®-036 Neurocomputer allows us to extract information
about general complexity of a classification process and of classes’ separability.
The next section introduces our novel approach of complexity estimation. In this
approach, the analysis of ANN-structure constructing plays a key role plays. My prime
interest is to give a pecise definition of classification (classes’ separability) complexity
that can be measured while using the analysis of evolutive procees of ANN-structure
procedure.
II.3.1.2 ANN based classification complexity estimating approach
In this section, my principal goal is to extract the ratio of classification tasks’
complexity. I perform my analysis taking into account the aspect of resources
consumption that is related to the sub-group of Computational complexities (Fig.II.3)
processed on the deterministic Turing machine. IBM© ZISC®-036 Neurocomputer was
selected as a tool for ANN construction because of its high performance characteristics.
Let me note that another ANN builder can be selected, but the core of the approach (the
principal idea) remains the same. Another important note is that I have aimed to not build
a kNN-classifier for the problem, but to estimate the classification complexity by
extracting information about its construction. In general, regardless of ANN-type, we
learn a classification problem using the associated database and then estimate the tasks
classification complexity by analyzing the created neural network structure.
My method can be applied to any classification problem that is defined on its data
level by a pair of sets S and C. The given database is composed of a collection of m-
objects, where each object (instance) s Є S is associated with an appropriate label
(category) c Є C.
I expect (with a minor reservation that the selected learning technique within the
77
learning parameters is appropriate) that a more complex problem will generally involve a
more complex neural network structure.
The simplest RCE-KNN-like-NNs structure feature is n - the number of neurons. This
criterion is a good candidate for the analysis not only because it is a resource-criterion, but
also because of ZISC®-036 ANN-evolutionary based learning strategy (De Tremiolles
1998). Thus, the number of neurons reflects the complexity of ANN structure. Taking into
consideration these two facts, the following indicator is defined:
mnQ = , 1,1 ≥≥ nm (III.40)
)(ˆ mgn = is an idea that has been proposed by Zhao and Wu (Zhao and Wu 1999) for
complexity estimating. In order to reserve a place for a more general ANN-structure, we
set (.)gn = , where (.)g is a function that estimates a complexity of ANN classification
structure. Arguments for this function might be: a signal-to-noise ratio, a dimension of the
representation space, boundary non-linearity and/or database size, etc. Therefore, the (.)g
can combine several arguments, for example number of neurons n and length of boundary
line, and in such way it can aggregate the key-parameters (related to classification
complexity) of the several “Ad hoc” (Fig. II.3) approaches.
In the first approach, I consider only (.)g function’s variations according to a m
parameter, supposing that for the case of ZISC®-036 structure )(ˆ mgn = ; therefore, our
complexity indicator Q(m) is defined for )(ˆ)( mgmg ≡ as the following:
mmgmQ )(1)( −= 0)(,1, ≥≥ mgm (III.41)
Here, the use of m parameter assumes our database is free of any incorrect or missing
information. We also do not consider the informative experience issue (Scott and
Markovitch 1989). Therefore, for the implementation and validation, we apply the
Random strategy of classification problem representation.
I suppose that for problems of different classification complexity, the same
incensement of parameter m produces different g(m) behaviour, where more complex
classification problem involves more complex ANN-structure. I share possible criticism
of this utilitarian point of view on classification complexity, but let me mention that ANN
based complexity estimation is the responsibility of g(m). Thus, for the same fixed m
value, problems of different classification difficulty are supposed to have different
complexity indicator Q(m).
78
My approach is based on the dynamic analysis of ANN construction. To provide a
good illustration, I return in this point to our IBM© ZISC®-036 Neurocomputer’s
implementation. The construction of NNs corresponds to the Voronoy polyhedron
building. Therefore, the RCE-KNN IBM© ZISC®-036 Neurocomputer maps the given
dim-dimensional space to prototypes; this implementation of classification process is
represented by a network of neurons within an influence field of hyper-polygons rather
than hyper-spheres, see Fig. II.4. In 2D chart, Fig. II.4 (left) it is described that the kNN
algorithm leads to clustering the input space S of m-instances/objects/vectors into
Voronoy cells. Each cluster of the training points is labelled by a class (a category).
Fig. II.4 : Examples of Voronoy polyhedron for 2D and 3D classification problems
The analysis of the construction of such a classification structures is based on the
mentioned above indicator Q(m). Based on RCE-kNN algorithm, each neuron of
classification structure allocates (captures) the maximal possible number of prototypes.
For this reason, a neuron of RBF Network (RBFN) modifies its influence field. A final
RBFN reflects an ability of the group of neurons of this network to arrange given number
of m-prototypes to a minimal number of Voronoy polyhedron cluster. Each cluster
corresponds to one neuron. This means that our analysis of RBFN construction is done
based on analysis of Q(m) behaviour, because Q(m) represents a balance between input m-
prototypes and n-number of Voronoy cells or neurons. Let me mention here that the
neurons of RBFN and the rules of their interactions are relatively simple, but the resulting
interacting behaviour exhibits the overall classification complexity. Dynamics observation
79
and information extraction about its complexity are typical approach that deals with such
complex systems as: brain, cellular regulatory machinery, ecosystems, etc. (Zhigulin
2004).
Concerning RBFN behaviour, my expectation is to observe two stages (1 and 3) of
RBFN building and a transition phase (2) Fig. II.5:
1. Stage of RBFN process initialization.
2. Transitional phase.
3. Stage of RBFN process deployment.
Fig. II.5 : Q(m) indicator function behaviour
The first stage has to characterize the state when neurons are easily generated in order
to capture the new incoming prototypes or to adjust an influence field of a preliminary
RBFN installation. During the second stage, this process of allocation of the rest of the
prototypes doesn’t require high adjustment of the influence fields and, as the result, there
is no new request to add a neuron. We suppose, taking into consideration different
concepts/views on complexity concepts, that the classification tasks’ difficulty is
determined by this transitional phase. One may find similarities between RBFN
initialization process within the transitional step and an effect of hysteresis that exists in
the world of physics (Mielke and Roubicek 2003): a phenomenon which occurs in
magnetic and ferromagnetic materials, in the elastics, electric and magnetic behaviour of
materials as subsequent effect of applying and the removing of a force or field; here, the
80
most interesting fact is that this transitional phase is force-independent and is
characteristic of the material. The effect of hysteresis (similar to RBFN installation) is
observed in more general fields, such as economics (Jelassi and Enders 2004), energy,
user interface design, electronics, cell biology (Pomerening, Sontag and Ferrell 2003),
neuroscience and respiratory physiology (West 2008)
According to the definition of hysteresis (Bertotti and Mayergoyz 2006), this
phenomenon can conceptually be explained with a link to RBFN initialization process in
the following way:
1. A system can be divided into subsystems or domains, much larger than an atomic
volume, but still microscopic: in a term of RCE-kNN, domain is equivalent to a neuron
including its influence field.
2. Such domains form small isotropic regions: it means that each neuron has one
category or belongs to only one class label.
3. Each of the system's domains can be shown to have a metastable state: in terms of
our ANN-structure constructing, it means that a further increase of m-parameter (number
of prototypes) doesn’t have an influence on its configuration.
4. The hysteresis is simply defined as the sum of all domains: in terms of RBFN, this
is the number of neurons.
Taking into account the board range of real-word evolutional processes, and the
phenomenon of hysteresis including its properties, we expect to capture the similar
behaviour of RBFN. However, from our point of view, the key role in RBFN construction
plays a transition between two stages. First one is eruption, frenzy and second - synergy,
maturity (Jelassi and Enders 2004).
The given below equation aims to capture this momentary transition phase, where j is
an index of such transitional point:
0)()( 2
2
=∂∂ mQm
(II.42)
The point(s) mj which satisfies(y) by the above equation has(have) the following
properties:
0)()(
,0)()(
);(:1,1
2
2
2
2
>∂∂
⇒>∀<∂∂
⇒−∈∀≥∃≥∀
mQm
mmmQm
mmmm
j
jjjjj εε (II.43)
or, another possible solution
81
0)()(
,0)()(
);(:1,1
2
2
2
2
<∂∂
⇒>∀>∂∂
⇒−∈∀≥∃≥∀
mQm
mmmQm
mmmm
j
jjjjj εε (II.44)
In practice, Q(m) is a piecewise function and we provide piece approximation Q(m)
using a polynomial function. In the case of more complex behaviour of Q(m), when we
start learning, our polynomial power order must be selected in appropriate way. We
expect to meet an S-shape curve on the final stage of RBFN installation.
The equations II.43 and II.44 represent two cases. In case of odd power order greater
than 2 (II.43), the approximated Q(m) approaches the limit simultaneously when Q(m)
monotonically grows. In case of even power order greater than 1 (II.44), approximated
Q(m) approaches the limit simultaneously when Q(m) monotonically declines. We expect
to have limit of Q(m), because of two reason:
1. Theoretical: we suppose that for any given classification problem the algorithm of
ANN-structure constructing is ideal. For our ZISC®-036 tool, a Voronoy polyhedron that
contains “the minimal possible number” of clusters for a given classification problem is
not affected by any further parameter m incensement.
2. Practical: in the real word, we have the same constrains for classification task
independently of the algorithm or its hardware implementation. Even if it is possible to
construct a benchmark, m may have no limit. Still, we have limitations on the hardware
memory, execution time of the clustering algorithm, etc.
Concerning the first issue, let us note that this is a typical platonistic-like assumption
as in Kolmogorov’s case of complexity with believe of existence “most efficient
algorithm”. However, it is known that the problem with Kolmogorov’s definition of
complexity and Kolmogorov’s-like approaches is that there is no proof that a given
algorithm (polyhedron) is the minimal one.
For the cases when j > 1, we define a transitional m0 point as the next:
),..,max( 10 jmmm = (II.45)
This minimax approach of selection of the maximal among the available is similar to
the idea of generalization expressed in the equation II.20. Hither, we select the
representative ratio of classification complexity as the most complex among available as
jmm << ..1 , m0 = mj
82
The main characteristic of the transitional point m0 is the following:
It means that RBFN for +∞→> mmm ,0 is in a stage of deployment. We expect that
increasing the number of new incoming instances/prototypes does not change
significantly the structure of RBFN. It means that a classifier is in such dynamical phase,
where new incoming prototypes are easily assigned to the categories/clusters and they
(prototypes) do not change significantly the classifier’s structure. In practice, 0≠const in
equation III.46 because of the second practical reason mentioned above: m is a finite
number - 11≤≥
mconst
We state that m0 plays a special role of a crash-point, transitional point or, more
precisely, bifurcation critical point (Prigogine 2003) of the classification process.
Def: Q(m0,) ( 1)(0 0 ≤< mQ ) is defined as coefficient of a task classification
complexity.
The interval of possible Q(m0,) values is similar to Fukunaga’s (Fukunaga 1972)
criteria of classes’ separability and etc. According to give definition Q(m0,) = 0 indicates
the most difficult case of classification task
The following sections are dedicated to the important question of the implementation
of ANN-structure based complexity estimator. We have selected RBFN because it can be
processed in a parallel and high-performance way. That is also why ANN based
complexity estimator has been implemented and tested using the IBM© ZISC®-036
Neurocomputer.
Afterwards, this ANN structure complexity estimator has been implemented using the
Matlab environment in order to extend the testing database and, what is more important,
to embed it in into T-DTS. Conceptually, these two implementations are similar. Both of
them employ principal idea mentioned in theoretical part of our work. However, there are
some small minor differences.
Concluding this section, let me mention that we were not the fist in our interest in the
asymptotic behaviour of a learning algorithm as m becomes large. In addition to the given
definition of classification complexity, the intensive investigation of concept learning
(Haussler, Kearns and Schapire 1994) undertaken by the board research community has
resulted in the development at least two general viewpoints of the learning process in
terms of learning curve and cumulative mistake. The learning curve is viewed as the
constmQmmm i →⇒+∞→>∀ )(:0 (III.46)
83
sample complexity of learning (Haussler, Kearns and Schapire 1994). The next section
presents the hardware implementation of ANN based complexity estimator using ©
ZISC®-036 Neurocomputer.
II.3.1.2.1. IBM© ZISC®-036 hardware implementation of ANN-structure based
complexity estimator
IBM© ZISC®-036 exhibits very interesting characteristics. More precisely, this
neurocomputer employs RCE-KNN evolutionary neural networks that during the learning
phase, partition the input space by prototypes, where each prototype is characterized by it
own class (category) (Madani, De Tremiolles and Tannhof 2001). From the processing
point of view, it means that during kNN-like partitioning (learning) every neuron’s
threshold is adjusted. For each learning database of size m, after learning process
performed by ZISC®-036 Neurocomputer, we obtain a RCE-kNN Neural Network
structure that corresponds to Voronoy polyhedron and uses the concept of complexity
estimation described above. Afterward, we compute the ANN based classification
complexity ratio (coefficient/rate). IBM© ZISC®-036 implementation of my estimator
allows user to conduct easily testing for even unseen prototypes. During this phase, it
activates or does not activate neighbourhood neuron(s).
The following section describes my software implementation of ANN based estimator.
II.3.1.2.2 ANN-structure based classification complexity estimator. Software
implementation.
The difference between ANN based complexity estimator’s software version and
IBM© ZISC®-036 Neurocomputer is the first one doesn’t allow the overlap of influence
field that may take a place when one uses IBM© ZISC®-036 Neurocomputer for
prototype associating (Lindblad and al. 1996). Thus, in the software ANN based
complexity estimator, the MIF parameter is assigned to each prototype’s centre. During
the process of Voronoy polyhedron constructing, centre is adjusted (minimized up to the
MIF-limit) automatically. In RCE-kNN-like ZISC®-036 Neurocomputer version, the
influence field of neighbourhood neurons is adjusted (prototypes with associated
category/class), but according to all neighbourhoods. If an instance/vector does not belong
to any influence field, it is not recognized. If an input vector lies within the influence field
84
of one ore more prototypes associated with one category, it is declared as belonging to
that category. If the input vector lies with influence field of two or more prototypes
associated, but with the different categories, it is declared as recognized, but not formally
identified. Afterwards, during the learning process, the neurons within their influence
fields adjust their influence fields. There is no change in the network content; only one or
more influence fields are modified.
Also, for both types of ANN based complexity estimator implementation, the process
of classification complexity extraction feature is the same: we perform the analysis of
Voronoy polyhedron constructing process. Afterwards, using the function )(ˆ mgn = , we
estimate the complexity of the created NNs. Employing behaviour’s analysis of
complexity indicator Q(m), we calculate m0 and Q(m0,) – the classification complexity
ratio.
In the next section we compare our ANN based complexity classification complexity
estimator to other classification complexity estimating techniques and comprehensive
analysis of estimators’ advantages and disadvantages.
II.3.2 ANN-structure based complexity estimator compare to
other approaches
ANN-structure based method belongs to “Ad hoc” group of methods of Intuitive
classification complexity measures class. The kNN-like hyperplane constructing method
approximates the Bayesian error. In the same time for kNN, ANN-structure based
complexity estimating method becomes a variant of “Ad hoc space covering” by ε-
Neighbourhoods (Le Bourgeois and Emptoz 1996).
ANN-structure based complexity estimation method in RCE-KNN-like
implementation can also be viewed as a further development of related Zhao’s and Wu’s
Hyperplane number based complexity estimating method (Zhao and Wu 1999). The given
realization of our method is related to Kaufmann’s complexity concept. Thus, the process
of construction polyhedrons that includes phases of adjustment of influence field of
neurons/categories for a new prototype can be treated as a conflict. Therefore, one may
construct a function )(ˆ mgn = using the total number of conflicts according to
Kaufmann’s concept. However according to the complexities taxonomy Fig II.4,
85
Kaufmann’s concept of complexity belongs to a very different and general category
named Self-organizing complexity, although it doesn’t mean that one may employ this
conceptual idea in classification complexity estimating context.
Finally, a relation between our method and fundamental Kolmogorov’s definition of
complexity shows itself in the moment of algorithms selection for ANN construction. As
it is shown, the ANN-structure based concept of classification complexity estimator is
related to the different groups of the classification complexity estimation methods.
II.4 Conclusion
Complexity as the inherited phenomenon of complex systems; up to date,
quantification of complexity remains an uncertain and non-trivial task. Even though the
foundations of complexity theory are still not well established, the pragmatic look at
existing applications of complexity paradigm could be used to build the taxonomy of
classification complexity measures as described in Fig. II.3. Using this taxonomy, we
investigated methods for complexity estimation of separability of classes as applicable to
T-DTS. We reviewed a set of methods ranging from the popular techniques like
Fukunaga’s scatter matrices discriminator to the most recent methods like PRISM.
Using the insights gained in the review, we proposed a novel complexity estimation
method that is based on Artificial Neuron Network (ANN) structures and designed to
further improve the effectiveness of the T-DTS technique. My ANN-structure based
classification complexity estimation method extracts information about complexity using
the analysis of the dynamic process of ANN-structure construction. For this, we employ
the additional function, particularly, the number of neurons/categories of evolutionary
RCE-KNN for the Voronoy polyhedron construction.
Besides complexity estimation aspects, there are other issues that have a crucial
impact on the overall performance of T-DTS. Important note is that T-DTS as an
approach of ensemble of classifiers construction is driven by classification complexity
estimating technique, but let us mention in this chapter that one might use the constructing
of NNs (or other classifiers) approach in order to estimate classification complexity
(Tumer and Ghosh 2003). In the next chapter, we study one of this issues dealing with
implementation of T-DTS.
86
Chapter III:
T-DTS software architecture
Practical issues of T-DTS implementation are of a great importance because even a
minor problem in an implementation of T-DTS for classification tasks can have a major
impact on the overall classification performance of the technique. Since one of the
objectives of this work is minimizing classification errors in classification task, this
chapter is dedicated to the software architecture of T-DTS. The chapter transits from the
high-level concepts of Chapter II to the software architecture of T-DTS, to its
implementation, and to the usability aspects of the approach.
In our software implementation, we simplify the general T-DTS functioning scheme
described in Fig. II.2 by omitting the second decomposition feed back loop of processing
model conformity. Taking into account the scope of T-DTS (i.e., classification tasks only,
using neural networks based classifiers) application, the conceptual diagram of the
proposed T-DTS implementation is in Fig. III.1.
The diagram consists of four main operational phases (blocks): data pre-processing
(when required), learning, complexity estimation and acting (processing). The aim of
these phases is to shape an ANN structure that would reflect the general complexity of a
given classification task. The behaviour of T-DTS ensemble of NNs models sculpting
over decomposed databases is associated mainly to the Complexity estimation agent
(CEA). On Fig III.1 it is marked as “Complexity Estimation Loop”.
In the diagram, the complexity estimation loop (Complexity estimation agent – CEA)
plays the central role. It shapes the behaviour of T-DTS ensemble of NNs by controlling
T-DTS database decomposition and tree organization through the means of a feedback.
87
Processing Results
Structure Construction
Learning Phase Feature Space Splitting
NN based Models Generation
Preprocessing (Normalizing, Removing Outliers, Principal
Component Analysis)
(PD) - Preprocessed Data Targets (T)
Data (D), Targets (T)
P – Prototypes NNTP - NN Trained Parameters
Operation Phase
Complexity Estimation Loop
Fig. III.1 : Diagram of T-DTS implementation for classification tasks
According to the basic tree-like decomposition of T-DTS by “divide to simplify”
(Rybnik 2004), the decomposition methods of T-DTS belong to the first sub-category of
“Agglomerative vs. divisive” decomposition technique of Density-based Hierarchical
methods (Section II.2). Any divisive decomposition method proceeds in a top-down
manner, first placing all the data or a starting decomposition in a single cluster and then
splitting the cluster into a larger set of clusters up until a stopping criterion is satisfied.
However, we should mention that the first sub-category of “Agglomerative vs. divisive”
approaches consist of a big range of the different methods that extend the base of
decomposition methods.
The main “Agglomerative vs. divisive” method calls Linkage based methods. Because
of the heuristic model based clustering sub-methods, where the data is assumed to be
generated by a mixture of component probability distributions, and because of that fact
that majority of classical Density-based method are variants of “Agglomerative vs
divisive” methods. Let me note that some authors define Linkage based methods as
Model-based agglomerative methods (Kamvar, Klein and Manning 2002).
Linkage based methods contain four types of approaches (Jain, Murty and Flynn
1999): Single-link approach (Sneath and Sokal 1973), Complete-link (King 1967), Group-
average (Johnson and Kargupta 2002) and Ward's method (Ward 1963). These four
88
methods are based on computing an approximate maximum for the classification like-
hood (Kamvar, Klein and Manning 2002).
Generally speaking, Model-based agglomerative methods’ merging is done on the
basis of higher likehood ratio. Therefore similarly for divisible methods, each sub-
approach (method) employs its function Df():
• Single-linkage method, the distance Df() between two clusters/groups Gi and Gj is
the minimum of all pairs of vectors is and js that belong to Gi and Gj correspondingly:
),(min),(, jiGsGsji ssdGGDf
jjii ∈∀∈∀= (III.1)
In this case, two clusters are merged to form a large cluster based on minimum
distance criteria.
• Complete-linkage method, the distance between two clusters is the maximum of all
pairs of objects in the clusters.
),(max),(, jiGsGsji ssdGGDf
jjii ∈∀∈∀= (III.2)
• Group-average or simply Average-linkage methods, the distance between two
clusters is the average of pairs of objects in the clusters.
)),((),( 2, jiGsGsji ssdMGGDf
jjii ∈∀∈∀= (III.3)
• Ward’s methods uses the error sum-of-squares (ESS) that is given by:
)()()(),( jijiji GESSGESSGGESSGGDf −−∪= (III.4)
where the ESS() is defined as:
∑∈
−=Gs
ii
GMsGESS 2))(()( (III.5)
and M(G) is the sample mean of the data points in the cluster G.
To obtain better Density-based, Hierarchical methods, several advanced approaches
adopt more complicated measurements in order to determine dissimilarities between
clusters. For example, Clustering Using Representatives CURE (Guha, Rastogi and Shim
1998), Robust Clustering Using Links ROCK (Guha, Rastogi and Shim 2000), Balanced
Interactive Reducing and Clustering Using Hierarchy BIRCH (Zhang, Ramakrishnan and
Livny 1996), and Hierarchical Clustering Algorithm Using Dynamic Modeling
CHAMELEON (Karypis, Han and Kumar 1999). This list of methods is given rather to
89
describe various possible directions of hierarchical clustering than to provide a common
clue. A good summary of these methods is present at work (Ding 2007). Basically,
variants of hierarchical clustering algorithms can be also distinguished by their
(dis)similarity measurement and how they deal with the (dis)similarity measures between
existing clusters and merged clusters.
Debriefing, T-DTS divisible type of universal decomposition provides an opportunity
to employ different type of decomposition methods, because the T-DTS concept is
grounded on (simply saying) “splitting database into sub-databases requires mandatory
decreasing overall databases complexity”.
In fact, T-DTS decomposing and processing methods’ set consists of small and
specialized mapping neural networks (Madani and Chebira 2000), (Madani, Rybnik and
Chebira 2003). On Fig. III.1 the correspondent block is called Neural Network based
Models (NNM). Supervised by CEA, T-DTS decomposes feature space into a set of
simpler (according to principal concept) sub-spaces. Thus, on the leaf-level of the
obtained tree one can find NNM and one can find Decomposition Agent (DA) on the node
level. In T-DTS realization, DA is also NNs-based, or more precisely Kohonen self-
organizing map, but to distinguish decomposing and processing phases we have given
different names to these two blocks. Let us note also that DA might be not NNs-based.
Therefore, constructing dynamically a tree by means of CEA that controls decomposition
of learning database, then building a set of specialized NNM trained by generated sub-
databases – this is the way in which T-DTS can be view as complex ensemble of neural
networks. Fig. III.2 illustrates T-DTS learning process.
More generally, combination of: CEA, splitting (DA) and learning capabilities (NNM)
confers to the issued intelligent systems’ self-organizing ability (Bouyoucef, Chebira,
Rybnik and Madani 2005).
Fig. III.2 shows the learning phase, where a set of neural network based models
(trained on sub-databases) is available that covers (models) the behaviour region-by-
region in the problem’s feature space. Important to note is that the initial complex
problem has been decomposed recursively into a set of simpler sub-problems.
The initial feature space S is divided into N sub-spaces (number of classifiers and
number of learning sub-databases/clusters is the same). For each subspace r (index), T-
DTS constructs a neural based model describing relations between inputs and outputs.
Important remark should arise here is whatever a neural based model cannot be built for
an obtained sub-database.
90
NNMM
NNM
NNM
DA
DA
NNM
Learning Database
DA
Learning sub-database
Learning sub-database
Learning sub-database
Learning sub-database
Fig. III.2 : Scheme of T-DTS learning concept
Theoretically, on the conceptual level a new decomposition will be performed on the
sub-space, dividing it into several other sub-spaces. The learning phase could be
considered as a self-organizing model generation process, which leads to organizing a set
of NNM. T-DTS operating phase is depicted by Fig. III.3.
Output-1
Input
Control Path
Data Path
NNMr
NNMN
NNM 1
Control Unit
Output-k
Output-N
U1
Ur
UN ( )tΨ
Fig. III.3 : T-DTS operating
The second operation mode as mentioned corresponds to the use of the constructed
ensemble of NNM. According to work (Madani, Chebira and Rybnik 2003), during the
learning phase, Control Unit (CU) constructs learning sub-sets Fig. III.3. During the
processing phase, CU receives unlearned input vector(s) or sub-sets and then sends
(switches) to most appropriate neural processing unit NNM.
The given description can be formalized in the following way: Let Ψ(t) be the
system’s input ( ( ) Ψℜ∈Ψ nt ), a nΨ -dimensional vector, and let unr tU ℜ∈)( be the r-th,
91
model’s output vector of dimension nu, where Nr ,...,1∈ . In general nΨ is not equal to
dim, because DA might modify dimension, meaning that here we reserve a room for
sophisticated DA.
Let unnrF ℜ→ℜ ψ:(.) be the r-th NNM’s transfer function. Let [ ] NtCU 1,0,),( ∈ξβψ ,
where N1,0 decision output of CU[.], which depends on a set of extra parameters β
and/or on some conditions ξ. βr represents some particular r-values of parameter and rξ
denotes some particular condition ξ accordingly, obtained from learning phase process for
the r-th sub-dataset. Taking into account the above-defined notation, the control unit
output CU[.] could be formalized as:
[ ] TNr uuutCU ),...,...,(,),( 1=ξβψ (III.6)
where
⎢⎣
⎡=
===elseuifu
r
rrr
,0&:,1 ξξββ
(III.7)
A CU’s response corresponds to some particular values of the parameter β and the
condition ξ (e.g. CU[Ψ,βr,ξr] activates the r-th NNM, and so the processing of an
unlearned input data conform to parameter βr and condition ξr will be given by the output
of the selected NNM is:
))((),( tFtU r ψψ = (III.8)
We highlight that CU activating an appropriate NNM doesn’t depend on the complete
vector but on the partial input vector Ψ(t).
Concerning the formal model of DA, three (Section III.2) available splitting method
realized in T-DTS are based on advantages of unsupervised Kohonen Self-organized
maps (SOM) and their properties. Splitting is performed on the basis of Similarity
Matching (Madani, Chebira and Rybnik 2003). Thus, tree-like decomposition is
performed under control of SOM-based DA. As an output we have the set of clusters Gi,
each of them is represented by vector Wi, where Wi is a centroid. Let’s suppose that after
the learning phase we have N clusters in our final tree structure. In this case for our new
unlearned Ψ(t) , CU[.] activates appropriate function based on the criterion of similarity
expressed as:
92
⎢⎣
⎡ −=−=∈∀
elseWtWtif
WtuNi iiii ,0
)(min)(,1)),((:
ψψψ (III.9)
Where |…| denotes the distance between vectors that one may calculate using the
different metrics II.1-II.9.
While being the most important T-DTS module/unit, CEA requires a parameter θ -
threshold of complexity. This parameter determines which ratio/rank of complexity
should have the decomposed sub-databases. Henceforth θ - threshold is a ratio such as
]1;0[∈θ , where 1 signifies the simplest case and 0 – very complex one.
θ – threshold plays a crucial role. It controls database decomposition. Therefore θ –
threshold can be called also as parameter of database divisibility or decomposability,
because arbitrary selected complexity estimating technique might provide approximate
class separability measure. For example, Maximal standard deviation criterion that in fact
is not able to detect “true” classes’ separability measure could play its role, according to
the results obtained in the work (Rybnik 2004).
This θ - threshold must be selected in advance, but there is no a priory information on
how θ should be set. If the complexity threshold requires operating with the less possible
complexity of the learning sub-databases - θ is preset to 1. Then the result generally is a
huge tree structure with a very small databases size on the leaf levels. Usually, based on
the comparison to decision tree methods, such a tree-structure requires further pruning. In
term of computational resources, it’s a costly processing procedure. Moreover, success or
fail of these methods depend on the correctly constructed learning database.
Returning to T-DTS, θ - threshold that equals to 0 signifies the incapability to initiate
a T-DTS recursive database decomposition. Another remark that must be done is the θ
measure relativity. Different types of CEA produce complexity ratios that cannot be
compared, because of the nature of phenomenon complexity, weak-definition or different
definition origin of the applied complexity (e.g. Information theory and Kolmogorov’s
complexity): θ- measure is relative. Therefore, naturally appears here the problem of
employing a non-heuristic and user-free procedure that searches for pre-selected
parameters the maximum (in the terms of performance) of T-DTS output.
In the next section, I propose a solution for this problem. It is a semi-automated user-
free procedure that finds a quasi-optimal value of the θ - threshold.
93
III.1 T-DTS concept: self-tuning procedure
T-DTS of version 2.00 was released and adapted for classification problems. Based on
the formal model for learning and processing phases described above, the user has to
define the global parameter of the θ – complexity threshold. The value of the parameter is
unknown a priori and this parameter is different for the different CEA modules/units. The
user in the framework of T-DTS relies on the “cut and try” method to find an “optimal”
or, more precisely, a quasi-optimal θ - threshold at which an initial supposition is fulfilled.
We take the word “optimal” in parentheses because generally the existence of such
threshold is not evident.
Briefly, Section III.1 is dedicated to solving the problem of an automatic search for the
“optimal” θ - threshold for which T-DTS produces a tree structure such that a
classification error assigned to this structure ensemble of NNM reaches its minimum. We
describe a semi-automatic (because some global parameters of this procedure have to be
still pre-defined by the user) complexity threshold adjusting procedure that is
deterministic and free of the heuristic user control.
Main disadvantages of user-based heuristic method of searching for quasi optimal θ -
threshold search are:
• a priori information analysis about a given database is required in order to initiate
a search.
• a lack of universal complexity estimator: the methods of complexity estimation
applied to classification task depend on the nature of data, the coordinate system and etc.
(a list of desirable, but not mandatory properties of classification Complexity estimators /
CEA can be found at work (Fukunaga 1972)). These and other features of CEA must be
selected by the user and taken into account when an operator searches for the quasi-
optimal θ.
In our framework, we deal with the θ - threshold optimization. This optimization
belongs to the group of optimization problems, where the solution space includes different
θ values varying from 0 to 1, T-DTS decomposition and processing modes, and our goal
is to find the best of possible solution(s).
94
III.1.1 Self-tuning procedure as optimization problem
Optimization methods are usually preoccupied with finding the best solution possible.
On the other hand, the actors in nature (living organism) usually seek only an adequate
solution. Genetic algorithms and other biologically inspired methods for sear and
optimization adopt this approach implicitly. The problem is that an “adequacy” remains to
be explored. For instance, a NP-hard problem (a general optimization problem is NP-
hard) becomes tractable if we are content to accept “good solutions”, rather then perfect
ones (Green and Newth 2001). Therefore, the role of an optimization procedure for us is
to extract (i.e. explore adequacy) information (if it is possible to do so) about an
optimization problem while avoiding NP-hardness. Our T-DTS optimization algorithm
should be designed to effectively search for a plausible solution in the solution space.
More precisely, dealing with classification problem and applying the divide and conquer
paradigm to T-DTS, we have inherited NP-hardness (NP-hardness to find a true optimal
partitioning into a given multi-dimensional space (Garey and Johnson 1979)). Thus, the
solution for a given classification problem dataset and other fixed settings of T-DTS is
lying in the solution space of a CEA selection and optimal θ - threshold.
Regardless of the structure of the general optimization problem, this problem may
actually have multiple quasi-optimal solutions, and the quality measure (classification
error ratio) may have a high standard deviation, or the provision of several near-optimal
solutions may be more desirable than a detection of one (completely) optimal solution.
Often, an expert may want to choose from such a set of (near-) optimal solutions. In this
case, a learner would be required to find not only one solution, but rather a set of several
different (near-) optimal solutions (Butz 2001).
In conclusion, a θ - threshold optimization problem is the problem in which the best of
possible solutions must be found from a given θ - threshold space where other parameters
of T-DTS are fixed. A T-DTS feedback is available in a form of learning and testing rates,
and CEA outputs for a decomposed dataset. This feedback may or may not provide hints
where to direct the further search for the optimal solution. Finally, the number of optimal
solutions may vary and, depending on the problem, one or many optimal (or near optimal)
solutions might be found. The disclose methods for optimization (self-optimization) of
such dynamic systems as T-DTS are related to a deep feedback analysis of not only
learning and testing performance characteristics, but also the variations of the prescribed
95
variables (Chou 1983).
Optimization is possible because the general approach of T-DTS is based on the
assumption of existence of the functional relation between θ - complexity threshold, T-
DTS tree structure, and classification rates. Each time we modify the θ – parameter for a
range of the fixed parameters including CEA, T-DTS build a different tree with a different
depth and structure. T-DTS user might primary consider only one output regardless of the
tree, and this factor is a minimization of the accuracy error.
III.1.2 Self-tuning procedure description
In the previous T-DTS implementation v2.00 user used a “cut and “trail” method to
find a quasi optimal θ – threshold, and there was no possibility to handle the optimization
in a well-formalized way. Therefore, for T-DTS v.2.50, we developed an algorithmic
procedure that resolves this problem in a semi-automated way; the procedure is not yet
fully automatic as the user still has to define several other parameters, particularly
complexity estimation method and decomposition unit.
The goal of the proposed procedure is to eliminate the user supervision during the
search for a quasi optimal θ – threshold that corresponds to a quasi optimal classifier tree-
structure.
tree
dept
h
θ0
θ2
Θ2,1 Θ2,2
Θ2,1,2,1=1
Θ2,2,2=1Θ2,1,1=1 Θ2,2,1=1Θ2,1,2
Θ2,1,2,3=1 Θ2,1,2,4=1Θ2,1,2,2=1Θ1,7,2=1Θ1,7,1=1
Θ1,6=1Θ1,5=1
Θ1,9=1
Θ1,2=1 Θ1,3=1
Θ1,8=1
Θ1,4=1
Θ1,1=1
Θ1,7
θ1tre
e de
pth
tree
dept
h
θ0θ0
θ2θ2
Θ2,1Θ2,1 Θ2,2Θ2,2
Θ2,1,2,1=1Θ2,1,2,1=1
Θ2,2,2=1Θ2,2,2=1Θ2,1,1=1Θ2,1,1=1 Θ2,2,1=1Θ2,2,1=1Θ2,1,2Θ2,1,2
Θ2,1,2,3=1Θ2,1,2,3=1 Θ2,1,2,4=1Θ2,1,2,4=1Θ2,1,2,2=1Θ2,1,2,2=1Θ1,7,2=1Θ1,7,1=1
Θ1,6=1Θ1,5=1
Θ1,9=1
Θ1,2=1 Θ1,3=1
Θ1,8=1
Θ1,4=1
Θ1,1=1
Θ1,7
Θ1,7,2=1Θ1,7,2=1Θ1,7,1=1Θ1,7,1=1
Θ1,6=1Θ1,6=1Θ1,5=1Θ1,5=1
Θ1,9=1Θ1,9=1
Θ1,2=1Θ1,2=1 Θ1,3=1Θ1,3=1
Θ1,8=1Θ1,8=1
Θ1,4=1Θ1,4=1
Θ1,1=1Θ1,1=1
Θ1,7Θ1,7
θ1θ1
Fig. III.4 : An example of maximal possible decomposition tree
96
The procedure works as the following. First, we preset θ threshold to 1. The result of
the corresponding T-DTS decomposition is shown on Fig. III.4 and is called the maximal
possible decomposition tree, because T-DTS performs decomposition until the sub-cluster
classification complexity is not equal to 1.
Still, it should be noted that even though the calculated θ equals to 1, it does not mean
that the cluster contains only one instance or only one class. According to T-DTS concept,
θ equal to 1 signifies the easiest classification task difficulty. One of the possible
criticisms of such pre-processing step is that it performs maximal possible dataset T-DTS-
like decomposition; still, dealing with T-DTS framework and no matter using heuristic or
algorithmic θ quasi-optimal threshold search, the user always extracts an
information/knowledge about the complete classification problem complexity using
feedback of T-DTS. The difference between heuristic “cuts” and “trails” method and the
maximal possible decomposition tree based method in Fig. III.4 is the computational cost
that includes determinism vs. speed heuristic undetermined optimization.
The maximal possible decomposition tree is a special T-DTS output, but only for the
fixed range of intra-parameters, including complexity estimator and decomposition unit.
This tree is a chaotic product of applying T-DTS concept, because the tree building
trajectory strongly depends on initial conditions: decomposition unit and complexity
estimator. We argue here that chaotic-decomposition provides an important self-
consistency condition for determining T-DTS-like adaptability to the given classification
problem (Fellman 2004). In our further analysis we disregard all leaf-nodes of this tree
(marked blue in Fig. III.4). We consider the distribution of numbers of the non-leaf
clusters over θ interval: [0;1). This means that for each delta (this parameter could be
modified by user, but generally delta concern to be small) step of [0;1), we calculate how
many clusters (marked magenta in Fig. III.4) of the tree have the complexity ratio laying
in delta-step sub interval of [0;1). The result of such analysis is a histogram built over
[0;1); in our further analysis, we consider the global characteristic of the maximal
decomposition tree represented by such histogram (given in Fig III.5 which has been
obtained for four spirals, two class academic classification benchmark using Fisher ratio
based complexity estimator).
Generally, the θ values vary in [0;1), so we find maximal Amax ( )(maxmax ,...iiA θ
∀= )
and minimal Amin ( )(minmin ,...iiA θ
∀= ) ratios. As expected, )1;0[max]min;[ ⊆AA ; please
recall that θ = 0 determines the case when the sub-database is very complex and θ = 1
97
denotes the opposites.
The second step of our analysis consists the further constriction of [Amin; Amax]. For
this purpose, I propose selecting sub interval of [Amin; Amax] of Fig. III.5 that contains
the majority (this parameter has to be predefined by the user) of clusters. The starting
point for extracting this sub-interval of [Amin; Amax] is the delta-step of the histogram,
where there has been registered the maximum number of clusters. According to the
histograme (Fig. III.5), it is θ = 0.283; therefore, the majority of cluster that broadcast the
overall task complexity equals to approximately [0.22; 0.50].
Fig. III.5: An example of distribution of the clusters’ number over [Amin;Amax]
complexity interval
Therefore, such constriction of [Bmin;Bmax] , where max]min;[max]min;[ AABB ⊆
required a parameter α that determines majority, where 0<α<1 .
In practice, one may use Pareto principle to determine α, so to obtained (1-α)*100%
equals to 80% or 20%, α should be equal to 0.8 or 0.2. However, in our practical
realization I have used a pessimistic assumption determining α = 0.1 (10%) as by default
T-DTS parameter.
Let me highlight, that the form of a histogram is a priori unknown. The hypothetical
assumption could be done only on analysis of the complexity estimator type and
specificity of decomposition unit. CEA returns. Thus, in worth case of uniform
98
distribution, we have:
α−=
−−
11
minmaxminmax
BBAA (III.10)
where 1maxmaxminmin0 <≤<≤≤ ABBA , where optimistic expectation is:
α−>
−−
11
minmaxminmax
BBAA (III.11)
Let me note remind that the procedure of extracting [Bmin; Bmax] from [Amin;Amax]
starts from the point on a histogram where the number of clusters reaches its maximum. In
case of two and more maximum the procedure automatically extract whole sub-interval
between the margin maximums.
The second phase of semi-automatic self-tuning procedure performs a search. For a
pre-defined delta-step that corresponds to an h, a selected α = 0.1, a chosen
decomposition, and a complexity estimation method, the procedure searches for a quasi-
optimal θ – threshold. To process a classification problem within the T-DTS framework,
the procedure also requires an additional parameter z – the number of T-DTS iterations.
This parameter is needed for checking the robustness of the classification results,
especially when T-DTS uses a learning database randomization on a dataset of a fixed
size. Generally, z is integer and z > 1; therefore, by applying T-DTS robustness check z
times, we compute for a fixed θ Є [Bmin; Bmax] the averages and standard deviations of
the performance characteristics. The main measures of T-DTS performance are the
following: Gr – generalization rate, Lr –learning rate, Tp – time of the overall T-DTS
processing, and NTp – processing time of T-DTS PU applied to the non-decomposed solid
database. The additional measures include SdGr – the standard deviation of generalization
rate, SdLr – the standard deviation of learning rate, and SdTp and SdNTp – the standard
deviations of Tp and NTp correspondently.
We aggregate performance measures of Gr, Lr, SdGr, SdLr Є [0, 1], Tp, NTp, SdTp
and SdNTp > 0 into a single P(θ) -T-DTS function that measures performance with a high
precision; P(θ) is constructed as following:
where b1,b2 and b3 are priorities of the performance measures; for example, one may
preset b1=3, b2=2 and b3=1 as parameters by default. However, for validation purposes,
[ ] [ ])( 321
3
21 11)( bbb
bbb
SdNTpNTpSdTpTpSdLrLrSdGrGrP ++ ⎥
⎦
⎤⎢⎣
⎡++
+−+−=θ (III.12)
99
we fix the parameters as b1=3, b2=2 and b3=0; note that zeroing the last parameter we
tune the procedure to disregard the processing time component.
It should be noted that P(θ) is quite sensitive to the priority parameters bi; however,
this sensitivity is on par with the heuristic adjustment of T-DTS, where the operator
manually performs a similar testing, analyzes the output based on the performance
criteria, and consciously defines the priorities of these criteria. In the implementation of
T-DTS v. 2.50, we provide the user with the freedom of customizing not only the priority-
coefficients of equation III.12, but also the complete equation itself.
An important note is that using P(θ) - T-DTS performance aggregation function, we
do not consider generalization rate as the only one possible performance characteristic.
Instead, while applying the equation III.12, one may want to find a balanced solution.
P(θ) is denoted as a function of the argument θ because, according to our initial
assumption, a quasi-optimum θ exists. Therefore, P(θ) is a fusion of the three main
performance characteristics. The specificity of P(θ) can be easily modified by the means
of the priority-parameters bi.
The applied semi-automated self-tuning procedure is defined by the behavior of the
P(θ) function, including its important intra-parameters (except P(θ) and their bi –
priorities) such as: the coefficient α, z – number of T-DTS iterations, and h – delta step of
the. The rest of the T-DTS parameters are assumed to be pre-selected.
Taking into account our principal supposition concerning the existence of a P(θ)
minimum, the self-tuning procedure is transformed into the problem of searching for θ on
[Bmin; Bmax] for which P(θ) is minimal. Even though there is no evidence to consider
P(θ) a continuous function, the restrictions on the performance characteristics of Gr, Lr,
etc., suggest that P(θ) cannot contain an essential and jump discontinuities.
We should note that, during a run of the self-adjusting procedure, the parameters are
stored for the future analysis purposes. This back up allows providing the user with not
only a single quasi optimal solution, but rather with a range of alternative solutions that
are similar to the quasi optimal one. For instance, one of the intermediate solutions can be
potentially interesting to the user if he wishes to scarify robustness (the half of percent of
the standard deviation) in the interest of lowering the overall T-DTS computational cost.
Another important remark is that the dilemma of the “global vs. local” minimum of P(θ)
cannot be resolved because of the P(θ) origins; P(θ) contains discontinuances, because the
components of this function are not continuous.
The self-adjusting algorithm implemented in the enhanced version of T-DTS is based
100
on the classical golden-cut search algorithm. The pseudo-code of the self-tuning
procedure is given bellow:
Programme TDTS_Self_Tuning_Proc(<S,C>, DA, CEA, NNM, z, h) ‘DA : Decomposition agent ‘CEA : Complexity estimation agent ‘NNM : Processing neural network model ‘<S,C> : Classification database defined as a pair of S and C sets BEGIN Integer : z ‘Number of iterations required to get statistics Integer : h ‘Number of optimization iterations
[Amin,Amax,Hist] := Pro_Decomp_Total(<S,C>,DA,CEA) ‘Amin, Amax: Minimum and maximum of complexity interval ‘Hist : Histogram of the complexity ‘Pro_Decomp_Total : sub procedure that extracts Hist
[Bmin,Bmax] := Histogram_Analyse(Amin,Amax,Hist) ‘Bmin, Bmax : Minimum and maximum of the new complexity interval ‘Histogram_Analyse : analyzing given histogram
j := h ‘tuning cycle controller m_optimum = max_integer_value ‘variable of optimum for minimizing
REPEAT [m_t,j,m_optimum]:=Find_min_opt(j,z,Bmin,Bmax,m_optimum) ‘m_θ : temporary complexity θ threshold ‘Find_min_opt : finds optimum by modifying j and m_t ‘m_optimum : current value of P(θ) FOR i = 1 to z
m_Res(i):=TDTS(<S,C>,DA,m_t,CEA, NNM) ‘m_Res : array of the results ENDFOR;
m_optimum:=Collect_TDTS_statistics_and_calculate_optimum(m_Res) m_Global_resultsj = m_Res ‘m_Global_results : matrix of the results
UNTIL j < h print m_Global_results, m_t, m_optimum END
In this algorithm, the extra integer parameter h is the number of optimizing iterations
(not to be mistaken with the z-parameter) and is equivalent to the precision of the optimal
θ – threshold. The h parameter has to be provided in advance for the following reasons:
1. Optimizing criterion P(θ) might deviate because of different reasons, such as a
random way of learning database building, fluctuations of the decomposition coordinates
of DEA, etc. Thus, h represents a trade-off between the high precision of the θ quasi
optimal threshold and instability of P(θ) caused by its fluctuating components.
2. We are interested in finding an optimization procedure that searches for optimal
threshold using as the smallest possible number of the total single-run T-DTS iterations.
Taking to consideration the h and z parameters, the computational time cost of self-tuning
procedure can be expressed in the next form: h•z•(“time of a single-T-DTS run for fixed
parameters such as DA/CEA/NNM, learning database ant etc”). Please note that a
101
particular single-T-DTS processing strongly depends on size of the classification database
and selected NNM method.
In conclusion, we would like to stress that the use of the proposed algorithm does not
increase the overall computational or handling complexity of T-DTS with the only
alternative being the more expensive heuristic manual search. The difference between the
searching for quasi optimum θ - threshold by the given procedure and by the heuristic
user-based approaches is injected into T-DTS processing time check.
The proposed semi-automated procedure has a range of advantages and disadvantages;
mentioned above. We will discuss and evaluate the practical issues of the procedure in the
validation chapter. The following section presents another important issue – the
implementation of T-DTS 2.50 in Matlab environment.
III.2 T-DTS software architecture and realization
Current Matlab T-DTS v. 2.50 software architecture is based on the use of a set of
specialized mapping Neural Network, NNM supervised by a set of DA. DA are prototype-
oriented Neural Networks. NNM are the models of Artificial Neural Networks origin. Our
principal software architecture is described on Fig. III.6.
Our T-DTS framework incorporates three core databases:
1. Decomposition methods (Database DU – Decomposition unit. Decomposition unit
is equivalent to DA in a term of T-DTS concept).
2. Processing methods (PU - Processing unit is equivalent to NNM).
3. Database of Complexity estimators (equivalent to designated CEA in a term of T-
DTS concept).
T-DTS software engine is CU (Control Unit) which controls and activates
correspondent packages thought Graphic User Interface (GUI). CU allows operator to
perform:
1. Normalization of incoming database.
2. Automatic extraction of learning database from the general classification database
S. It contains two realized options of the extraction:
• Randomly learning database extraction.
• Randomly learning database extraction with respect to the class distribution.
102
3. Principal component analysis and transformation.
4. DU selection.
5. PU selection.
6. Selection of the Complexity estimating method.
7. Defining of the θ - complexity threshold at the range [0;1]
8. Setting of T-DTS parameters and constants such as: z, h, α, B and etc.
9. Configuration of DU, PU
10. Database analysis using graphic tools, graphic interpretation of the obtained
results, trees constructing, additional user tools.
Fig. III.6 : T-DTS software architecture
In additional to T-DTS design, a need to develop guidelines (expressed in the
literature) (Maren 1990) for further integrating other types of processing: fuzzy logic,
genetic algorithms, expert systems and conventional algorithms, required to re-engine T-
DTS architecture.
Concluding, T-DTS uploads databases, provides pre-processing and builds a tree of
prototypes using selected decomposition method and selected complexity estimating
technique, then learns each sub-database using particular PU (NNM). After it applies the
103
obtained structure of NNM to the test/generalization database and sculpts a set of local
results. The final output of T-DTS work is learning and generalization rates.
Generally, based on Fig. III.6, T-DTS software can be viewed as a Lego system of:
Decomposition methods, processing methods and complexity estimating technique
powered by a control engine accessible to an operator thought GUI.
The main advantage of T-DTS architecture is that those three databases can be
independently developed out of the main frame and then easily incorporated into T-DTS.
Whatever NN model is pre-selected, significant efficiency can be gained through careful
design and minor modifications, which permits new T-DTS architecture.
Therefore, T-DTS implementation is an extendible platform that supports the principal
T-DTS concept. An example of possible extendibility of T-DTS is SOM-LSVMDT
approach (Saglam, Yazgan and Ersoy 2003), witch is based on the same idea of
decomposition using Self-organizing Maps. This technique can be implemented by the
mean of incorporating LSVMDT (Linear Support Vector Machine Decision Tree) (Chi
and Ersoy 2002) as processing method to PU database, because decomposition method
SOM is already implemented. The example of real T-DTS extendibility is recent
implementation of PU-method realized in work (Voiry and al. 2007). In fact, T-DTS
v.2.500 is a real-working concept of independent component development platform for
classification tasks. The next paragraphs describe our Matlab T-DTS implementation.
Thus, the principal Matlab T-DTS v. 2.50 software architecture is briefly described in
a term of the main processing components on Fig. III.7. This scheme figures out 9
principal modules of T-DTS. Relations between modules are defined by one way arrows
(indicating a simple call of the module) and two-way arrows for cases when the relations
between modules are complex.
The next list presents a detailed description of each of them.
1. T-DTS GUI module contains the group of function responsible for T-DTS
application control.
2. T-DTS Init – initializes and saves the processing parameters.
3. T-DTS modifier allows user through graphic interface to modify existing
decomposition (DEA) and processing (NNM) methods, then this update is picked up and
then applied.
104
Fig. III.7 : Principal T-DTS v. 2.50 Matlab software architecture
4. T-DTS output is relatively simple group of functions that calls once in order to
print the results out.
5. T-DTS Graph is the set of function responsible for 2D and 3D drawing.
6. T-DTS Main programme - central module that contains the group of functions
which are responsible for integration of the main T-DTS activities. Shortly saying, this
module is correspondent to CU[.].
7. Data T-DTS provides database upload, and pre-processing if requested by user
through GUI. Also this module can provide splitting database on learning and
decomposition parts.
8. T-DTS Core – the most important module which contains realization of the global
function such as recursive decomposition and tree-building defined by T-DTS concepts
9. T-DTS Complexity module is the engine of T-DTS application. It is responsible for
estimating databases complexity and taking decision for further decomposition.
Following a more detailed scheme is described on Fig III.8 required for possible
industrial T-DTS implementation. This Matlab architecture is sculpted from Matlab
105
modules. On the chart, each *.m – file represents separate processing module. The relation
between the modules indicate with two types of array: ” ” and “ ”.
The first type of relations between modules indicates a complex relation: meaning that
modules can call each other in order to modify input/output parameters each other. For
example, m_getProcMethod_params.m on Fig. III.8. (left and top) contains parameters of
multi-ANN training pre-set by user. This module has complex relations with “t_dts.fig &
t_dts.m” module. It signifies that the last one calls m_getProcMethod_params.m to pass
the pre-set parameters forward to m_tdts.m and so on. At the same time using GUI of T-
DTS by means of t_dts.fig, user can call Configuration menu and modifies this parameters
Fig. IV.5.8 with given explanation.
The second type defines the simple relations between modules, meaning that one
module is playing a role of a master and that module which is called – role of a slave. For
example, t_dts_init_script.m on Fig. III.8 which consists the range of by-default
parameters is “t_dts.fig & t_dts.m”. To initialize T-DTS using GUI t_dts.fig the
following t_dts_init_script.m script has be written. It is responsible for the save and
update by-default parameters with the user-given parameters. Similarly according to
organization chart, t_dts_init_script.m calls for its need m_getProcMethod_params.m and
m_getDecMethod_params.m modules in order to update processing and decomposition T-
DTS methods with the modified parameters accordingly, again through GUI interface
t_dts.fig.
The aim of providing such complex real-world scheme described on Fig. III.8 is to
show in more details T-DTS realization, because many question of possible realization of
different T-DTS processing appear here and these details that are not presented on the
conceptual T-DTS scheme have a crucial impact on the T-DTS functioning.
The practice of T-DTS upgrading from v. 2.00 to v. 2.50 has shown that main modules
of T-DTS including relations between principal modules are invariant and will be realized
in new versions of T-DTS. Each module described on Fig. III.8 contains a description
(below). Some modules contain sub functions or data file(s) that have to be explained:
1. t_dts.fig plus t_dts.m that includes t_dts_params.mat – file of parameters including
by default parameters. After each run of T-DTS the parameters of the last run
automatically saved in this t_dts_params.mat file.
2. m_tdts.m
• m_it_T_DTS() – sub function that executes n – time of T-DTS with the range
of send parameters
106
• m_recDBSplit() – recursively decomposes the database based on the provided
ration of complexity estimator, used for learning
• m_recDBSplit_Following_tree() – recursively decomposes database based on
the provided tree regardless the size of database. This sub-function imitates the same
process that has been used for learning. Used when option “Tree_following” Fig. III.8 is
selected.
• m_splitting() – for a given database and centres of decomposition provides
clustering
• m_QClassification() – for a given input and output database returns rates of
Learning, Generalization
3. m_build_sANN.m that includes module m_Cldb2MatrixConvertation() that
performes database convertion of categories / classes to Matrix of zero, where the position
of class indicates by 1 in column, and number of lines of this matrix signifies the number
of patterns. Classical approach of data interpretation is required before ANN training.
4. m_uploadDBs.m includes m_randData() modes that randomizes indexes and put
them as input parameters, it’s required for two mode of randomization (Fig. III.8 ).
5. m_3d_graph_clusters.m includes m_3d_Tree() modules that building a tree in 3D
over decomposed 2D clusters.
6. m_run_sANN.m includes m_Matrix2CldbConvertation() modules. Last one
performs an opposite operation to m_Cldb2MatrixConvertation(). Thus,
m_Matrix2CldbConvertation() converts obtained matrix of the testing results into a
standard format, where the categories / classes are stored in the array.
7. m_RBF_ZISC_fusion.m :
• m_ZISC_CE() – ANN based complexity estimator’s module that hold a named
of ANN based complexity estimator
• m_distance() – sub-module contains realization of the distances L1, LSUP and
Euclidean that may be used not only for ANN based complexity estimator employing, but
for database decomposition.
Let me mentioned, that the T-DTS’s NNs by default configuration (number of layers,
nodes and type of the connectivity) issue is important. Empirical tests that use the back-
propagation networks (majority of PU methods) have not demonstrated significant
advantages for 2 hidden layers over 1 in a relatively small and simple diagnostic network.
107
Fig. III.8: Detailed T-DTS v. 2.50 Matlab software architecture
108
For our T-DTS realization we follow Maren’s advice (Maren 1990): for classification
a classification (decision boundary) problem where the output node with the greatest
activation will determine the category of the instance, one hidden layers will most likely
sufficient. Therefore, PUs’ by default (used during the validation phase) settings for back-
propagation networks are the following. The number of neurons in output layer is equal to
the number of classes. The hidden layer consists of the neurons that their number equals
to square root (rounded-off value) of the number of database (sub-database).
On Fig. III.8 describes the full chart of the T-DTS modules’ realization.
The current Matlab T-DTS software realization v. 2.50 has the options and processing
that can be controlled through GUI Fig. is shown on Fig. III.9.
Fig. III.9 : Matlab T-DTS software realization v 2.50, Control panel
The detailed description is the following:
• Decomposition Units:
1.CNN (Competitive Neural Network)
2.SOM (Self Organized Map)
3.LVQ1 and LVQ2_1 (Learning Vector Quantization including two different learning
function)
• Processing Units:
1.LVQ1 and LVQ2_1
2.Elman_BN (Elman’s backpropagation network)
3.MLP_CF_GD (MLP cascade forward network with gradient descent algorithm)
109
4.LNM (Linear Neuron Model)
5.RBF (Radial basis function based network)
6.MLP_FF_GDM (MLP feed forward backpropagation network with gradient descent
algorithm including momentum adjusting)
7.GRNN (General Regression Neural Network)
8.PNN (Probabilistic Neural Network)
9.GRNN (Generalized regression neural network)
10. MLP_FF_BR (MLP feed forward backpropagation network with Bayes regulation
including statistical incoming database normalization) (Voiry and al. 2007)
11. MLP_FF_ID (MLP feed forward backpropagation network with input-delay)
12. Perceptron.
13. Elman_BNwP (Elman’s backpropagation network including statistical incoming
database normalization)
14. Elman_BNBR (Elman’s backpropagation network with Bayes regulation)
• Complexity estimating technique:
1.Maximum_Standard_Deviation (based technique) (Rybnik 2004)
2.Fisher_Disriminant_Ratio (Rybnik 2004)
3.Purity_Measure (PRISM based technique) (Singh 2003)
4.Normalized_mean_Distance (Bouyoucef 2007)
5.KLD (Kullback–Leibler Divergence based estimator)
6.JMDBC (Jeffreys-Matusita distance based complexity criterion)
7.Bhattacharyya_Coefficient (based criterion)
8.Mahalanobis_Distance (baced technique)
9.Interclass_DM_CRT_Trace (Scattered-matrix method based on inter-intra matrix-
criteria of the traces), criterion 9.13 (Fukunaga 1972)
10. Interclass_DM_CRT_Div_Trace (Scattered-matrix method based on inter-intra
matrix-criteria of the traces’ division), criterion 9.16 (Fukunaga 1972)
11. Interclass_DM_CRT_Log_Det (Scattered-matrix method based on inter-intra
matrix-criteria of logarithm of determinants), criterion 9.14 (Fukunaga 1972)
12. Interclass_DM_CRT_Dif_Trace (Scattered-matrix method based on inter-intra
matrix-criteria of the traces’ difference), criterion 9.15(Fukunaga 1972)4
4 This four criteria 9-12 generate ratio of class separability based on the statistical theory of discrimination
(Fukunaga 1972)
110
13. RBF_ZISC_based_Fusion_CRT (Simulation of ZISC®-036 ANN based
complexity estimator in Matlab environment) Section III.4.3
14. k_Nearest_Neighbor_Estimator
15. Collective_Entropy (PRISM based technique) (Singh 2003)
16. JSD (Jensen-Shannon Divergence based estimator)
17. Hellindger_Distance (based technique)
18. (Bayes Error) – direct realization of computational costly procedure only for
testing purpose and not for real-world classification task processing
Complexity estimating threshold can be set in correspondent field Fig. IV.9. This
value varies from 0 to 1. We have to mentioned, that the ratios returning by the different
complexity estimators 1-18 from different taxonomical classes, have been linearly
normalized in order to be placed in interval [0;1]. 0 – ratio means that database/sub-
database is “very complex” to classify and 1 - means the opposite. I would like to remind
that values of the different CEA are relative, and cannot be compared. This issue has been
verified during the validation phase of my work.
Decomposition techniques use two important intra parameters for performing
decomposition Fig. III.9:
• Mode:
1. Min_dist_to_prototype
2. Tree_following.
• Distance:
1. Euclidean
2. Manhattan
3. Mahalanobis.
Concerning the given description of tree-like learning database decomposition that is
controlled by SOM-based DA, we have to note that there are two possibilities of
managing decomposition of the databases.
Let’s DA decomposes database following the proposed T-DTS concept. After
decomposition DA obtains except the clusters of learning database, the set of centroids Wi
and T-DTS tree. The question how to use this information arises, because there is a
difference between the hierarchical clustering of database following tree-like
decomposition and the clustering performed by means of distance analysis to centroids Wi
that ignores the tree-like decomposition. Theoretically, borders between these two types
of clustering are absolutely different. Moreover, from the algorithm complexity point of
111
view, the first decomposition approach for given m instance required m*log2m iteration (if
tree is binary) and for the second - 0.5*m*(m-1).
Distinguishing these two cases, we have implemented Min_dist_to_prototype mode of
clustering that corresponds to the second mentioned approach and Tree_following which
is responsible for original according to the concept following tree clustering approach.
Whatever is selected Min_dist_to_prototype or Tree_following, T-DTS user can be
sure that there is no discrepancy in the manner of decomposing learning and
generalization database. This issue is extremely important.
Distance type is another important criterion, because it is a metrics of distance
measuring, however the problems, for which we have provided testing, have been
represented purely in Euclidian space. In addition, one can expand T-DTS platform with a
new database of distances (metrics) if required. The following sections describe the
methods related to Databases’ pre-processing and Create Learning & Generalization
DBs.
Except of selection of input and output file names Fig III.10 (left), the learning
database can be pre-processed if required using Principal Component Analysis and
Transformation or
• Normalization
1. No_normalizing
2. Norm_LinTans_EC
3. Norm_Statistical_EC.
Norm_LinTans_EC – linear data pre-processing that the prototypes lay in interval
from 0 to 1. Norm_Statistical_EC preprocesses data so that its mean is 0 and the standard
deviation is 1.
Section of Create Learning & Generalization DBs is used when the classification
problem doesn’t contain such information. In this case there is a big range of well studied
techniques dedicated to question how to learn, but afterward the incoming solid database
must be split according to T-DTS concept of problem handling. Therefore, there are two
types of splitting.
• Split-type
1. Balanced
2. Random
In the first case, the database is randomly extracted (if flag is on Fig III.9) with respect
to classes’ distribution and given percentage of extraction. Second option provides
112
random (without any additional condition) extraction of learning sub-database from the
origin classification database.
The GUI of T-DTS contains on the right panel the options of T-DTS processing on
Fig. III.9. There are 3 Run mode(s):
1. Uni
2. Multi
3. Self_Tuning.
The first mode performs simple Uni T-DTS processing. Multi run mode does the same
but for the range of special selected parameters saved in the multi-run script. In stead of
launching T-DTS each time manually and for each new set of parameters, it is more
efficient to write down this info into a script and then process. For example, the most used
mode of T-DTS is testing for different θ - threshold values.
• Self_Tuning is mode which uses θ - threshold adjusting procedure described above
at Section III.1.
• Iteration No. field on Fig III.9 allows user to set the number of iterations z in order
to gain statistics. The importance of this option has been discussed in Section III.1.
• Run button starts T-DTS. Result - prints out the summary of the testing.
Result(Threshold) is an option which describes generalization and learning result as a
function of θ - threshold. The idea of existence of a functional relation between them is
highlighted at Section III.2. To illustrate T-DTS GUI abilities including
Result(Threshold), I have applied T-DTS approach for a very simple benchmark of two
classes. The aim of this task is to classify 2D-plots that belong to two different classes.
These classes are linearly separated by a line X = 0, meaning that plots which contain
negative X-coordinate belong to class one, and the others – to the class two.
This learning process has been performed using decomposition method CNN,
processing – PNN (not appropriate for the goal, but good for an illustration of T-DTS
GUI), and for Maximum_Standard_Deviation complexity estimator (that is not a real
classification complexity estimator) with different threshold values preset in multi-run
script. The learning ratio was set to 50%. The learning database has been extracted
randomly from the solid database which contains 576 2D-plots. Class distribution is 50%
per 50%. Number of iteration set to gain statistics is 5. On Fig. III.10 is shown T-DTS-
output of learning and generalization rate for different θ - threshold values.
On Fig III.10, abscissa represents complexity θ - threshold and y-coordinate is
percentage (%) of successful learning (red) and generalization (blue). The obtained
113
generalization rates are marked on chart by diamonds, the equivalent corridor of standard
deviation of this rate is marked by asterisk. Learning rates on Fig III.10 are marked as
diamonds including asterisk. As the problem is very simple, the given learning rates reach
its maximum and the standard deviation of it is zero. One can note that however the
maximal generalization rate can be reached for θ = 0.19, the solution (with regards to
standard deviation) provided by θ = 0.224 might be more attractive because of the low
standard deviation of generalization rate. Analyzing a chart, one may select the
combination of alternative solutions in case when the best solution could not be found.
Fig. III.10 : T-DTS GUI: Results : 2 stripe-like benchmark (576 prototypes)
The next option “Print DB Complexity” allows user to apply all available complexity
estimating technique for a solid (non-decomposed) database. It permits to get preliminary
information for black box classification cases. “Print SubDB Complexity” can be applied
for decomposed database. It calculates complexity ratios for each sub-database and prints
it out for analysis of decompounded configuration.
Buttons “2D DBs”, “2D Tree” and “3D Graph” permit us to interpret the result of
decomposition and tree constructing in 2D and 3D. The two tests for the same benchmark
114
problem have been performed with above mentioned settings except θ - threshold which
has been fixed. In first case on Fig III.11 (left) Maximum_Standard_Deviation θ = 0.72
and second one Purity_Measure - θ = 0.35 on Fig III.11 (right).
Fig. III.11 : GUI of T-DTS, decomposition clusters’ chart
Clusters that represent final decomposition on Fig. III.11 are marked using different
styles. On the left picture, clustering includes projection of the tree over the mosaic of
clusters. The centres (nodes of the tree) of sub-databases/clusters are linked. There are not
only leafs of the processing nodes, but also the decomposition nodes.
A brief analysis of this chart suggests that the majority of clusters Fig III.11 allocated
alongside the class-border X = 0. We expect to see such result, because those clusters
don’t belong to one whole class. These bordering clusters contain two classes on Fig.
III.11. They are defined as more complex than others which contain one class only. The
left and right charts have different mosaics of cluster, because of the different complexity
estimators. Those pictures illustrate the direct influence of complexity estimating
technique on the decomposition (clustering) and tree-building. Accurate calculation of the
complexity ratio for sub-databases (clusters) is the engine of T-DTS which builds
decomposition set. Back to the topic of graphical results representation, the next available
option of T-DTS “2D Tree” is a tree drawing. On Fig. III.12 is shown the two cases.
However, on left and right pictures there are trees which are built for the above mentioned
problem and settings, but the complexity estimator was set as
Maximum_Standard_Deviation and θ - threshold = 0.72. What is important in tree
building is the configuration of decomposition methods. For example on the left picture,
we have decomposition method CNN with 2 neurons (decompounded centres. On the
right of Fig. III.12 we have used decompounded SOM with a grid 5x8, totally - 40
115
decomposition centres.
Fig. III.12 : GUI of T-DTS, decomposition tree charts
The final GUI option ““3D Graph”” combines two previous mentioned features. It’s
an option which allows user to build a decomposition tree over 2D clustering.
Fig. III.13 : GUI of T-DTS, decomposition tree chart in 3D
This option illustrates dynamics of T-DTS tree-like decomposition approach regulated
116
by a complexity estimator. On Fig III.12 is shown decomposition, and tree-structure in 3D
for the same simple 2D classification task, where CNN contains 2 neurons, complexity
estimating method is Purity_Measure, θ - threshold equals to 0.35. Another range of
option is available in T-DTS main menu. The most important for end user is two menus.
First - Configuration and second one Analysis Fig. III.13.
DU Parameters Configuration and PU Parameters Configuration permits to modify
DU and PU settings such as number of epochs, neuron numbers, etc.
Fig. III.14 : Menu of T-DTS, Configuration.
DU Configuration and PU Configuration are responsible for T-DTS platform
extension and independent component analysis. For example, PU named Elman_BNwP
(Elman’s backpropagation network including statistical incoming database normalization)
was developed as an analogue of MLP_FF_BR (MLP feed forward backpropagation
network with Bayes regulation including statistical incoming database normalization)
proposed by (Voiry and al. 2007) that has been incorporated before.
Multi Run Configuration menu configures script and Optimization function settings.
In case when the T-DTS running has been interrupted, user can continue T-DTS
running from the interrupted point, because the sub-product and parameters has been
automatically stored. In order to concatenate afterwards sub-products, the option
“Concatenate output” has been created. It allows user to process T-DTS in parallel mode
on different PCs. This option has appeared based on a real-working experience.
The next two principal menus permit to set the constants. The first Fig. III.15 – for T-
DTS processing such as: threshold accuracy k, Alfa_Coefficent (α), etc.
117
Fig. III.15 : Menu of T-DTS, Set Constants
The second menu “Set EC Options” Fig. III.16 defines the constants for complexity
estimators, such as resolution parameter B for PRISM methods, Maximal Influence Field
for our ANN based complexity estimator and so on.
Fig. III.16 : Menu of T-DTS, Set EC Options
Analysis menu is shown on Fig. III.17, contains the settings for: of histogram building
- “PDF of Complexity”, “Postregression” analysis and building chart of the optimization
function P(θ).
Fig. III.17 : Menu of T-DTS, Analysis
Fig. III.17 describes the last part of T-DTS parameters and its implementation.
Let me remind that realization aspect is very important because of the range of
difficulties that appear during the implementation of the abstract theoretical concept.
It is well known that simplification and assumption done on implementation phase
might negatively affect on T-DTS processing ability. Thus, it’s very important for further
validation and results analysis to have not only a vision of the general T-DTS concept, but
to understand details of the realizations. The realization can answer on the possible
118
question why the structure of output cluster is such and why the generalization rate is high
or low. The following Section III.3 provides a conclusion of the current aspects of T-DTS
implementation and its software architecture.
III.3 Conclusion
In this chapter, we have described the implementation of T-DTS. From the conceptual
point of view, the presented implementation scheme is platform independent: no matter
which programming language is selected for the main T-DTS modules, its implementation
scheme remains platform-invariant.
The presented implementation of T-DTS is also very flexible: any decomposition or
processing unit could be easily modified, adjusted, or replaced by a non-advanced user. A
variety of parameters, including pre-processing techniques, distance measures, and so on,
creates a user-friendly T-DTS environment for the classification tasks’ processing. The
key role in T-DTS implementation is played by the complexity estimation module that
controls the overall T-DTS performance by the θ – threshold. To successfully handle the θ
– threshold in T-DTS, one would require a knowledge about certain CEA features,
specifics of classification problem and DA, etc.; instead, to streamline the usability of T-
DTS, we proposed the self-tuning procedure that automatically optimizes the threshold
and allows getting the results in the deterministic way.
Since the question of finding the optimal θ - threshold is known to be an NP-hard
problem, we have created a novel semi- (because several parameters still have to be
predefined by the user) automated procedure for finding the quasi-optimal θ – threshold at
which the results of T-DTS classification could reach their quasi-maximum. Furthermore,
the procedure considers several possible results, given the user a choice of suitable
solutions.
The proposed semi-automated procedure also paves the new direction for the future
development of T-DTS. An important issue of T-DTS enhancement is that the proposed
semi-automated procedure uses the concept of maximal possible decomposition tree - the
tree where on leaf-level each sub-cluster contains the simplest sub-database for
classification. During the process of maximum decomposition, T-DTS accumulates the
information about distribution of the clusters’ complexity. The histogram over θ –
threshold of the sub-interval of the interval [0;1] represents the variance of the clusters’
119
number and their complexity. It provides additional information about database
divisibility when complexity estimator, decomposition module and other T-DTS
parameters are fixed. It is also very important, because the choice of decomposing
technique and estimator is on user’s responsibility. T-DTS concept makes the provision
for intelligent and self-organizing way of decomposing clusters’ allocation. However, the
above mentioned histogram represents the initial database divisibility; from the global
point of view of self-organizing systems and in accordance to the cross-discipline
overview (Haken 2002), this histogram is a key macroscopic characteristic that exhibits
T-DTS self-organizing abilities. In fact, this histogram of maximal tree-like database
decomposing predefines the database decomposition process.
It is also important to mention that the ANN based complexity estimator of the
proposed architecture maps very well on the RCE-kNN based complexity estimator
implemented in IBM© ZISC®-036 Neurocomputer that has been extensively validated
using benchmarks and real-world problems. This makes it possible to create a hardware
implementation of T-DTS, or, more precisely, a hardware RBF-kNN-like based T-DTS
for the IBM© ZISC®-036 Neurocomputer. In spite of the possible limitations of a
hardware-based implementation of decomposing methods, the exclusive benefits of using
RCE-kNN ZISC®-036 based complexity estimator make the hardware implementation a
viable choice not only clustering, but also the classification of RBF-based processing unit.
To explore the direction of hardware-based neurocomputing, we implemented a
hybrid software/hardware prototype of T-DTS using the IBM© ZISC®-036, where the
hardware was used to implement the complexity estimation and processing modules.
Naturally, such implementation has initial advantages and disadvantages. The principal
disadvantage is that it reduces T-DTS conceptual flexibility. The main advantages of the
hardware implementation are the speed achieved through parallelization and
computational efficiency. Considering the fact that the modern classification problems
require analysis of huge data stores, the new trends in development of high speed parallel
hardware systems make the direction of hardware T-DTS development a very attractive
alternative to the currently predominant software solutions.
The T-DTS enhancement, ANN based complexity estimator implementation on
IBM© ZISC®-036 Neurocomputer and its PC-software-based version, including 16 other
complexity estimating techniques, have been verified using classification benchmarks and
real-world classification problems. The design and the results of this verification are
reported in Chapter IV Validation.
120
Chapter IV:
Validation aspects
In this chapter, we validate the main aspects of the proposed T-DTS approach in the
following steps. First, we compare the effectiveness of the proposed ANN based
complexity estimation technique to the effectiveness of other available estimators outside
of the T-DTS framework. In the second part of the chapter, we validate and assess the
effectiveness of the proposed T-DTS enhancements within the framework; specifically,
we test the performance of complexity estimation techniques embedded into T-DTS and
the proposed T-DTS self-tuning procedure.
My validation datasets consist of the two parts: benchmarks specially designed for the
validation of classification problem techniques, and a real-world classification problem.
IV.1 ANN-structure based complexity estimators
The prime T-DTS performance enhancement objective drives me to prose a new
classification complexity estimating method. The result of our research is an ad hoc
ANN-structure based complexity estimator. This concept of classification complexity
estimation is free of distribution classes’ analysis disadvantages and as a novel
classification complexity (discriminant) estimating method can be used out by T-DTS
framework, for example for improving the other classification models readability or for
pre-analysis of the problem in patter recognition.
The usage of ANN based concept has been caused a possibility of neural network’s
learning indicator(s) to obtain inexplicit information (parameters) about the complex
industrial system’s process, plant and etc (Madani and Berechet 2001). Initially, ANN
based complexity estimator first has been implemented on the IBM© ZISC®-036
121
Neurocomputer because of its advantages such as evolutionary RCE-kNN like neural
network constructing. Second step was a simulation of the proposed estimator into Matlab
environment. Conceptually, they are similar. Both of them stands on the similar principal
idea mentioned in theoretical part, however there are some minor differences that have a
minor influence (as it’s shown) on the final outputs.
The core difference between this ANN based complexity estimator realizations is that
Matlab simulation doesn’t allow the overlapping of influence field that may occurred in
IBM© ZISC®-036 Neurocomputer implementation during prototype association
(Lindblad and al. 1996). Thus, for this Matlab simulated kNN-like ANN-structure based
complexity estimator, MIF parameter of each prototype is automatically adjusted
(minimizing MIF up parameter starting from the pre-set in advanced value) during the
polyhedron construction. In contrary to this, IBM© ZISC®-036 implementation of
estimator adjusts the threshold of neighbourhood neurons (prototypes with associated
category/class), where the final MIF parameter shouldn’t be lower than the pre-set value.
Although, let us highlight that for both types of complexity estimators the common and
principal feature is the same: extraction of complexity ratio from using analysis of the
Voronoy polyhedron construction process.
The following section provides the testing results and their analysis obtained for
IBM© ZISC®-036 ANN-structure based classification complexity estimator.
IV.1.1 Hardware-based validation
Historically, the proposed ad hoc classification task complexity estimator was initially
verified using the IBM© ZISC®-036 based PC board Neurocomputer implementation.
We have chosen this hardware because it is a good candidate for the hardware
implemented RCE-kNN Neural Network that uses an evolutionary learning strategy
(Madani, De Tremiolles and Tannhof 2001). During the kNN-like partitioning (learning)
the thresholds of neurons are adjusted. During the generalization phase, the
neighbourhood neuron(s) maybe (or not) activated. For a given learning database, the
result obtained after the learning process is RCE-kNN Neural Network structure
represented by Voronoy’s polyhedron. Using the concept of complexity estimating above
described, we compute the classification complexity ratio (coefficient/rate).
122
IV.1.1.1 IBM© ZISC®-036 Neurocomputer’s implementation and
benchmarks
To validate my new Ad hoc concept using IBM© ZISC®-036 Neurocomputer based
implementation, I have constructed academic (i.e. simple) classification benchmarks.
Basically, there are five databases representing a mapping of a restricted 2D space to 2
categories/classes described on Fig. IV.I. Each pattern was divided into two and more
equal striped sub-zones, each of them belonging to class 1 or 2 alternatively.
Fig. IV.1 : Stripe classification benchmarks
The benchmark samples are created using randomly generated instances sj that contain
two coordinates. Theoretically, the number of sj-samples m has an influence on the quality
of the striped zones (categories) demarcation. In case where vectors are uniformly random
distributed, higher quantity of the vectors/prototyped more precisely determines the
classes’ separating hyperplane. According to the value of the first coordinate of sj, and
according to the type of striped pattern, an appropriate category cj is assigned to the
instance sj. The m structures defined by pairs sj,cj are sent to ZISC®-036
Neurocomputer for learning purpose. Using equation II.42, I compute the indicator-
function Qi(m), where i is a pattern index described by Fig. IV.I (left one has an index i=1
and right one - i=5). Afterwards, I calculate complexity ratio for each of five benchmarks.
The range of theses validation tests have been performed on IBM© ZISC®-036
Neurocomputer using two different modes L1 and LSUP (Lindblad and al. 1996). More
detailed information concerning internal ZISC®-036 parameters and their realization is
available in Appendix B.1.
For these 5 different databases, we have performed validation using 8 types of m
database-size, including: 50, 100, 250, 500, 1000, 2500, 5000 and 10000 of sj.-instances.
For each set of the three parameters (type of the pattern, ZISC®-036 metric/distance mode
and database size m), a tests have been repeated 10 times for statistical purpose. Totally,
800 tests have been performed. The following Fig. IV.2 and Fig. IV.3 show the results.
123
It is expected that for Example 5 (pattern i=5) (Fig. IV.1 - 10 stripe-like-zones), the
indicator function Q5(m) has the highest values among the Q1(m) – Q4(m). It is also
expected that classification complexity ratio for Example 1 (pattern i=1) calculated by
IBM© ZISC®-036 ANN based complexity estimator has the lowest.
Let us highlight that on Fig. IV.2 and Fig. IV.3 one may observe the complexity
indicator behaviour, but not a complexity estimating function. Let me note that in T-DTS
framework 1 stands for the easiest case and 0 for the most complex. Therefore, any
complexity estimation out put is linearly normalized.
Basically, Fig. IV.2 and Fig.IV.3 shows an example of Qi(m) complexity indicators’
variations versus the learning database’s size m for 5 different benchmarks. I have
considered that the calculated classification complexity estimation ratio Q1(m0)
corresponds to the easiest case and Q5(m0) to the most difficult (Fig IV.1).
We expect that for any given benchmark problem, m- enhancing reduces problem
ambiguity, that’s why we observe declining Qi(m). This means that the considered
classification task becomes less complex when enough representative examples are
available. On the other hand, the benchmarks’ complexities are different. It is intuitively
expected that Q5 indicator function is above Q4. Based on the supposition of Qi(m)
behaviours, I have approximated Qi(m) with a polynomial function of the degree 3 in order
to capture m0 from equation II.42, where Qi(m0) acting as a classification task complexity
ratio.
Following Table IV.1 consolidates obtained classification complexities of the
benchmarks using the proposed method.
Table IV.1 : Benchmarks complexity rates obtained using IBM© ZISC®-036
implementation of ANN-structure based complexity estimator LSUP ZISC®-036 mode L1 ZISC®-036 mode
Benchmarks m0 Qi(m0) m0 Qi(m0) Q1 (Example 1) 100 0.846 88 0,849
Q2 (Example 2) 170 0,818 168 0,823
Q3 (Example 3) 190 0,767 186 0,771
Q4 (Example 4) 235 0.760 229 0,761
Q5 (Example 5) 265 0.739 254 0,746
124
Fig. IV.2 : Stripe classification benchmarks : Qi(m) behaviour versus the learning
database size m, LSUP ZISC®-036 mode
Fig. IV.3 : Stripe classification benchmarks : Qi(m) behaviour versus the learning
database size m, L1 ZISC®-036 mode
125
These results are plotted on Fig IV.2 and Fig. IV.3 - Qi(m) behaviour corresponds to
intuitive classification benchmarks’ complexity and Qi(m0) - shifts conformably to the
aforementioned expectations.
The difference between LSUP and L1 modes’ of IBM© ZISC®-036 Neurocomputer
signifies the difference in Voronoy polyhedron constructing. Thus, rhomb-like L1 metric
(Appendix C.1) produces a better partitioning (meaning less number of clusters required)
space than LSUP Manhattan distance regardless the fact that the classes are linearly
purely separable.
It is important to note that as the classes’ of benchmarks are perfectly separable,
meaning that Bayes error is equal to zero, it is expected that for classification complexity
estimating methods which is based on the theoretical Bayes error will not find difference
between Example 1 and Example 5.
Analysis of the plots m0,Q1 (Example 1) till m0,Q5 (Example 5) for related classification
tasks implies the following property:
54321 ,0,0,0,0,0 QQQQQ mmmmm <<<< (IV.1)
where
)()()()()( 0504030201 mQmQmQmQmQ >>>> (IV.2)
With respect with the task’s complexity incensement to observe the results that can be
represented by equation IV.1 and equation IV.2. Concerning the behaviour of indicator
function Qi(m) one has 1)(lim =+∞→
mQim.
Let me stress once more that MIF (Minimum Influence Field of the neuron, Appendix
B.1) is an important parameter influencing on the Voronoy polyhedron construction and
finally on the quality of the complexity ratio.
If one may take a look on the provided testing for benchmarks as on classification
process Fig. IV.4 and Fig. IV.5, the results as it is expected are very sensitive to m-
parameter and less sensitive to the fixed IBM© ZISC®-036 (distance).
+∞→>∀ mmm :0 Fig. IV.2, Fig. IV.3 and Fig. IV.4, Fig. IV.5 show that situation
becomes more predictable regarding indicators’ evolution and the classification rates. In
other words, our validation indicates that the extra data (addition prototypes: 0mm > )
doesn’t change the dynamic (second derivative) of the classification process.
126
Fig. IV.4 : Benchmarks’ classification rates behaviour versus learning database size
m, LSUP ZISC®-036 mode
Fig. IV.5 : Benchmarks’ classification rates behaviour versus learning database size
m, L1 ZISC®-036 mode
127
In conclusion the obtained results for constructed stripe-like benchmark, we can state
that the behaviour of Qi(m) complexity indicators, Qi(m0) ratios computed by ANN based
complexity estimator and classification rates as a quality check of Voronoy polyhedron
construction using ZISC®-036 RCE-kNN have confirmed our expectations. Next section
is dedicated to the validation of the same realization of ANN based complexity estimator,
but for the real-world classification problem.
IV.1.1.2 IBM© ZISC®-036 Neurocomputer’s implementation facing
Splice-junction DNA sequence classification problem
Second part of our validation is related to complexity estimation for a real-world
problem. One of the good candidates is the well-studied Splice-junction DNA-Patterns
classification problem from well-known Machine Learning Repository.
This classification problem is related to the complex process of protein creation.
During this process, in higher organisms take a place an elimination of the superfluous
DNA sequence. Points on a DNA sequence at which redundant DNA is removed are
called splice junctions.
One have to recognize for a given sequences of DNA, the boundaries between exons
(the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA
sequence that are spliced out). This problem consists of two subtasks: recognizing
exon/intron boundaries (referred in the original database to EI-sites), and recognizing
intron/exon boundaries (IE sites). In the biological community, IE borders are referred to
acceptors while EI borders are referred to as donors.
To evaluate the performance of the complexity estimation approach, I use this
molecular biology database titled as “Primate splice-junction gene sequences (DNA) with
associated imperfect domain theory” that is available in mentioned Machine Learning
Repository of Bren School of Information and Computer Science, University of
California, Irvine. This database has the following main features: 3190 instances, 60
attributes, 3 classes (labeled as N-class: consist of 50% of all instances, EI-class – 25%
and IE-class – 25%), and no missing attribute values
I start the validation by the generation of databases of the different size m. We have
created these databases randomly extracting sub-databases from the original one with
respect to classes’ distribution. The number of instance m corresponds to the database
128
size. Each instance sj belongs to a category cj equals to 1, 2 or 3. Vector sj consists 60
sequential of DNA nucleotide positions. The pair sj,cj where j is the pair index
( mj ≤≤1 ) are sent to ZISC®-036 Neurocomputer on learning (Voronoy polyhedron
construction).
If one may compare previous academic benchmarks and DNA Splice-junction
problem, one may state that the difference is only in number of the classes and the
attributes.
Let me mention, that the procedure of classification DNA Splice-junction instances
corresponds to classes’ sorting using small decision tree. According to detailed
description of the problem, first of all, a classifier must decide whenever the given
instances belong to the group of classes EI and IE or to class N.
Afterwards inside of the first group the ANN or other classifier should separate EI
class from IE class. Although, even this mentioned issue doesn’t express the difficulty of
real world classification problem.
To demonstrate the hardest of DNA Splice-junction classification we quote Sarkar’s
and Leong’s work (Sarkar and Leong 2001) that to figure out complexity use special
representation named DNA walk representation: “We have plotted the trace
corresponding to all three classes. We can observe that to some extend, the lines
representing the classes IE and EI can be separated visually; however, even for human
observer it is difficult to separate lines corresponding to the class N from the lines of the
other two”. However, the database doesn’t represent purely separable classes, because
Bayes error is equal to 0.0003.
Returning to the validation classification complexity estimating protocol for DNA
splice junction problem, we have to mention, that in order to get statistical proved result,
we generate 200 databases (files). Each validation test has been repeated twice to check
reliability of the results. As the boundaries between the classes are “complex even for
human observer” (Sarkar and Leong 2001), there is no special preference for distance
measure selections.
Thus, the validation has been performed for L1 distance mode. However, a crucial
issue is the selection of ZISC®-036 MIF-parameter. I have done my validation using 3
MIF values: 55, 56 and 4096. Approximately, up to 8400 tests have been performed.
Fig. IV.6 represents the behaviour of complexity indicators Qk(mk), where k denotes
MIF parameter. The Fig. IV.7, Fig. IV.8 and Fig. IV.9 demonstrate the influence of m0 on
the quality of the learning process. mk corresponds to m0 calcluated for the k-curve
129
Fig. IV.6 : Qk(m) evaluation for DNA splice-junction classification problem using
different k-MIF parameters (k: 55, 56, 4096) for: Q55(m), Q56(m), Q4096(m), mk –
corresponds to calculated m0 for each k-curve
Fig. IV.7 : Quality check of RCE-kNN-like Voronoy polyhedron construction based
on its generalization ability performed for k-MIF parameter k=55
130
It is similar to Fig. IV.4 and Fig. IV.5. Brief chart analysis suggests that the highest
generalization rate (rate of success) - 53.5% and the lowest rate of failure 29.4% is
reached for MIF = 56. This fact allows us to make conclusion that for this MIF parameter
is the most appropriate among available. Another fact that has been taken into
consideration for selection MIF = 56 is the highest rate of uncertainty. It means that
RBFN implementation on IBM© ZISC®-036, some instances during generalizing might
be defined such that it doesn’t belong to none of the proposed categories. It means that
one may extract these undetermined instances and using additional techniques may
enhance final generalization rate.
Concerning the generalization classification rate in absolute value for MIF = 56
parameter, let me note that to determine approximate upper limit of this value, the result
obtained (Bouyoucef 2006) using software simulation of RBFN approach is used for
coparision. However, I have to mention that differences between these two
implementations of the networks are insignificant. Software realization uses advanced
intra-technique of RBFN adjusting, but it doesn’t increase the generalization rate. In the
work (Bouyoucef 2006) it is 66.3%.
Let me remind that the class distribution is: 50%, 25% and 25%. Therefore, the
answer on the question “What of the given internal parameters (such as MIF) of IBM©
ZISC®-036 implementation of ANN based complexity estimator, distance mode must be
chosen in order to have the most appropriate Voronoy polyhedron (related to RBFN
structure)?” The answer is that these set of parameters for which the testing/generalization
or learning classification rates reach the maximum, and this RBFN structure will
correspond to the most appropriate polyhedron. This approach demonstrates the
mentioned above relation between classification process and the estimating of
classification complexity.
Concerning polynomial approximation for Qk(m) indicators it has been performed
similarly to classification benchmarks. The classification complexity rates are presented
in Table IV.2.
Let me note that I share possible criticism concerning lack of evidence described of
the curve(s) behaviour on Fig. IV.6, in comprising to benchmarks’ cases Fig. IV.2 and
Fig. IV.3, where one may clear observe the second derivate signs change.
131
Fig. IV.8 : Quality check of RCE-kNN-like Voronoy polyhedron construction based
on its generalization ability performed for k-MIF parameter k=56
Fig. IV.9 : Quality check of RCE-kNN-like Voronoy polyhedron construction based
on its generalization ability performed for k-MIF parameter k=4096
132
Table IV.2 : Complexity rates obtained for Splice-junction DNA classification
problem (original database) using IBM© ZISC®-036 Neurocomputer
MIF parameter m0 (denoted on chart as mk) Qk(mk)
55 730 0.382
56 775 0.438
4096 700 0.896 Let me mention that for m55, m56 and m4096 points, the resulting Q(m0)-complexity
rates are calculated using approximated polynomial functions for Q(m)-indicators.
Analyzing the classification potential constructed Voronoy polyhedrons for 3 different
MIF parameters (Fig. IV.7 - Fig. IV.9), the best candidate among available for the role of
the final classification complexity ratio of DNA Splice junction problem is Q56(m56) =
0.438.
During the validation, I have extracted from the rest of the database instances (3190-
m) the test (generalization) database of the same size m and in the same manner that to
have purely comparable results. Let us highlight that our aim was not to obtain the
minimum classification error. The goal of this testing is to assist in selection the most
appropriate classification complexity rate among available.
It is interesting to note that the feature of the second derivative sign changing has also
influence not only on Qi(m) indicators behaviour, but also on the whole classification
process. The fluctuations: quick change of the second derivative in short-sub interval m
[700;800] have been indicated and is similar to hysteresis effect. Thus, comprising to this
phenomenon, new, unseen m-instances (in hysteresis phenomenon this function
accomplishes extra force) for m > 800, do not change second derivative of Q(m) indicator,
because new m-instances do not change significantly corresponding Voronoy polyhedron
structure. Analogically to a real word problem, one may observe the similar behaviour of
the indicator functions for classification Fig. IV.4, Fig. IV.5, and Table IV.1.
In conclusion I can state that proposed ad hoc ANN based classification complexity
method for IBM© ZISC®-036 Neurocomputer’s realization have confirmed our
expectations. Using this approach, I have obtained correct result of ranging classification
complexity of benchmarks, plus defined a classification complexity of the DNA Splice
junction problem. Comprising the benchmarks’ complexity rates and DNA Splice
junction problem’s rate, we can state that the last problem is most complex. Classification
tasks’ analysis such as separating space and classes’ distribution totally correspond to our
expectations.
133
Next Section IV.1.2 is dedicated to validation of proposed complexity estimator, but
for a wide range of the tests and with software implementation of ANN based complexity
estimator. This deeper validation check is done in order to be confident with this tool of
classification complexity estimating before applying into T-DTS framework. The second
main reason of following validation has an aim to compare ANN based complexity
estimating approach to other complexity estimation techniques implemented into T-DTS.
IV.1.2 Software-based validation
In this section, I provide the results of ANN-structure based complexity estimator
validation using software implementation. Our prime aim is to more deeply verify the
concept of ANN based complexity estimator comparing it to 17 other (the list is available
at Section IV.2) that has already been implemented (Bouyoucef 2007) as Matlab code and
as part of previous Matlab T-DTS version 2.00. For this reason, we have created wide and
more complex range of benchmarks, and I have performed a check for a wide range of
ANN based complexity estimator parameters such as the order of approximating
polynomial. To be comparable, all 17 complexity estimators have been linear normalized
in the way where rates of the complexity vitiate in the interval [0;1], where 1 signifies the
easiest classification case and 0 – the hardest. Such standardization and normalization is
needed for T-DTS self-adjusting procedure.
The first part of this section provides the results obtained for a wide range of academic
benchmarks, the second one – for real-world problems.
IV.1.2.1 ANN-structure based complexity estimator using classification
benchmarks
Let me start here by the description of the first classification academic benchmarks
(Fig. IV.10).
134
Fig. IV.10 : Square classification benchmarks, 2 classes, 2000 prototypes
The tests have been performed for 3 implemented distance measures: LSUP, L1 and
EUCL (Euclidian). To obtain statistics for each fixed range of ANN based complexity
estimator’s parameters, the same test has been repeated 10 times. The benchmark
described on Fig. IV.10 has been constructed not only for m=2000, but also for different
values of m: 200, 500, 1000 and 2000.
The influence of the three different MIF, where (MIF is Maximum Influence Field:
1024, 10.24 and 0.1024) on the complexity ratio have been checked. Totally
aproximately1440 tests have been performed. If we take to account 17 (rest) complexity
estimators, where the PRISM based have been verified using three internal parameters of
the resolution parameter B: 2, 4 and 8, overall number was above 1600.The following Fig.
IV.11 presents the influence of classification complexity method (in term of increasing
number of squares) on the obtained ANN-structure based classification complexity
ratio(s)/rate(s).
On the picture bold solid line represents the average complexity ratio related to
distance mode. The dot-lines - the corridor of the standard deviations for appropriated
distance mode. Brief analysis suggests that this complexity estimator correctly reflect the
trend: increasing number of squares increase complexity (descending of the rates).
Second, this particular type of benchmark was used to check the influence of the type of
the metric (distance) on the complexity ratio.
135
L1 L1
L1
L1
LSUP
LSUP
LSUP
LSUP
EUCL
EUCL
EUCL
EUCL
0.87
0.89
0.91
0.93
0.95
0.97
0.992SQ 3SQ 4SQ 5SQ
Number of the squares
Com
plex
ity r
atio
Fig. IV.11 : ANN-structure based complexity estimator evaluation for: Square
benchmarks, 2 classes, 2000 number of prototypes, MIF = 1024, 3 distance modes
(LSUP - ∆, EUCL - x and L1 - )
It was natural to expect to see the highest ratios (the less complex problem is) trend
occupied by LSUP, because Manhattan distance like constructing Voronoy polyhedron is
an ideal match with proposed benchmarks. High standard deviation for EUCL and L1
does not allow us to state that EUCL metrics is definitely better that L1. However, taking
to consideration that the circle (EUCL) takes a middle match space capture position
between square (LSUP) and rhomb (L1), it supports the idea that metric has an influence
on the final complexity ratio, because EUCL distance measure takes an average position
between L1 and LSUP in space covering.
Concerning the output of the rest 17 complexity estimators, more detailed analysis of
the influence of prototype numbers and etc. parameters has be done at the work
(Bouyoucef 2007), in the following paragraph we give brief and final summary:
1. Complexity estimators that don’t follow classification complexity trend
expectations: Maximum_Standard_Deviation (based criterion),
Fisher_Disriminant_Ratio, Normalized_mean_Distance, Mahalanobis_Distance (based),
all 4 Scattered-matrix method based on inter-intra matrix-criteria of the traces.
2. Complexity estimators that satisfy our complexity trend expectations:
• Insensitive (ratio is invariant): KLD (Kullback–Leibler Divergence based
estimator), JMDBC (Jeffreys-Matusita distance based complexity criterion),
136
Bhattacharyya_Coefficient (based criterion), JSD (Jensen-Shannon Divergence based
estimator), Hellindger_Distance (based technique).
• Sensitive: Purity_Measure (PRISM based technique), Collective_Entropy (PRISM
based technique), k_Nearest_Neighbor_Estimator (PRISM based technique) and
RBF_ZISC_based_Fusion_CRT (the module’s name of our ANN based estimator).
It was expected to have a group of insensitive complexity estimators, because they
approximate Bayes error, which is equal zero in our case of purely separable benchmarks.
To check the influence of the MIF parameter and dimensionality on the classification
complexity rates we have performed a range of the testing for 1-5 stripe-like academic
benchmarks, Fig IV.1, but only for ANN based estimator: classification benchmarks and
number of the instance 2000.
To demonstrate influence of MIF parameter on the classification ratio, I provide
results given on Fig IV.13.
0,85
0,87
0,89
0,91
0,93
0,95
0,97
0,99
1,01
1,03ST2 ST4 ST6 ST8
Com
plex
ity ra
tio
Fig. IV.12 : ANN-structure based complexity estimator evaluation for: Stripe
benchmarks, 2 classes, 2000 number of prototypes, LSUP distance mode (MIF=10.24
- ∆, MIF=0.1024 - x and MIF=1024 - )
The results described on Fig. IV.12 (the average values and their corridors of the
standard deviations) suggest that in case when MIF is overestimated it doesn’t influence
on the results (see on Fig. IV.12 MIF=1024 against MIF=10.24). Of course, in this case
MIF increases overall computational time. However, in case of MIF’s underestimating,
137
the final ratio doesn’t allow algorithm to construct Voronoy’s polyhedron that reflects
classification task’s complexity.
Let me note, that the problem with under or overestimated MIF can be observed for
IBM© ZISC®-036 ANN based complexity estimator too. For example in a case of Spice-
junction DNA classification problem, we had overestimated MIF = 4096.
For this case RBFN has a weaker generalization ability then even for MIF=55. It
means that ANN complexity estimator suffers a lot from MIF-parameter under or
overestimation. However, let me underscore that in case of ANN structure based
complexity estimator, MIF overestimating doesn’t influence on the results, but only on
computational costs, because of its implementation that differs from IBM© ZISC®-036
Neurocomputer’s implementation.
This fact allow us in practice for user independent estimator treatment by default pre-
setting of MIF parameter as maximum possible distance between two prototypes, any
other trails to speed up computing making MIF by default parameter shorter must be
justified using extra database assumptions, that is not available a priori.
On the following Fig. IV.13 is represented the influence of the number of
prototypes/plots m on the ANN based complexity estimating ratio.
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
200 500 1000 2000
The number of prototypes (instances)
Com
plex
ity ra
tio
Fig. IV.13 : ANN-structure based complexity estimator evaluation for: 8 Stripe
benchmark, 2 classes (4&4 stripes), LSUP distance mode (MIF=10.24 - ∆,
MIF=0.1024 - x and MIF=1024 - )
138
The results described on Fig. IV.13 is similar to the results described on Fig. IV.2 and
Fig. IV.3. (in a term of final complexity classification ratios), but it is not the same. The
conceptual differences between that two cases and last one is that, for Fig. IV.2 and Fig
IV.3 correspond to the cases of particular incremental increasing of m-parameter and
behaviour of indicator function where one may naturally find the complexity ratio. In
contrast to Fig. IV.2 and Fig. IV.3, Fig. IV.13 contains the evaluation of the final
complexity ratios for upper m-limit ( 200≤m , 500≤m and etc.). Let me note, these are
the final ratios not the illustration of the dynamics of Voronoy’s polyhedron construction.
Concerning Fig IV.13 it’s naturally expected if the total number of data instances
available for classification is less, the more complex classification problem is. The
sufficient number of prototypes is more precisely to define border(s) between the
class(es). However, let us note that for ANN based complexity estimator (Fig. IV.2, Fig.
IV.3) and (Fig. IV.13), these two implementation cases are not comparable because of the
different m meaning. In the first case it’s the floating number of instances, when in the
second case m-parameter signifies overall number of instances. Comparing these two
cases, one may say that in case of IBM© ZISC®-036 Neurocomputer’s implementation,
m-increasing exhibits behaviour of Q(m) indicator, when in case software implementation
ANN based provides the results (complexity ratio) for overall total number of instances m.
The following chart Fig. IV.14 presents a range of benchmarks generalized under the
name grid. This benchmark was constructed to verify the influence of the dimension
increasing on the classification complexity process. For the first three dimensions, the
benchmarks are described in Fig. IV.14:
Fig. IV.14 : Grid classification benchmarks in D1, D2 and D3 dimension
Thus, it has been created such grid benchmarks in D1 – D3 and D4, D5. The number
of clusters for each dimension is 2dim, meaning that in D3 we have 8 box-like clusters, and
in D5 – 32. For each dimension, the number of plots/instances is fixed and equals to 2000.
These benchmarks contain a “doubled classification complexity”; the first is an
139
increasing dimension for a fixed number of prototypes. It means that the border in high
dimension space for the less number of prototypes has to determined weaker, because the
instances are uniformly distributed without any preferences to any dimension. The
second, the number of clusters increases by twice for each superior dimension. The
obtained results and the particular details are represented in Fig IV.15.
0
0,2
0,4
0,6
0,8
1
D1 D2 D3 D4 D5
Dimension
Com
plex
ity ra
tio
Fig. IV.15 : ANN-structure based complexity estimator evaluation for: Grid
benchmarks, 2 classes, EUCL distance mode (MIF=10.24 - ∆, MIF=1024 - ,
MIF=0.5012 - ♦ , MIF=0.3018 - and MIF=0.1024 - x)
Let me note that the Euclidian type of metric has been selected because its
inconformity to the benchmarks. Therefore, the MIF = 1024 and MIF = 10.24 the
complexity rates are identical. It is so because MIF = 1024 and MIF = 10.24 are reduced
during the Voronoy’s polyhedron creation. Including the fact that all given plots
coordinates for the given dimension are laying in the interval [-1;1], the trend for MIF =
1024 and MIF = 10.24 are similar. But from the other side, underestimating of MIF =
0.1024 has crucial impact on complexity determining in high D5 dimension feature space.
Intuitively, it is impossible to agree that the complexity of purely linearly separable 32
box-like clusters in D5 has 0-complexity (is very complex). Therefore, the only result that
can be taken to consideration is provided by overestimated MIFs. MIFs’ overestimating
increases computational time, but returns respectful complexity rates, that’s why the
question of selecting by default MIF parameter is important especially for high dimension
classification tasks.
140
Let us mention also that similar test on high dimensionality has been performed for 8
Stripe classification benchmarks. The conclusion concerning the importance of MIF
parameter is confirmed also by the same dynamics as on Fig IV.15.
Therefore, to figure out inducted dimensional complexity into high dimensional
classification benchmarks, we have extended construction of 2D 8 stripe classification
benchmarks. Following a similar logic, we increase the number of dimension up to 5, but
the number of clusters and the form rest similar. It’s a benchmark with 4 stripes of class
number one and 4 stripes of the opposite class two. We create such range of the
benchmarks where the separability criterion lays in one dimension. The next Fig. IV.16
presents obtained results.
0,5
0,55
0,6
0,65
0,7
0,75
0,8
0,85
0,9
0,95
1
D2 D3 D4 D5
Grid8-stripe
Fig. IV.16 : ANN-structure based complexity estimator evaluation for: Grid and 8-
stripe-benchmarks, 2 classes, EUCL distance mode, MIF=1024
Concerning the results described on Fig. V.16, classification 8-stripe-benchmarks
overall are more complex than Grid-benchmarks. The dimension incensement causes the
growth of the difference between Grid-benchmarks and 8-stripe-benchmarks complexity
ratios (in absolute values). Intuitively it has been expected to observe the opposite,
because: for D5 we still have only 8 multidimensional clusters, when for Grid-benchmark
in D5 we have 32 hypercuboids inside. The explanation of this discrepancy between
141
expected and actual results could be explained if one takes into account a fact of usage
kNN-like algorithm for Voronoy polyhedron construction. However, the provided
approach falsification (Popper’s term (Popper 2002)) should not discourage us.
Thus, the explanation of this irrelevant to the benchmarks intuitive complexity results
is the following. It is not because of 8 stripe problem is more complex than Grid one. It is
because of Voronoy polyhedron construction approach employed as core of ANN-
structure based complexity estimation. RCE-kNN-like method suits or matches (i.e. low
number of cluster requires) well to the “ideal” hyper-cube space partitioning, especially
for D5 Grid benchmark, when for 8-stripe-benchmark in D5, RCE-kNN-like algorithm
construct higher number of clusters.
Therefore, one may note a relation between ANN-structure based complexity
estimator and Kolmogorov view on the complexity. Thus, RCE-kNN-like polyhedron
construction algorithm appears to be not anymore “the most efficient” for Grid-
benchmarks complexity estimating. However, advocating our technique, let me mentioned
that such benchmarks are very artificial in most real-world case the border between
classes is more complex. Another point, one may modify used algorithm of space
partitioning inside my estimator or more radically apply different ANN.
To perform a check how the polynomial order/degree parameters influence on the
final classification complexity value, we have used the four spirals benchmarks described
on Fig. IV.17.
This 2D classification problem belongs to the same range of benchmarks as the stripes
for example. It is called four spirals problem because it contains four volutes, plus it
follows the same idea of naming benchmark as for example with the stripes, 2 4 – 8. Not
only the number of class and the form of border, but the overall number of the borders is
one of the most crucial parameters of the benchmarks. The border of the benchmark
described on Fig. IV.17 is a spiral. The number of instances was selected equal 500, as it
has been proved during the testing. The reduced number of the instances increases
classification complexity.
This experiment has been performed for different type of MIF parameters:
underestimated and overestimated. Polynomial approximation supports the idea expressed
in the equations II.43 and II.44. Therefore for the whole odd polynomial (2, 4, 6) order
and even (3, 5, 7) the complexity are similar when one takes to account the corridors of
the standard deviations. However, theoretically, we have expected to get the result with
142
exclusive approximation of the 3 order polynomial function, because in fact the behaviour
of Q(m) indicator is not predictable (Fig II.6), especially when the m parameter is small.
Fig. IV.17 : Four spiral classification benchmark, 2 classes, 500 prototypes
The following validation check has been performed for the polynomial orders/degrees:
1-7. The results are depicted on Fig. V.18.
0,55
0,6
0,65
0,7
0,75
0,81 2 3 4 5 6 7
Com
plex
ity r
atio
Fig. IV.18 : ANN-structure based complexity estimator evaluation for: 4 Spiral
benchmark, 2 classes, EUCL distance mode (MIF=0.1024 - x, MIF=0.2048 - ,
MIF=10.24 - ∆ and MIF=1024 - )
143
Brief analysis of this experiment suggests that the polynomial order parameter and
MIF must be defined by user. However, important is not the number itself, but the parity
of this parameter as it describes two different Q(m) approximating approaches II.43 and
II.44. Another issue is, if one takes to account the standard deviation of the complexity
average ratio variation, the given difference between odd and even order is not a crucial at
all, however essential, because the odd ratios lay on the border of the standard deviation
corridor of the even order and the same.
Finally, the next Fig. IV.19 presents the sequences of benchmark (using also their
abbreviations) problems range from the simplest (left) to the most complex (right).
One may claim that the last 2CR-benchmark is less complex than sinusoids’ or
spirals’ example. Concerning the constructing formulas, all of these 3 benchmarks contain
the sinus to define the classes’ border, but only the last 2CR-benchmark is constructed so
that there are available two overlapping zone. That’s why the given benchmarks’ Bayes
error is equal to zero except the overlapping circles, Fig. IV.19 (right). According to this
criterion of theoretical measure, we have defined 2CR-example as the most complex one.
The features of these 6 benchmarks are similar: 2 classes, 2000 prototypes, validation
has been performed for MIF = 1024. Thus, the results for benchmarks are depicted on the
Fig. IV.20.
Fig. IV.19 : Six classification benchmarks [from left to right, from top to down: 2
Stripes (2ST), 2 Grids (2GR), 2 Squares (2SQ), 2 Sinusoids (2SN), 2 Spirals (2SP)
and 2 Circles (2CR) with small overlapping zone]
144
L1
L1L1
L1
L1
L1
LSUP LSUP LSUP
LSUP
LSUP
LSUP
EUCL
EUCLEUCL EUCL
EUCL
EUCL0,92
0,93
0,94
0,95
0,96
0,97
0,98
0,99
2ST 2GR 2SQ 2SN 2SP 2CR
Fig. IV.20 : ANN-structure based complexity estimator evaluation for: 6
classification benchmark, 2 classes, 2000 prototypes, MIF=1024 (LSUP- ∆, EUCL - x
and L1 - )
Firstly, let me note that the central indicator is the result obtained using EUCL type of
distance metric, because a priory and especially for real word problem, there is no
evidence what type of distance measure matches the best for a given problem. Second, the
first three type of benchmarks (2ST, 2GR and 2SQ) are strictly linearly separates, that’s
why the given metrics LSUP doesn’t figure the classification difficulty out. For L1 we
have unexpected growth of the classification complexity value from 2SQ to 2SN
benchmarks, but let us note here that the standard deviation of the complexity ratio for
2SN benchmark using L1 metrics is the highest for all range of tests and is equal 0.0099.
That’s why if we take into account the range of the overall complexity for the proposed
benchmarks [0.920;0.995], the absolute value of obtained standard deviation allows us to
ignore the unexpected jump for 2SN, for L1 distance measure.
The same for metrics LSUP for the benchmark 2GR, that theoretically had to be less
complex than 2ST, but for 2GR standard deviation is 0.0042 and for 2ST - 0.028, that’s
why we cannot state from the point of view of the LSUP metrics and given deviation of
averages that 2GR is more complex than 2ST, and opposite, one may say that the
complexity of these two benchmarks are similar.
Another important note is that concerning the way of constructing sinusoids for 2SN
benchmarks, we have a situation where the majority of clusters are concentrated on the
classification border. Thus the declining of the curve from 2SQ to SN for EUCL metric is
not sharp. The problem of overlapping creates such Voronoy’s polyhedron where each
145
cluster (neuron) contains one prototype from overlapping zone and it increases the
complexity.
Next section presents verification of ANN based complexity estimator application for
Splice-junction DNA classification problem and Tic-tac-toe classification problem.
IV.1.2.2 ANN-structure based complexity estimator facing real world
classification problems
In this subsection we deal with the estimation complexity of the real world problem.
This validation has been performed using ANN based estimator. Afterwards the results
obtained by ZISC®-036 Neurocomputer based are compared with ANN based estimator.
IV.1.2.2.1 ANN-structure based complexity estimator facing Splice-junction DNA
sequences classification problem
Similar distance measure L1 has been selected for ANN structure based complexity
estimator. However, using software simulation in our environment I don’t have a
hardware memory limit. This particular issue of realization allows us to determine the
classification complexity more precisely. Let us note that because of the difference in
implementation between IBM© ZISC®-036 based and its software version, and because
of the different impact factor which plays MIF parameter, these two complexity
estimators are not identical.
Therefore, for L1 metric and MIF = 1024 and for complete Splice-junction DNA
database that has been used for Voronoy polyhedron construction, the complexity ratio
equals to 0.6856 +/- 0.0147 (standard deviation). Comparing to the result 0.439, obtained
by ZISC®-036 hardware based implementation, one may expect to have a lower ration,
because for the hardware implementation it has been used less amount of information.
Another issue is that using ANN based complexity estimator we also can be sure about
the generalization ability of the obtained structure and there is no need to do quality check
like it was done for ZISC®-036 Neurocomputer’s implementation.
The second reason why there is no need to provide a check of generalizing ability of
obtained RBFN structure, because MIF parameter has been set equal to 1024 (higher than
146
the maximal distance between any two prototypes and this MIF parameter has been
adjusted for every cluster of obtained Voronoy polyhedron).
The important remark is that the results strongly depend on the database
representation. Thus, according to the well-known fact that these four molecules organize
the sequences of three basic genetic words (triplet) that represent 22 amino acid including
“Stop” coding combination (this appears to be an error-correction mechanism, so a
random error in transcribing one letter in a DNA word will not necessarily produce the
wrong amino acid). We have taken to consideration this basic idea and re-encoded initial
database using triplets of binary sequences. Therefore we have obtained new database
contains the same 3190 number of instances, where each instance is a binary code of the
180 bit length.
Then we have calculated classification complexity ratio using EUCL metric and
MIF=1024. The calculated complexity ratio is 0.72295+/-0.00785, meaning that feature
number incensement decrease the overall complexity of the problem. This check
demonstrates an importance of database representation. The classification problem
meaningless decoding might be initial settable barrier to possible classification and
classification complexity estimation performance enhancement.
The interesting fact majority is that the rest of 11 classification complexity estimators
including (let us underscore this issue here) Maximal Standard Deviation complexity-like
criterion defines this task as very and very complex. The majority of rates are close to
zero. Therefore, we consolidate the results in Table IV.3.
Let me note here that except ANN based complexity estimator’s, PRISM based
methods contain important resolution parameter B (Singh 2003)
Concerning the result there we have two groups of estimators. The first one belongs to
the methods which figure out DNA Splice-junction classification problem very and very
complex. The ratio is near zero.
Another group of estimators find this classification problem complex, they provide
reasonable very that could be taken into account, because of the verification that has been
performed on the benchmarks. However, except expected leader ANN based complexity
estimators, similar results shown Mahalanobis distance based measure and fourth’s
criteria of Fukunaga (Interclass distance measure criteria: J1 – J4).
147
Table IV.3 : Complexity rates obtained for Splice-junction DNA classification
problem (re-encoded database) using ANN-structure based and other applications
Complexity estimator inter-parameter(s) Complexity ratio
Maximum standard deviation based N/A 0.013
Normalized distance based N/A 0.000
Fisher ratio distance based N/A 0.000
Mahalanobis distance based N/A 0.630
Interclass distance measure criterion J1 N/A 0.002
Interclass distance measure criterion J4 N/A 0.000
Interclass distance measure criterion J2 N/A 0.002
Interclass distance measure criterion J3 N/A 0.387
kNN based criterion N/A 0.520
B (resolution) = 2 0,180
B (resolution) = 4 0,045 Purity (PRISM)
B (resolution) = 8 0,003
B (resolution) = 2 0,041
B (resolution) = 4 0,010 Collective entropy (PRISM)
B (resolution) = 8 0.001
EUCL, MIF = 1024 0.723+/-0.008
LSUP, MIF = 1024 0.954+/-0.101 Matlab implementation of ANN-structure based complexity estimator
L1, MIF = 1024 0.722+/-0.016 IBM© ZISC®-036 Neurocomputer’s
implementation L1, MIF = 56 0.439
The more detailed summary of complexity estimating techniques is highlighted in the
summary – Section IV.1.3. Following section presents similar consolidated results of the
complexity estimating techniques applied for Tic-tac-toe end game classification problem.
IV.1.2.2.2 ANN-structure based complexity estimator facing Tic-tac-toe endgame
classification problem
The aim of this classification task is to predict whether each of 958 legal endgame
boards for Tic-tac-toe is won for “x” or for “o”. Totally, 16 criteria have been used
except. The results are given in Table V.4.
Brief summary confirms our expectation concerning classification complexity
ranging. Tic-tac-toe end game problem is more complex than all 2D academic benchmarks
(starting from the simplest stripe-case and finishing with overlapping circles) and in the
148
same time it is less complex than Splice-junction DNA sequences classification problem
even taking into account the big overlapping.
Table IV.4 : Complexity rates obtained for Tic-tac-toe endgame classification
problem using sixteen classification complexity criteria including ANN-structure
based complexity estimating technique
Complexity estimator inter-parameter(s) Complexity ratio
Maximum standard deviation based N/A 0.123
Normalized distance based N/A 0.040
Fisher ratio distance based N/A 0.004
Kullback-Leibler divergence based N/A 1
Jeffries-Matusita distance based N/A 1.000
Bhattacharyya criterion based N/A 1
Mahalanobis distance based N/A 0.404
Interclass distance measure criterion J1 N/A 0.000
Interclass distance measure criterion J4 N/A 0.000
Interclass distance measure criterion J2 N/A 0.000
Interclass distance measure criterion J3 N/A 0.037
kNN- based criterion N/A 0.653
B (resolution) = 2 0.366
B (resolution) = 4 0.116 Purity (PRISM)
B (resolution) = 8 0.007
B (resolution) = 2 0.237
B (resolution) = 4 0.074 Collective entropy (PRISM)
B (resolution) = 8 0.005
EUCL, MIF = 1024 0.828+/-0.035
EUCL, MIF = 10.24 0.826+/-0.028
EUCL, MIF = 0.1024 0
LSUP, MIF = 1024 0.895+/0.033
LSUP, MIF = 10.24 0.923+/-0.026
LSUP, MIF = 0.1024 0
L1, MIF = 1024 0.820+/-0.030
Matlab implementation of ANN-structure based complexity estimator
L1, MIF = 10.24 0.839+/-0.029
Considering the specificity of the obtained result, one may note that Information
theory based estimators approximating well Bayes error fail to detect 0.0003
misclassification rate. However, Jeffry-Matusita based criterion was the most sensitive
among them.
149
Also the results suggest that using ANN based estimator one may face the problem of
underestimating MIF parameter (MIF = 0.1024) which can lead to output that determines
this problems as the most complex (ratio equals to 0). Let me remind that DNA Splice
junction classification problem in DNA sequences walk representation cannot be
classified at glance by human using classical colour sequence representation, according to
the work (Sarkar and Leong 2001).
General summary based on more detailed analysis of the results, including the whole
range of tasks’ parameters, is presented in the following section.
IV.1.3 Summary
First of all we consider the results of complexity estimators for particular
classification benchmarks: Stripe, Grid, Circles and etc. For each of them, we can figure
out complexity estimators that don’t confirm our expectation (for example for doubled
overall number of the sub-zones and for the same rest benchmark parameters, the
complexity ratio should be lower (exhibiting in such way more complex problem) that for
the benchmark with the less number of sub-zone).
Whenever one may chose: Maximum standard deviation based, Normalized distance
based, Fisher ratio distance based or Mahalanobis distance based criteria, I can state that
they don’t figure out correctly an evolution (increasing number of sub-patterns,
dimensionality, classes’ separating space and etc.) of the classification complexity rate in
a frame of certain classification benchmark. We cannot completely rely on them, because
they have (as validation results shown) too many disadvantages. Although, it doesn’t
mean that they couldn’t be apply to T-DTS framework or other clustering application.
The second group is the group of criteria that generally defines the classification
complexity correctly, but for some specific parameter, they provide unexpected ratios.
They are: Four Fukunaga’s Interclass distance measure based (Fukunaga 1972), ANN-
structure based and Purity (PRISM) based criteria (Singh 2003). Although, we should
note that the last two criteria are very sensitive to B resolution parameters, similar to MIF-
parameter of ANN based estimator.
The third group of the complexity estimators satisfies our expectations, but because of
their bases which approximate Bayes error, they are not sensitive in case of purely
classes’ separable problem. These estimators are: Kullback-Leibler divergence based,
150
Jeffries-Matusita distance based, Bhattacharyya criterion based, Helindger distance
based and Jensen-Shannon distance based criteria.
Shortly summarizing our 3-group categorizing of the complexity estimator we can
generally state that according to the results of the tests we have two leading complexity
estimators: our ANN based complexity estimator and Collective Entropy (PRISM)
estimator. These two estimators’ performance depends crucial on their intra-parameters:
MIF and B. It is very important especially in case of ANN based estimator to set these
parameters correctly using additional information.
Nevertheless, there are two important issues that I would like highlight that in this
summary that is dedicated to not only T-DTS development and its verification.
However in the framework of statistical framework the complexity estimating (class
separability) issue stands alone. Out of T-DTS concept, it is required to have pre-analysis
of the problem that allows user to: select appropriate classifier, parameterize it or even
normalize/pre-process given classification database.
The second issue is the employing classification complexity estimating technique into
T-DTS Control Unit. The information returned by complexity estimator is used for
optimizing purpose, but with respect to pre-set threshold, it plays a role of decision maker.
Thus using an unsatisfactory (belongs to the worth group of complexity estimators)
complexity estimator doesn’t mean that this estimator cannot handle decomposing. The
problem of the selection complexity estimator is important as well as the problem of
selection appropriate decomposing method. Therefore the problem of selection processing
unit is less important, because PU is not required for database decomposition and tree-
structure building.
In conclusion, I can state that only the proposed ANN based complexity estimator
satisfies our expectation of classification complexity estimating. However, it is
computationally expensive method that requires important MIF intra-parameter adjusting,
but the parallel IBM© ZISC®-036 or similar hardware neurocomputing organization may
significantly cut down the overall computational cost.
Another important parameter of ANN based classification estimator is the database
size. For cases where the learning database of the real-world problem has to be
constructed, the total size of this database regardless the classification problem
representation has a crucial influence on the final complexity ratio. For instance, one
might construct the simple contra-example where ANN based complexity estimator might
be easily failed (falsified (Popper 2002)) to work. For example, a simple database that
151
contains two instances where each of them belongs to different class. Applying my
estimator we have two hyperplane and feedback ration 0 (the problem is very complex).
One may advocate my indicator in way where the lack on information is equivalent to the
state of uncertainty (not enough information to construct optimal class separating
hyperplane). Therefore, the obtained output “very complex” signifies complexity as
uncertainty. However, there are two ways to adjust ANN structure based complexity
estimator. One may set the low limit of database required for successful estimator
processing (let me note that in T-DTS framework this limit has been set from 10/20
prototypes). Another way to use different ANN-type or different g(.)/g(m) – function
which is denominated by the overall database (sub-database) size m. However, let me
mention, that in real world case, the database size contains hundreds of instances.
Moreover, taking into account nowadays tendency where the real-world databases
demonstrate quick growth in size, because of internet technologies, the artificial cases of
very low classification database lay out of our prime interest.
Throughout this discussion, the results obtained for Splice junction DNA sequence
classification problem (Table V.2 and Table V.3) including the different database sizes
serves me as the justification of database size sensitiveness of ANN structure based
estimator, but let me note that it is because of their RCE-kNN like implementation,
because kNN-based methods are very sensitive to learning database extraction/selection.
Thus, IBM© ZISC®-036 Neurocomputer implementation of ANN estimator defines DNA
problem complexity equals to 0.439. This experiment has been performed for the range of
the random extractions of the given initial database. That’s why the result obtained by
ANN-structure based estimator applied for complete Splice-junction DNA database
figures out the problem is less complex - 0.7222 with respect to similar L1-metric. In
comparison to the results obtained for Tic-tac-toe endgame problem, we can state that this
classification problem is less complex than DNA.
In addition, to confirm our initial suppositions, we have calculated directly Bayes
error ε. For DNA ε is equal to 0.0003 and for Tic-tac-toe – 0 (because each s-instance
represents only one distinguishing game combination). However, in case of Tic-tac-toe
end game problem one may articulate to high sub-zones of overlapping, but let us not that
each instance of the given database describe unique game regardless position similarities.
Following Section IV.2.1 is dedicated to verification ANN-structure based complexity
estimator as an integral part of T-DTS. Section IV.2.1 contains the results of classification
that have been obtained using T-DTS self-adjusting procedure. The validation approach
152
remains the same. First, I conduct experiments using academic benchmarks, then real-
world problems: Tic-tac-toe end and Splice-junction DNA sequences classification
problems.
IV.2 T-DTS
This section of validation consists two parts. In the first part, I present validation of
our ANN based complexity estimator as the part of T-DTS. Besides the question of the
worth-while complexity estimator selection, the user may find the following problems of
T-DTS handling:
• selection of complexity estimator: classification complexity estimating techniques
depend on the data nature, coordinate system (a full list of desirable, but not mandatory
properties of classification complexity estimators/discriminators mentioned in Fukunaga’s
work (Fukunaga 1972)).
• information required for modifying θ - threshold value and searching the value that
satisfies an optimal solution.
Therefore, the second part is dedicated to verification of the self-tuning procedure. In
this part I particularly focus on the practical aspect of the using/applying complexity
estimating techniques and its ability to enhance T-DTS performance.
IV.2.1 ANN-structure based complexity estimator validation
Accordingly to T-DTS concept, the complexity estimation modules play a key-role.
The complexity estimators are essential for database decomposing and final tree
construction. In order to evaluate T-DTS performance within ANN based complexity I
have used for the range of benchmarks mentioned already.
The 2D classification benchmarks verification is highlighted in the following sub-
section Section IV.2.1.1, and Section IV.2.1.2 is dedicated to real-word Splice-junction
DNA sequences classification problem and Tic-tac-toe endgame problems. For all these
benchmarks including real-word classification problems, the learning databases have been
extracted randomly from the corresponding database with respect to the classes’
distributions. The rest of the databases have been used for generalizing. Let us also note
153
that the threshold adjustment has been done manually by operator in “try and cut” way.
ANN based complexity estimator achievements is depicted on the rusting Fig. IV.21 - Fig.
IV.28 as ANN. The initial intra-parameter MIF have been also manually optimized. The
distance metric EUCL has been chosen.
IV.2.1.1 ANN-structure based complexity estimator in T-DTS
framework using classification benchmarks
In order to check T-DTS generalization ability with embedded ANN based complexity
estimator, arbitrary PU - LVQ1 has been chosen. The results for simple 2 Stripe
classification problem is given on Fig. IV.21.
On Fig. IV.21 - Fig. IV.28 x-axis represents θ - complexity threshold. y-axis is the
average generalization.
ANN based complexity estimator depicted as ANN takes an average position among
the other complexity indicators; however for threshold θ = 0.900, this complexity
estimator can achieve the possible maximal (e.g. the best) generalization as another
complexity estimating techniques.
Fig. IV.21 : Validation ANN-structure based complexity estimator embedded into T-
DTS framework: 2 Stripe benchmark, 2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU – CNN, PU – LVQ1
154
The given Fig. IV.21 exhibits also the relativity of the complexity measuring, meaning
that the general complexity of the same benchmark, for the same for fixed DU and PU,
but different complexity estimators reach the maximal output for different θ - threshold. It
is expected to face this result, because the used complexity measures have their different
origins.
Fig. IV.22 : Validation ANN-structure based complexity estimator embedded into T-DTS framework: 10 Stripe benchmark, 2 classes, generalization database size 1600
prototypes, learning database size 400 prototypes, DU – CNN, PU – LVQ1
Fig. IV.22 demonstrates the results obtained for related, but more complex sub-type of
benchmarks (10 stripe sub-zones) than the previous experiment with its results depicted
on Fig. IV.22. In order to increase more overall classification complexity, except
increasing the number of borders, I have reduced the learning data base size up to 400
prototypes. It is spotlighting the same conclusion: ANN based complexity indicator in a
framework of T-DTS is not the worst method among the others, however for reduced
learning database it could not be a leader according to its definition and validation
mentioned above.
I call an attention to the fact that if the principal problem here is already the same 2D
classification dilemma even for increased number of stripes, the reduced learning database
size is the key factor of non-leading performing of ANN based estimator. Therefore, one
155
ought to neglect an imperfection of LVQ1 as selected processing unit that has been used
in this experiment. In such way I have verified T-DTS generalization and decomposition
ability within the artificial worst-case constraints. In fact, in such worst-case conditions,
crop up from conjunction of intrinsic classification complexity and information leakage
(emerging from learning database reduced size), it is expectable to face such low
generalization rate (around 45%), regardless the selected complexity estimator, but let us
highlight that these parameters have consciously been selected as an arbitrary for the
validation.
As for the previous experiments, the classification complexity estimating techniques
demonstrate it relatedness in a framework of T-DTS application and their performance
characteristics. However, it is clear that the maximal generalization rates for the majority
of the complexity estimating techniques are gained for cases where θ - threshold is equal
approximately to 0.75, where in case Fig. IV.21 it is 0.8. Let me highlight here that when
one deals with a classification problem into T-DTS framework, he may neglect the
decomposing performed by DU, but the selection of DU is very important issue however
it might look invisible on the resulting Fig. IV.21 - Fig. IV.22.
Briefly summarizing, when one has a limit of the learning database size, he should not
expect to have ANN based estimator among the leader complexity estimators in T-DTS
framework (the explanation based on the complexity estimator definition is given above
in Section IV.1.3).
Using the complexity estimators which are based on Information theory including
leading Jeffreys-Matusita based criterion, one can reach the performance maximum: Fig.
IV.21 - Fig. IV.22, even for the same θ = 0.800, but le me highlight that it doesn’t mean
that the give classification problem can be successfully resolved using θ = 0.800 as a
constant. The complexity estimators’ outputs, especially after linear normalization, are
relative in their definitions.
There are the following main problems with complexity estimators which influence on
T-DTS performance:
• complexity estimators that very well approximating Bayes error becomes
insensitive when the classes are purely separated(case of Jeffreys-Matusita based criterion
estimator regardless the number of clusters demonstrate not a superiority, but
insensitiveness. We have for both cases a big number of leaf sub-databases/sub-clusters)
• complexity estimators are coordinates and database size dependent (Fukunaga
1972)
156
Forecasting the other possible reasons related to a general question “Why ANN-
structure based being verified as a classification complexity estimating leader is not a
leader into T-DTS framework”, we have put ahead the following argumentation:
• T-DTS concept is supposed to be general, but the practical realization is done for
specific set of self-organizing NNs prototype based decomposition techniques (including
the problem of its parameterization). In accordance with processing techniques that are
responsible for final performance results, integrated into T-DTS framework and related ti
processing units, decomposition techniques could not be treated as universal, even when
underlying “divide” and “conquer” principle is universal. The decomposing techniques
may fail (according to No Free Lunch Theorem) for specific classification problem and
central T-DTS idea “Performing decomposition reduces overall task’s complexity” might
not work. Shortly saying, imperfection of any decomposition technique required never-
ending T-DTS DU database update and adjusting.
• Unknown additional parameters of the tests such as undefined learning and
generalization databases’ size, that naturally is defined in a random way, has influence on
the process of tree-construction and as the result the output tree varies correspondingly to
learning database size. For the same reason, the coordinates of the centres of the sub-
clusters fluctuate. As the result, the clusters change their forms and locations. Simply
saying on the level of the resulted decomposition, it is expected to see different mosaics
Fig. IV.11.
In the following section, I provide ANN based complexity estimator validation in T-
DTS framework using two real-world problems as validation tool.
IV.2.1.2 ANN-structure based complexity estimator in T-DTS
framework facing real-world classification problems
For these two problems, testing has been done for the learning and generalization
database of the equal size. They have been randomly extracting with respect to classes’
distribution in overall mentioned proportion 50%/50%.
For Tic-tac-toe end game problem, the aim of this classification task was to predict
whether each of 958 legal endgame boards is won for x or for o. This problem is hard for
the covering algorithm family, because of multi-overlapping (Yang, Parekh and Honava
157
1999), however a distinguishing attribute of s-instance is always present and because it is
impossible to have two identical game combinations.
The obtained T-DTS results for Tic-tac-toe endgame problem is shown on Fig. V.23:
For this high-overlapping problem, ANN based complexity estimator takes the second
leading position. Only Mahalanobis distance based criterion has achieved the maximal
generalization rate for this range of tests. However, for the same problem, but with a
reduced learning database Fig. IV.24, the leader among the complexity estimators was
Bhattacharyya bound based criterion, where again the second rank for both Tic-tac-toe
endgame testing has been captured by ANN based estimator depicted as ANN.
Fig. IV.23 : Validation ANN-structure based complexity estimator embedded into T-
DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 479 prototypes, learning database size 479 prototypes, DU – CNN, PU –
MLP_FF_GDM
Another important note concerning this problem is the range of insensitive complexity
estimator such as Kullback-Leibler divergence based which cannot be applied. More
explanation concerning this issue is given in the Section V.2.2. Before ahead, these
estimators conclude on high complexity of both problem and sub-problems during the
step by step decomposition. As a result, there is no sense in any optimizing θ - threshold
for these estimators, because there are two decomposing cases: no decomposing and full
decomposing (in general, it means that we have a big number of clusters, each cluster
contains one pure class and has a very small size).
158
Fig. IV.24 : Validation ANN-structure based complexity estimator embedded into T-
DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 766 prototypes, learning database size 192 prototypes, DU – CNN, PU –
MLP_FF_GDM
Fig. IV.25 : Validation ANN-structure based complexity estimator embedded into T-
DTS framework: Splice-junction DNA sequences classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes,
DU – CNN, PU – MLP_FF_GDM
159
The following validation was performed for the Splice-junction DNA classification
sequences problem. The results are depicted in Fig. V.25. One can notice that the ANN
based complexity estimator is leading among the indicators.
Further, inapplicable indicators for this problem are: Maximum_Standard_Deviation -
it defines the problem as very complex regardless of a decomposing process;
Normalized_mean_Distance can not be applied because each vector consists of 60
attributes and the complexity ratio which is based on the root square deviation identifies
every problem as complex.
Next, it came as no surprise that the Fisher_Disriminant_Ratio estimator produced the
worst result. Moreover, we should highlight that all 3 complexity estimators during the
classification complexity validation have failed even in the framework of a single
classification benchmark.
Summarizing the classification benchmarks and the real-world problems according to
the comparison of the estimators, one may state that the ANN based complexity
embedded into the T-DTS framework is matchless among the others complexity
estimating techniques. This estimator, because of its origins, is not sensitive to a big
number of attributes of an input vector; still, it is computationally costly. One might also
conclude that ANN based estimator in T-DST is an appealing candidate for solving high
dimension problems. However, this estimator requires optimization of the internal
parameters, such as the very sensitive MIF, in order to avoid the extra executing time-
costliness. Incorporation of this complexity estimator had a goal to check the performance
of T-DTS compared to the other complexity estimation modules, and this validation had
passed successfully. T-DTS was used to solve two real-world classification problems. We
have shown that the ANN based complexity estimator embedded into T-DTS allows the
user to reach better learning and generalization rates. We have also illustrated that this
estimator is matchless for the classification tasks-related problems with the high
dimension feature space, where the statistical-based complexity indicators failed. This
estimator appears to be a general complexity indicator and thus acts more efficiently than
the other criteria. However, the problem of finding the optimal (more precisely, quasi
optimum) threshold θ where the T-DTS output reaches its maximal performance has not
been overviewed in this section. It is known that θ-threshold is the relative parameter
because of the different origin of complexity estimators. Therefore, we were motivated to
apply the semi-automated procedure that finds quasi optimal θ-threshold regardless of a
160
pre-selected complexity estimator. The following section provides a verification of the
proposed (Section IV.1) procedure and proves the superiority of such user-independent
approach. Another possibility (and its analysis) to achieve a maximal generalization rate
for the real-world problems (Tic-tac-toe endgame and Splice junction DNA classification
problems) will be discussed in greater details in the following section.
IV.2.2 T-DTS self-tuning procedure validation
In order to evaluate the self-tuning procedure of T-DTS, we used a similar range of
tests: benchmarks and two real world problems. For each problem, we have fixed the
complexity estimator, PU, DU and the rest of the T-DTS parameters. Then, T-DTS has
been launched using the self-tuning mode.
The complexity threshold of decomposition was adjusted automatically using the T-
DTS self-tuning procedure described in Section III.1. The optimization function had the
following macro-parameters: h=9, z=10, α = 0.1, meaning that T-DTS was run for each
complexity estimator, fixed DU, PU and other parameters for h • z times. Next section
provides a validation of this procedure using classification benchmarks.
IV.2.2.1 T-DTS self-tuning procedure validation using classification
benchmarks
Fig. IV.26 : Validation T-DTS self-tuning threshold procedure, Average learning rate
(including its corridor of the standard deviations) as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning
database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator
161
Fig. IV.27 : Validation T-DTS self-tuning threshold procedure, Average
generalization rate (including its corridor of the standard deviations) as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500
prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator
A very good illustration of how the self-tuning procedure works is done on the
benchmark of Spiral classification benchmark. The Fig. V.26 – Fig. V.29 provide the
details of the possible quasi optimum search for this specific classification problem.
Fig. IV.28 : Validation T-DTS self-tuning threshold procedure, Average clusters’
number as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU
– PNN, Fisher measure based complexity estimator
162
Intuitively and heuristically analyzing the trends, it’s expected to find quasi-optimal
threshold in the subinterval [0.7 ; 0.8], where the generalization rate reaches it maximum
and learning rate continue to grow. Now, when we come to the question of what is an
optimal solution for us, it is time to determine its meaning in a term of performance
estimating function P(θ) (equation III.12).
For this purpose I have set b1=3, b2=2, but b3=0. It was done in order to simplify the
number of parameters that defines quasi optimum ignoring T-DTS executing time which
proportional to the number of prototypes Fig. V.28.
Fig. IV.29 : Validation T-DTS self-tuning threshold procedure, Performance
estimating function P(θ): 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher
measure based complexity estimator
On Fig. IV.29 described P(θ) – function evolution in the θ-threshold interval [0.1;0.8]
for 4 Spiral academic benchmark and fixed range of parameter. In is shown that for θ
close to 0.7 T-DTS reaches its performance maximum (minimum of P(θ)=0.42) in a term
of combination generalization and learning rates, where the high priority is set for overall
maximizing of the generalization rate. More princely, using implemented into T-DTS
self-tuning θ-threshold procedure, the minimum of P(θ) has been found for θ = 0.7217.
For different selected complexity estimators, DU and PU, the different combination of the
satisfactory results is possible. These results are given in Table V.5. “Gr” stands for
163
generalization rate and “Lr” – learning rate. “Std” is the standard deviation of this
parameters.
Table IV.5 : Classification results: 4 Spiral benchmark, 2 classes, generalization
database size 500 prototypes, learning database size 500 prototypes
DU Complexity estimator PU Gr±Std/2 (%) Lr±Std/2 (%) Avr. leaf No. ±Std/2 Θ
CNN Collective entropy based Elman_BN 79.1583±0.4960 96.4870±0.1813 104.00±5.95 0.2798
SOM Collective entropy based Elman_BN 77.7956±1.6256 97.6846±0.2961 144.20±6.21 0.3353
CNN Fisher measure based PNN 80.4008±0.4216 95.8882±0.2505 176.2±1.98 0.7217 Based on given results, one may have selected one that which satisfies the given
constrains better such as low standard deviation of the generalization rate, maximal
possible generalization rate or satisfactory generalization rate, but maximal learning rate.
Let us stress that in a framework of T-DTS there is no a priory “the best”, the most
“optimal” solution. There is the bunch of possible quasi-optimal solution.
Fig. IV.30 : Validation T-DTS self-tuning threshold procedure, Clusters’ number
distribution: 4 Spiral benchmark, 2 classes, learning database size 500 prototypes, DU – CNN, Collective entropy based complexity estimator
Minimizing P(θ) - performance function inducts the configurations of the results
where generalization and learning rates satisfies the user expectation. To obtain this
minimum, different types of PU must be applied. The selection of appropriate PU is
heuristic procedure that requires a user-experience. The given in advanced characteristics
164
of the problem and its type, and knowledge of PUs’ features are helpful for particular PU
selection. The possibility to reaches the quasi optimum has been predefined by the form
of histogram given on Fig IV.30.
Therefore, the form of histogram Fig IV.30, defines the divisibility of the initial
database and if the complexity estimator defines the classification complexity in proper
manner we have the appropriate histogram.
However, another key factor that determines the form of the histogram and that is
present invisibly, each time when I build it, is the used decomposition. If the
decomposition produces the sub-clustering in such manner that it doesn’t reduce
complexity (regardless the problem), whatever prefect or not complexity estimator one
has used, the histogram by its form exhibits this case. Concluding, the pair of DU and
complexity estimator determines divisibility of the initial database.
Fig. IV.31 : Validation T-DTS self-tuning threshold procedure: 10 Stripe benchmark,
2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU – CNN, PU – LVQ1, 4 complexity estimators
The Fig. IV.31 describes the results obtained for the classification benchmarks and
then for real world problems (ANN based complexity estimator is marked under name
ZISC). Hither, we start from two-class, 10 stripe benchmark problem Fig. V.31. The x-
axis represents the decision threshold, the y-axis – the percentage of learning and
generalization rate.
For a two-class benchmark problem, the quasi optimal thresholds for four complexity
estimators were found in the range of [0.8591;0.9992]. Because of the benchmark
artificiality, these optimal thresholds lay close to each other regardless of the complexity
165
estimator. For this sub-interval, the four complexity estimators achieve their maximums
(98-99% of generalization for the proposed θ – thresholds). These results correspond to
the result obtained in the work (Bouyousef, 2007).
The aim of this experiment was not to find the best complexity estimator in T-DTS
framework or determine the best performance, but rather to test the self-tuning threshold
procedure for the same range of fixed parameters as for the given classification
benchmark including bi – priorities of P(θ).
The most interesting in these results is that defining h = 9 allows T-DTS in the user-
free mode to find the quasi optimal θ - threshold with the high accuracy, particularly
because of the continuous behaviour of predefined P(θ). However, this does not mean that
a human operator cannot heuristically achieve this accuracy.
The next section is dedicated to a validation of the self-tuning threshold T-DTS
procedure for two real-world classification problems.
IV.2.2.2 T-DTS self-tuning procedure validation using real-world
classification problems
Applying the self-tuning threshold T-DTS procedure for the Tic-tac-toe problem, we
have used a different database partitioning and complexity estimator. The histogram
obtained for maximal decomposition tree determines overall database divisibility for the
pre-selected DU and complexity estimator.
For the Tic-tac-toe endgame problem, we provide the two histograms obtained using
two trustful complexity estimators: Collective (PRISM based method) entropy and my
ANN based complexity estimators. The results are described in Fig. V.32 - Fig. V.33.
A brief analysis suggests that, according to Collective entropy’s histogram, the
database is divisible and decomposition provides clusters of different complexity;
however, it does not mean that it increases T-DTS performance. Still, the sub-cluster has a
high complexity ratio. Concerning the Fig. IV.33, it is suggested that decomposing does
not decrease database complexity and the sub-databases remains complex; thus, the
database of Tic-tac-toe endgame problem is not divisible in the sense of classification
simplification.
Let me note here that according to complexity estimators’ validation, our ANN based
estimator has been proven to be more trustful than Collective (PRISM based) entropy.
166
However, both of them are leader among the proposed approach used for validation.
Applying self-tuning T-DTS threshold procedure for different combination of the
general database partition, using different PU and complexity estimating techniques gives
the one result: any database decomposition doesn’t reduce overall complexity and as the
results, decomposition could not increase performance.
Fig. IV.32 : Validation T-DTS self-tuning threshold procedure, Clusters’ number
distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, Collective entropy complexity estimator
Fig. IV.33 : Validation T-DTS self-tuning threshold procedure Clusters’ number distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, ANN-structure
based complexity estimator
167
Therefore, the database should be processing solid, or another sophisticated method of
decomposing must be used. Knowing the origin of this problem, it is expected to face this
conclusion, because each s-instance describes unique game combination that determine a
unique part of the border in the feature space S, plus problem is high overlapping, that’s
why a low ratio of learning database also reduces performance.
Table IV.6 : Classification results: Tic-tac-toe endgame classification problem
Method description Type of algorithm Accuracy (%)
MLP_FF_BR (90% of database used) MLP FF based 99.9063
MLP_FF_BR (80% of database used) MLP FF based 99.6545
Elman_BNwP (90% of database used) MLP FF based 99.5521
T-DTS (80% of db), ANN-based CE, θ=0, 2 clusters, PU: Elman_BNwP T-DTS 98.4921
MLP_FF_BR (70% of database used) MLP FF based 98.4583
CN2 standard Rule instruction 98.33
Elman_BNwP (70% of database used) MLP FF based 98.0521
T-DTS (70% of db), Fisher ration CE, θ=0, 2 clusters, PU: Elman_BNwP T-DTS 98.0050
IB3-CI Instance learning 97.8
MLP_FF_BR (60% of database used) MLP FF based 97.3890
Decision tree learner +FICUS Feature constructing 96.45
kNN +FICUS (k=3) Feature constructing 96.14
T-DTS (70% of db), kNN Matlab CE, θ=0, 2 clusters, PU: PNN T-DTS 95.7334
kNN +FICUS (k=5) Feature constructing 95.35
kNN +FICUS (k=7) Feature constructing 94.99
kNN +FICUS (kNN – basic) Feature constructing 94.73
CN2-SD (γ=0.9) Rule instruction 88.41
CN2-SD (γ=0.7) Rule instruction 85.07
MLP_FF_BR (60% of database used) MLP FF based 85.0313CN2-SD (γ=0.5) Rule instruction 84.45
MBRTalk Instance learning 84.1
CN2-SD (add. weight.) Rule instruction 83.92
Backpropagation +FICUS /note: high standard deviation of the results/ Feature constructing 81.66
NewID Decision tree based 79.8
CN2 WRAcc Rule instruction 70.56
Table IV. 6 presents the competitive study between T-DTS and other classification
approaches’ performance (Aha 1991), (Lavrac, Flach and Todorovski 2002), (Markovitch
and Rosenstein 2002). The classification result are given in a term of accuracy (sum of
learning and generalization rate) in order to be comparative to the other methods.
168
This table clearly defines T-DTS based results and the influence of Tic-tac-toe
database decomposition on the performance results. Let me note that any complexity
estimator that produces similar to ANN based estimator’s histogram will be good
candidate for processing this problem in T-DTS framework. Such candidate is not only
Fisher base ratio, but also 4 Fukunaga’s interclass matrix distance criteria. However, one
may select a solution with two and more clusters as the solution if the given result
satisfies initial condition.
For the Splice junction DNA sequence classification problem, the results are described
in Fig. V.34. Fisher_Disriminant_Ratio complexity estimator is the leader with its
generalization rate 74.6218%, where the ANN based is the second one.
Maximum_Standard_Deviation is inapplicable, because of the same weaknesses
mentioned above for Fisher_Disriminant_Ratio applied for Tic-tac-toe endgame problem.
Fig. IV.34 : Validation T-DTS self-tuning threshold procedure: Splice-junction DNA
sequences classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU – CNN, PU – MLP_FF_GDM,
3 complexity estimators
This confirms our expectation of that ANN-structure based complexity estimator
(marked on Fig. IV.34 as ZISC) is boardly applicabile regardless the specificity of the
problem.
Before analyzing the quality of the obtained results and focusing on the main goal of
any automatic classification method – maximization of the generalization rate, let me
169
mention that the weaknesses of the T-DTS application (not the concept). It supports that
the cortege, for example a pair <complexity threshold; complexity estimating methods>,
determines the optimum or quasi optimum. For different methods, quasi-optimal threshold
may be also different.
Firstly, it is incorrect to assume/simplify that the set of optimal thresholds for the
whole range of complexity estimator applied for a certain single classification task can be
allocated in some sub-interval. There are several aspects that have an influence on the
optimization function P(θ), including a relativity of the complexity rate except ANN
based complexity estimator, meaning that we cannot optimize finding of an appropriate
complexity estimator among the available ones.
Second, we expect that main controlling pair of <complexity threshold; complexity
estimating methods> using DU during decomposition simplifies the problem regardless of
PU. In fact, taking into account that the problem of finding optimal decomposition is NP-
hard, summarizing it is oversimplified expectation which assumes that given
decomposition is quasi-optimal, meaning that simplification is indeed done.
Finally, turning back to the tasks’ classification goal of the mentioned above
manipulation, let us highlight that, in the framework of T-DTS output, we consider the
principal direction of minimizing the generalization error.
Furthermore, we search for the answer on the question of how to predict and how to
predefine the way of decomposing or no decomposing, in which the maximal
generalization rate can be achieved. Once more, our idea is based on the macroscopic
features of self-organizing, highlighted in the work (Haken 2002). Thus, microscopic
characteristic that rules decomposition is a histogram of divisibility extracted from
maximal decomposition tree.
To illustrate this principle, we have used an encoded database (in order to enhance
testing rate – our principal aim) for Splice junction DNA sequences classification
problem. Fig V.35 showes this database divisibility and complexity reduction based on the
next complexity estimators.
Based on the given histogram, one may conclude that the problem is hard-
decomposable. Decomposing does not reduce the complexity. During decomposition, the
complexity of the majority of sub clusters remains very complex. That is why the
decomposition does not provide generalization of error minimization.
170
Fig. IV.35 : Validation T-DTS self-tuning threshold procedure, Clusters’ number
distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size 1595 prototypes, DU – CNN, Purity PRISM based complexity
estimator
Fig. IV.36 : Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Splice-junction DNA sequences classification problem, 3 classes,
learning database size 1595 prototypes, DU – CNN, Fukunaga’s interclass distance measure J1 based complexity estimator
Therefore, we have consolidated the maximum results reached by T-DTS for Splice-
junction DNA classification problem in Table V.7; as it was expected, the maximum can
be reached when the database is not decomposed.
Table IV.7 : Classification results: Splice-junction DNA sequences classification
problem, three classes, generalization and learning database size 1595 prototypes
DU Complexity estimator PU Tr±Std/2 (%) Lr±Std/2 (%) Avr. leaf No. ±Std/2 Θ
None None Elman_BN 94.6675±0.0421 99.9373±0.0181 None None
CNN Purity PRISM based Elman_BN 93.5966±0.4174 99.8800±0.0224 2±0 0.0033
CNN Purity PRISM based Elman_BN 93.3950±0.4302 99.8500±0.0533 4±0 0.0340
The results exhibit that decomposition process reduces generalization ability.
171
However, when one takes into account the processing time, it is quite probable that a user
who wishes to sacrifice one percent of the generalization ability may select 4 clusters’ T-
DTS solution.
To conclude, I provide short (only the maximal characteristics) consolidated results
obtained by different authors including specific ROC Analysis employing methods
(Makal, Ozyilmaz and Palavaroglu 2008) for this particular classification problem in
Table IV.8.
Table IV.8 : Consolidation of the classification results: Splice-junction DNA
sequences classification problem.
Method description Specificity of the method
Generalization rate (%)
Maximum obtained in the work (Lumini and Nanni 2006) Hierarchical SVM based 99
Maximum obtained in the work (Dutch 2002), (learning db is modified) Specific NN based 95
Elman_BN (50% of database used) MLP FF based 94.6675
The average result for this type of problems (Malousi and al. 2008) SVM based 94
T-DTS (50% of db), 2 clusters, PU: Elman_BN T-DTS 93.5966
T-DTS (50% of db), 4 clusters, PU: Elman_BN T-DTS 93.3950
Maximum obtained in the work (Malaousi and al. 2008) for ANN-based solid ANN-based 93.3890
MLP (by ROC analysis) solid ANN-based 91.23
GRNN (by ROC analysis) solid ANN-based 91.14
RBF (by ROC analysis) solid ANN-based 89.35
Let us compare our output to the results obtained in the work (Malousi and al. 2008).
We have obtained the higher generalization rate (94,66% against 91,23%) because of the
embedded recursivity of Elman’s Backpropagation. The work of Makal, Ozyilmas and
Palavoroglu provides a good summary of the different solid-ANN methods applied for
this particular problem. The used there approaches (Malousi and al. 2008) do not use
decomposition approaches. Therefore, it is important to note that our T-DTS result within
the solution of four clusters is better 93.3% than the results (Malousi and al. 2008).
Although, in the work (Lumini and Nanni 2006) it was shown that SVM based
methods, especially hierarchical SVM based methods likr HM, Subspace, RankSVM, have
reached better results (97-99%). However, let me remind that the question of SVM option
parameterization is complex and requires additional applied techniques. The interesting
fact is that the very specific methods proposed in the work (Duch 2002): RBF, 720 nodes,
GhostMiner version of kNN, and Dipol92 surpass our result with their 95% of
generalization; nonetheless, their learning databases have been specially adjusted for these
172
three methods. We cannot consider that these methods are general. If one takes a look on
this general problem of DNA splice-junction classification as a general medical problem,
the work (Malousi and al. 2008) for the similar databases provides an average
generalization rate of 94%, even when one may use SVM based methods.
Finally, we can state that using T-DTS approach within enhanced self-tuning
procedure applied for the Splice junction DNA sequences classification problem, the
average for this type of problem was computed as 93.4% – 94.6% of generalization.
However, various SVM based (not NN-based processing methods that have been used in
T-DTS) are the leaders. This result analysis stimulates further update of T-DTS PU
database with SVM methods. The following section provides overall T-DTS validation
summary.
IV.2.3 Summary
In this section, we performed the range of experiments dedicated to T-DTS approach
validation, including its recent enhancement implemented as T-DTS v. 2.50. The first part
of the validation confirmed the superior performance characteristics of ANN based
complexity estimator: using the proposed novel method of decomposition T-DTS
controller, one can reach the maximal ratios of generalization for academic benchmark.
Since these maximal ratios (in absolute values) can be typically achieved with only the
best current complexity estimators, e.g., Mahalanobis distance based, Normalized distance
based, Maximal standard deviation based measure, etc., the proposed ANN based
estimator proved itself to be a very practical approach. Moreover, it should be noted that
our estimator performed well even in those real-world problems, where the range of other
popular complexity estimators, including Kullback-Leibler divergence and Hellinger
distance based estimator, were not applicable because of their Information theory origins.
The second part of the T-DTS validation tested the proposed self-tuning procedure;
recall that this procedure is able to answer the following questions:
• why using a leading (proven to be a leading) complexity estimation technique
might not be able to maximize T-DTS output.
• why applying the T-DTS tree-like decomposition technique to some classification
problems could not enhance the performance of the technique beyond the results of
alternative non-decomposing task processing technique.
173
During the experiment, the obtained histogram of divisibility, i.e., the result of T-DTS
maximal decomposition tree, gives an answer to the second question. The consequence of
this analysis might stimulate user has to choose another decomposition unit or another
complexity estimator that controls the process of decomposition. The self-tuning
procedure validation confirmed our expectations: employing this semi-automated
procedure does allow the user to find the range of quasi optimal solution. The next section
presents consolidated conclusion on the validation of the T-DTS enhancements and
complexity estimators, including the proposed ANN based complexity estimator.
IV.3 Conclusion
The first part of this chapter was dedicated to the experimental validation of the
proposed ANN based complexity estimation technique that we implemented using the
IBM© ZISC®-036 Neurocomputer and Matlab environment. The latter implementation
also allowed performing the comparative analysis of 17 complexity estimators. During the
verification, we observed that complexity estimators range into three groups based on
their relative performance. The third, most effective, group contains the leaders of
classification tasks complexity estimation; there include PRISM (Singh 2003) based
methods and the novel ANN based complexity estimator.
The second part of the chapter provides the results of the T-DTS concept validation,
where we showed that ANN structure based complexity estimator belongs to the class of
the leading complexity estimators. Moreover, even though the classification complexity
estimation technique is at the kernel of T-DTS as it controls the decomposition, the results
of the evaluation showed that this control might be successfully (in term of T-DTS
performance) done by the range of other techniques that, if taken separately, cannot
appropriately measure true classification complexity.
Last section of the second part provides the results obtained for validation of the self-
tuning complexity (i.e. θ-threshold) procedure that allows user to find a quasi-optimal θ-
threshold. Also, while searching for a quasi-optimal θ-threshold, T-DTS might produce a
whole range of satisfactory solutions; this allows the user to select the most preferable
combination of the output characteristics, such as: generalization and learning rate, total
number of clusters, and the overall T-DTS processing. It should be also emphasized that
174
the most important part of self-tuning procedure is the divisibility histogram. As it was
shown in the validation, it is not only the source of information for the θ-threshold
adjustment, but, according to the results, this histogram also provides the explanation of
why the T-DTS approach could not be applied to some classification problem. Such
unsatisfactory histogram output might be regarded as a stimulus for the further
development of decomposition and complexity estimation methods.
The general conclusion and perspectives of this work is separately consolidated in the
following section.
175
General conclusion and perspectives
Conclusion
T-DTS, a multi-model “Divide-To-Conquer” based classification technique was
initially introduced by Madani & Chebira (Madani 2000). Its key-components, “self-
organizing ability” and “complexity estimation loop”, have been subject of two prior
doctoral works performed by Rybnik (Rybnik 2004) and Bouyoucef (Bouyoucef 2007),
respectively. In this thesis, we proposed and verified several important extensions of this
approach. The main focus of this work was on the one hand, the complexity estimation
issue of T-DTS and on the other hand, concerned the overall enhancement of the T-DTS.
We developed a novel Artificial Neural Network (ANN) structure-based
“classification task’s complexity estimation” technique. We implemented and validate the
proposed technique within MatLab environment performing a comparative analysis of 17
complexity estimators that have already been used as “complexity estimators” in the
above-mentioned preceding doctoral works. Moreover, a fully parallel implementation of
the proposed ANN based “classification task’s complexity estimator” has been proposed,
implemented and validated, using IBM© ZISC®-036 Neuro-processor.
The experiments confirmed the effectiveness of the proposed ANN based complexity
estimation approach, showing performances comparable to the current leading techniques,
such as PRISM based methods and Information theory based complexity estimators
(Kullback–Leibler divergence, Jeffreys-Matusita distance, etc.). The important note is that
the group of “Information Theory” based estimators show its partial sensitiveness to a
number of features overruling the “classification complexity’s estimation”. In other
words, the performed analysis pointed out that the aforementioned group of complexity
estimators is less sensitive to concerned features, and so remains ineffective comparing to
PRISM-based and ANN based complexity estimators. The origin of the above-mentioned
insensitiveness could be highlighted by the fact that the aforementioned group of
techniques works on statistical approximation the Bayes error. However, it should be also
mentioned here that one of the current disadvantages of the ANN based estimator is its
176
high computational cost and the required parameterization (of the influence field value: a
specificity of the used ANN model). One should also take into consideration the fact that
there is no ideal way of constructing ANN structures, and, as the impact of this fact on T-
DTS, the proposed complexity estimation approach might, in some cases, produce an
incorrect estimation for a number of classification problems. The above-mentioned
shortage may occur either in case of the learning data scarcity or when the ANN-structure
remains inappropriate. However, the last issue is a general feature of any machine
learning based structure and the first one is easy to overcome using database size control
that has been already realized.
Some of complexity estimators such as Maximum standard deviation based criterion,
showed their sever limitations. We encountered several issues limiting the applicability of
such approaches rejecting them definitely. However, let us highlight that the mentioned
inapplicability on the practical level is neither related to the T-DTS decomposition
control’s ability nor to its ability to search for a quasi optimal tree structure of NNs
ensemble.
Even though the validation of the proposed complexity estimators (performed on the
benchmarks and two real-world problems) did not clearly show the single best T-DTS
decomposition control agent usage, it demonstrated that the ANN based complexity
estimator belongs to the group of the leading technique. This means that slotting the ANN
based complexity estimator in the T-DTS framework guarantees the possibility of
obtaining an outperforming NNs ensemble tree-structure.
In the second part of this work we studied the matter of the overall enhancement of T-
DTS. We designed and implemented the T-DTS classification system as a framework
consisting of a set of independent components, e.g., decomposition method, processing
method, and complexity estimation method. Owing to the proposed architecture, the
implemented system can be used as a versatile platform for further investigations of T-
DTS, where each component can be investigated, modified independently of other
components and easily replaced as soon as an improved version becomes available.
Moreover, such clear separation of the basic components of the system makes the use of
T-DTS platform accessible to those potential users who would like to experiment the
platform but do not have an expertise in the architecture of the system. The
aforementioned versatility has been implemented as three levels of competence (e.g. three
levels of usage): “basic-user” level, “advanced-user” level and “programmer” level.
The next contribution of this work is the design of a procedure for an automated quasi-
177
optimal adjustment of the θ – threshold parameter of T-DTS. The proposed procedure is
based on the histogram analysis of separability and complexity reduction characteristics
of analyzed data. Since in the previous iterations of T-DTS the threshold had to be
adjusted manually (using a non-trivial procedure), the proposed automation significantly
improves the usability of the approach and paves the way for industrial adoption of T-
DTS. As an added benefit, the histograms composed in the proposed procedure have
proven to be also useful for investigating macroscopic features of self-organizing abilities
exposed by T-DTS. The validation showed that use of the θ – threshold adjustment
procedure allows producing not only a single satisfactory result, but rather a set of
possible satisfactory results, where the user can select a suitable result based on his (or
her) particular preferences (needs): low accuracy error rate, low learning error rate, etc. It
should be mentioned that applying the automated procedure requires an additional
computational effort. However, this effort is well-justified considering that the alternative
manual heuristic of θ – threshold adjustment might also lead to a heavy and time
consuming search for a quasi optimal NNs ensemble structure.
Perspectives
The further development of ANN-structure based complexity estimator requires
enhancing the (.)g function. For example, the extraction of additional classification
complexity’s related information might aggregate other parameters, such as the length of
bounds, etc… Employing different types of NNs is also one of the possible directions of
developing the technique.
To improve the current implementation of T-DTS, we plan to create and incorporate
(into the framework of T-DTS) a database of algorithms for searching the quasi optimal θ
– threshold. Also, we plan to embed the histogram analysis algorithms. These
improvements are in the nearest T-DTS development plan.
Next, we would like to experiment with SVM-T-DTS implementation (combining T-
DTS and SVM) in order to investigate the superior classification properties of SVM. As it
was shown by the comparative DNA-problem results’ analysis, the SVM-based methods
are well suited for the classification problems in the medical area. Precisely, it was shown
that the hierarchical SVM methods are the leaders among the available methods for DNA
sequences’ classification. Another possible area of T-DTS’s extension is a combination of
178
self-organizing maps and LSVMDT (that is currently used for the processing of satellite
images).
The usage of histogram analysis in the T-DTS framework allows user to analyze the
complexity estimator when a known outperforming decomposition unit and a special
benchmark are selected. We would also like to investigate an opposite case of analyzing
decomposition unit’s “simplification” ability for a selected leading complexity estimator.
The far perspective of this work includes applying ANN based estimator not only to
the T-DTS-like NNs ensemble construction and adjustment, but also to any classifier’s
optimization. We are also plan increasing the computational performance by using a more
suitable alternative to the current PC-based hardware platform. Since the approach makes
the heavy use of Neural Networks, T-DTS naturally lands on the emerging hardware
platform of neurocomputing. A key role in this perspective may be undertaken by IBM©
ZISC®-036 neuro-processor which providing a fully parallel hardware implementation
for ANN-structures. The aforementioned option could play not only the role of
decomposing controller, but could also act as an optimizer of any classifiers (of
processing unit).
In conclusion, we believe that the close and the far perspectives in design and
enhancement of T-DTS classification architecture would make T-DTS a promising
industrial platform for resolving complex classification problems.
179
Appendixes
A. The list of publication Articles published in international journals
Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, “Estimating Complexity of Classification Tasks Using Neurocomputers Technology”, International Journal of Computing (ISJC’2009) ISSN 1727-6209, Vol. 8, Issue 1, pp. 43-52, 2009. Ivan Budnyk, El-Khier Bouyoucef, Abdennasser Chebira, Kurosh Madani, “Neurocomputer Based Complexity Estimator Optimizing a Hybrid Multi Neural Network Structure”, International Journal of Computing (ISJC’2008), ISSN 1727-6209, Vol. 7, Issue 3, pp. 122-129, 2008. Articles published as chapters in collective books
Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, “ZISC Neuro-computer for Task Complexity Estimation in T-DTS framework”, Artificial Neural Networks and Intelligent Information Processing, INSTICC PRESS, ISBN: 978-989-8111-35-7, Vol. 4, pp. 18-27, 2008.
Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, “ZISC Neural Network Base Indicator for Classification Complexity Estimation”, Artificial Neural Networks and Intelligent Information Processing, INSTICC PRESS, ISBN: 978-972-8865-86-3, Vol. 3, pp. 38-47, 2007. Articles published in international symposiums and conferences
El-Khier Bouyoucef, Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, "A Modular Neural Classifier with Self-Organizing Learning: Performance Analysis", Proceedings of the Sixth International Conference on Neural Networks and Artificial Intelligence (ICNNAI 2008), ISBN : 978-985-6329-79-4, Minsk, Byelorussia, 27 – 30 May, pp. 65-69, 2008.
Ivan Budnyk, El-Khier Bouyoucef, Abdennasser Chebira , Kurosh Madani, "A Hybrid Multi-Neural Network Structure Optimization Handled by a Neurocomputer Complexity Estimator", Proceedings of International Conference on Neural Networks and Artificial Intelligence (ICNNAI 2008), ISBN : 978-985-6329-79-4, Minsk, Byelorussia, 27 – 30 May, pp. 310-314, 2008.
Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, "Estimating Complexity of the Classification Tasks that Using Neurocomputers", Proceedings of International Conference on Intelligent Data Acquisition and Advanced Computing Systems (IEEE - IDAACS 2007), IEEE Cat. N°: 07EX1838C, ISBN 1-4244-1348-6, Dortmund, Germany, 6 – 8 September, pp. 207-212, 2007.
Abdennasser Chebira, Kurosh Madani, Ivan Budnyk, "Task Complexity Estimation Using Neural Networks Hardware", Proceedings of the 9th International Conference on Pattern Recognition and Information Processing (PRIP’2007), ISBN 978-985-6744-29-0, Minsk, Byelorussia, 22 – 24 May, Vol. 1, pp. 59-63, 2007. Articles published in national symposiums and conferences
Abdennasser Chebira, Ivan Budnyk, Kurosh Madani, "Auto-organisation d’une Structure Neuronale Arborescente", Proceedings of the XVIth Joint Meetings of the French Society of Classification (SFC 2009), Grenoble, France, 2 – 4 September, pp. 19-22, 2009.
180
B. Complexity
General speaking, the phenomenon of complexity has always been an inherited part of
environmental systems, for example ecology, world economics, etc. Altogether, these
systems consist of interdependent and variable parts. In other words, unlike a
conventional system, the parts shouldn’t have fixed relationships, fixed behaviours or
fixed quantities, thus their individual functions may also be undefined in traditional terms.
Despite the apparent tenuousness of this concept, these systems accordingly to (Lucas
2000), form the majority of our world, including living organisms, social and natural
systems. A sub-conclusion might be done from this is that complex systems cannot be
studied independently of their surroundings (Lucas 2000). It means that the behaviour of a
complex systems “necessitate a simultaneous understanding of the environment of these
systems” (Moffat 2003).
Thus, approaches used to study Complexity theory are based on a number of new
mathematical techniques, originating from fields as diverse as physics, biology, artificial
intelligence, politics and telecommunications, and this interdisciplinary viewpoint is the
crucial aspect, reflecting the general applicability of the theory to these systems in all
areas (Lucas 2000). Describing these main features of the term complexity doesn’t clarify
the defining of the term, because of high variations in terminology and concepts which
have proliferated in the field–deterministic chaos, fractals, self-organizing systems far
from thermodynamic equilibrium, complex adaptive systems, self-organizing criticality,
cellular automata, solutions, and so on because they all globally share same property
(Czerwinski 1998).
In looking at where these key staples of the term of complexity come from, let us
make a start by considering the natural world systems (examples of these systems are
mentioned above). According to work (Moffat 2003), in the classical view the physical or
biological processes are reducible to a few fundamental interactions. This leads to the
idea that under well-defined conditions, a system governed by a given set of laws will
follow a unique course: “like the planets of the solar system” (Moffat 2003). In this
essence, we are interested in essential components/staple point(s) of that determines
complexity in system-independent way.
181
B.1 Defining complexity. Genesis of the concept complexity
There is a list of the various approaches of defining complexity, but for all of them
might be characterized by one common feature. This common point is an answer on the
question “How complexity is born”.
The senses (observing the world) + mental (human) activity (making sense out of that
sensory information) encode Natural System (NS) into Formal System (FS); manipulate
FS to mimic the causal change in the NS. From the NS derives an implication that
corresponds to the causal event in the FS; decodes the FS and check its success in
representing the causal event in the NS (Ferreira 2001): complexity as phenomenon
appears at this moment when we find “unexpected behaviour” behaviour between natural
system (NS) and expected behaviour of the formal system (FS) that has been created in
order to describe the first one. Complexity appears as the discrepancies’ result between
our observation NS and human-cantered pro-active FS. This is the gab of
encoding/decoding FS to NS and back Fig. B.1.
Summarizing, “Complexity is the property of a real world system that is manifest in
the inability of any one formalism that existed before being an adequate to capture all its
properties” (Mikulecky 2007).
Fig. B.1 : Genesis of the complexity
The complex system, from which we single out some smaller part, our NS that is
converted into FS, allows us to manipulate and to have a model. The observed process is
stated to be complex when the chosen FS tries to capture its behaviour, but can only be
partially successful (Mikulecky 2007). Good illustration of process described on Fig. B.1
is a Newtonian paradigm. As a FS we have been satisfied with it. Then using FS encoding
182
and decoding for some special case, we have figured discrepancies out. We have begun to
change it so the post-Newtonian paradigm actually replaced or became the real world. As
we began to look more deeply into the world we came up with aspects that the Newtonian
Paradigm failed to capture. Requirements of explanation give a birth to complexity.
Shortly debriefing the big variety of approaches we find the common of defining
complexity. The principal understanding of this phenomenon named complexity comes
from the observing of interaction “between human mind and nature” Fig. B.1. Scientists
whose operate their own definitions try to figure nature out in a process of interaction
using wording such as: phase states, bifurcations, strange attractors, emergence in order
to describe and predict the behaviour of a few natural phenomena signing a “model” to
them (Czerwinski 1998).
During the centuries we have a model of nature based on the Newtonian paradigm that
was built on the Cartesian Reductionism: Machine Metaphor and Cartesian Dualism
(Bennett 2003) means Body is a biological machine; Mind as something apart from the
body. Intuitive concept of machine was built up from distinct parts and can be reduced to
those parts without losing its machine-like character - Cartesian Reductionism.
The Newtonian Paradigm and the three laws of motion: General Laws of motion, used
as the foundation of the modern scientific method where dynamics is the centre of the
framework, which leads to trajectory (Ferreira 2001). Although, the real world came to be
different that any model (Badiou 2007) including Newtonian one.
The complexity as world phenomena has shows itself by observation or experience as
an ability to falsify a previous created model. Hence, we deeply link complexity with the
Popper’s falsifiability and unfalsifiability (Popper 2002) that is important concept for
advance of science and complexity term as a result or a failure (Mikulecky 2007) tightly
linked with concept of Popper’s falsifiability. Last one is a source discrepancy between FS
and NS described on Fig. A.1. There is no surprise that any even well developed and
complicated formal system that describes a system sometimes cannot describe or explain
in the terms of this formal system some natural phenomenon which claims to be complex.
According to Gödel's first incompleteness theorem it’s expected to face
incompleteness of the formal system (Gödel and Feferman 2001). Moreover, first theorem
struck a fatal blow to Hilbert's (Ewald 2004 ) program, towards a universal formalizing
which is based on the complete mathematical formal system, gives an existent right on the
existence of the complexity as form of incompleteness natural system. Thus, the second
theorem of Gödel (Gödel and Feferman 2001) supports the idea of cycling pre-defining
183
complexity described on Fig. B.1. Formalizing a system, finding discrepancies, then again
encoding, formalizing and decoding we can find that NS again is more complex than FS
and etc.
Complexity in computer world is existence move for its determining and
calculating/modelling. It is a non-stop movement where computer scientists are "pushed"
into the world of running Tortoise and Achilles (Hofstadter 1999). However, Chaitin
(Chaitin 2005) proposes for scientists-mathematicians to abandon any hope of finding
“ideal” FS and suggests apply each time a quasi-empirical methodology.
Summarizing the defining complexity (complexity of the systems) topic, we want to
mention that the various nowadays definitions are related to one might say post-
Newtonian paradigm (Czerwinski 1998), where under post-Newtonian paradigm, it is
meant the arrangement of nature life and its complications, to be nonlinear: where inputs
and outputs are not proportional, where phenomena are unpredictable, but within bounds
are self-organizing, where unpredictability frustrates conventional structuring and where
solution as self-organization control as we think of it and etc. Complexity in the broad
meaning shows itself as an inability of some/any formalism to fully describe observed
phenomenon.
Next section dedicated to more detailed overview of complexity as the system’s
attribute.
B.2 System’s attribute and complexity
The majority of definitions are ground on the principal idea that complexity is an
inherent part of the complex systems such as: economies, social structures, climate,
nervous systems and etc. Complexity theory and chaos theory both attempt to reconcile
the unpredictability of non-linear dynamic of these systems with a sense of underlying
order and structure (David 2000).
First of all we have to discuss what we understand by complex systems. In a naïve
way, we may describe them as systems which are composed of many parts, or elements,
or components which may be of the same or different kinds. The components or parts may
be connected in a more or less complicated fashion. The various branches of science offer
184
us numerous examples, some of which turn out to be rather simple whereas others may be
called truly complex (Haken 2002).
A modern definition of complex systems is based on the concept of algebraic
complexity. It means, that at least to some extent, systems can be described by a sequence
of data, the fluctuating intensity of the light or a curve that represents data by the
numbers. There, one might attempt to follow up the paths of the individual parties and
their collisions and then derive the distribution function, known as the Boltzmann
distribution, of the velocity of the individual parts.
In all cases, a macroscopic description allows an enormous compression of
information so that we are no more concerned with the individual microscopic data, but
rather with global properties. An important step in treating complex systems consists in
establishing relations between various macroscopic quantities (Haken 2002).
The more science becomes divided into specialized disciplines, the more important it
becomes to find unifying principles (Haken 2002). We may recount the give in literature
cross-discipline ideas of view on the complexity in the complex systems. As matter of
fact, these definitions cannot be limited in the way of top of twenty (Sussman 2002),
because common point is that all of them use word-combinations such as: intricate ways,
subtle, degree and nature of the relationships, behaviour of macroscopic collections. This
type wording is not acceptable for strong definition, because all of them should be also
pre-defined.
First of all let us mention that we share a critical view of some authors on complexity
whose treats this paradigm as some kind of holism. They suggest that any attempt to cope
with complexity using such traditional tools is doomed to failure. Their remedies vary
from a complete abandonment to introduce new techniques and approaches. Of course,
any constructive suggestions for dealing with complexity are welcome from whatever
source. Thus, all given techniques of “the new sciences of complexity” are welcome for
studying what have been considered complexity of the complex systems. As it is evenly
noted by Sussman (Sussman 2000) many of these techniques, however have nothing to do
with complexity per se. It is stated (Sussman 2000) that many papers with the word
"complexity" in the title refer merely to some techniques for dealing with rather difficult
(complex) systems. Considering this point of view, complexity as a solid attribute of the
systems might contain the following features:
A system is complex when it is composed of many parts that interconnect in intricate
ways (Moses 2002) - this definition has to do with the number and nature of the
185
interconnections. Metric for intricateness is amount of information contained in the
system.
A system presents dynamic complexity when cause and effect are subtle, over time
(Senge 2006) - different effects in, the short-run and the long-run; dramatically different
effects can be observed locally and in other parts of the system. Obvious interventions
produce non-obvious consequences.
A system is complex when it is composed of a group of related units (subsystems), for
which the degree and nature of the relationships is imperfectly known (Sussman 2000) –
the overall emergent behaviour is difficult to predict, even when subsystem behaviour is
readily predictable, small changes in inputs or parameters may produce large changes in
behaviour.
A complex system has a set of different elements so connected or related as to perform
a unique function not performable by the elements alone (Maier and Rechtin 2000) -
requires different problem-solving techniques at different levels of abstraction
A complexity relates to the behaviour of macroscopic collections of units endowed
with the potential to evolve in time (Highfield 1996) - this definition differs from
computational complexity which estimates a number of mathematical operations needed
to solve a problem using Turing machine concept.
The features of the complexity of the complex systems are the following:
• Complex systems are non-fragmental: if it were, it would be a machine. Their
reduction to parts destroys important system characteristics irreversibly.
• Complex systems comprise real components that are distinct from its parts: there
are functional components defined by the system which definitional dependable on the
context of the systems. Outside the system they have no meaning. If removed from the
system it looses its original identity.
• Complex systems don’t have analytic or synthetic largest FS mode”: if there were
a largest model, all other models could be derived from it.
• Causalities in the system are mixed when distributed over the parts.
• Attributes of the systems are beyond algorithmic definition or realization: here, we
deal with posing a challenge to falsify. One of example is the famous Church's thesis
(Church 1936) (“…All the models of computation yet developed, and all those that may
be developed in the future, are equivalent in power… We will not ever find a more
powerful model...”).
186
These are not definitive indicators but a system that has many of these attributes hard
to be analyzable using linear determinism or statistical methods (Lucas 2000).
Another important feature of complexity as an attribute of the complex systems that
have to be mentioned here, especially, because of T-DTS concept of out thesis is a self-
organizing. In the literature we find complexity as an imprescriptible attribute of the
systems that allow them to self-organize. Naturally, we have to have before an answer on
the question “What is the complexity of self-organizing systems?”
The one strong definition that is relevant to the complexity couldn’t be found. It could
be arisen in bio-organisms (Hinegardner and Engelberg-Kulka 1983), where it has a direct
relationship with the evolutionary selection process. A very weak definition of such as
size of genome could be sufficient to explain an increase in the maximum complexity of
all species under evolution process. Henceforth, related to mutation of the genome, is self-
organizing complexity suggested by Wimsatt (Wimsatt 1974). Under complexity it is
meant co-adaptation of an organism's mechanisms (or as sub-mechanisms of other
mechanisms) as a source of the evolution, it is called in another words descriptive
complexity. Kauffman (Kauffman 1993) suggests that the order manifest in organisms is a
result of selection acting upon a system that is basically self-organizing and that this self-
organizational ability depends critically on the complexity of conflicting constraints. Here,
a complexity is linked with biology. These some sorts of criteria that may allow or not the
system to achieve the benefits in innovation based on survival (fitting real-word
constraints) and adaptability that we see for natural complex systems (Lucas 2000).
Therefore, one may see arising difficulty of the strong common for these systems
definition of complexity as inherited part of the complex systems, because not the
different origin, but mostly because of different phenomena and processes that took a
place and which are twisted around word (not a term) complexity.
187
C. Neural Networks in hardware
Neural network hardware has undergone rapid development during the last decade.
Unlike the conventional von-Neumann architecture that is situational in its origin, ANN
profits from massively parallel processing. A large variety of hardware has been designed
to exploit the inherited parallelism of the NN models. Despite the tremendous growth in
the computer power of general-purpose processors, ANN based hardware has been
designed for some specialized applications, such as image processing, speech synthesis
and analysis, pattern recognition, high energy physics and so on.
Neural network hardware is usually defined as those devices designed to implement
particular NN architecture and learning algorithm. These devices take an advantage of the
parallel nature of ANN. Due to the big diversity of neurohardware this overview is limited
with some certain aspects of implementation (Liao 2001).
The most important aspect of Neurocomputing hardware is its productivity in
comparison to soft-ANN realizations. It’s true that in general, a particular classification or
another task does not require super fast speed, and because of it, a software based
realization is on demand. Software based realization is architecturally independent and it
is easy to handle and to adjust. However, theses criteria become unimportant for real-time
response applications, systems and consumer products such as cheap verification devices.
According to the work (Aybay, Cetinkaya and Halici 1996) the hardware design of
each NN-chip as a principal component of NN-hardware is built of four key elements:
Weights block, Activation block, Transfer function block, Neuron state block under
handling of Control unit that is present on each chip and is responsible for passing control
parameters, Fig. C.1.
Important issue concerning the given scheme is that Neuron state block, Weights block
and Transfer function block may be off the chip, and part of their function can be
performed by a host computer (Aybay, Cetinkaya and Halici 1996).
Neural network hardware is usually specified by the number of artificial neurons on
each neuro-chip and the quantity of the connections between them. The number of
neurons could vary much from 10 to 106. Another important characteristic of
Neurocomputers is the precision by which the arithmetical units perform the basic
operations (Liao 2001).
188
Fig. C.1 : General block level architecture representation of a neurochip or a
neurocomputer processing elements
NN hardware could be categorized on classes using the following criteria (Aybay,
Cetinkaya and Halici 1996): Type of device (neuro-chip or neurocomputer, general
purpose device or specific purpose device), Neuron properties (the number of neurons,
precision, storage of neuron state: on-chip/off-chip, digital/analogue), Weights (storage of
weights, number of synapses and etc.), Activation characteristics (computation, activation
block output), Transfer function characteristics (on/off-chip, analogue/digital, threshold
look-up table/computational), Information flow, Learning (on/off-chip, standalone/via a
host), Speed (learning speed/processing speed), Cascadability, Type of technology
fabrication (for example VLSI), Clock rate, Data transfer rate, Number of inputs,
Number of outputs, Type of input (analogue/digital), Type of output (analogue/digital)
(Aybay, Cetinkaya and Halici 1996).
One may find in the works (Aybay, Cetinkaya and Halici 1996), (Liao 2001)
common-used simple taxonomy of neurocomputing hardware, but let us mention that the
borders between the categories of this taxonomy are weak defined.
Debriefing our introduction to NN-hardware, we would like to highlight that because
of the popularity and especially because of first-move advanced commercial success of
NN-software based applications. They have become more popular than hardware based
solutions, because of the disadvantages of the last such as: algorithmic specificity, design
189
complexity and lack of user-friendliness. It has allowed NN-based software to lead a trend
of ANN applications’ development. Moreover, because of their high flexibility and
universality in comparison to neurocomputers, they continue to demonstrate a high
potential to hold its leading position. Although, whenever there appears a need to handle
computation for real-time application, or to employ complex ANN with the big number of
neurons inside, there is no concurrence to neurocomputers. For the specific
niches/problems, neurocomputer provide a much better cost-to-performance ratio, lower
power consumption and small size. We may regard a hardware based neurocomputing as
an efficient delicate toolbox that after “its golden rush” of late 80th and early 90th has been
put on the mercy of a leading ANN-software based tools and applications. However, an
algorithmic success of the last one in a long-term will only revive the area of
Neurohardware. In fact these two approaches are not rivals. As long as conventional
hardware could not provide sufficient performance, there is a need for neurocomputers.
(Liao 2001)
Since many neural network chips have already been described in the literature (Maren
1990), and new neural network chips are appearing frequently, we will not make any
effort to review them all. Following Section C.1 and Section C.2 give more clues about
the type of neurcomputers that employing RBF-like neural network models for patter
recognition. First one IBM© ZISC®-036 has been used as a tool for verification our
classification complexity ad hoc approach. According to given above approaches of NN-
hardware categorizing, these two neurocomputers belong to the class of General Purpose
Digital Neuro-Chips (NCgd) (Aybay, Cetinkaya and Halici 1996). The common-used
code that denotes their main characteristics is: Id/Ad/Wdo/Sdo/Tdo/Lo, where I stands for
input/output, A for activation block, W for weights block, S for neuron state block, T for
transfer function, and L for learning, Furthermore, d means digital, and o means on-chip.
C.1 IBM© ZISC® -036 Neurocomputer
IBM© ZISC(Zero Instruction Set Computer)®-036 Neurocomputer (IBM Corporation
1998), (De Tremiolles 1998) is a fully integrated circuit based on neural network designed
for recognition and classification application which generally required supercomputing.
The key component, each of 36 chips (Fig. B.2), has a RBF neural network topology
(Lindblad and al. 1996). IBM© ZISC®-036 is a parallel neuro-processor that uses RCE
190
(Reduced Coulomb Energy algorithm) which automatically adjusts the number of hidden
units and converges in only few epochs. This method that implemented on IBM© ZISC®-
036 has been proven to be effective in resolving pattern classes separated by nonlinear
boundaries (Madani, De Tremiolles and Tannhof 1998).
Fig. C.2 : IBM© ZISC®-036 PC-486 ISA bus based bloc diagram
However, the RCE network depends on the user-specified parameters which are
computationally expensive to optimize (Wang, Neskovic and Cooper 2006). IBM©
ZISC®-036 Implementation of the RBF-like (Radial Basic Function) model (Park and
Sandberg 1991) could be seen as mapping an N-dimensional space by prototypes. Each
prototype is associated with a category and an influence field. The intermediate neurons
are added only when it is necessary. The influence field is then adjusted to minimize
conflicting zones by a threshold). kNN algorithms is embedded in this neurocomputer.
The kNN (k-nearest neighbour algorithm) method for classifying objects based on closest
training examples in the feature space. kNN is a type of instance-based learning, or lazy-
learning where the function is only approximated locally and all computation is deferred
until classification.
IBM© ZISC®-036 system implements two kinds of distance metrics:
1. L1: polyhedral volume influence field.
2. LSUP: a hyper-cubical volume influence field.
The ZISC®-036 is conventionally regardered as a coprocessor device, Fig C.3. Such
device must be controlled by a micro-controller or state machine (accessing its register).
Naturally in many RBF applications a large number of neurons are required. The
cascalability of ZISC®-036 is supplied by 144 pins that can be directly interconnected.
Such simplification of the basic design supports ZISC®-036 tower construction. ZISC’s
chip supports asynchronous as well as synchronous protocols, the latter when a common
191
clock can be share with the controller. The calculation of distance between input vector
and prototype use 14 bits precision. The components of the vectors are fed in sequence
and processed in parallel by each neuron (Lindblad and al. 1996).
Fig. C.3 : Schematic drawing of a single IBM© ZISC®-036 processing element -
neuron
It means that each chip Fig. C.4 is able to perform up to 250000 recognitions per
second.
neuron 1
neuron 2
neuron 36
redrive
controlsaddressdata
[8][6][16]
I/O bus [30]
daisy chain in [1]
logic
[4]
[21]
[1] daisy chain out
inter Zisc communication bus
decisionoutput
Fig. C.4 : IBM© ZISC®-036 chip’s bloc diagram
192
This chip is fully cascadable which allows the use of as many neurons as the user
needs. The first implementation (Lindblad and al. 1996) has been done on PC-486 ISA-
bus card Fig. C.5 consists 576 neurons (16 chips, each chip contains 36 neurons Fig. C.4.
Fig. C.5 : Hardware realization: IBM© ZISC®-036 PCI board
Each neuron is an element, which is able to:
• memorize a prototype (64 components coded on 8 bits), the associated category (14
bits), an influence field (14 bits) and a context (7 bits),
• compute the distance, based on the selected norm (norm L1 or LSUP) between its
memorized prototype and the input vector (the distance is coded on fourteen bits),
• compare the computed distance with the influence fields,
• communicate with other neurons (in order to find the minimum distance, category,
etc.),
• adjust its influence field (during learning phase)
As it’s mentioned a controlling IBM© ZISC©036 is performed by accessing its
registers, and requires an address definition via the address bus, and data transfer via the
data bus. The inter-ZISC communication bus which is used to connect several devices
within the same network, and the decision bus which carries classification information
allow the use of the ZISC in a ‘stand alone’ mode.
The ZISC’s neurons organization allows the connection of several ZISC modules
without impact on performance. An efficient protocol allows a true parallel operation of
all neurons of the network even during the learning process.
The most popular ZISC’s control is performed by a master state machine. This is done
by a standard I/O bus. The I/O bus of IBM© ZISC®-036 has been designed to allow a
193
wide variety of attachments from simple state machine interface to standard micro-
controllers or buses.
This neurocomputer has been used for image enhancement such as noise reduction
and focus correction (Madani, De Tremiolles and Tannhof 2001). However, after it’s last
manufacturing by IBM in 2001, the nowadays applications’ requirements stimulate to re-
new production of the ZISC’s cousin – CogniMem (CM) chip. The first production batch
has been released in January 2008. Following section is dedicated to detailed description
of this CM-1K Neural Network chips.
C.2 CM-1K Neural Network chip
CM-1K Neural Network Chip, Fig C. 6 is a descent of IBM© ZISC®-036
neurocomputer (Mendez 05-2009).
Fig. C.6 : CM-1K Neural network chip
It’s neural network chip featuring 1024 neurons working in parallel and a parallel bus
which allows the user to increase the network size by cascading multiple chips. It’s a firs
product of CogniMem network line (Mendez 05-2009). It is an ideal companion chip for
smart sensors and cameras and can classify patterns at high speed while coping with ill-
defined data, the detection of unknown events, and adaptively to changes of contexts and
working conditions (Mendez 03-2009). Compared to nowadays 4GHz CPU with its
bottleneck (the memory access through a single bus), CM-1K chip has a a very simple and
self-contained and efficient architecture (Mendez 05-2009).
In addition to this parallel neural network’s architecture, Fig. C.7, CM-1K integrates a
built-in recognition engine which can receive vector data directly from a sensor and
broadcast it to the neurons in real-time. It is demonstrated (Mendez 03-2009) that CM-1K
chip recognition time depends on operating clock and not on the numbers of models
194
stored in the neurons. The CM-1K as ZISC®-036 chips can be interconnected to build NN
of any capacity per increment of 1024 neurons and less than one watt per chip.
Fig. C.7 : Network of 6 CM-1K chips, or 6144 neurons in parallel
The CM1K control and data bus is simple and composed only 28 lines. Adding more
chips to increase the network size is totally transparent to the controller since the neurons
include their own learning and recognition logic (Mendez 03-2009). CogniMem offers a
proprietary signature extraction from 2D video to 1D vector.
Fig. C.8 : CM-1K chip’s functional diagram. Inner architecture
The recognition engine can operate at sensor speed (up to 27 MHz). The usage of the
195
high-speed recognition engine requires knowledge be previously loaded into the neurons.
Concerning the chip’s inner-architecture Fig C.6, one must find that the principal
organization similarity with IBM© ZISC®-036, Fig. C.3: it’s the ZISC-like chain of
identical neurons operation in parallel. A neuron is an associative memory which can
provide pattern comprising autonomously.
During the recognition of an input vector, all the neurons communicated briefly with
one another (for 16 clock cycles) to find which one has the best match. In additional to its
register-level instructions, CM-1K integrates a build-in recognition engine which receive
vectors data directly though a digital input bus, broadcast it to the neurons and return the
best-fit category 3 millisecond later (Mendez 03-2009).
Therefore, because the analogy with IBM© ZISC®-036 Neurons, we provide the brief
list of CM-1Kchip’s features and their specifications:
• 1024 parallel neurons
• Vector data of up to 256 byte
• 10 ms learning time (maximum)
• 10 ms recognition time (maximum)
• No limit to neuron expansion
• Trained by example
• RCE (Restricted Coulomb energy)
• L1 and LSup distance norms
• Radial Basis Function (RBF) or K- Nearest Neighbour (KNN) classifier
• 0.13 µM technology – die size 8 x 8 mm
The range of fields where CM-1K is used are: Image recognition (part inspection,
object recognition, face recognition, target tracking and identification, video monitoring,
gaze tracking, medical imaging, satellite imaging, smart motion detection, kinematics),
Signal recognition (speech recognition, voice identification, radar identification, EKG,
EEG monitoring, sonar identification, spectrum recognition, flight analysis, vibration
monitoring), Data mining (cryptography, genomics, bio informatics, fingerprint
identification, unstructured data mining) and more.
According to the report (Mendez 05-2009), several companies have integrated the
CM-1K chip in their product designs, mostly to add recognition capabilities to embedded
sensor boards. However, in near future, the newly created European Laboratory for
Sensory Intelligence is presently attempting the design of a system featuring using 100
196
CM-1K chips. CogniMem figures out the possibility to create a USB-key portable data
mining solution based on CM-1K. Summarizing, our specific neurocomputers’ overview,
let us note that according to the deep DARPA’s research, NN-hardware will be on duty,
because “as the number of neurons and interconnects increases with regards to the size of
application, the amount of memory required to store the interconnect values increases. If
that memory cannot be stored locally with every processor, then the processor must access
memory external to itself and that slows the overall speed of the simulator” (Goblick
1988).
197
Bibliography
(Abdelwahab 2004) Manal M.Abdelwahab, Self Designing Pattern Recognition System Employing Multistage Classification, Thesis presented to obtain the degree of Doctor of Philosophy in the Department of Electrical and Computer Engineering in the College of Engineering and Computer Science at the University of Central Florida, Orlando, Florida 2004, Publication No.: 3162085 (Aggarwal and Yu 1999) Charu C.Aggarwal and Phillips S.Yu, Data Mining Techniques for Associations, Clustering and Classification, Lecture Notes in Artificial Intelligence, Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining, PAKDD'99, April 26-28, 1999, Beijing, China, vol. 1574, pp. 13-23, 1999, ISBN: 3540658661 (Aha 1991) David W.Aha, L.Birnbaum (ed.), and G.Collins (ed.), Incremental Constructive Induction: An Instance-Based Approach, Machine Learning: Proceedings of the Eighth International Workshop (ML 91), pp. 117-121, 1991, ISBN: 1558602003 (Anderberg 1973) Michael R.Anderberg, Cluster Analysis for Applications (Probability & Mathematical Statistics Monograph) Academic Press Inc., New York, US, 1973, ISBN: 0120576503 (Arabie 1994) Phipps Arabie, Clustering and Classificication World Scientific Publishing Co. Pte. Ltd., 1994, ISBN: 9810212879 (Arbib 2003) Michael A.Arbib, The Handbook of Brain Theory and Neural Networks, Second revised edition, Bradford Books, 2003, ISBN: 0262011972 (Atlas and al. 1989) Les Atlas, Jerome Connor, Dong Park, Mohamed El-Sharkawi, I. M. Robert J., Alan Lippman, Ronald Cole, and Yeshwant Muthusamy, A Performance Comparison of Trained Multi-Layer Perceptrons and Trained Classification Tree, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 915-920, 1989, ISSN: 00189219 (Badiou 2007) Alain Badiou, Le Concept de Modele : Introduction а une Epistemologie Materialiste des Mathematiques L'Harmattan, Paris, France, 2007, ISBN: 2213634815 (Bak 1996) Per Bak, How Nature Works: The Science of Self-Organized Criticality Springer-Verlag New York Inc., New York, US, 1996, ISBN: 0387947914 (Barbara and Kamath 2003) Daniel Barbara and Chandrika Kamath, Proceedings of the Third SIAM International Conference on Data Mining: Proceedings in Applied Mathematics 112 Society for Industrial and Applied Mathematics, US, 2003, ISBN: 0898715458 (Bauer and Kohavi 1999) Eric Bauer and Ron Kohavi, An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants, Machine Learning, vol. 36, no. 1-2, pp. 105-139, 1999, ISSN: 08856125
198
(Bellman 1961) Richard Ernest Bellman, Adaptive Control Processes: A Guided Tour, First edition, Princeton University Press, Princeton, New Jersey, US, 1961, ASIN: B0006AWE08 (Ben-Dor, Shamir and Yakhini 1999) Amir Ben-Dor, Ron Shamir, and Zohar Yakhini, Clustering Gene Expression Patterns., Journal of Computational Biology, vol. 6, no. 3-4, pp. 281-297, 1999, ISSN: 10665277 (Benediksson, Swain and Ersoy 1990) Jon A.Benediksson, Philip H.Swain, and Okan K.Ersoy, Neural Network Approaches Versus Statistical Methods in Classification of Multisource Remote Sensing Data, IEEE Transactions on Geoscience and Remote Sensing, vol. 28, no. 4, pp. 540-551, 1990, ISSN: 01962892 (Bennett 1988) Charles H.Bennett and R.Herken (ed.), Logical Depth and Physical Complexity, in The Universal Turing Machine: A Half-Century Survey Oxford University Press Inc., New York, US, 1998, pp. 227-257, ISBN: 3211826378 (Bennett 2003) Jonathan Bennett, Learning from Six Philosophers: Descartes, Spinoza, Leibniz, Locke, Berkeley, Hume Clarendon Press, US, 2003, ISBN: 0199266298 (Berry 2003) Michael W.Berry, Survey of Text Mining: Clustering, Classification, and Retrieval Springer-Verlag New York Inc., New York, US, 2003, ISBN: 0387955631 (Bertotti and Mayergoyz 2006) Giorgio Bertotti and Isaak D.Mayergoyz, The Science of Hysteresis : 3-Volume Set Academic Press Inc., 2006, ISBN: 0124808743
(Biryukov, Ryazanov and Shmakov 2007) Andrey S.Biryukov, Vladimir V.Ryazanov, and A.S.Shmakov, "Solving Clusterization Problems Using Groups of Algorithms," Computational Mathematics and Mathematical Physics, vol. 48, no. 1, pp. 168-183, 2007, ISSN: 09655425
(Blass and Gurevich 2003) Andreas Blass and Yuri Gurevich, Algorithms: A Quest for Absolute Definitions, Bulletin of European Association for Theoretical Computer Science, vol. 81, pp. 195-225, 2003, ISSN: 02529742 (Boolos, Burgess and Jeffrey 2002) George S.Boolos, John P.Burgess, and Richard C.Jeffrey, Computability and Logic, Fourth revised edition, Cambridge University Press, 2002, ISBN: 0521007585 (Bouyoucef 2006) El Khier Sofiane Bouyoucef, Comparaison des Performances de la T-DTS avec 34 Algorithmes de Classification en Exploitant 16 Bases de Donnees de l'UCI: (Machine Learning Repository), 2006, Ph.D.Report,University Paris XII
(Bouyoucef, Chebira, Rybnik and Madani 2005) El Khier Sofiane Bouyoucef, Abdennasser Chebira, Mariusz Rybnik, Kurosh Madani, A.Sachenko (ed.), and O.Berezsky (ed.), Multiple Neural Network Model Generator with Complexity Estimation and Self-Organization Abilities, International Journal of Computing, vol. 4, no. 3, pp. 20-29, 2005.
199
(Bouyoucef 2007) El Khier Sofiane Bouyoucef, Contribution а l'Etude et la Mise en Ouvre d'Indicateurs Quantitatifs et Qualitatifs d'Estimation de la Complexite pour la Regulation du processus d'Auto Organisation d'une Structure Neuronale Modulaire de Traitement d'Information, These presentee аu Laboratoire Images, Signaux et Systemes Intelligents - LISSI de l'Universite Paris XII pour obtenir le grade de docteur en sciences informatiques 2007. (Breiman 1996) Leo Breiman, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123-140, 1996, ISSN: 17276209 (Briem, Benediktsson and Sveinsson 2000) Jakob Gunnar Briem, Jon Atli Benediktsson, and Johannes R.Sveinsson, Use of Multiple Classifiers in Classification of Data from Multiple Data Sources, International Geoscience and Remote Sensing Symposium Proceedings, vol. 2, pp. 882-884, 2000, ISSN: 08856125 (Brito, Bertrand, Cucumel and De Carvalho 2007) Paula Brito, Patrice Bertrand, Guy Cucumel, and Francisco de Carvalho, Selected Contributions in Data Analysis and Classification (Studies in Classification, Data Analysis, and Knowledge Organization) Springer-Verlag Berlin and Heidelberg GmbH. and Co. K., Germany, 2007, ISBN: 0780363604 (Bruzzone, Prieto, and Serpico 1999) Lorenzo Bruzzone, Diego Fernandez Prieto, and Sebastiano B.Serpico, A Neura-Statistical Approach to Multitemporal and Multisource Remote-Sensing Image Classification, IEEE Transactions on Geoscience and Remote Sensing, vol. 37, no. 3, pp. 1350-1359, 1999, ISBN: 3540735585 (Budnyk, Bouyoucef, Chebira and Madani 2008) Ivan Budnyk, El Khier Sofiane Bouyoucef, Abdennasser Chebira, and Kurosh Madani, A Modular Neural Classifier with Self-Organizing Learning: Performance Analysis, Proceedings of International Conference on Neural Networks and Artificial Intelligence, pp. 65-69, 2008, ISSN: 01962892 (Budnyk, Chebira and Madani 2007) Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, J.Zaytoon (ed.), J-L.Ferrier (ed.), J.Andrade-Cetto (ed.), and J.Filipe (ed.), ZISC Neural Network Base Indicator for Classification Complexity Estimation, in Artificial Neural Networks and Intelligent Information Processing (ANNIIP 2007) INSTICC PRESS (Portugal 2007), 2007, pp. 38-47, ISBN: 9789856329794 (Budnyk, Chebira and Madani 2008) Ivan Budnyk, Abdennasser Chebira, and Kurosh Madani, ZISC Neuro-computer for Task Complexity Estimation in T-DTS framework, in Artificial Neural Networks and Intelligent Information Processing (ANNIIP 2008) INSTICC PRESS (Portugal 2008), 2008, pp. 18-27, ISBN: 9789728865825 (Butz 2001) Martin Volker Butz, Rule-based Evolutionary Online Learning Systems: Learning Bounds, Classification, and Prediction, Thesis presented to obtain the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, Urbana, Illinois 2001, ISBN: 9789898111333
200
(Cacoullos 1966) Theophilos Cacoullos, Estimation of a Multivariate Density, Annals of the Institute of Statistical Mathematics, vol. 18, no. 2, pp. 179-189, 1966, Publication No.: 3153259 (Chaitin 2005) Gregory J.Chaitin, Meta Math!: The Quest for Omega (Peter N. Nevraumont Books) Pantheon Books, 2005, ISSN: 00203157 (Chan, Huang and De Fries 2001) Jonathan Cheung-Wai Chan, Chengquan Huang, and uth De Fries, Enhanced Algorithm Performance for Land Cover Classification from Remotely Sensed Data Using Bagging and Boosting, IEEE Transactions on Geoscience and Remote Sensing, vol. 39, no. 3, pp. 693-695, 2001, ISBN: 0375423133
(Chawla, Eschrich and Hall 2001) Nitesh Chawla, Steven Eschrich, and Lawrence O.Hall, Creating Ensemble of Classifiers, Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM'01), 29 November - 2 December 2001, San Jose, California, US, no. 580, p. 581, 2001. ISBN: 0769511198
(Chebira, Madani and Mercier 1997) Abdennasser Chebira, Kurosh Madani, Gilles Mercier, and S.K.Rogers (ed.), Various Ways for Building a Multi-Neural Network System : Application to a Control Process, Society of Photo-Optical Instrumentation Engineers, Applications and Science of Artificial Neural Networks III, 21-24 April 1997, Orlando, Florida, US, vol. 3077, pp. 148-159, 1997. ISBN: 0819424927 (Chen 1976) Chi Hao Chen, Chi Hao, Information Sciences, vol. 10, pp. 159-173, 1976, ISSN: 01962892 (Chen and Varshney 2002) Biao Chen and Pramod K.Varshney, A Byesian Sampling Approach to Decision Fusion Using Hierarchical Models, IEEE Transactions on Signal Processing, vol. 50, no. 8, pp. 1809-1818, 2002, ISSN: 00200255 (Chernoff 1966) Herman Chernoff, Estimation of a Multivariate Density, Annals of the Institute of Statistical Mathematics, vol. 18, pp. 179-189, 1966, ISSN: 1053587X (Chi and Ersoy 2002) Hoi-Ming Chi and Okan K.Ersoy, Support Vector Machine Decision Trees with Rare Event Detection, International Journal of Smart Engineering System Design, vol. 4, no. 4, pp. 225-242, 2002, ISSN: 00203157 (Chomsky 1968) Noam Chomsky, Language and Mind Harcourt Brace and World, US, 1968, ISSN: 10255818 (Chou 1983) Li Chou, Self-optimizing Method and Machines (WO/1983/000069), European Patent Office (EPO) (DE),1983, ISBN: 052167493X (Church 1936) Alonzo Church, A Note on the Entscheidungsproblem, The Journal of Symbolic Logic, vol. 1, no. 1, pp. 40-41, 1936, World Intellectual Property Organization,PCT/US1982/000845, ISSN: 00224812 Abstract: http://links.jstor.org/sici?sici=0022-4812%28193603%291%3A1%3C40%3AANOTE%3E2.0.CO%3B2-D
201
(Cosman, Oehler, Riskin and Gray 1993) Pamela C.Cosman, Kivanc L.Oehle, Eve A.Riskin, and Robert M.Gray, Using Vector Quantization for Image Processing, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 81, no. 9, pp. 1326-1341, 1993, ISSN: 00189219 (Cover and Hart 1967) Thomas M.Cover and Peter E.Hart, Nearest Neighbour Pattern Classification, IEEE Transactions on Information Theory, vol. 13, pp. 21-27, 1967, ISSN: 00189448 (Czerwinski 1998) Thomas J.Czerwinski, Coping With the Bounds: Speculations on Nonlinearity in Military Affairs National Defense University Press, 1998, ISBN: 1579060099 (Dasarathy 1991) Belur V.Dasarathy, Nearest Neighbor: Pattern Classification Techniques IEEE Computer Society Press, US, 1991, ISBN: 0818689307 (David 2000) Levy David, J.Rabin (ed.), G.J.Miller (ed.), M.Dekker (ed.), and B.W.Hildreth (ed.), Applications and Limitations of Complexity Theory in Organization Theory and Strategy, in Handbook of Strategic Management, Second revised edition, Marcel Dekker Inc., New York, US, 2000, ISBN: 0824703391 (De Tremiolles 1998) Ghislain Imbert De Tremiolles, Contribution a l'Etude Theoretique des Modeles Neuromimetiques et a leur Validation Experimentale: Mise en Oeuvre d'Applications Industrielles, These presentee a LISSI de l'Universite Paris XII pour obtenir le grade de docteur en sciences informatiques 1998, Number: 98PA120018 (De Tremiolles and al. 1997) Ghislain Imbert De Tremiolles, Pascal Tannhof, Brendan Plougonven, Claude Demarigny, Kurosh Madani, J.Mira (ed.), R.Moreno-Diaz (ed.), and J.Cabestany Moncusi (ed.), Visual Probe Mark Inspection, Using Hardware Implementation of Artificial Neural Networks, in VLSI Production, Lecture Notes in Computer Science, Biological and Artificial Computation, From Neuroscience to Technology, International Work-Conference on Artificial and Natural Neural Networks, IWANN'97 Lanzarote, Canary Islands, Spain, vol. 1240, pp. 1374-1383, 1997, ISBN: 3540630473 (Decoste and Scholkopf 2002) Dennis Decoste and Bernhard Scholkopf, Training Invariant Support Vector Machines, Machine Learning, vol. 46, no. 1-3, pp. 161-190, 2002, ISSN: 08856125 (Devroye 1987) Luc Devroye, A Course in Density Estimation Birkhauser Verlag AG, Germany, 1987, ISBN: 0817633650 (Deza M. and Deza E. 2006) Michel Marie Deza and Elena Deza, Dictionary of Distance Metrics Elsevier Science Ltd., 2006, ISBN: 0444520872 (Dietterich 2001) Thomas G.Dietterich, J.Kittler (ed.), and F.Roli (ed.), Ensemble Methods in Machine Learning, Lecture Notes in Computer Science, Multiple Classifier Systems, Proceedings of the First International Workshop, McS 2000, June 21-23, 2000, Cagliari, Italy, vol. 1857, pp. 1-15, 2001, ISBN: 3540677046
202
(Ding 2007) Yuanyuan Ding, Handling Complex, High Dimensional Data for Classification and Clustering, Thesis presented to obtain the degree of Doctor of Philosophy of University of Mississippi 2007, Publication No.: 3279419 (Dong 2003) Jianxion Dong, Speed and Accuracy: Large-scale Machine Learning Algorithms and their Applications, Thesis in the Department of Computer Sceince Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy Concordia University Montreal, Quebec, Canada 2003, ISBN: 0612852695 (Du, Zhang and Sun 2009) Peijun Du, Wei Zhang, Hao Sun, J.A.Benediktsson (ed.), J.Kittler (ed.), and F.Roli (ed.), Multiple Classifier Combination for Hyperspectral Remote Sensing Image Classification, Lecture Notes in Computer Science, Proceedings of the 8th International Workshop on Multiple Classifier Systems, MCS 2009, June 10-12, 2009, Reykjavik, Iceland, vol. 5519, pp. 52-61, 2009, ISBN: 3642023258 (Duda and Hart 1973) Richard O.Duda and Peter E.Hart, Pattern Classification and Scene Analysis John Wiley and Sons Inc., New York, US, 1973, ISBN: 0471223611 (Dujardin and al. 1999) Anne-Sophie Dujardin, Veronique Amarger, Kurosh Madani, Olivier Adam, Jean-Francois Motsch, M.Jose (ed.), and J.V.Sanchez-Andres (ed.), Multi-Neural Network Approach for Classification of Brainstem Evoked Response Auditory, Lecture Notes in Computer Science, Engineering Applications of Bio-Inspired Artificial Neural Networks, the International Work-Conference on Artificial and Natural Neural Networks, IWANN'99, June 2-4, Alicante, Spain, Proceedings, vol. 1607, pp. 255-264, 1999, ISBN: 3540660682 (Edmonds 1999) Bruce Edmonds, F.Heylighen (ed.), J.Bollen (ed.), and A.Riegler (ed.), What is Complexity? - The philosophy of complexity per se with application to some examples in evolution, in The Evolution of Complexity: The Violet Book of Einstein Meets Magritte, Kluwer Academic Publishers, 1999, ISBN: 0792357647 (Ester, Kriegel, Sander and Xu 1996) Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiawei Xu, and J.Han (ed.), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Lecture Notes in Computer Science, Proceedings of Second International Conference on Knowledge Discovery and Data Mining, vol. 12, no. 1, pp. 18-24, 1996, ISBN: 1577350049 (Ewald 2004) William Bragg Ewald, From Kant to Hilbert Oxford University Press, US, 2004, ISBN: 0198505353 (Fayyad, Piatetsky-Shapiro and Smyth 1996) Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, From Data Mining to Knowledge Discovery in Databases, AI Magazine (An Official Publication of the American Association for Artificial Intelligence), vol. 17, no. 3, pp. 37-54, 1996, ISSN: 07384602 (Feldman and Crutchfield 1997) David P.Feldman, James P.Crutchfield, V.M.Agranovich (ed.), A.R.Bishop (ed.), A.P.Fordy (ed.), P.R.Holland (ed.), P.R.Holland (ed.), and R.Q.Wu (ed.), Measures of Statistical Complexity: Why?, Physics Letters A, vol. 238, no. 4-5, pp. 244-252, 1997, ISSN: 03759601
203
(Fellman 2004) Philip V.Fellman, "The Nash Equilibrium Revisited: Chaos and Complexity Hidden in Simplicity," Interjournal ICCS4, Proceedings of the Fourth Internatinal Conference on Complex Systems, May 16-21, 2004, Boston, US, 2004, ISSN: 10810625, Abstract: ID 1013, http://arxiv.org/abs/0707.0891
(Ferreira 2001) Pedro M.Ferreira, Tracing Complexity Theory, 2001, Massachusetts Institute of Technology Engineering Systems Division Research Seminar in Engineering Systems ESD.83 (Fielding 2007) Alan H.Fielding, Cluster and Classification Techniques for the Biosciences, First edition, Cambridge University Press, UK, 2007, ISBN: 0521852811 (Fisher 1936) Ronald A Fisher and R.A Fisher (ed.), The Use of Multiple Measures in Taxonomic Problems, Annals of Eugenics: A Journal Devoted to the Genetic Study of Human Populations, vol. 7, pp. 179-188, 1936, ASIN: B00282EMS4 (Fogel 1991) David B.Fogel, An Information Criterion for Optimal Neural Network Selection, IEEE Transactions on Neural Networks, vol. 2, no. 5, pp. 490-497, 1991, ISSN: 10459227
(Fraley and Raftery 2002) Chris Fraley and Adrian E Raftery, "Model-Based Clustering, Discriminant Analysis, and Density Estimation," Journal of the American Statistical Association, vol. 97, no. 458, pp. 611-631, 2002, ISSN: 01621459, http://www.jstor.org/pss/3085676
(Freund and Schapire 1996) Yoav Freund, Robert E.Schapire, and L.Saitt (ed.), Experiments with a New Boosting Algorithm, Proceedings of the Thirteenth International Conference on Machine Learning 1996 International Conference, pp. 148-156, 1996, ISBN: 1558604197
(Friedman and Meulman 2004) Jerome H.Friedman and Jacqueline J.Meulman, "Clustering Objects on Subsets of Attributes," Journal of the Royal Statistical Society, Series B (Statistical Methodology), vol. 66, no. 4, pp. 815-849, 2004, ISSN: 13697412, http://www.jstor.org/pss/3647651
(Friedman and Rafsky 1979) Jerome H.Friedman and Lawrence C.Rafsky, Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests, The Annals of Mathematical Statistics, vol. 7, no. 4, pp. 697-717, 1979, ISSN: 00034851 (Friedrich 2004) Friedrich Jurgen, Spatial Modeling in Natural Sciences and Engineering: Software Development and Implementation, First edition, Springer-Verlag Berlin and Heidelberg GmbH. and Co. K., Germany, 2004, ISBN: 3540208771 (Fuglede and Topsoe 2004) Bent Fuglede and Flemming Topsoe, Jensen-Shannon Divergence and Hilbert Space Embedding, Proceedings, IEEE International Symposium on Information Theory (ISIT 2004), vol. 3, pp. 31-36, 2004, ISBN: 0780382803 (Fukunaga 1972) Keinosuke Fukunaga, Introduction to Statistical Pattern Recognition (in Russian) Nauka, Moscow, USSR, 1979, 1972, ISBN: 0122698509
204
(Gao, Foster, Mobus and Moschytz 2001) Qun Gao, Philipp Forster, Karl R.Mobus, and George S.Moschytz, Fingerprint Recognition Using CNNs: Fingerprint Preprocessing, Proceedings on the IEEE International Symposium on Circuits and Systems (ISCAS 2001), 6-9 May 2001, Sydney, Australia, vol. 2, pp. 433-436, 2001, ISBN: 0780366859 (Garey and Johnson 1979) Michael R.Garey and David S.Johnson, Computers and Intractability: A Guide to the Theory of Np-Completeness W.H. Freeman and Company Lyd., San Francisco, California, US, 1979, ISBN: 0716710455 (Gasmi and Merouani 2005) Ibtissem Gasmi and Hayet Merouani, Towards a Method of Automatic Design of Multi-Classifiers System Based Combination, Proceeding of World Academy of Science, Engineering and Technology, June 2005, vol. 6, pp. 82-87, 2005, ISSN: 20703724 (Gavin, Oswald, Wahl and Williams 2002) Daniel G.Gavin, W.Wyatt Oswald, Eugene R.Wahl, John W.Williams, D.B.Booth (ed.), and A.R.Gillespie (ed.), A Statistical Approach to Evaluating Distance Metrics and Analog Assignments for Pollen Records, Quaternary Research, vol. 60, no. 3, pp. 356-367, 2002, ISSN: 00335894
(Gelfand, Ravishankar and Delp 1991) Saul B.Gelfand, Channasandra S.Ravishankar, and Edward J.Delp, An Iterative Growing and Pruning Algorithm for Classification Tree Design, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 2, pp. 163-174, 1991, ISSN: 01628828, Abstract: www.cs.virginia.edu/~robins/Quantum_Computing_with_Molecules.pdf
(Gershenfeld and Chuang 1998) Neil Gershenfeld and Isaac L.Chuang, Quantum Computing with Molecules, Scientific American Magazine, pp. 66-71, 1998, ISSN: 00368733.
(Gersho and Gray 1991) Allen Gersho and Robert M.Gray, Vector Quantization and Signal Compression Kluwer Academic Publishers, 1991, ISBN: 0792391810
(Giacinto, Roli and Fumera 2000) Giorgio Giacinto, Fabio Roli, and Giorgio Fumera, Unsupervised Learning of Neural Network Ensembles for Image Classification, ICNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, 24-27 July, Como, Italy, vol. 3, pp. 155-159, 2000, ISBN: 0769506194
(Giusti, Masulli and Sperduti 2002) Nicola Giusti, Francesco Masulli, and Alessandro Sperduti, Theoretical and Experimental Analysis of a Two-Stage System for Classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 893-904, 2002, ISSN: 01628828
(Go, Han, Kim and Lee 2001) Jinwook Go, Gunhee Han, Hagbae Kim, and Chulhee Lee, Multigradient: A New Neural Network Learning Algorithm For Pattern Classification, IEEE Transactions on Geoscience and Remote Sensing, vol. 39, no. 5, pp. 986-993, 2001, ISSN: 01962892
205
(Goblick 1988) Thomas Goblick, DARPA Neural Network Study: October 1987-February 1988 AFCEA International Press, 1998, ISBN: 0916159175
(Godel 2001) Kurt Godel, S.Feferman (ed.), J.W.Dawson Jr (ed.), S.C.Kleene (ed.), G.H.Moore (ed.), R.M.Solovay (ed.), and J.Van Heijenoort (ed.), Collected Works: Volume II: Publications 1938-1974 (Collected Works (Oxford)) Oxford University Press, US, 2001, ISBN: 0195147219 (Gold and Morgan 1999) Bernard Gold and Nelson Morgan, Speech and Audio Signal Processing: Processing and Perception of Speech and Music, First edition, John Wiley, New York, US, 1999, ISBN: 0471351547 (Gold, Holub and Sollich 2005) Carl Gold, Alex Holub, Peter Sollich, J.Mira (ed.), and A.Prieto (ed.), Bayesian Approach to Feature Selection and Parameter Tuning for Support Vector Machine Classifiers, Neural Networks, vol. 18, no. 5-6 (Special issue: IJCNN 2005), pp. 693-701, 2005, ISSN: 08936080 (Goldstone and Kersten 2003) Robert L.Goldstone, Alan Kersten, A.F.Healy (ed.), R.W.Proctor (ed.), and I.B.Weiner (ed.), Concept and Categorization, in Handbook of Psychology: Experimental Psychology John Wiley adn Sons Inc., US, 2003, pp. 599-622, ISBN: 0471392626 (Goonatilake and Khebbal 1995) Suran Goonatilake and Sukhdev Khebbal, Intelligent Hybrid Systems: Fuzzy Logic, Neural Networks, and Genetic Algorithms John Wiley and Sons Ltd., 1995, ISBN: 0471942421 (Gray, Oehler, Perlmutte and Ohlsen 1993) Robert M.Gray, Karen L.Oehler, Keren O Perlmutter, and Richard A.Olshen, Combining Tree-Structured Vector Quantization with Classification and Regression Trees, Proceedings of the Twenty-Seventh Asilomar Conference on Signals, Systems, and Computers, pp. 1494-1498, 1993, ISBN: 0818641207 (Green and Newth 2001) David G.Green, David Newth, T.Bossomaier (ed.), R.Standish (ed.), and S.Halloy (ed.), Towards a T heory of Everything? - Grand Challenges in Complexity and Informatics, Complexity International, vol. 8,, Paper ID: green05 2001, ISSN: 13200682, http://www.complexity.org.au/ci/vol08/green05/ (Guha, Rastogi and Shim 1998) Sudipto Guha, Rajeev Rastogi, Kyuscok Shim, L.M.Haas, and A.Tiwary, CURE: An Efficient Clustering Algorithm for Large Databases, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, US, vol. 27, no. 2, pp. 73-84, 1998, ISBN: 0897919955 (Guha, Rastogi and Shim 2000) Sudipto Guha, Rajeev Rastogi, and Kyuscok Shim, ROCK: Robust Clustering Algorithm for Categorization Attributes, Information Systems, vol. 25, no. 5, pp. 345-366, 2000, ISSN: 03064379 (Haken 2002) Hermann Haken, Information and Self-Organization: A Macroscopic Approach to Complex Systems (Springer Series in Synergetics), Second edition, Springer-Verlag Berlin and Heidelberg GmbH. and Co. K., Germany, 2000, ISBN: 3540662863
206
(Han and Kamber 200 6) Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second revised edition, Morgan Kaufmann Publishers Inc., 2006, ISBN: 1558609016
(Han and Karypis 2000) Eui-Hong (Sam) Han and George Karypis, "Centroid-Based Document Classication: Analysis and Experimental Results," University of Minnesota, Department of Computer Science / Army HPC Research Center, AHPCRC, Minnesota Supercomputer Institute (Center contract number DAAH04-95-C-0008) ,2000, Technical Report No.: 00-017, Related papers are available via http://www.cs.umn.edu/~karypis
(Hansen and Salamon 1990) Lars Kai Hansen and Peter Salamon, Neural Network Ensembles, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993-1001, 1990, ISSN: 01628828 (Hartuv and Shamir 2000) Erez Hartuv and Ron Shamir, A Clustering Algorithm Based on Graph Connectivity, Information Processing Letters, vol. 76, no. 4-6, pp. 175-181, 2000, ISSN: 00200190 (Hassibi and Stork 1993) Babak Hassibi, David G.Stork, and S.J.Hanson (ed.), Second Order Derivatives for Network Pruning: Optimal Brain Surgeon, Advances in Neural Information Processing Systems Five, Nips Five, pp. 164-171, 1993, ISBN: 1558602747 (Hastie, Tibshirani and Friedman 2009) Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Third revised edition, Springer-Verlag New York Inc., New York, US, 2009, ISBN: 0387848576 (Haussler, Kearns and Schapire 1994) David Haussler, Michael Kearns, and Robert E.Schapire, Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension, Machine Learning, vol. 14, no. 1, pp. 83-113, 1994, ISSN: 08856125 (Hertz, Palmer and Krogh 1991) John Hertz, Richard G.Palmer, and Anders Krogh, Introduction to the Theory of Neural Computation Addison Wesley, New York, US, 1991, ISBN: 0201503956 (Highfield 1996) Roger Highfield, Frontiers of Complexity: The Search for Order in a Chaotic World Ballantine Books, 1996, ISBN: 0449910814 (Hinegardner and Engelberg-Kulka 1983) Ralph T.Hinegardner and Hanna Engelberg-Kulka, Biological Complexity, Journal of Theoretical Biology, vol. 104, pp. 7-20, 1983, ISSN: 00225193 (Ho 2000) Tin Kam Ho, J.Kittler (ed.), and F.Roli (ed.), Complexity of Classification Problems and Comparative Advantages of Combined Classifiers, Lecture Notes in Computer Science, Proceedings on the First International Workshop on Multiple Classifier Systems, McS 2000, June 2000, Cagliari, Italy, vol. 1857, pp. 97-106, 2000, ISBN: 3540677046
207
(Ho 2001) Tin Kam Ho, J.Kittler (ed.), and F.Roli (ed.), "Data Complexity Analysis for Classifier Combination ," Lecture Notes in Computer Science, Multiple Classifier Systems, Second International Workshop Proceedings, MCS 2001, July 2-4, 2001, Cambridge, UK, vol. 2096, pp. 53-56, 2001, ISBN: 3540422846
(Ho 2002) Tin Kam Ho, "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors," Pattern Analysis and Applications, vol. 5, no. 2, pp. 102-112, 2002, ISSN: 1433754
(Ho and Baird 1994) Tin Kam Ho and Henry S.Baird, "Estimating the Intrinsic Difficulty of A Recognition Problem ," Proceedings of the 12th IAPR International, Conference on Pattern Recognition, Conference B: Computer Vision and Image Processing, vol. 2, pp. 178-183, 1994, ISBN: 0818662700
(Ho and Baird 1998) Tin Kam Ho and Henry S.Baird, Pattern Classification with Compact Distribution Maps, Computer Vision and Image Understanding, vol. 70, no. 1, pp. 101-110, 1998, ISSN: 10773142
(Ho and Basu 2002) Tin Kam Ho and Mitra Basu, "Complexity Measures of Supervised Classification Problems," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 289-300, 1994, ISSN: 01628828
(Ho, Hull and Stihari 1994) Tin Kam.Ho, Jonathan J.Hull, and Sargur N.Stihari, Decision Combination in Multiple Classifier Systems, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 1, pp. 66-75, 1994, ISSN: 01628828
(Hofstadter 1999) Douglas R.Hofstadter, Godel, Escher, Bach: An Eternal Golden Braid Basic Books, 1999, ISBN: 0465026567 Abstract: http://www.econ.iastate.edu/tesfatsi/hogan.complexperplex.htm
(Horgan 1995) John Horgan, From Complexity to Perplexity, Scientific American Magazine, pp. 74-79, 1995, ISSN: 00368733 (Hsieh and Fan 2001) Ing-Sheen Hsieh, Kuo-Chin Fan, and C.A.Bouman (ed.), Multiple Classifier for Color Flag and Trademark Image Retrieval, IEEE Transactions on Image Processing, vol. 10, no. 6, pp. 938-950, 2001, ISSN: 10577149
(IBM Corporation 1998) IBM Corporation, ZISC(R)036 Neurons User's Manual, 1998, Documents Number: IOZSCWBU-02. http://noel.feld.cvut.cz/vyu/scs/2001/ZISC/ziscwb.pdf (Jain, Murty and Flynn 1999) Anil Kumar Jain, M.Narasimha Murty, and Patrick Joseph Flynn, Data Clustering: A Review, ACM Computing Surveys (CSUR), vol. 31, no. 3, pp. 264-323, 1999, ISSN: 03600300 (Jelassi and Enders 2004) Tawfik Jelassi and Albrech Enders, Key Terminology and Evolution of e-Business, in Strategies for e-Business: Creating Value through Electronic and Mobile Commerce Financial Times Prentice Hall, 2004, pp. 10-11, ISBN: 0273688405
208
(Jesse, Liu, Smart and Brown 2008) Christopher Jesse, Honghai Liu, Edward Smart, David Brown, I.Lovrek (ed.), R.J.Howlett (ed.), and L.C.Jain (ed.), Analysing Flight Data Using Clustering Methods, Lecture Notes in Artificial Intelligence, Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part I, vol. 5177, pp. 733-740, 2008, ISBN: 3540855645 (Joachims 1998) Thorsten Joachims, C.Nedellec (ed.), and C.Rouveirol (ed.), Text Categorization with Support Vector Machine: Learning with Many Relevant Features, Lecture Notes in Computer Science, Proceedings in Machine Learning, ECML 98, 10th European Conference on Machine Learning April 21-23, 1998, Chemnitz, Germany, vol. 1398, pp. 137-142, 1998, ISBN: 3540644172 (Johnson and Kargupta 2002) Erik L.Johnson, Hillol Kargupta, M.J.Zaki (ed.), and C-T.Ho (ed.), Collective, Hierarchical Clustering from Distributed, Heterogeneous Data, Lecture Notes In Computer Science, Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, vol. 1759, pp. 103-114, 2002, ISBN: 3540671943 (Jordan and Xu 1995) Michael I.Jordan, Lei Xu, S.Grossberg (ed.), K.Doya (ed.), and J.Taylor (ed.), Convergence Results for the EM Approach to Mixtures of Experts Architectures, Neural Networks, vol. 8, no. 9, pp. 1487-1489, 1995, ISSN: 08936080
(Josephson 2004) Brian D.Josephson, "How We Might be Able to Understand the Brain," Interjournal ICCS4, Proceedings of the Fourth Internatinal Conference on Complex Systems, May 16-21, 2004, Boston, US, 2004, ISSN: 10810625, Abstract ID: 5225, http://cogprints.org/3655/5/ICCS2004.links.html
(Juang and Katagiri 1992) Biing-Hwang Juang and Shigeru Katagiri, Discriminant Learning for Minimum Error Classification, IEEE Transactions on Signal Processing, vol. 40, no. 12, pp. 3043-3054, 1992, ISSN: 1053587X (Kadous 2002) Mohammed Waleed Kadous, Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series, A Thesis Submitted as a Requirement for the Degree of Doctor of Philosophy 2002, Order Number: AAI0806481 (Kamvar, Klein and Manning 2002) Sepandar D.Kamvar, Dan Klein, and Christopher D.Manning, Interpreting and Extending Classical Agglomerative Clustering Algorithms Using a Model-Based Approach, Proceedings of the Nineteenth International Conference on Machine Learning, pp. 283-290, 2002, ISBN: 1558608737 (Kantardzic 2002) Mehmed Kantardzic, Data Mining: Concepts, Models, Methods and Algorithms John Wiley and Sons Inc., US, 2002, ISBN: 0471228524 (Karypis, Han and Kumar 1999) George Karypis, Eui-Hong (Sam) Han, and Vipin Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer: Special Issue on Data Analysis and Mining, vol. 32, no. 8, pp. 68-75, 1999, ISSN: 00189162.
209
(Kauffman 1993) Stuart Alan A.Kauffman, The Origins of Order: Self-Organization and Selection in Evolution Oxford University Press Inc., US, 1993, ISBN: 0195079515 (Kawatani 1999) Takahiko Kawatani, Handwritten Kanji Recognition Using Combined Complementary Classifiers in a Cascade Arrangement, Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR 99), September 20-22, Bangalore, India, pp. 503-506, 1999, ISBN: 0780370449 (Kijsirikul and Chongkasemwongse 2001) Boonserm Kijsirikul and Kongsak Chongkasemwongse, Decision Tree Pruning Using Backpropagation Neural Networks, Proceedings of International Joint Conference on Neural Networks (IJCNN 01), July 15-19, 2001, Washington DC, US, vol. 3, pp. 1876-1880, 2001, ISBN: 0769503187 (Kimu ra and Shridhar 1991) Fumitaka Kimura and Malayappan Shridhar, Handwritten Numeral Recognition Based on Multiple Algorithms, Pattern Recognition, The Journal of the Pattern Recognition Society, vol. 24, no. 10, pp. 969-983, 1991, ISSN: 00313203 (Kimura, Takashina, Tsuruoka and Miyake 1987) Fumitaka Kimura, Kenji Takashina, Shinji Tsuruoka, and Yasuji Miyake, Modified Quadratic Discriminant Functions and the Application to Chinese Character Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 9, no. 1, pp. 149-153, 1987, ISSN: 01628828 (Kimura, Wakabayashi, Tsuruoka and Miyake 1991) Fumitaka Kimura, Tetsushi Wakabayashi, Shinji Tsuruoka, Yasuji Miyake, and C.Y.Suen (ed.), Improvement of Handwritten Japanese Character Recognition Using Weighted Direction Code Histogram, Pattern Recognition, The Journal of the Pattern Recognition Society, vol. 30, no. 8, pp. 1329-1337, 1997, ISSN: 00313203 (King 1967) Benjaman King, Step-wise Clustering Procedures, Journal of American Statistical Assosiation, vol. 62, no. 317, pp. 86-101, 1967, ISSN: 01621459 (Kittler, Hatef, Duin and Matas 1998) Josef Kittler, Mohamad Hatef, Robert P.W.Duin, and Jiri Matas, On Combining Classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226-239, 1998, ISSN: 01628828 (Kohn, Nakano and Silva 1996) Andre Fabio Kohn, Luis Gustavo Mendonca Nakano, Miguel Oliveira E.Silva, and C.Y.Suen (ed.), A Class Discriminability Measure Based on Feature Space Partitioning, Pattern Recognition, The Journal of the Pattern Recognition Society, vol. 29, no. 5, pp. 873-887, 1996, ISSN: 00313203 (Kohonen 1989) Teuvo Kohonen, Self-Organization and Associate Memory, Third edition, Springer-Verlag, Berlin, Germany, 1989, ISBN: 0387513876 (Koontz and Fukunaga 1972) Warren L.G.Koontz and Keinosuke Fukunaga, A Nonparametric Valley-Seeking Technique for Cluster Analysis, IEEE Transactions on Computers, vol. 21, no. 2, pp. 171-178, 1972, ISSN: 00189340 (Kressel 1999) Urlich Kressel, B.Scholkopf (ed.), C.J.C.Burges (ed.), and A.J.Smola (ed.), Pairwise Classification and Support Vector Machines, in Advances in Kernel
210
Methods: Support Vector Learning MIT Press, US, 1999, pp. 255-268, ISBN: 0262194163 (Krippendorff 1986) Klaus Krippendorff, Information Theory: Structural Models for Qualitative Data SAGE Publications Inc ., 1986, ISBN: 0803921322 (Kundur, Hatzinakos and Leung 2000) Deepa Kundur, Dimitrios Hatzinakos, and Henry Leung, Robust Classification of Blurred Imagery, IEEE Transactions on Image Processing, vol. 9, no. 2, pp. 243-255, 2000, ISSN: 10577149 (Kupinski, Edwards, Giger and Metz 2001) Matthew A.Kupinski, Damn C.Edwards, Maryellen L.Giger, and Charles E.Metz, Ideal Observer Approximation Using Bayesian Classification Neural Networks, IEEE Transactions on Medical Imaging, vol. 20, no. 9, pp. 886-899, 2001, ISSN: 02780062 (Lavrac, Flach and Todorovski 2002) Nada Lavrac, Peter Flach, Ljupco Todorovski, M.Bohanec (ed.), B.Kasek (ed.), N.Lavrac (ed.), and D.Mladenic (ed.), Rule Induction for Subgroup Discovery with CN2-SD, Proceedings of the ECML/PKDD'02 Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning, pp. 77-87, 2002 (Lazarevic and Obradovic 2001) Aleksandar Lazarevic and Zoran Obradovic, Effective Pruning of Neural Network Classifier Ensembles, Proceedings of International Joint Conference on Neural Networks (IJCNN 01): Washington, DC, US, July 15-19, vol. 2, pp. 796-801, 2001, ISBN: 0780370449 (Le Bourgeois and Emptoz 1996) Frank Le Bourgeois and Hubert Emptoz, Pretopological Approach for Supervising Learning, Proceedings of the 13th International Conference on Pattern Recognition: August 25-29, 1996 (1996 International Conference on Pattern Recognition (13th IAPR), vol. 4, pp. 256-260, 1996, ISBN: 0818674725 (Le Cun and al. 1989) Yann Le Cun, Bernard E.Boser, John S.Denker, Donnie Henderson, Richard E.Howard, Wayne E.Hubbard, and Lawrence D.Jackel, Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, vol. 1, no. 4, pp. 541-551, 1989, ISSN: 08997667 (Le Cun, Bottou, Bengio and Haffner 1998) Yann Le Cun, Leon Bottou, Yoshua Bengio, and Pattrick Haffner, Gradient-based Learning Applied to Document Recognition, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 80, no. 11, pp. 2278-2324, 1998, ISSN: 00189219
(Le Cun, Denker and Solla 1990) Yann Le Cun, John S.Denke, Sara A.Solla, and D.S.Touretzky (ed.), Optimal Brain Damage, Advances in Neural Information Processing Systems, vol. 2, pp. 598-605, 1990, ISBN: 1558601007 Abstract: http://tecfa.unige.ch/~lemay/thesis/THX-Doctorat/THX-Doctorat.html
(Lemay 1999) Philippe Lemay, The Statistical Analysis of Dynamics and Complexity in Psychology a Configural Approach, These presentee а la Faculte des sciences sociales et
211
politiques de l'Universite de Lausanne pour obtenir le grade de docteur en psychologie 1999, OCLC Number: 78333154 (Leondes 1998) Cornelius T.Leondes, Image Processing and Pattern Recognition (Neural Network Systems Techniques and Applications) Academic Press Inc., San Diego, California, US, 1998, ISBN: 0124438652 (Li, Zhang and Ogihara 2004) Tao Li, Chengliang Zhang, and Mitsunori Ogihara, A Comparative Study of Feature Selection and Multiclass Classification Methods for Issue Classification Based on Gene Expression, Bioinformatics, vol. 20, no. 15, pp. 2429-2437, 2004, ISSN: 13674803 (Liao 2001) Yihua Liao, Neural Networks in Hardware: A Survey, Department of Computer Sciences, University of California, Davis, California, US,2001, Project: ECS250A (Lin 1991) Jianhua Lin, Divergence Measures Based on the Shannon Entropy, IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145-151, 1991, ISSN: 00189448 Abstract: http://bit.csc.lsu.edu/~jianhua/shiv2.pdf (Lindblad and al. 1996) Thomas Lindblad, Clark S.Lindsey, Maxim Minerskjold, Givi Sekhniaidze, Geza Szekely, and A.J.Eide (ed.), The IBM ZISC036 Zero Instruction Set Computer, 1996,
(Lindblad, Lindsey and Eide 2002) Thomas Lindblad, Clark S.Lindsey, and Age J.Eide, Radial Basis Function (RBF) Neural Networks, 2002
(Lofgren 1973) Lofgren Lennart, P.Suppes (ed.), L.Henkin (ed.), A.Joja (ed.), and GR.C.Moisil (ed.), On the Formalization of Learning and Evolution, in Logic, Methodology and Philosophy of Science IV, Proceedings of the Fourth International Congress for Logic, Methodology and Philosophy of Science, Bucharest 1971 (Studies in Logic and the foundation of Mathematics, 74) North-Holland Publishing Co., Amsterdam, The Netherlands, 1973, SIN: B001DCBM74
(Lotte and al. 2007) Fabien Lotte, Marco Congedo, Anatole Lecuyer, Fabrice Lamarche, and Bruno Arnaldi, A Review of Classification Algorithms for EEG-based Brain-Computer Interfaces, Journal of Neural Engineering, vol. 4, no. 2, pp. 1-24, 2007, ISSN: 17412560 (Lu 1996) Yi Lu, Knowledge Integrations in a Multiple Classifier System, Applied Intelligence, vol. 6, no. 2, pp. 75-86, 1996, ISSN: 0924669X (Madani, Chebira and Mercier 1997) Kurosh Madani, Abdennasser Chebira, Gilles Mercier, J.Mira (ed.), R.Moreno-Diaz (ed.), and J.Cabestany Moncusi (ed.), Multi-Neural Networks Hardware and Software Architecture: Application of the Divide to Simplify Paradigm DTS, Lecture Notes in Computer Science / Biological and Artificial Computation: From Neuroscience to Technology : International Work-Conference on Artificial and Natural Neural Networks, IWANN'97, 3 June 1997, Lanzarote, Canary Islands, Spain, vol. 1240, pp. 841-850, 1997. ISBN: 3540630473
212
(Madani and Chebira 2000) Kurosh Madani, Abdennasser Chebira, D.A.Zighed (ed.), J.Komorowski (ed.), and J.Zytkow (ed.), A Data Analysis Approach Based on a Neural Networks Data Sets Decomposition and it's Hardware Implementation, Lecture Notes in Computer Science / Lecture Notes in Artificial Intelligence, Principles of Data Mining and Knowledge Discovery, 4th European Conference, PKDD 2000, September 13-16, 2000, Lyon, France, Proceedings Workshop 1, Advances in Data Mining, vol. 1910 2000, ISBN: 354041066X Abstract: http://eric.univ-lyon2.fr/~pkdd2000/Download/#WS1
(Lucas 2000) Chris Lucas, Quantifying Complexity Theory, CALResCo Group, 2000, http://www.calresco.org/lucas/quantify.htm (Madani and Berechet 2001) Kurosh Madani, Ion Berechet, .Mira (ed.), and A.Prieto (ed.), Inaccessible Parameters Monitoring in Industrial Environment: A Neural Based Approach, Lecture Notes in Computer Science, Bio-Inspired Applications of Connectionism: Proceedings of the 6th International Work-Conference on Artifical and Natural Neural Networks, IWANN 2001, June 13-15, 2001, Granada, Spain, vol. 2085, pp. 619-627, 2001, ISBN: 3540422358 (Madani, De Tremiolles and Tannhof 2001) Kurosh Madani, Ghislain Imbert De Tremiolles, Pascal Tannhof, J.Mira (ed.), and A.Prieto (ed.), ZISC-036 Neuro-processor Based Image Processing, Lecture Notes in Computer Science, Bio-Inspired Applications of Connectionism: Proceedings of the 6th International Work-Conference on Artifical and Natural Neural Networks, IWANN 2001, June 13-15, 2001, Granada, Spain, vol. 2085, pp. 200-207, 2001, ISBN: 3540422358 (Madani, Rybnik and Chebira 2003) Kurosh Madani, Mariusz Rybnik, Abdennasser Chebira, M.Jose (ed.), and J.R.Lvarez (ed.), Data Driven Multiple Neural Network Models Generator Based on a Tree-like Scheduler, Lecture Notes in Computer Science, Computational Methods in Neural Modeling, 7th International Work Conference on Artificial and Natural Neural Networks, IWANN 2003, June 3-6, Mao, Menorca, Spain, Proceedings, Part 1, vol. 2686, pp. 382-389, 2003, ISBN: 3540402101 (Maier and Rechtin 2000) Mark W.Maier and Eberhardt Rechtin, The Art of Systems Architecting, Second revised edition, CRC Press Inc., US, 2000, ISBN: 0849304407
(Makal, Ozyilmaz and Palavaroglu 2008), Senem Makal, Lale Ozyilmaz, and Senih Palavaroglu, "Neural Network Based Determination of Splice Junctions by ROC Analysis," Proceedings of World Academy of Science, Engineering and Technology, vol. 33, pp. 630-632, 2008, ISSN: 20703740, www.waset.org/journals/waset/v43/v43-112.pdf
(Malousi and al. 2008) Andigoni Malousi, Ioanna Chouvarda, Vassilis Koutkias, Sofia Kouidou, and Nicos Maglaveras, Variable-length Positional Modeling for Biological Sequence Classification, AMIA Annual Symposium proceedings, pp. 91-95, 2008, ISSN: 1942597X (Manning, Raghavan and Schultze 2008) Christopher D.Manning, Prabhakar Raghavan, and Hinrich Schutze, Support Vector Machines and Machine Learning on Documents, in Introduction to Information Retrieval Cambridge University Press, UK, 2008, ISBN: 0521865719
213
(Maren 1990) Alianna J Maren, Handbook of Neural Computing Applications Academic Press, Inc., 1990, ISBN: 0125460902 (Markovitch and Rosenstein 2002) Shaul Markovitch and Dan Rosenstein, Feature Generation Using General Constructor Functions, Machine Learning, vol. 49, no. 1, pp. 59-98, 2002, ISSN: 08856125 (Matusita and Akaike 1956) Kameo Matusita and Hirotugu Akaike, Decision Rules, Based on the Distance, For the Problems of Independence, Invariance and Two Samples, Annals of the Institute of Statistical Mathematics, vol. 7, no. 2, pp. 67-80, 1956, ISSN: 00203157 (McQueen 1967) James B.McQueen, L.M.Le Cam (ed.), and J.Neyman (ed.), Some Methods for Classification and Analysis of Multivariate Observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, June 21-July 18, 1965 and December 27, 1965 - January 7, 1966, vol. 1, pp. 281-297, 1967, Library of Congress Catalog Card Number: 498189
(Mendez 2009) Anne Menendez, Disruptive Parallel Neural Network Chip Ready to Compete With DSPs for Pattern Recognition, 2009, CM1K Technical Brief, Rev 05-09 . http://general-vision.com/White%20Papers/WP_CM1K_disruptive%20performance%20for%20DSP.pdf
(Michalski and Stepp 1987) Ryszard S.Michalski, Robert E.Stepp, S.C.Shapiro(ed.), D.Eckroth(ed.), and G.A.Vallasi(ed.), "Clustering," in Encyclopedia of Artificial Intelligence: A-N Vol 1 John Wiley and Sons Inc., 1987, pp. 103-111, ISBN: 047162974X
(Micheli-Tzanakou 1999) Evangelia Micheli-Tzanakou, Supervised and Unsupervised Pattern Recognition: Feature Extraction and Computational Intelligence (Industrial Electronics Series), First edition, CRC Press Inc, Boca Raton, US, 1999, ISBN: 0849322782 (Mielke and Roubicek 2003) Alexander Mielke and Tomas Roubicek, A Rate-Independent Model for Inelastic Behavior of Shape-Memory Alloys, Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal, vol. 1, no. 4, pp. 571-597, 2003, ISSN: 15403459 (Mikulecky 2007) Donald C.Mikulecky, C.Gershenson (ed.), D.Aerts (ed.), and B.Edmonds (ed.), Complexity Science as an Aspect of the Complexity of Science, in Worldviews, Science and Us: Philosophy and Complexity World Scientific Publishing Co. Pte. Ltd., 2007, pp. 30-53, ISBN: 9812705481 (Mitchell 1997) Tom M.Mitchell, Machine Learning McGraw Hill Higher Education, 1997, ISBN: 0070428077 (Mitchell, Anderson, Carbonel and Michalski 1986) Tom M.Mitchell, John R.Anderson, Jaime G.Carbonel, and Ryszard Stanislaw Michalski, Machine Learning: An Artificial
214
Intelligence Approach Morgan Kaufmann Publishers Inc., US, 1986, ASIN: B000FO7JKK (Moffat 2003) James Moffat, Complexity Theory and Network Centric Warfare CForty-OneSR Cooperative Research, 2003, ISBN: 1893723119 (Moses 2002) Joel Moses Ideas on Complexity in Systems - Twenty Views, Complexity and Flexibility (Working Paper), 2002, Massachusetts Institute of Technology Engineering Systems Division Working Paper Series ESD-WP-2000-02 (Murray-Smith and Johansen 1997) Roderick Murray-Smith and Tor Arne Johansen, Multiple Model Approaches to Modelling and Control Taylor and Francis Ltd., London, UK, 1997, ISBN: 074840595X (Novikoff 1963) Albert B.J.Novikoff, On Convergence Proofs for Perceptrons, Proceedings of the Symposium on Mathematical Theory of Automata: New York, N. Y., April 24, 25, 26, 1962, vol. 12, pp. 615-622, 1963, ASIN: B000GVXN4I (Opitz and Maclin1999) David Opitz and Richard Maclin, Popular Ensemble Methods: An Empirical Study, Journal of Artificial Intelligence Research, vol. 11, pp. 169-198, 1999, ISSN: 10769757 (Osuna, Freund, and Girosi 1997) Edgar Osuna, Edgar Osuna, and Federico Girosi, Training Support Vector Machines: An Application to Face Detection, Computer Vision and Pattern Recognition: Conference Proceedings, CVPR 97 (Proceedings on IEEE Computer Society Conference on Computer Vision and Pattern Recognition), pp. 130-136, 1997, ISBN: 0818678224 (Park and Sandberg 1991) Jooyoung Park, Irwin W.Sandberg., and T.J.Sejnowski (ed.), Universal Approximation Using Radial-Basis-Function Networks, Neural Computation, vol. 3, no. 2, pp. 246-257, 1991, ISSN: 08997667 (Parvin, Alizadeh and Minaei-Bidgoli 2009) Hamid Parvin, Hosein Alizadeh, Behrouz Minaei-Bidgoli, and F.I.S.Ko (ed.), Using Clustering for Generating Diversity in Classifiers Ensemble, International Journal of Digital Content Technology and its Applications, vol. 3, no. 1, pp. 51-57, 2009, ISSN: 19759339 (Parzen 1962) Emanuel Parzen, On Estimation of a Probability Density Function and Mode, The Annals of Mathematical Statistics, vol. 33, no. 3, pp. 1065-1076, 1962, ISSN: 00034851 (Pattichis C., Pattichis M., and Micheli-Tzanakou 2002) Constantinos S.Pattichis, Marios S.Pattichis, and Evangelia Micheli Tzanakou, Medical Imaging Fusion Applications: An Overview, Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers, vol. 9, no. 2, pp. 243-255, 2001, ISBN: 078037147X (Permana 2003) Sidik Permana, Towards the Complexity of Science, Journal of Social Complexity, vol. 1, no. 1, pp. 1-6, 2003, ISSN: 18296041
215
(Pierson 1998) William Edward Pierson, Using Boundary Methods for Estimating Class Separability, Dissertation Presented Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of Ohio State University 1998, ISBN: 0591977974 Abstract: http://josc.bandungfe.net/josc1/spam1ft.pdf (Plato 1997) Plato, J.M.Cooper (ed.), and D.C.Hutchinson (ed.), Plato Complete Works Hackett Publishing Company Inc., 1997, ISBN: 0872203492 (Pomerening, Sontag and Ferrell 2003) Joseph R.Pomerening, Eduardo D.Sontag, and James E.Ferrell Jr, Building a Cell Cycle Oscillator: Hysteresis and Bistability in the Activation of Cdc2, Nature Cell Biology, vol. 5, no. 4, pp. 346-351, 2003, ISSN: 14657392 (Popper 2002) Karl R.Popper, The Logic of Scientific Discovery Routledge, 2002, ISBN: 0415278449 (Portnoy, Bellaachia, Chen and Elkhahloun 2002) David Portnoy, Abdelhani Bellaachia, Yidong Chen, Abdel G.Elkhahloun, M.J.Zaki (ed.), J.T-L.Wang (ed.), and H.Toivonen (ed.), E-CAST: A Data Mining Algorithm for Gene Expression Data, KDD-2002, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Data Mining in Bioinformatics (BIOKDD 2002) July 23-26, 2002 Edmonton, Alberta, Canada, pp. 49-54, 2002, ISBN: 158113567X (Prampero and Carvalho 1998) Paulo S.Prampero and Andre DeCarvalho, Recognition of Vehicles Silhouette using Combination of Classifiers, The 1998 IEEE International Joint Conference on Neural Network Proceedings, IEEE World Congress on Computational Intelligence, May 4-May 9, Anchorage, Alaska, US, pp. 1723-1726, 1998, ISBN: 0780348591 (Prigogine 1980) Ilya Prigogine, From Being to Becoming: Time and Complexity in the Physical Sciences W.H.Freeman and Company Ltd., New York, US, 1980, ISBN: 0716711079 (Prigogine 2003) Ilya Prigogine, Time in Non-equilibrium Physics, in Is Future Given? World Scientific Publish Co. Pte. Ltd., 2003, pp. 44-54, ISBN: 9812385088 (Rao, Chand and Murthy 2005) Venu Gopala K.Rao, Prem P.Chand, and Ramana M.V.Murthy, Soft Computing-Neural Networks Ensembles, Journal of Theoretical and Applied Information Technology, vol. 3, no. 4, pp. 45-50, 2005, ISSN: 19928645 (Rao and Yadaiah 2005) Sree Hari V.Rao, Narri Yadaiah, and M.S.El Naschie (ed.), Parameter Identification of Dynamical Systems, Chaos, Solitons and Fractals, vol. 23, no. 4, pp. 1137-1151, 2005. ISSN: 09600779 (Ravindranathan and Leitch 1999) Mohan Ravindranathan, Roy Leitch, and M.S.El Naschie (ed.), "Model Switching in Intelligent Control Systems," Artificial Intelligence in Engineering, vol. 13, no. 2, pp. 175-187, 1999, ISSN: 09541810
216
(Renyi 1960) Alfred Renyi and J.Neyman (ed.), On Measures of Information and Entropy, Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, vol. 3, pp. 547-561, 1960, ISSN: 00970433 (Richards and Xiuping 2005) John Alan Richards and Jia Xiuping, Remote Sensing Digital Image Analysis: An Introduction, Forth edition, Springer-Verlag Berlin and Heidelberg GmbH. and Co. K., Germany, 2005, ISBN: 3540251286 (Robertson and Seymour 1984) Neil Robertson and Paul D.Seymour, Graph minors. III: Planar Tree-width, Journal of Combinatorial Theory. Series B, vol. 36, pp. 49-64, 1984, ISSN: 00958956 (Rosenblatt 1961) Frank Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms (Cornell Aeronautical Laboratory. Report No. VG-1196-G-8), Spartan Books, Washington DC, US, 1961. (Rossberg 2004) Axel G.Rossberg, "A Generic Scheme for Choosing Models and Characterizations of Complex Systems," Interjournal ICCS4, Proceedings of the Fourth Internatinal Conference on Complex Systems, May 16-21, 2004, Boston, US, 2004, ISSN: 10810625, Abstract No. 71, http://arxiv.org/abs/physics/0308018
(Roy 2000) Asim Roy, "Artificial Neural Networks: A Science in Trouble," Association for Computing Machinery, SIGKDD Explorations Newsletter, vol. 1, no. 2, pp. 33-38, 2000, ISSN: 19310145
(Rubin and Trajkovic 2001) Stuart H.Rubin and Ljiljana Trajkovic, On the Role of Randomization in Minimizing Neural Entropy, Invited paper in Proceedings of the Fifth Multi-Conference on Systemics, Cybernetics, and Informatics (SCI 2001), July 22-25, 2001, Orlando, Florida, US, 2001. (Ruck and al 1990) Dennis W.Ruck, Steven K.Rogers, Matthew Kabrisky, Mark E.Oxley, and Bruce W.Suter, The Multilayer Perceptron as an Approximation to a Bayes Optimal Discriminant Function, IEEE Transactions on Neural Networks, vol. 1, no. 4, pp. 296-298, 1990, ISSN: 10459227
(Ryabko 2006) Daniil Ryabko, "Pattern Recognition for Conditionally Independent Data," The Journal of Machine Learning Research, vol. 7, pp. 645-664, 2006, ISSN: 15324435
(Rybnik 2004) Mariusz Rybnik, Contribution to the Modeling and the Exploitation of Hybrid Multiple Neural Networks Systems: Application to Intelligent Processing of Information, Thesis presented to obtain the degree of Doctor of Philosophy of University Paris XII 2004.
(Saakian 2004) David B.Saakian, "Error Threshold in Optimal Coding, Numerical Criteria, and Classes of Universalities for Complexity," Physical Review E, Statistical, Nonlinear, and Soft Matter Physics, vol. 71(2), no. 1, p. 016126.1-016126.12, 2004, ISSN: 15393755, http://arxiv.org/abs/cond-mat/0409107
(Saglam, Yazgan and Ersoy 2003) Mehmet I.Saglam, Bingul Yazgan, and Okan K.Ersoy, Classification of Satellite Images by using Self-organizing map and Linear Support
217
Vector Machine Decision Tree, (c) GISdevelopment.net, Kuala Lumpur, Malaysia, 2003, http://www.gisdevelopment.net/technology/ip/ma03120abs.htm
(Sancho and al. 1997) Jose Luis Sancho, Batu Ulug, William Pierson, Anibal R.Figueiras-Vidal, Stanley C.Ahalt, D.Do Campo (ed.), A.R.Figueiras-Vidal (ed.), and F.Perez-Gonzalez (ed.), Boundary Methods for Distribution Analysis, in Intelligent Methods in Signal Processing and Communications Birkhauser Boston Inc., Cambridge, Manitoba, US, 1997, pp. 173-197, ISBN: 0817639608 (Sarkar and Leong 2001) Manish.Sarkar and Tze-Yun Leong, Splice Junction Classification Problems for DNA Sequences: Representation Issues, 2001 Conference Proceedings of the 23rd Annual International IEEE Engineering in Medicine and Biology Society, 25-28 October 2001, Istanbul, Turkey, vol. 3, pp. 2895-2898, 2001, ISBN: 0780372115 (Sarlashkar, Bodruzzaman and Malkani 1998) Avinash N.Sarlashkar, Mohammad Bodruzzaman, and Mohan J.Malkani, Feature Extraction Using Wavelet Transform For Neural Network Based Image Classification, Systems Theory 1998, Proceedings of the Thirtieth Southeastern Symposium on System Theory, March 8-10, 1998, West Virginia University, Morgantown, US, pp. 412-416, 1998, ISBN: 0780345479 (Sato and Yamada 1996) Atsushi Sato, Keiji Yamada, D.S.Touretzky (ed.), M.C.Mozer (ed.), and M.E.Hasselmo (ed.), Generalized Learning Vector Quantization, Advances in Neural Information Processing Systems 8: Proceedings of the 1995 Conference, vol. 8-9, pp. 423-429, 1996, ISBN: 0262201070 (Scott and Markovitch 1989) Paul D.Scott, Shaul Markovitch, and A.M.Segre (ed.), Uncertainty Based Selection of Learning Experiences, Proceedings of the Sixth International Workshop on Machine Learning, June 26-27, 1989, Cornell University, Ithaca, New York, US, pp. 358-361, 1989, ISBN: 1558600361. (Senge 2006) Peter M.Senge, The Fifth Discipline: The Art & Practice of The Learning Organization, Broadway Business, US,2006, ISBN: 0385517254 (Senior 2001) Andrew Senior, A Combination Fingerprint Classifier, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 10, pp. 1165-1174, 2001, ISSN: 01628828 (Serfling 2002) Robert J.Serfling and Y.Dodge (ed.), A Depth Function and a Scale Curve Based on Spatial Quantiles, in Statistical Data Analysis Based on the L1-Norm and Related Methods Birkhauser Verlag AG, Germany, 2002, pp. 25-38, ISBN: 3764369205
(Shalizi 2005) Cosma Rohilla Shalizi, Information Theory, 2005, Cosma Rohilla Shalizi's Web Notebook. http://www.cscs.umich.edu/~crshalizi/notebooks/ (Shalizi 2007) Cosma Rohilla Shalizi, Complexity, Complexity Measures, 2007, Cosma Rohilla Shalizi's Web Notebook. http://www.cscs.umich.edu/~crshalizi/notebooks/
218
(Shamir and Sharan 2002) Ron Shamir, Roded Sharan, T.Jiang (ed.), Y.Xu (ed.), and M.Q.Zhang (ed.), Algorithmic Approaches to Clustering Gene Expression Data, in Current Topics in Computational Molecular Biology MIT Press, US, 2002, pp. 269-300, ISBN: 0262100924 (Sharan and Shamir 2000) Roded Sharan, Ron Shamir, R.Altman (ed.), T.L.Bailey (ed.), P.Bourne (ed.), T.Lengauer, I.N.Shindyalov (ed.), L.F.T.Eyck (ed.), and H.Weissig (ed.), CLICK: A Clustering Algorithm with Application to Gene Expression Analysis, Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), August 16-23, 2000, La Jolla, California, US, pp. 307-316, 2000, ISBN: 1577351150 (Shen and Castan 1999) Jun Shen, Serge Castan, T.Kohonen (ed.), K.Makisara (ed.), O.Simula (ed.), and J.Kangas (ed.), Image Thinning by Neural Networks, Artificial Neural Networks: Proceedings of the 1991 International Conference on Artificial Neural Networks, ICANN'01, June 24-28, Espoo, Finland, vol. 1, pp. 841-846, 1991, ISBN: 0444891781 (Shi, Shu and Liu 1998) Daming Shi, Wenhao Shu, and Haitao Liu, Feature Selection for Handwritten Chinese Character Recognition Based on Genetic Algorithms, 1998 IEEE International Conference on Systems, Man, and Cybernetics, San Diego, US, vol. 5, pp. 4201-4206, 1998, ISBN: 0780347781 (Shipp and Kuncheva 2001) Catherine A.Shipp and Ludmila I.Kuncheva, Four Measures of Data Complexity for Bootstrapping, Splitting and Feature Sampling, Proceedings International ICSC Congress on Computational Intelligence: Methods and Applications (CIMA'2001), June 19-20, 2001, Bangor, Wales, United Kindom, pp. 429-435, 2001, ISBN: 3906454266 (Sima and Orponen 2003) Jiri Sima and Pekka Orponen, General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results, Neural Computation, vol. 15, no. 12, pp. 2727-2778, 2003, ISSN: 08997667 (Simard, Saatchi and De Grandi 2000) Marc Simard, Sasan S.Saatchi, and Gianfranco De Grandi, The Use of Decision Tree and Multiscale Texture for Classification of JERS-1 SAR Data over Tropical Forest, IEEE Transactions on Geoscience and Remote Sensing, vol. 38, no. 5, pp. 2310-2321, 2000, ISSN: 01962892
(Singh 2002) Sameer Singh, "Estimating Classification Complexity," CiteSeerX, 2002, pp. 1-41, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.7109&rep=rep1&type=pdf
(Singh 2003) Sameer Singh, PRISM - A Novel Framework for Pattern Recognition, Pattern Analysis and Applications, vol. 6, no. 2, pp. 134-149, 2003, ISSN: 14337541 (Singh 2003[2]) Sameer Singh, "Multi-Resolution Estimates of Classification Complexity," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1534-1539, 2003, ISSN: 01628828
219
(Sipser 2005) Michael Sipser, Introduction to the Theory of Computation, Second edition, Brooks/Cole, 2005, ISBN: 0534950973 (Sneath and Sokal 1973) Peter H.A.Sneath and Robert R.Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classification W.H.Freeman and Co. Ltd., 1973, ISBN: 0716706970 (Srivastava, Han, Kumar and Singh 1999) Anurag Srivastava, Eui-Hong Han, Vipin Kumar, and Vineet Singh, Parallel Formulations of Decision-Tree Classification Algorithms, Data Mining and Knowledge Discovery, vol. 3, no. 3, pp. 237-244, 1999, ISSN: 13845810 (Statnikov and al. 2005) Alexander Statnikov, Constantin F.Alifiers, Ioannis Tsamardinos, Douglas Hardin, and Shawn Levy, A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis, Bioinformatics, vol. 21, no. 5, pp. 631-643, 2005, ISSN: 13674803 (Stork, Duda and Hart 2001) David G.Stork, Richard O.Duda, and Peter E.Hart, Pattern Classification, Second edition, John Wiley and Sons Inc., New York, US, 2001, ISBN: 9755031030 (Sung and Niyogi 1995) Kah Kay Sung, Partha Niyogi, G.Tesauro (ed.), D.S.Touretzky (ed.), and T.K.Leen (ed.), Active Learning for Function Approximation, Advances in Neural Information Processing Systems 7: Proceedings of the 1994 Conference November 28-December 1, 1994, Denver, Colorado, US, vol. 7, pp. 593-600, 1995, ISBN: 0262201046 (Sussman 2000) Joseph M.Sussman, Introduction to Transportation Systems Artech House Publishers, 2000, ISBN: 1580531415 (Sussman 2002) Joseph M.Sussman, The New Transportation Faculty: The Evolution to Engineering Systems (paper). Introduction to Transportation Systems (book), 2002, Massachusetts Institute of Technology Engineering Systems Division Working Paper Series ESD-WP-2000-02 Ideas on Complexity in Systems - Twenty Views (Swain and King 1973) Philip H.Swain and Roger C.King, Two Effective Feature Selection Criteria for Multispectral Remote Sensing, The First International Joint Conference on Pattern Recognition, October 30 - November 1, 1973, Washington DC, US, pp. 536-540, 1973, ASIN: B000KIKTIA (Tan, Steinbach and Kumar 2005) Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining Addison Wesley, US, 2005, ISBN: 0321321367
(Tresp 2001) Volker Tresp, Y.H.Hu(ed.), and J-N Hwang (ed.), "Committee Machines," in Handbook of Neural Network Signal Processing CRC Press Inc., 2001, pp. 5-1-5-21, ISBN: 0849323592
(Theodoridis and Koutroumbas 2006) Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition, Third edition, Academic Press Inc., 2006, ISBN: 0123695317
220
(Therrien 1989) Charles W.Therrien, Decision, Estimation and Classification John Wiley and Sons Inc., US, 1989, ISBN: 0471504165 (Thrun, Faloutsos, Mitchell and Wassermanand 1999) Sebastian Thrun, Christos Faloutsos, Tom M.Mitchell, and Larry Wasserman, Automated Learning and Discovery: State-of-the-art and Research Topics in a Rapidly Growing Field, AI Magazine (An Official Publication of the American Association for Artificial Intelligence), vol. 20, no. 3, pp. 78-82, 1999, ISSN: 07384602 (Tu and Chung 1992) Pei-Lei Tu and Jen-Yao Chung, A New Decision-Tree Classification Algorithm for Machine Learning, Proceedings of the Fourth International Conference on Tools for Artificial Intelligence (TAI 92), pp. 370-377, 1992, ISBN: 0818629053
(Tumer and Ghosh 1995) Kagan Tumer and Joydeep Ghosh, "Classifier Combining: Analytical Results and Implications ," Proceedings of the AAAI-96 Workshop on Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms at the 13th National Conference on Artificial Intelligence, August 1996, Portland, US, pp. 126-132, 1995, ISSN: 10821089
(Tumer and Ghosh 2003) Kagan Tumer and Joydeep Ghosh, "Bayes Error Rate Estimation Using Classifier Ensembles," International Journal of Smart Engineering System Design, vol. 5, no. 2, pp. 95-105, 2003, ISSN: 10255818 (Ueda 2000) Naonori Ueda, Optimal Linear Combination of Neural Network for Improving Classification Performance, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 2, pp. 207-215, 2000, ISSN: 01628828 (Vapnik 1992) Vladimir N.Vapnik, J.E.Moody (ed.), S.J.Hanson (ed.), and R.P.Lippmann (ed.), Principles of Risk Minimization for Learning Theory, Advances in Neural Information Processing Systems, vol. 4, pp. 831-838, 1992, ISBN: 1558602224 (Vapnik 1998) Vladimir N.Vapnik, Statistical Learning Theory John Wiley and Sons Inc., New York, US, 1998, ISBN: 0471030031 (Voiry and al. 2007) Matthieu Voiry, Kurosh Madani, Veronique Amarger, Joel Bernie, F.Sandoval (ed.), A.Prieto (ed.), J.Cabestany (ed.), and M.Grana (ed.), Optical Devises Diagnosis by Neural Classifier Exploiting Invariant Data representation and Dimensionality Reduction Ability, Lecture Notes in Computer Science / Theoretical Computer Science and General Issues, Computational and Ambient Intelligence: 9th International Work-conference on Artificial Neural Networks, IWANN 2007, June 20-22, 2007, San Sebastian, Spain, Proceedings, vol. 4507, pp. 1098-1105, 2007, ISBN: 3540730060 (Wang, Neskovic and Cooper 2006) Jigang Wang, Predrag Neskovic, and Leon N.Cooper, Learning class regions by sphere covering, 2006, IBNS Technical Report 2006-02,Department of Physics and Institute for Brain and Neural Systems Brown University, Providence, RI02912,supportef under Grants DAAD19-01-1-0754,W911NF-04-1-0357
221
(Ward 1963) Joe H.Ward and C.Hildreth (ed.), Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, vol. 58, no. 301, pp. 236-244, 1963, ISSN: 01621459 (West 2008) John B.West, Respiratory Physiology: The Essentials, Eigth revised edition, Lippincott Williams and Wilkins, 2008, ISBN: 0781772060 (Wimsatt 1974) William C.Wimsatt, K.F.Schaffner (ed.), and R.S.Cohen (ed.), Complexity and Organization, in Proceedings of the 1972 Biennial Meeting of the Philosophy of Science Association D. Reidel Publishing Company, Dordrecht, The Netherlands, 1974, pp. 67-86, ISBN: 9027704090 (Wolfram 1994) Stephen Wolfram, Cellular Automata and Complexity: Collected Papers Perseus Books, 1994, ISBN: 0201626640 (Xu and Wunsch 2008) Rui Xu and Donald C.Wunsch II, "Recent Advances in Cluster Analysis," International Journal of Intelligent Computing and Cybernetics, vol. 1, no. 4, pp. 484-508, 2008, ISSN: 1756378X (Xu, Krzyzak and Suen 1992) Lei Xu, Adam Krzyzak, and Ching Y.Suen, Methods For Combining Multiple Classifiers And Their Applications To Handwriting Recognition, IEEE Transactions on Systems, Man, and Cybernetics, vol. 22, no. 3, pp. 418-435, 1992, ISSN: 00189472 (Yang, Parekh and Honava 1999) Jihoon Yang, Rajesh Parekh, and Vasant Honava, DistAI: An Inter-pattern Distance-based Constructive Learning Algorithm, Intelligent Data Analysis, vol. 3, no. 1, pp. 55-73, 1999, ISSN: 1088467X (Young and Fu 1986) Tzay Y.Young and King-Sun Fu, Handbook of Pattern Recognition and Image Processing Academic Press Inc., Orlando, Florida, US, 1986, ISBN: 0127745602 (Zeng and Starzyk 2001) Yujing Zeng and Janusz Starzyk, Statistical Approach to Clustering In Pattern Recognition, System Theory, Proceedings of the 33rd Southeastern Symposium, pp. 177-181, 2001, ISBN: 0780366611 (Zhang 2000) Guoqiang P.Zhang, Neural Networks for Classification: A Survey, IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews, vol. 3 0, no. 4, pp. 451-462, 2000, ISSN: 10946977 (Zhang 2006) Aidong Zhang, Advanced Analysis of Gene Expression Microarray Data, First edition, World Scientific Publishing Co. Pte. Ltd., 2006, ISBN: 9812566457 (Zhang, Chen and Kot 2000) Ping Zhan, Lihui Chen, and Alex C.Kot, A Novel Hybrid Classifier for Recognition of Handwritten Numerals, IEEE International Conference on Systems, Man and Cybernetics 2000, October 8-11, 2000, Sheraton Music City Hotel, Nashville, Tennessee, US, vol. 4, pp. 2709-2714, 2000, ISBN: 0780365836
222
(Zhang, Ramakrishnan and Livny 1996) Tian Zhang, Raghu Ramakrishnan, Miron Livny, H.V.Jagadish (ed.), and I.S.Mumick (ed.), BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, June 4-6, 1996, Montreal, Quebec, Canada, pp. 103-114, 1996, ISBN: 0897917944 (Zhao and Wu 1999) Mingsheng Zhao and Youshou Wu, Classification Complexity and Its Estimation Algorithm for Two-class Classification Problem, 1999 IEEE International Joint Conference on Neural Networks, vol. 3, pp. 1631-1634, 1999, ISBN: 0780355296
(Zhigulin 2004) Valentin P.Zhigulin, "Dynamical Motifs: Building Blocks of Complex Network Dynamics," Interjournal ICCS4, Proceedings of the Fourth Internatinal Conference on Complex Systems, May 16-21, 2004, Boston, US, 2004, ISSN: 10810625, http://arxiv.org/abs/cond-mat/0311330
(Zhou 1999) Weiyang Zhou and C.Ruf (ed.), Verification of The Nonparametric Characteristics of Backpropagation Neural Networks for Image Classification, IEEE Transactions on Geoscience and Remote Sensing, vol. 37, no. 2, pp. 771-779, 1999, ISSN: 01962892