Download - Présentée pour l’obtention du titre dedoxa.u-pec.fr/theses/th2009PEST0013.pdfVladimir Golovko and Prof. Hichem Maaref for the useful comments, suggestions, and the references

Thèse

Présentée pour l’obtention du titre de

DOCTEUR DE L’UNIVERSITÉ PARIS-EST Spécialité: Sciences Informatiques

par Ivan BUDNYK

Contribution à l’Étude et Implantation de Systèmes Intelligents Modulaires Auto-Organisateurs

Soutenue publiquement 8 décembre 2009 devant la commission d’examen composée de

LISSI – Laboratoire Images, Signaux et Systèmes Intelligents – EA 3956

Université Paris-Est (Paris 12), l’IUT Sénart-Fontainebleau, Département GEII

Bât A., Av. P. Point, F-77127 Lieusaint ; Tél : +33(0)164134486, Fax : +33(0)164134503, http://lissi.univ-paris12.fr

Rapporteur

Rapporteur

Examinateur

Examinateur

Invité

Directeur de thèse

Prof. Gilles

Prof. Vladimir

Prof. Hichem

Dr. Abdennasser

Dr. Semen

Prof. Kurosh

BERNARD

GOLOVKO

MAAREF

CHEBIRA

GOROKHOVSKYI

MADANI

Université Paris 8

Brest State Technical University

Université d’Évry-Val d’Essonne

Université Paris-Est (Paris 12)

Université nationale de

Kiev-Mohyla-Académie

Université Paris-Est (Paris 12)

Thesis

Presented to obtain the degree of

DOCTOR OF UNIVERSITY PARIS-EST Topic: Computer Sciences

by Ivan BUDNYK

Contribution to the Study and Implementation of Intelligent Modular Self-organizing Systems

Defended on 8 December 2009 in presence of commission composed by

LISSI – Laboratory of Images, Signals and Intelligent Systems – EA 3956

University Paris-Est (Paris 12), IUT Senart-Fontainebleau, Department GEII

Bat A., Av. P. Point, F-77127 Lieusaint ; Tel : +33(0)164134486, Fax : +33(0)164134503, http://lissi.univ-paris12.fr

Rapporteur

Rapporteur

Examiner

Examiner

Invited

Doctorate director

Prof. Gilles

Prof. Vladimir

Prof. Hichem

Dr. Abdennasser

Dr. Semen

Prof. Kurosh

BERNARD

GOLOVKO

MAAREF

CHEBIRA

GOROKHOVSKYI

MADANI

University Paris 8

Brest State Technical University

University of Évry Val d’Essonne

University Paris-Est (Paris 12)

National University of

Kyiv-Mohyla Academy

University Paris-Est (Paris 12)

i

Abstract Classification problems deal with separating group of objects into sets of smaller

classes; this set of problems have received considerable attention in diverse engineering fields such as biomedical imaging, speaker identification, fingerprint recognition, etc. Several effective approaches for automated classification were suggested based on artificial intelligence techniques, including neural networks. Still, one of the major challenges faced by these approaches is a large scale of data required for successful classification. In this thesis, we explore a possible solution to this problem based on a module-based Tree-like Divide to Simplify (T-DTS) classification model.

We focus on enhancing the key module of this approach - complexity estimation module. Furthermore, we provide an automated procedure for optimizing key complexity estimation parameters of the T-DTS model; this considerably improves usability and allows for a more effective configuration of decomposition reasoning of the approach. Another major contribution of this work employs further development of T-DTS modules that could be implemented using parallel computer architecture, thereby allowing T-DTS to utilize an underlying hardware to the fullest extent.

Key words: Information processing, complex systems, artificial intelligence, modular artificial learning systems, classification, complexity estimating.

Résume

Les problèmes de la classification ont reçu une attention considérable dans des différents champs d’ingénierie comme traitement des images biomédicales, identification a partir de la voix, reconnaissance d'empreinte digitale etc. Les techniques d’intelligence artificielles, incluant les réseaux de neurones artificiels, permettent de traiter des problèmes de ce type. En particulier, les problèmes rencontrés nécessitent la manipulation de bases de données de tailles très importantes. Des structures de traitement adaptatives et exploitant des ensembles de classificateurs sont utilisées. Dans cette thèse, nous décrivons principalement le développement et des améliorations apportées à un outil de classification désigné par le terme Tree-like Divide to Simplify ou T-DTS. Nos efforts se sont portés sur l’un des modules de cet outil, le module d’estimation de complexité. L’architecture de l’outil T-DTS est très flexible et nécessite le choix d’un nombre important de paramètres. Afin de simplifier l’exploitation de T-DTS, nous avons conçu et développé une procédure automatique d’optimisation d’un de ces plus importants paramètres, le seuil de décision associé à la mesure de complexité. La contribution principale de cette thèse concerne le développement de modules pouvant s’implanté sur une architecture de calcul matérielle parallèle. Ce ceci permet de se rapproché d’une implantation purement matérielle de l’outil T-DTS.

Mots clés: Traitement de l’information, systèmes complexes, intelligence artificiel, systèmes d’apprentissages artificiels modulaires, classification, estimation de la complexité.

ii

Acknowledgement

I would like to express my deepest gratitude to Prof. Kurosh Madani for his

continuous support and patience. His help and guidance throughout my studies at

Laboratoire Images Signaux et Systèmes Intelligents (LISSI / EA 3956) were invaluable.

His experience was truly of great benefit to me and his advice in academia and otherwise

has undoubtedly left a mark on me.

I would also like to thank Dr. Abdennasser Chebira for his valuable input and

meticulous efforts in reviewing much of this research, as well as his patience while I was

preparing this thesis.

I would also like to thank my thesis committee members Prof. Gilles Bernard, Prof.

Vladimir Golovko and Prof. Hichem Maaref for the useful comments, suggestions, and

the references.

I am grateful to the faculty members of l’IUT de Sénart including Dr. Veronique

Amarger, Dr. Christophe Sabourin and Dr. Amine Chohra for their assistance.

My deepest thanks also go to the members and ex-members of LISSI including Dr.

Nadia Kanaoui, Dr. El Khier Sofiane Bouyoucef, Dr. Mariusz Rybnik, Dr. Lamine Thiaw,

Dr. Matthieu Voiry, Weiwei Yu, Dalel Kanzari, Ting Wang, Dominik Maximilián Ramík

and Arash Bahrammirzaee.

I also thank France’s leading international exchange operator ÉGIDE, who gave me the

initial opportunity to pursue research in France.

Numerous other people and lifelong friends with whom I have worked and lived in

France, Czech Republic and Ukraine, deserve many thanks as well, especially Dr. Maksym

Petrenko and Dr. Semen Gorokhovskyi.

Finally, I would like to thank my parents, Georgiy and Ganna Budnyk, because without

them I would not be the person I am today.

iii

Table of contents

List of tables............................................................................................................................v

List of figures........................................................................................................................ vi

Index of symbol ......................................................................................................................x

General introduction ...............................................................................................................1

I State-of-the-art of classification approaches ........................................................................6

I.1 Concepts of classification............................................................................................7

I.2 Clustering methods ......................................................................................................8

I.3 Main classification methods ......................................................................................17

I.4 T-DTS (Tree-like Divide To Simplify)approach.......................................................37

I.5 Conclusion .................................................................................................................42

II Complexity concepts .........................................................................................................44

II.1 Introduction to complexity concepts ........................................................................44

II.2 Computational complexity measurement.................................................................48

II.3 ANN-structure based classification complexity estimator.......................................75

II.4 Conclusion................................................................................................................85

III T-DTS software architecture............................................................................................86

III.1 T-DTS concept: self-tuning procedure ...................................................................94

III.2 T-DTS software architecture and realization........................................................101

III.3 Conclusion ............................................................................................................118

IV Validation aspects..........................................................................................................120

IV.1 ANN-structure based complexity estimators........................................................120

IV.1.1 Hardware-based validation .........................................................................121

IV.1.2 Software-based validation ...........................................................................133

IV.1.3 Summary......................................................................................................149

IV.2 T-DTS...................................................................................................................152

IV.2.1 ANN-structure based complexity estimator validation ...............................152

IV.2.2 T-DTS self-tuning procedure validation......................................................160

IV.2.3 Summary......................................................................................................172

IV.3 Conclusion ............................................................................................................173

General conclusion and perspectives ..................................................................................175

iv

Appendixes .........................................................................................................................179

A The list of publication ...............................................................................................180

B Approaches to defining complexity ..........................................................................181

B.1 Defining complexity. Genesis of the concept complexity ...............................182

B.2 System’s attributes and complexity..................................................................183

C Neural Networks in hardware ...................................................................................187

C.1 IBM© ZISC®-036 Neurocomputer .................................................................189

C.2 System’s attributes and complexity..................................................................193

Bibliography .......................................................................................................................197

v

List of tables

II.1 Advantages and disadvantages of complexity estimating techniques............................72

IV.1 Benchmarks complexity rates obtained using IBM© ZISC®-036 implementation of

ANN structure based complexity estimator........................................................................123

IV.2 Complexity rates obtained for Splice-junction DNA classification problem (original

database) using IBM© ZISC®-036 Neurocomputer ..........................................................132

IV.3 Complexity rates obtained for Splice-junction DNA classification problem (re-

encoded database) using ANN-structure based and other applications ..............................147

IV.4 Complexity rates obtained for Tic-tac-toe endgame classification problem using

sixteen classification complexity criteria including ANN-structure based complexity

estimator..............................................................................................................................148

IV.5 Classification results: Four spiral benchmark, two classes, generalization database

size 500 prototypes, learning database size 500 prototypes ...............................................163

IV.6 Classification results: Tic-tac-toe endgame classification problem ...........................167

IV.7 Classification results: Splice-junction DNA sequences classification problem, three

classes, generalization and learning database size 1595 prototypes ...................................170

IV.8 Consolidation of classification results: Splice-junction DNA sequences classification

problem ...............................................................................................................................171

vi

List of figures

I.1 SVM: space mapping using linear hyperplane ................................................................21

I.2 SVM: space mapping using different space kernel functions Φ .....................................22

I.3 General block scheme diagram of the T-DTS structure constructing .............................40

II.1 Description and interpretation process...........................................................................52

II.2 Retained adherence subset for two classes near the boundary.......................................69

II.3 Taxonomy of classification complexity (separability) measures...................................71

II.4 Examples of Voronoy polyhedron for 2D and 3D classification problems ...................77

II.5 Q(m) indicator function behaviour.................................................................................78

III.1 Diagram of T-DTS implementation for classification tasks .........................................87

III.2 Scheme of T-DTS learning concept..............................................................................90

III.3 T-DTS operating ...........................................................................................................90

III.4 An example of maximal possible decomposition tree ..................................................95

III.5 An example of distribution of the clusters’ number over [Amin; Amax] complexity interval .................................................................................................................................97

III.6 T-DTS software architecture.......................................................................................102

III.7 Principal T-DTS v. 2.50 Matlab software architecture...............................................104

III.8 Detailed T-DTS v. 2.50 Matlab software architecture ...............................................107

III.9 Matlab T-DTS software realization v. 2.50, Control panel ........................................108

III.10 T-DTS GUI: Results : 2 stripe-like benchmark (576 prototypes).............................113

III.11 GUI of T-DTS, decomposition clusters’ chart..........................................................114

III.12 GUI of T-DTS, decomposition tree charts................................................................115

III.13 GUI of T-DTS, decomposition tree charts in 3D......................................................115

III.14 Menu of T-DTS, Configuration ................................................................................116

III.15 Menu of T-DTS, Set Constant ..................................................................................117

III.16 Menu of T-DTS, Set EC Options ..............................................................................117

III.17 Menu of T-DTS, Analysis .........................................................................................117

IV.1 Stripe classification benchmarks ...............................................................................122

IV.2 Stripe classification benchmarks : Qi(m) behavior versus learning database size m, LSUP ZISC®-036 mode.....................................................................................................124

IV.3 Stripe classification benchmarks : Qi(m) behavior versus learning database size m, L1 ZISC®-036 mode................................................................................................................124

IV.4 Benchmarks’ classification rates behavior versus learning database size m, LSUP ZISC®-036 mode................................................................................................................126

vii

IV.5 Benchmarks’ classification rates behavior versus learning database size m, L1 ZISC®-036 mode ............................................................................................................................126

IV.6 Qk(m) evaluation for DNA splice-junction classification problem using different ZISC®-036 k-MIF parameters (k: 55,56, 4096) for: Q55(m), Q56(m), Q4096(m), mk – corresponds to calculated m0 for each k-curve....................................................................129

IV.7 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k-MIF parameter k=55..............................................129

IV.8 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k-MIF parameter k=56..............................................131

IV.9 Quality check of RCE-kNN-like Voronoy polyhedron construction based on its generalization ability performed for k-MIF parameter k=4096..........................................131

IV.10 Square classification benchmarks, 2 classes, 2000 prototypes.................................134

IV.11 ANN-structure based complexity estimator evaluation for: Square benchmarks, 2 classes, 2000 prototypes, MIF = 1024, 3 distance modes...................................................135

IV.12 ANN-structure based complexity estimator evaluation for: 8 Stripe benchmark, 2 classes, 2000 prototypes, LSUP distance mode..................................................................136

IV.13 ANN-structure based complexity estimator evaluation for: 8 Stripe benchmark, 2 classes (4&4 stripes), LSUP distance mode .......................................................................137

IV.14 Grid classification benchmarks in D1 D2 and D3 dimension ..................................138

IV.15 ANN-structure based complexity estimator evaluation for: Grid benchmark, 2 classes, EUCL distance mode .............................................................................................139

IV.16 ANN-structure based complexity estimator evaluation for: Grid and 8-stripe-benchmarks, 2 classes, EUCL distance mode, MIF=1024 .................................................140

IV.17 Four spiral classification benchmarks, 2 classes, 2000 prototypes ..........................142

IV.18 ANN-structure based complexity estimator evaluation for: 4 Spiral benchmark, 2 classes, EUCL distance mode .............................................................................................142

IV.19 Six classification benchmarks [from left to right, from top to down: 2 Stripes (2ST), 2 Grids (2GR), 2 Squares (2SQ), 2 Sinusoids (2SN), 2 Spirals (2SP), 2 Circles (2CR) with small overlapping zone]......................................................................................................143

IV.20 ANN-structure based complexity estimator evaluation for: 6 classification benchmarks, 2 classes, 2000 prototypes, MIF=1024..........................................................144

IV.21 Validation Matlab implementation of ANN-structure based complexity estimator embedded into T-DTS framework: 2 Stripe benchmark, 2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU – CNN, PU – LVQ1..153

IV.22 Validation ANN-structure based complexity estimator embedded into T-DTS framework: 10 Stripe benchmark, 2 classes, generalization database size 1600 prototypes, learning database size 400 prototypes, DU – CNN, PU – LVQ1.......................................154

IV.23 Validation Matlab implementation of ANN-structure based complexity estimator embedded into T-DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 479 prototypes, learning database size 479 prototypes, DU – CNN, PU – MLP_FF_GDM...................................................................................................................157

viii

IV.24 Validation ANN-structure based complexity estimator embedded into T-DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 766 prototypes, learning database size 192 prototypes, DU – CNN, PU – MLP_FF_GDM ....158

IV.25 Validation ANN-structure based complexity estimator embedded into T-DTS framework: Splice-junction DNA sequence classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU – CNN, PU – MLP_FF_GDM...................................................................................................................158

IV.26 Validation T-DTS self-tuning threshold procedure, Average learning rate (including its corridor of the standard deviations) as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator................................160

IV.27 Validation T-DTS self-tuning threshold procedure, Average generalization rate (including its corridor of the standard deviations) as a function of θ-threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator ......161

IV.28 Validation T-DTS self-tuning threshold procedure, Average clusters’ number as a function of θ-threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator.................................................................................................161

IV.29 Validation T-DTS self-tuning threshold procedure, Performance estimating function P(θ): 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator..............................................................................................................................162

IV.30 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: 4 Spiral benchmark, 2 classes, learning database size 500 prototypes, DU – CNN, Collective entropy based complexity estimator ...................................................................................163

IV.31 Validation T-DTS self-tuning threshold procedure: 10 stripe benchmark, 2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU – CNN, PU – LVQ1, 4 complexity estimators ......................................................................164

IV.32 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, Collective entropy complexity estimator..............................................................................................................................166

IV.33 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, ANN-structure based complexity estimator..............................................................................................................................166

IV.34 Validation T-DTS self-tuning threshold procedure, Splice-junction DNA sequences classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU – CNN, PU – MLP_FF_GDM, 3 complexity estimators.............................................................................................................................................168

IV.35 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size 1595 prototypes, DU – CNN, Purity PRISM based complexity estimator.........................170

IV.36 Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size

ix

1595 prototypes, DU – CNN, Fukunaga’s interclass distance measure J1 based complexity estimator..............................................................................................................................170

B.1 Genesis of the complexity ............................................................................................181

C.1 General block level architecture representation of a neurochip or a neurocomputer processing elements ............................................................................................................188

C.2 IBM© ZISC®-036 PC-486 ISA bus based block diagram..........................................190

C.3 Schematic drawing of a single IBM© ZISC®-036 processing element - neuron........191

C.4 IBM© ZISC®-036 chip’s block diagram ....................................................................191

C.5 Hardware realization: IBM© ZISC®-036 PCI board ..................................................192

C.6 CM-1K Neural network chip........................................................................................193

C.7 Network of CM-1K chips, or 6l44 neurons in parallel ................................................194

C.8 CM-1K chip’s functional diagram. Inner architecture .................................................194

x

Index of symbols Symbol Signification

s / s input vector in the feature space

S set of vectors of the feature space

c element of a concept class

C set of concepts

i, j, l, r indexes

dim feature space dimension

d() distance function between two vectors

dE Euclidean distance

dWE weighted Euclidean distance

W vector of weights for dWE

dMN Minkovsky distance

L coefficient of dMN

dMH Manhattan distance, City-block distance or L1 distance

dCH Chebyshev distance

dML Mahalanobis distance

dCN Canberra distance

dCS Cosine distance

dTN Tanimoto distance

m total number of data items / instances in S

Df() distance function

Gi i-th cluster / group

is mean vector of cluster Gi

M() / E() mean / math expectation

ESS() error sum-of-square of a cluster

k clusters / groups number, number of samples

εs square error

af() affinity function

SM matrix of similarity

θ threshold / (threshold value of: affinity function or complexity or so forth)

g() hyperplane function

N number of classifiers

xi

p(), q() probability distribution functions

µ length of the segment of code / program

x, y discrete variables

H() entropy measure

I() mutual information measure

Λ() likelihood ratio

KL() Kullback–Leibler divergence, information gain or relative entropy

SH overall degree of separability

Dλ() Λ-divergence

JSD() Jenses-Shannon divergence

HD() Hellinger distance measure

JMD() Jeffreys-Matusita distance measure

BD() Bhattacharyya distance measure ε Bayes error

Г(s) local region around instance s

v(s) volume of the region Г(s)

()p density estimating function

()ε Bayes error ε estimating function / equation

NMD() normalized mean distance

Sw/Sb/Sm scatter matrices

J1/J2/J3/J4 Fukunaga’s four criteria of scatter matrices

B resolution parameter used for creation hyper-cuboids

ηij number of data instances of class i in box j

SGj PRISM’s parameter of separability each cluster (box) Gj

SG PRISM’s overall purity for all clusters (boxes) G

SG.norm normalized SG

HPG entropy measure of each G-PRISM’s box

HPG.norm collective entropy measure

FMD() Fisher linear discriminant (ratio) based measure

Ω Chaitin’s constant

n number of neurons in RBF-Net / parameter that reflects complexity

(.)g classification complexity estimation function

Q() indicator of classification complexity iℜ input feature space of dimension i, to distinguish from S

xii

Ψ(t) model’s input of vector(s), it ℜ∈)(ψ

nΨ dimension of input vectors Ψ(t)

ui output decision variable for NNMi , i – index

nu dimension of a vector ui

U linear combiner of outputs ui

F(.) transfer function ji ℜ→ℜ , where i and j are indexes of a dimension

CU[.] control unit output of a decision function

b set of parameters of CU

ξ set of condition of CU

Wi centroid of Gi cluster

z iteration number required to get statistics of T-DTS output

Ai complexity ration of the sub-database denoted by index i

[Amin;Amax] interval of θ-threshold ratio variation

α T-DTS parameter responsible for constriction of [Amin;Amax]

[Bmin;Bmax] sub-interval of interval [Amin;Amax]

Gr generalization (testing) rate, expressed in %

Lr learning rate, expressed in %

SdGr standard deviation of generalization (testing) rate, expressed in %

SdLr standard deviation of learning rate, expressed in %

Tp T-DTS processing time, expressed in seconds

NTp processing unit execution time applied to non-decomposed database

SdTp standard deviation of T-DTS processing time, expressed in seconds

SdNTp standard deviation of processing unit executing time, expressed in seconds

P(.) aggregation function of T-DTS performance

b1/b2/b3 coefficients of priorities

h number of optimization iteration equivalent of precision of quasi optimal t

1

General introduction

Fundamental efforts in understanding the nature of intelligence and its realization in

human minds have been of a growing interest in various research communities, including

education, cognition science, computer science, neuro-science and engineering. The field

of Artificial Intelligence (IA) was formed to specifically facilitate these efforts and to

solve a set of associated problems, ranging from Pattern recognition to Artificial life.

This thesis addresses Machine learning (ML) (Mitchell, Anderson, Carbonel and

Michalski 1986), (Mitchell 1997) discipline of AI that deals with design and development

of learning algorithms capable of recognizing complex patterns and making intelligent

decisions based on information represented by a database of classification objects.

Traditionally, Machine learning tools occupied a specific niche of commercial and self-

improving software applications; however, recent advances in AI field have brought ML

into the mainstream (Thrun, Faloutsos, Mitchell and Wassermanand 1999). For example,

Machine learning is in the foundation of many data-mining techniques that have become

the necessity in coping with the ever-growing volumes of available on-line and off-line

data (Thrun, Faloutsos, Mitchell and Wassermanand 1999). It is our motivation that better

ML techniques, particularly better algorithms that are based on fundamental statistical-

computational theories of learning processes, could greatly benefit and facilitates further

development of ML-based applications.

Motivation and objectives

In this thesis, I explore the direction dealing with data stores that use classification and

clustering methods (Fielding 2007). Being extensively used in data mining applications

(Bellman 1961), these methods were shown to not perform well with huge data sets,

(Dong 2003). In this research, I study Machine learning methods that can help to alleviate

this caveat and, as suggested by the No Free Lunch Theorem (Stork, Duda and Hart

2

2001), are at least as effective as other available classification approaches. I am

particularly interested in making Machine learning classification tools adaptable, self-

adjusting and capable of employing an intelligent human-like view on classification

problems in massive databases. I also aim at streamlining the use of Machine learning

techniques by providing an automated method for resolving parameters of these

techniques while tuning them for efficiency and effectiveness.

Of available Machine learning techniques, I focus on the proposed technique (Madani

and Chebira 2000) of Hybrid Multi Neural Networks (HMNN) (Madani, Rybnik and

Chebira 2003) developed by Rybnik (Rybnik 2004) and Bouyoucef (Bouyoucef, Chebira,

Rybnik and Madani 2005), (Bouyoucef 2007). Since this technique is based on design

paradigms of “divide and conquer” and “reduce complexity through decomposition”, in

this thesis I investigate an issue of complexity; in this context, complexity means difficulty

of classification tasks regardless of the data set size.

Further, I study how the data-driven tree-like constructing approach of ensemble of

neural networks can be adopted for the data organization purposes. I argue that this

structure provide for the best control of the generalization performance of the technique in

the typical setting of classification problems, i.e., when little reliable prior information

about the statistical law underlying a classification problem is available. I adopted and

enhanced the data driven HMNN structure constructor concept (Madani and Chebira

2000) named Tree-like Divide to Simplify (T-DTS) developed in the works (Madani,

Chebira and Mercier 1997) - (Madani, Rybnik and Chebira 2003) and designed for

classification problems based on the “divide and conquer” strategy. The objectives of this

thesis in studying T-DTS can be summarized as the following:

1. As it is shown (Rybnik 2004) that T-DTS algorithmic approach converges quickly

and their time complexities (computational complexities) linearly depend on the number

of training samples and the dimension of a feature vector, my central target is T-DTS

development and enhancement.

2. As a complexity estimator is at heart of T-DTS engine, I aim at developing a novel

classification complexity estimator based on an ad hoc method that takes into account the

classification database complexity issue and extracts information about its

“classifyability” using information provided by a Neural Networks (NNs) structure. The

Neural Networks are especially suitable for this problem as data is coming from a

complex environment and can be incomplete, heterogeneous, or have other characteristics

3

that make statistically-based classification complexity estimating not only very difficult,

but sometimes impossible or incorrect (Bouyoucef 2007).

3. Update T-DTS framework with the new complexity estimating techniques, the

Neural Networks based decomposition, and the processing methods.

4. Provide an automated self-adjusting procedure for effective resolving of the key

parameters of the approach.

5. Perform verification, analyze the results and outline possible perspectives.

The following section represents my contribution to this research.

Contribution

The main purpose of this research is to improve T-DTS technique and its

implementation. I expect that the results of this thesis will allow T-DTS to increase not

only its performance in classification tasks, but also to answer on the question how and

when to apply T-DTS approach to classification problem represented by particular

database. I expect based on the validation results’ feedback that proposed ANN based

classification complexity estimator will enhance classifiers’ adjustment regardless of

classification concept or paradigm.

Debriefing, this thesis makes the following specific contributions:

1. I have implemented a component-independent T-DTS platform for classification

tasks. Each component of the platform (i.e., decomposition method, processing method,

or complexity estimation technique) can be adjusted and verified independently of the

main platform and then successfully integrated into T-DTS.

2. I proposed an Artificial Neural Network (ANN) structure-based classification

complexity estimator that can theoretically be used for an adjustment of other advanced

classifiers such as T-DTS; I have showed, using benchmark and two real-world

classification problem, the effectiveness of the proposed estimator on the case of T-DTS

classifier.

3. I have proposed a self-adjusting T-DTS procedure that semi-automatically

determines the key parameter of T-DTS – complexity threshold, - at which T-DTS

performance reaches its quasi-optimum performance and produces the range (not only

one) satisfactory results.

4

The most important, using an analysis provided by this semi-automated procedure,

one could reason on why a T-DTS approach cannot be successful for a selected

decomposition and a processing unit. Another point is that taking advantage of the low-

price high-performance hardware that has become available in the recent years and de

facto became the platform of choice for large-scale learning algorithms; I have prepared

the way for a multi-processor implementation of T-DTS approach that would allow a user

to improve further T-DTS performance enhancement and efficient usability for large-scale

data.

In conclusion to my contributions, I have to mention that in this research I focused on

precondition and controlled issues of T-DTS functioning. Findings of this research can

help industry as well as academic researchers in designing efficient and robust structure of

intelligent classifiers capable of handling complex classification problems imposed even

by huge databases.

Thesis organization

The thesis is organized in four chapters. The main contributions of the thesis are

discussed in Chapter I, Chapter II, Chapter III and Chapter IV; these chapters are intended

to be self-contained and can be read by an experienced reader without considering other

chapters.

In Chapter I, I review major classification techniques and learning algorithms

available in the literature. The main goal of this chapter is to provide the readers with a

relevant background necessary to understand the details and additional motivation to the

T-DTS approach and related to T-DTS problems of employing overviewed classification

methods are discussed in the subsequent sections of this Chapter I. Last one is an extra

reason, why I provide a brief overview and comparison of traditional classification

techniques based on their speed, accuracy and memory requirements on large and

complex data sets; I also discuss limitations of these techniques and argue for a more

advanced approach. In addition, I discuss basic optimization techniques and key issues in

the disciplines of numerical analysis, computer architecture, operating systems and

parallel computing that are necessary for designing a high-performance implementation

and industrial adaptation of the suggested learning-based algorithm. The chapter

5

concludes with a theoretical overview of our proposed learning framework named T-DTS

that is based on building a tree-like classifier by means of the "divide and conquer" design

principle and unique ensemble method of complexity estimation. I also highlight the key

role of complexity estimating technique which is the core of T-DTS approach.

Chapter II is fully dedicated to complexity concepts and big range of complexity

measurement approaches. It also provides a detailed description of my novel Artificial

Neural Networks structure (ANN-structure) based complexity estimation approach that is

used for defining complexity of classification tasks.

In Chapter III, I highlight important T-DTS implementation aspects. I introduce a

self-tuning threshold procedure enhancement of the basic T-DTS learning framework that

is based on the analysis of maximal possible database decomposition. This procedure has

automatically determines an appropriate complexity threshold required by our approach.

Moreover, it provides a range of acceptable alternative solutions for industrial-based T-

DTS application.

Chapter IV is dedicated to the validation of the proposed approach. In the first part of

this chapter, we provide the results of RCE-kNN-like based complexity estimator (an

implementation of ANN structure based complexity estimator) verification, both as a

stand-alone technique and as a part of T-DTS. The second part of Chapter IV overviews

the evaluation of my proposed T-DTS threshold self-tuning procedure and outlines the

basics of an effective hardware implementation of the approach.

Finally, the last part of the thesis contains consolidated conclusions of my work and

defines further perspectives for T-DTS development.

6

Chapter I:

State-of-the-art of classification approaches

We start this chapter by providing a brief cross-field overview of the major statistical

classification methods. We shortly review the advantages of these methods and their

shortcomings in terms of accuracy and computational cost; our aim is to highlight the

importance of database decomposition based techniques in a frame of Statistical

classification. The review surveys not only traditional methods such as the k-Nearest

Neighbour-based method (Cover and Hart 1967), but also the modern ensemble classifiers

such as Bagging (Breiman 1996) and Boosting (Freund and Schapire 1996), and advanced

classifiers such as Support Vector Machines (SVM) (Vapnik 1998), (Serfling 2002). This

review is crucial in selecting and designing an appropriate classifier for our approach. We

conclude the chapter by introducing computational tools and techniques in order to

describe principal design of our framework named T-DTS (Tree-like Divide to Simplify)

and also to define the place of T-DTS among the existing classification approaches. The

subsections of the chapter are organized as the following.

In Section I.1, we provide a short introduction to the theory of classification. Section

I.2 introduces clustering techniques and methods. Section I.3 presents an overview of the

main classification techniques. Section I.4 describes general concepts of our T-DTS

approach and introduces our modular neural tree structure construction approach as

applied to classification tasks initially proposed by (Madani and Chebira 2000), and

subsequently developed by Rybnik (Rybnik 2004) and Bouyoucef (Bouyoucef 2007).

Section I.5 concludes.

7

I.1 Concepts of classification

Categorization (classification) is a process, where patterns or objects are recognized,

differentiated and associated with different classes. Categorization is fundamental for a

very large spectrum of problem solving tasks. Among many available categorization

theories and techniques, we can distinguish three general approaches of categorization:

Classical categorization, Conceptual clustering and Prototype theory (Mitchell,

Anderson, Carbonel and Michalski 1986).

Categorization is based on the criteria that all entities in a single category share one or

more common properties (Jurgen 2004). Categorization (classification) originates from

Plato, who in his “Statesman” 1 (Plato 1997) introduces the approach for grouping objects

based on their similar properties. In this way, any entity of a given classification universe

belongs unequivocally to one and only one class in the classification. Conceptual

clustering (Jurgen 2004) is a modern version of Classical categorization, where the

selection criteria are based not on common properties, but rather on shared concepts

among different entities. Prototype theory uses an entity of a category as a prototypical

object, and the decision whether another object also fits into this category depends on the

degree of overlap between the objects (Jurgen 2004). At a glance, the prototype theory

might have a relation to machine learning; however, it in fact is a subjective cognitive

approach that might manipulate natural categories (Goldstone and Kersten 2003).

In our thesis, we deal with a subset of Conceptual clustering theory called Statistical

classification, typically used in pattern recognition systems. Generally, this type of

classification might be described in terms of data instances: Ss ∈ , where s is a single

vector of a feature space S. Each problem instance s belongs to one element c of a concept

class C: Cc ∈ . The mapping from S to C is represented by c target concept belonging to

C set of concepts (that is, the concept space). Thus, the goal of automated classification

system is to predict to which class c belongs an arbitrary instance s (Butz 2001).

The general classification process involves two main steps: learning, where training

data are analyzed and a model is built, and classification, testing or generalizing, where

1 Dialogue between Stranger and Younger Socrates (“Statesman” [261a-261e]), (Plato 1997) Plato, J.M.Cooper (ed.),

D.C.Hutchinson (ed.), “Plato Complete Works”, Hackett Publishing Company Inc., 1997, ISBN: 0872203492

8

the built model is used to predict to which class belongs any given instance (Han and

Kamber 2006).

From the theoretical point of view, when information about training/generalization

data is given, the prime interest is the maximization of the accuracy rate. Other important

factors to be considered are the following: processing speed, recognition rate and other

computational resources cost factors (Gao, Foster, Mobus and Moschytz 2001).

According to the “No free lunch” theorem (Stork, Duda and Hart 2001), there is no

single learning-based approach that works the best on all classification problems;

subsequently, there is a wide variety of learning systems/approaches/concepts optimized

to provide a selection/extraction process with the most successful high-rate behaviour

predictable model using for given input data and classification (Han and Kamber 2006).

Therefore, this is a strong reason to continue development of the various categorization

algorithms. The existent categorization techniques fall under two general groups (Han and

Karypis 2000). The first one described in Section I.3, contains traditional machine

learning algorithms that have been developed over the years. The second group of

methods, described in the following Section I.2, contains specialized categorization

algorithms developed in the Information Retrieval community.

I.2 Clustering methods

Data clustering (not be confused with the Conceptual clustering paradigm) (Brito,

Bertrand, Cucumel and De Carvalho 2007)) or Clustering analysis (firstly proposed by

Tryon in 1939) encompasses a number of different algorithms and methods for grouping

objects of similar kind into respective categories. Clustering is often confused with

classification, but there is an important difference between the two: in classification, the

objects are assigned to pre-defined classes, whereas in clustering the classes have to be

discovered from the data (Michalski and Stepp 1987). Thus, one can assign clustering to

the category of unsupervised learning based classification methods. Furthermore, cluster

analysis (clustering) group of methods is a part of an important data mining area of

research (Agrarwal and Yu 1999). The reason why clustering is important for data-

mining is that as a technique in which the information that is logically similar is

physically stored together, clustering analysis increases the efficiency of management of

9

database systems: objects of similar properties are placed in a single class of objects, and

a single access to the disk makes the entire class available. The existing approaches to

data clustering include (employs another method) statistical approach (e.g., the k-MEANS

algorithm), optimization approach (e.g., branch and bound method), simulated annealing

techniques and neural network approach (Zeng and Starzyk 2001). Since the data mining

process involves extracting or uncovering patterns from data (Kantardzic 2002), and

grouping records in accordance with these patterns, a subset of data- mining (Fayyad,

Piatetsky-Shapiro and Smyth 1996) methods is directly related to data clustering, data-

mining methods can be used as classification methods.

Let us note that most clustering is largely based on heuristic, but intuitively reasonable

procedures, and most clustering methods for solving important practical questions are also

of this type. However, there is a little semantic guidance associated with these methods.

(Fraley and Raftery 2002).

In conclusion, we highlight that because clustering analysis is the method that can be

applied to pattern classification (Fayyad, Piatetsky-Shapiro and Smyth 1996) and has

numerous practical applications including diverse engineering fields (Mitchell 1997),

(Leondes 1998) such as oil exploration, biomedical imaging, speaker identification,

automated data entry, parameter identification of dynamical systems (Rao and Yadaiah

2005), fingerprint recognition, evaluation of the fetal state as carried out by obstetricians,

multi-path propagation channel conditions, etc. (Abdelwahab 2004), we perform a more

detailed overview of the main clustering methods in the following section.

I.2.1 Type of clustering

There are many approaches to categorization of the clustering methods. Depending on

a selected criterion, one might find important to group the methods into two big categories

such as hierarchical and non- hierarchical or parametric and non-parametric. However,

the most common used structure of the clustering methods includes the following four

categories (Jesse, Liu, Smart and Brown 2008):

1. Hierarchical methods (Section I.2.1.1)

2. Partitioning methods (Section I.2.1.2) determine the clusters at once, but can also

be viewed as divisive hierarchical methods

10

3. Density-based clustering methods (Section I.2.1.3) are employed to discover

arbitrary-shaped clusters, where each cluster is regarded as a region in which the density

of data objects exceeds a threshold

4. Two-way clustering, co-clustering or bi-clustering are the algorithms where not

only the objects are clustered, but also the features of the objects, i.e., if the data is

represented in a data matrix, the rows and columns are clustered simultaneously.

Section I.2.1.1 - Section I.2.1.3 provide a detailed overview of clustering methods in

groups 1-3, because these methods are closely related to the combination of clustering

approaches embedded into T-DTS.

I.2.1.1 Hierarchical methods

Hierarchical methods belong to the iterative type of procedures (Xu and Wunsch

2008) in which m data instances are partitioned into groups which may vary from a single

cluster containing all m data instances to m clusters each containing a single instance; a

proper definition might be found in work (Arabie 1994). Henceforth, the key point of the

hierarchical clustering methods is a decomposition algorithm over the given data set.

Based on the features of the algorithm, the most commonly used taxonomy of hierarchical

methods (Jain, Murty and Flynn 1999) includes the following five sub-categories.

I. First sub-category contains Agglomerative vs. Divisive approaches that are

related to algorithmic structure(s) and operation(s). An agglomerative approach begins

with each pattern in distinct unique cluster, and merges clusters together until a

stopping criterion is satisfied. A divisive approach proceeds top-down, placing all the

data in one cluster, and then successively splitting up clusters until a stopping criterion

is satisfied. T-DTS concept belongs to this sub-category, and more details on this

technique are provided in Chapter III.

II. Second sub-category consists of Monothetic vs. Polythetis approaches. The first

one uses the features simultaneously and second one in sequential way.

III. Third sub-category includes Deterministic (Hard) vs. Fuzzy approaches. The

first one allocates each instance-pattern to a single cluster, while the second one could

assign it to the different clusters in the same time.

11

IV. Fourth sub-category deals with Deterministic vs. Stochastic approaches. They

are designed to optimize a square error function using traditional determined

technique or through a random search.

V. Fifth sub category contains Incremental vs. Non-incremental methods that refer

to scalability issue that arise when the pattern database for clustering is very large, and

constrains on execution time or memory affect the architecture of the algorithm.

The following sub-section is dedicated to the second category of clustering methods -

Partitioning methods.

I.2.1.2 Partitioning methods

For given m data items (instances), a partitioning method arranges the data into k

groups/clusters, where mkGk ii≤= ,max and Gi is i-th cluster, sub-database, sub-group

of vectors. As k is an input parameter for these group of algorithms, some domain

knowledge is required, which in practice unfortunately is not available for real-word

applications (Ester, Kriegel, Sander and Xu 1996).

The partitioning algorithm typically starts with an initial partition and then uses an

iterative control strategy to optimize an objective function. Each cluster is represented by

the gravity centre of the cluster (k-centroid-based method) or by one of the objects of the

cluster located near its centre (k-medoid-based method). Usually, a centroid-based method

is used. Centroid is computed as the average of the attributes of each vector s of the

feature space S (Berry 2003).

Consequently, partitioning algorithm algorithms use a two step procedure. First,

determine k representatives minimizing the objective function. Second, assign each object

to the cluster with its representative based on the criterion of closeness (Section I.2.2) to

the considered object. This second step implies that a partition is equivalent to Voronoy

diagram and each cluster contains one of the Voronoy cell (Ester, Kriegel, Sander and Xu

1996). Thus, the shape of all clusters found by a partitioning algorithm is convex which is

very restrictive. There are plenty of partitioning methods, but principally, according to the

work (Jain, Murty and Flynn 1999), all of them can be arrange into 3 general sub-

categories.

I. First sub-category is the Square error clustering. The most popular partitioning

algorithm of this class is the k-MEANS algorithm.

12

II. Second sub-category is the Mixture-resolving and Mode-seeking clustering. This

group of algorithms is developed in number of ways. The principal concept uses an

underlying assumption that the patterns which have to be clustered are drawn from

one of several distributions, and the goal is to identify the parameters of each and (if

possible) their numbers. There are Gaussian mixture model, Expectation-

Maximization (EM) algorithm, and unsupervised Bayes models.

III. Third sub-category is the Graph-theoretic based clustering. These algorithms use

a similar graph theoretic approach to clustering, where the input data is represented as

a similarity graph and the algorithm recursively partitions the current set of elements

(represented by a sub-graph) into two subsets by a minimum weight cut computed

from that sub-graph, until a stopping rule is met (Shamir and Sharan 2002). There are

Minimal Spanning Tree (MST), Highly Connected Sub-graphs (HCS) (Hartuv and

Shamir 2000) and CLICK (Cluster Identification via Connectivity Kernels) (Sharan

and Shamir 2000).

The following sub-section represents the third category of clustering methods – the

Density-based methods.

I.2.1.3 Density-based methods

Density-based methods are used as the stand-alone tools to get an insight into the

distribution of a data set, e.g. to focus further analysis and data processing, or as a pre-

processing step. Density-based approaches apply a local cluster criterion. Clusters are

regarded as regions in the data space in which the objects are dense, and which are

separated by regions of low object density (noise). These regions may have an arbitrary

shape and the points inside a region may be arbitrarily distributed (Jesse, Liu, Smart and

Brown 2008).

The most popular approaches of this category are: DBSCAN (Density-Based Spatial

Clustering of Applications with Noise), OPTICS (Ordering Points To Identify the

Clustering Structure, LOF (Local Outlier Factors). They are used in Knowledge

Discovery in Databases (KDD) applications in finding the outliers, i.e. the rare events,

which are more interesting and useful than finding the common cases, e.g. detecting

criminal activities in e-commerce (Barbara and Kamath 2003).

13

Finalizing this short overview of the structure of classification methods, we would like

to remind that in clustering analysis, the existence of predefined pattern classes is not

assumed, the number of classes is unknown, or the class memberships of the vectors are

generally unknown (Leondes 1998). The method employs arrangement of the instances

into groups (clusters) so that there is a high similarity among objects in each cluster, but a

very low similarity among objects between clusters. The goals of any clustering include:

1. Organizing information about data so that relatively homogeneous groups

(clusters) are formed and describing their unknown properties.

2. Finding representatives.

Basing on these goals, the homogeneous group analysis has two components. One

component is the (dis)similarity measure between any two data samples or feature

vectors. Second component is the clustering algorithm that groups samples into clusters.

A similarity measure is essential to most clustering algorithms. That is why the next

section gives the description of main (dis)similarity measures.

The provided distance measures serve as a short overview. These distances are

embedded into T-DTS as a part of decomposition and complexity estimating techniques.

I.2.2 Distance measure

An important step in any clustering is to select a distance measure, which will

determine how the similarity of two elements is calculated. This will influence the shape

of the clusters, as some elements may be close to one another according to one distance

and farther away according to another.

The first classical commonly used similarity measure is the Euclidean measure.

Let d(·) denote a distance-function between two vectors. Let the two instances s1 and

s2 of feature space S be represented by two vectors Ss ∈1 and Ss ∈2 accordingly, where

index dim,...,1=i denotes an attribute of the feature vector. Then, the scalar dE(·) is

Euclidean distance:

• Euclidean distance:

( ) ( )∑=

−=dim

1

22121 )()(,

iE isisssd (I.1)

• Weighted Euclidean distance:

14

( ) ( )∑=

−=dim

1

22121 )()()(,

iWE isisiwssd (I.2)

where vector w defines the importance of the features of the weights. The choice of

weights must be done carefully, because this factor is more critical than switching

between the types of distance measures. Basic Euclidean distance is the special case

(L=2) of a more general type of distance - Minkovsky distance.

• Minkovsky distance:

( ) ( )L

i

LMN isisssd

1dim

12121 )()(, ⎟

⎠

⎞⎜⎝

⎛−= ∑

=

(I.3)

for L=1 we have Manhattan distance

• Manhattan (Hamming) distance, City-block distance or L1-distance (in most cases

denoted in literature as L1):

( ) ∑=

−=dim

12121 )()(,

iMH isisssd (I.4)

The following Chebyshev distance is a generalization of Minkovsky distance when L

approaches infinity. In literature it’s marked as ∞L -distance. Further, in our work it is used

as LSUP-distance.

• Chebyshev distance or LSUP-distance:

)()(max),( 21dim,121 isisssd iCH −==

(I.5)

The subsequent Mahalanobis distance is popular in statistics for measuring the

similarity of two data distribution. If T represents the matrix transpose, Σ is the covariance

matrix of the vectors s1 and s2.

• Mahalanobis distance:

( ) ( )211

2121 ),( ssssssd TML −Σ−= − (I.6)

The purpose of using Σ-1 is to standardize the data relative to covariance matrix.

The following distances provide some important clues about (dis)similarity criteria for

cluster analysis. For example, Canberra distance which is often used for homogeneous

cluster analysis for its sensitiveness to small changes when both coordinates are closed to

zero

• Canberra distance:

15

∑= +

−=

dim

1 21

2121 )()(

)()(),(

iCN isis

isisssd (I.7)

when 0)()( 21 =+ isis , one needs to define 0)00( = .

• Cosine distance:

21

2121 arccos),(

ss

ssssdCS×

= (I.8)

where dCS is in the range of [0;π]. Cosine similarity distance is frequently used in text

mining and document comparison.

Next, Tanimoto coefficient is an extension of equation II.8, such that it yields the

Jaccard coefficient (Tan, Steinbach and Kumar 2005).

• Tanimoto distance:

21

2

2

2

1

2121 ),(

ssss

ssssdTN

×−+

×=

(I.9)

The above given list of metrics can be completed with Levenshtein distance, Sorensen

similarity measures (Deza M. and Deza E. 2006) and the like metrics which comprise the

complex taxonomy of distance metrics. One example of the present metrics development

is the popular (because of its outperformed characteristics) group of distances called

Signal-to-noise distances (Gavin, Oswald, Wahl and Williams 2002); note that they do

not belong to the listed I.1 – I.9 subclass of Equal-weighted and Unweighted metrics.

Therefore, the above mentioned (dis)similarity measures permit to group the feature

vectors s to clusters in which the resemblance of instances is stronger than between the

clusters (Leondes 1998). Another important distinction is whether the clustering uses

symmetric or asymmetric distances. Many of the distance functions listed above have the

property that distances are symmetric. In other applications this is not the case. Distance

measurement is the fundamental vehicle for data clustering that is widely used in

numerous classification applications. However defining clustering in terms of

simultaneous closeness on all attributes may sometimes be desirable, but often is not.

Usually, clustering if it exists occurs only within a relatively small unknown subset of the

attributes (Friedman and Meulman 2004).

The following section provides the summary of clustering methods taking into account

mentioned above important concepts: clustering algorithms and distance measures.

16

I.2.3 Summary

The presented overview of the clustering analysis demonstrates the big range of

available clustering methods, where each method may produce a different grouping of a

given dataset. The choice of a particular method strongly depends on the type of output

being desired; a utilitarian approach to selecting an appropriate clustering algorithm has to

take into account whether the desired a clustering algorithm should be parametric or

nonparametric (Fukunaga 1972).

In the first case, a main criterion has to be provided; then, data are arranged into a pre-

assigned number of groups with the goal to optimize this criterion. Parametric clustering

methods are based on a pre-analysis of a global data structure and generally have to be

utilized in a combination with other optimization methods. The performance of this type

of methods depends on the assumptions about their parameters (e.g., number of clusters,

etc), which are hard to establish beforehand in real-world applications.

The most intuitive and frequently used criterion in parametric partition clustering

techniques is the squared error criterion, which tends to work well on isolated and

compact clusters (Jain, Murty and Flynn 1999). The most common used parametric

method is the k-MEANS algorithm, which employs the square error criterion (McQueen

1967). Several other variants of the k-MEANS algorithm were proposed to handle the

sensitivity to the initial partitioning (Anderberg 1973), which attempt to select a good

initial partitioning so that the algorithm is more likely to find the global optimum (Jain,

Murty and Flynn 1999). Other variants use splitting and merging of clusters, which make

it possible to obtain an optimal partitioning while starting from any arbitrary initial

partitioning. If the variance of a cluster is above a threshold, it is split, while two clusters

are merged if the distance between them is below another threshold (Jain, Murty and

Flynn 1999). However, a necessity to specify the threshold parameters in advance and

supervise the output is an inherited disadvantage of these approaches.

The alternative to the parametric methods is the non-parametric clustering, where no

assumptions can be made about the main characterizing parameter(s). In the commonly

used nonparametric approaches, e.g., valley-seeking method (Koontz and Fukunaga 1972),

data are grouped according to a density function (Density-based methods). These methods

do not require knowledge of the number of clusters beforehand. However in general, the

performance of these methods is very sensitive to the control parameters and, naturally, to

17

data distribution. The classic example of non-parametric approach is the algorithm called

CAST (The Cluster Affinity Search Technique), an iterative approach (Portnoy,

Bellaachia, Chen and Elkhahloun 2002) that deals effectively with outliers (Shamir and

Sharan 2002), (Zhang 2006, developed by Ben-Dor (Ben-Dor, Shamir and Yakhini 1999).

Regardless of the clustering strategy, the main purpose of clustering analysis is to

discover structures in data without providing an explanation and interpretation. Therefore,

clustering is a subjective in nature topic: the same data set may need to be partitioned

differently for different purposes. Typically, this subjectivity is incorporated into the main

clustering control parameters and employs domain knowledge in one or more steps of the

clustering analysis. It should be mentioned that every clustering approach uses some type

of knowledge, either implicitly or explicitly. The incorporation of explicitly available

domain knowledge into clustering is used mainly in ad hoc approaches (Jain, Murty and

Flynn 1999); however generally, clustering techniques automatically extract knowledge

during their pre-processing step.

Another problem with the clustering is that the result of clusterization strongly

depends on the choice of the feature space, on the object proximity measures, and on the

methods used to formalized the concepts of the object and cluster equivalence (Biryukov,

Ryazanov and Shmakov 2007). By combining clustering with classification methods

(described in the following section) that take into account the domain knowledge (in a

form of a set of concepts), we can greatly improve the performance of both clustering and

classification performance generally.

I.3 Main classification methods

This first section surveys the most prominent supervised learning classification

methods. This type of classification techniques distinguishes from unsupervised learning

or more precisely clustering overviewed above in the way that learner is not provided

with class labelling. In machine learning, such unsupervised learning group of algorithms

is known under name of instance learners or lazy learners. The second part of this section

is dedicated to this type of learners.

In contrast to lazy learners, eager learners’ aim is to predict the value of a function

for any valid input object after having seen a number of training examples. To achieve

18

this, the learner has to generalize from the presented (training/learning) data to unseen

situations, build a learning model. These algorithms are almost always biased toward

some representation (e.g. Neural Networks, Decision Trees, Support Vector Machine

(SVM) and etc. fall into this category). For instance, lazy learners do not build a model

and generally just remember the data samples. They are faster at training step but slower

at classification step (Ding 2007). A good example of a lazy learner is k-Nearest

Neighbour (kNN) classifier (Han and Kamber 2006).

Lazy learner has the option of representing the target function by a combination of

many local approximations, whereas an eager learner must commit at training time to a

single global approximation. The distinction between eager and lazy learner is thus related

to the distinction between global and local approximations of the target function.

Independently from the specificities of the learners, a combination of classification

approaches might be considering as a general solution.

Recently, in the area of Machine learning, the concept of combining classifiers is

proposed as a new direction for the improvement of the performance of individual

classifiers. One may see the combination of numerous hybrid methods, multiple experts,

and mixture of experts, cooperative agents, opinion pool, decision forest and including

classifiers ensemble and classifiers fusion, results in the improvement of classification.

Classifiers with different features and methodologies can complete each other (Parvin,

Alizadeh and Minaei-Bidgoli 2009). More precisely, in the work of Dietterich (Dietterich

2001) is given an accessible and informal reasoning, from statistical, computational and

representational viewpoints, of why ensembles approach can improve results.

The general goal of classification ensembles is to generate more certain, precise and

accurate system results. Our aim is a development of classifiers’ ensemble approach using

universal principles. Thus, we have to consider two main aspects. The first aspect is that a

classifier structures the data. Second one is that classifier’s optimization procedure has to

fit the classification model(s) to the data sample(s). Taking into consideration complex

classifier’s constructing concept, there is a risk of its failure if there is no sufficient

amount of data (Micheli-Tzanakou 1999). Moreover, the ensemble of classifiers has its

complex set of properties that has to be controlled by a user or a procedure. However, this

issue must be considered as an advantage, because these different properties become a

source of the principal classifiers’ dissimilarities and possible advanced (in comparison to

single classifier) applicability. Therefore, basing on the Lotte’s survey (Lotte and al.

19

2007), we provide the list of the most important properties that are commonly used to

describe different types of the classifiers:

• Generative vs. Discriminative. The first one computes the likelihood of each class

and chooses the most likely one. Discriminative learns only the way of discriminating the

classes. This is related to concept of Lazy and Eager learners correspondingly.

• Static vs. Dynamic. First one neglects temporal information during classification

for example Multi Layer Perceptron (MLP). On the contrary, Hidden Markov Model

(HMM) might classify the sequences of instances.

• Stable vs. Unstable. Discriminant linear function is an example of very stable

classifier in contradictory to MLP.

• Regularized vs. Unregulated: First group of classifiers uses a careful controlling

methods in order to prevent overtraining. On the contrary an overlapping may occur with

unregulated classifiers.

The list of the criteria can be prolonged, but even taking into consideration these four

criteria, we find that the properties of the most commonly used classifiers are highly

overlapped (the full chart of the classifiers and their properties are available at the work

(Lotte and al. 2007)).

Sections I.3.1-I.3.6 present the single classifiers descriptions in order to give a clue

about the ensemble of classifiers’ approach Sections I.3.5. Afterwards, Section I.4

describes T-DTS. T-DTS is based on proposed in the work (Chebira, Madani and Mercier

1997), (Madani, Chebira and Mercier 1997) “divide” and “conquer” paradigm. Thus, it

builds a classifiers’ ensemble over spitted up in tree-like manner initial database.

I.3.1 Linear classifiers

Linear classifiers are discriminant algorithms that use linear function to distinguish

classes. They are probably the most commonly used algorithms for applications. Two

main kinds of linear classifiers are used: Discriminant function and Support Vector

Machine (SVM).

20

I.3.1.1 Discriminant functions

Discriminant (linear / quadratic) function is a simple and basic classification method

based on Linear Discriminant Analysis (also known as Fisher’s LDA) used to separate the

data representing different classes. For a two-class classification problem, this function or

classes’ separating hyperplane2 is generally written as

where w is called the weight vectors and w0 - the threshold vector. The sign of g0()

defines belonging to one of two classes. Equation I.11 performs well when data from

different classes are linearly separable or the covariance matrices of all the classes are

identical (Duda and Hart 1973). When data can be linearly separable, the Perceptron using

a learning rule (Rosenblatt 1961), after a finite number of iterations (Novikoff 1963), can

successfully classify data.

For two-class problem, weight vectors of the Perceptron classifier can also be

determined by Fisher criterion (Fisher 1936). For non-separable cases, these parameters

can be obtained by assuming that data distribution is Gaussian-like or by minimizing the

mean square error (Duda and Hart 1973). To solve more than two-class problem, several

hyperplanes gi() are used. The strategy used for multiclass tasks is the “One Versus the

Rest” (OVR) which consists in separating each class from all the others.

Although Linear Discriminant function cannot achieve a high accuracy in most real

cases, its appealing property is low computational cost that can be easily implemented in

vector processors and a single actual processor. Therefore, for a large classification

problem with thousands of classes discriminant function is a good choice for pre-

classification.

The further enhancement of this approach such as Quadratic Discriminant Function

(QDF) or Modified Quadratic Discriminant Function (MQDF) (Kimura, Takashina,

Tsuruoka and Miyake 1987) performs very well in handwritten character recognition

(Kimura and Shridhar 1991), (Kimura, Wakabayashi, Tsuruoka and Miyake 1997).

Nevertheless, when a large number of training samples are available and problems involve

2 a higher dimensional analogue of a plane in three dimensions, The Collins English Dictionary

00 )( wswsg T += (I.11)

21

multimodal (multi local maxima) densities, non-parametric methods such as kNN and

SVM usually perform better than discriminant functions.

I.3.1.2 Support Vector Machine

Support Vector Machines (SVM) also uses hyperplane to identify classes, but to build

it, SVM algorithm using the Structural Risk minimization principle (Han and Karypis

2000). SVM method has been proposed firstly by Vapnik (Vapnik 1998), (Serfling 2002).

It often achieves superior classification performance compared to other learning

algorithms across most domains and tasks (Statnikov and al. 2005). It has recently

achieved a key position in pattern classification and has achieved promising performances

in many applications such as handwritten digit recognition (Decoste and Scholkopf 2002),

classification of web pages (Joachims 1998) and face recognition (Osuna, Freund, and

Girosi 1997).

SVM is generated in two steps. First, the data vectors are mapped to a high-

dimensional feature space. Second, the SVM tries to find a hyperplane in this space with

maximum margin separating the data. The margin denotes the distance from the boundary

to the closest data point in the feature space Fig. I.1.

Fig. I.1 : SVM: space mapping using linear hyperplane

Linear Support Vector Machine (LSVM) is the simplest linear form of SVM. In linear

case, the margin is defined by the distance of the hyperplane to the nearest positive and

22

negative examples. The goal of the SVM is to predict the class label for each input. The

classification is based on the sign of the decision function called a hyperplane.

Vapnik SVM algorithm finds the optimal hyperplane, which is defined as the one with

the largest margin separating classes of data (Serfling 2002). When training sets are not

linearly separable and perfect separation is not possible, a trade-off is used and it allows

LSVM to penalize the misclassification of a data point. However, the following section

presents SVM modification - Nonlinear Support Vector Machine (NSVM) that takes

intermediate position between linear classifiers and group of non-linear briefly described

in Sections I.3.3 – I.3.7.

I.3.2 Nonlinear Support Vector Machine

The original optimal hyperplane algorithm proposed by Vapnik was a linear classifier.

The proposed solution NSVM for non-linear cases builds nonlinear decision boundaries

by using a non-linear kernel function.

This allows the algorithm to fit the maximum-margin hyperplane in a transformed

feature space. The mapping the data S to another space by means of modified kernel

function Φ, schematically described in Fig. I.2. Generally, this new space has higher

dimension than original one (Gold, Holub and Sollich 2005).

Fig. I.2 : SVM space mapping using different space kernel functions Φ: Nonlinear

kernel tool (right) and Linear hyperplane (left)

There are the following classical kernels for SVM vectors: Polynomial, Gaussian,

Radial Basis Function (RBF) and Sigmoid. To resolve a multiclass classification problem,

SVM employs: One-versus-rest (OVR) method (Kressel 1999), One-versus-one (OVO)

23

method (Manning, Raghavan and Schultze 2008), Directed Acyclic Graph SVM

(DAGSVM) method and Weston and Watkins (WW) method (Hastie, Tibshirani and

Friedman 2009).

SVM based approaches have several advantages. They are good generalization

properties owing to the margin maximization and the regularization term. Many problems

in pattern recognition such as curse of dimensionality and over-fitting doesn’t occur with

SVM when the suitable kernel and multiclass SVM method parameters are properly

selected. Another appealing property of SVM is that its classifier structure is data-driven.

This structure is automatically determined by solving a constrained convex quadratic

programming problem. This avoids customizing the structure manually to achieve a high

performance in contrast to NNs (Neural networks). However, there are two important

problems for SVM: wide range applicability and SVM training algorithms’ computational

costs (Lotte and al. 2007).

I.3.3 Neural networks

Neural Networks (NNs) or more precisely Artificial Neural Networks (ANNs) are a

powerful tool with nonlinear approximation capabilities used in many engineering areas

and computer technologies. They are computational systems, either hardware or software,

which mimic the computational abilities of biological systems (Maren 1990).

ANNs are easy to construct and can be developed within a reasonable timeframe

(Maren 1990). They have been employed and compared to conventional classifiers for a

number of classification problems (Abdelwahab 2004). The results have shown that the

accuracy of the NNs based approach is equivalent to, or slightly better than, other

methods due to its ability to produce non-linear boundaries. This fact leads to NNs based

classifiers to be efficient (Zhou 1999) (Lotte and al. 2007).

ANNs are composed of highly interconnected neurons that accept input and generate

output. Each connection has a weight associated with it. The weights are adjusted during

the training of the network to achieve human-like pattern recognition. The choice of the

learning algorithm, weight initialization, the input signal representation and the structure

of the network is very important (Go, Han, Kim and Lee 2001). The number of hidden

neurons and layers must be sufficient to provide the discriminating capability required for

an application. However, if there are too many neurons, the neural network will not be

24

able to employ generalizing between input patterns when there are minor variations from

the training data (Fogel 1991). Furthermore, there will be a significant increase in cost and

in the time required for training. As reported (Goblick 1988), NNs’ classifiers have the

following characteristics:

o NNs’ classifiers are distribution free. NNs allow the target classes to be defined

without consideration to their distribution in the corresponding domain of each data

source (Benediksson, Swain and Ersoy 1990). In other words, using neural networks is a

better choice when it is necessary to define heterogeneous classes that may cover

extensive and irregularly formed areas in the spectral domain and may not be well

described by statistical models.

o NNs’ classifiers are capable of forming non-linear decision boundaries and they do

not require decision functions to be given in advance (Go, Han, Kim and Lee 2001). This

makes them flexible in modelling real world complex relationships (Zhang 2000) and they

can approximate any function with arbitrary accuracy.

o NNs’ classifiers are data independent (Abdelwahab 2004). When neural networks

are used, data sources with different characteristics can be incorporated into the process of

classification without knowing or specifying the weights on each data source. Until now,

the importance-free property of neural networks has been mostly demonstrated

empirically. Efforts have also been made to establish the relationship between the data

independent characteristics of NNs and their internal structure, particularly their weights

after training (Zhou 1999). In addition, NNs’ implementations demonstrate recently

storage reducing and computational requirements trends.

NNs learning method is unsupervised or supervised:

• Unsupervised methods determine classes automatically, but in fact show limited

ability to accurately divide space into clusters.

• Supervised methods have yielded higher accuracy than unsupervised ones, but

suffer from the need for human interaction to determine classes and training regions.

Backpropagation is one of the low computationally cost algorithms used in training of the

supervised neural network. It is based on linear model (steepest descent) (Go, Han, Kim

and Lee 2001) and it has been shown that for Multi Layer Perceptron (MLP) training

backpropagation approximates the Bayes optimal discriminant functions for both two-

class and multi-class recognition problems (Ruck and al 1990). However, there are some

drawbacks that are associated with backpropagation such as convergence to local

minimum and the absence of specific methods for determining the network structure.

25

Although, network pruning approach of deleting the irrelevant weights of a network

before invoking inference, can be used to optimize the size of the network (Hertz, Palmer

and Krogh 1991). Some approaches (Le Cun, Denker and Solla 1990), (Hassibi and Stork

1993) use the information of all second order derivatives of the error function for network

pruning. Although, these methods can improve the generalization performance, the

computational cost of pruning an initial large fully connected network is high. For other

range of methods such as LeNet1 (Le Cun and al. 1989) and LeNet5 (Le Cun, Bottou,

Bengio and Haffner 1998), the network structure is customized manually for the specific

application. However, for those networks constructing methods, it is required a good prior

knowledge.

Neural networks are robust to errors and thus are well-suited to problems in which the

training data are noisy. However, they have poor interpretability, since it is difficult for

humans to interpret the meaning behind the weights. Also it requires a number of

parameters, such as number of layers, to be determined, which often comes from

experience, especially when we deal with the NNs of the big size. A glaring and

fundamental weakness in the current theories of ANN and ANN-connectionism is the

total absence of the concept of an automous system. As, a result, the field developed NNs

algorithms require human adjustment (Roy 2000).

In conclusion, let us note that NNs are used to model continuous complex processes,

even human behaviour in a simplified task involving collision avoidance and target

positioning (Maren 1990).

I.3.4 Non-linear Bayesian classifiers

This section introduces Bayesian classifiers and Hidden Markov Models (HMMs). All

these classifiers produce non-linear decision boundaries. Although, this group of

classifiers is not as widespread as Discriminant functions or NNs in the real-word

applications, but they are generative which enables them to perform more efficient

rejection of uncertain samples than discriminative classifiers.

26

I.3.4.1 Bayesian classifiers

Bayesian decision theory is a fundamental statistical tool in pattern classification

problems. Bayesian method is one of the traditional classification techniques. It provides

the optimal performance from the standpoint of error probabilities in a statistical

framework (Go, Han, Kim and Lee 2001).

The success of the Bayesian methods depends on the assumptions used to obtain the

probabilistic model (Chen and Varshney 2002), (Zhang 2000). This makes them

unsuitable for some applications such as image classification based on a feature space

comprising texture measures (Simard, Saatchi and Grandi 2000). However, they have

been applied to ANNs in order to regularize training and in such way improve the

performance of the classifier (Kupinski, Edwards, Giger and Metz 2001).

I.3.4.2 Hidden Markov Models

The Hidden Markov Models (HMM) are the models with finite sets of states, each of

which is associated with a probability distribution. Transitions among the states are

governed by a set of probabilities called transition probabilities. In a particular state, an

outcome or observation can be generated, according to the associated probability

distribution. It is only the outcome, not the state visible to an external observer and

therefore states are hidden to the outside. HMMs are known to classify the data based on

their statistical properties. HMMs extract fuzzy features from the pattern in question and

comparing it with the known (stored) one (Lu 1996). To use HMMs for classification of

unknown input data, HMM are first trained to classify (recognize) known pattern.

HMMs are basically 1D-model. Hence, to use them for complex classification task

such as face recognition, the pattern must be represented in 1D format without loosing

any vital information. HMMs are popular in the field of speech recognition, because it is

perfectly suitable for the classification of time series. However, there are several problems

with HMMs, the main of them are: HMMs make very general assumption about the data,

the number of parameters that have to be set in HMMs is also big.

The theory of HMMs is elegant, but its implementation is hard (Kadous 2002).

27

I.3.5 Prototype methods

Let a prototype consist of p pairs (si,сi), i = 1,...l , where l is the maximal index for

each tuple of (si,сi) , сi is the class label of sample si. In most cases, si associated with the

prototype is typically an example from the training set. The classification of an unseen

pattern s is to assign its class to the label of the closest prototype by a distance measure

function d().

This group of classification methods is related to the group of clustering techniques

mentioned at Section I.3.5.2. One of the most used prototype methods are k-Nearest

Neighbour (kNN) and Vector Quantization. This type of classifiers is shortly overviewed

in the following sub-sections. It is relatively simple and performs non-linear space

separation.

I.3.5.1 Vector quantization

Vector quantization is a powerful technique used not only for classification, but also

for data compression purpose. It is based on the competitive learning paradigm, so it is

closely related to the self-organizing map model that is trained using unsupervised

learning. Vector Quantization and supervised classification techniques are combined

because both techniques can be designed and implemented using methods from Statistical

classification as well as classification trees (Cosman, Oehler, Riskin and Gray 1993). Let

us note that such implementation with a tree structure greatly reduces the encoding

complexity (Gray, Oehler, Perlmutte and Ohlsen 1993) and it has been shown that if an

optimal vector quantizer is obtained, under certain design constraints and for a given

performance objective(s) - no other coding system can achieve better performance. Vector

quantization has several advantages in coding and in reducing the computation in speech

recognition (Gold and Morgan 1999).

One of the most widely used algorithms is Linear Vector Quantization (LVQ). It is

applied for classifying various kinds of patterns and signals. The reason to apply LVQ is

that it can treat many input data with small computational burden. In other words, it can

deal with high dimensional representation space using simple learning structure.

LVQ can be used for training competitive layers of the unsupervised neural network

model developed by Kohonen (Kohonen 1989), called Self-Organizing Map (SOM), in a

supervised manner. LVQ is composed of two layers: a competitive layer that learns the

28

feature space topology and the linear layer that transforms classes into target classes.

According to Kohonen (Kohonen 1989), prototypes are placed with respect to the

decision boundary to reduce the classification error by attracting the prototypes of the

correct class and repelling prototypes of incorrect class. The decision boundary of LVQ is

piece-wise hyperplane. LVQ is defined in the form of algorithm rather than optimization

of a cost function, which makes difficult the analysis of its property. Generalized

Learning Vector Quantization (GLVQ) (Sato and Yamada 1996) adjusts the prototypes

based on Minimization of Classification Errors (MCE) (Juang and Katagiri 1992), It

allows GLVQ user to improve classification performance. It also has the advantage of

increasing the classification accuracy of the SOM network (Go, Han, Kim and Lee 2001).

Vector quantization technique is used to simplify image processing tasks such as half-

toning, edge detection (Cosman, Oehler, Riskin and Gray 1993), image recognition

(Gersho and Gray 1991), image: thinning, shrinking, skeletonization (Shen and Castan

1999) and speech (Gersho and Gray 1991).

I.3.5.2 k-Nearest Neighbour classifier

The k-Nearest Neighbour (kNN) classifiers are instance-based learners. Learning

consists of storing the present training samples. When a new instance is presented for a

query, a set of similar instances is retrieved and used for classification. As a lazy learner,

kNN classifier stores the training samples and do not build the classifier explicitly. When

an unknown instance/prototype is given, the algorithm searches the whole set of training

instances for the k instances which are closest to the unknown instance.

The unknown instance will be assigned the most common class among those k

instances. The k instances are the k-Nearest Neighbours of the unknown instance.

Proximity is generally defined in the terms of Euclidean distance or any other distance

measure (Mitchell 1997) such as the mentioned I.1 – I.9.

A very good property of this classifier is that kNN doesn’t require the analysis of the

data density function form. Its asymptotic probability of error is never greater than twice

the Bayesian error (Cover and Hart 1967).

The kNN classifiers are faster at training but slower at classification than eager

methods since nearly all computation takes place at classification time rather than when

the classifier model is built at training time. The disadvantages of KNN are that in order

29

to achieve a high accuracy, a huge number of training samples are required. To address

this problem, a variety of techniques have been developed for kNN performance adjusting

(Han and Karypis 2000). As a consequence, the computational cost of kNN is

prohibitively high.

In conclusion, we highlight that the performance of Prototype methods largely

depends on the initial sample, which are usually set by some clustering algorithm such as

k-MEANS. Comparing kNN with LVQ, it is noted that last one usually achieves better

performance and it costs less.

The following Section I.3.6 presents a classifier which is directly related to T-DTS

concept, due to its ability to construct a tree classification structure.

I.3.6 Decision trees

The decision trees are considered as one of the most used classification approaches

due to their accuracy and simplified computational properties (Gelfand, Ravishankar and

Delp 1991), (Srivastava, Han, Kumar and Singh 1999), (Zhang, Chen and Kot 2000).

They are capable of performing non-linear classification (Atlas and al. 1989) and they do

not rely on statistical distribution. This yields to successful applications in many fields

(Simard, Saatchi and Grandi 2000).

The tree is composed of a root node, intermediate nodes and terminal nodes. The data

set is classified at each node according to the decision framework defined by the tree (Ho,

Hull and Stihari 1994). The decision tree model is built by recursively splitting the

learning set based on the locally optimal criterion (Han and Karypis 2000). It starts with a

coarse classification, and then followed by a fine classification where finally each group

contains only one instance or one class. Decision trees based classification has the

advantages of employing more than one feature. Each group of the employed features

provides partial information about the instance(s). The combination of such groups or

clusters can be used to obtain accurate recognition decision (Senior 2001). There are more

than one decision tree that can be used for a given data base (Tu and Chung 1992).

The large number of methods has been proposed in the literature for the design of the

classification tree. Classification and Regression Trees (CART) is one of the approaches

that have been commonly used (Gelfand, Ravishankar and Delp 1991). It was developed

during years, starting from 1973 till 1984 (Atlas and al. 1989). It has an advantage of

30

constructing classification regions with sharp corners. However, it is computationally

expensive (Gelfand, Ravishankar and Delp 1991). In CART, splitting continues until

terminal nodes are reached. Then, a pruning criterion is used to sequentially remove splits.

Pruning can be implemented by using different data than those used for tree building

(Atlas and al. 1989). The main advantages of pruning is reducing the size of the decision

tree and hence reducing the classification error and avoiding both overfitting and

underfitting. The most often used pruning methods are based on removing some of the

nodes of the tree. Pruning can be performed employing neural networks, trained by

backpropagation algorithm (Kijsirikul and Chongkasemwongse 2001), to give weights to

nodes according to their significance instead of completely removing them.

Another widely used decision tree-based classification algorithm that has to be

mentioned is C4.5. As CART, it has been shown to produce good classification results,

but let us in conclusion highlight that decision tree based schemes do not work well due to

ovefitting problem (Han and Karypis 2000). Therefore, the disadvantages of the

classification methods (Section I.3.1 – I.3.7) required employing advanced approach of

combining classifiers.

I.3.7 Ensemble of classifiers

There are two common ways to increase the accuracy of a classification

system/framework: one is to improve the performance of a single classifier; another one is

to combine the results of multiple classifiers by employing decision combination for

different data-clusters of the learning heap (Tumer and Ghosh 1995). One may find in the

literature a range of experiments where the multiple classifiers have better performance

than single classifier when they are selected carefully and the combining algorithm retains

the advantages of each individual classifier and avoids its weakness (Ho, Hull and Stihari

1994), (Hsieh and Fan 2001).

This is due to two reasons (Briem, Benediktsson and Sveinsson 2000):

1. The risk of choosing the wrong data-cluster is lower;

2. Individual classifiers can be built on different types of instance-features of the

same data-cluster, and the multiple classifiers can weight the classifiers based on the

characteristics of the different features.

31

It’s quite typical when the ensemble method of combining classifiers is based on data

re-sampling approach. In this instance the outputs of a classifier are interpreted in terms of

bias-and-variance decomposition, the ensemble methods mainly reduce the variance of

these single classifiers.

Bagging (Breiman 1996) and Boosting (Freund and Schapire 1996) are two classical

methods that have shown great success when using ensemble of classifiers (Chan, Huang

and De Fries 2001). Bagging employs the bootstrap sampling method to generate training

subsets while the creation of each subset of Boosting depends on previous classification

results.

For both methods, the final decision is made by the majority of votes. Numerous

experiments (Bauer and Kohavi 1999), (Opitz and Maclin1999) have shown that Bagging

and Boosting are effective only for weak classifiers such as classification tree, neural

networks and perform well in a small dataset.

Stacking is another method which uses several classifiers. These classifiers are called

level-0 classifiers. The output of each these classifiers are given as input to a meta-

classifier (classifier level-1) which makes a final decision (Lotte and al. 2007).

In comparison to previous method, Voting are being used in a “democratic” way.

Each classifier assigns the class to a given vector, but the final class assigning is done

based on majority (the weakness of this method) of classifiers (Lotte and al. 2007).

Two most important issues appear by designing a multiple classifiers system:

classifier selection and decision combination. However, on large data set, others issues for

the ensemble methods need to be taken into consideration more seriously:

• If an ensemble method generates N classifiers over a database and then the

classifiers’ outputs are combined, time cost of such approach is about N times as high as a

single based classifier.

• If the data set assigned for ensemble of classifiers building is required to be stored

on hard-drive, because of lack of RAM, the time cost of re-sampling cannot be further

ignored.

The last important issue is that very often, the group of approaches for constructing

ensemble of classifiers uses subset(s) of database, rather than the total set. As a result, the

number of constructed classifiers loose generalizing power. This issue can be taken into

account, because in this case the ensemble methods may not perform better than a single

base classifier over total database.

32

Taking to consideration these two issues and the fact that classification accuracy

might degrade very fast as the number of classes’ increases (Li, Zhang and Ogihara 2004),

the multiple classifier method thereby becomes one alternative solution for solving

classification problems. Multiple classifier method cannot be superior to a single classifier

method, because here we deal with two most important issues of designing a multiple

classifiers system: classifier(s) selection and decision combination.

The classifiers’ ensemble approach is very important for complex tasks that contains

not only a major problems such a structure of classifiers and a final decision combination,

but also realization. These details that can be viewed as minor sub-problems have a

crucial influence on the ensemble of classifiers performance. The following sections are

dedicated to the problem of constructing Multiple Classifiers Structures (MCS).

I.3.7.1 Multiple classifiers structures

A number of classification systems based on the combination of the outputs of a set of

different classifiers and approaches of their constructing have been proposed in the

literature. Different Multiple Classifiers Structures (MCS) can be grouped as follows

(Sarlashkar, Bodruzzaman and Malkani 1998), (Hsieh and Fan 2001): parallel, pipeline

and hierarchical structures.

For the parallel structure, the classifiers are used in parallel and their outputs are

combined. In the pipeline structure, the system classifiers are connected in cascade

(Giusti, Masulli and Sperduti 2002), (Kawatani 1999). The hierarchical structure is a

combination of the previous two structures.

Irrelevant of the type of MCS, the difficulty to choose and construct classifiers pushed

researchers to developed methods that help designer to carry out the choice (Gasmi and

Merouani 2005). Among the various methods of constructing MCS dominates one central

method of producing initially a large number of classifiers and then selecting a subset

which is judged most valid to lead optimal performances. Central issue of the methods of

MCS constructing is the classifiers’ output decision combination. The following section

gives a clue about this problem.

33

I.3.7.2 Decision combination

The decision combination methods proposed in the literature are based on different

ideas: voting, statistics usage, constructing belief functions and other classifiers’ fusion

schemes (Xu, Krzyzak and Suen 1992), (Prampero and Carvalho 1998) and etc. Our

attention deserves last methods of decision combinations for hierarchical MCS

(Abdelwahab 2004) among available: random decision, majority decision and

hierarchical decision method. In hierarchical classification framework, the probability

output from each individual classifier is used as input for the next lower of hierarchy,

expecting that inaccuracy of classification in such way might be reduced on the low lever

of the hierarchy.

Therefore, for solving pattern recognition problems, one can find a proposal (Xu,

Krzyzak and Suen 1992) of using different type of classifiers’ decisions combination:

average Bayes classifier, voting methods, Bayesian formalism and Dempster-Shafer

formalism. In these methods, only the top choice from each classifier is used, which is

usually sufficient for problems with a small number of classes. The examination of the

strengths and weaknesses of each method leads to the problem of determining classifier

correlation that is the central issue in deriving an effective combination method.

Recently, in the work (Du, Zhang and Sun 2009), we can find even an integration of

the Dempster-Shafer and hierarchical decision combination approaches. There is plenty

of other approaches such as in the work (Kittler, Hatef, Duin and Matas 1998) - a

common theoretical framework for combining classifiers which may derive the product

rule, sum rule, max rule, min rule and median rule to take the product, sum, maximum,

minimum and median values of the a posterior probabilities p(сj|si) - the probability that

an input pattern with feature vector si is assigned to class сj . There are also different

classification techniques that have been proposed in literature based on the combination of

classifiers such as the kNN decision rule and the combination of an ensemble of neural

networks.

The T-DTS concept belongs to this mentioned group of classifiers’ ensemble. The

following section describes this type of approach in greater detail.

34

I.3.7.3 Ensemble of Neural Networks

The combination of an ensemble of neural networks has been proposed to achieve

high classification performance in comparison with the best performance that could be

achieved by employing a single neural network. This has been verified experimentally

(Kittler, Hatef, Duin and Matas 1998), (Giacinto, Roli and Fumera 2000). Also, it has

been shown that additional advantages are provided by an ensemble of neural networks in

the context of classification applications including biomedical applications that employ

specific diagnostic tools (Dujardin and al. 1999).

There are two types (macro-structure) (Maren 1990) of NNs combination: strongly

coupled and loosely coupled networks. First type can be treated as a single network and

are created by fusing togthere two or more networks into a single new structure. Loosely

coupled structures connect networks which retain their structural distinctness. In

consequence, a combination of NNs can be performed not only on macro-structure level,

but also on micro (intra NNs characteristics) level. This type of NNs combination is

similar to data fusion mechanisms. Henceforth data fusion (i.e. fusion) of multitemporal

characteristics approaches is essential for medical imaging, remote sensing applications.

Due to the availability of the large amount of data acquired by different types of NNs, it is

mandatory to develop effective fusion techniques able of take advantage of such multi-

NNs sources and multi-temporal characteristics of NNs (Bruzzone, Prieto, and Serpico

1999). Fusion of multi-characteristics especially refers to the acquisition, processing, and

synergistic combination of information from various type of NNs in order to provide a

better understanding of the classification situation under consideration (Kundur,

Hatzinakos and Leung 2000).

A fusion scheme (regardless NNs combining problem) might be defined as follows

(Pattichis C., Pattichis M., and Micheli-Tzanakou 2001): "data fusion is a formal

frameworks in which are expressed means and tools for the alliance of data originating

from different sources. It aims at obtaining information of greater quality; where the

exact definition of "greater quality" will depend upon the application." Data fusion

techniques might be classified into the following three groups:

1. Data Level fusion: Combination of raw data from all sensors.

2. Feature level fusion: Extraction, combination and classification of feature vectors

from all sensors.

35

3. Decision level fusion: Combination of outputs of the classifications achieved on

each single source.

Taking into consideration of specificity of NNs, one may find a big range multi-NNs

models’ fusion approaches. For example a method proposed by Ueda (Ueda 2000) of

linearly combining multiple NNs based classifiers, uses statistical pattern recognition

theory. In this approach, several neural networks are first selected based on which works

best for each class in terms of minimizing classification errors. Then, they are linearly

combined to form desirable classifier that exploits the strengths of the individual

classifiers. In this approach, the Minimum Classification Error (MCE) criterion is used to

estimate the optimal linear weights. In this method, the problem of estimating linear

weights in combination is reformulated as a problem of designing a linear discriminate

functions using MCE discriminate (Juang and Katagiri 1992). Being more general, we

remind that it has been shown (Lazarevic and Obradovic 2001) that NNs’ ensembles are

effective only if the NNs forming them make different errors. For example, it is shown

(Hansen and Salamon 1990) that neural networks combined by the majority rule can

provide an increase in classification accuracy only if the NNs make independent errors.

Unfortunately, the reported experimental results pointed out that the creation of error-

independent networks is not a trivial task.

In the neural network field, several methods for the creation of ensemble of neural

networks making different errors have been investigated. Such methods basically lie on

varying the parameters related to the design and training of neural networks. According to

the work (Rao, Chand and Murthy 2005), the ensemble of NNs’ methods can be included

into one of the three following categories:

1. Methods that use different initial NNs parameters,

2. Methods that use different type of NNs (such as probabilistic neural networks,

MLP or RBF network),

3. Methods that uses different sub-set of the given database.

The capabilities of the above methods were experimentally compared to create error-

independent NNs output. It was concluded that varying the networks type (second

category) and the training data (third type) are the two best ways for creating ensembles of

networks making independent errors. However, it has been noted that neural network

ensembles could be created using a combination of two or more of the above methods.

Therefore, neural network researchers have stated to investigate the problem of the

engineering design of NNs’ ensembles. The proposed approaches can be classified into

36

main design strategies: already mentioned overproduce and choose strategy and direct

strategy. The first strategy uses a group of method that generates an ensemble of error-

independent NNs directly. In practice, it is shown (Sung and Niyogi 1995) that using

ensemble of NNs might lead to low computation cost and good classification

performance. Moreover, the assumption that is central in this type of modelling is too

tight for many real-world applications (Ryabko 2006).

In conclusion, let us highlight that there are two major ways in which we can integrate

NNs to create an ensemble. One is to create a hybrid network by tightly fusion two

existing neural networks architecture (well-know old Hamming net and the counter-

propagation network are examples). The disadvantages of this approach are that in this

case we are still relying on a single network to accomplish a task which definitely may be

greater than the capabilities any single network may offer. The alternative, perspective

approach is to create ensemble of individual integrated ANNs, because complex problems

require multiple stage of processing. The computational complexity of fully connected

networks scales as the square of the numbers of neurons involved. The functioning of

NNs’ ensemble is easy to test, debug, repair and update using different type of ANNs.

Moreover, from the practical point of view nothing like a structural programming

principle have yet invaded to implement an any multi-modular ANNs system. Therefore,

modular system design (such as out T-DTS) makes it easy to verify and validate final

project (Maren 1990).

Summarizing Section II.3, which includes brief overview of classification approach

and their combination, we logical move to one of the main core of our work T-DTS

concept development. Let us highlight that some famous classifiers might not be

mentioned in Section II.2 and Section II.3. Concerning final sub-Section II.3.6.3, there are

many other different combination schemes which are available in the literature that might

make the Chapter II more complete, but our goal here has been to provide an introduction

to NNs’ ensemble (hybrid) T-DTS approached. Therefore, T-DTS concept can be treated

as a special case of ensemble of the classifiers (currently NNs) that use decision trees

approach.

The Section II.4 is fully dedicated to T-DTS approach. It contains detailed description

of the T-DTS concept regardless implementation aspects.

37

I.4 T-DTS (Tree-like Divide to Simplify) approach

Many real world problems and applications, e.g., system identification, industrial

processes, manufacturing regulation, optimization, decision, systems, plants safety,

pattern recognition, etc., information is available as data stored in files (databases etc.).

For classification approaches, an efficient processing and handling of these data is of

paramount importance. In most of these cases, processing efficiency is closely related to

several issues, among which are:

1. Data nature: includes data complexity, data quality and data representative

features.

2. Processing technique related issues: includes model choice, processing

complexity and intrinsic processing delays.

Data complexity, frequently related to nonlinearity or subjectivity of data, may affect

the processing efficiency. While, date quality (noisy or degraded data), may influence

processing success and expected results’ quality. Finally, representative features such as

scarcity of pertinent data could affect processing achievement or resulted precision

(Madani, Rybnik and Chebira 2003). On the other hand, choice or availability of

appropriated model which describes and forecast the behaviour is the major importance.

The ability of multi-models system to achieve goals thought selectively switching

between the models within a model space can be considered as a form of adaptive

reasoning (Ravindranathan and Leitch 1999). This switching (more precisely selection) of

models using reasoning is an adaptation. In T-DTS case, processing technique or

algorithms’ complexity (designing, precision, etc.) shapes the processing effectiveness.

Intrinsic processing delay or processing time, related to implementation issues (software

or hardware related issues) or processing models parameterization could affect not only

processing quality (results’ quality) but also the technique’s viability to offer an adequate

solution for a complex problem represented by a huge data store.

I.4.1 Modular approach

The aim of modular tree structure called Tree-like Divide to Simplify (T-DTS)

(Bouyoucef, Chebira, Rybnik and Madani 2005) is to extent an existent relation between

38

processing time and database size. Under “to extent” we mean ability to use clustering or

database decomposition as an approach to induct a significant gain of performance. The

expected result is a real-working system that decreases general processing time and/or

increases classification quality by means of database decomposition initially proposed in

the work (Chebira, Madani and Mercier 1997), (Madani, Chebira and Mercier 1997).

From our point of view, “Complexity reduction” is the key point on which the

modular Tree like Divide to Simplify (T-DTS) approach acts (Rybnik 2004), (Bouyoucef,

Chebira, Rybnik and Madani 2005). Let us mention here that in our search for appropriate

characterization, we focus on the property of the data (classification complexity) that

easily models the underlying structure (ensemble of NNs) that might be coherent to a

given problem. An important note is that the choice of the characterization depends on the

available model and on the other hand, when a particular set of properties of experimental

data has been found, one can meaningfully ask for the a model that reproduces the

structure that. Modelling requires prior characterization (Rossberg 2004). In this essence,

T-DTS is Hybrid of Multiple Neural Networks handled by characterization named

complexity estimator (Bouyoucef 2007). We purposely leave here the term of complexity

without proper definition in order to describe conceptual clue about T-DTS first and then

determine it to particular set of classification problems.

T-DTS concept is found on the assumption that database decomposition decreases

general task complexity. More precisely, it is based on the universal “divide” and

“conquer” principle and the “complexity reduction” (Madani, Chebira and Mercier 1997)

(in this paragraph I share a possible criticism concerning misapplication of the term

complexity) approach.

Many systems, such as: Committee Machines (Tresp 2001), Multi Agent and

Distributed Artificial Intelligence share the paradigm using it directly: splitting database

into clusters; or indirectly by the means of agents or modules coordinators (Rybnik 2004),

(Bouyoucef, 2007).

The complexity estimating and reduction technique allows T-DTS to deal with

complex problem in intelligent way. It was observed that the intelligent method of

partitioning generally performs than random one (Chawla, Eschrich and Hall 2001). This

staple point of T-DTS is the recursively construction of adaptive tree-structure(s) over

database sub-sets.

The first my question “Why do we use tree structures?” has an answer if one might

take into account (we share possible criticism concerning this primitive comparison) the

39

analogies in the “self-organizing world” and “brain activity” (Josephson 2004). The tree

structures are abundant in complex nature systems (e.g. taxonomic hierarchies, tropic

pyramids) and human intelligent organizations). There are some grounds (Green and

Newth 2001) for supposing that trees form naturally in many problems.

Therefore, when one deals with intelligence, last one shares a common view on how

to handle the phenomenon of complexity, more precisely the complexity of the natural or

intelligent human systems such as grammar and languages. Let me remind here that a

structure of any language as a product of human intelligence underlay a human behaviour

in board sense (Chomsky 1968). For example, I find a common idea with Humboldt’s

universal approach dealing with complexity that is highlighted at the same work

(Chomsky 1968): “in the Huboldtian’s sense, namely as «a recursive generated systems,

where the laws of generalization are fixed and invariant, but the scope and the specific

manner in which they are applied remain entirely unspecified»” Therefore, this presages

Wolfram's basic insight met at work (Chaitin 2005) dedicate to computational complexity.

Similarly, T-DTS concept targets complexity using simple principles that may have very

complicated-looking output, because being principally simple is a condition to have the

richest in phenomena that it produces: “simple in hypotheses - the most rich in

phenomena” (Chaitin 2005).

In machine learning, recursive tree-like decomposing approach is a part of junction

tree, clique tree, or join trees group of decomposition methods. This general concept of

tree-like decompositions is formalized in the works (Robertson and Seymour 1984) and

since that time has been studied and developed by many other authors.

Unfortunately, real world and industrial problems are never comfortable for tree-like

decomposition approaches in comparison to the benchmarks. They are often much more

complex, because of a large number of parameters which have to be considered. That’s

why conventional solutions (based on mathematical and analytical models) reach serious

limitation for solving this category of the problems (Budnyk, Bouyoucef, Chebira and

Madani 2008).

One of the key points on which we can rely on is the complexity reduction. This

approach might allow us to deal with complexity not only at the problem’s solution level

but also at processing procedure’s level. An issue could model complexity reduction by

splitting a complex problem into a set of simpler sub-problems. This leads us to “multi-

modelling" where a set of simple models is used to sculpt a complex behaviour (Jordan

and Xu 1995), (Murray-Smith and Johansen 1997). Another promising approach to reduce

40

complexity takes advantage from hybridization (Goonatilake and Khebbal 1995).

Henceforth, taking also into consideration classifiers’ ensemble approach, the following

section describe general T-DTS concept

I.4.2 T-DTS concept

T-DTS approach deals with classification problems using the universal principle

“divide” to “conquer”. According to DARPA report (Goblick 1988), ANNs’ models

have demonstrated superiority over classical methods for pattern classifiers (Maren 1990).

Therefore, T-DTS approach was implemented using Neural Networks models. Another

argument for ANN based T-DTS is a fact that NNs are superior for dealing with more

complex or open systems, which may be poorly understood and which cannot be

adequately described by a set of rules or equations. Regardless application, the principal

T-DTS concept includes two main operation stages that could be described as follows:

Fig. I.3 : General block scheme diagram of the T-DTS structure constructing

• The First Stage or Learning phase: T-DTS recursively decomposes the input

database into sub-databases of smaller size using step-by-step scheme (Fig. II.3) and then

generates processing structures and tools (special parameters) for the decomposed data

sets.

41

• The Second Stage or operation phase: has an aim to learn the inputs sub-spaces

obtained from splitting. Hither, the obtained hybrid multi neural network system used for

unknown (e.g. unlearned) database of new instances.

T-DTS decomposes problem into sub-problems recursively building a neural tree

computing structure. The nodes of the constructed tree are decision making units and NNs

based decomposition units, and leafs correspond to NNs based processing units (Madani

and Chebira 2000)

We have to note that the described diagram on Fig. I.3 is general. It might be adapted

up to specific problems by intelligent choice of the components. Let us mention also that

Normalization block could include not only database normalizing, but other pre-

processing expertises such as Principal Component Analysis.

I.4.3 T-DTS short description

T-DTS is a Hybrid Multi Neural Networks structure tree-based constructing approach

(Madani, Chebira and Mercier 1997). First, the concept is designed to create ensemble of

NNs over tree-like database decomposition. However, this ensemble of NNs might

contain different intelligent module (more appropriate for global task target) on different

structure’s levels. Generated recursively by T-DTS, a tree-structure (including form and

size of tree) conceptually might have to reflect the general complexity of the given

problem represented by the initial database.

Henceforth, T-DTS is also a purely data driven concept founded on the universal

principle “divide” to “conquer”. Decomposition T-DTS technique belongs to the

category of nonparametric (unsupervised methods must be realized) approaches or more

certainly to the subgroup of hierarchical divisive method, where processing techniques are

supervised.

We have to remark first that proposed way of decomposition is the analytical method

which infers microscopic events from macroscopic data is typical for physics. Second

remark is that T-DTS approach is based on the concept of reducibility (Madani, Chebira

and Mercier 1997), but it is known has its own limitations (Haken 2002).

T-DTS tree designing strategy belongs to the class of “overproduce and choose”

engineering methods. The decision(s) taking strategy is also hierarchical. During

decomposition, T-DTS checks complexity conformity and this is a moment of decision

42

taking: to continue decomposing or to stop. The tree-structure building process is

performed dynamically, as consequence T-DTS inherits the properties of decision tree

approach. Data base decomposing will continues until complexity conformity and sub-

database parameters for each neural networks module are not met together. In case, when

the sub-problem/sub-database still hasn’t reached required size or dimensionality of

neural networks model, T-DTS continues to process decomposing. Therefore, complexity

conformity block is a core engine of T-DTS (Fig. I.3). This (core) block cannot be

omitted in any possible implementation.

T-DTS concept is based on “divide” to “conquer” universal paradigms.

Decomposition technique(s) of T-DTS use(s) unsupervised data driven technique. The

complexity conformity block or simply saying complexity estimating module as well as

its taking decision part has to be also data driven only.

The following section provides a summary of clustering and classification methods

and situates the concept of T-DTS among the various paradigms.

I.5 Conclusion

Our thesis is dedicated to T-DTS technique applied to classification problems. T-DTS

employs two basic classification methods: clustering and classification.

There are two main categories of classification techniques, see Section II.2: parametric

and nonparametric. The first category uses parameters that are based on certain

assumptions about a given problem. Since performance of the parametric clustering

methods is very sensitive to their parameters, we have chosen to base T-DTS on non-

parametric methods. More precisely, T-DTS uses Prototype-based vector quantization

technique and Density-based decomposition techniques that include k-centroid

partitioning methods.

To achieve high classification performance, T-DTS employs a set of classification

methods that were reviewed in Section II.3. Among considered techniques were Bayesian

classifiers, Hidden Markov Models and Support Vector Machines; however, we found

these techniques unsuitable as they require an extensive parameterization and often

involve a high computational cost. Instead, we have chosen to base T-DTS on a

combination of Neural Networks and Decision Tree techniques that provide a good

43

classification performance at a moderate computation cost and require only a minor

parameterization; we combined the two classification approaches by the means of the

ensemble of classifiers technique that is based on the “overproduce and choose” design

principle and allows for reaching a higher accuracy than that of a single classifier. It is

worth mentioning that the core of T-DTS is the Neural Networks technique as it is

currently considered to be the outperforming tool for identifying patterns, forecasting, and

trends identification in large amounts of data, and thus is well-suited for the objectives of

this thesis.

44

Chapter II:

Complexity concepts

Since complexities are in the core of T-DTS, the aim of this section is to provide an

overview of different types of complexities and their relations. The specific type of

complexity - classification complexity, which corresponds to the class separability

criterion, was applied to T-DTS related work (Rybnik 2004). The complexity estimators

have been well studied in the work (Bouyoucef 2007) and we rely on their results in this

chapter. Our prime goal was to extent and update taxonomy of the definitions and to link

the high-level concepts with the applied definitions of complexity (Bouyoucef 2007). Let

us note that the term « complexity » cannot be defined in a unique way; thus, as other

works attempting to generalize the complexity, our overview may suffer from the lack of

accuracy in defining the complexity.

We begin this chapter in Section II.1 with a description of the complexity concept as a

whole. We provide a cross-disciplinary overview with an aim to gain a better insight on

the common views of complexity and its origins. Next, in Section II.2, we describe means

of measurement of a specific type of complexity – the computational complexity.

Section II.3 presents our novel approach for estimating complexity of classification

tasks that belongs to the group of ad hoc complexity estimation techniques and is based

on extracting information about separability of classes from ANN structures (structures of

Artificial Neural Networks). Section II.4 provides a short conclusion.

II.1 Introduction to complexity concepts

There is no common definition of the term “complexity”. The Oxford dictionary

defines complexity as something “made of usually several closely connected parts”. The

45

Latin “Complexus” signifies "entwined", "twisted together”. ”Complicated” uses Latin

ending “plic” that means “to fold”. “Complex” uses “plex” that means “to weave”.

Many attempts had been made to develop a generalized understanding of complexity,

and, ultimately, a Theory of Complexity - a systems theory (Lucas 2000) that consists of

many interacting components and many hierarchical layers. Note that a system is called

complex if it is impossible to reduce its overall behaviour to a set of properties

characterizing its individual components (Lucas 2000); interactions at collective level in

such system can produce properties that are simply not present when the components are

considered individually.

The quantification (measurement) of the systems’ complexity, including complexity

of classification process, plays the key role in T-DTS as it is responsible for optimizing

tree structure of Neural Networks. This motivated us to find a new approach for the

analysis of complexity and synthesis of complex systems.

We highlight in this chapter our general interest in investigating a complexity

phenomenon and stress the need in a clearly visible and solid classification complexity

estimator, to the extent possible while using our ad hoc approach described below. We

begin by considering the theoretical aspects of the complexity concept. Considering the

given aggregation, this concept can be stratified from the simplest level to the most

complex: 1. Static complexity: the simplest form of complexity that relates to static systems

and is generally studied by scientists using developed mathematic tools (Lucas 2000). For

example this form of complexity is studied by such techniques as Algorithmic

Information Theory (Chaitin 2005).

2. Dynamic complexity: extends upon static complexity by adding the dimension of

time that can improve or worsen the static situation (Wolfram 1994). Given interest in

experimental repeatability in science, is to observe dynamic and measure this type of

complexity (Lucas 2000) “which is a one way to exhibit the phenomenon complexity

because complexity as phenomena is born not statically, but dynamically (Appendix B).

3. Evolving complexity: relates to systems that evolve through time into different

systems (Wolfram 1994); the best known class of this phenomenon is usually described as

organic: open ended mutation. As all such systems are unique, there are symmetries

present in the arrangements that would allow one to measure these systems. For example,

it is possible to analyze the complexity of an evolving system from an evolutionary

viewpoint as a set of specific already investigated parts or patterns (e.g. DNA code) that

46

can also have numerous combinations that have not yet occurred and thus have not been

studied (Lucas 2000).

4. Complexity of self-organizing systems: this form of complexity is based on the

idea of comprising. According to Lucas (Lucas 2000), this is the self-maintaining type of

systems that operating at the edge of chaos, aggregate in nonlinear ways the structures and

complex mix of types 1-3 above mentioned (Wolfram 1994).

Let us mention that the complexity concept categorizing on 1-4 is performed by Lucas

(Lucas 2000) in very informal way, but it is not the fault of author. Thus, even the

simplest Static type of complexity that employing precise mathematic tools does not have

the common definition (Saakian 2004). Although, this work (Lucas 2000) is useful to

introduce a common context of the phenomenon complexity. Therefore, before applying a

quantitative technique (meaning we concern with Static type of complexity) of estimating

complexity, we need to decide whether they are, in fact, complex in any of the senses

mentioned above (Bak 1996). However, the problem with deciding and determining

complexity is typically done in an informal form of comparison. It means that the whole

spectrum of complexity assessment statements will be in the form of "x system … is more

complex than y system" (Edmonds 1999). Also, there is some point of a transition from

“simple” to “complex”; the assumed nature of this point further complicates the

formalization of complexity estimations.

Therefore, we will provide the pragmatic solution for describing categories of

complexity. The philosophical issue related with complexity will henceforth be ignored.

To classify different types of complexity, we define the following criteria that we later use

to organize concepts in several groups:

• Criterion of size: for example, the size of a genome, the number of species in a

biosphere. Size could be an indicator of difficulty, but for strong definition of complexity,

such criterion is inter-related.

• Minimum description length criterion: is based on Kolmogorov’s idea of

complexity based on the minimum possible length of a description in some language

(usually that of a Turing machine) (Shalizi 2005), (Chaitin 2005). We should discuss this

criterion in details later in this section.

• Criterion of variety: variety of basic components of a concept. Variety is the key

point of evolutionary processes. For example, human teeth including its organization and

functions are more complex than shark teeth regardless the quantity (Edmonds 1999),

(Lucas 2000). Variety is the necessary feature of complexity but is not sufficient for it.

47

• (Dis)Order: Complexity is a mid-point between order and disorder (in board

meaning of these terms) (Permana 2003).

There are certain difficulties in applying the listed criteria to building a common solid

complexity hierarchy; most notable, the criteria originate from absolutely different fields

and these origins cannot be ignored. Thus, we are going to rely on a heuristic inventory

done at the Horgan’s work (Horgan 1995). In his survey, Horgan analyses more than 30

different ways of categorizing complexities; in this work, we will consider the four of

them: algorithmic complexity, computational complexity, entropic complexity,

grammatical complexity. In the next section we describe the most important of the four considered complexity

categories, the computational complexity. Based on structure and complexity of a given

data, we propose a combined context-dependent measure of complexity of associated

computations. We describe our contribution of a novel classification complexity

estimation technique named ANN-structure based classification complexity estimator; we

explain its connections to other complexity estimation approaches according based on

proposed classification complexity measurement hierarchy.

II.2 Computational complexity measurement

Computational complexity as a discipline presents an outstanding research in

computational complexity. This subject is at the interface between mathematics and

theoretical computer science, with a clear mathematical profile and strictly mathematical

format. Neighbourhood fields of study inside theoretical computer science are analysis of

algorithms and computability theory. The key distinction between computational

complexity theory and analysis of algorithm is that the latter is dedicated to analysis of the

amount of resources needed by a particular algorithm to solve a concrete problem,

whereas the former asks a more general question: what kind of problems can be solved at

all (within a given computational model). There are other measures of computational complexity, such as communication

complexity in distributed computation or with connections to hardware design – circuit

computational complexity. Informally, a computational problem is regarded as inherently

48

difficult if solving the problem requires a large amount of resources, independent of the

algorithm used for solving it.

Computational complexity theory formalizes this intuition, by introducing

mathematical models of computation. Therefore, the following section is dedicated to

algorithmic aspect of the computational complexity.

II.2.1 Complexity, randomness and computability

Computational problems can be classified by the criterion of the time it takes for an

algorithm - usually a computer program - to solve them as a function of the problem’s

size. Some problems are difficult to solve, while others are easy. For example, some

difficult problems need algorithms that take an exponential amount of time in terms of the

size of the problem to solve, for example the general salesman problem.

From algorithmic part of computational complexity theory, complexity is relevant to

the time or the number of steps that it takes to solve an instance of the problem as a

function of the size of the input (usually measured in bits), using the most efficient (if this

fact is possible to prove) algorithm (Sipser 2005).

The question of whether NP is the same set as P is one of the most important open

questions in theoretical computer science due to the wide implications a solution would

present. Most scientists trend to idea of the negative answer on this question (Boolos,

Burgess and Jeffrey 2002). This fact corresponds with unproved Church–Turing thesis

which claims equivalency of the set of computing machines to each other in terms of

theoretical computational power. Thesis states that there is no chance to build a

computation device that is more powerful than the Turing machine. Thus, for NP-

problems there is no other way as deep investigation of NP-problems (or its sub-class),

where one may find new specific methods of its resolving for problem. Another

alternative is new computational models (Blass and Gurevich 2003). For example,

generally one of NP-problem solution could be solved using a quantum model of

computation (Gershenfeld and Chuang 1998).

Therefore, we have to mention also that “P=NP?” with the trend to P≠NP is linked

with a “real” according to Chaitin (Chaitin 2005) randomness. Thus, for the random

binary sequences of 0 and 1 inability to find compression algorithm/machine represents

complexity in terms of Kolmogorov (Rubin and Trajkovic 2001). Because the sequences

49

may exhibit not only a relation between input and output, but a program or algorithm

itself, for all random programs represented by these random sequences Chaitin has

calculated the probability to halt Ω3. As Turing’s machine halting problem is related to

“P=NP?” and the halting problem in Turing’s terms is unsolvable, then Chaitin comes to

conclusion that the real randomness is another form of computational complexity.

Taking to account that the base of our intent to define applied computation complexity

measure is determined Turing machine, where uncomputability as inability to resolve

P≠NP problem is linked with randomness. Taking also to consideration that computability

(algorithmic computability) is related to the numbers of algorithm’s iteration required to

solve the posed problem and this characteristic is measured quantity, then following

section represents a brief overview of measurable computational complexity classes in

order to figure out the hierarchy of classification complexity estimators and their relations

to more general/abstract type of complexity measures.

II.2.2 Kolmogorov related complexity measures

Among the big variety of the nowadays proposed measures of complexity, one may

find no relation between them, because they are usually special quantification for applied

usage (Shalizi 2005). However, all of them have common base. Thus, computational

complexity is the amount of computational resource (usually time or memory) that it takes

to solve a class of problem. Thus the difficulty here is the limited supply of these

resources once the appropriate program is supplied. This is now a very well studied

measure. For our purposes, this is a weak definition of complexity as applied to evolving

entities, as the time to perform a program or the space that the program takes is often not a

very pressing difficulty compared with the problem of providing the program itself

(through evolution) (Lucas 2000). The next list of complexity estimators are purely

resource (defined instance) based.

The very classical computational complexity measure above was introduced by

Kolmogorov (Shalizi 2005), which is roughly the minimum length of a Turing machine

program needed to generate a binary sequence (Edmonds 1999). This quantity is in

3 In the subfield of algorithmic information theory a Chaitin’s construction or Chaitin’s constant or halting

probability is a real number that informally represents the probability that a randomly-chosen program will halt.

50

general incomputable (Chaitin 2005), in the sense that there is simply no algorithm which

will compute it. This comes from a result of the halting problem or the other form of

Church’s thesis (Church 1936), which in turn is a disguised form of Gödel's theorem

(Godel 2001), so this limit is not likely to be broken any time soon. Moreover, the

Kolmogorov’s complexity is maximized by random strings (Chaitin 2005), so it's really

cannot justify what's random, that why it’s gradually come to be called algorithmic

information (Shalizi 2007). Anyway, ideas that lay down in the formalism which

describes a complexity in Kolmogorov’s ways when we rely on the main parameters of it,

such as the minimum length of algorithm, can be used for creating different type of the

practical approaches, where the complexity might be measured in context (Green and

Newth 2001). One of application example of is grammar complexity.

II.2.2.1 Grammar complexity

The Minimum Grammar Complexity Criterion (grammar complexity) is a formalism

used for description of structural relationships (Young and Fu 1986). The goal of the

grammar complexity is to give a quantitative characterization of communication. The

measure is said to be an upper bound of complexity, not the real complexity value,

because it is not sure that this is the minimum description of a sequence (Permana 2003).

Let µ is the length of segment of the code/program or input. Therefore we start with

segment of length µ = 2:

Step 1: search the most frequent µ -tuple in the sequence.

Step 2: replace the most frequent µ -tuple by a new symbol.

Step 3: increase the length of the µ -tuple by 1.

Step 4: repeat Steps 1 to 4 until no replacements can be performed.

When the original sequence has been compressed into another sequence of symbols,

the complexity is computed as follows. It is the sum of the length of the new sequence plus

the length of all symbols used to recode the sequence, without counting the repetition of

symbols. If symbols are repeated in recoding, their logarithm is added to the complexity

value. The problem with this procedure relates to the lack of bounds or comparisons of the

measure. One doesn't really know what a complexity of 5 or 20 means. It is small or

large? What does a disordered system typically exhibits? It makes the measure difficult to

apply in real situations. Another limit concerns the algorithm used to compute complexity.

Starting with pairs, and after moving to triplets, quadruples and so on, is logical, but what

51

if a system exhibits a 3-period cycle? Replacing pairs before triplets completely destroys

the structure of the original sequence. Replacing triplets would be a far more efficient

procedure. In fact, a hybrid procedure, combining this approach with entropy would be

better.

Grammar complexity is an illustration of using Kolmogorov’s approach. It plays a

very important role in every discussion of measuring complexity, but it puts away as

useless for any practical application (Shalizi 2007). Generally speaking, Kolmogorov’s

complexity measures involve finding some computer or abstract automaton which will

produce the pattern of interest. Bennett's Logical Depth is an instance of this tendency.

II.2.2.2 Bennett's Logical Depth

Bennett's Logical Depth is the running time of the shortest program (Bennett 1988). It

is the computational resources (especially time) based measure that calculating time for

obtaining results of a program of minimal length (Shalizi 2007). Bennett uses it also to

formalize the level of organization in systems. All present-day organisms can be viewed

as the result of a very long computation from an incompressible program and are thus by

this definition complex (Edmonds 1999). The principal disadvantage of this definition is

that it measures the process, but not the results. Next Lofgren’s Interpretative Descriptive

complexity is better than Bennett’s Logical Depth measure in term of formalizing

complexity of self-organizing systems.

II.2.2.3 Lofgren's Interpretation and Descriptive Complexity

In his work (Lofgren 1973) Lofgren proposes complexity as two processes of

interpretation and description (Fig. II.1). Interpretation process is the translation from the

description to the system and the descriptive process is another way around (Edmonds

1999).

For example the description could be the genotype and the system the phenotype. The

interpretation process would correspond to the decoding of the DNA into the effective

proteins that control the cell and the descriptive process the result of reproduction and

selection of the information there encoded. Löfgren then goes on to associate descriptive

52

complexity with Kolmogorov’s complexity and interpretational complexity with logical

strength and computational complexity.

Fig. II.1 : Description and interpretation process

Kauffman’s number of conflicts measure is a developing of the idea of self-organizing

process measurment, because the self-organizing abilities is an essential attribute of

complexity.

II.2.2.4 Kauffman's number of conflicting constraints

Kauffman’s definition (Kauffman 1993) is less concerned with Kolmogorov’s

complexity, but it introduces a working definition of complexity for the formal self-

organizing model (Kauffman 1993). His principal idea can reflect the complexity of self-

organizing process.

Kauffman defines complexity as the “number of conflicts”, or more precisely "number

of conflicting constraints". This definition represents the difficulty of specifying a

successful evolutionary (and not only) process to meet the imposed constraints, but this

definition hard applies for some real-word problem, because the conflicts measurement is

relative issue. Summarizing, we highlight that Kolmogorov’s and related to Kolmogorov’s

complexities mentioned here is very theoretical or relative that to be easy-applicable in

practice. Dealing with more pragmatic Information theory, allows us to use explicit

methods for computing classification complexity.

II.2.3 Information based complexity measures

Information theory is a branch of applied mathematics involved in quantifying amount

of information originally for coding and encoding. This theory has been developed in

53

order to find fundamental limits on compressing and reliably storing and communicating

information data. Since its inception, it has been broadened to find applications in many

others areas including machine learning field.

II.2.3.1 Shannon’s entropy based measures

Beginning with fundamental Shannon’s entropy measure, the role of Information

theory was at first very narrow. It was a subset of communication theory the main purpose

of which was to find answers to two fundamental questions:

1. What is the ultimate data compression that can be applied to a signal?

2. What is the ultimate transmission rate of signals on a wire?

But the mathematical techniques developed after pioneer Shannon so fruitfully that

they were applied in various fields of investigation including Theory of Complexity.

Entropy might be used for classification tasks complexity measurement, because

accordingly to theory, it figures out the characteristics of probabilistic models.

This ought to be defined as following:

Let x be a discrete variable which may take i = 1 … l values xx1,…xl, where l is the

index of maximal possible values. To i-th value xi of variable x is assigned probability pi.

Thus, Shannon’s entropy measures complexity as following:

)(log)()( 2 xpxpxH i

l

ixi∑

=

−= (II.1)

The entropy varies from 0 to log2(l). Additional predefining has to been done for i-

cases when pi(x) = 0, log pi(x) must be set equal 0. In case of uniform distribution the

entropy is maximal H(x) = log2(l). It is obvious that the greater number of possible states

we have - the greater entropy is. In a classical works from the theory of information,

entropy signifies the average amount of information required to select observations by

categories (Krippendorff 1986).

Entropy may be standardized so that it would range from 0 to 1, by dividing it by its

maximum log2(l). It may be easier to compare the amount of disorder of two systems,

knowing that one system encountered more states than the other.

A nice property of Shannon’s entropy is that variable categories may be permuted

without changing its value. Only the relative frequencies have a matter. This is why this

54

measure is said to be content-free. It does not make any assumptions about the

distribution of data; it thus belongs to the nonparametric family of statistics methods.

Entropy is interpreted in many different ways. It is a measure of the uncertainty tied to

the observed system. The lower the entropy the easier it is to predict the system's state,

and conversely. It may also be interpreted as a measure of disorder of a system, or in a

very similar fashion, a measure of its variability. There again is the lower the entropy - the

more orderly the system and conversely. Important note here is that we have to take into

consideration the source/analyzing tool which provide the distribution of pi(x) which in

the terms of mentioned variables describe the system.

Shannon's entropy works very fine for describing the order, uncertainty or variability

of a unique variable, but when we deal with more than one variable the following Joint

entropy is used. There are various entropies when considering two or more variables together: the joint

entropy, the mutual information and the conditional entropy.

When considering two discrete variables x and y at the same time, it is possible to

measure the degree of uncertainty or information associated with them. It is called the

joint entropy, H(x,y). If independent value x and y may respectively take l1 and l2 maximal

numbers of possible values of their individual probabilities pij correspondingly, the joint

entropy is computed as:

),(log),(),(1 2

1 12 yxpyxpyxH ij

l

i

l

jij∑∑

= =

−= (II.2)

Where pij represents the probability of being classified in both categories i of variable

x and category j of variable y. The joint entropy varies from a theoretical 0 or empirically

minH(x),H(y) to log2(l1)+log2(l2). The relation between the individual entropies and

their joint entropy is given by:

)()(),( yHxHyxH +≤ (II.3)

It expresses the fact that the joint entropy is always smaller than the sum of the

individual entropies. Let us underline that the equation III.3 holds only when the two

variables are independent, plus despite similar notation joint entropy shouldn’t be

confused with cross entropy.

Not only information can be measured by two variables as a whole, but also the

amount of information of a variable knowing the other. This is called the conditional

entropy.

55

This type of entropy relies on conditional probabilities, or as it’s also called

transitional probabilities. Suppose we want to compute the conditional probability of

state j of variable y for given state i of variable x; this is written as p(y|x). It is different

from the joint probability

)(),(

log),()(log)()()(2 112

1 12

12

1 ypyxp

yxpxypyxpypyxHj

ijl

j

l

jiij

l

ji

l

jj ∑ ∑∑∑

= +=+==

−=−= (II.4)

The relationship between conditional entropy and joint entropy is as follows:

)(),()( yHyxHyxH −= (II.5)

The conditional entropy defines a reduction of uncertainty. The higher the conditional

entropy the more an observer can predict the state of a variable, knowing the state of the

other variable.

Contrary to the joint entropy, conditional entropy is not a symmetrical measure:

H(y|x) ≠ H(x|y). Conditioning on a variable or the other does not give the same result,

because each variable has its own entropy H(x) and H(y).

The conditional information is the information particular to a variable, while the joint

entropy is the sum of the information of two variables.

Another commonly used Shannon’s based measure is Mutual information. It measure

the information shared by variables, or the quantity of information an observer that gets

common in two (or more) variables. Generally, for two variable formula given as

∑ ∑= +=

=1 2

1 12 )()(

),(log),(),(

l

i

l

ij ji

ijij ypxp

yxpyxpyxI (II.6)

There are many ways of expressing given formula. It might be expressed as a relation

between the individual entropies and the joint entropy. It is the sum of the individual

entropies, minus the joint entropy, as expressed:

),()()(),(),( yxHyHxHxyIyxI −+== (II.7)

In a case of three variables, the equation becomes:

),,(),(),(),()()()(),,( zyxHzyHzxHyxHzHyHxHzyxI +−−−++= (II.8)

When two variables are independent, the sum of their individual entropies is equal to

the joint entropy the mutual information is equal to zero.

56

Therefore the best measure of the proximity between variables is the mutual

information. It was shown in work (Lemay 1999) that mutual information is related to the

likelihood ratio Λ by the following, where l is the number of observations:

),(log2),( 2 yxIlyx =Λ (II.9)

This is an important fact, since it is the link between the information theory and the

statistical use of the branch of theory of probability. The greater is the mutual information,

the more similar two variables.

II.2.3.2 Relative entropies

Another group of Shannon’s like entropy methods is called Relative entropie(s). They

are also known in literature as the divergences or distances. In probability theory, this

type of measures belongs to class f-divergences, because it measures the difference

between two probability distributions. Interesting is a fact that these measures can be used

not only in signal processing, but in analysis of the contingency of tables and particularly

in pattern recognition (Lin 1991).

Suppose we compare two distributions: a "true" probability distribution p(x), and an

arbitrary probability distribution q(x). The relative entropy, information gain or Kullback-

Leibler's I-directed divergence (Lin 1991) formula is given by:

))()((log)(),( 2

1 xqxpxpqpKL

i

il

ii∑

=

= (II.10)

where l is the maximal index of levels of the variables.

The properties of the relative entropy equation makes it non-negative and equal to 0 if

both distribution are equivalent (p = q). The smaller is the relative entropy, the more

similar the distribution of the two variables, and conversely. It has to be noted that the

measure is asymmetrical, the distance KL(p,q) is not equal to KL(q,p). If the distributions

are not too dissimilar, the difference between KL(p,q) and KL(q,p) is small, and the

distance is then equivalent to the χ2-statistics (relatively to the sample size). To gain the

property of symmetry Kullback and Leibler actually defines the divergence as:

))()((log))()((),(),( 2

1 xqxpxqxppqKLqpKL

i

ii

l

ii −=+ ∑

= (II.11)

An alternative approach is given via the 0<λ<1 variable, called λ-divergence, Dλ :

57

))1(,()1())1(,(),( qpqKLqppKLqpD λλλλλλλ −+−+−+= (II.12)

when, λ = 0.5, Dλ becomes Jensen-Shannon divergence (Fuglede and Topsoe 2004)

)2

,(21)

2,(

21),( qpqKLqppKLqpJSD +

++

= (II.13)

or in another form

)]))()((

)(2(log)()))()((

)(2(log)([),( 221 xpxq

xqxqxpxq

xpxpqpJSDii

ii

ii

il

ii +

++

= ∑=

(II.14)

For classification problem, Jensen-Shannon divergence performs better as a feature

selection criterion for multivariate normal classes (Richards and Xiuping 2005). In fact, it

provides both the lower and the upper bounds for the Bayes probability of

misclassification error ε (importance of such approximations are given at short overview

in Section III.3.4.1). This makes it particularly suitable for the study of decision problems

(Lin 1991).

Jensen-Shannon divergence is the square of a metric that is equivalent to the Hellinger

distance.

2

1))()((

21),( xqxpqpHD i

l

ii −= ∑

= (II.15)

HD(p,q) satisfies the property : 1q)HD(p,0 ≤≤ . Hellinger distance is equivalent to

Jensen-Shannon divergence and the last one is also equal to one-half the so-called

Jeffrey’s divergence or Jeffreys-Matusita distance measure (Matusita and Akaike 1956).

2

1))()((),( xqxpqpJMD i

l

ii −= ∑

= (II.16)

II.1.12 - II.16 have been developed in order to find small differences between two

probabilities. Important to note that Hellinger distance is related to another distance based

measure - Bhattacharyya distance:

∑=

−=l

iii xqxpqpBD

1

)()(ln),( (II.17)

where the expression under logarithm is Bhattacharyya coefficient. As in II.16 and

II.17 cases, Bhattacharyya coefficient can be used to determine the relative closeness of

the two samples. It is shown in work (Theodoridis and Koutroumbas 2006) that it

corresponds to the optimum Chernoff bound. This measure is more general case of

Bhattacharyya distance (Chernoff 1966).

58

It is theoretically demonstrated that Bhattacharyya distance become proportional to

Mahalanobis distance I.6 (Theodoridis and Koutroumbas 2006).

Equations II.6 - II.17 origin from the information theory applied for narrow sphere of

statistical classification suits well for approximation of Bayes error ε. In the above given

form they present acceptable (from the point of view of computational/algorithm

difficulty) solution for two-class classification problems. However, in reality, we deal

with multi-class problems, and in this case common solution (Swain and King 1973) is

the usage of an average aggregated value among all possible pairs of the classes. Let for

II.11 Kp,q represents KL(p(x|ap),q(x|aq)), where ap and aq are class-labels. Let’s the

maximal number of classes will be l, in this case the aggregated ratio of class separability

(complexity – more description is given at Section II.3.4.1) for multi-class classification

problem is computed as:

∑ ∑−

= +=−=

1

1 1,)1(

1 l

p

l

pqqpavr K

llK (II.18)

Similar to equation II.18 (Swain and King 1973) is equation II.19 that very often use

the sum-approach (Shi, Shu and Liu 1998), for example for II.16 identical to Kp,q , JSDp,q

we will have the following aggregation ratio:

∑ ∑−

= +=

=1

1 1,

l

p

l

pqqpsum JSDJSD (II.19)

We have to mention that for the approaches of computing the final ratio of

classification complexity (II.18-II.19) in such way has a one disadvantage. In case when

both Kp,q and JSDp,q vary mach, aggregated ratio is not representative.

The equation III.20 represents the approach to get an aggregative value that has been

implemented in new T-DTS (Chapter IV) version:

))(min(min ,,...,11,...,1

qplpqlp

global JSDJSD+=−=

= (II.20)

In equation II.20, aggregated ratio of complexity is represented by the most difficult

case of class-separability. Thus, among zero indicates a case when there is as minimum

one similar p(x|cp) and q(x|cq) distributions among all possible pairs of classes. Although,

we have to motion that Kullback-Leibler relative entropy approaches is just a particular

group of Renyi’s entropies (Renyi 1960). Last one belongs to general family of

functionals of the order α for quantifying the diversity, uncertainty or systems’

randomness.

59

Taking into consideration the clue about the foundations of the information theory,

and more precisely the aspect of f-divergences estimates complexity measures, the

complexity estimating techniques defined in II.11 and II.14 - II.17 are used in the current

T-DTS version (Chapter III).

The next sub-section briefly highlights the main disadvantages of this Shannon’s

based complexity estimation measures that are grounded on Information theory.

II.2.3.3 Limits of the information theory to complexity estimation

Information theoretical analyses consider variables as nominal ones. So any analyses

dealing with truly quantified variables may suffer a loss of information and power.

Another similar point is the loss of meaning in variables, especially when you

leverage them (II.18 - II.19), because these measures are very general and dedicated to the

treatment of any information, but we deal with more specific statistical classification case.

For more general example of linear system modelling, when correlation is computed,

the direction of the relationship is known; if the sign is positive the two variables vary

together, and when negative they do it in the opposite direction. Analyses on categorical

variables seldom show the direction: for example high mutual information does not tell

for which categories there is a strong association. But this is a problem of categorical

variables not just the information theoretical approach.

To make our view on the problem of defining complexity more related to

classification problem in a frame of T-DTS concept, we need to use data complexity

measurement for classification process in Machine Learning.

II.2.4 Complexity measurement in machine learning

In this section we investigate the role of data complexity in the context of

classification problems. The classical Kolmogorov is closely related to several existing

universal principles used not only in machine learning such as Occam's razor of the

simplest possible theory (Chaitin 2005) (the minimum description length) and the

Bayesian approach. The difficulties of implementation of universal principles in terms of

Turing machine model, drives us to idea that the data complexity should be defined based

on a learning model, which is a more realistic approach.

60

In this case, a successful developing of T-DTS concept including implementation of

the real-world application for classification tasks is our particular aim. Therefore,

overviewed in previous section, the group of complexity measures used for more global

problem such as data transmission might be also applied for classification (labelled data)

problems.

In practice, we approximate the given above formula of data complexity measurement

for classification problems. Following section describes more utilitarian range of methods

focused on estimation complexity of classification problems only.

II.2.4.1 Class separability measures: Bayes error

While definition of an optimal complexity problem value including classification tasks

is often difficult (Ho 2001), this difficulty does not arise from the lack of an appropriate

metric (Ho 2002). Since, the objective of pattern recognition problems is to minimize the

number of classification errors, also called the error rate ε, the staple stone of the

Statistical classification is the minimum error rate over the set of all classifiers (Ho and

Baird 1994). This error is known as the Bayes error, ε. Thus ε becomes the proper

measure for evaluating given vectors database-set S and their class-labels database-set C

for classification (Cacoullos 1966).

If ε were easy to estimate, the classification complexity problem would embed an

alternative approach that minimize ε (Ho and Basu 2002). Unfortunately, direct estimation

of ε is usually difficult (Fukunaga 1972), (Young and Fu 1986), (Devroye 1987),

(Therrien 1989). However, for purely scientific (not industrial) testing needs, we have

implemented it at the last version of T-DTS using high-time consuming algorithm

expressed by equation II.21.

Examination of the governing equations for ε shows where the difficulty of estimating

ε arises. The misclassification rate for l classes (i – index for classes counting), where a

data vector s is represented as s = [s1, …, sdim]T is:

∑=

−=dim

1

)()))((max1(j

jjiispscpε (II.21)

where p(ci|si) is the posterior probability, where p(si) is the probability for attribute j

of the data vector s, that is defined as:

61

∑=

=l

iiijj cpcspsp

1)()(()( (II.22)

where p(si|ci) is the probability of belonging of the j-attribute of s vector to class ci

under condition of p(ci) . Through Bayes' theorem, the a posterior probability function is

related to the class conditional density by:

)()()(

)(j

iiji sp

cpcspscp = (II.23)

The estimation of, ε is difficult for three following reasons: density estimation of

p(si|ci) is an ill-posed problem (Devroye 1987), the difficulty of numeric integration

increases with dimensionality, class probabilities p(ci) are needed.

Concerning the last reason, in many applications, p(ci) values are unknown and must

be approximated. Several techniques have been developed to circumvent the difficulties

associated with direct estimation of ε and not related to ε. The general taxonomy of the

most commonly used techniques is: Indirect Bayes error estimating: mathematically based

measures, Non-parametric Bayes error estimation and bounds, Intuitive measures. Our

main interest in estimating classification complexity is related to this last group named

Intuitive measure. There are the following types of measurement: Interclass distance

measures (Fukunaga 1972), Boundary methods (Sancho and al. 1997), Space partitioning

methods (Singh 2003), Other techniques (Chen 1976), (Young and Fu 1986) and such as

Ad hoc (Ho 2000) measures.

Because this group of methods is not well-developed, here we have Other techniques

group. Although, our research interest addresses the group of Ad hoc methods, where we

have the following trends named as: Length of class boundary (Friedman and Rafsky

1979), Space covering by ε-Neighbourhoods (Le Bourgeois and Emptoz 1996), Feature

efficiency (Ho and Baird 1998), Hyperplane number based complexity estimator (Zhao

and Wu 1999), ANN based complexity estimator (Budnyk, Chebira and Madani 2007).

As it is mentioned, our method of complexity estimation can be range with the Ad hoc

type of methods that belong to upper-group of Intuitive measures. The last one is one of

three main direction of estimating class separability (classification complexity).

We briefly overview and give a clue about this category of the methods.

62

III.2.4.1.1 Indirect Bayes error estimating: mathematically based measures

This group of measures is easier to calculate. Some of these measures bound ε while

others are justified by other considerations. The calculation of II.10- II.20 can be

simplified in case when the distributions are normal (Theodoridis and Koutroumbas

2006), but for T-DTS universal concept which deals with a priory unknown classification

problem, this assumption cannot be taken into consideration. Thus, high computational

cost of these measure, especially in cases of high dimension S and increasing class

numbers makes them difficult to use (Rybnik 2004), (Bouyoucef 2007). Of course, some

simplification unrelated to Information theory results II.10- II.20 can be used for

calculating indirect Bayes error. One popular example of these is Normalized mean

distance.

This measure has been used as a complexity estimator in T-DTS. It’s shown (Rybnik

2004) to be a computationally cheap procedure. For two class problems represented by

vector-instances s1 Є S and s2 Є S respectfully we have:

)()(

)()())(),((

21

2121 jj

jsjsjsjsNDM

ssj σσ +

−= (II.24)

where j (1…dim) is an index of feature space , is is the mean of the i-component of

the vectors, and sσ is variance. The question of aggregation NDMj for all j and in case of

multi classes into final NDM(s1, s2) ratio practically is resolved in T-DTS application in

the way proposed by equation II.20.

We should mention here that II.24 is inadequate as a measure of class separability in

cases when the classes are not distance, and especially when the both classes have the

same mean values. However because of low computational cost of this measure, we have

mainly used this criterion for testing purposes.

III.2.4.1.2 Non-parametric Bayes error estimation and bounds

The two main non-parametric techniques used for density estimation are k-Nearest

Neighbors (kNN) and Parzen (Parzen 1962). Both techniques stand for the same

underlying concept of setting a local region Г(s) around each sample s, and examining the

ratio of the samples enclosed to the total number of samples normalized with respect to

the volume v(s) of the local region. Let’s the resolution parameter B determines the

63

dividing given initial space into Г(s) regions, therefore v(s) depends on the inner

parameter of Г(s) and global resolution B.

In this point we come to the difference between realizations of Parzen and kNN

technique. Parzen fixes the volume v(s) of the local region while kNN fixes the number of

samples enclosed by the local region. Both ε-estimating are asymptotically unbiased and

consistent (Fukunaga 1972), (Chernoff 1966). To provide asymptotical bound of Bayes

error, the following density estimation function can be calculated as:

msvKsp

)(1)(ˆ −

= (II.25)

For Parzen measure II.25 is defined similarly. Therefore, using the following logic of

building density estimating function )(ˆ sp of kNN approach, plus intense to approximate

Bayes error, next generalized formula expresses non parametric Bayes measure:

∑=

=⎥⎥⎥

⎦

⎤

⎢⎢⎢

⎣

⎡

∑−=

m

il

jj

i

i sv

svm 1

1

))(

)(max1(1ε (II.26)

However, the variance of Parzen estimates decays faster than that of k-NN (Fukunaga

1972). Nevertheless, k-NN is very often used and there are three reasons for its common

and wide-spread applicability:

1. The kernel, i.e. the local region size and shape, must be determined for Parzen

estimation which can be difficult (Chernoff 1966)

2. Using variably sized kernels, as is the case for kNN, helps to improve the accuracy

in the low density regions (Fukunaga 1972)

3. It is shown (Cover and Hart 1967) that k-NN (for k=1) estimates asymptotic the

true Bayes error.

ANN-structure based complexity estimating measure with a resolution parameter B

which must be preselected by user has been implemented in T-DTS. The above mentioned

methods of approximating Bayes error belongs to the group of Intuitive measure.

Therefore, next Section II.2.4.2 accomplishes this direction of approximating Bayes error,

shortly debriefing additional methods.

64

III.2.4.2 Intuitive measures

The most intuitive measures are founded upon the idea that class separability increases

as class mean separates and class covariance becomes tighter. One of the first concepts of

separability was proposed to use second order statistics.

This approach does not perform well in situations not adequately described by this

statistical assumption. However, there are several applications for which these techniques

perform well and, accordingly, have a valid claim as a useful idea to classification

complexity.

III.2.4.2.1 Scatter matrixes measures

The concept of mean separation and covariance tightness are expressed here by the

scatter matrices (Fukunaga 1972). The expressions for the within-class, between-class,

and mixture (total) scatter matrices, Sw, Sb, and Sm, respectively are:

[ ] ∑∑==

∑=−−=l

iiii

Tii

l

iiw cpcMsMsEcpS

11)())(()( (II.27)

Tii

l

iib MMMMcpS ))(()( 00

1−−= ∑

=

(II.28)

bwm SSS += (II.29)

where E[] is mathematical expectation of the vectors, p(ci) are the class distribution,

the Σi’s are the i-th class covariance matrices and ∑=

=l

iii McpM

10 )( - the mathematical

expectation of all classes, where iM - vectors’ mean.

Many intuitive measures of class separability come from manipulating these matrices

which are formulated to capture the separation of class means and class covariance

compactness.

Some of the most common used (Fukunaga 1972) measures are:

( )11

21 SStrJ −= (II.30)

)ln(2

12 S

SJ = (II.31)

65

( )13 StrJ = (II.32)

)()(

2

14 Str

StrJ = (II.33)

where generally (Fukunaga 1972) S1 = Sb and S2 = Sm . When |S2| = 0, for practical

reasons it’s used there a pseudo-inverted matrix S1 or S2. J1 and J2 are invariant, when J3

and J4 are coordinate dependant (Fukunaga 1972).

Equations III.32 - III.33 criteria are implemented into the current T-DTS version.

Another group of intuitive measures are the Bound methods.

III.2.4.2.2 Bound methods: PRISM approach

These methods count the number of samples in the overlap region of the sample

distributions. This process is done in a controlled way where the boundaries are modified

such the overlap region progressively decays. The trajectory of this count is then

examined to determine the separability of the classes. Two possible sources where

information regarding class separation can be obtained by:

1. Value of the trajectory.

2. Shape of the trajectory.

We have to mention, Pierson’s measure of classification complexity calls the Overlap

Sum (Pierson 1998). Overlap Sum is the arithmetical mean of overlapped points with

respect to progressive collapsing iteration. This criterion does not require any exact

knowledge of the distributions and is easy to compute. It is proved (Pierson 1998) to have

strong correlation with Bayes error (Rybnik 2004). Further, in the terms of computational

complexity, it has an advantage over kNN (Bouyoucef 2007).

The following Space partitioning methods are quite new. They have been proposed

for Pattern recognition in the Singh’s work (Singh 2003). As proven by an example of

Pierson’s Overlap Sum method of Space partitioning group of methods is more efficient

compare to conventional data analysis approaches (Pierson 1998).

The space partitioning methods analyze classification problems by decomposing

feature space into a number cells or boxes. Then, they are analyzed at different

resolutions.

The only class separability measure based on feature space partitioning according to

Singh (Singh 2003) are the other space portioning method proposed at work (Kohn,

66

Nakano and Silva 1996). This measure is called Class Discriminability Measure (CDM).

It is grounded on data-adaptive partitioning of the feature space in such a way the regions

having samples from two or more classes are more finely partitioned than those that have

a single class. CDM is based on the analysis of inhomogeneous buckets. The basic idea is

to divide the feature space into a number of hyper-cuboids.

Each of these subspaces is termed as a box. A given box at any stage is first tested on

the basis of a number of stopping criteria. If the stopping criterion is not satisfied, the box

partitioning continues. After the stopping criteria are satisfied, all boxes that are

inhomogeneous and not linearly separable are used to calculate CDM that is defined on

the basis of the difference between total samples in a box and the number of samples from

the majority class, over all boxes (Kohn, Nakano and Silva 1996), (Singh 2003).

A class separability ratio calculated using PRISM-based approach has been compared

with the Bayes error ratio, as well as with the mentioned Fukunaga’s scatter matrices

measures across a range of normal and non-normal data sets (Kohn, Nakano and Silva

1996). The main conclusion is that the CDM measure is more in line with Bayes error

when ranking the importance of features compared to indirect Bayes error estimating

criteria and is attractive in use due to its reduced computational cost (Kohn, Nakano and

Silva 1996), (Singh 2003).

The next paragraph presents a more sophisticate concept of feature space portioning

which forms the Pattern Recognition using Information Slicing Method (PRISM)

framework.

The feature space in dim-dimension can be partitioned using subspaces with different

topologies. In order to avoid the curse of dimensionality, a hyper-cuboids’ primitive is

used in PRISM framework (Singh 2002). Partitioning algorithms create hyper-cuboids for

m (1 < i < m) data points in dim-dimensional feature space, each of which can be assigned

to one of the known l (index of maximal class) classes 1, …, l. The extra parameter -

resolution B (commonly, 310 ≤≤ B ) of partitions per axes must be selected by user in

advance. The total number of boxes is Ktotal=(B+1)D

The difference of PRISM from CDM (Kohn, Nakano and Silva 1996) is that it does

not start to split from the median position. Secondly, there is no stopping criterion and the

empty boxes are not analyzed. The measure of classification complexity is described on

the basis of mentioned above partitioned scheme. The ratio of measures lies in the interval

[0;1]. The higher ratio indicates the simpler problem (Singh 2003[2]).

67

Since, we have a space partitioned on hyper-cuboids by resolution factor B, we can

calculate the Purity measure which defines “how pure data is” in hyper-cuboids. This

method of calculation is more advanced than CLC (Cluster Label Consistency)

complexity estimating proposed at work (Shipp and Kuncheva 2001)

Therefore, based on the initial PRISM clustering, we have the instances allocated in

the boxes. For each box computes the category ratios. For a total of jKl classes presented

in cell Gj (j is an index of box), totalkj ≤≤1 , if the ηij number of data points available in

the box Gj , then for each class lj, rkj ≤≤1 , where kr is the maximal number classes in

box Gj. Then the probability of class li allocated at box Gj is defined as:

∑=

=jKl

iij

ijijp

1η

η

(II.34)

For each box and for each class inside the j-th box, we apply equation II.34. We

receive a probabilistic distribution vector lljj Kjppp ,...,, 21 . Taking into account the

required normalization (Singh 2003), the parameter of separability for box Gj is calculated

as:

∑=

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟

⎟⎠

⎞⎜⎜⎝

⎛

−=

jl

i jij

j

jGj l

pl

lS

1

21

1 (II.35)

For each box Gj contains mj instances of data out of all total number m, then the

overall purity of different cells is given as:

∑=

=totalk

j

jGjG m

mSS

1 (II.36)

To give the largest weigh to lowest resolution (Singh 2003), plus to change a meaning

of complexity value on opposite, equation III.6.3 takes a form:

∑=

−=totalk

l

lGjBnormG m

mSS1

. 211 (II.37)

Next Neighbourhood separability PRISM’s criterion proposed by Singh at work

(Singh 2003) defines a classification complexity measure that depends on the concept of

decision boundaries. It is very similar to Purity, that’s why we have skipped its TDTS

implementation, but following PRISM’s Collective entropy basing on Information theory

is very important measure that represents order/disorder of the system (Singh 2003[2]).

68

In this particular PRISM case, Collective entropy signifies order/disorder accumulated

at different resolution considering non-empty cells for a given global resolution parameter

B. For probability distribution vector lljj Kjppp ,...,, 21 produced by equation II.6.1 the

entropy measure for each box Gj calculates as:

∑=

−=jk

iijijj ppHP

12 )(log (II.37)

Let us note here that generally, the base of logarithm in work (Singh 2003) is not

defined. Finally, the overall collective entropy included weighted factor and required

normalization is calculated as:

∑=

−=totalk

j

jjBnormG m

mHPHP

1. 2

11 (II.38)

This is to keep consistency with other measures: maximal value of 1 signifies

complete certainty and minimum value of 0 uncertainty and disorder.

The utility of the PRISM framework and the above complexity measures Purity and

Collective entropy based on feature space partitioning have demonstrated in practice

(Singh 2003) ability to predict the relevant level of classification error.

The group of methods that cannot be range with the classification complexity

measures mentioned above is organized in the group name Other methods. This group

may show up to be a very different to their origin. We do not try to provide complete

overview of these methods. Our aim is to give a clue about this group of complexity

estimators, and give a definition of the type of estimators/criteria that are implemented

into T-DTS. One of these measures is the well known Fisher linear discriminant ratio.

III.2.4.2.3 Fisher linear discriminant ratio and Ad hoc approaches

This classification complexity estimator is originated from Linear Discriminant

Analysis. Fisher (Fisher 1936) has defined the separation between two classes that are

represented by two vector-instances s1 Є S and s2 Є S respectfully as:

)()())()(())(),(( 22

221

2121

jjjsjsjsjsFDM

sSj σσ +

−= (II.39)

where j is an index of feature space , )(,)( 21 jsjs are the mean of the j-component of

the vectors, and )(),(21

jj sSσσ are the variance of these vectors respectfully. Similar

69

question appears about defining aggregative ratio for II.8. In a framework of T-DTS this

problem is resolved in similar, like it done for measure II.3.

However, this complexity measure is easy to compute, it is has the same

disadvantages as Normalized mean distance based measure.

Leaving the collective category of the methods named Other methods, our following

paragraphs overview more detailed Ad hoc measures. Thus in the work (Ho 2000), there

are three classes of Ad hoc complexity measures.

The first of them is named Length of class boundary group. Minimum Spanning Tree

is one of the well known measures. Hither, Length of class boundary method based on

Minimum Spanning Tree (MST) (Friedman and Rafsky 1979) finds the length of

boundary between two classes. Then, it constructs tree that connects all sample points

regardless of class. The length of the boundary is given as the count of these edges that

connect points belonging to two classes (Singh 2003). The length of class boundary

normalized by the total number of points is used as a measure of classification complexity

(Ho 2000).

The second method of this group is known as Space covering by ε-Neighbourhoods

measure. The idea of this method is to generate an adherence set for each data point which

represents the maximum size hypersphere around a data point that contains samples of the

same class and no other classes Fig. II.2 (Ho 2000).

Fig. II.2 : Retained adherence subset for two classes near the boundary

A list of the ε–Neighbourhoods, where ε represents the maximum possible distance

between sample points to be in a set together needed to cover the two classes is a

composite description of the shape of the class. The number and the order of the retained

subsets is an extension for complexity measuring. For example at work (Le Bourgeois and

Emptoz 1996), it is proposed a measure computed as an average proportional size in terms

of number of members in each hypersphere divided by the total number of use data points.

The third group of methods is titled as Feature efficiency. It is constructed for

classification problem (Ho and Baird 1998) with high dimension feature space. Feature

70

efficiency based method is concerned about how discriminatory information is distributed

across the features. Created characteristic of efficiency of individual features describes

how much each feature to contribute to the separation of the classes.

The last fourth measure calls Hyperplane number based complexity estimator. There

is an estimating method is an approach figured out at the work (Zhao and Wu 1999). The

complexity ratio there is defined as k-degree linear separable if at least k hyperplanes are

needed for separating data linearly across the different classes. This approach is simplistic

but easy computable.

II.2.5 Hierarchy of complexity estimation approaches

We would like to highlight that the central role of classification (class separability)

complexity estimator. Even regardless of T-DTS concept, most statistical pattern

recognition methods use such measures. Brief overview of the global concept complexity

as formalism and as inherent attribute of the complex systems lead us to more general

understanding of this phenomenon and more precisely to understanding of the

classification complexity.

Fig II.3 describes the updated taxonomy of the different complexity estimation

approaches in order to figure the place of each measure out and describe the general

hierarchy beginning from the global method and finishing with particular specific

classification complexity measures.

Let us note that complexities estimation approaches might include the

melting/overlapping concepts, more over some particular classification complexity

methods can use the conceptual ideas that belongs to different branch of this hierarchy.

For example, it is possible to subscribe ANN based complexity estimator to branch of

dynamic complexity concept (Lucas 2000), because of using hardware implemented

evolutionary net construction (De Tremiolles 1998).

These and other issues might be prejudiced a given stratification on Fig. II.3 and we

share here possible scepticism, because of the natures of such phenomenon complexity.

Any attempt to give a general high-level definition that in the same time generalizes

and includes various aspects of this phenomenon automatically implicates uncertainty or

subjectivism. For example, Kaufmann’s definition of complexity as a sub-sub class of

Self-organizing (Fig. II.3) complexity defines complexity as number of conflicts, but

71

natural question here arise how we calculate this number, based on which criterion and is

this criteria is environmental invariant. Thus, the given output of Kaufmann’s complexity

measure is relative, because of its weak-defined origin. However, according to taxonomy,

on the low lever of proposed hierarchy complexity carries some certain and utilitarian

meaning ignores some general aspects of phenomenon complexity.

To link the high abstract low ill-defined concept with practical classification

complexity measures in a frame of classification problem, we propose the following brief

summary:

Fig. II.3 : Taxonomy of classification complexity (separability) measures

72

1. The Bayes error (misclassification error) is staple point of any classification

complexity.

2. The concepts of data distribution analysis which approximates the Bayes error

dominate in pattern classification.

3. These approximations might belong to more global complexity concepts, such as

Information Theory, but their principal usage doesn’t relate to classification problem and

pattern recognition.

4. The range of novel methods (including not only the well developed direction of

space portioning) could and (what is important and has a practical sense) could not be

directly related to the golden standard of Bayes error misclassification. This set of

methods can provide efficient and data adaptive solutions for (ideally, there should be one

measure) “classifiability” (Singh 2003) measurement independent of the feature set.

Summarizing, Bayes error misclassification is necessary, but not sufficient for

defining classification view on complexity. Thus, the classification tasks can be purely

separable, but also can be geometrically or topologically complicated (complex), meaning

that Bayes error based method are most reliable, but not sufficient. Moreover as we deal

with computation in practice, there are a lot of applied computational problems (simply

saying computational complexity of classification complexity) that drives for further for

discovering of alternative methods.

Table II.1 : Advantages and disadvantages of complexity estimating techniques

Technique Advantages Disadvantages ANN-structure based. No probability distributions

required. Directly related to Bayes error

Computation efficient

Bhattacharyya bound based, Hellinger distance, Kullback-Leibler divergence, Jeffreys-Matusita distance, Bhattacharya distance, Mahalanobis distance.

Directly related to Bayes error Probability distributions required. Limited 2 classes.

Fukunaga’s 4 scatted matrices criteria Easy to compute Not equivalent to Bayes error. Limited with second order statistics

PRISM (Collective entropy, Purity) Provides relevant complexity levels to purely separable cases

High computational costs, not equivalent to Bayes error

Fisher linear discriminant based, Normalized distance base

Very easy to compute not equivalent to Bayes error

73

Following Table II.1 presents a quick snap shot analysis of classification complexity

methods, extracted from the thesis’s results of Pierson (Pierson 1998), Rybnik (Rybnik

2004) and Bouyoucef (Bouyoucef 2007).

The aim of Table II.1 is to give a brief view about implemented into T-DTS main

criteria of classification complexity estimators. They are organized into five groups

according to the taxonomy given on Fig II.3 (implemented 18 complexity estimators are

marked there using different style of box and bold font. Exceptional kNN method of

PRISM is not implemented and also is marked in different way).

As the complexity estimation is essential in classification. The principal idea of

developing a big diversity of complexity estimating methods is to satisfy the different

requirements of database decomposition and tree-structure building in the T-DTS

framework. A complexity estimator is an engine of T-DTS, the complexity measures

investigation is an essential for correct T-DTS functioning

On the given taxonomy Fig II.3 non-gradient and sharp bold areas include complexity

estimator methods that are an integrated part of new version T-DTS.

The next Section II.3 is fully dedicated to our novel complexity estimator named ANN

based estimating approach which belongs to one of the ad hoc group of the intuitive

measures.

II.3 ANN-structure based classification complexity

estimator

Passing ahead of the possible criticism concerning the name as ANN-structure based

classification complexity, I underline that in this section of my thesis I am not focused on

studying another statistical complexity measure built upon a combining of information

measure. I concern the fact that any concept (including its possible criticism) of the

complexity is strongly depending on the chosen measure interpretation base (in general

case - language), because complexity definition is exclusively found on this interpretation

base (Feldman and Crutchfield 1997) (Edmonds 1999). In the next paragraphs I focus on

giving a proper definition of ANN-structure based classification complexity estimator.

Essentially, by classification task complexity, I mean a class separability criterion.

Class separability is a classical concept in the field of pattern recognition, usually

74

expressed using a scatter matrixes approach (Fukunaga 1972). Because of using this

separability measure in discriminate analysis, feature selection, clustering and a wide

range of pattern recognition applications, this direction of research continues to develop.

For example in the past few years, Singh has proposed his PRISM-based method of

computation class separability criteria (Singh 2003) that belongs to a group of Space

partition methods Fig II.3. In contradiction, the majority of methods has been investigated

from the Bayesian standpoint, and is strongly influenced by ideas and tools from

statistical physics, as well as by information theory. While each of these theories has its

own distinct strengths and drawbacks, there is a little understanding of what relationships

hold between them (Haussler, Kearns and Schapire 1994).

My novel direct approach of classification task complexity estimation comprises

measurement of computational effort required for successful classification using “most

efficient algorithm” (Boolos, Burgess and Jeffrey 2002). Theoretically, “most efficient

algorithm” has to appear in previous sentence in case when we deal with complexity

estimating. However, in the same time, accordingly to Chaitin (Chaitin 2005) generally,

this is an idealistic expression, because to prove a fact that our algorithm is “the most …”

depends totally on the selected formalism or formal system (FS). The last one might be

proven to be incomplete (Hofstadter 1999), (Godel 2001). Therefore, the role of “most

efficient algorithm” in our ANN-structure based classification complexity estimation

technique is played by an algorithm that is user selected.

In our case of ANN-structure based complexity estimator, we have chosen kNN-based

classification method. It is very successful for estimating unknown data density function

(Dong 2003), whatever unimodal and mixture of multi-modals. Although, on the other

hand, a large number of samples are required to grow exponentially with the

dimensionality of feature space in order to get a good estimate and this issue leads to

increasing of computational cost of kNN-classifier. The estimating of the class density

function that increases extra learning efforts in case of multi-dimensional feature space

case is more likely to be expected in terms of Occam's razor principle. Although Occam's

razor doesn’t tell why this principle should works and why this principle is necessary. The

idea that only the simplest underlay principle produces enormous variety of a

phenomenon and in such way exhibits its complexity is central in Chaitin’s observation of

computational complexity (Chaitin 2005). Thus, from the geometric viewpoint, kNN-

classifiers is simple intuitive approach.

75

It was formed so that ANN-structure based classification complexity estimator has

been firstly proposed and implemented using IBM© ZISC®-036 Neurocomputer. That’s

why one may find a name of the realized modules titled as “ZISC”, “ZISC complexity

estimator”

The main reasons why we have chosen this tool are the following:

1. Quality aspect: implemented Restricted Column Energy (RCE) algorithm and

KNN algorithm.

2. Control aspect: limited number of necessary functions to control the IBM©

ZISC®-036 Neurocomputer.

3. Conceptual aspect: ANN-evolutionary learning strategy based (De Tremiolles

1998), (Madani, De Tremiolles and Tannhof 2001).

The following section describes in more detail the foundation of ANN-structure based

classification complexity estimator.

II.3.1 ANN-structure based complexity estimating concept

Before starting to discuss the algorithm of complexity estimating named ANN-

structure based or ANN based, we briefly give an important clue about computational

complexity (not classification) of artificial neural networks. According to the taxonomy in

Fig. II.3, our novel approach belongs to the sub-set of computational complexity that

takes into account limit of computational sources.

II.3.1.1 Computability of ANNs

Early computing machines had computational limitations, so in 1960s the complexity

theory had been developed to chart different subsets of the computable functions that

could be obtained when restrictions are placed on computing resources.

ANNs’ models as digital machines require a framework of computation defined on a

continuous phase space, with dynamics, that is characterized by an existence of real

constants that influence the macroscopic behaviour of the system.

ANNs’ paradigm is an alternative to the conventional von Neumann model (Sima and

Orponen 2003). ANNs’ models have been considered more powerful than traditional

models of computation. Depending on implementation and intra-parameters adjustment,

76

they may compute more functions within the same given time bounds. However, in

practice it is hard to rely on an approach which uses infinite precision real operations

(Arbib 2003).

Generally, estimation of the classification task complexity using neural networks is

not an easy subject. In our work, I propose a different slant using a neuroprocessor

implementation of Reduced Coulomb Energy (RCE) (Park and Sandberg 1991) and KNN

(Dasarathy 1991) model. Analysis of the evolutionary process of constructing neural

networks topology (Budnyk, Chebira and Madani 2007), (Budnyk, Bouyoucef, Chebira

and Madani 2008) on IBM© ZISC®-036 Neurocomputer allows us to extract information

about general complexity of a classification process and of classes’ separability.

The next section introduces our novel approach of complexity estimation. In this

approach, the analysis of ANN-structure constructing plays a key role plays. My prime

interest is to give a pecise definition of classification (classes’ separability) complexity

that can be measured while using the analysis of evolutive procees of ANN-structure

procedure.

II.3.1.2 ANN based classification complexity estimating approach

In this section, my principal goal is to extract the ratio of classification tasks’

complexity. I perform my analysis taking into account the aspect of resources

consumption that is related to the sub-group of Computational complexities (Fig.II.3)

processed on the deterministic Turing machine. IBM© ZISC®-036 Neurocomputer was

selected as a tool for ANN construction because of its high performance characteristics.

Let me note that another ANN builder can be selected, but the core of the approach (the

principal idea) remains the same. Another important note is that I have aimed to not build

a kNN-classifier for the problem, but to estimate the classification complexity by

extracting information about its construction. In general, regardless of ANN-type, we

learn a classification problem using the associated database and then estimate the tasks

classification complexity by analyzing the created neural network structure.

My method can be applied to any classification problem that is defined on its data

level by a pair of sets S and C. The given database is composed of a collection of m-

objects, where each object (instance) s Є S is associated with an appropriate label

(category) c Є C.

I expect (with a minor reservation that the selected learning technique within the

77

learning parameters is appropriate) that a more complex problem will generally involve a

more complex neural network structure.

The simplest RCE-KNN-like-NNs structure feature is n - the number of neurons. This

criterion is a good candidate for the analysis not only because it is a resource-criterion, but

also because of ZISC®-036 ANN-evolutionary based learning strategy (De Tremiolles

1998). Thus, the number of neurons reflects the complexity of ANN structure. Taking into

consideration these two facts, the following indicator is defined:

mnQ = , 1,1 ≥≥ nm (III.40)

)(ˆ mgn = is an idea that has been proposed by Zhao and Wu (Zhao and Wu 1999) for

complexity estimating. In order to reserve a place for a more general ANN-structure, we

set (.)gn = , where (.)g is a function that estimates a complexity of ANN classification

structure. Arguments for this function might be: a signal-to-noise ratio, a dimension of the

representation space, boundary non-linearity and/or database size, etc. Therefore, the (.)g

can combine several arguments, for example number of neurons n and length of boundary

line, and in such way it can aggregate the key-parameters (related to classification

complexity) of the several “Ad hoc” (Fig. II.3) approaches.

In the first approach, I consider only (.)g function’s variations according to a m

parameter, supposing that for the case of ZISC®-036 structure )(ˆ mgn = ; therefore, our

complexity indicator Q(m) is defined for )(ˆ)( mgmg ≡ as the following:

mmgmQ )(1)( −= 0)(,1, ≥≥ mgm (III.41)

Here, the use of m parameter assumes our database is free of any incorrect or missing

information. We also do not consider the informative experience issue (Scott and

Markovitch 1989). Therefore, for the implementation and validation, we apply the

Random strategy of classification problem representation.

I suppose that for problems of different classification complexity, the same

incensement of parameter m produces different g(m) behaviour, where more complex

classification problem involves more complex ANN-structure. I share possible criticism

of this utilitarian point of view on classification complexity, but let me mention that ANN

based complexity estimation is the responsibility of g(m). Thus, for the same fixed m

value, problems of different classification difficulty are supposed to have different

complexity indicator Q(m).

78

My approach is based on the dynamic analysis of ANN construction. To provide a

good illustration, I return in this point to our IBM© ZISC®-036 Neurocomputer’s

implementation. The construction of NNs corresponds to the Voronoy polyhedron

building. Therefore, the RCE-KNN IBM© ZISC®-036 Neurocomputer maps the given

dim-dimensional space to prototypes; this implementation of classification process is

represented by a network of neurons within an influence field of hyper-polygons rather

than hyper-spheres, see Fig. II.4. In 2D chart, Fig. II.4 (left) it is described that the kNN

algorithm leads to clustering the input space S of m-instances/objects/vectors into

Voronoy cells. Each cluster of the training points is labelled by a class (a category).

Fig. II.4 : Examples of Voronoy polyhedron for 2D and 3D classification problems

The analysis of the construction of such a classification structures is based on the

mentioned above indicator Q(m). Based on RCE-kNN algorithm, each neuron of

classification structure allocates (captures) the maximal possible number of prototypes.

For this reason, a neuron of RBF Network (RBFN) modifies its influence field. A final

RBFN reflects an ability of the group of neurons of this network to arrange given number

of m-prototypes to a minimal number of Voronoy polyhedron cluster. Each cluster

corresponds to one neuron. This means that our analysis of RBFN construction is done

based on analysis of Q(m) behaviour, because Q(m) represents a balance between input m-

prototypes and n-number of Voronoy cells or neurons. Let me mention here that the

neurons of RBFN and the rules of their interactions are relatively simple, but the resulting

interacting behaviour exhibits the overall classification complexity. Dynamics observation

79

and information extraction about its complexity are typical approach that deals with such

complex systems as: brain, cellular regulatory machinery, ecosystems, etc. (Zhigulin

2004).

Concerning RBFN behaviour, my expectation is to observe two stages (1 and 3) of

RBFN building and a transition phase (2) Fig. II.5:

1. Stage of RBFN process initialization.

2. Transitional phase.

3. Stage of RBFN process deployment.

Fig. II.5 : Q(m) indicator function behaviour

The first stage has to characterize the state when neurons are easily generated in order

to capture the new incoming prototypes or to adjust an influence field of a preliminary

RBFN installation. During the second stage, this process of allocation of the rest of the

prototypes doesn’t require high adjustment of the influence fields and, as the result, there

is no new request to add a neuron. We suppose, taking into consideration different

concepts/views on complexity concepts, that the classification tasks’ difficulty is

determined by this transitional phase. One may find similarities between RBFN

initialization process within the transitional step and an effect of hysteresis that exists in

the world of physics (Mielke and Roubicek 2003): a phenomenon which occurs in

magnetic and ferromagnetic materials, in the elastics, electric and magnetic behaviour of

materials as subsequent effect of applying and the removing of a force or field; here, the

80

most interesting fact is that this transitional phase is force-independent and is

characteristic of the material. The effect of hysteresis (similar to RBFN installation) is

observed in more general fields, such as economics (Jelassi and Enders 2004), energy,

user interface design, electronics, cell biology (Pomerening, Sontag and Ferrell 2003),

neuroscience and respiratory physiology (West 2008)

According to the definition of hysteresis (Bertotti and Mayergoyz 2006), this

phenomenon can conceptually be explained with a link to RBFN initialization process in

the following way:

1. A system can be divided into subsystems or domains, much larger than an atomic

volume, but still microscopic: in a term of RCE-kNN, domain is equivalent to a neuron

including its influence field.

2. Such domains form small isotropic regions: it means that each neuron has one

category or belongs to only one class label.

3. Each of the system's domains can be shown to have a metastable state: in terms of

our ANN-structure constructing, it means that a further increase of m-parameter (number

of prototypes) doesn’t have an influence on its configuration.

4. The hysteresis is simply defined as the sum of all domains: in terms of RBFN, this

is the number of neurons.

Taking into account the board range of real-word evolutional processes, and the

phenomenon of hysteresis including its properties, we expect to capture the similar

behaviour of RBFN. However, from our point of view, the key role in RBFN construction

plays a transition between two stages. First one is eruption, frenzy and second - synergy,

maturity (Jelassi and Enders 2004).

The given below equation aims to capture this momentary transition phase, where j is

an index of such transitional point:

0)()( 2

2

=∂∂ mQm

(II.42)

The point(s) mj which satisfies(y) by the above equation has(have) the following

properties:

0)()(

,0)()(

);(:1,1

2

2

2

2

>∂∂

⇒>∀<∂∂

⇒−∈∀≥∃≥∀

mQm

mmmQm

mmmm

j

jjjjj εε (II.43)

or, another possible solution

81

0)()(

,0)()(

);(:1,1

2

2

2

2

<∂∂

⇒>∀>∂∂

⇒−∈∀≥∃≥∀

mQm

mmmQm

mmmm

j

jjjjj εε (II.44)

In practice, Q(m) is a piecewise function and we provide piece approximation Q(m)

using a polynomial function. In the case of more complex behaviour of Q(m), when we

start learning, our polynomial power order must be selected in appropriate way. We

expect to meet an S-shape curve on the final stage of RBFN installation.

The equations II.43 and II.44 represent two cases. In case of odd power order greater

than 2 (II.43), the approximated Q(m) approaches the limit simultaneously when Q(m)

monotonically grows. In case of even power order greater than 1 (II.44), approximated

Q(m) approaches the limit simultaneously when Q(m) monotonically declines. We expect

to have limit of Q(m), because of two reason:

1. Theoretical: we suppose that for any given classification problem the algorithm of

ANN-structure constructing is ideal. For our ZISC®-036 tool, a Voronoy polyhedron that

contains “the minimal possible number” of clusters for a given classification problem is

not affected by any further parameter m incensement.

2. Practical: in the real word, we have the same constrains for classification task

independently of the algorithm or its hardware implementation. Even if it is possible to

construct a benchmark, m may have no limit. Still, we have limitations on the hardware

memory, execution time of the clustering algorithm, etc.

Concerning the first issue, let us note that this is a typical platonistic-like assumption

as in Kolmogorov’s case of complexity with believe of existence “most efficient

algorithm”. However, it is known that the problem with Kolmogorov’s definition of

complexity and Kolmogorov’s-like approaches is that there is no proof that a given

algorithm (polyhedron) is the minimal one.

For the cases when j > 1, we define a transitional m0 point as the next:

),..,max( 10 jmmm = (II.45)

This minimax approach of selection of the maximal among the available is similar to

the idea of generalization expressed in the equation II.20. Hither, we select the

representative ratio of classification complexity as the most complex among available as

jmm << ..1 , m0 = mj

82

The main characteristic of the transitional point m0 is the following:

It means that RBFN for +∞→> mmm ,0 is in a stage of deployment. We expect that

increasing the number of new incoming instances/prototypes does not change

significantly the structure of RBFN. It means that a classifier is in such dynamical phase,

where new incoming prototypes are easily assigned to the categories/clusters and they

(prototypes) do not change significantly the classifier’s structure. In practice, 0≠const in

equation III.46 because of the second practical reason mentioned above: m is a finite

number - 11≤≥

mconst

We state that m0 plays a special role of a crash-point, transitional point or, more

precisely, bifurcation critical point (Prigogine 2003) of the classification process.

Def: Q(m0,) ( 1)(0 0 ≤< mQ ) is defined as coefficient of a task classification

complexity.

The interval of possible Q(m0,) values is similar to Fukunaga’s (Fukunaga 1972)

criteria of classes’ separability and etc. According to give definition Q(m0,) = 0 indicates

the most difficult case of classification task

The following sections are dedicated to the important question of the implementation

of ANN-structure based complexity estimator. We have selected RBFN because it can be

processed in a parallel and high-performance way. That is also why ANN based

complexity estimator has been implemented and tested using the IBM© ZISC®-036

Neurocomputer.

Afterwards, this ANN structure complexity estimator has been implemented using the

Matlab environment in order to extend the testing database and, what is more important,

to embed it in into T-DTS. Conceptually, these two implementations are similar. Both of

them employ principal idea mentioned in theoretical part of our work. However, there are

some small minor differences.

Concluding this section, let me mention that we were not the fist in our interest in the

asymptotic behaviour of a learning algorithm as m becomes large. In addition to the given

definition of classification complexity, the intensive investigation of concept learning

(Haussler, Kearns and Schapire 1994) undertaken by the board research community has

resulted in the development at least two general viewpoints of the learning process in

terms of learning curve and cumulative mistake. The learning curve is viewed as the

constmQmmm i →⇒+∞→>∀ )(:0 (III.46)

83

sample complexity of learning (Haussler, Kearns and Schapire 1994). The next section

presents the hardware implementation of ANN based complexity estimator using ©

ZISC®-036 Neurocomputer.

II.3.1.2.1. IBM© ZISC®-036 hardware implementation of ANN-structure based

complexity estimator

IBM© ZISC®-036 exhibits very interesting characteristics. More precisely, this

neurocomputer employs RCE-KNN evolutionary neural networks that during the learning

phase, partition the input space by prototypes, where each prototype is characterized by it

own class (category) (Madani, De Tremiolles and Tannhof 2001). From the processing

point of view, it means that during kNN-like partitioning (learning) every neuron’s

threshold is adjusted. For each learning database of size m, after learning process

performed by ZISC®-036 Neurocomputer, we obtain a RCE-kNN Neural Network

structure that corresponds to Voronoy polyhedron and uses the concept of complexity

estimation described above. Afterward, we compute the ANN based classification

complexity ratio (coefficient/rate). IBM© ZISC®-036 implementation of my estimator

allows user to conduct easily testing for even unseen prototypes. During this phase, it

activates or does not activate neighbourhood neuron(s).

The following section describes my software implementation of ANN based estimator.

II.3.1.2.2 ANN-structure based classification complexity estimator. Software

implementation.

The difference between ANN based complexity estimator’s software version and

IBM© ZISC®-036 Neurocomputer is the first one doesn’t allow the overlap of influence

field that may take a place when one uses IBM© ZISC®-036 Neurocomputer for

prototype associating (Lindblad and al. 1996). Thus, in the software ANN based

complexity estimator, the MIF parameter is assigned to each prototype’s centre. During

the process of Voronoy polyhedron constructing, centre is adjusted (minimized up to the

MIF-limit) automatically. In RCE-kNN-like ZISC®-036 Neurocomputer version, the

influence field of neighbourhood neurons is adjusted (prototypes with associated

category/class), but according to all neighbourhoods. If an instance/vector does not belong

to any influence field, it is not recognized. If an input vector lies within the influence field

84

of one ore more prototypes associated with one category, it is declared as belonging to

that category. If the input vector lies with influence field of two or more prototypes

associated, but with the different categories, it is declared as recognized, but not formally

identified. Afterwards, during the learning process, the neurons within their influence

fields adjust their influence fields. There is no change in the network content; only one or

more influence fields are modified.

Also, for both types of ANN based complexity estimator implementation, the process

of classification complexity extraction feature is the same: we perform the analysis of

Voronoy polyhedron constructing process. Afterwards, using the function )(ˆ mgn = , we

estimate the complexity of the created NNs. Employing behaviour’s analysis of

complexity indicator Q(m), we calculate m0 and Q(m0,) – the classification complexity

ratio.

In the next section we compare our ANN based complexity classification complexity

estimator to other classification complexity estimating techniques and comprehensive

analysis of estimators’ advantages and disadvantages.

II.3.2 ANN-structure based complexity estimator compare to

other approaches

ANN-structure based method belongs to “Ad hoc” group of methods of Intuitive

classification complexity measures class. The kNN-like hyperplane constructing method

approximates the Bayesian error. In the same time for kNN, ANN-structure based

complexity estimating method becomes a variant of “Ad hoc space covering” by ε-

Neighbourhoods (Le Bourgeois and Emptoz 1996).

ANN-structure based complexity estimation method in RCE-KNN-like

implementation can also be viewed as a further development of related Zhao’s and Wu’s

Hyperplane number based complexity estimating method (Zhao and Wu 1999). The given

realization of our method is related to Kaufmann’s complexity concept. Thus, the process

of construction polyhedrons that includes phases of adjustment of influence field of

neurons/categories for a new prototype can be treated as a conflict. Therefore, one may

construct a function )(ˆ mgn = using the total number of conflicts according to

Kaufmann’s concept. However according to the complexities taxonomy Fig II.4,

85

Kaufmann’s concept of complexity belongs to a very different and general category

named Self-organizing complexity, although it doesn’t mean that one may employ this

conceptual idea in classification complexity estimating context.

Finally, a relation between our method and fundamental Kolmogorov’s definition of

complexity shows itself in the moment of algorithms selection for ANN construction. As

it is shown, the ANN-structure based concept of classification complexity estimator is

related to the different groups of the classification complexity estimation methods.

II.4 Conclusion

Complexity as the inherited phenomenon of complex systems; up to date,

quantification of complexity remains an uncertain and non-trivial task. Even though the

foundations of complexity theory are still not well established, the pragmatic look at

existing applications of complexity paradigm could be used to build the taxonomy of

classification complexity measures as described in Fig. II.3. Using this taxonomy, we

investigated methods for complexity estimation of separability of classes as applicable to

T-DTS. We reviewed a set of methods ranging from the popular techniques like

Fukunaga’s scatter matrices discriminator to the most recent methods like PRISM.

Using the insights gained in the review, we proposed a novel complexity estimation

method that is based on Artificial Neuron Network (ANN) structures and designed to

further improve the effectiveness of the T-DTS technique. My ANN-structure based

classification complexity estimation method extracts information about complexity using

the analysis of the dynamic process of ANN-structure construction. For this, we employ

the additional function, particularly, the number of neurons/categories of evolutionary

RCE-KNN for the Voronoy polyhedron construction.

Besides complexity estimation aspects, there are other issues that have a crucial

impact on the overall performance of T-DTS. Important note is that T-DTS as an

approach of ensemble of classifiers construction is driven by classification complexity

estimating technique, but let us mention in this chapter that one might use the constructing

of NNs (or other classifiers) approach in order to estimate classification complexity

(Tumer and Ghosh 2003). In the next chapter, we study one of this issues dealing with

implementation of T-DTS.

86

Chapter III:

T-DTS software architecture

Practical issues of T-DTS implementation are of a great importance because even a

minor problem in an implementation of T-DTS for classification tasks can have a major

impact on the overall classification performance of the technique. Since one of the

objectives of this work is minimizing classification errors in classification task, this

chapter is dedicated to the software architecture of T-DTS. The chapter transits from the

high-level concepts of Chapter II to the software architecture of T-DTS, to its

implementation, and to the usability aspects of the approach.

In our software implementation, we simplify the general T-DTS functioning scheme

described in Fig. II.2 by omitting the second decomposition feed back loop of processing

model conformity. Taking into account the scope of T-DTS (i.e., classification tasks only,

using neural networks based classifiers) application, the conceptual diagram of the

proposed T-DTS implementation is in Fig. III.1.

The diagram consists of four main operational phases (blocks): data pre-processing

(when required), learning, complexity estimation and acting (processing). The aim of

these phases is to shape an ANN structure that would reflect the general complexity of a

given classification task. The behaviour of T-DTS ensemble of NNs models sculpting

over decomposed databases is associated mainly to the Complexity estimation agent

(CEA). On Fig III.1 it is marked as “Complexity Estimation Loop”.

In the diagram, the complexity estimation loop (Complexity estimation agent – CEA)

plays the central role. It shapes the behaviour of T-DTS ensemble of NNs by controlling

T-DTS database decomposition and tree organization through the means of a feedback.

87

Processing Results

Structure Construction

Learning Phase Feature Space Splitting

NN based Models Generation

Preprocessing (Normalizing, Removing Outliers, Principal

Component Analysis)

(PD) - Preprocessed Data Targets (T)

Data (D), Targets (T)

P – Prototypes NNTP - NN Trained Parameters

Operation Phase

Complexity Estimation Loop

Fig. III.1 : Diagram of T-DTS implementation for classification tasks

According to the basic tree-like decomposition of T-DTS by “divide to simplify”

(Rybnik 2004), the decomposition methods of T-DTS belong to the first sub-category of

“Agglomerative vs. divisive” decomposition technique of Density-based Hierarchical

methods (Section II.2). Any divisive decomposition method proceeds in a top-down

manner, first placing all the data or a starting decomposition in a single cluster and then

splitting the cluster into a larger set of clusters up until a stopping criterion is satisfied.

However, we should mention that the first sub-category of “Agglomerative vs. divisive”

approaches consist of a big range of the different methods that extend the base of

decomposition methods.

The main “Agglomerative vs. divisive” method calls Linkage based methods. Because

of the heuristic model based clustering sub-methods, where the data is assumed to be

generated by a mixture of component probability distributions, and because of that fact

that majority of classical Density-based method are variants of “Agglomerative vs

divisive” methods. Let me note that some authors define Linkage based methods as

Model-based agglomerative methods (Kamvar, Klein and Manning 2002).

Linkage based methods contain four types of approaches (Jain, Murty and Flynn

1999): Single-link approach (Sneath and Sokal 1973), Complete-link (King 1967), Group-

average (Johnson and Kargupta 2002) and Ward's method (Ward 1963). These four

88

methods are based on computing an approximate maximum for the classification like-

hood (Kamvar, Klein and Manning 2002).

Generally speaking, Model-based agglomerative methods’ merging is done on the

basis of higher likehood ratio. Therefore similarly for divisible methods, each sub-

approach (method) employs its function Df():

• Single-linkage method, the distance Df() between two clusters/groups Gi and Gj is

the minimum of all pairs of vectors is and js that belong to Gi and Gj correspondingly:

),(min),(, jiGsGsji ssdGGDf

jjii ∈∀∈∀= (III.1)

In this case, two clusters are merged to form a large cluster based on minimum

distance criteria.

• Complete-linkage method, the distance between two clusters is the maximum of all

pairs of objects in the clusters.

),(max),(, jiGsGsji ssdGGDf

jjii ∈∀∈∀= (III.2)

• Group-average or simply Average-linkage methods, the distance between two

clusters is the average of pairs of objects in the clusters.

)),((),( 2, jiGsGsji ssdMGGDf

jjii ∈∀∈∀= (III.3)

• Ward’s methods uses the error sum-of-squares (ESS) that is given by:

)()()(),( jijiji GESSGESSGGESSGGDf −−∪= (III.4)

where the ESS() is defined as:

∑∈

−=Gs

ii

GMsGESS 2))(()( (III.5)

and M(G) is the sample mean of the data points in the cluster G.

To obtain better Density-based, Hierarchical methods, several advanced approaches

adopt more complicated measurements in order to determine dissimilarities between

clusters. For example, Clustering Using Representatives CURE (Guha, Rastogi and Shim

1998), Robust Clustering Using Links ROCK (Guha, Rastogi and Shim 2000), Balanced

Interactive Reducing and Clustering Using Hierarchy BIRCH (Zhang, Ramakrishnan and

Livny 1996), and Hierarchical Clustering Algorithm Using Dynamic Modeling

CHAMELEON (Karypis, Han and Kumar 1999). This list of methods is given rather to

89

describe various possible directions of hierarchical clustering than to provide a common

clue. A good summary of these methods is present at work (Ding 2007). Basically,

variants of hierarchical clustering algorithms can be also distinguished by their

(dis)similarity measurement and how they deal with the (dis)similarity measures between

existing clusters and merged clusters.

Debriefing, T-DTS divisible type of universal decomposition provides an opportunity

to employ different type of decomposition methods, because the T-DTS concept is

grounded on (simply saying) “splitting database into sub-databases requires mandatory

decreasing overall databases complexity”.

In fact, T-DTS decomposing and processing methods’ set consists of small and

specialized mapping neural networks (Madani and Chebira 2000), (Madani, Rybnik and

Chebira 2003). On Fig. III.1 the correspondent block is called Neural Network based

Models (NNM). Supervised by CEA, T-DTS decomposes feature space into a set of

simpler (according to principal concept) sub-spaces. Thus, on the leaf-level of the

obtained tree one can find NNM and one can find Decomposition Agent (DA) on the node

level. In T-DTS realization, DA is also NNs-based, or more precisely Kohonen self-

organizing map, but to distinguish decomposing and processing phases we have given

different names to these two blocks. Let us note also that DA might be not NNs-based.

Therefore, constructing dynamically a tree by means of CEA that controls decomposition

of learning database, then building a set of specialized NNM trained by generated sub-

databases – this is the way in which T-DTS can be view as complex ensemble of neural

networks. Fig. III.2 illustrates T-DTS learning process.

More generally, combination of: CEA, splitting (DA) and learning capabilities (NNM)

confers to the issued intelligent systems’ self-organizing ability (Bouyoucef, Chebira,

Rybnik and Madani 2005).

Fig. III.2 shows the learning phase, where a set of neural network based models

(trained on sub-databases) is available that covers (models) the behaviour region-by-

region in the problem’s feature space. Important to note is that the initial complex

problem has been decomposed recursively into a set of simpler sub-problems.

The initial feature space S is divided into N sub-spaces (number of classifiers and

number of learning sub-databases/clusters is the same). For each subspace r (index), T-

DTS constructs a neural based model describing relations between inputs and outputs.

Important remark should arise here is whatever a neural based model cannot be built for

an obtained sub-database.

90

NNMM

NNM

NNM

DA

DA

NNM

Learning Database

DA

Learning sub-database




Fig. III.2 : Scheme of T-DTS learning concept

Theoretically, on the conceptual level a new decomposition will be performed on the

sub-space, dividing it into several other sub-spaces. The learning phase could be

considered as a self-organizing model generation process, which leads to organizing a set

of NNM. T-DTS operating phase is depicted by Fig. III.3.

Output-1

Input

Control Path

Data Path

NNMr

NNMN

NNM 1

Control Unit

Output-k

Output-N

U1

Ur

UN ( )tΨ

Fig. III.3 : T-DTS operating

The second operation mode as mentioned corresponds to the use of the constructed

ensemble of NNM. According to work (Madani, Chebira and Rybnik 2003), during the

learning phase, Control Unit (CU) constructs learning sub-sets Fig. III.3. During the

processing phase, CU receives unlearned input vector(s) or sub-sets and then sends

(switches) to most appropriate neural processing unit NNM.

The given description can be formalized in the following way: Let Ψ(t) be the

system’s input ( ( ) Ψℜ∈Ψ nt ), a nΨ -dimensional vector, and let unr tU ℜ∈)( be the r-th,

91

model’s output vector of dimension nu, where Nr ,...,1∈ . In general nΨ is not equal to

dim, because DA might modify dimension, meaning that here we reserve a room for

sophisticated DA.

Let unnrF ℜ→ℜ ψ:(.) be the r-th NNM’s transfer function. Let [ ] NtCU 1,0,),( ∈ξβψ ,

where N1,0 decision output of CU[.], which depends on a set of extra parameters β

and/or on some conditions ξ. βr represents some particular r-values of parameter and rξ

denotes some particular condition ξ accordingly, obtained from learning phase process for

the r-th sub-dataset. Taking into account the above-defined notation, the control unit

output CU[.] could be formalized as:

[ ] TNr uuutCU ),...,...,(,),( 1=ξβψ (III.6)

where

⎢⎣

⎡=

===elseuifu

r

rrr

,0&:,1 ξξββ

(III.7)

A CU’s response corresponds to some particular values of the parameter β and the

condition ξ (e.g. CU[Ψ,βr,ξr] activates the r-th NNM, and so the processing of an

unlearned input data conform to parameter βr and condition ξr will be given by the output

of the selected NNM is:

))((),( tFtU r ψψ = (III.8)

We highlight that CU activating an appropriate NNM doesn’t depend on the complete

vector but on the partial input vector Ψ(t).

Concerning the formal model of DA, three (Section III.2) available splitting method

realized in T-DTS are based on advantages of unsupervised Kohonen Self-organized

maps (SOM) and their properties. Splitting is performed on the basis of Similarity

Matching (Madani, Chebira and Rybnik 2003). Thus, tree-like decomposition is

performed under control of SOM-based DA. As an output we have the set of clusters Gi,

each of them is represented by vector Wi, where Wi is a centroid. Let’s suppose that after

the learning phase we have N clusters in our final tree structure. In this case for our new

unlearned Ψ(t) , CU[.] activates appropriate function based on the criterion of similarity

expressed as:

92

⎢⎣

⎡ −=−=∈∀

elseWtWtif

WtuNi iiii ,0

)(min)(,1)),((:

ψψψ (III.9)

Where |…| denotes the distance between vectors that one may calculate using the

different metrics II.1-II.9.

While being the most important T-DTS module/unit, CEA requires a parameter θ -

threshold of complexity. This parameter determines which ratio/rank of complexity

should have the decomposed sub-databases. Henceforth θ - threshold is a ratio such as

]1;0[∈θ , where 1 signifies the simplest case and 0 – very complex one.

θ – threshold plays a crucial role. It controls database decomposition. Therefore θ –

threshold can be called also as parameter of database divisibility or decomposability,

because arbitrary selected complexity estimating technique might provide approximate

class separability measure. For example, Maximal standard deviation criterion that in fact

is not able to detect “true” classes’ separability measure could play its role, according to

the results obtained in the work (Rybnik 2004).

This θ - threshold must be selected in advance, but there is no a priory information on

how θ should be set. If the complexity threshold requires operating with the less possible

complexity of the learning sub-databases - θ is preset to 1. Then the result generally is a

huge tree structure with a very small databases size on the leaf levels. Usually, based on

the comparison to decision tree methods, such a tree-structure requires further pruning. In

term of computational resources, it’s a costly processing procedure. Moreover, success or

fail of these methods depend on the correctly constructed learning database.

Returning to T-DTS, θ - threshold that equals to 0 signifies the incapability to initiate

a T-DTS recursive database decomposition. Another remark that must be done is the θ

measure relativity. Different types of CEA produce complexity ratios that cannot be

compared, because of the nature of phenomenon complexity, weak-definition or different

definition origin of the applied complexity (e.g. Information theory and Kolmogorov’s

complexity): θ- measure is relative. Therefore, naturally appears here the problem of

employing a non-heuristic and user-free procedure that searches for pre-selected

parameters the maximum (in the terms of performance) of T-DTS output.

In the next section, I propose a solution for this problem. It is a semi-automated user-

free procedure that finds a quasi-optimal value of the θ - threshold.

93

III.1 T-DTS concept: self-tuning procedure

T-DTS of version 2.00 was released and adapted for classification problems. Based on

the formal model for learning and processing phases described above, the user has to

define the global parameter of the θ – complexity threshold. The value of the parameter is

unknown a priori and this parameter is different for the different CEA modules/units. The

user in the framework of T-DTS relies on the “cut and try” method to find an “optimal”

or, more precisely, a quasi-optimal θ - threshold at which an initial supposition is fulfilled.

We take the word “optimal” in parentheses because generally the existence of such

threshold is not evident.

Briefly, Section III.1 is dedicated to solving the problem of an automatic search for the

“optimal” θ - threshold for which T-DTS produces a tree structure such that a

classification error assigned to this structure ensemble of NNM reaches its minimum. We

describe a semi-automatic (because some global parameters of this procedure have to be

still pre-defined by the user) complexity threshold adjusting procedure that is

deterministic and free of the heuristic user control.

Main disadvantages of user-based heuristic method of searching for quasi optimal θ -

threshold search are:

• a priori information analysis about a given database is required in order to initiate

a search.

• a lack of universal complexity estimator: the methods of complexity estimation

applied to classification task depend on the nature of data, the coordinate system and etc.

(a list of desirable, but not mandatory properties of classification Complexity estimators /

CEA can be found at work (Fukunaga 1972)). These and other features of CEA must be

selected by the user and taken into account when an operator searches for the quasi-

optimal θ.

In our framework, we deal with the θ - threshold optimization. This optimization

belongs to the group of optimization problems, where the solution space includes different

θ values varying from 0 to 1, T-DTS decomposition and processing modes, and our goal

is to find the best of possible solution(s).

94

III.1.1 Self-tuning procedure as optimization problem

Optimization methods are usually preoccupied with finding the best solution possible.

On the other hand, the actors in nature (living organism) usually seek only an adequate

solution. Genetic algorithms and other biologically inspired methods for sear and

optimization adopt this approach implicitly. The problem is that an “adequacy” remains to

be explored. For instance, a NP-hard problem (a general optimization problem is NP-

hard) becomes tractable if we are content to accept “good solutions”, rather then perfect

ones (Green and Newth 2001). Therefore, the role of an optimization procedure for us is

to extract (i.e. explore adequacy) information (if it is possible to do so) about an

optimization problem while avoiding NP-hardness. Our T-DTS optimization algorithm

should be designed to effectively search for a plausible solution in the solution space.

More precisely, dealing with classification problem and applying the divide and conquer

paradigm to T-DTS, we have inherited NP-hardness (NP-hardness to find a true optimal

partitioning into a given multi-dimensional space (Garey and Johnson 1979)). Thus, the

solution for a given classification problem dataset and other fixed settings of T-DTS is

lying in the solution space of a CEA selection and optimal θ - threshold.

Regardless of the structure of the general optimization problem, this problem may

actually have multiple quasi-optimal solutions, and the quality measure (classification

error ratio) may have a high standard deviation, or the provision of several near-optimal

solutions may be more desirable than a detection of one (completely) optimal solution.

Often, an expert may want to choose from such a set of (near-) optimal solutions. In this

case, a learner would be required to find not only one solution, but rather a set of several

different (near-) optimal solutions (Butz 2001).

In conclusion, a θ - threshold optimization problem is the problem in which the best of

possible solutions must be found from a given θ - threshold space where other parameters

of T-DTS are fixed. A T-DTS feedback is available in a form of learning and testing rates,

and CEA outputs for a decomposed dataset. This feedback may or may not provide hints

where to direct the further search for the optimal solution. Finally, the number of optimal

solutions may vary and, depending on the problem, one or many optimal (or near optimal)

solutions might be found. The disclose methods for optimization (self-optimization) of

such dynamic systems as T-DTS are related to a deep feedback analysis of not only

learning and testing performance characteristics, but also the variations of the prescribed

95

variables (Chou 1983).

Optimization is possible because the general approach of T-DTS is based on the

assumption of existence of the functional relation between θ - complexity threshold, T-

DTS tree structure, and classification rates. Each time we modify the θ – parameter for a

range of the fixed parameters including CEA, T-DTS build a different tree with a different

depth and structure. T-DTS user might primary consider only one output regardless of the

tree, and this factor is a minimization of the accuracy error.

III.1.2 Self-tuning procedure description

In the previous T-DTS implementation v2.00 user used a “cut and “trail” method to

find a quasi optimal θ – threshold, and there was no possibility to handle the optimization

in a well-formalized way. Therefore, for T-DTS v.2.50, we developed an algorithmic

procedure that resolves this problem in a semi-automated way; the procedure is not yet

fully automatic as the user still has to define several other parameters, particularly

complexity estimation method and decomposition unit.

The goal of the proposed procedure is to eliminate the user supervision during the

search for a quasi optimal θ – threshold that corresponds to a quasi optimal classifier tree-

structure.

tree

dept

h

θ0

θ2

Θ2,1 Θ2,2

Θ2,1,2,1=1

Θ2,2,2=1Θ2,1,1=1 Θ2,2,1=1Θ2,1,2

Θ2,1,2,3=1 Θ2,1,2,4=1Θ2,1,2,2=1Θ1,7,2=1Θ1,7,1=1

Θ1,6=1Θ1,5=1

Θ1,9=1

Θ1,2=1 Θ1,3=1

Θ1,8=1

Θ1,4=1

Θ1,1=1

Θ1,7

θ1tre

e de

pth

tree

dept

h

θ0θ0

θ2θ2

Θ2,1Θ2,1 Θ2,2Θ2,2

Θ2,1,2,1=1Θ2,1,2,1=1

Θ2,2,2=1Θ2,2,2=1Θ2,1,1=1Θ2,1,1=1 Θ2,2,1=1Θ2,2,1=1Θ2,1,2Θ2,1,2

Θ2,1,2,3=1Θ2,1,2,3=1 Θ2,1,2,4=1Θ2,1,2,4=1Θ2,1,2,2=1Θ2,1,2,2=1Θ1,7,2=1Θ1,7,1=1

Θ1,6=1Θ1,5=1

Θ1,9=1

Θ1,2=1 Θ1,3=1

Θ1,8=1

Θ1,4=1

Θ1,1=1

Θ1,7

Θ1,7,2=1Θ1,7,2=1Θ1,7,1=1Θ1,7,1=1

Θ1,6=1Θ1,6=1Θ1,5=1Θ1,5=1

Θ1,9=1Θ1,9=1

Θ1,2=1Θ1,2=1 Θ1,3=1Θ1,3=1

Θ1,8=1Θ1,8=1

Θ1,4=1Θ1,4=1

Θ1,1=1Θ1,1=1

Θ1,7Θ1,7

θ1θ1

Fig. III.4 : An example of maximal possible decomposition tree

96

The procedure works as the following. First, we preset θ threshold to 1. The result of

the corresponding T-DTS decomposition is shown on Fig. III.4 and is called the maximal

possible decomposition tree, because T-DTS performs decomposition until the sub-cluster

classification complexity is not equal to 1.

Still, it should be noted that even though the calculated θ equals to 1, it does not mean

that the cluster contains only one instance or only one class. According to T-DTS concept,

θ equal to 1 signifies the easiest classification task difficulty. One of the possible

criticisms of such pre-processing step is that it performs maximal possible dataset T-DTS-

like decomposition; still, dealing with T-DTS framework and no matter using heuristic or

algorithmic θ quasi-optimal threshold search, the user always extracts an

information/knowledge about the complete classification problem complexity using

feedback of T-DTS. The difference between heuristic “cuts” and “trails” method and the

maximal possible decomposition tree based method in Fig. III.4 is the computational cost

that includes determinism vs. speed heuristic undetermined optimization.

The maximal possible decomposition tree is a special T-DTS output, but only for the

fixed range of intra-parameters, including complexity estimator and decomposition unit.

This tree is a chaotic product of applying T-DTS concept, because the tree building

trajectory strongly depends on initial conditions: decomposition unit and complexity

estimator. We argue here that chaotic-decomposition provides an important self-

consistency condition for determining T-DTS-like adaptability to the given classification

problem (Fellman 2004). In our further analysis we disregard all leaf-nodes of this tree

(marked blue in Fig. III.4). We consider the distribution of numbers of the non-leaf

clusters over θ interval: [0;1). This means that for each delta (this parameter could be

modified by user, but generally delta concern to be small) step of [0;1), we calculate how

many clusters (marked magenta in Fig. III.4) of the tree have the complexity ratio laying

in delta-step sub interval of [0;1). The result of such analysis is a histogram built over

[0;1); in our further analysis, we consider the global characteristic of the maximal

decomposition tree represented by such histogram (given in Fig III.5 which has been

obtained for four spirals, two class academic classification benchmark using Fisher ratio

based complexity estimator).

Generally, the θ values vary in [0;1), so we find maximal Amax ( )(maxmax ,...iiA θ

∀= )

and minimal Amin ( )(minmin ,...iiA θ

∀= ) ratios. As expected, )1;0[max]min;[ ⊆AA ; please

recall that θ = 0 determines the case when the sub-database is very complex and θ = 1

97

denotes the opposites.

The second step of our analysis consists the further constriction of [Amin; Amax]. For

this purpose, I propose selecting sub interval of [Amin; Amax] of Fig. III.5 that contains

the majority (this parameter has to be predefined by the user) of clusters. The starting

point for extracting this sub-interval of [Amin; Amax] is the delta-step of the histogram,

where there has been registered the maximum number of clusters. According to the

histograme (Fig. III.5), it is θ = 0.283; therefore, the majority of cluster that broadcast the

overall task complexity equals to approximately [0.22; 0.50].

Fig. III.5: An example of distribution of the clusters’ number over [Amin;Amax]

complexity interval

Therefore, such constriction of [Bmin;Bmax] , where max]min;[max]min;[ AABB ⊆

required a parameter α that determines majority, where 0<α<1 .

In practice, one may use Pareto principle to determine α, so to obtained (1-α)*100%

equals to 80% or 20%, α should be equal to 0.8 or 0.2. However, in our practical

realization I have used a pessimistic assumption determining α = 0.1 (10%) as by default

T-DTS parameter.

Let me highlight, that the form of a histogram is a priori unknown. The hypothetical

assumption could be done only on analysis of the complexity estimator type and

specificity of decomposition unit. CEA returns. Thus, in worth case of uniform

98

distribution, we have:

α−=

−−

11

minmaxminmax

BBAA (III.10)

where 1maxmaxminmin0 <≤<≤≤ ABBA , where optimistic expectation is:

α−>

−−

11

minmaxminmax

BBAA (III.11)

Let me note remind that the procedure of extracting [Bmin; Bmax] from [Amin;Amax]

starts from the point on a histogram where the number of clusters reaches its maximum. In

case of two and more maximum the procedure automatically extract whole sub-interval

between the margin maximums.

The second phase of semi-automatic self-tuning procedure performs a search. For a

pre-defined delta-step that corresponds to an h, a selected α = 0.1, a chosen

decomposition, and a complexity estimation method, the procedure searches for a quasi-

optimal θ – threshold. To process a classification problem within the T-DTS framework,

the procedure also requires an additional parameter z – the number of T-DTS iterations.

This parameter is needed for checking the robustness of the classification results,

especially when T-DTS uses a learning database randomization on a dataset of a fixed

size. Generally, z is integer and z > 1; therefore, by applying T-DTS robustness check z

times, we compute for a fixed θ Є [Bmin; Bmax] the averages and standard deviations of

the performance characteristics. The main measures of T-DTS performance are the

following: Gr – generalization rate, Lr –learning rate, Tp – time of the overall T-DTS

processing, and NTp – processing time of T-DTS PU applied to the non-decomposed solid

database. The additional measures include SdGr – the standard deviation of generalization

rate, SdLr – the standard deviation of learning rate, and SdTp and SdNTp – the standard

deviations of Tp and NTp correspondently.

We aggregate performance measures of Gr, Lr, SdGr, SdLr Є [0, 1], Tp, NTp, SdTp

and SdNTp > 0 into a single P(θ) -T-DTS function that measures performance with a high

precision; P(θ) is constructed as following:

where b1,b2 and b3 are priorities of the performance measures; for example, one may

preset b1=3, b2=2 and b3=1 as parameters by default. However, for validation purposes,

[ ] [ ])( 321

3

21 11)( bbb

bbb

SdNTpNTpSdTpTpSdLrLrSdGrGrP ++ ⎥

⎦

⎤⎢⎣

⎡++

+−+−=θ (III.12)

99

we fix the parameters as b1=3, b2=2 and b3=0; note that zeroing the last parameter we

tune the procedure to disregard the processing time component.

It should be noted that P(θ) is quite sensitive to the priority parameters bi; however,

this sensitivity is on par with the heuristic adjustment of T-DTS, where the operator

manually performs a similar testing, analyzes the output based on the performance

criteria, and consciously defines the priorities of these criteria. In the implementation of

T-DTS v. 2.50, we provide the user with the freedom of customizing not only the priority-

coefficients of equation III.12, but also the complete equation itself.

An important note is that using P(θ) - T-DTS performance aggregation function, we

do not consider generalization rate as the only one possible performance characteristic.

Instead, while applying the equation III.12, one may want to find a balanced solution.

P(θ) is denoted as a function of the argument θ because, according to our initial

assumption, a quasi-optimum θ exists. Therefore, P(θ) is a fusion of the three main

performance characteristics. The specificity of P(θ) can be easily modified by the means

of the priority-parameters bi.

The applied semi-automated self-tuning procedure is defined by the behavior of the

P(θ) function, including its important intra-parameters (except P(θ) and their bi –

priorities) such as: the coefficient α, z – number of T-DTS iterations, and h – delta step of

the. The rest of the T-DTS parameters are assumed to be pre-selected.

Taking into account our principal supposition concerning the existence of a P(θ)

minimum, the self-tuning procedure is transformed into the problem of searching for θ on

[Bmin; Bmax] for which P(θ) is minimal. Even though there is no evidence to consider

P(θ) a continuous function, the restrictions on the performance characteristics of Gr, Lr,

etc., suggest that P(θ) cannot contain an essential and jump discontinuities.

We should note that, during a run of the self-adjusting procedure, the parameters are

stored for the future analysis purposes. This back up allows providing the user with not

only a single quasi optimal solution, but rather with a range of alternative solutions that

are similar to the quasi optimal one. For instance, one of the intermediate solutions can be

potentially interesting to the user if he wishes to scarify robustness (the half of percent of

the standard deviation) in the interest of lowering the overall T-DTS computational cost.

Another important remark is that the dilemma of the “global vs. local” minimum of P(θ)

cannot be resolved because of the P(θ) origins; P(θ) contains discontinuances, because the

components of this function are not continuous.

The self-adjusting algorithm implemented in the enhanced version of T-DTS is based

100

on the classical golden-cut search algorithm. The pseudo-code of the self-tuning

procedure is given bellow:

Programme TDTS_Self_Tuning_Proc(<S,C>, DA, CEA, NNM, z, h) ‘DA : Decomposition agent ‘CEA : Complexity estimation agent ‘NNM : Processing neural network model ‘<S,C> : Classification database defined as a pair of S and C sets BEGIN Integer : z ‘Number of iterations required to get statistics Integer : h ‘Number of optimization iterations

[Amin,Amax,Hist] := Pro_Decomp_Total(<S,C>,DA,CEA) ‘Amin, Amax: Minimum and maximum of complexity interval ‘Hist : Histogram of the complexity ‘Pro_Decomp_Total : sub procedure that extracts Hist

[Bmin,Bmax] := Histogram_Analyse(Amin,Amax,Hist) ‘Bmin, Bmax : Minimum and maximum of the new complexity interval ‘Histogram_Analyse : analyzing given histogram

j := h ‘tuning cycle controller m_optimum = max_integer_value ‘variable of optimum for minimizing

REPEAT [m_t,j,m_optimum]:=Find_min_opt(j,z,Bmin,Bmax,m_optimum) ‘m_θ : temporary complexity θ threshold ‘Find_min_opt : finds optimum by modifying j and m_t ‘m_optimum : current value of P(θ) FOR i = 1 to z

m_Res(i):=TDTS(<S,C>,DA,m_t,CEA, NNM) ‘m_Res : array of the results ENDFOR;

m_optimum:=Collect_TDTS_statistics_and_calculate_optimum(m_Res) m_Global_resultsj = m_Res ‘m_Global_results : matrix of the results

UNTIL j < h print m_Global_results, m_t, m_optimum END

In this algorithm, the extra integer parameter h is the number of optimizing iterations

(not to be mistaken with the z-parameter) and is equivalent to the precision of the optimal

θ – threshold. The h parameter has to be provided in advance for the following reasons:

1. Optimizing criterion P(θ) might deviate because of different reasons, such as a

random way of learning database building, fluctuations of the decomposition coordinates

of DEA, etc. Thus, h represents a trade-off between the high precision of the θ quasi

optimal threshold and instability of P(θ) caused by its fluctuating components.

2. We are interested in finding an optimization procedure that searches for optimal

threshold using as the smallest possible number of the total single-run T-DTS iterations.

Taking to consideration the h and z parameters, the computational time cost of self-tuning

procedure can be expressed in the next form: h•z•(“time of a single-T-DTS run for fixed

parameters such as DA/CEA/NNM, learning database ant etc”). Please note that a

101

particular single-T-DTS processing strongly depends on size of the classification database

and selected NNM method.

In conclusion, we would like to stress that the use of the proposed algorithm does not

increase the overall computational or handling complexity of T-DTS with the only

alternative being the more expensive heuristic manual search. The difference between the

searching for quasi optimum θ - threshold by the given procedure and by the heuristic

user-based approaches is injected into T-DTS processing time check.

The proposed semi-automated procedure has a range of advantages and disadvantages;

mentioned above. We will discuss and evaluate the practical issues of the procedure in the

validation chapter. The following section presents another important issue – the

implementation of T-DTS 2.50 in Matlab environment.

III.2 T-DTS software architecture and realization

Current Matlab T-DTS v. 2.50 software architecture is based on the use of a set of

specialized mapping Neural Network, NNM supervised by a set of DA. DA are prototype-

oriented Neural Networks. NNM are the models of Artificial Neural Networks origin. Our

principal software architecture is described on Fig. III.6.

Our T-DTS framework incorporates three core databases:

1. Decomposition methods (Database DU – Decomposition unit. Decomposition unit

is equivalent to DA in a term of T-DTS concept).

2. Processing methods (PU - Processing unit is equivalent to NNM).

3. Database of Complexity estimators (equivalent to designated CEA in a term of T-

DTS concept).

T-DTS software engine is CU (Control Unit) which controls and activates

correspondent packages thought Graphic User Interface (GUI). CU allows operator to

perform:

1. Normalization of incoming database.

2. Automatic extraction of learning database from the general classification database

S. It contains two realized options of the extraction:

• Randomly learning database extraction.

• Randomly learning database extraction with respect to the class distribution.

102

3. Principal component analysis and transformation.

4. DU selection.

5. PU selection.

6. Selection of the Complexity estimating method.

7. Defining of the θ - complexity threshold at the range [0;1]

8. Setting of T-DTS parameters and constants such as: z, h, α, B and etc.

9. Configuration of DU, PU

10. Database analysis using graphic tools, graphic interpretation of the obtained

results, trees constructing, additional user tools.

Fig. III.6 : T-DTS software architecture

In additional to T-DTS design, a need to develop guidelines (expressed in the

literature) (Maren 1990) for further integrating other types of processing: fuzzy logic,

genetic algorithms, expert systems and conventional algorithms, required to re-engine T-

DTS architecture.

Concluding, T-DTS uploads databases, provides pre-processing and builds a tree of

prototypes using selected decomposition method and selected complexity estimating

technique, then learns each sub-database using particular PU (NNM). After it applies the

103

obtained structure of NNM to the test/generalization database and sculpts a set of local

results. The final output of T-DTS work is learning and generalization rates.

Generally, based on Fig. III.6, T-DTS software can be viewed as a Lego system of:

Decomposition methods, processing methods and complexity estimating technique

powered by a control engine accessible to an operator thought GUI.

The main advantage of T-DTS architecture is that those three databases can be

independently developed out of the main frame and then easily incorporated into T-DTS.

Whatever NN model is pre-selected, significant efficiency can be gained through careful

design and minor modifications, which permits new T-DTS architecture.

Therefore, T-DTS implementation is an extendible platform that supports the principal

T-DTS concept. An example of possible extendibility of T-DTS is SOM-LSVMDT

approach (Saglam, Yazgan and Ersoy 2003), witch is based on the same idea of

decomposition using Self-organizing Maps. This technique can be implemented by the

mean of incorporating LSVMDT (Linear Support Vector Machine Decision Tree) (Chi

and Ersoy 2002) as processing method to PU database, because decomposition method

SOM is already implemented. The example of real T-DTS extendibility is recent

implementation of PU-method realized in work (Voiry and al. 2007). In fact, T-DTS

v.2.500 is a real-working concept of independent component development platform for

classification tasks. The next paragraphs describe our Matlab T-DTS implementation.

Thus, the principal Matlab T-DTS v. 2.50 software architecture is briefly described in

a term of the main processing components on Fig. III.7. This scheme figures out 9

principal modules of T-DTS. Relations between modules are defined by one way arrows

(indicating a simple call of the module) and two-way arrows for cases when the relations

between modules are complex.

The next list presents a detailed description of each of them.

1. T-DTS GUI module contains the group of function responsible for T-DTS

application control.

2. T-DTS Init – initializes and saves the processing parameters.

3. T-DTS modifier allows user through graphic interface to modify existing

decomposition (DEA) and processing (NNM) methods, then this update is picked up and

then applied.

104

Fig. III.7 : Principal T-DTS v. 2.50 Matlab software architecture

4. T-DTS output is relatively simple group of functions that calls once in order to

print the results out.

5. T-DTS Graph is the set of function responsible for 2D and 3D drawing.

6. T-DTS Main programme - central module that contains the group of functions

which are responsible for integration of the main T-DTS activities. Shortly saying, this

module is correspondent to CU[.].

7. Data T-DTS provides database upload, and pre-processing if requested by user

through GUI. Also this module can provide splitting database on learning and

decomposition parts.

8. T-DTS Core – the most important module which contains realization of the global

function such as recursive decomposition and tree-building defined by T-DTS concepts

9. T-DTS Complexity module is the engine of T-DTS application. It is responsible for

estimating databases complexity and taking decision for further decomposition.

Following a more detailed scheme is described on Fig III.8 required for possible

industrial T-DTS implementation. This Matlab architecture is sculpted from Matlab

105

modules. On the chart, each *.m – file represents separate processing module. The relation

between the modules indicate with two types of array: ” ” and “ ”.

The first type of relations between modules indicates a complex relation: meaning that

modules can call each other in order to modify input/output parameters each other. For

example, m_getProcMethod_params.m on Fig. III.8. (left and top) contains parameters of

multi-ANN training pre-set by user. This module has complex relations with “t_dts.fig &

t_dts.m” module. It signifies that the last one calls m_getProcMethod_params.m to pass

the pre-set parameters forward to m_tdts.m and so on. At the same time using GUI of T-

DTS by means of t_dts.fig, user can call Configuration menu and modifies this parameters

Fig. IV.5.8 with given explanation.

The second type defines the simple relations between modules, meaning that one

module is playing a role of a master and that module which is called – role of a slave. For

example, t_dts_init_script.m on Fig. III.8 which consists the range of by-default

parameters is “t_dts.fig & t_dts.m”. To initialize T-DTS using GUI t_dts.fig the

following t_dts_init_script.m script has be written. It is responsible for the save and

update by-default parameters with the user-given parameters. Similarly according to

organization chart, t_dts_init_script.m calls for its need m_getProcMethod_params.m and

m_getDecMethod_params.m modules in order to update processing and decomposition T-

DTS methods with the modified parameters accordingly, again through GUI interface

t_dts.fig.

The aim of providing such complex real-world scheme described on Fig. III.8 is to

show in more details T-DTS realization, because many question of possible realization of

different T-DTS processing appear here and these details that are not presented on the

conceptual T-DTS scheme have a crucial impact on the T-DTS functioning.

The practice of T-DTS upgrading from v. 2.00 to v. 2.50 has shown that main modules

of T-DTS including relations between principal modules are invariant and will be realized

in new versions of T-DTS. Each module described on Fig. III.8 contains a description

(below). Some modules contain sub functions or data file(s) that have to be explained:

1. t_dts.fig plus t_dts.m that includes t_dts_params.mat – file of parameters including

by default parameters. After each run of T-DTS the parameters of the last run

automatically saved in this t_dts_params.mat file.

2. m_tdts.m

• m_it_T_DTS() – sub function that executes n – time of T-DTS with the range

of send parameters

106

• m_recDBSplit() – recursively decomposes the database based on the provided

ration of complexity estimator, used for learning

• m_recDBSplit_Following_tree() – recursively decomposes database based on

the provided tree regardless the size of database. This sub-function imitates the same

process that has been used for learning. Used when option “Tree_following” Fig. III.8 is

selected.

• m_splitting() – for a given database and centres of decomposition provides

clustering

• m_QClassification() – for a given input and output database returns rates of

Learning, Generalization

3. m_build_sANN.m that includes module m_Cldb2MatrixConvertation() that

performes database convertion of categories / classes to Matrix of zero, where the position

of class indicates by 1 in column, and number of lines of this matrix signifies the number

of patterns. Classical approach of data interpretation is required before ANN training.

4. m_uploadDBs.m includes m_randData() modes that randomizes indexes and put

them as input parameters, it’s required for two mode of randomization (Fig. III.8 ).

5. m_3d_graph_clusters.m includes m_3d_Tree() modules that building a tree in 3D

over decomposed 2D clusters.

6. m_run_sANN.m includes m_Matrix2CldbConvertation() modules. Last one

performs an opposite operation to m_Cldb2MatrixConvertation(). Thus,

m_Matrix2CldbConvertation() converts obtained matrix of the testing results into a

standard format, where the categories / classes are stored in the array.

7. m_RBF_ZISC_fusion.m :

• m_ZISC_CE() – ANN based complexity estimator’s module that hold a named

of ANN based complexity estimator

• m_distance() – sub-module contains realization of the distances L1, LSUP and

Euclidean that may be used not only for ANN based complexity estimator employing, but

for database decomposition.

Let me mentioned, that the T-DTS’s NNs by default configuration (number of layers,

nodes and type of the connectivity) issue is important. Empirical tests that use the back-

propagation networks (majority of PU methods) have not demonstrated significant

advantages for 2 hidden layers over 1 in a relatively small and simple diagnostic network.

107

Fig. III.8: Detailed T-DTS v. 2.50 Matlab software architecture

108

For our T-DTS realization we follow Maren’s advice (Maren 1990): for classification

a classification (decision boundary) problem where the output node with the greatest

activation will determine the category of the instance, one hidden layers will most likely

sufficient. Therefore, PUs’ by default (used during the validation phase) settings for back-

propagation networks are the following. The number of neurons in output layer is equal to

the number of classes. The hidden layer consists of the neurons that their number equals

to square root (rounded-off value) of the number of database (sub-database).

On Fig. III.8 describes the full chart of the T-DTS modules’ realization.

The current Matlab T-DTS software realization v. 2.50 has the options and processing

that can be controlled through GUI Fig. is shown on Fig. III.9.

Fig. III.9 : Matlab T-DTS software realization v 2.50, Control panel

The detailed description is the following:

• Decomposition Units:

1.CNN (Competitive Neural Network)

2.SOM (Self Organized Map)

3.LVQ1 and LVQ2_1 (Learning Vector Quantization including two different learning

function)

• Processing Units:

1.LVQ1 and LVQ2_1

2.Elman_BN (Elman’s backpropagation network)

3.MLP_CF_GD (MLP cascade forward network with gradient descent algorithm)

109

4.LNM (Linear Neuron Model)

5.RBF (Radial basis function based network)

6.MLP_FF_GDM (MLP feed forward backpropagation network with gradient descent

algorithm including momentum adjusting)

7.GRNN (General Regression Neural Network)

8.PNN (Probabilistic Neural Network)

9.GRNN (Generalized regression neural network)

10. MLP_FF_BR (MLP feed forward backpropagation network with Bayes regulation

including statistical incoming database normalization) (Voiry and al. 2007)

11. MLP_FF_ID (MLP feed forward backpropagation network with input-delay)

12. Perceptron.

13. Elman_BNwP (Elman’s backpropagation network including statistical incoming

database normalization)

14. Elman_BNBR (Elman’s backpropagation network with Bayes regulation)

• Complexity estimating technique:

1.Maximum_Standard_Deviation (based technique) (Rybnik 2004)

2.Fisher_Disriminant_Ratio (Rybnik 2004)

3.Purity_Measure (PRISM based technique) (Singh 2003)

4.Normalized_mean_Distance (Bouyoucef 2007)

5.KLD (Kullback–Leibler Divergence based estimator)

6.JMDBC (Jeffreys-Matusita distance based complexity criterion)

7.Bhattacharyya_Coefficient (based criterion)

8.Mahalanobis_Distance (baced technique)

9.Interclass_DM_CRT_Trace (Scattered-matrix method based on inter-intra matrix-

criteria of the traces), criterion 9.13 (Fukunaga 1972)

10. Interclass_DM_CRT_Div_Trace (Scattered-matrix method based on inter-intra

matrix-criteria of the traces’ division), criterion 9.16 (Fukunaga 1972)

11. Interclass_DM_CRT_Log_Det (Scattered-matrix method based on inter-intra

matrix-criteria of logarithm of determinants), criterion 9.14 (Fukunaga 1972)

12. Interclass_DM_CRT_Dif_Trace (Scattered-matrix method based on inter-intra

matrix-criteria of the traces’ difference), criterion 9.15(Fukunaga 1972)4

4 This four criteria 9-12 generate ratio of class separability based on the statistical theory of discrimination

(Fukunaga 1972)

110

13. RBF_ZISC_based_Fusion_CRT (Simulation of ZISC®-036 ANN based

complexity estimator in Matlab environment) Section III.4.3

14. k_Nearest_Neighbor_Estimator

15. Collective_Entropy (PRISM based technique) (Singh 2003)

16. JSD (Jensen-Shannon Divergence based estimator)

17. Hellindger_Distance (based technique)

18. (Bayes Error) – direct realization of computational costly procedure only for

testing purpose and not for real-world classification task processing

Complexity estimating threshold can be set in correspondent field Fig. IV.9. This

value varies from 0 to 1. We have to mentioned, that the ratios returning by the different

complexity estimators 1-18 from different taxonomical classes, have been linearly

normalized in order to be placed in interval [0;1]. 0 – ratio means that database/sub-

database is “very complex” to classify and 1 - means the opposite. I would like to remind

that values of the different CEA are relative, and cannot be compared. This issue has been

verified during the validation phase of my work.

Decomposition techniques use two important intra parameters for performing

decomposition Fig. III.9:

• Mode:

1. Min_dist_to_prototype

2. Tree_following.

• Distance:

1. Euclidean

2. Manhattan

3. Mahalanobis.

Concerning the given description of tree-like learning database decomposition that is

controlled by SOM-based DA, we have to note that there are two possibilities of

managing decomposition of the databases.

Let’s DA decomposes database following the proposed T-DTS concept. After

decomposition DA obtains except the clusters of learning database, the set of centroids Wi

and T-DTS tree. The question how to use this information arises, because there is a

difference between the hierarchical clustering of database following tree-like

decomposition and the clustering performed by means of distance analysis to centroids Wi

that ignores the tree-like decomposition. Theoretically, borders between these two types

of clustering are absolutely different. Moreover, from the algorithm complexity point of

111

view, the first decomposition approach for given m instance required m*log2m iteration (if

tree is binary) and for the second - 0.5*m*(m-1).

Distinguishing these two cases, we have implemented Min_dist_to_prototype mode of

clustering that corresponds to the second mentioned approach and Tree_following which

is responsible for original according to the concept following tree clustering approach.

Whatever is selected Min_dist_to_prototype or Tree_following, T-DTS user can be

sure that there is no discrepancy in the manner of decomposing learning and

generalization database. This issue is extremely important.

Distance type is another important criterion, because it is a metrics of distance

measuring, however the problems, for which we have provided testing, have been

represented purely in Euclidian space. In addition, one can expand T-DTS platform with a

new database of distances (metrics) if required. The following sections describe the

methods related to Databases’ pre-processing and Create Learning & Generalization

DBs.

Except of selection of input and output file names Fig III.10 (left), the learning

database can be pre-processed if required using Principal Component Analysis and

Transformation or

• Normalization

1. No_normalizing

2. Norm_LinTans_EC

3. Norm_Statistical_EC.

Norm_LinTans_EC – linear data pre-processing that the prototypes lay in interval

from 0 to 1. Norm_Statistical_EC preprocesses data so that its mean is 0 and the standard

deviation is 1.

Section of Create Learning & Generalization DBs is used when the classification

problem doesn’t contain such information. In this case there is a big range of well studied

techniques dedicated to question how to learn, but afterward the incoming solid database

must be split according to T-DTS concept of problem handling. Therefore, there are two

types of splitting.

• Split-type

1. Balanced

2. Random

In the first case, the database is randomly extracted (if flag is on Fig III.9) with respect

to classes’ distribution and given percentage of extraction. Second option provides

112

random (without any additional condition) extraction of learning sub-database from the

origin classification database.

The GUI of T-DTS contains on the right panel the options of T-DTS processing on

Fig. III.9. There are 3 Run mode(s):

1. Uni

2. Multi

3. Self_Tuning.

The first mode performs simple Uni T-DTS processing. Multi run mode does the same

but for the range of special selected parameters saved in the multi-run script. In stead of

launching T-DTS each time manually and for each new set of parameters, it is more

efficient to write down this info into a script and then process. For example, the most used

mode of T-DTS is testing for different θ - threshold values.

• Self_Tuning is mode which uses θ - threshold adjusting procedure described above

at Section III.1.

• Iteration No. field on Fig III.9 allows user to set the number of iterations z in order

to gain statistics. The importance of this option has been discussed in Section III.1.

• Run button starts T-DTS. Result - prints out the summary of the testing.

Result(Threshold) is an option which describes generalization and learning result as a

function of θ - threshold. The idea of existence of a functional relation between them is

highlighted at Section III.2. To illustrate T-DTS GUI abilities including

Result(Threshold), I have applied T-DTS approach for a very simple benchmark of two

classes. The aim of this task is to classify 2D-plots that belong to two different classes.

These classes are linearly separated by a line X = 0, meaning that plots which contain

negative X-coordinate belong to class one, and the others – to the class two.

This learning process has been performed using decomposition method CNN,

processing – PNN (not appropriate for the goal, but good for an illustration of T-DTS

GUI), and for Maximum_Standard_Deviation complexity estimator (that is not a real

classification complexity estimator) with different threshold values preset in multi-run

script. The learning ratio was set to 50%. The learning database has been extracted

randomly from the solid database which contains 576 2D-plots. Class distribution is 50%

per 50%. Number of iteration set to gain statistics is 5. On Fig. III.10 is shown T-DTS-

output of learning and generalization rate for different θ - threshold values.

On Fig III.10, abscissa represents complexity θ - threshold and y-coordinate is

percentage (%) of successful learning (red) and generalization (blue). The obtained

113

generalization rates are marked on chart by diamonds, the equivalent corridor of standard

deviation of this rate is marked by asterisk. Learning rates on Fig III.10 are marked as

diamonds including asterisk. As the problem is very simple, the given learning rates reach

its maximum and the standard deviation of it is zero. One can note that however the

maximal generalization rate can be reached for θ = 0.19, the solution (with regards to

standard deviation) provided by θ = 0.224 might be more attractive because of the low

standard deviation of generalization rate. Analyzing a chart, one may select the

combination of alternative solutions in case when the best solution could not be found.

Fig. III.10 : T-DTS GUI: Results : 2 stripe-like benchmark (576 prototypes)

The next option “Print DB Complexity” allows user to apply all available complexity

estimating technique for a solid (non-decomposed) database. It permits to get preliminary

information for black box classification cases. “Print SubDB Complexity” can be applied

for decomposed database. It calculates complexity ratios for each sub-database and prints

it out for analysis of decompounded configuration.

Buttons “2D DBs”, “2D Tree” and “3D Graph” permit us to interpret the result of

decomposition and tree constructing in 2D and 3D. The two tests for the same benchmark

114

problem have been performed with above mentioned settings except θ - threshold which

has been fixed. In first case on Fig III.11 (left) Maximum_Standard_Deviation θ = 0.72

and second one Purity_Measure - θ = 0.35 on Fig III.11 (right).

Fig. III.11 : GUI of T-DTS, decomposition clusters’ chart

Clusters that represent final decomposition on Fig. III.11 are marked using different

styles. On the left picture, clustering includes projection of the tree over the mosaic of

clusters. The centres (nodes of the tree) of sub-databases/clusters are linked. There are not

only leafs of the processing nodes, but also the decomposition nodes.

A brief analysis of this chart suggests that the majority of clusters Fig III.11 allocated

alongside the class-border X = 0. We expect to see such result, because those clusters

don’t belong to one whole class. These bordering clusters contain two classes on Fig.

III.11. They are defined as more complex than others which contain one class only. The

left and right charts have different mosaics of cluster, because of the different complexity

estimators. Those pictures illustrate the direct influence of complexity estimating

technique on the decomposition (clustering) and tree-building. Accurate calculation of the

complexity ratio for sub-databases (clusters) is the engine of T-DTS which builds

decomposition set. Back to the topic of graphical results representation, the next available

option of T-DTS “2D Tree” is a tree drawing. On Fig. III.12 is shown the two cases.

However, on left and right pictures there are trees which are built for the above mentioned

problem and settings, but the complexity estimator was set as

Maximum_Standard_Deviation and θ - threshold = 0.72. What is important in tree

building is the configuration of decomposition methods. For example on the left picture,

we have decomposition method CNN with 2 neurons (decompounded centres. On the

right of Fig. III.12 we have used decompounded SOM with a grid 5x8, totally - 40

115

decomposition centres.

Fig. III.12 : GUI of T-DTS, decomposition tree charts

The final GUI option ““3D Graph”” combines two previous mentioned features. It’s

an option which allows user to build a decomposition tree over 2D clustering.

Fig. III.13 : GUI of T-DTS, decomposition tree chart in 3D

This option illustrates dynamics of T-DTS tree-like decomposition approach regulated

116

by a complexity estimator. On Fig III.12 is shown decomposition, and tree-structure in 3D

for the same simple 2D classification task, where CNN contains 2 neurons, complexity

estimating method is Purity_Measure, θ - threshold equals to 0.35. Another range of

option is available in T-DTS main menu. The most important for end user is two menus.

First - Configuration and second one Analysis Fig. III.13.

DU Parameters Configuration and PU Parameters Configuration permits to modify

DU and PU settings such as number of epochs, neuron numbers, etc.

Fig. III.14 : Menu of T-DTS, Configuration.

DU Configuration and PU Configuration are responsible for T-DTS platform

extension and independent component analysis. For example, PU named Elman_BNwP

(Elman’s backpropagation network including statistical incoming database normalization)

was developed as an analogue of MLP_FF_BR (MLP feed forward backpropagation

network with Bayes regulation including statistical incoming database normalization)

proposed by (Voiry and al. 2007) that has been incorporated before.

Multi Run Configuration menu configures script and Optimization function settings.

In case when the T-DTS running has been interrupted, user can continue T-DTS

running from the interrupted point, because the sub-product and parameters has been

automatically stored. In order to concatenate afterwards sub-products, the option

“Concatenate output” has been created. It allows user to process T-DTS in parallel mode

on different PCs. This option has appeared based on a real-working experience.

The next two principal menus permit to set the constants. The first Fig. III.15 – for T-

DTS processing such as: threshold accuracy k, Alfa_Coefficent (α), etc.

117

Fig. III.15 : Menu of T-DTS, Set Constants

The second menu “Set EC Options” Fig. III.16 defines the constants for complexity

estimators, such as resolution parameter B for PRISM methods, Maximal Influence Field

for our ANN based complexity estimator and so on.

Fig. III.16 : Menu of T-DTS, Set EC Options

Analysis menu is shown on Fig. III.17, contains the settings for: of histogram building

- “PDF of Complexity”, “Postregression” analysis and building chart of the optimization

function P(θ).

Fig. III.17 : Menu of T-DTS, Analysis

Fig. III.17 describes the last part of T-DTS parameters and its implementation.

Let me remind that realization aspect is very important because of the range of

difficulties that appear during the implementation of the abstract theoretical concept.

It is well known that simplification and assumption done on implementation phase

might negatively affect on T-DTS processing ability. Thus, it’s very important for further

validation and results analysis to have not only a vision of the general T-DTS concept, but

to understand details of the realizations. The realization can answer on the possible

118

question why the structure of output cluster is such and why the generalization rate is high

or low. The following Section III.3 provides a conclusion of the current aspects of T-DTS

implementation and its software architecture.

III.3 Conclusion

In this chapter, we have described the implementation of T-DTS. From the conceptual

point of view, the presented implementation scheme is platform independent: no matter

which programming language is selected for the main T-DTS modules, its implementation

scheme remains platform-invariant.

The presented implementation of T-DTS is also very flexible: any decomposition or

processing unit could be easily modified, adjusted, or replaced by a non-advanced user. A

variety of parameters, including pre-processing techniques, distance measures, and so on,

creates a user-friendly T-DTS environment for the classification tasks’ processing. The

key role in T-DTS implementation is played by the complexity estimation module that

controls the overall T-DTS performance by the θ – threshold. To successfully handle the θ

– threshold in T-DTS, one would require a knowledge about certain CEA features,

specifics of classification problem and DA, etc.; instead, to streamline the usability of T-

DTS, we proposed the self-tuning procedure that automatically optimizes the threshold

and allows getting the results in the deterministic way.

Since the question of finding the optimal θ - threshold is known to be an NP-hard

problem, we have created a novel semi- (because several parameters still have to be

predefined by the user) automated procedure for finding the quasi-optimal θ – threshold at

which the results of T-DTS classification could reach their quasi-maximum. Furthermore,

the procedure considers several possible results, given the user a choice of suitable

solutions.

The proposed semi-automated procedure also paves the new direction for the future

development of T-DTS. An important issue of T-DTS enhancement is that the proposed

semi-automated procedure uses the concept of maximal possible decomposition tree - the

tree where on leaf-level each sub-cluster contains the simplest sub-database for

classification. During the process of maximum decomposition, T-DTS accumulates the

information about distribution of the clusters’ complexity. The histogram over θ –

threshold of the sub-interval of the interval [0;1] represents the variance of the clusters’

119

number and their complexity. It provides additional information about database

divisibility when complexity estimator, decomposition module and other T-DTS

parameters are fixed. It is also very important, because the choice of decomposing

technique and estimator is on user’s responsibility. T-DTS concept makes the provision

for intelligent and self-organizing way of decomposing clusters’ allocation. However, the

above mentioned histogram represents the initial database divisibility; from the global

point of view of self-organizing systems and in accordance to the cross-discipline

overview (Haken 2002), this histogram is a key macroscopic characteristic that exhibits

T-DTS self-organizing abilities. In fact, this histogram of maximal tree-like database

decomposing predefines the database decomposition process.

It is also important to mention that the ANN based complexity estimator of the

proposed architecture maps very well on the RCE-kNN based complexity estimator

implemented in IBM© ZISC®-036 Neurocomputer that has been extensively validated

using benchmarks and real-world problems. This makes it possible to create a hardware

implementation of T-DTS, or, more precisely, a hardware RBF-kNN-like based T-DTS

for the IBM© ZISC®-036 Neurocomputer. In spite of the possible limitations of a

hardware-based implementation of decomposing methods, the exclusive benefits of using

RCE-kNN ZISC®-036 based complexity estimator make the hardware implementation a

viable choice not only clustering, but also the classification of RBF-based processing unit.

To explore the direction of hardware-based neurocomputing, we implemented a

hybrid software/hardware prototype of T-DTS using the IBM© ZISC®-036, where the

hardware was used to implement the complexity estimation and processing modules.

Naturally, such implementation has initial advantages and disadvantages. The principal

disadvantage is that it reduces T-DTS conceptual flexibility. The main advantages of the

hardware implementation are the speed achieved through parallelization and

computational efficiency. Considering the fact that the modern classification problems

require analysis of huge data stores, the new trends in development of high speed parallel

hardware systems make the direction of hardware T-DTS development a very attractive

alternative to the currently predominant software solutions.

The T-DTS enhancement, ANN based complexity estimator implementation on

IBM© ZISC®-036 Neurocomputer and its PC-software-based version, including 16 other

complexity estimating techniques, have been verified using classification benchmarks and

real-world classification problems. The design and the results of this verification are

reported in Chapter IV Validation.

120

Chapter IV:

Validation aspects

In this chapter, we validate the main aspects of the proposed T-DTS approach in the

following steps. First, we compare the effectiveness of the proposed ANN based

complexity estimation technique to the effectiveness of other available estimators outside

of the T-DTS framework. In the second part of the chapter, we validate and assess the

effectiveness of the proposed T-DTS enhancements within the framework; specifically,

we test the performance of complexity estimation techniques embedded into T-DTS and

the proposed T-DTS self-tuning procedure.

My validation datasets consist of the two parts: benchmarks specially designed for the

validation of classification problem techniques, and a real-world classification problem.

IV.1 ANN-structure based complexity estimators

The prime T-DTS performance enhancement objective drives me to prose a new

classification complexity estimating method. The result of our research is an ad hoc

ANN-structure based complexity estimator. This concept of classification complexity

estimation is free of distribution classes’ analysis disadvantages and as a novel

classification complexity (discriminant) estimating method can be used out by T-DTS

framework, for example for improving the other classification models readability or for

pre-analysis of the problem in patter recognition.

The usage of ANN based concept has been caused a possibility of neural network’s

learning indicator(s) to obtain inexplicit information (parameters) about the complex

industrial system’s process, plant and etc (Madani and Berechet 2001). Initially, ANN

based complexity estimator first has been implemented on the IBM© ZISC®-036

121

Neurocomputer because of its advantages such as evolutionary RCE-kNN like neural

network constructing. Second step was a simulation of the proposed estimator into Matlab

environment. Conceptually, they are similar. Both of them stands on the similar principal

idea mentioned in theoretical part, however there are some minor differences that have a

minor influence (as it’s shown) on the final outputs.

The core difference between this ANN based complexity estimator realizations is that

Matlab simulation doesn’t allow the overlapping of influence field that may occurred in

IBM© ZISC®-036 Neurocomputer implementation during prototype association

(Lindblad and al. 1996). Thus, for this Matlab simulated kNN-like ANN-structure based

complexity estimator, MIF parameter of each prototype is automatically adjusted

(minimizing MIF up parameter starting from the pre-set in advanced value) during the

polyhedron construction. In contrary to this, IBM© ZISC®-036 implementation of

estimator adjusts the threshold of neighbourhood neurons (prototypes with associated

category/class), where the final MIF parameter shouldn’t be lower than the pre-set value.

Although, let us highlight that for both types of complexity estimators the common and

principal feature is the same: extraction of complexity ratio from using analysis of the

Voronoy polyhedron construction process.

The following section provides the testing results and their analysis obtained for

IBM© ZISC®-036 ANN-structure based classification complexity estimator.

IV.1.1 Hardware-based validation

Historically, the proposed ad hoc classification task complexity estimator was initially

verified using the IBM© ZISC®-036 based PC board Neurocomputer implementation.

We have chosen this hardware because it is a good candidate for the hardware

implemented RCE-kNN Neural Network that uses an evolutionary learning strategy

(Madani, De Tremiolles and Tannhof 2001). During the kNN-like partitioning (learning)

the thresholds of neurons are adjusted. During the generalization phase, the

neighbourhood neuron(s) maybe (or not) activated. For a given learning database, the

result obtained after the learning process is RCE-kNN Neural Network structure

represented by Voronoy’s polyhedron. Using the concept of complexity estimating above

described, we compute the classification complexity ratio (coefficient/rate).

122

IV.1.1.1 IBM© ZISC®-036 Neurocomputer’s implementation and

benchmarks

To validate my new Ad hoc concept using IBM© ZISC®-036 Neurocomputer based

implementation, I have constructed academic (i.e. simple) classification benchmarks.

Basically, there are five databases representing a mapping of a restricted 2D space to 2

categories/classes described on Fig. IV.I. Each pattern was divided into two and more

equal striped sub-zones, each of them belonging to class 1 or 2 alternatively.

Fig. IV.1 : Stripe classification benchmarks

The benchmark samples are created using randomly generated instances sj that contain

two coordinates. Theoretically, the number of sj-samples m has an influence on the quality

of the striped zones (categories) demarcation. In case where vectors are uniformly random

distributed, higher quantity of the vectors/prototyped more precisely determines the

classes’ separating hyperplane. According to the value of the first coordinate of sj, and

according to the type of striped pattern, an appropriate category cj is assigned to the

instance sj. The m structures defined by pairs sj,cj are sent to ZISC®-036

Neurocomputer for learning purpose. Using equation II.42, I compute the indicator-

function Qi(m), where i is a pattern index described by Fig. IV.I (left one has an index i=1

and right one - i=5). Afterwards, I calculate complexity ratio for each of five benchmarks.

The range of theses validation tests have been performed on IBM© ZISC®-036

Neurocomputer using two different modes L1 and LSUP (Lindblad and al. 1996). More

detailed information concerning internal ZISC®-036 parameters and their realization is

available in Appendix B.1.

For these 5 different databases, we have performed validation using 8 types of m

database-size, including: 50, 100, 250, 500, 1000, 2500, 5000 and 10000 of sj.-instances.

For each set of the three parameters (type of the pattern, ZISC®-036 metric/distance mode

and database size m), a tests have been repeated 10 times for statistical purpose. Totally,

800 tests have been performed. The following Fig. IV.2 and Fig. IV.3 show the results.

123

It is expected that for Example 5 (pattern i=5) (Fig. IV.1 - 10 stripe-like-zones), the

indicator function Q5(m) has the highest values among the Q1(m) – Q4(m). It is also

expected that classification complexity ratio for Example 1 (pattern i=1) calculated by

IBM© ZISC®-036 ANN based complexity estimator has the lowest.

Let us highlight that on Fig. IV.2 and Fig. IV.3 one may observe the complexity

indicator behaviour, but not a complexity estimating function. Let me note that in T-DTS

framework 1 stands for the easiest case and 0 for the most complex. Therefore, any

complexity estimation out put is linearly normalized.

Basically, Fig. IV.2 and Fig.IV.3 shows an example of Qi(m) complexity indicators’

variations versus the learning database’s size m for 5 different benchmarks. I have

considered that the calculated classification complexity estimation ratio Q1(m0)

corresponds to the easiest case and Q5(m0) to the most difficult (Fig IV.1).

We expect that for any given benchmark problem, m- enhancing reduces problem

ambiguity, that’s why we observe declining Qi(m). This means that the considered

classification task becomes less complex when enough representative examples are

available. On the other hand, the benchmarks’ complexities are different. It is intuitively

expected that Q5 indicator function is above Q4. Based on the supposition of Qi(m)

behaviours, I have approximated Qi(m) with a polynomial function of the degree 3 in order

to capture m0 from equation II.42, where Qi(m0) acting as a classification task complexity

ratio.

Following Table IV.1 consolidates obtained classification complexities of the

benchmarks using the proposed method.

Table IV.1 : Benchmarks complexity rates obtained using IBM© ZISC®-036

implementation of ANN-structure based complexity estimator LSUP ZISC®-036 mode L1 ZISC®-036 mode

Benchmarks m0 Qi(m0) m0 Qi(m0) Q1 (Example 1) 100 0.846 88 0,849

Q2 (Example 2) 170 0,818 168 0,823

Q3 (Example 3) 190 0,767 186 0,771

Q4 (Example 4) 235 0.760 229 0,761

Q5 (Example 5) 265 0.739 254 0,746

124

Fig. IV.2 : Stripe classification benchmarks : Qi(m) behaviour versus the learning

database size m, LSUP ZISC®-036 mode

Fig. IV.3 : Stripe classification benchmarks : Qi(m) behaviour versus the learning

database size m, L1 ZISC®-036 mode

125

These results are plotted on Fig IV.2 and Fig. IV.3 - Qi(m) behaviour corresponds to

intuitive classification benchmarks’ complexity and Qi(m0) - shifts conformably to the

aforementioned expectations.

The difference between LSUP and L1 modes’ of IBM© ZISC®-036 Neurocomputer

signifies the difference in Voronoy polyhedron constructing. Thus, rhomb-like L1 metric

(Appendix C.1) produces a better partitioning (meaning less number of clusters required)

space than LSUP Manhattan distance regardless the fact that the classes are linearly

purely separable.

It is important to note that as the classes’ of benchmarks are perfectly separable,

meaning that Bayes error is equal to zero, it is expected that for classification complexity

estimating methods which is based on the theoretical Bayes error will not find difference

between Example 1 and Example 5.

Analysis of the plots m0,Q1 (Example 1) till m0,Q5 (Example 5) for related classification

tasks implies the following property:

54321 ,0,0,0,0,0 QQQQQ mmmmm <<<< (IV.1)

where

)()()()()( 0504030201 mQmQmQmQmQ >>>> (IV.2)

With respect with the task’s complexity incensement to observe the results that can be

represented by equation IV.1 and equation IV.2. Concerning the behaviour of indicator

function Qi(m) one has 1)(lim =+∞→

mQim.

Let me stress once more that MIF (Minimum Influence Field of the neuron, Appendix

B.1) is an important parameter influencing on the Voronoy polyhedron construction and

finally on the quality of the complexity ratio.

If one may take a look on the provided testing for benchmarks as on classification

process Fig. IV.4 and Fig. IV.5, the results as it is expected are very sensitive to m-

parameter and less sensitive to the fixed IBM© ZISC®-036 (distance).

+∞→>∀ mmm :0 Fig. IV.2, Fig. IV.3 and Fig. IV.4, Fig. IV.5 show that situation

becomes more predictable regarding indicators’ evolution and the classification rates. In

other words, our validation indicates that the extra data (addition prototypes: 0mm > )

doesn’t change the dynamic (second derivative) of the classification process.

126

Fig. IV.4 : Benchmarks’ classification rates behaviour versus learning database size

m, LSUP ZISC®-036 mode

Fig. IV.5 : Benchmarks’ classification rates behaviour versus learning database size

m, L1 ZISC®-036 mode

127

In conclusion the obtained results for constructed stripe-like benchmark, we can state

that the behaviour of Qi(m) complexity indicators, Qi(m0) ratios computed by ANN based

complexity estimator and classification rates as a quality check of Voronoy polyhedron

construction using ZISC®-036 RCE-kNN have confirmed our expectations. Next section

is dedicated to the validation of the same realization of ANN based complexity estimator,

but for the real-world classification problem.

IV.1.1.2 IBM© ZISC®-036 Neurocomputer’s implementation facing

Splice-junction DNA sequence classification problem

Second part of our validation is related to complexity estimation for a real-world

problem. One of the good candidates is the well-studied Splice-junction DNA-Patterns

classification problem from well-known Machine Learning Repository.

This classification problem is related to the complex process of protein creation.

During this process, in higher organisms take a place an elimination of the superfluous

DNA sequence. Points on a DNA sequence at which redundant DNA is removed are

called splice junctions.

One have to recognize for a given sequences of DNA, the boundaries between exons

(the parts of the DNA sequence retained after splicing) and introns (the parts of the DNA

sequence that are spliced out). This problem consists of two subtasks: recognizing

exon/intron boundaries (referred in the original database to EI-sites), and recognizing

intron/exon boundaries (IE sites). In the biological community, IE borders are referred to

acceptors while EI borders are referred to as donors.

To evaluate the performance of the complexity estimation approach, I use this

molecular biology database titled as “Primate splice-junction gene sequences (DNA) with

associated imperfect domain theory” that is available in mentioned Machine Learning

Repository of Bren School of Information and Computer Science, University of

California, Irvine. This database has the following main features: 3190 instances, 60

attributes, 3 classes (labeled as N-class: consist of 50% of all instances, EI-class – 25%

and IE-class – 25%), and no missing attribute values

I start the validation by the generation of databases of the different size m. We have

created these databases randomly extracting sub-databases from the original one with

respect to classes’ distribution. The number of instance m corresponds to the database

128

size. Each instance sj belongs to a category cj equals to 1, 2 or 3. Vector sj consists 60

sequential of DNA nucleotide positions. The pair sj,cj where j is the pair index

( mj ≤≤1 ) are sent to ZISC®-036 Neurocomputer on learning (Voronoy polyhedron

construction).

If one may compare previous academic benchmarks and DNA Splice-junction

problem, one may state that the difference is only in number of the classes and the

attributes.

Let me mention, that the procedure of classification DNA Splice-junction instances

corresponds to classes’ sorting using small decision tree. According to detailed

description of the problem, first of all, a classifier must decide whenever the given

instances belong to the group of classes EI and IE or to class N.

Afterwards inside of the first group the ANN or other classifier should separate EI

class from IE class. Although, even this mentioned issue doesn’t express the difficulty of

real world classification problem.

To demonstrate the hardest of DNA Splice-junction classification we quote Sarkar’s

and Leong’s work (Sarkar and Leong 2001) that to figure out complexity use special

representation named DNA walk representation: “We have plotted the trace

corresponding to all three classes. We can observe that to some extend, the lines

representing the classes IE and EI can be separated visually; however, even for human

observer it is difficult to separate lines corresponding to the class N from the lines of the

other two”. However, the database doesn’t represent purely separable classes, because

Bayes error is equal to 0.0003.

Returning to the validation classification complexity estimating protocol for DNA

splice junction problem, we have to mention, that in order to get statistical proved result,

we generate 200 databases (files). Each validation test has been repeated twice to check

reliability of the results. As the boundaries between the classes are “complex even for

human observer” (Sarkar and Leong 2001), there is no special preference for distance

measure selections.

Thus, the validation has been performed for L1 distance mode. However, a crucial

issue is the selection of ZISC®-036 MIF-parameter. I have done my validation using 3

MIF values: 55, 56 and 4096. Approximately, up to 8400 tests have been performed.

Fig. IV.6 represents the behaviour of complexity indicators Qk(mk), where k denotes

MIF parameter. The Fig. IV.7, Fig. IV.8 and Fig. IV.9 demonstrate the influence of m0 on

the quality of the learning process. mk corresponds to m0 calcluated for the k-curve

129

Fig. IV.6 : Qk(m) evaluation for DNA splice-junction classification problem using

different k-MIF parameters (k: 55, 56, 4096) for: Q55(m), Q56(m), Q4096(m), mk –

corresponds to calculated m0 for each k-curve

Fig. IV.7 : Quality check of RCE-kNN-like Voronoy polyhedron construction based

on its generalization ability performed for k-MIF parameter k=55

130

It is similar to Fig. IV.4 and Fig. IV.5. Brief chart analysis suggests that the highest

generalization rate (rate of success) - 53.5% and the lowest rate of failure 29.4% is

reached for MIF = 56. This fact allows us to make conclusion that for this MIF parameter

is the most appropriate among available. Another fact that has been taken into

consideration for selection MIF = 56 is the highest rate of uncertainty. It means that

RBFN implementation on IBM© ZISC®-036, some instances during generalizing might

be defined such that it doesn’t belong to none of the proposed categories. It means that

one may extract these undetermined instances and using additional techniques may

enhance final generalization rate.

Concerning the generalization classification rate in absolute value for MIF = 56

parameter, let me note that to determine approximate upper limit of this value, the result

obtained (Bouyoucef 2006) using software simulation of RBFN approach is used for

coparision. However, I have to mention that differences between these two

implementations of the networks are insignificant. Software realization uses advanced

intra-technique of RBFN adjusting, but it doesn’t increase the generalization rate. In the

work (Bouyoucef 2006) it is 66.3%.

Let me remind that the class distribution is: 50%, 25% and 25%. Therefore, the

answer on the question “What of the given internal parameters (such as MIF) of IBM©

ZISC®-036 implementation of ANN based complexity estimator, distance mode must be

chosen in order to have the most appropriate Voronoy polyhedron (related to RBFN

structure)?” The answer is that these set of parameters for which the testing/generalization

or learning classification rates reach the maximum, and this RBFN structure will

correspond to the most appropriate polyhedron. This approach demonstrates the

mentioned above relation between classification process and the estimating of


Concerning polynomial approximation for Qk(m) indicators it has been performed

similarly to classification benchmarks. The classification complexity rates are presented

in Table IV.2.

Let me note that I share possible criticism concerning lack of evidence described of

the curve(s) behaviour on Fig. IV.6, in comprising to benchmarks’ cases Fig. IV.2 and

Fig. IV.3, where one may clear observe the second derivate signs change.

131





132

Table IV.2 : Complexity rates obtained for Splice-junction DNA classification

problem (original database) using IBM© ZISC®-036 Neurocomputer

MIF parameter m0 (denoted on chart as mk) Qk(mk)

55 730 0.382

56 775 0.438

4096 700 0.896 Let me mention that for m55, m56 and m4096 points, the resulting Q(m0)-complexity

rates are calculated using approximated polynomial functions for Q(m)-indicators.

Analyzing the classification potential constructed Voronoy polyhedrons for 3 different

MIF parameters (Fig. IV.7 - Fig. IV.9), the best candidate among available for the role of

the final classification complexity ratio of DNA Splice junction problem is Q56(m56) =

0.438.

During the validation, I have extracted from the rest of the database instances (3190-

m) the test (generalization) database of the same size m and in the same manner that to

have purely comparable results. Let us highlight that our aim was not to obtain the

minimum classification error. The goal of this testing is to assist in selection the most

appropriate classification complexity rate among available.

It is interesting to note that the feature of the second derivative sign changing has also

influence not only on Qi(m) indicators behaviour, but also on the whole classification

process. The fluctuations: quick change of the second derivative in short-sub interval m

[700;800] have been indicated and is similar to hysteresis effect. Thus, comprising to this

phenomenon, new, unseen m-instances (in hysteresis phenomenon this function

accomplishes extra force) for m > 800, do not change second derivative of Q(m) indicator,

because new m-instances do not change significantly corresponding Voronoy polyhedron

structure. Analogically to a real word problem, one may observe the similar behaviour of

the indicator functions for classification Fig. IV.4, Fig. IV.5, and Table IV.1.

In conclusion I can state that proposed ad hoc ANN based classification complexity

method for IBM© ZISC®-036 Neurocomputer’s realization have confirmed our

expectations. Using this approach, I have obtained correct result of ranging classification

complexity of benchmarks, plus defined a classification complexity of the DNA Splice

junction problem. Comprising the benchmarks’ complexity rates and DNA Splice

junction problem’s rate, we can state that the last problem is most complex. Classification

tasks’ analysis such as separating space and classes’ distribution totally correspond to our

expectations.

133

Next Section IV.1.2 is dedicated to validation of proposed complexity estimator, but

for a wide range of the tests and with software implementation of ANN based complexity

estimator. This deeper validation check is done in order to be confident with this tool of

classification complexity estimating before applying into T-DTS framework. The second

main reason of following validation has an aim to compare ANN based complexity

estimating approach to other complexity estimation techniques implemented into T-DTS.

IV.1.2 Software-based validation

In this section, I provide the results of ANN-structure based complexity estimator

validation using software implementation. Our prime aim is to more deeply verify the

concept of ANN based complexity estimator comparing it to 17 other (the list is available

at Section IV.2) that has already been implemented (Bouyoucef 2007) as Matlab code and

as part of previous Matlab T-DTS version 2.00. For this reason, we have created wide and

more complex range of benchmarks, and I have performed a check for a wide range of

ANN based complexity estimator parameters such as the order of approximating

polynomial. To be comparable, all 17 complexity estimators have been linear normalized

in the way where rates of the complexity vitiate in the interval [0;1], where 1 signifies the

easiest classification case and 0 – the hardest. Such standardization and normalization is

needed for T-DTS self-adjusting procedure.

The first part of this section provides the results obtained for a wide range of academic

benchmarks, the second one – for real-world problems.

IV.1.2.1 ANN-structure based complexity estimator using classification

benchmarks

Let me start here by the description of the first classification academic benchmarks

(Fig. IV.10).

134

Fig. IV.10 : Square classification benchmarks, 2 classes, 2000 prototypes

The tests have been performed for 3 implemented distance measures: LSUP, L1 and

EUCL (Euclidian). To obtain statistics for each fixed range of ANN based complexity

estimator’s parameters, the same test has been repeated 10 times. The benchmark

described on Fig. IV.10 has been constructed not only for m=2000, but also for different

values of m: 200, 500, 1000 and 2000.

The influence of the three different MIF, where (MIF is Maximum Influence Field:

1024, 10.24 and 0.1024) on the complexity ratio have been checked. Totally

aproximately1440 tests have been performed. If we take to account 17 (rest) complexity

estimators, where the PRISM based have been verified using three internal parameters of

the resolution parameter B: 2, 4 and 8, overall number was above 1600.The following Fig.

IV.11 presents the influence of classification complexity method (in term of increasing

number of squares) on the obtained ANN-structure based classification complexity

ratio(s)/rate(s).

On the picture bold solid line represents the average complexity ratio related to

distance mode. The dot-lines - the corridor of the standard deviations for appropriated

distance mode. Brief analysis suggests that this complexity estimator correctly reflect the

trend: increasing number of squares increase complexity (descending of the rates).

Second, this particular type of benchmark was used to check the influence of the type of

the metric (distance) on the complexity ratio.

135

L1 L1

L1

L1

LSUP

LSUP

LSUP

LSUP

EUCL

EUCL

EUCL

EUCL

0.87

0.89

0.91

0.93

0.95

0.97

0.992SQ 3SQ 4SQ 5SQ

Number of the squares

Com

plex

ity r

atio

Fig. IV.11 : ANN-structure based complexity estimator evaluation for: Square

benchmarks, 2 classes, 2000 number of prototypes, MIF = 1024, 3 distance modes

(LSUP - ∆, EUCL - x and L1 - )

It was natural to expect to see the highest ratios (the less complex problem is) trend

occupied by LSUP, because Manhattan distance like constructing Voronoy polyhedron is

an ideal match with proposed benchmarks. High standard deviation for EUCL and L1

does not allow us to state that EUCL metrics is definitely better that L1. However, taking

to consideration that the circle (EUCL) takes a middle match space capture position

between square (LSUP) and rhomb (L1), it supports the idea that metric has an influence

on the final complexity ratio, because EUCL distance measure takes an average position

between L1 and LSUP in space covering.

Concerning the output of the rest 17 complexity estimators, more detailed analysis of

the influence of prototype numbers and etc. parameters has be done at the work

(Bouyoucef 2007), in the following paragraph we give brief and final summary:

1. Complexity estimators that don’t follow classification complexity trend

expectations: Maximum_Standard_Deviation (based criterion),

Fisher_Disriminant_Ratio, Normalized_mean_Distance, Mahalanobis_Distance (based),

all 4 Scattered-matrix method based on inter-intra matrix-criteria of the traces.

2. Complexity estimators that satisfy our complexity trend expectations:

• Insensitive (ratio is invariant): KLD (Kullback–Leibler Divergence based

estimator), JMDBC (Jeffreys-Matusita distance based complexity criterion),

136

Bhattacharyya_Coefficient (based criterion), JSD (Jensen-Shannon Divergence based

estimator), Hellindger_Distance (based technique).

• Sensitive: Purity_Measure (PRISM based technique), Collective_Entropy (PRISM

based technique), k_Nearest_Neighbor_Estimator (PRISM based technique) and

RBF_ZISC_based_Fusion_CRT (the module’s name of our ANN based estimator).

It was expected to have a group of insensitive complexity estimators, because they

approximate Bayes error, which is equal zero in our case of purely separable benchmarks.

To check the influence of the MIF parameter and dimensionality on the classification

complexity rates we have performed a range of the testing for 1-5 stripe-like academic

benchmarks, Fig IV.1, but only for ANN based estimator: classification benchmarks and

number of the instance 2000.

To demonstrate influence of MIF parameter on the classification ratio, I provide

results given on Fig IV.13.

0,85

0,87

0,89

0,91

0,93

0,95

0,97

0,99

1,01

1,03ST2 ST4 ST6 ST8

Com

plex

ity ra

tio

Fig. IV.12 : ANN-structure based complexity estimator evaluation for: Stripe

benchmarks, 2 classes, 2000 number of prototypes, LSUP distance mode (MIF=10.24

- ∆, MIF=0.1024 - x and MIF=1024 - )

The results described on Fig. IV.12 (the average values and their corridors of the

standard deviations) suggest that in case when MIF is overestimated it doesn’t influence

on the results (see on Fig. IV.12 MIF=1024 against MIF=10.24). Of course, in this case

MIF increases overall computational time. However, in case of MIF’s underestimating,

137

the final ratio doesn’t allow algorithm to construct Voronoy’s polyhedron that reflects

classification task’s complexity.

Let me note, that the problem with under or overestimated MIF can be observed for

IBM© ZISC®-036 ANN based complexity estimator too. For example in a case of Spice-

junction DNA classification problem, we had overestimated MIF = 4096.

For this case RBFN has a weaker generalization ability then even for MIF=55. It

means that ANN complexity estimator suffers a lot from MIF-parameter under or

overestimation. However, let me underscore that in case of ANN structure based

complexity estimator, MIF overestimating doesn’t influence on the results, but only on

computational costs, because of its implementation that differs from IBM© ZISC®-036

Neurocomputer’s implementation.

This fact allow us in practice for user independent estimator treatment by default pre-

setting of MIF parameter as maximum possible distance between two prototypes, any

other trails to speed up computing making MIF by default parameter shorter must be

justified using extra database assumptions, that is not available a priori.

On the following Fig. IV.13 is represented the influence of the number of

prototypes/plots m on the ANN based complexity estimating ratio.

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

200 500 1000 2000

The number of prototypes (instances)

Com

plex

ity ra

tio

Fig. IV.13 : ANN-structure based complexity estimator evaluation for: 8 Stripe

benchmark, 2 classes (4&4 stripes), LSUP distance mode (MIF=10.24 - ∆,

MIF=0.1024 - x and MIF=1024 - )

138

The results described on Fig. IV.13 is similar to the results described on Fig. IV.2 and

Fig. IV.3. (in a term of final complexity classification ratios), but it is not the same. The

conceptual differences between that two cases and last one is that, for Fig. IV.2 and Fig

IV.3 correspond to the cases of particular incremental increasing of m-parameter and

behaviour of indicator function where one may naturally find the complexity ratio. In

contrast to Fig. IV.2 and Fig. IV.3, Fig. IV.13 contains the evaluation of the final

complexity ratios for upper m-limit ( 200≤m , 500≤m and etc.). Let me note, these are

the final ratios not the illustration of the dynamics of Voronoy’s polyhedron construction.

Concerning Fig IV.13 it’s naturally expected if the total number of data instances

available for classification is less, the more complex classification problem is. The

sufficient number of prototypes is more precisely to define border(s) between the

class(es). However, let us note that for ANN based complexity estimator (Fig. IV.2, Fig.

IV.3) and (Fig. IV.13), these two implementation cases are not comparable because of the

different m meaning. In the first case it’s the floating number of instances, when in the

second case m-parameter signifies overall number of instances. Comparing these two

cases, one may say that in case of IBM© ZISC®-036 Neurocomputer’s implementation,

m-increasing exhibits behaviour of Q(m) indicator, when in case software implementation

ANN based provides the results (complexity ratio) for overall total number of instances m.

The following chart Fig. IV.14 presents a range of benchmarks generalized under the

name grid. This benchmark was constructed to verify the influence of the dimension

increasing on the classification complexity process. For the first three dimensions, the

benchmarks are described in Fig. IV.14:

Fig. IV.14 : Grid classification benchmarks in D1, D2 and D3 dimension

Thus, it has been created such grid benchmarks in D1 – D3 and D4, D5. The number

of clusters for each dimension is 2dim, meaning that in D3 we have 8 box-like clusters, and

in D5 – 32. For each dimension, the number of plots/instances is fixed and equals to 2000.

These benchmarks contain a “doubled classification complexity”; the first is an

139

increasing dimension for a fixed number of prototypes. It means that the border in high

dimension space for the less number of prototypes has to determined weaker, because the

instances are uniformly distributed without any preferences to any dimension. The

second, the number of clusters increases by twice for each superior dimension. The

obtained results and the particular details are represented in Fig IV.15.

0

0,2

0,4

0,6

0,8

1

D1 D2 D3 D4 D5

Dimension

Com

plex

ity ra

tio

Fig. IV.15 : ANN-structure based complexity estimator evaluation for: Grid

benchmarks, 2 classes, EUCL distance mode (MIF=10.24 - ∆, MIF=1024 - ,

MIF=0.5012 - ♦ , MIF=0.3018 - and MIF=0.1024 - x)

Let me note that the Euclidian type of metric has been selected because its

inconformity to the benchmarks. Therefore, the MIF = 1024 and MIF = 10.24 the

complexity rates are identical. It is so because MIF = 1024 and MIF = 10.24 are reduced

during the Voronoy’s polyhedron creation. Including the fact that all given plots

coordinates for the given dimension are laying in the interval [-1;1], the trend for MIF =

1024 and MIF = 10.24 are similar. But from the other side, underestimating of MIF =

0.1024 has crucial impact on complexity determining in high D5 dimension feature space.

Intuitively, it is impossible to agree that the complexity of purely linearly separable 32

box-like clusters in D5 has 0-complexity (is very complex). Therefore, the only result that

can be taken to consideration is provided by overestimated MIFs. MIFs’ overestimating

increases computational time, but returns respectful complexity rates, that’s why the

question of selecting by default MIF parameter is important especially for high dimension

classification tasks.

140

Let us mention also that similar test on high dimensionality has been performed for 8

Stripe classification benchmarks. The conclusion concerning the importance of MIF

parameter is confirmed also by the same dynamics as on Fig IV.15.

Therefore, to figure out inducted dimensional complexity into high dimensional

classification benchmarks, we have extended construction of 2D 8 stripe classification

benchmarks. Following a similar logic, we increase the number of dimension up to 5, but

the number of clusters and the form rest similar. It’s a benchmark with 4 stripes of class

number one and 4 stripes of the opposite class two. We create such range of the

benchmarks where the separability criterion lays in one dimension. The next Fig. IV.16

presents obtained results.

0,5

0,55

0,6

0,65

0,7

0,75

0,8

0,85

0,9

0,95

1

D2 D3 D4 D5

Grid8-stripe

Fig. IV.16 : ANN-structure based complexity estimator evaluation for: Grid and 8-

stripe-benchmarks, 2 classes, EUCL distance mode, MIF=1024

Concerning the results described on Fig. V.16, classification 8-stripe-benchmarks

overall are more complex than Grid-benchmarks. The dimension incensement causes the

growth of the difference between Grid-benchmarks and 8-stripe-benchmarks complexity

ratios (in absolute values). Intuitively it has been expected to observe the opposite,

because: for D5 we still have only 8 multidimensional clusters, when for Grid-benchmark

in D5 we have 32 hypercuboids inside. The explanation of this discrepancy between

141

expected and actual results could be explained if one takes into account a fact of usage

kNN-like algorithm for Voronoy polyhedron construction. However, the provided

approach falsification (Popper’s term (Popper 2002)) should not discourage us.

Thus, the explanation of this irrelevant to the benchmarks intuitive complexity results

is the following. It is not because of 8 stripe problem is more complex than Grid one. It is

because of Voronoy polyhedron construction approach employed as core of ANN-

structure based complexity estimation. RCE-kNN-like method suits or matches (i.e. low

number of cluster requires) well to the “ideal” hyper-cube space partitioning, especially

for D5 Grid benchmark, when for 8-stripe-benchmark in D5, RCE-kNN-like algorithm

construct higher number of clusters.

Therefore, one may note a relation between ANN-structure based complexity

estimator and Kolmogorov view on the complexity. Thus, RCE-kNN-like polyhedron

construction algorithm appears to be not anymore “the most efficient” for Grid-

benchmarks complexity estimating. However, advocating our technique, let me mentioned

that such benchmarks are very artificial in most real-world case the border between

classes is more complex. Another point, one may modify used algorithm of space

partitioning inside my estimator or more radically apply different ANN.

To perform a check how the polynomial order/degree parameters influence on the

final classification complexity value, we have used the four spirals benchmarks described

on Fig. IV.17.

This 2D classification problem belongs to the same range of benchmarks as the stripes

for example. It is called four spirals problem because it contains four volutes, plus it

follows the same idea of naming benchmark as for example with the stripes, 2 4 – 8. Not

only the number of class and the form of border, but the overall number of the borders is

one of the most crucial parameters of the benchmarks. The border of the benchmark

described on Fig. IV.17 is a spiral. The number of instances was selected equal 500, as it

has been proved during the testing. The reduced number of the instances increases


This experiment has been performed for different type of MIF parameters:

underestimated and overestimated. Polynomial approximation supports the idea expressed

in the equations II.43 and II.44. Therefore for the whole odd polynomial (2, 4, 6) order

and even (3, 5, 7) the complexity are similar when one takes to account the corridors of

the standard deviations. However, theoretically, we have expected to get the result with

142

exclusive approximation of the 3 order polynomial function, because in fact the behaviour

of Q(m) indicator is not predictable (Fig II.6), especially when the m parameter is small.

Fig. IV.17 : Four spiral classification benchmark, 2 classes, 500 prototypes

The following validation check has been performed for the polynomial orders/degrees:

1-7. The results are depicted on Fig. V.18.

0,55

0,6

0,65

0,7

0,75

0,81 2 3 4 5 6 7

Com

plex

ity r

atio

Fig. IV.18 : ANN-structure based complexity estimator evaluation for: 4 Spiral

benchmark, 2 classes, EUCL distance mode (MIF=0.1024 - x, MIF=0.2048 - ,

MIF=10.24 - ∆ and MIF=1024 - )

143

Brief analysis of this experiment suggests that the polynomial order parameter and

MIF must be defined by user. However, important is not the number itself, but the parity

of this parameter as it describes two different Q(m) approximating approaches II.43 and

II.44. Another issue is, if one takes to account the standard deviation of the complexity

average ratio variation, the given difference between odd and even order is not a crucial at

all, however essential, because the odd ratios lay on the border of the standard deviation

corridor of the even order and the same.

Finally, the next Fig. IV.19 presents the sequences of benchmark (using also their

abbreviations) problems range from the simplest (left) to the most complex (right).

One may claim that the last 2CR-benchmark is less complex than sinusoids’ or

spirals’ example. Concerning the constructing formulas, all of these 3 benchmarks contain

the sinus to define the classes’ border, but only the last 2CR-benchmark is constructed so

that there are available two overlapping zone. That’s why the given benchmarks’ Bayes

error is equal to zero except the overlapping circles, Fig. IV.19 (right). According to this

criterion of theoretical measure, we have defined 2CR-example as the most complex one.

The features of these 6 benchmarks are similar: 2 classes, 2000 prototypes, validation

has been performed for MIF = 1024. Thus, the results for benchmarks are depicted on the

Fig. IV.20.

Fig. IV.19 : Six classification benchmarks [from left to right, from top to down: 2

Stripes (2ST), 2 Grids (2GR), 2 Squares (2SQ), 2 Sinusoids (2SN), 2 Spirals (2SP)

and 2 Circles (2CR) with small overlapping zone]

144

L1

L1L1

L1

L1

L1

LSUP LSUP LSUP

LSUP

LSUP

LSUP

EUCL

EUCLEUCL EUCL

EUCL

EUCL0,92

0,93

0,94

0,95

0,96

0,97

0,98

0,99

2ST 2GR 2SQ 2SN 2SP 2CR

Fig. IV.20 : ANN-structure based complexity estimator evaluation for: 6

classification benchmark, 2 classes, 2000 prototypes, MIF=1024 (LSUP- ∆, EUCL - x

and L1 - )

Firstly, let me note that the central indicator is the result obtained using EUCL type of

distance metric, because a priory and especially for real word problem, there is no

evidence what type of distance measure matches the best for a given problem. Second, the

first three type of benchmarks (2ST, 2GR and 2SQ) are strictly linearly separates, that’s

why the given metrics LSUP doesn’t figure the classification difficulty out. For L1 we

have unexpected growth of the classification complexity value from 2SQ to 2SN

benchmarks, but let us note here that the standard deviation of the complexity ratio for

2SN benchmark using L1 metrics is the highest for all range of tests and is equal 0.0099.

That’s why if we take into account the range of the overall complexity for the proposed

benchmarks [0.920;0.995], the absolute value of obtained standard deviation allows us to

ignore the unexpected jump for 2SN, for L1 distance measure.

The same for metrics LSUP for the benchmark 2GR, that theoretically had to be less

complex than 2ST, but for 2GR standard deviation is 0.0042 and for 2ST - 0.028, that’s

why we cannot state from the point of view of the LSUP metrics and given deviation of

averages that 2GR is more complex than 2ST, and opposite, one may say that the

complexity of these two benchmarks are similar.

Another important note is that concerning the way of constructing sinusoids for 2SN

benchmarks, we have a situation where the majority of clusters are concentrated on the

classification border. Thus the declining of the curve from 2SQ to SN for EUCL metric is

not sharp. The problem of overlapping creates such Voronoy’s polyhedron where each

145

cluster (neuron) contains one prototype from overlapping zone and it increases the

complexity.

Next section presents verification of ANN based complexity estimator application for

Splice-junction DNA classification problem and Tic-tac-toe classification problem.

IV.1.2.2 ANN-structure based complexity estimator facing real world

classification problems

In this subsection we deal with the estimation complexity of the real world problem.

This validation has been performed using ANN based estimator. Afterwards the results

obtained by ZISC®-036 Neurocomputer based are compared with ANN based estimator.

IV.1.2.2.1 ANN-structure based complexity estimator facing Splice-junction DNA

sequences classification problem

Similar distance measure L1 has been selected for ANN structure based complexity

estimator. However, using software simulation in our environment I don’t have a

hardware memory limit. This particular issue of realization allows us to determine the

classification complexity more precisely. Let us note that because of the difference in

implementation between IBM© ZISC®-036 based and its software version, and because

of the different impact factor which plays MIF parameter, these two complexity

estimators are not identical.

Therefore, for L1 metric and MIF = 1024 and for complete Splice-junction DNA

database that has been used for Voronoy polyhedron construction, the complexity ratio

equals to 0.6856 +/- 0.0147 (standard deviation). Comparing to the result 0.439, obtained

by ZISC®-036 hardware based implementation, one may expect to have a lower ration,

because for the hardware implementation it has been used less amount of information.

Another issue is that using ANN based complexity estimator we also can be sure about

the generalization ability of the obtained structure and there is no need to do quality check

like it was done for ZISC®-036 Neurocomputer’s implementation.

The second reason why there is no need to provide a check of generalizing ability of

obtained RBFN structure, because MIF parameter has been set equal to 1024 (higher than

146

the maximal distance between any two prototypes and this MIF parameter has been

adjusted for every cluster of obtained Voronoy polyhedron).

The important remark is that the results strongly depend on the database

representation. Thus, according to the well-known fact that these four molecules organize

the sequences of three basic genetic words (triplet) that represent 22 amino acid including

“Stop” coding combination (this appears to be an error-correction mechanism, so a

random error in transcribing one letter in a DNA word will not necessarily produce the

wrong amino acid). We have taken to consideration this basic idea and re-encoded initial

database using triplets of binary sequences. Therefore we have obtained new database

contains the same 3190 number of instances, where each instance is a binary code of the

180 bit length.

Then we have calculated classification complexity ratio using EUCL metric and

MIF=1024. The calculated complexity ratio is 0.72295+/-0.00785, meaning that feature

number incensement decrease the overall complexity of the problem. This check

demonstrates an importance of database representation. The classification problem

meaningless decoding might be initial settable barrier to possible classification and

classification complexity estimation performance enhancement.

The interesting fact majority is that the rest of 11 classification complexity estimators

including (let us underscore this issue here) Maximal Standard Deviation complexity-like

criterion defines this task as very and very complex. The majority of rates are close to

zero. Therefore, we consolidate the results in Table IV.3.

Let me note here that except ANN based complexity estimator’s, PRISM based

methods contain important resolution parameter B (Singh 2003)

Concerning the result there we have two groups of estimators. The first one belongs to

the methods which figure out DNA Splice-junction classification problem very and very

complex. The ratio is near zero.

Another group of estimators find this classification problem complex, they provide

reasonable very that could be taken into account, because of the verification that has been

performed on the benchmarks. However, except expected leader ANN based complexity

estimators, similar results shown Mahalanobis distance based measure and fourth’s

criteria of Fukunaga (Interclass distance measure criteria: J1 – J4).

147

Table IV.3 : Complexity rates obtained for Splice-junction DNA classification

problem (re-encoded database) using ANN-structure based and other applications

Complexity estimator inter-parameter(s) Complexity ratio

Maximum standard deviation based N/A 0.013

Normalized distance based N/A 0.000

Fisher ratio distance based N/A 0.000

Mahalanobis distance based N/A 0.630

Interclass distance measure criterion J1 N/A 0.002




kNN based criterion N/A 0.520

B (resolution) = 2 0,180

B (resolution) = 4 0,045 Purity (PRISM)



B (resolution) = 4 0,010 Collective entropy (PRISM)

B (resolution) = 8 0.001

EUCL, MIF = 1024 0.723+/-0.008

LSUP, MIF = 1024 0.954+/-0.101 Matlab implementation of ANN-structure based complexity estimator

L1, MIF = 1024 0.722+/-0.016 IBM© ZISC®-036 Neurocomputer’s

implementation L1, MIF = 56 0.439

The more detailed summary of complexity estimating techniques is highlighted in the

summary – Section IV.1.3. Following section presents similar consolidated results of the

complexity estimating techniques applied for Tic-tac-toe end game classification problem.

IV.1.2.2.2 ANN-structure based complexity estimator facing Tic-tac-toe endgame

classification problem

The aim of this classification task is to predict whether each of 958 legal endgame

boards for Tic-tac-toe is won for “x” or for “o”. Totally, 16 criteria have been used

except. The results are given in Table V.4.

Brief summary confirms our expectation concerning classification complexity

ranging. Tic-tac-toe end game problem is more complex than all 2D academic benchmarks

(starting from the simplest stripe-case and finishing with overlapping circles) and in the

148

same time it is less complex than Splice-junction DNA sequences classification problem

even taking into account the big overlapping.

Table IV.4 : Complexity rates obtained for Tic-tac-toe endgame classification

problem using sixteen classification complexity criteria including ANN-structure

based complexity estimating technique

Complexity estimator inter-parameter(s) Complexity ratio

Maximum standard deviation based N/A 0.123

Normalized distance based N/A 0.040

Fisher ratio distance based N/A 0.004

Kullback-Leibler divergence based N/A 1

Jeffries-Matusita distance based N/A 1.000

Bhattacharyya criterion based N/A 1

Mahalanobis distance based N/A 0.404





kNN- based criterion N/A 0.653


B (resolution) = 4 0.116 Purity (PRISM)



B (resolution) = 4 0.074 Collective entropy (PRISM)


EUCL, MIF = 1024 0.828+/-0.035

EUCL, MIF = 10.24 0.826+/-0.028

EUCL, MIF = 0.1024 0

LSUP, MIF = 1024 0.895+/0.033

LSUP, MIF = 10.24 0.923+/-0.026

LSUP, MIF = 0.1024 0

L1, MIF = 1024 0.820+/-0.030

Matlab implementation of ANN-structure based complexity estimator

L1, MIF = 10.24 0.839+/-0.029

Considering the specificity of the obtained result, one may note that Information

theory based estimators approximating well Bayes error fail to detect 0.0003

misclassification rate. However, Jeffry-Matusita based criterion was the most sensitive

among them.

149

Also the results suggest that using ANN based estimator one may face the problem of

underestimating MIF parameter (MIF = 0.1024) which can lead to output that determines

this problems as the most complex (ratio equals to 0). Let me remind that DNA Splice

junction classification problem in DNA sequences walk representation cannot be

classified at glance by human using classical colour sequence representation, according to

the work (Sarkar and Leong 2001).

General summary based on more detailed analysis of the results, including the whole

range of tasks’ parameters, is presented in the following section.

IV.1.3 Summary

First of all we consider the results of complexity estimators for particular

classification benchmarks: Stripe, Grid, Circles and etc. For each of them, we can figure

out complexity estimators that don’t confirm our expectation (for example for doubled

overall number of the sub-zones and for the same rest benchmark parameters, the

complexity ratio should be lower (exhibiting in such way more complex problem) that for

the benchmark with the less number of sub-zone).

Whenever one may chose: Maximum standard deviation based, Normalized distance

based, Fisher ratio distance based or Mahalanobis distance based criteria, I can state that

they don’t figure out correctly an evolution (increasing number of sub-patterns,

dimensionality, classes’ separating space and etc.) of the classification complexity rate in

a frame of certain classification benchmark. We cannot completely rely on them, because

they have (as validation results shown) too many disadvantages. Although, it doesn’t

mean that they couldn’t be apply to T-DTS framework or other clustering application.

The second group is the group of criteria that generally defines the classification

complexity correctly, but for some specific parameter, they provide unexpected ratios.

They are: Four Fukunaga’s Interclass distance measure based (Fukunaga 1972), ANN-

structure based and Purity (PRISM) based criteria (Singh 2003). Although, we should

note that the last two criteria are very sensitive to B resolution parameters, similar to MIF-

parameter of ANN based estimator.

The third group of the complexity estimators satisfies our expectations, but because of

their bases which approximate Bayes error, they are not sensitive in case of purely

classes’ separable problem. These estimators are: Kullback-Leibler divergence based,

150

Jeffries-Matusita distance based, Bhattacharyya criterion based, Helindger distance

based and Jensen-Shannon distance based criteria.

Shortly summarizing our 3-group categorizing of the complexity estimator we can

generally state that according to the results of the tests we have two leading complexity

estimators: our ANN based complexity estimator and Collective Entropy (PRISM)

estimator. These two estimators’ performance depends crucial on their intra-parameters:

MIF and B. It is very important especially in case of ANN based estimator to set these

parameters correctly using additional information.

Nevertheless, there are two important issues that I would like highlight that in this

summary that is dedicated to not only T-DTS development and its verification.

However in the framework of statistical framework the complexity estimating (class

separability) issue stands alone. Out of T-DTS concept, it is required to have pre-analysis

of the problem that allows user to: select appropriate classifier, parameterize it or even

normalize/pre-process given classification database.

The second issue is the employing classification complexity estimating technique into

T-DTS Control Unit. The information returned by complexity estimator is used for

optimizing purpose, but with respect to pre-set threshold, it plays a role of decision maker.

Thus using an unsatisfactory (belongs to the worth group of complexity estimators)

complexity estimator doesn’t mean that this estimator cannot handle decomposing. The

problem of the selection complexity estimator is important as well as the problem of

selection appropriate decomposing method. Therefore the problem of selection processing

unit is less important, because PU is not required for database decomposition and tree-

structure building.

In conclusion, I can state that only the proposed ANN based complexity estimator

satisfies our expectation of classification complexity estimating. However, it is

computationally expensive method that requires important MIF intra-parameter adjusting,

but the parallel IBM© ZISC®-036 or similar hardware neurocomputing organization may

significantly cut down the overall computational cost.

Another important parameter of ANN based classification estimator is the database

size. For cases where the learning database of the real-world problem has to be

constructed, the total size of this database regardless the classification problem

representation has a crucial influence on the final complexity ratio. For instance, one

might construct the simple contra-example where ANN based complexity estimator might

be easily failed (falsified (Popper 2002)) to work. For example, a simple database that

151

contains two instances where each of them belongs to different class. Applying my

estimator we have two hyperplane and feedback ration 0 (the problem is very complex).

One may advocate my indicator in way where the lack on information is equivalent to the

state of uncertainty (not enough information to construct optimal class separating

hyperplane). Therefore, the obtained output “very complex” signifies complexity as

uncertainty. However, there are two ways to adjust ANN structure based complexity

estimator. One may set the low limit of database required for successful estimator

processing (let me note that in T-DTS framework this limit has been set from 10/20

prototypes). Another way to use different ANN-type or different g(.)/g(m) – function

which is denominated by the overall database (sub-database) size m. However, let me

mention, that in real world case, the database size contains hundreds of instances.

Moreover, taking into account nowadays tendency where the real-world databases

demonstrate quick growth in size, because of internet technologies, the artificial cases of

very low classification database lay out of our prime interest.

Throughout this discussion, the results obtained for Splice junction DNA sequence

classification problem (Table V.2 and Table V.3) including the different database sizes

serves me as the justification of database size sensitiveness of ANN structure based

estimator, but let me note that it is because of their RCE-kNN like implementation,

because kNN-based methods are very sensitive to learning database extraction/selection.

Thus, IBM© ZISC®-036 Neurocomputer implementation of ANN estimator defines DNA

problem complexity equals to 0.439. This experiment has been performed for the range of

the random extractions of the given initial database. That’s why the result obtained by

ANN-structure based estimator applied for complete Splice-junction DNA database

figures out the problem is less complex - 0.7222 with respect to similar L1-metric. In

comparison to the results obtained for Tic-tac-toe endgame problem, we can state that this

classification problem is less complex than DNA.

In addition, to confirm our initial suppositions, we have calculated directly Bayes

error ε. For DNA ε is equal to 0.0003 and for Tic-tac-toe – 0 (because each s-instance

represents only one distinguishing game combination). However, in case of Tic-tac-toe

end game problem one may articulate to high sub-zones of overlapping, but let us not that

each instance of the given database describe unique game regardless position similarities.

Following Section IV.2.1 is dedicated to verification ANN-structure based complexity

estimator as an integral part of T-DTS. Section IV.2.1 contains the results of classification

that have been obtained using T-DTS self-adjusting procedure. The validation approach

152

remains the same. First, I conduct experiments using academic benchmarks, then real-

world problems: Tic-tac-toe end and Splice-junction DNA sequences classification

problems.

IV.2 T-DTS

This section of validation consists two parts. In the first part, I present validation of

our ANN based complexity estimator as the part of T-DTS. Besides the question of the

worth-while complexity estimator selection, the user may find the following problems of

T-DTS handling:

• selection of complexity estimator: classification complexity estimating techniques

depend on the data nature, coordinate system (a full list of desirable, but not mandatory

properties of classification complexity estimators/discriminators mentioned in Fukunaga’s

work (Fukunaga 1972)).

• information required for modifying θ - threshold value and searching the value that

satisfies an optimal solution.

Therefore, the second part is dedicated to verification of the self-tuning procedure. In

this part I particularly focus on the practical aspect of the using/applying complexity

estimating techniques and its ability to enhance T-DTS performance.

IV.2.1 ANN-structure based complexity estimator validation

Accordingly to T-DTS concept, the complexity estimation modules play a key-role.

The complexity estimators are essential for database decomposing and final tree

construction. In order to evaluate T-DTS performance within ANN based complexity I

have used for the range of benchmarks mentioned already.

The 2D classification benchmarks verification is highlighted in the following sub-

section Section IV.2.1.1, and Section IV.2.1.2 is dedicated to real-word Splice-junction

DNA sequences classification problem and Tic-tac-toe endgame problems. For all these

benchmarks including real-word classification problems, the learning databases have been

extracted randomly from the corresponding database with respect to the classes’

distributions. The rest of the databases have been used for generalizing. Let us also note

153

that the threshold adjustment has been done manually by operator in “try and cut” way.

ANN based complexity estimator achievements is depicted on the rusting Fig. IV.21 - Fig.

IV.28 as ANN. The initial intra-parameter MIF have been also manually optimized. The

distance metric EUCL has been chosen.

IV.2.1.1 ANN-structure based complexity estimator in T-DTS

framework using classification benchmarks

In order to check T-DTS generalization ability with embedded ANN based complexity

estimator, arbitrary PU - LVQ1 has been chosen. The results for simple 2 Stripe

classification problem is given on Fig. IV.21.

On Fig. IV.21 - Fig. IV.28 x-axis represents θ - complexity threshold. y-axis is the

average generalization.

ANN based complexity estimator depicted as ANN takes an average position among

the other complexity indicators; however for threshold θ = 0.900, this complexity

estimator can achieve the possible maximal (e.g. the best) generalization as another

complexity estimating techniques.

Fig. IV.21 : Validation ANN-structure based complexity estimator embedded into T-

DTS framework: 2 Stripe benchmark, 2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU – CNN, PU – LVQ1

154

The given Fig. IV.21 exhibits also the relativity of the complexity measuring, meaning

that the general complexity of the same benchmark, for the same for fixed DU and PU,

but different complexity estimators reach the maximal output for different θ - threshold. It

is expected to face this result, because the used complexity measures have their different

origins.

Fig. IV.22 : Validation ANN-structure based complexity estimator embedded into T-DTS framework: 10 Stripe benchmark, 2 classes, generalization database size 1600

prototypes, learning database size 400 prototypes, DU – CNN, PU – LVQ1

Fig. IV.22 demonstrates the results obtained for related, but more complex sub-type of

benchmarks (10 stripe sub-zones) than the previous experiment with its results depicted

on Fig. IV.22. In order to increase more overall classification complexity, except

increasing the number of borders, I have reduced the learning data base size up to 400

prototypes. It is spotlighting the same conclusion: ANN based complexity indicator in a

framework of T-DTS is not the worst method among the others, however for reduced

learning database it could not be a leader according to its definition and validation

mentioned above.

I call an attention to the fact that if the principal problem here is already the same 2D

classification dilemma even for increased number of stripes, the reduced learning database

size is the key factor of non-leading performing of ANN based estimator. Therefore, one

155

ought to neglect an imperfection of LVQ1 as selected processing unit that has been used

in this experiment. In such way I have verified T-DTS generalization and decomposition

ability within the artificial worst-case constraints. In fact, in such worst-case conditions,

crop up from conjunction of intrinsic classification complexity and information leakage

(emerging from learning database reduced size), it is expectable to face such low

generalization rate (around 45%), regardless the selected complexity estimator, but let us

highlight that these parameters have consciously been selected as an arbitrary for the

validation.

As for the previous experiments, the classification complexity estimating techniques

demonstrate it relatedness in a framework of T-DTS application and their performance

characteristics. However, it is clear that the maximal generalization rates for the majority

of the complexity estimating techniques are gained for cases where θ - threshold is equal

approximately to 0.75, where in case Fig. IV.21 it is 0.8. Let me highlight here that when

one deals with a classification problem into T-DTS framework, he may neglect the

decomposing performed by DU, but the selection of DU is very important issue however

it might look invisible on the resulting Fig. IV.21 - Fig. IV.22.

Briefly summarizing, when one has a limit of the learning database size, he should not

expect to have ANN based estimator among the leader complexity estimators in T-DTS

framework (the explanation based on the complexity estimator definition is given above

in Section IV.1.3).

Using the complexity estimators which are based on Information theory including

leading Jeffreys-Matusita based criterion, one can reach the performance maximum: Fig.

IV.21 - Fig. IV.22, even for the same θ = 0.800, but le me highlight that it doesn’t mean

that the give classification problem can be successfully resolved using θ = 0.800 as a

constant. The complexity estimators’ outputs, especially after linear normalization, are

relative in their definitions.

There are the following main problems with complexity estimators which influence on

T-DTS performance:

• complexity estimators that very well approximating Bayes error becomes

insensitive when the classes are purely separated(case of Jeffreys-Matusita based criterion

estimator regardless the number of clusters demonstrate not a superiority, but

insensitiveness. We have for both cases a big number of leaf sub-databases/sub-clusters)

• complexity estimators are coordinates and database size dependent (Fukunaga

1972)

156

Forecasting the other possible reasons related to a general question “Why ANN-

structure based being verified as a classification complexity estimating leader is not a

leader into T-DTS framework”, we have put ahead the following argumentation:

• T-DTS concept is supposed to be general, but the practical realization is done for

specific set of self-organizing NNs prototype based decomposition techniques (including

the problem of its parameterization). In accordance with processing techniques that are

responsible for final performance results, integrated into T-DTS framework and related ti

processing units, decomposition techniques could not be treated as universal, even when

underlying “divide” and “conquer” principle is universal. The decomposing techniques

may fail (according to No Free Lunch Theorem) for specific classification problem and

central T-DTS idea “Performing decomposition reduces overall task’s complexity” might

not work. Shortly saying, imperfection of any decomposition technique required never-

ending T-DTS DU database update and adjusting.

• Unknown additional parameters of the tests such as undefined learning and

generalization databases’ size, that naturally is defined in a random way, has influence on

the process of tree-construction and as the result the output tree varies correspondingly to

learning database size. For the same reason, the coordinates of the centres of the sub-

clusters fluctuate. As the result, the clusters change their forms and locations. Simply

saying on the level of the resulted decomposition, it is expected to see different mosaics

Fig. IV.11.

In the following section, I provide ANN based complexity estimator validation in T-

DTS framework using two real-world problems as validation tool.

IV.2.1.2 ANN-structure based complexity estimator in T-DTS

framework facing real-world classification problems

For these two problems, testing has been done for the learning and generalization

database of the equal size. They have been randomly extracting with respect to classes’

distribution in overall mentioned proportion 50%/50%.

For Tic-tac-toe end game problem, the aim of this classification task was to predict

whether each of 958 legal endgame boards is won for x or for o. This problem is hard for

the covering algorithm family, because of multi-overlapping (Yang, Parekh and Honava

157

1999), however a distinguishing attribute of s-instance is always present and because it is

impossible to have two identical game combinations.

The obtained T-DTS results for Tic-tac-toe endgame problem is shown on Fig. V.23:

For this high-overlapping problem, ANN based complexity estimator takes the second

leading position. Only Mahalanobis distance based criterion has achieved the maximal

generalization rate for this range of tests. However, for the same problem, but with a

reduced learning database Fig. IV.24, the leader among the complexity estimators was

Bhattacharyya bound based criterion, where again the second rank for both Tic-tac-toe

endgame testing has been captured by ANN based estimator depicted as ANN.


DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 479 prototypes, learning database size 479 prototypes, DU – CNN, PU –

MLP_FF_GDM

Another important note concerning this problem is the range of insensitive complexity

estimator such as Kullback-Leibler divergence based which cannot be applied. More

explanation concerning this issue is given in the Section V.2.2. Before ahead, these

estimators conclude on high complexity of both problem and sub-problems during the

step by step decomposition. As a result, there is no sense in any optimizing θ - threshold

for these estimators, because there are two decomposing cases: no decomposing and full

decomposing (in general, it means that we have a big number of clusters, each cluster

contains one pure class and has a very small size).

158


DTS framework: Tic-tac-toe endgame problem, 2 classes, generalization database size 766 prototypes, learning database size 192 prototypes, DU – CNN, PU –

MLP_FF_GDM


DTS framework: Splice-junction DNA sequences classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes,

DU – CNN, PU – MLP_FF_GDM

159

The following validation was performed for the Splice-junction DNA classification

sequences problem. The results are depicted in Fig. V.25. One can notice that the ANN

based complexity estimator is leading among the indicators.

Further, inapplicable indicators for this problem are: Maximum_Standard_Deviation -

it defines the problem as very complex regardless of a decomposing process;

Normalized_mean_Distance can not be applied because each vector consists of 60

attributes and the complexity ratio which is based on the root square deviation identifies

every problem as complex.

Next, it came as no surprise that the Fisher_Disriminant_Ratio estimator produced the

worst result. Moreover, we should highlight that all 3 complexity estimators during the

classification complexity validation have failed even in the framework of a single

classification benchmark.

Summarizing the classification benchmarks and the real-world problems according to

the comparison of the estimators, one may state that the ANN based complexity

embedded into the T-DTS framework is matchless among the others complexity

estimating techniques. This estimator, because of its origins, is not sensitive to a big

number of attributes of an input vector; still, it is computationally costly. One might also

conclude that ANN based estimator in T-DST is an appealing candidate for solving high

dimension problems. However, this estimator requires optimization of the internal

parameters, such as the very sensitive MIF, in order to avoid the extra executing time-

costliness. Incorporation of this complexity estimator had a goal to check the performance

of T-DTS compared to the other complexity estimation modules, and this validation had

passed successfully. T-DTS was used to solve two real-world classification problems. We

have shown that the ANN based complexity estimator embedded into T-DTS allows the

user to reach better learning and generalization rates. We have also illustrated that this

estimator is matchless for the classification tasks-related problems with the high

dimension feature space, where the statistical-based complexity indicators failed. This

estimator appears to be a general complexity indicator and thus acts more efficiently than

the other criteria. However, the problem of finding the optimal (more precisely, quasi

optimum) threshold θ where the T-DTS output reaches its maximal performance has not

been overviewed in this section. It is known that θ-threshold is the relative parameter

because of the different origin of complexity estimators. Therefore, we were motivated to

apply the semi-automated procedure that finds quasi optimal θ-threshold regardless of a

160

pre-selected complexity estimator. The following section provides a verification of the

proposed (Section IV.1) procedure and proves the superiority of such user-independent

approach. Another possibility (and its analysis) to achieve a maximal generalization rate

for the real-world problems (Tic-tac-toe endgame and Splice junction DNA classification

problems) will be discussed in greater details in the following section.

IV.2.2 T-DTS self-tuning procedure validation

In order to evaluate the self-tuning procedure of T-DTS, we used a similar range of

tests: benchmarks and two real world problems. For each problem, we have fixed the

complexity estimator, PU, DU and the rest of the T-DTS parameters. Then, T-DTS has

been launched using the self-tuning mode.

The complexity threshold of decomposition was adjusted automatically using the T-

DTS self-tuning procedure described in Section III.1. The optimization function had the

following macro-parameters: h=9, z=10, α = 0.1, meaning that T-DTS was run for each

complexity estimator, fixed DU, PU and other parameters for h • z times. Next section

provides a validation of this procedure using classification benchmarks.

IV.2.2.1 T-DTS self-tuning procedure validation using classification

benchmarks

Fig. IV.26 : Validation T-DTS self-tuning threshold procedure, Average learning rate

(including its corridor of the standard deviations) as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning

database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator

161

Fig. IV.27 : Validation T-DTS self-tuning threshold procedure, Average

generalization rate (including its corridor of the standard deviations) as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500

prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher measure based complexity estimator

A very good illustration of how the self-tuning procedure works is done on the

benchmark of Spiral classification benchmark. The Fig. V.26 – Fig. V.29 provide the

details of the possible quasi optimum search for this specific classification problem.

Fig. IV.28 : Validation T-DTS self-tuning threshold procedure, Average clusters’

number as a function of θ - threshold: 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU

– PNN, Fisher measure based complexity estimator

162

Intuitively and heuristically analyzing the trends, it’s expected to find quasi-optimal

threshold in the subinterval [0.7 ; 0.8], where the generalization rate reaches it maximum

and learning rate continue to grow. Now, when we come to the question of what is an

optimal solution for us, it is time to determine its meaning in a term of performance

estimating function P(θ) (equation III.12).

For this purpose I have set b1=3, b2=2, but b3=0. It was done in order to simplify the

number of parameters that defines quasi optimum ignoring T-DTS executing time which

proportional to the number of prototypes Fig. V.28.

Fig. IV.29 : Validation T-DTS self-tuning threshold procedure, Performance

estimating function P(θ): 4 Spiral benchmark, 2 classes, generalization database size 500 prototypes, learning database size 500 prototypes, DU – CNN, PU – PNN, Fisher

measure based complexity estimator

On Fig. IV.29 described P(θ) – function evolution in the θ-threshold interval [0.1;0.8]

for 4 Spiral academic benchmark and fixed range of parameter. In is shown that for θ

close to 0.7 T-DTS reaches its performance maximum (minimum of P(θ)=0.42) in a term

of combination generalization and learning rates, where the high priority is set for overall

maximizing of the generalization rate. More princely, using implemented into T-DTS

self-tuning θ-threshold procedure, the minimum of P(θ) has been found for θ = 0.7217.

For different selected complexity estimators, DU and PU, the different combination of the

satisfactory results is possible. These results are given in Table V.5. “Gr” stands for

163

generalization rate and “Lr” – learning rate. “Std” is the standard deviation of this

parameters.

Table IV.5 : Classification results: 4 Spiral benchmark, 2 classes, generalization

database size 500 prototypes, learning database size 500 prototypes

DU Complexity estimator PU Gr±Std/2 (%) Lr±Std/2 (%) Avr. leaf No. ±Std/2 Θ

CNN Collective entropy based Elman_BN 79.1583±0.4960 96.4870±0.1813 104.00±5.95 0.2798

SOM Collective entropy based Elman_BN 77.7956±1.6256 97.6846±0.2961 144.20±6.21 0.3353

CNN Fisher measure based PNN 80.4008±0.4216 95.8882±0.2505 176.2±1.98 0.7217 Based on given results, one may have selected one that which satisfies the given

constrains better such as low standard deviation of the generalization rate, maximal

possible generalization rate or satisfactory generalization rate, but maximal learning rate.

Let us stress that in a framework of T-DTS there is no a priory “the best”, the most

“optimal” solution. There is the bunch of possible quasi-optimal solution.

Fig. IV.30 : Validation T-DTS self-tuning threshold procedure, Clusters’ number

distribution: 4 Spiral benchmark, 2 classes, learning database size 500 prototypes, DU – CNN, Collective entropy based complexity estimator

Minimizing P(θ) - performance function inducts the configurations of the results

where generalization and learning rates satisfies the user expectation. To obtain this

minimum, different types of PU must be applied. The selection of appropriate PU is

heuristic procedure that requires a user-experience. The given in advanced characteristics

164

of the problem and its type, and knowledge of PUs’ features are helpful for particular PU

selection. The possibility to reaches the quasi optimum has been predefined by the form

of histogram given on Fig IV.30.

Therefore, the form of histogram Fig IV.30, defines the divisibility of the initial

database and if the complexity estimator defines the classification complexity in proper

manner we have the appropriate histogram.

However, another key factor that determines the form of the histogram and that is

present invisibly, each time when I build it, is the used decomposition. If the

decomposition produces the sub-clustering in such manner that it doesn’t reduce

complexity (regardless the problem), whatever prefect or not complexity estimator one

has used, the histogram by its form exhibits this case. Concluding, the pair of DU and

complexity estimator determines divisibility of the initial database.

Fig. IV.31 : Validation T-DTS self-tuning threshold procedure: 10 Stripe benchmark,

2 classes, generalization database size 1000 prototypes, learning database size 1000 prototypes, DU – CNN, PU – LVQ1, 4 complexity estimators

The Fig. IV.31 describes the results obtained for the classification benchmarks and

then for real world problems (ANN based complexity estimator is marked under name

ZISC). Hither, we start from two-class, 10 stripe benchmark problem Fig. V.31. The x-

axis represents the decision threshold, the y-axis – the percentage of learning and

generalization rate.

For a two-class benchmark problem, the quasi optimal thresholds for four complexity

estimators were found in the range of [0.8591;0.9992]. Because of the benchmark

artificiality, these optimal thresholds lay close to each other regardless of the complexity

165

estimator. For this sub-interval, the four complexity estimators achieve their maximums

(98-99% of generalization for the proposed θ – thresholds). These results correspond to

the result obtained in the work (Bouyousef, 2007).

The aim of this experiment was not to find the best complexity estimator in T-DTS

framework or determine the best performance, but rather to test the self-tuning threshold

procedure for the same range of fixed parameters as for the given classification

benchmark including bi – priorities of P(θ).

The most interesting in these results is that defining h = 9 allows T-DTS in the user-

free mode to find the quasi optimal θ - threshold with the high accuracy, particularly

because of the continuous behaviour of predefined P(θ). However, this does not mean that

a human operator cannot heuristically achieve this accuracy.

The next section is dedicated to a validation of the self-tuning threshold T-DTS

procedure for two real-world classification problems.

IV.2.2.2 T-DTS self-tuning procedure validation using real-world

classification problems

Applying the self-tuning threshold T-DTS procedure for the Tic-tac-toe problem, we

have used a different database partitioning and complexity estimator. The histogram

obtained for maximal decomposition tree determines overall database divisibility for the

pre-selected DU and complexity estimator.

For the Tic-tac-toe endgame problem, we provide the two histograms obtained using

two trustful complexity estimators: Collective (PRISM based method) entropy and my

ANN based complexity estimators. The results are described in Fig. V.32 - Fig. V.33.

A brief analysis suggests that, according to Collective entropy’s histogram, the

database is divisible and decomposition provides clusters of different complexity;

however, it does not mean that it increases T-DTS performance. Still, the sub-cluster has a

high complexity ratio. Concerning the Fig. IV.33, it is suggested that decomposing does

not decrease database complexity and the sub-databases remains complex; thus, the

database of Tic-tac-toe endgame problem is not divisible in the sense of classification

simplification.

Let me note here that according to complexity estimators’ validation, our ANN based

estimator has been proven to be more trustful than Collective (PRISM based) entropy.

166

However, both of them are leader among the proposed approach used for validation.

Applying self-tuning T-DTS threshold procedure for different combination of the

general database partition, using different PU and complexity estimating techniques gives

the one result: any database decomposition doesn’t reduce overall complexity and as the

results, decomposition could not increase performance.


distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, Collective entropy complexity estimator

Fig. IV.33 : Validation T-DTS self-tuning threshold procedure Clusters’ number distribution: Tic-tac-toe endgame problem, 2 classes, DU – CNN, ANN-structure

based complexity estimator

167

Therefore, the database should be processing solid, or another sophisticated method of

decomposing must be used. Knowing the origin of this problem, it is expected to face this

conclusion, because each s-instance describes unique game combination that determine a

unique part of the border in the feature space S, plus problem is high overlapping, that’s

why a low ratio of learning database also reduces performance.

Table IV.6 : Classification results: Tic-tac-toe endgame classification problem

Method description Type of algorithm Accuracy (%)

MLP_FF_BR (90% of database used) MLP FF based 99.9063


Elman_BNwP (90% of database used) MLP FF based 99.5521

T-DTS (80% of db), ANN-based CE, θ=0, 2 clusters, PU: Elman_BNwP T-DTS 98.4921


CN2 standard Rule instruction 98.33

Elman_BNwP (70% of database used) MLP FF based 98.0521

T-DTS (70% of db), Fisher ration CE, θ=0, 2 clusters, PU: Elman_BNwP T-DTS 98.0050

IB3-CI Instance learning 97.8


Decision tree learner +FICUS Feature constructing 96.45

kNN +FICUS (k=3) Feature constructing 96.14

T-DTS (70% of db), kNN Matlab CE, θ=0, 2 clusters, PU: PNN T-DTS 95.7334



kNN +FICUS (kNN – basic) Feature constructing 94.73

CN2-SD (γ=0.9) Rule instruction 88.41

CN2-SD (γ=0.7) Rule instruction 85.07

MLP_FF_BR (60% of database used) MLP FF based 85.0313CN2-SD (γ=0.5) Rule instruction 84.45

MBRTalk Instance learning 84.1

CN2-SD (add. weight.) Rule instruction 83.92

Backpropagation +FICUS /note: high standard deviation of the results/ Feature constructing 81.66

NewID Decision tree based 79.8

CN2 WRAcc Rule instruction 70.56

Table IV. 6 presents the competitive study between T-DTS and other classification

approaches’ performance (Aha 1991), (Lavrac, Flach and Todorovski 2002), (Markovitch

and Rosenstein 2002). The classification result are given in a term of accuracy (sum of

learning and generalization rate) in order to be comparative to the other methods.

168

This table clearly defines T-DTS based results and the influence of Tic-tac-toe

database decomposition on the performance results. Let me note that any complexity

estimator that produces similar to ANN based estimator’s histogram will be good

candidate for processing this problem in T-DTS framework. Such candidate is not only

Fisher base ratio, but also 4 Fukunaga’s interclass matrix distance criteria. However, one

may select a solution with two and more clusters as the solution if the given result

satisfies initial condition.

For the Splice junction DNA sequence classification problem, the results are described

in Fig. V.34. Fisher_Disriminant_Ratio complexity estimator is the leader with its

generalization rate 74.6218%, where the ANN based is the second one.

Maximum_Standard_Deviation is inapplicable, because of the same weaknesses

mentioned above for Fisher_Disriminant_Ratio applied for Tic-tac-toe endgame problem.

Fig. IV.34 : Validation T-DTS self-tuning threshold procedure: Splice-junction DNA

sequences classification problem, 3 classes, generalization database size 1520 prototypes, learning database size 380 prototypes, DU – CNN, PU – MLP_FF_GDM,

3 complexity estimators

This confirms our expectation of that ANN-structure based complexity estimator

(marked on Fig. IV.34 as ZISC) is boardly applicabile regardless the specificity of the

problem.

Before analyzing the quality of the obtained results and focusing on the main goal of

any automatic classification method – maximization of the generalization rate, let me

169

mention that the weaknesses of the T-DTS application (not the concept). It supports that

the cortege, for example a pair <complexity threshold; complexity estimating methods>,

determines the optimum or quasi optimum. For different methods, quasi-optimal threshold

may be also different.

Firstly, it is incorrect to assume/simplify that the set of optimal thresholds for the

whole range of complexity estimator applied for a certain single classification task can be

allocated in some sub-interval. There are several aspects that have an influence on the

optimization function P(θ), including a relativity of the complexity rate except ANN

based complexity estimator, meaning that we cannot optimize finding of an appropriate

complexity estimator among the available ones.

Second, we expect that main controlling pair of <complexity threshold; complexity

estimating methods> using DU during decomposition simplifies the problem regardless of

PU. In fact, taking into account that the problem of finding optimal decomposition is NP-

hard, summarizing it is oversimplified expectation which assumes that given

decomposition is quasi-optimal, meaning that simplification is indeed done.

Finally, turning back to the tasks’ classification goal of the mentioned above

manipulation, let us highlight that, in the framework of T-DTS output, we consider the

principal direction of minimizing the generalization error.

Furthermore, we search for the answer on the question of how to predict and how to

predefine the way of decomposing or no decomposing, in which the maximal

generalization rate can be achieved. Once more, our idea is based on the macroscopic

features of self-organizing, highlighted in the work (Haken 2002). Thus, microscopic

characteristic that rules decomposition is a histogram of divisibility extracted from

maximal decomposition tree.

To illustrate this principle, we have used an encoded database (in order to enhance

testing rate – our principal aim) for Splice junction DNA sequences classification

problem. Fig V.35 showes this database divisibility and complexity reduction based on the

next complexity estimators.

Based on the given histogram, one may conclude that the problem is hard-

decomposable. Decomposing does not reduce the complexity. During decomposition, the

complexity of the majority of sub clusters remains very complex. That is why the

decomposition does not provide generalization of error minimization.

170


distribution: Splice-junction DNA sequences classification problem, 3 classes, learning database size 1595 prototypes, DU – CNN, Purity PRISM based complexity

estimator

Fig. IV.36 : Validation T-DTS self-tuning threshold procedure, Clusters’ number distribution: Splice-junction DNA sequences classification problem, 3 classes,

learning database size 1595 prototypes, DU – CNN, Fukunaga’s interclass distance measure J1 based complexity estimator

Therefore, we have consolidated the maximum results reached by T-DTS for Splice-

junction DNA classification problem in Table V.7; as it was expected, the maximum can

be reached when the database is not decomposed.

Table IV.7 : Classification results: Splice-junction DNA sequences classification

problem, three classes, generalization and learning database size 1595 prototypes

DU Complexity estimator PU Tr±Std/2 (%) Lr±Std/2 (%) Avr. leaf No. ±Std/2 Θ

None None Elman_BN 94.6675±0.0421 99.9373±0.0181 None None

CNN Purity PRISM based Elman_BN 93.5966±0.4174 99.8800±0.0224 2±0 0.0033

CNN Purity PRISM based Elman_BN 93.3950±0.4302 99.8500±0.0533 4±0 0.0340

The results exhibit that decomposition process reduces generalization ability.

171

However, when one takes into account the processing time, it is quite probable that a user

who wishes to sacrifice one percent of the generalization ability may select 4 clusters’ T-

DTS solution.

To conclude, I provide short (only the maximal characteristics) consolidated results

obtained by different authors including specific ROC Analysis employing methods

(Makal, Ozyilmaz and Palavaroglu 2008) for this particular classification problem in

Table IV.8.

Table IV.8 : Consolidation of the classification results: Splice-junction DNA

sequences classification problem.

Method description Specificity of the method

Generalization rate (%)

Maximum obtained in the work (Lumini and Nanni 2006) Hierarchical SVM based 99

Maximum obtained in the work (Dutch 2002), (learning db is modified) Specific NN based 95

Elman_BN (50% of database used) MLP FF based 94.6675

The average result for this type of problems (Malousi and al. 2008) SVM based 94

T-DTS (50% of db), 2 clusters, PU: Elman_BN T-DTS 93.5966

T-DTS (50% of db), 4 clusters, PU: Elman_BN T-DTS 93.3950

Maximum obtained in the work (Malaousi and al. 2008) for ANN-based solid ANN-based 93.3890

MLP (by ROC analysis) solid ANN-based 91.23

GRNN (by ROC analysis) solid ANN-based 91.14

RBF (by ROC analysis) solid ANN-based 89.35

Let us compare our output to the results obtained in the work (Malousi and al. 2008).

We have obtained the higher generalization rate (94,66% against 91,23%) because of the

embedded recursivity of Elman’s Backpropagation. The work of Makal, Ozyilmas and

Palavoroglu provides a good summary of the different solid-ANN methods applied for

this particular problem. The used there approaches (Malousi and al. 2008) do not use

decomposition approaches. Therefore, it is important to note that our T-DTS result within

the solution of four clusters is better 93.3% than the results (Malousi and al. 2008).

Although, in the work (Lumini and Nanni 2006) it was shown that SVM based

methods, especially hierarchical SVM based methods likr HM, Subspace, RankSVM, have

reached better results (97-99%). However, let me remind that the question of SVM option

parameterization is complex and requires additional applied techniques. The interesting

fact is that the very specific methods proposed in the work (Duch 2002): RBF, 720 nodes,

GhostMiner version of kNN, and Dipol92 surpass our result with their 95% of

generalization; nonetheless, their learning databases have been specially adjusted for these

172

three methods. We cannot consider that these methods are general. If one takes a look on

this general problem of DNA splice-junction classification as a general medical problem,

the work (Malousi and al. 2008) for the similar databases provides an average

generalization rate of 94%, even when one may use SVM based methods.

Finally, we can state that using T-DTS approach within enhanced self-tuning

procedure applied for the Splice junction DNA sequences classification problem, the

average for this type of problem was computed as 93.4% – 94.6% of generalization.

However, various SVM based (not NN-based processing methods that have been used in

T-DTS) are the leaders. This result analysis stimulates further update of T-DTS PU

database with SVM methods. The following section provides overall T-DTS validation

summary.

IV.2.3 Summary

In this section, we performed the range of experiments dedicated to T-DTS approach

validation, including its recent enhancement implemented as T-DTS v. 2.50. The first part

of the validation confirmed the superior performance characteristics of ANN based

complexity estimator: using the proposed novel method of decomposition T-DTS

controller, one can reach the maximal ratios of generalization for academic benchmark.

Since these maximal ratios (in absolute values) can be typically achieved with only the

best current complexity estimators, e.g., Mahalanobis distance based, Normalized distance

based, Maximal standard deviation based measure, etc., the proposed ANN based

estimator proved itself to be a very practical approach. Moreover, it should be noted that

our estimator performed well even in those real-world problems, where the range of other

popular complexity estimators, including Kullback-Leibler divergence and Hellinger

distance based estimator, were not applicable because of their Information theory origins.

The second part of the T-DTS validation tested the proposed self-tuning procedure;

recall that this procedure is able to answer the following questions:

• why using a leading (proven to be a leading) complexity estimation technique

might not be able to maximize T-DTS output.

• why applying the T-DTS tree-like decomposition technique to some classification

problems could not enhance the performance of the technique beyond the results of

alternative non-decomposing task processing technique.

173

During the experiment, the obtained histogram of divisibility, i.e., the result of T-DTS

maximal decomposition tree, gives an answer to the second question. The consequence of

this analysis might stimulate user has to choose another decomposition unit or another

complexity estimator that controls the process of decomposition. The self-tuning

procedure validation confirmed our expectations: employing this semi-automated

procedure does allow the user to find the range of quasi optimal solution. The next section

presents consolidated conclusion on the validation of the T-DTS enhancements and

complexity estimators, including the proposed ANN based complexity estimator.

IV.3 Conclusion

The first part of this chapter was dedicated to the experimental validation of the

proposed ANN based complexity estimation technique that we implemented using the

IBM© ZISC®-036 Neurocomputer and Matlab environment. The latter implementation

also allowed performing the comparative analysis of 17 complexity estimators. During the

verification, we observed that complexity estimators range into three groups based on

their relative performance. The third, most effective, group contains the leaders of

classification tasks complexity estimation; there include PRISM (Singh 2003) based

methods and the novel ANN based complexity estimator.

The second part of the chapter provides the results of the T-DTS concept validation,

where we showed that ANN structure based complexity estimator belongs to the class of

the leading complexity estimators. Moreover, even though the classification complexity

estimation technique is at the kernel of T-DTS as it controls the decomposition, the results

of the evaluation showed that this control might be successfully (in term of T-DTS

performance) done by the range of other techniques that, if taken separately, cannot

appropriately measure true classification complexity.

Last section of the second part provides the results obtained for validation of the self-

tuning complexity (i.e. θ-threshold) procedure that allows user to find a quasi-optimal θ-

threshold. Also, while searching for a quasi-optimal θ-threshold, T-DTS might produce a

whole range of satisfactory solutions; this allows the user to select the most preferable

combination of the output characteristics, such as: generalization and learning rate, total

number of clusters, and the overall T-DTS processing. It should be also emphasized that

174

the most important part of self-tuning procedure is the divisibility histogram. As it was

shown in the validation, it is not only the source of information for the θ-threshold

adjustment, but, according to the results, this histogram also provides the explanation of

why the T-DTS approach could not be applied to some classification problem. Such

unsatisfactory histogram output might be regarded as a stimulus for the further

development of decomposition and complexity estimation methods.

The general conclusion and perspectives of this work is separately consolidated in the

following section.

175

General conclusion and perspectives

Conclusion

T-DTS, a multi-model “Divide-To-Conquer” based classification technique was

initially introduced by Madani & Chebira (Madani 2000). Its key-components, “self-

organizing ability” and “complexity estimation loop”, have been subject of two prior

doctoral works performed by Rybnik (Rybnik 2004) and Bouyoucef (Bouyoucef 2007),

respectively. In this thesis, we proposed and verified several important extensions of this

approach. The main focus of this work was on the one hand, the complexity estimation

issue of T-DTS and on the other hand, concerned the overall enhancement of the T-DTS.

We developed a novel Artificial Neural Network (ANN) structure-based

“classification task’s complexity estimation” technique. We implemented and validate the

proposed technique within MatLab environment performing a comparative analysis of 17

complexity estimators that have already been used as “complexity estimators” in the

above-mentioned preceding doctoral works. Moreover, a fully parallel implementation of

the proposed ANN based “classification task’s complexity estimator” has been proposed,

implemented and validated, using IBM© ZISC®-036 Neuro-processor.

The experiments confirmed the effectiveness of the proposed ANN based complexity

estimation approach, showing performances comparable to the current leading techniques,

such as PRISM based methods and Information theory based complexity estimators

(Kullback–Leibler divergence, Jeffreys-Matusita distance, etc.). The important note is that

the group of “Information Theory” based estimators show its partial sensitiveness to a

number of features overruling the “classification complexity’s estimation”. In other

words, the performed analysis pointed out that the aforementioned group of complexity

estimators is less sensitive to concerned features, and so remains ineffective comparing to

PRISM-based and ANN based complexity estimators. The origin of the above-mentioned

insensitiveness could be highlighted by the fact that the aforementioned group of

techniques works on statistical approximation the Bayes error. However, it should be also

mentioned here that one of the current disadvantages of the ANN based estimator is its

176

high computational cost and the required parameterization (of the influence field value: a

specificity of the used ANN model). One should also take into consideration the fact that

there is no ideal way of constructing ANN structures, and, as the impact of this fact on T-

DTS, the proposed complexity estimation approach might, in some cases, produce an

incorrect estimation for a number of classification problems. The above-mentioned

shortage may occur either in case of the learning data scarcity or when the ANN-structure

remains inappropriate. However, the last issue is a general feature of any machine

learning based structure and the first one is easy to overcome using database size control

that has been already realized.

Some of complexity estimators such as Maximum standard deviation based criterion,

showed their sever limitations. We encountered several issues limiting the applicability of

such approaches rejecting them definitely. However, let us highlight that the mentioned

inapplicability on the practical level is neither related to the T-DTS decomposition

control’s ability nor to its ability to search for a quasi optimal tree structure of NNs

ensemble.

Even though the validation of the proposed complexity estimators (performed on the

benchmarks and two real-world problems) did not clearly show the single best T-DTS

decomposition control agent usage, it demonstrated that the ANN based complexity

estimator belongs to the group of the leading technique. This means that slotting the ANN

based complexity estimator in the T-DTS framework guarantees the possibility of

obtaining an outperforming NNs ensemble tree-structure.

In the second part of this work we studied the matter of the overall enhancement of T-

DTS. We designed and implemented the T-DTS classification system as a framework

consisting of a set of independent components, e.g., decomposition method, processing

method, and complexity estimation method. Owing to the proposed architecture, the

implemented system can be used as a versatile platform for further investigations of T-

DTS, where each component can be investigated, modified independently of other

components and easily replaced as soon as an improved version becomes available.

Moreover, such clear separation of the basic components of the system makes the use of

T-DTS platform accessible to those potential users who would like to experiment the

platform but do not have an expertise in the architecture of the system. The

aforementioned versatility has been implemented as three levels of competence (e.g. three

levels of usage): “basic-user” level, “advanced-user” level and “programmer” level.

The next contribution of this work is the design of a procedure for an automated quasi-

177

optimal adjustment of the θ – threshold parameter of T-DTS. The proposed procedure is

based on the histogram analysis of separability and complexity reduction characteristics

of analyzed data. Since in the previous iterations of T-DTS the threshold had to be

adjusted manually (using a non-trivial procedure), the proposed automation significantly

improves the usability of the approach and paves the way for industrial adoption of T-

DTS. As an added benefit, the histograms composed in the proposed procedure have

proven to be also useful for investigating macroscopic features of self-organizing abilities

exposed by T-DTS. The validation showed that use of the θ – threshold adjustment

procedure allows producing not only a single satisfactory result, but rather a set of

possible satisfactory results, where the user can select a suitable result based on his (or

her) particular preferences (needs): low accuracy error rate, low learning error rate, etc. It

should be mentioned that applying the automated procedure requires an additional

computational effort. However, this effort is well-justified considering that the alternative

manual heuristic of θ – threshold adjustment might also lead to a heavy and time

consuming search for a quasi optimal NNs ensemble structure.

Perspectives

The further development of ANN-structure based complexity estimator requires

enhancing the (.)g function. For example, the extraction of additional classification

complexity’s related information might aggregate other parameters, such as the length of

bounds, etc… Employing different types of NNs is also one of the possible directions of

developing the technique.

To improve the current implementation of T-DTS, we plan to create and incorporate

(into the framework of T-DTS) a database of algorithms for searching the quasi optimal θ

– threshold. Also, we plan to embed the histogram analysis algorithms. These

improvements are in the nearest T-DTS development plan.

Next, we would like to experiment with SVM-T-DTS implementation (combining T-

DTS and SVM) in order to investigate the superior classification properties of SVM. As it

was shown by the comparative DNA-problem results’ analysis, the SVM-based methods

are well suited for the classification problems in the medical area. Precisely, it was shown

that the hierarchical SVM methods are the leaders among the available methods for DNA

sequences’ classification. Another possible area of T-DTS’s extension is a combination of

178

self-organizing maps and LSVMDT (that is currently used for the processing of satellite

images).

The usage of histogram analysis in the T-DTS framework allows user to analyze the

complexity estimator when a known outperforming decomposition unit and a special

benchmark are selected. We would also like to investigate an opposite case of analyzing

decomposition unit’s “simplification” ability for a selected leading complexity estimator.

The far perspective of this work includes applying ANN based estimator not only to

the T-DTS-like NNs ensemble construction and adjustment, but also to any classifier’s

optimization. We are also plan increasing the computational performance by using a more

suitable alternative to the current PC-based hardware platform. Since the approach makes

the heavy use of Neural Networks, T-DTS naturally lands on the emerging hardware

platform of neurocomputing. A key role in this perspective may be undertaken by IBM©

ZISC®-036 neuro-processor which providing a fully parallel hardware implementation

for ANN-structures. The aforementioned option could play not only the role of

decomposing controller, but could also act as an optimizer of any classifiers (of

processing unit).

In conclusion, we believe that the close and the far perspectives in design and

enhancement of T-DTS classification architecture would make T-DTS a promising

industrial platform for resolving complex classification problems.

179

Appendixes

A. The list of publication Articles published in international journals

Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, “Estimating Complexity of Classification Tasks Using Neurocomputers Technology”, International Journal of Computing (ISJC’2009) ISSN 1727-6209, Vol. 8, Issue 1, pp. 43-52, 2009. Ivan Budnyk, El-Khier Bouyoucef, Abdennasser Chebira, Kurosh Madani, “Neurocomputer Based Complexity Estimator Optimizing a Hybrid Multi Neural Network Structure”, International Journal of Computing (ISJC’2008), ISSN 1727-6209, Vol. 7, Issue 3, pp. 122-129, 2008. Articles published as chapters in collective books

Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, “ZISC Neuro-computer for Task Complexity Estimation in T-DTS framework”, Artificial Neural Networks and Intelligent Information Processing, INSTICC PRESS, ISBN: 978-989-8111-35-7, Vol. 4, pp. 18-27, 2008.

Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, “ZISC Neural Network Base Indicator for Classification Complexity Estimation”, Artificial Neural Networks and Intelligent Information Processing, INSTICC PRESS, ISBN: 978-972-8865-86-3, Vol. 3, pp. 38-47, 2007. Articles published in international symposiums and conferences

El-Khier Bouyoucef, Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, "A Modular Neural Classifier with Self-Organizing Learning: Performance Analysis", Proceedings of the Sixth International Conference on Neural Networks and Artificial Intelligence (ICNNAI 2008), ISBN : 978-985-6329-79-4, Minsk, Byelorussia, 27 – 30 May, pp. 65-69, 2008.

Ivan Budnyk, El-Khier Bouyoucef, Abdennasser Chebira , Kurosh Madani, "A Hybrid Multi-Neural Network Structure Optimization Handled by a Neurocomputer Complexity Estimator", Proceedings of International Conference on Neural Networks and Artificial Intelligence (ICNNAI 2008), ISBN : 978-985-6329-79-4, Minsk, Byelorussia, 27 – 30 May, pp. 310-314, 2008.

Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, "Estimating Complexity of the Classification Tasks that Using Neurocomputers", Proceedings of International Conference on Intelligent Data Acquisition and Advanced Computing Systems (IEEE - IDAACS 2007), IEEE Cat. N°: 07EX1838C, ISBN 1-4244-1348-6, Dortmund, Germany, 6 – 8 September, pp. 207-212, 2007.

Abdennasser Chebira, Kurosh Madani, Ivan Budnyk, "Task Complexity Estimation Using Neural Networks Hardware", Proceedings of the 9th International Conference on Pattern Recognition and Information Processing (PRIP’2007), ISBN 978-985-6744-29-0, Minsk, Byelorussia, 22 – 24 May, Vol. 1, pp. 59-63, 2007. Articles published in national symposiums and conferences

Abdennasser Chebira, Ivan Budnyk, Kurosh Madani, "Auto-organisation d’une Structure Neuronale Arborescente", Proceedings of the XVIth Joint Meetings of the French Society of Classification (SFC 2009), Grenoble, France, 2 – 4 September, pp. 19-22, 2009.

180

B. Complexity

General speaking, the phenomenon of complexity has always been an inherited part of

environmental systems, for example ecology, world economics, etc. Altogether, these

systems consist of interdependent and variable parts. In other words, unlike a

conventional system, the parts shouldn’t have fixed relationships, fixed behaviours or

fixed quantities, thus their individual functions may also be undefined in traditional terms.

Despite the apparent tenuousness of this concept, these systems accordingly to (Lucas

2000), form the majority of our world, including living organisms, social and natural

systems. A sub-conclusion might be done from this is that complex systems cannot be

studied independently of their surroundings (Lucas 2000). It means that the behaviour of a

complex systems “necessitate a simultaneous understanding of the environment of these

systems” (Moffat 2003).

Thus, approaches used to study Complexity theory are based on a number of new

mathematical techniques, originating from fields as diverse as physics, biology, artificial

intelligence, politics and telecommunications, and this interdisciplinary viewpoint is the

crucial aspect, reflecting the general applicability of the theory to these systems in all

areas (Lucas 2000). Describing these main features of the term complexity doesn’t clarify

the defining of the term, because of high variations in terminology and concepts which

have proliferated in the field–deterministic chaos, fractals, self-organizing systems far

from thermodynamic equilibrium, complex adaptive systems, self-organizing criticality,

cellular automata, solutions, and so on because they all globally share same property

(Czerwinski 1998).

In looking at where these key staples of the term of complexity come from, let us

make a start by considering the natural world systems (examples of these systems are

mentioned above). According to work (Moffat 2003), in the classical view the physical or

biological processes are reducible to a few fundamental interactions. This leads to the

idea that under well-defined conditions, a system governed by a given set of laws will

follow a unique course: “like the planets of the solar system” (Moffat 2003). In this

essence, we are interested in essential components/staple point(s) of that determines

complexity in system-independent way.

181

B.1 Defining complexity. Genesis of the concept complexity

There is a list of the various approaches of defining complexity, but for all of them

might be characterized by one common feature. This common point is an answer on the

question “How complexity is born”.

The senses (observing the world) + mental (human) activity (making sense out of that

sensory information) encode Natural System (NS) into Formal System (FS); manipulate

FS to mimic the causal change in the NS. From the NS derives an implication that

corresponds to the causal event in the FS; decodes the FS and check its success in

representing the causal event in the NS (Ferreira 2001): complexity as phenomenon

appears at this moment when we find “unexpected behaviour” behaviour between natural

system (NS) and expected behaviour of the formal system (FS) that has been created in

order to describe the first one. Complexity appears as the discrepancies’ result between

our observation NS and human-cantered pro-active FS. This is the gab of

encoding/decoding FS to NS and back Fig. B.1.

Summarizing, “Complexity is the property of a real world system that is manifest in

the inability of any one formalism that existed before being an adequate to capture all its

properties” (Mikulecky 2007).

Fig. B.1 : Genesis of the complexity

The complex system, from which we single out some smaller part, our NS that is

converted into FS, allows us to manipulate and to have a model. The observed process is

stated to be complex when the chosen FS tries to capture its behaviour, but can only be

partially successful (Mikulecky 2007). Good illustration of process described on Fig. B.1

is a Newtonian paradigm. As a FS we have been satisfied with it. Then using FS encoding

182

and decoding for some special case, we have figured discrepancies out. We have begun to

change it so the post-Newtonian paradigm actually replaced or became the real world. As

we began to look more deeply into the world we came up with aspects that the Newtonian

Paradigm failed to capture. Requirements of explanation give a birth to complexity.

Shortly debriefing the big variety of approaches we find the common of defining

complexity. The principal understanding of this phenomenon named complexity comes

from the observing of interaction “between human mind and nature” Fig. B.1. Scientists

whose operate their own definitions try to figure nature out in a process of interaction

using wording such as: phase states, bifurcations, strange attractors, emergence in order

to describe and predict the behaviour of a few natural phenomena signing a “model” to

them (Czerwinski 1998).

During the centuries we have a model of nature based on the Newtonian paradigm that

was built on the Cartesian Reductionism: Machine Metaphor and Cartesian Dualism

(Bennett 2003) means Body is a biological machine; Mind as something apart from the

body. Intuitive concept of machine was built up from distinct parts and can be reduced to

those parts without losing its machine-like character - Cartesian Reductionism.

The Newtonian Paradigm and the three laws of motion: General Laws of motion, used

as the foundation of the modern scientific method where dynamics is the centre of the

framework, which leads to trajectory (Ferreira 2001). Although, the real world came to be

different that any model (Badiou 2007) including Newtonian one.

The complexity as world phenomena has shows itself by observation or experience as

an ability to falsify a previous created model. Hence, we deeply link complexity with the

Popper’s falsifiability and unfalsifiability (Popper 2002) that is important concept for

advance of science and complexity term as a result or a failure (Mikulecky 2007) tightly

linked with concept of Popper’s falsifiability. Last one is a source discrepancy between FS

and NS described on Fig. A.1. There is no surprise that any even well developed and

complicated formal system that describes a system sometimes cannot describe or explain

in the terms of this formal system some natural phenomenon which claims to be complex.

According to Gödel's first incompleteness theorem it’s expected to face

incompleteness of the formal system (Gödel and Feferman 2001). Moreover, first theorem

struck a fatal blow to Hilbert's (Ewald 2004 ) program, towards a universal formalizing

which is based on the complete mathematical formal system, gives an existent right on the

existence of the complexity as form of incompleteness natural system. Thus, the second

theorem of Gödel (Gödel and Feferman 2001) supports the idea of cycling pre-defining

183

complexity described on Fig. B.1. Formalizing a system, finding discrepancies, then again

encoding, formalizing and decoding we can find that NS again is more complex than FS

and etc.

Complexity in computer world is existence move for its determining and

calculating/modelling. It is a non-stop movement where computer scientists are "pushed"

into the world of running Tortoise and Achilles (Hofstadter 1999). However, Chaitin

(Chaitin 2005) proposes for scientists-mathematicians to abandon any hope of finding

“ideal” FS and suggests apply each time a quasi-empirical methodology.

Summarizing the defining complexity (complexity of the systems) topic, we want to

mention that the various nowadays definitions are related to one might say post-

Newtonian paradigm (Czerwinski 1998), where under post-Newtonian paradigm, it is

meant the arrangement of nature life and its complications, to be nonlinear: where inputs

and outputs are not proportional, where phenomena are unpredictable, but within bounds

are self-organizing, where unpredictability frustrates conventional structuring and where

solution as self-organization control as we think of it and etc. Complexity in the broad

meaning shows itself as an inability of some/any formalism to fully describe observed

phenomenon.

Next section dedicated to more detailed overview of complexity as the system’s

attribute.

B.2 System’s attribute and complexity

The majority of definitions are ground on the principal idea that complexity is an

inherent part of the complex systems such as: economies, social structures, climate,

nervous systems and etc. Complexity theory and chaos theory both attempt to reconcile

the unpredictability of non-linear dynamic of these systems with a sense of underlying

order and structure (David 2000).

First of all we have to discuss what we understand by complex systems. In a naïve

way, we may describe them as systems which are composed of many parts, or elements,

or components which may be of the same or different kinds. The components or parts may

be connected in a more or less complicated fashion. The various branches of science offer

184

us numerous examples, some of which turn out to be rather simple whereas others may be

called truly complex (Haken 2002).

A modern definition of complex systems is based on the concept of algebraic

complexity. It means, that at least to some extent, systems can be described by a sequence

of data, the fluctuating intensity of the light or a curve that represents data by the

numbers. There, one might attempt to follow up the paths of the individual parties and

their collisions and then derive the distribution function, known as the Boltzmann

distribution, of the velocity of the individual parts.

In all cases, a macroscopic description allows an enormous compression of

information so that we are no more concerned with the individual microscopic data, but

rather with global properties. An important step in treating complex systems consists in

establishing relations between various macroscopic quantities (Haken 2002).

The more science becomes divided into specialized disciplines, the more important it

becomes to find unifying principles (Haken 2002). We may recount the give in literature

cross-discipline ideas of view on the complexity in the complex systems. As matter of

fact, these definitions cannot be limited in the way of top of twenty (Sussman 2002),

because common point is that all of them use word-combinations such as: intricate ways,

subtle, degree and nature of the relationships, behaviour of macroscopic collections. This

type wording is not acceptable for strong definition, because all of them should be also

pre-defined.

First of all let us mention that we share a critical view of some authors on complexity

whose treats this paradigm as some kind of holism. They suggest that any attempt to cope

with complexity using such traditional tools is doomed to failure. Their remedies vary

from a complete abandonment to introduce new techniques and approaches. Of course,

any constructive suggestions for dealing with complexity are welcome from whatever

source. Thus, all given techniques of “the new sciences of complexity” are welcome for

studying what have been considered complexity of the complex systems. As it is evenly

noted by Sussman (Sussman 2000) many of these techniques, however have nothing to do

with complexity per se. It is stated (Sussman 2000) that many papers with the word

"complexity" in the title refer merely to some techniques for dealing with rather difficult

(complex) systems. Considering this point of view, complexity as a solid attribute of the

systems might contain the following features:

A system is complex when it is composed of many parts that interconnect in intricate

ways (Moses 2002) - this definition has to do with the number and nature of the

185

interconnections. Metric for intricateness is amount of information contained in the

system.

A system presents dynamic complexity when cause and effect are subtle, over time

(Senge 2006) - different effects in, the short-run and the long-run; dramatically different

effects can be observed locally and in other parts of the system. Obvious interventions

produce non-obvious consequences.

A system is complex when it is composed of a group of related units (subsystems), for

which the degree and nature of the relationships is imperfectly known (Sussman 2000) –

the overall emergent behaviour is difficult to predict, even when subsystem behaviour is

readily predictable, small changes in inputs or parameters may produce large changes in

behaviour.

A complex system has a set of different elements so connected or related as to perform

a unique function not performable by the elements alone (Maier and Rechtin 2000) -

requires different problem-solving techniques at different levels of abstraction

A complexity relates to the behaviour of macroscopic collections of units endowed

with the potential to evolve in time (Highfield 1996) - this definition differs from

computational complexity which estimates a number of mathematical operations needed

to solve a problem using Turing machine concept.

The features of the complexity of the complex systems are the following:

• Complex systems are non-fragmental: if it were, it would be a machine. Their

reduction to parts destroys important system characteristics irreversibly.

• Complex systems comprise real components that are distinct from its parts: there

are functional components defined by the system which definitional dependable on the

context of the systems. Outside the system they have no meaning. If removed from the

system it looses its original identity.

• Complex systems don’t have analytic or synthetic largest FS mode”: if there were

a largest model, all other models could be derived from it.

• Causalities in the system are mixed when distributed over the parts.

• Attributes of the systems are beyond algorithmic definition or realization: here, we

deal with posing a challenge to falsify. One of example is the famous Church's thesis

(Church 1936) (“…All the models of computation yet developed, and all those that may

be developed in the future, are equivalent in power… We will not ever find a more

powerful model...”).

186

These are not definitive indicators but a system that has many of these attributes hard

to be analyzable using linear determinism or statistical methods (Lucas 2000).

Another important feature of complexity as an attribute of the complex systems that

have to be mentioned here, especially, because of T-DTS concept of out thesis is a self-

organizing. In the literature we find complexity as an imprescriptible attribute of the

systems that allow them to self-organize. Naturally, we have to have before an answer on

the question “What is the complexity of self-organizing systems?”

The one strong definition that is relevant to the complexity couldn’t be found. It could

be arisen in bio-organisms (Hinegardner and Engelberg-Kulka 1983), where it has a direct

relationship with the evolutionary selection process. A very weak definition of such as

size of genome could be sufficient to explain an increase in the maximum complexity of

all species under evolution process. Henceforth, related to mutation of the genome, is self-

organizing complexity suggested by Wimsatt (Wimsatt 1974). Under complexity it is

meant co-adaptation of an organism's mechanisms (or as sub-mechanisms of other

mechanisms) as a source of the evolution, it is called in another words descriptive

complexity. Kauffman (Kauffman 1993) suggests that the order manifest in organisms is a

result of selection acting upon a system that is basically self-organizing and that this self-

organizational ability depends critically on the complexity of conflicting constraints. Here,

a complexity is linked with biology. These some sorts of criteria that may allow or not the

system to achieve the benefits in innovation based on survival (fitting real-word

constraints) and adaptability that we see for natural complex systems (Lucas 2000).

Therefore, one may see arising difficulty of the strong common for these systems

definition of complexity as inherited part of the complex systems, because not the

different origin, but mostly because of different phenomena and processes that took a

place and which are twisted around word (not a term) complexity.

187

C. Neural Networks in hardware

Neural network hardware has undergone rapid development during the last decade.

Unlike the conventional von-Neumann architecture that is situational in its origin, ANN

profits from massively parallel processing. A large variety of hardware has been designed

to exploit the inherited parallelism of the NN models. Despite the tremendous growth in

the computer power of general-purpose processors, ANN based hardware has been

designed for some specialized applications, such as image processing, speech synthesis

and analysis, pattern recognition, high energy physics and so on.

Neural network hardware is usually defined as those devices designed to implement

particular NN architecture and learning algorithm. These devices take an advantage of the

parallel nature of ANN. Due to the big diversity of neurohardware this overview is limited

with some certain aspects of implementation (Liao 2001).

The most important aspect of Neurocomputing hardware is its productivity in

comparison to soft-ANN realizations. It’s true that in general, a particular classification or

another task does not require super fast speed, and because of it, a software based

realization is on demand. Software based realization is architecturally independent and it

is easy to handle and to adjust. However, theses criteria become unimportant for real-time

response applications, systems and consumer products such as cheap verification devices.

According to the work (Aybay, Cetinkaya and Halici 1996) the hardware design of

each NN-chip as a principal component of NN-hardware is built of four key elements:

Weights block, Activation block, Transfer function block, Neuron state block under

handling of Control unit that is present on each chip and is responsible for passing control

parameters, Fig. C.1.

Important issue concerning the given scheme is that Neuron state block, Weights block

and Transfer function block may be off the chip, and part of their function can be

performed by a host computer (Aybay, Cetinkaya and Halici 1996).

Neural network hardware is usually specified by the number of artificial neurons on

each neuro-chip and the quantity of the connections between them. The number of

neurons could vary much from 10 to 106. Another important characteristic of

Neurocomputers is the precision by which the arithmetical units perform the basic

operations (Liao 2001).

188

Fig. C.1 : General block level architecture representation of a neurochip or a

neurocomputer processing elements

NN hardware could be categorized on classes using the following criteria (Aybay,

Cetinkaya and Halici 1996): Type of device (neuro-chip or neurocomputer, general

purpose device or specific purpose device), Neuron properties (the number of neurons,

precision, storage of neuron state: on-chip/off-chip, digital/analogue), Weights (storage of

weights, number of synapses and etc.), Activation characteristics (computation, activation

block output), Transfer function characteristics (on/off-chip, analogue/digital, threshold

look-up table/computational), Information flow, Learning (on/off-chip, standalone/via a

host), Speed (learning speed/processing speed), Cascadability, Type of technology

fabrication (for example VLSI), Clock rate, Data transfer rate, Number of inputs,

Number of outputs, Type of input (analogue/digital), Type of output (analogue/digital)

(Aybay, Cetinkaya and Halici 1996).

One may find in the works (Aybay, Cetinkaya and Halici 1996), (Liao 2001)

common-used simple taxonomy of neurocomputing hardware, but let us mention that the

borders between the categories of this taxonomy are weak defined.

Debriefing our introduction to NN-hardware, we would like to highlight that because

of the popularity and especially because of first-move advanced commercial success of

NN-software based applications. They have become more popular than hardware based

solutions, because of the disadvantages of the last such as: algorithmic specificity, design

189

complexity and lack of user-friendliness. It has allowed NN-based software to lead a trend

of ANN applications’ development. Moreover, because of their high flexibility and

universality in comparison to neurocomputers, they continue to demonstrate a high

potential to hold its leading position. Although, whenever there appears a need to handle

computation for real-time application, or to employ complex ANN with the big number of

neurons inside, there is no concurrence to neurocomputers. For the specific

niches/problems, neurocomputer provide a much better cost-to-performance ratio, lower

power consumption and small size. We may regard a hardware based neurocomputing as

an efficient delicate toolbox that after “its golden rush” of late 80th and early 90th has been

put on the mercy of a leading ANN-software based tools and applications. However, an

algorithmic success of the last one in a long-term will only revive the area of

Neurohardware. In fact these two approaches are not rivals. As long as conventional

hardware could not provide sufficient performance, there is a need for neurocomputers.

(Liao 2001)

Since many neural network chips have already been described in the literature (Maren

1990), and new neural network chips are appearing frequently, we will not make any

effort to review them all. Following Section C.1 and Section C.2 give more clues about

the type of neurcomputers that employing RBF-like neural network models for patter

recognition. First one IBM© ZISC®-036 has been used as a tool for verification our

classification complexity ad hoc approach. According to given above approaches of NN-

hardware categorizing, these two neurocomputers belong to the class of General Purpose

Digital Neuro-Chips (NCgd) (Aybay, Cetinkaya and Halici 1996). The common-used

code that denotes their main characteristics is: Id/Ad/Wdo/Sdo/Tdo/Lo, where I stands for

input/output, A for activation block, W for weights block, S for neuron state block, T for

transfer function, and L for learning, Furthermore, d means digital, and o means on-chip.

C.1 IBM© ZISC® -036 Neurocomputer

IBM© ZISC(Zero Instruction Set Computer)®-036 Neurocomputer (IBM Corporation

1998), (De Tremiolles 1998) is a fully integrated circuit based on neural network designed

for recognition and classification application which generally required supercomputing.

The key component, each of 36 chips (Fig. B.2), has a RBF neural network topology

(Lindblad and al. 1996). IBM© ZISC®-036 is a parallel neuro-processor that uses RCE

190

(Reduced Coulomb Energy algorithm) which automatically adjusts the number of hidden

units and converges in only few epochs. This method that implemented on IBM© ZISC®-

036 has been proven to be effective in resolving pattern classes separated by nonlinear

boundaries (Madani, De Tremiolles and Tannhof 1998).

Fig. C.2 : IBM© ZISC®-036 PC-486 ISA bus based bloc diagram

However, the RCE network depends on the user-specified parameters which are

computationally expensive to optimize (Wang, Neskovic and Cooper 2006). IBM©

ZISC®-036 Implementation of the RBF-like (Radial Basic Function) model (Park and

Sandberg 1991) could be seen as mapping an N-dimensional space by prototypes. Each

prototype is associated with a category and an influence field. The intermediate neurons

are added only when it is necessary. The influence field is then adjusted to minimize

conflicting zones by a threshold). kNN algorithms is embedded in this neurocomputer.

The kNN (k-nearest neighbour algorithm) method for classifying objects based on closest

training examples in the feature space. kNN is a type of instance-based learning, or lazy-

learning where the function is only approximated locally and all computation is deferred

until classification.

IBM© ZISC®-036 system implements two kinds of distance metrics:

1. L1: polyhedral volume influence field.

2. LSUP: a hyper-cubical volume influence field.

The ZISC®-036 is conventionally regardered as a coprocessor device, Fig C.3. Such

device must be controlled by a micro-controller or state machine (accessing its register).

Naturally in many RBF applications a large number of neurons are required. The

cascalability of ZISC®-036 is supplied by 144 pins that can be directly interconnected.

Such simplification of the basic design supports ZISC®-036 tower construction. ZISC’s

chip supports asynchronous as well as synchronous protocols, the latter when a common

191

clock can be share with the controller. The calculation of distance between input vector

and prototype use 14 bits precision. The components of the vectors are fed in sequence

and processed in parallel by each neuron (Lindblad and al. 1996).

Fig. C.3 : Schematic drawing of a single IBM© ZISC®-036 processing element -

neuron

It means that each chip Fig. C.4 is able to perform up to 250000 recognitions per

second.

neuron 1

neuron 2

neuron 36

redrive

controlsaddressdata

[8][6][16]

I/O bus [30]

daisy chain in [1]

logic

[4]

[21]

[1] daisy chain out

inter Zisc communication bus

decisionoutput

Fig. C.4 : IBM© ZISC®-036 chip’s bloc diagram

192

This chip is fully cascadable which allows the use of as many neurons as the user

needs. The first implementation (Lindblad and al. 1996) has been done on PC-486 ISA-

bus card Fig. C.5 consists 576 neurons (16 chips, each chip contains 36 neurons Fig. C.4.

Fig. C.5 : Hardware realization: IBM© ZISC®-036 PCI board

Each neuron is an element, which is able to:

• memorize a prototype (64 components coded on 8 bits), the associated category (14

bits), an influence field (14 bits) and a context (7 bits),

• compute the distance, based on the selected norm (norm L1 or LSUP) between its

memorized prototype and the input vector (the distance is coded on fourteen bits),

• compare the computed distance with the influence fields,

• communicate with other neurons (in order to find the minimum distance, category,

etc.),

• adjust its influence field (during learning phase)

As it’s mentioned a controlling IBM© ZISC©036 is performed by accessing its

registers, and requires an address definition via the address bus, and data transfer via the

data bus. The inter-ZISC communication bus which is used to connect several devices

within the same network, and the decision bus which carries classification information

allow the use of the ZISC in a ‘stand alone’ mode.

The ZISC’s neurons organization allows the connection of several ZISC modules

without impact on performance. An efficient protocol allows a true parallel operation of

all neurons of the network even during the learning process.

The most popular ZISC’s control is performed by a master state machine. This is done

by a standard I/O bus. The I/O bus of IBM© ZISC®-036 has been designed to allow a

193

wide variety of attachments from simple state machine interface to standard micro-

controllers or buses.

This neurocomputer has been used for image enhancement such as noise reduction

and focus correction (Madani, De Tremiolles and Tannhof 2001). However, after it’s last

manufacturing by IBM in 2001, the nowadays applications’ requirements stimulate to re-

new production of the ZISC’s cousin – CogniMem (CM) chip. The first production batch

has been released in January 2008. Following section is dedicated to detailed description

of this CM-1K Neural Network chips.

C.2 CM-1K Neural Network chip

CM-1K Neural Network Chip, Fig C. 6 is a descent of IBM© ZISC®-036

neurocomputer (Mendez 05-2009).

Fig. C.6 : CM-1K Neural network chip

It’s neural network chip featuring 1024 neurons working in parallel and a parallel bus

which allows the user to increase the network size by cascading multiple chips. It’s a firs

product of CogniMem network line (Mendez 05-2009). It is an ideal companion chip for

smart sensors and cameras and can classify patterns at high speed while coping with ill-

defined data, the detection of unknown events, and adaptively to changes of contexts and

working conditions (Mendez 03-2009). Compared to nowadays 4GHz CPU with its

bottleneck (the memory access through a single bus), CM-1K chip has a a very simple and

self-contained and efficient architecture (Mendez 05-2009).

In addition to this parallel neural network’s architecture, Fig. C.7, CM-1K integrates a

built-in recognition engine which can receive vector data directly from a sensor and

broadcast it to the neurons in real-time. It is demonstrated (Mendez 03-2009) that CM-1K

chip recognition time depends on operating clock and not on the numbers of models

194

stored in the neurons. The CM-1K as ZISC®-036 chips can be interconnected to build NN

of any capacity per increment of 1024 neurons and less than one watt per chip.

Fig. C.7 : Network of 6 CM-1K chips, or 6144 neurons in parallel

The CM1K control and data bus is simple and composed only 28 lines. Adding more

chips to increase the network size is totally transparent to the controller since the neurons

include their own learning and recognition logic (Mendez 03-2009). CogniMem offers a

proprietary signature extraction from 2D video to 1D vector.

Fig. C.8 : CM-1K chip’s functional diagram. Inner architecture

The recognition engine can operate at sensor speed (up to 27 MHz). The usage of the

195

high-speed recognition engine requires knowledge be previously loaded into the neurons.

Concerning the chip’s inner-architecture Fig C.6, one must find that the principal

organization similarity with IBM© ZISC®-036, Fig. C.3: it’s the ZISC-like chain of

identical neurons operation in parallel. A neuron is an associative memory which can

provide pattern comprising autonomously.

During the recognition of an input vector, all the neurons communicated briefly with

one another (for 16 clock cycles) to find which one has the best match. In additional to its

register-level instructions, CM-1K integrates a build-in recognition engine which receive

vectors data directly though a digital input bus, broadcast it to the neurons and return the

best-fit category 3 millisecond later (Mendez 03-2009).

Therefore, because the analogy with IBM© ZISC®-036 Neurons, we provide the brief

list of CM-1Kchip’s features and their specifications:

• 1024 parallel neurons

• Vector data of up to 256 byte

• 10 ms learning time (maximum)

• 10 ms recognition time (maximum)

• No limit to neuron expansion

• Trained by example

• RCE (Restricted Coulomb energy)

• L1 and LSup distance norms

• Radial Basis Function (RBF) or K- Nearest Neighbour (KNN) classifier

• 0.13 µM technology – die size 8 x 8 mm

The range of fields where CM-1K is used are: Image recognition (part inspection,

object recognition, face recognition, target tracking and identification, video monitoring,

gaze tracking, medical imaging, satellite imaging, smart motion detection, kinematics),

Signal recognition (speech recognition, voice identification, radar identification, EKG,

EEG monitoring, sonar identification, spectrum recognition, flight analysis, vibration

monitoring), Data mining (cryptography, genomics, bio informatics, fingerprint

identification, unstructured data mining) and more.

According to the report (Mendez 05-2009), several companies have integrated the

CM-1K chip in their product designs, mostly to add recognition capabilities to embedded

sensor boards. However, in near future, the newly created European Laboratory for

Sensory Intelligence is presently attempting the design of a system featuring using 100

196

CM-1K chips. CogniMem figures out the possibility to create a USB-key portable data

mining solution based on CM-1K. Summarizing, our specific neurocomputers’ overview,

let us note that according to the deep DARPA’s research, NN-hardware will be on duty,

because “as the number of neurons and interconnects increases with regards to the size of

application, the amount of memory required to store the interconnect values increases. If

that memory cannot be stored locally with every processor, then the processor must access

memory external to itself and that slows the overall speed of the simulator” (Goblick

1988).

197

Bibliography

(Abdelwahab 2004) Manal M.Abdelwahab, Self Designing Pattern Recognition System Employing Multistage Classification, Thesis presented to obtain the degree of Doctor of Philosophy in the Department of Electrical and Computer Engineering in the College of Engineering and Computer Science at the University of Central Florida, Orlando, Florida 2004, Publication No.: 3162085 (Aggarwal and Yu 1999) Charu C.Aggarwal and Phillips S.Yu, Data Mining Techniques for Associations, Clustering and Classification, Lecture Notes in Artificial Intelligence, Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining, PAKDD'99, April 26-28, 1999, Beijing, China, vol. 1574, pp. 13-23, 1999, ISBN: 3540658661 (Aha 1991) David W.Aha, L.Birnbaum (ed.), and G.Collins (ed.), Incremental Constructive Induction: An Instance-Based Approach, Machine Learning: Proceedings of the Eighth International Workshop (ML 91), pp. 117-121, 1991, ISBN: 1558602003 (Anderberg 1973) Michael R.Anderberg, Cluster Analysis for Applications (Probability & Mathematical Statistics Monograph) Academic Press Inc., New York, US, 1973, ISBN: 0120576503 (Arabie 1994) Phipps Arabie, Clustering and Classificication World Scientific Publishing Co. Pte. Ltd., 1994, ISBN: 9810212879 (Arbib 2003) Michael A.Arbib, The Handbook of Brain Theory and Neural Networks, Second revised edition, Bradford Books, 2003, ISBN: 0262011972 (Atlas and al. 1989) Les Atlas, Jerome Connor, Dong Park, Mohamed El-Sharkawi, I. M. Robert J., Alan Lippman, Ronald Cole, and Yeshwant Muthusamy, A Performance Comparison of Trained Multi-Layer Perceptrons and Trained Classification Tree, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 3, pp. 915-920, 1989, ISSN: 00189219 (Badiou 2007) Alain Badiou, Le Concept de Modele : Introduction а une Epistemologie Materialiste des Mathematiques L'Harmattan, Paris, France, 2007, ISBN: 2213634815 (Bak 1996) Per Bak, How Nature Works: The Science of Self-Organized Criticality Springer-Verlag New York Inc., New York, US, 1996, ISBN: 0387947914 (Barbara and Kamath 2003) Daniel Barbara and Chandrika Kamath, Proceedings of the Third SIAM International Conference on Data Mining: Proceedings in Applied Mathematics 112 Society for Industrial and Applied Mathematics, US, 2003, ISBN: 0898715458 (Bauer and Kohavi 1999) Eric Bauer and Ron Kohavi, An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting and Variants, Machine Learning, vol. 36, no. 1-2, pp. 105-139, 1999, ISSN: 08856125

198

(Bellman 1961) Richard Ernest Bellman, Adaptive Control Processes: A Guided Tour, First edition, Princeton University Press, Princeton, New Jersey, US, 1961, ASIN: B0006AWE08 (Ben-Dor, Shamir and Yakhini 1999) Amir Ben-Dor, Ron Shamir, and Zohar Yakhini, Clustering Gene Expression Patterns., Journal of Computational Biology, vol. 6, no. 3-4, pp. 281-297, 1999, ISSN: 10665277 (Benediksson, Swain and Ersoy 1990) Jon A.Benediksson, Philip H.Swain, and Okan K.Ersoy, Neural Network Approaches Versus Statistical Methods in Classification of Multisource Remote Sensing Data, IEEE Transactions on Geoscience and Remote Sensing, vol. 28, no. 4, pp. 540-551, 1990, ISSN: 01962892 (Bennett 1988) Charles H.Bennett and R.Herken (ed.), Logical Depth and Physical Complexity, in The Universal Turing Machine: A Half-Century Survey Oxford University Press Inc., New York, US, 1998, pp. 227-257, ISBN: 3211826378 (Bennett 2003) Jonathan Bennett, Learning from Six Philosophers: Descartes, Spinoza, Leibniz, Locke, Berkeley, Hume Clarendon Press, US, 2003, ISBN: 0199266298 (Berry 2003) Michael W.Berry, Survey of Text Mining: Clustering, Classification, and Retrieval Springer-Verlag New York Inc., New York, US, 2003, ISBN: 0387955631 (Bertotti and Mayergoyz 2006) Giorgio Bertotti and Isaak D.Mayergoyz, The Science of Hysteresis : 3-Volume Set Academic Press Inc., 2006, ISBN: 0124808743

(Biryukov, Ryazanov and Shmakov 2007) Andrey S.Biryukov, Vladimir V.Ryazanov, and A.S.Shmakov, "Solving Clusterization Problems Using Groups of Algorithms," Computational Mathematics and Mathematical Physics, vol. 48, no. 1, pp. 168-183, 2007, ISSN: 09655425

(Blass and Gurevich 2003) Andreas Blass and Yuri Gurevich, Algorithms: A Quest for Absolute Definitions, Bulletin of European Association for Theoretical Computer Science, vol. 81, pp. 195-225, 2003, ISSN: 02529742 (Boolos, Burgess and Jeffrey 2002) George S.Boolos, John P.Burgess, and Richard C.Jeffrey, Computability and Logic, Fourth revised edition, Cambridge University Press, 2002, ISBN: 0521007585 (Bouyoucef 2006) El Khier Sofiane Bouyoucef, Comparaison des Performances de la T-DTS avec 34 Algorithmes de Classification en Exploitant 16 Bases de Donnees de l'UCI: (Machine Learning Repository), 2006, Ph.D.Report,University Paris XII

(Bouyoucef, Chebira, Rybnik and Madani 2005) El Khier Sofiane Bouyoucef, Abdennasser Chebira, Mariusz Rybnik, Kurosh Madani, A.Sachenko (ed.), and O.Berezsky (ed.), Multiple Neural Network Model Generator with Complexity Estimation and Self-Organization Abilities, International Journal of Computing, vol. 4, no. 3, pp. 20-29, 2005.

199

(Bouyoucef 2007) El Khier Sofiane Bouyoucef, Contribution а l'Etude et la Mise en Ouvre d'Indicateurs Quantitatifs et Qualitatifs d'Estimation de la Complexite pour la Regulation du processus d'Auto Organisation d'une Structure Neuronale Modulaire de Traitement d'Information, These presentee аu Laboratoire Images, Signaux et Systemes Intelligents - LISSI de l'Universite Paris XII pour obtenir le grade de docteur en sciences informatiques 2007. (Breiman 1996) Leo Breiman, Bagging predictors, Machine Learning, vol. 24, no. 2, pp. 123-140, 1996, ISSN: 17276209 (Briem, Benediktsson and Sveinsson 2000) Jakob Gunnar Briem, Jon Atli Benediktsson, and Johannes R.Sveinsson, Use of Multiple Classifiers in Classification of Data from Multiple Data Sources, International Geoscience and Remote Sensing Symposium Proceedings, vol. 2, pp. 882-884, 2000, ISSN: 08856125 (Brito, Bertrand, Cucumel and De Carvalho 2007) Paula Brito, Patrice Bertrand, Guy Cucumel, and Francisco de Carvalho, Selected Contributions in Data Analysis and Classification (Studies in Classification, Data Analysis, and Knowledge Organization) Springer-Verlag Berlin and Heidelberg GmbH. and Co. K., Germany, 2007, ISBN: 0780363604 (Bruzzone, Prieto, and Serpico 1999) Lorenzo Bruzzone, Diego Fernandez Prieto, and Sebastiano B.Serpico, A Neura-Statistical Approach to Multitemporal and Multisource Remote-Sensing Image Classification, IEEE Transactions on Geoscience and Remote Sensing, vol. 37, no. 3, pp. 1350-1359, 1999, ISBN: 3540735585 (Budnyk, Bouyoucef, Chebira and Madani 2008) Ivan Budnyk, El Khier Sofiane Bouyoucef, Abdennasser Chebira, and Kurosh Madani, A Modular Neural Classifier with Self-Organizing Learning: Performance Analysis, Proceedings of International Conference on Neural Networks and Artificial Intelligence, pp. 65-69, 2008, ISSN: 01962892 (Budnyk, Chebira and Madani 2007) Ivan Budnyk, Abdennasser Chebira, Kurosh Madani, J.Zaytoon (ed.), J-L.Ferrier (ed.), J.Andrade-Cetto (ed.), and J.Filipe (ed.), ZISC Neural Network Base Indicator for Classification Complexity Estimation, in Artificial Neural Networks and Intelligent Information Processing (ANNIIP 2007) INSTICC PRESS (Portugal 2007), 2007, pp. 38-47, ISBN: 9789856329794 (Budnyk, Chebira and Madani 2008) Ivan Budnyk, Abdennasser Chebira, and Kurosh Madani, ZISC Neuro-computer for Task Complexity Estimation in T-DTS framework, in Artificial Neural Networks and Intelligent Information Processing (ANNIIP 2008) INSTICC PRESS (Portugal 2008), 2008, pp. 18-27, ISBN: 9789728865825 (Butz 2001) Martin Volker Butz, Rule-based Evolutionary Online Learning Systems: Learning Bounds, Classification, and Prediction, Thesis presented to obtain the degree of Doctor of Philosophy in Computer Science in the Graduate College of the University of Illinois at Urbana-Champaign, Urbana, Illinois 2001, ISBN: 9789898111333

200

(Cacoullos 1966) Theophilos Cacoullos, Estimation of a Multivariate Density, Annals of the Institute of Statistical Mathematics, vol. 18, no. 2, pp. 179-189, 1966, Publication No.: 3153259 (Chaitin 2005) Gregory J.Chaitin, Meta Math!: The Quest for Omega (Peter N. Nevraumont Books) Pantheon Books, 2005, ISSN: 00203157 (Chan, Huang and De Fries 2001) Jonathan Cheung-Wai Chan, Chengquan Huang, and uth De Fries, Enhanced Algorithm Performance for Land Cover Classification from Remotely Sensed Data Using Bagging and Boosting, IEEE Transactions on Geoscience and Remote Sensing, vol. 39, no. 3, pp. 693-695, 2001, ISBN: 0375423133

(Chawla, Eschrich and Hall 2001) Nitesh Chawla, Steven Eschrich, and Lawrence O.Hall, Creating Ensemble of Classifiers, Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM'01), 29 November - 2 December 2001, San Jose, California, US, no. 580, p. 581, 2001. ISBN: 0769511198

(Chebira, Madani and Mercier 1997) Abdennasser Chebira, Kurosh Madani, Gilles Mercier, and S.K.Rogers (ed.), Various Ways for Building a Multi-Neural Network System : Application to a Control Process, Society of Photo-Optical Instrumentation Engineers, Applications and Science of Artificial Neural Networks III, 21-24 April 1997, Orlando, Florida, US, vol. 3077, pp. 148-159, 1997. ISBN: 0819424927 (Chen 1976) Chi Hao Chen, Chi Hao, Information Sciences, vol. 10, pp. 159-173, 1976, ISSN: 01962892 (Chen and Varshney 2002) Biao Chen and Pramod K.Varshney, A Byesian Sampling Approach to Decision Fusion Using Hierarchical Models, IEEE Transactions on Signal Processing, vol. 50, no. 8, pp. 1809-1818, 2002, ISSN: 00200255 (Chernoff 1966) Herman Chernoff, Estimation of a Multivariate Density, Annals of the Institute of Statistical Mathematics, vol. 18, pp. 179-189, 1966, ISSN: 1053587X (Chi and Ersoy 2002) Hoi-Ming Chi and Okan K.Ersoy, Support Vector Machine Decision Trees with Rare Event Detection, International Journal of Smart Engineering System Design, vol. 4, no. 4, pp. 225-242, 2002, ISSN: 00203157 (Chomsky 1968) Noam Chomsky, Language and Mind Harcourt Brace and World, US, 1968, ISSN: 10255818 (Chou 1983) Li Chou, Self-optimizing Method and Machines (WO/1983/000069), European Patent Office (EPO) (DE),1983, ISBN: 052167493X (Church 1936) Alonzo Church, A Note on the Entscheidungsproblem, The Journal of Symbolic Logic, vol. 1, no. 1, pp. 40-41, 1936, World Intellectual Property Organization,PCT/US1982/000845, ISSN: 00224812 Abstract: http://links.jstor.org/sici?sici=0022-4812%28193603%291%3A1%3C40%3AANOTE%3E2.0.CO%3B2-D

201

(Cosman, Oehler, Riskin and Gray 1993) Pamela C.Cosman, Kivanc L.Oehle, Eve A.Riskin, and Robert M.Gray, Using Vector Quantization for Image Processing, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 81, no. 9, pp. 1326-1341, 1993, ISSN: 00189219 (Cover and Hart 1967) Thomas M.Cover and Peter E.Hart, Nearest Neighbour Pattern Classification, IEEE Transactions on Information Theory, vol. 13, pp. 21-27, 1967, ISSN: 00189448 (Czerwinski 1998) Thomas J.Czerwinski, Coping With the Bounds: Speculations on Nonlinearity in Military Affairs National Defense University Press, 1998, ISBN: 1579060099 (Dasarathy 1991) Belur V.Dasarathy, Nearest Neighbor: Pattern Classification Techniques IEEE Computer Society Press, US, 1991, ISBN: 0818689307 (David 2000) Levy David, J.Rabin (ed.), G.J.Miller (ed.), M.Dekker (ed.), and B.W.Hildreth (ed.), Applications and Limitations of Complexity Theory in Organization Theory and Strategy, in Handbook of Strategic Management, Second revised edition, Marcel Dekker Inc., New York, US, 2000, ISBN: 0824703391 (De Tremiolles 1998) Ghislain Imbert De Tremiolles, Contribution a l'Etude Theoretique des Modeles Neuromimetiques et a leur Validation Experimentale: Mise en Oeuvre d'Applications Industrielles, These presentee a LISSI de l'Universite Paris XII pour obtenir le grade de docteur en sciences informatiques 1998, Number: 98PA120018 (De Tremiolles and al. 1997) Ghislain Imbert De Tremiolles, Pascal Tannhof, Brendan Plougonven, Claude Demarigny, Kurosh Madani, J.Mira (ed.), R.Moreno-Diaz (ed.), and J.Cabestany Moncusi (ed.), Visual Probe Mark Inspection, Using Hardware Implementation of Artificial Neural Networks, in VLSI Production, Lecture Notes in Computer Science, Biological and Artificial Computation, From Neuroscience to Technology, International Work-Conference on Artificial and Natural Neural Networks, IWANN'97 Lanzarote, Canary Islands, Spain, vol. 1240, pp. 1374-1383, 1997, ISBN: 3540630473 (Decoste and Scholkopf 2002) Dennis Decoste and Bernhard Scholkopf, Training Invariant Support Vector Machines, Machine Learning, vol. 46, no. 1-3, pp. 161-190, 2002, ISSN: 08856125 (Devroye 1987) Luc Devroye, A Course in Density Estimation Birkhauser Verlag AG, Germany, 1987, ISBN: 0817633650 (Deza M. and Deza E. 2006) Michel Marie Deza and Elena Deza, Dictionary of Distance Metrics Elsevier Science Ltd., 2006, ISBN: 0444520872 (Dietterich 2001) Thomas G.Dietterich, J.Kittler (ed.), and F.Roli (ed.), Ensemble Methods in Machine Learning, Lecture Notes in Computer Science, Multiple Classifier Systems, Proceedings of the First International Workshop, McS 2000, June 21-23, 2000, Cagliari, Italy, vol. 1857, pp. 1-15, 2001, ISBN: 3540677046

202

(Ding 2007) Yuanyuan Ding, Handling Complex, High Dimensional Data for Classification and Clustering, Thesis presented to obtain the degree of Doctor of Philosophy of University of Mississippi 2007, Publication No.: 3279419 (Dong 2003) Jianxion Dong, Speed and Accuracy: Large-scale Machine Learning Algorithms and their Applications, Thesis in the Department of Computer Sceince Presented in Partial Fulfillment of the Requirements For the Degree of Doctor of Philosophy Concordia University Montreal, Quebec, Canada 2003, ISBN: 0612852695 (Du, Zhang and Sun 2009) Peijun Du, Wei Zhang, Hao Sun, J.A.Benediktsson (ed.), J.Kittler (ed.), and F.Roli (ed.), Multiple Classifier Combination for Hyperspectral Remote Sensing Image Classification, Lecture Notes in Computer Science, Proceedings of the 8th International Workshop on Multiple Classifier Systems, MCS 2009, June 10-12, 2009, Reykjavik, Iceland, vol. 5519, pp. 52-61, 2009, ISBN: 3642023258 (Duda and Hart 1973) Richard O.Duda and Peter E.Hart, Pattern Classification and Scene Analysis John Wiley and Sons Inc., New York, US, 1973, ISBN: 0471223611 (Dujardin and al. 1999) Anne-Sophie Dujardin, Veronique Amarger, Kurosh Madani, Olivier Adam, Jean-Francois Motsch, M.Jose (ed.), and J.V.Sanchez-Andres (ed.), Multi-Neural Network Approach for Classification of Brainstem Evoked Response Auditory, Lecture Notes in Computer Science, Engineering Applications of Bio-Inspired Artificial Neural Networks, the International Work-Conference on Artificial and Natural Neural Networks, IWANN'99, June 2-4, Alicante, Spain, Proceedings, vol. 1607, pp. 255-264, 1999, ISBN: 3540660682 (Edmonds 1999) Bruce Edmonds, F.Heylighen (ed.), J.Bollen (ed.), and A.Riegler (ed.), What is Complexity? - The philosophy of complexity per se with application to some examples in evolution, in The Evolution of Complexity: The Violet Book of Einstein Meets Magritte, Kluwer Academic Publishers, 1999, ISBN: 0792357647 (Ester, Kriegel, Sander and Xu 1996) Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiawei Xu, and J.Han (ed.), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Lecture Notes in Computer Science, Proceedings of Second International Conference on Knowledge Discovery and Data Mining, vol. 12, no. 1, pp. 18-24, 1996, ISBN: 1577350049 (Ewald 2004) William Bragg Ewald, From Kant to Hilbert Oxford University Press, US, 2004, ISBN: 0198505353 (Fayyad, Piatetsky-Shapiro and Smyth 1996) Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, From Data Mining to Knowledge Discovery in Databases, AI Magazine (An Official Publication of the American Association for Artificial Intelligence), vol. 17, no. 3, pp. 37-54, 1996, ISSN: 07384602 (Feldman and Crutchfield 1997) David P.Feldman, James P.Crutchfield, V.M.Agranovich (ed.), A.R.Bishop (ed.), A.P.Fordy (ed.), P.R.Holland (ed.), P.R.Holland (ed.), and R.Q.Wu (ed.), Measures of Statistical Complexity: Why?, Physics Letters A, vol. 238, no. 4-5, pp. 244-252, 1997, ISSN: 03759601

203

(Fellman 2004) Philip V.Fellman, "The Nash Equilibrium Revisited: Chaos and Complexity Hidden in Simplicity," Interjournal ICCS4, Proceedings of the Fourth Internatinal Conference on Complex Systems, May 16-21, 2004, Boston, US, 2004, ISSN: 10810625, Abstract: ID 1013, http://arxiv.org/abs/0707.0891

(Ferreira 2001) Pedro M.Ferreira, Tracing Complexity Theory, 2001, Massachusetts Institute of Technology Engineering Systems Division Research Seminar in Engineering Systems ESD.83 (Fielding 2007) Alan H.Fielding, Cluster and Classification Techniques for the Biosciences, First edition, Cambridge University Press, UK, 2007, ISBN: 0521852811 (Fisher 1936) Ronald A Fisher and R.A Fisher (ed.), The Use of Multiple Measures in Taxonomic Problems, Annals of Eugenics: A Journal Devoted to the Genetic Study of Human Populations, vol. 7, pp. 179-188, 1936, ASIN: B00282EMS4 (Fogel 1991) David B.Fogel, An Information Criterion for Optimal Neural Network Selection, IEEE Transactions on Neural Networks, vol. 2, no. 5, pp. 490-497, 1991, ISSN: 10459227

(Fraley and Raftery 2002) Chris Fraley and Adrian E Raftery, "Model-Based Clustering, Discriminant Analysis, and Density Estimation," Journal of the American Statistical Association, vol. 97, no. 458, pp. 611-631, 2002, ISSN: 01621459, http://www.jstor.org/pss/3085676

(Freund and Schapire 1996) Yoav Freund, Robert E.Schapire, and L.Saitt (ed.), Experiments with a New Boosting Algorithm, Proceedings of the Thirteenth International Conference on Machine Learning 1996 International Conference, pp. 148-156, 1996, ISBN: 1558604197

(Friedman and Meulman 2004) Jerome H.Friedman and Jacqueline J.Meulman, "Clustering Objects on Subsets of Attributes," Journal of the Royal Statistical Society, Series B (Statistical Methodology), vol. 66, no. 4, pp. 815-849, 2004, ISSN: 13697412, http://www.jstor.org/pss/3647651

(Friedman and Rafsky 1979) Jerome H.Friedman and Lawrence C.Rafsky, Multivariate Generalizations of the Wald-Wolfowitz and Smirnov Two-Sample Tests, The Annals of Mathematical Statistics, vol. 7, no. 4, pp. 697-717, 1979, ISSN: 00034851 (Friedrich 2004) Friedrich Jurgen, Spatial Modeling in Natural Sciences and Engineering: Software Development and Implementation, First edition, Springer-Verlag Berlin and Heidelberg GmbH. and Co. K., Germany, 2004, ISBN: 3540208771 (Fuglede and Topsoe 2004) Bent Fuglede and Flemming Topsoe, Jensen-Shannon Divergence and Hilbert Space Embedding, Proceedings, IEEE International Symposium on Information Theory (ISIT 2004), vol. 3, pp. 31-36, 2004, ISBN: 0780382803 (Fukunaga 1972) Keinosuke Fukunaga, Introduction to Statistical Pattern Recognition (in Russian) Nauka, Moscow, USSR, 1979, 1972, ISBN: 0122698509

204

(Gao, Foster, Mobus and Moschytz 2001) Qun Gao, Philipp Forster, Karl R.Mobus, and George S.Moschytz, Fingerprint Recognition Using CNNs: Fingerprint Preprocessing, Proceedings on the IEEE International Symposium on Circuits and Systems (ISCAS 2001), 6-9 May 2001, Sydney, Australia, vol. 2, pp. 433-436, 2001, ISBN: 0780366859 (Garey and Johnson 1979) Michael R.Garey and David S.Johnson, Computers and Intractability: A Guide to the Theory of Np-Completeness W.H. Freeman and Company Lyd., San Francisco, California, US, 1979, ISBN: 0716710455 (Gasmi and Merouani 2005) Ibtissem Gasmi and Hayet Merouani, Towards a Method of Automatic Design of Multi-Classifiers System Based Combination, Proceeding of World Academy of Science, Engineering and Technology, June 2005, vol. 6, pp. 82-87, 2005, ISSN: 20703724 (Gavin, Oswald, Wahl and Williams 2002) Daniel G.Gavin, W.Wyatt Oswald, Eugene R.Wahl, John W.Williams, D.B.Booth (ed.), and A.R.Gillespie (ed.), A Statistical Approach to Evaluating Distance Metrics and Analog Assignments for Pollen Records, Quaternary Research, vol. 60, no. 3, pp. 356-367, 2002, ISSN: 00335894

(Gelfand, Ravishankar and Delp 1991) Saul B.Gelfand, Channasandra S.Ravishankar, and Edward J.Delp, An Iterative Growing and Pruning Algorithm for Classification Tree Design, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 2, pp. 163-174, 1991, ISSN: 01628828, Abstract: www.cs.virginia.edu/~robins/Quantum_Computing_with_Molecules.pdf

(Gershenfeld and Chuang 1998) Neil Gershenfeld and Isaac L.Chuang, Quantum Computing with Molecules, Scientific American Magazine, pp. 66-71, 1998, ISSN: 00368733.

(Gersho and Gray 1991) Allen Gersho and Robert M.Gray, Vector Quantization and Signal Compression Kluwer Academic Publishers, 1991, ISBN: 0792391810

(Giacinto, Roli and Fumera 2000) Giorgio Giacinto, Fabio Roli, and Giorgio Fumera, Unsupervised Learning of Neural Network Ensembles for Image Classification, ICNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, 24-27 July, Como, Italy, vol. 3, pp. 155-159, 2000, ISBN: 0769506194

(Giusti, Masulli and Sperduti 2002) Nicola Giusti, Francesco Masulli, and Alessandro Sperduti, Theoretical and Experimental Analysis of a Two-Stage System for Classification, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 893-904, 2002, ISSN: 01628828

(Go, Han, Kim and Lee 2001) Jinwook Go, Gunhee Han, Hagbae Kim, and Chulhee Lee, Multigradient: A New Neural Network Learning Algorithm For Pattern Classification, IEEE Transactions on Geoscience and Remote Sensing, vol. 39, no. 5, pp. 986-993, 2001, ISSN: 01962892

205

(Goblick 1988) Thomas Goblick, DARPA Neural Network Study: October 1987-February 1988 AFCEA International Press, 1998, ISBN: 0916159175

(Godel 2001) Kurt Godel, S.Feferman (ed.), J.W.Dawson Jr (ed.), S.C.Kleene (ed.), G.H.Moore (ed.), R.M.Solovay (ed.), and J.Van Heijenoort (ed.), Collected Works: Volume II: Publications 1938-1974 (Collected Works (Oxford)) Oxford University Press, US, 2001, ISBN: 0195147219 (Gold and Morgan 1999) Bernard Gold and Nelson Morgan, Speech and Audio Signal Processing: Processing and Perception of Speech and Music, First edition, John Wiley, New York, US, 1999, ISBN: 0471351547 (Gold, Holub and Sollich 2005) Carl Gold, Alex Holub, Peter Sollich, J.Mira (ed.), and A.Prieto (ed.), Bayesian Approach to Feature Selection and Parameter Tuning for Support Vector Machine Classifiers, Neural Networks, vol. 18, no. 5-6 (Special issue: IJCNN 2005), pp. 693-701, 2005, ISSN: 08936080 (Goldstone and Kersten 2003) Robert L.Goldstone, Alan Kersten, A.F.Healy (ed.), R.W.Proctor (ed.), and I.B.Weiner (ed.), Concept and Categorization, in Handbook of Psychology: Experimental Psychology John Wiley adn Sons Inc., US, 2003, pp. 599-622, ISBN: 0471392626 (Goonatilake and Khebbal 1995) Suran Goonatilake and Sukhdev Khebbal, Intelligent Hybrid Systems: Fuzzy Logic, Neural Networks, and Genetic Algorithms John Wiley and Sons Ltd., 1995, ISBN: 0471942421 (Gray, Oehler, Perlmutte and Ohlsen 1993) Robert M.Gray, Karen L.Oehler, Keren O Perlmutter, and Richard A.Olshen, Combining Tree-Structured Vector Quantization with Classification and Regression Trees, Proceedings of the Twenty-Seventh Asilomar Conference on Signals, Systems, and Computers, pp. 1494-1498, 1993, ISBN: 0818641207 (Green and Newth 2001) David G.Green, David Newth, T.Bossomaier (ed.), R.Standish (ed.), and S.Halloy (ed.), Towards a T heory of Everything? - Grand Challenges in Complexity and Informatics, Complexity International, vol. 8,, Paper ID: green05 2001, ISSN: 13200682, http://www.complexity.org.au/ci/vol08/green05/ (Guha, Rastogi and Shim 1998) Sudipto Guha, Rajeev Rastogi, Kyuscok Shim, L.M.Haas, and A.Tiwary, CURE: An Efficient Clustering Algorithm for Large Databases, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, US, vol. 27, no. 2, pp. 73-84, 1998, ISBN: 0897919955 (Guha, Rastogi and Shim 2000) Sudipto Guha, Rajeev Rastogi, and Kyuscok Shim, ROCK: Robust Clustering Algorithm for Categorization Attributes, Information Systems, vol. 25, no. 5, pp. 345-366, 2000, ISSN: 03064379 (Haken 2002) Hermann Haken, Information and Self-Organization: A Macroscopic Approach to Complex Systems (Springer Series in Synergetics), Second edition, Springer-Verlag Berlin and Heidelberg GmbH. and Co. K., Germany, 2000, ISBN: 3540662863

206

(Han and Kamber 200 6) Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second revised edition, Morgan Kaufmann Publishers Inc., 2006, ISBN: 1558609016

(Han and Karypis 2000) Eui-Hong (Sam) Han and George Karypis, "Centroid-Based Document Classication: Analysis and Experimental Results," University of Minnesota, Department of Computer Science / Army HPC Research Center, AHPCRC, Minnesota Supercomputer Institute (Center contract number DAAH04-95-C-0008) ,2000, Technical Report No.: 00-017, Related papers are available via http://www.cs.umn.edu/~karypis

(Hansen and Salamon 1990) Lars Kai Hansen and Peter Salamon, Neural Network Ensembles, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, no. 10, pp. 993-1001, 1990, ISSN: 01628828 (Hartuv and Shamir 2000) Erez Hartuv and Ron Shamir, A Clustering Algorithm Based on Graph Connectivity, Information Processing Letters, vol. 76, no. 4-6, pp. 175-181, 2000, ISSN: 00200190 (Hassibi and Stork 1993) Babak Hassibi, David G.Stork, and S.J.Hanson (ed.), Second Order Derivatives for Network Pruning: Optimal Brain Surgeon, Advances in Neural Information Processing Systems Five, Nips Five, pp. 164-171, 1993, ISBN: 1558602747 (Hastie, Tibshirani and Friedman 2009) Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Third revised edition, Springer-Verlag New York Inc., New York, US, 2009, ISBN: 0387848576 (Haussler, Kearns and Schapire 1994) David Haussler, Michael Kearns, and Robert E.Schapire, Bounds on the Sample Complexity of Bayesian Learning Using Information Theory and the VC Dimension, Machine Learning, vol. 14, no. 1, pp. 83-113, 1994, ISSN: 08856125 (Hertz, Palmer and Krogh 1991) John Hertz, Richard G.Palmer, and Anders Krogh, Introduction to the Theory of Neural Computation Addison Wesley, New York, US, 1991, ISBN: 0201503956 (Highfield 1996) Roger Highfield, Frontiers of Complexity: The Search for Order in a Chaotic World Ballantine Books, 1996, ISBN: 0449910814 (Hinegardner and Engelberg-Kulka 1983) Ralph T.Hinegardner and Hanna Engelberg-Kulka, Biological Complexity, Journal of Theoretical Biology, vol. 104, pp. 7-20, 1983, ISSN: 00225193 (Ho 2000) Tin Kam Ho, J.Kittler (ed.), and F.Roli (ed.), Complexity of Classification Problems and Comparative Advantages of Combined Classifiers, Lecture Notes in Computer Science, Proceedings on the First International Workshop on Multiple Classifier Systems, McS 2000, June 2000, Cagliari, Italy, vol. 1857, pp. 97-106, 2000, ISBN: 3540677046

207

(Ho 2001) Tin Kam Ho, J.Kittler (ed.), and F.Roli (ed.), "Data Complexity Analysis for Classifier Combination ," Lecture Notes in Computer Science, Multiple Classifier Systems, Second International Workshop Proceedings, MCS 2001, July 2-4, 2001, Cambridge, UK, vol. 2096, pp. 53-56, 2001, ISBN: 3540422846

(Ho 2002) Tin Kam Ho, "A Data Complexity Analysis of Comparative Advantages of Decision Forest Constructors," Pattern Analysis and Applications, vol. 5, no. 2, pp. 102-112, 2002, ISSN: 1433754

(Ho and Baird 1994) Tin Kam Ho and Henry S.Baird, "Estimating the Intrinsic Difficulty of A Recognition Problem ," Proceedings of the 12th IAPR International, Conference on Pattern Recognition, Conference B: Computer Vision and Image Processing, vol. 2, pp. 178-183, 1994, ISBN: 0818662700

(Ho and Baird 1998) Tin Kam Ho and Henry S.Baird, Pattern Classification with Compact Distribution Maps, Computer Vision and Image Understanding, vol. 70, no. 1, pp. 101-110, 1998, ISSN: 10773142

(Ho and Basu 2002) Tin Kam Ho and Mitra Basu, "Complexity Measures of Supervised Classification Problems," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 289-300, 1994, ISSN: 01628828

(Ho, Hull and Stihari 1994) Tin Kam.Ho, Jonathan J.Hull, and Sargur N.Stihari, Decision Combination in Multiple Classifier Systems, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 16, no. 1, pp. 66-75, 1994, ISSN: 01628828

(Hofstadter 1999) Douglas R.Hofstadter, Godel, Escher, Bach: An Eternal Golden Braid Basic Books, 1999, ISBN: 0465026567 Abstract: http://www.econ.iastate.edu/tesfatsi/hogan.complexperplex.htm

(Horgan 1995) John Horgan, From Complexity to Perplexity, Scientific American Magazine, pp. 74-79, 1995, ISSN: 00368733 (Hsieh and Fan 2001) Ing-Sheen Hsieh, Kuo-Chin Fan, and C.A.Bouman (ed.), Multiple Classifier for Color Flag and Trademark Image Retrieval, IEEE Transactions on Image Processing, vol. 10, no. 6, pp. 938-950, 2001, ISSN: 10577149

(IBM Corporation 1998) IBM Corporation, ZISC(R)036 Neurons User's Manual, 1998, Documents Number: IOZSCWBU-02. http://noel.feld.cvut.cz/vyu/scs/2001/ZISC/ziscwb.pdf (Jain, Murty and Flynn 1999) Anil Kumar Jain, M.Narasimha Murty, and Patrick Joseph Flynn, Data Clustering: A Review, ACM Computing Surveys (CSUR), vol. 31, no. 3, pp. 264-323, 1999, ISSN: 03600300 (Jelassi and Enders 2004) Tawfik Jelassi and Albrech Enders, Key Terminology and Evolution of e-Business, in Strategies for e-Business: Creating Value through Electronic and Mobile Commerce Financial Times Prentice Hall, 2004, pp. 10-11, ISBN: 0273688405

208

(Jesse, Liu, Smart and Brown 2008) Christopher Jesse, Honghai Liu, Edward Smart, David Brown, I.Lovrek (ed.), R.J.Howlett (ed.), and L.C.Jain (ed.), Analysing Flight Data Using Clustering Methods, Lecture Notes in Artificial Intelligence, Proceedings of the 12th international conference on Knowledge-Based Intelligent Information and Engineering Systems, Part I, vol. 5177, pp. 733-740, 2008, ISBN: 3540855645 (Joachims 1998) Thorsten Joachims, C.Nedellec (ed.), and C.Rouveirol (ed.), Text Categorization with Support Vector Machine: Learning with Many Relevant Features, Lecture Notes in Computer Science, Proceedings in Machine Learning, ECML 98, 10th European Conference on Machine Learning April 21-23, 1998, Chemnitz, Germany, vol. 1398, pp. 137-142, 1998, ISBN: 3540644172 (Johnson and Kargupta 2002) Erik L.Johnson, Hillol Kargupta, M.J.Zaki (ed.), and C-T.Ho (ed.), Collective, Hierarchical Clustering from Distributed, Heterogeneous Data, Lecture Notes In Computer Science, Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, vol. 1759, pp. 103-114, 2002, ISBN: 3540671943 (Jordan and Xu 1995) Michael I.Jordan, Lei Xu, S.Grossberg (ed.), K.Doya (ed.), and J.Taylor (ed.), Convergence Results for the EM Approach to Mixtures of Experts Architectures, Neural Networks, vol. 8, no. 9, pp. 1487-1489, 1995, ISSN: 08936080

(Josephson 2004) Brian D.Josephson, "How We Might be Able to Understand the Brain," Interjournal ICCS4, Proceedings of the Fourth Internatinal Conference on Complex Systems, May 16-21, 2004, Boston, US, 2004, ISSN: 10810625, Abstract ID: 5225, http://cogprints.org/3655/5/ICCS2004.links.html

(Juang and Katagiri 1992) Biing-Hwang Juang and Shigeru Katagiri, Discriminant Learning for Minimum Error Classification, IEEE Transactions on Signal Processing, vol. 40, no. 12, pp. 3043-3054, 1992, ISSN: 1053587X (Kadous 2002) Mohammed Waleed Kadous, Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series, A Thesis Submitted as a Requirement for the Degree of Doctor of Philosophy 2002, Order Number: AAI0806481 (Kamvar, Klein and Manning 2002) Sepandar D.Kamvar, Dan Klein, and Christopher D.Manning, Interpreting and Extending Classical Agglomerative Clustering Algorithms Using a Model-Based Approach, Proceedings of the Nineteenth International Conference on Machine Learning, pp. 283-290, 2002, ISBN: 1558608737 (Kantardzic 2002) Mehmed Kantardzic, Data Mining: Concepts, Models, Methods and Algorithms John Wiley and Sons Inc., US, 2002, ISBN: 0471228524 (Karypis, Han and Kumar 1999) George Karypis, Eui-Hong (Sam) Han, and Vipin Kumar, CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling, IEEE Computer: Special Issue on Data Analysis and Mining, vol. 32, no. 8, pp. 68-75, 1999, ISSN: 00189162.

209

(Kauffman 1993) Stuart Alan A.Kauffman, The Origins of Order: Self-Organization and Selection in Evolution Oxford University Press Inc., US, 1993, ISBN: 0195079515 (Kawatani 1999) Takahiko Kawatani, Handwritten Kanji Recognition Using Combined Complementary Classifiers in a Cascade Arrangement, Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR 99), September 20-22, Bangalore, India, pp. 503-506, 1999, ISBN: 0780370449 (Kijsirikul and Chongkasemwongse 2001) Boonserm Kijsirikul and Kongsak Chongkasemwongse, Decision Tree Pruning Using Backpropagation Neural Networks, Proceedings of International Joint Conference on Neural Networks (IJCNN 01), July 15-19, 2001, Washington DC, US, vol. 3, pp. 1876-1880, 2001, ISBN: 0769503187 (Kimu ra and Shridhar 1991) Fumitaka Kimura and Malayappan Shridhar, Handwritten Numeral Recognition Based on Multiple Algorithms, Pattern Recognition, The Journal of the Pattern Recognition Society, vol. 24, no. 10, pp. 969-983, 1991, ISSN: 00313203 (Kimura, Takashina, Tsuruoka and Miyake 1987) Fumitaka Kimura, Kenji Takashina, Shinji Tsuruoka, and Yasuji Miyake, Modified Quadratic Discriminant Functions and the Application to Chinese Character Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 9, no. 1, pp. 149-153, 1987, ISSN: 01628828 (Kimura, Wakabayashi, Tsuruoka and Miyake 1991) Fumitaka Kimura, Tetsushi Wakabayashi, Shinji Tsuruoka, Yasuji Miyake, and C.Y.Suen (ed.), Improvement of Handwritten Japanese Character Recognition Using Weighted Direction Code Histogram, Pattern Recognition, The Journal of the Pattern Recognition Society, vol. 30, no. 8, pp. 1329-1337, 1997, ISSN: 00313203 (King 1967) Benjaman King, Step-wise Clustering Procedures, Journal of American Statistical Assosiation, vol. 62, no. 317, pp. 86-101, 1967, ISSN: 01621459 (Kittler, Hatef, Duin and Matas 1998) Josef Kittler, Mohamad Hatef, Robert P.W.Duin, and Jiri Matas, On Combining Classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226-239, 1998, ISSN: 01628828 (Kohn, Nakano and Silva 1996) Andre Fabio Kohn, Luis Gustavo Mendonca Nakano, Miguel Oliveira E.Silva, and C.Y.Suen (ed.), A Class Discriminability Measure Based on Feature Space Partitioning, Pattern Recognition, The Journal of the Pattern Recognition Society, vol. 29, no. 5, pp. 873-887, 1996, ISSN: 00313203 (Kohonen 1989) Teuvo Kohonen, Self-Organization and Associate Memory, Third edition, Springer-Verlag, Berlin, Germany, 1989, ISBN: 0387513876 (Koontz and Fukunaga 1972) Warren L.G.Koontz and Keinosuke Fukunaga, A Nonparametric Valley-Seeking Technique for Cluster Analysis, IEEE Transactions on Computers, vol. 21, no. 2, pp. 171-178, 1972, ISSN: 00189340 (Kressel 1999) Urlich Kressel, B.Scholkopf (ed.), C.J.C.Burges (ed.), and A.J.Smola (ed.), Pairwise Classification and Support Vector Machines, in Advances in Kernel

210

Methods: Support Vector Learning MIT Press, US, 1999, pp. 255-268, ISBN: 0262194163 (Krippendorff 1986) Klaus Krippendorff, Information Theory: Structural Models for Qualitative Data SAGE Publications Inc ., 1986, ISBN: 0803921322 (Kundur, Hatzinakos and Leung 2000) Deepa Kundur, Dimitrios Hatzinakos, and Henry Leung, Robust Classification of Blurred Imagery, IEEE Transactions on Image Processing, vol. 9, no. 2, pp. 243-255, 2000, ISSN: 10577149 (Kupinski, Edwards, Giger and Metz 2001) Matthew A.Kupinski, Damn C.Edwards, Maryellen L.Giger, and Charles E.Metz, Ideal Observer Approximation Using Bayesian Classification Neural Networks, IEEE Transactions on Medical Imaging, vol. 20, no. 9, pp. 886-899, 2001, ISSN: 02780062 (Lavrac, Flach and Todorovski 2002) Nada Lavrac, Peter Flach, Ljupco Todorovski, M.Bohanec (ed.), B.Kasek (ed.), N.Lavrac (ed.), and D.Mladenic (ed.), Rule Induction for Subgroup Discovery with CN2-SD, Proceedings of the ECML/PKDD'02 Workshop on Integration and Collaboration Aspects of Data Mining, Decision Support and Meta-Learning, pp. 77-87, 2002 (Lazarevic and Obradovic 2001) Aleksandar Lazarevic and Zoran Obradovic, Effective Pruning of Neural Network Classifier Ensembles, Proceedings of International Joint Conference on Neural Networks (IJCNN 01): Washington, DC, US, July 15-19, vol. 2, pp. 796-801, 2001, ISBN: 0780370449 (Le Bourgeois and Emptoz 1996) Frank Le Bourgeois and Hubert Emptoz, Pretopological Approach for Supervising Learning, Proceedings of the 13th International Conference on Pattern Recognition: August 25-29, 1996 (1996 International Conference on Pattern Recognition (13th IAPR), vol. 4, pp. 256-260, 1996, ISBN: 0818674725 (Le Cun and al. 1989) Yann Le Cun, Bernard E.Boser, John S.Denker, Donnie Henderson, Richard E.Howard, Wayne E.Hubbard, and Lawrence D.Jackel, Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, vol. 1, no. 4, pp. 541-551, 1989, ISSN: 08997667 (Le Cun, Bottou, Bengio and Haffner 1998) Yann Le Cun, Leon Bottou, Yoshua Bengio, and Pattrick Haffner, Gradient-based Learning Applied to Document Recognition, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, vol. 80, no. 11, pp. 2278-2324, 1998, ISSN: 00189219

(Le Cun, Denker and Solla 1990) Yann Le Cun, John S.Denke, Sara A.Solla, and D.S.Touretzky (ed.), Optimal Brain Damage, Advances in Neural Information Processing Systems, vol. 2, pp. 598-605, 1990, ISBN: 1558601007 Abstract: http://tecfa.unige.ch/~lemay/thesis/THX-Doctorat/THX-Doctorat.html

(Lemay 1999) Philippe Lemay, The Statistical Analysis of Dynamics and Complexity in Psychology a Configural Approach, These presentee а la Faculte des sciences sociales et

211

politiques de l'Universite de Lausanne pour obtenir le grade de docteur en psychologie 1999, OCLC Number: 78333154 (Leondes 1998) Cornelius T.Leondes, Image Processing and Pattern Recognition (Neural Network Systems Techniques and Applications) Academic Press Inc., San Diego, California, US, 1998, ISBN: 0124438652 (Li, Zhang and Ogihara 2004) Tao Li, Chengliang Zhang, and Mitsunori Ogihara, A Comparative Study of Feature Selection and Multiclass Classification Methods for Issue Classification Based on Gene Expression, Bioinformatics, vol. 20, no. 15, pp. 2429-2437, 2004, ISSN: 13674803 (Liao 2001) Yihua Liao, Neural Networks in Hardware: A Survey, Department of Computer Sciences, University of California, Davis, California, US,2001, Project: ECS250A (Lin 1991) Jianhua Lin, Divergence Measures Based on the Shannon Entropy, IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145-151, 1991, ISSN: 00189448 Abstract: http://bit.csc.lsu.edu/~jianhua/shiv2.pdf (Lindblad and al. 1996) Thomas Lindblad, Clark S.Lindsey, Maxim Minerskjold, Givi Sekhniaidze, Geza Szekely, and A.J.Eide (ed.), The IBM ZISC036 Zero Instruction Set Computer, 1996,

(Lindblad, Lindsey and Eide 2002) Thomas Lindblad, Clark S.Lindsey, and Age J.Eide, Radial Basis Function (RBF) Neural Networks, 2002

(Lofgren 1973) Lofgren Lennart, P.Suppes (ed.), L.Henkin (ed.), A.Joja (ed.), and GR.C.Moisil (ed.), On the Formalization of Learning and Evolution, in Logic, Methodology and Philosophy of Science IV, Proceedings of the Fourth International Congress for Logic, Methodology and Philosophy of Science, Bucharest 1971 (Studies in Logic and the foundation of Mathematics, 74) North-Holland Publishing Co., Amsterdam, The Netherlands, 1973, SIN: B001DCBM74

(Lotte and al. 2007) Fabien Lotte, Marco Congedo, Anatole Lecuyer, Fabrice Lamarche, and Bruno Arnaldi, A Review of Classification Algorithms for EEG-based Brain-Computer Interfaces, Journal of Neural Engineering, vol. 4, no. 2, pp. 1-24, 2007, ISSN: 17412560 (Lu 1996) Yi Lu, Knowledge Integrations in a Multiple Classifier System, Applied Intelligence, vol. 6, no. 2, pp. 75-86, 1996, ISSN: 0924669X (Madani, Chebira and Mercier 1997) Kurosh Madani, Abdennasser Chebira, Gilles Mercier, J.Mira (ed.), R.Moreno-Diaz (ed.), and J.Cabestany Moncusi (ed.), Multi-Neural Networks Hardware and Software Architecture: Application of the Divide to Simplify Paradigm DTS, Lecture Notes in Computer Science / Biological and Artificial Computation: From Neuroscience to Technology : International Work-Conference on Artificial and Natural Neural Networks, IWANN'97, 3 June 1997, Lanzarote, Canary Islands, Spain, vol. 1240, pp. 841-850, 1997. ISBN: 3540630473

212

(Madani and Chebira 2000) Kurosh Madani, Abdennasser Chebira, D.A.Zighed (ed.), J.Komorowski (ed.), and J.Zytkow (ed.), A Data Analysis Approach Based on a Neural Networks Data Sets Decomposition and it's Hardware Implementation, Lecture Notes in Computer Science / Lecture Notes in Artificial Intelligence, Principles of Data Mining and Knowledge Discovery, 4th European Conference, PKDD 2000, September 13-16, 2000, Lyon, France, Proceedings Workshop 1, Advances in Data Mining, vol. 1910 2000, ISBN: 354041066X Abstract: http://eric.univ-lyon2.fr/~pkdd2000/Download/#WS1

(Lucas 2000) Chris Lucas, Quantifying Complexity Theory, CALResCo Group, 2000, http://www.calresco.org/lucas/quantify.htm (Madani and Berechet 2001) Kurosh Madani, Ion Berechet, .Mira (ed.), and A.Prieto (ed.), Inaccessible Parameters Monitoring in Industrial Environment: A Neural Based Approach, Lecture Notes in Computer Science, Bio-Inspired Applications of Connectionism: Proceedings of the 6th International Work-Conference on Artifical and Natural Neural Networks, IWANN 2001, June 13-15, 2001, Granada, Spain, vol. 2085, pp. 619-627, 2001, ISBN: 3540422358 (Madani, De Tremiolles and Tannhof 2001) Kurosh Madani, Ghislain Imbert De Tremiolles, Pascal Tannhof, J.Mira (ed.), and A.Prieto (ed.), ZISC-036 Neuro-processor Based Image Processing, Lecture Notes in Computer Science, Bio-Inspired Applications of Connectionism: Proceedings of the 6th International Work-Conference on Artifical and Natural Neural Networks, IWANN 2001, June 13-15, 2001, Granada, Spain, vol. 2085, pp. 200-207, 2001, ISBN: 3540422358 (Madani, Rybnik and Chebira 2003) Kurosh Madani, Mariusz Rybnik, Abdennasser Chebira, M.Jose (ed.), and J.R.Lvarez (ed.), Data Driven Multiple Neural Network Models Generator Based on a Tree-like Scheduler, Lecture Notes in Computer Science, Computational Methods in Neural Modeling, 7th International Work Conference on Artificial and Natural Neural Networks, IWANN 2003, June 3-6, Mao, Menorca, Spain, Proceedings, Part 1, vol. 2686, pp. 382-389, 2003, ISBN: 3540402101 (Maier and Rechtin 2000) Mark W.Maier and Eberhardt Rechtin, The Art of Systems Architecting, Second revised edition, CRC Press Inc., US, 2000, ISBN: 0849304407

(Makal, Ozyilmaz and Palavaroglu 2008), Senem Makal, Lale Ozyilmaz, and Senih Palavaroglu, "Neural Network Based Determination of Splice Junctions by ROC Analysis," Proceedings of World Academy of Science, Engineering and Technology, vol. 33, pp. 630-632, 2008, ISSN: 20703740, www.waset.org/journals/waset/v43/v43-112.pdf

(Malousi and al. 2008) Andigoni Malousi, Ioanna Chouvarda, Vassilis Koutkias, Sofia Kouidou, and Nicos Maglaveras, Variable-length Positional Modeling for Biological Sequence Classification, AMIA Annual Symposium proceedings, pp. 91-95, 2008, ISSN: 1942597X (Manning, Raghavan and Schultze 2008) Christopher D.Manning, Prabhakar Raghavan, and Hinrich Schutze, Support Vector Machines and Machine Learning on Documents, in Introduction to Information Retrieval Cambridge University Press, UK, 2008, ISBN: 0521865719

213

(Maren 1990) Alianna J Maren, Handbook of Neural Computing Applications Academic Press, Inc., 1990, ISBN: 0125460902 (Markovitch and Rosenstein 2002) Shaul Markovitch and Dan Rosenstein, Feature Generation Using General Constructor Functions, Machine Learning, vol. 49, no. 1, pp. 59-98, 2002, ISSN: 08856125 (Matusita and Akaike 1956) Kameo Matusita and Hirotugu Akaike, Decision Rules, Based on the Distance, For the Problems of Independence, Invariance and Two Samples, Annals of the Institute of Statistical Mathematics, vol. 7, no. 2, pp. 67-80, 1956, ISSN: 00203157 (McQueen 1967) James B.McQueen, L.M.Le Cam (ed.), and J.Neyman (ed.), Some Methods for Classification and Analysis of Multivariate Observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, June 21-July 18, 1965 and December 27, 1965 - January 7, 1966, vol. 1, pp. 281-297, 1967, Library of Congress Catalog Card Number: 498189

(Mendez 2009) Anne Menendez, Disruptive Parallel Neural Network Chip Ready to Compete With DSPs for Pattern Recognition, 2009, CM1K Technical Brief, Rev 05-09 . http://general-vision.com/White%20Papers/WP_CM1K_disruptive%20performance%20for%20DSP.pdf

(Michalski and Stepp 1987) Ryszard S.Michalski, Robert E.Stepp, S.C.Shapiro(ed.), D.Eckroth(ed.), and G.A.Vallasi(ed.), "Clustering," in Encyclopedia of Artificial Intelligence: A-N Vol 1 John Wiley and Sons Inc., 1987, pp. 103-111, ISBN: 047162974X

(Micheli-Tzanakou 1999) Evangelia Micheli-Tzanakou, Supervised and Unsupervised Pattern Recognition: Feature Extraction and Computational Intelligence (Industrial Electronics Series), First edition, CRC Press Inc, Boca Raton, US, 1999, ISBN: 0849322782 (Mielke and Roubicek 2003) Alexander Mielke and Tomas Roubicek, A Rate-Independent Model for Inelastic Behavior of Shape-Memory Alloys, Multiscale Modeling and Simulation: A SIAM Interdisciplinary Journal, vol. 1, no. 4, pp. 571-597, 2003, ISSN: 15403459 (Mikulecky 2007) Donald C.Mikulecky, C.Gershenson (ed.), D.Aerts (ed.), and B.Edmonds (ed.), Complexity Science as an Aspect of the Complexity of Science, in Worldviews, Science and Us: Philosophy and Complexity World Scientific Publishing Co. Pte. Ltd., 2007, pp. 30-53, ISBN: 9812705481 (Mitchell 1997) Tom M.Mitchell, Machine Learning McGraw Hill Higher Education, 1997, ISBN: 0070428077 (Mitchell, Anderson, Carbonel and Michalski 1986) Tom M.Mitchell, John R.Anderson, Jaime G.Carbonel, and Ryszard Stanislaw Michalski, Machine Learning: An Artificial

214

Intelligence Approach Morgan Kaufmann Publishers Inc., US, 1986, ASIN: B000FO7JKK (Moffat 2003) James Moffat, Complexity Theory and Network Centric Warfare CForty-OneSR Cooperative Research, 2003, ISBN: 1893723119 (Moses 2002) Joel Moses Ideas on Complexity in Systems - Twenty Views, Complexity and Flexibility (Working Paper), 2002, Massachusetts Institute of Technology Engineering Systems Division Working Paper Series ESD-WP-2000-02 (Murray-Smith and Johansen 1997) Roderick Murray-Smith and Tor Arne Johansen, Multiple Model Approaches to Modelling and Control Taylor and Francis Ltd., London, UK, 1997, ISBN: 074840595X (Novikoff 1963) Albert B.J.Novikoff, On Convergence Proofs for Perceptrons, Proceedings of the Symposium on Mathematical Theory of Automata: New York, N. Y., April 24, 25, 26, 1962, vol. 12, pp. 615-622, 1963, ASIN: B000GVXN4I (Opitz and Maclin1999) David Opitz and Richard Maclin, Popular Ensemble Methods: An Empirical Study, Journal of Artificial Intelligence Research, vol. 11, pp. 169-198, 1999, ISSN: 10769757 (Osuna, Freund, and Girosi 1997) Edgar Osuna, Edgar Osuna, and Federico Girosi, Training Support Vector Machines: An Application to Face Detection, Computer Vision and Pattern Recognition: Conference Proceedings, CVPR 97 (Proceedings on IEEE Computer Society Conference on Computer Vision and Pattern Recognition), pp. 130-136, 1997, ISBN: 0818678224 (Park and Sandberg 1991) Jooyoung Park, Irwin W.Sandberg., and T.J.Sejnowski (ed.), Universal Approximation Using Radial-Basis-Function Networks, Neural Computation, vol. 3, no. 2, pp. 246-257, 1991, ISSN: 08997667 (Parvin, Alizadeh and Minaei-Bidgoli 2009) Hamid Parvin, Hosein Alizadeh, Behrouz Minaei-Bidgoli, and F.I.S.Ko (ed.), Using Clustering for Generating Diversity in Classifiers Ensemble, International Journal of Digital Content Technology and its Applications, vol. 3, no. 1, pp. 51-57, 2009, ISSN: 19759339 (Parzen 1962) Emanuel Parzen, On Estimation of a Probability Density Function and Mode, The Annals of Mathematical Statistics, vol. 33, no. 3, pp. 1065-1076, 1962, ISSN: 00034851 (Pattichis C., Pattichis M., and Micheli-Tzanakou 2002) Constantinos S.Pattichis, Marios S.Pattichis, and Evangelia Micheli Tzanakou, Medical Imaging Fusion Applications: An Overview, Conference Record of the Thirty-Fifth Asilomar Conference on Signals, Systems and Computers, vol. 9, no. 2, pp. 243-255, 2001, ISBN: 078037147X (Permana 2003) Sidik Permana, Towards the Complexity of Science, Journal of Social Complexity, vol. 1, no. 1, pp. 1-6, 2003, ISSN: 18296041

215

(Pierson 1998) William Edward Pierson, Using Boundary Methods for Estimating Class Separability, Dissertation Presented Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of Ohio State University 1998, ISBN: 0591977974 Abstract: http://josc.bandungfe.net/josc1/spam1ft.pdf (Plato 1997) Plato, J.M.Cooper (ed.), and D.C.Hutchinson (ed.), Plato Complete Works Hackett Publishing Company Inc., 1997, ISBN: 0872203492 (Pomerening, Sontag and Ferrell 2003) Joseph R.Pomerening, Eduardo D.Sontag, and James E.Ferrell Jr, Building a Cell Cycle Oscillator: Hysteresis and Bistability in the Activation of Cdc2, Nature Cell Biology, vol. 5, no. 4, pp. 346-351, 2003, ISSN: 14657392 (Popper 2002) Karl R.Popper, The Logic of Scientific Discovery Routledge, 2002, ISBN: 0415278449 (Portnoy, Bellaachia, Chen and Elkhahloun 2002) David Portnoy, Abdelhani Bellaachia, Yidong Chen, Abdel G.Elkhahloun, M.J.Zaki (ed.), J.T-L.Wang (ed.), and H.Toivonen (ed.), E-CAST: A Data Mining Algorithm for Gene Expression Data, KDD-2002, Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Data Mining in Bioinformatics (BIOKDD 2002) July 23-26, 2002 Edmonton, Alberta, Canada, pp. 49-54, 2002, ISBN: 158113567X (Prampero and Carvalho 1998) Paulo S.Prampero and Andre DeCarvalho, Recognition of Vehicles Silhouette using Combination of Classifiers, The 1998 IEEE International Joint Conference on Neural Network Proceedings, IEEE World Congress on Computational Intelligence, May 4-May 9, Anchorage, Alaska, US, pp. 1723-1726, 1998, ISBN: 0780348591 (Prigogine 1980) Ilya Prigogine, From Being to Becoming: Time and Complexity in the Physical Sciences W.H.Freeman and Company Ltd., New York, US, 1980, ISBN: 0716711079 (Prigogine 2003) Ilya Prigogine, Time in Non-equilibrium Physics, in Is Future Given? World Scientific Publish Co. Pte. Ltd., 2003, pp. 44-54, ISBN: 9812385088 (Rao, Chand and Murthy 2005) Venu Gopala K.Rao, Prem P.Chand, and Ramana M.V.Murthy, Soft Computing-Neural Networks Ensembles, Journal of Theoretical and Applied Information Technology, vol. 3, no. 4, pp. 45-50, 2005, ISSN: 19928645 (Rao and Yadaiah 2005) Sree Hari V.Rao, Narri Yadaiah, and M.S.El Naschie (ed.), Parameter Identification of Dynamical Systems, Chaos, Solitons and Fractals, vol. 23, no. 4, pp. 1137-1151, 2005. ISSN: 09600779 (Ravindranathan and Leitch 1999) Mohan Ravindranathan, Roy Leitch, and M.S.El Naschie (ed.), "Model Switching in Intelligent Control Systems," Artificial Intelligence in Engineering, vol. 13, no. 2, pp. 175-187, 1999, ISSN: 09541810

216

(Renyi 1960) Alfred Renyi and J.Neyman (ed.), On Measures of Information and Entropy, Proceedings of the 4th Berkeley Symposium on Mathematics, Statistics and Probability, vol. 3, pp. 547-561, 1960, ISSN: 00970433 (Richards and Xiuping 2005) John Alan Richards and Jia Xiuping, Remote Sensing Digital Image Analysis: An Introduction, Forth edition, Springer-Verlag Berlin and Heidelberg GmbH. and Co. K., Germany, 2005, ISBN: 3540251286 (Robertson and Seymour 1984) Neil Robertson and Paul D.Seymour, Graph minors. III: Planar Tree-width, Journal of Combinatorial Theory. Series B, vol. 36, pp. 49-64, 1984, ISSN: 00958956 (Rosenblatt 1961) Frank Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms (Cornell Aeronautical Laboratory. Report No. VG-1196-G-8), Spartan Books, Washington DC, US, 1961. (Rossberg 2004) Axel G.Rossberg, "A Generic Scheme for Choosing Models and Characterizations of Complex Systems," Interjournal ICCS4, Proceedings of the Fourth Internatinal Conference on Complex Systems, May 16-21, 2004, Boston, US, 2004, ISSN: 10810625, Abstract No. 71, http://arxiv.org/abs/physics/0308018

(Roy 2000) Asim Roy, "Artificial Neural Networks: A Science in Trouble," Association for Computing Machinery, SIGKDD Explorations Newsletter, vol. 1, no. 2, pp. 33-38, 2000, ISSN: 19310145

(Rubin and Trajkovic 2001) Stuart H.Rubin and Ljiljana Trajkovic, On the Role of Randomization in Minimizing Neural Entropy, Invited paper in Proceedings of the Fifth Multi-Conference on Systemics, Cybernetics, and Informatics (SCI 2001), July 22-25, 2001, Orlando, Florida, US, 2001. (Ruck and al 1990) Dennis W.Ruck, Steven K.Rogers, Matthew Kabrisky, Mark E.Oxley, and Bruce W.Suter, The Multilayer Perceptron as an Approximation to a Bayes Optimal Discriminant Function, IEEE Transactions on Neural Networks, vol. 1, no. 4, pp. 296-298, 1990, ISSN: 10459227

(Ryabko 2006) Daniil Ryabko, "Pattern Recognition for Conditionally Independent Data," The Journal of Machine Learning Research, vol. 7, pp. 645-664, 2006, ISSN: 15324435

(Rybnik 2004) Mariusz Rybnik, Contribution to the Modeling and the Exploitation of Hybrid Multiple Neural Networks Systems: Application to Intelligent Processing of Information, Thesis presented to obtain the degree of Doctor of Philosophy of University Paris XII 2004.

(Saakian 2004) David B.Saakian, "Error Threshold in Optimal Coding, Numerical Criteria, and Classes of Universalities for Complexity," Physical Review E, Statistical, Nonlinear, and Soft Matter Physics, vol. 71(2), no. 1, p. 016126.1-016126.12, 2004, ISSN: 15393755, http://arxiv.org/abs/cond-mat/0409107

(Saglam, Yazgan and Ersoy 2003) Mehmet I.Saglam, Bingul Yazgan, and Okan K.Ersoy, Classification of Satellite Images by using Self-organizing map and Linear Support

217

Vector Machine Decision Tree, (c) GISdevelopment.net, Kuala Lumpur, Malaysia, 2003, http://www.gisdevelopment.net/technology/ip/ma03120abs.htm

(Sancho and al. 1997) Jose Luis Sancho, Batu Ulug, William Pierson, Anibal R.Figueiras-Vidal, Stanley C.Ahalt, D.Do Campo (ed.), A.R.Figueiras-Vidal (ed.), and F.Perez-Gonzalez (ed.), Boundary Methods for Distribution Analysis, in Intelligent Methods in Signal Processing and Communications Birkhauser Boston Inc., Cambridge, Manitoba, US, 1997, pp. 173-197, ISBN: 0817639608 (Sarkar and Leong 2001) Manish.Sarkar and Tze-Yun Leong, Splice Junction Classification Problems for DNA Sequences: Representation Issues, 2001 Conference Proceedings of the 23rd Annual International IEEE Engineering in Medicine and Biology Society, 25-28 October 2001, Istanbul, Turkey, vol. 3, pp. 2895-2898, 2001, ISBN: 0780372115 (Sarlashkar, Bodruzzaman and Malkani 1998) Avinash N.Sarlashkar, Mohammad Bodruzzaman, and Mohan J.Malkani, Feature Extraction Using Wavelet Transform For Neural Network Based Image Classification, Systems Theory 1998, Proceedings of the Thirtieth Southeastern Symposium on System Theory, March 8-10, 1998, West Virginia University, Morgantown, US, pp. 412-416, 1998, ISBN: 0780345479 (Sato and Yamada 1996) Atsushi Sato, Keiji Yamada, D.S.Touretzky (ed.), M.C.Mozer (ed.), and M.E.Hasselmo (ed.), Generalized Learning Vector Quantization, Advances in Neural Information Processing Systems 8: Proceedings of the 1995 Conference, vol. 8-9, pp. 423-429, 1996, ISBN: 0262201070 (Scott and Markovitch 1989) Paul D.Scott, Shaul Markovitch, and A.M.Segre (ed.), Uncertainty Based Selection of Learning Experiences, Proceedings of the Sixth International Workshop on Machine Learning, June 26-27, 1989, Cornell University, Ithaca, New York, US, pp. 358-361, 1989, ISBN: 1558600361. (Senge 2006) Peter M.Senge, The Fifth Discipline: The Art & Practice of The Learning Organization, Broadway Business, US,2006, ISBN: 0385517254 (Senior 2001) Andrew Senior, A Combination Fingerprint Classifier, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 10, pp. 1165-1174, 2001, ISSN: 01628828 (Serfling 2002) Robert J.Serfling and Y.Dodge (ed.), A Depth Function and a Scale Curve Based on Spatial Quantiles, in Statistical Data Analysis Based on the L1-Norm and Related Methods Birkhauser Verlag AG, Germany, 2002, pp. 25-38, ISBN: 3764369205

(Shalizi 2005) Cosma Rohilla Shalizi, Information Theory, 2005, Cosma Rohilla Shalizi's Web Notebook. http://www.cscs.umich.edu/~crshalizi/notebooks/ (Shalizi 2007) Cosma Rohilla Shalizi, Complexity, Complexity Measures, 2007, Cosma Rohilla Shalizi's Web Notebook. http://www.cscs.umich.edu/~crshalizi/notebooks/

218

(Shamir and Sharan 2002) Ron Shamir, Roded Sharan, T.Jiang (ed.), Y.Xu (ed.), and M.Q.Zhang (ed.), Algorithmic Approaches to Clustering Gene Expression Data, in Current Topics in Computational Molecular Biology MIT Press, US, 2002, pp. 269-300, ISBN: 0262100924 (Sharan and Shamir 2000) Roded Sharan, Ron Shamir, R.Altman (ed.), T.L.Bailey (ed.), P.Bourne (ed.), T.Lengauer, I.N.Shindyalov (ed.), L.F.T.Eyck (ed.), and H.Weissig (ed.), CLICK: A Clustering Algorithm with Application to Gene Expression Analysis, Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB 2000), August 16-23, 2000, La Jolla, California, US, pp. 307-316, 2000, ISBN: 1577351150 (Shen and Castan 1999) Jun Shen, Serge Castan, T.Kohonen (ed.), K.Makisara (ed.), O.Simula (ed.), and J.Kangas (ed.), Image Thinning by Neural Networks, Artificial Neural Networks: Proceedings of the 1991 International Conference on Artificial Neural Networks, ICANN'01, June 24-28, Espoo, Finland, vol. 1, pp. 841-846, 1991, ISBN: 0444891781 (Shi, Shu and Liu 1998) Daming Shi, Wenhao Shu, and Haitao Liu, Feature Selection for Handwritten Chinese Character Recognition Based on Genetic Algorithms, 1998 IEEE International Conference on Systems, Man, and Cybernetics, San Diego, US, vol. 5, pp. 4201-4206, 1998, ISBN: 0780347781 (Shipp and Kuncheva 2001) Catherine A.Shipp and Ludmila I.Kuncheva, Four Measures of Data Complexity for Bootstrapping, Splitting and Feature Sampling, Proceedings International ICSC Congress on Computational Intelligence: Methods and Applications (CIMA'2001), June 19-20, 2001, Bangor, Wales, United Kindom, pp. 429-435, 2001, ISBN: 3906454266 (Sima and Orponen 2003) Jiri Sima and Pekka Orponen, General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results, Neural Computation, vol. 15, no. 12, pp. 2727-2778, 2003, ISSN: 08997667 (Simard, Saatchi and De Grandi 2000) Marc Simard, Sasan S.Saatchi, and Gianfranco De Grandi, The Use of Decision Tree and Multiscale Texture for Classification of JERS-1 SAR Data over Tropical Forest, IEEE Transactions on Geoscience and Remote Sensing, vol. 38, no. 5, pp. 2310-2321, 2000, ISSN: 01962892

(Singh 2002) Sameer Singh, "Estimating Classification Complexity," CiteSeerX, 2002, pp. 1-41, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.7109&rep=rep1&type=pdf

(Singh 2003) Sameer Singh, PRISM - A Novel Framework for Pattern Recognition, Pattern Analysis and Applications, vol. 6, no. 2, pp. 134-149, 2003, ISSN: 14337541 (Singh 2003[2]) Sameer Singh, "Multi-Resolution Estimates of Classification Complexity," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1534-1539, 2003, ISSN: 01628828

219

(Sipser 2005) Michael Sipser, Introduction to the Theory of Computation, Second edition, Brooks/Cole, 2005, ISBN: 0534950973 (Sneath and Sokal 1973) Peter H.A.Sneath and Robert R.Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classification W.H.Freeman and Co. Ltd., 1973, ISBN: 0716706970 (Srivastava, Han, Kumar and Singh 1999) Anurag Srivastava, Eui-Hong Han, Vipin Kumar, and Vineet Singh, Parallel Formulations of Decision-Tree Classification Algorithms, Data Mining and Knowledge Discovery, vol. 3, no. 3, pp. 237-244, 1999, ISSN: 13845810 (Statnikov and al. 2005) Alexander Statnikov, Constantin F.Alifiers, Ioannis Tsamardinos, Douglas Hardin, and Shawn Levy, A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis, Bioinformatics, vol. 21, no. 5, pp. 631-643, 2005, ISSN: 13674803 (Stork, Duda and Hart 2001) David G.Stork, Richard O.Duda, and Peter E.Hart, Pattern Classification, Second edition, John Wiley and Sons Inc., New York, US, 2001, ISBN: 9755031030 (Sung and Niyogi 1995) Kah Kay Sung, Partha Niyogi, G.Tesauro (ed.), D.S.Touretzky (ed.), and T.K.Leen (ed.), Active Learning for Function Approximation, Advances in Neural Information Processing Systems 7: Proceedings of the 1994 Conference November 28-December 1, 1994, Denver, Colorado, US, vol. 7, pp. 593-600, 1995, ISBN: 0262201046 (Sussman 2000) Joseph M.Sussman, Introduction to Transportation Systems Artech House Publishers, 2000, ISBN: 1580531415 (Sussman 2002) Joseph M.Sussman, The New Transportation Faculty: The Evolution to Engineering Systems (paper). Introduction to Transportation Systems (book), 2002, Massachusetts Institute of Technology Engineering Systems Division Working Paper Series ESD-WP-2000-02 Ideas on Complexity in Systems - Twenty Views (Swain and King 1973) Philip H.Swain and Roger C.King, Two Effective Feature Selection Criteria for Multispectral Remote Sensing, The First International Joint Conference on Pattern Recognition, October 30 - November 1, 1973, Washington DC, US, pp. 536-540, 1973, ASIN: B000KIKTIA (Tan, Steinbach and Kumar 2005) Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining Addison Wesley, US, 2005, ISBN: 0321321367

(Tresp 2001) Volker Tresp, Y.H.Hu(ed.), and J-N Hwang (ed.), "Committee Machines," in Handbook of Neural Network Signal Processing CRC Press Inc., 2001, pp. 5-1-5-21, ISBN: 0849323592

(Theodoridis and Koutroumbas 2006) Sergios Theodoridis and Konstantinos Koutroumbas, Pattern Recognition, Third edition, Academic Press Inc., 2006, ISBN: 0123695317

220

(Therrien 1989) Charles W.Therrien, Decision, Estimation and Classification John Wiley and Sons Inc., US, 1989, ISBN: 0471504165 (Thrun, Faloutsos, Mitchell and Wassermanand 1999) Sebastian Thrun, Christos Faloutsos, Tom M.Mitchell, and Larry Wasserman, Automated Learning and Discovery: State-of-the-art and Research Topics in a Rapidly Growing Field, AI Magazine (An Official Publication of the American Association for Artificial Intelligence), vol. 20, no. 3, pp. 78-82, 1999, ISSN: 07384602 (Tu and Chung 1992) Pei-Lei Tu and Jen-Yao Chung, A New Decision-Tree Classification Algorithm for Machine Learning, Proceedings of the Fourth International Conference on Tools for Artificial Intelligence (TAI 92), pp. 370-377, 1992, ISBN: 0818629053

(Tumer and Ghosh 1995) Kagan Tumer and Joydeep Ghosh, "Classifier Combining: Analytical Results and Implications ," Proceedings of the AAAI-96 Workshop on Integrating Multiple Learned Models for Improving and Scaling Machine Learning Algorithms at the 13th National Conference on Artificial Intelligence, August 1996, Portland, US, pp. 126-132, 1995, ISSN: 10821089

(Tumer and Ghosh 2003) Kagan Tumer and Joydeep Ghosh, "Bayes Error Rate Estimation Using Classifier Ensembles," International Journal of Smart Engineering System Design, vol. 5, no. 2, pp. 95-105, 2003, ISSN: 10255818 (Ueda 2000) Naonori Ueda, Optimal Linear Combination of Neural Network for Improving Classification Performance, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 2, pp. 207-215, 2000, ISSN: 01628828 (Vapnik 1992) Vladimir N.Vapnik, J.E.Moody (ed.), S.J.Hanson (ed.), and R.P.Lippmann (ed.), Principles of Risk Minimization for Learning Theory, Advances in Neural Information Processing Systems, vol. 4, pp. 831-838, 1992, ISBN: 1558602224 (Vapnik 1998) Vladimir N.Vapnik, Statistical Learning Theory John Wiley and Sons Inc., New York, US, 1998, ISBN: 0471030031 (Voiry and al. 2007) Matthieu Voiry, Kurosh Madani, Veronique Amarger, Joel Bernie, F.Sandoval (ed.), A.Prieto (ed.), J.Cabestany (ed.), and M.Grana (ed.), Optical Devises Diagnosis by Neural Classifier Exploiting Invariant Data representation and Dimensionality Reduction Ability, Lecture Notes in Computer Science / Theoretical Computer Science and General Issues, Computational and Ambient Intelligence: 9th International Work-conference on Artificial Neural Networks, IWANN 2007, June 20-22, 2007, San Sebastian, Spain, Proceedings, vol. 4507, pp. 1098-1105, 2007, ISBN: 3540730060 (Wang, Neskovic and Cooper 2006) Jigang Wang, Predrag Neskovic, and Leon N.Cooper, Learning class regions by sphere covering, 2006, IBNS Technical Report 2006-02,Department of Physics and Institute for Brain and Neural Systems Brown University, Providence, RI02912,supportef under Grants DAAD19-01-1-0754,W911NF-04-1-0357

221

(Ward 1963) Joe H.Ward and C.Hildreth (ed.), Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, vol. 58, no. 301, pp. 236-244, 1963, ISSN: 01621459 (West 2008) John B.West, Respiratory Physiology: The Essentials, Eigth revised edition, Lippincott Williams and Wilkins, 2008, ISBN: 0781772060 (Wimsatt 1974) William C.Wimsatt, K.F.Schaffner (ed.), and R.S.Cohen (ed.), Complexity and Organization, in Proceedings of the 1972 Biennial Meeting of the Philosophy of Science Association D. Reidel Publishing Company, Dordrecht, The Netherlands, 1974, pp. 67-86, ISBN: 9027704090 (Wolfram 1994) Stephen Wolfram, Cellular Automata and Complexity: Collected Papers Perseus Books, 1994, ISBN: 0201626640 (Xu and Wunsch 2008) Rui Xu and Donald C.Wunsch II, "Recent Advances in Cluster Analysis," International Journal of Intelligent Computing and Cybernetics, vol. 1, no. 4, pp. 484-508, 2008, ISSN: 1756378X (Xu, Krzyzak and Suen 1992) Lei Xu, Adam Krzyzak, and Ching Y.Suen, Methods For Combining Multiple Classifiers And Their Applications To Handwriting Recognition, IEEE Transactions on Systems, Man, and Cybernetics, vol. 22, no. 3, pp. 418-435, 1992, ISSN: 00189472 (Yang, Parekh and Honava 1999) Jihoon Yang, Rajesh Parekh, and Vasant Honava, DistAI: An Inter-pattern Distance-based Constructive Learning Algorithm, Intelligent Data Analysis, vol. 3, no. 1, pp. 55-73, 1999, ISSN: 1088467X (Young and Fu 1986) Tzay Y.Young and King-Sun Fu, Handbook of Pattern Recognition and Image Processing Academic Press Inc., Orlando, Florida, US, 1986, ISBN: 0127745602 (Zeng and Starzyk 2001) Yujing Zeng and Janusz Starzyk, Statistical Approach to Clustering In Pattern Recognition, System Theory, Proceedings of the 33rd Southeastern Symposium, pp. 177-181, 2001, ISBN: 0780366611 (Zhang 2000) Guoqiang P.Zhang, Neural Networks for Classification: A Survey, IEEE Transactions on Systems, Man and Cybernetics, Part C: Applications and Reviews, vol. 3 0, no. 4, pp. 451-462, 2000, ISSN: 10946977 (Zhang 2006) Aidong Zhang, Advanced Analysis of Gene Expression Microarray Data, First edition, World Scientific Publishing Co. Pte. Ltd., 2006, ISBN: 9812566457 (Zhang, Chen and Kot 2000) Ping Zhan, Lihui Chen, and Alex C.Kot, A Novel Hybrid Classifier for Recognition of Handwritten Numerals, IEEE International Conference on Systems, Man and Cybernetics 2000, October 8-11, 2000, Sheraton Music City Hotel, Nashville, Tennessee, US, vol. 4, pp. 2709-2714, 2000, ISBN: 0780365836

222

(Zhang, Ramakrishnan and Livny 1996) Tian Zhang, Raghu Ramakrishnan, Miron Livny, H.V.Jagadish (ed.), and I.S.Mumick (ed.), BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, June 4-6, 1996, Montreal, Quebec, Canada, pp. 103-114, 1996, ISBN: 0897917944 (Zhao and Wu 1999) Mingsheng Zhao and Youshou Wu, Classification Complexity and Its Estimation Algorithm for Two-class Classification Problem, 1999 IEEE International Joint Conference on Neural Networks, vol. 3, pp. 1631-1634, 1999, ISBN: 0780355296

(Zhigulin 2004) Valentin P.Zhigulin, "Dynamical Motifs: Building Blocks of Complex Network Dynamics," Interjournal ICCS4, Proceedings of the Fourth Internatinal Conference on Complex Systems, May 16-21, 2004, Boston, US, 2004, ISSN: 10810625, http://arxiv.org/abs/cond-mat/0311330

(Zhou 1999) Weiyang Zhou and C.Ruf (ed.), Verification of The Nonparametric Characteristics of Backpropagation Neural Networks for Image Classification, IEEE Transactions on Geoscience and Remote Sensing, vol. 37, no. 2, pp. 771-779, 1999, ISSN: 01962892

Download - Présentée pour l’obtention du titre dedoxa.u-pec.fr/theses/th2009PEST0013.pdfVladimir Golovko and Prof. Hichem Maaref for the useful comments, suggestions, and the references

Top Related