comparison of different dimensionality reduction methods for information retrieval and text mining

UNIVERSITY OF ZAGREB FACULTY OF ELECTRICAL ENGINEERING AND COMPUTING

GRADUATE THESIS No. 1599

Comparison of different dimensionality reduction methods for information retrieval

and text mining

Goran Jovanov

Zagreb, July, 2006.

1

Sincere and honor thank to my mentor, prof. dr. sc. Bojana Dalbelo Bašić, for professional guiding and advising through my graduate thesis and for the contribution of my personal and professional development.

I also want to thank to mr. sc.

Jasminka Dobša for implementation assistance of the fuzzy k-means clustering algorithm.

2

CONTENT

1. INTRODUCTION ....................................................................................................... 3

2. INFORMATION RETRIEVAL ................................................................................ 5

3. CLUSTERING .......................................................................................................... 21

4. DIMENSIONALITY REDUCTION ....................................................................... 24

4.2.1. SPHERICAL K-MEANS ............................................................................................ 31 4.2.2. FUZZY K-MEANS ..................................................................................................... 35

5. SYSTEM IMPLEMENTATION ............................................................................. 44

6. EXPERIMENTS AND RESULTS .......................................................................... 56

6.4.1. WITHOUT FOLDING-IN DOCUMENTS ................................................................ 68 6.4.2. WITH FOLDING-IN DOCUMENTS ........................................................................ 78

7. RELATED WORKS ................................................................................................. 82

8. CONCLUSIONS ....................................................................................................... 84

9. REFERENCES .......................................................................................................... 85

TABLES INDEX .................................................................................................................... 89

FIGURES INDEX .................................................................................................................. 90

2.1. VECTOR SPACE MODEL ................................................................................... 7

2.2. VECTOR SPACE MODEL EXAMPLES ........................................................... 10

2.3. EVALUATION .................................................................................................... 16

2.4. EXAMPLE OF DOCUMENTS AND QUERIES ............................................... 19

4.1. LATENT SEMANTIC INDEXING .................................................................... 26

4.2. CONCEPT INDEXING ....................................................................................... 29

4.3. LSI AND CI COMPARISON EXAMPLE .......................................................... 37

4.4. LSI FOLDING-IN ALGORITHM ...................................................................... 42

6.1. DECOMPOSITION ERROR .............................................................................. 57

6.2. ORTHONORMALITY OF CONCEPT VECTORS ............................................ 59

6.3. CI PERFORMANCES ........................................................................................ 61

6.4. INFORMATION RETRIEVAL EVALUATION ................................................ 67

3

1. INTRODUCTION Large collections of documents are becoming increasingly common. The public

Internet currently has more than 1.5 billion web pages, while private intranets also

contain an abundance of text data. A vast amount of important scientific data

appears as technical abstracts and papers. Given such large document collections it

is important to organize them into structured ontologies. This organization facilitates

navigation and search, and at the same time provides a framework for continual

maintenance as document repositories grow in size.

Manual construction of structured ontologies is one possible solution and has

been adopted to organize the internet (www.yahoo.com) and to structure library

content. However, this process has the obvious disadvantage of being too labor

intensive, and is viable only in large corporations. Thus it is desirable to seek

automatic methods for organizing unlabeled document collections. Given a collection

of unlabeled data points, clustering refers to the problem of automatically assigning

class labels to the data and has been widely studied in statistical pattern recognition

and machine learning. Therefore, it increases interest in methods that allow users to

quickly and accurately retrieve and organize these types of information. Thus

disciplines like text mining and information retrieval have many challenges.

Text mining is a constituent discipline of data mining, which is content-based and

operates with unstructured text documents, extracting useful information ([17], [18]).

Content based operating implies operating solely with the document content and not

using metadata. Text mining comprises many different methods such as document

clustering, categorization and automatic document indexing, which is categorization

sub form.

Information retrieval discipline deals with presentation, storage and information

organization and access methods (see [1], [2], [8], [21], [24]). Furthermore,

information retrieval performs text document collection interpretation, which is an

abstraction of the syntax and semantic information contained in the text collections.

The principal objective of the information retrieval system is for a certain query to

4

retrieve all of the relevant documents and the smallest amount of irrelevant

documents. Unfortunately, due to problems such as polysemy (words with multiple

meanings) and synonymy (different words that have the same meaning), a list of

documents retrieved for a given query is almost never perfect, and the user has to

ignore some of the items.

Although there are many other models (see Information retrieval chapter), the

algorithms we are dealing with are embedded in the vector space model. The

documents are represented as vectors of term frequencies and the documents set is

represented as a matrix of document vectors. One of the main problems in

information retrieval with the vector space model of documents is the high

dimensionality of the document-term matrix. The number of documents in document

collections may vary from a few thousand to several hundred thousand, and the

number of terms is often more than a few thousand terms. Hence, reduction of

dimensionality appears to be very useful.

There are many methods for dimensionality reduction, but the most used are

Latent Semantic Indexing (LSI) (see [1], [2], [3], [15], [20], [25]), which is based on

Singular Value Decomposition (SVD), and Concept Indexing (CI) (see [4], [5], [6], [7],

[9]), which is based on Concept Decomposition (CD) using clustering (see Clustering

chapter) k-means algorithms. The comparison of these two methods ([8]) is the main

objective of this thesis.

5

2. INFORMATION RETRIEVAL The machine learning approach ([18], [22]) to classifier construction heavily relies on

the basic machinery of information retrieval. The reason is that both information

retrieval and document categorization are content-based document management

tasks, and therefore share many characteristics. Information retrieval techniques are

used in three phases of the classification task:

1. IR-style indexing is always (uniformly, i.e. by means of the same technique)

performed on the documents of the initial corpus and on those to be

categorized during the operating phase of the classifier;

2. IR-style techniques (such as document-request matching, query expansion,...)

are typically used in the inductive construction of the classifiers;

3. IR-style evaluation of the effectiveness of the classifiers is performed.

Document preprocessing is necessary before creating any information retrieval

model. Therefore, the first preprocessing algorithms are described as follows:

(1) lexical text processing (eliminating punctuation marks and numbers, ignoring

case)

(2) eliminating non-content-bearing words such as conjunctions, prepositions and

any similar words which generally have low semantic value in text exploration

(so called stop words)

(3) reducing words to their basic form such as stemming or lemmatization

(4) index term selection, e.g. preferring nouns and eliminating other forms of

those words

(5) construction and usage of the synonym associative term sets glossary called

thesaurus

6

In the following section different models of information retrieval are presented and

discussed, with regard to the initially postulated presumption. An information retrieval

model can be precisely defined in the following way:

Definition 2.1 [30] Information retrieval model is an ordered quadrille (Dr, Qr, F, g(q,

a)) where:

- Dr is representation of a document set,

- Qr is representation of a query set,

- F is set of rules for document and query presentation modeling and the

relationship between them,

- g(q, a) : Qr x Dr R is real function which defines the document order by

relevance of certain query, also called decision function.

Three classical models for the information retrieval discipline are: probabilistic,

logic and vector space. In the probabilistic model the set of rules for modeling

document and query representation is based on the probabilistic theory. Moreover, in

the logic model, documents and queries are represented with an index term set, so

this model is based on the set theory. And finally in the vector space model

documents and queries are represented as multidimensional vectors, thus this model

is linear algebra based. Algorithms used in this thesis are embedded in a vector

space model.

7

2.1. VECTOR SPACE MODEL Today, the vector space model is the most popular in the information retrieval

discipline. Unlike in other models index terms in documents and queries are assigned

with weights (real interval values) and the similarity measure value obtains value from

[0, 1] interval.

Document representation in a vector space model is often called a bag of words

representation. This representation is alluding on presumption of index term

independence and information loss about the relationship between terms.

Let T = {t1, t2, .., tm} denote index terms set and D = {d1, d2, ..., dn} the documents

set from the documents collection. Furthermore, let aij denote weights that are

assigned to pair (ti, dj), for i = 1, 2, ..., m and j = 1, 2, ..., n. The weight values are real

and positive. Index terms in the representation of query q are also assigned with

weights. Let with qi, for i = 1, 2, ..., m be denoted weights assigned to the (ti, q) pair.

Definition 2.2 [31] In vector space model documents dj, j = 1, 2, .., n are represented

with vectors form as aj = (a1j, a2j, ..., amj)T, and queries as q = (q1, q2, ..., qm)T.

Document collection D is represented with document-term matrix form A = [aij] = [a1

a2 ... an] (see Figure 3.1).

The column of the document-term matrix represents one document from

document collection and the rows represent index term (term weight in document

collection).

8

nj1

m

i

1

aaa

t

t

t

A

mnm1

ij

1n11

aa

a

aa

Figure 2.1 Document-term matrix

Definition 2.3 [31] Similarity measure between document aj and query q is defined

as cosine of the angle between their vector representations

m

i

m

i

m

isim

1

2

1

2

1))(cos()(

iij

iij

j

T

j

jj

qa

qa

||q||||a||

qaq,aq,d

(2. 1)

where ||aj|| and ||q|| are Euclidian norms of the vector’s representation of documents

and queries.

Since aij and qi values are positive, similarity measure value is from the [0, 1]

interval. Similarity measure values close to 1 mean better matching between

documents dj from document collection and query q. In practice similarity limit t is

often defined where retrieved documents dj for certain query q, have similarity

measure values from [t, 1] interval. Another approach for filtering and ranking

retrieved documents is to sort descending documents by similarity measure value

and to retrieve only k documents with best similarity measure values.

9

The above preprocessing scheme yields the number of occurrences of word j

in document i, say, fji, and the number of documents which contain the word j, say, dj.

Using these counts, we now create n document vectors in Rd, namely, a1, a2,..., an as

follows. For 1 ≤ j ≤ d, set the j-th component of document vector xi, 1 ≤ i ≤ n, to be the

product of three terms

xji = tji x gj x si (2. 2)

where tji is the term weighting component and depends only on fji, gj is the global

weighting component and depends on dj, and si is the normalization component for

ai. Intuitively, tji captures the relative importance of a word in a document, while gj

captures the overall importance of a word in the entire set of documents. The

objective of such weighting schemes is to enhance discrimination between various

document vectors and to enhance retrieval effectiveness.

There are many schemes for selecting the term, global, and normalization

components, for example, ([4], [7], [17], [18]) presents 5, 5, and 2 schemes,

respectively, for the term, global, and normalization components-a total of 5 x 5 x 2 =

50 choices. From this extensive set, we will use two popular schemes denoted as txn

and tfn, and known, respectively, as normalized term frequency and normalized

term frequency-inverse document frequency. Both schemes emphasize words

with higher frequencies, and use tji = fji. The txn scheme uses gj = 1, while the tfn

scheme emphasizes words with low overall collection frequency and uses formula

2.3 (dj – number of documents in which index term occurs; n – number of documents

in document collection). In both schemes, each document vector is normalized to

have unit L2 norm, that is,

j

jd

ng log (2. 3)

n

j

jjii gts1

2)( (2. 4)

Intuitively, the effect of normalization is to retain only the direction of the

document vectors. This ensures that documents dealing with the same subject matter

(that is, using similar words), but differing in length lead to similar document vectors.

10

2.2. VECTOR SPACE MODEL EXAMPLES

Example 2.1 [8] The document collection in this example is composed of 15 books or

article titles divided into two clusters. The first cluster is composed of 9 Data mining

(DM) documents, the second cluster contains 5 documents related to linear algebra

(LA) and document D6 (matrices, vector spaces, and information retrieval) is a

combination of both mentioned disciplines (data mining and linear algebra). The

index terms list is formed in three steps:

1. considered only terms that occur at least in two documents

2. „stop words“ are eliminated (conjunctions, definite articles and similar words

without any semantic for information retrieval)

3. word variations are mapped into their basic form; e.g. word matrices is

mapped into index term matrix. Furthermore, words applications and applied

are mapped into index term application.

As a result we get the following index term list: 1) text, 2) mining, 3) clustering, 4)

classification, 5) retrieval, 6) analysis, 7) information, 8) linear, 9) algebra, 10) matrix,

11) application, 12) document, 13) vector, 14) space, 15) data and 16) algorithm.

Documents and their categorization are shown in Table 3.1. Two queries shall be

presented in order to illustrate the information retrieval process:

Q1: Data mining

Q2: Using linear algebra for data mining

Relevant documents for query Q1 are DM and with both categories categorized

documents, whereas only document D6 is relevant for query Q2. In most of the

relevant documents for Q1 are not contained index terms contained in Q1 (but those

documents contain index term such as clustering, classification, information, retrieval,

which are relevant for DM discipline). In D6, which is relevant for Q2, is not contained

any index term contained in Q2.

11

Label Category Document

D1 DM Survey of text mining: clustering, classification,

and retrieval

D2 DM Automatic text processing: the transformation

analysis and retrieval of information by computer

D3 LA Elementary linear algebra: A matrix approach

D4 LA Matrix algebra and its applications in statistics and

econometrics

D5 DM Effective databases for text and document

management

D6 Both Matrices, vector spaces, and information retrieval

D7 LA Matrix analysis and applied linear algebra

D8 LA Topological vector spaces and algebras

D9 DM Information retrieval: data structures and algorithms

D10 LA Vector spaces and algebras for chemistry and physics

D11 DM Classification, clustering and data analysis

D12 DM Clustering of large data sets

D13 DM Clustering algorithms

D14 DM Document warehousing and text mining: techniques

for improving business operations, marketing and sales

D15 DM Data mining and knowledge discovery

Table 2.1 Documents and their categorization (example 2.1)

First of all, document-term matrix F is formed, where matrix component at (i, j)

represents occurrence number of i-th index term in j-th document. Analogously, query

vectors q1 and q2 are formed:

q1 = (0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0)T,

q2 = (0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0)T.

12

Before computing similarities between queries Q1 and Q2 and documents Di (1

≤ i ≤ 15) from the document collection, documents and queries vectors are

transformed into txn form. Because now document and query vectors are unit

vectors, as a similarity measure is used inner product between vectors.

ba||b||||a||

baba

)),(cos( (2. 5)

Documents ranking by relevance for queries Q1 and Q2 are shown in Table

3.2. For Q1 we see that 6 out of 10 relevant and none irrelevant documents are

retrieved, whereas for Q2 is not retrieved D6, which is the only relevant document.

Query Q1 Query Q2

Document Inner product Document Inner product

D15 1.4142 D15 1.4142

D12 0.7071 D3 1.1547

D14 0.5774 D7 0.8944

D9 0.5000 D12 0.7071

D11 0.5000 D4 0.5774

D1 0.4472 D8 0.5774

D2 0 D10 0.5774

D3 0 D14 0.5774

D4 0 D9 0.5000

D5 0 D11 0.5000

D6 0 D1 0.4472

D7 0 D2 0

D8 0 D5 0

D10 0 D6 0

D13 0 D13 0

Table 2.2 Document ranking by similarity with Q1 and Q2 (example 2.1)

13

From this example we see the evident disadvantage of vector space model,

and that is the fact that the only retrieved documents are those that contain index

terms contained in queries (lexically bounded). Documents that are bounded by

semantics, but do not contain index terms from the query (e.g. synonyms), shall not

be retrieved.

Example 2.2 [8] This example illustrates the application of global weighting, IDF

component. The same document collection is used, with the same queries Q1 and

Q2, where documents collection uses tfn and queries use tfx weighting function. TF

component represents index term occurrence in a certain document. Furthermore,

IDF is computed by formula 2.9 and normalization component is computed by 2.10.

IDF components of index terms are shown in Table 2.3.

Term dj – number of documents in

which term occurs IDF component

text 4 1.3218

mining 3 1.6094

clustering 4 1.3218

classification 2 2.0149

retrieval 4 1.3218

analysis 3 1.6094

information 3 1.6094

linear 2 2.0149

algebra 5 1.0986

matrix 4 1.3218

application 2 2.0149

document 2 2.0149

vector 3 1.6094

space 3 1.6094

data 4 1.3218

algorithm 2 2.0149

Table 2.3 IDF components of index terms (example 2.2)

14

The final document-term matrix A is attained by multiplying F matrix rows with

corresponding IDF components and by normalizing columns. Query vectors q1 and

q2 from the example 2.1 are multiplied with corresponding IDF components, so as a

result of vector representation of queries we get:

q1 = (0, 1.6094, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1.3218, 0)T,

q2 = (0, 1.6094, 0, 0, 0, 0, 0, 2.0149, 1.0986, 0, 0, 0, 0, 0, 1.3218, 0)T.

The result retrieved documents ranked by relevance for queries q1 and q2 are

demonstrated in Table 2.4.

Query Q1 Query Q2

Document Inner product Document Inner product

D15 2.0826 D15 2.0826

D12 0.9346 D3 1.9887

D14 0.8939 D7 1.4248

D1 0.7512 D12 0.9346

D9 0.5485 D14 0.8939

D11 0.5485 D1 0.7512

D2 0 D9 0.5485

D3 0 D11 0.5485

D4 0 D8 0.4776

D5 0 D10 0.4776

D6 0 D4 0.4557

D7 0 D2 0

D8 0 D5 0

D10 0 D6 0

D13 0 D13 0

Table 2.4 Document ranking by similarity with Q1 and Q2 (example 2.2)

15

In this example we notice a slightly different ranking of retrieved documents

then in example 2.1. Moreover, the same documents have non zero similarity and the

few top documents have the same order ranking like in example 2.1.

16

2.3. EVALUATION

In this section criterions and information retrieval evaluation techniques are

described. More about evaluation criterions can be found in [17], [18]. Evaluation is

usually performed on standard test document collections, constructed by experts

from different areas, which facilitates comparison of different information retrieval

methods. Test collections are composed of:

document collections (science article abstracts, reports or news articles)

queries set for document collections

estimated document’s relevance for each query from the queries set (relevant

judgment)

As in the case of information retrieval systems, the evaluation of document

classifiers is typically conducted experimentally, rather than analytically. The reason

for this tendency is that, in order to evaluate a system analytically (e.g. proving that

the system is correct and complete) we always need a formal specification of the

problem that the system is trying to solve (e.g. with respect to what correctness and

completeness are defined), and the central notion of document classification (namely,

that of relevance of a document to a category) is, due to its subjective character,

inherently non-formalisable. The experimental evaluation of classifiers, rather than

concentrating on issues of efficiency, usually tries to evaluate the effectiveness of a

classifier, i.e. its capability of taking the right categorization decisions. The main

reasons for this bias are that:

Efficiency is a notion dependent on the hw/sw technology used. Once this

technology evolves, the results of experiments aimed at establishing efficiency

are no longer valid. This does not happen for effectiveness, as any experiment

aimed at measuring effectiveness can be replicated, with identical results, on

any different or future hw/sw platform;

Effectiveness is really a measure of how the system is good at tackling the

central notion of classification, that of relevance of a document to a category.

17

Classification effectiveness is measured in terms of the classic IR notions of

precision p and recall r. In order to precisely define mentioned measures, we

denote with A set of retrieved documents for a certain query. Moreover, we denote

with R relevant documents set and with Rα we denote the intersection of two

mentioned sets (Rα = A ∩ R). With |A|, |R| and |Rα| we denote cardinalities of those

sets. So recall r is computed as follows:

||

||

R

Rr (2. 6)

n

i

i

n

i

i

n

i

i

FNTP

TP

r

11

1

(2. 7)

||

||

A

Rp (2. 8)

n

i

i

n

i

i

n

i

i

FPTP

TP

p

11

1

(2. 9)

TP (true positive), FP (false positive), TN (true negative) and FN (false negative) are

described in Table 2.5.

Documents di, 1 ≤ i ≤ n Relevant document

YES NO

Retrieved

document

YES TPi FPi

NO FNi TNi

Table 2.5 The contingency table for one query

Generally, with recall up growth, precision decreases and vice versa. These

measurements are closely related to each other and are computed at the same time.

Besides that, the set A is never retrieved to user at once, yet ranked document list

ordered by decreasing similarity is retrieved. Then while the user examines this list in

top-down order, precision and recall vary. Quality information retrieval insight can be

18

acquired by average precision measurement on different recall levels. Usually on 11

standard levels of recall average precision is computed. Levels: 0%, 10%, 20%, ...,

100%. Let’s denoted rk, k = 0, 1, ..., 10 k-th standard recall level. When the user

iterates through the list of retrieved documents for a certain query, usually recall does

not occur in some of the standard levels. So for computing precision on given level

P(ri) for i = 0, 1, ..., 9, the following formula is used:

9,...,1,0,:)(max)( 1 irrrrPrP iii (2. 10)

Precision on 100% recall level is equal to the precision value of acquired

100% recall, and if it does not acquire 100%, then precision on 100% recall is equal

to 0. This mean average precision is computed as an arithmetic mean of 11 standard

recall levels as follows:

10

0

)(11

1

i

irPP (2. 11)

Effectiveness can be measured also as the value of the Fα function, for some 0

≤ α ≤ 1, where Fα is defined as follows:

rp

F1

)1(1

1

(2. 12)

In this formula α may be seen as the relative degree of importance attributed to p and

r: If α = 1, then Fα coincides with p, if α = 0 then Fα coincides with r. Usually, a value

of α = 0.5 is used, which attributes equal importance to p and r; for reasons we do not

want to enter here, rather than F0.5 this is usually called F1 (see [17], [18] for details).

As shown in [32], for a given classifier Φ, its breakeven value is always less or equal

than its F1 value.

19

2.4. EXAMPLE OF DOCUMENTS AND QUERIES Example 2.3 This example shows documents and queries from the MEDLINE and

CRANFIELD collection. Label .I remarks new document and the number by this label

remarks ordinal number of document. Furthermore, label .W remarks the beginning

of the document text. In Figure 2.3 we see that document is relatively short and that

contains very specific terms such as fetal, plasma, glucose, fatty, acids. Also we are

able to see in Figure 2.4 that queries are very short and clear, and contain specific

terms. Fifth query contains term fetus, while the first document, which is relevant for

the fifth query contains term fetal (lexically different from the term fetus, but has same

semantic).

Figure 2.3 The first document from MEDLINE document collection

Figure 2.4 Some of the queries for MEDLINE document collection

.I 1

.W

the crystalline lens in vertebrates, including humans.

.I 3

.W

electron microscopy of lung or bronchi.

.I 4

.W

tissue culture of lung or bronchial neoplasms.

.I 5

.W

the crossing of fatty acids through the placental barrier. normal

fatty acid levels in placenta and fetus.

.I 10

.W

neoplasm immunology.

.I 1

.W

correlation between maternal and fetal plasma levels of glucose and free fatty

acids .

correlation coefficients have been determined between the levels of

glucose and ffa in maternal and fetal plasma collected at delivery .

significant correlations were obtained between the maternal and fetal

glucose levels and the maternal and fetal ffa levels . from the size of

the correlation coefficients and the slopes of regression lines it

appears that the fetal plasma glucose level at delivery is very strongly

dependent upon the maternal level whereas the fetal ffa level at

delivery is only slightly dependent upon the maternal level .

20

In Figure 2.5 we see the first document and in Figure 2.6 some of the queries from

the CRANFIELD document collcetions. The label .T remarks the author of the

document. Unlike MEDLINE we can see that CRANFIELD documents and queries

contain less specific terms.

Figure 2.5 The first document from CRANFIELD document collection

Figure 2.6 Some of the queries for CRANFIELD document collection

.I 001

.W

what similarity laws must be obeyed when constructing aeroelastic models

of heated high speed aircraft .

.I 002

.W

what are the structural and aeroelastic problems associated with flight

of high speed aircraft .

.I 004

.W

what problems of heat conduction in composite slabs have been solved so

far .

.I 008

.W

can a criterion be developed to show empirically the validity of flow

solutions for chemically reacting gas mixtures based on the simplifying

assumption of instantaneous local chemical equilibrium .

.I 1

.T

experimental investigation of the aerodynamics of a

wing in a slipstream .

.A

brenckman,m.

.W

experimental investigation of the aerodynamics of a

wing in a slipstream .

an experimental study of a wing in a propeller slipstream was

made in order to determine the spanwise distribution of the lift

increase due to slipstream at different angles of attack of the wing

and at different free stream to slipstream velocity ratios . the

results were intended in part as an evaluation basis for different

theoretical treatments of this problem .

the comparative span loading curves, together with

supporting evidence, showed that a substantial part of the lift increment

produced by the slipstream was due to a /destalling/ or

boundary-layer-control effect . the integrated remaining lift

increment, after subtracting this destalling lift, was found to agree

well with a potential flow theory .

an empirical evaluation of the destalling effects was made for

the specific configuration of the experiment .

21

3. CLUSTERING

Clustering is the task of organizing a set of objects into meaningful groups. These

groups can be disjoint, overlapping, or organized in some hierarchical fashion. The

key element of clustering is the notion that the discovered groups are meaningful.

This definition is intentionally vague, as what constitutes meaningful is to a large

extent, application dependent. In some applications this may translate to groups in

which the pair wise similarity between their objects is maximized, and the pair wise

similarity between objects of different groups is minimized. In some other applications

this may translate to groups that contain objects that share some key characteristics,

although their overall similarity is not the highest.

Clustering is the unsupervised classification of patterns (observations, data items,

or feature vectors) into groups (clusters). The clustering problem has been addressed

in many contexts and by researchers in many disciplines; this reflects its broad

appeal and usefulness as one of the steps in exploratory data analysis. However,

clustering is a difficult combinatorial problem, and differences in assumptions and

contexts in different communities have made the transfer of useful generic concepts

and methodologies slow to occur.

Typical pattern clustering activity involves the following steps [12]:

(1) pattern representation (optionally including feature extraction and/or selection),

(2) definition of a pattern proximity measure appropriate to the data domain,

(3) clustering or grouping,

(4) data abstraction (if needed),

(5) assessment of output (if needed).

Figure 3.1 depicts a typical sequencing of the first three of these steps, including

a feedback path where the grouping process output could affect subsequent feature

extraction and similarity computations.

22

Figure 3.1 Process of data clustering

Pattern representation refers to the number of classes, the number of

available patterns, and the number, type, and scale of the features available to the

clustering algorithm. Some of this information may not be controllable by the

practitioner. Feature selection is the process of identifying the most effective subset

of the original features to use in clustering. Feature extraction is the use of one or

more transformations of the input features to produce new salient features. Either or

both of these techniques can be used to obtain an appropriate set of features to use

in clustering.

Pattern proximity is usually measured by a distance function defined on pairs

of patterns. A variety of distance measures are in use in the various communities. A

simple distance measure like Euclidean distance can often be used to reflect

dissimilarity between two patterns, whereas other similarity measures can be used to

characterize the conceptual similarity between patterns.

The grouping step can be performed in a number of ways. The output cluster

(or clusters) can be hard (a partition of the data into groups) or fuzzy (where each

pattern has a variable degree of membership in each of the output clusters).

Hierarchical clustering algorithms produce a nested series of partitions based on a

criterion for merging or splitting clusters based on similarity. Partitional clustering

algorithms identify the partition that optimizes (usually locally) a clustering criterion.

Additional techniques for the grouping operation include probabilistic and graph-

theoretic clustering methods.

23

Data abstraction is the process of extracting a simple and compact

representation of a data set. Here, simplicity is either from the perspective of

automatic analysis (so that a machine can perform further processing efficiently) or it

is human-oriented (so that the representation obtained is easy to comprehend and

intuitively appealing). In the clustering context, a typical data abstraction is a compact

description of each cluster, usually in terms of cluster prototypes or representative

patterns such as the centroid.

Different approaches to clustering data can be described with the help of the

hierarchy shown in Figure 3.2. At the top level, there is a distinction between

hierarchical and partitional approaches (hierarchical methods produce a nested

series of partitions, while partitional methods produce only one).

Figure 3.2 taxonomy of clustering approaches

One of the famous clustering systems is G-means (clustering in ping-pong style)

in [10]. Concept indexing is k-means clustering algorithm based, therefore the k-

means algorithm shall be precisely defined and described in the next chapter

(dimensionality reduction, concept decomposition).

24

4. DIMENSIONALITY REDUCTION Different techniques for dimensionality reduction in vector space model have been

developed. There are many motives for dimensionality reduction such as: memory

space reduction for document representation, better information retrieval or

classification performance, noise and redundancy elimination in document

representation etc. Although dimensionality reduction means information reduction,

often results with more efficiency in information retrieval and classification, which

shall be confirmed later in this thesis. If dimensionality of the original vector space is

equal to index term count, dimensionality reduction can be performed in the following

two manners:

1. m~ << m term selection, feature selection

2. m~ << m term extraction, feature construction

Also dimensionality reduction methods can be supervised, if using information

about document to class attachment, or unsupervised if not using this information.

Upon dimensionality reduction using term selection, index term set T = {t1, …, tm}

reduces to subset )~

(~

TTT . The objective is to select terms endeavoring

minimum information retrieval performance reduction. On the other hand,

dimensionality reduction by term extraction creates new set T~

which is not a T

subset and generally terms from T~

will not match terms from T. This approach is

also called reparameterization (whereat number of new is less than the number of old

parameters), its drift is prevailing synonymy and polysemy problems. Dimensionality

reduction techniques map document’s representations that are close to each other in

original space, into vectors in reduced space that are closer to each other than in the

original space. This facilitates retrieval of documents that are relevant for a certain

query, withal does not necessarily contain index terms from the original space.

25

This thesis is based on term extraction information retrieval methods of latent

semantic indexing and concept indexing. The method of LSI was introduced in

1990 [33] and improved in 1995 [30]. It represents documents as approximations and

tends to cluster documents on similar topics even if their term profiles are somewhat

different. This approximate representation is accomplished by using a low-rank

singular value decomposition (SVD) approximation of the term-document matrix.

Although LSI method has empirical success, it suffers from the lack of interpretation

for the low-rank approximation and, consequently, the lack of controls for

accomplishing specific tasks in information retrieval. The explanation of Latent

Semantic Indexing efficiency in terms of multivariate analysis is provided in [3], [15],

[16]. A method by Dhillon and Modha [7] uses centroids of clusters created by the

spherical k-means algorithm or so-called concept decomposition (CD) for lowering

the rank of the term-document matrix. Applying this method, the space on which the

term-document matrix is projected is more interpretable. Namely, it is a space spread

by centroids of clusters. The information retrieval technique using concept

decomposition is called concept indexing (CI). Furthermore, the concept

decomposition method is computationally more efficient and requires less memory

than LSI.

26

4.1. LATENT SEMANTIC INDEXING Let the m × n matrix A = [aij] be the term-document matrix. Then aij is the weight of

the i-th term in the j-th document. The standard procedure is to normalize the

columns of the matrix to be of unit norm. The term-document matrix has an important

property of being sparse, i.e. most of its elements are zeros.

A query has the same form as a document; it is a vector, which on the i-th place

has the frequency of the i-th term in the query. We never normalize the vector of the

query because it has no effect on document ranking. A common measure of similarity

between the query and the document is the cosine of the angle between them.

In order to rank documents according to their relevance to the query, we compute

s = qT A, where q is the vector of the query and the j-th entry in s represents the

score in relevance of the j-th document. The LSI method is just a variation of the

vector space model. The fundamental mathematical result that supports LSI [2] is

that for any m × n matrix A, the following singular value decomposition exists:

TVΣUA (4.1)

where U is the m × m orthogonal matrix, V is the n × n orthogonal matrix and Σ is the

m × n diagonal matrix:

),...,diag( p1 Σ (4.2)

where p = min{m, n} and σ1 ≥ σ2 ≥ ...≥ σp ≥ 0. The σi are the singular values and ui

and vi are the i-th left singular vector and the i-th right singular vector respectively.

The second fundamental result [29] is the theorem by Eckart and Young, which

states that the distance in the Frobenius norm between A and its k-rank

approximation is minimized by the approximation Ak. Here

27

VΣUATkkkk (4.3)

where Uk is the m × k matrix which columns are the first k columns of U, Vk is the n ×

k matrix which columns are the first k columns of V, and Σk is the k × k diagonal

matrix which diagonal elements are the k largest singular values of A. More precisely,

22

1)( r

...minA

kFkXrankFk

XAAA (4.4)

We call Ak truncated SVD of A and space spread by columns of Uk k -

dimensional LSI subspace. So, there are no better rank- k approximations for matrix

A then Ak in Frobenious norm. Figure 4.1 depicts process of singular value

decomposition.

Figure 4.1 Reduced singular value decomposition (truncated SVD)

The ranking of documents, according to their relevance for a query using LSI

method, is executed by calculating the score vector

VUqs Tkkk

T (4.5)

A =

U

x

Σ

x

Uk (m × k) Σk (k × k)

VkT (k × n)

U (m × m) Σ (m × n) VT (n × n) A (m × n)

Term vectors

Document vectors

28

LSI Algorithm:

1. Create the term-document matrix A and the vector of the query q.

2. Use the singular value decomposition on the term-document matrix and

create a rank- k approximation Ak according to the formula (4.3).

3. Rank the documents according to their relevance to the query acording to the

formula (4.5).

29

4.2. CONCEPT INDEXING Concept indexing is an indexing method for dimensionality reduction in vector space

model using concept decomposition (CD) of the document-term matrix. CD of A (m x

n), the document-term matrix, is an approximation with another matrix which provides

k-dimensional representation (k << m) of the documents collection. Forming CD

algorithm is performed in two basic steps: the first step is document clustering into k

clusters, and the second step is document projection on cluster centroids according

least square approximation. Dhillon and Modha in [7] are using spherical k-means

clustering algorithm that is a different version of the basic k-means clustering

algorithm, which uses unit norm document representation.

Cluster centroids are also called concept vectors. Concept vectors represent

concepts from the clustering index terms and can be used as a model for

classification of the later folded documents into the documents collection. One of the

main CI advantages over LSI is the interpretability of the concept vector because

they are local, unlike LSI’s singular vectors which are not interpretable and they are

global. Furthermore, CI is less complex and uses less memory then LSI [7]. While

LSI is well theoretically based, CI has no theoretical baseline.

CI method has two different versions: unsupervised and supervised.

Unsupervised method is used over document collections that have not been

classified by experts, while over document collections that have been classified by

experts can be used both, supervised and unsupervised method. Supervised method

simply skips the first step of clustering documents in k clusters and cluster centroids

are formed on expert’s classification basis.

CI’s target is to approximate each document vector by a linear combination of

concept vectors. Define the concept matrix as a m x k matrix which j -th column is

the concept vector cj, that is

k21 cccC ,...,,k . (4.6)

30

If we assume linear independence of the concept vector then it follows that the

concept matrix has rank k . Now we define concept decomposition Dk of the

document-term matrix A as the least-squares approximation of A on the column

space of the concept matrix Ck. Concept decomposition is m x n matrix

ZCD kk* (4.7)

where Z* is the solution of the least-squares problem ([28])

ZCAZ kZ

min* (4.8)

that is

ACCCZ kkk

TT 1*

(4.9)

In this thesis two types of CI algorithms are used. One is spherical k-means

and the other is fuzzy k-mean. In following subsections both types are described.

Concept decomposition algorithm:

1) Compute documents cluster centroids ci, i = 1, 2, …, k by using some of

clustering algorithms (spherical or fuzzy k-means) from the following

subsections.

2) Form concept matrix Ck = [c1 c2 … ck].

3) Compute document representation matrix Z* by using formula (4.9)

31

4.2.1. SPHERICAL K-MEANS

Suppose we are given n document vectors a1, a2, ..., an in Rm ≥ 0. Let π1, π2, ..., πk

denote a partitioning of the document vectors into k disjoint clusters such that ([7])

liifand lj

k

j

j

1

,...,, n21 aaa . (4.10)

For each fixed 1 ≤ j ≤ k, the mean vector or the centroid of the document vectors

contained in the cluster πj is

jjn a

j am1

(4.11)

where nj is the number of document vectors in πj . Note that the mean vector mj need

not have a unit norm; we can capture its direction by writing the corresponding

concept vector as

|||| j

j

jm

mc (4.12)

The concept vector cj has the following important property. For any unit vector

z in Rm, we have from the Cauchy-Schwarz inequality that

jj

TT

a

j

a

caza . (4.13)

Thus, the concept vector may be thought of as the vector that is closest in

cosine similarity (in an average sense) to all the document vectors in the cluster πj.

Motivated by (4.9), we measure the “coherence” or “quality” of each cluster πj 1 ≤ j ≤

k, as

32

j

j

T

a

ca (4.14)

Observe that if all document vectors in a cluster are identical, then the average

coherence of that cluster will have the highest possible value of 1. On the other hand,

if the document vectors in a cluster vary widely, then the average coherence will be

small, that is, close to 0. Since jjj

T mnj

a

ca and ||cj|| = 1, we have that

jj a

T

j

T

j

T

j

a

T nnn

jjjjjjjj camccmcmca (4.15)

This rewriting yields the remarkably simple intuition that the quality of each

cluster πj is measured by the L2 norm of the sum of the document vectors in that

cluster. We measure the quality of any given partitioning {πj}kj=1 using the following

objective function:

k

j a

Tj

k

j

j

Q1

1

jca (4.16)

Intuitively, the objective function measures the combined coherence of all the

k clusters. Such an objective function has also been proposed and studied

theoretically in the context of market segmentation problems (Kleinberg et al., 1998).

33

Spherical k-means algorithm:

1) Initialize clustering. Start with some initial partitioning of the document vectors,

namely 1)0(

j

k

j . Let 1)0(

j

k

j . be the concept vectors of the associated

partitioning. Set the iteration count t to 0.

2) For each document vector ai, 1 ≤ i ≤ n, find the concept vector closest in

cosine similarity to ai. Now, compute the new partitioning 1)1(

j

kt

j induced

by the old concept vectors 1)1(

j

kt

jc :

njjlnltTtTn

i

t

j

1,,1,:

)()(

1

)1(

lji cacaaa

(4.17)

In words, πj(t+1) is the set of all document vectors that are closest to the

concept vector cj(t). If it happens that some document vector is simultaneously

closest to more than one concept vector, then it is randomly assigned to one

of the clusters. Clusters defined using (4.13) are known as Voronoi or Dirichlet

partitions.

3) Compute the new concept vectors corresponding to the partitioning computed

in (5.13):

kjt

t

t

1,

)1(

)1(

)1(

j

j

j

m

mc (4.18)

where )1( t

jm denotes the centroid or the mean of the document vectors in

cluster )1( t

j .

4. If some “stopping criterion” is met, then set )1(

t

j

F

j and set

)1(

t

j

F

j cc for 1 ≤ j ≤ k, and exit. Otherwise, increment t by 1, and go to

step 2 above. An example of a stopping criterion is: Stop if

34

1)1(

1)(

j

kt

jj

kt

j QQ , (4.19)

for some suitably chosen ε ≥ 0. In words, stop if the “change” in objective

function after an iteration of the algorithm is less than a certain threshold. We

now establish that the spherical k-means algorithm outlined above never

decreases the value of the objective function.

Like any other gradient-ascent scheme, the spherical k-means algorithm is

prone to local maxima. A careful selection of initial partitions 1)0(

j

k

j is important.

One can either (a) randomly assign each document to one of the k clusters, (b) first

compute the concept vector for the entire document collection and randomly perturb

this vector to get k starting concept vectors or (c) try several initial clusterings and

select the best in terms of the largest objective function. In my implementation I use

strategy (b).

35

4.2.2. FUZZY K-MEANS

The fuzzy k -means algorithm (FKM) (see [9], [26], [27], [34]) generalizes classical or

hard k -means algorithm. The goal of k -means algorithm is to cluster n objects (here

documents) in k clusters and find k mean vectors for clusters (centroids). In the

context of the vector space model for information retrieval we call these mean vectors

concepts. Spherical k -means algorithm used in [7] is just a variation of hard k -

means algorithm which uses a fact that document vectors (and the vectors of

concepts) are of the unit norm.

As opposed to hard k -means algorithm which allows a document to belong to

only one cluster, FKM allows a document to partially belong to multiple clusters. FKM

seeks a minimum of a heuristic global cost function

2

1 1

ca ij

k

i

n

j

b

ijfuzzJ , (4.20)

where aj, j=1, …, n are vectors of documents, ci, i=1, .., k are concept vectors, μij is

fuzzy membership degree of document aj in the cluster whose concept is ci and b is a

weight exponent in fuzzy membership. In general, the JFuzz criterion is minimized

when the concept ci is near those points that have high fuzzy membership degree for

cluster i, i=1, .., k. By solving a system of equations c

J

i

fuzz

and

ij

fuzzJ

we

achieve stationary point

ki

nj

k

r

b

ij

,,1

.,,1

12

2 1

1,

1

cra

ciaj

(4.21)

36

kin

j

b

ij

n

j

b

ij

,,1,

1

1

j

i

a

c (4.22)

for which the cost function achieves local minimum. We will get concept vectors using

following iterative procedure:

Fuzzy k-means algorithm:

1. Start with arbitrary concept vectors .,,1,)0( kici Set the index of

iteration .0t

2. Compute the fuzzy membership degrees )0(

ij according to formula (4.21).

3. Compute the cost function J fuzz)0(

according to formula (4.20).

4. Compute new concept vectors ct

i)1( according to formula (4.22).

5. Compute new fuzzy membership degrees )1( t

ij according to formula

(4.21).

6. Compute a new cost function Jtfuzz

)1( according to formula (4.20).

7. If JJ

tfuzz

tfuzz

)()1( for some threshold then stop and return concept

vectors; else go to step 4.

37

4.3. LSI AND CI COMPARISON EXAMPLE

Example 4.1 [8] In this example LSI (SVD) and CI (CDFKM) using fuzzy k-means are

compared for the document collection and queries from example 2.1.

Document-term matrix is created and normalized the columns of it to be of the

unit norm. To such a matrix we have applied CDFKM (concept decomposition with

fuzzy k-means, k=2) and truncated SVD (k=2). Let the truncated SVD be T

VU 222

and CDFKM be *

2 ZC . In truncated SVD, rows of U2 are the approximate (two-

dimensional) representation of terms, while rows of V2 are the approximate (two-

dimensional) representation of documents. Here we neglect Σ2 part, since Σ2 is a

diagonal matrix and produces only scaling of the axes. In CDFKM, rows of C2 are

approximate representations of terms and columns of Z* are approximate

representations of documents. Coordinates of terms are listed in Table 4.1, while

coordinates of documents and queries are listed in Table 4.2. Also, on Figure 4.2 and

4.3 images of terms are plotted. From Figure 4.2 we can see that images of two

groups of terms, data mining (DM) terms and linear algebra (LA) terms are grouped

together in the case of truncated SVD. In the case of CDFKM, two groups of terms

are generally grouped along the axes: along y axis (and near y axis) we have DM

terms, and along x axis we have LA terms. Exceptions are terms information and

retrieval. Our assumption is that this is because the model was confused by D6

document, which contains these terms and LA terms.

Most of the DM documents do not contain words data and mining. Such

documents will not be recognized by the simple term-matching vector space method

as relevant. Document D6, relevant for Q2, does not contain any of terms from the list

contained in the query. In the vector space model, the query has the same form as

the document. Let q be a representation of the query in the vector space model and

q~ its approximate representation using truncated SVD.

38

Term SVD CDFKM

xi yi xi yi

text 0.21 -0.31 0.10 0.43

mining 0.16 -0.29 0.01 0.42

clustering 0.24 -0.41 0.08 0.48

classification 0.12 -0.18 0.00 0.23

retrieval 0.27 -0.20 0.29 0.11

analysis 0.21 -0.11 0.29 0.00

information 0.09 -0.16 0.00 0.32

linear 0.25 -0.41 0.10 0.47

algebra 0.19 0.14 0.20 0.00

matrix 0.50 0.40 0.54 0.00

application 0.37 0.25 0.40 0.00

document 0.29 0.19 0.32 0.00

vector 0.29 0.20 0.32 0.00

space 0.21 -0.09 0.19 0.12

data 0.19 0.14 0.20 0.00

algorithm 0.11 -0.17 0.18 0.07

Table 4.1 Coordinates of the terms by SVD and CDFKM

Document LSI Space CI Space

xi yi xi yi

D1 0.24 -0.35 0.06 0.74

D2 0.24 -0.20 0.38 0.26

D3 0.33 0.26 0.69 -0.14

D4 0.33 0.26 0.69 -0.14

D5 0.11 -0.19 -0.04 0.54

D6 0.34 0.08 0.74 -0.10

D7 0.35 0.22 0.71 -0.08

D8 0.34 0.26 0.71 -0.14

D9 0.22 -0.25 0.38 0.25

D10 0.34 0.26 0.71 -0.14

D11 0.22 -0.31 0.06 0.64

D12 0.19 -0.33 -0.01 0.67

D13 0.13 -0.23 0.11 0.37

D14 0.14 -0.25 -0.08 0.69

D15 0.16 -0.28 -0.05 0.64

Q1 0.22 -0.40 -0.07 0.90

Q2 0.57 -0.03 0.70 0.75

Table 4.2 Coordinates of documents and queries by SVD and CDFKM

39

-0,5

-0,4

-0,3

-0,2

-0,1

0,0

0,1

0,2

0,3

0,4

0,5

0,05 0,1 0,15 0,2 0,25 0,3 0,35 0,4 0,45 0,5x

y

DM terms

LA terms

Figure 4.2 Images of terms by LSI

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

0,50

0 0,1 0,2 0,3 0,4 0,5 0,6

x

y

DM terms

LA terms

Figure 4.3 Images of terms by CI

40

Then, the following is satisfied

1~~ 2222 ΣUqqqΣUq

TT

On the other side, since documents are represented as columns of

ACCCZTkk

Tk

1* in CD, the approximate representation of the query by CD will be

qCCCq Tkk

Tk

1 . In Figure 4 and 5, images of approximate representations of

documents and queries are plotted. In the SVD projection, DM documents form one

group, LA documents another and the D6 document is isolated. In the CD projection,

LA documents are grouped; DM documents are somewhat more dispersed, while D6

document is in the group of LA documents. Shaded areas represent the area of

relevant documents for queries in the cosine similarity sense.

Retrieved documents for query Q1 in descending order, due to their score for

the term-matching method, are: D15, D12, D14, D9, D11 and D1. Other documents are

not retrieved at all, since their score is 0. So, the term-matching method has retrieved

6 out of 10 relevant documents. The retrieved documents for Q1 applying LSI are: D1,

D11, D12, D9, D15, D2, D14, D13, D5 and D6. The score for other documents is much

lower and we can state that other documents are not retrieved at all. The retrieved

documents are exactly all the relevant documents. The retrieved documents for Q1

applying CI are: D1, D14, D12, D11, D15, D5, D13, D2 and D9. These are all the relevant

documents except D6 document. For query Q2, only D6 document is relevant. The

term-matching method does not retrieve it at all, the LSI method recognizes D6 as the

most relevant document (although it does not contain any term from the query) and

the CI method retrieves D6 as the sixth most relevant document.

As a conclusion of this academic example we can state that the LSI and CI

methods have a similar effect; they cluster documents on the similar topic even if

their term profile is different. It seems that on this example, LSI is working better. In

the EXPERIMENTS AND RESULTS section, we compare these two techniques on

much larger document collections to achieve statistically significant comparisons.

41

-0,5

-0,4

-0,3

-0,2

-0,1

0,0

0,1

0,2

0,3

0 0,1 0,2 0,3 0,4 0,5 0,6x

y

DM documents

LA documents

Document D6

QueriesQ1

Q2

Figure 4.4 Images of documents and queries by LSI

-0,2

0,0

0,2

0,4

0,6

0,8

1,0

1,2

-0,1 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8x

y

DM documents

LA documents

Document D6

Queries

Q1

Q2

Figure 4.5 Images of documents and queries by CI

Document clustering of document collection from this example is performed by

our implemented system. In section System implementation another clustering of

these collection is presented.

42

4.4. LSI FOLDING-IN ALGORITHM Document collections on which information retrieval is performed are often

dynamically organized. Adding and removing documents from document collection is

continuously performed. The best example of dynamic document collection is the

World Wide Web (www). In addition, adding new documents into a document

collection causes new index terms appearances, so it’s necessary to update the

index term list. This thesis only emphasises adding new documents (folding in) into a

document collection.

It is not an obstacle to add new documents into a document collection in a vector

space model. A problem appears when document collection is represented into

reduced dimensionality space. Namely, vectors upon which during dimensionality

reduction projection is performed are acquired using the whole document collection

and folding new documents in would require re-computing singular and concept

vectors. That’s not, of course, practical. In this thesis only folding-in method without

re-computation of the transformation matrix for LSI is presented. Expanded

document-term matrix A is acquired after adding new documents, where:

A1 represents initial documents matrix in the initial terms space

A2 represents new documents matrix in the initial terms space

21 AAA (4.23)

With m we denote initial terms count, n1 we denote initial documents count and

n2 we denote new documents count. Then A1 matrix is the size of m x n1, A2 matrix is

the size of m x n2. Furthermore, txn(A1) A1 and txn(A2) A2 are performed, in the

other words A1’s and A2’s column vectors are normalized.

43

Truncated SVD is represented as A1k = Uk ∑k Vk

T. Query vectors matrix Q is a

size of q x m (queries are containing only initial terms).

This algorithm represents documents and queries in the LSI space without

implementation of new index terms coordinates and without correction of the singular

vectors. Algorithm is performed as described bellow.

Algorithm:

1) Compute new documents vectors representation DNEW in the space of reduced

dimensionality by using formula 1

kk2NEW ΣUADT

2) Compute queries vectors representation Queries in the reduced

dimensionality space by using formula 1

kk ΣUQQueriesT

3) Form document matrix DALL which shall be composed of the vector

representations of initial and new documents in the reduced dimensionality

space

NEW

k

ALLD

VD

44

5. SYSTEM IMPLEMENTATION Our system implements two dimensionality reduction methods. Concept Indexing (CI)

and Latent Semantic Indexing (LSI), whereas with more emphasis on CI method.

Because of lack of time, LSI method we have developed in MATLAB programming

language, and CI method we have developed within three main components, and is

more detailed and elaborated.

The first component of CI method system is Intel’s Math Kernel Library (BLAS and

LAPACK Libraries [11]). Because in this thesis IR is based on vector space model,

we found Intel’s BLAS and LAPACK libraries suitable for using for vector space

model implementation. The second component is the core implementation (back-

end), which is written in Microsoft Visual Studio .NET C++ programming language,

and finally the third component is a graphical user interface – GUI (front-end), written

in Microsoft Visual Studio .NET C# language. Each system’s component is described

in the following paragraphs.

1) The BLAS and LAPACK libraries are time-honored standards for solving a

large variety of linear algebra problems. The Intel Math Kernel Library (Intel

MKL) contains an implementation of BLAS and LAPACK that is highly

optimized for Intel processors. Intel MKL can enable you to achieve significant

performance improvements over alternative implementations of BLAS and

LAPACK.

BLAS

Basic Linear Algebra Subroutines (BLAS) provide the basic vector and matrix

operations underlying many linear algebra problems. Intel MKL BLAS support

includes:

BLAS Level 1 - vector-vector operations

BLAS Level 2 - vector-matrix operations

BLAS Level 3 - matrix-matrix operations

Sparse BLAS - an extension of BLAS Level 1

45

For BLAS Levels 2 and 3, multiple matrix storage schemes are provided. All

BLAS functions within Intel MKL are thread-safe. Gain the performance

enhancements of multiprocessing without altering your application using

parallelized (threaded) BLAS Level 3 routines.

Sparse BLAS

Sparse BLAS is a set of functions that perform a number of common vector

operations on sparse vectors stored in compressed form. Sparse vectors are

those in which the majority of elements are zeros. Achieve large savings in

computer time and memory with sparse BLAS routines and functions that have

been specially implemented to take advantage of vector scarcity.

LAPACK

Intel MKL includes Linear Algebra Package (LAPACK) routines that are used

for solving:

Linear equations

Eigenvalue problems

Least-squares problems

Singular value problems

LAPACK routines support both real and complex data. Routines are supported

for systems of equations with the following types of matrices: general, banded,

symmetric or Hermitian, triangular, and tridiagonal. The LAPACK routines

within Intel MKL provide multiple matrix storage schemes. LAPACK routines

are available with a FORTRAN interface.

2) Back-end is composed of two parts. The first part is the core implementation of

concept indexing functionalities using Intel’s MKL library (written in

unmanaged c++). Because we decided to use C# for GUI, it was necessary to

transform basic core implementation from unmanaged to managed C++. So

the second part of back-end is wrapper-class which is used as mediator

between core back-end and GUI and is written in manage C++.

46

3) Front-end is written in C# programming language, because we found C# very

suitable language for visualization. GUI contains three different tab-windows.

Single test, Batch testing and Query testing windows.

Figure 5.1 Single test tab-window screenshot

In Single test window (see Figure 5.1) we can perform CI algorithm step by

step or to end with options of adjusting many algorithm parameters such as clustering

algorithm type (Spherical or Fuzzy k-means), number of concept vectors, type of

concept vectors initialization (Random values, Random documents or Perturbed

centroids) and initial percentage of documents from document collection to select.

Also we have implemented 2D example mode in which we are able to perform CI on

two types of examples.

47

In the first example (example type-1), on the graph of the right side of this

window, we are able to define documents in the positive quadrant by clicking in

certain point of the graph (Figure 5.2), and after that clustering those documents by

performing some of the clustering algorithms (Figure 5.3). Blue crosses represent

documents and red crosses represent concept vectors.

Figure 5.2 Documents before clustering (example type-1)

Figure 5.3 Documents after clustering (example type-1)

48

In the second example (example type-2), on the graph of the right side of this

window, we are able to perform clustering (CD and SVD) on document collection

defined in Example 2.1 in section 2.2. In the other words Example 4.1 from section

4.2 can be re-performed. In Figure 5.4 are shown document representations of

documents and queries in CI space (different example of clustering presented in

Figure 4.5).

Figure 5.4 Document and query representations in CI space (example type-2)

49

Figure 5.5 depicts representations of terms in CI space (different example of

clustering presented in Figure 4.3).

Figure 5.5 Term representations in CI space (example type-2)

50

Figure 5.6 depicts representations of documents and queries in LSI space (different

example of clustering presented in Figure 4.4).

Figure 5.6 Document and query representations in LSI space (example type-2)

51

Figure 5.7 depicts representations of terms in LSI space (different example of

clustering presented in Figure 4.2).

Figure 5.7 Term representations in LSI space (example type-2)

There are also algorithm informations such as documents and terms count,

memory consumptions and algorithm outputs in Single test tab-window.

Batch testing tab-window is composed of three sub tab-windows. Those are

Options (Figure 5.8), Test list and results (Figure 5.9) and Graph tab-window (Figure

5.10).

52

In Options tab-window we are able to adjust many algorithm batch performing

parameters. Parameters like number of repeats for certain test, concept vectors

number range (of fixed number), clustering algorithm (Spherical, Fuzzy k-means or

both), type of concept vectors initialization (Random, Random documents or

Perturbed centroids), minimum and maximum term frequencies, initial documents

percentage range (or fixed percentage) and type of query matching condition. After

we adjust parameters for batch testing, then in the Batch menu we select Start Batch.

We are able to pause or stop batch testing any time, because we have implemented

it as a separate thread.

Figure 5.8 Batch test tab-window (Options) screenshot

Batch results are presented in the Test list and results tab-window. Each test

is represented as one record (row) of the table. Columns of the table represent test

attributes such as number of repeats, clustering algorithm (Spherical or Fuzzy k-

means), type of concept vector’s initialization, number of concept vectors, number of

iterations for CI, average concept vectors dot product, decomposition error, mean

average precision (MAP) and F1 measure, and algorithm performing time.

53

Figure 5.9 Batch test tab-window (Test list and results) screenshot

In the Graph tab-window we are able to define parameters for batch testing

visualization. We can define coordinates of x and y axes. In the x axis we are able to

observe number of concept vectors, initial document percentage and fuzzy exponent

b. We can define two y axes and observe the following measures: average concept

vectors dot product, decomposition error, number of iterations, test duration, memory

consumption, recall measure (min, max and avg), precision (min, max and avg), MAP

(mean average precision) and F1 measure (min, max and avg). We are able to define

four types of graphs. Those are: points for each repeat in the test; test repeats

average values; test repeats median values and test’s mean and variations values.

Also we can define colors and labels for each graph.

Graph refreshing is real-time, because drawing is defined as a separate thread.

54

Figure 5.10 Batch test tab-window (Graph) screenshot

In the Query testing tab-window (Figure 5.11) query testing and analyzing is

enabled. Query analyzing is enabled through four tables and graph visualization. In

the first table each row represents one query (query id, F1, MAP, Recall, Precision,

number of relevant documents, number of retrieved documents maximum query-

document similarity. The content of the three other tables dynamically changes,

depending which row (query) is selected in the first table. The second table contains

terms for certain query. The third table contains relevant documents ids. And the last

table contains retrieved document ids and similarities. There are three types of query

matching. Similarity limit (retrieves all documents having similarity greater or equal

then that similarity limit), Fixed number n of retrieved documents (retrieves the best n

documents by similarity) and finally proportion limit (retrieved documents number is

equal to proportion limit times number of relevant documents). We can draw recall-

precision graphs for each query separately or an average of all queries. Also we are

able to draw queries relevant documents distributions.

55

Figure 5.11 Query testing tab-window screenshot

56

6. EXPERIMENTS AND RESULTS In this section different experiment results shall be presented. Experiments are based

on comparison of two dimensionality reduction methods, latent semantic indexing

and concept indexing (for spherical and fuzzy k-means clustering). These two

methods are compared in two parameters.

Error of the vector representations matrix of documents in the original space

(bag of words representation) and matrix of documents in the reduced

dimensionality space.

Information retrieval evaluation

Experiments are performed on two standard document collections MEDLINE and

CRANFIELD for information retrieval. MEDLINE is a standard document collection

which contains 1033 article abstracts related to medicine and has 30 defined queries,

and CRANFIELD is a standard document collection that contains 1400 documents

related to aeronautics and has 225 queries defined. Both collections are obtained by

SMART system ([19]) for converting text documents into matrix representations.

Furthermore, it is shown that:

Concept vectors built by some of the defined clustering algorithms (spherical

and fuzzy k-means) tend to become orthogonal (convergent)

Information retrieval implementation greatly depends of the correctly chosen

dimension of the reduced dimensionality space, which corresponds to the

natural clusters count

Also experiments are performed for the defined folding-in method for LSI (only for

MEDLINE document collection).

And memory consumption and performance time experiments are performed and

results presented.

57

6.1. DECOMPOSITION ERROR Decomposition error is computed as a distance of the document matrix in the original

space and the document matrix in the reduced dimensionality spaceFk ZCA . In

Figure 6.1 shows a graph of the decomposition error regarding number of

concept/singular vectors used for clustering in MEDLINE document collection. We

see that SVD is giving better results than CD regarding decomposition error, but we

are also able to see that CD using Fuzzy k-means clustering is better than using

Spherical k-means clustering.

Figure 6.1 Relation between the number of concept/singular vectors and the decomposition error for MEDLINE

58

Figure 6.2 depicts the relation between the decomposition error and the

number of concept/singular vectors for CRANFIELD document collection. This graph

confirms that SVD is better at decomposition than CD even for CRANFIELD

document collection, and also confirms that CD using fuzzy k-means clustering is

slightly better than CD using spherical k-means clustering for CRANFIELD collection.

Figure 6.2 Relation between the number of concept/singular vectors and the decomposition error for CRANFIELD

59

6.2. ORTHONORMALITY OF CONCEPT VECTORS Within this experiment is shown that concept vectors tend towards orthonormality.

Figure 6.3 depicts the average concept vectors dot product regarding the number of

concept vectors for MEDLINE document collection. With the increased number of

concept vectors, their average dot product tends towards 0, thereby they tend

towards orthonormality. CD using fuzzy k-means clustering acquires more tendency

towards orthonormality than CD using spherical k-means clustering on MEDLINE

collection.

Figure 6.3 Relation between the number of concept vectors and the average concept vectors dot product for MEDLINE

60

Unlike CD on MEDLINE, Figure 6.3 depicts that there are no greater

differences in concept vector’s tendency towards orthonormality between CD using

fuzzy and spherical k-means clustering, on CRANFIELD document collection. But yet

it is still shown that concept vectors tend towards orthonormality even though with

less orthonormality tendency than for MEDLINE collection. This is probably because

MEDLINE has less documents than CRANFIELD (1033 comparing to 1398

respectively), but yet almost double index terms (7014 comparing to 3763

respectively). With a greater number of index terms it is easier to acquire

orthonormality because document vectors are very sparse.

Figure 6.4 Relation between the number of concept vectors and the average concept vectors dot product for CRANFIELD

61

6.3. CI PERFORMANCES In this section different CI performances are measured. Those are number of

iterations, test duration and memory consumption regarding number of concept

vectors.

Figure 6.5 shows CI using fuzzy clustering generally needs more iterations than

CI using spherical clustering, for MEDLINE document collection.

Figure 6.5 Relation between the number of concept vectors and the number of iterations for MEDLINE

62

Figure 6.6 corroborates that CI using fuzzy clustering generally needs more iterations

than CI using spherical clustering even for CRANFIELD set.

Figure 6.6 Relation between the number of concept vectors and the number of iterations for CRANFIELD

63

Tests performed with CI using fuzzy k-means clustering last more than tests

with CI using spherical clustering. This is because computation of fuzzy weights.

Figure 6.7 shows test duration for MEDLINE and Figure 6.8 shows test duration for

CRANFIELD.

Figure 6.7 Relation between the number of concept vectors and the test duration (in seconds) for MEDLINE

64

Figure 6.8 Relation between the number of concept vectors and the test duration (in seconds) for CRANFIELD

65

In Figure 6.9 we see that only for less than 25 concept vectors it has acquired

memory saving (for MEDLINE). We are also able to see that memory consumption

for CI linearly progresses and that generally CI using fuzzy clustering needs more

memory than CI using spherical clustering. Fuzzy weight matrix (number of

documents times number of concept vectors) is additional in CI using fuzzy clustering

and that’s why fuzzy clustering uses more memory.

Figure 6.9 Relation between number of concept vectors and memory consumption for MEDLINE

66

Figure 6.10 shows CI’s memory consumption for CRANFIELD.

Figure 6.10 Relation between number of concept vectors and memory consumption for CRANFIELD

We can conclude that CI using fuzzy clustering needs more iteration, time and

more memory than CI using spherical, while performing.

67

6.4. INFORMATION RETRIEVAL EVALUATION In this section results of experiments for information retrieval evaluation are

presented. First IR is evaluated without folding-in documents in the document

collection (MEDLINE and CRANFIELD). And after that IR evaluation regarding

folding-in for LSI is performed. 50 best ranked documents by similarity with query are

retrieved and four measures for IR evaluation are computed. Recall, precision, mean

average precision (MAP) and F1 measures upon all queries.

In subsection 6.3.1 experiments without folding-in documents and in 6.3.2 with

folding-in documents into document collection are performed.

68

6.4.1. WITHOUT FOLDING-IN DOCUMENTS

Figure 6.11 shows the average recall for all queries regarding number of

concept/singular vectors for MEDLINE document collection. Recall for LSI is less

than recall for CI. Furthermore, regardless which clustering algorithm is used in CI,

pretty similar recall is retrieved.

Figure 6.11 Relation between the number of concept/singular vectors and the average recall of all queries for MEDLINE

69

Figure 6.12 depicts the average recall for all queries regarding number of

concept/singular vectors for CRANFIELD document collection. Unlike for MEDLINE,

for CRANFIELD is inversely, thus recall is greater for LSI than CI. And similar to

MEDLINE collection, spherical and fuzzy k-means clustering algorithms give similar

recall.

Figure 6.12 Relation between the number of concept/singular vectors and the average recall of all queries for CRANFIELD

70

Moreover, relation between precision and number of concept/singular vectors

for MEDLINE is shown in Figure 6.13 and for CRANFIELD in Figure 6.14.

In Figure 6.13 we see that precision for LSI is greater than for CI. So far LSI

has greater precision than CI, but less recall for MEDLINE. Spherical k-means gives

slightly better results than fuzzy k-means clustering CI, regarding precision.

Figure 6.13 Relation between the number of concept/singular vectors and the average precision of all queries for MEDLINE

71

On CRANFIELD collections we see that both, precision and recall are greater

for LSI than CI. There is almost no difference between spherical and fuzzy k-means

clustering CI.

Figure 6.14 Relation between the number of concept/singular vectors and the average precision of all queries for CRANFIELD

So far we have presented recall and precision as measures, but these

measures solely are not adequate and satisfactory for information retrieval

evaluation. More adequate measures are MAP and F1, so these are presented in the

following experiment results.

72

Figure 6.15 shows query-document matching in the original space and in the

reduced dimensionality space (LSI and CI). We are able to see that generally LSI

gives greater MAP than CI, but we are also able to see that LSI and CI (spherical and

fuzzy k-means clustering) reach maximum MAP for 50 concept vectors. We can

assume that MEDLINE document collection can be naturally divided into 50 clusters.

Furthermore, CI using fuzzy k-means clustering gives greater MAP than using

spherical clustering, for the natural number of clusters (50), but for more clusters it is

vice versa. This graph also depicts that better results are acquired in reduced then in

the original space, although information reduction is performed. This is because LSI

and CI are resolving the problems of synonymy and polysemy, redundancy and

noise.

Figure 6.15 Relation between the number of concept/singular vectors and the average MAP of all queries for MEDLINE

73

Unlike MEDLINE, results for experiments performed on CRANFIELD (see

Figure 6.16) are different. Better results regarding MAP are acquired in the original

than in the reduced dimensionality space. Also LSI MAP graph grows monotonously;

in the other words does not reaches maximum. CI MAP graphs start reaching

saturation at 100 concept vector. The MAP graph for spherical CI almost completely

matches the MAP graph for fuzzy CI.

Figure 6.16 Relation between the number of concept/singular vectors and the average MAP of all queries for CRANFIELD

74

Figure 6.17 shows much better results for LSI than for CI, regarding F1

measure. Also we can see that LSI gives better results than query-document

matching in the original space. Furthermore, CI using spherical k-means gives

slightly better F1 measure than CI using fuzzy clustering.

Figure 6.17 Relation between the number of concept/singular vectors and the average F1 of all queries for MEDLINE

75

CI method on CRANFIELD also gives less F1 measure values then LSI (see

Figure 6.18). F1 measure value for query-document matching in the original space is

reached by LSI method for 100 singular vectors.

Figure 6.18 Relation between the number of concept/singular vectors and the average F1 of all queries for CRANFIELD

76

In Figure 6.19 we see the recall-precision graph for CI (fuzzy clustering) with

50 concept vectors performed on MEDLINE document collection.

Figure 6.19 Recall-precision graph for MEDLINE

Furthermore, in Figure 6.20 we see the recall-precision graph for CI (fuzzy

clustering) with 100 concept vectors performed on CRANFIELD document collection.

77

Figure 6.20 Recall-precision graph for CRANFIELD

As a conclusion of these experiments we can state that generally LSI gives

better results for IR evaluation than CI, in both document collections (MEDLINE and

CRANFIELD). Also that better IR evaluation results are acquired for MEDLINE then

for CRANFIELD. Furthermore, differences between CI using spherical and fuzzy k-

means clustering are slight.

78

6.4.2. WITH FOLDING-IN DOCUMENTS

In this section folding-in methods for LSI is tested only for MEDLINE document

collection. Tests are performed starting with 10% and finishing with 100% of the initial

documents, with steps of 10%.

Figure 6.21 shows that recall grows by greater initial documents percentage.

We can also see that for initial documents of 80% percent of all documents from the

document collection is reached maximum recall. This is probably because between

80% and 100% initial documents, doesn’t occur (or very small amount) new terms.

Figure 6.21 Relation between the initial document percentage and recall

79

Precision graphs (Figure 6.22) have the same characteristics as recall graphs.

Figure 6.22 Relation between the initial document percentage and precision

80

Figure 6.23 shows MAP measure.

Figure 6.23 Relation between the initial document percentage and MAP

81

And finally Figure 6.24 depicts F1 measure.

Figure 6.24 Relation between the initial document percentage and F1

Conclusion for this experiment is that with increasing initial document

percentage, IR evaluation gives better result. Also this experiment shows that most of

the terms are covered in 80% of initial documents.

82

7. RELATED WORKS Dhillon and Modha in their work [7] compare concept decomposition (CD), which they

have developed, to singular value decomposition (SVD), in terms of similarity of the

original document matrix and the matrix approximation. But in this work they don’t

research the convenience of document representation using concept decomposition

for text mining tasks like information retrieval or text classification.

Dimensionality reduction methods using cluster (class) centroids were developed

in [13], [35], [36], [37] and [38]. In Karypis’s and Hong’s work [13] dimensionality

reduction technique is used using class centroid projection, but not based on least

mean squares (as is the case with concept decomposition). They research

effectiveness of this technique by information retrieval evaluation, where they use

both, supervised and unsupervised algorithm version.

Park and associate workers in [35] have developed an orthonormal centroids

based algorithm for dimensionality reduction and they compare its effectiveness with

concept decompositions for text classification needs. Ye and associate workers in

[38] have presented IDR/QR algorithm for dimensionality reduction which uses QR

decomposition ([23]).

Also dimensionality reduction methods using kernel functions can be

implemented. Park and Park [37] have implemented orthonormal centroid algorithm

using kernel functions. This approach enables effective impose of non-linearity for

classification using support vector machines (SVM).

Dhillon and associate workers in [6], provoked by poor effectiveness of the

spherical k-means clustering algorithm for concept decomposition present divisive

information-theoretic feature clustering algorithm, where as target function they use

measures from information theory discipline.

83

Kogan and associate workers in [14] perform optimization of k-means by

combining batch and incremental k-means clustering algorithm. They use a distance-

like function which combines Euclidian distance and relative entropy.

In [9] fuzzy k-means clustering algorithm is presented and in [8] a comparison of

CI and LSI is presented on a simple example.

Folding-in new document in reduced dimensionality space for LSI is presented in

[2], where SVD updating retain, but SVD folding-in doesn’t retain othonormality of

singular vectors. In this work also they use semidiscreet decomposition for LSI ([25]).

The incremental algorithm for dimensionality reduction IDR/QR is presented in

[38]. This algorithm updates a transformation matrix for document transformation in

reduced dimensionality space, during folding-in new document into document

collection. This dimensionality reduction method is supervised and authors of this

method test its effectiveness for text classification needs.

84

8. CONCLUSIONS In this thesis the LSI and CI reduction dimensionality methods are experimentally

compared.

We have shown with our experiments that concept vector tend towards

orthonormality. Concept vectors are local and have well defined semantic, while

singular vectors are global and cannot be interpreted. Also concept vectors are very

sparse (often more than 85%). Regarding approximation error, LSI gives the best

approximation of the document matrix in the reduced dimensionality space, although

CI’s approximation error is comparable to LSI’s.

Information retrieval using LSI method is more effective on greater levels of

recall than information retrieval in the original space, because it addresses the

problem of synonymy, which affects recall. All we generally acquire is better result

performing LSI for information retrieval, than CI.

We have also compared clustering algorithms (fuzzy and spherical) for CI. CI

using fuzzy k-means clustering gives slightly better results in IR evaluation then CI

using spherical clustering. Also we have shown that spherical clustering algorithm is

faster and needs less memory than fuzzy clustering algorithm. Better IR evaluation

results are acquired for MEDLINE than for CRANFIELD, using both reduction

dimensionality methods (CI and LSI).

Regarding folding-in documents into the document collection without re-

computation of the transformation matrix (truncated SVD), best IR evaluation results

are acquired for 80% initial documents by our experiments. This points that folding-in

without re-computation and without correction of the transformation matrix, gives

solid results only for higher percentage of initial documents.

From the results of experiments we have performed can be concluded that LSI and

CI can be compared, regarding IR.

85

9. REFERENCES [1] M. W. Berry, Z. Drmač, E. R. Jessup, Matrices, Vector Spaces and Information

Retrieval, SIAM Review, Vol. 41, no. 2. pp. 335–362, 1999.

[2] M. W. Berry, S. T. Dumais, G. W. O’Brien, Using Linear Algebra for Intelligent

Information Retrieval, SIAM Review., Vol. 37, no. 4, pp. 573–595., 1995.

[3] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman,

Indexing by latent semantic analysis, J. American Society for Information

Science, Vol. 41, no. 6, pp. 391-407., 1990.

[4] I.S. Dhillon, J. Fan, Y. Guan, Efficient Clustering of Very Large Document

Collections, Data Mining for Scientific and Engineering Applications, Kluwer

Academic Publishers, pp. 357–381, 2001.

[5] I.S. Dhillon, Y. Guan and J. Kogan, Refining Clusters in High Dimensional Text

Data, 2nd SIAM International Conference on Data Mining (Workshop on

Clustering High-Dimensional Data and its Applications), April 2002.

[6] I.S. Dhillon, S. Mallela, R. Kumar, A divisive information-theoretic feature

clustering algorithm for text classification, Journal of Machine Learning

Research: Special Issue on Variable and Feature Selection, Vol. 3, pp. 1265–

1287., March 2003.

[7] I.S. Dhillon, D. S. Modha, Concept decomposition for large sparse text data

using clustering, Machine Learning, Vol. 42, no. 1, pp. 143-175., 2001.

[8] J. Dobša, B. Dalbelo Bašić, Comparison of information retrieval techniques:

Latent semantic indexing and concept indexing, Journal of Information and

Organizational Sciences 28 1-2, pp. 1-15., 2004.

86

[9] J. Dobša, B. Dalbelo Bašić, Concept decomposition by fuzzy k-means

algorithm, Proceedings of IEEE / WIC International Conference on Web

Intelligence, pp. 684-688, Halifax, Canada, 2003.

[10] Gmeans: clustering in ping-pong style,

http://www.cs.utexas.edu/users/yguan/datamining/gmeans.html [19.6.2006].

[11] Intel® Math Kernel Library, http://www.intel.com/cd/software/products/asmo-

na/eng/perflib/mkl/index.htm [19.6.2006].

[12] A.K. Jain, M.N. Murty, P.J. Flynn, Data Clustering: A Review, ACM Computing

Surveys, Vol 31, No. 3, pp. 264-323, 1999.

[13] G. Karypis, E. - H. S. Han, Concept Indexing: A Fast Dimensionality Reduction

Algorithm with Applications to Document Retrieval and Categorization,

Technical Report TR-00-0016, University of Minnesota, 2000.

[14] J. Kogan, M. Teboulle, C. Nicholas, Data driven similarity measures for k-

means like clustering algorithms, Information Retrieval, Vol. 8, 331-349., 2005.

[15] T.A. Letsche, M.W. Berry, Large-scale Information Retrieval with Latent

Semantic Indexing, Information Sciences - Applications, 1997.

[16] NSF Research Awards Abstacts 1990–2003,

http://www.ics.uci.edu/~kdd/databases/nsfabs/nsfawards.html [19.6.2006].

[17] F. Sebastiani, A Tutorial on Automated Text Categorisation, Proceedings of

ASAI-99, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires,

AR, pp. 7-35., 1999.

[18] F. Sebastiani, Machine Learning in Automated Text Categorisation, ACM

Computing Surveys, Vol. 34, No. 1, pp. 1–47., March 2002.

[19] SMART kolekcije, ftp://ftp.cs.cornell.edu/pub/smart/ [19.6.2006].

http://www.cs.utexas.edu/users/yguan/datamining/gmeans.html

http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm

http://www.intel.com/cd/software/products/asmo-na/eng/perflib/mkl/index.htm

http://www.ics.uci.edu/~kdd/databases/nsfabs/nsfawards.html

ftp://ftp.cs.cornell.edu/pub/smart/

87

[20] Latent Semantic Indexing Web Site, http://www.cs.utk.edu/~lsi/, [19.6.2006].

[21] S. Dominich, Information Retrieval - An Advanced Course, 2005.

[22] T. Hofmann, Introduction to Machine Learning, November 2003.

[23] G. Salton, C. Buckley, Term-weighting approaches in automatic retrieval,

Information Processing & Management, Vol. 24, No. 5, pp. 513–523, 1988.

[24] G. Salton, M. J. McGill, Introduction to Modern Information Retrieval, McGraw-

Hill Co., New York, 1983.

[25] T.G. Kolda, D.P. O’Leary, A semi-discrete matrix decomposition for latent

semantic indexing in information retrieval, ACM Transcations on Information

Systems, Vol. 16, pp. 322–346, 1998.

[26] J.C. Bezdek, A convergence theorem for the fuzzy ISODATA clustering

algorithms, IEEE Transactions on Pattern Analysis and Machine Intelligence,

Vol. 2, No. 1, pp. 1–8., 1990.

[27] J.C. Bezdek, R.J. Hathway, Convergence theory for fuzzy c-means:

Counterexamples and repairs, IEEE Trans. Systems, Man, Cybernetics, Vol.

17, No. 5, pp. 873–877., 1987.

[28] C.L. Lawson, R. J. Hanson, Solving Least Squares Problems, SIAM,

Philadelphia, 1995.

[29] C. Eckart, G. Young, The approximation of one matrix by another of lower

rank, Psychometrika, Vol. 1, pp. 211-218., 1936.

[30] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press,

Addison-Wesley, New York, 1999.

http://www.cs.utk.edu/~lsi/

88

[31] M.E. Maron, J.L. Kuhns, On relevance, probabilistic indexing and information

retrieval, Association for Computing Machinery 7 (1960)3, pp. 216-244.

[32] Y. Yang. An evaluation of statistical approaches to text categorization.

Information retrieval, 1-2(1):69–90, 1999.

[33] L.D. Baker, A.K. McCallum, Distributional clustering of words for text

categorisation, Proceedings of SIGIR-98, 21st ACM International Conference

on Research and Developement in Information Retrieval, pp. 96-103,

Melbourne, Australia, 1998.

[34] J. Yen, R. Langari, Fuzzy Logic: Intelligence, Control and Information, Prantice

Hall, New Jersey, 1999.

[35] H. Kim, P. Howland, H. Park, Dimension reduction in text classification using

support vector machines, Journal of Machine Learning Research 6 (2005), pp.

37-53.

[36] H. Park, M. Jeon, J. Ben Rosen, Lower dimensional representation of text data

based on centroids and least squares, BIT, 43 (2003) 3, pp. 1-22.

[37] C. Park, H. Park, Nonlinear feature extraction based on centroids and kernel

functions, Pattern Recognition 37 (2004)4, pp. 801-810.

[38] J. Ye, Q. Li, H. Xiong, H. Park, R. Janardan, V. Kumar, IDR/QR: An

incremental dimension reduction algotithm via QR decomposition, IEEE

Transactions on Knowledge and Data Engineering 17 (2005)9, pp.1208-1222.

89

TABLES INDEX Table 2.1 Documents and their categorization (example 2.1) ................................................ 11

Table 2.2 Document ranking by similarity with Q1 and Q2 (example 2.1)............................. 12

Table 2.3 IDF components of index terms (example 2.2) ........................................................ 13

Table 2.4 Document ranking by similarity with Q1 and Q2 (example 2.2) .............................. 14

Table 2.5 The contingency table for one query ....................................................................... 17

Table 4.1 Coordinates of the terms by SVD and CDFKM ...................................................... 38

Table 4.2 Coordinates of documents and queries by SVD and CDFKM ................................ 38

90

FIGURES INDEX Figure 2.1 Document-term matrix ............................................................................................. 8

Figure 2.3 The first document from MEDLINE document collection ..................................... 19

Figure 2.4 Some of the queries for MEDLINE document collection ....................................... 19

Figure 2.5 The first document from CRANFIELD document collection ................................. 20

Figure 2.6 Some of the queries for CRANFIELD document collection ................................. 20

Figure 3.1 Process of data clustering ...................................................................................... 22

Figure 3.2 taxonomy of clustering approaches ....................................................................... 23

Figure 4.1 Reduced singular value decomposition (truncated SVD) ...................................... 27

Figure 4.2 Images of terms by LSI ........................................................................................... 39

Figure 4.3 Images of terms by CI ............................................................................................ 39

Figure 4.4 Images of documents and queries by LSI ............................................................... 41

Figure 4.5 Images of documents and queries by CI ................................................................ 41

Figure 5.1 Single test tab-window screenshot ......................................................................... 46

Figure 5.2 Documents before clustering (example type-1) ..................................................... 47

Figure 5.3 Documents after clustering (example type-1) ........................................................ 47

Figure 5.4 Document and query representations in CI space (example type-2) ..................... 48

Figure 5.5 Term representations in CI space (example type-2) .............................................. 49

Figure 5.6 Document and query representations in LSI space (example type-2) ................... 50

Figure 5.7 Term representations in LSI space (example type-2) ............................................ 51

Figure 5.8 Batch test tab-window (Options) screenshot ......................................................... 52

Figure 5.9 Batch test tab-window (Test list and results) screenshot ....................................... 53

Figure 5.10 Batch test tab-window (Graph) screenshot .......................................................... 54

Figure 5.11 Query testing tab-window screenshot .................................................................. 55

Figure 6.1 Relation between the number of concept/singular vectors and the decomposition

error for MEDLINE ................................................................................................................. 57

Figure 6.2 Relation between the number of concept/singular vectors and the decomposition

error for CRANFIELD ............................................................................................................. 58

Figure 6.3 Relation between the number of concept vectors and the average concept vectors

dot product for MEDLINE ....................................................................................................... 59

Figure 6.4 Relation between the number of concept vectors and the average concept vectors

dot product for CRANFIELD ................................................................................................... 60

Figure 6.5 Relation between the number of concept vectors and the number of iterations for

MEDLINE ................................................................................................................................. 61

Figure 6.6 Relation between the number of concept vectors and the number of iterations for

CRANFIELD ............................................................................................................................ 62

Figure 6.7 Relation between the number of concept vectors and the test duration (in seconds)

for MEDLINE ........................................................................................................................... 63

Figure 6.8 Relation between the number of concept vectors and the test duration (in seconds)

for CRANFIELD ....................................................................................................................... 64

Figure 6.9 Relation between number of concept vectors and memory consumption for

MEDLINE ................................................................................................................................. 65

Figure 6.10 Relation between number of concept vectors and memory consumption for

CRANFIELD ............................................................................................................................ 66

91

Figure 6.11 Relation between the number of concept/singular vectors and the average recall

of all queries for MEDLINE ..................................................................................................... 68

Figure 6.12 Relation between the number of concept/singular vectors and the average recall

of all queries for CRANFIELD ................................................................................................. 69

Figure 6.13 Relation between the number of concept/singular vectors and the average

precision of all queries for MEDLINE ..................................................................................... 70

Figure 6.14 Relation between the number of concept/singular vectors and the average

precision of all queries for CRANFIELD ................................................................................. 71

Figure 6.15 Relation between the number of concept/singular vectors and the average MAP

of all queries for MEDLINE ..................................................................................................... 72

Figure 6.16 Relation between the number of concept/singular vectors and the average MAP

of all queries for CRANFIELD ................................................................................................. 73

Figure 6.17 Relation between the number of concept/singular vectors and the average F1 of

all queries for MEDLINE ......................................................................................................... 74

Figure 6.18 Relation between the number of concept/singular vectors and the average F1 of

all queries for CRANFIELD ..................................................................................................... 75

Figure 6.19 Recall-precision graph for MEDLINE ................................................................. 76

Figure 6.20 Recall-precision graph for MEDLINE ................................................................. 77

Figure 6.21 Relation between the initial document percentage and recall ............................. 78

Figure 6.22 Relation between the initial document percentage and precision ....................... 79

Figure 6.23 Relation between the initial document percentage and MAP .............................. 80

Figure 6.24 Relation between the initial document percentage and F1 .................................. 81