fyp ii ( chan chen jie bk08110364)

87
MALAY-ENGLISH PARALLEL CLUSTERING USING HIERARCHICAL AGGLOMERATIVE CLUSTERING CHAN CHEN JIE SCHOOL OF ENGINEERING AND INFORMATION TECHNOLOGY UNIVERSITI MALAYSIA SABAH 2011

Upload: chen-jie-chan

Post on 10-Mar-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Fyp II ( Chan Chen Jie Bk08110364)

MALAY-ENGLISH PARALLEL CLUSTERING USING

HIERARCHICAL AGGLOMERATIVE CLUSTERING

CHAN CHEN JIE

SCHOOL OF ENGINEERING AND INFORMATION TECHNOLOGY

UNIVERSITI MALAYSIA SABAH 2011

Page 2: Fyp II ( Chan Chen Jie Bk08110364)

MALAY-ENGLISH PARALLEL CLUSTERING USING

HIERARCHICAL AGGLOMERATIVE CLUSTERING

CHAN CHEN JIE

THESIS SUBMITTED IN FULFILMENT FOR THE BACHELOR OF COMPUTER SCIENCE

(SOFTWARE ENGINEERING)

SCHOOL OF ENGINEERING AND INFORMATION TECHNOLOGY

UNIVERSITI MALAYSIA SABAH 2011

Page 3: Fyp II ( Chan Chen Jie Bk08110364)

i

DECLARATION

I hereby declare that this piece of work is completed by me except for using

some resources as my reference and information, which I had declared them

in my writings.

6-5-2011

________________

CHAN CHEN JIE

CERTIFIED BY __________________________ __________________________ DR. RAYNER ALFRED PN. SURAYA ALIAS (PROJECT SUPERVISOR) (EXAMINER)

Page 4: Fyp II ( Chan Chen Jie Bk08110364)

ii

ACKNOWLEDGEMENT

First and the foremost, I want to acknowledge the continuous effort of my

project supervisor, Dr. Rayner Alfred who has guide me from the beginning

of the project until the end of this project. I sincerely appreciate his precious

suggestion and comment throughout this project. He is very scrupulous

helped me on every chapters and aspect of this project.

Besides, I also wish to thank to my family, lecturers, and friends who

assisting me in whole of my studies, no matter externally or internally in the

period of study and doing this project. Their continuous support is the

biggest impetus to me, in order to struggle hard towards this project.

Page 5: Fyp II ( Chan Chen Jie Bk08110364)

iii

ABSTRACT

MALAY-ENGLISH PARALLEL CLUSTERING USING HIERARCHICAL

AGGLOMERATIVE CLUSTERING

Multilingual text documents are becoming important resource for work in

multilingual natural language processing. This paper discusses the effects applying

clustering technique on parallel bilingual documents. It is interesting to look at the

differences of the cluster mapping of the Malay clusters and English clusters.

Hierarchical agglomerative clustering is chosen as clustering approach for this

project. Genetic algorithm is applied to optimize the weights of terms considered in

clustering the texts documents. Finally, this work aim to discover the suitable

clustering hierarchical agglomerative clustering methods for English and Malay text

documents clustering and the mapping results between English documents Clusters

and Malay Documents clusters.

Page 6: Fyp II ( Chan Chen Jie Bk08110364)

iv

ABSTRAK

Koleksi berbilang-bahasa adalah salah satu sumber yang penting untuk menjalani

pelbagai tugas pemprosesan bahasa. Kertas kerja ini membincang tentang kesan

applikasi pengelompokan dalam dokumen dwibahasa selari. Ia adalah menarik

untuk melihat perbezaan pemetaan gelompok kumpulan bahasa Malaysia dan

bahasa English. “Hierarchical Agglomerative Clustering” adalah teknik

pengelompokan yang dipilih sebagai teknik pengelompokan dalam projek ini.

Algoritma genetik diguna untuk mengoptimumkan pemberat istilah yang

dipertimbangkan dalam pengelompokan teks. Selain itu, mengurangkan jumlah

istilah dapat mempercepatkan proses pengelompokan. Secara khusus, tujuan

projek ini adalah memperkenalkan teknik yang sesuai untuk pengelompokan

dokumen bahasa English and dokumen bahasa Malaysia dan keputusan pemetaan

antara kelompok dokumen English dan kelompok dokumen Melayu.

Page 7: Fyp II ( Chan Chen Jie Bk08110364)

v

LIST OF CONTENTS

TITLE PAGE

DECLARATION i

ACKNOWLEDGEMENT

ii

ABSTRACT

iii

ABSTRAK

iv

LIST OF CONTENTS

v

LIST OF FIGURE

viii

LIST OF TABLES

xi

LIST OF ACRONYMS

xii

LIST OF FORMULAS

xiii

CHAPTER 1: Introduction 1.1 Introduction 1.2 Problem Background 1.3 Problem Statements 1.4 Objective 1.5 Project Scope 1.6 Organization of the Report

1 1 2 2 3 3 4

CHAPTER 2: LITERATURE REVIEW 2.1 Introduction 2.2 Clustering 2.3 Reviews of Existing Clustering Method

2.3.1 K-Mean clustering 2.3.2 Hierarchical Agglomerative Clustering 2.3.3 Comparison between K-mean clustering and

Hierarchical Clustering 2.4 Cross Language Information Retrieve

2.4.1 Cognate Matching 2.4.2 Query Translation 2.4.3 Document translation 2.4.4 Interlingual techniques

2.5 Text Processing in Document Clustering 2.5.1 Processing Text

2.5.1.1 Tokenization 2.5.1.2 Stopping

5 5 5 6 6 7 11

11 12 12 13 13 13 14 14 16

Page 8: Fyp II ( Chan Chen Jie Bk08110364)

vi

2.5.1.3 Stemming 2.5.2 Term Frequency and Weighting 2.6 Past Research on CLIR

2.6.1 Japanese/ English Cross Language Information Retrieve (Atsushi Fuji and Tetsuya, 2001)

2.6.2 A Parallel Hierarchical Agglomerative Clustering Technique for bilingual Corpora Based on Reduced Terms with Automatic Weight Optimization (Rayner Alfred, 2009)

2.6.3 Comparison between the Reviews Past of Past Researches on CLIR

2.7 Conclusion

16 18 19 19

20

22

22

CHAPTER 3: Methodology 3.1 Introduction 3.2 System Development Life Cycle

3.2.1 Planning 3.2.2 System Analysis 3.2.3 Design Phase 3.2.4 Implementation Phase 3.2.5 Testing and Maintenance phase

3.3 Operational Environments 3.3.1 Software 3.3.2 Hardware Requirements

3.4 Conclusion

23 23 23 24 25 26 27 27 27 27 28 28

Chapter 4: System Analysis and Design 4.1 Introduction 4.2 System Analysis 4.3 System Design

4.3.1 Data Flow Diagram (DFD) 4.3.1.1 Context Diagram 4.3.1.2 Level 0 Data Flow Diagram 4.3.1.3 Flow Chart

4.4 User Interface Design 4.5 Conclusion

29 29 29 30 30 31 31 32 35 36

Chapter 5: Implementation 1.0 Introduction 1.1 Text Processing 1.2 Weighting 1.3 Construct the Distances Matrix using Cosine Similarity 1.4 Clustering 1.5 Cluster Analysis 1.6 Genetic Algorithm Setup 1.7 Conclusion

37 37 37 40 42 44 49 51 52

Chapter 6: Testing 6.1 Introduction 6.2 Loading Data

53 53 53

Page 9: Fyp II ( Chan Chen Jie Bk08110364)

vii

6.3 Text Processing 6.4 Weighting 6.5 Distances Matrix 6.6 Experiment 1: Testing the Cluster Results for Different

Clustering Algorithm 6.7 Experiment 2: Testing the DBI values between the Single

Linkage, Complete Linkage and Average Linkage 6.8 Experiment 3: Improve the DBI value and the Percentage

of English Malay Mapping using Genetic Algorithm 6.9 Conclusion

55 56 58 58

63

65

68

Chapter 7 Conclusion and Future Work 7.1 Introduction 7.2 Objective Achievement 7.3 Discussion 7.4 Limitations of the System 7.5 Recommendation of Future Works 7.6 Conclusions

69 69 69 70 70 70 71

References

72

Page 10: Fyp II ( Chan Chen Jie Bk08110364)

viii

LIST OF FIGURE

FIGURE NO

TITLE PAGE

1.1 Clustering documents using hierarchical

agglomerative clustering

1

2.1 An Example of Clustering 3

2.2 K-Mean clustering algorithm 5

2.3 A dendogram 6

2.4 two same dendogram with different cutting points

7

2.5 Flow chart of hierarchical agglomerative clustering

8

2.6 Text Processing before clustering

12

3.1 Waterfall development-base model

22

4.1 Context Diagram for English-Malay parallel

hierarchical clustering using hierarchical clustering

29

4.2 Leval-0 Data Flow Diagram for English-Malay parallel

clustering using Hierarchical Agglomerative

Clustering

30

4.3 Flow chart of English-Malay parallel clustering using

Hierarchical agglomerative clustering

31

4.4 User Interface Design of the system in this project 35

Page 11: Fyp II ( Chan Chen Jie Bk08110364)

ix

5.1 Code Fragment for Tokenizer 38

5.2 Code Fragment for Stopword Remove 38

5.3 Code fragment for English Stemming 39

5.4 Code fragment for the array 40

5.5 Tokenizer Function in Java 40

5.6 Code fragment for Calculate Raw Term Frequency 40

5.7 Code Fragment that Normalize the Term Frequency 41

5.8 Code Fragment that Calculate the Inverse Document

Frequency

42

5.9 Code Fragment that calculates the TF-IDF 42

5.10 Code Fragment that calculates the Cosine Similarity

between Clusters

43

5.11 Code Fragment That Show All Function In The

Clustering Class

44

5.12 Code Fragment for Cluster Class 46

5.13 Code Fragment that Show the Calculation of Single

Linkage, Complete Linkage and Average Linkage

47

5.14 Code Fragment for Hierarchical Agglomerative

Clustering

48

5.15 Code Fragment for DBI Calculation 49

6.1 File Dialog That Allows User Selects the Datasets

Directory

53

6.2 Message dialog that inform user that the user 54

Page 12: Fyp II ( Chan Chen Jie Bk08110364)

x

selected folder does not contain any text file

6.3 Print Screen for Loaded Files Will Display on the

Program

54

6.4 The text processing that normalizes the texts 55

6.5 Print Screen of the Raw Term Frequency Table 56

6.6 Print Screen of the Term Frequency Table 56

6.7 Part of the idf displayed by the program 56

6.8 Print Screen for TF-IDF Table 57

6.9 Print Screen for Distances Matrix 58

6.10 Summary for Document Distribution in 7 Clusters 63

6.11 DBI values for different hierarchical agglomerative

clustering

64

6.12 DBI values for Each Generation in Genetic Algorithm 66

Page 13: Fyp II ( Chan Chen Jie Bk08110364)

xi

LIST OF TABLES

TABLE NO. TITLE

PAGE

2.1 Comparison between k-mean clustering and Hierarchical clustering

11

2.2 Comparison between the reviews past of past researches

on CLIR

21

3.1 Software requirements 28

5.1 Example of texts preprocessing that normalizes the terms 39

5.2 The Setup for the Experiment of Genetic Algorithm 51

6.1 Cosine similarity distances matrix for 10 documents 58

6.2 Clustering Result for Single Linkage 59

6.3 Clustering Result for Complete Linkage 61

6.4 Clustering result for Average Linkage 62

6.5 Summary for Document Distribution in Cluster 63

6.6 DBI values for different hierarchical agglomerative

clustering

64

6.7 DBI values for English Documents Clustering (Before GA

and After GA)

66

6.8 The Percentage of Clusters Mapping Before and After GA

Applied

67

Page 14: Fyp II ( Chan Chen Jie Bk08110364)

xii

LIST OF ACRONYMS

TF: TERM FREQUENCY

IDF: INVERSE DOCUMENT FREQUENCY

DFD: DATA FLOW DIAGRAM

HAC: HIERARCHICAL AGGLOMERATIVE CLUSTERING

CLIR: CROSS LANGUAGE INFORMATION RETRIEVE

GA: GENETIC ALGORITHM

DBI: Davies-Bouldin Index

Page 15: Fyp II ( Chan Chen Jie Bk08110364)

xiii

LIST OF FORMULAS

NO. NAME FORMULA

1 Inverse Document Frequency

idft = log N

dft

2 tf-idf weighting scheme

tf-idft,d = tft,d x idft

3 Precision of the English –Malay Mapping (EMM)

PrecisionEMM (C(E), C(M)) = |C E C B |

|C E |

4 Precision of the Malay – English Mapping (EMM)

PrecisionMEM (C(M), C(E)) = |C M C E |

|C M |

5 inverse document frequency

idft = log N

dft

6 tf-idf tf-idft,d = tft,d x idft

7 Cosine Similarity

Cos(x,y) = x∙y

x | y |

8 DBI DBI = 1

n max(n

i=1,i≠jδ i +δ j

d(ci ,cj ))

9 term frequency

t, fi,j = ni,j

nkjk

Page 16: Fyp II ( Chan Chen Jie Bk08110364)

1

CHAPTER 1

INTRODUCTION

1.0 Introduction

People always intended to retrieve useful information from the data collections.

Many experiments had been run to find out the methods that can retrieve

information from the set of documents and we call these methods as text mining

method. Clustering is one of the methods that can retrieve useful information from

documents. There are some clustering approaches that are used in text mining

such as k-mean clustering and hierarchical agglomerative clustering. Hierarchical

agglomerative clustering (HAC) is one of the common approaches used in many

text mining experiments. This is because HAC gives an outputs result in dendogram

and we can get any number of clusters from the dendogram easily as compared

with the k-mean or others clustering approaches.

There has been a lot of research on clustering text documents. However, it is

an interesting experiment if we apply clustering in parallel bilingual documents sets

where the alignment between pairs of clustered documents can be used to extract

words from each language and further be used in others applications. There are

various clustering experiments running on un-parallel bilingual documents sets but

few clustering experiments that related with parallel bilingual documents sets. This

is possible that there are few available sources for parallel bilingual documents sets

compared with the un-parallel bilingual documents sets.

Some useful results can be found by applied clustering to the same documents

in two languages (parallel bilingual documents) (Rayner Alfred, 2009). For example,

clustering in one language in the parallel bilingual documents sets can be used as a

source of annotation to verify the cluster that produced by clustering in another

language. Besides that, combining results for the two results for the two languages

Page 17: Fyp II ( Chan Chen Jie Bk08110364)

2

can be eliminated on some language-specific bias. Finally, the alignment between

pairs of clustered documents can be used to extract words from each language and

can be furthered use for others application such as cross linguistic information

retrieval (CLIR).

1.1 Problem Background

The rapid development of the Internet makes people out of space limitation, can

freely get the information in the world. However, this freedom of obtaining

information has been limited by the diversity of language. To across the language

barrier, people put forward corresponding solutions, such as online dictionary,

online translation, machine translation, crossing-language information retrieval and

cross-language search engine. Among them, crossing-language information

retrieval is one of the popular research topics.

There are less available resources on bilingual parallel corpora for English-

Malay. 200 sets of English – Malay bilingual parallel documents sets are collected to

conduct the experiment.

2.0 Problem Statements

There are numerous of research on clustering text documents. However, there are

few experiments that examined the impacts of clustering parallel bilingual corpora

especially for Malay-English parallel clustering. This project will conduct an

experiment on clustering for parallel English – Malay documents.

In this project, a collection of English-Malay texts documents are collected to

do bilingual clustering to these documents.

This project is divided into few parts. For English documents clustering, the

English documents sets may include un-useful information. In order to overcome

this problem, the precondition for the English documents is that they must be

stopping and stemming before go to the hierarchical clustering.

Page 18: Fyp II ( Chan Chen Jie Bk08110364)

3

Figure 1.1: Clustering documents using hierarchical agglomerative

clustering

After hierarchical clustering, the next step is performing mapping between

English and Malay clusters membership.

The last part of this project is to combine the clustering with a genetic

algorithm optimizing the weight of the terms so that the clustering matches the

annotation provided as closely as possible.

1.4 Objective

There are three main objectives in this project:

To propose a framework for English – Malay Parallel hierarchical

agglomerative clustering.

To implement a hierarchical agglomerative clustering method to cluster

English documents.

To apply Genetic Algorithm (GA) to analysis the effects of GA to the

clustering results.

1.5 Project Scope

The scope of this project is included the document representation and clustering

algorithm to cluster the large amount of bilingual documents Malay-English. Besides

that, this project wills implement the genetic algorithm (GA) technique to improve

the result of clustering.

HAC

English Documents

Stemming

on

Stopping

on

Page 19: Fyp II ( Chan Chen Jie Bk08110364)

4

1.6 Organization of the Report

This report will be organized and divided into 5 chapters.

Chapter 1 is introduction. This chapter will includes the problem

background, problem statements, objectives and project scope of this project. This

chapter aims to provide introduction of the project.

Chapter 2 is literature review. This chapter will make reviews of the existing

concept and methods. Some comparison of existing concepts and methods are also

includes in this chapter.

Chapter 3 is Methodology. This chapter will reviews and explains the

methodology used in this projects. This chapter will list out the software and

hardware requirements of this project.

Chapter 4 is system analysis and design. This chapter will explain the

system design of this project.

Chapter 5 is implementation. This chapter will discuss the implementation of

the system.

Chapter 6 is testing. This chapter will discuss the testing stage and the

testing results.

Chapter 7 is the conclusion for the whole project.

Page 20: Fyp II ( Chan Chen Jie Bk08110364)

5

CHAPTER 2

LITRERATURE REVIEW

2.1 Introduction

This chapter will review some existing data mining concepts and methods that

related to this project. Since this project is deals with parallel bilingual documents,

this chapter will focus on text-mining concepts and methods. The data mining

approaches those reviews in this chapter are clustering, text-processing and cross

language information retrieve.

2.2 Clustering

Clustering is the data mining method that divides data into groups that are useful

and meaningful (Christopher D. manning et al., 2009). Clustering data is performed

based on the data similarity. Objects that group in a group/cluster will show similar

characteristics. Clustering can only extract message based on existing data.

Clustering is unsupervised learning where it build model without well defined goal

or prediction field.

Figure 2.1: An Example of Clustering

There are some clustering methods that we can use to do text mining.

Every clustering method has its‟ own advantages and disadvantages. And the most

Page 21: Fyp II ( Chan Chen Jie Bk08110364)

6

popular methods are hierarchical and k-mean and this chapter will takes this two

methods to illustrate how clustering work.

2.3 Reviews of Existing Clustering Method

2.3.1 K-Mean clustering

K-mean clustering is a simple and efficient clustering algorithm. K-mean is an

algorithm used to group the data based on the attributes or features into K number

of groups and k is a positive number (Kardi Teknomo, 2006). The process of

grouping is done by minimizing the sum of squares of distances between data and

the corresponding cluster centroid.

The algorithm of k-mean is simple. It starts with determining the number of

K and set k number of centroid randomly in the dataset. After that, it will repeat to

determine the distances of each data to the centroid and group the data to the

nearest centroid. Figure 2.2 shows a simple algorithm of K-mean clustering.

Page 22: Fyp II ( Chan Chen Jie Bk08110364)

7

Figure 2.2 K-Mean clustering algorithm

Source: http://people.revoledu.com/kardi/tutorial/Clustering/

Since the way to initialize the centroid was not specified and usually the k of

samples is choosing randomly. Different initial centroid will produce different results

and there is no general theoretical way to find the optimal results of clusters from

the data. A simple approach is comparing the results of multiple runs with different

initial centroid.

Start

Determine the

number of K

Set centroid

Calculate the distances

of data to centroid.

Grouping based on

minimum distance

Available data

for grouping end

Yes

No

Page 23: Fyp II ( Chan Chen Jie Bk08110364)

8

2.3.2 Hierarchical Agglomerative Clustering

Another popular clustering method is hierarchical agglomerative clustering (HAC).

In hierarchical agglomerative clustering, data are categorized into a hierarchy

structure similar to a tree-like diagram (Figure 2.3) which is called dendogram

(Kardi Teknomo, 2006).

Figure 2.3: a dendogram

Source: http://people.revoledu.com/kardi/tutorial/clustering/

A dendogram is a standard output of hierarchical agglomerative clustering.

A dendogram is a cluster tree where the distance of split or merge is recorded.

Using dendogram, the number of clusters can determine by specify the

cutting point at the dendogram. For example, in the left dendogram below (Figure

2.4), we set cutting distance at 2 and obtain two clusters out of 6 data. The first

cluster consists of 4 data (number 4, 6, 5 and 3) and the second cluster consists of

two objects (number 1 and 2). Similarly, in the right dendogram, cutting at distance

at 1.2 will produce 3 clusters.

2.5

2

1.5

1

0.5

4 6 5 3 2 1

Page 24: Fyp II ( Chan Chen Jie Bk08110364)

9

Figure 2.4: two same dendogram with different cutting points

Source: http://people.revoledu.com/kardi/tutorial/clustering/

The agglomerative in hierarchical agglomerative clustering is means that the

clustering will starts from the bottom where all the objects are single cluster and

going up (bottom up approach) through merging of objects. There is another

different method call divisive approach (top down approach), where the objects will

group into a single group and then repeat split the group into two groups until the

number of object in each group become one.

The step by step algorithm of hierarchical agglomerative clustering is as

follow:

Step 1: Convert object features into distance matrix.

Step 2: Set each object as a cluster.

Step 2: Iterate until the number of cluster is 1.

2a: Merge two closest clusters.

2b: Update distance matrix.

Figure 2.4 show the flow chat of the hierarchical agglomerative

clustering.

Page 25: Fyp II ( Chan Chen Jie Bk08110364)

10

Figure 2.5: Flow chat of hierarchical agglomerative clustering

Source: http://people.revoledu.com/kardi/tutorial/clustering/

In Figure 2.5 the distance matrix needs to update by finding the distances

between the clusters after two closest clusters are merged and there are some

methods to do this. The methods are single-linkage, complete-linkage, and

average-linkage. Different methods will produce different outputs and the form of

the dendogram will different.

Start

Objects and their measured

features

Compute Distance Matrix

Set Object as cluster

Number of cluster =

1 End

Merge 2 closest clusters

Update Distance Matrix

Yes

No

Page 26: Fyp II ( Chan Chen Jie Bk08110364)

11

2.3.3 Comparison between K-mean clustering and Hierarchical Clustering

Table 2.1: Comparison between k-mean clustering and Hierarchical clustering

K-Means Hierarchical Clustering

Advantages

- Less computation process and

suitable for clustering large

documents sets.

- Time complexity for k-mean

clustering is linear.

- Result show in dendogram

where it can produce an

ordering of the objects and

easier to find the elements in

the n cluster by cutting at the

dendogram.

- Similar cluster are generated

and helpful for clustering.

Disadvantages

- Result is sensitive to the noise

and outliners.

-Result cannot guarantee as

optimal result because it is

depend on the initial centroids.

- Application of k-mean only

limit to numerical variables.

- No provision can be made for

a relocation of objects that may

have been incorrectly grouped

at the beginning state.

-Results may different if use

different distance matrix.

2.4 Cross Language Information Retrieve

Cross language information retrieve (CLIR) can be explains as a user try to search a

set of information in one language using a query in another language (Leah S.

Larkey & Margaret E. Connell, 2004). The issues of CLIR have been discussed for

several decades.

There are some existing methods to do cross language information retrieval

(CLIR). There are four types of strategies for matching a query with a set of

Page 27: Fyp II ( Chan Chen Jie Bk08110364)

12

documents in the context of CLIR (Oard and Diekema, 1998). The four types of

strategies are:

No translation

1. Cognate matching

Translation

2. Query translation

3. Document translation

4. Interlingual techniques

2.4.1 Cognate Matching

Cognate matching is a naïve model in CLIR. Some un-translatable terms such as

proper nouns or technical terminology are left unchanged after the translation

state. The unchanged terms can be expected to match successfully with and

corresponding terms in another language if the two languages have a close

linguistic relationship.

One way to cognate matching may be to decompose words in both the

query and document into n-gram (more specifically, character-based overlapping n-

grams), and do matching operations on the two sets of n-grams. However, when

two languages are very different, e.g., English and Chinese, the technique of edit

distance and n-grams can be not work well.

2.4.2 Query Translation

Query translation translates queries into document languages using bilingual

dictionaries or/ and corpora, prior to the retrieval process. Query translation is the

most common method used for matching in CLIR. This is because this method is

easier to handle. The reason is the retrieval system no needs to change its inverted

files of index terms in any way and just translate the query into the others

languages that the information may retrieve out. Furthermore it is less computation

to process the translation of the query than translate a large set of documents.

Page 28: Fyp II ( Chan Chen Jie Bk08110364)

13

However, there is a disadvantage where it is difficult to resolve term

ambiguity during the process of translation. This is because the query is usually

short and only little context can found for “disambiguation”.

2.4.3 Document translation

This approach translates documents into query languages, prior to the retrieval.

Document translation is the “opposite” method from the query translation. It has

different advantages and disadvantages from the query translation. Peoples prefer

query translation than document translation. This is because document translation

needs large computation process to translate a big set of documents. However,

some researchers still use this approach since this approach can extracts more

contexts from each document and improve the translation quality.

2.4.4 Interlingual techniques

Interlingual techniques is an intermediate space of subject representation into

which both the query can the documents are converted is used to compare them.

In an Interlingual based machine translation approach translation is done via an

intermediary (semantic) representation of the SL text. Interlingua is supposed to be

a language independent representation from which translations can be generated

to different target languages. Interlingual approach assumes that it is possible to

convert source texts into representations common to more than one language.

From such interlingual representation texts are generated into other languages.

Translation is thus in two stages: from the source language to the Interlingual (IL)

and from the IL to the target language.

2.5 Text Processing in Document Clustering

Document clustering is a complex algorithm. It needs few text mining process to

complete its‟ goal.

Page 29: Fyp II ( Chan Chen Jie Bk08110364)

14

2.5.1 Processing Text

After the documents sets are collected, the next step is to decide if it should be

modified or restructured in some way to simplify cluster. The types of changes that

are made at this state are call text processing.

The goal of text processing is to convert the many forms that words can

occur into more consistent index forms. Index terms are the representation of the

content of a document that is used for clustering.

The flow chart in Figure 2.6 shows the techniques that use for text

processing.

Figure 2.6: Text Processing before clustering

2.5.1.1 Tokenization

Given a character sequence and a defined document unit, tokenization is the task

of chopping it up into pieces, called tokens (Christopher D. Manning et al. 2009). In

the process of tokenization, certain character such as punctuation will throw away.

The example below show the result of the tokenization on a sentence:

Input: Friends, Romans, Countrymen, lend me your ears;

Output: Friends Romans Countrymen lend me your ears

However, this simple tokenizing process example was suitable for

experiments with small test collection, it does not seem appropriate for most text

processing experiments because too much information is discarded. Some

Original text Tokenization Stopping Stemming

HAC

Text Processing

Page 30: Fyp II ( Chan Chen Jie Bk08110364)

15

examples of issues involving tokenizing that can have significant impact on the

effectiveness of clustering are:

Small words (1 or 2 characters) can be important in some cluster,

usually in combinations with other words.

o E.g.: xp, ma, pm, ben e king, el paso, master p, gm, j lo,

world war ll.

Both hyphenated and non-hyphenated forms of many words are

common.

->In some cases the hyphen is not needed.

o E.g.: e-bay, wal-mart, active-x, CD-rom, t-shirt.

->At others times, hyphens should be considered either as part of

the word or a word separator.

o E.g: Winston-selem, Mazda rx-7, e-cards, pre-diabetes, t-

mobile, Spanish-speaking.

Special characters are an important part of the tags, URLs, code and

other important parts of documents that must be correctly tokenized.

Capitalized words can have different meaning from lower case words.

o E.g: “Bush” and “Apple”.

Apostrophes can be a part of a word, a part of a possessive, or just

a mistake.

o E.g: rosie o‟donnell, can‟t, don‟t, 80‟s, 1980‟s, men‟s straw

hats, master‟s degree, England‟s ten largest cities, shriner‟s.

Numbers can be important, including decimals.

o E.g.: nokia 3250, top 10 courses, united 93, quicktime 6.5

pro.

Periods can occur in numbers, abbreviations.

o E.g.: “I.B.M.”, “Ph.D.”), URLs, ends of sentences, and other

situations.

(Source: An introduction to Information Retrieve, 2009)

From these examples, tokenizing appears to be more complicated than it

may seem at first.

Page 31: Fyp II ( Chan Chen Jie Bk08110364)

16

2.5.1.2 Stopping

Human language is filled with function words; words which have little meaning

apart from other words (Christopher D. Manning et al. 2009). The most popular,

like “the”, “a”, “an”, “that”, or “those” are determiners. These words are part of

how we describe nouns in text, and express concept like location or quantity.

Prepositions, like “over”, “under”, “above”, and “below”, represent relative position

between two nouns. In information retrieval, these function words have a second

name: stopword.

A stopword list to keep all stopword are construct and will use in text

processing. However, constructing a stopword list must be done with caution.

Removing too many words will decrease the effectiveness of clustering. While not

removing stopword may cause some problems in ranking.

A stopword list can be constructed by simply using the top n most frequent

words in a collection. This can, however, lead to words being included that are

important for some clusters. More typically, either a standard stopword list is used,

or a list of frequent words and standard stopwords is manually edited to remove

and words that may be significant for a particular application. Standard stopword

list contains ~300 words.

2.5.1.3 Stemming

Stemming, also called conflation, is a component of text processing that captures

the relationships between different variations of word. More precisely, stemming

reduces the different forms of a word that occur because inflection (e.g., plurals,

tenses) or derivation (e.g., making a verb to a noun by adding the suffix-action) to

a common stem (Christopher D. Manning et al. 2009).

There are two basic types of stemmers: algorithmic and dictionary-based.

An algorithmic stemmer uses a small program to decide whether two words are

related, usually based on knowledge of word suffixes for a particular language. By

Page 32: Fyp II ( Chan Chen Jie Bk08110364)

17

contrast, a dictionary-based stemmer has no logic of its‟ own, but instead relies on

pre-created dictionaries of related terms to store term relationships.

The simplest kind of English algorithmic stemmer is the suffix-s stemmer.

This kind of stemmer assumes that any word ending in the letter „s‟ is plural, so

cakes >> cake, dogs >> dog. However, this rule is not perfect. It cannot detect

many plural relationships, like “century” and “centuries”. In very rare cases, it

detects a relationship where it does not exist, such as with “I” and “is”. The first

kind of error is called a false negative, and the second kind of error is called a false

positive.

More complicated algorithmic stemmers reduce the number of false

negatives by considering more kinds of suffixes, like –ing or –ed. By handling more

suffix types, the stemmer can find more term relationships: in other words, the

false negative rate is reduced. However, the false positive rate (finding a

relationship where none exists) generally increases.

The most popular algorithmic stemmer is the porter stemmer. This has been

used in many information retrieval experiments and system since 1970s, and a

number of implementations are available. The stemmer consists of a number of

steps, each containing a set of rules for removing suffixes. At each step, the rule

for the longest applicable suffix is executed. Some of the rules are obvious, while

others require some thought to work out what they are doing. As an example, here

are the first two parts of step 1 (of 5 steps):

Step 1a:

o Replace sses by ss (e.g., stresses >> stress).

o Delete s if the preceding word part contains a vowel not

immediately before the s (e.g., gaps >> gap but gas >>

gas).

o Replace ied or ies by I if preceded by more than one letter,

otherwise by ie (e.g., ties >> tie, cries >> cri).

o If suffix is us or ss do nothing (e.g., stress >> stress).

Page 33: Fyp II ( Chan Chen Jie Bk08110364)

18

Step 1b:

o Replace eed, eedly by ee if it is in the part of the word after

the first non-vowel following a vowel (e.g., agrees >> agree,

feed >> feed).

o Delete ed, edly, ing, ingly if the preceding word part contains

a vowel, and then if the word ends in at, bl, or iz add e (e.g.,

fished >> fish, pirating >> pirate), or if the word ends with a

double letter that not ll, ss or zz, remove the last letter (e.g.,

falling >> fall, dripping >> drip), or if it the word is short,

add e (e.g., hoping >> hope).

(Source: An Introduction to Information Retrieval, 2009)

2.5.2 Term Frequency and Weighting

After the processing text state, the next step is calculating the TF-IDF. TF is term

frequency and IDF is inverse document frequency (Christopher D. Manning et al.

2009). TF-IDF is a weight that uses to measure how important a word to a

document in a collection of documents.

The purpose of construct the term frequency is to compute a score between

a query term t and a document d, based on the weight of term t in document d.

There is a simple approach where we just take the number of occurrences of term t

as the weight in the document d. The weighting scheme is referred to as term

frequency and is denoted tft,d.

The concept of term frequency has a problem and the problem is to know

are all terms in a document equally important? This is because all terms are

considered equally important when it come to assessing relevancy on a query. The

Inverse document frequency can overcome this problem. Inverse document

frequency is a measure of the general important of term by dividing the total

number of documents by the number if documents containing the term and then

take the logarithm of that quotient.

Page 34: Fyp II ( Chan Chen Jie Bk08110364)

19

Thus the idf of a rare term is high whereas the idf of a frequent term is

likely to be low.

To produce a composite weight for each term in each document, we

combine the definitions of term frequency and inverse document frequency. The tf-

idf weighting scheme assigns to term t a weight in document d shown in Equation

2.

2.6 Past Research on CLIR

In this section, some past research on CLIR that proposed by others researcher

ware reviewed.

2.6.1 Japanese/ English Cross Language Information Retrieve (Atsushi

Fuji and Tetsuya, 2001)

Atsushi Fuji and Tetsuya (2001) proposed a Japanese/ English CLIR system using

the query translation approach combine with retrieval modules. They target the

retrieval of technical documents. The translation of technical terms plays an

important role in the performance of their system. To conduct this experiment, they

produce a Japanese/ English dictionary for base words and translate compound

words on a word-by-word basic. To resolve the problem of translation ambiguity

during the translation process, they use a probabilistic method.

tf-idft,d = tft,d x idft (2)

Where: tft,d = term frequency

idft = inverse document frequency

idft = log N

dft (1)

Where: idf = inverse document frequency N = total number of ducuments dtf = number of documents where the term ti appears.

Page 35: Fyp II ( Chan Chen Jie Bk08110364)

20

The reason why Atsushi Fuji and Tetsuya choose the query translation

approach in their experiment is the query translation approach is relatively

inexpensive to implement. However, they state that the naïve query translation

method does not guarantee sufficient system performance because this method

relying on existing bilingual dictionary, and new technical terms are progressively

created.

Others information retrieval techniques that used in the paper are

tokenization and stopword remove. The purposed of using this two techniques is to

convert all possible query words into base words.

18700 set of documents in both English and Japanese ware collected to

conduct the experiments in the project.

2.6.2 A Parallel Hierarchical Agglomerative Clustering Technique for

bilingual Corpora Based on Reduced Terms with Automatic Weight

Optimization (Rayner Alfred, 2009)

This paper was proposed by Rayner Alfred (2009). The purpose of this paper has

investigated the effects of applying a clustering technique to parallel multilingual

texts. The result of this experiment is display in the cluster mappings and the tree

structures of the cluster. The targeted languages are English and Bulgarian where a

collection of English – Bulgarian documents will form the bilingual parallel corpus.

There are three main parts in the paper. The first part has tested the

experiment of clustering on parallel corpora of English – Bulgarian texts. The

second part has conducted the English – Bulgarian clusters mapping and

constructed the English versus Bulgarian tree structures. The last part is applied

genetic algorithm to optimize the weight of terms considered in clustering the

English texts.

Page 36: Fyp II ( Chan Chen Jie Bk08110364)

21

20,000 pairs of English – Bulgarian documents were used as datasets in this

experiment. There were two parallel corpora (News Briefs and Features) and each

in two different languages, English and Bulgarian. In both corpora, each English

document was corresponded to a Bulgarian document with the same content.

The text mining methods that used in the experiment are stemming,

stopword removal and hierarchical agglomerative clustering. Genetic algorithm was

applied to optimize the weights of terms considered in clustering the English texts,

so that clustering matched the annotation provided as closely as possible.

Page 37: Fyp II ( Chan Chen Jie Bk08110364)

22

2.6.3 Comparison between the Reviews Past of Past Researches on CLIR

Table 2.2: Comparison between the reviews past of past researches on CLIR

Project Title Targeted

languages

Parallel or

Non-

parallel

languages

documents

Translation

texts?

Applied

Techniques

A Parallel

Hierarchical

Agglomerative

Clustering Technique

for bilingual Corpora

Based on Reduced

Terms with

Automatic Weight

Optimization (R.

Alfred)

-English

-Bugarian

-Parallel -no

translation

in the

project

-Hierarchical

agglomerative

clustering.

-Clusters

mapping.

Japanese/ English

Cross Language

Information Retrieve

(Atsushi Fuji and

Tetsuya, 2001)

-Japanese

-English

-non-parallel -translate

query.

-Query

Translation.

2.7 Conclusion

The purpose of reviewing the current methods is to get some background

knowledge about existing techniques. From the review of the techniques, we found

that the output of HAC (dendogram) is suitable for map comparison and tree

comparison in parallel bilingual clustering. We also know the sequence for

document clustering and the methods to use in the document clustering.

Page 38: Fyp II ( Chan Chen Jie Bk08110364)

23

CHAPTER 3

Methodology

3.1 Introduction

This chapter will explain the approaches and framework that will be used in this

project. This chapter also reviews the methodology that is used in this project.

Software and hardware requirements that used in this project will also list out since

it will affects the performance of the experiments.

There are two main sections in this chapter. The first section is the simple

summary of System Development Life Cycle (SDLC). SDLC is the process of

developing information system through investigation, analysis, design,

implementation and testing. The second section is related to the operational

environment which the software and hardware requirements that needed in whole

system.

3.2 System Development Life Cycle

Waterfall development-base model are chosen as the methodology in this project.

This model is called waterfall because it only moves forward from phase to phase

and no return to previous phase, just like the water flow in the waterfall. The

reason why waterfall model is chosen is because the system requirements are

identified before programming design.

Waterfall model can be defined as a complete sequential approach for the

software development. There are five phase in waterfall model which is start from

planning, and continue by system analysis, system design, system implementation,

testing and maintenance. Each phase should start after the previous phase has

finished. Figure 3.1 illustrate waterfall model work.

Page 39: Fyp II ( Chan Chen Jie Bk08110364)

24

Figure 3.1: Waterfall development-base model

3.2.1 Planning

Planning is the first phase in the waterfall model. In this phase, we need to figure

out the problem that we want to solve. In this phase, researchers need to identify

what is the challenge and find out the way to achieve the goal of my title - Malay-

English Parallel Clustering Using Hierarchical Agglomerative Clustering. The needed

tasks in this phase that need to perform are:

Identify the overall view and problem of the existing concept or method.

Identify the scopes, objectives and goals that needed to be achieved by the

system.

Identify the methods or concepts that researchers will be used to solve our

problem.

Evaluation of the project:

o Is the project feasible in time?

o Is the project reliable?

Planning

Analysis

Design

Implementation

Testing and

Maintenance

Page 40: Fyp II ( Chan Chen Jie Bk08110364)

25

In this planning phase, researchers also need to define our objectives and

scopes. This is important because it is consider as the guideline to develop the

system. A well done planning will make the development of the system easier.

The purpose of this project is to perform clustering on Malay-English

corpora using hierarchical agglomerative clustering. There are many past

experiments has been performed to test the effect of clustering on documents.

However, it is interesting whereby we can get different results by apply clustering

on parallel bilingual corpora.

The next planning step is to conduct the research on the existing

techniques. By having comparison, it leads to enhance the common understanding

of the advantages and disadvantages of clustering techniques through the

technique of hierarchical agglomerative clustering and k-mean clustering. From

this, researchers can decide which clustering technique is suitable for our title. In

this step, researchers decide to use hierarchical agglomerative clustering because

the output of hierarchical agglomerative clustering (dendogram) is suitable for

cluster mapping and tree comparison between the results of clustering two

languages.

The requirements also included in planning phase. Researchers need to list

out the hardware and software to be used in the project. Since documents

clustering need more computational consuming, researchers need to select the

hardware that has higher performance.

3.2.2 System Analysis

In system analysis, researchers need to analysis the requirements and the

technique that used in the project. This phase must describe and understand the

project clearly. In this phase, the whole software development process and the

overall software structure are defined.

Page 41: Fyp II ( Chan Chen Jie Bk08110364)

26

To make the process of system analysis easier, researchers need to study

the past research on this title. There are similar research on this project which is

English and Bulgarian (Rayner Alfred, 2009. From this, it can gain understanding

that the overall techniques and concepts that apply to achieve English – Malay

parallel clustering using hierarchical agglomerative clustering.

Besides that, there are several methods used for stemming and porter‟s

stemming is the most popular among the methods. Hence, porter‟s stemming will

be applied within this project. Besides, tokenizing and stopword removing will be

study and apply as well as to improve the clustering results.

3.2.3 Design Phase

In this phase, the information that collected from the analysis phase will be

evaluated. The required development tools need to be defined in this phase.

Analysis and design phase are very important in the whole development cycle

process. Any mistakes in the design phase could be very costly in order to solve in

the development process. Design phase also decide how the system will be

operated in term of hardware and software.

In this project, the first part is conducts clustering on English documents.

After that, the clusters that outputted from the clustering will map with the Malay

clusters. The last part in this project is apply genetic algorithm on the clustering to

optimizing the weight of the terms so that clustering matches the annotation

provided as closely as possible.

In this project, Java is the programming language that researchers choose

to develop the system. More detail of design phase in this project will be explained

in chapter 4 later.

Page 42: Fyp II ( Chan Chen Jie Bk08110364)

27

3.2.4 Implementation Phase

In this phase, the design will be converted into the physical system. The details of

the coding and design as well as the user interfaces design of this system are

important in this phase.

The tasks in this phase are:

Completing the development plan and finalizing the designs.

The coding of this system will start to implement.

Produce a user manual or user guild.

Test the system.

3.2.5 Testing and Maintenance phase

After the Implementation phase, the system testing begins. Different testing

methods are available to detect the bugs that were committed during the previous

phases. Different testing input are use to test the accuracy.

We can test the system in the following aspects:

Functionality

Structural

consistency

performance

durability

In maintenance phase, the system may change due to the errors that are

not discovered in previous stages.

3.3 Operational Environments

Software and hardware requirements are the software and hardware needed to

develop and run the system.

Page 43: Fyp II ( Chan Chen Jie Bk08110364)

28

3.3.1 Software

The software requirements to build the system in this project are show in table 3.1.

Table 3.1: Software requirements

Type Specification

Operating System: Microsoft Windows 7

Programming Tools SDK –NetBeans 6.9.1

Java SE JDK

Java SE JRE

3.3.2 Hardware Requirements

The hardware requirements to build the system in this project are show as below.

Laptop with the following hardware specification

o Central Processing Unit(CPU): Intel Centrino 2 (2.26 Ghz)

o 4 GB RAM

o 250 GB hard drive

o mouse and keyboard

3.4 Conclusion

As a conclusion, methodology provides guild, a blueprint or template during the

development of the system. This chapter concludes all the researches and

developments of system in order to achieve the goals of this system. In addition, it

also provides developers a clearer picture on what they need to be done and how

the system is to be organizes and developed.

Page 44: Fyp II ( Chan Chen Jie Bk08110364)

29

CHAPTER 4

SYSTEM ANALYSIS AND DESIGN

4.1 Introduction

This chapter is about the process and details of the system analysis phase and the

system design phase.

This chapter will explain the system design, interface design, technique and

related algorithms. Some flowchart, context diagram and data flow diagram will

include in this chapter to give a clear picture how this project will conducted.

4.2 System Analysis

This project is divided into two parts. In the first part of this project, there are two

sets of parallel documents, each in two different languages, English and Malay. In

both documents, each English document E corresponds to Malay document M with

the same content. To obtain a better clustering result, the English documents will

be tokenize, stopword remove and stemming before performed hierarchical

agglomerative clustering.

The first part of this project is conducted with clustering English documents

with using hierarchical agglomerative clustering approach.

The output of clustering English documents will map with the output of

Clustering Malay documents. There will be one-to-one mapping between the

English and Malay cluster. The precision (Rayner Alfred, 2009) values for each pair

English to Malay clusters mapping will be calculated using Equation 3.

Page 45: Fyp II ( Chan Chen Jie Bk08110364)

30

PrecisionEMM (C(E), C(M)) = |C E C B |

|C E | (3)

Where : C(E) = Clustered English Documents

C(M) = Clustered Malay Documents

Similarly, the precision of the Malay-English mapping, PrecisionMEM, can

define as the Equation 4.

PrecisionMEM (C(M), C(E)) = |C M C E |

|C M | (4)

Where : C(M) = Clustered Malay Documents

C(E) = Clustered English Documents

Higher value for precision will indicated a better clusters mapping result.

The second part of this project is apply the genetic algorithm in the

clustering to optimizing the weight of the terms so that clustering matches the

annotation provided as closely as possible.

4.3 System Design

This section discussed user system design and interface design for this project.

4.3.1 Data Flow Diagram (DFD)

Data flow diagram (DFD) is a picture of the movement of data between external

entities and the processes and data stores within a system.

4.3.1.1 Context Diagram

Context diagram is the top-level view of information system. It shows the system

boundaries, external entities that interact with the system and major information

flows between entities and the system.

Page 46: Fyp II ( Chan Chen Jie Bk08110364)

31

The context diagram of English-Malay parallel hierarchical agglomerative

clustering is shows in Figure 4.1.

Figure 4.1: Context Diagram for English-Malay parallel hierarchical

clustering using hierarchical clustering

4.3.1.2 Level 0 Data Flow Diagram

Level-0 data flow diagram shows the system‟s major processes, data flows and

data stores at a high level of abstraction. When the context diagram is expanded

into level-0 data flow diagram, all the connections that flow into and out of process

0 needs to be retained. Figure 4.2 shown the leval-0 data flow diagram for English-

Malay parallel clustering using hierarchical agglomerative clustering

User

0

English-Malay parallel clustering

using hierarchical agglomerative

clustering

Load English Files

Display results

Page 47: Fyp II ( Chan Chen Jie Bk08110364)

32

Figure 4.2: Leval-0 Data Flow Diagram for English-Malay parallel

clustering using Hierarchical Agglomerative Clustering

4.3.1.3 Flow Chart

A flow chart is a graphical or symbolic representation of a process. Each step in the

process is represented by a different symbol and contains a short description of the

process step. The flow chart symbols are linked together with arrows showing the

process flow direction.

1

Load Data

2

Tokenizing,

Stemming,

Stopping

7

Construct new TF-IDF values

by using tf × idf × scale

3

Compute TF -

IDF

4

Construct

distance matrix

5

Clustering

6

Mapping

clusters

Texts Texts

TF-IDF

Distance

Matrix

English clusters and

Malay clusters

New adjusted

TF-IDF values

Contract

distance matrix

until stopping

condition

Page 48: Fyp II ( Chan Chen Jie Bk08110364)

33

Figure 4.3: Flow chart of English-Malay parallel clustering using

Hierarchical agglomerative clustering

The flow of this project started with load English and Malay files into the

program. The texts in the files will performed tokenize, stemming and stopword

removed to restructure the texts.

Start

Load Data

Stemming/ Stopping

Compute TF-IDF

Build Distance Matrix

Hierarchical Agglomerative Clustering

Malay-English Cluster Mapping

Apply Genetic Algorithm

End

Stopping Condition

Yes

No

Page 49: Fyp II ( Chan Chen Jie Bk08110364)

34

The next step is computed tf (term frequency), df(document frequency) and

idf (inverse document frequency). TF is number of the terms occurred in a

document. DF is the number of documents that contained same terms. After that

idf will calculated using the Equation 5.

The next step is calculating the tf-idf using Equation 6.

The distance matrix will build and calculated using the Cosine Similarity in

Equation 7.

Cos(x,y) = 𝑥∙𝑦

𝑥 | 𝑦 | (7)

Where ∙ indicates the vector dot product, x∙y= 𝑥𝑘𝑦𝑘𝑛𝑘=1 and ||x|| is the length of

vector x, ||x|| = 𝑥2𝑘

𝑛𝑘=1 = 𝑥 ∙ 𝑥

After the distance matrix was constructed, the next step is performed

hierarchical agglomerative clustering on the English documents and Malay

documents. This project will used the algorithm of hierarchical agglomerative

clustering in figure 2.4.

tf-idft,d = tft,d x idft (6)

Where: tft,d = term frequency idft = inverse document frequency

idft = log N

dft (5)

Where: idf = inverse document frequency N = total number of ducuments

dtf = number of documents where the term ti appears.

Page 50: Fyp II ( Chan Chen Jie Bk08110364)

35

After the clustering step, the next action is do mapping on the English and

Malay clusters membership. There will be one-to-one mapping between the English

and Malay clusters (EMM). The same is repeated in the direction of Malay to English

Mapping (MBM).

The last action is applying the genetic algorithm in the clustering to optimize

the weight of the terms so the clustering can give a better output. The setting of

the genetic algorithm can be described as follows. A population of X strings of

length m is randomly generated where m is the number of terms (Rayner Alfred,

2009). These X strings represent the scale of adjustment that will be performed to

the values of inverse-documents-frequency (IDF) of the corpus and they are

generated with values ranging from 0.5-1.5 uniformly distributed within [1, m].

Each string represents a subset of (S1, S2, S3, …, Sm-1, Sm). When Si is 1, the IDF

value for the ith terms will not be adjusted; otherwise it will adjusted by multiplying

the IDF value for the ith term by 0.5 or 1.5, depending on the value of Si. The new

adjusted tf-idf for all terms wills tf × idf × scale. The fitness function of this genetic

algorithm is the Davies-Bouldin Index (DBI), to measure the cluster quality.

DBI = 1

𝑛 max{𝑛

𝑖=1,𝑖≠𝑗𝛿𝑖+𝛿𝑗

𝑑(𝑐𝑖 ,𝑐𝑗 )} (8)

Where n is the number of clusters. 𝛿 is the centroid distance and d is the

centroid linkage. A lower DBI value means the clustering produced a better cluster.

Fitness function is defined over the genetic representation and measures the quality

of the represented solution. The generational process is repeated until a

termination condition has been reached. This project will set a fixed number of

generations reached as the termination condition.

4.4 User Interface Design

The user interface is the place where interaction between humans and machines

occurs. The user interface of the system in this project is shows in Figure 4.4.

Page 51: Fyp II ( Chan Chen Jie Bk08110364)

36

Figure 4.4: User Interface Design of the system in this project

The functions of the elements in the user interface:

Load Files button: This button allows user to load documents

into the program.

Clustering button: This button allows user to perform clustering

to the loaded documents.

Mapping: This button allows user to do mapping on the English

and Malay clusters membership.

Apply Genetic Algorithm tick box: check this tick box to apply

genetic algorithm in the clustering.

Text pane: The output of the program will show here.

4.5 Conclusion

This chapter gives a clear picture of the implementation stage of the system and

the sequences of the program had been shown via the data flow diagrams.

Page 52: Fyp II ( Chan Chen Jie Bk08110364)

37

CHAPTER 5

IMPLEMENTATION

5.0 Introduction

Chapter 5 is explained and shows the implementations and integrates of the

project. Implementation carries the development process from the design to

operations stage.

There will be explanation from the input for the data and text processing to

normalize the terms. After that, there will be a brief explanation for the calculation

of cosine similarity and the implementation of the clustering. The last part is

explained about the DBI calculation and the setup for implementation of GA in this

project.

5.1 Text Processing

After the texts are read from the documents, tokenization is performed to remove

the punctuation mark (i.e. “,”, “.”, “!” and “;”) and extra spaces between the text.

Uppercase letters in the texts also transformed into lowercase letters during the

tokenization.

import java.util.StringTokenizer;

public class myTokenizer

{

private String text;

private String toToken;

private StringTokenizer strTokenizer;

public myTokenizer() { /*Constructer */

...}

Page 53: Fyp II ( Chan Chen Jie Bk08110364)

38

public void setText(String s) { /*Set the input for tokenizer*/

...}

public String toString() { /* return the tokenizer result. */

... }

private void tokenizerr() { /*Tokenizer operation*/

...}

}

Figure 5-1: Code Fragment for Tokenizer

After that, stop words is removed from the texts, i.e., the common stop

words like “a”, “are”, “is”, and “and”.

import java.util.StringTokenizer;

public class myStopWordRemover

{

private String text;

private String toToken;

private StringTokenizer strTokenizer;

private myStopWordList myList; // A stopword list is created to store all the stopword in Array

public myStopWordRemover(){ /* Constructer */

...}

public void setText(String s) { /* Set the input for stopword remove */

...}

public String toString() { /* Return the result for stopword remove */

...}

private void tokenizerr(){ /* operation required for tokenizer.*/

...}

}

Figure 5-2: Code Fragment for Stopword Remove

Stemming also performed to the texts using porter‟s stemming algorithm.

Thus, most of the terms in different form will “stemmed” into same word. For

example, words “compute”, “computing” and “computed” are stemmed to

“comput”.

Page 54: Fyp II ( Chan Chen Jie Bk08110364)

39

A code is written to plug-in the Java version porter‟s stemmer.

import java.util.StringTokenizer;

public class myStemming

{

String input;

String output;

Stemmer stem; /* Porter stemmer that ready to plug in in the system. */

public myStemming(){ /*ConStructer */

... }

public void setText(String s) { /* Set the input for tokenizer */

... }

@Override

public String toString() { /* Return the result to the main system */

... }

private void stemming() { /* call the porter stemmer and performed stemming */

... }

private boolean isWord(String s){

... }

Figure 5-3: Code fragment for English Stemming

Table 5.1 shows the example result from tokenization, stopword remove

and stemming.

Table 5.1: Example of texts processing that normalizes the Terms.

Step Data

1. Initial String

Friends, Romans, Countrymen, lend me your ears;

2. Tokenization

friends romans countrymen lend me your ears

3. Stopword Remove

friends romans countrymen lend ears

4. Stemming

friend roman countrymen lend ear

Page 55: Fyp II ( Chan Chen Jie Bk08110364)

40

All text in the documents will store in the form of string of terms. A two

dimension string array is used to store the texts and the files name.

String[][] contents_English = new String[NUMBER_OF_DOCMUMENTS][2];

/*

Where:

contents_English[i][0] = TITLE_OF _i_DOCUMENT;

contents_English[i][1]=CONTENT_OF _i_DOCUMENT;

*/

Figure 5.4: Code fragment for the array.

5.2 Weighting

The text will “chopped” or tokenized into a series of single term using “tokenizer”

function in Java, where space is used as delimiter in the tokenizer function.

String[] tokens = input.split(" ");

/*

Where input is the string of the text.

*/

Figure 5.5: Tokenizer Function in Java.

After that, each distinct term is counted.

public class termFrequency

{

private String input;

private String[] termList;

private String[] docList;

private int[][] vector;

public termFrequency() { /* Constructer */

...

}

public void initialize(String contents, String docNameS, String[] oldTermList, String[] oldDocList, int[][]

oldVector) { /* Initialize the needed input for calculate term frequency */

Page 56: Fyp II ( Chan Chen Jie Bk08110364)

41

...

}

public void countTerms() { /* performed the count terms operation. */

...

}

public String[] getDocList() { /* Return documents list to main Program*/

...

}

public String[] getTermsList() { /*Return term list to main program*/

...

}

public int[][] getVector(){ /*Return term frequency vector to main program*/

...

}

Figure 5.6: Code fragment for Calculate Raw Term Frequency.

After the raw term frequency is counted, the term frequency is counted by using

𝑡, 𝑓𝑖,𝑗 = 𝑛 𝑖,𝑗

𝑛𝑘𝑗𝑘 (9)

where ni,j is the number of occurrences of the considered term (ti) in document dj,

and the denominator is the sum of number of occurrences of all terms in

document dj, that is, the size of the document | dj |.

public void countTermFrequency(int[][] inVector) {

vector = inVector;

tfVector = new double[vector.length][vector[0].length];

int numWords = 0;

String[] tokens;

for (int i = 0, n = vector.length; i < n; i++) {

for (int j = 0, m = vector[i].length; j < m; j++){

tokens = contents[i][1].split(" ");

tfVector[i][j] = (double) vector[i][j] / tokens.length;

}}}

Figure 5.7: Code Fragment that Normalize the Term Frequency.

Page 57: Fyp II ( Chan Chen Jie Bk08110364)

42

The next step is calculated the inverse document frequency for all terms.

public void countIDF(String[] inDocsList, String[] inTermsList, int[][] inVector) { /*

Function that calculate the IDF value. */

}

Figure 5.8: Code Fragment that Calculate the Inverse Document

Frequency.

After the terms frequency and inverse document frequency calculated, the system

will calculate the tf-idf by multiply the terms frequency to the inverse document

frequency.

public void countTFIDF(double[][] inTfVector) {

tfVector = inTfVector;

tfidfVector = new double[tfVector.length][tfVector[0].length];

for (int i = 0, n = tfVector.length; i < n; i++){

for (int j = 0, m = tfVector[i].length; j < m; j++){

tfidfVector[i][j] = tfVector[i][j] * idfVector[j]; }}

}

Figure 5.9: Code Fragment that calculates the TF-IDF.

5.3 Construct the Distances Matrix using Cosine Similarity

The distances between the documents/clusters are calculated by using cosine

similarity. The code segment below is the cosine similarity in the system.

public class cosineSimilarity{

private double[] x;

private double[] y;

public cosineSimilarity() /* Constructer*/ {

}

public void setInput(double[] temp1, double[] temp2){ /* Set needed input for

the cosine similarity*/

}

Page 58: Fyp II ( Chan Chen Jie Bk08110364)

43

public double getDistance() /* Operation for calculate the cosine similarity. */{

double temp = 0.0;

double temp2 = 0.0;

double temp3 = 0.0;

if (x.length == y.length){

for (int i = 0; i < x.length; i++){

temp = temp + (x[i] * y[i]);

temp2 = temp2 + (x[i] * x[i]);

temp3 = temp3 + (y[i] * y[i]); }}

temp2 = Math.sqrt(temp2);

temp3 = Math.sqrt(temp3);

return temp / (temp2 * temp3); }}

Figure 5.10: Code Fragment that calculates the Cosine Similarity between

Clusters.

Page 59: Fyp II ( Chan Chen Jie Bk08110364)

44

5.4 Clustering

The important part in this project is clustering. This project is implemented

hierarchical agglomerative clustering. Both single linkage, complete linkage and

average linkage is applied to the clustering algorithm.

public class ClusteringMain {

private String FILE_NAME = "dbi-result.txt";

cluster[] myCluster;

private double[][] tfidfVector;

private String[] docsList;

private String[] termsList;

private ArrayList currentClusters;

private int mode;

private int clusterNo;

private String textLog;

public ClusteringMain(double[][] intfidfVector, String[] indocsList, String[]

intermsList, int inmode, int inclusterNo) {… /*Constructer */

}

private void initialize() {… /* initialize operation for clustering*/

}

public void doClustering(int clusterMode) {… /* perform clustering operation */

}

public ArrayList getAllCLusters() {… /* Return the result of clustering */

}

public String getTextLog() {… /* Return the text log of the clustering*/

}

}

Figure 5.11: Code Fragment That Show All Function In The Clustering

Class.

Page 60: Fyp II ( Chan Chen Jie Bk08110364)

45

The cluster class is used to store the cluster for the clustering. Each cluster object

can store the vector and the file name of the documents in a cluster.

public class cluster {

private double weight;

private double[][] coordinate;

private String[] docsList;

public cluster() {

weight = 0;}

public void setPoints(double[] pt, String docList) {

docsList = new String[1];

docsList[0] = docList;

if (pt.length > 0) {

coordinate = new double[1][pt.length];

for (int i = 0, n = pt.length; i < n; i++) {

coordinate[0][i] = pt[i]; } }}

public void setPoints(cluster clu1, cluster clu2, double atWeight) {

if (clu1.getDimnsion().length > 0) {

if (clu2.getDimnsion().length > 0) {

coordinate = new double[clu1.getDimnsion().length +

clu2.getDimnsion().length][clu1.getDimnsion()[0].length];

int i, n, k;

for (i = 0, n = clu1.getDimnsion().length; i < n; i++) {

for (int j = 0, m = clu1.getDimnsion()[i].length; j < m; j++) {

coordinate[i][j] = clu1.getDimnsion()[i][j]; }}

for (k = 0, n = clu2.getDimnsion().length; k < n; k++) {

for (int j = 0, m = clu2.getDimnsion()[k].length; j < m; j++) {

coordinate[i][j] = clu2.getDimnsion()[k][j]; }

i++;}}}

if (clu1.getAllDocuments().length > 0) {

if (clu2.getAllDocuments().length > 0) {

docsList = new String[clu1.getAllDocuments().length +

Page 61: Fyp II ( Chan Chen Jie Bk08110364)

46

clu2.getAllDocuments().length];

int i, j, k, m, n;

for (i = 0, n = clu1.getAllDocuments().length; i < n; i++) {

docsList[i] = clu1.getAllDocuments()[i]; }

for (j = 0, m = clu2.getAllDocuments().length; j < m; j++) {

docsList[i] = clu2.getAllDocuments()[j];

i++;}}}

weight = atWeight; }

public double getWeight() {

return weight; }

public String[] getAllDocuments() {

return docsList;

public double[][] getDimnsion() {

return coordinate; }}

Figure 5.12: Code Fragment for Cluster Class.

For each stage in hierarchical agglomerative clustering, two most similar

clusters will merge together until reach the number of clusters that user requested.

The different between single linkage, complete linkage and average linkage is the

definition of the term “similarity”. For single linkage, the similarities of two merged

clusters are that clusters have most similar pair of items, one in each cluster. For

complete linkage, the similarities of the two merged clusters are the clusters that

have least similar pair of items, one in each cluster. For average linkage, the

definition of similarity is the each member of a cluster has a greater average

similarity to the remaining members of that cluster then it does to all members of

any other cluster.

public class allLinkageProcessing {

private double[][] elements1;

private double[][] elements2;

private double[] distances;

private int mode;

public allLinkageProcessing() {

}

Page 62: Fyp II ( Chan Chen Jie Bk08110364)

47

public void setPoints(cluster a, cluster b) {

…}

public void countDist(int inmode) {

…/* Calculate the distances for the items in between clusters.

Arrays.sort(distances); /* Sort the distances */

}

public double getSinglelink() {

double temp = 0.0;

if (mode == 1) {

temp = distances[0];

} else if (mode == 2) {

temp = distances[(distances.length - 1)]; }

return temp; }

public double getCompletelink() {

double temp = 0.0;

if (mode == 1) {

temp = distances[(distances.length - 1)];

} else if (mode == 2) {

temp = distances[0];}

return temp; }

public double getGroupAverage() {

double temp = 0.0;

for (int i = 0, n = distances.length; i < n; i++) {

temp = temp + distances[i];}

return temp / (double) distances.length; }

public double getCentroidDistance() {

double temp = 0.0;

return temp;}

public double[] getCentroid(double[][] a) {

… }}

Figure 5.13: Code Fragment that Show the Calculation of Single Linkage,

Complete Linkage and Average Linkage.

Page 63: Fyp II ( Chan Chen Jie Bk08110364)

48

Hierarchical agglomerative clustering is a very time consuming process. An effective coding design is very important to save the time for testing the system.

while (currentClusters.size() > clusterNo) {

a = new cluster();

b = new cluster();

a = (cluster) currentClusters.get(0);

b = (cluster) currentClusters.get(1);

al = new allLinkageProcessing();

al.setPoints(a, b);

al.countDist(mode);

if (clusterMode == 1) {

min = al.getSinglelink();

} else if (clusterMode == 2) {

min = al.getCompletelink();

} else if (clusterMode == 3) {

min = al.getGroupAverage();

}

pos1 = 0;

pos2 = 1;

for (int i = 0, n = currentClusters.size(); i < n; i++) {

for (int j = i + 1, m = currentClusters.size(); j < m; j++) {

a = new cluster();

b = new cluster();

a = (cluster) currentClusters.get(i);

b = (cluster) currentClusters.get(j);

al = new allLinkageProcessing();

al.setPoints(a, b);

al.countDist(mode);

if (clusterMode == 1) {

distance = al.getSinglelink();

} else if (clusterMode == 2) {

distance = al.getCompletelink();

Page 64: Fyp II ( Chan Chen Jie Bk08110364)

49

} else if (clusterMode == 3) {

distance = al.getGroupAverage();

}

if (mode == 1) {

if (distance < min) {

min = distance;

pos1 = i;

pos2 = j;

} } else if (mode == 2) {

if (distance > min) {

min = distance;

pos1 = i;

pos2 = j;}} }}

c = new cluster();

a = new cluster();

b = new cluster();

a = (cluster) currentClusters.get(pos1);

b = (cluster) currentClusters.get(pos2);}

Figure 5.14: Code Fragment for Hierarchical Agglomerative Clustering.

5.5 Cluster Analysis

After the clustering process, the next thing is performed cluster analysis to the

cluster using DBI value.

The Davies-Bouldin Index (DBI) is used to measure the quality of the

clusters. Lower DBI value mean the clustering result is better because the centers

of clusters are far away from each others.

DBI = 1

𝑛 max(𝑛

𝑖=1,𝑖≠𝑗

𝛿𝑖+𝛿𝑗

𝑑(𝑐𝑖 ,𝑐𝑗 ))

The coding for calculating the DBI value is defined in figure 5.15.

Page 65: Fyp II ( Chan Chen Jie Bk08110364)

50

public class DBI {

private ArrayList allClusters;

private int distanceMode;

private double dbiValue;

public DBI(ArrayList inClusters, int inDistanceMode) { /* Constructer*/

}

public double getDBI() { /* return the DBI result */

return dbiValue;}

public double[] getCentroid(double[][] a) { /* Calculate the centroid of the

cluster.*/

}

public double getDistance(double[] a, double[] b) {

}

public double getAverageDistanceInCluster(double[][] allPoints) // This function

return the avarege distance from each point to the centroid.{

}

public double getDistanceBetweenClusters(double[][] a, double[][] b) // This

function return the distance between the centroid of the two clusters. {

}

public double getSpecialDistance(double[][] a, double[][] b) // (centroid distance

a + centroid distance b) / centroid linkage {

}

Figure 5.15: Code Fragment for DBI Calculation.

Page 66: Fyp II ( Chan Chen Jie Bk08110364)

51

5.6 Genetic Algorithm Setup

In this project, genetic algorithm is applied to improve the quality of the clustering.

The setup of the genetic algorithm is show in the table.

Table 5.2: The Setup for the Experiment of Genetic Algorithm.

Size of Chromosome Depend to the numbers of unique terms in all

documents.

Genes representative

values

{0.5, 1.0, 1.5}

Fitness value DBI value, the lowest the DBI value, the better the

fitness.

Cross Over type Uniform

Cross Over Rate 50 %

Mutation Rate 10%

Size of Population 30

Number of Generation 50

The algorithm for the applied GA in this project is defined as below.

1. Start with a randomly generated population of n l-bit strings (candidate

solutions to a problem).

For Population=1,30 do

Chromosome[population]={ rn(0.5, 1.0, 1.5)1, rn(0.5, 1.0, 1.5)2, rn(0.5, 1.0,

1.5)3, … rn(0.5, 1.0, 1.5)n}

end

2. Calculate the fitness f(x) of each string in the population.

The value of tf-idf for all terms will be adjusted by using tfi x idfi x

scalei where the scale if the value of i gene value for the population

x.

The new adjusted tf-idf values if used to performed clustering and

Page 67: Fyp II ( Chan Chen Jie Bk08110364)

52

then the DBI value for the clustering is calculated.

Lowest DBI value is the best fitness.

3. Repeat the following steps until n new strings have been created:

o Select a pair of parent strings from the current population, the

probability of selection being an increasing function of fitness

o With the crossover probability, cross over the pair at a randomly

chosen point to form two new strings. If no crossover takes place,

form two new strings that are exact copies of their respective

parents.

o Mutate the two new strings at each locus with the mutation

probability, and place the resulting strings in the new population.

4. Replace the current population with the new population.

5. Go to step 2.

5.7 Conclusion

In this stage, all modules used by system are implemented. The system should be

able to process the inputted data and run the calculation and display the results to

the user at the end. The next step in this project is tested the system to verify the

system. All the documentation of the result will be recorded in the following

chapter.

Page 68: Fyp II ( Chan Chen Jie Bk08110364)

53

CHAPTER 6

TESTING

6.1 Introduction

This chapter is writing about the testing results in this project. The testing results

are important for test that the project objectives have been fulfilled. Different

datasets and parameters are used to test system. The objective of testing process

is to find the errors and bug in the system that lead to inconsistency results of this

system.

6.2 Loading Data

The first thing in running this program is loaded the data for testing. In this project,

the dataset are English and Malay documents that save in text file format (.txt). A

file dialog is used in the system that allowed user to select the folder directory that

contains the documents set.

Figure 6.1: File Dialog That Allow User Select the Dataset Directory

Page 69: Fyp II ( Chan Chen Jie Bk08110364)

54

Since the dataset are in text file format, so when user selects a folder that

does not contain any text file, the system will not load anything data and display a

message dialog for the user.

Figure 6.2: Message dialog that inform user that the user selected folder

does not contain any text file.

After the documents successfully loaded into the system, the files name and

files content will display on the program. User can review the files that loaded in

the program to check whether the loaded files are the right data sets.

Figure 6.3: Print Screen for Loaded Files Will Display on the Program.

Page 70: Fyp II ( Chan Chen Jie Bk08110364)

55

6.3 Text Processing

The second test is test the text processing include tokenize, stopword remove and

stemming.

Input: Zul Noordin Says Will Not Retract Police Report Against Khalid

KUALA LUMPUR, Jan 25 (Bernama) -- Kulim Bandar-Baharu Member of Parliament

(MP) Zulkifli Noordin says will not retract a police report he lodged against Shah Alam

MP Khalid Samad despite being told to do so by Parti Keadilan Rakyat (PKR) advisor,

Datuk Seri Anwar Ibrahim.

Zulkifli, who is also a member of PKR, said he would only do so if Khalid retracted

statements he made that he (Zulkifli) deemed were insulting to Islam.

Zulkifli told this to Bernama when contacted about the matter on Monday night.

He had lodged the report at the Masjid India police station here on Saturday.

Khalid is from PAS. Both PAS and PKR are members of the Pakatan Rakyat opposition

coalition.

Number

of words:

126

Tokenize zul noordin says will not retract police report against khalid kuala lumpur jan 25

bernama kulim bandar baharu member of parliament mp zulkifli noordin says will not

retract a police report he lodged against shah alam mp khalid samad despite being

told to do so by parti keadilan rakyat pkr advisor datuk seri anwar ibrahim zulkifli who

is also a member of pkr said he would only do so if khalid retracted statements he

made that he zulkifli deemed were insulting to islam zulkifli told this to bernama when

contacted about the matter on monday night he had lodged the report at the masjid

india police station here on saturday khalid is from pas both pas and pkr are members

of the pakatan rakyat opposition coalition

Number

of words:

126

Stopword

remove

zul noordin retract police report khalid kuala lumpur jan 25 bernama kulim bandar

baharu parliament mp zulkifli noordin retract police report lodged shah alam mp khalid

samad despite told parti keadilan rakyat pkr advisor datuk seri anwar ibrahim zulkifli

pkr khalid retracted statements zulkifli deemed insulting islam zulkifli told bernama

contacted matter monday night lodged report masjid india police station saturday

khalid pas pas pkr pakatan rakyat opposition coalition

Number

of words:

69

Stemming zul noordin retract polic report khalid kuala lumpur jan bernama kulim bandar baharu

parliament mp zulkifli noordin retract polic report lodg shah alam mp khalid samad

despit told parti keadilan rakyat pkr advisor datuk seri anwar ibrahim zulkifli pkr khalid

retract statement zulkifli deem insult islam zulkifli told bernama contact matter mondai

night lodg report masjid india polic station saturdai khalid pa pa pkr pakatan rakyat

opposit coalit

Number

of words:

68

Figure 6.4: the text processing that normalizes the texts. Note that the

number of words has been decreased.

Page 71: Fyp II ( Chan Chen Jie Bk08110364)

56

6.4 Weighting

Term frequency is one of the most important elements in documents clustering.

Raw term frequency is defined as the frequency of a term t appeared in document

D. For example, in Figure 6.5, term “chong” is found in doc001.txt 4 times and

term “mew” is found 3 times in doc001.txt.

Figure 6.5: Print Screen of the Raw Term Frequency Table.

Term frequency for a term t in a document D can be normalized by the total

number of terms ND in the document using formula (9). For example, term

frequency for term “chong” is changed from “4” to “0.01694915” after the term

frequency normalization.

Figure 6.6: Print Screen of the Term Frequency Table.

Inverse document frequency (idf) also displayed in this system.

Figure 6.7: Part of the idf displayed by the program.

Page 72: Fyp II ( Chan Chen Jie Bk08110364)

57

Note that the idf value of “bernama” is 0; this is due to the “bernama”

occurred in all documents, this is meaning “bernama” is no useful for distinguishing

relevant from non-relevant documents in the collection of documents.

TF-IDF was calculated by using equation (2).

Figure: 6.8: Print Screen for TF-IDF Table

Page 73: Fyp II ( Chan Chen Jie Bk08110364)

58

6.5 Distances Matrix

In this project, cosine similarity is selected to measure the similarity between the

documents or clusters. For example, document “Doc001.txt” and document

“Doc001.txt” have same contents since it is a same document; therefore the cosine

similarity value between “Doc001” and “Doc001” is 1.

Figure: 6.9: Print Screen for Distances Matrix.

Table 6.1: Cosine similarity distances matrix for 10 documents

Doc001 Doc002 Doc003 Doc004 Doc005 Doc006 Doc007 Doc008 Doc009 Doc010

Doc001 1 0.163477 0.284548 0.284688 0.02337295 0.006486 0.007898 0.004389 0.010518 6.32E-04

Doc002 0.163477 1 0.07978 0.167431 0.03145406 0.00369 0.004215 0.014173 0.003047 6.71E-04

Doc003 0.284548 0.07978 1 0.152682 0.00476926 0.002682 3.61E-04 0.017931 1.99E-04 0.013871

Doc004 0.284688 0.167431 0.152682 1 0.03681549 0.008881 0.011351 0.007047 0.016692 8.94E-04

Doc005 0.023373 0.031454 0.004769 0.036815 1 0.005798 0.002323 0.010332 0.011481 0.004086

Doc006 0.006486 0.00369 0.002682 0.008881 0.00579808 1 0.005763 0.02707 0.323859 0.02797

Doc007 0.007898 0.004215 3.61E-04 0.011351 0.00232275 0.005763 1 0.080513 0.036755 2.13E-04

Doc008 0.004389 0.014173 0.017931 0.007047 0.01033184 0.02707 0.080513 1 0.070439 0.023135

Doc009 0.010518 0.003047 1.99E-04 0.016692 0.01148149 0.323859 0.036755 0.070439 1 0.07894

Doc010 6.32E-04 6.71E-04 0.013871 8.94E-04 0.00408612 0.02797 2.13E-04 0.023135 0.07894 1

6.6 Experiment 1: Testing the Cluster Results for Different Clustering

Algorithm

Clustering is the main task in this project. A good clustering result is important for

the cluster mapping between Malay clusters and English clusters. This project had

run single linkage, complete linkage and average linkage clustering with using 50

documents. These 50 documents clustered into 7 clusters and documents set in the

cluster are compared to see the different between the clustering results of single

Page 74: Fyp II ( Chan Chen Jie Bk08110364)

59

linkage, complete linkage and average linkage. 10 important terms for each cluster

will also show in the results.

Test 1: Single Linkage

Table 6.2: Clustering Result for Single Linkage.

Cluster

ID

Document Important

Terms

E1 1) Business001.txt

2) Business007.txt

3) General004.txt 4) Business002.txt

5) Business006.txt 6) Business004.txt

7) Business005.txt 8) Business008.txt

9) Feature010.txt

10) Business003.txt 11) Business010.txt

12) Feature009.txt 13) Feature002.txt

14) Feature005.txt

15) General007.txt 16) Sport005.txt

17) Sport006.txt 18) General001.txt

19) Politic001.txt 20) Politic004.txt

21) Politic005.txt

22) Politic006.txt 23) Politic008.txt

24) Politic002.txt 25) Politic003.txt

26) Politic009.txt

27) General002.txt 28) Politic007.txt

29) Sport010.txt 30) General005.txt

31) General010.txt

32) Sport007.txt 33) Sport009.txt

34) General006.txt 35) General008.txt

36) Feature003.txt 37) Politic010.txt

38) Sport001.txt

39) Sport004.txt 40) Sport003.txt

41) Sport002.txt 42) Sport008.txt

1) bank

2) umno

3) pkr 4) khalid

5) zulkifli 6) polic

7) petrona

8) team

9) sailor 10) sport

Cluster

ID

Document Important

Terms

E2 1)

Business009.txt

1) fdi

2) unctad

3) flow 4) drop

5) cent 6) trillion

7) declin 8) quarter

9) economi

10) global

E3 1)

Feature001.txt

2) Feature006.txt

3) Feature008.txt

1) seedstock

2) camel

3) prawn 4) farm

5) freshwat 6) haiezack

7) pond 8) breed

9) breeder

10) venture

E4 1)

Feature004.txt

1) traffic

2) jakarta

3) road 4) jl

5) agu 6) bu

7) citi 8) hour

9) crawl

10) peak

E5 1)

Feature007.txt

1) paint

2) voc

3) eco 4) chemic

5) soo 6) odour

7) irrit

8) environment

9) hazard 10) low

Page 75: Fyp II ( Chan Chen Jie Bk08110364)

60

Cluster ID

Document Important Terms

E6 1) General003.txt 1) asli

2) orang 3) expos

4)

knowledg 5) tradit

6) shafi 7)

documentari 8) master

9) hospit

10) import

Cluster ID

Document Important Terms

E7 1)

General009.txt

1) choi

2) ik 3) summon

4) court

5) magistr 6) chong

7) privat 8) daphn

9) seduct 10) appeal

Form Table 6.2, there are 42 documents clustered into cluster E1 and the important terms for cluster E1 are bank, umno, pkr, Khalid, zulkifli, polic, petrona, team, sailor,

sport.

Page 76: Fyp II ( Chan Chen Jie Bk08110364)

61

Test 2:

Complete Linkage

Table 6.3: Clustering Result for Complete Linkage

Cluster

ID

Document Important

Terms E1 1) Business001.txt

2) Business007.txt 3) Business003.txt 4) Business010.txt 5) Business009.txt

1) properti 2) fdi 3) equiti 4) fund 5) rubber 6) price 7) cent 8) unctad

9) incom 10) flow

E2 1) Business002.txt 2) Business006.txt 3) General004.txt 4) Feature003.txt 5) Feature007.txt

1) paint 2) petrona 3) biochar 4) lubric 5) proton 6) bad 7) timor 8) lest 9) ga 10) carbon

E3 1) Business004.txt 2) Business005.txt 3) Business008.txt 4) Feature010.txt 5) General007.txt

6) Sport005.txt 7) Sport006.txt 8) Feature001.txt 9) Feature006.txt 10) Feature008.txt 11) Feature002.txt 12) Feature005.txt 13) Feature004.txt

1) bank 2) seedstock 3) team 4) camel 5) class

6) race 7) prawn 8) pst 9) deposit 10) product

4 1) Feature009.txt 2) Sport010.txt 3) General010.txt 4) General006.txt 5) General008.txt 6) Sport007.txt 7) Sport009.txt

1) sailor 2) tourism 3) sport 4) venu 5) wind 6) boat 7) safeti 8) train 9) park 10) mya

Cluster

ID

Document Important

Terms E5 1) General001.txt

2) Politic001.txt 3) Politic004.txt 4) Politic005.txt 5) Politic006.txt 6) Politic008.txt 7) General003.txt 8) General005.txt

9) General009.txt

1) pkr 2) khalid 3) zulkifli 4) polic 5) retract 6) asli 7) pa 8)

gunasegaran 9) orang 10) nik

E6 1) General002.txt 2) Politic007.txt 3) Politic010.txt 4) Politic002.txt 5) Politic003.txt 6) Politic009.txt

1) umno 2) voter 3) secretari 4) suprem 5) appoint 6) tengku 7) razaleigh 8) regist 9) divis 10) puad

E7 1) Sport001.txt 2) Sport004.txt

3) Sport003.txt 4) Sport002.txt 5) Sport008.txt

1) bt 2) chn

3) chong 4) round 5) wei 6) titl 7) singl 8) lee 9) hafiz 10) pair

Page 77: Fyp II ( Chan Chen Jie Bk08110364)

62

Test 3:

Average Linkage

Table 6.4: Clustering Result for Average Linkage.

Cluster ID

Document Important Terms

E1 1) Business001.txt 2) Business007.txt 3) Business002.txt 4) Business006.txt 5) General004.txt 6) Business004.txt 7) Business005.txt 8) Business008.txt 9) Feature010.txt 10) Business003.txt 11) Business010.txt 12) Business009.txt 13) Feature003.txt 14) Feature001.txt 15) Feature006.txt 16) Feature008.txt

1) bank 2) seedstock 3) sale 4) properti 5) fdi 6) cent 7) price 8) petrona 9) camel 10) equity

E2 1) Feature002.txt 2) Feature005.txt 3) General007.txt 4) Sport005.txt 5) Sport006.txt 6) Feature004.txt

1) team 2) race 3) class 4) pst 5) endur 6) dubai 7) lap 8) cultur 9) traffic

10) commun

E3 1) Feature007.txt 1) paint 2) voc 3) eco 4) chemic 5) soo 6) odour 7) irrit 8) environment 9) hazard 10) low

E4 1) Feature009.txt 2) Sport010.txt 3) General006.txt 4) General008.txt 5) General010.txt

6) Sport007.txt 7) Sport009.txt

1) sailor 2) tourism 3) sport 4) venu 5) wind

6) boat 7) safeti 8) train

Cluster ID

Document Important Terms

9) park 10) mya

E5 1) General001.txt 2) Politic001.txt 3) Politic004.txt 4) Politic005.txt 5) Politic006.txt 6) Politic008.txt 7) Politic002.txt 8) Politic003.txt 9) Politic009.txt 10) General002.txt 11) Politic007.txt 12) Politic010.txt 13) General003.txt

1) umno 2) pkr 3) khalid 4) zulkifli 5) retract 6) asli 7) voter 8) pa 9) rakyat 10) secretary

E6 1) General005.txt 2) General009.txt

1) gunasegaran 2) choi 3) polic 4) court 5) death 6) personnel 7) ik 8) summon 9) intent

10) subari

E7 1) Sport001.txt 2) Sport004.txt 3) Sport003.txt 4) Sport002.txt 5) Sport008.txt

1) bt 2) chn 3) chong 4) round 5) wei 6) titl 7) singl 8) lee 9) hafiz 10) pair

Page 78: Fyp II ( Chan Chen Jie Bk08110364)

63

Table 6.5: Summary for Documents Distribution in Cluster.

Cluster Total

E1 E2 E3 E 4 E5 E 6 E 7

Single

Linkage3 42 1 3 1 1 1 1 50

Complete

Linkage 5 5 13 7 9 6 5 50

Average

Linkage 16 6 1 7 13 2 5 50

Figure 6.10: Summary for Documents Distribution in 7 Clusters.

From the results in table 6.5, complete linkage seems produced better

clusters than single linkage and average linkage. Where the size of each cluster is

more balances. Single linkage has worst clusters where there is some clusters have

extreme bigger size than others clusters. For example, in table 6.2, there are 42

documents in cluster no.1 where it is almost 84% of all documents. This is because

single linkage sensitive to noise and outliers (Kumar. V, et al., 2005). Average

linkage is intermediate between single linkage and complete linkage. There is no

extreme large cluster in average linkage cluster but there are some clusters shows

0

5

10

15

20

25

30

35

40

45

No. 1 No. 2 No. 3 No. 4 No. 5 No. 6 No. 7

Cluster

Number of Documents Single Linkage

Complete Linkage

Average Linkage

Page 79: Fyp II ( Chan Chen Jie Bk08110364)

64

that contain only 1 to 2 documents, for example cluster no.3 and cluster no. 6 in

table 6-4.

6.7 Experiment 2: Testing the DBI values between the single linkage,

complete linkage and Average Linkage

A testing had been made by clustered 100 documents using complete linkage,

single linkage and average linkage. The DBI value for each clustering testing is

recorded to observe the quality of the clustering, where a lower DBI value is a

better clustering result. This experiment has been running using 5 clusters, 10

clusters, 15 clusters and 20 clusters.

Table 6.6: DBI values for different hierarchical agglomerative clustering.

DBI Value for HAC

5 Clusters 10 Clusters 15 Clusters 20 Clusters

Complete

Linkage 4.882957 3.523555 2.831373 2.47898299

Single Linkage 1.781201 1.176582 1.119181 1.009035665

Average

Linkage 3.672783 2.637479 2.335699 2.05946653

Figure 6.11: DBI values for different hierarchical agglomerative

clustering.

0

1

2

3

4

5

6

5 Clusters 10 Clusters 15 Clusters 20 Clusters

BDI Value for HAC

Complete Link

Single Link

Average Link

Page 80: Fyp II ( Chan Chen Jie Bk08110364)

65

Based on the result in table 6.6, the DBI values are decreased from 5

clusters to 20 clusters. Where clustered to 5 clusters will produce a highest DBI

value and 20 clusters clustering will produce a lowest DBI value. This is because

when the number of cluster increase, the number of elements in the cluster will

decrease and the within-cluster distances will decrease. This mean a larger number

of clusters can group compactly and produce a better shape of cluster if compare

to the lower number of clusters.

Page 81: Fyp II ( Chan Chen Jie Bk08110364)

66

6.8 Experiment 3: Improve the DBI value and the Percentage of English-

Malay Mapping using Genetic Algorithm

In this experiment, a set of 100 documents was performed average linkage

clustering and then the English cluster and Malay cluster was map. The DBI values

and successful rate of mapping was compared between before GA and after GA.

Table 6.7: DBI values for English Documents Clustering (Before GA and After GA)

BDI Value for HAC

5 Clusters 10 Clusters 15 Clusters

Complete Linkage (Before GA) 2.113 1.812 1.737

Complete Linkage (After GA) 1.184 1.642 0.575

Figure 6.11: DBI values for Each Generation in Genetic Algorithm.

0.000

0.500

1.000

1.500

2.000

2.500

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

DBI value

Number of Generation

English 5 Clusters

English 10 Clusters

English 15 Clusters

Page 82: Fyp II ( Chan Chen Jie Bk08110364)

67

Table 6.8: The Percentage of Clusters Mapping Before and After GA Applied.

BDI Value for HAC

5 Clusters 10 Clusters 15 Clusters

Average Successful

Cluster Mapping

(Before GA)

MEM 72.0% 60.0% 52.3%

EMM 69.3% 61.0% 54.3%

Average Successful

Cluster Mapping

(After GA)

MEM 76.3% 52.3% 54.0%

EMM 73.0% 54.0% 55.3%

*MEM = Malay to English clusters mapping

*EMM = English to Malay clusters mapping

From table 6.7, the DBI values have decreased after genetic algorithm

applied. This means a better shape and compact cluster have produced by the

clustering that using adjusted tf-idf.

Table 6.8 show that applied GA is able to improve the percentage of

successful clusters mapping between English and Malay clusters by applied GA.

However, not all GA experiment can improve the mapping results. By referring to

table 6.8, the percentage of clusters mapping for 10 clusters had dropped after the

GA application to the clustering. This is due to the worst case of the GA where the

tf-idf was over adjusted and hence made a big changed on the structure of the

clusters of the clustering and affected the clusters mapping.

Page 83: Fyp II ( Chan Chen Jie Bk08110364)

68

6.9 Conclusion

As a conclusion, the testing phase had been performed correctly through the

system. The system had been successfully run according to the design of the

system. Based on the result, complete linkage is the most suitable clustering

algorithm for documents clustering where it cans produces small, tightly bound and

cohesive clusters whereas single linkage clustering tends to produce large, loosely-

bound and “straggly” clusters. By referring to the result of experiment 3, Generic

algorithm application also can improve the DBI of clusters hence slightly improve

the clusters mapping between English and Malay Clusters.

Page 84: Fyp II ( Chan Chen Jie Bk08110364)

69

CHAPTER 7

CONCLUSION AND FUTURE WORK

7.1 Introduction

The purpose of developed this system is to implement a framework for parallel

clustering English and Malay texts documents. The clustering result, English and

Malay clusters are proceeding to the clusters mapping. Genetic algorithm is applied

to this system to improve the clustering results. A system is designed and

developed to run experiments to achieve the objectives of this project. The results

are analyzed based on the data that generated by the system. Finally, the

conclusion part is concluded the work and study during this project.

7.2 Objective Achievement

Generally, the objectives of this system have been achieved. This section is

concluding the objectives that achieved in this project.

Objective 1: Propose a framework for English – Malay Parallel hierarchical

agglomerative clustering.

The system is able to perform English text documents clustering and Malay

text documents clustering together. The clusters that produced from the

clustering can be map between English clusters and Malay clusters.

Objective 2: Implement a hierarchical agglomerative clustering method to cluster

English documents.

All single linkage, complete linkage and average linkage are applied in the

system in this project. The results show that hierarchical agglomerative

clustering is able to group the English documents in the form of clusters.

Page 85: Fyp II ( Chan Chen Jie Bk08110364)

70

Objective 3: To apply Genetic Algorithm (GA) to analysis the effects of GA to the

clustering results.

Genetic algorithm is improved the clustering results where the DBI value

had reduced after performed genetic algorithm on the system.

7.3 Discussion

Based on the results from the experiments, complete linkage hierarchical

agglomerative produced better quality of cluster. This is because complete linkage

HAC is less sensitive to the noise and outliers of the data set. Normalize the text via

tokenization, stopword removed and stemming is important where it can reduce the

time required for documents clustering.

A large number of clusters will produce a lower DBI value; this is because

the large number of clusters will reduce the distance between document and its

centroid.

Genetic algorithm in this system can reduce the value of DBI for clustering

and improves the structure for the cluster. This is due to the compact and well

separated clusters found by minimizing DBI. However, for some genetic algorithm

experiments, the percentage of successfully clusters mapping is reduced, this is

because the tf-idf for the documents had over adjusted and cannot performed a

good cluster mapping between English and Malay Clusters.

7.4 Limitations of the System

The imitation of the system refers to the things that the system cannot provides

and do. The limitation for this system is listed out as below.

This system cannot distinguish English documents and Malay documents.

This system cannot save the result for the user.

This system cannot display the dendogram as visualization for clustering.

Page 86: Fyp II ( Chan Chen Jie Bk08110364)

71

7.5 Recommendation of Future Works

Some suggestion and idea are identified which can use to improve this system.

The recommendation is stated below:

Build an English and Malay language recognizer that automatic separate the

documents based on the document language.

Save the results of the system for future references.

Display the dendogram to illustrate the structure of the cluster of

hierarchical agglomerative clustering.

7.6 Conclusions

Based on the limitation and scope of this project, the English–Malay parallel

clustering still can be improved. The suggestion and recommendation had been

given at the previous section so that future works can improve the design of the

system. Discussion part is the summary of the study in this project and the

explanation to the results of the experiments. Although the system had some

limitation, it still achieved the objectives of the project.

Page 87: Fyp II ( Chan Chen Jie Bk08110364)

72

References

Alfred, R. (2009). A Parallel hierarchical Agglomerative Clustering Technique for Billingual Corpora Based on Reduced Terms with Automatic Weight Optimization. Proceedings of the 5th International Conference On Advanved Data Mining and Applications. Springer-Verlag Berlin, 19-30.

Dash, M., Petrutiu, S., & Scheuermann, p. (2004). Efficient Parallel Hierarchical Clustering. Accepted for Publication in 10th International Conference On Parellel and Distributed Computering, 364-371.

Fujii, A., & Ishikawa, T. (2001). Japanese/English Cross-Language Information Retrieval. 1-29.

Gey, F. C., kando, N., & Peters, C. (2004). Cross-Language Information Retrieval: the way ahead. 416-430.

Greengrass, E. (2000). Information Retrieval: A Survey. University of Maryland: Maryland, 111-115.

kishida, K. (2004). Technical Issues of cross-language information retrieval: A review, Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval, 435-437.

Manning, C. D., Raghavan, P., & Schutze, h. (2009). An Introduction to Information Retrieval. Cambridge: Cambridge University Press.

Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction To Data Mining. Boston: Pearson Education.

Teknomo, K. (2007). Hierarchical Clustering Tutorial. Retrieved October 10, 2010, from kardi Teknomo's Page: http://people.revoledu.com/kardi/tutorial/Clustering/index.html

Yusuf, H. R. (1992). An Analysis of Indonesia Language for Interlingual Machine-Translation System, COLING '92 Proceedings of the 14th conference on Computational linguistics (4), 1228-1232.