fyp ii ( chan chen jie bk08110364)
TRANSCRIPT
MALAY-ENGLISH PARALLEL CLUSTERING USING
HIERARCHICAL AGGLOMERATIVE CLUSTERING
CHAN CHEN JIE
SCHOOL OF ENGINEERING AND INFORMATION TECHNOLOGY
UNIVERSITI MALAYSIA SABAH 2011
MALAY-ENGLISH PARALLEL CLUSTERING USING
HIERARCHICAL AGGLOMERATIVE CLUSTERING
CHAN CHEN JIE
THESIS SUBMITTED IN FULFILMENT FOR THE BACHELOR OF COMPUTER SCIENCE
(SOFTWARE ENGINEERING)
SCHOOL OF ENGINEERING AND INFORMATION TECHNOLOGY
UNIVERSITI MALAYSIA SABAH 2011
i
DECLARATION
I hereby declare that this piece of work is completed by me except for using
some resources as my reference and information, which I had declared them
in my writings.
6-5-2011
________________
CHAN CHEN JIE
CERTIFIED BY __________________________ __________________________ DR. RAYNER ALFRED PN. SURAYA ALIAS (PROJECT SUPERVISOR) (EXAMINER)
ii
ACKNOWLEDGEMENT
First and the foremost, I want to acknowledge the continuous effort of my
project supervisor, Dr. Rayner Alfred who has guide me from the beginning
of the project until the end of this project. I sincerely appreciate his precious
suggestion and comment throughout this project. He is very scrupulous
helped me on every chapters and aspect of this project.
Besides, I also wish to thank to my family, lecturers, and friends who
assisting me in whole of my studies, no matter externally or internally in the
period of study and doing this project. Their continuous support is the
biggest impetus to me, in order to struggle hard towards this project.
iii
ABSTRACT
MALAY-ENGLISH PARALLEL CLUSTERING USING HIERARCHICAL
AGGLOMERATIVE CLUSTERING
Multilingual text documents are becoming important resource for work in
multilingual natural language processing. This paper discusses the effects applying
clustering technique on parallel bilingual documents. It is interesting to look at the
differences of the cluster mapping of the Malay clusters and English clusters.
Hierarchical agglomerative clustering is chosen as clustering approach for this
project. Genetic algorithm is applied to optimize the weights of terms considered in
clustering the texts documents. Finally, this work aim to discover the suitable
clustering hierarchical agglomerative clustering methods for English and Malay text
documents clustering and the mapping results between English documents Clusters
and Malay Documents clusters.
iv
ABSTRAK
Koleksi berbilang-bahasa adalah salah satu sumber yang penting untuk menjalani
pelbagai tugas pemprosesan bahasa. Kertas kerja ini membincang tentang kesan
applikasi pengelompokan dalam dokumen dwibahasa selari. Ia adalah menarik
untuk melihat perbezaan pemetaan gelompok kumpulan bahasa Malaysia dan
bahasa English. “Hierarchical Agglomerative Clustering” adalah teknik
pengelompokan yang dipilih sebagai teknik pengelompokan dalam projek ini.
Algoritma genetik diguna untuk mengoptimumkan pemberat istilah yang
dipertimbangkan dalam pengelompokan teks. Selain itu, mengurangkan jumlah
istilah dapat mempercepatkan proses pengelompokan. Secara khusus, tujuan
projek ini adalah memperkenalkan teknik yang sesuai untuk pengelompokan
dokumen bahasa English and dokumen bahasa Malaysia dan keputusan pemetaan
antara kelompok dokumen English dan kelompok dokumen Melayu.
v
LIST OF CONTENTS
TITLE PAGE
DECLARATION i
ACKNOWLEDGEMENT
ii
ABSTRACT
iii
ABSTRAK
iv
LIST OF CONTENTS
v
LIST OF FIGURE
viii
LIST OF TABLES
xi
LIST OF ACRONYMS
xii
LIST OF FORMULAS
xiii
CHAPTER 1: Introduction 1.1 Introduction 1.2 Problem Background 1.3 Problem Statements 1.4 Objective 1.5 Project Scope 1.6 Organization of the Report
1 1 2 2 3 3 4
CHAPTER 2: LITERATURE REVIEW 2.1 Introduction 2.2 Clustering 2.3 Reviews of Existing Clustering Method
2.3.1 K-Mean clustering 2.3.2 Hierarchical Agglomerative Clustering 2.3.3 Comparison between K-mean clustering and
Hierarchical Clustering 2.4 Cross Language Information Retrieve
2.4.1 Cognate Matching 2.4.2 Query Translation 2.4.3 Document translation 2.4.4 Interlingual techniques
2.5 Text Processing in Document Clustering 2.5.1 Processing Text
2.5.1.1 Tokenization 2.5.1.2 Stopping
5 5 5 6 6 7 11
11 12 12 13 13 13 14 14 16
vi
2.5.1.3 Stemming 2.5.2 Term Frequency and Weighting 2.6 Past Research on CLIR
2.6.1 Japanese/ English Cross Language Information Retrieve (Atsushi Fuji and Tetsuya, 2001)
2.6.2 A Parallel Hierarchical Agglomerative Clustering Technique for bilingual Corpora Based on Reduced Terms with Automatic Weight Optimization (Rayner Alfred, 2009)
2.6.3 Comparison between the Reviews Past of Past Researches on CLIR
2.7 Conclusion
16 18 19 19
20
22
22
CHAPTER 3: Methodology 3.1 Introduction 3.2 System Development Life Cycle
3.2.1 Planning 3.2.2 System Analysis 3.2.3 Design Phase 3.2.4 Implementation Phase 3.2.5 Testing and Maintenance phase
3.3 Operational Environments 3.3.1 Software 3.3.2 Hardware Requirements
3.4 Conclusion
23 23 23 24 25 26 27 27 27 27 28 28
Chapter 4: System Analysis and Design 4.1 Introduction 4.2 System Analysis 4.3 System Design
4.3.1 Data Flow Diagram (DFD) 4.3.1.1 Context Diagram 4.3.1.2 Level 0 Data Flow Diagram 4.3.1.3 Flow Chart
4.4 User Interface Design 4.5 Conclusion
29 29 29 30 30 31 31 32 35 36
Chapter 5: Implementation 1.0 Introduction 1.1 Text Processing 1.2 Weighting 1.3 Construct the Distances Matrix using Cosine Similarity 1.4 Clustering 1.5 Cluster Analysis 1.6 Genetic Algorithm Setup 1.7 Conclusion
37 37 37 40 42 44 49 51 52
Chapter 6: Testing 6.1 Introduction 6.2 Loading Data
53 53 53
vii
6.3 Text Processing 6.4 Weighting 6.5 Distances Matrix 6.6 Experiment 1: Testing the Cluster Results for Different
Clustering Algorithm 6.7 Experiment 2: Testing the DBI values between the Single
Linkage, Complete Linkage and Average Linkage 6.8 Experiment 3: Improve the DBI value and the Percentage
of English Malay Mapping using Genetic Algorithm 6.9 Conclusion
55 56 58 58
63
65
68
Chapter 7 Conclusion and Future Work 7.1 Introduction 7.2 Objective Achievement 7.3 Discussion 7.4 Limitations of the System 7.5 Recommendation of Future Works 7.6 Conclusions
69 69 69 70 70 70 71
References
72
viii
LIST OF FIGURE
FIGURE NO
TITLE PAGE
1.1 Clustering documents using hierarchical
agglomerative clustering
1
2.1 An Example of Clustering 3
2.2 K-Mean clustering algorithm 5
2.3 A dendogram 6
2.4 two same dendogram with different cutting points
7
2.5 Flow chart of hierarchical agglomerative clustering
8
2.6 Text Processing before clustering
12
3.1 Waterfall development-base model
22
4.1 Context Diagram for English-Malay parallel
hierarchical clustering using hierarchical clustering
29
4.2 Leval-0 Data Flow Diagram for English-Malay parallel
clustering using Hierarchical Agglomerative
Clustering
30
4.3 Flow chart of English-Malay parallel clustering using
Hierarchical agglomerative clustering
31
4.4 User Interface Design of the system in this project 35
ix
5.1 Code Fragment for Tokenizer 38
5.2 Code Fragment for Stopword Remove 38
5.3 Code fragment for English Stemming 39
5.4 Code fragment for the array 40
5.5 Tokenizer Function in Java 40
5.6 Code fragment for Calculate Raw Term Frequency 40
5.7 Code Fragment that Normalize the Term Frequency 41
5.8 Code Fragment that Calculate the Inverse Document
Frequency
42
5.9 Code Fragment that calculates the TF-IDF 42
5.10 Code Fragment that calculates the Cosine Similarity
between Clusters
43
5.11 Code Fragment That Show All Function In The
Clustering Class
44
5.12 Code Fragment for Cluster Class 46
5.13 Code Fragment that Show the Calculation of Single
Linkage, Complete Linkage and Average Linkage
47
5.14 Code Fragment for Hierarchical Agglomerative
Clustering
48
5.15 Code Fragment for DBI Calculation 49
6.1 File Dialog That Allows User Selects the Datasets
Directory
53
6.2 Message dialog that inform user that the user 54
x
selected folder does not contain any text file
6.3 Print Screen for Loaded Files Will Display on the
Program
54
6.4 The text processing that normalizes the texts 55
6.5 Print Screen of the Raw Term Frequency Table 56
6.6 Print Screen of the Term Frequency Table 56
6.7 Part of the idf displayed by the program 56
6.8 Print Screen for TF-IDF Table 57
6.9 Print Screen for Distances Matrix 58
6.10 Summary for Document Distribution in 7 Clusters 63
6.11 DBI values for different hierarchical agglomerative
clustering
64
6.12 DBI values for Each Generation in Genetic Algorithm 66
xi
LIST OF TABLES
TABLE NO. TITLE
PAGE
2.1 Comparison between k-mean clustering and Hierarchical clustering
11
2.2 Comparison between the reviews past of past researches
on CLIR
21
3.1 Software requirements 28
5.1 Example of texts preprocessing that normalizes the terms 39
5.2 The Setup for the Experiment of Genetic Algorithm 51
6.1 Cosine similarity distances matrix for 10 documents 58
6.2 Clustering Result for Single Linkage 59
6.3 Clustering Result for Complete Linkage 61
6.4 Clustering result for Average Linkage 62
6.5 Summary for Document Distribution in Cluster 63
6.6 DBI values for different hierarchical agglomerative
clustering
64
6.7 DBI values for English Documents Clustering (Before GA
and After GA)
66
6.8 The Percentage of Clusters Mapping Before and After GA
Applied
67
xii
LIST OF ACRONYMS
TF: TERM FREQUENCY
IDF: INVERSE DOCUMENT FREQUENCY
DFD: DATA FLOW DIAGRAM
HAC: HIERARCHICAL AGGLOMERATIVE CLUSTERING
CLIR: CROSS LANGUAGE INFORMATION RETRIEVE
GA: GENETIC ALGORITHM
DBI: Davies-Bouldin Index
xiii
LIST OF FORMULAS
NO. NAME FORMULA
1 Inverse Document Frequency
idft = log N
dft
2 tf-idf weighting scheme
tf-idft,d = tft,d x idft
3 Precision of the English –Malay Mapping (EMM)
PrecisionEMM (C(E), C(M)) = |C E C B |
|C E |
4 Precision of the Malay – English Mapping (EMM)
PrecisionMEM (C(M), C(E)) = |C M C E |
|C M |
5 inverse document frequency
idft = log N
dft
6 tf-idf tf-idft,d = tft,d x idft
7 Cosine Similarity
Cos(x,y) = x∙y
x | y |
8 DBI DBI = 1
n max(n
i=1,i≠jδ i +δ j
d(ci ,cj ))
9 term frequency
t, fi,j = ni,j
nkjk
1
CHAPTER 1
INTRODUCTION
1.0 Introduction
People always intended to retrieve useful information from the data collections.
Many experiments had been run to find out the methods that can retrieve
information from the set of documents and we call these methods as text mining
method. Clustering is one of the methods that can retrieve useful information from
documents. There are some clustering approaches that are used in text mining
such as k-mean clustering and hierarchical agglomerative clustering. Hierarchical
agglomerative clustering (HAC) is one of the common approaches used in many
text mining experiments. This is because HAC gives an outputs result in dendogram
and we can get any number of clusters from the dendogram easily as compared
with the k-mean or others clustering approaches.
There has been a lot of research on clustering text documents. However, it is
an interesting experiment if we apply clustering in parallel bilingual documents sets
where the alignment between pairs of clustered documents can be used to extract
words from each language and further be used in others applications. There are
various clustering experiments running on un-parallel bilingual documents sets but
few clustering experiments that related with parallel bilingual documents sets. This
is possible that there are few available sources for parallel bilingual documents sets
compared with the un-parallel bilingual documents sets.
Some useful results can be found by applied clustering to the same documents
in two languages (parallel bilingual documents) (Rayner Alfred, 2009). For example,
clustering in one language in the parallel bilingual documents sets can be used as a
source of annotation to verify the cluster that produced by clustering in another
language. Besides that, combining results for the two results for the two languages
2
can be eliminated on some language-specific bias. Finally, the alignment between
pairs of clustered documents can be used to extract words from each language and
can be furthered use for others application such as cross linguistic information
retrieval (CLIR).
1.1 Problem Background
The rapid development of the Internet makes people out of space limitation, can
freely get the information in the world. However, this freedom of obtaining
information has been limited by the diversity of language. To across the language
barrier, people put forward corresponding solutions, such as online dictionary,
online translation, machine translation, crossing-language information retrieval and
cross-language search engine. Among them, crossing-language information
retrieval is one of the popular research topics.
There are less available resources on bilingual parallel corpora for English-
Malay. 200 sets of English – Malay bilingual parallel documents sets are collected to
conduct the experiment.
2.0 Problem Statements
There are numerous of research on clustering text documents. However, there are
few experiments that examined the impacts of clustering parallel bilingual corpora
especially for Malay-English parallel clustering. This project will conduct an
experiment on clustering for parallel English – Malay documents.
In this project, a collection of English-Malay texts documents are collected to
do bilingual clustering to these documents.
This project is divided into few parts. For English documents clustering, the
English documents sets may include un-useful information. In order to overcome
this problem, the precondition for the English documents is that they must be
stopping and stemming before go to the hierarchical clustering.
3
Figure 1.1: Clustering documents using hierarchical agglomerative
clustering
After hierarchical clustering, the next step is performing mapping between
English and Malay clusters membership.
The last part of this project is to combine the clustering with a genetic
algorithm optimizing the weight of the terms so that the clustering matches the
annotation provided as closely as possible.
1.4 Objective
There are three main objectives in this project:
To propose a framework for English – Malay Parallel hierarchical
agglomerative clustering.
To implement a hierarchical agglomerative clustering method to cluster
English documents.
To apply Genetic Algorithm (GA) to analysis the effects of GA to the
clustering results.
1.5 Project Scope
The scope of this project is included the document representation and clustering
algorithm to cluster the large amount of bilingual documents Malay-English. Besides
that, this project wills implement the genetic algorithm (GA) technique to improve
the result of clustering.
HAC
English Documents
Stemming
on
Stopping
on
4
1.6 Organization of the Report
This report will be organized and divided into 5 chapters.
Chapter 1 is introduction. This chapter will includes the problem
background, problem statements, objectives and project scope of this project. This
chapter aims to provide introduction of the project.
Chapter 2 is literature review. This chapter will make reviews of the existing
concept and methods. Some comparison of existing concepts and methods are also
includes in this chapter.
Chapter 3 is Methodology. This chapter will reviews and explains the
methodology used in this projects. This chapter will list out the software and
hardware requirements of this project.
Chapter 4 is system analysis and design. This chapter will explain the
system design of this project.
Chapter 5 is implementation. This chapter will discuss the implementation of
the system.
Chapter 6 is testing. This chapter will discuss the testing stage and the
testing results.
Chapter 7 is the conclusion for the whole project.
5
CHAPTER 2
LITRERATURE REVIEW
2.1 Introduction
This chapter will review some existing data mining concepts and methods that
related to this project. Since this project is deals with parallel bilingual documents,
this chapter will focus on text-mining concepts and methods. The data mining
approaches those reviews in this chapter are clustering, text-processing and cross
language information retrieve.
2.2 Clustering
Clustering is the data mining method that divides data into groups that are useful
and meaningful (Christopher D. manning et al., 2009). Clustering data is performed
based on the data similarity. Objects that group in a group/cluster will show similar
characteristics. Clustering can only extract message based on existing data.
Clustering is unsupervised learning where it build model without well defined goal
or prediction field.
Figure 2.1: An Example of Clustering
There are some clustering methods that we can use to do text mining.
Every clustering method has its‟ own advantages and disadvantages. And the most
6
popular methods are hierarchical and k-mean and this chapter will takes this two
methods to illustrate how clustering work.
2.3 Reviews of Existing Clustering Method
2.3.1 K-Mean clustering
K-mean clustering is a simple and efficient clustering algorithm. K-mean is an
algorithm used to group the data based on the attributes or features into K number
of groups and k is a positive number (Kardi Teknomo, 2006). The process of
grouping is done by minimizing the sum of squares of distances between data and
the corresponding cluster centroid.
The algorithm of k-mean is simple. It starts with determining the number of
K and set k number of centroid randomly in the dataset. After that, it will repeat to
determine the distances of each data to the centroid and group the data to the
nearest centroid. Figure 2.2 shows a simple algorithm of K-mean clustering.
7
Figure 2.2 K-Mean clustering algorithm
Source: http://people.revoledu.com/kardi/tutorial/Clustering/
Since the way to initialize the centroid was not specified and usually the k of
samples is choosing randomly. Different initial centroid will produce different results
and there is no general theoretical way to find the optimal results of clusters from
the data. A simple approach is comparing the results of multiple runs with different
initial centroid.
Start
Determine the
number of K
Set centroid
Calculate the distances
of data to centroid.
Grouping based on
minimum distance
Available data
for grouping end
Yes
No
8
2.3.2 Hierarchical Agglomerative Clustering
Another popular clustering method is hierarchical agglomerative clustering (HAC).
In hierarchical agglomerative clustering, data are categorized into a hierarchy
structure similar to a tree-like diagram (Figure 2.3) which is called dendogram
(Kardi Teknomo, 2006).
Figure 2.3: a dendogram
Source: http://people.revoledu.com/kardi/tutorial/clustering/
A dendogram is a standard output of hierarchical agglomerative clustering.
A dendogram is a cluster tree where the distance of split or merge is recorded.
Using dendogram, the number of clusters can determine by specify the
cutting point at the dendogram. For example, in the left dendogram below (Figure
2.4), we set cutting distance at 2 and obtain two clusters out of 6 data. The first
cluster consists of 4 data (number 4, 6, 5 and 3) and the second cluster consists of
two objects (number 1 and 2). Similarly, in the right dendogram, cutting at distance
at 1.2 will produce 3 clusters.
2.5
2
1.5
1
0.5
4 6 5 3 2 1
9
Figure 2.4: two same dendogram with different cutting points
Source: http://people.revoledu.com/kardi/tutorial/clustering/
The agglomerative in hierarchical agglomerative clustering is means that the
clustering will starts from the bottom where all the objects are single cluster and
going up (bottom up approach) through merging of objects. There is another
different method call divisive approach (top down approach), where the objects will
group into a single group and then repeat split the group into two groups until the
number of object in each group become one.
The step by step algorithm of hierarchical agglomerative clustering is as
follow:
Step 1: Convert object features into distance matrix.
Step 2: Set each object as a cluster.
Step 2: Iterate until the number of cluster is 1.
2a: Merge two closest clusters.
2b: Update distance matrix.
Figure 2.4 show the flow chat of the hierarchical agglomerative
clustering.
10
Figure 2.5: Flow chat of hierarchical agglomerative clustering
Source: http://people.revoledu.com/kardi/tutorial/clustering/
In Figure 2.5 the distance matrix needs to update by finding the distances
between the clusters after two closest clusters are merged and there are some
methods to do this. The methods are single-linkage, complete-linkage, and
average-linkage. Different methods will produce different outputs and the form of
the dendogram will different.
Start
Objects and their measured
features
Compute Distance Matrix
Set Object as cluster
Number of cluster =
1 End
Merge 2 closest clusters
Update Distance Matrix
Yes
No
11
2.3.3 Comparison between K-mean clustering and Hierarchical Clustering
Table 2.1: Comparison between k-mean clustering and Hierarchical clustering
K-Means Hierarchical Clustering
Advantages
- Less computation process and
suitable for clustering large
documents sets.
- Time complexity for k-mean
clustering is linear.
- Result show in dendogram
where it can produce an
ordering of the objects and
easier to find the elements in
the n cluster by cutting at the
dendogram.
- Similar cluster are generated
and helpful for clustering.
Disadvantages
- Result is sensitive to the noise
and outliners.
-Result cannot guarantee as
optimal result because it is
depend on the initial centroids.
- Application of k-mean only
limit to numerical variables.
- No provision can be made for
a relocation of objects that may
have been incorrectly grouped
at the beginning state.
-Results may different if use
different distance matrix.
2.4 Cross Language Information Retrieve
Cross language information retrieve (CLIR) can be explains as a user try to search a
set of information in one language using a query in another language (Leah S.
Larkey & Margaret E. Connell, 2004). The issues of CLIR have been discussed for
several decades.
There are some existing methods to do cross language information retrieval
(CLIR). There are four types of strategies for matching a query with a set of
12
documents in the context of CLIR (Oard and Diekema, 1998). The four types of
strategies are:
No translation
1. Cognate matching
Translation
2. Query translation
3. Document translation
4. Interlingual techniques
2.4.1 Cognate Matching
Cognate matching is a naïve model in CLIR. Some un-translatable terms such as
proper nouns or technical terminology are left unchanged after the translation
state. The unchanged terms can be expected to match successfully with and
corresponding terms in another language if the two languages have a close
linguistic relationship.
One way to cognate matching may be to decompose words in both the
query and document into n-gram (more specifically, character-based overlapping n-
grams), and do matching operations on the two sets of n-grams. However, when
two languages are very different, e.g., English and Chinese, the technique of edit
distance and n-grams can be not work well.
2.4.2 Query Translation
Query translation translates queries into document languages using bilingual
dictionaries or/ and corpora, prior to the retrieval process. Query translation is the
most common method used for matching in CLIR. This is because this method is
easier to handle. The reason is the retrieval system no needs to change its inverted
files of index terms in any way and just translate the query into the others
languages that the information may retrieve out. Furthermore it is less computation
to process the translation of the query than translate a large set of documents.
13
However, there is a disadvantage where it is difficult to resolve term
ambiguity during the process of translation. This is because the query is usually
short and only little context can found for “disambiguation”.
2.4.3 Document translation
This approach translates documents into query languages, prior to the retrieval.
Document translation is the “opposite” method from the query translation. It has
different advantages and disadvantages from the query translation. Peoples prefer
query translation than document translation. This is because document translation
needs large computation process to translate a big set of documents. However,
some researchers still use this approach since this approach can extracts more
contexts from each document and improve the translation quality.
2.4.4 Interlingual techniques
Interlingual techniques is an intermediate space of subject representation into
which both the query can the documents are converted is used to compare them.
In an Interlingual based machine translation approach translation is done via an
intermediary (semantic) representation of the SL text. Interlingua is supposed to be
a language independent representation from which translations can be generated
to different target languages. Interlingual approach assumes that it is possible to
convert source texts into representations common to more than one language.
From such interlingual representation texts are generated into other languages.
Translation is thus in two stages: from the source language to the Interlingual (IL)
and from the IL to the target language.
2.5 Text Processing in Document Clustering
Document clustering is a complex algorithm. It needs few text mining process to
complete its‟ goal.
14
2.5.1 Processing Text
After the documents sets are collected, the next step is to decide if it should be
modified or restructured in some way to simplify cluster. The types of changes that
are made at this state are call text processing.
The goal of text processing is to convert the many forms that words can
occur into more consistent index forms. Index terms are the representation of the
content of a document that is used for clustering.
The flow chart in Figure 2.6 shows the techniques that use for text
processing.
Figure 2.6: Text Processing before clustering
2.5.1.1 Tokenization
Given a character sequence and a defined document unit, tokenization is the task
of chopping it up into pieces, called tokens (Christopher D. Manning et al. 2009). In
the process of tokenization, certain character such as punctuation will throw away.
The example below show the result of the tokenization on a sentence:
Input: Friends, Romans, Countrymen, lend me your ears;
Output: Friends Romans Countrymen lend me your ears
However, this simple tokenizing process example was suitable for
experiments with small test collection, it does not seem appropriate for most text
processing experiments because too much information is discarded. Some
Original text Tokenization Stopping Stemming
HAC
Text Processing
15
examples of issues involving tokenizing that can have significant impact on the
effectiveness of clustering are:
Small words (1 or 2 characters) can be important in some cluster,
usually in combinations with other words.
o E.g.: xp, ma, pm, ben e king, el paso, master p, gm, j lo,
world war ll.
Both hyphenated and non-hyphenated forms of many words are
common.
->In some cases the hyphen is not needed.
o E.g.: e-bay, wal-mart, active-x, CD-rom, t-shirt.
->At others times, hyphens should be considered either as part of
the word or a word separator.
o E.g: Winston-selem, Mazda rx-7, e-cards, pre-diabetes, t-
mobile, Spanish-speaking.
Special characters are an important part of the tags, URLs, code and
other important parts of documents that must be correctly tokenized.
Capitalized words can have different meaning from lower case words.
o E.g: “Bush” and “Apple”.
Apostrophes can be a part of a word, a part of a possessive, or just
a mistake.
o E.g: rosie o‟donnell, can‟t, don‟t, 80‟s, 1980‟s, men‟s straw
hats, master‟s degree, England‟s ten largest cities, shriner‟s.
Numbers can be important, including decimals.
o E.g.: nokia 3250, top 10 courses, united 93, quicktime 6.5
pro.
Periods can occur in numbers, abbreviations.
o E.g.: “I.B.M.”, “Ph.D.”), URLs, ends of sentences, and other
situations.
(Source: An introduction to Information Retrieve, 2009)
From these examples, tokenizing appears to be more complicated than it
may seem at first.
16
2.5.1.2 Stopping
Human language is filled with function words; words which have little meaning
apart from other words (Christopher D. Manning et al. 2009). The most popular,
like “the”, “a”, “an”, “that”, or “those” are determiners. These words are part of
how we describe nouns in text, and express concept like location or quantity.
Prepositions, like “over”, “under”, “above”, and “below”, represent relative position
between two nouns. In information retrieval, these function words have a second
name: stopword.
A stopword list to keep all stopword are construct and will use in text
processing. However, constructing a stopword list must be done with caution.
Removing too many words will decrease the effectiveness of clustering. While not
removing stopword may cause some problems in ranking.
A stopword list can be constructed by simply using the top n most frequent
words in a collection. This can, however, lead to words being included that are
important for some clusters. More typically, either a standard stopword list is used,
or a list of frequent words and standard stopwords is manually edited to remove
and words that may be significant for a particular application. Standard stopword
list contains ~300 words.
2.5.1.3 Stemming
Stemming, also called conflation, is a component of text processing that captures
the relationships between different variations of word. More precisely, stemming
reduces the different forms of a word that occur because inflection (e.g., plurals,
tenses) or derivation (e.g., making a verb to a noun by adding the suffix-action) to
a common stem (Christopher D. Manning et al. 2009).
There are two basic types of stemmers: algorithmic and dictionary-based.
An algorithmic stemmer uses a small program to decide whether two words are
related, usually based on knowledge of word suffixes for a particular language. By
17
contrast, a dictionary-based stemmer has no logic of its‟ own, but instead relies on
pre-created dictionaries of related terms to store term relationships.
The simplest kind of English algorithmic stemmer is the suffix-s stemmer.
This kind of stemmer assumes that any word ending in the letter „s‟ is plural, so
cakes >> cake, dogs >> dog. However, this rule is not perfect. It cannot detect
many plural relationships, like “century” and “centuries”. In very rare cases, it
detects a relationship where it does not exist, such as with “I” and “is”. The first
kind of error is called a false negative, and the second kind of error is called a false
positive.
More complicated algorithmic stemmers reduce the number of false
negatives by considering more kinds of suffixes, like –ing or –ed. By handling more
suffix types, the stemmer can find more term relationships: in other words, the
false negative rate is reduced. However, the false positive rate (finding a
relationship where none exists) generally increases.
The most popular algorithmic stemmer is the porter stemmer. This has been
used in many information retrieval experiments and system since 1970s, and a
number of implementations are available. The stemmer consists of a number of
steps, each containing a set of rules for removing suffixes. At each step, the rule
for the longest applicable suffix is executed. Some of the rules are obvious, while
others require some thought to work out what they are doing. As an example, here
are the first two parts of step 1 (of 5 steps):
Step 1a:
o Replace sses by ss (e.g., stresses >> stress).
o Delete s if the preceding word part contains a vowel not
immediately before the s (e.g., gaps >> gap but gas >>
gas).
o Replace ied or ies by I if preceded by more than one letter,
otherwise by ie (e.g., ties >> tie, cries >> cri).
o If suffix is us or ss do nothing (e.g., stress >> stress).
18
Step 1b:
o Replace eed, eedly by ee if it is in the part of the word after
the first non-vowel following a vowel (e.g., agrees >> agree,
feed >> feed).
o Delete ed, edly, ing, ingly if the preceding word part contains
a vowel, and then if the word ends in at, bl, or iz add e (e.g.,
fished >> fish, pirating >> pirate), or if the word ends with a
double letter that not ll, ss or zz, remove the last letter (e.g.,
falling >> fall, dripping >> drip), or if it the word is short,
add e (e.g., hoping >> hope).
(Source: An Introduction to Information Retrieval, 2009)
2.5.2 Term Frequency and Weighting
After the processing text state, the next step is calculating the TF-IDF. TF is term
frequency and IDF is inverse document frequency (Christopher D. Manning et al.
2009). TF-IDF is a weight that uses to measure how important a word to a
document in a collection of documents.
The purpose of construct the term frequency is to compute a score between
a query term t and a document d, based on the weight of term t in document d.
There is a simple approach where we just take the number of occurrences of term t
as the weight in the document d. The weighting scheme is referred to as term
frequency and is denoted tft,d.
The concept of term frequency has a problem and the problem is to know
are all terms in a document equally important? This is because all terms are
considered equally important when it come to assessing relevancy on a query. The
Inverse document frequency can overcome this problem. Inverse document
frequency is a measure of the general important of term by dividing the total
number of documents by the number if documents containing the term and then
take the logarithm of that quotient.
19
Thus the idf of a rare term is high whereas the idf of a frequent term is
likely to be low.
To produce a composite weight for each term in each document, we
combine the definitions of term frequency and inverse document frequency. The tf-
idf weighting scheme assigns to term t a weight in document d shown in Equation
2.
2.6 Past Research on CLIR
In this section, some past research on CLIR that proposed by others researcher
ware reviewed.
2.6.1 Japanese/ English Cross Language Information Retrieve (Atsushi
Fuji and Tetsuya, 2001)
Atsushi Fuji and Tetsuya (2001) proposed a Japanese/ English CLIR system using
the query translation approach combine with retrieval modules. They target the
retrieval of technical documents. The translation of technical terms plays an
important role in the performance of their system. To conduct this experiment, they
produce a Japanese/ English dictionary for base words and translate compound
words on a word-by-word basic. To resolve the problem of translation ambiguity
during the translation process, they use a probabilistic method.
tf-idft,d = tft,d x idft (2)
Where: tft,d = term frequency
idft = inverse document frequency
idft = log N
dft (1)
Where: idf = inverse document frequency N = total number of ducuments dtf = number of documents where the term ti appears.
20
The reason why Atsushi Fuji and Tetsuya choose the query translation
approach in their experiment is the query translation approach is relatively
inexpensive to implement. However, they state that the naïve query translation
method does not guarantee sufficient system performance because this method
relying on existing bilingual dictionary, and new technical terms are progressively
created.
Others information retrieval techniques that used in the paper are
tokenization and stopword remove. The purposed of using this two techniques is to
convert all possible query words into base words.
18700 set of documents in both English and Japanese ware collected to
conduct the experiments in the project.
2.6.2 A Parallel Hierarchical Agglomerative Clustering Technique for
bilingual Corpora Based on Reduced Terms with Automatic Weight
Optimization (Rayner Alfred, 2009)
This paper was proposed by Rayner Alfred (2009). The purpose of this paper has
investigated the effects of applying a clustering technique to parallel multilingual
texts. The result of this experiment is display in the cluster mappings and the tree
structures of the cluster. The targeted languages are English and Bulgarian where a
collection of English – Bulgarian documents will form the bilingual parallel corpus.
There are three main parts in the paper. The first part has tested the
experiment of clustering on parallel corpora of English – Bulgarian texts. The
second part has conducted the English – Bulgarian clusters mapping and
constructed the English versus Bulgarian tree structures. The last part is applied
genetic algorithm to optimize the weight of terms considered in clustering the
English texts.
21
20,000 pairs of English – Bulgarian documents were used as datasets in this
experiment. There were two parallel corpora (News Briefs and Features) and each
in two different languages, English and Bulgarian. In both corpora, each English
document was corresponded to a Bulgarian document with the same content.
The text mining methods that used in the experiment are stemming,
stopword removal and hierarchical agglomerative clustering. Genetic algorithm was
applied to optimize the weights of terms considered in clustering the English texts,
so that clustering matched the annotation provided as closely as possible.
22
2.6.3 Comparison between the Reviews Past of Past Researches on CLIR
Table 2.2: Comparison between the reviews past of past researches on CLIR
Project Title Targeted
languages
Parallel or
Non-
parallel
languages
documents
Translation
texts?
Applied
Techniques
A Parallel
Hierarchical
Agglomerative
Clustering Technique
for bilingual Corpora
Based on Reduced
Terms with
Automatic Weight
Optimization (R.
Alfred)
-English
-Bugarian
-Parallel -no
translation
in the
project
-Hierarchical
agglomerative
clustering.
-Clusters
mapping.
Japanese/ English
Cross Language
Information Retrieve
(Atsushi Fuji and
Tetsuya, 2001)
-Japanese
-English
-non-parallel -translate
query.
-Query
Translation.
2.7 Conclusion
The purpose of reviewing the current methods is to get some background
knowledge about existing techniques. From the review of the techniques, we found
that the output of HAC (dendogram) is suitable for map comparison and tree
comparison in parallel bilingual clustering. We also know the sequence for
document clustering and the methods to use in the document clustering.
23
CHAPTER 3
Methodology
3.1 Introduction
This chapter will explain the approaches and framework that will be used in this
project. This chapter also reviews the methodology that is used in this project.
Software and hardware requirements that used in this project will also list out since
it will affects the performance of the experiments.
There are two main sections in this chapter. The first section is the simple
summary of System Development Life Cycle (SDLC). SDLC is the process of
developing information system through investigation, analysis, design,
implementation and testing. The second section is related to the operational
environment which the software and hardware requirements that needed in whole
system.
3.2 System Development Life Cycle
Waterfall development-base model are chosen as the methodology in this project.
This model is called waterfall because it only moves forward from phase to phase
and no return to previous phase, just like the water flow in the waterfall. The
reason why waterfall model is chosen is because the system requirements are
identified before programming design.
Waterfall model can be defined as a complete sequential approach for the
software development. There are five phase in waterfall model which is start from
planning, and continue by system analysis, system design, system implementation,
testing and maintenance. Each phase should start after the previous phase has
finished. Figure 3.1 illustrate waterfall model work.
24
Figure 3.1: Waterfall development-base model
3.2.1 Planning
Planning is the first phase in the waterfall model. In this phase, we need to figure
out the problem that we want to solve. In this phase, researchers need to identify
what is the challenge and find out the way to achieve the goal of my title - Malay-
English Parallel Clustering Using Hierarchical Agglomerative Clustering. The needed
tasks in this phase that need to perform are:
Identify the overall view and problem of the existing concept or method.
Identify the scopes, objectives and goals that needed to be achieved by the
system.
Identify the methods or concepts that researchers will be used to solve our
problem.
Evaluation of the project:
o Is the project feasible in time?
o Is the project reliable?
Planning
Analysis
Design
Implementation
Testing and
Maintenance
25
In this planning phase, researchers also need to define our objectives and
scopes. This is important because it is consider as the guideline to develop the
system. A well done planning will make the development of the system easier.
The purpose of this project is to perform clustering on Malay-English
corpora using hierarchical agglomerative clustering. There are many past
experiments has been performed to test the effect of clustering on documents.
However, it is interesting whereby we can get different results by apply clustering
on parallel bilingual corpora.
The next planning step is to conduct the research on the existing
techniques. By having comparison, it leads to enhance the common understanding
of the advantages and disadvantages of clustering techniques through the
technique of hierarchical agglomerative clustering and k-mean clustering. From
this, researchers can decide which clustering technique is suitable for our title. In
this step, researchers decide to use hierarchical agglomerative clustering because
the output of hierarchical agglomerative clustering (dendogram) is suitable for
cluster mapping and tree comparison between the results of clustering two
languages.
The requirements also included in planning phase. Researchers need to list
out the hardware and software to be used in the project. Since documents
clustering need more computational consuming, researchers need to select the
hardware that has higher performance.
3.2.2 System Analysis
In system analysis, researchers need to analysis the requirements and the
technique that used in the project. This phase must describe and understand the
project clearly. In this phase, the whole software development process and the
overall software structure are defined.
26
To make the process of system analysis easier, researchers need to study
the past research on this title. There are similar research on this project which is
English and Bulgarian (Rayner Alfred, 2009. From this, it can gain understanding
that the overall techniques and concepts that apply to achieve English – Malay
parallel clustering using hierarchical agglomerative clustering.
Besides that, there are several methods used for stemming and porter‟s
stemming is the most popular among the methods. Hence, porter‟s stemming will
be applied within this project. Besides, tokenizing and stopword removing will be
study and apply as well as to improve the clustering results.
3.2.3 Design Phase
In this phase, the information that collected from the analysis phase will be
evaluated. The required development tools need to be defined in this phase.
Analysis and design phase are very important in the whole development cycle
process. Any mistakes in the design phase could be very costly in order to solve in
the development process. Design phase also decide how the system will be
operated in term of hardware and software.
In this project, the first part is conducts clustering on English documents.
After that, the clusters that outputted from the clustering will map with the Malay
clusters. The last part in this project is apply genetic algorithm on the clustering to
optimizing the weight of the terms so that clustering matches the annotation
provided as closely as possible.
In this project, Java is the programming language that researchers choose
to develop the system. More detail of design phase in this project will be explained
in chapter 4 later.
27
3.2.4 Implementation Phase
In this phase, the design will be converted into the physical system. The details of
the coding and design as well as the user interfaces design of this system are
important in this phase.
The tasks in this phase are:
Completing the development plan and finalizing the designs.
The coding of this system will start to implement.
Produce a user manual or user guild.
Test the system.
3.2.5 Testing and Maintenance phase
After the Implementation phase, the system testing begins. Different testing
methods are available to detect the bugs that were committed during the previous
phases. Different testing input are use to test the accuracy.
We can test the system in the following aspects:
Functionality
Structural
consistency
performance
durability
In maintenance phase, the system may change due to the errors that are
not discovered in previous stages.
3.3 Operational Environments
Software and hardware requirements are the software and hardware needed to
develop and run the system.
28
3.3.1 Software
The software requirements to build the system in this project are show in table 3.1.
Table 3.1: Software requirements
Type Specification
Operating System: Microsoft Windows 7
Programming Tools SDK –NetBeans 6.9.1
Java SE JDK
Java SE JRE
3.3.2 Hardware Requirements
The hardware requirements to build the system in this project are show as below.
Laptop with the following hardware specification
o Central Processing Unit(CPU): Intel Centrino 2 (2.26 Ghz)
o 4 GB RAM
o 250 GB hard drive
o mouse and keyboard
3.4 Conclusion
As a conclusion, methodology provides guild, a blueprint or template during the
development of the system. This chapter concludes all the researches and
developments of system in order to achieve the goals of this system. In addition, it
also provides developers a clearer picture on what they need to be done and how
the system is to be organizes and developed.
29
CHAPTER 4
SYSTEM ANALYSIS AND DESIGN
4.1 Introduction
This chapter is about the process and details of the system analysis phase and the
system design phase.
This chapter will explain the system design, interface design, technique and
related algorithms. Some flowchart, context diagram and data flow diagram will
include in this chapter to give a clear picture how this project will conducted.
4.2 System Analysis
This project is divided into two parts. In the first part of this project, there are two
sets of parallel documents, each in two different languages, English and Malay. In
both documents, each English document E corresponds to Malay document M with
the same content. To obtain a better clustering result, the English documents will
be tokenize, stopword remove and stemming before performed hierarchical
agglomerative clustering.
The first part of this project is conducted with clustering English documents
with using hierarchical agglomerative clustering approach.
The output of clustering English documents will map with the output of
Clustering Malay documents. There will be one-to-one mapping between the
English and Malay cluster. The precision (Rayner Alfred, 2009) values for each pair
English to Malay clusters mapping will be calculated using Equation 3.
30
PrecisionEMM (C(E), C(M)) = |C E C B |
|C E | (3)
Where : C(E) = Clustered English Documents
C(M) = Clustered Malay Documents
Similarly, the precision of the Malay-English mapping, PrecisionMEM, can
define as the Equation 4.
PrecisionMEM (C(M), C(E)) = |C M C E |
|C M | (4)
Where : C(M) = Clustered Malay Documents
C(E) = Clustered English Documents
Higher value for precision will indicated a better clusters mapping result.
The second part of this project is apply the genetic algorithm in the
clustering to optimizing the weight of the terms so that clustering matches the
annotation provided as closely as possible.
4.3 System Design
This section discussed user system design and interface design for this project.
4.3.1 Data Flow Diagram (DFD)
Data flow diagram (DFD) is a picture of the movement of data between external
entities and the processes and data stores within a system.
4.3.1.1 Context Diagram
Context diagram is the top-level view of information system. It shows the system
boundaries, external entities that interact with the system and major information
flows between entities and the system.
31
The context diagram of English-Malay parallel hierarchical agglomerative
clustering is shows in Figure 4.1.
Figure 4.1: Context Diagram for English-Malay parallel hierarchical
clustering using hierarchical clustering
4.3.1.2 Level 0 Data Flow Diagram
Level-0 data flow diagram shows the system‟s major processes, data flows and
data stores at a high level of abstraction. When the context diagram is expanded
into level-0 data flow diagram, all the connections that flow into and out of process
0 needs to be retained. Figure 4.2 shown the leval-0 data flow diagram for English-
Malay parallel clustering using hierarchical agglomerative clustering
User
0
English-Malay parallel clustering
using hierarchical agglomerative
clustering
Load English Files
Display results
32
Figure 4.2: Leval-0 Data Flow Diagram for English-Malay parallel
clustering using Hierarchical Agglomerative Clustering
4.3.1.3 Flow Chart
A flow chart is a graphical or symbolic representation of a process. Each step in the
process is represented by a different symbol and contains a short description of the
process step. The flow chart symbols are linked together with arrows showing the
process flow direction.
1
Load Data
2
Tokenizing,
Stemming,
Stopping
7
Construct new TF-IDF values
by using tf × idf × scale
3
Compute TF -
IDF
4
Construct
distance matrix
5
Clustering
6
Mapping
clusters
Texts Texts
TF-IDF
Distance
Matrix
English clusters and
Malay clusters
New adjusted
TF-IDF values
Contract
distance matrix
until stopping
condition
33
Figure 4.3: Flow chart of English-Malay parallel clustering using
Hierarchical agglomerative clustering
The flow of this project started with load English and Malay files into the
program. The texts in the files will performed tokenize, stemming and stopword
removed to restructure the texts.
Start
Load Data
Stemming/ Stopping
Compute TF-IDF
Build Distance Matrix
Hierarchical Agglomerative Clustering
Malay-English Cluster Mapping
Apply Genetic Algorithm
End
Stopping Condition
Yes
No
34
The next step is computed tf (term frequency), df(document frequency) and
idf (inverse document frequency). TF is number of the terms occurred in a
document. DF is the number of documents that contained same terms. After that
idf will calculated using the Equation 5.
The next step is calculating the tf-idf using Equation 6.
The distance matrix will build and calculated using the Cosine Similarity in
Equation 7.
Cos(x,y) = 𝑥∙𝑦
𝑥 | 𝑦 | (7)
Where ∙ indicates the vector dot product, x∙y= 𝑥𝑘𝑦𝑘𝑛𝑘=1 and ||x|| is the length of
vector x, ||x|| = 𝑥2𝑘
𝑛𝑘=1 = 𝑥 ∙ 𝑥
After the distance matrix was constructed, the next step is performed
hierarchical agglomerative clustering on the English documents and Malay
documents. This project will used the algorithm of hierarchical agglomerative
clustering in figure 2.4.
tf-idft,d = tft,d x idft (6)
Where: tft,d = term frequency idft = inverse document frequency
idft = log N
dft (5)
Where: idf = inverse document frequency N = total number of ducuments
dtf = number of documents where the term ti appears.
35
After the clustering step, the next action is do mapping on the English and
Malay clusters membership. There will be one-to-one mapping between the English
and Malay clusters (EMM). The same is repeated in the direction of Malay to English
Mapping (MBM).
The last action is applying the genetic algorithm in the clustering to optimize
the weight of the terms so the clustering can give a better output. The setting of
the genetic algorithm can be described as follows. A population of X strings of
length m is randomly generated where m is the number of terms (Rayner Alfred,
2009). These X strings represent the scale of adjustment that will be performed to
the values of inverse-documents-frequency (IDF) of the corpus and they are
generated with values ranging from 0.5-1.5 uniformly distributed within [1, m].
Each string represents a subset of (S1, S2, S3, …, Sm-1, Sm). When Si is 1, the IDF
value for the ith terms will not be adjusted; otherwise it will adjusted by multiplying
the IDF value for the ith term by 0.5 or 1.5, depending on the value of Si. The new
adjusted tf-idf for all terms wills tf × idf × scale. The fitness function of this genetic
algorithm is the Davies-Bouldin Index (DBI), to measure the cluster quality.
DBI = 1
𝑛 max{𝑛
𝑖=1,𝑖≠𝑗𝛿𝑖+𝛿𝑗
𝑑(𝑐𝑖 ,𝑐𝑗 )} (8)
Where n is the number of clusters. 𝛿 is the centroid distance and d is the
centroid linkage. A lower DBI value means the clustering produced a better cluster.
Fitness function is defined over the genetic representation and measures the quality
of the represented solution. The generational process is repeated until a
termination condition has been reached. This project will set a fixed number of
generations reached as the termination condition.
4.4 User Interface Design
The user interface is the place where interaction between humans and machines
occurs. The user interface of the system in this project is shows in Figure 4.4.
36
Figure 4.4: User Interface Design of the system in this project
The functions of the elements in the user interface:
Load Files button: This button allows user to load documents
into the program.
Clustering button: This button allows user to perform clustering
to the loaded documents.
Mapping: This button allows user to do mapping on the English
and Malay clusters membership.
Apply Genetic Algorithm tick box: check this tick box to apply
genetic algorithm in the clustering.
Text pane: The output of the program will show here.
4.5 Conclusion
This chapter gives a clear picture of the implementation stage of the system and
the sequences of the program had been shown via the data flow diagrams.
37
CHAPTER 5
IMPLEMENTATION
5.0 Introduction
Chapter 5 is explained and shows the implementations and integrates of the
project. Implementation carries the development process from the design to
operations stage.
There will be explanation from the input for the data and text processing to
normalize the terms. After that, there will be a brief explanation for the calculation
of cosine similarity and the implementation of the clustering. The last part is
explained about the DBI calculation and the setup for implementation of GA in this
project.
5.1 Text Processing
After the texts are read from the documents, tokenization is performed to remove
the punctuation mark (i.e. “,”, “.”, “!” and “;”) and extra spaces between the text.
Uppercase letters in the texts also transformed into lowercase letters during the
tokenization.
import java.util.StringTokenizer;
public class myTokenizer
{
private String text;
private String toToken;
private StringTokenizer strTokenizer;
public myTokenizer() { /*Constructer */
...}
38
public void setText(String s) { /*Set the input for tokenizer*/
...}
public String toString() { /* return the tokenizer result. */
... }
private void tokenizerr() { /*Tokenizer operation*/
...}
}
Figure 5-1: Code Fragment for Tokenizer
After that, stop words is removed from the texts, i.e., the common stop
words like “a”, “are”, “is”, and “and”.
import java.util.StringTokenizer;
public class myStopWordRemover
{
private String text;
private String toToken;
private StringTokenizer strTokenizer;
private myStopWordList myList; // A stopword list is created to store all the stopword in Array
public myStopWordRemover(){ /* Constructer */
...}
public void setText(String s) { /* Set the input for stopword remove */
...}
public String toString() { /* Return the result for stopword remove */
...}
private void tokenizerr(){ /* operation required for tokenizer.*/
...}
}
Figure 5-2: Code Fragment for Stopword Remove
Stemming also performed to the texts using porter‟s stemming algorithm.
Thus, most of the terms in different form will “stemmed” into same word. For
example, words “compute”, “computing” and “computed” are stemmed to
“comput”.
39
A code is written to plug-in the Java version porter‟s stemmer.
import java.util.StringTokenizer;
public class myStemming
{
String input;
String output;
Stemmer stem; /* Porter stemmer that ready to plug in in the system. */
public myStemming(){ /*ConStructer */
... }
public void setText(String s) { /* Set the input for tokenizer */
... }
@Override
public String toString() { /* Return the result to the main system */
... }
private void stemming() { /* call the porter stemmer and performed stemming */
... }
private boolean isWord(String s){
... }
Figure 5-3: Code fragment for English Stemming
Table 5.1 shows the example result from tokenization, stopword remove
and stemming.
Table 5.1: Example of texts processing that normalizes the Terms.
Step Data
1. Initial String
Friends, Romans, Countrymen, lend me your ears;
2. Tokenization
friends romans countrymen lend me your ears
3. Stopword Remove
friends romans countrymen lend ears
4. Stemming
friend roman countrymen lend ear
40
All text in the documents will store in the form of string of terms. A two
dimension string array is used to store the texts and the files name.
String[][] contents_English = new String[NUMBER_OF_DOCMUMENTS][2];
/*
Where:
contents_English[i][0] = TITLE_OF _i_DOCUMENT;
contents_English[i][1]=CONTENT_OF _i_DOCUMENT;
*/
Figure 5.4: Code fragment for the array.
5.2 Weighting
The text will “chopped” or tokenized into a series of single term using “tokenizer”
function in Java, where space is used as delimiter in the tokenizer function.
String[] tokens = input.split(" ");
/*
Where input is the string of the text.
*/
Figure 5.5: Tokenizer Function in Java.
After that, each distinct term is counted.
public class termFrequency
{
private String input;
private String[] termList;
private String[] docList;
private int[][] vector;
public termFrequency() { /* Constructer */
...
}
public void initialize(String contents, String docNameS, String[] oldTermList, String[] oldDocList, int[][]
oldVector) { /* Initialize the needed input for calculate term frequency */
41
...
}
public void countTerms() { /* performed the count terms operation. */
...
}
public String[] getDocList() { /* Return documents list to main Program*/
...
}
public String[] getTermsList() { /*Return term list to main program*/
...
}
public int[][] getVector(){ /*Return term frequency vector to main program*/
...
}
Figure 5.6: Code fragment for Calculate Raw Term Frequency.
After the raw term frequency is counted, the term frequency is counted by using
𝑡, 𝑓𝑖,𝑗 = 𝑛 𝑖,𝑗
𝑛𝑘𝑗𝑘 (9)
where ni,j is the number of occurrences of the considered term (ti) in document dj,
and the denominator is the sum of number of occurrences of all terms in
document dj, that is, the size of the document | dj |.
public void countTermFrequency(int[][] inVector) {
vector = inVector;
tfVector = new double[vector.length][vector[0].length];
int numWords = 0;
String[] tokens;
for (int i = 0, n = vector.length; i < n; i++) {
for (int j = 0, m = vector[i].length; j < m; j++){
tokens = contents[i][1].split(" ");
tfVector[i][j] = (double) vector[i][j] / tokens.length;
}}}
Figure 5.7: Code Fragment that Normalize the Term Frequency.
42
The next step is calculated the inverse document frequency for all terms.
public void countIDF(String[] inDocsList, String[] inTermsList, int[][] inVector) { /*
Function that calculate the IDF value. */
…
}
Figure 5.8: Code Fragment that Calculate the Inverse Document
Frequency.
After the terms frequency and inverse document frequency calculated, the system
will calculate the tf-idf by multiply the terms frequency to the inverse document
frequency.
public void countTFIDF(double[][] inTfVector) {
tfVector = inTfVector;
tfidfVector = new double[tfVector.length][tfVector[0].length];
for (int i = 0, n = tfVector.length; i < n; i++){
for (int j = 0, m = tfVector[i].length; j < m; j++){
tfidfVector[i][j] = tfVector[i][j] * idfVector[j]; }}
}
Figure 5.9: Code Fragment that calculates the TF-IDF.
5.3 Construct the Distances Matrix using Cosine Similarity
The distances between the documents/clusters are calculated by using cosine
similarity. The code segment below is the cosine similarity in the system.
public class cosineSimilarity{
private double[] x;
private double[] y;
public cosineSimilarity() /* Constructer*/ {
}
public void setInput(double[] temp1, double[] temp2){ /* Set needed input for
the cosine similarity*/
…
}
43
public double getDistance() /* Operation for calculate the cosine similarity. */{
double temp = 0.0;
double temp2 = 0.0;
double temp3 = 0.0;
if (x.length == y.length){
for (int i = 0; i < x.length; i++){
temp = temp + (x[i] * y[i]);
temp2 = temp2 + (x[i] * x[i]);
temp3 = temp3 + (y[i] * y[i]); }}
temp2 = Math.sqrt(temp2);
temp3 = Math.sqrt(temp3);
return temp / (temp2 * temp3); }}
Figure 5.10: Code Fragment that calculates the Cosine Similarity between
Clusters.
44
5.4 Clustering
The important part in this project is clustering. This project is implemented
hierarchical agglomerative clustering. Both single linkage, complete linkage and
average linkage is applied to the clustering algorithm.
public class ClusteringMain {
private String FILE_NAME = "dbi-result.txt";
cluster[] myCluster;
private double[][] tfidfVector;
private String[] docsList;
private String[] termsList;
private ArrayList currentClusters;
private int mode;
private int clusterNo;
private String textLog;
public ClusteringMain(double[][] intfidfVector, String[] indocsList, String[]
intermsList, int inmode, int inclusterNo) {… /*Constructer */
}
private void initialize() {… /* initialize operation for clustering*/
}
public void doClustering(int clusterMode) {… /* perform clustering operation */
}
public ArrayList getAllCLusters() {… /* Return the result of clustering */
}
public String getTextLog() {… /* Return the text log of the clustering*/
}
}
Figure 5.11: Code Fragment That Show All Function In The Clustering
Class.
45
The cluster class is used to store the cluster for the clustering. Each cluster object
can store the vector and the file name of the documents in a cluster.
public class cluster {
private double weight;
private double[][] coordinate;
private String[] docsList;
public cluster() {
weight = 0;}
public void setPoints(double[] pt, String docList) {
docsList = new String[1];
docsList[0] = docList;
if (pt.length > 0) {
coordinate = new double[1][pt.length];
for (int i = 0, n = pt.length; i < n; i++) {
coordinate[0][i] = pt[i]; } }}
public void setPoints(cluster clu1, cluster clu2, double atWeight) {
if (clu1.getDimnsion().length > 0) {
if (clu2.getDimnsion().length > 0) {
coordinate = new double[clu1.getDimnsion().length +
clu2.getDimnsion().length][clu1.getDimnsion()[0].length];
int i, n, k;
for (i = 0, n = clu1.getDimnsion().length; i < n; i++) {
for (int j = 0, m = clu1.getDimnsion()[i].length; j < m; j++) {
coordinate[i][j] = clu1.getDimnsion()[i][j]; }}
for (k = 0, n = clu2.getDimnsion().length; k < n; k++) {
for (int j = 0, m = clu2.getDimnsion()[k].length; j < m; j++) {
coordinate[i][j] = clu2.getDimnsion()[k][j]; }
i++;}}}
if (clu1.getAllDocuments().length > 0) {
if (clu2.getAllDocuments().length > 0) {
docsList = new String[clu1.getAllDocuments().length +
46
clu2.getAllDocuments().length];
int i, j, k, m, n;
for (i = 0, n = clu1.getAllDocuments().length; i < n; i++) {
docsList[i] = clu1.getAllDocuments()[i]; }
for (j = 0, m = clu2.getAllDocuments().length; j < m; j++) {
docsList[i] = clu2.getAllDocuments()[j];
i++;}}}
weight = atWeight; }
public double getWeight() {
return weight; }
public String[] getAllDocuments() {
return docsList;
public double[][] getDimnsion() {
return coordinate; }}
Figure 5.12: Code Fragment for Cluster Class.
For each stage in hierarchical agglomerative clustering, two most similar
clusters will merge together until reach the number of clusters that user requested.
The different between single linkage, complete linkage and average linkage is the
definition of the term “similarity”. For single linkage, the similarities of two merged
clusters are that clusters have most similar pair of items, one in each cluster. For
complete linkage, the similarities of the two merged clusters are the clusters that
have least similar pair of items, one in each cluster. For average linkage, the
definition of similarity is the each member of a cluster has a greater average
similarity to the remaining members of that cluster then it does to all members of
any other cluster.
public class allLinkageProcessing {
private double[][] elements1;
private double[][] elements2;
private double[] distances;
private int mode;
public allLinkageProcessing() {
}
47
public void setPoints(cluster a, cluster b) {
…}
public void countDist(int inmode) {
…/* Calculate the distances for the items in between clusters.
Arrays.sort(distances); /* Sort the distances */
}
public double getSinglelink() {
double temp = 0.0;
if (mode == 1) {
temp = distances[0];
} else if (mode == 2) {
temp = distances[(distances.length - 1)]; }
return temp; }
public double getCompletelink() {
double temp = 0.0;
if (mode == 1) {
temp = distances[(distances.length - 1)];
} else if (mode == 2) {
temp = distances[0];}
return temp; }
public double getGroupAverage() {
double temp = 0.0;
for (int i = 0, n = distances.length; i < n; i++) {
temp = temp + distances[i];}
return temp / (double) distances.length; }
public double getCentroidDistance() {
double temp = 0.0;
return temp;}
public double[] getCentroid(double[][] a) {
… }}
Figure 5.13: Code Fragment that Show the Calculation of Single Linkage,
Complete Linkage and Average Linkage.
48
Hierarchical agglomerative clustering is a very time consuming process. An effective coding design is very important to save the time for testing the system.
while (currentClusters.size() > clusterNo) {
a = new cluster();
b = new cluster();
a = (cluster) currentClusters.get(0);
b = (cluster) currentClusters.get(1);
al = new allLinkageProcessing();
al.setPoints(a, b);
al.countDist(mode);
if (clusterMode == 1) {
min = al.getSinglelink();
} else if (clusterMode == 2) {
min = al.getCompletelink();
} else if (clusterMode == 3) {
min = al.getGroupAverage();
}
pos1 = 0;
pos2 = 1;
for (int i = 0, n = currentClusters.size(); i < n; i++) {
for (int j = i + 1, m = currentClusters.size(); j < m; j++) {
a = new cluster();
b = new cluster();
a = (cluster) currentClusters.get(i);
b = (cluster) currentClusters.get(j);
al = new allLinkageProcessing();
al.setPoints(a, b);
al.countDist(mode);
if (clusterMode == 1) {
distance = al.getSinglelink();
} else if (clusterMode == 2) {
distance = al.getCompletelink();
49
} else if (clusterMode == 3) {
distance = al.getGroupAverage();
}
if (mode == 1) {
if (distance < min) {
min = distance;
pos1 = i;
pos2 = j;
} } else if (mode == 2) {
if (distance > min) {
min = distance;
pos1 = i;
pos2 = j;}} }}
c = new cluster();
a = new cluster();
b = new cluster();
a = (cluster) currentClusters.get(pos1);
b = (cluster) currentClusters.get(pos2);}
Figure 5.14: Code Fragment for Hierarchical Agglomerative Clustering.
5.5 Cluster Analysis
After the clustering process, the next thing is performed cluster analysis to the
cluster using DBI value.
The Davies-Bouldin Index (DBI) is used to measure the quality of the
clusters. Lower DBI value mean the clustering result is better because the centers
of clusters are far away from each others.
DBI = 1
𝑛 max(𝑛
𝑖=1,𝑖≠𝑗
𝛿𝑖+𝛿𝑗
𝑑(𝑐𝑖 ,𝑐𝑗 ))
The coding for calculating the DBI value is defined in figure 5.15.
50
public class DBI {
private ArrayList allClusters;
private int distanceMode;
private double dbiValue;
public DBI(ArrayList inClusters, int inDistanceMode) { /* Constructer*/
…
}
public double getDBI() { /* return the DBI result */
return dbiValue;}
public double[] getCentroid(double[][] a) { /* Calculate the centroid of the
cluster.*/
}
public double getDistance(double[] a, double[] b) {
…
}
public double getAverageDistanceInCluster(double[][] allPoints) // This function
return the avarege distance from each point to the centroid.{
…
}
public double getDistanceBetweenClusters(double[][] a, double[][] b) // This
function return the distance between the centroid of the two clusters. {
…
}
public double getSpecialDistance(double[][] a, double[][] b) // (centroid distance
a + centroid distance b) / centroid linkage {
…
}
Figure 5.15: Code Fragment for DBI Calculation.
51
5.6 Genetic Algorithm Setup
In this project, genetic algorithm is applied to improve the quality of the clustering.
The setup of the genetic algorithm is show in the table.
Table 5.2: The Setup for the Experiment of Genetic Algorithm.
Size of Chromosome Depend to the numbers of unique terms in all
documents.
Genes representative
values
{0.5, 1.0, 1.5}
Fitness value DBI value, the lowest the DBI value, the better the
fitness.
Cross Over type Uniform
Cross Over Rate 50 %
Mutation Rate 10%
Size of Population 30
Number of Generation 50
The algorithm for the applied GA in this project is defined as below.
1. Start with a randomly generated population of n l-bit strings (candidate
solutions to a problem).
For Population=1,30 do
Chromosome[population]={ rn(0.5, 1.0, 1.5)1, rn(0.5, 1.0, 1.5)2, rn(0.5, 1.0,
1.5)3, … rn(0.5, 1.0, 1.5)n}
end
2. Calculate the fitness f(x) of each string in the population.
The value of tf-idf for all terms will be adjusted by using tfi x idfi x
scalei where the scale if the value of i gene value for the population
x.
The new adjusted tf-idf values if used to performed clustering and
52
then the DBI value for the clustering is calculated.
Lowest DBI value is the best fitness.
3. Repeat the following steps until n new strings have been created:
o Select a pair of parent strings from the current population, the
probability of selection being an increasing function of fitness
o With the crossover probability, cross over the pair at a randomly
chosen point to form two new strings. If no crossover takes place,
form two new strings that are exact copies of their respective
parents.
o Mutate the two new strings at each locus with the mutation
probability, and place the resulting strings in the new population.
4. Replace the current population with the new population.
5. Go to step 2.
5.7 Conclusion
In this stage, all modules used by system are implemented. The system should be
able to process the inputted data and run the calculation and display the results to
the user at the end. The next step in this project is tested the system to verify the
system. All the documentation of the result will be recorded in the following
chapter.
53
CHAPTER 6
TESTING
6.1 Introduction
This chapter is writing about the testing results in this project. The testing results
are important for test that the project objectives have been fulfilled. Different
datasets and parameters are used to test system. The objective of testing process
is to find the errors and bug in the system that lead to inconsistency results of this
system.
6.2 Loading Data
The first thing in running this program is loaded the data for testing. In this project,
the dataset are English and Malay documents that save in text file format (.txt). A
file dialog is used in the system that allowed user to select the folder directory that
contains the documents set.
Figure 6.1: File Dialog That Allow User Select the Dataset Directory
54
Since the dataset are in text file format, so when user selects a folder that
does not contain any text file, the system will not load anything data and display a
message dialog for the user.
Figure 6.2: Message dialog that inform user that the user selected folder
does not contain any text file.
After the documents successfully loaded into the system, the files name and
files content will display on the program. User can review the files that loaded in
the program to check whether the loaded files are the right data sets.
Figure 6.3: Print Screen for Loaded Files Will Display on the Program.
55
6.3 Text Processing
The second test is test the text processing include tokenize, stopword remove and
stemming.
Input: Zul Noordin Says Will Not Retract Police Report Against Khalid
KUALA LUMPUR, Jan 25 (Bernama) -- Kulim Bandar-Baharu Member of Parliament
(MP) Zulkifli Noordin says will not retract a police report he lodged against Shah Alam
MP Khalid Samad despite being told to do so by Parti Keadilan Rakyat (PKR) advisor,
Datuk Seri Anwar Ibrahim.
Zulkifli, who is also a member of PKR, said he would only do so if Khalid retracted
statements he made that he (Zulkifli) deemed were insulting to Islam.
Zulkifli told this to Bernama when contacted about the matter on Monday night.
He had lodged the report at the Masjid India police station here on Saturday.
Khalid is from PAS. Both PAS and PKR are members of the Pakatan Rakyat opposition
coalition.
Number
of words:
126
Tokenize zul noordin says will not retract police report against khalid kuala lumpur jan 25
bernama kulim bandar baharu member of parliament mp zulkifli noordin says will not
retract a police report he lodged against shah alam mp khalid samad despite being
told to do so by parti keadilan rakyat pkr advisor datuk seri anwar ibrahim zulkifli who
is also a member of pkr said he would only do so if khalid retracted statements he
made that he zulkifli deemed were insulting to islam zulkifli told this to bernama when
contacted about the matter on monday night he had lodged the report at the masjid
india police station here on saturday khalid is from pas both pas and pkr are members
of the pakatan rakyat opposition coalition
Number
of words:
126
Stopword
remove
zul noordin retract police report khalid kuala lumpur jan 25 bernama kulim bandar
baharu parliament mp zulkifli noordin retract police report lodged shah alam mp khalid
samad despite told parti keadilan rakyat pkr advisor datuk seri anwar ibrahim zulkifli
pkr khalid retracted statements zulkifli deemed insulting islam zulkifli told bernama
contacted matter monday night lodged report masjid india police station saturday
khalid pas pas pkr pakatan rakyat opposition coalition
Number
of words:
69
Stemming zul noordin retract polic report khalid kuala lumpur jan bernama kulim bandar baharu
parliament mp zulkifli noordin retract polic report lodg shah alam mp khalid samad
despit told parti keadilan rakyat pkr advisor datuk seri anwar ibrahim zulkifli pkr khalid
retract statement zulkifli deem insult islam zulkifli told bernama contact matter mondai
night lodg report masjid india polic station saturdai khalid pa pa pkr pakatan rakyat
opposit coalit
Number
of words:
68
Figure 6.4: the text processing that normalizes the texts. Note that the
number of words has been decreased.
56
6.4 Weighting
Term frequency is one of the most important elements in documents clustering.
Raw term frequency is defined as the frequency of a term t appeared in document
D. For example, in Figure 6.5, term “chong” is found in doc001.txt 4 times and
term “mew” is found 3 times in doc001.txt.
Figure 6.5: Print Screen of the Raw Term Frequency Table.
Term frequency for a term t in a document D can be normalized by the total
number of terms ND in the document using formula (9). For example, term
frequency for term “chong” is changed from “4” to “0.01694915” after the term
frequency normalization.
Figure 6.6: Print Screen of the Term Frequency Table.
Inverse document frequency (idf) also displayed in this system.
Figure 6.7: Part of the idf displayed by the program.
57
Note that the idf value of “bernama” is 0; this is due to the “bernama”
occurred in all documents, this is meaning “bernama” is no useful for distinguishing
relevant from non-relevant documents in the collection of documents.
TF-IDF was calculated by using equation (2).
Figure: 6.8: Print Screen for TF-IDF Table
58
6.5 Distances Matrix
In this project, cosine similarity is selected to measure the similarity between the
documents or clusters. For example, document “Doc001.txt” and document
“Doc001.txt” have same contents since it is a same document; therefore the cosine
similarity value between “Doc001” and “Doc001” is 1.
Figure: 6.9: Print Screen for Distances Matrix.
Table 6.1: Cosine similarity distances matrix for 10 documents
Doc001 Doc002 Doc003 Doc004 Doc005 Doc006 Doc007 Doc008 Doc009 Doc010
Doc001 1 0.163477 0.284548 0.284688 0.02337295 0.006486 0.007898 0.004389 0.010518 6.32E-04
Doc002 0.163477 1 0.07978 0.167431 0.03145406 0.00369 0.004215 0.014173 0.003047 6.71E-04
Doc003 0.284548 0.07978 1 0.152682 0.00476926 0.002682 3.61E-04 0.017931 1.99E-04 0.013871
Doc004 0.284688 0.167431 0.152682 1 0.03681549 0.008881 0.011351 0.007047 0.016692 8.94E-04
Doc005 0.023373 0.031454 0.004769 0.036815 1 0.005798 0.002323 0.010332 0.011481 0.004086
Doc006 0.006486 0.00369 0.002682 0.008881 0.00579808 1 0.005763 0.02707 0.323859 0.02797
Doc007 0.007898 0.004215 3.61E-04 0.011351 0.00232275 0.005763 1 0.080513 0.036755 2.13E-04
Doc008 0.004389 0.014173 0.017931 0.007047 0.01033184 0.02707 0.080513 1 0.070439 0.023135
Doc009 0.010518 0.003047 1.99E-04 0.016692 0.01148149 0.323859 0.036755 0.070439 1 0.07894
Doc010 6.32E-04 6.71E-04 0.013871 8.94E-04 0.00408612 0.02797 2.13E-04 0.023135 0.07894 1
6.6 Experiment 1: Testing the Cluster Results for Different Clustering
Algorithm
Clustering is the main task in this project. A good clustering result is important for
the cluster mapping between Malay clusters and English clusters. This project had
run single linkage, complete linkage and average linkage clustering with using 50
documents. These 50 documents clustered into 7 clusters and documents set in the
cluster are compared to see the different between the clustering results of single
59
linkage, complete linkage and average linkage. 10 important terms for each cluster
will also show in the results.
Test 1: Single Linkage
Table 6.2: Clustering Result for Single Linkage.
Cluster
ID
Document Important
Terms
E1 1) Business001.txt
2) Business007.txt
3) General004.txt 4) Business002.txt
5) Business006.txt 6) Business004.txt
7) Business005.txt 8) Business008.txt
9) Feature010.txt
10) Business003.txt 11) Business010.txt
12) Feature009.txt 13) Feature002.txt
14) Feature005.txt
15) General007.txt 16) Sport005.txt
17) Sport006.txt 18) General001.txt
19) Politic001.txt 20) Politic004.txt
21) Politic005.txt
22) Politic006.txt 23) Politic008.txt
24) Politic002.txt 25) Politic003.txt
26) Politic009.txt
27) General002.txt 28) Politic007.txt
29) Sport010.txt 30) General005.txt
31) General010.txt
32) Sport007.txt 33) Sport009.txt
34) General006.txt 35) General008.txt
36) Feature003.txt 37) Politic010.txt
38) Sport001.txt
39) Sport004.txt 40) Sport003.txt
41) Sport002.txt 42) Sport008.txt
1) bank
2) umno
3) pkr 4) khalid
5) zulkifli 6) polic
7) petrona
8) team
9) sailor 10) sport
Cluster
ID
Document Important
Terms
E2 1)
Business009.txt
1) fdi
2) unctad
3) flow 4) drop
5) cent 6) trillion
7) declin 8) quarter
9) economi
10) global
E3 1)
Feature001.txt
2) Feature006.txt
3) Feature008.txt
1) seedstock
2) camel
3) prawn 4) farm
5) freshwat 6) haiezack
7) pond 8) breed
9) breeder
10) venture
E4 1)
Feature004.txt
1) traffic
2) jakarta
3) road 4) jl
5) agu 6) bu
7) citi 8) hour
9) crawl
10) peak
E5 1)
Feature007.txt
1) paint
2) voc
3) eco 4) chemic
5) soo 6) odour
7) irrit
8) environment
9) hazard 10) low
60
Cluster ID
Document Important Terms
E6 1) General003.txt 1) asli
2) orang 3) expos
4)
knowledg 5) tradit
6) shafi 7)
documentari 8) master
9) hospit
10) import
Cluster ID
Document Important Terms
E7 1)
General009.txt
1) choi
2) ik 3) summon
4) court
5) magistr 6) chong
7) privat 8) daphn
9) seduct 10) appeal
Form Table 6.2, there are 42 documents clustered into cluster E1 and the important terms for cluster E1 are bank, umno, pkr, Khalid, zulkifli, polic, petrona, team, sailor,
sport.
61
Test 2:
Complete Linkage
Table 6.3: Clustering Result for Complete Linkage
Cluster
ID
Document Important
Terms E1 1) Business001.txt
2) Business007.txt 3) Business003.txt 4) Business010.txt 5) Business009.txt
1) properti 2) fdi 3) equiti 4) fund 5) rubber 6) price 7) cent 8) unctad
9) incom 10) flow
E2 1) Business002.txt 2) Business006.txt 3) General004.txt 4) Feature003.txt 5) Feature007.txt
1) paint 2) petrona 3) biochar 4) lubric 5) proton 6) bad 7) timor 8) lest 9) ga 10) carbon
E3 1) Business004.txt 2) Business005.txt 3) Business008.txt 4) Feature010.txt 5) General007.txt
6) Sport005.txt 7) Sport006.txt 8) Feature001.txt 9) Feature006.txt 10) Feature008.txt 11) Feature002.txt 12) Feature005.txt 13) Feature004.txt
1) bank 2) seedstock 3) team 4) camel 5) class
6) race 7) prawn 8) pst 9) deposit 10) product
4 1) Feature009.txt 2) Sport010.txt 3) General010.txt 4) General006.txt 5) General008.txt 6) Sport007.txt 7) Sport009.txt
1) sailor 2) tourism 3) sport 4) venu 5) wind 6) boat 7) safeti 8) train 9) park 10) mya
Cluster
ID
Document Important
Terms E5 1) General001.txt
2) Politic001.txt 3) Politic004.txt 4) Politic005.txt 5) Politic006.txt 6) Politic008.txt 7) General003.txt 8) General005.txt
9) General009.txt
1) pkr 2) khalid 3) zulkifli 4) polic 5) retract 6) asli 7) pa 8)
gunasegaran 9) orang 10) nik
E6 1) General002.txt 2) Politic007.txt 3) Politic010.txt 4) Politic002.txt 5) Politic003.txt 6) Politic009.txt
1) umno 2) voter 3) secretari 4) suprem 5) appoint 6) tengku 7) razaleigh 8) regist 9) divis 10) puad
E7 1) Sport001.txt 2) Sport004.txt
3) Sport003.txt 4) Sport002.txt 5) Sport008.txt
1) bt 2) chn
3) chong 4) round 5) wei 6) titl 7) singl 8) lee 9) hafiz 10) pair
62
Test 3:
Average Linkage
Table 6.4: Clustering Result for Average Linkage.
Cluster ID
Document Important Terms
E1 1) Business001.txt 2) Business007.txt 3) Business002.txt 4) Business006.txt 5) General004.txt 6) Business004.txt 7) Business005.txt 8) Business008.txt 9) Feature010.txt 10) Business003.txt 11) Business010.txt 12) Business009.txt 13) Feature003.txt 14) Feature001.txt 15) Feature006.txt 16) Feature008.txt
1) bank 2) seedstock 3) sale 4) properti 5) fdi 6) cent 7) price 8) petrona 9) camel 10) equity
E2 1) Feature002.txt 2) Feature005.txt 3) General007.txt 4) Sport005.txt 5) Sport006.txt 6) Feature004.txt
1) team 2) race 3) class 4) pst 5) endur 6) dubai 7) lap 8) cultur 9) traffic
10) commun
E3 1) Feature007.txt 1) paint 2) voc 3) eco 4) chemic 5) soo 6) odour 7) irrit 8) environment 9) hazard 10) low
E4 1) Feature009.txt 2) Sport010.txt 3) General006.txt 4) General008.txt 5) General010.txt
6) Sport007.txt 7) Sport009.txt
1) sailor 2) tourism 3) sport 4) venu 5) wind
6) boat 7) safeti 8) train
Cluster ID
Document Important Terms
9) park 10) mya
E5 1) General001.txt 2) Politic001.txt 3) Politic004.txt 4) Politic005.txt 5) Politic006.txt 6) Politic008.txt 7) Politic002.txt 8) Politic003.txt 9) Politic009.txt 10) General002.txt 11) Politic007.txt 12) Politic010.txt 13) General003.txt
1) umno 2) pkr 3) khalid 4) zulkifli 5) retract 6) asli 7) voter 8) pa 9) rakyat 10) secretary
E6 1) General005.txt 2) General009.txt
1) gunasegaran 2) choi 3) polic 4) court 5) death 6) personnel 7) ik 8) summon 9) intent
10) subari
E7 1) Sport001.txt 2) Sport004.txt 3) Sport003.txt 4) Sport002.txt 5) Sport008.txt
1) bt 2) chn 3) chong 4) round 5) wei 6) titl 7) singl 8) lee 9) hafiz 10) pair
63
Table 6.5: Summary for Documents Distribution in Cluster.
Cluster Total
E1 E2 E3 E 4 E5 E 6 E 7
Single
Linkage3 42 1 3 1 1 1 1 50
Complete
Linkage 5 5 13 7 9 6 5 50
Average
Linkage 16 6 1 7 13 2 5 50
Figure 6.10: Summary for Documents Distribution in 7 Clusters.
From the results in table 6.5, complete linkage seems produced better
clusters than single linkage and average linkage. Where the size of each cluster is
more balances. Single linkage has worst clusters where there is some clusters have
extreme bigger size than others clusters. For example, in table 6.2, there are 42
documents in cluster no.1 where it is almost 84% of all documents. This is because
single linkage sensitive to noise and outliers (Kumar. V, et al., 2005). Average
linkage is intermediate between single linkage and complete linkage. There is no
extreme large cluster in average linkage cluster but there are some clusters shows
0
5
10
15
20
25
30
35
40
45
No. 1 No. 2 No. 3 No. 4 No. 5 No. 6 No. 7
Cluster
Number of Documents Single Linkage
Complete Linkage
Average Linkage
64
that contain only 1 to 2 documents, for example cluster no.3 and cluster no. 6 in
table 6-4.
6.7 Experiment 2: Testing the DBI values between the single linkage,
complete linkage and Average Linkage
A testing had been made by clustered 100 documents using complete linkage,
single linkage and average linkage. The DBI value for each clustering testing is
recorded to observe the quality of the clustering, where a lower DBI value is a
better clustering result. This experiment has been running using 5 clusters, 10
clusters, 15 clusters and 20 clusters.
Table 6.6: DBI values for different hierarchical agglomerative clustering.
DBI Value for HAC
5 Clusters 10 Clusters 15 Clusters 20 Clusters
Complete
Linkage 4.882957 3.523555 2.831373 2.47898299
Single Linkage 1.781201 1.176582 1.119181 1.009035665
Average
Linkage 3.672783 2.637479 2.335699 2.05946653
Figure 6.11: DBI values for different hierarchical agglomerative
clustering.
0
1
2
3
4
5
6
5 Clusters 10 Clusters 15 Clusters 20 Clusters
BDI Value for HAC
Complete Link
Single Link
Average Link
65
Based on the result in table 6.6, the DBI values are decreased from 5
clusters to 20 clusters. Where clustered to 5 clusters will produce a highest DBI
value and 20 clusters clustering will produce a lowest DBI value. This is because
when the number of cluster increase, the number of elements in the cluster will
decrease and the within-cluster distances will decrease. This mean a larger number
of clusters can group compactly and produce a better shape of cluster if compare
to the lower number of clusters.
66
6.8 Experiment 3: Improve the DBI value and the Percentage of English-
Malay Mapping using Genetic Algorithm
In this experiment, a set of 100 documents was performed average linkage
clustering and then the English cluster and Malay cluster was map. The DBI values
and successful rate of mapping was compared between before GA and after GA.
Table 6.7: DBI values for English Documents Clustering (Before GA and After GA)
BDI Value for HAC
5 Clusters 10 Clusters 15 Clusters
Complete Linkage (Before GA) 2.113 1.812 1.737
Complete Linkage (After GA) 1.184 1.642 0.575
Figure 6.11: DBI values for Each Generation in Genetic Algorithm.
0.000
0.500
1.000
1.500
2.000
2.500
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
DBI value
Number of Generation
English 5 Clusters
English 10 Clusters
English 15 Clusters
67
Table 6.8: The Percentage of Clusters Mapping Before and After GA Applied.
BDI Value for HAC
5 Clusters 10 Clusters 15 Clusters
Average Successful
Cluster Mapping
(Before GA)
MEM 72.0% 60.0% 52.3%
EMM 69.3% 61.0% 54.3%
Average Successful
Cluster Mapping
(After GA)
MEM 76.3% 52.3% 54.0%
EMM 73.0% 54.0% 55.3%
*MEM = Malay to English clusters mapping
*EMM = English to Malay clusters mapping
From table 6.7, the DBI values have decreased after genetic algorithm
applied. This means a better shape and compact cluster have produced by the
clustering that using adjusted tf-idf.
Table 6.8 show that applied GA is able to improve the percentage of
successful clusters mapping between English and Malay clusters by applied GA.
However, not all GA experiment can improve the mapping results. By referring to
table 6.8, the percentage of clusters mapping for 10 clusters had dropped after the
GA application to the clustering. This is due to the worst case of the GA where the
tf-idf was over adjusted and hence made a big changed on the structure of the
clusters of the clustering and affected the clusters mapping.
68
6.9 Conclusion
As a conclusion, the testing phase had been performed correctly through the
system. The system had been successfully run according to the design of the
system. Based on the result, complete linkage is the most suitable clustering
algorithm for documents clustering where it cans produces small, tightly bound and
cohesive clusters whereas single linkage clustering tends to produce large, loosely-
bound and “straggly” clusters. By referring to the result of experiment 3, Generic
algorithm application also can improve the DBI of clusters hence slightly improve
the clusters mapping between English and Malay Clusters.
69
CHAPTER 7
CONCLUSION AND FUTURE WORK
7.1 Introduction
The purpose of developed this system is to implement a framework for parallel
clustering English and Malay texts documents. The clustering result, English and
Malay clusters are proceeding to the clusters mapping. Genetic algorithm is applied
to this system to improve the clustering results. A system is designed and
developed to run experiments to achieve the objectives of this project. The results
are analyzed based on the data that generated by the system. Finally, the
conclusion part is concluded the work and study during this project.
7.2 Objective Achievement
Generally, the objectives of this system have been achieved. This section is
concluding the objectives that achieved in this project.
Objective 1: Propose a framework for English – Malay Parallel hierarchical
agglomerative clustering.
The system is able to perform English text documents clustering and Malay
text documents clustering together. The clusters that produced from the
clustering can be map between English clusters and Malay clusters.
Objective 2: Implement a hierarchical agglomerative clustering method to cluster
English documents.
All single linkage, complete linkage and average linkage are applied in the
system in this project. The results show that hierarchical agglomerative
clustering is able to group the English documents in the form of clusters.
70
Objective 3: To apply Genetic Algorithm (GA) to analysis the effects of GA to the
clustering results.
Genetic algorithm is improved the clustering results where the DBI value
had reduced after performed genetic algorithm on the system.
7.3 Discussion
Based on the results from the experiments, complete linkage hierarchical
agglomerative produced better quality of cluster. This is because complete linkage
HAC is less sensitive to the noise and outliers of the data set. Normalize the text via
tokenization, stopword removed and stemming is important where it can reduce the
time required for documents clustering.
A large number of clusters will produce a lower DBI value; this is because
the large number of clusters will reduce the distance between document and its
centroid.
Genetic algorithm in this system can reduce the value of DBI for clustering
and improves the structure for the cluster. This is due to the compact and well
separated clusters found by minimizing DBI. However, for some genetic algorithm
experiments, the percentage of successfully clusters mapping is reduced, this is
because the tf-idf for the documents had over adjusted and cannot performed a
good cluster mapping between English and Malay Clusters.
7.4 Limitations of the System
The imitation of the system refers to the things that the system cannot provides
and do. The limitation for this system is listed out as below.
This system cannot distinguish English documents and Malay documents.
This system cannot save the result for the user.
This system cannot display the dendogram as visualization for clustering.
71
7.5 Recommendation of Future Works
Some suggestion and idea are identified which can use to improve this system.
The recommendation is stated below:
Build an English and Malay language recognizer that automatic separate the
documents based on the document language.
Save the results of the system for future references.
Display the dendogram to illustrate the structure of the cluster of
hierarchical agglomerative clustering.
7.6 Conclusions
Based on the limitation and scope of this project, the English–Malay parallel
clustering still can be improved. The suggestion and recommendation had been
given at the previous section so that future works can improve the design of the
system. Discussion part is the summary of the study in this project and the
explanation to the results of the experiments. Although the system had some
limitation, it still achieved the objectives of the project.
72
References
Alfred, R. (2009). A Parallel hierarchical Agglomerative Clustering Technique for Billingual Corpora Based on Reduced Terms with Automatic Weight Optimization. Proceedings of the 5th International Conference On Advanved Data Mining and Applications. Springer-Verlag Berlin, 19-30.
Dash, M., Petrutiu, S., & Scheuermann, p. (2004). Efficient Parallel Hierarchical Clustering. Accepted for Publication in 10th International Conference On Parellel and Distributed Computering, 364-371.
Fujii, A., & Ishikawa, T. (2001). Japanese/English Cross-Language Information Retrieval. 1-29.
Gey, F. C., kando, N., & Peters, C. (2004). Cross-Language Information Retrieval: the way ahead. 416-430.
Greengrass, E. (2000). Information Retrieval: A Survey. University of Maryland: Maryland, 111-115.
kishida, K. (2004). Technical Issues of cross-language information retrieval: A review, Information Processing and Management: an International Journal - Special issue: Cross-language information retrieval, 435-437.
Manning, C. D., Raghavan, P., & Schutze, h. (2009). An Introduction to Information Retrieval. Cambridge: Cambridge University Press.
Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction To Data Mining. Boston: Pearson Education.
Teknomo, K. (2007). Hierarchical Clustering Tutorial. Retrieved October 10, 2010, from kardi Teknomo's Page: http://people.revoledu.com/kardi/tutorial/Clustering/index.html
Yusuf, H. R. (1992). An Analysis of Indonesia Language for Interlingual Machine-Translation System, COLING '92 Proceedings of the 14th conference on Computational linguistics (4), 1228-1232.