efficient data mining for calling path patterns in gsm networks information systems, accepted 5...
TRANSCRIPT
Efficient Data Mining for Calling Path Patterns in GSM Networks
Information Systems, accepted 5 December 2002
SPEAKER: YAO-TE WANG (王耀德 )
2
Outlines Introduction Frequent calling path graph Graph construction Mining calling path patterns Experimental results Conclusions
3
Mining association rules Agrawal, Imielinski, and Swami first addres
sed the issue of mining association rules in 1993.
4
Mining association rules Let I={i1, i2, . . ., in} be the set of all distinct items,
which are labeled by the lexicographic order. The association rules can be represented as “AB”
where A and B are subsets of I. This rule infers that if item A appears in one transaction, it is most likely that item B also occurs in the same transaction.
For example, “Bread Milk” “Beer Diaper”
5
Mining association rules A transaction T in a database supports an
itemset S if S is contained in T. All combinations of items that have
fractional transaction support above a certain threshold, called minimum support, are termed large itemsets.
6
Mining association rules The problem of association rule mining can
be decomposed into two sub-problems. Find all large itemsets. For a given large itemset, generate all rules.
For every large itemset L, find all non-empty subsets of L. For every such subset A, output a rule of the form “A (L-A)” if the ratio of support(L) to support(A) is at least minconf.
7
Mining association rules Apriori algorithm
The first pass determines the large 1-itemsets. A subsequence pass k consists of two phases.
First, the large itemsets Lk-1 are used to generate the candidate itemsets Ck.
Next, the database is scanned and the support of candidates in Ck is counted.
Apriori property: any subset of a large itemset must be large.
8
關聯式規則
J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers (2001).
Min_sup = 20%
9
Motivation Traditional methods for mining sequential
patterns such as Apriori-like algorithms may suffer from two problems: Large number of candidate sequences. Repeated database scans.
10
Motivation Traditional mining methods can not extract
PMFCPs from the database directly.a b c d e f g h i j k l
T100
T200
T300
T400
T500
T600
PMFCP
<b, c, d, e>
<d, e, f>
<e, f, g, h>
<g, h, i, j>
TID Calling paths
T100 <a, b, c, d, e, f>
T200 <a, b, c, d, e, f, g, h>
T300 <b, c, d, e>
T400 <d, e, f, g, h, i, j, k, l>
T500 <e, f, g, h, i, j>
T600 <g, h, i, j, k>
Let the minimum support be 50%.
11
Our solutions A novel graph data structure is proposed to
contain the information of calling paths. The database is scanned only once. An efficient mining algorithm based on the
proposed graph structure is devised to mine the PMFCPs in GSM networks.
13
Introduction The cell structure of a real network.
GSM: Switching, Services and Protocols, John Wiley & Sons Ltd., Chichester, England (1999).
14
Introduction A mobile phone user may make a phone
call at one cell and then move to the other cells during the phone call.
The sequence of visited cells during the phone call is termed a calling path.
15
Introduction The support of a calling path P is the ratio
of transactions in the calling path database that contain P.
A calling path with support not less than the user-specified minimum support is termed a frequent calling path.
A frequent calling path is maximal if it is not contained by any other frequent calling paths.
16
Problem definition Let P1 and P2 be calling paths where P1 = <v1,1, v1,2, . . ., v1,n> and P2
= <v2,1, v2,2, . . ., v2,m>. If <v1,i+1, . . ., v1,i+h> = <v2,j+1, . . ., v2,j+h> where h2, 0in-h, and 0jm-h, we define the merge operation as
<v1,1, v1,2, . . ., v1,n, v2,h+1, . . ., v2,m> if i=n-h and j=0,
<v2,1, v2,2, . . ., v2,m, v1,h+1, . . ., v1,n> if i=0 and j=m-h,
{<v1,1, . . ., v1,i+1, . . .,v1,i+h, v2,j+h+1, . . ., v2,m>, otherwise
<v2,1, . . ., v2,j+1, . . ., v2,j+h, v1,i+h+1, . . ., v1,n>}
For example, P1=<AX>, P2=<XB>, P1P2=<AXB>
P1=<XA>, P2=<BX>, P1P2=<BXA>
P1=<A1XA2>, P2=<B1XB2>,
P1P2={<A1XB2>, <B1XA2>}
P1P2 =
17
Problem definition The potential maximal frequent calling paths (PMFCPs) in
the database D are defined as {P|PFP+, and P is maximal in FP+}, where FP+ is the closure of FP under the merge operation and FP={P|P is a maximal frequent calling path in D}.
18
Frequent calling path graph A calling path graph is a directed graph containing the
necessary information of mining PMFCPs in a calling path database.
A calling path graph consists of vertices, out-edges, and in-out paths.
A vertex in the calling path graph represents a cell in the GSM network.
An out-edge of vertex v in a calling path graph is an edge starting at v. An in-edge of vertex v is an edge ending at v.
An in-out path of vertex v in a calling path graph is a calling path formed by one in-edge of v and one out-edge of v.
19
Frequent calling path graph In a frequent calling path graph G, all out-
edges and in-out paths in G are all frequent. A calling path can be decomposed into an
out-edge, or an out-edge plus several in-out paths by which the corresponding calling path graph can be constructed.
The decomposed out-edge and in-out paths can be merged to generate the original calling path.
20
Frequent calling path graph For example, the calling path <a, b, c, d, e>
can be decomposed into an out-edge <a, b> plus three in-out paths <a, b, c>, <b, c, d> and <c, d, e>.
On the contrary, the decomposed out-edge <a, b>, and in-out paths <a, b, c>, <b, c, d>, and <c, d, e> can be merged into <a, b, c, d, e>.
21
Graph construction
The cell structure of the GSM network may be required to be divided into several partitions so that the corresponding calling path sub-graph of each partition can be held in the main memory and then the mining algorithm is applied to each sub-graph.
22
Graph construction
Q1Q3Q2
Q6Q5Q4
Q9Q8Q7
Partition line 1 Partition line 2
Partition line 3
Partition line 4
Q1Q3Q2
Q5
Q7
Q4
Q9Q8
Q6
Example of graph partition.
23
Graph construction The algorithm of graph construction first
examine whether the cell structure of the GSM network is partitioned.
Then, the calling paths are retrieved from the database and decomposed into out-edges and in-out paths.
The graph is constructed by the out-edges and in-out paths.
24
Graph construction
a b
e
f g h
m
i
kl
n
c d
j
TID Calling paths TID Calling paths
T001 <d, g, j, m> T011 <b, d>
T002 <a, b> T012 <g, k, l>
T003 <g, k, n, l> T013 <b, d, g>
T004 <a, b, d> T014 <d, g, j, m>
T005 <b, d, g> T015 <g, k, n>
T006 <l, n, m> T016 <b, d, g, k, l>
T007 <d, g, f> T017 <a, c>
T008 <g, k> T018 <d, c, f>
T009 <d, g, k, n> T019 <m, n, l>
T010 <a, b, d, h, l> T020 <b, d, g, f, j>
Example 1: Let the minimum support be 10%.
25
Graph constructionv Out-edges In-out paths
a<a, b>:3, <a, c>:1
b <b, d>:5 <a, b, d>:2c <d, c, f>:1d <d, c>:1,<d, g>:4 <b, d, g>:4, <b, d, h>:1f <g, f, j>:1g <g, k>:4 <d, g, f>:2, <d, g, j>:2, <d, g, k>:2h <d, h, l>:1 j <g, j, m>:2 k <g, k, l>:2, <g, k, n>:3 l <l, n>:1 m <m, n>:1 n <k, n, l>:1, <l, n, m>:1, <m, n, l>:1
a b
dgf
jk l
m nT001 <d, g, j, m> T011 <b, d>T002 <a, b> T012 <g, k, l>T003 <g, k, n, l> T013 <b, d, g>T004 <a, b, d> T014 <d, g, j, m> T005 <b, d, g> T015 <g, k, n>T006 <l, n, m> T016 <b, d, g, k, l>T007 <d, g, f> T017 <a, c>T008 <g, k> T018 <d, c, f>T009 <d, g, k, n> T019 <m, n, l>T010 <a, b, d, h, l> T020 <b, d, g, f, j>
The frequent calling path graph.
26
Mining PMFCPsThe algorithm of mining PMFCPs is based on a depth-first search approach, which is one of the natural ways to visit vertices in a graph systematically.
First, find all local PMFCPs and then merge all local PMFCPs extracted from sub-graphs into global PMFCPs.
27
Mining PMFCPs Example 1 (cont.):
CopiedPath(1): <a, b, d, g, k, n>
Path(2): <a, b, d, g, k, l>
Path(3): <a, b, d, g, j, m>
Path(4): <a, b, d, g, f>
Copied
Appended
Appended
AppendedCopied
a b
dgf
jk l
m n
28
Mining PMFCPs
Partition line
a b
c d e
f g h i
j k l
m n
a b
c d e
f g h i
c d e
f g h i
j k l
m n
Local PMFCPs in PartitionU
Local PMFCPs in PartitionL
Global PMFCPs
<a, b, d, g, f> <d, g, k, n>
<d, g, k, l>
<d, g, j, m>
<a, b, d, g, k, n>
<a, b, d, g, k, l>
<a, b, d, g, j, m>
<a, b, d, g, f>
(a) Original cell structure. (b) PartitionU. (c) PartitionL.
If the cell structure of the GSM network is divided into two partitions,
a b
dgf
jk l
m n
29
Experimental results (PMFCPs) Two synthetic datasets were simulated.
For a GSM network with N cells, the cell structure of the GSM network is arranged to be semi-square shape that contains and +1 cells at each consecutive level.
For example, the cell structure of a GSM network with 22 cells is shown as follows:
N N
30
Experimental results (PMFCPs) The cells are labeled in sequential order from 0 to N-1,
if the GSM network contains N cells. The starting cell of each mobile phone call is
determined from a uniform distribution U(, ), where denotes the smallest cell ID, and denotes the largest cell ID.
The next cell of a calling path is also determined from a uniform distribution that selects one of the six neighboring cells uniformly.
The length of a calling path is determined from an exponential distribution with the parameter of mean .
31
Experimental results (PMFCPs)
T 10.0 C 1K D 100K
0
100
200
300
400
0.06 0.12 0.18 0.24 0.3 0.36
Minimum support threshold (%)
Run
tim
e (s
ec.)
Apriori RevisedApriori Prefixspan Graph-based
32
Experimental results (PMFCPs)
T 5.0 C 2K S 0.05%
0
50
100
150
100 200 300 400 500 600
Number of transactions (K)
Run
tim
e (s
ec.)
Apriori RevisedApriori Graph-based Prefixspan
33
Experimental results (PMFCPs)
C 2K D 500K S 0.05%
0
200
400
600
800
1,000
3 4 5 6 7
Mean length of calling paths
Run
tim
e (s
ec.)
Apriori RevisedApriori Graph-based Prefixspan
34
Experimental results (PMFCPs)
T 5.0 C 4.5K D 200M S 0.003%
0
50,000
100,000
150,000
200,000
250,000
0.25 0.50 1.00 2.00 4.00
Size of memory (MB)
Run
tim
e (s
ec.)
PrefixSpan Graph-based