efficient data mining for calling path patterns in gsm networks information systems, accepted 5...

Efficient Data Mining for Calling Path Patterns in GSM Networks

Information Systems, accepted 5 December 2002

SPEAKER: YAO-TE WANG (王耀德 )

2

Outlines Introduction Frequent calling path graph Graph construction Mining calling path patterns Experimental results Conclusions

3

Mining association rules Agrawal, Imielinski, and Swami first addres

sed the issue of mining association rules in 1993.

4

Mining association rules Let I={i1, i2, . . ., in} be the set of all distinct items,

which are labeled by the lexicographic order. The association rules can be represented as “AB”

where A and B are subsets of I. This rule infers that if item A appears in one transaction, it is most likely that item B also occurs in the same transaction.

For example, “Bread Milk” “Beer Diaper”

5

Mining association rules A transaction T in a database supports an

itemset S if S is contained in T. All combinations of items that have

fractional transaction support above a certain threshold, called minimum support, are termed large itemsets.

6

Mining association rules The problem of association rule mining can

be decomposed into two sub-problems. Find all large itemsets. For a given large itemset, generate all rules.

For every large itemset L, find all non-empty subsets of L. For every such subset A, output a rule of the form “A (L-A)” if the ratio of support(L) to support(A) is at least minconf.

7

Mining association rules Apriori algorithm

The first pass determines the large 1-itemsets. A subsequence pass k consists of two phases.

First, the large itemsets Lk-1 are used to generate the candidate itemsets Ck.

Next, the database is scanned and the support of candidates in Ck is counted.

Apriori property: any subset of a large itemset must be large.

8

關聯式規則

J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers (2001).

Min_sup = 20%

9

Motivation Traditional methods for mining sequential

patterns such as Apriori-like algorithms may suffer from two problems: Large number of candidate sequences. Repeated database scans.

10

Motivation Traditional mining methods can not extract

PMFCPs from the database directly.a b c d e f g h i j k l

T100

T200

T300

T400

T500

T600

PMFCP

<b, c, d, e>

<d, e, f>

<e, f, g, h>

<g, h, i, j>

TID Calling paths

T100 <a, b, c, d, e, f>

T200 <a, b, c, d, e, f, g, h>

T300 <b, c, d, e>

T400 <d, e, f, g, h, i, j, k, l>

T500 <e, f, g, h, i, j>

T600 <g, h, i, j, k>

Let the minimum support be 50%.

11

Our solutions A novel graph data structure is proposed to

contain the information of calling paths. The database is scanned only once. An efficient mining algorithm based on the

proposed graph structure is devised to mine the PMFCPs in GSM networks.

12

Introduction The cell structure of a GSM network.

13

Introduction The cell structure of a real network.

GSM: Switching, Services and Protocols, John Wiley & Sons Ltd., Chichester, England (1999).

14

Introduction A mobile phone user may make a phone

call at one cell and then move to the other cells during the phone call.

The sequence of visited cells during the phone call is termed a calling path.

15

Introduction The support of a calling path P is the ratio

of transactions in the calling path database that contain P.

A calling path with support not less than the user-specified minimum support is termed a frequent calling path.

A frequent calling path is maximal if it is not contained by any other frequent calling paths.

16

Problem definition Let P1 and P2 be calling paths where P1 = <v1,1, v1,2, . . ., v1,n> and P2

= <v2,1, v2,2, . . ., v2,m>. If <v1,i+1, . . ., v1,i+h> = <v2,j+1, . . ., v2,j+h> where h2, 0in-h, and 0jm-h, we define the merge operation as

<v1,1, v1,2, . . ., v1,n, v2,h+1, . . ., v2,m> if i=n-h and j=0,

<v2,1, v2,2, . . ., v2,m, v1,h+1, . . ., v1,n> if i=0 and j=m-h,

{<v1,1, . . ., v1,i+1, . . .,v1,i+h, v2,j+h+1, . . ., v2,m>, otherwise

<v2,1, . . ., v2,j+1, . . ., v2,j+h, v1,i+h+1, . . ., v1,n>}

For example, P1=<AX>, P2=<XB>, P1P2=<AXB>

P1=<XA>, P2=<BX>, P1P2=<BXA>

P1=<A1XA2>, P2=<B1XB2>,

P1P2={<A1XB2>, <B1XA2>}

P1P2 =

17

Problem definition The potential maximal frequent calling paths (PMFCPs) in

the database D are defined as {P|PFP+, and P is maximal in FP+}, where FP+ is the closure of FP under the merge operation and FP={P|P is a maximal frequent calling path in D}.

18

Frequent calling path graph A calling path graph is a directed graph containing the

necessary information of mining PMFCPs in a calling path database.

A calling path graph consists of vertices, out-edges, and in-out paths.

A vertex in the calling path graph represents a cell in the GSM network.

An out-edge of vertex v in a calling path graph is an edge starting at v. An in-edge of vertex v is an edge ending at v.

An in-out path of vertex v in a calling path graph is a calling path formed by one in-edge of v and one out-edge of v.

19

Frequent calling path graph In a frequent calling path graph G, all out-

edges and in-out paths in G are all frequent. A calling path can be decomposed into an

out-edge, or an out-edge plus several in-out paths by which the corresponding calling path graph can be constructed.

The decomposed out-edge and in-out paths can be merged to generate the original calling path.

20

Frequent calling path graph For example, the calling path <a, b, c, d, e>

can be decomposed into an out-edge <a, b> plus three in-out paths <a, b, c>, <b, c, d> and <c, d, e>.

On the contrary, the decomposed out-edge <a, b>, and in-out paths <a, b, c>, <b, c, d>, and <c, d, e> can be merged into <a, b, c, d, e>.

21

Graph construction

The cell structure of the GSM network may be required to be divided into several partitions so that the corresponding calling path sub-graph of each partition can be held in the main memory and then the mining algorithm is applied to each sub-graph.

22

Graph construction

Q1Q3Q2

Q6Q5Q4

Q9Q8Q7

Partition line 1 Partition line 2

Partition line 3

Partition line 4

Q1Q3Q2

Q5

Q7

Q4

Q9Q8

Q6

Example of graph partition.

23

Graph construction The algorithm of graph construction first

examine whether the cell structure of the GSM network is partitioned.

Then, the calling paths are retrieved from the database and decomposed into out-edges and in-out paths.

The graph is constructed by the out-edges and in-out paths.

24

Graph construction

a b

e

f g h

m

i

kl

n

c d

j

TID Calling paths TID Calling paths

T001 <d, g, j, m> T011 <b, d>

T002 <a, b> T012 <g, k, l>

T003 <g, k, n, l> T013 <b, d, g>

T004 <a, b, d> T014 <d, g, j, m>

T005 <b, d, g> T015 <g, k, n>

T006 <l, n, m> T016 <b, d, g, k, l>

T007 <d, g, f> T017 <a, c>

T008 <g, k> T018 <d, c, f>

T009 <d, g, k, n> T019 <m, n, l>

T010 <a, b, d, h, l> T020 <b, d, g, f, j>

Example 1: Let the minimum support be 10%.

25

Graph constructionv Out-edges In-out paths

a<a, b>:3, <a, c>:1

b <b, d>:5 <a, b, d>:2c <d, c, f>:1d <d, c>:1,<d, g>:4 <b, d, g>:4, <b, d, h>:1f <g, f, j>:1g <g, k>:4 <d, g, f>:2, <d, g, j>:2, <d, g, k>:2h <d, h, l>:1 j <g, j, m>:2 k <g, k, l>:2, <g, k, n>:3 l <l, n>:1 m <m, n>:1 n <k, n, l>:1, <l, n, m>:1, <m, n, l>:1

a b

dgf

jk l

m nT001 <d, g, j, m> T011 <b, d>T002 <a, b> T012 <g, k, l>T003 <g, k, n, l> T013 <b, d, g>T004 <a, b, d> T014 <d, g, j, m> T005 <b, d, g> T015 <g, k, n>T006 <l, n, m> T016 <b, d, g, k, l>T007 <d, g, f> T017 <a, c>T008 <g, k> T018 <d, c, f>T009 <d, g, k, n> T019 <m, n, l>T010 <a, b, d, h, l> T020 <b, d, g, f, j>

The frequent calling path graph.

26

Mining PMFCPsThe algorithm of mining PMFCPs is based on a depth-first search approach, which is one of the natural ways to visit vertices in a graph systematically.

First, find all local PMFCPs and then merge all local PMFCPs extracted from sub-graphs into global PMFCPs.

27

Mining PMFCPs Example 1 (cont.):

CopiedPath(1): <a, b, d, g, k, n>

Path(2): <a, b, d, g, k, l>

Path(3): <a, b, d, g, j, m>

Path(4): <a, b, d, g, f>

Copied

Appended

Appended

AppendedCopied

a b

dgf

jk l

m n

28

Mining PMFCPs

Partition line

a b

c d e

f g h i

j k l

m n

a b

c d e

f g h i

c d e

f g h i

j k l

m n

Local PMFCPs in PartitionU

Local PMFCPs in PartitionL

Global PMFCPs

<a, b, d, g, f> <d, g, k, n>

<d, g, k, l>

<d, g, j, m>

<a, b, d, g, k, n>

<a, b, d, g, k, l>

<a, b, d, g, j, m>

<a, b, d, g, f>

(a) Original cell structure. (b) PartitionU. (c) PartitionL.

If the cell structure of the GSM network is divided into two partitions,

a b

dgf

jk l

m n

29

Experimental results (PMFCPs) Two synthetic datasets were simulated.

For a GSM network with N cells, the cell structure of the GSM network is arranged to be semi-square shape that contains and +1 cells at each consecutive level.

For example, the cell structure of a GSM network with 22 cells is shown as follows:

N N

30

Experimental results (PMFCPs) The cells are labeled in sequential order from 0 to N-1,

if the GSM network contains N cells. The starting cell of each mobile phone call is

determined from a uniform distribution U(, ), where denotes the smallest cell ID, and denotes the largest cell ID.

The next cell of a calling path is also determined from a uniform distribution that selects one of the six neighboring cells uniformly.

The length of a calling path is determined from an exponential distribution with the parameter of mean .

31

Experimental results (PMFCPs)

T 10.0 C 1K D 100K

0

100

200

300

400

0.06 0.12 0.18 0.24 0.3 0.36

Minimum support threshold (%)

Run

tim

e (s

ec.)

Apriori RevisedApriori Prefixspan Graph-based

32


T 5.0 C 2K S 0.05%

0

50

100

150

100 200 300 400 500 600

Number of transactions (K)

Run

tim

e (s

ec.)

Apriori RevisedApriori Graph-based Prefixspan

33


C 2K D 500K S 0.05%

0

200

400

600

800

1,000

3 4 5 6 7

Mean length of calling paths

Run

tim

e (s

ec.)

Apriori RevisedApriori Graph-based Prefixspan

34


T 5.0 C 4.5K D 200M S 0.003%

0

50,000

100,000

150,000

200,000

250,000

0.25 0.50 1.00 2.00 4.00

Size of memory (MB)

Run

tim

e (s

ec.)

PrefixSpan Graph-based

35

Conclusions The interesting issue of mining calling path

patterns in GSM networks is addressed. A new concept of interesting and effective

patterns (PMFCPs) is derived from the calling path patterns.

The PMFCPs can be mined efficiently by using our proposed graph structure.

efficient data mining for calling path patterns in gsm networks information systems, accepted 5...

Documents