learning sets of rules 学习规则集合 2004. 12 edited by wang yinglin shanghai jiaotong...

Learning sets of rulesLearning sets of rules 学习规则集合学习规则集合

Learning sets of rulesLearning sets of rules 学习规则集合学习规则集合

2004. 122004. 12

Edited by Wang YinglinEdited by Wang Yinglin

Shanghai Jiaotong UniversityShanghai Jiaotong University

2

注意：图书馆资源

www.lib.sjtu.edu.cn

馆藏资源电子数据库Kluwer

Elsevier

IEEE

….

中国期刊网万方数据…

3

Introduction

Set of “If-then” rules The hypothesis is easy to interpret.

Goal Goal

Look at a new method to learn rulesLook at a new method to learn rules

Rules Propositional rules (rules without variables)

First-order predicate rules (with variables)

4

Introduction

So far . . .Method 1: Learn decision tree rules

Method 2: Genetic algorithm , encode rule set as a bit string

From now . . . New method!Learning first-order rule

Using sequential covering

First-order rule Difficult to represent using a decision tree or other propositional representation

If Parent(x,y) then Ancestor(x,y)

If Parent(x,z) and Ancestor(z,y) then Ancestor(x,y)

5

Introduction

contentsIntroduction

Sequential Covering Algorithms

First Order Rules

First-order inductive learning (FOIL)

Induction as Inverted Deduction

Summary

6


Algorithm1. Learn one rule that covers certain number of positive examples

2. Remove those examples covered by the rule

3. Repeat until no positive examples are left

7


Require that each rule has high accuracy but any coverageHigh accuracy the predicate : correct

Accepting low coverage the predicate NOT for every training example

8


Sequential-Covering

(Target-Attribute, Attributes, Examples, Threshold)

Learned-Rules {}

Rule Learn-One-Rule (Target-Attribute, Attributes, Examples)

WHILE Performance (Rule, Examples) > Threshold DO

Learned_rules ← Learned_rules + Rule // add new rule to set

Examples ← Examples - {examples correctly classified by Rule}

Rule Learn-One-Rule (Target-Attribute, Attributes, Examples)

Sort-By-Performance (Learned-Rules, Target-Attribute, Examples)

RETURN Learned-Rules

9


One of the most widespread approaches to learning disjunctive sets of rules.

Learning disjunctive sets of rules sequence of single problem ( a rule : conjunction of attr. value )

It performs a greedy search (no backtracking); as such it may not find an optimal rule set.

It sequentially covers the set of positive examples until the performance of a rule is below a threshold.

10

Sequential Covering Algorithms : General to Specific Beam Search

How do we learn each individual rule?

Requirements for LEARN-ONE-RULEHigh accuracy, need not high coverage

One approach is . . .To proceed as decision tree learning (ID3)

BUT, by following a single branch with best performance at each search step

11


IF {Humidity = Normal}THEN Play-Tennis = Yes

IF {Wind = Strong}THEN Play-Tennis = No

IF {Wind = Light}THEN Play-Tennis = Yes

IF {Humidity = High}THEN Play-Tennis = No

…

IF {}THEN Play-Tennis = Yes

…

IF {Humidity = Normal,Outlook = Sunny}

THEN Play-Tennis = Yes

IF {Humidity = Normal,Wind = Strong}


IF {Humidity = Normal,Wind = Light}


IF {Humidity = Normal,Outlook = Rain}


12

•Idea: organize the hypothesis space search in general to specific fashion.

•Start with most general rule precondition, then greedily add the attribute that most improves performance measured over the training examples.


13


Greedy search without backtracking

danger of suboptimal choice at any step

The algorithm can be extended using beam-searchKeep a list of the k best candidates at each step

On each search step, descendants are generated for each of these k best candidates.

And resulting set is again reduced to the k best candidates.

14


Learn_One_Rule (target_attr,attributes,examples,k)Best-hypothesis = 0Candidate-hypotheses = {Best-hypothesis}While Candidate-hypotheses is not empty do

1. Generate the next more specific candidate hypotheses 2. Update Best-hypothesis For all h in new-candidates if (Performance(h) > Performance(Best-hypothesis)) Best-hypothesis = h3. Update Candidate-hypotheses Candidate-hypotheses = best k members of new-candidates

Return rule : “IF Best-hypothesis THEN prediction” (predication: most frequent value of target_attr. among examples covered by Best-

hypothesis)

Performance(h, examples, target_attr.) - h_examples = the subsets of examples covered by h - Return Entropy(h_examples)

15

•All_constraints 所有形式为 (a=v) 的约束集合，其中， a 为Attributes 的成员， v 为出现在当前 Examples 集合中的 a 的值•New_candidate_hypotheses

对 Candidate_hypotheses 中每个 h ，循环

对 All_constraints 中每个 c ，循环

通过加入约束 c 创建一个 h 的特化式•New_candidate_hypotheses 中移去任意重复的、不一致的或非极大特殊化的假设

Generate the next more specific candidate hypotheses:


16

Day Outlook Temp Humid Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Low Weak Yes

D6 Rain Cool Low Strong No

D7 Overcast Cool Low Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Low Weak Yes

D10 Rain Mild Low Weak Yes

D11 Sunny Mild Low Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Low Weak Yes

D14 Rain Mild High Strong No

1.best-hypothesis = IF T THEN PlayTennis(x) = Yes

Learn-One-Rule Example

17
















1.best-hypothesis = IF T THEN PlayTennis(x) = Yes

2.candidate-hypotheses = {best-hypothesis}


18
















1. best-hypothesis = IF T THEN PlayTennis(x) = Yes

2. candidate-hypotheses = {best-hypothesis}

3. all-constraints = {Outlook(x)=Sunny, Outlook(x)=Overcast, Temp(x)=Hot, ......}


19



















4. new-candidate-hypotheses = {IF Outlook=Sunny THEN PlayTennis=YES, IF Outlook=Overcast THEN PlayTennis=YES, ...}


20



















4. new-candidate-hypotheses = {IF Outlook=Sunny THEN PlayTennis=YES, IF Outlook=Overcast THEN PlayTennis=YES, ...}

5. best-hypothesis = IF Outlook=Sunny THEN PlayTennis=YES


21

Sequential Covering Algorithms : Variations

Learn only rules that cover positive examplesThe fraction of positive example is small

in this case, we can modify the algorithm to learn only from those rare example

Instead of entropy, use a measure that evaluates the fraction of positive examples covered by the hypothesis

AQ-algorithmDifferent covering algorithm

Searches rule sets for particular target value

Different single-rule alg.

Guided by uncovered positive examples

Only attributes satisfied in examples are used

22

Summary : Points for Consideration

Key design issue for learning sets of rules

Sequential or simultaneous?Sequential : isolate components of hypothesis

Simultaneous : whole hypothesis at once

General-to-specific or Specific-to-general?GS : learn-one-rule

SG : Find-S

Generate-and-Test or Example-Driven?G&T : search thru syntactically legal hypotheses

E-D : Find-S, Candidate-Elimination

Post-pruning of Rules?Popular overfitting recovery method

23

Summary : Points for Consideration

What statistical evaluation method?Relative frequency :

nc/n (n : matched by rule, nc: classified by rule correctly)

M-estimate of accuracy :

(nc + mp) / (n + m)

P : the prior probability that a randomly drawn example will have classification assigned by the rule

m : weight ( or # of examples for weighting this prior)

Entropy

a i

c

ii ppSEntropy 2

1

log)(

24

Learning first-order rules

From now . . .We consider learning rule that contain variables (first-order rules)

Inductive learning of first-order rules : inductive logic programming (ILP)

Can be viewed as automatically inferring Prolog programs

Two methods are considered

FOIL

Induction as inverted deduction

25

Learning first-order rules

First-order ruleRules that contain variables

Example

Ancestor (x, y) Parent (x, y).

Ancestor (x, y) Parent (x, z) ^ Ancestor (z, y) :recursive

More expressive than propositional rules

IF (Father1 = Bob)&(Name2 = Bob)&(Female1 = True),THEN Daughter1,2 = True

IF Father(y,x) & Female(y), THEN Daughter(x,y)

26

Learning first-order rules :Terminology

Terminology常量 ---Constants: e.g., John, Kansas, 42

变量 ---Variables: e.g., Name, State, x

谓词 ---Predicates: e.g., Father-Of, Greater-Than

函数 ---Functions: e.g., age, cosine

项 ---Term: constant, variable, or function(term)

文字 ----Literals (atoms): Predicate(term) or negation

例如： , Greater-Than(age(John), 42) ， Female(Mary), ¬Female(x)

子句 ---Clause: disjunction of literals with implicit universal quantification

例如：∀ x : Female(x) Male(x)∨

Horn 子句 ----Horn clause: at most one positive literal

(H L1 L2 … Ln)

27

Learning first-order rules :Terminology

First Order Horn Clauses Rules that have one or more preconditions and one single consequent. Predicates may have variables

The following Horn clause is equivalent

H L1 … Ln

H (L1 … Ln )

If L1 … Ln) then H

28

Learning first-order rules :First-Order Inductive Learning (FOIL)

Natural extension of “Sequential covering + Learn-one-rule”

FOIL rule : similar to Horn clause with two exceptionsSyntactic restriction : no function

More expressive than Horn clauses :

Negation allowed in rule bodies

29

Learning first-order rules :First-Order Inductive Learning (FOIL)FOIL (Target_predicate, Predicates, Examples)

Pos (Neg) ← those Examples for which the Target_predicate is True (False)

Learned_rules ← { }

while Pos is not NULL, do

New Rule ← the rule that predicts Target_predicate with no preconditions

New RuleNeg ← Neg

while NewRuleNeg, do

Candidate_literals ← candidate new literals for NewRule

Best_literal ← argmax Foil_Gain (L, NewRule)

Add Best_literal to preconditions of NewRule

NewRuleNeg ← subset of NewRuleNeg (satisfies NewRule precondi.)

Learned-rules ← Learned_rules + NewRule

Pos ← Pos – {members of Pos covered by NewRule}

Return Learned_rules

30

Learning first-order rules :First-Order Inductive Learning (FOIL)

FOIL learns rules when the target literal is true.Cf. sequential covering learns both rules that are true and false

Outer loop Add a new rule to its disjunctive hypothesis

Specific-to-General search

Inner loopFind a conjunction

General-to-Specific search on each rule by starting with a NULL precondition and adding more literal (hill-climbing)

Cf. sequential covering performs a beam search.

31

FOIL: Generating Candidate Specializations in FOIL

Generate new literals, each of which may be added to the rule preconditions.

Current Rule : P(x1, x2, … , xk) <- L1 … Ln

Add new literal Ln+1 to get more specific Horn clause

Form of literal

Q(v1, v2, … , vk) : Q in predicates

Equal( Xj, Xk ) : Xj and Xk are variables already present to the rule

The negation of above

32

FOIL : Guiding the Search in FOIL

Consider all possible bindings (substitution) : prefer rules that possess more positive bindings

Foil_Gain(L, R)

L candidate predicate to add to rule R

p0 number of positive bindings of R

n0 number of negative bindings of R

p1 number of positive bindings of R + L

n1 number of negative bindings of R + L

t number of positive bindings of R also covered by R + L

Based on the numbers of positive and negative bindings covered before and after adding the new literal

00

0

11

1 lglgnp

p

np

pt Gain-Foil

33

FOIL : Examples

ExamplesExamples

Target literalTarget literal : GrandDaughter(x, y) : GrandDaughter(x, y)

Training Examples Training Examples : GrandDaughter(Victor, Sharon): GrandDaughter(Victor, Sharon)Father(Sharon,Bob)Father(Sharon,Bob) Father(Tom, Bob) Father(Tom, Bob)

Female(Sharon) Female(Sharon) Father(Bob, Victor)Father(Bob, Victor)

Initial stepInitial step : GrandDaughter(x, y) ← : GrandDaughter(x, y) ←

positive binding : {x/Victor, y/Sharon}positive binding : {x/Victor, y/Sharon}

negative binding : othersnegative binding : others

34

FOIL : Examples

Candidate additions to the rule preconditions : Equal(x,y), Female(x), Female(y), Father(x,y),

Father(y,x), Father(x,z), Father(z,x), Father(y,z),

Father(z,y) and the negations

For each candidate, calculate FOIL_Gain

If Father(y, z) has the maximum value of FOIL_Gain, select Father(y, z) to add precondition of ruleGrandDaughter(x, y) ← Father(y,z)

Iteration…

We add the best candidate literal and continue adding literals until we generate a rule like the following:

GrandDaughter(x,y) Father(y,z) ^ Father(z,x) ^ Female(y)

At this point we remove all positive examples covered by the rule and begin the search for a new rule.

35

FOIL : Learning recursive rules sets

Predicate occurs in rule head.

Example

Ancestor (x, y) Parent (x, z) Ancestor (z, y).

Rule: IF Parent (x, z) Ancestor (z, y) THEN Ancestor (x, y)

Learning recursive rule from relation

Given: appropriate set of training examples

Can learn using FOIL-based search

Requirement: Ancestor Predicates

Recursive rules still have to outscore competing candidates at FOIL-Gain

How to ensure termination? (i.e. no infinite recursion)

[Quinlan, 1990; Cameron-Jones and Quinlan, 1993]

36


Induction : inference from specific to general

Deduction : inference from general to specific

Induction can be cast as a deduction problem

( < x∀ i, f(xi) > D) (B h x∈ ∧ ∧ i)├ f(xi)D : a set of training data

B : background knowledge

xi : ith training instance

f(xi) : target value

X├ Y : “Y follows deductively from X”, or “X entails Y”

For every training instance xi, the target value f(xi) must follow

deductively from B, h, and xi

37


Learn target : Child(u,v) : child of u is v

Positive example : Child(Bob, Sharon)

Given instance: Male(Bob), Female(Sharon), Father(Sharon,Bob)

Background knowledge : Parent(u,v) Father(u,v)

Hypothesis satisfying the (B h x∧ ∧ i)├ f(xi)

h1 : Child(u, v) ←Father(v, u) : no need of B

h2 : Child(u, v) ←Parent(v, u) : need B

The role of Background KnowledgeExpanding the set of hypotheses

New predicates (Parent) can be introduced into hypotheses(h2)

38


In view of induction as the inverse of deduction

Inverse entailment operators is required

O(B, D) = h such that ( < x∀ i, f(xi) > D) (B h x∈ ∧ ∧ i)├ f(xi)

Input : training data D = {< xi, f(xi)>}

background knowledge B

Output : a hypothesis h

39


Attractive features to formulating the learning task1. This formulation subsumes the common definition of learning

2. By incorporating the notion of B, this formulation allows a more rich definition

3. By incorporating B, this formulation invites learning methods that use this B to guide search for h

40


Practical difficulties to formulating1. The requirement of the formulation does not naturally accommodate noisy training data.

2. The language of first-order logic is so expressive, and the number of hypotheses that satisfy the formulation is so large.

3. In most ILP system, the complexity of the hypothesis space search increases as B is increased.

41

Inverting Resolution

Resolution rule

P L∨

￢ L R ∨

P R ∨ (L: literal P,R : clause)

Resolution Operator (propositional form)Given initial clauses C1 and C2, find a literal L from clause C1 such

that ￢ L occurs in clause C2.

Form the resolvent C by including all literal from C1 and C2, except

for L and ￢ L. More precisely, the set of literals occurring in the conclusion C is

C = (C1 - {L}) (C∪ 2 - { ￢ L})

42


Example 1 C2: KnowMaterial ∨ ￢ Study

C1: PassExam ∨ ￢ KnowMaterial

C: PassExam ∨ ￢ Study

Example 2C1: A B C ┐D∨ ∨ ∨

C2: ┐B E F ∨ ∨

C : A C ┐D E F∨ ∨ ∨ ∨

43


O(C, C1)

Perform inductive inference

Inverse Resolution Operator (propositional form)

Given initial clauses C1 and C, find a literal L that occurs in clause

C1, but not in Clause C.

Form the second clause C2 by including the following literals

C2 = (C - (C1 -{L})) {∪ ￢ L}

44


Example 1 C2: KnowMaterial ∨ ￢ Study

C1: PassExam ∨ ￢ KnowMaterial

C: PassExam ∨ ￢ Study

Example 2C1: B D , C : A B∨ ∨

C2 : A ┐D (if, C∨ 2 : A ┐D B ??)∨ ∨

Inverse resolution is nondeterministic

45

Inverting Resolution : First-Order Resolution

First-Order ResolutionSubstitution ( 置换 )

Mapping of variables to terms

Ex) θ = {x/Bob, z/y}

Unifying Substitution （合一置换）For two literal L1 and L2, provided L1θ = L2θ

Ex) θ = {x/Bill, z/y}

L1=Father(x, y), L2=Father(Bill, z)

L1θ = L2θ = Father(Bill, y)

46


Resolution Operator (first-order form)

Find a literal L1 from clause C1, literal L2 from clause C2, and

substitution θ such that L1θ= ￢ L2θ.

From the resolvent C by including all literals from C1θ and C2θ,

except for L1θ and ￢ L2θ. More precisely, the set of literals

occurring in the conclusion C is

C = (C1 - {L1})θ (C∪ 2 - {L2})θ

47


ExampleC1= White(x) ← Swan(x), C2 = Swan(Fred)

C1 = White(x)∨ ￢ Swan(x),

L1= ￢ Swan(x), L2=Swan(Fred)

unifying substitution θ = {x/Fred}

then L1θ = ￢ L2θ = ￢ Swan(Fred)

(C1-{L1})θ = White(Fred)

(C2-{L2})θ = Ø

∴ C = White(Fred)

48

Inverting Resolution : First-order case

Inverse Resolution : First-order case

C=(C1-{L1})θ1 (C∪ 2-{L2})θ2

(where, θ = θ1θ2 (factorization))

C - (C1-{L1})θ1 = (C2-{L2})θ2

(where, L2 = ￢ L1θ1θ2-1 )

∴ C2=(C-(C1-{L1})θ1)θ2-1 {∪ ￢ L1θ1θ2

-1}

49

Inverting Resolution : First-order case

Multistep Inverse Resolution

Father(Tom,Bob) GrandChild(y,x)∨￢ Father(x,z)∨￢ Father(z,y)

{Bob/y,Tom/z}

Father(Shannon,Tom) GrandChild(Bob,x)∨￢ Father(x,Tom)

{Shannon/x}

GrandChild(Bob,Shannon)

50


C=GrandChild(Bob,Shannon)

C1=Father(Shannon,Tom)

L1=Father(Shannon,Tom)

Suppose we choose inverse substitution

θ1-1={}, θ2

-1={Shannon/x)

(C-(C1-{L1})θ1)θ2-1= (Cθ1)θ2

-1 = GrandChild(Bob,x)

{ ￢ L1θ1θ2-1} = ￢ Father(x,Tom)

∴ C2 = GrandChild(Bob,x) ∨￢ Father(x,Tom)or equivalently GrandChild(Bob,x) ← ￢ Father(x,Tom)

51

逆消解小结

逆消解提供了一种一般的途径以自动产生满足式子 ( < x∀ i, f(xi) > D) (B h x∈ ∧ ∧ i)├ f(xi) 的假设

与 FOIL 方法的差异逆消解是样例驱动， FOIL 是生成再测试逆消解在每一步只考虑可用数据中的一小部分， FOIL 考虑所有的可用数据看起来基于逆消解的搜索更有针对性且更有效，但实际未必如此

52

Summary

Learning Rules from Data

Sequential Covering Algorithms序列覆盖算法学习析取的规则集，方法是先学习单个精确的规则，然后移去被此规则覆盖的正例，再在剩余样例上重复这一过程序列覆盖算法提供了一个学习规则集的有效的贪婪算法，可作为由顶向下的决策树学习算法的替代算法，决策树算法可被看作并行覆盖，与序列覆盖相对应Learning single rules by search

Beam search

Alternative covering methods ：这些方法的不同在于它们考察规则前件空间的策略不同

Learning rule sets

First-Order RulesLearning single first-order rules

Representation: first-order Horn clauses

Extending Sequential-Covering and Learn-One-Rule: variables in rule preconditions ：学习一阶规则集的方法是将 CN2 中的序列覆盖算法由命题形式扩展到一阶表示。

53

Summary

FOIL: learning first-order rule sets ：可学习包括简单递归规则集在内的一阶规则集

Idea: inducing logical rules from observed relations

Guiding search in FOIL

Learning recursive rule sets


学习一阶规则集的另一个方法是逆演绎，通过运用熟知的演绎推理的逆算子来搜索假设Idea : inducing logical rule as inverted deduction

O(B, D) = h – such that ( < x∀ i, f(xi) > D) (B h x∈ ∧ ∧ i)├ f(xi)

逆消解是消解算子的逆转，而消解是普遍用于机器定理证明的一种推理规则，

54

书后练习

learning sets of rules 学习规则集合 2004. 12 edited by wang yinglin shanghai jiaotong...

Documents