semi-supervised learning mining the webchakrabarti & ramakrishnan2 need for an intermediate...

Semi-Supervised learning

Mining the Web Chakrabarti & Ramakrishnan 2

Need for an intermediate approach

Unsupervised and Supervised learning•Two extreme learning paradigms

•Unsupervised learning collection of documents without any labels easy to collect

•Supervised learning each object tagged with a class. laborious job

Semi-supervised learning•Real life applications are somewhere in

between.


Motivation Document collection D A subset (with ) has

known labels Goal: to label the rest of the collection. Approach

• Train a supervised learner using , the labeled subset.

• Apply the trained learner on the remaining documents.

Idea• Harness information in to enable better

learning.

DDK |||||||| DDK

KD

KDD \


The Challenge Unsupervised portion of the corpus,

, adds to•Vocabulary•Knowledge about the joint distribution of

terms•Unsupervised measures of inter-document

similarity. E.g.: site name, directory path, hyperlinks

Put together multiple sources of evidence of similarity and class membership into a label-learning system.•combine different features with partial

supervision

KDD \


Hard Classification Train a supervised learner on

available labeled data Label all documents in Retrain the classifier using the new

labels for documents where the classier was most confident,

Continue until labels do not change any more.

KD

KDD \


Expectation maximization Softer variant of previous algorithm Steps

•Set up some fixed number of clusters with some arbitrary initial distributions,

•Alternate following steps based on the current parameters of the

distribution that characterizes c.– Re-estimate, Pr(c|d), for each cluster c and each

document d, Re-estimate parameters of the distribution

for each cluster.


Experiment: EM Set up one cluster for each class

label Estimate a class-conditional

distribution which includes information from D

Simultaneously estimate the cluster memberships of the unlabeled documents.


Experiment: EM (contd..) Example:

• EM procedure + multinomial naive Bayes text classifier

• Laplace’s law for parameter smoothing

• For EM, unlabeled documents belong to clusters probabilistically Term counts weighted by the probabilities

• Likewise, modify class priors

dDd

Ddtc

Kc

Kc

dnW

tdn

,

, ),(||

),(1

Dd

Ddtc dndcW

tdndc

),()|Pr(||

),()|Pr(1

,


EM: Issues For , we know the class label cd

•Question: how to use this information ?

•Will be dealt with later Using Laplace estimate instead of ML

estimate•Not strictly EM

•Convergence takes place in practice

kDd


EM: Experiments Take a completely labeled corpus D, and

randomly select a subset as DK. also use the set of unlabeled

documents in the EM procedure. Correct classification of a document

=> concealed class label = class with largest probability

Accuracy with unlabeled documents > accuracy without unlabeled documents• Keeping labeled set of same size

EM beats naïve Bayes with same size of labeled document set• Largest boost for small size of labeled set• Comparable or poorer performance of EM for large labeled sets

DDU


Belief in labeled documents Depending on one’s faith in the initial

labeling•Set before 1st iteration:

•With each iteration Let the class probabilities of the labeled

documents `smear'

d0d cc’ allfor 0 d)|Pr(c and 1 d)|Pr(c


EM: Reducing belief in unlabeled documents

Problems due to•Noise in term distribution of documents

in

•Mistakes in E-step Solution

•attenuate the contribution from documents in

•Add a damping factor in E Step for contribution from

UDUD

UD

UD

,

, ),()|Pr(),(][||

),()|Pr(),(][1

K U

K U

Dd Ddd

Dd Ddd

tc dndcdnccW

tdndctdncc


Increasing DU while holding DK fixed also shows the advantage of using large unlabeled sets in the EM-like algorithm.


EM: Reducing belief in unlabeled documents (contd..) No theoretical justification

•accuracy is indeed influenced by the choice of

What value of to choose ? An intuitive recipe (to be tried)

small choose ,D largeFor

1) ( large choose ,D smallFor

k

k


EM: Modeling labels using many mixture components

Need not be a one to one correspondence between EM clusters and class labels.

Mixture modeling of •Term distributions of some classes

•Especially “the negative class” E.g.: For two class case “football” vs.

“not football”•Documents not about “football” are

actually about a variety of other things;


EM: Modeling labels using many mixture components

Experiments: comparison with Naïve Bayes•Lower accuracy with one mixture

component per label

•Higher accuracy with more mixture components per label

•Over fitting and degradation with too large a number of clusters


Allowing more clusters in the EM-like algorithm than there are class labels often help

to capture term distributions for composite or complex topics, and boosts the accuracy of the semi-supervised learner beyond that of a

naive Bayes classier.


Labeling hypertext graphs More complex features than exploited by

EM•Test document is cited directly by a training

document, or vice verca

•Short path between the test document and one or more training documents.

•Test document is cited by a named category in a Web directory Target category system could be somewhat

different

•Some category of a Web directory co-cites one or more training document along with the test document.


Labeling hypertext graphs: Scenario

Snapshot of the Web graph, Graph G=(V,E)

Set of topics, Small subset of nodes VK labeled Use the supervision to label some or

all nodes in V - Vk


Hypertext models for classification

c=class, t=text, N=neighbors

Text-only model: Pr[t|c]

Using neighbors’ textto judge my topic:Pr[t, t(N) | c]

Better model:Pr[t, c(N) | c]

Non-linear relaxation

?


Absorbing features from neighboring pages

Page u may have little text on it to train or apply a text classier

u cites some second level pages Often second-level pages have

usable quantities of text Question: How to use these

features ?


Absorbing features indiscriminate absorption of

neighborhood text Does not help. At times deteriorates accuracy Reason: Implicit assumption-

• Topic of a page u is likely to be the same as the topic of a page cited by u.

• Not always true

• Topic may be “related” but not “same”

Distribution of topics of the pages cited could be quite distorted compared to the totality of contents available from the page itself

E.g.: university page with little textual content • Points to “how to get to our campus” or “recent

sports prowess"


Absorbing link-derived features Key insight 1

• The classes of hyper-linked neighbors is a better representation of hyperlinks.

• E.g.: use the fact that u points to a page about athletics to

raise our belief that u is a university homepage, learn to systematically reduce the attention we pay to

the fact that a page links to the Netscape download site.

Key insight 2• class labels are from a is-a hierarchy.

evidence at the detailed topic level may be too noisy coarsening the topic helps collect more reliable data

on the dependence between the class of the homepage and the link-derived feature.


Absorbing link-derived features Add all prefixes of the class path to the

feature pool: Do feature selection to get rid of noise

features Experiment

• Corpus of US patents• Two level topic hierarchy

three first-level classes, each has four children.

• Each leaf topic has 800 documents,• Experiment with

Text Link Prefix Text+Prefix


The prefix trick

A two-level topic hierarchy of US patents.Using prefix-encoded link features in conjunction with text can significantly reduce classification error.


Absorbing link-derived features: Observations

Absorbing text from neighboring pages in an indiscriminate manner does not help classify

hyper-linked patent documents any better than a purely text-based naive Bayes classier.


Absorbing link-derived features: Limitation

|Vk| << |V|

Hardly any neighbors of a node to be classified linked to any pre-labeled node

Proposal•Start with a labeling of reasonable quality

Maybe using a text classifier

•Do Refine the labeling using a coupled

distribution of text and labels of neighbors,

•Until the labeling stabilizes.


A relaxation labeling algorithm Given

•Hypertext graph G(V, E)

•Each vertex u is associated with text uT

Desired•A labeling f of all (unlabeled) vertices so

as to maximize

) |V}u ; {u Pr(E, )Pr( V})u,{u Pr(E, where

}),{uPr(E,

))(|},{u(E,Pr(f(V))PrV})u,{uE,|Pr(f(V)

TT

T

TT Vu

VfVu


Preferential attachment Simplifying assumption: undirected graph Web graph starts with m0 nodes Time proceeds in discrete steps Every step, one new node v is added v is attached with m edges to old nodes

•Suppose old node w has current degree d(w)•Multinomial distribution, probability of

attachment to w is proportional to d(w) Old nodes keep accumulating degree “Rich gets richer”, or “winner takes all”


Heuristic Assumption E :

• The event that edges were generated as per the edge list E

Difficult to obtain a known function for Approximate it using heuristic assumptions. Assume that

• Term occurrences are independent

• link-derived features are independent

• no dependence between a term and a link-derived feature.

• Assumption: decision boundaries will remain relatively immune to errors in the probability estimates.

))(|},{,Pr( VfVuuE T


Heuristic Assumption (contd.) Approximate the joint probability of

neighboring classes by the product of the marginals

Couple class probabilities of neighboring node Optimization concerns

• Kleinberg and Tardos: global optimization A unique f for all nodes in VU

• Greedy labeling followed by iterative correction of neighborhoods

Greedy labeling using a text classier Reevaluate class probability of each page using latest estimates

of class probabilities of neighbors. EM-like soft classification of nodes

))(,,)),((|)(Pr())(,,|)(Pr(

))(,,)),((|)(Pr())(,,|))((Pr())(,,|)(Pr(

))(( )(

))((

KTU

vNf vNw

KT

vNf

KTUKTUKT

VfVEvNfvfVfVEwf

VfVEvNfvfVfVEvNfVfVEvf

vU U

vU


Inducing a Markov Random field

Induction on time-step • to break the circular definition.

Converges if seed values are reasonably accurate

Further assumptions• limited range of influence

• text of nodes other than v contain no information about f(v)

Already accounted for in the graph structure

vU U

vU U

vNf

KTU

vNw

KTr

vNf

KTU

vNw

KTr

KTr

vNfVEvNfvfVfVEwf

VfVEvNfvfVfVEwfVfVEvf

))(( )()(

))(( )()()1(

)))((,,)),((|)(Pr())(,,|)((Pr

))(,,)),((|)(Pr())(,,|)((Pr))(,,|)((Pr


Overview of the algorithm Desired: the class (probabilities) of v given

• The text vT on that page• Classes of the neighbors N(v) of v.

Use Bayes rule to invert that goal • Build distributions :

The algorithm HyperClass• Input: Test node v• construct a suitably large vicinity graph around and containing v• for each w in the vicinity graph do• assign using a text classier• end for• while label probabilities do not stabilize (r = 1,2…..) do• for each node w in the vicinity graph do• update to using equation • end for• end while

f(v)) | vP(f(N(v)), T

))f(V,VE, | (f(v)P KT(0)

))f(V,VE, | (f(v)P KT1)(r


Exploiting link features 9600 patents from

12 classes marked by USPTO

Patents have text and cite other patents

Expand test patent to include neighborhood

‘Forget’ fraction of neighbors’ classes

05

10152025303540

0 50 100

%Neighborhood known%

Err

or

Text Link Text+Link


Relaxation labeling: Observations

When the test neighborhood is completely unlabeled.•`Link‘ performs better than the text-based

classier

•Reason: Model bias Pages tend to link to pages with a related

class label."

Relaxation labeling •An approximate procedure to optimize a

global objective function on the hypertext graph being labeled.

•A metric graph labeling problem


A metric graph-labeling problem Inference about the topic of page u

depends possibly on the entire Web.• Computationally infeasible• Unclear if capturing such dependencies is

useful. Phenomenon of “losing one's way“ with clicks significant clues about a page expected to be in a

neighborhood of limited radius

Example : • A hypertext graph• Nodes can belong to exactly one of two topics

(red and blue)• Given a graph with a small subset of nodes

with known colors


A metric graph-labeling problem (contd..) Goal: find a labeling f(u) (u unlabeled) to

minimize

2 terms•affinity A(c1,c2) : cost between all pairs of

colors.•L(u,f(u)) = -Pr(f(u)|u): cost of assigning label

f(u) to node u Parameters

•Marginal distribution of topics,•2 x 2 topic citation matrix: probability of

differently colored nodes linking to each other

Ev)(u,u

f(v))A(f(u), f(u))L(u, Q(f)


Semi-supervised hypertext classification represented as a problem of completing a partially colored graph subject to a given set of cost

constraints.


A metric graph-labeling problem:

NP-Completeness NP-complete [Kleinberg and Tardos] approximation algorithms

•Within a O(log k log log k) multiplicative factor of the minimal cost,

•k = number of distinct class labels.


Problems with approaches so far

Metric or relaxation labeling• Representing accurate joint distributions over

thousands of terms High space and time complexity

Naïve Models• Fast: assume class-conditional attribute

independence,

• Dimensionality of textual sub-problem >> dimensionality of link sub-problem,

• Pr(vT|f(v)) tends to be lower in magnitude than Pr(f(N(v))|f(v)).

• Hacky workaround: aggressive pruning of textual features


Co-Training [Blum and Mitchell] Classifiers with disjoint features spaces. Co-training of classifiers

•Scores used by each classifier to train the other

•Semi-supervised EM-like training with two classifiers

Assumptions•Two sets of features (LA and LB) per

document dA and dB.•Must be no instance d for which •Given the label , dA is conditionally

independent of dB (and vice versa)

)()( BBAA dfdf


Co-training Divide features into two class-

conditionally independent sets Use labeled data to induce two

separate classifiers Repeat:

• Each classifier is “most confident” about some unlabeled instances

• These are labeled and added to the training set of the other classifier

Improvements for text + hyperlinks)|Pr().|Pr()|,Pr( cdcdcdd BABA


Co-Training: Performance dA=bag of words

dB=bag of anchor texts from HREF tags

Reduces the error below the levels of both LA and LB individually

Pick a class c by maximizing •Pr(c|dA) Pr(c|dB).


Co-training reduces classification error Reduction in error against the number of mutual training rounds.