web page language identification based on urls reporter: 鄭志欣 advisor: hsing-kuo pao 1

Post on 12-Jan-2016

239 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Web Page Language Identification Based on

URLsReporter: 鄭志欣

Advisor: Hsing-Kuo Pao

1

Web page language identification based on URLs, E. Baykan, M. Henzinger, and I. Weber., In 34th International Conference on Very Large Data Bases (VLDB), pages 176-188. ACM, 2008

Reference

2

Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Outline

3

Given only the URL of a web page, can we identify its language? Web crawlers Personalized Web Browser

We consider the problem of determining the language of a web page using only its URL. English , French , German , Spanish , and Italian .com (60%) , .org (10%)

www.wasserbett-test.com

Introduction

4

Applying machine learning techniques Features

Word features N-grams features Custom-made features

Machine learning algorithm Naïve Bayes Decision Tree Relative Entropy Maximum Entropy

Introduction

5

Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Outline

6

Words as features Remove “www” , ”index”, ”html” …,etc. For example,

http://www.internetwordstats.com/africa2.htm Split into : internetwordstats , com , africa cnn , gov are indicative of English Produits ,recherche are indicative of French

Extracting Feature Vectors

7

Trigrams as features Start with the some token as the method

above(word as features) Eg, weather

“_we” , “wea” , “eat” , “ath” ,”the” ,”her” , “er_” “_th” , “ing” are very common in English

8

Custom-made features Top-level domain country code OpenOffice dictionaries Dictionary with city names Number of hyphens

9

Country code top-level domain only (ccTLD) Country code top-level domain plus

(ccTLD+) Naïve bayes (NB) Decision Tees (DT) Relative Entropy(RE) Maximum Entropy(ME)

Classification Algorithms

10

Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Outline

11

The algorithms were evaluated on three different data sets Open Directory Project Microsoft’s Live Search 1260 pages form a large web crawl labels by

hand

DataSet

12

Data set Language Training size

Test size

Open Directory Project

English 145,000 4910

German 144,999 4965

French 144,996 4961

Spanish 144,974 4878

Italian 144,987 4933

SearchEngineResults

English 99,992 999

German 99,572 992

French 99,549 997

Spanish 99,838 997

Italian 99,786 997

WebCrawl

English 0 1082

German 0 81

French 0 57

Spanish 0 19

Italian 0 2113

Introduction Language Identification Based On URLs Experimental Setup Experimental Results Conclusions

Outline

14

P = n+p(+|+)/ (n+p(+|+) + n−(1 − p(−|−)))

= p(+|+)

= p(−|−)

F = 2/(1/R+1/P)

15

Human Performance

16

Baseline : ccTLD

17

18

19

20

21

This paper shows that high quality language identifiers for web pages can be built based on URLs alone.

The largest challenge is to identify English-looking URLs of non-English web pages.

Conclusions

22

top related