2007/4/201 extracting parallel texts from massive web documents chikayama taura lab. m2 dai saito

2007/4/20 1

Extracting Parallel Texts from Massive Web

Documents

Chikayama Taura lab.

M2 Dai Saito

2007/4/20 2

Purpose

Parallel corpus : a set of parallel textsParallel texts : translated pairs of texts

Construct Parallel Corpora from the Web

One thing was certain, that the WHITE kitten had had

nothing to do with it.

一つ確実なのは、白い子ネコはなんの関係も

なかったということ。

--it was the black kitten's fault entirely.

―― もうなにもかも、黒い子ネコのせいだったのです。

English 日本語

2007/4/20 3

Parallel Texts

Useful resource for Statistical machine translationDictionary construction

But… existing corpora are not enough Genre

Public Documents

Software Manuals

Language Limited English-French

Amount Small Large human

resource

2007/4/20 4

Parallel Texts from the Web

Extracting Parallel Texts from Massive Web Documents Very large amount of texts Varied languages Small human resource

2007/4/20 5

Problems

How to detect parallel texts automatically

How to reduce calculation cost

Web Web

To construct parallel corpus1. Extract candidate pairs2. Judge whether they

really are parallel texts

2007/4/20 6

Agenda

IntroductionRelated workProposal

Detect parallel textsExtract candidate pairs

ExperimentConclusion

2007/4/20 7

STRAND [Resnik et. al. 03]

URL Matching

1. Remove language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…)

2. Match LSSs-removed URLs3. Make a detail comparison

http://www.hostname.com/index.html.en

http://www.hostname.com/index.html.ja

http://www.hostname.com/index.html.en

http://www.hostname.com/index.html.ja

2007/4/20 8

DOM Tree Alignment [Lei et. al. 06]

HTML→DOM Tree

Searching linked pages“alt” taglink name

Parallel link: a pair of the same hyperlinks in parallel texts

link

link

“English version”“In English” etc…

2007/4/20 9

Agenda




2007/4/20 10

Outline

WebWeb

Detect parallel texts

Extract candidate pairs

……

……

Crawler

2007/4/20 11

Detecting parallel texts

Low comparison costwithout HTML Information

1. word (noun)2. semantic ID3. comparison

[Fukushima et.al. 06]

2007/4/20 12

Semantic ID Conversion

Constructing a graph from dictionaries

Treating Japaneseand English texts in the same level

# of Semantic IDs:about 10,000

Sense 感覚

意味Movie

Film 映画

Hobby

Taste

趣味

味

１

２

３

2007/4/20 13

Texts to Vector

テキスト955

…

辞書 1704

…

数列 3173

辞書を使ってテキストを数列に変える。

1704

955 3173

(955, 1704, 3173)

sort

+position information

2007/4/20 14

Comparison

tscore (translation score)

T1:(106, 335, 455, 567, 1704, 3173, 7421)

T2:(335, 567, 567, 1704, 4014, 5449, 7421)

score=012

)2#1(# TTO

2#1# TTscoretscore

3

tscore = 4/(7+7)

4

2007/4/20 15

tscore threshold

Fry Corpus[05 Fry] 400 pairF-measure

Speed 200,000 pairs/sec

tscore threshold 0.102

2recallpresicion

recallpresicion

2007/4/20 16

Agenda




2007/4/20 17

Extract candidate pairs

Calculation cost of each comparison

Calculation cost of extracting parallel textsA number of comparison: n^2URL matching is too strict

Japanese and English 90,000,000URL → 4,000 URL pairs → 1,000 real pairs

2007/4/20 18

Calculation Cost Reduction

→Reducing the number of comparison

distance score : tscoreCompare only texts close to each other

Distance of each parallel texts and a sample text should be equal

English 日本語

Sample

2007/4/20 19

Calculation Cost Reduction

Flow1. Select sample texts (<<n)2. Calculate distance score with

sample texts3. Classify top m score4. Compare only for texts in the

same group

2007/4/20 20

Number of sampleCalculation cost

Accuracy (low risk of miss labeling)

Methods to select sampleRandomk-means

Sampling

2007/4/20 21

k-means

1. Select k samples

2. Classify all texts

3. Calculate centers

4. Re-classify

k=2

2007/4/20 22

Calculation of tscore in k-means

Text1:(106, 335, 455, 567, 1704, 3173, 7421)Text2:(335, 567, 567, 1704, 4014, 5449, 7421)

Text1:(106, 335, 455, 567, 1704, 3173, 7421)Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …)

tscore = 4/(7+7)

tscore = (0.2+0.5)

normal

k-means

2007/4/20 23

Converting HTML on the Web

Guess languageEnglish, SJIS, EUC-JP, UTF-8

Convert character codeRemove HTML TagMorphological Analysis→pickup noun

2007/4/20 24

Agenda




2007/4/20 25

Experiment

Calculation CostAccuracy v.s. Calculation timeClustering

k-means

2007/4/20 26

Environment

Dataset： Fry Corpus [Fry 05]Corpus of Japanese-English news pagesConvert HTML to Semantic ID in advance

MachineCPU : Xeon 2.4GHz DualMemory : 2GBOS : Linux (Debian)

2007/4/20 27

0

50

100

150

200

250

0 1000 2000 3000 4000 5000 6000

# of pairs

Exec

utio

n Ti

me

[sec

]

n̂ 2

sampling( )

sampling( )

Calculation Cost

Fry Corpus200 - 6400 pairs

Normal All-to-All

Random sampling (Top3)

# of texts grows, gap becomes widerLow cost with n^2 samples

n

2n

2007/4/20 28

0

2

4

6

8

10

12

14

3 4 5 6 7 8 9 1011 1213 1415 16 1718 1920

# of samples

mis

s cl

assi

ficat

ion

ratio

[%]

0

0.5

1

1.5

2

2.5

exec

utio

n tim

e [s

ec]

miss classificationratioexecution time

Accuracy v.s. Calculation time

Fry Corpus400 pairs

Random sampling# of sample grows,

Miss classification ratio → high Execution time → low

Trade off with Miss classification ratio and Execution time

2007/4/20 29

Sample selection with k-means Accuracy and Execution time with k-

means Flow

1. Random sampling number of samples : √n

2. Calculating the center and re-sampling3. Measuring Miss-classification ratio and

Execution time

2007/4/20 30

Evaluation of k-means

Low miss-classification ratio→High biased

miss classification

calculation time [sec]

200

random 21 0.15

k-means

4 0.32

400

random 51 0.54

k-means

7 1.18

0

100

200

300

400

500

600

700

cluster number

# of

tex

ts

200 k-means200 normal400 k-means400 normal

2007/4/20 31

Agenda




2007/4/20 32

Conclusion and Future work

Parallel texts from the WebDetecting parallel textsExtracting candidate pairs

Random sampling k-means

2007/4/20 33

Future work

Better clustering methodsHierarchical

Dimension reductionAbout 10,000 dimension is too high

Processing real HTML texts from the Web

2007/4/20 34

Thank you for your attention!