2007/4/201 extracting parallel texts from massive web documents chikayama taura lab. m2 dai saito
TRANSCRIPT
2007/4/20 2
Purpose
Parallel corpus : a set of parallel textsParallel texts : translated pairs of texts
Construct Parallel Corpora from the Web
One thing was certain, that the WHITE kitten had had
nothing to do with it.
一つ確実なのは、白い子ネコはなんの関係も
なかったということ。
--it was the black kitten's fault entirely.
―― もうなにもかも、黒い子ネコのせいだったのです。
English 日本語
2007/4/20 3
Parallel Texts
Useful resource for Statistical machine translationDictionary construction
But… existing corpora are not enough Genre
Public Documents
Software Manuals
Language Limited English-French
Amount Small Large human
resource
2007/4/20 4
Parallel Texts from the Web
Extracting Parallel Texts from Massive Web Documents Very large amount of texts Varied languages Small human resource
2007/4/20 5
Problems
How to detect parallel texts automatically
How to reduce calculation cost
Web Web
To construct parallel corpus1. Extract candidate pairs2. Judge whether they
really are parallel texts
2007/4/20 6
Agenda
IntroductionRelated workProposal
Detect parallel textsExtract candidate pairs
ExperimentConclusion
2007/4/20 7
STRAND [Resnik et. al. 03]
URL Matching
1. Remove language-specific substrings[LSSs](Japanese : ja, jp, jpn, euc, sjis,…)
2. Match LSSs-removed URLs3. Make a detail comparison
http://www.hostname.com/index.html.en
http://www.hostname.com/index.html.ja
http://www.hostname.com/index.html.en
http://www.hostname.com/index.html.ja
2007/4/20 8
DOM Tree Alignment [Lei et. al. 06]
HTML→DOM Tree
Searching linked pages“alt” taglink name
Parallel link: a pair of the same hyperlinks in parallel texts
link
link
“English version”“In English” etc…
2007/4/20 9
Agenda
IntroductionRelated workProposal
Detect parallel textsExtract candidate pairs
ExperimentConclusion
2007/4/20 11
Detecting parallel texts
Low comparison costwithout HTML Information
1. word (noun)2. semantic ID3. comparison
[Fukushima et.al. 06]
2007/4/20 12
Semantic ID Conversion
Constructing a graph from dictionaries
Treating Japaneseand English texts in the same level
# of Semantic IDs:about 10,000
Sense 感覚
意味Movie
Film 映画
Hobby
Taste
趣味
味
1
2
3
2007/4/20 13
Texts to Vector
テキスト955
…
辞書 1704
…
数列 3173
辞書を使ってテキストを数列に変える。
1704
955 3173
(955, 1704, 3173)
sort
+position information
2007/4/20 14
Comparison
tscore (translation score)
T1:(106, 335, 455, 567, 1704, 3173, 7421)
T2:(335, 567, 567, 1704, 4014, 5449, 7421)
score=012
)2#1(# TTO
2#1# TTscoretscore
3
tscore = 4/(7+7)
4
2007/4/20 15
tscore threshold
Fry Corpus[05 Fry] 400 pairF-measure
Speed 200,000 pairs/sec
tscore threshold 0.102
2recallpresicion
recallpresicion
2007/4/20 16
Agenda
IntroductionRelated workProposal
Detect parallel textsExtract candidate pairs
ExperimentConclusion
2007/4/20 17
Extract candidate pairs
Calculation cost of each comparison
Calculation cost of extracting parallel textsA number of comparison: n^2URL matching is too strict
Japanese and English 90,000,000URL → 4,000 URL pairs → 1,000 real pairs
2007/4/20 18
Calculation Cost Reduction
→Reducing the number of comparison
distance score : tscoreCompare only texts close to each other
Distance of each parallel texts and a sample text should be equal
English 日本語
Sample
2007/4/20 19
Calculation Cost Reduction
Flow1. Select sample texts (<<n)2. Calculate distance score with
sample texts3. Classify top m score4. Compare only for texts in the
same group
2007/4/20 20
Number of sampleCalculation cost
Accuracy (low risk of miss labeling)
Methods to select sampleRandomk-means
Sampling
2007/4/20 21
k-means
1. Select k samples
2. Classify all texts
3. Calculate centers
4. Re-classify
k=2
2007/4/20 22
Calculation of tscore in k-means
Text1:(106, 335, 455, 567, 1704, 3173, 7421)Text2:(335, 567, 567, 1704, 4014, 5449, 7421)
Text1:(106, 335, 455, 567, 1704, 3173, 7421)Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …)
tscore = 4/(7+7)
tscore = (0.2+0.5)
normal
k-means
2007/4/20 23
Converting HTML on the Web
Guess languageEnglish, SJIS, EUC-JP, UTF-8
Convert character codeRemove HTML TagMorphological Analysis→pickup noun
2007/4/20 24
Agenda
IntroductionRelated workProposal
Detect parallel textsExtract candidate pairs
ExperimentConclusion
2007/4/20 26
Environment
Dataset: Fry Corpus [Fry 05]Corpus of Japanese-English news pagesConvert HTML to Semantic ID in advance
MachineCPU : Xeon 2.4GHz DualMemory : 2GBOS : Linux (Debian)
2007/4/20 27
0
50
100
150
200
250
0 1000 2000 3000 4000 5000 6000
# of pairs
Exec
utio
n Ti
me
[sec
]
n̂ 2
sampling( )
sampling( )
Calculation Cost
Fry Corpus200 - 6400 pairs
Normal All-to-All
Random sampling (Top3)
# of texts grows, gap becomes widerLow cost with n^2 samples
n
2n
2007/4/20 28
0
2
4
6
8
10
12
14
3 4 5 6 7 8 9 1011 1213 1415 16 1718 1920
# of samples
mis
s cl
assi
ficat
ion
ratio
[%]
0
0.5
1
1.5
2
2.5
exec
utio
n tim
e [s
ec]
miss classificationratioexecution time
Accuracy v.s. Calculation time
Fry Corpus400 pairs
Random sampling# of sample grows,
Miss classification ratio → high Execution time → low
Trade off with Miss classification ratio and Execution time
2007/4/20 29
Sample selection with k-means Accuracy and Execution time with k-
means Flow
1. Random sampling number of samples : √n
2. Calculating the center and re-sampling3. Measuring Miss-classification ratio and
Execution time
2007/4/20 30
Evaluation of k-means
Low miss-classification ratio→High biased
miss classification
calculation time [sec]
200
random 21 0.15
k-means
4 0.32
400
random 51 0.54
k-means
7 1.18
0
100
200
300
400
500
600
700
cluster number
# of
tex
ts
200 k-means200 normal400 k-means400 normal
2007/4/20 31
Agenda
IntroductionRelated workProposal
Detect parallel textsExtract candidate pairs
ExperimentConclusion
2007/4/20 32
Conclusion and Future work
Parallel texts from the WebDetecting parallel textsExtracting candidate pairs
Random sampling k-means
2007/4/20 33
Future work
Better clustering methodsHierarchical
Dimension reductionAbout 10,000 dimension is too high
Processing real HTML texts from the Web