exploiting bilingual word embeddings to establish ...weissweiler/kolloq/referate/eder.p… ·...

13
Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017 1 BILINGUAL WORD EMBEDDINGS Tobias Eder

Upload: others

Post on 03-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

Exploiting Bilingual Word Embeddingsto Establish Translational Equivalence

15.05.2017 1BILINGUAL WORD EMBEDDINGS

Tobias Eder

Page 2: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 2

Übersetzung ohne Wörterbuch

• Übersetzung auf bestimmter Domain

• Unbekannte Wörter im Text?

• Ohne Wörterbuch keine Übersetzung

• Domain-abhängige Übersetzung seltener Wörter

Page 3: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 3

Übersicht

1. Motivation

2. Word Embeddings

3. Vektorraummodelle

• Word2Vec

• FastText

4. Lineare Abbildungen

5. Korpora und Experimentaufbau

6. Weitere Schritte / Regularisierung

7. Literaturangaben

Page 4: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 4

Word Embeddings

Page 5: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 5

Word Embeddings

Page 6: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 6

Word2Vec

• Google (2013)

• Word-Embedding Toolkit

• CBOW und Skipgram Modelle

Page 7: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 7

fastText

• Facebook Research (2016)

• Word-representation learning• Mit Subword-Information (Buchstaben n-Gramme)

• Word-vectors für OOV Wörter

• Textklassifikation mit linearem Modell

Page 8: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 8

Lineare Abbildungen

Page 9: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 9

Lineare Abbildungen

Lineare Regression:

Ridge Regression (L2-Regularisierung):

Page 10: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 10

Korpora und Experimentaufbau

• Vier unterschiedliche parallele Korpora:• General (ca 110M Tokens)• Medical Big (ca 50M Tokens)• EMEA (ca 4M Tokens)• TED Talks (ca 2M Tokens)

• Unterschiedliche Embeddings (CBOW, Skipgram)

• Übersetzung Englisch – Deutsch

• Kleiner paralleler Korpus (ca. 5000 Wörter)

Page 11: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 11

Korpora und Experimentaufbau

• Auswahl an Worten aus Korpus (ca 1000 hochfrequente)

• Abbildung mit Regressions-Modell

• Domänespezifische Testsets

• Unterschiedliche Performance der Modelle

Page 12: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 12

Weitere Schritte

• Niedrigfrequente Wörter?

• Bessere Abbildungen?

• Andere Regularisierungsmethoden?

• Evaluation auf OOV-Wörtern in fastText

Page 13: Exploiting Bilingual Word Embeddings to Establish ...weissweiler/kolloq/referate/eder.p… · Exploiting Bilingual Word Embeddings to Establish Translational Equivalence 15.05.2017

15.05.2017 BILINGUAL WORD EMBEDDINGS 13

Literaturangaben

• Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey: “Efficient Estimation of Word

Representation in Vector Space”. In: Proceedings of Workshop at ICLR. 2013.

• Mikolov, Tomas; Le, Quoc V; Sutskever, Ilya: “Exploiting Similarities among Languages

for Machine Translation”. In: arXiv:1309.4168. 2013

• Ishiwatari, Shonosuke; Kaji, Nobuhiro; Yoshinaga, Naoki; Toyoda, Masahi; Kitsuregawa,

Masaru: “Accurate Cross-lingual Projectio between Count-based Word Vectors by

Exploiting Translatable Context Pairs”. In: Proceedings of the 19th Conference on

Computational Language Learning. 2015.

• Bojanowski, Piotr; Grave, Edouard; Joulin, Armand; Mikolov, Tomas: “Enriching Word

Vectors with Subword Information”. In: arXiv:1607.04606. 2016.

• Mikolov, Tomas; Yih, Wen-tau; Zweig, Geoffrey: “Linguistic Regularities in Continuous

Space Word Representations”. In: Proceedings of NAACL-HLT. 2013.