ainl 2016: romanova, nefedov
TRANSCRIPT
HSE-School of linguistics at Russian ParaphraseDetection Shared Task
Anastasia Romanova, Mikhail Nefedov
Saint-Petersburg, 2016
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Overview
1 Introduction
2 Task
3 Standard Features
4 Word Embedding Features
5 Results
6 Next steps
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Introduction
Higher School of Economics School of Linguistics
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Task
Compare two sentencesTwo types of classificationStandard and Non-standard runs
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Standard Features
Precision
precision = word-overlap(sentence1, sentence2)word–count(sentence1)
Recall
recall = word-overlap(sentence1, sentence2)word-count(sentence2)
BLEU scoreProposed by IBM (Papineni et al., 2002) for evaluating MachineTranslation Systems
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Standard Features
SyntaxNetReleased by Google in May, 2016Models for 40 languages
Dependency parse tree
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Standard Features
Tree Edit Distance (Zhang, Shasha, 1989)
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Standard Results
Standard run
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Word Embedding Features
Words as vectors
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Word Embedding Features
Drawbacks of the averaging approach (Rijkeand Kenter, 2015)
Vectors for words Mean vectors
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Word Embedding Features
Before preprocessing
Клинтон выступила с первой речью после поражения навыборах
After preprocessing
клинтон_S выступать_V первый_A речь_S поражение_Sвыбор_S
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Word Embedding Features
BM25 + Word2Vec
sl - longest sentencesss - shortest sentencesavgsl - average sentence length
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Word Embedding Features
All to all similarities
The boy smiles - The girls laughs
Similarity matrix
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Word Embedding Features
All to all similarities
The boy smiles - The girls laughs
Bins for all values
Bins for maximum values
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Word Embedding Features
Per-dimension similarities
Cosine similarity
Similarity bins
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Results
Non-standard run
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Next steps
Find optimal intervals for binsCreate a new Word2Vec modelTest AdaGramCompute idf on a larger corpusInclude dependency weighting into BM25
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task
Contacts I
[email protected]@gmail.com
Anastasia Romanova, Mikhail Nefedov HSE-School of linguistics at Russian Paraphrase Detection Shared Task