ainl 2016: yagunova
TRANSCRIPT
![Page 1: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/1.jpg)
ParaPhraser: Russian Paraphrase Corpus and Shared Task Elena Yagunova, Ekaterina Pronoza, ..
Saint-Petersburg State University& Co
ParaPhraser.ru2016
![Page 2: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/2.jpg)
Our approach (at all)As part of our project ParaPhraser on the
identification and classification of Russian paraphrase, we have collected a corpus of more than 8000 sentence pairs annotated as precise, loose or non-paraphrases.
It is annotated via crowdsourcing by naïve native Russian speakers
Our paraphrase corpus is collected from news headlines and therefore can be considered a summarized news stream describing the most important events.
![Page 3: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/3.jpg)
At firstThe aims of our paraphrase
project:To create a publicly available
Russian sentential paraphrase corpus
The corpus is NOT intended to be a general-purpose one
Potential applications: Information Extraction, Text Summarization
![Page 4: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/4.jpg)
Sources of paraphrasesNatural
◦parallel multilingual corpora◦comparable monolingual corpora◦different translations of the same stories/novels
◦news collections◦texts from social networks
Artificial
![Page 5: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/5.jpg)
Other corporaMicrosoft Research Paraphrase
Corpus (2004)◦5801 sentences, 67% paraphrases◦2 classes: paraphrases and non-paraphrases
◦annotated by 2 expertsMETER corpus, Knight and Marcu
corpus, User Language Paraphrase Corpus, PPDB, SEMILAR, Twitter Paraphrase Corpus, etc.
![Page 6: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/6.jpg)
Our unsupervised approach
Parse news articles in real timeCalculate similarity metric for each pair of headlines from different media agencies at the small period of time.
Include candidate pairs in the corpus
Annotate candidate pairs (using crowdsourcing)
![Page 7: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/7.jpg)
Unsupervised Similarity Metric
Extends matrix metric by Fernando and Stevenson:
Is calculated according to the rules:◦ capitalized identical words starting with capital letters ->
1.2◦ identical words -> 1◦ synonyms -> Npmi, Dice or Jaccard synset
coefficient multiplied by 0.8◦ substrings-> the score equal to the length of the smaller
word divided by the length of the larger word and multiplied by 0.7
◦ common prefix (at least 3 characters) -> the score equal to the prefix length divided by the length of the lesser word and multiplied by 0
◦ otherwise -> 0
babWabasim
),(
![Page 8: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/8.jpg)
Paraphrase classesPrecise paraphrase
◦КНДР аннулировала договор о ненападении с Южной Кореей.
DPRK annulled the non-aggression treaty with South Korea.◦КНДР вышла из соглашений о ненападении с
Южной Кореей.DPRK withdrew from the non-aggression agreement with South Korea.
Loose paraphrase◦ВТБ может продать долю в Tele2 в ближайшие
недели.VTB might sell its shares in TELE2 in the nearest weeks.◦ВТБ анонсировал продажу Tele2.VTB announced the sale of TELE2.
![Page 9: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/9.jpg)
Paraphrase classesLoose paraphrase
◦ВТБ может продать долю в Tele2 в ближайшие недели.
VTB might sell its shares in TELE2 in the nearest weeks.◦ВТБ анонсировал продажу Tele2.VTB announced the sale of TELE2.
Non-paraphrase◦В главном здании МГУ загорелась столовая.The student canteen has lit in the main building of MSU.◦Из главного здания МГУ эвакуированы около
300 человек.About 300 people are evacuated from the main building of MSU.
![Page 10: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/10.jpg)
Annotation• Each pair of sentences is annotated by at least 3 users• Pairs annotated by less than 4 users with opposite
judgements are cut off
![Page 11: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/11.jpg)
ParaPhraser8072 sentence pairs
3 classes:
1862 precise (23% )3257 loose (40%)
2953 non-paraphrases (37%)
![Page 12: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/12.jpg)
Linguistic characteristics
![Page 13: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/13.jpg)
Evaluation
Method Result, %
Unsupervised similarity metric, 2 classes (Precision)
80.24
Supervised similarity metric, 3 classes (F1)
60.31
New supervised approach, 3 classes (F1)
63.94
New supervised approach, 2 classes (F1)
82.46
![Page 14: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/14.jpg)
At second (I)oAnnotated via crowdsourcing: 3 paraphrase classesoDear X, please, evaluate the similarity of sense:
◦In Transnistria, the Stabilization Fund has been created for the needs of the President and the KGB
◦Transnistria has created the Stabilization Fund at the cost of the Russian gas
oWe compare prediction models based on different sentence similarity measures
![Page 15: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/15.jpg)
At second (II)oWe compare prediction models based on different sentence similarity measures:oShallow (edit distance, longest common sequences, BLEU, word/character level overlap, etc.)oSemantic (dictionary-based) (use semantic resources like WordNet)oDistributional (distributional semantic models)
oWe analyze linguistic characteristics of the misclassified sentencesoWe analyze the level of agreement between the annotators
![Page 16: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/16.jpg)
Level of Agreement between the Annotators: Cohen’s Kappa
![Page 17: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/17.jpg)
Percentage of Different Linguistic Phenomena among the Most Disagreed-On Sentence Pairs (Top-100)
Linguistic Phenomenon % Linguistic Phenomenon %
Different content 68 Synonymy 11
Presupposition 43 Context knowledge 11
Syntactic synonymy 28 Different time 8
Quotation 22 Metonymy 6
Phrasal synonymy 20 Transliteration 3
Reordering 19 Metaphor 2
Numeral 12
![Page 18: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/18.jpg)
Feature set Precision,% Recall, % F1-score, %shallow 62.42 61.04 60.67semantic 62.41 59.28 58.78distrib 61.22 58.25 57.16distrib + cosine 60.63 57.69 56.54distrib + cosine_ext 61.02 58.17 57.22shallow + semantic 63.75 62.15 62.02shallow + distrib 63.49 61.83 61.61shallow + distrib + cosine 63.42 61.67 61.42shallow + distrib + cosine_ext 64.04 62.39 62.19shallow + semantic + distrib 63.72 62.23 62.04shallow + semantic + distrib + cosine 64.05 62.87 62.68shallow + semantic + distrib + cosine_ext 65.73 63.90 63.66shallow + semantic + cosine 63.82 62.55 62.31shallow + semantic + cosine + ext 64.42 62.95 62.72
Evaluation of Different Feature Sets
![Page 19: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/19.jpg)
Class 1 (Precise Paraphrases)Sentence pairs 1-3 show that all the three feature types fail
to detect or to understand pragmatic presupposition (see “convict” and “Supreme Court” in #1: only a convicted person can be a subject to the cancelled sentence, and the court is supposed to cancel the sentence.
In #2 “Donbass” as the reference to the place of action in the first sentence is also obvious, especially for the Russian speaker, if the action concerns the martial law imposed to respond to the attack by the militia) as well as syntactic synonymy combined with word and phrase-level synonymy and reordering.
In #3 precise paraphrase class is disputable. For a naïve Russian speaker “tourists” might be identical to
“Russian tourists” in a news report due to the presupposition phenomenon, especially if it is a Russian news report; sure it is not a general truth, but the prediction based on general expectations of our group annotators.
![Page 20: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/20.jpg)
Class 0 (Loose paraphrases (conveying similar meaning))
Sentence pairs 4-7 show that all the three feature types fail to detect or to understand presupposition or to understand context or to understand communication phrase structure .
In #6 there is a “difficult” sentence pair: understood metaphorically, the sentences might be considered somewhat similar, however, such understanding requires large amounts of general knowledge and it is extremely hard to teach a machine to distinguish such subtle meanings.
The main part of the difficulty concerned on the different variants of the communication phrase structure choice:
ex. In Transnistria, the Stabilization Fund has been created / for the needs of the President and the KGB || Transnistria has created the Stabilization Fund / at the cost of the Russian gas.
![Page 21: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/21.jpg)
Class 0 (Loose paraphrases (conveying similar meaning))
The main part of the information structure from point of view annotators - marked by bold , the unimportant part – marked by italic.
This is reason of the difference of the definition of the “true class” (annotators result) and the tree prediction classes.
It was similar with #8: “Transnistria has created the Stabilization
Fund” and “In Transnistria, the Stabilization Fund has been created” are the most important parts in the phrase structure.
The similar situation is about #9, but structure of the phrase is more simple.
What is more important: Turkish police have detained three sons of ministers or the reason (that they have been arrested for corruption)?
![Page 22: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/22.jpg)
Some words at lastOur paraphrase corpus is collected from news
headlines and therefore can be considered a summarized news stream describing the most important events.
By building a graph of paraphrases, we can detect such events.
We construct two types of graphs: based on the current human annotation and on the complex model prediction.
The structure of the graphs is compared and analyzed and it is shown that the model graph has larger connected components which give a more complete picture of the important events than the human annotation graph.
Predictive model appears to be better at capturing full information about the important events from the news collection than human annotators.
![Page 23: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/23.jpg)
Corpus splittingThe corpus is divided into two parts: the graphs are constructed on the
events since 2015 the model is trained on the part
referring to 2013-2014 training and testing set are not chosen
randomly (any possible chains of events are not lost)
do not employ any other data (need to compare the graphs)
![Page 24: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/24.jpg)
Whether the connected components corresponding to the same event are really larger in the “model” graph?
Out of 50 cc of the “model” graph: 42 (84%) are larger than the corresponding
components of the “annotators” graph, 23 (46%) connected components correspond to
2 or more “annotators” components each, 10 (20%) connected components correspond to
several “annotators” components each by mistake (false positives).
2 (4%) of the connected components in the “model” graph should have been combined into a single component, but they are not (false negatives).
![Page 25: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/25.jpg)
Top-5 connected components of the two
graphsAnnotators ModelEarthquake in Nepal Earthquake in Nepal +
avalanche on Everest (#2 in the annotators’ graph) + a few sentences about other disasters
“Immortal regiment” march
“Immortal regiment” march
The space truck “Progress M-27M”
The space truck “Progress M-27M”
Evacuation from Nepal by a Russian aircraft
Evacuation from Nepal by a Russian aircraft
Elections in Kazakhstan Elections in Kazakhstan
![Page 26: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/26.jpg)
Results I Connected components(“cc”) in the “model”
graph are larger than cc in the “annotators” The “model” cc are usually formed by joining
several “annotators” cc referring to the same topic
Central nodes often stay the same from graph to graph (the shortest and simplest sentences are likely to have the largest node degrees)
In general the “model” cc give a more complete picture of the described events
![Page 27: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/27.jpg)
Results II Based on the n-grams overlap, the “model”
can join absolutely different events together (for example, fire in Orel, evacuation of people from Mi-8 and clashes in Peru in the 4th largest “annotators” connected component)
Sometimes “model” components can miss some nodes, for example, the component with headlines about “Progress M-27M” lacks one node which is present in the “annotators” graph
![Page 28: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/28.jpg)
Conclusion
We create a publicly available Russian sentential paraphrase corpus
The corpus can be applied for multiple purposes
The corpus is aimed for news with heterogeneous information structure
The corpus can be considered a news stream describing the most important events occurring in the world
![Page 29: AINL 2016: Yagunova](https://reader036.vdocuments.pub/reader036/viewer/2022062905/58749dd51a28abfc5f8b6a23/html5/thumbnails/29.jpg)
THANK YOU FOR YOUR ATTENTION
and join us at ParaPhraser.ru!