dr. s. arunmozhi1
TRANSCRIPT
-
8/2/2019 Dr. S. Arunmozhi1
1/46
Selvaraj [email protected]
rav an n vers y
-
8/2/2019 Dr. S. Arunmozhi1
2/46
24-Jan-2012
2
SRM University
Lexical Resources
In recent years, monolingual and multilingual,
have become more readily available.
,words and their relations, both within andacross languages,
has become richer and more easily exploitable invarious applications.
-
8/2/2019 Dr. S. Arunmozhi1
3/46
24-Jan-2012SRM University
3
Parallel corpora aligned at word level have createdpossibilities for analyzing translationalcorrespon ences an er v ng ex ca re at onswithin and across languages by means of new
computational methods such as Semantic Mirrors
Furthermore, unstructured texts, such as ordinaryweb materials, can be mined in different ways by
too s such as SketchEngine in order to fully automatically derive overviews of how
ex ca tems e ave n context.
-
8/2/2019 Dr. S. Arunmozhi1
4/46
24-Jan-2012SRM University
4
TDIL Init iat ives
MCIT started TDIL in 1991
languages
To develo information rocessin tools andtechniques
To facilitate human-machine interaction without
To create and access multilingual knowledgeresources and integrate them to develop innovativeuser products and services
-
8/2/2019 Dr. S. Arunmozhi1
5/46
24-Jan-2012SRM University
5
Basic tools for Indian languages
Software tools and fonts for all 22 Indian
been released in the public domain
-software tools for enabling the linguisticcommunity in the digital age
www.ildc.in
-
8/2/2019 Dr. S. Arunmozhi1
6/46
24-Jan-2012SRM University
6
Ongoing projects in Consort ium mode
English-IL MT system
- sys em
On-line handwritten recognition system
-
Speech Corpora/Technologies
Language Corpora
-
8/2/2019 Dr. S. Arunmozhi1
7/46
24-Jan-2012SRM University
7
Lexical Resources
WordNet
Corpora
-
8/2/2019 Dr. S. Arunmozhi1
8/46
24-Jan-2012
8
SRM University
WordNet
WordNets are being used in word sense, ,
information extraction and information
retrieval.
Over 60 WordNets have been developed over theworld.
Typologically different languages have facedchallenges in adapting the original model andlinking WordNets across languages.
-
8/2/2019 Dr. S. Arunmozhi1
9/46
24-Jan-2012
9
SRM University
What is WordNet?
A large lexical database, or electronic
Covers most English nouns, verbs, adjectives,
adverbs Electronic format makes it amenable to
automatic manipulation
and sorting, machine translation,...)
-
8/2/2019 Dr. S. Arunmozhi1
10/46
24-Jan-2012
10
SRM University
What s so special about WordNet?
Traditional paper dictionaries are organized
so words that are grouped together (on the same
a e are unrelated
WordNet is organized by meaning
so words in close proximity are related
Users can browse WordNet and find wordsrelated to their queries (like in a thesaurus)
-
8/2/2019 Dr. S. Arunmozhi1
11/46
24-Jan-2012
11
SRM University
Basic Design of WN
WordNet entries are word-concept mappings
Natural Languages map many-to many:
One conce t can be ex ressed b man words (synonymy): {car, auto, automobile}
c o e, u
-
8/2/2019 Dr. S. Arunmozhi1
12/46
24-Jan-2012
12
SRM University
One word can express many concepts
{c lub , stick}
{c lub , nightclub} {c lub , playing card}
The words we use most frequently are the most
polysemous (have the most meanings)!
-
8/2/2019 Dr. S. Arunmozhi1
13/46
24-Jan-2012
13
SRM University
WordNet handles synonymy and polysemy
Represents words and concepts unambiguously
Meaningfully relates words and concepts
-
8/2/2019 Dr. S. Arunmozhi1
14/46
24-Jan-2012
14
SRM University
WordNets building blocks: sets of synonyms
{hit, beat}
{queue, line}
Each s nset ex resses a distinct conce t.
Currently, WordNet contains appr. 117,000synsets
-
8/2/2019 Dr. S. Arunmozhi1
15/46
24-Jan-2012
15
SRM University
WordNet stores, and allows one to retrieve,
all words that express a given concept
-based relations
Result: a large semantic network
(as opposed to a flat list in a paper dictionary)
-
8/2/2019 Dr. S. Arunmozhi1
16/46
24-Jan-2012
16
SRM University
Relat ions among noun synsets
Hyperonymy/hyponymy relates super/subordinatesynsets (denting more/less general concepts):
{vehicle}/ \
car automobile bic cle bike/ \ \
{convertible} {SUV} {mountain bike}
Transitivity: A car is a kind of vehicle
n s a n o car => An SUV is a kind of vehicle
-
8/2/2019 Dr. S. Arunmozhi1
17/46
24-Jan-2012
17
SRM University
Relat ions among noun synsets
Meronymy/holonymy (part/whole)car automobile
|
{engine}{spark plug} {cylinder}
Inheritance: A car has an engine An en ine has s ark lu s => A car has spark plugs
-
8/2/2019 Dr. S. Arunmozhi1
18/46
24-Jan-2012
18
SRM University
Relat ions among verb synsets
Verbs denote event
{communicate}
|{talk}
/ \
s ammer w sper
-
8/2/2019 Dr. S. Arunmozhi1
19/46
24-Jan-2012
19
SRM University
Semantics of events (verbs) are very different
WordNet captures this fact with different
Relation refer to temporal properties of events
artial and com lete overla of two events
prior or posterior events
-
8/2/2019 Dr. S. Arunmozhi1
20/46
24-Jan-2012
20
SRM University
Relations among synsets create interconnectednetwork
Different senses of polysemous words aremem ers o s nc synse s a are re a e odifferent synsets i.e. occu different locations in the network
e.g., {stock, broth} has superordinate synset {dish}
s oc , ree as superor na e var e y These different synsets are also linked to
-
8/2/2019 Dr. S. Arunmozhi1
21/46
24-Jan-2012
21
SRM University
A words meaning can be defined in terms of itsposition in the network c lub 1 s a n o as soc ia t ion as m e m b e r s c lub 2 is a kind ofs t ick
Relatedness between words or synsets can be
quantified in terms of path length (number of connections among synsets)
-
8/2/2019 Dr. S. Arunmozhi1
22/46
24-Jan-2012
22
SRM University
How closely related are {zebra} and {horse}? Very: Both share the direct superordinate equine
What about {horse, sawhorse} and {horse,gymnastic horse}? e a e , u e o: o n uperor na e ar ac
is 4-5 levels up
What about {zebra} and {horse, mnastichorse}? Unrelated: the trees containing them never
-
8/2/2019 Dr. S. Arunmozhi1
23/46
24-Jan-2012
23
SRM University
WSD is a major problem in Natural LanguageProcessing
Assumption: words in a context (phrase,sentence, discourse) are semantically related , o r s e ze r a
to mean equine;
in the neighborhood ofgym it likely meansgymnastic horse.
If you want to disambiguate horse in the
con ex o ze ra, oo or a or e pa sfrom zebra to horse.
sense of horse.
-
8/2/2019 Dr. S. Arunmozhi1
24/46
24-Jan-2012
24
SRM University
Freely downloadable:
p: wor ne .pr nce on.e u
-
8/2/2019 Dr. S. Arunmozhi1
25/46
24-Jan-2012
25
SRM University
WordNets around the world
Currently, WordNets exist for some 60, , , ,
Estonian, Hebrew, Icelandic, Italian, Kannada,
Latvian Persian Romanian Sanskrit TamilTelugu, Thai, Turkish, Urdu, ...
Global WordNet Associationhttp://www.globalwordnet.org
-
8/2/2019 Dr. S. Arunmozhi1
26/46
24-Jan-2012
26
SRM University
WordNets in Indian Languages
Pioneer: Hindi WordNet
er n an anguages un er ons ruc on
North-East WordNet
, ,
Indradhanush
Bengali , Gujarati, Kashmiri, Konkani, Oriya,
Punjabi, Urdu
-
8/2/2019 Dr. S. Arunmozhi1
27/46
24-Jan-2012SRM University
27
Dravidian WordNet
Tamil (Tamil University), Telugu (Dravidian,
Viswavidyalayam), Kannada (University of
M sore Funding Agency: DIT
Bud et: 152 lakhs
Time frame: 24 months Starting Date: 26-12-2011
-
8/2/2019 Dr. S. Arunmozhi1
28/46
24-Jan-2012SRM University
28
Work already done
Tamil WordNet -
Tamil Virtual University
Available for download fromwww.nrc oss. n
Dravidian WordNet 11000 synsets developed Available online from
ttp: www.c t. t .ac. n n owor net
-
8/2/2019 Dr. S. Arunmozhi1
29/46
24-Jan-2012
29
SRM University
IndoWordNet
Collaborative effort to develop/link all Indian
Foundation of WordNet construction:
Source: Hindi WordNet
Ex ansion A roach
-
8/2/2019 Dr. S. Arunmozhi1
30/46
24-Jan-2012SRM University
30
Three Principles
Minimality
the words in the synset which uniquely identifies
the conce t. For example
{fam ily , house} uniquely identifies a concept
(e.g. he is from the house of the King of Jaipur}.
-
8/2/2019 Dr. S. Arunmozhi1
31/46
24-Jan-2012SRM University
31
Coverage
pr nc p e en s resses on e comp e on o esynset, i.e., capturing ALL the words that stand
(e.g., {fam ily , house, household, m nage}com pletes the synset).
Within the synset the words should be orderedaccording their frequency in the corpus.
-
8/2/2019 Dr. S. Arunmozhi1
32/46
24-Jan-2012SRM University
32
Replaceab i l i ty
synset,
i.e., w ords tow ards the be innin o the s nsetshould be able to replace one another in the examplesentence associated with the synset
33
-
8/2/2019 Dr. S. Arunmozhi1
33/46
24-Jan-2012SRM University
33
Some Stat ist ics on IndoWordNet -
A ss a m es e 3530 / 19 6 0 9
Ben ga li 8 679 / 18 563
B o d o 38 37/ 13357
Gu g a r a t i 9 70 / 2125
H in d i 33 9 0 0 / 8 20 0 0
K a n n a d a 59 20 / 734 4
M a lay a lam 6 154 / 8 6 22
M an ipu r i 2744 / 5231
M a ra th i 9739/ 21223
ep a 5 0 2 0 27
Sa n sk r i t 3340 / 178 20
Ta m i l 4750 / 98 21
T e lu u 10 6 3 9 / 18 2 5 0
Ur d u 6 123 / 9 6 4 1
34
-
8/2/2019 Dr. S. Arunmozhi1
34/46
24-Jan-2012SRM University
34
Corpora
35
-
8/2/2019 Dr. S. Arunmozhi1
35/46
24-Jan-2012
35
SRM University
Indian Languages Corpora Initiative
The Indian Languages Corpora Initiative (ILCI)
r e s e a r ch p r o je ct fo r t e ch n o lo gy
d e ve lo m en t fo r In d ia n la n u a e s .
Special Centre for Sanskrit Studies ofJ a w a h a r la l N eh r u Un ive r s it y
is coordinating this national project and is t h eco n s o r t iu m le a d e r o f t h e I LCI p r o je ct .
36
-
8/2/2019 Dr. S. Arunmozhi1
36/46
24-Jan-2012
36
SRM University
Consort ium Members
Punjabi University for Punjabi JNU (Center for Indian languages) for Urdu ISI Kolkata for Bangla Utkal University for Oriya
IIT Mumbai for Marathi Gujarat University for Gujarati Dravidian University for Telugu Tamil University for Tamil
IITM-K Trivandrum for Malayalam Goa University for Konkani Ea ch co n s o r t iu m m e m b e r w ill d eve lo co r o r a
a n d s t a n d a r d s in t h e ir r e s p e ct ive la n gu a ge s .
37
-
8/2/2019 Dr. S. Arunmozhi1
37/46
24-Jan-2012
37
SRM University
The m ain o b ject ive
11 Indian languages along with English) with
s t a n d a r d s for 12 major Indian languages includingEng s n t e d o m a in o f t o u r is m a n d h e a lt h .
Major aims of the project are
build parallel corpora in the domain of tourism andhealth (Hindi-English and Hindi-Indian languages) &
annotate (label) the parallel corpora.
38
-
8/2/2019 Dr. S. Arunmozhi1
38/46
24-Jan-2012
38
SRM University
Aims
Evolving Draft Standards includes evaluation of
as part of various projects under Technology
Development in Indian Languages (TDIL), and evaluating existing standards for their usability.
Standards for corpora collection, for corpora
The task of Corpora development includes corporacollection in Hindi arallel cor ora in 11 Indianlanguages and parallel corpora in English.
39
-
8/2/2019 Dr. S. Arunmozhi1
39/46
24-Jan-2012
39
SRM University
The basic starting point for this project is a list of50,000Hindi sentences used in the tourism and health domain.
A list of data source institutions including Tourism andHealth departments was made to collect data for Hindi.
the given 11 Indian languages and English has beencreated as per the standards evolved.
English are almost completed as per the BIS standards
40
-
8/2/2019 Dr. S. Arunmozhi1
40/46
24-Jan-2012
4
SRM University
50 K sentences from Hindi into Telugu were
25 k each in tourism and health domain
based on BIS-POS Tagset
Will be read b 1st Jan and
Will be made available online from www.tdil.gov.in
41
-
8/2/2019 Dr. S. Arunmozhi1
41/46
24-Jan-2012SRM University
4
Tools developed
Corpora Annotation Tool en er
Stemmer
Frequency list builder
42
-
8/2/2019 Dr. S. Arunmozhi1
42/46
24-Jan-2012SRM University
ILCI-Phase II
Major aims of the project are:
Corpora collection for source language
target languages
Corpora annotation of parallel corpora in 23
languages Agriculture and Culture domains (in addition to
More than 10 million word corpus to be developed
43
-
8/2/2019 Dr. S. Arunmozhi1
43/46
24-Jan-2012SRM University
Budget
1049.26 - 10 crores, 49 lakhs and 26 thousands par ners . a s
New partners 60.38 lakhs
,Communications and IT, GoI.
44
-
8/2/2019 Dr. S. Arunmozhi1
44/46
24-Jan-2012SRM University
New languages in ILCI-PII
Maith i l i
K a n n a d a
S a n s k r i t D o g r i
S i n d h i
A s s a m e s e
M a n i u r i
Ne p a l i
B o d o
45
-
8/2/2019 Dr. S. Arunmozhi1
45/46
24-Jan-2012SRM University
Advertisement
M.Sc in Computational Linguistics rav an n vers y
Under UGCs Innovative Programme
46
-
8/2/2019 Dr. S. Arunmozhi1
46/46
24-Jan-2012SRM University
Th nk f r r kin n i n!