rdf and other linked data standards — how to make use of big localization data

16
RDF and other linked data standards — how to make use of big localization data Dave Lewis FEISGILTT Vancouver 29 th Oct 2014

Upload: dave-lewis

Post on 09-Feb-2017

278 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: RDF and other linked data standards — how to make use of big localization data

RDF and other linked data standards — how to make use of big localization data !Dave Lewis!FEISGILTT Vancouver 29th Oct 2014!

Page 2: RDF and other linked data standards — how to make use of big localization data

Problem – Data Management!

— Language assets (TM and TB) scattered across the localisation pipeline, !–  different  versions,    –  different  stages  of  quality,    –  in  different  ‘silos’  –  challenging  to  pool  &  clean  these  resources  

— Difficult to share and search language resources within and across organisations !–  impacts  on  consistency  and  cost  &  quality  of  transla;on  

— Merging and extending language resources is complex!–   makes  leveraging  new  resources  extremely  difficult  

Page 3: RDF and other linked data standards — how to make use of big localization data

Why RDF/Linked Data?!

— Creates relationships (links) between data – ie linked data!

— Easier to integrate and leverage language resources regardless of its format, where it is stored, who owns it !–  saves  money  

— Easier to search & analyse language resources !–  saves  ;me  in  finding  the  most  suitable  resources  for  your  projects  

— Enriches language resources with additional meaning !–   allows  them  to  be  beCer  used  

— Easy tracking of provenance !–  helps  manage  versioning  

Page 4: RDF and other linked data standards — how to make use of big localization data

— Open Data on the Web: W3C Semantic Web standards for data published on Web !–  Fine-­‐grained  inter-­‐linking  of  data  “cells”  -­‐  URL  –  Extensible  meta-­‐data  –  Resource  Descrip;on  Format  (RDF)  –  Standard  Query  API  -­‐  SPARQL  

— LIDER Project: !–  Stakeholder  data  needs  for  LT  and  language  resources  –  Best  prac;ces  and  guidelines  to  apply  linked  data  

— Existing Open data vocabularies !–  Lexical-­‐conceptual  data  –  LEMON  vocabulary  –  Encyclopedic  -­‐  DBPedia  –  Resource  meta-­‐data,  licensing,  annota;on,  provenance  etc  

Linked Data for Language Technology!

Page 5: RDF and other linked data standards — how to make use of big localization data

Linguis;c  Linked  Data:  Lexicons  

Red  

Phone;c  form  Form  

singular  

[RED]  

Form  

plural  [REDES]  

Phone;c  form  number  

number  

Red  

Sense  wriCen  form  

“red”  

Sense  

wriCen  form  

“malla”  

equivalent  

Red  

image  

Red  

Sense   Sense  

transla;on  es  -­‐  en  

wriCen  form  

“red”   “network”  

wriCen  form  

Red  

wriCen  form  

Form  

gender  

femenine  

“red”  

Page 6: RDF and other linked data standards — how to make use of big localization data

Use Case: BabelNet.org!

Page 7: RDF and other linked data standards — how to make use of big localization data

Data Challenges in Using LT!

— Language Technology is statistical!— Quality is limited by distance between training data and

job at hand!— Training Data is the Key Asset for LT!–  E.g.  for  L10n  its  Transla;on  Memories  and  Term  Bases  

— Challenges for Managing Training Data!–  Discover  –  Select  –  Curate  –  Share/Pool/Sell  –  Understanding  Quality  – Measure  Impact  on  Produc;vity  

Page 8: RDF and other linked data standards — how to make use of big localization data

Language  Workers  

Language  Technology  

Language  Resources  

Active Curation: Managing LT/LR Lifecycle !

ACTIVE    CURATION  

Page 9: RDF and other linked data standards — how to make use of big localization data

Use Case: FALCON Project!

Tool  Chain  • Website  transla;on  

•  Transla;on  Management  

•  Terminology  Management  

Language  Technology  • Machine  Transla;on  

•  Term  Iden;fica;on  

Linked  Data  •  Parallel  Text  •  Terms:  Lexical-­‐conceptual  

XLIFF  +ITS2.0  

—  Building the Localization Web = Decentralised Annotated Global Translation Memory and Term Base !

—  Terms and translations become linkable resources!

—  Meta-data from L10n tool chain adds value!

—  Use in training Machine Translation and Text Analytics!

Page 10: RDF and other linked data standards — how to make use of big localization data

FALCON Demo: Locworld Expo!

Web  Site  Transla;on  

Transla;on  Management  

System  

Terminology  Management  

System  

Machine  Transla;on  

Federated    L3Data    Plagorm  

Transla;on  Management  

Text    Analysis  

Localiza;

on  

 Too

l  chain  

Language  

Techno

logy  

Language    

Resources  

Public  Resources  

DCU  

TCD  

Page 11: RDF and other linked data standards — how to make use of big localization data

Words as Resources on the Web!

Barak  Obama  is  the  44th  president  of  the  United  States  of  America.  He  was  first  elected  in  2008.  

Barak  Obama  si  el  44  º  presidente  de  los  Estados  Unidos  de  América.  Ha  fue  electo  primera  vez  en  2008.  

hCp://  www.ex.org/obama_en.html  

hCp://  www.ex.org/obama_es.html  

The  Web  of  Content   The  LocalizaDon  Web  

hCp://data.ex.org/String_0001  

hCp://  data.ex.org/String_0002  

Derived  From  

Derived  From  

Text:  “Barak  Obama  is  the  44th  president  of  the  United  States  of  America.”  Lang:en  

Text:“Barak  Obama  es  el  44  º  presidente  de  los  Estados  Unidos  de  América.”  Lang:es  TranslatedBy:Google  Translate  

Translated    From  

TranslaDon  Data  

Term:  “United  States  of  America.”  Lang:en  

Term:“Estados  Unidos  de  América.”  Lang:es  

Transla;on    Of  

hCp://  babelnet.org/345621  

hCp://  babelnet.org/57835  

Terminology  Data  

Topic:  Barack  Obama  Lang:  en  BirthDate:  1961-­‐08-­‐04    Spouse:  Michelle  Obama  Residence:  White  House  

hCp://  Dbpedia.org/Page/  Barak_Obama  

Encyclopaedic  Data  

Page 12: RDF and other linked data standards — how to make use of big localization data

L10n Use Case: Closing the Loop!

— Active Curation: Systematic harvesting of LT-ready TM and TB from localization tool chain!

— Data and Tools for Optimise process flow:!–  Priori;ze  segments  for  postedi;ng  and  input  to  incremental  MT  retraining      

–  Target  postedits  to  extract  target  terms  and  new  morphologies      

— Postediting Instrumentation:!–  Postedit  ;me  and  resource  use  (terms,  concordance)  vs.  automa;on  of  MT  metrics  

–  iOmegaT:  instrumented  open  source  CAT  tool  — LREC, AMTA, EDF, MLW, LocFoc, Multilingual, EdMedia,

FEISGILTT!

Page 13: RDF and other linked data standards — how to make use of big localization data

Research and Innovation Roadmap!

—  https://www.w3.org/community/ld4lt/wiki/Linguistic_Linked_Data_for_Content_Analytics:_a_Roadmap!

Global  Customer  

Engagement  Use  Cases  

Public  Sector  and  Civil  

Society  Use  Cases  

LinguisDc  Linked  Data  Life  Cycle  and  Value  Network  Requirements  

Page 14: RDF and other linked data standards — how to make use of big localization data

Best Practices for Multilingual Linked Open Data!

—  Linguistic Vocabularies.  !—  Resource-specific vocabularies!—  Best Practices for Multilingual Linked

Data !–   Prac;ces  for  Naming.  –  Prac;ces  for  Dereferencing  –  Prac;ces  for  Textual  Informa;on  –  Prac;ces  for  linking.    –  Iden;fica;on  of  languages  .  –  DataID  –  OWL  Metamodel  for  Language  

Resources  –  License  Ontology  

—  Guidelines for Converting WordNets to Linked Data !

—  Guidelines for Linguistic Linked Data Generation: Multilingual Knowledge Bases.!

—  Guidelines for Linguistic Linked Data Generation: Bilingual Dictionaries !

—  Guidelines for Converting TBX into Linked Data!

—  Guidelines for NIF-based NLP Services !

—  Comparison of Repositories!

Page 15: RDF and other linked data standards — how to make use of big localization data

Dublin Workshop Session!

—  There is a need for a common API to text analysis services, live update of linked data source, user feedback mechanisms, or annotation relevance indicators.!

—  "Too much information is no information": linked data information can help the translator only if it does not lead to an information overflow.!

—  A stand-off annotation mechanism is needed to deal with annotation overlap. NIF could be a solution.!

—  For the localization industry, licensing metadata is of key importance. Only with such metadata one can also work with internal = closed linked data.!

—  Terminology and linked data is a hot topic discussed also in the LD4LT group. Currently there is no standard mapping of the TBX format to RDF.!

—  Bitext (= aligned text of a source and one or several translations) could be exposed as as linked data, as an alternative to TMX.!

Page 16: RDF and other linked data standards — how to make use of big localization data

More Information!

— Contact: [email protected]!–  hCp://www.falcon-­‐project.eu  

— Lider: best practices and roadmap!–  hCp://www.lider-­‐project.eu/  

— See also: !–  Linked  Data  for  Language  Technology  (LD4LT)    W3C  Community  Group  •  hCp://www.w3.org/community/ld4lt/  

–  Best  Prac;ce  in  Mul;lingual  Linked  Open  Data  •  hCp://www.w3.org/community/bpmlod/  

–  OntoLex  Community  Group  •  hCp://www.w3.org/community/ontolex/  

!