rdf and other linked data standards — how to make use of big localization data
TRANSCRIPT
RDF and other linked data standards — how to make use of big localization data !Dave Lewis!FEISGILTT Vancouver 29th Oct 2014!
Problem – Data Management!
— Language assets (TM and TB) scattered across the localisation pipeline, !– different versions, – different stages of quality, – in different ‘silos’ – challenging to pool & clean these resources
— Difficult to share and search language resources within and across organisations !– impacts on consistency and cost & quality of transla;on
— Merging and extending language resources is complex!– makes leveraging new resources extremely difficult
Why RDF/Linked Data?!
— Creates relationships (links) between data – ie linked data!
— Easier to integrate and leverage language resources regardless of its format, where it is stored, who owns it !– saves money
— Easier to search & analyse language resources !– saves ;me in finding the most suitable resources for your projects
— Enriches language resources with additional meaning !– allows them to be beCer used
— Easy tracking of provenance !– helps manage versioning
— Open Data on the Web: W3C Semantic Web standards for data published on Web !– Fine-‐grained inter-‐linking of data “cells” -‐ URL – Extensible meta-‐data – Resource Descrip;on Format (RDF) – Standard Query API -‐ SPARQL
— LIDER Project: !– Stakeholder data needs for LT and language resources – Best prac;ces and guidelines to apply linked data
— Existing Open data vocabularies !– Lexical-‐conceptual data – LEMON vocabulary – Encyclopedic -‐ DBPedia – Resource meta-‐data, licensing, annota;on, provenance etc
Linked Data for Language Technology!
Linguis;c Linked Data: Lexicons
Red
Phone;c form Form
singular
[RED]
Form
plural [REDES]
Phone;c form number
number
Red
Sense wriCen form
“red”
Sense
wriCen form
“malla”
equivalent
Red
image
Red
Sense Sense
transla;on es -‐ en
wriCen form
“red” “network”
wriCen form
Red
wriCen form
Form
gender
femenine
“red”
Use Case: BabelNet.org!
Data Challenges in Using LT!
— Language Technology is statistical!— Quality is limited by distance between training data and
job at hand!— Training Data is the Key Asset for LT!– E.g. for L10n its Transla;on Memories and Term Bases
— Challenges for Managing Training Data!– Discover – Select – Curate – Share/Pool/Sell – Understanding Quality – Measure Impact on Produc;vity
Language Workers
Language Technology
Language Resources
Active Curation: Managing LT/LR Lifecycle !
ACTIVE CURATION
Use Case: FALCON Project!
Tool Chain • Website transla;on
• Transla;on Management
• Terminology Management
Language Technology • Machine Transla;on
• Term Iden;fica;on
Linked Data • Parallel Text • Terms: Lexical-‐conceptual
XLIFF +ITS2.0
— Building the Localization Web = Decentralised Annotated Global Translation Memory and Term Base !
— Terms and translations become linkable resources!
— Meta-data from L10n tool chain adds value!
— Use in training Machine Translation and Text Analytics!
FALCON Demo: Locworld Expo!
Web Site Transla;on
Transla;on Management
System
Terminology Management
System
Machine Transla;on
Federated L3Data Plagorm
Transla;on Management
Text Analysis
Localiza;
on
Too
l chain
Language
Techno
logy
Language
Resources
Public Resources
DCU
TCD
Words as Resources on the Web!
Barak Obama is the 44th president of the United States of America. He was first elected in 2008.
Barak Obama si el 44 º presidente de los Estados Unidos de América. Ha fue electo primera vez en 2008.
hCp:// www.ex.org/obama_en.html
hCp:// www.ex.org/obama_es.html
The Web of Content The LocalizaDon Web
hCp://data.ex.org/String_0001
hCp:// data.ex.org/String_0002
Derived From
Derived From
Text: “Barak Obama is the 44th president of the United States of America.” Lang:en
Text:“Barak Obama es el 44 º presidente de los Estados Unidos de América.” Lang:es TranslatedBy:Google Translate
Translated From
TranslaDon Data
Term: “United States of America.” Lang:en
Term:“Estados Unidos de América.” Lang:es
Transla;on Of
hCp:// babelnet.org/345621
hCp:// babelnet.org/57835
Terminology Data
Topic: Barack Obama Lang: en BirthDate: 1961-‐08-‐04 Spouse: Michelle Obama Residence: White House
hCp:// Dbpedia.org/Page/ Barak_Obama
Encyclopaedic Data
L10n Use Case: Closing the Loop!
— Active Curation: Systematic harvesting of LT-ready TM and TB from localization tool chain!
— Data and Tools for Optimise process flow:!– Priori;ze segments for postedi;ng and input to incremental MT retraining
– Target postedits to extract target terms and new morphologies
— Postediting Instrumentation:!– Postedit ;me and resource use (terms, concordance) vs. automa;on of MT metrics
– iOmegaT: instrumented open source CAT tool — LREC, AMTA, EDF, MLW, LocFoc, Multilingual, EdMedia,
FEISGILTT!
Research and Innovation Roadmap!
— https://www.w3.org/community/ld4lt/wiki/Linguistic_Linked_Data_for_Content_Analytics:_a_Roadmap!
Global Customer
Engagement Use Cases
Public Sector and Civil
Society Use Cases
LinguisDc Linked Data Life Cycle and Value Network Requirements
Best Practices for Multilingual Linked Open Data!
— Linguistic Vocabularies. !— Resource-specific vocabularies!— Best Practices for Multilingual Linked
Data !– Prac;ces for Naming. – Prac;ces for Dereferencing – Prac;ces for Textual Informa;on – Prac;ces for linking. – Iden;fica;on of languages . – DataID – OWL Metamodel for Language
Resources – License Ontology
— Guidelines for Converting WordNets to Linked Data !
— Guidelines for Linguistic Linked Data Generation: Multilingual Knowledge Bases.!
— Guidelines for Linguistic Linked Data Generation: Bilingual Dictionaries !
— Guidelines for Converting TBX into Linked Data!
— Guidelines for NIF-based NLP Services !
— Comparison of Repositories!
Dublin Workshop Session!
— There is a need for a common API to text analysis services, live update of linked data source, user feedback mechanisms, or annotation relevance indicators.!
— "Too much information is no information": linked data information can help the translator only if it does not lead to an information overflow.!
— A stand-off annotation mechanism is needed to deal with annotation overlap. NIF could be a solution.!
— For the localization industry, licensing metadata is of key importance. Only with such metadata one can also work with internal = closed linked data.!
— Terminology and linked data is a hot topic discussed also in the LD4LT group. Currently there is no standard mapping of the TBX format to RDF.!
— Bitext (= aligned text of a source and one or several translations) could be exposed as as linked data, as an alternative to TMX.!
More Information!
— Contact: [email protected]!– hCp://www.falcon-‐project.eu
— Lider: best practices and roadmap!– hCp://www.lider-‐project.eu/
— See also: !– Linked Data for Language Technology (LD4LT) W3C Community Group • hCp://www.w3.org/community/ld4lt/
– Best Prac;ce in Mul;lingual Linked Open Data • hCp://www.w3.org/community/bpmlod/
– OntoLex Community Group • hCp://www.w3.org/community/ontolex/
!