bitextor: harvest your own parallel corpora from the web, miquel esplà-gomis, universitat...
TRANSCRIPT
Bitextor: harvest your own parallelcorpora from the Web
Miquel Esplà-GomisUniversitat d’[email protected]
Who is behind Bitextor?✭ Transducens (Universitat d’Alacant)
✹ Parallel data crawling✹ Rule-based machine translation✹ Machine translation quality estimation✹ Computer-aided translation✹ ...
✭ Prompsit Language Engineering✹ Parallel data curation
✹ Machine translation (rule-based, statistical, neural, hybrid, etc.)
✹ Linguistic variant adaptation✹ ...
UA + Prompsit:
✹ Apertium: Rule-based MT✹ Abu-MaTran: Automatic building of MT✹ Bitextor: Parallel data crawling
Our motivation✭ Specific sources (legal, technical documentation, etc.) exhaustively
exploited✹ Europarl, EMEA, MultiUN, TEDTalks, etc✹ Most of them available at OPUS [http://opus.lingfil.uu.se/]
✭ What about more general sources (translated websites, etc.)?✹ good source of data for small languages✹ easy to find domain-specific contents✹ not that productive as crawling well-known websites... ?
✭ Free/open-source tool for automatically crawling the Web
✭ Crawl parallel data between any two languages
✭ Build parallel corpora from any XML-based data: XML, XHTML, OOXML (.docx), etc.
✭ It can generate TMX (for translators) or Moses-like plain text (for training SMT)
What is Bitextor?
Brief history of the project✭ First version developed at Univesitat d’Alacant in 2006
✭ Until version 2.0, problems for compiling/installing it in the beginning: not a good product
✭ Re-implemented in version 3.0 at Prompsit Language Engineering: ✹ Unix-pipeline architecture made of scripts✹ highly scallar✹ up to date external libraries and external tools✹ good documentation and support
✭ Currently at version 5.0: dramatic improvement of performance!!
Why Bitextor?✭ Customizable:
✹ Document- and segment-alignment quality threshold✹ Several input/output formats available✹ Time/size limit for crawling✹ ...
✭ High performance in document alignment: precision ~90%, recall ~80%
✭ Fast and easy to use:
bitextor -v dic -u http://golftrotter.com en fr
+90 seconds
= 4,056 pairs of segments !!!
What do you need to run Bitextor?✭ A Unix- or Posix-based operating system
✭ To follow the installation tutorial: https://sf.net/p/bitextor/wiki
✭ To identify one or more URLs to crawl
✭ Having a bilingual lexicon for the languages to crawl✹ For many languages, you can download from:
https://sf.net/projects/bitextor/files/lexicons
✹ Bitextor can build a lexicon from parallel corpora
Bitextor is a good team player✭ Bitextor can be combined with Spiderling* to crawl top-level domains
✭ Crawl monolingual and parallel corpora at the same time
✭ No need to look for multilingual websites!
✭ ~100GB of data in one week!
* A monolingual crawler focused on linguistic resources
Useful for translationUseful for translators or translation companies when ...
✭ ... a statistical MT system has to be built for a new pair of languages (Abu-MaTran: en-hr, en-fi, ...)
✭ ... domain adaptation of an MT system
✭ ... our translation memories (TM) need to be adapted to a specific domain to improve coverage✹ even more: for new clients, we may want to build a new TM using their documents
More than just translation✭ build bilingual (and domain-specific?) lexicons: Bitextor uses MGIZA++ to
generate these lexicons from crawled data
✭ identify the parts of two documents that are parallel: for example, get the translation of a word/sentence from an e-book in a foreign language
✭ get information about multilingualism to improve your strategy:✹ discover potential customers by finding webs that need to be translated
✹ focus your efforts by identify lingüistic domains or languages with a low ratio of translated documents by crawling top-level domains
What will be next?✭ improved segmentation and segment alignment
✭ new tool for cleaning translation memories
✭ Bitextor (and other crawlers) as a web service✹ ask Prompsit [email protected]
✭ Bitextor for Windows
✭ generation of “deferred” translation memories (Forcada, Esplà-Gomis and Pérez-Ortiz, 2016)
Better corpora!
Improving usability!
Thank you very much for your atantion!
Paldies jums par uzmanıbu!
we will be glad to hear from you [email protected]