bitextor: harvest your own parallel corpora from the web, miquel esplà-gomis, universitat...

Bitextor: harvest your own parallelcorpora from the Web

Miquel Esplà-GomisUniversitat d’[email protected]

Who is behind Bitextor?✭ Transducens (Universitat d’Alacant)

✹ Parallel data crawling✹ Rule-based machine translation✹ Machine translation quality estimation✹ Computer-aided translation✹ ...

✭ Prompsit Language Engineering✹ Parallel data curation

✹ Machine translation (rule-based, statistical, neural, hybrid, etc.)

✹ Linguistic variant adaptation✹ ...

UA + Prompsit:

✹ Apertium: Rule-based MT✹ Abu-MaTran: Automatic building of MT✹ Bitextor: Parallel data crawling

Our motivation✭ Specific sources (legal, technical documentation, etc.) exhaustively

exploited✹ Europarl, EMEA, MultiUN, TEDTalks, etc✹ Most of them available at OPUS [http://opus.lingfil.uu.se/]

✭ What about more general sources (translated websites, etc.)?✹ good source of data for small languages✹ easy to find domain-specific contents✹ not that productive as crawling well-known websites... ?

✭ Free/open-source tool for automatically crawling the Web

✭ Crawl parallel data between any two languages

✭ Build parallel corpora from any XML-based data: XML, XHTML, OOXML (.docx), etc.

✭ It can generate TMX (for translators) or Moses-like plain text (for training SMT)

What is Bitextor?

Brief history of the project✭ First version developed at Univesitat d’Alacant in 2006

✭ Until version 2.0, problems for compiling/installing it in the beginning: not a good product

✭ Re-implemented in version 3.0 at Prompsit Language Engineering: ✹ Unix-pipeline architecture made of scripts✹ highly scallar✹ up to date external libraries and external tools✹ good documentation and support

✭ Currently at version 5.0: dramatic improvement of performance!!

Why Bitextor?✭ Customizable:

✹ Document- and segment-alignment quality threshold✹ Several input/output formats available✹ Time/size limit for crawling✹ ...

✭ High performance in document alignment: precision ~90%, recall ~80%

✭ Fast and easy to use:

bitextor -v dic -u http://golftrotter.com en fr

+90 seconds

= 4,056 pairs of segments !!!

What do you need to run Bitextor?✭ A Unix- or Posix-based operating system

✭ To follow the installation tutorial: https://sf.net/p/bitextor/wiki

✭ To identify one or more URLs to crawl

✭ Having a bilingual lexicon for the languages to crawl✹ For many languages, you can download from:

https://sf.net/projects/bitextor/files/lexicons

✹ Bitextor can build a lexicon from parallel corpora

Bitextor is a good team player✭ Bitextor can be combined with Spiderling* to crawl top-level domains

✭ Crawl monolingual and parallel corpora at the same time

✭ No need to look for multilingual websites!

✭ ~100GB of data in one week!

* A monolingual crawler focused on linguistic resources

Useful for translationUseful for translators or translation companies when ...

✭ ... a statistical MT system has to be built for a new pair of languages (Abu-MaTran: en-hr, en-fi, ...)

✭ ... domain adaptation of an MT system

✭ ... our translation memories (TM) need to be adapted to a specific domain to improve coverage✹ even more: for new clients, we may want to build a new TM using their documents

More than just translation✭ build bilingual (and domain-specific?) lexicons: Bitextor uses MGIZA++ to

generate these lexicons from crawled data

✭ identify the parts of two documents that are parallel: for example, get the translation of a word/sentence from an e-book in a foreign language

✭ get information about multilingualism to improve your strategy:✹ discover potential customers by finding webs that need to be translated

✹ focus your efforts by identify lingüistic domains or languages with a low ratio of translated documents by crawling top-level domains

What will be next?✭ improved segmentation and segment alignment

✭ new tool for cleaning translation memories

✭ Bitextor (and other crawlers) as a web service✹ ask Prompsit [email protected]

✭ Bitextor for Windows

✭ generation of “deferred” translation memories (Forcada, Esplà-Gomis and Pérez-Ortiz, 2016)

Better corpora!

Improving usability!

Thank you very much for your atantion!

Paldies jums par uzmanıbu!

we will be glad to hear from you [email protected]

bitextor: harvest your own parallel corpora from the web, miquel esplà-gomis, universitat...

Presentations & Public Speaking