ua - gt aligner - icoc

9
Aligning images with ground truth transcriptions Rafael C. Carrasco ([email protected]) Departamento de Lenguajes y Sistemas Informáticos

Upload: impact-centre-of-competence

Post on 15-Apr-2017

674 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: UA - GT Aligner - ICoC

Aligning images with ground truth transcriptions

Rafael C. Carrasco ([email protected])Departamento de Lenguajes y Sistemas Informáticos

Page 2: UA - GT Aligner - ICoC

Impact Ground Truth

Over 30.000 pages of high-quality transcriptions.

difinicion à lo difinido: y antes de contarla, no dexèdicho quienes y quales fueron mis padres, y confuſo na-cimiento, que en ſu tanto, ſi dellos huuiera de eſcreuir-ſe, fuera ſin duda mas agradable y bien recebida que eſta

Page 3: UA - GT Aligner - ICoC

Identification of words and characters

Impact ground truth identifies regions (paragraphs).Lines can be usually identified with geometric methods.The identification of the words and characters is notstraightforward due to the variable separation betweenthem.Character breaking, overlapping and kerning are frequent.

Page 4: UA - GT Aligner - ICoC

Gap analysis is not sufficient

Bars mark the position of vertical gaps.

Page 5: UA - GT Aligner - ICoC

Objectives

Apply standard geometric methods to separate (anddeskew) the lines in the image.Use probabilistic models to identify the best segmentationof the characters in every line.Enrich the Impact ground truth with the additionalinformation (map between characters and images).Publish source code in the Impact Centre Github.

Page 6: UA - GT Aligner - ICoC

Character features

Candidate features: weight, shadow, gauge, profile,. . .

Page 7: UA - GT Aligner - ICoC

Methodology

Explore what character features are best for alignment.Employ simple training methods which (in contrast toHMM) require short training times.Font size and type (bold, slanted, etc) are not declared inthe ground truth files and they must be thereforeaddressed in a second phase of this project.

Page 8: UA - GT Aligner - ICoC

Applications

Training OCR engines, such as Tesseract, with largesamples of characters can be automatized.Adaptation of OCR engines to a particular book orcollection could be feasible with the manual transcriptionof only a few pages.

Page 9: UA - GT Aligner - ICoC

Note: Work on TEI P5

The Miguel de Cervantes library has created about 10,000books with TEI2 markup.TEI P5 has associated stylesheets, for example, to createe-books automatically. However, some limitations were foundto migrate to TEI P5:

Little support for indentation (normal/hanging).Automatic numeration of verse lines.No style-support for nested annotation.Headings cannot be marked for inclusion/exclusion in the(automatically generated) table of contents.

This experience can be an opportunity for cooperation betweenthe Centre and the TEI consortium.