ua - gt aligner - icoc
TRANSCRIPT
Aligning images with ground truth transcriptions
Rafael C. Carrasco ([email protected])Departamento de Lenguajes y Sistemas Informáticos
Impact Ground Truth
Over 30.000 pages of high-quality transcriptions.
difinicion à lo difinido: y antes de contarla, no dexèdicho quienes y quales fueron mis padres, y confuſo na-cimiento, que en ſu tanto, ſi dellos huuiera de eſcreuir-ſe, fuera ſin duda mas agradable y bien recebida que eſta
Identification of words and characters
Impact ground truth identifies regions (paragraphs).Lines can be usually identified with geometric methods.The identification of the words and characters is notstraightforward due to the variable separation betweenthem.Character breaking, overlapping and kerning are frequent.
Gap analysis is not sufficient
Bars mark the position of vertical gaps.
Objectives
Apply standard geometric methods to separate (anddeskew) the lines in the image.Use probabilistic models to identify the best segmentationof the characters in every line.Enrich the Impact ground truth with the additionalinformation (map between characters and images).Publish source code in the Impact Centre Github.
Character features
Candidate features: weight, shadow, gauge, profile,. . .
Methodology
Explore what character features are best for alignment.Employ simple training methods which (in contrast toHMM) require short training times.Font size and type (bold, slanted, etc) are not declared inthe ground truth files and they must be thereforeaddressed in a second phase of this project.
Applications
Training OCR engines, such as Tesseract, with largesamples of characters can be automatized.Adaptation of OCR engines to a particular book orcollection could be feasible with the manual transcriptionof only a few pages.
Note: Work on TEI P5
The Miguel de Cervantes library has created about 10,000books with TEI2 markup.TEI P5 has associated stylesheets, for example, to createe-books automatically. However, some limitations were foundto migrate to TEI P5:
Little support for indentation (normal/hanging).Automatic numeration of verse lines.No style-support for nested annotation.Headings cannot be marked for inclusion/exclusion in the(automatically generated) table of contents.
This experience can be an opportunity for cooperation betweenthe Centre and the TEI consortium.