Text curation for the Deutsches Textarchiv
www.deutschestextarchiv.de, www.dwds.de
Alexander Geyken,
BBAW Digitales Wörterbuch (DWDS), Deutsches Textarchiv (DTA), CLARIN-D
Berlin-Brandenburgische Akademie der Wissenschaften - BBAW
DWDS-étendu
DWDS base
DWDS noyau
• 2,6 G tokens
• 6 mill. Doc.
• Sous-partie CMC
• 254 M tokens
• 272 000 docs
• 100 M tokens
• 80 000 docs
DWDS: 1900 -
DTA étendu
DTA noyau
• 150 M
• 80 M
DTA: 1650-1900
Corpora at BBAW
Deutsches Textarchiv in a nutshell
• Select important hist. German prints, (1650-1900, 1,300 vol.) – Planck, Hilbert, Boltzmann,
Euler; Goethe, Lessing;Marx, Wundt , Forster, …
• Digitize (first editions, high accuracy transcription), TEI/P5 (DTA-baseformat, linguistic annotation
• Interoperable, e.g. CLARIN-D
• Funded by DFG (2007-2014); staff 5 FTE
www.deutschestextarchiv.de (new beta version)
charakterisirt
->
charakterisiert
DTAQ – a collaborative plattform for QA
Some detail on transcriptions …
(only) documented emendations are welcome, e.g. "Ednard": <choice> <sic>Ednard</sic> <corr>Eduard</corr> </choice>
chamisso_schlemihl_1814?p=13
no modernizations, no ‘normalizations’, e.g. “Ich laſſe mich nicht irre ſchreyn”:
ſchreyn → ſchreyn ſchreyn → schreyn ſchreyn → schreien goethe_faust01_1808?p=293
Transcription: UTF-8; transcribed true to the source; high accuracy in keying (>99%)
… and structuring
• „formative quality assurance“:
Volltextdigitalisierung – ZOT 18 von
36
Extensions of DTA core corpus
19
Cooperations with other partners
Technical support
Gesamt: 53.870 Seiten (ohne
Polytechnisches Journal)
Web form gathering text, images and metadata
Web form gathering text, images and metadata
preliminary DTA-id
Web form gathering text, images and metadata
Metadata on source
preliminary DTA-id
Web form gathering text, images and metadata
Metadata on source
preliminary DTA-id
on the transcription
Web form gathering text, images and metadata
Metadata on source
preliminary DTA-id
on the transcription
licence/legal
Web form gathering text, images and metadata
img source(s)
Web form gathering text, images and metadata
conversion
Available for all DTAE texts (as for all DTA texts):
DTAE – Key features
Parallel view: img | HTML; img | XML; img | CAB; …
+ full bibliographic record and metadata
+ info on transcription & encoding
Benefits
DTAE offers established infrastructure supporting every stage in an electronic document's life cycle
+ well-documented, consistent encoding of DTA core corpus and sub-corpora (DTA 'base format')
+ present and explore text in the context of DTA's corpora
+ linguistic analysis & tools to explore high quality corpus
+ integration in CLARIN-D (via BBAW as service centre)
DTAE: Extensions via cooperations
34
Cooperation with 12 partners,
– HAB Wolfenbüttel (DFG-Projekt AEdit)
– Dinglers Polytechnisches Journal (HU Berlin)
– Forschungsstelle für Personalschriften, Marburg (AdW Mainz)
– CLARIN-D Kurationsprojekt
– MPI für Bildungsforschung www.deutschestextarchiv.de/dtae-dlc
Gesamt: 53.870 Seiten (ohne
Polytechnisches Journal)
DTAE: Extensions via cooperations
Thanks for your attention!
Questions? [email protected]
Have a look at DTAE, DTAQ: www.deutschestextarchiv.de/dtae
www.deutschestextarchiv.de/dtaq