mitglied der leibniz-gemeinschaft querying spoken language corpora thomas schmidt ids mannheim
Post on 28-Mar-2015
227 Views
Preview:
TRANSCRIPT
Mitglied der Leibniz-Gemeinschaft
Querying Spoken Language Corpora
Thomas SchmidtIDS Mannheim
Mitglied der Leibniz-Gemeinschaft
Outline1) Background: EXMARaLDA, FOLKER, AGD, DGD22) Transcription: Data models, data formats, TEI3) Corpora: Recordings, transcripts, metadata4) Query requirements5) Query technologies6) Demo7) Future directions
Mitglied der Leibniz-Gemeinschaft
Background
• EXMARaLDA: System for building and querying spoken language corpora
• Used in many individual projects, at the HZSK CLARIN Centre• Transcription editor, Corpus management tool, query tool
EXAKT• FOLKER: Transcription tool – same technical basis, optimised
for Research and Teaching Corpus of Spoken German (FOLK)
Mitglied der Leibniz-Gemeinschaft
• Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim
• Dialect corpora, conversation corpora• Database for Spoken German (DGD2): access (browsing and
query) for AGD data
Background
Mitglied der Leibniz-Gemeinschaft
Model: Single timeline, multiple tiers
• Annotation tuples: text label + timeline reference• Timeline: fully ordered, reference to a recording• Tiers: collections of annotations of a specific category, a specific speaker,
annotations in a tier do not overlap Annotation Graph Framework (Bird/Liberman 2001)
Mitglied der Leibniz-Gemeinschaft
EXMARaLDA Basic Transcription:• (Flat) hierarchy of events in
tiers• Use of ID and IDREFS to
encode temporal relations• No additional markup, no
„deep“ semantics
Mitglied der Leibniz-Gemeinschaft
• EXMARaLDA
• ELAN
Mitglied der Leibniz-Gemeinschaft
• EXMARaLDA
• ELAN• Praat
Mitglied der Leibniz-Gemeinschaft
Data formats• Schmidt, Loehr et al. (2008): An exchange format for
multimodal annotations.– XML format for data exchange between seven tools with STMT data
models improves interoperability for data creation
• Drawbacks– no document order (non-linear, non-hierachical)– what is the „full text“ / the „primary data“ / the „character data“?– no explicit representation of dependencies– temporal structure, not linguistic structure bad for querying?
Mitglied der Leibniz-Gemeinschaft
STMT to OHCO transformation
Mitglied der Leibniz-Gemeinschaft
STMT to OHCO transformation
• Segment chain = any temporally connected chain of annotations within one tier
• Assumption: all other hierarchical structure beneath the level of segment chains
• Correspondence: segment chain ↔ <u>
Mitglied der Leibniz-Gemeinschaft
Mitglied der Leibniz-Gemeinschaft
Unparsed (EXAKT) Parsed (DGD2)
Mitglied der Leibniz-Gemeinschaft
Free annotation (EXAKT)
Token annotation (DGD2)
Mitglied der Leibniz-Gemeinschaft
• Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1)
• Romary, Witt, Schmidt: ISO/DIN PWI 24624: Transcription Of Speech
Mitglied der Leibniz-Gemeinschaft
Transcripts, recordings, metadata• Interaction metadata
– date, „genre“, place, degree of formality, etc.– pertains to a (set of) transcription(s)
• Speaker metadata– age, sex, language biography, speech impediments, etc.– pertains to (a) part(s) of a transcription
• Audio and video recordings– for checking transcription quality– for obtaining information not encoded in transcripts
• Transcripts– not (the) primary data!– a „convenient index into the recording“?– selective, theory-dependent, …
Mitglied der Leibniz-Gemeinschaft
Corpora
Mitglied der Leibniz-Gemeinschaft
Corpora• AGD Corpora: 8 mill. tokens • CGN Corpus: 9 mill. tokens• BNC Spoken: 10 mill. tokens• MICASE: 2 mill. tokens• Most other corpora: < 1 mill. Tokens(at least) one order of magnitude smaller than
written corporaQuery speed is (not that) important
Mitglied der Leibniz-Gemeinschaft
• „In informal conversation in Northern Scotland, older female speakers tend to use ‚aye‘ as a backchannel signal with a rising intonation“– Situational context Interaction metadata– Speaker metadata – Text data / Surface form Transcript text– Interactional context Temporal transcript structure– Prosodic properties Recording
Requirement #1: Access to all types of contextRequirement #2: (Manual) postprocessing of query results
Mitglied der Leibniz-Gemeinschaft
• „After a cut-off word followed by a pause of more than 0.3 seconds, the cut-off word is frequently repeated“– special word tokens (incomplete words, semi-lexical
material, …)– non-word tokens (pauses, non-verbal articulations, …)– temporal measurements (pause length)
Requirement #3: Queries for „special“ tokensRequirement #4: Queries with special properties (numerical
values, repetition)
Mitglied der Leibniz-Gemeinschaft
• „Filled pauses are less frequent in overlapping speech than at the beginning of turns“
• „Modal particles and modal adverbs often occur near one another in an utterance“ vs. „Filled pauses occur more frequently near another speaker‘s backchannel“
Requirement #5: Queries for position in temporal structureRequirement #6: Multiple distance measures, query scopes[…]
Mitglied der Leibniz-Gemeinschaft
• RequirementsAccess to all types of contextManual post-processing of query resultsQueries for special tokensQueries with special propertiesQueries for position in temporal structureMultiple distance measures, query scopes…
Mitglied der Leibniz-Gemeinschaft
Recordings
Metadata
Transcripts
Corp
us
Query Query result
Context
Postprocessing
Mitglied der Leibniz-Gemeinschaft
• EXAKT– Regular expression on „full text“ of <u>– (XPath on <u> with markup)– (XSL on transcripts)
• DGD2– Oracle full text on documents– SQL on <w> with attributes
Mitglied der Leibniz-Gemeinschaft
• Demo 1: EXAKT with HaMaTaC corpus• HaMaTaC: Hamburg Map Task Corpus
– advanced L2 learners of German– solving a map task– Orthographic transcription with lemma, POS,
disfluency annotation
Mitglied der Leibniz-Gemeinschaft
• Demo 2: DGD2 with FOLK Corpus• FOLK: Research & Teaching Corpus of Spoken
German
Mitglied der Leibniz-Gemeinschaft
• Future directions:– Support a „real“ query language: CQL– CQPWeb as a test case– User survey DGD2 (approaching 2000 users!)– …– …– TEI as common ground
• for different spoken language corpora query platforms? • for querying spoken and written data side-by-side?
top related