deelliivv eerraabblle 66..22 muu llttii--mmiiccrroopphhoo...
TRANSCRIPT
DDeelliivveerraabbllee 66..22
MMuullttii--mmiiccrroopphhoonnee ddaattaa ccoolllleeccttiioonn aanndd WWOOZZ
eexxppeerriimmeennttss ffoorr tthhee aannaallyyssiiss ooff uusseerr bbeehhaavviioouurr iinn
tthhee DDIICCIITT sscceennaarriiooss
AAuutthhoorrss:: LLuuttzz MMaarrqquuaarrddtt,, LLuuccaa CCrriissttooffoorreettttii,, EEddwwiinn
MMaabbaannddee,, NNiiccoollee BBeerriinnggeerr,, FFiioorreennzzaa AArriissiioo,,
MMaatttthhiiaass BBeezzoolldd
AAffffiilliiaattiioonnss:: FFAAUU,, FFBBKK--iirrsstt,, EEBB,, AAmmuusseerr
DDaattee:: 2288--AApprr--22000088
DDooccuummeenntt TTyyppee:: RR
SSttaattuuss//VVeerrssiioonn:: 11..00
DDiisssseemmiinnaattiioonn LLeevveell:: PPUU
FP6 IST-034624 http://dicit.itc.it
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 ii
Project Reference FP6 IST-034624
Project Acronym DICIT
Project Full Title Distant-talking Interfaces for Control of Interactive TV
Dissemination Level PU
Contractual Date of
Delivery 31-Mar-2007
Actual Date of Delivery Preliminary Version: 11-January-2008
Final Version: 28-April-2008
Document Number DICIT_D6.2_V1.0_20080428
Type Deliverable
Status & Version 1.0
Number of Pages 7+86
WP Contributing to the
Deliverable WP6 (WP responsible: Nicole Beringer – EB)
WP Task responsible Lutz Marquardt (FAU)
Authors (Affiliation)
Lutz Marquardt and Edwin Mabande (FAU), Luca
Cristoforetti (FBK-irst), Nicole Beringer and Matthias
Bezold (EB), Fiorenza Arisio (Amuser)
Other Contributors Walter Kellermann (FAU), Federica Vola (Amuser)
Reviewer
EC Project Officers
Anne Bajart (till January 31st 2007), Erwin Valentini (from
February 1st till October 31
st 2007), Pierre Paul Sondag
(from November 1st 2007)
Keywords: data collection, WOZ experiments, multi-microphone devices, distant-talking
speech recognition devices, voice-operated devices, Interactive TV, anti-intrusion,
surveillance.
Abstract:
The purpose of this document is to describe the multi-microphone data collection and WOZ
experiments that have been conducted under DICIT. While the first task‟s objective was to
provide testing data for acoustic pre-processing algorithms the latter activity aimed at
determining user behaviour as a basis for the dialog specification.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 iii
Contents
Contents ..................................................................................................................................... iii
List of Figures ............................................................................................................................ v
List of Tables ............................................................................................................................ vii
Summary .................................................................................................................................... 1
Introduction ................................................................................................................................ 2
Part I. Multi-channel Data Acquisition / Acoustic WOZ ..................................................... 3
1. Experimental Setup ............................................................................................................ 3
1.1 Hardware Setup .......................................................................................................... 3
1.1.1 Microphone Arrays ............................................................................................. 4
1.1.2 General Hardware Setup .................................................................................... 5
1.2 Software Setup ........................................................................................................... 8
1.3 Recording Room ......................................................................................................... 9
2. Recording Sessions ........................................................................................................... 12
3. Room Impulse Response Measurements .......................................................................... 14
4. Data Exploitation .............................................................................................................. 15
4.1 Data Annotation ....................................................................................................... 15
4.2 Data Exploitation / Testing ....................................................................................... 17
Part II. Dialogue WOZ ......................................................................................................... 18
1. Experimental Setups and Recordings ............................................................................... 18
1.1 General Experimental Setup – The DICIT WOZ System ........................................ 18
1.2 Experimental Setup at EB ........................................................................................ 19
1.2.1 Hardware Setup ................................................................................................ 20
1.2.2 Software Setup ................................................................................................. 21
1.2.3 Recording Sessions at EB ................................................................................. 21
1.3 Experimental Setup at Amuser ................................................................................. 22
1.3.1 Hardware Setup ................................................................................................ 23
1.3.2 Software Setup ................................................................................................. 23
1.3.3 Recording Sessions at Amuser ......................................................................... 23
2. Questionnaire .................................................................................................................... 25
2.1 Statistical Questions (Questions 1-4) ....................................................................... 25
2.2 TV Habits (Questions 5-12) ..................................................................................... 27
2.3 The DICIT System (Questions 13-29) ..................................................................... 28
2.3.1 Using the DICIT System .................................................................................. 28
2.3.2 Watching the Screen ......................................................................................... 30
2.3.3 Vocal Interaction .............................................................................................. 33
2.3.4 The System Voice ............................................................................................ 35
2.4 General Opinion of the DICIT WOZ Prototype (Questions 30 and 31) .................. 37
2.4.1 Users‟ experiences with DICIT ........................................................................ 37
2.4.2 Rating user satisfaction within DICIT .............................................................. 39
2.5 Summary .................................................................................................................. 40
3. Session Evaluation ........................................................................................................... 41
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 iv
3.1 Logging Data ............................................................................................................ 41
3.1.1 Logging Data .................................................................................................... 41
3.1.2 Number of Screens and Views ......................................................................... 42
3.1.3 Screen preferences of the Users ....................................................................... 44
3.1.4 Remote Control vs. Voice Control ................................................................... 46
3.1.5 TTS Usage ........................................................................................................ 48
3.1.6 Barge-In Behavior ............................................................................................ 54
3.1.7 User Speech Time ............................................................................................ 56
3.1.8 Multi-Slot Usage .............................................................................................. 58
3.1.9 Off-Talk ............................................................................................................ 60
3.1.10 Overlaps ............................................................................................................ 61
3.2 Observation of the Wizards ...................................................................................... 63
3.2.1 People have to be encouraged to use natural language input ........................... 63
3.2.2 People tend to use simple commands ............................................................... 64
3.2.3 Some people use Barge-In, others do not ......................................................... 64
3.2.4 Reset function not self-explanatory .................................................................. 64
3.2.5 Remote Control is hardly used ......................................................................... 64
4. Conclusions for Subsequent Prototypes ........................................................................... 65
4.1 Overall Conclusions ................................................................................................. 65
4.2 Dialog and Menu Structure ...................................................................................... 65
4.3 Speech Dialog ........................................................................................................... 66
4.4 Remote Control ........................................................................................................ 66
4.5 Considerations Among the Two Samples ............................................................... 67
Appendix A – Microphone Arrays ........................................................................................... 68
Appendix B – The Questionnaire ............................................................................................. 70
Appendix C – The WOZ Instructions at EB ............................................................................ 78
Appendix D – List of Predefined TTS Prompts ....................................................................... 79
Appendix E – Screenshots of the Views .................................................................................. 82
Bibliography ............................................................................................................................. 85
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 v
List of Figures
Figure 1: Harmonic Nested Array (all distances are in cm) ....................................................... 4
Figure 2: NIST MarkIII Microphone Array ............................................................................... 5
Figure 3: FAU setup ................................................................................................................... 6
Figure 4: FBK setup ................................................................................................................... 7
Figure 5: FAU recording room setup ....................................................................................... 10
Figure 6: FBK recording room setup ....................................................................................... 10
Figure 7: FAU array setup ........................................................................................................ 11
Figure 8: FBK array setup ........................................................................................................ 12
Figure 9: Images of the FBK room .......................................................................................... 12
Figure 10: Impulse response measurement setup ..................................................................... 14
Figure 11: A transcription session using the Transcriber tool.................................................. 15
Figure 12: DICIT WOZ menu structure ................................................................................... 19
Figure 13: The WOZ setup at EB. ............................................................................................ 20
Figure 14: The WOZ setup at Amuser ..................................................................................... 22
Figure 15: German: Different screens and views. .................................................................... 42
Figure 16: English: Different screens and views. ..................................................................... 43
Figure 17: German: Screen preferences. Name; time in minutes; percentage. ........................ 44
Figure 18: German: Screen preferences of the individual subjects .......................................... 44
Figure 19: English: Screen preferences. Name; time in minutes; percentage. ......................... 45
Figure 20: English: Screen preferences of the individual subjects. ......................................... 46
Figure 21: German: Amount of voice and remote control input. ............................................. 47
Figure 22: English: Amount of voice and remote control input. .............................................. 48
Figure 23: German: Types of TTS output. ............................................................................... 49
Figure 24: Types of TTS output. .............................................................................................. 49
Figure 25: Prompt types in EPG_MainMenu_View. ............................................................... 50
Figure 26: Prompt types in EPG_ManualInput. ....................................................................... 50
Figure 27: Prompt types in EPG_ResultList. ........................................................................... 50
Figure 28: Prompt types in View. ............................................................................................ 50
Figure 29: English: Types of TTS output. ................................................................................ 52
Figure 30: English: Types of TTS output per user. .................................................................. 52
Figure 31: Italian: Types of TTS output. .................................................................................. 53
Figure 32: Italian: Types of TTS output per user. .................................................................... 53
Figure 33: German: Number of barge-ins per subject. ............................................................. 55
Figure 34: English: Number of barge-ins per subject. ............................................................. 55
Figure 35: Italian: Number of barge-ins per subject ................................................................ 56
Figure 36: German: User speech time per subject. Red line is average. .................................. 56
Figure 37: English: User speech time per subject. Red line is average. ................................... 57
Figure 38: Italian: User speech time ........................................................................................ 57
Figure 39: German: Multi-slot evaluation. ............................................................................... 59
Figure 40: English: Multi-slot evaluation. ............................................................................... 59
Figure 41: German: Off-talk. .................................................................................................... 60
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 vi
Figure 42: English: Off-talk. .................................................................................................... 60
Figure 43: Italian Off-talk. ....................................................................................................... 61
Figure 44: German and English: Overlaps. .............................................................................. 62
Figure 45: Italian: Overlaps. ..................................................................................................... 63
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 vii
List of Tables
Table 1: Noise event classes ..................................................................................................... 16
Table 2: Statistical questions .................................................................................................... 26
Table 3: Habits questions ......................................................................................................... 28
Table 4: General usability questions ........................................................................................ 29
Table 5: Screen feedback questions ......................................................................................... 31
Table 6: Vocal mode questions ................................................................................................ 34
Table 7: Listening to the system voice ..................................................................................... 36
Table 8: User general opinion .................................................................................................. 38
Table 9: Semantic differential .................................................................................................. 39
Table 10: Logging data created by GUIDE. ............................................................................. 42
Table 11: Number of different views divided by number of different screens. ....................... 43
Table 12: Prompt types per view. ............................................................................................. 50
Table 13: Spatial aliasing limits of sub-arrays ......................................................................... 68
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 1
Summary
Extensive Wizard of Oz (WOZ) experiments for the interactive TV scenario have been carried
out and evaluated. The WOZ experiments were carried out in German, English and Italian.
The acoustic WOZ experiments have been carried out at FAU and FBK. The experiments
involve the acquisition of multi-channel data for signal front-end and have been carried out
due to the need to collect a database for testing acoustic pre-processing algorithms. Besides
the user inputs, the database also contains non-speech related acoustic events, room impulse
responses and video data.
The dialogue WOZ experiments have been carried out at EB and Amuser. The experiments
have been carried out in order to obtain sufficient data for characterizing user behavior,
vocabulary, language, etc. The data provides a basis for the specification of the dialog model
of the prototypes of the DICIT system. The general impression that the users have is that the
WOZ prototype is easy-to-use, efficient, original, capable and well-organized.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 2
Introduction
The work conducted during the first project year of DICIT with respect to WP6 consisted of
the tasks T6.1 “Market study and user expectations for system and interface design”,
T6.2 “Data collections: multi-channel data acquisition for signal front-end” and T6.3 “WOZ
for interactive TV scenario and study of user behavior”. While the description and discussion
of T6.1 were addressed in Deliverable D6.1, this document focuses on the latter tasks.
The intention of tasks T6.2 and T6.3 according to the DICIT Technical Annex is the
“collection of multi-microphone data and Wizard-of-OZ data” respectively [1]. “This data
will allow characterizing user behavior, vocabulary, and language, etc., and other information
which is necessary to conduct part of the activities scheduled in WP3, WP4, and WP5.”
In a Wizard-of-OZ (WOZ) experiment, a subject is requested to complete specific tasks using
an artificial system. The user is told that the system is fully functional and should try to use it
in an intuitive way, while the system is operated by a person not visible to the subject. The
operating person – called wizard – can react to user input in a more comprehensive way than
any system could, because he/she is not confined by computer logic. From a WOZ study,
interaction patterns can be extracted and applied to an actual prototype.
Due to the need to carry out WOZ experiments on the one hand and to create a database by
means of multi-channel data acquisition on the other, the tasks T6.2 and T6.3 were combined.
To this regard both dialogue WOZ as well as acoustic WOZ setups and task flows were
created to meet the respective requirements. The first aimed solely at analyzing the user
behavior in the foreseen DICIT TV scenario, thus enabling the tailoring of the dialogue design
to the user requirements. The latter however, consisted of a different WOZ environment, not
focusing on behavior analysis from the dialogue point of view, but on the need for creating
realistic usage scenarios for acoustic pre-processing purposes.
In the following, Part 1 of this document describes the “multi-channel data acquisition”
related to both acoustic WOZ and impulse response measurement addressed by FAU and
FBK, whereas the dialogue WOZ, which was conducted by EB (formerly Elektrobit) and
Amuser, is described in Part 2.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 3
Part I. Multi-channel Data Acquisition / Acoustic WOZ
As to the design of an acoustic front-end for the DICIT prototypes, the chosen multi-channel
approach allows for the exploitation of the sources‟ spatial distribution. Array signal
processing algorithms, such as Beamforming, Blind Source Separation and Acoustic Source
Localization, make use of an array, which consists of a group of microphones, to extract
information from a wave-field. They are therefore well suited for addressing the challenge of
developing distant-talking speech interfaces. In order to meet these requirements a
microphone array has been implemented for DICIT, which will be introduced in Section
1.1.1.
The main objective addressed by the task “Data collections: multi-channel data acquisition for
signal front-end” was to collect a database for testing acoustic pre-processing algorithms.
Thus, realistic scenarios can be simulated which avoids the need for real-time
implementations at a preliminary stage and allows for repeatable experiments - Section 1
gives a description of the hardware and software setup employed by FAU and FBK as well as
of the respective recording environments for the data acquisition.
Moreover, simulations that are produced from the combination of WOZ experiments with
multi-channel data acquisition also include hard-to-handle acoustic situations that only arise
from or become obvious in real-life scenarios. The task flow of the acoustic WOZ recording
sessions is presented in Section 2.
Measured room impulse responses may be used for off-line testing of both acoustic pre-
processing and speech recognition algorithms, enabling the artificial creation of simulation
data out of clean speech signals - the corresponding measurements are described in Section 3.
Section 4 finally reports on the annotation of the recorded WOZ data, which is necessary to
allow its further exploitation for speech recognition, event detection and speaker localization.
1. Experimental Setup
1.1 Hardware Setup
For multi-channel microphone acquisition the nested array (which will be further described in
Section 1.1.1) was chosen as an adequate and flexible means to meet the requirements of the
DICIT scenarios.
In order to create a testing database for acoustic pre-processing, the main objective of task
T6.2 was therefore to collect synchronized data from the nested array. There arose the need
of recording additional microphone signals as well as the TV-loudspeaker signals
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 4
synchronously, e.g. for reference purposes. Cameras were installed to deliver further visual
reference information. The choice and acquisition of the respective hardware and the
construction of the nested array was thus established as the basis for further work.
Since the setup had to be installed for the mandatory recordings mentioned above anyway, it
was decided to carry out additional parallel recordings with the 64-channel MarkIII-array
developed at NIST – it will be described in more detail in Section 1.1.2. Thus it was possible
to collect more data for eventual later testing and comparison purposes with little extra effort.
A reduced version of the same setup could be used for the acquisition of room impulse
responses.
1.1.1 Microphone Arrays
This subsection describes the two microphone arrays that were used for the acoustic WOZ
experiments.
Harmonically Nested Array
The nested microphone array depicted in Figure 1 consists of 13 linearly placed electret
microphones plus two vertically placed electret microphones.
Figure 1: Harmonic Nested Array (all distances are in cm)
It forms four linear sub-arrays, three of which consist of five microphones and one which
consists of seven microphones. The nested array allows for the exploitation of different sub-
arrays in order to meet the requirements of each of the different acoustic pre-processing
modules in terms of inter-microphone spacings (see Appendix A for further explanation).
NIST MarkIII Array
In the acoustic WOZ setup another linear microphone array, a modified NIST Microphone
Array MarkIII depicted in Figure 2, was also used [2].
The MarkIII is composed of 64 uniformly-spaced microphones, specifically developed for far-
field voice recognition, speaker localization and audio processing. It records synchronous data
at a sampling rate of 44.1 kHz or 22.05 kHz with a precision of 24 bits. The particularities of
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 5
this array are its modularity, the digitalization stage and the data transmission via an Ethernet
channel using the TCP/IP protocol. For further information please refer to appendix A.
Figure 2: NIST MarkIII Microphone Array
1.1.2 General Hardware Setup
The following description explains the hardware setups installed by FAU and FBK that were
employed to address the acoustic WOZ experiments.
To enable the simulation of the DICIT system for WOZ purposes via EB GUIDE Studio 2.80
(which will be further described in Section 1.2), parallel recording of 26 loudspeaker and
microphone channels as well as 64 additional microphone channels from the MarkIII-array,
three PCs had to be employed. Due to the high data-rates involved two of the three PCs used
had to feature high processing power in order to avoid data loss.
FAU Setup
A block diagram of the hardware setup used at FAU is depicted in Figure 3. In connection
with the nested array (equipped with Panasonic WM60-AT microphones), the audio
acquisition at FAU was facilitated by a Linux PC with a Dual Xeon 1.7 GHz processor (PC1)
utilizing the software “ecasound” which will be described in Section 1.2. Additionally, two
extra microphones mounted on the nested array (Panasonic WM60-AT), a table-microphone
(Shure MX 391/0), two lateral microphones (AKG SE 300 B), four close-talk microphones
(Shure WH20) as well as the stereo TV loudspeaker signals were synchronously recorded
with the 15 nested array microphones. For the connectors of the close-talk microphones three
XLR (Shure WH20XLR) and one Tini QG connector (Shure WH20TQG) were chosen, the
latter one enabling signal transmission via a wireless system (Shure PG14E R10) and thus
allowing more freedom of movement for the bearer. (It should be mentioned that the table-
microphone signal was split by the preamplifier - apart from its optical transmission to PC1
via ADAT the analogue output was routed to headphones to be monitored by the wizard.)
A virtual “multi”-device consisting of two synchronized RME HDSP 9652 multi-channel
soundcards acquired the nested array data via three ADAT ports as well as the remaining
audio signals listed above via another two ADAT ports. During each of the recording sessions
approximately 9 Gigabytes (GB) of audio-data was recorded by PC1. The nested array
microphone signals were processed by a FAU-constructed “Mic24ADAT”-device, integrating
microphone power supply, AD-conversion, pre-amplification and conversion to an optical
data stream. Optical data is transmitted from the “Mic24ADAT” directly to PC1 via three
TOSLINK cables (ADAT), thus allowing for a maximum of 24 separate channels. Remaining
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 6
microphone and loudspeaker signals were digitized and pre-amplified by means of a Presonus
Digimax and transmitted via two TOSLINK cables (ADAT) to PC1. 26 channels were used in
total. The Presonus Digimax served as master for synchronizing all devices related to the PC1
recordings to a 48 kHz clock – slaves drew their clock signal via Word Clock or ADAT.
A NIST-developed software was used together with the MarkIII-array. The array was
connected via cross LAN cable to the network adapter of a Linux PC equipped with a Dual
Xeon 2.67 GHz processor (PC2). Approximately 10 GB of audio data was recorded by PC2 in
connection with each recording session.
PC3 running under Windows XP was equipped with a Dual Xeon 1.7 GHz processor and 768
MB RAM. A graphic card with multi-display technology (AGP Matrox Millenium G550)
enabled the connection of two graphic devices (beamer and wizard monitor) provided with
independent signals. TV contents were transmitted by two loudspeakers and a beamer – the
respective audio signals were split by the Presonus Digimax and synchronously recorded
together with the nested array as already mentioned above. Additionally, a remote control had
to be integrated into the system – an IR receiver in the recording room was connected via
cable to the serial port of the PC, which in turn was monitored by “WinLIRC” (see Section
1.2). One video camera was employed to provide visual reference and location information.
Figure 3: FAU setup
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 7
FBK Setup
At FBK a similar setup to that of FAU was used as depicted in Figure 4. PC1 recorded 15
channels from the nested array plus nine more channels. Four close-talk microphones
(Countryman E6DW5) were used to record the user inputs. Two of them were connected to a
wireless system (CHIAYO QR-4000U, UDR-1000M, UB-2000) while the other two used
regular wires. The two lateral microphones and the table microphone were omnidirectional
boundary layer microphones (Shure Microflex MX391/O). The last two channels carried the
stereo signals of the clips, recorded directly from the audio board of the wizard PC3. The table
microphone was monitored by the wizard to hear what was happening in the room.
All the signals were recorded using three RME OctaMic II microphone preamplifiers with
integrated AD converters, connected via three TOSLINK cables using the ADAT protocol to
a RME HDSP 9652 digital board installed on the PC1. Sample synchronization was
guaranteed to all the OctaMic via a BNC cable connected to the word clock input. Data was
recorded at 48kHz using 16 bit quantization. The setups of PC2 and PC3 were similar to those
at FAU and therefore do not warrant further description. Three video cameras were employed
to provide visual reference and location information.
Figure 4: FBK setup
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 8
1.2 Software Setup
Recording software
As already noted above, the recordings had to cover long sessions at high sampling rates with
a variety of microphone- and loudspeaker signals to be acquired. In order to deliver usable
data for acoustic pre-processing purposes both acquisition tools had to guarantee lossless and
synchronized recordings of these signals.
A hard disc recording audio tool called “ecasound” was employed to synchronously record
the 26 channels (this refers to the FAU recordings which differs minimally from the setup at
FBK). All the signals were recorded synchronous and aligned at sample level. These 26
channels were acquired via five ADAT channels of the two RME HDSP 9652 multi-channel
soundcards mounted on PC1.
The signals were recorded into a single 26-channel wav-File at 48 kHz sampling rate and 32
bit resolution (the latter dictated by the soundcards, however also allowing more flexibility
than directly recording with 16 bit precision; thus an amplification according to the actual
maximum recording level followed by 32-to-16 bit conversion remains possible). The single
26-channel wav-file was subsequently separated into 26 single-channel wav-files and a 32-to-
16 bit conversion was also carried out.
(The soundcard tools “hdspmixer” and “hdspconf” serve for monitoring and synchronization
configuration - setting the soundcard as slave and acquiring the clock from ADATIn, i.e. from
the AD-Converter)
The NIST MarkIII array was provided with some utilities to record data to the hard disk. A
command-line program listened to the network card connected to the array and stored in a
single file the incoming data stream. The file contained all the 64 interleaved channels at
44.1kHz at 24 bit resolution. A custom-written program was used then to extract and convert
the single channels to 16 bits.
EB GUIDE Studio
The EB GUIDE Studio developed by EB is an easy to use Human Machine Interface (HMI)
development tool which allows the user to specify, simulate, and generate powerful User
Interfaces (UIs) without limitations. It helps the user to design multi modal UIs with
graphical, haptical, and speech dialog systems without restrictions on the number or kind of
displays or any other complexity.
Running on PC3, a version (tailored to the acoustic WOZ) of EB GUIDE Studio which was
provided by EB enabled the WOZ-simulation of the DICIT TV scenario. TV contents were
shown by means of a beamer displaying six country-specific avi-files of half an hour duration
each that had been pre-recorded from a TV using a digital satellite receiver (Dreambox
DM7025). Additionally, a selection of several teletext pages was available. While TV content
including overlays was transmitted to the beamer, the control interface for the “wizard” was
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 9
shown on the respective monitor. TV stereo output including eventually generated speech
output was transmitted to the preamp (splitting it up for loudspeaker-playback and recording).
The control interface allowed the wizard to react to the testing persons‟ commands. Reactions
included the generation of text outputs (sometimes connected to a text-to-speech engine),
changing channels, volume and teletext pages, depending on the current state of the system
(e.g. registration phase, TV transmission). The table-microphone signal which was recorded
in PC2 was also used to transmit commands to the wizard.
As indicated above “WinLIRC” was employed to decode and provide the remote control
commands to the GUIDE software, after having been trained properly. WinLIRC is a free
software for Windows that enables the reception of infrared signals through an optical device
connected to the serial port of the PC. The receiving device was installed in the recording
room and connected via a serial cable to the wizard PC. EB GUIDE Studio then interfaced
itself to WinLIRC to receive the codes of the pressed buttons on the real remote control.
1.3 Recording Room
The television was simulated by means of a video beamer, projecting its output on a wall, and
two high-quality loudspeakers placed on the sides of the screen. The participants sat on four
seats, positioned at a fixed distance from the wall. The 15-element microphone harmonic
array shown in Figure 1 was located next to the screen and represented the acoustic setup that
the DICIT consortium intends to exploit. As already stated above, for comparison purposes
the sessions were also recorded by a NIST Mark III array which was placed next to the
harmonic array. The table-microphone was placed between the arrays and the users, and it
was meant to simulate a remote control equipped with a microphone. As to the lateral
microphones, they will be exploited only for experimental analyses. Finally, each participant
was also recorded by a close-talk microphone whose signals were used to guarantee a robust
segmentation and accurate transcriptions as well.
At FAU, a single video camera was employed to record the sessions – to this respect equally
distributed positions were marked on the floor for the speakers to serve as reference
information for source localization testing. The recording room at FBK was furnished with 3
video cameras: one placed on the ceiling and the other two on the upper left-hand and right-
hand corners of the room. Video data were used both to monitor the experiments during the
annotation process and to derive 3D reference positions for each participant. Notice that video
and audio signals were manually aligned taking advantage of some impulsive events present
in the recordings, as for instance a door slam.
The exact room dimensions and positions of microphones as well as further equipment are
depicted in Figure 5 for the FAU setup and in Figure 6 for the FBK setup.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 10
Figure 5: FAU recording room setup
.
Figure 6: FBK recording room setup
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 11
Figure 7 and Figure 8 show images of the array setups actually used at FAU and FBK
respectively. Figure 9 shows two real images taken from the video cameras at FBK - it is
possible to see the active user moving in the room and giving commands by voice to the
system.
Figure 7: FAU array setup
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 12
Figure 8: FBK array setup
Figure 9: Images of the FBK room
2. Recording Sessions
Six acoustic WOZ sessions, each of about 30 minutes, were recorded at both FAU (German)
and FBK (Italian) – including one English session at FAU. Besides the wizard, four persons
participated in each recording session. These consisted of three subjects, male and female, for
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 13
each session and of a co-wizard, who had to ensure that the correct test-procedure was
followed, which will be described in the following - it should be noted that during certain
parts of the experiments the testing persons were encouraged to behave naturally and vividly
in order to create barge-in situations, overlapping system commands and background noise.
After having been introduced to the general procedure, the four participants entered the
recording room to sit down on their respective seats in front of the arrays and adjust their
close-talk microphones. Meanwhile the “wizard” started the recordings in a separated
monitoring room and for the rest of the session wire-tapped the microphone signals in the
recording room in order to react to the commands uttered by the users.
A set of phonetically rich sentences 1 – taken from the SpeechDat-Car EU project [4] for
Italian sessions, “Der Nordwind und die Sonne” for German sessions and several sentences
out of the TIMIT database for the English recording respectively – was read out by each of
the participants. Afterwards, the TV was “switched on” via voice command by the co-wizard
(i.e. the wizard reacted to the command of the co-wizard).
In the following, the participants registered themselves with the DICIT system and initially
only had to use the remote control in order to switch channels, adjust the volume, etc. After
that the users were allowed to control the system with both remote control and voice
commands. After some time to get acquainted with the new kind of TV-usage, the subjects
were asked to find specific pages in the teletext via voice-commands, while walking about in
the room – this movement was especially intended for later testing of the source localization
algorithms. At the same time, the co-wizard had to produce several noises for later event
classification issues. These noises included a chair being moved, falling objects (a bottle of
water and a heavy book), laughter, coughing, paper rustling, various phone rings and door-
slams (further details are provided in Section 4.1).
The test-subjects had to fill pre- and post-questionnaires before and after the experiments,
respectively. The former addressed general statistical issues, including dialect and technical
background, whereas the latter focused directly on feedback on the experiments.
Analyzing the questionnaires it emerged that the subjects were mainly young researchers or
students, skilled in the use of PCs and open to new technologies. They got a good impression
of the DICIT system and thought it to be useful for controlling the TV and especially for
navigating the teletext.
From their opinions it emerged that the system should have a good language flexibility and
that it should be fast enough to avoid annoyance. Some recognition errors were tolerated and
didn‟t represent a big distraction.
Results from these questionnaires will be taken into account for the development of the final
prototype.
1 These sentences include a quasi balanced combination of all phonemes of the language in question leaving out
all combinations that are invalid for that language.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 14
3. Room Impulse Response Measurements
Room impulse response measurements were carried out in order to provide data which could
be exploited later for purposes such as speech contamination. Room impulse response
measurements were made at FAU in the same room used for the WOZ experiments utilizing
Maximum Length Sequence (MLS). A single loudspeaker was used to play back the MLS
sequence while the 15-channel DICIT array and five separate microphones simultaneously
recorded the output. The loudspeaker was moved to different positions within the room and
the measurements were repeated. Figure 10 depicts the different loudspeaker positions, the
array positions and the single microphone positions within the room.
Figure 10: Impulse response measurement setup
At FBK, impulse responses were measured in the WOZ room. A chirp sequence was used that
was played by a loudspeaker positioned on each of the seats that had been occupied by the
subjects during the WOZ experiments – the positions are shown in Figure 6. The two
microphone arrays recorded the output belonging to the four different positions to be
investigated.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 15
4. Data Exploitation
This section describes the exploitation of the data from the acoustic WOZ experiments.
4.1 Data Annotation
In order to be used for later algorithm testing and speech recognition, the six FBK sessions
collected within the acoustic WOZ have been transcribed and segmented at word level,
introducing also specific labels for acoustic events.
An annotation guideline has been written, based on previous experience, in order to ensure as
much similarity as possible between different annotators [5]. Data were annotated using
“Transcriber”, a free annotation tool which permits multi-channel view [6]. To ease the effort
in understanding the dialogues between users and the system, stereo audio files were created
putting on the left channel the signal coming from the table microphone, and on the right
channel the sum of the close-talk microphones. In this way, the annotators could listen in a
selectable way to the environmental noises or to the uttered sentences.
Figure 11: A transcription session using the Transcriber tool
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 16
Annotators were provided with a preliminary automatic segmentation based on energy of the
close-talk signals. Even if not reliable due to cross-talk effects and non-speech human sounds,
this segmentation turned out to be a very useful starting point. It was also possible to visualize
the automatic segmentation for each speaker, to help in understanding which user was uttering
or producing some noise. Markers were inherited from the automatic segmentation and
adjusted manually in order to have some silence before and after the respective event.
Only three speakers per session were annotated, since the last speaker was always the co-
wizard, even if he/she actively used the system, we decided to not annotate his/her speech.
Annotation information comprises the name (ID) of the speaker, the transcription of the
utterance and any noise included in the acoustic event list. Annotators were instructed to
properly annotate those sentences that were personal comments and were not intended for the
system. Figure 11 shows the annotation of a session, speech uttered is annotated with the
speaker-ID, along with noise symbols. Seven classes of noises were identified and annotated
with square brackets (e.g., [pap] standing for paper rustling). Two other classes were created
to label speakers‟ or unknown noises. Noises and their associated labels are described in
Table 1.
Label Acoustic Event
[sla] door slamming
[cha] chair moving
[pho] phone ringing (various rings)
[cou] cough
[lau] laugh
[fal] object falling down (water bottle, book)
[pap] paper rustling (newspaper, magazine)
[spk] noises from speaker mouth
[unk] other unknown noises
Table 1: Noise event classes
The above mentioned events were a subset of the ones exploited in previous data collections
conducted under the CHIL EU project [7].
Temporal extension of different noise events was identified using a particular convention to
disambiguate between impulsive or prolonged events. In the lower part of the figure the
activities of the different speakers can be seen, i.e. speaker_1 uttering a sentence while
speaker_4 is folding some paper.
As to video data, a set of 3D coordinates for the head of each participant was created with a
video tracker based on a generative approach [8]. Given the 3D labels, a reference has been
derived for each session, which includes the ID of the active speaker, his/her coordinates and
some information about the presence of noises. The reference file has been obtained as a
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 17
combination of the raw 3D labels generated by the video tracker and the manual acoustic
annotation, with a rate of 5 labels per second.
4.2 Data Exploitation / Testing
The data collected during the WOZ experiments have been exploited for a preliminary
evaluation of the FBK algorithms.
The main goal of the evaluation was to understand the peculiarities of the DICIT scenario and
verify their influence on localization techniques in order to correctly handle them in the first
development of the first prototype. For instance we observed that user sentences were usually
very short and silence was predominant. The basic metric to use to evaluate source
localization (SLoc) methods is the Euclidean distance between the coordinates. Given this
metric, the evaluation of a SLoc algorithm is carried out in terms of localization rate, RMSE,
fine RMSE, bias and angular RMSE (refer to D3.1 for further details on the results).
The WOZ data was used to test the speaker verification and identification system: the system
was applied to the signals of the close-talk microphones, to the single central microphone of
the array and to the beamformer output, using matched model condition and different training
material quantity. The results showed that beamforming yields benefits to the system
performance when compared to the single microphone case, but the results are still inferior to
the close-talk microphone case. The WOZ data were also exploited to test the acoustic event
detection system. The test data were composed of 682 speech segments and 108 non-speech
segments extracted from the continuous audio stream exploiting manual annotation. The
results are promising and highlight that the most confusable events are speech, cough and
laugh (refer to D4.1 for further details).
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 18
Part II. Dialogue WOZ
A Wizard of Oz (WOZ) study was conducted in order to obtain a basis for the specification of
the dialog model of the prototypes of the DICIT system. The focus of this document is on the
electronic program guide (EPG) setup, which was used to determine how users select
broadcasts from an EPG database using a set of filter criteria by means of voice input. Also, a
screen layout and navigation scheme was evaluated.
The aim of this study was to determine how users operate an EPG system by voice control.
The WOZ system can understand all voice input and handle it accordingly, either by
performing the requested action or by replying with an error message.
The WOZ experiments were conducted at EB, Erlangen, Germany, and Amuser (formerly
Citec Voice), Torino, Italy. 20 sessions were performed both at EB and Amuser, in German
and Italian respectively, involving only one adult person per session. Moreover, a small
number (4) of English sessions were performed with one of the subjects speaking with
English, Scottish, Irish, and American accent, respectively. At EB, the recordings were all
performed between 05-May-2007 until 13-June-2007. At Amuser, the recordings were
performed from 30-May-2007 until 07-June-2007.
In this document, the WOZ experiments in German, English, and Italian are evaluated. To be
able to see which language is currently discussed, the respective parts are marked with the
flags of the country:
Flag Recording
German at EB, Erlangen, Germany
English at EB, Erlangen, Germany
Italian at Amuser, Torino, Italy
1. Experimental Setups and Recordings
1.1 General Experimental Setup – The DICIT WOZ System
The DICIT WOZ system is an STB system with an electronic program guide (EPG). Figure
12 shows the menu structure of the specified dialogue system. Users can browse the EPG data
by defining a set of filters (channel, time, day of the week, actor, subject, and title) and then
browse the results produced using these filters. Elements from the result list can be put into a
recording list. This list can also be browsed and elements can again be removed from it.
Moreover, there is a TV mode where users can watch a prerecorded set of movies (6 channels)
and use a simple teletext function. Screenshots of the views can be seen in Appendix E.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 19
Figure 12: DICIT WOZ menu structure
Although it is suggested in [15] to have a separate wizard operator for each modality in
multimodal systems, this was not necessary for this setup: Instead of having the wizard
operate every kind of user input, the remote control interaction was completely implemented
in the system and the wizard only had to handle speech input by the user. The fraction of the
system implemented in WOZ studies is discussed in [16].
1.2 Experimental Setup at EB
At EB, two rooms were used in the experimental setup (see Figure 13): the wizard room and
the test person room. The test person room was furnished like a living room with a couch,
where the test person could feel like he or she was watching TV in a private environment. The
wizard room was directly wired up to the test person room via direct lines for keyboard,
mouse and cables. The wizard could operate the DICIT WOZ system from there.
Each session was introduced with a short description of the experiment. The session itself was
divided into three tasks to be solved with the DICIT system. After the recording of the session
the test persons were asked to fill out a questionnaire.
All sessions were recorded with a high quality close-talk microphone (head set) and a distance
reference microphone which was of the same type to being used by the acoustic frontend of
the DICIT system under development. Additionally videos of the sessions were recorded for
reference and the video was also made available to the wizard during the recordings.
EPG mode
Welcome screen
Main menu [EPG_MainMenu_View]
Teletext mode
Teletext [News]
TV mode
TV [View]
Result list [EPG_ResultList]
Confirmation [EPG_Confirmation]
Recording list [EPG_RecordingList]
Select criterion [EPG_ChooseFilter]
Manual input [EPG_ManualInput]
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 20
Figure 13: The WOZ setup at EB.
1.2.1 Hardware Setup
In this section, the hardware setup at EB is discussed (see
Figure 13). One PC was used to run both the simulation and the recording. The camera and
the microphones were directly connected to the PC or the sound card, as were the
loudspeakers. The wizard screen, keyboard, and mouse were directly connected to the PC in
the next room using extension cables (PS/2 for mouse and keyboard, VGA for the screen).
The following hardware was used:
- PC: Dual-Core P4, 2GB of RAM, 2x 100GB hard disk
- Screen: hp 2035 20” (first part of the sessions), Belinea 2225 A1W 22” widescreen
(second part of the sessions)
- Microphones:
o Shure MX391/O (room microphone)
o Sennheiser ME3 (head set)
- Camera: Logitech QuickCam Pro 4000
- Sound card for recording: Edirol UA-25
Please note that this hardware configuration had to be used to run the dialogue simulation tool
from outside the recording room. To ensure a good quality of the audio recordings an external
sound card was used.
Wizard room Test person room
Screen
Camera
Room
microphone
Head set
Keyboard, mouse,
and screen
cables
Remote
control
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 21
1.2.2 Software Setup
The simulation was run using a special version of EB GUIDE Studio 2.60 [9, 10], which was
extended using the plugin mechanism. New control windows (speech input/output, state
change, remote control simulation, etc.) were added to the simulation desktop. The wizard
could use these windows to control the simulation. Moreover, extensive logging facilities
were added to GUIDE.
Moreover, GoldWave Multiquence and the Logitech QuickCam software were used at EB for
the audio and video recordings.
1.2.3 Recording Sessions at EB
At the beginning of a session, the subject was guided into the living room by the instructor
and received a short introduction to the experiment. The subjects then had to fulfill three tasks
in about half an hour. After the session, a questionnaire had to be filled in. The duration of a
session was about one hour. Both the recording and the questionnaire completion took about
half an hour each.
Introduction
The introduction given to the subjects can be outlined as follows:
- The system is a prototype that can understand spoken language.
- The session is going to be recorded (both audio and video).
- The system was a fully functional system which could be used via remote control. The
new feature of this experiment is voice input.
- Three tasks had to be solved. These were formulated on sheets of paper and handed
over to the subjects.
- The experiment was about testing the system, not the subject.
- After the recording, a short questionnaire had to be filled in.
The complete text of the introduction is available in Appendix C.
Tasks
During a session, each subject had to fulfill three tasks. Every task was printed on a separate
sheet of paper. The first task was given to the subject after the introduction by the instructor.
For the other two tasks, the instructor entered the room after the time for a task had elapsed.
Task 1 (15 minutes)
Please look for your favorite broadcast following your own selection criteria for
Sunday afternoon. Please note that the prototype does not yet support every TV
channel.
Task 2 (7 minutes)
Please look for the current broadcast on ARD and change the volume.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 22
Task 3 (7 minutes)
Please select a broadcast that is not actually on air and that you therefore would like to
record. Please note that the prototype does not yet support every channel.
Questionnaire
After the prototype session, each subject had to fill in a questionnaire of about 30 questions.
The questionnaire can be found in Appendix B and is discussed in detail in Chapter 2.
1.3 Experimental Setup at Amuser
Also at Amuser, two rooms were used in the experimental setup, very similar to EB setup (see
Figure 14):
Figure 14: The WOZ setup at Amuser
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 23
1.3.1 Hardware Setup
In this section, the hardware setup at Amuser is discussed (see Figure 14). PC2, placed into
the test person room, was used to run the simulation and to record the audio signals coming
from both the close talking and the distance reference microphones. The wizard PC (PC1) was
connected to the PC2 in the test person room via VNC, by a point to point cable, and was used
as an interaction client of the PC2 (server).
The two cameras were directly connected to a VHS mixer in the wizard room.
PC2 was connected to: the two microphones (through the USB-audio box), the loudspeakers
(directly driven by the built-in sound card), the RC receiver, connected through a serial port,
driven by the WinLIRC software.
The following hardware was used:
- PC1: IBM Thinkpad, 256MB of RAM, 1x20 GB hard disk
- PC2: HP Compaq notebook, 1GB of RAM, 1x20GB and 1x55GB hard disk
- Screen: IBM 20” (PC 1), Samsung sync master 231T LCD 800x600 (second monitor)
- Microphones:
o Shure SM10A (head set)
o Røde NT6 and AKG c680 BL (room microphone)
- Cameras: 2 Sony 3ccd
- Sound card for recording: SoundMax integrated digital audio.
1.3.2 Software Setup
The simulation was run using the same version of EB GUIDE Studio 2.60 provided by EB.
Moreover, CoolEdit pro 2.0 was used at Amuser for the audio recordings.
1.3.3 Recording Sessions at Amuser
A session consisted of a short introduction to the experiment, two tasks to be solved with
DICIT, and the compilation of a questionnaire after the recording session. At the beginning of
a session, the subject was guided into the living room by the instructor and received a short
introduction to the experiment, then filled out the first part of the questionnaire (statistical and
habits questions). First of all the subjects had to read some phrases, pretending this was a
“calibration phase” for microphones, and then fulfill two tasks in about 20 minutes. After the
session, the usability part of the questionnaire had to be filled in. The duration of a session
was about 50 minutes.
Introduction
The introduction given to the subjects was the same used at EB.
Tasks
During a session, each subject had to read some phrases first, and then fulfill two tasks. The
list of phrases and the tasks instructions were printed on a separate sheet of paper. The
instructor gave the list of phrases for the “acoustical WOZ” to the subjects after the
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 24
introduction, and when they finished reading, they received the first task. As for the first task,
the instructor didn‟t enter the room when the subject reached the goal, but left people
“playing” with the system until the time for this task had elapsed (about 10-12 minutes).
Finally the instructor entered the room to give the second task.
Recording phrases to set the microphones (3 minutes)
In order to present a “picture” of a real working system, the phrases for the “acoustical
WOZ” were presented as a microphones recording test:.
Task 1 (11 minutes)
Considering that only six national channels are available, using the criterion you
prefer, please search a program you want to record that isn‟t on air at this moment.
Task 2 (5 minutes)
Please, search the video-clip of Cristicchi on air at this moment, using the title if you
know it. While you watch the video-clip, please adjust the volume.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 25
2. Questionnaire
After the recording session with the WOZ prototype system, each subject had to fill in a
questionnaire to determine users‟ attitudes toward different aspects of the system. For German
and English subjects, the questionnaire data was entered on a notebook, so that the
questionnaires could automatically be evaluated without the need of entering them into the
computer separately. On the contrary, the Italian subjects used a paper-based questionnaire.
The questionnaire consists of 31 questions according to the criteria of DIN EN ISO
9241-110 (see [11]). The first part consists of statistical questions (1-4) and questions
regarding TV habits (5-12). The second part contains questions regarding specific parts of the
DICIT WOZ system, such as screen, voice output, and voice input. The last part investigates
subjects‟ overall impression of the system.
German and English subjects‟ answers have been evaluated separately. There were 20
German and only four English subjects. Therefore, the answers of the English subjects are not
significant, but a limited quantitative evaluation can still be done. Moreover, two of the
English subjects did not like voice control at all (answering “do not use voice” a couple of
times in the free-text comments) and one of them answered with a negative bias in many
questions. Since the answers of the other English participants are more similar to the German
answers, we would have got different results with a larger subject base.
The complete questionnaire can be found in Appendix B.
2.1 Statistical Questions (Questions 1-4)
The first part of the questionnaire contains statistical questions, e.g. regarding subjects‟
gender, occupation, or age.
Question
1: You
are…
male
75%
female
25%
male
75%
female
25%
Male
55%
Female
45%
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 26
2: What is
your
educational
qualifi-
cation?
degree
100%
degree
35%
second
ary
school
50%
middle
school
15%
3: Your age
41_50
20%51_60
10%
31_40
25%
>60
15%
20_30
30%
4:
Occupation
house-
wife
10%
comme
r-
cial
5% other
15%
emplo-
yee
45%
student
10%
retired
15%
Table 2: Statistical questions
Of the German subjects, 75% (15) were male and 25% (5) female. About two thirds (60%)
hold a university level degree, one third finished secondary school as their highest degree, and
two subjects finished middle or primary school. 55% are aged between 20 and 30, 30% from
31 to 40, two subjects (10%) are between 41 and 50, and one subject (5%) is between 51 and
60. Since most subjects are employed at EB, they work in the software business.
As for the English subjects, the gender distribution is the same as for the German subjects,
but all hold a universitary level degree and are older than the German subjects.
The distribution of Italian sample about gender was: 45% (9) females, 55% (11) males;
regarding the educational qualification was: one third (7) hold a university level degree, half
sample (10) finished the secondary school, and the rest (3) finished the middle school. The
age distribution was divided with more than half the sample under 50 years: 30% (6) are aged
between 20 and 30, 25% (5) from 31 to 40, 20% (4) were between 41 and 50; two subjects
(10%) were between 51 and 60 and three subjects (15%) were over 60 years. The occupation
distribution was: almost half the sample (45%) were employees, three subjects (15%) were
retired, two subjects (10%) were students, two subjects (10%) were housewives and 20%
worked in other jobs.
While the German sample was chosen to represent a specific target of “expert” users, the
Italian sample was recruited trying to represent the distribution of the whole population
(regarding gender, educational qualification, job and age).
41_50 75%
31_40 25%
20_30 55%
41_50 10%
31_40 30%
51_60 5%
<not given> 20%
comp. science
25%
electro- technol.
5%
commer cial IT
5%
engineer- ing
10%
IT 5%
biology 5%
software developer
25%
secondary school 30%
middle school
5%
primary school
5% degree 60%
automation 25%
software developer
75%
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 27
2.2 TV Habits (Questions 5-12)
The next section of the questionnaire contains questions regarding the TV watching habits.
Question
5: How
many
people live
in your
household
including
you?
2
45%
alone
25%3 or more
30%
3 or more
100%
2
35%
3 or more
55%
alone
10%
6: How
many TVs
do you
have in
your
house? none
10%
2
20%
1
45%
3 or more
25%
1
50%
2
50%
3 or more
25%
none
5%
2
40%
1
30%
7: Who
usually
decides
what to
watch on
TV?
majority
8%
together
42%
each
17%
one
33%
majority
50%
together
50%
majori-
ty
21%
one
21%
toge-
ther
47%
each
11%
8: How do
you usually
decide
which
programme
to watch?
(multiple
responses)
teletext
9%epg
20%
surfing
26%
guide
45%
surfing
40%
guide
60%
guide
26%
surfing
26%
EPG
11%
teletext
37%
9: Which
type of
television
do you
usually
watch?
(multiple
responses)
satellite
45%
traditional
30%
digital_terr
estrial
20%
iptv
5%
satellite
16%
traditio-
nal
84%
Satellite 100%
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 28
10: How do
you usually
select a
programme
?
no
answer
55%
numeric
button
45%
11: What is
the
information
that
interests
you to
choose a
programme
?
genre
63%
topic
32%
don’t
care
5%
12: Usually
you use the
TV to:
0
2
4
6
8
10
12
14
16
surf
watch
_tv
dvds
reco
rded
othe
r
back
grou
nd
0
5
10
15
20
watch tv surf other
Table 3: Habits questions
Multiple responses were possible for questions 8-11. Subjects could also enter additional
comments (more than one answer possible) for question 12. “Photos”, “HiFi”, and “Series and
Movies” were each stated once for the German subjects, whereas two subjects said that they
had no TV at all. One English subject added “video”.
2.3 The DICIT System (Questions 13-29)
Questions 13 to 29 are used to determine how subjects like specific aspects of the DICIT
WOZ system, such as the screen, voice input, or voice output.
2.3.1 Using the DICIT System
These questions are used to determine how subjects get along with the DICIT system and
whether they prefer voice to remote control input. Subjects had to rate each of the following
numeric button 75%
up/down button 25%
numeric button 45%
EPG 25%
up/down button 30%
duration 6%
actor 14%
topic 22%
don’t care 14%
genre 33%
channel 11%
genre 50%
actor 17%
topic 33%
0 0,5
1 1,5
2 2,5
3 3,5
4 4,5
dvds other recorded watch_tv
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 29
questions with values between 1 and 10. Moreover, they could explain or comment on their
answers in a text input field.
Question Average value
13. It was easy to understand how to use the different selection criteria
given by the system
(1 = Very Difficult, 10 = Very Easy)
DE: 9.00
EN: 6.75
IT: 7.25
14. It was easy to understand how to give all the vocal commands
(1 = Very Difficult, 10 = Very Easy)
DE: 8.80
EN: 7.75
IT: 7.95
15. It was comfortable to give some information with voice and the
other with the remote control
(1= Very Uncomfortable, 10 = Very Comfortable)
DE: 6.70
EN: 6.50
IT: 8.50
16. In case of problems did the system suggest usefully and efficiently
what to do to recover the information after the error?
(1 = Very Useless, 10 = Very Useful)
DE: 5.94
EN: 5.00
IT: 7.16 Table 4: General usability questions
Question 13:
German subjects had no problems using the filter criteria. Some found it to be logical,
easy, or clear (11), whereas some did not understand it in the very beginning, but it became
quickly clear to them after they had used the system for some time (3).
English subjects had more problems using the filters. One subject wanted to select a time
range (from – to), but the system could only select from a start time until midnight. Moreover,
a subject criticized that the “movies” genre filter showed many entries that were actually no
movies. (Since it was decided to keep the number of genres small, all entries had to be put
into the available categories.)
Few Italian subjects (4) hadn‟t problems using the filter criteria. Many of them (8) had
some problems at the beginning of the session, but after using the system for a while they
easily understood how to behave. Other subjects (6) stated different problems of interaction
with the system: someone complained about the lack of benchmark for voice volume
commands (e.g. “mute”, “half volume”, etc…), the lack of EPG data (the Italian EPG didn‟t
have data about subject and artist), or lamented some difficulties to link the selection of search
criteria to the direct view of a program (one subject didn‟t understand it was possible to use
more than one criterion at a time to sort data).
Question 14:
The German subjects understood how to use vocal commands, although they had no prior
training. Most users said that the system was easy, intuitive, or that they could find out how to
operate it using a trial and error strategy.
One English subject observed an inconsistency in TV mode between volume up/down
and channel up/down. Moreover, one subject gave a below-average rating of 4, whereas the
others gave good ratings (8-10).
Most of the Italian subjects (12) stated they didn‟t have problems because the screen
outputs facilitate the understanding of available voice commands. Other subjects (6) said they
had some difficulties to imagine volume commands, and one of them highlighted that if he
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 30
had some problems to use vocal commands, it was because he wasn‟t used to interact with
this kind of system (in his opinion it wasn‟t a design interface problem).
Question 15:
Subjects did not feel very comfortable using both voice and remote control (RC) to
operate the system. 11 out of 20 subjects said that they used the RC not at all or only to switch
on the system. Some wanted to use RC for quick or simple inputs (e.g. switching channel),
whereas they thought that voice should be used for complex input (e.g. EPG query).
One subject judged both speech control and speech feedback as irritating. Others said that
“voice was easier” and that the remote control was used for teletext.
Most of the Italian subjects (14) used only voice commands, stating that they are easier
and quicker to use than the RC. Three subjects said they used the RC only when they had
some difficulties with voice interaction (e.g. scrolling channels).
Question 16:
Regarding error recovery, some users commented that they either did not have problems
(4) or that the help function was “good” or “easy”. Others stated that the system did not
provide useful help or that the help was very basic and only took them one step further, but
did not help to understand the system.
English subjects rated help not as good as German subjects. One subject complained
that hardly any help was given and another that automatic help (after silence) was annoying.
Almost half of the Italian sample (9) said that the visual data/feedbacks were very useful
to recover information after interaction problems, and two subjects did not blame on the
system interface their own difficulties to cope with an error, but they stated their own
inadequacy with this kind of advanced system. Four subjects did not notice any help/error
messages during their sessions and therefore commented that they did not encounter any
problems. The other subjects complained the lack of contextual help/error messages or
complained a different general logic of interaction in comparison with their expectation.
Summary: The majority of the subjects judged the system to be easy to use and had no major
difficulties in using voice control. They did not feel very comfortable using voice and remote
control at the same time and a majority of the subjects used voice control exclusively. Error
recovery could be improved upon, since many subjects did not perceive the help to be useful.
When testing the usefulness of a voice-supported dialogue system, the results can be regarded
as very positive, because improving on dialogues and help menus is always possible by
standard techniques. Regarding the crucial issue, namely, acceptance of voice as an
interaction modality, the results of the questionnaire support this approach.
2.3.2 Watching the Screen
The aim of this section was getting feedback on the DICIT screen, i.e. whether it was easy to
read the screen and navigate the menu structure. Like in the previous section, every question
in this section consists of a rating value between 1 (negative) and 10 (positive) and an input
field, where subjects can explain their choice.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 31
Question Average value
17. Is the screen which shows the criteria for the programmes
search easy to read?
(1 = Very Difficult, 10 = Very Easy)
DE: 8.75
EN: 7.00
IT: 8.65
18. Was it easy to understand how to use vocally the search criteria
for programmes shown on the screen?
(1 = Very Difficult, 10 = Very Easy)
DE: 8.80
EN: 6.50
IT: 7.89
19. Was it easy to understand how to use the remote control to
select the search criteria for programmes?
(1= Very Difficult, 10 = Very Easy)
DE: 6.16
EN: 6.50
(only 2 answers)
IT: 7.00
20. To reach the task we have assigned to you, did you expect to
have some other vocal commands?
(Yes/No)
DE: Yes=55%, No=45%
EN: Yes=50%, No=50%
IT: Yes=30%, No=70%
21. Did you find the information on the screen useful to orient
yourself, in the case you would disable the audio?
(1 = Very Useless, 10 = Very Useful)
DE: 8.80
EN: 5.00
IT: 6.15
22. Do you find it useful that the previous criteria list would
always be shown?
(1 = Very Useless, 10 = Very Useful)
DE: 8.35
EN: 5.25
IT: 7.68
23. Do you find useful a function which allows you to insert a
precise word to search programmes through remote control?
(1= Very Useless, 10 = Very Useful)
DE: 3.35
EN: 3.75
IT: 7.26
Table 5: Screen feedback questions
Question 17:
While the acceptance of the EPG screen was good, the comments were very diverse. Two
subjects stated that the fonts were good, whereas two others note that they were too small.
Three mentioned that the screen contains too many details, while it was clear for others (3).
Subjects also had problems with the reset function, e.g. one did not want it to reset all filters.
Two English subjects stated that the list of TV programs was too short (only 3 entries
for 6 channels). One complaint was that the system keeps making suggestions.
The majority of the Italian subjects (12) found this screen useful and easy to read, but
someone deplored the “basic” graphics and one stated it was not clear that it was possible to
combine the search criteria.
Question 18:
The results also are very diverse for this question, but the majority of the subjects had
positive comments. Some were positive (“surprised in a positive way”, “it‟s faster than
Google”, “it always worked”), others negative (“produced strange results”).
For one English subject, it was not clear that the search had to be started (e.g. by saying
“Search”). Another comment was that the basic functionality was easy to use, but advanced
features were not.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 32
Even if all Italian subjects gave to this question a quite high score, only 8 commented on
it: half of the comments highlighted difficulties at the beginning, but it became quickly clear
how to use search criteria by voice. Two subjects reinforced their answer, stating it was easy
to understand and to use search criteria. Other two subjects complained some problems to
know which data were available to make a search (the Italian EPG didn‟t have data about
subject and artist) and complained the possibility to easily switch to the program on air,
having selected the desired one.
Question 19:
Most subjects (15 out of 20) said that they did not use the remote control at all. Others
comment that they are used to RC or that the DICIT RC works like other RCs.
As for the English subjects, most said that they used voice only.
As German and English subjects, most of Italian ones (16) said that they used only their
voice because being more comfortable. Some subjects appreciated the mapping of the
“colored buttons” related to voice commands on the screen, and the colored buttons on the
RC.
Question 20:
People were missing the following commands: going to full-screen mode, recording all
results at once or deleting all entries from the recording list at once, foreign language
commands, multi-select.
English subjects also want an option to record all entries at once. Another person
suggested presets for volume settings (e.g. “medium volume”).
Only six Italian subjects expected to have other commands or options. Two subjects
complained the unavailability of a list of artist or titles (within the “sub-menus” for these
search criteria), the others asked for the following commands on screen: “quit”, “show it” (to
switch the TV on the chosen program, instead of a generic channel), “go ahead” and “go
back” commands to scroll the list of programs.
Question 21:
Five subjects said that the provided information was enough. Subjects had some comments
regarding the TTS output: TTS should not be muted when the TV is muted (two different
“mutes”?), and the TV volume should be lowered during a TTS output. Moreover, one subject
stated that the text on the screen should not be chopped off.
Two English subjects stated that they only used the screen, speech feedback was irritating,
and that they would disable the audio. The other two subjects did not comment on this
question.
Only four people said they did not pay attention on help messages. Other four Italian
subjects reaffirmed they expected to have more detailed commands to adjust the volume
(especially to turn on/off the “mute”); the others said they read commands from screen or
used their reasoning to orient themselves using the system, but they didn‟t mention anything
about the help messages.
Question 22:
The most common answer is “good” (5). One user would like to make this configurable,
another would like to have templates.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 33
English subjects commented that it was not clear if all or only the last criterion was used
for the query and that this could be a possible preset option.
Only four people said that they could avoid consulting the criteria list when they could
have some familiarity with the system, . The other 10 Italian subjects commented that reading
the criteria, it is easier than remembering them, and it gives confidence during the interaction
with the system; one of these subjects stated it would be nice to make this screen
configurable, and another one said that he would prefer another kind of interface.
Question 23:
The dislike for a virtual keyboard is affirmed by the comments. Subjects regard it as slow
and compare it to T9 (cell phone input), which most of them don‟t like for a TV system. Also,
subjects don‟t like the idea of having a huge remote control with lots of keys. They prefer
having an improved and well-working voice input, which would make an on-screen keyboard
unnecessary.
Two English subjects rated this question with the extremes of 1 und 10 (one each), the two
others rated it with 3. One subject stated that a speller is not required as long as the spelling
works fine. Another said that shortcuts could be useful. Other comments were that this took
too long and depended on the word. One subject said that he preferred a printed guide.
Only one Italian subject dislikes this functionality; eleven of the other subjects judged this
feature a good idea to have an alternative way of interaction or to simplify the search for
artists or titles (some of them suggested to restrict the search of those items inserting few
letters by the RC at the beginning, and complete the query by voice, reading a restricted
number of items found).
Summary:
Subjects regarded the screen as easy to read and the voice command as intuitive. It was not so
clear how to use the remote control for filter selection. About half of the subjects wanted to
have additional commands at their disposal. The information on the screen was considered to
be useful even without the related TTS prompts. Subjects think that the list of previous criteria
should be available. German and English subjects had a strong dislike for an on-screen
keyboard. On the other hand, most Italian subjects appreciate this feature to facilitate the
search of difficult data like the names of the artists or the correct titles of programs.
2.3.3 Vocal Interaction
The aim of questions 24-26 was to probe the appeal of the vocal mode (in comparison to the
interaction with the haptic mode), and its flexibility both prodding an input and giving an
output.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 34
Question
24. How do
you judge
the
opportunity
to use a
vocal
command?
never use
vocal
cmds
10%
useful if
used with
RC
15%
useful if
more
operation
than RC
25%
very
useful
50%
25. For the
vocal
commands,
you prefer:
short
cmds
60%
full
sentences
20%
read
precise
cmds
20%
26. When
you give a
vocal
command to
the system,
you prefer
[feedback]:
both
35%no
feedback
35%
video
only
20%
voice
only
10%
Table 6: Vocal mode questions
Question 24:
The majority considers the opportunity to use voice input (very) useful, but 45% only if it
replaces remote control, it is used together with remote control, or if it provides more
operations than the remote control. The remote control is considered better apt for e.g.
switching channels. They think voice input is good for complex queries (e.g. EPG). Subjects
said that it‟s important for the system to understand lots of commands and to work right from
the beginning.
One subject was totally enthusiastic about this feature: “It‟s the way of the future and I
want it now!” Another comment was that this could be useful for disabled and elderly people.
Half of Italian sample answered to this question saying that voice interaction is useful, and
seven subjects reinforced their answer explaining that voice control facilitates the interaction
because, if commands work correctly, voice is quick, more practical, and not cumbersome
like the RC. Only three subjects commented that they prefer to choose when to use voice, and
when to use the RC, and another subject reinforced her negative answer, stating she‟d never
used the voice commands because, living alone, she thinks that speaking to the TV as to a
human being is distressing.
very useful 50%
never use vocal 50%
useful if replacing remote
5% useful if used with
RC 15%
useful if more
operations than
remote 25%
very useful 55%
short comds 100%
<none> 5%
read precise cmds 10%
full sentences
10%
short cmds 75%
video only 25%
immediate action 75%
video only 50%
immediate action
20%
both 30%
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 35
Question 25:
Most subjects agree on short commands to be the best solution. One person recommends
using long sentences for beginners and short ones for experts. Two subjects mention that both
short and long commands should be understood by the system.
One subject noted that short commands would take some time to be figured out. Others
said that they would use full sentences later.
Twelve Italian subjects agree on short commands because they are easier to remember and
to read on the screen (also for elderly or visual impaired people). Only one of the four people
which said that they prefer full sentences, justified his answer explaining that it is better to
avoid speaking as a robot.
Question 26:
Some people regard speech output as annoying (2) or repeat that they prefer video output
(3), some would like to be able to disengage it (2). One stated that a prompt should only be
repeated a certain number of times. Some users think that it should be “intelligent” or
“provide more feedback when problems occur.”
Subjects stated that this speech feedback could be good for the blind and that it was
sufficient. One subject said that he would not use voice control.
Italian subjects are equally split(7 and 7) among people that do not want any feedback, but
only the system reaction to their request, and people that want video and TTS feedback before
the system does anything. If we add to this second part of the sample, the other subjects that
want to have only video feedbacks (4) and those who want only audio feedbacks (2) to
interact with the system while they are not in front of the TV, the main part of the sample
could prefer to have some kind of feedback.
Summary: Subjects like to have voice as a means to control an STB system. They prefer
short commands instead of complete sentences. While they like speech input, subjects are
sceptical about speech output, and their preferences are different for the different samples:
While half of the German subjects prefer video-only feedback, 20% want an instant reaction
without feedback, and only one third wants to have both speech and visual confirmation; for
the Italian sample, one third of the subjects want an instant reaction of the system, one third
prefers both video and voice feedbacks, 20% like video-only feedback, and 10% voice-only
feedback.
2.3.4 The System Voice
The system voice of the DICIT system is subject of this section. Subjects were asked how
they like the TTS output, whether they want to be able to interrupt the system, and whether
they want to be able to switch off the recognizer.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 36
Question
27. Do you
find useful
that the
system
reads (in
addition to
listing them
on the
screen) the
programmes
found after
your
search?
yes if not
too many
15%
no
85%
yes
40%
no
55%
yes if not
too many
5%
28. If you
prefer a
system
which gives
you vocal
feedback:
want
interrupt
possibility
90%
happy to
wait
10%
want
interrupt
possibility
100%
want
interrupt
possibilit
y
81%
happy to
wait
19%
29. Would
you like to
have a
button to
enable /
disable the
vocal
recognizer?
yes
90%
no
10%
yes
100%
yes
85%
no
15%
Table 7: Listening to the system voice
Question 27:
Subjects think that this feature is only useful for blind or elderly people and should, if it is
implemented, be cancellable and it should be possible to disengage this feature. Most consider
it as too slow (5) and some don‟t like the TTS voice (2).
All subjects did not want the system to read out the results.
Even if eleven Italian subjects answered they don‟t want to hear the TTS reading the
listed programs, those who like this feature (8) added to the one which wants it only if the
items are not too many, are about half of the sample (45%). Someone of those who answered
“no”, and a couple of subjects that answered “yes”, said that this feature could be useful for
blind or elderly people (so could be enabled/disabled on demand); on the contrary, other two
people that said they want this feature, interpreted this like an advanced and useful
functionality even for not impaired people, saying that it is comfortable especially if they are
doing something else (they aren‟t in front of the TV) while they are consulting the EPG.
no 100%
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 37
Question 28:
People don‟t like to have the same prompt again and again. They would prefer varying
texts and texts that become shorter over time with increasing use of the system. They also
don‟t want to wait for the TTS to end (2) and prefer barge-in (2).
All subjects want the possibility to interrupt the system.
Most of the Italian subjects (13) liked the possibility to interrupt prompts (most of them
commented this means to have the control the TV), and only three subjects answered they
prefer waiting till the end of the system output.
Question 29:
Subjects want to switch off the recognizer when they are in a conversation (6), when other
people are in the room, or when it‟s loud in the room.
All subjects want the possibility to disable the recognizer.
Almost the whole Italian sample (17) likes the possibility to manually disable the
recognizer and 7 subjects reinforced their answer saying that they prefer to control the
interaction and they don‟t want to be annoyed by false recognition while they are talking with
someone else.
Summary: The majority of German, English and Italian subjects do not want the system to
read out the search results, the remaining German subjects only want this feature if the
number of results is small, while 40% of the Italian sample likes this feature and thinks that
represents a good way to feel free during a task where images of the EPG are not so
important as while the TV shows a program. In addition, 80-90% of the subjects want to be
able to interrupt the system (barge-in). A function to disable the recognizer should also be
implemented, since 85-90% of the subjects want to be able to do so.
2.4 General Opinion of the DICIT WOZ Prototype (Questions 30 and 31)
The remaining questions are used to examine how subjects like the DICIT WOZ system.
The biased answers of two English subjects show up especially strong in this section.
Since the number of English subjects is not significant, it should absolutely not be compared
directly to the far larger German user base.
2.4.1 Users’ experiences with DICIT
Finally, users had to rate their experiences with DICIT by means of 13 questions. Each of
these questions had to be rated with a value between 1 = complete disagreement and 7 =
complete agreement (as in the classical scale used in the 'semantic differential' of Osgood).
Sub-Question Result
1. I think that the system is easy to use DE: 6.40
EN: 5.50
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 38
IT: 5.61
2. It makes me confused when I use it DE: 1.40
EN: 2.00
IT: 3.00
3. I like the voice DE: 4.30
EN: 3.00
IT: 4.16
4. I think that the system needs too much attention to interact vocally DE: 2.15
EN: 3.751
IT: 3.63
5. I have the impression not to control the dialogue with the system DE: 1.30
EN: 2.00
IT: 2.94
6. I have to focus on using it with the remote control too DE: 4.05
EN: 1.00
(3 answers)
IT: 2.88
7. I think that the speech interaction is efficient DE: 5.36
EN: 5.502
IT: 5.68
8. By using the voice is easier to search the programmes DE: 5.15
EN: 4.50
IT: 5.57
9. The system voice speaks too quickly DE: 1.40
EN: 1.50
IT: 1.88
10. The selection criteria which appear on the screen are not clear DE: 1.65
EN: 2.503
IT: 3.15
11. I think that it is funny to use DE: 6.40
EN: 5.754
IT: 6.05
12. I prefer using traditional way (TV guide, teletext, newspaper) to
search an interesting programme
DE: 2.80
EN: 4.255
IT: 2.09
13. I think that this system needs some improvements DE: 4.45
EN: 6.50
IT: 4.88 1: 2x1, 1x6, 1x7;
2: 1x1, 3x7;
3: 3x1, 1x7;
4: 3x7, 1x2;
5: 1x1, 1x2, 2x7
Table 8: User general opinion
Discussion:
Subjects‟ experiences with the DICIT WOZ prototype are positive: they think that it is
easy to use and not confusing, and they think it is fun to use. On the other hand, people do not
like the voice too much and think that the system still needs improvements.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 39
Comparing the average of answers of the Italian sample with the German one, the main
differences are the following: Italian subjects feel “confused” interacting with the system (q.
2), maybe because (more than German people) they have the impression not to control the
dialog (q.5), and they don‟t think the selection criteria are clear (q. 10). The other answer is
quite aligned with the average of the German sample.
2.4.2 Rating user satisfaction within DICIT
In the final section, subjects had to rate the DICIT system using a range of 1 to 7 between sets
of opposite adjectives (e.g. easy vs. complicated). For most adjectives, small values represent
a positive feedback.
Range *
1. easy complicated 1.40 2.75* 2.76
2. efficient inefficient 1.90 4.00* 2.56
3. quick slow 2.75 3.50* 3.17
4. original copied 1.65 2.50* 1.61
5. precise vague 2.20 2.75* 2.76
6. capable incapable 2.05 2.75* 2.41
7. formal informal 3.80 4.75* 3.33
8. active passive 3.75 4.00* 2.88
9. friendly unfriendly 2.60 3.25* 2.64
10. determined undetermined 2.50 3.00* 2.93
11. polite impolite 2.35 2.25* 2.38
12. clever stupid 2.40 4.25* 2.70
13. organized disorganized 2.00 2.50* 2.35
14. patient impatient 2.00 3.50* 2.05 Table 9: Semantic differential
* Since the number of participants in the English group is not large enough to produce significant results, these
numbers are just added for completeness. In most questions, two subjects rated the system in a very positive way,
while two had a more negative impression. Therefore, a discussion of the English results is not included.
Discussion:
Altogether, subjects have a positive impression of the system: it is said to be easy,
efficient, capable, organized, and patient. The results are good, but not very good (around 2).
With a value of 2.75 for „quick‟, the responsiveness of the system should be improved, but
due to the nature of the system (WOZ setting), it should not be taken as a reference. At least,
the system did not get bad ratings in this category.
In the categories „formal‟ and „active‟ the system gets a moderate rating (3.80, 3.75).
This is due to the fact that subjects might have different ideas of how formal and active the
system should be (i.e., some users prefer a formal and passive system, whereas others a
personal and active one).
Comparing the average of answers of the Italian sample with the German one, most of the
answers are fairly aligned; few, but not significant, differences can be noted for the same three
adjective pair of the sematic differential commented before: Italian people judge the speed of
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 40
reaction of the system (q. 3) a little slower than German people; the style of the interface is
considered being a little more “formal” by the Italian subjects, and the system is judged a little
more “active”.
2.5 Summary
The results of the questionnaire are positive. People had no problems performing a number of
given tasks with the DICIT system, which they had not used before. People did like voice
control for an STB system and most did not use the remote control at all for the duration of
the test session.
Moreover, there are clear results for many aspects of the system, where people gave uniform
answers: subjects do not like long TTS output and complete sentences as output, but want
short, visual feedback and short commands for input. These results should definitely find their
way into the next prototypes.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 41
3. Session Evaluation
3.1 Logging Data
Different statistics were derived from the logging data automatically using scripts (Perl and
Python). These results are discussed in this section.
Unfortunately, some of the EB log files were not valid (IDs 4, 6, and 9) and had to be omitted
in this part of the evaluation. For the German evaluation, there are therefore 17 (out of 20)
valid log files included in the discussion. The English EB subjects have the IDs 4, 19, 20, and
22.
Some of the logging data (esp. view logging) was not collected during the Italian recordings.
Therefore, this data is not available for this evaluation.
The average duration of a German or English session was 27 minutes: for the Italian sessions
the average was 17 minutes.
3.1.1 Logging Data
During a recording session, logging data was collected to facilitate a thorough evaluation.
Some of the data was collected automatically using GUIDE (discussed in this section), other
data was created manually (i.e. annotations for the audio recordings).
GUIDE was extended with an extensive logging mechanism to be able to reconstruct the
interaction of the user with the system. For every session, a separate log file was created.
Some entries are specific to GUIDE and the model (e.g. state or event names), while others do
not depend on the model (such as TTS). The different logging entities are shown in Table 10.
Logging Entity Description
TTS Output Every TTS output of the system was logged. There are different
sources of the TTS output: system prompts that are played back
automatically, predefined prompts that the wizard plays manually by
selecting them from a list, or manual wizard prompts that the wizard
can enter manually if none of the predefined ones fit. If a prompt was
interrupted (either by voice or remote control), a special log entry was
created.
ASR Input Recognitions from the automatic speech recognizer (ASR) include the
name of the grammar and a confidence.
Although no actual ASR is used for the DICIT prototype, the wizard
can select from a list of currently active ASR commands that is
extracted from the loaded grammars. If a user input matches one of
these commands, he can simply select them from a list and thus
simulate the ASR. The wizard input is then treated by GUIDE like an
actual ASR recognition.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 42
State change Every state change in the GUIDE model is logged.
Event Every GUIDE event is logged.
Haptic events The wizard has a virtual remote control at his disposal, where he can
select most commands from the remote control. These include e.g.
cursor up/down, volume up/down, channel up/down, EPG/TV/teletext,
…. Numbers (0-9) were not present on the control panel.
Hardware events Hardware events are remote control events that are then mapped to
GUIDE events.
Screenshots At every event, a screenshot was created to be able to reproduce a
session visually from the logging data. Table 10: Logging data created by GUIDE.
Moreover, audio recordings using both a head-set and a distance reference microphone were
created. These recordings were then annotated manually using Praat [12] – for annotation
details please refer to [13]. In doing so, it was possible to do evaluations for these recordings
on a textual level.
The technical evaluation was done according to the PROMISE principles for evaluation of
multimodal dialogue systems (see [14]).
3.1.2 Number of Screens and Views
First, we examine the number of different views entered by the subjects. After every event, a
screenshot was taken and a respective entry added to the log file. By removing duplicate
screenshots (using a binary file comparison), we get the number of different screens for each
user. More than one unique screenshot can be taken within one view, e.g. if the user moves
the cursor in a list. From the log file entries, we can see how many different views there are
for each user, i.e. the number of times the user changed the view. This number is obviously
smaller than the number of different screens.
0
50
100
150
200
250
300
350
LOG1
LOG2
LOG3
LOG7
LOG8
LOG10
LOG11
LOG12
LOG13
LOG14
LOG15
LOG16
LOG17
LOG18
LOG21
LOG23
LOG24
Different Screens
Different Views
Figure 15: German: Different screens and views.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 43
German:
From these numbers, we can derive different types of users. If the number of different screens
is far larger than the number of different views, there was a lot of activity within a view,
which usually results from an intensive remote control use (e.g. Log12). On the other hand, if
activity within one view is small, it means that the users switched directly from one view to
the next, without much interaction (e.g. scrolling) inside a view (e.g. Log11). This also
becomes clear in Table 11, which shows the number of screens divided by the number of
views (min and max highlighted).
1 2 3 7 8 10 11 12 13 14 15 16 17 18 21 23 24
0.58 0.49 0.49 0.64 0.66 0.65 0.83 0.32 0.64 0.51 0.59 0.47 0.61 0.29 0.34 0.58 0.46
4 19 20 22
0.58 0.53 0.44 0.54
Table 11: Number of different views divided by number of different screens.
English:
On average, all German subjects show a quite similar behavior between each other as to the
patterns of use 2– the average number of views divided by the number of screens is 0.52 for
the English-speaking participants, compared to 0.54 for the German subjects. Also, there are
no subjects whose values depart strongly from the average value.
0
50
100
150
200
250
300
LOG5 LOG19 LOG20 LOG22
Different Screens
Different View s
Figure 16: English: Different screens and views.
Italian: This information was not available in the Italian log files.
2 For the English subjects the same tendency could be noted. This may be due to the fact that all English
speaking subjects have been living for a long time in Germany and may have adapted their way of interaction
with the TV to their German surroundings.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 44
3.1.3 Screen preferences of the Users
German:
From the log files, we can see how much time they spent in which view.
EPG_ChooseFilter;
1,30; 5%
EPG_Confirmation;
0,70; 3%
EPG_MainMenu_Vi
ew; 8,18; 31%
EPG_ManualInput;
1,13; 4%
EPG_ResultList;
6,76; 25%
NewsView; 0,37;
1%
View; 7,20; 26%
EPC_RecordingList;
1,17; 4%BlackScreen; 0,03;
0%WelcomeView;
0,39; 1%
Figure 17: German: Screen preferences. Name; time in minutes; percentage.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Log001
Log002
Log003
Log007
Log008
Log010
Log011
Log012
Log013
Log014
Log015
Log016
Log017
Log018
Log021
Log023
Log024
WelcomeView
View
NewsView
EPG_ResultList
EPG_ManualInput
EPG_MainMenu_View
EPG_Confirmation
EPG_ChooseFilter
EPC_RecordingList
BlackScreen
Figure 18: German: Screen preferences of the individual subjects
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 45
As can be seen from Figure 17, users spent most of their time in the EPG main menu,
followed by the TV screen and the result list. This is what one could expect regarding the
tasks the subjects were given. Little time was spent in the recording list (~ 4%), which
suggests that this feature was not clear to the subjects or that this feature was not required to
solve the given tasks. Only 3% of the time was spent within the confirmation screen, which is
what one might expect.
Figure 18 shows the screen preferences for every subject. While some users spent more time
within the TV screen (“View”) than others, the overall distribution is similar for the subjects,
which is due to similar tasks. Only three users (2, 3, and 12) used the teletext feature, which
was neither mentioned nor part of the task. Two subjects (15 and 21) did not use manual
input.
EPC_Recording
List; 1,03; 4%
EPG_ChooseFilt
er; 1,93; 7%
EPG_Confirmati
on; 2,13; 8%
EPG_MainMenu
_View; 8,01;
29%
EPG_ManualInp
ut; 1,61; 6%
EPG_ResultList;
5,28; 19%
NewsView; 0,63;
2%
WelcomeView;
0,37; 1%
View; 6,52; 24%
Figure 19: English: Screen preferences. Name; time in minutes; percentage.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 46
0%
20%
40%
60%
80%
100%
Log005 Log019 Log020 Log022
WelcomeView
View
New sView
EPG_ResultList
EPG_ManualInput
EPG_MainMenu_View
EPG_Confirmation
EPG_ChooseFilter
EPC_RecordingList
Figure 20: English: Screen preferences of the individual subjects.
English:
The results for the English subjects are similar to the results of the German subjects.
Italian: This information was not available in the Italian log files.
3.1.4 Remote Control vs. Voice Control
There are not only users who use voice control in a different way, but also users who had very
different attitudes toward voice control. We examined the number of remote control events
and the number of voice commands in the form of wizard actions (both high- and low-level
commands, but no direct multi-slot EPG3 (see pag. 57) queries, i.e.queries where more than
one EPG value is selected with the same utterance, such as “show action movies for tonight”,
which specifies a genre and a time in one statement). Again, the number of remote inputs and
the number of voice actions do not directly relate, since it takes a number of remote control
actions (e.g. 3x “down” plus “OK”) to trigger the same action that takes one voice command
only (e.g. “select by day”). Therefore, if the number of remote control inputs is the same as
the number of voice inputs, it does not mean that the user triggered the same number of
actions by remote control and voice. Still, we can derive different user groups from the
number of remote control use.
3 EPG = electronic programming guide
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 47
0
50
100
150
200
250
300
350
400
LOG
1
LOG
2
LOG
3
LOG
7
LOG
8
LOG
10
LOG
11
LOG
12
LOG
13
LOG
14
LOG
15
LOG
16
LOG
17
LOG
18
LOG
21
LOG
23
LOG
24
Remote User
Voice
Figure 21: German: Amount of voice and remote control input.
German:
By relating the number of voice to and remote control commands, we can see in Figure 21
that there are three different groups (12): a first group (mostly remote control) makes heavy
use of the remote control (subject 21 only, red arrow). A second group (mixed) makes use of
both voice control and remote control (subjects 12 and 21, orange arrow). A third group
(mostly voice control) does hardly make use of the remote control and uses mainly voice
control (subjects 3, 7, 10, 11, and 16, green arrows). Since the subjects were told during the
introduction that this experiment was about voice control and since this was a new feature and
therefore interesting for them, these results are not surprising. But still, voice control had a
great appeal for these users and they could obviously control the system without the remote
control.
English:
The behavior of the English subjects (Figure 22) is more uniform than the behavior of the
German subjects. No different user groups can be derived from these subjects. Please note that
the sample is not significant here.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 48
0
20
40
60
80
100
120
140
160
180
LOG5 LOG19 LOG20 LOG22
Remote User
Voice
Figure 22: English: Amount of voice and remote control input.
Italian: Even if this information was not available in the Italian log files, it is important to highlight
that most of the Italian subjects seldom used the remote control, and on the other hand, during
the second task, to adjust the volume, most of them used voice commands in the same way as
they used RC (e.g. “louder, louder, louder”, to reproduce the reiteration of pressing the RC
button, or “higheeer” probably to reproduce an univocal but prolonged pressure). In general,
voice control had a great appeal for Italian users, and they used the RC only when they had
problems interacting with the system by voice.
3.1.5 TTS Usage
Next, the TTS output will be examined, which is the second way for the wizard to react to
user input besides executing an action. There are two kinds of TTS output, automatic and
manual TTS output. While automatic TTS prompts are played when a state is entered, manual
TTS prompts are triggered by the wizard as a means to communicate with the user. In this
section, we only consider manual TTS prompts, because no information can be derived from
the automatic prompts that are part of the dialogue.
The manual TTS prompts can again be divided into two groups. Most of the prompts were
defined before the recordings and the wizard could trigger them by selecting them from a list.
These prompts were divided into different groups of prompts: error prompts (ERROR), help
prompts (HELP), please-wait prompts (WAIT), and rejection prompts that were used when
the wizard could not understand the user (REJECT). The complete list of prompts can be
found in Appendix D. Moreover, the wizard could enter additional prompts into a text field
and play them. These prompts could not be classified automatically and are therefore listed as
FREE. WAIT prompts were used when the wizard had to perform a time-consuming task
(such as entering a query in the EPG window). The wizard had a special shortcut (F12) that
he could use to trigger a WAIT prompt. (This feature was only used for the recordings at EB.)
REJECT prompts were used if the wizard could not understand the user, but sometimes also
to gain some time when the wizard could not react quickly enough. Moreover, a REJECT
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 49
prompt was also played if a requested function was not available. Also, commands that were
not part of the wizard guidelines (such as command repetition, e.g. “down, down, down” or
“down 3 times”) were rejected.
ERROR
13%
FREE
8%
HELP
12%
REJECT
42%
WAIT
25%
Figure 23: German: Types of TTS output.
0
10
20
30
40
50
60
70
80
90
Log0
01
Log0
02
Log0
03
Log0
07
Log0
08
Log0
10
Log0
11
Log0
12
Log0
13
Log0
14
Log0
15
Log0
16
Log0
17
Log0
18
Log0
21
Log0
23
Log0
24
WAIT
REJECT
HELP
FREE
ERROR
Figure 24: Types of TTS output.
German:
As can be seen from Figure 23, most prompts for all views and all subjects were REJECT
(42%) prompts, followed by WAIT (25%) prompts. About the same number of ERROR
(13%), FREE (8%), and HELP (12%) prompts were used.
We want to examine some users that show distinct TTS prompt distributions. These
are marked with an arrow in Figure 24. Most of these recordings were done at the end (= on
the right-hand side of the chart), but this is by accident.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 50
- Log014: This subject had a strong dialect and did not try to speak standard German.
He also used a lot of off-talk (e.g. DICIT: “Your input is being processed”, subject
says: “Well, I hope so!”). He also tried out which dialect words the system could
understand (e.g. for “yes”). In TV mode, the subject tried to pronounce channel names
in a “drunken” way. Obviously, many of his attempts were answered with a REJECT
by the wizard.
- Log021: This subject was not an experienced user and had some trouble using the
system. In the beginning, a lot of off-talk occurred (e.g. reading what was on the
screen, “what now?”). Lots of predefined (HELP) and free (FREE) help prompts were
necessary to guide the subject through the system. After some time, the subject used a
lot of free input (e.g. only saying the name of an actor or a channel), which was often
answered by the wizard with a WAIT prompt while he was processing the request.
- Log23: The subject operated DICIT in a very calm way. No WAIT prompts were used
for this subject. This is due to the fact that the subject did not use requests that
required a WAIT prompt (i.e. not free input). Moreover, all general help prompts were
played at least two times, which accounts for the high number of HELP prompts.
- Log24: The reason for the high number of REJECT prompts is that the subject tried
out different things, primarily unavailable functions (e.g. “summary for broadcast”,
“full-screen”, “go to top in list”).
EPC_Recording List
EPG_ Choose Filter
EPG_ Confir-mation
EPG_ MainMenu_ View
EPG_ ManualInput
EPG_ ResultList NewsView View
ERROR 1 2 1 75 13 49 0 4
FREE 1 0 1 63 3 19 0 19
HELP 1 0 0 57 0 30 0 36
REJECT 18 12 23 82 13 99 12 204
WAIT 7 6 0 128 85 48 1 7
Sum 28 20 25 405 114 245 13 270 Table 12: Prompt types per view.
EPG_MainMenu_View
ERROR
19%
FREE
16%
HELP
14%
WAIT
31%
REJECT
20%
Figure 25: Prompt types in EPG_MainMenu_View.
EPG_ManualInput
FREE
3%
REJECT
11%
WAIT
75%
ERROR
11%
Figure 26: Prompt types in EPG_ManualInput.
EPG_ResultList
ERROR
20%
FREE
8%
HELP
12%
WAIT
20%
REJECT
40%
Figure 27: Prompt types in EPG_ResultList.
View
HELP
13%
ERROR
1%
FREE
7%
WAIT
3%
REJECT
76%
Figure 28: Prompt types in View.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 51
Table 12 shows the number of TTS prompts per view. In Figure 25 - Figure 28, charts for the
views that have more than 100 TTS prompts are shown. For the others, on average about one
or less prompts per view and session have been used and it therefore is not possible to draw
conclusions from the values of these views.
In Figure 25, the distribution of prompts for the EPG main menu is shown. About one third of
the prompts are WAIT prompts that were used when the wizard reaction took some time (e.g.
free input or when the wizard needed to find the right button). ERROR prompts were for
example used if the search did not yield any result. One third of the prompts were HELP and
FREE prompts that provided help to the user.
As one might expect, the most frequently used prompt type for EPG_ManualInput (Figure 26)
is WAIT. When the user performed a manual input (e.g. selecting an actor), the wizard had to
type the value into the EPG database window by hand and the user had to wait. There are
comparatively few REJECT and ERROR prompts, which means that the wizard could
understand most input values. In some cases, the wizard did not know an actor and could
therefore not use it as a search criterion, which was indicated by an ERROR or REJECT
prompt.
Lots of REJECT and ERROR prompts make up more than half of the prompts in
EPG_ResultList (Figure 27). Since it was not clear to the users how to scroll in the result list,
they had to try how to use it. This involved lots of errors. Moreover, people tried to use
methods that could not be handled by the wizard (e.g. “up, up, down, up” very quickly). Users
could also say the name of a broadcast to select it. Since this took the wizard some time, he
triggered a WAIT prompt.
Finally, most of the prompts used in TV mode (“View”, Figure 28) are reject prompts. Users
tried lots of functions here that they are used to from their TV at home, but that were not
available in the WOZ prototype. This includes for example full-screen mode or brightness
settings. Users then tried the help function to see which commands were available, which led
to a HELP prompt.
All in all, the number of REJECT prompts is very high. The general error message „Sorry, I
could not understand you‟ does not provide the subject with information about the reasons of
the misunderstanding. Therefore, the upcoming prototypes should try to provide more specific
information whenever possible.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 52
ERROR
15%
FREE
3%
HELP
8%
REJECT
47%
WAIT
27%
Figure 29: English: Types of TTS output.
0
10
20
30
40
50
60
Log005 Log019 Log020 Log022
WAIT
REJECT
HELP
FREE
ERROR
Figure 30: English: Types of TTS output per user.
English:
The distribution is similar to the one of the German subjects (Figure 23). Only the low activity
of one subject (Log020) is mentionable. The number of FREE TTS prompts is smaller than
for the German subjects. This is probably due to the German wizard being cautious using
prompts for native speakers or the lack of necessity for these prompts. Please note that this
result is not significant due to the very small sample.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 53
WAIT
32%
REJECT
8% HELP
32%
FREE
1%
ERROR
27%
Figure 31: Italian: Types of TTS output.
0%
20%
40%
60%
80%
100%
Lo
g0
01
Lo
g0
02
Lo
g0
03
Lo
g0
04
Lo
g0
05
Lo
g0
06
Lo
g0
07
Lo
g0
08
Lo
g0
09
Lo
g0
10
Lo
g0
11
Lo
g0
12
Lo
g0
13
Lo
g0
14
Lo
g0
15
Lo
g0
16
Lo
g0
17
Lo
g0
18
Lo
g0
19
Lo
g0
20
WAIT
REJECT
HELP
FREE
ERROR
Figure 32: Italian: Types of TTS output per user.
Italian: As can be seen from Figure 31, most prompts for all subjects were WAIT or HELP
prompts (32%), followed by ERROR prompts (27%) and REJECT prompts (8%). FREE
prompts were used only in a few cases (1%). This different distribution (compared to the
German and English one) is due to the different behavior of the wizard, preferred to propose
Help or Error prompts, instead of Rejection messages, when subjects had difficulties (and
despite they didn‟t explicitly ask for help). Some subjects (indicated by arrows in Figure 32)
present remarkable different distribution of categories of TTS prompts in their log files
because their behavior was particular:
Log005: This subject asked several times for help (21, 57% of his “main commands
interactions”4) because he didn‟t find the information on the screen useful to orient himself
(he was one of three people who answered the question 21 with score 1). Other subjects (like
4 “Main commands interactions” is a set of 25 commands that aren‟t data driven (e.g.: “morning”, “Monday” ,
“drama”, etc…) but are firmly shown by the GUI and mentioned by the TTS messages (e.g. “search by
channel”, “results”, “TV-guide”, etc…).
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 54
7 – 8.62% of her “main commands interactions”, 11 – 8.11% of his “main commands
interactions”, and 6 – 6.45% of his “main commands interactions”) asked for some help too,
but they didn‟t give any negative feedback to questions 14, 16 or 21 of the questionnaire.
Log009: The bar chart of this subject shows a lot of Error messages because she was
really a non- experienced user and she mainly asked to find a program using artist or subject
criteria (she asked 14,28% times of her “main commands interactions” for an artist or the
subject, while the Italian EPG didn‟t present these data). Moreover she didn‟t interact a lot
with the system by voice because, within the second task, adjusting the volume, she pushed
the “mute” button. In this way she didn‟t hear any other “error messages” that the wizard
played to dissuade her to use the RC, so the second task has been interrupted and the whole
session was the shortest one (highlighting the percentage of errors).
Log018 and Log019: These subjects were used to ask for a program directly by the
name of it (without saying “search by title” before), which was often answered by the wizard
with a WAIT prompt while she was processing the request. In general, a lot of subjects
searched a program by title or artist, but when they asked freely for a title or artist (that were
not foreseen in the second task goal), processing the data of free inputs required some time
(this is the reason why half of the sample showed a lot of WAIT messages).
3.1.6 Barge-In Behavior
There is an entry in the log files for cancelled TTS prompts, which means that the user
interrupted the system either by voice command or remote control input.
German:
On average, every subject interrupted TTS prompts 8.65 times. Users usually interrupted
prompts that are long and that they have heard before. The most commonly interrupted
prompts are the prompts played when the main menu was entered and the recording
confirmation prompt.
As can be seen from Figure 33, there are two users who make heavy use of barge-in
(subjects 1 and 23, marked with a green circle). On the other hand, there are three subjects
(15, 16, and 18, marked in red) that use barge-in less than five times. While subjects 15 and
18 state in the questionnaire that they want to have the possibility to interrupt the system and
therefore are possibly not aware of this feature, subject 16 says that she would wait at short
prompts. One subject (7) states in the questionnaire that she would be willing to wait, but still
used barge-in five times.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 55
0
5
10
15
20
25
30
LOG
1
LOG
2
LOG
3
LOG
7
LOG
8
LOG
10
LOG
11
LOG
12
LOG
13
LOG
14
LOG
15
LOG
16
LOG
17
LOG
18
LOG
21
LOG
23
LOG
24
Figure 33: German: Number of barge-ins per subject.
0
1
2
3
4
5
6
7
8
LOG5 LOG19 LOG20 LOG22
Figure 34: English: Number of barge-ins per subject.
English:
There are two users (LOG5 and LOG20) that use little and two users (LOG 19 and LOG22)
that use a lot of barge-ins.
Italian:
In Figure 35, the results of Subjects 6, 11, 12, 15 and 16 have been removed because of
technical reasons, so the arrows show a failure of the system instead of an effective value 0;
regarding the others, only two subjects used barge-in more than 10 times ( 9 and 11), on the
contrary many users (12) interrupted prompts less than 5 times, especially subjects 18 and 20
didn‟t use any barge-in at all.
In general, Italian subjects used the barge-in functionality less than German users: the average
of barge-in interruptions for the Italian sample is 4.9.
From the annotations emerged that subjects interrupted the system a few times by voice, but
the log files results have to be interpreted as a haptic interruption of vocal output done by RC
(instead of vocal commands).
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 56
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 35: Italian: Number of barge-ins per subject
3.1.7 User Speech Time
The user speech time, which is the overall time a user was speaking, was examined as well.
The annotations were used as a basis for this evaluation and the speech time is exact for the
German and the English evaluation. For the Italian evaluation, the time needed for each task
was examined, while for the German and English subjects, the time for both tasks was used.
German:
German subjects (Figure 36) had an average speech time of 526 seconds. For one subject (3),
no data was available due to problems with the log file. On average, every subject used 263
words and had 115 turns.
Figure 36: German: User speech time per subject. Red line is average.
0
100
200
300
400
500
600
700
800
900
LOG1 LOG2 LOG4 LOG6 LOG7 LOG8 LOG9 LOG10 LOG11 LOG12 LOG13 LOG14 LOG15 LOG16 LOG17 LOG18 LOG21 LOG23 LOG24
time in s
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 57
English:
The average speech time of the English subjects (Figure 37) was 630 seconds and therefore
somewhat higher than the speech time of the German subjects. On average, every subject used
307 words and had 138 turns.
0
100
200
300
400
500
600
700
800
LOG1
LOG2
LOG3
LOG4
tim
e in
s
Figure 37: English: User speech time per subject. Red line is average.
Italian:
Figure 38: Italian: User speech time
As said before, the speech time of the Italian evaluation is different from the German and
English speech times. Since the task completion time is used, it‟s longer than the actual
speech time.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 58
The first task was longer than the second one: on average, in fact, people took 11 minutes to
end the first task and only 5 minutes for the second one. The reason of that difference seems
to be that in the first task the instructor and the wizard let people more time to “play” with the
system, because the goal was to find what was interesting for them (someone watched TV,
someone changed channels and someone else read teletext).
3.1.8 Multi-Slot Usage
The subjects could enter arbitrary queries, such as “Please show me what‟s on TV tonight
from genre action and with actor Brad Pitt.” The wizard could understand these queries, enter
them into the query window, and show the results to the user. Queries that fill more than one
“slot” of a query are called “multi-slot” queries. The example above contains the slots time
(“tonight”), genre (“action”), and actor (“Brad Pitt”). This query could also have been
performed using three single-slot queries, but multi-slot queries are more convenient and
natural for the user.
During the experiments, it quickly became clear that users do not use multi-slot queries
without having been informed about this feature. Therefore, a prompt was introduced that tells
the subjects to use multi-slot queries. When the wizard realized after some time that no multi-
slot queries occurred, he could trigger this prompt.
German:
For the first five subjects, no prompts were played and no multi-slot queries were executed by
the users. Starting with subject 5, users were hinted at the availability of the multi-slot feature.
Only one subject (LOG13) used multi-slot before this prompt. When using multi-slot after the
prompts, subjects usually repeated the help prompt word by word and adjusted it to their
demands in later queries.
Different kinds of behavior can be observed. First, some subjects only used this feature once
(e.g. LOG8, LOG14, LOG17, or LOG24). Other users used multi-slot extensively after they
had been informed about this feature (e.g. LOG7, LOG16, or LOG18). Finally, some subjects
did not use multi-slot even when they knew about it.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 59
0
2
4
6
8
10
12
14
16
18
LOG1
LOG2
LOG3
LOG4
LOG6
LOG7
LOG8
LOG9
LOG10
LOG11
LOG12
LOG13
LOG14
LOG15
LOG16
LOG17
LOG18
LOG21
LOG23
LOG24
Prompts
Multislot
Figure 39: German: Multi-slot evaluation.
English:
The results are similar for English subjects. There are subjects who had not heard the prompt
and did not use multi-slot, a subject who heard the prompt and used multi-slot, and a subject
who heard the prompt more than once, but did not use multi-slot.
0
1
2
3
4
5
LOG5 LOG19 LOG20 LOG22
Prompts
Multislot
Figure 40: English: Multi-slot evaluation.
Italian:
Just a few people used the multi-slot functionality (4), in particular two subjects, when the
screen was still black (few seconds before the “Welcome screen” appeared), started speaking
with long and natural phrases in which they mixed different search criteria (8 used
“CHANNEL” and “DAY” and 17 used “DAY” and “TIME”), but when they saw the GUI,
they kept on using just one criterion at a time and waiting the system feedback. Another
person once used two criteria together but in a schematic way (<keyword+value> and
<keyword+value>) and the last one tried to use more search criteria only after the wizard,
through a help prompt, notified how to use free input: this subject copied the model from the
prompt and used it only once, than he came back to single slot.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 60
In many sessions, when people spoke freely, even if they used word-commands, these
keywords have been classified as “Off-Talk” because the different volume and tone of the
voice proved they were reading (it was clear they were not giving different search criteria
using the multi-slot approach).
3.1.9 Off-Talk
Off-talk is the part of the speech, which does not address the system, but includes
exclamations or interjections as well as keywords that are simply read from the screen. The
diagrams in this section show the number of off-talk words.
German:
There are three groups of users regarding off-talk. The first group did not use any off-talk
(e.g. LOG1, LOG6, LOG7, …). The second group used some off-talk, but not extensively
(e.g. LOG2, LOG8, …). The third group had a lot of off-talk words (> 100, LOG3 and
LOG21).
0
20
40
60
80
100
120
140
160
180
LOG1
LOG2
LOG3
LOG4
LOG6
LOG7
LOG8
LOG9
LOG10
LOG11
LOG12
LOG13
LOG14
LOG15
LOG16
LOG17
LOG18
LOG21
LOG23
LOG24
nu
mb
er
Figure 41: German: Off-talk.
English:
Most English subjects had only a small number of off-talk words (LOG5, LOG20, and
LOG22). LOG19 had about twice the number of the other subjects, but the subjects still used
far less off-talk words than the third group of the German subjects.
0
2
4
6
8
10
12
14
16
18
20
LOG5 LOG19 LOG20 LOG22
nu
mb
er
Figure 42: English: Off-talk.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 61
Italian:
The Italian data are represented in percentage because the “off-talk” by each subject has been
compared to the overall number of interaction of that subject.
The majority of the sample (14) tended to speak doing comment, one third of them have said
some commands with a different volume and tone of the voice, that have been considered
“Off-Talk” because the intention to comment the system behavior was clear (not to give many
commands at the same time). In particular, there was a person who uttered a lot of “Off-Talk”
(subject 7: 67 % of the total interactions) saying personal comments like “well…”, “so”, “let‟s
see…” as well as reading out the outputs on the screen or the search criteria (“artist,
genre…”).
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Subject
Perc
en
tag
e
Figure 43: Italian Off-talk.
3.1.10 Overlaps
Overlaps are parts of the annotations where both the subject and the system are speaking at
the same time (only the TTS output is considered).
Overlaps occur when the ASR does not recognize the user speech (on the contrary if a subject
pronounced an understandable word/sentence for the system it stops speaking). It is important
to analyze when or where they appear most frequently, because they can be used to
understand how to improve the design of the dialog and to induce a more appropiate answer
from the user.
In Figure 44, the overlaps of the German and the English users are shown in a diagram. At the
end, the sum of all subjects is shown. As one can see, there is a concentration of overlaps at
the beginning of the recordings. Many subjects continued while the system was still speaking
the very long introductory text.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 62
Figure 44: German and English: Overlaps.
Start of recordings End of recordings
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 63
Italian:
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure 45: Italian: Overlaps.
Regarding the Italian sample, only 13 subjects tended to speak over the system, in particular
during the long introductory message and a very long help prompt (“To search for a program,
you can use all the search criteria listed on the screen. Moreover you can always ask me:
Back, to go back, Restart to reset the criteria, or Search Now to start a search.”). The majority
of people didn‟t do many overlaps: only three users were a little more impatient and spoke
more often while the system generated an output (8 and 14 times, whereas the other subjects
talked only two, maximum five, times of the total interactions).
3.2 Observation of the Wizards
Some results cannot be derived from the logging data directly, but are subjective impressions
of the wizards. But they still have to be considered, since these observations were shared both
by different wizards and in different sessions.
3.2.1 People have to be encouraged to use natural language input
In almost every recording, people operated the DICIT system at the beginning by reading out
the command they could read on the screen. Consider for example the EPG filter menu.
People used to read out the name of a filter and then the subsequent value (e.g. “Day”
followed by “Wednesday”). They did not know that free text input was possible.
Therefore, the wizard played help prompt that explained to the subjects that natural language
input was possible. This help prompt reads: “E.g. say „what is on RTL tonight?‟ or „are there
any comedies tomorrow?‟”. Usually, German and English subjects started off reading out this
prompt word-by-word, but adjusted it to their demands afterwards. On the contrary, Italian
subjects didn‟t follow the suggestion of this help prompt (only one subject copied word-by-
word the model from the prompt and then didn‟t use it anymore) and also those who started
with a natural language sentence, adapted his behavior to a system-driven interaction.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 64
Thus, even if this help message told people that natural language sentences can be used, it is
interesting to notice that, without an introductory text, or even an automatic help function (or
tutorial movie) that urge this kind of verbalization, users did not use natural language input
spontaneously, and a lot of them do not seem to dislike a system-driven interaction.
3.2.2 People tend to use simple commands
People tend to use simple commands that can directly be mapped to remote control commands
(called low-level before) most of the time or predefined speech commands (high-level). They
usually do not use complex or concatenated commands. People seem to use the commands
shown on the screen and do not try to combine them. Moreover, many commands are on a
“widget level”, for instance “cursor down/cursor down/cursor down” instead of “put cursor on
broadcast „Formula 1‟”. Users might also be encouraged to use more complex commands, or
speech shortcuts, by a help prompt that gives an example.
3.2.3 Some people use Barge-In, others do not
Some subjects are very impatient and interrupt TTS prompts, which they have heard for a
number of times (e.g. in the main menu or the confirmation view). Others do not use this
feature as often, which might also be due to the fact that they are not as experienced in
operating an STB system. On the other hand, when subjects use the RC to interrupt a non-
recurring TTS message, the interruption is due to problems, which the subjects try to resolve
using an ordinary RC interaction .
3.2.4 Reset function not self-explanatory
Many subjects had problems using the reset function. One could either reset all filters (by
saying “Reset” or pressing the yellow button) or single ones by saying “Reset [Filter]”, which
was explained neither in the introduction nor in a help prompt. Therefore, people had
problems resetting single filters and tried to do so by entering the filter sub-screen and saying
“reset”. Some users also seemed to enter filter sub-screens in search for a reset function and
left them immediately after realizing that it was not there. Reasons for these problems might
also be that the term “Reset” was not clear for the German subjects or that people did not want
to reset all filters, but only single ones.
3.2.5 Remote Control is hardly used
When people realize that speech input works well, they do not use the remote control any
more. But when they encounter problems with speech input repeatedly, they switch back to
the remote control. Still, most users stick with the voice control and do not touch the remote
control any more.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 65
4. Conclusions for Subsequent Prototypes
In this chapter, we want to draw conclusions from questionnaire observations (cited as
“Questionnaire [Number]”), the log file evaluation, and wizard impressions (both referenced
with the section number). The feedback on the prototype was positive. People found the WOZ
prototype easy and fun to use and had no problems using it (Questions 13, 14, 17, and 18). As
a general impression, the system appears easy, efficient, original, capable, well organized, and
patient.
4.1 Overall Conclusions
Subjects describe the DICIT system as neither too active nor too formal (Question 31). While
some subjects prefer either a more active or a less formal system, the WOZ system should
represent a good average solution for most people. On the other hand, the interaction style
could be configurable or even adaptive automatically to the current user.
Altogether, help should be improved, since it does not get good ratings (Question 16). While
only five subjects (two German and three Italian) cancelled the welcome prompt, it seems that
subjects did not really believe that DICIT is “quite clever”, as said in the introduction. While
German subjects started using complex multi-slot queries once they were introduced by a help
prompt, Italian subjects did not use multi-slot queries even when they had heard the help
prompt, and those who started talking freely with the system before the DICIT welcome
prompt adopted a more passive behavior when interacting with the actual system (Sect. ”Multi
slot usage”). Moreover, some subjects did not use barge-in, but requested this feature in the
questionnaire (Question 28); these behaviors and answers indicate that most subjects had
different expectations about the way the system works. Therefore, more detailed and active
help could increase the number of multi-slot queries and make people aware that vocal
interaction is an efficient alternative to RC interaction.
4.2 Dialog and Menu Structure
Users found the system easy and fun to use. Colors and font sizes got positive feedback in the
questionnaire (Question 17 and respective comments).
German subjects did not make use of the recording list a lot (Sect. ”Screen preferences of the
users”). Possibly, the concept was not obvious or people did not require this feature, since it
was not part of the task. The first interpretation of German data logs is also supported by the
behavior of Italian subjects, because (even if this information was not available in the Italian
log files) few Italian subjects (who “recorded” a lot of programs because of the first task goal)
really used the recording list to control or delete their “recordings”. Still, people should be
made aware of all features provided by DICIT, for instance by special help prompts (e.g. “Do
you know the recording list?”).
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 66
Moreover, the reset function in the EPG main menu was not clear for the German sample.
Some users remarked this problem in the questionnaire (comments to Questions 13 and 17)
and the wizards support this impression. Therefore, this feature should either be changed or
explained in a better way. Adding a “Reset” function to the filter sub-screens (views
EPG_ChooseFilter and EPG_ManualInput) could improve the usability of the system.
4.3 Speech Dialog
All in all subjects prefer voice input to remote control input, which is supported by the
questionnaire (Question 24) and the recordings (Sect. ”Remote control vs Voice control”).
The subjects prefer short commands (Question 25) and consider long commands useful only
for beginners (one comment on Question 25). On the other hand, people seem not to be aware
of how powerful the WOZ system was and did not use more complex commands for that
reason. German subject started using multi-slot queries after they had been given an example
by help prompt (Sect. ”Multi slot usage” ).
The subjects prefer smooth dialogs and do not want DICIT to interfere with their interaction
with the system. Therefore, both short output prompts and short input commands should be
used (Questions 25, 27). People prefer speech as an input means, because they state that
feedback should either be visual or an instant reaction, but not voice-only (Question 26).
Moreover, German subjects and most of the Italian subjects do not like long TTS prompts and
stated that DICIT should not read out what is on the screen (Question 27). On the other hand,
45% of the Italian sample stated that it was preferable to have the list of programs read out as
well, because this feature could facilitate both impaired people and “normal” persons who
prefer not to stay in front of the TV while consulting the EPG. This also applies to the help
function, which should be on the screen in any case. On the other hand, subjects want to be
able to interrupt TTS output by means of barge-in (Question 27 and Sect. ”Barge-in
Behavior”). Barge-in is not required if the prompts are kept short enough. If long prompts are
used, the users have to be made aware that they can interrupt the system. The TTS voice rated
low (Question 30.3) and should be improved.
When asked for this feature, people wanted to be able to switch off the recognizer (Question
29), mainly during a conversation with another person. The use of a keyword to address the
DICIT system could render this feature redundant.
There were also remarks regarding TTS and the mute function (Question 21). First, TTS
should not be muted when the TV sound is muted. Second, it should be possible to mute the
system completely.
4.4 Remote Control
A mixed-mode operation (voice and remote control) rates low between the German sample
(Question 15) and only subjects 12 and 21 made use of both voice input and the remote
control. On the contrary, even if most parts of the Italian subjects did not use the RC at all,
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 67
this feature had a good rating for the Italian sample; this means that Italian people stated an
expectation instead of a real knowledge. Remote control operation was not considered as easy
(Question 19) and should therefore be improved upon. Since the focus of this study was on
voice control, a complete and flawless remote control operation was not the main objective of
this study.
German subjects have a strong dislike for a “virtual keyboard” for free-text input (Question
23) and want to have a voice input method that can be used to select all titles, actors, and
subjects. On the contrary, most of the Italian subjects think that this is an alternative way of
interaction, which can simplify the search for artists or titles when these are difficult to
remember or to pronounce (e.g. because of a foreign language).
4.5 Considerations Among the Two Samples
In general, most Italian people are not used to interact with the TV (that is to use the TV set
not only to watch programs, but also to read information), and this is true especially for
elderly people, for people which have a low educational qualification, and for people whose
occupation is not related to an office job (in Italy the “digital divide” is still high: to date,
Digital Television is not fully deployed and the PC is not largely diffused in all the
households).
The Italian sample was selected such that it represents the whole population. Hence, two
thirds of the subjects belong to the above-mentioned categories; the underlying idea is that the
TV is a device potentially used by everybody.
The obtained results are coherent with this choice; in particular this can explain why few
subjects either spoke with complex phrases or adopted a more “natural” interaction with the
system (even if they heard the help that explained them a more free way to interact with).
Although they felt comfortable to use a very advanced and “natural” interaction mode (i.e.,
voice), most of them had a “passive” behavior approaching the EPG selection, because they
usually consult a paper guide or “surf” channels to select a program instead of looking up an
EPG.
Even though German subjects are not used to interact with the TV by voice or multimodal
input, the German sample – and especially test people who were more familiar with PCs or
interactive TV - show a low confidence that the system can understand complex or natural
phrases. Inexperienced test people try to have a more “natural” interaction with the system,
whereas experts explicitly try complex phrases to see how capable the system is.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 68
Appendix A – Microphone Arrays
Harmonic Nested Array
A microphone array performs spatial sampling of a wavefield. Spatial aliasing, which is
analogous to temporal and spectral aliasing, can be avoided if the microphone spacing d
satisfies the following inequality
2
mind
where min is the minimum wavelength in the signal of interest [17].
The nested array implemented for DICIT consists of four sub-arrays. Table 13 shows the
spatial aliasing limits for the four sub-arrays. The maximum frequency is given by
minmax /cf , where 344c m/s is the sound velocity in air.
Sub-array no. Distance [m] Minimum
Wavelength [m]
Maximum
Frequency [Hz]
1 0.04 0.08 4300.0
2 0.08 0.16 2150.0
3 0.16 0.32 1075.0
4 0.32 0.64 537.5 Table 13: Spatial aliasing limits of sub-arrays
The frequencies shown in the table are harmonics and thus the name harmonic nested array.
The microphone spacing of each of the sub-arrays were specially chosen such that each of
these sub-arrays covers one octave. This structure allows for a greater flexibility when
combined with the array signal processing algorithms.
NIST MarkIII Array
The array uses 64 electret microphones installed in a modular environment. Two main
components constitute the system: a set of microboards for recording the signals and a single
motherboard to transmit the digital data over the network. There are eight microboards in the
array, and every microboard is connected to eight microphones. The first step done
by the microboard is the polarization of the microphones and the amplification of the signals.
Electret microphones need a phantom power to work properly and provide a low voltage
signal. So the microboard adapts the signals to be converted in the digital format. The
digitalization of the audio signals is done on each microboard, using four dedicated stereo
analog to digital converters. The choice of putting the A/D converters as close as possible to
the microphones is crucial to obtain sufficiently small input noise level, which for the Mark
III array is x dB relative to the maximum level..
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 69
Since preliminary experiments conducted on the original array had shown that coherence
among a generic pair of signals was affected by a bias due to common mode electrical noise,
which turned out to be detrimental for time delay estimation techniques applied to co-phase
signals or to localize speakers, a hardware intervention was realized to remove each internal
noise source from analog modules of the device [3].
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 70
Appendix B – The Questionnaire
PERSONAL DATA USER DEFINED BY PRE-QUESTIONNAIRE Expert non Expert
SOME QUESTIONS WITH STATISTICAL GOAL
1.You are
male
female
2.What is your educational qualification
Prymary school
Middle school certificate
Secondary school certificate
Degree
3.Your age
20-30
31-40
41-50
51-60
more than 60
4.Your profession
Businessman Freelancer
Manager Executive
Employee Factory worker
Trader Agent
Craftsman Housewife
Student Retired
Working/studying area______________________________
YOUR HABITS WATCHING TV AT HOME (choose only an answer to each question)
5. How many people live in your house included you? I live alone
2 people
3 or more people
6. How many TV do you have in your house?
1
2
3 or more
no TV, but I use Net-Tv through pc
7. Usually who decides what to watch in TV? Only one person
We decide all together
We decide in majority
Each person has a TV
8. Usually how do you decide which programme to watch?
Looking the teletext up
Looking up newspaper/TV programmes guide/internet
looking up the electroinc programme guide (EPG)
Surfing channel 9. Which type of television do you usually watch? “Traditional” (analogic) jump to question 11
Satellite
Digital terrestrial
Iptv
10. How do you usually select a programme?
With the numeric button of the remote control
With the program up/program down button on the remote control
Personal code:
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 71
Through the electronic programme guide (EPG)
Scheduling the visualization 11. What are the information that interest you to choose a programme?
Genre (film, sport, tv series, news, etc)
Actor/big names
Topic/subject
Duration
Channel
I don’t care, because I surf channel
12. Usually you use TV to:
To watch current TV programmes
To watch the programms I have recorded
To watch the video on demand (VOD)
Watching bought DVDs
As background during other activities
I channel surf
Other (specify) ______________________
questionnaire DICIT
WE ASK YOU AN OPINION ABOUT THE SYSTEM YOU HAVE JUST TESTED, VALUEING EACH OF THE FOLLOWING ASPECTS:
USING DICIT SYSTEM
13. It was easy to understand how to use the different selection criteria given by the system Very Very Difficult Easy
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
14. It was easy to understand how to give all the vocal commands
Explain the reasons of your answer
Personal code:
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 72
Very Very Difficult Easy
1 2 3 4 5 6 7 8 9 10
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
15. It was comfortable to give some information with voice and the other with the remote control Very Very Uncomfortable Comfortable
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
16. In case of problems did the system suggest usefully and efficiently what to do to recover the information after the error? Very Very Useless Useful
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
WATCHING THE SCREEN
17. Is the screen which shows the criteria for the programmes search easy to read? Very Very Difficult Easy
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 73
18. Was it easy to understand how to use vocally the search criteria for programmes shown on the screen? Very Very Difficult Easy
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
19. Was it easy to understand how to use the remote control to select the search criteria for programmes? Very Very Difficult Easy
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
20. To reach the task we have assigned to you, did you expect to have some other vocal commands?
NO YES Write which ones
List the missed commands
……………………………………………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………
21. Did you find the information on the screen useful to orient yourself, in the case you would disable the audio? Very Very Useless Useful
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 74
………………………………………………………………
WATCHING THE SCREEN
22. Do you find useful that the previous criteria list would always be shown? Very Very Useless Useful
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
23. Do you find useful a function which allows you to insert a precise word to search programmes through remote control? Very Very Useless Useful
1 2 3 4 5 6 7 8 9 10
Explain the reasons of your answer
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
USING THE VOCAL INTERACTION
24. How do you judge the opportunity to use a vocal command? o Very useful o Useful if used with the remote control o Useful if it replaces the remote control o Useful if it allows me more operations than
the remote control o I would never use vocal commands
comments
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
25. For the vocal commands you prefer: o Using full sentences
comments
…………………………………………………………………………
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 75
o Using short commands
o Having some precise commands to read on the video
……………………………………………………………………………………………………………………………………………………………………
26. When you give a vocal command to the system, you prefer: o To have only a video feedback o To have only a vocal feedback o To have both vocal and video feedback o To have an immediate action of the system
(without the previous feedback)
comments
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
LISTENING TO THE SYSTEM VOICE
27. Do you find useful that the system reads (in addition to listing them on the screen) the programmes found after your search?
o Yes o Yes, only if they are not too many o No
comments ………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
28. If you prefer a system which gives you vocal feedback: o I would like to have the option to interrupt the
system, every time I give a command o I would be happy to wait for the system to
finish speaking before giving my command
comments ………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
29. Would you like to have a button to comments
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 76
enable / disable the vocal recognizer?
No
Yes
………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………
YOUR GENERAL OPINION
NOW THAT YOU KNOW THE SYSTEM WE ASK YOU SOME GENERAL OPINIONS ABOUT EACH OF THE FOLLOWING ASPECTS: YOUR EXPERIENCES WITH DICIT Say your opinion about DICIT crossing out the box that better describes your degree of agreement regarding each of the following phrases which describe the service.
Complete agreement
Complete disagreement
1. I think that the system is easy to use
2. It makes me confused when I use it
3. I like the voice
4. I think that the system needs too much attention to interact vocally
5. I have the impression not to control the dialogue with the system
6. I have to focus on using it with the remote control too
7. I think that the speech interaction is efficient
8. By using the voice is easier to search the programmes
9. The system voice speaks too quickly
10. The selection criteria which appear on the screen are not clear
11. I think that it is funny to use
12. I prefer using traditional way (TV guide, teletext, newspaper) to search an interesting programme
13. I think that this system needs some improvements
DICIT FOR YOU WAS
1.easy complicated
2.efficient inefficient
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 77
3.quick slow
4.original copied
5.precise vague
6.capable incapable
7.formal informal
8.active passive
9.friendly unfriendly
10.determined
undetermined
11.kind unkind
12.clever silly
13.organized disorganized
14.patient unpatient
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 78
Appendix C – The WOZ Instructions at EB
Test description
First of all: Thank you very much for taking part in our tests that aim at improving our new
TV system.
We‟ve got here a prototype of a set top box than can do a little more than an ordinary TV set.
The new thing is, that the device understands spoken language and that it comes with an
integrated TV program. As already said, it‟s a prototype, and your taking part in the the test
will help us to further improve it. So, thanks again.
During the test we are going to collect speech samples to find out how people interact with the
system in both situations: while selecting a TV broadcast and while watching TV. We please
you to interact with the system as naturally as possible which means you should speak as
normal as possible – like you would talk to me for example. We do not want to evaluate your
interaction with the device but improve the device itself, so don‟t worry, there is no “right” or
“wrong” in your attitude.
The data we are collecting from you and the other participants will be used to find out how
well the speech interaction is working. Therefore, you will be recorded with a room
microphone and a head set. Additionally, we are going to make video recordings.
All in all, you are going to get three tasks that we ask you to perform during the test. But your
job is NOT to solve the problem as fast as possible, but to test the system thoroughly.
For your first task you have 15 minutes and you are asked to find your favorite TV broadcast.
So take these 15 minutes to deeply test the system.
For each of the other tasks you have 7 minutes. Here you are asked to select a broadcast from
the TV program under a certain aspect – which one, remains your choice. Here also, it is not
the goal to solve the problem as fast as possible.
For each task you can use both, speech and remote control.
During the session, I will leave the room.
So, if there are any problems during the test, I cannot help, but the system can. Feel free to
find out how.
After the test I would like to ask you to fill in a questionnaire. Here I will help you.
In the questionnaire, we ask about your experience and impressions during the test. When
knowing how you would like to interact with the system, we can take that into account during
the further development. Your impressions and experiences with the prototype are processed
the results are used to improve the system. Your answers, of course, will remain anonymous.
Now, you have about 15-20 minutes to „play“ with the system. I will then come and pick you
up for the questionnaire. Good luck and have fun.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 79
Appendix D – List of Predefined TTS Prompts
Class
WAIT Ihre Eingabe wird
verarbeitet.
Bitte warten Sie einen
Moment.
Your input is being
processed.
Please wait a
moment.
Per favore, attendi
qualche istante.
ERROR Dieser Sender ist nicht
verfügbar.
Es gibt zu viele Ergebnisse
für diese Suche. Wollen Sie
Ihre Suche verfeinern?
Leider hat Ihre Suche keine
Treffer.
Ich habe nur
Programminformationen für
die nächsten sieben Tage.
Leider ist der gewählte
Sender nicht in meinen EPG
Daten.
Leider habe ich keine
Information über den
gewählten Sender.
Leider ist der gewählte
Titel nicht in meinen EPG
Daten.
Leider habe ich keine
Information über den
gewählten Titel.
Leider ist das gewünschte
Genre nicht in meinen EPG
Daten.
Ich habe keine
Informationen zum
gewünschten Genre.
Leider ist der gewünschte
Schauspieler nicht in meinen
EPG Daten.
Leider habe ich keine
Informationen zum
gewünschten Schauspieler.
Leider gibt es keine
Einträge in meinen EPG
Daten zu diesem Schlagwort.
Leider habe ich keine
This station is not
available.
There are too many
results, do you want
to refine the search?
Sorry, your search
did not yield any
result.
Sorry, I only have
information on the
program of the next
seven days.
Sorry, the
requested channel is
not included in my
EPG data.
Sorry, I have no
information on the
requested channel.
Sorry, the
requested title is not
included in my EPG
data.
Sorry, I have no
information on the
requested title.
Sorry, the
requested genre is not
included in my EPG
data.
Sorry, I have no
information on the
requested genre.
Sorry, the
requested artist is not
included in my EPG
data.
Sorry, I have no
Quest‟emittente non è
disponibile.
Ci sono molti
programmi, vuoi
perfezionare la ricerca?
Mi dispiace, questa
ricerca non ha dato
risultati.
Mi dispiace, sono
disponibili solo i
programmi dei prossimi
sette giorni.
Mi dispiace, questo
canale non e‟ presente
nella Guida TV.
Mi dispiace, non ho
informazioni per questo
canale.
Mi dispiace, questa
trasmissione non è
presente nella Guida TV.
Mi dispiace, non ho
informazioni per questa
trasmissione.
Mi dispiace, questa
categoria di programmi
non è presente nella
Guida TV.
Mi dispiace, non ho
informazioni per questa
categoria di programmi.
Mi dispiace,
quest‟artista non è
presente nella Guida TV.
Mi dispiace, non ho
informazioni per
quest‟artista.
Mi dispiace, questa
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 80
Informationen zum
gewünschten Schlagwort.
Diese Sendung befindet
sich bereits in der
Aufnahmeliste.
Diese Sendung wurde noch
nicht aufgezeichnet.
Die Fernbedienung
funktioniert nicht. Bitte
benützen Sie die
Spracheingabe.
Das Fernbedienungssignal
ist zu schwach.
Ein schwerwiegender
Systemfehler ist aufgetreten.
DICIT itle r gerade neu.
information on the
requested artist.
Sorry, the
requested subject is
not included in my
EPG data.
Sorry, I have no
information on the
requested subject.
This broadcasting
is already scheduled
for recording.
This broadcasting
is not recorded yet.
The Remote
control is not
working. Please use
speech to control the
system.
Remote control
signal too low.
A fatal error I.
DICIT is restarting.
tipologia di contenuti non
è presente nella Guida
TV.
Mi dispiace, non ho
informazioni per questa
tipologia di contenuti.
Per questa trasmissione
è già programmata una
registrazione.
Questa trasmissione
non è ancora stata
registrata.
Il telecomando non
funziona. Per favore, usa
i comandi vocali.
Il segnale del
telecomando è troppo
debole.
Errore di sistema.
DICIT deve riavviarsi.
HELP Sie können die
Fernbedienung wie gewohnt
benutzen, aber Sie können
auch mit mir sprechen und
mir sagen, was Sie gern tun
würden.
Sie können folgende
Befehle sagen: Hilfe, zurück,
neu starten oder jetzt suchen.
Sagen Sie zum Beispiel
„Was kommt heute Abend auf
RTL?“ oder „gibt es morgen
irgendwelche Komödien?“
Sie können mir auch sagen,
nach welcher Sendung Sie
suchen.
Bitte sagen Sie mir, wie ich
die Lautstärke für Sie
verändern soll.
Um ein Programm zu
suchen, können Sie alle
Suchkriterien auf dem
Bildschirm benutzen.
You can use the
remote control as
usual but you can also
speak to me and tell
me what you would
like to do.
You can say: help,
back, restart, or
search now.
E.g. say “what is
on RTL tonight?” or
“are there any
comedies tomorrow?”
You can also tell
me which broadcast
you are looking for.
Please tell me how
to change the volume
for you.
To search for a
program, you can use
all the search criteria
listed on the screen.
Puoi usare il
telecomando
normalmente, ma puoi
anche parlarmi e dirmi
cosa fare.
Puoi dirmi: AIUTO,
INDIETRO, CAMBIA, o
In ONDA.
Per esempio, dì: “cosa
c‟è su RAI 1 questa
sera?” o: “ci sono delle
commedie domani?”
Puoi anche dirmi quale
programma stai cercando.
Quanto devo
modificare il volume?
Scegli qualsiasi criterio
di ricerca elencato. Puoi
anche dire INDIETRO o
CAMBIA per
reimpostare la ricerca.
Elenco o Videoteca per
programmi non trasmessi
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 81
Sie können im EPG durch
die Angabe von Sender, Zeit,
Titel, Genre, Schauspieler
oder Schlagwort suchen.
Sie können einen Eintrag
aus der Liste zum Anschauen
oder zum Speichern in der
Aufnahmeliste auswählen.
Bitte nennen Sie mir z.B.
eine Zeit, einen Sender, ein
Genre oder ein Schlagwort
um die Suche zu verfeinern.
Hier sind Ihre Ergebnisse.
Moreover you can
always ask me: Back,
to go to back, Restart
to reset the criteria, or
Search Now to start a
search.
You can search the
EPG by specifying
Channel, Time, Title,
Genre, Actor or
Subject.
You can select an
item from the list and
watch or save it for
recording.
Please tell me for
example a time, a
channel, a genre or a
subject to refine the
search.
Here are the
requested results.
ora, oppure In Onda per
quelli in onda adesso.
Cerca nella Guida TV
con il canale, l‟orario, il
titolo, la categoria,
l‟artista o il contenuto.
Puoi scegliere uno dei
programmi elencati per
guardarlo o registrarlo.
Per favore indicami
l‟orario il canale, il titolo,
la categoria o il
contenuto.
Ecco cos‟ho trovato!
REJECT Wie bitte?
Leider konnte ich Sie nicht
verstehen.
Diese Funktion ist leider
nicht verfügbar.
Pardon?
Sorry, I did not
understand you.
Sorry, this function
is not available.
Puoi ripetere?
Scusa, non ho capito.
Spiacente, questa
funzionalità non è
disponibile.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 82
Appendix E – Screenshots of the Views
WelcomeScreen
EPG_MainMenu_View
View
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 83
EPG_ChooseFilter
EPG_ManualInput
EPG_ResultList
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 84
EPG_RecordingList
News
EPG_Confirmation
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 85
Bibliography
[1] Distant Talking Interfaces for Control of Interactive TV
“Annex I - Description of Work”
31-May-2006
[2] Cedrick Rochet, URL:
www.nist.gov/smartspace/toolChest/cmaiii/userg/Microphone_Array_Mark_III.pdf
[3] Luca Brayda, Claudio Bertotti, Luca Cristoforetti, Maurizio Omologo, and
Piergiorgio Svaizer. “Modifications on NIST MarkIII array to improve coherence
properties among input signals.”
AES, 118th Audio Engineering Society Convention. Barcelona, Spain, May, 2005.
[4] SpeechDat-Car EU-Project LE4-8334, URL: http://www.speechdat.org/SP-CAR/
[5] Luca Cristoforetti, Maurizio Omologo, Marco Matassoni, Piergiorgio Svaizer,
and Enrico Zovato. "Annotation of a multichannel noisy speech corpus."
Proc. of LREC 2000. Athens, Greece, May 2000.
[6] Transcriber, URL: http://trans.sourceforge.net/en/presentation.php
[7] Andrey Temko, Robert Malkin, Climent Nadieu, Christian Zieger, Dusan Macho,
and Maurizio Omologo. "CLEAR Evaluation of Acoustic Event Detection and
Classification systems." CLEAR'06 Evaluation Campaign and Workshop.
Southampton, UK: Springer, 2006.
[8] Oswald Lanz. “Approximate bayesian multibody tracking.”
IEEE Transaction on Pattern Analysis and Machine Intelligence, 2006: 1436-1449.
[9] Fleischmann, T. (2007). Model Based HMI Specication in an Automotive Context.
In Smith, M. J. and Salvendy, G., editors, HCI (8), volume 4557 of Lecture Notes
in Computer Science, pages 31{39. Springer.
[10] Goronzy, S., Mochales, R., and Beringer, N. (2006). Developing Speech Dialogs
for Multimodal HMIs Using Finite State Machines. In 9th International Conference
on Spoken Language Processing (Interspeech), CD-ROM.
[11] ISO 9241-110:2006 : “Ergonomics of human-system interaction -- Part 110:
Dialogue principles” International Organization for Standardization, 2006.
[12] Praat, URL: http://www.praat.org/
[13] N. Beringer: “Transliteration of Spontaneous Speech for the detailed Dialog
Taskflow” DICIT technical document, 29-March-2007.
© DICIT Consortium
D6.2 –Multi-microphone Data Collection and WOZ Experiments
DICIT_D6.2_20080428 86
[14] N. Beringer, U. Kartal, K. Louka, F. Schiel, U. Türk. PROMISE: A Procedure for
Multimodal Interactive System Evaluation. LREC Workshop 'Multimodal
Resources and Multimodal Systems Evaluation' 2002, Las Palmas,
Gran Canaria, Spain, pp. 77-80.
[15] Salber, D. and Coutaz, J. (1993). A Wizard of Oz platform for the study of
multimodal systems. In Conference Companion on Human Factors in Computing
Systems (INTERACT and CHI), pages 95{96, New York, NY. ACM.
[16] Taib, R. and Ruiz, N. (2007). Wizard of Oz for Multimodal Interfaces Design:
Deployment Considerations. In Jacko, J. A., editor, HCI (1),
volume 4550 of Lecture Notes in Computer Science, pages 232{241. Springer.
[17] Wolfgang Herbordt. “Sound Capture for Human/Machine Interfaces”, Springer-
Verlag, Berlin Heidelberg, 2005
© DICIT Consortium