deelliivv eerraabblle 66..22 muu llttii--mmiiccrroopphhoo...

DDeelliivveerraabbllee 66..22

MMuullttii--mmiiccrroopphhoonnee ddaattaa ccoolllleeccttiioonn aanndd WWOOZZ

eexxppeerriimmeennttss ffoorr tthhee aannaallyyssiiss ooff uusseerr bbeehhaavviioouurr iinn

tthhee DDIICCIITT sscceennaarriiooss

AAuutthhoorrss:: LLuuttzz MMaarrqquuaarrddtt,, LLuuccaa CCrriissttooffoorreettttii,, EEddwwiinn

MMaabbaannddee,, NNiiccoollee BBeerriinnggeerr,, FFiioorreennzzaa AArriissiioo,,

MMaatttthhiiaass BBeezzoolldd

AAffffiilliiaattiioonnss:: FFAAUU,, FFBBKK--iirrsstt,, EEBB,, AAmmuusseerr

DDaattee:: 2288--AApprr--22000088

DDooccuummeenntt TTyyppee:: RR

SSttaattuuss//VVeerrssiioonn:: 11..00

DDiisssseemmiinnaattiioonn LLeevveell:: PPUU

FP6 IST-034624 http://dicit.itc.it

D6.2 –Multi-microphone Data Collection and WOZ Experiments

DICIT_D6.2_20080428 ii

Project Reference FP6 IST-034624

Project Acronym DICIT

Project Full Title Distant-talking Interfaces for Control of Interactive TV

Dissemination Level PU

Contractual Date of

Delivery 31-Mar-2007

Actual Date of Delivery Preliminary Version: 11-January-2008

Final Version: 28-April-2008

Document Number DICIT_D6.2_V1.0_20080428

Type Deliverable

Status & Version 1.0

Number of Pages 7+86

WP Contributing to the

Deliverable WP6 (WP responsible: Nicole Beringer – EB)

WP Task responsible Lutz Marquardt (FAU)

Authors (Affiliation)

Lutz Marquardt and Edwin Mabande (FAU), Luca

Cristoforetti (FBK-irst), Nicole Beringer and Matthias

Bezold (EB), Fiorenza Arisio (Amuser)

Other Contributors Walter Kellermann (FAU), Federica Vola (Amuser)

Reviewer

EC Project Officers

Anne Bajart (till January 31st 2007), Erwin Valentini (from

February 1st till October 31

st 2007), Pierre Paul Sondag

(from November 1st 2007)

Keywords: data collection, WOZ experiments, multi-microphone devices, distant-talking

speech recognition devices, voice-operated devices, Interactive TV, anti-intrusion,

surveillance.

Abstract:

The purpose of this document is to describe the multi-microphone data collection and WOZ

experiments that have been conducted under DICIT. While the first task‟s objective was to

provide testing data for acoustic pre-processing algorithms the latter activity aimed at

determining user behaviour as a basis for the dialog specification.

© DICIT Consortium


DICIT_D6.2_20080428 iii

Contents

Contents ..................................................................................................................................... iii

List of Figures ............................................................................................................................ v

List of Tables ............................................................................................................................ vii

Summary .................................................................................................................................... 1

Introduction ................................................................................................................................ 2

Part I. Multi-channel Data Acquisition / Acoustic WOZ ..................................................... 3

1. Experimental Setup ............................................................................................................ 3

1.1 Hardware Setup .......................................................................................................... 3

1.1.1 Microphone Arrays ............................................................................................. 4

1.1.2 General Hardware Setup .................................................................................... 5

1.2 Software Setup ........................................................................................................... 8

1.3 Recording Room ......................................................................................................... 9

2. Recording Sessions ........................................................................................................... 12

3. Room Impulse Response Measurements .......................................................................... 14

4. Data Exploitation .............................................................................................................. 15

4.1 Data Annotation ....................................................................................................... 15

4.2 Data Exploitation / Testing ....................................................................................... 17

Part II. Dialogue WOZ ......................................................................................................... 18

1. Experimental Setups and Recordings ............................................................................... 18

1.1 General Experimental Setup – The DICIT WOZ System ........................................ 18

1.2 Experimental Setup at EB ........................................................................................ 19

1.2.1 Hardware Setup ................................................................................................ 20

1.2.2 Software Setup ................................................................................................. 21

1.2.3 Recording Sessions at EB ................................................................................. 21

1.3 Experimental Setup at Amuser ................................................................................. 22

1.3.1 Hardware Setup ................................................................................................ 23

1.3.2 Software Setup ................................................................................................. 23

1.3.3 Recording Sessions at Amuser ......................................................................... 23

2. Questionnaire .................................................................................................................... 25

2.1 Statistical Questions (Questions 1-4) ....................................................................... 25

2.2 TV Habits (Questions 5-12) ..................................................................................... 27

2.3 The DICIT System (Questions 13-29) ..................................................................... 28

2.3.1 Using the DICIT System .................................................................................. 28

2.3.2 Watching the Screen ......................................................................................... 30

2.3.3 Vocal Interaction .............................................................................................. 33

2.3.4 The System Voice ............................................................................................ 35

2.4 General Opinion of the DICIT WOZ Prototype (Questions 30 and 31) .................. 37

2.4.1 Users‟ experiences with DICIT ........................................................................ 37

2.4.2 Rating user satisfaction within DICIT .............................................................. 39

2.5 Summary .................................................................................................................. 40

3. Session Evaluation ........................................................................................................... 41

© DICIT Consortium


DICIT_D6.2_20080428 iv

3.1 Logging Data ............................................................................................................ 41

3.1.1 Logging Data .................................................................................................... 41

3.1.2 Number of Screens and Views ......................................................................... 42

3.1.3 Screen preferences of the Users ....................................................................... 44

3.1.4 Remote Control vs. Voice Control ................................................................... 46

3.1.5 TTS Usage ........................................................................................................ 48

3.1.6 Barge-In Behavior ............................................................................................ 54

3.1.7 User Speech Time ............................................................................................ 56

3.1.8 Multi-Slot Usage .............................................................................................. 58

3.1.9 Off-Talk ............................................................................................................ 60

3.1.10 Overlaps ............................................................................................................ 61

3.2 Observation of the Wizards ...................................................................................... 63

3.2.1 People have to be encouraged to use natural language input ........................... 63

3.2.2 People tend to use simple commands ............................................................... 64

3.2.3 Some people use Barge-In, others do not ......................................................... 64

3.2.4 Reset function not self-explanatory .................................................................. 64

3.2.5 Remote Control is hardly used ......................................................................... 64

4. Conclusions for Subsequent Prototypes ........................................................................... 65

4.1 Overall Conclusions ................................................................................................. 65

4.2 Dialog and Menu Structure ...................................................................................... 65

4.3 Speech Dialog ........................................................................................................... 66

4.4 Remote Control ........................................................................................................ 66

4.5 Considerations Among the Two Samples ............................................................... 67

Appendix A – Microphone Arrays ........................................................................................... 68

Appendix B – The Questionnaire ............................................................................................. 70

Appendix C – The WOZ Instructions at EB ............................................................................ 78

Appendix D – List of Predefined TTS Prompts ....................................................................... 79

Appendix E – Screenshots of the Views .................................................................................. 82

Bibliography ............................................................................................................................. 85

© DICIT Consortium


DICIT_D6.2_20080428 v

List of Figures

Figure 1: Harmonic Nested Array (all distances are in cm) ....................................................... 4

Figure 2: NIST MarkIII Microphone Array ............................................................................... 5

Figure 3: FAU setup ................................................................................................................... 6

Figure 4: FBK setup ................................................................................................................... 7

Figure 5: FAU recording room setup ....................................................................................... 10

Figure 6: FBK recording room setup ....................................................................................... 10

Figure 7: FAU array setup ........................................................................................................ 11

Figure 8: FBK array setup ........................................................................................................ 12

Figure 9: Images of the FBK room .......................................................................................... 12

Figure 10: Impulse response measurement setup ..................................................................... 14

Figure 11: A transcription session using the Transcriber tool.................................................. 15

Figure 12: DICIT WOZ menu structure ................................................................................... 19

Figure 13: The WOZ setup at EB. ............................................................................................ 20

Figure 14: The WOZ setup at Amuser ..................................................................................... 22

Figure 15: German: Different screens and views. .................................................................... 42

Figure 16: English: Different screens and views. ..................................................................... 43

Figure 17: German: Screen preferences. Name; time in minutes; percentage. ........................ 44

Figure 18: German: Screen preferences of the individual subjects .......................................... 44

Figure 19: English: Screen preferences. Name; time in minutes; percentage. ......................... 45

Figure 20: English: Screen preferences of the individual subjects. ......................................... 46

Figure 21: German: Amount of voice and remote control input. ............................................. 47

Figure 22: English: Amount of voice and remote control input. .............................................. 48

Figure 23: German: Types of TTS output. ............................................................................... 49

Figure 24: Types of TTS output. .............................................................................................. 49

Figure 25: Prompt types in EPG_MainMenu_View. ............................................................... 50

Figure 26: Prompt types in EPG_ManualInput. ....................................................................... 50

Figure 27: Prompt types in EPG_ResultList. ........................................................................... 50

Figure 28: Prompt types in View. ............................................................................................ 50

Figure 29: English: Types of TTS output. ................................................................................ 52

Figure 30: English: Types of TTS output per user. .................................................................. 52

Figure 31: Italian: Types of TTS output. .................................................................................. 53

Figure 32: Italian: Types of TTS output per user. .................................................................... 53

Figure 33: German: Number of barge-ins per subject. ............................................................. 55

Figure 34: English: Number of barge-ins per subject. ............................................................. 55

Figure 35: Italian: Number of barge-ins per subject ................................................................ 56

Figure 36: German: User speech time per subject. Red line is average. .................................. 56

Figure 37: English: User speech time per subject. Red line is average. ................................... 57

Figure 38: Italian: User speech time ........................................................................................ 57

Figure 39: German: Multi-slot evaluation. ............................................................................... 59

Figure 40: English: Multi-slot evaluation. ............................................................................... 59

Figure 41: German: Off-talk. .................................................................................................... 60

© DICIT Consortium


DICIT_D6.2_20080428 vi

Figure 42: English: Off-talk. .................................................................................................... 60

Figure 43: Italian Off-talk. ....................................................................................................... 61

Figure 44: German and English: Overlaps. .............................................................................. 62

Figure 45: Italian: Overlaps. ..................................................................................................... 63

© DICIT Consortium


DICIT_D6.2_20080428 vii

List of Tables

Table 1: Noise event classes ..................................................................................................... 16

Table 2: Statistical questions .................................................................................................... 26

Table 3: Habits questions ......................................................................................................... 28

Table 4: General usability questions ........................................................................................ 29

Table 5: Screen feedback questions ......................................................................................... 31

Table 6: Vocal mode questions ................................................................................................ 34

Table 7: Listening to the system voice ..................................................................................... 36

Table 8: User general opinion .................................................................................................. 38

Table 9: Semantic differential .................................................................................................. 39

Table 10: Logging data created by GUIDE. ............................................................................. 42

Table 11: Number of different views divided by number of different screens. ....................... 43

Table 12: Prompt types per view. ............................................................................................. 50

Table 13: Spatial aliasing limits of sub-arrays ......................................................................... 68

© DICIT Consortium


DICIT_D6.2_20080428 1

Summary

Extensive Wizard of Oz (WOZ) experiments for the interactive TV scenario have been carried

out and evaluated. The WOZ experiments were carried out in German, English and Italian.

The acoustic WOZ experiments have been carried out at FAU and FBK. The experiments

involve the acquisition of multi-channel data for signal front-end and have been carried out

due to the need to collect a database for testing acoustic pre-processing algorithms. Besides

the user inputs, the database also contains non-speech related acoustic events, room impulse

responses and video data.

The dialogue WOZ experiments have been carried out at EB and Amuser. The experiments

have been carried out in order to obtain sufficient data for characterizing user behavior,

vocabulary, language, etc. The data provides a basis for the specification of the dialog model

of the prototypes of the DICIT system. The general impression that the users have is that the

WOZ prototype is easy-to-use, efficient, original, capable and well-organized.

© DICIT Consortium


DICIT_D6.2_20080428 2

Introduction

The work conducted during the first project year of DICIT with respect to WP6 consisted of

the tasks T6.1 “Market study and user expectations for system and interface design”,

T6.2 “Data collections: multi-channel data acquisition for signal front-end” and T6.3 “WOZ

for interactive TV scenario and study of user behavior”. While the description and discussion

of T6.1 were addressed in Deliverable D6.1, this document focuses on the latter tasks.

The intention of tasks T6.2 and T6.3 according to the DICIT Technical Annex is the

“collection of multi-microphone data and Wizard-of-OZ data” respectively [1]. “This data

will allow characterizing user behavior, vocabulary, and language, etc., and other information

which is necessary to conduct part of the activities scheduled in WP3, WP4, and WP5.”

In a Wizard-of-OZ (WOZ) experiment, a subject is requested to complete specific tasks using

an artificial system. The user is told that the system is fully functional and should try to use it

in an intuitive way, while the system is operated by a person not visible to the subject. The

operating person – called wizard – can react to user input in a more comprehensive way than

any system could, because he/she is not confined by computer logic. From a WOZ study,

interaction patterns can be extracted and applied to an actual prototype.

Due to the need to carry out WOZ experiments on the one hand and to create a database by

means of multi-channel data acquisition on the other, the tasks T6.2 and T6.3 were combined.

To this regard both dialogue WOZ as well as acoustic WOZ setups and task flows were

created to meet the respective requirements. The first aimed solely at analyzing the user

behavior in the foreseen DICIT TV scenario, thus enabling the tailoring of the dialogue design

to the user requirements. The latter however, consisted of a different WOZ environment, not

focusing on behavior analysis from the dialogue point of view, but on the need for creating

realistic usage scenarios for acoustic pre-processing purposes.

In the following, Part 1 of this document describes the “multi-channel data acquisition”

related to both acoustic WOZ and impulse response measurement addressed by FAU and

FBK, whereas the dialogue WOZ, which was conducted by EB (formerly Elektrobit) and

Amuser, is described in Part 2.

© DICIT Consortium


DICIT_D6.2_20080428 3

Part I. Multi-channel Data Acquisition / Acoustic WOZ

As to the design of an acoustic front-end for the DICIT prototypes, the chosen multi-channel

approach allows for the exploitation of the sources‟ spatial distribution. Array signal

processing algorithms, such as Beamforming, Blind Source Separation and Acoustic Source

Localization, make use of an array, which consists of a group of microphones, to extract

information from a wave-field. They are therefore well suited for addressing the challenge of

developing distant-talking speech interfaces. In order to meet these requirements a

microphone array has been implemented for DICIT, which will be introduced in Section

1.1.1.

The main objective addressed by the task “Data collections: multi-channel data acquisition for

signal front-end” was to collect a database for testing acoustic pre-processing algorithms.

Thus, realistic scenarios can be simulated which avoids the need for real-time

implementations at a preliminary stage and allows for repeatable experiments - Section 1

gives a description of the hardware and software setup employed by FAU and FBK as well as

of the respective recording environments for the data acquisition.

Moreover, simulations that are produced from the combination of WOZ experiments with

multi-channel data acquisition also include hard-to-handle acoustic situations that only arise

from or become obvious in real-life scenarios. The task flow of the acoustic WOZ recording

sessions is presented in Section 2.

Measured room impulse responses may be used for off-line testing of both acoustic pre-

processing and speech recognition algorithms, enabling the artificial creation of simulation

data out of clean speech signals - the corresponding measurements are described in Section 3.

Section 4 finally reports on the annotation of the recorded WOZ data, which is necessary to

allow its further exploitation for speech recognition, event detection and speaker localization.

1. Experimental Setup

1.1 Hardware Setup

For multi-channel microphone acquisition the nested array (which will be further described in

Section 1.1.1) was chosen as an adequate and flexible means to meet the requirements of the

DICIT scenarios.

In order to create a testing database for acoustic pre-processing, the main objective of task

T6.2 was therefore to collect synchronized data from the nested array. There arose the need

of recording additional microphone signals as well as the TV-loudspeaker signals

© DICIT Consortium


DICIT_D6.2_20080428 4

synchronously, e.g. for reference purposes. Cameras were installed to deliver further visual

reference information. The choice and acquisition of the respective hardware and the

construction of the nested array was thus established as the basis for further work.

Since the setup had to be installed for the mandatory recordings mentioned above anyway, it

was decided to carry out additional parallel recordings with the 64-channel MarkIII-array

developed at NIST – it will be described in more detail in Section 1.1.2. Thus it was possible

to collect more data for eventual later testing and comparison purposes with little extra effort.

A reduced version of the same setup could be used for the acquisition of room impulse

responses.

1.1.1 Microphone Arrays

This subsection describes the two microphone arrays that were used for the acoustic WOZ

experiments.

Harmonically Nested Array

The nested microphone array depicted in Figure 1 consists of 13 linearly placed electret

microphones plus two vertically placed electret microphones.

Figure 1: Harmonic Nested Array (all distances are in cm)

It forms four linear sub-arrays, three of which consist of five microphones and one which

consists of seven microphones. The nested array allows for the exploitation of different sub-

arrays in order to meet the requirements of each of the different acoustic pre-processing

modules in terms of inter-microphone spacings (see Appendix A for further explanation).

NIST MarkIII Array

In the acoustic WOZ setup another linear microphone array, a modified NIST Microphone

Array MarkIII depicted in Figure 2, was also used [2].

The MarkIII is composed of 64 uniformly-spaced microphones, specifically developed for far-

field voice recognition, speaker localization and audio processing. It records synchronous data

at a sampling rate of 44.1 kHz or 22.05 kHz with a precision of 24 bits. The particularities of

© DICIT Consortium


DICIT_D6.2_20080428 5

this array are its modularity, the digitalization stage and the data transmission via an Ethernet

channel using the TCP/IP protocol. For further information please refer to appendix A.

Figure 2: NIST MarkIII Microphone Array

1.1.2 General Hardware Setup

The following description explains the hardware setups installed by FAU and FBK that were

employed to address the acoustic WOZ experiments.

To enable the simulation of the DICIT system for WOZ purposes via EB GUIDE Studio 2.80

(which will be further described in Section 1.2), parallel recording of 26 loudspeaker and

microphone channels as well as 64 additional microphone channels from the MarkIII-array,

three PCs had to be employed. Due to the high data-rates involved two of the three PCs used

had to feature high processing power in order to avoid data loss.

FAU Setup

A block diagram of the hardware setup used at FAU is depicted in Figure 3. In connection

with the nested array (equipped with Panasonic WM60-AT microphones), the audio

acquisition at FAU was facilitated by a Linux PC with a Dual Xeon 1.7 GHz processor (PC1)

utilizing the software “ecasound” which will be described in Section 1.2. Additionally, two

extra microphones mounted on the nested array (Panasonic WM60-AT), a table-microphone

(Shure MX 391/0), two lateral microphones (AKG SE 300 B), four close-talk microphones

(Shure WH20) as well as the stereo TV loudspeaker signals were synchronously recorded

with the 15 nested array microphones. For the connectors of the close-talk microphones three

XLR (Shure WH20XLR) and one Tini QG connector (Shure WH20TQG) were chosen, the

latter one enabling signal transmission via a wireless system (Shure PG14E R10) and thus

allowing more freedom of movement for the bearer. (It should be mentioned that the table-

microphone signal was split by the preamplifier - apart from its optical transmission to PC1

via ADAT the analogue output was routed to headphones to be monitored by the wizard.)

A virtual “multi”-device consisting of two synchronized RME HDSP 9652 multi-channel

soundcards acquired the nested array data via three ADAT ports as well as the remaining

audio signals listed above via another two ADAT ports. During each of the recording sessions

approximately 9 Gigabytes (GB) of audio-data was recorded by PC1. The nested array

microphone signals were processed by a FAU-constructed “Mic24ADAT”-device, integrating

microphone power supply, AD-conversion, pre-amplification and conversion to an optical

data stream. Optical data is transmitted from the “Mic24ADAT” directly to PC1 via three

TOSLINK cables (ADAT), thus allowing for a maximum of 24 separate channels. Remaining

© DICIT Consortium


DICIT_D6.2_20080428 6

microphone and loudspeaker signals were digitized and pre-amplified by means of a Presonus

Digimax and transmitted via two TOSLINK cables (ADAT) to PC1. 26 channels were used in

total. The Presonus Digimax served as master for synchronizing all devices related to the PC1

recordings to a 48 kHz clock – slaves drew their clock signal via Word Clock or ADAT.

A NIST-developed software was used together with the MarkIII-array. The array was

connected via cross LAN cable to the network adapter of a Linux PC equipped with a Dual

Xeon 2.67 GHz processor (PC2). Approximately 10 GB of audio data was recorded by PC2 in

connection with each recording session.

PC3 running under Windows XP was equipped with a Dual Xeon 1.7 GHz processor and 768

MB RAM. A graphic card with multi-display technology (AGP Matrox Millenium G550)

enabled the connection of two graphic devices (beamer and wizard monitor) provided with

independent signals. TV contents were transmitted by two loudspeakers and a beamer – the

respective audio signals were split by the Presonus Digimax and synchronously recorded

together with the nested array as already mentioned above. Additionally, a remote control had

to be integrated into the system – an IR receiver in the recording room was connected via

cable to the serial port of the PC, which in turn was monitored by “WinLIRC” (see Section

1.2). One video camera was employed to provide visual reference and location information.

Figure 3: FAU setup

© DICIT Consortium


DICIT_D6.2_20080428 7

FBK Setup

At FBK a similar setup to that of FAU was used as depicted in Figure 4. PC1 recorded 15

channels from the nested array plus nine more channels. Four close-talk microphones

(Countryman E6DW5) were used to record the user inputs. Two of them were connected to a

wireless system (CHIAYO QR-4000U, UDR-1000M, UB-2000) while the other two used

regular wires. The two lateral microphones and the table microphone were omnidirectional

boundary layer microphones (Shure Microflex MX391/O). The last two channels carried the

stereo signals of the clips, recorded directly from the audio board of the wizard PC3. The table

microphone was monitored by the wizard to hear what was happening in the room.

All the signals were recorded using three RME OctaMic II microphone preamplifiers with

integrated AD converters, connected via three TOSLINK cables using the ADAT protocol to

a RME HDSP 9652 digital board installed on the PC1. Sample synchronization was

guaranteed to all the OctaMic via a BNC cable connected to the word clock input. Data was

recorded at 48kHz using 16 bit quantization. The setups of PC2 and PC3 were similar to those

at FAU and therefore do not warrant further description. Three video cameras were employed

to provide visual reference and location information.

Figure 4: FBK setup

© DICIT Consortium


DICIT_D6.2_20080428 8

1.2 Software Setup

Recording software

As already noted above, the recordings had to cover long sessions at high sampling rates with

a variety of microphone- and loudspeaker signals to be acquired. In order to deliver usable

data for acoustic pre-processing purposes both acquisition tools had to guarantee lossless and

synchronized recordings of these signals.

A hard disc recording audio tool called “ecasound” was employed to synchronously record

the 26 channels (this refers to the FAU recordings which differs minimally from the setup at

FBK). All the signals were recorded synchronous and aligned at sample level. These 26

channels were acquired via five ADAT channels of the two RME HDSP 9652 multi-channel

soundcards mounted on PC1.

The signals were recorded into a single 26-channel wav-File at 48 kHz sampling rate and 32

bit resolution (the latter dictated by the soundcards, however also allowing more flexibility

than directly recording with 16 bit precision; thus an amplification according to the actual

maximum recording level followed by 32-to-16 bit conversion remains possible). The single

26-channel wav-file was subsequently separated into 26 single-channel wav-files and a 32-to-

16 bit conversion was also carried out.

(The soundcard tools “hdspmixer” and “hdspconf” serve for monitoring and synchronization

configuration - setting the soundcard as slave and acquiring the clock from ADATIn, i.e. from

the AD-Converter)

The NIST MarkIII array was provided with some utilities to record data to the hard disk. A

command-line program listened to the network card connected to the array and stored in a

single file the incoming data stream. The file contained all the 64 interleaved channels at

44.1kHz at 24 bit resolution. A custom-written program was used then to extract and convert

the single channels to 16 bits.

EB GUIDE Studio

The EB GUIDE Studio developed by EB is an easy to use Human Machine Interface (HMI)

development tool which allows the user to specify, simulate, and generate powerful User

Interfaces (UIs) without limitations. It helps the user to design multi modal UIs with

graphical, haptical, and speech dialog systems without restrictions on the number or kind of

displays or any other complexity.

Running on PC3, a version (tailored to the acoustic WOZ) of EB GUIDE Studio which was

provided by EB enabled the WOZ-simulation of the DICIT TV scenario. TV contents were

shown by means of a beamer displaying six country-specific avi-files of half an hour duration

each that had been pre-recorded from a TV using a digital satellite receiver (Dreambox

DM7025). Additionally, a selection of several teletext pages was available. While TV content

including overlays was transmitted to the beamer, the control interface for the “wizard” was

© DICIT Consortium


DICIT_D6.2_20080428 9

shown on the respective monitor. TV stereo output including eventually generated speech

output was transmitted to the preamp (splitting it up for loudspeaker-playback and recording).

The control interface allowed the wizard to react to the testing persons‟ commands. Reactions

included the generation of text outputs (sometimes connected to a text-to-speech engine),

changing channels, volume and teletext pages, depending on the current state of the system

(e.g. registration phase, TV transmission). The table-microphone signal which was recorded

in PC2 was also used to transmit commands to the wizard.

As indicated above “WinLIRC” was employed to decode and provide the remote control

commands to the GUIDE software, after having been trained properly. WinLIRC is a free

software for Windows that enables the reception of infrared signals through an optical device

connected to the serial port of the PC. The receiving device was installed in the recording

room and connected via a serial cable to the wizard PC. EB GUIDE Studio then interfaced

itself to WinLIRC to receive the codes of the pressed buttons on the real remote control.

1.3 Recording Room

The television was simulated by means of a video beamer, projecting its output on a wall, and

two high-quality loudspeakers placed on the sides of the screen. The participants sat on four

seats, positioned at a fixed distance from the wall. The 15-element microphone harmonic

array shown in Figure 1 was located next to the screen and represented the acoustic setup that

the DICIT consortium intends to exploit. As already stated above, for comparison purposes

the sessions were also recorded by a NIST Mark III array which was placed next to the

harmonic array. The table-microphone was placed between the arrays and the users, and it

was meant to simulate a remote control equipped with a microphone. As to the lateral

microphones, they will be exploited only for experimental analyses. Finally, each participant

was also recorded by a close-talk microphone whose signals were used to guarantee a robust

segmentation and accurate transcriptions as well.

At FAU, a single video camera was employed to record the sessions – to this respect equally

distributed positions were marked on the floor for the speakers to serve as reference

information for source localization testing. The recording room at FBK was furnished with 3

video cameras: one placed on the ceiling and the other two on the upper left-hand and right-

hand corners of the room. Video data were used both to monitor the experiments during the

annotation process and to derive 3D reference positions for each participant. Notice that video

and audio signals were manually aligned taking advantage of some impulsive events present

in the recordings, as for instance a door slam.

The exact room dimensions and positions of microphones as well as further equipment are

depicted in Figure 5 for the FAU setup and in Figure 6 for the FBK setup.

© DICIT Consortium


DICIT_D6.2_20080428 10

Figure 5: FAU recording room setup

.

Figure 6: FBK recording room setup

© DICIT Consortium


DICIT_D6.2_20080428 11

Figure 7 and Figure 8 show images of the array setups actually used at FAU and FBK

respectively. Figure 9 shows two real images taken from the video cameras at FBK - it is

possible to see the active user moving in the room and giving commands by voice to the

system.

Figure 7: FAU array setup

© DICIT Consortium


DICIT_D6.2_20080428 12

Figure 8: FBK array setup

Figure 9: Images of the FBK room

2. Recording Sessions

Six acoustic WOZ sessions, each of about 30 minutes, were recorded at both FAU (German)

and FBK (Italian) – including one English session at FAU. Besides the wizard, four persons

participated in each recording session. These consisted of three subjects, male and female, for

© DICIT Consortium


DICIT_D6.2_20080428 13

each session and of a co-wizard, who had to ensure that the correct test-procedure was

followed, which will be described in the following - it should be noted that during certain

parts of the experiments the testing persons were encouraged to behave naturally and vividly

in order to create barge-in situations, overlapping system commands and background noise.

After having been introduced to the general procedure, the four participants entered the

recording room to sit down on their respective seats in front of the arrays and adjust their

close-talk microphones. Meanwhile the “wizard” started the recordings in a separated

monitoring room and for the rest of the session wire-tapped the microphone signals in the

recording room in order to react to the commands uttered by the users.

A set of phonetically rich sentences 1 – taken from the SpeechDat-Car EU project [4] for

Italian sessions, “Der Nordwind und die Sonne” for German sessions and several sentences

out of the TIMIT database for the English recording respectively – was read out by each of

the participants. Afterwards, the TV was “switched on” via voice command by the co-wizard

(i.e. the wizard reacted to the command of the co-wizard).

In the following, the participants registered themselves with the DICIT system and initially

only had to use the remote control in order to switch channels, adjust the volume, etc. After

that the users were allowed to control the system with both remote control and voice

commands. After some time to get acquainted with the new kind of TV-usage, the subjects

were asked to find specific pages in the teletext via voice-commands, while walking about in

the room – this movement was especially intended for later testing of the source localization

algorithms. At the same time, the co-wizard had to produce several noises for later event

classification issues. These noises included a chair being moved, falling objects (a bottle of

water and a heavy book), laughter, coughing, paper rustling, various phone rings and door-

slams (further details are provided in Section 4.1).

The test-subjects had to fill pre- and post-questionnaires before and after the experiments,

respectively. The former addressed general statistical issues, including dialect and technical

background, whereas the latter focused directly on feedback on the experiments.

Analyzing the questionnaires it emerged that the subjects were mainly young researchers or

students, skilled in the use of PCs and open to new technologies. They got a good impression

of the DICIT system and thought it to be useful for controlling the TV and especially for

navigating the teletext.

From their opinions it emerged that the system should have a good language flexibility and

that it should be fast enough to avoid annoyance. Some recognition errors were tolerated and

didn‟t represent a big distraction.

Results from these questionnaires will be taken into account for the development of the final

prototype.

1 These sentences include a quasi balanced combination of all phonemes of the language in question leaving out

all combinations that are invalid for that language.

© DICIT Consortium


DICIT_D6.2_20080428 14

3. Room Impulse Response Measurements

Room impulse response measurements were carried out in order to provide data which could

be exploited later for purposes such as speech contamination. Room impulse response

measurements were made at FAU in the same room used for the WOZ experiments utilizing

Maximum Length Sequence (MLS). A single loudspeaker was used to play back the MLS

sequence while the 15-channel DICIT array and five separate microphones simultaneously

recorded the output. The loudspeaker was moved to different positions within the room and

the measurements were repeated. Figure 10 depicts the different loudspeaker positions, the

array positions and the single microphone positions within the room.

Figure 10: Impulse response measurement setup

At FBK, impulse responses were measured in the WOZ room. A chirp sequence was used that

was played by a loudspeaker positioned on each of the seats that had been occupied by the

subjects during the WOZ experiments – the positions are shown in Figure 6. The two

microphone arrays recorded the output belonging to the four different positions to be

investigated.

© DICIT Consortium


DICIT_D6.2_20080428 15

4. Data Exploitation

This section describes the exploitation of the data from the acoustic WOZ experiments.

4.1 Data Annotation

In order to be used for later algorithm testing and speech recognition, the six FBK sessions

collected within the acoustic WOZ have been transcribed and segmented at word level,

introducing also specific labels for acoustic events.

An annotation guideline has been written, based on previous experience, in order to ensure as

much similarity as possible between different annotators [5]. Data were annotated using

“Transcriber”, a free annotation tool which permits multi-channel view [6]. To ease the effort

in understanding the dialogues between users and the system, stereo audio files were created

putting on the left channel the signal coming from the table microphone, and on the right

channel the sum of the close-talk microphones. In this way, the annotators could listen in a

selectable way to the environmental noises or to the uttered sentences.

Figure 11: A transcription session using the Transcriber tool

© DICIT Consortium


DICIT_D6.2_20080428 16

Annotators were provided with a preliminary automatic segmentation based on energy of the

close-talk signals. Even if not reliable due to cross-talk effects and non-speech human sounds,

this segmentation turned out to be a very useful starting point. It was also possible to visualize

the automatic segmentation for each speaker, to help in understanding which user was uttering

or producing some noise. Markers were inherited from the automatic segmentation and

adjusted manually in order to have some silence before and after the respective event.

Only three speakers per session were annotated, since the last speaker was always the co-

wizard, even if he/she actively used the system, we decided to not annotate his/her speech.

Annotation information comprises the name (ID) of the speaker, the transcription of the

utterance and any noise included in the acoustic event list. Annotators were instructed to

properly annotate those sentences that were personal comments and were not intended for the

system. Figure 11 shows the annotation of a session, speech uttered is annotated with the

speaker-ID, along with noise symbols. Seven classes of noises were identified and annotated

with square brackets (e.g., [pap] standing for paper rustling). Two other classes were created

to label speakers‟ or unknown noises. Noises and their associated labels are described in

Table 1.

Label Acoustic Event

[sla] door slamming

[cha] chair moving

[pho] phone ringing (various rings)

[cou] cough

[lau] laugh

[fal] object falling down (water bottle, book)

[pap] paper rustling (newspaper, magazine)

[spk] noises from speaker mouth

[unk] other unknown noises

Table 1: Noise event classes

The above mentioned events were a subset of the ones exploited in previous data collections

conducted under the CHIL EU project [7].

Temporal extension of different noise events was identified using a particular convention to

disambiguate between impulsive or prolonged events. In the lower part of the figure the

activities of the different speakers can be seen, i.e. speaker_1 uttering a sentence while

speaker_4 is folding some paper.

As to video data, a set of 3D coordinates for the head of each participant was created with a

video tracker based on a generative approach [8]. Given the 3D labels, a reference has been

derived for each session, which includes the ID of the active speaker, his/her coordinates and

some information about the presence of noises. The reference file has been obtained as a

© DICIT Consortium


DICIT_D6.2_20080428 17

combination of the raw 3D labels generated by the video tracker and the manual acoustic

annotation, with a rate of 5 labels per second.

4.2 Data Exploitation / Testing

The data collected during the WOZ experiments have been exploited for a preliminary

evaluation of the FBK algorithms.

The main goal of the evaluation was to understand the peculiarities of the DICIT scenario and

verify their influence on localization techniques in order to correctly handle them in the first

development of the first prototype. For instance we observed that user sentences were usually

very short and silence was predominant. The basic metric to use to evaluate source

localization (SLoc) methods is the Euclidean distance between the coordinates. Given this

metric, the evaluation of a SLoc algorithm is carried out in terms of localization rate, RMSE,

fine RMSE, bias and angular RMSE (refer to D3.1 for further details on the results).

The WOZ data was used to test the speaker verification and identification system: the system

was applied to the signals of the close-talk microphones, to the single central microphone of

the array and to the beamformer output, using matched model condition and different training

material quantity. The results showed that beamforming yields benefits to the system

performance when compared to the single microphone case, but the results are still inferior to

the close-talk microphone case. The WOZ data were also exploited to test the acoustic event

detection system. The test data were composed of 682 speech segments and 108 non-speech

segments extracted from the continuous audio stream exploiting manual annotation. The

results are promising and highlight that the most confusable events are speech, cough and

laugh (refer to D4.1 for further details).

© DICIT Consortium


DICIT_D6.2_20080428 18

Part II. Dialogue WOZ

A Wizard of Oz (WOZ) study was conducted in order to obtain a basis for the specification of

the dialog model of the prototypes of the DICIT system. The focus of this document is on the

electronic program guide (EPG) setup, which was used to determine how users select

broadcasts from an EPG database using a set of filter criteria by means of voice input. Also, a

screen layout and navigation scheme was evaluated.

The aim of this study was to determine how users operate an EPG system by voice control.

The WOZ system can understand all voice input and handle it accordingly, either by

performing the requested action or by replying with an error message.

The WOZ experiments were conducted at EB, Erlangen, Germany, and Amuser (formerly

Citec Voice), Torino, Italy. 20 sessions were performed both at EB and Amuser, in German

and Italian respectively, involving only one adult person per session. Moreover, a small

number (4) of English sessions were performed with one of the subjects speaking with

English, Scottish, Irish, and American accent, respectively. At EB, the recordings were all

performed between 05-May-2007 until 13-June-2007. At Amuser, the recordings were

performed from 30-May-2007 until 07-June-2007.

In this document, the WOZ experiments in German, English, and Italian are evaluated. To be

able to see which language is currently discussed, the respective parts are marked with the

flags of the country:

Flag Recording

German at EB, Erlangen, Germany

English at EB, Erlangen, Germany

Italian at Amuser, Torino, Italy

1. Experimental Setups and Recordings

1.1 General Experimental Setup – The DICIT WOZ System

The DICIT WOZ system is an STB system with an electronic program guide (EPG). Figure

12 shows the menu structure of the specified dialogue system. Users can browse the EPG data

by defining a set of filters (channel, time, day of the week, actor, subject, and title) and then

browse the results produced using these filters. Elements from the result list can be put into a

recording list. This list can also be browsed and elements can again be removed from it.

Moreover, there is a TV mode where users can watch a prerecorded set of movies (6 channels)

and use a simple teletext function. Screenshots of the views can be seen in Appendix E.

© DICIT Consortium


DICIT_D6.2_20080428 19

Figure 12: DICIT WOZ menu structure

Although it is suggested in [15] to have a separate wizard operator for each modality in

multimodal systems, this was not necessary for this setup: Instead of having the wizard

operate every kind of user input, the remote control interaction was completely implemented

in the system and the wizard only had to handle speech input by the user. The fraction of the

system implemented in WOZ studies is discussed in [16].

1.2 Experimental Setup at EB

At EB, two rooms were used in the experimental setup (see Figure 13): the wizard room and

the test person room. The test person room was furnished like a living room with a couch,

where the test person could feel like he or she was watching TV in a private environment. The

wizard room was directly wired up to the test person room via direct lines for keyboard,

mouse and cables. The wizard could operate the DICIT WOZ system from there.

Each session was introduced with a short description of the experiment. The session itself was

divided into three tasks to be solved with the DICIT system. After the recording of the session

the test persons were asked to fill out a questionnaire.

All sessions were recorded with a high quality close-talk microphone (head set) and a distance

reference microphone which was of the same type to being used by the acoustic frontend of

the DICIT system under development. Additionally videos of the sessions were recorded for

reference and the video was also made available to the wizard during the recordings.

EPG mode

Welcome screen

Main menu [EPG_MainMenu_View]

Teletext mode

Teletext [News]

TV mode

TV [View]

Result list [EPG_ResultList]

Confirmation [EPG_Confirmation]

Recording list [EPG_RecordingList]

Select criterion [EPG_ChooseFilter]

Manual input [EPG_ManualInput]

© DICIT Consortium


DICIT_D6.2_20080428 20

Figure 13: The WOZ setup at EB.

1.2.1 Hardware Setup

In this section, the hardware setup at EB is discussed (see

Figure 13). One PC was used to run both the simulation and the recording. The camera and

the microphones were directly connected to the PC or the sound card, as were the

loudspeakers. The wizard screen, keyboard, and mouse were directly connected to the PC in

the next room using extension cables (PS/2 for mouse and keyboard, VGA for the screen).

The following hardware was used:

- PC: Dual-Core P4, 2GB of RAM, 2x 100GB hard disk

- Screen: hp 2035 20” (first part of the sessions), Belinea 2225 A1W 22” widescreen

(second part of the sessions)

- Microphones:

o Shure MX391/O (room microphone)

o Sennheiser ME3 (head set)

- Camera: Logitech QuickCam Pro 4000

- Sound card for recording: Edirol UA-25

Please note that this hardware configuration had to be used to run the dialogue simulation tool

from outside the recording room. To ensure a good quality of the audio recordings an external

sound card was used.

Wizard room Test person room

Screen

Camera

Room

microphone

Head set

Keyboard, mouse,

and screen

cables

Remote

control

© DICIT Consortium


DICIT_D6.2_20080428 21

1.2.2 Software Setup

The simulation was run using a special version of EB GUIDE Studio 2.60 [9, 10], which was

extended using the plugin mechanism. New control windows (speech input/output, state

change, remote control simulation, etc.) were added to the simulation desktop. The wizard

could use these windows to control the simulation. Moreover, extensive logging facilities

were added to GUIDE.

Moreover, GoldWave Multiquence and the Logitech QuickCam software were used at EB for

the audio and video recordings.

1.2.3 Recording Sessions at EB

At the beginning of a session, the subject was guided into the living room by the instructor

and received a short introduction to the experiment. The subjects then had to fulfill three tasks

in about half an hour. After the session, a questionnaire had to be filled in. The duration of a

session was about one hour. Both the recording and the questionnaire completion took about

half an hour each.

Introduction

The introduction given to the subjects can be outlined as follows:

- The system is a prototype that can understand spoken language.

- The session is going to be recorded (both audio and video).

- The system was a fully functional system which could be used via remote control. The

new feature of this experiment is voice input.

- Three tasks had to be solved. These were formulated on sheets of paper and handed

over to the subjects.

- The experiment was about testing the system, not the subject.

- After the recording, a short questionnaire had to be filled in.

The complete text of the introduction is available in Appendix C.

Tasks

During a session, each subject had to fulfill three tasks. Every task was printed on a separate

sheet of paper. The first task was given to the subject after the introduction by the instructor.

For the other two tasks, the instructor entered the room after the time for a task had elapsed.

Task 1 (15 minutes)

Please look for your favorite broadcast following your own selection criteria for

Sunday afternoon. Please note that the prototype does not yet support every TV

channel.

Task 2 (7 minutes)

Please look for the current broadcast on ARD and change the volume.

© DICIT Consortium


DICIT_D6.2_20080428 22

Task 3 (7 minutes)

Please select a broadcast that is not actually on air and that you therefore would like to

record. Please note that the prototype does not yet support every channel.

Questionnaire

After the prototype session, each subject had to fill in a questionnaire of about 30 questions.

The questionnaire can be found in Appendix B and is discussed in detail in Chapter 2.

1.3 Experimental Setup at Amuser

Also at Amuser, two rooms were used in the experimental setup, very similar to EB setup (see

Figure 14):

Figure 14: The WOZ setup at Amuser

© DICIT Consortium


DICIT_D6.2_20080428 23

1.3.1 Hardware Setup

In this section, the hardware setup at Amuser is discussed (see Figure 14). PC2, placed into

the test person room, was used to run the simulation and to record the audio signals coming

from both the close talking and the distance reference microphones. The wizard PC (PC1) was

connected to the PC2 in the test person room via VNC, by a point to point cable, and was used

as an interaction client of the PC2 (server).

The two cameras were directly connected to a VHS mixer in the wizard room.

PC2 was connected to: the two microphones (through the USB-audio box), the loudspeakers

(directly driven by the built-in sound card), the RC receiver, connected through a serial port,

driven by the WinLIRC software.

The following hardware was used:

- PC1: IBM Thinkpad, 256MB of RAM, 1x20 GB hard disk

- PC2: HP Compaq notebook, 1GB of RAM, 1x20GB and 1x55GB hard disk

- Screen: IBM 20” (PC 1), Samsung sync master 231T LCD 800x600 (second monitor)

- Microphones:

o Shure SM10A (head set)

o Røde NT6 and AKG c680 BL (room microphone)

- Cameras: 2 Sony 3ccd

- Sound card for recording: SoundMax integrated digital audio.

1.3.2 Software Setup

The simulation was run using the same version of EB GUIDE Studio 2.60 provided by EB.

Moreover, CoolEdit pro 2.0 was used at Amuser for the audio recordings.

1.3.3 Recording Sessions at Amuser

A session consisted of a short introduction to the experiment, two tasks to be solved with

DICIT, and the compilation of a questionnaire after the recording session. At the beginning of

a session, the subject was guided into the living room by the instructor and received a short

introduction to the experiment, then filled out the first part of the questionnaire (statistical and

habits questions). First of all the subjects had to read some phrases, pretending this was a

“calibration phase” for microphones, and then fulfill two tasks in about 20 minutes. After the

session, the usability part of the questionnaire had to be filled in. The duration of a session

was about 50 minutes.

Introduction

The introduction given to the subjects was the same used at EB.

Tasks

During a session, each subject had to read some phrases first, and then fulfill two tasks. The

list of phrases and the tasks instructions were printed on a separate sheet of paper. The

instructor gave the list of phrases for the “acoustical WOZ” to the subjects after the

© DICIT Consortium


DICIT_D6.2_20080428 24

introduction, and when they finished reading, they received the first task. As for the first task,

the instructor didn‟t enter the room when the subject reached the goal, but left people

“playing” with the system until the time for this task had elapsed (about 10-12 minutes).

Finally the instructor entered the room to give the second task.

Recording phrases to set the microphones (3 minutes)

In order to present a “picture” of a real working system, the phrases for the “acoustical

WOZ” were presented as a microphones recording test:.

Task 1 (11 minutes)

Considering that only six national channels are available, using the criterion you

prefer, please search a program you want to record that isn‟t on air at this moment.

Task 2 (5 minutes)

Please, search the video-clip of Cristicchi on air at this moment, using the title if you

know it. While you watch the video-clip, please adjust the volume.

© DICIT Consortium


DICIT_D6.2_20080428 25

2. Questionnaire

After the recording session with the WOZ prototype system, each subject had to fill in a

questionnaire to determine users‟ attitudes toward different aspects of the system. For German

and English subjects, the questionnaire data was entered on a notebook, so that the

questionnaires could automatically be evaluated without the need of entering them into the

computer separately. On the contrary, the Italian subjects used a paper-based questionnaire.

The questionnaire consists of 31 questions according to the criteria of DIN EN ISO

9241-110 (see [11]). The first part consists of statistical questions (1-4) and questions

regarding TV habits (5-12). The second part contains questions regarding specific parts of the

DICIT WOZ system, such as screen, voice output, and voice input. The last part investigates

subjects‟ overall impression of the system.

German and English subjects‟ answers have been evaluated separately. There were 20

German and only four English subjects. Therefore, the answers of the English subjects are not

significant, but a limited quantitative evaluation can still be done. Moreover, two of the

English subjects did not like voice control at all (answering “do not use voice” a couple of

times in the free-text comments) and one of them answered with a negative bias in many

questions. Since the answers of the other English participants are more similar to the German

answers, we would have got different results with a larger subject base.

The complete questionnaire can be found in Appendix B.

2.1 Statistical Questions (Questions 1-4)

The first part of the questionnaire contains statistical questions, e.g. regarding subjects‟

gender, occupation, or age.

Question

1: You

are…

male

75%

female

25%

male

75%

female

25%

Male

55%

Female

45%

© DICIT Consortium


DICIT_D6.2_20080428 26

2: What is

your

educational

qualifi-

cation?

degree

100%

degree

35%

second

ary

school

50%

middle

school

15%

3: Your age

41_50

20%51_60

10%

31_40

25%

>60

15%

20_30

30%

4:

Occupation

house-

wife

10%

comme

r-

cial

5% other

15%

emplo-

yee

45%

student

10%

retired

15%

Table 2: Statistical questions

Of the German subjects, 75% (15) were male and 25% (5) female. About two thirds (60%)

hold a university level degree, one third finished secondary school as their highest degree, and

two subjects finished middle or primary school. 55% are aged between 20 and 30, 30% from

31 to 40, two subjects (10%) are between 41 and 50, and one subject (5%) is between 51 and

60. Since most subjects are employed at EB, they work in the software business.

As for the English subjects, the gender distribution is the same as for the German subjects,

but all hold a universitary level degree and are older than the German subjects.

The distribution of Italian sample about gender was: 45% (9) females, 55% (11) males;

regarding the educational qualification was: one third (7) hold a university level degree, half

sample (10) finished the secondary school, and the rest (3) finished the middle school. The

age distribution was divided with more than half the sample under 50 years: 30% (6) are aged

between 20 and 30, 25% (5) from 31 to 40, 20% (4) were between 41 and 50; two subjects

(10%) were between 51 and 60 and three subjects (15%) were over 60 years. The occupation

distribution was: almost half the sample (45%) were employees, three subjects (15%) were

retired, two subjects (10%) were students, two subjects (10%) were housewives and 20%

worked in other jobs.

While the German sample was chosen to represent a specific target of “expert” users, the

Italian sample was recruited trying to represent the distribution of the whole population

(regarding gender, educational qualification, job and age).

41_50 75%

31_40 25%

20_30 55%

41_50 10%

31_40 30%

51_60 5%

<not given> 20%

comp. science

25%

electro- technol.

5%

commer cial IT

5%

engineering

10%

IT 5%

biology 5%

software developer

25%

secondary school 30%

middle school

5%

primary school

5% degree 60%

automation 25%

software developer

75%

© DICIT Consortium


DICIT_D6.2_20080428 27

2.2 TV Habits (Questions 5-12)

The next section of the questionnaire contains questions regarding the TV watching habits.

Question

5: How

many

people live

in your

household

including

you?

2

45%

alone

25%3 or more

30%

3 or more

100%

2

35%

3 or more

55%

alone

10%

6: How

many TVs

do you

have in

your

house? none

10%

2

20%

1

45%

3 or more

25%

1

50%

2

50%

3 or more

25%

none

5%

2

40%

1

30%

7: Who

usually

decides

what to

watch on

TV?

majority

8%

together

42%

each

17%

one

33%

majority

50%

together

50%

majori-

ty

21%

one

21%

toge-

ther

47%

each

11%

8: How do

you usually

decide

which

programme

to watch?

(multiple

responses)

teletext

9%epg

20%

surfing

26%

guide

45%

surfing

40%

guide

60%

guide

26%

surfing

26%

EPG

11%

teletext

37%

9: Which

type of

television

do you

usually

watch?

(multiple

responses)

satellite

45%

traditional

30%

digital_terr

estrial

20%

iptv

5%

satellite

16%

traditio-

nal

84%

Satellite 100%

© DICIT Consortium


DICIT_D6.2_20080428 28

10: How do

you usually

select a

programme

?

no

answer

55%

numeric

button

45%

11: What is

the

information

that

interests

you to

choose a

programme

?

genre

63%

topic

32%

don’t

care

5%

12: Usually

you use the

TV to:

0

2

4

6

8

10

12

14

16

surf

watch

_tv

dvds

reco

rded

othe

r

back

grou

nd

0

5

10

15

20

watch tv surf other

Table 3: Habits questions

Multiple responses were possible for questions 8-11. Subjects could also enter additional

comments (more than one answer possible) for question 12. “Photos”, “HiFi”, and “Series and

Movies” were each stated once for the German subjects, whereas two subjects said that they

had no TV at all. One English subject added “video”.

2.3 The DICIT System (Questions 13-29)

Questions 13 to 29 are used to determine how subjects like specific aspects of the DICIT

WOZ system, such as the screen, voice input, or voice output.

2.3.1 Using the DICIT System

These questions are used to determine how subjects get along with the DICIT system and

whether they prefer voice to remote control input. Subjects had to rate each of the following

numeric button 75%

up/down button 25%

numeric button 45%

EPG 25%

up/down button 30%

duration 6%

actor 14%

topic 22%

don’t care 14%

genre 33%

channel 11%

genre 50%

actor 17%

topic 33%

0 0,5

1 1,5

2 2,5

3 3,5

4 4,5

dvds other recorded watch_tv

© DICIT Consortium


DICIT_D6.2_20080428 29

questions with values between 1 and 10. Moreover, they could explain or comment on their

answers in a text input field.

Question Average value

13. It was easy to understand how to use the different selection criteria

given by the system

(1 = Very Difficult, 10 = Very Easy)

DE: 9.00

EN: 6.75

IT: 7.25

14. It was easy to understand how to give all the vocal commands


DE: 8.80

EN: 7.75

IT: 7.95

15. It was comfortable to give some information with voice and the

other with the remote control

(1= Very Uncomfortable, 10 = Very Comfortable)

DE: 6.70

EN: 6.50

IT: 8.50

16. In case of problems did the system suggest usefully and efficiently

what to do to recover the information after the error?

(1 = Very Useless, 10 = Very Useful)

DE: 5.94

EN: 5.00

IT: 7.16 Table 4: General usability questions

Question 13:

German subjects had no problems using the filter criteria. Some found it to be logical,

easy, or clear (11), whereas some did not understand it in the very beginning, but it became

quickly clear to them after they had used the system for some time (3).

English subjects had more problems using the filters. One subject wanted to select a time

range (from – to), but the system could only select from a start time until midnight. Moreover,

a subject criticized that the “movies” genre filter showed many entries that were actually no

movies. (Since it was decided to keep the number of genres small, all entries had to be put

into the available categories.)

Few Italian subjects (4) hadn‟t problems using the filter criteria. Many of them (8) had

some problems at the beginning of the session, but after using the system for a while they

easily understood how to behave. Other subjects (6) stated different problems of interaction

with the system: someone complained about the lack of benchmark for voice volume

commands (e.g. “mute”, “half volume”, etc…), the lack of EPG data (the Italian EPG didn‟t

have data about subject and artist), or lamented some difficulties to link the selection of search

criteria to the direct view of a program (one subject didn‟t understand it was possible to use

more than one criterion at a time to sort data).

Question 14:

The German subjects understood how to use vocal commands, although they had no prior

training. Most users said that the system was easy, intuitive, or that they could find out how to

operate it using a trial and error strategy.

One English subject observed an inconsistency in TV mode between volume up/down

and channel up/down. Moreover, one subject gave a below-average rating of 4, whereas the

others gave good ratings (8-10).

Most of the Italian subjects (12) stated they didn‟t have problems because the screen

outputs facilitate the understanding of available voice commands. Other subjects (6) said they

had some difficulties to imagine volume commands, and one of them highlighted that if he

© DICIT Consortium


DICIT_D6.2_20080428 30

had some problems to use vocal commands, it was because he wasn‟t used to interact with

this kind of system (in his opinion it wasn‟t a design interface problem).

Question 15:

Subjects did not feel very comfortable using both voice and remote control (RC) to

operate the system. 11 out of 20 subjects said that they used the RC not at all or only to switch

on the system. Some wanted to use RC for quick or simple inputs (e.g. switching channel),

whereas they thought that voice should be used for complex input (e.g. EPG query).

One subject judged both speech control and speech feedback as irritating. Others said that

“voice was easier” and that the remote control was used for teletext.

Most of the Italian subjects (14) used only voice commands, stating that they are easier

and quicker to use than the RC. Three subjects said they used the RC only when they had

some difficulties with voice interaction (e.g. scrolling channels).

Question 16:

Regarding error recovery, some users commented that they either did not have problems

(4) or that the help function was “good” or “easy”. Others stated that the system did not

provide useful help or that the help was very basic and only took them one step further, but

did not help to understand the system.

English subjects rated help not as good as German subjects. One subject complained

that hardly any help was given and another that automatic help (after silence) was annoying.

Almost half of the Italian sample (9) said that the visual data/feedbacks were very useful

to recover information after interaction problems, and two subjects did not blame on the

system interface their own difficulties to cope with an error, but they stated their own

inadequacy with this kind of advanced system. Four subjects did not notice any help/error

messages during their sessions and therefore commented that they did not encounter any

problems. The other subjects complained the lack of contextual help/error messages or

complained a different general logic of interaction in comparison with their expectation.

Summary: The majority of the subjects judged the system to be easy to use and had no major

difficulties in using voice control. They did not feel very comfortable using voice and remote

control at the same time and a majority of the subjects used voice control exclusively. Error

recovery could be improved upon, since many subjects did not perceive the help to be useful.

When testing the usefulness of a voice-supported dialogue system, the results can be regarded

as very positive, because improving on dialogues and help menus is always possible by

standard techniques. Regarding the crucial issue, namely, acceptance of voice as an

interaction modality, the results of the questionnaire support this approach.

2.3.2 Watching the Screen

The aim of this section was getting feedback on the DICIT screen, i.e. whether it was easy to

read the screen and navigate the menu structure. Like in the previous section, every question

in this section consists of a rating value between 1 (negative) and 10 (positive) and an input

field, where subjects can explain their choice.

© DICIT Consortium


DICIT_D6.2_20080428 31

Question Average value

17. Is the screen which shows the criteria for the programmes

search easy to read?


DE: 8.75

EN: 7.00

IT: 8.65

18. Was it easy to understand how to use vocally the search criteria

for programmes shown on the screen?


DE: 8.80

EN: 6.50

IT: 7.89

19. Was it easy to understand how to use the remote control to

select the search criteria for programmes?

(1= Very Difficult, 10 = Very Easy)

DE: 6.16

EN: 6.50

(only 2 answers)

IT: 7.00

20. To reach the task we have assigned to you, did you expect to

have some other vocal commands?

(Yes/No)

DE: Yes=55%, No=45%

EN: Yes=50%, No=50%

IT: Yes=30%, No=70%

21. Did you find the information on the screen useful to orient

yourself, in the case you would disable the audio?


DE: 8.80

EN: 5.00

IT: 6.15

22. Do you find it useful that the previous criteria list would

always be shown?


DE: 8.35

EN: 5.25

IT: 7.68

23. Do you find useful a function which allows you to insert a

precise word to search programmes through remote control?

(1= Very Useless, 10 = Very Useful)

DE: 3.35

EN: 3.75

IT: 7.26

Table 5: Screen feedback questions

Question 17:

While the acceptance of the EPG screen was good, the comments were very diverse. Two

subjects stated that the fonts were good, whereas two others note that they were too small.

Three mentioned that the screen contains too many details, while it was clear for others (3).

Subjects also had problems with the reset function, e.g. one did not want it to reset all filters.

Two English subjects stated that the list of TV programs was too short (only 3 entries

for 6 channels). One complaint was that the system keeps making suggestions.

The majority of the Italian subjects (12) found this screen useful and easy to read, but

someone deplored the “basic” graphics and one stated it was not clear that it was possible to

combine the search criteria.

Question 18:

The results also are very diverse for this question, but the majority of the subjects had

positive comments. Some were positive (“surprised in a positive way”, “it‟s faster than

Google”, “it always worked”), others negative (“produced strange results”).

For one English subject, it was not clear that the search had to be started (e.g. by saying

“Search”). Another comment was that the basic functionality was easy to use, but advanced

features were not.

© DICIT Consortium


DICIT_D6.2_20080428 32

Even if all Italian subjects gave to this question a quite high score, only 8 commented on

it: half of the comments highlighted difficulties at the beginning, but it became quickly clear

how to use search criteria by voice. Two subjects reinforced their answer, stating it was easy

to understand and to use search criteria. Other two subjects complained some problems to

know which data were available to make a search (the Italian EPG didn‟t have data about

subject and artist) and complained the possibility to easily switch to the program on air,

having selected the desired one.

Question 19:

Most subjects (15 out of 20) said that they did not use the remote control at all. Others

comment that they are used to RC or that the DICIT RC works like other RCs.

As for the English subjects, most said that they used voice only.

As German and English subjects, most of Italian ones (16) said that they used only their

voice because being more comfortable. Some subjects appreciated the mapping of the

“colored buttons” related to voice commands on the screen, and the colored buttons on the

RC.

Question 20:

People were missing the following commands: going to full-screen mode, recording all

results at once or deleting all entries from the recording list at once, foreign language

commands, multi-select.

English subjects also want an option to record all entries at once. Another person

suggested presets for volume settings (e.g. “medium volume”).

Only six Italian subjects expected to have other commands or options. Two subjects

complained the unavailability of a list of artist or titles (within the “sub-menus” for these

search criteria), the others asked for the following commands on screen: “quit”, “show it” (to

switch the TV on the chosen program, instead of a generic channel), “go ahead” and “go

back” commands to scroll the list of programs.

Question 21:

Five subjects said that the provided information was enough. Subjects had some comments

regarding the TTS output: TTS should not be muted when the TV is muted (two different

“mutes”?), and the TV volume should be lowered during a TTS output. Moreover, one subject

stated that the text on the screen should not be chopped off.

Two English subjects stated that they only used the screen, speech feedback was irritating,

and that they would disable the audio. The other two subjects did not comment on this

question.

Only four people said they did not pay attention on help messages. Other four Italian

subjects reaffirmed they expected to have more detailed commands to adjust the volume

(especially to turn on/off the “mute”); the others said they read commands from screen or

used their reasoning to orient themselves using the system, but they didn‟t mention anything

about the help messages.

Question 22:

The most common answer is “good” (5). One user would like to make this configurable,

another would like to have templates.

© DICIT Consortium


DICIT_D6.2_20080428 33

English subjects commented that it was not clear if all or only the last criterion was used

for the query and that this could be a possible preset option.

Only four people said that they could avoid consulting the criteria list when they could

have some familiarity with the system, . The other 10 Italian subjects commented that reading

the criteria, it is easier than remembering them, and it gives confidence during the interaction

with the system; one of these subjects stated it would be nice to make this screen

configurable, and another one said that he would prefer another kind of interface.

Question 23:

The dislike for a virtual keyboard is affirmed by the comments. Subjects regard it as slow

and compare it to T9 (cell phone input), which most of them don‟t like for a TV system. Also,

subjects don‟t like the idea of having a huge remote control with lots of keys. They prefer

having an improved and well-working voice input, which would make an on-screen keyboard

unnecessary.

Two English subjects rated this question with the extremes of 1 und 10 (one each), the two

others rated it with 3. One subject stated that a speller is not required as long as the spelling

works fine. Another said that shortcuts could be useful. Other comments were that this took

too long and depended on the word. One subject said that he preferred a printed guide.

Only one Italian subject dislikes this functionality; eleven of the other subjects judged this

feature a good idea to have an alternative way of interaction or to simplify the search for

artists or titles (some of them suggested to restrict the search of those items inserting few

letters by the RC at the beginning, and complete the query by voice, reading a restricted

number of items found).

Summary:

Subjects regarded the screen as easy to read and the voice command as intuitive. It was not so

clear how to use the remote control for filter selection. About half of the subjects wanted to

have additional commands at their disposal. The information on the screen was considered to

be useful even without the related TTS prompts. Subjects think that the list of previous criteria

should be available. German and English subjects had a strong dislike for an on-screen

keyboard. On the other hand, most Italian subjects appreciate this feature to facilitate the

search of difficult data like the names of the artists or the correct titles of programs.

2.3.3 Vocal Interaction

The aim of questions 24-26 was to probe the appeal of the vocal mode (in comparison to the

interaction with the haptic mode), and its flexibility both prodding an input and giving an

output.

© DICIT Consortium


DICIT_D6.2_20080428 34

Question

24. How do

you judge

the

opportunity

to use a

vocal

command?

never use

vocal

cmds

10%

useful if

used with

RC

15%

useful if

more

operation

than RC

25%

very

useful

50%

25. For the

vocal

commands,

you prefer:

short

cmds

60%

full

sentences

20%

read

precise

cmds

20%

26. When

you give a

vocal

command to

the system,

you prefer

[feedback]:

both

35%no

feedback

35%

video

only

20%

voice

only

10%

Table 6: Vocal mode questions

Question 24:

The majority considers the opportunity to use voice input (very) useful, but 45% only if it

replaces remote control, it is used together with remote control, or if it provides more

operations than the remote control. The remote control is considered better apt for e.g.

switching channels. They think voice input is good for complex queries (e.g. EPG). Subjects

said that it‟s important for the system to understand lots of commands and to work right from

the beginning.

One subject was totally enthusiastic about this feature: “It‟s the way of the future and I

want it now!” Another comment was that this could be useful for disabled and elderly people.

Half of Italian sample answered to this question saying that voice interaction is useful, and

seven subjects reinforced their answer explaining that voice control facilitates the interaction

because, if commands work correctly, voice is quick, more practical, and not cumbersome

like the RC. Only three subjects commented that they prefer to choose when to use voice, and

when to use the RC, and another subject reinforced her negative answer, stating she‟d never

used the voice commands because, living alone, she thinks that speaking to the TV as to a

human being is distressing.

very useful 50%

never use vocal 50%

useful if replacing remote

5% useful if used with

RC 15%

useful if more

operations than

remote 25%

very useful 55%

short comds 100%

<none> 5%

read precise cmds 10%

full sentences

10%

short cmds 75%

video only 25%

immediate action 75%

video only 50%

immediate action

20%

both 30%

© DICIT Consortium


DICIT_D6.2_20080428 35

Question 25:

Most subjects agree on short commands to be the best solution. One person recommends

using long sentences for beginners and short ones for experts. Two subjects mention that both

short and long commands should be understood by the system.

One subject noted that short commands would take some time to be figured out. Others

said that they would use full sentences later.

Twelve Italian subjects agree on short commands because they are easier to remember and

to read on the screen (also for elderly or visual impaired people). Only one of the four people

which said that they prefer full sentences, justified his answer explaining that it is better to

avoid speaking as a robot.

Question 26:

Some people regard speech output as annoying (2) or repeat that they prefer video output

(3), some would like to be able to disengage it (2). One stated that a prompt should only be

repeated a certain number of times. Some users think that it should be “intelligent” or

“provide more feedback when problems occur.”

Subjects stated that this speech feedback could be good for the blind and that it was

sufficient. One subject said that he would not use voice control.

Italian subjects are equally split(7 and 7) among people that do not want any feedback, but

only the system reaction to their request, and people that want video and TTS feedback before

the system does anything. If we add to this second part of the sample, the other subjects that

want to have only video feedbacks (4) and those who want only audio feedbacks (2) to

interact with the system while they are not in front of the TV, the main part of the sample

could prefer to have some kind of feedback.

Summary: Subjects like to have voice as a means to control an STB system. They prefer

short commands instead of complete sentences. While they like speech input, subjects are

sceptical about speech output, and their preferences are different for the different samples:

While half of the German subjects prefer video-only feedback, 20% want an instant reaction

without feedback, and only one third wants to have both speech and visual confirmation; for

the Italian sample, one third of the subjects want an instant reaction of the system, one third

prefers both video and voice feedbacks, 20% like video-only feedback, and 10% voice-only

feedback.

2.3.4 The System Voice

The system voice of the DICIT system is subject of this section. Subjects were asked how

they like the TTS output, whether they want to be able to interrupt the system, and whether

they want to be able to switch off the recognizer.

© DICIT Consortium


DICIT_D6.2_20080428 36

Question

27. Do you

find useful

that the

system

reads (in

addition to

listing them

on the

screen) the

programmes

found after

your

search?

yes if not

too many

15%

no

85%

yes

40%

no

55%

yes if not

too many

5%

28. If you

prefer a

system

which gives

you vocal

feedback:

want

interrupt

possibility

90%

happy to

wait

10%

want

interrupt

possibility

100%

want

interrupt

possibilit

y

81%

happy to

wait

19%

29. Would

you like to

have a

button to

enable /

disable the

vocal

recognizer?

yes

90%

no

10%

yes

100%

yes

85%

no

15%

Table 7: Listening to the system voice

Question 27:

Subjects think that this feature is only useful for blind or elderly people and should, if it is

implemented, be cancellable and it should be possible to disengage this feature. Most consider

it as too slow (5) and some don‟t like the TTS voice (2).

All subjects did not want the system to read out the results.

Even if eleven Italian subjects answered they don‟t want to hear the TTS reading the

listed programs, those who like this feature (8) added to the one which wants it only if the

items are not too many, are about half of the sample (45%). Someone of those who answered

“no”, and a couple of subjects that answered “yes”, said that this feature could be useful for

blind or elderly people (so could be enabled/disabled on demand); on the contrary, other two

people that said they want this feature, interpreted this like an advanced and useful

functionality even for not impaired people, saying that it is comfortable especially if they are

doing something else (they aren‟t in front of the TV) while they are consulting the EPG.

no 100%

© DICIT Consortium


DICIT_D6.2_20080428 37

Question 28:

People don‟t like to have the same prompt again and again. They would prefer varying

texts and texts that become shorter over time with increasing use of the system. They also

don‟t want to wait for the TTS to end (2) and prefer barge-in (2).

All subjects want the possibility to interrupt the system.

Most of the Italian subjects (13) liked the possibility to interrupt prompts (most of them

commented this means to have the control the TV), and only three subjects answered they

prefer waiting till the end of the system output.

Question 29:

Subjects want to switch off the recognizer when they are in a conversation (6), when other

people are in the room, or when it‟s loud in the room.

All subjects want the possibility to disable the recognizer.

Almost the whole Italian sample (17) likes the possibility to manually disable the

recognizer and 7 subjects reinforced their answer saying that they prefer to control the

interaction and they don‟t want to be annoyed by false recognition while they are talking with

someone else.

Summary: The majority of German, English and Italian subjects do not want the system to

read out the search results, the remaining German subjects only want this feature if the

number of results is small, while 40% of the Italian sample likes this feature and thinks that

represents a good way to feel free during a task where images of the EPG are not so

important as while the TV shows a program. In addition, 80-90% of the subjects want to be

able to interrupt the system (barge-in). A function to disable the recognizer should also be

implemented, since 85-90% of the subjects want to be able to do so.

2.4 General Opinion of the DICIT WOZ Prototype (Questions 30 and 31)

The remaining questions are used to examine how subjects like the DICIT WOZ system.

The biased answers of two English subjects show up especially strong in this section.

Since the number of English subjects is not significant, it should absolutely not be compared

directly to the far larger German user base.

2.4.1 Users’ experiences with DICIT

Finally, users had to rate their experiences with DICIT by means of 13 questions. Each of

these questions had to be rated with a value between 1 = complete disagreement and 7 =

complete agreement (as in the classical scale used in the 'semantic differential' of Osgood).

Sub-Question Result

1. I think that the system is easy to use DE: 6.40

EN: 5.50

© DICIT Consortium


DICIT_D6.2_20080428 38

IT: 5.61

2. It makes me confused when I use it DE: 1.40

EN: 2.00

IT: 3.00

3. I like the voice DE: 4.30

EN: 3.00

IT: 4.16

4. I think that the system needs too much attention to interact vocally DE: 2.15

EN: 3.751

IT: 3.63

5. I have the impression not to control the dialogue with the system DE: 1.30

EN: 2.00

IT: 2.94

6. I have to focus on using it with the remote control too DE: 4.05

EN: 1.00

(3 answers)

IT: 2.88

7. I think that the speech interaction is efficient DE: 5.36

EN: 5.502

IT: 5.68

8. By using the voice is easier to search the programmes DE: 5.15

EN: 4.50

IT: 5.57

9. The system voice speaks too quickly DE: 1.40

EN: 1.50

IT: 1.88

10. The selection criteria which appear on the screen are not clear DE: 1.65

EN: 2.503

IT: 3.15

11. I think that it is funny to use DE: 6.40

EN: 5.754

IT: 6.05

12. I prefer using traditional way (TV guide, teletext, newspaper) to

search an interesting programme

DE: 2.80

EN: 4.255

IT: 2.09

13. I think that this system needs some improvements DE: 4.45

EN: 6.50

IT: 4.88 1: 2x1, 1x6, 1x7;

2: 1x1, 3x7;

3: 3x1, 1x7;

4: 3x7, 1x2;

5: 1x1, 1x2, 2x7

Table 8: User general opinion

Discussion:

Subjects‟ experiences with the DICIT WOZ prototype are positive: they think that it is

easy to use and not confusing, and they think it is fun to use. On the other hand, people do not

like the voice too much and think that the system still needs improvements.

© DICIT Consortium


DICIT_D6.2_20080428 39

Comparing the average of answers of the Italian sample with the German one, the main

differences are the following: Italian subjects feel “confused” interacting with the system (q.

2), maybe because (more than German people) they have the impression not to control the

dialog (q.5), and they don‟t think the selection criteria are clear (q. 10). The other answer is

quite aligned with the average of the German sample.

2.4.2 Rating user satisfaction within DICIT

In the final section, subjects had to rate the DICIT system using a range of 1 to 7 between sets

of opposite adjectives (e.g. easy vs. complicated). For most adjectives, small values represent

a positive feedback.

Range *

1. easy complicated 1.40 2.75* 2.76

2. efficient inefficient 1.90 4.00* 2.56

3. quick slow 2.75 3.50* 3.17

4. original copied 1.65 2.50* 1.61

5. precise vague 2.20 2.75* 2.76

6. capable incapable 2.05 2.75* 2.41

7. formal informal 3.80 4.75* 3.33

8. active passive 3.75 4.00* 2.88

9. friendly unfriendly 2.60 3.25* 2.64

10. determined undetermined 2.50 3.00* 2.93

11. polite impolite 2.35 2.25* 2.38

12. clever stupid 2.40 4.25* 2.70

13. organized disorganized 2.00 2.50* 2.35

14. patient impatient 2.00 3.50* 2.05 Table 9: Semantic differential

* Since the number of participants in the English group is not large enough to produce significant results, these

numbers are just added for completeness. In most questions, two subjects rated the system in a very positive way,

while two had a more negative impression. Therefore, a discussion of the English results is not included.

Discussion:

Altogether, subjects have a positive impression of the system: it is said to be easy,

efficient, capable, organized, and patient. The results are good, but not very good (around 2).

With a value of 2.75 for „quick‟, the responsiveness of the system should be improved, but

due to the nature of the system (WOZ setting), it should not be taken as a reference. At least,

the system did not get bad ratings in this category.

In the categories „formal‟ and „active‟ the system gets a moderate rating (3.80, 3.75).

This is due to the fact that subjects might have different ideas of how formal and active the

system should be (i.e., some users prefer a formal and passive system, whereas others a

personal and active one).

Comparing the average of answers of the Italian sample with the German one, most of the

answers are fairly aligned; few, but not significant, differences can be noted for the same three

adjective pair of the sematic differential commented before: Italian people judge the speed of

© DICIT Consortium


DICIT_D6.2_20080428 40

reaction of the system (q. 3) a little slower than German people; the style of the interface is

considered being a little more “formal” by the Italian subjects, and the system is judged a little

more “active”.

2.5 Summary

The results of the questionnaire are positive. People had no problems performing a number of

given tasks with the DICIT system, which they had not used before. People did like voice

control for an STB system and most did not use the remote control at all for the duration of

the test session.

Moreover, there are clear results for many aspects of the system, where people gave uniform

answers: subjects do not like long TTS output and complete sentences as output, but want

short, visual feedback and short commands for input. These results should definitely find their

way into the next prototypes.

© DICIT Consortium


DICIT_D6.2_20080428 41

3. Session Evaluation

3.1 Logging Data

Different statistics were derived from the logging data automatically using scripts (Perl and

Python). These results are discussed in this section.

Unfortunately, some of the EB log files were not valid (IDs 4, 6, and 9) and had to be omitted

in this part of the evaluation. For the German evaluation, there are therefore 17 (out of 20)

valid log files included in the discussion. The English EB subjects have the IDs 4, 19, 20, and

22.

Some of the logging data (esp. view logging) was not collected during the Italian recordings.

Therefore, this data is not available for this evaluation.

The average duration of a German or English session was 27 minutes: for the Italian sessions

the average was 17 minutes.

3.1.1 Logging Data

During a recording session, logging data was collected to facilitate a thorough evaluation.

Some of the data was collected automatically using GUIDE (discussed in this section), other

data was created manually (i.e. annotations for the audio recordings).

GUIDE was extended with an extensive logging mechanism to be able to reconstruct the

interaction of the user with the system. For every session, a separate log file was created.

Some entries are specific to GUIDE and the model (e.g. state or event names), while others do

not depend on the model (such as TTS). The different logging entities are shown in Table 10.

Logging Entity Description

TTS Output Every TTS output of the system was logged. There are different

sources of the TTS output: system prompts that are played back

automatically, predefined prompts that the wizard plays manually by

selecting them from a list, or manual wizard prompts that the wizard

can enter manually if none of the predefined ones fit. If a prompt was

interrupted (either by voice or remote control), a special log entry was

created.

ASR Input Recognitions from the automatic speech recognizer (ASR) include the

name of the grammar and a confidence.

Although no actual ASR is used for the DICIT prototype, the wizard

can select from a list of currently active ASR commands that is

extracted from the loaded grammars. If a user input matches one of

these commands, he can simply select them from a list and thus

simulate the ASR. The wizard input is then treated by GUIDE like an

actual ASR recognition.

© DICIT Consortium


DICIT_D6.2_20080428 42

State change Every state change in the GUIDE model is logged.

Event Every GUIDE event is logged.

Haptic events The wizard has a virtual remote control at his disposal, where he can

select most commands from the remote control. These include e.g.

cursor up/down, volume up/down, channel up/down, EPG/TV/teletext,

…. Numbers (0-9) were not present on the control panel.

Hardware events Hardware events are remote control events that are then mapped to

GUIDE events.

Screenshots At every event, a screenshot was created to be able to reproduce a

session visually from the logging data. Table 10: Logging data created by GUIDE.

Moreover, audio recordings using both a head-set and a distance reference microphone were

created. These recordings were then annotated manually using Praat [12] – for annotation

details please refer to [13]. In doing so, it was possible to do evaluations for these recordings

on a textual level.

The technical evaluation was done according to the PROMISE principles for evaluation of

multimodal dialogue systems (see [14]).

3.1.2 Number of Screens and Views

First, we examine the number of different views entered by the subjects. After every event, a

screenshot was taken and a respective entry added to the log file. By removing duplicate

screenshots (using a binary file comparison), we get the number of different screens for each

user. More than one unique screenshot can be taken within one view, e.g. if the user moves

the cursor in a list. From the log file entries, we can see how many different views there are

for each user, i.e. the number of times the user changed the view. This number is obviously

smaller than the number of different screens.

0

50

100

150

200

250

300

350

LOG1

LOG2

LOG3

LOG7

LOG8

LOG10

LOG11

LOG12

LOG13

LOG14

LOG15

LOG16

LOG17

LOG18

LOG21

LOG23

LOG24

Different Screens

Different Views

Figure 15: German: Different screens and views.

© DICIT Consortium


DICIT_D6.2_20080428 43

German:

From these numbers, we can derive different types of users. If the number of different screens

is far larger than the number of different views, there was a lot of activity within a view,

which usually results from an intensive remote control use (e.g. Log12). On the other hand, if

activity within one view is small, it means that the users switched directly from one view to

the next, without much interaction (e.g. scrolling) inside a view (e.g. Log11). This also

becomes clear in Table 11, which shows the number of screens divided by the number of

views (min and max highlighted).

1 2 3 7 8 10 11 12 13 14 15 16 17 18 21 23 24

0.58 0.49 0.49 0.64 0.66 0.65 0.83 0.32 0.64 0.51 0.59 0.47 0.61 0.29 0.34 0.58 0.46

4 19 20 22

0.58 0.53 0.44 0.54

Table 11: Number of different views divided by number of different screens.

English:

On average, all German subjects show a quite similar behavior between each other as to the

patterns of use 2– the average number of views divided by the number of screens is 0.52 for

the English-speaking participants, compared to 0.54 for the German subjects. Also, there are

no subjects whose values depart strongly from the average value.

0

50

100

150

200

250

300

LOG5 LOG19 LOG20 LOG22

Different Screens

Different View s

Figure 16: English: Different screens and views.

Italian: This information was not available in the Italian log files.

2 For the English subjects the same tendency could be noted. This may be due to the fact that all English

speaking subjects have been living for a long time in Germany and may have adapted their way of interaction

with the TV to their German surroundings.

© DICIT Consortium


DICIT_D6.2_20080428 44

3.1.3 Screen preferences of the Users

German:

From the log files, we can see how much time they spent in which view.

EPG_ChooseFilter;

1,30; 5%

EPG_Confirmation;

0,70; 3%

EPG_MainMenu_Vi

ew; 8,18; 31%

EPG_ManualInput;

1,13; 4%

EPG_ResultList;

6,76; 25%

NewsView; 0,37;

1%

View; 7,20; 26%

EPC_RecordingList;

1,17; 4%BlackScreen; 0,03;

0%WelcomeView;

0,39; 1%

Figure 17: German: Screen preferences. Name; time in minutes; percentage.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Log001

Log002

Log003

Log007

Log008

Log010

Log011

Log012

Log013

Log014

Log015

Log016

Log017

Log018

Log021

Log023

Log024

WelcomeView

View

NewsView

EPG_ResultList

EPG_ManualInput

EPG_MainMenu_View

EPG_Confirmation

EPG_ChooseFilter

EPC_RecordingList

BlackScreen

Figure 18: German: Screen preferences of the individual subjects

© DICIT Consortium


DICIT_D6.2_20080428 45

As can be seen from Figure 17, users spent most of their time in the EPG main menu,

followed by the TV screen and the result list. This is what one could expect regarding the

tasks the subjects were given. Little time was spent in the recording list (~ 4%), which

suggests that this feature was not clear to the subjects or that this feature was not required to

solve the given tasks. Only 3% of the time was spent within the confirmation screen, which is

what one might expect.

Figure 18 shows the screen preferences for every subject. While some users spent more time

within the TV screen (“View”) than others, the overall distribution is similar for the subjects,

which is due to similar tasks. Only three users (2, 3, and 12) used the teletext feature, which

was neither mentioned nor part of the task. Two subjects (15 and 21) did not use manual

input.

EPC_Recording

List; 1,03; 4%

EPG_ChooseFilt

er; 1,93; 7%

EPG_Confirmati

on; 2,13; 8%

EPG_MainMenu

_View; 8,01;

29%

EPG_ManualInp

ut; 1,61; 6%

EPG_ResultList;

5,28; 19%

NewsView; 0,63;

2%

WelcomeView;

0,37; 1%

View; 6,52; 24%

Figure 19: English: Screen preferences. Name; time in minutes; percentage.

© DICIT Consortium


DICIT_D6.2_20080428 46

0%

20%

40%

60%

80%

100%

Log005 Log019 Log020 Log022

WelcomeView

View

New sView

EPG_ResultList

EPG_ManualInput

EPG_MainMenu_View

EPG_Confirmation

EPG_ChooseFilter

EPC_RecordingList

Figure 20: English: Screen preferences of the individual subjects.

English:

The results for the English subjects are similar to the results of the German subjects.

Italian: This information was not available in the Italian log files.

3.1.4 Remote Control vs. Voice Control

There are not only users who use voice control in a different way, but also users who had very

different attitudes toward voice control. We examined the number of remote control events

and the number of voice commands in the form of wizard actions (both high- and low-level

commands, but no direct multi-slot EPG3 (see pag. 57) queries, i.e.queries where more than

one EPG value is selected with the same utterance, such as “show action movies for tonight”,

which specifies a genre and a time in one statement). Again, the number of remote inputs and

the number of voice actions do not directly relate, since it takes a number of remote control

actions (e.g. 3x “down” plus “OK”) to trigger the same action that takes one voice command

only (e.g. “select by day”). Therefore, if the number of remote control inputs is the same as

the number of voice inputs, it does not mean that the user triggered the same number of

actions by remote control and voice. Still, we can derive different user groups from the

number of remote control use.

3 EPG = electronic programming guide

© DICIT Consortium


DICIT_D6.2_20080428 47

0

50

100

150

200

250

300

350

400

LOG

1

LOG

2

LOG

3

LOG

7

LOG

8

LOG

10

LOG

11

LOG

12

LOG

13

LOG

14

LOG

15

LOG

16

LOG

17

LOG

18

LOG

21

LOG

23

LOG

24

Remote User

Voice

Figure 21: German: Amount of voice and remote control input.

German:

By relating the number of voice to and remote control commands, we can see in Figure 21

that there are three different groups (12): a first group (mostly remote control) makes heavy

use of the remote control (subject 21 only, red arrow). A second group (mixed) makes use of

both voice control and remote control (subjects 12 and 21, orange arrow). A third group

(mostly voice control) does hardly make use of the remote control and uses mainly voice

control (subjects 3, 7, 10, 11, and 16, green arrows). Since the subjects were told during the

introduction that this experiment was about voice control and since this was a new feature and

therefore interesting for them, these results are not surprising. But still, voice control had a

great appeal for these users and they could obviously control the system without the remote

control.

English:

The behavior of the English subjects (Figure 22) is more uniform than the behavior of the

German subjects. No different user groups can be derived from these subjects. Please note that

the sample is not significant here.

© DICIT Consortium


DICIT_D6.2_20080428 48

0

20

40

60

80

100

120

140

160

180


Remote User

Voice

Figure 22: English: Amount of voice and remote control input.

Italian: Even if this information was not available in the Italian log files, it is important to highlight

that most of the Italian subjects seldom used the remote control, and on the other hand, during

the second task, to adjust the volume, most of them used voice commands in the same way as

they used RC (e.g. “louder, louder, louder”, to reproduce the reiteration of pressing the RC

button, or “higheeer” probably to reproduce an univocal but prolonged pressure). In general,

voice control had a great appeal for Italian users, and they used the RC only when they had

problems interacting with the system by voice.

3.1.5 TTS Usage

Next, the TTS output will be examined, which is the second way for the wizard to react to

user input besides executing an action. There are two kinds of TTS output, automatic and

manual TTS output. While automatic TTS prompts are played when a state is entered, manual

TTS prompts are triggered by the wizard as a means to communicate with the user. In this

section, we only consider manual TTS prompts, because no information can be derived from

the automatic prompts that are part of the dialogue.

The manual TTS prompts can again be divided into two groups. Most of the prompts were

defined before the recordings and the wizard could trigger them by selecting them from a list.

These prompts were divided into different groups of prompts: error prompts (ERROR), help

prompts (HELP), please-wait prompts (WAIT), and rejection prompts that were used when

the wizard could not understand the user (REJECT). The complete list of prompts can be

found in Appendix D. Moreover, the wizard could enter additional prompts into a text field

and play them. These prompts could not be classified automatically and are therefore listed as

FREE. WAIT prompts were used when the wizard had to perform a time-consuming task

(such as entering a query in the EPG window). The wizard had a special shortcut (F12) that

he could use to trigger a WAIT prompt. (This feature was only used for the recordings at EB.)

REJECT prompts were used if the wizard could not understand the user, but sometimes also

to gain some time when the wizard could not react quickly enough. Moreover, a REJECT

© DICIT Consortium


DICIT_D6.2_20080428 49

prompt was also played if a requested function was not available. Also, commands that were

not part of the wizard guidelines (such as command repetition, e.g. “down, down, down” or

“down 3 times”) were rejected.

ERROR

13%

FREE

8%

HELP

12%

REJECT

42%

WAIT

25%

Figure 23: German: Types of TTS output.

0

10

20

30

40

50

60

70

80

90

Log0

01

Log0

02

Log0

03

Log0

07

Log0

08

Log0

10

Log0

11

Log0

12

Log0

13

Log0

14

Log0

15

Log0

16

Log0

17

Log0

18

Log0

21

Log0

23

Log0

24

WAIT

REJECT

HELP

FREE

ERROR

Figure 24: Types of TTS output.

German:

As can be seen from Figure 23, most prompts for all views and all subjects were REJECT

(42%) prompts, followed by WAIT (25%) prompts. About the same number of ERROR

(13%), FREE (8%), and HELP (12%) prompts were used.

We want to examine some users that show distinct TTS prompt distributions. These

are marked with an arrow in Figure 24. Most of these recordings were done at the end (= on

the right-hand side of the chart), but this is by accident.

© DICIT Consortium


DICIT_D6.2_20080428 50

- Log014: This subject had a strong dialect and did not try to speak standard German.

He also used a lot of off-talk (e.g. DICIT: “Your input is being processed”, subject

says: “Well, I hope so!”). He also tried out which dialect words the system could

understand (e.g. for “yes”). In TV mode, the subject tried to pronounce channel names

in a “drunken” way. Obviously, many of his attempts were answered with a REJECT

by the wizard.

- Log021: This subject was not an experienced user and had some trouble using the

system. In the beginning, a lot of off-talk occurred (e.g. reading what was on the

screen, “what now?”). Lots of predefined (HELP) and free (FREE) help prompts were

necessary to guide the subject through the system. After some time, the subject used a

lot of free input (e.g. only saying the name of an actor or a channel), which was often

answered by the wizard with a WAIT prompt while he was processing the request.

- Log23: The subject operated DICIT in a very calm way. No WAIT prompts were used

for this subject. This is due to the fact that the subject did not use requests that

required a WAIT prompt (i.e. not free input). Moreover, all general help prompts were

played at least two times, which accounts for the high number of HELP prompts.

- Log24: The reason for the high number of REJECT prompts is that the subject tried

out different things, primarily unavailable functions (e.g. “summary for broadcast”,

“full-screen”, “go to top in list”).

EPC_Recording List

EPG_ Choose Filter

EPG_ Confir-mation

EPG_ MainMenu_ View

EPG_ ManualInput

EPG_ ResultList NewsView View

ERROR 1 2 1 75 13 49 0 4

FREE 1 0 1 63 3 19 0 19

HELP 1 0 0 57 0 30 0 36

REJECT 18 12 23 82 13 99 12 204

WAIT 7 6 0 128 85 48 1 7

Sum 28 20 25 405 114 245 13 270 Table 12: Prompt types per view.

EPG_MainMenu_View

ERROR

19%

FREE

16%

HELP

14%

WAIT

31%

REJECT

20%

Figure 25: Prompt types in EPG_MainMenu_View.

EPG_ManualInput

FREE

3%

REJECT

11%

WAIT

75%

ERROR

11%

Figure 26: Prompt types in EPG_ManualInput.

EPG_ResultList

ERROR

20%

FREE

8%

HELP

12%

WAIT

20%

REJECT

40%

Figure 27: Prompt types in EPG_ResultList.

View

HELP

13%

ERROR

1%

FREE

7%

WAIT

3%

REJECT

76%

Figure 28: Prompt types in View.

© DICIT Consortium


DICIT_D6.2_20080428 51

Table 12 shows the number of TTS prompts per view. In Figure 25 - Figure 28, charts for the

views that have more than 100 TTS prompts are shown. For the others, on average about one

or less prompts per view and session have been used and it therefore is not possible to draw

conclusions from the values of these views.

In Figure 25, the distribution of prompts for the EPG main menu is shown. About one third of

the prompts are WAIT prompts that were used when the wizard reaction took some time (e.g.

free input or when the wizard needed to find the right button). ERROR prompts were for

example used if the search did not yield any result. One third of the prompts were HELP and

FREE prompts that provided help to the user.

As one might expect, the most frequently used prompt type for EPG_ManualInput (Figure 26)

is WAIT. When the user performed a manual input (e.g. selecting an actor), the wizard had to

type the value into the EPG database window by hand and the user had to wait. There are

comparatively few REJECT and ERROR prompts, which means that the wizard could

understand most input values. In some cases, the wizard did not know an actor and could

therefore not use it as a search criterion, which was indicated by an ERROR or REJECT

prompt.

Lots of REJECT and ERROR prompts make up more than half of the prompts in

EPG_ResultList (Figure 27). Since it was not clear to the users how to scroll in the result list,

they had to try how to use it. This involved lots of errors. Moreover, people tried to use

methods that could not be handled by the wizard (e.g. “up, up, down, up” very quickly). Users

could also say the name of a broadcast to select it. Since this took the wizard some time, he

triggered a WAIT prompt.

Finally, most of the prompts used in TV mode (“View”, Figure 28) are reject prompts. Users

tried lots of functions here that they are used to from their TV at home, but that were not

available in the WOZ prototype. This includes for example full-screen mode or brightness

settings. Users then tried the help function to see which commands were available, which led

to a HELP prompt.

All in all, the number of REJECT prompts is very high. The general error message „Sorry, I

could not understand you‟ does not provide the subject with information about the reasons of

the misunderstanding. Therefore, the upcoming prototypes should try to provide more specific

information whenever possible.

© DICIT Consortium


DICIT_D6.2_20080428 52

ERROR

15%

FREE

3%

HELP

8%

REJECT

47%

WAIT

27%

Figure 29: English: Types of TTS output.

0

10

20

30

40

50

60

Log005 Log019 Log020 Log022

WAIT

REJECT

HELP

FREE

ERROR

Figure 30: English: Types of TTS output per user.

English:

The distribution is similar to the one of the German subjects (Figure 23). Only the low activity

of one subject (Log020) is mentionable. The number of FREE TTS prompts is smaller than

for the German subjects. This is probably due to the German wizard being cautious using

prompts for native speakers or the lack of necessity for these prompts. Please note that this

result is not significant due to the very small sample.

© DICIT Consortium


DICIT_D6.2_20080428 53

WAIT

32%

REJECT

8% HELP

32%

FREE

1%

ERROR

27%

Figure 31: Italian: Types of TTS output.

0%

20%

40%

60%

80%

100%

Lo

g0

01

Lo

g0

02

Lo

g0

03

Lo

g0

04

Lo

g0

05

Lo

g0

06

Lo

g0

07

Lo

g0

08

Lo

g0

09

Lo

g0

10

Lo

g0

11

Lo

g0

12

Lo

g0

13

Lo

g0

14

Lo

g0

15

Lo

g0

16

Lo

g0

17

Lo

g0

18

Lo

g0

19

Lo

g0

20

WAIT

REJECT

HELP

FREE

ERROR

Figure 32: Italian: Types of TTS output per user.

Italian: As can be seen from Figure 31, most prompts for all subjects were WAIT or HELP

prompts (32%), followed by ERROR prompts (27%) and REJECT prompts (8%). FREE

prompts were used only in a few cases (1%). This different distribution (compared to the

German and English one) is due to the different behavior of the wizard, preferred to propose

Help or Error prompts, instead of Rejection messages, when subjects had difficulties (and

despite they didn‟t explicitly ask for help). Some subjects (indicated by arrows in Figure 32)

present remarkable different distribution of categories of TTS prompts in their log files

because their behavior was particular:

Log005: This subject asked several times for help (21, 57% of his “main commands

interactions”4) because he didn‟t find the information on the screen useful to orient himself

(he was one of three people who answered the question 21 with score 1). Other subjects (like

4 “Main commands interactions” is a set of 25 commands that aren‟t data driven (e.g.: “morning”, “Monday” ,

“drama”, etc…) but are firmly shown by the GUI and mentioned by the TTS messages (e.g. “search by

channel”, “results”, “TV-guide”, etc…).

© DICIT Consortium


DICIT_D6.2_20080428 54

7 – 8.62% of her “main commands interactions”, 11 – 8.11% of his “main commands

interactions”, and 6 – 6.45% of his “main commands interactions”) asked for some help too,

but they didn‟t give any negative feedback to questions 14, 16 or 21 of the questionnaire.

Log009: The bar chart of this subject shows a lot of Error messages because she was

really a non- experienced user and she mainly asked to find a program using artist or subject

criteria (she asked 14,28% times of her “main commands interactions” for an artist or the

subject, while the Italian EPG didn‟t present these data). Moreover she didn‟t interact a lot

with the system by voice because, within the second task, adjusting the volume, she pushed

the “mute” button. In this way she didn‟t hear any other “error messages” that the wizard

played to dissuade her to use the RC, so the second task has been interrupted and the whole

session was the shortest one (highlighting the percentage of errors).

Log018 and Log019: These subjects were used to ask for a program directly by the

name of it (without saying “search by title” before), which was often answered by the wizard

with a WAIT prompt while she was processing the request. In general, a lot of subjects

searched a program by title or artist, but when they asked freely for a title or artist (that were

not foreseen in the second task goal), processing the data of free inputs required some time

(this is the reason why half of the sample showed a lot of WAIT messages).

3.1.6 Barge-In Behavior

There is an entry in the log files for cancelled TTS prompts, which means that the user

interrupted the system either by voice command or remote control input.

German:

On average, every subject interrupted TTS prompts 8.65 times. Users usually interrupted

prompts that are long and that they have heard before. The most commonly interrupted

prompts are the prompts played when the main menu was entered and the recording

confirmation prompt.

As can be seen from Figure 33, there are two users who make heavy use of barge-in

(subjects 1 and 23, marked with a green circle). On the other hand, there are three subjects

(15, 16, and 18, marked in red) that use barge-in less than five times. While subjects 15 and

18 state in the questionnaire that they want to have the possibility to interrupt the system and

therefore are possibly not aware of this feature, subject 16 says that she would wait at short

prompts. One subject (7) states in the questionnaire that she would be willing to wait, but still

used barge-in five times.

© DICIT Consortium


DICIT_D6.2_20080428 55

0

5

10

15

20

25

30

LOG

1

LOG

2

LOG

3

LOG

7

LOG

8

LOG

10

LOG

11

LOG

12

LOG

13

LOG

14

LOG

15

LOG

16

LOG

17

LOG

18

LOG

21

LOG

23

LOG

24

Figure 33: German: Number of barge-ins per subject.

0

1

2

3

4

5

6

7

8


Figure 34: English: Number of barge-ins per subject.

English:

There are two users (LOG5 and LOG20) that use little and two users (LOG 19 and LOG22)

that use a lot of barge-ins.

Italian:

In Figure 35, the results of Subjects 6, 11, 12, 15 and 16 have been removed because of

technical reasons, so the arrows show a failure of the system instead of an effective value 0;

regarding the others, only two subjects used barge-in more than 10 times ( 9 and 11), on the

contrary many users (12) interrupted prompts less than 5 times, especially subjects 18 and 20

didn‟t use any barge-in at all.

In general, Italian subjects used the barge-in functionality less than German users: the average

of barge-in interruptions for the Italian sample is 4.9.

From the annotations emerged that subjects interrupted the system a few times by voice, but

the log files results have to be interpreted as a haptic interruption of vocal output done by RC

(instead of vocal commands).

© DICIT Consortium


DICIT_D6.2_20080428 56

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 35: Italian: Number of barge-ins per subject

3.1.7 User Speech Time

The user speech time, which is the overall time a user was speaking, was examined as well.

The annotations were used as a basis for this evaluation and the speech time is exact for the

German and the English evaluation. For the Italian evaluation, the time needed for each task

was examined, while for the German and English subjects, the time for both tasks was used.

German:

German subjects (Figure 36) had an average speech time of 526 seconds. For one subject (3),

no data was available due to problems with the log file. On average, every subject used 263

words and had 115 turns.

Figure 36: German: User speech time per subject. Red line is average.

0

100

200

300

400

500

600

700

800

900

LOG1 LOG2 LOG4 LOG6 LOG7 LOG8 LOG9 LOG10 LOG11 LOG12 LOG13 LOG14 LOG15 LOG16 LOG17 LOG18 LOG21 LOG23 LOG24

time in s

© DICIT Consortium


DICIT_D6.2_20080428 57

English:

The average speech time of the English subjects (Figure 37) was 630 seconds and therefore

somewhat higher than the speech time of the German subjects. On average, every subject used

307 words and had 138 turns.

0

100

200

300

400

500

600

700

800

LOG1

LOG2

LOG3

LOG4

tim

e in

s

Figure 37: English: User speech time per subject. Red line is average.

Italian:

Figure 38: Italian: User speech time

As said before, the speech time of the Italian evaluation is different from the German and

English speech times. Since the task completion time is used, it‟s longer than the actual

speech time.

© DICIT Consortium


DICIT_D6.2_20080428 58

The first task was longer than the second one: on average, in fact, people took 11 minutes to

end the first task and only 5 minutes for the second one. The reason of that difference seems

to be that in the first task the instructor and the wizard let people more time to “play” with the

system, because the goal was to find what was interesting for them (someone watched TV,

someone changed channels and someone else read teletext).

3.1.8 Multi-Slot Usage

The subjects could enter arbitrary queries, such as “Please show me what‟s on TV tonight

from genre action and with actor Brad Pitt.” The wizard could understand these queries, enter

them into the query window, and show the results to the user. Queries that fill more than one

“slot” of a query are called “multi-slot” queries. The example above contains the slots time

(“tonight”), genre (“action”), and actor (“Brad Pitt”). This query could also have been

performed using three single-slot queries, but multi-slot queries are more convenient and

natural for the user.

During the experiments, it quickly became clear that users do not use multi-slot queries

without having been informed about this feature. Therefore, a prompt was introduced that tells

the subjects to use multi-slot queries. When the wizard realized after some time that no multi-

slot queries occurred, he could trigger this prompt.

German:

For the first five subjects, no prompts were played and no multi-slot queries were executed by

the users. Starting with subject 5, users were hinted at the availability of the multi-slot feature.

Only one subject (LOG13) used multi-slot before this prompt. When using multi-slot after the

prompts, subjects usually repeated the help prompt word by word and adjusted it to their

demands in later queries.

Different kinds of behavior can be observed. First, some subjects only used this feature once

(e.g. LOG8, LOG14, LOG17, or LOG24). Other users used multi-slot extensively after they

had been informed about this feature (e.g. LOG7, LOG16, or LOG18). Finally, some subjects

did not use multi-slot even when they knew about it.

© DICIT Consortium


DICIT_D6.2_20080428 59

0

2

4

6

8

10

12

14

16

18

LOG1

LOG2

LOG3

LOG4

LOG6

LOG7

LOG8

LOG9

LOG10

LOG11

LOG12

LOG13

LOG14

LOG15

LOG16

LOG17

LOG18

LOG21

LOG23

LOG24

Prompts

Multislot

Figure 39: German: Multi-slot evaluation.

English:

The results are similar for English subjects. There are subjects who had not heard the prompt

and did not use multi-slot, a subject who heard the prompt and used multi-slot, and a subject

who heard the prompt more than once, but did not use multi-slot.

0

1

2

3

4

5


Prompts

Multislot

Figure 40: English: Multi-slot evaluation.

Italian:

Just a few people used the multi-slot functionality (4), in particular two subjects, when the

screen was still black (few seconds before the “Welcome screen” appeared), started speaking

with long and natural phrases in which they mixed different search criteria (8 used

“CHANNEL” and “DAY” and 17 used “DAY” and “TIME”), but when they saw the GUI,

they kept on using just one criterion at a time and waiting the system feedback. Another

person once used two criteria together but in a schematic way (<keyword+value> and

<keyword+value>) and the last one tried to use more search criteria only after the wizard,

through a help prompt, notified how to use free input: this subject copied the model from the

prompt and used it only once, than he came back to single slot.

© DICIT Consortium


DICIT_D6.2_20080428 60

In many sessions, when people spoke freely, even if they used word-commands, these

keywords have been classified as “Off-Talk” because the different volume and tone of the

voice proved they were reading (it was clear they were not giving different search criteria

using the multi-slot approach).

3.1.9 Off-Talk

Off-talk is the part of the speech, which does not address the system, but includes

exclamations or interjections as well as keywords that are simply read from the screen. The

diagrams in this section show the number of off-talk words.

German:

There are three groups of users regarding off-talk. The first group did not use any off-talk

(e.g. LOG1, LOG6, LOG7, …). The second group used some off-talk, but not extensively

(e.g. LOG2, LOG8, …). The third group had a lot of off-talk words (> 100, LOG3 and

LOG21).

0

20

40

60

80

100

120

140

160

180

LOG1

LOG2

LOG3

LOG4

LOG6

LOG7

LOG8

LOG9

LOG10

LOG11

LOG12

LOG13

LOG14

LOG15

LOG16

LOG17

LOG18

LOG21

LOG23

LOG24

nu

mb

er

Figure 41: German: Off-talk.

English:

Most English subjects had only a small number of off-talk words (LOG5, LOG20, and

LOG22). LOG19 had about twice the number of the other subjects, but the subjects still used

far less off-talk words than the third group of the German subjects.

0

2

4

6

8

10

12

14

16

18

20


nu

mb

er

Figure 42: English: Off-talk.

© DICIT Consortium


DICIT_D6.2_20080428 61

Italian:

The Italian data are represented in percentage because the “off-talk” by each subject has been

compared to the overall number of interaction of that subject.

The majority of the sample (14) tended to speak doing comment, one third of them have said

some commands with a different volume and tone of the voice, that have been considered

“Off-Talk” because the intention to comment the system behavior was clear (not to give many

commands at the same time). In particular, there was a person who uttered a lot of “Off-Talk”

(subject 7: 67 % of the total interactions) saying personal comments like “well…”, “so”, “let‟s

see…” as well as reading out the outputs on the screen or the search criteria (“artist,

genre…”).

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Subject

Perc

en

tag

e

Figure 43: Italian Off-talk.

3.1.10 Overlaps

Overlaps are parts of the annotations where both the subject and the system are speaking at

the same time (only the TTS output is considered).

Overlaps occur when the ASR does not recognize the user speech (on the contrary if a subject

pronounced an understandable word/sentence for the system it stops speaking). It is important

to analyze when or where they appear most frequently, because they can be used to

understand how to improve the design of the dialog and to induce a more appropiate answer

from the user.

In Figure 44, the overlaps of the German and the English users are shown in a diagram. At the

end, the sum of all subjects is shown. As one can see, there is a concentration of overlaps at

the beginning of the recordings. Many subjects continued while the system was still speaking

the very long introductory text.

© DICIT Consortium


DICIT_D6.2_20080428 62

Figure 44: German and English: Overlaps.

Start of recordings End of recordings

© DICIT Consortium


DICIT_D6.2_20080428 63

Italian:

0

2

4

6

8

10

12

14

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure 45: Italian: Overlaps.

Regarding the Italian sample, only 13 subjects tended to speak over the system, in particular

during the long introductory message and a very long help prompt (“To search for a program,

you can use all the search criteria listed on the screen. Moreover you can always ask me:

Back, to go back, Restart to reset the criteria, or Search Now to start a search.”). The majority

of people didn‟t do many overlaps: only three users were a little more impatient and spoke

more often while the system generated an output (8 and 14 times, whereas the other subjects

talked only two, maximum five, times of the total interactions).

3.2 Observation of the Wizards

Some results cannot be derived from the logging data directly, but are subjective impressions

of the wizards. But they still have to be considered, since these observations were shared both

by different wizards and in different sessions.

3.2.1 People have to be encouraged to use natural language input

In almost every recording, people operated the DICIT system at the beginning by reading out

the command they could read on the screen. Consider for example the EPG filter menu.

People used to read out the name of a filter and then the subsequent value (e.g. “Day”

followed by “Wednesday”). They did not know that free text input was possible.

Therefore, the wizard played help prompt that explained to the subjects that natural language

input was possible. This help prompt reads: “E.g. say „what is on RTL tonight?‟ or „are there

any comedies tomorrow?‟”. Usually, German and English subjects started off reading out this

prompt word-by-word, but adjusted it to their demands afterwards. On the contrary, Italian

subjects didn‟t follow the suggestion of this help prompt (only one subject copied word-by-

word the model from the prompt and then didn‟t use it anymore) and also those who started

with a natural language sentence, adapted his behavior to a system-driven interaction.

© DICIT Consortium


DICIT_D6.2_20080428 64

Thus, even if this help message told people that natural language sentences can be used, it is

interesting to notice that, without an introductory text, or even an automatic help function (or

tutorial movie) that urge this kind of verbalization, users did not use natural language input

spontaneously, and a lot of them do not seem to dislike a system-driven interaction.

3.2.2 People tend to use simple commands

People tend to use simple commands that can directly be mapped to remote control commands

(called low-level before) most of the time or predefined speech commands (high-level). They

usually do not use complex or concatenated commands. People seem to use the commands

shown on the screen and do not try to combine them. Moreover, many commands are on a

“widget level”, for instance “cursor down/cursor down/cursor down” instead of “put cursor on

broadcast „Formula 1‟”. Users might also be encouraged to use more complex commands, or

speech shortcuts, by a help prompt that gives an example.

3.2.3 Some people use Barge-In, others do not

Some subjects are very impatient and interrupt TTS prompts, which they have heard for a

number of times (e.g. in the main menu or the confirmation view). Others do not use this

feature as often, which might also be due to the fact that they are not as experienced in

operating an STB system. On the other hand, when subjects use the RC to interrupt a non-

recurring TTS message, the interruption is due to problems, which the subjects try to resolve

using an ordinary RC interaction .

3.2.4 Reset function not self-explanatory

Many subjects had problems using the reset function. One could either reset all filters (by

saying “Reset” or pressing the yellow button) or single ones by saying “Reset [Filter]”, which

was explained neither in the introduction nor in a help prompt. Therefore, people had

problems resetting single filters and tried to do so by entering the filter sub-screen and saying

“reset”. Some users also seemed to enter filter sub-screens in search for a reset function and

left them immediately after realizing that it was not there. Reasons for these problems might

also be that the term “Reset” was not clear for the German subjects or that people did not want

to reset all filters, but only single ones.

3.2.5 Remote Control is hardly used

When people realize that speech input works well, they do not use the remote control any

more. But when they encounter problems with speech input repeatedly, they switch back to

the remote control. Still, most users stick with the voice control and do not touch the remote

control any more.

© DICIT Consortium


DICIT_D6.2_20080428 65

4. Conclusions for Subsequent Prototypes

In this chapter, we want to draw conclusions from questionnaire observations (cited as

“Questionnaire [Number]”), the log file evaluation, and wizard impressions (both referenced

with the section number). The feedback on the prototype was positive. People found the WOZ

prototype easy and fun to use and had no problems using it (Questions 13, 14, 17, and 18). As

a general impression, the system appears easy, efficient, original, capable, well organized, and

patient.

4.1 Overall Conclusions

Subjects describe the DICIT system as neither too active nor too formal (Question 31). While

some subjects prefer either a more active or a less formal system, the WOZ system should

represent a good average solution for most people. On the other hand, the interaction style

could be configurable or even adaptive automatically to the current user.

Altogether, help should be improved, since it does not get good ratings (Question 16). While

only five subjects (two German and three Italian) cancelled the welcome prompt, it seems that

subjects did not really believe that DICIT is “quite clever”, as said in the introduction. While

German subjects started using complex multi-slot queries once they were introduced by a help

prompt, Italian subjects did not use multi-slot queries even when they had heard the help

prompt, and those who started talking freely with the system before the DICIT welcome

prompt adopted a more passive behavior when interacting with the actual system (Sect. ”Multi

slot usage”). Moreover, some subjects did not use barge-in, but requested this feature in the

questionnaire (Question 28); these behaviors and answers indicate that most subjects had

different expectations about the way the system works. Therefore, more detailed and active

help could increase the number of multi-slot queries and make people aware that vocal

interaction is an efficient alternative to RC interaction.

4.2 Dialog and Menu Structure

Users found the system easy and fun to use. Colors and font sizes got positive feedback in the

questionnaire (Question 17 and respective comments).

German subjects did not make use of the recording list a lot (Sect. ”Screen preferences of the

users”). Possibly, the concept was not obvious or people did not require this feature, since it

was not part of the task. The first interpretation of German data logs is also supported by the

behavior of Italian subjects, because (even if this information was not available in the Italian

log files) few Italian subjects (who “recorded” a lot of programs because of the first task goal)

really used the recording list to control or delete their “recordings”. Still, people should be

made aware of all features provided by DICIT, for instance by special help prompts (e.g. “Do

you know the recording list?”).

© DICIT Consortium


DICIT_D6.2_20080428 66

Moreover, the reset function in the EPG main menu was not clear for the German sample.

Some users remarked this problem in the questionnaire (comments to Questions 13 and 17)

and the wizards support this impression. Therefore, this feature should either be changed or

explained in a better way. Adding a “Reset” function to the filter sub-screens (views

EPG_ChooseFilter and EPG_ManualInput) could improve the usability of the system.

4.3 Speech Dialog

All in all subjects prefer voice input to remote control input, which is supported by the

questionnaire (Question 24) and the recordings (Sect. ”Remote control vs Voice control”).

The subjects prefer short commands (Question 25) and consider long commands useful only

for beginners (one comment on Question 25). On the other hand, people seem not to be aware

of how powerful the WOZ system was and did not use more complex commands for that

reason. German subject started using multi-slot queries after they had been given an example

by help prompt (Sect. ”Multi slot usage” ).

The subjects prefer smooth dialogs and do not want DICIT to interfere with their interaction

with the system. Therefore, both short output prompts and short input commands should be

used (Questions 25, 27). People prefer speech as an input means, because they state that

feedback should either be visual or an instant reaction, but not voice-only (Question 26).

Moreover, German subjects and most of the Italian subjects do not like long TTS prompts and

stated that DICIT should not read out what is on the screen (Question 27). On the other hand,

45% of the Italian sample stated that it was preferable to have the list of programs read out as

well, because this feature could facilitate both impaired people and “normal” persons who

prefer not to stay in front of the TV while consulting the EPG. This also applies to the help

function, which should be on the screen in any case. On the other hand, subjects want to be

able to interrupt TTS output by means of barge-in (Question 27 and Sect. ”Barge-in

Behavior”). Barge-in is not required if the prompts are kept short enough. If long prompts are

used, the users have to be made aware that they can interrupt the system. The TTS voice rated

low (Question 30.3) and should be improved.

When asked for this feature, people wanted to be able to switch off the recognizer (Question

29), mainly during a conversation with another person. The use of a keyword to address the

DICIT system could render this feature redundant.

There were also remarks regarding TTS and the mute function (Question 21). First, TTS

should not be muted when the TV sound is muted. Second, it should be possible to mute the

system completely.

4.4 Remote Control

A mixed-mode operation (voice and remote control) rates low between the German sample

(Question 15) and only subjects 12 and 21 made use of both voice input and the remote

control. On the contrary, even if most parts of the Italian subjects did not use the RC at all,

© DICIT Consortium


DICIT_D6.2_20080428 67

this feature had a good rating for the Italian sample; this means that Italian people stated an

expectation instead of a real knowledge. Remote control operation was not considered as easy

(Question 19) and should therefore be improved upon. Since the focus of this study was on

voice control, a complete and flawless remote control operation was not the main objective of

this study.

German subjects have a strong dislike for a “virtual keyboard” for free-text input (Question

23) and want to have a voice input method that can be used to select all titles, actors, and

subjects. On the contrary, most of the Italian subjects think that this is an alternative way of

interaction, which can simplify the search for artists or titles when these are difficult to

remember or to pronounce (e.g. because of a foreign language).

4.5 Considerations Among the Two Samples

In general, most Italian people are not used to interact with the TV (that is to use the TV set

not only to watch programs, but also to read information), and this is true especially for

elderly people, for people which have a low educational qualification, and for people whose

occupation is not related to an office job (in Italy the “digital divide” is still high: to date,

Digital Television is not fully deployed and the PC is not largely diffused in all the

households).

The Italian sample was selected such that it represents the whole population. Hence, two

thirds of the subjects belong to the above-mentioned categories; the underlying idea is that the

TV is a device potentially used by everybody.

The obtained results are coherent with this choice; in particular this can explain why few

subjects either spoke with complex phrases or adopted a more “natural” interaction with the

system (even if they heard the help that explained them a more free way to interact with).

Although they felt comfortable to use a very advanced and “natural” interaction mode (i.e.,

voice), most of them had a “passive” behavior approaching the EPG selection, because they

usually consult a paper guide or “surf” channels to select a program instead of looking up an

EPG.

Even though German subjects are not used to interact with the TV by voice or multimodal

input, the German sample – and especially test people who were more familiar with PCs or

interactive TV - show a low confidence that the system can understand complex or natural

phrases. Inexperienced test people try to have a more “natural” interaction with the system,

whereas experts explicitly try complex phrases to see how capable the system is.

© DICIT Consortium


DICIT_D6.2_20080428 68

Appendix A – Microphone Arrays

Harmonic Nested Array

A microphone array performs spatial sampling of a wavefield. Spatial aliasing, which is

analogous to temporal and spectral aliasing, can be avoided if the microphone spacing d

satisfies the following inequality

2

mind

where min is the minimum wavelength in the signal of interest [17].

The nested array implemented for DICIT consists of four sub-arrays. Table 13 shows the

spatial aliasing limits for the four sub-arrays. The maximum frequency is given by

minmax /cf , where 344c m/s is the sound velocity in air.

Sub-array no. Distance [m] Minimum

Wavelength [m]

Maximum

Frequency [Hz]

1 0.04 0.08 4300.0

2 0.08 0.16 2150.0

3 0.16 0.32 1075.0

4 0.32 0.64 537.5 Table 13: Spatial aliasing limits of sub-arrays

The frequencies shown in the table are harmonics and thus the name harmonic nested array.

The microphone spacing of each of the sub-arrays were specially chosen such that each of

these sub-arrays covers one octave. This structure allows for a greater flexibility when

combined with the array signal processing algorithms.

NIST MarkIII Array

The array uses 64 electret microphones installed in a modular environment. Two main

components constitute the system: a set of microboards for recording the signals and a single

motherboard to transmit the digital data over the network. There are eight microboards in the

array, and every microboard is connected to eight microphones. The first step done

by the microboard is the polarization of the microphones and the amplification of the signals.

Electret microphones need a phantom power to work properly and provide a low voltage

signal. So the microboard adapts the signals to be converted in the digital format. The

digitalization of the audio signals is done on each microboard, using four dedicated stereo

analog to digital converters. The choice of putting the A/D converters as close as possible to

the microphones is crucial to obtain sufficiently small input noise level, which for the Mark

III array is x dB relative to the maximum level..

© DICIT Consortium


DICIT_D6.2_20080428 69

Since preliminary experiments conducted on the original array had shown that coherence

among a generic pair of signals was affected by a bias due to common mode electrical noise,

which turned out to be detrimental for time delay estimation techniques applied to co-phase

signals or to localize speakers, a hardware intervention was realized to remove each internal

noise source from analog modules of the device [3].

© DICIT Consortium


DICIT_D6.2_20080428 70

Appendix B – The Questionnaire

PERSONAL DATA USER DEFINED BY PRE-QUESTIONNAIRE Expert non Expert

SOME QUESTIONS WITH STATISTICAL GOAL

1.You are

male

female

2.What is your educational qualification

Prymary school

Middle school certificate

Secondary school certificate

Degree

3.Your age

20-30

31-40

41-50

51-60

more than 60

4.Your profession

Businessman Freelancer

Manager Executive

Employee Factory worker

Trader Agent

Craftsman Housewife

Student Retired

Working/studying area______________________________

YOUR HABITS WATCHING TV AT HOME (choose only an answer to each question)

5. How many people live in your house included you? I live alone

2 people

3 or more people

6. How many TV do you have in your house?

1

2

3 or more

no TV, but I use Net-Tv through pc

7. Usually who decides what to watch in TV? Only one person

We decide all together

We decide in majority

Each person has a TV

8. Usually how do you decide which programme to watch?

Looking the teletext up

Looking up newspaper/TV programmes guide/internet

looking up the electroinc programme guide (EPG)

Surfing channel 9. Which type of television do you usually watch? “Traditional” (analogic) jump to question 11

Satellite

Digital terrestrial

Iptv

10. How do you usually select a programme?

With the numeric button of the remote control

With the program up/program down button on the remote control

Personal code:

© DICIT Consortium


DICIT_D6.2_20080428 71

Through the electronic programme guide (EPG)

Scheduling the visualization 11. What are the information that interest you to choose a programme?

Genre (film, sport, tv series, news, etc)

Actor/big names

Topic/subject

Duration

Channel

I don’t care, because I surf channel

12. Usually you use TV to:

To watch current TV programmes

To watch the programms I have recorded

To watch the video on demand (VOD)

Watching bought DVDs

As background during other activities

I channel surf

Other (specify) ______________________

questionnaire DICIT

WE ASK YOU AN OPINION ABOUT THE SYSTEM YOU HAVE JUST TESTED, VALUEING EACH OF THE FOLLOWING ASPECTS:

USING DICIT SYSTEM

13. It was easy to understand how to use the different selection criteria given by the system Very Very Difficult Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons of your answer

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

14. It was easy to understand how to give all the vocal commands


Personal code:

© DICIT Consortium


DICIT_D6.2_20080428 72

Very Very Difficult Easy

1 2 3 4 5 6 7 8 9 10

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

15. It was comfortable to give some information with voice and the other with the remote control Very Very Uncomfortable Comfortable

1 2 3 4 5 6 7 8 9 10


………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

16. In case of problems did the system suggest usefully and efficiently what to do to recover the information after the error? Very Very Useless Useful

1 2 3 4 5 6 7 8 9 10


………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

WATCHING THE SCREEN

17. Is the screen which shows the criteria for the programmes search easy to read? Very Very Difficult Easy

1 2 3 4 5 6 7 8 9 10


………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

© DICIT Consortium


DICIT_D6.2_20080428 73

18. Was it easy to understand how to use vocally the search criteria for programmes shown on the screen? Very Very Difficult Easy

1 2 3 4 5 6 7 8 9 10


………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

19. Was it easy to understand how to use the remote control to select the search criteria for programmes? Very Very Difficult Easy

1 2 3 4 5 6 7 8 9 10


………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

20. To reach the task we have assigned to you, did you expect to have some other vocal commands?

NO YES Write which ones

List the missed commands

……………………………………………………………………………………………………………………………………………………………………………………………… ………………………………………………………………………………………………

21. Did you find the information on the screen useful to orient yourself, in the case you would disable the audio? Very Very Useless Useful

1 2 3 4 5 6 7 8 9 10


………………………………………………………………………………………………………………………………………………………………………………………………………………………………

© DICIT Consortium


DICIT_D6.2_20080428 74

………………………………………………………………

WATCHING THE SCREEN

22. Do you find useful that the previous criteria list would always be shown? Very Very Useless Useful

1 2 3 4 5 6 7 8 9 10


………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

23. Do you find useful a function which allows you to insert a precise word to search programmes through remote control? Very Very Useless Useful

1 2 3 4 5 6 7 8 9 10


………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

USING THE VOCAL INTERACTION

24. How do you judge the opportunity to use a vocal command? o Very useful o Useful if used with the remote control o Useful if it replaces the remote control o Useful if it allows me more operations than

the remote control o I would never use vocal commands

comments

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

25. For the vocal commands you prefer: o Using full sentences

comments

…………………………………………………………………………

© DICIT Consortium


DICIT_D6.2_20080428 75

o Using short commands

o Having some precise commands to read on the video

……………………………………………………………………………………………………………………………………………………………………

26. When you give a vocal command to the system, you prefer: o To have only a video feedback o To have only a vocal feedback o To have both vocal and video feedback o To have an immediate action of the system

(without the previous feedback)

comments

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

LISTENING TO THE SYSTEM VOICE

27. Do you find useful that the system reads (in addition to listing them on the screen) the programmes found after your search?

o Yes o Yes, only if they are not too many o No

comments ………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

28. If you prefer a system which gives you vocal feedback: o I would like to have the option to interrupt the

system, every time I give a command o I would be happy to wait for the system to

finish speaking before giving my command

comments ………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

29. Would you like to have a button to comments

© DICIT Consortium


DICIT_D6.2_20080428 76

enable / disable the vocal recognizer?

No

Yes

………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………

YOUR GENERAL OPINION

NOW THAT YOU KNOW THE SYSTEM WE ASK YOU SOME GENERAL OPINIONS ABOUT EACH OF THE FOLLOWING ASPECTS: YOUR EXPERIENCES WITH DICIT Say your opinion about DICIT crossing out the box that better describes your degree of agreement regarding each of the following phrases which describe the service.

Complete agreement

Complete disagreement

1. I think that the system is easy to use

2. It makes me confused when I use it

3. I like the voice

4. I think that the system needs too much attention to interact vocally

5. I have the impression not to control the dialogue with the system

6. I have to focus on using it with the remote control too

7. I think that the speech interaction is efficient

8. By using the voice is easier to search the programmes

9. The system voice speaks too quickly

10. The selection criteria which appear on the screen are not clear

11. I think that it is funny to use

12. I prefer using traditional way (TV guide, teletext, newspaper) to search an interesting programme

13. I think that this system needs some improvements

DICIT FOR YOU WAS

1.easy complicated

2.efficient inefficient

© DICIT Consortium


DICIT_D6.2_20080428 77

3.quick slow

4.original copied

5.precise vague

6.capable incapable

7.formal informal

8.active passive

9.friendly unfriendly

10.determined

undetermined

11.kind unkind

12.clever silly

13.organized disorganized

14.patient unpatient

© DICIT Consortium


DICIT_D6.2_20080428 78

Appendix C – The WOZ Instructions at EB

Test description

First of all: Thank you very much for taking part in our tests that aim at improving our new

TV system.

We‟ve got here a prototype of a set top box than can do a little more than an ordinary TV set.

The new thing is, that the device understands spoken language and that it comes with an

integrated TV program. As already said, it‟s a prototype, and your taking part in the the test

will help us to further improve it. So, thanks again.

During the test we are going to collect speech samples to find out how people interact with the

system in both situations: while selecting a TV broadcast and while watching TV. We please

you to interact with the system as naturally as possible which means you should speak as

normal as possible – like you would talk to me for example. We do not want to evaluate your

interaction with the device but improve the device itself, so don‟t worry, there is no “right” or

“wrong” in your attitude.

The data we are collecting from you and the other participants will be used to find out how

well the speech interaction is working. Therefore, you will be recorded with a room

microphone and a head set. Additionally, we are going to make video recordings.

All in all, you are going to get three tasks that we ask you to perform during the test. But your

job is NOT to solve the problem as fast as possible, but to test the system thoroughly.

For your first task you have 15 minutes and you are asked to find your favorite TV broadcast.

So take these 15 minutes to deeply test the system.

For each of the other tasks you have 7 minutes. Here you are asked to select a broadcast from

the TV program under a certain aspect – which one, remains your choice. Here also, it is not

the goal to solve the problem as fast as possible.

For each task you can use both, speech and remote control.

During the session, I will leave the room.

So, if there are any problems during the test, I cannot help, but the system can. Feel free to

find out how.

After the test I would like to ask you to fill in a questionnaire. Here I will help you.

In the questionnaire, we ask about your experience and impressions during the test. When

knowing how you would like to interact with the system, we can take that into account during

the further development. Your impressions and experiences with the prototype are processed

the results are used to improve the system. Your answers, of course, will remain anonymous.

Now, you have about 15-20 minutes to „play“ with the system. I will then come and pick you

up for the questionnaire. Good luck and have fun.

© DICIT Consortium


DICIT_D6.2_20080428 79

Appendix D – List of Predefined TTS Prompts

Class

WAIT Ihre Eingabe wird

verarbeitet.

Bitte warten Sie einen

Moment.

Your input is being

processed.

Please wait a

moment.

Per favore, attendi

qualche istante.

ERROR Dieser Sender ist nicht

verfügbar.

Es gibt zu viele Ergebnisse

für diese Suche. Wollen Sie

Ihre Suche verfeinern?

Leider hat Ihre Suche keine

Treffer.

Ich habe nur

Programminformationen für

die nächsten sieben Tage.

Leider ist der gewählte

Sender nicht in meinen EPG

Daten.

Leider habe ich keine

Information über den

gewählten Sender.

Leider ist der gewählte

Titel nicht in meinen EPG

Daten.


Information über den

gewählten Titel.

Leider ist das gewünschte

Genre nicht in meinen EPG

Daten.

Ich habe keine

Informationen zum

gewünschten Genre.

Leider ist der gewünschte

Schauspieler nicht in meinen

EPG Daten.


Informationen zum

gewünschten Schauspieler.

Leider gibt es keine

Einträge in meinen EPG

Daten zu diesem Schlagwort.


This station is not

available.

There are too many

results, do you want

to refine the search?

Sorry, your search

did not yield any

result.

Sorry, I only have

information on the

program of the next

seven days.

Sorry, the

requested channel is

not included in my

EPG data.

Sorry, I have no

information on the

requested channel.

Sorry, the

requested title is not

included in my EPG

data.

Sorry, I have no

information on the

requested title.

Sorry, the

requested genre is not

included in my EPG

data.

Sorry, I have no

information on the

requested genre.

Sorry, the

requested artist is not

included in my EPG

data.

Sorry, I have no

Quest‟emittente non è

disponibile.

Ci sono molti

programmi, vuoi

perfezionare la ricerca?

Mi dispiace, questa

ricerca non ha dato

risultati.

Mi dispiace, sono

disponibili solo i

programmi dei prossimi

sette giorni.

Mi dispiace, questo

canale non e‟ presente

nella Guida TV.

Mi dispiace, non ho

informazioni per questo

canale.

Mi dispiace, questa

trasmissione non è

presente nella Guida TV.

Mi dispiace, non ho

informazioni per questa

trasmissione.

Mi dispiace, questa

categoria di programmi

non è presente nella

Guida TV.

Mi dispiace, non ho


categoria di programmi.

Mi dispiace,

quest‟artista non è

presente nella Guida TV.

Mi dispiace, non ho

informazioni per

quest‟artista.

Mi dispiace, questa

© DICIT Consortium


DICIT_D6.2_20080428 80

Informationen zum

gewünschten Schlagwort.

Diese Sendung befindet

sich bereits in der

Aufnahmeliste.

Diese Sendung wurde noch

nicht aufgezeichnet.

Die Fernbedienung

funktioniert nicht. Bitte

benützen Sie die

Spracheingabe.

Das Fernbedienungssignal

ist zu schwach.

Ein schwerwiegender

Systemfehler ist aufgetreten.

DICIT itle r gerade neu.

information on the

requested artist.

Sorry, the

requested subject is

not included in my

EPG data.

Sorry, I have no

information on the

requested subject.

This broadcasting

is already scheduled

for recording.

This broadcasting

is not recorded yet.

The Remote

control is not

working. Please use

speech to control the

system.

Remote control

signal too low.

A fatal error I.

DICIT is restarting.

tipologia di contenuti non

è presente nella Guida

TV.

Mi dispiace, non ho


tipologia di contenuti.

Per questa trasmissione

è già programmata una

registrazione.

Questa trasmissione

non è ancora stata

registrata.

Il telecomando non

funziona. Per favore, usa

i comandi vocali.

Il segnale del

telecomando è troppo

debole.

Errore di sistema.

DICIT deve riavviarsi.

HELP Sie können die

Fernbedienung wie gewohnt

benutzen, aber Sie können

auch mit mir sprechen und

mir sagen, was Sie gern tun

würden.

Sie können folgende

Befehle sagen: Hilfe, zurück,

neu starten oder jetzt suchen.

Sagen Sie zum Beispiel

„Was kommt heute Abend auf

RTL?“ oder „gibt es morgen

irgendwelche Komödien?“

Sie können mir auch sagen,

nach welcher Sendung Sie

suchen.

Bitte sagen Sie mir, wie ich

die Lautstärke für Sie

verändern soll.

Um ein Programm zu

suchen, können Sie alle

Suchkriterien auf dem

Bildschirm benutzen.

You can use the

remote control as

usual but you can also

speak to me and tell

me what you would

like to do.

You can say: help,

back, restart, or

search now.

E.g. say “what is

on RTL tonight?” or

“are there any

comedies tomorrow?”

You can also tell

me which broadcast

you are looking for.

Please tell me how

to change the volume

for you.

To search for a

program, you can use

all the search criteria

listed on the screen.

Puoi usare il

telecomando

normalmente, ma puoi

anche parlarmi e dirmi

cosa fare.

Puoi dirmi: AIUTO,

INDIETRO, CAMBIA, o

In ONDA.

Per esempio, dì: “cosa

c‟è su RAI 1 questa

sera?” o: “ci sono delle

commedie domani?”

Puoi anche dirmi quale

programma stai cercando.

Quanto devo

modificare il volume?

Scegli qualsiasi criterio

di ricerca elencato. Puoi

anche dire INDIETRO o

CAMBIA per

reimpostare la ricerca.

Elenco o Videoteca per

programmi non trasmessi

© DICIT Consortium


DICIT_D6.2_20080428 81

Sie können im EPG durch

die Angabe von Sender, Zeit,

Titel, Genre, Schauspieler

oder Schlagwort suchen.

Sie können einen Eintrag

aus der Liste zum Anschauen

oder zum Speichern in der

Aufnahmeliste auswählen.

Bitte nennen Sie mir z.B.

eine Zeit, einen Sender, ein

Genre oder ein Schlagwort

um die Suche zu verfeinern.

Hier sind Ihre Ergebnisse.

Moreover you can

always ask me: Back,

to go to back, Restart

to reset the criteria, or

Search Now to start a

search.

You can search the

EPG by specifying

Channel, Time, Title,

Genre, Actor or

Subject.

You can select an

item from the list and

watch or save it for

recording.

Please tell me for

example a time, a

channel, a genre or a

subject to refine the

search.

Here are the

requested results.

ora, oppure In Onda per

quelli in onda adesso.

Cerca nella Guida TV

con il canale, l‟orario, il

titolo, la categoria,

l‟artista o il contenuto.

Puoi scegliere uno dei

programmi elencati per

guardarlo o registrarlo.

Per favore indicami

l‟orario il canale, il titolo,

la categoria o il

contenuto.

Ecco cos‟ho trovato!

REJECT Wie bitte?

Leider konnte ich Sie nicht

verstehen.

Diese Funktion ist leider

nicht verfügbar.

Pardon?

Sorry, I did not

understand you.

Sorry, this function

is not available.

Puoi ripetere?

Scusa, non ho capito.

Spiacente, questa

funzionalità non è

disponibile.

© DICIT Consortium


DICIT_D6.2_20080428 82

Appendix E – Screenshots of the Views

WelcomeScreen

EPG_MainMenu_View

View

© DICIT Consortium


DICIT_D6.2_20080428 85

Bibliography

[1] Distant Talking Interfaces for Control of Interactive TV

“Annex I - Description of Work”

31-May-2006

[2] Cedrick Rochet, URL:

www.nist.gov/smartspace/toolChest/cmaiii/userg/Microphone_Array_Mark_III.pdf

[3] Luca Brayda, Claudio Bertotti, Luca Cristoforetti, Maurizio Omologo, and

Piergiorgio Svaizer. “Modifications on NIST MarkIII array to improve coherence

properties among input signals.”

AES, 118th Audio Engineering Society Convention. Barcelona, Spain, May, 2005.

[4] SpeechDat-Car EU-Project LE4-8334, URL: http://www.speechdat.org/SP-CAR/

[5] Luca Cristoforetti, Maurizio Omologo, Marco Matassoni, Piergiorgio Svaizer,

and Enrico Zovato. "Annotation of a multichannel noisy speech corpus."

Proc. of LREC 2000. Athens, Greece, May 2000.

[6] Transcriber, URL: http://trans.sourceforge.net/en/presentation.php

[7] Andrey Temko, Robert Malkin, Climent Nadieu, Christian Zieger, Dusan Macho,

and Maurizio Omologo. "CLEAR Evaluation of Acoustic Event Detection and

Classification systems." CLEAR'06 Evaluation Campaign and Workshop.

Southampton, UK: Springer, 2006.

[8] Oswald Lanz. “Approximate bayesian multibody tracking.”

IEEE Transaction on Pattern Analysis and Machine Intelligence, 2006: 1436-1449.

[9] Fleischmann, T. (2007). Model Based HMI Specication in an Automotive Context.

In Smith, M. J. and Salvendy, G., editors, HCI (8), volume 4557 of Lecture Notes

in Computer Science, pages 31{39. Springer.

[10] Goronzy, S., Mochales, R., and Beringer, N. (2006). Developing Speech Dialogs

for Multimodal HMIs Using Finite State Machines. In 9th International Conference

on Spoken Language Processing (Interspeech), CD-ROM.

[11] ISO 9241-110:2006 : “Ergonomics of human-system interaction -- Part 110:

Dialogue principles” International Organization for Standardization, 2006.

[12] Praat, URL: http://www.praat.org/

[13] N. Beringer: “Transliteration of Spontaneous Speech for the detailed Dialog

Taskflow” DICIT technical document, 29-March-2007.

© DICIT Consortium

http://www.nist.gov/smartspace/toolChest/cmaiii/userg/Microphone_Array_Mark_III.pdf

http://www.speechdat.org/SP-CAR/

http://trans.sourceforge.net/en/presentation.php

http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=38009


DICIT_D6.2_20080428 86

[14] N. Beringer, U. Kartal, K. Louka, F. Schiel, U. Türk. PROMISE: A Procedure for

Multimodal Interactive System Evaluation. LREC Workshop 'Multimodal

Resources and Multimodal Systems Evaluation' 2002, Las Palmas,

Gran Canaria, Spain, pp. 77-80.

[15] Salber, D. and Coutaz, J. (1993). A Wizard of Oz platform for the study of

multimodal systems. In Conference Companion on Human Factors in Computing

Systems (INTERACT and CHI), pages 95{96, New York, NY. ACM.

[16] Taib, R. and Ruiz, N. (2007). Wizard of Oz for Multimodal Interfaces Design:

Deployment Considerations. In Jacko, J. A., editor, HCI (1),

volume 4550 of Lecture Notes in Computer Science, pages 232{241. Springer.

[17] Wolfgang Herbordt. “Sound Capture for Human/Machine Interfaces”, Springer-

Verlag, Berlin Heidelberg, 2005

© DICIT Consortium

http://www.phonetik.uni-muenchen.de/Forschung/Publications/Beringer-02-PROMISE.ps



deelliivv eerraabblle 66..22 muu llttii--mmiiccrroopphhoo...

Documents