deliverable 6.4 first prototype performance evaluationdicit.fbk.eu/dicit_d6.4_r_20090326_pu.pdf ·...

104
Deliverable 6.4 First prototype performance evaluation Authors: Fiorenza Arisio, Roberto Manione Timo Sowa, Matthias Bezold, Hugo Wesseling Affiliations: Amuser, EB Date: March 26, 2009 Document Number: DICIT_D6.4_R_20090326 Status/Version: Final Dissemination Level: PU FP6 IST-034624 http://dicit.fbk.eu

Upload: others

Post on 20-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

DDeell iivveerraabbllee 66..44

FFii rrsstt pprroottoottyyppee ppeerrffoorrmmaannccee eevvaalluuaattiioonn

AAuutthhoorrss:: FFiioorreennzzaa AArriissiioo,, RRoobbeerrttoo MMaanniioonnee

TTiimmoo SSoowwaa,, MMaatttthhiiaass BBeezzoolldd,, HHuuggoo WWeesssseell iinngg

AAffffii ll iiaattiioonnss:: AAmmuusseerr ,, EEBB

DDaattee:: MMaarrcchh 2266,, 22000099

DDooccuummeenntt NNuummbbeerr:: DDIICCIITT__DD66..44__RR__2200009900332266

SSttaattuuss//VVeerrssiioonn:: FFiinnaall

DDiisssseemmiinnaattiioonn LLeevveell :: PPUU

FP6 IST-034624 http://dicit.fbk.eu

Page 2: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

DICIT_D6.4_R_20090326 1

Project Reference FP6 IST-034624 Project Acronym DICIT Project Full Title Distant-talking Interfaces for Control of Interactive TV Dissemination Level PU Contractual Date of Delivery November 08 Actual Date of Delivery November 08 Document Number DICIT_D6.4_R_20090326 Type Deliverable Status & Version Final Number of Pages 103 WP Contributing to the De-liverable

WP6

WP Task responsible Timo Sowa – EB

Authors (Affiliation) Fiorenza Arisio (Amuser), Roberto Manione (Amuser) Timo Sowa (EB), Matthias Bezold (EB), Hugo Wesseling (EB)

Other Contributors Manfredo Pansa (Amuser) Reviewer EC Project Officers Pierre Paul Sondag (from November 1st 2007) Keywords: data collection, usability test, prototype evaluation, multi-microphone devices, distant-talking speech recognition devices, voice-operated devices, Interactive TV, anti-intrusion, surveillance. Abstract: The purpose of this document is to describe the experiments that have been conducted un-der DICIT to evaluate the first prototype. The objective of the evaluation was to provide a usability test, with naïve real users, in order to evaluate both the dialog design and the whole system implementation.

Page 3: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 2

Contents Summary................................................................................................................................5 1 Introduction ....................................................................................................................5 2 Goals of the evaluation campaign....................................................................................6 3 Experimental methods ....................................................................................................7

3.1 Setup.......................................................................................................................8 3.1.1 DICIT prototype..............................................................................................8 3.1.2 Experimental setup at EB ................................................................................9 3.1.3 Experimental setup at Amuser .......................................................................10

3.2 Subjects and sessions ............................................................................................12 3.3 Procedure..............................................................................................................12

3.3.1 General instructions.......................................................................................12 3.3.2 Description of the different goals of each task ...............................................14

3.4 The DICIT questionnaire.......................................................................................14 3.5 Data logging, visualization, and annotation ...........................................................15

3.5.1 Logging of session data .................................................................................15 3.5.2 The EvaluationSimulator tool ........................................................................16

3.6 Definition of metrics for the evaluation .................................................................17 3.6.1 Metrics for dialogue design ...........................................................................18 3.6.2 Metrics for Automatic Speech Recognition....................................................19 3.6.3 Metrics for NLU............................................................................................19 3.6.4 Metrics for experimental tasks.......................................................................20

3.7 Corpus annotations and calculation of metrics.......................................................21 3.7.1 Used metrics for the classification .................................................................22 3.7.2 Computation of metrics out of the classification result ...................................23 3.7.3 Simplified computation of metrics out of the classification result ..................24

4 Results..........................................................................................................................25 4.1 Subjective measures ..............................................................................................25

4.1.1 Statistical questions (Questions A-D) ............................................................25 4.1.2 TV habits (Questions E-N) ............................................................................27 4.1.3 The DICIT system (Questions 1-20) ..............................................................30 4.1.4 General opinion on DICIT prototype #1 (Questions 26 & 27)........................45

4.2 Observations of the experimenters.........................................................................52 4.2.1 Multi-slot usage.............................................................................................52 4.2.2 Verbal behavior .............................................................................................52 4.2.3 Haptic behavior .............................................................................................54

4.3 Objective measures ...............................................................................................54 4.3.1 General statistics ...........................................................................................54 4.3.2 ACR results...................................................................................................56 4.3.3 UE.................................................................................................................58 4.3.4 SIE ................................................................................................................60 4.3.5 WRR .............................................................................................................62 4.3.6 TCR ..............................................................................................................65 4.3.7 TCT...............................................................................................................67

5 Discussion ....................................................................................................................70 5.1 The questionnaire..................................................................................................70

Page 4: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 3

5.2 Objective metrics ..................................................................................................71 5.3 Observations from the recorded log data ...............................................................72

5.3.1 Feedback for Speech Input ............................................................................72 5.3.2 Global Speech Commands.............................................................................73 5.3.3 Improving the Help Screen ............................................................................73 5.3.4 EPG Improvements .......................................................................................73 5.3.5 Options for voice interaction .........................................................................74

6 Plan for the evaluation of the second prototype .............................................................74 6.1 Subject Samples....................................................................................................74 6.2 Evaluation sites and segmentation of the samples..................................................74 6.3 Session settings .....................................................................................................75 6.4 Analysis of the data...............................................................................................75

7 Conclusions ..................................................................................................................77 8 References ....................................................................................................................78 9 Appendix A – Questionnaire.........................................................................................80 10 Appendix B – General instructions to the subjects.....................................................95 11 Appendix C – Tasks..................................................................................................96 12 Appendix D – Description of the RC and the available channels .............................101

List of Figures Figure 1: DICIT general menu structure .................................................................................8 Figure 2: The 1st prototype setup at EB. ..................................................................................9 Figure 3: The evaluation setup at Amuser .............................................................................11 Figure 4: Evaluation Simulator tool; main screen..................................................................15 Figure 5: EvaluationSimulator tool; definition of complex actions........................................16 Figure 6: EvaluationSimulator tool; annotating speech input.................................................17 Figure 7: EvaluationSimulator tool; generating charts...........................................................17 Figure 8: Questions 1-7 (“Using the DICIT system”)............................................................31 Figure 9: Question 3 (“It was easy to understand howto give vocal commands”) broken down according to language. The different between English and Italian is significant. ...................31 Figure 10: Questions 8-11,13 ("Watching the screen")..........................................................34 Figure 11: Question 8 (“Do you find it useful to choose between displaying the programme list and the criteria list”) broken down according to language. The difference between English and Italian is significant. .......................................................................................................35 Figure 12: Question 12. ........................................................................................................35 Figure 13: Questions on adaptive features.............................................................................42 Figure 14: Question 22 broken down according to languages................................................42 Figure 15: Questions 25.1-4, difficulty ratings for the tasks. .................................................44 Figure 16: Question 25.2, difficulty rating for Task 2 broken down according to language. ..45 Figure 17: Question 26.1-18 - General opinion about the DICIT prototype. ..........................47 Figure 18: Sub-questions of 26 showing a significant difference according to the language. .48 Figure 19: Question 27 - system properties assigned to the prototype....................................50 Figure 20: Question 27 - some properties broken down according to language......................51 Figure 21: Frequencies of utterance lengths (in words). ........................................................55

Page 5: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 4

Figure 22: Mean values and standard deviation of utterance lengths according to languages (width of error bars = standard deviation). ............................................................................55 Figure 23: Classification of the recorded utterances. .............................................................56 Figure 24: Overall Word Recognition Rate (WRR)...............................................................63 Figure 25: Word Recognition Rate by gender. ......................................................................63 Figure 26: Word Recognition Rate by grammar....................................................................64 Figure 27: Word Recognition Rate by layout. .......................................................................64 Figure 28: Word Recognition Rate by language....................................................................65 Figure 29: Task Completion Rate by task (A-D: specific tasks, Ag-Dg: general tasks)..........65 Figure 30: Task Completion Rate by gender. ........................................................................66 Figure 31: Task Completion Rate by modality and task. .......................................................66 Figure 32: Task Completion Rate by language and task. .......................................................67 Figure 33: Task Completion Time by task. ...........................................................................67 Figure 345 shows the overall task completion time grouped by gender and task. No clear trend can be found in the data. The difference is not significant for any of the tasks.......................68 Figure 35: Task Completion Time by gender and task..........................................................68 Figure 36: Task Completion Time by modality and task. ......................................................68 Figure 37: Task Completion Time according to language and task........................................69

Page 6: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 5

Summary This deliverable describes the efforts on the evaluation of the first DICIT set-top-box proto-type for the interactive TV domain as it was described in the DICIT deliverables D5.1 [20] and D2.2 [21]. The evaluation campaign mainly consists of a usability study with naïve users who test a TV able to accept voice commands spoken without using a close-talking micro-phone. The goal of the study is twofold, first, to assess the design of the dialogue, and, sec-ond, to measure the performance of the first prototype. The study was conducted at Amuser and EB with the English, Italian, and German language version of the prototype. In total, 56 subjects were asked to use the system and solve some typical TV-related tasks given different experimental conditions. Subjective questionnaire data and objective logging data and session annotations were used as a basis for the evaluation.

Overall, the results reported in this document show that the system needs improvements mainly in order to shorten its reaction time and to increase its ability to recognize the user commands. However, it is important to notice that the STB based prototype is a complex sys-tem addressing highly challenging goals, with different subsystems that need to be tuned and optimized both as single pieces and in their interaction; the analyzed system is the first proto-type and the obtained results in the evaluation will help implementing an optimized version in the next months. One encouraging result already emerged is that users think that the voice in-put makes DICIT an original and fun to use product.

1 Introduction The current deliverable is a part of work package 6, “Market Study, User Study, and System Evaluation” covering task T6.5 “Evaluation of first STB prototype”. Its purpose according to annex I is evaluating the resulting prototype systems concerning usability, design, and effec-tiveness [1]. This task is an intermediate step between T6.3, the Wizard-of-Oz study with a focus on exploring potential user behavior given a speech-enabled interface, and T6.6, the fi-nal prototype evaluation with a focus on system performance and concluding assessment. Taken together, DICIT spends a lot of activity on evaluation, because the prototype systems are not just meant to be demonstrations of technical feasibility, but should demonstrate a seri-ous effort to create a competitive system that anybody can use. Following the reviewers’ ad-vice after the year 1 review meeting to intensify activities on evaluation, the effort (man months) spent on WP6 even exceed the original plan.

Task T6.5 was originally planned to be a “HMI-only” evaluation, i.e. a PC-based evaluation of a simulation system using EB GUIDE Studio (formerly “tresos GUIDE”). Not-withstanding the plan, the evaluation was done with the full-fledged first prototype due to two important reasons. First, the prototype uses IBM’s dialogue manager CIMA including NLU (Natural Language Understanding) support as the target framework. Thus, for a meaningful evaluation, the evaluated system should support the same NLU capabilities the target system offers. Instead of building NLU capabilities into GUIDE, the easier way was to use GUIDE as a modelling tool and create an export filter to CIMA. That way, NLU functionality could be directly used without additional effort. Second, the STB hardware and corresponding interface software now used for the first prototype were made available at an early stage during the pro-ject. With the hardware ready for use, there was no need to simulate the STB output in GUIDE.

This report describes the activities on the evaluation of the first STB prototype and its results. The focus is on reporting the user study which was prepared between February 2008

Page 7: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 6

and August 2008, and conducted and evaluated between September and November 2008. At Amuser, session recordings for English and Italian were conduced between September 15th and October 5th, 2008. At EB, the recordings were conducted between October 6th and Octo-ber 23rd, 2008. The structure of the document is as follows. Chapter 2 provides the principal goals of the evaluation campaign and the rationale for choosing the methodology. Chapter 3 describes the concrete evaluation method applied for the campaign. The first sections recount the layout of the experiment rooms as well as the hardware and software setups at the evalua-tion sites. The next sections detail the subject set, the experimental procedure, and tasks/instructions given to the participants. The final sections deal with the data sources used for the analysis, namely the questionnaire, session logging data, manual annotations, and met-rics. A section on the technical evaluation framework developed in this context is included. In Chapter 4 we provide the results using descriptive statistics covering both subjective measures acquired from questionnaires and objective measures from logs and annotations. A discussion of the results including a critical appraisal of the current prototype’s status is given in Chapter 5. A plan for the final evaluation is presented in Chapter 6, followed by a summary and con-clusion in the final chapter.

The report encompasses results for the three system languages. In cases where anything is described or discussed separately for each country, the respective parts are marked with the flags of the country as follows.

Flag Recording

German at EB, Erlangen, Germany

English at Amuser and EB

Italian at Amuser, Torino, Italy

2 Goals of the evaluation campaign As stated in the introduction, the overarching goal of the STB prototype evaluation is to assess the prototype systems concerning usability, design, and effectiveness. Thus, in contrast to the Wizard-of-Oz study conducted in the first year which pretended a “perfect” system and was aimed to determine users’ dialogue behavior, the current evaluation concentrates on the run-ning system. Consequently, sources of error or bad performance can be found in both the sys-tem’s design in terms of dialogue behavior and in system implementation. Both aspects should be equally highlighted in this effort and two major goals were defined respectively. Thus, these goals are:

1. to test the adequacy of the dialogue design from an ergonomic point of view. This refers to the concordance between the design and user’s expectations and to the effort for an av-erage system user to take advantage of the multimodal dialogue in performing a number of typical tasks; for example, the degree to which a user was able to realize and exploit the fact that it is possible to switch to channel CNN by just saying “go to CNN” instead of having to remember CNN’s channel number and pushing the appropriate button on the remote control.

2. to assess the performance of the actual implementation of the designed dialogue and of all involved technologies in the STB prototype #1 (e.g., beam forming, acoustic echo cancel-lation, automatic speech recognition, action classification, etc.); for example, the extent to

Page 8: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 7

which a voice command is understood correctly by the ASR module, or the correctness of the system’s reaction to a given command.

The novel and outstanding attribute of the DICIT interactive TV prototype is speech input from the far field. Speech as a control modality sets DICIT apart from commercially available TV sets with EPG (Electronic Program Guide) functions operated with a remote control (RC). One of the advantages provided by the visual mode in a multimodal system is that it can in-form users about the available commands (e.g. letting users “say what they see”); however, the way users can take advantage of this feature is varied, being dependent upon the training of each and every user and on the flexibility of the actual implementation. The DICIT proto-type is a multi-function environment where a wide range of information is shown to the user; it has been designed taking into account the above considerations, trying to accommodate the needs of two opposite kinds of users (e.g. the naïf and the trained user); in particular, the mul-timodal dialog system implemented in DICIT prototype supports both following cases:

1) everything that can be done with the RC, could also be done by voice with the same “steps”, because a “parallel” interaction should facilitate naïve users (which can use their voice following the same "paths" learned using the RC);

2) voice can be used as a “shortcut” in place of several “basic” “steps”; such voice shortcuts are “suggested” to the user by labels and other cues displayed in the “captions” of the video output.

Hence, special attention is to be paid to the additional benefit speech and multimodal input offer over the traditional RC. A comparison between the input modalities speech and RC with respect to performance was thus one particular performance-related goal for each of the three languages.

Though the goals could be addressed using a number of different established method-ologies for the evaluation of human-machine interfaces, a study with uninformed (“naïve”) participants as system users was selected for that purpose. The rationale behind this choice is that other methodologies which do not involve end users, such as expert reviews, cognitive walkthroughs, rule-based evaluations etc., require extensive knowledge about the relevant fac-tors which have an impact on the usability of an interface. Up to the present, there is no such collection of factors for speech dialogue systems with NLU capabilities. In other words, we lack the metrics and experience to determine the adequacy and performance of a speech-enabled system just by looking at the design. Therefore, another goal of the evaluation is to acquire this experience and define appropriate metrics.

3 Experimental methods The basic paradigm for the evaluation campaign is a user study with “naïve” participants, i.e., subjects who are neither involved in the development of the system nor have extensive back-ground knowledge about speech technology. In order to get results about the expected per-formance and adequacy of the interface in everyday life, participants should operate the sys-tem in a typical environment of use, and they should not be disturbed or influenced by the experimenter or another person. Dedicated experiment rooms were set up for this purpose, de-scribed in more detail in the following section. To test the prototype in relevant situations of use, we chose a task-based paradigm that covers some of the most frequent tasks TV and EPG users are confronted with. These tasks had to be solved using different modalities. Details can be found in the specific sections of this chapter.

Page 9: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 8

3.1 Setup

3.1.1 DICIT prototype The DICIT system is an STB system with a dynamic electronic program guide (EPG). Figure 1 illustrates the basic dialogue flow of the system as specified in D5.1 [20]. After displaying the start screen, the system switches to TV mode and shows the current broadcast. While watching TV in this mode, the user may call the settings screen or use the EPG by defining a set of filters (channel, time, day of the week and genre) and then invoke the result lists pro-duced using these filters. Elements from the result list can be put into a “scheduling” list which simulates to turn the TV on when the chosen program will be broadcasted. Moreover, while in the TV mode, users can watch a free satellite channel (9 for the English version, 10 for the Italian version, 15 for the German version of the system), adjust the volume and ask for help if needed. Commands can either be given vocally or with the RC. The remote control interaction was completely implemented in the system and was “parallel” to the voice interac-tion.

Dicitstartup (state=stdBy,profile=0)

OK#ok button#

DICIT is starting!

<DICITSplash screen

with info>

On Grammar

Turn on TV on first Channel allowed for profile 0

Tv Screen Grammar

GUIDE |#BLUE button#

switch TV on requested Chanl

TVCHANNEL_GEN<n> |TVCHANNEL_GEN

<name> |#n# | #chnl incr.# |

#chnl decr.#

Adjust volume

VOLUME <n> |VOLUME <id> |VOLUME UP |

VOLUME DOWN |#volume up# |

#volume down#

Handle Settings

SETTINGS#RED button#

HELP |#GREEN button#

Handle Help(GLOBAL)

SET HELP GUIDEWait n seconds

*

Handle list from EPG

abort

Ready jingle<Tv Screen>

SET HELP GUIDE

**

** disappear after 15

seconds

NO MATCH |<other classes>

Figure 1: DICIT general menu structure

The software setup and system architecture for the prototype follow the directives and running the software provided in WP2. The Fracarro STB was equipped with the firmware version provided by WP2. More documentation about the software setup of the DICIT system can be found in D2.2.

Page 10: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 9

3.1.2 Experimental setup at EB At EB, two rooms were used in the experimental setup (see Figure 2): First, the subject was sitting on a couch in the “test person room”, which was furnished similar to a living room. This room contained a TV screen and a microphone array as well as the DICIT prototype equipment. The distance between the TV set and the user was about 2.50m. The supervisor was sitting in the room next to the test person room. If the subject had finished one task of the experiment, he could knock on the door to call the supervisor.

All sessions were recorded with a microphone array. The signal was processed by the DICIT PCs. Videos of the sessions were recorded for reference and extensive logging data was recorded by the dialog application.

Figure 2: The 1st prototype setup at EB.

Supervisor room Test person room

Screen

Web cam

Remote

control

Video

camera

PC1

PC2

Microphone ar-

ray

.

.

.

Audio

equip-

ment

STB

Page 11: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 10

The sessions were recorded using a video camera that was placed in a corner of the room. A mirror was used to get a view of the whole room. Moreover, another web cam that was placed on to of the microphone array was pointed to the user.

The microphone array consists of 15 small microphones, each having a separate XLR socket. Three RME Octamic II pre-amplifiers were used as D/A convertors for the micro-phone signals. PC1 was equipped with an RME Hammerfall DSP sound board and the micro-phone channels from the D/A convertors were connected to the PC using ADAT connections. PC1 was connected to PC2 using a direct network connection. The dialog management soft-ware running on PC2 processed the audio fragments coming from PC1 and send respective commands to the Fracarro STB. The following hardware was used for the two PCs:

• PC 1: Dual-Core P4, 2GB of RAM running Linux (Fedora 8 64-bit) • PC 2: Quad-Core P4, 4GB of RAM running Windows XP • TV screen: 32” Panasonic

3.1.3 Experimental setup at Amuser At Amuser, three rooms were used in the experimental setup (actually two rooms, with one in turn divided by a curtain, able to damp the audio waves arriving to it):

1. the test person room, the only one accessible to the subjects, where the experiments took place; the actual size of this room is about 3.5m by 3.8 m; the distance between the face of the subject and the microphone array was around 2 m.

2. the equipment room, accessible only by the DICIT team, hosting the prototype hard-ware, such as PC1 and PC2, with respective keyboards and screens, STB and so on

3. the experimenter room hosted the experimenter while watching and listening to the experiment that was going on in the DICIT room.

Page 12: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 11

Figure 3: The evaluation setup at Amuser

One digital VCR was placed in the Test person room, pointing to the subject; a mirror was placed in the wall behind the subject; in this way the camera was able to “watch” both the test person and the DICIT TV during the experiments. All the sessions have been recorded by this VCR and the produced media have been archived.

One webcam has also been placed in the DICIT room, pointing to the DICIT TV; the output of this camera has been brought to the experimenter using a PC placed in the experi-

Experimenter

room

PC4: Netmeeting

(audio+ webcam)

Test person

room

VCR

R

C

Webcam 2 m

RC receiver

LCD TV (under array)

PC 1

Microphone array

PC 2

PC 3: demo,

webcam &

questionnaire

STB

Equipment

room

Curtain

Sofa

Mirror

3.8 m

3.5m

Page 13: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 12

ment room (PC3) running a point to point NetMeeting session with a fourth PC (PC4) located in the experimenter room; this allowed the experimenter to watch and listen the experiment while not interfering with it. In particular, the following hardware was used:

• PC1 is a Dell Precision T3400 (Intel Core 2 Quad, Q6700 running at 2.66 GHz) with 4

GB of RAM running Linux, kernel 2.6 (64 bit). • PC2 is a Compaq ML 330-G2 (2 Pentium III CPUs running at 1.5 GHz) with 1.2 GB

of RAM running MS-Windows Professional 2000-SP4. • TV is a 17’ Philips LCD • the 2 speakers are EVENT 20/20 BAS • STB + Remote Control was supplied by Fracarro • Microphone array was the “standard” one, built by FBK

3.2 Subjects and sessions The prototype evaluation campaign was carried out having in mind an “end-to-end” evalua-tion with the purpose of assessing the goodness of the design and the suitability of the imple-mentation. For such goals it is not necessary to involve a large subject sample: in the litera-ture, a “small sample” (of 5 – 6 subjects) is usually accepted as sufficient [24]. Instead of having a large subject sample it is recommended to use a higher number of tasks each subject has to solve [25]. For this reason it was decided to involve 20 subjects per language, with 5 tasks per subject. Things would have been different if the purpose was to evaluate the per-formances of a voice recognizer, which would have needed a larger subject sample.

The usability experiments were conducted at Amuser - Torino, Italy and at EB - Erlan-gen, Germany. 20 sessions were performed both at EB and Amuser, in German and Italian respectively, involving only one person per session. Moreover, the English sessions were split between the two sites of EB and Amuser. Due to some difficulty to find American people ei-ther in Germany or in Italy, the English language sample is composed of only sixteen persons (instead of 20) and includes even British native speakers. In Italy twelve English sessions have been performed with six of the subjects speaking UK English, and six speaking with American accent. In Germany the English sessions have been conducted with four speakers of US-English.

3.3 Procedure The test person room was furnished like a living room with a couch, where the test person could feel like he or she was watching TV in a private environment. The experimenter room was directly wired up to the test person room via direct lines for keyboard, mouse and cables.

At the beginning of a session, the subject was guided into the living room by the instruc-tor and received a short introduction to the experiment (see Appendix A), the sheet with the description of the usable buttons of the RC and a list of the available channels (see Appendix C). The general instructions explicitly pointed out that the response time of the system could be long and that the system cannot understand the names of specific broadcasts. This informa-tion was made explicit because both issues were considered important limitations of the cur-rent prototype users may not be able to determine on their own.

3.3.1 General instructions The instruction given to the subjects can be outlined as follows:

Page 14: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 13

- The system is a prototype that can understand spoken language. - The session is going to be recorded (both audio and video). - The system was a fully functional system which could be used even via remote con-

trol. - One “free” task and four specific tasks had to be fulfilled. These were written on paper

and handed over to the subjects. - The experiment was about testing the system, not the subject. - After the recording, a short questionnaire had to be completed.

The session was divided into two parts to be solved with the DICIT system. First of all the subjects were asked to “play” with the system as they wanted (free task of about 5 minutes). The first task was carried out without giving many instructions to the user, to let him/her freely discover the functionalities of the system avoiding bias introduced by the possibility to imitate what was shown in the instruction demos. Only after this “free task”, where the users were even free to move into the room while they were speaking, they saw two demos about two different modes to use the EPG.

After the two demos they had to sitting on the sofa and fulfill four tasks in about 50 minutes (see the detailed description of tasks in Appendix C); the instructor entered the room after the time for a task had elapsed (about ten minutes) or when they have reached the goal of the task.

To avoid bias due to the sequence of tasks, the order of tasks was different for each sub-ject, and each kind of task asked for reach the goal using different modes of interaction with the system (using only the voice commands � v, using only the RC � rc, using both voice and RC � vrc) as is reported in the following:

subject “free” task 1st Task 2nd task 3rd task 4th task f1 X A_gen B_spec_rc C_spec_v D_spec_vrc f2 X B_gen A_spec_rc C_spec_v D_spec_vrc f3 X C_gen B_spec_rc A_spec_v D_spec_vrc f4 X D_gen B_spec_rc C_spec_v A_spec_vrc f5 X A_gen B_spec_v C_spec_rc D_spec_vrc f6 X B_gen A_spec_v C_spec_rc D_spec_vrc f7 X C_gen B_spec_v A_spec_rc D_spec_vrc f8 X D_gen B_spec_v C_spec_rc A_spec_vrc f9 X A_gen B_spec_vrc C_spec_v D_spec_rc f10 X B_gen A_spec_ vrc C_spec_v D_spec_rc m1 X C_gen B_spec_ vrc A_spec_v D_spec_rc m2 X D_gen B_spec_ vrc C_spec_v A_spec_rc m3 X A_gen B_spec_vrc C_spec_rc D_spec_v m4 X B_gen A_spec_ vrc C_spec_rc D_spec_v m5 X C_gen B_spec_ vrc A_spec_rc D_spec_v m6 X D_gen B_spec_ vrc C_spec_rc A_spec_v m7 X A_gen B_spec_v C_spec_vrc D_spec_rc m8 X B_gen A_spec_v C_spec_vrc D_spec_rc m9 X C_gen B_spec_v A_spec_vrc D_spec_rc m10 X D_gen B_spec_v C_spec_vrc A_spec_rc

Table 1: task sequence

Page 15: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 14

The “rule” that has been followed was: the “first” real task was always the general one, then it followed the specific ones swapping the different modes, as it is possible to combine them having 20 subjects.

In this way, the tasks have been distributed in such a way that one user did not do the same task twice and that each task will be done in each of the modality combinations by dif-ferent users.

3.3.2 Description of the different goals of each ta sk Free Tasks

Please, “play” for at least 5 minutes with the system, to discover what it is able to do (remember that, if you need it, you can always ask DICIT for help). While you are us-ing the system, you are kindly requested to move into the room as you prefer.

Set of Task A

Surf the available channels and then try to schedule to turn the TV on (specific goal: find a program on air at 1:30 AM on a precise channel)

Set of Task B

Search some programs through the search criteria (specific goal: find cars races on Sunday)

Set of Task C

Find something playing “now” and select it to switch to Tv mode; then adjust the vol-ume (specific goal: find a program on air on a precise channel and set the volume to ¼ of full volume)

Set of Task D

Try to modify settings (specific goal: change the interaction mode to “expert” and the EPG start mode to “list”)

The duration of a single session was about 1 hour and 15 minutes.

3.4 The DICIT questionnaire After the recording session with the first prototype, each subject had to complete a question-naire to determine users’ attitudes toward different aspects of the system. The questionnaire data was entered on a notebook, such that the data could be automatically evaluated without the need to enter the data by hand. Appendix A shows the complete questionnaire.

The questionnaire consists of 74 questions according to the criteria of DIN EN ISO 9241-110 (see [11]). The first part consists of four statistical questions (A-D) and questions regarding TV habits (E-N). The second part contains questions regarding specific parts of the DICIT system, such as screen, voice output, and voice input (1-20). The last part investigates users’ expectation about possible adaptive features (21-24), the evaluation of the easi-

Page 16: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 15

ness/difficult to reach the goal of each task, an overall impression of the system (26.1 – 26.18) and a semantic differential (14 statements).

3.5 Data logging, visualization, and annotation One specific activity within WP6 concerns the development of a technical framework to sup-port and partly automate the evaluation process. For this purpose, a logging procedure and a software, called EvaluationSimulator tool have been developed. An earlier version of the tool has already been used for the evaluation of the Wizard-of-Oz data [22]. The capabilities of the tool have been extended for the first prototype performance evaluation as described in the fol-lowing sections.

3.5.1 Logging of session data During each experimental session extensive log files were created containing all user-system interactions, internal system states, and contextual information. In particular, all multimodal inputs by the user are logged. This refers to key presses on the remote control as well as to any type of speech input. If speech input was detected, the pre-processed speech data sent to the dialogue components (after beamforming, SLOC, and speech activity detection) is stored in separate files. Additionally, the logging system stores the recognition result (ASR compo-nent) and the chosen action (NLU component). All in all, the logs contain all data necessary to replay and analyze the interaction with the system at a later point in time.

Figure 4: Evaluation Simulator tool; main screen

Page 17: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 16

3.5.2 The EvaluationSimulator tool The EvaluationSimulator tool is able to read the log files produced during the experimental sessions and to display the data in a human-readable format. The tool uses a time-aligned, tier-based graphical display of the events (Figure 4). The upper part of the interface shows the visualization of the logging data. Each type of logging event, such as key (RC) input, speech input (availability of a PCM file), or ASR results is displayed in a separate tier. The user may move the time marker (marker in upmost time-line and vertical line) to any position. The sys-tem reacts by showing key presses on the RC with a small animation (lower left part of the figure) and the TV screen displayed at that point in time (lower right part). Hence, the soft-ware can be used as a recorder playing and re-playing system-user interactions.

The tool allows for a free definition of more complex events or metrics from low-level logging events [23]. Figure 5 illustrates the definition of a complex event (“pcmresultaction”) based on three different basic logging events (“pcm”, “action”, “result”). If all such basic events with the given constraints can be found in the log, a complex action is generated and displayed on the timeline (see last four tiers in Figure 4).

Figure 5: EvaluationSimulator tool; definition of complex actions

Besides visualization, re-play, and the definition of high-level events the EvaluationSimulator tool also support the annotation of data. Each log event can be annotated with an arbitrary number of additional parameters. Figure 6 shows an example of how this feature was used in the current evaluation campaign. The user selected a “pcm” event on the time-line by right-clicking and selecting “annotate”. Such “pcm” events are generated whenever there is some speech input recognized by the system. The window shows all parameters associated with the selected speech input event. The parameters “result” and “action” reflect the recognition result of the ASR engine (“Wednesday”) and the action classification (“Value”) which were auto-matically read out of the log files. The annotator may now play the sound file associated with

Page 18: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 17

the event, listen to the original speech input, and transcribe (“transcr”) the input (which was correct in this case). In addition, the system’s reaction to the user input is annotated (“classif”) – in this case “OK”, for the correct choice of the action – according to the appropriateness of action executed by the system rather than simply on the correctness of utterance recognition. Using the buttons “jump to previous/next” the annotator can directly go to the next event of the same type with a single click. That way, annotating the data can be made much more effi-cient.

Figure 6: EvaluationSimulator tool; annotating speech input

Finally, the tool can be used to automatically generate graphical charts from any type of input data. These charts can be directly imported in text processing software as it was done in this deliverable. Figure 7 shows an example of the user interface for this feature. Here, the annota-tor/evaluator may enter the answers to the questionnaire questions. Descriptive statistical charts can be created simply by clicking “generate chart”.

Figure 7: EvaluationSimulator tool; generating charts

3.6 Definition of metrics for the evaluation The metrics against which the various experiment sessions are evaluated are presented in the next sections. For each of one, the following aspects will be presented:

Definition: Objective metrics include everything that can be directly or indirectly derived from observation. This includes (a) everything that be directly derived from the log files via an automated process and (b) everything that can be derived from a transcription of the data where the transcription itself can be considered objective due to strict and unambiguous

Page 19: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 18

guidelines. In contrast, subjective metrics include all subjective assessments by an individual (typically the participant him/herself) and the observation of the experimenter.

Scope: Objective metrics may relate to different units of discourse. The following units will be used as scope: a single voice utterance by the user, a task as defined, a single interaction by the user (including both voice and RC).

Restriction: Some metrics are not useful or meaningful for all units. Restrictions define which units to include in the analysis.

Requirement: The requirements specify what kind of original data is needed in order to de-rive the metrics.

Computation: A metric can be computed either automatically or manually.

Rationale: the kind of insight that is expected to be provided by the metric.

3.6.1 Metrics for dialogue design

Command Appropriateness Scope: single user utterance Restriction: -- Requirement: transcription of audio recording of user utterances Computation: manual: the evaluation of this metric requires good knowledge of the dialogue design and may sometimes be subject to judgement of the analyzer/annotator!

=otherwise0

situation dialoguecurrent in the eappropriatplausible/ is command theif1CA

This metric reflects whether the request or command by a user is plausible or appropriate in the current dialogue situation, regardless whether it is foreseen in the current dialogue model or not. Thus, if the user says for example: “How is the weather tomorrow”, the metric would be 0 since this kind if request is implausible (off-domain). Likewise, if a user says something that cannot be understood by design in the current dialogue state, but could was implemented in another dialogue state, the value will be 0 (i.e. he/she ask for programs of genre “weather”, while within the “settings” state).

If the user made a plausible request (i.e., one that was foreseen in the dialogue design at the current dialogue state), but his/her choice of words is unusual such that the utterance is not present in the language model, the value is 1.

Rationale: The CA value (over sets of utterances) will show if the dialogue design appropri-ately covers users’ expectations. Comparing the CA against the Dialogue Coverage (DC) (see below) will give a measure of the extent to which an improvement of the Language Model to have it consider the set of the plausible commands could theoretically bring to the experience of the set of test users. The improvement brought has been defined theoretical, since adding the set of plausible commands to the Language Model could lower the WRR and ACR.

Dialogue Coverage Scope: single user utterance Restriction: -- Requirement: transcription of audio recording of user utterances

Page 20: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 19

Computation: manual: the evaluation of this metric requires good knowledge of the dialogue design and may sometimes be subject to judgement of the analyzer/annotator!

=otherwise0

model dialogue by the covered is command theif1DC

This metric reflects whether the principal type of request or command by a user is foressen in the dialogue design in the current situation, regardless whether the actual utterance was under-stood by the ASR/NLU engine or not. Thus, if the user says for example: “Please I want to watch to CNN” in TV mode, but this particular way of expressing the command was not un-derstood by the ASR/NLU engine, the metric would be 1 since this kind of request if this given dialogue situation (changing channels in TV mode) is foreseen in the dialogue design. If a user says something that cannot be understood by design in the current situation, but could be understood in another dialogue state, the value will be 0.

Rationale: The DC calculation is necessary to calculate the Word Recognition Rate (WRR) and the Action Classification Rate (ACR) metrics (see below), since only the utterances which match this criteria have to be considered.

3.6.2 Metrics for Automatic Speech Recognition In order to assess the performance of the speech recognition system (excluding NLU and dia-logue processing) the following measures can be used.

Word Recognition Rate Scope: single user utterance Restriction: only utterances covered by the dialogue model (DC = 1) Requirement: transcription of user utterances Computation: automatic, out of the transcription of the utterances

++−=N

SDIWRR 1

Let the word-chain ( )uNuu wwwutt ,...,, 21= the utterance by the speaker (acquired by transcrip-

tion). Let the word-chain ( )rMrr wwwrec ,...,, 21= the recognized string. Compare both word-chains and count the number of word insertions I, the number of word deletions D, and the number of word substitutions S and compute word recognition rate WRR as defined above (similar to Levenshtein distance). The comparison is done using a dynamic programming ap-proach, e.g., the Wagner-Fischer algorithm.

Rationale: The WRR value (averaged over sets of utterances, such as all the utterances spo-ken by a given subject) will provide information on the performance of the ASR module. It will tell us whether ASR needs improvement or not.

3.6.3 Metrics for NLU

Action Classification Rate Scope: the intended meaning of an utterance (illocutionary act) Restriction: only utterances covered by the dialogue model (DC = 1)

Page 21: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 20

Requirement: transcription of match between the user spoken utterances and the chosen branch on the dialogue (multivalued variable) Computation: automatic, out of the transcription of the result of the classification of the ut-terance

=T

CACR

Let the word-chain ( )uNuu wwwutt ,...,, 21= the utterance by the speaker (listened from user input). Let the action act the action chosen by the NLU, out of the (recognized chain of words out of the) user input and delivered to the Dialog Manager.

C is the number of correctly classified actions within the considered set; T is the total of actions classified in the set.

Rationale: The ACR value will provide information on the performance of the ASR+NLU module. It will tell whether the ASR+NLU is a major source for errors or not. This definition of ACR, “end-to-end” from user input to feedback given to the user is the most effective, when the evaluation of usability is of concern.

If NLU alone performance is of interest, an indication of its performances can be de-rived comparing the metrics of the ASR alone (WRR) and the ACR, as above defined.

3.6.4 Metrics for experimental tasks

Task Completion Rate Scope: one complete task; or the “main” part of the task completed (where the complete task was composed of more sub-tasks) Restriction: the task has to be completed in “one try” (repetitions are not allowed) and in a time range of maximum 10 minutes. Requirement: transcription of the result of one task (Boolean variable: OK/KO, according to the fact the the task has been Completed/NOTCompleted) Computation: manual, out of the transcription of the result of the task

Rationale: The TCR value will provide information on both the goodness of the design of the system (efficacy) and the overall performances of the prototype and in particular the perform-ances of the voice recognition chain (embracing multi-microphone signal processing to NLU/MAC).

Other measures (e.g. the ratio among the commands given through Remote Control and commands given via voice will) help discriminate among the two above said components.

Task Completion Time Scope: one complete task; or the “main” part of the task completed (where the complete task was composed of more sub-tasks) Restriction: maximum time range is 10 minutes for each task. Requirement: identification of the beginning and end of each single task and of the “think-ing” time Computation: automatic, out of the transcription of the isolation of each single task and of the intervals of “thinking time”.

DTTCTTCTN

TeTbTCT

−=−=

Page 22: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 21

TCT is the raw Task completion time, defined as the time elapsing from the beginning to the end of the task. TCTN is the Net Task Completion Time, that is the real “thinking time” for the task; in order to obtain it, it is necessary calculate the Dead Time, which is the sum of all the response times elapsed during the task plus the time spent by the user to recover from sys-tem errors (since it requires manual annotation, in the following TCTN is not calculated).

Rationale: The TCT value will provide information on both the efficiency of the design of the system and the overall performances of the prototype and in particular the performances of the voice recognition chain (embracing multi-microphone signal processing to NLU/MAC).

User errors Scope: one complete task, or one complete session. Restriction: the wrong actions have to be judged as plausible for the subject in a given state of the interface (if the user corrects him/herself, the wrong behaviour is due to a inattention). Requirement: classification of the result of one action/request. Computation: by the classification of the commands given during a task

+=

CE

EUE

Where E is the number of erroneous actions and C is the number of correct actions.

Rationale: UE gives a measure of the efficacy of the design regardless of the goodness of the implementation of the system.

If the interface is easy to use, the subject could learn in a very few time what it is neces-sary to do to reach a target. In the other hand, if the numbers of errors done by the user it’s high, the interface is not effective because induce to wrong behaviors.

Ratio between successful interactions and errors Scope: one complete task, or one complete session. Restriction: the wrong actions behaviors to be judged as plausible for the subject in a given state of the interface (if the user corrects him/herself, the wrong behaviour is due to a inatten-tion). Requirement: classification of the result of one action/request. Computation: by the classification of the commands given during a task

Rationale: SIE is the ratio between correct actions/requests and the wrong behaviors in a given state. The effectiveness is the result of the relationship between the efficacy of the de-sign, that allows achieving a given goal, and to the correct functioning of the system.

The completion of a task is due to a series of completion of sub-task that can be com-pared with the number of the wrong behaviors.

3.7 Corpus annotations and calculation of metrics This section described the procedure followed to compute the defined metrics out of the logs of the usage experiments. The procedure has been carried out partly in manual way and partly assisted by tools (the Evaluation Simulator Tool by EB and MS-Excel).

In this section, the above reported metrics, described in very general terms, are special-ized to the particular needs of this document, that is to evaluate the DICIT prototype with re-spect to the two aspects:

1. to test the adequacy of the dialogue design from an ergonomic point of view.

Page 23: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 22

2. to assess the performance of the actual implementation of the designed dialogue and of all the involved technologies in the prototype

as already stated in Section 2.

The task of decomposing the above two components from the outcomes of the metric compu-tation, is not always an easy undertaking: for example, while it could be relatively easy to cor-relate the fact the utterance of one word belonging to the active Language Model or Grammar has not been correctly recognized to poor implementation, of the fact that a plausible utter-ance was not been recognized because it was missing into the Language Model to poor de-sign, there are cases where such decision is not easy to take.

In order to support a finer-graded correlation of the metric outcomes to one or the other component, the outcome of the DICIT Action Classifier has been annotated according to a multivalued value (actually 11 different values – see below); Such finer annotation allowed the definition of more specialized metrics, which could be more clearly correlated with the two aspects of the evaluation, as above stated.

3.7.1 Used metrics for the classification The starting data of the procedure are the following; such data have been produced by the various components of the DICIT prototype when running the experiments and saved on a subject by subject basis (together with the video of the session saved for future inspections and for the cases where it could not be decided what was going on with the other data):

1. timed log of the CIMA Dialog Manager, including the Language Model and Embed-ded grammar in effect, text of the TTS messages, Classifications of the utterances pro-vided by the MAC, chosen branches in the dialog

2. timed log of the ASR results: the transcription of the user input produced by the ASR

3. duration of each task (to be further refined deleting the initial and final time, where the experimenter is still into the user room)

4. voice track of all the session, containing both user utterances and the output of the DICIT system (e.g. TV audio and TTS messages)

5. the real chunks of voice sent to the ASR (as generated by the multi-microphone signal processing chain, as segmented by the end point detector)

6. timed log of the RC commands given by the user

7. timed log of the STB screens generated by DICIT (e.g. command bar, help popup, EPG lists, choice lists, and so on)

The result of the procedure is the actual value of the Objective Metrics, evaluated within the respective set of events (e.g. single utterance, single task, single session, global). In the fol-lowing, a description of the data extracted/annotated from the logs is reported. For each utter-ance spoken by the user within one task (utterances outside a task have been discarded: this is the case of the voice segments deriving from talks among the user and the experimenter be-fore or after a task):

1. Transcription, if the utterance has been considered (e.g. segmented and processed) by the system; for Missed utterances (e.g. the ones which have not been considered by the sys-tem) no transcription was annotated, since there is no use of it

2. Classification result CR: the user utterance has been compared against the following be-havior of the system; a 11 valued result has been defined:

Page 24: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 23

i. OK: the user command was appropriate with respect to the system state and the system classification was correct.

ii. NO: the user command was appropriate with respect to the system state and the system classification was not correct.

iii. OF: Off talk utterance

iv. NC: Not classifiable (anomalous utterance, such as a part of one utterance, resulting from bad segmentation)

v. TV: totality of prevalence of audio from TV of TTS

vi. NP: the user command was not appropriate with respect to the system state, but judged plausible; system classification was not correct (like NO, but denote commands that could be inserted in the design of the new version of the system)

vii. NI: the user command was not appropriate with respect to the system state, and judged implausible; system classification was not correct

viii. MOK: the user command was appropriate with respect to the system state but the sys-tem missed the command.

ix. MNP: the user command was not appropriate with respect to the system state, but judged plausible; the system missed the command.

x. MNI: the user command was not appropriate with respect to the system state, and judged implausible; the system missed the command

For each Task:

1. Duration, calculated as the time difference among the first utterance given within the task and the first utterance outside the task (usually speech form the experimenter, who entered the room).

2. Outcome: either OK, if the user was able to successfully complete the task within the time limits, or KO if the user was not able to successfully complete the task.

3. Number of RC commands given within the task

3.7.2 Computation of metrics out of the classificat ion result Computing the defined metrics out of the above reported annotated data is straightforward; in the following, the expressions for all metrics but WRR (which is not dependent upon the clas-sification result and has been defined in the respective section), are reported.

CA = 1 iff CR is in { OK U NO U NP }

DC = 1 iff CR is in { OK U NO }

C-ACR = ΣOK / (ΣOK + ΣMOK + ΣNO + ΣNC + ΣTV) where ΣXX is the number of utterances classi-fied as XX over a given interval (e.g. one task, or one session, or the entire set of sessions). This definition gives the actual value for ACR, according to the above definition.

C-ACR* = (ΣOK+ΣNP+ΣMOK+ΣMNP) / (ΣOK+ΣMOK+ΣNO+ΣNP+ΣNC+ΣMNP+ΣTV) where ΣXX is the number of utterances classified as XX over a given interval (e.g. one task, or one session, or the entire set of sessions); this definition gives one upper bound for the ACR, under the hy-pothesis that the Language Modes was wide enough to enclose all the expressions that have been judged plausible and that the WRR with this model was the same as for the real Lan-guage Model.

Page 25: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 24

In other words, while ACR gives information about how the system is performing, in-fluenced by both the two basic factors of the efficacy of the design and goodness of imple-mentation, ACR* gives information on how the system could perform the design was perfectly matched against the expectations of the users.

TCR = ΣOK/( ΣOK + ΣKO) where ΣXX is the number of Task classified as XX over a given in-terval (e.g. session, or the entire set of sessions).

C-UE = (ΣNP + ΣNI + ΣMNP + ΣMNI.) /(ΣOK+ΣNO+ΣNP +ΣNI+ΣMNP +ΣMNI).

C-SIE = ΣOK/(ΣOK + ΣMOK + ΣNO) where ΣXX is the number of utterances classified as XX over a given interval (e.g. one task, or one session, or the entire set of sessions); this expression gives the value of the metric according to its definition.

C-SIE* = (ΣOK+ΣNO+ΣNP+ΣMOK+ΣMNP) / (ΣOK+ΣMOK+ΣNO+ΣNP+ΣMNP+ΣNI+ΣMNI) where ΣXX is the number of utterances classified as XX over a given interval (e.g. one task, or one session, or the entire set of sessions); this definition gives an upper bound of SIE under the hypothesis that the Speech Recognizer was perfect and the Language Model was wide enough to accom-modate all the plausible commands; in other words it gives a measure of the goodness of the design, abstracting from its implementation. Ideally, this indicator should equal 1; the main cause of deviation from 1 is poor training of the user; hence another use of this indicator could be calculate a sequence of its values, measured over time. An asymptotic behavior with target 1 is expected for properly designed systems; the speed with which it approaches 1 expresses the easiness of the system to be “learned” by the users while the value at the beginning of the experience yields the cognitive load.

3.7.3 Simplified computation of metrics out of the classification result The computation of the metrics given in the above session, being the more appropriate, poses some practical problems, as far as annotation is concerned: if the missed utterances have to be highlighted and classified, a thorough analysis of the whole sessions has to be done, and not only the already segmented parts have to be listened to.

On the other hand, considering only the voice segments that have been segmented by the system itself and sent to the voice recognizer, is a shorter job but all the “missed” events are non considered. For this reason, another definition has been given to some metrics, which neglects the “missed” events.

In order to verify the discrepancy among the two annotation techniques: “complete”, which considers also the “missed events, and “express”, which neglects them, only a subset of sessions have been annotated with the “complete” technique. As it will be shown in the results section, the “complete” annotation technique took 3 times as much effort than the “express” technique. In the following, the re-definition of some metrics is reported, which considers only the classification coming from the “express” technique.

ACR = ΣOK / (ΣOK + ΣNO + ΣNC + ΣTV) where ΣXX is the number of utterances classified as XX over a given interval (e.g. one task, or one session, or the entire set of sessions). This defini-tion gives the actual value for ACR, according to the above definition.

ACR* = (ΣOK + ΣNP) / (ΣOK + ΣNO + ΣNP + ΣNC + ΣTV) where ΣXX is the number of utterances classified as XX over a given interval (e.g. one task, or one session, or the entire set of ses-sions); this definition gives one upper bound for the ACR, under the hypothesis that the Lan-guage Modes was wide enough to enclose all the expressions that have been judged plausible and that the WRR with this model was the same as for the real Language Model.

Page 26: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 25

In other words, while ACR gives information about how the system is performing, in-fluenced by both the two basic factors of the efficacy of the design and goodness of imple-mentation, ACR* gives information on how the system could perform if the design was per-fectly matched against the expectations of the users.

UE = (ΣNP + ΣNI) /(ΣOK+ΣNO+ΣNP +ΣNI).

SIE = (ΣOK + ΣNO)/( ΣOK + ΣNO+ ΣNP + ΣNI ) where ΣXX is the number of utterances classified as XX over a given interval (e.g. one task, or one session, or the entire set of sessions); this expression gives the value of the metric according to its definition.

SIE* = (ΣOK+ ΣNO+ ΣNP) /(ΣOK+ ΣNO+ ΣNP + ΣNI) where ΣXX is the number of utterances classi-fied as XX over a given interval (e.g. one task, or one session, or the entire set of sessions); this definition gives an upper bound of SIE under the hypothesis that the Speech Recognizer was perfect and the Language Model was wide enough to accommodate all the plausible commands; in other words it gives a measure of the goodness of the design, abstracting from its implementation. Ideally, this indicator should equal 1; the main cause of deviation from 1 is poor training of the user; hence another use of this indicator could be calculate a sequence of its values, measured over time. An asymptotic behavior with target 1 is expected for prop-erly designed systems; the speed with which it approaches 1 expresses the easiness of the sys-tem to be “learned” by the users while the value at the beginning of the experience yields the cognitive load.

4 Results

4.1 Subjective measures The usability evaluation was conducted in order to improve the interface of the first proto-types of the DICIT system. The focus of this part of document is to understand how users per-ceive the interface (both vocal and haptic), while they use the Tv for “traditional” goals (like watch a program or adjust the volume) or while selecting broadcasts from an EPG database using a set of filter criteria by means of voice input. Also, a screen layout and navigation scheme was evaluated.

The DICIT system can understand all voice input and handle it accordingly, and the aim of this study was to determine if users consider appealing or not a system which can even be operated by voice control.

4.1.1 Statistical questions (Questions A-D) The first part of the questionnaire contains statistical questions, e.g. regarding subjects’ gen-der, occupation, or age. While the German and the Italian sample were chosen trying to repre-sent the distribution of the whole population (regarding gender, educational qualification, job and age), because of some difficulty to find American people either in Germany and in Italy, the balancing of the subjects of the English sample, has not been strictly respected as in the other two samples (e.g. see gender and educational qualification distribution). Question

Page 27: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 26

A: You are…

male55%

female45%

male38%

female62%

Male50%

Female50%

B: What is your educa-tional qualifi-cation?

secondary school

5%

high school30%

degree / diploma

65%

secon-dary

school0%

high school13%

degree87%

high school

55%

secondary

school10%

degree/

diploma

35%

C: Your age 41-50

5%

31-4015%

21-3045%

16-2025%

51-605%

61 over5%

16-206%

>616%

51-6013%

41-506%

31-406%

21-3063%

21-3035%

31-4030%

41-5010%

16-2025%

D: Profes-sion

student school25%

student university

10%

employee55%

retired5%

self employed0%

other5%

apprentice trainee

0%

self employed25%

ret ired0% other

6%

stud ent/school

0% student /university

19%

appr. t rainee0%

emplo yee50%

employee

40%

student (univer

sity)45%

student school

10%

other5%

Table 2: Statistical questions

Of the German subjects, 11 (55%) were male and 9 (45%) female. The educational distri-

bution was skewed towards higher degrees and diplomas than the average population distribu-tion. Testing with young subjects was a priority and therefore the 16-20 age group is well rep-resented. Professions were correlated with age, where young subjects were mostly students and of middle age were mostly employees.

As for the English native speakers, the gender distribution of the subjects is 6 male, and 10 female. About the educational qualification, the majority of them held a university level de-

Page 28: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 27

gree (80%) and a few part of them held a high school level. Since the ASR engine was spe-cialized for the US English, this sample should have been composed by a majority of Ameri-can subjects. Due to some difficulty to find American people either in Germany and in Italy, for this sample British English native speakers have been recruited in addition. The distribu-tion was: 63% (10) US English speakers, and 38% (6) UK English speakers. The age distribu-tion of this sample is: 6% (1) teenagers (range 16-20), 63% (10) aged between 21 and 30 years, 6% (1) are aged between 31 and 40 years, 6% (1) aged between 41 and 50 years, 13% (2) aged between 51 and 60 years, and 6% (1) aged over 60 years. The occupation distribution was: 19% (3) university students, 50% (8) employee, 19% (3) self-employed and 12% (2) in the “other” category occupation.

About the gender, the Italian sample was equally distributed (50% females and 50% males); regarding the educational qualification the distribution was: one third (7) hold a uni-versity level degree, the main part of the sample (11) finished the secondary school, and the rest (2) finished the middle school. Since we tried to involve in the test young users (teenag-ers) the age distribution was divided with more than half the sample under 31 years (avoiding the two last age range between 51 and 60 and over 60 years old). Te age distribution is: 25% (5) are aged between 16 and 20, 35% (7) from 21 to 30, 30% (6) were between 31 and 40; and two subjects (10%) were between 41 and 50. Unfortunately have not been recruited subjects aged more than fifty years old. The occupation distribution was: more than half of the sample (55%) students, 40% employees, and 5% worked as self-employed.

About the studying or working area, the main part of the Italian sample is involved in humanistic/administrative activities and only two subjects (10%) work or study in the com-puter science area.

4.1.2 TV habits (Questions E-N) The next section of the questionnaire contains questions regarding the TV watching habits. Question E: How many peo-ple live in your household including you?

alone20%

230%

3 more50%

256%

alone19%3 more

25%

215%

3 or more70%

alone15%

F: How many TVs do you have in your house?

145%

240%

3 more15%

none0% internet tv

0%

none19%

137%

internet tv

6%

3 more0%

238%

3 or more25%

none10%

240%

125%

Page 29: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 28

G: Who usually de-cides what to watch on TV?

one40%

together30%

majority5%

first served20%

each5%

one25%

each0%

first served

13%

m ajority6%

together56%

1st served20% majori-

ty21%

one21%

toge-ther47%

each11%

H: How do you usually decide which pro-gramme to watch?

teletext5%

guide50%

epg5%

surfing40%

epg31%

surf ing31% guide

38%

t elet ext0%

guide15%

surfing30%

EPG10%

teletext45%

I: Which type of television do you usually watch?

traditional45%

satellite35%

digital terrestrial

20%

iptv0%

satellite38%

traditio -nal56%

iptv0%

digital terrestrial

6%

DTT20%

satellite16%

traditio-nal

84%

J: How do you usually select a pro-gramme?

numeric button36%

up down button55%

epg9%

Iptv VOD0%

up down but ton

14%

epg57%

numeric but t on29%

Iptv VOD0%

no answer

65%By EPG

55%

numeric button45%

Page 30: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 29

K: What is the main information that inter-ests you to choose a pro-gramme?

genre45%

actor0%

topic35%

duration0%

channel10%

dont care10%

durat ion0%

topic44%

actor0%

genre43%

channel13%

dont care0%

Topic40%

Genre60%

L: For what do you usually use the TV?

recorded10%

vod0%

rented5%

bought5%

background0%

surf20%

other0%

watch on air60%

recor-ded6%

V OD0%

rent ed19%

bought6%

background0% surf

6%

other6%

watch t v57%

as back-ground

10%

channel surfing

5%

other5%

watch bought videos

5%

watch rented videos

5%

watch on air

prg70%

M: How do you con-sider your-self as a user of TV and related devices?

amateur5%

basic35%

moderately skilled50%

very skilled10%

modera-t ely

skilled7%

very skilled

21%

basic65%

amateur7%

basci user35%

very skilled user5%

moderately

skilled user60%

N: Who usually op-erates the domestic media ap-pliances (TV,Radio, Satellite...) in your home?

i35%

partner10%flatmates

10%

together30%

nobody in particular10%

parents5%

children0%

parents0%

children0%

f lat -mates

0%

partner43%

together29%

nobody in part i-

cular7% i

21%I

45%

partner10%

parents10%

all toghete

r35%

Table 3: Questions regarding habits

Page 31: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 30

The German test subjects mostly use the TV to watch on air programs and search using the up/down and numeric buttons. They normally do not use an EPG to decide on a program to watch but either surf or use a paper guide.

Since the English sample was split in two sites, and in Italy DTT and IP-TV are not wide-spread, even the English native speakers recruited by Amuser, like the most part of the Italian sample, watch the traditional television (9); in the other hand, the remaining English subjects (5) in Italy and the ones recruited at EB, use in majority the satellite to “stay tuned” to the English programs, one is using DTT.

As in Italy the “traditional” TV is still the most watched, the most used way to decide which program to watch is to read the teletext or to surf channels. Even if few subjects are used to manage the EPG (to decide which program to watch, or to select it), most of the Ital-ian people consider themselves “moderately skilled users” (60% of the sample), and only few of them delegate partners or parents to operate domestic media, and most of them answer to do it by themselves or with someone else.

4.1.3 The DICIT system (Questions 1-20) Questions from 1 to 20 are used to determine how subjects like specific aspects of the DICIT system, such as the screen, voice input, or voice output.

Using the DICIT System These questions are used to determine how subjects get along with the DICIT system and whether they prefer voice to remote control input. Subjects had to rate each of the following questions with values between 1 and 10. Moreover, they could explain or comment on their answers in a text input field.

Figure 8 shows the responses of the subjects as boxplots. In this and in all following boxplots, the boxes indicate the 25% and 75% quartiles and the median value is shown as a thick vertical bar. The whiskers indicate minimum and maximum values excluding extreme values which are drawn as single dots or asterisks for extreme outliers.

The median value for questions 1 to 6 is above average and the majority of the re-sponses are in the “positive” area. The positive tendency is, however, only marginal for ques-tion 3 – referring to the genereal ease-of-use of speech commands. A clear positive tendency can be found in the responses to question 4 concerning multimodality. The responses to ques-tion 7, on the usefulness and efficiency of the system when problems occur, show, in contrast, a negative tendency. It is notable that for almost every question ratings can be found on the complete range of the scale. Hence, there is a strong heterogeneity among the subjects.

Page 32: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 31

Figure 8: Questions 1-7 (“Using the DICIT system”)

The answers to question 1 to 7 do not differ significantly among the language groups with one exception: For question 3 a significant difference between Italian end English subjects can be found (Mann-Whitney, p < 0.05). English subjects gave a slightly worse rating than Italian subjects (cf. Figure 9).

Figure 9: Question 3 (“It was easy to understand howto give vocal commands”) broken down according to language.

The different between English and Italian is significant.

For some questions on the basic usability of the system, a correlation between the answer and the familiarity of the user with the technology could be determined: Answers to Question M strongly correlate with answers to Questions 1 (two-tailed Pearson correlation test, p<0.01) and 7 (two-tailed Pearson correlation test, p<0.05). Hence, users with more expertise tend to give higher ratings for basic usability of the settings menu and the error recovery.

Question 1: “It was easy to understand how to modify the settings.” The biggest difficulties were the unexpected use of the ‘ok’ button to change the settings

and the slow reaction time of the system.

English subjects consider easy to modify settings because the Setting Menu was easy to understand and find and it contains a limited number of options. Two subjects had some prob-lems at the beginning of the session, but after using the system for a while they easily under-stood how to behave.

Page 33: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 32

Even the Italian subjects consider easy to modify settings but often they did not use the command Apply (both using the RC or the voice) after have changed the value of the different options. Since each option has different item named with very different keyword, someone stated that, using the vocal interaction, could be preferable to say only the keyword instead of naming the label of the option before. One subject stated that she initially misunderstood the command “Cambia” (= Reset) and she tried to use it to change the value of the different op-tions, instead to simply reset the factory preferences.

Question 2: “It was easy to understand how to use the different selection criteria.” The German subjects complained about the difficulty of entering the correct time into the

system. The options that could be selected were sometimes unknown or not sufficient for the task. The ‘night’ setting was not showing programs after 0:00.

English subjects did not judge the filters to be easy to use. A subject criticized that the there are too many options, and another subject (who used the RC to do the task C) stated that it was not clear if she had to press the Ok button to see the results of the search criteria, or if she had to use the red button (Results command).

The Italian users hadn’t problems using the filter criteria, even if few of them understood that it was possible to use more than one criterion at a time to sort the data. Two users stated that, due to the slow reaction time of the system, there was some difficulty to switch from the search function to the list of programs. One subject lamented some difficulties to link the se-lection of search criteria to the schedule to view a program).

Question 3: “It was easy to understand how to give all the vocal commands.” German subjects found it difficult to determine what would be understood by the system.

After realizing some correct commands they found interaction easier. They would like to have some kind of quick feedback that the system has received the command. Another complaint was the reaction speed.

English subjects judged quite difficult to use the speech interaction because (due to the slow reaction time of the system) it was not clear whether to repeat the given command or not. Moreover, as to repeat the utterance generated some problem of recognition, some subjects stated that even when they gave clear instructions, the system often did not understand them. One subject stated that it was not clear if the system would recognize sentences or just short commands. English subjects also remarked that getting feedback to what the system under-stood could be helpful.

Italian subjects judged quite easy to use voice commands, and the 25% of them attributed the problems of recognition of the system due to the fact that it was not clear if the system would better recognize sentences or just short commands. One subject stated that if she had some problems to use vocal commands, it was because she wasn’t used to interact with this kind of system (in her opinion it wasn’t a design interface problem). Another user stated that in his opinion there was an inconsistency between the commands “previous” and “next” be-cause for the experience he did, the command “previous” worked as “previous page” and the command “next” can be used only to scroll the program’s list item by item.

Page 34: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 33

Question 4: “It was comfortable to give some information using voice and other informa-tion using the remote control.”

German subjects complained again about the speed of the system. Some judged the possi-bility to start with speech and switch to remote when speech fails, useful. Selecting items lower in the list was found practical using speech because it does not take a lot of cursor movements. It was found that different tasks have a different modality as preference.

English subjects judged comfortable to give some information by voice and the other with the RC because it is good to have a choice, and when the voice recognition works correctly, it speeds up the possibility to change the channels or to select the programs.

Also the Italian subjects judged comfortable to use either RC and voice commands, but the 25% of them attributed to the RC a recovery function when the voice commands did not work. One subject highlighted that the voice interaction has a major flexibility, most of all during the search of the programs, and two subjects stated that the advantage of using voice is the possibility to interact with the system even while doing something else.

Question 5: “It was easy to change the channel by voice.” German subjects would have liked to have a ‘list mode’, showing an overview of the

channels. Some channel names seemed not recognized, in those cases saying a number proved easier.

English subjects judged easy to change channels by voice, but 15% of them stated that they had some problems to let themselves understand by the system.

Also the Italian subjects judged easy to change channels by voice, but the 35% of them highlighted that often some channels were exchanged with channels with similar names (i.e. Sky 24 with France 24, or the different Rai’s channels).

Question 6: “It was easy to change the volume by voice.” German subjects found that the system sometimes muted unexpectedly. And although vol-

ume was recognized it was sometimes adjusted in the wrong direction. A couple of subjects did not test setting the volume by voice.

English subjects judged a little bit easier to adjust the volume than to change channels by voice. One subject highlighted that DICIT seemed to recognize her voice more easily when she asked it to turn the volume up.

Also the Italian subjects judged more easy to adjust the volume than to change channels by voice. The 15% of them stated that they still have some problems to understand how to use volume commands (e.g. “mute”, “half volume”, etc…)

Question 7: “In case of problems did the system suggest usefully and efficiently what to do to recover the information after the error?”

Regarding the help function, German subjects found that the help was mostly not reacting to the specific context the system was in. Also a lot of times the help dialogues switched off after reading just the beginning part of it (tts self recognition). And when a user reaches a point where he did not want to be the help would not be about the menu where he came from.

Page 35: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 34

English subjects did not find useful and efficient the error recovery done by the help func-tion or the GUI interface because often, the repetition of a right command during the lapse of time before the system reacts, was not understood and caused another misunderstanding.

Also the Italian sample did not find the error recovery useful and efficient and three sub-jects complained the lack of contextual error messages highlighting that the help prompts were not exhaustive and the TTS volume was not so easy to distinguish from the background of the TV.

Watching the Screen The aim of this section was getting feedback on the DICIT screen, i.e. whether it was easy to read the screen and navigate the menu structure. Like in the previous section, every question in this section consists of a rating value between 1 (negative) and 10 (positive) and an input field, where subjects can explain their choice.

Figure 10 shows the responses of the subjects as boxplots of the questions from 8 to 11 and question 13, since question 12 is represented in Figure 12. As for questions 1-7, the re-sponses show a positive tendency. This tendency is weaker for question 8, 10, and 13 refer-ring to the usefulness of the separated list/criteria screen design, speech input of search crite-ria, and usefulness of information on the screen with disabled audio. For these questions, we find ratings on the whole scale – even on the strong negative side. Responses to questions 9 and 11 regarding screen readability and use of the RC show a strong positive tendency with few exceptions on the negative side. According to the results for question 12 many users would expect more or different voice commands.

The answers to questions 8 to 13 do not differ significantly among the language groups with one exception: For question 8 a significant difference between Italian end English sub-jects can be found (Mann-Whitney, p < 0.05). English subjects gave a slightly worse rating than Italian subjects (cf. Figure 11 and Figure 9)

Figure 10: Questions 8-11,13 ("Watching the screen")

Page 36: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 35

Figure 11: Question 8 (“Do you find it useful to choose between displaying the programme list and the criteria list”)

broken down according to language. The difference between English and Italian is significant.

Figure 12: Question 12.

Question 8: “Do you find it useful to choose between displaying the programme list and the criteria list?”

German subjects couldn’t always switch between the two easily enough. Subjects judged the list mode easier .One would like to have it displayed underneath the criteria when the list is small enough.

The English subjects judged that it was neither easy nor difficult to choose between dis-playing the program list or the criteria list, but some of them started that this is true if the pro-gram can recognize the command, otherwise it became frustrating. One subject highlighted that some search criterion, like time, was not easy to follow because its options (e.g. morning, afternoon…) don’t suggest how to choose intermediate values (given by the hours).

The Italian subjects found useful to choose between displaying the program list or the cri-teria list, because it allows to choose the favorite mode to make a search. One of them stated that for him is confusing to have as first screen of the EPG the one with the list of programs.

Question 9: “Is the screen which shows the criteria for the programme search easy to read?”

The screen was easy to read and colors were found to be well chosen.

For the English subjects it was easy to read the search criteria screen, even if two persons stated that they have some problem to access to this screen.

The Italian subjects found this screen useful and easy to read.

Page 37: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 36

Question 10: “Was it easy to understand how to use vocally the search criteria for pro-grammes shown on the screen?”

.It was found difficult to vocally change the genre, and due to the slow reaction it was not always known if the system understood the speaker. The overview of possible options was helpful.

The English subjects, did not find so easy to understand how to use vocally the search cri-teria to select the programs because some of them wanted to say the numbers associated to the different options, but often they were not understood. In this cases they tried to select the wanted value saying commands like “go down, go down”, etc…to scroll the list of options (as they should done using the RC) but they obviously found this kind of selection rather annoy-ing.

The Italian subjects judged that it was neither easy nor difficult to use the voice interaction for the search criteria, but two of them complained about the slow reaction time of the proto-type (which caused many repetition of the commands) and another user stated that it is frus-trating to use the search criteria screen when the system doesn’t understand the right option (e.g. in the Italian search criteria screen, the channel “Canale 5” was associated to the ordinal number 6 in the list of channels, while the channel “Rete 4” was associated to number 5, so when they asked for channel 5, meaning the channel name, often the output was “Rete 4”). Other two subjects stated that it was not clear how to make the cursor go to another search cri-terion when they have filled the first one, and one of them suggested to let the cursor auto-matically go on the next item, highlighting that it was even not clear that they can fill more than one criterion at time saying one single utterance.

Question 11: “Was it easy to understand how to use the remote control to select the search criteria for programmes?”

German subjects judged the use of the remote intuitive. Some complained about the speed of the system or the layout of the buttons on the remote control. The speed of the remote con-trol interaction was found a problem sometimes.

English subjects judged useful to use the remote control to select the search criteria be-cause being the traditional way to interact with the TV, they found it was easier than to inter-act by voice.

The Italian subjects found very useful to use the remote control because they feel more familiar to interact with the TV using the haptic mode than using the speech mode. Only one subject lamented that the RC buttons are too sensitive to touch.

Question 12: “To reach the task we have assigned to you, did you expect to have some other vocal commands?”

The biggest complaint was the lack of an intuitive record command in the program list. Another requested feature was the possibility to ask for the currently running program.

Three English subjects of the Amuser sample stated that they would like to have other commands, but they suggestions highlight that they misunderstood this question meaning that the system has to give them other feedback like: some response after the “return” or “reset” commands; question like: “what are you looking for?”, “which day?”, “what time?” (to solicit their answers), and an overall diagram of the menu structure as a suggestion of what types of things they could choose from.

Page 38: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 37

Only four Italian subjects expected to have other commands or options. Three of them la-mented that they were missed commands (like “back” and “go-ahead” to scroll pages, and “af-ter” instead of “next” to scroll the items) that actually are already foreseen. Another one stated she would like to choose the programs saying directly the title.

Question 13: “Did you find the information on the screen useful to orient yourself, if the audio has been disabled?”

Some German subjects remarked on the fact that they did not deactivate the voice. One found the specification of the function of the colored buttons on the screen helpful.

English subjects judged the information on the screen useful to orient themselves although some of them complaint that the information (probably the help messages) were not always present and not always easy to understand.

The Italian subjects found useful the information on the screen, but one commented that it was hard for him reading the writing. One subject lamented that she did not find any help from the system when the audio was mute, and another one suggested to change some key-word as “Cambia” (=Change, with meaning “Reset”) with “Annulla” (=Cancel).

Vocal Interaction The aim of questions 14-17 was to probe the appeal of the vocal mode (in comparison to the interaction with the haptic mode), and its flexibility both prodding an input and giving an out-put.

There is a correlation between the answer to Question 16 and the familiarity of the user with the technology could be determined (two-tailed Pearson correlation test, p<0.05). Hence, users with more expertise tend to give higher ratings for the system’s ability to understand vo-cal commands. Question 14. How do you judge the opportu-nity to use a vocal com-mand?

useful if replace

RC10%

useful if used

with RC35%

never use

vocal5%

useful if system

is quickly

50%

useful if system is quickly40%

useful if replace

RC15%

useful if more

operation than RC

40%

very useful

5%

useful if replacin

g remote

6%

useful if system react as quickly

31% useful if

used with

remote6%

never use

vocal 13% very

useful44%

Page 39: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 38

15. For the vocal com-mands, you prefer:

short commands

85%

full sentences

10%

read precise

commands5%

read precise co m-

mands38%

sho rt co m-

mands49%

full senten-

ces13%

short cmds55%

full sentenc

es15% read

precise cmds30%

16. How do you judge the systems ability to understand the vocal commands?

17. Did you find any situation in which you needed to have some more in-struction to interact with the system?

no15%

yes85%

yes69%

no31%

yes64%

no36%

Table 6: vocal mode questions

Question 14 : “How do you jugde the opportunity to use a vocal command?” The German subjects found voice interaction useful as long as interaction per remote con-

trol is still available. They found it especially helpful for selecting items in large lists or to change to a channel when you do not know the channel number.

The majority of the English subjects answered that the voice interaction is very useful, even for disabled and elderly people. Five users answered that the voice is useful only if the system reacts as quickly as it does with the RC and one of the two subjects which said they never use vocal commands, reinforced his negative answer stating he’d never used the voice commands because he is not interested to talk to a machine.

The Italian sample answered positively to this question but the main part of the sample conditioned the judgment to a comparison with the RC: eight subjects (40%) answered that the voice is useful if it allows more operations than the remote control. Another 40% consid-ered useful the voice only if the system reacts as quickly as it does with the remote control, and three of them reinforced their answer explaining that they consider slow the reaction of the system to voice commands. Only three subjects commented that they prefer the voice if it replaces the remote control at all, and they commented that this functionality is useful for eld-erly people which have difficulties to handle the RC, or to avoid to search the remote control.

Page 40: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 39

Question 15: “For the vocal commands, you prefer…” Most subjects like to short commands the most. Reasons given include a quicker reaction,

less failures in recognition and an uneasiness in naturally interacting with a computer system.

Most of the English subjects agree on short commands or to read precise commands on the screen. Those who want to use only the commands suggested by the GUI think that using only the keyword would speed the system up and reduce mistakes. Some of those who immediately used short commands, stated that there was no real instructions as to how the keyword on the screen should be read, and a British native speaker pointed out that since British and Ameri-can English are very different, would be helpful making the command specific to the country. Only few users said that they would use full sentences, but one stated that it was not easy to judge how much the system can understand.

Since in everyday life they are accustomed to spend a limited effort to operate domestic media, most of the Italian subjects want use short commands (55%) or read precise commands on the screen (30%), because they imagine that this should generate a more precise and quick reaction by the system. No one of the three subjects which said that they prefer full sentences, justified their answer.

Question 16: “How do you judge the systems ability to understand the vocal commands?” German subjects had problems with getting the system to understand longer utterances and

judged the recognition rate to be low. Saying just the channel number instead of ‘Kanal’ (=channel) in front did not seem to work. And the word ‘Ergebnis’ (=result) was thought to be badly recognized.

English subjects judged insufficient the ability of the system to understand vocal com-mands both because the reaction time of the prototype was slow, and because often there was no response to their utterances.

Also the Italian subjects found not sufficient the ability of the system to understand vocal commands, but their judgment is a little better than the one of the English sample because some of them (mainly women) ascribe the recognition problems to the volume of the TV, or their soft voice. Moreover the Italian subjects pointed out mainly specifics cases of misunder-standing, as known problems with the words “sport” and “niente” (=none), or the exchange of similar channel names, or problems of substitution like: “giorno”(=date), mistaken with “genere”(=genre), or “change <channel_number>”, mistaken with “mute”.

Question 17: “Did you find any situation in which you needed to have some more instruc-tion to interact with the system?”

German subjects would have wanted more help during the sessions, especially during the programming of a TV-show. They would have liked a better example of how to use the search and programming function of the system.

Most of the English subjects would like to have some more instruction to well understand how to use the vocal commands and how to “undo” something when they realize that they made something wrong (i.e. one subject specified that she was not sure if the “reset” com-mand should be used to erase only the last said or misunderstood command, or it has the func-tion to cancel all and start all over again).

Also the Italian subjects would like to have some more instruction, but mainly to under-stand how to schedule when the Tv has to turn on, or to switch from the EPG mode to the on

Page 41: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 40

air program selected by the search criteria (this functionality did not work in the Italian ver-sion of the prototype).

The System Voice The system voice of the DICIT system is subject of this section. Subjects were asked how they judge the TTS output, whether they want the system read the results of a search, and whether they want to be able to switch off the recognizer. Question 18. How do you judge the talking feature of a system

useful70%

only for old

30%

not useful

0%

useful50%

useful for old50%

19. Do you find useful that the sys-tem reads (in addition to listing them on the screen) the programmes found after your search?

yes0%

yes if not too many20%

no80%

yes if not too

many44%

no31%

yes25%

no25% yes

45%

yes if not too many30%

20. Would you like to have a but-ton to en-able / dis-able the vocal rec-ognizer?

yes80%

no20%

yes94%

no6%

yes80%

no20%

Table 7: vocal mode questions

Question 18 : “How do you judge the talking feature of a system?” Most of the German subjects considered the TTS to be helpful, especially for getting to

know the system. The voice was thought understandable but sometimes annoying, which made the feature to turn of the voice useful.

only for old

62%

useful38%

Page 42: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 41

Most of the English subjects consider the talking feature not useful for themselves, but they think it's useful for elderly persons or persons with visual impairment.

The Italian sample is split on half about the talking feature of DICIT: 50% think is useful, the other 50% think could be useful only for elderly persons. Those who consider this feature useful, interpreted this like an advanced functionality stating that it is comfortable especially if they are doing something else, while they are consulting the EPG (10% of Italian subjects, use the Tv as a “background” while they are doing something else).

Question 19: “Do you find useful that the system reads (in addition to listing them in the screen) the programmes found after your search?”

Most German subjects do not want to have the programs read to them, pointing out that it would take to long to read a long list.

Seven English subjects want the system to read out the results only if they are not too many, and adding to these users the subjects which think it is useful the system reads the pro-grams regardless of the number, the majority of the English sample would like to have this feature.

Most of the Italian subjects liked the possibility that the system read the output (45% an-swered yes + 35% answered yes if there aren’t too many items), but they would like to choose if enable or disable this feature.

Question 20: “Would you like to have a button to enable/disable the vocal recognizer?” Most subjects would like the possibility to turn off the recognizer, but only when the sys-

tem would falsely react to off talk. One requested that it would also be desirable to turn it off by voice.

The majority of the English subjects want the possibility to disable the recognizer because to have options is always considered a good thing.

Almost the whole Italian sample (16) likes the possibility to manually disable the recog-nizer and some of them justified their answer saying that they prefer to control the interaction most of all when they have to talk with someone else in the same room.

Adaptive Features The topic of this session are some adaptive features about the user profiling and the customi-zation of the prototype. Although the implementation of these feature in the final prototype will depend on specific architecture choices, we decided to investigate whether users would like this feature or not, before to plan the second prototype. The subjects were asked to imag-ine some features that they have not experimented, and express their agreement or suggestions about possible new functionalities of DICIT.

Figure 13 shows the distribution of the responses for questions 21 to 23, indicating a strong positive bias towards adaptive features in general, the highlighting of frequently used functions, and the general feeling being monitored by the system. It should be pointed out, however, that almost the complete scale was used. Hence, there are still people who do not like adaptive features and dislike the idea of being monitored by the system. For questions 21 and 23 no significant influence of the language could be determined. Question 22 depends to some degree on the language: A significant difference between the Italian and the English

Page 43: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 42

group was found (cf. Figure 14). The latter had a stronger positive feeling about a possible highlighting feature than the former.

Figure 13: Questions on adaptive features.

Figure 14: Question 22 broken down according to languages.

Question 21: “Imagine the system could adapt itself to your behavior, e.g. by providing help depending on your expertise with the system. How do you feel about such a feature?”

German subjects would like to have the system adapt to their behavior. They noted that it should adapt to each person using the TV personally. That the system could show programs that the user chose more often. That the system could store the user preferences like volume per TV show. And that the system would guide a user better that is making the same mistakes more often. One person was worried that it might adapt to the user in the beginning and there-fore not display the advanced options that he might want to use after some experience with the system.

The majority of the English subjects want the system adapt itself because it allows people who are more skilled to search and to schedule more quickly, and at the same time they think it would help elderly people, or those not used to computer systems. On the contrary, one of the two subjects who gave a low score to this question, stated that he doesn’t want to argue with the system (probably thinking that this imply an annoying training phase).

Page 44: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 43

Most of the Italian subjects liked the possibility that the system adapt itself to their behav-ior; about the two subjects that gave a low score, one stated that to operate the TV it is not necessary to have particular features, the other stated that the system has to be user-friendly for everyone: beginner or expert.

Question 22: “What do you think about the adaptive system highlighting (e.g. vby enlarg-ing) the most frequently used functions on the screen automatically?

Most subjects judged the highlighting feature positive, noting the higher ease with which to find often used options. Negative opinions were due to the fact that they did not want to have to be influenced by the TV or that not often used functions get harder to execute.

The English subjects found very positive that the system could highlight automatically the most frequently used functions on the screen because it might simplify the interaction and speed up the selection process (especially for old people or for persons with sight problems).

The Italian subjects think that the possibility of the system to highlight automatically the most frequently used functions could be positive if it doesn’t prevent to easily find other func-tions and if it is possible to change the ones proposed automatically by the system.

Question 23: “Do you think you could feel uncomfortable if the system monitored your choices or recorded your preferences in order to improve itself for you?”

German subjects were quite diverse in answering this question. The biggest concern is about the privacy of the information and therefore would not want to have the system con-nected to the internet if it stores their data. A minor concern was a feeling of uneasiness with the system remembering or learning the behavior of the user. They also requested that the fea-ture would be opt-in and password protected.

The English subjects don’t feel very comfortable with a system which monitor their choices because even if it speed up the selection process, they have some doubt about the pri-vacy of this kind of information.

The Italian subjects feel comfortable, with a system which monitor their choices, but some of them condition their positive answers to the possibility to choose how to manage/change the customization when they want it.

Question 24: “Do you have any additional ideas how the system might support your opera-tion by adapting to your behavior?”

Suggestions German subjects offered were a warning when a show will start that the user likes, recognition of different users according to their voice, recognizing different moods of the user, user specific settings such as volume level or favorites list or placing often chosen programs at the beginning of a list.

Some English subject reinforced their previous answers saying that they would give prior-ity to channel, series and command preferences.

Two Italian subjects would like to have a kind of “reminder” that alert them when on air there are program of a favourite genre or already chosen by the user. Another subject sug-gested to improve the customization let the user free to add a channel, in the “favourite list” associated to a profile, while this is watched.

Page 45: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 44

Task specific questions The efficiency of the design and its relationship with the experience of the subjects, can be measured with the learning curve effect. This effect states that more times a system has been used (or a task performed), the less time will be required on each subsequent iteration, be-cause the users become more dexterous to interact with it and their cognitive load decrease.

In order to verify if there was some learning effect and if the different tasks were ac-complished more or less easily depending on the goal and depending on the interaction mode that each task required to use, four questions had been dedicated to let the users evaluate if they found difficult or easy to do each specific task. Obviously this subjective evaluation doesn’t necessarily correspond neither to the time spent to reach the goal, nor to the actual ac-complishment of the task, since sometime the subjects thought that they have finished the task or reached the goal even if they did not.

These questions do not include the “free” task because, as the name say, the users were free to do what they wanted, and to let them “explore” the interface in the way that they pre-ferred, it was carried out before to have seen the two demos for the instructions.

The result is displayed in Figure 15. Each of these questions had to be rated with a value between 1 = difficult and 10 = easy.

Note that, although the subjects carried out the specific tasks in a different order, each first task was always “generic”, that means that users can interact with the system looking for the items they wanted, and using the interaction mode they preferred.

Since only tasks 2 to 4 were comparable, the development of the difficulty rating over these tasks is considered. In general, the distribution of responses shows a shift from a rating slightly biased towards “difficult” for Task 2 towards better ratings for Tasks 3 and 4. All quartiles move towards the right, the “easy” end of the scale. The differences between Task 2 and Task 3 as well as the difference between Task 3 and Task 4 are not significant. However, a significant increase from Task 2 to Task 4 can be determined (paired Wilcoxon signed ranks test, p<0.01). Hence, the subjects consider the tasks to be less difficult if they have prior ex-perience. We should point out, however, that the whole range of ratings was used for all tasks and that the difference between the quartiles is quite high. Difficulty ratings seem to be highly dependent on the individual. There were no significant differences in the ratings among the language groups except for a significant difference between the German and the Italian group for Task 2 (cf. Figure 16).

Figure 15: Questions 25.1-4, difficulty ratings for the tasks.

Page 46: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 45

Figure 16: Question 25.2, difficulty rating for Task 2 broken down according to language.

It is plausible to assume that the difficulty ratings correspond to the overall performance of the prototype. Since the overall recognition performance of the setup in Germany was lower than the performance of the Italian setup, this might result in a worse difficulty rating for German than for Italian subjects.

Since the experimental test plan has been split between EB and Amuser, the English sub-jects recruited in Italy (twelve persons) mainly had a sequence of tasks where the goal of the second task was to find car races (task B) only by voice. For this reason, the low rate of the task 2 could be justified to the more negative impression that the English subjects (mainly re-cruited at Amuser) had about the system’s ability to understand vocal commands.

Even the Italian subjects in a majority of cases had as second task the one with the goal to find car races (task B), but the alternative of modalities to reach this goal was more balanced (in four cases they had to use only the RC, in eight cases they had to use only the voice, in the other last cases they can use both); for this reason it seems that they felt more comfortable than English and German subjects with the second task, where they can choose the interaction mode.

4.1.4 General opinion on DICIT prototype #1 (Questi ons 26 & 27) The remaining questions are used to examine how subjects like the DICIT prototype #1.

Users’ experiences with DICIT Finally, users had to rate their experiences with DICIT by means of 18 questions. Each of these questions had to be rated with a value between 1 = complete disagreement and 7 = com-plete agreement.

Figure 17 illustrates the responses for question 26.1-18 for the whole sample (without considering different languages). With respect to the evaluation of the whole system in gen-eral, sub-questions 26.1/2 show that subjects are divided concerning the ease of use and the degree of confusion they experienced. People tend to rate ease of use slightly better than aver-age and confusability slightly less than average, however, the complete range of responses was used in both questions with the exception that no one rated “I think that DICIT is easy to use” as “complete agreement”. Though there is a certain fun factor using the system reflected in the better-than-average rating for sub-question 26.13, subjects do not express a clear pref-erence for DICIT over traditional ways of searching for interesting programs (cf. 26.14) – though the responses cover the complete range of the spectrum again. The usability of the sys-tem in terms of the simplicity to solve typical tasks is rated poor on average (cf. 26.15). For most subjects, solving the tasks was not easy – the spectrum of responses is shifted to the left (disagreement) end – though some subjects partly or even fully agreed that they “easily suc-

Page 47: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 46

ceed in their tasks”. A lot less variability is found in the responses for 26.16. Almost all sub-jects fully or at least partly agree that the prototype system needs some improvement.

As for the evaluation of the input modalities speech and remote control subjects express that they have to pay more attention using speech input than using the RC (cf. 26.4/6). The agreement for the question regarding speech (26.4) is rated a bit higher than average, while the same question regarding the RC (26.6) is rated much lower than average. A broad range of different responses is found when people were asked if they loose the thread while interacting vocally (cf. 26.5). There is no particular tendency in the responses showing that about half of the subjects partly or fully lost the thread while interacting vocally. Comparing speech input and RC people on average do not see advantages in speed and simplicity using speech (cf. 26.7/8). Also, there is no clear preference for speech or the RC on average (cf. 26.9). The re-sponses cover the whole range of the spectrum again showing that the rating for speech vs. RC as input modality is highly individual.

Specific interaction designs for the selection criteria and settings were rated with a clear positive tendency. The criteria were clear (cf. 26.11) and the settings were easy to modify for most subjects (cf. 26.12).

The system’s voice is neither particularly liked nor disliked by subjects on average, though the responses again cover the full range of the spectrum (cf. 26.3). People generally do not think that the voice speaks too quickly (cf. 26.10).

Vocal instructions were rated as “boring” slightly less than average, and slightly more than average as “useful” – again with the full range of responses.

Page 48: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 47

Figure 17: Question 26.1-18 - General opinion about the DICIT prototype.

Page 49: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 48

Figure 18: Sub-questions of 26 showing a significant difference according to the language.

Page 50: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 49

Significant differences between the language groups can only be found in sub-questions 1, 2, 4, 5, 6, and 16 (Figure 18) – (Mann-Whitney test, p<0.05 for everything considered “sig-nificant”). English subjects rated “easy of use” significantly worse than Italian subjects (cf. 26.1). The system confuses English subjects significantly stronger than German subjects (cf. 26.2). Italian subject has to pay less attention using vocal interaction than the members of the other two language groups (cf. 26.4). Moreover, English subjects’ rating on “losing the thread in vocal interaction” is significantly worse than the rating of the Germans (cf. 26.5) and Eng-lish subjects also stated they had to pay more attention as compared to German and Italian (cf. 26.6). Taken together, the English subjects generally rated the system worse than members of the other language groups. Finally, more German subjects agreed on “DICIT needs improve-ment” than Italian subjects (cf. 26.16).

Rating user satisfaction within DICIT In the final section, subjects had to rate the DICIT system using a range of 1 to 7 between sets of opposite adjectives (as in the classical scale used in the 'semantic differential' of Osgood). For most adjectives, small values represent a positive feedback.

The results shown in Figure 19 are the ones for the whole sample (without considering differences related to the language). Notably, for many pairs of adjectives subjects typically chose a rating in the middle, but also the left and right end of the scales are covered (more dif-ferences are highlighted where results are clustered by language). Ratings that describe the appearance of the system (form, activeness, friendliness, politeness, cleverness, organization, patience) from a more personalized point of view are slightly inclined towards the left (posi-tive) end. Ratings describing system performance in a more depersonalized way (easiness, ef-ficiency, speed, precision, capability, predictability), however, are either balanced in the mid-dle of the scale or on the right (negative) end. Speed was poorly rated in particular. More than 75% of the subjects gave a 5-7 rating here, i.e. a rating on the “slow” side of the scale. Finally, more than 75% of the subjects chose the left, “original” side of the “original – copied” scale.

Page 51: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 50

Figure 19: Question 27 - system properties assigned to the prototype.

Of the 14 properties five show significant differences between two or more language groups (Mann-Whitney, p<0.05). Figure 20 illustrates these differences. Italian subjects rated the ef-ficiency of the system significantly higher than German and English subjects. Another signifi-cant difference can be found in the rating for originality between Germans and English peo-ple. English subjects also gave a significantly lower rating for the system’s capabilities than the Italian subjects (Germans in between). As for the more “personalized” adjectives, Italians gave higher ratings regarding form and politeness. In both cases their ratings are significantly higher than the ratings of the German group.

Page 52: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 51

Figure 20: Question 27 - some properties broken down according to language.

Page 53: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 52

4.2 Observations of the experimenters Some results cannot be derived from the logging data directly, but are subjective impressions of the experimenters. But they still have to be considered, since these observations were shared both by different experimenters and in different sessions.

4.2.1 Multi-slot usage In a multimodal system, the users can “say what they see”, so they are not aware only of the system's domain of knowledge (as in a speech-only system), but they don’t need to be solic-ited from it to complete a search, because their requests are immediately elicited by the sys-tem’s options that they can see. While in a speech-only system, the mixed initiative is mainly a “dialog initiative”, in a multimodal system the mixed-initiative can be considered mainly a “task initiative”, because the users can directly watch the available items, and choose if they want to use few of them, or all, to reach the goal that they prefer (without focusing on a “step by step” dialog).

In the DICIT system, when in the EPG mode, one of the advantages of the voice inter-action on the haptic mode, should be the ability to fill more than one search criterion at a time, (speeding up the interaction with the system) instead of select any search criterion, to enter in a sub-menu where to select a possible value of that criterion.

On the other hand, in order to take advantage of this convenient feature, user have to: 1) know what kind of words they can use to make an appropriate request (i.e. to know what kind of channels or genres are available, or what type of utterances the system is able to under-stand), and 2) to be confident that the system performs well enough and will reliably under-stand their requests (they will no take the initiative if they are not confident).

Obviously, the knowledge about the allowed phrases/commands is gained while using the system (e.g. the more the users interact with it, the more they will be able to use the built-in phrases/commands); on the other hand, the evaluations were conducted with naïve users, which had very short time to understand how the system could be used, so their learning was limited to some essential strategies (i.e. learn the available satellite channels or the available genres, using the voice commands like the RC commands).

Coming the second point, some studies highlighted that often people tend to use basic level interactions (producing single-word commands) while interacting with automatic sys-tems [18]; few of them, after some use of the system, tend to take more initiative in the dia-logue, provided that they gained enough confidence in the system’ s ability to understand them [19].

The multi-modal interaction implies the both to say complex phrases (instead of one single command) and to be confident that the system will react to any utterance, because it correctly answered in the first turn of dialog (this encourage the subjects to use more complex phrases even if the first one was simple keyword). If the subjects are not used to interact with a voice interface, and their first turn of dialog did not have an appropriate answer, they will tend to avoid a complex interaction as the “multi-slot” feature allows.

4.2.2 Verbal behavior The kind of voice interaction exemplified in the demos had some influence only on younger users, who seemed to imitate the behavior seen in the demos a little more than adult subjects, trying to speak with the system using a more natural language.

Moreover, when in the EPG mode, few of them tried to fill more than one search crite-rion at a time (sometimes referred to as “multi-slot” behavior); furthermore most of them of-

Page 54: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 53

ten opened the sub-menus containing the possible alternatives for each search criterion, in-stead of saying it directly; as an example, when in EPG search screen, they said “channel”, which brought them into the channel submenu, and then choose one channel (e.g. “BBC”), whereas they could have said just “channel BBC” in the main menu.

This “step by step” behavior is probably due to the adoption of a kind of interaction that imitates the use of the remote control instead of using the more powerful “shortcuts” provided by the voice interaction. The reduced experience that the subjects had with this new system probably also had some influence on this finding. Specific problems of the voice based interface, who generated particular behaviors by the sub-jects are: 1) slow reaction time of the system; 2) interruption of the help messages using commands not allowed; 3) the mapping between the time intervals used as options of the “time” menu, and the possi-bility to say directly the hour to specify a broadcast time; 4) the possibility to select an item by the associated number (both using voice or RC), or 5) the possibility to select an item by saying its title. One of the most important problems caused by the slow reaction time of the system is that the subjects, accustomed to a quick reaction of the TV set (particularly for the users that still watch the “traditional” TV, instead of the DTT), tended to repeat in very few seconds the keyword/phrases used to give voice commands thinking that the system had not “understood” the first command. Actually, the system was still processing the first utterance and the second one either spoiled the recognition of the first one or caused a rejection, since the same com-mand was no longer available in the dialog state reached after the first command.

The insufficient score given by the subjects to the system’s ability to understand vocal commands is also due to this slowness, associated to the fact that the system did not give any feedback to let the user know that it was still “working”.

Although the reaction time to commands given via the RC was also almost that slow, the habit of the subjects to interact with the TV set in that way, and the novelty to use their voice without knowing well what kind of commands they could say, probably reinforced these negative evaluations. Actually the sample who judged more negatively the voice interaction is the English one; this is probably caused by the fact, pointed out by one of the British native speakers, that since the pronunciation and the use of idiomatic forms is different among the UK and the US, it would be helpful to produce two localized versions.

Concerning the second item, since most of the system prompts are short, the users rarely tried to interrupt the TTS messages; most of the interruptions occurred while the help messages (which are the longest messages), were played.

Many users interrupted the help messages in a wrong way (the only command allowed to quit the help pop-up is the “OK” command or the green button). This behavior is probably due to the fact that in many occasions the help message and pop-up were played without the users asking for it (e.g. after a recognition error), so they persisted in their original objective, trying to interrupt the help without hear/read the suggested command to remove it, and simply repeating the command that they had already said.

Concerning the third point, one source of confusion for the users arose by the fact that, while the voice dialogs allowing the definition of the time supported both symbolic expres-

Page 55: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 54

sions (e.g. morning) and numeric values (e.g. 14:30), within the GUI menu the only available alternatives were the symbolic ranges (only morning, afternoon, evening and night; further-more the user was not informed about the actual time range associated to any option); subjects who entered this menu having a precise hour in mind were induced by the GUI to choose a symbolic range instead of a number, and when they returned to the selection criteria screen, the chosen time was shown using the beginning of the chosen option (e.g. 14.00 for the time range “afternoon”); hence, even if the command was correctly understood by the system, the subjects interpreted this feedback as a wrong response, and tried to rectify it repeating the time range or re-entering in the “time” menu screen.

About the possibility to select an item using the related number, the behavior of the us-ers was appropriate (some of them tried to say or select the numbers hitting the RC button) but actually the interface was not completely consistent (in some states it was possible to do it, in others it was not), so they got stuck because they had to try different modalities to ac-complish the same selection action while being in different states of the EPG mode.

About the last point, in the General instructions the subjects were noticed that they were not allowed to select an item using its title; however they tended spontaneously to say the title of a program when they want to select it.

4.2.3 Haptic behavior The haptic interface presented some specific problems only in the part related to the “set-tings”: the subjects who have to perform the task D (change the default settings) using only the RC had more difficulties to understand the toggle mechanism that was implemented to change the value of each option based on the idea that every hit on the “OK” button toggles the current value.

Moreover, it was not clear that only exiting with the “red” button (=Apply) their changes would have been stored, while just hitting the “yellow” button (=TV) did not store their changes.

4.3 Objective measures The focus of this chapter is to analyze the metrics defined in § 3.6, to understand if there are objective measures which confirm the subjective impressions, and to find out what the areas which for improvement of the next prototype are.

4.3.1 General statistics In total, 6.911 utterances have been collected and used for the evaluation of the utterance-based objective measures, regardless of the quality of the speech (we did not differentiated them between those with high/good quality and those with low quality). We used only those uttered during the execution of a task while all recorded activity between tasks and interfer-ences caused by the experimenter were discarded. Fort the task-based metrics, 223 task execu-tions by 56 subjects have been evaluated. Due to a technical error the execution of one task by a subject were not logged and are thus not contained in the statistics (4 * 56 = 224).

One interesting factor regarding the use of speech is the average length of a command by the user. Figure 21 shows the frequency distribution for utterances of length (measured in “number of words”) n as transcribed by the annotators. 937 of the 6.911 utterances have a zero length and are not contained in this statistics such that the total number is 5.974. Zero-length utterances indicate misrecognitions, i.e., cases in which the system detected speech and

Page 56: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 55

provided an output although the user did no say anything.1 As the diagram shows, subjects have a strong preference for short commands consisting of one or few words (mean = 2,07; stddev = 1,657). About 53% of the utterances have a length of one. The frequencies strongly decrease for increasing lengths. Utterances with lengths between one and six words already make up more than 97% of the whole set. Only six utterances with length twelve or bigger have been recorded.

Figure 21: Frequencies of utterance lengths (in words).

The general shape of this distribution is equal for all three languages. Still, differences in the mean value can be found as illustrated in Figure 22 (EN: mean=2,57; stddev=1,89; IT: mean=1,86; stddev=1,34; DE: mean=1,93; stddev=1,68). Despite the variability, the differ-ences between English and the other language groups are significant (two-tailed t-test; p<0,001 for EN-DE/EN-IT). English subjects use more words on average than German and Italian subjects.

Figure 22: Mean values and standard deviation of utterance lengths according to languages

(width of error bars = standard deviation).

The distribution of the annotators’ classification is shown in Figure 23. From the total 6.911 utterances 10 annotations were missing. These are not contained in the statistics. Most utter-ance fall into the classes OK (41%) and NO (36%), i.e., they are principally interpretable by the system – though the number of wrongly executed actions is pretty high. In the following, we discuss the utterance-based metrics introduced before; all of which can be derived from the classifications.

1 A typical reason for such misrecognitions is ambient noise, or TV/TTS output not completely eliminated by the acoustic echo cancellation.

Page 57: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 56

Figure 23: Classification of the recorded utterances.

4.3.2 ACR results Table 12 illustrates the total numbers for ACR (0,44) and ACR* (0,48). Hence, only 44% of the utterances that could principally be understood by the system led to the execution of the correct action. Even if we count in utterances that were not foreseen in the dialogue model and hypothetically assumed perfect recognition of these utterances, only 48% of the verbal commands would have been classified and executed correctly.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

ACR

ACR*

Table 12: Action Classification Rates.

ACR and ACR* broken down according to gender is illustrated in Table 13. In both cases the number is slightly higher for men (ACR 0,45; ACR* 0,48) than for women (ACR 0,44; ACR* 0,47). This difference is not statistically significant. However, it points to the general ten-dency observed for various speech applications that female voices are more difficult to recog-nize than male voices.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

female

maleACR*

ACR

Table 13: Action Classification Rate compared between men and women

Trying to find out where most of the system failures occur, it is possible to see that the more problematic points are the Selection grammar and the Help grammar.

Page 58: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 57

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

Help

Search-By

EPG

TV-Screen

Selection

ACR*

ACR

Table 14: Action Classification Rate grouped by grammars

As it was already explained (in § 4.2.2) and as it is possible to understand by the difference between the ACR and the ACR* metrics, while the problems in the Help grammar are due to some low efficacy of the design, the failures in the Selection grammar can be more related to the few knowledge that the users had on the available items of the selection criteria. Even if users are aware of the system's domain of knowledge, they cannot know/remember all the possible values that the search criteria can assume until they have not taken confidence with it. The low ACR score in the Selection grammar is probably related to the low score of the “search by channel” procedure and the lacked results of the search, as it is possible to under-stand by the following chart:

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

S4Han

dleSea

rchB

yDat

e

S3Sele

ction

From

EPGNoRes

ults

S2Han

dleEPGLis

tNoR

esult

s

S10Han

dleSet

tings

S3Sele

ction

From

EPGS1a

VideoO

ut

S1Spla

shScr

een

S4Han

dleSea

rchB

yCha

nnel

S2Han

dleEPGLis

t

S4Han

dleSea

rchB

yGen

re

S4Han

dleSea

rchB

yTim

e

ACR*

ACR

Table 15: Action Classification Rate grouped by layouts

Page 59: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 58

Analyzing the ACR score by grammars (table 14) with the one by layouts, another suggestive point is that the two procedures where “OK” is the main command (Help and Splash screen) have a low score although they have a very simple grammar. This suggests that, to avoid the recognition errors due to this short term, both the grammars and the GUI have to allow people to use other strategies and other commands to exit from these procedures.

Comparing the ACR results among the languages (cf. Table 16), it is evident that action classification worked better for Italian subjects than for English and German subjects. This difference is significant (two-tailed t-test; p<0.001).

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7

EN

DE

IT

CACR*

CACR

ACR*

ACR

Table 16: Action Classification Rate grouped by languages. Note that no CACR(*) were computed for German.

Finally, another important point to highlight is the different score of the CACR* (the metrics which measures the correct and complete action classification rate if the design was perfectly matched against the expectations of the users) and the score of the CACR (the metric which measures the overall action classification rate found in the implemented system).

o The CACR (EN about 32 %, IT about 36%) expresses the real Action Classification Rate, as measured in the real prototype (considering also the low quality utterances or the ones missed by the signal processing subsystem);

o The ACR* (EN about 46%, IT about 54%, DE about 44%) expresses the “ideal” Ac-tion Classification Rate, under the hypothesis that the Language Model was able to cover all the plausible utterances that were said;

o The CACR* (EN about 58%, IT about 62%) expresses the “ideal” Action Classifica-tion Rate, taking into account also the correct utterance that were missed by the signal processing subsystem, under the hypothesis that all the plausible utterances were classified correctly.

The two gaps of about 13 points can be narrowed on one hand improving the signal process-ing subsystem, in order to have a better endpointing and, on the other hand improving the Language Model and Dialog design.

4.3.3 UE The total user error rate, i.e., the number of utterances which could principally not be under-stood be system because of a mismatch between user expectations and the design, is 0.072. Though the error rate is marginally higher for women than for men, this difference is insig-nificant (cf. Table 17).

Page 60: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 59

0,055 0,06 0,065 0,07 0,075 0,08

female

male

Table 17: Users Errors grouped by gender

As can be seen in the following chart (table 18), the rate of erroneous actions that occur while the subjects are using the DICIT prototype seems to be particularly high in the Help grammar and are probably due to the behavior already explained in § 4.2.1.

The other high rate of errors in EPG grammar probably depends on the fact that the sub-jects did not know all the options available for the EPG selection (i.e. the available genre or time).

0 0,05 0,1 0,15 0,2 0,25

Help

Search-By

EPG

TV-Screen

Selection

Table 18: Users Errors grouped by grammars

The detail concerning the number of errors grouped by the screen layouts (table 19), shows that most of the erroneous actions in the Tv-screen grammar occurred while the Splash screen was still active and the subjects were probably trying to change channel, adjusting the volume or to manage the EPG or the settings without notice that they have to give the “OK” com-mand.

The other relevant number of errors that the subjects made in the EPG list and the video output, matches with the explanation already given for the low ACR score, due to the fact that the naïve subjects don’t know what kind of channels or genres are available, and they need some time to learn what are the available options before to be able to formulate the “correct” requests.

Page 61: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 60

S4HandleSearchByDate

S3SelectionFromEPGNoResults

S2HandleEPGListNoResults

S10HandleSettings

S3SelectionFromEPG

S1aVideoOut

S1SplashScreen

S4HandleSearchByChannel

S2HandleEPGList

S4HandleSearchByGenre

S4HandleSearchByTime

Table 19: Users Errors grouped by layouts

As can be seen in the following chart (Table 20), the English subjects (UE = 0.111) made more erroneous actions than the German subjects (UE = 0.070), who made more erroneous actions than the Italian subjects (UE = 0.046). The differences between the languages, if com-pared pair wise, is significant (two-tailed t-test; p<0.01). The reason can probably be found in the fact that the English sample was composed by UK and the US subjects which use different idiomatic forms, while the English interface (designed by non native speakers) was dedicated to the US ones.

0 0,02 0,04 0,06 0,08 0,1 0,12

EN

DE

IT

Table 20: Users Errors grouped by language

4.3.4 SIE Given a set of processed utterances, the ratio between successful interactions and errors (SIE) is a metric which yields the effectiveness of the design, as the relationship between the theo-retical efficacy of the design, and the correct functioning of the system. The SIE* considers not only the number of correct request done by the subjects but calculates the number of all the plausible sentences as correct commands, in relation with the total number of utterances spoken by the users and processed by the system. Its actual value (near to 1) means that even if the users asked for some “out-of-vocabulary” options/commands, they knew the full range of capabilities of the system and they understood what they have to do to achieve the assigned tasks. On the other hand, the different score between the SIE* and the SIE, returns the leeway to improve the design, adding in the LM sentences which the users considered plausible.

Page 62: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 61

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

SIE

SIE*

Table 21: Comparison between the Successful Interactions and Errors measures

As it was already explained in the rate of users errors grouped by gender (Table 14), the fol-lowing chart (Table 22) points out some differences, depending on the gender, in the relation-ship between the successful interaction and the users errors. Men reach a slightly higher SIE and SIE* rate than women. However, this gender difference is only significant for SIE* (two-tailed t-test; p<0.01).

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

female

male

SIE*

SIE

Table 22: Successful Interactions and Errors compared between men and women

The grammar where subjects asked for many “out-of-vocabulary” options/commands, was in the Help procedure: since this message/pop-ups often came out without an explicit request of the users, they ignored the suggested command to quit it and they continue to repeat the same command that they said before this “interruption”.

0 0,2 0,4 0,6 0,8 1

Help

Search-By

EPG

TV-Screen

Selection

SIE*

SIE

Table 23: Successful Interactions and Errors grouped by grammars

As can be seen in the following chart (table 24) the other relevant difference between the suc-cessful interaction and the users errors, is still in the Splash screen and in the EPG list screen. Actually, for the latter this rate of “out-of-vocabulary” options/commands can be explained by the fact that although subjects knew that they could not select an item saying its title, often

Page 63: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 62

they spontaneously tried to do it, pretending that this was one of the main shortcuts allowed by the voice interaction.

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

S4HandleSearchByDate

S3SelectionFromEPGNoResults

S2HandleEPGListNoResults

S10HandleSettings

S3SelectionFromEPG

S1aVideoOut

S1SplashScreen

S4HandleSearchByChannel

S2HandleEPGList

S4HandleSearchByGenre

S4HandleSearchByTime

SIE*

SIE

Table 24: Successful Interactions and Errors grouped by layouts

Small, but still significant differences can be found when the SIE values for the three lan-guage groups are compared. The English subjects seemed to do more requests implausible, or not allowed in a certain turn of the dialogue (lower value for CSIE* and SIE*).

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

EN

DE

IT

CSIE*

CSIE

SIE*

SIE

Table 25: Successful Interactions and Errors grouped by languages.

Note that CSIE and CSIE* were not determined for German.

4.3.5 WRR Figure 24 illustrates the mean Word Recognition Rate over the whole collection of utterances and its standard deviation (mean = 0.3; stddev = 0.655). The standard deviation in this plot (and in all following plots) is indicated by the error bars (mean +/- 1.0 * stddev). Note that the upper bound (“perfect recognition”) for WRR is always one, but the metric can be negative. The Word Recognition Rate provides information on the performance of the ASR module. The comparatively low value strongly indicates that speech recognition in the first prototype lacks accuracy. The standard deviation is very high, since the corpus includes a lot of “pseudo recognitions” where there was no utterance. Pseudo recognitions with a high number of words

Page 64: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 63

lead, particularly, to a high negative number for this metric and thus influence the standard deviation.

Figure 24: Overall Word Recognition Rate (WRR).

Figure 25 shows word recognition rates by gender. The WRR for women (mean = 0.317; stddev = 0.620) is slightly higher than the WRR for men (mean = 0.284; stddev = 0.685). This difference, though small, is significant (two-tailed t-test; p<0.05). This result could be gener-ated by the fact that while the men were able to say phrases where the command was correctly understood by the system (even if the rest of the utterance was not understood or it was substi-tuted), on the contrary, the women succeeded to let the system understand most of the words that they said, but not the ones which were semantic relevant for the system to correctly ac-complish the request.

Figure 25: Word Recognition Rate by gender.

Figure 26 shows word recognition rates by grammar. In terms of WRR the “Search-By” (mean = 0.363; stddev = 0.595) and “EPG” (mean = 0.342; stddev = 0.686) grammar perform better than the “TV-Screen” (mean = 0.279; stddev = 0.686) and “Selection” grammar (mean = 0.278; stddev = 0.696). When the “Help” grammar was active, WRR was worst (mean = 0.161; stddev = 0.641). The difference between the WRR for “Help” and for the other gram-mars is significant (two-tailed t-test; p<0.01). Also, the difference in WRR between “Search-By” and the other groups except “EPG” is significant (two-tailed t-test; p<0.05). The results show a quite similar trend as the results for ACR grouped by grammar (this is rea-sonable since there is a correlation between the two measures). Since the “OK” command is a very short term, actually the reason why a simple grammar like the Help grammar shows a score so low, can be explained by the fact that the “main” command can be easily substituted. This result suggests to change the dialog design to propose another strategy/command to exit from the help pop-up.

Page 65: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 64

Figure 26: Word Recognition Rate by grammar.

Figure 27 shows WRRs grouped by layout. While most of the layouts have a mean WRR be-tween approx. 0.26 and 0.40, “S1SplashScreen” has a slightly lower value (mean = 0.162; stddev = 0.543), and, particularly, both screens for no results, “S3SelectionFromEPGNoRes” (mean = -0.057; stddev = 0.714 and “S2HandleEPGListNoResult” (mean = -0.261; stddev = 1.516) stick out with negative mean values.

The trend is quite similar to the one of the ACR, with the exception of the layout where the list of the program should be shown: probably when no results are shown in the EPG list screen, the users tend to say phrases that are very often misunderstood by the system.

Figure 27: Word Recognition Rate by layout.

Figure 28 shows WRRs grouped by language. The recognition rate for the Italian system (mean = 0.402; stddev = 0.566) is much better than the rate for the English system (mean = 0.277; stddev = 0.630) and for the German system (mean = 0.239; stddev = 0.717). The dif-ference between Italian and the other two languages is significant (two-tailed t-test; p<0.001), while the difference between German and English is insignificant

Page 66: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 65

Figure 28: Word Recognition Rate by language.

4.3.6 TCR The Task Completion Rate provides information on both the effectiveness of the design and the overall performances of the prototype. Figure 29 shows the completion rates for the eight tasks (differentiated by goal specifically or generally defined). Notably, the values depend a lot on the task. The highest completion rates were achieved with tasks D (0.81) and Dg (0.93) that consisted of manipulating the settings. Subject completed tasks B (0.46) and C (0.53) only about half of the times. Both tasks had to do with searching for a program, i.e., people had to use the EPG. They performed better for the generic versions, Bg (0.73) and Cg (0.69) of this task. Here, the search target was not defined such that people could search for whatever they liked. The worst completion rate were achieved for tasks A (0.29) and Ag (0.36). This task involved scheduling an arbitrary or a specific program. When comparing the completion rates for different tasks, it should be noted that a possible learning effect could also explain parts of the differences, because the task execution order was not balanced across all subjects. The completion rate for the generic versions of the tasks is always better than the specific ver-sion (see comments to Figure 16). This is not surprising, since subjects had more freedom in the general version while the basic type of task was the same. So, taken together, in about 50% of the cases, subjects were not able to complete the task they were assigned. These results could depend by two reasons:

1) being naïve users, the subjects would need more time to understand how to interact with the system (actually they received only some general instruction watching two demos concerning the EPG management);

2) sometimes the system crashed and the subjects were not required to repeat/finish the task if they already had “played” with the prototype for at least 4-5 minutes.

Figure 29: Task Completion Rate by task (A-D: specific tasks, Ag-Dg: general tasks).

Page 67: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 66

For the following statistics on completion rates only the specific tasks (A-D) were considered, while the general tasks (Ag-Dg) are excluded. This is because the freedom given to subjects in the generic tasks would bias the results. Results for the task completion rate grouped by gender are shown in Figure 30. The difference between male (0.55) and female (0.51) subjects is not significant. However, it is aligned with the higher UE score (table 17) and with the correlation between the self-definitions of the sub-jects about their ability to manage domestic media appliances: often the women consider themselves “basic users” and delegate to other relatives the management of advanced devices.

Figure 30: Task Completion Rate by gender.

Figure 31 shows completion rates for the four specific taks according to the input modality. In all cases, the restriction to using voice led to the worst completion rate. If the use of the RC was allowed (conditions RC only and multimodal), completion rates get much higher. Inter-estingly, using only the remote control leads to higher completion rates than using both mo-dalities for three of four tasks (A, C, and D). Hence, subjects do not get an advantage using speech in addition to the RC. Obviously this depends both on the novelty represented by the voice modality (the users had to discover how to interact with the system using a new mode, while they had to understand what kind of features there were available), and by the ability of the system to correctly un-derstand the commands/requests of the users.

Figure 31: Task Completion Rate by modality and task.

Figure 32 shows completion rates grouped by language. Expect for task D, the highest com-pletion rate is achieved by the Italian subjects.

Page 68: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 67

The very low completion rate of the German and English samples on task A could be due to some misunderstanding about how to “schedule” a program.

Figure 32: Task Completion Rate by language and task.

4.3.7 TCT For the computation of the task completion time only successfully completed specific tasks were considered. The generic tasks were not considered, because subjects had a lot of freedom in these tasks and could choose for themselves what kind content to search and what interac-tion mode to use. Hence, the time spent to reach the goal in the generic tasks is not compara-ble.

Figure 33 shows the task completion time for the four specific tasks A (mean = 491.3, stddev = 162.8), B (mean = 325.5; stddev = 185.0), C (mean = 339.9; stddev = 178.7), and D (mean = 171.6; stddev = 191.3). The times are roughly in line with the task completion rate. This is expected, since both measures rate the “difficulty” of a task and a low completion time should correspond to a high completion rate. Generally speaking, average completion times are pretty high (more than 2.5 minutes for the “simplest” task D and about eight minutes for the most “difficult” task A). Even if the tasks only consist of few actions, so the times could be unacceptable for a usable interface, we have to consider that the amount of time spent to complete the tasks was determined both by the slow reaction time of the system, and by the fact that users were all “naïve” (they expressly did not received specific instructions to inter-act with the system in the “best” way).

Figure 33: Task Completion Time by task.

Page 69: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 68

Figure 345 shows the overall task completion time grouped by gender and task. No clear trend can be found in the data. The difference is not significant for any of the tasks.

Figure 35: Task Completion Time by gender and task.

Figure 366 shows task completion times according to modality. For any task the time is low-est if subjects could only use the remote control. If only speech was available, completion time is highest for tasks A, B, and D. The time to complete task C was highest for the condi-tion “voice plus RC”.

Figure 36: Task Completion Time by modality and task.

Figure 377 shows the completion rates according to language and task. In contrast to task completion where the Italian subjects achieved the best rates, the image for completion time is less clear. For tasks A and C, English subjects used less time than German and Italian sub-jects. In task B, Italians were faster than English and German subjects. In task D Italians used much more time than members of the other two groups.

Page 70: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 69

Figure 37: Task Completion Time according to language and task.

Page 71: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 70

5 Discussion The following paragraphs are devoted to summarize and discuss the main results that emerged from the subjective and the objective evaluations.

5.1 The questionnaire A first general remark about the subjective evaluation based on the questionnaire data con-cerns the range of ratings. Ratings for each single item were in most cases extremely variable, i.e. almost everything between very positive and very negative can be found in the results. This indicates that usability ratings for innovative interfaces highly depend on the individual and that, in the case of DICIT, speech as an alternative interface may not be suitable for each and every user. Thus, speech should be considered an alternative and short-cut in addition to a well-designed traditional input style using the remote control.

The overall tendency for most of the questions regarding system usability, screen de-sign, and speech interaction is slightly positive. Subjects seem to like the idea of speech inter-action in a TV scenario and, at least on average, did not have particular trouble with the screen design and the usage of the prototype. The use of the two main screens, the selection criteria and the selection screen, were judged to be easily understandable. A main source for criticism and one of the major points for future improvements is the slowness of the system. Slowness was probably partly misinterpreted as misrecognition and resulted in the repetition of commands - even though subjects were informed that the proto-type reacted slowly. This seems to be the reason for many errors.

Subjects like to have a choice about the possibility to consult the list of programs or the criteria list. The screen with the search criteria is easy to read, but the voice commands are not so intuitive (this impression is mainly due to the slow time of reaction of the system).

The remote control is preferred over the voice interaction, possibly because it is the tra-ditional way to interact with the TV subjects are familiar with and it is useful when there are some problems using the voice interaction. Only few subjects wanted to have additional commands at their disposal.

The subjects consider the voice command useful if it allows to speed up the interaction both allowing more operations than the remote control, and making the system to react more quickly. For the same reason they prefer short commands instead of complete sentences and even the judgment about the ability of the system to understand vocal commands is affected by the expectation to have a quick reaction using the voice interaction (most of them lamented that the system reaction is too slow). All them would like to have more specific instructions. Generally, the help facilities of the system were criticized as being insufficient.

The majority of English and Italian subjects agree that the system should read the search results even if they consider the talking feature of the system a functionality that is more use-ful for persons with some problems, than for themselves. A function to disable the recognizer would also be appreciated by the subjects.

In general, the adaptive features are appreciated, although the idea to be monitored by a system which records the preferences of each user, got a lower value.

The preference about tasks pointed out that regardless of the goal of the task, until there is not some familiarity with the system, the users appreciate to be free to search what they pre-fer using a mixed interaction mode.

As for the general opinion, the subjects’ experiences with the DICIT prototype are quite positive: they think that it is easy and fun to use although it is a little bit confusing. They judge the remote control a little bit better than the voice (there is a minor cognitive overhead

Page 72: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 71

because this is the usual interaction mode), but the voice mode is quite appreciated to manage the EPG and search programs; the selection criteria are clear, it’s easy to modify the settings and the TTS is intelligible. On the other hand, people do not like the voice too much and think that the system still needs improvements.

Comparing the average of answers among the languages, the main differences are the following: the English sample is the more confused while generically uses the system (q. 26.2); interacting with the system by voice, English and German subjects have to pay more attention than Italian users (q. 26.4); confirming the differences in the first question, English people feel more “confused” than German and Italian subjects, while interacting with the sys-tem both by voice and by RC (q. 26.5 and 26.6). Most of the Italian subjects think that DICIT is easy to use (q. 26.1) and think that the system needs less improvements than the German sample (q. 26.16).

Altogether, subjects have a positive general impression of the system: it is said to be original, active, friendly, organized, patient and quite polite. DICIT is neither easy nor com-plicated, and results are quite neutral about the couple efficient-inefficient, capable-incapable, precise-vague, formal-informal, predictable-unpredictable. The system has to be improved to react more quickly, because it seems to be slow.

Comparing the average of answers among the languages, most of the answers are fairly aligned; few differences can be noted for the some adjective pair of the semantic differential: Italian subjects judge the DICIT system more efficient than the other two samples; German and Italian people judge DICIT more original than English people and the system is consid-ered more “capable” by the Italian subjects. The German sample judge the interface more formal than the other two languages and the Italian subjects consider the system a little less polite than the German and the English samples.

5.2 Objective metrics The evaluation of the objective metrics show a lot of opportunities for the improvement of the system. At first, the preference for short commands (low number of words) expressed in the questionnaire is confirmed by the objective data. Actually, one-word commands are most fre-quently used. However, subjects still use a significant amount of longer commands. We sug-gest two ways to interpret this result. Subjects may simply prefer short commands a priori, or the result is a side-effect of a poor recognition performance (see below) in that subjects try to use longer utterances first, but switch to shorter utterances if they experience problems. Con-sequently, an improved prototype should be able to handle the recognition of very short com-mands while still being able to process more “NLU”-like, complex instructions. The basic message would be not to neglect simple, straightforward use of a system when striving for complex, natural interaction.

Among the metrics used in the objective evaluation, the word recognition rate (WRR) measures the quality of the “recognition chain” that is all the processing that take place from the mouth of the user to the recognized text. The action classification rate (ACR) measure re-flects the quality of the “classification chain” that is all the processing that take place from the mouth of the user to the classified action. With an average ACR of less than 50% and an aver-age WRR of about 30%, the current prototype does not meet the demands for a usable speech-enabled system. The ACR, as the most important metric for an end-to-end evaluation, has to be much higher in order for the system to be usable. ACRs of 90% and more are desirable, meaning that not more than every 10th verbal interaction should be executed wrongly. The reason why the ACR is higher than the WRR is probably that in statistical recognition not all

Page 73: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 72

words contribute to the “core” meaning of the phrase, hence even if most of the words except one important meaning-bearing word are misrecognized, the statistical method could still classify correctly.

Nevertheless even the ACR results are not so good for a system designed to use the voice interaction as a shortcut to perform sophisticated interactions; the subjective judgment shows that the system has to be improved about its ability to understand vocal commands. To raise these relatively poor figures, it is on one hand necessary to improve the recognition chain and the language models, since the actual version was tailored to better understand complex phrases, instead of single commands or short utterances.

The other metrics all in all suggest that, for verbal interaction, most of the errors made by the system are due to problems in the speech processing chain, while some errors are due to the design of the speech system. In particular, the restriction of the grammar in some screen (e.g., the help screen: you can only make the help screen disappear with a specific command) was a design choice that led to problems. One consequence would be to make more com-mands globally accessible (see also 5.3.2 below). That way, speech could better serve its func-tion as a “short cut” for complex sequences of interactions.

The tasks given to the subjects were thought to reflect some more or less complex eve-ryday functions for a TV/EPG home entertainment system. Insofar, the task completion rates were expected to be much higher than measured in this evaluation. Only about 50% of the tasks could be completed at all – which is definitely too low in order to be usable on an “eve-ryday basis”- but it is important to consider that users were all “naïve”, from both the point of view of familiarity with voice interaction, and of knowledge about the system’s features. The high dependency of the TCR on the task itself, indeed, indicates that there are specific prob-lems for some operations (e.g., “schedule” a program) which could be alleviated by a more specific training of the subjects, or by a more intensive testing of typical use cases before starting the second evaluation campaign. Modality-specifiy TCRs show that users sould not solve the tasks using speech alone. The evaluation of task completion times (TCT) revealed similar results to TCR. Whenever task completion was low, the corresponding competion times were high. Interestingly, TCTs for RC only were typically lower than for combined (speech + RC) input and speech input alone. With respect to the objective results, for the cur-rent state of the prototype, we were unable to prove that pure voice input or multimodal input leads to improved usability.

5.3 Observations from the recorded log data This section contains a number of observations derived from the analysis performed in the previous chapters. While some of the ideas from the experimenter observations appear again, new observations were also made during the evaluation.

5.3.1 Feedback for Speech Input Since no feedback about the executed action was provided, it often was not clear to the users if the system understood them correctly. For commands that were understood correctly the system reaction provided feedback, e.g. when going to a different screen. However, if the command was not understood correctly (e.g. “TV” instead of “results”), an action was exe-cuted that was not clear to the user and left him confused. Moreover, actions that as switching the TV channel could take up to 20 seconds, also leaving the user without any information what the system is currently doing.

Page 74: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 73

The common reaction time in a turn-taking between humans is obviously more quick (and more rich) than the turn-taking in a human-machine interaction. But since the subjects expected a reaction time at least as fast as commercial systems, they sometimes became impa-tient and repeated an utterance before the previous one had been processed.

Most commands in the EPG were acknowledged by the system with a short sound. Therefore, this feedback should be extended for the second prototype. First, instant reaction when a recognition starts, for instance by means of a small icon on the screen, could be im-plement, as is already proposed by the dialog specification, but could not be implemented for the first prototype for technical reasons. More ways of giving feedback could be discussed for the specification of the second prototype.

5.3.2 Global Speech Commands Often users were using speech commands in screens where these were not available. For in-stance, some users wanted to open the EPG screen with certain filter values directly from the TV screen, using utterances such as “what’s on CNN Saturday afternoon”. Whereas this command is valid in the EPG main screen, it is not supported in the TV screen. Therefore, us-ing more global grammars could improve the usability and the naturalness of the system.

When using such invalid commands, the recognizer returned something different and a – from the user’s point of view - wrong action was executed. Therefore, having a more global language model could also reduce such recognition errors and wrong system reactions.

Subjects regularly used commands without parameters, such as „volume“ without speci-fying whether to increase or decrease the volume. Therefore, some feedback should be pro-vided for such actions, such as showing the volume bar in this example or providing specific help.

5.3.3 Improving the Help Screen The help screen is a modal window that has to be closed before the system can be used again. Therefore, it is not possible to use commands when the help window is opened. This was not clear to the subjects, and again caused recognition errors when commands were still used by the subjects.

Moreover, the help was very general, and while subjects frequently opened the help screen, the help message provided little new information beyond what the subjects already knew. Therefore, a more extensive and context-sensitive help could improve the usability of the system as well as the usefulness of the help.

5.3.4 EPG Improvements First, the „now“ command was not supported by the language model and a corresponding screen only available implicitly by opening the result screen without specifying any filter val-ues. Therefore, the „now“ command should be added to the language model and an explicit command should be added to the dialog design that shows what’s currently on air. A similar function was already part of the WOZ prototype.

The filter criterions „author“, „subject“, and „title“ were not supported for the first pro-totype in order to achieve a somewhat higher recognition rate. However, these features could greatly improve the utility of the EPG menu. The evaluation contained tasks with specific in-structions regarding certain programs, including the names or topics of the programs (e.g. “tennis” as a topic or “700 club” as the name of a show). Many subjects first tried using these values, but since these were not supported, they led to misrecognitions.

Page 75: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 74

Moreover, subjects had a hard time removing filter criterions. They opens a filter selec-tion list (such as the channel list) and selects the first entry (“None”) by using the word “one” (since the “none” phrase was not part of the language model) in order to remove a certain fil-ter criterion. Therefore, it should be make clearer how single filter criterions can be removed.

5.3.5 Options for voice interaction Since people always prefer to have some choice, other improvement/changes required by many subjects are the possibility to improve the quality of the TTS (mainly for the Italian lan-guage) and to enable/disable the voice recognition as they prefer.

6 Plan for the evaluation of the second prototype

6.1 Subject Samples

As for the evaluation of the first prototype, the main goal of the evaluation campaign will be to test the usability of the whole system; hence the focus will be on the appeal and the easi-ness of use of the GUI, the easiness of use of the different interaction modes including the ac-curacy of the distant talking voice recognition and the reactivity of the system.

The three samples should be representative as much as possible of the “country popula-tion”, avoiding to recruit only “skilled” persons (i.e. people who work daily with computers or who are used to play with computers); hence, each sample should be balanced by gender, by age and instruction.

The chosen style of the evaluation is quite non-restrictive, but nevertheless aims to iden-tify design and performance factors that have an impact of the usability. Since we have to cope with a multi-factorial design, we need more subjects in order to derive valid results. Thus, the next evaluation campaign will involve a larger number of subjects clustered into 3 samples: an English, a German and an Italian sample; each sample will be composed of about 50 subjects.

6.2 Evaluation sites and segmentation of the sample s

Unlike the first evaluation campaign, the experiments will take place at more sites in order to explore the variability of the system performance under different room conditions and users language and culture.

The evaluation sites, where the whole DICIT system will be available starting from May, are: Amuser, Elektrobit, FAU, FBK, Fracarro, IBM-CZ, IBM-US; each partner will provide a fully tested and calibrated DICIT system installed into a proper room and a pool of subjects for the experiments.

As done for the first prototype, all the evaluation experiments will be conducted and analyzed by human factors experts of Amuser and Elektrobit which will travel to the different locations to conduct the experiments. The following table resumes the planned allocation of resources. Site American na-

tive English British / non-native English

German Italian Evaluator

Amuser 5 5 30 Amuser Elektrobit 5 5 30 Elektrobit

Page 76: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 75

FAU 5 5 20 Elektrobit FBK 5 5 10 Amuser Fracarro 5 5 10 Amuser IBM CZ 5 5 Elektrobit IBM US 20 Amuser+Elektrobit

As already anticipated, each system will be "aligned" with respect to the environmental acous-tics to a given "standard" in order to obtain comparable results; to this purpose, a calibration procedure is being implemented at FAU and FBK.

6.3 Session settings

The experimental set up will observe a single subject for the main part of each session. Only one single task of each whole session will have two persons in the test room, in order to test the capability of the system to reject an interfering subject; at the end of each session the ex-perimenter will play the role of the interferer (speaking from a different place of the room with respect to where the subject is). Alternatively, the two-person setup could be tested in a dedicated study. The“ecological” environment where the test sessions will take place is com-posed by:

• a room optionally with interior furniture, such as a sofa, allowing subjects to both normally stay in front of the TV and to move around;

• a dual-feed dish pointing to ASTRA-HOTBIRD connected to the DICIT STB. • An up to date EPG containing at least 10 channels in the language of the sample for

one week (updated). Since some functionalities of the new prototype are different from the old one, each subject will receive some general instruction (1 page) about the available channels and the meaning of the RC buttons. Each experiment will be split into two parts:

1. a free task, when users are be free to “play with the system” for some time and move into the room;

2. a “guided” part of when the subjects have to accomplish precise tasks (e.g. 4 tasks) while sitting in front of the TV. The proposed tasks will request to reach the goal using different interaction modes: only voice, only RC and both voice and RC.

Before to start the “guided” part of the test session the subjects will watch a demo clip provid-ing some more instructions about the available functionalities.

6.4 Analysis of the data

As for the first prototype, both subjective and objective data will be taken into account in the analysis; in particular:

1. Subjective data will be collected through a questionnaire, the observations of the ex-perimenters

2. Objective data will be produced manually transcripting and annotating the user behav-iour by means of the signals recorded at various points in the signal processing chain and of the log files generated at various points in the recognition and dialogue se-quencing.

Page 77: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 76

The collection of both sets of data is supported by tools provided by EB; in particular the an-notation will be carried out with the support of the Evaluation Simulator Tool; moreover, a precise definition of the annotation procedure will be defined in advance based upon the ex-perience gained in the first campaign.

Page 78: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 77

7 Conclusions The actual design has been judged easy to use and almost easy to understand although the novelty introduced by the voice interaction requires some training to be used in the best way: actually, subjects pointed out that the voice commands are not so intuitive, but it is necessary to distinguish between the commands that allow to do some “action” (in this case it is suffi-cient to read what is on the screen), and “commands” dealing with information, that have to be learned/remembered to select the right option (i.e. the users have to know what are the available selection criteria within the EPG, or the available channels while they’re using the TV-mode).

The integration of the graphic, vocal and haptic aspects into a multimodal interface has been appreciated by the users who liked the possibility to use both the remote control and the voice (although sometimes, when the voice recognition failed, the RC was a kind of back-up to avoid misunderstanding). Users consider the voice interaction useful since it allows to speed up the interaction with respect to the RC (i.e. selecting items out of large lists); this rea-son explains why they preferred to use short phrases or one-word commands written on the screen.

Generally speaking, the user expectation to have a quick interaction using voice has not been met (regardless of the language and of the kind of the owned TV) for two reasons: 1) the ability of the system to understand vocal commands was judged not sufficient and 2) the reac-tion time of the prototype was judged too long. Moreover, the subjects judged that one of the main problems of the design interface is the lack of an “interactive” error recovery procedure and an insufficient support by the help procedure. Despite the above problems, users think that the voice input makes DICIT an original and fun to use product.

Finally, it must be noticed that the prototype is a complex system addressing highly challenging goals, with different subsystems that need to be tuned and optimized both as sin-gle pieces and in their interaction; the analyzed system is the first prototype and the obtained results in the evaluation will help the implementation of a more optimized version, in the next months.

Page 79: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 78

8 References [1] Distant Talking Interfaces for Control of Interactive TV “Annex I - Description of Work” 31-May-2006 [2] Cedrick Rochet, URL: www.nist.gov/smartspace/toolChest/cmaiii/userg/Microphone_Array_Mark_III.pdf [3] Luca Brayda, Claudio Bertotti, Luca Cristoforetti, Maurizio Omologo, and

Piergiorgio Svaizer. “Modifications on NIST MarkIII array to improve coherence properties among input signals.” AES, 118th Audio Engineering Society Convention. Barcelona, Spain, May, 2005.

[4] SpeechDat-Car EU-Project LE4-8334, URL: http://www.speechdat.org/SP-CAR/ [5] Luca Cristoforetti, Maurizio Omologo, Marco Matassoni, Piergiorgio Svaizer, and Enrico Zovato. "Annotation of a multichannel noisy speech corpus." Proc. of LREC 2000. Athens, Greece, May 2000. [6] Transcriber, URL: http://trans.sourceforge.net/en/presentation.php [7] Andrey Temko, Robert Malkin, Climent Nadieu, Christian Zieger, Dusan Macho,

and Maurizio Omologo. "CLEAR Evaluation of Acoustic Event Detection and Classification systems." CLEAR'06 Evaluation Campaign and Workshop. Southampton, UK: Springer, 2006.

[8] Oswald Lanz. “Approximate bayesian multibody tracking.” IEEE Transaction on Pattern Analysis and Machine Intelligence, 2006: 1436-1449. [9] Fleischmann, T. (2007). Model Based HMI Specication in an Automotive Context.

In Smith, M. J. and Salvendy, G., editors, HCI (8), volume 4557 of Lecture Notes in Computer Science, pages 31{39. Springer.

[10] Goronzy, S., Mochales, R., and Beringer, N. (2006). Developing Speech Dialogs for Multimodal HMIs Using Finite State Machines. In 9th International Conference on Spoken Language Processing (Interspeech), CD-ROM.

[11] ISO 9241-110:2006 : “Ergonomics of human-system interaction -- Part 110: Dialogue principles” International Organization for Standardization, 2006. [12] Praat, URL: http://www.praat.org/ [13] N. Beringer: “Transliteration of Spontaneous Speech for the detailed Dialog Taskflow” DICIT technical document, 29-March-2007. [14] N. Beringer, U. Kartal, K. Louka, F. Schiel, U. Türk. PROMISE: A Procedure for

Multimodal Interactive System Evaluation. LREC Workshop 'Multimodal Resources and Multimodal Systems Evaluation' 2002, Las Palmas, Gran Canaria, Spain, pp. 77-80. [15] Salber, D. and Coutaz, J. (1993). A Wizard of Oz platform for the study of

multimodal systems. In Conference Companion on Human Factors in Computing Systems (INTERACT and CHI), pages 95{96, New York, NY. ACM.

[16] Taib, R. and Ruiz, N. (2007). Wizard of Oz for Multimodal Interfaces Design: Deployment Considerations. In Jacko, J. A., editor, HCI (1),

volume 4550 of Lecture Notes in Computer Science, pages 232{241. Springer. [17] Wolfgang Herbordt. “Sound Capture for Human/Machine Interfaces”, Springer-

Verlag, Berlin Heidelberg, 2005

Page 80: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 79

[18] N.G. Ward, A. G. Rivera, K. Ward, D.G. Novick (2005) “Some Usability Issues and Research Priorities in Spoken Dialog Applications” in Technical Report UTEP-CS-05-23

[19] C. Doran, J Aberdeen, L. Damianos, L. Hirschman (2001)“Comparing Several Aspects of Human-Computer and Human-Human Dialog” in Annual Meeting of the ACL volume 16.

[20] R, Manione, F. Arisio, R. Balchandran, M. E. Epstein (2008), “Language Modelling, Dialogue and User Interface the First Set-top-box Related DICIT Prototype”, DICIT Project deliverable D5.1

[21] T. Sowa, M. Bezold, R. Balchandran, D. Zobel (2008) “Integrated first PC-based STB prototype system”, DICIT Project deliverable D2.2

[22] L. Marquardt, L. Cristoforetti, E. Mabande, N. Beringer, F. Arisio, M. Bezold (2008) “Multi-microphone data collection and WOZ experiments for the analysis of user behaviour in the DICIT scenarios”, DICIT Project deliverable D6.2

[23] M. Wesseling, M. Bezold, and N. Beringer (2008) “Automatic Evaluation Tool for Multimodal Dialogue Systems”, Perception and Interactive Technologies for Speech-Based Systems (PIT), Kloster Irsee, Germany, June 2008.

[24] J. Nielsen and T.K. Landauer (1993, April). „A mathematical model of the finding of usability problems”. In Proceedings of ACM INTERCHI’93 Conference (pp. 206-213). Amsterdam, the Netherlands: ACM Press.

[25] G. Lindgaard and J. Chattratichart, (2007) “Usability Testing: What Have We Over-looked?” CHI 2007 Proceedings, ACM Press.

Page 81: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 80

9 Appendix A – Questionnaire

DICIT QUESTIONNAIRE

SOME QUESTIONS FOR STATISTICAL PURPOSES

Personal code

Personal details

A. You are

Male

Female

B. Educational qualification

Secondary school certificate

High school certificate

Degree/Diploma

C. Your age

16-20

21-30

31-40

41-50

51-60

More than 60

D. Your profession

Student (school)

Student (university)

Apprentice/trainee

Employee

Self-employed

Retired

Other

Working/studying area

Page 82: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 81

YOUR HABITS WATCHING TV AT HOME

E. How many people live in your house including you?

I live alone

2 people

3 or more people

F. How many TV's do you have in your house?

None

1

2

3 or more

I watch TV via the Internet

G. Who usually decides what to watch on TV?

Only one person

We decide all together

We decide by majority

First come, first served

Each person has a TV

H. How do you usually decide which programme to watch?

Looking up the teletext

Looking up in a newspaper/TV pro-grammes guide/internet

Looking up in the electronic pro-gramme guide (EPG)

By surfing channels

I. Which type of television do you usually watch?

'Traditional' (analogue) -> jump to question K

Satellite

Digital terrestrial

Iptv

J. How do you usually select a programme?

With the numeric button of the remote control

With the program up/program down button on the remote control

Through the electronic programme guide (EPG)

Selecting it through a Iptv or Video on Demand service

Page 83: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 82

K. What is the main information that interests you to choose a programme?

Genre (film, sport, tv series, news, etc)

Actor/big name

Topic/subject

Duration

Channel

I don't care

L. For what do you usually use the TV?

Watching current TV programmes

Watching the programms I have re-corded

Watching the video on demand (VOD)

Watching rented videos/DVDs

Watching bought videos/DVDs

As background to other activities

Channel surfing

For something else (specify)

M. How do you consider yourself as a user of TV and related devices?

Amateur - I hate operating devices and don't really understand how they work

A basic user, I just know the basic func-tions that I need and no more

A moderately skilled user, who can program the TV channels and program a recording for a given time

A very skilled user, who can use almost every function of the devices

N. Who usually operates the domestic media appliances (TV,Radio, Satellite...) in your home?

I

My husband/wife/partner

My parents

My son/daughter/sons

My roommates/flatmates

All together

Nobody in particular

Page 84: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 83

WE ASK FOR YOUR OPINIONS ABOUT THE SYSTEM YOU HAVE JUST TESTED, PLEASE GIVE EACH ASPECT A RATING FROM 1 TO 10.

USING DICIT SYSTEM

1. It was easy to understand how to modify the settings.

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

2. It was easy to understand how to use the different selection criteria given by the system

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

3. It was easy to understand how to give all the vocal commands

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

Page 85: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 84

4. It was comfortable to give some information using voice and other information using the remote control

Very Uncomfortable Very Comfortable

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

5. It was easy to change the channel by voice.

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

6. It was easy to change the volume by voice.

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

7. In case of problems did the system suggest usefully and efficiently what to do to recover the information after the error?

Very Useless Very Useful

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

Page 86: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 85

WATCHING THE VIDEO

8. Do you find it useful to choose between displaying the programme list and the criteria list ?

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

9. Is the screen which shows the criteria for the programme search easy to read?

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

10. Was it easy to understand how to use vocally the search criteria for programmes shown on the screen?

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

11. Was it easy to understand how to use the remote control to select the search criteria for programmes?

Very Difficult Very Easy

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

Page 87: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 86

12. To reach the task we have assigned to you, did you expect to have some other vocal commands?

No

Yes -> (Write which ones)

List the missed commands

13. Did you find the information on the screen useful to orient yourself, if the audio had been disabled?

Very Useless Very Useful

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

Page 88: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 87

USING THE VOCAL INTERACTION

14. How do you judge the opportunity to use a vocal command?

Very useful

Useful if used with the remote control

Useful if it replaces the remote control

Useful if it allows me more operations than the remote con-trol

Useful if the system reacts as quickly as it does with the re-mote control

I would never use vocal commands

Comments

15. For the vocal commands, you prefer:

Using full sentences

Using short commands

Having some precise commands to read on the video

Comments

16. How do you judge the systems ability to understand the vocal commands?

Very Bad Very Good

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer / commands that the system didn't understand

Page 89: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 88

17. Did you find any situation in which you needed to have some more instruction to interact with the system?

No

Yes -> (Write which)

Situations

Page 90: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 89

LISTENING TO THE SYSTEM VOICE

18. How do you judge the talking feature of a system

It's useful

I don't find it useful for me, but perhaps it's useful for elderly persons/persons with visual impairment

It's not useful at all

Comments

19. Do you find useful that the system reads (in addition to listing them on the screen) the programmes found after your search?

Yes

Yes, only if they are not too many

No

Comments

20. Would you like to have a button to enable / disable the vocal recognizer?

Yes

No

Comments

Page 91: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 90

ADAPTIVE FEATURES

21. Imagine the system could adapt itself to your behavior, e.g. by providing help depending on your expertise with the system. How do you feel about such a feature?

Very negative Very positive

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

22. What do you think about the adaptive system highlighting (e.g. by enlarging) the most frequentl y used functions on the screen automatically?

Very negative Very positive

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

23. Do you think you could feel uncomfortable if the system monitored your choices or recorded your preferences in order to improve itself for you?

Very uncomfortable Not uncomfortable at all

1 2 3 4 5 6 7 8 9 10

Explain the reasons for your answer

24. Do you have any additional ideas how the system might support your operation by adapting to your behavior?

Page 92: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 91

TASK SPECIFIC QUESTIONS:

25. How difficult did you find each of the 4 tasks? Consider both the task itself and your possibility to use the remote contr ol or voice.

Difficult Easy

1. Task 1

2. Task 2

3. Task 3

4. Task 4

Page 93: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 92

YOUR GENERAL OPINION. NOW THAT YOU KNOW THE SYSTEM WE ASK YOU SOME GENERAL OPINIONS ABOUT EACH OF THE FOLLOWING ASPECTS:

26. Give your opinion about DICIT crossing out the box that better describes your degree of agreement regarding each o f the following phrases which describe the service.

Complete disagree-ment

Complete agreement

1. I think that Dicit is easy to use

2. It makes me confused when I use it

3. I like the Dicit voice

4. I had to pay a lot of attention using vocal interaction

5. I often lost the thread while in-teracting vocally

6. I had to pay a lot of attention when using the RC

7. I think that speech input speeds up using the programming guide (EPG)

8. By using the voice it is easier to search the programmes

Page 94: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 93

9. I preferred using the remote control over voice

10. The Dicit voice speaks too quickly

11. The selection criteria which appear on the screen are not clear

12. The settings are difficult to modify

13. I think that it is funny to use

14. I prefer using traditional way (TV guide, teletext, newspaper) to search for an interesting pro-gramme

15. I easily succeed in my tasks

16. I think that Dicit needs some improvements

17. The vocal instructions were boring

18. The vocal instructions were useful

Page 95: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 94

27. Dicit For You Was

easy complicated

efficient inefficient

quick slow

original copied

precise vague

capable incapable

formal informal

active passive

friendly unfriendly

predictable unpredictable

polite impolite

clever stupid

organized disorganized

patient impatient

Submit

Page 96: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 95

10 Appendix B – General instructions to the subjects

GENERAL DESCRIPTION OF THE TEST

Thank you for participating in the evaluation of a new television system! Your help will allow us to improve the present DICIT prototype. The objective of the test is to evaluate the quality of the system. During the test session, we will ask you to interact with the DICIT system as if you were at home. The whole test is composed of 5 parts: in the first part, you are free to use the system as you like to try out things like surfing through channels or searching and scheduling a recording using the electronic programming guide. For the other 4 parts we will ask you to do some-thing in particular. To accomplish the tasks given to you, it is possible to use either the remote control (RC), or vocal commands, or both: for each task, we will explicitly ask you to interact with DICIT either using only the RC, only vocal commands, or whatever mode you prefer. The system will work with an environmental speech recognition, so you can move around the room as you wish, however to have a better voice sample we would like to ask you to keep a distance of 1.5 - 2 meters from the TV’s screen (see the line on the floor). Some of the available voice commands to interact with the system will be displayed on the screen. Sometimes the DICIT voice will suggest what to do. Please remember that you can take all the time you need to consider what to do, and what to expect from the system if you choose a particular command. If you have any difficulties during the session, you can always get help from DICIT; the test conductor is not allowed to help you. Remember that we are not evaluating your performance, but rather more the prototype that you are going to test, so there is no right or wrong behavior on your behalf. We ask you to be as natural as possible in what you say and do. At the end of the test we will ask you to complete a questionnaire. In the questionnaire there are some personal questions which are used only for statistical reasons; all the information we will kindly ask of you will be used only for statistical purposes; they will not reference the person, as each test candidate is linked to an alpha-numeric code. Bear in mind that the DICIT system is just a prototype, so it still cannot understand the titles of the programs and it reacts a bit slowly to voice commands. If the system does not react to your voice command even after waiting for a couple of seconds, just try again. Thank you for participating!

Page 97: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 96

11 Appendix C – Tasks Task “free”

Playing Around With the System

Please, “play” for at least 5 minutes with the system, to discover what it is able to do (remember that, if you need it, you can always ask Dicit for help).

While you are using the system, you are kindly requested to move into the room as you prefer.

Task A_gen

Please, surf channels to see what kind of programs are on air now, considering that only 9 satellite channels are available.

Then try to schedule to turn the TV on to watch a program that isn’t on air at this moment.

Task A_spec_rc

Using only the RC, please surf through all the 9 satellite channels available.

Then, with the same mode, try to schedule to turn the TV on to watch ‘CBS News’ on Sky News at 1:30 AM.

Page 98: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 97

Task A_spec_v

Using only vocal commands, please surf through all the 9 satellite channels available.

Then, with the same mode, try to schedule to turn the TV on to watch ‘CBS News’ on Sky News at 1:30 AM.

Task A_spec_vrc

Using both RC and vocal commands, please surf through all the 9 satellite channels available.

Then, with the same mode, try to schedule to turn the TV on to watch ‘CBS News’ on Sky News at 1:30 AM.

Task B_gen

Please, search some programs through the search criteria, using the time, genre or the channel of your preference.

Task B_spec_rc

Using only the RC and through the search criteria, please try to find out at what time all the sport shows about the cars race will be played on Sunday.

Page 99: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 98

Task B_spec_v

Using only vocal commands and through the search criteria, please try to find out at what time all the sport shows about the cars race will be played on Sunday.

Task B_spec_vrc

Using both RC and vocal commands, through the search criteria, please try to find out at what time all the sport shows about the cars race will be played on Sunday.

Task C_gen

Please, search a program through the EPG list, and when you find something playing now that you like and select it to switch to tv mode. While you are watching it, try to set the volume to approximately ¼ of full volume.

Task C_spec_rc

Using only the RC, and through the EPG list, search and select the program now running on CNN so that the system switches to TV mode.

While watching CNN, try to set the volume to approximately ¼ of full volume.

Page 100: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 99

Task C_spec_v

Using only the vocal commands, and through the EPG list, search and select the program now running on CNN so that the system switches to TV mode.

While watching CNN, try to set the volume to approximately ¼ of full volume.

Task C_spec_vrc

Using both the RC and vocal commands, through the EPG list, search and select the program now running on CNN so that the system switches to TV mode.

While watching CNN, try to set the volume to approximately ¼ of full volume.

Task D_gen

Please try to modify settings changing the system voice, the interaction mode or the EPG start mode.

Task D_spec_rc

Please,using only the RC, try to modify settings by setting the interaction mode to expert and the EPG start mode to list.

Page 101: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 100

Task D_spec_v

Please, using only vocal commands, try to modify settings by setting the interaction mode to expert and the EPG start mode to list.

Task D_spec_vrc

Please, using RC and vocal commands, try to modify settings by setting the interaction mode to expert and the EPG start mode to list.

Page 102: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 101

12 Appendix D – Description of the RC and the available channels

BUTTON MAPPING ON THE REMOTE CONTROL

digit buttons program incr, decr

buttons navigation buttons (item scroll: “up”/”down”)

navigation buttons (page scroll: “left”/ “right”)

Vo- lume control but- tons

On/off button Key color buttons

Un-used buttons

Page 103: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 102

English channels on Astra

RC Digit Channel Name

1 Al Jazeera

2 BBC World

3 CNBC

4 CNN

5 Euronews

6 Eurosport

7 God Europe

8 Sky News

9 Tv Shop

Italian channels on Hotbird

Numeri tele-comando

Nome canale

1 RAI 1

2 RAI 2

3 RAI 3

4 Rete 4

5 Canale 5

6 Italia 1

7 Sky News 24

8 Sky Meteo 24

9 Al Jazeera

10 France 24

Page 104: Deliverable 6.4 First prototype performance evaluationdicit.fbk.eu/DICIT_D6.4_R_20090326_PU.pdf · 3/26/2009  · Evaluation” covering task T6.5 “Evaluation of first STB prototype”

D6.4 – First Prototype Performance Evaluation

DICIT_D6.4_R_20090326 103

German channels on Astra

Nummer Kanal Name

1 Das Erste

2 DSF

3 Eurosport

4 Kabel 1

5 n-tv

6 Pro Sieben

7 RBB

8 RTL

9 RTL 2

10 Super RTL

11 SWR

12 Tele 5

13 Vox

14 WDR

15 ZDF