field studies. user studies ubicomp: people use technology must conduct user studies also: focus...

FIELD STUDIES

User studies

Ubicomp: people use technology Must conduct user studies Also:

Focus groups Ethnographic studies Heuristic evaulations Etc.

User studies

Laboratory studies: Controlled environment

Field (in-situ) studies Real world

Field studies

Appropriate for ubicomp: Abundant data Observe unexpected challenges Understand impact on lives

Trade-off: Loss of control Significant time and effort

Three common types

Current behavior Proof of concept Experience with prototype

How to think about user studies?

Formulate hypotheses

Research steps

1. State problem(s)2. State goal(s)3. Propose hypotheses4. Propose steps to test hypotheses5. Explain how problem(s), goal(s) and

hypotheses fit into existing knowledge6. Produce results of testing hypotheses7. Explain results8. Evaluate research9. State new problems

What is a hypothesis?

Proposing an explanation Theory or hypothesis? “This is just a theory.” Some theories we live by (“just” not

justified): Newton’s theory of motion Einstein’s theory of relativity Evolutionary theory

Hypothesis

Must be tentative Must predict

Hypothesis

Some criteria of scientificity Self-consistent Grounded (fits bulk of relevant knowledge) Accounts for empirical evidence Empirically testable by objective

procedures of science General in some respect and to some

extent

On proposing hypotheses

Anomalous phenomena: Strange and unfamiliar (Bermuda triangle) Familiar yet not fully understood (cognitive

load) Is there already an explanation?

Types of hypotheses

Incremental Fundamental shift:

Ptolemy (c. 90 – c. 168): geocentric cosmology

Copernicus (1473 – 1543): heliocentric cosmology

And then came…

Kepler (1571 – 1630): elliptical orbits

Fundamental shift example

Ulcer: Stress? Spicy food? Bacteria.

Types of proposed explanations Causes Correlation Causal mechanisms Underlying processes Laws Functions

Proposing causal explanations Studies show that using a cell phone

while driving increases the probability of getting into an accident. Why is that so? Pick up ringing phone Dial number See but don’t perceive

Effects not always there

Cell phone + driving: Usually no accident Only one of the factors

Remote and proximate causes Cell phone + driving:

Attention shift → missed signal → accident Remote cause → proximate cause → effect

Correlation

A and B are correlated if: A → B B → A C → A and C → B A combination of (some of) the above Coincidence

Correlation vs. causal relation: Correlation doesn’t imply causal relation Cannot determine cause direction (A → B or B

→ A)

Correlation

Positive, negative None found ≠ none exists Causal link → correlation:

May provide initial evidence for causal link Less explanatory value than facts about

causal links

Causal mechanisms

Mechanisms connecting remote causes and their effects.

E.g.: Damaged artery in heart → clotting Clotting → blocked artery Blocked artery → heart attack Aspirin inhibits clotting → lower risk of heart

attack

Underlying processes

Photoelectric effect

Photoelectric effect

Einstein: 1921 Nobel Prize in Physics

http://upload.wikimedia.org/wikipedia/commons/7/78/Einstein1921_by_F_Schmutzer_4.jpg

Laws

General regularities in nature Universal:

F = ma Non-universal:

Statistical laws

Functions

What is the purpose of the phenomenon?

FOR SALE A prime lot of serfs or SLAVES GYPSY (TZIGANY) Through an auction at noon at the St. Elias Monastery on 8 May 1852 consisting of 18 Men 10 Boys, 7 Women & 3 Girls in fine condition

http://upload.wikimedia.org/wikipedia/commons/3/35/Sclavi_Tiganesti.jpg

Functions

William Harvey (1578 – 1657): Heart pumps blood through

circulatory system No modern instruments! Experiments with a number

of animals: Various fish, Snail, Pigeon, etc.

Multiple methods together

Function → → causal mechanism → → underlying processes

National Ignition Facility(Dennis O’Brien @ UNH):Ignition with lasers → → Laser, target chamber → → Physics of nuclear fusion

http://en.wikipedia.org/wiki/National_Ignition_Facility

http://www.eceblogger.com/?p=539

Multiple methods together

Law → underlying processes Isaac Newton (1643 – 1727),

second law of motion:F = ma → Graviton?

http://upload.wikimedia.org/wikipedia/commons/3/39/GodfreyKneller-IsaacNewton-1689.jpg

Ockham’s razor

Crop circles: pranksters or aliens?

http://upload.wikimedia.org/wikipedia/commons/c/c4/Crop_circles_Swirl.jpg

Ockham’s razor

William of Ockham (c. 1288 – c. 1348)

http://en.wikipedia.org/wiki/File:William_of_Ockham.png

http://en.wikipedia.org/wiki/File:William_of_Ockham.png

Do I have a hypothesis?

Yes. Do you realize you do?

How to think about user studies?

Formulate hypotheses

Three common types


Research steps



Current behavior

Insights and inspiration: State problem(s), goal(s) Propose hypotheses

Relatively long

Current behavior – example 1

AJ Brush and Kori Inkpen, “Yours, mine and ours?...” (pdf) (2005 movie inspiring title)

Home technology: users share, etc.

http://research.microsoft.com/pubs/69498/brushinkpenyoursmineours.pdf

http://www.yoursmineandoursmovie.com/



Current behavior – example 2 Schwetak Patel et al. “Farther Than You

May Think…” (pdf) Hypothesis: Mobile phone a proxy to

user location.

http://abstract.cs.washington.edu/~shwetak/papers/prox_ubicomp06.pdf

Three common types


Research steps



Proof of concept

Technological advance: Produce results: prototype Explain results: prototype

Relatively short

Proof of concept – example 1 J. Sherwani et al., “Speech vs. Touch-

tone: Telephone Interfaces for Information Access by Low Literate Users” (pdf) (video)

Hypothesis: Speech better telephony interface than touch-tone for low literate users.

http://www.cs.cmu.edu/~jsherwan/pubs/ictd09.pdf

http://www.youtube.com/watch?v=jZv0y5_UyLQ

Proof of concept – example 2

John Krumm and Eric Horvitz, “Predestination:…” (pdf)

Hypothesis: Destinations from partial trajectories.

Train/test algorithm on GPS tracks from 169 people

Used pre-existing data: Krumm and Horvitz, “The Microsoft Multiperson

Location Survey” Collecting original data a significant

contribution Leverage!

http://research.microsoft.com/en-us/um/people/jckrumm/Publications%202006/predestination%20preprint%20final.pdf

Three common types


Research steps



Experience with prototype

Users’ interaction with technology: Produce results: prototype Explain results: prototype

Relatively long

Prototype an example!

Others don’t care about: Raw usage information Usability problems Intricate implementation details Etc.

Generalize! Scientific and good technical work

Experience – example 1

C. Neustaedter, et al., “A Digital Family Calendar in the Home:…” (pdf) (video)

Hypothesis: At-a-glance awareness, remote access are significant benefits.

4 households, 4 weeks each (Best Student Paper, Graphics Interface

2007)

http://delivery.acm.org/10.1145/1270000/1268551/p199-neustaedter.pdf?key1=1268551&key2=3461847621&coll=portal&dl=ACM&CFID=26746030&CFTOKEN=26792350

http://www.youtube.com/watch?v=IVAAucKJUiw

Experience – example 2

Rafael Ballagas et al., “Gaming Tourism:…” (pdf) (video)

Hypothesis: Learning through a game. 18 participants: 2 alone + 8 pairs (8 x 2

= 16)

http://www.youtube.com/watch?v=zuefBtnQGWg

http://www.youtube.com/watch?v=zuefBtnQGWg

Study design

Who is the consumer? Manager(s)

Industry, academic lab Professor(s)

E.g. thesis committee Researchers

E.g. advisor’s collaborators Reviewers

For paper, proposal, thesis Funding agency

Report on progress, proposal for funding Public

Friends, family, alumni, potential students, donors, potential employers

Study design

How can I explain this to a layperson? What is key? What can be omitted?

How will I write this up? Paper Thesis Report Blog post

Start writing paper/thesis/report/blog post at the beginning of the study.

Study design

Test hypothesis/hypotheses

Testing hypotheses via user studies

User studies: Laboratory studies

Good: Control, easier to evaluate results Bad: Constraints

Field studies Good: Fewer constraints Bad: Less control, more difficult to evaluate

results

Criteria

Falsifiability: Prediction fails = explanation isn’t correct Account for other factors!

Note: Criterion - singular Criteria - plural

Criteria

Verifiability: Prediction successful = explanation is

correct Account for other factors!

The meat of it…

Battleship Potemkin, 1925 film

Rotten meat scene

http://en.wikipedia.org/wiki/File:Vintage_Potemkin.jpg

http://www.youtube.com/watch?v=HLW9n40LBXc

Why larvae in meat?

Francesco Redi (1626-1697)

Generation of insects, 1668

Causal explanation: fly droppings

http://en.wikipedia.org/wiki/Francesco_Redi



Redi’s research

Hypothesis: Worms derived from fly droppings

Testing hypothesis: Two sets of flasks with meat: sealed and

open Prediction: worms only in open flask

Falsifiability criterion

Can anything cause a failed prediction even if explanation is correct?

Did the apparatus operate properly? Tight seal? Meat not initially spoiled? Other?

Verifiability criterion

Can anything result in successful prediction even if explanation is wrong?

What if “active principle” in the air is responsible for spontaneous generation?

Modify experiment: Replace seal with veil:

Flies cannot reach meat Air in contact with meat

Modification helps meet verifiability criterion

Verifiability criterion

Experimental vs. control group: Only difference in level of one independent

variable Redi’s experiment:

Control: Open flasks Experimental: Veil-covered flasks

Control: laboratory experiment Meat in veil-covered flasks? Creating control/experimental groups

often impossible without careful design/control

Study design

Test hypothesis/hypotheses Formulate in terms of:

Independent variables (multiple conditions) Dependent variables

Design: Within-subjects Between-subjects Mixed design

Within-subjects design: example

Police radio UI: hardware Speech

Blog post, video

http://www.eceblogger.com/2007/07/comparing-user-interfaces/

http://www.youtube.com/watch?v=5kR_LjRbZC4&feature=player_embedded

Within-subjects design: example

Results in graphical form:

Example: between-subjects design

Classical example: testing a drug

Mixed design: example 1

SUI characteristics study Secondary task: speech control of radio 2 x 2 x 2 design:

SR accuracy: high/low PTT button: yes/no – ambient recognition Dialog repair strategy: mis-/non-

understanding

Mixed design: example 2

Motivation: PTT vs. driving performance Secondary task: speech control of radio 2 x 3 x 3 design:

SR accuracy: high/low PTT activation:

push-hold-release/push-release/no push PTT button: ambient/fixed/glove

Push-hold-release Push-release No-push

Ambient Fixed Glove Ambient Fixed Glove Ambient Fixed Glove

High

Low

Control condition

Baseline: e.g. no technology vs. later introduced technology

Considerations

What will subjects do? Normal behavior – may take long Scenarios

Augment existing or brand new? Augment: taking advantage of familiarity New: more control (fewer inherited

constraints) Simulate or implement?

E.g. WoZ

Data to collect

Qualitative Insight into what participants did. How do participants compare? Did they do

what they thought they did? Use quantitative data.

Quantitative How did people behave? But why? Use qualitative data.

Data to collect

At least three types of data: Demographic Usage Reactions

Data to collect

Run pilot experiments!

Collecting data

Logging Surveys Experience sampling Diaries Interviews Unstructured observation – ethnography

Logging

Plan ahead, not after the fact! Testing hypotheses

Don’t leave important data out Don’t save data you don’t need

Leverage logging: Everything OK?

E.g. Mike Farrar’s MS research: files appearing on server indicates apps OK

Explicit communication with server: “I’m OK!”

Surveys

Open-ended Multiple-choice Likert-scale

Surveys

Questions should allow positive and negative feedback.

Text clear to others? Check! One question at a time!

“Fun and easy to use?” Length?

Don’t bore subjects to death. Standard questions (e.g. QUIS)?

Previously used questions?

http://lap.umd.edu/quis/

Example: Mike Farrar’s study Hypotheses:

Initialize grammar (video): From previous tags From tags by users with similar interests

Voice commands convenient way to tag photos (video)

Keyboard users will use voice less Low task completion: give up on voice

http://www.youtube.com/user/eceblogger#p/u/7/_peG8mwAOqE

http://www.youtube.com/user/eceblogger#p/u/8/_vmtRhl7TEI

Experience sampling (ESM)

Short questionnaire Timing:

Random Scheduled Event-based

Experience sampling (ESM)

How often? How many? Relate to quantitative data?

Diaries

Similar to ESM

Interviews

Semi-structured: List of specific questions + follow-up

questions Bring data

E.g. Nancy A. Van House: “Flickr and Public Image Sharing:…”

Interviews + photo elicitation

http://people.ischool.berkeley.edu/~vanhouse/

http://people.ischool.berkeley.edu/~vanhouse/VanHouseFlickrDistantCHI07.pdf

Interviews

Neutral questions Negative feedback is OK (this is hard):

Don’t argue!

Participants

Follow IRB rules

Participants

Who to recruit? Representative of intended users Not your friends, family, colleagues – bias! May need different types

Recruit sufficient numbers of each type

Participant profile

Age E.g. age significant for driving

Gender Technology use and experience Other

Eye tracker studies: no glasses

Number of participants

Between-subjects usually requires more than within-subjects

Proof-of-concept: typically fewer and many types

Longer study: may be able to use fewer Time commitment per participant is

significant! Recruit (Craigslist), organize, train, run,

transfer data, process data Participants will drop out – recruit extra

Counterbalancing may not work out

Compensation

Don’t try to save on this! Driving simulator lab study cost example

1 graduate student year at UNH ≈ $50k Software maintenance fees per year ≈

$20k Trip to conference ≈ $2k PC or laptop ≈ $2k $20 x 24 participants ≈ $0.5k (less than

1%)

Compensation

Must not affect data E.g. in image tagging study if we paid per

picture: More data Unrealistic as interactions are for money not for

value of prototype

Compensation

Leverage if you can: Latest driving simulator lab study in

collaboration with Microsoft Research: Use Microsoft software as compensation

Data analysis

Test hypotheses Use multiple data types Tell a story

Data analysis

Statistics: Descriptive Inferential

Descriptive statistics

Level of measurement: Nominal Ordinal Interval

http://en.wikipedia.org/wiki/Level_of_measurement

Level of measurement

Nominal: Unordered categories E.g. yes/no Valid to report :

Frequency


Ordinal: Rank order preference without numeric

difference E.g. responses on Likert scale

Five of the eight participants strongly agreed or agreed with the following statement: “I prefer to have a GPS screen for navigation.”

Valid to report : Frequency Median Some people report means but what is the mean

of “strongly agree” and “strongly disagree”?


Interval: Numerical differences significant E.g. age, number of times an action

occurred, etc. Valid to report:

Sum Mean Median Standard deviation (outliers?)

Outliers in interval data

Inferential statistics

Significance tests t-test ANOVA Many others

Which to use: depends on data

Significance test: example 1

To assess the effect of different navigation aids on visual attention, we performed a one-way ANOVA using PTD as the dependent variable. As expected, the time spent looking at the outside world was significantly higher when using spoken directions as compared to the standard PND directions, p<.01. Specifically, for spoken directions only, the average PDT was 96.9%, while it was 90.4% for the standard PND.


-5

0

5

10

15

20

60-80 80-100 100-120 120-140 140-160

PDT

on st

anda

rd P

ND

[%]

distance from previous intersection [m]

… PDT on the PND screen changes with the distance from the previous intersection… significant main effect, p<.01…


Randomization test Kun et al. (pdf) Idea from Veit et al. (pdf)

http://andrewkun.com/papers/2009/Kun%20et%20al%20PND.pdf

http://www.int-res.com/articles/meps/139/m139p011.pdf


0

5

10

15

20

25

30

35

0 1 2 3 4 5 6 7 8

Rstw

[deg

rees

^2 ]

lag [seconds]

standard

p = 0.05

spoken only

field studies. user studies ubicomp: people use technology must conduct user studies also: focus...

Documents