field studies. user studies ubicomp: people use technology must conduct user studies also: focus...
TRANSCRIPT
FIELD STUDIES
User studies
Ubicomp: people use technology Must conduct user studies Also:
Focus groups Ethnographic studies Heuristic evaulations Etc.
User studies
Laboratory studies: Controlled environment
Field (in-situ) studies Real world
Field studies
Appropriate for ubicomp: Abundant data Observe unexpected challenges Understand impact on lives
Trade-off: Loss of control Significant time and effort
Three common types
Current behavior Proof of concept Experience with prototype
How to think about user studies?
Formulate hypotheses
Research steps
1. State problem(s)2. State goal(s)3. Propose hypotheses4. Propose steps to test hypotheses5. Explain how problem(s), goal(s) and
hypotheses fit into existing knowledge6. Produce results of testing hypotheses7. Explain results8. Evaluate research9. State new problems
What is a hypothesis?
Proposing an explanation Theory or hypothesis? “This is just a theory.” Some theories we live by (“just” not
justified): Newton’s theory of motion Einstein’s theory of relativity Evolutionary theory
Hypothesis
Must be tentative Must predict
Hypothesis
Some criteria of scientificity Self-consistent Grounded (fits bulk of relevant knowledge) Accounts for empirical evidence Empirically testable by objective
procedures of science General in some respect and to some
extent
On proposing hypotheses
Anomalous phenomena: Strange and unfamiliar (Bermuda triangle) Familiar yet not fully understood (cognitive
load) Is there already an explanation?
Types of hypotheses
Incremental Fundamental shift:
Ptolemy (c. 90 – c. 168): geocentric cosmology
Copernicus (1473 – 1543): heliocentric cosmology
And then came…
Kepler (1571 – 1630): elliptical orbits
Fundamental shift example
Ulcer: Stress? Spicy food? Bacteria.
Types of proposed explanations Causes Correlation Causal mechanisms Underlying processes Laws Functions
Proposing causal explanations Studies show that using a cell phone
while driving increases the probability of getting into an accident. Why is that so? Pick up ringing phone Dial number See but don’t perceive
Effects not always there
Cell phone + driving: Usually no accident Only one of the factors
Remote and proximate causes Cell phone + driving:
Attention shift → missed signal → accident Remote cause → proximate cause → effect
Correlation
A and B are correlated if: A → B B → A C → A and C → B A combination of (some of) the above Coincidence
Correlation vs. causal relation: Correlation doesn’t imply causal relation Cannot determine cause direction (A → B or B
→ A)
Correlation
Positive, negative None found ≠ none exists Causal link → correlation:
May provide initial evidence for causal link Less explanatory value than facts about
causal links
Causal mechanisms
Mechanisms connecting remote causes and their effects.
E.g.: Damaged artery in heart → clotting Clotting → blocked artery Blocked artery → heart attack Aspirin inhibits clotting → lower risk of heart
attack
Underlying processes
Photoelectric effect
Photoelectric effect
Einstein: 1921 Nobel Prize in Physics
Laws
General regularities in nature Universal:
F = ma Non-universal:
Statistical laws
Functions
What is the purpose of the phenomenon?
FOR SALE A prime lot of serfs or SLAVES GYPSY (TZIGANY) Through an auction at noon at the St. Elias Monastery on 8 May 1852 consisting of 18 Men 10 Boys, 7 Women & 3 Girls in fine condition
Functions
William Harvey (1578 – 1657): Heart pumps blood through
circulatory system No modern instruments! Experiments with a number
of animals: Various fish, Snail, Pigeon, etc.
Multiple methods together
Function → → causal mechanism → → underlying processes
National Ignition Facility(Dennis O’Brien @ UNH):Ignition with lasers → → Laser, target chamber → → Physics of nuclear fusion
Multiple methods together
Law → underlying processes Isaac Newton (1643 – 1727),
second law of motion:F = ma → Graviton?
Ockham’s razor
Crop circles: pranksters or aliens?
Ockham’s razor
William of Ockham (c. 1288 – c. 1348)
http://en.wikipedia.org/wiki/File:William_of_Ockham.png
Do I have a hypothesis?
Yes. Do you realize you do?
How to think about user studies?
Formulate hypotheses
Three common types
Current behavior Proof of concept Experience with prototype
Research steps
1. State problem(s)2. State goal(s)3. Propose hypotheses4. Propose steps to test hypotheses5. Explain how problem(s), goal(s) and
hypotheses fit into existing knowledge6. Produce results of testing hypotheses7. Explain results8. Evaluate research9. State new problems
Current behavior
Insights and inspiration: State problem(s), goal(s) Propose hypotheses
Relatively long
Current behavior – example 1
AJ Brush and Kori Inkpen, “Yours, mine and ours?...” (pdf) (2005 movie inspiring title)
Home technology: users share, etc.
Current behavior – example 2 Schwetak Patel et al. “Farther Than You
May Think…” (pdf) Hypothesis: Mobile phone a proxy to
user location.
Three common types
Current behavior Proof of concept Experience with prototype
Research steps
1. State problem(s)2. State goal(s)3. Propose hypotheses4. Propose steps to test hypotheses5. Explain how problem(s), goal(s) and
hypotheses fit into existing knowledge6. Produce results of testing hypotheses7. Explain results8. Evaluate research9. State new problems
Proof of concept
Technological advance: Produce results: prototype Explain results: prototype
Relatively short
Proof of concept – example 1 J. Sherwani et al., “Speech vs. Touch-
tone: Telephone Interfaces for Information Access by Low Literate Users” (pdf) (video)
Hypothesis: Speech better telephony interface than touch-tone for low literate users.
Proof of concept – example 2
John Krumm and Eric Horvitz, “Predestination:…” (pdf)
Hypothesis: Destinations from partial trajectories.
Train/test algorithm on GPS tracks from 169 people
Used pre-existing data: Krumm and Horvitz, “The Microsoft Multiperson
Location Survey” Collecting original data a significant
contribution Leverage!
Three common types
Current behavior Proof of concept Experience with prototype
Research steps
1. State problem(s)2. State goal(s)3. Propose hypotheses4. Propose steps to test hypotheses5. Explain how problem(s), goal(s) and
hypotheses fit into existing knowledge6. Produce results of testing hypotheses7. Explain results8. Evaluate research9. State new problems
Experience with prototype
Users’ interaction with technology: Produce results: prototype Explain results: prototype
Relatively long
Prototype an example!
Others don’t care about: Raw usage information Usability problems Intricate implementation details Etc.
Generalize! Scientific and good technical work
Experience – example 1
C. Neustaedter, et al., “A Digital Family Calendar in the Home:…” (pdf) (video)
Hypothesis: At-a-glance awareness, remote access are significant benefits.
4 households, 4 weeks each (Best Student Paper, Graphics Interface
2007)
Experience – example 2
Rafael Ballagas et al., “Gaming Tourism:…” (pdf) (video)
Hypothesis: Learning through a game. 18 participants: 2 alone + 8 pairs (8 x 2
= 16)
Study design
Who is the consumer? Manager(s)
Industry, academic lab Professor(s)
E.g. thesis committee Researchers
E.g. advisor’s collaborators Reviewers
For paper, proposal, thesis Funding agency
Report on progress, proposal for funding Public
Friends, family, alumni, potential students, donors, potential employers
Study design
How can I explain this to a layperson? What is key? What can be omitted?
How will I write this up? Paper Thesis Report Blog post
Start writing paper/thesis/report/blog post at the beginning of the study.
Study design
Test hypothesis/hypotheses
Testing hypotheses via user studies
User studies: Laboratory studies
Good: Control, easier to evaluate results Bad: Constraints
Field studies Good: Fewer constraints Bad: Less control, more difficult to evaluate
results
Criteria
Falsifiability: Prediction fails = explanation isn’t correct Account for other factors!
Note: Criterion - singular Criteria - plural
Criteria
Verifiability: Prediction successful = explanation is
correct Account for other factors!
The meat of it…
Battleship Potemkin, 1925 film
Rotten meat scene
Why larvae in meat?
Francesco Redi (1626-1697)
Generation of insects, 1668
Causal explanation: fly droppings
Redi’s research
Hypothesis: Worms derived from fly droppings
Testing hypothesis: Two sets of flasks with meat: sealed and
open Prediction: worms only in open flask
Falsifiability criterion
Can anything cause a failed prediction even if explanation is correct?
Did the apparatus operate properly? Tight seal? Meat not initially spoiled? Other?
Verifiability criterion
Can anything result in successful prediction even if explanation is wrong?
What if “active principle” in the air is responsible for spontaneous generation?
Modify experiment: Replace seal with veil:
Flies cannot reach meat Air in contact with meat
Modification helps meet verifiability criterion
Verifiability criterion
Experimental vs. control group: Only difference in level of one independent
variable Redi’s experiment:
Control: Open flasks Experimental: Veil-covered flasks
Control: laboratory experiment Meat in veil-covered flasks? Creating control/experimental groups
often impossible without careful design/control
Study design
Test hypothesis/hypotheses Formulate in terms of:
Independent variables (multiple conditions) Dependent variables
Design: Within-subjects Between-subjects Mixed design
Within-subjects design: example
Police radio UI: hardware Speech
Blog post, video
Within-subjects design: example
Results in graphical form:
Within-subjects design: example
Results in graphical form:
Example: between-subjects design
Classical example: testing a drug
Mixed design: example 1
SUI characteristics study Secondary task: speech control of radio 2 x 2 x 2 design:
SR accuracy: high/low PTT button: yes/no – ambient recognition Dialog repair strategy: mis-/non-
understanding
Mixed design: example 2
Motivation: PTT vs. driving performance Secondary task: speech control of radio 2 x 3 x 3 design:
SR accuracy: high/low PTT activation:
push-hold-release/push-release/no push PTT button: ambient/fixed/glove
Push-hold-release Push-release No-push
Ambient Fixed Glove Ambient Fixed Glove Ambient Fixed Glove
High
Low
Control condition
Baseline: e.g. no technology vs. later introduced technology
Considerations
What will subjects do? Normal behavior – may take long Scenarios
Augment existing or brand new? Augment: taking advantage of familiarity New: more control (fewer inherited
constraints) Simulate or implement?
E.g. WoZ
Data to collect
Qualitative Insight into what participants did. How do participants compare? Did they do
what they thought they did? Use quantitative data.
Quantitative How did people behave? But why? Use qualitative data.
Data to collect
At least three types of data: Demographic Usage Reactions
Data to collect
Run pilot experiments!
Collecting data
Logging Surveys Experience sampling Diaries Interviews Unstructured observation – ethnography
Logging
Plan ahead, not after the fact! Testing hypotheses
Don’t leave important data out Don’t save data you don’t need
Leverage logging: Everything OK?
E.g. Mike Farrar’s MS research: files appearing on server indicates apps OK
Explicit communication with server: “I’m OK!”
Surveys
Open-ended Multiple-choice Likert-scale
Surveys
Questions should allow positive and negative feedback.
Text clear to others? Check! One question at a time!
“Fun and easy to use?” Length?
Don’t bore subjects to death. Standard questions (e.g. QUIS)?
Previously used questions?
Example: Mike Farrar’s study Hypotheses:
Initialize grammar (video): From previous tags From tags by users with similar interests
Voice commands convenient way to tag photos (video)
Keyboard users will use voice less Low task completion: give up on voice
Experience sampling (ESM)
Short questionnaire Timing:
Random Scheduled Event-based
Experience sampling (ESM)
How often? How many? Relate to quantitative data?
Diaries
Similar to ESM
Interviews
Semi-structured: List of specific questions + follow-up
questions Bring data
E.g. Nancy A. Van House: “Flickr and Public Image Sharing:…”
Interviews + photo elicitation
Interviews
Neutral questions Negative feedback is OK (this is hard):
Don’t argue!
Participants
Follow IRB rules
Participants
Who to recruit? Representative of intended users Not your friends, family, colleagues – bias! May need different types
Recruit sufficient numbers of each type
Participant profile
Age E.g. age significant for driving
Gender Technology use and experience Other
Eye tracker studies: no glasses
Number of participants
Between-subjects usually requires more than within-subjects
Proof-of-concept: typically fewer and many types
Longer study: may be able to use fewer Time commitment per participant is
significant! Recruit (Craigslist), organize, train, run,
transfer data, process data Participants will drop out – recruit extra
Counterbalancing may not work out
Compensation
Don’t try to save on this! Driving simulator lab study cost example
1 graduate student year at UNH ≈ $50k Software maintenance fees per year ≈
$20k Trip to conference ≈ $2k PC or laptop ≈ $2k $20 x 24 participants ≈ $0.5k (less than
1%)
Compensation
Must not affect data E.g. in image tagging study if we paid per
picture: More data Unrealistic as interactions are for money not for
value of prototype
Compensation
Leverage if you can: Latest driving simulator lab study in
collaboration with Microsoft Research: Use Microsoft software as compensation
Data analysis
Test hypotheses Use multiple data types Tell a story
Data analysis
Statistics: Descriptive Inferential
Descriptive statistics
Level of measurement: Nominal Ordinal Interval
Descriptive statistics
Level of measurement: Nominal Ordinal Interval
Level of measurement
Nominal: Unordered categories E.g. yes/no Valid to report :
Frequency
Level of measurement
Ordinal: Rank order preference without numeric
difference E.g. responses on Likert scale
Five of the eight participants strongly agreed or agreed with the following statement: “I prefer to have a GPS screen for navigation.”
Valid to report : Frequency Median Some people report means but what is the mean
of “strongly agree” and “strongly disagree”?
Level of measurement
Interval: Numerical differences significant E.g. age, number of times an action
occurred, etc. Valid to report:
Sum Mean Median Standard deviation (outliers?)
Outliers in interval data
Inferential statistics
Significance tests t-test ANOVA Many others
Which to use: depends on data
Significance test: example 1
To assess the effect of different navigation aids on visual attention, we performed a one-way ANOVA using PTD as the dependent variable. As expected, the time spent looking at the outside world was significantly higher when using spoken directions as compared to the standard PND directions, p<.01. Specifically, for spoken directions only, the average PDT was 96.9%, while it was 90.4% for the standard PND.
Significance test: example 2
-5
0
5
10
15
20
60-80 80-100 100-120 120-140 140-160
PDT
on st
anda
rd P
ND
[%]
distance from previous intersection [m]
… PDT on the PND screen changes with the distance from the previous intersection… significant main effect, p<.01…
Significance test: example 3
Randomization test Kun et al. (pdf) Idea from Veit et al. (pdf)
Significance test: example 3
0
5
10
15
20
25
30
35
0 1 2 3 4 5 6 7 8
Rstw
[deg
rees
^2 ]
lag [seconds]
standard
p = 0.05
spoken only