web log, text, and other data mining

Post on 14-Jun-2015

608 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Web Log, Text, and Other Data Mining

Wayne Kao

What is Data Mining?• “Automated extraction of hidden

predictive information from large databases” -Kurt Thearling

• “Quickly and thoroughly explore mountains of data, isolating the valuable, usable information -- the business intelligence” -SPSS site

Possible Questions (Chi)• Usage

– How has info been accessed? How frequently? What’s popular?

– How do people enter the site? Where do people spend time? How long do they spend there?

– How do people travel within a site? What are the [un]popular paths?

– Who are the people accessing the site? From what geographical location? From what domains?

Possible Questions (cont)• Structural

– What information has been added? Modified? Remained the same but moved?

• Usage + Structural– How is new info accessed? When does it

become popular?– How does introducing new information

change navigation patterns? Can people still navigate there to the desired info?

– Do people look for deleted information?

Usability Testing

Common usability testing techniques:• Interviews• Ethnographic and/or lab-style observations• Surveys• Focus groups

Good qualitative data

Problems with these techniques:• Time and effort are costly• Small sample sizes – quantitative results? (Spool)

How can we get usability testing more involved in the design cycles, so we can find problems and potential problems earlier?

Design

EvaluatePrototype

Remote Usability (Waterson)

• Analyze clickstreams in the context of the task and user intentions

• Human observers not present• Want methods that are

– Easy to deploy on any website– Compatible with range of OS and browsers

• Mobile computing adds further usability challenges– Small screen sizes– Limited and/or new interaction techniques– Devices are used in environments beyond

the desktop

Apache Web Log205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET

/~sophal/whole5.gif HTTP/1.0" 200 9609 "http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)"

216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET /~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; http://www.inktomi.com/slurp.html)“

202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/indextop.html HTTP/1.1" 200 3510 "http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“

202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/animate.js HTTP/1.1" 200 14261 "http://www.csua.berkeley.edu/~tahir/indextop.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“

Analog - One traditional tool

• Reports number of requests, info about client machines, entry/exit points, charts (Chi et al.)

• Generated on a daily basis• Typical stats• Prettier stats

Readings• “Visualizing the Evolution of Web Ecologies”

Chi et al., Xerox PARC, 1998

• “Visualizing Association Rules for Text Mining”Wong, Whitney, & Thomas, Pacific Northwest, 1999

• “VISVIP: 3D Visualization of Paths through Web Sites”Cugini & Scholtz, National Institute of Standards and Technology, 1999

• “Case Study: E-Commerce Clickstream VisualizationBrainerd & Becker, Blue Martini Software, 2001

• “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System”Waterson et al., UC Berkeley, 2002

Readings• “Visualizing the Evolution of Web Ecologies”

Chi et al., Xerox PARC, 1998

• “Visualizing Association Rules for Text Mining”Wong, Whitney, & Thomas, Pacific Northwest, 1999

• “VISVIP: 3D Visualization of Paths through Web Sites”Cugini & Scholtz, National Institute of Standards and Technology, 1999

• “Case Study: E-Commerce Clickstream VisualizationBrainerd & Becker, Blue Martini Software, 2001

• “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System”Waterson et al., UC Berkeley, 2002

Evolution of Web Ecologies• Rather than hits, focus intermediate

representation on (C)ontent, (U)sage, and (T)opology, sorted by URL.– URL1:

• {day1: <link> <link> …}• {day2: <link> <link> …}

– URL2:• {day1: <link> <link> …}

• Visualize an entire web site in a small amount of space

• Show temporal changes

Disk Tree Visualization• Breadth first traversal• Each ring represents a tree level• All leaf nodes guaranteed some

angular space (360 / # leaves)

Tree links line mark in X and Y

Page access frequency

line size/brightness

Lifecycle stage color: new, continued, deleted

Disk Tree Visualization (cont)

• Pros – No occlusion problems since it’s 2D

plane– Can use the 3rd dimension for other

info (e.g. time)– Aesthetically pleasing to the eye (?)

• Cons– Difficult to see any page-level detail– Confusing color choices

Time Tube Visualization• Put Disk Trees along spatial axis• Rotated so that each slice gets

equal screen area• Focus+context• Animation: Can fly through tube,

mapping time onto time

Interaction Model• Can rotate slices with a button click• Can focus a slice by clicking on it• Flicking gestures move slices around• Right-clicking zooms to an area• Mouseovers display more

information about a node in a side window

• Can bring up pages in the browser• Animation of slices

Real-world Analyzes• Deadwood: Shows pages

becoming [un]popular• Shows effects of a redesign

Real-world Analyzes (cont)• Added items are being

used• Deleted items aren’t

negatively impacting the rest of the site

Comments• Gives only a broad view of the data

with no real way to get at the specifics

• Interaction seems very advanced• Not sure how intuitive the whole

idea of a circular tree is – seems kind of gratuitous

Readings• “Visualizing the Evolution of Web Ecologies”

Chi et al., Xerox PARC, 1998

• “Visualizing Association Rules for Text Mining”Wong, Whitney, & Thomas, Pacific Northwest, 1999

• “VISVIP: 3D Visualization of Paths through Web Sites”Cugini & Scholtz, National Institute of Standards and Technology, 1999

• “Case Study: E-Commerce Clickstream VisualizationBrainerd & Becker, Blue Martini Software, 2001

• “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System”Waterson et al., UC Berkeley, 2002

Association Rule?• Quantitative rule that describes

associations between sets of items– Not qualitative because no domain

knowledge necessary for text mining• Implication X Y where

– X: set of antecedent items– Y: consequent item

• Example: 80% of people who buy diapers and baby powder also buy baby oil.

Association Rule? (cont)• Support/predictability/conditional

probability– Percentage of items in the total set

that satisfies the union of items in the antecedent and in the consequent item

• Confidence/prevalence/joint probability– Percentage of articles that satisfy both

the antecendent and the consequent item

Association Rule Visualization

• Must visualize– Antecedent items & consequent items– Associations between antecedent and

consequent– Rules' support– Confidence

• Traditional ways of visualizing it– 2D matrix– Directed graph

2D Matrix (figure 1)• Antecedent and consequent items on

axes• Metadata icons in the cells that

connect the antecedent to consequent contain support and confidence values

Association rule: B C

2D Matrix (cont)• Pros: one-to-one binary relationships• Cons:

– Hard to see association rules in many-to-one relationships (A+BC or AC and BC)

– Grouping antecedents adds complexity– Object occulusion

Directed graph• nodes = items• edges =

associations• Cons:

– Dozen or more items tangled display

– Selecting edges to display multiple rules requires significant human interaction

Confusing?

“Novel” Technique• Matrix: rule-to-item

– rows = topics– columns = item associations– blue/red = antecedent and

consequent

• Bar graph = confidence/support• Can use queries to filter• Mouse zooming to support

context/focus

“Novel” Technique Advantages

• Handles hundreds of multiple antecedent association rules

• View topics and associations simultaneously

• Individual items clearly shown• No antecedent groups• Few occulusions because metadata is

plotted at the far end and bar graph is scaled

• No screen swapping, animation, or serious interaction required

“Novel” Technique Demo• Demo shows scalability• ~9 MB news article corpus of 100,000+

documents• Use word and concept-based text engines• Words evaluated on whether they’re

interesting depending on their position in documents

• Suffices removed and common prepositions, pronouns, adj’s, gerunds ignored

• Build a table of antecedents, consequents, confidences, and supports -> feed into viz

Conclusions• Rule-to-item association• Very clear visualization if limited to

a few dozen rules• Most web log visualizations jump

to using a graph; this paper forces you to think twice.

Readings• “Visualizing the Evolution of Web Ecologies”

Chi et al., Xerox PARC, 1998

• “Visualizing Association Rules for Text Mining”Wong, Whitney, & Thomas, Pacific Northwest, 1999

• “VISVIP: 3D Visualization of Paths through Web Sites”Cugini & Scholtz, National Institute of Standards and Technology, 1999

• “Case Study: E-Commerce Clickstream VisualizationBrainerd & Becker, Blue Martini Software, 2001

• “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System”Waterson et al., UC Berkeley, 2002

VISVIP• Captures individual movement

between pages rather than aggregates

• Shows paths - sequence of URLs

Topology• Directed graph• Force-directed algorithm

– Spring-like force– Nodes repel each other with force

inversely proportional to the distance between them (i.e. closer nodes means closer pages)

– Final force pulls nodes toward center

Content• URLs abbreviated

– http://sims.berkeley.edu/~bob/pics/large/abd.gif ge/abd

• Color-coded by content type• Mouseover reveals all the

abbreviated information

Simplification• Common problems

– Noise nodes not significant to paths - image and mailto nodes

– Over-connectivity - link back to home page or company logo

• Solutions– Delete all edges connected to a node– Make one node the graph root– Focus on a subset of the graph

Path Sequence• Showing subject paths as straight

lines didn't work– Hard to follow single jagged path– Multiple paths overlapped

• Spline representation– Each path is a smooth curve overlaid

on the graph– Colors for groups of subjects (e.g.

novices)

Path Sequence (cont)• User path-oriented layouts

– Simpler structure than when path is laid over a graph of the entire site

Path Timing• Vertical bar with base

on node, its height proportional to time spent on page

• Animation runs through pages at 10-30 times real-time

• Select a node to get detailed stats

Comments• Capturing individual movements

pretty innovative• Curved user paths and reorienting

the layout based on user paths• Overall graph viz not too clear• Good tips for creating a web log

mining viz

Readings• “Visualizing the Evolution of Web Ecologies”

Chi et al., Xerox PARC, 1998

• “Visualizing Association Rules for Text Mining”Wong, Whitney, & Thomas, Pacific Northwest, 1999

• “VISVIP: 3D Visualization of Paths through Web Sites”Cugini & Scholtz, National Institute of Standards and Technology, 1999

• “Case Study: E-Commerce Clickstream VisualizationBrainerd & Becker, Blue Martini Software, 2001

• “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System”Waterson et al., UC Berkeley, 2002

Clickstream Visualizer• Aggregate

nodes using an icon (e.g. all the checkout pages)

• Edges represent transitions– Wider means

more transitions

Customer Segments• Collect

– Clickstream– Purchase history– Demographic data

• Associates customer data with their clickstream (scary...)

• Different color for each customer segment

Filtering• Using the mouse or table control,

can filter by– Edge weight– Node selection

• Example: select checkout nodes and see if users are exiting from nodes

LayoutUsing third party Tom Sawyer package1. Hierarchical from higher-out degree

to higher-in degree– Mirrors actual flow of site users– The default

2. Circular– Puts related nodes into circles– Shows relationships between groups of

pages

Layout (cont)• Aggregation based on file system

path (good idea?)

Initial Findings• Gender

shopping differences (intriguing...)

Initial Findings (cont)• Checkout process

analysis• Newsletter hurting

sales

Comments• Visualizing clickstreams with

demographic data• Grouping pages by type• Best use of color• Icons an interesting way of

reducing complexity

Readings• “Visualizing the Evolution of Web Ecologies”

Chi et al., Xerox PARC, 1998

• “Visualizing Association Rules for Text Mining”Wong, Whitney, & Thomas, Pacific Northwest, 1999

• “VISVIP: 3D Visualization of Paths through Web Sites”Cugini & Scholtz, National Institute of Standards and Technology, 1999

• “Case Study: E-Commerce Clickstream VisualizationBrainerd & Becker, Blue Martini Software, 2001

• “What Did They Do? Understanding Clickstreams with the WebQuilt Visualization System”Waterson et al., UC Berkeley, 2002

System Design• Log data with proxy• Infer actions• Aggregate data• Layout graph• Display interactive visualization

Capturing Interaction

• Typical HTTP request…

Client Browser Web Server

Capturing Interaction (cont)

• WebQuilt captures interaction with a proxy– Proxies have typically been used for

caching and firewalls

WebQuiltLog

ProxyClient Browser Web Server

Capturing Interaction (cont)

• If a page says:<A HREF=“coolpage.html">

• Change it to:<A HREF="http://webquiltproxy.cs.berkeley.edu/webquilt?replace=http://www.spiffypages.com/coolpage.html&tid=1&linkid=13">

Capturing Interaction (cont)

• Pros:– Don’t need access to servers– Can analyze sites without permission

from the server– Can gather clickstreams from a

variety of devices including PDAs, phones,desktop computers

• Cons:– No access direct to the client

Visualization

Interactive, zoomable directed graph

• Nodes = web pages• Edges = aggregate traffic

between pages

Java-based SATIN toolkit for gesturing & zooming interaction

Image rendering of web pages:• JacoZoom Java callable wrappers

around an ActiveX component• MSIE window

Directed graph• Nodes: visited pages

– Color marks entry and exit nodes

• Arrows: traversed links– Thicker: more heavily

traversed– Color

• Red/yellow: Time spend before clicking

• Blue: optimal path chosen by designer

Controls• Slider: Zoom in and out• Checkboxes: Filter paths to display

Pages• Zooming in shows page thumbnails• Arrows

– Originate from actual links or the Back button

– Translucent & don’t cover details

LayoutLayout system flexible…1. Edge-weighted depth-first

traversal– Most visited path along top– Recursively place less followed paths

below

2. Grid positioning– Organizes distance between nodes– Avoid overlapping nodes

Interaction• Selecting nodes• Zooming in and out• Navigational gestures

Inferring & Aggregating• Take log files and infer actions,

such as when the back button is pressed– Can infer back button pressed, but

not combinations of back and forward– Extensible framework to add other

inferred actions

• Aggregate information, preserving individual paths

Running a WebQuilt Remote Usability Test

• Recruit users• Design and distribute tasks (via

email)• Auto-collect! Watch and wait as

users perform tasks and proxy logs data

• Visualize, analyze• Use the results to change design

Pilot Usability Study• Edmunds.com PDA web site• Visor Handspring equipped with a

OmniSky wireless modem• 10 users asked to find…

– Anti-lock brake information on the latest Nissan Sentra model

– The Nissan dealer closest to them.

In the Lab vs. Out in the WildComparing in-lab usability testing with WebQuilt

remote usability testing• 5 users were tested in the lab • 5 were given the device and asked to perform

the task at their convenience• All task directions, demographic data, and

follow up questionnaire data was presented and collected in web forms as part of the WebQuilt testing framework.

Classifying Usability Issues

Lab: Tester observations, participant comments and questionnaire data

Remote: WebQuilt visualization and questionnaire data

Four categories of issues• Browser • Device• Test design• Site design

• Six severity levels• 0 indicates comment• 1-5 where 1 is a very minor issue and 5 is a critical issue

Browser Device Interact before load (3) No forward button (2)

Difficulty with input in questionnaire (3)

Difficulty scrolling (2) Device errors unrelated to

testing (1) Tried writing on screen (0)

Site Design Test Design Falsely completed task (4) Long download times (4) Ping-pong behavior (3) Interact before load (3) Too much scrolling (2) Save address functionality

not clear (1) Back button navigation (0) Would like more features (0) Finds site useful (0)

Falsely completed task (4) Difficulty remembering

task description (3) Difficulty with input in

questionnaire (3) Questionnaire wording

problems (3) Forgot how to end task (1) Confusing task description

(1)

Findings

Findings• WebQuilt methodology is promising for

uncovering site design related issues. • 1/3 of the issues were device or browser

related.• Browser and device issues can not be

captured automatically with WebQuilt unless they cause an interaction with the server

• can be revealed via the questionnaire data.

Testing Concerns• What to do when problems with running

the test occur?• Understanding user motivation is still

ambiguous: Curiosity vs. confusion?• Gathering qualitative feedback on

mobile devices is difficult– PDA input difficult– Phones have potential for audio

Comments• Zooming/filtering great for showing

overview and page-level details– Can put screenshots directly into the

viz

• Layout in relation to intended path• Study compares remote usability

tests to traditional tests - promising

• Proxy logging very cool

Future Work• Expanded mobile device interaction

capture, specifically net-enabled cell phones

• Improve filtering capabilities, integrating questionnaire and demographic data

• Clever algorithms to simplify graph layout• Improved quantitative reporting• Improved controls/interaction• More rigorous evaluation with designers

and usability experts

Concluding Comments• Many incremental improvements in

web log/data mining viz (using a graph, using demographic data, etc.)

• Would be really good to see a study of usability engineers and web developers comparing the tools themselves

top related