multimedia search: from lab to web prof. dr. l. schomaker ki rug invited lecture, presented at the...

78
Multimedia search: From Lab to Web prof. dr. L. Schomaker KI RuG Invited lecture, presented at the 4e Colloque International sur le Document Electronique, 24-26 octobre 2001. Schomaker, LRB (2001) Image Search and Annotation: From Lab to Web. Proceedings of CIDE 2001, pp.373-375, ISBN 2-909285-17-0.

Post on 20-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Multimedia search: From Lab to Web

prof. dr. L. Schomaker

KI

RuGInvited lecture, presented at the 4e Colloque International sur le Document Electronique,24-26 octobre 2001.Schomaker, LRB (2001) Image Search and Annotation: From Lab to Web.Proceedings of CIDE 2001, pp.373-375, ISBN 2-909285-17-0.

©2001 LRB Schomaker - KI/RuG

2

KI

RuG

Overview

Methods in content-based image search

The user’s perspective: ergonomics, cognition and perception

Feeding the data-starved machine

©2001 LRB Schomaker - KI/RuG

3

KI

RuG

Researchers

L. Schomaker L. Vuurpijl E. Deleau E. Hoenkamp A. Baris

©2001 LRB Schomaker - KI/RuG

4

KI

RuG

A definition

In content-based image retrieval systems, the goal is to provide the user with a set of images, based on a query which consists - partly or completely - on pictorial information

Exclude: point & click navigation in pre-organized image bases

©2001 LRB Schomaker - KI/RuG

5

KI

RuG

Image-based queries on WWW: existing methods and their problems

IBIR - image-based information retrieval

CBIR - content-based image retrieval

QBIC - queries based on image content

PBIR - pen-based image retrieval

©2001 LRB Schomaker - KI/RuG

6

KI

RuG

Existing systems & prototypes

QBIC (IBM) VisualSEEk (Columbia) Four-Eyes (MIT Media) … and many more: (Webseek,

Excalibur,Imagerover,Chabot,Piction) Research: IMEDIA (Inria),

Viper/GIFT (Marchand-Maillet)

©2001 LRB Schomaker - KI/RuG

7

KI

RuG

Query Methods

Query Matched with Algorithm

Keywords Manual text annotation

String search,

Information Retrieval

Keywords Textual context of image

String search, IR

Exemplar image Complete image Template matching, feature vector matching

Rectangular sub image

Complete image Feature, texture based

Layout structure Complete image Texture and Color

Object outline Partial image Outlines, Edges

Object sketch Partial image Features, Edges

©2001 LRB Schomaker - KI/RuG

8

KI

RuG

Example 1. QBIC (IBM)

Features:– Colors, textures, edges, shape

Matching– Layout, full-image templates, shape

Upper-left picture is the query

“boy in yellow raincoat”

…yields very counter-intuitive results

What was the user’s intention?

©2001 LRB Schomaker - KI/RuG

10

KI

RuG

Example 2. VisualSEEk

Features:– Colors, textures, edges. bitmap shape

Matching:– layout, full-image templates

Layout- and feature-based query construction

Requires detailed user knowledge on pattern-recognition issues!

VisualSEEk (Columbia Univ.)

©2001 LRB Schomaker - KI/RuG

12

KI

RuG

Example 3. FourEyes (MIT Medialab)

Imposed block segmentation

Textual annotation per block

Labels are propagated on the basis of texture matching

FourEyes (MIT Medialab)

©2001 LRB Schomaker - KI/RuG

14

KI

RuG

FourEyes…

Imposed block segmentation: is unrelated to object placement

object details are lost: global + textural

Interesting: a role for the user

©2001 LRB Schomaker - KI/RuG

15

KI

RuG

Problems

Full-image template matching yields bad retrieval results

Feature-based matching requires a lot of input and knowledge by the user

Layout-based search only suits a subset of image needs

Grid-based partitioning misses details and breaks up meaningful objects

©2001 LRB Schomaker - KI/RuG

16

KI

RuG

Problems…

Reasons behind a retrieved image list are unclear (Picard, 1995)

Features and matching scheme are not easily explainable to the user

An intelligent system should learn from previous queries of the user(s)

©2001 LRB Schomaker - KI/RuG

17

KI

RuG

A statement

In content-based image retrieval systems, just as in text-based Information Retrieval, the performance of current systems is limited due to their incomplete and weak modeling of the user’s – Needs– Goals– Perception– Cognition (semantics)

©2001 LRB Schomaker - KI/RuG

18

KI

RuG

User-Interfacing aspects

Computer users are continuously evaluating the value of system responses as a function of the effort spent on input actions (cost / benefit evaluation)

Consequence: after formulating a query with a large amount of key clicks, slider adjustments and mouse clicks, the quality of an image hit list is expected to be very high…

Conversely, user expectancies are low when the effort only consists of a single mouse click

©2001 LRB Schomaker - KI/RuG

19

KI

RuG

Pragmatic aspects

a survey on WWW revealed that users are interested in objects (71%) and not in layout, texture or abstract features.

The preferred image type is photographs (68%)

©2001 LRB Schomaker - KI/RuG

20

KI

RuG

Cognitive & Perceptual aspects

Objects are best recognized from 'canonical views' (Blanz et al., 1999),

Photographers know and utilize this phenomenon by manipulating camera attitude or objects

©2001 LRB Schomaker - KI/RuG

21

KI

RuG

Photographs and paintings imply communication

Photographer

Painter

User, viewer

=Surveillance

camera

Computer Vision

World World

©2001 LRB Schomaker - KI/RuG

22

KI

RuG

Photographs and paintings imply communication

Photographer

Painter

User, viewer

=Surveillance

camera

Computer Vision

World World

Problems of geometrical invariance are less extreme

©2001 LRB Schomaker - KI/RuG

23

KI

RuG

Canonical Views

Non-canonical object orientation

©2001 LRB Schomaker - KI/RuG

24

KI

RuG

Canonical Views

Canonical object orientation

©2001 LRB Schomaker - KI/RuG

25

KI

RuG

More cognition: Basic-level object categories

In a hierarchy of object classes (ontology) a node of the type 'Basic Level' (Rosch et al.,1976) adds many structural features in its description, as compared to the level above, whereas the number of unique additional features is reduced when going down towards a more specific node.

©2001 LRB Schomaker - KI/RuG

26

KI

RuG

Basic-level categories, example

“furniture” [virtually no geometrical features]

“chair” [many clearly-defined structural features]

“kitchen chair” [only a few additional features].

©2001 LRB Schomaker - KI/RuG

27

KI

RuG

Basic-level object categories and mental imagery

A basic level is the highest level for which clear mental imagery exists in an object ontology

©2001 LRB Schomaker - KI/RuG

28

KI

RuG

Basic-level object categories and mental imagery

A basic level is the highest level for which clear mental imagery exists in an object ontology

A basic-level object elicits almost the same feature description when it is named, or shown visually

©2001 LRB Schomaker - KI/RuG

29

KI

RuG

Basic-level object categories and mental imagery

A basic level is the highest level for which clear mental imagery exists in an object ontology

A basic-level object elicits almost the same feature description when it is named or shown visually

Basic-level object descriptions often contain reference to structural components (parts)

©2001 LRB Schomaker - KI/RuG

30

KI

RuG

Basic-level object categories and mental imagery

A basic level is the highest level for which clear mental imagery exists in an object ontology

A basic-level object elicits almost the same feature description when it is named or shown visually

Basic-level object descriptions often contain reference to structural components (parts)

In verbally describing the contents of a picture, people will tend to use 'basic-level' words.

©2001 LRB Schomaker - KI/RuG

31

KI

RuG

Basic-level object categories and mental imagery

A basic level is the highest level for which clear mental imagery exists in an object ontology

A basic-level object elicits almost the same feature description when it is named or shown visually

Basic-level object descriptions often contain reference to structural components (parts)

In verbally describing the contents of a picture, people will tend to use 'basic-level' words.

Rosch, E., Mervis, C.B., Gray, W.E., Johnson, E.M.

and Boyes-Braem, P. (1976).

Basic objects in natural categories.

Cognitive Psychology, 8, pp. 382-439.

©2001 LRB Schomaker - KI/RuG

32

KI

RuG

Implication of the ‘basic level’ category

The basic level forms a natural bridge between textual and pictorial information

It is likely to determine both annotation and search behavior of the users

It is an ideal starting point for developing computer vision systems which generate text on the basis of a photograph (ultimately)

©2001 LRB Schomaker - KI/RuG

33

KI

RuG

Misconception about Perception and Cognition

“A picture is worth a thousand words”?

True or False?

“A picture is worth a thousand words”….

But many pictures could use a few words…!

“A picture is worth a thousand words”?

This is a part of a rocket engine by NASA

©2001 LRB Schomaker - KI/RuG

37

KI

RuG

Assumptions

In image retrieval, the media type of photographs is preferred

There is a predominant interest in objects (in the broad sense: including humans and animals)

The most likely level of description in real-world images is the “basic-level” category (Rosch et al.)

©2001 LRB Schomaker - KI/RuG

38

KI

RuG

Goal: object-based image search

Object recognition in an open domain?

Not possible yet.

Extensive annotation is needed in any case: for indexed access and for machine learning (MPEG-7 allows for sophisticated annotation)

But who is going to do the annotation: the

content provider or the user, and how?

©2001 LRB Schomaker - KI/RuG

39

KI

RuG

How to realize object-based image search?

Bootstrap process for pattern recognition

cf.: Project CyC (Lenat) and openMind (Stork)

Collaborative, opportunistic annotation and object labeling (browser side)

Background learning process (server side)

©2001 LRB Schomaker - KI/RuG

40

KI

RuG

Design considerations

Focus on object-based representations and queries

Material: photographics with identifiable objects for which a verbal description can be given

Exploit human perceptual abilities

Allow for incremental annotation to obtain a growing training set

©2001 LRB Schomaker - KI/RuG

41

KI

RuG

Outline-based queries

In order to bridge the gap between what is currently possible and the ultimate goal of automatic object detection and classification, a closed curve, drawn around a known object is used as a bootstrap representation: An outline.

This closed curve contains shape information itself (XY, dXdY, curvature) and allows to separate visual object characteristics represented by the pixels which are enclosed by it from the background

Scribbles vs Outlines

Examples of outlines from a “Wild West” base of photographs

Outline, basic features and matching

©2001 LRB Schomaker - KI/RuG

45

KI

RuG

More outline-based features

Lengths of radii from center of gravity Curvature Curvature scale space Bitmap of an outline Absolute Fourier transform |FFT| Others (not tried yet): wavelets, Freeman

coding

Outline features: coordinates, running angle (cos(f),sin(f)), radii, |FFT|

Outline examples from motor bicycle set

Motor bike engine

Image (pixel-based) features

Matching possibilities

©2001 LRB Schomaker - KI/RuG

51

KI

RuG

Annotation

After the user has produced an outline (by pen or mouse), it is fruitful to ask for a text label (keyboard, speech, handwriting)

Knowledge on semantics can be exploited to guide the user (e.g., with menus)

Annotation tool

Initial results

©2001 LRB Schomaker - KI/RuG

54

KI

RuG

Problems in performance measurement

The systems usually have the goal to return a list of similar-looking images

What is good? What is bad?

No clear-cut definition of ‘class’, unlike speech and handwriting recognition

Performance measurement is borrowed from Information Retrieval: Precision & Recall

©2001 LRB Schomaker - KI/RuG

55

KI

RuG

Ensemble vs Hit list vs Wanted

©2001 LRB Schomaker - KI/RuG

56

KI

RuG

Precision of a hit list

©2001 LRB Schomaker - KI/RuG

57

KI

RuG

Precision of a hit list: accidental or real?

Example:

©2001 LRB Schomaker - KI/RuG

58

KI

RuG

Intermediate summary

Outline-based search yields promising results Many questions remain:

Can users do it? Do they like to perform outlining+annotation? Is the ‘bootstrap’ idea valid: can the outlines

be used for matching with unseen images?

Can users produce outlines?

Object classes:• Locomotive• Christmas tree• Atomic explosion• Jukebox• 4-wheel drive car• Brain• Motor bike• Pistol• Buddha• Stop sign

User (N=33) differences in outline production

©2001 LRB Schomaker - KI/RuG

60

KI

RuG

Multistable outlining behavior?

Locomotive: with or without smoke?

Accurate or sloppy curvature followers

Observations

Ambiguities in outlining

Cluster analysis solves outline variation

Outline-based image retrieval system

Outlines XY (colored) vs Edges ΔI (grey)

Match operator, for each point i on the outline:

©2001 LRB Schomaker - KI/RuG

65

KI

RuG

Outline vs Edge search results

Caveat: no translation, orientation, scaleinvariance (early results)

More use for outlines: class-specific edge detectors

Generic edge detection

Edge detector (MLP)Trained with outline points from the motor bicycle base as targets for output neuron

©2001 LRB Schomaker - KI/RuG

67

KI

RuG

User input is highly valuable

Labeled outlines are needed to train classifiers/matchers

Labeled outlines are needed to develop benchmark sets (like: “BenchAthlon”)

Examples from other fields:

Unipen: 5 million characters,

NIST: millions of characters,

LDC: thousands of hours labeled speech

©2001 LRB Schomaker - KI/RuG

68

KI

RuG

User input is highly valuable

openMind arguments (Stork, 1999; 2001)

The best teams have the largest labeled training sets

Differences between algorithms vanish when huge training sets are used (Ho & Baird, 1997)

Processor speed can be exploited if sufficient amounts of data are used (free ride on Moore’s Law)

©2001 LRB Schomaker - KI/RuG

69

KI

RuG

The Vind(X) site (Schomaker & Vuurpijl)

The experience collected thus far has been integrated in a functional Web site for image search and collaborative annotation

In collaboration with Amsterdam Rijksmuseum: a large image base of paintings and their descriptions in a text base

©2001 LRB Schomaker - KI/RuG

70

KI

RuG

The Vind(X) site (Schomaker & Vuurpijl)

Site: http://kepler.cogsci.kun.nl/vindx/

The site will become part of the openMind initiative: http://www.openmind.org

System consisting of Java/Javascript WWW pages, server-side pattern recognition in C

Vind(X) has extensive search and rendering functions

Vind(x) system with paintings data base of Amsterdam Rijksmuseum.

Query at upperleft “sitting man”

(Schomaker & Vuurpijl, 1999)

http://kepler.cogsci.kun.nl/vindx/

©2001 LRB Schomaker - KI/RuG

72

KI

RuG

Outline results for one user

©2001 LRB Schomaker - KI/RuG

73

KI

RuG

More questions: open user access

How to detect non-cooperative outlining and annotation?

How to merge ‘identical’ outlines?

How to merge ‘identical’ textual annotation

How to detect valuable expert input?

©2001 LRB Schomaker - KI/RuG

74

KI

RuG

More questions: semantics and geometry

How to achieve ‘explainable’ image hit list results?

make sure the underlying features are based on human perception

Hypothesis: “The construction of ontologies based on both semantics and feature- space characteristics will help in producing ‘explainable’ hit list results”

Example of an ontology created from all collected object annotations

A relation between semantic classes and contours?

fruit

plant

inanimate

creature

©2001 LRB Schomaker - KI/RuG

77

KI

RuG

Summary

Existing systems have problems in useability

Knowledge on the user (ergonomics, perception, cognition) may help substantially

Objects are a preferred search criterion

Object-based approaches have a strong connection to semantics

©2001 LRB Schomaker - KI/RuG

78

KI

RuG

Summary (continued)

An outline-based object search system was presented

The prototype was converted to a Web site with real content: Dutch paintings (> 80)

The site is used for collecting human annotations to this image base (> 1000)

The resulting data are very useful for future research in a number of areas: IR, outline matching, pixel matching, dedicated preprocessing