![Page 1: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/1.jpg)
Words and PicturesRahul Raguram
![Page 2: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/2.jpg)
Motivation
Huge datasets where text and images co-occur
~ 3.6 billion photos
![Page 3: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/3.jpg)
Motivation
Huge datasets where text and images co-occur
![Page 4: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/4.jpg)
Motivation
Huge datasets where text and images co-occur
Photos in the news
![Page 5: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/5.jpg)
Motivation
Huge datasets where text and images co-occur
Subtitles
![Page 6: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/6.jpg)
Motivation
Interacting with large image datasets Image content
‘Blobworld’[Carson et al., 99]
![Page 7: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/7.jpg)
Motivation
Interacting with large photo collections Image content
‘Blobworld’[Carson et al., 99]
![Page 8: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/8.jpg)
Motivation
Interacting with large photo collections Image content
‘Blobworld’[Carson et al., 99]
![Page 9: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/9.jpg)
Motivation
Interacting with large photo collections Image content
Query by sketch[Jacobs et al., 95]
![Page 10: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/10.jpg)
Motivation
Interacting with large photo collections Image content
Query by sketch[Jacobs et al., 95]
![Page 11: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/11.jpg)
Motivation
Interacting with large photo collections Large disparity between user needs and what
technology provides (Armitage and Enser 1997, Enser 1993, Enser 1995, Markulla and Sormunen 2000)
Queries based on image histograms, texture, overall appearance, etc. are vanishingly small
![Page 12: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/12.jpg)
Motivation
Interacting with large photo collections Text queries
![Page 13: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/13.jpg)
Motivation
Text and images may be separately ambiguous; jointly they tend not to be Image descriptions often leave out
what is visually obvious (eg: the colour of a flower)
…but often include properties that are difficult to infer using vision (eg: the species of the flower)
![Page 14: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/14.jpg)
Linking words and pictures: Applications Automated image annotation
Auto illustration
Browsing supporttiger cat mouth teeth
“statue of liberty”
![Page 15: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/15.jpg)
Learning the Semantics of Words and Pictures
Barnard and Forsyth, ICCV 2001
![Page 16: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/16.jpg)
Key idea
Model the joint distribution of words and image features
Joint probability model for text and image features
Random bitsImpossible
Keywords:appletree
Unlikely
Keywords:skywatersun
Reasonable
Slide credit: David Forsyth
![Page 17: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/17.jpg)
Input Representation
Extract keywords
Segment the image into a set of ‘blobs’
![Page 18: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/18.jpg)
EM revisited: Image segmentation
Examples from: http://www.eecs.umich.edu/~silvio/teaching/
![Page 19: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/19.jpg)
EM revisited: Image segmentation
Image
Segment 1Segment 2 . . .Segment k
),( 11 N),( 22 N
),( kkN
l )|( lxp l
),( lll )(xp
Generative model
Problem: You don’t know the parameters, the mixing weights, or the segmentation
![Page 20: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/20.jpg)
EM revisited: Image segmentation
Image
If you knew the segmentation, then you could find the parameters easily
Compute maximum likelihood estimates for
Fraction of the image in the segment gives the mixing weight
),( lll
l
![Page 21: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/21.jpg)
EM revisited: Image segmentation
Image
If you knew the segmentation, then you could find the parameters easily
If you knew the parameters, you could easily determine the segmentation
Solution: iterate
)|( xp lCalculate the posteriors
![Page 22: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/22.jpg)
EM revisited: Image segmentation
Image from: http://www.ics.uci.edu/~dramanan/teaching/
![Page 23: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/23.jpg)
Input Representation
Segment the image into a set of ‘blobs’ Each region/blob represented by a
vector of 40 features (size, position, colour, texture, shape)
![Page 24: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/24.jpg)
Modeling image dataset statistics Generative, hierarchical model
Extension of Hofmann’s model for text (1998)
Each node emits blobs and words
Higher nodes emit more general words and blobs
sky
Middle nodes emit moderately general words and blobs
sun
Lower nodes emit more specific words and blobs
waves
![Page 25: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/25.jpg)
Modeling image dataset statistics Generative, hierarchical model
Extension of Hofmann’s model for text (1998)
Following a path from root to leaf generates image and associated text
sky
sun
waves
sun sky waves
![Page 26: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/26.jpg)
Modeling image dataset statistics Generative, hierarchical model
Extension of Hofmann’s model for text (1998)
Each cluster is associated with a path from the root to a leaf
Cluster of images
![Page 27: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/27.jpg)
Modeling image dataset statistics Generative, hierarchical model
Extension of Hofmann’s model for text (1998)
Each cluster is associated with a path from the root to a leaf
sky
sun, sea
waves rocks
sun seasky waves
sun seasky rocks
Adjacent clusters
![Page 28: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/28.jpg)
Modeling image dataset statistics
Di lc
clPcliPcP )|(),|()(
)(DPD = blobs words
)|()( cDPcPc
Each cluster is associated with a path from a leaf to the root
ic
ciPcP )|()( Conditional independence of the items
Nodes along the path from leaf to root
![Page 29: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/29.jpg)
Modeling image dataset statistics
For blobs
Di lc
clPcliPcPDP )|(),|()()(
)()(2
1
2/12/
1
||)2(
1),|(
xx
d
T
eclbP
For words Tabulate word frequencies
![Page 30: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/30.jpg)
Modeling image dataset statistics
Model fitting: EM Missing data is path, nodes that
generated each data element Two hidden variables: If path, node were known for each data
element, easy to get maximum likelihood estimate of parameters
Given parameter estimate, path, node easy to figure out
Di lc
clPcliPcPDP )|(),|()()(
cdH ,
lidV ,,
document d is in cluster c
item i of document d was generated at level l
![Page 31: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/31.jpg)
Results
Clustering Does text+image clustering have an
advantage?
Only text
![Page 32: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/32.jpg)
Results
Clustering Does text+image clustering have an
advantage?
Only blob features
![Page 33: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/33.jpg)
Results
Clustering Does text+image clustering have an
advantage?
Both textand imagesegments
![Page 34: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/34.jpg)
Results
Clustering Does text+image clustering have an
advantage? User study:
Generate 64 clusters for 3000 images Generate 64 random clusters from the same
images Present random cluster to user, ask to rate
coherence (yes/no) 94% accuracy
![Page 35: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/35.jpg)
Results
Image search Supply a combination of text + image
features Approach: compute for each candidate
image, the probability of emitting the query items )|(),|()|( dcPdcQPdQP
c Q – set of query items
d – candidate document
c Qq l
dcPclPclqP )|()|(),|(
![Page 36: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/36.jpg)
Results
Image search
Image credit: David Forsyth
![Page 37: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/37.jpg)
Results
Image search
Image credit: David Forsyth
![Page 38: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/38.jpg)
Results
Image search
Image credit: David Forsyth
![Page 39: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/39.jpg)
Results
Auto-annotation Compute:
)|(),|()|( BcPBcwPBwPc
![Page 40: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/40.jpg)
Results
Auto-annotation Quantitative performance: Use 160 Corel CDs, each with 100 images
(grouped by theme) Select 80 of the CDs, split into training
(75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set
Model scoring:n – number of words for the imager – number of words predicted correctlyw – number of words predicted incorrectlyN – vocabulary sizeAll words that exceed a threshold are predicted
![Page 41: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/41.jpg)
Results
Auto-annotation Quantitative performance: Use 160 Corel CDs, each with 100 images
(grouped by theme) Select 80 of the CDs, split into training
(75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set
Model scoring:n – number of words for the imager – number of words predicted correctlyModel predicts n words
Can do surprisingly well just by using the empirical word frequency!
![Page 42: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/42.jpg)
Results
Auto-annotation Quantitative performance:
Score of 0.1 indicates roughly 1 out of every 3 words is correctly predicted(vs. 1 out of 6 for the empirical model)
eNS
mNS EE
ePR
mPR EE
![Page 43: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/43.jpg)
Names and Faces in the News
Berg et al., CVPR 2004
![Page 44: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/44.jpg)
Motivation
President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters
![Page 45: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/45.jpg)
Motivation
President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, 2003. Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters
![Page 46: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/46.jpg)
Motivation
Organize news photographs for browsing and retrieval
Build a large ‘real-world’ face dataset Datasets captured in lab conditions do
not truly reflect the complexity of the problem
![Page 47: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/47.jpg)
Motivation
Organize news photographs for browsing and retrieval
Build a large ‘real-world’ face dataset Datasets captured in lab conditions do
not truly reflect the complexity of the problem
In many traditional face datasets, it’s possible to get excellent performance by using no facial features at all (Shamir, 2008)
![Page 48: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/48.jpg)
Motivation
Top left 100×100 pixels of the first 10 individuals in the color FERET dataset. The IDs of the subjects are listed right to the images
![Page 49: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/49.jpg)
Dataset
Download news photos and captions ~500,000 images from Yahoo News, over a period of
two years
Run a face detector 44,773 faces Resized to 86x86 pixels
Extract names from the captions Identify two or more capitalized words followed by a
present tense verb Associate every face in the image with every detected
name
Goal is to label each face detector output with the correct name
![Page 50: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/50.jpg)
Dataset Properties
Diverse Large variation in lighting and pose Broad range of expressions
![Page 51: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/51.jpg)
Dataset Properties
Diverse Large variation in lighting and pose Broad range of expressions
Name frequencies follow a long tailed distribution
Doctor Nikola shows a fork that was removed from an Israeli woman who swallowed it while trying to catch a bug that flew in to her mouth, in Poriah Hospital northern Israel July 10, 2003. Doctors performed emergency surgery and removed the fork. (Reuters)
President George W. Bush waves as he leaves the White House for a day trip to North Carolina, July 25, 2002. A White House spokesman said that Bush would be compelled to veto Senate legislation creating a new department of homeland security unless changes are made. (Kevin Lamarque/Reuters)
![Page 52: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/52.jpg)
Preprocessing
Rectify faces to canonical position Train 5 SVMs as feature detectors
Corners of left and right eyes, tip of the nose, corners of the mouth
Use 150 hand-clicked faces to train the SVMs
For a test image, run the SVMs over the entire image Produces 5 feature maps Detect maximal outputs in the 5 maps, and
estimate the affine transformation to the canonical pose
Image credit: Y. J. Lee
![Page 53: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/53.jpg)
Preprocessing
Rectify faces to canonical position Train 5 SVMs as feature detectors
Corners of left and right eyes, tip of the nose, corners of the mouth
Use 150 hand-clicked faces to train the SVMs
For a test image, run the SVMs over the entire image Produces 5 feature maps Detect maximal outputs in the 5 maps, and
estimate the affine transformation to the canonical pose
Reject images with poor rectification scores
![Page 54: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/54.jpg)
Preprocessing
Rectify faces to canonical position Train 5 SVMs as feature detectors
Corners of left and right eyes, tip of the nose, corners of the mouth
Use 150 hand-clicked faces to train the SVMs
For a test image, run the SVMs over the entire image Produces 5 feature maps Detect maximal outputs in the 5 maps, and
estimate the affine transformation to the canonical pose
Reject images with poor rectification scores This leaves 34,623 images
Throw out images with more than 4 names 27,742 faces
![Page 55: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/55.jpg)
Face representation
86x86 images – 7396 dimensional vectors However, relatively few 7396 dimensional
vectors actually correspond to valid face images
We want to effectively model the subspace of valid face images
Slide credit: S. Lazebnik
![Page 56: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/56.jpg)
Face representation
We want to construct a low-dimensional linear subspace that best explains the variation in the set of face images
Slide credit: S. Lazebnik
![Page 57: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/57.jpg)
Principal Component Analysis (PCA)
Definecovariance matrix
Formulation: C. Bishop
![Page 58: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/58.jpg)
Principal Component Analysis (PCA) Want to maximize the projected variance
Alternate formulation: minimize sum- of-square errors
Maximize
subject to
Use Lagrange multipliersu1 must be an eigenvector of S
Choose maximum eigenvalue to maximize variance
Image, formulation: C. Bishop
![Page 59: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/59.jpg)
Principal Component Analysis (PCA) The direction that captures the maximum
covariance of the data is the eigenvector corresponding to the largest eigenvalue of the data covariance matrix
Furthermore, the top k orthogonal directions that capture the most variance of the data are the k eigenvectors corresponding to the k largest eigenvalues
Slide credit: S. Lazebnik
![Page 60: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/60.jpg)
Limitations of PCA
PCA assumes that the data has a Gaussian distribution (mean µ, covariance matrix Σ)
Slide credit: S. Lazebnik
![Page 61: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/61.jpg)
Limitations of PCA
The direction of maximum variance is not always good for classification
Image credit: C. Bishop
![Page 62: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/62.jpg)
Limitation #1
Shape of the data not modeled well by the linear principal components
![Page 63: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/63.jpg)
The return of the kernel trick
Basic idea: express conventional PCA in terms of dot products
From before:
For convenience, assume that you’ve subtracted off the mean from each vector
Consider a nonlinear function Φ(x) mapping into M-dimensions (M>D)
Assume
Covariance matrix
Formulation: C. Bishop
![Page 64: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/64.jpg)
The return of the kernel trick
Covariance matrix in feature space
Now MxM
Substituting for C
Scalar values
The eigenvectors vi can be written as a linear combination of the Φ(xn)
Formulation: C. Bishop
![Page 65: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/65.jpg)
Key step: express this in terms of the kernel function
Multiply both sides by ΦT(xl)
Projection of a point onto eigenvector i
The return of the kernel trick
Formulation: C. Bishop
![Page 66: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/66.jpg)
Kernel PCA
Image credit: C. Bishop
![Page 67: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/67.jpg)
Limitation #2
The direction of maximum variance is not always good for classification
Image credit: C. Bishop
![Page 68: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/68.jpg)
Linear Discriminant Analysis (LDA) Goal: Perform dimensionality reduction while
preserving as much of the class discriminatory information as possible
Try to find directions along which the classes are best separated
Capable of distinguishing image variation due to identity from variation due to other sources such as illumination and expression
![Page 69: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/69.jpg)
Linear Discriminant Analysis (LDA) Define inter- and intra-class scatter matrices
LDA computes a projection that maximizes the ratio
by solving the generalized eigenvalue problem
W – intra-classB – inter-class
![Page 70: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/70.jpg)
Class labels for LDA
For the unsupervised names and faces dataset, you don’t have true labels Use proxy for labeled training data Images from the dataset with only one
detected face and one detected name
Observation: Using LDA on top of the space found by kernel PCA improves performance significantly
![Page 71: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/71.jpg)
Clustering faces
Now that we have a representation for faces, the goal is to ‘clean up’ this dataset
Modified k-means clusteringObamaBushClintonSaddam
![Page 72: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/72.jpg)
Clustering faces
Now that we have a representation for faces, the goal is to ‘clean up’ this dataset
Modified k-means clusteringObamaBushClintonSaddam
![Page 73: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/73.jpg)
Clustering faces
Now that we have a representation for faces, the goal is to ‘clean up’ this dataset
Modified k-means clusteringObamaBushClintonSaddam
x
xx x
![Page 74: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/74.jpg)
Clustering faces
Now that we have a representation for faces, the goal is to ‘clean up’ this dataset
Modified k-means clustering
x
xx x
BushSaddam
![Page 75: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/75.jpg)
Pruning clusters
Remove clusters with < 3 faces This leaves 19,355 images
For every data point, compute a likelihood score
Remove points with low likelihood
k – number of nearest neighbours being consideredki – number of n.n. that are in cluster in – total number of points in the datasetni – total number of points in cluster i
![Page 76: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/76.jpg)
Pruning clusters
For various thresholds:
![Page 77: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/77.jpg)
Merging clusters
Merge clusters with different names that correspond to a single person Defense Donald Rumsfeld and Donald
Rumsfeld Or Colin Powell and Secretary of State
Look at distance between the means in discriminant space If below a threshold, merge
![Page 78: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/78.jpg)
Merging clusters
Image credit: David Forsyth
![Page 79: Words and Pictures Rahul Raguram. Motivation Huge datasets where text and images co-occur ~ 3.6 billion photos](https://reader036.vdocuments.pub/reader036/viewer/2022062409/56649d615503460f94a42596/html5/thumbnails/79.jpg)
Results