10-1 vocab of terms
TRANSCRIPT
![Page 1: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/1.jpg)
Alan NochensonIST 511
10/1/2012
![Page 2: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/2.jpg)
Motivation Real-world example Techniques
Tokenization Stop words Normalization Stemming/lemmatization
![Page 3: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/3.jpg)
Using a variety of techniques, we want to improve IR systems so that they “understand” more of what we want from a query
E.g. When searching for a paper about Facebook, the following queries should all return the paper The facebook, facebook, face-book
![Page 4: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/4.jpg)
![Page 5: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/5.jpg)
![Page 6: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/6.jpg)
![Page 7: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/7.jpg)
Damerau–Levenshtein distance is the number of ops between two words Insert Delete Change Swap
adidas = adiidas == adifas (distance 1) But: cat != rat != hat (distance 1)
![Page 8: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/8.jpg)
Breaking up sentences on a variety of rules Split on non-alphanumeric?
Good: The dog ran to the park Bad: Ms. O’Hannety went to O’Flaggerty’s pub
(Ms, O, Hannety, went, to, O, Flaggerty, s, pub) Split on space?
Bad: San Fransisco is a great city.
![Page 9: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/9.jpg)
E.g. Lebensversicherungsgesellschaftsangestellter = life insurance company employee
Would not get split by any of the previously mentioned methods
![Page 10: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/10.jpg)
Drop common ‘useless’ words How useless are they (“President of the USA”)
Not a big problem to include them, space or time-wise
![Page 11: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/11.jpg)
What I did at Amazon (codenamed BrandSims normalization)
Maps words/phrases that are semantically related to each other, so they can refer to the same content
E.g. Alan went to the store = Alan go store
![Page 12: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/12.jpg)
Mainly dropped since they were not always supported
Problematic since in certain languages accents are critical to understanding
![Page 13: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/13.jpg)
Standardize to all caps or all lowercase (more common)
Everywhere in the sentence? Bad: We went to the White House
Better solution is the beginning of a sentence and in titles
![Page 14: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/14.jpg)
More complicated than previous normalization techniques
Goal is to remove things like tense, number, possession from strings
![Page 15: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/15.jpg)
Chop off the end of the word Con: Crude and sometime ineffective Pro: Fast and no overhead
E.g. cookies -> cooki, cup->c
![Page 16: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/16.jpg)
Use a vocab list and morphological (structural) list [which may or may not help much]
Recognize context in a sentence (saw would become see if used as a verb, not a noun)
Porter’s algorithm:
![Page 17: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/17.jpg)
Understand the type of queries that will be submitted
It is all about tradeoffs between precision and recall
These techniques can be used differently depending on the context.
![Page 18: 10-1 Vocab of Terms](https://reader034.vdocuments.pub/reader034/viewer/2022052401/55c06cf5bb61eb8e3e8b46c1/html5/thumbnails/18.jpg)