資訊檢索導論 introduction to information retrieval

Upload: stm-works

Post on 06-Apr-2018

238 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Introduction to information retrieval

    1/87

  • 8/2/2019 Introduction to information retrieval

    2/87

  • 8/2/2019 Introduction to information retrieval

    3/87

    90

    Web

    Web

    2004 Pew Fallows 2004

    92%

    Web

    1990

    Web Web

    Web

    SanfordStuttgart

  • 8/2/2019 Introduction to information retrieval

    4/87

    II

    7590

    8

    1

    2

    3

    4

    5

    15 Boolean retrieval

    6 7

    8

    8 921 9

    10

    XML HTML

    6 11

    12 11

    12

    1318 1315

    13

    14 6

    Rocchio kNN

  • 8/2/2019 Introduction to information retrieval

    5/87

    III

    k nearest neighbor

    15

    1618 16

    K EM

    17

    18

    1921 Web 19 Web

    Web 20

    21 Web

    Cross-language IRGrossman and Frieder 2004 4 Oard

    and Dorr 1996

    Image and multimedia IRGrossman and Frieder 2004

    4 Baeza-Yates and Ribeiro-Neto 1999 611 12 del Bimbo 1999Lew

    2001Smeulders et al.2000

    Speech retrievalCoden et al.2002

    Music retrievalDownie 2006 http://www.ismir.net/

    User interfaces for IRBaeza-Yates and Ribeiro-

    Neto 1999 10

    Parallel and peer-to-peer IR P2P Grossman and Frieder 2004

    7 Baeza-Yates and Ribeiro-Neto 1999 9 Aberer 2001

    Digital librariesBaeza-Yates and Ribeiro-Neto 1999 15

    Lesk 2004

    Information science perspectiveKorfhage1997Meadow et al.1999Ingwersen and Jarvelin 2005

  • 8/2/2019 Introduction to information retrieval

    6/87

    IV

    Logic-based approaches to IRvan Rijsbergen 1989

    Natural language processing techniquesManning and

    Schutze 1999Jurafsky and Martin 2008Lewis and Jones 1996

    21

    15 6 7

    810 11

    11.1 1113 15

    18

    18.1 21

    [*][**][***]

    Lauren Cowles

    Cheryl AasheimJosh AttenbergLuc BelangerTom BreuelDaniel Burckhardt

    Georg BuscherFazli CanDinquan ChenErnest DavisPedro DomingosRodrigo Panchiniak

    FernandesPaolo Ferragina Norbert FuhrVignesh GanapathyElmer GardunoXiubo Geng

    David GondekSergio GovoniCorinna HabetsBen HandyDonna HarmanBenjamin Haskell

    Thomas HuhnDeepak JainRalf JankowitschDinakar JayarajanVinay KakadeMei Kobayashi

    Wessel KraaijRick LafleurFlorian LawsHang LiDavid MannEnnio MasiFrank McCown

    Paul McNameeSven Meyer zu EissenAlexander MurzakuGonzalo NavarroScott Olsson

    Daniel Paiva

    Tao Qin

    Megha Raghava

    Ghulam Raza

    Michal Rosen-Zvi

    Klaus Rothenhausler

    Kenyu L. RunnerAlexander SalamancaGrigory SapunovTobias SchefferNico Schlaefer

  • 8/2/2019 Introduction to information retrieval

    7/87

    V

    Evgeny ShadchnevIan SoboroffBenno SteinMarcin SydowAndrew TurnerJason UttHuey

    VoTravis WadeMike WalshChangliang WangRenjing Wang Thomas Zeume

    James AllanOmar AlonsoIsmail Sengor AltingovdeVo NgocAnhRoi

    BlancoEric BreckEric BrownMark CarmanCarlos CastilloJunghoo ChoAron CulottaDoug

    CuttingMeghana DeodharSusan DumaisJohannes FurnkranzAndreas HesDjoerd Hiemstra

    David HullThorsten JoachimsSiddharth Jonathan J. B.Jaap KampsMounia LalmasAmy

    LangvilleNicholas LesterDave LewisStephen LiuDaniel LowdYosi MassJeff Michels

    Alessandro MoschittiAmir NajmiMarc NajorkGiorgio Maria Di NunzioPaul OgilviePriyank

    PatelJan PedersenKathryn PedingsVassilis PlachourasDaniel RamageStefan RiezlerMichael

    SchiehlenHelmut SchmidFalk Nicolas ScholerSabine Schulte im WaldeFabrizio Sebastiani

    Sarabjeet SinghAlexander StrehlJohn TaitShivakumar VaithyanathanEllen VoorheesGerhard

    WeikumDawid WeissYiming YangYisong YueJian Zhang Justin Zobel

    Pavel BerkhinStefan ButtcherJamie CallanByron DomTorsten Suel Andrew Trotman

    1314 15 Ray Mooney

    Ray Mooney 3

    Ray Mooney

    C. D. Manning

    P. Raghavan Yahoo!

    H. Schutze

    http://informationretrieval.org

    Email [email protected]

  • 8/2/2019 Introduction to information retrieval

    8/87

    Snippet XML

    Prabhakar Raghavan

    [email protected]

    http://ir.ict.ac.cn/~wangbin/iir-book/

    http://nlp.stanford.edu/IR-book/

    information-retrieval-book.html

  • 8/2/2019 Introduction to information retrieval

    9/87

    1

    1.1 4

    1.2 8

    1.3 11

    1.4 15

    1.5 18

    21

    2.1 22

    2.1.1 22

    2.1.2 23

    2.2 25

    2.2.1 25

    2.2.2 30

    2.2.3 31

    2.2.4 35

    2.3 39

    2.4 41

    2.4.1 42

    2.4.2 43

    2.4.3 46

    2.5 48

    11

    22

  • 8/2/2019 Introduction to information retrieval

    10/87

    2

    51

    3.1 52

    3.2 55

    3.2.1 56

    3.2.2 k-gram 57

    3.3 59

    3.3.1 59

    3.3.2 60

    3.3.3 61

    3.3.4 k-gram 63

    3.3.5 64

    3.4 66

    3.5 67

    69

    4.1 70

    4.2 72

    4.3 75

    4.4 77

    4.5 80

    4.6 83

    4.7 86

    89

    5.1 91

    5.1.1 Heaps 93

    5.1.2 Zipf 94

    5.2 95

    5.2.1 96

    5.2.2 97

    33

    44

    55

  • 8/2/2019 Introduction to information retrieval

    11/87

    3

    5.3 100

    5.3.1 101

    5.3.2 103

    5.4 111

    115

    6.1 116

    6.1.1 118

    6.1.2 120

    6.1.3 g 122

    6.2 123

    6.2.1 124

    6.2.2 tf-idf 125

    6.3 126

    6.3.1 127

    6.3.2 130

    6.3.3

    1316.4 tf-idf 133

    6.4.1 tf 133

    6.4.2 tf 133

    6.4.3 134

    6.4.4 135

    6.5 139

    141

    7.1 142

    7.1.1 K 143

    7.1.2 144

    7.1.3 145

    7.1.4 145

    7.1.5 147

    66

    77

  • 8/2/2019 Introduction to information retrieval

    12/87

    4

    7.1.6 147

    7.2 149

    7.2.1 150

    7.2.2 150

    7.2.3 151

    7.2.4 152

    7.3 153

    7.3.1 154

    7.3.2 154

    7.3.3 155

    7.4 155

    157

    8.1 158

    8.2 160

    8.3 161

    8.4

    1658.5 171

    8.6 175

    8.6.1 175

    8.6.2 176

    8.6.3 177

    8.7 177

    8.8 180

    183

    9.1 185

    9.1.1 Rocchio 188

    9.1.2 190

    9.1.3 191

    9.1.4 Web 193

    88

    99

  • 8/2/2019 Introduction to information retrieval

    13/87

    5

    9.1.5 193

    9.1.6 194

    9.1.7 195

    9.1.8 195

    9.2 196

    9.2.1 196

    9.2.2 196

    9.2.3 198

    9.3 200

    XML 203

    10.1 XML 206

    10.2 XML 210

    10.3 XML 215

    10.4 XML 219

    10.5 XML 223

    10.6

    225

    229

    11.1 230

    11.2 232

    11.2.1 1/0 232

    11.2.2 233

    11.3 233

    11.3.1 235

    11.3.2 237

    11.3.3 239

    11.3.4 240

    11.4 242

    11.4.1 242

    11.4.2 243

    1010

    1111

  • 8/2/2019 Introduction to information retrieval

    14/87

    6

    11.4.3 Okapi BM25 244

    11.4.4 IR 246

    11.5 247

    249

    12.1 250

    121.1 250

    12.1.2 253

    12.1.3 254

    12.2 255

    12.2.1 IR 255

    12.2.2 256

    12.2.3 Ponte Croft 259

    12.3 262

    12.4 LM 263

    12.5 265

    267

    13.1 271

    13.2 273

    13.3 278

    13.4 NB 280

    13.5 286

    13.5.1 287

    13.5.2 2 290

    13.5.3 292

    13.5.4 292

    13.5.5 293

    13.6 294

    13.7 300

    1212

    1313

  • 8/2/2019 Introduction to information retrieval

    15/87

    7

    303

    14.1 305

    14.2 Rocchio 307

    14.3 k 311

    14.4 316

    14.5 321

    14.6 323

    14.7 330

    333

    15.1 334

    15.2 341

    15.2.1 341

    15.2.2 343

    15.2.3 344

    15.2.4 347

    15.3 348

    15.3.1 349

    15.3.2 351

    15.4 ad hoc 355

    15.4.1 355

    15.4.2 357

    15.5 359

    363

    16.1 365

    16.2 368

    16.3 370

    16.4 K- 374

    16.5 382

    1414

    1515

    1616

  • 8/2/2019 Introduction to information retrieval

    16/87

    8

    16.6 387

    391

    17.1 393

    17.2 396

    17.3 403

    17.4 405

    17.5 407

    17.6 409

    17.7 410

    17.8 412

    17.9 414

    417

    18.1 418

    18.2 - SVD 42218.3 424

    18.4 LSI 427

    18.5 432

    Web 433

    19.1 434

    19.2 Web 436

    19.2.1 Web 438

    19.2.2 439

    19.3 441

    19.4 444

    19.5 446

    19.6 shingling 449

    19.7 454

    1717

    1818

    1919

  • 8/2/2019 Introduction to information retrieval

    17/87

    9

    Web 455

    20.1 456

    20.1.1 456

    20.1.2 457

    20.2 457

    20.2.1 458

    20.2.2 DNS 462

    20.2.3 URL 463

    20.3 466

    20.4 467

    20.5 470

    473

    21.1 Web 474

    21.2 PageRank 476

    21.2.1 478

    21.2.2 PageRank 480

    21.2.3 PageRank 483

    21.3 Hub Authority 486

    21.4 492

    495

    531

    2020

    2121

  • 8/2/2019 Introduction to information retrieval

    18/87

  • 8/2/2019 Introduction to information retrieval

    19/87

    1

  • 8/2/2019 Introduction to information retrieval

    20/87

    2

    Information Retrieval IR

    Web

    unstructured data

    structured

    data

    semistructured data

    Java threading

    clustering

    information retrieval retrieval information retrieval

    information

    retrieval

    search information retrievalsearch

    information retrieval

  • 8/2/2019 Introduction to information retrieval

    21/87

    3

    classification

    Web web search

    Web

    personal information

    retrieval

    MacOS X Spotlight Windows Vista

    domain-specific search

    Web

  • 8/2/2019 Introduction to information retrieval

    22/87

    4

    . .

    . .

    .

    Shakespeares Collected Works

    Brutus Caesar Calpurnia

    Brutus

    Caesar Calpurnia

    grepping Unix

    grep grepping

    regular expression

    00

    ()

    () grep

    Romans NEAR countrymen NEAR

    ()

    index

    ,000

    BrutusCaesar Calpurnia Brutus Marcus Brutus

    Caesar

    Julius Caesar Calpurnia Calpurnia

    Pisonis

  • 8/2/2019 Introduction to information retrieval

    23/87

    5

    incidence matrix -term

    . word

    I-Hong Kong

    - - td (t, d) 0

    Brutus AND Caesar AND NOT Calpurnia

    BrutusCaesar Calpurnia Calpurnia

    complement AND

    000 AND 0 AND 0 = 0000

    Antony

    and CleopatraHamlet -

    term

    Antony and Cleopatra

    Julius CaesarThe TempestHamlet

    OthelloMacbethAntony

    Cleopetra

    Antonyand

    Cleopatra

    JuliusCaesar

    TheTempest

    Hamlet Othello Macbeth ...

    Antony 0 0 0 Brutus 0 0 0Caesar 0 Caplurnia 0 0 0 0 0Cleopatra 0 0 0 0 0mercy 0 worser 0 0...

  • 8/2/2019 Introduction to information retrieval

    24/87

  • 8/2/2019 Introduction to information retrieval

    25/87

    7

    pipeline leaks

    pipeline rupture

    effectiveness

    precision

    recall

    -

    0 00

    - 000 0 00

    0

    000 00

    -0 0 000000000 -

    .%-0 /000 0

    inver ted

    index

    -

    dictionary vocabulary lexicon

    dictionary - vocabulary

    0 000

    dictionary - vocabulary

  • 8/2/2019 Introduction to information retrieval

    26/87

    8

    posting

    posting listinverted list

    postings -

    ID .

    ..

    -

    .

    ()

    Friends, Romans, countrymen. So let it be with Caesar ...

    () token

    tokenization

    Friends Romans countrymen So ...

    () Friend roman countrymen So ...

    ()

    ID ID

    ID

    token

    token

    Brutus

    Caesar

    Calpurnia

    0

  • 8/2/2019 Introduction to information retrieval

    27/87

    9

    .

    sort-based indexing

    docID

    ID

    -

    -

    -

    document frequency

    docID

    ad hoc

    disk

    singly linked list

    . skip

    listvariable length array

    Unix Unix sort uniq

  • 8/2/2019 Introduction to information retrieval

    28/87

    10

    - ID

    ID

    term frequency

    doc ID doc ID

    I did enact Julius Caesar: I was

    killed i' the Capitol; Brutus killed

    me.

    So let it be with Caeser. The noble

    Brutus hath told you Caesar was

    ambitious:

    I

    did

    enatc

    julius

    caesar

    I

    was

    killed

    i'

    the

    capitol

    burtus

    killed

    me

    so

    let

    it

    be

    with

    caesarthe

    noble

    brutus

    hath

    told

    you

    caesar

    was

    ambitious

    I

    I

    i'

    ambitious

    be

    brutus

    brutus

    capitol

    caesar

    caesar

    caesardid

    enact

    hath

    it

    julius

    killed

    killed

    let

    menoble

    so

    the

    the

    told

    you

    was

    was

    with

    with

    was

    you

    told

    the

    so

    noble

    me

    let

    killed

    julius

    it

    ambitious

    be

    brutus

    capitol

    caesar

    did

    enact

    hath

    I

    i'

  • 8/2/2019 Introduction to information retrieval

    29/87

    11

    cache

    offset

    traverse

    -

    disk seek

    1-1 [*] 1-3

    new home sales top forecasts

    home sales rise in july

    increase in home sales in july

    july new home sales rise

    1-2 [*]

    breakthrough drug for schizophrenia

    new schizophrenia drug

    new approach for treatment of schizophrenia

    new hopes for schizophrenia patients

    a.

    b. -

    1-3 [*] 1-2

    a. schizophrenia AND drug

    b. for AND NOT (drug OR approach)

    .

    simple conjunctive query

    Brutus AND Calpurnia -

    -

    () Brutus

    ()

    () Calpurnia

    ()

  • 8/2/2019 Introduction to information retrieval

    30/87

    12

    () -

    - - Brutus Calpurnia

    intersect ion

    merge

    merge algorithm

    -

    -

    ID ID ID

    ID x y

    O(x + y) (N) N

    (.) O(.)

    Cormen et al.0

    Brutus

    Calpurnia

    0

    INTERSECT(p,p)

    answer

    whilep NIL. and p NIL

    do ifdocI D( p) = doc I D (p)

    then ADD(a'nswer, doc I D(p) )

    pnext(p)

    pnext(p)

    else ifdoc. I D(p) < docI D(p)

    thenpnext(p) elsepnext(p)

    0 returnanswer

  • 8/2/2019 Introduction to information retrieval

    31/87

    13

    ID

    (Brutus OR Caesar) AND NOT Calpurnia -

    query optimization

    t

    Brutus AND Caesar AND Calpurnia -

    - -

    (Calpurnia AND Brutus) AND Caesar -

    (madding OR crowd) AND (ignoble OR strife) AND (killed OR slain) -

    OR AND

    AND

  • 8/2/2019 Introduction to information retrieval

    32/87

    14

    -

    -

    bash

    -

    1-4 [*] O(x + y) x y Brutus

    Caesar

    a. Brutus AND NOT Caesarb. Brutus OR NOT Caesar

    1-5 [*]

    c. (Brutus OR Caesar) AND NOT (Antony OR Cleopatra

    INTERSECT (t, ...,tn)

    termsSORTBYINCREASINGFREQUENCY(t, ...,tn)

    resultpostings(frst(terms))

    termsrest(terms)

    while termsNIL and result NIL

    doresultINTERSECT(result,postings(frst(terms))) termsrest(terms)

    returnresult

  • 8/2/2019 Introduction to information retrieval

    33/87

    15

    1-6 [**] AND OR

    a. -

    b.

    c.

    1-7 [*]

    d. (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes

    eyes

    kaleidoscope 00

    marmalade 0

    skies

    tangerine

    trees

    1-8 [*]

    e. friends AND romans AND (NOT countrymen

    countrymen

    1-9 [**]

    1-10 [**] x OR y 1-6

    1-11 [**] x AND NOT y

    .

    ranked retrieval

    model .

    free text query

    P-norm

  • 8/2/2019 Introduction to information retrieval

    34/87

    16

    0 0

    0

    ANDOR NOT

    term proximityproximity

    -

    Westlaw Westlawhttp://www.

    westlaw.com/ 0

    00 Westlaw

    Terms and Connectors

    Westlaw Natural Language

    Westlaw

    Information on the legal theories involved in preventing the disclosure of trade

    secrets by employees formerly employed by a competing company

    trade secret/s disclos!/s prevent/s employe!

    Requirements for disabled people

    to be able to access a workplace

    disab!/p access!/s work-site work-place (employment / place

    Cases about a host's responsibility for

    drunk guests

    host!/p (responsib! liab!)/p (intoxicat! drunk!)/p guest

  • 8/2/2019 Introduction to information retrieval

    35/87

    17

    Web

    0 Web

    & AND, /s

    /p /k k

    phrase search .

    ! . liab!

    liab work-site worksitework-site

    work site ..

    Westlaw

    00

    Westlaw Turtle,

    Westlaw

    AND

    OR

    ()

    () operating system

    Westlaw

    Gates NEAR Microsoft

    ()

  • 8/2/2019 Introduction to information retrieval

    36/87

    18

    ()

    ad hoc

    ad hoc Web

    Web

    Web

    1-12 [*] Westlaw professor

    teacher lecturer explain

    explain

    1-13 [*] burglar

    (i) burglar(ii) burglar AND burglar(iii) burglar OR burglar

    (i) knight(ii) conquer(iii) knight OR conquer

    .

    0 0 Cleverdon; Liddy00

  • 8/2/2019 Introduction to information retrieval

    37/87

    19

    Bush

    memex

    memex

    Information Retrieval Calvin Mooers 0

    Mooers0

    IBM

    Taube and Wooster, H. P. Luhn

    Mooers

    George

    Boole

    ANDOR

    Lee and Fox,

    Witten Witten et al.

    Zobel and Moffat, 00

    Friedl 00 regular expression

    Hopcroft et al. 000

  • 8/2/2019 Introduction to information retrieval

    38/87

  • 8/2/2019 Introduction to information retrieval

    39/87

    531

    A

    A/B test A/B 177

    Accents 32, 53

    Access control lists 84

    Accumulator 119, 132, 237

    Accuracy 162, 285, 294, 295,

    299, 373

    Active learning 196, 350,

    361

    Add-one smoothing 275,

    277, 278

    Ad hoc retrieval ad hoc 6, 196,

    264, 268, 282, 298, 315, 333, 334, 352,

    353, 355, 358, 361

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    evaluation of 7, 46, 68, 145,

    155, 157, 158, 159, 160, 161, 163, 165,

    167, 169, 170, 171, 173, 174, 175, 176,

    177, 180, 181, 182, 193, 194, 200, 203,

    206, 212, 219, 221, 222, 225, 226, 233,

    260, 267, 271, 294, 295, 297, 298, 301,

    324, 347, 353, 360, 363, 365, 369, 370,

    371, 372, 374, 386, 388, 415, 432, 445,

    449, 471

    machine learning methods

    22, 42, 152, 155, 174, 226, 298,

    333, 334, 347, 348, 350, 355, 361, 482

    Adjacency tables 468, 486

    Adjusted Rand index

    388

    Adversarial information retrieval

    441

    Akaike information criterion (AIC)

    381

    Algebra, linear, review

    Algorithmic search

    Anchor text 264, 411, 412,

    454, 459, 474, 475, 476, 483, 489, 492

    Any-of classification 322,

    323, 332

    Auxiliary index 81, 82, 85

    Average-link clustering

    403

    B

    Back queues 464, 465, 466

    Bag of words model

    Unigram language model 123, 124,

    127, 254, 282, 283, 324, 355

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

  • 8/2/2019 Introduction to information retrieval

    40/87

    532

    359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,465, 479, 480, 481, 487, 492

    Balanced F measure F F

    measure F 163

    Bayes error rate 316, 331

    Bayesian networks 230, 246,

    247

    Bayesian prior 238, 240

    Bayesian smoothing 265

    Bayes Optimal Decision Rule 232

    Bayes risk 233

    Bayes Rule 231, 234, 280

    Bernoulli model 259, 260,

    267, 278, 279, 280, 281, 282, 283, 284,

    285, 287, 289, 290, 292, 300, 383, 384

    Best-merge persistence

    400, 402, 404, 405, 413

    Bias 254, 264, 286, 303, 305,323, 324, 325, 326, 327, 328, 331, 335,

    349

    Bias-variance tradeoff -

    254, 286, 327, 335

    Biclustering 389

    Bigram language model

    Binary Independence Model (BIM)

    Binary search tree Biword indexes 42, 43,

    44, 46

    Blind relevance feedback

    184, 194

    Blocked sort-based indexing algorithm

    BSBI 73,

    74, 75, 76, 77, 78, 86

    Blocked storage described

    Blogs

    BM25 weights BM25 230, 242,

    244

    Boolean retrieval 1, 4, 6, 11,15, 17, 19, 28, 38, 83, 113, 119, 123,

    154, 204

    model 4, 6, 9, 11, 15, 16, 17,

    18, 19, 29, 33, 67, 68, 87, 92, 94, 105,

    108, 111, 115, 116, 123, 124, 126, 127,

    141, 142, 149, 151, 152, 153, 154, 178,

    182, 188, 189, 191, 192, 200, 203, 204,

    205, 206, 207, 215, 222, 223, 224, 226,

    229, 230, 231, 232, 233, 234, 235, 237,238, 239, 240, 242, 243, 244, 246, 249,

    250, 251, 252, 253, 254, 255, 256, 257,

    139, 247, 258, 257, 258, 259, 260, 261,

    262, 263, 264, 265, 267, 273, 274, 277,

    278, 279, 280, 281, 282, 283, 284, 285,

    286, 287, 289, 290, 292, 293, 300, 301,

    303, 304, 305, 306, 323, 324, 326, 327,

    328, 334, 335, 336, 338, 339, 343, 349,

    355, 359, 363, 368, 380, 381, 382, 383,384, 386, 387, 389, 415, 432, 433, 440,

    441, 442

    principles 210, 229, 230, 232,

    233, 265, 291, 406

    query processing 9, 15,

    17, 28, 30, 31, 38, 57, 82, 83, 99, 108,

    113, 139, 147, 148, 151, 155, 466, 467,

    470

    ranked retrieval vs.

    tokenization 8, 22, 25, 26,

    27, 28, 34, 42, 95, 152, 230, 314

    vector space model interactions

    Boosting

    Bottom-up cluster ing

    . hierarchical agglomerative

    clustering (HAC)

  • 8/2/2019 Introduction to information retrieval

    41/87

    533

    393, 410

    Bowtie structure 439

    Break-even point

    168BSBI (b locked sor t -based indexing

    algorithm)

    73, 74, 75, 76, 77, 78, 86

    Buckshot algorithm Buckshot

    378, 414

    Buffer 71, 74, 181

    C

    Caching 71, 90, 109

    compression and in search systems 90

    variable length arrays and

    Capitalization 32, 33

    Capture-recapture method

    Cardinality in clustering

    Case-folding 33, 73, 91,

    92, 93, 100

    CAS topics CAS 220, 222, 224

    Category 3, 22, 25, 161, 162,

    172, 196, 238, 256, 268, 269, 270, 271,

    272, 273, 274, 275, 277, 279, 280, 281,

    282, 283, 284, 286, 287, 288, 289, 290,

    291, 292, 293, 294, 295, 296, 297, 298,

    299, 300, 304, 305, 307, 308, 309, 310,

    311, 312, 313, 314, 316, 317, 318, 319,

    320, 321, 322, 323, 324, 325, 326, 327,

    328, 329, 330, 332, 334, 335, 336, 339,

    340, 343, 344, 348, 349, 350, 351, 352,

    357, 358, 360, 364, 371, 372, 379, 386,

    388, 413, 435, 444, 445

    Centroid-based classification

    331

    Centroids 188, 189, 190, 192,

    304, 306, 307, 308, 309, 310, 311, 314,

    327, 329, 331, 369, 374, 375, 376, 377,

    378, 379, 381, 382, 384, 386, 388, 391,

    392, 396, 400, 401, 404, 405, 406, 407,

    409, 411, 412, 413, 414, 415, 432

    HAC 392, 393 ,395, 396, 399, 400, 401, 403, 404, 405,

    406, 408, 409, 410, 412, 413, 414

    Rocchio classification Rocchio

    307, 308, 309, 310, 311, 316, 329, 331,

    374

    Chaining in clustering

    399

    Chain rule 231

    Champion lists 145, 146, 149,155

    Character sequence decoding

    21, 22, 23

    2

    feature selection 2 291,

    292

    Chinese 12, 29, 35, 48, 49, 67,

    100, 124, 129, 142, 146, 147, 150, 174,

    178, 207, 215, 244, 276, 305, 307, 319,

    327, 334, 358, 368, 374, 407, 430, 453Class boundary 311, 319,

    320

    Classes 3, 22, 25, 161, 162, 172,

    196, 238, 256, 268, 269, 270, 271, 272,

    273, 274, 275, 277, 279, 280, 281, 282,

    283, 284, 286, 287, 288, 289, 290, 291,

    292, 293, 294, 295, 296, 297, 298, 299,

    300, 304, 305, 307, 308, 309, 310, 311,

    312, 313, 314, 316, 317, 318, 319, 320,321, 322, 323, 324, 325, 326, 327, 328,

    329, 330, 332, 334, 335, 336, 339, 340,

    343, 344, 348, 349, 350, 351, 352, 357,

    358, 360, 364, 371, 372, 379, 386, 388,

    413, 435, 444, 445

    maximum a posteriori

    C l a s s i f i c a t i o n T e x t

    classification 3, 22, 25, 26, 27, 28,

    126, 134, 158, 159, 160, 161, 162, 184,

  • 8/2/2019 Introduction to information retrieval

    42/87

    534

    191, 197, 230, 260, 267, 268, 269, 270,

    271, 272, 273, 275, 276, 277, 278, 279,

    280, 282, 284, 285, 286, 287, 288, 289,

    290, 292, 293, 294, 295, 296, 297, 298,299, 300, 303, 304, 305, 306, 307, 308,

    309, 310, 311, 312, 313, 314, 315, 316,

    317, 301, 318, 317, 318, 319, 320, 321,

    322, 323, 324, 325, 326, 327, 328, 329,

    330, 331, 332, 333, 334, 335, 336, 337,

    339, 340, 341, 342, 343, 344, 345, 347,

    348, 349, 350, 351, 352, 353, 354, 355,

    356, 357, 358, 360, 361, 364, 370, 374,

    380, 382, 384, 412, 427, 435, 436, 454,471, 483

    any-of 272, 305, 321, 322,

    323, 332

    centroid-based 304, 331

    k N N k n e a r e s t n e i g h b o r

    classification (kNN) 284, 297, 298,

    301, 304, 305, 306, 311, 312, 313, 314,

    315, 316, 320, 321, 323, 326, 327, 329,

    331, 343, 348, 349, 412multivalue 321

    one-of 272, 299, 305, 321,

    322, 323, 331

    one-versus-all 343

    R o c c h i o R o c c h i o

    Rocchio classification 307, 308,

    309, 310, 311, 316, 329, 331, 374

    Classification function 271,

    272, 304, 321, 339, 340Classifiers 22, 162, 191, 230,

    268, 270, 271, 272, 275, 276, 278, 279,

    284, 285, 286, 287, 289, 292, 293, 294,

    295, 296, 297, 298, 299, 300, 301, 303,

    305, 306, 310, 311, 312, 314, 315, 316,

    317, 318, 319, 320, 321, 322, 323, 324,

    325, 326, 327, 328, 329, 330, 331, 332,

    334, 335, 336, 337, 339, 340, 342, 343,

    344, 345, 347, 349, 350, 351, 352, 353,

    354, 355, 356, 357, 358, 360, 361

    choosing 18, 23, 25, 26, 30, 34,

    36, 37, 40, 44, 52, 57, 60, 62, 64, 73,

    74, 78, 81, 83, 85, 86, 106, 111, 117,122, 128, 130, 131, 144, 146, 148, 149,

    154, 164, 168, 172, 173, 174, 177, 178,

    181, 186, 196, 208, 210, 218, 224, 226,

    245, 256, 262, 267, 271, 273, 274, 280,

    282, 283, 284, 285, 286, 287, 288, 289,

    290, 291, 292, 293, 298, 300, 301, 304,

    312, 314, 318, 319, 323, 325, 327, 331,

    335, 343, 344, 347, 349, 350, 351, 352,

    354, 357, 361, 367, 370, 375, 376, 378,379, 380, 381, 382, 383, 386, 389, 392,

    393, 395, 396, 403, 406, 409, 410, 411,

    413, 414, 415, 435, 437, 438, 440, 441,

    446, 447, 448, 449, 453, 457, 459, 464,

    465, 466, 469, 470, 474, 475, 477, 478,

    484, 487, 489, 492

    performance improving

    3, 9, 10, 11, 13, 18, 37, 39, 40, 46,

    48, 49, 58, 62, 77, 90, 108, 109, 112,131, 135, 139, 149, 165, 173, 179, 182,

    185, 186, 189, 190, 192, 193, 194, 195,

    196, 198, 199, 219, 223, 238, 265, 271,

    285, 286, 288, 289, 315, 322, 334, 335,

    343, 347, 350, 351, 352, 353, 354, 360,

    366, 367, 368, 388, 412, 430, 431, 436,

    440, 444, 448, 449, 450, 457, 474, 476,

    493

    two-class 3, 101, 135, 139,172, 178, 184, 268, 281, 294, 296, 307,

    309, 310, 317, 319, 320, 326, 327, 331,

    335, 337, 339, 343, 346, 373, 386, 435,

    456, 487

    CLEF collection CLEF

    Click spam 443

    Clickstream mining 177,

    195

    Clickthrough log analysis

  • 8/2/2019 Introduction to information retrieval

    43/87

    535

    177, 181

    Cliques 246, 398

    Cloaking, in spamming

    Cluster-based classification 331

    Cluster hypothesis 365, 367,

    368, 387

    Clustering 2, 3, 26, 126, 147,

    148, 265, 309, 352, 363, 364, 365, 366,

    367, 368, 369, 370, 371, 372, 373, 374,

    376, 377, 378, 379, 380, 381, 382, 383,

    384, 385, 386, 387, 388, 389, 391, 392,

    393, 394, 395, 396, 397, 398, 399, 400,401, 402, 403, 404, 405, 406, 407, 408,

    409, 410, 411, 412, 413, 414, 415, 418,

    427, 431, 432, 453, 467

    average-link 403

    cardinality in 30, 110, 181, 192,

    223, 226, 232, 242, 285, 286, 301, 314,

    334, 379, 434

    centroid-based 304, 331

    chaining in 399, 409complete-link HAC HAC

    divisive 391, 392, 409, 410,

    415

    exclusive vs. exhaustive

    flat Flat clustering

    363, 364, 365, 368, 370, 374, 379, 392,

    394, 395, 409, 410, 413, 414

    group-average agglomerative 391, 400, 403

    hard 19, 40, 69, 70, 71, 75, 77,

    78, 87, 90, 96, 112, 163, 179, 260, 331,

    347, 365, 368, 369, 381, 383, 384, 386,

    437, 466, 490

    hierarchical Hierarchical

    clustering 150, 151, 152, 153, 210,

    221, 272, 351, 360, 391, 392, 394, 395,

    407, 409, 410, 415, 434, 435, 446

    minimum variance 414

    model-based 264, 363,

    382, 383, 415

    optimal 13, 15, 40, 103, 104,105, 110, 112, 188, 232, 233, 245, 259,

    277, 293, 298, 301, 314, 316, 323, 324,

    325, 331, 334, 339, 340, 342, 370, 376,

    378, 379, 380, 386, 389, 391, 392, 407,

    408, 409, 414, 469

    overview 19, 455, 456

    single-link HAC HAC

    spectral 415

    top-down 211, 392, 393,409, 410

    Clusters 70, 77, 80, 466

    pruning 155

    Co-clustering 389

    Collections 2, 4, 6, 9, 11, 12,

    15, 17, 18, 25, 30, 34, 44, 45, 47, 52,

    58, 60, 65, 70, 72, 73, 74, 75, 77, 78,

    79, 80, 81, 85, 90, 91, 92, 93, 94, 95,

    96, 97, 98, 99, 100, 101, 106, 108, 109,111, 112, 116, 117, 119, 121, 124, 125,

    128, 129, 130, 131, 135, 136, 137, 138,

    144, 148, 151, 152, 158, 159, 160, 161,

    162, 165, 167, 168, 169, 170, 171, 173,

    174, 175, 176, 178, 181, 184, 185, 188,

    189, 191, 192, 194, 196, 198, 199, 200,

    204, 207, 208, 210, 212, 213, 214, 216,

    217, 219, 220, 224, 225, 232, 234, 235,

    237, 238, 239, 240, 241, 242, 244, 246,255, 256, 258, 259, 260, 261, 262, 264,

    268, 271, 277, 288, 292, 293, 296, 298,

    300, 304, 307, 311, 312, 314, 315, 328,

    358, 359, 365, 366, 367, 368, 370, 371,

    373, 377, 383, 388, 392, 395, 397, 400,

    409, 412, 413, 415, 418, 424, 427, 428,

    431, 434, 457, 467

    clustering 2, 3, 26, 126, 147,

    148, 265, 309, 352, 363, 364, 365, 366,

  • 8/2/2019 Introduction to information retrieval

    44/87

    536

    367, 368, 369, 370, 371, 372, 373, 374,

    376, 377, 378, 379, 380, 381, 382, 383,

    384, 385, 386, 387, 388, 389, 391, 392,

    393, 394, 395, 396, 397, 398, 399, 400,401, 402, 403, 404, 405, 406, 407, 408,

    409, 410, 411, 412, 413, 414, 415, 418,

    427, 431, 432, 453, 467

    frequency 9, 10, 13, 14, 15,

    18, 30, 43, 45, 46, 47, 53, 65, 71, 72,

    79, 83, 86, 91, 92, 94, 95, 96, 97, 100,

    104, 106, 108, 113, 115, 123, 124, 125,

    129, 131, 132, 133, 134, 135, 138, 139,

    149, 155, 158, 176, 177, 178, 190, 192,213, 217, 218, 219, 234, 238, 239, 241,

    244, 245, 246, 257, 258, 261, 262, 263,

    274, 275, 279, 285, 286, 290, 292, 293,

    294, 304, 315, 359, 372, 412, 415, 424,

    439, 448, 456, 457, 479, 481

    residual defined

    statistics 7, 9, 83, 85, 91,

    116, 191, 198, 213, 218, 219, 220, 256,

    355, 467large 96, 111

    Combination schemes

    46

    Combination similarity

    393, 394, 398, 407, 408, 409

    Complete-linkage clustering

    391, 396, 397, 398, 399, 402, 403,

    404, 408

    Complete-link clustering 406

    Component coverage 220,

    221

    Compound nouns 28, 360

    Compound-splitter 28

    Compression 7, 14, 23, 30, 40,

    44, 45, 70, 71, 73, 75, 77, 83, 86, 89,

    90, 91, 92, 94, 95, 96, 97, 98, 99, 100,

    101, 102, 97, 98, 99, 100, 101, 102,

    103, 104, 105, 106, 107, 108, 109, 110,

    111, 112, 109, 112, 113, 99, 169, 468,

    104, 105, 106, 107, 108, 112, 109, 110,

    111, 112, 113, 103of dictionaries 89, 91, 94,

    95, 98, 100, 109, 112

    of docIDs ID

    lossless/lossy

    parameter-free

    parameterized

    of postings list 91

    Compression/indexes

    Heaps law Heaps 93, 111, 315overview 19, 455, 456

    Zipfs law Zipf 94, 95, 106,

    108, 111, 439

    Concept drift 284, 285, 298,

    301

    Conditional independence assumption

    235, 281, 282, 283

    Confusion matrix 322, 323,

    386Connected components 398

    Connectivity queries 467,

    468, 470

    Connectivity servers 455,

    467, 471

    Content management systems

    70, 87

    Content seen module

    461Context, XML XML 216, 217,

    218

    Context resemblance

    216, 219

    Contiguity hypothesis 304,

    311, 365

    Continuation bit 101, 102

    Corpus 6, 30, 72, 73, 74, 75,

    102, 161, 294

  • 8/2/2019 Introduction to information retrieval

    45/87

    537

    Cosine similarity 127,

    128, 129, 130, 138, 139, 142, 143, 144,

    145, 147, 148, 150, 188, 217, 243, 306,

    309, 313, 329, 355, 356, 369, 387, 393,404, 427, 428, 476

    CO topics CO 220, 224

    CPC (cost per click)

    442

    CPM (cost per mil)

    441

    Cranfield collection Cranfield

    Cross-entropy 264

    Cross-language information retrieval 161, 432, 490

    Cumulative gain 169, 181

    D

    Databases 2, 70, 87, 204, 205,

    206, 223, 224, 226, 227, 437

    communication with

    re la t ional 204, 223,

    224, 226, 227

    -codes 103, 110, 112

    Decision boundaries 307,

    317, 319, 320, 326, 327, 328, 329, 334,

    336, 337, 339, 341, 347, 357, 359

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    Decision hyperplanes 305,

    317, 319, 320, 335, 336Decision trees 297, 300, 334,

    350

    Dendrograms 393, 394, 397,

    398, 406, 413

    complete-link clustering

    391, 396, 397, 398, 399, 402, 403,

    404, 408

    described 61, 142, 144, 160,

    168, 198, 204, 205, 207, 209, 222, 224,231, 242, 260, 275, 285, 301, 310, 320,

    322, 324, 349, 363, 367, 368, 378, 381,

    396, 404, 412, 432, 454, 468, 471, 474,

    475, 476, 491

    Development sets 298, 349

    Development test collection

    159, 245

    Diacritics 32, 33

    Dice coefficient Dice 170Dictionaries 7, 8, 9, 10, 11, 13,

    17, 21, 22, 25, 26, 28, 29, 31, 34, 35,

    38, 51, 52, 53, 54, 55, 57, 58, 59, 60,

    75, 76, 77, 80, 86, 89, 90, 91, 92, 94,

    95, 96, 97, 98, 48, 67, 97, 98, 99, 100,

    109, 112, 117, 118, 125, 143, 145, 154,

    184, 196, 197, 198, 199, 200, 201, 254,

    260, 264, 283, 448, 466, 468, 469

    compression of 89, 91,94, 95, 98, 100, 109, 112

    in inverted indexes

    52

    search structures for

    Differential cluster labeling

    410

    Digital libraries 204

    Discrete-time stochastic processes

    478

  • 8/2/2019 Introduction to information retrieval

    46/87

    538

    Disk seek 11, 81

    Distortion 380, 381, 395

    Distributed crawling 461,

    466, 471Distributed index 69, 70,

    77, 78, 80, 86, 455, 456, 466, 471

    Distributed indexing

    Distributed information retrieval

    Divisive clustering 391,

    409, 410, 415

    DNS resolution DNS 462, 463,

    471DNS resolution module DNS

    DNS server DNS 462, 463

    DocIDs ID 9, 10, 41, 45, 79, 82,

    83, 92, 100, 101, 102, 109, 112, 113,

    121, 146, 147, 150, 151, 261, 276, 299,

    356, 385

    compression of ID

    in inverted indexes

    IDin postings list intersection operations

    Document-at-a-time scoring

    Document co l lec t ion

    Collections 2, 4, 6, 9, 11, 12, 15, 17,

    18, 25, 30, 34, 44, 45, 47, 52, 58, 60,

    65, 70, 72, 73, 74, 75, 77, 78, 79, 80,

    81, 85, 90, 91, 92, 93, 94, 95, 96, 97,98, 99, 100, 101, 106, 108, 109, 111,

    112, 116, 117, 119, 121, 124, 125, 128,

    129, 130, 131, 135, 136, 137, 138, 144,

    148, 151, 152, 158, 159, 160, 161, 162,

    165, 167, 168, 169, 170, 171, 173, 174,

    175, 176, 178, 181, 184, 185, 188, 189,

    191, 192, 194, 196, 198, 199, 200, 204,

    207, 208, 210, 212, 213, 214, 216, 217,

    219, 220, 224, 225, 232, 234, 235, 237,

    238, 239, 240, 241, 242, 244, 246, 255,

    256, 258, 259, 260, 261, 262, 264, 268,

    271, 277, 288, 292, 293, 296, 298, 300,

    304, 307, 311, 312, 314, 315, 328, 358,359, 365, 366, 367, 368, 370, 371, 373,

    377, 383, 388, 392, 395, 397, 400, 409,

    412, 413, 415, 418, 424, 427, 428, 431,

    434, 457, 467

    Document likelihood model

    263

    Document-partitioned index

    77

    Documents 15, 37, 60, 151,178, 204, 207, 216, 223, 227, 268, 270,

    275, 431, 434, 476

    character sequence decoding

    classification of

    Text classification 126, 330, 350,

    353

    defined

    delineation of 21, 22, 35, 37, 38,42, 47, 75, 78, 79, 80, 86, 92, 125, 139,

    151, 152, 153, 163, 171, 173, 177, 181,

    194, 197, 198, 199, 201, 205, 239, 246,

    301, 322, 330, 346, 388, 389, 410, 420,

    421, 440, 441, 454, 457, 458, 459, 462,

    466, 467, 473, 474, 475, 476, 489, 490,

    492, 493

    frequency defined

    function notations partitioning 38, 49, 73, 77, 78,

    79, 80, 211, 256, 369, 370, 371, 409,

    410, 415, 463

    relevant, retrieving

    unit, choosing

    vector, defined 327

    Document space 234, 271,

    280, 304, 352

    Document zones 123, 353,

  • 8/2/2019 Introduction to information retrieval

    47/87

    539

    355

    Doorway pages 440

    Dot products 127, 128, 130, 131,

    136, 306, 308, 329, 339, 342, 343, 345,346, 402, 403, 404, 412

    described 61, 142, 144, 160,

    168, 198, 204, 205, 207, 209, 222, 224,

    231, 242, 260, 275, 285, 301, 310, 320,

    322, 324, 349, 363, 367, 368, 378, 381,

    396, 404, 412, 432, 454, 468, 471, 474,

    475, 476, 491

    in SVMs SVM

    Duplicate elimination modules 459

    Dynamic indexing 69, 70,

    80, 83, 105, 467

    Dynamic summary 178, 179,

    180, 181

    E

    East Asian languages

    ChineseJapanese 28, 48, 161

    Edit distance 59, 61, 62, 63,

    64, 65, 67

    Effectiveness 7, 28, 33, 37, 38,

    40, 42, 46, 48, 49, 68, 92, 110, 111,

    112, 127, 135, 158, 160, 161, 162, 163,

    166, 168, 173, 177, 180, 181, 182, 185,

    189, 190, 191, 192, 193, 194, 199, 200,

    219, 222, 224, 232, 247, 257, 258, 260,

    261, 262, 263, 272, 279, 283, 284, 289,

    290, 292, 293, 295, 296, 297, 298, 301,

    313, 315, 317, 320, 322, 323, 328, 331,

    334, 347, 349, 350, 351, 352, 353, 354,

    360, 361, 364, 365, 367, 368, 370, 381,

    388, 399, 420, 431, 458, 490

    assessment of 27, 34, 157,

    158, 159, 160, 168, 170, 171, 172, 173,

    174, 181, 189, 190, 191, 193, 194, 195,

    196, 212, 220, 230, 269, 288, 289, 292,

    296, 304, 312, 317, 351, 355, 357, 358,

    359, 370, 371

    text classification 158,

    160, 161, 260, 267, 268, 269, 270, 271,

    272, 273, 276, 277, 280, 284, 285, 286,292, 293, 294, 295, 297, 298, 300, 303,

    301, 304, 301, 303, 304, 305, 309, 310,

    311, 315, 316, 319, 323, 324, 325, 328,

    331, 333, 334, 341, 343, 347, 348, 350,

    351, 352, 353, 354, 355, 360, 361, 382

    Efficiency 9, 10, 13, 15, 31, 39,

    46, 49, 62, 72, 77, 82, 103, 105, 109,

    110, 112, 131, 135, 149, 190, 192, 195,

    204, 271, 286, 293, 295, 296, 305, 315,329, 350, 352, 365, 368, 378, 388, 392,

    402, 410, 412, 457

    Eigen decomposition 421,

    422

    Eigenvalues 418, 419, 420,

    421, 422, 423, 425, 427, 478, 480, 488

    11-point interpolated average precision

    11 166

    Email 27document units 23, 25,

    210

    sorting 3, 4, 9, 13, 19, 25, 33,

    72, 79, 117, 160, 213, 269, 350, 360,

    399, 446, 466

    EM algorithm EM 259, 365,

    382, 383, 384, 385, 386, 387, 389, 392

    Enterprise resource planning

    87, 205Enterprise search 70, 87, 95

    Entropy 104, 105, 106, 112, 264,

    300, 372, 373

    Equivalence classes 22, 31, 32,

    34, 37

    Ergodic Markov Chain

    479, 480, 484

    Euclidean distance

    138

  • 8/2/2019 Introduction to information retrieval

    48/87

    540

    Euclidean length 127

    Evalution of retrieval systems

    7, 160

    A/B test A/B 177ad hoc ad hoc 6, 196, 264,

    268, 282, 298, 315, 333, 334, 352, 353,

    355, 358, 361

    clustering 2, 3, 26, 126, 147,

    148, 265, 309, 352, 363, 364, 365, 366,

    367, 368, 369, 370, 371, 372, 373, 374,

    376, 377, 378, 379, 380, 381, 382, 383,

    384, 385, 386, 387, 388, 389, 391, 392,

    393, 394, 395, 396, 397, 398, 399, 400,401, 402, 403, 404, 405, 406, 407, 408,

    409, 410, 411, 412, 413, 414, 415, 418,

    427, 431, 432, 453, 467

    F measure F 163, 164, 165, 168,

    180, 221, 371, 374, 388

    interpolated precision

    165, 166, 167, 170, 171

    kappa statistic kappa 172,

    173, 175, 181keyword-in-context snippets

    MAP 167

    marginal relevance 222

    normalized discounted cumulative gain

    169

    overview 19, 455, 456

    pooling 168, 171, 181, 296

    precision at k kprecision-recall curve -

    165, 166, 167, 168, 169, 193

    probabilistic information retrieval

    229, 234, 242, 243, 247, 229,

    234, 242, 243, 247, 265

    ranked sets

    relevance assessment

    157, 158, 159, 160, 168, 170, 171, 172,

    173, 174, 181, 194, 212, 220, 355, 357,

    359

    relevance feedback

    177, 183, 184, 185, 186, 187,

    188, 189, 190, 191, 192, 193, 194, 195,200, 193, 196, 193, 200, 184, 185, 186,

    187, 188, 196, 188, 189, 190, 191, 192,

    193, 194, 193, 194, 195, 196, 200, 230,

    234, 236, 239, 240, 241, 242, 244, 246,

    230, 234, 236, 195, 240, 241, 242, 244,

    246, 230, 234, 236, 239, 240, 241, 242,

    244, 246, 230, 234, 236, 239, 240, 241,

    242, 244, 246, 262, 263, 264, 262, 263,

    264, 262, 263, 264, 262, 263, 239, 309,310, 311, 330, 309, 310, 311, 330, 309,

    310, 311, 330, 309, 310, 311, 330, 194,

    195, 196, 200, 194, 195, 264

    results snippets 157, 177,

    178, 179, 180, 181

    ROC curve ROC 169, 361

    R-precis ion R 168, 169,

    170, 180, 181

    sensitivity 169specificity

    summarization, static vs. dynamic

    system quality/user utility

    test collections, standard

    157, 158, 160

    text classification 158,

    160, 161, 260, 267, 268, 269, 270, 271,272, 273, 276, 277, 280, 284, 285, 286,

    292, 293, 294, 295, 297, 298, 300, 303,

    301, 304, 301, 303, 304, 305, 309, 310,

    311, 315, 316, 319, 323, 324, 325, 328,

    331, 333, 334, 341, 343, 347, 348, 350,

    351, 352, 353, 354, 355, 360, 361, 382

    text summarization 152,

    158, 180, 195, 415

    unranked sets 224

  • 8/2/2019 Introduction to information retrieval

    49/87

    541

    XML retrieval XML 203, 205,

    206, 209, 210, 212, 213, 215, 219, 221,

    222, 223, 224, 219, 226, 205, 206, 209,

    222, 212, 213, 215, 219, 221, 222, 223,224, 226, 227, 262, 221, 222, 223, 224,

    226, 227, 224, 226, 227

    Evidence accumulation

    Exclusive clustering

    Exhaustive clustering 369

    E x p e c t a t i o n - M ax i m i z a t i o n ( E M )

    algori thm EM 259, 365,

    382, 383, 384, 385, 386, 387, 389, 392

    Expectation step E 383, 384, 387Expected edge density

    388

    Extended query 196, 214,

    216

    Extensible Markup Language

    XML 205

    External criterion of quality

    External sorting algorithm 73, 75

    F

    False negative 162, 371, 386

    False positive 162, 163, 169,

    371, 386

    Feature engineering 331,

    352, 354, 359

    Feature selection/text classification

    2 2 286, 290

    frequency-based 292, 293

    method comparison

    multiple classifiers 343

    mutual information 286,

    287, 288, 289, 292, 294, 300, 301, 352,

    354, 371, 372, 386, 411

    noise feature 284, 286,

    287, 316, 319

    overfitting 286, 356

    overview 19, 455, 456

    in performance improvement

    350, 351statistical significance

    180, 260, 292, 293, 301

    Fetch modules

    Field 116, 117, 118, 152, 204,

    206, 223, 224, 247

    Filtering 2, 6, 56, 57, 58, 81, 91,

    196, 209, 223, 268, 273, 327, 330, 348,

    350, 352, 450, 459, 460, 461

    First story detection 409,414

    Flat clustering 363, 364,

    365, 368, 370, 374, 379, 392, 394, 395,

    409, 410, 413, 414

    Akaike information criterion AIC

    381, 387

    cardinality in 30, 110, 181, 192,

    223, 226, 232, 242, 285, 286, 301, 314,

    334, 379, 434classification vs.

    collections 2, 4, 6, 9, 11, 12,

    15, 17, 18, 25, 30, 34, 44, 45, 47, 52,

    58, 60, 65, 70, 72, 73, 74, 75, 77, 78,

    79, 80, 81, 85, 90, 91, 92, 93, 94, 95,

    96, 97, 98, 99, 100, 101, 106, 108, 109,

    111, 112, 116, 117, 119, 121, 124, 125,

    128, 129, 130, 131, 135, 136, 137, 138,

    144, 148, 151, 152, 158, 159, 160, 161,162, 165, 167, 168, 169, 170, 171, 173,

    174, 175, 176, 178, 181, 184, 185, 188,

    189, 191, 192, 194, 196, 198, 199, 200,

    204, 207, 208, 210, 212, 213, 214, 216,

    217, 219, 220, 224, 225, 232, 234, 235,

    237, 238, 239, 240, 241, 242, 244, 246,

    255, 256, 258, 259, 260, 261, 262, 264,

    268, 271, 277, 288, 292, 293, 296, 298,

    300, 304, 307, 311, 312, 314, 315, 328,

  • 8/2/2019 Introduction to information retrieval

    50/87

    542

    358, 359, 365, 366, 367, 368, 370, 371,

    373, 377, 383, 388, 392, 395, 397, 400,

    409, 412, 413, 415, 418, 424, 427, 428,

    431, 434, 457, 467defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    distortion 380, 381, 395

    evaluation of 7, 46, 68, 145,155, 157, 158, 159, 160, 161, 163, 165,

    167, 169, 170, 171, 173, 174, 175, 176,

    177, 180, 181, 182, 193, 194, 200, 203,

    206, 212, 219, 221, 222, 225, 226, 233,

    260, 267, 271, 294, 295, 297, 298, 301,

    324, 347, 353, 360, 363, 365, 369, 370,

    371, 372, 374, 386, 388, 415, 432, 445,

    449, 471

    exhaustive 369, 409Expectation-Maximization algorithm

    EM 259, 365, 382, 383, 384,

    385, 386, 387, 389, 392

    expectation step E 383, 384, 387

    external criterion of quality

    HAC vs . HAC

    internal criterion of quality

    K means K

    K-medoids K- 415

    in language models

    maximization step M 383, 384,

    387

    model complexity 380,

    381

    normalized mutual information

    371

    objective functions 341,

    369, 370, 375, 377, 379, 380, 382

    outliers 334, 341, 377, 378,396, 399

    partitional 369

    purity 371, 372, 373, 374, 386

    Rand index, adjusted

    388

    residual sum of squares

    374, 380, 395

    scatter-gather - 366, 388

    search result 7, 17, 32, 82,84, 222

    seeds 375, 376, 377, 378, 384,

    386, 389, 413, 457

    singleton 378

    soft 3, 23, 33, 36, 87, 96, 139,

    150, 329, 340, 341, 342, 347, 365, 369,

    383, 384, 385, 386, 387, 389, 392, 415,

    424, 431, 432, 446

    unsupervised learning F measure F 163, 164, 165, 168,

    180, 221, 371, 374, 388

    Focused retrieval 222

    Free text

    Free text query

    p a r s i n g f u n c t i o n s

    designing 49, 53, 70, 71, 77,

    80, 87, 96, 100, 150, 151, 152, 158,

    171, 176, 178, 179, 209, 212, 213, 224,

  • 8/2/2019 Introduction to information retrieval

    51/87

    543

    244, 254, 307, 329, 435, 446, 456, 457,

    458, 463, 466, 475

    tokenization 8, 22, 25, 26,

    27, 28, 34, 42, 95, 152, 230, 314in vector retrieval models

    234

    Frequency-based feature selection

    292

    Frobenius norm F 425, 426, 430

    Front coding 98, 99, 100,

    109

    Front queues 464, 465, 466,

    474Functional margins 336

    G

    GAAC Group-average agglomerative

    clustering 403, 404, 405, 406, 407,

    408, 409, 412, 413, 414, 415

    encoding 103, 104, 105,

    106, 107, 108, 109, 110, 111

    Gaps, encoding 468, 470

    Generative model 278,

    327

    Geometric margin 337

    Global champion list

    Gold standard 159, 370, 371

    Golomb codes Golomb 112

    GOV2 collection GOV2

    Greedy feature selection

    Grepping

    Ground truth 159

    Group-average agglomerative clustering

    391, 400, 403

    Group-average clustering

    403, 405, 413

    H

    HAC hierarchical agglomerative

    clustering (HAC) 392, 393, 395,

    396, 399, 400, 401, 403, 404, 405, 406,

    408, 409, 410, 412, 413, 414

    Hard assignment 365, 381, 384

    Hard clustering 365, 369, 383,386

    Harmonic numbers 106

    Hashing 14, 52, 53, 55, 66, 76,

    99, 331, 450, 451, 461, 467, 470

    Heaps law Heaps 93, 111, 315

    Held-out data 298, 313

    Hierarchical agglomerative clustering

    (HAC) 392, 393,

    395, 396, 399, 400, 401, 403, 404, 405,406, 408, 409, 410, 412, 413, 414

    algorithm comparison

    best-merge persistence

    400, 402, 404, 405, 413

    Buckshot algorithm Buckshot

    378, 414

    centroids 188, 189, 190, 192,

    304, 306, 307, 308, 309, 310, 311, 314,

    327, 329, 331, 369, 374, 375, 376, 377,378, 379, 381, 382, 384, 386, 388, 391,

    392, 396, 400, 401, 404, 405, 406, 407,

    409, 411, 412, 413, 414, 415, 432

    chaining in 399, 409

    cliques 246, 398

    cluster-internal labeling

    combination similarity

    complete-link clustering 391, 396, 397, 398, 399, 402, 403,

    404, 408

    connected components

    398

    dendrograms 393, 394, 397,

    398, 406, 413

    differential cluster labeling

    410

    divisive 391, 392, 409, 410,

  • 8/2/2019 Introduction to information retrieval

    52/87

    544

    415

    first story detection

    409, 414

    flat vs. group-average 391, 392,

    396, 400, 401, 403, 404, 405, 409, 413,

    414, 415

    inversions 393, 405, 406, 407,

    409

    monotonicity 393, 406, 413

    next-best merge (NBM) arrays

    novelty detection 388,409

    optimality 13, 301, 314,

    391, 392, 407, 408

    outliers 334, 341, 377, 378,

    396, 399

    overview 19, 455, 456

    priority queue algorithm

    401

    single-link clustering 394, 396, 397, 398, 399, 400, 402, 408,

    409, 413, 414, 453

    time complexity 12,

    13, 14, 17, 39, 61, 75, 77, 81, 276, 277,

    278, 299, 300, 311, 314, 315, 329, 331,

    342, 343, 378, 379, 386, 392, 399, 400,

    403, 404, 405, 409, 410, 412, 413, 414

    top-down 211, 392, 393,

    409, 410Hierarchical classification

    351, 360

    Hierarchical clustering

    agg lomera t ive hiera rch ica l

    agglomerative clustering (HAC)

    applications 6, 10, 18, 23, 37,

    44, 70, 72, 74, 77, 83, 86, 87, 96, 102,

    108, 118, 123, 137, 139, 143, 144, 145,

    151, 152, 158, 160, 168, 169, 173, 174,

    176, 181, 189, 191, 192, 193, 194, 198,

    204, 205, 208, 213, 215, 221, 224, 226,

    243, 252, 253, 254, 257, 258, 264, 268,269, 271, 272, 275, 280, 286, 297, 298,

    299, 305, 310, 315, 327, 331, 334, 344,

    346, 347, 348, 349, 350, 351, 352, 360,

    361, 363, 365, 366, 367, 368, 370, 375,

    381, 382, 383, 384, 386, 387, 388, 389,

    392, 394, 395, 407, 408, 409, 410, 413,

    414, 418, 425, 432, 434, 436, 437, 448,

    450, 451, 453, 454, 457, 465, 467, 468

    defined 2, 3, 4, 17, 22, 25, 26,29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    probabilistic interpretation of

    R environment support for R

    Hierarchical Dirichlet processes HDP

    Hierarchy 272, 351, 392,

    394, 395, 409, 410, 415, 434, 435

    Highlighting

    HITS (hyperlink-induced topic search)

    474, 489, 490,

    491, 493

  • 8/2/2019 Introduction to information retrieval

    53/87

    545

    Host splitters 461

    HTML 24, 178, 205,

    224, 434, 435, 438, 440, 446, 453, 460,

    474, 475http 16, 27, 36, 70,

    86, 87, 161, 180, 181, 185, 186, 201,

    340, 366, 367, 370, 388, 415, 432, 434,

    435, 440, 446, 448, 459, 460, 475

    Hub score 487

    Hyperlink-induced topic search (HITS)

    Hyperlinks Link analysis

    Hyphenation and tokenization 3, 438, 439, 474, 487

    I

    Ide dec-hi 190, 200

    IDF Inverse document frequency

    (IDF) 190

    IID Independent and identically

    distributed (IID)

    Images, searching for

    Relevance feedback 177,

    183, 184, 185, 186, 187, 188, 189, 190,

    191, 192, 193, 194, 195, 200, 193, 196,

    193, 200, 184, 185, 186, 187, 188, 196,

    188, 189, 190, 191, 192, 193, 194, 193,

    194, 195, 196, 200, 230, 234, 236, 239,

    240, 241, 242, 244, 246, 230, 234, 236,

    195, 240, 241, 242, 244, 246, 230, 234,

    236, 239, 240, 241, 242, 244, 246, 230,

    234, 236, 239, 240, 241, 242, 244, 246,

    262, 263, 264, 262, 263, 264, 262, 263,

    264, 262, 263, 239, 309, 310, 311, 330,

    309, 310, 311, 330, 309, 310, 311, 330,

    309, 310, 311, 330, 194, 195, 196, 200,

    194, 195, 264

    Impact ordering 84, 147

    Implicit relevance feedback

    177, 193, 195

    Incidence matrix 5, 108, 109

    Independence 233, 234, 235,

    242, 243, 247, 274, 281, 282, 283, 285,

    290, 291, 292, 293, 299, 301, 321

    Independent and identically distributed(IID)

    Index construction

    BSBI 73, 74, 75, 76, 77, 78, 86

    distributed indexes 69,

    70, 77, 78, 80, 86, 455, 456, 466, 471

    resources 3, 81, 87, 96, 180,

    195, 197, 205, 264, 434, 457

    Indexer 70, 78, 111, 152,

    153, 180, 440, 445, 457, 459Indexes 1, 4, 5, 7, 8, 9, 10, 11,

    13, 17, 18, 19, 22, 23, 24, 25, 26, 27,

    28, 29, 30, 31, 34, 36, 38, 39, 40, 42,

    43, 44, 45, 46, 47, 48, 49, 52, 55, 56,

    57, 58, 59, 62, 63, 64, 65, 66, 67, 69,

    70, 72, 73, 74, 75, 76, 77, 78, 79, 80,

    81, 82, 83, 84, 85, 86, 87, 89, 90, 91,

    92, 95, 99, 102, 105, 106, 107, 108,

    109, 110, 111, 112, 113, 115, 116, 117,118, 119, 124, 128, 129, 130, 131, 138,

    142, 144, 145, 146, 147, 150, 151, 152,

    153, 154, 155, 158, 161, 175, 178, 179,

    180, 184, 196, 197, 199, 204, 205, 210,

    211, 212, 216, 218, 224, 225, 230, 268,

    269, 315, 331, 365, 367, 368, 377, 412,

    417, 418, 427, 431, 433, 434, 435, 436,

    437, 440, 445, 446, 447, 448, 449, 450,

    451, 453, 454, 455, 456, 457, 459, 466,467, 468, 470, 471, 475, 490

    biword 42, 43, 44, 46, 155

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

  • 8/2/2019 Introduction to information retrieval

    54/87

    546

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    document-partitioned

    k-gram k-gram 48, 57, 65

    next word 46, 98parametric 40, 47, 49, 61, 70,

    71, 75, 80, 85, 93, 105, 112, 115, 116,

    117, 144, 145, 148, 153, 159, 160, 173,

    174, 177, 189, 206, 242, 244, 245, 247,

    254, 257, 258, 259, 263, 273, 274, 276,

    278, 280, 281, 282, 283, 284, 287, 289,

    298, 299, 300, 301, 305, 309, 311, 314,

    316, 317, 318, 323, 327, 330, 331, 337,

    341, 347, 353, 354, 357, 359, 361, 379,381, 382, 383, 384, 385, 386, 387, 477

    permuterm 56, 57, 58, 59, 62,

    67

    positional 7, 8, 10, 12, 21, 24,

    29, 39, 40, 41, 43, 44, 45, 46, 47, 48,

    71, 72, 80, 83, 90, 91, 92, 95, 98, 100,

    113, 116, 152, 153, 165, 167, 170, 178,

    179, 215, 230, 254, 274, 278, 280, 281,

    282, 283, 285, 299, 318, 328, 334, 335,350, 354, 366, 376, 434, 443, 444, 461,

    469, 477, 479

    size/estimation

    term-partitioned

    zone 71, 116, 117, 118, 119,

    120, 123, 148, 178, 206, 304, 305, 306,

    307, 308, 310, 312, 319, 320, 322, 329,

    353, 354, 355, 357, 396, 402

    Indexing

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    distributed 3, 69, 70, 77,

    78, 79, 80, 86, 455, 456, 457, 459, 461,

    462, 466, 467, 471

    granularity 24, 25, 34, 103,

    124, 173, 174, 222, 255, 388, 437

    latent semantic 92, 199,368

    unit defined

    INEX 174, 206, 219, 220, 222, 223,

    225, 227

    Informational queries 444,

    445

    Information gain 300, 301,

    410, 415

    Information need 2, 3, 6, 7,16, 17, 135, 158, 159, 160, 166, 167,

    168, 169, 170, 171, 174, 175, 178, 185,

    193, 196, 210, 220, 230, 234, 245, 246,

    257, 262, 268, 365, 368

    Information retrieval 1, 2,

    3, 4, 5, 7, 25, 26, 70, 83, 89, 90, 91, 94,

    95, 103, 120, 6, 126, 18, 30, 84, 113,

    135, 149, 141, 157, 158, 160, 161, 162,

    171, 173, 184, 185, 189, 191, 196, 209,

  • 8/2/2019 Introduction to information retrieval

    55/87

    547

    220, 247, 249, 250, 253, 268, 363, 365,

    441

    hardware issues 78, 87

    history of 253, 254, 294, 300,393, 407, 433, 434, 439, 443, 445, 465,

    484

    overview 19, 455, 456

    search system components

    terms, statistical properties of

    89, 91

    In-links 438, 474, 477, 482

    Inner product Dot products 127, 128, 130, 131, 136, 306,

    308, 329, 339, 342, 343, 345, 346, 402,

    403, 404, 412

    Instance-based learning

    314

    Internal criterion of quality

    Interpolated precision

    165, 166, 167, 170, 171Intersection, postings list

    Inter-similarity

    Inverse document frequency (IDF)

    217

    Inversions 393, 405, 406, 407,

    409

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,465, 479, 480, 481, 487, 492

    in HAC HAC 393, 395, 396,

    399, 400, 403, 404, 405, 408, 409, 410,

    412, 413, 414

    Inverted file Inverted

    indexPostings list 7, 100, 146

    Inverted index 1, 4, 7, 8, 9,

    10, 11, 17, 19, 22, 40, 48, 52, 55, 56,

    57, 58, 66, 67, 70, 72, 73, 74, 75, 76,77, 79, 82, 84, 87, 90, 92, 105, 106,

    107, 108, 109, 111, 113, 117, 119, 130,

    142, 144, 147, 152, 153, 196, 204, 205,

    218, 315, 331, 367, 368, 412, 435, 468

    Boolean query processing

    building principles

    described 61, 142, 144, 160,

    168, 198, 204, 205, 207, 209, 222, 224,231, 242, 260, 275, 285, 301, 310, 320,

    322, 324, 349, 363, 367, 368, 378, 381,

    396, 404, 412, 432, 454, 468, 471, 474,

    475, 476, 491

    encoding 103, 104, 105,

    106, 107, 108, 109, 110, 111

    kNN classification in kNN

    Inverter 79, 80, 85

    IP address IP 27, 462

    J

    Jaccard coefficient Jaccard 64 ,

    65, 451, 452, 453

    Japanese 28, 34, 35, 48, 55, 491

    Journal influence weight

    492

    K

    Kappa s ta t is t ic kappa 172,

  • 8/2/2019 Introduction to information retrieval

    56/87

    548

    173, 175, 181

    Kernel function 346

    Kernels 7, 9, 36, 49, 61, 62, 73,

    196, 197, 204, 209, 268, 340, 345, 346,347, 348, 359, 360, 419, 487

    Mercer Mercer 346

    polynomial 254, 256, 257,

    258, 259, 260, 273, 274, 275, 276, 277,

    278, 279, 280, 281, 282, 283, 284, 285,

    286, 289, 290, 292, 293, 299, 300, 321,

    329, 346, 347, 419

    quadratic 82, 105, 323, 346

    radial basis functions 346, 347

    Kernel trick 345

    Keys 5, 18, 30, 151, 154,

    171, 178, 179, 192, 209, 213, 220, 222,

    270, 352, 353, 354, 366, 367, 435, 439,

    440, 442, 443, 444, 449

    Key-value pairs - 78, 79, 80, 85

    Keyword-in-context (KWIC) snippets

    k-gram index k-gram 48, 57, 65

    described 61, 142, 144, 160,

    168, 198, 204, 205, 207, 209, 222, 224,

    231, 242, 260, 275, 285, 301, 310, 320,

    322, 324, 349, 363, 367, 368, 378, 381,

    396, 404, 412, 432, 454, 468, 471, 474,

    475, 476, 491

    spelling correction in 51,

    52, 59, 60, 62, 63, 64, 65, 67, 68, 83,85, 152, 153, 184, 191

    word matching in 17, 59

    K means K

    K-medoids K- 415

    k nearest neighbor classification (kNN)

    algorithm k 311

    Bayes error rate 316,

    331

    bias in 254, 264, 286, 303,

    305, 323, 324, 325, 326, 327, 328, 331,

    335, 349

    decision boundaries 307,

    317, 319, 320, 326, 327, 328, 329, 334,336, 337, 339, 341, 347, 357, 359

    described 61, 142, 144, 160,

    168, 198, 204, 205, 207, 209, 222, 224,

    231, 242, 260, 275, 285, 301, 310, 320,

    322, 324, 349, 363, 367, 368, 378, 381,

    396, 404, 412, 432, 454, 468, 471, 474,

    475, 476, 491

    effectiveness 7, 28, 33, 37,

    38, 40, 42, 46, 48, 49, 68, 92, 110, 111,112, 127, 135, 158, 160, 161, 162, 163,

    166, 168, 173, 177, 180, 181, 182, 185,

    189, 190, 191, 192, 193, 194, 199, 200,

    219, 222, 224, 232, 247, 257, 258, 260,

    261, 262, 263, 272, 279, 283, 284, 289,

    290, 292, 293, 295, 296, 297, 298, 301,

    313, 315, 317, 320, 322, 323, 328, 331,

    334, 347, 349, 350, 351, 352, 353, 354,

    360, 361, 364, 365, 367, 368, 370, 381,388, 399, 420, 431, 458, 490

    instance-based learning

    314

    memory-based learning

    314

    memory capaci ty 327,

    335, 336

    multinomial Naive Bayes vs.

    256, 273as nonlinear classification

    303, 305, 316, 320, 323, 324

    testing/training capacity

    time complexity/optimality

    variance 254, 286, 303, 305,

    323, 324, 325, 326, 327, 328, 331, 335,

    381, 382, 383, 387, 414

  • 8/2/2019 Introduction to information retrieval

    57/87

    549

    Voronoi tessellation Voronoi

    312, 329

    KNN classification KNN K

    nearestneighbor classification kNN K

    311

    Kruskals algorithm Kruskal

    Kullback-Leibler divergence KL

    264, 330, 387

    KWIC (keyword-in-context)

    L

    Labeling 391, 410, 411, 412,

    415

    of clusters 365, 391, 392,

    410, 411, 412, 415

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,

    379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    Language of an automaton

    250, 251, 252

    Language identification 34,

    48

    Language issues relevance

    feedback 177, 183, 184,

    185, 186, 187, 188, 189, 190, 191, 192,

    193, 194, 195, 200, 193, 196, 193, 200,

    184, 185, 186, 187, 188, 196, 188, 189,

    190, 191, 192, 193, 194, 193, 194, 195,

    196, 200, 230, 234, 236, 239, 240, 241,

    242, 244, 246, 230, 234, 236, 195, 240,241, 242, 244, 246, 230, 234, 236, 239,

    240, 241, 242, 244, 246, 230, 234, 236,

    239, 240, 241, 242, 244, 246, 262, 263,

    264, 262, 263, 264, 262, 263, 264, 262,

    263, 239, 309, 310, 311, 330, 309, 310,

    311, 330, 309, 310, 311, 330, 309, 310,

    311, 330, 194, 195, 196, 200, 194, 195,

    264

    Language models 67, 139,222, 226, 243, 247, 249, 250, 251, 252,

    253, 254, 255, 256, 258, 260, 261, 262,

    263, 264, 265, 277, 368, 382

    Bayesian smoothing 265

    BIM/XML vs., 230

    clustering in

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    distributions multinomial

    254, 256, 257, 258, 259, 260,

    273, 274, 275, 276, 277, 278, 279, 280,

    281, 282, 283, 284, 285, 286, 289, 290,

    292, 293, 299, 300, 321, 329, 346, 347,

  • 8/2/2019 Introduction to information retrieval

    58/87

    550

    419

    document likelihood

    263, 264

    extended approaches 119,197, 198, 201, 230, 250, 361, 432

    finite automata and

    250, 251

    Kullback-Leibler divergence KL

    264, 330, 387

    likelihood ratio 252, 254,

    255, 264

    linear interpolation 258,

    259, 265overview 19, 455, 456

    query likelihood 249,

    250, 255, 261, 263, 264, 432

    tf-idf weighting vs. tf-idf

    translation 2, 5, 35, 59, 184,

    200, 253, 256, 264, 386, 450, 490

    types of 2, 13, 27, 30, 39, 42,

    45, 48, 69, 83, 116, 123, 160, 206, 208,209, 212, 220, 224, 237, 256, 262, 268,

    269, 271, 323, 326, 348, 349, 350, 358,

    359, 369, 371, 383, 392, 447, 452

    Laplace smoothing 275

    Latent Dirichlet Allocation (LDA) LDA

    Latent semantic analysis (LSA)

    92

    Latent semantic indexing (LSI) 199, 368

    LDA (Latent Dirichlet Allocation) LDA

    L2 distance L2

    Learning algorithm described

    Weighted zone scoring

    Learning error

    Learning method 22, 42,

    121, 152, 155, 174, 226, 270, 271, 272,

    273, 298, 300, 316, 323, 324, 325, 326,

    327, 328, 333, 334, 343, 347, 348, 350,

    355, 361, 482Lemma 36, 37

    Lemmatization described

    Lemmatizer 37, 92

    Length-normalization

    Levenshtein distance Levenshtein

    Lexicalized subtrees 215,

    216, 225

    Lexicons in inverted indexes

    52Likelihood 40, 43, 53, 61, 103,

    117, 134, 148, 231, 232, 236, 238, 240,

    249, 250, 252, 254, 255, 256, 261, 263,

    264, 274, 281, 292, 313, 322, 326, 328,

    330, 381, 382, 383, 384, 387, 432, 464

    Likelihood ratio 252, 254,

    255, 264

    Linear algebra review

    Linear classifiers 303,305, 316, 317, 318, 319, 320, 321, 322,

    323, 324, 326, 328, 331, 336, 344, 345,

    350, 356, 357, 358

    Linear interpolation 258,

    259, 265

    Linear problem 319, 320,

    326, 342

    Linear separability 317, 319,

    320, 328, 330, 333, 334, 341, 344, 345,359

    Link analysis 441, 467, 473,

    474, 476, 489, 490, 493

    anchor text 264, 411, 412,

    454, 459, 474, 475, 476, 483, 489, 492

    authority score authority 487,

    488, 489, 490, 491, 492

    ergodic Markov chain

  • 8/2/2019 Introduction to information retrieval

    59/87

    551

    HITS 474, 489, 490, 491, 493

    hub score 487

    Markov chains 477, 478,

    479, 480, 481, 482, 484, 485, 486, 492overview 19, 455, 456

    PageRank PageRank 485,

    486

    steady-state theorem

    Link farms 493

    Link spam

    LLRUN 112

    L M L a n g u a g e

    models 67, 139, 222, 226,243, 247, 249, 250, 251, 252, 253, 254,

    255, 256, 258, 260, 261, 262, 263, 264,

    265, 277, 368, 382

    Logarithmic merging 82, 83,

    86

    Lossless compression 92

    Lossy compression 92

    Lovins stemmer Lovins

    Low-rank approximation 417, 418, 420, 422, 424, 425, 426, 427,

    430, 432

    LSA latent semantic analysis

    92

    LSI latent semantic indexing

    199, 368

    M

    Machine-learned relevance described

    Machine learning methods

    22, 42, 152, 155, 174, 226, 298,

    333, 334, 347, 348, 350, 355, 361, 482

    Machine translation 253,

    264, 386

    Macroaveraging 296

    MAP mean average precision

    167

    Map phase map

    MapReduce 77, 78, 79, 80, 85, 86

    Marginal relevance 222

    Marginal statistic 172

    Margins 83, 92, 209, 269, 323,336, 337, 338, 339, 341, 347, 463, 464,

    468, 470

    Markov chains 477, 478,

    479, 480, 481, 482, 484, 485, 486, 492

    Master node 77, 78, 79, 80

    Matrix decomposition 417,

    418, 420, 421, 424, 432

    eigen 5, 27, 33, 137, 150, 152,

    267, 271, 276, 279, 283, 284, 286, 287,288, 289, 290, 291, 292, 293, 298, 299,

    300, 301, 305, 316, 319, 320, 322, 323,

    327, 328, 331, 334, 336, 339, 342, 343,

    344, 345, 347, 348, 351, 352, 353, 354,

    355, 356, 358, 359, 360, 361, 410, 412,

    418, 419, 420, 421, 422, 423, 425, 427,

    434, 476, 478, 480, 488, 489, 490, 491,

    492

    eigenvalues 418, 419, 420,421, 422, 423, 425, 427, 478, 480, 488

    Frobenius norm

    latent semantic indexing

    199, 368

    linear algebra review

    low-rank approximation

    417, 418, 420, 422, 424, 425, 426, 427,

    430, 432

    reduced SVD SVD 423, 426singular value 419, 421,

    423, 424, 425, 426, 428, 429

    symmetric diagonal

    422, 423

    theorems 231, 233, 234, 280,

    414, 421, 422, 423, 424, 425, 432, 451,

    452, 480, 481

    truncated SVD SVD 423,

    426

  • 8/2/2019 Introduction to information retrieval

    60/87

    552

    Maximization step M 383, 384, 387

    Maximum a posteriori

    Maximum likelihood estimate (MLE)

    238, 240, 256, 274,384, 387

    Mean average precision

    167

    Medoids 379, 389, 415

    Memory-based learning

    314

    Memory capacity 327, 335,

    336

    Mercator crawler Mercator458, 470

    Mercer kernels Mercer 346

    Merge algorithm 12, 13,

    14, 15, 21, 39, 40, 41, 45, 47, 86, 153,

    404

    Merge postings list

    13, 14, 39, 40, 41, 44, 46, 47, 142, 153

    Metadata 27, 116, 152, 153,

    178, 388Microaveraging 296, 299,

    348

    Minimum spanning tree

    413, 414

    Minimum variance clustering

    ModApte split

    Model-based clustering

    Model complexity 380,381

    Monotonicity 393, 406, 413

    Multiclass classification

    321, 343

    Multiclass SVMs SVM 360

    Multilabel classification

    322, 323, 332

    Multimodal class

    Multinomial classification

    299

    Multinomial model 254,

    260, 278, 279, 280, 281, 282, 283, 284,

    285, 289, 290, 292, 293, 300Multinomial Naive Bayes

    256, 273

    Bernoulli model 259,

    260, 267, 278, 279, 280, 281, 282, 283,

    284, 285, 287, 289, 290, 292, 300, 383,

    384

    bias in 254, 264, 286, 303,

    305, 323, 324, 325, 326, 327, 328, 331,

    335, 349concept drift 284, 285,

    298, 301

    conditional independence assumption

    235, 281, 282, 283

    as linear classifier

    optimal classifier 316

    positional independence assumption

    274, 282, 283, 285,

    299properties 3, 23, 30, 89, 90,

    91, 94, 178, 234, 314, 323, 327, 381,

    433, 436, 441, 458, 468

    in query likelihood models

    249, 250, 255, 261, 264, 432

    random variables X and U

    X U

    semi-supervised learning

    sparseness 7, 213, 219, 254,

    257, 275, 277, 282, 352, 368

    testing/training capacity

    in text classification 158,

    160, 161, 260, 267, 268, 269, 270, 271,

    272, 273, 276, 277, 280, 284, 285, 286,

    292, 293, 294, 295, 297, 298, 300, 303,

    301, 304, 301, 303, 304, 305, 309, 310,

  • 8/2/2019 Introduction to information retrieval

    61/87

    553

    311, 315, 316, 319, 323, 324, 325, 328,

    331, 333, 334, 341, 343, 347, 348, 350,

    351, 352, 353, 354, 355, 360, 361, 382

    variance 254, 286, 303, 305,323, 324, 325, 326, 327, 328, 331, 335,

    381, 382, 383, 387, 414

    Mul t inomia l NB NB

    Multinomial Naive Bayes 273, 274,

    275, 277, 281, 299

    Multivalue classification

    321

    Multivariate Bernoulli model

    Mutual information 286,

    287, 288, 289, 292, 294, 300, 301, 352,

    354, 371, 372, 386, 411

    N

    Naive Bayes assumption

    Naive Bayes learning method

    Multinomial Naive

    Bayes; Multivariate Bernoulli model

    271

    Named entity tagging

    352

    Nat iona l Ins t i tu te of S tandards and

    Technology

    Natural language processing

    37, 178, 262

    issues in 22, 37, 175, 415

    lemmatizers in 35, 36, 37,

    48, 92

    text summarization 178,

    181, 354

    XML retrieval XML

    Navigational queries 444,

    445

    NDCG (normalized discounted cumulative

    gain) 169

    Near-duplicate search results

    Nested elements 212

    NEXI 208, 209, 210, 227

    Next-best merge (NBM) arrays

    Next word index 46

    N-gram language model N

    Bigram language model

    Unigram language model

    Nibble 4 103,

    112

    NLP Natural language processing

    37, 178, 262

    N M I N o r m a l i z e d m u t u a linformation (NMI)

    371, 372, 373, 374, 388

    Noise documents 319, 326,

    327, 328, 341

    Noise feature 284, 286, 287,

    316, 319

    Nonlinear classifiers

    303, 316, 320, 323, 324

    Nonlinear problem 320,326, 342

    Normalization

    in probability theory 104,

    229, 230, 231, 240, 243, 247, 257, 262

    term 3, 5, 6, 7, 8, 9, 10, 11,

    12, 13, 14, 15, 16, 17, 18, 21, 22, 25,

    26, 27, 29, 30, 31, 32, 34, 37, 38, 40,

    41, 42, 43, 44, 45, 46, 52, 53, 54, 55,

    56, 57, 58, 59, 60, 61, 62, 63, 64, 65,66, 67, 72, 73, 74, 75, 76, 77, 78, 79,

    80, 81, 82, 83, 84, 85, 86, 89, 90, 91,

    92, 93, 94, 95, 96, 97, 98, 99, 100, 106,

    107, 108, 109, 110, 111, 112, 113, 115,

    116, 118, 119, 121, 123, 124, 125, 126,

    127, 129, 130, 131, 132, 133, 134, 47,

    135, 134, 135, 137, 138, 139, 142, 143,

    144, 145, 146, 147, 149, 150, 152, 153,

    154, 155, 171, 178, 179, 184, 187, 190,

  • 8/2/2019 Introduction to information retrieval

    62/87

    554

    191, 192, 195, 196, 197, 198, 199, 200,

    208, 210, 212, 213, 214, 215, 216, 217,

    218, 219, 225, 226, 230, 233, 234, 235,

    236, 237, 238, 239, 240, 241, 242, 243,244, 245, 246, 250, 251, 252, 253, 254,

    256, 257, 260, 261, 262, 263, 264, 273,

    274, 275, 276, 278, 279, 280, 281, 282,

    283, 284, 285, 286, 287, 288, 289, 290,

    291, 292, 293, 294, 298, 304, 309, 311,

    314, 315, 318, 329, 343, 347, 352, 355,

    359, 365, 366, 367, 368, 379, 382, 383,

    384, 385, 410, 411, 412, 415, 417, 418,

    422, 423, 424, 425, 427, 428, 429, 430,431, 432, 439, 444, 448, 450, 459, 466,

    467, 471, 475, 476, 490

    tf weighting tf 123, 131

    URL 27, 434, 437, 453, 457, 458,

    459, 460, 461, 462, 463, 464, 465, 466,

    467, 468, 469, 470, 477

    Normalized discounted cumulative gain

    (NDCG)

    Normalized mutual information (NMI)

    Normalized tokens in inverted indexes

    Normal vectors 308, 317, 335

    Novelty detection 388,

    409

    NTCIR collection NTCIR

    O

    Objective function 341, 369,

    370, 375, 377, 379, 380, 382

    Odds 232

    Odds ratio

    Okapi BM25 weighting Okapi BM25

    230

    1/0 loss 1/0 233

    One-of classification 321,

    323, 331

    One-versus-all (OVA) classification

    343

    Optimal classifier 316

    Optimal clustering

    Optimal learning method Optimal weight

    Ordering 4, 8, 9, 10, 13, 15, 17,

    18, 30, 53, 55, 58, 59, 69, 70, 72, 73,

    74, 75, 76, 77, 79, 83, 84, 85, 86, 96,

    98, 111, 113, 116, 119, 120, 125, 126,

    130, 133, 134, 141, 142, 143, 145, 146,

    147, 149, 150, 151, 153, 154, 155, 158,

    167, 169, 170, 173, 177, 178, 181, 195,

    205, 209, 213, 217, 223, 224, 226, 229,230, 232, 233, 234, 235, 237, 239, 246,

    250, 252, 255, 256, 259, 261, 262, 264,

    270, 277, 292, 322, 330, 334, 355, 357,

    358, 359, 361, 368, 401, 435, 436, 442,

    443, 448, 451, 453, 459, 468, 469, 474,

    476, 482, 487, 490, 492

    Ordinal regression 358, 361

    Outliers 334, 341, 377, 378,

    396, 399Out-links 438, 477, 479

    Overfitting 286, 356

    Overlap score measure

    126

    Oxford English Dictionary

    92

    P

    PageRank 485

    computation 7, 10, 12, 13, 19,

    30, 31, 41, 44, 47, 48, 53, 60, 61, 62,

    63, 64, 65, 67, 68, 70, 72, 75, 77, 78,

    81, 85, 90, 96, 102, 106, 107, 109, 110,

    111, 113, 115, 116, 118, 119, 120, 121,

    122, 123, 124, 125, 126, 127, 128, 129,

    130, 131, 132, 133, 134, 135, 136, 137,

    138, 141, 142, 143, 144, 145, 146, 147,

    148, 149, 150, 152, 153, 158, 162, 163,

    164, 165, 166, 167, 168, 169, 170, 171,

  • 8/2/2019 Introduction to information retrieval

    63/87

    555

    172, 173, 175, 176, 177, 179, 180, 181,

    185, 188, 190, 191, 192, 193, 195, 197,

    198, 199, 213, 215, 216, 217, 218, 219,

    222, 226, 230, 231, 232, 234, 235, 236,237, 239, 139, 155, 239, 237, 239, 240,

    241, 242, 243, 244, 245, 246, 247, 252,

    253, 254, 255, 256, 257, 258, 260, 261,

    262, 263, 264, 273, 274, 275, 276, 278,

    279, 280, 282, 283, 284, 285, 286, 287,

    288, 290, 291, 292, 293, 294, 295, 296,

    297, 299, 300, 303, 304, 305, 306, 307,

    309, 311, 313, 314, 319, 322, 324, 329,

    335, 336, 339, 342, 343, 345, 346, 347,352, 355, 358, 359, 364, 365, 367, 368,

    369, 370, 371, 373, 374, 375, 376, 377,

    378, 379, 383, 384, 386, 387, 392, 393,

    395, 396, 398, 400, 401, 402, 403, 404,

    405, 406, 411, 412, 413, 414, 415, 420,

    422, 423, 424, 425, 426, 427, 428, 429,

    430, 431, 436, 450, 451, 452, 453, 467,

    474, 475, 477, 479, 480, 481, 482, 484,

    485, 486, 487, 488, 489, 490, 491, 492,493

    described 61, 142, 144, 160, 168,

    198, 204, 205, 207, 209, 222, 224, 231,

    242, 260, 275, 285, 301, 310, 320, 322,

    324, 349, 363, 367, 368, 378, 381, 396,

    404, 412, 432, 454, 468, 471, 474, 475,

    476, 491

    ergodic Markov chain

    479, 480, 484Markov chains 477, 478,

    479, 480, 481, 482, 484, 485, 486, 492

    personalized 484, 485, 486

    principal left eigen vector

    478, 480

    probability vectors 478, 479,

    480

    steady-state theorem

    stochastic matrix 478, 489

    teleport operation

    477, 480, 481, 482, 483, 484, 485, 486,

    492, 493

    topic-specific 485Paice stemmer Paice

    Paid inclusion 440

    Parameter-free compression

    Parameterized compression

    Parameter tuning

    Parameter tying 353

    Parametric indexes 115,

    116, 117Parametric search

    Parser 78, 79, 80, 151, 152,

    153, 199

    Parsing functions, designing

    Parsing modules

    Partitional clustering 391,

    409, 410, 415

    Partition rule Passage retrieval 222, 226

    Patent databases 204

    Performance 18, 25, 37, 49, 68,

    70, 71, 103, 112, 131, 158, 159, 160,

    168, 190, 193, 194, 195, 224, 233, 242,

    245, 246, 254, 258, 259, 262, 263, 265,

    282, 289, 295, 298, 301, 322, 334, 348,

    349, 353, 432, 457

    Permuterm index 56, 57, 58,59, 62, 67

    P e r s o n a l i z e d P a g e R a n k

    PageRank 485, 486

    Phonetic correction 66

    Phrase index 43, 46

    Phrase queries 17, 18, 21,

    22, 27, 28, 30, 41, 42, 43, 44, 46, 47,

    49, 144, 151, 154, 155, 254

    Phrase search 44, 113

  • 8/2/2019 Introduction to information retrieval

    64/87

    556

    Pivoted document length normalization

    Pivot length

    Pointwise mutual information 287, 301

    Polytomous classification

    321, 343

    Polytopes 312

    Pooling

    Pornography filtering 352

    Porter stemmer Porter

    38

    Positional independence assumption 274, 282, 283, 285,

    299

    Positional indexes 43, 44,

    45, 46, 47, 48, 179

    Posterior probability 231,

    273, 280

    Postfiltering, in k-gram indexes k-gram

    Postings in block sort-based indexing

    compression and 7, 14, 23,

    30, 40, 44, 45, 70, 71, 73, 75, 77, 83,

    86, 89, 90, 91, 92, 94, 95, 96, 97, 98,

    99, 100, 101, 102, 97, 98, 99, 100, 101,

    102, 103, 104, 105, 106, 107, 108, 109,

    110, 111, 112, 109, 112, 113, 99, 169,

    468, 104, 105, 106, 107, 108, 112, 109,110, 111, 112, 113, 103

    defined 2, 3, 4, 17, 22, 25, 26,

    29, 42, 52, 55, 61, 62, 63, 79, 85, 90,

    92, 94, 100, 104, 110, 118, 121, 125,

    127, 133, 145, 160, 161, 162, 163, 165,

    166, 167, 168, 170, 173, 180, 188, 195,

    204, 208, 211, 213, 215, 216, 217, 219,

    220, 221, 230, 234, 235, 237, 259, 262,

    264, 271, 272, 276, 280, 283, 284, 287,

    290, 292, 300, 301, 307, 308, 317, 318,

    319, 321, 324, 325, 327, 329, 334, 335,

    336, 341, 344, 345, 346, 352, 353, 358,

    359, 365, 368, 369, 371, 372, 374, 377,379, 382, 383, 386, 387, 392, 393, 396,

    397, 400, 404, 405, 407, 409, 412, 413,

    424, 427, 437, 438, 443, 446, 450, 451,

    465, 479, 480, 481, 487, 492

    in inverted indexes 1, 4,

    7, 8, 9, 10, 11, 17, 19, 22, 40, 48, 52,

    55, 56, 57, 58, 66, 67, 70, 72, 73, 74,

    75, 76, 77, 79, 82, 84, 87, 90, 92, 105,

    106, 107, 108, 109, 111, 113, 117, 119,130, 142, 144, 147, 152, 153, 196, 204,

    205, 218, 315, 331, 367, 368, 412, 435,

    468

    positional 7, 8, 10, 12, 21, 24,

    29, 39, 40, 41, 43, 44, 45, 46, 47, 48,

    71, 72, 80, 83, 90, 91, 92, 95, 98, 100,

    113, 116, 152, 153, 165, 167, 170, 178,

    179, 215, 230, 254, 274, 278, 280, 281,

    282, 283, 285, 299, 318, 328, 334, 335,350, 354, 366, 376, 434, 443, 444, 461,

    469, 477, 479

    Postings list

    compression of 7, 14, 23, 30,