w5ch11 hash index

Upload: jhowell63021

Post on 05-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 W5Ch11 Hash Index

    1/63

    CPSC 461Instructor: Marina Gavrilova

  • 7/31/2019 W5Ch11 Hash Index

    2/63

    Goal Goal of todays lecture is to introduce concept of

    hash-index. Hashing is an alternative to treestructure that allows near constant access time to ANYrecord in a very large database.

  • 7/31/2019 W5Ch11 Hash Index

    3/63

    Presentation Outline Introduction

    Did you know that?

    Static hashing definition and methods

    Good hash function Collision resolution (open addressing, chaining)

    Extendible hashing

    Linear hashing Useful links and current market trends

    Summary

  • 7/31/2019 W5Ch11 Hash Index

    4/63

    4

    Approaches to Search

    1. Sequential and list methods

    (lists, tables, arrays).

    2. Direct access by key value (hashing)

    3. Tree indexing methods.

    Introduction to Hashing

  • 7/31/2019 W5Ch11 Hash Index

    5/63

    5

    DefinitionHashing is the process of mapping a key value to a

    position in a table.

    Ahash function maps key values to positions.

    Ahash table is an array that holds the records.

    Searching in a hash table can be done in O(1) regardless of the

    hash table size.

    Introduction to Hashing

  • 7/31/2019 W5Ch11 Hash Index

    6/63

    6

    Introduction to Hashing

  • 7/31/2019 W5Ch11 Hash Index

    7/637

    Example of Usefullness

    10 stock details, 10 table positions

    Stock numbers are between 0 and 1

    1000.

    Using the whole stock numbers may

    require 1000 storage locations and

    this is an obvious waste of memory.

    Introduction to Hashing

  • 7/31/2019 W5Ch11 Hash Index

    8/638

    Applications of Hashing

    Compilers use hash tables to keep track of declared variables

    A hash table can be used for on-line spelling checkersif

    misspelling detection (rather than correction) is important, an entiredictionary can be hashed and words checked in constant time

    Game playing programs use hash tables to store seen positions,

    thereby saving computation time if the position is encountered

    again

    Hash functions can be used to quickly check for inequalityif

    two elements hash to different values they must be different

    Storing sparse data

    Introduction of Hashing

  • 7/31/2019 W5Ch11 Hash Index

    9/639

    Did you know that? Cryptography was once known only to the key people in the theNational Security Agency and a few academics.

    Until 1996, it was illegal to export strong cryptography from theUnited States.

    Fast forward to 2006, and the Payment Card Industry Data Security

    Standard (PCI DSS) requires merchants to encrypt cardholderinformation. Visa and MasterCard can levy fines of up to $500,000for not complying!

    Among methods recommended are:

    Strong one-way hash functions (hashed indexes)

    Truncation

    Index tokens and pads (pads must be securely stored)

    Strong cryptography[Hashing for fun and profit: Demystifying encryption for PCI DSS

    Roger Nebel]

    http://www.nsa.gov/http://searchsecurity.techtarget.com/topics/0,295493,sid14_tax303586,00.htmlhttp://searchsecurity.techtarget.com/topics/0,295493,sid14_tax303586,00.htmlhttp://searchsecurity.techtarget.com/topics/0,295493,sid14_tax303586,00.htmlhttp://searchsecurity.techtarget.com/topics/0,295493,sid14_tax303586,00.htmlhttp://www.nsa.gov/
  • 7/31/2019 W5Ch11 Hash Index

    10/6310

    Did you know that? Transport Layer Security protocol on networks (TLS) uses the Rivest,

    Shamir, and Adleman (RSA) public key algorithm for the TLS key

    exchange and authentication, and the Secure Hashing Algorithm 1

    (SHA-1) for the key exchange and hashing.

    [System cryptography: Use FIPS compliant algorithms for encryption,

    hashing, and signing, Microsoft TechNews, 2005]

  • 7/31/2019 W5Ch11 Hash Index

    11/6311

    Did you know that? Spatial hashing studies performed at Microsoft Research, Redmond

    combine hashing with computer graphics to create a new set of tools for

    rendering, mesh reconstruction, and collision optimization (see publicposter by Hugues Hoppe on the next slide)

  • 7/31/2019 W5Ch11 Hash Index

    12/63

    Perfect Spatial Hashing Sylvain Lefebvre Hugues Hoppe(Microsoft Research)

    Hash table Offset table Hash table Offset table

    Vector images Sprite maps

    Alpha compression

    3D textures 3D painting

    Simulation Collision detection

    2D 3D

    1282 382 182 1283 353 193

    ApplicationsHash function

    p

    p S

    ( )s h p

    modq p r

    [ ]q

    Domain Hash

    table H

    Offset

    table

    ( )h p p p

    Perfect hash on multidimensional data

    No collisions ideal for GPU

    Single lookup into a small offset table

    Offsets only ~4 bits per defined data

    Access only ~4 instructions on GPU

    Optimized spatial coherence

    10243, 46MB, 530fps 20483, 56MB, 200fps

    10243, 12MB, 140fps2563, 100fps

    10242, 500KB, 700fps +900KB, 200fps

    (modulo table sizes)

    0.9bits/pixel, 800fps

    1.8%

    24372

    83333

    We design a perfect hash function to losslessly packsparse data while retaining efficient random access:

    Simply:

    453

    nearest: 7.5MB, 370fps

    11632

  • 7/31/2019 W5Ch11 Hash Index

    13/63

    13

    Did you know that? Combining hashing and encryption provides a much

    stronger tool for database and password protection.

    http://msdn.microsoft.com/msdnmag/issues/03/08/SecurityBriefs/

    [Security Briefs, SMDN Magazine]

    http://msdn.microsoft.com/msdnmag/issues/03/08/SecurityBriefs/http://msdn.microsoft.com/msdnmag/issues/03/08/SecurityBriefs/http://msdn.microsoft.com/msdnmag/issues/03/08/SecurityBriefs/http://msdn.microsoft.com/msdnmag/issues/03/08/SecurityBriefs/
  • 7/31/2019 W5Ch11 Hash Index

    14/63

    14

    Hash Functions

    Hashing is the process of chopping up the key andmixing it up in various ways in order to obtain an index

    which will be uniformly distributed over the range of

    indices - hence the hashing.

    There are several common ways of doing this:

    Truncation Folding

    Modular Arithmetic

    Hash Functions

  • 7/31/2019 W5Ch11 Hash Index

    15/63

    15

    Hash FunctionsTruncation

    Truncation is a method in which parts of the key are ignored andthe remaining portion becomes the index.

    - For this, we take the given key and produce a hash location bytaking portions of the key (truncating the key).

    ExampleIf a hash table can hold 1000 entries and an 8-digit

    number is used as key, the 3rd, 5thand 7th digits starting

    from the left of the key could be used to produce the index.

    - e.g. .. Key is 62538194 and the hash location is 589.

    -

    Advantage: Simple and easy to implement.

    Problems: Clustering and repetition.

    Hash Functions

  • 7/31/2019 W5Ch11 Hash Index

    16/63

    16

    Hash FunctionsFolding

    Folding breaks the key into several parts and combines the parts to forman index.

    - The parts may be recombined by addition, subtraction, multiplications and may have

    to be truncated as well.- Such a process is usually better than truncation by itself since it produces a better

    distribution: all of the digits in the key are considered.

    - Using a key 62538194 and breaking it into 3 numbers using the first 3 and the last 2

    digits produced 625, 381 and 94. These could be added to get 1100 which could be truncated

    to 100.

    They could be also be multiplied together and then three digits chosen

    from the middle of the number produced.

    Hash Functions

  • 7/31/2019 W5Ch11 Hash Index

    17/63

    17

    Hash Functions(Modular Arithmetic)

    Modular Arithmeticprocess essentially assures that the indexproduced is within a specified range. For this, the key is converted to

    an integer which is divided by the range of the index with the resulting

    function being the value of the remainder.

    Uses: biometrics, encryption, compression

    - If the value of the modulus is a prime number, the distribution ofindices

    obtained is quite uniform.

    - A table whose size is some number which has many factors provides thepossibility of many indices which are the same, so the size should be a prime

    number.

    Hash Functions

  • 7/31/2019 W5Ch11 Hash Index

    18/63

    18

    Good Hash Functions

    Hash functions which use all of the key are almost always betterthan those which use only some of the key.

    - When only portions are used, information is lost and therefore thenumber of possibilities for the final key are reduced.

    - If we deal with the integer its binary form, then the number of

    pieces that can be manipulated by the hash function is greatly

    increased.

    Hash Functions

  • 7/31/2019 W5Ch11 Hash Index

    19/63

    19

    Collision

    It is obvious that no matter what function is used, the possibilityexists that the use of the function will produce an index which is a

    duplicate of an index which already exists. This is a Collision.

    Collision resolution strategy:

    - Open addressing: store the key/entry in a different position

    - Chaining: chain together several keys/entries in each position

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    20/63

    20

    Collision - Example- - Hash table size 11

    - - Hash function: key mod hash size

    So, the new positions in the hash table are:

    Some collisions occur with this hash function.

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    21/63

    21

    Collision ResolutionOpen Addressing

    Resolving collisions by open addressing is resolving the problem bytaking the next open space as determined by rehashing the key

    according to some algorithm.

    Two main open addressing collision resolution techniques:

    - - Linear probing: increase by 1 each time [mod table size!]

    - - Quadratic probing:to the original position, add 1, 4, 9, 16,

    also in some cases key-dependent increment technique is used.

    Collision Resolution

    Probing

    If the table position given by the hashed key is already

    occupied, increase the position by some amount, until an empty

    position is found

  • 7/31/2019 W5Ch11 Hash Index

    22/63

  • 7/31/2019 W5Ch11 Hash Index

    23/63

    23

    Collision ResolutionOpen Addressing

    In order to try to avoid clustering, a method which does not look forthe first open space must be used.

    Two common methods are used

    - - Quadratic Probing

    - - Key-dependent Increments

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    24/63

    24

    Collision ResolutionOpen Addressing

    Quadratic Probingnew position = (collision position + j2) MOD hash size

    { j = 1, 2, 3, 4, }

    Example

    Before quadratic probing:

    After quadratic probing:

    ProblemOverflowmay occurs when there is still space in the

    hash table.

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    25/63

    25

    Collision ResolutionOpen Addressing

    Key-dependent Increments

    This technique is used to solve the overflow problem of thequadratic probing method.

    These increments vary according to the key used for the hashfunction.

    If the original hash function results in a good distribution, then key-

    dependent functions work quite well for rehashing and all locations in the

    table will eventually be probed for a free position.

    Key dependent increments are determined by using the key to

    calculate a new value and then using this as an increment to determine

    successive probes.

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    26/63

    26

    Collision ResolutionOpen Addressing

    Key-dependent IncrementsFor example, since the original hash function was key Mod 11,we might choose a function

    of key MOD 7 to find the increment. Thus the hash function becomes - -

    new position = current position + ( key DIV11) MOD 11

    ExampleBefore key-dependent increments:

    After key-dependent increments:

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    27/63

    27

    Collision ResolutionOpen Addressing

    Key-dependent Increments In all of the closed hash functions it is important to ensure that an

    increment of 0 does not arise.

    If the increment is equal to hash size the same position will be probed all the

    time, so this value cannot be used.

    If we ensure that the hash size is prime and the divisors for the open and

    closed hash are prime, the rehash function does not produce a 0 increment,

    then this method will usually access all positions as does the linear

    probe.

    - Using a key-dependent method usually result reduces clustering and therefore

    searches for an empty position should not be as long as for the linear method.

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    28/63

    28

    Collision ResolutionChaining Each table position is a linked list Add the keys and entries anywhere in the

    list (front easiest)

    Advantagesover open addressing:

    - Simpler insertion and removal

    - Array size is not a limitation (but should

    still minimize collisions: make table size

    roughly equal to expected number of keys

    and entries)

    Disadvantage- Memory overhead is large if entries are

    small

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    29/63

    29

    Collision ResolutionChaining

    Example:

    Before chaining:

    After chaining:

    Collision Resolution

  • 7/31/2019 W5Ch11 Hash Index

    30/63

    30

    In analyzing search efficiency, the average is usually used. Searching withhash tables is highly dependent on how full the table is since as the tableapproaches a full state, more rehashes are necessary. The proportion of the

    table which is full is called the Load Factor.

    - When collisions are resolved using open addressing, the maximum load

    factor is 1.

    - Using chaining, however, the load factor can be greater than 1 when thetable is full and the linked list attached to each hash address has more than

    one element.

    - Chaining consistently requires fewer probes than open addressing.-Traversal of the linked list is slow and if the records are small, it may be just

    as well to use open addressing.

    - Chaining is the best under two conditions --- when the number of

    unsuccessful searches is large or when the records are large.

    - Open addressing would likely be a reasonable choice when most searches are

    likely to be successful, the load factor is moderate and the records are

    relatively small.

    Analysis of Searching using Hash Tables

  • 7/31/2019 W5Ch11 Hash Index

    31/63

    31

    Average number of probes for different collision resolution

    methods:[ The values are for large hash tables, in this case larger than

    500]

    Analysis of Searching using Hash Tables

  • 7/31/2019 W5Ch11 Hash Index

    32/63

    32

    When are other representations more suitable than hashing:

    Hash tables are very good if there is a need for many searches in a

    reasonably stable database

    Hash tables are not so good if there are many insertions and

    deletions, or if table traversals are neededin this case, trees arebetter for indexing

    Also, hashing is very slow for any operations which require the

    entries to be sorted (e.g. query to Find the minimum key)

    Analysis of Searching using Hash Tables

  • 7/31/2019 W5Ch11 Hash Index

    33/63

    33

    Aperfect hashing functionmaps a key into a unique address. If the

    range of potential addresses is the same as the number of keys, the

    function is a minimal (in space) perfect hashing function.

    What makes perfect hashing distinctive is that it is a process for

    mapping a key space to a unique address in a smaller address space, that is

    hash (key) unique address

    Not only does a perfect hashing function improve retrieval performance,but a minimal perfect hashing function would provide 100 percent storage

    utilization.

    Perfect Hashing

  • 7/31/2019 W5Ch11 Hash Index

    34/63

    34

    Process of creating a perfect hash function

    A general form of a perfect hashing function is:

    p.hash (key) =(h0(key) + g[h

    1(key)] + g[h

    2(key)] mod N

    Perfect Hashing

  • 7/31/2019 W5Ch11 Hash Index

    35/63

    35

    In Cichellis algorithm, the component functions are:

    h0 = length (key)

    h1 = first_character (key)

    h2 = second_character (key)

    and g = T (x)

    where Tis the table of values associated with individual characters xwhich

    may apply in a key.

    The time consuming part of Cichellis algorithm is determining T.

    Cichellis Algorithm

  • 7/31/2019 W5Ch11 Hash Index

    36/63

    36

    When we apply the Cichellis perfect hashing functionto the keyword

    beginusing table 1, we can get

    The keyword begin would be stored in location 33. Since the hash values

    run from 2 through 37 for this set of data, the hash function is a minimal

    perfect hashing function.

    Cichellis Algorithm

    Table 1: Values associated with the characters of the Pascal

    reserved words

  • 7/31/2019 W5Ch11 Hash Index

    37/63

    37

    Links for interactive hashing examples:

    http://www.engin.umd.umich.edu/CIS/course.des/cis350/hashing/WEB/HashApplet.htm

    http://www.cs.auckland.ac.nz/software/AlgAnim/hash_tables.html

    http://www.cse.yorku.ca/~aaw/Hang/hash/Hash.html

    http://www.cs.pitt.edu/~kirk/cs1501/animations/Hashing.html

    Some Links to Hashing Animation

  • 7/31/2019 W5Ch11 Hash Index

    38/63

    Hashing as Database Index

    The basic idea is to use hashing function, which maps asearch key value(of a field) into a record or bucket ofrecords.

    As for any index, 3 alternatives for data entries k*: Data record with key value k

    Hash-basedindexes are best for equalityselections.Cannot support range searches.

    Static and dynamic hashing techniques exist; trade-offssimilar to ISAM vs. B+ trees.

  • 7/31/2019 W5Ch11 Hash Index

    39/63

    Static Hashing # primary pages fixed, allocated sequentially, never

    de-allocated; overflow pages allowed if needed.

    h(k) mod M = bucket to which data entry withkeykbelongs. (M = # of buckets)

    h(key) mod N

    hkey

    Primary bucket pages Overflow pages

    2

    0

    N-1

  • 7/31/2019 W5Ch11 Hash Index

    40/63

    Static Hashing (Contd.) Buckets contain data entries.

    Hash function depends on search key field of record r.Must distribute values over range 0 ... M-1 (table size).

    h(key) = (a * key + b) mod M usually works well. a and b are constants; lots known about how to tune h.

    Long overflow chains can develop and degradeperformance.

    Extendible and LinearHashing: Dynamic techniques to fixthis problem.

  • 7/31/2019 W5Ch11 Hash Index

    41/63

    Rule of thumb Try to keep space utilization between 50% and 80%

    If 80%, overflow significantDepends on how good hashing function is

    &

    On # keys/bucket

  • 7/31/2019 W5Ch11 Hash Index

    42/63

    42

    Extendible Hashing (Fagin et. al. 1979)

    Expandable hashing (Knott 1971)

    Dynamic Hashing (Larson 1978)

    Hash Functions for Extendible Hashing

  • 7/31/2019 W5Ch11 Hash Index

    43/63

    43

    Assume that a hashing technique is applied to a dynamicallychanging file composed of buckets, and each bucket can holdonly a fixed number of items.

    Extendible hashing accesses the data stored in bucketsindirectly through an index that is dynamically adjusted toreflect changes in the file.

    The characteristic feature of extendible hashing is theorganization of the index, which is an expandable table.

    Extendible Hashing

  • 7/31/2019 W5Ch11 Hash Index

    44/63

    44

    A hash function applied to a certain key indicates a position in the indexand not in the file (or table or keys). Values returned by such a hash

    function are calledpseudokeys.

    The database/file requires no reorganization when data are added to or

    deleted from it, since these changes are indicated in the index.

    Only one hash function hcan be used, but depending on the size of the

    index, only a portion of the added h(K)is utilized.

    A simple way to achieve this effect is by looking at the address into the

    string of bits from which only the ileftmost bits can be used.

    The number i is the depth of the directory.

    In figure 1(a) (in the next slide), the depth is equal to two.

    Extendible Hashing

  • 7/31/2019 W5Ch11 Hash Index

    45/63

    45

    Figure 1. An example of extendible

    hashing (Drozdek Textbook)

    Example

  • 7/31/2019 W5Ch11 Hash Index

    46/63

    Extendible Hashing as Index

    Situation: Bucket (primary page) becomes full. Why notre-organize file bydoubling # of buckets? Reading and writing all pages is expensive!

    Idea: Use directory of pointers to buckets, double # ofbuckets bydoubling the directory, splitting just the bucketthat overflowed!

    Directory much smaller than file, so doubling it is muchcheaper. Only one page of data entries is split. No

    overflowpage! Trick lies in how hash function is adjusted!

    2LOCAL DEPTH

  • 7/31/2019 W5Ch11 Hash Index

    47/63

    Example Directory is array of size 4.

    To find bucket for r, take lastglobal depth # bits ofh(r); wedenote rbyh(r).

    Ifh(r) = 5 = binary 101, it isin bucket pointed to by 01.

    Insert

    : If bucket is full, splitit (allocate new page, re-distribute).If necessary, double the directory. (As we will see, splitting a

    bucket does not always require doubling; we can tell bycomparingglobal depth with local depth for the split bucket.)

    13*00

    01

    10

    11

    2

    2

    2

    2

    2

    LOCAL DEPTH

    GLOBAL DEPTH

    DIRECTORY

    Bucket A

    Bucket B

    Bucket C

    Bucket D

    DATA PAGES

    10*

    1* 21*

    4* 12* 32* 16*

    15* 7* 19*

    5*

  • 7/31/2019 W5Ch11 Hash Index

    48/63

    Insert h(r)=20 (Causes Doubling)

    20*

    00

    01

    10

    11

    2 2

    2

    2

    LOCAL DEPTH 2

    2

    DIRECTORY

    GLOBAL DEPTHBucket A

    Bucket B

    Bucket C

    Bucket D

    Bucket A2(`split image'of Bucket A)

    1* 5* 21*13*

    32*16*

    10*

    15* 7* 19*

    4* 12*

    19*

    2

    2

    2

    000001

    010

    011

    100

    101

    110

    111

    3

    3

    3

    DIRECTORY

    Bucket A

    Bucket B

    Bucket C

    Bucket D

    Bucket A2

    (`split image'of Bucket A)

    32*

    1* 5* 21* 13*

    16*

    10*

    15* 7*

    4* 20*12*

    LOCAL DEPTH

    GLOBAL DEPTH

  • 7/31/2019 W5Ch11 Hash Index

    49/63

    Points to Note 20 = binary 10100. Last 2 bits (00) tell us rbelongs in A or

    A2. Last3 bits needed to tell which.

    Global depth of directory: Max # of bits needed to tell whichbucket an entry belongs to.

    Local depth of a bucket: # of bits used to determine if anentry belongs to this bucket.

    When does bucket split cause directory doubling?

    Before insert, local depth of bucket =global depth. Insertcauses local depth to become >global depth; directory isdoubled bycopying it overand `fixing pointer to split imagepage. (Use of least significant bits enables efficient doubling

    via copying of directory!)

  • 7/31/2019 W5Ch11 Hash Index

    50/63

    Directory Doubling

    00

    01

    10

    11

    2

    Why use least significant bits in directory? Allows for doubling via copying!

    000

    001

    010

    011

    3

    100

    101110

    111

    vs.

    0

    1

    1

    6*6*

    6*

    6 = 110

    00

    10

    01

    11

    2

    3

    0

    1

    1

    6* 6*6*

    6 = 110000

    100

    010

    110

    001

    101

    011

    111

    Least Significant Most Significant

  • 7/31/2019 W5Ch11 Hash Index

    51/63

    Comments on Extendible Hashing

    If directory fits in memory, equality search answered withone disk access; else two.

    100MB file, 100 bytes/rec, 4K pages contains 1,000,000records (as data entries) and 25,000 directory elements;

    chances are high that directory will fit in memory. Directory grows in spurts, and, if the distribution of hash

    values is skewed, directory can grow large.

    Multiple entries with same hash value cause problems!

    Delete: If removal of data entry makes bucket empty,can be merged with `split image. If each directoryelement points to same bucket as its split image, canhalve directory.

  • 7/31/2019 W5Ch11 Hash Index

    52/63

    52

    Expandable Hashing

    Similar idea to an extendible hashing.But binary tree is used to store an index on the buckets.

    Dynamic Hashing

    multiple binary trees are used.

    Outcome:- To shorten the search.- Based on the key --- select what tree to search.

    Hybrid methods

  • 7/31/2019 W5Ch11 Hash Index

    53/63

    Linear Hashing

    This is another dynamic hashing scheme, an alternative toExtendible Hashing.

    LH handles the problem of long overflow chains without

    using a directory, and handles duplicates. Idea: Use a family of hash functions h0, h1, h2, ...

    hi(key) = h(key) mod(2iN); N = initial # buckets

    h is some hash function (range is not 0 to N-1)

    If N = 2d0, for some d0, hi consists of applying h and looking atthe last di bits, where di = d0 + i.

    hi+1 doubles the range ofhi (similar to directory doubling)

  • 7/31/2019 W5Ch11 Hash Index

    54/63

    Linear Hashing (Contd.) Directory avoided in LH by using overflow pages, and

    choosing bucket to split round-robin.

    Splitting proceeds in `rounds. Round ends when allNRinitial (for round R) buckets are split. Buckets 0 toNext-1have been split; Next toNR yet to be split.

    Current round number is Level.

    Search: To find bucket for data entryr, findhLevel(r):

    IfhLevel

    (r) in range Next toNR

    , rbelongs here.

    Else, r could belong to bucket hLevel(r) or bucket hLevel(r)+NR; must applyhLevel+1(r) to find out.

  • 7/31/2019 W5Ch11 Hash Index

    55/63

    Overview of LH File

    In the middle of a round.

    Levelh

    Buckets that existed at the

    beginning of this round:

    this is the range of

    Next

    Bucket to be split

    of other buckets) in this round

    Levelh search key value )(

    search key value )(

    Buckets split in this round:

    If

    is in this range, must use

    h Level+1

    `split image' bucket.

    to decide if entry is in

    created (through splitting

    `split image' buckets:

  • 7/31/2019 W5Ch11 Hash Index

    56/63

    Linear Hashing (Contd.) Insert: Find bucket by applying hLevel / hLevel+1:

    If bucket to insert into is full:

    Add overflow page and insert data entry.

    (Maybe) SplitNext bucket and incrementNext.

    Can choose any criterion to `trigger split. Since buckets are split round-robin, long overflow chains

    dont develop!

    Doubling of directory in Extendible Hashing is similar;switching of hash functions is implicit in how the # of bitsexamined is increased.

  • 7/31/2019 W5Ch11 Hash Index

    57/63

    Example of Linear Hashing On split, hLevel+1 is used to

    re-distribute entries.

    0hh1

    (This info

    is for illustration

    only!)

    Level=0, N=4

    00

    01

    10

    11

    000

    001

    010

    011

    (The actual contents

    of the linear hashed

    file)

    Next=0PRIMARYPAGES

    Data entry rwith h(r)=5

    Primarybucket page

    44* 36*32*

    25*9* 5*

    14* 18*10*30*

    31* 35* 11*7*

    0hh1

    Level=0

    00

    01

    10

    11

    000

    001

    010

    011

    Next=1

    PRIMARYPAGES

    44* 36*

    32*

    25*9* 5*

    14* 18*10*30*

    31* 35* 11*7*

    OVERFLOWPAGES

    43*

    00100

  • 7/31/2019 W5Ch11 Hash Index

    58/63

    Example: End of a Round

    0hh1

    22*

    00

    01

    10

    11

    000

    001

    010

    011

    00100

    Next=3

    01

    10

    101

    110

    Level=0

    PRIMARYPAGES

    OVERFLOW

    PAGES

    32*

    9*

    5*

    14*

    25*

    66* 10*18* 34*

    35*31* 7* 11* 43*

    44*36*

    37*29*

    30*

    0hh1

    37*

    00

    01

    10

    11

    000

    001

    010

    011

    00100

    10

    101

    110

    Next=0

    Level=1

    111

    11

    PRIMARY

    PAGES OVERFLOWPAGES

    11

    32*

    9* 25*

    66* 18* 10* 34*

    35* 11*

    44* 36*

    5* 29*

    43*

    14* 30* 22*

    31* 7*

    50*

  • 7/31/2019 W5Ch11 Hash Index

    59/63

    LH Described as a Variant of EH

    The two schemes are actually quite similar: Begin with an EH index where directory hasNelements.

    Use overflow pages, split buckets round-robin.

    First split is at bucket 0. (Imagine directory being doubled at

    this point.) But elements , , ... are the same.So, need only create directory elementN, which differs from 0,now.

    When bucket 1 splits, create directory elementN+1, etc.

    So, directory can double gradually. Also, primary bucketpages are created in order. If they are allocatedin sequencetoo (so that finding ith is easy), we actually dont need adirectory!

  • 7/31/2019 W5Ch11 Hash Index

    60/63

    Useful Links http://www.cs.ucla.edu/classes/winter03/cs143/l1/han

    douts/hash.pdf

    http://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htm

    http://www.smckearney.com/adb/notes/lecture.exten

    dible.hashing.pdf

    http://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.smckearney.com/adb/notes/lecture.extendible.hashing.pdfhttp://www.smckearney.com/adb/notes/lecture.extendible.hashing.pdfhttp://www.smckearney.com/adb/notes/lecture.extendible.hashing.pdfhttp://www.smckearney.com/adb/notes/lecture.extendible.hashing.pdfhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htmhttp://www.ecst.csuchico.edu/~melody/courses/csci151_live/Static_hash_course_notes.htm
  • 7/31/2019 W5Ch11 Hash Index

    61/63

    Summary Hash-based indexes: best for equality searches, cannot

    support range searches.

    Static Hashing can lead to long overflow chains.

    Extendible Hashing avoids overflow pages by splitting afull bucket when a new data entry is to be added to it.(Duplicates may require overflow pages.)

    Directory to keep track of buckets, doubles periodically.

    Directoryless schemes (linear dynamic hashing) available Can get large with skewed data; additional I/O if this does

    not fit in main memory.

  • 7/31/2019 W5Ch11 Hash Index

    62/63

    Summary Linear Hashing avoids directory by splitting buckets

    round-robin, and using overflow pages.

    Overflow pages not likely to be long.

    Duplicates handled easily.

    Space utilization could be lower than Extendible Hashing,since splits not concentrated on `dense data areas.

  • 7/31/2019 W5Ch11 Hash Index

    63/63

    Check ListWhat is the intuition behind hash-structured indexes?

    Why are they especially good for equality searches butuseless for range selections?

    What is Extendible Hashing? How does it handle

    search, insert, and delete?

    What is Linear Hashing?

    What are the similarities and differences betweenExtendible and Linear Hashing?

    How does perfect hash function works?