dfs lecture 12

Upload: radheshyam-gawari

Post on 02-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 DFS Lecture 12

    1/19

    March 23 & 28, 2000 1

    Csci 2111: Data and File

    StructuresWeek 10, Lectures 1 & 2

    Hashing

  • 8/10/2019 DFS Lecture 12

    2/19

    March 23 & 28, 2000 2

    Motivation

    Sequential Searching can be done in O(N)access time,meaning that the number of seeks grows in proportion tothe size of the file.

    B-Trees improve on this greatly, providing O(LogkN)access where k is a measure of the leaf size (i.e., thenumber of records that can be stored in a leaf).

    What we would like to achieve, however, is an O(1)

    access, which means that no matter how big a file grows,access to a record always takes the same small number ofseeks.

    Static Hashingtechniques can achieve such performanceprovided that the file does not increase in time.

  • 8/10/2019 DFS Lecture 12

    3/19

    March 23 & 28, 2000 3

    What is Hashing?

    A Hash functionis a function h(K) which transforms a

    key K into an address.

    Hashing is like indexing in that it involves associating

    a key with a relative record address.

    Hashing, however, is different from indexing in two

    important ways:

    With hashing, there is no obvious connectionbetween the key and the location.

    With hashing two different keys may be transformed

    to the same address.

  • 8/10/2019 DFS Lecture 12

    4/19

    March 23 & 28, 2000 4

    Collisions

    When two different keys produce the sameaddress, there is a collision. The keys involved are

    called synonyms. Coming up with a hashing function that avoids

    collision is extremely difficult. It is best to simplyfind ways to deal with them.

    Possible Solutions:Spread out the records

    Use extra memory

    Put more than one record at a single address.

  • 8/10/2019 DFS Lecture 12

    5/19

    March 23 & 28, 2000 5

    A Simple Hashing Algorithm

    Step 1:Represent the key in

    numerical form

    Step 2:Fold and Add

    Step 3:Divide by a prime number

    and use the remainder as the

    address.

  • 8/10/2019 DFS Lecture 12

    6/19

    March 23 & 28, 2000 6

    Hashing Functions and Record

    Distributions

    Records can be distributed among addresses in

    different ways: there may be (a) no synonyms (uniform

    distribution); (b) only synonyms (worst case); (c) a few

    synonyms (happens with random distributions).

    Purely uniform distributions are difficult to obtain and

    may not be worth searching for.

    Random distributions can be easily derived, but theyare not perfect since they may generate a fair number

    of synonyms.

    We want better hashing methods.

  • 8/10/2019 DFS Lecture 12

    7/19

    March 23 & 28, 2000 7

    Some Other Hashing Methods

    Though there is no hash function that guaranteesbetter-than-random distributions in all cases, bytaking into considerations the keys that are beinghashed, certain improvements are possible.

    Here are some methods that are potentially betterthan random:

    Examine keys for a pattern

    Fold parts of the keyDivide the key by a number

    Square the key and take the middle

    Radix transformation

  • 8/10/2019 DFS Lecture 12

    8/19

    March 23 & 28, 2000 8

    Predicting the Distribution of

    Records

    When using a random distribution, we can use a

    number of mathematical tools to obtain

    conservative estimates of how our hashingfunction is likely to behave:

    Using the Poisson Function p(x)=(r/N)xe-(r/N)/x!

    applied to Hashing, we can conclude that:

    In general, if there are N addresses, then the

    expected number of addresses with x records

    assigned to them is Np(x)

  • 8/10/2019 DFS Lecture 12

    9/19

    March 23 & 28, 2000 9

    Predicting Collisions for a Full

    File

    Suppose you have a hashing function that you

    believe will distribute records randomly and you

    want to store 10,000 records in 10,000 addresses. How many addresses do you expect to have no

    records assigned to them?

    How many addresses should have one, two, and

    three records assigned respectively?

    How can we reduce the number of overflow

    records?

  • 8/10/2019 DFS Lecture 12

    10/19

    March 23 & 28, 2000 10

    Increasing Memory Space I

    Reducing collisions can be done by choosing a good

    hashing function or using extra memory.

    The question asked here is how much extra memory

    should be use to obtain a given rate of collision reduction?

    Definition: Packing densityrefers to the ratio of the

    number of records to be stored (r) to the number ofavailable spaces (N).

    The packing density gives a measure of the amount of

    space in a file that is used.

  • 8/10/2019 DFS Lecture 12

    11/19

    March 23 & 28, 2000 11

    Increasing Memory Space II

    The Poisson Distr ibutionallows us to predict the numberof collisions that are likely to occur given a certain packingdensity. We use the Poisson Distribution to answer the

    following questions: How many addresses should have no records assigned tothem?

    How many addresses should have exactly one recordassigned (no synonym)?

    How many addresses should have one record plus one ormore synonyms?

    Assuming that only one record can be assigned to each homeaddress, how many overflow records can be expected?

    What percentage of records should be overflow records?

  • 8/10/2019 DFS Lecture 12

    12/19

    March 23 & 28, 2000 12

    Collision Resolution by

    Progressive Overflow

    How do we deal with records that cannot fit into their home

    address? A simple approach: Progressive Overf lowor

    L inear Probing.

    If a key, k1, hashes into the same address, a1, as another

    key, k2, then look for the first available address, a2,

    following a1 and place k1 in a2. If the end of the address

    space is reached, then wrap around it. When searching for a key that is not in, if the address space

    is not full, then an empty address will be reached or the

    search will come back to where it began.

  • 8/10/2019 DFS Lecture 12

    13/19

    March 23 & 28, 2000 13

    Search Length when using

    Progressive Overflow

    Progressive Overflow causes extra searches andthus extra disk accesses.

    If there are many collisions, then many records

    will be far from home. Definitions: Search lengthrefers to the number of

    accesses required to retrieve a record fromsecondary memory. The average search lengthis

    the average number of times you can expect tohave to access the disk to retrieve a record.

    Average search length = (Total searchlength)/(Total number of records)

  • 8/10/2019 DFS Lecture 12

    14/19

    March 23 & 28, 2000 14

    Storing More than One Record

    per Address: Buckets

    Definition:A bucketdescribes a block of records

    sharing the same address that is retrieved in one

    disk access. When a record is to be stored or retrieved, its

    home bucket address is determined by hashing.

    When a bucket is filled, we still have to worry

    about the record overflow problem, but this occurs

    much less often than when each address can hold

    only one record.

  • 8/10/2019 DFS Lecture 12

    15/19

    March 23 & 28, 2000 15

    Effect of Buckets on Performance

    To compute how densely packed a file is, we need

    to consider 1) the number of addresses, N,

    (buckets) 2) the number of records we can put ateach address, b, (bucket size) and 3) the number

    of records, r. Then, Packing Density = r /bN.

    Though the packing density does not change when

    halving the number of addresses and doubling the

    size of the buckets, the expected number of

    overflows decreases dramatically.

  • 8/10/2019 DFS Lecture 12

    16/19

    March 23 & 28, 2000 16

    Making Deletions

    Deleting a record from a hashed file is morecomplicated than adding a record for two reasons:

    The slot freed by the deletion must not be allowed to

    hinder later searchesIt should be possible to reuse the freed slot for later

    additions.

    In order to deal with deletions we use tombstones, i.e.,

    a marker indicating that a record once lived there butno longer does. Tombstones solve both the problemscaused by deletion.

    Insertion of records is slightly different when usingtombstones.

  • 8/10/2019 DFS Lecture 12

    17/19

    March 23 & 28, 2000 17

    Effects of Deletions and

    Additions on Performance

    After a large number of deletions and additionshave taken places, one can expect to find many

    tombstones occupying places that could beoccupied by records whose home address precedesthem but that are stored after them.

    This deteriorates average search lengths.

    There are 3 types of solutions for dealing with thisproblem: a) local reorganization during deletions;b) global reorganization when the average searchlength is too large; c) use of a different collisionresolution algorithm.

  • 8/10/2019 DFS Lecture 12

    18/19

    March 23 & 28, 2000 18

    Other Collision Resolution

    Techniques

    There are a few variations on random hashing thatmay improve performance:Double Hashing: When an overflow occurs, use a

    second hashing function to map the record to itsoverflow location.

    Chained Progressive Overf low: Like Progressiveoverflow except that synonyms are linked together with

    pointers.

    Chaining with a Separate Overf low Area: Likechained progressive overflow except that overflowaddresses do not occupy home addresses.

    Scatter Tables: The Hash file contains no records, butonly pointers to records. I.e., it is an index.

  • 8/10/2019 DFS Lecture 12

    19/19

    March 23 & 28, 2000 19

    Pattern of Record Access

    If we have some information about what records

    get accessed most often, we can optimize their

    location so that these records will have shortsearch lengths.

    By doing this, we try to decrease the effective

    average search length even if the nominal average

    search length remains the same.

    This principle is related to the one used in

    Huffman encoding.