dfs lecture 12

8/10/2019 DFS Lecture 12

1/19

March 23 & 28, 2000 1

Csci 2111: Data and File

StructuresWeek 10, Lectures 1 & 2

Hashing


2/19

March 23 & 28, 2000 2

Motivation

Sequential Searching can be done in O(N)access time,meaning that the number of seeks grows in proportion tothe size of the file.

B-Trees improve on this greatly, providing O(LogkN)access where k is a measure of the leaf size (i.e., thenumber of records that can be stored in a leaf).

What we would like to achieve, however, is an O(1)

access, which means that no matter how big a file grows,access to a record always takes the same small number ofseeks.

Static Hashingtechniques can achieve such performanceprovided that the file does not increase in time.


3/19

March 23 & 28, 2000 3

What is Hashing?

A Hash functionis a function h(K) which transforms a

key K into an address.

Hashing is like indexing in that it involves associating

a key with a relative record address.

Hashing, however, is different from indexing in two

important ways:

With hashing, there is no obvious connectionbetween the key and the location.

With hashing two different keys may be transformed

to the same address.


4/19

March 23 & 28, 2000 4

Collisions

When two different keys produce the sameaddress, there is a collision. The keys involved are

called synonyms. Coming up with a hashing function that avoids

collision is extremely difficult. It is best to simplyfind ways to deal with them.

Possible Solutions:Spread out the records

Use extra memory

Put more than one record at a single address.


5/19

March 23 & 28, 2000 5

A Simple Hashing Algorithm

Step 1:Represent the key in

numerical form

Step 2:Fold and Add

Step 3:Divide by a prime number

and use the remainder as the

address.


6/19

March 23 & 28, 2000 6

Hashing Functions and Record

Distributions

Records can be distributed among addresses in

different ways: there may be (a) no synonyms (uniform

distribution); (b) only synonyms (worst case); (c) a few

synonyms (happens with random distributions).

Purely uniform distributions are difficult to obtain and

may not be worth searching for.

Random distributions can be easily derived, but theyare not perfect since they may generate a fair number

of synonyms.

We want better hashing methods.


7/19

March 23 & 28, 2000 7

Some Other Hashing Methods

Though there is no hash function that guaranteesbetter-than-random distributions in all cases, bytaking into considerations the keys that are beinghashed, certain improvements are possible.

Here are some methods that are potentially betterthan random:

Examine keys for a pattern

Fold parts of the keyDivide the key by a number

Square the key and take the middle

Radix transformation


8/19

March 23 & 28, 2000 8

Predicting the Distribution of

Records

When using a random distribution, we can use a

number of mathematical tools to obtain

conservative estimates of how our hashingfunction is likely to behave:

Using the Poisson Function p(x)=(r/N)xe-(r/N)/x!

applied to Hashing, we can conclude that:

In general, if there are N addresses, then the

expected number of addresses with x records

assigned to them is Np(x)


9/19

March 23 & 28, 2000 9

Predicting Collisions for a Full

File

Suppose you have a hashing function that you

believe will distribute records randomly and you

want to store 10,000 records in 10,000 addresses. How many addresses do you expect to have no

records assigned to them?

How many addresses should have one, two, and

three records assigned respectively?

How can we reduce the number of overflow

records?


10/19

March 23 & 28, 2000 10

Increasing Memory Space I

Reducing collisions can be done by choosing a good

hashing function or using extra memory.

The question asked here is how much extra memory

should be use to obtain a given rate of collision reduction?

Definition: Packing densityrefers to the ratio of the

number of records to be stored (r) to the number ofavailable spaces (N).

The packing density gives a measure of the amount of

space in a file that is used.


11/19

March 23 & 28, 2000 11

Increasing Memory Space II

The Poisson Distr ibutionallows us to predict the numberof collisions that are likely to occur given a certain packingdensity. We use the Poisson Distribution to answer the

following questions: How many addresses should have no records assigned tothem?

How many addresses should have exactly one recordassigned (no synonym)?

How many addresses should have one record plus one ormore synonyms?

Assuming that only one record can be assigned to each homeaddress, how many overflow records can be expected?

What percentage of records should be overflow records?


12/19

March 23 & 28, 2000 12

Collision Resolution by

Progressive Overflow

How do we deal with records that cannot fit into their home

address? A simple approach: Progressive Overf lowor

L inear Probing.

If a key, k1, hashes into the same address, a1, as another

key, k2, then look for the first available address, a2,

following a1 and place k1 in a2. If the end of the address

space is reached, then wrap around it. When searching for a key that is not in, if the address space

is not full, then an empty address will be reached or the

search will come back to where it began.


13/19

March 23 & 28, 2000 13

Search Length when using

Progressive Overflow

Progressive Overflow causes extra searches andthus extra disk accesses.

If there are many collisions, then many records

will be far from home. Definitions: Search lengthrefers to the number of

accesses required to retrieve a record fromsecondary memory. The average search lengthis

the average number of times you can expect tohave to access the disk to retrieve a record.

Average search length = (Total searchlength)/(Total number of records)


14/19

March 23 & 28, 2000 14

Storing More than One Record

per Address: Buckets

Definition:A bucketdescribes a block of records

sharing the same address that is retrieved in one

disk access. When a record is to be stored or retrieved, its

home bucket address is determined by hashing.

When a bucket is filled, we still have to worry

about the record overflow problem, but this occurs

much less often than when each address can hold

only one record.


15/19

March 23 & 28, 2000 15

Effect of Buckets on Performance

To compute how densely packed a file is, we need

to consider 1) the number of addresses, N,

(buckets) 2) the number of records we can put ateach address, b, (bucket size) and 3) the number

of records, r. Then, Packing Density = r /bN.

Though the packing density does not change when

halving the number of addresses and doubling the

size of the buckets, the expected number of

overflows decreases dramatically.


16/19

March 23 & 28, 2000 16

Making Deletions

Deleting a record from a hashed file is morecomplicated than adding a record for two reasons:

The slot freed by the deletion must not be allowed to

hinder later searchesIt should be possible to reuse the freed slot for later

additions.

In order to deal with deletions we use tombstones, i.e.,

a marker indicating that a record once lived there butno longer does. Tombstones solve both the problemscaused by deletion.

Insertion of records is slightly different when usingtombstones.


17/19

March 23 & 28, 2000 17

Effects of Deletions and

Additions on Performance

After a large number of deletions and additionshave taken places, one can expect to find many

tombstones occupying places that could beoccupied by records whose home address precedesthem but that are stored after them.

This deteriorates average search lengths.

There are 3 types of solutions for dealing with thisproblem: a) local reorganization during deletions;b) global reorganization when the average searchlength is too large; c) use of a different collisionresolution algorithm.


18/19

March 23 & 28, 2000 18

Other Collision Resolution

Techniques

There are a few variations on random hashing thatmay improve performance:Double Hashing: When an overflow occurs, use a

second hashing function to map the record to itsoverflow location.

Chained Progressive Overf low: Like Progressiveoverflow except that synonyms are linked together with

pointers.

Chaining with a Separate Overf low Area: Likechained progressive overflow except that overflowaddresses do not occupy home addresses.

Scatter Tables: The Hash file contains no records, butonly pointers to records. I.e., it is an index.


19/19

March 23 & 28, 2000 19

Pattern of Record Access

If we have some information about what records

get accessed most often, we can optimize their

location so that these records will have shortsearch lengths.

By doing this, we try to decrease the effective

average search length even if the nominal average

search length remains the same.

This principle is related to the one used in

Huffman encoding.

dfs lecture 12

Documents