chap 7 . indexing

48
File Structure SNU-OOPSLA Lab. 1 Chap 7 Chap 7 . Indexing . Indexing 서서서서서 서서서서서서 서서서서서서서서서서 SNU-OOPSLA-LAB 서 서 서 서서 File Structures by Folk, Zoellick, and R icarrdi

Upload: komala

Post on 06-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

File Structures by Folk, Zoellick, and Ricarrdi. Chap 7 . Indexing. 서울대학교 컴퓨터공학과 객체지향시스템연구실 SNU-OOPSLA-LAB 김 형 주 교수. Chapter Objectives(1). Introduce concepts of indexing that have broad applications in the design of file systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 1

Chap 7Chap 7. Indexing. Indexing

서울대학교 컴퓨터공학과

객체지향시스템연구실

SNU-OOPSLA-LAB

김 형 주 교수

File Structures by Folk, Zoellick, and Ricarrdi

Page 2: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 2

Chapter Objectives(1)Chapter Objectives(1)

Introduce concepts of indexing that have broad applications in the design of file systems

Introduce the use of a simple linear index to provide rapid access to records in an entry-sequenced, variable-length record file

Investigate the implementation of the use of indexes for file maintenance

Introduce the template features of C++ for object I/O Describe the object-oriented approach to indexed sequential

files

Page 3: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 3

Chapter Objectives(2)Chapter Objectives(2)

Describe the use of indexes to provide access to records by more than one key

Introduce the idea of an inverted list, illustrating Boolean operations on lists

Discuss of when to bind an index key to an address in the data file

Introduce and investigate the implications of self-indexing files

Page 4: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 4

Contents(1)Contents(1)

7.1 What is an Index?

7.2 A Simple Index for Entry-Sequenced Files

7.3 Using Template Classes in C++ for Object I/O

7.4 Object-Oriented Support for Indexed, Entry-

Sequenced Files of Data Objects

7.5 Indexes That Are Too Large to Hold in Memory

Page 5: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 5

Contents(2)Contents(2)

7.6 Indexing to Provide Access by Multiple Keys

7.7 Retrieval Using Combinations of Secondary Keys

7.8 Improving the Secondary Index Structure:

Inverted Lists

7.9 Selective Indexes

7.10 Binding

Page 6: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 6

Overview: Index(1)Overview: Index(1) Index: a data structure which associates given key values with

corresponding record numbers It is usually physically separate from the file (unlike for indexed

sequential files tight binding). Linear indexes (like indexes found at the back of books)

Index records are ordered by key value as in an ordered relative file

Best algorithm for finding a record with a specific key value is binary search

Addition requires reorganization

7.1 What Is an Index?

Page 7: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 7

Overview: Index(2)Overview: Index(2)

k1 k2 k4 k5 k7 k9

k1 k2 k4 k5 k7 k9

AAA ZZZ CCC XXX EEE FFF

Index File

Data File

7.1 What Is an Index?

Page 8: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 8

Overview: Index(3)Overview: Index(3)

Tree Indexes (like those of indexed sequential files) Hierarchical in that each level Beginning with the root level, points to the next record Leaves POINTs only the data file

Indexed Sequential File Binary Tree Index AVL Tree Index B+ tree Index

7.1 What Is an Index?

Page 9: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 9

Roles of Index?Roles of Index?

Index: keys and reference fields

Fast Random Accesses

Uniform Access Speed

Allow users to impose order on a file without

actually rearranging the file

Provide multiple access paths to a file

Give user keyed access to variable-length

record files

7.1 What Is an Index?

Page 10: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 10

A Simple Index(1)A Simple Index(1) Datafile

entry-sequenced, variable-length record

primary key : unique for each entry in a file

Search a file with key (popular need) cannot use binary search in a variable-length recor

d file(can’t know where the middle record)

construct an index object for the file

index object : key field + byte-offset field

7.2 A Simple Index for E-S Files

Page 11: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 11

A Simple Index (2)A Simple Index (2)

ANG3795 167

COL31809 353

COL38358 211

DG18807 256

FF245 442

LON2312 32

MER75016 300

RCA2626 77

WAR23699 132

DG139201 396

LON|2312|Romeo and Juliet|Prokofiev . . .

RCA|2626|Quarter in C Sharp Minor . . .

WAR|23699|Touchstone|Corea . . .

ANG|3795|Sympony No. 9|Beethoven . . .

COL|38358|Nebeaska|Springsteen . . .

DG|18807|Symphony No. 9|Beethoven . . .

MER|75016|Coq d'or Suite|Rimsky . . .

COL|31809|Symphony No. 9|Dvorak . . .

DG|139201|Violin Concerto|Beethoven . . .

FF|245|Good News|Sweet Honey In The . . .

32

77

132

167

211

256

300

353

396

442

Datafile

Actual data recordAddress ofrecord

Referencefield

KeyIndexfile

7.2 A Simple Index for E-S Files

Page 12: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 12

A Simple Index (3)A Simple Index (3) Index file: fixed-size record, sorted

Datafile: not sorted because it is entry sequenced

Record addition is quick (faster than a sorted file) Can keep the index in memory

find record quickly with index file than with a sorted one

Class TextIndex encapsulates the index data and index operations

Key Reference field

7.2 A Simple Index for E-S Files

Page 13: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 13

Let’s See Figure 7.4Let’s See Figure 7.47.2 A Simple Index for E-S Files

Class TextIndex{ public: TextIndex(int maxKeys = 100, int unique = 1);

int Insert(const char*ckey, int recAddr); //add to index int Remove(const char* key); //remove key from index int Search(const char* key) const;

//search for key, return recAddr void Print (ostream &) const; protected: int MaxKeys; // maximum num of entries int NumKeys;// actual num of entries char **Keys; // array of key values int* RecAddrs; // array of record references int Find (const chat* key) const; int Init (int maxKeys, int unique); int Unique;// if true --> each key must be unique}

Page 14: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 14

Index ImplementationIndex Implementation

Page 638, 639, 640 G.1 Recording.h G.2 Recording.cpp G.3 Makere.cpp

Page 641, 642 G.4 Textind.h G.5 Textind.cpp

Page 15: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 15

RetrieveRecording with the Index RetrieveRecording with the Index RetrieveRecording(KEY...) procedure : retrieve a single record by ke

y from datafile. And puts together the index search, file read, and buf

fer unpack operations into single function

int RetriveRecording (Recording & recording, char * key,

TextIndex & RecordingIndex, BufferFile & RecordingFile)

// read and unpack the recording, return TRUE if succeeds

{ int result;

result = RecordingFile . Read (RecordingIndex.Search(key));

if (result == -1) return FALSE;

result = recording.Unpack (RecordingFile.GetBuffer());

return result;

}

Page 16: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 16

Template Class RecordFile we want to make the following code possible

Person p; RecordFile pFile; pFile.Read(p); Recording r; RecordFile rFile; rFile.Read(r);

difficult to support files for different record types without having to modify the class

Template class which is derived from BufferFile the actual declarations and calls

RecordFile <Person> pFile; pFile.Read(p); RecordFile <Recording> rFile; rFile.Read(p);

Template Class for I/O Object(1)Template Class for I/O Object(1)

7.3 Using Template Classes in C++ for Object I/O

Page 17: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 17

Template Class for I/O Object(2)Template Class for I/O Object(2) Template Class RecordFile

7.3 Using Template Classes in C++ for Object I/O

template <class RecType>class RecordFile : public BufferFile{ public:

int Read(RecType& record, int recaddr = -1); int Write(const RecType& record, int recaddr = -1); int Append(const RecType& record); RecordFile(IOBuffer& buffer) : BufferFile(buffer) {}

};//The template parameter RecType must have the following methods//int Pack(IOBuffer &); pack record into buffer//int Unpack(IOBuffer &); unpack record from buffer

Page 18: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 18

Adding I/O to an existing class RecordFile add methods Pack and Unpack to class Recording create a buffer object to use in the I/O

DelimFieldBuffer Buffer; declare an object of type RecordFile<Recording>

RecordFile<Recording> rFile (Buffer);

Declaration and Calls

Template Class for I/O Object(3)Template Class for I/O Object(3)

7.3 Using Template Classes in C++ for Object I/O

Recording r1, r2;rFile.Open(“myfile”);rFile.Read(r1);rFile.Write(r2);

Directly open a file and read andwrite objects of class Recording

Page 19: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 19

Object-Oriented Approach to I/OObject-Oriented Approach to I/O

Class IndexedFile add indexed access to the sequential access provided by class

RecordFile extends RecordFile with Update, Append and Read method

Update & Append : maintain a primary key index of data file Read : supports access to object by key

TextIndex, RecordFile ==> IndexedFile Issues of IndexedFile

how to make a persistent index of a file how to guarantee that the index is an accurate reflection of the con

tents of the data file

7.4 OO Support for Indexed, E-S Files of Data Objects

Page 20: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 20

Create the original empty index and data files Load the index file into memory Rewrite the index file from memory Add records to the data file and index Delete records from the data file Update records in the data file Update the index to reflect changes in the data file Retrieve records

7.4 OO Support for Indexed, E-S Files of Data Objects

Basic Operations of IndexedFile(1)Basic Operations of IndexedFile(1)

Page 21: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 21

Basic Operations of TextIndexedFile (1)Basic Operations of TextIndexedFile (1) Creating the files

initially empty files (index file and data file) created as empty files with header records

implementation ( makeind.cpp in Appendix G ) Create method in class BufferFile

Loading the index into memory loading/storing objects are supported in the IOBuffer classes need to choose a particular buffer class to use for an index

file ( tindbuff.cpp in Appendix G ) define class TextIndexBuffer as a derived class of FixedFieldBuffer to

support reading and writing of index objects

7.4 OO Support for Indexed, E-S Files of Data Objects

Page 22: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 22

Rewriting the index file from memory part of the Close operation on an IndexedFile write back index object to the index file should protect the index when failure write changes when out-of-date(use status flag) Implementation

Rewind and Write operations of class BufferFile

Record Addition

7.4 OO Support for Indexed, E-S Files of Data Objects

Basic Operations of TextIndexedFile(2)Basic Operations of TextIndexedFile(2)

Add an entry to the index

Requires rearrangementif in memory, no file access using TextIndex.Insert

Add a new record to data file

using RecordFile<Recording>::Write

+

Page 23: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 23

Record Deletion data file: the records need not be moved index: delete entry really or just mark it

using TextIndex::Delete

Record Updating (2 categories) the update changes the value of the key field

delete/add approach

reorder both the index and the data file

the update does not affect the key field no rearrangement of the index file

may need to reconstruct the data file

7.4 OO Support for Indexed, E-S Files of Data Objects

Basic Operations of TextIndexedFile(3)Basic Operations of TextIndexedFile(3)

Page 24: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 24

Class TextIndexedFile(1)Class TextIndexedFile(1)

Members methods

Create, Open, Close, Read (sequential & indexed), Append, and Update operations

protected members ensure the correlation between the index in memory (Index),

the index file (IndexFile), and the data file (DataFile) char* key()

the template parameter RecType must have the key method used to extract the key value from the record

7.4 OO Support for Indexed, E-S Files of Data Objects

Page 25: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 25

Class TextIndexedFile(2)Class TextIndexedFile(2)7.4 OO Support for Indexed, E-S Files of Data Objects

Template <class RecType>class TextIndexedFile{ public:

int Read(RecType& record); // read next recordint Read(char* key, RecType& record) // read by key int Append(const RecType& record);int Update(char* oldKey, const RecType& record);int Create(char* name, int mode=ios::in|los::out);int Open(char* name, int mode=ios::in|los::out);int Close();TextIndexedFile(IOBuffer & buffer, int keySize, int maxKeys=100);~TextIndexedFile(); // close and delete

protected:TextIndex Index; BufferFile IndexFile;TextIndexBuffer IndexBuffer;RecordFile<RecType> DataFile;char * FileName; // base file name for fileint SetFileName(char* fName, char*& dFileName, char*&IdxFName);

};

Page 26: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 26

Enhancements to TextIndexedFile(1)Enhancements to TextIndexedFile(1)

Support other types of keys Restriction: the key type is restricted to string (char *) Relaxation: support a template class SimpleIndex with

parameter for key type

Support data object class hierarchies Restriction: every object must be of the same type in

RecordFile Relaxation: the type hierarchy supports virtual pack methods

7.4 OO Support for Indexed, E-S Files of Data Objects

Page 27: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 27

Enhancements to TextIndexedFile(2)Enhancements to TextIndexedFile(2)

7.4 OO Support for Indexed, E-S Files of Data Objects

Support multirecord index files Restriction: the entire index fit in a single record Relaxation: add protected method Insert, Delete, and Searc

h to manipulate the arrays of index objects

Active optimization of operations Obvious: the most obvious optimization is to use binary sea

rch in the Find method Active: add a flag to the index object to avoid writing the ind

ex record back to the index file when it has not been changed

Page 28: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 28

Where are we going?Where are we going?

Plain Stream File

Persistency ==> Buffer support ==> BufferFile

<incremental approach> Deriving BufferFile using

various other classes

Random Access ==> Index support => IndexedFile

<incremental approach> : Deriving TextIndexedFile using RecordFile and TextIndex

Page 29: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 29

Too Large Index(1)Too Large Index(1)

On secondary storage (large linear index) Disadvantages

binary searching of the index requires several seeks(slower than a sorted file)

index rearrangement requires shifting or sorting records on second storage

Alternatives (to be considered later) hashed organization tree-structured index (e.g. B-tree)

7.5 Indexes That Are Too Large to Hold in Memory

Page 30: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 30

Too Large Index (2)Too Large Index (2)

Advantages over the use of a data file sorted by key

even if the index is on the secondary storage can use a binary search

sorting and maintaining the index is less expensive than doing

the data file

can rearrange the keys without moving the data records if

there are pinned records

7.5 Indexes That Are Too Large to Hold in Memory

Page 31: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 31

Index by Multiple Keys(1)Index by Multiple Keys(1)

DB-Schema = ( ID-No, Title, Composer, Artist, Label)

Find the record with ID-NO “COL38358” (primary key - ID-No)

Find all the recordings of “Beethoven” (2ndary key - composer)

Find all the recordings titled “Violin Concerto” (2ndary key - title)

7.6 Indexing to Provide Access by Multiple Keys

Page 32: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 32

Index by Multiple Keys(2)Index by Multiple Keys(2)

Most people don’t want to search only by primary key

Secondary Key can be duplicated Figure -->

Secondary Key Index secondary key --> consult

one additional index (primary key index)

BEETHOVEN ANG3795

BEETHOVEN DG139201

BEETHOVEN COL38358

COREA WAR23699

DVORAK COL31809

PROKOFIEV LON2312

RIMSKY-KORSAKOV MER75016

SPRINGSTEEN COL38358

SWEET HONEY IN THE R FF245

BEETHOVEN DG18807

Secondary key Primary key

Composer index

BEETHOVEN DG18807

7.6 Indexing to Provide Access by Multiple Keys

Page 33: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 33

Secondary Index:Basic Operations(1)Secondary Index:Basic Operations(1) Record Addition

similar to the case of adding to primary index secondary index is stored in canonical form

fixed length (so it can be truncated) original name can be obtained from the data file

can contain duplicate keys local ordering in the same key group

7.6 Indexing to Provide Access by Multiple Keys

Page 34: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 34

Secondary Index:Basic Operations (2)Secondary Index:Basic Operations (2) Record Deletion (2 cases)

Secondary index references directly record delete both primary index and secondary index rearrange both indexes

Secondary index references primary key delete only primary index leave intact the reference to the deleted record advantage : fast disadvantage : deleted records take up space

7.6 Indexing to Provide Access by Multiple Keys

Page 35: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 35

Secondary Index: Basic Operations (3)Secondary Index: Basic Operations (3)

Record Updating primary key index serves as a kind of protective

buffer Secondary index references directly record

update all files containing record’s location

Secondary index references primary key (1) affect secondary index only when either primary or

secondary key is changed

Continued.

7.6 Indexing to Provide Access by Multiple Keys

Page 36: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 36

Secondary Index: Basic Operations (4)Secondary Index: Basic Operations (4)

Secondary index references primary key(2) when changes the secondary key

rearrange the secondary key index

when changes the primary key

update all reference field

may require reordering the secondary index

when confined to other fields

do not affect the secondary key index

7.6 Indexing to Provide Access by Multiple Keys

Page 37: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 37

Retrieval of RecordsRetrieval of Records Types

primary key access

secondary key access

combination of above

Combination of keys using secondary key index, it is easy

boolean operation (AND, OR)

7.7 Retrieval Using Combinations of Secondary Keys

Page 38: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 38

Inverted Lists(1)Inverted Lists(1) Inverted List

a secondary key leads to a set of one or more primary keys

Disadvantages of 2nd-ary index structure rearrange when adding

repeated entry when duplicating

Solution A: by an array of references

Solution B: by linking the list of references

7.8 Improving the Secondary Index Structure

Page 39: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 39

Array of ReferencesArray of References

BEETHOVEN ANG3795 DG139201 DG18807 RCA2626

COREA WAR23699

DVORAK COL31809

PROKOFIEV LON2312

RIMSKY-KORSAKOV MER75016

SPRINGSTEEN COL38358

SWEET HONEY IN THE R FF245

Secondary key Set of primary key references

Revised composer index

7.8 Improving the Secondary Index Structure

* no need to rearrange

* limited reference array

* internal fragmentation

Page 40: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 40

Inverted Lists (2)Inverted Lists (2) Guidelines for better solution

no reorganization when adding no limitation for duplicate key no internal fragmentation

Solution B: by Linking the list of references

A list of primary key references

secondary key field, relative record number of the

first corresponding primary key reference

7.8 Improving the Secondary Index Structure

PROKOFIEV ANG36193

LON2312

Page 41: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 41

Linking List of References (1)Linking List of References (1)

BEETHOVEN

COREA

PROKOFIEV

RIMSKY-KORSAKOV

SPINGSTEEN

SWEET HONEY IN THE R

DVORAK

3

2

7

10

6

4

9

LON2312

RCA2626

ANG23699

COL38358

DG18807

MER75016

COL31809

DG139201

ANG36193

WAR23699

-1

-1

-1

8

-1

1

-1

-1

5

0

0

1

2

3

4

5

6

7

8

9 FF245 -1

Secondary Index file Label ID List file

Improved revision of the composer index

0

1

2

3

4

5

6

10

7.8 Improving the Secondary Index Structure

Page 42: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 42

Linking List of References (2)Linking List of References (2) The primary key references in a separate, entry-

sequenced file Advantages

rearranges only when secondary key changes rearrangement is quick less penalty associated with keeping the secondary index file on

secondary storage (less need for sorting) Label ID List file not need to be sorted reusing the space of deleted record is easy

7.8 Improving the Secondary Index Structure

Page 43: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 43

Linking List of References (3)Linking List of References (3)

Disadvantage same secondary key references may not be

physically grouped lack of locality could involve a large amount of seeking solution: reside in memory

same Label ID list can hold the lists of a number of secondary index files

if too large in memory, can load only a part of it

7.8 Improving the Secondary Index Structure

Page 44: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 44

Selective IndexesSelective Indexes

Selective Index: Index on a subset of records

Selective index contains only some part of

entire index provide a selective view

useful when contents of a file fall into several

categories e.g. 20 < Age < 30 and $1000 < Salary

7.9 Selective Indexes

Page 45: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 45

Index Binding(1)Index Binding(1)

When to bind the key indexes to the physical address of its associated record?

File construction time binding

(Tight, in-the-data binding) tight binding & faster access the case of primary key when secondary key is bound to that time

simpler and faster retrieval reorganization of the data file results in modifications of

all bound index files

7.10 Binding

Page 46: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 46

Index Binding (2)Index Binding (2) Postpone binding until a record is actually

retrieved (Retrieval-time binding) minimal reorganization & safe approach mostly for secondary key

Tight, in-the-data binding is good when static, little or no changes rapid performance during retrieval mass-produced, read-only optical disk

7.10 Binding

Page 47: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 47

Let’s Review (1)Let’s Review (1)

7.1 What is an Index?

7.2 A Simple Index for Entry-Sequenced Files

7.3 Using Template Classes in C++ for Object I/O

7.4 Object-Oriented Support for Indexed, Entry-

Sequenced Files of Data Objects

7.5 Indexes That Are Too Large to Hold in Memory

Page 48: Chap 7 .   Indexing

File Structure SNU-OOPSLA Lab. 48

Let’s Review(2)Let’s Review(2)

7.6 Indexing to Provide Access by Multiple Keys

7.7 Retrieval Using Combinations of Secondary Keys

7.8 Improving the Secondary Index Structure:

Inverted Lists

7.9 Selective Indexes

7.10 Binding