Download - Distributed Processing Systems (Distributed File System) 오 상 규 서강대학교 정보통신 대학원 Email : [email protected] Email : [email protected]

Distributed Processing Systems

(Distributed File System)

오 상 규오 상 규

서강대학교 정보통신 대학원서강대학교 정보통신 대학원

Email : [email protected] : [email protected]

Distributed File System

서강대학교 정보통신 대학원

Definitions

File

an abstraction of permanent storage a sequence of similar-sized data items (typically 8-bit bytes)

Directory

a file, of a special type, that provides a mapping from text names to internal file identifiers.

File system

responsible for the organization, storage, retrieval, naming, sharing and protection of files.

File storage

implemented on magnetic disks and non-volatile storage media.



Definitions (cont.)

File SystemFile SystemFile

Computer OS

. . .

Computer OS

. . .File

System

ProcessMgmt.

MemoryMgmt.

StorageStorage

Computer Program ofManagement Mechanism

Digitizationof File

Directoryor Folder



Definitions (cont.)

Unique File Identifiers (UFIDs):

file server creates a UFID for a file directory service records the UFID with its name

Access Control : combination of below two

Capability Approach: file can be accessed with a valid capability Identity Based Approach: list users and their entitled services

Mutable and Immutable files

Mutable means that there is only one stored version of a file. (SUN NFS, CFS,LOCUS) Immutable means that the file cannot be modified once it has been created.



Definition (cont.)

Repeatable Idem-potent Operations

multiple execution have the same effect as a single execution. Stateless File Servers

no information has been stored about previous operations.

Atomicity

if an operation terminates successfully, the new state is consistent and semantically correct. if the operation fails, the file state will remain unchanged.



File System Taxonomy

LEVEL 1LEVEL 1 : One user performs computation via a single process as in IBM PC and Apple MAC. File system design issues include naming structure, application programming interface, mapping to physical storage media, and the integrity against failures

: abstracts network interfaces and communication resource.

: does the protocol processing.

LEVEL 2 LEVEL 2

LEVEL 3LEVEL 3

LEVEL 4 LEVEL 4

: A single user computing with multiple processes, OS/2 File system needs to address concurrency control issue.

: Multiple users share data and resources. File system needs to specify and enforce security.

: distributed file systems, multiple users who are physically dispersed in a network of autonomous computers and share one common file system.



File System Modules

Directory Service Directory Service Directory Service Directory Service

Directory model : relates file names to file IDsDirectory model : relates file names to file IDs

Access control module : checks permission for operation requested Access control module : checks permission for operation requested

File Service File Service File Service File Service

File addressing module : uses file location map to relate file IDs to filesFile addressing module : uses file location map to relate file IDs to files

File access module : uses file index to find file pages for reads or writesFile access module : uses file index to find file pages for reads or writes

Block Service Block Service Block Service Block Service

Block module : accesses and allocates disk blocksBlock module : accesses and allocates disk blocks

Device module : disk IO and bufferingDevice module : disk IO and buffering



Distributed file service requirements

Access transparency : Client programs should be unaware of the distr5ibution of files. Location transparency : Client programs should see a uniform file name space. Concurrency transparency : Changes to a file by one client should not interfere with the operation of other clients simultaneously accessing or changing the same file. Failure transparency : the correct operation of servers after the failure of a client and the correct operation of client programs in the face of lost messages.



Distributed file service requirements (cont.)

Performance transparency : Client programs should continue to perform satisfactorily while the load on the service varies within a specified a range.

Hardware and operation system heterogeneity : The service interface should be defined so that client and server S/W can be implemented for different operating systems and computers.

Scalability : The service can be extended by incremental growth to deal with a wide range of loads and network sizes. Replication transparency : A file may be represented by several copies of its contents at different locations.



Distributed file service requirements (cont.)

Migration transparency : Neither client programs nor system administration tables in client nodes need to be changed when files moved.

Support for fine-grained distribution of data

Tolerance to network partitioning and detached operation



File Service Components

Flat File Service

concerned with implementing operations on the contents of files. Unique File Identifiers (UFIDs) are used to refer to files in all requests for flat file service operations.

Directory Service

provides a mapping between text names for files and their UFIDs.

provides the functions needed to generate and update directories and to obtain UFIDs from directories. a client of the flat file service.



File Service Components (cont.)

Client Module

an extension of the user package

runs in each client computer, integrating and extending the operations of the flat file service and the directory service under a single API

holds information about the network locations of the flat fileserver and directory server processes.

play an important role in achieving satisfactory performance through the implementation of a cache.



File Service Components (cont.)

NETWORK

File Service RPC Interface

User Program

User Program

File Service File Service

Directory Service Directory Service

File Service RPC Interface

Application Programming Interface

Client Module Client Module

User Program

User Program

User Program

User Program

User Program

User Program



Design Issues

Flat file service

offer a simple, general purpose set of operations.

Fault tolerance

the service continue to operate in the face of the client and server failure.

The RPC interfaces can be designed in terms of idempotent operations ensuring that duplicated requests do not result in invalid updates to files.

the servers can be stateless.



Design Issues (cont.)

Directory Service

The separation of the directory service from the file service enables a variety of directory services to be designed and offered for use with a single file service. Client Module

hides low level constructs such as the UFIDs used in the RPC interfaces of the flat file service and the directory service from the user level programs.



Attribute record structure

Read timestampRead timestamp

Creation timestampCreation timestamp

File LengthFile Length

Attribute timestampAttribute timestamp

Write timestampWrite timestamp

OwnerOwner

Reference CountReference Count

Access control listAccess control list

File typeFile type

Maintained by Flat file service

Maintained by Directory service



Mechanisms for Building DFS

Mounting

allow the building of different file name spaces to form a single hierarchical name space.

mount table in the kernel maps mount points to storage devices.

Caching ( file caching )

exploit the temporal locality of reference data can be either cached in the main memory or on local disk of clients

data can also be cached at servers to reduce access latency.



Mechanisms for Building DFS (cont.)

a b c

de

fg h i

j k

Mounting Point

Server X

Server Y

Server Z

Name Space Hierarchy




Cache Consistency

Server initiated approach : servers inform cache managers whenever the client cache data becomes stale.

Client initiated approach

: client cache mangers validate data with server before returning

it to clients.

No file caching : during concurrent-write sharing

Sequential write sharing

: a client opened a file that has already modified and closed by another

client. Timestamps are used to handle this problem.




Replication

how to keep replicas update

how to detect inconsistencies.

Scalability

suitability to handle system expansion

Semantics

read operation will return the value of the latest write operation




Location Transparency

Files are named and accessed independently of their locations

and from where they are called

Security

Authentication and Access Control



Flat file service operations

Read(file, i, n) (Data) –-- REPORTS(BadPosition)

: If 1 i Length(File) : Reads a sequence of up to n items in File starting at item i and returns it in Data

if i Length(File) : Returns the empty sequence, reports an error.

Write(File, i, Data) –-- REPORTS(BadPosition)

: If 1 i Length(File) + 1 : Writes a sequence of Data to File, Starting at item i, extending the file if necessary.

If i Length(File) + 1 : null operation, reports an error.

Create( ) File

: Creates a new file of length 0 and delivers a UFID for it.



Flat file service operations (cont.)

Truncate(File,l)

: If l Length(File) : shortens the file to length l ; else does nothing.

Delete(File)

: Removes the file from the files store.

GetAttributes(File) Attr

: Returns the file attributes for the file.

SetAttributes(File, Attr)

: Sets the file attributes (Only those attributes that are not shaded in Slide 20p.)



Flat file service Interface definition

DEFINITION MODULE Files;

EXPORT QUALIFIED Read, Write, Length, Turncate, Create,Delete,

ErrorType, Sequence, Seqptr, MAX, UFID, ErrorReport;

CONST MAX = 2048

TYPE

Sequence = Record

l : CARDINAL;

s : ARRAY[1..MAX] OF CHAR;

END;

VAR

ErrorReport: ErrorType;

DEFINITION MODULE Files;

EXPORT QUALIFIED Read, Write, Length, Turncate, Create,Delete,

ErrorType, Sequence, Seqptr, MAX, UFID, ErrorReport;

CONST MAX = 2048

TYPE

Sequence = Record

l : CARDINAL;

s : ARRAY[1..MAX] OF CHAR;

END;

VAR

ErrorReport: ErrorType;



Flat file service Interface definition (cont.)

PROCEDURE Read(File:UFID; i, n : CARDINAL) : seqptr;

PROCEDURE Write(File:UFID; I : CARDINAL; Data : Seqptr);

PROCEDURE Length(File : UFID) : CARDINAL;

(* Implemented in terms of GetAttributes *)

PROCEDURE Turncate(File : UFID; l :CARDINAL);

PROCEDURE Create() : UFID;

PROCEDURE Delete(File : UFID);

END Files.

PROCEDURE Read(File:UFID; i, n : CARDINAL) : seqptr;

PROCEDURE Write(File:UFID; I : CARDINAL; Data : Seqptr);

PROCEDURE Length(File : UFID) : CARDINAL;

(* Implemented in terms of GetAttributes *)

PROCEDURE Turncate(File : UFID; l :CARDINAL);

PROCEDURE Create() : UFID;

PROCEDURE Delete(File : UFID);

END Files.



CopyFile using flat file operations

MODULE CopyFile;

FROM InOut IMPORT WriteString, WriteLn;

FROM Files IMPORT Read, Write, Length, Turncate,

UFID, ErrorType, MAX, ErrorReport;

PROCEDURE CopuFile(File1, File2 : UFID);

VAR

i,l:CARDINAL;

BEGIN

l := Length(File1);

Turncate(File2,l);

MODULE CopyFile;

FROM InOut IMPORT WriteString, WriteLn;

FROM Files IMPORT Read, Write, Length, Turncate,

UFID, ErrorType, MAX, ErrorReport;

PROCEDURE CopuFile(File1, File2 : UFID);

VAR

i,l:CARDINAL;

BEGIN

l := Length(File1);

Turncate(File2,l);



CopyFile using flat file operations (cont.)

FOR i := 1 TO l BY MAX DO

Write(File2, I, Read(File1, I, MAX));

END;

IF ErrorReport != NONE THEN

WriteString(“CopyFile failed”);

WriteLn;

END;

END CopyFile;

END CopyFile.

FOR i := 1 TO l BY MAX DO

Write(File2, I, Read(File1, I, MAX));

END;

IF ErrorReport != NONE THEN

WriteString(“CopyFile failed”);

WriteLn;

END;

END CopyFile;

END CopyFile.



Directory service operations (1)

Lookup(Dir, Name, AccessMode, UserID) (File)

---- REPORTS(NotFound, NoAccess)

: Locates the next name in the directory and returns the relevant UFID ; reports

an error if it cannot be found or if the client making the request is not authorized

to access the file in the manner specified By AccessMode.

AddName(Dir, Name, File, UserID) –-- REPORTS(NameDuplicate)

: If Name is not in the directory : Adds the (Name, File) pair to the directory and

updates the attribute record accordingly.

If Name is already in the directory : reports an error.



Directory service operations (2) Directory service operations (2) UnName(Dir, Name) --– REPORTS(Not Found)

: If Name is in the directory : The entry containing Name is removed from the

directory.

If Name is not in the directory : reports an error

ReName(Dir, OldName, NewName) –-- REPORTS(NotFound)

: If Name is in the directory : The entry containing Name gets the new name.

If Name is not in the directory: reports an error.

GetName(Dir, Pattern) NameSeq

: Returns all of the text names in the directory that match the regular expression

given by Pattern.



Implementation techniques (1)

File Group

a collection of files mounted on a server computer.

file groups support the allocation of files to file servers in larger logical units and enable the service to be implemented with files stored on several servers.

In a file system that supports file groups, the representation of UFIDs includes a file group identifier component.

Internet addressInternet address datedatefile group identifier

32 bits 16bits




Space leaks

a disk space leak occurs whenever the program responsible for creating a file terminates without having entered the UFID of the file into any directory and without deleting the file. Thus the client module should include composite operation :

CreateFile(Name, Dir)

: The operation for creating a new file takes the text name to be assigned to

the new file and Dir - the UFID of a directory into which the file is to be

entered. It creates a new file and Name and the UFID of the new file to Dir.




Capabilities and access control

A capability is a ‘digital-key’ - a large integer selected in a manner that makes it difficult to counterfeit. The directory service requires client to states their identity using an access control list.

Construction of UFIDs

The flat file service must generate UFIDs in a manner that not only ensures uniqueness but makes them difficult to counterfeit.

File Group IDFile Group ID File numberFile number

48 bits 32 bits 32 bits

Random numberRandom number




Access modes

Access control to files is based upon tha fact that a UFID constitutes a ‘key’ or capability to access a file. This can be achieved by extending the UFIDs to include a permission filed.

48 bits 32 bits 32 bits 5bits

File Group IDFile Group ID File numberFile number Random numberRandom number

Read Write / Truncate

Delete GetAttributes

SetAttributes




Encryption of the permission field

To avoid attempts to penetrate the security of the file service, the permission field and the random number are encrypted to produce a single 37-bit number. unencrypted permission field is also included so that client and server programs can determine by examination what permissions are included in a UFID.

48 bits 32 bits 37 bits 5bits

File Group IDFile Group ID File numberFile number

Encrypted permission bits + Random number

Unencrypted permission bits




File representation

Block IndexBlock Index

Unused

•

Attribute record

•

•

•

Unused

Unused

Page 1

Page 2

Page 3

Page 4



*On disk Inode*On disk Inode

B S inode list data blocks

direct

indirect

m o d e

o w ne rs

tim e sta m p s

size

d ire c t b lo c ks

sing le ind ire c t

d o ub le ind ire c t

trip le ind ire c t

d a ta

d a ta

d a ta

d a ta

d a ta

d a ta

d a ta

d a ta

d a ta

d a ta

d a ta

d a ta

d a ta




File location

The flat file service must translate UFIDs to file server locations and file address. Implementation : 1st step - identify the server that holds the required file group. (This is done by the client module.) 2nd step - locate the required file’s block index. (This is done by the server that holds the file.)

Group location

A group location database, giving the current locations of all accessible file groups in the form of <FileGroupID, PortIdentifier> pairs is replicated in each participating server.




File addressing

When a server receives a flat file service request, it uses the file group identifier and the file number to locate the required file’s block index. B-trees are effective method for structuring a set of data for searching.

Server cache

avoid repeated access to disk storage for the same block.

write-through cache operation should be used.

Client Cache

the client cache also uses write-through.



CASE STUDY :

The Sun Network File System



Sun Microsystem’s Network File System

the first file service that was designed as a product.(1985)

To encourage its adoption as a standard, the definitions of key interfaces were placed in the public domain. [ Sun 1989 ]

provides a working solution to many requirements for distributed file access, but it does not address some issues ( replication transparency, concurrency transparency, scalability) whose importance is likely to grow as the size and range of applications for distributed systems increase.



*Design Goal of NFS (1)

to achieve a high level of support for hardware and operating system heterogeneity. Access transparency : provides an API to local processes that is identical to the local operating system’s interface. Location transparency : Each client establishes a file name space by adding remote file systems to its local name space. Failure transparency : the stateless and idempotent nature of the NFS file access protocol ensures that the failure modes observed by clients when accessing remote files are similar to those for local file access.



*Design Goal of NFS (2)

Performance transparency : Both the client and the server employ caching to achieve satisfactory performance.

Migration transparency : File systems may be moved between servers, but the remote mount tables in each client must then be separately updated. So migration transparency is not fully achieved by NFS.



*Requirement not addressed by NFS

Replication transparency : NFS does not support file replication. The Sun Network Information Service(NIS) is a separate service available for use with NFS that supports the replication of simple databases.

Concurrency transparency : NFS does not aim to improve upon the UNIX approach to the control of concurrent updates to files. Scalability : NFS was originally designed to allow each server to support approximately 5-10 clients.



#Sun NFS features

NFS is a distributed file system that provides transparent, remote access to file systems on UNIX and other systems.

NFS uses an External Data Representation (XDR)

It is implemented on top of an RPC package

NFS uses UDP and IP as its network protocol

Client machines mount file systems located on servers so they can be

accessed as if they were local.



Remote mounting on an NFS client

Server 1Server 1

/ (root)

export

people

big jon bob ….

ClientClient

/ (root)

…. vmunix usr

student x staff

Server 2Server 2

/ (root)

nfs

users

jim ann jane joe

RemoteRemotemountmount

RemoteRemotemountmount

The file system mounted at /usr/students in the client is actually the sub tree located at /export/people in Server1 ; the file system mounted at /usr/staff in the client is actually the sub tree located at /nfs/usrs in Server2.



#VFS(Virtual File System)

to allow different file system types to be mounted on a single machine.

It separates file system operations from implementation.

It dynamically selects the appropriate file system based on what file or directory needs to be accessed.

The VFS interface to any underlying file systems is through Virtual Node Interface (Vnode).

Vnodes are data structures that uniquely identify files similar to inodes in Unix.



#VFS mounting



#Client side Vnode Interface

File system Operations

mount(varies) : system call to mount file system. mount_root( ) : mount file system as root.

VFS Operations

unmount(vfs) : Unmount file system. sync(vfs) : Flush delayed write blocks.

Vnode operations

open(VP,flags) : Mark file open. rdwr(vp,uio,rwflag, flags) : read or write a file. mkdir(dvp, name) : create a directory.



#Sun NFS Stateless Protocol

Stateless Protocol

ensure robustness when clients or servers or network experience failures. • if client fails, the server does not need to take any action. • if a server fails, the client retransmits its request until it receives a response.

Disadvantages of Stateless Protocols

a server may receive multiple copies of the same request.

server must save any modified data to a stable storage before

completing a client request.



NFS software architecture

Client Computer Client Computer

USER-LEVEL CLIENT PROCESS

USER-LEVEL CLIENT PROCESS

UNIX KERNELUNIX KERNEL

NFS PROTOCOLUNIX FILE

SYSTEM

NFS CLIENT

VIRTUAL FILE SYSTEM

Local Remote

System calls

Server Computer Server Computer


NFS SERVER

UNIX FILE

SYSTEM

VIRTUAL FILE SYSTEM

NETWORK

Process using NFS



NFS Server operations (RPC interface)

lookup(dirfh, name) fh,attr

: Returns a file handle and attributes for the file name in the directory dirfh.

create(dirfh, name, attr) newfh, attr

: Create a new file name in directory dirfh with attributes attr and returns the new

file handle and attributes.

remove(dirfh, name) status

: Removes file name from directory dirfh

getattr(fh) attr

: Returns file attributes of file fh.(Similar to the UNIX stat system call.)



NFS Server operations (2)

setattr(fh, attr) attr

: Sets the attributes (mode, user id, group id, size, access time and modify time

of a file). Setting the size to 0 truncates the file.

read(fh, offset, count) attr, data

: Returns up to count bytes of data from a file starting at offset.

Also returns the latest attributes of the file.

write(fh,offset, count, data) attr

: Writes count bytes of data to a file starting at offset. Returns the attributes of

the file after the write has taken place.




rename(dirfh, offset, todirfh, toname) status

: Changes the name of file name in directory dirfh to toname in directory todirfh.

link(newdirfh, newname, dirfh, name) status

: Creates an entry newname in the dirctory newdirfh which refers to file name in

the directory dirfh.

symlink(newdirfh, newname, string) status

: Creates an entry newname in the directory newdirfh of type symbolic link with

the value string.

The server does not interpret the string, but makes a symbolic link file to hold it.




readlink(fh) string

: Returns the string that is associated with the symbolic link file identified by fh.

mkdir(dirfh, name, attr) newfh, attr

: Creates a new directory name with attributes attr and returns the new file

handle and attributes.

rmdir(dirfh, name) status

: Removes the directory empty name from the parent directory dirfh.

Fails if the directory is not empty.




readdir(dirfh, cookie, count) entries

: Returns up to count bytes of directory entries from the directory dirfh,

Each entry contains a file name, file id, and an opaque pointer to the next

directory entry, called cookie. The cookie is used in subsequent readdir calls

to start reading from the subsequent entry. A readdir with a 0 value for the

cookie reads from the first entry in the directory.

statfs(fh) fsstats

: Returns file system information (such as block size, number of free blocks, and

so on) for the file system containing a file fh.



*Implementation (1)

The NFS client and server modules communicate using remote procedure call. ( Sun’s RPC system was developed for use in NFS.) Because the file and directory operations are integrated in a single service, the space leek problem cannot arise.

Virtual file system

VFS module has been added to the UNIX kernel.

Role

• distinguish between local and remote files. • translate between the UNIX-independent file identifiers used by NFS and the internal file identifiers used in UNIX and other file systems.



Virtual file system (cont.) File handle

• file system identifier : a unique number that is allocated to each file system when it is created. • i-node generation number : incremented each time the i-node number is reused. ( needed because in the UNIX file system i-node numbers are reused after a file is removed.)

*Implementation (2)

File system IdentifierFile system Identifier i-node number of file

i-node number of file

i-node generation number

i-node generation number



Virtual file system (cont.) the virtual file system layer • one VFS structure for each mounted file system.

• one v-node per open file.

• The v-node contains an indicator to show whether a file is local or remote.

*Implementation (3)



*Implementation (4)

Client integration

emulates the semantics of the standard UNIX file system primitives.

integrated with the UNIX kernel. • user programs can access files via UNIX system calls without recompilation or reloading.

• a single client module serves all user level processes, with a shared cache of recently -used block.

• The encryption key used to protect User Ids passed to the server can be retained in the kernel.

cooperates with the virtual file system in each client machine.



*Implementation (5)

Server integration integrated with the UNIX kernel mainly for performance reasons. • user level NFS server achieved approximately 80% of the performance of the kernel version.

Access control and authentication Since the NFS server is stateless, it does not keep files open on behalf of its client.

• the server must check the user’s identity against the file’s access permission attributes a fresh on each request.

use the DES encryption of the user’s authentication information in the RPC protocol. (NFS 4.0)



*Implementation (6)

Path name translation

path name parsing and their translation is controlled by the client. • Each part of a name that refers to a remote mounted directory is translated to a file handle using a separate lookup request to the remote server. Mount service

The mounting of remote file is supported by a separate mount service process that runs at user level on each NFS server computer.

Client use a modified version of the UNIX mount command, specifying the remote host name, path name and the local name.



*Implementation (7)

Mount service (cont.) hard mounting

• when a user-level process accesses a file in a file system, the process is suspended until the request can be completed.

• In the case of server failure, user-level processes are suspended until the server restarts.

soft mounting

• In the case of server failure, the NFS client module returns a failure indication to user-level processes after a small number of retries.



*Implementation (8)

Mount service (cont.)

mount request

• performed as a part of system initialization process in the client. ( by editing the UNIX startup script(/etc/rc) ).

• an individual user can change the configuration using mount.

Automounter

dynamically mount a file system whenever an ‘empty’ mount point is referenced by a client.

runs as a user level UNIX process in each client.

maintains a table of mount points(path name).



*Implementation (9)

Automounter (cont.)

Automounter behaves like a local NFS server at the client machine.

read-only replication can be achieved by listing several servers containing identical file systems against a name in the Automounter table.

• useful for heavily-used file systems that change infrequently.



#Implementation (10)

The new autofs Automounter



*Implementation (11)

Server caching

conventional UNIX system

• read-ahead protocol : anticipates read accesses and fetches the pages following those that have most recently been read. • delayed-write protocol : when a page has been altered, its new contents are written to disk only when the the buffer page is required for another page.

NFS server

• write-through protocol : write each modification to disk immediately because a failure of the server might otherwise result in the undetected loss of data by clients.




Client caching

The NFS client module caches the results of read, write, getattr, lookup and readdir operations in order to reduce the number of requests transmitted to servers.

A timestamp-based method is used to validate cached blocks.

the validation check is performed whenever a file is opened and whenever the server is contacted to fetch a new block from a file.

when a cached page is modified it is marked as dirty and is scheduled to be flushed to the server asynchronously.

Since NFS clients cannot determine whether a file is shared or not, the validation procedure must be used for all file accesses.




Client caching (cont.)

asynchronous reads and writes are achieved by the inclusion of one or more bio-daemon processes at each client.

• Bio : block input-output ; daemon : user level processes that perform system task.

Bio-daemon processes enhance the performance and reduce chances of inconsistency between caches at different client.



#Implementation (14)

Using Local Disk Caching with NFS




Other optimizations

The Sun file system based on the UNIX BSD 4.2 Fast File System

• uses 8 Kbyte disk blocks : fewer file system calls for sequential file access.

The UDP packet • extended to 9 Kbytes : an entire block as an argument be transferred in a single packet. Performance

The relatively poor write performance has been addressed by the use of battery-backed non-volatile RAM in the server’s disk controller.



#NFS Operation

NFS service is provided by a number of daemons nfsd : NFS server daemon that handles client file system requests.

a number of nfsds might be running concurrently.

NFS is a Request/Reply Protocol

client issues a request to access a remote file.

kernel interprets the request and forwards to the appropriate VFS routines to forward it to the client agent.

client agent prepare an RPC, it assigns a transaction ID, encode it using XDR, and transmits the request to server.



#NFS Operation (1)

NFS Server Side Operation

for each incoming request, IP and UDP protocol processing takes place.

the request arguments and RPC header are decoded according to XDR. one of the nfsd is selected to execute the request.

after completion, a reply is prepared and sent to client.

NFS daemon returns to its idle state.



#NFS Operation (2)

Other Important Daemons

protmap converts RPC program numbers into protocol port numbers, keeps a list of available RPC servers and their ports, and the program numbers they are serving.

mountd handles file system mount requests and determines which file systems are available to which machines and users.

biod are asynchronous block I/O daemons that run on the client and perform read-ahead and write-behind from client buffer cache.



#NFS Operation (3)

Communication Protocol Daemons

: to run TCP/UDP and IP protocol functions.

inetd : listens for connections on internet addresses for certain services, invokes service specific server daemons when a connection is found.

routed : manages network routing tables.



#NFS Performance two buffer caches are used at client side

: to reduce the number of remote requests that go to the servers

one for data second for file attributes

Support of block read-ahead at both server and client sides

I/O request is issued for first block. Second I/O request for remaining data blocks while the first one being processed.

Write-behind

flush critical information to stable storage. NFS uses synchronous writes to save modified data to stable storage.



CASE STUDY :

The Andrew File System



*AFS : Andrew File System (1)

Andrew : a distributed computing environment developed at Carnegie Mellon University for use a as campus computing and information system.

Andrew File System : a file service designed to provide an information sharing mechanism to its user.

The main goals were to build a scalable and secure distributed file system. It is based on client/server model and its initial goals were to support at least 7000 workstations on a campus-wide network.

AFS was extended in CODA project to develop a highly-available distributed file system.



*AFS : Andrew File System (2)

AFS is implemented on a network of workstations and servers running BSD 4.3 UNIX or the Mach operating system

AFS is compatible with NFS.

AFS is designed to perform well with larger numbers of active users .

(scalability)



*Design characteristics for Scalability

Whole-file serving

The entire contents of files are transmitted to client computers

by AFS servers.

Whole-file caching

Once a copy of a file has been transferred to a client computer it is

stored in a cache on the local disk.

The cache is permanent, surviving reboots of the client computer.



*The operation scenario of AFS

A user process in a client computer issues an open system call for a file in the shared file space.

The server holding the file is located and is sent a request for a copy of the file.

The copy is stored in the local UNIX file system in the client computer, the copy is then opened

Subsequent read, write and other operations on the file by processes in the client computer are applied to the local copy.

When the process in the client issues a close system call, if the local copy has been updated its contents are sent back to the server.

The server updates the file contents and the timestamps on the file.



*Assumption for Design strategy locally-cached copies are likely to remain valid for long periods. ( infrequently updated shared files and a single user access)

The local cache can be allocated a substantial proportion of the disk

space on each workstation.

Files are small ; most are less than 10 kilobytes in size

Read operations on files are much more common than writes.

Sequential access is common and files are referenced in bursts.

Most files are read and written by only one user.

explicitly excluded the provision of storage facilities for databases.



Distribution of processes in the AFS

NETWORK

USER PROGRAMUSER PROGRAM VENUSVENUS






WORKSTATIONS

VICEVICE


VICEVICE


SERVERS



*AFS Implementation (1)

two software components

Vice : the server S/W that runs as a user-level UNIX process in each server. ( the information sharing backbone and it consists of a collection of dedicated file servers.) Venus : a user-level process that runs in each client computer.

( finds file in Vice, caches them locally, and performs shared file access.)

user process files Local files : handled as normal UNIX files.

Shared files : stored on servers and copies of them are cached on the local disks of workstations.



System call interception in the AFS

USER PROGRAM

USER PROGRAM VENUSVENUS


UNIX FILE SYSTEM

WORKSTATION

UNIX file system calls

Non local file operations

Local Disk



*AFS Implementation (2)

The UNIX kernel in each workstation and server is a modified version of BSD 4.3 UNIX. to intercept open, close and some other file system calls. One of the file partitions on the local disk of each workstation is used as a cache, holding the cached copies of files from the shared space.



File name space seen by clients of the AFS

LocalLocal/ (root)

tmp bin …. vmunix

SharedShared

cmu

Symbolic Link Symbolic Link

bin



*AFS file service features

Files are grouped into volumes for ease of location and movement. A flat file service : implemented by the Vice servers.

the hierarchic directory structure : implemented by the set of Venus.

Each file and directory in the shared file space is identified by a unique, 96-bit file identifier (fid) .

The Venus processes translate the pathnames issued by clients to fids.

fids are used only for internal communication between AFS modules (Venus and Vice processes ).



*AFS file identifier

32 bits 32 bits 32 bits

Volume NumberVolume Number File handle File handle UniquifierUniquifier

the volume containing the file

identifying the file within the volume

ensure that file identifiers are not reused



*Cache coherence

Callback-based mechanism callback : a remote procedure call from a server to a Venus process.

callback promises ( have two states: valid or cancelled )

: a token issued by the Vice server that is the custodian of the file,

guaranteeing that it will notify the Venus process when any other

client modifies the file.

The goal of Callback-based cache coherence mechanism

to achieve the best approximation to one-copy file semantics that is practicable without serious Performance degradation.



Implementation of file system calls in AFS

User process UNIX kernel Venus Net Vice

Open

(FileName, mode)

If FileName refers to a file in shared file space, pass the request to Venus

Open the local file and return the file descriptor to the application.

Check list of files in local cache. If no present or there is no valid callback promise, send a request for the file to the Vice server that is custodian of the volume containing the file.

Place the copy of the file in the local file system, enter its local name in the local cache list and return the local name to UNIX.

Transfer a copy of the file and callback promise to the workstation.Log the callback promise

Read

(FileDescriptor,Buffer, length)

Perform a normal UNIX read operation on the local copy

Write

(FileDescriptor,Buffer,length)

Perform a normal UNIXwrite operation on the local copy

Close

(FileDesciptor)

Close the local copy and notify Venus that the file has been closed

If the local copy has been changed, send a copy to the Vice server that is the custodian of the file

Replace the file contents and send a callback to all other clients holding callback promise on the file



The Vice service Interface (1)

Fetch(fid) attr, data

: Returns the attributes(status) and, optionally, the contents of file identified by

the fid and records a callback promise on it.

Store(fid, attr, data)

: Updates the attributes and (optionally) the contents of a specified file.

Create( ) fid

: Creates a new file and records a callback promise on it.

Remove(fid)

: Deletes the specified file.



The Vice service Interface (2)

RemoveCallBack(fid)

: Informs server that a Venus process has flushed a file form its cache.

BreakCallBack(fid)

: This call is made by a Vice Server to a Venus process. It cancels the callback

promise on the relevant file.

SelLock(fid, mode)

: Sets a lock on the specified file or directory. The mode of the lock may be

shared or exclusive. Locks that are not removed expire after 30 minutes.

ReleaseLock(fid) : Unlocks the specified file or directory.



*Update Semantics (1)

Update semantics ( a client : C / a file : F/ a server : S )

after a successful open : latest ( F, S ) after a failed open : failure ( S ) after a successful close : updated ( F, S ) after a failed close : failure ( S ) latest ( F, S ) : denotes a guarantee that the current value of F at C is the same as the value at S.

failure ( S ) : denotes that the open or close operation has not been performed at S.

updated ( F, S ) : denotes that C's value of F has been successfully propagated to S.




the currency guarantee for open

after a successful open : latest ( F, S, 0 )

or ( lostCallback ( S, T ) and

inCache ( F ) and latest ( F, S, T ) )

latest ( F, S, T ) : denotes that the copy of F seen by the client is no

more than T seconds out of date.

lostCallback ( S, T ) : denotes that a callback message from S to C has

been lost at some time during the last T seconds.

inCache ( F ) : the file F was in the cache at C before the open

operation was attempted.




If clients in different workstations open, Write and close the same file concurrently, all but the update resulting from the last close will be silently lost.

Client must implement concurrency control independently.

When two client processes in the same workstation open a file,

they share the same cached copy and updates are performed in the

normal UNIX fashion-block-by-block



#AFS Scalability (1)

The scalability of the system is achieved by reducing static binding to the minimum, and by maximizing the number of active clients that can be supported by a server. AFS cache manager intercepts requests for remotely stored files and either obtains the requested data from the cache, or requests the appropriate chunk from the appropriate file server

All machines using AFS refer to any file using a common name, in AFS3.0, one can use pathname /afs/athena.mit.edu/user/a/xyz.

In AFS4.0, both DIGITAL's DNS and X.500 are used to navigate through the top-most directories of the name space.



#AFS Scalability (2)

the key strategy for achieving scalability

Whole file serving : the entire contents of files are transmitted to client computers by AFS servers. Whole file caching : Once a copy of a file has been transferred to a client computer it is stored in a cache on the local disk. The cache contains several hundred of the files most recently used on that computer. Local copies of files are used to satisfy clients’ open requests in preference to remote copies whenever possible.



#AFS Security (1)

Security in AFS depends on the integrity of a small number of VICE servers. No user software will ever run on VICE servers and Andrew assumes that the hardware and software on workstations may be modified in arbitrary manner.

Protection Domain

It is composed of users and groups. A user can authenticate itself to the system, be held responsible for its actions, and be charged for resource consumption. A group of other groups and users, associated with a user, called its owner.



#AFS Security (2)

In AFS2, cache coherence was achieved based on Callback ; the server promises to notify workstations chasing a file before allowing modification. Callback made it feasible for clients to cache directories and to translate path names locally.

AFS2 used a single process to service all clients ; non-preemptive lightweight processes supported concurrency and convenient programming abstraction at clients and servers.

Volumes, collection of files, are used in disk storage allocation.

Read-only replication of volumes to increase availability for frequently read, but rarely updated files, such as system programs.



*Other aspects (1)

UNIX KERNEL MODIFICATIONS

The UNIX kernel in AFS hosts is altered so that Vice can perform file operations in terms of file handles instead of the conventional UNIX file descriptors.

LOCATION DATABASE Each server contains a copy of a fully replicated location database giving a mapping volume names to servers.

THREADS

The implementations of Vice and Venus make use of a non-preemptive threads package to enable requests to be processed concurrently.



*Other aspects (2)

READ-ONLY REPLICAS

Volumes containing files that are frequently read but rarely modified, can be replicated as read-only volumes at several servers.

BULK TRANSFERS

The use of such a large packet size(64 kilobyte chunks) is an important aid to performance, minimizing the effect of network latency. PERFORMANCE whole-file caching leads to dramatically reduced loads on the servers. ( a server load of 40% was measured against a load of 100 % for NFS running the same benchmark. ).



#Main benefits of AFS

Data Sharing is simplified : a workstation can access any file in AFS since file system is location transparent, the user can access any file by just using names.

User mobility is supported : user can access any shared file stored on any

workstation in the system.

System administration is easier.

Better security is possible.

The servers in VICE are secure and run trusted system software.

Client autonomy is improved : workstations can be moved, turned off

without affecting users at other workstations.



#AFS-2 versus Sun NFS performance

0

200

400

600

800

1000

1200

1400

0 1 2 4 5 6 7 8 10 12 14 15 16 18Load Units

Bec

nchm

ark

Tim

e in

Sec

onds

Andrew Cold ChacheAndrew Warm CacheNFS



CASE STUDY :

The Coda File System



The limitation of AFS

the limited form of replication (restricted to read-only volume) Fault tolerance of the service the mobile use of portable computers.



Coda File system

developed in a research project undertaken by Satyanarayanan and his co-worker at Carnegie Mellon University.

A descendent of AFS. (the Coda design requirements were derived from experience with AFS.) developed as a solution to the drawbacks of AFS and to meet the need for disconnected operation of portable workstations.

a principle of the design of Coda

the copies of files residing on servers are more reliable than those residing in the cache of workstations.



#Coda File system features (1)

disconnected operation for mobile clients reintegration of data from disconnected clients

bandwidth adaptation

Failure Resilience

read/write replication servers

resolution of server/server conflicts

handles of network failures which partition the servers handles disconnection of clients client



#Coda File system features (2)

Performance and scalability

client side persistent caching of files, directories and attributes for high performance write back caching

Security kerberos like authentication access control lists (ACL's) Well defined semantics of sharing

Freely available source code



*Coda File system component

VSG(Volume Storage Group)

the set of servers holding replicas of a file volume.

AVSG(Available Volume Storage Group)

access available subset of the VSG.

the membership of the AVSG varies as servers become accessible or are made inaccessible by network/server failure.



#Client / Venus / Vice



*Optimistic replication Strategy (1)

allows modification of files to proceed when the network is partitioned or during disconnected operation. It relies on the attachment to each version of a file of a Coda version vector (CVV) and a timestamp. CVV : a vector of integers with one element for each server in the relevant VSG. Each element of the CVV is an estimate (a count) of the number of modifications performed on the version of the file. The purpose of the CVVs

: to provide sufficient information about the update history of each file version to enable inconsistencies to be detected and corrected automatically.




When a modified file is closed,

each site in the current AVSG is sent an update message by the Venus process at the client, containing the current CVV and the new contents for the file.

The Vice process at each site checks the CVV. The Venus process then computes a new CVV with modification counts increased and distributes the new CVV to the members of the AVSG.

The message is sent only to the member of the AVSG.




The advantages deriving from the replication

The files in a replicated volume remain accessible to any client that can access at least one of the replicas.

The performance of the system can be improved by sharing some of the load.

Coda enhances availability by the replication of files across servers.

by the ability of clients to operate entirely out of their caches.



*Update semantics (1)

currency guarantee for open/close (a client : C / a file : F / AVSG : s )

after a successful open : s and latest ( F, s, 0 )

or ( latest ( F, s, T ) and lostCallback ( s, T )

and inCache ( F ) ) )

or ( s = and and inCache ( F ) )

after a failed open : s and conflict ( F, s )

or ( s = and and inCache ( F ) )

after a successful close : s and update ( F, s )

or ( s = )



*Update semantics (2)

currency guarantee for open/close (cont.) (a client : C / a file : F / AVSG : s )

after a failed open : s and conflict ( F, s )

latest ( F, s, T ) : denotes that the current value of F C was the

latest across all the servers in s at some instant

in the last T seconds.

lostCallback(s,T) : a callback was sent by some member of s in the

last T seconds and was not received at C.

conflict( F, s ) : the values of F at some servers in s are currently

in conflict.



*Accessing Replicas

The strategy used on open and close to access the replicas of a file is a variant of the read-one, write-all approach.

Open operation

• if a copy of the file is not present in the local cache the client identifies

a preferred server from the AVSG.

• The client requests a copy of the file, and on receiving it, it checks

with all the other members of the AVSG to verify that the copy is the

latest available version.

Close operation

• When a file is closed at a client after modification, its contents and

attributes are transmitted in parallel to all the members of the AVSG

using a multicast remote procedure calling protocol.



*Cache coherence (1)

events to be detect by Venus

Enlargement of an AVSG.

( due to the accessibility of a previously inaccessible server )

Shrinking of an AVSG

( due to a server becoming inaccessible )

A lost callback event

To achieve this ,Venus sends a probe message to all the servers in

VSGs of the files that it has in its cache every T seconds.



*Cache coherence (2)

The problem of updates

that are missed by a server because it is not in the AVSG

of a different client that performs an update.

Venus is sent a volume version vector(volume CVV) in response to

each probe message.

• Volume CVV : a summary of the CVVs.

Venus detects any mismatch between the volume CVVs.



*Disconnected Operation

During brief disconnections, the least-recently-used cache replacement policy normally adopted by Venus ( to avoid cache missed on the disconnected volumes)

Coda allows users to specify a prioritized list of files and directories

that Venus should strive to retain in the cache.

When disconnected operation ends, a process of reintegration begins.

Conflicts may be detected during reintegration.

the cache copy is stored in a temporary location(covolume) on the

server and the user that initiated the reintegration is informed.



#Failure resilience methods



Performance

compare the performance of Coda with AFS under benchmark loads designed to simulate user populations ranging from 5-50 AFS users.

With no replication : no significant difference

With 3-fold replication : the time for Coda to perform a benchmark load equivalent to 5 users exceeds that of AFS without replication by only 5%.

With 3-fold replication ( and a load equivalent to 50 users ) : the time to complete the benchmark is increased by 70%, whereas that for AFS without replication if increased by only 16%.

Download - Distributed Processing Systems (Distributed File System) 오 상 규 서강대학교 정보통신 대학원 Email : [email protected] Email : [email protected]

Top Related