nfs/rdma over ib under linux charles j. antonelli center for information technology integration...

Post on 21-Dec-2015

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NFS/RDMA over IB under Linux

Charles J. AntonelliCenter for Information Technology IntegrationUniversity of Michigan, Ann ArborFebruary 7, 2005(portions copyright Tom Talpey and Gary Grider)

Agenda

NFSv2,3,4NFS/RDMALinux NFS/RDMA serverNFS SessionspNFS and RDMA

NFSv2,3

One of the major software innovations of the 80’sOpen systems

Open specificationRemote procedure call (RPC)

Invocation across machine boundariesSupport for heterogeneity

Virtual file system interface (VFS)Abstract interface to file system functionsRead, write, open, close, etc.

Stateless serverEase of implementationObviates lack of server reliability

Problems with NFSv2,3

NamingUnder client control (automounter helps)

ScalabilityCaching is hard to get right

ConsistencyThree-second rule

PerformanceChatty protocol

Problems with NFSv2,3

Access controlTrusted clientIdentity agreement

LockingOutside the NFS protocol specification

System administrationNo tools for backend managementProliferation of exported workstation disks

NFSv4

Major componentsExport managementCompound RPCDelegationState and locksAccess control listsSecurity: RPCSEC_GSS

NFSv4

Export Management

NFSv4 pseudo fs allows the client to mount the server root, and browse to discover offered exports

No more mountd

Access into an export is based on the user’s credentials

Obviates /etc/exports client list

Compound RPC

Designed to reduce wire trafficMultiple operations per request:

Compound RPC

PUTROOTFH

LOOKUP

GETATTR

GETFH

“Start with the pseudo fs root, lookup mount point path name, and return attributes and file handle.”

Delegation

Server issues delegations to clientsA read delegation on a file is a guarantee that no other clients are writing to the fileA write delegation on a file is a guarantee that no other clients are accessing the file

Reduces revalidation requirementsNot necessary for correctnessIntended to reduce RPC requests to the server

NFSv3 is an ostensibly stateless protocolHowever, NFSv3 is typically used with a stateful auxiliary locking protocol (NLM)

NFSv4 locking is part of the protocolNo more lockd

LOCK operation sets up lock stateClient polls server when LOCK request is denied

NFSv4 servers also keep track ofOpen files, mainly to support Windows share reservation semanticsDelegations

State and Locks

Open file and lock state are lease-basedA lease is the amount of time a server will wait, while not receiving a state referencing operation from a client, before reaping the client’s state.

Delegation state is callback-basedA callback is a communication channel from the server back to the client

State Management

NFSv4 defines ACLs for file system objectsRicher and more granular than POSIX ACLsSimilar to NT ACLsACLs are showing up on local UNIX file systems

Access Control Lists

Security Model

Security added to RPC layerRFC 2203 defines RPCSEC_GSS

Adds the GSSAPI to the ONC RPC

An application that uses the GSSAPI can "plug in" any security service implementing the APINFSv4 mandates the implementation of Kerberos v5 and LIPKEY GSSAPI security mechanisms.

The combination of LIPKEY (and SPKM3) provides a security service similar to TLS

Existing NFSv4 Implementations

SUN Solaris client and serverNetwork Appliance multi-protocol server

NFSv4, NFSv3, CIFSHummingbird WinXXX client and serverCITI

Linux client and serverOpenBSD/FreeBSD client

EMC multi-protocol serverHPUX serverGuelph OpenBSD serverIBM AIX client and server

Future Implementations

Cluster-coherent NFS serverpNFS

NFS/RDMA

A way to run NFS v2/v3/v4 over RDMAGreatly enhanced NFS performance

Low overheadFull bandwidthDirect I/O – true zero copy

Implemented on LinuxkDAPL APIClient today, server soon

RPC layer approach

Implemented within RPC layerNew RPC transport typeAdds RDMA-transport specific header“Chunks” direct data transfer between client memory and server buffersBindings for NFSv2/v3, also NFSv4

Implementation Layering

Client implemented as kernel RPC transportServer approach similarRDMA API: kDAPLNFS client code remains unchangedCompletely transparent to application

Use of kDAPL

All RDMA interfacing is via kDAPLVery simple subset of kDAPL 1.1 API

Connection, connection DTOsKernel-virtual or physical LMRs, RMRsSmall (1KB-4KB typical) send/receiveLarge RDMA (4KB-64KB typical)

All RDMA read/write initiated by server

Potential NFS/RDMA Users

Anywhere high bandwidth, low overhead is important:

HPC/Supercomputing clustersDatabaseFinancial applicationsScientific computingGeneral cluster computing

Linux NFS/RDMA server

Project goalsRPC/RDMA implementation

kDAPL APIMellanox IB

Interoperate with NetApp RPC RDMA clientPerformance gain over TCP transport

Linux NFS/RDMA server

ApproachDivide RPC layer into unified state management and abstract transport layerSocket-specific code replaced by general interface implemented by socket or RDMA transportsSimilar to client RPC transport switch concept

Linux NFS/RDMA server

Implementation stagesListen for and accept connectionsProcess inline NFSv3 requestsNFSv3 RDMANFSv4 RDMA

Listen for and accept connections

svc_makexprtSimilar to svc_makesock for socket transportsRDMA transport tasks:

Open HCARegister memoryCreate endpoint for RDMA connections

Listen for and accept connections

svc_xprtRetains transport-independent components of svc_sockAdd pointer to transport-specific structureSupport for registering dynamic transport implementations (eventually)

Listen for and accept connections

Reorganize code into transport-agnostic and transport-specific blocksUpdate calling code to specify transport

Process inline NFSv3 requests

RDMA-specific send and receive routinesAll data sent inline via RDMA SendTasks

Register memory buffers for RDMA sendManage buffer transmission by the hardwareProcess RDMA headers

NFSv3 RDMA

Use RDMA Read and Write for large transfersRPC page management

xdr_buf contains initial kvec and list of pagesInitial kvec holds RPC header and short payloadsPage list used for large data transfer

Server memory registrationAll server memory pre-registeredAllows simpler memory managementMay need revisiting wrt security

NFSv3 RDMA

Client writeServer issues RDMA Read from client-provided read chunksServer reads into xdr_buf page listSimilar to socket-based receive for ULP

Client readServer issues RDMA Write into client-provided write chunks

NFSv3 RDMA

Reply chunksApplies when client requests generate replies that are too large for RDMA SendServer issues RDMA write into client-supplied buffers

NFSv4 RDMA

NFSv4 layered on RPC/RDMATask:

export modifications for RDMA transport

NFSv4.1 Sessions

Adds a session layer to NFSv4Enhances protocol reliability

Accurate duplicate request cachingBounded resources

Provides transport diversityTrunking, multipathing

http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-sess-00.txt

pNFS basics

Separation of data and control, so NFS metadata requests go through NFS and data requests flow directly to devices (OBSD, Block/ iSCSI, file)This allows an NFSv4.X-pNFS client to be a native client to Object/SAN/data-filer file system and scale efficiently.Limits the need for custom VFS clients for every version of every OS/kernel known to mankind

pNFS and RDMA

NFSv4.x client with RDMA gives us low latency low overhead path for metadata (via RPC/RDMA layer)pNFS gives us parallel paths for data direct to the storage devices or filers (for OBSD, block, and file methods)

For file method RPC/RDMA provides standards based data path to data filerFor block method iSCSI/ISER or SRP could be used, this provides a standards based data path (lacks transactional security though)For OBSD method, since ANSI OBSD is iSCSI extended, if OBSD/iSCSI/ISER all get along, this provides a standards based data path that is transactionally secure

pNFS and RDMA

With the previous two items, combined with other NFSv4 features like leasing, compound RPC’s, etc., we have a first class standards based file system client that gets native device performance all provided by NFSv4.XXX, capable of effectively using any global parallel file system

AND ALL WITH STANDARDS!

pNFS and RDMA

We really need all this work to be enabled on both Ethernet and Infiniband and to be completely routable between the two medias.

Will higher level apps that become RDMA aware be able to use both Ethernet and Infiniband and mixtures of both transparently?Will NFSv4 RPC/RDMA, iSCSI, and SRP be routable between medias?

CITI

Developing NFSv4 reference implementation since 1999

NFS/RDMA and NFSv4.1 Sessions since 2003

Funded by Sun, Network Appliance, ASCI, PolyServe, NSFhttp://www.citi.umich.edu/projects/nfsv4/

Key message

Give us kDAPL

Any questions?http://www.citi.umich.edu/

top related