building advanced storage environment cheng yaodong computing center, ihep december 2002

36
Building Advanced Building Advanced Storage Environment Storage Environment Cheng Yaodong Computing Center, IHEP Computing Center, IHEP December 2002 December 2002

Upload: angelica-dennis

Post on 30-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Building Advanced Building Advanced Storage Environment Storage Environment

Cheng Yaodong

Computing Center, IHEPComputing Center, IHEP

December 2002December 2002

OutlineOutline

◆◆ Current EnvironmentCurrent Environment

◆◆ Main ProblemsMain Problems

◆◆ SolutionsSolutions

◆◆ Related TechniquesRelated Techniques

◆◆ Introduction to CERN/CastorIntroduction to CERN/Castor

◆◆ Test environmentTest environment

Current storage Current storage EnvironmentEnvironment

◆ Isolated StorageIsolated Storage

■ Each server has its own storageEach server has its own storage

◆ Multi-platformMulti-platform ■ Redhat Linux, HP-UX, Solaris, WindowsRedhat Linux, HP-UX, Solaris, Windows

◆ Various mediumsVarious mediums ■ Disk array, tapes including LTO, DLT, SDLT, etc.Disk array, tapes including LTO, DLT, SDLT, etc.

◆ Obsolete Management Obsolete Management

◆ NFSNFS

Isolated StorageIsolated Storage

Sun Storage

File System

Volume Manager

File System

Volume Manager

HP Storage Dell Storage

File System

Volume Manager

Storage Island

Main ProblemsMain Problems

◆ DAS(DAS(Directly Attached StorageDirectly Attached Storage) ) Data Island Data Island

◆ Bad scalabilityBad scalability

◆ Low efficiencyLow efficiency

◆ Inconvenient to useInconvenient to use◆ NFSNFS

■ Overload on SystemOverload on System

■ Overhead on NetworkOverhead on Network

◆ Small capacity

SolutionsSolutions◆ Building an Advanced Storage EnvironmentBuilding an Advanced Storage Environment

■ ProvidesProvides ● Remote access to disk files

● Disk pool management

● Indirect access to tape

● Volume manager

● Hierarchical Storage Manager Functionality

■ Main ObjectivitiesMain Objectivities ● Focussed on HEP requirements

● Easy to use, deploy, administer

● High performance

● Good scalability

● Available on most Unix systems and Windows/NT

● Integration and Virtualization of storage resource

Related TechniquesRelated Techniques

◆ Hierarchical Storage Manager (HSM)

◆ Distributed file systemDistributed file system

◆ Storage Area Network (SAN)Storage Area Network (SAN)

◆ Virtual StorageVirtual Storage

Hierarchical Storage Hierarchical Storage ManagerManager

◆ Characteristics of data in High Energy PhysicsCharacteristics of data in High Energy Physics

■ 20% active, 80% non-active

◆ Layers of storage devicesLayers of storage devices◆ Data migrationData migration

◆ Data recallData recall◆ 3-tier storage infrastructure3-tier storage infrastructure

Distributed file systemDistributed file system

◆ Load balance between storage devicesLoad balance between storage devices

◆ Alleviate the overload of OS and networkAlleviate the overload of OS and network

◆ A single, shared name space for all users,A single, shared name space for all users,

from all machinesfrom all machines

◆ Location-independent file sharingLocation-independent file sharing

◆Client CachingClient Caching

◆ Extended security through Kerberos authenticatioExtended security through Kerberos authentication and Access Control Listsn and Access Control Lists

◆Replication techniques for file system reliabilityReplication techniques for file system reliability

Storage Area Network

◆ A private network specially for storage

◆ Storage devices are connected to a switch

through FCP, iscsi, InfiniBand and other protocols ◆These protocols are designed specially for large amount of data transfer

◆Servers are directly connected to the disks and share data

◆Use native filesystems much better performance than NFS

◆Some HSM functionality still needed

HSM

SAN ModelSAN Model

LAN

Storage Area Network

server server server

Virtual StorageVirtual Storage

◆ Map all the storage resource to a virtual device or Map all the storage resource to a virtual device or a single file spacea single file space

◆ Integrating storage devicesIntegrating storage devices ■different storage connections: DAS, NAS, SANdifferent storage connections: DAS, NAS, SAN ■different storage mediums: Disk, Tapedifferent storage mediums: Disk, Tape

◆ Indirectly access physical storage devicesIndirectly access physical storage devices◆ Easy to use, administerEasy to use, administer

◆ Support multi-platformSupport multi-platform◆ Data sharingData sharing

Our Implement of Our Implement of Virtual StorageVirtual Storage

Physical Storage Devices

Storage Management Software

RedhatClient

HPClient

SolarisClient

NTClient

Virtual Storage Space

Client

Virtualization

Transparent access

Introduction to Introduction to CERN/castorCERN/castor

◆ Cern Advanced STORage managerCern Advanced STORage manager

■ In January, 1999, CERN began to develop CASTORIn January, 1999, CERN began to develop CASTOR ■ Hierarchical Storage Manager used to store user and physics fi

les

■ It manages the secondary and tertiary storageIt manages the secondary and tertiary storage

■ Currently holds more than 1800 TB of dataCurrently holds more than 1800 TB of data

■ The servers are installed in the Computer center, while the clienThe servers are installed in the Computer center, while the clients are deployed on most of the computers including the desktots are deployed on most of the computers including the desktopsps.

■ automatic management experiment data on filesautomatic management experiment data on files

◆ Main access to data is through RFIO (Remote Main access to data is through RFIO (Remote File I/O package)File I/O package)

Remote File I/O (RFIO)Remote File I/O (RFIO)

◆ Provide transparent access to files: they can be local, remProvide transparent access to files: they can be local, remote or HSM filesote or HSM files

◆ There existThere exist■ a command line interface: rfcp, rfmkdir, rfdir

■ an Application Programming Interface (API)

◆ All calls handle standard file names and file descriptors (UAll calls handle standard file names and file descriptors (Unix or Windows)nix or Windows)

◆ The routine names are obtained by pre-pending standard The routine names are obtained by pre-pending standard Posix system calls by rfio_Posix system calls by rfio_

◆ The function prototypes are unchangedThe function prototypes are unchanged

◆ The function name translation is done automatically by inThe function name translation is done automatically by including the header file “rfio.h”cluding the header file “rfio.h”

RFIO access to dataRFIO access to data

RFIOClient RFIOD

(DISKMOVER)

RFIOClient

LOCAL DISK

REMOTE DISK

Disk PoolDisk Pool

◆ a series of disks on different machines form disk a series of disks on different machines form disk pool managed by Stagerpool managed by Stager

◆ disk virtualizationdisk virtualization

◆ allocate space in disk pool to store files allocate space in disk pool to store files

◆ make space in the pools to store new files by make space in the pools to store new files by garbage collectorgarbage collector

◆ Keeps a catalog of all files residing in the poolsKeeps a catalog of all files residing in the pools

File access in a disk File access in a disk poolpool

STAGER

RFIOD(DISK

MOVER)

DISK POOL

RFIOClient

CATALOG

Castor Name ServerCastor Name Server

◆ File names are in the form:File names are in the form:■ /castor/domain_name/experiment_name/…

• for example: /castor/ihep.ac.cn/ybj/

■ /castor/domain_name/user/…• for example: /castor/ihep.ac.cn/user/c/cheng

◆ Role:Role:

■ Implement an hierarchical view of the name space: files and directories

■ Remember the file residency on tertiary storage

■ Keep the file class definitions

CASTOR file access

NAMEserver

STAGER

RFIOD(DISK

MOVER)

DISK POOL

NAMEserver

RFIOClient

CATALOG

CASTOR componentsCASTOR components◆ The backend store consists of:The backend store consists of:

■ RFIOD (Disk Mover)RFIOD (Disk Mover)

■ Name serverName server

■ Volume ManagerVolume Manager

■ Volume and Drive Queue ManagerVolume and Drive Queue Manager

■ RTCOPY daemon (Tape Mover)

■ Tpdaemon Tpdaemon

◆ Main characteristics of the serversMain characteristics of the servers■ DistributedDistributed

■ Critical servers are replicatedCritical servers are replicated

■ Use CASTOR Database (Cdb), Open Source databases (MySQL)Use CASTOR Database (Cdb), Open Source databases (MySQL)

Main componentsMain components

◆ Distributed componentsDistributed components■ Remote File I/O (RFIO)Remote File I/O (RFIO)

■ CASTOR Name Server (Cns)

■ StagerStager

■ Tape Mover (RTCOPY)Tape Mover (RTCOPY)

■ Physical Volume Repository (Ctape)Physical Volume Repository (Ctape)

◆ Central componentsCentral components■ Volume Manager (VMGR)Volume Manager (VMGR)

■ Volume and Drive Queue Manager (VDQM)Volume and Drive Queue Manager (VDQM)

■ Message DaemonMessage Daemon

StagerStager

◆ Role: Storage Resource ManagerRole: Storage Resource Manager■ Disk pool manager

• Allocates space on disk to store files• Keeps a catalog of all files residing in the pools• Makes space in the pools to store new files (garbage

collector)

■ Hierarchical Resource Manager• Migrates files according to file class and disk pool policies• Recalls files

■ Tape Stager (deprecated)• Caches tape files on disk

File classesFile classes◆ Associated with each file or directoryAssociated with each file or directory

◆ Inherited from the parent directory but can be changed (at Inherited from the parent directory but can be changed (at sub-directory level)sub-directory level)

◆ Describes how the file is managed on disk, migrated and pDescribes how the file is managed on disk, migrated and purgedurged

◆ File class attributes are:File class attributes are:■ OwnershipOwnership

■ Migration time intervalMigration time interval

■ Minimum time before migrationMinimum time before migration

■ Number of copiesNumber of copies

■ Retention period on diskRetention period on disk

■ Number of parallel streams (number of drives)Number of parallel streams (number of drives)

■ Tape poolsTape pools

Migration policiesMigration policies◆ Migration policy depends onMigration policy depends on

■ File classFile class

■ Disk poolDisk pool

◆ Start migrationStart migration■ Amount of data ready to be migrated exceeds a given thresholdAmount of data ready to be migrated exceeds a given threshold

■ Percentage of free space below a given thresholdPercentage of free space below a given threshold

■ Time intervalTime interval

■ Migration can also be forced

◆ Stop migrationStop migration■ Data ready at start migration time has been migratedData ready at start migration time has been migrated

◆ AlgorithmAlgorithm■ Least recently accessed file migrated firstLeast recently accessed file migrated first

■ Maximum number of tape drives (parallel streams) can be setMaximum number of tape drives (parallel streams) can be set

Physical Volume Repository Physical Volume Repository (Ctape)(Ctape)

◆ Dynamic configuration of tape drives

◆ Reservation of resourcesReservation of resources

◆ Drive allocation (when not using VDQM)Drive allocation (when not using VDQM)

◆ Tape volume mount and positionTape volume mount and position

◆ Automatic label checkingAutomatic label checking

◆ User callable routines to write labelsUser callable routines to write labels

◆ Drive status displayDrive status display

◆ Operator interfaceOperator interface

◆ VMGR and VDQM interfaceVMGR and VDQM interface

◆ Hardware supported:Hardware supported:■ Drives: DLT, LTO, IBM 3590, STK 9840, STK9940Drives: DLT, LTO, IBM 3590, STK 9840, STK9940■ Robots: ADIC Scalar, IBM 3494, IBM 3584, Odetics, Sony DMS24, STKRobots: ADIC Scalar, IBM 3494, IBM 3584, Odetics, Sony DMS24, STK

Volume Manager Volume Manager (VMGR)(VMGR)

◆ Handle pool of tapesHandle pool of tapes

■ private to an experimentprivate to an experiment

■ public poolpublic pool

■ supply pool

◆ Features:Features:

■ Determine the most appropriate tapes for storing files in a given tape pool according to file size

■ minimize the number of tape volumes for a given file

◆ Tape volumes are administered by the Computer Center. Tape volumes are administered by the Computer Center. They are not owned nor managed by users.They are not owned nor managed by users.

◆ There is one single Volume ManagerThere is one single Volume Manager

Volume and Drive Volume and Drive Queue Manager Queue Manager

(VDQM)(VDQM)◆ VDQM maintains a global queue of tape requests per VDQM maintains a global queue of tape requests per device groupdevice group

◆ VDQM maintains a global table of all tape drivesVDQM maintains a global table of all tape drives■ Provide tape server load-balancing

■ Optimize the number of tape mounts

◆ Tape requests are assigned a priority:Tape requests are assigned a priority:■ Requests are queued in priority order

■ Requests with same priority are queued in time order

◆ Drives may be dedicated

◆ Easy to add functionality like■ Drive quotas

■ Fair share scheduler (prototype exists)

User interfaceUser interface

◆ Command lineCommand line ■ name server commands: nsls, nsmkdir, nsrm, nstouch, nschmod,nsenterclass ■ rfio commands: rfdir, rfcp,rfcat, rfchmod,rfrm,rfrename

◆ Applications Programming Interface (API)Applications Programming Interface (API) ■ # include <shift.h> ■ Add a library “lshift” when compiling

■ Two forms of routine names ●obtained by pre-pending standard Posix system calls by rfio_, such as rfio_open, rfio_read, rfio_write, rfio_seek, rfio_close, etc. ● The function prototypes are unchanged. The function name translation is done automatically by including the header file “rfio.h”

Test Environment

◆ HardwareHardware ■Servers: Dell 6400, Dell 4400, Dell 2400,Dell GX110Servers: Dell 6400, Dell 4400, Dell 2400,Dell GX110

■ Disk array, DAS diskDisk array, DAS disk

■ Tape library: Adic100 ( 2 HP LTO devices, 12+60 slots)Tape library: Adic100 ( 2 HP LTO devices, 12+60 slots)

◆ Software ■ Operation System: redhat 7.2

■ Storage Management Software: CERN/castor

■ Distributed file system: NFS, AFSDistributed file system: NFS, AFS ■ Job scheduling system: PBSJob scheduling system: PBS

■Database: MySQLDatabase: MySQL

Future Storage Future Storage EnvironmentEnvironment

ConclusionConclusion

◆ Handle the large amount of data in a fully Handle the large amount of data in a fully distributed environment. distributed environment.

◆ Mapping all the storage resource to a single file Mapping all the storage resource to a single file spacespace

◆ Users access files in the space through command Users access files in the space through command line or APIline or API

◆ Users only remember the file name, and don’t Users only remember the file name, and don’t know where their files are placed and whether the know where their files are placed and whether the storage capacity is enoughstorage capacity is enough

Thanks!!Thanks!!

Storage HierarchyStorage Hierarchy

3-tier storage infrastru3-tier storage infrastructurecture

SlowFastVery Fast

$$$$$/MB $$/MB $/MB

StorageNetwork

StorageNetwork

LAN or WAN

“disk-to-disk”applianceFilers

Servers

TapeLibrary

OpticalLibrary

HeterogeneousStorage

Tier 1primary storage

Tier 3tertiary storage

Tier 2Secondary storage

Archival / HSMBackup/restoreParameters of Prevalent Tapes