high performance computing - project report

HIGH PERFORMANCE COMPUTING

CONTENTS

1. Introduction…………………………………………12. History of Computing………………………………23. Parallel Computing…………………………………54. Classification of Computers………………………..95. High Performance Computing……………………..13

Architecture…………………………………….14 Symmetric Multiprocessing……………………16

6. Computer Clusters…………………………………18 Cluster Categorizations……………………..21 Basics of Cluster Computing……………….22 Description over HPC………………………26 Cluster Components………………………...28 Message Passing Interface………………….31 Parallel Virtual Machine……………………33 Cluster Middleware………………………....37 Storage………………………………………42 Cluster Features……………………………..44

7. Grid Computing……………………………………49 Cycle Stealing………………………………….53

8. Bibliography……………………………………….57

INTRODUCTION

HPC can be boiled down to one thing – SPEED. The goal is to achieve the maximum amount of computations in minimum amount of time. The term HPC refers to the use parallel supercomputers and computer clusters i.e. computing systems comprised of multiple processors linked together in a single system with commercially available interconnects. Today the HPC systems have become a basic need where the work is required to be done much more quickly and efficiently. Many organizations and institutions across the world are incorporation this latest trend.

HISTORY OF COMPUTING

The history of computing is longer than the history of computing hardware and modern computing technology and includes the history of methods intended for pen and paper or for chalk and slate, with or without the aid of tables.

Concrete devices:

Computing is intimately tied to the representation of numbers. But long before abstractions like number arose, there were mathematical concepts to serve the purposes of civilization. These concepts are implicit in concrete practices such as:

one-to-one correspondence, a rule to count how many items, say on a tally stick, which was eventually abstracted into number;

comparison to a standard, a method for assuming reproducibility in a measurement, for example, the number of coins;

The 3-4-5 right triangle was a device for assuring a right angle, using ropes with 12 evenly spaced knots, for example.

Numbers:

'Eventually, the concept of numbers became concrete and familiar enough for counting to arise, at times with sing-song mnemonics to teach sequences to others. All the known languages have words for at least "one" and "two", and even some animals like the blackbird can distinguish a surprising number of items.

Advances in the numeral system and mathematical notation eventually led to the discovery of mathematical operations such as addition, subtraction, multiplication, division, squaring, square root, and so forth. Eventually the operations were formalized, and concepts about the operations became understood well enough to be stated formally, and even

proven. See, for example, Euclid's algorithm for finding the greatest common divisor of two numbers.

By the High Middle Ages, the positional Hindu-Arabic numeral system had reached Europe, which allowed for systematic computation of numbers. During this period, the representation of a calculation on paper actually allowed calculation of mathematical expressions, and the tabulation of mathematical functions such as the square root and the common logarithm (for use in multiplication and division) and the trigonometric functions. By the time of Isaac Newton's research, paper or vellum was an important computing resource, and even in our present time, researchers like Enrico Fermi would cover random scraps of paper with calculation, to satisfy their curiosity about an equation. Even into the period of programmable calculators, Richard Feynman would unhesitatingly compute any steps which overflowed the memory of the calculators, by hand, just to learn the answer.

Navigation and astronomy:

Starting with known special cases, the calculation of logarithms and trigonometric functions can be performed by looking up numbers in a mathematical table, and interpolating between known cases. For small enough differences, this linear operation was accurate enough for use in navigation and astronomy in the Age of Exploration. The uses of interpolation have thrived in the past 500 years: by the twentieth century Leslie Comrie and W.J. Eckert systematized the use of interpolation in tables of numbers for punch card calculation.

In our time, even a student can simulate the motion of the planets, an N-body differential equation, using the concepts of numerical approximation, a feat which even Isaac Newton could admire, given his struggles with the motion of the Moon.

Weather prediction:

The numerical solution of differential equations, notably the Navier-Stokes equations was an important stimulus to computing, with Lewis Fry Richardson's numerical approach to solving differential equations. To this day, some of the most powerful computer systems of the Earth are used for weather forecasts.

Symbolic computations

By the late 1960s, computer systems could perform symbolic algebraic manipulations well enough to pass college-level calculus courses. Using programs like Maple, Macsyma (now Maxima) and Mathematica, including some open source programs like Yacas, it is now possible to visualize concepts such as modular forms which were only accessible to the mathematical imagination before this.

PARALLEL COMPUTING

Parallel computing is the simultaneous execution of some combination of multiple instances of programmed instructions and data on multiple processors in order to obtain results faster. The idea is based on the fact that the process of solving a problem usually can be divided into smaller tasks, which may be carried out simultaneously with some coordination.

Definition:

A parallel computing system is a computer with more than one processor for parallel processing. In the past, each processor of a multiprocessing system always came in its own processor packaging, but recently-introduced multicore processors contain multiple logical processors in a single package. There are many different kinds of parallel computers. They are distinguished by the kind of interconnection between processors (known as "processing elements" or PEs) and memory. Flynn's taxonomy, one of the most accepted taxonomies of parallel architectures, classifies parallel (and serial) computers according to: whether all processors execute the same instructions at the same time (single instruction/multiple data -- SIMD) or whether each processor executes different instructions (multiple instruction/multiple data -- MIMD).

One major way to classify parallel computers is based on their memory architectures. Shared memory parallel computers have multiple processors accessing all available memory as global address space. They can be further divided into two main classes based on memory access times: Uniform Memory Access (UMA), in which access times to all parts of memory are equal, or Non-Uniform Memory Access (NUMA), in which they are not. Distributed memory parallel computers also have multiple processors, but each of the processors can only access its own local memory; no global memory address space exists across them. Parallel computing systems can also be categorized by the numbers of processors in them. Systems with thousands of such processors are known as massively parallel. Subsequently there is what is referred to as "large scale" vs. "small scale" parallel processors. This depends on the size of the processor, e.g. a PC

based parallel system would generally be considered a small scale system. Parallel processor machines are also divided into symmetric and

asymmetric multiprocessors, depending on whether all the processors are the same or not (for instance if only one is capable of running the operating system code and others are less privileged).

A variety of architectures have been developed for parallel processing. For example Ring architecture has processors linked by a ring structure. Other architectures include Hypercubes, Fat trees, systolic arrays, and so on.

Theory and practice:

Parallel computers can be modeled as Parallel Random Access Machines (PRAMs). The PRAM model ignores the cost of interconnection between the constituent computing units, but is nevertheless very useful in providing upper bounds on the parallel solvability of many problems. In reality the interconnection plays a significant role. The processors may communicate and cooperate in solving a problem or they may run independently, often under the control of another processor which distributes work to and collects results from them (a "processor farm").

Processors in a parallel computer may communicate with each other in a number of ways, including shared (either multiported or multiplexed) memory, a crossbar, a shared bus or an interconnect network of a myriad of topologies including star, ring, tree, hypercube, fat hypercube (a hypercube with more than one processor at a node), an n-dimensional mesh, etc. Parallel computers based on interconnect network need to employ some kind of routing to enable passing of messages between nodes that are not directly connected. The communication medium used for communication between the processors is likely to be hierarchical in large multiprocessor machines. Similarly, memory may be either private to the processor, shared between a numbers of processors, or globally shared. Systolic array is an example of a multiprocessor with fixed function nodes, local-only memory and no message routing.

Approaches to parallel computers include multiprocessing, parallel supercomputers, NUMA vs. SMP vs. massively parallel computer systems, distributed computing (esp. computer clusters and grid computing).

According to Amdahl's law, parallel processing is less efficient than one x-times-faster processor from a computational perspective. However, since power consumption is a super-linear function of the clock frequency on modern processors, we are reaching the point where from an energy cost perspective it can be cheaper to run many low speed processors in parallel than a single highly clocked processor.

Parallel programming:

Parallel programming is the design, implementation, and tuning of parallel computer programs which take advantage of parallel computing systems. It also refers to the application of parallel programming methods to existing serial programs (parallelization). Parallel programming focuses on partitioning the overall problem into separate tasks, allocating tasks to processors and synchronizing the tasks to get meaningful results. Parallel programming can only be applied to problems that are inherently parallelizable, mostly without data dependence. A problem can be partitioned based on domain decomposition or functional decomposition, or a combination.

There are two major approaches to parallel programming: implicit parallelism, where the system (the compiler or some other program) partitions the problem and allocates tasks to processors automatically (also called automatic parallelizing compilers); or explicit parallelism, where the programmer must annotate their program to show how it is to be partitioned. Many factors and techniques impact the performance of parallel programming, especially load balancing, which attempts to keep all processors busy by moving tasks from heavily loaded processors to less loaded ones.

Some people consider parallel programming to be synonymous with concurrent programming. Others draw a distinction between parallel programming, which uses well-defined and structured patterns of communications between processes and focuses on parallel execution of processes to enhance throughput, and concurrent programming, which typically involves defining new patterns of communication between processes that may have been made concurrent for reasons other than performance. In either case, communication between processes is performed

either via shared memory or with message passing, either of which may be implemented in terms of the other.

Programs which work correctly in a single CPU system may not do so in a parallel environment. This is because multiple copies of the same program may interfere with each other, for instance by accessing the same memory location at the same time. Therefore, careful programming (synchronization) is required in a parallel system.

CLASSIFICATION OF COMPUTERS

The classification of computers can be described as follows:

1. Mainframe Computers.

Mainframes (often colloquially referred to as Big Iron) are computers used mainly by large organizations for critical applications, typically bulk data processing such as census, industry and consumer statistics, ERP, and financial transaction processing.

The term probably originated from the early mainframes, as they were housed in enormous, room-sized metal boxes or frames. Later the term was used to distinguish high-end commercial machines from less powerful units which were often contained in smaller packages.

Today in practice, the term usually refers to computers compatible with the IBM System/360 line, first introduced in 1965. (IBM System z9 is IBM's latest incarnation.) Otherwise, systems with similar functionality but not based on the IBM System/360 are referred to as "servers". However, "server" and "mainframe" are sometimes used interchangeably.

Some non-System/360-compatible systems derived from or compatible with older (pre-web) server technology may also be considered mainframes. These include the Burroughs large systems and the UNIVAC 1100/2200 series systems. Most large-scale computer system architectures were firmly established in the 1960s and most large computers were based on architecture established during that era up until the advent of web servers in the 1990s.

There were several minicomputer operating systems and architectures that arose in the 1970s and 1980s, but minicomputers are generally not considered mainframes. (UNIX is generally considered a minicomputer operating system even though it has scaled up over the years to match mainframe characteristics in many ways.)

Thus, the defining characteristics of “mainframe” appear to be being compatible with large computer systems that were established in the 1960s.

http://en.wikipedia.org/wiki/Big_iron

2. Minicomputers.

Minicomputer (colloquially, mini) is a largely obsolete term for a class of multi-user computers that lies in the middle range of the computing spectrum, in between the largest multi-user systems (mainframe computers) and the smallest single-user systems (microcomputers or personal computers). Formerly this class formed a distinct group with its own hardware and operating systems. While the distinction between mainframe computers and smaller computers remains fairly clear, contemporary middle-range computers are not well differentiated from personal computers, being typically just a more powerful but still compatible version of personal computer. More modern terms for minicomputer-type machines include midrange systems (IBM parlance), workstations (Sun Microsystems and general UNIX/Linux parlance), and servers.

3. Microcomputers.

Although there is no rigid definition, a microcomputer (sometimes shortened to micro) is most often taken to mean a computer with a microprocessor (µP) as its CPU. Another general characteristic of these computers is that they occupy physically small amounts of space. Although the terms are not synonymous, many microcomputers are also personal computers (in the generic sense) and vice versa.

The microcomputer came after the minicomputer, most notably replacing the many distinct components that made up the minicomputer's CPU with a single integrated microprocessor chip. The early microcomputers were primitive, the earliest models shipping with as little as 256 bytes of RAM, and no input / output other than lights and switches. However, as microprocessor design advanced rapidly and memory became less expensive from the early 1970s onwards, microcomputers in turn grew faster and cheaper. This resulted in an explosion in their popularity during the late 1970s and early 1980s.

The increasing availability and power of such computers attracted the attention of more software developers. As time went on and the industry

matured, the market standardized around IBM PC clones running MS-DOS (and later Windows).

Modern desktop computers, video game consoles, laptop computers, tablet PCs, and many types of handheld devices, including mobile phones, may all be considered examples of microcomputers according to the definition given above.

4. Supercomputers.

A supercomputer is a computer that led the world (or was close to doing so) in terms of processing capacity, particularly speed of calculation, at the time of its introduction. The term "Super Computing" was first used by New York World newspaper in 1920 to refer to large custom-built tabulators IBM made for Columbia University.

The term supercomputer itself is rather fluid, and today's supercomputer tends to become tomorrow's normal computer. CDC's early machines were simply very fast scalar processors, some ten times the speed of the fastest machines offered by other companies. In the 1970s most supercomputers were dedicated to running a vector processor, and many of the newer players developed their own such processors at a lower price to enter the market. The early and mid-1980s saw machines with a modest number of vector processors working in parallel become the standard. Typical numbers of processors were in the range 4–16. In the later 1980s and 1990s, attention turned from vector processors to massive parallel processing systems with thousands of "ordinary" CPUs, some being off the shelf units and others being custom designs. (This is commonly and humorously referred to as the attack of the killer micros in the industry.) Today, parallel designs are based on "off the shelf" server-class microprocessors, such as the PowerPC, Itanium, or x86-64, and most modern supercomputers are now highly-tuned computer clusters using commodity processors combined with custom interconnects.

Supercomputers are used for highly calculation-intensive tasks such as problems involving quantum mechanical physics, weather forecasting, climate research (including research into global warming), molecular modeling (computing the structures and properties of chemical compounds,

biological macromolecules, polymers, and crystals), physical simulations (such as simulation of airplanes in wind tunnels, simulation of the detonation of nuclear weapons, and research into nuclear fusion), cryptanalysis, and the like. Major universities, military agencies and scientific research laboratories are heavy users.

HIGH PERFORMANCE COMPUTING

Introduction:

The term high performance computing (HPC) refers to the use of (parallel) supercomputers and computer clusters, that is, computing systems comprised of multiple (usually mass-produced) processors linked together in a single system with commercially available interconnects. This is in contrast to mainframe computers, which are generally monolithic in nature. While a high level of technical skill is undeniably needed to assemble and use such systems, they can be created from off-the-shelf components. Because of their flexibility, power, and relatively low cost, HPC systems increasingly dominate the world of supercomputing. Usually, computer systems in or above the teraflop-region are counted as HPC-computers.

The term is most commonly associated with computing used for scientific research. A related term, High-performance technical computing (HPTC), generally refers to the engineering applications of cluster-based computing (such as computational fluid dynamics and the building and testing of virtual prototypes). Recently, HPC has come to be applied to business uses of cluster-based supercomputers, such as data warehouses, line-of-business (LOB) applications and transaction processing.

Evolving the "HPC" Concept:

It should be noted that there is an evolution that is happening with regards to the nomenclature surrounding the "HPC" acronym. The ‘old’ definition of HPC, High Performance Computing, was the natural semantic evolution of the 'supercomputing' market, referring to the expanded and diverse range of platforms, from scalable high-end systems to COTS clusters, blade servers and of course the traditional vector supercomputers used to attack the most complex data- and computational-intensive applications. A key trend that is currently taking root is the shift in focus towards productivity – or more precisely, how systems and technology are applied. This encompasses everything in the HPC ecosystem, from the development environment, to systems and storage, to the use and

interoperability of applications, to the total user experience – all combined to address and solve real world problems.

The more current and evolving definition of HPC refers to High Productivity Computing, and reflects the purpose and use model of the myriad of existing and evolving architectures, and the supporting ecosystem of software, middleware, storage, networking and tools behind the next generation of applications.

Architecture:

A HPC cluster uses a multiple-computer architecture that features a parallel computing system consisting of one or more master nodes and one or more compute nodes interconnected in/by a private network system. All the nodes in the cluster are commodity systems – PCs, workstations or servers – running on commodity software such as Linux. The master node acts as server for network file system (NFS) and as a gateway to the outside world. In order to make the master node highly available to the users, high availability (HA) clustering might be employed.

The sole task of compute nodes is to execute parallel jobs. In most cases, therefore, the compute nodes do not have any peripherals connected. All access and control to the compute nodes are provided via remote connections, such as network and/or serial port through the master node. Since compute nodes do not need to access the machines outside the cluster, nor do the machines outside the cluster need to access the compute nodes directly, compute nodes commonly use private IP addresses.

Symmetric Multiprocessing:

Symmetric multiprocessing, or SMP, is a multiprocessor computer architecture where two or more identical processors are connected to a single shared main memory. Most common multiprocessor systems today use SMP architecture.

SMP systems allow any processor to work on any task no matter where the data for that task are located in memory; with proper operating system support, SMP systems can easily move tasks between processors to balance the workload efficiently.

SMP is one of many styles of multiprocessor machine architecture; others include NUMA (Non-Uniform Memory Access) which dedicates different memory banks to different processors allowing them to access memory in parallel. This can dramatically improve memory throughput as long as the data is localized to specific processes (and thus processors). On the downside, NUMA makes the cost of moving data from one processor to another, as in workload balancing, more expensive. The benefits of NUMA are limited to particular workloads, notably on servers where the data is often associated strongly with certain tasks or users.

Other systems include asymmetric multiprocessing (ASMP), in which separate specialized processors are used for specific tasks, and computer clustered multiprocessing (e.g. Beowulf), in which not all memory is available to all processors.

The former is not widely used or supported (though the high-powered 3D chipsets in modern video cards could be considered a form of asymmetric multiprocessing) while the latter is used fairly extensively to build very large supercomputers. In this discussion a single processor is denoted as a uni processor (UN).

Advantages & Disadvantages:

SMP has many uses in science, industry, and business where software is usually custom programmed for multithreaded processing. However, most consumer products such as word processors and computer games are written

in such a manner that they cannot gain large benefits from SMP systems. For games this is usually because writing a program to increase performance on SMP systems will produce a performance loss on uniprocessor systems, which were predominant in the home computer market as of 2007. Due to the nature of the different programming methods, it would generally require two separate projects to support both uniprocessor and SMP systems with maximum performance. Programs running on SMP systems do, however, experience a performance increase even when they have been written for uniprocessor systems. This is because hardware interrupts that usually suspend program execution while the kernel handles them can run on an idle processor instead. The effect in most applications (e.g. games) is not so much a performance increase as the appearance that the program is running much more smoothly. In some applications, particularly software compilers and some distributed computing projects; one will see an improvement by a factor of (nearly) the number of additional processors.

In situations where more than one program is running at the same time, an SMP system will have considerably better performance than a uni-processor, because different programs can run on different CPUs simultaneously.

Support for SMP must be built into the operating system. Otherwise, the additional processors remain idle and the system functions as a uniprocessor system.

In cases where many jobs are being processed in an SMP environment, administrators often experience a loss of hardware efficiency. Software programs have been developed to schedule jobs so that the processor utilization reaches its maximum potential. Good software packages can achieve this maximum potential by scheduling each CPU separately, as well as being able to integrate multiple SMP machines and clusters.

Access to RAM is serialized; this and cache coherency issues causes performance to lag slightly behind the number of additional processors in the system.

COMPUTER CLUSTERS

A computer cluster is a group of tightly coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks. Clusters are usually deployed to improve performance and/or availability over that provided by a single computer, while typically being much more cost-effective than single computers of comparable speed or availability.

"A cluster is a logical arrangement of independent entities that collectively provide a service."

"Logical arrangement" implies a structured organization. Logical emphasizes that this organization is not necessarily static. Smart software and/or hardware are typically involved.

"Independent entities" implies a level of distinction and function outside of a cluster context that may involve a system or some fraction of a system (e.g., an operating system).

"Provide a service" implies the intended purpose of the cluster. Elements of pre-service preparation (i.e., provisioning), and post service teardown, may be involved here.

What is computer clusters?

Cluster is a collection of networked computers enabling one or more defined resources referenced via single name

Computer clusters are groups of computers working together to complete one task or multiple tasks. Clusters can be used in many different ways. Some examples of how clusters are used are fault tolerance (high availability), load balancing, and parallel computing. Many of the bigger computer clusters out there reach supercomputer status.

In other words, a cluster is a group of computers which work together toward a final goal. Some would argue that a cluster must at least consist of

a message passing interface and a job scheduler. The message passing interface works to transmit data among the computers (commonly called nodes or hosts) in the cluster.

The job scheduler is just what it sounds like. It takes job requests from user input or other means and schedules them to be run on the number of nodes required in the cluster. It is possible to have a cluster without either of these components, however. Consider a cluster built for a single purpose. There would be no need for a job scheduler and data could be shared among the hosts with simple methods like a CORBA interface.

History:

The history of cluster computing is best captured by a footnote in Greg Pfister's In Search of Clusters: "Virtually every press release from DEC mentioning clusters says 'DEC, who invented clusters...'. IBM did not invent them either. Customers invented clusters, as soon as they could not fit all their work on one computer, or needed a backup. The date of the first is unknown, but it would be surprising if it was not in the 1960s, or even late 1950s."

The formal engineering basis of cluster computing as a means of doing parallel work of any sort was arguably invented by Gene Amdahl of IBM, who in 1967 published what has come to be regarded as the seminal paper on parallel processing: Amdahl's Law. Amdahl's Law describes mathematically the speedup one can expect from parallelizing any given otherwise serially performed task on a parallel architecture. This article defined the engineering basis for both multiprocessor computing and cluster computing, where the primary differentiator is whether or not the interprocessor communications are supported "inside" the computer (on for example a customized internal communications bus or network) or "outside" the computer on a commodity network.

Consequently the history of early computer clusters is more or less directly tied into the history of early networks, as one of the primary motivations for the development of a network was to link computing resources, creating a de facto computer cluster. Packet switching networks were conceptually invented by the RAND Corporation in 1962. Using the concept of a packet switched network, the ARPANET project succeeded in

creating in 1969 what was arguably the world's first commodity-network based computer cluster by linking four different computer centers (each of which was something of a "cluster" in its own right, but probably not a commodity cluster). The ARPANET project grew into the Internet -- which can be thought of as "the mother of all computer clusters" (as the union of nearly all of the compute resources, including clusters, that happen to be connected). It also established the paradigm in use by all computer clusters in the world today -- the use of packet-switched networks to perform interprocessor communications between processor (sets) located in otherwise disconnected frames.

The development of customer-built and research clusters proceeded hand in hand with that of both networks and the Unix operating system from the early 1970s, as both TCP/IP and the Xerox PARC project created and formalized protocols for network-based communications. The Hydra operating system was built for a cluster of DEC PDP-11 minicomputers called C.mmp at C-MU in 1971. However, it was not until circa 1983 that the protocols and tools for easily doing remote job distribution and file sharing were defined (largely within the context of BSD Unix, as implemented by Sun Microsystems) and hence became generally available commercially, along with a shared file system.

The first commercial clustering product was ARCnet, developed by Datapoint in 1977. ARCnet was not a commercial success and clustering per se did not really take off until DEC released their VAXcluster product in the 1984 for the VAX/VMS operating system. The ARCnet and VAXcluster products not only supported parallel computing, but also shared file systems and peripheral devices. They were supposed to give you the advantage of parallel processing, while maintaining data reliability and uniqueness. VAXcluster, now VMScluster, is still available on OpenVMS systems from HP running on Alpha and Itanium systems.

Two other noteworthy early commercial clusters were the Tandem Himalaya (a circa 1994 high-availability product) and the IBM S/390 Parallel Sysplex (also circa 1994, primarily for business use).

No history of commodity computer clusters would be complete without noting the pivotal role played by the development of Parallel Virtual Machine (PVM) software in 1989. This open source software based on TCP/IP communications enabled the instant creation of a virtual

supercomputer -- a high performance compute cluster -- made out of any TCP/IP connected systems. Free form heterogeneous clusters built on top of this model rapidly achieved total throughput in FLOPS that greatly exceeded that available even with the most expensive "big iron" supercomputers. PVM and the advent of inexpensive networked PCs led, in 1993, to a NASA project to build supercomputers out of commodity clusters. In 1995 the invention of the "Beowulf"-style cluster -- a compute cluster built on top of a commodity network for the specific purpose of "being a supercomputer" capable of performing tightly coupled parallel HPC computations. This in turn spurred the independent development of Grid computing as a named entity, although Grid-style clustering had been around at least as long as the Unix operating system and the Arpanet, whether or not it, or the clusters that used it, were named.

Cluster categorizations:

1. High-availability (HA) cluster.

High-availability clusters (also known as failover clusters) are implemented primarily for the purpose of improving the availability of services which the cluster provides. They operate by having redundant nodes, which are then used to provide service when system components fail. The most common size for an HA cluster is two nodes, which is the minimum requirement to provide redundancy. HA cluster implementations attempt to manage the redundancy inherent in a cluster to eliminate single points of failure. There are many commercial implementations of High-Availability clusters for many operating systems. The Linux-HA project is one commonly used free software HA package for the Linux OSs.

2. Load-balancing cluster.

Load-balancing clusters operate by having all workload come through one or more load-balancing front ends, which then distribute it to a collection of back end servers. Although they are primarily implemented for

improved performance, they commonly include high-availability features as well. Such a cluster of computers is sometimes referred to as a server farm. There are many commercial load balancers available including Platform LSF HPC, Sun Grid Engine, Moab Cluster Suite and Maui Cluster Scheduler. The Linux Virtual Server project provides one commonly used free software package for the Linux OS.

3. High-performance computing (HPC) clusters.

High-performance computing (HPC) clusters are implemented primarily to provide increased performance by splitting a computational task across many different nodes in the cluster, and are most commonly used in scientific computing. Such clusters commonly run custom programs which have been designed to exploit the parallelism available on HPC clusters. HPCs are optimized for workloads which require jobs or processes happening on the separate cluster computer nodes to communicate actively during the computation. These include computations where intermediate results from one node's calculations will affect future calculations on other nodes.

One of the most popular HPC implementations is a cluster with nodes running Linux as the OS and free software to implement the parallelism. This configuration is often referred to as a Beowulf cluster.

Microsoft offers Windows Compute Cluster Server as a high-performance computing platform to compete with Linux.

Many software programs running on High-performance computing (HPC) clusters use libraries such as MPI which are specially designed for writing scientific applications for HPC computers.

Basics of Cluster Computing

Cluster computing refers to technologies that allow multiple computers, called cluster nodes, to work together with the aim to solve common computing problems. Generic cluster architecture is shown in

Figure. Each node can be a single or multiprocessor computer, such as a PC, workstation or SMP server, equipped with its own memory, I/O devices and operating system. The cluster, having similar nodes, is called homogeneous, otherwise - heterogeneous.

The nodes are usually interconnected by local area network (LAN) based on one of the following technologies: Ethernet, Fast Ethernet, Gigabit Ethernet, Myrinet, Quadrics Network (QsNet), InfiniBand communication fabric, Scalable Coherent Interface (SCI), Virtual Interface Architecture (VIA) or Memory Channel.

The speed of network technology is characterized by a bandwidth and latency. Bandwidth means how much information can be sent through a particular network connection and latency is defined as the time it takes for a networking device to process a data frame. Note that a higher network speed is usually associated with a higher price of related equipment. To improve further cluster performance, different network topologies can be implemented in each particular case. Moreover, channel bonding technology can be used in the case of the Ethernet-type networking to double the network bandwidth.

To realize this technology, two network interface cards (NIC's) should be installed in each node, and two network switches should be used, one for each channel, to form two separate virtual networks. The optimal choice of the network type is dictated by demands on speed and volume of data exchange between several parts of the application software, running on different nodes.

Various operating systems, including Linux, Solaris and Windows, can be used to manage the nodes. However, in order for the clusters to be able to pool their computing resources, special cluster enabled applications must be written using clustering libraries or a system level middleware [13] should be used. The most popular clustering libraries are PVM (Parallel Virtual Machine) [14] and MPI (Message Passing Interface) [15]; both are very mature and work well. By using PVM or MPI, programmers can design applications that can span across an entire cluster's computing resources rather than being confined to the resources of a single machine. For many applications, PVM and MPI allow computing problems to be solved at a rate that scales almost linearly in relation to the number of processors in the cluster.

The cluster architecture is usually optimized for High Performance Computing or High Availability Computing. The choice of the architecture is dictated by the type of an application and available budget. A combination of both approaches is utilized in some cases, resulting in a highly reliable system, characterized by a very high performance. The principal difference between these two approaches consists of that in the HPC case, each node in the cluster executes a part of the common job, where as in the second case, several nodes perform or are ready to perform the same job and, thus, are able to substitute each other in a case of failure.

High availability (HA) clusters are used in mission critical applications to have constant availability of services to end-users through multiple instances of one or more applications on many computing nodes. Such systems found their application as Web servers, e-commerce engines or database servers. HA clusters use redundancy to ensure that a service remains running, so that even when a server fails or must go offline for service, the other servers pick up the load. The system optimized for maximum availability should not have any single point of failure, thus requiring a specific architecture (Figure).

Two types of HA clusters can be distinguished - shared nothing architecture and shared disk architecture. In the first case, each computing node is using dedicated storage, whereas the second type of HA cluster shares common storage resources, interconnected by Storage Area Network (SAN). The operation of HA cluster requires normally special software, which is able to recognize the occurred problem and transparently migrate the job to another node.

HPC clusters are built to improve processing throughput in order to handle multiple jobs of various sizes and types or to increase performance. The most common HPC clusters are used to shorten turnaround times on compute-intensive problems by running the job on multiple nodes at the same time or when the problem is just too big for a single system. This is often the case in scientific, design analysis and research computing, where the HPC cluster is built purely to obtain maximum performance during the solution of a single, very large problem. Such HPC clusters utilize parallelized software that breaks down the problem into smaller parts, which are dispatched across a network of interconnected systems that concurrently process each small part and then communicate with each other using

message-passing libraries to coordinate and synchronize their results. The Beowulf-type cluster [17], which will be described in the next section, is an example of the HPC system. Beowulf system is the cluster which is built primarily out of commodity hardware components, is running a free-software operating system like Linux or FreeBSD and is interconnected by a private high-speed network. However, some Linux clusters, which are built for high availability instead of speed, are not Beowulf’s.

While Beowulf clusters are extremely powerful, they are not for everyone.

The primary drawback of Beowulf clusters is that they require specially designed software in order to take advantage of cluster resources. This is generally not a problem for those in the scientific and research communities who are used to writing their own special purpose applications since they can use PVM or MPI libraries to create cluster-aware applications. However, many potential users of the cluster technologies would like to have some kind of performance benefit using standard applications. Since such applications have not been written with the use of PVM or MPI libraries, such users simply cannot take advantage of a cluster. This problem has been limited the use of cluster technologies to a small group of users for years. Recently, a new technology, called openMosix [18], appears that allows standard applications to take advantage of clustering without being rewritten or even recompiled.

OpenMosix is a "patch" to the standard Linux kernel, which adds clustering abilities and allows any standard Linux process to take advantage of a cluster's resources. OpenMosix uses adaptive load balancing techniques and allows processes running on one node in the cluster to migrate transparently to another node where they can execute faster. Because OpenMosix is completely transparent to all running programs, the process that has been migrated does not even know that it is running on another remote node. This transparency means that no special programming is required to take advantage of OpenMosix load-balancing technology. In fact, a default OpenMosix installation will migrate processes to the best node automatically. This makes OpenMosix a clustering solution that can provide an immediate benefit for many applications.

A cluster of Linux computers running OpenMosix can be considered as a large virtual SMP system with some exclusion. The CPUs on a "real"

SMP system can exchange data very fast, but with OpenMosix, the speed at which nodes can communicate with one another is determined by the speed of the network. Besides, OpenMosix does not currently offer support for allowing multiple cooperating threads to be separated from one another. Also, like an SMP system, OpenMosix cannot execute a single process on multiple physical CPU s at the same time. This means that OpenMosix will be not able to speed up a single process/program, except to migrate it to a node where it can execute most efficiently. At the same time, OpenMosix can migrate most standard Linux processes between nodes and, thus, allows for extremely scalable parallel execution at the process level. Besides, if an application forks many child processes then OpenMosix will be able to migrate each one of these processes to an appropriate node in the cluster. Thus, OpenMosix provides a number of benefits over traditional multiprocessor systems.

The OpenMosix technology can work in both homogeneous and heterogeneous environments, thus allowing building clusters, consisting of tens or even hundreds of nodes, using inexpensive PC hardware as well as a bunch of high-end multi-processor systems. The use of OpenMosix together with new Intel's Hyper- Threading technology, available with the last generation of Intel Xeon processors, allows additional improving of performance for threaded applications. Also existing MPI/PVM programs can benefit from OpenMosix technology.

Description over High Performance Computing (HPC) :

High Performance Computing (HPC) uses clusters of inexpensive, high performance processing blocks to solve difficult computational problems. Historically',1 HPC Linux cluster technology was incubated in scientific computing circles, where complex problems could only be tackled by individuals possessing domain knowledge and the know-how to build, debug and manage their own clusters. As a result, they were mainly used to tackle computationally challenging scientific work in government agencies and university research labs.

That technology has evolved such that today, the performance, scalability, flexibility, and reliability benefits of HPC clusters are being realized in nearly all businesses where rigorous analytics are being applied to simulation and product modeling. HPC clustering provides these businesses with a scalable fabric of servers that can be allocated on an as-needed basis to provide unprecedented computational throughput.

HPC provides enterprises and organizations with a productive, simple and hardware agnostic HPC system enabling administrators to install, monitor and manage the cluster as a single system, from a single node - the Master. Through the Master, thousands of systems can be managed as if they were a single, consistent, virtual system, dramatically simplifying deployment and management and significantly improving data center resource utilization and server performance.

High Performance Computing employs a unique architecture based on three principles that, combined, deliver unparalleled productivity and lower TCO.

• The operating environment deployed to the compute nodes is provisioned "Stateless", directly to memory.

• The compute node operating environment is lightweight, stripped of unnecessary software, overhead and vulnerabilities.

• A simple operating system extension virtualizes the cluster into a pool of processors operating as if it were a single virtual machine.

The result is a highly efficient, more reliable and scalable system, capable of processing more work in less time, while being vastly simpler to use and maintain. Its powerful unified process space means that end users can easily and intuitively deploy, manage and run complex applications from the Master, as if the servers were a single virtual machine. The compute servers are fully transparent and directly accessible if need be. But if you only care about the compute capacity presented at the single Master node, you need never look further than this one machine.

Cluster components:

1. Software components.

Single System Image (SSI):

A single system image is the illusion, created by software orhardware, that presents a collection of resources as one, morepowerful resource

SSI makes the cluster appear like a single machine to the user,to applications, and to the network.

MOSIX:

SSI approach, for Linux With MOSIX a Linux cluster appears as a single multiprocessor

machine Load balancing and job migration are supported Comes as a patch for the 2.4 Linux kernel and a set of user

land tools No special APIs are used, applications are executed the same

way as on real SMPs (symmetric multiprocessing machines).

Shared memory architectures:

SMP: System with n processors with access to the samephysical memory

Distributed shared memory (DSM): the illusion of a system of nprocessors (each with its own physical memory) having accessto a global (shared) memory

In both systems, different threads of execution communicatevia shared memory

DSM:

Problem of maintaining consistency of the global memory Each process may have a copy of a page/segment How to maintain a global view? Locking mechanisms are required to prevent concurrent write

access

Beowulf Project:

Project for building cheap Linux-based cluster systems A set of Open Source tools for cluster computing Message passing libraries are used for node communication Implementations of common message passing systems (PVM

and MPI) are part of a Beowulf system Opposed to MOSIX, each node in the cluster appears as a

computer with its own OS and hardware resources

Message passing vs. Distributed shared memory:

Most parallel/distributed applications today are based onmessage passing

DSM makes programming easier but may not scale well withthe numbers of processors due to undesired communicationpatterns.

Message Passing Systems:

Message Passing:

The processors of a parallel system communicate byexchanging messages.

Each processor has a mailbox for receiving incomingmessages (Messages don’t get lost).

Receiving messages can be blocking/non-blockingsynchronous or asynchronous.

APIs:

There are two mainly used APIs Message passing interface (MPI) And parallel virtual machine (PVM)

Message Passing Interface :

The Message Passing Interface (MPI) is a language-independent communications protocol used to program parallel computers. Although MPI belongs in layers 5 and higher of the OSI Reference Model, implementations may cover most layers of the reference model, with sockets and TCP/IP being used in the transport layer. MPI is not sanctioned by any major standards body; nevertheless, it has become the de facto standard for communication among processes that model a parallel program running on a distributed memory system. Actual distributed memory supercomputers such as computer clusters often run these programs. The principal MPI-1 model has no shared memory concept, and MPI-2 has only a limited distributed shared memory concept.

The advantages of MPI over older message passing libraries are portability (because MPI has been implemented for almost every distributed memory architecture) and speed (because each implementation is in principle optimized for the hardware on which it runs). MPI is supported on

http://en.wikipedia.org/wiki/Hardware

http://en.wikipedia.org/wiki/Implementation

http://en.wikipedia.org/wiki/Computer_clusters

http://en.wikipedia.org/wiki/Distributed_memory

http://en.wikipedia.org/wiki/Parallel_programming

http://en.wikipedia.org/wiki/Communication

http://en.wikipedia.org/wiki/Standardization

http://en.wikipedia.org/wiki/De_facto

http://en.wikipedia.org/wiki/OSI_Reference_Model

shared-memory and NUMA (Non-Uniform Memory Access) architectures as well, where it often serves not only as important portability architecture, but also helps achieve high performance in applications that are naturally owner-computes oriented. However, it has also been criticized for being too low level and difficult to use. Despite this complaint, it remains a crucial part of parallel programming, since no effective alternative has come forth to take its place.

MPI is a specification, not an implementation. MPI has Language Independent Specifications (LIS) for the function calls and language bindings. There are two versions of the standard that are currently popular: version 1.2, which emphasizes message passing and has a static runtime environment (fixed size of world), and, MPI-2.1, which includes new features such as scalable file I/O, dynamic process management and collective communication with two groups of processes. The MPI interface is meant to provide essential virtual topology, synchronization and communication functionality between a set of processes (that have been mapped to nodes/servers/ computer instances) in a language independent way, with language specific syntax (bindings).

http://en.wikipedia.org/wiki/Synchronization

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

MPI guarantees that there is progress of asynchronous messages independent of the subsequent calls to MPI made by user processes (threads).The relative value of overlapping communication and computation, asynchronous vs. synchronous transfers and low latency vs. low overhead communication remain important controversies in the MPI user and implementer communities. MPI also specifies thread safe interfaces, which have cohesion and coupling strategies that help avoid the manipulation of unsafe hidden state within the interface.

There has been research over time into implementing MPI directly into the hardware of the system, for example by means of Processor-in-memory, where the MPI operations are actually built into the micro circuitry of the RAM chips in each node. By implication, this type of implementation would be independent of the language, OS or CPU on the system, but cannot be readily updated or unloaded. Another approach has been to add hardware acceleration to one or more parts of the operation. This may include hardware processing of the MPI queues or the use of RDMA to directly transfer data between memory and the network interface without needing CPU or kernel intervention.

Parallel Virtual Machine:

PVM (Parallel Virtual Machine) is a software package that permits a heterogeneous collection of UNIX and/or Windows computers hooked together by a network to be used as a single large parallel computer. Thus large computational problems can be solved more cost effectively by using the aggregate power and memory of many computers. The software is very portable. PVM enables users to exploit their existing computer hardware to solve much larger problems at minimal additional cost. Hundreds of sites around the world are using PVM to solve important scientific, industrial, and medical problems in addition to PVM's use as an educational tool to teach parallel programming. With tens of thousands of users, PVM has become the de facto standard for distributed computing world-wide. PVM is an integrated set of software tools and libraries that emulates a general-purpose, flexible, heterogeneous concurrent computing framework on interconnected computers of varied architecture. The overall objective of the PVM system is to enable such a collection of computers to be used cooperatively for concurrent or parallel computation.

The PVM system is composed of two parts. The first part is a daemon, called pvmd3 and sometimes abbreviated pvmd that resides on all the computers making up the virtual machine. (An example of a daemon program is the mail program that runs in the background and handles all the incoming and outgoing electronic mail on a computer). The second part of the system is a library of PVM interface routines. It contains a functionally complete repertoire of primitives that are needed for cooperation between tasks of an application. This library contains user-callable routines for message passing, spawning processes, coordinating tasks, and modifying the virtual machine.

The PVM computing model is based on the notion that an application consists of several tasks. Each task is responsible for a part of the application's computational workload. Sometimes an application is parallelized along its functions; that is, each task performs a different function, for example, input, problem setup, solution, output, and display. This process is often called functional parallelism. A more common method of parallelizing an application is called data parallelism. In this method all

the tasks are the same, but each one only knows and solves a small part of the data. This is also referred to as the SPMD (single-program multiple-data) model of computing. PVM supports either or a mixture of these methods. Depending on their functions, tasks may execute in parallel and may need to synchronize or exchange data, although this is not always the case.

An exemplary diagram of the PVM computing model is shown in Figure and an architectural view of the PVM system, highlighting the heterogeneity of the computing platforms supported by PVM, is shown in Figure

The principles upon which PVM is based include the following: User-configured host pool Translucent access to hardware Process-based computation Explicit message-passing model Heterogeneity support Multiprocessor support

Differences between PVM and MPI:

MPI is a standard, PVM not MPI supports collective operations PVM does not MPI supports more modes of sending messages PVM dynamically spawns processes In MPI 1.x, process are created once at startup MPI has more support for dedicated hardware There are several implementations of the MPI standard

Commercial softwares:

Load Leveler - IBM Corp., USA LSF (Load Sharing Facility) - Platform Computing, Canada NQE (Network Queuing Environment) - Craysoft Corp., USA Open Frame - Centre for Development of Advanced

Computing, India RWPC (Real World Computing Partnership), Japan UnixWare (SCO-Santa Cruz Operations,), USA Solaris-MC (Sun Microsystems), USA Cluster Tools (A number for free HPC clusters tools from Sun)

Cluster Middleware:

Middleware is generally considered the layer of software sandwiched between the operating system and applications. Middleware provides various services required by an application to function correctly. Middleware has been around since the 1960’s. More recently, it has reemerged as a means of integrating software applications running in a heterogeneous environment. There is large overlap between the infrastructures that it provides a clusterwith high-level Single System Image (SSI) services and those provided by the traditional view of middleware. It can be described as the software thatresides above the kernel and the network and provides services to applications.

Heterogeneity can arise in at least two ways in a cluster environment. First, as clusters are typically built from commodity workstations, the hardware platform can become heterogeneous. As the cluster is incrementally expanded using newer generations of the same computer product line, or even using hardware components that have a very differentarchitecture, problems related to these differences are introduced. For example, a typical problem that must be resolved in heterogeneous hardware environments is the conversion of numeric values to the correct byte ordering. A second way that clusters become heterogeneous is the requirement to support very different applications. Examples of this include applications that integrate software from different sources, or require access to data or software outside the cluster. In addition, a requirement to develop applications rapidly can exacerbate the problems inherent with heterogeneity. Middleware has the ability to help the application developer overcome these problems. In addition, middleware also providesservices for the management and administration of a heterogeneous system.

2. Hardware Components:

Use of high density rack mounted servers in most popular configuration for today’s HPC cluster environment. Besides the compute nodes, each rack could be equipped with network switches, UPS, PDU, and so on. Fig. 1 illustrates a typical HPC cluster configuration. Left of the rack shows the possible interconnections for the compute nodes. For some type of

applications where communication bandwidth between nodes is critical, low latency and high bandwidth interconnections such as Gigabit Ethernet, Myrinet, InfiniBand etc. are common choices for interconnecting among compute nodes.

Several connections on the right of the rack in fig.2 represent the connections for cluster monitoring and management. The serial port and BMC provides console redirection feature as an additional rule for monitoring and managing the compute nodes from the master without relying on the network connectivity or interfering with the network activities. DRAC could be used for remote management along with KVM that access the compute nodes through non traditional switch with CAT-5 cables and TCP/IP networking.

Nodes:

Computing nodes. Master nodes.

The types of CPUs used in the nodes frequently are Intel and AMD. In Intel we use Xeon and Itanium processors and in AMD we use Optaron.

In a cluster, each node can be a single or multiprocessor computer, such as a PC, workstation or SMP server, equipped with its own memory, I\O devices & operating system.

The nodes are interconnected by LAN using one of the following technologies: Ethernet, fast Ethernet, gigabit Ethernet, myrinet, infiniBand communication fabric.

Gigabit Ethernet:

It is a transmission technology based on the Ethernet frame format and protocol used in the local area networks, provides a data rate of 1 billion bits/sec (1gigabit). Gigabit Ethernet is defined in the IEEE 802.3 standards.

Gigabit Ethernet is carried primarily on optical fiber. Existing Ethernet LANs with 10 and 100 mbps cards can feed into a gigabit Ethernet backbone. An alternative technology that competes with After the 10 and 100Mbps cards, a newer standard, 10Gb Ethernet is also becoming available.

InfiniBand:

InfiniBand is a switched fabric communications link primarily used in high-performance computing. Its features include quality of service and failover, and it is designed to be scalable. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices. It is a superset of the Virtual Interface Architecture. It is an architecture and specification for data flow between processors and I/O devices that promise data bandwidth and almost unlimited expandability in tomorrow’s computer systems.

Like Fiber Channel, PCI Express, Serial ATA, and many other modern interconnects, InfiniBand is a point-to-point bidirectional serial link intended for the connection of processors with high speed peripherals such as disks. It supports several signaling rates and, as with PCI Express, links can be bonded together for additional bandwidth.

It is expected to gradually replace the existing peripheral component interconnect (PCI) shared-bus approach used in most of today’s PC’s and servers. Offering up to 2.5GB/s and supporting up to 64000 addressable devices it has increased reliability, better sharing between clustered processors and built in security.

Myrinet:

Myrinet, ANSI/VITA 26-1998, is a high-speed local area networking system designed by Myricom to be used as an inter-connect between multiple machines to form computer clusters. Myrinet has much less protocol overhead than standards such as Ethernet, and therefore provides better throughput, less interference, and less latency while using the host CPU. Although it can be used as a traditional networking system, Myrinet is often used directly by programs that "know" about it, thereby bypassing a call into the operating system.

Myrinet physically consists of two fiber optic cables, upstream and downstream, connected to the host computers with a single connector. Machines are connected via low-overhead routers and switches, as opposed to connecting one machine directly to another. Myrinet includes a number of fault-tolerance features, mostly backed by the switches. These include flow control, error control, and "heartbeat" monitoring on every link. The newest, "fourth-generation" Myrinet, called Myri-10G, supports a 10 Gbit/s data rate and is inter-operable with 10 Gigabit Ethernet on PHY, the physical layer (cables, connectors, distances, signaling). Myri-10G started shipping at the end of 2006.

Myrinet characteristics:

Flow control, error control, and heartbeat continuity monitoring on every link.

Low latency, cut through switches with monitoring of high availability applications.

Storage:

1. DAS:

Direct-attached storage (DAS) refers to a digital storage system directly attached to a server or workstation, without a storage network in between. It is a retronym, mainly used to differentiate non-networked storage from SAN and NAS.

2. SAN:

In computing, a storage area network (SAN) is an architecture to attach remote computer storage devices such as disk arrays, tape libraries and optical jukeboxes to servers in such a way that, to the operating system, the devices appear as locally attached devices. Although cost and complexity is dropping, as of 2007, SANs are still uncommon outside larger enterprises.

By contrast to a SAN, network-attached storage (NAS) uses file-based protocols such as NFS or SMB/CIFS where it is clear that the storage is remote, and computers request a portion of an abstract file rather than a disk block.

3. NAS:

Network-attached storage (NAS) is a file-level data storage connected to a computer network providing data access to heterogeneous network clients. NAS hardware is similar to the traditional file server equipped with direct attached storage, however it differs considerably on the software side. The operating system and other software on the NAS unit provides only the functionality of data storage, data access and the management of these functionalities. Use of NAS devices for other purposes (like scientific computations or running database engine) is strongly discouraged. Many vendors also purposely make it hard to develop or install any third-party software on their NAS device by using closed source operating systems and protocol implementations. In other words, NAS devices are server appliances.

Cluster features:

In a cluster system, it is important to eliminate single point of failure in terms of hardware. Other than this ,data integrity and system health checking is very important .for a long term investment ,a cluster shall be able to add additional nodes in the future in order to minimize the TCO.

No Single Point of Failure:

Cluster cat operate from two independent machines. Provides complete redundancy and can have no single point of failure (SPOF). Shared storage can use RAID or even multiple disk arrays to achieve a high level of storage redundancy. Supports for multiple heartbeat communication channels between clusters.

Data integrity and manageability:

Cluster works independent of any types of Linux file systems and volumes, including journaling file systems and software RAID drivers. This ensures file systems are protected using all sorts of storage software without the need of reconfiguration or data migration. The use of journaling file systems also enables a fast recovery time without the need to run through a lengthy fsck check.

Application monitoring:

It is important to monitor the application availability. Application can fail due to various reasons. Cluster offers service monitoring agents (SMA) which can execute custom status check scripts which control the availability of a specific application. Monitoring agents are available for common database, middle-ware and services.

Direct Linux kernel communications:

Ensures the Linux kernel gets monitored by Cluster. Communications of clusters are done by kernel modules instead of user applications. This technology is known as highly reliable and crash safe from the user space.

Robust user interface:

The user interface is an interactive menu driven interface which can be used in a console and graphical Java console, local or remote access. This ensures Cluster can be monitored and controlled by system administrator anytime, anywhere and using any type of connection.

Commodity hardware:

Runs on x86 based commodity hardware or even PowerPC based hardware. There is no need for proprietary type of architecture to operate. Future investment protection of your e-business application and cluster software are assured.

About Computer Clusters and their performance:

Clusters Evaluation: This is a good paper but very old (1995). Nevertheless, it points out the various parameters on which cluster

performance should be analyzed. It also talks about commercial clustering software available and their evaluation.

Scalable Cluster Computing with MOSIX for LINUX: A great paper about great software. If you are thinking in terms of using a cluster of workstations this is a definitive value-add.

Scalability Limitations of VIA Based Technologies in supporting MPI Implementation and Evaluation of MPI on an SMP Cluster PVM Guide MPICH related articles, tutorials and software Gigabit Performance Evaluation Cluster Performance for various processors Comparison between MPI and PVM Performance Evaluation of LAM, MPICH, and MVICH on a Linux

cluster connected by a Gigabit Ethernet network Another useful link for parallel computing resources

Clusters often are confused with traditional massively parallel systems and conventional distributed systems. A cluster is a type of parallel or distributed computer system that forms--to varying degrees--a single, unified resource composed of several interconnected computers. Each interconnected computer has one or more processors, I/O capabilities, an operating-system kernel, and memory. The key difference between clustering and traditional distributed computing is that the cluster has a strong sense of membership. Like team members, the nodes of a cluster typically are peers, with all nodes agreeing on who the current participants are. This sense of membership becomes the basis for providing availability, scalability, and manageability.

However, capabilities beyond membership show up in different cluster solutions--and have spawned different classes of clustering solutions. Before exploring different approaches, it's helpful to compare clustering with traditional distributed computing and scaling solutions such as symmetric multiprocessing (SMP) and non-uniform memory addressing (NUMA).

Although there are minor hardware differences, the principal distinction between clusters and distributed computing is in software. In hardware, a cluster is more likely to share storage between computers and use a high bandwidth, low-latency, reliable interconnect. From a software perspective, the key difference between clusters and distributed computing is the strong sense of membership that exists in the cluster. Distributed computing, like client-server computing, relies on pair wise connections, while clusters are composed of peer nodes. Another key software difference is that clusters tend to be single administrative domains, which allows cluster implementations to avoid the security overhead and complexity required by distributed computing. Because a cluster's goal is to act as a single, reliable, scalable server, its machines are not usually distributed physically.

Another important distinction between clusters and distributed computing is the way remote resources are accessed. In a cluster, it is likely that these resources will be accessed transparently. In contrast, distributed computing often has heavyweight, cumbersome, and complicated interprocessor communications (IPC) capabilities between computers. Clusters--in particular, full clusters--use single-node IPC paradigms such as pipes and message queues along with traditional TCP/IP sockets.

Advantages:

Earlier in this chapter we have discussed the reasons why we would want to put together a high performance cluster, that of providing a computational platform for all types of parallel and distributed applications. The class of applications that a cluster can typically cope with would be considered grand challenge or super-computing applications. GCAs (Grand Challenge Applications) are fundamental problems in science and engineering with broad economic and scientific impact. They are generally considered intractable without the use of state-of-the-art parallel computers. The scale of their resource requirements, such as processing time, memory, and communication needs distinguishes GCAs. A typical example of a grand challenge problem is the simulation of some phenomena that cannot be measured through experiments. GCAs include massive crystallographic and microtomographic structural problems, protein dynamics and biocatalysis, relativistic quantum chemistry of actinides, virtual materials design and processing, global climate modeling, and discrete event simulation.

Low cost solution and high performance are only a few of the advantages of utilizing a High Performance Computing Cluster. Other key benefits that distinguish it from large SMP’s are described below:

Features Large SMP’s HPCC

Scalability Fixed Unbounded

Availability Moderate High

Ease of technology refresh

Difficult Manageable

Service and support Expensive Affordable

System manageability Custom; better usability Standard; moderate usability

Application availability High Moderate

Reusability of components

Low High

Disaster recovery ability Weak Strong

Installation Non-standard Standard

Cluster computing research projects:

Beowulf (CalTech and NASA) - USA CCS (Computing Centre Software) - Paderborn, Germany DQS (Distributed Queuing System) - Florida State University, US. HPVM -(High Performance Virtual Machine), UIUC&now UCSB,US MOSIX - Hebrew University of Jerusalem, Israel MPI (MPI Forum, MPICH is one of the popular implementations) NOW (Network of Workstations) - Berkeley, USA NIMROD - Monash University, Australia NetSolve - University of Tennessee, USA PVM - Oak Ridge National Lab./UTK/Emory, USA

PARAM Padma:

PARAM Padma is C-DAC's next generation high performance scalable computing cluster, currently with a peak computing power of One Teraflop. The hardware environment is powered by the Compute Nodes based on the state-of-the-art Power4 RISC processors’ technology. These nodes are connected through a primary high performance System Area Network, PARAMNet-II, designed and developed by C-DAC and a Gigabit Ethernet as a backup network.

ONGC Clusters:

ONGC implements two LINUX cluster machines: One is a 272 nodes dual core computing system with each node

equivalent to two CPU’s. The master node has 12 nodes dual CPU and a 32terabyte SAN storage.

The second system has 48 nodes i.e. 96 CPUs code computing nodes and the master node has 4 nodes and 20 terabyte SAN storage.

GRID COMPUTING

Grid computing is a phrase in distributed computing which can have several meanings:

A local computer cluster which is like a "grid" because it is composed of multiple nodes.

Offering online computation or storage as a metered commercial service, known as utility computing, "computing on demand", or "cloud computing".

The creation of a "virtual supercomputer" by using spare computing resources within an organization.

The creation of a "virtual supercomputer" by using a network of geographically dispersed computers. Volunteer computing, which generally focuses on scientific, mathematical, and academic problems, is the most common application of this technology.

These varying definitions cover the spectrum of "distributed computing", and sometimes the two terms are used as synonyms. This article focuses on distributed computing technologies which are not in the traditional dedicated clusters; otherwise, see computer cluster.

Functionally, one can also speak of several types of grids:

Computational grids (including CPU Scavenging grids) which focuses primarily on computationally-intensive operations.

Data grids or the controlled sharing and management of large amounts of distributed data.

Equipment grids which have a primary piece of equipment e.g. a telescope, and where the surrounding Grid is used to control the equipment remotely and to analyze the data produced.

History:

The term Grid computing originated in the early 1990s as a metaphor for making computer power as easy to access as an electric power grid in Ian Foster and Carl Kesselmans seminal work, "The Grid: Blueprint for a new computing infrastructure".

CPU scavenging and volunteer computing were popularized beginning in 1997 by distributed.net and later in 1999 by SETI@home to harness the power of networked PCs worldwide, in order to solve CPU-intensive research problems.

The ideas of the grid (including those from distributed computing, object oriented programming, cluster computing, web services and others) were brought together by Ian Foster, Carl Kesselman and Steve Tuecke, widely regarded as the "fathers of the grid". They led the effort to create the Globus Toolkit incorporating not just computation management but also storage management, security provisioning, data movement, monitoring and a toolkit for developing additional services based on the same infrastructure including agreement negotiation, notification mechanisms, trigger services and information aggregation. While the Globus Toolkit remains the defacto standard for building grid solutions, a number of other tools have been built that answer some subset of services needed to create an enterprise or global grid.

Grids versus conventional supercomputers:

"Distributed" or "grid computing" in general is a special type of parallel computing which relies on complete computers (with onboard CPU, storage, power supply, network interface, etc.) connected to the Internet by a conventional network interface, such as Ethernet. This is in contrast to the traditional notion of a supercomputer, which has many CPUs connected by a local high-speed computer bus.

The primary advantage of distributed computing is that each node can be purchased as commodity hardware, which when combined can produce similar computing resources to a many-CPU supercomputer, but at lower cost. This is due to the economies of scale of producing commodity hardware, compared to the lower efficiency of designing and constructing a small number of custom supercomputers. The primary performance disadvantage is that the various CPUs and local storage areas do not have high-speed connections. This arrangement is thus well-suited to applications where multiple parallel computations can take place independently, without the need to communicate intermediate results between CPUs.

The high-end scalability of geographically dispersed grids is generally favorable, due to the low need for connectivity between nodes relative to the capacity of the public Internet. Conventional supercomputers also create physical challenges in supplying sufficient electricity and cooling capacity in a single location. Both supercomputers and grids can be used to run multiple parallel computations at the same time, which might be different simulations for the same project, or computations for completely different applications. The infrastructure and programming considerations needed to do this on each type of platform are different, however.

There are also differences in programming and deployment. It can be costly and difficult to write programs so that they can be run in the environment of a supercomputer, which may have a custom operating system, or require the program to address concurrency issues. If a problem can be adequately parallelized, a "thin" layer of "grid" infrastructure can cause conventional, standalone programs to run on multiple machines (but each given a different part of the same problem). This makes it possible to write and debug programs on a single conventional machine, and eliminates complications due to multiple instances of the same program running in the same shared memory and storage space at the same time.

Design considerations and variations:

One feature of distributed grids is that they can be formed from computing resources belonging to multiple individuals or organizations (known as multiple administrative domains). This can facilitate commercial transactions, as in utility computing, or make it easier to assemble volunteer computing networks.

One disadvantage of this feature is that the computers which are actually performing the calculations might not be entirely trustworthy. The designers of the system must thus introduce measures to prevent malfunctions or malicious participants from producing false, misleading, or erroneous results, and from using the system as an attack vector. This often involves assigning work randomly to different nodes (presumably with different owners) and checking that at least two different nodes report the same answer for a given work unit. Discrepancies would identify malfunctioning and malicious nodes.

Due to the lack of central control over the hardware, there is no way to guarantee that nodes will not drop out of the network at random times. Some nodes (like laptops or dialup Internet customers) may also be available for computation but not network communications for unpredictable periods. These variations can be accommodated by assigning large work units (thus reducing the need for continuous network connectivity) and reassigning work units when a given node fails to report its results as expected.

The impacts of trust and availability on performance and development difficulty can influence the choice of whether to deploy onto a dedicated computer cluster, to idle machines internal to the developing organization, or to an open external network of volunteers or contractors.

In many cases, the participating nodes must trust the central system not to abuse the access that is being granted, by interfering with the operation of other programs, mangling stored information, transmitting private data, or creating new security holes. Other systems employ measures to reduce the amount of trust "client" nodes must place in the central system such as placing applications in virtual machines.

Public systems or those crossing administrative domains (including different departments in the same organization) often result in the need to

run on heterogeneous systems, using different operating systems and hardware architectures. With many languages, there is a tradeoff between investment in software development and the number of platforms that can be supported (and thus the size of the resulting network). Cross-platform languages can reduce the need to make this tradeoff, though potentially at the expense of high performance on any given node (due to run-time interpretation or lack of optimization for the particular platform).

Various middleware projects have created generic infrastructure, to allow various scientific and commercial projects to harness a particular associated grid, or for the purpose of setting up new grids. BOINC is a common one for academic projects seeking public volunteers; more are listed at the end of the article.

Cycle stealing:

Typically, there are three types of owners, who use their workstations mostly for:

1. Sending and receiving email and preparing documents. 2. Software development - edit, compile, debug and test cycle. 3. Running compute-intensive applications.

Cluster computing aims to steal spare cycles from (1) and (2) to provide resources for (3).

However, this requires overcoming the ownership hurdle - people are very protective of their workstations.

Usually requires organisational mandate that computers are to be used in this way.

Stealing cycles outside standard work hours (e.g. overnight) is easy, stealing idle cycles during work hours without impacting interactive use (both CPU and memory) is much harder.

Usually a workstation will be owned by an individual, group, department, or organisation - they are dedicated to the exclusive use by the owners.

This brings problems when attempting to form a cluster of workstations for running distributed applications.

International Grid Projects:

GARUDA (Indian) D-grid (German) Malaysia national grid computing Singapore national grid computing project Thailand national grid computing project CERN data grid (Europe) PUBLIC FORUMS

o Computing Portalso Grid Forumo European Grid Forumo IEEE TFCC!o GRID’2000

GARUDA:

GARUDA is a collaboration of science researchers and experimenters on a nation wide grid of computational nodes, mass storage and scientific instruments that aims to provide the technological advances required to enable data and compute intensive science for the 21st century. One of GARUDA’s most important challenges is to strike the right balance between research and the daunting task of deploying that innovation into some of the most complex scientific and engineering endeavors being undertaken today.

The Department of Information Technology (DIT), Government of India has funded the Centre for Development of Advanced Computing (C-DAC) to deploy the nation-wide computational grid ‘GARUDA’ which will connect 17 cities across the country in its Proof of Concept (PoC) phase with an aim to bring “Grid” networked computing to research labs and industry. GARUDA will accelerate India’s drive to turn its substantial research investment into tangible economic benefits.

Some Problems:

1. Embarrassingly Parallel:

In the jargon of parallel computing, an embarrassingly parallel workload (or embarrassingly parallel problem) is one for which no particular effort is needed to segment the problem into a very large number of parallel tasks, and there is no essential dependency (or communication) between those parallel tasks.

In other words, each step can be computed independently from every other step, thus each step could be made to run on a separate processor to achieve quicker results.

A very common usage of an embarrassingly parallel problem lies within graphics processing units (GPUs) for things like 3D projection since each pixel on the screen can be rendered independently from each other pixel.

Embarrassingly parallel problems are ideally suited to distributed computing over the Internet (e.g. SETI@home), and are also easy to perform on server farms which do not have any of the special infrastructure used in a true supercomputer cluster.

Embarrassingly parallel problems lie at one end of the spectrum of parallelization, the degree to which a computational problem can be readily divided amongst processors.

2. Software Lockout:

In multiprocessor computer systems, software lockout is the issue of performance degradation due to the idle wait times spent by the CPUs in kernel-level critical sections. Software lockout is the major cause of scalability degradation in a multiprocessor system, posing a limit on the maximum useful number of processors. To mitigate the phenomenon, the kernel must be designed to have its critical sections as short as possible, therefore decomposing each data structure in smaller substructures.

BIBLIOGRAPHY

1. High Performance Computing, by Alex Veidenbaum.2. High Performance Computing and Networking, by Wolfgang

Gentzsch, Uwe Harms.3. High Performance Computing and the Art of Parallel Programming,

by Stan Openshaw, Lan Turton.4. www.wikipedia.org 5. www.howstuffworks.com

http://www.howstuffworks.com/

http://www.wikipedia.org/

high performance computing - project report

Documents