load balancing strategies for a parallel ray-tracing system based on constant subdivision

13
Load balancing strategies for a parallel ray-tracing system based on constant subdivision Hiroaki Kobayashi ~, Satoshi Nishimura 2, Hideyuki Kubota 3, Tadao Nakamura ~, and Yoshiharu Shigei 4 1 Department of Mechanical Engineering, Faculty of Engineering, Tohoku University, Sendal 980, Japan 2 Department of Information Science, Faculty of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-Ku, Tokyo 113, Japan 3 Yamato Research Laboratory, IBM Japan, Ltd., 1623-14 Shimotsuruma, Yamato 242, Japan 4 Department of Information Engineering, Faculty of Engineering, Toyo University, 2100 Kujirainakanodai, Kawagoe 350, Japan Static and dynamic load balancing strate- gies for a multiprocessor system for a ray tracing algorithm based on constant sub- division are presented. An object space is divided into regular cubes (subspaces), whose boundary planes are perpendicular to the coordinate axes, and these are allo- cated to the processors in the system. Here, load balancing among the processors is the most important problem. Firstly, in a category of static load balancing, strate- gies for mapping the subspaces into the processors are evaluated by simulation. Moreover, we propose a hierarchical mul- tiprocessor system in order to realize dy- namic load balancing with the static one. Its architecture can overcome the limita- tion of the static load balancing in a large scale multiprocessor system. Key words: Multiprocessor system -Load balancing - Performance evaluation Ray tracing - Parallel algorithms tg algorithm can be an efficient ynthesize very realistic images 1980). However, in a cost/per- comparison with other visible surface algorithms such as scan-line algorithms and z-buffer algorithms, a ray-tracing algorithms is ser- iously handicapped. Fast image synthesis based on ray tracing is one of the most important topics in computer graphics and much work has been done to this area. A ray-tracing algorithm is very time consuming in that numerous ray-object intersections have to be computed in order to calculate which part of the object covers a given pixel. In order to reduce the number of objects that must be checked, some efficient approaches have been proposed. For ex- ample, Rubin and Whitted (1980) proposed placing simple bounding volumes around each object in their database. If a given ray does not intersect the bounding volume of a particular object, then this object needs no further consideration. Besides, by grouping objects hierarchically and placing a boundary volume encompassing the extent of all children at each node of the hierarchical tree, the majority of the ray-bounding volume intersection calculations can also be avoided. Glassner (1984) and Fujimoto et al. (1986) pre- sented different approaches. Glassner adaptively subdivided an object space into subspaces whose spatial interrelations are represented by an octree encoding structure, and Fujimoto et al. subdivided an object space into regular cubes, i.e., a three- dimensional grid structure. When a ray passes through a space, these data structures reject any object away from the ray, i.e., object are checked roughly in order of their occurrence along the ray. Recent progress in VLSI technology makes it pos- sible to achieve large scale parallel processing (Kunii 1984). For fast image synthesis, parallel pro- cessing is applicable to calculate the pixel values on a screen since each intensity of pixels can be independently calculated. Hereafter, we call such a parallel processing approach "pixel-oriented par- allel processing". Nishimura et al. (1983) built a multiprocessor system for parallel ray tracing based on this approach. Their system consists of 64 processing elements (called unit computers), each of which generates subimages of a screen in parallel. However, all processing elements require the whole object description in a space in order to calculate pixel values, since hidden surface re- moval in a ray-tracing algorithm is achieved by considering the whole object in a space. Therefore, their system configuration is a tightly coupled mul- The Visual Computer (1988) 4:197-209 Springer-Verlag 1988 1 9 7

Upload: hiroaki-kobayashi

Post on 10-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Load balancing strategies for a parallel ray-tracing system based on constant subdivision

Hiroaki Kobayashi ~, Satoshi Nishimura 2, Hideyuki Kubota 3, Tadao Nakamura ~, and Yoshiharu Shigei 4 1 Department of Mechanical Engineering, Faculty of Engineering, Tohoku University, Sendal 980, Japan 2 Department of Information Science, Faculty of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-Ku, Tokyo 113, Japan 3 Yamato Research Laboratory, IBM Japan, Ltd., 1623-14 Shimotsuruma, Yamato 242, Japan 4 Department of Information Engineering, Faculty of Engineering, Toyo University, 2100 Kujirainakanodai, Kawagoe 350, Japan

Static and dynamic load balancing strate- gies for a multiprocessor system for a ray tracing algorithm based on constant sub- division are presented. An object space is divided into regular cubes (subspaces), whose boundary planes are perpendicular to the coordinate axes, and these are allo- cated to the processors in the system. Here, load balancing among the processors is the most important problem. Firstly, in a category of static load balancing, strate- gies for mapping the subspaces into the processors are evaluated by simulation. Moreover, we propose a hierarchical mul- tiprocessor system in order to realize dy- namic load balancing with the static one. Its architecture can overcome the limita- tion of the static load balancing in a large scale multiprocessor system.

Key words: Multiprocessor system - L o a d balancing - Performance evaluation Ray tracing - Parallel algorithms

tg algorithm can be an efficient ynthesize very realistic images 1980). However, in a cost/per- comparison with other visible

surface algorithms such as scan-line algorithms and z-buffer algorithms, a ray-tracing algorithms is ser- iously handicapped. Fast image synthesis based on ray tracing is one of the most important topics in computer graphics and much work has been done to this area. A ray-tracing algorithm is very time consuming in that numerous ray-object intersections have to be computed in order to calculate which part of the object covers a given pixel. In order to reduce the number of objects that must be checked, some efficient approaches have been proposed. For ex- ample, Rubin and Whitted (1980) proposed placing simple bounding volumes around each object in their database. If a given ray does not intersect the bounding volume of a particular object, then this object needs no further consideration. Besides, by grouping objects hierarchically and placing a boundary volume encompassing the extent of all children at each node of the hierarchical tree, the majority of the ray-bounding volume intersection calculations can also be avoided. Glassner (1984) and Fuj imoto et al. (1986) pre- sented different approaches. Glassner adaptively subdivided an object space into subspaces whose spatial interrelations are represented by an octree encoding structure, and Fuj imoto et al. subdivided an object space into regular cubes, i.e., a three- dimensional grid structure. When a ray passes through a space, these data structures reject any object away from the ray, i.e., object are checked roughly in order of their occurrence along the ray. Recent progress in VLSI technology makes it pos- sible to achieve large scale parallel processing (Kunii 1984). For fast image synthesis, parallel pro- cessing is applicable to calculate the pixel values on a screen since each intensity of pixels can be independently calculated. Hereafter, we call such a parallel processing approach "pixel-oriented par- allel processing". Nishimura et al. (1983) built a multiprocessor system for parallel ray tracing based on this approach. Their system consists of 64 processing elements (called unit computers), each of which generates subimages of a screen in parallel. However, all processing elements require the whole object description in a space in order to calculate pixel values, since hidden surface re- moval in a ray-tracing algorithm is achieved by considering the whole object in a space. Therefore, their system configuration is a tightly coupled mul-

The Visual Computer (1988) 4:197-209 �9 Springer-Verlag 1988 1 9 7

tiprocessor system. If object description is too com- plex to be duplicated in the local memory of indi- vidual processors, then substantial communica- tions between processors and the common memory for storing the object description causes a system bottleneck. Moreover, it is impractical that proces- sors have duplicate object descriptions of whole object space in their local memories when a large scale multiprocessor system is constructed. Caspary and Scherson (1987) extended the algo- rithm b~sed on a hierarchical data structure of bounding volume to the parallel algorithm, and proposed a self-balanced multiprocessor system of MIMD architecture. A hierarchical data structure is divided and distributed among the processors, which execute the intersection calculations and shading by transmitting ray messages within the multiprocessor system. They suggested that nearly ideal load balancing in the multiprocessor system can be achieved by using data driven and demand driven controls fashion. As different hardware approach, a parallel process- ing system was studied which implements a space subdivision method to reduce the ray-object inter- section calculation and to calculate pixel values in parallel (Dippe and Swensen 1984; Nemoto and Omachi 1986; Kobayashi et al. 1987). An object space is divided into subspaces, which are allocated to the processors. Each processor calculates local intensities of objects in the subspace. Global inten- sities of pixels on a screen are calculated by gather- ing the related local intensities. Propagation of rays to be traced is achieved by interprocessor commu- nication. A ray travels from one subspace (proces- sor) to another neighbouring subspace (processor) according to its direction. We call this approach "object-oriented parallel processing". In this ap- proach, the mare goal is to realize ideal load bal- ancing in the system. In this paper, we study load balancing in a multi- processor system for parallel ray-tracing based on object-oriented parallel processing. Firstly, in a cat- egory of static load balancing, we discuss strategies for allocating a regularly subdivided object space to the processors so as to improve system perfor- mance. Moreover, in order to overcome the limita- tion of static load balancing, and realize a high- performance large scale multiprocessor system, we propose hierarchical architecture employing a dy- namic load balancing mechanism in addition to static load balancing. The outline of the paper is as follows. Section 2

198

discusses parallel processing based on space subdi- vision and static load balancing methods. We pres- ent strategies for mapping subspaces into the p r o - cessors connected by the nearest-neighbour con- nection network of one, two and three dimensions. Section 3 presents a multiprocessor system based on the parallel processing scheme discussed in Sect. 2, and evaluates the performance of the sys- tem by simulation. We analyze processing time, effective utilization of the system, memory require- ment of the system for object description and the relationship between ray transmission time and to- tal processing time. Section 4 proposes a hierarchi- cal multiprocessor system to overcome the limita- tion of static load balancing. We show that the hierarchical architecture can achieve high perfor- mance in a large scale multiprocessor system by using a dynamic load balancing mechanism with a static load balancing mechanism. Finally, Sect. 5 presents conclusions.

2 Strategies for mapping a regularly subdivided object space into a parallel processing system

In image synthesis using ray tracing, calculations of the local intensity, reflection and refraction at any point of an object are carried out on each object in an object space. Thus, computational ef- forts for synthesizing an image occur on each ob- ject, and can therefore be localized. As a result, we proposed an object-oriented parallel processing approach to image synthesis by ray tracing (Ko- bayashi et al. 1987), as opposed to a conventional pixel-oriented parallel processing approach. In an object-oriented parallel processing system, the objects in a space are allocated to the proces- sors, and rays travel in the multiprocessor system by interprocessor communication. The processor space as a geometrical configuration of the multi- processor system correspond to the object space. Objects are checked by the processors in order of their occurrence along the ray. The processor de- termines the intersecting object for a given ray and calculates the local intensity of the intersecting ob- ject. The main question of this approach is how to allo- cate the objects to the processors. It is necessary to clarify the positional relationship of the objects so that a ray may propagate through an object

t

space autonomously when allocating the objects to the processors. However, it is difficult to specify the positional relationship of the objects when the objects are randomly distributed in an object space. Moreover, when the object space is dense, the allo- cation will be almost impossible. To achieve quasi object-oriented parallel process- ing, we simply divide an object space into regular subspaces of appropriate sizes (Fujimoto et al. 1986), and allocate these subspaces to the respec- tive processors. As interconnection network of the processors, we use the nearest-neighbour connec- tion network (NNCN) of one, two and three di- mensions (1D-NNCN, 2D-NNCN and 3D- NNCN, respectively) for one, two and three dimen- sional parallel processing (Fig. 1). In the figure, subspaces represented by solid lines are unit sub- spaces to be allocated to the processors. When allo- cating regularly subdivided object space to these systems, we must consider the following problems.

1) The optimum number of subspaces of an object space depends on both the number of objects

Object Space Processor Space

0--0-.-.-0

�9 One Dimensional Parallel Processing

�9 Two Dimensional ParalLeL Processing

�9 Three Dimensional ParaLLel Processing

Fig. 1. Regularly subdivided object space and processor space (processor network) of parallel processing system

and the position of the objects to be rendered. However, the number of processors in the sys- tem is fixed. In general, the number of subspaces is larger than the number of processors.

2) In image synthesis, the computational load which is necessary in each subspace is not uni- form, since it depends on whether the subspace includes objects or not and the position of a subspace in an object space.

The first drawback can be overcome by allocating a set of subspaces to a single processor. The sim, plest method to do so is to allocate a block of neighbouring subspaces to a single processor. Fig- ure 2 illustrates this allocation. An object space is divided into regular subspaces and each of them is allocated to the processors whose numbers are shown in the figure. In this example, a 4 x 4 mesh- connected system for two dimensional parallel pro- cessing is assumed. In general, the processor number n of subspace (x, y, z) for D-dimensional parallel processing is obtained as follows.

1D parallel processing: n = [xN/c] 2D parallel processing: n = [xN/c] + [yN/c] N 3D parallel processing: n = [xN/c] + [yN/c] N+ [zN/c] N z where n is the processor number. c is the number of subspaces along each coordinate axis, thus the total number of subspaces in an ob- ject space is c a . (x, y, z) is the subspace index, where x, y and z = 0, 1,.-., c - 1 . N is the number of processors along each dimen- sion of parallel processing, thus the total number

0 0 1 1 2 2 3 ! 3

0 0 1 1 2 2 3 3

4 4 5 5 6 6 7 7

4 4 5 5 6 8 7 7

8 8 9 9 10 10 11 11

818 9 9 10101111

12i12 13 13114 14 15 15

12 12 13 13 14 14 15 15

A . / -

. 1

Fig. 2. Block allocation for two-dimensional parallel processing

199

of processors in the system is N D, where D is the dimension of parallel processing. [ ] is a floor function which returns the largest in- teger not greater than its input.

We shall call these allocations the block alloca- tions. On the other hand, in order to overcome the sec- ond drawback, Dippe and Swensen (1984) and Ne- moto and Omachi (1986) suggested that the neigh- bouring processors should exchange their compu- tational load in execution time according to the load balance between them. This method is effec- tive to a certain degree. However, the extra hard- ware required for implementing this method is ex- tremely expensive. Further, it has a tendency to cause a larger overhead both due to controlling load balance and due to the communications to move the object description between the processors neighbouring each other. Hence, most of the regu- lar processes is suspended during the load ex- change. We present a simple and effective method for the same objective. Generally, the computational load in an object space tends to concentrate in a local space. So, it is necessary to distribute the load concentration in several neighbouring subspaces to the processors uniformly. To this end, we allocate subspaces at certain intervals to one processor. Therefore, neigh- bouring subspaces in a local space are allocated to different processors. Figure 3 shows this alloca- tion for a 4 x 4 mesh-connected system. The numbers in the figure refer to processor numbers to be allocated. By this allocation, the processor

number n of subspace {x, y, z) for D-dimensional parallel processing is obtained as follows:

1D parallel processing: n = x mod N 2D parallel processing: n = x mod N + (y mod N) N 3D parallel processing: n = x mod N + (y mod N) N + (z mod N) N 2

where mod is the modulo function, and x, y, z and N are as before. We call these allocations the distributed allocations. Since each object occupies several neighbouring subspaces due to spatial co- herence, the above allocations are able to distribute a much heavier load existing in a local space to the processors approximately uniformly. In these allocations, in order to directly communicate with the processors which deal with the neighbouring subspaces, it is necessary that the processor at one terminal along each dimension of parallel process- ing is connected to the processor at the other termi- nal like a ring or torus. In Sect. 4, we will discuss the effectiveness of static load balancing by the dis- tributed allocation by means of simulation.

3 A multiprocessor system based on object-oriented parallel processing and effectiveness of static load balancing by the distributed allocation

f . f f , . - - f / / / / ~ - / .

23 0"1"23 4 5 6 7 4 5 6 7

8 91011 8 910111 121314 1512131415

0 1 2 3 0 1 2 3 4 5 6 7 4 5 6 7

8 9 i~ 11 8 91011 1213 1512131415

. / " /

/ / U / . J

i -

Fig. 3. Distributed allocation for two-dimensional parallel processing

200

3.1 System architecture

In this section, we will present the concrete systems architecture for an object-oriented parallel process- ing scheme as discussed in the previous section and the effectiveness and limitation of static load bal- ancing by the distributed allocation. Figure 4 de- picts the system organization for two dimensional parallel processing, i.e., the torus network connect- ing the processing elements (called Intersection Processors or IPs in the figure). For one dimension- al and three dimensional parallel processing, the IPs are interconnected by the 1D-NNCN and 3D- NNCN, respectively, in which the processor at one terminal along each dimension of parallel process-

Host I Computer

System Bus [

i::::@

IP : Intersection Processor (Processing Element )

Fig. 4. System organization

f ame lu f f or ontroller

rame

spl y;

ing is cyclically connected to the processor at the other terminal. The host computer controls the system, defines the object description and the related parameters for the scene to be rendered, and generates primary rays and allocates these rays to the appropriate 1Ps. In this system, a term of "a ray" means a ray information packet. Communications between the host computer and the IPs are achieved over the system bus. Each IP is allocated to certain subspaces of an object space, and determines whether or not the ray visiting the subspace intersects an object within the subspace. If so, the IP calculates the local inten- sity on the intersecting object and sends the intensi- ty to the frame buffer controller via the frame buffer bus when the intersecting point is not in a shadow. If the ray does not intersect an object, or if the rays are newly generated after the reflection/refrac- tion process, the IP transfers the rays or secondary rays to the next IP including the appropriate subs- paces according to the directions of these rays. For shadowing, we have to test whether or not the object exists on the line between the intersect- ing point on the object and the light source. To this end, the IP generates a ray in the direction

of the light source. If the object causing a shadow is found, the local intensity related to this light source becomes ineffective. For moving a ray in an object space and determin- ing the next subspace to be checked, we implement a three dimensional digital differential analyzer (3DDDA) in each IP. A 3 D D D A is the extension of a D D A (digital differential analyzer) which is used for generating a line on a two dimensional raster grid (Fujimoto et al. 1986). Since a 3 D D D A finds a next subspace by means of incremental cal- culations, 3 D D D A processing is very fast. The propagation of rays between processors is achieved through the interprocessor communication. The result obtained in each IP for a given ray is either the local intensity on an intersecting object or existence of a shadow. If the intersecting point on the object is not in a shadow, the frame buffer controller accumulates the local intensity to the corresponding memory in the frame buffer in order to calculate the global intensity of the pixel on a screen. We will assume that each IP consists of the NEC microprocessor V30 (Intel 8086 compatible) and the floating-point co-processor 8087, and is inter- connected via an 8-bit parallel interface such as the GPIB parallel interface according to the net- work topology of the system. From this assump- tion, we determine the following features:

- Time for a ray-object intersection calculation per object,

if intersecting object, otherwise

- Time for determining the next sub- space to which a ray will be trans- ferred.

- Time for a local intensity calculation of an intersecting object.

- Time for calculating the direction of reflection.

- Time for calculating the direction of refraction.

- Time for decoding a ray information packet.

- Time for generating a ray informa- tion packet.

- Transmission time of a ray between IPs via communication line.

1750(gs) 13oo( s)

340(gs)

4600(1~s)

680(gs)

11oo( ts)

58O( s)

1600(gs)

50( ts)

These features will be used in the performance eval- uation by simulation in the following sections.

201

3.2 Performance evaluation of the system employing the static allocation strategies

In this section, we will evaluate the system dis- cussed in the previous section by simulation. We will examine the block and the distributed alloca- tions by four performance estimates: processing time, effective utilization (parallel processing effi- ciency) of the system, memory requirement for ob- ject description and the relation between the ray

Table 1. Systems evaluated by simulation

Sys- Dimen- Allocation tern sion name

1B1 1B2 1B4 1B6 1B8 1B12

1D2 1D4 1D6 1D8 1D12

2B2 2B4 2B6 2B8 2B12

2D2 2D4 2D6 2D8 2D12

3B2 3B4 3B6 3B8 3B12

3D2 3D4 3D6 3D8 3D12

Block

Distributed

Block

Distributed

Block

Distributed

Network

1D-NNCN (Linear)

1D-NNCN* (Ring)

2D-NNCN (Mesh)

12D-NNCN* (Torus)

3D-NNCN

3D-NNCN*

Number Total of pro- numbe~ cessors of pro- in each cessors dimension

1 1 2 �9 2 4 4 6 6 8 8

12 12

2 2 4 4 6 6 8 8

12 12

2 4 4 16 6 36 8 64

12 144

2 4 4 16 6 36 8 64

12 144

2 8 4 64 6 216 8 512

12 1728

2 8 4 64 6 216 8 512

12 1728

transmission time and total processing time. The simulator has been written in the programming language C and performed on the VAX-11/750 under the UNIX operating system. The simulator is a discrete event-driven simulator and capable of simulating the system behaviour at a clock level.

3.2.1 Simulation model

We implement the block and the distributed alloca- tions to one, two and three dimensional parallel processing systems, and evaluate each system by simulation. The systems considered for evaluation are presented in Table 1. Here, "system name" denotes classification of system configuration and "aBc" in this field means that dimension of parallel processing is "a" (where a is 1, 2 or 3), an allocation method is "B" (where "B" means the "Block" allo- cation, while "D" means the "Distributed" alloca- tion), and the number of processors in the system is "c a'. The notation "aBc" will be used in the following experimental results and discussion. Figures 5 and 6 show the test images used for per- formance evaluation. The image of Fig. 5 (Image 1) is composed of 369 spheres, whose surfaces causes diffuse reflections and whose positions are deter- mined by a certain recurrence formula. The image of Fig. 6 (Image 2) contains one transparent sphere, one specular reflective sphere in the front, and 121 diffuse reflective spheres in a plane in the back- ground. The object space has been divided into 24 x 24 x 24 subspaces and 144 x 144 rays are traced.

* In each dimension, the processor at terminal is connected to the processor at the other terminal (wraparound connection) Fig. 5. Test image 1

202

Fig. 6. Test image 2

3.2.2 Experimental results and discussion

A. Processing time and effective utilization of the system

Figures 7 and 8 show the processing time of Im- age 1 and Image 2 respectively as a function of the number of processors. In both figures, it is found that the processing time decreases as the number of processors increases. Also, it is seen that the processing time using the distributed allocation is two to three times faster than that using the block allocation. Therefore, it is found that the distrib- uted allocation is excellent regrading static load balancing. However, a decreasing rate in the pro- cessing time is small when the number of proces- sors is large. Figures 9 and 10 show the effective utilization of the systems as a function of the number of proces- sors. Here, the effective utilization of a multiproces- sor system is measured as follows:

Effective utilization =

(Uniprocessor processing time) x 100

(Multiprocessor x (Number of processors in processing time) the multiprocessor system)

From Figs. 9 and 10, it is found that the systems with the distributed allocation have a higher effec- tive utilization that the systems with the block allo- cation. Notice also that the efficiency decreases as the number of processors increases. This is because of the relationship between the number of sub- spaces in an object space and the number of proces- sors in the processor space of the system. When the number of subspaces is fewer compared to the number of processors, the effectiveness of load bal- ancing by the distributed allocation can not be ob- tained since a little local information of an object space is assigned to one processor and the load is not uniformly allocated to the processors. Thus, there is the limitation of static load balancing by the distributed allocation when the degree of space subdivision is not sufficient, compared to the number of processors. 2-D parallel processing seems to be the most effi- cienct compared to I-D and 3-D parallel process- ing with respect to static load balancing. The rea- sons are as follows: In general, most of the rays cannot reach the subspaces at the end of an object space seen from a view point, since these subspaces may be hidden by other objects. Thus, in 3-D paral- lel processing, the processors to which these sub- spaces are allocated tend to be idle. On the other hand, parallelism derived from primary rays is al- ways expectative. Therefore, 2-D parallel process- ing is most efficient regarding static load balancing. In these simulations in which an object space is divided into 24 x 24 x 24 subspaces, it was found that the system configuration which presents the highly effective utilization and nearly "ideal" load balancing is up to a 4 x 4 multiprocessor system for two dimensional parallel processing. In order to obtain "ideal" load balancing in a large scale multiprocessing system, we will propose a hierarch- ical architecture by using dynamic load balancing in addition to static load balancing in Sect. 4.

B. Memory requirement

Theoretically, the memory requirement for object description in each processor is 1/n times the total memory requirement in the system consisting of n processors, and the total memory requirement is independent of the number of processors. How- ever, since one object may have to be duplicated in several processors (subspaces), the memory re-

203

500

2OO

100

5o

03 ._ 20

o n

~ , mIBn 13, O2Bn �9 "- El 3Bn ",%, � 9

~ 4 0 " - , �9 2Dn " - " , . , �9 3Dn

",,01",. q,

~ b "'q"'"'"a

I I I I I 4 16 64 256 1024

Number of processors 4096

1000

500

200 E

".T_,

I00

5o o a_

20

10

zX

"~, El 3Bn �9 1Dn

" X ~ ' ~ ", Q 2Dn

~ , " ~, ~ �9 %%

"~qq, m

�9 %-.

I I I I I 4 !6 64 256 1024

N u m b e r of p r o c e s s o r s

100

2 80 O

N 6 0

>

"5 4O

LIJ

20

", z~ 1Bn ". �9 O 2Bn , , , ~ ~

"' I I ~ i o "

"t3

Z 16 6'4 2;6 10~4

100

c 8 0 O

N

60

>

" - 4 0 rn

20

0~ 4096

111 9 Number of processors

Fig. 7. Number of processors vs. Total processing time in the case of image 1

Fig. 8. Number of processors vs. Total processing time in the case of image 2

Fig. 9. Number of processors vs. Effective utilization of the system in the case of image 1

Fig. 10. Number of processors vs. Effective utilization of the system in the case of image 2

4•• z~ 1Bn O 2Bn D 3Bn

', ', �9 3Dn ,IL

0',',,, 0

4 16 64 256 1024 Number of processors

4096

4096

quirement in each processor may be larger than 1/n times the total memory requirement. Moreover, the distributed allocation seems to be inferior to the block allocation in this respect. To verify this, we examine the total memory re- quirement of the system for object description in the case of the distributed allocation and in the case of the block allocation. We define the total memory requirement in the system as follows:

2 0 4

Total memory requirement = (maximum memory requirement among

the processors) x (number of processors in the system)

Figures 11 and 12 show the total memory require- ment in the system as a function of the number of processors. In these figures, a dot-dash-line pres- ents the total memory requirement in a conven-

/ /

~1 3 2 32 ~ '

.~/ , r ~/ ,-o

~ 1 6 ~, ," E 16

E .~ . - " z o r /,, Z

b, r 8 .@ a ,

E ~1 , t I j j z - - .,_ ~ / ,, '.5

4 g', cr a~a ~ol ~ / , ~ , - " s 4

~.l /43~ /w ~,lBn (,~/ I I / ~g~,,/" O 2Bn ~'~

2 / ~ � 9 1On E 2

@ / ~ - - �9 3Dn ~"

' ' 6' ' ' 1 1 4 16 4 256 1024 4096 11 N u m b e r of p rocesso rs 12

Fig. 1L Number of processors vs. Total memory requirement in the case of image 1

Fig. 12. Number of processors vs. Total memory requirement in the case of image 2

! /

g - 7 ~,~,'

/ 3

:4/

S' / / 2Bn #., /,/y ._ o 3Bn

/ / ' o r . O . . 0 �9 lDn / " '" . . 0 " " : 2Dn

I

- ~ 16 64 256 1024 N u m b e r of p r o c e s s o r s

4096

tional pixel-oriented parallel processing system in which a whole object description in an object space is duplicated in each processor in order to com- pletely avoid memory conflict in the common memory. It is found that the memory requirement in our system gradually increases as the number of processors increases. On the other hand, the memory requirement in the conventional pixel-ori- ented parallel processing system increases rapidly. Comparing the distributed allocation with the block allocation, there is no difference in the case of Image 1. In Image 1, objects randomly exist in an object space (see Fig. 5) and the object descrip- tion in each subspace is nearly the same. Thus, the object description of Image 1 is uniformly dis- tributed to the processors and the memory require- ment in each processor is well-balanced though there are duplicated objects in each processor by using the distributed allocation. For Image 2, the memory requirement of the distributed allocation is about twice as large as that of the block alloca- tion, since there are big objects in the front of the object space (see Fig. 6). However, notice that both the total memory requirements of the block and distributed allocations in the object-oriented paral- lel processing system are far less than the total memory requirement in the conventional pixel-ori- ented parallel processing system.

C. Ray communication time between the processors

Figure 13 shows the total processing time as a func- tion of the ray communication time between pro- cessors. This clearly shows that the ray communi- cation time is independent of the total processing time when the time is less than 10 ms/ray. This is because the processes related to intensity calcula-

5O

4O

30

~ 20 8 o e_ 1o

zx z ~ - .&

2D4 [] [] El

3D8 o o o

I I I I I

100 101 102 103 104 105

Communication time(gsec/ray) Fig. 13. Communication time vs. Total processing time

10 G

2 0 5

tions in each processor and ray transmission be- tween the processors are carried out simultaneous- ly by means of pipelining. From these results, it is found that even the serial line of 19.2 K bps (bits per second) may be applicable as a communication link between the processors.

4 Hierarchical multiprocessor system with static and dynamic load balancing mechanisms

4.1 System architecture

In the previous discussion, we have seen that the effective utilization decreases as the number of pro- cessors increases and it is difficult to keep highly effective utilization in a large scale multiprocessor system. In order to overcome the limitation of stat- ic load balancing by the distributed allocation and improve the performance of a large scale parallel processing system, we propose a hierarchical multi- processor system with static and dynamic load bal- ancing mechanisms. Figure 14 illustrates the sys- tem architecture. There are two levels: a cluster level and a processing element level. At the cluster

level, the subspace are allocated to each cluster by using the distributed allocation. Thus, at this level, the clusters realize an object space to be tra- versed by rays. At the processing element level, processes allocated to a cluster are carried out si- multaneously by the processing elements. There- fore, at the cluster level, static load balancing is achieved by the distributed allocation, while at the processing element level, dynamic load balancing is achieved within a cluster in execution time. Especially, if the cluster has a single processing ele- ment, the hierarchical system is equivalent to the non-hierarchical system presented in Sect. 3. The system consists of a host computer, k clusters and a intercluster connection network. The host computer controls the system. The intercluster con- nection network interconnects the clusters, and is used to transfer rays to the appropriate clusters according to the directions of the rays. An N-di- mensional nearest-neighbour interconnection net- work is used for this network. The cluster consists of a cluster controller, m pro- cessing elements and a intracluster connection net- work. The cluster controller receives rays visiting the cluster and assigns them to the processing ele- ment which has the lightest load in the cluster.

Host �9 Computer

I, Intercluster Network

. . . . . . . . . . . . . . . . . . . . . . . . . t . . . . . . . . . . . . . . . . . . . . . . . . . . . t

~ ,~Ctuster JControt[er ~ . C ~

I Intraduster Network [ l Intraduster Network J

Processing Element i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~ �9 . . . . . . . . . . . . . . . . . . . . . . . . . J

Ctuster ] Cluster 2 Fig. 14. System architecture of a hierarchical multiprocessor system

I

............. i ..............

I, [ntracluster Network I

L . . . . . . . . . . . . . . . . . . . . . . . . . J

Ctuster k

206

The processing elements carry out ray-object inter- section calculations and/or local intensity calcula- tions of intersecting objects for the assigned rays. Communications between the cluster controller and the processing elements are achieved over the intracluster connection network�9

4.2 Performance evaluation of a hierarchical system with static and dynamic load balancing mechanisms

In order to examine the effect of static and dynamic load balancing in the hierarchical multiprocessor system, we have evaluated the performance of the system by simulation�9 However, as we have already obtained the effect of static load balancing by the distributed allocation in Sect. 3, we keep the number of clusters constant and vary the number of processors in a cluster�9 We choose a 4 x 4 cluster system of two-dimensional configuration, i.e., the number of clusters is 16. This is because a 4 x 4 multiprocessor system for two-dimensional parallel processing have shown the limit of nearly "ideal" load balancing by the distributed allocation whose effective utilization was 86% to 88%. Thus, the intercluster connection network is a two-dimen- sional mesh connecton. The same test images, Im- age 1 and Image 2, are used for the performance evaluation. Features of each processing element a n d the networks are the same ones described in Sect. 3.1. Figure 15 shows the processing time as a function of the number of processors in a cluster. Here, the numbers in parentheses are the total number of processors in the system�9 Notice that the processing time linearly decreases as the number of processors increases, i.e., the speedup is almost linear. Fig- ure 16 also shows the effective utilization of the system as a function of the number of processors in a cluster. Notice that the hierarchical system consists of 256 processing elements as one cluster composed of 16 processing elements. This configu- ration reveals a higher effective utilization (84% to 86%) compared to the non-hierarchical system consisting of 256 processing elements whose effec- tive utilization was 20% to 40%. In this case, the processing time in the hierarchical system is two to four times shorter than that of the non-hierarchi- cal system and then the hierarchical system can achieve almost linear speedup. Therefore, it is found that the system performance is remarkably

100

50

"d 20

E = 10

u~ 5

(3-

i

o., ',. Image 2

, . I ", " , ,

Image 1 "-~., l , , , ,

"", �9

",, ",,,

I I I I I 1 1 2 4 8 16

(16) (32) (64) (128)(256) Number of processors in a cluster

(Total number of processors in the system)

Fig. 15. Processing time vs. Number of processors in a cluster in a hierarchical multiprocessor system

100

8O g

.~ 60

>* 40

Image 1

Image 2

1 k ; ; 1'6 (16) (32) (64) (128) (256)

Number of processors in a duster (Total number of processors in the system)

Fig. 16. Effective utilization of the system vs. Number of processors in a cluster in a hierarchical multiprocessor system

improved. Hence, nearly "ideal" load balancing seems to be kept in a large scale multiprocessor system by using these static and dynamic load bal- ancing mechanisms�9

5 Conclusion

In this paper, we have studied static and dynamic load balancing mechanisms in a multiprocessor system for a ray tracing algorithm based on con- stant subdivision. Firstly, in a category of static

207

load balancing, the distributed allocation to map a regularly subdivided object space into the proces- sors have been examined by simulation. We have shown that the processing time in the case of the distributed allocation is 30% to 50% shorter than that in the block allocation. Next, we have pro- posed a hierarchical multiprocessor system em- ploying dynamic load balancing with static one and shown nearly "ideal" load balancing in a large scale multiprocessor system. Almost linear speedup can be obtained in the hierarchical multiprocessor system.

Acknowledgements. The authors are grateful to Professor To- siyasu L. Kunii of the University of Tokyo for motivating them to do this research and for helpful conversations.

The authors also sincerely acknowledge and appreciate the helpful suggestions and comments of Professor Ed F. Deprettere of Delft University of Technology and Professor Isaac D. Scher- son of Princeton University on an earlier version of this paper.

References

Whitted JT (1980) An improved illumination model for shaded display. Commun ACM 23:343-394

Wittie LD (1981) Communication structure for large networks of microcomputers. IEEE Trans Comput C-30:264-273

Woodwark JR (1984) A multiprocessor architecture for viewing solid models. Display (APRIL) pp 92103

HIROAKI KOBAYASHI is currently a research associate in the Department of Mechanical Engineering at Tohoku Univer- sity, Sendai, Japan. His research interests include computer ar- chitecture, parallel processing systems and applications, and computer graphics. He received the B.E. degree in Communica- tion Engineering, and the M.E. and D.E. degrees in Informa- tion Engineering from Toboku University in 1983, 1985 and 1988, respectively. He is a member of the IEEE Computer

Society, the ACM, the Institute of Electronics, Information and Communication Engineers of Japan and the Information Pro- cessing Society of Japan.

Caspary E, Scherson ID (1988) Multiprocessing for ray tracing: a hierarchical self-balancing approach. The Visual Com- puter 4(4):188 196

Dippe M, Swenson J (1984) An adaptive subdivision algorithm and parallel architecture for realistic image synthesis. Com- put Graph 18:3:149-158

Dippe M, Swensen J (1984) An adaptive subdivision algorithm and parallel architecture for realistic image Synthesis. Corn- put Graph 18:3:149-158

Feng T (1981) A survey of interconnection networks. IEEE Comput 12:14:18-27

Foley JD, Dam AV (1982) Fundamentals of interactive com- puter graphics. Addison-Wesley, Reading

Fujimoto A, Tanaka T, Iwata K (1986) ARTS accelerated ray- tracing system. IEEE Comput Graph Appl 6:4:16-26

Glassner AS (1984) Space subdivision for fast ray tracing. IEEE Comput Graph Appl 4:10:15-21

Kobayashi H, Nakumura T, Shigei Y (1987) Parallel processing of an object space for image synthesis using ray tracing. The Visual Computer 3:13 22

Kunii TL (ed) (1984) VLSI engineering. Lect Notes Comput Sci 163

Nemoto K, Omachi T (1986) An adaptive subdivision by sliding boundary surfaces for fast ray tracing. Proc Graphic Inter- face '86:43-48

Nishimura H, Ohno H, Kawata T, Shirakawa I, Omura K (1983) LINKS-l: a parallel pipelined multimicrocomputer system for image creation. Proc 10th Ann Int Symp Comput Archi:387-394

Rubin SM, Whitted JT (1980) A three-dimensional representa- tion for fast rendering of complex scenes. Comput Graph 14:3:110-116

Timothy LK, Kajiya JT (1986) Ray tracing complex scenes. Comput Graph 20:4: 269-278

SATOSHI NISHIMURA is cur- rently a master course graduate student of Information Science at the University of Tokyo. His research interests include com- puter architecture and com- puter graphics. He received the B.E. degree in Communication Engineering from Tohoku Uni- versity in 1987.

HIDEYIJKI KUBOTA is cur- rently a researcher in the Yama- to Research Laboratory at IBM Japan, Ltd. His research inter- ests include computer architec- ture and computer graphics. He received the B.E. degree in Communication Engineering and the M.E. degree in Informa- tion Engineering from Tohoku University in 1986 and 1988, re- spectively. He is a member of the IEICE of Japan.

208

TADAO NAKAMURA was born in Ube, Japan, on January 25, 1944. He received the Dr. of Eng. degree from Tohoku Uni- versity in 1972. His major was on Computer Aided Design in semiconductor electronics. Since 1972 he has been a faculty member of the Faculty of En- gineerung of Tohoku Universi- ty. He is currently a Professor of Computer Science in the De- partment of Mechanical Engi- neering, Tohoku University. He has been studying computer ar- chitecture frequently at the

Computer Systems Laboratory, Stanford University since 1983. His present research interests include computer architecture, supercomputer architecture, computer graphics, and distributed processing systems. He is an Editorial Board Member of The Visual Computer. He is also a Senior Member of the IEEE and a member of the IEICE of Japan, the IEEE COMSOC Communications Software Committee and the IEEE COMSOC Computer Com- munications Committee.

YOSHIHARU SHIGEI is a pro- fessor in the Department of In- formation Engineering, Toyo University. He received the B.E. degree and the Dr, of Eng. de- gree from Tohoku University in 1948 and 1962, respectively. From 1948 to 1963 he was with the Electrical Research Center, Ministry of Telegram and Tele- phone of Japan. From 1963 to 1977, he joined the Electrical Communication Laboratory of the Nippon Telegraph and Telephone Corporation. Here he was engaged in research on

transmission as the head of the Transmission Network Labora- tory of the Transmission Research Division. He was a Professor in the Insitute of Electrical Communications, Tohoku Universi- ty from 1977 to 1978, and from 1977 to 1988 he was a Professor in the Department of Information Engineering, Tohoku Univer- sity. His present research interests include electrical communica- tion engineering, computer architecture and distributed process- ing systems. He is a member of the IEICE of Japan and the IPS of Japan. He was also Vice President of the IEICE of Japan from 1983 to 1985.

209