storage estimation and design space exploration ...fbalasa/jvsps08.pdfj sign process syst (2008)...

21
J Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration Methodologies for the Memory Management of Signal Processing Applications F. Balasa · P. G. Kjeldsberg · A. Vandecappelle · M. Palkovic · Q. Hu · H. Zhu · F. Catthoor Received: 10 January 2007 / Revised: 20 June 2007 / Accepted: 7 May 2008 / Published online: 14 June 2008 © 2008 Springer Science + Business Media, LLC. Manufactured in The United States Abstract The storage requirements in data-dominated signal processing systems, whose behavior is described by array-based, loop-organized algorithmic specifica- tions, have an important impact on the overall energy consumption, data access latency, and chip area. This paper gives a tutorial overview on the existing tech- niques for the evaluation of the data memory size, This research was sponsored in part by the U.S. National Science Foundation (DAP 0133318). F. Balasa (B ) Southern Utah University, Cedar City, UT, USA e-mail: [email protected] P. G. Kjeldsberg · Q. Hu Norwegian University of Science and Technology, Trondheim, Norway P. G. Kjeldsberg e-mail: [email protected] Q. Hu e-mail: [email protected] A. Vandecappelle Essensium/Mind, Leuven, Belgium e-mail: [email protected] M. Palkovic IMEC vzw, Leuven, Belgium e-mail: [email protected] H. Zhu ARM, Sunnyvale, CA, USA e-mail: [email protected] F. Catthoor IMEC vzw and Katholieke Universiteit Leuven, Leuven, Belgium e-mail: [email protected] which is an important step during the early stage of system-level exploration. The paper focuses on the most advanced developments in the field, presenting in more detail (1) an estimation approach for non- procedural specifications, where the reordering of the loop execution within loop nests can yield signifi- cant memory savings, and (2) an exact computation approach for procedural specifications, with relevant memory management applications – like, measuring the impact of loop transformations on the data storage, or analyzing the performance of different signal-to- memory mapping models. Moreover, the paper dis- cusses typical memory management trade-offs – like, for instance, between storage requirement and number of memory accesses – taken into account during the exploration of the design space by loop transformations in the system specification. Keywords Memory management · Loop transformations · Memory size evaluation · Digital signal processing 1 Introduction and Motivation In many signal processing systems, particularly in the multimedia and telecom domains, data transfer and storage have a significant impact on both the system performance and the major cost parameters – power consumption and chip area. During the system devel- opment process, the designer must often focus first on the exploration of the memory subsystem in order to achieve a cost optimized end product [13]. The behavior of these targeted VLSI systems, syn- thesized to execute mainly data-dominated applica-

Upload: vuongquynh

Post on 06-Mar-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

J Sign Process Syst (2008) 53:51–71DOI 10.1007/s11265-008-0244-0

Storage Estimation and Design Space ExplorationMethodologies for the Memory Managementof Signal Processing Applications

F. Balasa · P. G. Kjeldsberg · A. Vandecappelle ·M. Palkovic · Q. Hu · H. Zhu · F. Catthoor

Received: 10 January 2007 / Revised: 20 June 2007 / Accepted: 7 May 2008 / Published online: 14 June 2008© 2008 Springer Science + Business Media, LLC. Manufactured in The United States

Abstract The storage requirements in data-dominatedsignal processing systems, whose behavior is describedby array-based, loop-organized algorithmic specifica-tions, have an important impact on the overall energyconsumption, data access latency, and chip area. Thispaper gives a tutorial overview on the existing tech-niques for the evaluation of the data memory size,

This research was sponsored in part by the U.S. NationalScience Foundation (DAP 0133318).

F. Balasa (B)Southern Utah University, Cedar City, UT, USAe-mail: [email protected]

P. G. Kjeldsberg · Q. HuNorwegian University of Science and Technology,Trondheim, Norway

P. G. Kjeldsberge-mail: [email protected]

Q. Hue-mail: [email protected]

A. VandecappelleEssensium/Mind, Leuven, Belgiume-mail: [email protected]

M. PalkovicIMEC vzw, Leuven, Belgiume-mail: [email protected]

H. ZhuARM, Sunnyvale, CA, USAe-mail: [email protected]

F. CatthoorIMEC vzw and Katholieke Universiteit Leuven,Leuven, Belgiume-mail: [email protected]

which is an important step during the early stage ofsystem-level exploration. The paper focuses on themost advanced developments in the field, presentingin more detail (1) an estimation approach for non-procedural specifications, where the reordering of theloop execution within loop nests can yield signifi-cant memory savings, and (2) an exact computationapproach for procedural specifications, with relevantmemory management applications – like, measuringthe impact of loop transformations on the data storage,or analyzing the performance of different signal-to-memory mapping models. Moreover, the paper dis-cusses typical memory management trade-offs – like,for instance, between storage requirement and numberof memory accesses – taken into account during theexploration of the design space by loop transformationsin the system specification.

Keywords Memory management ·Loop transformations · Memory size evaluation ·Digital signal processing

1 Introduction and Motivation

In many signal processing systems, particularly in themultimedia and telecom domains, data transfer andstorage have a significant impact on both the systemperformance and the major cost parameters – powerconsumption and chip area. During the system devel-opment process, the designer must often focus first onthe exploration of the memory subsystem in order toachieve a cost optimized end product [1–3].

The behavior of these targeted VLSI systems, syn-thesized to execute mainly data-dominated applica-

Page 2: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

52 F. Balasa et al.

tions, is usually described in a high-level programminglanguage. The code is typically organized in sequencesof loop nests having as boundaries (typically, affine)functions of the loop iterators. The code may con-tain conditional instructions, where the arguments canbe data-dependent and/or data-independent (relationaland/or logic expressions of affine functions of loopiterators). In our target domain, the data structures aremulti-dimensional arrays whose indexes in the code areaffine functions of loop iterators. The class of specifi-cations with these characteristics are often called affinespecifications [1].

Global loop transformations are important system-level design techniques, used to enhance the localityof data and the regularity of data accesses in affinespecifications. Reducing the lifetime of the array ele-ments increases the possibility of memory sharing, sincedata with non-overlapping lifetimes can be mapped tothe same physical location. This leads to the overallreduction of the data storage requirements and, hence,of the chip area. Also, due to the large amounts of datain our targeted applications, both on-chip and off-chipmemory modules are usually needed. Improving datalocality by global loop reorganization enhances the datareuse opportunities [4–6]. If these data are mapped tothe intermediate layer memories in the hierarchy, theoff-chip memory accesses are potentially significantlyreduced, which is critical for system performance andenergy consumption [2]. But this is heavily enabled bythe in-place mapping of the data copies in the interme-diate memory, which is usually quite size-constrained[7]. Hence, size reduction of the data storage is also cru-cial for this purpose. Several lower-level techniques areavailable for such a size optimization but they are tooslow to use during the actual system-level explorationstage, up front. For that purpose, we need efficientsystem-level evaluation or estimation techniques.

This paper provides a tutorial overview on the ex-isting techniques for the system-level evaluation of thedata memory size in signal processing applications,focusing on the most advanced developments in thisresearch domain. The paper presents in more detail(but, still, at high level) two more recent, complemen-tary approaches for the evaluation of the memory sizeof non-procedural and procedural algorithmic specifica-tions. The first technique – developed at the NorwegianUniversity of Science and Technology, Trondheim,Norway, with the co-operation of the InteruniversityMicroelectronics Center (IMEC), Leuven, Belgium –performs an estimation of the data memory size, wherethe reordering of the loop execution within loop nestsis optimized such that the storage requirements bereduced. The second technique – developed at the

University of Illinois at Chicago, U.S.A., with initialsupport from IMEC – performs an exact computationof the minimum data storage for procedural specifica-tions. This latter approach has relevant memory man-agement applications like the study of the impact ofloop transformations on storage requirements and theperformance analysis of different models of mappingmulti-dimensional signals to the physical memories.Moreover, the paper discusses typical memory manage-ment trade-offs – extensively studied at IMEC (like,for instance, trading the number of memory accessesfor the size of the data memory), that are taken intoaccount during the exploration – mainly, by loop trans-formations – of system specifications. Being a tutorialpresentation, the paper does not address lower-levelcomputational aspects (a rich bibliography is offeredto the interested reader); instead, it focuses on thegeneral computation models of the approaches, on theirapplications in the memory management and system-level exploration methodologies, illustrating the con-cepts and ideas by several code examples.

The rest of the paper is organized as follows. Afteran overview of the past works addressing the memoryestimation/computation problem (Section 2), the paperpresents in Section 3 two more recent models for theevaluation of storage requirements. The first approachdoes an accurate estimation exploring the possibilitiesof reordering the loop execution aiming to optimizedata locality. The second technique performs an exactcomputation after the loop execution has already beenfixed. Section 4 discusses memory management trade-offs that must be taken into account during the earlyhigh-level design space exploration. Section 5 presentsseveral experimental results and Section 6 states themain conclusions of this comprehensive overview.

2 The Memory Size Computation Problem:A Brief Overview

For several decades, researchers have worked on dif-ferent approaches for estimating or computing theminimum memory size required to execute a givenapplication. Most of the initial work was done at sca-lar level due to the register-transfer behavioral spec-ifications of the earlier digital systems. After beingmodeled as a clique partitioning problem [8], the reg-ister allocation and assignment have been optimallysolved for nonrepetitive schedules, when the life-timeof all the scalars is fully determined [9]. The similaritywith the problem of routing channels without verticalconstraints [10] has been exploited in order to deter-mine the minimum register requirements (similar to

Page 3: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 53

the number of tracks in a channel), and to optimallyassign the scalars to registers (similar to the assignmentof one-segment wires to tracks) in polynomial time byusing the left-edge algorithm [11]. A nonoptimal exten-sion for repetitive and conditional schedules has beenproposed in [12]. A lower bound on the register cost canbe found at any stage of the scheduling process usingforce-directed scheduling [13]. Integer Linear Program-ming techniques are used in [14] to find the optimalnumber of memory locations during a simultaneousscheduling and allocation of functional units, registers,and busses. Employing circular graphs, [15] and [16]proposed optimal register allocation/assignment solu-tions for repetitive schedules. A lower bound for theregister count is found in [17] without fixing the sched-ule, through the use of ASAP and ALAP constraintson the operations. A good overview of these techniquescan be found in [18].

Common to all the scalar-based techniques is thatthey break down when used by flattening large multi-dimensional arrays, each array element being consid-ered a separate scalar. The nowadays data-intensivesignal processing applications are described by high-level, loop-organized, algorithmic specifications whosemain data structures are typically multi-dimensionalarrays. Flattening the arrays from the specification ofa video or image processing application would typicallyresult in many thousands or even millions of scalars.

To overcome the shortcomings of the scalar-basedtechniques, several research teams have tried to splitthe arrays into suitable units before or as a part of theestimation. Typically, each instance of array elementaccessing in the code is treated separately. Due to theloop structure of the code, large parts of an array can beproduced or consumed by the same code instance. Thisreduces the number of elements the estimator musthandle compared to the scalar-based methodology. Wewill now present different published contributions us-ing this approach, starting with techniques that assumea procedural execution of the application code.

In [19], a production time axis is created for each ar-ray. This models the relative production and consump-tion time, or date, of the individual array accesses. Thedifference between these two dates equals the numberof array elements produced between them. The maxi-mum difference found for any two depending instancesgives the storage requirement for this array. The totalstorage requirement is the sum of the requirements foreach array. An Integer Linear Programming approachis used to find the date differences. Since each array istreated separately, only the internal in-place mappingof an array (intra-array in-place) is considered; the pos-sibility of mapping arrays in-place of each other (inter-

array in-place) is not exploited. By in-place mappingwe mean the optimization technique where data withnon-overlapping lifetimes can be mapped to the samephysical memory locations [20].

Another approach is taken in [21]. The data-dependency relations between the array references inthe code are used to find the number of array elementsproduced or consumed by each assignment. The storagerequirement at the end of a loop equals the storagerequirement at the beginning of the loop, plus the num-ber of elements produced within the loop, minus thenumber of elements consumed within the loop. The up-per bound for the occupied memory size within a loopis computed by producing as many array elements aspossible before any elements are consumed. The lowerbound is found with the opposite reasoning. From this,a memory trace of bounding rectangles as a functionof time is found. The total storage requirement equalsthe peak bounding rectangle. If the difference betweenthe upper and lower bounds for this critical rectangle istoo large, better estimates can be achieved by splittingthe corresponding loop into two loops and rerunningthe estimation. In the worst-case situation, a full loop-unrolling is necessary to achieve a satisfactory estimate.

Reference [22] describes a methodology for so-calledexact memory size estimation for array computation.It is based on live variable analysis and integer pointcounting for intersection/union of mappings of para-meterized polytopes. In this context, a polytope is theintersection of a finite set of half-spaces and may bespecified as the set of solutions to a system of linearinequalities. It is shown that it is only necessary tofind the number of live variables for one statement ineach innermost loop nest to get the minimum memorysize estimate. The live variable analysis is performedfor each iteration of the loops however, which makesit computationally hard for large multi-dimensionalloop nests.

In [23], the specifications are limited only to perfectlynested loops. A reference window is used for eacharray. At any moment during execution, the windowcontains array elements that have already been refer-enced and will also be referenced in the future. Theseelements are hence stored in the local memory. Themaximal window size gives the memory requirementfor the array. If multiple arrays exist, the maximum ref-erence window size equals the sum of the windows forindividual arrays. Inter-array in-place is consequentlynot considered.

All the techniques above estimate the memory sizeassuming a single memory. Hu et al. [5] performs hi-erarchical memory size estimation, taking data reuseand memory hierarchy allocation into account. In-place

Page 4: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

54 F. Balasa et al.

mapping is not incorporated in the current version, butis indicated as part of future work.

In contrast to the array-based methods described sofar in this section, the storage requirement estimationtechnique presented in [24] assumes a non-proceduralexecution of the application code. It traverses a depen-dency graph based on an extended data dependencyanalysis resulting in a number of non-overlapping arraysections (so called basic sets) and the dependenciesbetween them. The basic set sizes and the sizes of thedependencies are found using an efficient lattice pointcounting technique [25]. The maximal combined size ofsimultaneously alive basic sets found through a greedygraph traversal gives an estimation of the storagerequirement.

The techniques described above are mainly usedat an early design stage to estimate the memory sizerequired for the final implementation. Optimizations atthese early steps are also the main topic of this paper.In addition to this, much effort has been focused onperforming the final in-place optimization and signal-to-memory mapping. A good overview of the work inthis field can be found in [26]. Although this paperwill not put much emphasis on the later design step ofassigning signals to the physical memory, this mappingproblem will be briefly addressed in Section 3.2.3, asan application of one of the memory size computationapproaches. The next section will present two alterna-tive techniques that can be used for the system-levelevaluation of the data storage.

3 Non-Scalar Models for Memory Size Evaluation

This section presents two approaches for the memorysize evaluation of loop-organized algorithmic speci-fications, where the main data structures are multi-dimensional arrays. The specifications are in thesingle-assignment form, that is, each array element iswritten at most once (but it can be read an arbitrarynumber of times).

The first technique assumes that the application codecan be interpreted in a non-procedural form. It allowsany degree of fixation of the execution ordering, fromfully unfixed, through partially fixed, to fully fixed. Tobe able to quickly generate estimates with this limitedamount of information available, certain approxima-tions are used. Unless the ordering is fully fixed, it alsogenerates upper and lower bounds on the storage re-quirement. It is, therefore, natural to use this approachduring the early system level design steps. The designercan use the bounds as guidance when making decisionsthat gradually fix the execution ordering.

The second approach assumes that the applicationcode is written in a procedural form. It is, therefore,beneficial to be used at later design stages when itmakes sense to allow more computationally hard tech-niques in order to reach exact results.

3.1 Estimation Approach for Non-ProceduralSpecifications

All but the last estimation approach described inSection 2 assume a fully fixed ordering. This makesthem hard to use during early system level design stepssuch as the global data flow transformations and globalloop transformations [3]. The number of alternative so-lutions is huge, each with a different execution order-ing. It is very time consuming to evaluate each of theseusing its fixed ordering. The designer would benefitfrom having size estimates that work with only a par-tially fixed execution ordering. In that context, an es-timate that is not necessarily fully exact is acceptable.Actually, without the ordering being fixed, it is eveninfeasible to come up with one number for the size: inpractice, a range of sizes is possible still. So, we wouldlike to compute as tight as possible bounds on thatavailable range for a given amount of freedom in theloop organization.

For our target classes of data-dominated applicationsthe high-level description is typically characterized bylarge multi-dimensional loop nests and arrays withmostly manifest index expressions and loop bounds.Example 1 shows the code of a simple loop nest. Twostatements, S1 and S2, produce elements of two arrays,A and B. Elements from array A are consumed whenelements of array B are produced. This gives rise to aflow type data dependency between S1 and S2 [27]. Inthis example the array index expressions are relativelysimple, but the techniques presented in this sectionwork for any affine and manifest index expressions.Section 4.1 discusses preprocessing techniques that canbe employed on the original source code, if this is notthe case.

Example 1

for(x = 0; x <= 5; x + +)

for(y = 0; y <= 5; y + +)

for(z = 0; z <= 2; z + +){S1 : A[x][y][z] = f1(input);S2 : if(x>=1&&y>=z)B[x][y][z]= f2(A[x−1][y−z][z]);

}Loop interchange is a transformation with very large

impact on lifetimes of data elements within loop nests.With the worst-case ordering of the dimensions inExample 1, y outermost, x second outermost, and z

Page 5: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 55

innermost (y, x, z), the storage requirement for thedependency between S1 and S2 is 33 memory locations.For the best case ordering, (z, x, y), the requirementis 8. For this simple example, the absolute numbersare small. The ratio between them is large, however,and this holds also for large real life examples.

We shall now give an overview of the four steps ofour estimation methodology. It takes available partiallyfixed ordering into account to find upper and lowerbounds (UB and LB) on the run-time storage require-ment of an application. The span between the boundsreflects the still unfixed part of the ordering.

Step 1 Collect data dependency information fromapplication code.

Our estimation methodology uses a geometricalmodel to describe data, operations and dependencies.The first part of this step is performed by the Atom-ium tool [28]. It finds the Iteration Domains (ID) ofthe statements and the index expressions of the ar-ray accesses in the application code. Figure 1 showsgraphically the iteration space [27] of the loop nest inExample 1 with the IDs of the two statements. Toavoid unnecessarily complex figures, two-dimensionalrectangles are used. All nodes within a rectangle arepart of the corresponding ID. For our example, the IDof S2 has originally a somewhat complex shape becauseof the if-clause y >= z. When this is the case, we applya bounding box simplification. A bounding box ID isa rectangular approximation of the original ID, wherethe original can have any convex shape. Figure 1 showsI DS2 after this simplification. This is an efficient andappropriate approximation since most convex shapesin our targeted application domain are rectangularand have regular accesses, i.e., images, blocks, etc. Toavoid too complex address generation, linear addressesare also normally used in the final implementationof the application. Bounding box based size estimates

z 0 1 y2 3 4 5

0

x

1

2

3

4

5 IDS1=

0..5

0..2][0..5

IDS2=

1..5

0..2][0..5

DP=

0..4

0..2][0..5

DVP=0..1

0..0][0..2

EDV=1

0][2

|DP |=5x

|DVP |=2x

|DP |=y 6|DVP |=3y

Figure 1 Iteration space and iteration domains (IDs) from Ex-ample 1.

will then give results similar to those of the finalimplementation.

At each iteration node within I DS1 and I DS2 weassume that statement S1 and S2 are executed, re-spectively. For I DS1 this is fully correct, while it is anapproximation for I DS2 because of the bounding box.Not all elements produced by S1 are read by S2. ADependency Part (DP) is therefore defined as shown inFig. 1 containing the iteration nodes at which elementsare produced that are read by the depending statement.Note that as long as no data dependencies are violated,i.e., data is never consumed before it is produced, wecan visit the iteration nodes in any order we like. Thiscorresponds to a non-procedural interpretation of theapplication code.

We now need to find the Extreme DependencyVector (EDV) for each dependency. For this we can usean efficient technique introduced by Hu et al. in [29]that avoids the traditional computationally expensiveILP-based solutions. There is a Dependency Vector(DV) from each iteration node writing an array elementto the iteration node reading the same array element.Dependencies are uniform if they all have the samelength and direction. Otherwise, they are nonuniform,as in Example 1. Figure 2 summarizes the three alter-native DV lengths and directions of Example 1. AllDVs have length 0 in the z-dimension. The EDV isthe maximal projection among all DVs of each loopdimension. In most cases, as in Example 1, the resultof this is one of the actual DVs. If this is not the case,use of the EDV will result in some overestimates, asshown in [29]. Alternatively, these exceptions can behandled separately during estimation. In [29], a conceptof a Maximal DV is introduced that is always one ofthe existing DVs. To be able to use this, a fully fixedexecution ordering is required. This is consequently notan alternative here.

The EDV spans a Dependency Vector Polytope(DVP), as shown in Fig. 1. Its dimensions are defined asSpanning Dimensions (SD). The remaining dimensionsare denoted Nonspanning Dimensions (ND). In Fig. 1,x and y are SDs while z is ND. More details about theseconcepts can be found in [30].

Figure 2 DVs in Example 1.

z 0 1 2

0

x

1

y

Page 6: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

56 F. Balasa et al.

Step 2 Position DPs in a common iteration space.

The IDs are placed in a common iteration space toenable a global estimation scope even for applicationswith loops that are not perfectly nested [31]. A bestcase and worst case common iteration space may benecessary to find the memory size LB and UB for thecomplete application.

Step 3 Estimate UB and LB on the size of individ-ual dependencies based on the partially fixedexecution ordering.

The size of a dependency between two IDs equalsthe number of iteration nodes visited in the DP beforethe first depending iteration node is visited. Since onearray element is produced at each iteration node inthe DP, this size equals the number of array elementsproduced before the first depending array element isproduced that potentially can be mapped in-place of thefirst array element.

We will now present a number of guiding principlesfor the ordering of loops in a loop nest. The size ofa dependency is minimized if the execution orderingis fixed so that its NDs and SDs are placed at theoutermost and innermost nest levels respectively. Theorder of the NDs is of no consequence as long as theyare all fixed outermost. If one or more SDs are orderedin between the NDs, it is however better to place theshortest NDs inside the SDs. The length of ND ik isdetermined by the length of the DP in this dimension|DPk|. The ordering of SDs is important even if allof them are fixed innermost. The SD with the largestLength Ratio (LR) should be ordered innermost. TheLR is defined as follows:

LRk = |DVPk| − 1

|DPk| − 1

For the special case when |DPk| = 1, care must betaken to avoid division by zero. An |LRk| = ∞ can stillbe assumed, since such dimensions should be orderedinnermost.

Returning to the dependency between S1 and S2 inExample 1, the LRs for the SDs are easily calculated.For each dimension we use |DP| and |DVP| as illus-trated in Fig. 1. This gives LRx = 1/4 and LRy = 2/5.According to the guiding principles, the y dimensionshould consequently be fixed innermost with x sec-ond innermost, and ND z outermost. If the guidingprinciples are applied in their opposite order, we getthe worst-case ordering. Using these guiding principlesfor best case and worst-case ordering, our estimationmethodology finds the orderings that will result in theLB and UB storage requirement of individual depen-dencies. Any ordering already fixed will be taken into

account. Starting outermost, the contribution of eachnest level to the total dependency size is calculated. Ifa dimension is already fixed at a given nest level, thisdimension is used to calculate the contribution. If not,the dimensions according to the best case and worst-case orderings are used for LB and UB calculations, re-spectively. When calculating the contribution of a givennest level, we multiply Q j of the dimension fixed herewith the Pk of all dimensions fixed inside it. Here Q j =|DVP j| − 1 and Pk = |DPk|. The total dependency sizeis the sum of contributions of all nest levels. Note thatQ j = 0 for all NDs, so they do not contribute to theoverall dependency size except with their Pk if they arefixed inside an SD. Table 1 demonstrates the stepwisecalculation of dependency sizes with an unfixed and apartially fixed execution ordering. The different valuesfor Q j and Pk can be found from Fig. 1. Note that as theordering is gradually fixed, the bounds converge. Witha fully fixed ordering they are equal. In addition to theLB and UB, the best case ordering found is reported tothe user. The complete algorithm is described in detailin [30].

Step 4 Find simultaneously alive dependencies andtheir maximal combined size ⇒ Bounds ontotal storage requirement.

After having found the upper and lower bounds onthe size of each dependency in the common iterationspace, it is necessary to determine if two or moreof them are alive simultaneously. The maximal com-bined size of simultaneously alive dependencies overthe lifetime of an application, gives the total storagerequirement of the application. Two dependencies canpotentially be alive simultaneously if their DPs overlapin one or more dimensions in the common iterationspace. Depending on whether the overlap occurs onlyfor NDs, for a subset of the dimensions including atleast one SD, or for all dimensions, they will alter-nate in being alive, be alive simultaneously for certain

Table 1 Storage requirement for the code in Example 1 withunfixed and partially fixed execution ordering.

Completely unfixed x Fixed outermost

BC ordering (z,x,y) (x,z,y)WC ordering (y,x,z) (x,y,z)1st nest level LBt=0 LBt=UBt=

UBt=Qy·Px·Pz=30 Qx·Py·Pz=182nd nest level LBt=LBt+Qx·Py=6 LBt=LBt=18

UBt=UBt+Qx·Pz=33 UBt=UBt+Qy·Pz=243rd nest level LB=LBt+Qy=8 LB=LBt+Qy=20

UB=UBt=33 UB=UBt=24

Page 7: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 57

execution orderings, or be alive simultaneously regard-less of the chosen execution ordering. Similar reason-ing can be made for groups of multiple dependencies.The estimation methodology uses a two-step proce-dure. First, groups of potentially simultaneously alivedependencies are detected, followed by an inspectionto reveal those actually simultaneously alive for a givenpartially fixed execution ordering. Details regardingthis part of the methodology are presented in [32].

We will demonstrate the benefit of allowing a par-tially fixed ordering using a part of the Updating Sin-gular Value Decomposition algorithm [33] used, forinstance, in beamforming for antenna systems. The partwe investigate has two major dependencies, each withestimated LB/UB of 1/100 if no execution orderingis fixed. For one of these dependencies there existrestrictions in the code, forcing a given dimension to befixed outermost. Estimating the sizes with this partiallyfixed ordering results in converging LB/UB of 100/100for both dependencies. Both are alive simultaneously,so their combined size is 200. This is found withouthaving to explore all the alternative orderings. Thelarge penalty of fixing this dimension outermost wouldencourage the designer to investigate transformationalternatives. One possibility would be to perform aloop body split, so that the dependency without anyrestrictions could be ordered optimally. This wouldthen result in a combined size of 101. Sections 4 and5 give more detailed and larger examples of how thisestimation methodology can be a part of a solutionspace exploration during the global loop transforma-tion design step.

3.2 Exact Computation Approach for ProceduralSpecifications

This section presents a non-scalar method for comput-ing exactly the minimum data memory size in signalprocessing algorithms where the code is procedural,that is, where the loop structure and sequence of in-structions induce the (fixed) execution ordering. Thisassumption is based on the fact that the design entryin present industrial design usually includes a full fix-ation of the execution ordering. Even if this is not thecase, the designer can still explore different algorithmicspecifications functionally equivalent.

The exact computation of storage requirements(minimum data memory size) may be necessary inembedded system applications, where the constraintson the data memory space are typically tighter. It maybe useful as well in assessing the impact of differentcode (and, in particular, loop) transformations on thedata storage. The exact computation approach allows

to address the problem of signal-to-memory mappingby providing exact determinations of window sizes formulti-dimensional signals and their indexes. Finally, itallows a more precise data reuse analysis [4–6] and,therefore, a better steering of the (hierarchical) mem-ory allocation [1, 2]. The trade-off will be the longerrun-times and the requirement that each possible looptransformation instantiation has to be evaluated indi-vidually. That makes this approach better suited afterthe crude pruning of the system-level data-flow, andthe loop transformation stage has been finalized, so thatonly a limited set of options remains.

An array reference can be typically representedas the image of an affine vector function i �−→ T ·i + u over a Z-polytope (its iterator space) { i ∈Zn | A · i ≥ b }, therefore, a lattice [34] which is linearlybounded [35]. For instance, the array reference A[2i −j + 5][3i + 2 j − 7] from the loop nest

for (i = 2; i ≤ 7; i+ +)

for (j = 1; j ≤ −2i+ 15; j+ +)

if(j ≤ i+ 1) · · ·A[2i− j+ 5][3i+ 2j− 7] · · ·

has the iterator space P =

⎧⎪⎪⎨

⎪⎪⎩

⎢⎢⎣

ij

⎥⎥⎦ ∈

Z2

⎢⎢⎣

1 00 11 −1

−2 −1

⎥⎥⎦

[ij

]

⎢⎢⎣

21

−1−15

⎥⎥⎦

⎫⎪⎪⎬

⎪⎪⎭

. (The inequality i ≤ 7

is redundant.) The A-elements of the array reference

have the indices x, y:{ [

xy

]

=[

2 −13 2

] [ij

]

+[

5−7

] [ij

]

∈ P}

. The points of the index space

lie inside the Z-polytope {2x + y ≥ 17, 5x − y ≥ 25,

3x − 2y ≤ 22, x + 4y ≤ 82, x, y ∈ Z}, whose boundaryis the image of the boundary of the iterator space P(see Fig. 3). However, only the points (x,y) satisfyingalso the divisibility condition 7 | 2x + y + 4 (that is,7 divides exactly 2x+y+4) belong to the index space;these are the black points in the right quadrilateralfrom Fig. 3. In this illustrative example, each point inthe iterator space is mapped to a distinct point of theindex space, but this is not always the case.

3.2.1 The Algorithm Computing the Minimum DataMemory Size

The main steps of the memory size computation algo-rithm [36, 37] will be briefly presented below.

Page 8: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

58 F. Balasa et al.

Figure 3 The iteratorspace and the index spaceof the array referenceA[2i − j + 5][3i + 2 j − 7].

i

j

j = -2i+15

i = 2

j = 1

0

1

2

3

4

51 62 3 4 7

5

6

j = i+1

(14/3, 17/3)

(7, 1)

(2, 3)

(2, 1)

x

y

4 6 8 10 12

2

4

6

8

14 16 18

10

12

14

16

18

0

(26/3, 55/3)

(8, 1)

(18, 16)

(6, 5)

-3x+2y=22

x+4y=82

5x-y=25

2x+y=17Iterator space

Index space

Step 1 Extract the array references from the given algo-rithmic specification and decompose the arrayreferences of every indexed signal into disjointlinearly bounded lattices (LBLs).

Figure 4b shows the result of this decomposition forthe 2-dimensional signal A in the illustrative examplefrom Fig. 4a. The graph displays the inclusion relations(arcs) between the lattices of A (nodes). The 4 “bold”nodes are the 4 array references of signal A in the code.(The delay operator “@” indicates inputs or signals pro-duced in previous data-sample executions of the code,and the following argument signifies the number of such

previous executions. The delayed array references donot affect this decomposition step, but the next stepstake into account the delayed signals since they must bekept alive during several time iterations.) The nodes arealso labeled with the size of the corresponding LBL–that is, the number of lattice points (i.e., points havinginteger coordinates) in those sets. The inclusion graphis gradually constructed by partitioning analytically theinitial (four) array references using intersections anddifferences of lattices [37]. While the intersection oftwo non-disjoint LBLs is an LBL as well [35], thedifference is not necessarily an LBL – and this latteroperation makes the decomposition difficult. In thisexample, A1 ∩ A2 = A3 and A1 − A3, A2 − A3 are

optDelta[0] = 0 ; // A[10][529] : inputfor ( j=16 ; j<=512 ; j++){ Delta[4][j][0] = 0 ; for ( k=0 ; k<=8 ; k++) for ( i=j-16 ; i<=j+16 ; i++) Delta[4][j][33*k+i-j+17] = A[4][j] - A[k][i]@1 + Delta[4][j][33*k+i-j+16] ; optDelta[j-15] = Delta[4][j][297] + optDelta[j-16] ;}for( j=16 ; j<=512 ; j++){ Delta[5][j][0] = 0 ; for( k=1 ; k<=9 ; k++) for( i=j-16 ; i<=j+16 ; i++) Delta[5][j][33*k+i-j-16] = A[5][j] - A[k][i]@1 + Delta[5][j][33*k+i-j-17] ; optDelta[j+482] = Delta[5][j][297] + optDelta[j+481] ;}out = optDelta[994]; // out : output

A1 A2

A3

A4 A5

A6 A7

A8 A9

A10 A11

4761 4761

529 529

4232

1587 1587

32 32

497 497 A8

A9

A4 A5A6 A7

A10 A11

0

1516

512

528

513

0 94 5A[4][y] A[5][y]16 <= y <= 512

A[9][y]0 <= y <= 528

A[x][y]4 <= x <= 50 <= y <= 15

(a) (b) (c)

A[x][y]6 <= x <= 80 <= y <= 528

A[x][y]1 <= x <= 80 <= y <= 528

A[x][y]0 <= x <= 80 <= y <= 528

A[x][y]1 <= x <= 90 <= y <= 528

A[x][y]4 <= x <= 5

513 <= y <= 528

A[0][y]0 <= y <= 528

A1

A2

A10

A11

Figure 4 a Example 2 (the kernel of a motion detection algo-rithm [1], where m = 4, n = 16, M = 5, N = 512). b Decomposi-tion of the index space of signal A into disjoint linearly bounded

lattices (LBLs); the arcs in the graph show the inclusion relationsbetween LBLs. c The partitioning of A’s array space according tothe decomposition (b).

Page 9: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 59

also LBLs (denoted A4, A5 in Fig. 4b). However, thedifference A3 − A10 is not an LBL due to the non-convexity of this set. At the end of the decomposi-tion, the nodes without any incident arc representnon-overlapping LBLs (they are displayed in Fig. 4c).Every array reference in the code is now either adisjoint LBL itself (like A10 and A11), or a union ofdisjoint LBLs (e.g., A1 = A4 ∪ A3 = A4 ∪ ⋃11

i=6 Ai).When the affine vector function i �−→ T · i + u is a

one-to-one mapping1 (like in Fig. 3, where each of the21 iterator vectors – represented by black points in theiterator space – is mapped to a distinct point in the in-dex space), the LBL size computation reduces to thecomputation of the number of lattice points (i.e., hav-ing integer coordinates) in a Z-polytope. (See againFig. 3, where instead of counting the black points in theright quadrilateral containing holes, we count the blackpoints in the left quadrilateral without any hole if loopnormalization was performed in advance.) Otherwise,the problem reduces to counting the points in a projec-tion of a polytope [38, 39], as explained and exempli-fied in [25]. Counting the lattice points in a polytopecan be done in several ways: there are methods basedon Ehrhart polynomials like, for instance, [40, 41], oreven much simpler – adapting the Fourier-Motzkintechnique [42, 43]. For reason of scalability, we use acomputation technique based on the decomposition ofa simplicial cone into unimodular cones [44], approachexemplified in [36].

Step 2 Determine the memory size at the boundariesbetween the blocks of code.

After the decomposition of the array references inthe specification code, a lifetime analysis on the poly-hedral partitions of the signals finds the blocks of code(e.g., nest of loops) where each of the disjoint latticesis produced and consumed (i.e., used as part of anoperand for the last time). Based on this information,the memory size at the block boundaries can be com-puted exactly, since the storage requirements of LBLsare already known from Step 1.

Step 3 Compute the maximum storage requirement in-side each block.

This operation is based on the computation ofmaximum iterator vectors relative to the lexicographicorder. Taking the set of iterator vectors mapping agiven array element and assuming the loops normalized(i.e., all the iterators are increasing with the step 1),

1This occurs, for instance, when the rank of matrix T is equal tothe number of its columns, as proven in [25].

the maximum iterator vector yields the iteration in theloop nest when the element is consumed and, hence,the data storage decreases. Actually, part of the LBLsproduced or consumed in the block can be convenientlyignored if their effect on the memory variation canbe taken into account without generating the scalarsthey cover and computing their maximum iterator vec-tors. For instance, in the first loop nest of Example2 (see Fig. 4a), it can be proven that each iteratorvector [ j k i]T corresponds to a unique produced scalarDelta[4][ j][33 ∗ k + i − j + 17] and a unique consumedscalar Delta[4][ j][33 ∗ k + i − j + 16]. The effect of thetwo array references on the memory variation is +1 −1 = 0 in each iteration and, therefore, these operandscan be ignored. It can be shown that the storage re-quirement of Example 2 is 10,582 locations, the maxi-mum occupancy occurring in the first loop nest.

3.2.2 Evaluating the Impact of Loop Transformationson the Data Storage

The tool implemented based on the algorithm com-puting the minimum data storage – described inSection 3.2.1 – can be easily adapted to generate thedata memory variation during the execution of theapplication code. The memory trace generated in Fig. 5shows the variation of the storage during the executionof a 2-D Gaussian blur filter algorithm from a medicalimage processing application which extracts contoursfrom tomograph images in order to detect brain tumors.The abscissae are the numbers of datapath instructionsin the code; the ordinates are the memory locations inuse. While the first graph represents the entire trace,the second graph is a detailed trace in the interval[23800, 24900], which corresponds to the end of the“horizontal Gaussian blur” and the start of the “verticalGaussian blur.”

The computation of the minimum data storage is alsouseful in evaluating the impact of different code (and,in particular, loop) transformations on the data storage.For instance, the minimum memory size needed for theexecution of Example 2 from Fig. 6a is 125,193 loca-tions – as computed by the software tool. The traceof the memory variation is displayed in Fig. 6b anda detail containing the global maximum is shown inFig. 6c. Different variants of code of a same applica-tion can be compared one against another in storagepoint of view, without the need of performing a propermemory allocation for each variant – a significantlymore expensive solution. In particular, the memorysize computation tool implementing the algorithm inSection 3.2.1 can be used to evaluate the loop orderingstrategy described in Section 3.1. In Example 2, i is

Page 10: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

60 F. Balasa et al.

4700

4750

4800

4850

4900

4950

5000

5050

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

’gauss10050.trace’

4700

4750

4800

4850

4900

4950

5000

5050

0 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000

’gauss10050.trace’

4960

4970

4980

4990

5000

5010

23800 24000 24200 24400 24600 24800

’gauss10050.trace’

(a) (b)

Figure 5 a Memory trace of a 2-D Gaussian blur filter algorithm(N = 100, M = 50). The global maximum is at the point (x=5,y=5,005), therefore the minimum data memory size is 5,005

locations. b A detail of the memory trace where the “horizontalGaussian blur” ends and the “vertical Gaussian blur” begins.

the only SD. According to the guiding principles ofSection 3.1, the loops should hence be ordered with thei dimension innermost. This is substantiated by our sizecomputation. By interchanging the loops, the storagerequirement decreases drastically with over 99.5% (toonly 576 locations), the new trace being the graphshown in Fig. 6d.

3.2.3 Mapping Multi-Dimensional Arraysinto the Data Memory

The minimum data storage represents the tight lowerbound for which the execution of the code is stillpossible. However, in practice, this amount of storageis difficult to reach (although still possible!) since itwould require a complex hardware for address gener-ation. Instead, industrial designers apply more regularsignal-to-memory mapping techniques to compute thephysical addresses in the data memory for the arrayelements in the application code. Academic researchershave studied the mapping models as well, aiming tofind better trade-offs. These mapping models actuallytrade-off an excess of storage for a less complex addressgeneration hardware.

Several signal-to-memory mapping techniques havebeen proposed (an overview is given, for instance, in[26]), but none of these research works could provideconsistent information on how effective the mappingmodels are, that is, how large is the oversize of theirresulting storage amount after mapping in comparisonto the minimum data storage. The effectiveness of themapping models are assessed only relatively, comparing

the storage resulted after applying the mapping modeleither to the total number of array elements, or tothe storage results when applying a different mappingmodel [26, 45, 46]. This relative evaluation is not suffi-cient though, since it does not give a precise picture onthe absolute quality of the model.

The computation approach described in Section 3.2.1can determine not only the minimum data storage,but, in addition, it can find out the minimum windowsfor each array in the application code, that is, themaximum number of each array’s elements that aresimultaneously alive. For instance, the signal A fromExample 2 in Fig. 6a needs a minimum window of12,352 memory locations since there are at most 12,352A-elements simultaneously alive (as computed by thetool based on the algorithm presented in Section 3.2.1).The minimum windows (or the optimal intra-array in-place mapping [1]) of all the signals in Example 2,together with the number of array elements in a staticallocation, are shown in Table 2. However, since theelements of the arrays A, . . . , E can share the samelocations if their lifetimes are disjoint, the minimumstorage requirement is, actually, 125,193 locations, anamount smaller than the sum of the minimum windowsof the five arrays in the code – that is, 146,400. This isalso called the optimal inter-array in-place mapping [1].

To illustrate the analysis of effectiveness of a map-ping approach [47], let us consider the mapping modelproposed in [45] that computes storage windows foreach array in the code, investigating all the possiblelinearizations of the array; for each linearization, thelargest distance between two live elements is computed.

Page 11: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 61

// All the array elements are produced for ( i=0; i<767; i++ ) // and consumed in this loop nest for ( j=0; j<256; j++ ) { if ( i+j>=127 && i+j<=254 && j<=127 ) A[i][j] = ... ; if ( i+j>=255 && i+j<=382 && j<=127 ) ... = A[i-128][j]; if ( i+j>=159 && i+j<=318 && j<=159 ) B[i][j] = ... ; if ( i+j>=319 && i+j<=478 && j<=159 ) ... = B[i-160][j]; if ( i+j>=191 && i+j<=382 && j<=191 ) C[i][j] = ... ; if ( i+j>=383 && i+j<=574 && j<=191 ) ... = C[i-192][j]; if ( i+j>=223 && i+j<=446 && j<=223 ) D[i][j] = ... ; if ( i+j>=447 && i+j<=670 && j<=223 ) ... = D[i-224][j]; if ( i+j>=255 && i+j<=510 ) E[i][j] = ... ; if ( i+j>=511 && i+j<=766 ) ... = E[i-256][j]; }

0

20000

40000

60000

80000

100000

120000

140000

0 50000 100000 150000 200000 250000 300000 350000 400000

"MemoryTrace_Loops_i_j"

0

100

200

300

400

500

600

0 50000 100000 150000 200000 250000 300000 350000 400000

"MemoryTrace_Loops_j_i"

124000

124200

124400

124600

124800

125000

125200

125400

170000 175000 180000 185000 190000 195000 200000 205000 210000

"MemoryTrace_Loops_i_j_detail"

(a) (b)

(d)(c)

Figure 6 a Example 3. Memory traces showing the effect of loopinterchange on the storage requirement: b The memory traceof the code. The minimum data storage (corresponding to themaximum value of the memory variation) is 125,193 locations.

c A detail of the memory trace in a neighborhood of the globalmaximum. d Memory trace of the code after loop interchange.The minimum data storage is only 576 locations.

This distance plus 1 is the data storage allocated forthe array – according to [45]. For instance, the lin-earizations considered for a 2-D array are the onesobtained by concatenating the rows, or concatenatingthe columns, in the increasing or decreasing orderof the indexes. When applied to the code Example 3,the mapping model [45] yields a storage window of16,384 locations since, e.g., the elements A[0][127] andA[128][126] are simultaneously alive, and their distance

in the row-by-row concatenation is 128×128-1=16,383.Therefore, an allocation steered by this model wouldrequire for signal A 32.64% more storage than re-ally necessary (i.e., 12,352). All the array windowsyielded by the mapping model [45] are displayed inthe last row of Table 2. In conclusion, the model[45] would allocate 194,560 memory locations for theentire code, therefore, 32.9% more storage than theoptimal memory sharing within each array (146,400

Table 2 Evaluation of asignal-to-memory mappingmodel [45] for Example 3(Fig. 6a).

Example 3 Array A Array B Array C Array D Array E Total

Number of array elements 32,640 51,040 73,536 100,128 130,816 388,160Minimum array windows 12,352 19,280 27,744 37,744 49,280 146,400Mapping array windows 16,384 25,600 36,864 50,176 65,536 194,560

Page 12: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

62 F. Balasa et al.

locations) and 55.41% more storage than the absoluteminimum requirement (the overall optimal sharing) of125,193 locations.

Note that, according to the estimation approachdescribed in Section 3.1, the storage requirement ofthe procedural code in Example 3 is 194,560 memorylocations, which is exactly the result obtained aboveafter the array mapping when DeGreef’s model [45]was applied. The explanation is that the estimationapproach is anticipating the memory allocation, it doesnot attempt to minimize the data memory. So, thetwo approaches in Sections 3.1 and 3.2 are actuallycomplementary to each other.

4 Trade-offs in the Storage Exploration Methodology

Recent advanced multimedia systems use a largeamount of data storage and transfers. This memory andbus usage consumes major parts of the energy in an em-bedded system mainly due to initially bad data locality.The systematic Data Transfer and Storage Exploration(DTSE) methodology [2] reduces the energy consump-tion and optimizes global memory accesses by refiningand improving the source code implementation.

Global Loop Transformations (GLT) form a ma-jor step in this DTSE methodology. They improvethe initial bad data locality and thus enable subse-quent optimization steps. The steps before GLT (pre-GLT) reduce redundant data transfers and expose thefreedom (e.g., by inlining) for the GLT step. Theytrade-off data memory size vs. other parameters of theapplication. For instance, selective function inlin-ing [48] creates more opportunities to data memory sizereduction. However, it increases the required instruc-tion memory size. The GLT step changes the global ex-ecution ordering of the application and thus improveslocality of the array accesses. This results in later datamemory size reduction. However, it potentially sacrificeother parameters of the application, like the instructionmemory size, control-flow complexity, number of mem-ory accesses, etc. To evaluate these data memory sizerelated trade-offs, we need the fast and accurate storagesize estimators that have been discussed in Section 3.These estimators can yield important feedback on thequality of the steering (either automatic or manual) forboth the GLT and pre-GLT trade-offs. This is shownin Fig. 7 where the trade-off oriented loop transforma-tion structural outline and the relation to memory sizeestimation is presented.

In the flow chart in Fig. 7 we distinguish four typesof trade-offs: pre-GLT, GLT, Other Loop Transfor-mations, and Code generation trade-offs. These four

pre-GLTtrade-offs

GLTtrade-offs

Other LTtrade-offs

Code generationtrade-offs

Memory sizeestimation

Figure 7 Memory size estimation as part of the trade-off ori-ented loop transformation (LT) flow chart (GLT stands forGlobal Loop Transformations).

categories of trade-offs appear in the order in whichthey occur in the DTSE methodology [2]. The relationbetween memory size estimation and those trade-offsis depicted by the arrows pointing to the trade-offs. Asolid arrow means a strong relation, i.e., the memorysize estimation is an important part of that trade-off.The dashed arrow depicts a weak relation, i.e., thememory size estimation could be present in that trade-off, however the importance of the estimation is notvery significant. Since the pre-GLT and GLT trade-offswill be addressed in separate subsections due to theimportance of memory size estimation in those trade-offs, we are going to briefly discuss below the tworemaining trade-offs displayed at the bottom of the flowdiagram (Fig. 7).

Trade-offs w.r.t. the data memory size appear, espe-cially, when we enlarge the exploration space or changethe execution ordering. After GLT, the global exe-cution ordering is fixed and, hence, less opportunitiesremain. In particular, in the subsequent steps of theDTSE methodology also other loop transformationsare applied to improve, e.g., the performance and band-width related aspects [49, 50]. These loop transforma-tions do not have such a big impact on the data memorysize as the GLT. Due to that, the high-level memoryestimators are not needed that badly in this phase and,hence, are not used so often. Still, they can he helpfulalso in this phase as suggested by the dashed arrow inFig. 7. Also at the end during the code generation fromthe geometrical model, interesting trade-offs exist. Dur-ing this phase when also the local execution ordering isfixed, we trade-off the code size, the complexity of thecontrol-flow, and the number of empty iterations forthe same geometrical model representation from whichdifferent codes can be generated. These trade-offs havealready been discussed by the authors working on codegenerators from geometrical model [51, 52]. Moreover,

Page 13: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 63

they do not affect the data memory size, so we shall notdiscuss this phase further.

4.1 Pre-GLT Trade-Offs

The GLT are performed on the geometrical modelwhich is used in most of the research in the field of looptransformations [53–58]. However, the model imposesstrict limitations on the input code. It can deal onlywith static control parts [59]. The static control partis a maximal set of consecutive statements withoutwhile loops, where loop bounds and conditionals mayonly depend on invariants within this set of statements.These invariants include symbolic constants, formalfunction parameters and surrounding loop counters.Also, the geometrical model requires pointer-free codein one function [2]. The parts of the code that do notfulfill these strict conditions cannot be modeled in thegeometrical model and thus cannot be transformed.To extend the exploration scope of the GLTs, differ-ent preprocessing techniques, like selective functioninlining [48], pointer analysis and conversion [60, 61],dynamic single assignment conversion [62], advancedcopy propagation [63], hierarchical rewriting [2], andscenario creation [64–66], have been proposed. Thesetechniques often require trade-offs between the free-dom they allow for loop transformations and extracost you have to pay (e.g., code size). These pre-GLT trade-offs are orthogonal [67] to the GLT trade-offs which will be discussed in the next subsection,i.e., the constraints created during the decisions in thepre-GLT trade-offs are propagated and constrain theGLT trade-offs.

4.2 GLT Trade-Offs

After preparing the application to be parsed for thegeometrical model analysis, we can perform the GLTon this model. The GLT change the execution order ofthe application and affect data memory size, instructionmemory size, control-flow complexity, number of mem-ory accesses, etc.

In-place optimization reuses space in the memory ofthe data structure or data structure elements that arenot needed any more for the future production of struc-tures or elements [45]. The closer the production andthe consumption of the data structure or data structureelements in the program, the better the in-place opti-mization step can be applied. Data reuse optimizationaims at placing a local copy of the part of the arraywhich will be used (consumed) several times, closer tothe data path [4]. The closer the consumptions in theprogram, the better the data reuse will be. However,

the close consumption for improved data reuse cancause bad in-place optimization for some arrays andvice versa. The good data reuse causes reduction inthe number of memory accesses and the bad in-placecauses larger data memory size. This results in trade-offs between the number of memory accesses and thedata memory size. An example of this trade-off, com-bined with using memory size evaluation techniquesfrom Section 3, is given in Section 4.3.

The trade-off in the previous paragraph has targetedthe data part of the application. Till now we did notlook at the control part of the application. The codeafter GLT, with optimal locality and thus minimal datamemory size, contains several complex if conditionsdue to the irregular access. However, high-efficientcode should not contain too complex control flow sinceit will slow down the application on the processor.Especially on VLIW processors this can cause muchoverhead because, due to the conditions, the loops canhave a problem with the crucial software pipeliningsubstep. The rich control flow in the application cancreate overhead if good branch prediction or guardedexecution is missing in the processor data path hard-ware. The extra control flow is mainly present becauseof fusion and shifting of the loops to obtain optimallocality and to still satisfy the flow dependencies. For-tunately, this control flow overhead can be reduced bynot fusing all the loops that are required for optimallocality. This will cause some growth of the memorysize requirements, resulting in a data memory size vs.control flow complexity trade-off. This trade-off can becombined with the trade-off between intra in-place anddata reuse as we will show in Section 4.3 for a real-life application. Note, that the memory size evaluationtechniques from Section 3 are crucial when steering orproviding early feedback to the designer w.r.t. thosecomplex memory oriented trade-offs.

4.3 The GLT Trade-Off Educative Demonstrator

During all trade-offs discussed in the previous subsec-tions the memory size estimation is beneficial. We shalldemonstrate it first on the simple example of trade-offbetween storage size and number of memory accessesin Example 4. Later, in Section 5.2, we provide alsoresults for a real-life test-vehicle.

In the simple example, three 4 × 5 arrays are present:A, B and C, embedded in three loop nests. In the firstloop nest, arrays A and B are produced (written) re-sulting in 2 × 4 × 5, i.e., 40 accesses. In the second loopnest, arrays A and B are consumed (read) and arrayC is produced (written). This results in 3 × 4 × 5, i.e.,60 accesses. In the last loop nest, array C is consumed

Page 14: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

64 F. Balasa et al.

(read) resulting in 1 × 4 × 5, i.e., 20 accesses. Arrays Aand B are still used later. Thus, the memory locationsof arrays A and B cannot be reused by array C, i.e.,arrays A and B cannot be in-placed (inter array in-place, see Section 2) with array C. The maximal numberof simultaneously alive elements (i.e., required storagesize) is 3 × 20, i.e., 60 (between the second and the thirdloop nest). The initial implementation requires thus 120memory accesses and a storage size of 60 locations. Theinformation about storage size can be obtained usingthe memory size evaluation techniques in Section 3. Toobtain a better implementation, loop reverse and fusionare used. Note that it is not possible to fuse all the threeloop nests because of the reversed loop iterator in thesecond loop nest. To reverse the execution of the thirdloop nest is not possible due to other constraints.

Example 4 int A[4][5], B[4][5], C[4][5];for (i=0; i<4; i++)

for (j=0; j<5; j++) {A[i][j] = · · ·B[i][j] = · · ·

}for (i=0; i<4; i++)

for (j=0; j<5; j++) {C[3-i][4-j] = f(A[i][j],B[i][j]);

}for (i=2; i<6; i++)

for (j=1; j<6; j++) {· · · = C[i-2][j-1];

}/* A & B arrays still used */

First, we assume the fusion of the first two loop nests.Then, we can assign the computed values of arraysA and B into two temporary variables and use thesevariables in the f(int, int) function. Arrays A and Bstill have to be produced, because they are used later.This means the whole consumption of A and B in thesecond loop nest was saved, i.e., 40 accesses. Thus, thenumber of memory accesses has been reduced to 80.The storage size still remains 60 locations since thisfusion has not affected the array lifetimes.

Another option is to fuse the last two loop nests.Before that, a reverse has to be applied on the secondloop nest. After the lifetime exploration of array C by amemory size evaluation technique from Section 3, wecan determine that only some elements of this arrayhave to be simultaneously alive. The number of theseelements is computed by a memory size evaluationtechnique from Section 3, and is going to be 6 instead of20. This results in an overall storage size of 46, insteadof the original 60 locations. Note that the number of

memory accesses does not change, only some memorylocations are accessed several times after applying(intra) in-place optimizations.

To summarize, the initial implementation requires adata memory size of 60 locations and 120 data memoryaccesses. After the loop fusion of the first two loopnests, the number of memory accesses has decreased to80 data memory accesses. Another interesting option isthe loop fusion of the last two loop nests. This decreasesthe required data memory size to 46 locations. Neithersolution is the best one. The fusion of the first twoloop nests is better for number of memory accesses;the fusion of the last two loop nests is better for thedata memory size. These two solutions represent twopoints on the optimal Pareto curve [68] that trades-offthe number of memory accesses and the storage size.Storage size estimators are crucial for the evaluation ofthe possible solutions and, also, to provide importantfeedback for the trade-off steering (either automatic ormanual) techniques.

5 Experimental Results

This section provides quantitative results, allowing tobetter understand the complementarity of the tech-niques discussed in this paper. First, we provide theexperimental results for memory size estimation andoptimization for a number of real-life benchmarks. Thisis followed by a case study of the different trade-offsduring GLT which have been discussed in Section 4.

5.1 Experimental Results for Memory Size Estimationand Optimization

In this subsection, we present several experiments withthree software tools:

(1) K2 is a tool computing exactly the minimumdata memory size, developed at the University ofIllinois at Chicago, whose model was presented inSection 3.2;

(2) STOREQ is a memory size estimation tool, devel-oped at the Norwegian University of Science andTechnology, Trondheim, Norway, whose modelwas discussed in Section 3.1;

(3) MC is a Memory Compaction tool from the DTSEdesign flow (see Section 4), developed at theIMEC, Leuven, Belgium.

MC performs low-level storage order optimization ofmulti-dimensional data [45], which is normally the laststep in the DTSE design flow. This hence gives ex-amples of reasonable implementations, using advanced

Page 15: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 65

Table 3 Experimental results.

Application # Array # Scalars K2 STOREQ MC(parameters) Refs. Memory CPU Memory CPU Memory CPU

Regularity detection(MaxGrid=8, L=64) 19 4,752 2,304 0.8 4,153 2.2 3,706 1.3

Durbin algorithm(N=500) 27 252,499 1,249 15 2,002 2.5 1,502 40.4

2-D Gauss blur filter(N=800,M=600) 20 5,260,027 480,005 137 958,809 2.0 958,805 2.2

Motion detection(M=N=32, m=n=4) 72,543 2,740 2 2,770 1.6 2,741 1.4(M=N=120, m=n=8) 11 3,749,063 33,284 16 33,398 1.8 33,285 7.9

MPEG-4 motion estimation 68 265,633 2,465 18 3,399 2.0 3,396 1.6

techniques that perform a trade-off between memorysize and address and code complexity.

In order to make possible the comparative evalua-tion of these techniques, application codes were con-sidered procedural, therefore the execution orderingwas entirely fixed, as implied by the code organization.The selected benchmarks are either algebraic kernelsor applications from digital signal processing: (1) a real-time regularity detection algorithm used in robot vision;(2) Durbin’s algorithm for solving Toeplitz systemswith N unknowns; (3) a 2-D Gaussian blur filter froma medical image processing application which extractscontours from tomograph images in order to detectbrain tumors; (4) a motion detection algorithm usedin the transmission of real-time video signals on datanetworks [1]; (5) the kernel of a motion estimationalgorithm for moving objects (MPEG-4).

Table 3 summarizes the results of our experiments.The STOREQ and MC tools have been run on a Linuxserver with a Dual 2.4 GHz Xenon Processor and 3 GBRAM. K2 was run on a PC with a 1.85 GHz AthlonXP processor and 512 MB memory.2 Columns “# ArrayRefs.” and “# Scalars” display the numbers of arrayreferences and, respectively, scalar signals (array ele-ments) in the application codes. For each of the threetools, column “Memory” displays the storage require-ments obtained (numbers of memory locations) andcolumn “CPU” shows the corresponding running timesin seconds.

Table 3 shows that both STOREQ and MC areusually much faster than the exact computation ap-

2The difference between the two testing platforms does not affectin a significant way the relative differences between runningtimes. According to [69], the former platform may be slightlyfaster (no more than 10−15%) than the latter, especially forapplications exploiting the presence of the dual processor. Theless amount of RAM of the second platform does not affect theperformance of the K2 tool since this is not a critical resource.

proach (although the running times of the latter are stillreasonable). This is not unexpected: performing exactcomputations typically implies a higher computationaleffort. Sometimes, the amount of additional computa-tions can prove to be very significant, as in the caseof the 2-D Gaussian blur filter application. This hap-pens mainly when the data storage has relative smallvariations during the code execution, preventing thepruning mechanism of the tool to work efficiently. Asexemplified in [37], the tool K2 detects and eliminatesfrom further analysis the blocks of code where the localmaxima cannot exceed the overall storage requirement.This “pruning” speeds up the tool, concentrating theanalysis on those portions of code where the memoryincrease is likely to happen. Consequently, the tool isslower for applications having very many local maximaof storage variation, as it happens during the executionof the 2-D Gaussian blur filter application (see the leftpart of the memory traces in Fig. 5). On the other hand,the more computationally-expensive exact techniquecan be used to obtain the exact minimum data storage(about half of the values found by the estimation tools)and precise memory traces, like the ones in Fig. 5,allowing to identify the parts of the application coderequiring more storage.

The minimum storage requirements are difficult touse in practical allocation problems (typically requiringsignificantly more complex hardware for address gener-ation). MC performs a trade-off between memory sizeand the complexity of the address generation hardware.This causes, for instance, the big difference betweenthe exact size of the 2-D Gaussian blur filter found byK2 and the result of MC (and estimate of STOREQ)for this application. On the other hand, knowing theminimum storage gives a good idea on this compromisein terms of extra storage.

As it can be seen from Table 3, the results ofSTOREQ are, in general, quite close to the imple-

Page 16: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

66 F. Balasa et al.

mentations obtained using MC. The goal of STOREQis not, like K2, to find the smallest possible memoryrequirement, but to estimate the size of the memoryrequired by a likely implementation. This is also usefulat the early loop-transformation stage. The optimizedimplementation found by MC is assumed to be such alikely implementation. The overestimate of STOREQcompared to MC in the Durbin algorithm is caused by abounding box around array elements produced along atwo-dimensional diagonal line in iteration space. If thisis unacceptable, a more detailed dependency analysismust be performed, so that these cases can be treatedseparately. These situations appear quite rarely in ourapplication domain. STOREQ is very fast, enabling useof the tool during the early design step of loop trans-formation exploration. Even more so since most of thetime reported here (over 80% on the average) stemsfrom the first data dependency analysis step describedin Section 3.1. This part does not have to be repeatedif the sizes of multiple alternative (partially fixed) exe-cution orderings are estimated. It will then only be theCPU time for size estimation, 0.3 s on average for theapplications in Table 3, that will be required. Noticealso that the STOREQ run-time is close to independentof the number of nodes in a given iteration space, asdemonstrated by the two versions of the motion detec-tion kernel. Furthermore, the tool is implemented inMatlab. When a future version is available in C++, likethe other tools, it will be substantially faster. Togetherthis will make the STOREQ approach useful even forautomatic loop transformation exploration where thesearch space can have thousands of possible solutions.For this it will not be possible to use MC or K2.

5.2 The GLT Trade-Off Real-Life Demonstrator

Subsection 4.3 demonstrates an educative examplewhich trades-off the data memory size vs. the numberof memory accesses. This subsection demonstrates amore complex, three dimensional trade-off among thedata memory size, the number of memory accesses, andthe control-flow complexity for the real-life QSDPCMdemonstrator [70]. This trade-off – depicted in Fig. 8 –is the combination of the trade-offs which have beendiscussed in Subsection 4.2 and has been first presentedin more detail in [71]. The different points in Fig. 8represent code versions with different loop transforma-tions applied. These transformations are strip mining,loop fusion, and shifting.

In this paper, we only analyze this trade-off w.r.t. theuse of the high-level storage size estimators presentedin Section 3. To analyze the storage size usage, weuse two approaches: the high-level STOREQ estimator

1(SM)

2

3

4

2500 2600 2700 2800 2900 3000 3100 3200 3300Assigned data size to L1 255

260 265

270 275

280 285

290 295

300 305

310

Nr. data transfers (x103)

0

500

1000

1500

2000

2500

Control-flow complexity(#if statements exec.)

Figure 8 The 3-D exploration space example (#mem.accesses×data storage size×control flow complexity) for areal life multimedia application (QSDPCM [70]) [71].

approach, based on the size of the data flow depen-dency, and the exact memory trace (which is part of theK2 tool) for the array that participates in the data flowdependency. First we discuss the high-level estimationapproach and then we show on the memory traces thatthe estimation is in-line with the exact memory traceprovided by the K2 tool.

We start our exploration from the code version thathas the best memory size due to the high-level memoryestimators. This point is named 1(SM) in our explo-ration space. In the next code version (going from point1(SM) to 2), we have observed that the data storage sizeincreases from 2,544 elements to 2,589 elements. Thishas two effects. First, we have decreased the data trans-fers from 307 × 103 to 259 × 103. Second, the numberof if statement evaluations has increased from 2,206 to2,288. When going from point 2 to 3, we have reducedthe number of if statement evaluations from 2,288 to572, i.e., by a factor of 4, for an increase in memorysize from 2,589 to 3,249. In the last code version (goingfrom point 3 to 4), we have decreased the numberof if condition evaluations even further, again by afactor of 4. This improvement is obtained with a minorcost – the increase of memory size by only 9 locations.In this demonstrator, we can see again that the exploita-tion of high-level storage size estimation together withother high-level estimations is crucial when positioninga code version in the exploration space and determiningthe Pareto optimality of the code version.

We validate the memory size estimation with anexact memory trace (obtained with the K2 tool whosemodel was presented in Section 3.2). The memorytraces are depicted in Fig. 9. As already explained inSection 3.2.2, the abscissae are the numbers of data-path instructions in the code and the ordinates are thememory locations in use. Figure 9a shows the memory

Page 17: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 67

0

50

100

150

200

250

300

350

400

0 20000 40000 60000 80000 100000 120000 140000

"qsdpcm_part_c1_prev_sub4_frame_2.trace"

0

50

100

150

200

250

300

350

400

0 20000 40000 60000 80000 100000 120000

"qsdpcm_part_c3_prev_sub4_frame_2.trace"

0

100

200

300

400

500

0 20000 40000 60000 80000 100000 120000

"qsdpcm_part_c5_prev_sub4_frame_2.trace"

(a)

(b)

(c)

Figure 9 Memory traces for several QSDPCM versions of code(1,2, and 4) with different storage size requirements (a, b, and c).

trace for the code version 1(SM), Fig. 9b shows thememory trace for the code version 2, and Fig. 9c showsthe memory trace for the code version 4. As mentionedabove, these versions have the same functionality anddiffer only by the loop transformations applied. Notethat this memory trace is only for one affected array.We can observe the same relative behavior of the peakvalues of the memory size as in Fig. 8. That is, therelative increase of the estimation and the peak valueof the memory trace do match. The best code w.r.t.

the memory size is version 1(SM) with 336 storagelocations, the next one is version 2 with 376 locations,and the worst is version 4 with 496 locations for oneaffected array in the application. However, as statedabove, this last version has other benefits resulting fromthe discussed trade-offs like less data transfers and lessif statement evaluations (see also Fig. 8).

6 Conclusions

This paper has given an overview on the techniquesfor the evaluation of the data memory size in signalprocessing applications, focusing on the most advanceddevelopments in this research domain. Two alterna-tive techniques, an estimation approach – investigatingthe different loop execution orders in non-proceduralspecifications, and a computation approach – findingthe exact minimum data memory size – have been pre-sented. The paper has addressed the important role ofloop transformations for enhancing the memory man-agement of signal processing systems, whose behavioris described by high-level, array-oriented specifications.In this context, the paper has discussed basic memorymanagement trade-offs taken into account during theexploration of the design space.

References

1. Catthoor, F., Wuytack, S., De Greef, E., Balasa, F.,Nachtergaele, L., & Vandecappelle, A. (1998). Custom mem-ory management methodology: Exploration of memory or-ganization for embedded multimedia system design. Boston:Kluwer Academic Publishers.

2. Panda, P. R., Catthoor, F., Dutt, N., Dankaert, K.,Brockmeyer, E., Kulkarni, C., et al. (2001). Data and memoryoptimization techniques for embedded systems. ACM Trans-actions on Design Automation of Electronic System, 6(2),149–206, April.

3. Catthoor, F., Danckaert, K., Kulkarni, C., Brockmeyer, E.,Kjeldsberg, P. G., Van Achteren, T., et al. (2002). Data ac-cess and storage management for embedded programmableprocessors. Boston: Kluwer Acad. Publ.

4. Van Achteren, T., Deconinck, G., Catthoor, F., &Lauwereins, R. (2002). Data reuse exploration methodologyfor loop-dominated applications. In Proc. ACM/IEEE designand test in Europe conf. (pp. 428–435). Paris, France (April).

5. Hu, Q., Vandecappelle, A., Palkovic, M., Kjeldsberg, P. G.,Brockmeyer, E., & Catthoor, F. (2006). Hierarchical mem-ory size estimation for loop fusion and loop shifting in data-dominated applications. In Proc. Asia & S.-Pacific designautomation conf. (pp. 606–611). Yokohama, Japan (January)

6. Luican, I. I., Zhu, H., & Balasa, F. (2006). Formal model ofdata reuse analysis for hierarchical memory organizations. In

Page 18: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

68 F. Balasa et al.

Proc. IEEE/ACM int. conf. comp.-aided design (pp. 595–600).San Jose CA (November).

7. Brockmeyer, E., Miranda, M., Catthoor, F., & Corporaal, H.(2003). Layer assignment techniques for low energy in multi-layered memory organisations. In Proc. ACM/IEEE designautomation and test in Europe conf. (pp. 1070–1075). Munich,Germany (March).

8. Tseng, C. J., & Siewiorek, D. (1986). Automated synthe-sis of data paths in digital systems. IEEE Transactionson Computer-Aided Design of ICs and Systems, CAD-5(3),379–395, July.

9. Goossens, G., Rabaey, J., Vandewalle, J., & De Man, H.(1987). An efficient microcode compiler for custom DSPprocessors. In Proc. IEEE int. conf. comp.-aided design(pp. 24–27). Santa Clara CA (November).

10. Hashimoto, A., & Stevens, J. (1971). Wire routing by opti-mizing channel assignment within large apertures. In Proc.8th design automation workshop (pp. 155–169).

11. Kurdahi, F. J., & Parker, A. C. (1987). REAL: A program forregister allocation. In Proc. 24th ACM/IEEE design automa-tion conf. (pp. 210–215).

12. Goossens, G. (1989). Optimization techniques for automatedsynthesis of application-specific signal-processing architec-tures. Ph.D. thesis, K.U. Leuven, Belgium.

13. Paulin, P. G., & Knight, J. P. (1989). Force-directed schedul-ing for the behavioral synthesis of ASIC’s. IEEE Transac-tions on Computer-Aided Design of ICs and System, 8(6),661–679, June.

14. Gebotys, C. H., & Elmasry, M. I. (1992). Optimal VLSIarchitectural synthesis. Boston: Kluwer Academic Publ.

15. Stok, L., & Jess, J. (1992). Foreground memory managementin data path synthesis. International Journal of Circuit Theoryand Applications, 20, 235–255.

16. Parhi, K. K. (1994). Calculation of minimum number of regis-ters in arbitrary life time chart. IEEE Transactions on Circuitsand Systems - II: Analog and Digital Signal Processing, 41(6),434–436.

17. Ohm, S. Y., Kurdahi, F. J., & Dutt, N. (1994). Comprehen-sive lower bound estimation from behavioral descriptions.In Proc. IEEE/ACM int. conf. on computer-aided design(pp. 182–187).

18. Gajski, D., Vahid, F., Narayan, S., & Gong, J. (1994). Spec-ification and design of embedded systems. Englewood Cliffs:Prentice Hall.

19. Verbauwhede, I., Scheers, C., & Rabaey, J. M. (1994).Memory estimation for high level synthesis. In Proc. 31stACM/IEEE design automation conf. (pp. 143–148) (June).

20. Verbauwhede, I., Catthoor, F., Vandewalle, J., & De Man, H.(1989). Background memory management for the synthesisof algebraic algorithms on multi-processor dsp chips. In Proc.int. conf. on VLSI (pp. 209–218). Munich, Germany (August).

21. Grun, P., Balasa, F., & Dutt, N. (1998). Memory size esti-mation for multimedia applications. In Proc. 6th int. work-shop hardware/software co-design (pp. 145–149). Seattle WA(March).

22. Zhao, Y., & Malik, S. (2000). Exact memory size estimationfor array computations. IEEE Transactions on VLSI System,8(5), 517–521.

23. Ramanujam, J., Hong, J., Kandemir, M., & Narayan, A.(2001). Reducing memory requirements of nested loops forembedded systems. In Proc. 38th ACM/IEEE design automa-tion conf. (pp. 359–364) (June).

24. Balasa, F., Catthoor, F., & De Man, H. (1995). Back-ground memory area estimation for multi-dimensional signalprocessing systems. IEEE Transactions on VLSI System, 3(2),157–172, June.

25. Balasa, F., Catthoor, F., & De Man, H. (1997). Practical solu-tions for counting scalars and dependences in ATOMIUM –a memory management system for multi-dimensional signalprocessing. IEEE Transactions on CAD of IC’s and System,16(2), 133–145, February.

26. Darte, A., Schreiber, R., & Villard, G. (2005). Lattice-basedmemory allocation. IEEE Transactions on Computers, 54,1242–1257, October.

27. Banerjee, U. (1988). Dependence analysis for supercomput-ing. Boston: Kluwer Acad. Publ.

28. IMEC (2006). Atomium web site. http://www.imec.be/design/atomium/.

29. Hu, Q., Vandecappelle, A., Kjeldsberg, P. G., Catthoor, F.,& Palkovic, M. (2007). Fast memory footprint estimationbased on dependency distance vector calculation. In Proc.ACM/IEEE design automation and test in Europe (pp. 379–384). Nice, France (April).

30. Kjeldsberg, P. G., Catthoor, F., & Aas, E. J. (2003). Datadependency size estimation for use in memory optimiza-tion. IEEE Transactions on CAD of IC’s and System, 22(7),908–921, July.

31. Kjeldsberg, P. G., Catthoor, F., & Aas, E. J. (2004). Storagerequirement estimation for optimized design of data intensiveapplications. ACM Transactions on Design Automation ofElectronic Systems, 9, 133–158, April.

32. Kjeldsberg, P. G., Catthoor, F., & Aas, E. J. (2001). Detec-tion of partially simultaneously alive signals in storage re-quirement estimation for data-intensive applications. In Proc.38th ACM/IEEE design automation conf. (pp. 365–370). LasVegas NV (June).

33. Moonen, M., Dooren, P. V., & Vandewalle, J. (1992). AnSVD updating algorithm for subspace tracking. SIAM Jour-nal on Matrix Analysis and Applications, 13(4), 1015–1038.

34. Schrijver, A. (1986). Theory of linear and integer program-ming. New York: Wiley.

35. Thiele, L. (1992). Compiler techniques for massive parallelarchitectures. In P. Dewilde (Ed.), State-of-the-art in com-puter science. Boston: Kluwer Acad. Publ.

36. Zhu, H., Luican, I. I., & Balasa, F. (2006). Memory sizecomputation for multimedia processing applications. In Proc.Asia & South-Pacific design automation conf. (pp. 802–807).Yokohama, Japan (January).

37. Balasa, F., Zhu, H., & Luican, I. I. (2007). Computation ofstorage requirements for multi-dimensional signal process-ing applications. IEEE Transactions on VLSI Systems, 15(4),447–460, April.

38. Pugh, W., & Wonnacott, D. (1993). An exact method foranalysis of value-based array data dependences. In Proc. 6thint. workshop languages and compilers for parallel computing(pp. 546–566). Portland OR (August).

39. Verdoolaege, S., Beyls, K., Bruynooghe, M., & Catthoor,F. (2005). Experiences with enumeration of integer projec-tions of parametric polytopes. In R. Bodik (Ed.), Compilerconstruction: 14th int. conf. (Vol. 3443, pp. 91–105). Berlin:Springer.

40. Ph. Clauss, Loechner, V. (1998). Parametric analysis of poly-hedral iteration spaces. Journal of VLSI Signal Processing,19(2), 179–194.

Page 19: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 69

41. Verdoolaege, S., Seghir, R., Beyls, K., Loechner, V., &Bruynooghe, M. (2004). Analytical computation of Ehrhartpolynomials: Enabling more compiler analyses and optimiza-tions. In Proc. int. conf. compilers arch. and synthesis forembedded syst. (pp. 248–258) (September).

42. Dantzig, G. B., Eaves, B. C. (1973). Fourier-Motzkin elimi-nation and its dual. Journal of Combinatorial Theory (A), 14,288–297.

43. Pugh, W. (1992). A practical algorithm for exact array depen-dence analysis. Communications of the ACM, 35(8), 102–114,August.

44. Barvinok, A. I. (1994). A polynomial time algorithm forcounting integral points in polyhedra when the dimension isfixed. Mathematics of Operations Research, 19(4), 769–779,November.

45. De Greef, E., Catthoor, F., & De Man, H. (1997). Mem-ory size reduction through storage order optimization forembedded parallel multimedia applications. special issue on“Parallel Processing and Multi-media. In A. Krikelis (Ed.),Parallel computing (Vol. 23, no. 12). Amsterdam: Elsevier(December).

46. Tronçon, R., Bruynooghe, M., Janssens, G., & Catthoor, F.(2002). Storage size reduction by in-place mapping of arrays.In A. Coresi (Ed.), Verification, model checking and abstractinterpretation (pp. 167–181).

47. Luican, I. I., Zhu, H., & Balasa, F. (2007). Signal-to-memorymapping analysis for multimedia signal processing. In Proc.Asia & South-Pacific design automation conf. (pp. 486–491).Yokohama, Japan (January).

48. Absar, J., Catthoor, F., & Das K. (2003). Call-instancebased function inlining for increasing data access related op-timization opportunities. Technical report, IMEC, Leuven,Belgium.

49. Dasygenis, M., Brockmeyer, E., Durinck, B., Catthoor, F.,Soudris, D., & Thanailakis, A. (2004). Power, Performanceand Area Exploration for Data Memory Assignment of Mul-timedia Applications. In A. Pimentel, & S. Vassiliadis (Eds.),Proc. systems, architectures, modeling, and simulation, LCNS(Vol. 3133, pp. 540–549). Samos: Springer-Verlag (June).

50. Shashidar, K. C., Vandecappelle, A., & Catthoor, F. (2001).Low power design of turbo decoder module with explorationof energy-performance trade-offs. In Proc. workshop on com-pilers and operating systems for low power in conjunctionwith Int. Conf. on parallel arch. and compilation techniques(pp. 10.1–10.6). Barcelona, Spain (September).

51. Bastoul, C. (2004). Code generation in the polyhedral modelis easier than you think. In Proc. int. conf. on parallel arch.and compilation techniques (pp. 7–16). Juan-les-Pins, France(September).

52. Quillere, F., Rajopadhye, S., & Wilde, D. (2000). Generationof efficient nested loops from polyhedra. International Jour-nal of Parallel Programming, 28(5), October.

53. Banerjee, U., Eigenmann, R., Nicolau, A., & Padua, D.(1993). Automatic program parallelization. Proceedings ofthe IEEE, 81(2), 211–243, February.

54. Feautrier, P. (1991). Dataflow analysis of array and scalarreferences. International Journal of Parallel Programming,20(1), 23–52.

55. Darte, A., & Robert, Y. (1995). Affine-by-statement schedul-ing of uniform and affine loop nests over parametric do-mains. Journal on Parallel and Distributed Computing, 29(1),43–59.

56. Kandemir, M., Ramanujam, J., Choudhary, A., &Banerjee, P. (2001). A layout-conscious iteration spacetransformation technique. IEEE Transactions on Computers,50(12), 1321–1335.

57. Kelly, W., & Pugh, W. (1993). A framework for unifyingreordering transformations. Univ. Maryland College Park,CS-TR-3193 (April).

58. Wolf, M. E., & Lam, M. S. (1991). A data locality optimiz-ing algorithm. In Proc. SIGPLAN conf. on programminglanguage design and implementation (pp. 30–43) Toronto,Canada (June).

59. Bastoul, C., Cohen, A., Girbal, A., Sharma, S., & Temam,O. (2003). Putting polyhedral loop transformations to work.In Proc. int. workshop languages & compilers for parallelcomput. (pp. 209–225) (September).

60. Semeria, L., & De Micheli, G., (1998). SpC: Synthesis ofpointers in C. In Proc. IEEE/ACM int. conf. comp.-aideddesign (pp. 340–346). Santa Clara CA (November).

61. Franke, B., & O’Boyle, M. (2003). Array recovery and high-level transformations for DSP applications. ACM Transac-tions Embedded Computing Systems, 2(2), 132–162, May.

62. Vanbroekhoven, P., Janssens, G., Bruynooghe, M.,Corporaal, H., & Catthoor, F. (2005). Transformation todynamic single assignment using a simple data flow analysis.In Proc. 3rd Asian symp. on programming languages andsyst., Tsukuba, Japan, and in Lecture Notes Comp. Sc.(Vol. 3780, pp. 330–346). Springer Verlag (November).

63. Vanbroekhoven, P., Janssens, G., Bruynooghe, M.,Corporaal, H., & Catthoor, F. (2003). Advanced copypropagation for arrays. In Proc. SIGPLAN conf. languages,compilers, and tools for embedded syst. (pp. 24–33). SanDiego CA (June).

64. Palkovic, M., Corporaal, H., & Catthoor, F. (2005). Globalmemory optimisation for embedded systems allowed by codeduplication. In Proc. 9th int. Wsh. on software and compilersfor embedded systems (pp. 72–80) (September).

65. Palkovic, M., Corporaal, H., & Catthoor, F. (2008). Deal-ing with data dependent conditions to enable general globalsource code transformations. International Journal of Embed-ded Systems, Interscience Publ. (in press).

66. Palkovic, M., Brockmeyer, E., Vanbroekhoven, P.,Corporaal, H., & Catthoor, F. (2006). Systematic pre-processing of data dependent constructs for embeddedsystems. Journal of Low Power Electronics, 2(1), 9–17, April.

67. Catthoor, F., & Brockmeyer, E. (2000). Unified meta-flow summary for low-power data-dominated applications.In F. Catthoor (Ed.), Unified low-power design flow for data-dominated multi-media and telecom applications (pp. 7–23).Boston: Kluwer Acad. Publ.

68. Pareto, V. (1896). Cours D’Economie Politique, volume I–II.Lausanne.

69. Tom’s Hardware (2003). Benchmark Marathon: 65 CPUsfrom 100 MHz to 3066 MHz (Online). Available:http://www.tomshardware.com/2003/02/17/benchmark_marathon/index.html

70. Strobach, P. (1988). QSDPCM – A new technique in sceneadaptive coding. In Proc. 4th Eur. signal processing conf. (pp.1141–1144). Grenoble, France, Amsterdam: Elsevier Publ.(September).

71. Palkovic, M. (2007). Enhanced applicability of loop transfor-mations (Chapter 6). Ph.D. Thesis, Eindhoven University ofTechnology, Dept. of Electrical Eng. (September).

Page 20: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

70 F. Balasa et al.

F. Balasa received the M.S. and Ph.D. degrees in computer sci-ence from the Polytechnical University of Bucharest, Bucharest,Romania, in 1981 and 1994, respectively, the M.S. degree in math-ematics from the University of Bucharest, Bucharest, Romania,in 1990, and the Ph.D. degree in electrical engineering from theKatholieke Universiteit Leuven, Leuven, Belgium, in 1995.

He is currently an associate professor of computer science atSouthern Utah University, Cedar City, Utah, U.S.A. From 1990to 1995, He was with the Interuniversity Microelectronics Center(IMEC), Leuven, Belgium. From 1995 to 1999, he was a seniordesign automation engineer at the Advanced Technology Di-vision of Conexant Systems (formerly Rockwell SemiconductorSystems), Newport Beach, California, U.S.A. From 2000 to 2007,he was with the Dept. of Computer Science at the University ofIllinois at Chicago, Chicago, U.S.A.

His research interests span different areas of VLSI CAD.Dr. Balasa was a recipient of the U.S. National Science Foun-dation CAREER Award.

P. G. Kjeldsberg was born in Trondheim, Norway in 1966. Hereceived his Sivilingeniør degree (M.Sc.) in electrical engineeringin 1992 from the Norwegian Institute of Technology. In 2001 hereceived the degree of Doktor ingeniør (Ph.D.) from the sameinstitution (now Norwegian University of Science and Technol-ogy, NTNU). Between 1992 and 1996 he worked as a designengineer at Eidsvoll Electronics, designing communication con-trol equipment based on embedded hw/sw solutions. During hisdoctoral studies, he focused on storage requirement estimationand optimization for data intensive applications. The researchwas performed in close cooperation with IMEC, in Leuven,Belgium, where he was a visiting researcher for nine monthsin all. Kjeldsberg has published a number of conference andjournal papers, and has been coauthor of a book in his field ofinterest. Currently, he is an Associate Professor at Department

of Electronics and Telecommunications, NTNU. His researchinterests are embedded hw/sw systems, with a focus on multi-media and digital signal processing applications. Between Octo-ber 2005 and June 2006, Kjeldsberg was a visiting researcher atUniversity of California, Irvine, Center for Embedded ComputerSystems. Kjeldsberg is and has been a member of the board ofdirectors both at the Faculty and in private companies. He isfrequently used as reviewer for several international journals andconferences.

A. Vandecappelle received his M.Sc. in Computer Sciences fromthe Katholieke Universiteit Leuven, Belgium in 1997. He joinedIMEC, Leuven, Belgium as a researcher and software architect.His research interests include code optimizations regarding mem-ory energy and speed bottlenecks. He has worked on varioustopics including custom memory architecture design, high-leveloptimisation of address expressions, global loop transformations,efficient scratch-pad memory usage, and parallelisation of se-quential code. In 2008, he left IMEC to take on a job as a seniorembedded software architect at Essensium/Mind. Here, he doesdevelopment of Open Source embedded software projects basedon Linux, eCos, RTEMS and other Open Source components.

M. Palkovic was born in Bratislava, Slovakia in 1977. Hereceived his M.Sc. degree in Electrical Engineering from theSlovak University of Technology, Bratislava, Slovakia, in 2001and his Ph.D. degree in Electrical Engineering from the Tech-nische Universiteit Eindhoven, Eindhoven, The Netherlands in2007. Since March 2001 he is a researcher in Design Technology

Page 21: Storage Estimation and Design Space Exploration ...fbalasa/JVSPS08.pdfJ Sign Process Syst (2008) 53:51–71 DOI 10.1007/s11265-008-0244-0 Storage Estimation and Design Space Exploration

Methodologies for managing signal processing application memory 71

group in the Nomadic Embedded Systems division at the Inter-University Micro Electronics Center (IMEC), Leuven, Belgium.His research interests include high-level optimizations in datadominated multimedia applications and wireless systems, relatedaspects of system design automation, and platform architecturesfor low power.

Q. Hu received his Ph.D. from the Norwegian University ofScience and Technology in Trondheim, Norway in 2007 and hisM.Sc. from the Royal Institute of Technology in Stockholm,Sweden in 2002. He is currently a design engineer at AtmelNorway. His research interests are in embedded systems designand low power microcontroller design, with special emphasis onhw/sw codesign and the development of methods and algorithmsfor system-level optimization and estimation of embedded multi-media applications for low power.

H. Zhu received the B.S. degree in electrical engineering fromXi’an Jiaotong University, P.R. China, in 1996, and his M.S.degree in electrical and electronics engineering from Nanyang

Technological University, Singapore, in 2001. He received hisPh.D. degree in computer science from University of Illinoisat Chicago (UIC), U.S.A., in 2007. Before joining UIC, hehas worked as a senior R&D engineer at JVC Asia Pte. Ltd.,Singapore. Currently, he is a software engineer in the DesignAutomation Group of the Physical IP Dept. at ARM, Sunnyvale,California, U.S.A.

His research interests are mainly focused on memory man-agement for real-time multidimensional signal processing andcombinatorial optimization in CAD VLSI.

F. Catthoor received the engineering degree and a Ph.D. inelectrical engineering from the Katholieke Universiteit Leuven,Belgium in 1982 and 1987 respectively. Since 1987, he has headedseveral research domains in the area of high-level and systemsynthesis techniques and architectural methodologies, all withinthe Design Technology for Integrated Information and TelecomSystems (DESICS - formerly VSDM) division at IMEC, Leuven,Belgium. He was assistant professor at the EE department of theK.U. Leuven since 1989, and full professor (part-time) since 2000.His current research activities belong to the field of architec-ture design methods and system-level exploration for power andmemory footprint within real-time constraints, oriented towardsdata storage management, global data transfer optimization andconcurrency exploitation. Platforms that contain both customizedarchitectures and (parallel) programmable instruction-set proces-sors are targeted. Also technology-aware design aspects are astrong focus. He also has an extensive scientific curriculum, in-cluding over 500 international papers, former TPC membershipof many major conferences, and current/earlier editorship ina.o. IEEE Trans. on VLSI Systems, Trans. on Multi-Media. andACM Trans. On Design Automation for Embedded Systems. In1986 he received the Young Scientist Award from the MarconiInternational Fellowship Council. He has been elected an IEEEfellow in 2005.