automatic generation of multi-core stressmarks...
TRANSCRIPT
Wouter Kampmann, Lieven Lemiengre
Automatic Generation of Multi-core Stressmarks
Academiejaar 2009-2010Faculteit IngenieurswetenschappenVoorzitter: prof. dr. ir. Jan Van CampenhoutVakgroep Elektronica en informatiesystemen
Master in de ingenieurswetenschappen: computerwetenschappen Masterproef ingediend tot het behalen van de academische graad van
Begeleider: Stijn PolflietPromotor: prof. dr. ir. Lieven Eeckhout
Automatic generation of multi-core stressmarksWouter Kampmann and Lieven Lemiengre
Supervisor(s): Lieven Eeckhout, Stijn Polfliet
Abstract— This article describes a framework for the development ofplatform-portable stressmarks. Estimating the practical power and ther-mal characteristics of a processor is vital to evaluate power and thermalmanagement strategies, to examine hotspots that may damage the proces-sor or reduce the chip’s lifecycle, and to dimension cooling solutions andpower circuitry.The proposed framework makes it possible to automatically generate opti-mized stressmarks for almost any platform.
Keywords— stressmark, platform-independent, synthetic benchmark,portable, power dissipation
I. INTRODUCTION
IN the past few years, it has become apparent that power andthermal characteristics of a processor have become a first
class design constraint. For a long time, the maximum proces-sor power consumption increased by a factor of about two, ev-ery four years [4], [3]. This trend of course could not continueand it came to a halt around 2002, when the industry hit thepower wall. Power consumption and thermal dissipation couldnot increase any further and controlling the power and thermalcharacteristics became a first class design constraint, requiringattention at every stage of the microprocessor design flow.
Conventional benchmarks can be used to estimate the powerand thermal characteristics of a typical workload. However, theyare unsuitable for estimating the maximum power and operat-ing temperature characteristics[1]. It is important to analyse thepractical worst-case behavior of a processor.
The worst-case maximum power consumption and tempera-ture dissipation can be used to develop power and thermal man-agement strategies. Another application is using the worst-casebehavior to dimension the thermal package and the power sup-ply circuitry for the processor.
The current practice in the industry is to develop hand-craftedstressmarks. These stressmarks are developed by specialists thathave a very detailed knowledge of the microprocessor architec-ture. It is a very tedious and time-consuming job. The resultingstressmark is processor-specific, so this work has to be repeatedif the micro-architecture is modified.
We developed a framework to automate the creation of stress-marks. We based our work on the StressMarker framework thatcould automaticaly generate stressmarks for the Alpha 21264microprocessor architecture.[1] The key idea is to generate syn-thetic benchmarks based on an abstract workload description;a machine learing algorithm then optimizes this workload de-scription to induce certain thermal or power characteristics.
Our contributions:• Our aim is to make the framework platform-portable, and weachieve this by generating the synthetic benchmarks purely in Clanguage constructs. The resulting C program is compiled forthe target platform. This means that our framework is unawareof the underlying platform and the platform-specific details are
filled in by the compiler. This allows the framework to generatestressmarks for a very wide range of systems. We verified theresults for MIPS and x86-64 targets.• A stressmark is described by a number of abstract parame-ters, each determining an aspect of either the target platform,or the workload of the stressmark. We minimized the numberof platform-specific parameters and specialized this workloadmodel for generating stressmarks. We also extended the work-load model to support generating multi-threaded stressmarks.The result is a lean workload model, specialized for stressmarks,that uses less parameters than the StressMarker framework[1]—30 instead of 40—while offering more functionality.
II. FRAMEWORK WORKFLOW
The framework workflow consists of four steps. We start witha workload description which is then transformed into a C stress-mark. The C stressmark is first compiled, and is then executedon a test platform. The measurements are fed into the machinelearning algorithm which generates an optimized workload, andthe cycle is complete. As a machine learning algorithm we usea genetic search algorithm.
OptimizationMeasurements(SESC / HPC)
Synthetic BenchmarkAbstract WorkloadModel
Fig. 1. Framework workflow.
III. DEVELOPING THE WORKLOAD MODEL
Stressmarks are described by two kinds of parameters: plat-form parameters and workload parameters. The platform pa-rameters are the number of hardware threads and the size of acache line. These parameters can easily be defined for any plat-form. The second type of parameters describes the workload ofthe stressmark. It is these parameters that the machine learn-ing algorithm will optimize. They were chosen based on theresearch by Joshi et al. [1]. These parameters describe a setof hardware-independent workload characteristics. We tried tominimize the number of parameters because less parameters re-sults in a smaller search space, allowing the machine learningalgorithm to work more efficiently.
A. Workload Parameters
The workload model consists of four major parts: the instruc-tion mix, the minimal dependency between instructions, the dataand instruction footprint, and the memory striding behavior.
A.1 Instruction Mix
This part consists of a high-level distribution of the propor-tions of arithmetic, memory, and branch instructions. Then foreach instruction type a more specialized distribution is defined.For arithmetic instructions, it is the distribution of each arith-metic operation defined by it datatype and numeric operation.Memory instructions are characterized as loads or stores, work-ing in shared or thread-local memory. For branch instructionstheir branch behavior is defined.
A.2 Minimum Dependency Distance
The dependency distance is the number of instructions be-tween two dependent instructions. Since we only work with out-of-order processors, we only considered the RAW (Read afterwrite) dependencies. Instruction dependencies limit the instruc-tion level parallelism, so this parameter is essentially a measureof the ILP.
A.3 Data and Instruction Footprint
The footprints are the number of unique data and instructionaddresses referenced while running the stressmark. The size ofthese will affect the stress on the memory subsystem, particu-larly the caches.
A.4 Memory Striding Behavior
Memory instructions in the stressmark may exhibit some dy-namic behavior. Some memory instructions read from or writeto the same address every time they are executed. Other mem-ory instructions walk through the memory, using a different ad-dress every time they are executed. We use data stream stridesto model this behavior.
IV. GENERATING C BENCHMARKS
We want the framework to be platform-portable, meaningthat, given the platform-dependent parameters, it should be ca-pable of generating stressmarks for almost any platform, with-out knowing the instruction set or register set. To achieve thisfeature, we use the low-level programming language C insteadof assembler to express the stressmark. Once the stressmark iscompiled for the target platform, we obtain an executable stress-mark.
One of the inherent difficulties of this approach is the optimiz-ing behavior of the compiler. Compilers are made to optimizeredundant code, loop invariants, etc., to increase the program’sperformance. However, when a stressmark is compiled, somecompiler optimizations could change characteristics reflectingthe workload model; these optimizations are undesired. Unfor-tunately, we cannot wholly disable optimization as we still relyon intelligent instruction selection and register allocation for ef-ficiency and correctness.
We addressed the compiler optimization problem using twoapproaches. First of all, we dumbed the compiler down to theminimum level of optimization, using a predefined optimizationlevel tweaked with special compiler flags. We then designed thestructure of a stressmark and made it immune to the remainingoptimizations of the dumbed-down compiler.
The resulting method is capable of generating effective stress-marks using C. The only significant drawback of our techniqueis the heavy register usage needed to maintain the minimum de-pendency distance. If the platform does not have enough hard-ware registers, there is a risk of register spilling.
V. RESULTS
A. Test Platforms
We used two platforms to verify the effectiveness of ourframework. First we set up a simulated SMP MIPS platform onwhich we optimize for maximal power usage. The other plat-form is a real world system: an Intel Core2 Quad processor thatwe optimized for maximum IPC; unfortunately, the optimiza-tion gets stuck at a local maximum (IPC ∼= 3) because it is notusing any memory instructions.
0
20
40
60
80
100
120
140
160
0 5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
10
0
Po
wer
(W)
Generation
0
0,5
1
1,5
2
2,5
3
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
IPC
Generation
Fig. 2. Results: top: MIPS, bottom: x86-64.
VI. CONCLUSION
We showed that we can generate effective multi-threadedstressmarks using C as implementation language. Because ofthis, our framework for automated stressmark generation isplatform-portable.
REFERENCES
[1] Ajay M. Joshi, Lieven Eeckhout, Lizy Kurian John, and Ciji Isen. Auto-mated microprocessor stressmark generation. In HPCA [2], pages 229–239.
[2] 14th International Conference on High-Performance Computer Architec-ture (HPCA-14 2008), 16-20 February 2008, Salt Lake City, UT, USA. IEEEComputer Society, 2008.
[3] S H Gunther, F Binns, D M Carmean, and J C Hall. Managing the impact ofincreasing microprocessor power consumption. Intel Technology Journal,(1):2005, 2001.
[4] Herb Sutter. The free lunch is over: A fundamental turn toward concurrencyin software. Dr. Dobb’s Journal, 30(3):202–210, 2005.
PREFACE iv
Preface
During the past nine months, we have been looking tremendously forward to the completion
of this document you are now holding. We hope you enjoy reading it as much as we enjoyed
the experience of preparing and writing it.
In the first chapter, we introduce the concept of stressmarks and briefly discuss some impor-
tant trends and events motivating the subject of our master thesis.
The second chapter contains an overview of the workload model we defined with a description
of the different workload parameters it contains. We discuss the consequences of the design
choices we made and how they relate to the characteristics of a generated stressmark.
In the third chapter, the main component of our framework, the stressmark generator, is
explained. We take a closer look at how the workload model is transformed into an executable
synthetic benchmark.
Chapter four explains how we employed a genetic algorithm in order to turn synthetic bench-
marks into stressmarks, optimizing for an output characteristic such as power usage or the
number of instructions per cycle.
The fifth chapter combines the components described in the foregoing chapter, giving a high-
level overview of the entirety of our StressmarkRunner framework. The two different platforms
we set up are discussed in preparation of the next chapter.
In the sixth chapter, we elaborately discuss the different results we obtained from running the
framework on our two target platforms, and how we verified the correctness and performance
of our genetic algorithm and stressmark generator component.
Chapter seven concludes this document by providing a final overview and some closing remarks
on the work we did.
Acknowledgements
Producing a master thesis can be quite a daunting task. We would therefore like to thank all
those who have supported and guided us throughout this endeavor, especially our supervisors,
professor Lieven Eeckhout, and Stijn Polfliet.
Wouter Kampmann and Lieven Lemiengre, May 2010
USAGE RESTRICTIONS vi
Usage restrictions
”The authors give permission to make this master dissertation available for consultation and
to copy parts of this master dissertation for personal use.
In the case of any other use, the limitations of the copyright have to be respected, in particular
with regard to the obligation to state expressly the source when quoting results from this
master dissertation.”
Wouter Kampmann and Lieven Lemiengre, May 2010
CONTENTS vii
Contents
Preface iv
Usage restrictions vi
Acronyms ix
1 Introduction 1
1.1 Before the Power Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Hitting the Power Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Stressmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 The Workload Model 5
2.1 Stressmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Workload Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Workload Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Synthetic Benchmarks in C 15
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Language and Compiler Requirements . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Exploring the Optimization Behavior of the Compiler . . . . . . . . . . . . . 19
3.4 Interesting C Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5 Forming the Stressmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.7 Stressmark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Stressmark Optimization 39
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Genetic Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Meta Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
CONTENTS viii
5 The Stressmark Runner Framework 44
5.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Platform Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3 Framework Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Results 66
6.1 Number of SESC Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.2 Exploration of Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 GA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4 GA Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5 Theoretical Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7 Conclusion 83
Bibliography 86
List of Figures 88
List of Tables 90
ACRONYMS ix
Acronyms
ACID Atomic Consistent Isolated Durability
ALU Arithmetic Logic Unit
API Application Programming Interface
BTB Branch Target Buffer
IDE Integrated Development Environment
ILP Instruction Level Parallelism
MDD Minimum Dependency Distance
RAW Read after Write
SIMD Single Instruction Multiple Data
WAR Write after Read
WAW Write after Write
XML Extensible Markup Language
YAML YAML Ain’t Markup Language
INTRODUCTION 1
Chapter 1
Introduction
For a long time, every four years the maximum power consumption of processors increased
by a factor of a little more than two. This evolution lasted about fifteen years from 1986
until 2002. Around 2002 things changed; processor manufacturers hit the power wall and
ever since, the power consumption of processors has only marginally increased.
1.1 Before the Power Wall
Before we look at the consequences of the power wall, let us look at the period pre-dating
it. For about fifteen years, processor manufacturers were able to increase the single-threaded
performance of their processors with about 50% every year [11]. How did they achieve this?
1. Moores law: The number of transistors that can be placed inexpensively in an integrated
circuit doubles about every two years. [13]
2. RISC processors: Using simple instructions that are easy to pipeline.
3. Out-of-order execution: Mining ILP in single-threaded workloads by employing specu-
lative execution.
4. Increasing the clock frequency: Requiring deeper pipelines
1.2 Hitting the Power Wall
This evolution came to an end around 2002. You could say that for a long time, power was
free and transistors were expensive, but now the situation had turned around [8]. What
happened?
1.3 Consequences 2
Figure 1.1: SPECint performance over the years (image source: [11]).
1. A higher clock frequency means more power consumption and heat dissipation, and
cooling solutions can only take you so far. Their cost increases exponentially with the
thermal dissipation [9].
2. ILP wall: Serial performance can be improved by mining more ILP but this requires
more speculative execution. The law of diminishing returns applies; at some point the
hardware cost to mine more ILP starts increasing exponentially. This is not beneficial
for the performance per watt.
3. Speed of light: It takes many clock pulses to transport data from one side of the chip
to the another.
1.3 Consequences
For the past few years, the power consumption and the clock frequencies have basically stayed
the same.
However, Moore’s law still applies, so the transistor budgets are still increasing at the same
speed. To improve performance, processor manufacturers are now focusing on multi-core
processors. If we look at the immediate future, we also see that more and more functionality
is being integrated into the CPU. For example, Intel and AMD are going to launch a CPU
with a GPU integrated on the die next year, memory controllers are already integrated on
most CPUs, and PCI-express controllers are also being integrated on some Intel processors.
1.4 Stressmarks 3
Figure 1.2: Power wall, frequency wall and ILP wall (image source: [14]).
It is clear that power and thermal characteristics of a processor are becoming increasingly more
important. They have become a first-class design constraint for high-performance processors
and should be considered at every stage of the microprocessor design flow.
1.4 Stressmarks
Conventional benchmarks can be used to estimate the power and thermal characteristics of
a typical workload. However, they are unsuitable for estimating the maximum power and
thermal characteristics [12]. There actually is an increasing disparity between the maximum
power consumption and the power consumed while running more typical applications [9]. This
growing difference in power consumption faces the system designer with a difficult problem.
The system should be designed to ensure the processor does not exceed the specified maximum
operating temperature, even if those circumstances are extremely rare.
1.4 Stressmarks 4
The worst-case behavior of a processor can be used for a number of applications:
1. Developing power and thermal management strategies for the processor. The processor
could reduce its frequency if it is in danger of overheating; this can reduce the cost of the
cooling solution. Since recently the opposite strategy can be applied as well. A multi-
core processor using only one core could over-clock that core to improve single-threaded
performance if the other cores are idle. [2]
2. Finding hot-spots: Hotspots are small regions on a chip that dissipate a large amount
of power for a short time. This localized overheating can reduce the lifetime of a chip,
cause timing problems, degrade circuit performance, and even cause chip failure.
3. The worst-case behavior can also be used to dimension the cooling solution and the
power supply circuitry for the processor.
Current practice in the industry is to develop hand-crafted stressmarks. These stressmarks
are developed by specialists that have a very detailed knowledge of the microprocessor ar-
chitecture. It is a very tedious and time-consuming job. Moreover the resulting stressmark
is processor-specific, so this work will have to be repeated if the processor architecture is
changed.
We believe the process of generating stressmarks can be automated. Our work is based on
the StressMarker framework [12] that could automatically generate stressmarks for the Alpha
21264. The key idea is to generate synthetic benchmarks based on an abstract workload
description. A machine learning algorithm then optimizes this workload to induce certain
thermal or power characteristics.
We have created a similar framework with a few new features:
� Our framework is platform-portable. We achieve this by generating synthetic bench-
marks purely in C language constructs.
� We developed a workload model that is specialized for stressmark generation.
� We can generate multi-threaded stressmarks that communicate through memory to
stress cache coherence protocols and inter-cache communication.
THE WORKLOAD MODEL 5
Chapter 2
The Workload Model
2.1 Stressmarks
Stressmarks are a kind of synthetic benchmarks especially constructed to stress a specific part
of the processor under test. In our case, these benchmarks are optimized using a machine
learning algorithm to induce certain power or thermal characteristics.
A stressmark is described by a number of abstract parameters, each determining an aspect of
either the target platform, or the workload of the stressmark. The synthetic benchmark gen-
erator processes these parameters and generates the stressmark’s code written in C language
constructs.
The first type of parameters describes the target platform, determining key properties of the
processor the stressmark is being developed for. We opted to keep the number of parameters
to a minimum in our framework, so we pick only cache line size and the number of hardware
threads as platform parameters. Note that these two generally apply to any processor.
The second type of parameters describes the workload of the stressmark. These are the
parameters that the machine learning algorithm optimizes in order to create an efficient
stressmark. We have chosen them based on prior research by Joshi et al [12], introduced
some simplifications, and extended the parameters to support multi-threaded stressmarks. In
the next section, we discuss these workload parameters in detail.
2.2 Workload Parameters
The workload model consists of four major parts: the instruction mix, the minimal depen-
dency between instructions, the data and instruction footprint, and the memory striding
behavior.
2.2 Workload Parameters 6
2.2.1 Instruction Mix
We start by defining a distribution describing the proportion of arithmetic, memory, and
branch instructions. For each of these general instruction types, we then define another
distribution, determining the relative frequencies of instructions of more specific subtypes
(e.g. integer addition, double multiplication, etc. for the arithmetic instructions).
Arithmetic Instructions
Arithmetic instructions take two operand registers, perform some operation on them and
store the result in a register. These instructions are characterized by an operation and the
data type of the operands. Our framework supports integer, and single and double precision
floating point as data types. The supported operations are addition, multiplication, and
division.
The relative frequencies of all arithmetic instructions that should be used in the stressmark
are combined in an arithmetic instruction profile. These instructions should stress the ALUs
responsible for arithmetic calculations in the processor.
Table 2.1: Example of an arithmetic instruction profile
Operation Datatype Relative frequency
Integer Add 20%
Integer Mul 20%
Integer Div 20%
Double Add 20%
Double Mul 10%
Double Div 10%
Memory Instructions
A memory instruction takes an address and either reads from that address and writes its value
to a register, or writes the register value to the address. The address used by the memory
instruction can point to a shared portion of the memory that is used by all threads in the
system, or a local portion of memory specifically allocated for the thread. Consequently, we
can define four kinds of memory operations in total: shared loads, shared stores, local loads
and local stores.
2.2 Workload Parameters 7
Stores and loads stress the ALUs responsible for memory operations, the store/load buffers,
and the memory system including caches. If the memory operation uses shared memory, it is
possible that the instruction causes extra inter-cache traffic. The amount of traffic depends
on the cache coherence protocol. If there is a lot of contention between processors, this will
cause stalls, negatively affecting the processor’s throughput.
The relative frequencies of all memory instructions are combined in the memory instruction
profile. The memory access pattern is not determined by this.
Table 2.2: Example of a memory instruction profile
Operation Shared? Relative frequency
Load Yes 10%
Store Yes 10%
Load No 40%
Store No 40%
Branch Instructions
There are two kinds of branch instructions: conditional and unconditional branches. Using
branch instructions, we want to stress the ALU responsible for branch processing, the branch
predictor, and other associated hardware structures (e.g. BTB).
To stress the branching-related logic, we need to control the predictability of a branch. We
can do this using the branch transition rate [10]. The branch transition rate is the number
of times a branch switches between taken and untaken, divided by the number of times the
branch is executed.
For example, a branch transition rate of 100% means that the branch will constantly alternate
between taken and untaken. Branch transition rates that are very high (90-100%) or very
low (0-10%) have a high predictability. If the branch transition rate is between 30 and 70
percent, the branching behavior is harder to predict.
In the workload model we use the cumulative distribution of the inverse branch transition
rate.
2.2 Workload Parameters 8
Table 2.3: Example of a branch transition rate distribution
Inverse branch transition rate Relative frequency
1 70%
2 20%
4 5%
8 5%
2.2.2 Dependencies Between Instructions
If an instruction has to wait for the result of a previous instruction, there is a dependency.
The dependency distance is the number of instructions between two dependent instructions.
There are three kinds of instruction dependencies: WAW (Write after Write), RAW (Read
after Write), and WAR (Write after Read). Since out-of-order processors can eliminate WAW
and WAR dependencies, we will only consider RAW dependencies.
Instruction dependencies limit the instruction level parallelism (ILP). In a stressmark we want
the ILP to be large enough to fully occupy all ALUs in the processor. We will therefore define
a minimum RAW distance in the workload model.
2.2.3 Data and Instruction Footprint
The footprints are the number of unique data and instruction addresses referenced while
running the stressmark. The size of these will affect the stress on the memory subsystem,
particularly the caches.
The size of the instruction footprint determines whether the stressmark fits into the L1 in-
struction cache. The data footprint is twofold, defining the size of the global memory region
on the one hand, and of the private memory region allocated for every thread on the other.
The size of the global memory determines the contention probability when multiple threads
read from it or write to it.
In the workload model, we define the three parameters discussed above: the number of
instructions, the size of the global memory, and the size of the thread-local memory.
Table 2.4: Example of a data and memory footprint
Instruction footprint 400 instructions
Shared data footprint 32kB
Thread-local data footprint 512kB
2.3 Workload Summary 9
2.2.4 Memory Striding Behavior
Memory instructions in the stressmark may exhibit some dynamic behavior. Some memory
instructions read from or write to the same address every time they are executed. Other
memory instructions walk through the memory, reading from a different address every time
they are executed. We will use data stream strides to model this behavior.
Memory instructions that walk through the memory do this with a constant step size, the
stream stride. The size of the stream stride is a multiple of the size of a cache line (defined in
the platform-dependent parameters). Every memory instruction is assigned a stream stride.
In the workload model, we define a distribution of stream strides. Memory instructions that
constantly read from the same address have a stream stride of size zero. Instructions with a
stream stride greater than zero will cycle through a part of the memory defined in the data
footprint.
Table 2.5: Example of a stream stride distribution
Stride value Relative frequency
0 80%
1 10%
2 2%
4 6%
8 2%
2.3 Workload Summary
In total, there are 30 variables in the workload model. While developing the workload model,
we tried to minimize the number of parameters, keeping only those relevant to developing
stressmarks. A workload model containing less parameters results in a smaller search space,
allowing the machine learning algorithm to work more effectively.
Each parameter is designed to stress a specific part of the processor. Although the parameters
may interact with each other, each parameter has one specific goal, avoiding these overlaps
as much as possible. For example, we could support more arithmetic operations (shifts or
2.4 Discussion 10
subtractions) but we decided against this because we assume that we can optimally stress
all ALUs with the provided operations, rendering any more operations superfluous. If this
assumption would turn out to be false, our framework is set up in a way that operations can
be added with relative ease.
While developing the workload model, we also had to take into consideration that it eventually
will be converted into a stressmark. To be able to efficiently generate the stressmark code,
we defined some extra restrictions on some parameters. For example, the inverse branch
transition rate is restricted to powers of two. We will explain this more in detail in the next
chapter.
Table 2.6: Workload summary
Category # parameters
Instruction mix 3
Arithmetic instruction distribution 9
Memory instruction distribution 5
Branch transition rate 4
Inter instruction dependency 1
Footprint 3
Stream stride distribution 5
2.4 Discussion
2.4.1 Differences With Prior Work
In previous work by Joshi et al [12], stressmarks have been created based on a workload
model made to create synthetic equivalents of real benchmarks. This workload model was
composed of 40 parameters and had no support for multi-threaded workloads, nor was it
platform-portable. Our own workload model is based on this, but we would like to highlight
a few key differences:
� In the original stressmark paper, the dependency distance was a cumulative distribution.
To create stressmarks it suffices to define a minimum dependency distance.
� We do not define a basic block size in our model.
� Because our benchmark is platform-portable, we cannot make assumptions about the
latency of arithmetic operations. We define three operations: addition, multiplication,
and division, for three datatypes (integers, floats, and doubles).
2.4 Discussion 11
� We reduced the number of parameters to describe the stream stride distribution and
the branch transition rate distribution, because we found that this relatively coarse
granularity suffices for the generation of stressmarks.
� We added support for multi-threaded stressmarks by introducing operations on shared
memory, affecting stressmark performance due to coherency.
2.4.2 Platform Portability
In a perfect world, we would be able to produce a completely platform-independent stressmark
that could be used to stress any given processor to its absolute limit. To approach this limit
in practice however, a stressmark needs to optimally stress as many processor components as
possible, and often exploit the specifics of the plaform it is running on. It is clear that this
unfortunately renders stressmarks platform-dependent by their very nature.
Because full-fledged platform independence is not possible, our goal becomes maximal plat-
form portability; we design our framework making sure that it can generate stressmarks for
a wide range of platforms, and that the adoption of new platforms is relatively easy.
Our workload model is an excellent starting point in the achievement of this ambition, since
its abstract parameters can be applied to most platforms and new platforms can often be
easily supported by adding a few new parameters.
The next step is to generate the executable stressmark without breaking platform portabil-
ity. To achieve this, a widely supported low-level programming language, in our case the C
language, is used instead of assembler to express the stressmark (as in previous work). Using
a compiler for the target platform, we then obtain the executable stressmark. The framework
itself therefore does not know the instruction set or register set; it only needs a compiler sup-
porting the platform. The wide availability of compilers for different platforms thus ensures
platform portability.
We learnt however that our approach is not without limitations, as the following example
shows. To fully stress a modern processor we would have to support SIMD instructions in
the workload model. Supporting SIMD instructions is however problematic because there
is no standardized way to express them in C. We will explore a couple of possible solutions
further in this document, but unfortunately it is not possible to implement them in a way
that platform portability is absolutely guaranteed.
2.4 Discussion 12
2.4.3 Branch Predictability
The performance characteristics of the branch predictor component, such as power usage and
heat production, are determined by the number of branch instructions on the one hand and
the miss prediction rate on the other; it is therefore of crucial importance that our framework
can influence these two properties through the workload model.
For the number of branch instructions, this is trivial since this property directly corresponds to
the workload parameter determining the instruction footprint. For the branch predictability
however, there is no such workload parameter since it is not possible to accurately generate
code with a specific branch miss prediction rate. We will however show that the branch miss
prediction rate can be controlled indirectly by setting the branch transition rate. The latter
is indeed a workload parameter as it is perfectly possible to generate synthetic code with a
given branch transition rate.
In the paper ”Branch Transition Rate: A New Metric for Improved Branch Classification
Analysis”, Haungs, Sallee, and Farrens [10] found that the branch miss prediction rate for
global as well as local branch predictors is determined by their transition rate and taken rate
(i.e. the number of times the branch is taken.) Due to design decisions about the workings
of our stressmark generator, the taken rate of our branches is a fixed 50%, and the transition
rate is 100%, 50%, 25%, or 12.5%.
Using the following graphs from Haungs et al., it can be deduced that the corresponding miss
prediction rates range from the lowest values (<5%, white) to the highest (>45%, black) with
two evenly spread intermediates.
Figure 2.1: Miss rates of local (left) and global (right) branch predictors for different classes
of branches, identified by transition rate and taken rate.
2.4 Discussion 13
On the axes of these graphs, class 0 corresponds to 0-5%, class 1 to 5-10%, ..., class 4 to
20-25%, class 5 to 25%-75%, class 6 to 75%-80%, ..., class 9 to 90%-95%, and class 10 to
95%-100%. The values for our stressmarks with a fixed branch transition rate of 50% are
therefore located in the middle column, which contains widely varying values.
2.4.4 Transformation Aspects
We now draw attention to the various aspects of the transformation of the workload model into
the executable stressmark, which is performed by the stressmark generator. First, we need
to distinguish clearly between the workload model itself, which is the input of the stressmark
generator, and the effective workload. The latter contains the actual workload parameter
values of the generated stressmark when it is executed.
Randomized vs. Deterministic Stressmark Generation
Although the workload model defines the key characteristics of the stressmark that needs to be
generated, there are some aspects that are not explicitly determined by it; notable examples
are the taken branch rates discussed in the previous section, and the order of instructions.
During stressmark generation, these undefined aspects can in general either be determined at
random, or by reasonable design choices. If determined by design choice, the transformation
process will always produce the same stressmark for a given workload, but the number of
stressmarks that can possibly be created is reduced, and it cannot necessarily be guaranteed
that the design choice always produces the best stressmark possible. If chosen randomly,
the transformation process is no longer deterministic and a single workload model can then
generate different stressmarks during sequential runs of the stressmark generator.
Although we initially opted to determine some of the undefined aspects randomly, it became
clear later on that this was the wrong choice. The effective workloads produced by different
stressmarks based on the same workload model varied too much, causing the search algorithm
described further in this document to function inefficiently, since a given workload model no
longer corresponded to a single fitness value. Concluding that a deterministic transformation
is really necessary, we switched later on and tweaked our design choices for best performance.
We also looked at the theoretical maxima of our search results to guard the efficacy of the
framework.
Mapping Between Workload Model and Effective Workload
Note first that different workload models may sometimes result in the same stressmark and
therefore the same effective workload. For example, the instruction footprint may be 50 in-
2.4 Discussion 14
structions while the number of memory operations is only 1%. In this case the framework will
generate zero memory instructions, which is of course the same stressmark as the one created
for the same workload model, but with 0% memory operations instead of 1%. Moreover,
since there are no memory instructions at all, additional memory parameters in the workload
model become irrelevant (i.e. data footprint, stride distribution, reads/writes, shared/non-
shared). Workload models differing only in these parameters will once again result in the
same stressmark.
It is now also clear that the effective workload is not always consistent with the workload
model. This is not only caused by duplicate mappings as illustrated by the example above, but
also by certain complexities within the stressmark generator algorithm. These are described
in the next chapter.
Multi-threading Aspects
The multi-threaded stressmarks are created from a single workload. This means that every
core will be running the same synthetic benchmark. The interaction between threads will
happen at the memory level. There are two situations of contention between threads. First,
threads may be competing over cache line ownership; in this case we stress the cache coherency
mechanisms. Second, contention may happen if threads compete for cache memory; this will
be the case if the size of the global memory combined with all the thread-local memory is
bigger than the total cache size.
We did not implement synchronization primitives such as mutexes in the stessmarks. These
primitives will typically cause the processor to stall for a while. Stalling is undesired behavior
for stressmarks.
SYNTHETIC BENCHMARKS IN C 15
Chapter 3
Synthetic Benchmarks in C
3.1 Introduction
In the previous chapter we described the workload model, a collection of program character-
istics that describe a stressmark. In this chapter we will examine how the workload model
can be transformed into an executable stressmark.
We want the framework to be platform-portable, meaning that it should be capable of gener-
ating stressmarks for almost any platform given the platform-dependent parameters without
knowing the instruction-set or register-set of the platform. To achieve this feature, we use
a low-level programming language instead of assembler to express the stressmark. Once the
stressmark is compiled for the target platform, we obtain an executable stressmark.
This is why our framework doesn’t have to know the platform it is testing; it only needs a
compiler that supports it.
3.2 Language and Compiler Requirements
We will start with defining the criteria for the low-level programming language. First, compi-
lation should be static; interpreted or JIT compiled languages will not do. Second, it must be
possible to express the various workload properties in the language constructs. And finally, a
high quality compiler should be available for almost any platform.
As low-level programming language we therefore chose the C programming language. It is so
well supported that it is the de facto standard among the low-level programming languages.
Virtually every platform has a C compiler and most have a highly optimized one.
3.2 Language and Compiler Requirements 16
3.2.1 Alternatives
An alternative language to C could be Fortran; it is less supported but it fits all the other
criteria perfectly. We opted for C in favor of Fortran mainly because we have more experience
with it.
Note that in the end, the language choice may not matter all that much, since compilers like
the GNU Compiler Collection support many languages and use the same backend for every
language, making it quite unlikely that using another language will yield significantly better
or worse results. In fact the low-level language used is nothing more than an interface to
control the backend of the compiler for platform-specific code generation.
An alternative approach could be to skip the compiler frontend altogether and directly im-
plement the stressmark in the intermediate representation (IR) of the compiler. We could for
example use GCC’s GIMPLE/TUPLES or LLVM’s IR. Both compiler frameworks support a
lot of platforms but not each one. Some specialized embedded processors (such as Trident
media processors or microchip PIC processors) only have commercial compilers.
Using the compiler’s IR to express the stressmark would improve our control over the form
and properties of the stressmark. To implement the stressmark in C, we have to go through
significant efforts to make sure that the critical stressmark properties are preserved after
compilation. Taking the extra effort to implement the stressmarks in C results in a significant
portability advantage.
3.2.2 Expressing the Stressmark in C
Expressing the stressmark in C is quite simple. The language constructs allow us to easily
express all the behavior we want. In the following example 3.1 we illustrate the general form
of a stressmark. Beware that this a very naive implementation of a stressmark, only for the
purpose of example.
Before we start running the stressmark, we need to perform some initialization. The initial-
ization contains some variable declarations and the memory allocation.
The next part is the stressmark loop, consisting of a start block and the stressmark body. In
the start-block the next iteration of the stressmark body is prepared, and it is checked whether
the stressmark has finished. The stressmark body contains the actual behavior conform to
the workload model. It is important to note that there are no loops inside the stressmark
body.
Finally, in the finalization routine we free the allocated memory.
3.2 Language and Compiler Requirements 17
Figure 3.1: Global stressmark structure.
3.2.3 Compiling the C Stressmark
The role of the compiler in our framework is to fill in the platform-specific details. The com-
piler should perform optimal instruction selection and register allocation for the underlying
platform. At the same time, the compiler may not change the execution properties of the
stressmark as they are expressed in the low-level programming language.
The example may look like a perfectly working stressmark but after compilation, the result is
very disappointing. If we compile this example, we notice that the variables v1 and v2 have
disappeared. The branch operation and the arithmetic operations have been eliminated as
well. This is because they are functionally redundant; they do not contribute to the result
of the function,f nor do they generate any effect. We also notice that the stride calculation
checks for division-by-zero, which is not needed since memSize will always be greater than
zero.
Listing 3.1: Compilation result with -O1
3.2 Language and Compiler Requirements 18
s t r i d e 3 = ( s t r i d e 3 + 12) % memSize ;
400548: addiu v0 , s0 , 12
40054 c : div zero , v0 , s2
400550: bnez s2 ,40055 c <s t re s smark+0x4c>
400554: nop
400558: break 0x7
40055 c : mfhi s0
i f ( i−− == 0) break ;
400560: addiu s1 , s1 ,−1
400564: beq s1 , v1 ,400578 <s t re s smark+0x68>
400568: s l l v0 , s0 , 0 x2
v2 = v3 + v2 ; // a r i t hme t i c i n s t r u c t i o n
i f ( i & 2) v1 = v1 * v3 ; // branch i n s t r u c t i o n
memory [ s t r i d e 3 ] = v3 ; // memory i n s t r u c t i o n
40056 c : addu v0 , v0 , a0
400570: j 400548 <s t re s smark+0x38>
400574: sw s3 , 0 ( v0 )
This brings us to the disadvantages of using a low-level programming language as imple-
mentation target for stressmarks. While the compiler is very good at converting a program
into machine code, it will also optimize the program. For normal applications, optimizing is
of course beneficial as it eliminates unnecessary operations without changing the functional
behavior. However, when a stressmark is compiled, some optimizations could change char-
acteristics reflecting the workload model; these optimizations are undesired. Unfortunately,
we cannot wholly disable optimization as we still rely on intelligent instruction selection and
register allocation for efficiency and correctness.
We want to stress that the optimization tradeoff is very tricky to get right. If the generated
code is not optimized, the result is inefficient as the available registers and instructions are
not optimally utilized. If the compiler performs too much optimization, critical parts of the
stressmark may be optimized away, changing the stressmark behavior and jeopardizing its
conformity to the workload model.
In the remainder of this chapter, we will mainly focus on how to tune the compiler and
the structure of the stressmark to generate correct executable stressmarks. Before analyz-
ing the optimization countermeasures, we define which optimization behavior is required, or
acceptable.
3.3 Exploring the Optimization Behavior of the Compiler 19
3.2.4 Compiler Requirements
The compiler is required to perform two tasks: instruction selection and register allocation.
The workload is only correctly expressed in the executable stressmark if the arithmetic and
memory operations use registers. Stack operations and register spilling should be avoided as
much as possible.
The compiler is allowed to do some instruction rescheduling. The workload does not define
the instruction ordering; only the minimum dependency distance is defined. If the compiler
performs some instruction rescheduling, the minimum dependency distance could possibly
change. Such optimizations are not considered harmful. The compiler may have good enough
knowledge about the latencies of instructions to be able to reschedule instructions without
causing a slowdown. Instruction rescheduling across blocks is however not acceptable.
It is also the responsibility of the compiler to optimize the address calculation for memory
instructions.
3.3 Exploring the Optimization Behavior of the Compiler
We addressed the compiler optimization problem using two approaches. First of all, we
dumbed the compiler down to the minimum level of optimization using a predefined opti-
mization level, tweaked with special compiler flags. We then designed the structure of a
stressmark and made it immune to the remaining optimizations of the dumbed-down com-
piler.
From this point our results are dependent on the used compiler and the platform. We use the
GCC 4.4 for verifying the x86-64 target and GCC 3.4 for verifying the SESC/MIPS target.
3.3.1 Configuring the Compiler
GCC has six optimization levels: O0, O1, O2, O3, Os, and Ofast. The lowest optimization
level O0 is not useful since it does not perform any register allocation; from O1 onwards it
does. We fine-tuned the O1 profile a bit more using flags to disable some loop optimizations.
Table 3.1: Used compiler flags
GCC 3.4 -fno-loop-optimize -mno-check-zero-division -fnew-ra -fno-if-conversion -fno-if-conversion2
GCC 4.4 -fno-tree-loop-optimize -fno-if-conversion -fno-if-conversion2
3.3 Exploring the Optimization Behavior of the Compiler 20
Through extensive trial and error we sought to find a combination of flags that sufficed to
reliably compile stressmarks. The counter-optimization methods used in the next part rely
on these options.
This is a fragile part of our famework; if a new compiler is used, the user will have to configure
that compiler to fit our stressmark method. If the compiler cannot be configured correctly,
there are two solutions. One way is to change how stressmarks are generated by the framework
based on the behavior of the new compiler. The other solution is to change the framework
component that generates C into a component that generates assembler, thus giving up on
platform portability.
If you can configure the compiler correctly, you can generate stressmarks almost without any
customization. On top of that, gcc already supports many compilation targets and these
require no changes at all.
3.3.2 Analyzing Compiler Optimizations
In this section we will investigate the optimizations of the dumbed-down compiler and their
effects on the quality of the stressmark.
Redundant Code Elimination
Redundant operations are operations that do not contribute to the result of the function or
do not produce an effect, such as writing to memory.
Redundant functions
Table 3.2: Redundant functionC Assembler (MIPS)
void function(){int i,v1=2,v2=3,v3=7,v4=5;
for(i = 0; i<100; i++){v1 = v3 * v4; Completely eliminated
v2 = v2 - v4;
v1 = v2 / v4;
}}
The function in this example only contains dead code. Even the dumbed-down compiler will
completely eliminate this function.
3.3 Exploring the Optimization Behavior of the Compiler 21
Table 3.3: AlternativesAlternative 1 Alternative 2
int function(){ void function(){int i,v1=2,v2=3,v3=7,v4=5; volatile int effect;
for(i = 0; i<100; i++){ int i,v1=2,v2=3,v3=7,v4=5;
v1 = v3 * v4; for(i = 0; i<100; i++){v2 = v2 - v4; v1 = v3 * v4;
v1 = v2 / v4; v2 = v2 - v4;
} v1 = v2 / v4;
return v1 + v2 + v3; }} effect = v1 + v2 + v3
}
There are two ways to prevent the function from being completely eliminated. In the first
alternative we make the return value dependent on the variables that are used. In the second
alternative we use a volatile variable to generate an effect. The volatile variable is a variable
that is always stored in memory, never in a register. Writing the sum of the used variables
to memory makes it visible for other threads to see, thus preventing the variables from being
optimized away.
Redundant operations Now that the function is not completely optimized away, we turn
our attention to the loop inside the function that represents a stressmark body.
Table 3.4: Redundant operations
C Assembler (MIPS)
for(i = 0; i<100; i++){ for(i = 0; i<100; i++){40051c: move v1,zero
v1 = v3 * v4; v1 = v3 * v4;
v2 = v2 - v4; 400520: subu a2,a2,a0 -> v2=v2-v4;
400524: addiu v1,v1,1
400528: slti v0,v1,100
40052c: bnez v0,400520 <main+0x10>
v1 = v2 / v4; 400530: div zero,a2,a0 -> v1=v2/v4;
} }
3.3 Exploring the Optimization Behavior of the Compiler 22
After compilation, we can see that the compiler has eliminated one of the operations in the
loop body. The first and the third operation write to the same variable while this variable is
never read between these two writes. In other words, the first operation is redundant. This op-
timization behavior has significant ramifications for implementing the minimum dependency
distance.
Listing 3.2: Dependency distance
. . .
v1 = v1 * v2 ; // v1 Write
v2 = v2 / v3 ;
v3 = v3 + v4 ;
v4 = v4 * v5 ;
v5 = v4 / v1 ; // v1 Read
. . .
The minimum dependency distance is defined as the minimum RAW distance. In the above
example the RAW distance is four. To achieve this distance without redundant operations,
we had to use five variables. If we were to increase the minimum dependency distance, the
required number of variables would increase proportionally.
Since we want all the variables to be stored in registers for optimal execution, a large minimal
RAW distance will require a lot of hardware registers. If there are not enough hardware
registers available, this will cause register spills. Not only is this undesired behavior, it is
also impossible for the framework to detect this happening. This is a limitation caused by
the use of a low-level programming language. If we implemented the benchmark directly
in assembler, it would be a lot easier to achieve very large RAW distances with only a few
hardware registers. In fact, we would probably drop the minimum dependency distance from
the workload model.
In listing 3.5 we look at the redundancy elimination in combination with conditional instruc-
tions, commonly known as ”partial redundancy elimination.” After compilation, the super-
fluous conditional expression is completely eliminated by the dumbed-down compiler. This
means that the redundancy removal even works across blocks within the loop.
Conclusion This is by far the most annoying optimization and to our knowledge there is
no way to prevent the compiler from applying it by tweaking the compiler flags. Whenever
we enable register allocation, the compiler will try to eliminate the most obvious unnecessary
instructions.
3.3 Exploring the Optimization Behavior of the Compiler 23
Table 3.5: Redundant blocksC Assembler (MIPS)
for(i = 0; i<100; i++){ for(i = 0; i<100; i++){40051c: move v1,zero
if(i == 10) v1=v3*v4; if(i == 10) v1=v3*v4;
v1 = v3 * v4; v1 = v3 * v4;
v2 = v2 - v4; 400520: subu a2,a2,a0 -> v2=v2-v4;
400524: addiu v1,v1,1
400528: slti v0,v1,100
40052c: bnez v0,400520 <main+0x10>
v1 = v2 / v4; 400530: div zero,a2,a0 -> v1=v2/v4;
} }
Loop Invariants
The body of the stressmark is placed inside a loop. A typical compiler optimization is to
hoist loop invariants out of the loop. To avoid this optimization through the structure
of the stressmark, we would have to make sure that all variables within the loop are de-
pendent on a previous iteration of the loop. This would add a lot of complexity to the
stressmark generator. Fortunately we were able to avoid this by setting the compiler-flag
”-fno-tree-loop-optimize”.
Table 3.6: Loop invariants
C Assembler (MIPS)
for(i = 0; i<100; i++){ for(i = 0; i<100; i++){400518: move a0,zero
v1 = v3 * v4; 40051c: mult a2,a1 -> v1 = v3 * v4;
400520: addiu a0,a0,1
400524: slti v0,a0,100
400528: bnez v0,40051c <main+0xc>
v2 = v3 + v4; 40052c: addu v1,a2,a1 -> v2=v3+v4;
} }
In the example (table 3.6) both instructions are loop-invariant. However if we look at the
compilation result, we can see that they are still inside the loop. The multiplication (v1 =
3.3 Exploring the Optimization Behavior of the Compiler 24
v3*v4) is placed at the beginning of the loop. The addition (v2 = v3+v4) is placed in the
branch delay slot of the branch instruction.
We conclude that we needn’t worry about loop invariants in the body of the stressmark.
Constant Folding and Propagation
These optimizations are present in the dumbed-down compiler, but because of the
”-fno-tree-loop-optimize” flag, they do not work on literals declared outside the loop.
These optimizations come in handy because they optimize some of the address calculation for
memory operations inside the stressmark loop. However, there are some cases where these
optimizations may cause trouble when combined with algebraic simplifications. Listing 3.3 is
a reduced problem case we encountered while testing the framework.
Listing 3.3: Constant folding and propagation
[ . . . ]
v1 = v2 / v2 ; // v1 = 1 −> e l im ina t ed ( a l g e b r a i c s im p l i f i c a t i o n )
[ . . . ]
v3 = v1 * v1 ; // v3 = v1 = 1 −> e l im ina t ed ( cons tant f o l d i n g and prop . )
v5 = v1 + v1 ; // v5 = 2 −> e l im ina t ed ( cons tant f o l d i n g and prop . )
[ . . . ]
v6 = v3 / v5 ; // v5 = v3 >> 1 −> peepho le op t im i za t i on
[ . . . ]
To avoid these optimizations, we came up with a few extra rules for generating the stressmark.
� All variables are initialized with different values.
� Instructions with two operands use two different registers as operand.
Code Layout and Branch Optimization
The compiler eliminates branches whenever it can, and by doing so also removes unreachable
code. These optimizations make it harder to implement static branches in the stressmark.
3.3 Exploring the Optimization Behavior of the Compiler 25
Table 3.7: Branch optimization
Unoptimized Optimized
if(condition) goto L3; else goto L2; if(condition) goto L3;
L2: [...] L2: [...]
L3: [...] L3: [...]
Unoptimized Optimized
[...] [...]
goto L3 // eliminated
L2: [...] // noting jumps to this block // eliminated
L3: [...] L3: [...]
In listing 3.4 we show a solution to implement static branches in the stressmark. The for-
loop represents the stressmark body. In the finalization routine, we jump to the block that
otherwise would be eliminated.
Listing 3.4: Static branch implementation
for ( i = 0 ; i<MaxIter ; i++) { // s tressmark body
[ . . . ]
goto L3 ;
L2 : [ . . . ]
L3 : [ . . . ]
}goto l 2 ; // f i n a l i z a t i o n rou t ine
Instruction Rescheduling
Most instruction rescheduling optimizations only become available with the -O2 optimization
level. This optimization is allowed as long as it doesn’t work across blocks. In practice we
rarely see an instruction rescheduling optimization in the compiled code.
Exceptions
While compiling divisions and modulo operations for the MIPS target, the compiler will gener-
ate code that checks for division-by-zero. By using the compiler flag ”-mno-check-zero-division”
we can disable this safety.
3.4 Interesting C Constructs 26
3.4 Interesting C Constructs
Before we use all the knowledge we gained about the optimization behavior of the compiler
to implement a good stressmark, we look at some interesting language constructs in C that
may help constructing stressmarks.
3.4.1 Volatile Variables
Volatile variables in C are variables that may change in a way that is not predictable by
the compiler. Volatile variables are typically used to implement signal handlers, or to access
memory mapped devices.
The volatile keyword prevents the compiler from storing the variable content in a register;
this means that writing to a volatile variable always results in writing to a static memory
address, and reading from a volatile variable always results in reading from a static memory
address.
Since the content of the memory is modified, these operations are effectful and cannot be
optimized away.
3.4.2 Const Variables
The value of const variables cannot be changed after initialization. Typically the const
keyword does not improve performance at a sufficiently high optimization level, since the
compiler can figure out if a variable will be modified or not. The crippled compiler however,
does need this information to improve the address calculation for memory instructions.
3.4.3 Control Flow
While researching how we could implement the flow control in the stressmark, we found many
alternatives. Some examples:
3.5 Forming the Stressmark 27
Table 3.8: Alternative control flow implementations
Alternative 1 Alternative 2
while( i > 0 ) { start:
i--; if(i <= 0) goto end;
if( condition1 ) { i--;
a = b + c; if(!condition1) goto L1;
} a = b + c;
c = a * b; L1:c = a * b;
if( condition2 ) { if(!condition2) goto L2:
d = e + f; d = e + f;
} L2:f = e * d;
f = e * d; goto start;
} end:
In the example both alternatives are functionally equivalent and they compile to exactly the
same machine code. We opted to use gotos because they are more flexible; only gotos allow
us to implement unoptimizable static branches (listing 3.4).
3.5 Forming the Stressmark
We will explain the structure of the stressmarks in three stages. We start with only arithmetic
instructions, then add memory operations, and finally control flow to the stressmark.
3.5.1 Arithmetic Operations
The first stressmark is a simple loop containing only arithmetic operations. Stressmarks
such as this one can be generated by the framework by providing a workload model with an
instruction mix existing a 100% out of arithmetic instructions.
The stressmark starts by initializing the used variables (vN), the loopcounter (i) and a division
variable (vDiv). The division variable is created to avoid division-by-zero problems. The next
part of the stressmark is the stressmark loop with a very small start block. The stressmark
body contains the arithmetic operations and it finishes by jumping back to the start block.
When the stressmark has finished, it returns the sum of all the variables to avoid optimization
(table 3.3.2).
3.5 Forming the Stressmark 28
We simplified this example by using only integer operations. More typical stressmarks will use
a combination of integer and floating point operations. This stressmark contains one dynamic
branch instruction for the loop even though this is not defined in the workload model.
Listing 3.5: Arithmetic instructions
int s t re s smark ( ) {int i = 999999 , v1 = 2 , v2 = 3 , v3 = 5 , v4 = 7 , v5 = 13 ;
const int vDiv = 3 ;
s t a r t :
i f ( i−− <= 0) goto end ; // s t a r t b l o c k
v1 = v2 + v3 ; // i n t add
v2 = v3 / vDiv ; // i n t d i v
v3 = v4 * v5 ; // i n t mul
v4 = v5 + v1 ; // i n t add
v5 = v1 / vDiv ; // i n t d i v
[ . . . ]
goto s t a r t ;
end :
return v1+v2+v3+v4+v5 ;
}
The number of registers a stressmark uses is critical to avoid register spilling. Register usage
breakdown:
� Necessarily stored in registers
– Loop counter (i): 1
– Variables (vN): Mininum dependency distance + 1
� Optionally stored in registers
– Division variable (vDiv): One register for every datatype
How the compiler compiles the division variable depends on the number of available registers.
Sometimes it is loaded as a literal before the division; if not, the variable will continuously
be kept in a register.
3.5.2 Memory Instructions
To support memory instructions we need to declare thread-local (localMemoryArray) and
shared (globalMemoryArray) memory regions. These regions are split into smaller arrays,
3.5 Forming the Stressmark 29
one for each striding memory instruction.
We assign each striding memory instruction a dedicated array to avoid memory instructions
influencing each other’s behavior; more specifically, we want to avoid memory instructions
causing data of another memory instruction to be cached. The performance of a memory
instruction should after all be independent of other instructions.
To implement the striding behavior of the memory instructions, we use the variables lStrideN
for local and sStrideN shared memory instructions, with N equaling the stride distance. These
variables are incremented with the stride value in the start block of the stressmark (= stride
distance * length cache line). A striding memory instruction will read from its array with an
offset defined by lStrideN or sStrideN.
Non-striding memory instructions are a lot easier to implement; we can simply use volatile
variables (sCel1, lCel1, lCel2) to this effect.
Implementing memory instructions causes the initialization and finalization block to grow a
bit, but this of course does not affect the stressmark. The start block has grown as well and
it now contains some expensive modulo operations that will be executed at the beginning of
every iteration. If the stressmark body is sufficiently large, those operations shouldn’t have
an effect on the performance.
The implementation of non-striding memory instructions is very cheap. They read from or
write to an address equal to a static offset from the stack pointer, and can be executed in
a single instruction. Striding memory instructions require some address calculation. They
could be implemented more efficiently if every instruction in the loop had a dedicated hardware
register to store the address. However, since our implementation is already constrained by
the number of registers, we opted not to implement it in this way.
Listing 3.6: Arithmetic + memory instructions
volat i le int sCel1 ;
int s t re s smark ( int * const globalMemoryArray ) {int i = 999999 , v1 = 2 , v2 = 3 , v3 = 5 , v4 = 7 , v5 = 13 ;
const int vDiv = 3 ;
int * const localMemoryArray = mal loc ( . . . ) ;
int * const lArr1 = &localMemoryArray [ 0 ] ;
int * const lArr2 = &localMemoryArray [ 1 0 0 ] ;
int * const lArr3 = &localMemoryArray [ 2 0 0 ] ;
int * const sArr1 = &globalMemoryArray [ 0 ] ;
volat i le int lCel1 , lCe l2 ;
3.5 Forming the Stressmark 30
int l S t r i d e 2 = 0 , l S t r i d e 4 = 0 , s S t r i d e 2 = 0 ;
s t a r t :
i f ( i−− <= 0) goto end ; // beg in s t a r t b l o c k
l S t r i d e 2 = ( l S t r i d e 2 + 2*4) % 100 ;
l S t r i d e 4 = ( l S t r i d e 4 + 4*4) % 100 ;
s S t r i d e 2 = ( s S t r i d e 2 + 2*4) % 150 ; // end s t a r t b l o c k
v1 = v2 + v3 ; // i n t add
v2 = lArr1 [ l S t r i d e 2 ] ; // memread s t r i d e=2
v3 = v3 / vDiv ; // i n t d i v
lArr2 [ l S t r i d e 2 ] = v4 ; // memwrite s t r i d e=2
v4 = lCe l1 ; // memread s t r i d e=0
v5 = lArr3 [ l S t r i d e 4 ] ; // memread s t r i d e=4
lCe l2 = v1 ; // memread s t r i d e=0
v1 = v2 / vDiv ; // i n t d i v
[ . . . ]
goto s t a r t ;
end :
f r e e ( localMemoryArray ) ;
return v1+v2+v3+v4+v5 ;
}
The number of registers used by the implementation has significantly increased.
Register usage breakdown:
� Necessarily stored in registers
– Loop counter (i): 1
– Variables (vN): Mininum dependency distance + 1
– Stride offsets (lStrideN and sStrideN): 1 for every used stride distance for shared or
thread-local memory instructions—e.g. 4 used stride distances for shared memory
instructions and 2 for thread-local memory instructions means a total of 6 required
registers.
– Address of the localMemoryArray and globalMemoryArray: 2 registers
� Optionally stored in registers
– Division variable (vDiv): One register for every datatype
3.5 Forming the Stressmark 31
The variables localMemoryArray and globalMemoryArray should be stored in registers to
prevent them from being loaded before a striding memory instruction is executed.
The number of required registers is increased by a maximum of ten. Typically there will
only be a few striding memory instructions in a stressmark, so the actual number of extra
registers is lower. It is important to note that the non-striding memory instructions require
no registers because of the use of volatile variables.
3.5.3 Branch Instructions
The final step is to include branch instructions, which will cause some instructions to be exe-
cuted conditionally. Note first that conditional execution is not allowed for every instruction
type. Striding memory instructions must be executed each iteration because their offset is
calculated at every iteration; if they are not executed there will be a gap in their path.
We want to implement the branch behavior as cheaply as possible in terms of register usage.
We reuse the loop counter to calculate if a branch will be taken or not. We use the lowest bits
of the loop counter. The first bit will constantly alternate, forming the pattern ...101010...,
which corresponds to an inverse branch transition rate of 1. The pattern of the second bit is
...110011001100..., corresponding to an inverse branch transition rate of 2, etc. This makes
it very simple to implement branch instructions conform to the workload model as long as
the inverse branch transition rates are a power for two. However, these branches are highly
regular and therefore very predictable. This is not so bad because we typically want very
predictable branches in a stressmark to reduce the stalling probability. A more advanced
implementation of branch transition rates can be found in the remarks (listing 3.10); note
that it uses a lot more registers though.
Using branches also has an effect on the minimum dependency distance. To preserve the
minimum dependency distance across multiple branch instructions, we have to take all possible
paths into account; this results in more registers being used.
Because of this, we reduce the number of instructions in a conditional block to a single instruc-
tion. It would be reasonable to think that the compiler optimizes the very small conditional
blocks, replacing them by conditional moves. However, we can disable this optimization using
the flags -fno-if-conversion and -fno-if-conversion2.
The example is simplified for better readability. The static branch implementation (figure
3.4) was omitted.
Listing 3.7: Arithmetic + memory + branch instructions
// i b t r means ” inv e r s e branch t r a n s i t i o n ra t e ”
3.5 Forming the Stressmark 32
volat i le int globalMemoryCel ;
int s t re s smark ( int * const globalMemoryArray ) {int i = 999999 , v1 = 2 , v2 = 3 , v3 = 5 , v4 = 7 , v5 = 13 ;
const int vDiv = 3 ;
int * const localMemoryArray = ( int *) mal loc ( . . . ) ;
int * const lArr1 = &localMemoryArray [ 0 ] ;
int * const lArr2 = &localMemoryArray [ 1 0 0 ] ;
int * const lArr3 = &localMemoryArray [ 2 0 0 ] ;
volat i le int lCel1 , lCe l2 ;
int s t r i d e 2 = 0 , s t r i d e 4 = 0 ;
s t a r t :
i f ( i−− <= 0) goto end ;
l S t r i d e 2 = ( l S t r i d e 2 + 2*4) % 100 ;
l S t r i d e 4 = ( l S t r i d e 4 + 4*4) % 100 ;
v1 = v2 + v3 ; // i n t add
v2 = lArr1 [ l S t r i d e 2 ] ; // memread s t r i d e=2
i f ( i & 2) // branch i b t r=2
v3 = v3 / vDiv ; // i n t d i v
lArr2 [ l S t r i d e 2 ] = v4 ; // memwrite s t r i d e=2
v5 = lArr3 [ l S t r i d e 4 ] ; // memread s t r i d e=4
i f ( i & 3) // branch i b t r=4
v4 = lCe l1 ; // memread s t r i d e=0
lCe l2 = v1 ; // memread s t r i d e=0
v1 = v2 / vDiv ; // i n t d i v
i f ( i & 3) // branch i b t r=4
v2 = v3 * v4 ;
[ . . . ]
goto s t a r t ;
end :
f r e e ( localMemoryArray ) ;
return v1+v2+v3+v4+v5 ;
}
3.5 Forming the Stressmark 33
Register usage breakdown:
� Necessarily stored in registers
– Loop counter (i): 1
– Variables (vN): Mininum dependency distance + (MindepDist * fraction branch
instructions) + 1
– Stride offsets (lStrideN and sStrideN): 1 for every used stride distance for shared or
thread-local memory instructions—e.g. 4 used stride distances for shared memory
instructions and 2 for thread-local memory instructions means a total of 6 required
registers.
– Address of the localMemoryArray and globalMemoryArray: 2 registers
� Optionally stored in registers
– Division variable (vDiv): One register for every data type
All that remains to be done on this point, is to start the stressmark threads.
Listing 3.8: Starting stressmarks
void * runstressmark ( void * ptr ) {s t re s smark ( ( int * const ) ptr ) ;
}
int main ( int argc , char** argv ) {int t0 , t1 , t2 , t3 ;
globalMemoryArray = ( int *) mal loc ( . . . )
s e s c i n i t ( ) ;
sesc spawn ( ( void *) * runstressmark , ( void *) globalMemoryArray ,NULL) ;
sesc spawn ( ( void *) * runstressmark , ( void *) globalMemoryArray ,NULL) ;
sesc spawn ( ( void *) * runstressmark , ( void *) globalMemoryArray ,NULL) ;
sesc spawn ( ( void *) * runstressmark , ( void *) globalMemoryArray ,NULL) ;
s e s c w a i t ( ) ;
s e s c e x i t ( 0 ) ;
return 0 ;
}
The main method allocates the global memory and starts as many threads as there are
hardware threads. The threading implementation is platform-specific; for the MIPS target
we use a simulator specific library (SESC threads), and for the x86 target we use PThreads.
3.6 Remarks 34
While designing the stressmark form, we mainly focus on avoiding compiler optimization and
minimizing the register usage to ensure the ILP can be maximized.
3.6 Remarks
3.6.1 Register Usage
It is clear that our stressmarks will perform best on architectures with a large number of
registers. Most modern microarchitectures are RISC-based and therefore fulfill this require-
ment. The only big exception is the 32bit x86 microarchitecture; after some quick testing we
concluded that our platform is not capable of generating stressmarks for this register-starved
microarchitecture. We then evaluated our framework with x86-64, and found that although
we can generate stressmarks for this platform, their effectiveness is unclear.
3.6.2 Alternative Branch Transition Rate Implementation
Listing 3.9: Alternative BTR
unsigned int i , btrIndex ;
// pa t t e rn = 11000010110010001100001011001000
unsigned int btr2 = 0xC2C8C2C8 ;
for ( i =0; i <10000; i ++){s t a r t b l o c k :
btrIndex = i % 32 ;
stressmarkbody :
[ . . . ]
i f ( ( btr2 >> btrIndex ) & 1) {[ . . . ]
}[ . . . ]
}
This is a way to implement branch transition rates more accurately. The patterns produced
by this method are much less predictable. This implementation consumes a lot more registers:
one for every branch transition rate.
It could be worthwhile to add a parameter to the platform-dependent workload parameters.
This depends on the platform having many registers, or few. This method is probably more
suited for synthetic benchmarks modeling the load of manually written benchmarks than for
our stressmarks.
3.6 Remarks 35
3.6.3 Differences Between Workload Model and the Compiled Stressmark
There are two reasons the workload model and the properties of the benchmark, i.e. the
effective workload, can be different.
The first reason is that not all operations are necessarily compiled into a single instruction;
on some platforms, they can instead be compiled in a series of instructions. For example,
when compiling for a MIPS target, integer multiplication and division operations are each
transformed into two instructions: one for the operator itself, and one for storing the higher
bits of the result value. Load instructions can even result in up to five instructions.
The second reason is that a stressmark sometimes uses too many registers and so introduces
spilling. As this is undetectable, there is unfortunately nothing much we can do about this.
The important thing to notice however, is that the stressmark needn’t be exactly consistent
with the workload model because it is optimized by a search algorithm that is not aware of
the semantics of the workload model anyway.
The type of algorithm we use, a genetic search algorithm, is even especially robust in this
respect due to the fact that it distinguishes between a genotype and a phenotype. The former,
corresponding to the workload model, is always an abstraction of the latter, in our case the
stressmark. The crucial point is that the fitness value will always reflect the effective workload,
regardless of the exact nature of the relation between the stressmark and its workload model.
Therefore, as long as all relevant aspects of the stressmark somehow can be controlled by
modifying the workload model, the search algorithm will always select for the best solution.
That being said, the search algorithm will be more effective as the workload model more
effectively controls the relevant aspects of the stressmark, and this can be achieved best by
gaining insight in its important qualities and making sure the workload model properly reflects
these.
3.6.4 Exploration of the Implementation of SIMD Instructions
Using Auto-vectorization
We can represent SIMD operations as in this simplified example:
Listing 3.10: Auto-vectorization
int i , a [ 4 ] , b [ 4 ] , c [ 4 ] ;
for ( i =0; i <4; i++) { // may compi le to 1 SIMD in s t r u c t i o n
a [ i ] = b [ i ] + c [ i ] ;
}
3.7 Stressmark Generation 36
Using GCC 4.0+ combined with the –O2 optimization level and the option ”-ftree-vectorize”,
GCC can optimize this kind of code patterns into SIMD instructions.
This approach is portable across platforms that support similar SIMD instructions, making it
a good candidate to be defined as platform-dependent workload parameter. GCC developers
are working on methods to make sure the auto-vectorization will be more predictable in the
future (using pragma’s).
However, this approach does have a number of serious disadvantages. First of all, it only works
with a high optimization level (–O2), a level that our method can not work reliably with.
These kinds of tricks are also somewhat compiler-dependent, as not all C compilers support
auto-vectorization. The transformation is therefore currently not guaranteed, rendering it
unpredictable.
Using Intrinsics
Another approach is to use intrinsics for the implementation of SIMD operations. We do not
use this approach because it is both compiler and architecture-specific. It does however give
direct control.
3.7 Stressmark Generation
The stressmark generator is built using the pipe and filter architectural pattern. This made it
easy to replace parts of this software pipeline or compare alternative implementations. There
are four distinct phases in the generation process.
Backbone
generation
Variable
allocation
Operation
distribution
C
generation Workload
Plaform
Stressmark
Figure 3.2: Stressmark generation.
3.7.1 Backbone Generation
In the first step, a graph made out of basic blocks is created. Each block is assigned an id, a
branch transition rate, a size and the links to its parent(s) and child(ren).
3.7 Stressmark Generation 37
The number of blocks depends on the proportion of branch instructions in the instruction
mix and the trace size.
3.7.2 Operation Distribution
In this step, it is determined which operations should be executed in a block and in what
order. It also makes sure that striding memory operations do not get placed in conditional
blocks.
Arithmetic operations are a data type (integer, float, double) combined with an operation
(add, multiplication, division).
Memory operations are a load or store in shared or non-shared memory with a certain stride
value.
We have two ways to distribute the operations: random distribution and smooth distribution.
� Random distribution means the operations will be randomly distributed across the
blocks. This causes the stressmark to become inconsistent (variation of 10 to 20% in
power for the same workload).
� Smooth distribution means that the operations will be chosen deterministically. To
choose an operation, the algorithm looks at the instruction profile of the previous op-
erations, compares it to the instruction profile in the workload model, and chooses the
operation with a frequency deviating the most from the demanded frequency in the
instruction profile of the workload model.
Variable Allocation
This step will assign variables to the operations determined in the previous step. The variables
are chosen in a way that they cannot be optimized away and the minimum dependency
distance is preserved. The number of used variables is minimized by the algorithm. In this
step the memory sizes are determined and the memory instructions are each assigned their
dedicated array.
When this step is finished, the stressmark is essentially ready; all the information to create
the stressmark is available.
C Generator
Now the stressmark is converted to C code. This step is platform-dependent, resulting in a
couple of different versions:
3.7 Stressmark Generation 38
� Single-threaded: the most platform-portable version
� SESC multi-threaded: a version to run multi-threaded benchmarks on SESC
� Pthread: a version using the POSIX threading implementation to run multi-threaded
on Linux
� Pthread with hardware performance counters: a special version to measure the IPC
value of the stressmark.
Using PAPI to Measure IPC
Using hardware performance counters in Linux is not a trivial affair. We use the kernel 2.6.30
that has built-in support for hardware performance counters. To access the counters, we use
a high-level API provided by PAPI 4.0 [4], which enables us to directly measure the IPC
value. To ensure that the measurements are accurate, we perform them ten times and take
the highest value.
STRESSMARK OPTIMIZATION 39
Chapter 4
Stressmark Optimization
4.1 Introduction
OptimizationMeasurements(SESC / HPC)
Synthetic BenchmarkAbstract WorkloadModel
Figure 4.1: Optimization process.
The foregoing chapters describe in detail the abstract workload model we use, and how this
model can be transformed into a platform-portable synthetic benchmark, which can then be
executed on its target platform. During the execution of the benchmark, several character-
istics can be measured reflecting performance, or the design requirements and constraints of
the microprocessor architecture; this can be power usage, temperature, throughput, etc. We
call these parameters the output characteristics of the benchmark.
The goal of a stressmark is to stress the processor, i.e. to induce extreme behavior by max-
imizing (or minimizing) one of these output characteristics. In order to reach this goal, we
start from a random initial workload model and then generate and execute the correspond-
ing stressmark. We measure the output characteristic of our choice, and then optimize the
result by using a genetic algorithm that generates new workload models and selects for this
characteristic.
4.2 Genetic Search Algorithm 40
4.2 Genetic Search Algorithm
4.2.1 Concepts
A genetic search algorithm is a heuristic that can be used to solve optimization and search
problems by employing the principles of biological evolution. Potential solutions to the prob-
lem, the individuals, are grouped in generations. Each individual in a generation has a
genotype, a phenotype, and a fitness value. The genotype, in our case the workload model, is
an abstract representation of the phenotype, in our case the corresponding stressmark. The
fitness value, a property of the phenotype, is the parameter getting optimized.
The algorithm proceeds by creating a new generation of individuals based on the existing
one. This is done in two steps: selection and reproduction. Through selection, two parent
individuals are picked from the current generation in order to produce an individual for the
new generation (also known as the child). This selection is based on the fitness value of the
individuals in the first generation as individuals with a higher fitness value have a higher
probability to get selected. Reproduction is done in two phases: crossover and mutation.
During crossover, the genotypes of both parents are combined to form the genotype of the
child. In this way, crossover is a possible means to select the best properties of both parents,
creating a solution with higher fitness than each parent. After crossover, mutation is applied
by randomly varying the genotype of the child with a certain probability. The function of
mutation is to explore new possible solutions in the search space.
After all individuals for the second generation have been created using selection, crossover,
and mutation, their fitness values are calculated and the process repeats itself. If the algorithm
is working correctly, the average fitness value should increase throughout.
One more principle that we applied is elitism. When a new generation is produced, the
best solutions of the previous generation are normally lost, since new individuals are usually
not identical copies of the previous generation. Elitism prevents this by copying the best
individual(s) of the previous generation to the new generation without altering them. The
elitism parameter is the number of individuals copied.
4.2.2 Domain and Fitness Value
The dimensions of the search space of our genetic algorithm are defined by the parameters of
the abstract workload model, some of which are discrete, and some of which are continuous.
As mentioned in the chapter on the workload model, the number of parameters is kept as
low as possible to reduce the size of the search space to a minimum and speed up the search
process. Nevertheless, a total of 30 parameters remains.
4.2 Genetic Search Algorithm 41
In principle, the fitness can be any measurable output characteristic of the stressmark’s ex-
ecution. We set up a simulated SMP MIPS platform on which we optimize for maximal
power usage, and an Intel Core2Quad x86-64 platform on which we optimize for maximal
IPC (instructions per cycle).
4.2.3 Configuration
Using the meta algorithm discussed further in this chapter, we decided on a generation size
of 72 workload individuals, a mutation probability of 10%, a crossover probability of 80%,
and an elitism value of 1.
4.2.4 Genetic Operators
Mutation
Mutation of a workload model is done by iterating over each of its parameters. If a parameter
is selected for mutation (depending on the mutation probability), its value is randomized.
The randomization process is dependent on the parameter type:
� Integer parameters have a maximum value, a minimum value, and a step size defined.
The new value is one of min+n · step with n a random value between 0 and bmax−minstep c
� Double parameters are randomized in the same way as integer parameters, except that
the value of n is not capped.
� Parameters with enumerated values are set to a random element of the enumeration
set.
� So called FixedSumParameters, used to represent instruction mix distributions, are
vectors with components totaling a predefined sum, usually 100. These parameters are
randomized by picking a single one of their components and setting its value to a random
number between 0 and the predefined fixed sum. As the total sum of the components
will now no longer be correct, the vector is then rescaled to fix this.
Selection
For selection we tried tournament selection, and proportionate or roulette-wheel selection,
finding the latter giving the best results. Proportionate selection elects two parents for
crossover by picking one after the other from the entire population with a probability di-
rectly proportionate to the fitness of the individuals. It is possible that the same individual
is selected twice.
4.3 Meta Algorithm 42
Crossover
Crossover of two workload models begins by creating an empty workload model for the result
and selecting one of the parent workloads. The algorithm then fills the empty result workload
by iterating over the workload parameters. Each iteration, with a given probability the
algorithm switches the selected parent with the other, and then sets the result workload’s
value for the considered parameter to the value of the selected parent.
4.3 Meta Algorithm
Although genetic algorithms are quite easy to implement, it is often unclear in the beginning
exactly which design choices should be made and how they are best configured. Four settings
have to be set right in order for the algorithm to function efficiently:
1. Population size. The population size parameter is all about getting the trade-off right
between generation size and the number of generations. It is often unclear whether a
large number of small generations or a small number of large generations will yield the
best results.
2. Mutation probability. High mutation probability increases the randomness and variation
in the population; it makes the search less directed. Low probabilities on the other hand
make the search more vulnerable to end up in local extrema.
3. Crossover probability. This probability has again an impact on the variation, but in a
different way. A population is often partitioned in different classes of good solutions
(think species), each class containing individuals with the same quality that positively
affects their fitness value, a quality that is different from the qualities of individuals
of other classes. A high crossover probability will lessen the number of these different
classes, heavily mixing them up until they are joined and so lowering variation; a low
probability will encourage the forming of these classes while risking that they grow apart
without ever having the chance of combining their qualities for an even better result.
4. Elitism. If individuals with high fitness values are preserved, every generation they feed
their properties into the population, keeping the search process ”on track,” but also
risking to outcompete solutions with alternative qualities, even though these may yield
better results in the long run.
Because the problem of getting these parameters right is not a trivial one, we implement a
simple meta search algorithm that optimizes them for us.
4.3 Meta Algorithm 43
4.3.1 Domain and value
The four settings discussed in the previous section are the dimensions of our search space, and
the search method is a simple hill climbing algorithm. We consider the following possibilities
for each dimension:
1. The couple (generation size, generation count) can be (18, 12), (36, 6), (54, 4), or (72,
3). The product of the two components, i.e. the total number of individuals, is always
216.
2. The mutation probability varies between 0 and 1
3. The crossover probability varies between 0 and 1
4. Elitism is expressed relative to the size of the population, between 0 and 0.25
The value of each point in the search space is determined by using its components to run a
genetic algorithm, and then calculate the gain in average fitness of its individuals between
the first and last generation. When comparing the value of points, care is taken that the
algorithm executions for each point start from the same initial population.
4.3.2 Neighborhood Concept
In order to implement a hill climbing algorithm, a neighborhood concept needs to be defined
first. Neighbors are typically found by slightly increasing and decreasing the component
values of the different dimensions of a point. Because the evaluation of each point requires
the running of an entire genetic algorithm, in our case the number of neighbors has to be
limited as much as possible. We therefore allow only one dimensional value at a time to be
modified.
The step size we use for the modification of the mutation, crossover, and elitism parameters
is dependent on a zoom level. At zoom level 0, the step size is one sixth of the entire domain
(e.g. between 0 and 1); at zoom level 1, the step size is (16)2 · domainsize, and so one. At
the beginning of the search process, the zoom level is set at 0; it is increased whenever the
algorithm has reached an extremum, gradually refining the result.
THE STRESSMARK RUNNER FRAMEWORK 44
Chapter 5
The Stressmark Runner Framework
5.1 Design Considerations
Every software engineering student is familiar with several methods of software development,
ranging from the traditional, rigid waterfall model to newer, agile approaches like the extreme
programming method. This was however the first time we needed to develop a fairly large
piece of software within a research context. We found that such an environment gives rise to
specific requirements and challenges. The most important ones are discussed in the ensuing
paragraphs.
5.1.1 Scala/Java
Probably the most important requirement of the software development process in a research
environment is that the researcher should be able to focus on his core job and quickly imple-
ment the desired functionality while not being distracted by practical issues concerning the
programming language or the development environment. Ease of use should encourage him
to experiment and try out alternative solutions to the research problem he is examining. Fi-
nally, since the point of research is to discover new things, the functional requirements of the
software being built will change constantly and so the development process and programming
language need to facilitate these changes.
This is why we picked the Scala programming language for our project. Scala is a high-level
programming language that is compiled into byte code running on the Java virtual machine.
It is two-ways compatible with Java meaning that Scala code can be called from within a
piece of Java code as well as the other way round.
There are several benefits that make Scala a suitable language for developing research tools
and getting work done quickly in a research context. The first is expressivity. The more ex-
5.1 Design Considerations 45
pressive a programming language, the more compact its notation, the more work can be done
in less time. While high-level programming languages in general tend to be more expressive
than low-level programming languages, for Scala this is particularly the case. A standard
implementation of the well-known quicksort algorithm gives a fair idea of the expressivity of
Scala.
Listing 5.1: Scala quicksort example
def qso r t [T <% Ordered [T ] ] ( l i s t : L i s t [T ] ) : L i s t [T] = l i s t match {case Ni l => Ni l
case pivot : : t a i l =>
val ( be fore , a f t e r ) = t a i l p a r t i t i o n ( < pivot )
q so r t ( be f o r e ) : : : ( p ivot : : q so r t ( a f t e r ) )
}
The implementation, using only standard language constructs, contains merely six short lines
of code. The resulting function is general enough to sort comparable objects of any type.
Since Scala is a functional programming language, and a lot of attention has been paid to
collections and syntactic flexibility, the code is very clear and expresses the intention of the
programmer in a natural way that closely follows the core line of reasoning of the algorithm.
Implemented in C++, the code size would be several times larger and a helper function
to partition a list of objects would have to be written, distracting attention from the core
functionality.
The second benefit is a corollary of the fact that Scala is intertwined with Java; because of
the two-way compatibility, any Java library can be called directly from Scala without using
glue code or compromising syntactic clarity or brevity. Since a huge number of Java libraries
can be found on the internet, most of them available for free and fairly well-documented, this
is hugely beneficial.
Comparing Scala to lower level languages, other advantages become clear. The type system
prevents basic errors yet is flexible enough not to be restrictive, and of course the developer
needn’t bother with memory allocation and pointers, two concepts leading to numerous bugs
in C/C++ that are often very hard to track down.
5.1.2 Apache Ant
Unfortunately, the use of a high-level programming language presents a big challenge as well.
Within a research environment a lot of interoperating tools are used, and often these tools are
written in a number of different programming languages. The environment is by its nature a
heterogeneous one, despite the fact that a tight integration is often key to work efficiency. This
5.1 Design Considerations 46
is even more so the case since the development environment and production environment are
one and the same; the researcher is at the same time developer as well as end-user, constantly
trading one role for the other.
The only point where all these research and development tools come together is the command
line environment and this is where the trouble lies. Since Scala is executed on the Java
Virtual Machine which makes it platform-independent, communication with the operating
system and the command line environment becomes notoriously burdensome. A lot of glue
code is often required and it is hard to handle errors properly.
This is why we chose to use the Apache Ant tool, which was originally made to ease the build
and deployment processes of software applications written in Java. Ant is the bridge between
the Java and the command line environment. Itself written in Java and supported by the
open-source community, it has a well-documented API that is readily available to the Scala
developer. Using this API, it becomes a lot easier and safer to execute file system operations
and invoke command line scripts and tools.
5.1.3 The Mirrored Command Suite Pattern
With Ant firmly in place, and so having joined our Java/Scala environment with the third-
party tools we work with, we needed to expose the functionality of our framework in a way
that it could be readily used to experiment and run tests. At first, we tried to write an
Ant project file in XML to accomplish this. This seemed reasonable as these support all
the functionality we needed. The way to construct a project configuration for Ant is to
define different build targets, which nicely corresponded to the different functional units our
framework implements. We could then execute these targets via the command line, which we
figured would be a convenient way to experiment.
In theory this approach still sounds fine, but unfortunately we would soon find out that in the
end, the proof of the pudding is in the eating. The problem was the complexity and overhead
of working with the Ant configuration file; although it was possible to run commands and
algorithms written in Scala as well as command line tools by defining build targets in the XML
file, it turned out to be a real pain to do so. Commands implemented in Scala each needed
to have a special dedicated interface class to be executed by Ant, passing arguments and
parameters from the configuration space to the command turned out to be very troublesome,
and a lot of other inconveniences soon surfaced.
For this reason we decided to drop the configuration file altogether and opted instead for an
architecture pattern we like to call the mirrored command suite. This approach was inspired
by the way Linux applications and tool sets are often structured, combined with the need
5.1 Design Considerations 47
(yet again) to integrate the Scala environment with the command line environment. The key
principles are the following:
1. All functionality of the framework that needs to be directly available to the researcher
to experiment with, is exposed as a set of input/output commands with a (limited)
number of parameters to control their behavior.
2. If this framework functionality includes processes (e.g. the generation of a stressmark),
each step in these processes is made available as a command in its own right. This allows
the researcher to easily execute, control, and debug each of these steps individually. For
convenience, the process as a whole can also be exposed as a single command.
3. Each command is mirrored, meaning it is made available twice: once through the com-
mand line shell as a script, and once through a Java/Scala class called the CommandIn-
terpreter. Typically, one version will be a proxy calling the second version. For example,
if a command invokes a third party tool, the shell script is the natural place to do this,
so the Java/Scala command will be a proxy using Ant to run the shell script. On the
other hand, if the command runs our own stressmark generator which is written in
Scala, the shell script will be the proxy running Scala, invoking the real command, and
passing the command parameters.
4. All shell script names start with the same prefix, in our case ”smr” for stressmark
runner, followed by the name of the command.
5. The parameters and usage of each command are described in concise usage instruc-
tions that can be accessed in the traditional way by invoking the shell script without
parameters or the parameter ”help”.
6. If a command’s input and/or output is structured data, this data is presented in a
human-readable format that can easily be edited by the researcher. This format could
be XML, but we opted for YAML [6] (standing for Yaml Ain’t Markup Language) as
this is less bloated and tends to be easier to read and edit.
We found that this approach is perfectly suited for development in a research environment.
For starters, owing to the fact that the basic architecture is extremely general and has very
little structure since it is just a set of commands, the software can easily evolve in the course of
the research. It is worthwhile to notice that this dynamic nature is completely compatible and
in line with the key tenets of the modern software development methodologies mentioned in
the beginning of this chapter: iterative development in small steps, responsiveness to changing
requirements, and a strong emphasis on experiment and early interaction with the software
that is being built.
5.1 Design Considerations 48
Even the merits of the mental exercise of simply dividing the framework’s functionality in
different commands, each with their respective input, output, and parameters, should not be
underestimated, as this greatly contributes to the formation of clear concepts, structuring the
chaos that research usually is. We found that the flat, simple structure of the command set
can actually be better in this respect than a traditional and potentially more complex class
system.
On a more practical note, the command line shell can be optimally used as an interactive
development environment with the developer/user exploring and debugging new commands
and fully leveraging the finished functionality. While invoking commands, thanks to basic
shell functionality and the common prefix of the command’s shell scripts, pressing the tab
key functions as a code completion tool, and the usage instructions as parameter hints.
Continuous integration, another aphorism of modern software development, can be fit into this
as well. Since the development and user environment are the same, it makes perfect sense to
create additional commands to increase usability of the repetitive tasks a developer/researcher
typically faces while modifying the code of different tools: build processes, updating library
binaries with newly compiled versions, and so on.
Last but not least, a mirrored command suite also lends itself perfectly to build on top of it
a job queue for distributed execution of commands, which is the subject of the next design
consideration.
5.1.4 A Distributed Job Queue
Research in computer science would be no fun at all without overly complex simulators and
sky-rocketing simulation times. Unfortunately, sometimes even too much caffeine really is too
much. It is for these rare occasions we found it to be necessary to curb simulation times by
building a distributed job queue for the concurrent execution of commands.
The target platform for running this job system is the Hydra server cluster of the ELIS
research group at our university. We used a MySQL database server and nine worker servers
with a shared file system, each running a dual core processor. When executing the jobs in
the queue, two worker threads are running on each machine in order to efficiently utilize the
dual cores.
As any SQL transaction, calls to the MySQL database comply to the so-called ACID set of
properties, where ACID stands for atomicity, consistency, isolation, and durability. Thanks to
these properties, the database is perfectly suited for the role of central communication point
between our 18 workers. These workers connect to the database, fetch the next job fit for
5.1 Design Considerations 49
execution from the queue, and start crunching. After a worker has finished its job, the result
is written back to the database, and the whole process repeats itself from the beginning.
The main concern typically would be workers fetching or updating the same job due to
concurrency issues. A simple reservation system prevents this from happening. Each job
has a status set to QUEUED as long as the job is available for processing. When a worker
attempts to execute the job, it will first try to update the status to RESERVED using the
following query:
Listing 5.2: Get work query
UPDATE j obs SET State=’RESERVED’ WHERE JobID=?id AND State=’QUEUED’
The ACID properties of the update transaction guarantee that this query can only update
the job record once. If a worker detects its query failed, it can therefore safely conclude the
job is being executed by another worker and attempt to fetch the next available job instead.
5.1.5 Lessons Learned
Apart from the points discussed above, we also encountered some smaller and more obvious
issues which are briefly summarized here as lessons learned:
� Avoid immature development tools, and especially unstable builds. This proved a hard
thing to do as Scala is a relatively new language and none of the plug-ins for the
mainstream IDEs had reached release candidate level. We stuck to Netbeans early on,
but had to try quite a lot of builds for until we found a relatively stable one.
� You Ain’t Gonna Need It. Despite the fact that every software engineer knows the
infamous YAGNI maxim, there are few who always heed its advice. As numerous
others before us, we found that designing functionality too rigorously too early tends
to lead to over-engineering and features that were never really needed in the first place.
Agile practices help to keep this phenomenon to an absolute minimum.
� Debugging concurrency issues is no fun, especially in a distributed environment. We
tried to alleviate this problem by using MySQL as a central point of communication as
mentioned before, and by going great lengths to test as much functionality as possible
locally and single-threaded, making sure most errors were detected while they were still
easy to trace.
� No silver bullet; the mirrored command suite pattern has one disadvantage we realized
later on. Since each process is split up in different commands representing the steps
it contains, and each of these commands is invoked through the command line shell,
5.2 Platform Setups 50
there is no encompassing runtime environment to preserve any state between steps.
This is usually not a problem, as this state is typically the input and output of each
process step, which we want to be editable by the researcher anyway and therefore save
in a human-readable file format. If this state becomes increasingly complex though,
the cost of writing the glue code saving and loading it may outweigh the benefits of
this approach. During the course of the project, this happened to be the case for the
measurement points of the meta search algorithm described elsewhere in this document.
The solution we came up with was to save the state to the database instead of the file
system, preserving the possibility to view and edit it while reducing the serialization
overhead. It is however easy to see that this solution will not necessarily work in every
case.
� Keep the use of shared objects, especially singletons, to a minimum while developing a
concurrent system. Scala explicitly supports the declaration of singleton objects which
the programmer can define like classes typically are. While this is an attractive language
feature, we found that using singletons a lot–like static methods in the traditional object-
oriented programming model–can easily introduce concurrency issues if they contain
shared mutable state. We have mitigated this by defining facade classes regulating
access to these singleton objects and minimizing the shared members.
5.2 Platform Setups
5.2.1 SESC SMP MIPS
Component Overview
The first platform we test our framework on, employs the SESC simulator to execute the
generated stressmarks on a simulated SMP MIPS architecture. In this setup, our framework
runs on the Hydra distributed platform as described in the section on design considerations.
We distinguish between the worker nodes and the central database node which contains the
job queue for the workers to execute. Running on the worker nodes are the components we
have implemented ourselves in Scala (i.e. StressmarkRunner and StressmarkGenerator), and
the third-party tools we invoke as executable binaries. The former are described in detail in
the next section; the latter we discuss in the ensuing paragraphs.
In order to compile the generated C code of our stressmarks, we use the gcc cross compiler
for the MIPS instruction set, which is made available in the sesc-utils package. The version
of gcc is 3.4. To implement multi-threading, we used the SESC threading library. Apart from
gcc and SESC, we use the gnuplot tool for displaying the simulation results as a graph.
5.2 Platform Setups 51
Hydra Worker Node Job Database Node
<<component>>StressmarkRunner
<<component>>StressmarkGenerator
<<component>>SESC Simulator
<<component>>GCC Compiler
<<component>>GNUPlot
<<component>>MySQL Job Database
<<component>>SESC Threading Library
Figure 5.1: SESC overview.
SESC - SuperESCalar Simulator
”SESC is a microprocessor architectural simulator developed primarily by the
i-acoma research group at UIUC and various groups at other universities that
models different processor architectures, such as single processors, chip multi-
processors and processors-in-memory. It models a full out-of-order pipeline with
branch prediction, caches, buses, and every other component of a modern proces-
sor necessary for accurate simulation. SESC is an event-driven simulator. It has
an emulator built from MINT, an older project that emulates a MIPS processor.”
[5]
The default SMP (symmetric multi-processing) architecture configuration of SESC simulates
256 identical 70 nm cores running at 5 GHz with an issue width of 4. The branch predictor
is based on the Alpha 21264 hybrid predictor. The cache configuration is the following:
� L1D and L1I: 32kB, associativity of 4, LRU, write-through
� private L2: 512kB, associativity of 8, LRU, write-back, MESI
As the large number of cores combined with the 70 nm process and high clock rate does not
even remotely resemble a realistic processor design, we altered the configuration as follows:
� the number of cores was reduced to 4
� the clock rate for each core was reduced to 1 GHz
5.2 Platform Setups 52
Experience
Paul Sack, one of the developers of SESC, introduces the simulator stating that ”the biggest
challenge for new students in architecture research groups is not passing theory or software
classes. It is not finding a new apartment or registering with the INS. It is understanding the
architecture of the processor simulator that will soon confront them—a simulator coded not
for perfection, but for deadlines. Even the most well-conceived simulator can quickly look
like a Big Ball of Mud to the uninitiated.” We found these observations accurately matching
our own.
5.2.2 Intel Core2Quad x86-64
Test Node
<<component>>StressmarkRunner
<<component>>StressmarkGenerator
<<component>>GCC Compiler
<<component>>GNUPlot
<<component>>MySQL Job Database
<<component>>pthread Library
Figure 5.2: Hardware overview.
The second test platform is the Intel Core2 Quad 9450 hardware processor, executing a 64
bit x86 instruction set. There are 4 45nm cores running at 2,66 GHz. The Thermal Design
5.3 Framework Architecture 53
Power (TDP), which Intel defines as the maximum power usage, is 95 Watts. The cache
configuration is the following:
� L1D and L1I: 32kB per core
� L2: 2 x 6 MB (shared by two cores each)
In the Intel Developers Manual [1] we found that the maximum number of instructions per
cycle (IPC) for this processor is four. We use this figure to evaluate the performance of our
generated stressmarks. Measuring the IPC during our tests was done by employing hardware
performance counters.
The setup of the software components is the same as the one for the SESC platform, except
for three things: the SESC simulation has been replaced by native execution of the generated
stressmark, its threading library was replaced by the standard POSIX implementation for
Linux (pthread), and the database now runs locally since native execution is fast enough not
to require running the test in a distributed way.
5.3 Framework Architecture
5.3.1 Overview
On the highest level, the framework is comprised of two components: the StressmarkGener-
ator which transforms the abstract workload model into C code, and the StressmarkRunner
which in turn contains the command set exposing the entire framework’s functionality to the
researcher, the genetic search algorithm used to optimize the generated stressmarks, and the
job queue for concurrent execution of commands.
The remainder of this chapter is a more detailed exploration of the four StressmarkRunner
packages. Note that only the most important classes of each package are given. For the full
picture, the source of the project should be consulted.
5.3.2 Commands
The Command Interpreter
As a central part of the mirrored command suite pattern, the command interpreter mirrors
the command line shell in Scala, allowing to run the commands exposing the framework’s
functionality by providing an instruction string, for example ”generate-workload mywork-
load.yml” to save a random workload model in YAML format to the file myworkload.yml.
5.3 Framework Architecture 54
stressmarkrunner
stressmarkgenerator
commands jobmanagersearch util
benchgenerator descriptors elements memory
util
Figure 5.3: Packages.
It interprets the instruction, identifies the command name and the different parameters, and
returns an AbstractCommand object, e.g. SMRGenerateWorkload.
This command object can then be executed, and the result code, the standard output string,
and the error string can be retrieved. Note that these three output elements mimic those of
a traditional command line shell script. Moreover, to make the congruency with shell scripts
complete, commands are always run in the context of a work directory.
As mentioned in the section on design considerations, it will often be the case that a command
object is no more than a proxy wrapper for a shell script (or an executable binary), which
is why the command line shell command class (CLSCommand) is provided. CLSCommand
uses the Apache Ant library to allow running a shell script in Scala, simply by providing the
instruction string (e.g. ”ls -a”).
With the CommandInterpreter, the AbstractCommand, and the CLSCommand classes, ev-
5.3 Framework Architecture 55
-workdir : String-homedirSMR : String
+getCommand(instruction : String) : AbstractCommand+getCommand(commandName : String, args : String []) : AbstractCommand
CommandInterpreter
+run()+getResult() : Int+getOutput() : String+getError() : String+execute() : Int
AbstractCommand
-command : String-args : String[]-baseDir : String
+execute() : Int
CLSCommand
core
SMRSimulate
SMRFullRun
SMRCompileSMRPlotSimresults
SMRParseSimreport
SMRGenerateC
SMGenerateWorkload
Figure 5.4: Commands Core.
erything is in place to expose the entire functionality of the framework by providing the
command implementations. These implementations are grouped in five packages:
1. core: a package containing all commands necessary to generate one or more random
workloads, obtain the corresponding stressmarks, simulate them on SESC, and plot the
results in a graph.
2. jobmanager: contains commands to control the job queue for concurrent execution of
commands.
3. ga: provides the functionality needed to execute the genetic search algorithm for opti-
mizing stressmarks
4. metaga: contains commands for the meta search algorithm to optimize the genetic
search algorithm
5.3 Framework Architecture 56
5. other: kitchen sink package with all commands that belong nowhere else
Forming the interface by which the developer/researcher controls the framework, the com-
mand implementations in each of these packages are now briefly discussed. The notation
used for the command names is the shell script variant; for example, SMRGenerateWorkload
becomes smr-generate-workload.
Core Commands
smr-generate-workload <workload output file> [<workload output file> ...]
The smr-generate-workload command generates a YAML file containing a workload with
randomly initiated parameters. Parameter values can easily be edited by the researcher.
smr-generate-c <workload file> <output c file>
This command generates the c code of the synthetic benchmark based on the workload that
is provided as input.
smr-compile <c file> <binary output file> [-debug]
The smr-compile command runs the gcc MIPS cross-compiler that comes with SESC utils
using the correct compiler flags, linked libraries, etc. The output file is a MIPS binary file
that can be run on the SESC simulator. Optionally, an extra debug file can be generated
containing the assembler code.
smr-simulate <binary file> <configuration> <report output file>
The smr-simulate command runs a MIPS binary file on the SESC simulator using the re-
quested hardware configuration. The result is a human-readable SESC simulation report.
smr-parse-simreport <simulation report> <simresults output file>
The smr-parse-simreport parses a SESC simulation report and generates a YAML file con-
taining the data that is relevant to the genetic search algorithm (mainly power statistics).
smr-plot-simresults [-r] <output png file> <simresults input file> [<simresults
input file> ...]
The power statistics of one or more simulation runs can be plotted in a stacked graph contain-
ing the power usage split up into the Fetch, Issue, Memory, Execution, and Clock categories.
smr-full-run <workload file> [<workload file> ...]
5.3 Framework Architecture 57
The smr-full-run command combines all the necessary commands to run a simulation of a
synthetic benchmark based on the provided workload file. If more than one workload file is
provided as input, the simulations of all files will be executed in sequence (i.e. not using the
job manager queue).
JobManager Commands
AbstractCommand CLSCommand
jobmanager
SMRQReset SMRQRunWorkers
SMRQShutdownWorkers SMRQStatus
Figure 5.5: Commands Jobmanager.
smr-q-status
The smr-q-status command shows the number of jobs in the job queue and their state (CRE-
ATED, QUEUED, RESERVED, RUNNING, SUCCESS, or ERROR). If there are active
workers, the command they are running is shown as well.
smr-q-run-workers <worker count>
The smr-q-run-workers command starts a number of worker threads on the local machine.
The workers will automatically connect to the job queue database and start executing any
queued jobs (provided all the jobs they depend upon are already finished successfully).
smr-q-shutdown-workers
The smr-q-shutdown-workers command tries to shut down all active workers currently con-
nected to the job queue. Since workers shut down automatically when all queued jobs have
finished, smr-q-shutdown-workers should only be used to stop the execution of the current
job queue in order to resolve errors, or cancel or pause the execution.
5.3 Framework Architecture 58
smr-q-reset
The smr-q-reset command clears the job queue. All job data in the database will be removed.
Genetic Algorithm Commands
AbstractCommand CLSCommand
ga
SMRGaSetup
SMRGaSummarizeGeneration
SMRGaSummarizeTotal
SMRGaEvolve
SMRGaPlotResults
Figure 5.6: Commands Genetic Algorithm.
smr-ga-setup <population size> <generations>
The smr-ga-setup command creates the jobs necessary to run the genetic search algorithm
described earlier with a certain population size for a certain number of generations. The jobs
are ready to be executed by the workers (e.g. by running the smr-q-run-workers command).
The jobs employ the rest of the commands in this section to execute the search algorithm.
smr-ga-evolve [-D ga.mutationProb=x] [-D ga.crossoverProb=x] [-D ga.elitism=x]
<output files prefix> <input file> [<input file> ...]
The smr-ga-evolve command takes the simulation results of one population of synthetic bench-
marks as input and generates the workloads for a new population based on these results.
5.3 Framework Architecture 59
During the genetic evolution process mutation, crossover, and elitist selection is applied.
smr-ga-plot-results [-r] <output file> <input file> [<input file> ...]
The smr-ga-plot-results commands generates a png file using gnuplot containing a graph that
displays the fitness values of the different generations of stressmarks. The graph includes
the minimum, average, and maximum fitness value (power usage) of each generation. The
input files are generation summaries generated by the smr-ga-summarize-generation command
below.
smr-ga-summarize-generation <output file> <input file> [<input file> ...]
The smr-ga-summarize-generation command takes a list of simresults YAML files as input
and generates another YAML file containing the minimum, average, and maximum fitness
for each input file. The input files normally correspond to the stressmark individuals in the
current generation of the genetic algorithm described in this document.
Meta GA Commands
AbstractCommand CLSCommand
metaga
SMRMetagaInit
SMRMetagaCollectEvaluationResult
SMRMetagaExpand
SMRMetagaSetupEvaluation
Figure 5.7: Commands Meta GA.
smr-metaga-init
5.3 Framework Architecture 60
The smr-metaga-init command creates a single metaga measuring point in the database. The
measuring point corresponds to a genetic search using a specific configuration. A search
configuration is comprised of the generation size, the number of generations, the mutation
and crossover probabilities, and the elitism parameter.
smr-metaga-setup-evaluation
The smr-metaga-setup-evaluation command generates the necessary jobs to run the genetic
searches for each unevaluated meta-ga measuring point in the database. If for example smr-
metaga-init has been used to create a single measuring point, the jobs will be generated that
run the genetic search corresponding to the configuration of that measuring point.
smr-metaga-collect-evaluation-results
This commands collects the scores of the differently configured genetic algorithms that have
been run and stores these results in the database.
smr-metaga-expand
The smr-metaga-expand command generates the neighbors of the best scoring (unexpanded)
metaga measuring point in the database. Every neighbor slightly varies in one parameter
of the measuring point configuration (e.g. a slightly increased mutation probability, or an
elitism parameter that is decreased by one.)
Other Commands
smr-consistency-test <workload> <test-runs>
This command generates the jobs necessary to run a number of stressmark simulations based
on a single workload and calculate the consistency of the different outcomes (i.e. the minimum,
maximum, average, and standard deviation of the power usages.) Use smr-q-run-workers to
start executing the generated jobs.
Apart from the commands controlling the functionality of the framework, a couple of com-
mands have been made available to alleviate the development process and make it more
efficient:
1. smr-build-sesc-and-spot [clear]: runs the make commands and others necessary to (re)build
the SESC simulator and hotspot adaption for SESC.
2. smr-build-sesc-configs: runs the make commands and others necessary to (re)build the
configuration files for the different architectures supported by SESC.
5.3 Framework Architecture 61
AbstractCommand CLSCommand
other
SMRConsistencyTest SMRBuildSescConfigs
SMRUpdateJar
SMRBuildSescAndSpot
SMRConsistencyTestCalcResults SMRBootHydraWorkers
SMRUpdateJarOnHydra
Figure 5.8: Other Commands.
3. smr-update-jar: copies the StressmarkRunner and StressmarkGenerator jar files pro-
duced by Netbeans to the local test environment.
4. smr-update-jar-on-hydra: uploads the StressmarkRunner and StressmarkGenerator jar
files produced by Netbeans to the Hydra environment.
5. smr-boot-hydra-workers: automatically boots two workers on every Hydra server. This
command is only available on the Hydra servers, not in the local test environment.
5.3 Framework Architecture 62
5.3.3 Jobmanager
-jobID : Int-parentJobID : Int-jobType : Enum-command : String-workdir : String-executionModel : Enum-state : Enum
+create()+queue()+reserve(worker : Worker)+run()+waitForDependencies()
Job
-workers : Worker[]-threads : Thread[]-jobDatabase : JobDatabase
+createSimpleJob(command : String, jobType : Enum) : SimpleJob+createCompoundJob() : CompoundJob+run()+busy() : Boolean+stop()
JobManager
+create(location : String)+run()+destroy()+getSandboxPath() : String
Worker
-host : String-port : Int-user : String-pass : String-dbName : String
JobDatabase
RegularJob
SimpleJob
+createChildJob(command : String, jobType : Enum) : ChildJob
CompoundJob
ChildJob
Figure 5.9: Jobmanager.
The jobmanager package contains all classes related to the job queue that is used to run
commands concurrently. Using the JobManager, new jobs can be created and worker threads
can be run to execute the jobs in the queue. Before creating a new job however, the developer
should decide whether it is necessary to create a CompoundJob, or a SimpleJob will suffice.
� SimpleJobs are single instructions that come in three different types. The CUSTOM
type is used for executing a framework command through the CommandInterpreter,
the SHELL type runs a command line instruction, and the ANT type executes an Ant
target defined in a configuration file.
� CompoundJobs are sequences of ChildJobs. ChildJobs are just like SimpleJobs apart
from the fact they have a CompoundJob as parent. All ChildJobs of a single Com-
poundJob are fetched and executed by the same worker thread.
5.3 Framework Architecture 63
Optionally, jobs can be organized in dependency groups and be made dependent on such
groups. Throughout its lifetime, a job can progress through the following states:
ERROR
SUCCESS
RUNNINGRESERVEDQUEUEDCREATED
Figure 5.10: Jobstates.
A job that is in the CREATED state is present in the queue, but will not be executed yet.
After all properties have been set correctly, it can be QUEUED so a worker can try to fetch
it. In order to make sure all jobs are executed by only one worker, the worker must first
update its state to RESERVED as described in the section on design considerations. After
the reservation has succeeded, the worker first waits for all dependencies of the job to be
finished successfully and then starts RUNNING the job which will either lead to SUCCESS
or an ERROR. The result of the whole operation is then written back to the job database.
Note that our platform setup includes a shared file system for all workers. It is the responsi-
bility of the developer creating the jobs that files are managed in a way that plays well with
the concurrent execution of queued jobs. Concurrency can be controlled using dependency
groups and compound jobs if necessary.
5.3.4 Search
The packages ga and metaga contain the classes implementing the core of the genetic search
algorithm and the meta search algorithm.
Genetic Algorithm Classes
For conducting a genetic search, the developer creates the Individuals populating the first
generation. An Individual is nothing more than a fitness value and a ParameterMap which
stores a number of key-value pairs expressing the individual’s properties and information on
how to mutate each of these values. In the case of our framework, the ParameterMap of the
5.3 Framework Architecture 64
ga
metaga
-individuals : Individual[]
+crossover(parents : GeneticPopulation, probability : Double) : Individual+sortByFitness()+proportionateSelect(number : Int) : GeneticPopulation+evolve(generationSize : Int, elitism : Int, mutationProbability : Double, crossoverProbability : Double) : GeneticPopulation
GeneticPopulation
-fitness : Double-parameters : ParameterMap
+mutate(probability : Double)+setFitness(fitness : Double)
Individual
-sizeCount : Int-mutationProbability : Double-crossoverProbability : Double-elitism : Double-fitness : Double
+getNeighbours(zoomLevel : Int) : GaMeasurement []+commitToDatabase()
GaMeasurement
1..*
Figure 5.11: Genetic Algorithm Classes.
Individuals will invariably contain the abstract workload model and the knowledge how, for
example, the instruction mix can be mutated by randomizing the percentage of arithmetic,
memory, and branch instructions.
When all individuals are created and their fitness values computed, a GeneticPopulation can
be instantiated containing them. Progressing to the next generation then becomes as simple
as calling the evolve method, passing the size of the next generation, the number of best
individuals that should survive without mutation (i.e. the elitism factor), and the mutation
and crossover probabilities. The evolve method returns new GeneticPopulation object with a
new set of individuals, only this time the fitness of these individuals are unknown. It is now
the responsibility of the developer to set these fitness values before calling the evolve method
again, and so on.
5.3 Framework Architecture 65
Meta Search Classes
The problem for the developer might be that it is unclear for which values of the population
size, elitism, and mutation/crossover probabilities the genetic algorithm performs best. In
order to allow examining this problem thoroughly, the meta search algorithm can be used.
The meta search is a hill climbing algorithm and its four dimensional search space contains the
points defined by the tuple (generation size and count, elitism, mutation probability, crossover
probability). The points of the search space which are being explored by the algorithm are
stored in the database and represented by the GaMeasurement class.
Note that the GaMeasurement class is in the first place a helper class that eases accessing the
database and calculating neighboring measurement points; the rest of the hill climbing func-
tionality is implemented by the metaga commands described earlier (most notably selection
of the best measurement in order to expand it.)
The developer should manually instantiate the initial measurement point by providing its
properties and calculating its fitness. After having committed this measurement point to the
database, its neighbors can be retrieved. The neighbors will be up to eight new measurement
points, each slightly varying in one dimension of the search space. It is now the responsibility
of the developer to calculate the fitness of each neighbor. After this, the process can be
repeated by calling het getNeighbours method again, etc.
The zoom level of the getNeighbours method defines the granularity of the variation between
a measurement point and its neighbors.
5.3.5 Util
util
Database YamlStruct ParameterMap
Figure 5.12: Util.
This package contains a number of utility classes facilitating the use of the database and the
reading of YAML files. For the latter we use the jyaml library [3].
RESULTS 66
Chapter 6
Results
6.1 Number of SESC Instructions
In order to obtain stable output characteristics when running a stressmark on the SESC
simulator, it is necessary to run the simulation long enough. There are two reasons for this.
The first one is the initialization phase executed when the stressmark is being started. During
this phase, the stressmark instructions themselves have not kicked in yet and therefore the
output characteristics obviously do not reflect its qualities. Moreover, as the power charac-
teristic we use applies to the entire run of the stressmark, the effect of the initialization dies
away only slowly.
The second cause is the fact that even the stressmark code itself needs to be run for a certain
amount of time in order for its behavior to stabilize, since, for example, the behavior of the
data caches changes over time as they will slowly adapt to the stressmark loop code being
executed.
6.1.1 Rabbit Mode
Luckily, there is a way to minimize at least the effects of the initialization phase. SESC
supports a so called ”rabbit mode,” allowing the simulator to hop over the initialization,
speeding up the simulation process during this phase by calculating only the data that is
strictly necessary to progress in a correct way. A corollary of this is that there are no
statistics recorded when in rabbit mode, which is exactly what we want to lessen the effect
of the initialization on the average power usage.
6.1 Number of SESC Instructions 67
6.1.2 Test Setup
We determine the number of instructions necessary for a stable output characteristic by
simulating an increasing number of instructions of a random stressmark. The first run only
simulates 1000 instructions, the second 2500, the next 5000, 10000, etc. We continue this
process until the value of the output characteristic remains more or less constant. We can
then choose an optimal number of instructions, balancing the tradeoff between result accuracy
and simulation time.
The entire process is then repeated with rabbit mode enabled in order to see its effect. The
number instructions that is skipped, is one million.
0
10
20
30
40
50
60
70
1K 2,5K 5K 10K 25K 50K 100K 250K 500K 1M 2,5M 5M 10M 25M 50M
Pow
er
(W)
Instructions
Rabbit Mode Disabled 1M Rabbit Mode
Figure 6.1: Executed Instructions.
6.1.3 Results and Discussion
Figure 6.1 shows the results of both experiments. Please note that the number of instructions
on the horizontal axis increases exponentially, and that it is proportional to the simulation
time.
Looking at the power characteristic with rabbit mode disabled, the effect of the initialization
code is very clear. The heavy workload our stressmark generates, only gradually increases
6.2 Exploration of Search Space 68
the average power consumption, reaching relative stability at about ten million instructions.
As was expected, the test with the rabbit mode shows a lessened effect of the initialization
and more quickly climbs to its maximum. Note however that this comes at the (small) cost
of running the first million instructions in rabbit mode prior to the actual simulation.
In order to keep the simulation time short enough without compromising the accuracy of our
results too much, we have eventually settled for simulations running a million instructions in
rabbit mode, followed by two million instructions normal simulation. These figures are used
for all simulations discussed in this document, except if stated otherwise.
6.2 Exploration of Search Space
For many developers, the idea of solving hard problems simply by unleashing a genetic algo-
rithm on it in the hope that it will magically do the heavy lifting for them, is a very tempting
one. This is because it often appears to be the case that one does not really need any insight
in the problem domain in order to design a genetic algorithm that provides all the answers
one is looking for. This is however a deceiving thought, as the success of a genetic algorithm
very much depends on the definition of its genotype and its genetic operators. It is for this
mental exercise of devising these two definitions that a deep understanding of the problem at
hand is crucial.
In this section, we discuss the data we analyzed in order to gain this much needed insight in
the problem domain of our framework and to check the relevance of the genotype we defined—
the abstract workload model—by making sure it efficiently controls the output characteristics.
We do this by studying the relations between several parameters of our workload model and
the two output characteristics we optimize for: power consumption and IPC.
The platform we used for generating this data is the SESC SMP configuration described
earlier. The configuration contains four hardware threads, and two integer and two floating
point ALUs. We now look at different types of workload models.
6.2.1 Integer Addition
The instruction mix of our first workload model demands a stressmark with nothing but
integer addition instructions. The minimum dependency distance varies from 1 to 16.
As we will notice time and time again, the first thing immediately becoming apparent is the
close correlation between our two output characteristics, the number of instructions per cycle
and the power usage. This is of course to be expected, as a higher throughput immediately
6.2 Exploration of Search Space 69
0,0
0,5
1,0
1,5
2,0
2,5
0
10
20
30
40
50
60
1 3 5 7 9 11 13 15
IPC
Pow
er
(W)
Minimum Dependency Distance
Power IPC
Figure 6.2: Integer Addition.
translates in heavy usage of the different processor components, which in turn results in a
higher energy consumption.
In order for this high throughput to be possible, as many instructions as possible need to be
processed in parallel. This will be the case if many instructions following each other can be
executed independently, in other words: the larger the minimum dependency distance (MDD)
between instructions, the higher the IPC and power usage.
The data clearly meets our expectations on this account, although we notice the output
characteristics ceasing to grow at an MDD of 10 and upwards. This is the point where an
IPC of two is reached, and since there are only two integer ALUs, it is clear that this is the
limit of what we can achieve by using only integer instructions.
6.2.2 Integer Multiplication
The second workload model only contains integer multiplication operations. All the conclu-
sions we reached for the integer addition operations are valid in this case as well. Note that
the power usage is slightly less than in the previous case; this is partly because every multipli-
cation in C code is compiled into two instructions: one for the multiplication operation itself,
and one for copying the low bits from the result register. The latter consumes less energy,
6.2 Exploration of Search Space 70
0,0
0,5
1,0
1,5
2,0
2,5
0
5
10
15
20
25
30
35
40
45
1 3 5 7 9 11 13 15
IPC
Pow
er
(W)
Minimum Dependency Distance
Power IPC
Figure 6.3: Integer Multiplication.
dragging down the average figure for the power usage.
6.2.3 Double Addition
We now look at a stressmark that contains exclusively double add operations. Once again the
positive correlation between the MDD and our output characteristics is apparent, although a
couple of things are different this time.
The minimum dependency distance we need in order to reach an IPC of 2 is higher than in the
case of integer operations. At an MDD of 16, the IPC actually grows larger than 2, although
there are only 2 floating point ALUs. This is due to float register to temporal register spilling.
The spilling operations positively affect the IPC since they are handled by the integer ALUs.
6.2.4 Double Multiplication
Our conclusions for double multiplication are the same as those for addition, with again a
higher MDD that is needed to reach an IPC of 2, and a lower power usage than in the case
of additions.
6.2 Exploration of Search Space 71
0,0
0,5
1,0
1,5
2,0
2,5
3,0
0
10
20
30
40
50
60
70
1 3 5 7 9 11 13 15
IPC
Pow
er
(W)
Minimum Dependency Distance
Power IPC
Figure 6.4: Double Addition.
6.2.5 Integer and Double Combined
If we look at a stressmark based on a workload model combining integer additions with
double multiplications, we can increase the MDD even more to reach an IPC of four. Since
the registers of both the double and the integer ALUs are saturated at this point, register
spilling this time affects the IPC negatively with a serious drop in performance at an MDD
of about 27.
6.2.6 Private Loads and Stores
We now turn our attention from arithmetic to memory instructions. The first workload model
we take a look at in this category, is restricted to thread-local load and store instructions with
stride 0, meaning these instructions access the same address each iteration. Instead of the
minimum dependency distance, we now vary the ratio of loads and stores. On the left,
the instruction mix exclusively contains load instructions; on the right, we only have store
instructions.
It is apparent from the graph that executing a lot of store operations has a heavily negative
impact on the IPC. This is to be expected as a store instruction needs to be propagated
through the different cache levels, causing pipeline stalls. When on the other hand a load
6.2 Exploration of Search Space 72
0,0
0,5
1,0
1,5
2,0
2,5
3,0
0
10
20
30
40
50
60
70
1 3 5 7 9 11 13 15
IPC
Pow
er
(W)
Minimum Dependency Distance
Power IPC
Figure 6.5: Double Multiplication.
instruction is executed, the data can be immediately fetched from the cache, resulting in a
relatively high IPC.
6.2.7 Shared Loads and Stores
The last instruction mixes we consider, are exclusively comprised of shared load and store
operations, again with stride 0. Important to mention is the fact that this time, all operations
use the same memory address to load from and store at. The mix containing only load
instructions draws attention as it reaches an IPC of 2, sharply contrasting with the instruction
mixes containing store operations. Once again, this can be explained by the cache behavior.
Since all operations are applied to the same address, even a small number of store operations
ruin the performance as cache coherence issues cause many pipeline stalls, rendering the
caches virtually useless.
6.2.8 Conclusion
Having explored the search space of our framework by looking at distinct workload models
exhibiting different types of behavior, we conclude that the parameters of our model efficiently
control the output characteristics in the way we expected. Although the effects of changing
6.3 GA Results 73
0,0
0,5
1,0
1,5
2,0
2,5
3,0
3,5
4,0
4,5
0
20
40
60
80
100
120
1 3 5 7 9 11 13 15 17 19 21 23 25 27 30
IPC
Pow
er
(W)
Minimum Dependency Distance
Power IPC
Figure 6.6: Integer and Double operations.
the minimal dependency distance and the load store ratio can be understood in the case of
these simple scenarios, it is unclear how these effects add up as they are combined, causing
complex behavior. This is where our genetic algorithm comes into play.
6.3 GA Results
We now take a look at the main result of our work. As described in an earlier part, we set
up two different target platforms to demonstrate the portability of our stressmarks. We run
the genetic algorithm with the parameters discovered by the meta search algorithm:
1. A generation size of 72 individuals.
2. A mutation probability of 10%
3. A crossover probability of 80%
4. An elitism factor of 1
Note that we do not determine the number of generations upfront. We simply run the
algorithm until the output characteristic does no longer significantly change for a number of
6.3 GA Results 74
0,0
0,5
1,0
1,5
2,0
2,5
0
20
40
60
80
100
120
140
160
180
100 / 0 95 / 5 90 / 10 85 / 15 80 / 20 60 / 40 40 / 60 20 / 80 0 / 100
IPC
Pow
er
(W)
Loads / Stores
Power IPC
Figure 6.7: Private Loads and Stores.
generations. The graphs show a bit less generations then we actually run in order to focus on
the more interesting first part.
6.3.1 SESC Platform
The output characteristic for the SESC platform is the average power usage over the course
of the entire execution of the stressmark. Figure 6.9 shows an overview of the generations
produced throughout the search process, displaying the maximum, average, and minimum
power usage.
In the initial population, the power usage of the fittest stressmark is 41.28 watts, the power
usage of the least fit stressmark is 12.47 watts, and the average power usage is 21.28 watts.
Throughout the search process, the maximum and the average power usage become signifi-
cantly larger while the values for the minimum power usage remain around 10 watts. This
is a good sign since it shows that the tradeoff between variety and quality is nicely balanced
in the population; this is necessary in order to optimize the found solutions without getting
stuck in a local maximum.
Further looking at the evolution of the fitness, we find the characteristics of a typical genetic
algorithm run. In the first generations, the best properties present in the initial population
6.3 GA Results 75
0,0
0,5
1,0
1,5
2,0
2,5
0
20
40
60
80
100
120
140
160
100 / 0 95 / 5 80 / 20 60 / 40 40 / 60 20 / 80 0 / 100
IPC
Pow
er
(W)
Loads / Stores
Power IPC
Figure 6.8: Shared Loads and Stores.
are selected for and combined using the crossover operator, quickly increasing the maximum
fitness until it is almost tripled around the sixth generation. This the point where the al-
gorithm is capped for the first time, struggling a couple of generations until new properties
increasing the fitness are discovered around generation number 12. We see this happening
again around generations 42, 65, and 75.
Note that despite the elitism factor of 1, which in principle always preserves the fittest individ-
ual, the maximum fitness sometimes drops, and towards the end seems to alternate between
two values. This is because the behavior of the stressmark generator at the point of this
experiment was not entirely deterministic yet, causing variance in the output characteristics
of the stressmark corresponding to the fittest workload model (and undoubtedly the other
models as well.)
The final best result is a stressmark produced first in generation 87 with a power usage of 163
watts, an increase with factor 4 compared to the fittest individual in the initial generation.
The properties of its workload model are shown below:
Listing 6.1: Workload SESC
memoryShared : 64
t r a c e S i z e : 100
a r i thmet i c In s t ruc t i onMix :
6.3 GA Results 76
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Pow
er
(W)
Generation
Minimum Average Maximum
Figure 6.9: Result SESC.
doubleAdd : 44
doubleMul : 28
integerAdd : 23
integerMul : 5
mdd: 29
swThreads : 3
in s t ruc t i onMix :
a r i t h m e t i c I n s t r u c t i o n s : 89
memoryInstruct ions : 2
b ranch In s t ruc t i on s : 9
memoryThreadLocal : 2048
memoryStr ideProf i l e :
s i z e 1 : 26
s i z e 0 : 37
s i z e 4 : 2
s i z e 3 : 2
s i z e 2 : 33
memoryInstructionMix :
unsharedLoad : 5
6.3 GA Results 77
sharedLoad : 10
unsharedStore : 41
sharedStore : 44
branchTrans i t ion :
r a t e 0 : 11
r a t e 1 : 33
r a t e 2 : 10
r a t e 4 : 28
r a t e 8 : 18
As might be expected, we notice a high proportion of arithmetic instructions and a large
minimum dependency distance. The arithmetic instruction mix is quite balanced, but seem-
ingly avoids integer multiplications. The memory instruction mix has an unexpectedly high
number of store operations, but the selection pressure on this mix is probably not that large,
since the proportion of memory instructions is only 2 percent.
6.3.2 Core 2 Quad Platform
0
0,5
1
1,5
2
2,5
3
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
IPC
Generation
Minimum Average Maximum
Figure 6.10: Result Core 2 Quad.
The same experiment was repeated on the Intel Core 2 Quad target hardware platform. This
6.3 GA Results 78
time, the output characteristic we use is the number of instructions per cycle, which is a fair
indicator of the power usage as we demonstrated in the section exploring the search space.
The IPC is measured by employing hardware performance counters.
Although less pronounced than in the case of the SESC platform, here too we find a significant
increase in the output characteristic. In the initial population, the maximum IPC value is 2.25
while in the final generation, the fittest individual has an IPC of 2.92, showing an increase of
30
Our algorithm seems to be limited to an IPC of 3, although the maximum IPC of the processor
is 4. The workload model used for the best stressmark gives us a clue as to why this is the
case:
Listing 6.2: Workload x86-64
memoryShared : 64
t r a c e S i z e : 50
a r i thmet i c In s t ruc t i onMix :
doubleMul : 7
integerAdd : 33
doubleAdd : 19
integerMul : 41
mdd: 9
swThreads : 8
in s t ruc t i onMix :
a r i t h m e t i c I n s t r u c t i o n s : 81
memoryInstruct ions : 2
b ranch In s t ruc t i on s : 17
memoryThreadLocal : 2048
memoryStr ideProf i l e :
s i z e 0 : 30
s i z e 1 : 26
s i z e 2 : 19
s i z e 3 : 17
s i z e 4 : 8
memoryInstructionMix :
unsharedLoad : 25
sharedLoad : 41
unsharedStore : 30
sharedStore : 4
branchTrans i t ion :
6.4 GA Efficiency 79
r a t e 0 : 13
r a t e 1 : 24
r a t e 2 : 13
r a t e 4 : 26
r a t e 8 : 24
Looking at the instruction mix in particular, we can see that the algorithm heavily selected
for arithmetic instructions and, to a lesser extent, branch instructions. Provided we follow
this approach and eliminate all memory instructions, an IPC of 3 indeed becomes the limit
because the Intel Core 2 Quad has only three arithmetic ALUs, which the algorithm fully
exploits by balancing the arithmetic instruction mix to effect an optimal throughput.
The IPC limit of 4 is based on the fetch width and it is now clear that it would be necessary to
somehow add memory instructions to the mix to come close to this figure. Other individuals
in the population do have these memory instructions but have a lower fitness, probably
because they no longer optimally stress the arithmetic ALUs as a consequence. If a sweet
spot exists, combining arithmetic and memory instructions up to an IPC of 4, our algorithm
was unfortunately unable to find it—as were we.
6.4 GA Efficiency
Although we found that our genetic algorithm yielded quite satisfactory results, especially in
the case of the SESC target platform, we now set up an experiment to obtain a more objective
measure of its performance. The genetic algorithm set up for SESC had a population size of 72
individuals, and it took roughly a hundred generations in order to find an optimal stressmark.
This means that a total of 7200 workload models were produced, and the same number
of simulations were executed for measuring the output characteristics of the corresponding
stressmarks.
Using this figure, we set up a random search algorithm on the same target platform, producing
another 7200 workloads, and determining their stressmark’s fitness values. We provide two
views on this data: a recording of the different results as they were generated (in grey), and
a more informative, sorted version showing the distribution of stressmarks in the entire set
(the thin black line.) We take a look at the sorted distribution of the stressmarks.
The first thing we notice, is the small proportion of stressmarks producing a constant power
usage of 1.9 watts. These are different instances of the same dummy stressmark that is
generated by the framework whenever it encounters a workload model that cannot be used
to produce a valid stressmark. This may for example be the case if the proportion of branch
6.4 GA Efficiency 80
0
10
20
30
40
50
60
70
80
90
1 501 1001 1501 2001 2501 3001 3501 4001 4501 5001 5501 6001 6501 7001
Pow
er
(W)
Workloads
Random Sequence Sorted
Figure 6.11: Random Search.
instructions is too high (e.g. nearly 100%.) We find that 159 stressmarks in the set are invalid;
this is 2.2%. This is certainly good enough, since we deal with totally randomly generated
workloads here; during the course of a genetic algorithm run, this percentage will be much
lower as the invalid stressmarks are immediately eliminated by the selection process.
If we look at the rest of the distribution, we can spot a small number of extreme stress-
marks. The minimum power usage is 7.74 watts, and the maximum 88.84 watts; the standard
deviation is a meager 6.99 and the average stressmark has a power usage of 24.03 watts.
We cannot compare the maximum power usage to the result shown earlier, since that result has
been obtained using another version of our stressmark generator. The graph below shows a run
of the genetic algorithm using the same version. Its maximum value after 7200 simulations is
140.79 watts, performing significantly better than the random search algorithm: a solid 52%.
It is of course also important to notice that the random search algorithm, being random,
will yield different results each time it is run, especially since the standard deviation in the
set is so low, making the well-performing stressmarks a rare commodity. The genetic search
algorithm does not only perform better, its performance is much more stable, especially since
the fairly low mutation rate that is used.
On a closing note, the random search algorithm provided us also with some data on the
6.5 Theoretical Maximum 81
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Pow
er
(W)
Generation
Minimum Average Maximum
Figure 6.12: GA Comparison.
timing performance of SESC, our distributed computing setup, and our framework in general
(excluding the components of the genetic algorithm). It took the eighteen workers on our
Hydra test platform seven hours to generate and evaluate the 7200 random workload models,
resulting in a total rate of 3.5 seconds per stressmark.
6.5 Theoretical Maximum
The random search algorithm provides a means to evaluate the efficiency of our genetic algo-
rithm; we now describe a way of testing the efficiency of our stressmark generator framework
in its entirety. We do this again using the SESC target platform.
SESC supports different architectures that are each defined in their respective configuration
files, the content of which expresses the properties of the comprised building blocks. Among
these properties are the dissipated energy values in case a particular building block is active,
a particular operation is performed by the building block, or a certain event takes place (e.g.
a cache miss.)
Based on these values, we calculate the theoretical maximum power usage a stressmark can
generate. This is done simply by adding all energies together that can be dissipated in a
single clock cycle, multiplying the result with the clock frequency. We also assume that it is
6.5 Theoretical Maximum 82
not possible to simultaneously write to and load from a cache unit, or to effect a cache hit at
the same time of a cache miss. When encountering incompatible energy values like these, we
add only the largest of the two. The result is displayed in figure 6.13.
Core 154,30 watts
Core 254,30 watts
Core 354,30 watts
Core 454,30 watts
L1 Caches30,44 watts
L2 Caches68,46 watts
TLB5,75 watts
Figure 6.13: Theoretical Upper Limit.
The main components in the SMP configuration are four identical cores, each totaling a
theoretical maximum power usage of 54.30 watts, two caches, totaling nearly 100 watts, and
a TLB using at most 5.75 watts. This brings the total theoretical maximum to 321.8 watts.
The best stressmark produced by the genetic algorithm has a power usage of 163.18 watts or
50.7% of the theoretical maximum. In Joshi et al. [12] we find a comparable percentage of
57%.
CONCLUSION 83
Chapter 7
Conclusion
Originally described as early as 1965, Moore’s law holds true today as it always has. While the
number of transistors is still growing exponentially and their size keeps shrinking, processor
designers find themselves in a pickle spending their transistor budgets while continuing to
comply to the ever tightening design requirements such as power usage and temperature.
Having faced the power wall around 2002, slipping from the single-core into the multi-core
era, these problems today are larger than ever before.
A corollary of this tendency is the importance of knowing and understanding the worst-case
behavior of new microprocessors, a trait that is typically researched by writing stressmarks-
benchmark programs that stress the processor to its limits. As this job becomes more and
more tedious and hugely expensive, the industry is looking for ways to automate this process.
In this master thesis, we described the StressmarkRunner framework, a solution for the auto-
matic generation of stressmarks based on prior research in this area by Joshi et al.[12]. The
two largest of our own contributions are the usage of the C programming language to make
these stressmarks platform-portable and the support of multi-core platforms.
The stressmark generation process begins with the workload model, an abstract description
containing a number of parameters determining the stressmark’s characteristics. We started
from the workload model proposed by Joshi et al. [12] and have simplified and extended it
in order for it to better suit our own requirements.
Simplification was achieved by removing a number of workload parameters such as block
size and its standard deviation, and by reducing the parameter determining the minimum
dependency distance to a scalar value. We found that these parameters, which are necessary
for the generation of synthetic benchmarks mimicking the behavior of manually designed
benchmarks, are overkill in the case of stressmarks.
Extension of the workload model was necessary to support multi-threaded stressmarks. We
CONCLUSION 84
added the number of software threads as a workload parameter, and distinguished between
shared and private memory instructions, allowing cache coherence issues to enter the equation.
We ended up with a total of thirty parameters, each one aimed at stressing a specific part of
the processor with as few overlapping effects between the different parameters as possible. We
made sure that new parameters can be added with relative ease, allowing for future extension
with platform-specific parameters, thus increasing the platform-portability of our solution.
As the next step, we employed the abstract workload model to generate synthetic benchmarks
in C, again with platform-portability as our main goal. This portability is preserved during
the stressmark generation phase through the wide support of C compilers, which act as
implementers providing the platform-specific details needed to convert our C benchmark into
executable binaries. We considered alternatives for C such as Fortran, or GCC’s or LLVM’s
intermediate representations. The latter approach stimulated us to reflect on the role of the
programming language as an interface to the backend of the compiler.
The use of C or one of its alternative languages also provides a large number of challenges
we thoroughly researched and documented. Compiler optimization plays a crucial role in the
compilation of stressmarks; it is a huge and complex process we tried to gain understanding
of in order to control it as well as possible. We described different types of optimizations,
discussed their consequences for the stressmark’s behavior, and proposed and implemented so-
lutions to the problems they posed. We discussed and utilized different C language constructs
that proved useful and often necessary to express the different parameters of the workload
model in C, and we described in detail how to implement these parameters.
Having explored the possibilities of the use of a low-level programming language, we also paid
attention to the limits of this approach. We discussed the example of SIMD instructions,
which at the moment cannot possibly be implemented in a real platform-portable way.
The synthetic benchmarks we create according to their workload model using the stressmark
generator, are then optimized to maximally stress the components of the underlying platform.
We achieved this by writing a genetic algorithm that selects for one of the output characteris-
tics of the stressmarks. As the configuration of a genetic algorithm often is a tricky enterprise,
we used a simple hill climbing meta algorithm to determine the best mutation and crossover
probabilities, population size, and elitism factor.
Finally able to generate real stressmarks, we set up two target platforms to run the genetic
algorithm on while demonstrating the platform-portability of our approach. The first plat-
form was the SESC simulator, running the configuration of an SMP architecture executing
the MIPS instruction set. The second was the Intel Core 2 Quad processor for an x86-64
instruction set.
CONCLUSION 85
On the SESC platform, we ran the genetic algorithm optimizing for maximum power usage
through more than a hundred generations, totaling more than 7200 stressmark individuals.
We achieved a resulting power usage three times higher than the maximum usage in the initial
generation.
Because of the large number of simulations and the relatively long simulation times, we set
up a distributed system with a job queue in support of this experiment. For this, we used
nine dual-core servers in the Hydra cluster running at the ELIS research centre at our alma
mater, the University of Ghent. We learnt a lot by developing this system in a research
environment, a setting which inspired us to design a new software architecture pattern that
we found suitable for the specific requirements that research enforces.
On the Intel Core 2 Quad hardware platform, we ran a similar experiment optimizing the
IPC output characteristic of our stressmarks, resulting in a 30% increase, almost reaching an
IPC of three. Since the maximum IPC of the platform is four, we examined the potential
reasons for the result we obtained by studying the processor architecture, finding that our
algorithm restricted itself to the arithmetic ALUs.
On top of all this, we tried to thoroughly verify our framework and methods. First, we
explored our search space by examining characteristic workload models, making sure the
results met our expectations and that the stressmark’s output could be effectively controlled
through the workload model. Second, we compared the performance of our genetic algorithm
to the performance of a random search algorithm, gaining new insights by examining the
distribution of the stressmark’s fitness values and finding that our genetic algorithm is 50%
more effective than the random search we ran. Third, we calculated the theoretical maximum
power usage of the SESC SMP platform by summing the maximum power values of its
components, finding that the power usage of our stressmarks reaches 50% of the theoretical
maximum, a value comparable to the one we found in the literature.
Looking back at a fruitful year however, the things we probably cherish the most are the
experiences we gained—most of them good and all of them valuable—from setting about the
eventful undertaking that producing a master thesis is, and following through until the end
of this very paragraph.
BIBLIOGRAPHY 86
Bibliography
[1] Intel developers manual (basic architecture). http://www.intel.com/Assets/PDF/
manual/253665.pdf.
[2] Intel turbo boost. http://www.intel.com/technology/turboboost/.
[3] Jyaml library. http://jyaml.sourceforge.net/.
[4] Papi: Performance application programming interface. http://icl.cs.utk.edu/papi/.
[5] Sesc documentation. http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/.
[6] Yaml: Yaml ain’t markup language.
[7] 14th International Conference on High-Performance Computer Architecture (HPCA-14
2008), 16-20 February 2008, Salt Lake City, UT, USA. IEEE Computer Society, 2008.
[8] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry
Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf,
Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel comput-
ing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS
Department, University of California, Berkeley, Dec 2006.
[9] S H Gunther, F Binns, D M Carmean, and J C Hall. Managing the impact of increasing
microprocessor power consumption. Intel Technology Journal, (1):2005, 2001.
[10] Michael Haungs, Phil Sallee, and Matthew Farrens. Branch transition rate: A new metric
for improved branch classification analysis. High-Performance Computer Architecture,
International Symposium on, 0:241, 2000.
[11] John Hennessy and David Patterson. Computer Architecture - A Quantitative Approach.
Morgan Kaufmann, 2003.
[12] Ajay M. Joshi, Lieven Eeckhout, Lizy Kurian John, and Ciji Isen. Automated micropro-
cessor stressmark generation. In HPCA [7], pages 229–239.
BIBLIOGRAPHY 87
[13] G. E. Moore. Cramming more components onto integrated circuits. Proceedings of the
IEEE, 86(1):82–85, 1998.
[14] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software.
Dr. Dobb’s Journal, 30(3):202–210, 2005.
LIST OF FIGURES 88
List of Figures
1.1 SPECint performance over the years (image source: [11]). . . . . . . . . . . . 2
1.2 Power wall, frequency wall and ILP wall (image source: [14]). . . . . . . . . . 3
2.1 Miss rates of local (left) and global (right) branch predictors for different classes
of branches, identified by transition rate and taken rate. . . . . . . . . . . . . 12
3.1 Global stressmark structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Stressmark generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Optimization process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1 SESC overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Hardware overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Commands Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Commands Jobmanager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.6 Commands Genetic Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.7 Commands Meta GA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.8 Other Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.9 Jobmanager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.10 Jobstates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.11 Genetic Algorithm Classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.12 Util. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1 Executed Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Integer Addition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Integer Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Double Addition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.5 Double Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.6 Integer and Double operations. . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.7 Private Loads and Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
LIST OF FIGURES 89
6.8 Shared Loads and Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.9 Result SESC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.10 Result Core 2 Quad. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.11 Random Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6.12 GA Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.13 Theoretical Upper Limit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
LIST OF TABLES 90
List of Tables
2.1 Example of an arithmetic instruction profile . . . . . . . . . . . . . . . . . . . 6
2.2 Example of a memory instruction profile . . . . . . . . . . . . . . . . . . . . . 7
2.3 Example of a branch transition rate distribution . . . . . . . . . . . . . . . . 8
2.4 Example of a data and memory footprint . . . . . . . . . . . . . . . . . . . . 8
2.5 Example of a stream stride distribution . . . . . . . . . . . . . . . . . . . . . 9
2.6 Workload summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Used compiler flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Redundant function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Redundant operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Redundant blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Loop invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Branch optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.8 Alternative control flow implementations . . . . . . . . . . . . . . . . . . . . . 27
LISTINGS 91
Listings
3.1 Compilation result with -O1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Dependency distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Constant folding and propagation . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Static branch implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Arithmetic instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 Arithmetic + memory instructions . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7 Arithmetic + memory + branch instructions . . . . . . . . . . . . . . . . . . 31
3.8 Starting stressmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.9 Alternative BTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.10 Auto-vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1 Scala quicksort example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Get work query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.1 Workload SESC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.2 Workload x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
1
Automatische Generatie van
Multicore Stressmarks
Inleiding
Van 1986 tot 2002, de periode voor de zogenaamde power wall, verdubbelde het energiever-
bruik van microprocessors elke vier jaar en slaagde men erin de seriele verwerkingssnelheid te
verbeteren met zo’n 50% per jaar [14]. Deze snelheidsverbeteringen waren vooral gedreven
door de wet van Moore [13] die zegt dat het aantal transistors in een geıntegreerd circuit onge-
veer elke twee jaar verdubbelt. Transistors waren aanvankelijk schaars en het energieverbruik
was geen limiterende factor. Met meer transistors kon de uitvoeringspijplijn dieper gemaakt
worden zodat de kloksnelheid kon stijgen. Out-of-order processors maakten het mogelijk om
die pijplijnen efficient te benutten en zelfs om meerdere uitvoeringspijplijnen aan te sturen
door het ILP (Instruction Level Parallelism) te maximaliseren.
Deze evolutie kwam tot een abrupt einde rond 2002; de power wall was bereikt. Door het
verhogen van de kloksnelheid stijgt het vermogenverbruik en dus ook de thermische dissipatie.
De kost van koelingsoplossingen stijgt echter exponentieel in functie hiervan. Zo werd rond
2002 de grens bereikt van wat praktisch mogelijk is qua koeling.
Ook op vlak van ILP werd er een plafond bereikt. De hardwarestructuren werden te groot
en de kost van de structuren was niet meer proportioneel met de snelheidswinst. De wet
van Moore blijft wel nog steeds geldig, wat betekent dat transistors steeds minder, en het
vermogenverbruik steeds meer, een limiterende factor werd. De processorontwikkelaar moet
zich dus steeds meer concentreren op het vermogensgedrag van de processor.
2
Figuur 1: Power wall, frequency wall en ILP wall [14].
Conventionele benchmarks kunnen gebruikt worden om het typische vermogensgedrag van
een processor te analyseren, maar dat is niet langer voldoende. Een betere manier om het
gedrag van processors te analyseren, is met stressmarks.
Er is een steeds groter wordende kloof tussen het maximale en het typische vermogensver-
bruik van een processor[9]. Dit stelt de processorontwikkelaar voor een lastig probleem want
de correcte werking van de processor moet gegarandeerd kunnen worden, zelfs in extreem
zeldzame omstandigheden. Om die zeldzame gevallen te onderzoeken worden stressmarks
gebruikt. Dit zijn benchmarks die extreem gedrag in de processor veroorzaken. Een aantal
toepassingen zijn de volgende:
� Het bepalen van de veiligheidsgrenzen voor het thermische beheer en het vermogens-
3
beheer van de processor. Bijvoorbeeld het tijdelijk verlagen van de kloksnelheid als de
processor te warm wordt.
� Het detecteren van zogenaamde hotspots, dit zijn kleine regio’s op de chip die gedurende
een korte periode zeer warm worden. Hotspots zijn slecht voor de levensduur en de
betrouwbaarheid van een processor.
� Het dimensioneren van het koelsysteem van de processor en/of de stroomvoorziening.
Momenteel worden stressmarks manueel ontwikkeld door een specialist die de processor door
en door kent. Het is een zeer vervelende en tijdrovende taak die telkens moet overgedaan
worden als de werking van de processor wijzigt.
In deze thesis proberen we het ontwikkelen van stressmarks te automatiseren. We baseren
ons op voorgaand werk, het ”StressMaker framework”van Joshi e.a. [12]. Dit raamwerk kan
automatisch benchmarks genereren voor een Alpha 21264 microprocessor. Het belangrijkste
idee achter het raamwerk is het gebruik van synthetische benchmarks die gemaakt worden op
basis van een kleine verzameling programmakarakteristieken. Door die karakteristieken met
een zoekalgoritme te optimaliseren voor maximaal vermogenverbruik wordt een stressmark
bekomen.
Ons raamwerk verbetert deze aanpak op een paar kritische punten:
� De synthetische benchmarks worden volledig in pure C-constructies gegenereerd. Hier-
door is het raamwerk platformoverdraagbaar. Het kan dus in principe voor alle proces-
sors die door de compiler ondersteund worden een stressmark genereren.
� De programmakarakteristieken zijn gespecialiseerd voor het genereren van stressmarks
(in plaats van synthetische benchmarks die het gedrag van andere benchmarks immite-
ren.)
� Het raamwerk kan meerdradige stressmarks maken die via het geheugen communiceren.
4
Abstract Werklastmodel
Een stressmark is dus een synthetische benchmark die gegenereerd wordt aan de hand van
een aantal karakteristieken. We maken een onderscheid tussen platformkarakteristieken en
programmakarakteristieken. De platformkarakteristieken zijn beperkt tot de grootte van een
cachelijn en het aantal hardwaredraden dat de processor kan uitvoeren.
De programmakarakteristieken beschrijven de werklast die op de processor uitgeoefend wordt.
Dit zijn de parameters die door het zoekalgoritme geoptimaliseerd zullen worden. Het is
essentieel om het aantal werklastparameters tot een minimum te beperken om de zoekruimte
zo klein mogelijk te maken.
De werklastparameters zijn de volgende:
Algemene instructieverdeling
Een relatieve verdeling van rekenkundige instructies, geheugeninstructies en spronginstructies.
Rekenkundige instructieverdeling
Dit is de relatieve verdeling van rekenkundige operaties. Een operatie wordt gedefinieerd door
een datatype (floating point of integer) en een rekenkundige operatie (optelling, aftrekking of
deling)
Geheugeninstructieverdeling
Dit is een relatieve verdeling van geheugenoperaties. Er zijn vier operaties: lezen of schrijven
in gedeeld geheugen en lezen of schrijven in draadlokaal geheugen.
Gedrag spronginstructies (inverse sprongtransitieverhoudingen)
Deze parameter bepaalt de relatieve verdeling van sprongtransitieverhoudingen. De sprong-
transitieverhouding is het aantal keer dat een sprong van richting verandert (i.e. genomen
of niet genomen) ten opzichte van het totaal aantal keer dat de spronginstructie uitgevoerd
werd. Een sprongtransitieverhouding van 0 betekent dat de sprong statisch is.
5
Minimale afhankelijkheidsafstand
De kleinst toelaatbare read-after-write afhankelijkheidsafstand in de stressmark.
Groottes
Het aantal instructies in de stressmark en de groottes van het draadlocaal en het gedeeld
geheugen.
Gedrag van stappende geheugeninstructies
Geheugeninstructies wandelen door het geheugen met een vaste stapgrootte, gedefinieerd in
functie van de grootte van een cachelijn. Deze paramter is een relatieve verdeling van de
stapgrootte van geheugeninstructies. Het is mogelijk dat de stapafstand 0 is. In dat geval
wordt er altijd van hetzelfde adres gelezen.
Synthetische Benchmarks in C
Opdat het raamwerk platformoverdraagbaar zou zijn, gebruiken we de programmeertaal C
in plaats van assembler voor de implementatie van de stressmarks. Het is de bedoeling dat
de compiler daarna de C-code compileert voor de betreffende doelarchitectuur. De compiler
wordt gebruikt voor instructieselectie en registerallocatie voor het doelplatform.
Het ontwikkelen van synthetische benchmarks met C is een moeilijke opgave omdat een com-
piler het programma optimaliseert voor optimale prestaties. Na compilatie moeten de karak-
teristieken van de stressmark echter behouden blijven. Compilers doen veel optimalisaties
waarbij dit niet geldt, zoals de eliminatie van redundante code, het optimaliseren van lusinva-
rianten, rekenkundige optimalisaties, ongewenste floating point excepties, optimalisatie van
instructievolgorde enz.
Een compiler kan wel geconfigureerd worden om te bepalen welke optimalisaties worden uitge-
voerd en welke niet, maar dit is een moeizaam proces. De moeilijkheid zit hem in het zoeken
naar een evenwicht tussen onder- en overoptimalisatie. Bij onderoptimalisatie zal de compiler
te weinig gaan optimaliseren om de instructieselectie en registerallocatie nog naar behoren
6
uit te voeren; bij overoptimalisatie zal de compiler de karakteristieken van de stressmark
veranderen door b.v. instructies te schrappen.
Onze aanpak is de GCC-compiler te configureren voor een minimaal aanvaarbaar optimali-
satieniveau en vervolgens de codevorm van de stressmark zo aan te passen dat de resterende
optimalisaties geen effect meer hebben op de karakteristieken van de stressmark.
Stressmarkoptimalisatie
OptimizationMeasurements(SESC / HPC)
Synthetic BenchmarkAbstract WorkloadModel
Figuur 2: Optimalisatieproces.
Zoals beschreven is in het vorige deel, kunnen synthetische benchmarks gegenereerd worden
aan de hand van een abstract werklastmodel. Tijdens de uitvoering van de benchmark kun-
nen we nu verschillende karakteristieken meten zoals het vermogensverbruik, de IPC, of de
temperatuur. Om een synthetische benchmark om te vormen naar een stressmark die een van
deze karakteristieken strest, zullen we een genetisch zoekalgoritme gebruiken.
We beginnen hiervoor met een initiele populatie die verschillende werklastmodellen bevat
die willekeurig gegenereerd zijn. We laten de StressmarkGenerator-applicatie vervolgens de
bijhorende benchmarks produceren, voeren deze uit en meten telkens de waarde van de te
stressen karakteristiek van elke uitvoering. Deze vormt telkens de fitheid van het individu
(de stressmark) in kwestie. Vervolgens passen we de twee fases toe van het genetisch proces:
selectie en reproductie.
Selectie gebeurt proportioneel met de fitheid van de individuen; de kans om geselecteerd te
worden is evenredig met de fitheid van de stressmark. Nadat twee stressmarks geselecteerd
zijn, wordt een kindstressmark geproduceerd door crossover en mutatie toe te passen. Bij
crossover worden de werklasten van beide stressmarks gecombineerd door sommige parameters
7
te kopieren van de eerste ouder, en sommige van de tweede. De crossoverprobabiliteit bepaalt
hierbij hoezeer de beide ouders dooreengemixt worden. In de mutatiefase worden eventueel
nog een of meerdere willekeurige aanpassingen gedaan aan het resulterende werklastmodel.
Het aantal aanpassingen wordt hierbij beınvloed door de mutatieprobabiliteit.
Op basis van de initiele populatie wordt een nieuwe gegenereerd door het veelvuldig toepassen
van selectie en crossover, maar ook rekening houdend met de elitismefactor. Elitisme betekent
dat een of meerdere individuen uit de vorige generatie ongewijzigd worden overgenomen en
in de volgende populatie gekopieerd. Door keer na keer nieuwe populaties te genereren,
selecteren we voor een steeds hogere fitheid en bekomen we zo een stressmark die de gekozen
karakteristiek maximaal strest.
We gebruiken telkens de volgende configuratiewaarden voor het draaien van onze tests:
� Een populatiegrootte van 72 werklastmodellen
� Een mutatieprobabiliteit van 10%
� Een crossoverprobabiliteit van 80%
� Een elitismefactor van 1
Deze configuratie is bepaald aan de hand van een eenvoudig hill climbing-algoritme dat we
hiervoor hebben ontworpen.
Ontwikkeling van het Raamwerk
Hoewel we doorheen de loop van onze opleiding met verscheidene softwareontwikkelingstech-
nieken hebben kennisgemaakt, was dit de eerste keer dat we een grote softwareapplicatie
ontwikkeld hebben binnen een onderzoeksomgeving. We hebben ondervonden dat dit aanlei-
ding gaf tot specifieke vereisten en uitdagingen die we kort bespreken.
De belangrijkste vereiste is wellicht dat de onderzoeker zich moet kunnen concentreren op
zijn onderzoekswerk zonder hierbij gehinderd te worden door de software die gebruikt wordt.
Er moet dan ook een programmeertaal gekozen worden die deze vereiste weerspiegelt.
8
Daarom hebben we gekozen voor de Scala programmeertaal. Scala is een programmeertaal die
functionele aspecten en objectgeorienteerde paradigma’s combineert. Ze wordt gecompileerd
naar Java bytecode en is dus tweevoudig compatibel met Java: Java-code kan rechtstreeks
uitgevoerd worden vanuit Scala en vice versa.
Het gebruik van Scala heeft verscheidene voordelen. De expressiviteit van de taal zorgt ervoor
dat de code bondig is en duidelijk de intentie van de programmeur weergeeft. Aangezien
Java compatibel is met Scala, kan er gebruik gemaakt worden van de vele beschikbare Java-
bibliotheken. Het typesysteem is tegelijk robuust en flexibel zodat het fouten vermijdt zonder
teveel restricties op te leggen. Aangezien de taal een hoog abstractieniveau heeft, worden er
bovendien geen C-stijl pointers gebruikt en wordt geheugenallocatie automatisch uitgevoerd.
Om gemakkelijk de command line te kunnen besturen vanuit de Scala-omgeving gebruiken
we de Apache Ant-bibliotheek. Dit is erg belangrijk omdat de verschillende third-party ap-
plicaties die we gebruiken, enkel via de command line beschikbaar zijn.
Deze integratie hebben we verder nog verbeterd door een softwarearchitectuur toe te passen
die we de ”mirrored command line suite” noemen. De principes hiervan zijn de volgende:
1. Alle functionaliteit van het raamwerk dat we ontwikkelen moet beschikbaar zijn in de
vorm van input-outputcommando’s met een (beperkt) aantal parameters om hun gedrag
te kunnen bepalen.
2. Van de processen die in het raamwerk aanwezig zijn, worden de verschillende stappen
als aparte commando’s ter beschikking gesteld zodat de ontwikkelaar deze eenvoudig
kan apart uitvoeren, debuggen, ...
3. Elk commando wordt gespiegeld, waarmee we bedoelen dat het twee maal wordt ter
beschikking gesteld aan de programmeur. Eenmaal in de Scala-omgeving en eenmaal
als script via de command line.
4. Al de command line scripts die commando’s uitvoeren hebben een naam die start met
hetzelfde prefix zodat ze gemakkelijk door de ontwikkelaar gevonden kunnen worden
door het prefix te typen en tab in te drukken.
9
5. Als de input en/of output gestructureerde data is, wordt deze opgeslagen in bestanden
in een human-readable formaat. Hiervoor hebben we voor YAML [6] gekozen.
We hebben ondervonden dat deze aanpak zeer degelijk is om te gebruiken binnen een on-
derzoeksomgeving vanwege verschillende factoren. Ten eerste laat de flexibiliteit van de mir-
rrored command line suite toe om deze te laten meegroeien met de veranderende vereisten
van de software. Ten tweede zorgt het onderverdelen van de functionaliteit is de eenvoudige
commandostructuur ervoor dat de concepten binnen het onderzoeksdomein duidelijk wor-
den gedefinieerd. Daarnaast biedt de command line alles om te dienen als een interactieve
ontwikkelomgeving zoals code completion. We kunnen ook command’s voorzien die niet de
functionaliteit van het raamwerk implementeren, maar de ontwikkelomgeving strakker inte-
greren in het werkproces van de onderzoeker door repetitieve taken binnen het buildproces te
automatiseren (continuous integration).
Testplatforms
SESC SMP MIPS
Het eerste platform waarop we ons raamwerk testen, gebruikt de SESC-simulator om een
SMP MIPS-architectuur te simuleren. SESC is een simulator gebouwd bovenop MINT, die
zowel single- als multi-corearchitecturen kan simulerenf.
De configuratie die we gebruikt hebben, is een versie van de symmetric multi-processingarchitectuur
(SMP). Ze bevat 4 identieke kernen die elk draaien aan een kloksnelheid van 1 GHz op een
70 nm technologie. De sprongvoorspeller is gebaseerd op de Alpha 21264 hybridevoorspeller
en de cacheconfiguratie is de volgende:
� L1D en L1I: 32 kB, associativiteit van 4, LRU, write-through
� Private L2: 512kB, associativiteit van 8, LRU, write-back, MESI
Om de totale simulatietijd te beperken, hebben we een gedistribueerd jobsysteem geımplementeerd
om de commando’s die het raamwerk aanbiedt parallel te kunnen uitvoeren op dit platform.
10
Hydra Worker Node Job Database Node
<<component>>StressmarkRunner
<<component>>StressmarkGenerator
<<component>>SESC Simulator
<<component>>GCC Compiler
<<component>>GNUPlot
<<component>>MySQL Job Database
<<component>>SESC Threading Library
Figuur 3: SESC-platform.
Hiervoor hebben we gebruik gemaakt van de Hydra servercluster van de ELIS onderzoeks-
groep aan onze universiteit. Deze bestaat uit 9 werkservers met een dual-core processor en
een gedeeld bestandssysteem, en een MySQL databaseserver. De werkservers draaien elk 2
draden met een werker die verbinding maakt met de database, telkens de volgende job aan-
vraagt en zo de ene na de andere job uitvoert in parallel met de andere werkers. Zo zijn we in
staat om tijdens testen een totale simulatietijd te behalen van gemiddeld slechts 3,5 seconden
per stressmark.
We gebruiken verder de GCC cross-compiler voor de MIPS-instructieset die samen met SESC
wordt geleverd. De versie van GCC is 3.4. Om multi-threading te implementeren, gebruiken
we de SESC threading-bibliotheek.
Intel Core2Quad x86-64
Het tweede platform is een Intel Core2Quad 9450 hardwareprocessor die een 64 bit x86-
instructieset uitvoert. Er zijn 4 45 nm kernen die elk aan 2,66 GHz draaien. De TDP
(Thermal Design Power), wat Intel definieert als het maximale vermogensverbruik, is 95watt.
De cacheconfiguratie is de volgende:
� L1D en L1I: 32 kB per kern
11
� L2: 2 x 6 MB (gedeeld per twee kernen)
In de Intel Developers Manual [1] vinden we terug dat het maximaal aantal instructies per
cyclus (IPC) voor deze processor 4 bedraagt. We gebruiken dit getal om de prestaties van
onze gegenereerde stressmarks te evalueren. We hebben de effectieve IPC tijdens het testen
gemeten door het gebruik van hardware performance counters.
De setup van de softwarecomponenten is dezelfde als deze van het SESC-platform, met drie
uitzonderingen. Ten eerste is de uitvoering van de stressmark op SESC vervangen door
uitvoering op de hardwareprocessor; ten tweede is de threading-bibliotheek vervangen door
de standaard POSIX-implementatie voor Linux (pthread); en ten derde wordt de database
deze keer locaal gedraaid omdat uitvoering op hardware snel genoeg is om geen gedistribueerde
uitvoering te vereisen.
Resultaten
Aantal SESC-instructies
Om stabiele karakteristieken te bekomen tijdens de uitvoer van de stressmarks op de SESC-
simulator, is het nodig om de simulatie voldoende lang te laten draaien. Ten eerste is er
immers de intialisatiefase die doorlopen wordt wanneer de stressmark opgestart wordt. Ten
tweede moet ook de code van de stressmark zelf voor een voldoend lange tijd uitgevoerd
worden om een stabiel gedrag te vertonen.
Gelukkig is er een manier om alvast de effecten van de initialisatiefase te verminderen. SESC
ondersteunt een zogenaamde ”rabbit mode”, die toelaat om over de initialisatie heen te sprin-
gen door deze versneld uit te voeren door enkel de data bij te houden die hoogst nodig is om
correct te vorderen.
We bepalen het aantal instructies dat nodig is om een stabiele uitvoer te bekomen door
achtereenvolgens een oplopend aantal instructies van dezelfde stressmark uit te voeren. Eerst
worden er slechts 1000 uitgevoerd, vervolgens 2500, dan 5000, 10000, enz. We zetten dit
voort tot de waarde van de karakteristieken niet langer significant wijzigt. Dit proces wordt
vervolgens herhaald, deze keer voorafgegaan door een miljoen instructies in rabbit mode.
12
0
10
20
30
40
50
60
70
1K 2,5K 5K 10K 25K 50K 100K 250K 500K 1M 2,5M 5M 10M 25M 50M
Pow
er
(W)
Instructions
Rabbit Mode Disabled 1M Rabbit Mode
Figuur 4: Uitgevoerde Instructies.
Resultaten en discussie
Figuur 4 toont de resultaten van beide experimenten. Als we kijken naar het vermogensver-
bruik met rabbit mode uitgeschakeld, zien we duidelijk het effect van de initialisatie. De zware
werklast die de stressmark genereert, doet het vermogensverbruik slechts met mondjesmaat
toenemen naarmate we meer instructies draaien, om te stabiliseren op ongeveer 10 miljoen
instructies. Zoals verwacht doet de test met rabbit mode ingeschakeld het veel beter en zien
we daar een verminderd effect van de initialisatie. Merk echter op dat we hiervoor de (kleine)
kost betalen een miljoen instructies in rabbit mode te moeten draaien voor het simuleren van
de 10 miljoen in normale modus. Om de simulatietijd kort genoeg te houden zonder de cor-
rectheid van het resultaat teveel te compromitteren hebben we uiteindelijk gekozen om onze
simulaties te draaien met een miljoen instructies in rabbit mode, gevolgd door twee miljoen
normaal gesimuleerde instructies.
13
Exploratie van de Zoekruimte
Om het genetisch zoekalgoritme efficient te laten werken, is het nodig om te controleren of
ons abstract werklastmodel de karakteristieken waarnaar we optimaliseren voldoende effectief
kan beınvloeden. We doen dit door enkele doorsneden van de zoekruimte te bekijken. We ge-
bruiken het SESC-platform dat we eerder omschreven en draaien hierop een SMP-configuratie
met vier hardwaredraden, en twee integer- en twee floating point-ALU’s. Hierop draaien we
werklastmodellen met specifieke instructiemixprofielen. We kijken telkens naar het vermo-
gensverbruik en de IPC in functie van de minimale afhankelijkheidsafstand (MDD) tussen de
instructies.
Rekenkundige instructies
0,0
0,5
1,0
1,5
2,0
2,5
0
10
20
30
40
50
60
1 3 5 7 9 11 13 15
IPC
Pow
er
(W)
Minimum Dependency Distance
Power IPC
Figuur 5: Integer Optellingen.
Het eerste instructiemixprofiel bevat enkel opteloperaties met integers. We laten de minimale
afhankelijkheidsafstand varieren van 1 tot 16. Wat meteen duidelijk wordt, is de sterke
correlatie tussen de twee karakteristieken die we meten. Dit valt te verwachten aangezien een
14
hogere IPC leidt tot een betere benutting van de componenten, wat op zijn beurt leidt tot
een hoger vermogensverbruik.
Om een hoge IPC mogelijk te maken, moeten er zoveel mogelijk instructies in parallel worden
uitgevoerd. Hiervoor is het nodig dat de minimale afhankelijkheidsafstand zo groot mogelijk
is. We stellen vast dat dit ook blijkt uit de resultaten: de karakteristieken nemen toe als we
de MDD van 1 tot 10 laten varieren. Op dit punt echter wordt een IPC van 2 bereikt en
aangezien er slechts 2 integer-ALU’s beschikbaar zijn, is het duidelijk dat dit het verwachte
maximum is.
0,0
0,5
1,0
1,5
2,0
2,5
3,0
0
10
20
30
40
50
60
70
1 3 5 7 9 11 13 15
IPC
Pow
er
(W)
Minimum Dependency Distance
Power IPC
Figuur 6: Double Optellingen.
Het tweede instructieprofiel bevat enkel optellingsoperaties met doubles. De vorige conclusies
blijven grotendeels gelden, maar hier zien we een interessant bijkomend effect optreden. Bij
een MDD van 16, stijgt de IPC-waarde boven de 2 uit, hoewel er ook slechts 2 double ALU’s
beschikbaar zijn. Dit komt vanwege register spilling. Aangezien de spillinginstructies door de
integer-ALU’s behandeld worden, worden ook deze lichtjes belast.
15
Geheugeninstructies
0,0
0,5
1,0
1,5
2,0
2,5
0
20
40
60
80
100
120
140
160
180
100 / 0 95 / 5 90 / 10 85 / 15 80 / 20 60 / 40 40 / 60 20 / 80 0 / 100
IPC
Pow
er
(W)
Loads / Stores
Power IPC
Figuur 7: Private lees- en schrijfoperaties.
We bekijken nu een profiel dat enkel geheugeninstructies bevat. De instructies zijn allemaal
draadlokaal, maar de verhouding tussen lees en schrijfinstructies varieert, met aan de lin-
kerkant een mix die uitsluitend bestaat uit leesinstructies en aan de rechterkant een mix
die enkel bestaat uit schrijfinstructies. Het blijkt uit de figuur dat het uitvoeren van veel
schrijfoperaties een slecht effect heeft op de IPC. Dit is te verwachten aangezien deze telkens
door verschillende cacheniveaus moeten gepropageerd worden, wat pijplijnstalls met zich mee
brengt.
Resultaten van het Genetisch Algoritme
SESC-platform
De opgemeten karakteristiek voor het SESC-platform is het gemiddeld vermogensverbruik
over de loop van de hele simulatie. Figuur 8 toont een overzicht van de generaties die gepro-
16
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Pow
er
(W)
Generation
Minimum Average Maximum
Figuur 8: Resultaten SESC.
duceerd zijn doorheen het zoekproces waarbij telkens de maximale, gemiddelde en minimale
fitheid van de werklastmodellen in de populatie opgetekend zijn.
In de initiele populatie is het vermogensverbruik van de beste stressmark 41,28 W, het ver-
mogensverbruik van de slechtste 12,47 W, en het gemiddelde vermogensverbruik 21,28 W.
Doorheen het zoekproces groeien het maximale en gemiddelde vermogensverbruik, maar blijft
het minimale verbruik vrij laag. Dit is een goed teken gezien dit een goed evenwicht toont
tussen variatie en kwaliteit in de populatie; dit is noodzakelijk om de gevonden oplossingen
voldoende te optimaliseren zonder in lokale minima van de zoekruimte vast te raken.
Het uiteindelijke resultaat is een maximaal verbruik van 163 W, gegenereerd door een werk-
lastmodel dat ontdekt is in generatie 87. Dit is 4 maal zoveel als het maximale verbruik in
de eerste generatie. We bekijken de eigenschappen van het werklastmodel in kwestie:
Listing 1: Werklast SESC
memoryShared : 64
t r a c e S i z e : 100
17
a r i thmet i c In s t ruc t i onMix :
doubleAdd : 44
doubleMul : 28
integerAdd : 23
integerMul : 5
mdd: 29
swThreads : 3
in s t ruc t i onMix :
a r i t h m e t i c I n s t r u c t i o n s : 89
memoryInstruct ions : 2
b ranch In s t ruc t i on s : 9
memoryThreadLocal : 2048
memoryStr ideProf i l e :
s i z e 1 : 26
s i z e 0 : 37
s i z e 4 : 2
s i z e 3 : 2
s i z e 2 : 33
memoryInstructionMix :
unsharedLoad : 5
sharedLoad : 10
unsharedStore : 41
sharedStore : 44
branchTrans i t ion :
r a t e 0 : 11
r a t e 1 : 33
r a t e 2 : 10
r a t e 4 : 28
r a t e 8 : 18
Zoals verwacht zien we een groot aantal rekenkundige instructies en een hoge minimale af-
18
hankelijkheidsafstand. De rekenkundige mix is redelijk gebalanceerd, maar lijkt integerver-
menigvuldigingen te vermijden. De geheugeninstructiemix is willekeurig omdat het aantal
geheugeninstructies slechts 2% bedraagt, waardoor deze geen invloed hebben op het geheel.
Core2Quad-platform
0
0,5
1
1,5
2
2,5
3
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
IPC
Generation
Minimum Average Maximum
Figuur 9: Resultaat Core2Quad.
Hetzelfde experiment hebben we herhaald op het hardwareplatform. Deze keer hebben we
de IPC als karakteristiek gebruikt. Hoewel het resultaat minder uitgesproken is dan bij het
SESC-platform, vinden we ook hier een significante toename van de karakteristiek. De IPC-
waarde varieert van 2,25 in de eerste generatie, tot 2,92 in de laatste, een toename van 30%.
Ons algoritme lijkt gelimiteerd tot een IPC-waarde van 3, hoewel de maximale IPC-waarde
van de processor 4 bedraagt. Het werklastmodel van het beste individu toont ons waarom:
Listing 2: Werklast x86-64
memoryShared : 64
t r a c e S i z e : 50
19
a r i thmet i c In s t ruc t i onMix :
doubleMul : 7
integerAdd : 33
doubleAdd : 19
integerMul : 41
mdd: 9
swThreads : 8
in s t ruc t i onMix :
a r i t h m e t i c I n s t r u c t i o n s : 81
memoryInstruct ions : 2
b ranch In s t ruc t i on s : 17
memoryThreadLocal : 2048
memoryStr ideProf i l e :
s i z e 0 : 30
s i z e 1 : 26
s i z e 2 : 19
s i z e 3 : 17
s i z e 4 : 8
memoryInstructionMix :
unsharedLoad : 25
sharedLoad : 41
unsharedStore : 30
sharedStore : 4
branchTrans i t ion :
r a t e 0 : 13
r a t e 1 : 24
r a t e 2 : 13
r a t e 4 : 26
r a t e 8 : 24
Als we naar de instructiemix kijken, zien we dat het algoritme sterk geselecteerd heeft voor re-
20
kenkundige operaties en in mindere mate spronginstructies. In het geval dat we deze strategie
volgen en alle geheugeninstructies elimineren, is 3 inderdaad de maximale IPC. Dit is om-
dat de Intel Core2Quad slechts 3 ALU’s heeft, dewelke het zoekalgoritme dan ook maximaal
benut.
De IPC-limiet van 4 is gebaseerd op de fetchbreedte en het is nu duidelijk dat het noodzakelijk
zou zijn om geheugeninstructies aan de mix toe te voegen om deze limiet te benaderen. Er
zijn individuen in de populatie aanwezig die deze instructies inderdaad gebruiken, maar deze
hebben een lagere fitheid, wellicht omdat ze niet langer de integer-ALU’s volledig belasten.
Indien er al een mix bestaat met een combinatie van rekenkundige operaties en geheugenin-
structies die een IPC van 4 benadert, dan was ons algoritme — net als onszelf — niet in staat
om deze te vinden.
Efficientie
Om de efficientie van ons genetisch algoritme te testen, hebben we hetzelfde aantal simulaties
dat het uitgevoerd heeft (7200), eveneens laten evalueren door een willekeurig zoekalgoritme
(Monte Carlo).
De grafiek toont twee weergaves van dezelfde data. De grijze balken geven de simulaties weer
in de volgorde dat ze gedraaid zijn (voor zover mogelijk in een gedistribueerde omgeving),
terwijl de zwarte curve dezelfde resultaten gesorteerd weergeeft. We kijken naar deze laatste
om de distributie van werklastmodellen nader te onderzoeken.
Het eerste wat we opmerken is een klein aantal stressmarks die elk een constant verbruik van
1,9 W produceren. Dit zijn verschillende instanties van dezelfde stressmarkcode die gegene-
reerd wordt indien de StressmarkGenerator niet in staat is om een stressmark te genereren
die voldoet aan het werklastmodel dat als input is gegeven. Dit kan b.v. het geval zijn
als het aantal gevraagde spronginstructies te hoog is (b.v. tegen de 100%). Slechts 2,2%
van de stressmarks is ongeldig, dit is een voldoende klein aantal, zeker gezien het hier gaat
om willekeurig gegenereerde stressmarks. Bij het genetisch algoritme worden deze meteen
weggeselecteerd.
De rest van de distributie toont ons een maximale waarde van 88,84 W. Als we het genetisch
21
0
10
20
30
40
50
60
70
80
90
1 501 1001 1501 2001 2501 3001 3501 4001 4501 5001 5501 6001 6501 7001
Pow
er
(W)
Workloads
Random Sequence Sorted
Figuur 10: Willekeurig Zoekalgoritme.
zoekalgoritme draaien met dezelfde versie van de StressmarkGenerator krijgen we 140,79 W.
Dit is 52% beter dan het random zoekalgoritme. Bovendien moeten we opmerken dat de pres-
tatie van het genetisch zoekalgoritme uiteraard veel stabieler is met een kleinere geluksfactor.
Theoretisch Maximum
Als laatste resultaat bekijken we een schatting van het theoretisch maximum van onze SESC-
configuratie en vergelijken dit met het resultaat behaald door ons genetisch zoekalgoritme. We
berekenen dit maximum door het energieverbruik van alle componenten van de architectuur
voor 1 cyclus in geactiveerde toestand op te tellen en dit getal te vermenigvuldigen met de
kloksnelheid. We verkrijgen de verdeling in figuur 11.
We krijgen een theoretisch maximum van 321,8 W in totaal. De beste stressmark geprodu-
ceerd door ons genetisch algoritme geeft 163,18 W, of 50,7% van het theoretisch maximum.
Dit is vergelijkbaar met het percentage van 57% dat we terugvinden in Joshi e.a. [12].
22
Core 154,30 watts
Core 254,30 watts
Core 354,30 watts
Core 454,30 watts
L1 Caches30,44 watts
L2 Caches68,46 watts
TLB5,75 watts
Figuur 11: Theoretisch Maximum.
BIBLIOGRAFIE 23
Bibliografie
[1] Intel developers manual (basic architecture). http://www.intel.com/Assets/PDF/
manual/253665.pdf.
[2] Intel turbo boost. http://www.intel.com/technology/turboboost/.
[3] Jyaml library. http://jyaml.sourceforge.net/.
[4] Papi: Performance application programming interface. http://icl.cs.utk.edu/papi/.
[5] Sesc documentation. http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/.
[6] Yaml: Yaml ain’t markup language.
[7] 14th International Conference on High-Performance Computer Architecture (HPCA-14
2008), 16-20 February 2008, Salt Lake City, UT, USA. IEEE Computer Society, 2008.
[8] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Par-
ry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf,
Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing
research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Depart-
ment, University of California, Berkeley, Dec 2006.
[9] S H Gunther, F Binns, D M Carmean, and J C Hall. Managing the impact of increasing
microprocessor power consumption. Intel Technology Journal, (1):2005, 2001.
[10] Michael Haungs, Phil Sallee, and Matthew Farrens. Branch transition rate: A new metric
for improved branch classification analysis. High-Performance Computer Architecture,
International Symposium on, 0:241, 2000.
BIBLIOGRAFIE 24
[11] John Hennessy and David Patterson. Computer Architecture - A Quantitative Approach.
Morgan Kaufmann, 2003.
[12] Ajay M. Joshi, Lieven Eeckhout, Lizy Kurian John, and Ciji Isen. Automated micropro-
cessor stressmark generation. In HPCA [7], pages 229–239.
[13] G. E. Moore. Cramming more components onto integrated circuits. Proceedings of the
IEEE, 86(1):82–85, 1998.
[14] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software.
Dr. Dobb’s Journal, 30(3):202–210, 2005.