understanding emerging workloads for performance and...

UPTEC IT 16 015

Examensarbete 30 hpNovember 2016

Understanding Emerging Workloads for Performance and Energy

Eddie Eriksson

Institutionen för informationsteknologiDepartment of Information Technology

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Understanding Emerging Workloads for Performanceand Energy

Eddie Eriksson

The size and capacity of datacenters has grown over time, and today, datacentershave become a significant source for compute capacity. This is because the type, aswell as the number of applications and users that are moving into the cloud have beensteadily increasing. As datacenter use increases, so does the energy consumed bythese large computing facilities. Therefore, by improving datacenter efficiency, one cansignificantly reduce datacenter cost.

In order to achieve improved datacenter efficiency, the hardware and softwarebottlenecks that exists in today's software need to be identified and evaluated. In thiswork a number of popular datacenter workloads using the Top-Down methodologywas evaluated with the aim to use it to better understand the bottlenecks andbehavior of these workloads. The goal of this work is to determine if the applicationsshow any time-varying behavior and if there are any potential for improvements ofthe hardware with respect to energy efficiency. The proposed methodology workswell for understanding high level bottlenecks on modern hardware. We identifiedtime-varying behavior as well as areas of improvement common to several studiedapplications.

Tryckt av: Reprocentralen ITCUPTEC IT 16 015Examinator: Lars-Åke NordénÄmnesgranskare: Erik HagerstenHandledare: Trevor Carlson

Contents

1 Introduction 8

2 Background 82.1 CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Hardware performance counters . . . . . . . . . . . . . . . . . . . 92.3 Top-Down analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Frontend Bound . . . . . . . . . . . . . . . . . . . . . . . 102.3.2 Backend Bound . . . . . . . . . . . . . . . . . . . . . . . . 122.3.3 Bad Speculation . . . . . . . . . . . . . . . . . . . . . . . 122.3.4 Retiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Tools used in this work . . . . . . . . . . . . . . . . . . . . . . . . 132.4.1 Linux Perf . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 pmu-tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Docker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.6 Cloudsuite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.6.1 Web Serving . . . . . . . . . . . . . . . . . . . . . . . . . 152.6.2 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . . 152.6.3 Media Streaming . . . . . . . . . . . . . . . . . . . . . . . 15

3 Setup 163.1 Customizing pmu-tools . . . . . . . . . . . . . . . . . . . . . . . . 163.2 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Modifying the Web Serving Benchmark . . . . . . . . . . . . . . 173.4 Multiplexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.5 Visualizing the data . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Results 194.1 Multiplex vs no Multiplex . . . . . . . . . . . . . . . . . . . . . . 194.2 Top-Down and multiplex . . . . . . . . . . . . . . . . . . . . . . . 204.3 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3.1 Web Serving . . . . . . . . . . . . . . . . . . . . . . . . . 224.3.2 Web Search . . . . . . . . . . . . . . . . . . . . . . . . . . 264.3.3 Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Related Work 35

6 Conclusions 366.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Bibliography 38

5

Popularvetenskaplig sammanfattning

Idag okar antalet datahallar i takt med att fler och fler applikationer anvanderdem och antalet anvandare okar. Ett problemet med det ar att datahallaranvander mycket energi. Darfor ar det viktigt att deras energiforbrukning min-skas och att optimera processorer mot datahallsapplikationer.

For att gora hitta dessa optimeringar anvandes och evaluerades Top-Downmetodologin for att hitta och identifiera problem och flaskhalsar. Med dennametod var iden att det ska vara latt hitta och identifiera flaskhalsar i en proces-sor for en applikation. Flaskhalsar identifieraras genom att i ett hierarkiskt sattundersoka applikationens prestanda. Forst undersoks till exempel om en app-likation har problem med att hamta och omvandla ett programs instruktionereller om problemet snarare ligger i utforandet av instruktionerna. Beroende pavar i processorn problem identifieras sa studerar man mer detaljerat endast idet omradet eller omradena. Till exempel om problemen ligger i utforandet avinstruktionerna sa undersoks om problemet beror pa att inte tillrackligt mangainstruktioner utfors eller att det tar for lang tid att hamta data.

I detta arbetet undersoktes hur val denna metod funkade for undersoka data-hallsapplikationer och om nagra hardvaruforandringar kan foreslas med hjalpav metoden. Utover det kollades det om applikationerna visade signifikant be-teende over tid och om dess beteende skiljde sig mellan olika tradar.

For att gora allt detta kordes ett antal benchmarks fran Cloudsuite. Matningarnagjordes med ett verktyg kallat Toplev fran pmu-tools, vilket ar en samling verk-tyg for att utfora olika sorter profileringar. Toplev implementer Top-Down, ochandringar gjordes for att tillata att samla data over tid och per trad.

I nagra av applikationerna kunde tidsvarierande beteende hittas. Top-Downfunkade val for information pa en hogre niva men var mer begransad for de-taljerad analys. Detta berodde till stor del pa att tradarna var aktiva i smatidsperioder i kombination med begransningar i hardvaran for matningarna.Flaskhalsar hittades i processorns instruktionsminne och aven i dess exekver-ingsportar. En framtida forbatting kan darfor vara att lagga till fler typspecifikaexekveringsportar.

7

1 Introduction

With the rise of datacenters come the potential for new and diverse workloads.In addition, datacenters use a large amount of energy [9] and therefore find-ing ways to reduce this energy footprint while improving the performance isbeneficial for both the environment and for reducing costs.

In this work, we explored the Top-Down analysis methodology proposed byYasin [1], for datacenter applications. Top-Down is a hierarchical approachfor classifying performance bottlenecks. Understanding the requirements andmethodologies for datacenter analysis is needed to get an accurate understandingof their behavior. We used the Cloudsuite [5] benchmarks suite, a collection ofbenchmarks modeled after data center workloads. The goal of this work is tounderstand the behavior and bottlenecks of the applications. A performanceanalysis was also performed to understand application behavior over time andhow bottlenecks correlated over time and whether or not there exists any time-varying behavior.

During our evaluation, we found that collecting time-varying behavior workswell for some parts of the Top-Down methodology, but for some aspects it isnot as accurate. Achieving a detailed understanding of an application was dif-ficult due to multiplexing and is something that potentially could be addressedby issuing multiple runs of the applications or running with larger intervalswhich would have given different levels of granularity, and is part of futurework. It was also found that 2 of the 3 applications showed significant perfor-mance bottlenecks in their instruction caches, and correlate well with previousworks.

2 Background

With the rise of datacenters come a cost in energy [9]. Because of the specificrequirements of datacenter workloads, the properties of these workloads canvary, and differ from previous computer architecture work [11][9].

Dennard showed [17] that transistors were able to operate for higher frequency,but at the same cost in power as transistor with a smaller feature size and Mooreshowed in [18] that the economic cost to put more transistors on a chips wouldallow the number of transistors to grow exponentially. Because Dennard scalingis coming to an end, the number of active transistors at one time is no longerincreasing. This can limit system complexity. Moore’s law is also ending thus,it is not possible to add transistors at a cost effective manner. Most datacenterstoday are built around the use of conventional desktop processors designed fora broad market and do not match the need of datacenter applications [9]. Thus,more specialized and efficient processors could be a way to enable future highperformance datacenters. To design specialized hardware, one needs to identify

8

datacenter workload bottlenecks and also methodologies and tools to identifyand find these. However current techniques appear to be quite coarse grainedand more detail is desired without doing simulation, and this work tries toimprove them.

2.1 CPUs

There have been many efforts in keeping the CPU pipelines full as much aspossible, which is one of the reasons why modern CPUs are complex. Sometechniques employed are out of order execution, branch predictions and hard-ware prefetching [1].

One can view CPU pipeline as two distinct parts, a frontend and a backend.The frontend fetches instructions and generates a collection of micro operations(µops). The backend schedules, executes and retires the µops.

Most processors also use a cache hierarchy that is accessed before the mainmemory, usually a 3 tier hierarchy consisting of an L1, L2 and an L3 cache.The benefit of using caches is that data which is used often will be stored in thefast, but smaller cache levels for quick access. Usually when data is put in thecache, nearby data is also put in the cache since it is likely that it also will bereferenced soon. The L1 cache is usually very small but very fast, L2 is biggerbut slower than L1 and L3 is larger but slower than L2. The CPU first looksin the L1 cache, if it can not find it there it looks in L2, then L3 and then themain memory.

2.2 Hardware performance counters

Hardware performance counters are a set of special purpose registers that keepcount of various hardware events. These events can for example be mispredictedbranches, number of executed instructions or cache misses. These counts canthen be used to better understand how the software runs on a specific versionof hardware.

While generally 4 and even 8 distinct events usually can be monitored at onetime, there are cases when it is beneficial to monitor a large number of simulta-neous events. One method to solve this problem is with time-based multiplexingof performance counters and this methodology is explored in this work.

Instead of only monitoring a fixed number of distinct events throughout the ex-ecution of the entire application, it is split up into time periods where a numberof events that fits the available number of counters are ran for each period [15].The returned values are then scaled by the fraction of an application’s executiontime. The accuracy of the results while performing multiplexing for the work-loads used in this study depends on the application workload and runtime. Forthe used workloads used in this work, it was found that time multiplexing can

9

prove to be complex when trying to understand specific workload characteristicsor bottlenecks in an extremely short timespan.

2.3 Top-Down analysis

The Top-Down analysis methodology is a hierarchical classification of CPU bot-tlenecks proposed by Ahmad Yasin in “A Top-Down Method for PerformanceAnalysis and Counters Architecture.” [1]. Using traditional performance counterstatistics it can be difficult to tell what the actual bottlenecks of an applicationare. Take an example where cache misses are counted. Intuitively a cache misswould have a large penalty because fetching data from memory could stall theprocessor. Nevertheless modern processors try to account for this by executingsomething else while the data is fetched. This can make it difficult to know whatthe actual penalty of a cache miss was and different CPUs would have differentpenalties. Top-Down aims to solve these issues with a straight-forward method-ology to identify and understand the bottlenecks of an application.

Top-Down is one easy way to identify and understand the critical bottlenecksof an application on out-of-order CPUs using specifically designed performancecounters. The methodology allows one to more easily understand the perfor-mance of an application. First, the performance is classified into either frontendbound, backend bound, retiring or bad speculation (explained in sections 2.3.1to 2.3.3) making up the top level of the hierarchy. Depending on where thebottlenecks are, the subcategories of the top level can be explored.

Each category in the top level can be broken down into subcategories. TheTop-Down methodology allows one to drill down into subcategories only whenbottlenecks are found in that area. If the value of a category exceeds a threshold,then and only then should its subcategories be explored. Table 1 shows the Top-Down hierarchy and its categories. The different categories are explained in thefollowing sections and focused mostly on Level 1 and Level 2.

2.3.1 Frontend Bound

Frontend stalls occur when the backend is ready to execute additional µopsbut the frontend can not supply enough µops. Yasin states that ”Dealing withFrontend issues is a bit tricky as they occur at the very beginning of the long andbuffered pipeline. This means in many cases transient issues will not dominatethe actual performance. Hence, it is rather important to dig into this areaonly when Frontend Bound is flagged at the Top-Level.” To further distinguishthe cause of the stalls, frontend bound is divided into a latency bound and abandwidth bound category.

10

Level 1 Level 2 Level 3 Level 4

Frontend

Latency

iTLB Miss

iCache Miss

Branch Resteers

Other

BandwidthFetch unit 1

Fetch unit 2

Backend

CoreExecution Ports Utilization

0 ports

1 port

2 ports

3 ports

Divider

Memory

Stores Bound

L1

L2

L3

Ext. Memory BoundBandwidth

Latency

Bad SpeculationBranch Misspredicts

Machine Clears

RetiringBase

Floating point-ArithmeticScalar

Vector

other

Micro-code Sequencer

Table 1: Top-Down hierarchy

11

Latency Bound Frontend Latency bound represents stalls where the frontendtakes too long to produce µops and can occur because of instruction cachemisses but also for other CPU specific events which make up subcategoriesfor latency bound. Latency issues can also occur because of branch re-steers, meaning that no µops were delivered because the CPU was stillfetching instructions for the correct path after a branch prediction.

Bandwidth Bound Frontend bandwidth bound represents cases where notenough µops could be supplied due to inefficiencies in the instruction de-coders and are further classified into a category for each fetch unit.

2.3.2 Backend Bound

Backend stalls occurs when there are µops ready for execution, but the back-end does not yet have sufficient resources to execute them. These issues canappear when there are data cache misses or the execution ports are not fullyutilized. Backend-bound stalls are further divided into memory bound and corebound.

Memory Bound An application is memory bound when the execution portsare starved because of inefficiencies with the memory subsystem (cachesand main memory). For example stalls that happens because of all cachelevels being missed. Memory bound is further divided into a subcategoryfor each cache level. An application is also memory bound when theexecution ports are stalled because of a large amount of buffered storeinstructions, thus stores bound is a subcategory under memory bound.There is also a category if the main memory is the cause of the stalls andthat category is further divided into bandwidth and latency.

Core Bound Core-bound issues manifest with bad execution port utilization,e.g only 2 ports are being used at a time when there are 4 available.This can happen when there are many instructions of the same type. Forexample, if we only have floating point instructions, but the CPU can onlyexecute 1 floating point instruction per cycle and the CPU can commit4 instructions per cycle. If a division operation takes long enough it canreduce the performance of the execution ports. Thus the core boundcategory is split up into divider and execution ports utilization. The lastcategory is also further classified by how many ports were utilized.

2.3.3 Bad Speculation

The bad speculation category covers stalls due to mispredicted speculations.These stalls can occur because of the pipeline being blocked due to recoveringfrom miss speculations, it also covers stalls that occurred because of issuedµops that never retire. Yasin states in [1] why this category is in the toplevel. ”Having Bad Speculation category at the Top-Level is a key principle

12

in our Top-Down Analysis. It determines the fraction of the workload underanalysis that is affected by incorrect execution paths, which in turn dictates theaccuracy of observations listed in other categories.” If there are stalls becauseof bad speculation it can be a good idea to look into this area first.

Bad speculation is divided into machine clears and branch mispredicts. Branchmispredicts is obvious but machine clears reflects stalls due to the pipeline beingflushed because of wrong data speculation.

2.3.4 Retiring

Retiring represents slots where issued µops eventually retire. If retiring is at100% it means that the maximal number of µops retired each cycle [1]. Evenif there is a high retiring value there can still be room for further improve-ment. Retiring are divided further into a base category and a micro sequencercategory.

base The base category represents high retiring values by floating point arith-metic. The base category is further divided into scalar or vector opera-tions. Beside floating point arithmetic there is a category for everythingelse.

Microcode sequencer Microcode sequencer represents µops that were retiredby microcode sequences such as floating point assists.

2.4 Tools used in this work

2.4.1 Linux Perf

Linux perf is a command line profiler for Linux machines. It utilizes theperf events interface exported by recent kernel versions [3]. Perf comes witha set of different commands for profiling that all uses performance counters.Some examples of commands are record and stat. Record is used for samplingand creates a profile of an application. Perf stat counts the number of an oc-curring event type during an application’s runtime. Perf stat only returns asummary of counts for each chosen events and generally has a low overhead.Perf Record has a higher overhead than stat but gives in addition to the to-tal number of events, also what software and system calls caused the chosenevents.

By using Perf by itself a lot of statistics can be collected about the system but itcan be hard to get a deeper insight about the application and its problems. Thestatistics can however be used in methodologies as Top-Down to get a deeperunderstanding of the problems.

13

2.4.2 pmu-tools

Pmu-tools [4] is a collection of tools to profile and collect statistics for IntelCPUs and are built upon linux perf.

One component, the Toplev tool is most applicable to this work. It implementsthe Top-Down methodology and automatically chooses the most appropriateperformance counters for each specific microarchitecture. For this work, modi-fications were done for the tool to add some options that were missing from thebeginning.

2.5 Docker

Docker has become a popular way to deploy and run applications in isolatedenvironments, with their own file systems containing everything needed to runan application. These isolated entities are called containers.

The containers are created from images and an image consists of the runtime,libraries and binaries needed to run a specific application and a container isbuilt from an image.

The architecture of Docker consists of three parts, the Docker client, the Dockerdaemon and the registry. The user uses the client to communicate with thedaemon to create and run containers. The daemon has to be run on the hostand it creates and runs the containers. The registry holds the images whichare used to build containers, it can be either local or shared in the Docker hubwhich is Docker’s own registry.

At a first glance a Docker container appear to be similar to a virtual machine. Acontainer does not do any hardware virtualization and runs on the same kernelas the host, leading to a container being more lightweight and boots in a coupleof seconds while a virtual machine boots a complete operating system, thustaking longer time.

2.6 Cloudsuite

Cloudsuite is a benchmark suite and contains a number of client-server bench-marks representing different cloud based applications such as web serving, stream-ing and data caching. The different kind of benchmarks try to mimic the kindof behavior one can see in a datacenter today [9]. Cloudsuite provides 8 bench-marks but this work focused on 3 of them. Docker containers were used todeploy and evaluate the benchmarks.

Cloudsuite was used because it was straight-forward to setup and run, but italso brought the convenience of providing benchmarks that mimicked real lifedatacenter behavior and stressed the system.

14

2.6.1 Web Serving

The web serving benchmark simulates a web serving server, used for web brows-ing, social networking and other similar activities on the web. The benchmarkis set up as a web stack with four parts, a Memcached server, a database server,a web server and the client.

The web server runs Elgg which is a real life social networking engine [6] usedby several organizations and is similar to applications such as Facebook [12].Elgg uses MYSQL as the database and the database queries are cached withMemcache which is an in-memory key value store for small arbitrary data [8] toimprove latency and throughput.

The client used Faban [13] to set up workloads and benchmarks. First theclient had to populate the database with users, these were simulated clientsthat logs in and use the system. The benchmark was set up in such a way thatmore common actions such as posting to the wall was done more often thanuncommon actions like login/logout [5].

2.6.2 Web Search

The web search benchmark consists of two parts, a client and one or more index-ing servers. The client sets up the benchmarks and workloads with Faban[13].The server contains text and fields found from crawled websites [5] and relieson the Apache Solr search engine framework and powers services such as BestBuy and Sears. The data sets of the server are stored in memory to keep a highthroughput and quality of service. The client containers simulate real worldclients that send requests to the server.

2.6.3 Media Streaming

The media streaming benchmark consists of two parts, a server and a client.The server uses nginx which is a HTTP, reverse proxy server, a mail proxyserver, and a generic TCP/UDP proxy server and power services like Netflix[7].

The client uses Httpperf, a tool that measures web server performance and setsup workloads and was used to generate a mix of video requests of differentqualities and lengths to stress the server [5].

15

3 Setup

3.1 Customizing pmu-tools

To perform the measurements and Top-Down analysis custom made scripts wereconsidered, making it easier to add all desired features. However implementingthe Top-Down methodology was more difficult than first imagined. Since allevents needed were not listed by Perf, some had to be added manually fromthe manual. There was a possibility though, that the events could be misreadleading to errors when doing Top-Down analysis. Instead pmu-tools was usedto be sure that the events were correct and the tool also chooses appropriatecounters for the tested hardware.

3.2 Measurements

Doing Top-Down analysis, several statistics from the applications are neededand it is important that the measurements are done in a representative way. Anoption is to run the tool directly collecting the statistics for the whole system,but since many of the benchmarks simulate both the servers and the clientsmeasuring the whole system would not have been representative.

The measurements could have been done inside the containers or from the out-side using the option of attaching to the pids of the interesting parts of theapplication. Doing the measurements inside the container would have worked,if only one container was measured but in some benchmarks the non client partconsists of several containers. This could lead to issues when synchronizing thedata from the different containers, but by comparing results from inside andoutside the container no significant difference was found. Thus, the option toperform the measurements outside the containers was used.

A method to know which threads to actually look closer at was desired, sinceper pid/thread data was collected and would thus lead to data for many threadswhich all were not relevant. This was solved by calculating the average CPUusage for each thread and then only look closer at the threads with high averageCPU usage.

When collecting the over time data the Toplev’s interval option was used. Withthis option statistics was collected for set intervals and thus, the Top-Downstatistics was calculated for each interval.

16

3.3 Modifying the Web Serving Benchmark

When the web serving benchmark is run it starts by filling the database withusers. This takes around half of the benchmark’s total runtime and data forthis part was of no interest. Thus, changes were made to the Dockerfile and theimage of the client container in such a way that instead of creating all the usersin the beginning of the benchmark and then running the benchmark itself, itnow creates all the users when it is built. This allows for measurements to bedone only when the actual benchmark is run.

3.4 Multiplexing

Running Toplev for Level 2 Top-Down statistics, the number of events neededare more than the available hardware counters and have to be multiplexed.Recalling that the results from multiplexing are estimates, a short study wasdone to find if this would cause any issues. Toplev has an option for doing nomultiplexing and with this option the tool runs the application one time foreach group of events. This option is not available for the interval option atthis time and adding this functionality would have been to time consuming andcomplicated at this time.

The study was done with a modified version of Toplev that made it possible tochoose which iteration to run. This makes it possible to reset the containerseach run. For every run except the last, the intermediate counts are stored tofile and the tool exits without calculating the Top-Down statistics and in thelast iteration the Top-Down statistics are calculated.

3.5 Visualizing the data

The graphs from the Toplev script visualize the Top-Down statistics over timewith the time in seconds on the x axis. The y axis goes from 0%-100% indicatinghow large part of the performance of an application was in a certain area e.g.frontend, backend, with the exception of CPU utilization and the mux part ofthe graphs. The CPU utilization goes from 0.0 to the number of cores used(e.g.4 cores 4.0).

If the verbose option (calculating and reporting all the Top-Down metrics forthe chosen level) is used, the top level (frontend, backend, retiring and badspeculation) should add up to 100%. The Level 2 and deeper metrics do notnecessarily add up to 100%.

Toplev has several output options; a plain text, a comma separated values(csv)file, and if running with the interval option it also has a graphing option. Thegraphing option was of interest but it does not work together with the per-pidoption. Instead the result are stored in a csv file that later on is parsed and a

17

file is created with the data for each pid. Graphs can then be made for each pidby invoking the graphing script for each file.

A downside with how calculations are done in the Top-Down methodology how-ever is that if the thread was idle it would say that it was 100% backend boundbecause technically the backend was not getting any µops. This also shows inthe graphs and makes it hard to distinguish the real issues at times. Thereforethe the csv files are parsed and values are set to 0 where the CPU usage iszero.

18

4 Results

4.1 Multiplex vs no Multiplex

A short study on multiplex versus no multiplex was performed, where the totalLevel 2 statistics were collected for the web serving benchmark. 5 times for themultiplexed version and 5 times for the non multiplexed version. When the runswere finished the average and standard deviation was calculated for each metricand version.

Iteration Bad Speculation Bad Spec Branch Mispredicts frontend frontend Latency Cpu usage

1 15.74 15.04 32.11 21.99 0.07

2 15.59 15.02 30.85 21.890 0.07

3 14.68 14.21 29.07 20.59 0.07

4 16.43 15.53 31.13 22.87 0.07

5 14.96 14.33 30.68 22.27 0.08

Average 15.48 14.83 30.778 17.97 0.07

Standard Deviation 0.69 0.55 1.1 0.84 0.005

Table 2: multiplex

Iteration Bad Speculation Bad Spec Branch Mispredicts frontend frontend Latency CPU usage

1 15.27 15.09 29.39 22.11 0.07

2 14.54 15.13 27.82 21.81 0.07

3 15.08 14.78 29.93 22.64 0.07

4 15.55 15.26 30.37 21.34 0.07

5 15.25 15.42 29.12 22.11 0.07

Average 15.14 15.136 29.326 22.03 0.07

Standard Deviation 0.37 0.24 0.97 2.06 0

Table 3: No Multiplex

In Figures 1 and 2, multiplexed metrics have higher standard deviations thanthe non multiplexed version. The highest standard deviation found was onlyaround 1%. Thus, deeming the difference not significant enough to impact therest of the results.

19

(a) frontend (b) frontend Latency Standard Deviation

Figure 1: frontend Standard Deviation

(a) Bad Speculation (b) Branch Mispredict

Figure 2: Bad Speculation Standard Deviation

4.2 Top-Down and multiplex

Measuring Level 2 statistics for the web serving benchmark, there were somecomplications with the top level results. The Level 1 statistics show that, for thedatabase threads with high CPU usage, they were mostly frontend bound. TheLevel 2 statistics however, show that they were mostly backend bound insteadat the top level. It was known beforehand that Level 2 statistics could be lessaccurate due to multiplexing, but the result was still found surprising and thisissue could be seen in the other 2 benchmarks as well.

Examining Figure 3, the different sections got split up into smaller sections asthe interval was lowered, and could also be seen in the CPU usage in Figure4. The sections could possibly be split up into more sections if the interval sizeis reduced enough, however it is not possible to check at this moment due toToplev only supporting intervals larger or equal than 10ms.

The reason for this behavior is because the threads of the application was activeonly in short periods of time. This behavior could explain the issues in the Level2 statistics, if the time they were active was less than the time it took to cyclethrough the multiplexed events there was a possibility that nothing was counted,therefore being no values to scale.

20

Even though the top level showed faulty values the other metrics could stillbe useful but have to be studied carefully, since the fact that some metricsare dependent on its parent’s metric making some results not as accurate oreven skipped completely. E.g. as in the case where only doing Level 1 shows anapplication to be mostly frontend bound but not in the Level 2 statistics. In thiscase, frontend-latency values can still be reported since it is not dependent onthe frontend latency but the frontend-bandwidth values could still be skewedsince it is dependent on the top level frontend-bound value. By using largerintervals we were able to get more accuracy.

(a) 1000ms intervals

(b) 100ms intervals

(c) 10ms intervals

Figure 3: Top-Down Level 1 Statistics for different interval sizes for web servingapplication

21


(b) 100ms intervals

(c) 10ms intervals

Figure 4: CPU usage for different interval sizes for web serving application

4.3 Benchmarks

The benchmarks were run on a machine with an Intel(R) Core(TM) SkyLakei7-6700K with 4 4GHz cores and 64 gigabytes of ram. The operating systemused was OpenSuse Tumbleweed. For the individual benchmark results, theserver side and client side was pinned to half of the cores each.

4.3.1 Web Serving

Figure 6 and 7 corresponds to threads that run server side scripts handlingrequests from the clients and Figure 8 corresponds to a thread belonging to thethe database. Both the php threads and the database thread are shown to befrontend bound most of the time and Figures 10, 11 and 9 show that the threadswere frontend latency bound, indicating instruction cache misses or many branchresteers. This behavior might occur because of the web serving applicationdoing several, different kind of tasks leading to a large number of instructionsets, therefore leading to inefficiencies with the instruction caches.

22

Figure 5: Average CPU usage for Level 1 web serving statistics

Figure 6: Level 1 Top-Down statistics php5-fpm-20058


23

Figure 8: Level 1 Top-Down statistics mysqld-20103

Over time, the values stayed mostly the same with some exceptions. Of the thethreads with high CPU usage the thread php5-fpm-20095(Figure 7) did not showany significant time-varying behavior and showed mostly high frontend boundvalues but still not as high as the database thread. The threads in Figure 6and 8 showed time-varying behavior at the end where they both showed higherbackend bound values.

Figure 9: Level 2 Top-Down statistics mysqld-5134

24



25

4.3.2 Web Search

Figure 12: Average CPU usage for Level 1 web search statistics

In Figure 12 the thread Docker-6618 showed the second largest average CPUusage. Examined closer in Figure 13 it is shown that it was mostly active in thebeginning when the benchmark was built and initialized. The Docker threadstill showed some spikes during the whole course but at the end there wereseveral spikes.

Figure 13: Level 1 Top-Down statistics docker-6618

26

Figure 14: Level 1 Top-Down statistics java-14199



Figures 14, 15 and 16 show that the threads shared similar behavior. Thethreads all showed high backend bound values in the beginning, due to cachesstill being cold and the first portion of the application being ramp up. Duringthe rest of the run they showed retiring as its highest value but it also showedhigh backend bound values. Over time the threads were retiring the majority

27

of the time, but the threads showed periodical behavior where they had higherbackend bound values. Examining the Level 2 data in Figure 17 shows thatbackend bound was due to the thread being core bound. Reasons for being corebound could be because of the threads doing several operations of the sametype, hence not all execution ports could be utilized.

Unlike the web serving application, web search only does one kind of task, yetshowed more time-varying behavior. This could either be because each requesthaving a different impact on the performance or another possibility would bethat each request have one part where it is more backend bound and then lateron in its execution being mostly retiring. The second case would imply thatduring the execution it would change the type of the operations and thereforeable to utilize the execution ports more.


28

4.3.3 Streaming

Figure 18: Avearge CPU usage for Level 1 streaming

Running the streaming benchmark there were some issues where the runtimeof the runs varied from a couple of minutes to around 40 minutes. The graphsalso changed a bit between runs and interval sizes, some examples are shownin Figure 19. Although, the CPU usage followed a similar pattern betweenruns.

29


(b) 100ms intervals

(c) 10ms intervals

Figure 19: Top-Down Level 1 statistics for different interval sizes for streamingapplication

30

Some of the differences can be attributed to the fact that the application aremulti-threaded and the scheduling of the threads accounts for some of the dif-ferences. The differences can also be due to several videos being streamedthroughout the run and the videos had different lengths and qualities. Usingsmaller intervals it was hard to distinguish the characteristics due to the longruntime making some lines in the graph very thin. For this reason an intervalsize of 1000ms was used for this benchmark. The issues with this benchmarkare suspected to be because of the setup and infrastructure used in this workrather than the benchmark in itself.

During the different runs the docker thread stayed consistent (Figure 20). Itshowed a high CPU usage in the beginning, and then ramped down. Thisrepeating behavior, seen in Figure 20 and 19 occurred since the benchmarkruns several videos with different qualities and lengths. Just as in the otherbenchmarks the Docker was one of the threads with the highest average CPUusage but the difference in this case was that for the streaming applicationthe thread was not only just active in the beginning but also during the wholerun.

Figure 21: Level 2 Top-Down statistics docker-31310

31


(b) 100ms intervals

(c) 10ms intervals

Figure 20: Top-Down Level 1 statistics for different interval sizes for dockerthread in streaming application

32

The Docker thread was mostly backend and frontend bound (but varied withtime) as seen in Figure 20). At the start of each new video the applicationwas mostly backend bound in the beginning but also retiring more instructions,after awhile retiring was smaller and stayed the same. Looking at the Level 2statistics in Figure 21 one could see that biggest reason for the thread beingbackend bound was due to mostly being memory bound. Further explorationwith higher Top-Down levels can be needed to get a better understanding ofthe root cause. The frontend issues stems from frontend latency issues seen inFigure 21, indicating problems in the instruction caches.

Figure 22: Level 1 Top-Down statistics nginx-30722


As for the other threads seen in Figures 24, 25, 22 and 23, they all show sometime significant behavior. Similar to the docker thread the nginx threads showeda similar CPU behavior. As for the Top-Down metrics the threads were backendbound while also having high retiring values. Majority of the time the threadsshowed a lot of retiring spikes reducing the frontend bound values while keepingthe same backend bound values. The Figures also covered a long period of timewhich made the spikes look like they ran shorter periods than they actuallydid.

33



Unfortunately the generated Level 2 values are very hard to distinguish. Allvalues in Figure 26 where the mux value reached 100 should be disregarded.Because the areas where mux was 100 meant that no useful data was measuredand if you scale 0 with anything it will end up being 0. When examining theLevel 2 statistics there seemed to be issues with both being memory bound inthe backend and latency bound in the frontend.

Since the frontend problems was due to being latency bound it suggests inef-ficiencies in the instruction caches and the varying behavior in frontend andretiring suggests that the applications might switch sets of instructions duringthe streams. This behavior could be because the application might start withmissing instructions until it starts hitting in the cache and more instructionsare able retire. Then a new sets of instructions are used and it starts missingagain.

34


5 Related Work

Ferdman et al. introduced Cloudsuite in [9], a benchmark suite for scale outworkloads based on real world datacenter workloads. They explored the mi-croarchitectural implications of their workloads behavior using performancecounters. Similar to [9], Palit et al. [12] implemented representative bench-marks for online applications. The authors then compared their benchmarkswith Ferdman et al. [9], and determined that the resulting benchmarks exhib-ited similar microarchitectural behavior.

Wang et al. [10] presented BigDataBench, a benchmark suite targeting realapplications and diverse data sets. The authors also characterizatied the work-loads in the benchmarks suite finding that big data applications has a very lowoperational intensity (which is the ratio of the work to the memory traffic),compared to traditional benchmarks. The authors also showed that differentdata volumes for the input had an impact on the results.

In [11], Kanev et al., the authors performed a microarchitectural analysis onlive applications on over twenty thousand google machines over a three yearperiod. They also did Top-Down analysis on some application but mostly onthe top level and not over time. One of the options to Linux perf was LiMiT[16] developed by Demme et al. which did not use system calls to access theperformance counters reducing overhead.

35

6 Conclusions

There is a need a to improve the energy efficiency of datacenters and havingprocessors targeted toward datacenter workloads could make this possible. Toaccomplish this, we first need to identify and understand the bottlenecks of mod-ern data center workloads. The Top-Down methodology is one option and wastested and evaluated in this work. The methodology classifies the performanceof the application in an hierarchical way to better understand the bottlenecksof an application. In this work we updated the pmu-tools [4] (performancemonitoring tools) to allow for Top-Down statistics to be captured over timeand per thread. This tool, in turn uses Linux’s perf tool as basis for collectingdata.

The methodology was evaluated to better understand its applicability for data-center workloads. For both the web search and web serving applications, Level1 Top-Down statistics provided the needed insight to identify bottlenecks on abroad level. The Level 2 the statistics were not as accurate due to the use ofmultiplexing performance counters. Even though it was shown that results withmultiplexing had a higher standard deviation than those without multiplexingand the difference not significant enough, other issues with multiplexing occuranyway. Our investigation has shown that the top level values change betweencollecting Level 1 and Level 2 Top-Down statistics. Recalling that perform-ing Level 2 Top-Down analysis time multiplexing of performance counters areneeded, the issue occurs when the a thread is active in small amounts of timeleading to cases where there is not enough time to cycle the events. Thus, nodata is collected and the metrics would be miscalculated. In this work higherinterval sizes were used to acquire better accuracy. Another option would beto rerun the application several times collecting the events separately. We didnot explore this in this work due to complications with aligning the threads andphases that would occur performing several runs, and also the complications ofresource sharing and scheduling between threads.

Analysing Top-Down data over time was done to determine whether or notthere existed any time-varying behavior. Together with over time data, dataper thread was also studied to determine whether there existed any similari-ties between threads. The active threads of the web search application showedperiodic behavior. Where it alternated between backend bound and retiringwhere the backend-bound value varied between around 25%-50% and retiringbetween around 50%-75%. The Level 2 statistics show that backend problemwas because of bad execution port utilization which indicates that the appli-cation sometimes does the same type of operations and adding more executionports of that type could be a way to solve this. Time-varying behavior was alsofound in the streaming benchmark in both the CPU usage and the Top-Downstatistics. For every started video in the benchmark the CPU usage would riseat first and then go down until it reached a steady state until the next video.A limitation of this work was that we did not use the same infrastructure as

36

in [9] and this CPU behavior could possibly be avoided using a similar infras-tructure. The Top-Down statistics however show that it had a similar amountof backend stalls throughout the run but it also showed a periodic behaviorwhere the retiring percentage would get higher and frontend bound get loweredand then have the frontend percentage getting higher and having retiring beinglowered. With the Level 2 results we find that the frontend stalls was becauseof frontend latency suggesting instruction cache inefficiencies. The cause forthe time-varying behavior could be that the thread runs several different setsof instructions leading to instruction cache misses when a new set is started. Incomparison to web Search and streaming the web serving application was morestable in its behavior over time.

By looking at all the applications it is of interest to see if the applicationsshare any similar behavior and characteristics. This is also important to see ifthere are any hardware changes that can be made. The applications all showedhigh frontend values. The frontend stalls were due to the application beingfrontend-latency bound. This suggests that to improve the efficiency for theseworkloads, better instruction caches or branch predictors are needed. All ofthe applications showed some backend stalls but in comparison to the frontendthe backend issues stemmed from different parts of the backend. The backendstalls of web search backend stalls was mostly due to the application beingcore bound while streaming and web serving were due to being memory boundmaking it in general harder to suggest a change except for just improving thewhole backend.

6.1 Future Work

For future work one thing that could be done would be to add options to getthe total Top-Down statistics for the application while at the same time getstatistics over time. The option to get the data per part of the applicationinstead of every single thread which happens when running with per-threadoption.

For solving the multiplex issues one could modify the scripts so that the appli-cation is run several times collecting statistics over time and synchronize thembefore calculating the Top-Down statistics. However doing this, one have toexamine what impact the synchronization would have on the results.

Another option that would have been interesting to explore beside the CPUperformance would be to also profile network and disk usage to find bottleneckin those areas and check if improvements could be made there for improving theenergy usage.

A limitation of this thesis was that only applications runs in Docker were profiledbut profiling applications in virtual machines can also be valuable.

37

References

[1] Yasin, A. “A Top-Down Method for Performance Analysis and Coun-ters Architecture.” In 2014 IEEE International Symposium on Per-formance Analysis of Systems and Software (ISPASS), 35–44, 2014.doi:10.1109/ISPASS.2014.6844459.

[2] “Docker.” Docker. Accessed April 14, 2016. https://www.docker.com/.

[3] ”perf.” perf. Accessed April 22, 2016. https://perf.wiki.kernel.org.

[4] ”pmu-tools.” pmu-tools. Accessed August 4, 2016.https://github.com/andikleen/pmu-tools.

[5] ”cloudsuite.” cloudsuite. Accessed August 5, 2016. http://cloudsuite.ch.

[6] ”elgg.” elgg. Accessed August 9, 2016. https://elgg.org.

[7] ”nginx.” nginx. Accessed August 9, 2016. https://nginx.org/en/.

[8] ”memcached.” memcached. Accessed August 9, 2016.https://memcached.org.

[9] Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos,Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian DanielPopescu, Anastasia Ailamaki, and Babak Falsafi. ”Clearing the Clouds:A Study of Emerging Scale-out Workloads on Modern Hardware” In the17th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), March 2012.

[10] Wang, Lei, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang,Yongqiang He, Wanling Gao, et al. “BigDataBench: A Big Data Bench-mark Suite from Internet Services.” In 2014 IEEE 20th International Sym-posium on High Performance Computer Architecture (HPCA), 488–99,2014. doi:10.1109/HPCA.2014.6835958.

[11] Kanev, S., J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.Y. Wei, and D. Brooks. “Profiling a Warehouse-Scale Computer.” In 2015ACM/IEEE 42nd Annual International Symposium on Computer Archi-tecture (ISCA), 158–69, 2015. doi:10.1145/2749469.2750392.

[12] Palit, T., Yongming Shen, and M. Ferdman. “Demystifying CloudBenchmarking.” In 2016 IEEE International Symposium on Perfor-mance Analysis of Systems and Software (ISPASS), 122–32, 2016.doi:10.1109/ISPASS.2016.7482080.

[13] ”faban” Faban. Accessed September 2, 2016 ”http://faban.org”

[14] ”Apache Solr” Apace Solr. Accessed September 7,2016”http://lucene.apache.org/solr/”

38

[15] ”perf wiki.”perf wiki. Accessed August 17, 2016https://perf.wiki.kernel.org/index.php/Tutorial

[16] Demme, John, and Simha Sethumadhavan. “Rapid Identification of Archi-tectural Bottlenecks via Precise Event Counting,” 353. ACM Press, 2011.doi:10.1145/2000064.2000107.

[17] Dennard, R. H., F. H. Gaensslen, Hwa-Nien Yu, V. L. Rideout, E. Bassous,and A. R. Leblanc. “Design Of Ion-Implanted MOSFET’s with Very SmallPhysical Dimensions.” Proceedings of the IEEE 87, no. 4 (April 1999):668–78. doi:10.1109/JPROC.1999.752522.

[18] Moore, G. E. “Cramming More Components Onto Integrated Cir-cuits.” Proceedings of the IEEE 86, no. 1 (January 1998): 82–85.doi:10.1109/JPROC.1998.658762.

39

understanding emerging workloads for performance and...

Documents