ase2010 shang

An Experience Report on Scaling Tools for MSR Studies Using MapReduce

Weiyi Shang, Bram Adams, Ahmed E. HassanSoftware Analysis and Intelligence Lab (SAIL)

School of Computing, Queen’s University

2

Mining Software Repositories: Propagating code changes

MethodA is

changed

MethodA calls

Method B

MethodC calls

Method A

Change methods B and C

MethodA is

changed

When method A is changed, 90% of the

time method D is changed.

Change method

D

Not Enough

History helps!

3

Traditional pipeline for MSR studies

Software repositories Data preparation (ETL)

Extraction

Transformation

Loading

DataWarehouse

Data Analysis

Source code history

Bug database

Mailing list

System log

Continues to grow

More complex algorithms

MSR studies must scale

4

Existing solutions to scale

powerful machines

ad hoc distributed computing

multi-threaded and multi-core

EXPENSIVELARGE PROGRAMMING EFFORT

NOT RE-USABLE

5

Example: D-CCFinder Clone Detector

40 days on 1 pc machine 52 hours on 80-machines cluster

6

Web Analysis is similar to MSR studies

Large-scale data Scan-centric Rapidly evolving

7

Web-scale platforms

We believe that the MSR field can benefit from web-scale platforms to overcome the limitations of current approaches.

8

In our previous research

Hadoop is up to 3 times faster on a 4-machine cluster

Feasibility study using Hadoop to scale a software evolution study on Eclipse.

9

In this paper

1. Does MapReduce scale to other MSR studies and larger clusters?

2. What are the challenges and experiences of scaling MSR studies?

10

ReduceMap

An example of MapReduce

Datagoodhellofishcatschoolnighthappydog

ValueKey dog3cat3

fish4good4

hello5night5happy5

school6

ValueKey

23243516

Counting the frequency of word lengths

Key 45436553

11

Three large-scale MSR studies

• Software evolution study– J-REX: code-change information abstractor for

Java from line level to program entity level• Code clone detection– CC-Finder: code clone detection tool

• Log analysis– JACK: log analysis tool for detecting system

anomalies during load testing

12

Experimental environment

CPU type #machines Memory size

Operating system

Intel Quad Core Q6600 (2.40 GHz)

18 3GB Ubuntu 8.04

8 Xeon (3.0 GHZ)

10 8GB CentOS 5.2

13

Input data

Data Size Data type #Files

EclipseDatatools

10.4 GB227 MB

CVS repositoryCVS repository

189,15610,629

FreeBSD 5.1 GB source code 317,740Log files No.1Log files No.2

9.9 GB2.1 GB

execution logexecution log

5454

14

1. Does MapReduce scale to other MSR studies and larger clusters?

15

SHARCNET(×10)

1 machine

0 100 200 300 400 500 600 700

98

580

min

SHARCNET(×10)

1 machine

0 100 200 300 400 500 600 700 800

80

755

Software Evolution & Log analysis J-REX

JACK

×9

×6

min

16

Code clone detection

Can MapReduce scale up CCFinder ?

Yes!58 hours on an 18-machine cluster.

17

2. What are the challenges and experiences of scaling MSR studies?

Challenge 1: Locality of MSR analysis

18

Local analysis

Semi-local analysis

Global analysis

WebMSR MSR MSR

19

Challenge 2: Granularity of MSR analysis

Fine-grained analysis

Coarse-grained analysis

• Web community experience:– #Map: 10 ~ 100 × # machines– #Reduce: 0.95 or 1.75 × #CPU

cores • MSR experience:– #Reduce tasks= #CPU cores

(fine-grained analysis)– #Reduce task= #input records

(coarse-grained analysis)WebMSR MSR

20

Challenges of migrating MSR studies to MapReduce

1. Locality of MSR analysis2. Granularity of MSR analysis3. Locating a suitable cluster4. Managing data during analysis5. Recovering from errors

21

Questions?

ase2010 shang

Documents