ase2010 shang
TRANSCRIPT
An Experience Report on Scaling Tools for MSR Studies Using MapReduce
Weiyi Shang, Bram Adams, Ahmed E. HassanSoftware Analysis and Intelligence Lab (SAIL)
School of Computing, Queen’s University
2
Mining Software Repositories: Propagating code changes
MethodA is
changed
MethodA calls
Method B
MethodC calls
Method A
Change methods B and C
MethodA is
changed
When method A is changed, 90% of the
time method D is changed.
Change method
D
Not Enough
History helps!
3
Traditional pipeline for MSR studies
Software repositories Data preparation (ETL)
Extraction
Transformation
Loading
DataWarehouse
Data Analysis
Source code history
Bug database
Mailing list
System log
Continues to grow
More complex algorithms
MSR studies must scale
4
Existing solutions to scale
powerful machines
ad hoc distributed computing
multi-threaded and multi-core
EXPENSIVELARGE PROGRAMMING EFFORT
NOT RE-USABLE
5
Example: D-CCFinder Clone Detector
40 days on 1 pc machine 52 hours on 80-machines cluster
6
Web Analysis is similar to MSR studies
Large-scale data Scan-centric Rapidly evolving
7
Web-scale platforms
We believe that the MSR field can benefit from web-scale platforms to overcome the limitations of current approaches.
8
In our previous research
Hadoop is up to 3 times faster on a 4-machine cluster
Feasibility study using Hadoop to scale a software evolution study on Eclipse.
9
In this paper
1. Does MapReduce scale to other MSR studies and larger clusters?
2. What are the challenges and experiences of scaling MSR studies?
10
ReduceMap
An example of MapReduce
Datagoodhellofishcatschoolnighthappydog
ValueKey dog3cat3
fish4good4
hello5night5happy5
school6
ValueKey
23243516
Counting the frequency of word lengths
Key 45436553
11
Three large-scale MSR studies
• Software evolution study– J-REX: code-change information abstractor for
Java from line level to program entity level• Code clone detection– CC-Finder: code clone detection tool
• Log analysis– JACK: log analysis tool for detecting system
anomalies during load testing
12
Experimental environment
CPU type #machines Memory size
Operating system
Intel Quad Core Q6600 (2.40 GHz)
18 3GB Ubuntu 8.04
8 Xeon (3.0 GHZ)
10 8GB CentOS 5.2
13
Input data
Data Size Data type #Files
EclipseDatatools
10.4 GB227 MB
CVS repositoryCVS repository
189,15610,629
FreeBSD 5.1 GB source code 317,740Log files No.1Log files No.2
9.9 GB2.1 GB
execution logexecution log
5454
14
1. Does MapReduce scale to other MSR studies and larger clusters?
15
SHARCNET(×10)
1 machine
0 100 200 300 400 500 600 700
98
580
min
SHARCNET(×10)
1 machine
0 100 200 300 400 500 600 700 800
80
755
Software Evolution & Log analysis J-REX
JACK
×9
×6
min
16
Code clone detection
Can MapReduce scale up CCFinder ?
Yes!58 hours on an 18-machine cluster.
17
2. What are the challenges and experiences of scaling MSR studies?
Challenge 1: Locality of MSR analysis
18
Local analysis
Semi-local analysis
Global analysis
WebMSR MSR MSR
19
Challenge 2: Granularity of MSR analysis
Fine-grained analysis
Coarse-grained analysis
• Web community experience:– #Map: 10 ~ 100 × # machines– #Reduce: 0.95 or 1.75 × #CPU
cores • MSR experience:– #Reduce tasks= #CPU cores
(fine-grained analysis)– #Reduce task= #input records
(coarse-grained analysis)WebMSR MSR
20
Challenges of migrating MSR studies to MapReduce
1. Locality of MSR analysis2. Granularity of MSR analysis3. Locating a suitable cluster4. Managing data during analysis5. Recovering from errors
21
Questions?