on the extent and nature of software reuse in open source java
TRANSCRIPT
On the Extent and Nature of Software Reuse in Open
Source Java ProjectsLars Heinemann, Florian Deissenboeck, Mario Gleirscher, Benjamin Hummel,
Maximilian IrlbeckTechnische Universität München
ICSR 2011, Pohang, Korea1
Software Reuse• Reuse of existing artifacts for constructing
new software
• Proven benefits
• Increased productivity
• Reduced time to market
• Improved quality
2
• Tremendous reuse opportunities
• Class Libraries (e.g. Apache Commons)
• Frameworks (e.g. Eclipse: 40 MLOC)
• Open source code (Google Code Search: several GLOC)
• Internet serves as reuse repository
3
Software Reuse
Research Problem• Unclear how software projects make use of
available reuse opportunities
• Lack of data on amount of reuse in software projects
• Assessing success of software reuse difficult
4
Contribution• Empirical knowledge about extent and
nature of software reuse in OSS
• Quantitative data on software reuse in 20 open source projects
• Substantiates discussion of success/failure of software reuse
• Provides practioners with benchmark
5
Terms• Software reuse: Using code developed by
third parties (excluding OS/platform)
• White-box reuse: Code incorporated in source form (internals exposed, potentially modified)
• Black-box reuse: Code incorporated in binary form (internals hidden, no modifications)
6
Study Design (GQM)
7
We analyze open source projects
for the purpose of understanding the state of the practice in
software reuse with respect to its
extent and nature from the viewpoint of the
developers and maintainers in the context of
Java open source software.
8
Question Metric
RQ1: Do open source projects reuse software?
existence of software reuse
RQ 2: How much white-box reuse occurs?white-box reuse
rate
RQ 3: How much black-box reuse occurs?black-box reuse
rate
Study Design (GQM)
Reuse Rate
Reused source code [LOC]Overall source code [LOC]
White-box
Reused binary code [bytes]
Overall binary code [bytes]Black-box
Reused code
Project‘s own codeOverall code ofsoftware system
Study Objects• 20 Java projects from
• Criteria: Production/Stable, Standalone app, pure Java, Java SE platform, source download available
• All among 50 most downloaded
• sourcecode size: 0.4 to 790 kLOC, bytecode size: 17 to 22,761 KB
• Test code excluded with heuristics (e.g. folders named test/tests)
10
Study Implementation
• White-box reuse = copied code
• Can be detected automatically by clone detectors
• Clone detection against 22 commonly used Java libraries (~ 6MLOC)
• Detection of reuse of statement sequences with > 15 statements
11
a) Detecting white-box reuse
Study Implementation
• In addition: manual inspection of source directory tree
• Clues: file/package names
• Source of files identified via header comments/web search
• Detection of reuse of whole files/directories, not limited to fixed set of libraries
12
a) Detecting white-box reuse
Study Implementation
• Byte-code based static analysis
• Aggregates byte code size of all library types referenced by project‘s source code
• Traverses type dependency graph using Java Constant Pool (type usages and method calls)
• Includes transitive dependencies
13
b) Detecting black-box reuse
Study Implementation
• Although not covered by reuse definition, potential variations in use of Java API interesting
• Black-box reuse baseline of empty Java program: 5 MB (2,082 types)
• Object → Class → ClassLoader ... (Reflection API / Collections API)
14
b) Detecting black-box reuse
Results RQ 1
• 18 of the 20 projects (90%) reuse software from third parties
• Exceptions: HSQLDB (relational database engine), Youtube Downloader (video download utility)
15
Do open source projects reuse software?
Results RQ 2
16
How much white-box reuse occurs?
• Clone detection found 791 clones, 11,701 copied LOC in 7 study objects
• Clones found: complete files with minor modifications (e.g. different version)
• Manual inspection found additionally whole copied libraries in 4 study objects
• Overall: white-box reuse found for 9 of 20 projects
• Reuse rates: 0% - 10%
0
10
20
30
40
50
60
70
iRep
ort-D
esig
ner
soap
UI
RO
DIN
SQ
uirr
eL S
QL
Clie
nt
Azu
reus
/Vuz
e
Ope
nPro
j
TV-B
row
ser
DrJ
ava
Sw
eet H
ome
3D
JabR
ef
Mob
ile A
tlas
Cre
ator
Jedi
t
Bud
di
Dav
Mai
l
Free
Min
d
HS
QLD
B
PD
F S
plit
and
Mer
ge
Med
iath
ek V
iew
subs
onic
You
Tube
Dow
nloa
der
Java APIJava API Baseline
3rd partyown
Results RQ 3
Absolute bytecode size distribution (MB)
17
How much black-box reuse occurs?
3rd party: 0 - 42 MBJava API: 13 - 17 MB
Relative bytecode size distribution (%)
0
20
40
60
80
100PD
F Sp
lit an
d M
erge
YouT
ube
Down
load
er
DavM
ail
Med
iath
ek V
iew
Budd
i
Mob
ile A
tlas
Crea
tor
subs
onic
HSQ
LDB
Free
Min
d
Ope
nPro
j
Swee
t Hom
e 3D
iRep
ort-D
esig
ner
JabR
ef
soap
UI
RODI
N
Jedi
t
TV-B
rows
er
DrJa
va
SQui
rreL
SQL
Clie
nt
Azur
eus/
Vuze
Java API 3rd Party own18
Results RQ 3How much black-box reuse occurs?
3rd party: 0 - 62%Java API: 23 - 99%Combined: 41 - 99%
Relative bytecode size distribution (%) without Java API
0
20
40
60
80
100PD
F Sp
lit an
d M
erge
iRep
ort-D
esig
ner
DavM
ail
Budd
i
soap
UI
Ope
nPro
j
RODI
N
Mob
ile A
tlas
Crea
tor
SQui
rreL
SQL
Clie
nt
DrJa
va
Swee
t Hom
e 3D
TV-B
rows
er
JabR
ef
Free
Min
d
Med
iath
ek V
iew
JEdi
t
subs
onic
Azur
eus/
Vuze
HSQ
LDB
YouT
ube
Down
load
er
3rd Party own19
Results RQ 3How much black-box reuse occurs?
Discussion
• Software reuse common among Java OSS
• On average: high black-box reuse rates
• Expected to have significant impact on development effort
• Black-box reuse rates considerably varying
20
a) Extent of reuse
Discussion
• Lee&Litecky found a negative influence of project size on reuse rate (survey of 500 Ada professionals)
• Without Java API: Spearman correlation of 0.05 (two tailed p-value 0.83)
• With Java API: Spearman -0.93 (p-value < 0.0001) → significant and strong negative correlation
21
b) Influence of project size on reuse rate
Discussion
• Categorization of reused libraries (e.g. networking, text/xml, rich client platforms)
• No predominant category found
• Nearly all projects reuse software from more than one category
• No significant insights, except reuse diverse w.r.t. types of functionality
22
c) Types of reused functionality
Threats to internal validity
• False-positives from clone detection
• mitigated by manual inspection of results
• Unclear if code was copied into study objects or from them
• mitigated by manual inspection
• Black-box analysis considers a whole class as the element of reuse
23
a) overestimation of reuse
Threats to internal validity
• Fixed set of libraries in clone detection
• False-negatives in clone detection
• Manual inspection for copied code inherently incomplete
• Black-box analyses misses calls via reflection, boundaries by Java interfaces
• Other forms of component interaction
24
a) underestimation of reuse
Threats to external validity• Unclear how representative study objects
are for all Java OSS
• Transferability to other PL or commercial development unclear
• Impact of PL is expected to be high
• Availability of reusable code depends on PL (e.g. Java vs. COBOL)
25
Conclusions• Early visions of development by plugging
reusable components not realistic
• But: Reuse in form of libraries common in Java OSS
• High black-box reuse rates (9 of 20 projects > 50%)
• Availability of reusable functionality well-established for Java platform
26
Future Work• Other programming ecosystems
• Legacy programming languages, e.g. COBOL
• Scripting languages, e.g. Python
• Commercial software development environments
27
Thank you.Questions?
28