software engineering laboratory, department of computer science, graduate school of information...
TRANSCRIPT
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
A Preliminary Study on Impact of Software Licenses on
Copy-and-Paste Reuse
Yu Kashima† , Yasuhiro Hayase†† ,Norihiro Yoshida††† ,
Yuki Manabe† , Katsuro Inoue†
† : Osaka University †† : Toyo University†††: Nara Institute of Science and Technology
1
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Software Reuse
• Purpose of software reuse– Development of reliable software– Increasing software productivity
• We focus on Copy-and-Paste(CnP)– A basic method of software reuse
2
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Open Source Software and Licenses
• Open Source Software(OSS)– Derivative works from OSS products are allowed
to be distributed– Reusable source code is increasing because of
increasing OSS products• OSS Licenses
– Many kind of licenses are designed for satisfying various developer’s intent
– Each OSS licenses have different conditions– Reuse is also restricted by the licenses
3
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Representative OSS Licenses
• 3-clause BSD License(BSD3)– A derivative work must retain copyright notices, list of
conditions and disclaimer of warranties• Apache License Version 2(Apachev2)
– A derivative work must retain copyrights, patents, trademarks and attribution notices
• GNU General Public License Version 2(GPLv2)– A derivative work must be distributed under GPLv2
• LicenseName Code ≡ source code distributed under LicenseNameEx. BSD3 code ≡ source code distributed under BSD3
4
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
CnP between different license files
• If a developer reuse source code; – Both license of reused code and license of
developing code must be satisfied simultaneously
– Distributions of developing code are prohibited in case
CnP
5
BSD3 GPLv2
CnP
CnP
Apachev2 GPLv2
CnP
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Impact of License on CnP
• Hypothesis– Characteristic of source code reuse depends on
their license• Frequency of CnP• Kind of licenses used by source code developed by CnP
• To our knowledge, there are no quantitative studies on CnP reuse from the aspect of software license
• We investigate actual OSS to confirm this hypothesis
6
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Experiment
• An quantitative experiment was performed on a small set
• Purpose– Confirming our hypothesis– Investigating the scalability of our method
• Overview– Investigation of the number of CnP on each license– Code clone detection is used for CnP detection
• Code clone is a code fragment similar to other• Code clone is typically generated by CnP
7
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Method of Experiment
Step1. License
detection
Source Files
Application X
Application YStep3. Counting Code Clones
Code fragments grouped by their license
8
License #Code Fragments
License A 10
License B 3
… …
Unknown
License A
License B
License A License A
License A License B
Step2. Code Clone
Detection
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Step1. License Detection
• Ninka[1] is used for detecting licenses of source files– Analyzing license description in the source file– Having the high precision of the detected license
• Excluding files Ninka fails to detect their licenses– Files which contain no license description or
unknown license description
[1] D. M. German, Y. Manabe and K. Inoue: “A sentence-matching method for automatic license identification of source code files”, ASE 2010, pp. 437–446 (2010)
9
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Step2. Code Clone Detection
• CCFinder[2] is used for extracting code clone across different application– We assume that CnP within application will not cause license problems
• Filtering– Excluding code clones generated by other than CnP
Ex. getter/setter, variable declarations
• Directions of CnP are undecided
10
License A License B License C
Application X Application Y Application Z
CnP CnP
Getter/Setter[2] T. Kamiya, S. Kusumoto and K. Inoue: “CCFinder: A multilinguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28, pp. 654–670 (2002)
Variable Declarations
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Step3. Counting Code Clones(1/2)
• Repeating the following steps to target licenses
1. Select a license as an analysis target
2. Extract clone sets including the license code• Clone set is a set of code clones similar to each
other
3. Count code fragments in extracted clone sets grouped by their license
11
License A License B License C License #Code Fragments
License A 2
License B 1
License C 2
Application X Application Y Application Z
Fragments having CnP relations to License A code
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Step3. Counting Code Clones(2/2)
• A clone set including both original code fragments and code fragments generated by CnP
→ Counting code fragments in clone sets approximates counting the number of CnP
• Counting the number of CnP to/from target license code fragments
• Although this table includes the CnP of opposite direction, it is enough to understand the brief of summary
12
License A License B License C License #Code Fragments
License A 2
License B 1
License C 2
Application X Application Y Application Z
Fragments having CnP relations to License A code
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Analyzed Code
• Java files(.java) in Debian GNU/Linux 5.0.2 main section
• Reasons for selecting this target– consisted of various licenses– enable to be analyzed by both Ninka and
CCFinder– an feasible scale for this experiment
13
#Packages 452
#Files 77,452
LOC 8,530,896
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
License Distribution in Analyzed Code
14
Apach
ev2
GPLv2+
Less
erGPLv
2.1+
GPLnoV
ersio
n,GPLv
2+,L
inkExc
eptio
n
GPLv2
BSD3
GPLv2,
ClassP
athE
xcep
tion
othe
r
No Not
ificat
ion
Unkno
wn lic
ense
02000400060008000
100001200014000160001800020000
#Files
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Result ( BSD3 )
15
License #Fragments Percentage
BSD3 613 92%
GPLv2+ 20 3.0%
Apachev2 16 2.4%
LesserGPL2+ 14 2.1%
GPLv2,ClassPathException 1 0.15%
LesserGPL2.1+ 1 0.15%
• Result of counting code fragments in clone sets including BSD3 fragments grouped by their license• The frequency of license used by code fragments having CnP relationship to BSD3 fragments
• BSD3 code is mostly reused by BSD3 code
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Result ( Apachev2 )License #Fragments Percentage
Apachev2 1533 77%
Apachev1.1 316 16%
LesserGPL2.1+ 42 2.1%
MPLv1.1 33 1.6%
BSD3 29 1.5%
MX4JLicensev1 16 0.80%
GPLv2+ 4 0.20%
LibraryGPL2+ 3 0.15%
MPLv1.0 2 0.10%
MITX11noNotice 2 0.10%
Public Domain 1 0.050%
Subversion+ 1 0.050%
EPLv1 1 0.050%
16
• Large percentage of CnP between Apachev2 code fragments
• Apachev1.1 code has been changed their license to Apachev2
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Result ( GPLv2+ )
17
License #Fragments Percentage
GPLv2+ 268 44%
GPLnoVersion,GPLv2+,LinkException 225 41%
BSD3 28 5.1%
LibraryGPLv2+ 20 3.6%
Apachev2 4 0.73%
LesserGPLv2.1+ 4 0.73%
• CnP within GPLv2+ code occupy the highest percentage • “GPLnoVersion, GPLv2+, LinkException” has high percentage
• “GPLnoVersion, GPLv2+, LinkException” code is reused by GPLv2+ code.
CnP
GPLnoVersion, GPLv2+, LinkException GPLv2+
CnP
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
#Files and #Fragments under Each License
18
#Fragments #Files #Fragments / #Files
BSD3 665 2181 0.305
Apachev2 1983 16350 0.121
GPLv2+ 549 8160 0.0673
• The frequency of CnP per file BSD3 > Apachev2 > GPLv2+
• Code under a license is copy-and-pasted frequently, if “#Fragments / #Files” of the license is large
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Summary of the Results
• Common characteristic of all licenses– CnP within code distributed under same license or
licenses designed by the same organization have a majority• CnP might happen mostly in an organization
• Apachev2 has CnP relations to various licenses– Files under Apachev2 have the largest number– The condition of Apachev2 is more relaxed than
that of GPLv2+• The frequency of CnP per file
BSD3 > Apachev2 > GPLv2+
19
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Threat to Validity
• Insufficient to apply this result to general OSS– This analysis target is small
→ We plan large scale analysis– Only Java files were analyzed
• History of Java files is short, hence Java files are less copy-and-pasted than others
→ We plan analysis of C/C++ files• Overlap code fragments may be counted separately
– Number of overlap code fragments might be small
20
Fragment A
Fragment B
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Scalability of Investigating Method
• This method can apply to large target, because each step can– License detection
• Ninka can analyze files in linear order
– Code clone detection• There are more scalable tools than CCFinder such
as CCFinderX and D-CCFinder.
– Counting code clone• This process did not take a long time
21
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Conclusion
• A preliminary study of impact of licenses on CnP was performed– Java files in Debian/GNU Linux 5.0.2 main section
were analyzed• CnP are happened mostly within code
distributed under the same license or licenses designed by the same organization
• The frequency of CnP per file– BSD3 > Apachev2 > GPLv2+
• Our method can be applied to a large target
22
Department of Computer Science, Graduate School of Information Science and Technology, Osaka University
Future Work
• Large Scale Experiment• Investigating that code fragments are
copy-and-pasted mostly in an organization• Detecting direction of CnP
23