software engineering laboratory, department of computer science, graduate school of information...

23
e Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka Un A Preliminary Study on Impact of Software Licenses on Copy-and-Paste Reuse Yu Kashima Yasuhiro Hayase †† Norihiro Yoshida ††† Yuki Manabe Katsuro Inoue : Osaka University †† Toyo University †††: Nara Institute of Science and Technology 1

Upload: heath-held

Post on 01-Apr-2015

217 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

A Preliminary Study on Impact of Software Licenses on

Copy-and-Paste Reuse

Yu Kashima† , Yasuhiro Hayase†† ,Norihiro Yoshida††† ,

Yuki Manabe† , Katsuro Inoue†

† : Osaka University †† : Toyo University†††: Nara Institute of Science and Technology

1

Page 2: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Software Reuse

• Purpose of software reuse– Development of reliable software– Increasing software productivity

• We focus on Copy-and-Paste(CnP)– A basic method of software reuse

2

Page 3: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Open Source Software and Licenses

• Open Source Software(OSS)– Derivative works from OSS products are allowed

to be distributed– Reusable source code is increasing because of

increasing OSS products• OSS Licenses

– Many kind of licenses are designed for satisfying various developer’s intent

– Each OSS licenses have different conditions– Reuse is also restricted by the licenses

3

Page 4: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Representative OSS Licenses

• 3-clause BSD License(BSD3)– A derivative work must retain copyright notices, list of

conditions and disclaimer of warranties• Apache License Version 2(Apachev2)

– A derivative work must retain copyrights, patents, trademarks and attribution notices

• GNU General Public License Version 2(GPLv2)– A derivative work must be distributed under GPLv2

• LicenseName Code ≡ source code distributed under LicenseNameEx. BSD3 code ≡ source code distributed under BSD3

4

Page 5: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

CnP between different license files

• If a developer reuse source code; – Both license of reused code and license of

developing code must be satisfied simultaneously

– Distributions of developing code are prohibited in case

CnP

5

BSD3 GPLv2

CnP

CnP

Apachev2 GPLv2

CnP

Page 6: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Impact of License on CnP

• Hypothesis– Characteristic of source code reuse depends on

their license• Frequency of CnP• Kind of licenses used by source code developed by CnP

• To our knowledge, there are no quantitative studies on CnP reuse from the aspect of software license

• We investigate actual OSS to confirm this hypothesis

6

Page 7: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Experiment

• An quantitative experiment was performed on a small set

• Purpose– Confirming our hypothesis– Investigating the scalability of our method

• Overview– Investigation of the number of CnP on each license– Code clone detection is used for CnP detection

• Code clone is a code fragment similar to other• Code clone is typically generated by CnP

7

Page 8: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Method of Experiment

Step1. License

detection

Source Files

Application X

Application YStep3. Counting Code Clones

Code fragments grouped by their license

8

License #Code Fragments

License A 10

License B 3

… …

Unknown

License A

License B

License A License A

License A License B

Step2. Code Clone

Detection

Page 9: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Step1. License Detection

• Ninka[1] is used for detecting licenses of source files– Analyzing license description in the source file– Having the high precision of the detected license

• Excluding files Ninka fails to detect their licenses– Files which contain no license description or

unknown license description

[1] D. M. German, Y. Manabe and K. Inoue: “A sentence-matching method for automatic license identification of source code files”, ASE 2010, pp. 437–446 (2010)

9

Page 10: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Step2. Code Clone Detection

• CCFinder[2] is used for extracting code clone across different application– We assume that CnP within application will not cause license problems

• Filtering– Excluding code clones generated by other than CnP

Ex. getter/setter, variable declarations

• Directions of CnP are undecided

10

License A License B License C

Application X Application Y Application Z

CnP CnP

Getter/Setter[2] T. Kamiya, S. Kusumoto and K. Inoue: “CCFinder: A multilinguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28, pp. 654–670 (2002)

Variable Declarations

Page 11: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Step3. Counting Code Clones(1/2)

• Repeating the following steps to target licenses

1. Select a license as an analysis target

2. Extract clone sets including the license code• Clone set is a set of code clones similar to each

other

3. Count code fragments in extracted clone sets grouped by their license

11

License A License B License C License #Code Fragments

License A 2

License B 1

License C 2

Application X Application Y Application Z

Fragments having CnP relations to License A code

Page 12: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Step3. Counting Code Clones(2/2)

• A clone set including both original code fragments and code fragments generated by CnP

→ Counting code fragments in clone sets approximates counting the number of CnP

• Counting the number of CnP to/from target license code fragments

• Although this table includes the CnP of opposite direction, it is enough to understand the brief of summary

12

License A License B License C License #Code Fragments

License A 2

License B 1

License C 2

Application X Application Y Application Z

Fragments having CnP relations to License A code

Page 13: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Analyzed Code

• Java files(.java) in Debian GNU/Linux 5.0.2 main section

• Reasons for selecting this target– consisted of various licenses– enable to be analyzed by both Ninka and

CCFinder– an feasible scale for this experiment

13

#Packages 452

#Files 77,452

LOC 8,530,896

Page 14: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

License Distribution in Analyzed Code

14

Apach

ev2

GPLv2+

Less

erGPLv

2.1+

GPLnoV

ersio

n,GPLv

2+,L

inkExc

eptio

n

GPLv2

BSD3

GPLv2,

ClassP

athE

xcep

tion

othe

r

No Not

ificat

ion

Unkno

wn lic

ense

02000400060008000

100001200014000160001800020000

#Files

Page 15: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Result ( BSD3 )

15

License #Fragments Percentage

BSD3 613 92%

GPLv2+ 20 3.0%

Apachev2 16 2.4%

LesserGPL2+ 14 2.1%

GPLv2,ClassPathException 1 0.15%

LesserGPL2.1+ 1 0.15%

• Result of counting code fragments in clone sets including BSD3 fragments grouped by their license• The frequency of license used by code fragments having CnP relationship to BSD3 fragments

• BSD3 code is mostly reused by BSD3 code

Page 16: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Result ( Apachev2 )License #Fragments Percentage

Apachev2 1533 77%

Apachev1.1 316 16%

LesserGPL2.1+ 42 2.1%

MPLv1.1 33 1.6%

BSD3 29 1.5%

MX4JLicensev1 16 0.80%

GPLv2+ 4 0.20%

LibraryGPL2+ 3 0.15%

MPLv1.0 2 0.10%

MITX11noNotice 2 0.10%

Public Domain 1 0.050%

Subversion+ 1 0.050%

EPLv1 1 0.050%

16

• Large percentage of CnP between Apachev2 code fragments

• Apachev1.1 code has been changed their license to Apachev2

Page 17: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Result ( GPLv2+ )

17

License #Fragments Percentage

GPLv2+ 268 44%

GPLnoVersion,GPLv2+,LinkException 225 41%

BSD3 28 5.1%

LibraryGPLv2+ 20 3.6%

Apachev2 4 0.73%

LesserGPLv2.1+ 4 0.73%

• CnP within GPLv2+ code occupy the highest percentage • “GPLnoVersion, GPLv2+, LinkException” has high percentage

• “GPLnoVersion, GPLv2+, LinkException” code is reused by GPLv2+ code.

CnP

GPLnoVersion, GPLv2+, LinkException GPLv2+

CnP

Page 18: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

#Files and #Fragments under Each License

18

#Fragments #Files #Fragments / #Files

BSD3 665 2181 0.305

Apachev2 1983 16350 0.121

GPLv2+ 549 8160 0.0673

• The frequency of CnP per file BSD3 > Apachev2 > GPLv2+

• Code under a license is copy-and-pasted frequently, if “#Fragments / #Files” of the license is large

Page 19: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Summary of the Results

• Common characteristic of all licenses– CnP within code distributed under same license or

licenses designed by the same organization have a majority• CnP might happen mostly in an organization

• Apachev2 has CnP relations to various licenses– Files under Apachev2 have the largest number– The condition of Apachev2 is more relaxed than

that of GPLv2+• The frequency of CnP per file

BSD3 > Apachev2 > GPLv2+

19

Page 20: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Threat to Validity

• Insufficient to apply this result to general OSS– This analysis target is small

→ We plan large scale analysis– Only Java files were analyzed

• History of Java files is short, hence Java files are less copy-and-pasted than others

→ We plan analysis of C/C++ files• Overlap code fragments may be counted separately

– Number of overlap code fragments might be small

20

Fragment A

Fragment B

Page 21: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Scalability of Investigating Method

• This method can apply to large target, because each step can– License detection

• Ninka can analyze files in linear order

– Code clone detection• There are more scalable tools than CCFinder such

as CCFinderX and D-CCFinder.

– Counting code clone• This process did not take a long time

21

Page 22: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Conclusion

• A preliminary study of impact of licenses on CnP was performed– Java files in Debian/GNU Linux 5.0.2 main section

were analyzed• CnP are happened mostly within code

distributed under the same license or licenses designed by the same organization

• The frequency of CnP per file– BSD3 > Apachev2 > GPLv2+

• Our method can be applied to a large target

22

Page 23: Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University A Preliminary

Department of Computer Science, Graduate School of Information Science and Technology, Osaka University

Future Work

• Large Scale Experiment• Investigating that code fragments are

copy-and-pasted mostly in an organization• Detecting direction of CnP

23