era - measuring maintainability of spreadsheets in the wild

20
T +31 20 314 0950 [email protected] www.sig.eu Measuring Maintainability of Spreadsheets in the Wild José Pedro Correia & Miguel Alexandre Ferreira September 2011

Upload: icsm-2011

Post on 13-Jan-2015

238 views

Category:

Technology


1 download

DESCRIPTION

Paper: Measuring Maintainability of Spreadsheets in the WildAuthors: José Pedro Correia and Miguel Alexandre FerreiraSession: Early Research Achievements Track Session 2: Software Changes and Maintainability

TRANSCRIPT

Page 1: ERA - Measuring Maintainability of Spreadsheets in the Wild

T +31 20 314 0950

[email protected]

www.sig.eu

Measuring Maintainability of Spreadsheets in the Wild

José Pedro Correia & Miguel Alexandre Ferreira

September 2011

Page 2: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Introduction

Spreadsheets •  are widely used in all kinds of organizations

•  contain important business logic

•  are maintained by different people

Do all spreadsheets matter? •  throwaway calculations don’t

•  some data intensive spreadsheets might

•  “spreadsheet programs” matter the most

2

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

Page 3: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Pragmatic criteria for “spreadsheet programs”

have formulas the formulas have references

3

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

Page 4: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Definitions

•  spreadsheet = { sheet1, sheet

2, …, sheet

i }

•  sheet =

4

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

cell11

cell12

… cell1j

cell21

cell22

… cell2j

celli1

celli2

… cellij

Page 5: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Definitions

•  cell types

•  blank: no content

•  data: non-blank and does not contain a formula

•  proxy: contains a formula that is a direct, single reference (e.g. =A1)

•  calculation: contains a formula and is not a proxy

•  cell roles

5

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

blank data proxy calculation

not referenced no role label1 data sink1 calc. sink

referenced open input data source1 data move calc. step

1[Hodnigg & Mittermeir – Metric-based spreadsheet visualization: Support for focused maintenance – EuSpRIG’08]

Page 6: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Definitions

•  formula copy equivalents1

•  unique formula

6

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

=R1 + 2 =R2 + 2 =R3 + 2 =R4 + 2

=X + 2

1[Mittermeir & Clermont – Finding high-level structures in spreadsheet programs – WCRE’02]

Page 7: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Research questions

1.  Which metrics can we use to assess spreadsheet maintainability?

2.  What are the typical values for the selected metrics?

7

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

Page 8: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Study outline

Metric selection Measuring the EUSES Spreadsheet Corpus Analysis of results

8

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

Page 9: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Goal Question Metric (mockup)

9

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

Metrics

Sub questions

Questions

Sub goals

Main goal Maintainability

Analyzability

How large is the spreadsheet?

Changeability

Are there inconsistencies?

… …

Stability

How much coupling is there?

Testability

How complex is the spreadsheet?

...

Page 10: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Example metrics

Spreadsheet level •  # used rows / columns

•  # formulas / unique formulas

Sheet level •  # data fan-in / fan-out

•  # data move / sink cells

Formula level • McCabe complexity

•  # operators

10

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

Page 11: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

The EUSES Spreadsheet Corpus1

0

1000

2000

3000

4000

5000

Total Internet search

Contain formulas

Contain formulas with

references

> 25 unique formulas

Spreadsheets

11

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

1[Fisher II & Rothermel – The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet

dependability mechanisms – WEUSE’05]

1609

Page 12: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Power-law like distributions

most metrics follow a power-law like distribution

12

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

NON_BLANK_CELLS (up to 99 quantile)

Freq

uenc

y

0 5000 10000 15000

020

040

060

080

010

00

Page 13: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Extremely skewed distributions

13

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

at least 75% of sheets have no proxy cells

at least 75% of sheets are not referenced / have no references

Attribute Object Min Q1 Med. Q3 95% 99% Max

# data move cells Sheet 0 0 0 0 12 80 672

# data sink cells Sheet 0 0 0 0 8 80 1188

# data fan-out Sheet 0 0 0 0 52 1256 964366

# data fan-in Sheet 0 0 0 0 76 1814 964366

Page 14: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Sparse distributions

14

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

conditionals and magic values are uncommon

Attribute Object Min Q1 Med. Q3 95% 99% Max

McCabe complexity Formula 1 1 1 1 1 5 34

# unidentified values Formula 0 0 0 1 3 9 51

Page 15: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Documentation within the spreadsheet

15

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

at least 25% of sheets are purely for documentation purposes

Attribute Object Min Q1 Med. Q3 95% 99% Max

% label cells Sheet 0 35 64 100 100 100 100

Page 16: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Spreadsheet layout

16

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

most common layout seems to be vertical

Attribute Object Min Q1 Med. Q3 95% 99% Max

# used columns Spreadsheet 2 10 19 41 122 228 738

# used rows Spreadsheet 2 47 99 210 686 1629 40518

Page 17: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

“Dragging” formulas

17

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

seems to be a common practice

Attribute Object Min Q1 Med. Q3 95% 99% Max

# formula cells Spreadsheet 1 20 75 239 1153 4052 24523

# unique formulas Spreadsheet 1 3 8 26 103 252 961

Page 18: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Answering the research questions

1.  Which metrics can we use to assess spreadsheet maintainability? •  levels: sheets, rows/columns, cells, formulas

2.  What are the typical values for the selected metrics? •  most distributions resemble software metrics distributions

•  some distributions are extremely skewed/sparse

•  expect label only sheets/more rows than columns/copy equivalent formulas

18

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

Page 19: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

Roadmap

1.  Study how and what to measure in spreadsheets

2.  Select a minimal set of metrics and build a quality model

3.  Gather a representative set of measurements for calibration

4.  Calibrate the thresholds in the model

5.  Validate the model

19

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011

Page 20: ERA - Measuring Maintainability of Spreadsheets in the Wild

I 20

The end…

Thank you for your attention!

Q&A

Complete data set and technical report at: http://www.sig.eu/en/spreadsheet-quality

Miguel Ferreira

Software Improvement Group [email protected]

20

Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011