era - measuring maintainability of spreadsheets in the wild
DESCRIPTION
Paper: Measuring Maintainability of Spreadsheets in the WildAuthors: José Pedro Correia and Miguel Alexandre FerreiraSession: Early Research Achievements Track Session 2: Software Changes and MaintainabilityTRANSCRIPT
T +31 20 314 0950
www.sig.eu
Measuring Maintainability of Spreadsheets in the Wild
José Pedro Correia & Miguel Alexandre Ferreira
September 2011
I 20
Introduction
Spreadsheets • are widely used in all kinds of organizations
• contain important business logic
• are maintained by different people
Do all spreadsheets matter? • throwaway calculations don’t
• some data intensive spreadsheets might
• “spreadsheet programs” matter the most
2
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
I 20
Pragmatic criteria for “spreadsheet programs”
have formulas the formulas have references
3
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
I 20
Definitions
• spreadsheet = { sheet1, sheet
2, …, sheet
i }
• sheet =
4
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
cell11
cell12
… cell1j
cell21
cell22
… cell2j
…
…
…
celli1
celli2
… cellij
I 20
Definitions
• cell types
• blank: no content
• data: non-blank and does not contain a formula
• proxy: contains a formula that is a direct, single reference (e.g. =A1)
• calculation: contains a formula and is not a proxy
• cell roles
5
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
blank data proxy calculation
not referenced no role label1 data sink1 calc. sink
referenced open input data source1 data move calc. step
1[Hodnigg & Mittermeir – Metric-based spreadsheet visualization: Support for focused maintenance – EuSpRIG’08]
I 20
Definitions
• formula copy equivalents1
• unique formula
6
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
=R1 + 2 =R2 + 2 =R3 + 2 =R4 + 2
=X + 2
1[Mittermeir & Clermont – Finding high-level structures in spreadsheet programs – WCRE’02]
I 20
Research questions
1. Which metrics can we use to assess spreadsheet maintainability?
2. What are the typical values for the selected metrics?
7
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
I 20
Study outline
Metric selection Measuring the EUSES Spreadsheet Corpus Analysis of results
8
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
I 20
Goal Question Metric (mockup)
9
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
Metrics
Sub questions
Questions
Sub goals
Main goal Maintainability
Analyzability
How large is the spreadsheet?
…
…
…
…
Changeability
Are there inconsistencies?
…
… …
Stability
How much coupling is there?
…
…
Testability
How complex is the spreadsheet?
…
...
I 20
Example metrics
Spreadsheet level • # used rows / columns
• # formulas / unique formulas
Sheet level • # data fan-in / fan-out
• # data move / sink cells
Formula level • McCabe complexity
• # operators
10
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
I 20
The EUSES Spreadsheet Corpus1
0
1000
2000
3000
4000
5000
Total Internet search
Contain formulas
Contain formulas with
references
> 25 unique formulas
Spreadsheets
11
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
1[Fisher II & Rothermel – The EUSES spreadsheet corpus: a shared resource for supporting experimentation with spreadsheet
dependability mechanisms – WEUSE’05]
1609
I 20
Power-law like distributions
most metrics follow a power-law like distribution
12
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
NON_BLANK_CELLS (up to 99 quantile)
Freq
uenc
y
0 5000 10000 15000
020
040
060
080
010
00
I 20
Extremely skewed distributions
13
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
at least 75% of sheets have no proxy cells
at least 75% of sheets are not referenced / have no references
Attribute Object Min Q1 Med. Q3 95% 99% Max
# data move cells Sheet 0 0 0 0 12 80 672
# data sink cells Sheet 0 0 0 0 8 80 1188
# data fan-out Sheet 0 0 0 0 52 1256 964366
# data fan-in Sheet 0 0 0 0 76 1814 964366
I 20
Sparse distributions
14
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
conditionals and magic values are uncommon
Attribute Object Min Q1 Med. Q3 95% 99% Max
McCabe complexity Formula 1 1 1 1 1 5 34
# unidentified values Formula 0 0 0 1 3 9 51
I 20
Documentation within the spreadsheet
15
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
at least 25% of sheets are purely for documentation purposes
Attribute Object Min Q1 Med. Q3 95% 99% Max
% label cells Sheet 0 35 64 100 100 100 100
I 20
Spreadsheet layout
16
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
most common layout seems to be vertical
Attribute Object Min Q1 Med. Q3 95% 99% Max
# used columns Spreadsheet 2 10 19 41 122 228 738
# used rows Spreadsheet 2 47 99 210 686 1629 40518
I 20
“Dragging” formulas
17
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
seems to be a common practice
Attribute Object Min Q1 Med. Q3 95% 99% Max
# formula cells Spreadsheet 1 20 75 239 1153 4052 24523
# unique formulas Spreadsheet 1 3 8 26 103 252 961
I 20
Answering the research questions
1. Which metrics can we use to assess spreadsheet maintainability? • levels: sheets, rows/columns, cells, formulas
2. What are the typical values for the selected metrics? • most distributions resemble software metrics distributions
• some distributions are extremely skewed/sparse
• expect label only sheets/more rows than columns/copy equivalent formulas
18
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
I 20
Roadmap
1. Study how and what to measure in spreadsheets
2. Select a minimal set of metrics and build a quality model
3. Gather a representative set of measurements for calibration
4. Calibrate the thresholds in the model
5. Validate the model
19
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011
I 20
The end…
Thank you for your attention!
Q&A
Complete data set and technical report at: http://www.sig.eu/en/spreadsheet-quality
Miguel Ferreira
Software Improvement Group [email protected]
20
Measuring Maintainability of Spreadsheets in the Wild – José Pedro Correia & Miguel Alexandre Ferreira – September 2011