1 black box (finite state machine) testing design for testability coverage measures random testing...

1

Black box (Finite State Machine) testingDesign for testabilityCoverage measuresRandom testingConstraint-based testingDebugging and test case minimizationUsing model checkers for testingCoverage revisited (“small model property”)

Topics in Testing We’ve Covered

2

Black box (Finite State Machine) testing

• There “are no Turing machines”

• Vasilevskii and Chow algorithm for conformance testing based on spanning trees and distinguishing sets

• Exhaustive testing that cannot miss bugs is often computationally intractable


a

a b

d

3

Design for testability

• Controllability and observability

• Simulation and stubbing, assertions, downward scalability, etc.


4

Coverage measures

• Not necessarily correlated with fault detection!• Still useful!

• Graph coverage: node and edge (statement and branch coverage)

• Logic coverage• Input space partitioning• Syntax-based coverage


4

1

2 3

x >= yx < y

x = yy = 0

x = x + 1

b1 b2

b3

b1 b2

b3

((a <= b) && !G) || (x >= y)

5

Random testing

• Generate inputs at random• Explore very large numbers of executions• Relies on a good automatic test oracle• Feedback to bias choices away from

redundant and irrelevant inputs is useful

• Good baseline for evaluating other methods, and often very effective


6

Constraint-based testing

• Addresses weaknesses of random testing• E.g., finding needles in haystacks, such as

where hash(x) = y

• Combines concrete and symbolic execution to generate inputs

• Concrete execution helps where symbolic solvers choke


7

Debugging and test case minimization

• Automatic minimization of test cases is very valuable for debugging and reducing regression suite size

• Debugging can be considered as an application of the scientific method

• Various techniques exist for using test cases to localize faults


8

Using model checkers for testing

• Testing based on states, rather than on executions or paths

• Use abstractions to reducestate space

• Use automatic instrumentationto handle the engineeringdifficulties


9

Hang onto your hats• It’s going to be a fast ride• Anything in these slides is fair game for the

test: anything not even mentioned in these slides is not fair game (so I’ll mention valgrind right now to let you know it might show up…)

• So ask questions as we go if something is unclear (so that you think even re-reading the slides isn’t going to help)

NOW BEGINS THE REVIEW

10

Basic Definitions: Testing

What is software testing?• Running a program• In order to find faults

• a.k.a. defects• a.k.a. errors • a.k.a. flaws• a.k.a. faults• a.k.a. BUGS

11

Testing

What isn’t software testing?• Purely static analysis: examining a program’s

source code or binary in order to find bugs, but not executing the program

• Good stuff, and very important, but it’s not testing• We’ll get back to this in a future class

• Fuzzy borderline: if we only symbolically execute the program

• For this class, we’ll call it testing when the program actually runs (but maybe in a virtual machine)

12

Why Testing?

Ideally: we prove codecorrect, using formalmathematical techniques (with a computer, not chalk)

• Extremely difficult: for some trivial programs (100 lines) and many small (5K lines) programs

• Simply not practical to prove correctness in most cases – often not even for safety or mission critical code

13

Why Testing?

Nearly ideally: use symbolic or abstract model checking to prove the system correct• Automatically extracts a mathematical abstraction from

a system• Proves properties over all possible executions

• In practice, can work well for very simple properties (“this program never crashes in this particular way”), but can’t handle complex properties (“this is a working file system”)

• Doesn’t work well for programs with complex data structures (like a file system)

14

Why Does Testing Matter?

NIST report, “The Economic Impacts of Inadequate Infrastructure for Software Testing” (2002)• Inadequate software testing costs the US alone

between $22 and $59 billion annually• Better approaches could cut this amount in half

Major failures: Ariane 5 explosion, Mars Polar Lander, Intel’s Pentium FDIV bug

Insufficient testing of safety-critical software can cost lives: THERAC-25 radiation machine: 3 dead

We want our programs to be reliable• Testing is how, in most cases, we find out if

they are

Mars PolarLander crashsite?

THERAC-25 design

Ariane 5:exception-handlingbug : forced selfdestruct on maidenflight (64-bit to 16-bitconversion: about370 million $ lost)

15

Testing and Monitoring

In this class, we’ll look at which executions of a program to run• I’ll call this problem “the” testing problem

Second problem: how do we know if an execution reveals a bug?• Key question when monitoring deployed

programs to handle faults or send in bug reports from the field

• I’ll (mostly) take this for granted: we have a reference model or assertions to check

16

Example: File System Testing

How hard would it be to just try “all” the possibilities?

Consider only core 7 operations (mkdir, rmdir, creat, open, close, read, write)• Most of these take either a file name or a

numeric argument, or both• Even for a “reasonable” (but not provably safe)

limitation of the parameters, there are 26610

executions of length 10 to try• Not a realistic possibility (unless we have 1012

years to test)

17

The Testing Problem

This is a primary topic of this class: what “questions” do we pose to the software, i.e., • How do we select a small set of executions out

of a very large set of executions?

• Fundamental problem of software testing research and practice

• An open (and essentially unsolvable, in the general case) problem

18

Terms: Verification and Validation

These two terms appear a lot, often in vague or sloppy ways, in the literature• Verification is checking that a program

matches a specification• Validation is making sure it meets the

original requirements – satisfies customers, operates ok onboard the spacecraft, etc.

Verification: “you built it right”Validation: “you built the right thing”

(our focus, forthe most part)

19

Terms: Unit, Integration, System Testing

Stages of testing• Unit testing is the first phase, done by

developers of modules• Integration testing combines unit tested

modules and tests how they interact• System testing tests a whole program to

make sure it meets requirements

• “Design testing” is testing prototypes or very abstract models before implementation – seldom mentioned, but when possible it can save your bacon

• Exhaustive model checking may be possible at this stage

20

Terms: Functional Testing

Functional testing is a related term• Tests a program from a “user’s” perspective – does it

do what it should?• Opposed to unit testing, which often proceeds from

the perspective of other parts of the program• Module spec/interface, not user interaction• Sort of a fuzzy line – consider a file system – how different is

the use by a program and use of UNIX commands at a prompt by a user?

• Building inspector does “unit testing”; you, walking through the house to see if its livable, perform “functional testing”

• Kick the tires vs. take it for a spin?

21

Terms: Regression Testing

Regression testing• Changes can break code, reintroduce old bugs

• Things that used to work may stop working (e.g., because of another “fix”) – software regresses

• Usually a set of cases that have failed (& then succeeded) in the past

• Finding small regressions is an ongoing research area – analyze dependencies

“. . . as a consequence of the introduction of new bugs, program maintenance requires far more system testing. . . . Theoretically, after each fix one must run the entire batch of test cases previously run against the system, to ensure that it has not been damaged in an obscure way. In practice, such regression testing must indeed approximate this theoretical idea, and it is very costly." - Brooks, The Mythical Man-Month

22

Terms: The Oracle Problem

The oracle problem• How to know if a test fails• If the oracle says every execution is good, why

bother running the program?• Some obvious, easily automated approaches:

• The program probably shouldn’t crash• Assertions shouldn’t be violated

• Automatable, but more difficult to apply:• Differential testing (McKeeman, etc.) – when you

have another program, likely correct, that does the same thing, just compare outputs over same inputs

• Last resort, not automatable:• Hand inspection of executions

(oracle: a magical source of truth, often cryptic, given by the gods)

23

Terms: Test (Case) vs. Test Suite

Test (case): one execution of the program, that may expose a bug

Test suite: a set of executions of a program, grouped together• A test suite is made of test cases

Tester: a program that generates tests

Line gets blurry when testing functions, not programs – especially with persistent state

24

Terms: Black Box Testing

Black box testing• Treats a program or system as a • That is, testing that does not look at source

code or internal structure of the system• Send a program a stream of inputs, observe the

outputs, decide if the system passed or failed the test

• Abstracts away the internals – a useful perspective for integration and system testing

• Sometimes you don’t have access to source code, and can make little use of object code

• True black box? Access only over a network

25

Terms: White Box Testing

White box testing• Opens up the box!

• (also known as glass box, clear box, or structural testing)

• Use source code (or other structure beyond the input/output spec.) to design test cases

• Brings us to the idea of coverage

26

Terms: Coverage

Coverage measures or metrics• Abstraction of “what a test suite tests” in a

structural sense• Best explained by giving examples• Common measures:

• Statement coverage• A.k.a line coverage or basic block coverage• Which statements execute in a test suite

• Decision coverage• Which boolean expressions in control structures

evaluated to both true and false during suite execution• Path coverage

• Which paths through a program’s control flow graph are taken in the test suite

27

Terms: Mutation Testing

A mutation of a program is a version of the program with one or more random changes

Mutation testing is another way to measure the quality of a test suite• Amman and Offutt call it syntax-based coverage

Idea: generate a large number of mutants• Run the test suite on these

• If few mutants are detected, the test suite may not be very good

• Difficulties• Cost of testing many versions of a program• How to generate mutants (operators)

• In principle, can subsume many otherforms of coverage

28

Faults, Errors, and Failures

Fault: a static flaw in a program• What we usually think of as “a bug”

Error: a bad program state that results from a fault• Not every fault always produces an error

Failure: an observable incorrect behavior of a program as a result of an error• Not every error ever becomes visible

29

To Expose a Fault with a Test

Reachability: the test much actually reach and execute the location of the fault

Infection: the fault must actually corrupt the program state (produce an error)

Propagation: the error must persist and cause an incorrect output – a failure

30

Controllability and Observability

Goals for a test case:• Reach a fault• Produce an error• Make the error visible as a failure

In order to make this easy the program must be controllable and observable• Controllability:

• How easy it is to drive the program where we want to go

• Observability:• How easy it is to tell what the program is doing

31

Design for Testability

If a program is not designed to be controllable and observable, it generally won’t be

We have to start preparing for testing before we write any code• Testing as an after-the-fact, ad hoc, exercise is

often limited by earlier design choices

32

Test-Driven Development

One way to design for testability is to write the test cases before the code• Idea arising from Extreme Programming and agile

development• Write automated test cases first• Then write the code to satisfy tests

• Helps focus attention on making software well-specified• Forces observability and controllability: you have to be

able to handle the test cases you’ve already written (before deciding they were impractical)

• Reduces temptation to tailor tests to idiosyncratic behaviors of implementation

33

Controllability: Simulation and Stubbing

A key to controllable code is effective simulation and stubbing• Simulation of low-level hardware devices

through a clean driver interface• Real hardware may be slow• May be impossible/expensive to induce some

hardware failure modes on real hardware• Real hardware may be a limited resource

• Stubbing for other routines and code• Other code/modules may not be complete• May be slow and irrelevant to test• May need to simulate failure of other modules

34

Controllability: Downwards Scalability

Another important aspect of controllability is to make code “downwards scalable”• Many faults cause an error only in a corner

case due to a resource limit• An effective strategy for finding errors is to

reduce the resource limits• Test a version of the program with very tight bounds• Finding corner cases is easier if the corners are

close together• Too many programs hard-code resource limits

or make assumptions about resources unconnected to defined limits

• E.g., not checking the result of malloc

35

Observability: Assertions

Assertions improve observability by making (some) errors into failures• Even if the effect of a fault doesn’t propagate, it

may be visible if an assertion checks the state at the right time

Assertions also improve observability by making the error, rather than failure, visible• Know how the state was corrupted

directly, not just eventual effect

36

Observability: Invariant Checkers

Can extend the idea of assertions to writing “full” invariant checkers• Do a crawl of code’s basic data structures• Check various invariants that would be

too expensive to check at runtime• Invariant checker can be written to be

easy-to-use: recursion, memory allocation, etc.• Won’t run on actual system• But be careful! If your invariant checker has

a bug and changes the system state. . .

37

Graph Coverage

Cover all the nodes, edges, or paths of some graph related to the program

Examples:• Statement coverage• Branch coverage• Path coverage• Data flow (def-use) coverage• Model-based testing coverage• Many more – most common kind of

coverage, by far

38

Statement/Basic Block Coverage

if (x < y){ y = 0; x = x + 1;}else{ x = y;}

4

1

2 3

x >= yx < y

x = yy = 0

x = x + 1

if (x < y){ y = 0; x = x + 1;}

3

1

2x >= y

x < y

y = 0x = x + 1

Statement coverage:Cover every node of thesegraphs

Treat as one node becauseif one statement executesthe other must also execute(code is a basic block)

39

Branch Coverage

if (x < y){ y = 0; x = x + 1;}else{ x = y;}

4

1

2 3

x >= yx < y

x = yy = 0

x = x + 1

if (x < y){ y = 0; x = x + 1;}

3

1

2x >= y

x < y

y = 0x = x + 1

Branch coverage vs.statement coverage:Same for if-then-else

But consider this if-thenstructure. For branch coveragecan’t just cover all nodes, butmust cover all edges – get tonode 3 both after 2 and withoutexecuting 2!

40

Path Coverageif (x < y){ y = 0; x = x + 1;}else{ x = y;}

if (x < y){ y = 0; x = x + 1;}

4

1

2 3

x >= yx < y

x = yy = 0

x = x + 1

6

4

5x >= y

x < y

y = 0x = x + 1

How many paths throughthis code are there? Needone test case for each toget path coverage

To get statement and branchcoverage, we only need twotest cases:1 2 4 5 6 and 1 3 4 6

Path coverage needs two more:1 2 4 5 61 3 4 61 2 4 61 3 4 5 6

In general: exponential inthe number of conditional branches!

41

Data Flow Coverage

x = 3;y = 3;

if (w) { x = y + 2;}

if (z) { y = x – 2;}

n = x + y

7

4

6!z

z

y = x - 2

5

3

4!w

w

x = y + 2

2

1

n = x + y

x = 3

y = 3

Def(x)

Def(y)

Def(x)Use(y)

Use(y)Use(x)

Use(x)Def(y)

Annotate program withlocations where variablesare defined and used(very basic staticanalysis)

Def-use pair coverage requiresexecuting all possible pairsof nodes where a variable isfirst defined and then used,without any interveningre-definitions

E.g., this path covers the pairwhere x is defined at 1 and usedat 7: 1 2 3 5 6 7

But this path does NOT:1 2 3 4 5 6 7

May be many pairs,some not actually executable

42

Logic Coverage

if (((a>b) || G)) && (x < y)){ y = 0; x = x + 1;}

3

1

2 ((a <= b) && !G) || (x >= y)

((a>b) || G)) && (x < y)

y = 0x = x + 1

if (x < y){ y = 0; x = x + 1;}

What if, instead of:

we have:

Now, branch coverage will guaranteethat we cover all the edges, but doesnot guarantee we will do so for allthe different logical reasons

We want to test the logic of the guardof the if statement

43

43

Active Clause Coverage

( (a > b) or G ) and (x < y)1 T F T T2 F F T F

duplicate3 F T T T4 F F T F

5 T T T T6 T T F F

With these values for G and (x<y), (a>b) determines the value of the predicate

With these values for (a>b) and (x<y), G determines the value of the predicateWith these values for (a>b) and G, (x<y) determines the value of the predicate

4444

Input Domain Partitioning

Partition scheme q of domain D

The partition q defines a set of blocks, Bq = b1 , b2 ,

… bQ

The partition must satisfy two properties:1. blocks must be pairwise disjoint (no overlap)2. together the blocks cover the domain D (complete)

bi bj = , i j, bi, bj Bq

b1 b2

b3 b = Db Bq

Coverage then means using at least one input from each of b1, b2, b3, . . .

4545

Syntax-Based Coverage

Based on mutation testing (a pet topic of Amman and Offutt, who are heavily into this research area)

Bit different kind of creature than the other coverages we’ve looked at

Idea: generate many syntactic mutants of the original program

Coverage: how many mutants does a test suite kill (detect)?

46

Generation vs. Recognition

Generation of tests based on coverage means producing a test suite to achieve a certain level of coverage• As you can imagine, generally very hard• Consider: generating a suite for 100%

statement coverage easily reaches “solving the halting problem” level

• Obviously hard for, say, mutant-killingRecognition means seeing what level of

coverage an existing test suite reaches

47

Coverage and Subsumption

Sometimes one coverage approach subsumes another• If you achieve 100% coverage of criteria A, you are

guaranteed to satisfy B as well• For example, consider node and edge coverage

• (there’s a subtlety here, actually – can you spot it?)

What does this mean?• Unfortunately, not a great deal• If test suite X satisfies “stronger” criteria A and test suite

Y satisfies “weaker” criteria B• Y may still reveal bugs that X does not!• For example, consider our running example and statement

vs. branch coverage• It means we should take coverage with a grain of salt,

for one thing

48

Levels of Testing

Adapted from Beizer, by Amman and Offutt• Level 0: Testing is debugging• Level 1: Testing is to show the program works• Level 2: Testing is to show the program

doesn’t work• Level 3: Testing is not to prove anything

specific, but to reduce risk of using program• Level 4: Testing is a mental discipline that

helps develop higher quality software

49

What’s So Good About Coverage?

Consider a fault that causes failure every time the code is executed

Don’t execute the code: cannot possibly find the fault!

That’s a pretty good argument for statement coverage

int findLast (int a[], int n, int x) {// Returns index of last element // in a equal to x, or -1 if no// such. n is length of a

int i; for (i = n-1; i >= 0; i--) {

if (a[i] = x) return i;}return 0;

}

1 black box (finite state machine) testing design for testability coverage measures random testing...

Documents