shinji fukuda, esther julia olaya-marín, francisco martínez -capel, ans m. mouton

19
How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez-Capel, Ans M. Mouton Shinji Fukuda Assistant Professor Faculty of Agriculture Kyushu University [email protected] u.ac.jp

Upload: oona

Post on 24-Feb-2016

91 views

Category:

Documents


0 download

DESCRIPTION

How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme . Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez -Capel, Ans M. Mouton. Shinji Fukuda Assistant Professor - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme

Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez-Capel, Ans M. Mouton

Shinji FukudaAssistant ProfessorFaculty of AgricultureKyushu [email protected]

Page 2: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

How to split our data in cross-validation?1. Introduction2. Method

a) Species distribution datab) Modelling approachc) How to split data in cross-validationd) Performance measurese) Habitat information

3. Results and Discussion4. Summary

2 Shinji Fukuda <[email protected]>

OUTLINE

Page 3: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Why is cross-validation necessary?To ensure that your model is generally good

Avoid over-fitting // improve generalisation abilityObtain better model parameters

K-fold cross-validation Leave-one-out cross-validationRepeated random sub-sampling validation

To compare your target models fairlyNested cross-validation

3 Shinji Fukuda <[email protected]>

INTRODUCTION

10-fold nested CVCalibration Validation

TestingTraining

Page 4: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

How to split our data when applying CV?Compare three potential methods

Random splittingStratified splitting

Prevalence-based splitting (output variable)Habitat space-based splitting (input variables)

4 Shinji Fukuda <[email protected]>

OBJECTIVE

depth

AbsencePresence

Criteria Model performanceHabitat information

Page 5: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Species distribution data (3 species)

5 Shinji Fukuda <[email protected]>

METHODS

Anguilla anguilla (European eel) : 46%

Luciobarbus guiraonis (Barbo Mediterraneo): 69%

Page 6: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Species distribution data (statistics)1 2 3 4 5 6 7 8 9

Min. Cabriel 83 0.00 0.80 0 28 95 3 20.69 0.001stQu. Jucar 42 1.00 8.65 7 382 1746 97 122.99 0.15Median 2.00 79.00 35 441 3734 1295 152.67 0.30Mean 1.88 47.42 32 577 4187 943 165.58 0.41

3rdQu. 3.00 79.00 54 818 3909 1295 191.57 0.46Max. 5.00 79.00 54 1286 18296 4624 383.68 6.76

6

METHODS

10 11 12 13 14 15 16Min. 1153 5.76 0.09 0.11 0.28 0.42 0.15

1stQu. 2727 12.00 2.85 3.25 4.52 0.74 0.24Median 3595 14.25 5.32 5.51 6.44 0.84 0.31Mean 3910 13.71 5.22 5.49 6.24 0.87 0.36

3rdQu. 5314 15.30 7.02 7.48 8.35 0.93 0.47Max. 6298 19.90 12.22 13.63 12.36 3.37 0.91

1. River // 2. Number of exotic fish species (%) // 3. Channel length without artificial barriers (km) // 4. Number of tributaries between artificial barriers // 5. Altitude (m a.s.l.) // 6. Drainage area (km2) // 7. Drainage area between artificial barriers (km2) // 8. Distance from headwater source (km) // 9. Natural slope of the channel (%) // 10. Solar radiation (WH/m2) // 11. Water temperature (oC) // 12. Mean annual flow rate (m3/s) // 13. Mean monthly flow (Two year before sampling ) (m3/s) // 14. Inter-annual mean flow (calculated for 5 years) (m3/s) // 15. Inter-monthly flow variation of the mean monthly flows (5 years before sampling) // 16. Coefficient of variation of mean annual flows (calculated for 5 years)

Page 7: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

How to split data (random or stratified)Random splitting

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration and validation data points#3 Set a random seed#4 Split the data set

7 Shinji Fukuda <[email protected]>

METHODS

AbsencePresence

Page 8: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

How to split data (random or stratified)Stratified splitting (prevalence)

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration

and validation data points#3 Set a random seed#4 Split the data set according to prevalence or abundance #4-1 Separate presence and absence data#4-2 Split each of presence and absence data according to

the number of CV folds#4-3 Merge presence and absence data

8 Shinji Fukuda <[email protected]>

METHODS

Page 9: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

How to split data (random or stratified)Stratified splitting (habitat)

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration

and validation data points #3 Set a random seed#4 Split the data set according to Euclidean

distance of habitat variables #4-1 Calculate Euclidean distance of each data point from

center of habitat variable space#4-2 Split the data according to the Euclidean distance of

habitat variables#4-3 Merge data sets to generate calibration and validation

data sets9 Shinji Fukuda <[email protected]>

METHODS

D

Page 10: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Random Forests (Breiman, 2001)

10

METHODS

Tree 1 Tree k Tree 500

サンプル 1 サンプル k サンプル 500

00 1

11 0 0 1 0 1

1

0

0

結果 101

1

結果 k01

0

結果 50011

0

… …

… …

… …… ……

①ブートストラップサンプルの作成

②サンプルごとの分類木の作成

③結果の統合(多数決)

Variable 1 … Variable n P/A

3.4 … 0.23   0

12.8 … 0.19   1

… … … …

107 … 0.36   0

1

2

1064

Data

Occurrence

         0

        1

         0

Probability

       0.005

       0.837

       0.129

1

2

1064

Model

Observation

        0

        1

         0

1

2

1064

Output

CCI, AUC, etc.

Bootstrap sampling

Development of CART

Voting for classification

no CV within training data

Sample 1 Sample k Sample 500

Page 11: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Performance measuresCCI: Correctly Classified Instances

CCI = (TP+TN)/nn: size of data set

AUC: Area under the ROC curveRanges between 0.5 and 1, with 1 being perfect

NSE: Nash-Sutcliffe Efficiency

Ranges between -∞ and 1, with 1 being perfect11 Shinji Fukuda <[email protected]>

METHODS

-

-- n

i

n

i

XX

XXNSE

1

2

meanobs

1

2

modelobs1

Page 12: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Habitat informationVariable importance

Which habitat variables are important?Computed internally in the RF algorithm (varImpPlot)

Habitat suitability curveWhich habitat conditions are important?Computed internally in the RF algorithm (partialPlot)

12

METHODS

0 5 10 15 20 250

1

0 5 10 15 20 250

1

0 20 40 60 80 1000

1

Water depth (cm)

Hab

itat s

uita

bilit

y

Flow velocity (cm s −1)

Percent vegetation coverage (%)

(a)

(b)

(c)

Sui

tabi

lity 1: suitable

0: unsuitable

Shinji Fukuda <[email protected]>

Page 13: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Performance (A. anguilla)—Random

13

RESULTS & DISCUSSION

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E

: Calibration: Validation

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C

: Calibration: Validation

Page 14: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Performance (A. anguilla)—Prevalence

14

RESULTS & DISCUSSION

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E

: Calibration: Validation

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C

: Calibration: Validation

Page 15: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Performance (A. anguilla)—Habitat space

15

RESULTS & DISCUSSION

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E

: Calibration: Validation

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C

: Calibration: Validation

Page 16: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Variable importance (A. anguilla)

16

RESULTS & DISCUSSION

0 1 2

123456789

10111213141516

0 1 2

123456789

10111213141516

0 1 2

123456789

10111213141516

Mean decrease in accuracy

Hab

itat v

aria

ble

inde

x

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

Page 17: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

Habitat suitability curves

0 500 1000

−2

−1

0

0 500 1000

−2

−1

0

0 500 1000

−2

−1

0

Altitude (m) Altitude (m) Altitude (m)

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

17

RESULTS & DISCUSSION

0 1 2 3 4 5 6

−2

−1

0

0 1 2 3 4 5 6

−2

−1

0

0 1 2 3 4 5 6

−2

−1

0

Slope (%)

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

Slope (%) Slope (%)

5 10 15 20

−2

−1

0

5 10 15 20

−2

−1

0

5 10 15 20

−2

−1

0

Water temperature (°C)

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

Water temperature (°C) Water temperature (°C)

Altitude

Slope

Water Temp.

(high)

(mid)

(low)

Page 18: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

How to split the data when applying CV?Three methods were compared

Random splittingPrevalence-based splittingHabitat space-based splitting

Random Forests was used as an SDMModel performance (AUC, NSE)

Very similar (almost no influence on RF modelling)Habitat information

Variable importance was similar (accuracy was similar…)HSCs show variability in shape and range (to some extent,

related to the degree of variable importance) Much clearer influence may be observed in case of

data with low prevalence18 Shinji Fukuda <[email protected]>

SUMMARY

Page 19: Shinji Fukuda, Esther Julia Olaya-Marín,  Francisco  Martínez -Capel, Ans M. Mouton

19 A beautiful day in Ghent , Belgium

THANK YOU VERY MUCH!!!

Short summary Cross-validation schemes: Random & Stratified (output & input) Data-driven method: Random Forests Model performance & habitat information Influence was not clear for accuracy and variable importance HSC shapes were variable