shinji fukuda, esther julia olaya-marín, francisco martínez -capel, ans m. mouton

How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme

Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez-Capel, Ans M. Mouton

Shinji FukudaAssistant ProfessorFaculty of AgricultureKyushu [email protected]

How to split our data in cross-validation?1. Introduction2. Method

a) Species distribution datab) Modelling approachc) How to split data in cross-validationd) Performance measurese) Habitat information

3. Results and Discussion4. Summary

2 Shinji Fukuda <[email protected]>

OUTLINE

Why is cross-validation necessary?To ensure that your model is generally good

Avoid over-fitting // improve generalisation abilityObtain better model parameters

K-fold cross-validation Leave-one-out cross-validationRepeated random sub-sampling validation

To compare your target models fairlyNested cross-validation


INTRODUCTION

10-fold nested CVCalibration Validation

TestingTraining

How to split our data when applying CV?Compare three potential methods

Random splittingStratified splitting

Prevalence-based splitting (output variable)Habitat space-based splitting (input variables)


OBJECTIVE

depth

AbsencePresence

Criteria Model performanceHabitat information

Species distribution data (3 species)


METHODS

Anguilla anguilla (European eel) : 46%

Luciobarbus guiraonis (Barbo Mediterraneo): 69%

http://en.wikipedia.org/wiki/File:Squalius_pyrenaicus_02_by-dpc.jpg

Species distribution data (statistics)1 2 3 4 5 6 7 8 9

Min. Cabriel 83 0.00 0.80 0 28 95 3 20.69 0.001stQu. Jucar 42 1.00 8.65 7 382 1746 97 122.99 0.15Median 2.00 79.00 35 441 3734 1295 152.67 0.30Mean 1.88 47.42 32 577 4187 943 165.58 0.41

3rdQu. 3.00 79.00 54 818 3909 1295 191.57 0.46Max. 5.00 79.00 54 1286 18296 4624 383.68 6.76

6

METHODS

10 11 12 13 14 15 16Min. 1153 5.76 0.09 0.11 0.28 0.42 0.15

1stQu. 2727 12.00 2.85 3.25 4.52 0.74 0.24Median 3595 14.25 5.32 5.51 6.44 0.84 0.31Mean 3910 13.71 5.22 5.49 6.24 0.87 0.36

3rdQu. 5314 15.30 7.02 7.48 8.35 0.93 0.47Max. 6298 19.90 12.22 13.63 12.36 3.37 0.91

1. River // 2. Number of exotic fish species (%) // 3. Channel length without artificial barriers (km) // 4. Number of tributaries between artificial barriers // 5. Altitude (m a.s.l.) // 6. Drainage area (km2) // 7. Drainage area between artificial barriers (km2) // 8. Distance from headwater source (km) // 9. Natural slope of the channel (%) // 10. Solar radiation (WH/m2) // 11. Water temperature (oC) // 12. Mean annual flow rate (m3/s) // 13. Mean monthly flow (Two year before sampling ) (m3/s) // 14. Inter-annual mean flow (calculated for 5 years) (m3/s) // 15. Inter-monthly flow variation of the mean monthly flows (5 years before sampling) // 16. Coefficient of variation of mean annual flows (calculated for 5 years)

How to split data (random or stratified)Random splitting

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration and validation data points#3 Set a random seed#4 Split the data set


METHODS

AbsencePresence

How to split data (random or stratified)Stratified splitting (prevalence)

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration

and validation data points#3 Set a random seed#4 Split the data set according to prevalence or abundance #4-1 Separate presence and absence data#4-2 Split each of presence and absence data according to

the number of CV folds#4-3 Merge presence and absence data


METHODS

How to split data (random or stratified)Stratified splitting (habitat)

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration

and validation data points #3 Set a random seed#4 Split the data set according to Euclidean

distance of habitat variables #4-1 Calculate Euclidean distance of each data point from

center of habitat variable space#4-2 Split the data according to the Euclidean distance of

habitat variables#4-3 Merge data sets to generate calibration and validation

data sets9 Shinji Fukuda <[email protected]>

METHODS

D

Random Forests (Breiman, 2001)

10

METHODS

Tree 1 Tree k Tree 500

サンプル 1 サンプル k サンプル 500

00 1

11 0 0 1 0 1

1

0

0

結果 101

1

結果 k01

0

結果 50011

0

… …

… …

… …… ……

①ブートストラップサンプルの作成

②サンプルごとの分類木の作成

③結果の統合（多数決）

Variable 1 … Variable n P/A

3.4 … 0.23 　 0

12.8 … 0.19 　 1

… … … …

107 … 0.36 　 0

1

2

…

1064

Data

Occurrence

　　　　　　　 0

　　　　　　　 1

…

　　　　　　　 0

Probability

　　　　　　 0.005

　　　　　　 0.837

…

　　　　　　 0.129

1

2

…

1064

Model

Observation

　　　　　　　 0

　　　　　　　 1

…

　　　　　　　 0

1

2

…

1064

Output

CCI, AUC, etc.

Bootstrap sampling

Development of CART

Voting for classification

no CV within training data

Sample 1 Sample k Sample 500

Performance measuresCCI: Correctly Classified Instances

CCI = (TP+TN)/nn: size of data set

AUC: Area under the ROC curveRanges between 0.5 and 1, with 1 being perfect

NSE: Nash-Sutcliffe Efficiency

Ranges between -∞ and 1, with 1 being perfect11 Shinji Fukuda <[email protected]>

METHODS

-

-- n

i

n

i

XX

XXNSE

1

2

meanobs

1

2

modelobs1

Habitat informationVariable importance

Which habitat variables are important?Computed internally in the RF algorithm (varImpPlot)

Habitat suitability curveWhich habitat conditions are important?Computed internally in the RF algorithm (partialPlot)

12

METHODS

0 5 10 15 20 250

1

0 5 10 15 20 250

1

0 20 40 60 80 1000

1

Water depth (cm)

Hab

itat s

uita

bilit

y

Flow velocity (cm s −1)

Percent vegetation coverage (%)

(a)

(b)

(c)

Sui

tabi

lity 1: suitable

0: unsuitable

Shinji Fukuda <[email protected]>

Performance (A. anguilla)—Random

13

RESULTS & DISCUSSION

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E

: Calibration: Validation

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C


Performance (A. anguilla)—Prevalence

14


1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E


1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C


Performance (A. anguilla)—Habitat space

15


1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E


1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C


Variable importance (A. anguilla)

16


0 1 2

123456789

10111213141516

0 1 2

123456789

10111213141516

0 1 2

123456789

10111213141516

Mean decrease in accuracy

Hab

itat v

aria

ble

inde

x

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

Habitat suitability curves

0 500 1000

−2

−1

0

0 500 1000

−2

−1

0

0 500 1000

−2

−1

0

Altitude (m) Altitude (m) Altitude (m)


17


0 1 2 3 4 5 6

−2

−1

0

0 1 2 3 4 5 6

−2

−1

0

0 1 2 3 4 5 6

−2

−1

0

Slope (%)


Slope (%) Slope (%)

5 10 15 20

−2

−1

0

5 10 15 20

−2

−1

0

5 10 15 20

−2

−1

0

Water temperature (°C)


Water temperature (°C) Water temperature (°C)

Altitude

Slope

Water Temp.

(high)

(mid)

(low)

How to split the data when applying CV?Three methods were compared

Random splittingPrevalence-based splittingHabitat space-based splitting

Random Forests was used as an SDMModel performance (AUC, NSE)

Very similar (almost no influence on RF modelling)Habitat information

Variable importance was similar (accuracy was similar…)HSCs show variability in shape and range (to some extent,

related to the degree of variable importance) Much clearer influence may be observed in case of

data with low prevalence18 Shinji Fukuda <[email protected]>

SUMMARY

19 A beautiful day in Ghent , Belgium

THANK YOU VERY MUCH!!!

Short summary Cross-validation schemes: Random & Stratified (output & input) Data-driven method: Random Forests Model performance & habitat information Influence was not clear for accuracy and variable importance HSC shapes were variable

shinji fukuda, esther julia olaya-marín, francisco martínez -capel, ans m. mouton

Documents