shinji fukuda, esther julia olaya-marín, francisco martínez -capel, ans m. mouton

How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme

Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez-Capel, Ans M. Mouton

Shinji FukudaAssistant ProfessorFaculty of AgricultureKyushu Universityshinji-fkd@agr.kyushu-u.ac.jp

How to split our data in cross-validation?1. Introduction2. Method

a) Species distribution datab) Modelling approachc) How to split data in cross-validationd) Performance measurese) Habitat information

3. Results and Discussion4. Summary

2 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

OUTLINE

Why is cross-validation necessary?To ensure that your model is generally good

Avoid over-fitting // improve generalisation abilityObtain better model parameters

K-fold cross-validation Leave-one-out cross-validationRepeated random sub-sampling validation

To compare your target models fairlyNested cross-validation

INTRODUCTION

10-fold nested CVCalibration Validation

TestingTraining

How to split our data when applying CV?Compare three potential methods

Random splittingStratified splitting

Prevalence-based splitting (output variable)Habitat space-based splitting (input variables)

OBJECTIVE

AbsencePresence

Criteria Model performanceHabitat information

Species distribution data (3 species)

METHODS

Anguilla anguilla (European eel) : 46%

Luciobarbus guiraonis (Barbo Mediterraneo): 69%

Species distribution data (statistics)1 2 3 4 5 6 7 8 9

Min. Cabriel 83 0.00 0.80 0 28 95 3 20.69 0.001stQu. Jucar 42 1.00 8.65 7 382 1746 97 122.99 0.15Median 2.00 79.00 35 441 3734 1295 152.67 0.30Mean 1.88 47.42 32 577 4187 943 165.58 0.41

3rdQu. 3.00 79.00 54 818 3909 1295 191.57 0.46Max. 5.00 79.00 54 1286 18296 4624 383.68 6.76

METHODS

10 11 12 13 14 15 16Min. 1153 5.76 0.09 0.11 0.28 0.42 0.15

1stQu. 2727 12.00 2.85 3.25 4.52 0.74 0.24Median 3595 14.25 5.32 5.51 6.44 0.84 0.31Mean 3910 13.71 5.22 5.49 6.24 0.87 0.36

3rdQu. 5314 15.30 7.02 7.48 8.35 0.93 0.47Max. 6298 19.90 12.22 13.63 12.36 3.37 0.91

1. River // 2. Number of exotic fish species (%) // 3. Channel length without artificial barriers (km) // 4. Number of tributaries between artificial barriers // 5. Altitude (m a.s.l.) // 6. Drainage area (km2) // 7. Drainage area between artificial barriers (km2) // 8. Distance from headwater source (km) // 9. Natural slope of the channel (%) // 10. Solar radiation (WH/m2) // 11. Water temperature (oC) // 12. Mean annual flow rate (m3/s) // 13. Mean monthly flow (Two year before sampling ) (m3/s) // 14. Inter-annual mean flow (calculated for 5 years) (m3/s) // 15. Inter-monthly flow variation of the mean monthly flows (5 years before sampling) // 16. Coefficient of variation of mean annual flows (calculated for 5 years)

How to split data (random or stratified)Random splitting

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration and validation data points#3 Set a random seed#4 Split the data set

METHODS

AbsencePresence

How to split data (random or stratified)Stratified splitting (prevalence)

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration

and validation data points#3 Set a random seed#4 Split the data set according to prevalence or abundance #4-1 Separate presence and absence data#4-2 Split each of presence and absence data according to

the number of CV folds#4-3 Merge presence and absence data

METHODS

How to split data (random or stratified)Stratified splitting (habitat)

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration

and validation data points #3 Set a random seed#4 Split the data set according to Euclidean

distance of habitat variables #4-1 Calculate Euclidean distance of each data point from

center of habitat variable space#4-2 Split the data according to the Euclidean distance of

habitat variables#4-3 Merge data sets to generate calibration and validation

data sets9 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

METHODS

Random Forests (Breiman, 2001)

METHODS

Tree 1 Tree k Tree 500

サンプル 1 サンプル k サンプル 500

11 0 0 1 0 1

結果 101

結果 k01

結果 50011

… …

… …… ……

①ブートストラップサンプルの作成

②サンプルごとの分類木の作成

③結果の統合（多数決）

Variable 1 … Variable n P/A

3.4 … 0.23 　 0

12.8 … 0.19 　 1

… … … …

107 … 0.36 　 0

Occurrence

　　　　　　　 0

　　　　　　　 1

　　　　　　　 0

Probability

　　　　　　 0.005

　　　　　　 0.837

　　　　　　 0.129

Observation

　　　　　　　 0

　　　　　　　 1

　　　　　　　 0

Output

CCI, AUC, etc.

Bootstrap sampling

Development of CART

Voting for classification

no CV within training data

Sample 1 Sample k Sample 500

Performance measuresCCI: Correctly Classified Instances

CCI = (TP+TN)/nn: size of data set

AUC: Area under the ROC curveRanges between 0.5 and 1, with 1 being perfect

NSE: Nash-Sutcliffe Efficiency

Ranges between -∞ and 1, with 1 being perfect11 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

METHODS

meanobs

modelobs1

Habitat informationVariable importance

Which habitat variables are important?Computed internally in the RF algorithm (varImpPlot)

Habitat suitability curveWhich habitat conditions are important?Computed internally in the RF algorithm (partialPlot)

METHODS

0 5 10 15 20 250

0 20 40 60 80 1000

Water depth (cm)

itat s

Flow velocity (cm s −1)

Percent vegetation coverage (%)

lity 1: suitable

0: unsuitable

Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

Performance (A. anguilla)—Random

RESULTS & DISCUSSION

1 2 3 4 5 6 7 8 9 100.5

Random seed

: Calibration: Validation

1 2 3 4 5 6 7 8 9 100.5

Random seed

Performance (A. anguilla)—Prevalence

1 2 3 4 5 6 7 8 9 100.5

Random seed

1 2 3 4 5 6 7 8 9 100.5

Random seed

Performance (A. anguilla)—Habitat space

1 2 3 4 5 6 7 8 9 100.5

Random seed

1 2 3 4 5 6 7 8 9 100.5

Random seed

Variable importance (A. anguilla)

123456789

10111213141516

123456789

10111213141516

123456789

10111213141516

Mean decrease in accuracy

itat v

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

Habitat suitability curves

0 500 1000

Altitude (m) Altitude (m) Altitude (m)

0 1 2 3 4 5 6

Slope (%)

Slope (%) Slope (%)

5 10 15 20

Water temperature (°C)

Water temperature (°C) Water temperature (°C)

Altitude

Water Temp.

(high)

How to split the data when applying CV?Three methods were compared

Random splittingPrevalence-based splittingHabitat space-based splitting

Random Forests was used as an SDMModel performance (AUC, NSE)

Very similar (almost no influence on RF modelling)Habitat information

Variable importance was similar (accuracy was similar…)HSCs show variability in shape and range (to some extent,

related to the degree of variable importance) Much clearer influence may be observed in case of

data with low prevalence18 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

SUMMARY

19 A beautiful day in Ghent , Belgium

THANK YOU VERY MUCH!!!

Short summary Cross-validation schemes: Random & Stratified (output & input) Data-driven method: Random Forests Model performance & habitat information Influence was not clear for accuracy and variable importance HSC shapes were variable

shinji fukuda, esther julia olaya-marín, francisco martínez -capel, ans m. mouton

Documents

maqueta memoria capel 2010

shinji ver

gute besserung shinji kagawa

projet shigeo fukuda

capel-kesehatan lintas bawah.pptx

capel face 2015

ethan fukuda cody tamura period 1

tutorial edit xml by rieyan shinji

capel la enseñanza digital

el arte de guzman capel

electrocardiògrafo fukuda fcp-7101 fx7102

2 capel horacio

guzman capel

capel sentoo essential 2015

diccionario electoral, capel

horacio capel

granados - valses poeticos shin-ichi fukuda

diagram bear fukuda toru[super]

プレコングレスセミナーpre-congress seminar...

camila capel livro 1