shinji fukuda, esther julia olaya-marín, francisco martínez -capel, ans m. mouton
DESCRIPTION
How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme . Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez -Capel, Ans M. Mouton. Shinji Fukuda Assistant Professor - PowerPoint PPT PresentationTRANSCRIPT
How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme
Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez-Capel, Ans M. Mouton
Shinji FukudaAssistant ProfessorFaculty of AgricultureKyushu [email protected]
How to split our data in cross-validation?1. Introduction2. Method
a) Species distribution datab) Modelling approachc) How to split data in cross-validationd) Performance measurese) Habitat information
3. Results and Discussion4. Summary
2 Shinji Fukuda <[email protected]>
OUTLINE
Why is cross-validation necessary?To ensure that your model is generally good
Avoid over-fitting // improve generalisation abilityObtain better model parameters
K-fold cross-validation Leave-one-out cross-validationRepeated random sub-sampling validation
To compare your target models fairlyNested cross-validation
3 Shinji Fukuda <[email protected]>
INTRODUCTION
10-fold nested CVCalibration Validation
TestingTraining
How to split our data when applying CV?Compare three potential methods
Random splittingStratified splitting
Prevalence-based splitting (output variable)Habitat space-based splitting (input variables)
4 Shinji Fukuda <[email protected]>
OBJECTIVE
depth
AbsencePresence
Criteria Model performanceHabitat information
Species distribution data (3 species)
5 Shinji Fukuda <[email protected]>
METHODS
Anguilla anguilla (European eel) : 46%
Luciobarbus guiraonis (Barbo Mediterraneo): 69%
Species distribution data (statistics)1 2 3 4 5 6 7 8 9
Min. Cabriel 83 0.00 0.80 0 28 95 3 20.69 0.001stQu. Jucar 42 1.00 8.65 7 382 1746 97 122.99 0.15Median 2.00 79.00 35 441 3734 1295 152.67 0.30Mean 1.88 47.42 32 577 4187 943 165.58 0.41
3rdQu. 3.00 79.00 54 818 3909 1295 191.57 0.46Max. 5.00 79.00 54 1286 18296 4624 383.68 6.76
6
METHODS
10 11 12 13 14 15 16Min. 1153 5.76 0.09 0.11 0.28 0.42 0.15
1stQu. 2727 12.00 2.85 3.25 4.52 0.74 0.24Median 3595 14.25 5.32 5.51 6.44 0.84 0.31Mean 3910 13.71 5.22 5.49 6.24 0.87 0.36
3rdQu. 5314 15.30 7.02 7.48 8.35 0.93 0.47Max. 6298 19.90 12.22 13.63 12.36 3.37 0.91
1. River // 2. Number of exotic fish species (%) // 3. Channel length without artificial barriers (km) // 4. Number of tributaries between artificial barriers // 5. Altitude (m a.s.l.) // 6. Drainage area (km2) // 7. Drainage area between artificial barriers (km2) // 8. Distance from headwater source (km) // 9. Natural slope of the channel (%) // 10. Solar radiation (WH/m2) // 11. Water temperature (oC) // 12. Mean annual flow rate (m3/s) // 13. Mean monthly flow (Two year before sampling ) (m3/s) // 14. Inter-annual mean flow (calculated for 5 years) (m3/s) // 15. Inter-monthly flow variation of the mean monthly flows (5 years before sampling) // 16. Coefficient of variation of mean annual flows (calculated for 5 years)
How to split data (random or stratified)Random splitting
#0 Read data#1 Set the number of CV folds#2 Set the number of calibration and validation data points#3 Set a random seed#4 Split the data set
7 Shinji Fukuda <[email protected]>
METHODS
AbsencePresence
How to split data (random or stratified)Stratified splitting (prevalence)
#0 Read data#1 Set the number of CV folds#2 Set the number of calibration
and validation data points#3 Set a random seed#4 Split the data set according to prevalence or abundance #4-1 Separate presence and absence data#4-2 Split each of presence and absence data according to
the number of CV folds#4-3 Merge presence and absence data
8 Shinji Fukuda <[email protected]>
METHODS
How to split data (random or stratified)Stratified splitting (habitat)
#0 Read data#1 Set the number of CV folds#2 Set the number of calibration
and validation data points #3 Set a random seed#4 Split the data set according to Euclidean
distance of habitat variables #4-1 Calculate Euclidean distance of each data point from
center of habitat variable space#4-2 Split the data according to the Euclidean distance of
habitat variables#4-3 Merge data sets to generate calibration and validation
data sets9 Shinji Fukuda <[email protected]>
METHODS
D
Random Forests (Breiman, 2001)
10
METHODS
Tree 1 Tree k Tree 500
サンプル 1 サンプル k サンプル 500
00 1
11 0 0 1 0 1
1
0
0
結果 101
1
結果 k01
0
結果 50011
0
… …
… …
… …… ……
①ブートストラップサンプルの作成
②サンプルごとの分類木の作成
③結果の統合(多数決)
Variable 1 … Variable n P/A
3.4 … 0.23 0
12.8 … 0.19 1
… … … …
107 … 0.36 0
1
2
…
1064
Data
Occurrence
0
1
…
0
Probability
0.005
0.837
…
0.129
1
2
…
1064
Model
Observation
0
1
…
0
1
2
…
1064
Output
CCI, AUC, etc.
Bootstrap sampling
Development of CART
Voting for classification
no CV within training data
Sample 1 Sample k Sample 500
Performance measuresCCI: Correctly Classified Instances
CCI = (TP+TN)/nn: size of data set
AUC: Area under the ROC curveRanges between 0.5 and 1, with 1 being perfect
NSE: Nash-Sutcliffe Efficiency
Ranges between -∞ and 1, with 1 being perfect11 Shinji Fukuda <[email protected]>
METHODS
-
-- n
i
n
i
XX
XXNSE
1
2
meanobs
1
2
modelobs1
Habitat informationVariable importance
Which habitat variables are important?Computed internally in the RF algorithm (varImpPlot)
Habitat suitability curveWhich habitat conditions are important?Computed internally in the RF algorithm (partialPlot)
12
METHODS
0 5 10 15 20 250
1
0 5 10 15 20 250
1
0 20 40 60 80 1000
1
Water depth (cm)
Hab
itat s
uita
bilit
y
Flow velocity (cm s −1)
Percent vegetation coverage (%)
(a)
(b)
(c)
Sui
tabi
lity 1: suitable
0: unsuitable
Shinji Fukuda <[email protected]>
Performance (A. anguilla)—Random
13
RESULTS & DISCUSSION
1 2 3 4 5 6 7 8 9 100.5
0.6
0.7
0.8
0.9
1
Random seed
NS
E
: Calibration: Validation
1 2 3 4 5 6 7 8 9 100.5
0.6
0.7
0.8
0.9
1
Random seed
AU
C
: Calibration: Validation
Performance (A. anguilla)—Prevalence
14
RESULTS & DISCUSSION
1 2 3 4 5 6 7 8 9 100.5
0.6
0.7
0.8
0.9
1
Random seed
NS
E
: Calibration: Validation
1 2 3 4 5 6 7 8 9 100.5
0.6
0.7
0.8
0.9
1
Random seed
AU
C
: Calibration: Validation
Performance (A. anguilla)—Habitat space
15
RESULTS & DISCUSSION
1 2 3 4 5 6 7 8 9 100.5
0.6
0.7
0.8
0.9
1
Random seed
NS
E
: Calibration: Validation
1 2 3 4 5 6 7 8 9 100.5
0.6
0.7
0.8
0.9
1
Random seed
AU
C
: Calibration: Validation
Variable importance (A. anguilla)
16
RESULTS & DISCUSSION
0 1 2
123456789
10111213141516
0 1 2
123456789
10111213141516
0 1 2
123456789
10111213141516
Mean decrease in accuracy
Hab
itat v
aria
ble
inde
x
(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting
Habitat suitability curves
0 500 1000
−2
−1
0
0 500 1000
−2
−1
0
0 500 1000
−2
−1
0
Altitude (m) Altitude (m) Altitude (m)
(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting
17
RESULTS & DISCUSSION
0 1 2 3 4 5 6
−2
−1
0
0 1 2 3 4 5 6
−2
−1
0
0 1 2 3 4 5 6
−2
−1
0
Slope (%)
(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting
Slope (%) Slope (%)
5 10 15 20
−2
−1
0
5 10 15 20
−2
−1
0
5 10 15 20
−2
−1
0
Water temperature (°C)
(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting
Water temperature (°C) Water temperature (°C)
Altitude
Slope
Water Temp.
(high)
(mid)
(low)
How to split the data when applying CV?Three methods were compared
Random splittingPrevalence-based splittingHabitat space-based splitting
Random Forests was used as an SDMModel performance (AUC, NSE)
Very similar (almost no influence on RF modelling)Habitat information
Variable importance was similar (accuracy was similar…)HSCs show variability in shape and range (to some extent,
related to the degree of variable importance) Much clearer influence may be observed in case of
data with low prevalence18 Shinji Fukuda <[email protected]>
SUMMARY
19 A beautiful day in Ghent , Belgium
THANK YOU VERY MUCH!!!
Short summary Cross-validation schemes: Random & Stratified (output & input) Data-driven method: Random Forests Model performance & habitat information Influence was not clear for accuracy and variable importance HSC shapes were variable