shinji fukuda, esther julia olaya-marín, francisco martínez -capel, ans m. mouton

Post on 24-Feb-2016

91 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme . Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez -Capel, Ans M. Mouton. Shinji Fukuda Assistant Professor - PowerPoint PPT Presentation

TRANSCRIPT

How to deal with input data characteristics in data-driven aquatic species distribution modelling: in search of an optimal cross-validation categorisation scheme

Shinji Fukuda, Esther Julia Olaya-Marín, Francisco Martínez-Capel, Ans M. Mouton

Shinji FukudaAssistant ProfessorFaculty of AgricultureKyushu Universityshinji-fkd@agr.kyushu-u.ac.jp

How to split our data in cross-validation?1. Introduction2. Method

a) Species distribution datab) Modelling approachc) How to split data in cross-validationd) Performance measurese) Habitat information

3. Results and Discussion4. Summary

2 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

OUTLINE

Why is cross-validation necessary?To ensure that your model is generally good

Avoid over-fitting // improve generalisation abilityObtain better model parameters

K-fold cross-validation Leave-one-out cross-validationRepeated random sub-sampling validation

To compare your target models fairlyNested cross-validation

3 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

INTRODUCTION

10-fold nested CVCalibration Validation

TestingTraining

How to split our data when applying CV?Compare three potential methods

Random splittingStratified splitting

Prevalence-based splitting (output variable)Habitat space-based splitting (input variables)

4 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

OBJECTIVE

depth

AbsencePresence

Criteria Model performanceHabitat information

Species distribution data (3 species)

5 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

METHODS

Anguilla anguilla (European eel) : 46%

Luciobarbus guiraonis (Barbo Mediterraneo): 69%

Species distribution data (statistics)1 2 3 4 5 6 7 8 9

Min. Cabriel 83 0.00 0.80 0 28 95 3 20.69 0.001stQu. Jucar 42 1.00 8.65 7 382 1746 97 122.99 0.15Median 2.00 79.00 35 441 3734 1295 152.67 0.30Mean 1.88 47.42 32 577 4187 943 165.58 0.41

3rdQu. 3.00 79.00 54 818 3909 1295 191.57 0.46Max. 5.00 79.00 54 1286 18296 4624 383.68 6.76

6

METHODS

10 11 12 13 14 15 16Min. 1153 5.76 0.09 0.11 0.28 0.42 0.15

1stQu. 2727 12.00 2.85 3.25 4.52 0.74 0.24Median 3595 14.25 5.32 5.51 6.44 0.84 0.31Mean 3910 13.71 5.22 5.49 6.24 0.87 0.36

3rdQu. 5314 15.30 7.02 7.48 8.35 0.93 0.47Max. 6298 19.90 12.22 13.63 12.36 3.37 0.91

1. River // 2. Number of exotic fish species (%) // 3. Channel length without artificial barriers (km) // 4. Number of tributaries between artificial barriers // 5. Altitude (m a.s.l.) // 6. Drainage area (km2) // 7. Drainage area between artificial barriers (km2) // 8. Distance from headwater source (km) // 9. Natural slope of the channel (%) // 10. Solar radiation (WH/m2) // 11. Water temperature (oC) // 12. Mean annual flow rate (m3/s) // 13. Mean monthly flow (Two year before sampling ) (m3/s) // 14. Inter-annual mean flow (calculated for 5 years) (m3/s) // 15. Inter-monthly flow variation of the mean monthly flows (5 years before sampling) // 16. Coefficient of variation of mean annual flows (calculated for 5 years)

How to split data (random or stratified)Random splitting

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration and validation data points#3 Set a random seed#4 Split the data set

7 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

METHODS

AbsencePresence

How to split data (random or stratified)Stratified splitting (prevalence)

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration

and validation data points#3 Set a random seed#4 Split the data set according to prevalence or abundance #4-1 Separate presence and absence data#4-2 Split each of presence and absence data according to

the number of CV folds#4-3 Merge presence and absence data

8 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

METHODS

How to split data (random or stratified)Stratified splitting (habitat)

#0 Read data#1 Set the number of CV folds#2 Set the number of calibration

and validation data points #3 Set a random seed#4 Split the data set according to Euclidean

distance of habitat variables #4-1 Calculate Euclidean distance of each data point from

center of habitat variable space#4-2 Split the data according to the Euclidean distance of

habitat variables#4-3 Merge data sets to generate calibration and validation

data sets9 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

METHODS

D

Random Forests (Breiman, 2001)

10

METHODS

Tree 1 Tree k Tree 500

サンプル 1 サンプル k サンプル 500

00 1

11 0 0 1 0 1

1

0

0

結果 101

1

結果 k01

0

結果 50011

0

… …

… …

… …… ……

①ブートストラップサンプルの作成

②サンプルごとの分類木の作成

③結果の統合(多数決)

Variable 1 … Variable n P/A

3.4 … 0.23   0

12.8 … 0.19   1

… … … …

107 … 0.36   0

1

2

1064

Data

Occurrence

         0

        1

         0

Probability

       0.005

       0.837

       0.129

1

2

1064

Model

Observation

        0

        1

         0

1

2

1064

Output

CCI, AUC, etc.

Bootstrap sampling

Development of CART

Voting for classification

no CV within training data

Sample 1 Sample k Sample 500

Performance measuresCCI: Correctly Classified Instances

CCI = (TP+TN)/nn: size of data set

AUC: Area under the ROC curveRanges between 0.5 and 1, with 1 being perfect

NSE: Nash-Sutcliffe Efficiency

Ranges between -∞ and 1, with 1 being perfect11 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

METHODS

-

-- n

i

n

i

XX

XXNSE

1

2

meanobs

1

2

modelobs1

Habitat informationVariable importance

Which habitat variables are important?Computed internally in the RF algorithm (varImpPlot)

Habitat suitability curveWhich habitat conditions are important?Computed internally in the RF algorithm (partialPlot)

12

METHODS

0 5 10 15 20 250

1

0 5 10 15 20 250

1

0 20 40 60 80 1000

1

Water depth (cm)

Hab

itat s

uita

bilit

y

Flow velocity (cm s −1)

Percent vegetation coverage (%)

(a)

(b)

(c)

Sui

tabi

lity 1: suitable

0: unsuitable

Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

Performance (A. anguilla)—Random

13

RESULTS & DISCUSSION

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E

: Calibration: Validation

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C

: Calibration: Validation

Performance (A. anguilla)—Prevalence

14

RESULTS & DISCUSSION

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E

: Calibration: Validation

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C

: Calibration: Validation

Performance (A. anguilla)—Habitat space

15

RESULTS & DISCUSSION

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

NS

E

: Calibration: Validation

1 2 3 4 5 6 7 8 9 100.5

0.6

0.7

0.8

0.9

1

Random seed

AU

C

: Calibration: Validation

Variable importance (A. anguilla)

16

RESULTS & DISCUSSION

0 1 2

123456789

10111213141516

0 1 2

123456789

10111213141516

0 1 2

123456789

10111213141516

Mean decrease in accuracy

Hab

itat v

aria

ble

inde

x

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

Habitat suitability curves

0 500 1000

−2

−1

0

0 500 1000

−2

−1

0

0 500 1000

−2

−1

0

Altitude (m) Altitude (m) Altitude (m)

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

17

RESULTS & DISCUSSION

0 1 2 3 4 5 6

−2

−1

0

0 1 2 3 4 5 6

−2

−1

0

0 1 2 3 4 5 6

−2

−1

0

Slope (%)

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

Slope (%) Slope (%)

5 10 15 20

−2

−1

0

5 10 15 20

−2

−1

0

5 10 15 20

−2

−1

0

Water temperature (°C)

(a) Random splitting (b) Prevalence−based splitting (c) Habitat space−based splitting

Water temperature (°C) Water temperature (°C)

Altitude

Slope

Water Temp.

(high)

(mid)

(low)

How to split the data when applying CV?Three methods were compared

Random splittingPrevalence-based splittingHabitat space-based splitting

Random Forests was used as an SDMModel performance (AUC, NSE)

Very similar (almost no influence on RF modelling)Habitat information

Variable importance was similar (accuracy was similar…)HSCs show variability in shape and range (to some extent,

related to the degree of variable importance) Much clearer influence may be observed in case of

data with low prevalence18 Shinji Fukuda <shinji-fkd@agr.kyushu-u.ac.jp>

SUMMARY

19 A beautiful day in Ghent , Belgium

THANK YOU VERY MUCH!!!

Short summary Cross-validation schemes: Random & Stratified (output & input) Data-driven method: Random Forests Model performance & habitat information Influence was not clear for accuracy and variable importance HSC shapes were variable

top related