a comparison of deep learning performance against health-care … · 2019. 10. 24. · a comparison...

A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis

Reporter:范佐搖2019/10/17

2https://blogs.nvidia.com.tw/2016/07/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/

Dermatologist-level classification of skin cancer with deep neural networks

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A.Bagul, R. L. Ball, C. Langlotz, K. Shpanskaya, M. P. Lungren, and A. Y. Ng

arXiv, Dec. 2017.

112,120 chest X-ray images

Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images

T. Hirasawa, K. Aoyama, T. Tanimoto, et al.

Gastric Cancer (2018) 21:653–660

13,584 images were collected for 2639 histologically proven gastric cancer lesions as a training image data set.

Vessel Extraction in X-Ray Angiograms Using Deep Learning

E. Nasr-Esfahani, S. Samavi, N. Karimi, S.M.R. Soroushmehr, K. Ward, M.H. Jafari, B. Felfeliyan, B. Nallamothu, K. Najarian

IEEE 38th Annual International Conference of the. IEEE, 2016. p. 643-646.

44 X-ray angiography images

AI model

patient cohorts

Disease

Data Set

Model

Training

Verification

20%

80%Training Data

Internal

Test Data

Data Augmentation

Process

Model

Construction

Model

Verification

Data Set

External

Test Data

From: Jan 1, 2012 to Jun 6, 2019

Database

31587records

disease Number of study disease Number of study

ophthalmic disease 18 studiesgastroenterological

or hepatologicalcancers

5 studies

breast cancer10 studies

thyroid cancer 4 studies

trauma and orthopaedics

10 studies gastroenterology and hepatology

2 studies

dermatological cancer

9 studiescardiology

2 studies

lung cancer 7 studies oral cancer 2 studies

respiratory disease 8 studies nephrology 1 studies

neurology1 studies

maxillofacial surgery1 studies

rheumatology1 studies nasopharyngeal

cancer1 studies

urological disease1 studies two different target

conditions1 studies

82 studies

DL model

health-care professionals

Discussion

First, most studies took the approach of assessing deep learning diagnosticaccuracy in isolation, in a way that does not reflect clinical practice. Manystudies were excluded at screening because they did not provide comparisonswith health-care professionals (ie, human vs machine), and very few of theincluded studies reported comparisons with health-care professionals usingthe same test dataset. Considering deep learning algorithms in this isolatedmanner limits our ability to extrapolate the findings to health-care delivery,except perhaps for mass screening

Discussion

Second, there were very few prospective studies done in real clinicalenvironments. Most studies were retrospective, in silico, and based on previouslyassembled datasets. The ground truth labels were mostly derived from datacollected for other purposes, such as in retrospectively collected routine clinicalcare notes or radiology or histology reports, and the criteria for the presence orabsence of disease were often poorly defined. The reporting around handling ofmissing information in these datasets was also poor across all studies. Most didnot report whether any data were missing, what proportion this represented andhow missing data were dealt with in the analysis. Such studies should beconsidered as hypothesis generating, with real accuracy defined in patients, notjust datasets.

Discussion

Third, a wide range of metrics were employed to report diagnostic performancein deep learning studies. If a probability function is not reported, the frequencyof true positives, false positives, false negatives, and true negatives at aspecified threshold should be the minimum requirement for such comparisons.In our review, only 12 studies reported the threshold at which sensitivity andspecificity were reported, without justification of how the threshold was chosen;choice of threshold is often set at the arbitrary value of 0·5, as is convention inmachine learning development.

Fourth, there is inconsistency over key terminology used in deep learningstudies.

We suggest distinguishing the datasets involved in the development of analgorithm as training set (for training the algorithm), tuning set (for tuninghyperparameters), and validation test set (for estimating the performance ofthe algorithm). For describing the different types of validation test sets, wesuggest adoption of the suggestion by Altman and Royston: internal validation(for in-sample validation), temporal validation (for in-sample validation with atemporal split), and external validation (for out-of sample validation).

Discussion

Discussion

Finally, although most studies did undertake an out-of-sample validation,most did not do this for both health-care professionals and deep learningalgorithms.

Our finding when comparing performance on internal versus externalvalidation was that, as expected, internal validation overestimatesdiagnostic accuracy in both health-care professionals and deep learningalgorithms. This finding highlights the need for out-of-sample externalvalidation in all predictive models.

Thanks for listening

a comparison of deep learning performance against health-care … · 2019. 10. 24. · a comparison...

Documents