a comparison of deep learning performance against health-care … · 2019. 10. 24. · a comparison...
TRANSCRIPT
-
A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis
Reporter:范佐搖2019/10/17
-
2https://blogs.nvidia.com.tw/2016/07/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/
-
Dermatologist-level classification of skin cancer with deep neural networks
-
CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning
P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A.Bagul, R. L. Ball, C. Langlotz, K. Shpanskaya, M. P. Lungren, and A. Y. Ng
arXiv, Dec. 2017.
112,120 chest X-ray images
-
Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images
T. Hirasawa, K. Aoyama, T. Tanimoto, et al.
Gastric Cancer (2018) 21:653–660
13,584 images were collected for 2639 histologically proven gastric cancer lesions as a training image data set.
-
Vessel Extraction in X-Ray Angiograms Using Deep Learning
E. Nasr-Esfahani, S. Samavi, N. Karimi, S.M.R. Soroushmehr, K. Ward, M.H. Jafari, B. Felfeliyan, B. Nallamothu, K. Najarian
IEEE 38th Annual International Conference of the. IEEE, 2016. p. 643-646.
44 X-ray angiography images
-
AI model
patient cohorts
Disease
Data Set
Model
Training
Verification
-
AI model
patient cohorts
Disease
Data Set
Model
Training
Verification
-
AI model
patient cohorts
Disease
Data Set
Model
Training
Verification
-
AI model
patient cohorts
Disease
Data Set
Model
Training
Verification
-
20%
80%Training Data
Internal
Test Data
Data Augmentation
Process
Model
Construction
Model
Verification
Data Set
External
Test Data
-
From: Jan 1, 2012 to Jun 6, 2019
Database
31587records
-
disease Number of study disease Number of study
ophthalmic disease 18 studiesgastroenterological
or hepatologicalcancers
5 studies
breast cancer10 studies
thyroid cancer 4 studies
trauma and orthopaedics
10 studies gastroenterology and hepatology
2 studies
dermatological cancer
9 studiescardiology
2 studies
lung cancer 7 studies oral cancer 2 studies
respiratory disease 8 studies nephrology 1 studies
neurology1 studies
maxillofacial surgery1 studies
rheumatology1 studies nasopharyngeal
cancer1 studies
urological disease1 studies two different target
conditions1 studies
82 studies
-
DL model
health-care professionals
-
Discussion
First, most studies took the approach of assessing deep learning diagnosticaccuracy in isolation, in a way that does not reflect clinical practice. Manystudies were excluded at screening because they did not provide comparisonswith health-care professionals (ie, human vs machine), and very few of theincluded studies reported comparisons with health-care professionals usingthe same test dataset. Considering deep learning algorithms in this isolatedmanner limits our ability to extrapolate the findings to health-care delivery,except perhaps for mass screening
-
Discussion
Second, there were very few prospective studies done in real clinicalenvironments. Most studies were retrospective, in silico, and based on previouslyassembled datasets. The ground truth labels were mostly derived from datacollected for other purposes, such as in retrospectively collected routine clinicalcare notes or radiology or histology reports, and the criteria for the presence orabsence of disease were often poorly defined. The reporting around handling ofmissing information in these datasets was also poor across all studies. Most didnot report whether any data were missing, what proportion this represented andhow missing data were dealt with in the analysis. Such studies should beconsidered as hypothesis generating, with real accuracy defined in patients, notjust datasets.
-
Discussion
Third, a wide range of metrics were employed to report diagnostic performancein deep learning studies. If a probability function is not reported, the frequencyof true positives, false positives, false negatives, and true negatives at aspecified threshold should be the minimum requirement for such comparisons.In our review, only 12 studies reported the threshold at which sensitivity andspecificity were reported, without justification of how the threshold was chosen;choice of threshold is often set at the arbitrary value of 0·5, as is convention inmachine learning development.
-
Fourth, there is inconsistency over key terminology used in deep learningstudies.
We suggest distinguishing the datasets involved in the development of analgorithm as training set (for training the algorithm), tuning set (for tuninghyperparameters), and validation test set (for estimating the performance ofthe algorithm). For describing the different types of validation test sets, wesuggest adoption of the suggestion by Altman and Royston: internal validation(for in-sample validation), temporal validation (for in-sample validation with atemporal split), and external validation (for out-of sample validation).
Discussion
-
Discussion
Finally, although most studies did undertake an out-of-sample validation,most did not do this for both health-care professionals and deep learningalgorithms.
Our finding when comparing performance on internal versus externalvalidation was that, as expected, internal validation overestimatesdiagnostic accuracy in both health-care professionals and deep learningalgorithms. This finding highlights the need for out-of-sample externalvalidation in all predictive models.
-
Thanks for listening