a comparison of deep learning performance against health-care … · 2019. 10. 24. · a comparison...

28
A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis Reporter: 范佐搖 2019/10/17

Upload: others

Post on 18-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis

    Reporter:范佐搖2019/10/17

  • 2https://blogs.nvidia.com.tw/2016/07/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/

  • Dermatologist-level classification of skin cancer with deep neural networks

  • CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

    P. Rajpurkar, J. Irvin, K. Zhu, B. Yang, H. Mehta, T. Duan, D. Ding, A.Bagul, R. L. Ball, C. Langlotz, K. Shpanskaya, M. P. Lungren, and A. Y. Ng

    arXiv, Dec. 2017.

    112,120 chest X-ray images

  • Application of artificial intelligence using a convolutional neural network for detecting gastric cancer in endoscopic images

    T. Hirasawa, K. Aoyama, T. Tanimoto, et al.

    Gastric Cancer (2018) 21:653–660

    13,584 images were collected for 2639 histologically proven gastric cancer lesions as a training image data set.

  • Vessel Extraction in X-Ray Angiograms Using Deep Learning

    E. Nasr-Esfahani, S. Samavi, N. Karimi, S.M.R. Soroushmehr, K. Ward, M.H. Jafari, B. Felfeliyan, B. Nallamothu, K. Najarian

    IEEE 38th Annual International Conference of the. IEEE, 2016. p. 643-646.

    44 X-ray angiography images

  • AI model

    patient cohorts

    Disease

    Data Set

    Model

    Training

    Verification

  • AI model

    patient cohorts

    Disease

    Data Set

    Model

    Training

    Verification

  • AI model

    patient cohorts

    Disease

    Data Set

    Model

    Training

    Verification

  • AI model

    patient cohorts

    Disease

    Data Set

    Model

    Training

    Verification

  • 20%

    80%Training Data

    Internal

    Test Data

    Data Augmentation

    Process

    Model

    Construction

    Model

    Verification

    Data Set

    External

    Test Data

  • From: Jan 1, 2012 to Jun 6, 2019

    Database

    31587records

  • disease Number of study disease Number of study

    ophthalmic disease 18 studiesgastroenterological

    or hepatologicalcancers

    5 studies

    breast cancer10 studies

    thyroid cancer 4 studies

    trauma and orthopaedics

    10 studies gastroenterology and hepatology

    2 studies

    dermatological cancer

    9 studiescardiology

    2 studies

    lung cancer 7 studies oral cancer 2 studies

    respiratory disease 8 studies nephrology 1 studies

    neurology1 studies

    maxillofacial surgery1 studies

    rheumatology1 studies nasopharyngeal

    cancer1 studies

    urological disease1 studies two different target

    conditions1 studies

    82 studies

  • DL model

    health-care professionals

  • Discussion

    First, most studies took the approach of assessing deep learning diagnosticaccuracy in isolation, in a way that does not reflect clinical practice. Manystudies were excluded at screening because they did not provide comparisonswith health-care professionals (ie, human vs machine), and very few of theincluded studies reported comparisons with health-care professionals usingthe same test dataset. Considering deep learning algorithms in this isolatedmanner limits our ability to extrapolate the findings to health-care delivery,except perhaps for mass screening

  • Discussion

    Second, there were very few prospective studies done in real clinicalenvironments. Most studies were retrospective, in silico, and based on previouslyassembled datasets. The ground truth labels were mostly derived from datacollected for other purposes, such as in retrospectively collected routine clinicalcare notes or radiology or histology reports, and the criteria for the presence orabsence of disease were often poorly defined. The reporting around handling ofmissing information in these datasets was also poor across all studies. Most didnot report whether any data were missing, what proportion this represented andhow missing data were dealt with in the analysis. Such studies should beconsidered as hypothesis generating, with real accuracy defined in patients, notjust datasets.

  • Discussion

    Third, a wide range of metrics were employed to report diagnostic performancein deep learning studies. If a probability function is not reported, the frequencyof true positives, false positives, false negatives, and true negatives at aspecified threshold should be the minimum requirement for such comparisons.In our review, only 12 studies reported the threshold at which sensitivity andspecificity were reported, without justification of how the threshold was chosen;choice of threshold is often set at the arbitrary value of 0·5, as is convention inmachine learning development.

  • Fourth, there is inconsistency over key terminology used in deep learningstudies.

    We suggest distinguishing the datasets involved in the development of analgorithm as training set (for training the algorithm), tuning set (for tuninghyperparameters), and validation test set (for estimating the performance ofthe algorithm). For describing the different types of validation test sets, wesuggest adoption of the suggestion by Altman and Royston: internal validation(for in-sample validation), temporal validation (for in-sample validation with atemporal split), and external validation (for out-of sample validation).

    Discussion

  • Discussion

    Finally, although most studies did undertake an out-of-sample validation,most did not do this for both health-care professionals and deep learningalgorithms.

    Our finding when comparing performance on internal versus externalvalidation was that, as expected, internal validation overestimatesdiagnostic accuracy in both health-care professionals and deep learningalgorithms. This finding highlights the need for out-of-sample externalvalidation in all predictive models.

  • Thanks for listening