您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[Milliman]:评估医疗分析中的监督机器学习分类模型 - 发现报告

评估医疗分析中的监督机器学习分类模型

2022-12-27MillimanM***
AI智能总结
查看更多
评估医疗分析中的监督机器学习分类模型

MILLIMANWHITE PAPER Evaluating supervised machine learningclassification models in healthcare analytics Whether you are a hospital administrator looking to improve workflow efficiency,aproviderlooking to improve patient outcomes, or an insurance administrator looking to decrease thenumber of fraudulent claims, evaluating a machine learning model is essential to your toolbox Ketaki Nagarkar, DPTEllyn Russo, MS critical. Bias testing should be performed on an ongoing basisrather than as a onetime task and requires monitoring over theentire project lifecycle.4 Artificialintelligence(AI)tools, such asmachinelearning(ML), have the potentialto improve healthcare operations anddelivery, assist in diagnosis detection,and improve workflows.1,2 We focus the discussion for the remainder of this paper on metricselection and some commonly used tools to evaluate theperformance of a supervised binary (two classes) classificationmodel, illustrating them through several example scenarios. ML modelperformanceevaluation ML algorithms include both supervised and unsupervised models(seeFigure 1 for further detail on these types).3Supervised MLalgorithms used for predictive analytics learn,or train,fromlabeleddata. This trained model is then used topredict futureoutcomes based on new,unseen data.An example application ofa supervised ML algorithm is predicting the likelihood of acondition, or diagnosis, based on a patient’s radiological scan.Conversely, an unsupervised ML algorithm does not have alabeled target outcome. An unsupervised ML algorithm can beused to identify subgroups of patients with similar characteristics. Common tools and metrics for performance evaluation ofsupervisedclassification models(definitions are provided throughout theremainder of the text and in the Appendix)include: ConfusionmatrixReceiveroperatorcharacteristic(ROC)curvePrecisionrecallcurveAccuracyF1 scoreRecall/truepositiverate/sensitivityPrecision/positivepredictivevalueTruenegativerate/specificity A confusionmatrixis a helpful visual of the information neededformodelperformanceevaluation andplacestrue and false classpredictionsinto atwo-by-twoformatbased onthemodel’spredicted probabilities at a certain threshold (see Figure 2 for aconfusion matrixtemplate). There are several important components to consider whenevaluating ML algorithms, including, but not limited to, biastesting and metric selection.Elimination of bias, such as samplebias, prejudicial bias, measurement bias, and algorithmic bias, is ROCcurvesare commonly used to visualize models whenassessing both classes (0 and 1). AnROCcurveplotsrecall,ortruepositiverateorsensitivity, on the y-axis against thefalse positiverate, or 1–specificity, on thex-axisat differentthresholds or operation points.5The area under the curve (AUC)quantifies performance of the model at possible thresholds:the larger the value ofthe AUC, the better the overall diagnosticperformance of a test. As illustrated in Figure 3, model C hasa larger ROC-AUC as compared to B and, thus, a betterabilitytodiscriminate. Applications ofMLmodelevaluation metrics Tohelp understand how these metrics are used, weexplorethefollowingfourexample application scenarios. Scenario1:Consider1,000 people,of which 100 havepulmonary hypertension. Here, 90% of the patients are disease-free, and 10% are the (minority) class of interest, ascenariothatmayberelatable to manyhealthcare diagnostic tests that detectmedical conditions. In this example, whatistheconsequenceofa false positive, that is, if the model falsely predicts a healthypatient as having a condition leading to unneeded treatments(see Figure5)? On the other hand, whatis the penalty of havinga false negative,that is,if the model falsely predicts the patientas being healthy when,in fact, this patient has the condition,causing a potential delayin orlack of lifesaving treatment? The ROC-AUCis a well-known measure and is well-suited forbalancing the trade-off between sensitivity and specificity. Theprecision recall curve (PRC),however, is useful when evaluatingprimarily for the positive class in an imbalanced data set.6ThePRC is obtained by plotting the precision, or positive predictivevalue, on the y-axis against recall on the x-axis at differentthresholds (see Figure 4).It does not consider the true negatives.The PRC average precision (AP) score is a useful metric toevaluate the PRC-AUC. A higher AP score is indicative of abetter ability to identify the true positive cases (improved recall)while minimizing false positives (better precision).7-11 Ifthe goal of the model is toavoid false negatives,thenanoverall classification accuracy score (correct predictions dividedby total predictions) of 90%is not necessarily a helpful measureto assessperformance for detection of the rare occurrence.Afalse negative would mean alifesavingintervention was missed.A false positive would likely meanmore expensive testing andunnecessary intervention.In this situation,ifminimizingfa