In statistics, the receiver operating characteristic curve , ie ROC curve , is a graphical plot that describes the diagnostic capability of a binary classifier system because the discriminating threshold varies.
The ROC curve is generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or possible detection in machine learning. The false-positive rate is also known as a fall-out or probability of false alarm and can be calculated as (1 - specificity). This can also be considered a Power plot as a function of Type I Error of decision rule (when performance is calculated from only samples from population, can be considered as an estimate of this amount). The ROC curve thus sensitivity as a function of fall-out. In general, if the probability distribution for detection and false alarms is known, the ROC curve can be generated by plotting the cumulative distribution function (the area under the probability distribution of to the threshold of discrimination) of the detection probability in y-axis versus the cumulative distribution function of the false alarm probability on the x-axis.
ROC analysis provides a tool for selecting optimally possible models and removing suboptimals independently from (and before determining) the context of class cost or distribution. ROC analysis is related to a direct and natural way to analyze the cost/benefit of diagnostic decision making.
The ROC curve was first developed by electrical engineers and radar engineers during World War II to detect enemy objects on the battlefield and was soon introduced to psychology to take into account the detection of perceptual stimuli. ROC analysis has since been used in medicine, radiology, biometrics, natural hazard forecasting, meteorology, model performance appraisal, and other fields for several decades and is increasingly used in machine learning and data mining research.
ROC is also known as the relative operating characteristic curve, since it is a comparison of two operating characteristics (TPR and FPR) as a criterion change.
Video Receiver operating characteristic
Basic concepts
The classification model (classifier or diagnosis) is a sample mapping between a particular class/group. The result of classifier or diagnosis can be a real value (continuous output), in which case the classifier boundaries between classes should be determined by the threshold value (for example, to determine if a person has hypertension based on blood pressure size). Or it can be a discrete class label, which shows one of the classes.
Let's consider the two-class prediction problem (binary classification), where the result is labeled as positive ( p ) or negative ( n ). There are four possible outcomes of binary classifiers. If the result of the prediction is p and the true value is also p , then it is called true positive (TP); But if the true value is n then it is said to be false positive (FP). In contrast, true TN negative has occurred when the predicted result and its actual value are n , and false negative (FN) is when the predicted result is n while the actual value is p .
To get the right example in real-world problems, consider diagnostic tests that seek to determine if a person has a particular disease. A false positive in this case occurs when the person tests positive, but actually does not have the disease. A false negative, on the other hand, occurs when the person tests negative, indicating they are healthy, when they actually have the disease.
Let's define the experiment from P a positive instance and N a negative instance for some conditions. Four results can be formulated in the contingency table 2ÃÆ' â ⬠"2 or confusion matrix , as follows:
Maps Receiver operating characteristic
ruang ROC
Contingency tables can get multiple "metrics" of evaluation (see infobox). To draw the ROC curve, only true positive rate (TPR) and false positive rate (FPR) are required (as a function of some classifier parameters). The TPR defines how many true positive results occur among all the positive samples available during the test. FPR, on the other hand, defines how many incorrect positive results occur among all the negative samples available during the test.
The ROC space is defined by FPR and TPR as the axis of x and y , respectively, which describes the relative trade-off between true positive (benefits) and false positive (costs). Since TPR is equivalent to sensitivity and FPR is equal to 1 - specificity, ROC graphs are sometimes called sensitivity vs plots (1 - specificity). Each predicted or sample result of a confusion matrix represents a single point in the ROC space.
The best prediction method that might produce a point in the upper left corner or the coordinates (0,1) of the ROC space, represents 100% sensitivity (no false negatives) and 100% specificity (no false positives). The point (0,1) is also called perfect classification . Random guesses will give a point along the diagonal line (so-called line without discrimination ) from the lower left to the upper right corner (regardless of the positive and negative base levels). An intuitive example of a random guess is a decision by flipping a coin. As the sample size increases, ROC point random multipliers migrate toward the diagonal line. In the case of a balanced coin, it will migrate to a point (0.5, 0.5).
Diagonal divides ROC space. The above diagonal points are a good classification result (better than random), points below the line represent bad results (worse than random). Note that the output of a poor predictor can consistently be reversed to get a good predictor.
Let's look into four predictions of 100 positive and 100 negative examples:
The plot of the above four results in the ROC space is given in the figure. The A method clearly shows the best predictive power between A , B , and C . The B result is located on the random guess line (diagonal line), and it can be seen in the table that the accuracy of B is 50%. However, when C is reflected in the midpoint (0,5.0.5), the resulting method C? is even better than A . This mirrored method only reverses any method predictions or tests that generate a contingency table C . Although the original C method has negative predictive power, only reversing its decision leads to a new prediction method C? which has positive predictive power. When the C method predicts p or n , the method C? will predict n or p , respectively. In this way, the test C? will do the best. The closer the results from the contingency table to the upper left corner, the better predict, but the distance from the random guess line in both directions is the best indicator of how much predictive power the method has. If the result is below the line (ie this method is worse than random guess), all method predictions should be reversed to harness its power, thereby moving the result above the random guess line.
Curve in ROC space
For example, imagine that blood protein levels in sick and healthy people are usually distributed with 2 g/dL and 1 g/dL each. Medical tests can measure the level of a particular protein in a blood sample and classify any number above a certain threshold as an indication of the disease. Experiments can adjust the threshold (black vertical line in the image), which in turn will change the false positive rate. Raising the threshold will result in fewer false positives (and more false negatives), corresponding to the leftward movement on the curve. The actual shape of the curve is determined by how much overlap each two distributions have. These concepts are shown in the Applet of Characteristic Operational Curve (ROC).
Further Interpretation
Sometimes, ROC is used to generate summary statistics. Common versions are:
- interception of ROC curve with line at 45 degree orthogonal to line without discrimination - equilibrium point where Sensitivity = Specificity
- interception of the ROC curve with a tangent at 45 degrees parallel to the line without discrimination closest to the error-free point (0.1) - also called J Youden statistics and common as Information
- area between ROC curve and line without discrimination - Gini coefficient
- the area between the full ROC curve and the triangle ROC curve only includes (0,0), (1,1) and one selected operating point (tpr, fpr) - Consistency
- area under the ROC curve, or "AUC" ("Area Under Curve"), or A '(pronounced "a-prime"), or "c-statistic".
- the sensitivity index d ' (pronounced "d-prime"), the distance between the average distribution of activity in the system in loud-alone conditions and its distribution under alone-signal conditions, divided by the standard deviation them, assuming that these two distributions are normal with the same standard deviation. Based on this assumption, the ROC form is fully determined by d '.
However, any attempt to summarize the ROC curve into a single number loses information about the sacrificial pattern of a particular discriminator algorithm.
area under the curve
When using a normalized unit, the area under the curve (often called only AUC) equals the probability that a classifier would rank a positive sample randomly higher than a randomly selected negative (assuming a 'positive' rating is higher than ' negative'). This can be seen as follows: the area under the curve is given by (the integral boundary is reversed because the large T has a lower value on the x axis)
di mana adalah skor untuk instance positif dan adalah skor untuk instance negatif, dan dan adalah kepadatan probabilitas sebagaimana didefinisikan dalam bagian sebelumnya.
Lebih lanjut dapat ditunjukkan bahwa AUC berkaitan erat dengan Mann-Whitney U, yang menguji apakah nilai positif lebih tinggi daripada negatif. Ini juga setara dengan tes Wilcoxon peringkat. AUC terkait dengan koefisien Gini ( ) oleh rumus , di mana:
In this way, it is possible to calculate AUC by using an average number of trapezoidal estimates.
It is also common to calculate Area Below ROC Convex Hull (ROC AUCH = ROCH AUC) since every point on the line segment between two predicted results can be achieved randomly using one or another system with probabilities proportional to the relative length of the opposite component of segment. Interestingly, it is also possible to reverse the concave - as in the picture a worse solution can be reflected to be a better solution; basins can be reflected in each line segment, but this more extreme fusion form is much more likely to supplement the data.
Machine learning communities most often use ROC AUC statistics for model comparison. However, this practice has recently been questioned based on new machine learning studies which show that AUC is quite noisy as a classification measure and has several other significant problems in model comparison. A reliable and valid AUC estimate can be interpreted as the probability that a classifier will give a higher score to a randomly selected positive instance than to a randomly selected negative instance. However, critical research shows frequent failures in obtaining reliable and valid AUC estimates. Thus, the practical value of AUC measurements has been questioned, increasing the likelihood that AUC can actually introduce more uncertainty into comparison of machine learning classification than resolution. However, the coherence of AUC as a measure of the performance of aggregate classification has proven true, in terms of uniform distribution levels, and AUC has been associated with a number of other performance metrics such as the Brier score.
One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about sacrifice between different systems or performance points plotted rather than individual system performance, as well as ignoring the possibility of concave repairs, Alternative related measures such as Informedness or DeltaP are recommended. These steps are essentially equivalent to Gini for one prediction point with DeltaP '= Informedness = 2AUC-1, whereas DeltaP = Markedness represents a dual (ie predicts the prediction of the actual class) and its geometric mean is the Matthews correlation coefficient..
More size
Total Operational Characteristics (TOC) also characterizes diagnostic capabilities while revealing more information than ROC. For each threshold, the ROC reveals two ratios, TP/(TP FN) and FP/(FP TN). In other words, the ROC reveals the hit/(hits misses) and false alarms/(false alarms correct rejections). On the other hand, TOC shows the total information in the contingency table for each threshold. The TOC method discloses all information provided by the ROC method, plus any additional important information not disclosed by ROC, ie the size of each entry in the contingency table for each threshold. TOC also provides the popular AUC of the ROC.
These numbers are TOC and ROC curves using the same data and thresholds. Consider the point corresponding to the threshold 74. The TOC curve shows the number of clicks, ie 3, and hence the number of misses, ie 7. In addition, the TOC curve shows that the number of false alarms is 4 and the number of true rejection is 16. At a certain point in ROC curve, it is possible to collect values ââfor false alarm ratio/(false alarm rejection correct) and hit/(hits misses). For example, at the threshold of 74, it is clear that the x coordinates are 0.3 and the y coordinates are 0.2. However, these two values ââare not sufficient to build all entries from the underlying two-by-two contingency table.
Whereas ROC AUC varies between 0 and 1 - with informative classifier yielding 0.5 - Alternate Informedness and Gini Coefficient steps (in single or single-system parameterization) all have the advantage that 0 represents an opportunity performance while 1 represents a perfect performance, and -1 represents a case of "full uncertainty" of the full information always giving the wrong response. Bringing the opportunity performance to 0 allows this alternative scale to be interpreted as Kappa statistics. Information has been shown to have desirable characteristics for Machine Learning versus other common definitions of Kappa such as Cohen Kappa and Fleiss Kappa.
Sometimes it can be more useful to see the specific area of ââthe ROC Curve rather than across the curve. It is possible to calculate partial AUC. For example, one can focus on the curve region with a low false-positive rate, which is often of major interest for population screening tests. Another common approach to a classification problem where P << N (common in bioinformatics applications) is to use a logarithmic scale for the x axis.
Graph of the error detection tradeoff
An alternative to the ROC curve is the tradeoff (DET) error detection graph, which plots the false negative rate (missed detection) vs the false positive rate (false alarm) on the x and y axes that are not transformed linearly. The transformation function is a quantitative function of the normal distribution, that is, the inverse of the cumulative normal distribution. This is, in fact, the same transformation as zROC, below, except that a complement of hit rate, miss rate or false negative rate, is used. This alternative spends more of the graph area in the region of interest. Most areas of the ROC have little interest; people mainly care about the tight area against the y-axis and the top left-corner - which, since using the miss rate rather than its complement, hit level, is the lower left corner in the DET plot. Further, the DET graph has the useful properties of linearity and linear threshold behavior for normal distribution. The DET plot is used extensively in the automatic speaker recognition community, where the DET name was first used. The analysis of ROC performance in graphs with this axis warping was used by psychologists in perceptual studies in the mid-20th century, where it was nicknamed "double probability paper".
Z-score
If a standard score is applied to the ROC curve, the curve will be converted into a straight line. This z score is based on a normal distribution with zero average and standard deviation of one. In the theory of memory strength, one must assume that zROC is not only linear, but has a slope of 1.0. The normal distribution of targets (the objects being studied that need to be remembered by the subject) and the feed (the untutored object the subject is trying to remember) are the factors that cause the zROC to be linear.
The linearity of the zROC curve depends on the standard deviation of the target and the distribution of strength of lure. If the standard deviation is the same, the slope is 1.0. If the standard deviation of the target power distribution is greater than the standard deviation of the lure power distribution, then the slope will be less than 1.0. In most studies, it has been found that the sloped zROC curve continues to fall below 1, usually between 0.5 and 0.9. Many experiments resulted in a zROC slope of 0.8. A slope of 0.8 implies that the target power distribution variability is 25% greater than the variability distribution of lure power.
Another variable used is d ' (d prime) (discussed above in "Other Steps"), which can easily be expressed in terms of z-values. Although d 'is a commonly used parameter, it must be admitted that it is only relevant when strictly following the very strong assumption of the power theory made above.
The z score of the ROC curve is always linear, as assumed, except under special circumstances. Yonelinas's familiarity-recollection model is a two dimensional representation of memory recognition. Instead of a subject just answering yes or no for a particular input, the subject provides an input of intimacy feeling, which operates just like the original ROC curve. What's changed, however, is the parameter for Recollection (R). The recording is considered all or nothing, and it beats intimacy. If there is no recollection component, the zROC will have a predictive slope of 1. However, when adding the recollection component, the zROC curve will become concave, with a decreasing slope. This difference in shape and slope results from additional elements of variability because some items are being remembered. Patients with anterograde amnesia can not recall, so their Yonelinas zROC curve will have a slope of close to 1.0.
History
The ROC curve was first used during World War II for the analysis of radar signals before being used in signal detection theory. After the attack on Pearl Harbor in 1941, US troops began a new study to improve the accurately detected prediction of Japanese aircraft from their radar signals. For this purpose they measure the ability of the radar receiver operator to make this important difference, called the Receiver Operating Characteristic.
In the 1950s, ROC curves were used in psychophysics to assess human (and sometimes non-human) detection of weak signals. In the medical world, ROC analysis has been widely used in the evaluation of diagnostic tests. The ROC curve is also used extensively in epidemiology and medical research and is often mentioned along with evidence-based drugs. In radiology, ROC analysis is a common technique for evaluating new radiological techniques. In social science, ROC analysis is often called ROC Accuracy Ratio, a common technique for assessing the accuracy of the default probability model. The ROC curve is widely used in laboratory medicine to assess the diagnostic accuracy of a test, to select the optimal cut off test and to compare the diagnostic accuracy of some tests.
The ROC curve also proved useful for evaluation of machine learning techniques. The first application of ROC in machine learning was with Spackman showing the ROC curve values ââin comparing and evaluating different classification algorithms.
The ROC curve is also used in forecast verification in meteorology.
ROC curved beyond the binary classification
Perpanjangan kurva ROC untuk masalah klasifikasi dengan lebih dari dua kelas selalu rumit, karena derajat kebebasan meningkat secara kuadratis dengan jumlah kelas, dan ruang ROC memiliki dimensi, di mana adalah jumlah kelas. Beberapa pendekatan telah dibuat untuk kasus tertentu dengan tiga kelas (tiga arah ROC). Perhitungan volume di bawah permukaan ROC (VUS) telah dianalisis dan dipelajari sebagai metrik kinerja untuk masalah multi-kelas. Namun, karena kompleksitas mendekati VUS yang benar, beberapa pendekatan lain berdasarkan perpanjangan AUC lebih populer sebagai metrik evaluasi.
Source of the article : Wikipedia