Simulation study of a novel method for comparing more than two independent receiver operating characteristic (ROC) curves based on the area under the curves (AUCs)

* Corresponding author (roshini@mail.cmb.ac.lk) Abstract: Receiver operating characteristic (ROC) graphs are useful for organising binary classifiers and visualising their performance. In order to compare classifiers it may be needed to reduce the ROC performance to a single scalar value representing expected performance. Such a commonly used summary statistic is the area under the curve (AUC) of the ROC curve. The AUCs can be estimated either parametrically or non-parametrically. The parametric approach assumes that the signal present (positive) and signal absent (negative) groups can be represented as two overlapping Gaussian distributions. If the observations of two or more ROC curves are obtained from the same region of interest, their AUCs are considered to be correlated.


INTRODUCTION
A ROC curve is simply a graphical plot, which illustrates the performance of a binary classifier system as its discrimination threshold is varied. A variety of summary indices have been proposed as a single measure or simplification for considering the entire curve. One of the most common measures used for summarising the performance of the diagnostic modalities is the value of the area under the curve (AUC), which ranges from 0 to 1, where the higher the value of the AUC a better discrimination power is implied (Fawcett, 2006). There are parametric, nonparametric and semi parametric methods of estimating the area under a ROC curve. The method of estimating the AUC depends on whether the data used is either continuous or in rating form.
The application areas of receiver operating characteristic (ROC) curves range from medical imaging and radiology to machine learning and data mining. In practice it is often required to compare several alternative diagnostic tests and this entails the comparison of several AUCs under the ROC curves. At present this is achieved by comparing all pairwise combinations but this procedure has some serious drawbacks.
To make things clear consider an example from medical testing. Suppose it is required to determine whether suspected coronary artery disease (CAD) patients are positive or negative with respect to CAD. The gold standard test for determining this is an angiogram.
However an angiogram is a very expensive diagnostic tool and specially in developing countries like Sri Lanka only a few persons can afford this test. Two substitute tests for diagnosing CAD are the coronary stress test (CST) and non-invasive carotid artery ultrasound of the neck (NICAUN). In order to compare the performance of the two substitute tests with respect to that of an angiogram, two independent groups of patients of size m and n, respectively can be selected. One group will be given the angiogram and CST and the other group will be given the angiogram and NICAUN. The comparison of each substitute test with respect to the angiogram will be done using the AUCs of the respective ROC curves, which are constructed by plotting the sensitivity versus (1-specificity) for varying discrimination threshold values (Fawcett, 2006). As the two groups of patients are independent the AUCs will also be independent. This corresponds to the comparison of two independent AUCs.
An asymptotic test for comparing several AUCs under the curves has been proposed by Meyen and Sooriyarachchi (2014) and it is of interest to study the effectiveness of this method and validate its properties. Therefore, the motivation of this study arose to address the need for such proper simulation based analysis.
No proper study concerning the properties of the asymptotic test proposed by Meyen and Sooriyarachchi (2014) has been carried out yet. An important assumption of this test is that the AUCs are multivariate normally distributed. Therefore it was of interest to study the properties of the test under different circumstances, as in many real life applications the AUCs may not be multivariate normally distributed. The study determined the appropriate sample sizes and checked the null distribution of the proposed statistic, whilst identifying its limitations. Here the sample size corresponds to the size of the sample used for generating the ROC curve and its AUC. In the simulation study each sample generates a ROC curve. The type of classifier used here is binary.
To carry out this study it was necessary to implement a programme of the Dorfman and Alf method of maximum likelihood estimation (Dorfman & Alf, 1969) in order to estimate the parameters needed to calculate the AUC. The language used for this implementation was C. Furthermore, the analysis of AUCs of ROC curves was carried out for categorical rating-scale data where 3 categories were considered for uncorrelated ROC curves with two ROC curves being compared at once. As per previous research carried out by Cleeves (2002), it was decided to simulate data for sample sizes of 20, 50, 100, 120,140, 250 and 500 observations in total (i.e. sample sizes of 10, 25, 50, 60, 70, 125 and 250 with respect to the positive and negative groups, respectively). The degree of overlap of the two populations was controlled by generating observations from Gaussian distributions whose means differed by 0.5, 0.75 and 1 standard deviations. Additionally, data were simulated assuming equal variances in the two subpopulations, and assuming distributions with standard deviation ratios of 1:1.5. The mean and the standard deviation of the negative population were taken to be 0.5 and 0.1, respectively. Each of the 42 combinations of sample size, degree of overlap, and standard deviation ratio was replicated 1000 times. After simulation for the various sample sizes, likelihood ratio (LR) and score tests based on the beta distribution, which the proposed test statistic should asymptotically follow, were applied to the sample of test statistics thus formed for the different sample sizes. When the sample size was very small (20) confidence intervals based on the above tests could not be constructed as large sample approximations were invalid.

Binormal ROC curves
In order to understand the construction of ROC curves it is important to understand the signal-detection paradigm from which it is derived. According to Grey and Morgan (1972), the signal-detection paradigm consists simply of a subject successively choosing between signal present (with background noise), SN, or no signal present (just noise), N. The model now usually assumed is that the stimulus sets up a response within the subject that can be represented as a continuous, uni-dimensional random variable X with probability distribution function, F N (x) as the stimulus noise and F SN (x) as the stimulus signal and noise.
Typically F N (x) = F(x), F SN (x) = F(bx -a) for some F, i.e. the distributions are two-parameter and of the same form. The ROC curves simulated in this study use F=N(.,.), where N(.,.) denotes the normal distribution. If the subject is required to respond just Yes (signal present) or No (no signal present) then the model assumes that the tests for the underlying distribution by setting up a criterion (z, say) and responding "Yes" if the response X is, say greater than z and "No" if X is less than z. Then it follows that, ... (1) ... (2) ) Journal of the National Science Foundation of Sri Lanka 43 (4) December 2015 Varying z gives the ROC curve [P(Hit) against P(False alarm)], defined by a and b that is taken to represent the subject's performance. In order to estimate b, the experiment must be elaborated, the two standard means of doing so being either a repetition of the above with different z values (which could be obtained by manipulating a payoff matrix) or by requiring that the subject grade the response (for example, Yes, Sure or No, Fairly sure). It is the second elaboration that is considered here: by analogy with the above the model then assumes n criteria, zk , k = 1,2, ... , n (In addition it is convenient to define z 0 = -∞ and z n+1 = + ∞ ) so that the subject makes the overt response R i , if the latent response X falls in the interval: The values of a and b along with other parameters of the ROC curve were estimated using the method of scoring proposed by Dorfman and Alf (1969).

Problems of iteration
The start for the initial iteration was used as the parameter estimates of the simple linear regression as outlined in Grey and Morgan (1972). Iteration continues until either, two successive iterates differ by less than 10 -3 in all of their components and the final iterate is a possible solution (i.e. ẑ 1 ≤ ẑ 2 ≤ ... ≤ ẑ n and b > 0). The degenerate solution for the parameter estimates of the ROC curve can occur from empty cells in the data matrix. Therefore in order to overcome the problem of degeneracy similar to Dorfman and Berbaum (1995) the programme developed adds a small positive constant in order to avoid degeneracy in the case of empty cells.

Determination of the threshold of the categories of the rating data
The threshold of the categories was decided by dividing the false positive fraction (FPF) into equal fractions, so that for example if 3 category rating data were considered the thresholds or cut points of the categories would be 0.33 and 0.66. If the mean and standard deviation for the signal present (with background noise) cases are denoted by µ SN ,σ SN , and the mean and the standard deviation for the noise only cases are denoted as µ N ,σ N , assuming without loss of generality that µ N ≤ µ SN and c represent the cut point on the decision variable axis such that a case is classified as 'negative' if x≤ C and positive if x > C, then the values for the FPF is given as follows where Φ(. ) denotes the cumulative standard normal distribution (Metz & Pan, 1999):

Calculation of the AUC and variance of the AUC
According to Metz et al. (1998), when the mean and standard deviation of the signal present (with background noise) cases, and the mean and the standard deviation of the noise only cases are considered in their usual notation, the values of a and b are given as follows in equations (4) and (5).
It is possible to obtain the AUC of a ROC curve using the following formula where Φ(. ) denotes the cumulative standard normal distribution as shown in equation (6).
In order to calculate the variance of the AUC, the delta method, which is described in detail in Casella and Berger (2002) is made use of, giving the formula as follows for the variance.

Proposed test statistic
The test developed by Meyen and Sooriyarachchi (2014) has been developed using various results from multivariate statistics along with the properties of ROC curves. Since the test is derived for the asymptotic distribution of the statistic it is of interest to study the small sample behaviour of the test statistic as well. In the following sections is the derivation of the test developed by Meyen and Sooriyarachchi (2014).
Let, which is a p×1 vector, where AUC i denotes the AUC of the i th ROC curve.

As
is not known it has to be estimated. From (Hotelling, 1947) the general form of the Hotelling's statistic is as follows, The dimensionality p needs to be reduced by 1 for estimating . Therefore taking instead of for large samples gives the following, Here p is the number of AUCs and n is the number of independent quantities used to calculate the AUCs. For the case of large samples (large n 1 and n 1 ) n will be large. The test statistic can be used to test H 0 .
Confidence intervals for the beta distribution are fit under MATLAB as described in Hahn and Shapiro (1994).

Application of the likelihood ratio and score tests for the test statistic
Since the test statistic is supposed to asymptotically follow a beta distribution it is necessary to check the hypothesis that it comes from a beta distribution with given parameters, which can be calculated for the ROC curves generated. Apart from this confidence intervals also need to be constructed for the maximum likelihood estimates (MLEs) fitted to the data under the hypothesis that they are beta distributed.
Consider the standard form of the distribution with shape parameters and support on . The beta distribution includes some well-known distributions as special cases, such as the uniform distribution and the power distribution . The total loglikelihood function for p,q based on a random sample y 1 , ..., y n of size n can be written as given in equation (10)  where ɡ 1 and ɡ 2 are the geometric means of the y , i s and s respectively, and is the gamma function. The log-likelihood function given above is regular (Maia et al., 2003) with respect to all p and q derivatives up to the fourth order. The score function is where ψ(. ) is the digamma function, and the observed Fisher's information matrix for p and q is given by Where ψ • (x) = dψ(x)/dx is the trigamma function.
The maximum likelihood estimates (MLEs) of p and q have no closed-form expressions and can only be obtained from an iterative process.
For testing H 0 the score statistic is particularly attractive since it does not involve estimation under the alternative hypothesis H 1 , but only requires the evaluation of the score function and the observed information matrix under the null hypothesis. The score statistic for the null hypothesis is given by, where all quantities with tilde represent estimates evaluated at the null hypothesis. The asymptotic chi-squared distribution of the LR and score statistics is used to test statistical hypotheses since their exact distributions are usually unknown.

Analysis of 3 category rating-method data for 20 individuals
When the AUCs of the generated ROC curves were applied to the test statistic in this case it was of interest to note that some of the values obtained were lying outside the uppermost boundary of the beta distribution [ i.e. the value 1 of the interval (0,1)]. This simply means that the large sample theory is not applicable with this sample size. Thus no confidence intervals are calculated for this case. Table 1 gives the results of the analysis of 3 category rating-method data for 20 individuals.

Analysis of 3 category rating-method data for 50 individuals
When the AUCs of the generated ROC curves were applied to the test statistic, it was noted in the case when the AUCs were distributed with means 0.75 standard deviations apart and standard deviations in the ratio 1:1, that it was not possible to apply the likelihood ratio and score tests, as the geometric mean of the AUCs could not be numerically represented. Apart from the particular instance it was possible to apply the likelihood ratio test and score tests to the other cases. However in all the other cases the values of the statistics obtained were high, which led to the rejection of the hypothesis that the AUCs came from beta distributions with the given parameters. of (n _ q _ 1)/2  Table 2 gives the results of the analysis of 3 category rating-method data for 50 individuals.

Analysis of 3 category rating-method data for 100 individuals
In this instance it was possible to compute the score statistics along with the likelihood ratio statistics. Table 3 gives the results of the analysis of 3 category rating-method data for 100 individuals. Table 4 gives the results of the analysis of 3 category rating-method data for 120 individuals. It should be noted in the case when the AUCs were distributed with means 0.75 standard deviations apart and standard deviations in the ratio 1:1.5 that it was not possible to apply the likelihood ratio and score tests as the   Table 5 gives the results of the analysis of 3 category rating-method data for 140 individuals. It is of interest to note that the value of the score statistic is negative when the AUCs were distributed with means 0.5 standard deviations apart and standard deviations in the ratio 1:1.5. According to Morgan et al. (2007) this is possible because, when the observed information matrix is used in place of the expected information matrix and if the observed information matrix is negative definite this could result in negative values for the score statistic. Table 6 gives the results of the analysis of 3 category rating-method data for 250 individuals.

Analysis of 3 category rating-method data for 250 individuals
Journal of the National Science Foundation of Sri Lanka 43(4) December 2015      Table 7: Analysis of 3 category rating-method data for 500 individuals the value of the score statistic is negative when the AUCs were distributed with means 0.5 standard deviations apart and standard deviations in the ratio 1:1, and when the AUCs were distributed with means 0.75 standard deviations apart and standard deviations in the ratio 1:1.5.

DISCUSSION AND CONCLUSION
In the case of the smallest sample size (i.e. 20) it could be seen that it was not appropriate to use the LR and score statistics as the values of the geometric mean of the AUC vectors, which were used in computing these statistics, could not be numerically represented in Matlab as they were very large. This illustrates that large sample theory does not hold for sample size 20. Therefore no confidence intervals were constructed for this case. The observed and expected parameters for this case were very different from each other indicating that the said theory regarding the beta distribution of the test statistic does not hold in the case of sample size 20.
For cases simulated under the sample size of 50 it could be seen that both the likelihood ratio and score test resulted in rejecting the assertion that the test statistic came from a beta distribution with specified parameters, apart from a single case in which both tests were not applicable. However in that case the confidence interval for the parameters did not include the expected values.
When considering the cases simulated under the sample size of 100 it could be seen that when the means differed by 0.5 and the standard deviation between the signal absent and signal present group were taken to be 1:1.5, the likelihood ratio and score tests indicated that the test statistic indeed came from a beta distribution with specified parameters, while the expected values for the parameter lay within the confidence interval thereby confirming this assertion. However, in the other cases it could be seen from the results that the assertion that the test statistic came from a beta distribution with specified parameters was rejected. It could be seen however that when the spread between the two Gaussian populations (i.e. signal present and signal absent) was higher (i.e. the standard deviations were in the ratio 1:1.5) the value of the likelihood ratio and score statistics were closer to that of the 5 % significance level of the distribution, which is indicative of the fact that the test statistic performs better when there is a greater spread between the two populations.
For the cases simulated under the sample size of 120 it could be seen that apart from a single case, the score and likelihood ratio tests could be calculated for all the other cases and they lead to the acceptance of the assertion that the test statistic follows a beta distribution with specified parameters. In this case as well it could be seen by the values of the likelihood ratio and score statistics obtained, that the test statistic appears to perform better when there is a greater spread between the two Gaussian populations.
For all cases simulated under the sample size 140 it could be seen that both the likelihood ratio and score test resulted in acceptance of the assertion that the test statistic comes from a beta distribution with specified parameters. Similar to the case under the sample size of 100 it could be seen that when the spread between the two Gaussian populations were higher, the value of the likelihood ratio and score statistics were closer to that of the 5 % significance level of the distribution, which serves to strengthen the assertion made towards the end in the previous paragraph.
For all cases simulated under the sample size of 250 it could be seen the results obtained were similar to the case when the sample size of 140 was considered.
When considering the cases simulated under the sample size of 500 it could be seen that apart from a single case, the score and likelihood ratio tests could be calculated for all the other cases and they lead to the acceptance of the assertion that the test statistic follows a beta distribution with specified parameters. In this case too it could be seen by the values of the likelihood ratio and score statistics obtained that the test statistic appears to perform better when there is a greater spread between the two Gaussian populations.
Journal of the National Science Foundation of Sri Lanka 43 (4) December 2015

Matlab implementation of the score test for the beta distribution
Given below is the Matlab implementation of the score test for the beta distribution.

Matlab implementation of the likelihood ratio test for the beta distribution
Given below is the Matlab implementation of the likelihood ratio test for the beta distribution.