Determining the sensitivity and specificity of a substitute test as a diagnostic for its gold standard in the presence of severe missingness

When the need arises to identify a disease, substitute tests or screening tests are commonly used to recommend patients for its respective “Gold Standard”. Since it is seldom that these gold standards are carried out for those who pass the substitute tests, calculating the sensitivity and specificity of the substitute test has become a near impossible task using conventional methods. However, due to the life threatening nature of certain diseases such as coronary artery disease (CAD), understanding the effectiveness of these substitute tests in detecting the disease for sub-regions of the world is of utmost importance. Therefore, the primary objective of this study was to develop a theoretical framework to determine the sensitivity and specificity of a diagnostic test in the presence of severe missingness in the results of its gold standard. The methodology involves missing value imputation for the missing response, which is the result of the gold standard for those who have passed the substitute test. Logistic models were used to predict the existence of the disease using pre-defined risk factors. Subsequently, receiver operator characteristic (ROC) curves were used to confirm the existing cut-off for the substitute test. This procedure is illustrated on data from a retrospective study carried out in a General Hospital in Sri Lanka. The ROC curve analysis verified the existing Bruce protocol method cut-off as being the best to classify the existence of CAD. The study confirms that the results conform to world standards.


Background
Screening tests or substitute tests are commonly used in the medical field, either, in order to select individuals before recommending them to more conclusive tests, or to identify diseases prior to conspicuous symptoms being noticed. Primarily due to high costs and administrative difficulties, many tests have been developed to act as diagnostic for conclusive tests, which are commonly referred to as the "Gold Standard". In the domain of heart disease, cardiac stress tests (CST), echo -cardiography, and baseline electrocardiography are a few of the screening tests used to identify coronary artery disease, whilst the angiogram is considered as its ultimate Gold Standard (Greenland et al., 2010). In the domain of diabetes mellitus, fasting plasma glucose and 2-hour plasma glucose during an oral glucose tolerance test are commonly used as substitute tests to identify Type-2 Diabetes whilst its controversial gold standard is the level of glycosylated haemoglobin (American Diabetes Association, 2004).
Medical consultants use many factors such as gender, medical history, and social habits along with the results of such screening test results as predictors of a disease. However, the predictive capabilities of screening tests such as the CST have shown to differ from region to region, resulting in controversies (Bokhari et al., 2008).
The appropriateness of diagnostic methods is rarely tested in developing nations. The primary reason being problems regarding funding and other resources in such study domains. Here, a complex procedure is needed to identify the actual disease condition of individuals. The disease condition as indicated by a gold standard or a similar device for individuals who pass these screening tests are seldom available as doctors in developing countries rarely advice the gold standard for those passing the screening test (Atukorale, 2005;Wake & Yoshiyama, 2009); hence resulting in 'missingness' with regard to vital data. Therefore the motivation for this paper arose, since perhaps the only manner in which at least an approximate understanding on the precision of the stress test could be obtained, through statistical techniques using missing data imputation.

Review of literature
In order to unravel the problem of the sensitivity and specificity of the substitute test, four primary statistical methods are needed. Namely, a sample size calculation and design of the study, missing value analysis, statistical modeling and ROC curve analysis.
Approaches on design reviews can be seen in studies such as that has been carried out by Bolland et al. (1998). Royston and Barbiker (2002) recommended the relatively recent review on sample size calculation methods by Sahai and Khurshid (1996), along with the work carried out by Ury and Fleiss (1980) and Lachin (1977), for an extensive understanding on two group binary outcome studies.
In the area of missing value analysis, the literature points to a vast range of techniques ranging from crude methods such as mean substitution, to approximate bayesian bootstrap methods, EM algorithms, and to even non-parametric decision-tree methods, to impute datum. Imputation methods can be further divided into single value imputation (SI) and multiple-value imputation (MI) (Rubin, 1987). In Van-Leeuwen et al. (2007)'s study, imputation was repeated 100 times to classify unverified women as having gestational diabetes mellitus or not.
For the purpose of summarizing the predictive power of a binary outcome situation, when test data do not fall into two obviously defined categories, Agresti (2007) recommends the use of ROC curves. The area under the curve (AUC) is one of the most commonly used methods as a summary measure of the ROC curve to compare classification capabilities (Vergara et al., 2008). Parametric, semi-parametric or non-parametric estimation methods can be used to estimate the AUC of a ROC (Vergara et al., 2008). Hanley and McNeil (1983) recommend a test for comparing two AUCs, pair-wise.

Objectives of the study
The primary objective of this study was to recommend a methodology to identify the sensitivity and specificity of a substitute test in the presence of missingness. Further, it was also intended to provide advice on the identification of the significance of the substitute test as a diagnostic for its gold standard, and to verify the existing cut-off for using an example set of data.

The data
The methodology was illustrated using data from the Cardiology Unit of the Sri Jayawardenapura Hospital, in Sri Lanka, since it had a relatively well organized records room, owing to being a general hospital that had patients from all over the country, and easy administrative conveniences in obtaining large numbers of records. Five hundred patient records details were collected from bed-head tickets (BHT) from January 2008 to October 2009. A further set of 50 BHTs were also obtained for the purpose of validating the study findings from the year 2010. Table 1 includes a description of the variables collected for the study.

Brief description of the methodology
The theoretical framework used a sample size design, missing value analysis, statistical modeling and ROC curve analysis.
The sample size formula by Lachin (1977) for the comparison of more than two groups with dichotomous outcomes in an r x c contingency table was adopted for the sampling design. The logistic regression method, an extension of the regression method, was used to multiple impute the missing responses of the gold standard (Yuan, 2001). The missing values were sampled from the posterior distribution of the responses using Monte-Carlo simulation (Tan et al., 2010). Logistic models, as predominantly used in the medical field to model a disease status, were used as the underlying statistical model (Agresti, 2007). Finally, the area under the receiver operating characteristic curve (AUC) was used to identify the best cut-off for the substitute test, using the Dorfman and Alf maximum likelihood estimation approach (Hanley & McNeil, 1983).

The design review
Following the initial data collection process (internal pilot study), a design review for the sample size calculation is usually conducted. Random Sampling methods (Kish, 1995) can be used as a sampling method. In study domains that are either possibly the first of its kind or where there is a lack of prior information, a crude guess (n Preliminary ) for an initial sample size is decided based on past literature and the data collection carried out. Then, after gathering information about the study parameters using the intial observations, the sample size is reestimated and a final sample size is fixed (n New ). Bolland et al. (1998)

Sample size calculation 1: two groups with dichotomous outcomes
The formula for the comparison of two groups with dichotomous outcomes (that is, having proportions P 1 and P 2 ) is given by Ury and Fleiss (1980) for equal groups as well as for the comparison of unequal groups with dichotomous outcomes.
Sample size calculation 2: more than two groups with dichotomous outcomes Royston and Barbiker (2002) recommended the sample size formula for the comparison of more than two groups with dichotomous outcomes by Lachin (1977) for determining r x c contingency tables. This procedure can easily be extended for the case where c > 2, by simply substituting for 'j', the number of possible outcomes for the response. The methodology for c = 2 has been incorporated in the ART module of Stata (Royston & Barbiker, 2002) in which the user can obtain the required sample size for six or less treatment groups (i={2,3,4,5,6}).

Missing value imputation
The literature mentions three types of missingness, namely, missing completely at random (MCAR), missing at random (MAR) and non-ignorable missingness or missing not at random (MNAR) (Acock, 2005). The definitions of missingness are explained by Tan et al. (2010).

Posterior distribution (Tan et al., 2010)
The Bayesian approach to missing value imputation consists of three steps (Gelman et al., 1995) Constructing a full probability model 1.
summarized by a joint distribution for all observable and unobservable quantities Summarizing the findings for observed 2.
quantities of interest based on the derived conditional distributions of these quantities given the observed data Assessing model adequacy 3.
The joint posterior distribution of Y com and θ : Conditional distributions of these quantities are obtained by the Bayes theorem: Is the normalizing constant of ( ) After Y com is observed, one can predict or forecast the future observation, denoted by ỹ. The posterior predictive distribution of ỹ given the data Y com is defined as: Most frequently, the future observation ỹ and Y com are conditionally independent given θ . In this case we have

Regression method for imputing (Yuan, 2001)
The methodology for imputing using the regression technique is as follows (Yuan, 2001).
The imputation model for the standard regression model, is: ... (6) where, Y i = the variable inclusive of missing values given the covariate variables, X 1 , X 2 , ... , X n . The fitted model includes the regression parameters β 0 , β 1 , ... ,β n and the associated covariance matrix j j V 2 s ˆ where V j is the usual (X'X ) -1 matrix derived from the intercept and covariates X 1 , X 2 , ... , X n .
The following steps are used to generate imputed values for each imputation.
New parameters β * = β *0 , β *1 , ..., β *n and 2 * j ŝˆ are drawn from the posterior distribution of the parameters. That is they are simulated from estimates, random variable and n j is the number of non-missing observations for Y j . The regression coefficients are drawn as hj is the upper triangular matrix in the Cholesky decomposition, and Z is a vector of n+1 independent random normal variates. The missing values are then replaced by , where x 1 , x 2 ,..., x n are the values of the covariates and z i is a simulated normal deviate. The logistic regression method is an extension of the regression method and is defined by where, P i is the probability of disease for treatment 'i' given the explanatory variables, X 1 , X 2 ,..., X n , are fitted, and missing P i values are imputed using the same procedure stated above.
The conventional method to obtain the final imputed values, as carried out in many studies, is to multiple impute data sets and carry out a statistical analysis for each of these and then combine the results using Rubin's rule (Mehta et al., 2007). Another approach as used by Van-Leeuwen et al. (2007) is to multiple impute many data sets and obtain the average for each observation. In this study 100 such imputations were averaged out. The averaged out observation were grouped as 1 if 5 . 0 > i P or else grouped as 0. If a different threshold can be reasoned to be more appropriate, then the same threshold can be used.

Monte-Carlo simulation: the inversion method (Tan et al., 2010)
Let X be a random variable with cumulative distribution function F. Since F is a non-decreasing function, the inverse of the function F -1 may be defined by ... (8) has the cumulative distribution function F. Hence, in order to generate one sample, say x, from random variable X ~ F, we first draw from U ~U (0,1), then compute F -1 (U) and set it equal to x. Hence the steps can be stated as, first draw U from U (0,1) and then return X = F -1 (U)

Verification bias in the response variable
As mentioned previously in this study, results for the diagnostic gold standard (CAD) are available primarily for patients who are positive for the test under investigation (CST). When this type of missingness is present, data from such studies are subject to what has been termed "verification bias". There are several ways to adjust for verification bias using statistical correction methods (Laurer et al., 2007;Cronin & Vickers, 2008).
Another approach for correcting verification bias under the assumption that the data are missing at random (MAR), the response variable is binary and the number of covariates is relatively large, requiring parametric models for the probability of verification, is multiple imputation (Harel & Zhou, 2006 ;Hua, 2009). Here, multiple imputation based on data augmentation has been used to correct for verification bias. Using simulation, Harel and Zhou (2006) show that imputation methods are better than the existing methods with regard to nominal coverage and confidence interval length for the sensitivity and specificity of the test. Harel and Zhou (2006) also go on to show that for a sample as large as in this study (greater than 200 observations), the biases of sensitivity and specificity from multiple imputation procedures are only marginally higher than from the existing methods.
These findings support our use of multiple imputation and indicate that there is no use of making further verification bias corrections.

Model building and ROC
The theory behind logistic models is well established and has been described by many authors such as Agresti (2007), Collett (1991) and Hosmer and Lemeshow (2000). Agresti (2007) states that in the use of most diagnostic tests when test data do not fall into two obviously defined categories, the area under the curve (AUC) of a receiver operating characteristic curve (ROC curves) is one of the most reliable measures of the logistic models classification capabilities.
ROC Curves are plots of sensitivity as a function of 1-specificity, and are calculated using all possible cut-offs (Agresti, 2007;Vergara et al., 2008). Using the obtained model, β n respectively. That is, if Ŷ > Threshold (k) the predicted outcome is positive, or else it is categorized as negative. Using k values ranging from 0 to 1, the sensitivity and specificity were calculated for 'each' of these thresholds.
That is,

Estimates for ROCs
In order to calculate the AUC, both parametric or semiparametric estimation methods give a smooth ROC curve and more importantly, as a result of their distributional assumptions, statistical inferences such as hypothesis testing and confidence intervals can be very easily achieved (Vergara et al., 2008). Researchers like Hanley and McNeil (1983) have shown a preference towards using the Dorfman and Alf (1969) maximum likelihood estimation approach. This is the same approach used in the software ROCKIT (Metz et al., 1998). In the terminology, negative cases are patients who actually do not have a disease or given condition, and positive cases are patients who actually do have a disease or a given condition.

Comparing AUCs
Once the estimates for the AUC, its variance and standard errors are obtained, a pair-wise comparison can be made following Hanley and McNeal (1983). It is generally accepted that for a sufficiently large dataset the AUC estimate approximates a normal random variable. Hence the test statistic for the difference between two AUC's would be: ..(9)

Design review and sample size calculation
After planning out the data collection process, a design review for the sample size calculation was conducted based on the study by Bolland et al. (1998). A crude guess of 250 observations was first decided upon and an upper bound of 500 data points was set due to the difficulty in obtaining records. This guess of 250 observations was calculated using the data in the study by McNeer et al. (1978).
Gender, age (categorized into 3 levels), hypertension status and diabetes mellitus status were also selected for the sample size calculation, owing to their impact on the disease (Wilson et al., 1998). The required sample size for each of the selected factors was calculated using the methodology in the ART menu and described under the design review. The significance level for this study was fixed at 0.05 for a two tailed test, and the power was decided upon as 0.80. The reason for choosing such values was due to the infeasibility in collecting large samples due to both inherent missing values and administrative inconveniences. Table 2 depicts the sample sizes obtained for the respective factors. As can be seen from the results, the maximum sample size was 295. Therefore, the required sample size was set at 300 with the test having a power of 80 % with a type I error of 5 %.

Missing value imputation
The primary plausible reason for the missingness of most of the data points on covariates, apart from that of the angiogram status, is due to medical staff not being able to complete records. However, the missingness on the angiogram response status was mainly due to the fact that those individuals who passed the CST were not subjected to an angiogram. Therefore, making it impossible to calculate the sensitivity or specificity of the CST since this number was extremely small and as good as nonexistent, though a comparison of CST levels only, similar to the work done by McNeer et al. (1978), could have been conducted.
The opinion of the medical doctors involved in the study regarding the missingness of the covariates was that their results could be biased and those missing values were not conditional on another variable, and hence according to the discussion made by Acock (2005) were not missing at random (NMAR). On the other hand, the missingness of the response variable (CAD) was entirely dependent on another variable, namely, the CST and thus, the values of the missingness of the CAD falls under the preview of missing at random (MAR). Current research indicates that while using imputed missing values that are missing at random or missing completely at random (MAR or MCAR) does not bias the results the same, is not the case for missing values, which are not missing at random (NMAR).
It must be noted that though angiograms have been in use in Sri Lanka for well over 15 years, it is surprising that a study concerning the sensitivity and specificity levels of the CST has not been published possibly due to this reason. Therefore, instead of confining this study to a comparison of the CST levels, it was thought as necessary to impute these missing values. Sterne et al. (2009) stated that the "missing at random (MAR) assumption may be reasonable if a variable that is predictive of missing data in a covariate of interest is included in the imputation model". Following from this definition, since the variable needed to be imputed is that of the response, the MAR assumption was valid. Further, since a logistic model was to be used in the final analysis, it was considered best to use this method opposed to mean imputation or hot deck methods. It must also be noted that imputing missing values for the response or dependant variable is seldom carried out when explanatory variables are not missing or imputed, since "in this case MI is the same as list-wise deletion and such imputation only increases sampling variability" (Allison, 2004). However, in this study, as explained above, since if these particular values were not imputed, it would be impossible to find out the sensitivity and specificity of the CST, hence this procedure was carried out.
Two approaches can be used to impute data. The first method is to compute multiple imputations, generally around 5, analyze those multiple imputations individually using conventional statistical methods, and, finally, to combine the results using Rubin's rule (Rosenbaum & Rubin, 1983). Method two, however, involves creating many multiple imputations, around 100 and averaging the results ( Van-Leeuwen et al., 2007). Though the first method is more popular and perhaps better validated in many studies due to computational ease and sound methodology, the second method was adopted. These values were included into the original data set. The variables used for imputing include age, gender, hypertension, diabetes mellitus, cigarette and alcohol consumption, systolic and diastolic blood pressures, marital status and CST status. No interaction terms were included.
After the imputation, logistic regression models were used on the imputed data set to determine important covariates. Variables that were considered as insignificant remained to be so and those that were significant remained to be significant apart from the variable family history. Yet, even this variable is more significant than the other variables, as was the case before imputation. Perhaps due to the large increase in power, as a result of the increase in sample size, the significance levels of these variables appear to have increased vastly.

Logistic models
Both a forward selection and backward elimination procedure were carried out, considering up to two interaction terms only. The final model obtained using backward elimination process for the total set of observations including imputed observations is as follows: .. (10) After building a model, it was clearly observed how the odds of getting CAD decreased as an individual's ability to withstand a CST stage level increased out of those who failed the stress test. It was also observed that those who passed the CST had the smallest odds of getting CAD. Following from this observation, the final objective of this study was to identify the best cut-off for the CST. That is, to identify if instead of using the conventional Bruce-protocol method to pass and fail individuals, if having a stage as a cut-off gives a significantly better or even similar classification capability. For this purpose, the CST variable was grouped into three categories as given below: Group 1: Those who failed in stage 1 versus the 1.
rest (those who passed up to a stage ≥ 1). The corresponding model based on backward elimination is Group 2: Those who failed in stages 1 or 2 versus 2.
the rest (those who passed up to a stage ≥ 2). The corresponding model based on backward elimination is: ( ) Group 3: Those who failed the test (all stages 1 3. and above) versus those who passed the CST. The corresponding model based on backward elimination is: ..(13)

ROC curve analysis
Three ROC curves were constructed for the three cut-off models. In order to obtain a graphical overall look at the ROC curves, the ROC curves corresponding to the three models were plotted and are given in Figure 1.
Using the above data obtained through ROCKIT, and under the assumption that for a sufficiently large data set the AUC is distributed normally, the following hypothesis was tested for the three possible comparisons. where, (i,j)={(1,2) (1,3) (2,3)} Using the explained test statistic and the calculated estimates as given in Table 3, the outcome as given in Table 4 was obtained. It can clearly be seen that the AUCs for the first two groupings were not significantly different. However, the AUC of the third grouping was significantly different from both the first and second grouping at 5 % significance level. Further, since the Z statistic value is positive, it implies that the AUC of grouping 3 is significantly 'larger' than that of the other two at significant level even smaller than 5 %.    It can be seen that the first two CST groupings appear to be similar, since their respective lines overlap somewhat. However, in contrast to the first two groupings, the last grouping (colour-coded in off-white), which is the Bruce protocol cut-offs of pass and fail as currently used by medical practitioners in Sri Lanka, appears to have its ROC curve line almost always above the other two groupings, though quite close. In other words, the third grouping not only has the most significant CST grouping with respect to its logistic model but it also appears to have better discrimination power than the other two based on the observations of the ROC curve. However in order to test if this difference is significant, a statistical pair-wise comparison test was carried out.

Validation of the model
In studies related to the medical research field, many practical dilemmas could occur and hence bias the results either knowingly or unknowingly to the researcher. Some of these drawbacks include: possible lack of representativeness in the sample due to administrative issues or missing data; inadequate sample size; omission of important confounders due to issues ranging from lack of knowledge in the subject area to the inability to measure the existence of a confounder due to it being controlled as a precaution, an example being the problem encountered with cholesterol levels in this study. Though such drawbacks are common in medical related studies, it is, however, very important that the inferences or results obtained are accurate and precise enough, due to the possibly life threatening nature of the disease. Therefore, it is important to verify even a small doubt. Incumbent validation procedures include methods such as bootstrapping and independent test case validation. For the purpose of validation, it was considered best to use an independent test case since it would verify the validity of the inferences obtained with a dataset completely unrelated with the first, and also, due to the accepted nature and simplicity of this method. Data collection was a problematic issue throughout this study, and therefore, it was not possible to obtain a large test dataset. However, the objective of validating using a test dataset was not to make inferences concerning the study hypotheses, but instead to observe if the data agreed in general with the previously obtained models. Another unrelated set of 50 BHTs and CST records were once again requested from the Sri Jayawardenapura hospital. Firstly, the dataset was cleaned, and as before, a missing data imputation procedure was carried out. Then using the forecasts and the actual results, from the modeling procedure, the false positive rate (FPR), false negative rate (FNR), true positive rate (TPR), and true negative rate (TNR) were tabulated.
Further, ROC curves were constructed for the test cases as well. The above mentioned calculations were carried out after deleting the individuals who did not do the CST as they were not needed. This final test dataset had just 34 observations. Therefore, in drawing conclusions from this small dataset, two things should be kept in mind. Firstly, the AUCs could be imprecisely estimated, and secondly, the estimated AUC may not have a normal distribution. Thus results should be carefully interpreted.
Though the overall model has a high TPR of 82 % its TNR is only 67 %. In the case of the model chosen to find the best cut-off, the TPR was lesser with its value being just 73 %, with however the same TNR value observed for the first model, that is 67 %. In general, the overall model correctly identified roughly 76 % of the cases while the other correctly classified over 70 % of the cases. In order to obtain an idea for the classification capabilities of the two models, ROC curves were once again constructed but are not presented here. Calculations were carried out for the AUCs and their respective standard errors. The AUCs for the two models were 0.811 and 0.765, respectively. According to Hosmer and Lemeshow (2000), these AUC values would imply that the first AUC has excellent discrimination while the second has acceptable discrimination. Further, it was found out through ROCKIT that these AUCs were not significantly different from each other. Yet due to the small sample size it must be noted that these estimates and inferences may not be very accurate. Therefore, the ROC curves were used only to obtain a graphical view of the classification capability of the two models. It was observed that the AUCs for both models are very much further away from the diagonal of the curve.

DISCUSSION
In this study missing value imputation has been successfully used for determining the values of sensitivity and specificity and thereby determining the diagnostic capabilities of the CST as a substitute for the angiogram. This technique can be similarly used in cases where passing the substitute test results in no gold standard test being done.
The main finding of this study was that the Bruce protocol cut-off was the best classifier of the CAD. Also the results obtained for the example dataset are conformed to the world standards. The sensitivity and specificity values obtained in this study with the aid of missing value imputation were, for the Bruce protocol method, a sensitivity of 87 % and specificity of 77 % after adjusting for the other confounding variables. Similarly, for the overall model a sensitivity of 85 % and specificity of 79 % can be observed. In both these cases, we can observe that the sensitivity is slightly higher than the specificity. The sensitivity and specificity values obtained for the above situations may however have been enhanced by the confounders' predictive capabilities.
The American Heart Association guidelines state a risk factor-unadjusted "sensitivity and specificity of 68 % and 77 % for detecting significant coronary disease at angiography" whilst Hill and Timmis (2002) have concluded in their study as this test having a risk factor-unadjusted sensitivity of 78 % and a specificity of 70 % in detecting coronary artery disease. Fuster et al. (2004) state in their study that, "the true diagnostic value of the exercise ECG relates to its relatively high specificity". As can be seen, these values seem to change somewhat and to quote Bokhari et al. (2008) "wide variations in the sensitivity and specificity of the exercise ECG for the diagnosis of coronary artery disease (CAD) have been reported". It is interesting to note that the sensitivity and specificity values of these unadjusted studies are low, relative to those obtained in this study, which was adjusted for risk factors. However, Koide et al.'s study (2001), which adjusted for some risk factors give sensitivity and specificity values of 84 % and 90 %, respectively indicating that our values are somewhat higher than usually reported values due to adjustment for risk factors. In general, it can be observed that the sensitivity and specificity values obtained after using missing value imputation, are similar to world wide standards. Though the Bruce protocol method gave reasonable values for the sensitivity and specificity of the CST, that does not, of course, rule out the fact that other methods are better or worse.
A very interesting sub-finding of this study was the interaction terms obtained in the final cut-off model. It is generally accepted that gender and age (Roger et al., 1998) can have a marked impact on the CST results. So much so, that the Bruce protocol cut-offs are adjusted for age.
The reason for the odds ratio of getting CAD for male versus that of female to increase dramatically for those who had passed the CST could be due to the fact that women who had CAD were more sensitive to the stress test than were the males. That is, if we take the group of individuals who passed the stress test, we can assume there to be a very few females with CAD, as opposed to males who may still have CAD but managed to pass the CST due to better fitness rates. This could explain why there was a positive interaction for gender with CST. Another interesting finding was the positive interaction term for age with the level passed in the CST. This implies that as a person gets older, the impact of the CST lessens. Medical practitioners in Sri Lanka also state that if a younger person fails the CST that would imply that the patient has a higher chance of having CAD than an older individual. This interaction term appears to agree with this hypothesis. Yet the interaction with diabetes cannot be explained by the above arguments for gender or age. The overall reasoning behind why gender, age and diabetes had positive interactions with CST could be due to the fact that when an individual has CAD, the CST predicts it well, hence obscuring the impact of the other risk factors as opposed to the case where they failed it. Further, the small counts observed may have exaggerated the actual estimates and also resulting in the large confidence intervals obtained.
It can, however, be argued that this observation comes as a result of the imputation procedure. But since the imputation was carried out using many other variables such as systolic and diastolic blood pressure, highly correlated variables such as alcohol and cigarette consumption and even variables such as marital status, imputation appears to be an unlikely cause. That is, due to the inclusion of a large number of other variables in the imputation procedure, which were independent yet highly correlated with CAD, it would be expected that the impact of a few of these variables to be lessened and not strengthened. This would be an interesting topic for further research.