use of Schoenfeld’s global test to test the proportional hazards assumption in the cox proportional hazards model: an application to a clinical study

: Cox proportional hazards (PH) model is one of the finest techniques in identifying combined effects of several covariates on the relative risk (hazard). This model assumes that the hazards of the different strata formed by the levels of the covariates are proportional. The primary objective of this paper is to illustrate the usefulness of a global goodness-of-fit test proposed by Schoenfeld for testing the PH assumption. Though several classical methods have been discussed in previous studies there is no one research paper that compares Schoenfeld’s method with these. Moreover, programmes are developed in SAS for constructing this global goodness-of-fit test. In this paper the proposed test is applied to a real, large scale data set that involves several covariates, whereas Schoenfeld has used only a small data set with only one covariate to illustrate this new test. Using Kaplan-Meier curves, a preliminary analysis was conducted on the survival data. Then, a Cox PH model was fitted to the data. All the methods and residual analysis including the global goodness-of-fit test indicated that for the data set used the assumption of PH is violated. However, other than for the global goodness-of-fit test all other techniques are based on graphical methods and are thus subjective. Hence, for cases where the violation of the PH assumption is marginal these graphical methods may be inadequate to detect this departure. However, as the global goodness-of-fit test is an objective test it is recommended as the best among the methods compared.


IntRoductIon
In the comparison of two survival functions in a clinical trial, it is useful to have a means to measure the difference between the two survival curves.If the corresponding hazard functions are proportional, then the interpretation of relative risk (hazard) can be done using the maximum RESEARCH ARTICLE use of Schoenfeld's global test to test the proportional hazards assumption in the cox proportional hazards model: an application to a clinical study partial likelihood estimator proposed by Cox 1 .The descriptive method of identifying proportionality of hazards for levels of a single covariate involves the plotting of Kaplan-Meier estimates.However, this simple method is cumbersome when there are many covariates.Thus, the estimation of relative risk for a group of subjects depending on several explanatory variables (covariates which can be categorical or continuous) is assessed by a parametric fitting of a proportional hazard (PH) model 1 .
After fitting an appropriate PH model for the given data, it is vital to check the goodness-of-fit and the residuals of the fitted model.The goodness-of-fit of a PH model mainly focuses on checking the validity of the assumption of the proportional hazards (i.e.whether the effects of covariates on risk remain constant over time).For a more general Cox's regression model, the PH property is one of the restrictions to using the model with time-fixed covariates.It assumes that the hazard ratio between two sets of covariates is constant over time, because the common baseline hazard function cancels out in the ratio of the two hazards.However, the impact of time-varying covariates leads the hazard to vary over time, thus violating the assumption of PH in the model.To test this PH assumption of Cox's regression model, Schoenfeld 2 and Moreaue et al. 3 introduced a dummy time-dependent covariate.There, the observed numbers of events in the cells arising from a partition of the Cartesian events product of the range of covariates (or ranges defined by the predicted partial likelihood estimates of the model) and the time axis, (with expected numbers of events predicted by the model) in the cells are computed, and a chi-squared statistic for the fit is obtained.The number of partitions of the time-axis is arbitrary but the defined portions (time intervals) should be non-overlapping.
A variety of residuals have been developed for a fitted PH model such as Cox-Snell residuals 4 , Schoenfeld residuals 5 etc. Plots of these residuals are useful in detecting nonproportionality of predicted hazards of the fitted model over the covariate space for each covariate.
The objective in the present paper is to illustrate the usefulness of the global goodness-of-fit test proposed by Schoenfeld 2 and to discuss other classical methods of testing validity of Cox PH models.This is achieved by examining each method and applying all the mentioned model validation techniques including the new global goodness-of-fit test proposed by Schoenfeld 2 to a large, real life data set that includes several covariates.The data is about the times that heroin addicts remain in a clinic for methadone maintenance treatment 6 .Several possible explanatory variables are recorded along with the termination times for 238 individuals.Simply, the approach discussed in this paper extends the ideas developed for goodness-of-fit testing and residual analysis for a censored set of real data.

MEtHodS And MAtERIALS
Assumption of proportional hazards: Suppose two groups, namely group 1 and 2 (for example say, group 1 is receiving the new treatment and group 2 is receiving the standard treatment), are compared with respect to the hazard of each group.Let λ 1 (t | group 1) and λ 2 (t | group 2) be the hazard functions of group 1 and group 2 respectively, where t > 0. Then the two groups are said to have proportional hazard, when the hazard ratio Ψ is constant over time.That is, Kaplan-Meier curves: The Kaplan-Meier method 7 estimates the survival function that summarizes the survival data.The curves of Kaplan-Meier survival functions work best for time fixed covariates with few levels.If the predictor satisfies the proportional hazard assumption then the graph of the log [-log(survival)] versus log of survival time should result in parallel lines.
The Cox regression model 1 : Let T i be the failure time for subject i, i =1,...,n.If T i follows the Cox proportional hazards regression model, then the hazard function for T i at time t > 0, conditional on the p x 1 covariate vector Z i , is where λ 0 (t) is the baseline hazard function (i.e. the hazard function when all covariates take value zero) and β is a p × 1 vector of regression coefficients.Statistics are designed to check whether interaction terms between elements of Z i or higher order terms in the elements of Z i need to be added to β' Z i .
Using counting process notation, the information in the data can be represented by When the fitted model is correct, the Cox-Snell residuals (r cox-snell ) i , are a plausible sample of observations from a unit exponential distribution.Thus, a plot of Cox-Snell residuals versus observations (or time) will not lead to a symmetric display.
The log-cumulative hazard plot of residuals is given by plotting log log values against log (r cox-snell ) i .A straight line plot with unit slope and zero intercept will then indicate that the fitted survival model is adequate in satisfying the proportional hazard assumption.
Schoenfeld residuals 5 : The Schoenfeld residual vector is calculated on a per event time basis as where Z t is a weighted average of the covariates over the risk set at time t and is given by, Under the proportional hazards assumption, the Schoenfeld residuals have the sample path of a random walk; therefore, they are useful in assessing time trend or lack of proportionality.Due to time dependent covariates the generalized linear regression of the Schoenfeld residuals on functions of time gives a non-zero slope.Thus, a non-zero slope is an indication of a violation of the proportional hazard assumption.As with any regression it is recommended to look at the graph of the regression in addition to performing the tests of non-zero slopes.There are certain types of non-proportionality that will not be detected by the tests of non-zero slopes alone but that might become obvious when looking at the graphs of the residuals such as nonlinear relationship (i.e., a quadratic fit) between the residuals.
Goodness-of-fit statistics 8 : In the Cox regression model, the hazard ratio for subject i versus subject j at time t is When comparing two individuals, the individual with the larger value of exp(β'Z i ) has greater risk of death at time t.To form the proposed goodness-of-fit statistics, the partial likelihood estimate of Then the subjects are grouped or partitioned into regions based on the percentiles of ˆi , which we call percentiles of risk.Following Hosmer and Lemeshow's 9 approach with binary data, it is suggested to form G regions of approximately equal size so that the first group contains the n/G subjects with the smallest ˆi 's, and the last group contains the n/G subjects with the largest ˆi 's.In general, this classification leads to grouping subjects that are considered similar in that they have similar risks of death at any given time i.
Given the partition of the data, the goodness-of-fit statistic is formulated by defining the (G -1)group indicators Then, in order to assess the goodnessof-fit of the model (1), we consider the alternative Cox model, If model ( 1) is correctly specified, then γ 1 = γ 2 =.....= γ G-1 = 0 in (4).Although I ig is based on the random quantities ˆi 's, Moore and Spruill 10 showed that, asymptotically, one can treat the partition as if it was based on the true φ i 's (and thus, regard I ig as a fixed covariate).
To test the goodness-of-fit of model (1) versus alternative (4), the likelihood ratio, Wald, or score statistic can be used to test H 0 : 1) has been correctly specified, each of these statistics has an approximate chi-squared distribution with (G-1) degrees of freedom (d.f.) when the sample size is large.
Although the score, Wald, and likelihood ratio statistics are asymptotically equivalent, the score statistic for H 0 : 4), is proposed over others since it has a nice intuitive interpretation.For model ( 4), the score vector is Let ˆ 0 be the estimate β under the null hypothesis γ = 0.The score test statistic for testing this null is, ( , ) var ( , )  , ), the score statistic is actually based on the large sample distribution of 2 0 ( , 0) u .Using an algebraic identity, 2 0 ( , 0) u can be expressed as where 0 , , exp , , , , being the Breslow 11 estimate of the baseline hazard.
Substituting ( 6) and ( 7) in ( 5), one can show that 0 0 1 ˆ, ,0 , , 0 0 and thus that the g th element of 2 0 ( , 0) u equals, where O g is the observed number of failures in region (group) g and E g is the estimated expected number of failures [under model ( 1)] in region g.Alternatively, (8) can be expressed as where 0 ˆˆ, , , , is the martingale residual given by model (1).Thus, our goodness-of-fit statistic is actually a function of the observed minus the estimated expected number of failures in each region or, equivalently, the martingale residuals in each region.Because of this fact, it can describe what is meant by large samples, i. e., the statistics have approximate chi-squared distribution with (G-1) d.f.when the sample size is large.
Using the partition based on the percentiles of risk (percentiles of ˆi 's), the above statistic has little power to test whether the proportional hazard assumption is valid.However, this statistic can easily be extended to have power to detect non proportional hazards by using the approach of Schoenfeld 2 .
There, in addition to partitioning the subject based on the percentiles of risk as explained above, the time axis should also be partitioned into say, τ intervals, which are consecutive and non-overlapping, containing approximately an equal number of subjects in each interval.
Accordingly, (τ -1) indicators are defined, Then, in order to assess the goodnessof-fit along with testing PH assumption of model (1), we define the alternative Cox model including the (τ-1) (G -1) interaction terms as follows Then, as explained previously, the score test statistic can be used to compare model (1) and model (9).If model ( 1) is found as not significantly deviated from model ( 9) (i.e. the hazard is the same over covariate space as well as over time), then it can be decided that the goodness-of-fit and PH assumption hold for model (1).
A rough sample size criterion for the resulting goodness of fit statistic to be approximately chi-square, can be used in deciding the numbers G and τ.In particular, the τ × G regions formed by the cross classification of the covariate and time partitions should be chosen in such a way that 6 ≤ τG ≤ D/5, where D is the total number of failures in the data set.
Since the score statistic is similar to Pearson's chi-square for contingency tables (a function of observed and expected frequencies), it is suggested that all estimated expected counts E gk in each cell representing τ × G regions be greater than 1 and at least 80% should be greater than or equal to 5. For situations when this does not hold, the chi-square approximation for the score test may be poor.One possible solution to this is to use a smaller number of τ -G regions in the interval [6, D/5], for which 80% of the E gk 's are greater than or equal to 5.

Example
The data set used in this study is from a clinical trial on 238 heroin addicts following methadone maintenance treatment 6 , in which the outcome is the time until the addict terminates the treatment procedure.Thus, the endpoint of interest is not death, but termination of treatment.Some subjects were still in the clinic at the time these data were recorded and this is indicated by the variable "status", which is equal to 1 if the person had departed the clinic on completion of treatment (uncensored) and 0 otherwise (right censored).The measured possible explanatory variables for time to complete treatment are namely, maximum methadone dose ("dose", a continuous variable in units-milligrams), whether the addict had a criminal record ("prison", a categorical variable with two levels, 1 -have criminal records, 0 -no records) and the clinic in which the addict was being treated ("clinic", a categorical variable with two levels, 1 -clinic 1, 0 -clinic 2)-are recorded along with the termination times for all 238 individuals.The overall event rate for this data set is 0.63 (=number terminating treatment/238=150/238).
According to the main aim of this study, a PH model was fitted in order to determine the goodness-of fit of the model by applying the techniques described in the previous section.
As a preliminary identification of proportionality of hazards or departure from it, log cumulative hazard plots based on Kaplan-Meier estimates are drawn for the three explanatory variables where the continuous variable "dose" is categorized into 3 levels <60, 60-79 and 80. Figures 1 is the log cumulative hazard plots for levels of variables "dose" "prison" and "clinic" respectively.
In Figure 1, the log cumulative hazard function (estimated by the Kaplan-Meier method) for the three levels of "dose" are approximately parallel.The log cumulative hazard function for the two levels of "prison" does not seem parallel and as it is depicted in the third graph of Figure 1, the distance between the log cumulative hazard function for the two levels of "clinic" increases over the log of survival time indicating non parallelism and the curves for the two levels are also seen to cross.However, a particular conclusion cannot be made from these results since a Kaplan-Meier survival curve is a univariate method of estimating survival for the levels of a factor which is unadjusted for other existing factors.
The next step is to fit the Cox PH model for the survival experience regressed on the explanatory variables.The explanatory variable -Methadone dose ("dose") is the standardized methadone dose since it is observed that the dose varies by 10 to 15 units in the recorded data.Then, to fit the model, the PROC PHREG procedure of the SAS package was used and the following Cox PH model (parameter estimates are given in Table 1) is fitted using the forward selection criteria 12 . ) where i, i = 1,....,238.
The fitted model (10) includes the categorical covariate "clinic" which showed non-proportional hazard when considered individually.Thus it is expected that the model will fail to hold valid the PH assumption unless the nonproportionality of hazards for "clinic" is cancelled out due to the adjustments of the other explanatory variables.However, without an appropriate residual analysis and goodness-of-fit tests, it is not appropriate to state anything about the validity of the fitted model.

Testing Validity of the Fitted Model:
The Cox-Snell residuals for model ( 10) are computed according to equation ( 2).The log-cumulative hazard plot of the Cox-Snell residuals depicted in Figure 2 is used to identify any departure from a well fitted model that satisfies the proportional hazard assumption.
It is explained under a previous section that a straight line log cumulative hazard plot with unit slope and zero intercept will indicate that the PH assumption in the model holds.Figure 2 illustrates that there is a slight deviation of points from a straight line indicating that this is a boarderline case.Thus, based on this plot it is difficult to come to a firm conclusion regarding departure from PHs.
The Schoenfeld residuals for each covariate "dose", "prison" and "clinic" computed as given in equation ( 3) are observed and are plotted against time (time of termination from treatments).These plots are depicted in Figure 3 where fitted linear and quadratic regression lines for the Schoenfeld residuals are visible on the same plot.
According to Figure 3, the best fitted line for the Schoenfeld residuals for "dose" has a slope which is not significantly different from zero (since p-value for slope is 0.5454>>0.05),and the fitted quadratic line is balanced around the zero horizontal axis.The illustration in the plot of Schoenfeld residuals for "prison" leads to the same interpretation, where the fitted linear regression line has non-significant slope (since p-value for slope is 0.3653>>0.05),and the fitted quadratic line is balanced around the zero horizontal axis.Thus, it can be concluded that the covariates "dose" and "prison" do not violate the PH assumption of the fitted model (10) and are also not time dependent.However, the fitted linear regression line for the plot of Schoenfeld residuals for "clinic" (Third graph of Figure 3) has a slope which is significantly different from zero (p-value for slope is 0.0007<<0.05)and the fitted quadratic curve also highly deviates from the zero horizontal axis.This result indicates that the covariate "clinic" violates the PH assumption of the model ( 10) and hence leads to the conclusion that "clinic" is probably a time dependent covariate.
To justify the findings expressed in this section, the variable "clinic" can be tested for it's time dependency.For that, it is tested whether there is a significant difference between the current model ( 10) and the model regressed on all the covariates in (10) plus the interaction term "time*clinic" [model ( 11 Under the null hypothesis that model (10) fits the data well, the deviance difference between model (10) and model ( 11) follows a Chi-square distribution with 1 d.f.The calculated difference in deviance between model ( 10) and ( 11) is 11.509 (p-value = 0.000693 <<0.05).This indicates that the interaction term "time*clinic" is highly significant and hence it is confirmed that the variable "clinic" is a time dependent covariate and thus violates the PH assumption.
Testing the goodness-of-fit: To determine the goodnessof-fit of model ( 10), the 238 subjects are grouped into G=5 groups (approximately 48 individuals in each group), based on the percentiles of the partial likelihood estimates ( ˆi ) predicted by model (10).Simultaneously, the time axis is partitioned into τ =2 intervals, t ≤ 367 (days) and t > 367 (days), such that the 238 individuals are divided equally among each interval.Determining G and τ are done under restriction 6 ≤ (τ = 2) (G = 5 ) ≤ (D =150)/5, where D = 150 is the total number of failures in the data set, and all estimated expected   counts in each cell representing (τ = 2) × (G = 5) = 10 regions, are greater than 1 and at least 80% are greater than or equal to 5. The expected counts for the 10 regions, estimated by model ( 10) are illustrated in Table 2.
It can be seen from Table 2, that all expected counts are greater than 1 and 8 out of the 10 regions have expected counts greater than 5 (80%).Thus, this grouping is appropriate when using the Chi-square approximation for the goodness-of-fit test discussed in the following.The score test statistic for model ( 10) is 56.27 associated with 3 d.f. and that of the alternative model ( 12) is 159.19 with 7 d.f.Thus the score goodness-of-fit statistic for the fitted model ( 10) is x 2 =102.93 (=159.19-56.27)with 4 (=7-3) d.f.(p-value <0.001) indicating that the alternative model ( 12) is significantly different from model ( 10) and hence the goodness-of-fit fails for model (10).Since this test is powerful in testing the PH assumption too, it can be concluded that the PH assumption in the fitted model (10) is violated.

dIScuSSIon
The Cox PH model is the most popular method of examining the effect of explanatory variables on survival.However, it requires the assumption of proportional hazards between strata formed by the combinations of levels of the different explanatory variables.Thus, when fitting a PH model it is vital to assess the assumption of proportionality.There are numerous methodologies in the literature [2][3][4][5][8][9][10] for checking the assumption of PHs.
Kaplan-Meier survival estimates in graphical format are useful in preliminary identification of proportional hazards for levels of categorical variables taken individually.However, this method is cumbersome when there are several explanatory variables.Also it is a univariate method and does not adjust for other covariates.In this case, more advanced techniques are required.One may fit proportional hazards regressed on several explanatory variables (PH models) in order to identify combined effects of several covariates on the relative risk.However, in the fitted PH model, the explanatory variables included in the model should satisfy the restriction that the relative risk is proportional over the time for different levels of covariates (i.e.PH assumption).If this requirement is present in the fitted PH model then the assumption of PH is not violated.This paper was basically written with the objective of illustrating the usefulness of a new global goodness-of-fit test and discussing a number of established approaches in determining the validity of a fitted Cox PH model.In this paper this new test and a number of established methods for testing the PH assumption are examined by way of an example.Simultaneously developing a software programme in SAS to use in applying these techniques was a secondary objective of this paper.The methods are applied to a data set taken from a clinical trial on heroin addicts following methadone maintenance treatment 6 , in which the time until the addict terminates the treatment procedure is measured.Three possible covariates namely maximum methadone dose ("dose"), whether the addict had a criminal record ("prison") and the clinic in which the addict was been treated ("clinic")-that were suspected to influence the termination time were recorded for 238 heroin addicts.
Prior to model fitting, the conventional descriptive method, in identifying proportional hazards on each categorical covariate taken individually, namely Kaplan-Meier survival curves, are used on data.This identified that the two levels of "clinic" and also the two levels of "prison" do not have proportional hazards, hinting that these variables may violate the PH assumption when it is included in the model.The variables "dose" showed approximate proportional hazards.However, a firm conclusion was not made about this result since a univariate method is not reliable when dealing with several explanatory variables.
A Cox PH model was then fitted to the data, using forward selection procedure that ended up including all 3 explanatory variables into the model.
The Log cumulative hazard plot of Cox-Snell residuals suggested that the goodness-of-fit of the model is boarderline, as the points deviate slightly from a straight line.Then, the Schoenfeld residuals for each covariate were studied.This indicated that the variable "clinic" violates the PH assumption of the fitted model while the other two covariates "prison" and "dose" do not.According to Schoenfeld 5 , one of the reasons that a covariate violates PH assumption is when it is time dependent.
The Cox PH model for the hazard of death at time t for the i th of n individuals in a study can be expressed as equation (1) where x .This model can be generalized to the situation when one or more explanatory variables are time dependent by writing x ji (t) in place of x ji in equation ( 1).There are several simple ways to extend the Cox model to include a time dependent covariate 13 .More complicated forms of relationships between the outcome and explanatory variable over time are discussed by Fisher and Lin 14 .These authors explain the way in choosing the functional form of the time dependence of the covariate which they mention is usually not self evident but may be suggested by biological understanding.Some popular functional forms are time lagged covariates, moving weighted average of value over a lag time period, linear form, Splines, Piecewise constant functions with the covariate etc., to name a few.In our study the simple linear form was selected as the choice of a complex functional form raises the possibility of too much modeling and great over-fitting of the data set.Also it is known that when the PH assumption does not hold this is a good choice, though it is somewhat restricted.Thus, using this approach it is found that the variable "clinic" is time dependent.
As the final approach, the global goodness-offit test for Cox PH models, proposed by Schoenfeld 2 , that has power to detect the insufficiency of covariates in describing the relative risks and the assumption of PH, was applied to the fitted model.Here an alternative model which contains variables that indicate partitions of relative risk (hazard ratios) over covariate space and the time space, where if these indicator variable are found significant in the model, implies that the covariates are not sufficient or the PH assumption does not hold.In other words, to perform this, an alternative model that contains all the covariates in our fitted model plus the indicator variables that indicate the partitioned regions had to be compared with the fitted model using the score test statistic which has chi-squared distribution when our model has been correctly specified.Accordingly, the 238 subjects are grouped into 5 groups based on the percentiles of the partial likelihood estimates predicted by the fitted model and the time axis is partitioned into 2 intervals (t ≤ 367 days and t > 367 days).Thus, altogether there were 5×2 =10 regions and this partitioning did not violate the requirement for a Chi-square distribution of the score statistic since all estimated expected counts (expected number of addicts that terminate treatments) in the 10 regions are greater than 1 and 80% are greater than or equal to 5.Then, the alternative model [model (12)] is fitted as described previously and it was found that there is a highly significant difference between that [model (12)] and our fitted model [model (10)].Thus, with this result, it is determined that the goodness-of-fit along with the PH assumption is not satisfied for the fitted PH model.Finally, it is concluded that the PH assumption does not hold for the fitted model due to the inclusion of time varying covariate -"clinic" -and hence does not satisfy the global goodness-of-fit.
Another important finding of this analysis is that most of the other techniques were subjective and were unable in boarderline cases to confirm the lack-of-fit or violation of PH assumption due to time dependent covariates.The only objective method of all methods examined in this paper is the global goodness-of-fit test proposed by Schoenfeld 2 .Therefore, it is recommended as the most reliable method in validating the PH model.

Figure 1 :
Figure 1: Log cumulative hazard plots for levels of the three variables

Figure 2 :
Figure 2: Log cumulative hazard plot of Cox-Snell residuals for model (10) According to the grouping defined above, the indicator variables I ig [indicating that the i th individual belongs to g th region partitioned based on the percentiles of the partial likelihood estimates ( ˆi )] where g = 1 ,....,4 and I * ik (indicating that the i th individual belongs to k th time interval) where k = 1, are defined.Then, the alternative Cox model given below is fitted,

Figure 3 :
Figure 3: Schoenfeld residuals for each explanatory variable versus survival time where N i (t) takes value one if subject i has been observed to fail prior to time t and takes value zero otherwise and Y i (t) takes value one if subject i is at risk at time t and takes value zero otherwise.Then the Cox partial likelihood score vector equals )].

Table 2 :
Observed and expected number of addicts terminated from treatment, predicted by model(10)