Adjusting for a cluster effect in the logistic regression model: an illustration of theory and its application

* Corresponding author (roshini@mail.cmb.ac.lk) Abstract: When a response having two outcomes is modelled using a logistic model the responses on each observation are often considered to be independent of each other. This assumption may not always be valid and the responses may be correlated with each other as in the case of clustered data, which can occur especially in the case of survey data. When responses are correlated as explained, then the ordinary logistic regression model is unsuitable as the standard errors will be biased, and therefore this model should be adjusted for the cluster effect. In this paper one of the many methods of adjustment suggested in the literature, which is based on robust standard error estimation for cluster sampling data is examined. The objective of this paper is to illustrate the theory and mode of application of this theory by way of using a survey on Paraquet poisoned patients in Sri Lanka. Here, the patients are clustered within hospitals and therefore adjustment of the logistic model for the clustering effect is discussed. Adjusting for the hospital (cluster) effect significantly reduced the standard error of four out of thirty three odds ratios given by the model. That is four odds ratios that were not significant before adjustment became significant after adjustment. There was no change in significance in the other twenty nine odds ratios. On average there is a reduction of 2.29% in the standard error of the odds ratios after adjustment for the cluster effect. This indicates that it is effective and important to adjust for the cluster effect.


INTRODUCTION
A logistic regression model (Agresti, 1990;Collett, 1994;Hosmer & Lemeshow, 2000) is usually fitted when the response variable has two levels. When observations are selected based on strata a cluster effect can be anticipated within strata. Usually in data modelling, stratification or clustering is ignored. If however, stratification is indeed effective, the ignoring of stratification will usually increase the resulting variance of the estimators (Parson, 1992) as the independence between patients is violated due to the correlation within strata, so that the usual maximum likelihood method for estimating standard errors is not valid. In order to overcome this problem a robust method of estimation of standard errors of the parameter estimates has to be used. Therefore an adjustment is required for the cluster (strata) effect in estimating the standard errors of the parameter estimates. In the past various adjustments have been suggested by several authors, the more popular ones being Kacker and Harville(1984), Kenward and Roger(1997), and Huber(1967). Huber's adjustment is now available in the statistical package STATA (Liu, 1998) and is explained here. This method was developed by Huber (1967) and is based on bootstrap jackknife method.
The objective of this paper is to illustrate the theory of adjustment for cluster effects in a logistic model and its application by way of an example to determine the factors contributing to the transfer of poisoned patients from peripheral hospitals to secondary hospitals where the hospital is considered as a cluster. The study (Senerathna, 2006) used is a cross sectional study. The patients admitted to peripheral hospitals in Anuradhapura and Polonnaruwa Districts during the period of July 2005 to December 2005 have been considered for the study. The study population consists of 1024 patients from 36 hospitals. A total of 40 variables were measured on each patient. Table 1 gives the broad areas about which information was collected and the variables measured in the study.

September 2011
Journal of the National Science Foundation of Sri Lanka 39 (3)

METHODS AND MATERIALS
The logistic regression model: Consider a single categorical explanatory variable, A, with I levels. Let p i be the probability of belonging to the response category of interest for an observation having the i th level of A. The logistic regression model (Collett, 1994), which gives the relationship between the response (p i ) and the i th level of a factor A, is as follows.
... (1) Here α is the intercept and β i A is the effect of the i th level of the variable A on the response.
Adjusting for cluster effect: One approach to adjust for the cluster effect is to introduce it as a covariate to the model. Suppose β j B represents the effect of the j th cluster, then the model can be given as ... (2) Although it is possible to estimate all cluster effects, it becomes tedious and not feasible when the numbers of clusters are very large (Paterson & Goldstein, 1991).
In addition, it would be useful to be able to generalize inferences to all the clusters in the population than to restrict it to those in the sample drawn. Therefore, the other approach (Paterson & Goldstein, 1991) would be to introduce the cluster effect (u j ) to the model as a normally distributed random variable with zero mean and a constant variance. This is given by the model: This is a generalized linear mixed model where A is a fixed effect and u is a random effect. More details about this type of model, model fitting and interpretation are given in Brown & Prescott (1999).
Fixed and random effect standard errors are calculated using a formula that is based on a known variance matrix, V. However, because V is infact estimated it is known that in most situations there will be some bias in the standard errors given by the maximum likelihood method (Brown & Prescott, 1999). Various adjustments for bias have been suggested and Huber's adjustment is now available in the statistical package STATA, which is used for the analysis of the example data set used in this paper (Annex). Huber's (1967) method of adjustment for a cluster effect in the estimates of standard errors in a logistic model has been described by Liu (1998) and is briefly explained here.

Huber (1967) method of adjustment of estimates of standard errors:
Let us denote the logistic model, ... (4) Where p i is the probability of an event for the i th unit, x i is the design matrix for the i th unit, β is the vector of regression coefficients and i = 1, … n. Thus p i can be denoted by, ... (5) Then the log likelihood function L for a probability density function is  ... (6) Define the dummy variables y i such that y i =1 if the i th unit has the event of interest and y i =0 otherwise.
Then ... (7) [where is the density function of the Bernoulli distribution].
For the i th observation the Score function can be denoted as ... (8) and the Hessian can be denoted as ... (9) This robust estimation is based on asymptotic bootstrap or jackknife (Efron & Tibshirani, 1993) method. The bootstrap method is based on estimating from resampling with replacement from the original sample. The jackknife method, similarly estimates by systematical recomputing the estimate by dropping one observation at a time from the original sample. Liu (1998) shows that the amount of the shift in the estimate of the standard error when i th observation is dropped is where the matrix ... (10) It is assumed that no single observation has very large effect in the fitting, and then dropping two observations is roughly the sum of dropping each observation individually. The same is extended to a cluster, where dropping all members of the cluster is roughly equivalent to dropping each member individually.
By Huber (1967) formula the standard variance estimate ... (11) Using equation (7) for logistic model the score ... (12) where y i is the observed response of the i th unit and y i ~ bin(n i, p i ) and Hessian Substituting these in equation (11) ... (13) Suppose there are C clusters and g j observations in each cluster (j=1,2….C). Then the robust standard error is ... (14) where and is the contribution to score from each cluster. Liu (1998) suggests two finite sample corrections to adjust the estimate more closely to the population values. This requires a multiplier q c , which is considered to be or asymptotically where N is the number of observations, p is the number of parameters to be estimated and C is the number of clusters.
For large samples, both adjustments are the same (As for large N, p<<N and (N-p-1)≈(N-1).
Selecting variables to the model: Initially a study could consist of a large number of variables and therefore, variables that are significantly associated with the outcome have to be identified using an appropriate univariate test for association i.e. Chi square test (Agresti, 2002). As an univariate test is unadjusted for other factors, a liberal significance level of about 0.2 should be used in selecting variables for modelling (Collett, 1994). The variables selected at the univariate stage should then be examined after adjustment for other variables using a logistic model with all other chosen variables as fixed effects and the cluster variable as a random effect. Forward selection method (Collett, 1994) with a significance level of 5% (0.05) is usually used to select a suitable model. When selecting important variables to the model, the Wald statistic associated with the variable was used instead of the likelihood ratio test (LRT) statistic, which is the standard statistic used in the basic logistic regression model. This is because when a cluster effect is present,

September 2011
Journal of the National Science Foundation of Sri Lanka 39 (3) the standard errors of the parameter estimates are adjusted and thus the LRT is invalid (Rana et al., 2010).

The example:
The study data consists of poisoned patients who were admitted to 36 peripheral hospitals in the North Central Province of Sri Lanka between July 2005 and December 2005 (Senarathna, 2006). The data collected was classified into two groups, namely, patient details and hospital details. Patient data such as age, gender, clinical symptoms and treatments received were obtained from the 'bed head ticket' and the hospital details were collected from relevant authorities of each hospital.
The main response of interest in this study is the outcome for the patient. During the study period only a few deaths were recorded due to poisoning. Due to lack of data, death was excluded as an outcome. Thus only two outcomes, discharged or transferred to a secondary hospital, were considered as outcomes for the patient. It was of interest to determine factors affecting the transfer of poisoned patients from peripheral hospitals (Eddleston, 2006). This is a binary outcome and the suitable model to fit for this type of data is the Binary Logistic regression model. However, as patients were collected from hospitals, patients from the same hospital tend to be correlated (Guittet et al., 2006). In addition, corresponding information on the hospital where the patient is admitted to is the same for patients from the same hospital. Therefore in estimating the standard error of the parameter estimates in a logistic model, we cannot use the usual maximum likelihood method, which assumes the observations to be independent. Therefore an adjustment is required for this hospital effect in estimating the standard errors of the parameter estimates.
Thus the methods discussed are illustrated in this example in order to explain its advantages.

Logistic model for the patient outcome
From univariate analysis, association of the variables with the outcome for the patient is analysed one variable at a time. However it is important to analyse the effect of these variables when present together. In addition it is also important to identify the most significant factors that affect the outcome. Therefore, to achieve these objectives a logistic mixed model is fitted to the data, which adjusts for the cluster (hospital) effect.
The variable selection indicated that the following fixed effects were included in the logistic regression model. The variable name is given within brackets.
Poison type ingested (cpoison) Appearance of the patient's pupil (cpupils) Consciousness of the patient (conscious) Intravenous (IV) access is given to the patient (ivacess) Atropine is given to the patient (atropine) IV fluids are given to the patient (ivfluids) Patient is being monitored (monitoring) Availability of ambuventilation facilities in the hospital (ambuventilation) Availability of fuller's earth at the hospital (fullersearth) Distance from peripheral hospital to the secondary hospital (distance) Availability of diazepam IV in the hospital (diazepamiv) Availability of nasogastric (Ng) tubes at the hospital (ngtubes) Availability of adrenaline in the hospital (hadrenaline) Availability of 2-pyridine aldoxime methiodide (PAM) in the hospital (hpam) Number of sanitary labourers in the hospital (slabourer) Availability of dextrose in the hospital (dextrose) Of these variables, distance and slabourer are continuous variables while all the others are categorical variables. The cluster effect was fitted using the 'cluster' option in the STATA package. This implies that the cluster (hospital) effect is considered to be a random effect. The standard errors are adjusted based on the Huber formula. The model selected by the forward selection method is ... (15) Where β w hospital follows a N(0,σ 2 H ) distribution where σ 2 H is the between hospital variance. The model fitting was limited to the main effects due to large number of variables considered and to keep the complexity of the model to the minimum. The model with these main effects (given by equation 15) was then checked for adequacy using diagnostic methods. Table 2 shows the odds ratios, robust standard errors of the odds ratios and 95% confidence intervals for the odds ratios for the logistic model with mixed effects (equation 15). Additionally, it also includes the  The baseline level for the calculating odds ratios has been considered as the first level for all categorical variables. The * indicates the variables corresponding to the odds ratios that became significant after adjustment for cluster effect.
percentage decrease in the standard error of the odds ratio when compared to the model unadjusted for cluster effect.
The odds ratios of cpoison_2, ivfluids_2, hadrenalin_2 and conscious_2 are significant after adjusting for the cluster effect while these were not significant before adjustment. The significance of all other odds ratios at 5% level are the same before and after adjustment. After adjustment for the cluster effect, the average decrease in standard error of the odds ratios overall is 2.3%. The change in significance of 4 odds ratios indicates that the

September 2011
Journal of the National Science Foundation of Sri Lanka 39 (3) adjustment for cluster effect is important. Interpretation of the odds ratios given in Table 2 is included in another section.

Diagnostics of the model
One method of determining the fit of the model is to observe how the model can classify the outcome for the patient for each observation, which is a function of the observed responses and estimated values of the covariates. Table 3 shows the classification given by the model against actual outcome for the patient. The model fitted, correctly identifies the outcome of 78.42% of the patients. This is nearly 80% and shows that the model classifies the outcome well (Rana et al., 2010) and thus indicates its goodness of fit.

Comparing the outcome without adjusting for cluster (hospital) effect
One matter of interest is to compare the change in the standard errors of the odds ratios with and without adjusting for the cluster/hospital effect. Table 4 gives the odds ratios, robust standard errors of the odds ratios and the 95% confidence intervals for the odds ratios for logistic model without adjusting for the hospital (random) effect.
The standard errors that have higher values compared to corresponding robust standard errors when adjusted for cluster effect is indicated by bold letters. It can be observed that out of the 33 parameters estimated in the model, 18 parameters have a higher standard error when unadjusted for cluster effect. When significance of the odds ratios are compared before and after adjustment, this shows that four odds ratios that were not significant in the unadjusted analysis become significant after the adjustment, indicating that adjustment for cluster effect is important in drawing correct conclusions.

Interpreting the odds ratios
The type of poison ingested by the patient is a significant factor that contributes to the transfer of patients. The odds of transfer for patients with Oleander, Paraquot and spray poisons are approximately 5, 10 and 0.2 times the odds for transfer with organophosphate poisoned patients. This is the case even if there is no adjustment for cluster effect. The adjustment for cluster effect makes the carbamate poisoned patients also have significantly   When cluster effect is adjusted for, it makes those patients that were given IV fluids have lower (0.15) odds of being transferred compared to patients that were not given IV fluids.
When cluster effect is adjusted for, it makes the odds for transfer for a patient admitted to hospitals without adrenalin approximately 3 times higher compared to a patient admitted to hospital with adrenalin.
After adjustment for the cluster effect, a patient that has been unconscious has approximately 6 times higher odds of being transferred compared to a patient that has been conscious.

CONCLUSION
Results indicate that while levels 7, 8 and 9 of poison ingested are significantly different from level 1 of poison ingested in the absence of the adjustment for cluster effect. When cluster effect is adjusted for, additionally level 2 of poison ingested also becomes significant. Adjustment for cluster effect also makes the IV access, adrenalin and consciousness factors significant. Therefore 4 (out of 33) parameters that were not significant before adjustment for cluster effect became significant with the adjustment, indicating the importance of its implementation.

Stata commands used in the analysis:
/* Stata commands used in fitting the fitted model with main effects adjusting for cluster effect. This programme gives the results on the parameter estimates and robust standard errors of the estimates and the 95% confidence interval of the estimates for the adjusted analysis. */ xi:logit outcome i.ncpoison i.ambuventilation /// i.ncpupils i.nivacess i.natropine i.nivfluids /// i.nmonitoring i.fullersearth distance i.diazepamiv /// i.ngtubes i.hadrenaline i.hpam slabourer /// i.ndextrose i.conscious, cluster(hospitals) /* Stata commands used in fitting the fitted model with main effects adjusting for cluster effect. This programme gives the results of the odds ratios, robust standard errors of the odds ratios and the 95% confidence interval of the odds ratios for the adjusted analysis. */ xi:logistic outcome i.ncpoison i.ambuventilation /// i.ncpupils i.nivacess i.natropine i.nivfluids /// i.nmonitoring i.fullersearth distance i.diazepamiv /// i.ngtubes i.hadrenaline i.hpam slabourer /// i.ndextrose i.conscious, cluster(hospitals) /*Stata commands used in fitting the fitted model with main effects unadjusted for cluster effect. This programme gives the results on the parameter estimates and robust standard errors of the estimates and the 95% confidence interval of the estimates for the unadjusted analysis. */ xi:logistic outcome i.ncpoison i.ambuventilation /// i.ncpupils i.nivacess i.natropine i.nivfluids /// i.nmonitoring i.fullersearth distance i.diazepamiv /// i.ngtubes i.hadrenaline i.hpam slabourer /// i.ndextrose i.conscious /*Stata commands used in fitting the fitted model with main effects unadjusted for cluster effect. This programme gives the results of the odds ratios, robust standard errors of the odds ratios and the 95% confidence interval of the odds ratios for the unadjusted analysis. */ xi:logistic outcome i.ncpoison i.ambuventilation /// i.ncpupils i.nivacess i.natropine i.nivfluids /// i.nmonitoring i.fullersearth distance i.diazepamiv /// i.ngtubes i.hadrenaline i.hpam slabourer /// i.ndextrose i.conscious