Original Article
Analysis of case-cohort data: A comparison of different methods

https://doi.org/10.1016/j.jclinepi.2006.06.022Get rights and content

Abstract

Objective

The case-cohort design combines the advantages of a prospective cohort study and the efficiency of a case–control design. Usually a Cox proportional-hazards model is used for the analyses. However, adaptation of the model is necessary because of the sampling. We compared three methods that were proposed in the literature, which differ in weighting of study subjects: Prentice's, Barlow's, and Self and Prentice's method.

Study Design and Setting

In a cohort of 17,357 women we studied the relationship between body mass index and cardiovascular disease (n = 821) with varying subcohort sizes (sampling fraction = 0.005, 0.01, 0.05, 0.10, 0.15).

Results

Even with a sampling fraction of 0.01, all three methods showed identical estimates and standard errors (SE). With sampling fractions ≥0.10, results of the case-cohort analyses were similar to the full-cohort analyses. With simulations, the three methods provided different results if the full cohort is small (<1,250 subjects, subcohort = 10%, 8% failures) or if the subcohort size was smaller than 15% (full cohort of 1,000 observations, 8% failures). The difference between the methods did not change with the number of failures or with different effect sizes.

Conclusion

In the above-mentioned situations, the effect estimates and SE of Prentice's method most resembled the estimates of the full-cohort estimates.

Introduction

The case-cohort design was introduced by Prentice [1] and is useful in analyzing cohort data in which failure is rare because covariate information is collected only from all failures and a random sample (with sampling probability α) of the censored observations, referred to as the subcohort. The design is also efficient if the collection of detailed follow-up information is costly and time consuming. For example, if newly diagnosed cases are relatively easily obtained for the full cohort by contacting a disease specific registry, but follow-up for mortality or movement outside the study area (to estimate follow-up time) is more difficult, one may consider the case-cohort design. With the selection of a random subcohort, follow-up for the disease is still necessary for the full cohort but follow-up for other censoring events is restricted to members of the subcohort only. For this purpose, the design was used in the Netherlands Cohort Study [2]. In addition, the subcohort is chosen without regard of any outcome, and thus may serve as a comparison group for several different diseases. This might be efficient if biological samples are needed, which then need to be retrieved once for the comparison group. Furthermore, if applicable, DNA extraction needs to be done only once.

Three different weighting methods have been proposed, which differ in the way they handle the weighing of the subcohort members and the cases outside the subcohort [1], [3], [4]. All three methods are incorporated in an SAS macro written by Barlow and Ichikawa and made available through Statlib (http://lib.stat.cmu.edu/general/robphreg) [5]. They also compared the three methods with the nested case–control and the full-cohort analysis in a data set described by Breslow and Day [6]. This cohort, however, is very small (full-cohort size: n = 679) and includes only 56 failures. Most current cohorts used for etiologic research include far more subjects.

The purpose of this article is to compare effect estimates and standard errors (SE) yielded by the three different methods of analysis of case-cohort data with each other and with a full-cohort analysis in a large cohort. As an illustration, we investigate the relation between body mass index (BMI) and cardiovascular disease (CVD) with available cohort data. In addition, we studied the influence of the full-cohort size, subcohort size, the number of cases, and the estimated effect size (i.e., size of the relative risk) on the three methods using simulated data.

Section snippets

Weighting methods

For the analysis of case-cohort data a pseudolikelihood is used instead of the partial likelihood, which is normally used in analyzing full-cohort data [1], [5]. This pseudolikelihood is a weighted Cox regression model [5]. The contribution of a failure to the likelihood function by person i at time tj isYi(tj)ezi(tj)βYi(tj)wi(tj)ezi(tj)β+kSkiYk(tj)wk(tj)ezk(tj)β

The first term in the denominator is the contribution by the case, weighted with weight wi. The second term is the summation over

Example

Table 2 shows the results of the example for the full-cohort (n = 15,768) and case-cohort analyses for each of the five subcohort sizes. With subcohorts larger than 1%, there was no difference between the three weighting methods. All methods showed exactly the same estimates as well as identical robust SE method.

However, only in subcohorts of 10% or larger, estimates were also comparable with those of the full-cohort analysis (i.e., BMI  23 kg/m2 and α = 10%: βfull-cohort = −0.17, SE = 0.09; βPrentice = 

Discussion

In our large cohort example, three methods to analyze case-cohort data resulted in very similar effect estimates and SE. Only in the case of unrealistic extremely small subcohort sizes of 1% or less Prentice's method started to show estimates closer to the full cohort estimates than the other two methods.

Results from the simulations show again that the three methods result in identical estimates in most situations. But when (sub)cohort sizes are small, the estimates of the method proposed by

References (9)

There are more references available in the full text version of this article.

Cited by (103)

  • The SunBEAm birth cohort: Protocol design

    2023, Journal of Allergy and Clinical Immunology: Global
  • Chronic inflammatory diseases, subclinical atherosclerosis, and cardiovascular diseases: Design, objectives, and baseline characteristics of a prospective case-cohort study ‒ ELSA-Brasil

    2022, Clinics
    Citation Excerpt :

    Case-cohort studies are less costly since only a subsample of participants selected independent of the outcomes is included as the comparison group for all ancillary studies that are part of this project. Moreover, also as part of the study strategy, participants selected in the ACS have additional biological samples collected and stored at each visit, which permits the rational use of stored biological samples.21 In the baseline examination (2008‒2010), information was collected about the presence of arthritis without specification, and specifically about rheumatoid arthritis and systemic lupus erythematosus using the question: Have you been previously told by a physician that you had/have arthritis?

View all citing articles on Scopus
View full text