Discussion
In the largest study of ML-informed subtypes in T2D to date, we had three major findings. First, we identified four clinically distinguishable subtypes across incident and prevalent T2D: metabolic, early onset, late onset and cardiometabolic, with thorough internal validation. Second, there were distinct trajectories of these T2D subtypes, whether by subtype at the end of the study period, all-cause hospitalization or mortality. Third, we confirmed major differences in new and existing medication usage across novel T2D subtypes.
A recent systematic review included 62 studies of ‘complex’ or ML approaches to T2D subclassification in a total of 793 291 participants.25 However, efforts to understand subtypes in T2D have neither used nationally representative data, nor used generalizable, reproducible methods, nor been validated, nor been associated with a wide range of outcomes.25 Using nationally representative EHR data in 420 448 individuals, we have used multiple, explainable methods in incident T2D with a larger number of variables and a longer follow-up than prior studies. The same systematic review speculated ‘whether subclassification approaches at diagnosis alone are enough’ and therefore, our longitudinal follow-up of individuals and subtype classification at the end of the study period (‘prevalent’) is an advance in methodology. However, prior to clinical implementation, external validation of the subtypes observed in this study in other datasets,20 as well as genetic validation,26 is necessary for cluster replication, and to better understand the overlaps and differences, compared with other proposed T2D subtypes. As Misra and colleagues note, clinical utility of T2D subtypes depends on the ability to use easily accessible data. Therefore, our use of routine EHR and simple variables is likely to increase generalizability and applicability of our T2D subtypes.
In 114 231 individuals in the Swedish National Diabetes Register, a recent study derived five subtypes in T2D using K-means clustering: older onset, severe hyperglycemia, severe obesity, younger onset, and insulin use.27 Another study proposed five clusters based on only six clinical variables, including BMI and HbA1c, in All New Diabetics In Scania, a purposively sampled diabetes dataset.16 28 Our subtypes are plausible and consistent with these subtypes and other studies but need to be validated externally prior to clinical implementation or evaluation. The identified subtypes are likely to be more clinically generalizable than those identified in other studies using smaller samples of research cohorts or registry-based populations, which may not represent clinical practice or the general population. Interestingly, our T2D subtypes are also similar to the subtypes which we identified in HF (early onset, late onset, atrial fibrillation related, metabolic, cardiometabolic) and CKD (early onset, late onset, cancer related, metabolic, cardiometabolic).20 21 Increasingly, links between T2D, CVD (particularly HF) and CKD are recognized from epidemiology to clinical practice, and there are also overlaps with other diseases such as non-alcoholic fatty liver disease and obesity. Approaches to precision medicine and subtyping across diseases have been based on incident diseases ‘one-at-a-time’ and depending on which disease occurs first.20 21 25 However, as lifetime risk and multiple long-term conditions are increasingly investigated, the similarity of subtypes across diseases suggests that it may be more appropriate to use clustering approaches in combinations of diseases and over the life course, with implications for potential etiologies and mechanisms of disease subtypes.
T2D is associated with high morbidity, mortality and healthcare utilization and costs, particularly in the context of multiple long-term conditions.29 We confirm the high burden of disease and high hospitalization rates, consistent with previous studies. There are clear differences in prognosis across the four subtypes, whether by all-cause mortality, hospital admissions or new diseases. For mortality, the worst prognosis is in the late-onset subtype and the best prognosis in the metabolic subtype, whereas for hospital admissions, the best prognosis is in the early-onset subtype. The high risk of developing severe complications, including over double the risk of lower limb amputation (HR 2.51, 1.51–4.17, p<0.0001), neuropathy (HR 2.06, 1.63–2.59, p<0.0001), and nephrotic syndrome (HR 2.05, 1.76–2.40, p<0.0001) in the late-onset versus the metabolic subtype, illustrates the major burden of healthcare need associated with T2D. The fact that the majority of individuals with T2D remained in the same subtype throughout the study period suggests that subtyping at time of diagnosis is likely to be clinically useful but requires external validation. There was a significant transition from metabolic to late-onset subtype, and from late-onset to cardiometabolic subtype over time. Both trajectories highlight the importance of preventing progression to more morbid or high-burden subtypes in people with T2D.
Investigation of differences in medication across subtypes of T2D may be instructive both in terms of understanding healthcare utilization and trajectory of disease, but also in informing future preventive and therapeutic strategies and clinical trials. Patients in the early-onset subtype exhibited a lower likelihood of medication prescription within 5 years after T2D onset than other clusters. Patients with T2D diagnosed at ages 19–40 are known to be under-represented in pharmaceutical studies,30 and further investigation in other datasets is required to assess this pattern within the early-onset cluster. Individuals in the late-onset subtype had the highest likelihood of medication prescription within 5 years after T2D diagnosis, adjusted for sex, age, and pre-T2D medication. In the cardiometabolic subtype, there was a high medication burden prior to diagnosis of T2D. There is evidence of a significant impact of adjusting for pre-T2D medication on the trajectory of medication in the cardiometabolic subtype, emphasizing the importance of considering both pre-T2D and post-T2D medication prescription in analysis of subtypes.
Strengths and limitations
Our study introduces T2D subtyping in the largest study population in a nationally representative dataset encompassing the most comprehensive range of variables. We have implemented and validated ML-based subtype definition using an established framework. Additionally, we have employed an explainable AI technique for subtype characterization, capturing non-linear and complex interconnection among variables and highlighting the positive and negative impacts of variables on cluster membership. Cluster characterization and internal validation were conducted through rigorous cross-validated supervised learning. Compared with simple, statistical subtyping based on single variables, we were able to the assess a wide range of variables on cluster membership in more realistic models, and the representativeness of the study population makes our subtypes highly likely to be generalizable, unlike prior smaller, less representative studies.
There are several limitations to our study. First, we did not have access to biomarker variables in the clustering algorithm. While CPRD is a comprehensive, linked dataset, representative of the UK population, it is a generic EHR rather than a diabetes-specific dataset, leading to incomplete biomarker values at T2D onset or study exit. Therefore, our T2D phenotype relied primarily on clinical coding rather than biomarkers. Without complete biomarker variables at T2D onset in the data, there may be a minority of type 1 diabetes records in the cohort due to miscoding at the point of care. To minimize the risk of misclassification, we have used validated phenotype definitions exclusively based on explicit diagnostic codes for T2D.31 The interpretation of our findings must account for this potential misclassification of diabetes common to code-based phenotyping in EHR data. Second, additional incomplete variables were ethnicity, IMD, and behavioral factors for all individuals. Based on the internal validation findings, we used a complete feature rather than a complete case approach to mitigate these limitations and minimize the effects of incomplete variables. Third, we only considered all-cause rather than cause-specific mortality and hospitalization, and average follow-up was 9.49 years. Fourth, while we have labeled the clusters as metabolic, early onset, late onset and cardiometabolic, they represent the overall attributes of cluster membership rather than strict clustering rules. For example, ‘early onset’ and ‘late onset’ are simplified labels to name the subtypes but should not be interpreted as definitive rules or cut-offs for categorizing patients into subtypes based on age alone. Incorporating a probabilistic soft clustering approach in the future could enhance subtype characterization. Fifth, as already stated, we have not yet externally validated the subtypes in other datasets for phenotypic or genetic replicability. Sixth, we only had data about medication prescription, not adherence. Finally, we have not explored the clinical utility or cost-effectiveness of the defined subtypes, which is needed prior to implementation.