Conclusions
We derived and validated algorithms identifying T1D, among those with diabetes, using primary care EMR data, administrative healthcare data, and EMR combined with administrative healthcare data in Ontario, Canada. Algorithms using EMR data alone or EMR combined with administrative data were able to identify T1D with excellent performance. However, algorithms using administrative data alone had important limitations, namely low sensitivity and only moderate PPV.
In our study, rule-based algorithms outperformed or performed similarly to classification trees and random forests. It is important to note that the best performing rule-based algorithms mimicked classification trees, after the classification tree had been developed. Thus, it is possible to use data-driven approaches such as classification trees to determine optimal variables for inclusion in an algorithm and subsequently develop a rule-based approach that may be simpler for implementation. Rule-based approaches also permitted us to select an algorithm prioritizing specificity and PPV, even at the expense of sensitivity. In contrast, classification trees and random forests generated a single algorithm that best balanced all parameters (sensitivity, specificity, PPV and NPV).
The EMR data algorithm had superior performance compared with previously published algorithms. One algorithm based on the ratio of billing codes for T1D versus T2D, prescriptions and blood test results had a sensitivity of 65% and a PPV of 88% but did not report specificity or NPV.12 13 Lethebe and others used machine learning models to develop an algorithm that included text terms for T1D diagnosis and age at meeting criteria for diabetes diagnosis, and had a sensitivity of 43%, specificity of 99%, PPV of 85% and NPV of 95%.11 Another algorithm that considered diagnostic codes, medications, and age at diabetes incidence had perfect agreement with a gold standard for diagnosis of T1D versus T2D, but likely included a small number of subjects with T1D.14 Finally, a classification tree that included medications, DKA, billing codes, and age had a sensitivity of 92.8%, specificity of 99.3%, PPV of 89.5%, and NPV of 99.5%.15
We used three candidate administrative healthcare data algorithms to evaluate the minimum, moderate, and maximum estimates of the number of prevalent and incident T1D cases. All three algorithms demonstrated increasing prevalence of T1D from 2010 to 2017. We observed decreasing incidence of T1D between 2010 and 2017 using the minimum and moderate estimate algorithms. However, uptake of insulin pumps would be expected to have been highest shortly after the funding program was initiated in 2008, and therefore early ‘incident’ cases may actually reflect individuals with prevalent T1D who were newly starting insulin pump therapy. Interestingly, the moderate sensitivity, low PPV (maximum estimate) algorithm demonstrated a divergent trend in T1D incidence, with rates decreasing until 2014 but subsequently increasing. Since this algorithm includes DKA that was not specified to be associated with T1D or T2D, we hypothesize that the increasing incidence may reflect rising rates of DKA in patients with T2D taking sodium-glucose transport protein 2 inhibitors, who were misclassified as having T1D by this algorithm.33 Accounting for these factors, incidence rates of T1D in Ontario appear to be stable.
Worldwide, the incidence of T1D in children and youth is reported as ≤0.5% of the population.5–7 Incidence in children has generally been found to be increasing, although in some geographical regions there have been reports of a plateau in incidence rates.5 7 34–39 Our estimates for T1D incidence rates in adults were lower than those reported for children, which is not surprising since the majority of T1D cases are diagnosed in childhood or adolescence.40 Less information on T1D incidence trends in adults is available, but our results are consistent with decreasing incidence rates in individuals over the age of 15 and 25 in Belgium, Lithuania and Sweden.6 8 41 42 In these countries, a concurrent increase in T1D incidence in younger ages has been noted, suggesting a shift in diagnosis to younger ages rather than a true increase in overall incidence of T1D. Although T1D prevalence rates in adults are not explicitly reported in the literature to our knowledge, our findings of increasing prevalence are consistent with reports of decreasing mortality in individuals with T1D and perhaps a shift in age at diagnosis to younger ages.43
There are some important considerations in the application of the administrative healthcare data algorithms derived in this study. First, the proportion of patients with T1D in the reference population may have been lower than the true population prevalence since some individuals with T1D only see specialists and not primary care physicians. Indeed, the proportion of diabetes cases in EMRPC assigned as being T1D (3.6%) was lower than observed in other populations, which is generally quoted as 5%–10% of all diabetes cases.4 13 14 If the true population proportion of T1D is higher, then the PPV of applied algorithms would also be expected to be higher. Second, administrative healthcare data in Ontario date back to 1991, which means we could not determine if diabetes incidence criteria were met during childhood for individuals who were 42 years or older at the study end-date. This explains why individuals older than age 44 with incident diabetes younger than 28 years old but who did not meet the criteria for pediatric diabetes were classified as having T1D in the classification tree using administrative healthcare data (figure 2B), since it could not have been determined if these individuals met the criteria for pediatric diabetes. Thus, the algorithm performs better in younger individuals, and we expect that as the retrospective availability of data lengthens, algorithm performance will improve for older ages. Third, the algorithm relies on a database of insulin pump users that is exclusive for T1D, which may not be available in all settings. Finally, we did not assess alternative methods for building predictive models, such as logistic regression or machine learning approaches.
Our study has a number of strengths. The sample size for the reference population was large because automated search criteria based on medications were used to first identify individuals with possible T1D. In addition, we validated the algorithms in two different sample populations. We also evaluated multiple methods for deriving algorithms (classification trees, random forests, and rule-based methods). Limitations of our study include possible misclassification of T1D as T2D in the charts that were not manually abstracted, although the number of false-negatives is likely to be low since antihyperglycemic medications other than metformin were rarely used in T1D care prior to 2015. In addition, internal validation of the algorithms (eg, splitting the sample into derivation and validation sets) was not performed due to sample size concerns; however, classification tree estimates were adjusted for optimism, which corrects for the tendency of predictive models to perform better in training data sets than external data. Finally, algorithms with the highest PPV had only modest sensitivity, which led us to evaluate ranges of plausible estimates for prevalence and incidence rates.
In summary, we have derived and validated algorithms identifying T1D that have excellent performance using primary care EMR data and combined EMR and administrative data. Algorithms using administrative data alone have modest performance, but the benefit of being able to be applied to an entire population. Application of these algorithms demonstrated increasing prevalence of T1D in Ontario since 2010 and stable incidence. These algorithms will permit further study of the epidemiology, healthcare utilization, and outcomes of T1D in large populations.