Article Text

Download PDFPDF

Validation of a type 1 diabetes algorithm using electronic medical records and administrative healthcare data to study the population incidence and prevalence of type 1 diabetes in Ontario, Canada
  1. Alanna Weisman1,2,3,
  2. Karen Tu3,4,5,6,
  3. Jacqueline Young1,
  4. Matthew Kumar1,
  5. Peter C Austin1,3,
  6. Liisa Jaakkimainen1,3,6,
  7. Lorraine Lipscombe1,2,3,7,
  8. Ronnie Aronson8,
  9. Gillian L Booth1,2,3,9
  1. 1ICES, Toronto, Ontario, Canada
  2. 2Division of Endocrinology & Metabolism, Department of Medicine, University of Toronto, Toronto, Ontario, Canada
  3. 3Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
  4. 4Department of Family and Community Medicine, University of Toronto, Toronto, Ontario, Canada
  5. 5Toronto Western Hospital Family Health Team, University Health Network, Toronto, Ontario, Canada
  6. 6North York General Hospital, Toronto, Ontario, Canada
  7. 7Women’s College Research Institute, Women’s College Hospital, Toronto, Ontario, Canada
  8. 8LMC Diabetes & Endocrinology, Toronto, Ontario, Canada
  9. 9Li Ka Shing Knowledge Institute of St. Michael’s Hospital, Toronto, Ontario, Canada
  1. Correspondence to Dr Alanna Weisman; alanna.weisman{at}


Introduction We aimed to develop algorithms distinguishing type 1 diabetes (T1D) from type 2 diabetes in adults ≥18 years old using primary care electronic medical record (EMRPC) and administrative healthcare data from Ontario, Canada, and to estimate T1D prevalence and incidence.

Research design and methods The reference population was a random sample of patients with diabetes in EMRPC whose charts were manually abstracted (n=5402). Algorithms were developed using classification trees, random forests, and rule-based methods, using electronic medical record (EMR) data, administrative data, or both. Algorithm performance was assessed in EMRPC. Administrative data algorithms were additionally evaluated using a diabetes clinic registry with endocrinologist-assigned diabetes type (n=29 371). Three algorithms were applied to the Ontario population to evaluate the minimum, moderate and maximum estimates of T1D prevalence and incidence rates between 2010 and 2017, and trends were analyzed using negative binomial regressions.

Results Of 5402 individuals with diabetes in EMRPC, 195 had T1D. Sensitivity, specificity, positive predictive value and negative predictive value for the best performing algorithms were 80.6% (75.9–87.2), 99.8% (99.7–100), 94.9% (92.3–98.7), and 99.3% (99.1–99.5) for EMR, 51.3% (44.0–58.5), 99.5% (99.3–99.7), 79.4% (71.2–86.1), and 98.2% (97.8–98.5) for administrative data, and 87.2% (81.7–91.5), 99.9% (99.7–100), 96.6% (92.7–98.7) and 99.5% (99.3–99.7) for combined EMR and administrative data. Administrative data algorithms had similar sensitivity and specificity in the diabetes clinic registry. Of 11 499 711 adults in Ontario in 2017, there were 24 789 (0.22%, minimum estimate) to 102 140 (0.89%, maximum estimate) with T1D. Between 2010 and 2017, the age-standardized and sex-standardized prevalence rates per 1000 person-years increased (minimum estimate 1.7 to 2.56, maximum estimate 7.48 to 9.86, p<0.0001). In contrast, incidence rates decreased (minimum estimate 0.1 to 0.04, maximum estimate 0.47 to 0.09, p<0.0001).

Conclusions Primary care EMR and administrative data algorithms performed well in identifying T1D and demonstrated increasing T1D prevalence in Ontario. These algorithms may permit the development of large, population-based cohort studies of T1D.

  • type 1
  • population-based studies
  • clinical epidemiology
  • validation

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Significance of this study

What is already known about this subject?

  • Validated algorithms identifying diabetes in large electronic medical record and administrative healthcare databases permit population-based research in diabetes but generally do not distinguish type 1 and type 2 diabetes.

  • The inability to identify type 1 diabetes in large databases is a barrier to understanding type 1 diabetes epidemiology and outcomes.

What are the new findings?

  • Algorithms identifying type 1 diabetes in a primary care electronic medical record database had excellent performance, and algorithms identifying type 1 diabetes in administrative healthcare data had good performance.

  • Application of the administrative data algorithms demonstrated that type 1 diabetes prevalence rates among adults in Ontario, Canada have been increasing since 2010, while incidence rates have been stable.

How might these results change the focus of research or clinical practice?

  • Application of these validated algorithms for identifying type 1 diabetes will permit population-based studies on the epidemiology, healthcare utilization, and outcomes of type 1 diabetes.


Type 1 diabetes (T1D) results from the autoimmune destruction of the pancreatic beta cells, leading to lifelong insulin deficiency and hyperglycemia. Although T1D accounts for only an estimated 5%–10% of all diabetes cases, it continues to be associated with excess morbidity, premature mortality, and an economic burden to healthcare systems.1–3 The relative rarity of T1D compared with the burden of type 2 diabetes (T2D) and difficulties in distinguishing T1D from T2D in large databases have been major barriers to understanding T1D epidemiology and outcomes at a population level.4 For example, while the epidemiology of T1D in children has been well described in a number of countries,5–7 there is a paucity of data concerning the epidemiology of T1D in adults, which is important for public health and resource planning.8

Population-based data sources such as electronic medical records (EMRs) or administrative healthcare databases are efficient and cost-effective methods for studying diabetes epidemiology and outcomes.9 An important strategy for conducting population-based studies is ‘electronic phenotyping’, in which algorithms are developed and used to identify patients with a particular condition.10 A number of algorithms distinguishing T1D from T2D have been developed using EMR databases. The majority of these algorithms were developed using rule-based methods or decision trees, although Lethebe and others applied various other machine learning approaches.11 These algorithms commonly rely heavily on physician billing codes that are specific for T1D or T2D.11–15 However, physician billing codes differentiating T1D and T2D are not available in all regions and, even when they are available, result in substantial misclassification due to errors in coding.16 Furthermore, there has been limited comparison in the performance of algorithms derived from different methods (ie, rule-based vs decision trees vs machine learning).

Algorithms distinguishing T1D and T2D using administrative healthcare data have been developed for use in the pediatric population in countries such as the USA and Canada, but are more challenging to apply to adult populations in whom the proportion of T1D to T2D is much lower.17 18 To our knowledge, no algorithms distinguishing T1D from T2D in adults have been developed using administrative healthcare data alone, which lack the rich clinical details available in EMR databases. A major advantage of administrative healthcare data algorithms is their potential to be applied to an entire population.

Our objectives were to (1) derive and validate algorithms distinguishing T1D from T2D in diabetes populations using a primary care EMR database, population-based administrative healthcare data, and combined EMR and administrative data from Ontario, Canada; and (2) evaluate temporal trends in prevalence and incidence of T1D in this population by applying the best performing administrative healthcare data algorithms.

Research design and methods

This study used EMR and administrative healthcare data from the province of Ontario, Canada, which has a single-payer healthcare system and a population of approximately 14 million. Records for a given individual were linked across multiple databases using encoded identifiers based on health card numbers and analyzed at ICES (formerly known as the Institute for Clinical Evaluative Sciences). The study was designed and reported according to guidelines for validation studies using administrative data.19

Creation of the reference population

The reference population was derived from a primary care EMR database, known as the Electronic Medical Records Primary Care (EMRPC). EMRPC (formerly referred to as EMRALD) contains all clinical information from the medical chart of ~450 000 patients, contributed by over 400 primary care physicians using the Practice Solutions EMR in Ontario, Canada. Available data include the Cumulative Patient Profile (CPP) (medical history, problem list, medications, risk factors and allergies), progress notes, laboratory tests, specialist consultation letters, diabetes education center notes, and hospital discharge summaries. The population of EMRPC is generally representative of the entire Ontario population, with slight under-representation of young adult men and slight over-representation of young adult women.20

All patients in EMRPC with diabetes were identified using a previously validated algorithm that has a sensitivity of 83.1% and specificity of 98.2%.21 22 Inclusion criteria were age ≥18 years and use of EMR by the primary care physician for at least 1 year. Exclusion criteria included individuals registered to the primary care physician who had not yet had a clinic visit prior to 30 September 2015 and missing date of birth.

A random 25% subsample of EMRPC patients with diabetes were selected to create the reference population. The number of individuals required in validation studies of a disease state depends on the prevalence of the disease. A larger number of individuals is required for diseases with low prevalence to ensure there are a sufficient number of cases in the validation sample.23 To increase the efficiency of chart abstraction, we used an approach similar to the ‘Strawman’ algorithm employed by Klompas et al,12 by selecting patients more likely to have T1D to undergo chart abstraction, thereby increasing the number of cases for the same number of charts abstracted. A search of medications in the CPP categorized patients into those with possible T1D (prescribed insulin with or without metformin, but no other antihyperglycemic agent) who were selected for chart abstraction, and those unlikely to have T1D (not prescribed insulin or prescribed a non-insulin, non-metformin antihyperglycemic agent) whose charts were not abstracted but were included in the reference population as they were assumed to not have T1D.

Patients classified as having possible T1D based on these criteria were manually reviewed by an endocrinologist (AW) using a standardized abstraction platform that was piloted on the first 100 charts. Fifty charts were randomly selected to be reabstracted by the same abstractor for intrarater reliability assessment, and an additional 50 charts were randomly selected to be abstracted by a second abstractor (a primary care physician, LJ) for inter-rater reliability assessment. During the abstraction, diabetes type was classified as T1D or T2D using the entire EMR record and specific classification criteria (online supplementary appendix 1).

Supplemental material

Validation of algorithms in an independent external sample

Administrative healthcare data algorithms that were derived using EMRPC as the reference population were additionally validated in an independent sample, the LMC Diabetes Registry. LMC is a network of community-based endocrinology practices with a central database that includes diabetes type as determined by an endocrinologist.24 The database used for this study included records for patients from seven clinic sites.

Variables for EMR data algorithms

Automated searches of free text in the CPP were performed, which is the active health history section of the EMR. CPP descriptions of diabetes were classified as ‘Definite T1D’, ‘Possible T1D’ or ‘T2D’ (see table 1 footnote). Automated searches of structured fields were performed for medications (categorized as any insulin, bolus insulin, metformin, or antihyperglycemic medication other than insulin or metformin), body mass index and age.

Table 1

Descriptive characteristics of the EMRPC reference population by diabetes type

Variables for administrative healthcare data algorithms

Hospitalization and emergency department diagnoses were obtained from the CIHI (Canadian Institute for Health Information)-Discharge Abstract Database and the National Ambulatory Care Reporting Services held at ICES. Physician service claims were determined from the Ontario Health Insurance Plan. Prescription medication information was obtained from the Ontario Drug Benefit Database, which includes claims for prescription medications for individuals age 65 or older, or those individuals under age 65 with disability or social assistance. Laboratory data were obtained from the Ontario Laboratories Information System. Insulin pump use was obtained from the Ontario Ministry of Health Assistive Devices Program, which has provided funding for insulin pumps to children with T1D since 2006 and adults with T1D since 2008. Date of diabetes diagnosis (incidence) was determined from the Ontario Diabetes Database, which requires two physician service claims for diabetes within 2 years or one hospitalization record with diabetes.9 Pediatric diabetes was identified by a previously validated algorithm, which requires four physician service codes with a diagnosis of diabetes within 1 year prior to the age of 19 years.25 Hospitalization and emergency records transitioned from the International Classification of Diseases-9 (ICD-9) to ICD-10 in 2002. Prior to 2002, hospitalization records did not distinguish diabetic ketoacidosis (DKA) or diabetes as being specific to T1D or T2D. Therefore, three variables for DKA were considered: any DKA consisted of a hospitalization code for DKA using either ICD-9 or ICD-10 criteria from any date, T1D-specific DKA consisted of a hospitalization code for DKA using ICD-10 criteria for 2002 or later, and T2D-specific DKA consisted of a hospitalization code for DKA using ICD-10 criteria for 2002 or later. A full list of variables and the corresponding codes is available in online supplementary appendix 2.

Derivation of algorithms

Algorithms were derived using the EMRPC reference population. Three approaches to algorithm derivation (all considering the same potential set of variables derived from the data sources above) were used: (1) classification trees; (2) random forests; and (3) rule-based methods. Classification trees categorize individuals as having T1D or T2D based on a sequence of binary splits or partitions that, at each step, consider all possible binary splits of the data (recursive partitioning). First, the single variable which best splits the data into two groups is selected. For each resulting subgroup, another variable (which could be the same variable) is selected that best splits the subgroup into two further subgroups. The process is repeated recursively on each of the two groups until no further partitioning can be performed.26 Variables selected in the classification tree analysis are thus chosen without investigator input or selection. Random forests grow a sequence of classification trees, each classification tree grown in a different bootstrap sample. At each node, only a random sample of the predictor variables are considered when deciding how to partition the given node. Results are combined across the classification trees using a majority-vote approach. Random forests are less susceptible to overfitting compared with conventional classification trees.27 Rule-based methods were initially based on the descriptive characteristics of the study sample and a priori clinical knowledge, and were subsequently refined based on evaluation of the performances of all algorithms.

Evaluation of algorithms

Algorithms were examined with respect to sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) and corresponding 95% CIs. Three algorithms were selected to give minimum, moderate and maximum estimates for prevalence and incidence rates based on their performance characteristics: a low sensitivity, moderate PPV algorithm (minimum estimate); a moderate sensitivity, moderate PPV algorithm (moderate estimate); and a high sensitivity, low PPV algorithm (maximum estimate).

Prevalence and incidence of T1D

The three selected algorithms were then applied to the entire Ontario population to determine yearly crude and age-standardized and sex-standardized prevalence and incidence rates for T1D per 1000 person-years of follow-up time from 1 April 2010 to 31 March 2017. Denominators for these rates were individuals who were alive and eligible for healthcare at the start of each fiscal year (which begins on 1 April each year and concludes on 31 March the following year), and numerators were the number of individuals who met the criteria for the specified algorithm. Rates were standardized to the 1991 Canadian census population.

Sample size calculation

A sample size calculation was performed a priori and abstraction of 385 charts permitted estimation of the 95% CI for sensitivity and specificity with a maximum width of ±4%.28 The sample size for manual chart abstraction could not be estimated precisely a priori since medication criteria were applied to a random subset of the entire diabetes population, and the actual sample size was larger than our sample size requirement calculated.

Statistical analysis

Intrarater and inter-rater reliability were assessed by unweighted Cohen’s kappa. For the classification tree analyses, the R package rpart (V.4.1–8) was used to grow and prune a classification tree with the outcome being T1D.26 29 A stopping rule was that a node did not undergo subsequent splits once the number of subjects in the node was less than 10. The minimum cross-validated mean error was used as a criterion in pruning the final tree. The bootstrapping approach of Harrell et al30 was applied to provide optimism-corrected estimates of performance. Random forests were performed using the R package randomForest (V.4.5–36), and missing data were handled in multiple ways: complete case analysis, simple imputation with mean/median, or imputation using rfimpute.31

Trends in prevalence and incidence rates per 1000 person-years of follow-up were analyzed by negative binomial regression with follow-up time as an offset, comparing the rate in each year with 2010 as a reference. An Omnibus likelihood ratio was calculated for each negative binomial regression and model fit was verified. This process was repeated for each set of estimates (minimum, moderate and maximum). R V.3.1.2 (31 October 2014) was used for performing classification tree and random forest analyses.32 SAS V.9.4 was used for all other analyses.

Sensitivity analyses

As a post-hoc sensitivity analysis, the performances of selected algorithms in the LMC Diabetes Registry were evaluated when stratified by age as ≤40 or >40 years old. Administrative healthcare data in Ontario date back to 1991; thus, in order to identify an individual as having incident diabetes younger than 19 years old, age at index date would have had to be 42 years or younger.


Reference population characteristics

There were 26 238 individuals with diabetes in EMRPC as of 30 September 2015, and 21 518 remained after applying the inclusion and exclusion criteria (figure 1). The reference population included 5402 individuals with diabetes (a random 25% subsample), of whom 435 had charts abstracted for meeting the criteria for possible T1D and 4967 were included as having T2D without chart abstraction based on medication use alone. Of the 435 patients with possible T1D, 195 (44.8%) had definite T1D based on chart abstraction. Thus, the overall proportion of T1D in this sample of primary care patients with diabetes was 3.6%. Unweighted Cohen’s kappa for intrarater and inter-rater agreement was 0.96 and 0.82, respectively. Descriptive characteristics based on EMR and administrative data are reported in table 1.

Figure 1

Selection of subjects for the reference population. EMR, electronic medical record; EMRPC, Electronic Medical Records Primary Care; T1D, type 1 diabetes; T2D, type 2 diabetes.

The LMC Diabetes Registry was used for validation of administrative healthcare data algorithms in an independent sample population. There were 29 371 individuals in the registry, of whom 2379 had T1D and 26 992 had T2D (proportion of T1D out of all diabetes cases: 8.1%; the characteristics are reported in online supplementary table 3).

EMR data algorithms

The best performing EMR algorithm was derived using a classification tree (EMRTree in table 2 and figure 2A), which included the following variables: definite or possible T1D terms in the CPP, insulin, absence of non-insulin, non-metformin antihyperglycemic agent, and age at index date. This algorithm had a sensitivity of 80.6% (95% CI 75.9 to 87.2), specificity of 99.8% (99.7 to 100), PPV of 94.9% (92.3 to 98.7) and NPV of 99.3% (99.1 to 99.5). The random forest (EMRForest) had similar performance to the classification tree, although with a slightly lower PPV. Variable importance factors for the random forest are reported in online supplementary table 4. Rule-based algorithms did not perform as well as the classification tree and random forest. The best performing algorithm using rule-based methods was EMRRule 3, which included definite T1D terms in the CPP and absence of insulin and antihyperglycemic agents other than metformin (sensitivity 77.9% (71.5 to 83.6), specificity 99.5% (99.3 to 99.7), PPV 86.4% (80.4 to 91.1), and NPV 99.2% (98.9 to 99.4)).

Table 2

Performances of selected algorithms*

Figure 2

Classification tree algorithms for EMR data (A), administrative data (B), and combined EMR and administrative data (C). CPP, Cumulative Patient Profile; DKA, diabetic ketoacidosis; EMR, electronic medical record; ODD, Ontario Diabetes Database; T1D, type 1 diabetes; T2D, type 2 diabetes.

Performances of administrative healthcare data algorithms

The best performing algorithm (ie, highest PPV) was AdminRule 2, which was a combination including pediatric diabetes or insulin pump (sensitivity 51.3% (44 to 58.5), specificity 99.5% (99.3 to 99.7), PPV 79.4% (71.2 to 86.1), and NPV 98.2% (97.8 to 98.5)). AdminRule 3, which was a simple combination including age at diabetes incidence younger than 30 years old, pediatric diabetes, DKA and insulin pump use, had a higher sensitivity of 61.5% (54.3 to 68.4) but lower PPV of 75.9% (68.5 to 100). The classification tree (AdminTree; figure 2B) and random forest (AdminForest; online supplementary table 4) had similar performances, but both had lower PPV than the optimally performing algorithms using rule-based algorithms.

Sensitivity and specificity for all algorithms described above were similar in the LMC Diabetes Registry compared with EMRPC. However, PPVs were consistently higher and NPVs slightly lower in the LMC sample, as expected based on the higher proportion of patients with T1D in this sample (online supplementary table 5). For example, AdminRule 2had a sensitivity of 58.9% (56.9 to 60.9), specificity of 99.6% (99.5 to 99.7), PPV of 93.2% (91.8 to 94.4) and NPV of 98.5% (96.3 to 96.7), and AdminRule 3had a sensitivity of 65.4% (63.4 to 67.3), specificity of 99.3% (99.2 to 99.4), PPV of 89.5% (87.9 to 90.9) and NPV of 97% (96.8 to 97.2).

Combined EMR and administrative data algorithms

A rule-based algorithm including insulin, absence of antihyperglycemic medication other than metformin, definite T1D terms in the CPP, insulin pump or pediatric diabetes (EMR+AdminRule 1) was the best performing algorithm, with a sensitivity of 87.2% (81.7 to 91.5), specificity of 99.9% (99.7 to 100), PPV of 96.6% (92.7 to 98.7), and NPV of 99.5% (99.3 to 99.7).

Prevalence and incidence of T1D in Ontario

Three algorithms were applied to the Ontario population to estimate the minimum, moderate, and maximum prevalence and incidence rates of T1D from 2010 to 2017 (online supplementary tables 6–8). The algorithms selected were (1) AdminRule 2: a low sensitivity, moderate PPV algorithm; (2) AdminRule 3: a moderate sensitivity, moderate PPV algorithm; and (3) AdminRule 4: a high sensitivity, low PPV algorithm. The number of individuals with prevalent T1D in 2017 ranged from 24 789 (minimum estimate using AdminRule 2) to 102 140 (maximum estimate using algorithm 14). Age-standardized and sex-standardized prevalence rates increased between 2010 and 2017, with the minimum estimate being an increase from 1.7 (95% CI 1.68 to 1.73) to 2.56 (2.53 to 3.6) per 1000 person-years using AdminRule 2, and the maximum estimate being an increase from 7.48 (7.43 to 7.54) to 9.86 (9.79 to 9.92) per 1000 person-years using AdminRule 4. Overall, this represented a significant increase in T1D prevalence of 32%–50% (p<0.0001 for Omnibus likelihood ratio). Using AdminRule 2 and AdminRule 3, age-standardized and sex-standardized incidence decreased between 2010 and 2014 and subsequently remained stable (figure 3), with the incidence rate per 1000 person-years in 2017 being between 0.04 (0.03–0.04) using the minimum estimate and 0.09 (0.08–0.1) using the moderate estimate (Omnibus likelihood ratio p<0.0001). However, using AdminRule 4 (maximum estimate), incidence rates for T1D decreased from 0.47 (0.46–0.49) per 1000 person-years in 2010 to a nadir of 0.44 in 2013, then increased to 0.61 in 2017 (Omnibus likelihood ratio p<0.0001). Using AdminRule 2 and AdminRule 3, the estimated number of individuals newly diagnosed with T1D in 2017 in Ontario was between 397 and 1013.

Figure 3

Age-standardized and sex-standardized (A) prevalence and (B) incidence trends of T1D in Ontario, Canada per 1000 person-years*. Legend: closed circles: high sensitivity, low PPV algorithm (AdminRule 4); closed triangles: moderate sensitivity, moderate PPV algorithm (AdminRule 3); closed squares: low sensitivity, moderate PPV algorithm (AdminRule 2). *The denominator for determination of prevalence rates was person-years of follow-up for all eligible subjects in each fiscal year. The denominator for determination of incidence rates was person-years of follow-up for eligible subjects excluding those previously identified with T1D prior to the start of each fiscal year. PPV, positive predictive value; T1D, type 1 diabetes.

Sensitivity analyses

The following EMR variables had missing data: body mass index (12.9% missing) and estimated glomerular filtration rate (11.6% missing). The following administrative data variables had missing data: age at diabetes incidence (3.7%) and estimated glomerular filtration rate (22%). For all random forests, multiple approaches for missing data were evaluated (complete case analysis, simple imputation using means/medians, and imputation using rfimpute), and all results were consistent with the primary analysis. Sensitivities of algorithms were higher in individuals ≤40 years old than in those >40 years old, whereas specificities were similar (online supplementary table 9; for example, sensitivity 73.0% (70.4 to 75.5) vs 44.1% (41.2 to 47) for AdminRule 2, and 77.7% (75.3 to 80) vs 52.3% (49.4 to 55.2) for AdminRule 3).


We derived and validated algorithms identifying T1D, among those with diabetes, using primary care EMR data, administrative healthcare data, and EMR combined with administrative healthcare data in Ontario, Canada. Algorithms using EMR data alone or EMR combined with administrative data were able to identify T1D with excellent performance. However, algorithms using administrative data alone had important limitations, namely low sensitivity and only moderate PPV.

In our study, rule-based algorithms outperformed or performed similarly to classification trees and random forests. It is important to note that the best performing rule-based algorithms mimicked classification trees, after the classification tree had been developed. Thus, it is possible to use data-driven approaches such as classification trees to determine optimal variables for inclusion in an algorithm and subsequently develop a rule-based approach that may be simpler for implementation. Rule-based approaches also permitted us to select an algorithm prioritizing specificity and PPV, even at the expense of sensitivity. In contrast, classification trees and random forests generated a single algorithm that best balanced all parameters (sensitivity, specificity, PPV and NPV).

The EMR data algorithm had superior performance compared with previously published algorithms. One algorithm based on the ratio of billing codes for T1D versus T2D, prescriptions and blood test results had a sensitivity of 65% and a PPV of 88% but did not report specificity or NPV.12 13 Lethebe and others used machine learning models to develop an algorithm that included text terms for T1D diagnosis and age at meeting criteria for diabetes diagnosis, and had a sensitivity of 43%, specificity of 99%, PPV of 85% and NPV of 95%.11 Another algorithm that considered diagnostic codes, medications, and age at diabetes incidence had perfect agreement with a gold standard for diagnosis of T1D versus T2D, but likely included a small number of subjects with T1D.14 Finally, a classification tree that included medications, DKA, billing codes, and age had a sensitivity of 92.8%, specificity of 99.3%, PPV of 89.5%, and NPV of 99.5%.15

We used three candidate administrative healthcare data algorithms to evaluate the minimum, moderate, and maximum estimates of the number of prevalent and incident T1D cases. All three algorithms demonstrated increasing prevalence of T1D from 2010 to 2017. We observed decreasing incidence of T1D between 2010 and 2017 using the minimum and moderate estimate algorithms. However, uptake of insulin pumps would be expected to have been highest shortly after the funding program was initiated in 2008, and therefore early ‘incident’ cases may actually reflect individuals with prevalent T1D who were newly starting insulin pump therapy. Interestingly, the moderate sensitivity, low PPV (maximum estimate) algorithm demonstrated a divergent trend in T1D incidence, with rates decreasing until 2014 but subsequently increasing. Since this algorithm includes DKA that was not specified to be associated with T1D or T2D, we hypothesize that the increasing incidence may reflect rising rates of DKA in patients with T2D taking sodium-glucose transport protein 2 inhibitors, who were misclassified as having T1D by this algorithm.33 Accounting for these factors, incidence rates of T1D in Ontario appear to be stable.

Worldwide, the incidence of T1D in children and youth is reported as ≤0.5% of the population.5–7 Incidence in children has generally been found to be increasing, although in some geographical regions there have been reports of a plateau in incidence rates.5 7 34–39 Our estimates for T1D incidence rates in adults were lower than those reported for children, which is not surprising since the majority of T1D cases are diagnosed in childhood or adolescence.40 Less information on T1D incidence trends in adults is available, but our results are consistent with decreasing incidence rates in individuals over the age of 15 and 25 in Belgium, Lithuania and Sweden.6 8 41 42 In these countries, a concurrent increase in T1D incidence in younger ages has been noted, suggesting a shift in diagnosis to younger ages rather than a true increase in overall incidence of T1D. Although T1D prevalence rates in adults are not explicitly reported in the literature to our knowledge, our findings of increasing prevalence are consistent with reports of decreasing mortality in individuals with T1D and perhaps a shift in age at diagnosis to younger ages.43

There are some important considerations in the application of the administrative healthcare data algorithms derived in this study. First, the proportion of patients with T1D in the reference population may have been lower than the true population prevalence since some individuals with T1D only see specialists and not primary care physicians. Indeed, the proportion of diabetes cases in EMRPC assigned as being T1D (3.6%) was lower than observed in other populations, which is generally quoted as 5%–10% of all diabetes cases.4 13 14 If the true population proportion of T1D is higher, then the PPV of applied algorithms would also be expected to be higher. Second, administrative healthcare data in Ontario date back to 1991, which means we could not determine if diabetes incidence criteria were met during childhood for individuals who were 42 years or older at the study end-date. This explains why individuals older than age 44 with incident diabetes younger than 28 years old but who did not meet the criteria for pediatric diabetes were classified as having T1D in the classification tree using administrative healthcare data (figure 2B), since it could not have been determined if these individuals met the criteria for pediatric diabetes. Thus, the algorithm performs better in younger individuals, and we expect that as the retrospective availability of data lengthens, algorithm performance will improve for older ages. Third, the algorithm relies on a database of insulin pump users that is exclusive for T1D, which may not be available in all settings. Finally, we did not assess alternative methods for building predictive models, such as logistic regression or machine learning approaches.

Our study has a number of strengths. The sample size for the reference population was large because automated search criteria based on medications were used to first identify individuals with possible T1D. In addition, we validated the algorithms in two different sample populations. We also evaluated multiple methods for deriving algorithms (classification trees, random forests, and rule-based methods). Limitations of our study include possible misclassification of T1D as T2D in the charts that were not manually abstracted, although the number of false-negatives is likely to be low since antihyperglycemic medications other than metformin were rarely used in T1D care prior to 2015. In addition, internal validation of the algorithms (eg, splitting the sample into derivation and validation sets) was not performed due to sample size concerns; however, classification tree estimates were adjusted for optimism, which corrects for the tendency of predictive models to perform better in training data sets than external data. Finally, algorithms with the highest PPV had only modest sensitivity, which led us to evaluate ranges of plausible estimates for prevalence and incidence rates.

In summary, we have derived and validated algorithms identifying T1D that have excellent performance using primary care EMR data and combined EMR and administrative data. Algorithms using administrative data alone have modest performance, but the benefit of being able to be applied to an entire population. Application of these algorithms demonstrated increasing prevalence of T1D in Ontario since 2010 and stable incidence. These algorithms will permit further study of the epidemiology, healthcare utilization, and outcomes of T1D in large populations.


The authors would like to acknowledge Esha Homenauth and Vicki Ling from ICES, and Ruth Brown from LMC Diabetes & Endocrinology, for preparing the data. The authors thank IMS Brogan for use of their Drug Information Database.



  • Twitter @AlannaWeisman

  • Contributors AW, JY and MK researched and analyzed the data. AW, GLB, LJ, KT, PCA and LL contributed to study design. AW conducted the literature search and wrote the first draft of the manuscript. All authors interpreted the data, provided critical revisions, and approved the final version. AW and GLB are the guarantors for this study and take full responsibility for the work as a whole.

  • Funding This project was funded by Diabetes Action Canada and the Ontario SPOR Support Unit. This study was supported by ICES, a research institute affiliated with the University of Toronto which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC). The opinions, results and conclusions reported in this paper are those of the authors and are independent from the funding sources. No endorsement by ICES or the Ontario MOHLTC is intended or should be inferred. Parts of this material are based on data and/or information compiled and provided by CIHI. However, the analyses, conclusions, opinions and statements expressed in the material are those of the author(s), and not necessarily those of CIHI. AW is supported by a Canadian Institutes of Health Research Fellowship and a Diabetes Canada fellowship, and the Clinician-Scientist Training Program in the Department of Medicine at the University of Toronto. GLB is supported by a Clinician-Scientist Merit Award from the Department of Medicine at the University of Toronto. PCA is supported by a Mid-Career Investigator Award from the Heart and Stroke Foundation. KT and LJ are supported by Research Scholar Awards from the Department of Family and Community Medicine at the University of Toronto.

  • Competing interests None declared.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement No data are available. Data are held at ICES and are not publicly available.