Discussion
Using two nationally representative datasets, we studied the associations between sociodemographic factors and (1) DM prevalence at the county level and (2) DM status at the individual level. We compared the explanatory power of traditional epidemiological and novel ML methods. All of the sociodemographic factors assessed in the present analysis—including age, gender, African-American race, Hispanic ethnicity, household income, and education level—were significantly associated with diabetes prevalence or status at both the county and individual levels. After adjusting for additional behavioral and health factors, all of these associations remained except education level. We found that ML outperformed multivariate regression in both individual-level and county-level data. We observed that higher proportions of Hispanic individuals were associated with lower rates of county-level DM, but being Hispanic was associated with higher rates of individual-level DM, a novel finding that would not have been uncovered through the analysis of only one type of data.
The finding that Hispanic ethnicity was negatively associated with DM prevalence at the county level, but positively associated with risk of DM at the individual level, is noteworthy and warrants further study. Since Hispanics disproportionately live in the US West, it is possible that counties where Hispanics tend to live have lower overall rates of obesity and other risk factors associated with DM.29 Counties in the Western USA also have relatively smaller populations of African-Americans, who have a higher risk for DM.4 Nevertheless, Hispanic ethnicity remained an independent predictor of DM prevalence after controlling for region. This finding that a higher percentage of Hispanics was associated with lower county-level rates of DM gives credence to the observations of previous studies, demonstrating that high proportions of Hispanics are associated with a protective effect against county-level obesity, which is in turn closely associated with DM.29 31 32 The results of those and the present analysis indicate that higher Hispanic ethnic density is associated with better health at the county level, corroborating the notion of a ‘Hispanic Paradox’—that is, the epidemiological phenomenon in which Hispanic and Latino Americans tend to have health outcomes and mortality rates that ‘paradoxically’ are comparable with, or in some cases better than, those of their US non-Hispanic white counterparts, even though Hispanics have lower average income and education.33 34 The observation that being of Hispanic ethnicity was associated with higher individual DM risk, however, complicates this idea of a Hispanic paradox but is perhaps unsurprising, given that Hispanic populations are on average associated with lower income and education, both of which are independently associated with higher rates of DM.32–36 This association between Hispanic ethnicity and individual DM status remains significant after controlling for income and education level; this implies that there exist unmeasured risk factors for DM beyond these impactful socioeconomic factors. Future research should explore potential explanations for these paradoxical findings, including whether living alongside more other members of one’s own culture is associated with improved health outcomes, or whether individuals who live in communities with higher proportions of Hispanics or otherwise more ethnically diverse communities possess more DM-friendly diets or engage in healthier lifestyles or behaviors.
Complementary to other studies, we found household income and education level to be protective against DM at the county and individual levels.37–41 Our finding that female gender was protective against individual DM status was consistent with past studies.42–44 However, we found that at the county level, a higher proportion of women was associated with higher rates of DM prevalence. This novel and paradoxical finding could be explained by the longer life expectancy of women, insufficiently controlled for by the per cent of the population over 65 years old or by underlying differences that this analysis did not account for. Nevertheless, because there exists relatively little variability in percentage female across counties, this may also simply be a spurious finding. Further research should examine the association between the female population of a county and diabetes prevalence.
African-American race was positively associated with both county-level DM prevalence and individual DM status, consistent with the reports of studies illustrating a higher risk of DM among African-American individuals.4 Historically, African-Americans live in higher proportions in counties with lower household incomes as well as in the US South, which was independently associated with higher rates of DM (see online supplemental table 5).29 Nevertheless, African-American race was an independent predictor of both DM at the county and individual levels even when income and geography were adjusted for. It should be noted that counties with higher proportions of African-American individuals may have more diabetogenic environments that we could not adjust for, such as fewer healthy food options and fewer opportunities for physical activity.36 45 Thus, both African-Americans and Hispanics are associated with lower income levels and education, but communities with a higher proportion of Hispanics are associated with lower rates of DM, while communities with a higher proportion of African-Americans are associated with higher rates of DM.
Though the findings at the county and individual levels were very similar, the differences that emerged—particularly those for Hispanic individuals and populations–underscore the importance of validating analyses of large datasets. Both population-level and individual-level analyses serve important roles in understanding phenomena that differentially influence health outcomes at different levels of society, but neither tells the complete story. Failing to validate one with the other introduces the possibility of obtaining spurious or misleading findings, which may yield inefficient or misguided efforts to prevent disease development and progression.
We found that ML models explained more of the variation in individual diabetes status and county-level diabetes prevalence than standard regression models, while identifying nearly the same set of risk factors. The fact that ML and standard multivariate regression identified the same risk factors from both individual-level and county-level data provides crucial validation for these ML methods and affords additional confidence in the increased explanatory power associated with ML. The performance improvement generated by ML was significant for the county-level analysis, suggesting that researchers and policymakers should use these models to guide population-level analyses and interventions. The marginal performance improvement of ML for the individual-level analysis suggests that for such studies, more interpretable model will likely be adequate, or potentially superior, for most practical purposes. This may alleviate the concerns of those using traditional interpretable models in this context that they are missing out on the potentially superior performance of ML. This also provides an additional data point for the current, impassioned debate about whether interpretable models must necessarily suffer from performance far inferior to that of ML models.46
Because publicly available CHR data lack individual-level information—and because publicly available NHANES data lack county-level information—there does not currently exist a single dataset that can be used to examine each potential risk factor at the individual and county levels simultaneously using the same data. Using two independent datasets does, however, confer value in the context of ML by increasing the amount of data available—and, through this, increasing explanatory power—while mitigating the risk of producing spurious or misleading results.
In order to protect and better serve disadvantaged groups, certain ethical considerations should be made when using ML and traditional methods alike. First, results must not be used to exacerbate existing health or socioeconomic disparities—and results regarding race must be interpreted with the understanding that race itself is a social construct.47 48 Additionally, in clinical practice, it is crucial that providers continue to take into account individual priorities and socioeconomic or cultural circumstances, which statistical methods are unable to account for. Finally, just as with traditional regression, sampling methods in ML must minimize bias; ML models are trained on the data creators supply them, and their predictive power and perceived accuracy on subsequent data is rooted in the same biases introduced during the acquisition of the data. Both NHANES and CHR have implemented measures such as oversampling of minority populations, in the case of NHANES, and contribution from multiple datasets, in the case of CHR, to ensure sufficient data are gathered for often-under-represented populations. Nevertheless, as will be discussed, both datasets rely in part on self-reported data and on certain difficult-to-obtain data that is unable to be obtained for all individuals or counties, which may introduce bias.
Limitations
This study is associated with several limitations. First, the NHANES and CHR data used in this analysis were derived from time periods of differing lengths. This was crucial to increase the data available for analysis of NHANES—for many variables included, 2-year cohorts of NHANES reported fewer than 2000 data points. This was particularly important when training ML algorithms, which rely on larger amounts of data. Another possible limitation is the fact that the NHANES and CHR data are based in part on self-reported measures, which could introduce bias. Nevertheless, self-reported DM status has been shown to be effective in estimating provider-assessed DM in certain populations.49 Moreover, while most population studies rely exclusively on self-reported data, the NHANES dataset incorporates ample objective laboratory and physiologic measures. Another potential limitation is how DM was defined or assessed in these two datasets. In the county-level data, CHR may underestimate county-level DM prevalence by only including those diagnosed by a physician. Nevertheless, as long as this bias is non-differential by county or other factors considered, our statistical results remain directionally valid. Additionally, in the individual-level data, individuals were deemed to have DM if they met particular laboratory or survey criteria. Though imperfect, these criteria used are consistent with established, gold-standard definitions of DM that capture participants with both diagnosed and undiagnosed DM. Finally, given the data available, we were unable to distinguish between type 1 and type 2 DM. Because type 2 represents over 90% of US cases of DM, the impact of this limitation is likely minimal.50 Ultimately, as data continue to be amassed and data linkage techniques grow more sophisticated, it may become possible to join different datasets to overcome these sorts of data limitations.