Article Text

Multimethod, multidataset analysis reveals paradoxical relationships between sociodemographic factors, Hispanic ethnicity and diabetes
  1. Gabriel M Knight1,
  2. Gabriela Spencer-Bonilla2,
  3. David M Maahs3,4,5,
  4. Manuel R Blum2,6,7,
  5. Areli Valencia8,
  6. Bongeka Z Zuma8,
  7. Priya Prahalad3,
  8. Ashish Sarraju9,
  9. Fatima Rodriguez9,
  10. David Scheinker3,10,11
  1. 1Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
  2. 2Department of Medicine, Stanford University School of Medicine, Stanford, California, USA
  3. 3Division of Pediatric Endocrinology, Stanford University School of Medicine, Stanford, California, USA
  4. 4Stanford Diabetes Research Center, Stanford, California, USA
  5. 5Department of Health Research and Policy, Stanford University School of Medicine, Stanford, California, USA
  6. 6Department of General Internal Medicine, Bern University Hospital, Bern, Switzerland
  7. 7Institute of Primary Health Care, University of Bern, Bern, Switzerland
  8. 8Stanford University School of Medicine, Stanford, California, USA
  9. 9Division of Cardiovascular Medicine, Stanford University School of Medicine, Stanford, California, USA
  10. 10Department of Management Science and Engineering, Stanford University School of Engineering, Stanford, California, USA
  11. 11Clinical Excellence Research Center, Stanford University School of Medicine, Stanford, California, USA
  1. Correspondence to Dr David Scheinker; dscheink{at}; Dr Fatima Rodriguez; frodrigu{at}


Introduction Population-level and individual-level analyses have strengths and limitations as do ‘blackbox’ machine learning (ML) and traditional, interpretable models. Diabetes mellitus (DM) is a leading cause of morbidity and mortality with complex sociodemographic dynamics that have not been analyzed in a way that leverages population-level and individual-level data as well as traditional epidemiological and ML models. We analyzed complementary individual-level and county-level datasets with both regression and ML methods to study the association between sociodemographic factors and DM.

Research design and methods County-level DM prevalence, demographics, and socioeconomic status (SES) factors were extracted from the 2018 Robert Wood Johnson Foundation County Health Rankings and merged with US Census data. Analogous individual-level data were extracted from 2007 to 2016 National Health and Nutrition Examination Survey studies and corrected for oversampling with survey weights. We used multivariate linear (logistic) regression and ML regression (classification) models for county (individual) data. Regression and ML models were compared using measures of explained variation (area under the receiver operating characteristic curve (AUC) and R2).

Results Among the 3138 counties assessed, the mean DM prevalence was 11.4% (range: 3.0%–21.1%). Among the 12 824 individuals assessed, 1688 met DM criteria (13.2% unweighted; 10.2% weighted). Age, gender, race/ethnicity, income, and education were associated with DM at the county and individual levels. Higher county Hispanic ethnic density was negatively associated with county DM prevalence, while Hispanic ethnicity was positively associated with individual DM. ML outperformed regression in both datasets (mean R2 of 0.679 vs 0.610, respectively (p<0.001) for county-level data; mean AUC of 0.737 vs 0.727 (p<0.0427) for individual-level data).

Conclusions Hispanic individuals are at higher risk of DM, while counties with larger Hispanic populations have lower DM prevalence. Analyses of population-level and individual-level data with multiple methods may afford more confidence in results and identify areas for further study.

  • informatics
  • diabetes mellitus
  • type 2
  • ethnic groups
  • risk factors

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See:

Statistics from

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Supplementary materials


  • FR and DS are joint senior authors.

  • Contributors GMK designed the research, performed statistical analysis, and wrote the manuscript. DS and FR contributed to data analysis, interpretation of results, and critical revision of the manuscript. All authors contributed meaningfully to this manuscript and approved the final version. GMK is the guarantor of this work and, as such, had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

  • Funding DMM, DS, PP were supported by R18DK122422. DMM was additionally supported by P30DK116074. FR is funded by a career development award from the National Heart, Lung, and Blood Institute (K01 HL 14460) and the American Heart Association/Robert Wood Johnson Harold Amos Medical Faculty Development Program.

  • Map disclaimer The depiction of boundaries on the map(s) in this article does not imply the expression of any opinion whatsoever on the part of BMJ (or any member of its group) concerning the legal status of any country, territory, jurisdiction or area or of its authorities. The map(s) are provided without any warranty of any kind, either express or implied.

  • Competing interests DMM acts as an advisor for Sanofi-Aventis US LLC, Novo Nordisk Inc, WL Gore and Associates, Inc, and Medtronic Minimed, Inc. DS acts as an advisor for Carta Healthcare. No other potential conflicts of interest relevant to this article were reported.

  • Patient consent for publication Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Data availability statement Data are available in a public, open access repository. All data used in this study were publicly available from National Health and Nutrition Examination Survey ( and County Health Rankings (

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.