Obesity studies

Applying machine learning approaches for predicting obesity risk using US health administrative claims database

Abstract

Introduction Body mass index (BMI) is inadequately recorded in US administrative claims databases. We aimed to validate the sensitivity and positive predictive value (PPV) of BMI-related diagnosis codes using an electronic medical records (EMR) claims-linked database. Additionally, we applied machine learning (ML) to identify features in US claims databases to predict obesity status.

Research design and methods This observational, retrospective analysis included 692 119 people ≥18 years of age, with ≥1 BMI reading in MarketScan Explorys Claims-EMR data (January 2013–December 2019). Claims-based obesity status was compared with EMR-based BMI (gold standard) to assess BMI-related diagnosis code sensitivity and PPV. Logistic regression (LR), penalized LR with L1 penalty (Least Absolute Shrinkage and Selection Operator), extreme gradient boosting (XGBoost) and random forest, with features drawn from insurance claims, were trained to predict obesity status (BMI≥30 kg/m2) from EMR as the gold standard. Model performance was compared using several metrics, including the area under the receiver operating characteristic curve. The best-performing model was applied to assess feature importance. Obesity risk scores were computed from the best model generated from the claims database and compared against the BMI recorded in the EMR.

Results The PPV of diagnosis codes from claims alone remained high over the study period (85.4–89.2%); sensitivity was low (16.8–44.8%). XGBoost performed the best at predicting obesity with the highest area under the curve (AUC; 79.4%) and the lowest Brier score. The number of obesity diagnoses and obesity diagnoses from inpatient settings were the most important predictors of obesity. XGBoost showed an AUC of 74.1% when trained without an obesity diagnosis.

Conclusions Obesity prevalence is under-reported in claims databases. ML models, with or without explicit obesity, show promise in improving obesity prediction accuracy compared with obesity codes alone. Improved obesity status prediction may assist practitioners and payors to estimate the burden of obesity and investigate the potential unmet needs of current treatments.

What is already known on this topic

  • Obesity is underestimated in administrative claims databases; a significant proportion of individuals with obesity do not have a diagnosis code for the condition recorded in their claims data. Machine learning methods can optimize predictability and may be used to predict disease status, diagnosis, and clinical variables using real-world evidence.

What this study adds

  • This real-world data analysis confirmed the upward trend in obesity prevalence from 2013 to 2019. While most obesity codes in administrative claims data accurately identify cases of obesity (ie, high positive predictive value), they consistently miss a significant proportion of individuals with this disease (ie, low sensitivity).

  • This study addressed the limitation of underutilization of obesity codes in claims data via a machine learning model, enabling more accurate identification of individuals with obesity in administrative claims data.

  • The most important predictors of obesity were the number of obesity diagnoses, obesity diagnoses from inpatient settings, diagnosis of obstructive sleep apnea, diagnosis of hypertension, and the use of antidiabetic or antihypertension agents.

How this study might affect research, practice or policy

  • Improved obesity status prediction could help healthcare professionals and payors estimate the burden of the disease and evaluate current obesity treatment strategies.

Introduction

Obesity remains a major health crisis in the USA. About half of the adult US population is likely to have obesity by 2030.1 Obesity is associated with multiple chronic conditions such as type 2 diabetes (T2D), cardiovascular diseases (CVD), metabolic syndrome, chronic kidney disease, metabolic dysfunction-associated steatotic liver disease (MASLD), certain types of cancer, obstructive sleep apnea (OSA), osteoarthritis and various psychological conditions.2–4 Consequently, obesity imposes a significant economic burden.5

The obesity status of an individual can be assessed and classified by measuring body mass index (BMI).6 Based on their BMI, the US Centers for Disease Control and Prevention classifies people as underweight (below 18.5 kg/m2), normal (18.5 to <25 kg/m2), overweight (25 to <30 kg/m2), class 1 (30 to <35 kg/m2), class 2 (35 to <40 kg/m2) or class 3 (≥40 kg/m2) obesity.7 Despite its use as a key anthropometric measure of obesity status, BMI is not adequately recorded in US administrative claims databases.8 When the International Classification of Disease (ICD) diagnosis codes for obesity are included in claims data, obesity status can be identified with a high positive predictive value (PPV), but in reality where these diagnosis codes are underused, it results in low sensitivity.8–11 This limits the use of administrative claims data in studying obesity as an exposure, confounder or effect modifier of interest in weight-related health service research.12

The inherent low sensitivity of administrative claims data in defining obesity can be addressed by predicting the risk of obesity using real-world data (RWD), that is, data related to patient health status or the delivery of health care, including medical claims, and electronic medical records (EMR), or data from registries and digital health technology.13 RWD is imperative to researchers, health technology assessment agencies, payors, and other stakeholders to assess and mitigate the risk of obesity-related comorbidities and to better guide clinical decisions.14 Prediction modeling is often performed using classical statistical regression methods. However, these models may overlook complex associations when a higher number of variables are studied. In addition, choosing the right model is not straightforward when using these methods.15 Machine learning (ML) methods address these limitations by maximizing predictability and effectively handling possible non-linear relationships and higher-order interactions.16 ML methods can be used to develop algorithms to predict disease risk, diagnosis, and clinical variables.17–21

Recently, Wu et al8 applied four ML algorithms: Catboost, random forest, penalized logistic regression with L1 penalty (Least Absolute Shrinkage and Selection Operator (Lasso)), and artificial neural networks to predict BMI classifications in a claims database. The study identified a Super-Learner algorithm (SLA) that leveraged predictions from four ML algorithms through logistic regression, with an area under the receiver operating characteristic curve (AUC ROC) of approximately 88% for the prediction of BMI classifications of 30–40 kg/m2. However, the study was conducted using EMR and claims databases independently with the claims data source as the target, which may not accurately reflect actual BMI values due to shortcomings pertaining to the claims database.

In the current study, we assessed the validity of obesity diagnosis codes in claims data and their concordance with the EMR data using a large US claims-EMR-linked database with residents across regions. Furthermore, we applied ML algorithms to identify features in the US claims database to predict obesity status using BMI recorded in the EMR as the gold standard.

Research design and methods

Data source

This observational, retrospective study used data from the Merative MarketScan Explorys Claims-EMR data set. The data set was built by combining MarketScan administrative claims and Explorys EMR data and consists of 6.5 million individuals with claims-EMR linked records from more than 400 000 providers and physicians from all four US census regions. This data set comprises comprehensive longitudinal patient records including healthcare cost information, outpatient prescription fills including specialty pharmacy, coverage eligibility, vital signs and other biometrics, medical and surgical history, laboratory results, implantable devices, patient-reported outcomes, inpatient drugs, ambulatory prescriptions, clinical events, and procedures.

Study participants

The study included individuals aged ≥18 years with at least one valid BMI value (LOINC (Logical Observation Identifiers Names and Codes): 39156-5 and 89270-3) recorded in the claims-EMR data set from January 2013 to December 2019. The date of the last valid BMI record was set as the index date. Individuals were required to be commercially insured or covered under a Medicare Supplemental plan with continuous enrolment during the baseline period (ie, 12 months prior to, or on the index date).

Validity of the obesity diagnosis codes

Obesity prevalence over the study period was determined through EMR (based on the BMI value (>30 kg/m2)) and claims data (based on the presence of obesity diagnoses codes recorded for any healthcare encounter, ie, outpatient, emergency room (ER) and inpatient visits at baseline), divided by the number of subjects that met the inclusion criteria, respectively. The validity of the obesity diagnosis codes, ICD-Clinical Modification (CM)−9 and ICD-CM-10, was evaluated in terms of their sensitivity and PPV.

Candidate variables and feature aggregation

Model features included in the prediction model were demographic characteristics such as age, sex, BMI, payor, and index year measured at the index date; clinical characteristics such as diagnoses codes, procedure codes, medication codes, and Charlson Comorbidity Index measured at baseline; and healthcare resource utilization such as the number of inpatient, ER, and outpatient visits at baseline. (A full list of features used in the prediction models is available in online supplemental table 4.) The main binary outcome of interest was obesity, defined as BMI≥30 kg/m2 recorded in the EMR.

The features obtained from the data set were grouped into relevant code hierarchies or clinical concepts to increase the ease of computation and clinical interpretation. The diagnosis codes (ICD-CM-9, ICD-CM-10), and procedure codes (Current Procedure Terminology and Healthcare Common Procedure Coding System (HCPCS)) were grouped using Clinical Classification Software, a web-based analytics software.22 Medications in the National Drug Code and HCPCS were grouped using the Cerner Multum drug database.23 Analytical data sets were created using the Instant Health Data platform (Panalgo, Boston, Massachusetts, USA) and preprocessed with the cohorts, target variables, and features using the data science module within the platform. All features were carried forward without selection during preprocessing.

Model evaluation and testing

The ML algorithms used for feature selection were Lasso, extreme gradient boosting (XGBoost), and random forest. The data set was randomly split into two partitions; the training/validation (60%/20%) set and the test (holdout) (20%) set. Hyperparameters were assessed using five-fold cross-validation in the training/validation set. The models were trained and evaluated multiple times with different data splits to obtain a reliable assessment of their performance for each hyperparameter combination. Models were initially evaluated using their default hyperparameter settings (online supplemental table 1) but were then tuned using a random grid search to optimize the predictive ability of the models. The best-performing model was then selected for further tuning.24

Model performance was evaluated using the test set. The primary metric of evaluation of the model performance was the AUC ROC. Other metrics investigated were PPV (accuracy), sensitivity (recall), precision, Youden Index, F1 score, negative predictive value (NPV), and specificity.8 Validation of the best-performing model was conducted using the test set which was not used during the model’s training and validation. This approach provided a good proxy for how the model would perform with new data.

Feature importance ranking and risk score computation

A total of 479 features were entered into the ML models. The best-performing model was used to select and rank a feature’s importance based on said feature’s relative importance to risk prediction using the training data set. To assess the probability of an individual meeting the criteria for obesity, the final model calculated a risk score for each individual in the cohort. This risk score was aligned with the known BMI from the EMR to then classify the risk score distribution relative to the true BMI.

Sensitivity analysis

A series of sensitivity analyses were performed on participants with BMI≥35 kg/m2 and ≥40 kg/m2 as binary target variables. A shorter baseline period of data (ie, 3 months and 6 months) was used to assess the model performance in the absence of 12 months of data. The binary classification model was trained and validated on a model that excluded BMI/weight-related diagnoses (eg, ICD-CM-10: overweight or obesity: E68*; BMI: Z68*) from the input feature and the model performance was reported.

Ethical review and regulatory considerations

This research was conducted in strict compliance with all state, local, and federal regulatory requirements. The study execution was consistent with Good Clinical Practices, Good Epidemiological Practices, the International Convention on Harmonization, Health Insurance Portability and Accountability Act regulations, the US Department of Health and Human Services, the Office of Human Research Protection and any applicable Internal Review Board guidelines. When applicable, the research was conducted in accordance with the regulations of the US Food and Drug Administration as described in 21 Code of Federal Regulations 50 and 56 and all applicable laws. Participants’ data were de-identified to protect their privacy.

Results

Demographics and clinical characteristics

A total of 3 532 946 participants had a BMI value recorded in the database; 2 781 248 were aged ≥18 years, of whom 692 119 had continuous enrolment in the baseline period. The cohort attrition is depicted in online supplemental figure 1. Of those with continuous enrolment, 276 646 (40.0%) people had BMI>30 kg/m2 in the EMR database and 96 427 (13.9%) had an obesity diagnosis code indicating BMI>30 kg/m2 or an obesity-related diagnosis code in the administrative claims data. The demographics and characteristics are given in table 1. The mean (SD) age of the participants was 50.8 (17.8) years, and the mean (SD) BMI was 29.5 (7.1) kg/m2. There was a higher proportion of women compared with men (55.5% vs 44.5%) across the subsets. The characteristics of people with obesity in the EMR and the claims subsets differed from each other. Relative to those classified to have obesity based on BMI in the EMR, those with claims diagnosis codes tended to be older, have a higher BMI, higher disease burden, and have Medicare Supplemental insurance.

Table 1
|
Demographics and cohort characteristics

Prevalence, sensitivity, and PPV

Obesity prevalence, sensitivity, and PPV, as derived from the ICD-CM from claims from 2013 to 2019, are shown in figure 1. The PPV of the diagnosis codes remained relatively high over the study period with an average of 88% (ranging from 85.4% to 89.2%), while sensitivity remained low, with an average of 31% (ranging from 16.8% to 44.8%). The stratified table in online supplemental table 2 provides information on the coding of obesity‐related diagnoses in the claims data by the measured BMI, as recorded in EMR.

Figure 1
Figure 1

Obesity prevalence, sensitivity and PPV from ICD-CM from 2013 to 2019. *Sensitivity=TP/(TP+FN); PPV=TP/(TP+FP). BMI was used to determine obesity status in the EMR; obesity diagnosis codes were used to determine obesity status in the claims. BMI, body mass index; EMR, electronic medical record; FN, false negative; FP, false positive; ICD-CM, International Classification of Diseases-Clinical Modification; PPV, positive predictive value; TP, true positive.

Model prediction

The performance of the predictive models in the training/validation data set, as examined and assessed by AUC, average precision, PPV (accuracy), sensitivity (recall), precision, F1 score, NPV, and specificity, is presented in table 2. XGBoost had a higher AUC of 79.4% and a greater accuracy of 73.5% compared with the other models tested. The Brier score among the tested models ranged from 0.17 to 0.20, with XGBoost scoring the lowest. The performance of XGBoost was validated in the test/holdout data set with the highest AUC of 79.4% (online supplemental table 3).

Table 2
|
Performance of predictive models fitted to the training set (60%) and evaluated on the validation set (20%)

Feature importance

Figure 2 shows the feature importance plot generated by XGBoost in the training/validation data set. The features identified by XGBoost to positively impact obesity risk prediction in the training/validation data set were: the number of obesity diagnoses, obesity diagnoses from inpatient settings, diagnosis of OSA, hypertension, T2D, hyperglycemia, pre-diabetes, neurocognitive disorders such as delirium, dementia, amnestic and other cognitive disorders, MASLD and metabolic dysfunction-associated steatohepatitis (MASH), and use of antidiabetic agents, antihypertensive combinations, and diuretics such as bumetanide, mannitol, furosemide, torsemide, and bone resorption inhibitors, presence of obesity ER visit, and older age. The presence of overweight, underweight, diagnosis of osteoporosis, acne, melanocytic nevi, and examinations encounter were found to negatively impact the risk prediction.

Figure 2
Figure 2

Top 20 features identified from XGBoost in the train/validation data set. Gain is the relative contribution of the corresponding feature to the model. All features were captured at baseline, except when indicated as index. *Diuretics include bumetanide, mannitol, furosemide and torsemide. #Neurocognitive disorders include delirium, dementia, amnestic and other cognitive disorders. ER, emergency room; MASH, metabolic dysfunction-associated steatohepatitis; MASLD, metabolic dysfunction-associated steatotic liver disease; XGBoost, extreme gradient boosting.

Risk prediction

The probability of an individual meeting the criteria for obesity (BMI≥30 kg/m2) by the XGBoost model as compared with the actual BMI classification, is shown in figure 3. The predicted probability (SD) of 0.72 (0.26) was observed in people in the class 3 obesity cohort, while a predicted probability (SD) of 0.20 (0.16) was observed in people in the underweight cohort. The model indicated that individuals with higher predictive values have a greater likelihood of meeting the BMI criteria for obesity.

Figure 3
Figure 3

Predicted probability from XGBoost model versus actual BMI classification. BMI, body mass index; XGBoost, extreme gradient boosting.

Sensitivity analysis

Trained and validated binary classification models for BMI≥35 kg/m2 and ≥40 kg/m2 had a higher AUC of 81.0% and 83.6%, respectively, using the XGBoost model. The models showed the topmost common predictive features were similar to those presented in the main analysis of BMI classification ≥30 kg/m2. For BMI>35 kg/m2, the number of obesity diagnoses and sleep apnea were the most important predictors of obesity, followed by overweight, use of antidiabetic agents, and use of antihypertensive medications. For BMI>40 kg/m2, the number of obesity diagnoses and sleep apnea were the most important predictors of obesity, followed by the use of antidiabetic agents, overweight, and use of diuretic medications.

The XGBoost model performed better with 12 months of baseline data with an AUC of 79.4%, compared with 3-month and 6-month baseline data. On retraining the model without a baseline BMI/weight-related diagnosis, the XGBoost model yielded a satisfactory performance with an AUC value of 74.1%. The top five features identified were sleep apnea (positive impact), antidiabetic agents (positive impact), hypertension (positive impact), use of antihypertensive combination medications (positive impact), and osteoporosis (negative impact; online supplemental figure 2).

Discussion

The growing prevalence of obesity necessitates the exploration of risk prediction and prevention strategies for obesity. Administrative claims databases can be comprehensive and inexpensive sources of RWD for epidemiological studies as they establish the prevalence and incidence of various chronic diseases across large and demographically diverse populations.25 However, previous studies have shown that the use of these data sources may result in an incorrect estimate of obesity prevalence due to underutilization of the diagnosis codes.8 11 26 27

The current study assessed the validity of obesity diagnosis codes in an administrative claims database and reported high PPV and low sensitivity, in concordance with previous studies.11 12 Ammann et al reported low specificity and high PPV of the ICD-CM-9 and ICD-CM-10 BMI-related diagnosis codes for identification of patients with overweight or obesity.11 Ammann et al also demonstrated higher sensitivity of ICD-CM-10 coding compared with ICD-CM-9 coding11 which was confirmed by Suissa et al,28 a finding potentially attributable to improved coding practices and reimbursement requirements. Moreover, the accuracy or PPV of obesity diagnosis codes was higher among patients with obesity-related complications such as diabetes or hypertension,27 and the probability of having an obesity-related diagnosis code in claims data increased with comorbidity burden and hospitalization.11

In the current study, older individuals, those with severe obesity, or those with a higher disease burden were more likely to have an obesity diagnosis code recorded in the claims data. Possible reasons could be because these individuals may be more likely to seek medical care, or healthcare providers may code obesity for those with greater severity and burden, as they may consider obesity to be a driving diagnosis for the high disease burden. Indeed, people with an obesity diagnosis code were more likely to have increased healthcare utilization, including hospitalizations, emergency room visits, outpatient visits, and increased usage of medications compared with those without obesity diagnosis codes.28 However, despite the increased usage of obesity diagnosis codes in the claims database, the true obesity prevalence was still underestimated in the claims database compared with EMR data in the current study. Furthermore, the sensitivity was low as only approximately 31% of people with BMI indicative of obesity in the EMR had a corresponding diagnosis in claims. These results further emphasize the magnitude of obesity code underutilization and its impact on assessing obesity in claims data.

Previous studies showed that people with severe obesity were more likely to have a BMI-related diagnosis code in administrative data relative to those of normal weight.9 11 27 People with obesity diagnosis codes in the claims database are more likely to have class 3 obesity than those without obesity diagnosis codes, suggesting that diagnosis codes may not be recorded for people with class 1 obesity.27 Underutilization of obesity diagnosis codes could also result from other factors, such as physicians not considering obesity to be a disease, or the obesity diagnosis not being based on an objective measurement of BMI, thus only capturing cases of severe obesity.25 Taken together with the findings of the current study, these factors emphasize the importance of careful consideration of the diagnosis codes used for inclusion criteria in observational studies using claims data, and how diagnosis codes may impact outcomes such as disease prevalence or incidence. The use of diagnosis codes to help provide a “snapshot” for the better capture of obesity in the identification of target populations is critical to improving public health surveillance and research studies that use these databases, given the established association of obesity with several chronic diseases.25

Lately, Wu et al8 developed two models applying an SLA: model 1 with recorded baseline BMI values and model 2 with demographics and clinical characteristics data, excluding baseline BMI values, to predict obesity in people of all age groups. Model 1 reported better performance than model 2 with a higher AUC ROC (88% vs 73%), accuracy (ranging from 87.9% to 92.8% versus 73.6% to 80.0%) and specificity (ranging from 91.8% to 94.7% versus 71.6% to 85.9%). However, a notable limitation of Wu et al8 is that the study interpolated BMI from claims data that under-reports BMI. In the current study, we tested the predictive performance of five ML models to differentiate people with and without obesity. Of all the models tested, the XGBoost model demonstrated moderate-to-strong performance in predicting obesity risk. A higher AUC ROC and lower Brier score of the XGBoost model translated to better distinction of people with and without obesity with greater accuracy; however, a lower Brier score (0.17 in the current study) does not necessarily imply higher calibration.29 To prevent poor calibration, we followed some of the strategies outlined in Van Calster et al.30 For example, we included a sufficient sample size for the number of predictors, we used Lasso regression, a penalized regression technique, and we employed a simpler model that did not include too many interaction terms.

Furthermore, candidate variables that predict obesity risk were identified in the US administrative claims database using BMI recorded in the EMR. The XGBoost model ranked corresponding features by their relative contribution to the model in terms of “gain”, calculated by averaging each feature’s contribution for each tree in the model. Features were listed in descending order of their predictive power, as variables at the top contribute more to the model than the ones at the bottom. The highest-ranked predictors of obesity were relatively consistent across the ML algorithms used. Nevertheless, predictor importance did differ slightly between sensitivity analyses such as different lengths of baseline periods (3 or 6 months) and BMI targets of >35 kg/m2 and >40 kg/m2. Interestingly, results from ML models that excluded baseline BMI/weight-related diagnosis were able to predict obesity status based on other risk factors of obesity such as sleep apnea and use of antidiabetic medications. The findings potentially fill the gap of missing obesity status in administrative claims databases.

Besides the finding that the number of obesity diagnoses and diagnoses from inpatient settings were found to be the most important predictors of obesity, diagnosis of OSA, hypertension, and use of antidiabetic or antihypertension medications suggests that these factors are strongly associated with obesity and can be used to identify people at high risk for the condition. Interestingly, this study highlighted the presence of dermatological conditions such as acne and melanocytic nevi to negatively impact obesity risk. No epidemiological evidence is available suggesting a relationship between obesity and sebum production, and thus the pathogenesis of acne. In one of the largest risk factor studies conducted on the prevalence of melanocytic nevi among children and adolescents in Baltic countries, the condition was found to be associated with higher BMI.31 Aside from this, little to no evidence exists assessing the impact of these conditions on obesity risk.

Limitations of this study include the availability of all potential predictors in the databases investigated, such as race, region, and physical activity of the participants as well as the possible inaccuracies and misclassification of key variables like obesity diagnosis codes. Second, the database consists of administrative healthcare data of primarily large employer sponsor-insured individuals from a convenience sample of the population across the USA. While BMI is not the best indicator of obesity, it is a helpful tool for obesity screening and health assessment in clinical practice. It is a standardized metric used worldwide, a simple way to identify individuals who may be at risk for weight-related health problems, prompting further evaluation. Other anthropometric measures of adiposity, such as waist circumference or fat-to-muscle ratio, may provide a more complete picture of the obesity status of a patient. However, these measures are usually not captured in EMRs, and it would be difficult, if not impossible, to exclude specific populations, such as athletes with high BMI but without obesity,32 from the analyses. Therefore, the results may not be generalizable to all populations. Lastly, further research is warranted to carry out external validation of the models using different databases.

Identifying potential candidate variables that predict obesity using RWD may help decision-makers understand the impact of obesity at the population level, helping them to identify appropriate levers to implement policy measures to mitigate risks. Furthermore, ML methods can help improve obesity status prediction and thus assist practitioners and payors to estimate the burden of the condition, investigate the potential unmet need for current treatment, and determine the economic value of new treatments, at both individual and population levels. Moreover, as obesity is a major risk factor for several chronic conditions, predicting its occurrence using RWD will ensure greater accuracy in risk estimates for morbidity and mortality associated with its comorbidities. Future research could focus on validating the ML model in other populations and evaluating predictive scores for obesity-related complications, such as CVD and mortality in administrative claims data. For example, Njei et al recently used an explainable machine learning model for high-risk MASH prediction and compared its performance with well-established biomarkers such as MASLD fibrosis scores.33 The XGBoost model in the study had high sensitivity, specificity, AUC, and accuracy for identifying high-risk MASH. Furthermore, BMI was one of the top five predictors of high-risk MASH. Future studies are needed on how the predictive risk of obesity may change over time with obesity intervention or treatment.

Conclusions

Obesity is under-reported in administrative claims databases. Applying ML approaches to RWD may help predict obesity status and thus estimate the burden of the condition more accurately. The current study demonstrated moderate-to-strong predictive performance of XGBoost model in identifying people at high risk of obesity using claims data. The computed predictive value helped differentiate people in the claims data based on their obesity status, thus expanding the utility of BMI as a variable in the data source. Improved prediction of obesity status could assist practitioners and payors to estimate the burden of the condition and investigate the potential unmet need of current treatment at individual and population levels, which may lead to better prevention and treatment strategies for obesity.