Article Text

Systematic review on the measurement properties of diabetes-specific patient-reported outcome measures (PROMs) for measuring physical functioning in people with type 2 diabetes
  1. Ellen B M Elsman1,
  2. Lidwine B Mokkink1,2,
  3. Marlous Langendoen-Gort3,
  4. Femke Rutters1,2,
  5. Joline Beulens1,
  6. Petra J M Elders2,3,
  7. Caroline B Terwee1,2
  1. 1Department of Epidemiology and Data Science, Amsterdam UMC Locatie VUmc, Amsterdam, The Netherlands
  2. 2Amsterdam Public Health Research Institute, Amsterdam, Netherlands
  3. 3Department of General Practice and Elderly Care, Amsterdam UMC Locatie VUmc, Amsterdam, The Netherlands
  1. Correspondence to Dr Caroline B Terwee; cb.terwee{at}amsterdamumc.nl

Abstract

We aimed to systematically assess the measurement properties of diabetes-specific patient-reported outcome measures (PROMs) for measuring physical functioning, one of the core outcomes, in adults with type 2 diabetes.

We performed a systematic literature search for PROMs or subscales measuring physical function that were validated to at least some extent in EMBASE and MEDLINE. Measurement properties were evaluated according to the COSMIN guideline for systematic reviews of PROMs.

In total 21 articles were included, describing 12 versions of 7 unique diabetes-specific PROMs or subscales measuring physical functioning. In general, there were few high-quality studies on measurement properties of PROMs measuring physical functioning in adults with type 2 diabetes. The Dependence/Daily Life subscale of the Diabetic Foot Ulcer Scale—Short Form (DFS-SF) and the Impact of Weight on Activities of Daily Living Questionnaire (IWADL) were most extensively evaluated. Both had sufficient ratings for aspects of content validity, although with mostly very low-quality evidence. Sufficient ratings for structural validity, internal consistency, and reliability were also found for both instruments, but responsiveness was rated inconsistent for both instruments. The other PROMs or subscales often had insufficient aspects of content validity, or their unidimensionality could not be confirmed.

This systematic review showed that the Dependence/Daily Life subscale of the DFS-SF and the IWADL could be used to measure physical functioning in people with type 2 diabetes in research or clinical practice, while keeping the limitations of these instruments in mind. The measurement properties that have not been evaluated extensively for these PROMs should be evaluated in future studies.

The study protocol was registered in the PROSPERO database, number CRD42021234890.

  • diabetes
  • physical functioning
  • systematic review
  • COSMIN
  • patient-reported outcome measures (PROMs)

Data availability statement

Data sharing not applicable as no datasets generated and/or analyzed for this study. All data relevant to the study are included in the article or uploaded as supplementary information.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Introduction

The number of adults with diabetes has more than tripled over the past 20 years. In 2019, 463 million adults were estimated to have diabetes, and this number is expected to increase to 700 million in 2045. Around 90% of all diabetes is type 2 diabetes. Ten per cent of global health expenditure is spent on diabetes treatment, making the disease a global problem.1 2

Because of the chronic nature of type 2 diabetes and the impact on peoples’ lives, it is important to measure patient-reported outcomes, such as symptoms and physical, mental, and social functioning, in research and clinical practice. To that end, patient-reported outcome measures (PROMs) can be used, which measure health outcomes that are important to patients.3–7 However, a review of type 2 diabetes clinical trials showed that 10% (ie, 14 studies) included patient-reported outcomes.8 In total, 68 different outcomes were measured in these studies with 23 PROMs. Most PROMs (87%) were used in only one study.8 This heterogeneity and lack of standardized outcome measurement hampers pooling and comparing outcome data.

To overcome heterogeneity in measured outcomes, a core outcomes set (COS) has been developed for type 2 diabetes.9 This COS represents an agreed standardized set of outcomes that should be measured and reported in all trials for type 2 diabetes.9 10 One of the patient-reported outcomes that has been included in the COS for type 2 diabetes is activities of daily living, defined as ‘being able to complete usual everyday tasks and activities, including those related to personal care, household tasks or community-based tasks.’9 However, activities of daily living does not refer to an aspect of health, as opposed to, for example, limitations in the performance of activities of daily living. As such, we personally believe the term physical functioning, which often includes activities of daily living,11 better covers the construct.

It is important to measure physical functioning with the most suitable PROM, taking specific PROM characteristics into account, such as interpretability of scores (eg, reference values, minimal important change values), feasibility of use, and measurement properties. Measurement properties are the quality aspects of a PROM and include reliability, validity, and responsiveness (see online supplemental appendix 1 for definitions of the measurement properties).12 To make an evidence-based recommendation on the most suitable PROM, all available PROMs suitable for people with type 2 diabetes need to be evaluated on these characteristics in a high-quality systematic review.

Supplemental material

Several systematic reviews on PROMs used in diabetes research have been published in the last decade.13–22 Most of these reviews included instruments measuring multidimensional constructs,13 15–19 21 22 but have not reported13 19 nor evaluated15–18 21 22 the results per subscale. They also made little to no effort to provide an overview of the different constructs measured by subscales of PROMs. This is important, because the results of measurement properties can vary among subscales and review users need to know what the best instrument is to measure a certain construct. Moreover, several reviews have not conducted a (complete) risk of bias assessment to assess the quality of individual studies13 17–19 nor have they graded the quality of the total body of evidence for a specific PROM.13 16–19

COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments) is the most comprehensive and widespread methodology to enable the evidence-based selection of the most suitable PROM for a certain construct and population.23 The studies that have used the COSMIN methodology seem to have not correctly applied it, as for example it was unclear how the overall results per PROM were summarized or graded, or this was not done at all.19 22 Because of these limitations, previous reviews provide limited guidance on which PROMs or subscales are most suitable to measure physical functioning. There is thus still a need for a high-quality systematic review of PROMs for people with diabetes.24 Therefore, this study aims to systematically assess the measurement properties of diabetes-specific PROMs for measuring physical functioning in adults with type 2 diabetes to make recommendations on the most suitable PROM to use in research or clinical practice.

Methods

The systematic review was conducted according to the COSMIN guideline for systematic reviews.23

Literature search

This study was part of a larger systematic review, in which (1) all PROMs that have been validated to at least some extend in people with type 2 diabetes have been identified and described,25 (2) the content validity of diabetes-specific PROMs has been investigated,26 and (3) the measurement properties of diabetes-specific PROMs for physical functioning have been assessed (this study). A comprehensive search was performed in the bibliographic databases MEDLINE (through PubMed) and EMBASE (through www.embase.com) from inception up to January 1, 2022 without language restrictions. Non-English papers were included if relevant information could easily be extracted with Google translate. The search consisted of three elements: (1) type 2 diabetes, using a comprehensive set of search terms from a clinical librarian of the Vrije Universiteit Amsterdam, the Netherlands; (2) PROMs, using a PROM filter27; and (3) measurement properties, using a modified version of the measurement properties filter.28 29 No search terms were used for the construct, as the complete series of reviews intended to find all instruments that have been validated in people with type 2 diabetes. Moreover, for this specific review, we intended to also include physical functioning subscales of PROMs measuring broader constructs, such as quality of life. Adding search terms for physical functioning could have prevented finding these broader instruments as subscales are not always mentioned in the abstract. The complete search strategy can be found in online supplemental appendix 2. Reference lists of included articles were searched by hand to ensure all relevant studies and available translations were considered.25 26

Supplemental material

Study selection

Covidence30 was used for screening and selection of abstracts and full-text articles. Relevant articles were selected by first reviewing title and abstract, and if the study seemed relevant or in case of doubt, the full-text article was retrieved and screened. Abstract and full-text screening was done by two reviewers independently. Discrepancies were resolved by discussion and/or consultation of a third reviewer. PROMs that were considered to measure physical functioning based on the Wilson and Cleary model31 in the first review25 were included in the current study when the following criteria were met:

  1. Construct of interest: The PROM or a relevant subscale of a PROM should measure physical functioning. We adopted the definition of the Patient-Reported Outcomes Measurement Information System (PROMIS), a large US initiative that developed generic PROMs for core health outcomes,11 which defined physical functioning as the capability to perform physical activities (ie, what a person can do in the daily environment), rather than performance (ie, what a person actually does) or capacity (ie, what a person can do in a standardized-controlled environment, often measured by performance-based tests). Capability to perform physical activities includes the functioning of one’s upper extremities (dexterity), lower extremities (walking or mobility), and central regions (neck, back), as well as instrumental activities of daily living, such as running errands. In case a subscale of the instrument measures physical functioning, only that subscale was included.

  2. Population: At least 50% of the study population or reported subgroups should consist of adults with type 2 diabetes mellitus.

  3. Instrument type: The instrument should be a questionnaire, to be completed by the person with type 2 diabetes in self-report or interview form.

  4. Measurement properties: At least one of the aims of the paper should be the development of a diabetes-specific PROM or the evaluation of one or more measurement properties of a diabetes-specific PROM. Studies that aim to evaluate the interpretability of a PROM were also included. Studies that use a PROM but do not intend to evaluate its measurement properties or in which the PROM is only used as a comparison instrument in the validation of another instrument were excluded.

Data extraction

PROMs and manuals were retrieved by searching Google or by contacting PROM developers. Characteristics of included PROMs (eg, construct, target population, subscales, number of items, etc.), information on feasibility, and information on interpretability were extracted. For each article, it was determined which measurement properties were evaluated. Data extraction was done by one reviewer and checked by a second reviewer.

Subsequent steps were conducted one measurement property at a time in the following order, as per COSMIN guideline:23 content validity, internal structure (ie, structural validity, internal consistency, and cross-cultural validity\measurement invariance), reliability and measurement error, and the remaining measurement properties (ie, criterion validity, hypotheses testing for construct validity, and responsiveness). Content validity evidence for the physical functioning subscales was taken from the content validity review,26 although standard 1, regarding the clarity of the definition of the construct, was scored again specifically for the included physical functioning subscale. Only Dutch or English papers were included for the evaluation of content validity, because this requires detailed understanding of the methods. All other measurement properties were evaluated in the current study.

Evaluation of the quality of a PROM

Per measurement property, first, data on the study population and the results of studies were extracted. Second, the methodological quality of each study was assessed using the COSMIN Risk of Bias checklist.32 Each standard was rated on a four-point rating scale as ‘very good’, ‘adequate’, ‘doubtful’, or ‘inadequate’. A total rating per measurement property per study was obtained taking the lowest rating among the standards (ie, worst-score counts).33 Third, criteria for good measurement properties were applied to each result using the quality criteria, resulting in a sufficient (+), insufficient (−), or indeterminate (?) rating (online supplemental appendix 3).23 A priori hypotheses were formulated to evaluate the results on construct validity and responsiveness. Figure 1 shows the predefined hypotheses for comparisons with other instruments. Hypotheses for comparisons between relevant subgroups or before and after intervention were: effect size (eg, Cohen’s D, standardized response mean) ≥0.20 for differences between relevant subgroups, score differences between relevant subgroups ≥10% (eg, people with type 2 diabetes should score 10% worse than controls), or correlation ≥0.30 between relevant subgroups and score. Relevant subgroups were selected in consultation with an expert on type 2 diabetes. Fourth, evidence from multiple individual studies on the same PROM or subscale was summarized per measurement property and the summarized result was rated against the quality criteria for good measurement properties.23 The rating of the individual studies (+, −, or ?) was also applied to the summarized result when the results of individual studies were consistent. When individual studies showed inconsistent results, explanations for inconsistency in terms of differences in populations or study quality were explored. When inconsistency could be explained, results were summarized and rated per subset of studies. When inconsistency could not be explained, the overall rating was inconsistent (±), without summarizing the results or based on the majority of consistent results (+, −, or ?). If studies with a + or − rating were available, studies with a ? were ignored and not included when summarizing the results. Fifth, the quality of the evidence was graded using a modified Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach resulting in ‘high’, ‘moderate’, ‘low’, or ‘very low’ quality.23 Quality of the evidence was not graded for studies for which the overall rating was indeterminate (?). For all other situations, starting with high-quality evidence, quality of evidence was downgraded (online supplemental appendix 4). For internal consistency, the quality of evidence started at the level of structural validity.23

Supplemental material

Supplemental material

Figure 1

Decision tree for hypotheses regarding the comparisons of instruments.

Each step of the quality evaluation was done by two reviewers independently. Discrepancies were resolved by discussion and/or consultation of a third reviewer.

Formulation of recommendations

To formulate recommendations, we considered the results on the measurement properties in order of importance. According to COSMIN, PROMs that have any level of sufficient content validity, which is the most important measurement property, and at least low-quality evidence for sufficient internal consistency (and as such also at least low-quality evidence for sufficient structural validity) can be recommended for use, except when there is high-quality evidence for any insufficient measurement property.23 We subsequently took results on reliability into account when formulating recommendations, and considered construct validity and responsiveness as least important. Importantly, we also took into account the limitations of the PROMs arising from the recommendations.

Results

Study selection

The database search and reference check resulted in 12 771 unique abstracts, of which 341 were assessed full text for eligibility. Ultimately, 21 articles were included in this review, describing 12 versions of 7 unique PROMs or subscales measuring physical functioning. A flowchart can be found in online supplemental appendix 5. For many PROMs, it was unclear what the PROM exactly aimed to measure, let alone that this was the case for the PROM subscales. In fact, for 7 of the 12 physical functioning subscales, no description was provided at all (table 1). Most PROMs have 5–7 items, although different versions of the included subscale of the Diabetes-39 contain 5–15 items (table 1). Characteristics of study populations involved in PROM design and content validity studies can be found in online supplemental appendix 6, whereas characteristics of study populations for the assessment of other measurement properties can be found in online supplemental appendix 7. Information on feasibility and information on interpretability can be found in online supplemental appendix 8 and online supplemental appendix 9, respectively.

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Table 1

Characteristics of included diabetes-specific PROMs measuring physical functioning in people with type 2 diabetes (n=11)

Measurement properties

Table 2 summarizes the results of the included studies on measurement properties per PROM. Per study, the methodological quality and the result of the study are displayed. A more extensive description of the results can be found in online supplemental appendix 10. Table 3 provides an overview of the summary of findings and the quality of the evidence. More extensive results can be found in online supplemental appendix 11. Below, per measurement property the most important results are discussed, in order of importance.23

Supplemental material

Supplemental material

Table 2

Results and quality of studies on measurement properties of diabetes-specific PROMs measuring physical functioning in people with type 2 diabetes

Table 3

Summary of findings per diabetes-specific PROM measuring physical functioning in people with type 2 diabetes, with a rating of the summarized result (+, −, ±, ?) and a grading of the quality of the evidence (H, M, L, V)

Content validity

The PROM development was considered adequate only for the Diabetic Foot Ulcer Scale—Short Form (DFS-SF),34 patient-reported outcomes instrument for Thai patients with type 2 diabetes (PRO-DM-Thai),35 Impact of Weight on Activities of Daily Living Questionnaire (IWADL),36 and Quality of Life Instrument for Indian Diabetes Patients (QOLID)37(online supplemental appendix 11). For the other PROMs, the development was rated as inadequate, because the construct of the included physical functioning subscale was not clearly described or the PROM was not pilot tested. Four studies examined the comprehensibility of a PROM (DFS-SF38 and Diabetes-3939–41) after translation (online supplemental appendix 12), which were all doubtful or inadequate. As for most PROMs no or only inadequate content validity studies were available and the PROM development study was also inadequate, the ratings of the relevance, comprehensiveness, and comprehensibility of the PROM were mainly based on the subjective ratings by the reviewers. Note that one reviewer had expertise in PROM development and evaluation (CBT), and one reviewer was a general practitioner and full professor in diabetes care, and as such had expertise in treating people with diabetes (PJME). Considering results of the PROM development studies, content validity studies if both were at least doubtful, and the reviewer ratings, the content validity of the DFS,42 DFS-SF,34 38 and IWADL36 for measuring physical functioning was considered sufficient, but often with very low-quality evidence. Details on content validity can be found in our preceding review.26 Many of the other PROMs had items that were not related to physical functioning. For example at least 8 of the 15 items in the Energy and Mobility subscale of the Diabetes-39 also ask for other health problems, such as loss of vision (n=1), other illnesses (n=3), and energy (n=4).43 Moreover, the items in the Physical Function subscale of the Diabetes Quality of Life Clinical Trial Questionnaire (DQLCTQ) asks for the duration of limitations, rather than the extent of limitations.44

Supplemental material

Internal structure

Aspects of internal structure were evaluated for all PROMs or subscales. If studies had inadequate quality for structural validity or cross-cultural validity\measurement invariance, this was often due to small sample sizes. Sufficient structural validity and internal consistency was found for the DFS-SF,34 38 45 PRO-DM-Thai,35 IWADL,36 46 and Chinese Cardiff Wound Impact Schedule (C-CWIS).47 Various factor structures for the subscales of the Diabetes-39 have been found,39 43 48–50 resulting in different versions with 7, 10, 14, and 15 items of the Energy and Mobility subscale. Only the 14- and 15-item versions were found to be unidimensional with sufficient internal consistency. Internal consistency was considered indeterminate for most other PROMs, despite Cronbach’s alpha >0.7, because there was not at least low-quality evidence that the PROMs were unidimensional, which is a prerequisite for correct interpretation of internal consistency. Cross-cultural validity was not evaluated for any of the PROMs, whereas measurement invariance was only assessed for the Dependence/Daily Life subscale of the DFS-SF for the variables sex, age, place of residence, education, type of diabetes, and time since diagnosis.45 Because only sex impacted one item (depend on others to get out of the house), we rated measurement invariance as sufficient (online supplemental appendix 10).

Reliability and measurement error

Reliability was evaluated for six PROMs or subscales. All studies with inadequate quality had a time interval that was considered to be too long (ie, more than 4 weeks). Sufficient reliability was found for the Dependence subscale of the DFS (but not for the Daily Life subscale),42 and for the DFS-SF,34 IWADL,46 DQLCTQ,44 and 15-item Energy and Mobility subscale of the Diabetes-39.40 51 Although reliability was also evaluated for the Physical Impairment subscale of the Diabetes-39, the result could not be rated because it was unclear how the reliability parameter was calculated.50 Measurement error was evaluated only for the IWADL.46 The measurement error for using the IWADL in individual persons was rated as insufficient because the smallest detectable change was larger than the minimal important change.

Remaining measurement properties

Figure 2 presents an overview of the evidence on hypotheses testing for construct validity and responsiveness (latter marked with *). Bars reaching within the blue area indicate that the results were in accordance with our predefined hypotheses. Panel A shows the correlations with other instruments based on our decision tree, panel B the percentage score differences between subgroups or before and after intervention, and panel C the effect sizes between subgroups or before and after intervention. Studies with an indeterminate rating are not included in figure 2, because hypotheses were not defined or data were not provided to test the hypotheses. All PROMs have been evaluated with respect to construct validity, except the Energy and Mobility subscale of the Diabetes-39 SF.48 Most studies were of at least adequate quality. Three studies were of inadequate quality, because they did not apply an appropriate statistical method to compare subgroups.35 45 47 Construct validity of the Daily Activities subscale of the DFS42 was considered sufficient based on correlations between instruments, because ≥75% of the results were in accordance with our predefined hypotheses. Construct validity of the Dependence/Daily Life subscale of the DFS-SF, Physical Impairment subscale of the Diabetes-39,50 and the Physical Symptoms and Everyday living subscale of the C-CWIS were considered sufficient based on comparisons between subgroups, as ≥75% of the results were in accordance with our predefined hypotheses. Responsiveness (marked with *) was evaluated for five PROMs. All studies were of very good quality. For none of the PROMs, responsiveness was considered sufficient.

Figure 2

Results on hypotheses testing for construct validity and responsiveness of diabetes-specific PROMs measuring physical functioning in people with type 2 diabetes: (A) correlations with other instruments; (B) Percentage score differences between subgroups or before and after intervention; (C) Effect sizes between subgroups or before and after intervention. *Results of responsiveness; †Correlations between subgroups and instrument score; ‡One of the known-groups tested in the hypotheses was small (n <20); Number in parentheses indicates the number of items in the subscale for the Diabetes-39, for example (15) refers to the 15-item Energy and Mobility subscale; Green: very good study; Yellow: adequate study; Orange: doubtful study; Red: inadequate study; Bars reaching within the blue area indicate that the results are in accordance with our predefined hypotheses, for example, for the DFS in panel A, four results are in accordance with our predefined hypotheses and one is not (one result >0.6, one result 0.4–0.7, two results 0.3–0.6). C-CWIS, Chinese Cardiff Wound Impact Schedule; DFS, Diabetic Foot Ulcer Scale; DFS-SF, Diabetic Foot Ulcer Scale—Short Form; DQLCTQ, Diabetes Quality of Life Clinical Trial Questionnaire; IWADL, Impact of Weight on Activities of Daily Living Questionnaire; PRO-DM-Thai, patient-reported outcomes instrument for Thai patients with type 2 diabetes; PROMs, patient-reported outcome measures; QOLID, Quality of Life Instrument for Indian Diabetes Patients.

Recommendations

The DFS-SF and IWADL had sufficient relevance, comprehensiveness, and comprehensibility, and at least low-quality evidence for sufficient internal consistency, and can thus be considered for use in research and clinical practice. Both also had sufficient reliability, but measurement error of the IWADL was insufficient. The DFS-SF and IWADL had inconsistent responsiveness, with high-quality evidence for the subscale of the DFS-SF. This limitation should be taken into account when considering using the DFS-SF and IWADL.

Discussion

This review systematically evaluated the measurement properties of diabetes-specific PROMs for measuring physical functioning, one of the core outcomes,9 in adults with type 2 diabetes. To ascertain a high-quality systematic review with trustworthy results, we adhered to the COSMIN guideline for systematic reviews.23 In our review, 12 versions of seven unique PROMs were identified. The Dependence/Daily Life subscale of the DFS-SF34 38 45 and the IWADL36 46 seem to be the most extensively evaluated and had sufficient content validity, structural validity, internal consistency, and reliability.

Content validity is the most important measurement property,23 and the Dependence/Daily Life subscale of the DFS-SF34 38 45 and the IWADL36 46 have sufficient relevance, comprehensiveness, and comprehensibility, although mostly based on low or very low quality evidence. The content of the IWADL is more focused on limitations with the performance of daily activities, whereas the content of the Dependence/Daily Life subscale of the DFS-SF asks for dependence on others for the performance of daily activities. Moreover, the DFS-SF was specifically developed for people with type 2 diabetes and foot ulcers. These limitations in content and target population should be taken into account when using the DFS-SF and IWADL. After content validity, structural validity is the second most important measurement property,23 and both subscales were considered unidimensional. Sufficient internal consistency and reliability were also found for both instruments, although measurement error was insufficient for the IWADL, but the quality of the evidence was low, and therefore further research regarding this measurement property should be conducted. No information about measurement error was available for the Dependence/Daily Life subscale of the DFS-SF. Construct validity in terms of comparisons between subgroups was sufficient for the Dependence/Daily Life subscale of the DFS-SF, whereas this was inconsistent for the IWADL and for correlations between instruments. Responsiveness was also inconsistent for both instruments.

In general, we show in the current review that high-quality studies on measurement properties of PROMs measuring physical functioning in adults with type 2 diabetes are scarce. Five of the studies on PROM development were of inadequate methodological quality, whereas the other four were of doubtful methodological quality. For structural validity, 7 out of 15 studies were of inadequate or doubtful quality and for measurement invariance the one study found was also of inadequate quality. For reliability, six out of eight studies were of inadequate or doubtful quality and for measurement error the one study found was of doubtful quality. The inadequate or doubtful methodological quality of the individual studies resulted in lower quality evidence for many measurement properties. For internal consistency, hypotheses testing for construct validity and responsiveness, the majority of the studies had adequate or very good methodological quality (19 out of 22 for internal consistency, 27 out of 33 for hypotheses testing for construct validity, and 5 out of 7 for responsiveness), leading to higher quality evidence.

Most PROMs or subscales had inconsistent construct validity, often with high-quality evidence, so future studies will probably not change these results. Considering the results on comparisons with other instruments, correlations of the DFS-SF Dependence/Daily Life subscale with more related constructs were higher than those with less related construct. Most correlations just not met our hypotheses, which also show that formulating hypotheses is challenging. On the other hand, correlations of the Diabetes-39 15-item Energy and Mobility subscale were all high, regardless of the comparison instrument’s construct, indicating that the content of the 15-item Energy and Mobility subscale is not only measuring physical functioning. This also resonates content validity results, with insufficient relevance and comprehensiveness of the Diabetes-39 15-item Energy and Mobility subscale, whereas these were sufficient for the DFS-SF Dependence/Daily Life subscale.

Several PROMs have been translated in various languages, but none of these PROMs have been assessed for cross-cultural validity. This is remarkable, because a large number of PROMs that have originally been developed in English have been translated for use in countries that are likely to be culturally different from western countries, for example, Asian or Arabic countries. Evaluating cross-cultural validity is important, because it is not self-evident that the items in translated questionnaires perform similar compared with the items of the original instrument.52

The measurement properties that have not been evaluated for various PROMs could be evaluated in future studies. However, it is not very useful to study these measurement properties for a PROM with insufficient content validity. To measure physical functioning in a valid way, a PROM needs to contain items referring to the functioning of one’s upper extremities, lower extremities or central regions, or relevant activities of daily living for people with type 2 diabetes and should not contain items that are not related to physical functioning or that lack key aspects of physical functioning. Only the Dependence/Daily Life subscale of the DFS-SF and the IWADL fulfill these requirements and are worthwhile to be subject of future validation studies.

As an alternative, one could consider using or validating a generic PROM for measuring physical functioning in people with type 2 diabetes. Examples are the Physical Functioning subscale of the 36-item Short Form Health Survey (SF-36), which has been used quite often used in diabetes studies (eg, Refs 53–57), or the more modern, generic PROMIS Physical Function measures.58 The necessity to use a disease-specific PROM for such a generic outcome as physical functioning can be questioned. It is likely that relevant physical functioning items (eg, walking, stair climbing, performing household activities) are the same for people with diabetes as for people with other conditions. Furthermore, none of the included PROMs had diabetes-attribution in the question(naire) (eg, due to your diabetes…/because of your diabetes…). Diabetes attribution may not always be desirable, because it might lead to differences in interpretation of the items in a PROM. For example, some people will relate the items specifically to their diabetes, while others will ignore the attribution and respond to the items considering their overall health. Also, some people might not know whether their complaints are caused by their diabetes, and as such may doubt how to respond. This may affect reliability and validity of the PROM.

At least eight systematic reviews on PROMs used in diabetes research and care that could potentially have included the same instruments and articles have been published in the last decade.13 15–19 21 22 However, these reviews included at best only half of the PROMs that were included in this study. Most of these reviews included the Diabetes-39, but often not all language versions were considered. For example, the recent review by Wee et al22 included only the Arabic39 and Portuguese41 versions of the Diabetes-39, but not the original English version,43 nor the Vietnamese,40 Spanish,51 Chinese,48 59 Thai,49 and German50 versions. The IWADL36 46 has not been evaluated in any of the previous reviews. Moreover, most reviews provided a judgment on the quality of only some of the measurement properties. For example, Bottino et al16 only evaluated internal consistency, and content validity and structural validity were almost never evaluated in any of these reviews. Thus, the current review is the first to give a comprehensive overview of the measurement properties of PROMs or subscales that measure physical functioning in people with type 2 diabetes.

A limitation of the current study is that the assessment of content validity was difficult because we included physical functioning subscales from PROMs that often measured broader constructs, such as health-related quality of life. Although the assessment of measurement properties should be conducted for each subscale separately,23 often information required for the assessment of content validity is only reported for the PROM as a whole. Also for other measurement properties, information was sometimes reported poorly or unclear. Thus, as a team, we had to make decisions on how to value the information. This is inherent to using the COSMIN methodology, but other researchers might come to different conclusions. By reporting everything in tables and appendices, we tried to be as open and consistent as possible.

In conclusion, we identified 12 versions of seven unique diabetes-specific PROMs or subscales for measuring physical functioning, one of the core outcomes in adults with type 2 diabetes. The Dependence/Daily Life subscale of the DFS-SF and the IWADL are most extensively evaluated and had sufficient content validity, structural validity, internal consistency, and reliability, although the quality of the evidence for many measurement properties was very low or low. These PROMs could be used to measure physical functioning in people with type 2 diabetes in research or clinical practice, but the limitations of these instruments (eg, the specific content and target population, inconsistent responsiveness) should be kept in mind. As physical functioning may not necessarily need to be measured with a diabetes-specific PROM, future studies should evaluate the validity of generic PROMs, such as PROMIS, in people with type 2 diabetes.

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Supplemental material

Data availability statement

Data sharing not applicable as no datasets generated and/or analyzed for this study. All data relevant to the study are included in the article or uploaded as supplementary information.

Ethics statements

Patient consent for publication

Ethics approval

Not applicable.

Acknowledgments

We thank Wia Barkema, Lenka Groeneveld, Ilana Halperin, Geetha Mukerji, Amber van der Heijden, and Maartje de Wit for their help with screening titles and abstracts.

References

Supplementary materials

Footnotes

  • Contributors CBT, LBM, EBME, and FR conceived the study; CBT, LBM, and EBME designed the study; CBT carried out the literature search; CBT, PJME, JB, FR, and EBME screened titles and abstracts; CBT, ML-G, EBME, and LBM extracted the data; EBME and LBM assessed the study quality, risk of bias, and grading of the evidence; EBME and LBM interpreted the data; EBME wrote the manuscript; JB, PJME, ML-G, LBM, CBT, and FR revised the manuscript; all authors approved the final version of the manuscript.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.