Pathophysiology/complications

Reliability of recommended non-invasive chairside screening tests for diabetes-related peripheral neuropathy: a systematic review with meta-analyses

Abstract

The objective is to determine, by systematic review, the reliability of testing methods for diagnosis of diabetes-related peripheral neuropathy (DPN) as recommended by the most recent guidelines from the International Diabetes Foundation, International Working Group on the Diabetic Foot and American Diabetes Association. Electronic searches of Cochrane Library, EBSCO Megafile Ultimate and EMBASE were performed to May 2021. Articles were included if they reported on the reliability of recommended chairside tests in diabetes cohorts. Quality appraisal was performed using a Quality Appraisal of Reliability Studies checklist and where possible, meta-analyses, with reliability reported as estimated Cohen’s kappa (95% CI). Seventeen studies were eligible for inclusion. Pooled analysis found acceptable inter-rater reliability of vibration perception threshold (VPT) (κ=0.61 (0.50 to 0.73)) and ankle reflex testing (κ=0.60 (0.55 to 0.64)), but weak inter-rater reliability for pinprick (κ=0.45 (0.22 to 0.69)) and 128 Hz tuning fork (κ=0.42 (0.15 to 0.70)), though intra-rater reliability of the 128 Hz tuning fork was moderate (κ=0.54 (0.37 to 0.73)). Inter-rater reliability of the four-site monofilament was acceptable (κ=0.61 (0.45 to 0.77)). These results support the clinical use of VPT, ankle reflexes and four-site monofilament for screening and ongoing monitoring of DPN as recommended by the latest guidelines. The reliability of temperature perception, pinprick, proprioception, three-site monofilament and Ipswich touch test when performed in people with diabetes remains unclear.

Introduction

Globally, diabetes is reported to affect almost 500 million people.1 Diabetes-related peripheral neuropathy (DPN) is a common complication of diabetes that results in sensory loss in the extremities, which can lead to impaired balance and gait,2 as well as the formation of pressure ulcers and subsequent infection.3 DPN is implicated in 50%–75% of all non-traumatic amputations.4 DPN is estimated to be present in up to 50% of those with a diabetes duration of over 10 years,5 6 is present in 10%–30% of people at time of diabetes diagnosis and has also been noted in pre-diabetes.7

Non-invasive chairside tests are recommended for diagnosis of DPN and used for ongoing monitoring to map disease progression. Early diagnosis is vital to implement strategies to reduce the risk of limb-threatening sequelae. A multidisciplinary approach in combination with patient education, compliance and routine foot care have demonstrated prophylactic capacity for reducing DPN progression and severity, as well as ulcer risk,8 9 and intensive glucose control has been shown to reduce incidence of DPN.10 11

Various international guidelines exist providing direction as to which chairside tests should be performed for routine screening and monitoring of DPN. These guidelines differ in recommendations of test type and test protocol. The International Diabetes Federation (IDF),12 International Working Group on the Diabetic Foot (IWGDF)13 and American Diabetes Association (ADA)14 represent three major international organizations that develop some of the most widely used guidelines for diabetes-related foot assessment, diagnosis and management. Collectively, these groups recommend variations of the following chairside tests: 10 g monofilament, 128 Hz tuning fork, light touch/Ipswich touch test, temperature perception, vibration perception threshold (VPT), pinprick, proprioception and ankle reflexes.12–14

Due to the ongoing nature of testing required to facilitate early diagnosis and monitor progression of DPN, it is imperative that the recommended screening tests demonstrate acceptable reliability.15 However, remarkably, despite the widespread use of recommended chairside testing there has been no comprehensive investigation of their reliability. Therefore, the aim of this research was to, by systematic review of available evidence, evaluate the reliability of screening tests for DPN in the lower limb of adults with diabetes, as per protocols recommended by the most recent guidelines from the IDF, IWGDF and ADA. We hypothesize that all recommended tests will demonstrate acceptable reliability.

Research design and methods

Search strategy

This review was registered in the International Prospective Register of Systematic Reviews (PROSPERO ID: CRD42020186383), and reporting is consistent with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses statement. In order to identify studies that have investigated the reliability of non-invasive neurological tests in people with diabetes, an electronic search was performed independently by two authors (AM and SL) until May 2021 using the biomedical databases: Cochrane Library, EBSCO Megafile Ultimate and EMBASE. Search terms used in various combinations and with database relevant truncations were: reliability, consistency, accuracy, reproducibility, repeatability, agreement, precision, monofilament, neuropen, neurotip, tuning fork, vibration, VPT, neurothesiometer, biothesiometer, maxivibrometer, tip-therm, Ipswitch touch test, IpTT, reflex, perception, sensation, nociception, neuropathy and DPN. Abstracts were managed using Endnote X9 software.

Two authors (AM and SL) screened retrieved articles at title and abstract level and final determination of article suitability for inclusion following full-text review was performed in consultation with a third reviewer (VC). Lastly, reference lists of included articles were manually screened for any additional relevant research.

Inclusion and exclusion criteria

Inclusion criteria were: original peer-reviewed research articles or conference abstracts reporting reliability (inter-rater or intra-rater) of any of the non-invasive screening tests for DPN as recommended by either the IDF, IWGDF or ADA in a population with diabetes. Articles were also eligible if a subset of participants had diabetes, provided these data were reported separately or available from the authors. Articles investigating reliability of questionnaires, combination tests such as the Michigan Neuropathy Screening Instrument, tests performed in participant upper limbs, and non-English language texts were excluded. In addition, articles were excluded where time to retest made it likely that results may be affected by disease progression, for example, > 1 year.

Statistical analysis

Data were extracted (AM) and cross-checked (SL) using a customized data extraction form that included study and participant characteristics, statistical analyses and reliability results. Where Kappa values or percentage agreement were provided, interpretation of reliability outcomes was in accordance with McHugh.16 This is reported as none (0–0.20 or 0%–4%), minimal (0.21–0.39 or 4%–15%), weak (0.40–0.59 or 15%–35%), moderate (0.60–0.79 or 35%–63%), strong (0.80–0.90 or 64%–81%) and almost perfect (>0.90 or 82%–100%).16 In addition to these interpretations, any kappa values >0.60 were considered acceptable, as per the conservative thresholds suggested by McHugh for health research and practice.16 Coefficient of variation (COV) as the ratio of the SD to the mean was considered to indicate a higher reliability the lower the percentage score.17 Intraclass correlation coefficients (ICCs) were interpreted in accordance with Portney and Watkins, that is, good (>0.75), moderate (0.5–0.75) and poor (< 0.5) reliability.18 Spearman’s rho was interpreted in accordance with Prion and Haerling as negligible (0.00–0.20), weak (0.21–0.40), moderate (0.41–0.60), strong (0.61–0.80) and very strong (0.81–1.00).19

Adequate data were available to perform meta-analyses on four different neuropathy tests: ankle reflex, pinprick, 128 Hz tuning fork and VPT. The ankle reflex and pinprick tests were assessed only for their inter-rater reliability. The 128 Hz tuning fork and VPT tests were assessed for both their inter-rater and intra-rater reliability.

An alpha level of 0.05 was specified for all tests and confidence intervals. The data were analysed in R V.4.1.0. Data for each study was presented as Cohen’s kappa, with their corresponding variances (of sample distribution) being calculated from additional study results, in order of preference, as below:

  • If the SE was reported, the variance was calculated by squaring it.

  • If the percentage agreement and number of observations was reported, the variance was calculated by the below formula,20 where ‘p0’ is the percentage agreement, ‘k’ is Cohen’s kappa and ‘n’ is the number of observations:

Display Formula

  • If the CI was reported, the variance was calculated by the below formula, where ‘ku’ and ‘kl’ are the upper and lower 95% confidence limits of kappa, respectively:

Display Formula

Results of the meta-analysis are presented as estimated Cohen’s kappa (95% CI) and total heterogeneity (I2) with accompanying forest plots. The trim-and-fill method was used to detect and adjust for publication bias.

Meta-analyses were assessed using the R package ‘metafor’.21 For the inter-rater reliability analyses, if at least three papers were available, a random-effects model was specified, using the DerSimonian-Laird method and Knapp-Hartung adjustment. If only two papers were available, a fixed-effects model was specified, using the fixed-effects method. For the intra-rater reliability analyses, if at least three raters were available, a random-effects model was specified, using the DerSimonian-Laird method and Knapp-Hartung adjustment. If only two raters were available, a fixed-effects model was specified, using the fixed-effects method.

When data were collected more than once on the same participant, the mean kappa and variance was used in the meta-analysis. This occurred if either the same participant was measured in more than one location by the same rater (eg, left toe and right toe) or the same participant was measured by more than one rater.

Assessment of methodological quality

Methodological quality and risk of bias of included articles was performed independently by two reviewers (AM and SL) using the Quality Appraisal of Reliability (QAREL) Checklist and qualitative methodological assessment,22 with disagreements arbitrated by a third reviewer (SC). Where data were incomplete or methodology unclear, relevant authors were contacted for clarification.

Results

A total of 2431 articles were retrieved from the database search. Seventy-nine articles were identified as suitable for full-text review, of which 17 satisfied eligibility criteria for inclusion (figure 1). Seven articles were included in respective meta-analyses for individual test methods.

Figure 1
Figure 1

PRISMA flow chart of search strategy. PRISMA, Preferred Reporting Items for Systematic Reviews and Meta-Analyses

Characteristics and overview of included studies

The 17 included studies in this review reported a total of 1248 participants (table 1). Age (years) was reported as a mean 50–73,23–31 range (8–89),28 32 or unreported.33–38 Sex was reported in 11 studies, with more males (n=617, 59%) overall than females, while sex was unreported in six studies.33–37 39 Diabetes type was specified as type 1,32 38 type 2,23 25 26 39 both type 1 and type 224 27–29 31 33 34 36 37 or unreported.30 35 Diabetes duration was reported in years as a mean 3–54,23–25 27 31 32 38 range (0–63)23 28 31 32 or unreported.26 29 30 33–37 39 DPN diagnosis was reported in 65% of participants (range 3%–100%) across 12 studies,23 24 26 27 30 32 34 36 37 39 and prevalence was unreported in six.25 28 31 33 35 38 Nine studies assessed inter-rater reliability,23 27–30 33 35 38 39 six studies assessed intra-rater reliability25 31 32 34 36 37 and two studies examined both inter-rater and intra-rater reliability24 26 (table 1). Reliability was reported using Kappa statistic,23 24 26 28 29 32 33 35 38 39 COV,31 34 36 percentage agreement,39 Spearman’s rho37 and ICC.25 30

Table 1
|
Participant characteristics

Raters and measurement methods

There was little consistency between skills and qualifications of raters with experience ranging from doctors and specialists or people with specialized training with these devices23 24 26 29 30 35 36 38 39 to internists.28 Seven studies did not report the experience level of the raters.25 27 31–34 37 Testing environment varied with eight studies reported to be in tertiary settings,23 27 29 32 34 35 37 39 three in a secondary setting,24 33 36 three in an education or university setting26 28 30 and three did not specify the testing environment.25 31 38 Furthermore, time periods between subsequent retests varied as four studies retested on the same day,24 27 32 38 four retested within 1 week,25 26 31 36 four retested within 1 month,30 34 37 39 one retested within 2 months23 and two studies did not specify their retesting periods.28 33

Methodological quality

All studies evaluated a relevant sample of participants, applied tests appropriately and used appropriate statistical measures of agreement (online supplemental table 1). The overall quality of studies varied however, primarily due to inconsistency in reporting on blinding of raters and participants, randomization of raters or assessments, and general methodology. For example, two studies did not blind raters to their own prior outcomes,28 32 the results of the reference standard28 38 and a further two did not blind raters to clinical information, which was not a part of the testing procedure.28 30 Three studies did not blind raters to additional cues that were not a part of the test.26 31 34 Therefore, the results of these studies need to be interpreted within the context of these limitations.

Four-site 10 g monofilament reliability

One study assessed the reliability of the four-site monofilament that reported moderate inter-rater reliability (κ=0.61) (table 2) and varied intra-rater reliability ranging from minimal to moderate (κ=0.34–0.67),26 table 3.

Table 2
|
Inter-rater reliability of peripheral neurological tests, reported as Cohen’s kappa (К), intraclass correlation coefficient (ICC) or per cent agreement
Table 3
|
Intra-rater reliability of peripheral neurological tests, reported as Cohen’s kappa (К), coefficient of variance (COV) or Spearman’s (r)

128 Hz tuning fork reliability

Eight studies assessed inter-rater reliability of 128 Hz tuning fork that demonstrated a largely varied reliability ranging from none to strong (κ=0–0.86)23 24 26–29 38 39 (table 2). Two studies reported intra-rater reliability of the 128 Hz tuning fork as weak to moderate agreement (κ=0.41–0.66)24 26 (table 3).

VPT reliability

Four studies assessed inter-rater reliability of VPT through various modalities including the biothesiometer,24 30 neurothesiometer26 33 and maxivibrometer30 (table 2).

Biothesiometer: one study reported weak to moderate reliability (κ=0.58–0.65)24 and one study reported good reliability (ICC: 0.927).30

Neurothesiometer: two studies reported weak to moderate reliability (k=0.51–0.61).26 33

Maxivibrometer: one study reported good reliability (ICC: 0.961–0.958).30

Eight studies assessed intra-rater reliability of VPT through various modalities including biosthesiometer,24 31 32 36 neurothesiometer26 31 34 and Vibratron II25 34 (table 3).

Biothesiometer: two studies reported weak to moderate agreement (κ=0.51–0.81),24 32 two reported high levels of agreement (COV (%)=8.6–18.6)31 36 and one reported very strong reliability (rho=0.91).37

Neurothesiometer: one study reported weak to moderate agreement (κ=0.51)26 and two studies reported excellent agreement (COV (%)=6–8.1).31 34

Vibration Sensitivity Tester (Vibratron II): one study reported moderate intra-rater reliability (COV (%)=31–34)34 and one study reported excellent reliability.25

Pinprick reliability

Three studies assessed pinprick inter-rater reliability reporting minimal to weak reliability (κ=0.35–0.48)28 33 39 (table 2). Intra-rater reliability was not reported.

Ankle reflex reliability

Four studies assessed ankle reflex inter-rater reliability and reported weak to moderate reliability (κ=0.58–0.60)23 24 28 35 (table 2). One study assessed intra-rater reliability of ankle reflexes reporting weak to moderate agreement (κ=0.51–0.64)24 (table 3).

Proprioception reliability

One study assessed inter-rater reliability of proprioception and reported minimal reliability (κ=0.28)28 (table 2). Intra-rater reliability has not been examined.

Other recommended tests

Our literature search did not identify any investigations into the reliability of light touch/Ipswich touch test, three-site monofilament or temperature perception as performed according to current guidelines.

Meta-analyses

There were sufficient data from included studies to undertake meta-analyses of the inter-rater reliability of ankle reflexes,23 24 28 35 pinprick,28 39 128 Hz tuning fork23 24 26 28 29 39 and VPT (figure 2)24 26 as well as the intra-rater reliability of 128 Hz tuning fork24 26 and VPT (figure 3).24 26 32

Figure 2
Figure 2

Forest plots for inter-rater reliability of screening tests for DPN. (A) Forest plot for inter-rater reliability of ankle reflex test. (B) Forest plot for inter-rater reliability of pinprick test. (C) Forest plot for inter-rater reliability of 128 Hz tuning fork test. (D) Forest plot for inter-rater reliability of vibration perception threshold test. DPN, diabetes-related peripheral neuropathy.

Figure 3
Figure 3

Forest plots for intra-rater reliability of screening tests for DPN. (A) Forest plot for intra-rater reliability of 128 Hz tuning fork test. (B) Forest plot for intra-rater reliability of vibration perception threshold test. DPN, diabetes-related peripheral neuropathy.

Meta-analysis demonstrated the highest inter-rater reliability – reported as estimated Cohen’s kappa (95% CI) – for VPT (κ=0.61 (0.50 to 0.73)), followed by ankle reflexes (κ=0.60 (0.55 to 0.64)), pinprick (κ=0.45 (0.22 to 0.69)) and 128 Hz tuning fork (κ=0.42 (0.15 to 0.70)).

Meta-analysis demonstrated the highest intra-rater reliability for VPT (κ=0.63 (0.45 to 0.81)), followed by the 128 Hz tuning fork (κ=0.54 (0.37 to 0.73)).

The trim-and-fill method used to detect and adjust for publication bias resulted in adjusted estimated Cohen’s kappa (95% CI) for ankle reflexes (κ=0.60 (0.32 to 0.80)), pinprick (κ=0.48 (0.29 to 0.67)) and 128 Hz tuning fork (κ=0.32 (0.05 to 0.60)) (inter-rater reliability) and 128 Hz tuning fork (κ=0.53 (0.35 to 0.72)) (intra-rater reliability).

Conclusions

Of the recommended tests, included articles investigated the reliability of the four-site monofilament,26 128 Hz tuning fork,23 24 26–29 38 39 VPT,24–26 30–34 36 37 pinprick,28 33 39 ankle reflex23 24 28 35 and proprioception.28 The findings of this review are that the inter-rater and intra-rater reliability of recommended neurological tests are largely varied when performed in people with diabetes. Based on the limited data available, results of pooled analyses suggest that VPT and ankle reflexes demonstrate acceptable reliability, whereas the reliability of pinprick and 128 Hz tuning fork tests is questionable. Additionally, cohort studies suggest that the four-site monofilament also demonstrates acceptable reliability,26 whereas reliability of proprioception may be inadequate.28 These findings should be considered in the context of the results of the QAREL assessment and the variability in methodological reporting, in conjunction with the wide CIs for the adjusted pooled estimates for the reliability (eg, the intra-rater reliability of 128 Hz tuning fork (κ=0.32 (0.05 to 0.60)) and the variability of results that indicate available evidence is low or moderate quality. Of note, although included in IDF, IWGDF and ADA guidelines, we did not identify any article reporting the reliability of the three-site monofilament, light touch, Ipswich Touch Test or temperature perception tests in people with diabetes. These results need to be considered in light of the established predictive capacity for the development of foot wounds as demonstrated by the 10 g monofilament and 128 Hz tuning fork.40 41

The findings of this systematic review highlight the need for more exhaustive investigation of reliability of recommended chairside tests for DPN. A number of these studies assessing reliability for DPN testing reported that 100% of their population cohorts had DPN23 24 27 30 34 37 39 making the weak to moderate reliability reported for both inter-rater and intra-rater reliability concerning. Although not inferring diagnostic accuracy, studies of reliability are affected by disease prevalence.42 Therefore, when conducted in a cohort all with the target disease, the results are likely to overstate the reproducibility of the measurement.42 In the case of tests such as monofilament testing for which pooled estimates of diagnostic accuracy have shown low sensitivity of 0.53 and adequate specificity of 0.88, the likelihood of a false negative test result is high for any given test point.43 This is consistent with our findings of weak to moderate test reliability even in populations consisting entirely of participants with DPN. As chairside DPN testing is both used for the diagnosis and ongoing monitoring of DPN the usefulness of a test that has limited capacity to rule out the presence of the target disease or to reproduce a positive result in those with the disease is questionable. Furthermore, given that the earliest nerve damage in DPN is likely to be to small fibers,44 reliability of chairside small-fiber tests is under investigated. We identified three studies that included investigation into the reliability of pinprick. However, we did not identify any tests investigating the reliability of thermal perception, and our present review did not investigate question-based tests such as the Total Symptom Score.12 In this context, the reliability of large-fiber tests such as monofilament and vibration perception need to be considered together with their limited ability to detect early disease. Further research is thus warranted to determine the reliability of tests capable of detecting early disease.

Methodological differences between included studies is likely to have contributed to the range of results available in the literature. Reliability of various chairside tests was reportedly affected by limited training or variances in experience levels of clinicians23 26 28 34 35 39 and also by inconsistent comprehension of individual test instructions by participants.23 24 26 32 39 Tests such as the tuning fork, monofilament and pinprick all rely on application of controlled pressure by the clinician. As the rate of pressure is difficult to control for, especially between different raters, several studies identified this as possibly influencing test reliability.23 24 26–28 38 39 These issues suggest that adequate clinician training should be undertaken, that the training is consistent with guidelines and that the instructions to patients should be clear, all of which may lead to improved reliability of chairside tests. Clinically, this can be improved through consideration of recommendations from current guidelines regarding test technique and test sites.12–14 The included literature is limited by use of small sample sizes,26 34 35 37 39 lack of blinding of assessors to previous results28 30 and heterogeneity of measures of statistical agreement used. Although the majority of studies used kappa values, some used COV, Spearman’s rho, percentage agreement or ICCs, making comparison of available data across testing methods challenging.

This review has highlighted the need for further investigation of reliability of chairside DPN testing. Due to the range of reliability and varied reliability measures across all recommended neurological tests, it is suggested that there be more extensive research into the reliability of pinprick, proprioception and other recommended chairside DPN tests that have not been investigated. Furthermore, future research should be conducted in specific populations with diabetes and be conducted in populations where prevalence of DPN has been established through testing methods with high diagnostic accuracy. Given the additional impacts of age on neurological and cognitive function beyond those results from diabetes, there may be age-specific differences in reliability of chairside tests, and as such, investigations taking age into account are required. To this end, simplifying neurological testing will allow clinicians and patients to better communicate test instructions as well as reduce the variability between clinicians when performing the tests to improve overall reliability. Furthermore, increased clinical knowledge of reliability of neurological screening tests allows for more informed clinical decision making when selecting multiple tests (eg, monofilament and tuning fork) to aid in the diagnosis and monitoring of DPN.

Although the search strategy employed in this review was designed to be robust, there may be some evidence that was not captured, for example, unpublished data. It should also be acknowledged that the reliability of chairside tests included in this review are from three international consensus statements only. Other commonly used chairside neuropathy tests that warrant further investigation include the monofilament test using additional sites for all cause peripheral neuropathy,45 conventional and graduated tuning forks,46 two-point discrimination,47 temperature sensation and the Michigan Neuropathy Screening Instrument.48 Lastly, future studies investigating test reliability should ensure adequate reporting, sufficient detail for cohort characteristics, methodology and appropriate statistical tests, for example, kappa or intraclass correlation coefficients with relevant CIs.

The results of this systematic review found evidence of acceptable reliability for VPT using a biothesiometer, neurothesiometer or maxivibrometer, ankle reflexes and the four-site monofilament test. Due to the large range of reported reliability for the 128 Hz tuning fork, we are unable to appropriately comment on this testing method. These results support the clinical use of these identified tests for screening and ongoing monitoring of DPN as recommended by the latest guidelines by IDF, IWGDF and ADA, respectively. The reliability of temperature perception (IDF and ADA), pinprick, proprioception (ADA), three-site monofilament and Ipswich touch test (IWGDF) when performed in people with diabetes remains unclear and warrants investigation to determine their suitability for use for testing in this population.