FormalPara Key Points

Quantifying preferences for healthcare is becoming increasingly popular; however, there exists no recent description of how health-related discrete choice experiments (DCEs) are being employed.

This study identified changes in experimental design, analytical methods, validity tests, qualitative methods and outcome measures over the last 5 years.

To facilitate quality assessment and better integration into health decision-making, future DCE reports should include more complete information, which might be achieved by developing reporting guidelines specifically for DCEs.

1 Introduction

In recent years, there have been increased calls for patient and public involvement in healthcare decision-making [1, 2]. Patient or public involvement can support decision-making at multiple levels: individual (shared decision-making), policy (patient experts on panels) and commissioning (incorporating patient preferences in technology evaluations or health state valuation). Views can be elicited qualitatively, quantitively or using mixed-methods approaches [3]. Example methods include interviews, focus groups and stated preference techniques such as the standard gamble or time trade-off. Studies by the Medical Device Innovation Consortium (MDIC) [4] and Mahieu et al. [5] highlighted a wide variety of methods to measure both stated and revealed preferences in healthcare.

Among the quantitative methods for eliciting stated health preferences, discrete choice experiments (DCEs) are increasingly advocated [6]. In a DCE individuals are asked to select their preferred (and/or least preferred) alternative from a set of alternatives. DCEs are grounded in theories which assume that (1) alternatives can be described by their attributes, (2) an individual’s valuation depends upon the levels of these attributes, and (3) choices are based on a latent utility function [7,8,9,10]. The theoretical foundations have implications for the experimental design (principles to construct alternatives and choice sets) and the probabilistic models used to analyse the choice data [7].

Previously conducted broad reviews by Ryan and Gerard (1990–2000) [11], de Bekker-Grob et al. (2001–2008) [7] and Clark et al. (2009–2012) [6] identified a number of methodological challenges of DCEs (e.g. how to choose among orthogonal, D-efficient and other designs or how to account for preference heterogeneity when analysing choice data). These reviews, as well as published checklists [12] and best-practice guidelines [13,14,15,16,17], have been developed to provide specific guidance and potentially improve quality [12, 18]. However, it is unknown whether the challenges identified in prior reviews are still relevant or whether there has been a response to the published suggestions and guidelines. Furthermore, although health-related DCEs are increasingly advocated by organisations such as the MDIC [4], their use for actual decision-making in health remains limited [7, 13]. Key barriers to their wider use in policy include concerns about the robustness and validity of the method and the quality of applied studies [19, 20].

This paper seeks to provide a current overview of the applications and methods used by DCEs in health economics. This overview will be created by systematically reviewing DCE literature and extracting data from the period 2013–2017. In addition, historical trends in experimental design, analytical methods, validation procedures and outcome measures will be described by comparing the results to those of prior reviews. For the sake of generality and to allow examination of trends based on consistent data extraction methods, this comparison will focus on the broad reviews cited above, rather than on narrower reviews of DCEs covering specific study designs or disease areas [21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,41]. Recent developments in DCE methods will be incorporated by including new data elements not reported in previous reviews. Potential challenges and recommendations for future research will also be identified.

2 Methods

The current systematic review continued the work conducted in the prior broad DCE reviews [6, 7, 11] by focusing on DCEFootnote 1 applications published between 2013 and 2017. The methodology for this systematic review built on that of the prior reviews to allow comparison of results across review periods and identification of trends. The search was initiated in May 2015 and updated in February 2016 and January 2018. We used the same search engine (PubMed) that was used in the latest review by Clark et al. [6] and generally used the same search terms. We decided to exclude the search terms ‘conjoint’ and ‘dce’, since these yielded too many irrelevant results (particularly due to the rise of dynamic contrast-enhanced imaging in gene expression profiling) and would have substantially increased the number of abstracts to be reviewed. The final search terms included ‘discrete choice experiment’, ‘discrete choice experiments’, ‘discrete choice modeling’, ‘discrete choice modelling’, ‘discrete choice conjoint experiment’, ‘stated preference’, ‘part-worth utilities’, ‘functional measurement’, ‘paired comparisons’, ‘pairwise choices’, ‘conjoint analysis’, ‘conjoint measurement’, ‘conjoint studies’, ‘conjoint choice experiment’ and ‘conjoint choice experiments’. A study was included if it was applied to health, included a discrete choice exercise (rather than rating or ranking), focused on human beings and was published as a full-text article in English between January 2013 and December 2017. Consistent with prior reviews, DCEs without empirical data (e.g. methodological studies) and studies of samples already included in our review were excluded.

To ensure consistency of data extraction and assist with synthesis of results, the authors used an extraction tool, available in Appendix A of the Electronic Supplementary Material, initially developed using the criteria of Clark et al. [6]. We first considered areas of application (e.g. patient consumer experience, valuing health outcomes) and background information (country of origin, number and type of attributes, number of choice sets, survey administration method), followed by more detailed information about the experimental design (type, plan, use of blocking, design software, design source, method used to create choice sets, number of alternatives, presence of an opt-out or status quo option, sample size and type), data analysis (model, analysis software, model details), validity checks (external and internal), use of qualitative methods (type and rationale) and presented outcome measures. The authors tested the extraction tool and discussed initial results. To fully capture current DCE design methods, the following data elements were added to the original data extraction tool: number of alternatives, presence of an opt-out or status quo, sample size, use of blocking, use of a Bayesian design approach, software for econometric analyses and the type of qualitative research methods reported. With regard to analysis methods, this review also extracted additional information on the use of scale-adjusted latent class, heteroskedastic conditional logit and generalised multinomial models. Studies were also categorised by journal type.

Each author extracted data from a group of articles, checking online appendices and supplementary materials where relevant. A subsample of studies (20%) was double-checked by V.S. for quality control. We categorised the extracted data and reported the results as percentages. Results for the econometric analysis models were categorised based on the three key characteristics of the multinomial logit model (Fig. 1): (1) the assumption that error terms are independent and identically distributed (IID) according to the extreme value type I distribution, (2) independence of irrelevant alternatives (IIA) (resulting from the first characteristic) and (3) the presence or absence of preference heterogeneity [7]. The IID characteristic limits flexibility in estimating the error variance, whereas IIA is about the flexibility of the substitution pattern (how flexible respondents are to substitute between choices), and assumptions about preference heterogeneity determine whether preferences are allowed to vary across respondents.

Fig. 1
figure 1

Econometric analysis model overview

3 Results

3.1 Search Results

A total of 7877 abstracts were identified from the beginning of 2013 until the end of 2017. After abstract and full-text review, 301 DCEs (including six case 3 best–worst scaling [BWS] studies) met the inclusion criteria and were selected for data extraction (see Fig. 2) [43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343]. Figure 3 depicts the total number of DCE applications in health across the different review periods: 1990–2000, 2001–2008, 2009–2012 and 2013–2017. The 2009–2012 review reported that the number of studies had increased to 45 per year on average [6]. The current review period found 60 studies per year on average, with a high of 98 studies in 2015 and a low of 32 studies in 2017 (Fig. 3). Figure 3 also shows that the increase in DCE applications between the prior review periods and the current review period was less consistent than the increases observed in prior periods.

Fig. 2
figure 2

Flow diagram of systematic literature review to identify discrete choice experiments (DCEs)

Fig. 3
figure 3

Number of discrete choice experiment (DCE) applications by publication year

3.2 Areas of Application

Prior reviews mentioned that although DCEs were originally introduced in health economics to value patient or consumer experience, the use of DCEs has broadened considerably [6, 344]. Table 1 summarises information about the different areas of application of DCEs for each review period (Appendix B of the Electronic Supplementary Material contains figures based on the tables in this review). Compared to the latest review period, the largest overall shifts occurred in the areas of patient consumer experience (category A), trade-offs between health outcomes and patient or consumer experience factors (category C), and health professionals’ preferences for treatment or screening (category G). In the current review period, 8% of studies valued health outcomes such as ‘heart attacks avoided’ (category B, 23 studies, e.g. studies [148, 152, 153, 162, 170]), 4% estimated utility weights within the quality-adjusted life year (QALY) framework (category D, 13 studies, e.g. [218, 226,227,228, 230]), 6% focused on job choices (category E, 17 studies, e.g. [231, 236, 238, 242, 247]), and 9% developed priority-setting frameworks (category F, 27 studies, e.g. [248, 253, 270, 272, 274]).

Table 1 Areas of study application

Among the DCEs reviewed, the most common journal focus was health services research (n = 139; 46%). About a third (n = 102; 34%) of articles were published in specialty-focused medical journals such as Vaccine (five studies [66, 131, 146, 311, 313]) or the British Journal of Cancer (three studies [47, 70, 171]). Fifty-one (17%) were published in general medical journals such as PLoS One (20 studies, e.g. [44, 64, 81, 91, 99]) and BMJ Open (five studies [100, 102, 109, 169, 264]). More details can be found in Appendix C of the Electronic Supplementary Material.

3.3 Background Information About DCEs

The reviews from Ryan and Gerard [11], de Bekker-Grob et al. [7] and Clark et al. [6] provided detailed information about study characteristics. Information for the current review period is described in the sections below. Table 2 parts (a) and (b) report the current information alongside data from the prior reviews.

Table 2 DCE Background information

3.3.1 Country of Origin

Table 2a shows that UK-based studies made up a relatively high proportion of published DCEs (17%, 50 studies), as did studies from the US (17%, 50 studies), Australia (10%, 30 studies), the Netherlands (15%, 44 studies), Germany (9%, 28 studies) and Canada (8%, 25 studies). DCEs were also popular in other European countries, for example, Italy (3%, eight studies) and Sweden (2%, six studies) (not shown). We also observed an increase in studies coming from ‘other’ countries, from 0% to 34% across the four review periods, which reflects an upwards trend towards applying DCEs in middle- and low-income countries (e.g. Cameroon [239], Ghana [244], Laos [232], Malawi [254] and Vietnam [122]).

3.3.2 Attributes, Choices and Survey

In the current review period, the number of attributes per alternative in DCEs ranged from two to 21, with a median of five. We observed a slight decrease in number of attributes; the modal category was 4–5 (39%, 117 studies). In line with prior reviews, most studies (82%, 247 studies) included four to nine attributes. For the period 2013–2017, most studies included a monetary (50%, 150 studies), time-related (39%, 117 studies), or risk-related (44%, 133 studies) attribute. The proportion of studies including time-related and health status (24%, 71 studies) attributes decreased.

Most DCEs in the current period included nine to 16 choices per individual (54%, 162 studies), with a median of 12 (minimum 1, maximum 32). Prior reviews mentioned increases in online administration of DCEs. This trend continued in the current review period, with 57% of the DCEs conducted online (172 studies), whereas the number of DCEs which used pencil and paper dropped to 23% (69 studies). These self-completed DCEs remained the main source of survey administration.

3.3.3 Alternatives and Sample

Prior reviews did not collect data about the number of alternatives included in each DCE or whether an opt-out or status quo option was included. For the current period, most of the studies (83%, 251 studies) included two alternatives (not including any opt-out or status quo option), with 8% (23 studies) not clearly reporting the number of included alternatives (Table 2b). The majority of the studies (64%, 194 studies) did not include an opt-out or status quo option.

The prior reviews covering the period 1990–2012 did not extract data about the sample size. In the current period, the mean and median sample size were 728 and 401, respectively. Sample size ranged from a minimum of 35 [116] to a maximum of 30,600 respondents [148]. Most of the samples included patients (37%, 110 studies) or the general public (27%, 81 studies). A large number of DCEs sampled ‘other’ populations (31%, 93 studies) such as healthcare workers, healthcare students or a mixture of these.

3.4 Experimental Design

Experimental design (planning of the alternatives and choice sets) is crucial to the conduct of a DCE. The review from de Bekker-Grob et al. [7] describes DCE design in detail. For more information about the choices researchers have to make when designing the experimental part of a DCE, we also refer to a key checklist and best practice example [14, 15].

3.4.1 Design Type, Design Plan and Blocking

As in prior review periods, most DCEs made use of a fractional design (89%, 269 studies) (Table 3). Additionally, we observed that for the current review period, the design plan of DCEs most frequently focused on main effects only (29%, 86 studies). This is a decrease compared to the periods 1990–2000, 2001–2008 and 2009–2012, with 74%, 89% and 55%, respectively. The percentage of DCEs not clearly reporting design plan information increased to 49% (147 studies) for 2013–2017. When generating the experimental design, blocking, creating different versions of the experiment for different respondent groups, can be used to reduce the cognitive burden of respondents by reducing the total number of choices per respondent [345]. Reviews for the period 1990–2012 did not collect information about blocking. Data for the current period showed that 50% (150 studies) reported using blocking when generating the experimental design. On average, studies with blocking had 709 participants, each of whom completed 11 choice sets, whereas studies with unblocked designs had 439 participants, each of whom completed 13 choice sets.

Table 3 Experimental design information DCEs

3.4.2 Design Software

Ngene became the most popular software tool in the current period for generating experimental designs (21%, 62 studies, e.g. [53, 63, 139, 268, 319]). SAS (18%, 54 studies, e.g. [262, 290, 296, 300, 316]) and Sawtooth (16%, 47 studies, e.g. [46, 141, 207, 276, 323]) remained popular tools. Compared to prior review periods, we observed an increase in the percentage of studies not clearly indicating what software was used to generate the experimental design (33%, 99 studies, e.g. [44, 144, 177, 204, 299]).

3.4.3 Methods to Create Choice Sets

The upwards trend in the use of D-efficient (35%, 105 studies) experimental designs continued in the current review period. Correspondingly, fewer DCEs used orthogonal arrays through methods such as single profiles, random pairing or the foldover technique (Table 3). As with the experimental design characteristics mentioned in the previous sections, we observed that an increasing number of studies (33%, 100 studies in 2013–2017) did not clearly report the methods used to create choice sets.

3.5 Econometric Analysis Methods

Information about the different econometric analysis methods and the appropriateness of these methods for different DCE applications is described in great detail in the prior reviews [6, 7, 11]. More information can be found in papers by Louviere and Lancsar [12], Bridges et al. [14] and Hauber et al. [17]. Table 4 parts (a) and (b) summarise information about econometric analyses from the current and prior review periods.

Table 4 Econometric analysis details DCEs

3.5.1 Econometric Analysis Model, Software and Preference Heterogeneity

We present information about econometric analysis models according to the taxonomy described in the “Methods” section and visualised in Fig. 1. Reviews for the periods 1990–2000 and 2001–2008 reported that most DCEs used random-effects (random-intercept) probit models to analyse preference data (53% and 41%, respectively). The review for the period 2009–2012 showed a shift to the use of other methods like multinomial logit models (45%) and mixed (random-parameter) logit models (25%). For the current review period, this trend continued (see Table 4a). Most DCEs in 2013–2017 reported the use of mixed logit models (39%, 118 studies, e.g. [47, 271, 301, 314, 318]) or multinomial logit models (39%, 116 studies, e.g. [92, 110, 166, 294, 339]) to analyse preference data. The current review period also showed an increase in the use of latent class models (12%, 36 studies, e.g. [38, 91, 139, 165, 269]) and other econometric analysis models. Examples include generalised multinomial logit (4%, 12 studies, e.g. [97, 124, 157, 174, 240]) and heteroskedastic multinomial logit (4%, 11 studies, e.g. [134, 139, 184, 256, 309]).

Prior reviews did not collect data about the software used for econometric analysis. For the current review period, Table 4b shows that most DCEs made use of Stata (31%, 94 studies, e.g. [91, 110, 138, 149, 213]) or Nlogit (22%, 65 studies, e.g. [94, 171, 204, 282, 346]) to conduct econometric analysis. However, 26% (79 studies, e.g. [101, 184, 211, 231, 330]) did not clearly report information about the software used.

Among the studies that used mixed logit models to account for preference heterogeneity in the period 2013–2017, 22% (65 studies) included additional information about the distributional assumptions used to conduct the mixed logit analysis and the number of distributional draws (e.g. Halton draws) used to simulate preference heterogeneity. This percentage is similar to the percentage for the period 2009–2012, which was 21%. The mean number of draws for the current review period was 1354 (median 1000, minimum 50, maximum 10,000), and 18% of the DCEs (53 studies) assumed that parameters followed the normal distribution.

3.6 Validity Checks and Qualitative Methods

DCEs are based on responses to hypothetical choices (stated preferences), so internal and external validity checks provide a crucial opportunity to assess data quality or to compare stated preferences from DCEs with revealed preferences. As Clark et al. [6] observed in their review, there is often little reported about the tests for external validity, possibly because validating hypothetical choice scenarios is difficult [347]. Perhaps for this reason, the review covering the period 1990–2000 did not extract specific information about external validity tests. In the reviews from 2001–2012, only a very small proportion (1%) of the DCEs reported any details about their investigations into external validity. The current review period showed that 2% (seven studies [55, 93, 147, 184, 185, 195, 248]) reported using external validity tests (Table 5).

Table 5 Details of validity checks and qualitative methods

For detailed information about the different internal validity tests, we refer to the prior review papers [6, 7, 11]. In the current review period, the percentage of studies that included internal validity checks ranged from a maximum of 17% (50 studies) for non-satiation checks to 6% (18 studies) for internal compensatory checks. Internal compensatory checks were reported less frequently than in earlier review periods. For the current review period, ‘other’ validity checks such as tests for theoretical and face validity and consistency were used frequently (34%, 102 studies).

Another way to enhance quality in a DCE is to complement the quantitative study with qualitative methods [35]. For the current review period, 86% (258) of the DCEs used qualitative methods to enhance the process and/or results. Most DCEs used interviews (50%, 151 studies) or focus group techniques (18%, 54 studies). Qualitative methods were usually used to inform attribute (53%, 160 studies) and/or level (44%, 134 studies) selection, which follows the overall upwards trend reported in prior reviews. The proportion of DCEs using qualitative methods for questionnaire pre-testing (38%, 113 studies) was similar to the level in the previous review period. Overall, just as in the previous review periods, few studies in the current review period (4%, 12 studies) used qualitative methods to improve the understanding of results/responses.

3.7 Outcome Measures

Information about the trends regarding the presented outcome measures is presented in Table 6.

Table 6 Presented outcome measures of DCEs

As mentioned in prior reviews, DCEs often presented their outcomes in terms of willingness to pay (WTP), a monetary welfare measure or a utility score [6, 7, 11]. Use of these methods has declined over the past two review periods (2001–2012), and use of utility scores decreased from 24% to 8% over the past three periods (1990–2012). Relative to the previous period, we observed increases in the use of utility scores (17%, 50 studies, e.g. [61, 128, 141, 164, 317]), odds ratios (10%, 30 studies, e.g. [80, 146, 200, 234, 280]) and probability scores (13%, 38 studies, e.g. [122, 154, 198, 272, 277]). We also collected information about willingness-to-accept (WTA) measures (4%, 13 studies, e.g. [53, 94, 250, 322, 338]) and regression coefficients (56%, 169 studies, e.g. [44, 57, 231, 244, 276]), which were not collected in previous reviews. The proportion of studies with ‘other’ outcome measures remained near one half (49%, 147 studies, e.g. [48, 87, 114, 207, 273]). Examples from this category include (predicted) choice shares, maximum acceptable risk, relative importance and ranking.

4 Discussion

In this study, we reviewed DCEs published between 2013 and 2017. We followed the methods of prior reviews and compared our extraction results to those reviews to identify trends. We identified that DCEs have continued to increase in number and have been undertaken in more and more countries. Studies reported using more sophisticated designs with associated software, for example, D-efficient designs generated using Ngene. The trend towards the use of more sophisticated econometric models has also continued. However, many studies presented sophisticated methods with insufficient detail. For example, we were not able to check whether the results had the correct interpretation or whether the authors had conducted the appropriate diagnostics (e.g. checked that the data possessed the IIA characteristic). Qualitative methods have continued to be popular as an approach to select attributes and levels, which might improve validity. In this study, we also extracted data in several new categories, for example, sample size and type, the use of blocking, software used for econometric analysis and type of qualitative method used. We observed that the mean and median sample size were 728 and 401, respectively, with most samples including patients. We also observed that half of the studies used blocking and most studies used Stata for econometric analysis. Interviewing was the most popular qualitative research method used alongside DCEs.

The observed increase in the total number of DCEs in health economics was similar to the trend reported in prior reviews [6, 7, 11], but less consistent from year to year (Fig. 3). This less consistent increase might be explained by the presence of many competing stated preference methods [4, 5, 347]. We hypothesise that other methods may be increasing in popularity or becoming more useful in health settings [348]. Examples of such methods may include BWS case 1 and case 2 [349,350,351], which were not included in this review. Additionally, in this review, we excluded a significant number of studies (n = 31) making methodological considerations about DCEs rather than conducting empirical research. The presence of such studies may indicate that knowledge about DCEs in health has increased and there is more focus on studies to develop the method. Examples include simulation studies about experimental design, studies comparing the outcomes of a DCE to other stated preference method outcomes and studies examining different model specifications [352,353,354]. This might be another explanation for the less consistent increase in DCE application studies.

The common use of fractional designs, as described in prior reviews [6, 7], has continued. This review also found that main effects DCEs continue to dominate; however, there is a downwards trend as DCE designs incorporate two-way interactions more often. This is in line with the recommendations of Louviere and Lancsar [12], who suggest inclusion of interaction terms should be explored in the experimental design stage. Ngene became the most popular software tool in the current review period for generating experimental designs, while D-efficient designs became the most popular method to create choice sets. Perhaps as a consequence of the rise in software-generated designs, this review also showed that an increasing percentage of articles did not include information about experimental design features such as the design plan. Omitting this type of information might inhibit quality assessment and reduce confidence in the results. Future research might focus on the specific reasons why such information is missing and the impact of the missing information on quality assessment of DCEs. One potential reason for omitting methodological details is the journal word limit. When confronted with a low word limit, authors should consider using online space to report additional design and analysis details.

In addition to these observations about the generation of experimental designs, we identified design information that would be helpful to report in DCEs and future systematic reviews. For example, prior reviews did not include information about blocking, and although at least half of the DCEs we reviewed used blocking, 30% of the studies we reviewed did not include information about blocking. Blocking could be an important technique in light of the growing literature about the cognitive burden of DCEs and the impact of this cognitive burden on respondent outcomes [345]. However, blocking also has the disadvantage of requiring a larger sample size [345]. The approach described by Sándor and Wedel [355] might be another alternative to increase the validity of DCE outcomes in case of relatively small sample sizes or the investigation of preference heterogeneity.

Prior reviews identified a shift to more flexible econometric analysis models [6, 7], which is not necessarily positive. This trend has continued in this review. Most studies included multinomial logit or mixed logit models. Although we did not formally extract information about variance estimation, we noted that among the DCEs using multinomial logit models to analyse choice data, few reported robust or Huber-White standard errors (most studies reported ‘regular’ standard errors). Since these standard errors allow for more flexible substitution patterns and flexible variances, it is common in economics and econometrics to report these standard errors instead of ‘regular’ standard errors [356]. Also, in the presence of repeated observations from the same individuals, conventional standard errors are biased downward [357]. Thus, future DCEs in health economics could benefit from more appropriate treatment of clustered data (i.e. use of robust standard errors) and more complete reporting of econometric output.

In terms of analytical methods, we also observed some patterns in the exploration of preference and scale heterogeneity. We noted that, among the 39% of studies that used a mixed logit model, many treated heterogeneity as a nuisance, i.e. they used the mixed model to accommodate repeated measures but did not report additional information about the ‘mixed’ aspect of the data (e.g. standard deviation estimates). Since preference heterogeneity is regarded as an important aspect within choice modelling, taking full advantage of the modelling results might help us understand preference heterogeneity better [358]. With regard to scale heterogeneity, work by Fiebig et al. [346] indicated that other models such as the generalised multinomial logit and heteroskedastic multinomial logit models could be considered when analysing DCE data, to identify differences in scale when comparing preferences between groups of respondents [359]. Data from this review identified a small number of DCEs using such methods; for a more detailed breakdown, we refer readers to another review focussing on scale heterogeneity specifically [30]. However, it is important to mention that the generalised multinomial logit model should be used with caution since the ability of this model to capture scale heterogeneity has been questioned in the literature [360].

Articles by Vass and Payne [19] and Mott [20] describe issues influencing the degree to which DCE findings are used in healthcare decision-making (e.g. health-state valuation and health technology assessment). These articles, rising popularity of the method, and interest from regulators and funders suggest that DCEs could play an important role in real-world decision-making [361, 362]. However, concerns have been expressed about the validity, reliability, robustness and generalisability of DCEs [11, 363]. A key stage in understanding the robustness of DCEs is understanding whether stated preferences reflect ‘true’ preferences as revealed in the market [10]. In this study, we observed that the number of studies testing external validity remained small. Future research should focus on identifying and resolving the methodological and practical challenges involved in validity testing, and on guiding the incorporation of DCEs into actual decision-making in healthcare. Another practice that may improve the robustness of DCEs and facilitate their use in healthcare decision-making is the increased use of qualitative methods to complement quantitative DCE analysis [363]. Prior reviews and additional literature suggest that qualitative research methods can strengthen DCEs and other quantitative methods by facilitating numerous investigations such as (1) identification of relevant attributes and levels, (2) verification that respondents understand the presented information, and (3) learning about respondents’ decision strategies [6, 7, 11, 364]. These investigations can help determine whether respondents are making choices in line with the underpinning utility theories, thereby supporting the legitimacy of the underlying assumptions. This review showed an overall upwards trend in the number of DCEs using qualitative methods to select attributes and levels. This move towards a more mixed-methods approach has been observed by others, for example, the study by Ikenwilo et al. [365].

4.1 Strengths and Limitations

The current study has several strengths. First, the detailed data extraction was completed by each author individually, with the total number of articles approximately divided equally among authors because of the relative short timeframe and the need to balance author burden with study quality. Additionally, a subsample of studies (20%) was double-checked by one author (V.S.) for quality control, which enhanced reliability. Second, this study identified trends in empirical DCEs by comparing outcomes from all prior reviews. Additionally, this study included aspects of empirical DCEs not investigated before, although these aspects were recognised in the literature as becoming more important in DCE research (e.g. blocking in experimental design and the type of qualitative methods used in a DCE). Third, our observation of less rapid growth in the number of empirical DCEs (compared to the growth observed in previous reviews) matches the trend in the preference research to focus on the broad range of stated preference methods available (rather than DCEs exclusively) [4, 5, 347].

A potential weakness of this study was the use of multiple reviewers with potentially different interpretations of DCE reports, which might have affected the data extraction and, as a consequence, the results presented. To limit inconsistency between reviewers, all co-authors discussed the data extraction frequently and results were cross-validated by a single author (V.S.). Similarly, this inconsistency in interpretation may also have occurred between the different review periods. Procedural information from the two most recent reviews was used to ensure consistency, and we are therefore confident the general trends reported and the conclusion that more detailed methods reporting is called for holds. Another potential weakness is the use of only one database (PubMed). However, like the authors of the prior reviews [6, 7], we do not expect the review findings to be significantly different when performing searches on other databases. Also, since we were interested in identifying trends and therefore maximising comparability between the different reviews, we preferred to restrict our searches to this single database. As with many systematic reviews, data were extracted from published manuscripts and online appendices. The results are therefore reliant on what was reported in the final article and do not necessarily reflect all activities of the authors. Trends presented could therefore reflect factors such as publication bias, journal scope, editor preferences, and word limits, as well as preferences of journal editors rather than actual practice. Additionally, although we did update the data extraction tool based on changes in the field, future research might benefit from updating other aspects of the systematic review protocol such as search terms and inclusion and exclusion criteria (e.g. inclusion of best–best scaling). Finally, although we believe that DCEs are both useful and common enough to deserve focused attention in this review, DCEs represent one method among many for examining health preferences, and other methods may be preferable depending on the circumstances [4].

5 Conclusion

This study provides an overview of the applications and methods used by DCEs in health. The use of empirical DCEs in health economics has continued to grow, as have the areas of application and the geographic scope. This study identified changes in the experimental design (e.g. more frequent use of D-efficient designs), analysis methods (e.g. mixed logit models most frequently used), validity enhancement (e.g. more diverse use of internal validity checks), qualitative methods (e.g. upwards trend of qualitative methods used for attribute and level selection) and outcome measures (e.g. coefficients most frequently used). However, a large number of studies not reporting methodological details were also identified. DCEs should include more complete information, for example, information about design generation, blocking, model specification, random-parameter estimation and model results. Developing reporting guidelines specifically for DCEs might positively impact quality assessment, increase confidence in the results and improve the ability of decision-makers to act on the results. How and when to integrate health-related DCE outcomes into decision-making remains an important area for future research.