• No results found

Transparency in Measurement: Reviewing 100 Empirical Papers Using the Hamilton Depression Rating Scale

N/A
N/A
Protected

Academic year: 2021

Share "Transparency in Measurement: Reviewing 100 Empirical Papers Using the Hamilton Depression Rating Scale"

Copied!
81
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master Thesis

Transparency in Measurement

Reviewing 100 Empirical Papers Using the Hamilton Depression Rating Scale Linus J. Neumann

Student ID: s2616742 Supervised by Eiko I. Fried Ph.D.

University of Leiden

Faculty of Social & Behavioural Sciences Institute of Psychology

Field of Clinical Psychology 03/08/2020

(2)

Table of Contents

1. Introduction ... p. 4 1.1 Transparency About Research Processes in Social and

Behavioural Sciences ... p. 4 1.1.1 Transparency is Crucial to Counter

Questionable Research Practices (QRPs) ... p. 5 1.2 Transparency About Measurement Practices ... p. 6

1.2.1 Required Information About Measurement in the Field of Psychological Research ... p. 7

1.3 The Particular Case of Depression Measurement ... p. 8 1.3.1 The Importance of Valid and Transparent

Depression Measurement ... p. 9 1.3.2 Different Operationalization of Depression

Amongst Measures... p. 9 1.3.3 Transparency in the Field of Depression Measurement ... p. 11 1.4 The Hamilton Depression Rating Scale (HDRS) ... p. 12 1.5 Research Question ... p. 13 2. Methods ... p. 13 2.1 Review Procedure ...p. 14 2.2 Introduction to the Used Coding Scheme ... p. 15 3. Results ... p. 20 4. Discussion ... p. 29 4.1 Limitations ...p. 33 4.2 Conclusion ... p. 34 4.3 Implications ... p. 35 References ... p. 37 Appendices ... p. 53

Appendix A: List of Reviewed Papers ... p. 53 Appendix B: Supplementary Information on the Results ... p. 64

(3)

Abstract

The replication crisis has undermined the credibility of social and behavioural sciences. Transparency about various steps taken in the research process, from data collection and analysis to reporting results, is regarded crucial to counter questionable research practices (QRPs). All information about how a study was conducted is needed to evaluate the validity of findings, and necessary (but not sufficient) for replicable science. Despite the open science movement, the empirical clinical literature has not widely adopted open science standards. This review of 100 empirical papers from four established clinical journals critically vets the transparency of measurement practices, in particular defining the construct under study, citing the used measure, justifying the choice of measure, providing measure validity evidence, describing the quantification of measure and outlining potential measure modifications. Measurement transparency is summarized in the investigation of major depression, one of the most prevalent mental disorders, measured with the Hamilton Depression Rating Scale (HDRS), one of the most commonly used depression measures. The results show that a lack of transparency led to omission of crucial information, prohibiting readers to properly vet the validity of many study results. Twenty-two percent of papers did not contain any definition of depression, and about half of the definitions were highly ambiguous. In 31% of papers, the reference to the used HDRS version (e.g. 17 or 24 item version) remained unclear, including four incorrect citations that did not match with the stated HDRS version. Nine citations were ambiguous. Ninety-one percent of publications did not contain any justification for choosing the HDRS over other depression scales, and in 84% of cases, no validity evidence was provided. Finally, this review is concluded by showing that QRPs have to be countered by promoting the importance of transparent measurement, coping with the crisis of theory and adopting open science standards. This is not only important to ensure the replicability and credibility of findings, but also to bolster the validity of the field of depression measurement and the overall field of psychological research.

(4)

1. Introduction

Pashler and Wagenmakers (2012) drew attention to the so-called replication crisis in social and behavioural sciences, which undermines the credibility of the field. Their popular publication was concluded by the following statement:

Having found ourselves in the very unwelcome position of being (to some degree at least) the public face for the replicability problems of science in the early 21st century, psychological science has the opportunity to rise to the occasion and provide leadership in finding better ways to overcome bias and error in science generally. (p. 529)

This master thesis aims to contribute insightful information to the debate about open science standards, which promotes collective actions and steps towards valid and replicable science in the field. By reviewing 100 empirical papers, we critically vet a lack of transparency about measurement practices as a source of bias and error in the context of the replication crisis and investigate the extent to which ambiguity about steps in the measurement process obfuscates important information to determine the credibility of study results.

1.1 Transparency About Research Processes in Social and Behavioural Sciences In the context of the open science movement, several requirements concerning an improvement of transparency about research processes have been proposed. Such

requirements include the publication of materials, data and analyses scripts. Further, the conduction of power analyses and the preregistration of hypotheses and analyses plans are promoted widely for some time now. In addition, recent technological developments simplify the implementation of practices promoting scientific transparency. In this light, it might seem surprising that researchers and reviewers in many fields of social and behavioural sciences still criticize research practices that do not promote such transparency. Frequent criticisms include not mentioning non-confirmed results, or an only partial reporting of methodological details and/or statistical analyses. Evidently, the range of criticisms regarding transparency about various steps taken in the research process, from data collection, to data analysis, to reporting results, is still wide. Even today, transparency in the field of social and behavioural sciences still needs improvement (Asendorpf et al., 2013; Chan, Hròbjartsson, Haahr,

Gøtzsche, & Altman, 2004; Cybulski, Mayo-Wilson, Grant, Corporation, & Monica, 2016; Flake, Pek, & Hehman, 2017; Grahe, 2018; Hales, Wesselmann, & Hilgard, 2018; Miguel et al., 2014; Steegen, Tuerlinckx, Gelman, & Vanpaemel, 2016).

(5)

About a decade ago, community attention was drawn on the problematic circumstance that verifying findings displayed a difficulty in social and behavioural sciences. This so-called replication crisis has the potential of undermining the credibility of the whole field (Flake & Fried, 2019; Pashler & Wagenmakers, 2012). Apart from methodological and statistical practices, a lack of transparency about various steps taken in the research process is regarded as an influential source of bias and error. In this context, underreporting of important

information makes it difficult to identify the reason for the lack of replicability. In addition, information report about the research process is crucial for evaluating the validity of study results. Hence, a lack of transparency undermines the credibility of findings (Flake & Fried, 2018; Flake & Fried, 2019; Flake et al., 2017; Hales et al., 2018; Pashler & Wagenmakers, 2012).

On a metaphorical level, the scientific research process can be compared to a garden of forking paths. Scientific investigations include multiple steps, which need to be taken. Hence, researchers have options to choose, resulting in degrees of freedom on their behalf. Each intermediate step in the research process is a potential source of bias and error as new important information is generated. What happens if a researcher withholds this information? Questions and misconceptions emerge easily and both the replicability and credibility of findings become undermined. Therefore, sufficient information report is an indispensable characteristic of credible and replicable research, allowing to fully track and comprehend the entire research process. Accordingly, researchers are clearly advised to be as transparent as possible about their research practices (Flake & Fried, 2018; Flake & Fried, 2019; Gelman & Loken, 2014).

1.1.1 Transparency is Crucial to Counter Questionable Research Practices (QRPs)

Questionable research practices (QRPs) display methodological and statistical practices that cause bias, resulting in both undermined replicability and credibility of study results. Examples for QRPs are selectively reporting significant results, not reporting all of a study’s conditions, collecting more data after looking whether the results were significant etc. QRPs increase the likelihood of supporting a hypothesis in an untrustworthy way and

therefore weaken the validity of a study. A lack of transparency about multiple steps taken in the research process leads to a situation in which research practices become ambiguous and their impact on a study’s findings cannot be evaluated. It cannot be ruled out whether a study and its findings are affected negatively by QRPs if transparency is lacking. Furthermore, it is

(6)

difficult to replicate findings from a study that is affected by QRPs. Accordingly, transparency about the way a study was conducted is crucial to counter QRPs. (Agnoli, Wicherts, Veldkamp, Albiero, & Cubelli, 2017; Flake & Fried, 2019; John, Loewenstein, & Prelec, 2012; Sijtsma, 2016).

John et al. (2012) published a survey about the prevalence estimates of QRPs, revealing that they are widely used amongst psychologists. Approximately 90% of 2155 participating academic psychologists at major U.S. universities admitted having engaged once in at least one of 10 listed QRPs. A more recent survey by Agnoli et al. (2017) showed similar results for a different sample of psychologists under the use of the same questionnaire on QRPs. Approximately 90% of 277 psychologists involved in research institutions in Italy also admitted having engaged once in at least one of the 10 presented QRPs.

1.2 Transparency About Measurement Practices

Although it is a fundamental part of science, the field of measurement is still relatively unexamined in terms of transparency about research practices. Information report about various steps taken in the measurement process is indispensable to counter questionable measurement practices (QMPs). QMPs form an independent source of bias and error in the research process. They display a subcategory of QRPs and can also contribute to obtain desired results in an untrustworthy way. Examples for QMPs are using measures without reference to their source, choosing measures without providing a justification, modifying measures in an unacceptable way etc. A lack of transparency about steps taken in the measurement process creates circumstances in which measurement practices become

ambiguous. This ambiguity leads to both undermined replicability and credibility of findings as one cannot rule out whether a study is affected by QMPs, which in turn means that

measurement was poor. Poor measurement limits the inferences of a study so that the credibility of findings becomes undermined. In addition, it is difficult to replicate findings from a publication in which questions about measurement remain unanswered or in which measurement was poor (Flake & Fried, 2018; Flake & Fried, 2019; Flake et al., 2017; Hales et al., 2018).

The surveys on QRPs by John et al. (2012) and Agnoli et al. (2017) contained just one item about questionable measurement practices: In a paper, failing to report all of a study’s dependent measures. The self-admission rate for this item was 63,4% among 486

(7)

(Agnoli et al., 2017). These rates go along with the assumption of Flake and Fried (2019) that QMPs are ubiquitous. Approximately the half of all participants in each of the

abovementioned studies admitted using this QMP. Considering the fact that the respective item only refers to the practice of reporting all of study’s dependant measures, the urgency of investigating the transparency about steps taken in the measurement process becomes clear.

1.2.1 Required Information About Measurement in the Field of Psychological Research

In clinical sciences, there is no single construct for which exactly one globally accepted measure with good validity evidence and without any degrees of freedom in its use exists. Accordingly, transparency about measurement practices is crucial in this field to ensure both replicable and valid measurement. There are many steps in the measurement process, which require sufficient information report. Consequently, the whole process of assessment must be outlined clearly (Flake & Fried, 2018; Flake & Fried, 2019; Gelman & Loken, 2014; Hales et al., 2018).

Determining what is actually intended to be measured is the begin of any measuement process. Accordingly, a clear definition of the construct under study has to be provided by referring to corresponding theoretical models. Otherwise it is not possible to choose or build an appropriate measure that fully captures this construct (Flake & Fried, 2019). Generally, clearly defining the construct of interest is difficult in social behavioural sciences. Several constructs have identical names, but actually display different phenomena in the real world, which is also known as jingle fallacy. In turn, differently named constructs often represent one and the same real-world phenomenon, which is categorized as jangle fallacy. Both fallacies are summarized under the term of the construct identity fallacy (CIF), describing the ambiguity of construct definitions in social and behavioural sciences. The CIF contributes to confusion and misunderstandings on researchers’ behalf, which make the process of literature research more difficult (Larsen & Bong, 2016). Furthermore, the CIF impedes the process of clearly conceptualizing a construct. Often, several theoretical frameworks for one and the same construct exist. This accounts for constructs as emotions, mental disorders, and personality traits (Bong M., 1996; Flake & Fried, 2019, Larsen & Bong, 2016; Weidman, Steckler, & Tracy, 2016). Still, it is important to define the construct under study in order to choose an appropriate measure. This goes along with reviewing literature and theories or even with developing new theoretical models. Being transparent about a construct’s definition is

(8)

crucial to counter the ambiguity in all subsequent steps in the measurement process (Flake & Fried, 2019).

Furthermore, the selection of measure has to be justified. It is important to outline why the chosen instrument is considered as the best option to measure the contruct under study. Otherwise, authors cannot be assumed having a clear intention of choosing the respective instrument. Neither cannot they be assumed having considered choosing other instruments at all. Picking a measure without critical consideration is poor measurement as it remains unclear whether the construct under study was captured fully in the best possible way (Flake & Fried, 2018; Flake & Fried, 2019, Flake et al., 2017).

Beyond this justification, specific characteristics of the chosen measure itself have to be provided. If several versions of an instrument exist, it has to be stated, which one was used. Validity and reliability evidence is required, which implies referring to corresponding

research or to conduct independent analyses. A lack of necessary information causes

circumstances in which it cannot be ruled out whether a measure’s validity or reliablity was low. In this case, the credibility of a study’s findings will become undermined and it will not be possible to draw inferences about the measured construct (Cronbach & Meehl, 1955; Flake & Fried, 2018; Flake & Fried, 2019, Flake et al., 2017; Hussey & Hughes, 2019).

If quantitive actions go along with the use of a measure, any processes regarding the measure quantification have to be outlined clearly. This implies to describe the way of score computation and all conducted analyses. There are several ways of scoring, such as averaging items, computing a score that is standardized, calculating a sum or a factor score etc.

Therefore, it is important to describe the quantification of a measure clearly (Flake & Fried, 2019).

In case a measure gets modified before or after data collection, this has to be stated, justified and outlined further to ensure both the replicability and validity of study results (Flake & Fried, 2019; Weidman et al., 2016).

1.3 The Particular Case of Depression Measurement

For this review, the example of the field of depression measurement was chosen. It is well-suited to underline the importance of transparency about measurement practices to ensure both the replicability and credibility of findings. Several characteristics of the field, such as the amount of existing measures and the variability in the conceptualization of depression, lead to enormous degrees of freedom on researchers’ behalf. Many steps, from

(9)

defining depression, to choosing a measure, to computing and reporting scores etc., can be taken in various ways. These degrees of freedom result in urgent need of transparency.

1.3.1 The Importance of Valid and Transparent Depression Measurement

Depression has a prevalence of 300 million people at a global level, equivalent to 4.4% of the world’s population in 2015. In addition, it is ranked as the single largest contributor to global disability and is also the major contributor to suicide deaths with a number of close to 800,000 per year (World-Health-Organization, 2020). It is often comorbid with other mental disorders such as anxiety disorder, impulse control disorder, substance use disorder etc. Furthermore, depression goes along with a high level of functional impairment of the individual (Kessler et al., 2003). Although it is regarded as questionable to use scores of depression measures as the fundamental basis to diagnose people or to enrol them in research studies, they are commonly used to draw categorical inferences about depression. Cut-off values are often interpreted as indicators to judge someone as depressed or not. Accordingly, high and extensive significance is attributed to scores of depression measures in the field (Fried, 2015; Fried & Nesse, 2015). Depression measurement has far-reaching effects, such as a respective diagnosis of depression, the influence of this diagnosis on an individual’s wellbeing, or subsequent steps as possible treatment. It does not only lead to inferences about an individual’s mental health status, but also impacts scientific findings in global health policy. All these consequences depend on the outcome of the chosen measure. Therefore, depression measurement must be valid, which is only possible if it is transparent.

1.3.2 Different Operationalization of Depression Amongst Measures

Depression is regarded as a psychological construct like personality or intelligence, which are latent phenomena. Latent phenomena usually are not observed directly. Therefore, measurement is carried out with instruments in which the respective phenomenon is

operationalized through responses to a specific set of items. The assessment of depression is based on symptomatology. This means, that items of depression measures commonly display symptoms. Assessing latent phenomena requires reliable and valid items. It has to be proved transparently that the items and scores of an instrument actually are related to the proposed phenomenon, or in other words, that they don’t measure something else (Cronbach & Meehl, 1955; Flake et al., 2017; Fried, 2015; Fried et al., 2016).

(10)

The complex conceptualization of depression underlies great variance amongst psychological theories, making it a heterogeneous phenomenon. Depression is characterized through its multifactorial etiology and various symptoms. It is known to be a multifarious diagnostic category (Fried, 2015; Kendler, 2012; Parker, 2005). Accordingly, depression is often regarded as a spectrum with several depressive subtypes in order to differentiate between several homogenous categories of depression. Many subtypes, such as anxious depression, agitated forms of depression, atypical depression, bipolar depression, catatonic depression, double depression (comorbid with dysthymia), hostile depression, melancholic depression, psychotic depression, reactive depression, seasonal depression, unipolar depression etc. were identified. (Angst & Merikangas, 1997; Baumeister & Parker, 2012; Cuellar, Johnson, & Winters, 2005; Fava et al., 1997; Lynch, Gunning, & Liston, 2020; Pyszczynski & Greenberg, 1987; Ross, Eduard, Joseph, Mauricio, & Charles, 2010). Several subtypes overlap and furthermore do not differ in their response to treatment. Accordingly, it is questionable that these subtypes significantly differ from each other (Arnow et al., 2015). Overall, the construct validity of depression is doubted in the field of clinical sciences. No single conceptualization of depression, neither a uniform theory, has been globally accepted (Fried, 2015).

Construct validity plays a fundamental role in both original and replication research (Flake et al., 2017). Generally, it has to be known what is actually measured. A clear

conceptualization about the construct under study is needed to choose an appropriate measure and finally measure it validly (Flake & Fried, 2019). This is challenging in the case of

depression (Fried, 2015). The heterogeneity of the phenomenon is further reflected in the field of measurement. Already in 1960, Max Hamilton stated at the time the HDRS was

introduced: “The appearance of yet another rating scale for measuring symptoms of mental disorder may seem unnecessary, since there are so many already in existence and many of them have been extensively used.” (Hamilton, 1960, p. 56). Today, more than 280 measures for depression exist in total, which have been developed over the past 100 years. They differ with regards to the content of items, response format and their objectives (Santor, Gregus, & Welch, 2006). This also accounts for measures that intend to measure the same type of depression. The example of the Beck Depression Inventory (BDI) and the HDRS, which both measure Major Depressive Disorder (MDD), proves that depression is captured differently amongst measures. MDD is represented differently through different sets of items, displaying different depressive symptoms. Physical symptoms of depression are represented stronger in the HDRS, whereas the BDI focusses more on the individual’s feelings and mood. (Beck,

(11)

Ward, Mendelsohn, Mock, & Erbaugh, 1961; Fried et al., 2016; Hamilton, 1960; Shafer, 2006). Evidently, there is jingle in the field of depression measurement (Fried, 2017). This makes it difficult to compare outcome amongst scales. Furthermore, it is possible that any individual symptom composition leads to different outcome amongst measures.

1.3.3 Transparency in the Field of Depression Measurement

Valid depression measurement is difficult to realize due to the attributes of the field. The high number of existing depression measures itself is questionable. It requires critical consideration to choose the most appropriate measure, which differs from more than 280 others, to assess depression validly in any individual case with any individual composition of symptoms. Therefore, it is all the more important to be transparent about the choice of measure and to both consider and justify it critically. Lacking transparency about this choice and the corresponding justification leads to a situation in which there is no prove that the respective measure has been chosen for a good reason. This in turn means, that it cannot be proved that choosing other instruments from the wide range has even been considered. Picking any measure without critical consideration is poor measurement as it cannot be guaranteed that depression has been captured in the best possible way (Flake et al., 2017; Flake & Fried, 2018; Flake & Fried, 2019; Fried, 2015; Fried, 2017; Santor et al., 2006). Furthermore, the use of one and the same measure goes along with flexibility in its use, for example at the moment of scoring. Reporting a score of a well-established depression measure does not give insight in the way it has been computed as there are several ways to convert item responses into scores. There are even more possibilities in case a measure has subscales or various analytic techniques are allowed (Flake & Fried, 2019).

In the overall field of psychological research, measures are partially used without reference to their source and often lack validity evidence so that it is not proved that they actually measure the purported construct. In addition, measure modifications without any provided justification can be identified amongst publications (Barry, Chaney, Piazza-Gardener, & Chavarria, 2014; Flake & Fried, 2019; Hussey & Hughes, 2019; Weidman et al., 2016). According to

corresponding reviews, the choice of measure in the field of depression measurement is more or less arbitrary and underlies rules of thumb. Although several new instruments have been developed over time, one appears to rely on well-known ones, which were created several years ago (Bagby, Ryder, Schuller, & Marshall, 2004; Santor et al., 2006).

(12)

1.4 The Hamilton Depression Rating Scale (HDRS)

The Hamilton Depression Rating Scale (HDRS) was created by Max Hamilton already in 1960. Even today, it remains to be used broadly to measure depression. The scale is

observer-rated, meaning that an outside person judges about a patient’s symptomatology by responding to a respective set of items. Physical and behavioural features are more strongly represented than other symptoms like the patient’s feelings for example. (Bagby et al., 2004; Hamilton, 1960; Worboys, 2013). Compared to other depression measures, these are essential distinctive characteristics of the HDRS. The BDI for example is a self-reported scale and focusses less on somatic features, but more on the patient’s thoughts and feelings (Beck et al., 1961, Shafer, 2006; Worboys, 2013). The original HDRS version from 1960 was intended to quantitatively illustrate changes in symptomatology of patients, respectively addicts, who were already diagnosed with depression. It was not designed to serve as a diagnostic tool. Nevertheless, the scale developed to be used as such over time. Furthermore, Max Hamilton expressed the importance of computing and reporting factor scores. The HDRS should serve to identify significant changes in particular groups of symptoms (e.g. insomnia, bodily symptoms etc.). Therefore, a descriptive evaluation of these scores was promoted, rather than making global judgments about depression by focusing on an aggregated sum score (Bagby et al., 2004; Hamilton, 1960; Hamilton, 1967; Worboys, 2013).

Over time, the HDRS became widely known. Users began to modify the HDRS by adding or removing items to create both longer and shortened adaptations. Several versions of the scale were created. Today, at least 20 that were published in total. They differ in the total number of items, corresponding interpretations of item responses and their instructions regarding the process of score computation. Janet Williams (2001) provided an overview of published HDRS versions and their sources, showing that furthermore structured interview guides, self-reported versions, and digital forms were introduced. Until today, no single HDRS got acknowledged universally. Same accounts for global conventions regarding the use of the HDRS. Accordingly, versions are often used with wrong references to their source.

(Hamilton, 1960; Williams, 2001; Worboys, 2013; Zitman, Mennen, Griez, & Hooijer, 1990). Still, the HDRS developed to be the gold standard to measure depression. It was introduced in a period of time in which mental illnesses began to be treated outside of mental hospitals. In this context, the HDRS contributed to change the view on depression as a lifelong disorder towards a treatable episode of illness. In addition, practical attributes of the

(13)

scale also contributed to its ubiquity. It is known to be simple and fast to use (Bagby et al., 2004; Worboys, 2013).

Overall, the psychometric properties of the HDRS are debated in the field. On the one hand, the scale’s internal reliability, interrater reliability, retest reliability, convergent validity, discriminant validity, and predictive validity are adequate. On the other hand, several items of the scale don’t measure the severity of depression well. They assess multiple constructs so that they are actually ineffective. The format of several item responses appears to be problematic as well. Internal reliability at item level is low in many cases and furthermore interrater and retest coefficients are weak (Bagby et al., 2004). This causes circumstances, in which the generated sum-score of the scale has no clear meaning as it is multidimensional. In addition, the scale did not show to have one clear underlying facture structure across studies, which should not be the case for a well-constructed instrument measuring a clearly defined construct (Bagby et al., 2004, Williams, 2001). In contrast, it is proved that a general

depression cluster of the HDRS appears to be stable amongst different cultures, showing that the psychometric properties of the HDRS are debated and ambiguous (Vindbjerg, Makransky, Mortensen, & Carlsson, 2019). The fact that the HDRS has been designed in 1960 means that the representation of depression no longer corresponds to the current version of the DSM. Therefore, the HDRS actually requires a revision and its use in clinical sciences has to be well considered (Bagby et al., 2004; Hamilton, 1960; Williams, 2001).

1.5 Research Question

The aim of this review was to investigate how transparent researchers are about the various steps taken in the process of measuring depression with the HDRS (defining depression, justifying the choice of the HDRS, outlining the process of score computation) and how transparently they report important information about the HDRS itself (stating the used version, using the correct reference to its source, providing validity evidence for it, outlining possible modifications).

2. Methods

A review of 100 empirical papers from four established clinical journals (listed in appendix A) was conducted to study transparency in measurement, using the example of depression, measured with the HDRS.

(14)

2.1 Review Procedure

Publications were selected using the following procedure, summarized in figure 1. The keyword profile “ALL = (HDRS* OR HAMD* OR Hamilton Depres*)” was used on the Web of Science platform to search for articles, published within the last five years in the Journal of Affective Disorders, the Journal of Psychiatric Research, Psychiatry Research, and/or BMC Psychiatry. 396 hits resulted (table 1). Subsequently, every publication was

Figure 1

Procedure of Selecting 100 Publications for This Review

.

Web of Science platform

Keyword profile: ALL = (HDRS* OR HAMD* OR Hamilton Depres*)

Refinery settings:

1) document types: article 2) publication years: 2015-2020 3) source titles:

Ø "Journal of Affective Disorders”,

Ø “Psychiatry Research”

Ø “Journal of Psychiatric Research”

Ø “BMC Psychiatry”

396 hits Applicability check:

A publication was judged as not applicable if at least one of the following criteria was met:

1) HDRS is not used

2) Publication is a meta analysis

3) Publication displays psychometric research about the HDRS

76 publications excluded

320 applicable publications

(15)

checked on its applicability for this review by screening its title, abstract, and method section. A publication was judged as not applicable if at least one of the following criteria was met: 1. The HDRS is not used, 2. The publication is a meta-analysis, 3. The publication displays psychometric research about the HDRS. 76 publications were excluded according to these criteria. Remaining papers were screened for duplicates, but none emerged, leading to 320 publications, which were considered as applicable for this review (table 1). Per journal, 25 publications were selected randomly in R. We set a seed (1337) and then selected 25 of 187 papers for the Journal of Affective Disorders via sample (187,25), 25 of 72 papers for Psychiatry Research via sample (72,25), 25 of 29 papers for the Journal of Psychiatric Research via sample (29, 25), and 25 of 32 papers for BMC Psychiatry via sample (32,25). Table 1

Distribution of Publications Amongst the Four Selected Journals Before and After the Applicability Check

Journal First Search After Applicability Check

396 Publications 320 Publications Journal of Affective Disorders 223 187

Psychiatry Research 87 72

Journal of Psychiatric Research 44 29

BMC Psychiatry 42 32

2.2 Introduction to the Used Coding Scheme

A coding scheme was developed to conduct both quantitative and qualitative analysis on transparency in measurement for the specific example of the HDRS. The “set of questions that researchers and consumers of scientific research can consider to identify and avoid [Questionable Measurement Practices]” was the groundwork for the development of the scheme (Flake & Fried, 2019, p. 12). The selected publications were screened with regards to six major categories and their respective subcategories (table 2). In total, there were 10 decisions to be made per paper according to respective evaluation criteria for each subcategory. In general, “1” was coded if respective information was provided. For the purpose of qualitative analysis, this information was noted. “0” was coded in case respective information was not provided or did not fit to the set evaluation criteria. The general intention was to code generously. In case of ambiguity about coding with regards to the set criteria, the

(16)

qualitatively better coding alternative was chosen. This means that it was rather concluded that information was provided and fitting to the set criteria than judging that information was not provided or inappropriate. Authors were generally assumed to have the intention of

providing readers with necessary information. In order to counteract subjectivity in the coding process, the evaluation criteria are outlined per subcategory of the used coding scheme in the following paragraphs.

Table 2

Coding Scheme Developed to Study Transparency in Measurement, Using the Example of the HDRS

Major Category Subcategory

Construct Conceptualization Definition of Depression

Important Measure Details Citation of the Used HDRS Version HDRS Version

Measure Selection Justification of Choosing the HDRS Measure Validity Evidence General HDRS Validity Evidence

HDRS Validity Evidence for Sample Measure Quantification Description of Score Calculation

Description of HDRS Factor Scores Measure Modification Modification of the HDRS

Justification of Modifying the HDRS

2.2.1 Construct Conceptualization

Definition of depression: Any construct under study must be defined clearly, otherwise it is not possible to choose or build an appropriate measure and capture the construct in the process of measurement (Flake & Fried, 2019). Therefore, it was checked whether depression was defined or not. The introduction section of each empirical paper was screened on a definition of depression. An appropriate definition was defined as a clearly stated conceptualization, which goes beyond a phenomenological description of depression, such as stating its prevalence, listing depressive symptoms or describing its operationalization by evaluating on criteria from diagnostic manuals like the DSM-V or the ICD-10. A clear conceptualization provides the reader with information about the underlying mechanism of depression, which makes it different from other phenomena. Definitions, which were clearly

(17)

recognizable as such (e.g. “Depression is an affective disorder [...]”) and contained

information about at least two underlying features (e.g. “Depression is an affective disorder, where a low mood and feelings of hopelessness are two of many underlying features [...]”) or in the better case references to respective theoretical frameworks, such as Beck’s cognitive triad (Beck, 1979), were regarded as valid. Any description of the operationalization of depression, enumeration of a measure’s items, prevalence estimates or evaluation on just one specific underlying feature were regarded as insufficient.

2.2.2 Important Measure Details

Citation of the used HDRS version: It is important to use any of the existing HDRS versions with a correct reference in order to provide information about both its source and content. This information is necessary to ensure the replicability and validity of findings. Both the reference and method section of each publication were screened on a citation for the used HDRS version. Furthermore, it was checked whether the respective citation fit to the stated HDRS version. The correctness was judged with regards to an overview of 11 published HDRS versions and their sources (table 3), published by Janet Williams (2001).

Table 3

Different Versions of the HDRS and the Respective References According to Janet Williams (2001)

HDRS Version Reference to its Source/ Correct Citation

17 / 21 items Hamilton, 1960

17 / 11 items Bech et al., 1986

14 items Potts et al., 1991

18 items Carr et al., 1981

23 items Reynolds & Kobak, 1995

24 items Riskind et al., 1987

25 items Miller et al., 1985

27 items Gelenberg et al., 1990

29 items Williams et al., 2000

31 items Roberts et al., 2001

(18)

Any other provided citation (e.g. a citation of the Spanish version of the HDRS), which is not included in this overview, was checked by screening the respective reference. Max Hamilton published multiple papers about the HDRS, but never introduced a new version of his scale himself. Accordingly, there is only one reference to the HDRS, which is Hamilton’s

publication from 1960 (Hamilton, 1960; Williams, 2001; Zitman et al., 1990). The other papers (e.g. Hamilton 1967, 1980, 1986) address the use of the HDRS in general and outline the meaning of single items. Several of these became very popular (Zitman et al., 1990). Hamilton’s publication from 1986 is cited more than 30,000 times (Google Scholar, 2020). Since respective papers by Max Hamilton refer to the original version from 1960, but actually are not the correct reference, it was decided to classify them as ambiguous. This only

accounted for cases in which the 17/21 item version was stated, since the original version of the HDRS consists of 17/21. Otherwise, the respective citations were classified as incorrect as they did not match with the stated version (Hamilton, 1960; Williams, 2001). In case

information for either the subcategory “Citation of the used HDRS version” or the

subcategory “Description of the HDRS version” was given (e.g. citation of Hamilton (1960) is provided, but no information about the used version is given; no citation is provided, but the 17 item version is stated), the correctness of a citation could not be checked. Nevertheless, “0” was only coded for the subcategory with missing information as it was intended to code generously. It was decided not to choose for the qualitatively poorer coding alternative for both subcategories “Citation of the used HDRS version” and “Description of the HDRS version”, because authors provided information at least for one of them.

Description of the HDRS version: It was checked whether the used HDRS version was stated beyond from a provided citation (e.g. “The HDRS 17 item version was used”).

2.2.3 Measure Selection

Justification of choosing the HDRS: The choice of using the HDRS to measure

depression must be justified in order to prove that choosing other instruments was considered. This is important to counteract poor measurement in terms of the selection of measure (Flake & Fried, 2019). Any kind of provided justification was regarded as sufficient if it could be clearly recognized as such (e.g. “We chose the HDRS, because [...]”). A clearly worded justification is necessary to prove that authors had the real intention of choosing the HDRS. Detached reliability or validity evidence was not regarded as a valid justification, unless it was clearly stated that the psychometric properties of the HDRS were the reason for choosing

(19)

the scale (e.g. “The HDRS has good reliability” was not regarded as a valid justification whereas a justification like “We chose the HDRS due to its good reliability” was classified as valid). For this subcategory, a special focus was on the qualitative analysis since it was aimed to get a clear insight into the kinds of justification that authors provide.

2.2.4 Measure Validity Evidence

General HDRS validity evidence: It was investigated whether authors provided general validity evidence for the HDRS. Overall, many definitions of validity exist. The selected 100 papers were screened on structural and external validity evidence. Reliability is no sufficient, but necessary validity evidence, belonging to the structural phase of the

validation process. It shows to what extent an instrument measures consistently and with low measurement error. Something that is not measured reliably, cannot be valid (Flake et al., 2017). Evidence for scale reliability (e.g. coefficients like Cronbach’s alpha, inter-item correlations, test-retest reliability) is the least to prove the validity of a measure. Accordingly, respective evidence was regarded as justifiable in this review. External validity describes the accuracy of measurement as it proves that an instrument clearly measures the construct under study and not something else. In case of the HDRS it needs to be proved that depression is measured. External validity evidence for the HDRS requires citing or conducting respective analysis, for example about convergent validity (measurement data of the HDRS is highly correlated with data of other instruments, which also measure depression) or discriminant validity (measurement data of the HDRS should only correlate slightly or not at all with data of other instruments, which measure different constructs) (Flake et al., 2017). Citing the HDRS was not regarded as a sufficient validity evidence as it does not prove that authors intentionally provided validity evidence.

HDRS measure validity evidence for sample: Given the fact that a measure’s psychometric properties can vary across samples (e.g. a sample of severe psychiatric inpatients vs. healthy students), each publication was also checked on specific validity evidence for the study sample. The same evaluation criteria as for general validity evidence applied for this subcategory.

(20)

2.2.5 Measure Quantification

Description of score computation: Using a measure goes along with flexibility in computing scores. There are several ways to convert item responses into scores, which

multiply if a measure consists of several subsclaes. Therefore, the process of scoring has to be outlined in order to give sufficient insight in the way a score was computed and to make it reproducible (Flake & Fried, 2019). In this review, reproducible was defined in terms that any outward person would be able to recalculate the exact same score with the dataset of a study.

Description of HDRS factor scores: Max Hamilton underlined the importance of evaluating on the HDRS factor scores in order to expose the full depression symptom profile, rather than judging globally about depression severity by focusing on the aggregated sum score (Bagby et al., 2004; Hamilton, 1960; Hamilton, 1967; Worboys, 2013). Therefore, it was checked whether authors reported scores for the respective subscales of the HDRS.

2.2.6 Measure Modification

Modification of the HDRS: It was checked whether the used HDRS version was modified in any way. Modifying a measure can for example change its psychometric

properties, which makes measure modifications an important subject to be transparent about (Flake & Fried, 2019). Coding alternatives for this subcategory ranged from “0” (no

modification) to “2” (the measure was modified). It was differentiated between any measure modifications that were made before (coded with “1”) and after (coded with “2”) data collection.

Justification of modifying the HDRS: Furthermore, it was investigated whether authors provided any justification for modifying the HDRS.

3. Results

The results of the quantitative analyses are presented per subcategory of the used coding scheme in figure 2. Both quantitative and qualitative analyses are outlined further in the following sections. Supplementary information on the results is summarized in appendix B. For several subcategories, the coding alternative “ambiguous” was introduced during the process of reviewing as it was difficult to stick to the set evaluation criteria and the general intention of coding generously at the same time.

(21)

Figure 2

Results of the Quantitative Analyses for the 100 Reviewed Publications (Presented in Percent per Subcategory of the Used Coding Scheme)

(22)

3.1 Construct Conceptualization: Definition of Depression

Twenty-two percent of the 100 reviewed publications did not contatin any defintion of depression, 26% did (figure 2). From these, six definitions contained references to theories or research about the underlying mechanisms of depression. In all of these six cases, underlying biological mechanisms were stated:

1. “Biological genetic changes” (Liang et al., 2018, p. 314)

2. “The neurotrophic hypothesis of depression (Duman and Monteggia, 2006)” (Khan et al., 2019, p. 108)

3. “The monoamine hypothesis of depression (Asberg et al., 1976)” (Bai et al., 2017, p. 296)

4. “Biological [...] pathways (Lett et al., 2004; Kuehl et al.,2012)” (Rahe et al., 2016, p. 164)

5. “Genetic (Sei-fuddin et al., 2012) [and] biological [factors] (Maletic and Raison, 2014)” (Mrad, Krir, Ajmi, Gaha, & Mechri, 2016, p. 173) 6. “Disruption of neural connectivity, inflammation, and hypoperfusion [...]

(Taylor et al., 2013)” (Chen et al., 2019, p. 133).

Four of these six definitions additionally contained references to psychological or behavioural mechanisms, namely:

1. „Psychological factors“ (Liang et al., 2018, p. 314)

2. “Cognitive theories of depression (Beck et al., 1979)” (Bai et al., 2017, p. 296), 3. “Behavioral pathways (Lett et al., 2004; Kuehl et al., 2012)” (Rahe et al., 2016,

p. 164)

4. “Psychosocial factors (Alloy et al., 2005)” (Mrad, Krir, Ajmi, Gaha, & Mechri, 2016, p. 173).

The definition that fit best to the set evaluation criteria was provided by Bai et al. (2017): Depression is a common, chronic psychiatric disorder, which is a serious burden to patients, their family and society. However, the mechanism of depression is still unclear, although a lot of hypotheses have been proposed for many years, such as the monoamine hypothesis of depression (Asberg et al., 1976) and cognitive theories of depression (Beck et al., 1979). Cognitive theories of depression hold that cognitive biases in information processing play a crucial role in the etiology and maintenance of depressive disorders. The cognitive biases consider that depressed patients partially remember, perceive and attend to affectively negative materials or selectively attenuate the

(23)

information processing for positive materials. In addition, cognitive biases in depressed patients have been associated with increased risk of relapse (Bouhuys et al., 1999). (p. 296)

The remaining 20 of the 26 provided definitions displayed phenomenological descriptions of at least two features of depression, for example a “low mood” and a “loss of interest” (Yoon, Hattori, Sasayama, & Kunugi, 2018, p. 134) or “guilty feelings” and a “low self-esteem” (Zhao et al., 2019, p. 25).

Fifty-two percent of cases were classified as ambiguous (figure 2). Definitions displayed combinations of descriptive aspects of depression (e.g. prevalence estimates, operationalization), which were individually considered as insufficient to define depression. In order to differentiate between papers in which no information was given at all and papers in which authors at least defined certain aspects of depression, these 52 cases were classified as ambiguous. Overall, six categories of provided characteristics of depression could be identified. Ambiguous definitions mostly contained information from at least two of these.

1. Outlined consequences of depression:

For example: “Major depressive disorder accounts for significant global morbidity, including medical comorbidities, mortality, and disability (Belmaker and Agam, 2008)” (Kostić et al., 2017, p. 66)

2. Outlined information about the treatment of depression:

For example: “Current treatment approaches rely primarily on antidepressant medications and psychotherapy based on clinician's choice, preference, and experience.” (Greden et al., 2019, p. 59)

3. Outlined details about one highly specific aspect (mostly a symptom) that is associated with depression:

For example: “Depression is commonly associated with an imbalanced autonomic nervous system (ANS), which occurs due to a reduced parasympathetic and an increased sympathetic drive. This imbalance is reflected by changes of heart rate variability [...].” (Pawlowski et al., 2017, p. 64)

4. Outlined interrelation between depression and another clinical phenomenon: For example: “It has been reported that the prevalence rates of depression in schizophrenia range from 30% to 70% (Peitl et al., 2017). The depressive symptoms in schizophrenia patients are often associated with overall poorer functional outcomes, lower quality of life [...].” (Fang et al., 2019, p. 1)

(24)

5. Outlined prevalence estimates of depression:

For example: “Major depression effects 5–19% of the population, and it is the most frequent psychiatric disorder (Kessler et al., 2003).” (Camkurt, Fındıklı, İzci, Kurutaş, & Tuman, 2016, p. 81)

6. Outlined operationalization of depression in respective diagnostic manuals: For example: “Bipolar Disorder (BP) is a mood disorder which definition and inclusion criteria are similar in the DSM-IV and 5 (“presence of five of nine diagnostic symptoms with a minimum duration of 2 weeks and a change from previous functioning”).” (Di Giacomo et al., 2017, pp. 90-91)

3.2 Important Measure Details: Citation of the Used HDRS Version

In 31% of the reviewed publications, the source of the used HDRS version remained unclear. In four of these 31 cases, an incorrect citation was provided. These four citations did not match with the stated HDRS version. Nine percent of citations were classified as

ambiguous, from which seven were the citation of Hamilton’s publication from 1967 and two referred to Hamilton’s paper from 1980. In 60%, a correct citation for the used HDRS version was provided (figure 2).

3.3 Important Measure Details: Description of the HDRS Version

Fifteen percent of the reviewed papers did not contain information about the used HDRS version. In 82 % of cases, the used HDRS version was stated clearly. For 3% of papers, it remained ambiguous, which version was used (e.g. only one item was used, but it remained unclear from which version this item was taken) (figure 2).

3.4 Measure Selection: Justification of Choosing the HDRS

Ninety-one percent of papers did not contain any justification of choosing the HDRS over other depression scales. In 8% of cases, the justification was hinted at (figure 2):

1. “The Hamilton Depression Rating Scale (HDRS) (Hamilton,1960), an observer-rated scale, and the Beck Depression Inventory (BDI) (Beck et al., 1961), a self-report inventory, are the most commonly used rating scales of depression.” (Suzuki et al., 2016, p. 49)

(25)

2. “The HDRS is a validated 17-item rating scale that has been widely applied in psychiatric studies to measure the severity of depressive symptoms and also has been used to evaluate treatment efficacy and severity of depressive

symptoms in several clinical trials in Iran.” (Alamdarsaravi et al., 2017, p. 60) 3. “The HRSD is one of the most widely used tools for depression assessment due

to its good psychometric properties [...].” (Zeng et al., 2016, p. 57) 4. “The primary outcome measures were The Hamilton Rating Scale for

Depression (HRSD) [...] and Beck Depression Inventory (BDI-II) [...] administered to assess both clinician- and self-evaluations of symptoms.” (Jonassen et al., 2019, p. 3)

5. “It is the most frequently used instrument to measure changes in depressive symptoms in clinical trials, especially in patients with more severe depression [...].” (Kruisdijk, Hopman-Rock, Beekman, & Hendriksen, 2019, p. 4)

6. “It is widely used in research on mood disorders and in clinical practice.” (Liu et al., 2017, p. 3)

7. “[The SIGH SAD] is the current benchmark for assessment of severity of depression in light therapy trials. [The HAM-D] is more commonly used in clinical practice and research.” (Bais et al., 2016, p. 8)

8. “The HAMD is a widely used structural interview for assessment of depressive symptoms and has been used during pregnancy and in the postpartum period” (Nishi et al., 2016, p. 4)

In one case, the provided justification was classified as ambiguous (figure 2), because both the 17 and the 28 items version of the HDRS were used, but only the choice of

using the 28 item version was justified.

3.5 Measure Validity Evidence: General HDRS Validity Evidence

In 84%, the HDRS was used without stating any validity evidence. Provided validity evidence of 12 papers fit to the set evaluation criteria (figure 2). Amongst these 12 papers, six cases were identified in which only reliability evidence was provided. This reliability

evidence mostly displayed evidence for internal consitency. Authors referred to

corresponging research. In two of these 6 cases, authors additionaly stated a conrete value for Cronbach’s alpha:

(26)

1. “This instrument has been shown to have a good [...] internal consistency of .82 (Cronbach’s alpha) [...]” (Kruisdijk, Hopman-Rock, Beekman, &

Hendriksen, 2019, p. 4)

2. “The Spanish version of the scale that was used in this study has good internal consistency (Cronbach's alpha = 0.78) and test-retest reliability (0.92) (Bobes et al., 2003; Ramos-Brieva and Cordero, 1986).” (Martínez-Amorós et al., 2018, p. 170)

The remaining six of the 12 identified pieces of evidence applied both to the reliablity and the validty of the HDRS. Again, authors most freqeuently referred to corresponding research. One time, conrete values of respective correlation coefficients were stated additionaly: “The HDRS has been found to have good validity (correlation coefficient for HDRS and clinical changes = 0.26) and reliability (r = 0.88–0.99) in Chinese populations [...]” (Liu et al., 2017, p. 3).

In four cases, provided validity evidence was ambiguous (figure 2). Two of these four amiguous papers, which were written partly by the same authors, were classified as

ambiguous, because the following text passage was used identically:

It consists of 21 items, and ratings are given on different scales ranging from 3, 4, or 5 points (e.g., insomnia early: 0 = no difficulty falling asleep;1 = complains of occasional difficulty falling asleep, i.e., more than 0.5 h; and 2 = complains of nightly difficulty falling asleep), with higher scores reflecting more marked depressive symptoms (Cronbach's α= 0.80). (Ahmadpanah et al., 2019, p. 3).

The two text passages only differed with regards to the stated value for Cronbach’s alpha, which was stated to be 0.80 in the paper written by Ahmadpanah et al. (2019) and 0.89 in the paper written by Jahangard et al. (2018): “[...] with higher scores reflecting more marked depressive symptoms (Cronbach's α= 0.89)” (Jahangard et al., 2018, p. 320).

3.6 Measure Validity Evidence: HDRS Validity Evidence for Sample

Ninety-seven percent of the reviewed publications did not contain specific HDRS validity evidenve for the study sample (figure 2). Amongst these 97 publications were seven cases in which interrater reliability evidence was provided, which was not regarded as measure validity evidence.

In one of the 100 publications, sample specific HDRS validity evidene was identified (figure 2). Zhang et al. (2018) stated:

(27)

The associations between the FAST and GAF, HDRS and YMRS were analyzed using the Pearson correlation analysis to test concurrent validity. The FAST total score was highly associated with GAF (r=−0.952, p< 0.001), HDRS (r= 0.575, p< 0.001) and YRMS total scores (r= 0.394, p< 0.001) at both week 0 and at week 1(r=−0.945, p< 0.001; r= 0.582, p< 0.001; r= 0.363, p< 0.001), respectively. (p. 158)

Two percent of publications were classified as ambiguous (figure 2). Authors

evaluated on the psychometric properties of the used measures in their studies, but it remained unlcear, which evidence accounted for the HDRS.

3.7 Measure Quantification: Description of Score Computation

In 95% of publications, the process of score computation was ambiguous (figure 2). In almost all of these cases, a total score was used. Several designations for this score were identified, for example “HDRS score”, “HAM-D score”, “HRSD score”, “HDRS total score”, “HAM-D-17-score” etc. The way of score computation remained unclear, but it could be assumed that a sum score was computed. This did not account for one paper, which did not contain any information related to the score computation at all (figure 2).

In 3% of papers, a detailed description of score calculation was provided (figure 2). Here it became clear that a sum score was computed. Bais et al. (2016) stated:

The SIGH-SAD is a 29-item structured interview and consists of 21 HAM-D (Hamilton Rating Scale for Depression) items and 8 atypical items, of which 11 items can be scored with a value of 0–2, 5 items with a value of 0–3 and 13 items with a value of 0–4 [...]. The sum score ranges from 0 to 63 for the HAM-D items and from 0 to 26 for the atypical items, resulting in a total sum score of 0 to 89 [...]. (p. 8)

Jha et al. (2019) outlined the following:

The individual items of HAMD-17 have three or five choices that are scored from 0 to 2 or 0–4, which are then summed to indicate depression severity of none (< 6), mild (6– 13), moderate (14–18), severe (19–23) and very severe (≥24). (p. 166)

The description by Qin et al. (2017) was:

To assess the treatment outcome, we used the reductive ratio of the sum score of 17 items HAMD (i.e. (HAMDbaseline−HAMDeight-week)/HAMDbaseline) and those of the network measures (i.e.PiandZi) of the MDD patients before and after treatment (i.e. (Pibaseline−Pieight-week)/Pibaselineor (Zibaseline−Zieight-week)/Zibaseline).

(28)

For one publication, the analysis of the description of socre computation was not applicable, because just one item was used and no score was calculated.

3.8 Measure Quantification: Description of HDRS Factor Scores

In 93% of cases, no description of HDRS factor scores was provided in addition to the description of total scores. Five publications contained a respective description.

One of the 100 reviewed publications was classified as ambiguous, because subscale scores were used, but no concrete values for the scores were provided so that they did not have the desired descriptive character.

The quantiative analysis of the description of subscale scroes was not applicable in one case as just one item was used and no score was calcualted (figure 2).

3.9 Measure Modification: Modification of the HDRS

In 93% of cases, no modification of the HDRS was identified, whereas six percent of the used scales were modified before data collection (figure 2). Three of these six modified versions were officially published and thus referred to respective publications:

1. “To measure core depression, we used the Core Depression Factor Subscale of the HAM-D (Bech, 2006; Detke et al., 2002; Perahia et al., 2006)” (Steenkamp et al., 2017, p. 195).

2. “The therapeutic effect of the treatment was assessedusing a modified version of the HDRS-21 (HDRS-NOW) (Leibenluft et al., 1993), from which items that could not be meaningfullyrated due to the total sleep deprivation procedure were excluded [...].” (Suzuki et al., 2016, p. 49)

3. “The dimensional structure suggested by Capuron et al. (2009) was used” (Fialho, Pereira, Harrison, Rusted, & Whale, 2017, p. 152)

The remaining three of the six modified scales were changed in terms of the selection of items. Allaert, Demais & Collén (2018) just used one HDRS item, Trombello et al. (2018) used six items from the 17 items version related to anxiety, and Costemale-Lacoste et al. (2018) used three items related to insomnia and one item from the HDRS 17 item version.

One publication was classified as ambiguous (figure 2). It was stated that a modified 24 / 25 item version was used, which was referred to Mazure, Nelson, and Price (1986).

(29)

This is paper is not a reference to a specific modified HDRS version, but addresses the reliability and validity of depressive symptoms. Furthermore, the number of items was unclear.

3.10 Measure Modification: Justification of Modifying the HDRS

The analysis of justifications for HDRS modifications was not applicable in 93 %. Six modification were identified in total. Three of these were not justified, one was. Authors chose the Core Depression Factor Subscale because of its composition of items (Steenkamp et al., 2017). The remaing two identified modifications were justified ambiguously. One of these two ambiguous justifications did not cover the entire modification, since it remained unclear why authors chose the modified HDRS-NOW version (Suzuki et al., 2016). The other one was not stated clearly and furthermore only referred to the selection of three items to study on insomnia, although a single item was taken from the HDRS 17 item version as well

(Costemale et al., 2018). The remaining modification, which was classified as ambiguous, was not justified.

4. Discussion

The objective of this review was to vet transparency in measurement, using the example of measuring depression with the HDRS. It was investigated whether authors reported necessary information about multiple steps in the measurement process, including important details about the HDRS itself.

Reviewing 100 empirical papers from four established clinical journals showed that important information about measurement was missing or ambiguous to a large extent.

Transparency was mainly lacking for the steps of defining depression, justifying on the choice of the HDRS, citing the used HDRS version, and providing validity evidence for it.

Furthermore, the quantification of the HDRS mostly remained ambiguous as the process of score computation was not outlined in detail. Still, it could be assumed that a sum score was computed. In contrast, the used HDRS version was overall stated transparently and only less than 10% of the used scales were modified.

Several questions about the process of measuring depression with the HDRS could not be answered properly due to a lack of transparency. Frequently, authors did not report

(30)

depression. In addition, the source of the used HDRS version remained unclear in many cases and it rarely was proved that the HDRS was a valid instrument to measure depression. Since important information was obfuscated, it was not possible to determine the overall validity of many studies and the replicability of many findings became undermined. It is indicated that QMPs, as a subcategory of QRPs, play a significant role in the context of the replication crisis.

It was assumed that defining depression is challenging due to its complex

conceptualization amongst theories. Therefore, generous evaluation criteria were set for the analysis of the construct conceptualization of depression. Still, 22% of papers did not contain any definition of depression and 52% were classified as ambiguous. These results underline the fact that depression is a heterogeneous phenomenon, which is affected by the CIF. No universal conceptualization and neither a uniform theory of depression is globally accepted in the field, which makes it difficult to define depression and to ensure valid measurement (Flake & Fried, 2019; Fried, 2015). This is a more extensive problem, which accounts for several psychological constructs.

According to current research, psychology suffers from a crisis of theory, contributing to the replication crisis. Apart from poor methodology, the replicability of findings gets undermined through the deep running problem that abstract phenomena lack a concrete theoretical foundation (Oberauer & Lewandowsky, 2019). This circumstance generally undermines the validity of the field. Psychological theories are often vague and don’t imply clear predictions in the real world. Consequently, making hypotheses and defining

psychological constructs go along with high degrees of freedom on researchers’ behalf, increasing the likelihood for QRPs. Flake & Fried (2019) outlined the importance of defining the construct under study clearly in order to choose an appropriate measure that captures it in the best possible way. For articles, which did not contain a clear conceptualization of

depression, it could not be proved that depression was fully captured in the process of

measurement. Twenty-two papers did not contain any definition of depression. In these cases, it could not be proved that any theoretical foundation was used at all to measure depression. Due to the intention of coding generously, several definitions were classified as justifiable although the underlying mechanism of depression did not become completely clear. Many definitions addressed more than one underlying feature of depression, but it did not become clear what makes depression different from other psychological phenomena. This

(31)

definition of depression undermined the credibility of findings since it could not be proved that depression was fully captured in the process of measurement in many cases.

Furthermore, the required substantive link between a provided definition of the construct under study and the choice of measure could not be identified. The selection of the HDRS never concurred comprehensibly with the stated conceptualization of depression. This even accounted for the definition by Bai et al. (2017), which was classified as the most complete definition amongst all reviewed papers. The authors theoretically explained

depression by referring to Beck’s Cognitive Model, which addresses i.a. the role of cognitive biases (Beck, 1979). Since the HDRS focusses less on cognitive symptoms than the BDI for example, the selection of the HDRS over more than 280 existing measures was questionable.

The analysis of the selection of measure further revealed that 91% of cases did not contain a justification of choosing the HDRS. The eight identified justifications were only hinted at and one justification remained ambiguous. Accordingly, it could not be ruled out that the choice of measure underlay rules of thumb in the majority of papers, which is poor measurement. The obtained results go along with the findings by Santor et al. (2006), indicating that the overall choice of measure in the specific field of depression measurement is arbitrary. Unreasonable choices of measure particularly undermine the validity of the field of depression measurement, because more than 280 measures exist. The field has been chosen as an example to review transparency in measurement since its overall validity is doubted and many possible measure alternatives make a justified selection of measure even more

important. Therefore, it is highly questionable that it remained unclear why authors chose the HDRS to measure depression in 91% of cases. Flake & Fried (2019) outlined that choosing measures without a clear justification is a common QMP in the field of psychology,

suggesting that similar results might be obtained for other psychological constructs, which are measured with other scales.

Furthermore, the analysis of provided citations for the used HDRS version revealed that 4% of citations were incorrect, 27% of papers did not contain any citation for the HDRS and 9% of citations were ambiguous. These results coincide with two important findings. First, they are consistent with the observation that authors are not clear about which HDRS version they used, which underlines that the use of the HDRS is often vague and ambiguous itself (Williams, 2001; Worboys, 2013). Second, the results coincide with the finding that using measures without reference to their source is a commonly used QMP in psychological research. A correct citation of the used measure is necessarily required to ensure both

Referenties

GERELATEERDE DOCUMENTEN

Most ebtl employees indicate that there is sufficient qualified personnel and that the offered programme is adequate. At the request of personnel both in Hoogeveen and in

Objective The objective of the project was to accompany and support 250 victims of crime during meetings with the perpetrators in the fifteen-month pilot period, spread over

The safety-related needs are clearly visible: victims indicate a need for immediate safety and focus on preventing a repeat of the crime.. The (emotional) need for initial help

The authors address the following questions: how often is this method of investigation deployed; what different types of undercover operations exist; and what results have

The following effective elements for the unit are described: working according to a multidisciplinary method, hypothesis-testing observation, group observation,

It states that there will be significant limitations on government efforts to create the desired numbers and types of skilled manpower, for interventionism of

17 Nevertheless, this copying practice showed that the regional press deemed the story relevant to its readers, and in June and July 1763 extensive reports appeared throughout

Indicates that the post office has been closed.. ; Dul aan dat die padvervoerdiens