Measuring disease activity in patients with early rheumatoid arthritis

(1)

Measuri

ng

di

sease

acti

vi

ty

i

n

pati

ents

wi

th

earl

y

rheumatoi

d

arthri

ti

s

Li

seth

Si

emons

I

SBN:

978-

90-

365-

3661-

5 su

rin

g d

is

ea

se

a

cti

vit

y i

n p

at

ie

nt

s w

ith

e

ar

ly

rh

eu

_m

at

oid

a

rth

rit

is

Lis

et

h S

ie

m

on

s

in patients with early rheumatoid arthritis

preceded by an introduction

On Friday July 4,2014 at 14.30 h in Waaier,room 4 University of Twente

(2)

Measuring disease activity in patients with

early rheumatoid arthritis

Liseth Siemons

(3)

Prof. dr. C.A.W. Glas Universiteit Twente

Assistant‐supervisor: Dr. P.M. ten Klooster Universiteit Twente

Members: Prof. Dr. E. Krishnan Stanford University (CA, USA) Prof. dr. P.L.C.M. van Riel Radboud universitair medisch centrum Prof. dr. G. Zielhuis Radboud universitair medisch centrum Dr. H. Vonkeman Universiteit Twente, Medisch Spectrum Twente Prof. dr. J.A.M. van der Palen Universiteit Twente, Medisch Spectrum Twente Prof. dr. R. Sanderman Universiteit Twente, Rijksuniversiteit Groningen

Funding for this work was provided by the Dutch RhEumatoid Arthritis Monitoring (DREAM) collaboration. PhD thesis, University of Twente ISBN: 978‐90‐365‐3661‐5 DOI: 10.3990/1.9789036536615 Cover: Balanced Rock, Arches National Park (UT, USA) Cover design by: Liseth Siemons Printed by: Gildeprint Drukkerijen, Enschede, The Netherlands The publication of this thesis was financially supported by the Medisch Spectrum Twente and the Dutch Arthritis Foundation. Copyright © 2014 Liseth Siemons, Enschede, The Netherlands All rights reserved. No part of this book may be produced or transmitted, in any form or by any means, without the prior written permission of the author.

(4)

PROEFSCHRIFT ter verkrijging van de graad van doctor aan de Universiteit Twente, op gezag van de rector magnificus, prof. dr. H. Brinksma, volgens besluit van het College voor Promoties in het openbaar te verdedigen op vrijdag 4 juli 2014 om 14.45 uur door

Liseth Siemons

geboren op 18 juni 1985 te Sneek

(5)

Prof. dr. M.A.F.J. van de Laar (promotor) Prof dr. C.A.W. Glas (promotor)

Dr. P.M. ten Klooster (assistent‐promotor)

(6)

For Charles, Reini, Wolter, Arnout, and Kien

(7)

(8)

these past 4 years and I would like to mention some of them in particular.

First of all, my thanks go to my daily supervisor, Peter ten Klooster. Peter, your door was always open and you always had (or made) time to help, give advice, or have a critical discussion. With your scientific understanding, keen eye for detail, and extensive statistical knowledge, your involvement certainly raised this thesis to a higher level. I’m grateful for your guidance and support, and I thank you for the opportunity you have given to me to learn from you during the past 4 years.

Next, I would like to thank my supervisors, Mart van de Laar and Cees Glas. Mart, thank you for your confidence when you offered me this PhD position and for making this valuable experience possible. During the project you always kept track of the “bigger picture”, determining whether we were going in the right direction or whether adaptations or our original plans were necessary. I have much respect for your professionalism in and dedication to clinical research and the freedom that you were willing to give me over the years, including the opportunity to go abroad. I experienced your optimism and involvement as very valuable and your clinical insights were definitely of great importance throughout this thesis.

Cees, your expertise on item response theory and generalizability theory was indispensable for the realization of this thesis. It wasn’t always easy for me to understand the, in my eyes, complex statistical formulas, but you always made sure that I (at least) understood the general idea behind it. I have always experienced your enthusiasm about statistical “problems” or “complexities” as inspiring. More than once I went to you with a question about an article we were working on and you did not only give me an answer, but a whole bunch of new research ideas as well. You were always looking ahead to things we could aspire to after we finished the current study. There is always more to discover and more to learn. Thank you for sharing your knowledge and enthusiasm with me. It should not go unnoticed that this research would not have been possible without the funding of the Dutch RhEumatoid Arthritis Monitoring registry (DREAM), the rheumatologists and nurses who recruited patients and collected data, and the patients who gave their consent to participate. A special thanks must also go to Nancy, for

(9)

Harald Vonkeman, Ina Kuper, and Piet van Riel, thank you for your valuable scientific input and the nice collaboration. Harald, I’m glad that your involvement in my project grew over the years. I really value your clinical expertise and appreciate your openness and keen eye for detail. I would like to express my gratitude to the members of my Graduation committee as well: Eswar Krishnan, Piet van Riel, Gerhard Zielhuis, Harald Vonkeman, Job van der Palen, and Robbert Sanderman. Thank you for your willingness to be part of my committee. A special thanks to Eswar, who granted me the opportunity to stay with his group at Stanford University for a period of 3 months. I very much appreciate the way you work, your involvement in the weekly meetings, and your patience when explaining things to the group or person‐to‐person. I’m thankful that I could be part of that for a little while. It was an amazing experience and I enjoyed it very much. Narinder, Weiqi and Linjun, thank you for the very warm welcome you gave me and for the fantastic trip to Alcatraz. Gwen, thank you for your kindness and the many nice chats we had near the coffee machine. And finally, I want to express my gratitude to Joni, who was willing to share her home with me during my stay. I know I was often busy, but I’m grateful for your hospitality and the great trips that we were able to make during my stay. Then, of course, I also want to thank my Dutch colleagues. Thank you all for the very good time, for all the “gezelligheid”, and the many nice chats. Without you these 4 years would not have been so great. You created a good working environment and a friendly atmosphere. I’ve had many roommates over the years and I appreciated the company of every one of you, but special thanks should go to my fellow‐tower‐colleagues and lunch‐ break buddy’s. Between the hard work there was always room for a little chat or a serious conversation. I really enjoyed your company, the lunches we had together, and the dinner or wine tasting evenings at each other’s homes. I hope we will stay in touch and will continue to have these social gatherings in the future. I would also like to thank our secretaries. You always stood by to assist with flight reservations, the ordering of materials, the arrangement of certain payments, and many, many other things. Even the organization of a conference or summerschool was no problem. Marieke, thanks for lending me a hand with that.

(10)

and dad, you are the best. Thank you so much for always being there for me and believing in me. Thank you for your never‐ending love, relentless support, and for helping me to relax when I needed to get my mind away from work. Arnout and Wolter, it is really nice that you wanted to be my paranimfs, but what makes me really happy and immensely proud is the fact that I have such great older brothers. Then, finally, my dearest Kien. I cannot thank you enough for your unconditional love, endless support, and infinite patience and positivism. Thanks for always being there for me, both now and in the future.

(11)

(12)

Acknowledgements vii Chapter 1 Introduction 1 Chapter 2 Modern psychometrics applied in rheumatology: A systematic review 15

Chapter 3 Validating the 28‐tender joint count using item

response theory

47

Chapter 4 Contribution of assessing forefoot joints in early

rheumatoid arthritis patients: Insights from item response theory

65

Chapter 5 Interchangeability of 28‐joint disease activity

scores using the erythrocyte sedimentation rate or the C‐reactive protein as inflammatory marker

81

Chapter 6 How age and sex affect the erythrocyte

sedimentation rate and C‐reactive protein in early rheumatoid arthritis

93

Chapter 7 Distinct trajectories of disease activity over the

first year in early rheumatoid arthritis patients following a treat‐to‐target strategy

105

Chapter 8 Further optimization of the reliability of the 28‐

joint Disease Activity Score in patients with early rheumatoid arthritis 117 Chapter 9 General discussion 131 List of publications 141 Summary / Samenvatting 143

(13)

(14)

Introduction

(15)

Rheumatoid arthritis

Rheumatoid arthritis (RA) is a chronic autoimmune disease, which is characterized by symmetric inflammations of the peripheral joints [1‐3]. The course of the disease is highly variable and has a substantial impact on the physical, psychological, and social health of the patients and, in turn, on society in terms of health care costs and decreased productivity. Predominant signs and symptoms of the disease include joint pain, joint swelling, joint damage or deformations, stiffness, fatigue, diminished physical and social functioning, and depression [1‐6]. The worldwide prevalence rate of RA is estimated to be around 0.5% to 1.0% of the adult population in developed countries [1, 2, 7, 8], tends to increase with age, and is higher in females than in males.

Although the precise aetiology of RA remains unknown, the pathogenesis is assumed to be multifactorial, involving both genetic as well as environmental factors [1, 2, 7, 9]. New insights into the biological mechanisms have resulted in new medicines and drastic changes in the treatment strategies over the past decade, which have led to significant improvements in the future prospects of patients. Where previous treatments were mostly aimed at controlling symptoms and lacked effectiveness in the prevention of joint destruction and maintenance of functional status, current treatments have a more aggressive approach aiming to suppress RA disease activity as fast as possible, as completely as possible, and as long as possible [2, 3]. For this purpose, different relatively strict monitoring and treatment strategies are used [10, 11]. These strategies emphasize a strong interference early on in the disease and have proven to be effective in early RA patients in daily clinical practice [12], resulting in more and faster remissions compared to usual care. In particular, there are significantly larger improvements in: a) physical functioning, b) the (reduced) number of tender and swollen joints, and c) the patient‐ reported assessments of pain and general health [12].

Disease activity

These disease‐activity‐driven, or treat‐to‐target, treatment strategies require optimal measurement of disease activity. A widely used measure for monitoring disease control is the Disease Activity Score for 28 joints (DAS28). The DAS28 is a simplified form of its predecessor, the Disease Activity Score (DAS).

DAS

RA disease activity is a complex and multidimensional construct. Traditionally, rheumatologists based their clinical judgment of a patient’s disease activity on a

(16)

combination of factors, including laboratory measures, clinical assessments, and radiographic information [13]. However, this unstructured assessment method resulted in large discrepancies between rheumatologists and, in an attempt to formalize this process and reduce discrepancies, a quantitative disease activity index was proposed: the DAS [13].

Starting with a large collection of previously reported potential markers of disease activity, factor analysis was used to classify these variables into groups. The factors that discriminated best between patients with low and high disease activity (as defined by the rheumatologists’ treatment decisions) were retained and their most influential variables were extracted using multiple regression analysis. This resulted in a continuous composite score reflecting a patient’s current disease activity [13], based on (in decreasing importance) the graded Ritchie score (i.e. a tender joint count in 53 joints), a binary rated swollen joint count in 44 joints, a non‐specific measure of inflammation called the erythrocyte sedimentation rate, and a patient‐reported global assessment of general health.

DAS28

Though the development of the DAS aided the quantification of a patient’s disease activity, its practical usefulness in clinical practice was limited due to the elaborate measurement of all 53 tender and 44 swollen joints. This is not only a tedious and time‐ consuming job for the clinician, but for the patient as well. Furthermore, some of the included joints are very difficult to assess reliably or might show abnormalities because of processes beyond the rheumatic disease [14‐16]. Consequently, several studies examined the usability of reduced joint counts and found that they were as valid and reliable as more extensive joint counts [14, 16‐19], which led to the development of a modified DAS score, i.e. the DAS28. The erythrocyte sedimentation rate (ESR) and general health measure were retained, but the extensive joint counts were reduced to only 28 joints [18]. Figure 1 gives an overview of the joints administered, including the shoulders [2 joints], elbows [2 joints], wrists [2 joints], metacarpophalangeals (MCPs) [10 joints], proximal interphalangeals (PIPs) [10 joints], and knees [2 joints] [20]. All joints are measured on a binary scale, where 0 reflects no pain/no swelling and 1 reflects the presence of pain/swelling, and are summed into a 28‐tender joint count (TJC28) and 28‐swollen joint count (SJC28). These more or less objective, semi‐objective and subjective measures were combined into a continuous scale of RA disease activity, using the following equation [18, 20]: DAS28‐ESR = 0.56 ∗ 28 0.28 ∗ 28 0.70 ∗ 0.014 ∗ .

(17)

The resulting DAS28 score reflects present disease activity, ranging from 0 to approximately 10, and can be used to classify patients as being in remission [DAS28 <2.6], or as having low disease activity [2.6 ≤ DAS28 ≤ 3.2], moderate disease activity [3.2 < DAS28 ≤ 5.1], or high disease activity [DAS28 >5.1] [20].

When following a disease‐activity‐driven treatment strategy, therapy adjustments defined by protocol are often based on these resulting DAS28 scores. However, these DAS28 scores have their limitations. Apart from disease activity, all individual component scores might be influenced by comorbidities [21‐23], joint counts can also be affected by physician and patient related factors [24], and a patient’s general health rating can be elevated because of non‐inflammatory or personal factors [25]. On a population level the DAS28 score gives probably a good estimation of disease activity in RA patients but in individual RA patients inconsistencies can occur. Moreover, a rheumatologist might still observe disease activity in omitted joints when the DAS28 indicates a state of remission. Therefore, to better interpret a DAS28 score, it is essential to have thorough insight into its individual components and their shortcomings. Figure 1 ‐ The 28 joints included in the tender and swollen joint counts of the DAS28. MCP = metacarpophalangeal, PIP = proximal interphalangeal

Individual components DAS28

Joint counts

RA is an inflammatory disease which predominantly manifests itself in warm, reddish, painful and swollen joints. Joints are often referred to as the major “organ” involved in RA

(18)

[26, 27]. Consequently, joint counts are considered essential for assessing RA disease activity. They belong to the core set of disease activity measures and their use is recommended by the American College of Rheumatology as well as the European League Against Rheumatism [20, 27, 28].

Since the assessment of all joints is unfeasible and inaccurate in clinical practice, mainly due to time constraints, reduced joints counts have been proposed over time. The DAS28 uses 28‐joint counts to assess pain and swelling, which have been shown to correlate strongly with more extensive joint counts [14, 16‐19]. Each visit, the presence of both joint pain and swelling is assessed by a trained rheumatologist or nurse practitioner by exerting a certain amount of pressure on the patient’s joints. This is called a semi‐ objective assessment method because the degree of joint pain depends on the patient’s pain threshold as well as the amount of pressure exerted by the rheumatologist, whereas the assessment of joint swelling depends on the physician’s perceptions [24]. Large intra‐ and inter‐observer variability have been reported in both joints counts, although reliability was generally found to be higher for the tender joint count [29]. Rating differences might be explained by factors as the assessors’ levels of training and experience, a lack of standardization in examination methods, unclear definitions of what a swollen joint looks like, or the degree of joint deformity [29, 30]. Reliability can be improved by training, the use of the same assessor at each assessment, or the use of standardized guidelines for joint assessment [29, 31].

Another issue related to these reduced joint counts concerns the presence of residual joint activity in omitted joints [32‐34], which has raised the question whether the 28‐joint counts are sufficient to reflect a patient’s current disease activity. The omission of certain joints seems warranted though because of major assessment difficulties or because of influential processes that go beyond RA. For instance, it is very difficult to assess swelling in the hip joints, and foot abnormalities might result from fluid retention or wearing the wrong footwear [14] instead of inflammation. When including these joints, they might provide relevant clinical information but they might also add unwanted random error variance to the measurement.

Acute phase reactants

Because RA is an inflammatory disease, it is not surprising that the DAS28 includes an acute phase reactant as an indicator of disease activity. Even though a single laboratory measure can only reflect part of the whole inflammatory response, it can be clinically helpful for evaluating inflammatory severity and for monitoring disease activity over time [22]. Although the DAS28 was originally developed with the ESR as acute phase reactant, it has been argued that the C‐reactive protein (CRP) might give a better reflection of current

(19)

disease activity because of its faster response to inflammatory stimuli [22, 35]. Consequently, another DAS28 formula was developed using CRP instead of ESR [36]:

DAS28‐CRP = 0.56 ∗ 28 0.28 ∗ 28 0.36 ∗ 1 0.014 ∗ 0.96.

The ability to calculate DAS28 scores based on different acute phase reactants can be convenient, but it does raise the question whether these two measures can be used interchangeably because both inflammatory markers measure different aspects of the disease. Where ESR values are assumed to reflect the patient’s disease activity over the past few weeks, CRP values may be a better reflection of short‐term changes in disease activity [20, 22, 36, 37].

The ESR is the rate at which red blood cells fall through plasma in a tube of blood in one hour time. It is an indirect way of screening for elevated concentrations of acute phase plasma proteins in the blood (e.g. fibrinogen) because these cause the red blood cells to settle down more rapidly [22]. ESR concentrations respond slowly to inflammatory stimuli and, consequently, to changes in a patient’s disease activity [22, 35, 37]. The CRP, on the other hand, is a direct measure of the acute phase reaction [35, 37‐39], responding more rapidly to changes in inflammatory stimuli [22, 35, 37, 39, 40]. It is a protein that is produced in the liver as a reaction to certain biologic ligands that appear when inflammation develops [22] and is part of the body’s defense mechanisms against damaged cells or pathogens. Assessment of both inflammatory markers is relatively easy, quick, and inexpensive [22, 35, 38‐40], but the assessment of the ESR requires a fresh blood sample [22, 40], while the measurement of CRP can also been performed on stored frozen specimens in a central laboratory [22, 37, 40].

Their interpretation, however, is complicated by the fact that both are non‐specific acute phase reactants of systemic inflammation, which means that elevated levels are not necessarily (solely) due to the inflammation of the rheumatic disease but can also be influenced by other factors. ESR, for instance, can be increased by other inflammatory conditions and non‐inflammatory‐related biological phenomena like paraproteinemia, abnormally shaped or sized red blood cells, changes in plasma composition, anemia, pregnancy, or certain drugs [22, 35, 37, 39, 40]. CRP production, on the other hand, can be hampered in liver disease and concentrations can be elevated in the presence of comorbid diseases like osteoarthritis, gout, and systemic lupus erythematosus, in the presence of bacterial infections, or because of non‐inflammatory influences like sleep deprivation or unhealthy diets [22]. Finally, some medications, such as tocilizumab, have a direct effect on CRP production bypassing the disease processes of RA.

(20)

General health

Where the severity and impact of most rheumatic conditions were initially typically evaluated with clinical measures, patient‐reported outcome measures (PROs) have gained attention since the eighties of the last century [41, 42] and are now even part of the core set of disease activity measures as defined by the American College of Rheumatology [28] as well as the ACR/EULAR remission criteria [27]. The DAS28 also includes a patient‐ reported outcome measure – a visual analog scale of general health (GH) – which asks the patient to give an overall assessment of how they currently feel, considering all the ways in which the RA influences their lives (Figure 2).

Figure 2 – Visual analog scale of general health on which patients indicate how they currently feel, considering all the ways in which the RA influences their lives. A visual analog scale can be administered quick and easy, yet score interpretation can be very difficult because the patient is asked to consider all the ways in which the RA influences his or her life, touching upon different dimensions of the disease. As defined by the World Health Organization, health is not merely the absence of disease but involves the whole spectrum of physical, mental and social wellbeing [43], all of which can affect the patient’s GH score. Consequently, a score of 20 for one person does not need to have the same meaning as a score of 20 in another person. The inclusion of the GH component has often been criticized. It is not only the most subjective component of the DAS28, but GH scores can also be elevated when none of the other DAS28 components point to an active disease, possibly due to certain non‐inflammatory effects of RA like pain or fatigue [44]. Additionally, the patient’s perception of health has been shown to differ across patients with equal DAS28 scores, which might be explained by the occurrence of response shifts during the disease course [45]. Still, it is believed that (semi‐)objective clinical measures and subjective patient‐reported outcome measures address different aspects of the disease [46] and that both should be administered to evaluate a multifactorial concept like RA disease activity.

Objective of this thesis

Although the DAS28 has proven its value in rheumatology, several doubts have been raised about residual disease activity in omitted joints [32‐34], the influence of external

(21)

factors on the level of inflammatory markers [22, 47‐62], and the inclusion of a subjective patient‐reported outcome measure [44, 45]. The aim of this thesis is to address some of these issues to improve our understanding of the complexities behind the measurement of disease activity in early rheumatoid arthritis patients.

Outline of this thesis

Modern psychometrics and joint counts Statistical methods are essential for developing or evaluating outcome measures. So far, the majority of clinical measures have been developed using methods from classical test theory. In recent years, however, item response theory (IRT) emerged as an alternative, complementary psychometric method. IRT enables a more thorough evaluation of an instrument’s psychometric characteristics by providing more detailed information on the item level [63].

Although IRT originated from the field of educational measurement, it is gaining attention in the medical field as well [64‐67] and the first study of this thesis provides an overview of its use within rheumatology (Chapter 2). This study shows that IRT applications have markedly increased over the past decades, although they are primarily being applied to patient‐reported measures. To examine its applicability to clinical disease activity measures more thoroughly, it was evaluated whether IRT could be used to internally validate the 28‐tender joint count (Chapter 3).

The focus on the item level provides the interesting opportunity to evaluate the measurement range and precision of the 28 included joints. A commonly raised criticism of the DAS28 is that patients might experience residual disease activity in excluded joints, especially in the feet. To address this issue, IRT information curves were used to evaluate the contribution of the forefoot joints to the measurement range of the 28‐joint counts (Chapter 4). Inflammatory markers When working with two DAS28 scores, depending on a preference for or the availability of the ESR or CRP, it is important to know whether these scores can be used interchangeably. Score discrepancies might lead to different interpretations of a patient’s level of disease activity and can have far‐reaching effects on treatment decisions. As such, the interchangeability of the two DAS28 scores is examined in Chapter 5.

Although the CRP levels are supposed to be less sensitive to external influences than the ESR [22, 35, 38, 39], an extensive number of studies have demonstrated similar dependencies on gender, age, and body mass index for both acute phase reactants,

(22)

suggesting that these factors should be taken into account when interpreting a patient’s DAS28 score. Yet, the extent of these effects remain unknown and are evaluated in

Chapter 6.

Heterogeneity of the early RA population

As mentioned earlier, DAS28 scores are often embedded within the treatment protocols with the current emphasis on early aggressive treatments. They are used as a criterion variable for treatment decisions or for evaluating treatment effectiveness [12, 68]. In this line of thought it is important to evaluate whether individual differences exist in the course of disease activity (Chapter 7), because patients with different response patterns might be in need of different therapeutic interventions. DAS28 reliability The DAS28 established itself as a valid measure [18], but since reliability is a prerequisite for validity, it should be reliable as well. Yet common internal consistency measures like Cronbach’s alpha are ill‐suited for the assessment of composite reliability, i.e. when different components of disease activity are combined into one index measure. Therefore, Chapter 8 applies principles from generalizability theory to determine the reliability of the DAS28 in patients with early RA. Finally, a general discussion about the findings in chapter 2 to 8 is given in Chapter 9. It elaborates on clinical implications and explores future research directions.

References

1. Sangha O. Epidemiology of rheumatic diseases. Rheumatology. 2000;39(2):3‐12.

2. Turkiewicz AM, Moreland LW. Rheumatoid arthritis. In: Bartlett SJ, editor. Clinical care in the rheumatic diseases. Atlanta (GA): Association of Rheumatology Health Professionals; 2006. p. 157‐66. 3. Lee DM, Weinblatt ME. Rheumatoid arthritis. The Lancet. 2001;358:903‐11.

4. Uhlig T, Loge JH, Kristiansen IS, Kvien TK. Quantification of reduced health‐related quality of life in patients with rheumatoid arthritis compared to the general population. J Rheumatol. 2007;34(6):1241‐7. 5. Pollard L, Choy EH, Scott DL. The consequences of rheumatoid arthritis: quality of life measures in the individual patient. Clin Exp Rheumatol. 2005;23(5 Suppl 39):S43‐52. 6. Bath J, Hooper J, Giles M, Steel D, Reed E, Woodland J. Patient perceptions of rheumatoid arthritis. Nurs Stand. 1999;14(3):35‐8. 7. Scott DL, Wolfe F, Huizinga TWJ. Rheumatoid arthritis. The Lancet. 2010;376:1094‐108.

(23)

8. Gabriel SE, Michaud K. Epidemiological studies in incidence, prevalence, mortality, and comorbidity of the rheumatic diseases. Arthritis Res Ther. 2009;11(3):229.

9. Tobón GJ, Youinou P, Saraux A. The environment, geo‐epidemiology, and autoimmune disease: Rheumatoid arthritis. J Autoimmun. 2010;35:10‐4. 10. Smolen JS, Aletaha D, Bijlsma JW, Breedveld FC, Boumpas D, Burmester G, et al. Treating rheumatoid arthritis to target: recommendations of an international task force. Ann Rheum Dis. 2010;69(4):631‐7. 11. Combe B, Landewe R, Lukas C, Bolosiu HD, Breedveld F, Dougados M, et al. EULAR recommendations for the management of early arthritis: report of a task force of the European Standing Committee for International Clinical Studies Including Therapeutics (ESCISIT). Ann Rheum Dis. 2007;66(1):34‐45. 12. Schipper LG, Vermeer M, Kuper HH, Hoekstra MO, Haagsma CJ, Den Broeder AA, et al. A tight control treatment strategy aiming for remission in early rheumatoid arthritis is more effective than usual care treatment in daily clinical practice: a study of two cohorts in the Dutch Rheumatoid Arthritis Monitoring registry. Ann Rheum Dis. 2012;71(6):845‐50. 13. Van der Heijde DMFM, van ’t Hof MA, van Riel PLCM, Theunisse LAM, Lubberts EW, van Leeuwen MA, et al. Judging disease activity in clinical practice in rheumatoid arthritis: First step in the development of a disease activity score. Annals of the Rheumatic Diseases. 1990;49:916‐20. 14. Fuchs HA, Brooks RH, Callahan LF, Pincus T. A simplified twenty‐eight‐joint quantitative articular index in rheumatoid arthritis. Arthritis Rheum. 1989;32:531.

15. van Tuyl LHD, Britsemmer K, Wells GA, Smolen JS, Zhang B, Funovits J, et al. Remission in early rheumatoid arthritis defined by 28 joint counts: limited consequences of residual disease activity in the forefeet on outcome. Ann Rheum Dis. 2012;71(1):33‐7. 16. Fuchs HA, Pincus T. Reduced joint counts in controlled clinical trials in rheumatoid arthritis. Arthritis Rheum. 1994;37(4):470‐5. 17. Prevoo MLL, van Riel PLCM, van ‘t Hof MA, van Rijswijk MH, van Leeuwen MA, Kuper HH, et al. Validity and reliability of joint indices. A longitudinal study in patients with recent onset rheumatoid arthritis. Br J Rheumatol. 1993;32(7):589‐94. 18. Prevoo ML, van ’t Hof MA, Kuper HH, van Leeuwen MA, van de Putte LB, van Riel PL. Modified disease activity scores that include twenty‐eight–joint counts: development and validation in a prospective longitudinal study of patients with rheumatoid arthritis. Arthritis Rheum. 1995;38(1):44‐8.

19. Smolen JS, Breedveld FC, Eberl G, Jones I, Leeming M, Wylie GL, et al. Validity and reliability of the twenty‐eight‐joint count for the assessment of rheumatoid arthritis activity. Arthritis Rheum. 1995;38(1):38‐43.

20. Van Riel PLCM, Fransen J, Scott DL. Eular handbook of clinical assessments in rheumatoid arthritis. Alphen aan den Rijn: van Zuiden Communications; 2004. 21. van Tuyl LH, Boers M. Patient's global assessment of disease activity: what are we measuring? Arthritis Rheum. 2012;64(9):2811‐3. 22. Firestein GS, Budd RC, Harris Jr ED, McInnes IB, Ruddy S, Sergent JS. Kelly's textbook of rheumatology. Philadelphia: Saunders Elsevier; 2009. 23. Leeb BF, Andel I, Sautner J, Nothnagl T, Rintelen B. The DAS28 in rheumatoid arthritis and fibromyalgia patients. Rheumatology. 2004;43(12):1504‐7. 24. Thompson PW, Kirwan JR. Joint count: A review of old and new articular indices of joint inflammation. Br J Rheumatol. 1995;34:1003‐8.

(24)

25. Masri KR, Shaver TS, Shahouri SH, Wang S, Anderson JD, Busch RE, et al. Validity and reliability problems with patient global as a component of the ACR/EULAR remission criteria as used in clinical practice. J Rheumatol. 2012;39(6):1139‐45.

26. Kapral T, Dernoschnig F, Machold KP, Stamm T, Schoels M, Smolen JS, et al. Remission by composite scores in rheumatoid arthritis: are ankles and feet important? Arthritis Res Ther. 2007;9(4):R72. 27. Felson DT, Smolen JS, Wells G, Zhang B, van Tuyl LHD, Funovits J, et al. American College of

Rheumatology/European League Against Rheumatism provisional definition of remission in rheumatoid arthritis for clinical trials. Arthritis Rheum. 2011;63(3):573‐86.

28. Felson DT, Anderson JJ, Boers M, Bombardier C, Chernoff M, Fried B, et al. The American College of Rheumatology preliminary core set of disease activity measures for rheumatoid arthritis clinical trials. The Committee on Outcome Measures in Rheumatoid Arthritis Clinical Trials. Arthritis Rheum. 1993;36(6):729‐40.

29. Cheung PP, Gossec L, Mak A, March L. Reliability of joint count assessment in rheumatoid arthritis: A systematic literature review. Semin Arthritis Rheum. In press.

30. Marhadour T, Jousse‐Joulin S, Chalès G, Grange L, Hacquard C, Loeuille D, et al. Reproducibility of joint swelling assessments in long‐lasting rheumatoid arthritis: influence on Disease Activity Score‐28 values (SEA‐Repro study part I). J Rheumatol. 2010;37(5):932‐7.

31. Pincus T. Limitations of a quantitative swollen and tender joint count to assess and monitor patients with rheumatoid arthritis. Bull NYU Hosp Jt Dis. 2008;66(3):216‐23.

32. Landewé R, van der Heijde D, van der Linden S, Boers M. Twenty‐eight‐joint counts invalidate the DAS28 remission definition owing to the omission of the lower extremity joints: A comparison with the original DAS remission. Ann Rheum Dis. 2006;65:637‐41.

33. Mäkinen H, Kautiainen H, Hannonen P, Sokka T. Is DAS28 an appropriate tool to assess remission in rheumatoid arthritis? Ann Rheum Dis. 2005;64:1410‐3.

34. van der Leeden M, Steultjens MP, van Schaardenburg D, Dekker J. Forefoot disease activity in rheumatoid arthritis patients in remission: results of a cohort study. Arthritis Res Ther. 2010;12(1):R3. 35. Kushner I. C‐reactive protein in rheumatology. Arthritis Rheum. 1991;34(8):1065‐68.

36. Fransen J, Welsing PMJ, de Keijzer RMH, van Riel PLCM. Disease activity scores using C‐reactive protein: CRP may replace ESR in the assessment of RA disease activity. Ann Rheum Dis. 2003;62 (Suppl. 1):151.

37. Wolfe F. Comparative usefulness of C‐reactive protein and erythrocyte sedimentation rate in patients with rheumatoid arthritis. J Rheumatol. 1997;24(8):1477‐85.

38. Crowson CS, Rahman MU, Matteson EL. Which measure of inflammation to use? A comparison of erythrocyte sedimentation rate and C‐reactive protein measurements from randomized clinical trials of golimumab in rheumatoid arthritis. J Rheumatol. 2009;36(8):1606‐10.

39. Husain TM, Kim DH. C‐reactive protein and erythrocyte sedimentation rate in orthopaedics. UPOJ. 2002;15:13‐6.

40. Paulus H, E., Brahn E. Is erythrocyte sedimentation rate the preferable measure of the acute phase response in rheumatoid arthritis? J Rheumatol. 2004;31(5):838‐40.

41. Fries JF. The promise of the future, updated: better outcome tools, greater relevance, more efficient study, lower research costs. Fut Rheumatol. 2006;1(4):415‐21.

42. Fries JF, Bruce B, Cella D. The promise of PROMIS: Using item response theory to improve assessment of patient‐reported outcomes. Clin Exp Rheumatol. 2005;23(5 SUPPL. 39).

(25)

43. World Health Organization. WHO definition of Health [cited March 1, 2014]; Available from:

http://www.who.int/about/definition/en/print.html

44. Vermeer M, Kuper HH, van der Bijl AE, Baan H, Posthumus MD, Brus HLM, et al. The provisional ACR/EULAR definition of remission in RA: a comment on the patient global assessment criterion. Rheumatology (Oxford). 2012;51(6):1076‐80. 45. Kievit W, Welsing PMJ, Adang EMM, Eijsbouts AM, Krabbe PFM, van Riel PLCM. Comment on the use of self‐reporting instruments to assess patients with rheumatoid arthritis: the longitudinal association between the DAS28 and the VAS general health. Arthritis Rheum. 2006;55(5):745‐50. 46. Wolfe F. The prognosis of rheumatoid arthritis: assessment of disease activity and disease severity in the clinic. Am J Med. 1997;103(6A):12S‐8S. 47. Nestel AR. ESR changes with age ‐ a forgotten pearl. BMJ. 2012;344:e1403. 48. Miller A, Green M, Robinson D. Simple rule for calculating normal erythrocyte sedimentation rate. Br Med J (Clin Res Ed). 1983;286(6361):266.

49. Shearn MA, Kang IY. Effect of age and sex on the erythrocyte sedimentation rate. J Rheumatol. 1986;13(2):297‐8.

50. De Silva DA, Woon FP, Chen C, Chang HM, Wong MC. Serum erythrocyte sedimentation rate is higher among ethnic South Asian compared to ethnic Chinese ischemic stroke patients. Is this attributable to metabolic syndrome or central obesity? J Neurol Sci. 2009;276(1‐2):126‐9.

51. Böttiger LE, Svedberg CA. Normal erythrocyte sedimentation rate and age. Br Med J. 1967;2(5544): 85‐7.

52. Radovits BJ, Fransen J, van Riel PL, Laan RF. Influence of age and gender on the 28‐joint Disease Activity Score (DAS28) in rheumatoid arthritis. Ann Rheum Dis. 2008;67(8):1127‐31.

53. Ranganath VK, Elashoff DA, Khanna D, Park G, Peter JB, Paulus HE. Age adjustment corrects for apparent differences in erythrocyte sedimentation rate and C‐reactive protein values at the onset of seropositive rheumatoid arthritis in younger and older patients. J Rheumatol. 2005;32(6):1040‐2. 54. Hayes GS, Stinson IN. Erythrocyte sedimentation rate and age. Arch Ophthalmol. 1976;94(6):939‐40. 55. Oeser A, Chung CP, Asanuma Y, Avalos I, Stein CM. Obesity is an independent contributor to functional capacity and inflammation in systemic lupus erythematosus. Arthritis Rheum. 2005;52(11):3651‐9. 56. Kawamoto R, Kusunoki T, Abe M, Kohara K, Miki T. An association between body mass index and high‐ sensitivity C‐reactive protein concentrations is influenced by age in community‐dwelling persons. Ann Clin Biochem. 2013;50(Pt5):457‐64.

57. Piéroni L, Bastard JP, Piton A, Khalil L, Hainque B, Jardel C. Interpretation of circulating C‐reactive protein levels in adults: body mass index and gender are a must. Diabetes Metab. 2003;29(2 Pt 1): 133‐8.

58. Lee YJ, Lee JH, Shin YH, Kim JK, Lee HR, Lee DC. Gender difference and determinants of C‐reactive protein level in Korean adults. Clin Chem Lab Med. 2009;47(7):863‐9.

59. Lakoski SG, Cushman M, Criqui M, Rundek T, Blumenthal RS, D'Agostino Jr RB, et al. Gender and C‐ reactive protein: data from the Multiethnic Study of Atherosclerosis (MESA) cohort. Am Heart J. 2006;152(3):593‐8.

60. Rommel J, Simpson R, Mounsey JP, Chung E, Schwartz J, Pursell I, et al. Effect of body mass index, physical activity, depression, and educational attainment on high‐sensitivity C‐reactive protein in patients with atrial fibrillation. Am J Cardiol. 2013;111(2):208‐12.

61. Kao TW, Lu IS, Liao KC, Lai HY, Loh CH, Kuo HK. Associations between body mass index and serum levels of C‐reactive protein. S Afr Med J. 2009;99(5):326‐30.

(26)

62. Rawson ES, Freedson PS, Osganian SK, Matthews CE, Reed G, Ockene IS. Body mass index, but not physical activity, is associated with C‐reactive protein. Med Sci Sports Exerc. 2003;35(7):1160‐6. 63. Reeve BB, Fayers P. Applying item response theory modeling for evaluating questionnaire item and

scale properties. In: Fayers P, Hays RD, editors. Assessing Quality of Life in Clinical Trials: Methods of Practice. Oxford, NY: Oxford University Press; 2005. p. 55‐73. 64. Belvedere SL, de Morton NA. Application of Rasch analysis in health care is increasing and is applied for variable reasons in mobility instruments. J Clin Epidemiol. 2010;63:1287‐97. 65. McHorney CA. Generic health measurement: Past accomplishments and a measurement paradigm for the 21st century. Ann Intern Med. 1997;127(5):743‐50.

66. McHorney CA. Ten recommendations for advancing patient‐centered outcomes measurement for older persons. Ann Intern Med. 2003;139:403‐9. 67. Hays RD, Morales LS, Reise SP. Item response theory and health outcomes measurement in the 21st century. Med Care. 2000;38(9):II28‐II42. 68. Vermeer M, Kuper HH, Hoekstra M, Haagsma CJ, Posthumus MD, Brus HLM, et al. Implementation of a treat‐to‐target strategy in very early rheumatoid arthritis. Arthritis Rheum. 2011;63(10):2865‐72.

(27)

(28)

Modern psychometrics

applied in rheumatology

A systematic review

L. Siemons P.M. ten Klooster E. Taal C.A.W. Glas M.A.F.J. van de Laar BMC Musculoskelet Disord. 2012, 13: 216

(29)

Abstract

Background: Although item response theory (IRT) appears to be increasingly used within

health care research in general, a comprehensive overview of the frequency and characteristics of IRT analyses within the rheumatic field is lacking. An overview of the use and application of IRT in rheumatology to date may give insight into future research directions and highlight new possibilities for the improvement of outcome assessment in rheumatic conditions. Therefore, this study systematically reviewed the application of IRT to patient‐reported and clinical outcome measures in rheumatology.

Methods: Literature searches in PubMed, Scopus and Web of Science resulted in 99

original English‐language articles which used some form of IRT‐based analysis of patient‐ reported or clinical outcome data in patients with a rheumatic condition. Both general study information and IRT‐specific information were assessed.

Results: Most studies used Rasch modeling for developing or evaluating new or existing

patient‐reported outcomes in rheumatoid arthritis or osteoarthritis patients. Outcomes of principle interest were physical functioning and quality of life. Since the last decade, IRT has also been applied to clinical measures more frequently. IRT was mostly used for evaluating model fit, unidimensionality and differential item functioning, the distribution of items and persons along the underlying scale, and reliability. Less frequently used IRT applications were the evaluation of local independence, the threshold ordering of items, and the measurement precision along the scale. Conclusion: IRT applications have markedly increased within rheumatology over the past decades. To date, IRT has primarily been applied to patient‐reported outcomes, however, applications to clinical measures are gaining interest. Useful IRT applications not yet widely used within rheumatology include the cross‐calibration of instrument scores and the development of computerized adaptive tests which may reduce the measurement burden for both the patient and the clinician. Also, the measurement precision of outcome measures along the scale was only evaluated occasionally. Performed IRT analyses should be adequately explained, justified, and reported. A global consensus about uniform guidelines should be reached concerning the minimum number of assumptions which should be met and best ways of testing these assumptions, in order to stimulate the quality appraisal of performed IRT analyses.

(30)

Background

Since there is no gold standard for the assessment of disease severity and impact in most rheumatic conditions, it is common practice to administer multiple outcome measures to patients. Initially, the severity and impact of most rheumatic conditions was typically evaluated with clinical measures (CMs) [1, 2] such as laboratory measures of inflammation like the erythrocyte sedimentation rate [3] and physician‐based joint counts [4, 5]. Since the eighties of the last century, however, rheumatologists have increasingly started to use patient‐reported outcomes (PROs) [1, 2]. As a result, a wide variety of PROs are currently in use, varying from single item visual analog scales (e.g. pain or general health) to multiple item scales like the health assessment questionnaire (HAQ) [6] which measures a patient’s functional status and the 36‐item short form health survey (SF‐36) which measures eight dimensions of health related quality of life [7].

Statistical methods are essential for the development and evaluation of all outcome measures. By far, most health outcome measures have been developed using methods from classical test theory (CTT). In recent years, however, an increase in the use of statistical methods based on item response theory (IRT) can be observed in health status assessment [8‐10]. Extensive and detailed descriptions of IRT can be found in the literature [11‐14]. In short, IRT is a collection of probabilistic models, describing the relation between a patient’s response to a categorical question/item and the underlying construct being measured by the scale [11, 15]. IRT supplements CTT methods, because it provides more detailed information on the item level and on the person level. This enables a more thorough evaluation of an instrument’s psychometric characteristics [15], including its measurement range and measurement precision. The evaluation of the contribution of individual items facilitates the identification of the most relevant, precise, and efficient items for the assessment of the construct being measured by the instrument. This is very useful for the development of new instruments, but also for improving existing instruments and developing alternate or short form versions of existing instruments [16]. Additionally, IRT methods are particularly suitable for equating different instruments intended to measure the same construct [17] and for cross‐cultural validation purposes [18]. Finally, IRT provides the basis for developing item banks and patient‐tailored computerized adaptive tests (CATs) [9, 19, 20].

Although IRT appears to be increasingly used within health care research in general, a comprehensive overview of the frequency and characteristics of IRT analyses within the rheumatic field is lacking. The Outcome Measures in Rheumatology (OMERACT) network recently initiated a special interest group aimed at promoting the use of IRT methods in rheumatology [21]. An overview of the use and application of IRT in rheumatology to date may give insight into future research directions and highlight new possibilities for the

(31)

improvement of outcome assessment in rheumatic conditions. Therefore, the aim of this study was to systematically review the application of IRT to clinical and patient‐reported outcome measures within rheumatology.

Methods

Search strategy

Figure 1 presents an overview of the various stages followed during the search process, starting with an extensive literature search in April 2012 to identify all eligible studies up to and including the year 2011. Electronic database searches of PubMed, Scopus, and Web of Science were carried out, using the terms 'Item response theor*' OR 'Item response model*' OR 'latent trait theor*' OR Rasch OR Mokken, in combination with Rheumat* OR Arthros* OR arthrit*. Inclusion and exclusion criteria Only original research articles written in English were included. Articles were considered original when they included original data and when they performed analyses on this data in order to achieve a defined study objective. To be included, studies should present an application of IRT in a sample of which at least 50% had some kind of rheumatic disease. In cases where less than 50% of the study sample consisted of rheumatic patients (i.e. inflammatory rheumatism, arthrosis, soft tissue rheumatism), the study was only included when the rheumatic sample was analysed separately from the rest of the sample. Reviews, letters, editorials, opinion papers, abstracts, posters, and purely descriptive studies were excluded. No limitations were set for study design.

Study identification and selection

The search strategy resulted in a total of 385 studies. After the removal of 189 duplicates, 196 unique articles were identified. Two reviewers independently screened all 196 studies for relevance based on the abstract and title identified from the initial search. If no evident inclusion or exclusion reasons were identified, the full‐text was examined. In total, 103 studies did not meet inclusion criteria and were excluded. The main reasons for exclusion were: the study population (i.e. the study population was not clearly defined or the study contained a rheumatic sample <50% of the total sample which was not separately analysed), the statistical analyses (i.e. no IRT application), and the article type (i.e. non‐original research). Figure 1 includes an overview of the exclusion reasons followed by the number of articles removed.

(32)

Figure 1. Flowchart of the search process. Data extraction

First, two reviewers independently evaluated a random sample of 15 articles. Both general study information as well as IRT‐specific information were extracted, using a purpose‐made checklist (Additional file 1) based on both expert input and important issues as mentioned in Tennant and Conaghan [22], Reeve and Fayers [15], and Orlando [23]. Inter‐rater agreement of the evaluated variables was moderate to high, with Cohen’s

(33)

kappa ranging from 0.60 to 1.00. Most of the disagreements were caused by differing interpretations of some of the extracted variables. For instance, one of the reviewers interpreted the checklist on “performed analyses” as performed analyses using IRT based methods only, whereas the other reviewer interpreted it more broadly including classical test theory methods as well (the latter being the correct method). Consensus about these differences was reached by discussion. Next, one of these reviewers also evaluated the remaining 84 articles.

General study information

General information concerned the author(s), publication year, study population, the populations’ country of origin, total number of participants (N), study design of the IRT analyses (i.e. cross‐sectional or longitudinal), type of outcome measure (PRO or CM), and main measurement intention (e.g. quality of life, pain, overall physical functioning).

Purpose of analyses

The purpose of the analyses was determined by the main goal the author(s) of the article pursued (e.g. the development, evaluation, comparison, or cross‐cultural validation of instruments).

Specific IRT analyses

Before a researcher can perform IRT analyses, an appropriate IRT model should be selected. Unidimensional models are most widely applied, the simplest being the Rasch model which assumes that the items are equally discriminating and vary only in their difficulty. The 2‐parameter logistic model (2‐PL model) extends the Rasch model by assuming that the items have a varying ability to discriminate among people with different levels of the underlying construct [11, 15, 19, 23]. These models can be specified further for polytomous items. The rating scale model, graded response model, modified graded response model, partial credit model, and generalized partial credit model can be applied in case of ordered categorical responses. The nominal response model can be applied when response categories are not necessarily ordered [11, 15, 19, 23, 24]. The rating scale model and the partial credit model are generalizations of the Rasch model, the other models are generalizations of the 2‐PL model. In addition to these unidimensional models, multidimensional models and specific non‐parametric models like the Mokken model [25, 26] have been developed. Differences in model assumptions should be taken into account when choosing a model and model choice should be motivated by taking the discrimination equality of the items and the number of (ordered) response categories into consideration [15, 22‐24].

(34)

The applied IRT software and the corresponding item and person parameter estimation method(s) should also be cited, since not all software packages report the findings in the same way [22] and because the use of different estimation methods may result in different parameter estimates [11].

To make IRT results interpretable and trustworthy, three principal assumptions should be evaluated when applying a unidimensional IRT model [15, 23]. The first assumption concerns unidimensionality, meaning that the set of test items measure only a single construct [11, 15, 22, 23]. Analyses for checking the unidimensionality can include different types of factor analysis of the items or the residuals. A more advanced method would be to compare a unidimensional IRT model with a multidimensional IRT model, for instance using a likelihood ratio test. The second (related) assumption concerns local independence of the items. When this assumption is violated this may indicate that the items have more in common with each other than just the single underlying construct [11, 15, 22, 23]. This may either point to response dependency (e.g. overlapping items in the scale) or to multidimensionality of the scale [22]. It can lead to biased parameter estimates and wrong decisions about, for instance, item selection [15]. Local independence can be tested by a factor analysis of the residual covariations, or with more specific statistics targeted at responses to pairs of items [12]. The third assumption concerns the model’s appropriateness to reflect the true relationship among the underlying construct and the item responses [11, 15, 22, 23]. This can be examined with both item and person fit statistics. More information about these assumptions and suggestions about which aspects to report can be found in the literature [11, 15, 22, 23].

Other useful IRT applications include the evaluation of the presence of differential item functioning, the reliability and measurement precision, the ordering of the response categories or item thresholds, and the hierarchical ordering and distribution of persons and items along the scale of the underlying construct. Differential item functioning (DIF, also called item bias) is present when patients with similar levels of the underlying construct being measured respond differently to an item [15, 22]. Commonly examined types of DIF are DIF across gender and age [22]. Global IRT reliability is equivalent to Cronbach’s alpha, with the difference that not the raw score but the IRT score is being used in its calculation. Which specific global reliability statistics are presented usually depends on the software package used. Contrary to CTT methods, IRT also provides information about the local reliability [12] and, related to this, the instrument’s measurement precision along the scale of the underlying construct.

With rating scale analysis, the ordering of the response categories or item thresholds can be checked, enabling the evaluation of the appropriateness or redundancy of the response categories [15]. Likewise, the hierarchical ordering and/or distribution of

(35)

persons and items along the scale can be analysed to determine the measurement range of the outcome measure and to determine whether the items function well for the included population sample or whether there is a mismatch between them [23].

Results

General information of included studies

The initial database search yielded a total of 93 eligible studies. Six additional studies were identified by manual reference checks of the selected studies. This resulted in a final selection of 99 studies (Additional file 2). Figure 2 shows that the prevalence of IRT analysis within rheumatology increased markedly over the past decades. This is consistent with conclusions from Hays et al. [19], and with findings from Belvedere and Morton [8] who examined the frequency of Rasch analyses in the development of mobility instruments. Figure 2. Number of published articles reporting the application of IRT within rheumatology.

(36)

Table 1 presents an overview of the most prominent results. By far, most research was carried out with patients from the United States or the United Kingdom, but data from patients from The Netherlands and Canada were also regularly used. The vast majority of studies involved cross‐sectional IRT analyses. It could also be observed that an increasing number of studies perform longitudinal IRT analyses since the 21st century, as represented by a rise of DIF testing over time.

Study samples varied from as little as 18 persons in the study of Penta et al. [27] to as many as 16519 persons in the study conducted by Wolfe et al. [28]. Most studies (92.9%) performed analyses on a population sample of at least 50 persons.

In 85 of the 99 studies IRT analyses were applied to PROs. The remaining 14 studies applied IRT to CMs. The vast majority of the studies applied IRT to data gathered from patients suffering from rheumatoid arthritis (RA) or osteoarthritis (OA).

Outcome measures of overall physical functioning and quality of life were most frequently being analysed. To a lesser extent, studies applied IRT to PRO measures of specific functioning [27, 29‐37], pain [35, 38‐43], psychological constructs [44‐46], and work disability [47‐51]. Studies also applied IRT to CMs such as measures of disease activity [52‐54] and disease damage or radiographic severity [55‐57]. Purpose of analyses Most common main goals for both the PRO‐ and the CM‐studies were the development or evaluation of new measures, the evaluation of existing measures, and the development or evaluation of alternate or short form versions of an existing measure. In addition, several studies aimed to cross‐culturally validate a patient‐reported or clinical measure. IRT was rarely applied for the development of item banks [17, 58] or computerized adaptive tests [59, 60]. Specific IRT analyses IRT model and software

The vast majority of IRT applications within rheumatology involved Rasch analyses, although a clear specification and rationale of the applied Rasch model was not always given. Few studies used a two‐parameter IRT model or Mokken analysis. Most analyses were carried out with the software packages Bigsteps/Winsteps or RUMM.

A motivation of the model choice was only provided in 27.3% of the studies. Likewise, the item and person parameter estimation methods were rarely specified (8.1% and 4.0% of the studies, respectively).