roa_home_page.gif (1087 bytes)

Issues in the Measurement of Health-Related Quality of Life
To download as zipped Word95 file
'
Click here'



Workshops in Patient-Centred Health Outcomes Measurement,
Health program Evaluation
, Questionnaire Design, and Decision making in Health Care
For information '
Click here' or go to www.RodOConnorAssoc.com


Measurement  Issues
For information 'Click here'



Rod O'Connor

Working Paper 30   July 1993

NHMRC National Centre for
Health Program Evaluation
Melbourne, Australia


ISSN 1038-9547
ISBN 1 875677 26 7

 

Rod O'Connor is Executive Director of Rod O'Connor & Associates Pty Ltd, Consultants in Health Measurement and Research,
and
Associate Professor (Conjoint) in the School of Public Health and Community Medicine,
 
University of New South Wales, Sydney  Australia.  Email: rod@RodOConnorAssoc.com

Please Note:  'Issues in the Measurement of Health-Related Quality of Life' has been superceded by Rod O'Connor's 'Measuring Quality of Life in Health', Pub. Churchill Livingstone, 2004

This provides a major advance over 'Issues in the Measurement of Health-Related Quality of Life', topics receiving special attention in 'Measuring Quality of Life in Health' including:
o Context, role and nature of patient-centred health outcome measures
o The psychological mechanisms that mediate reports of health-related quality of life
o First actions when selecting, assessing, developing or modifying a test
o How to evaluate an test
o The use of single item measures
o Validity, or the degree to which a scale measures what the user intends it to
o Classical approaches to developing a test
o Steps in developing a classical test
o Modern test theory, in particular application of the Rasch Model
o Major current issues, such as interpreting quality of life scores for individual patients, cross-cultural application issues, measuring ‘utilities’, and the problem of missing or incomplete data.

To order, go to http://www.RodOConnorAssoc.com/New_Book.htm

 

'Issues in the Measurement of Health-Related Quality of Life'
 

A. INTRODUCTION

B. CONCEPTS AND DEFINITIONS IN HEALTH-RELATED QUALITY OF LIFE 

1. The Motivation for Developing Quality of Life (QOL) Measures in Health
2. Types of Health-related QOL Measure
3. The Notion of QOL as Used in Health
4. QOL as an Individual's Subjective Well-being
5. Objections to QOL as Subjective Well-being
6. Factors that Influence Reported Subjective Well-being
7. Implications of SWB Findings for QOL Measurement
8. General Problems in the Measurement of a Construct such as HQOL

C. SCALING

1. Stevens: the Effects of Task on Number Properties
2. No Scaling is Direct
3. Magnitude Estimation and Category Rating may Measure Different Things
4. Scaling in Health-state Assessment
4.1. Does category rating provide an interval scale?
4.2. What is the effect of providing defined versus vague endpoints to a scale?
4.3. Effects of task complexity
4.4. Effects of stimulus materials
4.5. What is a category rating task?
5. How Important are Interval Properties for Statistical Operations on Health States?
6. Conclusions and Observations

D. RELIABILITY 27

1. Types of Reliability
1.1. Test-retest, or measure of stability
1.2. Alternate form method, or measure of equivalence
1.3. Measures of internal consistency - Split-half method
1.4. Measures of internal consistency - Methods based on item covariance, or coefficient alpha.
2. Factors that Affect Reliability Coefficients.
2.1. Characteristics of the subjects: variation in the behaviour, and ability to perform the measurement task
2.2. Test items: number (test length) and homogeneity
3. Reporting and Interpreting Reliability
4. Conclusion

E. VALIDITY

1. Content Validation
2.Criterion-related Validation
3. Construct Validity
3.1. Correlations between the test and other tests
3.2. Correlation between the test and selected variables
3.3. Convergent and Discriminant validation
3.4. Construct representation
3.5. Sensitivity/responsiveness
3.6. `Descriptive validity'
4. Techniques Used in the Measurement and Development of Validity
4.1. Correlation
4.2. Multiple regression
4.3. Factor analysis
5. Conclusions and Observations: The Case for Concept Validity 

F. SPECIFIC ISSUES WHEN CONSTRUCTING HEALTH-RELATED QOL MEASURES 

1. The Importance of Clearly Defining the Purpose to which the HQOL Test will be put and the QOL Concept to be Used
2. The Structure and Outputs Required of the HQOL Instrument
3. Selecting the Task Used to Develop and Scale the Test
4. Determining the Content of the Test
4.1. Forming content materials
4.2. Who should provide the health state assessments?
4.3. Which dimensions should be assessed for a comprehensive HQOL test?
5. The Treatment of Future Events and Mortality
6. Method of Test Administration
6.1. The need for practicality
6.2. Self-administration can produce less valid measures
7. Interpretation of Test Scores

G. SOME CURRENT TOOLS

1. Bergner's Sickness Impact Profile (SIP)
1.1. Initial development
1.2. 1974 field validation
1.3. 1976 field testing
2. Quality of Well-being (QWB) Scale
2.1. Nature of the QWB
2.2. Validation
2.3. Problems for the QWB
3. Torrance's Utility Model
3.1. Testing of instruments
3.2. Approach to validity
3.3. Development of a Multi-attribute Utility (MAU) Scale
4. Rosser's Classification of Illness State
5. Concluding Observations

.BIBLIOGRAPHY

.APPENDIX 1

.Criteria for developing a satisfactory expert-referenced test of work-related disability

 

A INTRODUCTION

This paper reviews the literature and discusses the major issues regarding the development of a reliable, valid and practicable instrument that could comprehensively measure a patient's quality of life.

Issues of definition are first considered, with special attention to the notion that quality of life could be defined as a patient's subjective well-being. The complexities of measuring such a psychological construct are noted, followed by consideration of the major issues that are entailed. These concern the problem of interpreting such measurements as interval or ratio data, and the means of development and assessing test reliability and validity. Next issues of test construction and interpretation of specific importance to health status measures are investigated. Finally there is an examination of the conceptualisation, development and psychometric status of four of the major instruments proposed to provide health status measures (those developed by Bergner; Bush, Kaplan et al; Torrance; and Rosser).

The aim of the review is to provide basic information and analysis that is central to anyone considering the development of a health status assessment instrument from a psychometric perspective, or who wishes to assess existing instruments. It should not be concluded that the review is definitive: there are a number of issues that are only hinted at in this review (eg. the need for a systematic comparison of rating versus trade-off measures of health state value). However it is hoped that the information, analyses and thoughts presented may be of assistance to those working in what is a very complex and increasingly important inter-disciplinary field.

To aid assimilation of the information, and to permit a rapid overview, the concluding sections of each of the first four chapters may be read (ie. the final summary sections of' Concepts and Definitions'; 'Scaling'; 'Reliability'; and 'Validity'). The final two chapters, 'Issues when constructing health-related quality of life measures', and 'Some current tools', are best read in their entirety.

 

B CONCEPTS AND DEFINITIONS IN HEALTH-RELATED QUALITY OF LIFE

1 The Motivation for Developing Quality of Life (QOL) Measures in Health

The development of Quality of Life measures in health has been encouraged both by the need to assess the relative merits of rival health programs in a context of increasing pressure on health resources, and a desire to be able to comprehensively assess the impact of clinical therapies.

(a) Resource allocation

The need to have a measure of health program effect that goes beyond traditional output measures (such as number of patients treated) has been a major motivation in the development of generic health status assessment instruments. There has been an increasing need to rationally allocate health service resources across diverse health programs. Arising from this a number of measures of the broad effects of illness state on a patient's life have been developed with the declared aim of assisting health policy decisions. The Rosser Index has been seen to allow evaluation of health service funding Gudex (1986) and Torrance (1972) described the development of his health utility approach as a means of measuring health improvement that is disease and program independent, the aim being to facilitate decisions regarding program funding. Similarly Kaplan (1988a) has applied the QWB to develop a General Health Policy Model.

Even the Sickness Impact Profile (or SIP) developed by Bergner and co-workers (e.g. see Bergner et al., 1976a; 1976b), seemingly developed outside the context of cost-utility theory, had this as a declared aim. Bergner stated that the SIP was developed with the specific aim of providing information on the efficacy of health programs to assist decisions regarding the appropriate allocation of the government's resources. It was aimed to provide a "fiscally and logistically practical measure of health status". (Bergner et al., 1976a, p. 393).

(b) Assessing clinical outcomes

Revicki (1989) notes that advances in medical research and therapy have shifted health care resources from the diagnosis and treatment of infectious disease to the prevention and control of chronic disease: with this has come an increased emphasis on changes in functional status and quality of life outcomes. This move is assisted by the undesirable aspects of many modern treatments. Bergner (1989) notes that the consequences of treatment and treatment-related side effects may affect all of a patient's life. eg. becoming bald and nauseous, being on a restricted diet, being tied to a machine 12 hours out of 24, etc., and hence it is important to assess all aspects of a treatment's effects. In this context Revicki (1989) cited a 1986 study by Croog et al. which compared three anti-hypertensive drugs, and selected measures to indicate QOL dimensions of general well-being, sleep dysfunction, sexual problems, work performance, social activity participation, physical distress, and cognitive function. While the drugs were found to have comparable efficacy in decreasing blood pressure, and there were no differences between the treatment groups on measures of sleep dysfunction, social participation, and visual memory, one drug did show differences on general well being and physical distress.

Deyo and Patrick (1989) have also pointed out that medical interventions may result in improved functional health status without evidence of physiologic improvement e.g. pulmonary rehabilitation programs may improve exercise capacity without altering pulmonary function tests, while on the other hand therapy may result in physiologic improvement without discernible clinical benefit to patients eg nitrate therapy may alter haemodynamics in patient with heart disease without improving exercise capacity.

Cancer is an area which has been noted as particularly relevant to QOL considerations. Donovan et al. (1989) note evidence that the emotional suffering produced by cancer exceeds the physical suffering it causes, while at the same time pointing out that QOL measures have generally not been included in clinical trials of cancer therapy. This is at least in part due to physicians not being comfortable working with social scientists, whose tools have been seen as "soft" and cumbersome (Skeel, 1989), and lacking credibility (Deyo & Patrick 1989).

2 Types of Health-related QOL Measure

A large number of measures of health status and associated notions are available (eg. see McDowell and Newell, 1987). Most can be characterised in terms of three continuums: disease specific versus generic measures; single dimension versus broad spectrum measures; and the range of values output. (See also Bergner 1989; Donovan et al. 1989).

(a) Disease specific versus generic measures

Measures vary in the degree to which they are developed to measure a specific disease or to be capable of application to many or all illness states. As noted by Deyo and Patrick (1989), disease specific measures have greater salience for physicians, better focus on functional areas of particular concern, and may possess greater responsiveness to disease-specific interventions. On the other hand generic measures permit comparisons across interventions and diagnostic conditions, which is particularly important for policy makers [resource allocation]. They also allow dysfunction to be quantified for an individual experiencing several disease conditions (Bombardier et al., 1986; Temkin et al., 1989).

There is some evidence that generic measures can be as responsive in some settings as disease-specific measures. Kaplan et al. (1989) criticise disease--specific methods on the grounds that all diseases and disabilities affect overall quality of life, and the purpose of QOL measures is not to identify clinical information relevant to the disease but to determine the impact of the disease on general function. General QOL measures are proposed as better as they can capture a wide variety of dysfunction that might be in different systems, i.e. not specific to the disease condition (e.g. confusion, tiredness, sexual impotence, depression).

Certainly general measures have the ability to capture side effects and benefits that might not have been anticipated (Kaplan & Anderson, 1988), although once identified a disease-specific measure could be prepared to more exactly assess the dimension of interest.

(b) Single dimension versus broad spectrum measures

Measures vary according to the degree to which they focus on particular activities, or attempt to encompass the full range of aspects of living that may influence personal contentment or satisfaction for a person with the condition.

As noted by Bergner (1989), there are measures that focus on particular activities such as walking, eating and dressing (eg. Activities of Daily Living Index), while others measure physical functioning plus other health related aspects such as symptoms, emotional status, cognition, perceptions of health etc. The latter group consists of both measures specific to a disease and general measures (such as Bergner's Sickness Impact Profile).

(c) Range of values output

Measures also vary in that they may output a single value, a series of sub-scale values, or a series of sub-scale values plus an aggregate value. Examples of scales that both measure multiple dimensions and provide an aggregate measure are the QWB and SIP.

3 The Notion of QOL as Used in Health

While there is general agreement on the potential value of QOL measures as key evaluation variables, there is an absence of clear agreement on a definition of QOL: definitions of QOL in the health context are mostly vague or absent. As noted by Deyo and Patrick (1989), conceptions relevant to health and QOL are diverse, scattered through many disciplines, and use many different labels (e.g. health status, functional status, disability scale, quality of life). Bergner (1989) notes that the notion of Quality of Life (QOL) has been a category in Index Medicus since 1966, yet QOL is usually not defined in the reports of clinical trials, and "definitions must be deduced from the dimensions assessed", and that "each investigator that purports to address quality of life actually examines a very narrow and specific set of factors". Generally notions of quality of life are not specified, but are considered to be implicit in the measure used, ie. they are more inferred than explained.

None the less there seems to be acceptance that health-related QOL is a `multi-dimensional concept that encompasses the physical, emotional, and social components associated with an illness or treatment' (Revicki, 1989). Which precise dimensions to include is less agreed. For example, Torrance (1987) states that physiological and emotional functioning contribute directly to quality of life, and "taken together these two constitute health-related quality of life" (Torrance, 1987, p. 593), with social functioning (eg. social role and social contacts) outside the scope of health-related quality of life. On the other hand Kaplan et al (1989) use the term health-related quality of life to refer to the impact of health conditions on function, but include social role, although suggesting that health-related quality of life may be independent of quality of life relevant to work setting, housing, or similar factors.

The sampling of the proper dimensions when estimating QOL is central to the validity of QOL measures, and is considered further in Chapter F of this report.

4 QOL as an Individual's Subjective Well-being

To develop a clear conceptual base it is useful to attempt to clarify a notion of `quality of life'. The term `quality of life' (QOL) can have several meanings. It may be used to refer to outward material circumstances, such that good quality of life is represented by good physical health, material security, supportive family and friends, etc. Alternatively it can refer to subjective well-being, or SWB, by this being meant an individual's sense of happiness or satisfaction, typically reflecting a global assessment of all aspects of their life (McCauley and Bremer, 1991, make a similar distinction between outward circumstances and personal assessment in their proposal of `objective well-being' versus `subjective well-being').

Both emotional and cognitive factors may be referred to as part of subjective well-being, while objective conditions such as health, wealth, and comfort are seen to be potential influences but not inherently or necessarily part of the notion. As noted by Diener (1984), the literature on SWB broadly concerns notions such as happiness, morale, positive affect, etc., and covers both positive judgement and affective reactions; it has been concerned either with what leads people to evaluate their lives in positive terms (a global judgement regarding life satisfaction), or happiness in terms of a preponderance of positive affect over negative affect.

The work of Campbell is often referred to when interpreting quality of life as subjective well-being. For example Donovan et al. (1989) cited Campbell (1976) in suggesting that an accepted general definition of quality of life is "a persons subjective sense of well-being, derived from current experience of life as a whole". In the context of treatment selection, Goodinson and Singleton (1989) propose a definition of quality of life as "the degree of satisfaction with perceived present life circumstances" (citing Young and Longman, 1983), this being seen to encompass the `physical, social, and material well-being of an individual', and to concern an evaluation of the physical, psychological and social impact of disease treatment on patients lives.

Within this framework, QOL is seen to be influenced by quite idiosyncratic factors, with a major determinant of an individual's quality of life being the perceived discrepancy between what is and what could have been. Skeel (1989), considering quality of life from the context of cancer research, cites Calman (1984) in that quality of life "is the extent to which a persons hopes and ambitions are matched and fulfilled by experience". In further support of this interpretation, Campbell (1981) reported that when questioned about the quality of their lives, apparently healthy individuals respond in terms of life satisfaction, usually in relation to specific domains, where satisfaction is proportional to the closeness between aspiration and achievement. Bergner (1989) also reports the notion that QOL is enhanced as the distance between attained and desired goals diminishes.

The implication is that changing expectation can lead to altered perception of QOL in similar circumstances, and different experiences may have different quality of life implications for different individuals. The notion may also help to explain why some people appear to adapt to changed circumstances very rapidly, ie. by reducing their aspirations (see B 6(c)).

5 Objections to QOL as Subjective Well-being

There is no doubt that objective external factors such as income, length of survival, change in tumour volume, etc., influence quality of life. Generally such factors are assessed to be influences on QOL, not the QOL itself, however there are those who appear to argue that quality of life should be identified with physical conditions only. In the context of health status measurement, Kaplan et al. (1989) stated that "most investigators believe that symptoms and mortality do represent quality of life" (Kaplan et al., 1989, p. S31), contrasting this approach with those who regarded quality of life as "subjective appraisals of life satisfaction" (citing Hunt and McEwen, 1983), or those who combine a patient's subjective evaluation of well being with physical symptoms, sexual function, work performance, emotional status, etc. (citing Croog et al., 1986).

There would seem to be at least two versions of a position where objective, externally observable measures are exclusively made use of when assessing quality of life. These might be termed a `non observables are banned' position, and a `only in development' argument.

(a) `Non observables are banned'

The `non observables are banned' position states that any behaviour that cannot be directly observed and confirmed by an independent observer is unworthy of analysis. In the development of the SIP, Bergner et al. (1976a) reported deciding that of a feeling state, clinical, and performance conception of an individuals own health state appraisal, only the last of these, the performance conception, was suitable. The feeling state conception was ruled out on the grounds of being inaccessible to external validation, and the clinical conception was seen as unsuitable as it required medical interpretation and hence was reliant on the definitions of physicians and not the person concerned. The performance conception was adopted as it could be based on respondent report, but could also be easily observed and reported by an untrained observer, and also allowed easy comparison between different diseases and dysfunctions.

The difficulty with this approach as that it makes determination of the relative importance of different forms of physical quality of life/objective well-being exceptionally difficult, as the subject's own view of relative desirability would be precluded. Either one does not develop a global index, or one arbitrarily assigns weightings so that dimensions/sub-scales can be combined to form a global index. Kaplan, Bush and Berry (1979) have referred to this issue in suggesting that the category rating task allows a single global rating to be given to total case descriptions so that the subject can consider the multiple dimensions of health jointly and simultaneously, and argue that this is necessary if arbitrary rules for combining attributes into a total case rating are to be avoided. A different means of using patient report to weight dimensions has been noted by Goodinson and Singleton (1989), who refer to a 1985 study by Ferrans and Power where Likert scales were used to measure satisfaction and then measures were obtained on the relevance of each item/domain to the individual, an aggregate QOL index formed by weighting each item/domain according to its reported relevance value and then adding together.

(b) `Only in development'

This alternative position states that any measure of subjective well-being reliant upon the report of an individual is liable to random measurement error, and hence it is preferable to develop an instrument which can be used to predict subjective well-being independently of the subject's own report. While subject report is useful and possibly essential for instrument development, it is to be avoided in instrument application, ie. when making a specific assessment.

This latter approach seems more reasonable. Basically it allows subject report measures to play a role as dependent variables when developing a test instrument, and grants subjective well-being an important role in the development of QOL measures.

An approach that attempts to minimise the role of a direct measure of subjective well-being in a test situation may be sensible as there is considerable evidence that direct report can be misleading. Evidence relating to this issue is discussed in the next section

6 Factors that Influence Reported Subjective Well-being

(a) Life events and experiences

There is little doubt that subjective well-being is influenced by major life events and experiences, eg. housing, employment, health, marriage etc., and a great deal of research has been concerned with the notion that the major cause of change in SWB are major life events and experiences (Diener, 1984; Heady et al., 1985; Heady and Wearing, 1989).

The relationship between life events and SWB is not simple. As well as issues regarding the relative effect of different types of event, there are questions concerning the effect of overall SWB on satisfaction within a given domain. Among the variables commonly treated as affecting SWB are domain satisfactions (eg, with marriage, health, work etc.), major life events, and reference standards (eg. expectations, aspirations, sense of equity). Furthermore satisfaction within a given domain could conceivably be a consequence of SWB. For example, Heady et al. (1991) have argued that satisfaction with work, standard of living, and leisure satisfaction, are largely the result of overall life satisfaction, and that satisfaction with friendship and general fitness (as opposed to illness) appear to be explicable solely on the basis of personality; on the other hand satisfaction with marriage appeared to both influence and be influenced by overall SWB.

In terms of the effects of illness, Diener (1984) concluded that objective health is significantly related to SWB, although the relationship appeared to be much weaker than that between self-rated health and SWB. The relationship between health and SWB may be also be bi-directional: Hughes (1985) found that depression following lung cancer radiotherapy may exacerbate symptom distress (tiredness, anorexia, pain). Donovan et al. (1989) also pointed out that cancer patients experience positive impacts of the disease on their life as well as negative, eg. increased closeness to spouse.

(b) Personality variables as mediators

Evidence has been reported that personality traits of extraversion and neuroticism are highly stable and can predict SWB 20 years later (Costa and McCrae, 1980,1984, cited Heady and Wearing, 1989). It has been argued that personality can heavily mediate the impact of exogenous life events, with each person having a "normal" equilibrium level of life events and SWB, predictable on the basis of age and personality; only when events deviate from equilibrium levels is SWB seen to change (Heady and Wearing, 1989).

Individuals have also been shown to vary in coping strategies, which in turn can affect physical factors such as health outcomes. Greer (1979) and Pettingale (1984) showed that recurrence free survival at 5 and 10 years after surgery for breast cancer was related to psychological approach at three months (fighting spirit or denial were better than helpless/hopeless responses).

(c) The effects of adaptation

In addition to individual-specific variables, there are general psychological mechanisms that act to increase SWB independently of direct physical effects. For example patients frequently experience release of anxiety and stress in the initial stages of recovery following surgery (Cohen, 1982, see Goodinson and Singleton, 1989). Other effects can develop more steadily, for example Cassillet (1984, cited Breetvelt and Van Dam, 1991) reported that patients with newly diagnosed illness had greater anxiety and depression than patients who had been living with the illness for longer periods.

This adaptation to illness has been much reported, with many studies suggesting that patients may differ from controls markedly in physical complaints while differing little or not at all in terms of psychological complaints. For example Cassillet et al. (1982, cited Breetvelt and Van Dam, 1991) found that melanoma patients had superior psychologic well being to other patients suffering dermatological disorders, and moreover the mean score for patients was not different from that of the normal public.

Adaptation can be so great as to apparently eliminate SWB differences between people chronically ill and controls, or even those who have recently had very positive experiences. Brickman et al. (1984) found that lottery winners and quadriplegics differed little from normal controls in SWB, and De Haes and van Knippenburg (1984, cited Goodinson and Singleton, 1989), reported how in many studies of QOL in cancer patients, no differences are found compared to benign controls.

These findings may sometimes reflect inadequacies in the QOL measurement instruments. However they also suggest fundamental homeostatic processes, such as the re-setting of expectations, change in reference standards, etc. Diener (1984) concluded that health does seem correlated with SWB, but that adaptation markedly reduces its influence.

How long it takes to adapt, to what extent people can and do adapt, and the factors determining this, seems still to be broadly unknown. However Breetvelt and Van Dam (1991) have reported interesting findings that suggest how adaptation may be measured independently of SWB. First observing that many studies which employ patient self-report suggest that cancer patients are not more anxious or unhappy than other patient groups or even the normal healthy population, they suggest that this seems to conflict with the everyday experience of physicians and other care takers. A recent paper by Epstein et al. (1989) has provided evidence of this, where family/friend care givers were asked to act as proxies for older chronically ill patients. Although proxy and subject-own responses were generally similar for overall health, functional status, and social activity, proxies rated subjects' emotional health and satisfaction significantly lower than did the subjects themselves (of course this could indicate inaccuracy of the proxies).

Breetvelt and Van Dam (1991) argue that the appropriate control for a patient's report of well being is not the report of healthy subjects, but "retrospective pre-test" (citing Hoogstaten, 1985). They propose evidence that while patents may not rate their current level of well being differently to that of controls, patients may give a much higher rating to the state that they experienced prior to their illness or accident. In other words, patients rate themselves as being considerably happier in the past than do control groups.

Breetvelt and Van Dam (1991) attributed estimates of current SWB (placed at similar levels to non-patients) to a subjective rescaling of what constitutes happiness, ie. a criterion shift in terms of the quantitative level that constitutes normal well being. It was suggested there may additionally be a change in the relative weighting of psychological components versus physical components, ie. the dimensions patients use to assess well being.

This line of investigation also suggests that there may be real problems in assuming that self-report is a valid measure of SWB. For example if patients make judgements relative to an internal criterion that is in some way adjusted to bring about a positive report (eg. by making downward comparisons with patients even less well-off; Taylor, 1983, cited Breetvelt and Van Dam, 1991; Diener, 1984), then cognitive factors may lead to what Breetvelt and Van Dam call `under reporting', ie. that "patients report less emotional distress, satisfaction or the like than is actually present" (Breetvelt and Van Dam, 1991, p. 983).

That self report may be unreliable has been suggested by other studies. Bombardier et al. (1986) found when examining the ability of the QWB scale and other measures to detect the effects of auranofin (oral gold) on arthritis that self--ratings by patients failed to detect significant treatment effects that were indicated by other instruments. Tests of self-versus-interviewer test administration have also suggested the self report can reduce test validity (see Chapter F, section 6).

7 Implications of SWB Findings for QOL Measurement

Subjective well-being is influenced by factors other than external events or physical conditions. For the purposes of developing an instrument to measure health-related QOL, personality effects could possibly be ignored and treated as a random variable. However adaptation effects are a different issue: self--reported QOL may well differ from underlying QOL (ie. represent under-reporting). The apparently malleable and relative nature of QOL self--assessments means self-reported QOL cannot be taken as a criterion measure for the QOL of a given health state (in the sense of validity assessment). The fact that an individual's report indicates acceptance of their state does not mean that they would not greatly prefer an alternative one if the choice was available.

The consequence of this analysis is that in developing a measure of health--related QOL subjective self-report data should be treated very carefully, and alternative methods of estimating current QOL need to be explored. Investigations are needed into:

(a) the factors and conditions determining the rate and extent of adaptation;

(b) the nature of adaptation, ie. can one accept reported SWB at face value, or does it represent under-reporting, where `actual' or underlying SWB is less than reported SWB;

(c) the value of techniques such as `retrospective pre-test', which may provide a more valid measure of `actual' SWB:

(d) whether care givers also experience adaptation (this may in part explain the diminishing relationship between patient report and doctor, as doctor's become more senior and experienced; see Chapter F).

While such basic research is conducted, instrument development that makes use of patient report information needs to proceed with caution, and the following guidelines are suggested:

1 Include a measure such as the `retrospective pre-test' as part of the test battery when assessing QOL.

2 Attempt to include the measurement of variables relevant to the assessment of adaptation. Even in the event that an adaptation-free measure of illness impact is possible, it is likely that adaptation has an effect on `underlying' as well as `surface' (ie. current reported) SWB. Furthermore illnesses may differ in the extent to which they allow adaptation: eg. illnesses where acuity fluctuates or is uncertain may be less readily adapted to, compared to stable conditions where the prognosis is clear.

3 Recognise that subjective well-being is not a criterion measure of QOL, but it may still be the single best indicator available.

It may be wise to include self-report wherever possible as a general catch-all in the event that there is a key dimension that is of specific but unexpected relevance to a given condition or program that is overlooked (also see Chapter E for Bergner's notion of descriptive validity, as well as Kaplan's criticisms of factor analysis in test formation).

4 Clearly separate the steps involved in the development of a test measure, from the steps involved in the application of the developed test. Self-reported SWB might play a central role in test development, but a different role, or no role at all, in the final instrument.

8 General Problems in the Measurement of a Construct such as HQOL

If QOL is identified with subjective well-being alone, then a simple approach might have been to ask the patient. However, as described, QOL estimates via simple self report may be misleading. Adaptive cognitive mechanisms can influence the report, resulting in the judgement leading to the QOL rating being in some way comparative, producing an over-estimation of QOL (eg. through a diminished threshold of what constitutes an acceptable QOL).

Even if the adaptive nature of SWB had not been made evident, some procedure for estimating QOL beyond a simple inquiry would still be necessary. A single response can be a rather unstable, complex, and possibly misleading indicator of an attribute, and in most behavioural measurement it is necessary to combine several responses and types of response if the estimate is to be reliable.

There seems to be no simple measure of health-related QOL and reported SWB is likely to be an imperfect measure of `actual' SWB. A test that aims to measure health-related QOL (or SWB) is endeavouring to measure a construct. By construct is meant a hypothetical concept which can never be directly measured or absolutely confirmed (unlike physical attributes such as height or weight), but only inferred from observations of behaviour. To estimate the value of a construct it is necessary to establish an operational definition, ie. a rule or rules of correspondence between the construct and behaviours that indicate it. A test of the construct is then a procedure for obtaining samples of this/these behaviour(s) that allows the value of the construct to be estimated (see Crocker and Algina, 1986; Anastasi, 1990).

To construct a test to measure QOL requires consideration of the conditions under which numbers may be reliably and validly assigned to represent the magnitude or amount of a psychological attribute. It subsumes issues such as reliability, standardisation, validity, and scaling. In addition, and as pointed out by Donovan et al. (1989) in the context of health-related QOL, for a measure to be meaningful it needs to have the psychometric properties of reliability and validity.

The problems facing the measurement of constructs have been outlined by Crocker and Algina (1986). These are stated below, along with a brief reference to where they are addressed in this report.

1 No single approach to the measurement of any psychological construct is universally accepted. Measures of psychological constructs are always indirect, hence theorists who talk about the construct may select very different behaviours to define the construct operationally.

See Chapter E, Validity.

2 Psychological measurements are usually based on limited samples of behaviour -determining the number of items and the variety of content necessary to provide an adequate sample of the behavioural domain is a major problem in developing a sound measurement procedure.

See Chapter E, Content validation.

3 The measurement obtained is always subject to error. It is very unlikely that re-testing of the same individuals would ever be identical, due to fatigue, boredom, guessing, carelessness, etc. If a different form of the test is applied, scores may also change because of variation in content.

See Chapter D, Reliability.

4 The measurement scales will tend to lack well-defined units - the properties of the measurement scale, the labelling of the units, and the interpretation of the values derived are complex issues. See Chapter C, Scaling.

5 The psychological construct measured by the test must be both operationally defined (i.e. defined in terms of observable behaviour) and hence capable of being empirically demonstrated, and defined in terms of its relationship to other constructs or events in the real world.

See Chapter E, Construct Validity.

 

C SCALING

A necessary consideration when developing instruments for assessing QOL is the nature of the data that is input to form the instrument. The development of an instrument for measuring a psychological construct involves the hypothesis that the construct is a property occurring in varying amounts that can be quantified using a scaling rule or a theoretical unidimensional continuum, and entails determining the real-number properties the scale values on this continuum possess, i.e. nominal/ordinal/interval/ratio. This in turn determines which statements concerning values on a scale are meaningful, and which mathematical analyses can be legitimately applied to them.

The relevant area in psychology is known as scaling, scaling being `concerned with the theory and practice of associating numbers with psychological objects' (Eyfuth, 1972). Scaling has been heavily influenced by psychophysics, which in turn concerns the `manner in which living organisms respond to the energetic configurations of the environment' (Stevens, 1972). Efforts to determine the functional relationship between a physical stimulus and perceptual experience produced methods which have been applied to areas away from the simple physical qualities of the environment (eg. to the scaling of attitudes).

1 Stevens: the Effects of Task on Number Properties

The early psychophysicists were concerned to study the relationship between measurement obtained in two different ways of what were presumed to be the same property, eg. they studied the relationship between weight, length and temperature defined by the response of human subjects as instruments, and weight, length and temperature defined by other measuring instruments such as scales, foot rules, and thermometers. A psychophysical law is a statement of the relationship between measurements obtained by these two methods. The experimental methods and statistical processes developed by the early psychophysicists have since been used in psychological testing and in the study of human ability, and have been developed and applied to measure human ability, personality, attitudes, interests, and many other aspects of behaviour.

In what is now classical work, Stevens (eg. see Stevens and Galanter, 1957) divided stimuli into those forming prothetic and metathetic continua. Prothetic continua where seen to be concerned with "how much", ie. quantitative aspects; an example is loudness, where different levels were formed by adding more of the same. On the other hand metathetic continua were concerned with "what kind", or "where", ie. qualitative aspects; an example is pitch, where differences are due to the substitution of new frequencies for old.

Stevens described the main differences between prothetic and metathetic continua as being exhibited in the formal relations that could be observed among three primary kinds of scaling measures, ie. magnitude, partition, and confusion measures

In magnitude scales, the observer is asked to directly assign numbers to stimuli in proportion to their apparent magnitude, or of ratios among apparent magnitudes. The function relating stimulus magnitude to subjective magnitude is generally determined to be a power one. Cross-modality matching can be used to validate the scaling produced in this way; in this process the functions relating, say, loudness and brightness are first determined, and then it is shown that the exponents of the two functions can be used to predict the third where brightness is directly matched to loudness.

With partition scales, on the other hand, the observer assigns one of a finite set of numbers to each stimulus, eg. the numbers 1 to 5, or adjectives such as small, medium, large. This is seen to represent judgements of subjective differences, or distances. (Also see Eisler, 1962).

On quantitative (prothetic) continua, partition scaling methods (eg. interval, category scales) are usually curved relative to the magnitude scale, ie. the partition scale gives a smaller exponent. With qualitative (metathetic) continua, scaling methods are linearly related. The loss of linearity between methods was seen by Stevens as being a product of forcing the observer to partition the continuum, and prevent the making of a proportional number assignment that would preserve ratios: this restriction causes the dramatic curvature in the scale.

Confusion scales includes scales such as JND (Just Noticeable Difference), discrimination, paired comparisons, and successive intervals. These tasks tend to be concerned with issues of determining thresholds of no difference/difference. The common feature is that some measure of variability or confusion is taken as the unit, in the sense that if there was no noise or confusion in human judgements then the JND would become infinitely small. With prothetic continua confusion scales were reported to be logarithmically related to magnitude scales, while being linear on metathetic continua.

2 No Scaling is Direct

Since the early work of Stevens, there has been a retreat from the view that saw magnitude estimation as the method of choice for scaling psychophysical functions. In a major review of psychophysical scaling, Gescheider (1988) concluded that it is now clear that sensation magnitude cannot be measured directly by any method, including the method of Stevens which was argued to be direct, ie. methods that require subjects to assign numbers to stimuli that presumably represent sensation magnitude (eg. magnitude estimation or category scaling). Responses in the scaling task are now seen to be a joint function of cognitive and sensory factors (first a sensory stage and then a cognitive stage), with the validity of a psychophysical scale able to be established only through an examination of how well the psychological responses of subjects can be predicted from theories of sensory and cognitive processes.

Attacks on magnitude estimation have come from Shephard (1981, cited Gescheider, 1988), who noted that the subject response in magnitude estimation is really only a discrete verbal response that itself possesses no definite qualitative magnitude, and that evidence from cross-modality matches does not guarantee validity as subjects could simply be assigning numbers to sensation in a nonlinear but consistent way. Zwislocki (1983, cited Gescheider, 1988) also found that individual subjects had unique but varying characteristic nonlinear functions when assigning numbers to sensation magnitudes.

It has also been observed that while average magnitude estimations are approximately a power function of stimulus intensity, subjects responses can be influenced by many stimulus variables other than individual stimulus magnitude (eg. the range of stimuli presented in the experiment; Marby and Cook, 1986, cited Gescheider, 1988). Models to explain magnitude estimation results now suggest the subject rehearses the psychological continuum in which the stimulus is presented, with the subjects response determined by the location of the stimulus perceived on the continuum relative to various possible anchors.

There is now overwhelming evidence that responses in psychophysical experiments can be biased in many ways, with effects reported in both magnitude estimation and category rating tasks. Which effects appear (some biases are contradictory) seem dependent upon subtle conditions of instructions, stimuli, and response. Gescheider (1988) reviewed numerous examples, including:

-sequential bias effects, a bias to report values similar to the last reported (successive measures are correlated). The result of this is seen to reduce the response range, through assimilation of the responses toward the centre.

- contraction bias, where the subjects response are closer to the centre of the response range than they should be (possibly a product of sequential dependencies)

-instruction or `framing' effects, eg. where numerical examples are given in magnitude estimation tasks, the function has been found to vary depending on the size of the numerical ratio given as example

- the tendency to use categories equally often in category rating

-stimulus frequency bias, where stimuli presented that are a little larger or smaller than a frequently presented stimulus are judged to be excessively different

- stimulus equalisation bias, a tendency to use the full range of responses whatever the size of the range of the stimuli

On the other hand it has been claimed (eg. by Zwislocki and others; see Gescheider, 1988), that subjects can make judgements of sensation magnitude that are relatively immune to context effects if specifically instructed to assign numbers to stimuli in such a way that the impression of the size of the number matches the impression of the sensation magnitude of the stimulus. When subjects are so instructed and asked to judge each stimulus independently, stimulus context effects are claimed to be relatively small (as opposed to where, say, example stimuli are given with an assigned number standard). However others have disputed this (eg. Mellers, 1983 cited Gescheider, 1988, claimed all judgements to be relative and occurring in context, with instructions and stimulus factors exerting a controlling influence), and at the time of Gescheider's review the issue was in doubt.

It is apparent that the defining of a category rating or magnitude estimation task requires a much more complex model of cognitive processing than the psychophysical one.

3 Magnitude Estimation and Category Rating may Measure Different Things

As noted, a new perspective on the interval scaling debate arose with the growth of cognitive psychology, with its emphasis upon cognitive processes as determinants of perception and response. In this tradition may be placed work by Anderson (1976), who addressed the problem of the `equal-interval' scale in psychological measurement as part of an attempt to develop a general theory of information integration.

Anderson's `method of functional measurement' proposed that whether or not a given task produces an interval measure of an underlying variable or construct may be determined through an examination of the effect of two or more variables on the response scale. If the response measure is an interval scale, and two stimulus variables combine linearly, then this should result in a non-significant F ratio for interaction and with a two-way graph of the data appearing as a set of parallel lines. If parallelism was found to be present, three goals were seen to be simultaneously realised: 1) a linear model was supported; 2) the response was shown to be on an interval scale; and 3) interval scales of the stimulus variables were indicated.

Alternatively, if the plotting of data produced a `linear fan', then 1) a multiplying model was supported; 2) the response is indicated to be on an interval scale; and 3) interval scales of the stimulus variables are demonstrated.

The functional measurement approach can be applied to any scaling procedure. Anderson's own results (Anderson, 1976) suggested that category judgements did yield interval scales, and as category ratings are almost always non-linearly related to magnitude estimation, magnitude estimation must not. However, since then, the work of others (eg. Marks, 1974, cited Gescheider, 1988) has been claimed to indicate that magnitude estimation, too, satisfies functional assessment requirements.

Gescheider (1988) suggested that both category scaling and magnitude estimation appear to pass a variety of interval-scaling tests, and that perhaps each procedure produces valid measurements of different psychological processes. Magnitude estimation tasks require subjects to give a response in terms of apparent magnitude, while category judgements may entail judgements of difference: the sensory magnitudes and sensory dissimilarities of stimuli are equally meaningful but different dimensions of experience. Magnitude estimation and category rating are equally valid because they measure different things.

Gescheider concluded that the observed non-linear relation between scales obtained by different methods was the most perplexing and one of the oldest problems in psychophysics, and whether the non-linearities are due to cognitive--judgement factors or sensory-perceptual factors was yet to be determined.

4 Scaling in Health-state Assessment

When testing, say, illumination, one can present a range of stimuli and determine a subject's ability to discriminate among stimuli and estimate differences. But for health states, one cannot present the experience of ill health to subjects. Either one uses a range of patients who have experienced different states and use a common measuring tool, or a single group of subjects and ask them to `imagine' different states. Neither situation could be said to correspond to a classical psychophysical experiment. None the less, psychophysics has been the model for scaling studies, and while the sensation-measurement aspects are likely to be doubtful, health status scaling can be expected to be effected by cognitive factors of task, instruction, and stimulus in a similar way to that found in psychophysical measurement. Experimental findings regarding rating versus magnitude estimation illustrate this. Findings relevant to the issue of category rating versus magnitude estimation in health-related quality of life (HQOL) measurement are discussed below.

4.1 Does category rating provide an interval scale?

The developers of the Quality of Well-Being (QWB) Index have invested significant effort in attempting to determine the characteristics of different scaling procedures. Patrick, Bush and Chen (1973) examined category rating, magnitude estimation, and an equivalence task as methods of measuring preference for health status scenarios, employing states that covered the preference continuum from 1.0 for complete well being, to 0.0 for death (each scenario containing a functional description in terms of mobility, physical and social activity; age; and a symptom-problem complex or CPX - for more details of the QWB, see Chapter G).

The category rating task required subjects to score scenarios on an 11 point scale, where `most desirable' was stated as above 11, and `least desirable' as below 1. The magnitude estimation task required subjects to assess scenarios on a 1000 point scale, where 1000 equalled a standard scenario of a `as healthy as possible' person, and each scenario was to be rated in terms of fractions of this standard healthy person, ie. if a scenario described a person half as healthy as the standard, then a score of 500 should be given. Equivalence measured trade-offs in numbers of individuals for a given state against a standard population and health state.

The results were that no significant differences were found in the values assigned to items by comparable groups, and the relation between category and magnitude scales was found to be linear. However, as noted by Patrick et al., the magnitude estimation scale was anchored at the upper end as perfectly well function, value 1000, and did not allow the response to be arbitrarily large (as would be allowed if, eg., a central value was given as an example) and this could have converted it to a 0 to 1000 category rating scale. The functional measurement test of Anderson was also applied and revealed results consistent with the separate scales possessing interval properties.

In contrast to Patrick et al., Kaplan, Bush and Berry (1979) did report finding a logarithmic relationship between values obtained via category rating versus magnitude estimation (as also did Kind and Rosser, 1988). In Kaplan et al's experiment the magnitude estimation task differed from that of Patrick et al. in that the standard was selected from the middle of the scale, with the result that the scale was now unbounded. Kaplan et al. (1979) pointed out the notion that on logical grounds either category rating or magnitude estimation could be providing interval data, but not both, and concluded in favour of category rating as the magnitude estimation results were seen to provide `intuitively unreasonable' results when interpreted in terms of a 0 to 1 scale. Furthermore Kaplan et al. reported that Anderson's functional measurement approach applied to category rating data in a 100 subject test (hence possessing much statistical power), produced highly significant main effects of social activity and level of well-being, but no interaction between the effects (F-ratio < 1, and the graphed lines per effect being parallel). This was consistent with category rating providing an interval response scale. As reported in the previous section, however, others have suggested that magnitude estimation tasks can satisfy Anderson's test, and it cannot be concluded that the presence of defined endpoints on the response scale, ie. a `bounded' scale, category rating' as defined by Kaplan et al. (1979), uniquely provides interval data. On the other hand it does seem that defining a range of allowed responses when estimating stimulus size causes a fundamental change in the nature of the task, and this may be the most useful distinguishing feature of a category rating versus magnitude estimation task.

4.2 What is the effect of providing defined versus vague endpoints to a scale?

Not withstanding the `reasonableness' intuitions of Kaplan et al. (1979), the argument that a scale with well-defined endpoints uniquely possesses the ability to provide an interval scale for health state assessment cannot be upheld. However there are grounds for supporting such a scale beyond Andersons test of functional assessment. This evidence comes from investigations by Kaplan and Ernst (1983) of the reported tendency for distribution effects in category rating tasks, ie. that subjects tend to spread their responses across all the allowable categories (Steven and Galanter, 1957; Parducci, 1968), Kaplan and Ernst conducting, experiments to investigate whether and under what conditions distribution effects occurred.

In their first experiment, Kaplan and Ernst used a 10 point rating scale, and instructions that included descriptions of a completely well person and a person in a coma so as to define clearly the end-points of the scale. Different groups of subjects were presented with four different groups of health state scenarios: all high scale items, all low, all medium, and mixed.

Analysis of the ratings of health state descriptions by subjects who saw only high scale values produced some evidence of distribution effects. There was a slight trend for these subjects to spread their ratings of high items across the response range (from 0 `as bad as dying' to 10 `completely well' relative to the assignment of the same sub-set of high scale items by different subjects who were presented with the high sub-set in the context of a broader overall range of items. This trend was not supported in a second experiment.

However a further experiment suggested that such effects could occur under conditions where subjects did not have or were not given clear information regarding the types of items that should define the end points of the scale. In this experiment subjects were to assess acts of immorality, and those subjects given instructions which clearly defined the endpoints did produce responses distributed differently to those given minimal instructions.

Kaplan and Ernst concluded from this that health state assessments using a scale with poles of death to complete health are naturally readily understood and hence well defined, and therefore tend to be resistant to context effects. On the other hand, to minimise context effects of the distribution type it was suggested that:

1 the continuum along which states are to be rated should be well defined 2 the end points should be clearly defined 3 the stimuli should not be available for inspection prior to their rating

To this might be added the provision of enough scale points to allow maximum discrimination between judgements.

The provision of endpoints to a scale, but uncertain ones, may in fact lead category rating to produce ordinal scales. For example, Read et al. (1984) appeared to require subjects to define the extremes of the scale themselves, picking the worst and best outcomes (where the best was not a state of perfect health, and then filling in the medium values. This explicit requirement to use all the values of the scale would seem to counteract its `absolute' nature, ie. it encourages subjects to make relative selections of scale points.

A related issue concerns that of the zero point, and the defining of death on a scale. It has been clearly demonstrated that there is a need for estimates corresponding to states worse than death (eg. Read et al., 1984; Kind and Rosser, 1988; Sutherland, Dunn and Boyd, 1983). This poses problems both for advocates of category scaling (problems in clearly defining the poles, so encourage distribution effects), and also for those favouring magnitude estimation. Sutherland, Dunn and Boyd (1983) raise the question of can you have ratio scale, as opposed to interval or ordinal, when the zero point is indeterminate.

This latter point was indirectly addressed by Haig et al. (1986), who inverted the traditional poles of the illness-health continuum such that now the absence of dysfunction or discomfort (corresponding to perfect health?) became zero, with the other end of the scale open-ended. This variant of magnitude estimation entailed subjects being given the zero state as the first state, which by definition was zero. The second state was randomly selected, and subjects (inpatients of a surgery ward) were apparently free to assign any number to it that appeared to correspond to the magnitude of its undesirability, with all subsequent states in proportion to the earlier judgements.

Haig et al reported finding a linear relationship between this scale and earlier results reported by Bush and co-workers using category rating, while at the same time claiming that the scale was a truly ratio one because it employed magnitude estimation after Stevens, and death did not represent what they (as have others) found to be an inappropriate zero. The failure to find a logarithmic relationship between this and Bush's scale was not explained.

4.3 Effects of task complexity

If one wishes to minimise the influence of complex cognitive variables, eg. task complexity, etc., and focus on the ostensible aspect of interest, ie. desirability or otherwise of a health state, then it is sensible to use a task that is as simple as possible. Patrick et al. (1973) found that both category and magnitude estimation procedures were easy to use, as opposed to the method of equivalent stimuli which was complex, and reported to be `unrealistic, emotive, confusing, and offensive to some judges' (consistent with this, the equivalence (tradeoff) method also tended to give the largest standard deviations).

In contrast, Torrance (1976) found a test that employed category scaling as the hardest to use, compared to standard gamble or time trade-off. In Torrance's category scaling the subject had to indicate the relative desirability of three--paragraph health scenarios on a 0 to 100 continuum (using a visual analogue scale), being asked for each of a series of health states to mark three lines - one for the case where the subject has three months left to live, one for 8 years, and one for normal life expectancy. This seems a very complex task, with the subject having to estimate and combine QOL per state and length of time simultaneously. It illustrates the importance of looking at the whole task, not just some abstracted element of it (Kaplan, Bush and Berry, 1976, suggested the result was due to item complexity and because the category rating task was presented first).

Kaplan and Ernst (1983) have also argued that the magnitude estimation task, by its very nature, involves response scales that are not clearly understood by subjects, and hence are likely to be particularly prone to distribution bias.

4.4 Effects of stimulus materials

Llewelyn-Thomas et al. (1984) examined the effect on numerical values assigned to health states of presenting a written description in formalised point form as used in the development of the Index of Well-being (Patrick, Bush and Chen, 1973) versus scenarios similar in style to that used by Torrance (written in detail in the first person singular in common language). The subjects were outpatients with malignant disease, and both standard gamble and a category rating task were used. The point form scenarios were found to produce higher values than the narrative form standard, and standard gamble provided substantially different and systematically higher scores than did category rating (although there was a significant interaction effect of task order). Also effects of instruction have been demonstrated by Tversky and Kahneman, (1981), and McNeil et al. (1982).

4.5 What is a category rating task?

It is apparent that what constitutes a category rating task means different things to different people, and as pointed out in 4.1 and 4.2 it may be that clearly defining the boundaries of the scale is a more important aspect of a task than whether the subject is requested to estimate stimulus values in terms of ratios of a standard or differences. Brooks (1991) in a very useful review exhibits another confusion in commenting upon `perhaps its .. [ie. the category rating task] .. most attractive feature being the visual nature of the approach, which can help raters conceptualise what is required of them in evaluating health states' (Brooks, 1991,p. 19). This is in one sense a good point, but it confuses the use of a visual aid with what constitutes category scaling. A visual scale can be used for the explicit ordinal scaling of states, and a task can require interval-type judgements on a clearly bounded scale without reference to a visual aid. It seems time to abandon simplifying terms such as `magnitude estimation' and `rating scales' and clearly describe the task in all its features.

5 How Important are Interval Properties for Statistical Operations on Health States?

It has been assumed so far that a response scale must possess interval properties if it is to be useful in health status measurement. This is by no means clear.

Firstly, it can be argued that most if not all psychological variables are in fact ordinal, although for statistical purposes they are, quite justifiably, commonly treated as if they were interval or ration variables.

Ferguson (1966), for example, suggests in the analysis of statistical data in psychology and education it is common for the information to be superimposed on the data, e.g. to assume that a set of ordinal numbers can be replaced by the cardinal numbers, and to then proceed to apply arithmetic operations to the numbers. This involves assumptions about the equality of intervals, when in fact the measuring operation does not yield this. Ferguson suggests that scores on intelligence tests, attitude tests and personality tests are in all effect ordinal variables, for no aspect of the operation of measuring intelligence, say, permits the making of meaningful statements about the equality of intervals. It cannot be said that the differences in intelligence between a person with an IQ of 80 and one with an IQ of 90 is in any sense equal to the difference in intelligence between a person with an IQ of 110 and another with an IQ of 120. Much the same argument could be applied to a subjective well-being scale. Although a logical purist may conclude from this that the testing of data in this way should be discontinued, Ferguson disagrees. Practical necessity can dictate a procedure, although it should be understood what assumptions are being made.

A second and related argument, is that most parametric statistical tests do not require data to be measured on an interval scale, but rather that tests require assumptions about the distribution of the data: parametric statistics can be used as long as the data meets these distribution assumptions For example Anderson (1976) proposed that the present concern with interval or linear scales is in many applications unnecessary, as the power of a statistical test has no necessary relation to the nature of the response scale. A nonlinear or ordinal response may well provide normal distributions.

Some have responded to this view by testing whether the distribution is normal, and then correcting non-normal data using eg. log transformations (cf Hall et al., 1989). Others have been even more pragmatic. Crocker and Algina (1986) concluded that psychology test data should be treated as interval scale data as long as it can be demonstrated empirically that usefulness of the scores for prediction or description is enhanced by this treatment. Even Stevens who popularised the ordinal/interval debate was quoted as sanctioning the use of parametric statistics on the pragmatic grounds that it can lead to fruitful results. Shavelson (1988) in a recent text on statistical reasoning for behavioural science, illustrated the hands-off approach of most to this issue, in stating that "The problem, then, is not so much the match of statistical method measurement scales. Rather the problem is one of interpreting the results of a statistical analysis .... ", and then quoting Hays in that "the experimenting psychologist [sociologist, educational researcher] must face the problem of the interpretation of statistical results within psychology and on extramathematical grounds" (Hays, 1973, p. 88, cited Shavelson, 1988).

It would seem that as far as inferential statistics is concerned, it is not necessary to show evidence of the measurements being interval in nature when measuring the effect of a treatment. To a large extent procedures may be used if it can be shown they are useful, ie. on the balance of the evidence they appear to have practical, predictive value.

While this approach may legitimate tests which suggest tests of significance, it could be argued that this only supports attempts to test the relative order of different HQOL estimates, and that one still cannot assume that equal differences on the scale represent equal quantities of health. When concerned with the relative allocation of quantities of resources, then a ratio scale may be needed (see Kind and Rosser, 1988). Countering this Anderson (1976) has argued that when using linear regression for prediction, which can be the aim of a test to predict HQOL predictive accuracy will tend not to be affected by nonlinearity in the stimulus values employed in the regression. This is a considerable advantage for practical prediction (although considerable caution still need to be applied when interpreting parameters estimated from the regression equation).

6 Conclusions and Observations

It is now clear that sensation magnitude cannot be measured directly by any method, including magnitude estimation and category scaling. Responses in any scaling task are now recognised to be a joint function of cognitive and sensory factors. It is apparent that the defining of a category rating or magnitude estimation task requires a much more complex model of cognitive processing than the psychophysical one.

Furthermore both category scaling and magnitude estimation appear to pass a variety of interval-scaling tests, and perhaps each procedure produces valid measurements of different psychological processes. However Kaplan and Ernst have provided evidence that a form of category rating task does seem more resistant to inter-item context effects, namely one where the both the continuum along which states are to be rated and the end points are clearly defined. Task complexity is also likely to have a major effect in producing undesirable error variance (ie. variance related to task factors as opposed to health state factors), and standardisation of stimulus materials and instructions is essential.

It also needs to be clear that no psychological scale may be interval, in the sense that equal quantities at different parts of the scale cannot be treated as meaningfully equivalent (eg. intelligence). The question is what do we want to measure, and how do we measure it so that responses are reliable. Task factors will change results, be they the product of the particular set of stimuli presented, examples given, whether a scale with defined endpoints is presented, or only one standard: people appear to use all available information to define the task not just the single sentence or paragraph statement of task as given by the experimenter. In scientific research into human behaviour (which includes preferences), then the aim is to examine differences between situations, not the absolute level of something. Is there an underlying value to be tapped - or is it created especially for the experiment? Assuming that the former is at least partly the case, then the issue is not so much does its measure have interval or ratio properties, but if they are at least ordinal, then in what way can we legitimately aid decision making in that area. One should aim to clarify the purposes of the investigation as much as possible, defining the type of underlying variable that is to be estimated, then do all one can to minimise undesired sources of variance.

This introduces us to the issues of reliability and validity, addressed in the following chapters.

 

D RELIABILITY

For any test or measure of health status to be useful it must be reliable, that is, repeat measurements made under constant conditions need to give the same result. As put by Anastasi (1990), measures of test reliability allow an estimate of the proportion of test variability that is due to error variance, where error variance is change in scores due to anything other than the characteristic of interest. The need to minimise error variance is the reason why psychological tests typically specify all aspects of the test environment, ie. instructions, time limits, mode of subject-tester interaction, etc., the aim being to eliminate or control extraneous sources of variance that would otherwise effect test results.

The basic measure of test reliability or reliability coefficient is the correlation coefficient, which indicates the consistency between two independently derived sets of scores. The most common of these is Pearson's Product Moment Correlation, which measures the location of items in the two variables to be compared in terms of the amount of deviation each item displays above or below their respective group means (ie. determines the standard scores for all test scores in each variable), and calculates the product of the paired scores. The Pearson correlation coefficient is then the mean of these products (computational formulas simplify this process). Significance tests then determine the probability of the observed correlation occurring by chance alone.

It is possible to estimate the reliability coefficient for an instrument either through repeated presentation of a test or presentation of parallel forms of a test, or through a single test administration. The advantages and disadvantages of these methods are briefly reviewed below.

1 Types of Reliability

1.1 Test-retest, or measure of stability

The most obvious means for estimating the reliability of a test is through representing the identical test on a later occasion. The correlation coefficient for a test-retest procedure is termed the coefficient of stability, and the Pearson product moment formula can be used. Crocker and Algina (1988) state that few if any standards exist for judging the minimally acceptable value for a coefficient of stability, but that commercially published individually administered aptitude test are amongst the highest. Subsets of the WAIS (Weschler Adult Intelligence Scale) have coefficients in the .70s, .80s, and low .90s. Personality, interest or attitude measures are often lower than these, but Crocker and Algina propose that well constructed test should still have test-retest coefficients in the .80s.

Although apparently simple and straightforward, the problems attendant the test-retest method are major. Basically the two tests can not be considered as independent due to practice and/or recall, and while the longer the interval between test and retest the less the risk of memory effects, the greater is the risk of intervening events causing respondents to change their views. The obtaining of a low coefficient may mean either the test is an unreliable measure of the trait, or the trait itself may be unstable. Alternatively the testee's behaviour may have been altered by the first administration, and the second test may reflect effects of memory, practice, learning, boredom, sensitisation etc.

In assessing the correlation it is hard to decide if it has been inflated (due to memory effects), or deflated (due to change in views).

1.2 Alternate form method, or measure of equivalence

A way of avoiding the difficulties of test-retest is through the use of alternate forms of a test. In the development of alternate forms care is needed to determine that the forms are truly parallel, and are independently constructed to meet the same specifications (Anastasi, 1990).

Crocker and Algina (1986) propose that any tests that have multiple forms should have some evidence of their equivalence. To test this equivalence the two forms are administered within a very short time period, allowing only enough time between testings so that the testees are not fatigued. The order of administration should be balanced. Pearson product moment correlation coefficient could then be computed between the two forms, to form the coefficient of equivalence.

Crocker and Algina also state that there are no hard and fast rules for what constitutes a minimally acceptable value for alternate form reliability estimates, but that many standardised achievement test manuals regard coefficients varying in the .80s to .90s for this type of reliability. In addition means, SDs, and standard errors of measurement should be reported for each form and these should be "quite similar".

Anastasi (1990) concluded that while more widely applicable than test-retest reliability, alternate form reliability also has limitations. Alternate form tests can still be subject to practice effects, and differences between the two sets of answers will be a mixture of differences between the items used and other sources of error. Also there may be real problems for many tests of constructing truly equivalent forms, and for these reasons other techniques for estimating reliability may be required.

1.3 Measures of internal consistency - Split-half method

If only one administration of a single form is involved, then changes in test result due to changes in the construct under examination are less likely to occur (although some changes could occur within the period of the test itself). Procedures designed to estimate reliability in these circumstances are called measures of internal consistency, and are primarily concerned with errors caused by content sampling (although errors of measurement because of faulty administration and scoring, guessing, and temporary fluctuations of individual performance within the testing session may also affect the internal consistency coefficient). Crocker and Algina (1986) note that measures of internal consistency are very important in many tests, as the aim of a test is generally not to estimate how the testee would score on the items presented but how the testee would score on a larger content domain of possible items that might have been asked

The original measure of this type was the split-half method, once the most widely used way of estimating how consistently the testee performed across items or subsets of items on the single test form (Moser and Kalton, 1979). This method involve the division of the test into two sub-tests, each half the length of the original test, the two half-tests then scored and the correlation coefficient computed between the two scores.

The coefficient so obtained will be an underestimate of the reliability coefficient for the full-length test, for longer tests are generally more reliable than shorter tests because errors of measurement due to content sampling are reduced (see section 2.2 this chapter). To correct this the Spearman Brown prophecy formula can be employed, or other correction methods (e.g. Rulon method), to give the stepped-up reliability of the whole test or rw (see Anastasi, 1990; Crocker and Algina, 1986).

Split-half methods have been criticised on the grounds that there are many ways of dividing a test into halves, and these can result in different reliability estimates. For this reason, methods based on item covariance may be preferred. This will not apply to a highly heterogeneous test, for which split-half or even test-retest may be the most appropriate method.

1.4 Measures of internal consistency - Methods based on item covariance, or coefficient alpha.

A fourth method for finding reliability, also utilising a single test administration, is based on the consistency of responses to all items in the test.

Anastasi (1990) notes that this interitem consistency is influenced by two sources of error variance, namely content sampling (as also applies with alternate form and split half tests), and heterogeneity of the behaviour domain sampled. The more homogeneous the domain, the higher the interitem consistency.

A highly relevant question in this context is whether the criterion that the test is trying to predict is itself relatively homogeneous or heterogeneous. `A single homogeneous test is obviously not an adequate predictor of a highly heterogeneous criterion' (Anastasi, 1990, p. 123).

The most common procedure for finding interitem consistency is the Kuder--Richardson reliability coefficient, which is equivalent to the mean of all split-half coefficients resulting from different splittings of a test. This is in contrast to the ordinary split half, which is a planned split designed to yield equivalent sets of items. Hence unless the test items are highly homogeneous, measures of interitem consistency will yield much lower coefficients than the split-half method (Anastasi, 1990). Cronbach's Alpha and Hoyts Analysis of Variance are said to yield identical results to Kuder-Richardson, (Crocker and Algina, 1986), all determining the ratio of the sum of the item covariances to the total observed score variance.

2 Factors that Affect Reliability Coefficients.

2.1 Characteristics of the subjects: variation in the behaviour, and ability to perform the measurement task

An important factor influencing the size of a reliability coefficient is the range of individual differences in the group, or subject homogeneity.

The magnitude of a reliability coefficient depends on variation among individuals on both their `true' scores (ie. their score on the underlying behaviour of interest), and error scores. Thus the homogeneity of the test group is a major consideration. If members of the test group are similar with respect to the trait being measured then the reliability coefficient will be much lower than if they varied markedly regarding the trait, because the random error variance will tend to be constant for the two groups (if of equal size), while `true' score variance will be much less and hence account for a much smaller proportion of the observed score variance.

In other words a test is not reliable or unreliable, rather reliability is a property of the scores on a test for a particular group of subjects. A consequence of this is that to compare tests it is essential to determine whether reported reliability estimates were based on samples similar in composition.

A further factor influencing the size of the reliability coefficient is the ability level of the group upon which the test was developed. For example, if the subjects are confused or pressured by the demands of the task then there is likely to be increased error variance, and hence reduced test reliability. On the other hand it would be difficult to justify eliminating such subjects, particularly if they over-represent a particular group, eg. patients or patient-relatives, as opposed to health professionals. Patients and their relatives may possess less formal education and be relatively unfamiliar with formal testing procedures but also possess relevant and valuable knowledge of health status variables and their effects. Eliminating such subjects because they fail to master the task is likely to reduce the test's validity. Even when no rater elimination occurs it is important to assess consistency across raters, ie. inter-rater reliability, particularly where measures are made of complex judgements.

Examiner variance as a source of error, ie. variation in test scores as a product of experimenter/examiner factors (Anastasi, 1990, p 125.), should also be considered, and, if necessary, controlled for.

2.2 Test items: number (test length) and homogeneity

As already noted, test length affects both true score variance and observed score variance. Longer tests have greater test reliability than shorter test composed of similar items (errors of measurement due to content sampling are reduced).

The Spearman-Brown prophecy formula can be used to estimate the effects on reliability of increasing or decreasing test size. Thus if two equivalent items could be used in a scale, but only one is in fact used, the correlation between the two items is a measure of the internal consistency reliability for the one item. As pointed out by Moser and Kalton (1979), if the correlation between the two items was 0.5, and both items were used in the test, the Spearman-Brown formula calculates their reliability at 0.67. If further items all intercorrelated 0.5 were added the reliability would increase further. The higher the intercorrelation between items, the less the number needed to reach a given level of reliability. Thus only 4 items intercorrelated at 0.7 are needed to reach an rw = 0-9, but 9 if the items were intercorrelated 0.5.

Moser and Kalton (1979) note that with attribute measurement the set of items rarely intercorrelates highly, and in order to attain an adequate level of reliability multiple items are needed. The higher the intercorrelation between items, ie. the greater the item homogeneity, the less items are needed. However item homogeneity may only be obtained by restricting the breadth of the scale, and this may have a direct effect on reducing validity.

3 Reporting and Interpreting Reliability.

The reliability of a test may be expressed in terms of the standard error of measurement or SEM, also termed the standard error of the score. The SEM is particularly suited to the interpretation of individual scores, and Anastasi recommends that when the score for a test is reported an indication of its expected error should be provided. The standard error of measurement may be considered as the average standard deviation of examinees' individual error distributions for a large number of repeated testings, and allows the establishment of a confidence interval in which the true score is expected to lie (for the formula see Anastasi, 1990; Crocker and Algina, 1986).

The standard error of measurement and the reliability are alternative ways of expressing test reliability, although the reliability coefficient is best for comparing the reliability of different tests, while the standard error of measurement is recommended for the interpretation of individual scores. The interpretation of score differences should be made via measures of the standard error of score differences. (See Anastasi, 1990).

Finally, it is generally accepted that the developer of any test has an obligation not only to investigate the reliability of the test, but to report this information. Guidelines for this, derived from the American "Standard for Educational and Psychological Testing" (1985) are given by Crocker and Algina (Crocker and Algina, 1986, p.152).

4 Conclusions

To summarise reliability measures, for any test or measure of health status to be useful it must be reliable, that is, repeat measurements made under constant conditions need to give the same result. The test-retest method of forming a reliability coefficient has the disadvantage that the obtained value may be inflated (due to memory effects), or deflated (due to change in views). The alternate form method has similar if diminished limitations. Of internal consistency measures, the ordinary split-half has the advantage of being a planned split, which may be appropriate if the items were planned to be highly heterogeneous. If the test items are planned to be homogeneous, then an inter-item consistency measure such as Kuder-Richardson 20 or Chronbach's Alpha would be more appropriate, noting that unless the test items are highly homogeneous measures of interitem consistency will yield much lower coefficients than the split-half method.

Whatever the method chosen, much care needs to be taken when interpreting reliability coefficients, for a test is not reliable or unreliable, rather reliability is a property of the scores on a test for a particular group of subjects. A consequence of this is that to compare tests it is essential to determine whether reliability estimates were based on similar subjects, for the size of a reliability coefficient will tend to be proportional to the variability between subjects on the characteristic being assessed.

It also needs to be borne in mind that test reliability is directly related to item homogeneity. The greater the item homogeneity, the less items are needed for a given correlation size. However item homogeneity may only be obtained by restricting the breadth of the scale, and this may have a direct effect on reducing validity. Thus when a test is assessing a highly heterogeneous construct such as QOL there is likely to be a trade-off between test reliability and test validity.

Ultimately, a high reliability coefficient indicates there is consistency in a testee's scores, but it does not ensure that the inference to be drawn for the test is correct. Validity refers to the ability of the scale to measure what it sets out to measure, so differences between individuals truly reflect differences in the characteristic under study. Reliability is a necessary but not sufficient condition for validity. When examining reliability coefficients healthy scepticism is recommended regarding what is purported to be demonstrated, combined with a close examination of the item and subject conditions.

 

E VALIDITY

Cronbach (1971; cited in Crocker & Algina, 1986) describes validation as the process by which a test developer or test user collects evidence to support the type of inferences that are to be drawn from test scores. To plan a validation study, the desired inference must be clearly identified and then evidence gathered.

Anastasi (1990) states that fundamentally all procedures for determining test validity are concerned with the relationships between performance on the test and other independently observable facts about the behaviour characteristics under consideration.

1 Content Validation

A major priority in constructing any test is to ensure that all the elements necessary to measure the construct have been included. With a multi-dimensional construct there is a need to develop an appropriate weighting of these dimensions. This is the area of content validation.

Content validation involves the systematic examination of the test content to determine whether it covers a representative sample of the behaviour domain to be measured (Anastasi, 1990). The items that form the test should convey the attribute, and also cover the full range of the attribute in a balanced way. The items need to be a representative sample of the universe of content. The purpose of content validation is to assess whether the items truly represent the performance domain or construct of specific interest.

In conducting content validation, Crocker and Algina (1986) note that considerations include how should separate elements that make up a test be weighted; and how we should determine whether all elements necessary to represent the construct have been included. Anastasi (1990) makes the point that the domain should be defined in advance rather than after the test has been prepared.

Kaplan, Bush and Berry (1976) have argued strongly for content validity as the basis of health status validation. This is issued is addressed in much greater detail in Section F.

2 Criterion-related Validation

Criterion-related validation procedures indicate the effectiveness of a test in predicting an individual's performance in specified activities. It entails performance on the test being checked against a criterion, that is a direct and independent measure of that which the test is designed to predict (Anastasi, 1990), and validity can be estimated based on the correlation coefficient between predictor and criterion scores.

With criterion -related validity the scale is developed as an indicator of some observable criterion, ie. the behaviour of interest can be directly measured, but either the criterion is available at the time of testing but a test is quicker, simpler, or less expensive (concurrent validity), or the behaviour could only be determined in the future (predictive validity).

A test may be validated against as many criteria as there are uses for it (Anastasi, 1990). Such criteria may come from personal judgement by one considered expert in the area, eg. as with a psychiatric diagnosis, assuming that the diagnosis has been based on prolonged observation and detailed case history, etc., and is itself valid. Anastasi (1990) points out that ratings have been employed in the validation of almost every type of test, and while subject to judgemental errors, they represent a valuable source of criterion data (when obtained under carefully controlled conditions, and raters are adequately trained, etc.; see Anastasi, 1990, pp. 645-647).

Kaplan, Bush and Berry (1976) argue that criterion validity is not possible for a broad health status measure because no criterion exists that accurately measures the phenomena of interest, and indeed the lack of such a measure is the reason why effort has gone into the development of health status measures (if a criteria exists, only greater practicability or less cost justifies the use of some other measure). This is doubtless true, but it should not obscure the fact that while no criterion may exist, there may (as discussed in Chapter B, and see 3.2 following) be a measure (or measures) that is close to such a criterion.

3 Construct Validity

The construct-related validity of a test is the extent to which the test may be said to measure a theoretical construct or trait (Anastasi, 1990), where the construct is manifested in a variety of behaviours and where there is no single behaviour that is seen to represent it comprehensively and be measured. Crocker and Algina (1986) describe a psychological construct as `a product of informed scientific imagination', which is not directly observable. Examples are intelligence, creativity, neuroticism, etc. To be useable a construct needs to be defined (operationally or semantically), and its relationships with other constructs and measures of specific real-world criteria specified.

According to Crocker and Algina, the process of construct validation might involve:

a) defining the construct b) formulating a hypothesis as to how differences on the construct should be associated with differences on other characteristics c) measuring the construct d) gathering empirical data on the hypothesised characteristics e) determining consistency between the construct levels and the characteristic levels.

If the hypothesised relationship(s) are found as predicted, then both the construct and the test that measures it are useful.

Details of examinations that might be carried out to assess construct validity are as follows.

3.1 Correlations between the test and other tests

Correlations between a new test and similar tests can be cited as evidence that the same general area of behaviour is being assessed. Anastasi (1990) points out that such correlations should be moderately high but not too high, for otherwise the new test represents needless duplication (unless it is eg. briefer, or easier to administer).

3.2 Correlation between the test and selected variables

The construct measure may be used to see if individuals or populations hypothesised to differ on the construct do so. If expected differences are found, the test is supported. If not there may be a fault in the theory underlying the construct, the measure of the construct, or the treatment assumed to provide the difference.

In essence measures that fail as criterion measures of validity are input as proxy--criteria. For example Donovan et al. (1989) in reviewing QOL-cancer scales argued that the best form of validation was assessment of the tests' capacity to predict external criterions such as medical or psychological stress indicators, e.g. number of requests for medical help, use of psychiatric services etc.

3.3 Convergent and Discriminant validation

A framework within which to conduct construct validation was proposed by Campbell and colleagues, who pointed out that in order to demonstrate construct validity it should be shown not only that a test correlates highly with other variables it would be theoretically expected to, but also that it does not correlate highly with variables with which it would be expected to differ (Campbell and Fiske, 1959; also see Anastasi, 1990). Campbell and Fiske (1959) proposed a systematic method for exploring this, the Multitrait-Multimethod Matrix method, which entails the assessment of two or more constructs by two or more methods.

Campbell and Fiske's method proceeds from the notion that each test or task employed for measurement purposes is a "trait-method unit", a union of a particular trait content with measurement procedures not specific to the trait being measured. To examine discriminant validity, ie. that the test does not correlate highly with other tests that it should differ from, it is proposed that the researcher must identify two or more ways of measuring the construct of interest, and at least one further construct which can be measures by the same methods.

Using one sample of subjects, measurements are made on each construct by each method, and correlations computed between each pair of measurements. The matrix of all of the intercorrelations is the multitrait-multimethod matrix (or MM Matrix).

Three types of correlation are formed:

- Reliabilities (or monotrait, monomethod).

- Convergent validity coefficients - correlations between measures of the same construct using different measurement methods (monotrait heteromethod).

- Discriminant validity coefficients - correlations between measures of different constructs using different methods (heterotrait monomethod), or correlations between measures of different constructs using different methods (heterotrait heteromethod).

For satisfactory construct validity, the scores obtained for the same trait by different methods (validity coefficients), should be higher than the correlations between different traits measured by different methods, and the correlations measured between different traits using the same method (if the latter is high, a person's scores may be being affected by an irrelevant common factor such as ability to understand the questions; Anastasi, 1990).

Campbell and Fiske argue that a careful examination of the MM Matrix will indicate what the next steps should be: whether methods should be discarded or replaced, or concepts sharpened in definition, and which concepts are poorly measured because of excessive or confounding method variance. They also caution that many M M Matrices will show no convergent validation: no relationship may be found between two methods of measuring a trait. In this situation alternative propositions are: (a) neither method is adequate for measuring the trait; (b) one of the two methods does not really measure the trait; or, (c) the response tendencies are specific to the non-trait aspects of the test.

3.4 Construct representation

Embretson (1983, 1986, cited Anastasi, 1990) has proposed a new approach to the assessment of construct validity arising from the process orientation of cognitive psychology, as opposed to the focus on outcome measures (correlation between test result and another measure) that arose from psychophysics. In Embretson's approach the study of construct representation is to "identify specific information processing elements and knowledge stores needed to perform the tasks set by the test" (Anastasi, 1990, p. 160). Various task decomposition procedures and methods of experimental manipulation are advised so as to measure the contribution of different response components to test performance, and "to determine what theoretical constructs are assessed by the test".

An approach of this type could well be useful when considering the output of some of the complex tasks demanded of subjects when developing health status assessment measures. As in the discussion of reliability earlier, it is difficult not to believe that the performance of many tasks is more concerned with a subject's capacity to conduct a complex cognitive manipulation. than the declared content of the task. Are the results of tradeoff tasks determined more by values about health, the ability to mentally juggle complex notions, or moral principles regarding the value of life.

3.5 Sensitivity/responsiveness

As noted by Donovan et al (1989), a QOL measure needs to be able to discriminate between conditions and/or within conditions as the disease course changes, and Deyo & Patrick (1989) have suggested that a test should be tested for "responsiveness", meaning sensitivity to change, in addition to reliability and validity. It is equally possible to consider sensitivity as a special aspect of validity, which is the approach favoured in this review.

3.6 `Descriptive validity'

Bergner has proposed the term `descriptive validity' to refer to the ability of an instrument to comprehensively characterise a patient's health status. Bergner et al (1981) proposed that item categories should be retained in the SIP even if they fail to account for additional variance, on the grounds that they contribute to the descriptive capacity of the SIP.

4 Techniques Used in the Measurement and Development of Validity

4.1 Correlation

Calculating the correlation between a test score and a criterion measure forms a validity coefficient. The Pearson Product-Moment correlation can be used for this, and as with correlation coefficients generally (and as with reliability coefficients), sample heterogeneity is a major factor: the wider the range of scores, the higher will be the correlation. Crocker and Algina (1986) suggest that whenever a low correlation coefficient is obtained, the researcher should determine whether a restriction in variance has occurred because of sample selection or some aspect of the measurement process obscuring a possible relationship between the variables of interest. Examination of the scatterplot is recommended.

Attention to the form of the relationship between test and criterion has other uses. The Pearson Product-Moment correlation assumes that the relationship is linear and uniform throughout the range, and it may be useful to examine the bivariate distribution via a scatter plot of to determine the relationship, eg. it may not be of equal variability throughout the range (ie. not homeoscedastic; see Shavelson, 1988).

The validity coefficient may also be interpreted as the standard error of estimate, analogous to the error of measurement in connection with reliability. Even with a validity of .80. the error of predicted scores may be considerable, but a test may improve predictive efficiency if it shows any significant correlation with the criterion, however low. Anastasi (1990) suggests that even validities as low as .20 or .30 may justify inclusion, depending on the relative benefit from having the test.

In interpreting correlation coefficients the magnitude of the coefficient needs to be checked against the two criteria of whether it is significantly different from .00, and what percentage of variance in one variable is shared with variance in the other (see Hays, 1981; Shavelson, 1988). Remember also (as noted in the previous chapter) that high correlations between variables does not mean that they are causally related, as some further intervening variable may affect both variables. Also restriction of variance in the scores of X or Y variables can reduce the maximum correlation value that can be obtained.

It is particularly important to be aware that high correlations can conceal important differences. Anderson, Bush and Berry (1986) compared dysfunction scores on the QWB scale using self and interviewer modes of administration, against a measure of dysfunction gained from applying the QWB to a detailed examination of ancillary clinical information. While finding a very high correlation between the two measures for all subjects (Pearson product moment r = 0.98), tests of sensitivity and specificity showed appreciable differences. Thus for those with actual dysfunction on the physical activity and social activity scales the self-administered QWB was reported to result in the accurate classification of only 45%, ie. sensitivity = .45 (note that the misclassifications may still have been of dysfunction, but not at the same level), while a sensitivity level of .86 was determined for the interviewer administered mode. Looking at the dysfunction subjects alone, the correlation between self and interviewer mode was .90. High correlations can mask major differences, and there is a need to look at specificity, sensitivity wherever possible (but criterion against which to check are not often available - here it was a carefully calculated QWB value, not dysfunction per se).

Bergner et al. (1976b) also produced data to suggest that an overall moderate or high correlation may mask shortcomings in sensitivity for particular sub groups. They found the correlation for a total subject sample between self-assessment of dysfunction and SIP score to be 0.52. This was more than each of several subgroups alone, with the correlations for one sub-group (speech pathology patients) being-.01.

4.2 Multiple regression

Multiple Regression, or MR, is increasingly being used in the development and validation of health status measures for the prediction of health status (eg. see Llewellyn et al, 1992; Lipscomb, 1989; Hall et al, 1989). The degree of error likely in such prediction can again be estimated (for calculating the Standard Error of the Estimate to determine confidence intervals, see Crocker & Algina, 1986).

MR is of particular use in health status tests because of the multi-dimensionality of the construct. Anastasi (1990) proposes that for the prediction of most constructs several (sub)tests are likely to be required. A single test designed to measure an appropriate criterion would have to be highly heterogeneous, and Anastasi recommends the combination of several relatively homogeneous tests rather than a single test consisting of many different sorts of items (on the grounds that tests, or subtests, should be homogeneous in the interest of reliability).

The combination of such subtests to allow a decision can be achieved by multiple regression. In the computation of multiple regression each test is weighted in direct proportion to its correlation with the criterion, and in inverse proportion to its correlations with the other tests. Thus the highest weight, will be assigned to the test with the highest validity and the least amount of overlap with the rest of the battery.

These weights are optimal only for the particular sample in which they were derived, and it is important that the test algorithm is cross-validated by correlating the predicted criterion scores with the actual criterion scores in a new sample (although formulas are available to estimate shrinkage in a multiple correlation, the larger the original sample the smaller the shrinkage, empirical verification is considered preferable).

Anastasi (1990) also points out that sometimes a negatively correlated variable is needed to obtain the best correlation, due to the need to eliminate some influential variable that is uncorrelated with the criterion but would otherwise introduce irrelevant variance into the test. For example reading comprehension may correlate highly with scores on a test because the test problems require the ability to understand the instructions. Inserting a measure of reading comprehension in the regression equation will eliminate this error variance and raise the validity of the battery (although it is better to redesign the test to eliminate the undesired variable).

As described by Cohen (1968), multiple regression (MR) and analysis of Variance/analysis of covariance (AV/ACV) are essentially identical systems. In fact MR was developed in the course of the study of natural variation, while AV/ACV came out of artificial or [experimentally manipulated variation, but they are both general linear models. Both are equally robust to violations of normality assumptions. In essence AV is a special simplified case of MR particularly suited to neat experimental layouts where qualitative treatments are manipulated in appropriate orthogonal relationships. It has far less flexibility than MR as it leads to the dichotomisation of variables (so they can be examined as treatments) with a consequent loss of information and associated statistical power. In contrast to the constraints of AV/ACV programs, the very general MR program can accommodate any given design by coding those independent variables of interest.

Experimentalists often criticise MR as an inferior statistic, particularly compared to Analysis of Variance (ANOVA). However Cohen (1968) argued that MR is far more powerful and flexible an analytic system. Dummy variables allow the coding of nominal scale data, subtle variables can be captured via contrast coding, and curvilinear relationships can be examined by means of a polynomial form in power terms so that non-linear regression can be represented within the linear multiple regression framework. The preference for ANOVA was proposed as in part reflecting the original non-availability of MR, because it requires the computation and inversion of a matrix of correlations (or sums of squares and products) among the independent variables, which require major computation for even few independent variables. With electronic data processing facilities, there is no longer a barrier.

However caution is needed when using MR. For example many independent variables can be readily generated, with associated loss of statistical, power (as df increases) and the need to be aware of type I error rates. For multiple comparisons, the significance level that gives an appropriate overall error rate of alpha is approximated by alpha/n, where n is the number of simultaneous comparisons (Patrick, Bush, and Chen, 1973). However organisation of the independent variables, and step-wise admission to the analysis (testing for significant increases in R2) can control this.

Anderson (1976) has also pointed out the caution that is necessary when using MR to test theoretical models. Anderson describes how assuming a linear model could lead to high correlations of the order of 0.98, even when plotting the independent variables reveals curves that are strictly non-parallel, and as demonstrated by a significant ANOVA interaction term. Anderson argued that regression -correlation methodology can be useful in applied prediction, but can be misleading when it comes to testing theoretical models. Indeed the great usefulness of regression -correlation analysis in applied prediction stems largely from its insensitivity to real deviations from linear summation models.

Anderson (1976) also points out the fallacy that some researchers have engaged in of interpreting the importance of independent variables according to the magnitude of their factor loadings, as it is apparent that such correlations are confounded with a number of factors (such as the range of the relevant variable).

4.3 Factor analysis

Factor analysis began as an attempt by Spearman to examine the question of whether intelligence is the expression of a single major factor or whether there are multiple `intelligences' (Cattell, 1972). Anastasi (1990) describes factor analysis as `particularly relevant to construct validation'. The aim of factor analysis is to explain the correlations among a large number of variables as reflecting variation on a smaller number of underlying inferred factors, to go beyond appearances to basic concepts. Its role is both to generate hypotheses, and to test them, and proceeds from the supposition that many psychological attributes (traits) can be measured only by a whole pattern of variables and not any single variable.

Factor analysis (along with multiple regression; see Cohen, 1968) are the major expressions of correlation analysis, and Cattell (1972) characterises factor analysis as the principal tool for examining the significance and magnitude of relations among variables when a large number of variables need to be examined simultaneously, just as ANOVA (Analysis of Variance) dominates analysis where variables can be manipulated under strictly controlled conditions.

Factor analysis involves obtaining n measures on the same testees, computing an n * n correlation matrix, and then using factor analysis techniques to identify a number of underlying variables (factors) that account for variation in the n variables.

The n measures may be either items, in which case it may be determined whether the items cluster together as predicted by the theoretical structure of the construct, or tests/measures (that may be made up of sets of items). Again the initial issue is whether the subtests or tests which are supposed to measure the same element are identified as measuring a common factor.

Kaplan, Bush and Berry (1976) attacked factor analysis as a tool in the development of health status indexes. The reason proposed was that once factor analysis derived underlying factors (such as sociability, physical distress, etc.), then it was likely that items that are checked rarely or are poorly correlated with other items (and hence contributing little to explaining variance) would be considered unimportant and excluded. Kaplan et al. argue that such items could correspond to eg. rare conditions, which while not loading significantly on any of the larger factors and because rarely selected not representing a substantial unique factor, were none the less extremely important for the proper assessment of those rare cases. Hence infrequently used items should not be excluded.

The issue here seems to bear on the distinction between construct exploration and data reduction. Infrequent items should not be excluded solely on factor analysis grounds, for as Anastasi (1986) has emphasised, both logic and empiricism must play roles in developing construct measures. On the other hand Hall et al. (1989) have demonstrated the usefulness of factor analysis in health status instrument assessment (for more details, see Chapter F).

5 Conclusions and Observations: The Case for Concept Validity

Anastasi (1986; 1990) has argued that content-, criterion-, and construct-related validation no longer correspond to meaningfully distinct categories, but are products of the developmental history of validation testing. In Anastasi's framework, statistical methodology leads first to the analysing of items against total test scores or external criterion measures, and then to factor analysis, and so on, and construct validity should now be seen as a comprehensive concept that includes all the other types. `Content validation and criteria-related validation can be more appropriately regarded as stages in the construct validation of all tests'. Test scores are seen to be always based on constructs, with even in a simple test the factor being measured not corresponding to any single empirical measure. (For example, a test to measure an individuals walking speed there would be a need to take representative measurements to obtain a distribution of speeds depending on context, purpose, persons condition at the time etc.). After Messick (1980, cited Anastasi, 1990), Anastasi argues that the term validity, insofar as it refers to the interpretive meaningfulness of a test, should be reserved for construct validity. Content validity should be labelled content--relevance and content-coverage, and Criterion-related validity termed predictive utility and diagnostic utility (corresponding to predictive and concurrent validation).

In terms of the test construction process, validity is seen to be built in to a test from the outset, rather than being limited to the last stage of test development. Almost any information gathered in the process of developing a test is related to validity. It begins with the formation of the construct definition, derived from psychological theory or prior research, then follows item preparation and analyses to select the most valid items, followed by internal analyses that may include factor analysis of item cluster or subtests (an item needs to be shown to belong in a scale based on both logic, the construct definition, and through the results of factor analysis or other procedures of item analysis).

Finally, and most importantly, there should be correlation of scores with external real-life criteria. This needs to be combined with very close examination of the results of such correlation analyses, to ensure that an overall high correlations is not masking a major failure of the instrument for important sub-groups. This is particularly relevant when validating generic health status measures, given the range of patient types that they are required to handle, and the evidence of Bergner and Anderson et al. discussed earlier (section 4.1).

 

F SPECIFIC ISSUES WHEN CONSTRUCTING HEALTH-RELATED QOL MEASURES

It was argued in Chapter B that the measurement of health-related quality of life (HQOL) is best achieved via an instrument that assesses the dimensions found to affect HQOL and combines them into a single index. The proce