Psychological Testing or Psychometrics

The measurement of human mental processes and psychological characteristics. Not all aspects of human psychology lend themselves to formal testing and test results are never used exclusively by psychologists, but may be used in conjunction with other forms of assessment.

Mental states are not directly accessible to measurement. Tests are designed to use information about people's behaviour as indicators of their underlying psychological characteristics. For example, people who suffer from anxiety behave differently in certain situations to those who do not; people of high ability are better equipped to deal with more complex problem-solving tasks than those of lower ability. Tests are designed for use in a variety of situations: for example, to diagnose the reasons for various problems (such as reading difficulties or problems in relating to other people); to make predictions (such as in the selection of people for work); and to assess levels of achievement (as in formalized scholastic attainment tests). Testing uses a range of different types of behaviour as indicators of the underlying psychological characteristics it sets out to measure: from self-reported feelings, attitudes, and behaviour, through the person's ability to cope with various types of problem, to observation of performance in structured tasks.

Tests (of which there are many thousands) are designed to measure various aspects of intellectual and emotional functioning, including personality traits, motivation, beliefs and attitudes, ability and intelligence, and various areas of emotional concern. They are distinguished from simple questionnaires (such as those found in magazines) by the techniques used in their construction, and by the large body of information and evidence which has to be collected about their measurement properties and the meaning of their scores. All tests have a prescribed content, empirically defined measurement properties, and standardized methods of administration and interpretation.

Types of Test

Tests can be divided into two main categories: those that are designed to test maximum performance, and those that assess typical performance. Tests of maximum performance have questions with correct or incorrect answers and the score depends on how many questions the person being tested gets right (or, for some tests, how quickly the test can be completed). Tests of maximum performance include achievement tests, and tests of ability or aptitude (also known as intelligence tests—see IQ). Tests of typical performance do not have "right" answers, but are designed to assess the ways in which people differ in terms of personality, interests, motivation, style of learning, beliefs, attitudes, and values.

Achievement Tests

These tests assess current performance in an academic area. All educational assessment involves some form of achievement testing. However, psychological tests of achievement differ from less-formal school tests in following the principles of psychometrics in their construction and validation. Psychometric achievement testing is used within state-school systems in a large number of countries. Achievement tests administered in a school setting may include separate measurements of vocabulary, language skills and reading, mathematics and problem-solving, science, and social studies. Individual achievement is assessed by comparison of results either with average scores derived from large representative national or local samples (norm-referenced assessment) or with pre-defined expected levels of performance (criterion-referenced assessment).

Aptitude and Ability Tests

These tests assess a person's potential for learning and for future achievement. As such, they are widely used to aid the diagnosis of learning difficulties and also to predict future performance in an area in which an individual is not currently trained. Tests range from ones of specific aptitudes, to those of general reasoning ability. These broad-band tests of general mental ability are often referred to as intelligence tests and are intended to measure the overall capacity of an individual to cope with the intellectual demands made on them.

The more specific tests measure verbal, numerical, spatial, and non-verbal reasoning ability, perception of form, clerical speed and accuracy, motor co-ordination, and finger and manual dexterity. These tests assess skills that are pertinent to many different occupations. In practice, sets of tests are used which focus on a specific area, such as clerical work, computer programming, engineering, or modern languages.

Some tests are designed for use with groups of people and are relatively simple to administer and interpret. They are widely used in occupational testing both for personnel selection and for vocational guidance counselling (where aptitude testing may help clarify an individual's career goals). Individual tests of ability, on the other hand, are used diagnostically and require a greater degree of skill and expertise to administer and to interpret. Such tests include the Stanford-Binet, the Wechsler, and the British Ability Scales.

Self-Report Measures

Self-report instruments allow an individual to indicate preferences for one or more of a set of given alternatives. In interest inventories (tests designed to assess an individual's interests and preferred activities), these alternatives could be job titles or various types of work activity. Self-report personality questionnaires, on the other hand, use alternatives designed to assess a person's social and emotional adjustment. Some are designed for use in clinical assessment to aid the diagnosis of mental health problems and to assess needs for psychological counselling. Others are designed for occupational assessment to aid selection of staff for team-building, or for assessing personal development needs. Instruments are constructed around numbers of personality traits: some may have only 2 or 3 scales, others have 30 or more. However, the traits all tend to relate to five major dimensions of personality: extroversion, emotional stability, conscientiousness, agreeableness, and openness to new experience.

Projective Techniques

Some personality tests use projective techniques (for example, the Rorschach "inkblot" test; the Thematic Apperception Test; word-association techniques; and sentence-completion tests). While some objective and standardized scoring systems have been developed for these tests, the essentially subjective nature of their interpretation and their basis on a theory of projection—which is not widely accepted—makes them particularly vulnerable to criticism. They tend to be used mainly in clinical settings in countries which still have a strong psychoanalytic tradition.

Areas of Use

Both the extent of use of psychological tests and the ways in which they are used differ from country to country. However, in general, intelligence and achievement tests are used in educational settings to diagnose learning needs, assess individual accomplishment, and to improve instruction and curriculum planning. Screening tests can also be used to identify learning support needs, both for children and adults. Interest inventories and aptitude tests are often used as children approach school-leaving age to assist them in making educational and vocational choices.

In clinics or hospitals, psychological tests may be administered for purposes of diagnosis and treatment planning. Clinical tests can provide information about overall personality functioning and the need for psychotherapy; testing may also focus on some specific question, such as the presence or absence of an organically based brain disorder. Testing is also used in forensic medicine to aid assessment of the intellectual and emotional states of victims and offenders.

Both aptitude and personality tests are widely used in many countries for selection and classification in industrial and organizational settings. There is now very strong evidence to show that general mental ability is one of the best single predictors of future job performance, and there is a growing body of evidence to support links between personality traits and various aspects of job performance and fitting in with people in an organization. Psychological tests are also used to assess training and development needs. The most highly developed use of testing tends to be in areas where the costs of making wrong selection decisions is highest. Such areas include selection for commercial or military pilot training.

Interpreting Test Results

Most tests yield a raw score (an original score which has not been analysed) based on a numerical count of responses, such as the number of correct answers on an ability test. Some produce scores which have no simple relationship to the number of right answers given. Computer-adaptive tests, for example, produce direct estimates of a person's level of ability. Raw scores have limited use, as they are difficult to interpret. Interpretation is aided by one of two main techniques. First, by norm-referencing, which relates a person's score to those of other people. Second, by criterion-referencing, which relates their score to some external criterion or yardstick of performance.

Norm-Referenced Scores

The percentile equivalent of a person's score indicates the proportion of people in the norm reference group that scored below that individual. For example, when a person's raw score falls at the 75th percentile, 75 per cent of people in the norm reference group had lower raw scores. Percentiles are often divided into bands. A common system used with ability tests is the five-grade system. Grades A to E are assigned to the top 10 per cent (A), next 20 per cent (B), middle 40 per cent (C), lower 20 per cent (D), and bottom 10 per cent (E) of the people in the norm group.

Standard scores are derived by mapping the raw scores on to a bell-shaped distribution known as a normal distribution (see Bell-Shaped Curve). This is defined as having an average of zero and a standard deviation (SD) of one. The SD defines the spread of people's scores about the average, such that the interval of one SD either side of the average contains the scores of 68 per cent of all the people in the group. By mapping different tests' raw scores on to a common normal distribution, it becomes possible to make direct comparisons between measures which use different raw score scales. It is also possible to convert any standard score into a percentile or percentile-related measure.

The basic standard score scale (average=0, SD=1) is transformed for practical use into various different scales. Many personality measures use sten scores (average=5.5, SD=2), while many ability and aptitude tests use T-scores (average=50, SD=10). General ability measures and measures of intelligence are often expressed as intelligence quotients, or IQ. These are standard scores with an average of 100 with SD=15.

Tables of norms are included in test manuals to enable conversion of raw scores into standard scores or percentile equivalents. These tables are derived from studies in which the test has been administered to a large, representative group of people. The test manual should include a description of the sample of people used to establish norms, including age, sex, geographical location, and occupation.

Criterion-Referenced Scores

These are less common. An example might be a test where raw scores have been directly related to the time taken by people to complete a training course. If a clear relationship can be shown between scores on the test and time taken to achieve training goals, the an individual's test score can be used to predict their likely training performance.

Expectancy tables can be constructed which relate particular ranges of scores on the test to expected levels of performance on the criterion. The criterion can be any variable of interest for which a relationship with a test can be demonstrated. Typically, criteria include job performance measures, training outcomes, effects of treatment, and academic examination scores. To produce expectancy tables, it is necessary to collect a substantial amount of information about the relationship between people's scores on the test and on the criterion.

Reliability and Validity

The potential utility of a test score depends on what it measures (its validity) and how accurately it measures (its reliability). An unreliable test cannot provide useful measures of anything. Psychometrics is a set of techniques for estimating the amount of error contained in each test score. The extent to which measures are free from error is measured by the coefficient of reliability. Once the reliability of a test is known, the size of the error of measurement associated with each raw score can be estimated. Reliability sets a limit on validity. However, evidence of validity has to come from a wide range of sources.

The most important type of validity is construct validity. All the evidence which shows that the test measures what it claims to measure, and that it does not measure what it claims not to measure, is evidence of its construct validity. Such evidence can come from experimental research studies (for instance, studies have shown that extroverts and introverts differ in terms of a range of performance measures) and from correlational studies. These correlational studies involve a new measure being calibrated using scales of established validity, some of which measure similar constructs and others which measure different ones. The pattern of correlations between the new scale and the marker scales can be used to assess what the new scale is measuring.

Content validity concerns the extent to which the sample of items or questions in a test are representative of all the relevant items that might have been used. Words in a spelling test, for example, need to be chosen to sample the relevant areas of word knowledge and levels of spelling difficulty.

Criterion-related validity concerns a test's relationship with external criteria—such as academic attainment, job performance, sales income, or marital compatibility. Measures of criterion-related validity may be concurrent (based on criteria which are measured at the same time the test was taken) or predictive (when the test was taken some time before the criterion was measured). Evidence of predictive validity is particularly important as support for the use of tests in job selection, or in aiding people in making choices about their future.

Proper Use and Interpretation of Tests

A considerable amount of technical and practical knowledge, understanding, and skill is required to make informed choices between tests and to properly interpret their results. In all areas of use, professionals never rely exclusively on psychological test results in making decisions about people. They always consider them in conjunction with all other available information. Interpretation requires not only an understanding of psychometric principles, but also a good knowledge of the test itself and the body of evidence collected about its measurement properties. Such evidence will be contained in the technical manual provided with the test. This should contain evidence relating to the test's reliability and validity, the conditions under which it should or should not be used, groups for which it might present problems of bias in interpretation, and general guidance and advice on how the results should be interpreted and reported back to other people. The technical manual will also contain information describing the norms of various groups of people, such as adult males, adult females, college graduates, senior managers etc.

Used wisely, psychological tests can be of great benefit to individuals and society in helping them to get the best out of education, to improve mental health, to increase satisfaction and well-being at work, and to help industry and commerce make the most of its human resources. However, psychological test data can also confront us with problems which are difficult to handle. Many of the controversies which surround test use reflect existing problems in society. For example, differences in test scores between blacks and whites in the United States are generally considered to be a reflection of the long-term effects of social disadvantage. Research shows that these differences in test scores are not due to bias in the tests. They are real, and are predictive of real social outcomes, to differences in academic attainment, differences in job performance, and so on.

Given the potential power of psychological tests as a tool, it is vital that they are used carefully and appropriately. Use of tests should always be under the supervision of a competent psychologist, or other suitably qualified, accredited person, and should always be carried out with due regard for local cultural factors and for professional and ethical codes of practice. In the United Kingdom a national system of certification of test user competences is now in place which is based on standards for test use. In most countries, the national psychological association can provide information and guidance covering the supply and use of tests, and standards for test construction. Across countries, the International Test Commission liaises between national psychological associations and provides guidance on technical issues such as adapting tests for use in different cultures.