Principles of assessmentAuthor: Caroline Clapham
© Dr Caroline Clapham
This article contains a brief introduction to the main principles which should be followed by the constructors of tests and assessments. It briefly introduces the key concepts of test validity, reliability and washback, and provides guidelines for pre-testing. It gives the addresses of three other language testing web sites and has bibliographical pointers to more detailed discussion of language testing, in particular. A comprehensive glossary of testing terms is also provided.
Table of contents
- 1. Introduction
- 2. Validity
- 3. Reliability
- 4. Washback
- 5. Pre-testing
- 6. Concluding remarks
- Related links
Assessment has profited from the advances that have taken place in statistical analysis over the last century and it is now considered important for assessors to check their tests empirically as well as rationally. In addition, the upsurge in interest in the results of tests and examinations by governments, parents and candidates has led to a multi-million pound testing industry and contemporaneously there is a huge increase in the numbers of scholars researching different aspects of testing and assessment.
Although trends in testing, as in other fields, change over time, some principles of assessment are permanent and are not overly affected by current fashions. They need to be held in mind by assessors whenever they construct a test, whether this be a class quiz, a class essay or an end-of-year examination. (Here the terms 'test' and 'assessment' are used interchangeably.) The most important of these principles fall under three heads: validity, reliability and washback. Whilst the following discussion focuses specifically upon language assessment, the principles of testing apply more generally.
'Validity' is an all-encompassing term which is related to questions about what the test is actually assessing. Is the test telling you what you want to know? Does it measure what it is intended to measure? A test is not valid, for example, if it is intended to test a student's level of reading comprehension in a foreign language but instead tests intelligence or background knowledge.
When a new test is constructed it should be assessed for validity in as many ways as possible. The aspects of validity which are looked at will, of course, depend on the purpose for which the test has been designed and will partly depend on the importance of the assessment. A teacher writing a classroom quiz will not have the time or the inclination to carry out many different investigations of validity, but the constructors of an examination which will affect candidates' futures are duty-bound to examine as many aspects of validity as possible. (It should be remembered that it is not the test itself, but the use of the test for a particular purpose, that should be examined; a test of oral French, for example, may be perfectly valid for 16-year-olds in the British examinations system but might not be valid for doctors needing medical French in order to be able to cope in Parisian hospitals.)
There are different views on the best ways of assessing validity, but there are some key aspects, and it is good practice to investigate as many of these as possible:
2.1 Construct validity
The term 'construct validity' refers to the overall construct or trait being measured. It is an inclusive term which, according to some testing practitioners, covers all aspects of validity, and is therefore a synonym for 'validity'. If a test is supposed to be testing the construct of listening, it should indeed be testing listening, rather than reading, writing and/or memory. To assess construct validity the test constructor can use a combination of internal and external quantitative and qualitative methods. For more about this, see section 6. An example of a qualitative validation technique would be for the test constructors to ask test-takers to introspect while they take a test, and to say what they are doing as they do it, so that the test constructors can learn about what the test items are testing, as well as whether the instructions are clear, and so on.
Construct validation also relates to the test method, so it is often felt that the test should follow current pedagogical theories. If the current theory of language teaching emphasises a communicative approach, for example, a test containing only out-of-context, single-sentence, multiple-choice items, which test only one linguistic point at a time, is unlikely to be considered to have construct validity.
2.2 Content validity
The content validity of a test is sometimes checked by subject specialists who compare test items with the test specifications to see whether the items are actually testing what they are supposed to be testing, and whether the items are testing what the designers say they are. (On specifications as a whole, see Davidson & Lynch 2002, and Alderson, Clapham & Wall 1995: Chapter 2.) In the case of a classroom quiz, of course, there will be no test specifications, and the deviser of the quiz may simply need to check the teaching syllabus or the course textbook to see whether each item is appropriate for that quiz.
One of the advantages of even the most rudimentary content validation is that it identifies those items which are easy to test but which add nothing to our knowledge of what the students know; it is tempting for a test writer to write easy-to-test items, and to ignore essential aspects of a foreign language, for example, because they are difficult to assess.
Face validity is an important aspect of a test; it relates to the question of whether non-professional testers such as parents and students think the test is appropriate. If these non-specialists do not think the test is testing candidates' knowledge in a suitable manner, they may, for example, complain vociferously and the candidates may not tackle the test with the required zeal. If the test lacks face validity, it may not work as it should, and may have to be redesigned. (See Alderson, Clapham & Wall 1995: 172-73.)
2.4 Criterion-related validity
The aspects of test validity described so far relate to the 'internal' validity of the test, but some methods, and these are widely used for 'high-stakes' tests, also assess the 'external', 'criterion-related' validity of a test. To assess criterion-related validity, the students' test scores may, for example, be correlated with other measures of the students' language ability such as teachers' rankings of the students, or with the scores on a similar test. Such measures assess the concurrent validity of the measure. Similarly the future ability of the students can be assessed (the test's predictive validity) to see if the test can accurately foretell how the candidates will fare in the future. For example, if a test is supposed to assess whether students have a high enough level of a foreign language to be able to teach that language to secondary school children, the test should be validated, perhaps by classroom observation, to see whether students who have passed the test do actually have enough of the foreign language to be able to teach it in the classroom.
The reliability of a test is an estimate of the consistency of its marks; a reliable test is one where, for example, a student will get the same mark if he or she takes the test, possibly with a different examiner, on a Monday morning or a Tuesday afternoon. A test must be reliable, as a test cannot be valid unless it is reliable. However, the converse is not true: it is perfectly possible to have a reliable test which is not valid. For example, a multiple-choice test of grammatical structures may be wonderfully reliable, but it is not valid if teachers are not interested in the grammatical abilities of their students and/or if grammar is not taught in the related language course.
If the test consists of right/wrong items such as multiple-choice items or some sorts of short answer questions, a reliability estimate such as the Alpha Coefficient or Kuder Richardson 21 may be calculated (see Alderson, Clapham & Wall 1995: 87-89); but if the test consists of an essay or an oral interview, for example, then other forms of test reliability must be estimated. A statistic which can be used by the statistically sophisticated is based on Generalizability Theory (see Crocker & Algina 1986: Chapter 8.), but more simple measures such as correlations between the scores a marker gives on Day 1 and Day 5 (intra-rater reliability), and correlations between two different markers' scores (inter-rater reliability) can be estimated, along with calculations of whether the levels of raters' marks, as well as the order of the scores, are similar (see Alderson, Clapham & Wall 1995: Chapter 6).
Any language test or piece of assessment must have positive washback (backwash), by which I mean that the effect of the test on the teaching must be beneficial. This should be held in mind by the test constructors; it is only too easy to construct a test which leads, for example, to candidates learning material by heart or achieving high marks by simply applying test-taking skills rather than genuine language skills (see Wall 1997).
If test writers wish to follow good practice in assessment, they need to make
sure that their tests or examinations are valid and reliable. Both validity
and reliability depend on the test items/tasks working well, and for this purpose
most tests are pre-tested. Of course there is no time to do this for
a class quiz, but for 'high stakes' tests, if it is possible to have 20 or more
in the pre-testing sample, it is essential (see Alderson, Clapham
& Wall 1995: Chapter 4). If a test contains some kind of right/wrong
items, where the student gets a mark for each correct answer, an important part
of test construction is that the test should undergo item analysis. That
is each item should first be examined to see how difficult it is for the group
tested (facility value) and scores on that item should also be correlated
with the total scores on the test in order to see whether the highest proficiency
students tend to get an item right whilst the lowest proficiency ones get it
wrong (discrimination index). If the discrimination index of an item
is poor, the item may be easier for the linguistically weak students than for
the linguistically strong, and there may be something seriously wrong with the
item or its answer key. It is probable that that item should be changed or dropped.
An excellent program for producing the statistics relating to the evaluation
of right/wrong test items such as multiple-choice items and short-answer questions
is ITEMAN 1989.
If students' futures are likely to be affected in any way by the test results, performance tests such as essays and oral interviews should also be tried out, though on fewer candidates. Such tests and tasks should be pre-tested to see, for example, whether they produce the expected kinds of language, and whether the marking criteria are appropriate. (For more about marking criteria see Alderson, Clapham & Wall 1995: Chapter 5, and Weigle 2002.)
6. Concluding remarks
Testing and assessment take up large slices of teachers' time and, perhaps partly because of this, there is currently great interest in validity issues, washback and in the ethics of testing. More and more tests now have a pre-testing element built in to their construction and, in addition, there is much research into the construction, marking and validation of 'subjective tests'. Readers who wish to read more about Language Testing than just introductions to testing are recommended to consult Bachman 1991. They should also, if they want to read short articles on various aspects of language assessment for first and second language learners, look at some of the chapters in Clapham & Corson 1997 . (This publication has 29 chapters on key language testing aspects in first and second language assessment, and each chapter is accompanied by approximately 30 references for further reading.)
7. Glossary of testing terms (with special reference to Language Testing)
|Analytic marking scale||This scale contains a set of criteria for marking a test. For example, for a writing test it might include criteria relating to discourse, grammar, vocabulary and task achievement. The raters have to assign marks for each of these criteria. C.f. holistic marking scale.|
|Assessment||An overarching term which covers both 'assessment' and 'testing'. Some people do not like the term 'testing', and refer to most classroom testing as 'assessment'. This is an unnecessary use of the term, but for some people the term 'alternative assessment', which tend to consist of authentic, but often unreliable, tasks, is preferred to the term 'testing'.|
|Achievement test||A test showing how well students have learnt the section of a course that has just been taught. Such a test may be called a 'progress test' or an 'attainment test'. A summative test is a form of achievement test. C.f. formative test.|
|Aptitude test||A test showing how well a student is likely to learn a particular skill. A language aptitude test may contain, for example, subtests of memory, inductive ability and grammatical understanding.|
|Attainment test||See achievement test.|
|Cloze test||A gap-filling task, where words are deleted at fixed stages in a text and the candidate has to replace them. For example, a cloze test may have every 6th word deleted. Cloze tests are easy to prepare, but because of the random effect of the deletion of every nth word, different cloze tests behave very differently from one another. They should, therefore, undergo item analysis before they are given to candidates.|
|Composition/essay||A task where candidates have to produce at least a paragraph of their own written language. Such tasks are marked subjectively (see analytic and holistic marking scales).|
|Computer-adaptive test||A test marked by computer where the computer program selects items for a particular candidate, based on previous item results, so that the items are intended to be of a suitable level of difficulty.|
|Construct validity||One aspect of the validity of a test.|
|Content validity||An aspect of test validity where specialists decide whether a test, or test items, assesses what the test constructors intend to be assessed.|
|C-test||A test where the first half of every second word is removed from a text and the candidate has to restore the missing letters. The candidate may have to read the previous and the succeeding text, in order to be able to replace the missing half-words. This test has many of the same advantages and disadvantages as the cloze test, but is intended to depend less on the candidate's creative imagination.|
|Criterion-referenced test||This term is the opposite of norm-referenced test. In a criterion-referenced test, the candidate's performance is compared to predetermined criteria, and not to the performance of other students. So, for example, a candidate's speaking performance may be compared to a set of speaking criteria. It is common, for objective test papers to be norm-referenced and subjective tests to be criterion-referenced.|
|Diagnostic test||A test which diagnoses a student's linguistic strengths and weaknesses. For example, a diagnostic test might reveal that a student has trouble using articles.|
|Dictation||A test of listening in which the candidate writes down what s/he hears. This test may assess more than just recognition of spoken words.It may be marked according to the exact words (including the exact punctuation and spelling), the accuracy of certain phrases, or simply according to meaning.|
|Direct test||A test of a student's language skill tested in a real-life way. Direct test is the opposite of indirect test.|
|Discrimination||The extent to which a test distinguishes between strong and weak students.|
|Discrimination Index||This tells the test constructor how well a particular item discriminates between the strong and the weak students. See Principles of Assessment section 5.|
|Face validity||The views of the 'layman' on the validity of a test. See Principles of Assessment 2.3.|
|Facility value||This is often calculated at the same time as the discrimination index. It tells the test constructor or researcher how easy an item is for a particular group of students. See Principles of Assessment section 5.|
|Formative test||A test used during a course to assess a student's progress. Such a test is generally aimed at producing feedback for the teacher and the student.|
|Gap-filling test||This task is similar to a cloze test, but the test constructor chooses where the gaps in the text should be. The tester can decide what sort of language he or she is testing, and can write or choose the text accordingly. The text gaps can be open-ended or have multiple-choice solutions or can be accompanied by a bank of possible answers from which the candidate can choose appropriate words or phrases.|
|Holistic marking scale||This term is used in opposition to analytic marking scale. The rater gives a single score, rather than an initial variety of scores, for his or her impression of the language level of the essay or extract of speech, using criteria supplied by the test constructors. The rater may only need to read the script once. Some say that this is a less reliable marking system than the analytic marking system.|
|Indirect test||A test which tests a skill in an indirect way. An example of such a test would be one where Speaking is assessed by a multiple-choice test of the recognition of some of the sounds of the foreign language.|
|Information transfer item||A test task where a candidate has to transfer some information (these tasks are commonly used for testing reading and listening) from one form to another. For example, information in a text may have to be transferred to a table or chart.|
|Inter-rater reliability||The agreement between one rater and another on a candidate's performance.|
|Intra-rater reliability||The consistency with which a single rater assesses one or more samples of spoken or written text. To assess this, the rater would need to mark some scripts more than once.|
|Item analysis||Analysis of the performance of test items carried out on objectively marked items. For item analysis to be worthwhile, you need about 20/30 people if chance is not going to play too large a part. For a high stakes test you will need to have at least 100 students in your sample, but 20 to 30 students at the same level as those who are going to take the 'live' test will give some idea of how it will perform. This will show up poor items and alert you to problems you might not have thought of, if nothing else.|
|Matching task||A version of a multiple-choice task, where the candidate has to match the items in two lists. There are usually more items in one of the two lists than in the other, so that students cannot get the final answer simply by deduction.|
|Multiple-choice item||A test item where there is a question or statement followed by a range of possible answers. The candidate has to mark the appropriate one or more answers; an answer key is provided for markers. Such items test recognition rather than production, and although they are very easy to mark, they are difficult to construct.|
|Norm-referenced test||This is the opposite of a criterion-referenced test. A norm-referenced test is a test in which a candidate's score is compared to the scores of the other students in the group. For such tests it is important, where possible, to carry out an item analysis in which the facility value and the discrimination index of each item are calculated.|
|Objective test||A test in which the answers have already been decided, so that the candidate's answers can be compared to an answer key. Examples of objective test items are multiple-choice and short answer questions. Such tests are often easily marked by a computer.|
|Open ended question||See short answer question.|
|Performance test||A form of assessment, for example, an essay or an oral interview, which usually tests a candidate's productive rather than receptive skills.|
|Placement test||A test which assigns to the candidate a level so that he or she can be placed in a particular class|
|Predictive validity||The accuracy of a test in deciding the future linguistic performance of a candidate. For example, does the test truthfully predict a person's ability to survive in basic Spanish?|
|Proficiency test||A test of the current language level of the candidate. It is distinguished from an achievement test by the fact that its candidates may come from a range of different language backgrounds and may have acquired their foreign language in many different ways.|
|Progress test||See achievement test.|
|Rater||A marker of a test; usually a marker of a subjective test such as a test of writing or speaking.|
|Reliability||The consistency of a test. See Principles of Assessment section 3,|
|Standard score||A score that has been standardised so that a candidate's reported score means the same thing time after time. For example, British A levels are standardised so that an A grade in one year should be roughly equivalent to an A grade in another year.|
|Short answer question||An objectively markable test item where the student has to produce a short answer, which is compared to an answer key. The fewer words that are allowed in the answer, the easier it is to write an all-inclusive answer key. Another term for this is open-ended question.|
|Subjective test||The opposite of an objective test. This test type must be marked subjectively. Most writing and speaking tests are subjectively marked, and so are many dictations and translations. Subjective tests are often marked by raters using 'analytic' or 'holistic' marking scales.|
|Summative test||A test given at the end of a course. The term is in opposition to formative test.|
|Translation||A test in which the candidate has to translate either the whole text or parts of it into or out of the first language. Such tasks are often difficult to mark reliably.|
|True/false item||A sort of multiple-choice item in which the candidate must choose from two options, such as, for example, True and False. Some test writers reduce the inflationary effect of guessing by adding a third option such as 'Not given'.|
|Validity||The accuracy of a test. See Principles of Assessment section 2.|
|Washback||The effect of a test on teaching. See Principles of Assessment section 4.|
For further information on language testing terms refer to the Multilingual
Glossary of Language Testing Terms (1998) and Davies et
International Language Testing Association (ILTA) (2002). http://www.surrey.ac.uk/ELI/ILTA/faqs
Wall, D. (1997). Impact and Washback in Language Testing. In C. Clapham & D. Corson (eds), Encyclopaedia of Language and Education 7: Language Testing and Assessment, 291-302. Dordrecht: Kluwer Academic.
There are an increasing number of test sites on line, but as a start readers are encouraged to look at:
The International Language Testing Association (ILTA) 2002 Web Site. This has 10-minute videos on various topics relating to Language Testing: http://www.surrey.ac.uk/ELI
The reader could also look at the following two sites, the first of which gives
examples of many tests of English and the second of which describes the DIALANG
project in Europe:
Referencing this article
Below are the possible formats for citing Good Practice Guide articles. If you are writing for a journal, please check the author instructions for full details before submitting your article.
- MLA style:
Canning, John. "Disability and Residence Abroad". Southampton, 2004. Subject Centre for Languages, Linguistics and Area Studies Guide to Good Practice. 7 October 2008. http://www.llas.ac.uk/resources/gpg/2241.
- Author (Date) style:
Canning, J. (2004). "Disability and residence abroad." Subject Centre for Languages, Linguistics and Area Studies Good Practice Guide. Retrieved 7 October 2008, from http://www.llas.ac.uk/resources/gpg/2241.
The Humbox is a humanities teaching resource repository jointly managed by LLAS.