Research shows that the typical teacher can spend up to a third of their professional life involved in assessment-related activities (Stiggins and Conklin, 1992), yet a lack of focus on assessment literacy in initial teacher training has left many teachers feeling less than confident in this area. In this blog, we’ll be dipping our toes into some of the key concepts of language testing. If you find this interesting, be sure to sign up for my Oxford English Assessment Professional Development assessment literacy session.
What is assessment literacy?
As with many terms in ELT, there are competing definitions for the term ‘assessment literacy’, but for this blog, we’re adopting Malone’s (2011) definition:
Assessment literacy is an understanding of the measurement basics related directly to classroom learning; language assessment literacy extends this definition to issues specific to language classrooms.
As you can imagine, language assessment literacy (LAL) is a huge area. For now, though, we’re going to limit ourselves to the key concepts encapsulated in ‘VRAIP’.
VRAIP is an abbreviation for Validity, Reliability, Authenticity, Impact and Practicality. These are key concepts in LAL and can be used as a handy checklist for evaluating language tests. Let’s take each one briefly in turn.
Face, concurrent, construct, content, criterion, predictive… the list of types of validity goes on, but at its core, validity refers to how well a test measures what it is setting out to measure. The different types of validity can help highlight different strengths and weaknesses of language tests, inform us of what test results say about the test taker, and allow us to see if a test is being misused. Take construct validity. This refers to the appropriateness of any inferences made based upon the test scores; the test itself is neither valid nor invalid. With that in mind, would you say the test in Figure 1 is a valid classroom progress test of grammar? What about a valid proficiency speaking test?
Ask your partner the questions about the magazine.
1. What / magazine called?
2. What / read about?
3. How much?
Answer your partner with this information.
‘Reliability’ refers to consistency in measurement, and however valid a test, without reliability its results cannot be trusted. Yet ironically, there is a general distrust of statistics itself, reflected in the joke that “a statistician’s role is to turn an unwarranted assumption into a foregone conclusion”. This distrust is often rooted in a lack of appreciation of how statistics work, but it’s well within the average teacher’s ability to understand the key statistical concepts. And once you have mastered this appreciation, you are in a much stronger position to critically evaluate language tests.
The advent of Communicative Language Teaching in the 1970s saw a greater desire for ‘realism’ in the context of the ELT classroom, and since then the place of ‘authenticity’ has continued to be debated. A useful distinction to make is between ‘text’ authenticity and ‘task’ authenticity, the former concerning the ‘realness’ of spoken or written texts, the latter concerning the type of activity used in the test. Intuitively, it feels right to design tests based on ‘real’ texts, using tasks which closely mirror real-world activities the test taker might do in real life. However, as we will see in the Practicality section below, the ideal is rarely realised.
An English language qualification can open doors and unlock otherwise unrealisable futures. But the flip side is that a lack of such a qualification can play a gatekeeping role, potentially limiting opportunities. As Pennycook (2001) argues, the English language
‘has become one of the most powerful means of inclusion or exclusion from further education, employment and social positions’.
As language tests are often arbiters of English language proficiency, we need to take the potential impact of language tests seriously.
Back in the ELT classroom, a more local instance of impact is ‘washback’, which can be defined as the positive and negative effects that tests have on teaching and learning. An example of negative washback that many exam preparation course teachers will recognise is the long hours spent teaching students how to answer weird, inauthentic exam questions, hours which could more profitably be spent on actually improving the students’ English.
Take the exam question in Figure 2, for instance, which a test taker has completed. To answer it, you need to make sentence B as close in meaning as possible to sentence A by using the upper-case word. But you mustn’t change the upper-case word. And you mustn’t use more than five words. And you must remember to count contracted words as their full forms. Phew! That’s a lot to teach your students. Is this really how we want to spend our precious time with our students?
By the way, the test taker’s answer in Figure 2 didn’t get full marks. Can you see why? The solution is at the end of this blog.
|A I haven’t received an invite from Anna yet.
B Anna still hasn’t sent an invite.
The cause of this type of ‘negative washback’ is typically due to test design emphasising reliability at the expense of authenticity. But before we get too critical, we need to appreciate that balancing all these elements is always an exercise in compromise, which brings us nicely to the final concept in VRAIP…
There is always a trade-off between validity, reliability, authenticity and impact. Want a really short placement test? Then you’re probably going to have to sacrifice some construct validity. Want a digitally-delivered proficiency test? Then you’re probably going to have to sacrifice some authenticity. Compromise in language testing is inevitable, so we need to be assessment literate enough to recognise when VRAIP is sufficiently balanced for a test’s purpose. If you’d like to boost your LAL, sign up for my assessment literacy session.
If you’re a little rusty, or new to key language assessment concepts such as validity, reliability, impact, and practicality, then my assessment literacy session is the session for you:
Solution: The test taker did not get full marks because their answer was not ‘as close as possible’ to sentence A. To get full marks, they needed to write “still hasn’t sent me”.
- Malone, M. E. (2011). Assessment literacy for language educators. CAL Digest October 2011.
- Pennycook, A. (2001). English in the World/The World in English. In A Burns and C. Coffin (Eds), Analysing English in a Global Context: A Reader. London, Routledge
- Stiggins, R. J, & Conklin, N. F. (1992). In teachers’ hands: Investigating the practices of classroom assessment. Albany: State University of New York Press.
Colin Finnerty is Head of Assessment Production at Oxford University Press. He has worked in language assessment at OUP for eight years, heading a team which created the Oxford Young Learner’s Placement Test and the Oxford Test of English. His interests include learner corpora, learning analytics, and adaptive technology.