John H.A.L. de Jong is Professor Emeritus of Language Testing at VU University, Amsterdam, and owner/director of a consultancy business “Language Testing Services”. John graduated from Leiden University and obtained a Ph.D. in Educational Measurement from Twente University.
He has published articles and books in Dutch, English and French on educational measurement and on language acquisition and assessment.
John has been involved, since 1991, in the Council of Europe's projects to define a common framework for language learning, teaching and testing and has worked on establishing the learning load of learning a foreign language as a function of the first language.
John de Jong is regularly invited as keynote speaker at international theoretical and technological meetings on language testing and educational measurement.
You started your career as a secondary school teacher a long time ago but later you became interested in testing. What made you focus on educational measurement and on language acquisition and assessment?
I started teaching immediately after getting my Bachelor’s degree in French Linguistics and Literature. Then I continued my studies and I got a Master’s degree majoring in Applied Linguistics with French Literature and English Sociolinguistics as minors. While studying for my Master’s, I realized that my interests lay more in data and methodologies for gathering hard evidence. At the same time, while preparing my secondary school students for their exams, I felt their exams lacked validity and didn’t really reflect their language competences. Therefore, once I passed my MA, I started looking for a job where I could use my knowledge in applied linguistics, specifically in improving the quality of secondary school leaving examinations. I found a position as item writer for French exams at CITO, the Dutch National Institute for Educational Measurement.
When I had been working there for a while, I noticed that each exam officer received large green sheets of computer output containing nothing but numbers that reported on the results of the exam paper they had prepared. Being language people, my colleagues seemed only marginally interested in these computer outputs, but I was very much interested and asked them if I could have a closer look. Pretty soon I was analysing and reanalysing these outputs and I obtained the position of researcher in the unit.
You have been involved in the development of CEFR, the Council of Europe's project to define a common framework for language learning, teaching and testing. Were there any standardized reporting scales of language proficiency at that time?
In the 1950s the Foreign Service Institute (FSI) in the USA started the development of a scale of foreign language skills for staff at government agencies. This FSI scale comprised 6 levels from 0 (= no functional ability) to 5 (= equivalent to an educated native speaker). Because it was often felt to lack precision, the scale was revised in 1985 by the Interagency Language Roundtable (comprising various agencies of the United States federal government) to include “plus” levels. The scale has since been referred to as the ILR scale. By the 1980s when the American Council on the Teaching of Foreign Languages (ACTFL) wanted to adopt the FSI scale for use in education, they found that many language teachers would not reach a level above 2 on the ILR scale and decided they needed to increase the granularity of the scale at the lower end in order to make it useful for measuring learners’ progress in schools. Consequently, the levels 0 to 2 of the ILR scale were defined as “Novice”, “Intermediate” and “Advanced” (each divided in Low, Mid and High) on the ACTFL scale effectively creating nine levels out of the three lowest IRL levels (0, 1 end 2). Anything above ILR level 2 was labelled “Superior” on the ACTFL. The ACTFL scale is a well-guarded system with rigorous centrally controlled certification processes for its testers and raters.
Another offspring of the ILR scale was released in Australia a little earlier than the ACTFL. The Australian Second Language Proficiency Ratings (ASLPR) was renamed, in 1997. It’s now called International Second Language Proficiency Ratings (ISLPR).
Apart from those rating scales other scales existed in a number of national contexts and language examinations and Brian North mentions all these scales as source material for the pool of descriptors he assembled when developing the CEFR. But the issue with most of these scales was that they were largely based on intuition and it is North’s accomplishment to develop a data-driven scaling of the descriptors he collected which probably explains the wider success and international recognition of the CEFR.
Do tests have an impact on planning, learning and teaching?
Tests most certainly have an impact on learning as language learners wish to prepare for them in order to increase their likelihood of success and the demand of learners to be well-prepared gives rise to commercial teaching programs. While I was teaching in secondary education my students complained that I wasn’t using past exam papers in my lessons like their English and German teachers. I told them that by requiring them to speak and write in French and by reading French texts and by playing French songs and theatre plays, I was in fact giving them a better preparation for the exam and more useful for their potential need to understand and use French in their later life. I don’t think they all agreed, but we surely had more fun reading “En attendant Godot” by Samuel Becket than solving multiple choice questions from past exams.
I do think tests could have more impact on planning students’ learning if they provided more detailed information on why students are not getting the scores they want or need. Simply taking an IELTS exam 17 times as some students do and each time getting a score just below what they need on one of the four skills and next time on another is not really very instructive.
Is there a balance between the needs of candidates being assessed and the institutions conducting the assessment?
I assume in many cases this is like a social contract: the students need a particular score on a test to allow them entrance into a university or immigration into a country and the institutions provide them with the exam that is accepted (even if not designed) for that purpose.
Many educators hold the view that tests are artificial, and do not represent the language a learner will need in a real-life situation.
Not all tests are the same in that respect. There certainly are tests that provide very little information on how students passing them will be able to function in real-life language situations. But over the years under the influence of professional language testing associations such as ILTA (International Language Testing Association) and EALTA (European Association for Language Testing and Assessment) there have been improvements and more attention has been paid to validity. This is probably why we see less use of multiple choice questions in some tests nowadays and more items integrating skills.
Language is both the object and instrument of measurement. That is, we must use language itself as a measurement instrument in order to observe its use and measure language ability. Does this make it difficult to distinguish the language abilities we want to measure from the methods used to elicit language?
The language used in the questions and prompts of a language test should not constitute a threshold to the test taker and make it difficult for them to understand what is being demanded. That is why often language tests are designed to measure a specific level of language ability, e.g., aiming at A1 or A2 of the CEFR. But there are also tests that claim to be broad-spectrum and offer measurement along all levels of the CEFR. In those tests special care has to be taken to make sure candidates even at the lowest level are able to understand what is expected of them. But this makes it very difficult to set tasks that are demanding at the higher levels. Therefore tests that truly aim to measure at a range of levels are more and more taking advantage of the opportunities offered by adaptive testing, techniques where easier questions are set when a test taker fails to correctly respond to an item and more difficult questions when they succeed, thereby gradually focusing on the level that the candidate actually masters.
Does testing format matter?
The format of any test item type adds additional demands that can have more or less relevance to the language skill that it purports to address and can create inequalities across the population of students that are being assessed. A notorious example is the true/false type of question. More advanced students will be more inclined than students at lower levels of ability to reject any statement as ‘true’, because they feel the statement doesn’t completely cover the truth, irrespective of whether it is intended as ‘true’ or ‘false’ by the test developer. On the other hand students with limited ability are more inclined to accept any statement as ‘true’ simply because they are more likely to agree with anything they see in print that sounds more or less reasonable. This results in ‘false’ statements being more discriminating between ability levels than ‘true’ statements are. Another example is items that are like puzzles that need to be solved. The c-test tends to be like that. Some students enjoy puzzles, while others detest them and have difficulty to engage with these type of items. Therefore, to minimize the impact of item characteristics, a fair test must include a variety of item types in that way avoiding that the use of any specific item type is giving undue advantage to a specific group of students.
Are computer-administered tests more secure than paper-and-pencil exams?
Computer-administered tests are generally more secure than paper-and-pencil exams because they can have built-in algorithms to check test taker behaviour. For example, every key-board entry and any mouse movement can be stored and used for later scrutiny, analysis and evaluation. In addition, computer-administered tests usually implement more security measures at the individual candidate level, such as cameras that can track all head, hand and even eye movement and individual test stations with partitions between test takers.
Will cheating occur as long as test results are used to make important educational decisions?
The temptation to cheat is present in any human activity, be it a presidential election, or an exam. The rational for cheating behaviour is the expectation that the probability of a reward, i.e., an unwarranted win in an election or an undeserved pass on an exam, is greater than the probability of getting caught and thereby losing the sought after advantage of the cheating behaviour. Test providers therefore have to make clear to test takers that the likelihood of getting caught is extremely high and the penalty on being caught is extremely severe. Educators from their side have the responsibility to inform test takers on the minimal chance of getting away with cheating and more importantly, that test takers are in fact fooling themselves if they succeed in their cheating behaviour.
And one last question. Can human performance be accurately measured?
We must realise that any measurement of human performance, human abilities, psychological traits or language abilities always entails measurement error. Test providers should therefore be open and transparent about the size of the error margin around their reported measurement results and they must do anything in their power and capabilities to reduce the size of this error. The measurement error implies that the actual ability of a test taker lies in a range of scores around the reported score. Test providers have the responsibly to provide indications of the likelihood of the actual ability of a test taker being above and/or below their reported score. It is the responsibility of the score user to decide what kind of error is more important in the situation of their decision: if they chose to accept test scores slightly below a minimum score, they reduce the probability of false negatives, i.e., the probability of rejecting test takers while there is a chance that they in fact do meet the minimum requirement, or only accepting scores slightly above the minimum score, thereby reducing the probability of false positives, i.e., the probability of accepting test takers who in fact do not meet the minimum requirement.
Anastasia Spyropoulou
anastasia@eltnews.gr
John de Jong is one of our judges in the 2021 ELT Excellence Awards. If circumstances allow, John will join the next Foreign Languages Forum scheduled for the last weekend of August and will present some of the awards.
Quotes
“The format of any test item type adds additional demands that can have more or less relevance to the language skill that it purports to address and can create inequalities across the population of students that are being assessed.”
“We must realise that any measurement of human performance, human abilities, psychological traits or language abilities always entails measurement error.”
“The temptation to cheat is present in any human activity, be it a presidential election, or an exam. Educators from their side have the responsibility to inform test takers on the minimal chance of getting away with cheating and more importantly, that test takers are in fact fooling themselves.”