ChatGPT and assessments discussion panel at SoE conference
As part of a discussion panel on ChatGPT and assessments, I gave this talk at the Engineering the Future Conference 2023 at the University of Manchester on the 29th of March 2023.
Watch it here:
Slides (as of 29/03/2023):
A few additional thoughts (what follows is not a summary of the talk at all):
Two of the other speakers on the panel were Dr. Louise Dennis and Dr. Iliada Eleftheriou. Faced with the novelty of ChatGPT, they took action: the first allowed using ChatGPT in a ring-fenced part of an assessment worth 15%, the second developed an “AI Code of Conduct”, encouraged students to use ChatGPT in her course unit, and then polled them on their experiences (I can’t wait to read the paper). The results presented were interesting, insightful, but most importantly… data points.
Undoubtedly because of my pure mathematics background, as you have seen, I’ve been focusing on the slow process of figuring this out the right way to make sure that, when guidance is developed, it is robust.
I stand by all of my suggestions (e.g. don’t develop policies that rely on detectors, and don’t rely on specific current shortcomings of LLMs) but… in the meantime, reality has not been paused and all things happen in the present: students are writing assignments right now, and right now detectors can help to spot some AI-generated essays, and current shortcomings can be used to make assessments more robust while we wait for the general guidance to be developed1 .
An example: ChatGPT often starts an answer with a basic introduction that states assumptions, objectives and definitions. I think that is a byproduct of RLHF, but regardless of the reason, this can be used to spot potential ChatGPT-written answers!
Can a detector be fooled by decent prompt engineering? Yes (but plagiarists are lazy). Can a tool be trained to generate essays that fool detectors? Yes. Is there such a tool widely available and known to the public? Not yet.
So: if while marking an essay today you spot potentially AI-generated text, putting it in a detector is a useful thing to do, and a positive result should update your position a little towards “malpractice” (and viceversa).2
A successful strategy cannot be based on the existence of robust detectors, or on a specific weakness of mainstream variations of a given LLM. But, while said strategy is developed, any help is welcome, and any signal (no matter how noisy) is worth something.
In a nutshell: imperfect tools are useful, temporary workarounds are useful, interim guidance is useful, observational studies are useful, and thankfully some people have been working on each of these. The best thing you can do is something: use GPT daily, talk to students about it, listen to a lot of people and hear about what they have done.
To quote Eliezer Yudkowsky: “They won't do it for you. Nobody else will do your job. There are not competent adults running around looking for things to do. There is nobody who fills in the gaps. There is nobody who holds up the walls. You walk through a mostly empty universe.”
I think about this take a lot, but apparently not enough.
Of course, and as usual, not everyone you listen to will update you in the right direction (e.g. Chomsky is completely wrong here), but crowds are wise, and one’s a crowd.
The very worst thing you can do is disengage: this is here to stay. Even if they stop training new ones, GPT-4 in its infinite incarnations is already here. And even if they shut down GPT-4, LLaMA has leaked and some version of it can probably run on commercial hardware, like a phone.
Guidance which, if effective, will likely require workload negotiations and in general significant changes to be implemented.
And of course, Just as in the current policy “Turnitin alone does not confirm the presence of malpractice; an academic judgement must be made to reach this conclusion”, the same holds here. False positives with GPT-detectors happen too, so be cautious. A detector’s output is a hint, not evidence.