Against AI Detection 3: What should we (not) do?

May 09, 2023

When faced with the unknown, humanity is always tempted by simple solutions. In the (probably very short) era of large language models, these are the AI output detectors.

The underlying idea is simple: text is pasted in a box, and the detector is able to recognise whether it comes from a certain model checking the sequence of words (tokens) against the weights of the model. Companies know that the idea is simple and appealing, so several detectors have quickly appeared in the wild, like OpenAI’s own detector, ZeroGPT, Originality.ai, and Turnitin’s in-built AI detection. They will tell you that their detectors are validated on a dataset with high sensitivity and specificity, which is probably true. This does not mean that they work as intended.

I believe that AI detection should not be part of the new policies developed for higher education, and the goal of this post is to explain why I believe this. While writing it, it became very long, so I am splitting it into three parts:

This is part 3. A short summary of the arguments made in this post is:

“Share your prompt” policies are essentially incompatible with good AI usage.
Tracking drafts could work, but… it’s mass surveillance.
Generative AI improves grades, so every student must use it.
Reform is probably the correct way forward.

“Share your prompt” policies

I have seen a few studies that allowed students to use ChatGPT in assessments, and asked them to share the prompts they used. This is good, and the results are insightful. However, I have also seen a few suggestions for policies that allow generative AI usage as long as the prompts used are submitted as part of the assignment.

I think that the idea comes from a misconception of Generative AI as an “input-output” tool: you ask for something, you get it, and that’s it. Arguably, if the marker is able to see how Generative AI was used, then it can take that into account and provide more authentic feedback. A report, or a “methodology” paragraph, to disclose how Generative AI was used as a mandatory attachment to every assessment would be an excellent idea. But… the prompts?

Please don’t ask for the prompts! I recently used ChatGPT to write a certain two-page document that I cared about a lot, probably as much as a student cares about an essay. The conversation I used to provide context, then brainstorm, then write it, then refine it, and finally test it against the criteria I wanted it to satisfy… is 134 messages long. The spirit of these messages is not to give 134 instructions and receive 134 outputs (even though that is what is technically going on) - it is a genuine conversation. I did not know what the third message was going to be when I was typing the first.

I think that what I did is good use of AI, the kind of thing that we’d want to encourage in students: use AI to better express their ideas. AI excels when it augments human performance - so what does it mean for students to “share their prompts”? Should we expect submissions with dozens of pages of “prompt” appendices? Is focusing on “quality of prompts” the right approach, when ChatGPT's popularity stems from its conversational nature, which OpenAI specifically prioritised in its training? With all due respect to LinkedIn influencers, the most effective way to engage with ChatGPT is to interact as you would with a friend: “I need to do this, can you help? I’d like…”.

One more remark: should we for some reason put limits on the kind of prompts we allow, do we really think that students would submit the actual prompts they used? They’d probably just generate five fully-policy-compliant prompts and submit those instead.

Someone proposed critical usage of AI detectors, where students “could use the output from a detector to reflect on their use of AI” - which, since detectors do not work, would simultaneously waste students’ time and keep the AI detection industry around.

I am all for disclosing how AI was used to write an essay, and educating students on what is good usage and what isn’t, with the goal of preserving critical thinking skills. But we need to give up on the illusion of being able to control them. Instead, we need authentic assessment methods that are AI-resistant.

Tracking drafts

Invigilated, in-person exams are a safe haven. It does not matter how good the AI becomes at assessments if the students can’t use it. However, these are not really suitable for several types of assessments, or degrees (think of remote degrees), and have been widely criticised due to their negative effects on student performance.

During Covid-19, solutions for mass surveillance really gained traction, promising lecturers that they would be able to emulate the “invigilated exam” experience in terms of confidence that no cheating took place. The results were as bad as you’d expect: bad implementations, draconian procedures once a student is flagged, heightened anxiety, problems with students in certain living conditions (e.g. without an isolated room to take exams), and unclear effectiveness.

From what I have seen, a webcam and a really long cable can defeat a surprisingly large amount of proctoring software solutions.

The silver lining is that, thanks to those insights, almost nobody is proposing to use proctoring tech against AI now. What is instead being proposed is another flavour of mass surveillance, where it is proposed that the institution tracks a student’s working patterns. Examples are:

Students can only work on the assignment in dedicated invigilated class hours.
“Fingerprint” the student’s writing style to be able to tell when it changes - instead of detecting AI, detect the student!
Students are either required to submit drafts regularly, or to use a cloud-based University solution (e.g. Google Docs, Microsoft Office 365, or Cadmus) so a chain of drafts can be produced.

While the first two suggestions are obviously terrible, the latter would probably work. If we allow generative AI usage to augment human abilities, such an approach would at least guarantee that students spent quite a bit of time on the essay.

Okay, a student could potentially generate an essay with ChatGPT, then manually type it into the cloud platform, then later ask ChatGPT to change a few things, and then manually type it again, and so on… but it is probably easier to just make the changes themselves, and plagiarists are lazy! Further, platforms like Cadmus are planning detection algorithms to catch exactly this kind of behaviour.

The “manually typing” part can be automated, but doing so is hard, mostly inaccessible, and actually vulnerable to detection (anti-cheating software in online gaming is, for instance, very good at distinguishing human input patterns from even the “advanced” pseudo-random ones software would use).

Potential issues include enforceability: what do we do if students work offline and then paste their work? Mandating usage of a certain tool would surely raise quite a few issues and, on the other hand, if we just recommend it but then investigate those who do not use it for contract cheating much more than others… it does not look great.

As part of a unit in my department, students have been asked in the past to write their work in the cloud-based LaTeX writing software Overleaf, and this was quite successful. With Microsoft Copilot just a few months away, by the time this kind of approach is implemented students might actually be using Office 365 anyway to benefit from AI!

However, we should not lose sight of the fact that this is still quite invasive mass surveillance. We would be asking for the ability to monitor every second of the student’s creative process, effectively choosing surveillance over trust.

Finally, it is entirely possible that in the next few years large language models develop into fully autonomous entities that can use a laptop as a human would, simulating the entire writing process. What then? We would be back at square one.

In short, tracking drafts can be a good mitigation measure, to the point that students have started to recommend each other to use cloud-based software to save drafts and be able to produce them if they are wrongly accused of using AI. But it does not solve the core problem, and it might be as vulnerable to future generative AI as the current system is to ChatGPT. Instead, we need to develop authentic assessment methods that are AI-resistant.

Change is necessary

This post might have given some readers the idea that I am in favour of simply letting students use AI freely, and happily go on with our lives. Indeed, I am, because this is what is going to happen either way.

I argue that what needs to change is us, our courses, and our assessments. I give two reasons for this: the first is that preservation of human critical thinking skills is necessary, as we will likely never be able to fully trust AI output. If we do not change our ways, and allow AI-enabled cheating, future humans will not be able to think properly, losing whatever edge we might have against artificial intelligence (if any). I spoke about this in the last part of my January talk on ChatGPT.

There is another, more practical reason. We scale grades, which means that a student’s performance is measured against the performance of other students in comparable situations, and grades are adjusted to mitigate against changes and enable comparisons across years, classes and groups. With a typical Moloch dynamic, the same essay that got 70% in 2019 might get 60% in 2023, if everyone else is using Generative AI and the quality improves.

This means that it is in the best interest of students to use Generative AI! Even students who dislike it have a strong incentive to learn to use it.

In particular, this means that every resource meant to teach students how to use AI in the “right” way must not impair any of the benefits in workload or quality that generative AI provides. This includes generating the entire essay, if Generative AI ever reaches that level (I predict next year). We cannot ask students to not use AI, or to only use it for certain tasks, and then as a result have students who cheat get higher grades, or better work-life balance, or both.

So, once more: even if we could teach students how to use AI in the “right” way, and even if the majority of students were receptive to these lessons, we need to develop authentic assessment methods that are AI-resistant.

We cannot direct the wind, but we can adjust our sails

So, we need to change our approach. But change does not have to be drastic!

An example from the past: for the past two decades every student enrolled in any linear algebra class has been able to compute determinants of matrices simply by googling a determinant calculator and putting in the numbers1. This means that a “compute the determinant of A” homework is vulnerable to cheating. However, this hasn’t quite mattered in the grand scheme of things, as we implemented various solutions:

Asking them to show their work (justify your steps…).
Asking a reflective follow-up (explain why…).
Asking hard-to-compute follow-up questions (“what is the determinant of A^(10^10)?”).

Computers have been able to solve quite a lot of undergraduate maths problems for a long time - in comparison, a computer could not really produce a critical review of an author’s life work before ChatGPT became available, so mathematics is more impervious to this. However, the general philosophy remains the same. Our response to AI should be to take into account its existence, and its advanced abilities… and reform things to take all this into account.

Authentic assessment methods that are AI-resistant

The main objective of this post was to argue in favour of the development of said assessment methods, not to actually develop them. This is, of course, a huge task - I fully expect that in a few years we will be seeing books on the topic.

However, we cannot wait for those because the show must go on, and Generative AI is available now, and it is about to be catastrophically more available. Here are my small, generic suggestions:

In-person, invigilated exams are a safe haven - despite all their shortcomings, they should be used whenever suitable in this transition phase.
If the assessment is composed of several tasks, consider shifting the balance. For instance, a project currently marked as 80% essay, 20% oral presentation might be changed to 50% essay, 50% oral presentation, to allow more room for critical reflection and evaluation of the student’s knowledge during the latter.
Small assessments already intended to be more formative than summative (e.g. coursework worth 10%) are probably fine as they are. Them counting towards the final grade encourages participation and effort, so resist the temptation of making them formative only (for instance in our first-year courses students get marks for tutorial attendance!).
If you are planning on monitoring drafts, consider adding a staging approach to the assessment. For instance, students could receive feedback on each draft, or even marks if the assessment is clearly divided in phases.
Despite GPT-4 being multimodal, non-text-based input is still unavailable for the vast majority of the public, and apparently not very good. Hence, a question that includes any kind of visual, non-textual input is harder to process for AI.
Include explicit instructions on what constitutes appropriate use of AI. You won’t be able to detect most violations, but not all students know this and the more anxious ones might hesitate to use it for fear of repercussions, which puts them at a disadvantage.
Do not rely on misconceptions about the abilities of generative AI. You will hear that it does not excel at scenario-based questions, or at critical reflection, or at references2. Don’t trust this kind of argument: we don’t know what current AI can do, and new stuff comes out every day. When you see the sentence “ChatGPT cannot do X”, read it as “I was not able to make ChatGPT do X”.

You can find many more takes on the topic on the internet. Don’t just listen to me, don’t just listen to them: it is our job to think, reflect, evaluate and act in the best interest of our students. External perspectives should inform, not dictate our decisions!

Conclusion

Do not wait for the first light of the fifth day - cavalry might not be coming. Start working on your courses now! I suggest asking yourself these questions:

Are your assessments vulnerable to generative AI? What small changes can you make to assessments to improve this? Do you need radical changes?
What would you like students to know about generative AI, in the context of what you teach?
Do you have a clear idea of what acceptable usage of AI by students would be? Is it enforceable, and realistic? Have you compared it with other academics? How can you communicate this to students?

Finally, I suggest an exercise. Take past work, from before ChatGPT existed, and (while mindful of GDPR) run it through OpenAI’s text classifier. I predict you’ll find a few surprises. Or not - after all, an 8-ball is 50% accurate about the future.

Should things go south in your institution, remember all this if you find yourself on a malpractice panel!

My 2020-written PhD thesis introduction (and most of my writing) comes up as “Unclear if AI generated”…

I am not saying that “generative AI is just like calculators, or Wikipedia”, it’s not! Or, if it is, then it is so like “the atomic bomb is just like knives, or bows and arrows” in terms of impact, disruption, potential and scale of change that is necessary. We’ll need to do a lot more than developing a policy on calculators, or educating students to cite Wikipedia’s references instead of Wikipedia as a reference.

My impression is that these are currently false, false, and true.

Thoughts

Discussion about this post