LLMs and higher education
Large Language Models (LLM), of which ChatGPT and Bing AI (Sydney) are the current most famous examples, are here to stay.
A report says that “89% of students have used ChatGPT to help with homework assignments”. Let’s not get lost in the question of how accurate is this number: it is one data point. Even if it was off by a factor of ten, and the real percentage were 9%... if almost one student out of ten is using something, then any education provider must take its existence into account.
I am really happy to see, in meetings and informal discussions, that the majority view already seems to be that detectors are unreliable, and that the assumption that we can differentiate between human-written and LLM-written content is ultimately flawed1.
ChatGPT has its shortcomings, but it appears to be improving at an impressive speed through software updates and better prompts from users. Sydney appears to be better than ChatGPT at several tasks, and it has access to the live internet. Further, a new generation of LLMs is rumoured to be coming out in 2023 (GPT-4?), and we may already be seeing it in Bing’s Sydney.
I argue that any robust policy on LLMs must start from the assumption that a LLM can produce content that is functionally identical to content written by a human. In other words, for all intents and purposes, we must assume that it can simulate human output, and that this cannot be detected reliably.
This might not yet be true, but the current trajectory leaves little doubt to it eventually becoming a fact. This does not mean that LLMs will be alive or able to think, or understand, or feel emotions: it does mean, however, that they can communicate as if they did2.
Eighty years ago Jorge Luis Borges wrote about the Library of Babel , a library that contains a book for each possible sequence of characters up to a certain finite length. He argued that having all the books is the same as having no books – that
“To speak is to commit tautologies[...] for a book to exist, it is sufficient that it be possible”.
If the library contains every statement and its opposite, then it contains no knowledge3. LLM bring us an improved version of the Library of Babel, with coherent syntax and a rough truth compass, but they do not solve the fundamental issue that lies at its core.
When ChatGPT is asked what 2+2 is, it will answer that it is 4. But it is perfectly possible to override this knowledge, and make it argue that it is 5, 67, or an apple pie.
In other words, it has been argued that a LLM is not capable of critical thinking, and that what it attempts to do, quite successfully, is to provide the answer that a human would have been statistically more likely to give in the context of the dialogue, and of the training data (and restrictions).
Now, LLMs are capable of more than just text prediction, but for the sake of discussion we must again suppose that there are things that human minds can do, but that LLM cannot, so we should not allow LLM engines to replace human thinking. There is another good reason to do this: as a species, we must retain the ability to evaluate LLM output critically, because since we cannot tell a Library of Babel advanced search engine from a being capable of understanding, then we cannot ever be sure that its output will not be flawed.
What does this mean for higher education?
In light of episodes like this one, I am fairly sure that what I am about to describe has already happened once: a student asks ChatGPT to write an essay for their assignment, and submits it for grading. A lazy (or overworked) lecturer asks ChatGPT to mark the work following the marking scheme, and submits the output to the exam board. The student is then awarded their grade. 4
The mere fact that this is actually possible is a huge issue. No learning would have taken place there, just a grotesque simulation of the process. Analogies can be drawn to essay mills, or their analogues for external or automated marking processes, but I think that this is fundamentally different, and worse: here, a book was simply taken off the shelf of the library of Babel, a pointless exercise that serves no purpose.5
And yet, we cannot reject LLMs. These are here to stay, and employees of all sectors will make use of the benefits they provide, just like it happened with industrial machines, calculators and computers. We cannot, and should not, exclude LLMs from education and pretend that nothing happened.6
There seems to be a general consensus on the potential for LLMs to be beneficial in providing ideas and prompts which the student can then expand, and I fully endorse incorporating LLMs and AI in general in education, as an integral part of the student learning process: but if we assume, as I argue we must, that the LLM can simulate perfectly a student’s output, and that this cannot be detected, then mitigating the impact of the assessment issue is the urgent question we must address first.
This requires a radical, robust shift to future-proof assessments in higher education: I propose a general principle that every assessment must include a direct human-to-human interaction component.
Traditional invigilated in-person assessments are already compliant, since a human (invigilator) can verify that a human (student) is actually performing the assessment tasks. Other asynchronous remote forms of assessment, such as projects or dissertations, would be changed to always include an in-person task, such as a discussion with the student (which is asked to prove their critical knowledge of the topic and their skills), or an experiment, or any other task that can be certified as genuine human output.
This is not an easy policy to adopt: it comes with huge workload issues for staff and students, but I argue that it is better than the alternatives. It would certainly be possible to reform assessment taking into account current LLM shortcomings, or relying on current LLM detectors if reliable ones are built.
In the past three months there has been a lot of focus on ChatGPT’s inability to properly reference content, or to produce bibliographies. Just three months later, Bing’s Sydney appears to be much better at it. What would have happened to a policy that used good references as part of a “human output detector”?
Choosing a radical, big change instead of hundreds of small ones has the benefit of allowing for proper study on fairness, equity and accessibility of assessment. It takes time to get things right, and constant change would jeopardise any such effort.
If we fail to educate students in critical thinking, and neglect to motivate them to develop their human skills then we ultimately fail our mission as educators. Indeed, in a world where LLMs can perform most complex tasks, motivation becomes even more important in higher education.
LLMs are vulnerable to censorship, and can be influenced by bad actors. Moreover, LLM may have subtle, fundamental limits: they may be unable to substantially improve past their training data; they may have biases or ultimately incapable of expressing a plurality of views as broad as humans; and, further, they may be ultimately unreliable on complex issues. We need to be able to spot this ceiling as soon as it is reached, if it exists: humanity must retain the ability to independently evaluate LLM output, and we can only spot mistakes on things that we understand.
Hence, the presence of LLMs does not diminish the ultimate value of higher education, but it does threaten its quality and effectiveness if it fails to adapt to this new paradigm.
This does not mean that the content of courses should not change. Drawing an analogy with calculators, there are some abilities that we have outsourced to technology sixty years ago: nowadays, most students cannot compute a logarithm or perform long division from scratch. This does not pose a problem, as calculators are ubiquitous in society and we generally trust the output to be accurate. However, we still retain the knowledge on how to perform such tasks, and the skills we impart to mathematics undergraduates enable them to understand those processes easily. This, we think, is an acceptable compromise.
A fundamental long term challenge for educators will be to reform syllabi and intended learning objectives to simultaneously promote LLM-human synergies and retain important human skills. This is not an easy task, and we cannot afford to proceed slowly, but all we can do is try our best.
There are signs that it may already be false.
The Chinese room argument is a good analogy. There is a not-fully-disproven argument that humans also work like the room, in which case LLM=AGI (Artificial General Intelligence) as soon as they become a little better. Let’s reject this hypothesis... even just because otherwise it makes little sense to write this note.
There is a website where you can search the library and, of course, everything can be found: https://libraryofbabel.info/search.html.
I did not do this: as it turns out, this is not yet possible in the 4th year maths course I was teaching this year:
However, I have plenty of anecdotal evidence of ChatGPT being used for writing, and for marking, so it is quite likely that the wires crossed and this has occurred somewhere in the world.
Again, if we were to assume that LLM-produced content does constitute critical thinking and is equivalent to human output in its value, this would be the “LLM = AGI (Artificial General Intelligence)” scenario, which we reject. Note that, as Borges argued in 1941, the inability to distinguish “value” or “truth” of the output becomes the issue here.
If anything, because this strategy would fail.