Large Language Models and Critical Thinking

When asked why people believe in seemingly insane conspiracy theories, a lot of people will say something about “critical thinking” and how we don’t teach it enough. They’re not entirely wrong (though I’m sure the full reasons are much more complicated), but I’m not sure most people who use the term even fully understand what “critical thinking” means.

I actually taught critical thinking for two years. And one of the things I would ask on the first day of class was how, when confronted with conflicting claims such as “humans landed on the moon in 1969” and “the moon landings were actually faked,” people decided which to believe. (Of course, whether or not either must be believed, and whether every controversial issue boils down to exactly two possible positions, are another question entirely.) Almost always, the answer I heard was something along the lines of, “I listen to each point of view, and I make my own decision.” When asked why people who believed the other position might have arrived at the “wrong” conclusion, they would usually pinpoint the failure somewhere in the “listen to each point of view” part, and not the “make my own decision”. It’s almost as though most people believed that “making one’s own decision” is something that happened automatically and infallably once one had consumed enough information, and that the only way to make a bad decision was not to consume enough information. Most of the rest of the course consisted of trying to convince them that there are plenty of errors that can occur in the “make my own decision” part, and to make them aware of when they were making such errors themselves. Indeed, “being able to make rational decisions once presented with information” is a pretty good working definition of “critical thinking”.

I’ve been thinking about this a lot lately as I’ve seen people go absolutely gaga over large language models such as Galactica and ChatGPT. When these products first launched, there was a lot of fawning praise about how they would “replace search engines” and even put human creatives out of work, and a lot of this praise sounded as though LLMs were, or would soon become, a sort of omniscient oracle whereby people would ask any question and get the correct answer, no other research or fact-checking needed. Then people started identifying lots of questions for which these language models gave laughably wrong answers, and even scarier cases in which they gave answers that were wrong, but not laughably so: plausible-sounding, but still wrong, answers that people were apt to believe without looking much further. And of course, these problems did not dissuade the true believers; much as with cryptocurrency, when presented with a list of their pet project’s shortcomings, AI evangelists are apt to fall back on the excuse that “it’s early days,” and that with more and better training we will someday get the omniscient computer overloards we’ve been dreaming of.

But large language models of the type that currently dominate the headlines will never get enough training data, because their flaws do not lie in the training data alone. Rather, they lie in what they’re being trained to do – and what they’re not being trained to do.

Let’s first look at the other deep learning success stories that are currently grabbing headlines: text-to-image models, such as Midjourney and Stable Diffusion. These allow one to type in a description of a scene, and get an image. There was a lot of early hype about how they would replace illustrators, followed by the harsh reality check that they have no idea how many fingers a human being is supposed to have. And, of course, there’s the inevitable bias: the way they tend to generate images of white males unless explicitly asked to do otherwise, the way they associate certain negative terms with certain groups of people, and so forth. However, unlike with LLMs, there is a good case to be made that the shortcomings of text-to-image models really are due to problems with the training data. These models are trained on huge datasets of tagged images such as ImageNet, or else on images and their surrounding text that have been scraped from the web (often without the content owners’ permission). Their problems can mostly be attributed to incorrectly tagged images in their training sets, insufficiently diverse images for certain tags, and other issues that could be solved with more and better data. When it comes right down to it, they’re just big, fancy classifiers.

Large language models like Galactica and ChatGPT, however, aren’t trained to predict images given tags. They’re trained to predict which words likely follow other words. When a training set like ImageNet is built, humans decide that some set of words accurately describe whatever is pictured in the accompanying image. Even if a training set is web-scraped rather than curated, there’s a pretty good chance whoever put the image on the web chose to make the surrounding text at least somewhat descriptive of the image. LLM chatbots, however, are not trained to generate correct answers to questions. They’re trained to sound like humans. They’re designed to pass the Turing Test, which notoriously measures an entity’s intelligence by its ability to trick people rather than its ability to think. And so the only criteria determining what text an AI chatbot uses to respond to a question are whether that text was statistically likely to follow the prompt in the training data. None of those responses are tagged as “true” or “false,” not even in the most flawed manner. One can only hope that correct answers outnumber wrong answers in the training set.

In order to mitigate the problems with these large language models, they would have to be fed curated training sets where answers have been tagged as “true” or “false”. This would of course result in the further problem of the taggers’ biases being encoded in the model, but at least the model would have some concept of true and false. The current models don’t. They literally just listen to everyone’s opinion and make a choice unencumbered by critical thinking. There have been some efforts by the ChatGPT team to introduce some dataset curation in order to prevent the most offensive types of answers – efforts performed by armies of underpaid laborers in the Global South – but even this effort stops far short of tagging each training item for veracity. The overall training philosophy is still “throw tons of data into tons of machines and let them figure it out”.

Don’t get me wrong – I’m not saying a large language model trained on well-curated data would be any closer to “artificial general intelligence” or “sentience”. Nor am I saying we wouldn’t still have to think critically about who’s doing the tagging or what biases the models are learning. But as long as we’re optimizing for “imitate a human” and think that shoveling in more data will fix everything, these models will only ever accidentally get things right.