A recently published study has found that GPT-3, the large language model that powers AI chatbot ChatGPT, reasons about as well as college students when it comes to answering “reasoning problems.” The fact that the GPT language models do not learn in any comparable way to human beings prompted the researchers to wonder whether this sort of reasoning is a “brand new” form of intelligence.
However, confirming whether the software is indeed “mimicking human reasoning” or using “a new type of cognitive process” would require backstage access to OpenAI's systems, which is unlikely to happen any time soon.
If ChatGPT is indeed reasoning in a novel way, it could pose a number of interesting quandaries for AI ethics, research, and development.
GPT-3's Logic and Reasoning Skills Surpass Expectations
UCLA’s study, which appeared in Nature Human Behavior this week, found that “GPT-3 displayed a surprisingly strong capacity for abstract pattern induction” when answering general intelligence and SAT test questions.
This capacity found GPT-3 matched or surpassed “human capabilities in most settings,” while “preliminary tests of GPT-4 indicated even better performance.”
The results, authors Taylor Webb, Keith J Holyoak, and Hongjing Lu say, “indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.”
Zero-shot learning is a method used in machine learning to teach LLMs and other AI machines to identify, recognize, and classify things that they have never encountered before.
“Language learning models are just trying to do word prediction so we’re surprised they can do reasoning. Over the past two years, the technology has taken a big jump from its previous incarnations.” – Hongjing Lu, author of the study in an article
Study co-author Keith J Holyoak adds that, although ChatGPT appears to think like a human in some regard: “People did not learn by ingesting the entire internet, so the training method is completely different.”
He said they hope to explore whether the language model's cognitive processes are “brand new” and a truly “real artificial intelligence.”
GPT-3's Test Results
GPT-4 did particularly well on Raven’s Progressive Matrices, which is an image prediction test. On this test, it scored 80%, whereas the average human will score below 60%.
Interestingly, when the language model did make mistakes, it often went wrong in similar ways, and at similar stages, to its sentient test counterparts.
Although GPT-3 performed commendably when it came to analytical reasoning, there are other sets of questions it struggled with, including questions that required some level of comprehension of physical space.
GPT-3 also didn't do quite as well as students in test questions designed to see whether the subjects could pair up analogies that conveyed similar meanings. Notably, however, GPT-4, the successor to GPT-3 and GPT-3.5, performed better on this test.
LLMs Continue to Show Promise on Tests
This isn’t the first example of a language model being put to the test and producing scores comparable to human subjects.
For example, GPT-3.5 — the model that currently powers the free version of ChatGPT — scored around the bottom 10% of test takers in a simulated law school bar exam. GPT-4, which is available to ChatGPT Plus customers, scored in the top 10%.
Bard’s test results are less impressive. Fortune scraped some SAT math questions from the internet and posed them to Bard, but the Chatbot answered between 50% and 75% incorrectly. It did, however, perform better on reading tests — at least enough to secure a place at several US universities.
However, these tests were conducted before Bard shifted to using a more capable language model, called PaLM 2, which performs particularly well when prompted with math and coding-based questions.
Questions Still Remain
However, the question that still remains — and can’t be answered without OpenAI effectively granting research teams a ChatGPT Access All Areas pass — is how these reasoning processes occur in the language models that power it.
If ChatGPT has developed a cognitive process for reasoning that departs significantly from known human processes, the ramifications could be wide-reaching, and open up a whole web of ethical questions relating to sentience and evaluating AI systems.