Apple Study Uncovers Flaw in OpenAI, Google, and Meta LLMs
Representative image by freepik

Apple Study Uncovers Flaw in OpenAI, Google, and Meta LLMs

A study by Apple suggests that the reasoning intelligence of LLMs from OpenAI, Google, Meta, and others may be closer to "sophisticated pattern matching" than "true logical reasoning."
Published on

A study by Apple researchers suggests that the reasoning intelligence of LLMs from OpenAI, Google, Meta, and others may be closer to "sophisticated pattern matching" than "true logical reasoning." This holds true for OpenAI’s o1 advanced reasoning model as well.

Speculating data contamination in the GSM8K test, which is the most common benchmark for reasoning skills, the study developed a new benchmark called GSM-Symbolic. This new benchmark keeps the essence of the reasoning problems but changes the variables, like names, numbers, complexity, and adding irrelevant information.

The discovery that followed was “fragility” in LLM performance. The study tested over 20 models including OpenAI's o1 and GPT-4o, Google's Gemma 2, and Meta's Llama 3. With every single model, the model's performance decreased when the variables were changed.

To test the hypothesis that LLMs relied more on pattern matching than actual reasoning, the study added superfluous phrases to math problems to detect models’ reactions. For example, "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?"

Consequently, there was a significant drop in performance across the board. OpenAI's o1 Preview fared the best, with a drop of 17.5 percent accuracy, while Microsoft's Phi 3 model performed 65 percent worse.

In the kiwi example, the study said LLMs tended to subtract the five smaller kiwis from the equation without understanding that kiwi size was irrelevant to the problem. This establishes that "models tend to convert statements to operations without truly understanding their meaning" validating the researchers' hypothesis.

The study stated that testing models on the benchmark that includes irrelevant information "exposes a critical flaw in LLMs’ ability to genuinely understand mathematical concepts and discern relevant information for problem-solving."

CDO Magazine
www.cdomagazine.tech