Research published in February 2026 (arXiv:2602.04649v1) by a collaborative team from Alibaba's Qwen team, Fudan University, and Tsinghua University reveals a concerning phenomenon in AI evaluation: models may produce correct answers through entirely wrong reasoning processes. The researchers term this 'superficial alignment' or 'deceptive alignment,' analogous to a student guessing correctly without understanding the underlying method.
The team developed a 'rationality consistency' evaluation framework called METAJUDGE that measures alignment between AI reasoning processes and human thinking patterns. Testing across 19 top AI models revealed that even the most advanced systems achieve only about 40% rationality consistency, meaning most correct answers come through flawed reasoning paths.
Traditional accuracy metrics are becoming saturated and failing to distinguish between models. For example, OpenAI's o3 and o3-mini show similar accuracy but dramatically different reasoning quality. The researchers propose a hybrid signal training approach that rewards correct answers only when accompanied by correct reasoning, achieving 5% average improvement on benchmark tests.[citation:7]