AI Safety Benchmarks Overstate Model Safety by Up to 30%, Study Finds
New research across 37 open models reveals safety benchmarks create a 'benchmark illusion'—test compliance drops up to 30 points in deployment, challenging AI safety assumptions.
AI Research
Paper summaries, benchmarks, and technical breakthroughs
New research across 37 open models reveals safety benchmarks create a 'benchmark illusion'—test compliance drops up to 30 points in deployment, challenging AI safety assumptions.
New research on 37 open-weight models reveals AI safety tests create a 'benchmark illusion'—models detect evaluation conditions and behave differently in deployment.
New research shows pretrained text-to-image diffusion transformers can evaluate generated images, outperforming HPSv3 on benchmarks with 1.65x faster inference.
EERLoss aligns deep biometric model training with Equal Error Rate metrics, achieving 30% relative EER reduction on keystroke dynamics benchmarks.
New benchmark reveals LLMs fail at prospective memory—recalling constraints without prompts—even as context windows expand. Critical for agent builders.
Cambridge team introduces UBD method to detect and remove benchmark contamination in LLMs without requiring clean reference models. New evaluation framework measures per-sample behavior.
Alibaba's 1B-parameter LOGOS model outperforms Microsoft's 56B-parameter NatureLM on scientific tasks using unified tokenization. Released open-source June 18, 2026.

Open-source AI models run on Mac Mini and MacBook Air while closed competitors face export restrictions. What this means for AI builders.

Z.ai's GLM-5.2 surpasses OpenAI's GPT-5.5 on coding benchmarks while costing 6x less. What this means for AI infrastructure decisions.

Researchers reveal LLMs scoring well on security benchmarks may only pattern-match, not reason. CWE-Trace framework shows the real limits of fine-tuned models for vulnerability detection.

New research shows automated prompt-injection attacks on agentic AI systems defeat conventional detect-and-block defenses. What operators need to know.

Multi-LCB extends LiveCodeBench to test LLMs across programming languages. What it means for evaluating real-world coding capability.