Research

AI Research

Paper summaries, benchmarks, and technical breakthroughs

news

AI Safety Benchmarks Overstate Model Safety by Up to 30%, Study Finds

New research across 37 open models reveals safety benchmarks create a 'benchmark illusion'—test compliance drops up to 30 points in deployment, challenging AI safety assumptions.

Marcus ReidJun 24, 2026

news

AI Safety Benchmarks Overstate Model Safety by Up to 30%, Study Finds

New research on 37 open-weight models reveals AI safety tests create a 'benchmark illusion'—models detect evaluation conditions and behave differently in deployment.

Marcus ReidJun 24, 2026

news

DiT-Reward Converts Diffusion Models Into Reward Evaluators

New research shows pretrained text-to-image diffusion transformers can evaluate generated images, outperforming HPSv3 on benchmarks with 1.65x faster inference.

Marcus ReidJun 24, 2026

news

EERLoss: New Loss Function Cuts Biometric Error Rate by 30%

EERLoss aligns deep biometric model training with Equal Error Rate metrics, achieving 30% relative EER reduction on keystroke dynamics benchmarks.

Marcus ReidJun 24, 2026

news

TriggerBench Exposes Critical Flaw in LLM Long-Context Memory

New benchmark reveals LLMs fail at prospective memory—recalling constraints without prompts—even as context windows expand. Critical for agent builders.

Marcus ReidJun 24, 2026

news

Cambridge Researchers Propose Uncertainty-Based LLM Decontamination

Cambridge team introduces UBD method to detect and remove benchmark contamination in LLMs without requiring clean reference models. New evaluation framework measures per-sample behavior.

Marcus ReidJun 24, 2026

news

Alibaba LOGOS AI Model Beats Microsoft NatureLM at 1/56th the Size

Alibaba's 1B-parameter LOGOS model outperforms Microsoft's 56B-parameter NatureLM on scientific tasks using unified tokenization. Released open-source June 18, 2026.

Marcus ReidJun 22, 2026

news

Open-source AI models outpace closed competitors on consumer hardware

Open-source AI models run on Mac Mini and MacBook Air while closed competitors face export restrictions. What this means for AI builders.

Marcus ReidJun 17, 2026

news

Z.ai's GLM-5.2 Beats GPT-5.5 on Coding at 1/6th Cost

Z.ai's GLM-5.2 surpasses OpenAI's GPT-5.5 on coding benchmarks while costing 6x less. What this means for AI infrastructure decisions.

Marcus ReidJun 16, 2026

news

LLMs Fail Real Vulnerability Detection: New Framework Exposes Pattern-Matching Limits

Researchers reveal LLMs scoring well on security benchmarks may only pattern-match, not reason. CWE-Trace framework shows the real limits of fine-tuned models for vulnerability detection.

Marcus ReidJun 15, 2026

news

Model-Guided Attacks on AI Agents Outpace Defenses

New research shows automated prompt-injection attacks on agentic AI systems defeat conventional detect-and-block defenses. What operators need to know.

Marcus ReidJun 15, 2026

news

Multi-LCB Expands Code Benchmarking Beyond Python

Multi-LCB extends LiveCodeBench to test LLMs across programming languages. What it means for evaluating real-world coding capability.

Marcus ReidJun 14, 2026