No Multimodal LLM Is Order-Invariant: 18-Model Audit Reveals 24-50% Flip Rates
New audit of 18 multimodal LLMs finds none are order-invariant. Flip rates hit 24-50%. Prompt-level fixes don't transfer across modalities. Operators take note.
What Happened
On June 24, 2026, researchers Akshay Paruchuri, Sanmi Koyejo, and Ehsan Adeli published a paper on arXiv titled "Same Evidence, Different Answer: Auditing Order Sensitivity in Multimodal Large Language Models." The paper introduces Facet-Probe, a structured audit framework that tests whether multimodal LLMs (MLLMs) produce consistent answers when the ordering of inputs is shuffled — a property that should be irrelevant to the correct answer but, as the study shows, is not.
Facet-Probe audits five distinct ordering facets: option ordering (e.g., multiple-choice answer positions), evidence-chunk ordering (as in RAG pipelines), document-rank ordering, image-set ordering, and mixed-modality ordering (interleaving text and images differently). The researchers evaluated 18 frontier and open-weight MLLMs and found that none exhibited order invariance. Per-facet panel-mean flip rates spanned 24–50%.
Critically, the study includes a same-ordering control using Gemini at temperature 0 to establish a decoder-stochastic noise floor. The observed ordering flips exceed this floor in verified cells, confirming that the instability is not merely random sampling noise — it is a systematic sensitivity to input arrangement.
The researchers also tested prompt-level mitigations on Gemini. The result: training-free prompt changes are modality-conditional — they may help in text-only reasoning but do not transfer to visual reasoning tasks. This suggests that prompt engineering alone is insufficient for achieving order robustness.
Why It Matters
Standard MLLM benchmarks score each test item in one canonical ordering. If a model gets the right answer in that specific arrangement, it passes. But production systems rarely feed inputs in benchmark-canonical order — RAG pipelines retrieve and rank chunks dynamically, agentic workflows assemble context programmatically, and user-facing applications receive inputs in unpredictable sequences.
This paper reveals that the gap between benchmark performance and real-world reliability is substantial. A model that scores 90% on a benchmark might flip its answer on 13–50% of those same items when inputs are reordered. For operators, this means:
- Decision consistency is worse than benchmarks suggest. If your multimodal pipeline feeds evidence or options in varying orders, you're getting different answers a significant percentage of the time.
- Temperature 0 doesn't save you. The Gemini control at temperature 0 still shows ordering excess, so deterministic decoding doesn't eliminate the problem.
- Prompt fixes are modality-locked. A prompt strategy that stabilizes text reasoning may have zero effect on visual reasoning, making one-size-fits-all mitigation impossible.
The researchers propose cross-ordering flip rate as a standard reporting axis for MLLMs — a metric that should sit alongside accuracy, latency, and cost in any production evaluation suite.
Who Is Affected
AI startups building multimodal RAG, document analysis, or agentic systems are the most directly impacted. Their products' reliability is worse than benchmark scores imply, and the inconsistency will surface in customer-facing outputs.
Enterprise teams evaluating MLLMs for deployment need to expand internal evaluation beyond accuracy. Cross-ordering flip rate should become a standard axis in vendor selection and model comparison.
Open-source model developers should treat order invariance as a training-time and architectural concern, not a post-hoc evaluation footnote. The paper explicitly calls for future work on training-time and architectural approaches, given that prompt-level mitigation is insufficient.
Strategic Implications
For AI startup founders: If your product depends on MLLM outputs for decisions — document classification, multimodal QA, evidence-based reasoning — add cross-ordering consistency tests to your eval pipeline immediately. A 13–50% flip rate means your product's reliability is materially worse than you think. Consider input canonicalization (sorting evidence, fixing option order) as a first-line defense, and majority voting across orderings for high-stakes calls.
For developers/operators building with AI APIs: Don't assume temperature 0 solves ordering sensitivity — the Gemini control proves it doesn't. Implement deterministic input ordering where possible, and for critical paths, run 2–3 orderings and check for consensus. If outputs diverge, flag the result for human review rather than auto-acting on it.
For non-technical business owners evaluating AI tools: Ask vendors directly: "What is your cross-ordering flip rate?" If they can't answer, run a simple test — submit the same query with reordered inputs and compare. High variance is a red flag for any automated decision-making workflow.
What to Watch Next
Watch for model providers (OpenAI, Google, Anthropic) to adopt cross-ordering flip rate as a reported metric in their model cards — the research community is clearly converging on evaluation methodology gaps, as evidenced by the recent cluster of papers on evaluation awareness, bias frameworks, and safety judge comparisons. Also monitor for training-time approaches to order invariance, which this paper explicitly motivates but does not solve.
Frequently Asked Questions
Q: What is order sensitivity in multimodal LLMs?
A: Order sensitivity is when a model gives different answers to the same question depending on the order in which inputs (options, evidence chunks, images) are presented. It should not happen — the correct answer doesn't change based on input arrangement — but the Facet-Probe audit shows it happens 24–50% of the time across 18 tested models.
Q: Can prompt engineering fix order sensitivity?
A: Not reliably. The study found that training-free prompt changes are modality-conditional — they may help for text reasoning but do not transfer to visual reasoning. The researchers conclude that prompt-level mitigation alone is unlikely to provide general order robustness, and architectural or training-time solutions are needed.
Q: Does setting temperature to 0 eliminate the problem?
A: No. A Gemini control at temperature 0 showed that observed ordering flips still exceed the decoder-stochastic noise floor, meaning the instability is systematic, not random sampling noise.