← Back
MMLU
other
1 mention from 1 sources
Massive Multitask Language Understanding - a benchmark for evaluating AI models across diverse academic subjects and knowledge areas.
1
sources
Mentioned by
All mentions
"even something simpler like MMLU, which is a multiple-choice benchmark. If you just change the format slightly, like, I don't know, if you use a dot instead of a parenthesis or something like that, the model accuracy will vastly differ."
From:
State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI | Lex Fridman Podcast #490
•
▶ 1:46:50
•
Jan 2026
Attribution: Sebastian uses MMLU as an example of how format sensitivity affects model evaluation