Skip to main content
Carsten Felix Draschner, PhD

Fast New ELO Over MixEval. But Looking into Code, I Got Doubts!

Have you seen that MixEval Hard has two interesting but little critical aspects

Image 1

TLDR

Issues with MixEval

  1. For freeform answering, it lets e.g. GPT35T as judge predict values like "[[0.9]]" and try to fetch this as reliable number for correctness score.
  2. The MixEval eval data contains several duplicates due to their sampling approach.

My Take

I like the concept of MixEval to have an open-source LLM benchmark which has a good overlap to Arena Elo, but I am unsure if I trust LLMs as judge which try to reason continuous numbers for freeform LLM answers and relies on duplicates in samples.

Extract of My Hands-On Criteria

Credit

Thanks Philipp Schmid for your post(s)

My Questions?

#generativeai #artificialintelligence #llm #machinelearning #benchmark