We've beaten GPT4! ... is a sentence which starts to annoy me.

About Mistrust in LLM Evaluation. Benchmark contamination in LLMs? How to Evaluate GenAI?!

𝗧𝗟;𝗗𝗥 ⏱️

Why do we need LLM benchmarks? 📊

Many use cases can be solved by GenAI more specifically by LLMs
Many LLMs are available as AIaaS or as an open-source model
At some point, you need to select a specific model for your dedicated use case
News is flooded by a multitude of models that are better than some reference LLM like OpenAIs GPT4

When we speak about leaderboards and benchmarks, we look into specific types of tasks.
Those tasks need to be “easily measurable” as LLM might generate arbitrary texts.
e.g. MMLU is simply a multiple choice and looks if first generated character is A-E.
Other Benchmarks use e.g. another LLM as judge (which is expensive) and also fuzzy.
It is almost impossible to measure if an LLM learned the benchmark data by heart in its pretraining/fine-tuning stage to get on top of the leaderboard.

Leaderboards are only a starting point for model selection.
GenAI approach selection is a multidimensional problem.
Develop a use-case-specific evaluation framework e.g. does the generated code run/match unit tests, is secure and fast
For most of my use cases I barely care if the model can solve English multiple choice questions by simply evaluating if the first character is an A, B, C, D, or E like in MMLU.
Already simple throughput benchmarks seem to have their issues. see: https://rb.gy/5l8qqp

Under which license is the model available and does the license allow my intended usage?
What do we know about retuning especially regarding: multi-language support, instruction tuning, alignment, and context length?
What hardware requirements/costs do we face, and which throughput can we provide? e.g. 13B vs 8x7B vs 70B ...

To Hugging Face and other platforms for providing LLM Leaderboards and easily accessible models
To OpenAI with its GPT-4 as reference established to be beaten
To all benchmark creators and researchers supporting transparent and reliable GenAI evaluation

#generativeai #artificialintelligence #llm #machinelearning #benchmark