On LLM Benchmark | Jason Siu

Jason Siu

type

Post

Created date

Apr 28, 2024 05:39 AM

category

LLM

tags

Machine Learning

Artificial Intelligence

status

Published

Language

Chinese

From

summary

slug

password

Author

Priority

Featured

Featured

Cover

Origin

Type

URL

Youtube

Youtube

icon

notion image

Chain-of-thought Hub，這是一個正在進行的工作，希望成為評估語言模型推理能力的統一平台。

複雜推理任務的列表:

數學（GSM8K）、

這是一個經典的基準測試，用於衡量鏈式思維數學推理性能。這不是唯一的度量標準，但一個很好的解釋是 “在保持其他通用能力的同時，模型在數學方面的表現如何” —— 這也非常困難。

科學（MATH）、

符號（BBH）、

知識（MMLU

notion image

This graph tells:

GPT-4 在 GSM8K 和 MMLU 上明顯優於所有其他模型。

Claude 是唯一可以與 GPT 系列相媲美的模型家族。

較小的模型，如 FlanT5 11B 和 LLaMA 7B，明顯落後於排行榜，這意味著複雜推理可能只是大型模型的能力。

More to come

MMLU (Massive Multitask Language Understanding): A large-scale language benchmark that tests model performance across a wide range of subjects and types of knowledge, from professional domains to high school-level topics.

HellaSwag: A benchmark for testing a model's ability to predict the ending of a story or scenario. It challenges the AI's common sense reasoning and understanding of everyday activities.

ANLI (Adversarial NLI): A stress-test benchmark designed to evaluate the robustness of models against adversarial examples in natural language inference tasks.

GSM-8K: A benchmark focused on grade school math problems, testing the AI's ability to understand and solve mathematical questions stated in natural language.

MedQA: Involves medical question answering, where the model is tested on its understanding of medical concepts and terminology, often requiring reasoning over multiple pieces of information.

AGIEval: This benchmark likely evaluates aspects related to artificial general intelligence (AGI) through various tests, possibly focusing on reasoning, understanding, or other cognitive abilities.

TriviaQA: A question-answering benchmark that requires the model to retrieve and reason over a large collection of trivia questions and their answers.

Arc-C (ARC Challenge) and Arc-E (ARC Easy): These benchmarks from the AI2 Reasoning Challenge test the model's ability to answer more difficult (Challenge) and easier (Easy) grade-school level multiple-choice science questions.

PIQA (Physical Interaction QA): Tests a model's physical commonsense, assessing its understanding of the physical world through questions about everyday physical interactions.

SociQA: Focuses on social commonsense, evaluating how well a model understands social norms and human interactions.

BigBench-Hard: Could be part of the broader BIG-bench (Beyond the Imitation Game benchmark) designed to test AI models on tasks that require advanced reasoning, creativity, and much more.

WinoGrande: A dataset designed to test commonsense reasoning, specifically targeting the resolution of ambiguities in winograd schema-style sentences.

OpenBookQA: Aims to test a model's ability to answer open-ended questions using both reasoning and knowledge that might be found in a typical school textbook.

BoolQ: A question-answering dataset where models have to determine the truth value (True or False) based on a passage.

CommonSenseQA: A test of how well AI systems can answer questions that require commonsense knowledge to resolve.

TruthfulQA: Tests a model's ability to generate truthful and non-misleading answers, focusing on honesty and factual accuracy.

HumanEval: Commonly used to test code generation models, assessing their ability to write functional programming code based on a given prompt.

MBPP (Mostly Basic Python Problems): This typically involves assessing a model's ability to solve basic programming problems, usually in Python.

Reference

複雜推理：大語言模型的北極星能力 | 人人都是產品經理 (woshipm.com)

notion image

notion image

Author:Jason Siu
URL:https://jason-siu.com/article/64067aa0-1d3c-4734-8e05-8f1f5d25f0ff
Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts

Seminar: Revolution in Large Language Models and How You Can Build Apps With It

Fundamental concepts on Neural Network

Comparison between Sigmoid and Softmax Activation Function with Python

Maximum Likelihood Estimate (MLE)

Bagging and Boosting tree

Bouncing Back with Gain Recovery Calculator: The Art of Recouping Financial Losses in Leveraged Investments On LLM Settings

Jason Siu

A warm welcome! I am a tech enthusiast, passionate about learning and self-discovery.

Statistics

Number of posts:

240

Latest posts

Life - Principles

What is Containers? (Azure)

How to ask Insightful, Structured Questions

How to Win Friends and Influence People - Dale Carnegie