type
Post
Created date
Apr 28, 2024 05:39 AM
category
LLM
tags
Machine Learning
Artificial Intelligence
status
Published
Language
Chinese
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
Chain-of-thought Hub,這是一個正在進行的工作,希望成為評估語言模型推理能力的統一平台。
複雜推理任務的列表:
- 數學(GSM8K)、
- 這是一個經典的基準測試,用 於衡量鏈式思維數學推理性能。這不是唯一的度量標準,但一個很好的解釋是 “在保持其他通用能力的同時,模型在數學方面的表現如何” —— 這也非常困難。
- 科學(MATH)、
- 符號(BBH)、
- 知識(MMLU
This graph tells:
- GPT-4 在 GSM8K 和 MMLU 上明顯優於所有其他模型。
- Claude 是唯一可以與 GPT 系列相媲美的模型家族。
- 較小的模型,如 FlanT5 11B 和 LLaMA 7B,明顯落後於排行榜,這意味著複雜推理可能只是大型模型的能力。
More to come
- MMLU (Massive Multitask Language Understanding): A large-scale language benchmark that tests model performance across a wide range of subjects and types of knowledge, from professional domains to high school-level topics.
- HellaSwag: A benchmark for testing a model's ability to predict the ending of a story or scenario. It challenges the AI's common sense reasoning and understanding of everyday activities.
- ANLI (Adversarial NLI): A stress-test benchmark designed to evaluate the robustness of models against adversarial examples in natural language inference tasks.
- GSM-8K: A benchmark focused on grade school math problems, testing the AI's ability to understand and solve mathematical questions stated in natural language.
- MedQA: Involves medical question answering, where the model is tested on its understanding of medical concepts and terminology, often requiring reasoning over multiple pieces of information.
- AGIEval: This benchmark likely evaluates aspects related to artificial general intelligence (AGI) through various tests, possibly focusing on reasoning, understanding, or other cognitive abilities.
- TriviaQA: A question-answering benchmark that requires the model to retrieve and reason over a large collection of trivia questions and their answers.
- Arc-C (ARC Challenge) and Arc-E (ARC Easy): These benchmarks from the AI2 Reasoning Challenge test the model's ability to answer more difficult (Challenge) and easier (Easy) grade-school level multiple-choice science questions.
- PIQA (Physical Interaction QA): Tests a model's physical commonsense, assessing its understanding of the physical world through questions about everyday physical interactions.
- SociQA: Focuses on social commonsense, evaluating how well a model understands social norms and human interactions.
- BigBench-Hard: Could be part of the broader BIG-bench (Beyond the Imitation Game benchmark) designed to test AI models on tasks that require advanced reasoning, creativity, and much more.
- WinoGrande: A dataset designed to test commonsense reasoning, specifically targeting the resolution of ambiguities in winograd schema-style sentences.
- OpenBookQA: Aims to test a model's ability to answer open-ended questions using both reasoning and knowledge that might be found in a typical school textbook.
- BoolQ: A question-answering dataset where models have to determine the truth value (True or False) based on a passage.
- CommonSenseQA: A test of how well AI systems can answer questions that require commonsense knowledge to resolve.
- TruthfulQA: Tests a model's ability to generate truthful and non-misleading answers, focusing on honesty and factual accuracy.
- HumanEval: Commonly used to test code generation models, assessing their ability to write functional programming code based on a given prompt.
- MBPP (Mostly Basic Python Problems): This typically involves assessing a model's ability to solve basic programming problems, usually in Python.
Reference
- Author:Jason Siu
- URL:https://jason-siu.com/article/64067aa0-1d3c-4734-8e05-8f1f5d25f0ff
- Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!
Relate Posts