type
Post
Created date
Apr 28, 2024 05:39 AM
category
LLM
tags
Machine Learning
Artificial Intelligence
status
Published
Language
Chinese
From
summary
slug
password
Author
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon
 
notion image
Chain-of-thought Hub,這是一個正在進行的工作,希望成為評估語言模型推理能力的統一平台。
複雜推理任務的列表:
  • 數學(GSM8K)、
    • 這是一個經典的基準測試,用 於衡量鏈式思維數學推理性能。這不是唯一的度量標準,但一個很好的解釋是 “在保持其他通用能力的同時,模型在數學方面的表現如何” —— 這也非常困難。
  • 科學(MATH)、
  • 符號(BBH)、
  • 知識(MMLU
notion image
This graph tells:
  • GPT-4 在 GSM8K 和 MMLU 上明顯優於所有其他模型。
  • Claude 是唯一可以與 GPT 系列相媲美的模型家族。
  • 較小的模型,如 FlanT5 11B 和 LLaMA 7B,明顯落後於排行榜,這意味著複雜推理可能只是大型模型的能力。
 
More to come
  1. MMLU (Massive Multitask Language Understanding): A large-scale language benchmark that tests model performance across a wide range of subjects and types of knowledge, from professional domains to high school-level topics.
  1. HellaSwag: A benchmark for testing a model's ability to predict the ending of a story or scenario. It challenges the AI's common sense reasoning and understanding of everyday activities.
  1. ANLI (Adversarial NLI): A stress-test benchmark designed to evaluate the robustness of models against adversarial examples in natural language inference tasks.
  1. GSM-8K: A benchmark focused on grade school math problems, testing the AI's ability to understand and solve mathematical questions stated in natural language.
  1. MedQA: Involves medical question answering, where the model is tested on its understanding of medical concepts and terminology, often requiring reasoning over multiple pieces of information.
  1. AGIEval: This benchmark likely evaluates aspects related to artificial general intelligence (AGI) through various tests, possibly focusing on reasoning, understanding, or other cognitive abilities.
  1. TriviaQA: A question-answering benchmark that requires the model to retrieve and reason over a large collection of trivia questions and their answers.
  1. Arc-C (ARC Challenge) and Arc-E (ARC Easy): These benchmarks from the AI2 Reasoning Challenge test the model's ability to answer more difficult (Challenge) and easier (Easy) grade-school level multiple-choice science questions.
  1. PIQA (Physical Interaction QA): Tests a model's physical commonsense, assessing its understanding of the physical world through questions about everyday physical interactions.
  1. SociQA: Focuses on social commonsense, evaluating how well a model understands social norms and human interactions.
  1. BigBench-Hard: Could be part of the broader BIG-bench (Beyond the Imitation Game benchmark) designed to test AI models on tasks that require advanced reasoning, creativity, and much more.
  1. WinoGrande: A dataset designed to test commonsense reasoning, specifically targeting the resolution of ambiguities in winograd schema-style sentences.
  1. OpenBookQA: Aims to test a model's ability to answer open-ended questions using both reasoning and knowledge that might be found in a typical school textbook.
  1. BoolQ: A question-answering dataset where models have to determine the truth value (True or False) based on a passage.
  1. CommonSenseQA: A test of how well AI systems can answer questions that require commonsense knowledge to resolve.
  1. TruthfulQA: Tests a model's ability to generate truthful and non-misleading answers, focusing on honesty and factual accuracy.
  1. HumanEval: Commonly used to test code generation models, assessing their ability to write functional programming code based on a given prompt.
  1. MBPP (Mostly Basic Python Problems): This typically involves assessing a model's ability to solve basic programming problems, usually in Python.
Reference
notion image
notion image
Bouncing Back with Gain Recovery Calculator: The Art of Recouping Financial Losses in Leveraged InvestmentsOn LLM Settings