[{"content":"在过去的一年里，AI Agent 的演进出现了两个非常重要的趋势：\n智能体正在变得更通用（Generalist）：可以承担越来越多类型的任务；智能体的任务时长变得更长（Long-horizon）：能够连续执行几十甚至上百个步骤的复杂任务。根据 METR 的基准测试，AI 能自动完成的人类任务等效时长大约每 7 个月翻倍。这意味着智能体从“短对话助手”，发展为“能够连续运行数百甚至上千步的自主系统”。\n与此同时，通用智能体数量激增，如 Manus 和 Claude Code 等系统正在承担远不止“写代码”或“回答问题”的任务。它们能够组织研究流程、规划任务、调用大量工具，并产出复杂成果。但随着任务时长与任务复杂度提升，工程上的挑战也随之而来：\nManus：典型任务需要调用约 50 次工具 Anthropic Production Agents：实际生产系统常常会进行数百轮对话与推理这些长周期、多步骤、工具密集型的系统，被称为 Agentic System。尽管不同项目的具体实现差别较大，但它们普遍遵循以下四个核心原则：\n使用 Planning 保证任务方向正确在超长任务中，如果缺乏规划，模型非常容易偏航。因此 Agentic System 普遍采用明确的任务计划来做“航向控制”。\n常见做法包括：\nManus：使用 todo.md 保存任务清单，并在执行过程中不断读写更新； Claude Code：要求用户先批准计划，再执行工具与操作； Gemini Deep Research：在执行前强制生成计划、并请求用户确认； Anthropic Multi-Agent Researcher：将 research plan 写入文件系统，在关键流程中重新读取，确保最终报告“遵循原计划”；使用 Filesystem 进行 Offload Context 随着工具调用次数、搜索结果、观察内容不断增多，把所有内容保存在消息历史里会快速耗尽上下文窗口。\n因此，可以采用“外部化记忆”机制：\n将庞大中间结果（如搜索原始结果）保存到磁盘在上下文中只放简短摘要，节省 Tokens 在需要时再从文件读取完整内容长期记忆可以独立保存，不随对话窗口消失 Example:\nAnthropic Multi-Agent Researcher：将研究计划写到文件里 → 调研完成后再读回来 → 保证报告结构与原计划一致； Manus：将 todo.md 持久化，多次更新、反复读取文件系统就是 Agentic System 的外部化记忆（Externalized Memory）。\n使用Sub-agents 隔离上下文与任务 Agentic System 另一个关键能力是：任务拆分与子智能体协同。原因很简单：每个子智能体都有独立的上下文窗口。\n主智能体无需承载所有推理上下文子智能体可以专注于独立的子任务可以并行化任务（如资料收集、数据比对） Example:\nClaude Code：代码分析和修复任务由不同子 agent 完成 Anthropic Multi-Agent Researcher：子 agent 专门负责检索、整理、验证 Manus：子 agent 负责某些并行可拆解的子任务 Open Deep Research：子智能体只负责“收集信息”，最终报告由主 agent 统一撰写，避免结构冲突 Cognition / Walden Yan 提到过“多智能体隐式决策冲突的问题”，如果多个子智能体分别撰写部分内容，那么最终合并会出现风格不统一、推理冲突、结构不一致。\n因此子智能体应该只做可并行、可聚合、无冲突的任务，例如：信息收集、文献爬取、数据提取、独立判断或评分，不建议独立撰写报告章节（容易冲突）。\n使用精心设计的 Prompt Engineering Agentic System 的 System Prompt 通常非常巨大、精细、迭代多次。\nExample:\nClaude Code几十次迭代 Manus长而复杂 Open-Deep-Research结构明确、角色分工清晰尽管 Agentic System 在架构上看似只是：“LLM + 工具调用 + 循环”，但背后依赖的 Prompt 复杂度非常高，用于确保：\n工具调用顺序正确输出格式一致避免幻觉长周期行为稳定 Sub-agent 的工作可被主 agent 正确整合因此 Prompt 工程是 Agentic System 的“隐形关键工程”。\nSummery: Agentic System 的方法论矩阵\nAgent Filesystem Planning Sub-Agents Prompting Manus ✔ 使用文件系统 ✔ todo.md \u0026amp; 重读计划 ✔ 多个并行子 agent ✔ 大量 prompt 迭代 Anthropic Researcher ✔ 保存计划与资料 ✔ 写计划并重复读取 ✔ delegated researcher – Open Deep Research ✔ 使用 LangGraph State ✔ think tool ✔ 信息收集子 agent ✔ 自定义 System Prompt Claude Code ✔ 使用文件系统存代码与补丁 ✔ plan mode ✔ 任务拆分 ✔ 强大的 prompt 体系这些系统尽管背景不同，却共享了深度智能体的“四大原则”：\nPlanning：保持方向、避免偏航 FileSystem：外部化记忆、节省上下文窗口 Sub-Agent：上下文隔离、任务解耦 Prompt Engineering：确保整个系统稳定运行 Planning 随着智能体承担的任务越来越复杂，例如多文件代码修改、多轮研究任务、产品设计等，仅依靠“下一步该做什么”的即时推理已经完全不够。Agentic System 都引入了 Planning 工具：\nClaude Code: 使用 TodoWrite 工具生成带审批的任务计划，允许用户确认再执行。 Manus: 自动生成并持续更新 todo.md 文件，贯穿整个任务流程。智能体需要一个结构化、可追踪、可更新的任务计划（TODO 列表），并将其保存到“状态”（State）中。\n构建 State Planning 需要在State中维护以下信息:\nmessages: 对话历史\ntodos: 任务规划和进度追踪\nfiles: 上下文信息存储\n因此，我们扩展默认的 AgentState，创建 DeepAgentState：\n定义 TODO 数据结构 from typing import Literal from typing_extensions import TypedDict class Todo(TypedDict): \u0026#34;\u0026#34;\u0026#34;结构化的任务项,用于追踪复杂工作流的进度属性: content: 简短、具体的任务描述 status: 当前状态 - pending(待处理)、in_progress(进行中)或 completed(已完成) \u0026#34;\u0026#34;\u0026#34; content: str status: Literal[\u0026#34;pending\u0026#34;, \u0026#34;in_progress\u0026#34;, \u0026#34;completed\u0026#34;] 文件系统Reducer 为了支持增量式的文件更新,我们需要一个 reducer function:\ndef file_reducer(left, right): \u0026#34;\u0026#34;\u0026#34;合并两个文件字典,右侧优先作为 files 字段的归约函数,允许对虚拟文件系统进行增量更新参数: left: 左侧字典(现有文件) right: 右侧字典(新增/更新的文件) 返回: 合并后的字典,右侧值覆盖左侧值 \u0026#34;\u0026#34;\u0026#34; if left is None: return right elif right is None: return left else: return {**left, **right} 扩展Agent的State 通过继承 AgentState，我们保留 messages，同时加入 todos 和 files。这样，整个 Agent 就在一个统一的状态树中管理：对话内容、任务计划、虚拟文件。这极大增强了 agent 的能力与可控性。\nfrom typing import Annotated, NotRequired from langgraph.prebuilt.chat_agent_executor import AgentState class DeepAgentState(AgentState): \u0026#34;\u0026#34;\u0026#34;扩展的代理状态,包含任务追踪和虚拟文件系统继承自 LangGraph 的 AgentState 并添加: - todos: Todo 项列表,用于任务规划和进度追踪 - files: 虚拟文件系统,存储为文件名到内容的字典映射 \u0026#34;\u0026#34;\u0026#34; todos: NotRequired[list[Todo]] files: Annotated[NotRequired[dict[str, str]], file_reducer] Planning 工具设计为了让智能体能够管理自己的任务计划，我们需要设计两个基本工具：\nwrite_todos：写入或更新 TODO 列表\nread_todos：从 state 中读取当前 TODO，辅助决策\nwrite_todos 这个工具会将 LLM 生成的 Todo 列表直接写入 state：\n@tool(description=WRITE_TODOS_DESCRIPTION, parse_docstring=True) def write_todos(todos, tool_call_id): return Command( update={ \u0026#34;todos\u0026#34;: todos, \u0026#34;messages\u0026#34;: [ ToolMessage(f\u0026#34;Updated todo list to {todos}\u0026#34;, tool_call_id=tool_call_id) ], } ) 工具返回 Command\n更新了 state.todos 写入了一条 ToolMessage 因此，执行 write_todos 后，智能体的状态自动包含新的任务计划。\nread_todos @tool(parse_docstring=True) def read_todos(state, tool_call_id): todos = state.get(\u0026#34;todos\u0026#34;, []) ... 与 write_todos 不同，read_todos 返回的是字符串，不是 Command。\n但 create_react_agent 会自动将字符串包装成 ToolMessage 并更新 messages 字段。\n价值与优势 TODO 是智能体的“显式认知步骤（Explicit Cognition）”: 帮助模型在长任务中拆解、记忆、执行。\nTODO 是智能体的“规划工具”, 支持如下能力：\n多步骤执行子任务跟踪状态回溯推理清晰化避免遗忘中间步骤 TODO 是智能体的“可控接口”: 开发者或用户可以“审阅”智能体的计划，从而让智能体更可控、更稳定。 FileSystem 在长时间运行的代理任务中，代理可能需要执行数十次工具调用。在这个过程中，重要的上下文信息可能会丢失或被遗忘。通过将关键信息保存到文件中，我们可以：\n持久化重要信息在多次工具调用后重新获取上下文更好地引导代理完成复杂任务虚拟文件系统设计我们将定义三个工具，并为它们提供清晰的描述，以便 LLM 能够理解和使用它们：\nls：列出虚拟文件系统（即字典的所有键）中的所有文件。\nread_file：根据指定的文件路径读取文件内容。\nwrite_file：将指定的内容写入指定的文件路径。\n核心实现列出文件：ls 使用 Injected State 访问图状态从状态中获取 files 字典并返回所有键（文件路径）的列表 @tool(description=LS_DESCRIPTION) def ls(state: Annotated[DeepAgentState, InjectedState]) -\u0026gt; list[str]: return list(state.get(\u0026#34;files\u0026#34;, {}).keys()) 读取文件：read_file 这个工具支持：\n根据路径读取内容根据 offset 和 limit 分页读取自动编号（方便调试）长行截断 @tool(description=READ_FILE_DESCRIPTION, parse_docstring=True) def read_file(file_path, state, offset=0, limit=2000) -\u0026gt; str: ... 读取流程：查看文件是否存在 -\u0026gt; 将内容按行切分 -\u0026gt; 按 offset、limit 截取 -\u0026gt; 返回带行号的结果。\n效果：\n1 The MCP (Model Context Protocol) is... 2 It allows systems to... 写入文件：write_file 写文件需要更新 state，因此使用 Command：\n@tool(description=WRITE_FILE_DESCRIPTION, parse_docstring=True) def write_file(file_path, content, state, tool_call_id) -\u0026gt; Command: files = state.get(\u0026#34;files\u0026#34;, {}) files[file_path] = content return Command( update={ \u0026#34;files\u0026#34;: files, \u0026#34;messages\u0026#34;: [ ToolMessage(f\u0026#34;Updated file {file_path}\u0026#34;, tool_call_id=tool_call_id) ], } ) 实现更新 state.files，写入 ToolMessage 到 state.messages\nSummary 虚拟文件系统是构建深度代理的重要基础设施。虽然在简单示例中可能显得不必要,但在处理复杂的多步骤任务时，它能够显著提升代理的可靠性和性能。\n✓ 防止 LLM 遗忘重要信息\n✓ 随时加载任意步骤的上下文\n✓ 支持多文件代码生成\n✓ 支持复杂研究任务（多次搜索、多次整合）\n✓ 支持“草稿—修改—再草稿”的循环\n✓ 支持在不同 agent 之间共享状态\n这套文件操作工具（ls、read_file、write_file）将是深度代理抽象的核心功能之一。通过理解其底层实现原理——基于 LangGraph 状态的模拟文件系统——你可以根据自己的需求进行定制和扩展。\n","permalink":"https://mig217.github.io/post/2025-11-30-deep-agents/","summary":"\u003cp\u003e在过去的一年里，AI Agent 的演进出现了两个非常重要的趋势：\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003e智能体正在变得更通用（Generalist）\u003c/strong\u003e：可以承担越来越多类型的任务；\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003e智能体的任务时长变得更长（Long-horizon）\u003c/strong\u003e：能够连续执行几十甚至上百个步骤的复杂任务。\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e根据 METR 的基准测试，AI 能自动完成的人类任务等效时长大约每 7 个月翻倍。这意味着智能体从“短对话助手”，发展为“能够连续运行数百甚至上千步的自主系统”。\u003c/p\u003e\n\u003cfigure class=\"align-center\"\u003e\n \u003cimg loading=\"lazy\" src=\"/static/images/length-of-tasks-log.png\" width=\"700px\"/\u003e \n\u003c/figure\u003e\n\n\u003cp\u003e与此同时，通用智能体数量激增，如 Manus 和 Claude Code 等系统正在承担远不止“写代码”或“回答问题”的任务。它们能够\u003cstrong\u003e组织研究流程、规划任务、调用大量工具，并产出复杂成果\u003c/strong\u003e。但随着任务时长与任务复杂度提升，工程上的挑战也随之而来：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eManus\u003c/strong\u003e：典型任务需要调用约 50 次工具\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAnthropic Production Agents\u003c/strong\u003e：实际生产系统常常会进行数百轮对话与推理\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e这些长周期、多步骤、工具密集型的系统，被称为 \u003cstrong\u003eAgentic System\u003c/strong\u003e。尽管不同项目的具体实现差别较大，但它们普遍遵循以下四个核心原则：\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003e使用 Planning 保证任务方向正确\u003c/strong\u003e\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e在超长任务中，如果缺乏规划，模型非常容易偏航。因此 Agentic System 普遍采用明确的任务计划来做“航向控制”。\u003c/p\u003e\n\u003cp\u003e常见做法包括：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eManus\u003c/strong\u003e：使用 \u003ccode\u003etodo.md\u003c/code\u003e 保存任务清单，并在执行过程中不断读写更新；\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eClaude Code\u003c/strong\u003e：要求用户先批准计划，再执行工具与操作；\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eGemini Deep Research\u003c/strong\u003e：在执行前强制生成计划、并请求用户确认；\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eAnthropic Multi-Agent Researcher\u003c/strong\u003e：将 research plan 写入文件系统，在关键流程中重新读取，确保最终报告“遵循原计划”；\u003c/li\u003e\n\u003c/ul\u003e\n\u003col start=\"2\"\u003e\n\u003cli\u003e\u003cstrong\u003e使用 Filesystem 进行 Offload Context\u003c/strong\u003e\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e随着工具调用次数、搜索结果、观察内容不断增多，把所有内容保存在消息历史里会快速耗尽上下文窗口。\u003c/p\u003e\n\u003cp\u003e因此，可以采用“外部化记忆”机制：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e将庞大中间结果（如搜索原始结果）保存到磁盘\u003c/li\u003e\n\u003cli\u003e在上下文中只放简短摘要，节省 Tokens\u003c/li\u003e\n\u003cli\u003e在需要时再从文件读取完整内容\u003c/li\u003e\n\u003cli\u003e长期记忆可以独立保存，不随对话窗口消失\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eExample:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eAnthropic Multi-Agent Researcher\u003c/strong\u003e：将研究计划写到文件里 → 调研完成后再读回来 → 保证报告结构与原计划一致；\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eManus\u003c/strong\u003e：将 \u003ccode\u003etodo.md\u003c/code\u003e 持久化，多次更新、反复读取\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e文件系统就是 Agentic System 的外部化记忆（Externalized Memory）。\u003c/p\u003e","title":"Deep Agents From LangGraph"},{"content":"大家可能都已经对 LLM 很熟悉了。大概在两三年前，ChatGPT、Claude、Llama、DeepSeek 等模型相继出现，可以说是彻底改变了世界。但在使用这些强大工具的同时,一个核心问题值得探讨：这些模型到底是如何训练的？\n本文将从宏观视角梳理 LLM 的训练流程，重点关注训练 AI Agents 所需的关键技术路径,而非底层实现细节。\nLLM Training Pipeline LLM 的训练是一项复杂的系统工程,通常可以划分为三个核心阶段: 预训练(Pre-training)、经典后训练(Classic Post-training/RLHF) 和推理强化学习(RL for Reasoning)。在实际应用中,我们还会结合提示工程(Prompting) 和微调(Fine-tuning) 来进一步激发模型潜力。\nGeneral LLM Training Pipeline 从整体上看，大语言模型（LLM）的训练分为三个阶段，每个阶段的目标、规模和挑战各不相同：\nPre-training：在大规模文本上学习预测下一个词，建立通用知识基础；这是规模最大、成本最高的一步，瓶颈在于高质量数据和算力资源。 Classic Post-training / RLHF：通过人类反馈强化学习，使模型输出更符合用户偏好；相比预训练，所需数据和成本大幅降低，但高度依赖优质反馈和有效评测体系。 RL for Reasoning：让模型在回答前进行推理，提升解决数学、编程等客观问题的能力；规模和成本介于前两者之间，难点在于设计合适的 RL 环境并防止模型“自我黑客”。三个阶段对比\n阶段核心目标数据规模训练时间成本级别主要瓶颈预训练（Pre-training）学习预测下一个词，构建知识底座 ~10 万亿 tokens 数月千万美元级高质量数据、算力资源经典后训练 / RLHF（Classic Post-training / RLHF）让模型符合用户偏好 ~10 万个问题几天数万–十万美元人类反馈数据、评测体系推理型 RL（RL for Reasoning）提升推理和思考能力百万级问题数周百万美元级 RL 环境设计、防止自我黑客成功的关键要素了解了训练的宏观阶段后，我们再深入一层，看看在每个阶段中，哪些要素是决定模型成败的关键。\n模型架构 (Architecture): 模型的骨架，目前 Transformer 仍是绝对主流，MoE (Mixture of Experts) 因其高效扩展性而备受关注。训练算法与损失函数 (Algorithm \u0026amp; Loss): 模型优化的导航系统，决定了模型如何从数据中学习。数据与强化学习环境 (Data \u0026amp; RL Environment): 模型学习的“教材”。尤其在对齐和能力提升阶段，高质量的数据和精心设计的 RL 环境至关重要。评测 (Evaluation): 衡量训练效果的标尺，指导着整个优化迭代的方向。系统与基础设施 (Systems \u0026amp; Infrastructure): 支撑训练的动力引擎，决定了你能否高效、稳定地将模型规模化。一个有趣的转变是，在 2023 年之前，学术界和业界的焦点更多集中在架构和算法的创新上。而如今，随着技术路线逐渐收敛，共识已经形成：真正拉开模型效果差距的，是数据、评测和系统这三大支柱。\nLLM 训练已从单纯的算法竞赛，演变为一项复杂的系统工程：\n数据决定上限：数据的质量和多样性，直接决定了模型可能达到的最终高度。评测决定方向：科学的评测体系，是迭代优化、做出正确技术决策的指南针。系统决定规模：强大的基础设施，决定了你能否训练更大、更强的模型。提示工程与领域微调当我们获得一个训练好的基础模型后，如何高效地将其应用于特定场景？提示工程 (Prompting) 和领域微调 (Fine-tuning) 是两种最核心的技术。\n提示（Prompting）\n我们可以将称之为 “提问的艺术”，即通过精心设计的指令 (Prompt) 来引导模型精准地执行任务。其核心特点是轻量、低成本且由评测驱动。对于大多数开发者而言，这是将模型与业务结合的最高效途径。\n微调（Fine-tuning）\n微调的核心思想是利用特定领域的数据，对基础模型进行“专项强化训练”，使其成为该领域的“专家”。例如，医疗公司可利用专业医疗数据对模型进行微调，显著提升其在医疗报告理解上的准确性。相比提示工程，微调需要额外的数据和算力投入，但能让模型更深入地内化领域知识，效果上限更高。\nPretraining 接下来，我们将深入探讨预训练阶段。这部分将围绕三个核心展开：方法 (Method)、数据 (Data) 和算力 (Computation)。\nMethod 预训练的目标是让模型学习世界知识，但实现这一目标的任务却很简单：预测下一个词 (Next Token Prediction)。\n这个过程与手机输入法预测下一个词的原理类似。然而，当模型在超过 10 万亿 Token 的互联网级数据上完成这个任务后，推理、归纳甚至“思考”的能力便会“涌现”出来。这一核心思想自 GPT-2 以来，便成为业界主流。\n简单语言模型：N-gram 要理解现代 LLM，可以回顾其最朴素的前身——N-gram 模型。这是一种纯粹的统计方法，其核心思想是：一个词的出现概率，只与它前面的 N-1 个词相关。\n预测方式：通过在语料库中统计 N-gram 片段的出现频率来计算概率。致命缺陷：存储灾难：需要存储海量片段，无法扩展到互联网数据。无法泛化：无法处理未在语料库中见过的文本。为了解决这些问题，研究者们转向了神经网络。神经网络语言模型本质上可以看作是 N-gram 模型的一种高效、可泛化的“参数化近似”。它不再死记硬背所有统计频次，而是学习文本中蕴含的模式，并将这些模式压缩到模型的参数中。\n自回归 (Autoregressive) 神经网络模型现代 LLM 普遍采用自回归 (AR) 模式进行训练，其工作流程可分解为以下几步：\n文本向量化 (Word Embedding)：\n分词 (Tokenization)：将文本 \u0026ldquo;She likely prefers\u0026rdquo; 切分为 Token 序列 [\u0026ldquo;She\u0026rdquo;, \u0026ldquo;likely\u0026rdquo;, \u0026ldquo;prefers\u0026rdquo;]。嵌入 (Embedding)：将每个 Token 映射为一个高维向量。这组向量成为神经网络的数字输入。上下文融合 (Neural Network Processing)：这组向量被送入 Transformer 网络。通过自注意力等机制，模型对输入信息进行复杂的加权融合，生成一个蕴含了全部上下文信息的新向量。\n概率预测 (Probability Prediction)：\n这个上下文向量通过一个线性层，被转换为一个与词汇表等大的高维向量（称为 logits）。Logits 向量中的每个值代表对应词成为下一个词的可能性得分。使用 Softmax 函数将 logits 转换为一个总和为 1 的概率分布。例如，\u0026ldquo;dogs\u0026rdquo; 的概率为 0.7，\u0026ldquo;cats\u0026rdquo; 为 0.2 等。训练与推理：\n训练时：模型将预测概率（如 \u0026ldquo;dogs\u0026rdquo; 的 0.7）与真实答案（概率 1.0）进行比较，通过交叉熵损失 (Cross-Entropy Loss) 计算差距，并据此反向传播更新模型参数。推理时：模型根据生成的概率分布进行采样 (Sampling)，通常选择概率最高的词，并将其拼接回输入，循环往复，生成完整的回答。 Data 预训练数据是大模型的\u0026quot;养料\u0026quot;，其规模之大令人咂舌。当前主流模型通常需要消耗超过 10 万亿 tokens 的数据：\nLlama 4：大约在 20–40 万亿 tokens DeepSeek V3：大约 15 万亿 tokens Llama 3：大约 15 万亿 tokens 换算一下，这相当于超过 200 亿个独立网页的内容量。然而规模不等于质量。从充斥着广告和低质信息的互联网中提炼出“精华”，是预训练成败的关键。\n数据处理流程\n数据抓取与文本提取：利用 Common Crawl 等开源项目，从互联网上下载海量的原始网页快照，再从中精准提取正文文本。这是计算成本最高的环节之一。\n多层过滤与去重：\n内容过滤：移除有害、不适宜 (NSFW) 及隐私内容。启发式过滤：通过长度、字符等简单规则，快速筛掉明显的低质量文本。数据去重：去除高度雷同的内容，避免模型学习无效模式。基于模型的质量筛选：训练一个分类器，让它学习**高质量文本（如维基百科）**的特征，然后用它去为海量网页打分，筛选出最“像”高质量内容的文本。\n数据混合 (Data Mix)：将来自网页、代码、书籍、论文等不同来源的干净数据，按特定比例混合。这个数据配方直接塑造了模型的能力倾向，例如，增加代码数据的权重，可以显著提升模型的编程能力。\n预训练后期的策略\n中期训练 (Mid-training)：在预训练后期，使用一个规模更小但质量极高的“黄金数据集”（如书籍、论文）进行第二阶段训练，以巩固和深化核心知识。长上下文持续预训练 (Continual Pre-training for Longer Context)：为节约成本，通常在训练前期使用较短上下文（如 4k），在最后阶段才逐步增加到目标长度（如 128k）。由于涉及核心竞争力和潜在的版权风险，顶级的数据处理流程和数据配方，是各家机构真正的护城河。\nComputation 在预训练中，计算量 (Compute) 投入几乎是决定模型性能的唯一关键因素。这个规律对所有类型的数据和模型都普遍适用。\n计算量主要由两个维度决定：\n模型规模 (Parameters)：模型包含的参数数量。数据量 (Training Tokens)：用于训练的文本总量。幸运的是，模型性能与计算量之间的关系并非杂乱无章，而是遵循着一种可预测的规律——缩放法则 (Scaling Laws)。这意味着我们可以在小规模实验中观察模型表现，然后依据缩放曲线，精准推断出在投入百倍、千倍计算量后模型的性能会达到何种水平。\nScaling Laws: Tuning 缩放法则彻底改变了超参数的调优方式，将一个极其昂贵的过程变得高效。\n传统做法：直接启动多个不同超参的大模型并行试错，成本高昂。现代做法：在多个小模型上快速实验，找到超参数与模型规模之间的缩放关系，再将最优配置外推 (Extrapolate) 到目标大模型上。缩放法则让我们可以从小模型的廉价试错中，找到适用于大模型的规律，极大提升了研发效率。\nScaling laws for development 在选择模型架构时，我们真正关心的不应是它在当前算力下的表现，而是当计算资源扩大10倍、100倍后，哪个架构会变得更强(@kaplanScalingLawsNeural2020)。关键在于比较两个指标：\n初始性能 (Constant)：在同等算力下，谁的性能更好。缩放速率 (Scaling Rate)：每增加单位算力，谁的性能提升得更快。 Example: Transformer vs. LSTM\n问题: Transformer 架构还是 LSTM 架构？方法: 我们可以在较小的规模下，分别训练这两种模型，并绘制它们的性能曲线观察 (见图): Transformer 的线始终在 LSTM 的线下方: 在任何计算规模下，Transformer 的表现都更好。这被称为有更好的“常数”。 Transformer 的线的斜率更陡: 每增加一点计算资源，Transformer 性能的提升幅度比 LSTM 更大。这被称为有更好的“缩放速率”。应永远选择缩放速率更优的架构。一个架构即使当前表现稍逊，但只要其潜力巨大，就更值得投资。\nScaling laws: eg Chinchilla 在计算预算固定的前提下，资源应该更偏向于扩大模型规模还是增加数据量？DeepMind 的 Chinchilla 论文通过缩放法则给出了答案(@hoffmannTrainingComputeOptimalLarge2022)。\nChinchilla 发现：为了最高效地利用算力，模型参数量与数据量之间存在一个“黄金比例”——每 1 个模型参数，大约需要 20 个 Token 的数据来训练。\n局限性：\n被忽略的成本：Chinchilla 法则只优化了训练成本，完全没有考虑模型部署后的推理成本。现实权衡：对于商业服务而言，推理成本是决定成败的关键。因此，更明智的决策可能是：宁愿在训练阶段投入更多资源（例如，用更多数据训练一个相对较小的模型），以换取未来长期、低廉的推理成本。 Summary: 训练一个前沿模型的代价这里以一个假设的405B参数的 Llama 3 模型为例，估算了其训练所需的各项成本（基于2023-2024年数据）:\n规模：405B 参数，15.6T Tokens 数据。硬件：16,000 张 NVIDIA H100 GPU。时间：约 70 天。金钱成本: 综合硬件租用和人力等成本，为 5200万美元（范围可能在 5000 万到 8000 万美元之间）。环境成本: 仅训练这一个模型所产生的碳排放，约等于 2000 次从纽约到伦敦的往返机票。未来趋势: 每一代新模型的训练算力消耗都将大约是上一代的 10 倍。大模型的预训练阶段，本质上是一场围绕方法、数据和算力的系统性工程。它以一个极其简单的任务为起点，通过海量高质量数据和庞大计算资源的投入，最终实现了智能的涌现。\nPosttraining 预训练的目标是预测下一个词。这使得模型精通语言模式，却不理解“遵循指令”或“帮助用户”这类概念。\n例如，如果你对一个原始的预训练模型说：“给6岁的孩子解释一下登月”，它很可能不会回答，而是续写一个相似的问题，比如：“给6岁的孩子解释一下万有引力”。\n因此，我们需要一个额外的“后训练”阶段，来校准模型的行为，使其真正变得有用、可控。\n后训练的两个主要阶段 1. Alignment / Instruction Following\n这是让模型变得有用的第一步，也是 2022年 ChatGPT 取得成功的核心。\n目标：让模型学会理解并遵循用户指令，知道什么样的回答是“好”的回答。任务：主要通过监督微调 (SFT) 和基于人类反馈的强化学习 (RLHF)。数据：与预训练相比，数据量小得多，大约在 5千到 50万个高质量问答对之间。这个阶段好比教一个满腹经纶的学者“如何与人有效沟通”，只需少量高质量范例，即可让它学会互动的范式。\n2. Reasoning\n这是在“对齐”之后更高级的阶段，是 2024年以来的新趋势（以 o1 等模型为代表）。\n目标：不仅要回答得“好”，更要通过深度思考，确保答案的正确性，尤其是在数学、编程等有客观标准的任务上。方法：主要通过带有验证器的强化学习 (RL with Verifiers)。它引入了Test-time Compute 的概念。传统模型性能在训练结束后就已固定；而具备推理能力的模型，可以在回答问题时投入更多计算资源来“思考”，从而得到更准确的答案。\nSupervised Finetuning SFT (Supervised Fine-tuning) 是对齐模型的第一步，其核心是Behavior Cloning。\n它的原理与预训练一致——预测下一个词，但关键区别在于，它不再使用海量的互联网数据，而是在一个高质量的、我们期望的“问题-答案”数据集上进行训练，强制模型只学习和模仿“标准答案”的说话方式。\nSFT 可以做什么？\nSFT 可以将一个原始的知识库模型，转变为一个能干的助手，教会它：\nInstruction following：听懂并执行命令。 Desired format or style：例如，使用表情符号、要点列表，或保持某种特定的语气。 Tool use：教会模型调用计算器、搜索引擎等外部API来完成复杂任务。 Early reasoning：通过学习带有推理链的范例，教会模型在回答前先“思考”。理论上，任何你能提供“优质输入-输出”配对的任务，都可以通过 SFT 来学习。\nSFT 的数据从何而来？\nAsk Humans 最直接的方式。雇佣人类专家为各种问题编写高质量答案。这是 GPT-3 进化到初代 ChatGPT 的关键一步，优点是质量最高，缺点是成本高昂且速度缓慢。\nSynthetic Data 利用一个更强大的“教师模型”来自动生成海量问答数据。**斯坦福的 Alpaca 模型(@duboisAlpacaFarmSimulationFramework2024a)**是这一方法的早期成功典范，它开启了使用合成数据来复刻闭源模型能力的浪潮。\nGenerate \u0026amp; Verify 当训练最强模型，没有更强的“老师”可以请教时，你需要一个能判断对错的“裁判”。其步骤是：\n让模型针对一个问题生成多个候选答案。使用Verifier——例如代码测试程序、数学检查器——来自动筛选。只保留通过验证的优质答案作为训练数据。 DeepSeek R1 正是通过这种“头脑风暴 + 裁判筛选”的模式，训练出了顶级的推理能力\nSFT 需要多少数据？\nSFT 所需的数据量远小于预训练：\n简单任务 (学习风格)：约1万条数据通常足够。LIMA 论文(\u0026lt;@zhouLIMALessMore2023\u0026gt;)发现，数千个高质量范例就能教会模型期望的风格。复杂任务 (推理、工具使用)：需要更多数据，例如 DeepSeek R1 使用了约80万条样本。预训练已让模型学到了几乎所有知识。SFT 的作用更像是激活和规范，它告诉模型在何种情境下，应该唤醒哪种它已具备的能力，使其行为模式符合特定用户的偏好。\nReinforcement Learning SFT 本质是“行为克隆”，但这存在三大缺陷：\n受限于范例质量：模型的上限无法超越提供标准答案的人类或教师模型。但人类作为“裁判”的能力，远高于作为“创作者”的能力。可能教会模型幻觉：当模型模仿一个它自己无法验证真实性的答案时（例如引用一篇它没读过的文献），它学会的不是事实，而是“编造看起来煞有介事的假信息”的行为模式。成本高昂：编写大量完美的“标准答案”费时费钱。 RL 的核心思想：最大化期望行为\n为了解决上述问题，RL 提出与其克隆好的行为，不如去强化好的行为。我们不再提供唯一标准答案，而是让模型自己探索多种答案，然后通过一个 Reward Signal 告诉它哪些更好，从而引导它产出更高分的答案。\nReward Signal 从何而来？\nRule-based Rewards: 对于有明确对错标准的任务（如代码、数学），直接用规则来打分。代码能通过单元测试就给高分，反之则给低分。 RLHF - RL from Human Feedback: 这是成就 ChatGPT 的关键技术。让模型生成多个答案，然后让人类选择“更喜欢哪一个”，用这些偏好数据训练一个 Reward Model，让这个AI裁判去指导主模型。 LLM as a Judge: 直接用一个最强的 LLM 来给答案打分，作为奖励信号。 DeepSeek R1 的强化学习流程 DeepSeek R1 采用了一种精细化的“分而治之”策略，对不同任务“对症下药”，如图所示(@alammarIllustratedDeepSeekR12025)：\n对于推理等客观任务：使用基于规则的验证器（如单元测试）作为“硬核裁判”，奖励信号准确可靠。对于写作等主观任务：使用模拟人类偏好的奖励模型作为“AI裁判”，判断回答的有用性和安全性。这种结合了客观规则和主观模型的混合式强化学习，使得模型在保证逻辑严谨的同时，也提升了在开放性对话中的表现。\nGRPO 算法 GRPO 由 DeepSeek R1 推广(@shaoDeepSeekMathPushingLimits2024)，是目前开源社区最常用的 RL 算法。其流程简单直接：\n生成：针对一个问题，生成多个不同答案。打分：用奖励模型或验证器为每个答案打分。学习：更新模型参数，鼓励模型多产出高分答案的行为，抑制低分行为。在实际操作中，通常会加入 KL散度约束 (KL Divergence)，它像一根“缰绳”，防止模型为了追求高分而“走火入魔”，确保其生成的内容依然自然、流畅。\nInfra is Key 在大规模 RL 中，算法本身并非最难，底层的软硬件基础设施才是真正的挑战，尤其是在需要模型与环境进行多步交互的智能体 (Agent) 任务中，生成样本（采样）的过程极其消耗计算资源。\nKimi 团队的解决方案展示了顶级基础设施的形态(@teamKimiK2Open2025)：\n智能调度：暂停耗时过长的采样任务，先用已有数据更新模型，再用新模型继续被暂停的任务，避免流程卡顿。大规模并行：当一个任务等待外部API时，GPU会立即切换到其他任务，最大化利用率。架构优化：将训练、推理等引擎部署在同一物理节点，最大限度减少通信延迟。 Evaluation 评估是机器学习与人工智能中最关键的环节之一。它的重要性体现在三个方面：\n量化进展：帮助识别模型改进方向，衡量性能变化，指导超参数选择。模型选择：用于比较不同模型，以确定最适合特定应用场景的方案。生产可用性判断：即使模型在评测中表现最佳，也需要通过评估确定其是否达到实际应用要求。评估主要分为两类：Closed-ended Evaluation 和 Open-ended Evaluation。\nClosed-ended Evaluation 核心思想：将评测问题转化为有少数几个固定答案的格式（如多选题），从而可以轻松地自动化验证。\n典型例子：MMLU (Massive Multitask Language Understanding) 是一个广泛使用的基准测试，它包含大量类似大学考试的多项选择题，覆盖了从数学到历史的众多学科(@hendrycksMeasuringMassiveMultitask2021)。主要挑战： Prompt Sensitivity: 对同一个问题，不同的提问方式可能会导致模型给出截然不同的答案。 Data Contamination: 评测数据常出现在公开语料中，模型可能在预训练阶段就已经“见过”了考题，导致评测分数虚高。 Open-ended Evaluation 对于像 ChatGPT 这样以对话和生成为核心的模型，封闭式评测远远不够。开放式评测旨在评估它们在真实、无固定答案场景下的表现。\n主要挑战：\n应用场景多样：模型需要处理从聊天、编程到内容摘要等各种任务。答案开放性强：模型的回答通常很长，且没有唯一的“标准答案”，因此无法通过简单的文本匹配来判断对错。因此，研究者提出了 Preference Comparison 的评估思路：\n人工评估：ChatBot Arena Chatbot Arena (@chiangChatbotArenaOpen2024) 采用了双盲人类（double-blind human evaluation）评估方式：用户在不知道模型身份的情况下，与随机两款聊天机器人交互并投票选择更优者。\n优点：结果客观、可信度高。缺点：成本高昂且速度缓慢，难以进行大规模、高频率的迭代测试。 LLM 自动评估：AlpacaEval 为了降低人工成本，研究者提出用 LLM 充当评审员。AlpacaEval（@duboisLengthControlledAlpacaEvalSimple2025）是早期代表方法，核心流程如下：\n针对同一个指令，分别获取基准模型和待评测模型的回答。将这两个回答提交给一个强大的裁判LLM，让它判断哪个更好。通过大量比较，计算出待评测模型的胜率 (Win Rate)。优点\n与 Chatbot Arena 的结果高度一致（Spearman 相关系数 0.98）。成本极低（约 3 分钟、10 美元以内即可完成全流程）。缺点\n存在Spurious Correlation风险，即评审模型的偏好可能影响评测结果。 Systems 当我们讨论如何提升模型性能时，一个共识是“扩展性决定上限 (scaling is what matters)”。这意味着投入更多的计算资源通常能带来更好的结果。但现实是，我们所有人都受限于计算资源。\n既然计算是瓶颈，为什么不直接购买更多的 GPU 呢？原因有三：\nGPU昂贵且稀缺: 顶级 GPU 不仅价格高昂，而且供应紧张。即使有预算，也未必能买到。物理限制: 大规模 GPU 集群的通信开销巨大。GPU 之间的数据传输速度可能成为新的瓶颈，拖慢整体训练速度。效率问题: 必须确保每一块 GPU 的计算潜力都被充分压榨，否则只是徒增成本。因此，与其盲目堆砌硬件，不如通过系统级的优化来高效地分配和利用现有资源。\nGPU 的基础概念要优化训练过程，首先需要了解 GPU 的核心特性：\n大规模并行处理：与 CPU 核心少而强不同，GPU 拥有成千上万个核心。它采用“单指令多数据”（SIMD）模式，在大量线程上对不同数据执行相同的指令，专为高吞吐量而生。矩阵乘法优化：GPU 最初为图形处理涉及，而图形处理的本质是密集的矩阵运算。因此，GPU 内置了专门用于加速矩阵乘法的硬件单元（如 Tensor Cores），这类运算的速度通常是其他浮点操作的 10 倍以上。计算不再是瓶颈：GPU 的浮点计算能力 (FLOPs) 增速远超其内存带宽和通信速度(@ivanovDataMovementAll2021)。这意味着，如今的瓶颈不再是“算得慢”，而是“喂不饱”。保持计算单元持续有数据可算是系统优化的核心挑战。如下图所示，计算性能的提升曲线远比内存带宽陡峭。内存层次结构：GPU内部的内存是分层级的。离计算核心越近的内存（如L1 Cache、Shared Memory）速度越快，但容量越小；离得越远（如Global Memory DRAM）容量越大，但速度越慢。高效的算法需要精心设计，以最大化地利用高速缓存，减少对慢速全局内存的访问。关键指标：MFU\n模型 FLOP 利用率 (Model Flop Utilization, MFU) 是衡量系统效率的关键指标：\n$$\\text{MFU} = \\frac{\\text{GPU 的理论峰值吞吐量 (Theoretical Best)}}{\\text{观测到的模型吞吐量 (Observed Throughput)}}$$\nMFU 的值反映了你的代码在多大程度上压榨出了 GPU 的理论性能。一个 MFU 为 1 的程序意味着计算单元在任何时候都处于忙碌状态。在实际应用中，MFU 能达到 50% 就已经是非常出色的表现，许多大公司也需要投入大量精力才能将这个数字从 15%-20% 提升到 50%。\n系统优化技术了解了 GPU 的特性后，我们可以探讨一些具体的优化方法。\nLow-Precision Training 核心思想：使用更少的bits（如bf16）来表示数字，从而减少内存占用和加速数据传输。\n由于深度学习训练过程（尤其是随机梯度下降）本身充满了噪声，因此大多数运算并不需要32为浮点数（fp32）的高精度。将矩阵乘法等主要计算从 fp32 切换到 bf16 可以带来显著的性能提升。\nFor Training: Automatic Mixed Precision (AMP)\nWeights: 以 fp32 格式存储主权重，以保持精度。 Computation: 在进行前向和反向传播时，将权重和激活值转换为 bf16 进行矩阵运算，以获得速度提升和内存节省。 Gradients: 以 bf16 格式存储，进一步节省内存。 Update: 将 bf 梯度转换回 fp32，用于更新主权重。 Operator Fusion 核心思想：将多个连续的计算操作合并成一个单一的计算内核（Kernel），以减少对全局内存的读写次数。\n在PyTorch等框架中，每一行独立的计算（如 y = x.cos()）都可能触发一次全局内存读写。\nOperator Fusion 可以将 y = x.cos() 和 z = y.sin() 这样的多个操作合并，实现一次读取、多次计算、一次写回，从而大幅减少内存访问开销。\ntorch.compile 就是PyTorch中实现 Operator Fusion 的强大工具。\nTiling 核心思想：通过巧妙地组织计算顺序，最大化地复用已加载到高速缓存中的数据。\n以矩阵乘法为例，传统的计算方式需要频繁地从全局内存中读取整个行和列。而Tiling则将大矩阵切分成小块（Tile）。计算时，先将几个小块加载到高速的共享内存中，完成所有与这些小块相关的计算后，再加载下一批。\n这样，加载到高速缓存中的每个数据点都被多次使用，极大地减少了对慢速全局内存的访问次数。\nExample: Flash Attention\nFlashAttention 是注意力机制优化算法(@daoFlashAttentionFastMemoryEfficient2022)，它结合了上述技术：\nOperator Fusion: 将注意力的多个计算步骤（矩阵乘法、缩放、掩码、Softmax）融合成一个内核。 Tiling: 在计算注意力分数时使用Tiling策略，避免实例化巨大的注意力矩阵。 Recomputation: 在反向传播时，不存储中间结果（如注意力矩阵），而是重新计算它们。因为重计算比从全局内存中读取这些巨大的中间结果要快得多。通过这些系统级的优化，FlashAttention 在没有改变任何模型逻辑的情况下，实现了高达 1.7 倍的端到端训练加速。\n多GPU并行策略当模型规模大到单个 GPU 无法容纳时，我们就必须采用并行计算策略。\n当训练一个 P 参数的模型，通常需要约 16P GB 的内存：\nModel Weights: 4P GB Gradients: 4P GB Optimizer States: 8P GB (Adam需存储均值和方差) 这意味着，训练一个7B参数的模型大约需要 $16 \\times 7B \\approx 112GB$ 的显存，这远远超出了单张 GPU 的容量。\nData Parallelism 核心思想：将模型和优化器完整地复制到每张 GPU 上，然后将训练数据分片，每张 GPU 处理一部分数据，每一步计算完成后，聚合所有的 GPU 的梯度，然后同步更新各自的模型副本。\n优点：简单直接，可以有效利用多张 GPU 加速训练。缺点：完全没有节省内存。如果模型本身放不下一张卡，数据并行也无能为力。为了解决显存问题，ZeRO(@rajbhandariZeROMemoryOptimizations2020)等技术应运而生。它是一种增强的数据并行，通过将优化器状态、梯度甚至模型参数分片到不同的 GPU 上，从而极大地降低了单张 GPU 的显存压力。\nModel Parallelism 核心思想：将模型本身切分到不同的 GPU 上。这适用于数据并行无法解决的超大模型场景。\nPipeline Parallelism: 按模型的Layer进行分层。例如，GPU 0 负责第 1-10 层，GPU 1 负责 11-20 层，数据像流水线一样依次流过各个 GPU (@huangGPipeEfficientTraining2019)。 Tensor Pipeline: 在单层内部进行切分。例如，将一个巨大的权重矩阵切分成几块，分别放到不同的 GPU 上进行计算，最后将结果聚合(@shoeybiMegatronLMTrainingMultiBillion2020)。 Architectural Sparsity 核心思想：并非每个输入数据都需要经过模型的所有参数。通过激活模型的一部分参数来处理输入，可以有效降低计算量。\n**混合专家模型（Mixture of Experts, MoE）**是该思想的典型代表。MoE 模型包含多个专家子网络和一个路由器。对于每个输入，路由器会选择性地激活一个或几个专家来处理它 (@fedusReviewSparseExpert2022)。\n优点：可以在保持计算量（FLOPs）不变的情况下，大幅增加模型总参数量。由于每次只有部分专家被激活，因此非常适合并行化，可以将不同的专家部署在不同的 GPU 上。\nSummary 本次讨论虽然涵盖了许多核心内容，但AI领域的发展日新月异，还有很多重要话题我们未能深入：\n在技术实现层面，我们没有探讨像MoE（混合专家模型）和SSM（状态空间模型）这样的前沿架构；也未涉及模型解码策略与推理优化、ChatGPT等产品的UI/工具设计，以及至关重要的多模态技术。\n整个领域还面临着更宏大和深刻的挑战：如何有效防止技术滥用、如何突破上下文窗口的限制、如何应对高质量数据枯竭的“数据墙”危机，以及如何解决数据收集的合法性问题。\n这些悬而未决的议题，共同构成了AI下一阶段的发展蓝图和核心挑战。\nReference Alammar, Jay. 2025. “The Illustrated DeepSeek-R1.” https://newsletter.languagemodels.co/p/the-illustrated-deepseek-r1.\nChiang, Wei-Lin, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, et al. 2024. “Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.” arXiv. https://doi.org/10.48550/arXiv.2403.04132.\nDao, Tri, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. 2022. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” arXiv. https://doi.org/10.48550/arXiv.2205.14135.\nDubois, Yann, Balázs Galambosi, Percy Liang, and Tatsunori B. Hashimoto. 2025. “Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators.” arXiv. https://doi.org/10.48550/arXiv.2404.04475.\nDubois, Yann, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2024. “AlpacaFarm: A Simulation Framework for Methods That Learn from Human Feedback.” arXiv. https://doi.org/10.48550/arXiv.2305.14387.\nFedus, William, Jeff Dean, and Barret Zoph. 2022. “A Review of Sparse Expert Models in Deep Learning.” arXiv. https://doi.org/10.48550/arXiv.2209.01667.\nHendrycks, Dan, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. “Measuring Massive Multitask Language Understanding.” arXiv. https://doi.org/10.48550/arXiv.2009.03300.\nHoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, et al. 2022. “Training Compute-Optimal Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2203.15556.\nHuang, Yanping, Youlong Cheng, Ankur Bapna, Orhan Firat, Mia Xu Chen, Dehao Chen, HyoukJoong Lee, et al. 2019. “GPipe: Efficient Training of Giant Neural Networks Using Pipeline Parallelism.” arXiv. https://doi.org/10.48550/arXiv.1811.06965.\nIvanov, Andrei, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2021. “Data Movement Is All You Need: A Case Study on Optimizing Transformers.” arXiv. https://doi.org/10.48550/arXiv.2007.00072.\nKaplan, Jared, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. “Scaling Laws for Neural Language Models.” arXiv. https://doi.org/10.48550/arXiv.2001.08361.\nRajbhandari, Samyam, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” arXiv. https://doi.org/10.48550/arXiv.1910.02054.\nShao, Zhihong, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, et al. 2024. “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.” arXiv. https://doi.org/10.48550/arXiv.2402.03300.\nShoeybi, Mohammad, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. “Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism.” arXiv. https://doi.org/10.48550/arXiv.1909.08053.\nTeam, Kimi, Yifan Bai, Yiping Bao, Guanduo Chen, Jiahao Chen, Ningxin Chen, Ruijue Chen, et al. 2025. “Kimi K2: Open Agentic Intelligence.” arXiv. https://doi.org/10.48550/arXiv.2507.20534.\nZhou, Chunting, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, et al. 2023. “LIMA: Less Is More for Alignment.” arXiv. https://doi.org/10.48550/arXiv.2305.11206.\n","permalink":"https://mig217.github.io/post/2025-10-30-training-llms-for-agents/","summary":"\u003cp\u003e大家可能都已经对 LLM 很熟悉了。大概在两三年前，ChatGPT、Claude、Llama、DeepSeek 等模型相继出现，可以说是彻底改变了世界。但在使用这些强大工具的同时,一个核心问题值得探讨：\u003cstrong\u003e这些模型到底是如何训练的？\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e本文将从宏观视角梳理 LLM 的训练流程，重点关注训练 AI Agents 所需的关键技术路径,而非底层实现细节。\u003c/p\u003e\n\u003ch2 id=\"llm-training-pipeline\"\u003eLLM Training Pipeline\u003c/h2\u003e\n\u003cp\u003eLLM 的训练是一项复杂的系统工程,通常可以划分为三个核心阶段: \u003cstrong\u003e预训练(Pre-training)、经典后训练(Classic Post-training/RLHF) 和推理强化学习(RL for Reasoning)\u003c/strong\u003e。在实际应用中,我们还会结合提示工程(Prompting) 和微调(Fine-tuning) 来进一步激发模型潜力。\u003c/p\u003e\n\u003ch3 id=\"general-llm-training-pipeline\"\u003eGeneral LLM Training Pipeline\u003c/h3\u003e\n\u003cp\u003e从整体上看，大语言模型（LLM）的训练分为三个阶段，每个阶段的目标、规模和挑战各不相同：\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003ePre-training\u003c/strong\u003e：在大规模文本上学习预测下一个词，建立通用知识基础；这是规模最大、成本最高的一步，瓶颈在于高质量数据和算力资源。\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eClassic Post-training / RLHF\u003c/strong\u003e：通过人类反馈强化学习，使模型输出更符合用户偏好；相比预训练，所需数据和成本大幅降低，但高度依赖优质反馈和有效评测体系。\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eRL for Reasoning\u003c/strong\u003e：让模型在回答前进行推理，提升解决数学、编程等客观问题的能力；规模和成本介于前两者之间，难点在于设计合适的 RL 环境并防止模型“自我黑客”。\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e\u003cimg loading=\"lazy\" src=\"../../docs/images/general-llm-training-pipeline.png\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e三个阶段对比\u003c/strong\u003e\u003c/p\u003e\n\u003ctable\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth\u003e阶段\u003c/th\u003e\n \u003cth\u003e核心目标\u003c/th\u003e\n \u003cth\u003e数据规模\u003c/th\u003e\n \u003cth\u003e训练时间\u003c/th\u003e\n \u003cth\u003e成本级别\u003c/th\u003e\n \u003cth\u003e主要瓶颈\u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e预训练（Pre-training）\u003c/td\u003e\n \u003ctd\u003e学习预测下一个词，构建知识底座\u003c/td\u003e\n \u003ctd\u003e~10 万亿 tokens\u003c/td\u003e\n \u003ctd\u003e数月\u003c/td\u003e\n \u003ctd\u003e千万美元级\u003c/td\u003e\n \u003ctd\u003e高质量数据、算力资源\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e经典后训练 / RLHF（Classic Post-training / RLHF）\u003c/td\u003e\n \u003ctd\u003e让模型符合用户偏好\u003c/td\u003e\n \u003ctd\u003e~10 万个问题\u003c/td\u003e\n \u003ctd\u003e几天\u003c/td\u003e\n \u003ctd\u003e数万–十万美元\u003c/td\u003e\n \u003ctd\u003e人类反馈数据、评测体系\u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e推理型 RL（RL for Reasoning）\u003c/td\u003e\n \u003ctd\u003e提升推理和思考能力\u003c/td\u003e\n \u003ctd\u003e百万级问题\u003c/td\u003e\n \u003ctd\u003e数周\u003c/td\u003e\n \u003ctd\u003e百万美元级\u003c/td\u003e\n \u003ctd\u003eRL 环境设计、防止自我黑客\u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003ch3 id=\"成功的关键要素\"\u003e成功的关键要素\u003c/h3\u003e\n\u003cp\u003e了解了训练的宏观阶段后，我们再深入一层，看看在每个阶段中，哪些要素是决定模型成败的关键。\u003c/p\u003e","title":"Introduction to training LLMs for AI agents"},{"content":"By 2025, existing models have already become remarkably intelligent. However, even the smartest system cannot perform effectively without understanding what it is being asked to do. Prompr engineering refers to the practice of phrasing tasks in an optimal way for large language model-based chatbots. Context engineering, on the other hand, represents the next stage - aiming to automate this process within dynamic systems.\nWhat is Context Engineering? Tobi, from Shopify, shared an interesting post in which he expressed his appreciation for the term “Context Engineering.” Later, Karpathy followed up with a brilliant definition: Context engineering is the art and science of filling the context window with just the right information at each step of an agent’s trajectory\nKarpathy made a powerful analogy: he compared LLMs to a computer\u0026rsquo;s CPU, and the context window to RAM. Just like memory is always limmited, and operating system decides what should be loaded into RAM to keep the system running effectivly. Similarly, the goal of context engineering is to determine what information should be placed into the limited context window at each step of the LLM\u0026rsquo;s reasoning process.\nTypes of Context Engineering Context Engineering is an umbrella discipline that encompasses several types of contextual inputs:\nInstructions: Commonly referred to as \u0026ldquo;prompts\u0026rdquo;, these specify what the AI should do. Knowledge: Facts and data retrieved from external files or databases. Memories: Previous converstaion history or reference examples that provide continuity. Tools: Information about which tools that AI can use (e.g., calculator, search engine) and the results returned by these tools. Why Is This Harder for Agents? There are two main reasons why context engineering becomes more challenging for agents:\nAgents typically handle longer-running or more complex tasks. Agents extensively use tool calling. Both characterstics lead to increased context load. For example, in multi-turn tasks, each tool call\u0026rsquo;s feedback gets written to the context window. The first round calls one tool, the second calls another tool\u0026hellip; As the number of rounds increases, the accumulated tool feedback in the context grows continuously, consuming significant tokens.\nDrew Breunig provides an excellent summary of context failures in his blog post, including:\nContext Poisoning: Malicious or misleading information injected into the context.\nContext Distraction: Irrelevant information that diverts attention from the main task.\nContext Confusion/Curation Errors: Poor organization or conflicting information.\nContext Clash: Contradictory information that creates confusion.\nAs context length grows, models must process more information, increasing the likelihood of errors: they may become confused due to information conflicts or be misled by injected hallucinations, producing incorrect responses. As Cognition recently emphasized in a blog post:\nContext Engineering is effectively the #1 job of engineers building AI agents.\nApproaches Lance outlines four main strategies for context engineering in his blog post:\nWriting Context: Store information outside the context window to assist the agent in completing tasks.\nSelecting Context: Selectively retrieve relevant context into the window to support execution.\nCompressing Context: Retain only the most important tokens, removing redundant or irrelevant context to save space.\nIsolating Context: Break context into segments or modules to reduce noise and enhace clarity.\nWriting Context The core idea of Writing Context is to store information outside the AI\u0026rsquo;s \u0026ldquo;short-term memory\u0026rdquo; (its context window) so it can be retrieved and referenced when needed. This mirrors how human tackle complex problems - we take notes and build memories. Similarly, agents can do both:\nTaking notes -\u0026gt; Use a scratchpad\nBuilding memories -\u0026gt; Use long-term memory\nScratchpad: Temporary Notes When an Agent is carrying out a specific, single task, it needs a scratchpad to record intermediate thoughts, plans, or key information because it may need to refer back to them later.\nExample：Anthropic’s multi-agent researcher\nThe LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan.\nThe scratchpad can be implemented in different ways, for example:\nSaving drafts to a file(e.g., JSON, TXT) so the Agent can review them later. Storing notes in a runtime state object (such as LangGraph\u0026rsquo;s state) because this keeps the context alive while the task is running. Core idea: Let the Agent \u0026ldquo;write as it works\u0026rdquo; because this allows it to immediately consult what it just wrote. Once the task is completed, the scratchpad is usually no longer needed.\nMemories: Experiences Across Multiple Interactions Memory is different from a scratchpad, because it is designed to store information that the Agent should remember across multiple, separate conversations or tasks. In this sense, it works more like the Agent\u0026rsquo;s accumulated \u0026ldquo;experience\u0026rdquo;.\nExample:\nGenerative Agents: create \u0026ldquo;memory chunks\u0026rdquo; because they synthesize past interactions and feedback.\nChatGPT’s memory: automatically records user perferences, tone, and commonly shared details because these improve personalization in future conversations.\nCursor \u0026amp; Windsurf: generate memories from user behavior because this supports more coherent context completion.\nAs the Agent interacts with you, new information is generated. The Agent continuously receives new context and updates its memory dynamically because integrating fresh input with old knowledge creates a more complete, personalized background.\nSelecting Context In the Writing Context phase, information has already been stored. Now the challenge of **Selecting Context is to decide which part of this vast pool of information should be brought into the Agent\u0026rsquo;s limited \u0026ldquo;working memory\u0026rdquo;, because the Agent can only operate effectively on a small window of context at a time.\nAn Agent can choose content just \u0026ldquo;wrote\u0026rdquo; into the scratchpad, because that is the most immediate and relevant source. But more interesting and subtle is selecting from long-term memory. To understand this better, we can compare it to human memory types:\nMemory Type What is Stored Human Example Agent Example Semantic Facts Things I learned in school. Facts about a user Episodic Experiences Things I did. Past agent actions Procedural Instructions Instincts or motor skills. Agent system prompt These memory types can be pulled into the context window in different ways because each type supports the Agent in solving a specific kind of problem.\nPractical Applications 1. Instructions\nProcedural memory often comes from rule or instruction files, because Agents need a stable reference for style guides or tool usage. For example, when using a code agent, the CLAUDE.md file may contain guidelines or standard instructions for a project. In many cases, these files are pulled entirely into the context window, because they provide the Agent with consistent rules to follow. For instance, when you start Claude Code, it loads relevant project and organization files.\n2. Facts\nWhen the knowledge base is very large (e.g., company-wide documents or all historical emails), we need to select only the facts relevant to the current query. Two common techniques are:\nEmbedding-based Similarity Search: all documents are converted into vectors, and when a query comes in, the system finds the semantically, closest chunks because this ensures meaning-based matching rather than exact keyword overlap. Graph Databases: store information and their relationships as a graph, because this enables more complex logical queries than plain text search. 3. Tools\nWhen an Agent has access to many tools, loading all tool descriptions indiscriminately hurt performance. Research shows that because tool count exceeds 30, LLM performance degrades, and at 100 tools, it nearly collapses. A common solution is applying RAG techniques to tool descriptions.\nEach tool\u0026rsquo;s description is indexed like documents in a knowleage base. When a task arises (e.g., \u0026ldquo;Check the weather in Beijing tomorror\u0026rdquo;), the system performs semantic search to find the most relevant tool (e.g., \u0026ldquo;weather query\u0026rdquo;), and because only this selected tool\u0026rsquo;s usage guide is injected into the context, efficiency and accuracy improve.\n4. Knowledge\nRAG is a broad area. You can think of memory as a subset of RAG because memory focuses on personalization, while RAG broadly addresses knowledge retrieval.\nExanple: CodiumAI\u0026rsquo;s code retrieval strategy, as shared by Windsurf\u0026rsquo;s CEO Varun:\nSmart Chunking: Code is split along semantic boundaries (e.g., a function or class) rather than arbitrarily, because this preserves meaning and ensures retrieved chunks are complete units.\nHybrid Search: Vector search is combined with keyword search (grep), knowledge graphs, etc., because relying on a single retrieval method can be unrealiable.\nLLM-based Re-ranking: All candidate chunks are re-scored by an LLM, and the top results are selected, because this yields higher-quality context than raw retrieval alone.\nThis example shows that Selecting Context is not a single step, but a layered and highly engineered process, because each stage reduces noise and ensures the Agent gets the most relevant information.\nCompressing Context Even after writring and selecting information, the chosen content may still be too long, because even carefully selected data can exceed the limits of the context window. This is where Compressing Context comes in.\nThe core idea is: keep only the tokens (text units) necessary to execute the task, because removing redundant information frees up space while preserving utility.\nSummarization Summarization is the most common form of context compression, and it can be applied at different ranges and stages:\n1. Summarizing the entire conversation\nThis acts as a fallback strategy, because it prevents overly long conversation from crashing the system.\nExample: Claude Code “auto compact”\nWhen using Claude\u0026rsquo;s code assistant, if the dialogue approaches its 200,000-token context window (at 95% usage), an \u0026ldquo;auto compact\u0026rdquo; function is traggered. It summarizes the entire conversation history into a shorter version, because this frees up space without losing essential continuity.\n2. Summarizing specific parts\nThis is a more precise method, because it compresses only certain sections along the way instead of waitting until the very end.\nCompleted work sections In Anthropic\u0026rsquo;s article How we built our multi-agent research system, once a subtask or research section is finished, the system summarizes that section. Because the summary preserves conclusions while discarding lengthy process details, valuable context space is saved.\nPassing context to linear sub-agents In Cognition\u0026rsquo;s post Don\u0026rsquo;t Build Multi-Agents, tasks are broken down and delegated to sub-agents. Summarization is used as a form of information handoff: before passing work to Sub-Agent 1, the main agent generates a summary. Sub-Agent 1 reads this compressed version and can start immediately, because it doesn\u0026rsquo;t need the full raw context to understand the task.\nTrimming Besides generating summaries, there is a more direct way to compress context called triming. It does not rewrite content, because instead of rephrasing, it simply deletes tokens judged to be less important.\nHeuristics:\nThis is the simplest method. For example, a rule like \u0026ldquo;keep only the last 10 turns of dialogue\u0026rdquo; works because older records are often less relevant to the current task.\nLearned Pruning:\nA smaller, faster LLM can be used to decide which parts of the context are less relevant to the task, and those parts are removed.\nIsolating Context The core idea of Isolating Context is to divide context into multiple independent segments, because letting each agent or task module only process the part of context it actually needs improves efficiency and scalability. This approach is especially effective for handling highly complex tasks.\nMulti-Agent Systems Instead of relying on a single \u0026ldquo;all-purpose\u0026rdquo; Agent, we can build a team of Agents, each with its own context window, tools, and instructions, so that every agent can focus on a specific responsibility.\n1. OpenAI Swarm library\nThis framework is based on the principle of separation of concerns. A complex task is broken down and distributed across different AI agents. Because each agent works within its own independent context window, they can operate without interfering with each other.\n2. Anthropic Multi-Agent Research System\nIn Anthropic\u0026rsquo;s research, sub-agents run in parallel, each exploring a different aspect of the same problem within its own context window.\nSubagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent.\nThe biggest benefit is that the total information capacity of the system grows far beyond the limits of a single model\u0026rsquo;s context window. One agent can deeply investigate Subtopic A, another can focus on Subtopic B, and in the end, their results are combined because this produces a richer and more comprehensive report than any one agent could generate alone.\nSandboxed Execution This method separates an AI\u0026rsquo;s thinking from its \u0026ldquo;execution\u0026rdquo;, because keeping heavy operations outside the LLM\u0026rsquo;s context window avoids wasting tokens on irrelevant details.\nHuggingFace\u0026rsquo;s Open Deep Research\nThe Agent generates a snippet of code that describes the tools and logic it wants to run.\nThe code is executed in an isolated sandbox environment.\nThe sandbox can handle \u0026ldquo;heavy assets\u0026rdquo; such as large files, images, or audio-because these objects are too costly in tokens to load into the LLM\u0026rsquo;s context window.\nAfter execution, only the most essential and compact results (e.g., a calculation output, file path, or state variable) are passed back to the LLM for the next round of reasoning.\nThe elegence of this method lies in the fact that the sandbox can persist state across multiple dialogue turns. This means you don\u0026rsquo;t have to reload everything into the LLM context each time, because the sandbox efficiently maintains continuity while isolating token-intensive objects from teh model\u0026rsquo;s working memory.\nRuntime State Objects This is a very direct, code-level way of implementing isolating, because it makes different kinds of the information explicitly separated into structured \u0026ldquo;buckets.\u0026rdquo;\nWe can define a dedicated data model for managing state (e.g., a Pydantic model in Python). Each fields acts like its own bucket:\nA history bucket to store conversation history.\nA file bucket to hold retrieved document contents.\nA tool output bucket for results returned by external tools.\nclass AgentState(BaseModel): user_goal: str # The user\u0026#39;s task goal plan_summary: str # Summary of the current plan tool_outputs: Dict[str, Any] # Tool output bucket recent_messages: List[str] # History bucket (dialogue history) retrieved_files: Dict[str, str] # File bucket (retrieved document contents) At any stage of the Agent\u0026rsquo;s runtime, you can decide which bucket to pull information from and how to combine it before passing it into the LLM. This works because it gives develops explicit control over what enters the context, while keeping different types of state neatly separated.\n","permalink":"https://mig217.github.io/post/2025-08-20-context-engineering/","summary":"\u003cp\u003eBy 2025, existing models have already become remarkably intelligent. However, even the smartest system cannot perform effectively without understanding what it is being asked to do. \u003cstrong\u003ePrompr engineering\u003c/strong\u003e refers to the practice of phrasing tasks in an optimal way for large language model-based chatbots. \u003cstrong\u003eContext engineering\u003c/strong\u003e, on the other hand, represents the next stage - aiming to automate this process within dynamic systems.\u003c/p\u003e\n\u003ch2 id=\"what-is-context-engineering\"\u003eWhat is Context Engineering?\u003c/h2\u003e\n\u003cp\u003e\n \u003ca href=\"https://x.com/tobi/status/1935533422589399127\" target=\"_blank\" rel=\"noopener\"\u003eTobi\u003c/a\u003e, from Shopify, shared an interesting post in which he expressed his appreciation for the term “Context Engineering.” \n Later, \u003ca href=\"https://x.com/karpathy/status/1937902205765607626\" target=\"_blank\" rel=\"noopener\"\u003eKarpathy\u003c/a\u003e followed up with a brilliant definition:\n\u003c/p\u003e","title":"Context Engineering"},{"content":" The following insights are drawn from the Reasoning with o1 video course by DeepLearning.ai. This article explores how to effectively prompt and utilize the new generation of reasoning models. Models released over the past year have demonstrated remarkable progress in reasoning and planning tasks. OpenAI has deeply optimized Chain of Thought (CoT) processing, using reinforcement learning to fine-tune models so they automatically integrate step-by-step reasoning into their response process.\nWhile current model performance is already impressive, the more significant long-term development is reasoning-time scalability. Reasoning model performance improves not only with increased training compute but also with the thinking time allocated during inference (test-time or inference-time compute). This provides an entirely new dimension for scaling large model performance.\nHowever, reasoning models aren\u0026rsquo;t suitable for every scenario. This article will cover the types of tasks reasoning models excel at, and when you might need smaller, faster models, or even hybrid approaches. This article structure includes:\nIntroduction to Reasoning Models Designing Prompts for Reasoning Models Using Reasoning Models for Planning LLMs as Judges Meta Prompting Introduction Before reasoning model emerged, most AI models behaved like children-always blurting out the first thing that came to mind. The revolutionary breakthrough of reasoning models lies in learning a valuable skills: think before you speak. This enables them to achieve unprecedented performance levels in complex tasks including mathematics, programming, science, stragegic planning, and logical reasoning.\nCoT: The Core Mechanism The key advantage of reasoning models lies in their native integration of chain-of-thought reasoning process (@weiChainofThoughtPromptingElicits2023). Let\u0026rsquo;s understand this through an example. When we present the model with a letter scrambling problem:\noyfjdnisdr rtqwainr acxz mynzbhhx -\u0026gt; Think step by step Use the example above to decode: oyekaijzdf aaptcg suaokybhai ouow aght mynznvaatzacdfoulxxz Rather than providing an immediate answer, the model engages in the following thought process: Cipher Decoding Process\nProblem Understanding: Analyze the given example to identify patterns Hypothesis Formation: Consider whether it\u0026rsquo;s an anagram or some form of cipher Hypothesis Testing: Notice that the encrypted text is exactly twice the length of the orginal Iternative Refinement: When the first hypothesis fails, use existing information to form new hypotheses Optimal Path Discovery: Through continuous trial and error, ultimately find the correct solution This process encompasses the key steps humans use when solving complex problems:\nProblem and solution space identification Hypothesis development and testing Approach adjustment and path selection What makes reasoning model so speical is that you don\u0026rsquo;t need complex prompts to guide them through deep thinking. This means reasoning models requires less contextual prompting to produce high-quality results for complex tasks, truly achieving the leap from \u0026ldquo;quick reaction\u0026rdquo; to \u0026ldquo;deep thinking\u0026rdquo;.\nOf course, this deep reasoning comes with trade-offs - when using reasoning models, you need to balance reasoning quality against response speed.\nBreakthrough and Performance Leap The performance leap of reasoning model is primarily attributed to two key breakthroughs:\n1. Inference-time Compute\nResearch has found that in the model\u0026rsquo;s post-training phase, the more reinforcement learning conducted, the higher the model\u0026rsquo;s accuracy. But more surprisingly, allowing models to \u0026ldquo;think longer\u0026rdquo; during inference significanly improves result quality. By giving models more thinking time, even with the same model parameters and training data, superior performance can be achieved.\nImage source: [OpenAI](https://openai.com/index/learning-to-reason-with-llms/) 2. Consensus Voting\nAnother key breakthrough is teaching models to verify outputs through consensus voting. The mechanism works as follows:\nGenerate multiple different solutions for the same problem Train the model select the most frequently occuring solution as the final answer In Minerva\u0026rsquo;s experiments, MATH benchmark accuracy improved from 33.6% to 50.3% (@brownLargeLanguageMonkeys2024). Experiments showed that the consensus mechanism stabilizes at around 100 samples, meaning significant performance improvements can be achieved without generating massive number of samples.\nComparing coverage (performance with an oracle verifier) to mainstream methods available for picking the correct answer (majority voting, reward model selection and reward model majority voting) as we increase the number of samples. These breakthroughs have yielded remarkable results: Taking GPT-4o and o1 models as examples:\nMathematical Olympiad-level abilities (AIME 2024): GPT-4o achieved 13% accuracy, while o1 reached 83% Coding abilities: GPT-4o scored 11%, while o1 achieved 89% o1 greatly improves over GPT-4o on challenging reasoning benchmarks. General Mathematics (MATH): o1 achieved a massive 30% improvement over GPT-4o College-level Knowledge (MMLU): o1 improved across all categories, with college mathematics accuracy reaching an astounding 98.1% o1 improves over GPT-4o on a wide range of benchmarks, including 54/57 MMLU subcategories. Emerging Capabilities of Reasoning Models Beyond leaps in traditional benchmarks, reasoning models have demonstrated some execiting emerging capabilities.\n1. Abstract Reasoning\nWhen given 16 words and asked to find underlying categories and correctly classify them, GPT-4o\u0026rsquo;s performance was somewhat random, identifying only two categories with errors. Meanwhile, o1 perfectly identified all four categories and correctly classified all 16 words. This capability is crucial for handing abstract problems beyond standardized tests.\nAbstract reasoning 2. Generator-Verifier Gap\nFor many problems (such as mathematics, programming, puzzles), verifying a good answer is much easier than generating one from scratch. Reasoning models excel at leveraging this principle. They can:\nGenerate an initial solution Verify it and identify issues Iterate based on feedback, gradually approaching the perfect answer When this generator-verifier gap exists, we can trade more computation at inference time for better performance.\n3. Potential Application Areas\nThis powerful reasoning capability makes these models highly promising in the following domains:\nData Analysis: Interpreting complex datasets, such as genomic sequencing results in biology Scientific Computing: Writing and debudding specialized code for computational fluid dynamics or astrophysics simulations Experimental Design: Proposing new experimental approaches in chemistry or explaining complex physics experimental results Algorithm Development: Assisting in creating or optimizing data analysis algorithms bioinformatics Literature Synthesis: Reasoning across multiple research papers to form coherent conclusions Effective Prompting Here are four key prompting principles that have emerged for working with reasoning models. While these principles don\u0026rsquo;t cover every scenario, they can help you explore and understand how this new generation of reasoning models differs from others.\nKeep It Simple and Direct When writing prompts, aim for clarity and conciseness. Direct instructions often produce the best results, while complex descriptions and excessive background information can actually interfere with the model\u0026rsquo;s internal reasoning process.\nNo Need for Explicit Chain-of-Thought Reasoning models no longer require manually adding \u0026ldquo;think step by step\u0026rdquo; or breaking down tasks steps in your prompts like previous models did. These models have been trained to natually provide effective explanations and reasoning processes in their responses. Therefore, starting with a simple, direct prompt is always the best approach.\nOf course, for highly specialized tasks or complex contexts, you might still consider adding CoT-style step-by-step peompts. However, it\u0026rsquo;s recommended to start with simple prompts and adjust details based on output quality.\nFor example: Suppose you need a function that outputs SMILES IDs for all molecules related to insulin.\nGood Prompt\n\u0026#34;Generate a function that outputs the SMILES IDs for all the molecules involved in insulin.\u0026#34; Bad Prompt\n\u0026#34;Generate a function that outputs the SMILES IDs for all the molecules involved in insulin.\u0026#34; \u0026#34;Think through this step by step, and don\u0026#39;t skip any steps:\u0026#34; \u0026#34;- Identify all the molecules involve in insulin\u0026#34; \u0026#34;- Make the function\u0026#34; \u0026#34;- Loop through each molecule, outputting each into the function and returning a SMILES ID\u0026#34; \u0026#34;Molecules:\u0026#34; Use Structured Prompts When your prompt content becomes complex, use separators (like Markdown, XML tags, or quotes) to break it into different sections. This structured format not only improves model accuracy but also makes troubleshooting much easier.\nExample: Customer service assistant scenario\n\u0026#34;\u0026lt;instructions\u0026gt;You are a customer service assistant for AnyCorp, a provider\u0026#34; \u0026#34;of fine storage solutions. Your role is to follow your policy to answer the user\u0026#39;s question. \u0026#34; \u0026#34;Be kind and respectful at all times.\u0026lt;/instructions\u0026gt;\\n\u0026#34; \u0026#34;\u0026lt;policy\u0026gt;**AnyCorp Customer Service Assistant Policy**\\n\\n\u0026#34; \u0026#34;1. **Refunds**\\n\u0026#34; \u0026#34; - You are authorized to offer refunds to customers in accordance \u0026#34; \u0026#34;with AnyCorp\u0026#39;s refund guidelines.\\n\u0026#34; \u0026#34; - Ensure all refund transactions are properly documented and \u0026#34; \u0026#34;processed promptly.\\n\\n\u0026#34; \u0026#34;2. **Recording Complaints**\\n\u0026#34; \u0026#34; - Listen attentively to customer complaints and record all relevant \u0026#34; \u0026#34;details accurately.\\n\u0026#34; \u0026#34; - Provide assurance that their concerns will be addressed and \u0026#34; \u0026#34;escalate issues when necessary.\\n\\n\u0026#34; \u0026#34;3. **Providing Product Information**\\n\u0026#34; \u0026#34; - Supply accurate and helpful information about AnyCorp\u0026#39;s storage \u0026#34; \u0026#34;solutions.\\n\u0026#34; \u0026#34; - Stay informed about current products, features, and any updates \u0026#34; \u0026#34;to assist customers effectively.\\n\\n\u0026#34; \u0026#34;4. **Professional Conduct**\\n\u0026#34; \u0026#34; - Maintain a polite, respectful, and professional demeanor in all \u0026#34; \u0026#34;customer interactions.\\n\u0026#34; \u0026#34; - Address customer inquiries promptly and follow up as needed to \u0026#34; \u0026#34;ensure satisfaction.\\n\\n\u0026#34; \u0026#34;5. **Compliance**\\n\u0026#34; \u0026#34; - Adhere to all AnyCorp policies and procedures during customer \u0026#34; \u0026#34;interactions.\\n\u0026#34; \u0026#34; - Protect customer privacy by handling personal information \u0026#34; \u0026#34;Confidentially.\\n\\n\u0026#34; \u0026#34;6. **Refusals**\\n\u0026#34; \u0026#34; - If you receive questions about topics outside of these, refuse \u0026#34; \u0026#34;to answer them and remind them of the topics you can talk about.\u0026lt;/policy\u0026gt;\\n\u0026#34; We use \u0026lt;instructions\u0026gt; tags to clearly define the task role and behavioral guidelines. Then we provide structured policy content through \u0026lt;policy\u0026gt; tags, making it clear to the model what \u0026ldquo;policy\u0026rdquo; specifically refers to. This way, the model clearly understands its role, the rules it should follow, and the tasks it needs to complete, with clear boundaries and minimal confusion.\nShow rather than Tell Instead of explaining your requirements through lengthy text descriptions, provide a relevant example that allows the model to intuitively understand the task domain and output format you expect.\nHere\u0026rsquo;s and example: We still use \u0026lt;prompt\u0026gt; and \u0026lt;policy\u0026gt; to define roles and rules, but we add an \u0026lt;example\u0026gt; tag that directly provides a sample question-answer pair to help the model understand the expected response format and citation style.\n\u0026#34;\u0026lt;prompt\u0026gt;You are a lawyer specializing in competition law, \u0026#34; \u0026#34;assisting business owners with their questions.\u0026lt;/prompt\u0026gt;\\n\u0026#34; \u0026#34;\u0026lt;policy\u0026gt;As a legal professional, provide clear and accurate \u0026#34; \u0026#34;information about competition law while maintaining \u0026#34; \u0026#34;confidentiality and professionalism. Avoid giving specific \u0026#34; \u0026#34;legal advice without sufficient context, and encourage clients \u0026#34; \u0026#34;to seek personalized counsel when necessary.\u0026lt;/policy\u0026gt;\\n\u0026#34; \u0026#34;\u0026#34;\u0026#34;\u0026lt;example\u0026gt; \u0026lt;question\u0026gt; I\u0026#39;m considering collaborating with a competitor on a joint marketing campaign. Are there any antitrust issues I should be aware of? \u0026lt;/question\u0026gt; \u0026lt;response\u0026gt; Collaborating with a competitor on a joint marketing campaign can raise antitrust concerns under U.S. antitrust laws, particularly the Sherman Antitrust Act of 1890 (15 U.S.C. §§ 1–7). Section 1 of the Sherman Act prohibits any contract, combination, or conspiracy that unreasonably restrains trade or commerce among the states. **Key Considerations:** 1. **Per Se Illegal Agreements:** Certain collaborations are considered automatically illegal (\u0026#34;per se\u0026#34; violations), such as price-fixing, bid-rigging, and market allocation agreements. For example, in *United States v. Topco Associates, Inc.*, 405 U.S. 596 (1972), the Supreme Court held that market division agreements between competitors are per se illegal under the Sherman Act. 2. **Rule of Reason Analysis:** Collaborations that are not per se illegal are evaluated under the \u0026#34;rule of reason,\u0026#34; which assesses whether the pro-competitive benefits outweigh the anti-competitive effects. In *Broadcast Music, Inc. v. Columbia Broadcasting System, Inc.*, 441 U.S. 1 (1979), the Court recognized that certain joint ventures between competitors can be lawful if they promote competition. 3. **Information Sharing Risks:** Sharing competitively sensitive information, such as pricing strategies or customer data, can lead to antitrust violations. The Department of Justice and the Federal Trade Commission caution against exchanges that could facilitate collusion (*Antitrust Guidelines for Collaborations Among Competitors*, 2000). **Recommendations:** - **Define the Scope:** Clearly delineate the parameters of the collaboration to focus on the marketing campaign without involving competitive aspects like pricing or market division. - **Implement Safeguards:** Establish protocols to prevent the exchange of sensitive information that is not essential to the marketing effort. - **Legal Consultation:** Given the complexities of antitrust laws, consult with a legal professional to ensure the collaboration complies with all legal requirements. **Conclusion:** While joint marketing campaigns between competitors are not inherently illegal, they must be structured carefully to avoid antitrust pitfalls. Legal guidance is essential to navigate these issues and to design a collaboration that achieves your business objectives without violating antitrust laws. \u0026lt;/response\u0026gt; \u0026lt;/example\u0026gt;\u0026#34;\u0026#34;\u0026#34; In practice, you can use these principles as your default starting point and gradually adjust based on task complexity. If the model doesn\u0026rsquo;t perform as expected, rather than adding more explanations, prioritize adding or improving examples first. This typically leads to more robust results.\nTask Planning with Reasoning Models When building multi-step logical tasks, efficiently leveraging models with different capabilities is key to improving performance and reducing cost. Reasoning models are especially strong in multi-step planning and problem solving. Given a task scenario, constraints, and available tools, they can quickly generate structured, logical solutions.\nHowever, using a reasoning model to execute every step can lead to high latency and unnecessary computational overhead. To address this, we adopt a two-stage strategy:\nUse a reasoning model to generate a task plan. Use a standard model to execute each step of the plan. This design is widely adopted in real-world system. It combines intelligent reasoning with fast, cost-effective execution.\nPlan Generation + Execution The entire workflow can be broken down into three stages:\nInput Scenario \u0026amp; Constraints: The user submits a task request that requires multi-step logical reasoning. Generate Solution Plan: The reasoning model uses the task context, tool descriptions, and constraints to generate a multi-step executable plan. Execute Plan Steps: A standard model (in this case, DeepSeek-V3) executes each steps of the plan in sequence to produce the final result. Core Implementation We start with a scenario, usually a customer request that requires multi-step reasoning. This scenario is passed to a planning model (e.g., deepseek-reasoner) which is equipped with tools and planning instruction. It produces a structured execution plan. We then pass this plan to a execution model (e.g., deepseek-chat) to carry out each step. Once all steps are complete, the final answer is returned to the user.\nPlanning Model Prompt Design\nThe core task of the planning model is to understand the scenario and generate a structured execution plan. Its prompt consists of several key components:\nRole Setting: Defines the model\u0026rsquo;s role and tasked with create an executable plan to fulfill a user request. Tool Description: List the functions available to execution model. The planning model does not call these functions directly but must understand their capabilities to design a valid plan. Plan Structure Instructions: Guidelines for organizing steps to ensure the execution model can parse and execute them accuracy. Guidelines for organizing steps to ensure the chat model can parse and execute them accurately. # Prompt for the planning model planning_prompt = \u0026#34;\u0026#34;\u0026#34; You are a supply chain management assistant. The first input you will receive will be a complex task that needs to be carefully reasoned through to solve. Your task is to review the challenge, and create a detailed plan to process customer orders, manage inventory, and handle logistics. You will have access to an LLM agent that is responsible for executing the plan that you create and will return results. The LLM agent has access to the following functions: - get_inventory_status(product_id) - This gets the currently available product that we have - get_product_details(product_id) - This function gets the necessary components we need to manufacture additional product ... When creating a plan for the LLM to execute, break your instructions into a logical, step-by-step order, using the specified format: - **Main actions are numbered** (e.g., 1, 2, 3). - **Sub-actions are lettered** under their relevant main actions (e.g., 1a, 1b). - **Sub-actions should start on new lines** - **Specify conditions using clear \u0026#39;if...then...else\u0026#39; statements** (e.g., \u0026#39;If the product was purchased within 30 days, then...\u0026#39;). - **For actions that require using one of the above functions defined**, write a step to call a function using backticks for the function name (e.g., `call the get_inventory_status function`). - Ensure that the proper input arguments are given to the model for instruction. There should not be any ambiguity in the inputs. - **The last step** in the instructions should always be calling the `instructions_complete` function. This is necessary so we know the LLM has completed all of the instructions you have given it. - **Detailed steps** The plan generated must be extremely detailed and thorough with explanations at every step. Use markdown format when generating the plan with each step and sub-step. Please find the scenario below. \u0026#34;\u0026#34;\u0026#34; Execution Model Prompt Design\nOnce a plan is generated, the execution model is respnsible for executing it. The system prompt clearly defines its role and behavior, including:\nDefine Core Responsibility: We define its role and emphasize that its primary task it to \u0026ldquo;strictly follow the given policy.\u0026rdquo; This prevents it from deviating from the intended workflow. Explain Decision-Making: We require the model to explain the logic behind each step it takes. This provides real-time insight into task progress and helps identify potential issues quickly. Chain-of-Thought: We provide the execution model with clear CoT instructions. This helps guides its reasoning for more accurate judgments when performing specific,detailed steps. # System prompt for the execution model execution_prompt = \u0026#34;\u0026#34;\u0026#34; You are a helpful assistant responsible for executing the policy on handling incoming orders. Your task is to follow the policy exactly as it is written and perform the necessary actions. You must explain your decision-making process across various steps. # Steps 1. **Read and Understand Policy**: Carefully read and fully understand the given policy on handling incoming orders. 2. **Identify the exact step in the policy**: Determine which step in the policy you are at, and execute the instructions according to the policy. 3. **Decision Making**: Briefly explain your actions and why you are performing them. 4. **Action Execution**: Perform the actions required by calling any relevant functions and input parameters. POLICY: {policy} \u0026#34;\u0026#34;\u0026#34; Process Orchestration\nAfter setting up prompts, variables, and tool functions, we define a main controller function to run the whole process:\nReceive Scenario: The function takes the initial customer scenario as input. Generate Plan: It calls the planning model to generate a detailed, step-by-step action plan based on the scenario. Initiate Execute Loop: It initializes an execution model \u0026ldquo;worker\u0026rdquo; and provides it with the plan. This worker enters a loop, using the defined function-calling mechanism to execute each step. Return Results: When all steps are complete, the loop terminates. The conversation history and final results are returned, clearly showing how the system addressed and completed the task. def process_scenario(scenario): append_message({\u0026#39;type\u0026#39;: \u0026#39;status\u0026#39;, \u0026#39;message\u0026#39;: \u0026#39;Generating plan...\u0026#39;}) plan = call_planning(scenario) append_message({\u0026#39;type\u0026#39;: \u0026#39;plan\u0026#39;, \u0026#39;content\u0026#39;: plan}) append_message({\u0026#39;type\u0026#39;: \u0026#39;status\u0026#39;, \u0026#39;message\u0026#39;: \u0026#39;Executing plan...\u0026#39;}) messages = call_execution(plan) append_message({\u0026#39;type\u0026#39;: \u0026#39;status\u0026#39;, \u0026#39;message\u0026#39;: \u0026#39;Processing complete.\u0026#39;}) return messages The call_execution(plan) function is the core of this process. It receives the plan from the reasoning model and initiates a while loop to interact with the execution model:\nIn each iteration of the loop, it provides the current plan and conversation history to the execution model. The execution model analyzes the information and decides to make a tool call. The system intercepts this request, executes the corresponding Python function, and returns the result as a new message. This process repeats until the execution model calls a special instructions_complete function, signaling that all tasks are finished. At this point, the loop breaks, and the execution is complete. def call_execution(plan): execution_prompt = execution_prompt.replace(\u0026#34;{policy}\u0026#34;, plan) messages = [ {\u0026#39;role\u0026#39;: \u0026#39;system\u0026#39;, \u0026#39;content\u0026#39;: execution_prompt}, ] while True: response = client.chat.completions.create( model=V3_MODEL, messages=messages, tools=TOOLS, parallel_tool_calls=False ) assistant_message = response.choices[0].message.to_dict() print(assistant_message) messages.append(assistant_message) append_message({\u0026#39;type\u0026#39;: \u0026#39;assistant\u0026#39;, \u0026#39;content\u0026#39;: assistant_message.get(\u0026#39;content\u0026#39;, \u0026#39;\u0026#39;)}) if (response.choices[0].message.tool_calls and response.choices[0].message.tool_calls[0].function.name == \u0026#39;instructions_complete\u0026#39;): break if not response.choices[0].message.tool_calls: continue for tool in response.choices[0].message.tool_calls: tool_id = tool.id function_name = tool.function.name input_arguments_str = tool.function.arguments append_message({\u0026#39;type\u0026#39;: \u0026#39;tool_call\u0026#39;, \u0026#39;function_name\u0026#39;: function_name, \u0026#39;arguments\u0026#39;: input_arguments_str}) try: input_arguments = json.loads(input_arguments_str) except (ValueError, json.JSONDecodeError): continue if function_name in function_mapping: try: function_response = function_mapping[function_name](**input_arguments) except Exception as e: function_response = {\u0026#39;error\u0026#39;: str(e)} else: function_response = {\u0026#39;error\u0026#39;: f\u0026#34;Function \u0026#39;{function_name}\u0026#39; not implemented.\u0026#34;} try: serialized_output = json.dumps(function_response) except (TypeError, ValueError): serialized_output = str(function_response) messages.append({ \u0026#34;role\u0026#34;: \u0026#34;tool\u0026#34;, \u0026#34;tool_call_id\u0026#34;: tool_id, \u0026#34;content\u0026#34;: serialized_output }) append_message({\u0026#39;type\u0026#39;: \u0026#39;tool_response\u0026#39;, \u0026#39;function_name\u0026#39;: function_name, \u0026#39;response\u0026#39;: serialized_output}) return messages Define and Run a Scenario\nWith everything in place, we define a specific business scenario to kick off the workflow. The scenario includes:\nMandatory Instruction: The customer must be notified before the final set of operations is competed. Prioritization Rule: The strategy should prioritize fulfilling orders with existing inventory first, before placing new purchase orders for out-of-stock components. # Example usage scenario_text = (\u0026#34;We just received a major shipment of new orders. \u0026#34; \u0026#34;Please generate a plan that gets the list of awaiting \u0026#34; \u0026#34;orders and determines the best policy to fulfill them.\\n\\n\u0026#34; \u0026#34;The plan should include checking inventory, ordering \u0026#34; \u0026#34;necessary components from suppliers, scheduling production \u0026#34; \u0026#34;runs with available capacity, ordering new components \u0026#34; \u0026#34;required from suppliers, and arranging shipping to the \u0026#34; \u0026#34;retailer’s distribution center in Los Angeles. Notify \u0026#34; \u0026#34;the customer before completing.\\n\\n\u0026#34; \u0026#34;Prioritize getting any possible orders out that you can \u0026#34; \u0026#34;while placing orders for any backlog items.\u0026#34;) # Process the scenario messages = process_scenario(scenario_text) Finally, we pass this scenario to the system to generate a plan and observe how it autonomously executes the complex task from start to finish.\nLLM as a Judge ","permalink":"https://mig217.github.io/post/2025-06-21-a-guide-to-reasoning-models/","summary":"\u003cp style=\"color: #1E90FF;\"\u003e\n The following insights are drawn from the \u003cem\u003eReasoning with o1\u003c/em\u003e video course by \u003ca href=\"https://learn.deeplearning.ai/courses/reasoning-with-o1/lesson/h8dkv/introduction\" style=\"color: #1E90FF;\"\u003eDeepLearning.ai\u003c/a\u003e.\n\u003c/p\u003e\n\u003cbr\u003e\n\u003cp\u003eThis article explores how to effectively prompt and utilize the new generation of reasoning models. Models released over the past year have demonstrated remarkable progress in reasoning and planning tasks. OpenAI has deeply optimized Chain of Thought (CoT) processing, using reinforcement learning to fine-tune models so they automatically integrate step-by-step reasoning into their response process.\u003c/p\u003e","title":"How to Use Reasoning Models?"},{"content":"In today\u0026rsquo;s rapidly evolving AI landscape, the remarkable progress we\u0026rsquo;ve witnessed is largely attributed to open scientific research and fully open models. However, as time progresses, more and more research and development work is becoming increasingly closed off.\nWe still need to delve deeper into how language models work, improve their capabilities, and make them safer, more efficient, and more reliable. Simultaneously, we need to extend language models\u0026rsquo; abilities beyond text into domains like healthcare, science, and even complex decision-making processes. Most importantly, we must bring these models into real-world applications, ensuring they are deployable, interpretable, and effectively mitigate biases and risks.\nTo achieve these goals, we need:\nFully open language models: including data, code, and training details Transparent research processes: facilitating review and understanding Reproducible results: driving robust scientific progress Accessible ecosystems: supporting broader researcher participation At the University of Washington and AI2, this commitment to openness is foundational. Through two major initiatives—OLMo (Open Language Models) and Tulu (an open post-training framework)—they are building a fully open language model ecosystem that spans pretraining, post-training, intermediate training stages, and agent development.\nIn this post, I’ll introduce three of their key efforts to improve reasoning in language models, each representing a distinct but complementary stage of the development pipeline:\nPretraining: OLMo, Dolma Post-training: Tulu, OpenInstruct Test-time inference: S1, Self-RAG, OpenScholar Overview OLMo 2 OLMo 2(@olmo2OLMo22025) is available in two sizes: 7B and 13B parameters. Despite being trained on significantly fewer tokens, both versions achieve performance on par with Llama 3 and Qwen 2.5 across standard open-source evaluation benchmarks.\nOn the “compute vs. model quality” curve, OLMo 2 falls on the so-called Pareto optimal frontier—demonstrating that with careful data curation and training strategies, it is possible to achieve competitive results without relying on massive computational resources.\nTulu 3 Tulu 3(@lambertTulu3Pushing2025) is AI2’s latest instruction-tuned model, built on top of Llama 3-405B. Through multi-stage instruction tuning, safety alignment, and tool-augmented reasoning, Tulu 3 now surpasses DeepSeek V3 and comes close to GPT-4o on reasoning-intensive tasks.\nIn the following sections, I will walk through this post-training pipeline step by step and explain how each component contributes to the final performance gains.\nPost Training: Tulu We start with post-training because most of the \u0026ldquo;reasoning capabilities\u0026rdquo; in modern large language models are developed and strengthened during this stage. Building modern LLMs typically involves two main phases:\nPretraining: Models learn to predict the next token through large-scale data (primarily from the internet). This stage produces base models with certain general capabilities, but they are not yet safe and lack instruction-following and strong reasoning abilities.\nPost-training: Fine-tuning the base models to enable them to understand human intentions, use tools, perform reasoning, and comply with safey and regulatory requirements.\nThe base pre-trained LMs are neither safe nor robust for public use and interactions, thus require post-training.(@Tulu3Opens) Tulu: Open Instruction Tuning Recipe Tulu is an open, reproducible, and leading-performance post-training methodology. The post-training pipeline consists of 3 key steps, with iterative adjustments and refinements based on model feedback between these steps.\nAn overview of the Tülu 3 recipe Instruction Tuning: Fine-tuning base models using large amounts of \u0026ldquo;instruction + response\u0026rdquo; data (which can be human-annotated or synthetically generated by models) to make them better at executing instruction tasks.\nPreference Tuning: Collecting human preference feedback (e.g., \u0026ldquo;which response do you prefer?\u0026rdquo;) to train models to generate answers that better align with human expectations. Tulu 3 systemically compares DPO (Direct Preference Optimization) and PPO (Proximal Policy Optimization) methods.\nReinforcement Learning with Verfiable Reward: Building upon RLHF with further innovation, fine-tuning models through controllable and verifiable rewards signals to enhance their robustness in complex scenarios.\nIn addition, the following four steps are also crucial for successful model adaptation:\nEstablish clear eval criteria for each target capability, such as mathematics, programming, safety, etc.\nDesign representative task prompts for these capabilities\nEnsure data compliance and legality, avoiding copyright issues\nDecontaminate data to ensure no overlap between eval and training sets\nStep 1: Supervised Finetuning The first step in the Tulu training pipline is Supervised Finetuning (SFT), aimed at giving pre-training language models basic task execution capabilities. SFT involves fine-tuning models using large amounts of \u0026ldquo;Prompt + Completion\u0026rdquo; samples to teach them how to respond to human input.\nTo construct high-quality instruction data, the Tulu team proposed a \u0026ldquo;dual-track data construction strategy\u0026rdquo; that can be executed in parallel:\nData Curation: Design and collect high-quality samples around core tasks, such as dialogue, programmming, reasoning, etc. Data Mixing: Combine human-annotated data with model-generated data to cover broad capabilities and improve generalization In model evaluation, the team systematically tested performance across multiple capability dimensions (dialogue, knoeledge, reasoning, code, multilingual, safety). To further improve training effectiveness, they adopted the following stragegies to optimize data configuration:\nSelect datasets that perform expectionally well on specific tasks Mix human and synthetic data to ensure diversity and scale Adjust data proportions by task capability dimensions to achieve training balance Comparison of different instruction tuning datasets, showing that different instruction-tuning datasets can excel in different aspects, and mixtures perform best on average. Data Challenges and Solutions for Reasoning Capabilities\nCompared to tasks that output single answers, reasoning problems are often more complex, requiring models to have multi-step thinking capabilities. Research shows that CoT data is exetermly effective for such tasks. However, high-quality CoT data often requires expert step-by-step annotation, which is expensive, inefficient, difficult to scale, and lacks diversity in style.\nTo address these challenges, the Tulu team proposed a \u0026ldquo;Persona-Driven data generation\u0026rdquo; method based on the paper \u0026ldquo;Scaling Synthetic Data Creation with 1,000,000,000 Personas\u0026rdquo;(@geScalingSyntheticData2025):\nDesign specific personas (such as chemists, children, programmers) for particular skills (like mathematics, coding)\nModels generate tasks and problem-solving processes based on personan settings, improving content diversity and scalability\nPersonas can work with a wide range of data synthesis prompts (e.g., “create a math problem”) to guide an LLM to synthesize data with corresponding perspectives. The Tulu team designed approximately 250,000 personas and guided models to generate three-types of core task data, combined with GPT-4o and Claude Sonnet to complete step-by-step solutions, forming complete CoT samples.\nExperimental results show:\nAfter adding persona data, models significantly improved on mathematical tasks, especially on complex problems. Improvement on simple problems like GSM8K was relatively limited.\nTo further improve quality, Tulu introduced a GPT-4 self-consistency voting mechanism, retaining optimal solution paths and filtering out nearly 40% of noisy samples. Ultimately, retaining only 60% of the data still achieved higher accuracy.\nOther approaches to generate COT data\nManual Human Annotation (e.g., GSM8K dataset): Annotators write step by step solutions High-quality reasoning traces Limited scale (only 7K) Lack of diversity in reasoning styles Program-Aided Language Models (PAL): Convert math problems into Python code execution traces Guarantee correctness through execution Less natural language reasoning, less intuitive Limited to problems that can be coded Self-generated COT (self-ask): using LLMs to generate their reasoning paths Scalable to many problems Quality highly dependent on base model Capability-driven Data Mixing\nData mixing for SFT Training on real user interactions with strong models is helpful almost across the board. Safety training is largely orthogonal to the other skills. Persona-based data synthesis is very useful for targeting new skills. Performance during our SFT ablations, showing the effect of removing safety, WildChat, Persona, and Math data in isolation. SFT performance potential SFT mixtures show strong performance, achieving a higher average score than other comparable mixes. All models, including Tülu 2 SFT, were trained on either Llama 3.0 or 3.1. Our final Tülu 3 70B model was used to help format this table.\nSummary of the performance of our Tülu 3 SFT models against comparable baselines. Step 2: Preference Tuning After supervised fine-tuning, we move to preference tuning to align the model with human preferences.\nThe key idea is: instead of training on \u0026ldquo;correct\u0026rdquo; answers, we train on preference comparisons. For example:\nInput: \u0026ldquo;Write a haiku about AI\u0026rdquo;\nOutput 1: \u0026ldquo;Sure, here\u0026rsquo;s a haiku: \u0026hellip;\u0026rdquo; 👍\nOutput 2: \u0026ldquo;Sorry, I cannot help you with that.\u0026rdquo; 👎\nIn this case, human annotators (or even AI models in the case of RLAIF-Reinforcement Learning with AI Feedback) choose the better response. This feedback creates strong training signals that improve the style, helpfulness, and conversational quality of responses.\nWhile preference tuning continues to enhance skills developed in SFT, its biggest gains are in areas like tone, clarity, and user alignment - not necessarily raw task performance like math.\nSince we can\u0026rsquo;t use standard supervised learning on preference comparisons, we need specialized algorithms like RLHF (Reinforcement Learning with Human Feedback).\nRLHF\nRLHF (@christianoDeepReinforcementLearning2023) uses a reinforcement learning framework to incorporate preference data. In this framework:\nPolicy: the language model responsible for generating the next token State: the user\u0026rsquo;s input prompt Action: the model generated response Environment: a reward model trained on preference data, which determines which responses is better The reward model is specially trained neural network that takes a prompt and multiple candidate responses as input, and outputs a preference score. This score guides the model to learn to produce responses more aligned with human preferences.\nSource: [HuggingFace](https://huggingface.co/blog/rlhf) To optimize this process, algorithms like PPO (Proximal Policy Optimization) are commonly used. In newer approaches, DPO (Direct Preference Optimization) is also gaining popularity. Let’s take a closer look at how these methods work.\nUnpacking DPO vs. PPO\nThe core challenge of preference tuning lies in balancing two goals:\nMaximizing rewards: training the mdoel to produce outputs aligned with human preferences Staying close to the base model: avoiding drastic shifts that could degrade core capabilities PPO：Proximal Policy Optimization\nPPO(@schulmanProximalPolicyOptimization2017) follows the traditional reinforcement learning approach, in two steps:\nTrain a reward model using human preference data (e.g., A is better than B). This model learns to assign scores based on human feedback. Optimize the policy model (i.e., the language model) using RL to generate outputs that maximize the learned reward. PPO delivers strong performance but is complex to implement, costly to train, and requires maintaining both a policy and a reward model.\nDPO: Direct Preference Optimization\nDPO (@rafailovDirectPreferenceOptimization2024a), proposed more recently, simplifies this process. Instead of training a separate reward model, it treats preference data as a ranking problem (e.g., A \u0026gt; B) and directly updates the policy model using a derived objective.\nDPO is simplier to implement, more efficient to train and has inspired variants like SimPO(@mengSimPOSimplePreference2024) and length-normalized DPO(@IterativeLengthRegularizedDirect) for greater flexibility.\nUnderstanding DPO and PPO. Source: [labellerr](https://www.labellerr.com/blog/dpo-vs-ppo-for-llm-all/) The figure below summarizes finding from a recent study on preference tuning in the Tulu system(@ivisonUnpackingDPOPPO2024a)\nPerformance improvements resulted by changing different components in the preference training of TÜLU.(@ivisonUnpackingDPOPPO2024a) Key insights:\nData quality is the single most important factor: upgrading data let to a 56% -\u0026gt; 61% performance jump\nPPO consistently outperforms DPO, but DPO\u0026rsquo;s simplicity makes it attractive for real-world deployment\nLarge reward models offer diminishing returins\nDomain-specific prompting has a strong effect: to boost performance in areas like code, math, or creative writing, use targeted prompts and preference data from that domain.\nBuilding Tulu 3\nTo build Tulu 3, the team systematically integrated the techniques disscussed earlier, optimizing each key component of the pipine (@lambertTulu3Pushing2025).\nPipeline for generating and scaling preference data that is based from Ultrafeedback (Cui et al., 2023). Data strategy and prompt selection\nSince prompt selection plays a critical role, the team constructed a diverse dataset by combining different types of prompts:\nPrompts reused from the SFT - to retain continuity and maintain accuracy; New prompts not sees during SFT - to improve generalization; Out-of-domain prompts - to broaden the model\u0026rsquo;s ability to handle unfamiliar topics. Response generation from multiple models\nTo generate high-quality preference data, the team collected responses from a range of models, from smaller Llama-7B variants to top-tier models like GPT-4o. This diversity enabled the creation of strong contrastive examples for preference learning.\nBecause of the importance of real-world performance, the team also included on-policy completions-responses generated by the current version of Tulu 3 itself. This helped ensure that preference data reflected how the model behaves in practice, allowing the system to learn which its own responses were better or worse than alternatives.\nRLAIF and Preference Optimization\nFor preference modeling, the team used Reinforcement Learning from AI Feedback (RLAIF). They employed GPT-4o as a judge to evaluate response across four dimensions: helpfulness, instruction-following, truthfulness, and honesty. Each comparsion was binarized into a chosen vs. rejected label to support supervised preference training.\nFor optimization, they tasted several algorithms-including DPO, PPO, and CPO-but found limited gains from PPO. Given DPO\u0026rsquo;s simplicity and effectiveness, they adopted it as the main method.\nKey findings of data ablations\nThe experiments revealed several critical insights:\nLLM as Judge: GPT-4o consistently provided the most realiable preference judgements across models. Performance of DPO models trained on preference annotations by different LLM judges. On-policy vs. Off-policy: Adding on-policy data yielded significantly better results than relying solely on off-policy data.\nSFT vs. New Prompts: Introducing new and out-of-domain prompts improved overall performance.\nWith SFT and preference optimization complete, the team introduced a third novel step: Reinforcement Learning with Verifiable Rewards, which will be discussed in the next section.\nStep 3: Reinforcement Learning with Verifiable Rewards After completing preference tuning with methods like DPO, the Tulu team further examined how model performance evolved with increased training steps across different tasks:\nAlpacaEval: Performance quickly plateaued, showing limited further gains. IFEval: Accuracy in following complex instructions began to decline as training continued. GSM8K (Reasoning): Initially improved, but quickly overfit and degraded. These trends suggest that for more complex tasks-like rasoning and instruction following-over optimization becomes a real concern, leading to performance drop-offs rather than improvements.\nRethinking the Reward Model\nTrained on human preference data with neural reward model, these models assign scalar scores to responses (e.g., 10.5), indicating how \u0026ldquo;good\u0026rdquo; each responses is. However, these scores are often difficult to interpret, and may not align well with actual task objectives.\nConsider a simple example:\nPrompt: \u0026#34;What is 2 + 2?\u0026#34; Expected Answer: \u0026#34;4\u0026#34; A neural reward model might return scores like 1.0, 5.5, or 1000-offering little insight into correctness. For tasks with objectively verifiable outcomes, such scoring can be misleading.\nThis insight led the team to propose a simpler, more transparent solution: for tasks with verifiable outcomes-such as math and programming-it\u0026rsquo;s more effective to replace neural reward models with rule-based reward functions. Thses are easier to interpret, more aligned with task objectives, and offer a clearer signal for optimization.\nRLVR Recipe and Analyses\nThe idea of replacing human preference signals with verifiable rewards is not unique. Earlier this year, the DeepSeek-V3 model adopted a similar philosophy, highlighting the growing momentum and promise of this direction.\nExperimental Setup\nStarting point: Begain with the Tulu 3 model that had already been optimized via DPO Environment: Using targeted datasets paired with automatic verifiers to evaluate model outputs Algorithm: Returned to classical RL, specifically leveraging the DPO Datasets: Focused on three datasets where answer can be objectively verified: GSM8K, MATH, and IFEval. Some verification tasks, like math reasoning, are straightforward: simply check if the predicted answer matches the correct one (e.g., if prediction == answer -\u0026gt;1, else -\u0026gt; 0). Others, such as constraint satisfication in instruction following, are more nuanced. These required checking which constraints are met and computing an overall satisfication rate. Nonetheless, the underlying principle remains similar.\nResults and Observations\nGSM8K: Consistent improvements were observed in both stages, with the most significant gains achieved when RLVR was applied after DPO. Notably, there was no sign of overfitting.\nMATH: A slight dip in performance was seen when starting from DPO, but this quickly rebounded with continued training.\nIFEval: Strong improvements emerged when starting the SFT checkpoint. However, gains from the DPO starting point were smaller-likely due to limited training data, according to the team\u0026rsquo;s analysis.\nScaling Up: From 7B to 405B\nThe Tulu team scaled this \u0026ldquo;three-stage\u0026rdquo; RLVR recipe (SFT -\u0026gt; DPO -\u0026gt; RLVR) across model sizes from 7B to 70B, and all the way up to 405B. The results were compelling:\nOn various benchmarks, the final 405B model achieved performance on par with GPT-4o and surpassed DeepSeek V3. Summary of Tülu 3 results relative to peer 405B models. The 8B and 70B versions of Tulu 3 significantly outperformed other open-source models of similar scale, such as Qwen-Instruct and Llama-3-Instruct. In fact, these models reached performance levels comparable to small proprietary models like GPT-4o-mini and Claude 3 Haiku. Overview of results on Tülu 3 Eval suite, over both 8B and 70B models. On particularly interesting insight: RLVR delivers greater gains at scale. This aligns with the team\u0026rsquo;s hypothesis that larger, stronger base models are better positioned to benefit from reinforcement via verifiable rewards.\nTest-Time Inference A significant area of current research focuses on enhancing model performance during the inference phase, particularly by improving reasoning capabilities at test time.\nA Minimal Recipe for Reasoning \u0026amp; Test-Time Scaling We begin by introducing the paper: s1: Simple test-time scaling (@muennighoffS1SimpleTesttime2025), which presents a minimalist yet powerful approach to improving model reasoning through test-time scaling.\nSimilar to other advancement in the language model domain, this method\u0026rsquo;s core lies in a meticulously curated dataset, named s1K. This dataset is then paired with a straightforward test-time scaling algorithm to produce the final s1 model.\nData Curation\nThis s1K dataset was constructed by filtering a large collection of advanced reasoning problems, including mathematics, logic puzzles, and probability questions. The complexity of this data significantly that of previous datasets like Tulu 3 (which primarily contains elementary to high school-level math), focusing instead on highly challenging problems, such as those found in Olympaid-level math competitions.\nThe data curation process involved several key steps:\nInitial collection: 59k problems spanning logic puzzles, mathematics, and other domains. Quality filtering: reduced to 52k Difficulty filtering: reduced further to 24k Diversity optimization: final selection of 1k unique and challenging questions Interestingly, benchmark evaluations revealed that performance using curated 1k dataset was nearly identical to its performance using the full 59k dataset.\nDistill Reasoning Traces \u0026amp; Answers\nOnce the problems were selected, they were annotated with detailed reasoning traces and answers. For instance, given the following problem:\nAn often-repeated fun fact is that humans produce more power per unit volume than stars. If the sun were the same size, but it produced the same amount of power per unit volume as a human, what would its surface temperature be?... The researchers initially used Google\u0026rsquo;s Gemini model to generate CoT annotations. These annotations intentionally included \u0026ldquo;thinking\u0026rdquo; tokens (e.g., \u0026ldquo;that happens, but let me think more\u0026rdquo;) to capture the reasoning process.\nIn the latest version of s1, these annotations were replaced with results from DeepSeek R1, whcih unexpectedly led to a significant improvement in the final performance.\nThe resulting dataset spans a wide range of domains, from geometry and number theory to control theory and astronomy.\ns1K and s1-32B. (left) s1K is a dataset of 1,000 high-quality, diverse, and difficult questions with reasoning traces. (right) s1-32B, a 32B parameter model finetuned on s1K is on the sample-efficiency frontier. Test-Time Scaling with Budget Forcing\nResearchers employed a surprisingly simple yet highly effective method called budget forcing.\nThe mechanism is straightforward: when the model generates a response to a prompt (e.g., \u0026ldquo;How many r\u0026rsquo;s are in the raspberry?\u0026rdquo;), its output length is checked against a predefined token budget. If the output is shorter than the budget, a special wait token is appended to the sequence, prompting the model to continue generating. The wait tokens acts as a hint, effectively telling the model, \u0026ldquo;We are not sure your answer is complete; please continue thinking.\u0026rdquo;\nBudget forcing with s1-32B. Training and Results\nA Qwen 32B model was fine-tuned on the s1K data. The results demonstrate a clear scaling trend:\nMATH500 Dataset: As the allocated token budget was increased from 512 to 2048, the model\u0026rsquo;s accuracy consistently improved, demonstrating a clear scaling law.\nAIME24 \u0026amp; GPQA Datasets: On these more challenging datasets, the model was prompted to generated even longer responses (exceeding 8,000 tokens). Again, performance scaled positively with the number of generated tokens.\nTest-time scaling with s1-32B. Researchers compared different test-time scaling methods. Budget forcing, a form of sequential scaling, produced a steeper performance curve and proved more effective than parallel scaling methods. Parallel approaches, such as generating multiple reasoning paths and using majority voting or self-consistency checks, showed some gains but were less significant.\nSequential and parallel test-time scaling. Ablation studies further validated these findings. The performance different between the 1k s1K dataset and the full 59k dataset was minimal. However, using a randomly selected 1k sample resulted in significantly worse performance, underscoring the critical importance of high-quality, curated data.\ns1K data ablations. Self-Guided Generation at Inference The Self-RAG framework (@asaiSelfRAGLearningRetrieve2023) introduces an innovative approach based on RAG. Its unique characteristic is a language model trained not only to generate content but also to actively criticize its output and self-improve.\nThis process establishes a feedback loop. During generation, the model periodically inserts critic tokens to evaluate whether its response is sound and if the retrieved documents are relevant. This mechanism allows the model to dynamically optimize its answers during inference, enabling more powerful test-time scaling.\nOverview of SELF-RAG. Further research shows this self-improvement loop is especially effective for tasks requiring substantial reasoning, such as synthesizing scientific literature and answering complex scientific question (@asaiOpenScholarSynthesizingScientific2024).\nYou can interact with a live demo at openscholar.allen.ai. There, you can input complex queries and observe how the model retrieves, integrates information from multiple sources, and constructs a well-reasoned answer.\nOpen Pre Training: OLMo Early attempts showed that methods like RLVR are not effective on weaker base models. This makes strengthening the base model itself acritical priority. A modern \u0026ldquo;base model\u0026rdquo; isn\u0026rsquo;t built in a single pass of next-token prediction with a fixed learning rate; instead, it\u0026rsquo;s forged through a multi-stage training process.\nStage 1: Pre-training This is the most resource-intensive phase, consuming about 99% of the total compute budget.\nObjective: To learn broad knowledge and language patterns through next-token prediction.\nData: The model is trained on trillions of tokens of unstructured, diverse text sourced from the web, code repositories, academic papers, and more. The strategy is to use the largest and most diverse dataset possible within the given compute constrains.\nComposition of the pretraining data for OLMo 2.(@olmo2OLMo22025) Stage 2: Mid-training Following the extensive pre-training, the model enters a bried but crucial mid-training stage. This phase uses only about 1% of the compute budget but is vital for enhancing complex reasoning abilities.\nObjective: To selectively strengthen specific capabilities, particularly addressing weaknesses from the pre-training stage.\nData: Unlike pre-training, this stage uses a smaller, highly curated dataset designed to:\nBoost reasoning and code: Upsample high-quality data focused on reasoning, mathematics, and code. Patch model weaknesses: Identify and fix shortcomings by \u0026ldquo;injecting\u0026rdquo; targeted knowledge. Introduce new knowledge: Incorporate valuable data that was too scarce to be used effectively during pre-training. Composition of the mid-training data (Dolmino).(@olmo2OLMo22025) Results and Evaluation This two-stage approach yields significant results. Evaluations of the OLMo 2 model show a dramatic performance boost after mid-training across multiple benchmarks. The most significant gains were observed in tasks requiring complex reasoning, such as GSM8K and DROP (reading comprehension with reasoning).\nEvaluations comparing OLMo 2 7B and 13B at the end of pretraining and mid-training stages Ultimately, the optimized OLMo 2 model achieves performance that is on par with or better than models like Llama 3 8B.\nOLMo2 on par or beer than Llama3, Qwen2.5 Conclusion and Outlook While the AI field has made tremendous progress, significant challenges and opportunities remain. Areas like reasoning agents and domain-specific language models are particularly ripe for future research and innovation.\nReference Asai, Akari, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, et al. 2024. “OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs.” arXiv. https://doi.org/10.48550/arXiv.2411.14199.\nAsai, Akari, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. “Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection.” arXiv.org. https://arxiv.org/abs/2310.11511v1.\nChristiano, Paul, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2023. “Deep Reinforcement Learning from Human Preferences.” arXiv. https://doi.org/10.48550/arXiv.1706.03741.\nGe, Tao, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. 2025. “Scaling Synthetic Data Creation with 1,000,000,000 Personas.” arXiv. https://doi.org/10.48550/arXiv.2406.20094.\n“Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level.” n.d. https://arxiv.org/html/2406.11817v1. Accessed June 6, 2025.\nIvison, Hamish, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, and Hannaneh Hajishirzi. 2024. “Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback.” arXiv. https://doi.org/10.48550/arXiv.2406.09279.\nLambert, Nathan, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V. Miranda, et al. 2025. “Tulu 3: Pushing Frontiers in Open Language Model Post-Training.” arXiv. https://doi.org/10.48550/arXiv.2411.15124.\nMeng, Yu, Mengzhou Xia, and Danqi Chen. 2024. “SimPO: Simple Preference Optimization with a Reference-Free Reward.” arXiv. https://doi.org/10.48550/arXiv.2405.14734.\nMuennighoff, Niklas, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. “S1: Simple Test-Time Scaling.” arXiv. https://doi.org/10.48550/arXiv.2501.19393.\nOLMo, Team, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, et al. 2025. “2 OLMo 2 Furious.” arXiv. https://doi.org/10.48550/arXiv.2501.00656.\nRafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model.” arXiv. https://doi.org/10.48550/arXiv.2305.18290.\nSchulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. “Proximal Policy Optimization Algorithms.” arXiv. https://doi.org/10.48550/arXiv.1707.06347.\n“Tülu 3 Opens Language Model Post-Training up to More Tasks and More People Ai2.” n.d. https://allenai.org/blog/tulu-3. Accessed May 18, 2025.\n","permalink":"https://mig217.github.io/post/2025-05-30-open-training-recipes/","summary":"\u003cp\u003eIn today\u0026rsquo;s rapidly evolving AI landscape, the remarkable progress we\u0026rsquo;ve witnessed is largely attributed to open scientific research and fully open models. However, as time progresses, more and more research and development work is becoming increasingly closed off.\u003c/p\u003e\n\u003cp\u003eWe still need to delve deeper into \u003cstrong\u003ehow language models work, improve their capabilities, and make them safer, more efficient, and more reliable\u003c/strong\u003e. Simultaneously, we need to extend language models\u0026rsquo; abilities beyond text into domains like \u003cstrong\u003ehealthcare, science, and even complex decision-making processes\u003c/strong\u003e. Most importantly, we must bring these models into real-world applications, ensuring they are \u003cstrong\u003edeployable, interpretable, and effectively mitigate biases and risks\u003c/strong\u003e.\u003c/p\u003e","title":" Open Training Recipes for Reasoning in Language Models"},{"content":"Language Agents have emerged as one of the most exciting research directions in AI over the past two years. This article explores three core components: long-term memory via HippoRAG, reasoning capabilities with Grokked Transformers, and world modeling through WebDreamer.\nWhy Agents Again? Russell \u0026amp; Norvig in “Artificial Intelligence: A Modern Approach” define an agent as “anything that can perceive its environment through sensors and act upon that environment through actions.”（@ArtificialIntelligenceModern）\nFig.1:Agent-Environment Interaction Framework Many people believe modern agents can be simply defined as “LLM + external environment.” This view suggests that language models themselves have limited functionality with only text input-output interfaces; once connected to an external environment, able to perceive environmental information and influence the environment, they become agents.\nFig.2:‘Modern’ agent = LLM \u0026#43; external environment? However, this definition is oversimplified. In reality, there are two main competing views in the community:\nLLM-first view: We make an LLM into an agent\nImplications: scaffold on top of LLMs, prompting focused, heavy on engineering Agent-first view: We integrate LLMs into AI agents so they can use language for reasoning and communication\nImplications: All the same challenges faced by previous AI agents (e.g., perception, reasoning, world models, planning) still remain, but we need to re-examine them through the new lens of LLMs and tackle new ones (e.g., synthetic data, self-reflection, internalized search) Characteristics of Modern Language Agents Contemporary AI agents, with integrated LLMs, can use language as a vehicle for reasoning and communication\nInstruction following, in-context learning, output customization Reasoning (for better acting): state inference, self-reflection, replanning, etc. Unlike traditional agents, reasoning in language agents is essentially a new form of “action”. In traditional AI agents, actions typically refer to the external world (such as manipulating robots). But in language agents, reasoning occurs in the internal environment, in the form of “inner monologue.” Its core process include\nFig.3:Inner Monologue and Reasoning in Language Agents Reasoning by generating tokens is a new type of action (vs. actions in external environments) Internal environment, where reasoning takes place in an inner monologue fashion Self-reflection is a ‘meta’ reasoning action (i.e., reasoning over the reasoning process), akin to metacognitive functions Reasoning is for better acting, by inferring environmental states, retrospection, etc. Percept and external action spaces are substantially expanded, thanks to using language for communication and multimodal perception Evolution of AI agents To understand the uniqueness of language agents, we can compare the evolution of AI agents:\nFeature Logical Agent Neural Agent Language Agent Expressiveness Low Bounded by the logical language Medium Anything a (small-ish) NN can encode High Almost anything, especially verbalizable parts of the world Reasoning Logical inferences Sound, explicit, rigid Parametric inferences Stochastic, implicit, rigid Language-based inferences Fuzzy, semi-explicit, flexible Adaptivity Low Bounded by knowledge curation Medium Data-driven but sample inefficient High Strong prior from LLMs + language use Early AI agents could only capture limited aspects of human intelligence, such as symbolic reasoning or unimodal perception.\nLanguage agents show significant improvements over traditional logical agents and neural agents in expressiveness, reasoning flexibility, and adaptivity. Their language-driven reasoning abilities enable them to better handle uncertainties in complex environments and formulate more reasonable action strategies.\nA Conceptual Framework for Language Agents The capabilities of language agents can be divided into three different levels (as shown in the figure), core-competencies similar to human cognitive processes, form lower-level perception, memory, embodiment, to upper-level planning, reasoning, and world models. They simultaneously span issues of safety, evaluation, synthetic data, and efficiency.\nFig.4:Capability Hierarchy and Challenges of Language Agents That’s the introduction. This article will further explore three main aspects of language agents:\nOn long-term memory: HippoRAG On reasoning: Grokked Transformers On world models and planning: WebDreamer HippoRAG: Neurobiologically-Inspired Long-Term Memory for LLMs Humans and animals continuously learn by gaining and strengthening knowledge. Nobel Prize winner Eric Kandel highlighted memory’s vital role, saying, “Memory is everything. Without it, we are nothing.” （@marksSearchMemoryEmergence2006） Memory relies on synaptic plasticity, where brain connections grow stronger to support learning. Sleep even helps solidify memories for the long term.\nIdeally, AI, especially large language models (LLMs), should learn and build knowledge over time too. But LLMs struggle with this, often suffering from catastrophic forgetting, where they lose past knowledge—a major limitation.\nNon-Parametric Memory Researchers use non-parametric memory to help large language models (LLMs) learn continuously by storing new knowledge externally, as seen in Retrieval-Augmented Generation (RAG). This lets LLMs dynamically pull in outside information, acting as long-term memory. According to studies （@xieAdaptiveChameleonStubborn2024a）, LLMs adapt well to external data, even when it contradicts their own knowledge.\nFig.5: LLMs can effectively incorporate external evidence, even when it conflicts with their parametric memory, provided the evidence is coherent and persuasive Despite these benefits, current RAG implementations have limitations. Traditional RAG systems rely on vector embeddings for retrieval, which often struggle to capture complex associations.\nLong-term Memory in Humans The hippocampal indexing theory （@teylerHippocampalMemoryIndexing1986） provides insights into how human memory achieves efficient recall. It suggests that:\nNeocortex stores raw sensory data (e.g., auditory and visual information). Hippocampus acts as an index, linking disparate memory fragments into a structured retrieval system. Parahippocampal regions facilitate connections between stored experiences, aiding in memory retrieval. Fig.6: Hippocampus creates index for the memories to be stored in different part of neocortex （@sExploringHippoRAGNeurobiologically2024） Indexing procedure enables two fundamental faculties of human memory:\nPattern separation: process for differentiating memories (neocortex and parahippocampus) Pattern completion: process for recovering complete memories from relevant associations (mostly hippocampus, specifically CA3) HippoRAG: Bringing Human-Like Memory to LLMs HippoRAG （@gutierrezHippoRAGNeurobiologicallyInspired2025a） simulates this memory mechanism by building a similar structured index for RAG systems. Its workflow is divided into two phases:\nOffline Indexing Phase:\nConcept Extraction: Uses an LLM to extract triplets (concepts, noun phrases, and their relationships) from text Knowledge Graph Construction: Builds a schema-less knowledge graph using the extracted concepts and relationships as nodes and edges Dense Encoding: Employs dense retrievers to consolidate similar or synonymous concepts Online Query Phase:\nConcept Identification: Identifies key concepts from the query (such as \u0026ldquo;Stanford\u0026rdquo; and \u0026ldquo;Alzheimer\u0026rsquo;s\u0026rdquo;) Similar Node Retrieval: Finds nodes in the index similar to query concepts to serve as seed nodes Graph Search: Employs the Personalized PageRank algorithm to search the graph Reranking: Reranks original passages based on concept weights Fig.7: Detailed HippoRAG Methodology. The Personalized PageRank algorithm is a critical component of HippoRAG. It performs a random walk starting from seed nodes, dispersing probability mass to neighboring nodes. Nodes close to seed nodes or at the intersection of multiple seed nodes naturally receive higher weights.\nPerformance HippoRAG delivers significant performance improvements across multiple benchmark datasets, particularly in multi-hop QA tasks and iterative retrieval scenarios.\nMulti-Hop QA Performance\n2WikiMultiHopQA: Achieves an 11% improvement in R@2 and 20% in R@5, leveraging its entity-centric design for superior retrieval. MuSiQue: Shows a 3% improvement, demonstrating robustness across datasets. Fig.8: Single-step retrieval performance. Integration with Existing Methods\nHippoRAG complements existing iterative retrieval approaches:\nWhen integrated with IRCoT, R@5 performance improves further, highlighting the synergistic benefits of structured retrieval with multi-step reasoning. Fig.9: Multi-step retrieval performance. Memory in LLMs: Key Insights Memory is fundamental to human learning. Our sophisticated memory mechanisms enable pattern recognition, association creation, and dynamic recall of relevant memories beyond surface-level similarities.\nWhile LLMs struggle with long-term memory through parametric continual learning, non-parametric memory (e.g., RAG) offers a promising solution.\nRecent developments in RAG focus on adding more structure to embeddings (e.g., HippoRAG, GraphRAG) to enhance:\nSensemaking: the ability to interpret larger, more complex, or uncertain contexts. Associativity: the capacity to draw multi-hop connections between disparate pieces of info. Despite these advances, we are still far from developing a truly sophisticated memory system. Key challenges, such as handling episodic memory and spatiotemporal reasoning, remain unsolved.\nAs we refine memory systems, the next crucial step is to explore reasoning, which builds upon memory to enable more advanced cognitive abilities.\nGrokking of Implicit Relations in Transformers In the current landscape of LLM research, explicit reasoning methods such as Chain of Thought (CoT) have garnered significant attention. However, implicit reasoning—a more fundamental capability—is essential for understanding the true nature of these models. Let\u0026rsquo;s explore the implicit reasoning mechanisms within the Transformer architecture.\nImplicit Reasoning in LMs Implicit reasoning refers to a model\u0026rsquo;s ability to generate correct outputs without explicitly showing its reasoning steps. This fundamental capability shapes how language models process and utilize information.\nKey Aspects of Implicit Reasoning:\nModels learn to predict next tokens without explicit reasoning chains during pre-training Shapes how language models develop structured knowledge representations Recent insights into emergent reasoning capabilities: Base models develop fundamental reasoning constructs during pre-training Reinforcement learning optimizes selection of existing reasoning patterns Current Challenges:\nResearch has identified several limitations in language models\u0026rsquo; implicit reasoning abilities:\nCompositional Reasoning Models excel primarily at single-step reasoning (Yang et al. 2024) The gap in compositional ability persists even as models scale up (Press et al. 2023) Comparative Analysis Even advanced models like GPT-4 face difficulties with implicit attribute comparisons, despite having access to the relevant information (Zhu et al. 2023) Grokked Transformers are Implicit Reasoners These limitations have fueled a narrative that autoregressive LLMs cannot truly reason. However, a recent paper (@wangGrokkedTransformersAre2024) challenges this view, suggesting Transformers possess untapped reasoning potential worthy of deeper investigation.\nResearch questions This investigation explores two key questions:\nCan Transformers learn to reason implicitly? What factors control the acquisition of implicit reasoning? Experimental Design Model implementation:\nThe study uses a standard GPT-2 style Transformer (8 layers, 768 hidden dimensions, 12 attention heads) with conventional AdamW optimization (learning rate 1e-1, batch size 512, weight decay 0.1, 2000 warm-up steps).\nCompositional Reasoning Framework:\nFor testing implicit reasoning, the authors created synthetic knowledge graphs with ( |E| ) entities and 200 relation types, split into ID and OOD atomic facts. The key mechanism is two-hop composition: $$ (h, r₁, b) ∧ (b, r₂, t) ⇒ (h, r₁∘r₂, t) $$\nExample: from \u0026ldquo;Barack has-wife Michelle\u0026rdquo; and \u0026ldquo;Michelle born-in 1964,\u0026rdquo; infer \u0026ldquo;Barack has-wife∘born-in 1964.\u0026rdquo;\nInductive Learning Assessment:\nThe study examines how models learn deduction reules from examples without explicit instruction, using two test scenarios:\nID Generalization: Novel combinations of familiar atomic facts used in other compositions. ODD/Systematic Generalization: Facts seen individually but never used in compositions-success here indicates true reasoning rather than memorization. Key Takeaways Takeaway #1: Transformers Learn to Reason Through \u0026lsquo;Grokking\u0026rsquo;\nInitially, models quickly reach 100% training accuracy (overfitting) while test accuracy remains low. However, after continuing training for approximately 20 times more steps beyond overfitting, test accuracy suddenly jumps to 100%.\nThis establishes a clear connection between grokking and the emergence of reasoning capabilities in transformers—reasoning abilities aren\u0026rsquo;t learned immediately but emerge after extended training periods.\nFig.10: transformers can learn to reason implicitly, but this skill is only robustly acquired through grokking Takeaway #2: Generalization Varies Across Reasoning Types\nWith compositional reasoning, models achieved perfect performance on in-distribution (ID) test examples but failed to generalize to out-of-distribution (OOD) scenarios. For comparative reasoning, however, models eventually reached 100% accuracy on both ID and OOD test sets. This indicates that the type of logical structure being learned significantly impacts how well the acquired reasoning generalizes.\nTakeaway #3: Data Distribution Matters More Than Data Size\nWhile previous research suggested that grokking requires a critical threshold of data size, this study challenges that assumption. The researchers found that data distribution—specifically the ratio between inferred facts and atomic facts (φ)—is far more important than total data quantity.\nWhen keeping this ratio fixed and increasing data size, generalization speed remained consistent. But when maintaining data size while increasing the φ ratio from 3.6 to 18, generalization speed increased dramatically.\nFig.11: The speed of grokking on the in-distribution (ID) test performance (a) correlates with the ratio between inferred and atomic facts, and (b) is not influenced by the size of training data. Analyzing the changes during grokking To understand the internal changes during the Grokking process, researchers employed two standard mechanistic interpretation tools:\nLogit Lens: Shows us how the network processes information at different stages. Causal Tracing: Measures how different parts of the network influence the final output. Generalization Circuits for Different Reasoning Tasks\nIt has been discovered that different types of reasoning tasks lead to distinctly different generalization circuits inside Transformers.\nType of Reasoning Circuit Type Working Mechanism Generalization Characteristics Compositional Reasoning Two-stage circuit First identify the bridging entity bb, then perform reasoning via r2r_2 May be limited by the model’s ability to learn the bridging entity, prone to errors Comparative Reasoning Parallel circuit Directly retrieve numerical values in parallel, followed by magnitude comparison Relatively strong generalization ability and more stable performance Fig.12: The (evolution of) generalizing circuit for composition. Fig.13: The (evolution of) generalizing circuit for comparison. Why Compositional Reasoning Struggles with Generalization\nTransformers often fail at compositional reasoning in out-of-distribution (OOD) settings. The reason lies in how they store and reuse atomic facts across layers.\nTwo-stage reasoning is required: first finding a bridge entity (h, r₁ → b), then inferring the answer (b, r₂ → t). Transformers tend to store both hops in lower layers, but don’t re-store the second hop in higher layers. This leads to failure when encountering unseen combinations of known facts — the core of OOD generalization. Core Problem and Solution\nThe key issue is that models don\u0026rsquo;t store atomic facts in higher layers⁠⁠. The solution is to force the storage of second-part atomic facts in higher layers⁠.\nPossible Solutions\nData Augmentation: Train the model to learn compositional structures in upper layers through special tasks and annotations⁠. Regularization Incentives: Design loss functions that encourage storing atomic facts in both lower and higher layers⁠. Structural Adjustment: Modify the Transformer\u0026rsquo;s self-attention mechanism to actively recall relationships across different layers⁠. Expected Outcomes\nThis approach should lead to better systematic generalization by allowing the model to:⁠\nBreak through OOD generalization limitations More flexibly combine previously unseen facts Improve applicability in real-world scenarios World Models and Planning In the context of language agents, planning can be defined as: given a goal G, determining a sequence of actions a₀, a₁, \u0026hellip;, aₙ that, when executed, lead to a state that satisfies or exceeds the requirements of goal G.\nUnlike traditional systems that use constrained formal languages (like PDDL) to explicitly describe goals, modern language agents typically use natural language to express objectives. This approach enhances expressiveness and flexibility but introduces challenges such as semantic ambiguity and goal uncertainty.（@liuLLM+PEmpoweringLarge2023, @kambhampatiLLMsCantPlan2024）。\nTo address these challenges, current research has proposed various planning paradigms to improve language agents\u0026rsquo; goal modeling and task execution capabilities.\nPlanning paradigms for language agents Language agents employ several key planning mechanisms:\nPrompt-based Planning: Guides large language models to generate action sequences through carefully designed prompts. For example, the ReAct framework alternates between reasoning and acting to enhance task coherence.\nPlan-then-Act Architecture: Divides tasks into two phases—first generating a global action plan, then executing step by step. This approach emphasizes forward-looking goal understanding, as seen in methods like AutoGPT and WebGPT.\nIterative Planning/Replanning: Accounts for environmental dynamics by adjusting plans in real-time during execution. The Reflexion framework exemplifies this approach, where agents update their strategies based on feedback.\nProgram-aided Planning: Incorporates program execution or external tools to support planning, adding verifiability and structure to the planning process.\nWorld Models in Language Agents In language agents, a World Model is an abstract representation of environmental states that helps agents reason about the consequences of future actions. Simply put, world models answer the question: \u0026ldquo;What will happen if I take a certain action?\u0026rdquo;\nWhile traditional reinforcement learning represents world models as state transition functions, language agents employ more flexible forms, often relying on language expressions, knowledge graphs, structured memory, or multimodal information.\nWorld Models serve several critical functions:\nEnvironmental Perception: Agents build world models to understand current states and constraints (e.g., webpage structures, task requirements, conversation history). Forward Simulation: Simulating potential future states resulting from specific actions—similar to a \u0026ldquo;mental rehearsal\u0026rdquo; process (as in WebDreamer\u0026rsquo;s \u0026ldquo;dreaming\u0026rdquo; process). Multi-step Planning Support: Using world models as auxiliary modules to predict outcomes at each step of a plan sequence, thereby optimizing overall strategy. World models can be constructed through:\nLanguage-based Simulation: Using language models to generate predictive outcomes for actions—flexible but difficult to verify. Tool-enhanced Modeling: Combining external tools (crawlers, APIs, environment simulators) to build structured state information. Memory-augmented Modeling: Incorporating long-term memory modules to record interaction history or external knowledge, enhancing continuous reasoning capabilities. While world models significantly improve planning performance, their accuracy and stability remain research bottlenecks.\nCase Study: WebDreamer WebDreamer (@guYourLLMSecretly2025) exemplifies the integration of planning and world model construction, emphasizing the \u0026ldquo;imagine first, then act\u0026rdquo; philosophy.\nFig.14: Schematic illustration of different web agent strategies as a search problem, where each node represents a webpage. Its primary workflow includes:\nExtracting Task Goals and Constraints: Parsing user intent from natural language. Building an \u0026ldquo;Imagined\u0026rdquo; World Model: Reasoning about future states and possible paths based on language input (the \u0026ldquo;dreaming\u0026rdquo; process). Generating Executable Plans: Developing feasible action steps using the world model and iteratively updating based on feedback. Fig.15: Illustration of WEBDREAMER simulating outcomes for three candidate actions using GPT-4o: (1) Click \u0026#39;Office Products\u0026#39;, (2) Click \u0026#39;Electronics\u0026#39;, and (3) Type \u0026#39;Disk\u0026#39; into textbox. WebDreamer\u0026rsquo;s strength lies in its \u0026ldquo;imagination\u0026rdquo; process, giving agents clearer global awareness of complex or multi-step tasks, enhancing plan generation capabilities and execution robustness.\nFig.16: Success rate (%) on VisualWebArena (Koh et al., 2024a), Online-Mind2Web (Xue et al., 2025), and Mind2Web-Live (Pan et al., 2024b). Key Takeaways on Planning Compared to traditional symbolic planning, language agents require stronger language understanding and plan generalization capabilities when facing open-ended, natural language goals.\nMultiple planning paradigms (prompt-based, plan-then-act, iterative, program-aided) offer different modeling paths for language agents.\nIncorporating world models (like WebDreamer) significantly improves planning quality and contextual consistency, particularly for complex tasks with multi-step requirements or ambiguous goals.\nA major ongoing challenge is establishing stable, controllable, and verifiable bridges between natural language and executable plans.\nFuture Directions for Language Agents Open Research Questions As language agents continue to evolve, several critical research questions remain unsolved:\nMemory and Continual Learning: How can language agents truly learn over time without catastrophic forgetting? The challenge involves creating systems that remember past conversations, learn from mistakes, and continuously improve while developing personalized memory systems that respect user privacy.\nReasoning in Uncertain Environments: Unlike environments with clear metrics, language agents operate in fuzzy worlds filled with ambiguity. The field is actively exploring how to implement reasoning frameworks where \u0026ldquo;correct\u0026rdquo; answers aren\u0026rsquo;t clear-cut and how agents integrate observations with actions when information may be contradictory.\nPlanning and World Models: Current planning capabilities remain primitive compared to their potential. Finding the balance between computationally intensive simulations and simple reactive approaches presents a fascinating optimization problem, alongside maintaining coherent planning over longer horizons where small errors compound.\nSafety and Security: The attack surface of web-enabled agents encompasses potentially the entire internet. Research must focus on mitigating both endogenous risks (agent incompetence) and exogenous threats (adversarial attacks) while developing monitoring systems that detect when an agent operates beyond its competence.\nPromising Applications Despite these challenges, several exciting applications are emerging that show significant potential:\nAgentic Search and Deep Research Tools like Perplexity Pro and Google/OpenAI\u0026rsquo;s deep research agents show clear business potential Enhanced information synthesis across multiple sources with factual grounding Workflow Automation End-to-end automation of complex multi-step processes Integration with existing software ecosystems and APIs Adaptive workflows that learn from human feedback and patterns Scientific Research Assistants Literature review and hypothesis generation Experimental design optimization Data analysis and pattern recognition Cross-disciplinary knowledge synthesis Reference “Artificial Intelligence: A Modern Approach, 4th US Ed.” n.d. https://aima.cs.berkeley.edu/. Accessed March 25, 2025.\nGu, Yu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, et al. 2025. “Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents.” arXiv. https://doi.org/10.48550/arXiv.2411.06559.\nGutiérrez, Bernal Jiménez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2025. “HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models.” arXiv. https://doi.org/10.48550/arXiv.2405.14831.\nKambhampati, Subbarao, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. 2024. “LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.” arXiv. https://doi.org/10.48550/arXiv.2402.01817.\nLiu, Bo, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. 2023. “LLM+P: Empowering Large Language Models with Optimal Planning Proficiency.” arXiv. https://doi.org/10.48550/arXiv.2304.11477.\nMarks, Andrew R. 2006. “In Search of Memory The Emergence of a New Science of Mind.” Journal of Clinical Investigation 116 (5): 1131. https://doi.org/10.1172/JCI28674.\nS, SURUTHI. 2024. “Exploring HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models.” Medium. https://suruthi41.medium.com/exploring-hipporag-neurobiologically-inspired-long-term-memory-for-large-language-models-a43d65b35c01.\nTeyler, T. J., and P. DiScenna. 1986. “The Hippocampal Memory Indexing Theory.” Behavioral Neuroscience 100 (2): 147–54. https://doi.org/10.1037//0735-7044.100.2.147.\nWang, Boshi, Xiang Yue, Yu Su, and Huan Sun. 2024. “Grokked Transformers Are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization.” arXiv. https://doi.org/10.48550/arXiv.2405.15071.\nXie, Jian, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. 2024. “Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts.” arXiv. https://doi.org/10.48550/arXiv.2305.13300.\n","permalink":"https://mig217.github.io/post/2025-03-30-memory-reasoning-and-planning-of-language-agents/","summary":"\u003cp\u003eLanguage Agents have emerged as one of the most exciting research directions in AI over the past two years. This article explores three core components: \u003cstrong\u003elong-term memory via HippoRAG, reasoning capabilities with Grokked Transformers, and world modeling through WebDreamer\u003c/strong\u003e.\u003c/p\u003e\n\u003ch2 id=\"why-agents-again\"\u003eWhy Agents Again?\u003c/h2\u003e\n\u003cp\u003eRussell \u0026amp; Norvig in “Artificial Intelligence: A Modern Approach” define an agent as “\u003cstrong\u003eanything that can perceive its environment through sensors and act upon that environment through actions.\u003c/strong\u003e”（@ArtificialIntelligenceModern）\u003c/p\u003e","title":"Memory, Reasoning, and Planning of Language Agents"},{"content":"Introduction To understand LLM agents, we need to break the term into two foundational components: Large Language Models (LLMs) and Agents. While LLMs have gained widespread recognition, the concept of \u0026ldquo;agent\u0026rdquo; in this context requires deeper exploration.\nWhat is an Agent? In artificial intelligence, an agent is an \u0026ldquo;intelligent\u0026rdquo; system that perceives and interacts with an \u0026ldquo;environment\u0026rdquo; to achieve specific goals. The classification of agents varies based on their operational environment:\nPhysical environments: robots, autonomous vehicles, etc. Digital environments: DQN for Atari games, Siri, AlphaGo Humans as environments: Chatbots Agents typically follow a perception-reasoning-action cycle, where they:\nObserve their environment Process information and make decisions Take actions that affect the environment Fig.1: Agent-Environment Interaction What\u0026rsquo;s an LLM Agent? An LLM agent integrates the powerful language capabilities of LLMs with the goal-oriented, interactive nature of agents. These systems represent a significant evolution in AI, with capabilities ranging from basic conversational skills to complex reasoning and planning.\nLLM agents can be classified into three progressive levels of sophistication:\nLevel 1: Text Agent\nBasic agents that process and respond to text input Examples: ELIZE, LSTM-DQN Level 2: LLM Agent\nAdvanced agents that leverage LLMs for direct action generation Examples: SayCan, Language Planner Level 3: Reasoning Agent\nUse LLM to reason to act Examples: ReAct, AutoGPT Fig.2: Three Levels of LLM Agents: From Text to Reasoning Pre-LLM Language Agents ELIZA (1966): The Pioneer of Text Agents The development of text-based agents dates back to the early days of AI. ELIZA, created in 1966, marked a significant milestone as one of the first chatbots.\nIt\u0026rsquo;s simple yet effective rule-based approach involved pattern matching and response templates to simulate human conversation. While users found ELIZA remarkably engaging, the system had inherent limitions:\nLimited to specific domains and use cases Require a extensive manual rule creation Unable to handle complex interactions or understanding Fig.3: ELIZA: The First Chatbot Despite these constraints, ELIZA established the conceptual foundation for future conversational agents and demonstrated the potential of natural language interfaces.\nLSTM-DQN (2015): Reinforcement Learning for Text Agents Prior to the emergence of LLMs, RL was a dominant approach for developing text-based agents. This methodology treated text as both the observation and action space, similar to how traditional RL handles pixels and keyboard inputs in video games. The core idea was that optimizing for reward signals would natually lead emergence of language intelligence[1].\nFig.4: LSTM-DQN for Reinforcement Learning in Text-Based Environments However, this approach faced several significant limitions:\nDomain-specific applications Dependence on explicit scalar reward signals Requires extensive training These early approaches highlighted both the promise and challenges of creating intelligent text-based agents, setting the stage for the transformative impact that large language models would later bring to this field.\nThe Emergence of Large Language Models LLMs have revolutionized text agents through next-token prediction on massive text corpora. During inference, they solve diverse new tasks through prompting alone[2]. This emergent generality creates exciting possibilities for building more capable agents.\nA Brief History of LLM Agents The rise of LLM agents began with models like GPT-3 in 2020. Initially, researchers explored their potential across diverse tasks, which broadly fell into two categories:\nReasoning tasks: such as symbolic question answering and logical inference Acting tasks: including interactive applications like games and robotics Fig.5: Evolution of LLM Agents: From Reasoning and Acting to ReAct Over time, reasoning and acting converged, giving rise to reasoning agents—models that combine structured thinking with goal-driven actions. This led to two key research directions:\nApplications: Web interaction, software engineering, scientific discovery, and more Methods: Memory systems, planning, multi-agent collaboration, and adaptive learning Enhancing LLMs with External Knowledge and Computation While LLMs excel at many tasks, some require more than just next-token prediction—they demand reasoning, external knowledge, or computation. To address these limitations, researchers have developed various techniques.\n(1) Code-Augmented Computation\nFor tasks involving calculations or formal reasoning, LLMs can generate code instead of directly predicting an answer. The generated code is then executed to produce the final result[3].\nExample: Prime factorization, Fibonacci sequences\n(2) Retrieval-Augmented Generation (RAG) for Knowledge\nFor knowledge-intensive queries, LLMs can retrieve relevant information from external corpora before generating a response[4]. This is typically using:\nExtra corpora A retriever (e.g., BM25, DPR, etc.) Fig.6: An illustration comparing (a) black-box language models and (b) retrieval-oriented NLP models, the paradigm this post advocates for Limitation: RAG depends on the availability of a relevant corpus. If the needed information is missing (e.g., \u0026ldquo;Who is the latest Prime Minister?\u0026rdquo;), retrieval alone is insufficient.\n(3) Tool-Use for Dynamic Information\nWhen static corpora fall short, LLMs can invoke external tools in real time. This is achieved by introducing special tokens that trigger API calls[5][6]. Common tools include:\nSearch engine, calculator, etc. Task-specific models (translation) APIs This approach significantly expands capabilities but introduces new challenges in tool selection and interaction management.\nFig.7: Examples of TALM Text-to-Text Interface in Different Tasks What if both knowledge and reasoning are needed? Many tasks require both reasoning and external knowledge, pushing researchers to develop hybrid approaches. For example, one can interleave retrieval with chain-of-thought reasoning[7] or generate follow-up queries to refine responses[8].\nFig.8: IRCoT interleaves chain-of-thought (CoT) generation and retrieval steps to guide the retrieval by CoT and vice-versa. However, early solutions were fragmented. Even within a single task like QA, different benchmarks posed distinct challenges, leading to a proliferation of task-specific techniques.\nFig.9: QA Methods and Reasoning Approaches To achieve generality, we need a framework that integrates knowledge retrieval with structured reasoning.\nReAct: A Unified Framework for Reasoning and Action Prior approaches to LLM agents faced a fundamental divide: reasoning-focused methods lacked external information, while action-focused methods lacked structured thinking[9]. The Core Idea of ReAct ReAct integrates reasoning and action paradigms into a unified framework, allowing language models to generate both simultaneously rather than in isolation. Key advantages include: Synergy of reasoning and acting Simple and intuitive to use General across domains Fig.10: Comparison of Reasoning-Only, Acting-Only, and ReAct (Reason \u0026#43; Act) Paradigms How ReAct Works A ReAct agent follows an iterative reasoning-action loop until reaching a final conclusion:\nGenerate Thought: break down the problem and decide on the next step. Take Action: query a retrieval system, execute code, or call an external API. Observe Outcome: analyze the retrieved data or computed result. Repeat: use the new information to generate the next thought and action. Fig.11: ReAct: Integrating Reasoning and Acting for Improved QA Performance Implementation approaches:\nOne-shot prompting Few-shot prompting Fine-tuning ReAct Enables Systematic Exploation At the heart of this approach is the concept of treating reasoning as an action, which expands the action space of AI agents. This enables them to explore complex tasks in a more systematic way.\nReAct outperforms standard prompting methods with just a few examples, approaching or even exceeding the performance of supervised learning across multiple tasks:\nFig.12: Performance Comparison of QA and Interactive Tasks Using Different Prompting Methods Unlike traditional AI agents constrained by the environment, ReAct agents can not only perform external actions but also reason actively, breaking free from fixed action sets. This allows for unlimited expressiveness, influencing decision-making without altering the external world.\nLong-term Memory: Expanding Agent Capabilities Beyond the Context Window Building upon the ReAct framework\u0026rsquo;s integration of reasoning and action, we now explore how agents can overcome the fundamental limitations of their working memory through effective long-term memory systems.\nShort-term Memory vs. Long-term Memory An agent’s short-term memory is limited to the language model’s context window. While this allows it to dynamically add thoughts, actions, and observations, it comes with significant constraints: Short-term Memory Long-term Memory - Append-only - Limited context - Limited attention - No persistence across tasks - Read and write - Stores experience, knowledge, skills - Persists over new experiences Short-term memory is like a goldfish with its legendary three-second memory - an agent might solve remarkable problems but must start from scratch the next time. This limitation motivates the need for long-term memory systems.\nFig.13: The Ephemeral Genius: A Goldfish’s Dilemma Reflexion: A Simple Form of Long-term Memory Reflexion builds directly upon ReAct by adding a memory layer that enables agents to learn from past experiences[10]. The process works as follows:\nTask: The agent attempts to solve a problem Trajectory: The agent follows a reasoning-action path toward a solution Evaluation: The solution is tested against success criteria Reflection: The agent analyzes its performance, identifying strengths and weaknesses Next Trajectory: The agent incorporates these reflections into future attempts Fig.14: Reflexion works on decision-making, programming, and reasoning tasks. This approach proves particularly effective for coding tasks, where clear feedback from unit tests creates ideal learning conditions. Fig.15: Pass@1 accuracy for various model-strategy-language combinations. The base strategy is a single code generation sample. All instruction-based models follow zero-shot code generation. Traditional RL vs. Reflexion Another way to understand Reflexion is through its contrast with traditional reinforcement learning approaches. RL relies on sparse scalar rewards and weight updates for learning. In contrast, Reflexion introduces a more direct, language-driven learning process:\nTraditional RL Reflexion: “Verbal” RL - Learns via scalar rewards (sparse feedback) - Learns by updating weights (credit assignment) - Learns via text feedback - Learns by updating language (a long-term memory of task knowledge) By leveraging explicit textual feedback, Reflexion allows agents to iteratively refine their reasoning and decision-making, creating interpretable knowledge that persists as long-term memory for future tasks without requiring weight adjustments.\nAgent Architecture Unified Perspective on Agents The final step in understanding agent architecture is recognizing that language itself functions as a form of long-term memory. Agents can improve their capabilities through two complementary mechanisms:\nNeural Adaptation: Updating the parameters of the language model through fine-tuning. External Memory: Storing and retrieving knowledge in structured textual or symbolic formats. By viewing both neural networks and external text repositories as forms of long-term memory, we arrive at unified abstraction of learning. This perspective emphasizesthe agent\u0026rsquo;s ability to reason over short-term memory while maintaining persistent long-term memory that extends beyond the current task. As demonstrated in CoALA[11], any agent can be expressed through three fundamental components:\nMemory: Where information is stored Action Space: What the agent can do Decision-making procedure: How the agent selects actions based on memory and capabilities Fig.16: A compilation of language agents using the Cognitive Architectures for Language Agents (🐨CoALA) framework. This framework provides a sufficient and comprehensive way to conceptualize agent system of any complexity.\nA Brief History of Agents Looking beyond LLM agents specifically, we can place agents in the borader historical context of AI development. While simplified, this timeline illustrates shifts in agent architectures:\nEra Agent Paradigm Representation Characteristics Early AI Symbolic AI Rule-based logical expressions • Programmed rules for environment interaction • Expert systems • Intensive design effort • Task-specific Post-AI Winter Deep RL Neural embeddings/vectors • Data-driven learning • Breakthroughs like Atari, AlphaGo • Millions of training steps • Limited generalization Recent LLM Agents Natural language • Language as intermediate representation • Rich priors from pre-training • Inference-time scalability • General and generalizable The fundamental difference between these paradigms lies in how they transform observations into actions:\nSymbolic AI agents map observations to symbolic states, then apply logical rules to determine actions Deep RL agents convert observations into neural embeddings, processed through networks to produce actions LLM agents use a natural language as the intermediate representation, mimicking human thought processes This language-based approach offers several distinct advantages:\nLeverages rich priors from pre-trained LLMs Provides inference-time scalability Facilitates generalization across diverse tasks This fundamental shift enables an entriely new class of apllications that were previously impossible. Let\u0026rsquo;s explore these next.\nDigital Automation LLM Agents enable a new class of digital automation applications. This advancement stands in stark contrast to previous digital assistants like Siri, which had fundamentally limited capabilities. The breakthrough with LLM agents comes from:\nReasoning over real-world language (and other modalities) Making decisions across open-ended actions and long horizons Earlier sequence-to-sequence systems couldn\u0026rsquo;t handle complex tasks requiring contextual understanding and extended planning.\nThe evolution of LLM agents represents parallel advancement in both mathematical models and practical applications - a dual progression critical to unlocking digital automation\u0026rsquo;s potential.\nApplications WebArena: General Web Interaction WebArena[12] extends AI applications in web environments beyond online shopping, enabling a broader range of web-based interactions such as information retrieval, form filling, web navigation, and content generation. This highlights AI\u0026rsquo;s potential for diverse real-world applications.\nFig.17: Overview of WebArena – A Self-Hosted Agent-Driven Automation Framework. SWE-Bench: AI for Software Engineering SWE-Bench[13] is a benchmark designed for evaluate LLMs on real world software issues collected from GitHub. Given a codebase and an issue report, the AI must generate a file diff that resolves the issue. In this task:\nInput: GitHub repository + issue report Output: File diff to fix the issue Evaluation: Passing unit tests from a pull request Fig.18: Language Model-Assisted Code Fixing and Testing Workflow. The task definition is straightforward, but solving it requires complex repository interaction, creating unit tests, executing code, and iterative debugging-mirroring the workflow of human software engineers.\nChemCrow: AI-Driven Scientific Discovery ChemCrow[14] leverages ReAct to enable LLM agents in chemical discovery. Given chemical data and access to tools like Python and online databases, the AI can analyze information, reason through potential compounds, and take actions to propose novel chemical structures.\nFig.19: ChemCrow—An Autonomous Agent for AI-Driven Chemistry Research. A key breakthrough is the integration of AI with physical experiments-suggested compounds are synthesized in a lab, providing real-world feedback that refines the AI\u0026rsquo;s predictions. This demonstrates how AI agents can operate byond digital tasks, extending into scientific research and real world experimentation.\nSummary Lessons for Research Some of the most impactful research is remarkably simple-think of CoT and ReAct. Simplicity is powerful because it often leads to generality. However, achieving simplicity is challenging. It requires both:\nAbstract Thinking: Looking beyond specific tasks or datasets to identify broader principles.\nTask Familiarity (but not over-reliance on task-specific methods): Understanding problems deeply without getting stuck in incremental improvements.\nStudying history and diverse disciplines can aid in developing abstraction skills, helping researchers identify more generalizable soultions.\nWhat\u0026rsquo;s Next? The future of AI research is multi-dimensional, with several promising directions. Here are five key areas worth exploring:\nTraining: How can we effectively train models for AI agents? Where does the data come from? Interface: How do we design environments for AI agents? Robustness: How do we ensure AI solutions work reliably in real-world scenarios? Human interaction: How do AI systems perform when interacting with people? Benchmarking: How do we create meaningful benchmarks to measure progress? References [1] Narasimhan, Karthik, Tejas Kulkarni, and Regina Barzilay. “Language Understanding for Text-Based Games Using Deep Reinforcement Learning.” arXiv preprint, September 11, 2015.\n[2] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, and others. “Language Models Are Few-Shot Learners.” arXiv preprint, July 22, 2020.\n[3] Chen, Wenhu, Xueguang Ma, Xinyi Wang, and William W. Cohen. “Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks.” arXiv. November 22, 2022.\n[4] SAIL Blog. \u0026ldquo;Building Scalable, Explainable, and Adaptive NLP Models with Retrieval.\u0026rdquo; October 5, 2021.\n[5] Parisi, Aaron, Yao Zhao, and Noah Fiedel. \u0026ldquo;TALM: Tool Augmented Language Models.\u0026rdquo; arXiv, May 24, 2022.\n[6] Schick, Timo, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, and others. \u0026ldquo;Toolformer: Language Models Can Teach Themselves to Use Tools.\u0026rdquo; arXiv, February 9, 2023.\n[7] Trivedi, Harsh, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. \u0026ldquo;Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions.\u0026rdquo; arXiv, June 23, 2023.\n[8] Press, Ofir, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. \u0026ldquo;Measuring and Narrowing the Compositionality Gap in Language Models.\u0026rdquo; arXiv, October 17, 2023.\n[9] Yao, Shunyu, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. \u0026ldquo;ReAct: Synergizing Reasoning and Acting in Language Models.\u0026rdquo; arXiv, March 10, 2023.\n[10] Shinn, Noah, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. \u0026ldquo;Reflexion: Language Agents with Verbal Reinforcement Learning.\u0026rdquo; arXiv, October 10, 2023.\n[11] Sumers, T. R., Yao, S., Narasimhan, K., and Griffiths, T. L. \u0026ldquo;Cognitive Architectures for Language Agents.\u0026rdquo; arXiv.org, September 05, 2023.\n[12] Zhou, Shuyan, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, and others. \u0026ldquo;WebArena: A Realistic Web Environment for Building Autonomous Agents.\u0026rdquo; arXiv, April 16, 2024.\n[13] Jimenez, Carlos E., John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. \u0026ldquo;SWE-bench: Can Language Models Resolve Real-World GitHub Issues?\u0026rdquo; arXiv, November 11, 2024.\n[14] Bran, Andres M., Sam Cox, Oliver Schilter, Carlo Baldassari, Andrew D. White, and Philippe Schwaller. \u0026ldquo;ChemCrow: Augmenting large-language models with chemistry tools.\u0026rdquo; arXiv, October 2, 2023.\n","permalink":"https://mig217.github.io/post/2025-03-10-llm-agents-overview/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eTo understand LLM agents, we need to break the term into two foundational components: \u003cstrong\u003eLarge Language Models (LLMs)\u003c/strong\u003e and \u003cstrong\u003eAgents\u003c/strong\u003e. While LLMs have gained widespread recognition, the concept of \u0026ldquo;agent\u0026rdquo; in this context requires deeper exploration.\u003c/p\u003e\n\u003ch3 id=\"what-is-an-agent\"\u003eWhat is an Agent?\u003c/h3\u003e\n\u003cp\u003eIn artificial intelligence, an agent is \u003cstrong\u003ean \u0026ldquo;intelligent\u0026rdquo; system that perceives and interacts with an \u0026ldquo;environment\u0026rdquo; to achieve specific goals\u003c/strong\u003e. The classification of agents varies based on their operational environment:\u003c/p\u003e","title":"LLM Agents: Brief History and Overview"},{"content":"本文内容来自 Jason Weston (Meta) 在 UC Berkeley Advanced Large Language Model Agents 课程中的分享，探讨了大语言模型的推理能力提升。以下为讲座内容：\nAI 能力正在快速发展，如 O1、R1 等模型在推理基准测试中取得的突破性进展。本文将聚焦于模型的自我提升能力(self-improvement)。\n为了更好地理解AI的推理机制，我们首先需要区分两种基本的思维模式：System 1和 System 2：\nFig.1: Hybrid Reasoning Framework: System 1 and System 2 Collaboration in LLMs System 1：快速直觉系统\n这是一种类似人类直觉反应的快速思维系统，主要依赖关联性思维。在LLM中，这种能力体现在transformer神经网络的基础运作机制上。其主要特征包括：\n每个token使用固定的计算资源；直接输出答案；局限性：容易学习到虚假关联，产生幻觉、迎合性回答、越界等问题； System 2: 深度思考系统\n这代表了一种更深层次的思维模式，目前主要通过CoT来实现。在生成最终答案之前，System 2会进行系统性的推理分析。它具有以下优势：\n能够执行规划、搜索和验证等复杂任务；具备动态计算能力，可以通过CoT、ToT等方式实现灵活推理；我们可以通过优化模型架构或权重来提升System 1的表现，也可以通过改进推理链的生成方式来增强System 2的表现；最终目标是让AI具备自我学习能力，这包括3个关键方面：\n能够自主设计具有挑战性的训练任务评估任务的完成质量，形成自我奖励机制根据理解和反馈，持续更新优化自身能力接下来，我们将分两个部分展开讨论：首先回顾语言模型的历史发展历程，然后深入探讨过去一年中的重要研究进展。\nLLM Post-training：O1/R1 之前的优化之路 Instrcut GPT InstructGPT 是在 2022 年提出的一种语言模型优化方法[1]，它结合了监督学习和基于人类反馈的强化学习（RLHF），比单纯的监督微调更为先进。这种方法包含三个关键步骤(如图)：\nFig.2: Reinforcement Learning from Human Feedback (RLHF) Training Pipeline 监督微调（SFT）阶段：通过收集人类标注的示范数据，对基础模型进行初步的行为调整；奖励模型训练阶段：根据人类对模型多个输出的排序评估，训练一个能够判断输出质量的奖励模型；强化学习优化阶段：利用奖励模型的反馈评分，通过 PPO 算法持续优化模型输出；这种训练方法通过引入人类反馈和自我优化机制，使模型能够持续改进其输出质量。这不仅提升了模型的性能，而是朝着自我训练的目标迈进。\nDirect Preference Optimization (DPO) DPO，即直接偏好优化[2]，作为RLHF的替代方案，近年来受到了广泛关注。DPO旨在简化模型的训练过程，并直接基于人类的偏好数据进行优化，无需显式地训练奖励模型（如图）\nFig.3: DPO optimizes for human preferences while avoiding reinforcement learning DPO的主要步骤：\n偏好数据收集：收集包含多个输出的偏好数据集；例如，给定提示后生成两个结果，由人类选择更符合预期的输出；\n目标函数转化：通过数学变换，将强化学习的目标函数转化为分类问题，基于Bradley-Terry模型优化策略差异；\n策略优化：通过最大化偏好数据的似然函数，优化模型参数；\n总的来说，InstructGPT 和 DPO 等 Post-Training 技术，通过引入人类反馈和偏好，极大地提升了 LLM 的指令跟随能力和输出质量。虽然这个阶段还没有明确的CoT推理机制，但相比原始的预训练语言模型，已经有了显著的进步。\n通过System 2提升推理能力：Prompting方法 Chain-of-Verification (CoVe) 减少 LLM 中的幻觉 CoVe 的核心思想是，让LLM不仅仅生成一个初步的答案（可以看作是初稿），还要对这个初步答案进行自我验证[3]。CoVe 的工作流程如下：\n生成初步答案 (Baseline Response): 模型根据用户提问（Query）生成一个初步答案；规划验证问题 (Plan Verifications): 模型基于初步答案，生成一系列验证性问题；执行验证 (Execute Verifications): 模型针对每个验证问题，生成简短的回答；生成最终验证后的答案对比(Final Verified): 模型对比初步答案和验证答案，发现并修正错误，生成最终答案； Fig.4: Chain-of-Verification (CoVe) method. Given a user query, a large language model generates a baseline response that may contain inaccuracies. 研究表明，模型在处理简短的回答对时，通常比生成长篇回答更准确。通过规划和执行这些验证问题，LLM能够发现并纠正自身在初步答案中存在的矛盾和错误；在各类知识密集型任务中，采用CoVe方法可以将LLM回答的准确率显著提升（如图）：\nFig.5: Test Precision and average number of positive and negative (hallucination) entities for list-based questions on the Wikidata and Wiki-Category list tasks. System 2 Attention(S2A): 让注意力更专注由于 LLM 采用“软注意力”机制，导致模型易受“语义泄露”和“谄媚”问题的影响，即整个上下文（包括不相关信息）都会影响输出。为解决此问题，\u0026ldquo;System 2 Attention\u0026rdquo; 论文提出了 S2A 方法[4]，利用 CoT 推理重写原始指令，消除偏差。S2A 步骤（如图）：\n重写指令: 提示 LLM 重写原始问题，去除其中的不相关信息或偏差基于重写后的问题进行回答: 使用重写后的问题再次输入 LLM，生成最终答案 Fig.6: Reducing Bias in LLM Responses Through Question Rewriting. CoT 和 System 2 推理的应用远不止于数学问题。通过利用这些中间 token 进行推理，模型可以在各种类型的任务中进行思考，并有效解决许多问题。\nBranch-Solve-Merge (BSM)：分解复杂任务当任务复杂、指令难以理解时，即使是像 GPT-4 这样强大的模型也会出错。BSM 就像“分而治之”的策略，将一个大问题拆解成几个小问题，逐个击破，最后再将结果整合起来[5]。BSM 流程 (如图所示)：\nBranch: 将复杂任务分解成多个相互独立的子任务； Solve: 针对每个分支，独立地解决其对应的子任务； Merge: 整合各分支的解决方案形成最终答案，这不仅是简单拼接，而是需要综合优化；研究表明，通过给予语言模型额外的思考时间，BSM方法能显著提升评估结果的质量。\nFig.7: Branch-Solve-Merge Improves Large Language Model Evaluation and Generation. 虽然Prompting方法通过精心设计的提示能显著提升LLM在复杂任务上的表现，但这些方法仍依赖人工干预，需要为每个特定任务设计专门的提示。\n我们真正需要的是从端到端训练模型使其具备推理能力，而不只是依赖巧妙的提示设计。因此，当前研究的新趋势是通过模型自我优化来增强推理能力。\n通过自我提升 (Self-Improvement) 实现更好的推理传统机器学习中，人类监督比自身弱的 AI 系统（下图左）。为实现与超级智能（远超人类的智能）的对齐，人类需监督比自己更强的 AI（图中）[6]。这引出一个关键问题：如何持续改进超越人类的模型？\nFig.8: A simple analogy for superalignment Self-Rewarding LLMs 虽无法直接解决此难题，但可借鉴相关思路：LLM 自身能提供良好结果判断 [7][8]，且良好判断能助其持续改进 [9][10]。基于此，Self-Rewarding LLM 用自身反馈替代人类反馈，让模型在训练中“自评自训”[11]。模型具备双重核心能力：\n指令执行能力：精准理解并回应用户指令自我评估能力：客观评价不同回应的质量训练流程（如图）：\nFig.9: How language models can teach themselves to follow instructions 基础模型: 从预训练的LLAMA-2-70B（M0）开始初始训练: 使用种子指令跟随和评估数据进行多任务训练迭代优化: 完成两轮自我奖励训练循环（得到M2和M3模型）实验结果\n指令跟随能力内部测试集（256条多样化指令）：迭代训练使模型能力稳步提升 (Fig.10) Fig.10: Human evaluation results. AlpacaEval 2.0：经过两轮训练后，性能接近GPT-4 0314水平 (Fig.11) Fig.11: AlpacaEval 2.0 results (win rate over GPT-4 Turbo evaluated by GPT-4). MT-Bench：各类任务表现均有提升，尤其在通用写作方面表现突出 (Fig.12) Fig.12: MT-Bench Results (on a scale of 10). 自我评估能力 OpenAssistant验证集：数据显示模型评估能力随训练周期持续增强尽管取得显著进展，但仍面临挑战：如何进一步提升模型在复杂推理任务上的表现？\nIterative reasoning preference optimization (IRPO) IRPO 在 Self-Rewarding 基础上引入 CoT，使得模型不仅要生成最终答案，还需要提供完整的推理过程[12]。方法流程如下：（Fig.13）\nFig.13: Iterative Reasoning Preference Optimization. 使用当前模型对每个训练样本生成多个 CoT 及对应答案根据答案的正确性（正确 vs. 错误）构造偏好对，筛选优质推理路径采用 DPO 与 NLL 损失训练模型，提升正确答案的生成概率，同时抑制错误答案训练完成后，使用更新后的模型进行下一轮优化，不断提高推理能力实验结果\n在 GSM8K 上，该方法在1到4轮迭代中提升了近10%的准确率（Fig.14） Fig.14: GSM8K results comparing Iterative Reasoning Preference Optimization (Iterative RPO) against other baselines that are based on the same base model and training data. 在 ARC Challenge 及更复杂的数学推理任务上，也取得了明显的性能提升（Fig.15） Fig.15: ARC and MATH results. 实验表明 DPO 训练至关重要；仅使用 SFT 无法获得同样的优化效果，必须通过负样本抑制，才能真正提升模型的推理能力（Fig.16）。\nFig.16: SFT trained on chosen seqs; init from Llama. Meta-Rewarding LLMs：提升模型评判能力的新方法 Meta-Rewarding方法解决了Self-Rewarding训练中快速饱和的问题。它不仅关注答案质量，更强调评判能力的提升，让大模型同时扮演Actor、Judge 和 Meta-Judge 三重角色[13]。\n核心机制：\nMeta-Judge: 评估已有的评判标准，确保模型能持续改进判断逻辑 Meta-Rewards: 在训练过程中引入新型训练信号，帮助模型动态优化评判标准训练迭代流程: 整个训练过程可以拆解为以下 3 个步骤，并循环优化（Fig.17）\nFig.17: Meta-Rewarding iterative training scheme. 生成 Actor 数据: 大模型生成回答，并对自身答案进行评估生成 Judge 数据: 通过 LLM-as-a-Meta-Judge 进一步评估这些评判，形成 Meta-Rewards DPO 训练: 基于偏好对进行 DPO 训练，使得模型不仅学会更好地回答问题（Step 1），也学会更精准地评判答案（Step 2）这种方法在 AlpacaEval 任务上取得了更高的胜率，同时在一些生产级 LLM 评测中也表现良好（Fig.18）。\nFig.18: AlpacaEval 2: The evaluation on AlpacaEval shows significant improvement with MetaRewarding training. Meta-Rewarding 不仅提升了答案质量，还让大模型具备更强的自我优化能力，使评判标准更加稳健和可信（Fig.19）。\nFig.19: AlpacaEval 2. Length-controlled (LC) win rate increases with Meta-Rewarding iterations, even approaching Claude-Opus level. 思考总结与未来方向 Summary Self-Rewarding Models (3.1部分) 展现了一种新范式，模型能够自我训练、自我改进，甚至有望迈向超越人类智能的方向。\nVerifiable Rewards：通过优化 CoT 来提升推理能力和评估能力，例如 Iterative Reasoning Preference Optimization (3.2部分) 、DeepSeek 和 O1 方法。\nBetter Judges：具备推理能力的评估者（如基于 CoT 的判别模型）能帮助训练模型在非可验证任务上的思考能力。\nMeta-Rewarding \u0026amp; Meta-Reasoning (3.3部分)：模型不仅能评估任务，还能进一步评估自身的评估结果，从而优化判断过程。\n未来大模型发展将致力于整合这些技术，将Meta-Rewarding、CoT 等最新成果与基础架构融为一体。\n同时，研究正从基于文本的CoT扩展到基于向量的推理方式，如COCONUT[14]方法尝试用连续向量替代token，能够在某些任务上匹配甚至超越经典 CoT，尤其是在复杂搜索任务上。不过，这些实验仍然局限于小规模任务，未来仍需进一步验证其可扩展性。\nWhat else comes next? Self-improving \u0026amp; Self-evaluation: 突破性能瓶颈的关键，通过更多inference time compute for evaluation 可能是提升能力的关键。\n从交互中学习：模型不仅应当通过静态数据训练，还应通过交互（与人类、互联网或自身）不断优化推理能力，这与 Agents 和 Synthetic Data 的研究方向紧密相关。\n突破System 1 级推理能力：当前的研究主要集中于System 2 推理（基于显式逻辑的推理过程），但如果能改进 Transformer 本身的架构，例如探索更优的注意力机制，或研发新的神经网络层，可能会带来更具颠覆性的进展。\n参考文献 [1] Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, and others. \u0026ldquo;Training Language Models to Follow Instructions with Human Feedback.\u0026rdquo; arXiv, March 4, 2022.\n[2] Rafailov, Rafael, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. \u0026ldquo;Direct Preference Optimization: Your Language Model is Secretly a Reward Model.\u0026rdquo; arXiv, July 29, 2024.\n[3] Dhuliawala, Shehzaad, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikiyilmaz, and Jason Weston. \u0026ldquo;Chain-of-Verification Reduces Hallucination in Large Language Models.\u0026rdquo; arXiv, September 25, 2023.\n[4] Weston, Jason, and Sainbayar Sukhbaatar. \u0026ldquo;System 2 Attention (Is Something You Might Need Too).\u0026rdquo; arXiv, November 20, 2023.\n[5] Saha, Swarnadeep, Omer Levy, Asli Celikiyilmaz, Mohit Bansal, Jason Weston, and Xian Li. \u0026ldquo;Branch-Solve-Merge Improves Large Language Model Evaluation and Generation.\u0026rdquo; arXiv, June 7, 2024.\n[6] OpenAI. 2024. “Weak-to-Strong Generalization.” OpenAI, February 14.\n[7] Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, and others. \u0026ldquo;Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback.\u0026rdquo; arXiv, April 12, 2022.\n[8] Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, and others. \u0026ldquo;Llama 2: Open Foundation and Fine-Tuned Chat Models.\u0026rdquo; arXiv, July 19, 2023.\n[9] Zheng, Lianmin, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, and others. \u0026ldquo;Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.\u0026rdquo; arXiv, December 24, 2023.\n[10] Dubois, Yann, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, and others. \u0026ldquo;AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback.\u0026rdquo; arXiv, January 8, 2024.\n[11] Yuan, Weizhe, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. \u0026ldquo;Self-Rewarding Language Models.\u0026rdquo; arXiv, February 8, 2024.\n[12] Pang, Richard Yuanzhe, Weizhe Yuan, Kyunghyun Cho, He He, Sainbayar Sukhbaatar, and Jason Weston. “Iterative Reasoning Preference Optimization.” arXiv, 2024.\n[13] Wu, Tianhao, Weizhe Yuan, Olga Golovneva, Jing Xu, and Yuandong Tian. “Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge.” arXiv preprint arXiv:2407.19594, July 30, 2024.\n[14] Hao, Shibo, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. “Training Large Language Models to Reason in a Continuous Latent Space.” Preprint, December 11, 2024.\n","permalink":"https://mig217.github.io/post/2025-03-01-llm-self-improvement-and-reasoning-evolution/","summary":"\u003cp\u003e本文内容来自 \u003ca href=\"https://rdi.berkeley.edu/adv-llm-agents/slides/Jason-Weston-Reasoning-Alignment-Berkeley-Talk.pdf\"\u003eJason Weston (Meta) 在 UC Berkeley Advanced Large Language Model Agents 课程中的分享，探讨了大语言模型的推理能力提升\u003c/a\u003e 。以下为讲座内容：\u003c/p\u003e\n\u003chr\u003e\n\u003cp\u003eAI 能力正在快速发展，如 O1、R1 等模型在推理基准测试中取得的突破性进展。本文将聚焦于模型的\u003cstrong\u003e自我提升能力(self-improvement)\u003c/strong\u003e。\u003c/p\u003e\n\u003cp\u003e为了更好地理解AI的推理机制，我们首先需要区分两种基本的思维模式：\u003cstrong\u003eSystem 1和 System 2\u003c/strong\u003e：\u003c/p\u003e\n\u003cfigure class=\"align-center\"\u003e\n \u003cimg loading=\"lazy\" src=\"/images/System1\u0026amp;System2.png\" width=\"700px\"/\u003e \u003cfigcaption\u003e\n Fig.1: Hybrid Reasoning Framework: System 1 and System 2 Collaboration in LLMs\n \u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003cp\u003e\u003cstrong\u003eSystem 1：快速直觉系统\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e这是一种类似人类直觉反应的快速思维系统，主要依赖关联性思维。在LLM中，这种能力体现在\u003cstrong\u003etransformer神经网络的基础运作机制上\u003c/strong\u003e。其主要特征包括：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e每个token使用固定的计算资源；\u003c/li\u003e\n\u003cli\u003e直接输出答案；\u003c/li\u003e\n\u003cli\u003e局限性：\u003cstrong\u003e容易学习到虚假关联，产生幻觉、迎合性回答、越界等问题\u003c/strong\u003e；\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eSystem 2: 深度思考系统\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e这代表了一种更深层次的思维模式，目前主要通过\u003cstrong\u003eCoT\u003c/strong\u003e来实现。在生成最终答案之前，System 2会进行系统性的推理分析。它具有以下优势：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e能够执行\u003cstrong\u003e规划、搜索和验证\u003c/strong\u003e等复杂任务；\u003c/li\u003e\n\u003cli\u003e具备动态计算能力，可以通过CoT、ToT等方式实现灵活推理；\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e我们可以通过优化\u003cstrong\u003e模型架构或权重来提升System 1的表现\u003c/strong\u003e，也可以通过\u003cstrong\u003e改进推理链的生成方式来增强System 2\u003c/strong\u003e的表现；最终目标是让AI具备\u003cstrong\u003e自我学习\u003c/strong\u003e能力，这包括3个关键方面：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e能够自主设计具有挑战性的训练任务\u003c/li\u003e\n\u003cli\u003e评估任务的完成质量，形成自我奖励机制\u003c/li\u003e\n\u003cli\u003e根据理解和反馈，持续更新优化自身能力\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e接下来，我们将分两个部分展开讨论：首先回顾语言模型的历史发展历程，然后深入探讨过去一年中的重要研究进展。\u003c/p\u003e\n\u003ch2 id=\"llm-post-trainingo1r1-之前的优化之路\"\u003eLLM Post-training：O1/R1 之前的优化之路\u003c/h2\u003e\n\u003ch3 id=\"instrcut-gpt\"\u003eInstrcut GPT\u003c/h3\u003e\n\u003cp\u003eInstructGPT 是在 2022 年提出的一种语言模型优化方法[1]，它\u003cstrong\u003e结合了监督学习和基于人类反馈的强化学习（RLHF）\u003c/strong\u003e，比单纯的监督微调更为先进。这种方法包含三个关键步骤(如图)：\u003c/p\u003e\n\u003cfigure class=\"align-center\"\u003e\n \u003cimg loading=\"lazy\" src=\"/images/illustrating.png\" width=\"700px\"/\u003e \u003cfigcaption\u003e\n Fig.2: Reinforcement Learning from Human Feedback (RLHF) Training Pipeline\n \u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003col\u003e\n\u003cli\u003e\u003cstrong\u003e监督微调（SFT）阶段\u003c/strong\u003e：通过收集人类标注的示范数据，对基础模型进行初步的行为调整；\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003e奖励模型训练阶段\u003c/strong\u003e：根据人类对模型多个输出的排序评估，训练一个能够判断输出质量的奖励模型；\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003e强化学习优化阶段\u003c/strong\u003e：利用奖励模型的反馈评分，通过 PPO 算法持续优化模型输出；\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e这种训练方法\u003cstrong\u003e通过引入人类反馈和自我优化机制，使模型能够持续改进其输出质量\u003c/strong\u003e。这不仅提升了模型的性能，而是朝着自我训练的目标迈进。\u003c/p\u003e","title":"大语言模型的自我提升与推理能力进化(Jason Weston, Meta)"},{"content":"2024年，大语言模型在推理能力方面取得了显著突破。以O系列模型为例，在ARC-AGI评估任务中展现了令人瞩目的性能【1】：\nO3模型达到了87.5%的准确率，尽管每个任务的计算成本较高（超过$1,000）相比之下，未采用特殊推理技术的传统LLMs准确率通常低于25% Fig.1: O-Series Performance 如何通过有效的Prompting 来激发大语言模型的深层次推理能力，一直是研究者和开发者关注的核心问题。以下是几种主要的触发方法：\n少量示例CoT提示（Few-shot CoT）：通过提供少量推理示例，引导模型学习推理模式并应用到新问题中。指令型提示（Instruction prompting）：明确指导模型逐步思考问题，避免直接跳至答案。指令微调（Instruction tuning）：针对多步思考的推理任务对模型进行微调，提升其在类似任务中生成连贯思维链的能力。强化学习（Reinforcement learning）：利用强化学习技术训练模型，使其能够生成更完整、更准确的推理链。本文重点：\n我们将深入探讨Inference-time techniques，特别关注如何通过扩展token预算来提升LLM的推理能力。主要包括三个维度：\n基本提示词技巧：使用更多的token预算来生成单一的解决方案。从多个候选中进行搜索和选择，增加推理的宽度。模型迭代自我改进，增加推理的深度，最终到达最优解。使用更多的Token生成单一解决方案优化提示词能显著提升模型在各类任务中的表现。本节将介绍一些提示词工程技术，帮助我们更好地完成复杂任务。下图对比了Standard Prompting和CoT Prompting两种方法【2】【3】：\nStandard Prompting：仅给出最终答案，没有推理过程，容易导致错误结果。 CoT Prompting：展示完整的推理过程，让模型清晰地说明从问题到答案的推导步骤。 Fig.2: Chain-of-thought prompting enables large language models to tackle complex arithmetic commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted. Zero-shot CoT Prompting：通过指令引导生成思维链推理 0-shot CoT 是一种通过简短指令引导LLM进行推理的方法，无需依赖任何示例。在这种方法中，模型不需要看到具体的示范或训练数据，仅通过一个简单的指令（如\u0026quot;Let\u0026rsquo;s think step by step.\u0026quot;）（Fig.3）即可开始推理【4】。\nFig.3: Example inputs and outputs of GPT-3 优缺点：\n0-shot CoT的表现显著优于普通的0-shot方法，特别是在数学和符号推理等较难的任务中（Fig.4）相比Few-shot CoT，0-shot CoT更加便捷，因为无需手动标注示例。但从性能来看，0-shot CoT的表现仍然不如Few-shot CoT（Fig.5） Fig.4: Accuracy comparison of Zero-shot-CoT with Zero-shot on each tasks. Fig.5: Comparison with baseline methods using accuracies on MultiArith and GSM8K. Analogical Prompting：让模型自主生成参考案例在analogical prompting中，我们并不直接向LLM提供样本，而是先指示模型回忆相关的示例，然后再解决测试问题（fig.6）。具体来说，模型会首先自我生成一些相关示例，接着利用这些示例去解决目标问题。【5】\nFig.6: Overview of our approach, analogical prompting. Analogical prompting 表现优于 0-shot CoT和 Few-shot CoT方法。\n优势：\n示例由LLM自主生成，无需手动标注生成的示例能够根据特定问题量身定制，更具相关性。除了生成示例外，LLM还能产生更高层次的知识概括，为问题提供更广泛的见解，从而帮助解决原始问题。（Fig.7） Fig.7: Actual example of our prompt (top) and LLM output (bottom) for the Codeforces task. 局限性：\n自动生成示例可能比人工标注示例包含更多错误。有时生成的示例可能与问题无关，或包含错误的解题步骤，影响最终的推理质量。 LLM作为优化器，迭代改进Prompt 在这种方法中，我们将LLM作为优化器，通过分析历史轨迹（即已尝试的提示词及其对应的效果分数）来不断改进提示词质量【6】。具体实现需要两个LLM配合（Fig.8）：\nFig.8: An overview of the OPRO framework. Optimizer：基于历史提示词和任务示例，生成新的更优提示词 Evaluator：评估提示词的准确性表现为实现这一目标，我们需要设计Meta prompt（Fig.9），主要包含两个关键要素：**Trajectory（记录过往提示词及准确率）**和 Exemplars：（展示待优化的目标任务）\nFig.9: An example of the meta-prompt for prompt optimization with instruction-tuned PaLM 2-L (PaLM 2-L-IT) on GSM8K, where the generated instruction will be prepended to the beginning of “A:” in the scorer LLM output (A_begin in Section 4.1). 实验结果表明，这种方法效果显著（Fig.10）：\nFig.10: Top instructions with the highest GSM8K zero-shot test accuracies from prompt optimization with different optimizer LLMs. All results use the pre-trained PaLM 2-L as the scorer. 从基础提示词\u0026quot;Let\u0026rsquo;s solve the problem\u0026quot;（准确率60.8%）开始优化后的最佳提示词比\u0026quot;Let\u0026rsquo;s think step by step\u0026quot;提升了约8%，达到80.7%的准确率，与PaLM-2使用少量示例CoT的效果相当这种方法带来两个重要启示：\n无需手动编写示例即可达到与few-shot CoT相当的性能不仅节省了人工调优时间，还能发现一些意想不到的新视角，比如\u0026quot;Take a deep breath and work on this problem\u0026ldquo;这样的提示策略（Fig.10） Least-to-most prompting: 通过问题分解实现推理能力提升 Least-to-most prompting的核心思想是通过指导LLM如何分解复杂问题来提升其推理能力⁠【7】。这种方法包含2个关键步骤（Fig.11）：\nFig.11: Least-to-most prompting solving a math word problem in two stages: (1) query the language model to decompose the problem into subproblems; (2) query the language model to sequentially solve the subproblems. Problem reduction：将复杂问题分解为简单的子问题⁠ Sequentially solve subquestions：按顺序解决子问题，并将解决方案组合起来得到最终答案⁠ Example: 解决SCAN任务\n任务目标：将合成的自然语言命令转换为对应的动作序列。比如，“look thrice after jump”可能转换为“JUMP LOOK LOOK LOOK”。⁠（Fig.12）\nFig.12: Example commands in SCAN and their corresponding action sequences. 实验结果：使用从易到难提示法，模型在该任务上取得了接近完美的表现，准确率高达99.7% (Fig.13)\nFig.13: Accuracies (%) of different prompting methods on the test set of SCAN under length split. Self-Discover：让模型自主构建推理结构不同的推理任务往往需要不同的推理结构，包括任务分解方式和各阶段的规划等。Self-Discover方法的创新之处在于，引导模型自动构建特定任务的推理结构，且无需手动编写示例【8】。⁠⁠这种方法分为两个主要阶段（Fig.14）：\nFig.14: Illustration of using SELF-DISCOVER for problem-solving. Stage 1: 模型通过Self-Discover模块自动发现并生成适合任务的推理结构⁠⁠ Stage 2: 模型利用已发现的推理结构来解答具体的任务实例优势：Self-Discover方法在处理复杂推理任务时表现优异，相比传统的直接答案和链式思维方法，在多个复杂推理任务中展现出了优越的性能。尤其在需要进行多步推理的任务中，Self-Discover能够提供更高的准确率和更强的推理能力（Fig.15）。\nFig.15: SELF-DISCOVER guides LLMs to self-discover and compose atomic reasoning modules into a reasoning structure to solve challenging tasks. 增加推理的宽度，探索更广泛的解决方案空间为提升LLM的推理能力，我们不应该局限为每个问题只生成1个解决方案。通过探索多个推理分支，可以让模型从不同角度进行问题求解，从而提高推理的准确性和灵活性。\nSelf-consistency: 多路径推理提升准确率 Self-consistency 是一个简单但效果显著的方法。它的核心思想是让模型生成多个推理路径，然后从中选择最一致的答案，而不是仅依赖单一推理过程【9】。具体实现包含2个关键步骤（Fig.16）：\nFig.16: The self-consistency method contains three steps: (1) prompt a language model using chain-of-thought (CoT) prompting; (2) replace the “greedy decode” in CoT prompting by sampling from the language model’s decoder to generate a diverse set of reasoning paths; and (3) marginalize out the reasoning paths and aggregate by choosing the most consistent answer in the final answer set. 多路径生成：让模型对同一个问题生成多个不同推理路径答案聚合：基于最终答案的一致性来选择最优解\nNote: 答案的选择仅基于最终结果，不需要不同推理路径之间完全一致这一看似简单的策略，却能显著提升模型的表现（Fig.17）：\nFig.17: Arithmetic reasoning accuracy by self-consistency compared to chain-of-thought prompting (Wei et al., 2022). 准确率随样本量提升：实验数据显示，随着推理路径数量的增加，模型的准确率显著提升（Fig.18）。 Fig.18: Self-consistency significantly outperforms sample-and-rank with the same # of samples. 该方法揭示了准确率与一致性之间的关联。当多个推理路径指向相同答案时，LLM对其预测结论的信心更高，聚合后答案的正确性往往也更高（Fig.19）。 Fig.19: The consistency is correlated with model’s accuracy. 基于Self-consistency的应用：AlphaCode 在代码生成领域，基于Self-consistency的方法展现出了强大的效果。Google DeepMind的AlphaCode项目就是采用了这一方法（Fig.20），其核心是通过\u0026quot;Flitering \u0026amp; Clustering\u0026quot;来优化代码生成的结果【10】。具体步骤如下\nFig.20: Overview of AlphaCode. 对LLM生成的代码（仅限通过测试用例代码），作为输入进行测试将所有输出相同的程序聚类在一起从最大的10个聚类中，各选择1个程序作为代表在 Codeforces 上的实验结果表明，聚类方法相比单纯的过滤带来了显著的提升。然而与Oracle selection相比，仍然存在一定的差距。（Fig.21，蓝色为Oracle selection）\nFig.21: Comparison of different sample selection methods. 局限性：Self-consistency在自由生成任务中的效果，不如代码生成中理想。因为自由生成任务没有明确的答案，解码过程复杂且结果不稳定，模型可能难以保持稳定的输出质量。\nUniversal Self-consistency (USC) : 让模型自主进行一致性选择 USC的核心思想是：取代传统的答案提取过程，直接让LLM执行基于一致性的选择【11】。具体来说：我们向模型发出指令，要求其基于多数共识来选择最一致的回答，并对所有候选回答进行审视（Fig.22）。\nFig.22: Overview of the Universal Self-Consistency workflow. 这种方法在实践中展现出了显著优势（Fig.23）\nFig.23: USC results with different number of samples. 在摘要生成和问答等开放式生成任务中，USC取得了显著的性能提升。在数学推理和编程等任务中，USC能够达到与Self-Consistency方法相当的表现，同时无需进行答案提取和代码执行⁠⁠。需要注意的是，USC的性能受限于模型处理长文本的能力⁠⁠。\nTree-of-thoughts (ToT) : 让LLM进行深度思考到目前为止，我们所讨论的方法只在最终解决方案层面进行选择，而ToT方法则更进一步。\n通过引入逐步评分机制，ToT能够在解决过程中进行树搜索。这意味着我们不用等到完整解决方案后才做出判断，而是可以在搜索过程中优先探索更有希望的步骤【12】。\nFig.24: Schematic illustrating various approaches to problem solving with LLMs. Example: 24点游戏\nToT方法的工作流程如下（Fig.25）：\nFig.25: ToT in a game of 24. The LM is prompted for (a) thought generation and (b) valuation. 思维生成：让模型提出可能的下一步思考方向思维评估：让模型评估当前状态的潜力/可行性 LLM会通过多次投票来选择最佳方案，最终采用得票结果最多的选项（Fig.26）。\nFig.26: A step of deliberate search in a randomly picked Creative Writing task. Given the input, the LM samples 5 different plans, then votes 5 times to decide which plan is best. 研究结果表明：在token预算方面，采用广度优先搜索（BFS）的方法比Standard prompting 和 CoT Prompting方法表现更好。\n模型迭代自我改进，迈向最优解在前文中，我们探讨了通过生成多个解来帮助减少单次预测的错误。但这其实是一种相对次优的错误修正方式。因为所有的响应都是同时生成的，模型无法从之前的错误中吸取经验教训。\n因此，在这一部分中，我们将重点介绍如何让LLM在推理过程中不断学习和改进，通过迭代优化来提升最终输出的质量。\n反思与自我改进：利用内外部反馈提升LLM性能 Reflexion and Self-Refine 是一种让LLM持续优化输出的方法【13】【14】。在生成解决方案后，模型会经历2个关键步骤（Fig.28）：\nFig.28: Reflexion works on decision-making 4.1, programming 4.3, and reasoning 4.2 tasks. LLM根据观察结果生成反馈（这个过程可以引入外部评估提供更客观的参考）在Reflexion论文【13】中模型充当代理（agent），向环境提出行动请求，环境根据模型的输入反馈观察结果。这些外部信号帮助模型判断当前步骤的有效性 LLM结合内部反思和外部反馈优化输出，为下一步预测提供更好的基础该方法在多个任务中表现出色，尤其是在能够获得高质量外部评估信号的任务中：\n在ALFWorld任务中，通过有效的评估启发式方法，反思显著提高了模型性能 (Fig.29) Fig.29: (a) AlfWorld performance across 134 tasks showing cumulative proportions of solved tasks using self-evaluation techniques of (Heuristic) and (GPT) for binary classification. (b) Classification of AlfWorld trajectories by reason of failure. 在HotPotQA任务中，外部评估为每次反思提供了答案的准确性，从而帮助模型在每次反思后做出更好的改进 (Fig.30) Fig.30: Chain-of-Thought (CoT) and ReAct. Reflexion improves search, information retrieval, and reasoning capabilities on 100 HotPotQA questions. (a) Reflexion ReAct vs Reflexion CoT (b) Reflexion CoT (GT) for reasoning only (c) Reflexion vs episodic memory ablation. LLMs 自我纠错能力任有局限 Self-correction方法被证明能可以提升模型性能，但这些研究大多依赖于oracle verifier。然而在实际应用中，我们通常无法获得这样的外部验证机制。那么在没有外部反馈的情况下，LLM的表现如何呢？\n在论文“Large Language Models Cannot Self-Correct Reasoning Yet”中，展示了这一问题的负面结果【15】：\n在没有oracle verifier的情况下，LLM需要自行判断其响应的正确性\nLLM可能错误地判断自己的预测正确性，从而导致自我修正后性能反而下降\n多智能体辩论 vs Self-consistency 多智能体辩论的核心思想是：让模型并行生成多个响应，然后提示LLM评估这些响应并给出更新后的答案。\n虽然在 “Multiagent Debate” 论文中表明多智能体辩论优于Self-consistency方法【16】，但深入发现这种方法存在明显的token使用效率问题【15】：如果我们控制相同的回答数量和预算限制，Self-consistency的表现实际上更好。\n如何设计有效的推理技术? 在探索了各种技术后，我们可以总结出几个关键原则：\n通用性和可扩展性优先【17】：\n最强大方法往往是那些能随着计算能力增长而持续扩展的通用方法。搜索和学习是2种展现出优异扩展性的基础方法。理解模型的能力边界：\n不同任务可能需要不同的推理策略。选择的技术应该与模型的实际能力相匹配。注重发现机制【17】：\n我们的目标是打造能够自主发现的AI系统，而不是简单地将人类的发现嵌入其中。过度依赖预设的解决方案可能会阻碍我们理解真正的发现过程。参考文献： “OpenAI O3 Breakthrough High Score on ARC-AGI-Pub.” ARC Prize. Accessed February 15, 2025. Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, et al. \u0026ldquo;Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.\u0026rdquo; Preprint, arXiv, January 10, 2023. Nye, Maxwell, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, et al. \u0026ldquo;Show Your Work: Scratchpads for Intermediate Computation with Language Models.\u0026rdquo; Preprint, arXiv, November 30, 2021. Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. \u0026ldquo;Large Language Models are Zero-Shot Reasoners.\u0026rdquo; Preprint, arXiv, January 29, 2023. Yasunaga, Michihiro, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, et al. \u0026ldquo;Large Language Models as Analogical Reasoners.\u0026rdquo; Preprint, arXiv, March 9, 2024. Yang, Chengrun, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. \u0026ldquo;Large Language Models as Optimizers.\u0026rdquo; Preprint, arXiv, April 15, 2024. Zhou, Denny, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, et al. \u0026ldquo;Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.\u0026rdquo; Preprint, arXiv, April 16, 2023. Zhou, Pei, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, et al. \u0026ldquo;Self-Discover: Large Language Models Self-Compose Reasoning Structures.\u0026rdquo; Preprint, arXiv, February 6, 2024. Wang, Xuezhi, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and others. \u0026ldquo;Self-Consistency Improves Chain of Thought Reasoning in Language Models.\u0026rdquo; Preprint, arXiv, March 7, 2023. Li, Yujia, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, and others. \u0026ldquo;Competition-Level Code Generation with AlphaCode.\u0026rdquo; Science, Decemeber 9, 2022. Chen, Xinyun, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, and others. \u0026ldquo;Universal Self-Consistency for Large Language Model Generation.\u0026rdquo; Preprint, arXiv, November 29, 2023. Yao, Shunyu, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. \u0026ldquo;Tree of Thoughts: Deliberate Problem Solving with Large Language Models.\u0026rdquo; Preprint, arXiv, December 3, 2023. Shinn, Noah, Federico Cassa, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. \u0026ldquo;Reflexion: Language Agents with Verbal Reinforcement Learning.\u0026rdquo; Preprint, arXiv, October 10, 2023. Madaan, Aman, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, and others. \u0026ldquo;Self-Refine: Iterative Refinement with Self-Feedback.\u0026rdquo; Preprint, arXiv, May 25, 2023. Huang, Jie, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. \u0026ldquo;Large Language Models Cannot Self-Correct Reasoning Yet.\u0026rdquo; Preprint, arXiv, March 14, 2024. Du, Yilun, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. \u0026ldquo;Improving Factuality and Reasoning in Language Models through Multiagent Debate.\u0026rdquo; Preprint, arXiv, May 23, 2023. Sutton, Richard. The Bitter Lesson. ","permalink":"https://mig217.github.io/post/2025-02-16-improving-llm-reasoning/","summary":"\u003cp\u003e2024年，大语言模型在推理能力方面取得了显著突破。以O系列模型为例，在ARC-AGI评估任务中展现了令人瞩目的性能【1】：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eO3模型达到了87.5%的准确率\u003c/strong\u003e，尽管每个任务的计算成本较高（超过$1,000）\u003c/li\u003e\n\u003cli\u003e相比之下，\u003cstrong\u003e未采用特殊推理技术的传统LLMs准确率通常低于25%\u003c/strong\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cfigure class=\"align-center\"\u003e\n \u003cimg loading=\"lazy\" src=\"/images/o-series-performance.jpg\" width=\"700px\"/\u003e \u003cfigcaption\u003e\n Fig.1: O-Series Performance\n \u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003cp\u003e如何通过有效的Prompting 来激发大语言模型的深层次推理能力，一直是研究者和开发者关注的核心问题。以下是几种主要的触发方法：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003e少量示例CoT提示（Few-shot CoT）\u003c/strong\u003e：通过提供少量推理示例，引导模型学习推理模式并应用到新问题中。\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003e指令型提示（Instruction prompting）\u003c/strong\u003e：明确指导模型逐步思考问题，避免直接跳至答案。\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003e指令微调（Instruction tuning）\u003c/strong\u003e：针对多步思考的推理任务对模型进行微调，提升其在类似任务中生成连贯思维链的能力。\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003e强化学习（Reinforcement learning）\u003c/strong\u003e：利用强化学习技术训练模型，使其能够生成更完整、更准确的推理链。\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003e本文重点：\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e我们将深入探讨Inference-time techniques，特别关注如何通过扩展token预算来提升LLM的推理能力。主要包括三个维度：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e基本提示词技巧：\u003cstrong\u003e使用更多的token预算来生成单一的解决方案\u003c/strong\u003e。\u003c/li\u003e\n\u003cli\u003e从多个候选中进行搜索和选择，增加推理的\u003cstrong\u003e宽度\u003c/strong\u003e。\u003c/li\u003e\n\u003cli\u003e模型迭代自我改进，增加推理的\u003cstrong\u003e深度，最终到达最优解\u003c/strong\u003e。\u003c/li\u003e\n\u003c/ul\u003e\n\u003ch2 id=\"使用更多的token生成单一解决方案\"\u003e使用更多的Token生成单一解决方案\u003c/h2\u003e\n\u003cp\u003e优化提示词能显著提升模型在各类任务中的表现。本节将介绍一些提示词工程技术，帮助我们更好地完成复杂任务。\n下图对比了Standard Prompting和CoT Prompting两种方法【2】【3】：\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003cstrong\u003eStandard Prompting\u003c/strong\u003e：仅给出最终答案，没有推理过程，容易导致错误结果。\u003c/li\u003e\n\u003cli\u003e\u003cstrong\u003eCoT Prompting\u003c/strong\u003e：展示完整的推理过程，让\u003cstrong\u003e模型清晰地说明从问题到答案的推导步骤\u003c/strong\u003e。\u003c/li\u003e\n\u003c/ul\u003e\n\u003cfigure class=\"align-center\"\u003e\n \u003cimg loading=\"lazy\" src=\"/images/CotvsSd.png\" width=\"700px\"/\u003e \u003cfigcaption\u003e\n Fig.2: Chain-of-thought prompting enables large language models to tackle complex arithmetic commonsense, and symbolic reasoning tasks. Chain-of-thought reasoning processes are highlighted.\n \u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003ch3 id=\"zero-shot-cot-prompting通过指令引导生成思维链推理\"\u003eZero-shot CoT Prompting：通过指令引导生成思维链推理\u003c/h3\u003e\n\u003cp\u003e0-shot CoT 是一种通过简短指令引导LLM进行推理的方法，无需依赖任何示例。在这种方法中，模型不需要看到具体的示范或训练数据，仅\u003cstrong\u003e通过一个简单的指令（如\u0026quot;Let\u0026rsquo;s think step by step.\u0026quot;）（Fig.3）即可开始推理\u003c/strong\u003e【4】。\u003c/p\u003e\n\u003cfigure class=\"align-center\"\u003e\n \u003cimg loading=\"lazy\" src=\"/images/0-shot-cot.png\" width=\"700px\"/\u003e \u003cfigcaption\u003e\n Fig.3: Example inputs and outputs of GPT-3\n \u003c/figcaption\u003e\n\u003c/figure\u003e\n\n\u003cp\u003e\u003cstrong\u003e优缺点：\u003c/strong\u003e\u003c/p\u003e","title":"如何优化大语言模型（LLM）的推理能力？"}]