全文由千問Qwen翻譯:

From "Reasoning" Thinking to "Agentic" Thinking

從“推理式思考”到“智能體式思考”

The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.

過去兩年重塑了我們評估模型的方式以及對模型的期望。OpenAI的o1證明,“思考”可以成為一種一流的技能——一種需要專門訓(xùn)練并面向用戶開放的能力。DeepSeek-R1則表明,推理風(fēng)格的后訓(xùn)練方法不僅能在原始實驗室之外重現(xiàn),還能實現(xiàn)規(guī)模化應(yīng)用。OpenAI將o1描述為一種通過強(qiáng)化學(xué)習(xí)訓(xùn)練而成的模型,它能夠在回答問題前“先進(jìn)行思考”。DeepSeek則將R1定位為一款與o1相媲美的開放式推理模型。

That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.

那個階段很重要。但2025年上半年主要聚焦于推理思維:如何讓模型在推理時花費更多時間。計算,如何用更強(qiáng)烈的獎勵來訓(xùn)練它們,如何暴露或控制那種額外的推理努力?,F(xiàn)在的問題是:接下來該怎么做?我認(rèn)為答案是代理思維:即思考——為了 在與環(huán)境互動時采取行動,并根據(jù)來自外界的反饋不斷更新計劃。

1. What the Rise of o1 and R1 Actually Taught Us

o1和R1的崛起實際上教會了我們什么

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.

第一波推理模型告訴我們,若想在語言模型中規(guī)?;瘧?yīng)用強(qiáng)化學(xué)習(xí),我們就需要具備確定性、穩(wěn)定性和可擴(kuò)展性的反饋信號。數(shù)學(xué)、代碼、邏輯及其他可驗證的領(lǐng)域因此成為核心,因為在這些場景中,獎勵信號遠(yuǎn)比一般的偏好監(jiān)督更為有力。它們使強(qiáng)化學(xué)習(xí)能夠?qū)W⒂谧非笳_性,而非僅僅追求合理性。與此同時,基礎(chǔ)設(shè)施也變得至關(guān)重要。

Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.

一旦模型經(jīng)過訓(xùn)練能夠推理更長的軌跡,強(qiáng)化學(xué)習(xí)便不再只是監(jiān)督微調(diào)的一個輕量級附加組件。它……變成 一個系統(tǒng)性問題。你需要大規(guī)模部署、高吞吐量驗證、穩(wěn)定的策略更新以及高效的采樣。推理模型的出現(xiàn),其背后既涉及基礎(chǔ)設(shè)施建設(shè),也關(guān)乎建模本身。OpenAI 將 o1 描述為一種通過強(qiáng)化學(xué)習(xí)訓(xùn)練的推理模型,而 DeepSeek R1 后來進(jìn)一步印證了這一方向,展示了——多少 針對基于推理的強(qiáng)化學(xué)習(xí),需要專門的算法和基礎(chǔ)設(shè)施工作。第一次重大轉(zhuǎn)變:從擴(kuò)大預(yù)訓(xùn)練規(guī)模轉(zhuǎn)向擴(kuò)大后訓(xùn)練規(guī)模以實現(xiàn)推理能力。

2. The Real Problem Was Never Just "Merge Thinking and Instruct"

真正的問題從來不僅僅是“融合思考與指令”

At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.

2025年初,我們Qwen團(tuán)隊的許多成員心中都描繪了一幅雄心勃勃的藍(lán)圖。理想的系統(tǒng)是將實現(xiàn)思維與指令模式統(tǒng)一,并支持可調(diào)節(jié)的推理力度,其理念類似于低/中/高三種推理設(shè)置。更棒的是,該系統(tǒng)能夠根據(jù)提示和上下文自動推斷出恰當(dāng)?shù)耐评砹浚耗P图饶芗磿r作答,也能選擇深入思考,甚至在面對真正棘手的問題時,投入更多計算資源進(jìn)行細(xì)致求解。

Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.

從概念上講,這是正確的方向。Qwen3是最清晰的公開嘗試之一。它引入了“混合思考模式”,在一個模型家族中同時支持思考和非思考行為,強(qiáng)調(diào)可控的思考預(yù)算,并描述了一個明確包含“思考模式融合”的四階段后訓(xùn)練流程,該流程位于長思維鏈冷啟動和推理強(qiáng)化學(xué)習(xí)之后。

But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.

但融合比良好執(zhí)行更容易描述。困難的部分是數(shù)據(jù)。當(dāng)人們談?wù)撊诤纤伎寂c指令時,他們通常首先想到的是模型側(cè)的兼容性:一個檢查點能否同時支持兩種模式,一個聊天模板能否在它們之間切換,一個服務(wù)棧能否暴露正確的切換開關(guān)。更深層的問題是,這兩種模式的數(shù)據(jù)分布和行為目標(biāo)存在本質(zhì)差異。

We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.

我們在嘗試平衡模型合并與提升訓(xùn)練后數(shù)據(jù)的質(zhì)量和多樣性時,并未完全做到盡善盡美。在這一修訂過程中,我們還密切關(guān)注了用戶如何實際參與具備思考與指導(dǎo)兩種模式。在企業(yè)級任務(wù)中,例如重寫、標(biāo)注、模板化支持、結(jié)構(gòu)化提取以及運營質(zhì)量保證等重復(fù)性高、工作量大的場景,表現(xiàn)強(qiáng)勁的指導(dǎo)模型通常因其直接性、簡潔性、格式合規(guī)性以及低延遲而受到青睞。而表現(xiàn)強(qiáng)勁的思考模型則因在解決難題時消耗更多標(biāo)記、保持連貫的中間結(jié)構(gòu)、探索多種備選路徑,并保留足夠的內(nèi)部計算以切實提升最終結(jié)果的正確性而備受推崇。

These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.

這兩種行為模式相互抵消。如果對合并后的數(shù)據(jù)不加以精心篩選,最終結(jié)果往往兩頭不討好:所謂的“思考”型行為變得雜亂無章、臃腫不堪,或缺乏足夠的決斷力;而“指令”型行為則變得不夠干脆利落、可靠性降低,且成本高于商業(yè)用戶的需求。實際上想要。

Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.

分離在實踐中仍頗具吸引力。2025年晚些時候,在Qwen3最初的混合框架之后,2507版本推出了獨立的Instruct和Thinking更新版本,其中包括分別針對30B和235B參數(shù)量的變體。在商業(yè)部署中,大量客戶仍然希望在批量操作中實現(xiàn)高吞吐、低成本且高度可操控的指令行為。對于這些場景,合并顯然并不具備優(yōu)勢。將各條線分開,能讓團(tuán)隊更清晰地專注于解決每種模式的數(shù)據(jù)和訓(xùn)練問題。

Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.

其他實驗室則選擇了截然不同的路徑。Anthropic公開倡導(dǎo)一種集成式模型理念:Claude 3.7 Sonnet被定位為一種混合推理模型,用戶可選擇普通回復(fù)或深度思考模式,API用戶還可設(shè)定思考預(yù)算。Anthropic明確表示,他們認(rèn)為推理應(yīng)當(dāng)是一種集成化的能力,而非獨立的模型。GLM-4.5同樣公開將自身定位為一種混合推理模型,兼具思考與非思考兩種模式,實現(xiàn)了推理、編碼及智能體能力的統(tǒng)一;DeepSeek隨后也朝著類似方向邁進(jìn),其V3.1版本推出了“思考與非思考”混合推理功能。

The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.

關(guān)鍵問題在于,這種融合是否是自然有機(jī)的。如果思維與指令僅僅被安置于同一個檢查點內(nèi),卻仍表現(xiàn)為兩種生硬拼接的個性,那么產(chǎn)品的用戶體驗將依然顯得不自然。真正成功的融合,需要實現(xiàn)推理努力的平滑連續(xù)變化。模型應(yīng)當(dāng)能夠表達(dá)不同層次的推理強(qiáng)度,并且最好能自適應(yīng)地在這些層次之間做出選擇。GPT式的努力控制正朝著這一方向邁進(jìn):它采用的是對計算資源的策略性調(diào)控,而非簡單的二元開關(guān)。

3. Why Anthropic's Direction Was a Useful Corrective

為什么Anthropic的方針是一種有益的糾正措施

Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.

Anthropic圍繞Claude 3.7和Claude 4的公開表述是克制的。他們著重強(qiáng)調(diào)了整合推理、用戶可控的思維預(yù)算、真實世界任務(wù)、代碼質(zhì)量,以及后期在長時間思考過程中使用工具的能力。Claude 3.7被定位為一種具備可控預(yù)算的混合推理模型;Claude 4則在此基礎(chǔ)上進(jìn)一步拓展,允許推理與工具使用相互交織。與此同時,Anthropic還特別強(qiáng)調(diào)了編碼、長期運行任務(wù)以及智能體工作流作為其主要目標(biāo)。

Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.

生成更長的推理軌跡并不會自動使模型變得更智能。在許多情況下,過多的顯式推理信號反而會導(dǎo)致分配效率低下。如果模型試圖以同樣冗長的方式對所有內(nèi)容進(jìn)行推理,它很可能無法合理 prioritization,無法有效壓縮,也無法采取行動。人類的 軌跡表明,一種更嚴(yán)謹(jǐn)?shù)囊暯歉鼮榍‘?dāng):思考應(yīng)以目標(biāo)工作量為導(dǎo)向。如果目標(biāo)是編寫代碼,那么思考就應(yīng)有助于代碼庫導(dǎo)航、規(guī)劃、分解、錯誤恢復(fù)以及工具編排。如果目標(biāo)是代理工作流,那么思考的重點應(yīng)放在提升長期執(zhí)行質(zhì)量上,而非追求令人驚艷的中間成果。

This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.

這種對目標(biāo)導(dǎo)向型實用性的強(qiáng)調(diào),指向了一個更為宏大的趨勢:我們正從模型訓(xùn)練時代邁向智能體訓(xùn)練時代。我們在Qwen3博客中明確指出:“我們正在從一個以模型訓(xùn)練為核心的時代,轉(zhuǎn)型為以智能體訓(xùn)練為核心的時代”,并把未來的強(qiáng)化學(xué)習(xí)進(jìn)展與環(huán)境反饋相結(jié)合,以支持長時程的推理能力。所謂智能體,是一種能夠制定計劃、決定行動時機(jī)、運用工具、感知環(huán)境反饋、調(diào)整策略,并在長周期內(nèi)持續(xù)運行的系統(tǒng)。它之所以與眾不同,就在于其與外界之間形成了閉環(huán)互動。

4. What "Agentic Thinking" Really Means

“智能體式思考”的真正含義

Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.

“智能體式思考”是一種不同的優(yōu)化目標(biāo)。推理思維通常以最終答案之前的內(nèi)部推敲質(zhì)量來衡量:模型能否解出定理、寫出證明、生成正確的代碼,或通過基準(zhǔn)測試。而“智能體式思考”則關(guān)注的是,模型在與環(huán)境交互的過程中能否持續(xù)取得進(jìn)展。

The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:

Agentic thinking is a model that reasons through action.

核心問題從“模型能否思考足夠長的時間?”轉(zhuǎn)變?yōu)?ldquo;模型能否以維持有效行動的方式進(jìn)行思考?”。智能體式思考必須處理幾件純推理模型大多可以避免的事情:

智能體式思考是一個通過行動進(jìn)行推理的模型。

5. Why Agentic RL Infrastructure Is Harder

為什么智能體強(qiáng)化學(xué)習(xí)基礎(chǔ)設(shè)施更難

Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.

一旦目標(biāo)從解決基準(zhǔn)問題轉(zhuǎn)向解決交互式任務(wù),強(qiáng)化學(xué)習(xí)的架構(gòu)便會發(fā)生變化。用于經(jīng)典推理強(qiáng)化學(xué)習(xí)的基礎(chǔ)設(shè)施已不足以應(yīng)對這一需求。在推理強(qiáng)化學(xué)習(xí)中,你通??梢詫⒉蓸榆壽E視為大體自成一體的路徑,并配備相對清晰的評估器。而在代理強(qiáng)化學(xué)習(xí)中,策略被嵌入一個更大的框架之中:工具服務(wù)器、瀏覽器、終端、搜索引擎、模擬器、執(zhí)行沙箱、API層、內(nèi)存系統(tǒng)以及編排框架。此時,環(huán)境不再只是靜態(tài)的驗證者;它已成為訓(xùn)練系統(tǒng)的一部分。

This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.

這會創(chuàng)建一個新的系統(tǒng)要求:訓(xùn)練與推理必須實現(xiàn)更徹底的解耦。若缺乏這種解耦,模型上線的吞吐量將大幅下降。試想一下,一個編碼智能體需要針對實時測試框架執(zhí)行生成的代碼:推理端會因等待執(zhí)行反饋而停滯不前,訓(xùn)練端則因缺乏已完成的軌跡而陷入饑餓狀態(tài),整個流水線的運行效率遠(yuǎn)低于基于經(jīng)典推理的強(qiáng)化學(xué)習(xí)所預(yù)期的GPU利用率。如果再疊加工具延遲、部分可觀測性以及有狀態(tài)環(huán)境等因素,這些低效問題便會進(jìn)一步加劇。其結(jié)果是,實驗進(jìn)度緩慢且充滿痛苦,甚至在你尚未達(dá)到目標(biāo)能力水平之前,就已經(jīng)陷入困境。

The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.

環(huán)境本身也正成為一類一流的研究工具。在SFT時代,我們癡迷于數(shù)據(jù)的多樣性;而在智能體時代,我們則應(yīng)癡迷于環(huán)境的質(zhì)量:包括穩(wěn)定性、真實性、覆蓋范圍、難度、狀態(tài)多樣性、反饋豐富度、抗過擬合能力以及 rollout 生成的可擴(kuò)展性。環(huán)境構(gòu)建已開始成為一個真正的創(chuàng)業(yè)領(lǐng)域,而不再僅僅是副業(yè)項目。如果智能體正在接受訓(xùn)練,以適應(yīng)類似生產(chǎn)環(huán)境的運行場景,那么環(huán)境便成了核心能力棧的重要組成部分。

6. The Next Frontier Is More Usable Thought

下一個前沿是更易用的思維

My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.

我的預(yù)期是,智能體式思考將成為思考的主導(dǎo)形式。我認(rèn)為它可能最終取代大部分舊的靜態(tài)獨白式推理思考:那種因缺乏交互而通過輸出越來越多文本來補(bǔ)償?shù)?、過長的、孤立的內(nèi)部軌跡。即使在非常困難的數(shù)學(xué)或編碼任務(wù)上,一個真正先進(jìn)的系統(tǒng)也應(yīng)該有權(quán)進(jìn)行搜索、模擬、執(zhí)行、檢查、驗證和修訂。目標(biāo)是穩(wěn)健且高效地解決問題。

The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.

訓(xùn)練這類系統(tǒng)時,最棘手的挑戰(zhàn)便是獎勵作弊。一旦模型獲得了有意義的工具訪問權(quán)限,獎勵作弊便會變得愈加危險。具備搜索功能的模型可能會在強(qiáng)化學(xué)習(xí)過程中直接查找到答案;編碼智能體則可能利用倉庫中的未來信息、濫用日志,或發(fā)現(xiàn)一些能輕易繞過任務(wù)要求的捷徑。如果環(huán)境中存在隱蔽漏洞,智能體看似表現(xiàn)得超凡脫俗,實則是在被訓(xùn)練去作弊。正因如此,智能體時代比推理時代更加微妙和復(fù)雜。更強(qiáng)大的工具讓模型變得更加有用,但同時也擴(kuò)大了虛假優(yōu)化的攻擊面。我們應(yīng)預(yù)期,下一階段的重大研究瓶頸將來自環(huán)境設(shè)計、評估器的魯棒性、反作弊機(jī)制,以及策略與世界之間更具原則性的接口。盡管如此,方向已然明確:借助工具的思維模式遠(yuǎn)比孤立的思考更有價值,也更有可能切實提升生產(chǎn)力。

Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.

智能體式思考也將意味著對工程的駕馭。核心智能將越來越多地源自于多個代理的組織方式:一位負(fù)責(zé)規(guī)劃與調(diào)度工作的統(tǒng)籌者,一群充當(dāng)領(lǐng)域?qū)<业膶I(yè)代理,以及一群執(zhí)行更具體任務(wù)、同時協(xié)助控制上下文、避免干擾并保持不同層次推理之間隔離性的子代理。未來,我們將從訓(xùn)練模型轉(zhuǎn)向訓(xùn)練代理,再進(jìn)一步從訓(xùn)練代理轉(zhuǎn)向訓(xùn)練系統(tǒng)。

Conclusion

結(jié)語

The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.

推理浪潮的第一階段確立了一項重要發(fā)現(xiàn):在語言模型之上應(yīng)用強(qiáng)化學(xué)習(xí),當(dāng)反饋信號可靠且基礎(chǔ)設(shè)施能夠支撐時,可產(chǎn)生質(zhì)量上更強(qiáng)大的認(rèn)知能力。

The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.

深層次的轉(zhuǎn)變是從推理式思維轉(zhuǎn)向行動式思維:從更長時間的思考,轉(zhuǎn)變?yōu)闉榱瞬扇⌒袆佣M(jìn)行的有序思考。培訓(xùn)的核心對象也隨之發(fā)生了變化——如今,關(guān)注的焦點已不再是單純的模型本身,而是“模型+環(huán)境”這一系統(tǒng),更具體地說,是智能體及其周圍的生態(tài)系統(tǒng)。這使得哪些研究成果最為關(guān)鍵也發(fā)生了改變:固然,模型架構(gòu)和訓(xùn)練數(shù)據(jù)依然至關(guān)重要;但與此同時,環(huán)境設(shè)計、部署基礎(chǔ)設(shè)施、評估器的穩(wěn)健性,以及多個智能體之間協(xié)同互動所依賴的各類接口,也都變得同樣重要。這也重新定義了“良好思考”的含義:在現(xiàn)實世界的約束條件下,最能持續(xù)推動行動的有效軌跡,而非單純追求最長或最顯眼的軌跡。

It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.

它也改變了競爭優(yōu)勢的來源。在推理時代,優(yōu)勢來自更優(yōu)秀的強(qiáng)化學(xué)習(xí)算法、更強(qiáng)的反饋信號以及更高的可擴(kuò)展性。訓(xùn)練流水線。在智能體時代,優(yōu)勢將來自更優(yōu)質(zhì)的環(huán)境、更緊密的訓(xùn)練與服務(wù)一體化、更強(qiáng)大的模型約束工程,以及實現(xiàn)模型決策與其所產(chǎn)生后果之間閉環(huán)的能力。

轉(zhuǎn)載請注明出處、作者和本文鏈接。
聲明:文章內(nèi)容僅供參考、交流、學(xué)習(xí)、不構(gòu)成投資建議。
想和千萬鈦媒體用戶分享你的新奇觀點和發(fā)現(xiàn),點擊這里投稿 。創(chuàng)業(yè)或融資尋求報道,點擊這里。

敬原創(chuàng),有鈦度,得贊賞

贊賞支持
發(fā)表評論
0 / 300

根據(jù)《網(wǎng)絡(luò)安全法》實名制要求,請綁定手機(jī)號后發(fā)表評論

登錄后輸入評論內(nèi)容

快報

更多

12:05

港股午評:恒生指數(shù)漲0.55%,恒生科技指數(shù)漲1.05%

11:50

馬斯克:Grok Imagine下周將有重大發(fā)布

11:45

現(xiàn)貨白銀日內(nèi)漲幅達(dá)2%

11:45

雷軍介紹小米機(jī)器人團(tuán)隊在靈巧手領(lǐng)域新進(jìn)展:通過15萬次抓握循環(huán)可靠性測試

11:35

A股午評:三大指數(shù)集體上漲,創(chuàng)新藥、鋰電板塊集體爆發(fā)

11:32

國內(nèi)期貨主力合約漲跌不一,碳酸鋰、純苯漲超5%

11:28

光纖概念快速回暖,杭電股份漲停

11:23

2025年中國科幻產(chǎn)業(yè)總營收達(dá)1261億元,連續(xù)三年突破千億大關(guān)

11:21

電網(wǎng)設(shè)備概念盤中震蕩回升,長高電新漲停

11:21

零跑A10全球上市,售價6.58萬元起

11:20

智己LS8開啟預(yù)售,價格25.98萬-30.98萬元

11:07

港股鋰電股走強(qiáng),贛鋒鋰業(yè)漲超7%

11:03

九號公司:與泡泡瑪特達(dá)成合作,聯(lián)名電動車將于4月推出

11:02

滬深京三市成交額超1萬億元,較上日此時縮量952億元

10:48

證監(jiān)會首席律師程合紅:開展新一輪公司治理專項行動,加強(qiáng)對減持、程序化交易等市場交易活動的監(jiān)督管理

10:47

證監(jiān)會首席律師程合紅:配合司法機(jī)關(guān)研究制定內(nèi)幕交易、操縱市場民事?lián)p害賠償司法解釋,加大先行賠付制度適用力度

10:44

港股恒生科技指數(shù)漲幅擴(kuò)大至1%

10:43

鋰礦板塊震蕩走高,融捷股份漲停

10:43

全國社保基金理事會:個別組合持有單只股票時間最長達(dá)到20年

10:34

證監(jiān)會:資本市場理性投資、價值投資、長期投資的法治基礎(chǔ)進(jìn)一步夯實

掃描下載App