Arxiv Papers of Today

生成时间: 2026-01-23 16:33:32 (UTC+8); Arxiv 发布时间: 2026-01-23 20:00 EST (2026-01-24 09:00 UTC+8)

今天共有 21 篇相关文章

Keyword: reinforcement learning

ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation

ICPO：多回合对话的言外校准政策优化

Authors: Zhebo Wang, Xiaohu Mu, Zijie Zhou, Mohan Li, Wenpeng Xing, Dezhang Kong, Meng Han
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.15330
Pdf link: https://arxiv.org/pdf/2601.15330
Abstract Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation'' phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user's illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75\% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.
中文摘要 大型语言模型（LLMs）在多回合对话中常常出现“迷失对话”现象，即在用户提供模糊初始指令时，难以从早期错误假设中恢复。我们发现，标准的训练后技术如可验证奖励强化学习（RLVR）加剧了这一问题，因为它奖励自信、直接的回答，从而诱发过度自信，并阻止模型寻求澄清。为此，我们提出了言内校准策略优化（ICPO）这一新型训练框架，使模型对指令模糊性更加敏感。ICPO通过提供未明确说明的提示来补充训练语料库，并对用户言内意图的奖励信号进行条件，奖励模型在面对歧义时表达不确定或请求澄清。实验表明，ICPO能培养适当的谦逊，在多回合对话中平均提升75\%，同时保持单回合基准测试的稳健表现。我们的工作为实现更强大、协作的对话式人工智能提供了切实可行的路径，能够更好地驾驭人际互动的细微差别。

A Mobile Magnetic Manipulation Platform for Gastrointestinal Navigation with Deep Reinforcement Learning Control

一款具深度强化学习控制的移动磁力作平台，用于胃肠道导航

Authors: Zhifan Yan, Chang Liu, Yiyang Jiang, Wenxuan Zheng, Xinhao Chen, Axel Krieger
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.15545
Pdf link: https://arxiv.org/pdf/2601.15545
Abstract Targeted drug delivery in the gastrointestinal (GI) tract using magnetic robots offers a promising alternative to systemic treatments. However, controlling these robots is a major challenge. Stationary magnetic systems have a limited workspace, while mobile systems (e.g., coils on a robotic arm) suffer from a "model-calibration bottleneck", requiring complex, pre-calibrated physical models that are time-consuming to create and computationally expensive. This paper presents a compact, low-cost mobile magnetic manipulation platform that overcomes this limitation using Deep Reinforcement Learning (DRL). Our system features a compact four-electromagnet array mounted on a UR5 collaborative robot. A Soft Actor-Critic (SAC)-based control strategy is trained through a sim-to-real pipeline, enabling effective policy deployment within 15 minutes and significantly reducing setup time. We validated the platform by controlling a 7-mm magnetic capsule along 2D trajectories. Our DRL-based controller achieved a root-mean-square error (RMSE) of 1.18~mm for a square path and 1.50~mm for a circular path. We also demonstrated successful tracking over a clinically relevant, 30 cm * 20 cm workspace. This work demonstrates a rapidly deployable, model-free control framework capable of precise magnetic manipulation in a large workspace,validated using a 2D GI phantom.
中文摘要 利用磁性机器人在胃肠道（GI）中进行靶向药物递送，为系统性治疗提供了有前景的替代方案。然而，控制这些机器人是一项重大挑战。固定磁系统工作空间有限，而移动系统（例如机械臂上的线圈）则存在“模型校准瓶颈”，需要复杂且预先校准的物理模型，这些模型的创建耗时且计算量高。本文提出了一种紧凑、低成本的移动磁力控平台，利用深度强化学习（DRL）克服了这一局限。我们的系统采用一个紧凑型四电磁阵列，安装在UR5协作机器人上。基于软性演员-批评者（SAC）的控制策略通过模拟到真实流水线训练，使策略在15分钟内有效部署，并显著缩短设置时间。我们通过沿二维轨迹控制一个7毫米磁性胶囊来验证该平台。我们基于日程学习的控制器实现了方方根误差（RMSE）为1.18~毫米（方形路径），1.50~毫米（圆路径）。我们还展示了在临床相关、30厘米×20厘米工作区上的成功跟踪。这项工作展示了一种可快速部署、无需模型的控制框架，能够在大型工作空间中进行精确磁性作，并通过二维GI幻影进行验证。

When Sharpening Becomes Collapse: Sampling Bias and Semantic Coupling in RL with Verifiable Rewards

当锐化变成崩溃：强化学习中的采样偏差与语义耦合与可验证奖励

Authors: Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.15609
Pdf link: https://arxiv.org/pdf/2601.15609
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) is a central paradigm for turning large language models (LLMs) into reliable problem solvers, especially in logic-heavy domains. Despite its empirical success, it remains unclear whether RLVR elicits novel capabilities or merely sharpens the distribution over existing knowledge. We study this by formalizing over-sharpening, a phenomenon where the policy collapses onto limited modes, suppressing valid alternatives. At a high level, we discover finite-batch updates intrinsically bias learning toward sampled modes, triggering a collapse that propagates globally via semantic coupling. To mitigate this, we propose inverse-success advantage calibration to prioritize difficult queries and distribution-level calibration to diversify sampling via a memory network. Empirical evaluations validate that our strategies can effectively improve generalization.
中文摘要 带可验证奖励的强化学习（RLVR）是将大型语言模型（LLMs）转变为可靠问题解决器的核心范式，尤其是在逻辑密集领域。尽管在实证上取得了成功，但目前尚不清楚RLVR究竟是激发了新颖的能力，还是仅仅提升了现有知识的分布。我们通过形式化过度锐化来研究这一点，即政策崩溃到有限模式，抑制有效替代方案。从高层次来看，我们发现有限批次更新本质上偏向采样模式，触发一种通过语义耦合向全局传播的崩溃。为缓解这一问题，我们提出逆成功优势校准以优先处理困难查询，并采用分布级校准以通过内存网络实现采样多样化。实证评估验证了我们的策略能够有效提升泛化能力。

AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning

AION：利用双策略强化学习实现空中室内物体-目标导航

Authors: Zichen Yan, Yuchen Hou, Shenao Wang, Yichao Gao, Rui Huang, Lin Zhao
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.15614
Pdf link: https://arxiv.org/pdf/2601.15614
Abstract Object-Goal Navigation (ObjectNav) requires an agent to autonomously explore an unknown environment and navigate toward target objects specified by a semantic label. While prior work has primarily studied zero-shot ObjectNav under 2D locomotion, extending it to aerial platforms with 3D locomotion capability remains underexplored. Aerial robots offer superior maneuverability and search efficiency, but they also introduce new challenges in spatial perception, dynamic control, and safety assurance. In this paper, we propose AION for vision-based aerial ObjectNav without relying on external localization or global maps. AION is an end-to-end dual-policy reinforcement learning (RL) framework that decouples exploration and goal-reaching behaviors into two specialized policies. We evaluate AION on the AI2-THOR benchmark and further assess its real-time performance in IsaacSim using high-fidelity drone models. Experimental results show that AION achieves superior performance across comprehensive evaluation metrics in exploration, navigation efficiency, and safety. The video can be found at this https URL.
中文摘要 对象-目标导航（ObjectNav）要求智能体自主探索未知环境，并向由语义标签指定的目标对象导航。此前的研究主要研究了零发射物体导航在二维运动下的应用，但将其推广到具备三维运动能力的空中平台仍然未被充分探索。空中机器人提供了卓越的机动性和搜索效率，但它们也带来了空间感知、动态控制和安全保障的新挑战。本文提出AION用于基于视觉的空中物体导航，无需依赖外部定位或全局地图。AION是一个端到端的双重策略强化学习（RL）框架，将探索和目标达成行为拆分为两种专业策略。我们基于AI2-THOR基准测试评估AION，并利用高精度无人机模型进一步评估其在IsaacSim中的实时性能。实验结果显示，AION在探索、导航效率和安全等综合评估指标上表现卓越。视频可在此 https 网址观看。

Explainable Deepfake Detection with RL Enhanced Self-Blended Images

使用强化学习增强自混合图像进行可解释的深度伪造检测

Authors: Ning Jiang, Dingheng Zeng, Yanhong Liu, Haiyang Yi, Shijie Yu, Minghe Weng, Haifeng Shen, Ying Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.15624
Pdf link: https://arxiv.org/pdf/2601.15624
Abstract Most prior deepfake detection methods lack explainable outputs. With the growing interest in multimodal large language models (MLLMs), researchers have started exploring their use in interpretable deepfake detection. However, a major obstacle in applying MLLMs to this task is the scarcity of high-quality datasets with detailed forgery attribution annotations, as textual annotation is both costly and challenging - particularly for high-fidelity forged images or videos. Moreover, multiple studies have shown that reinforcement learning (RL) can substantially enhance performance in visual tasks, especially in improving cross-domain generalization. To facilitate the adoption of mainstream MLLM frameworks in deepfake detection with reduced annotation cost, and to investigate the potential of RL in this context, we propose an automated Chain-of-Thought (CoT) data generation framework based on Self-Blended Images, along with an RL-enhanced deepfake detection framework. Extensive experiments validate the effectiveness of our CoT data construction pipeline, tailored reward mechanism, and feedback-driven synthetic data generation approach. Our method achieves performance competitive with state-of-the-art (SOTA) approaches across multiple cross-dataset benchmarks. Implementation details are available at this https URL.
中文摘要 大多数以往的深度伪造检测方法都缺乏可解释的输出。随着对多模态大型语言模型（MLLM）日益增长的兴趣，研究人员开始探索其在可解释深度伪造检测中的应用。然而，应用MLLM进行这项任务的一个主要障碍是缺乏带有详细伪造归属注释的高质量数据集，因为文本注释既昂贵又具有挑战性——尤其是对于高保真伪造图像或视频。此外，多项研究表明，强化学习（RL）能显著提升视觉任务中的表现，尤其是在提升跨领域泛化方面。为了促进主流MLLM框架在深度伪造检测中的采用，并降低注释成本，并探讨强化学习在此背景下的潜力，我们提出了基于自混合图像的自动化思维链（CoT）数据生成框架，以及强化学习增强型深度伪造检测框架。大量实验验证了我们CoT数据构建流程、定制奖励机制以及反馈驱动的合成数据生成方法的有效性。我们的方法在多个跨数据集基准测试中，与最先进的（SOTA）方法竞争性能。实现详情可在此 https 网址查阅。

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

通过裂变-GRPO实现稳健工具使用：学习从执行错误中恢复

Authors: Zhiwei Zhang, Fei Zhao, Rui Wang, Zezhong Wang, Bin Liang, Jiakang Wang, Yao Hu, Shaosheng Cao, Kam-Fai Wong
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.15625
Pdf link: https://arxiv.org/pdf/2601.15625
Abstract Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as sparse negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model's on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.
中文摘要 大型语言模型（LLM）能够有效调用工具，但在多回合执行中仍然脆弱：在工具调用错误后，较小的模型常常退化为重复的无效重调用，无法解释错误反馈并自我纠正。这种脆弱性阻碍了可靠的实际部署，因为在工具交互过程中执行错误是不可避免的。我们指出当前方法的一个关键局限性：标准强化学习（RL）将错误视为稀疏的负奖励，无法提供恢复指导，而预先收集的合成纠错数据集则存在与模型策略错误模式分布不匹配的问题。为弥合这一差距，我们提出了 Fission-GRPO 框架，该框架将执行错误转换为强化学习训练循环中的纠正监督。我们的核心机制通过通过微调错误模拟器的诊断反馈，将每个失败的路径分裂进新的训练实例，然后根据策略重新采样恢复部署。这使得模型能够从探索过程中产生的精确误差中学习，而非从静态、预先收集的误差案例中学习。在BFCL v4多回合系统上，裂变GRPO使Qwen3-8B的绝对错误恢复率提升了5.7%，关键是整体准确率提升4%（42.75%至46.75%），并优于专用工具使用代理。

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

EmotionThinker：用于可解释言语情感推理的韵律感知强化学习

Authors: Dingdong Wang, Shujie Liu, Tianhua Zhang, Youjun Chen, Jinyu Li, Helen Meng
Subjects: Subjects: Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2601.15668
Pdf link: https://arxiv.org/pdf/2601.15668
Abstract Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: this https URL
中文摘要 言语中的情感信息在多模态感知中扮演着独特的角色。然而，当前的语音大型语言模型（SpeechLLMs）类似于传统的语音情感识别（SER）系统，仍然将情感理解视为一个简单的分类问题。这限制了预测的可解释性，同时使大型语言模型的表达和推理能力被低估。在本研究中，我们迈出了通过强化学习（RL）将SER重新表述为深度推理问题的第一步。我们提出了EmotionThinker，旨在生成准确的情绪预测，并基于细微的声学线索提供可解释的解释。为此，我们首先构建了EmotionCoT-35K，这是一个带有思维链注释和详细说明的情绪推理数据集。其次，我们观察到当前的语音LLM表现出弱的韵律感知，而韵律线索则构成了解读情感的基本信号。为此，我们开发了韵律增强基础模型EmotionThinker-Base，并证明韵律增强能提升情感理解。第三，我们引入了基于渐进信任意识推理奖励（GRPO-PTR）的组-相对-策略优化（Group -Relative-Policy-Optimization）应用于强化学习（RL）。与仅依赖基于规则的结果奖励的标准GRPO不同，GRPO-PTR逐步引入推理奖励，动态调整其可信度权重以反映推理与结果的对齐，并基于多维标准的奖励模型评估整体推理质量。EmotionThinker 在情感准确性和解释质量方面均优于以往最先进的评估模型，推动 SER 实现可解释的多模态推理。项目页面：此 https URL

Performance-guided Reinforced Active Learning for Object Detection

基于性能的强化主动学习对象检测

Authors: Zhixuan Liang, Xingyu Zeng, Rui Zhao, Ping Luo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.15688
Pdf link: https://arxiv.org/pdf/2601.15688
Abstract Active learning (AL) strategies aim to train high-performance models with minimal labeling efforts, only selecting the most informative instances for annotation. Current approaches to evaluating data informativeness predominantly focus on the data's distribution or intrinsic information content and do not directly correlate with downstream task performance, such as mean average precision (mAP) in object detection. Thus, we propose Performance-guided (i.e. mAP-guided) Reinforced Active Learning for Object Detection (MGRAL), a novel approach that leverages the concept of expected model output changes as informativeness. To address the combinatorial explosion challenge of batch sample selection and the non-differentiable correlation between model performance and selected batches, MGRAL skillfully employs a reinforcement learning-based sampling agent that optimizes selection using policy gradient with mAP improvement as reward. Moreover, to reduce the computational overhead of mAP estimation with unlabeled samples, MGRAL utilizes an unsupervised way with fast look-up tables, ensuring feasible deployment. We evaluate MGRAL's active learning performance on detection tasks over PASCAL VOC and COCO benchmarks. Our approach demonstrates the highest AL curve with convincing visualizations, establishing a new paradigm in reinforcement learning-driven active object detection.
中文摘要 主动学习（AL）策略旨在以最小的标记工作训练高性能模型，只选择最具信息量的实例进行注释。当前评估数据信息量的方法主要关注数据的分布或内在信息内容，而不会直接关联后续任务表现，例如物体检测中的平均精度（mAP）。因此，我们提出了绩效引导（即mAP引导）强化主动学习对象检测（MGRAL），这是一种利用预期模型输出变化作为信息量的创新方法。为解决批量样本选择的组合爆炸性挑战及模型性能与选定批次之间的不可微分相关性，MGRAL巧妙地采用基于强化学习的抽样代理，利用策略梯度优化选择，奖励mAP改进。此外，为了减少未标记样本mAP估计的计算开销，MGRAL采用无监督方式和快速查找表，确保部署的可行性。我们评估MGRAL在检测任务上的主动学习表现，相较于PASCAL VOC和COCO基准。我们的方法展示了最高的AL曲线，并以令人信服的可视化效果，建立了强化学习驱动的主动对象检测新范式。

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models

从被动度量到主动信号：不确定性量化在大型语言模型中不断演变的角色

Authors: Jiaxin Zhang, Wendi Cui, Zhuohang Li, Lifu Huang, Bradley Malin, Caiming Xiong, Chien-Sheng Wu
Subjects: Subjects: Artificial Intelligence (cs.AI); Applications (stat.AP)
Arxiv link: https://arxiv.org/abs/2601.15690
Pdf link: https://arxiv.org/pdf/2601.15690
Abstract While Large Language Models (LLMs) show remarkable capabilities, their unreliability remains a critical barrier to deployment in high-stakes domains. This survey charts a functional evolution in addressing this challenge: the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior. We demonstrate how uncertainty is leveraged as an active control signal across three frontiers: in \textbf{advanced reasoning} to optimize computation and trigger self-correction; in \textbf{autonomous agents} to govern metacognitive decisions about tool use and information seeking; and in \textbf{reinforcement learning} to mitigate reward hacking and enable self-improvement via intrinsic rewards. By grounding these advancements in emerging theoretical frameworks like Bayesian methods and Conformal Prediction, we provide a unified perspective on this transformative trend. This survey provides a comprehensive overview, critical analysis, and practical design patterns, arguing that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
中文摘要 尽管大型语言模型（LLM）展现出卓越的能力，但其不可靠性仍然是高风险领域部署的关键障碍。本调查描绘了应对这一挑战的功能演变：不确定性从被动诊断指标演变为引导实时模型行为的主动控制信号。我们展示了不确定性如何作为主动控制信号在三个前沿被利用：在 \textbf{高级推理}中优化计算并触发自我纠正;在 \textbf{自主代理}中，用于治理关于工具使用和信息寻求的元认知决策;以及在\textbf{强化学习}中，通过内在奖励来减轻奖励黑客行为，促进自我提升。通过将这些进展建立在贝叶斯方法和共形预测等新兴理论框架之上，我们为这一变革趋势提供了统一的视角。本调查提供了全面的概述、批判性分析和实用设计模式，主张掌握不确定性这一新趋势对于构建下一代可扩展、可靠且值得信赖的人工智能至关重要。

Dancing in Chains: Strategic Persuasion in Academic Rebuttal via Theory of Mind

《锁链中的舞蹈：通过心智理论进行学术反驳中的战略说服》

Authors: Zhitao He, Zongwei Lyu, Yi R Fung
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.15715
Pdf link: https://arxiv.org/pdf/2601.15715
Abstract Although artificial intelligence (AI) has become deeply integrated into various stages of the research workflow and achieved remarkable advancements, academic rebuttal remains a significant and underexplored challenge. This is because rebuttal is a complex process of strategic communication under severe information asymmetry rather than a simple technical debate. Consequently, current approaches struggle as they largely imitate surface-level linguistics, missing the essential element of perspective-taking required for effective persuasion. In this paper, we introduce RebuttalAgent, the first framework to ground academic rebuttal in Theory of Mind (ToM), operationalized through a ToM-Strategy-Response (TSR) pipeline that models reviewer mental state, formulates persuasion strategy, and generates strategy-grounded response. To train our agent, we construct RebuttalBench, a large-scale dataset synthesized via a novel critique-and-refine approach. Our training process consists of two stages, beginning with a supervised fine-tuning phase to equip the agent with ToM-based analysis and strategic planning capabilities, followed by a reinforcement learning phase leveraging the self-reward mechanism for scalable self-improvement. For reliable and efficient automated evaluation, we further develop Rebuttal-RM, a specialized evaluator trained on over 100K samples of multi-source rebuttal data, which achieves scoring consistency with human preferences surpassing powerful judge GPT-4.1. Extensive experiments show RebuttalAgent significantly outperforms the base model by an average of 18.3% on automated metrics, while also outperforming advanced proprietary models across both automated and human evaluations. Disclaimer: the generated rebuttal content is for reference only to inspire authors and assist in drafting. It is not intended to replace the author's own critical analysis and response.
中文摘要 尽管人工智能（AI）已深度融入研究流程的各个阶段并取得了显著进展，但学术反驳仍是一个重大且未被充分探索的挑战。这是因为反驳是在严重信息不对称环境下进行的复杂战略沟通过程，而非简单的技术辩论。因此，当前的方法在很大程度上模仿表面语言学，缺乏有效说服所需的核心视角采纳元素，因此存在困难。本文介绍了RebuttalAgent，这是首个基于心智理论（ToM）进行学术反驳的框架，通过ToM策略-反应（TSR）流程进行作化，该流程模拟审稿人的心理状态，制定说服策略，并生成基于策略的回应。为了训练我们的代理，我们构建了RebuttalBench，这是一个通过新颖的批判与精炼方法综合的大规模数据集。我们的培训流程分为两个阶段，首先是监督下的微调阶段，以装备基于ToM的分析和战略规划能力，随后是利用自我奖励机制实现可扩展自我提升的强化学习阶段。为了实现可靠高效的自动评估，我们进一步开发了Rebuttal-RM，这是一款基于超过10万个多来源反驳数据样本训练的专业评估器，其评分一致性与人类偏好相符，超越了强大的评判GPT-4.1。大量实验显示，RebuttalAgent在自动化指标上平均优于基础模型18.3%，同时在自动化和人工评估中也优于先进专有模型。免责声明：生成的反驳内容仅供参考，以激励作者并协助撰写。它并不旨在取代作者自身的批判性分析和回应。

PhysProver: Advancing Automatic Theorem Proving for Physics

PhysProver：推进物理学自动定理证明

Authors: Hanning Zhang, Ruida Wang, Rui Pan, Wenyuan Wang, Bingxu Meng, Tong Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2601.15737
Pdf link: https://arxiv.org/pdf/2601.15737
Abstract The combination of verifiable languages and LLMs has significantly influenced both the mathematical and computer science communities because it provides a rigorous foundation for theorem proving. Recent advancements in the field provide foundation models and sophisticated agentic systems pushing the boundaries of formal mathematical reasoning to approach the natural language capability of LLMs. However, little attention has been given to the formal physics reasoning, which also heavily relies on similar problem-solving and theorem-proving frameworks. To solve this problem, this paper presents, to the best of our knowledge, the first approach to enhance formal theorem proving in the physics domain. We compose a dedicated dataset PhysLeanData for the task. It is composed of theorems sampled from PhysLean and data generated by a conjecture-based formal data generation pipeline. In the training pipeline, we leverage DeepSeek-Prover-V2-7B, a strong open-source mathematical theorem prover, and apply Reinforcement Learning with Verifiable Rewards (RLVR) to train our model PhysProver. Comprehensive experiments demonstrate that, using only $\sim$5K training samples, PhysProver achieves an overall 2.4\% improvement in multiple sub-domains. Furthermore, after formal physics training, we observe 1.3\% gains on the MiniF2F-Test benchmark, which indicates non-trivial generalization beyond physics domains and enhancement for formal math capability as well. The results highlight the effectiveness and efficiency of our approach, which provides a paradigm for extending formal provers outside mathematical domains. To foster further research, we will release both our dataset and model to the community.
中文摘要 可验证语言与大型语言模型的结合对数学和计算机科学界产生了重大影响，因为它为定理证明提供了严谨的基础。该领域的最新进展提供了基础模型和复杂的智能体系统，推动形式数学推理的边界，接近LLMs的自然语言能力。然而，对形式物理推理的关注较少，后者同样高度依赖类似的问题解决和定理证明框架。为了解决这个问题，本文据我们所知提出了首次在物理领域中增强形式定理证明的方法。我们为该任务编写了一个专门的数据集PhysLeanData。它由从 PhysLean 中采样的定理和基于猜想的形式数据生成流水线生成的数据组成。在训练流程中，我们利用DeepSeek-Prover-V2-7B——一个强大的开源数学定理证明器，并应用可验证奖励强化学习（RLVR）来训练我们的模型PhysProver。全面的实验表明，仅用$\sim$5K的训练样本，PhysProver在多个子领域整体提升了2.4%的水平。此外，经过正式物理训练后，我们观察到MiniF2F-Test基准提升1.3%%，表明其在物理领域之外实现了非平凡的推广，同时形式数学能力也有所提升。结果凸显了我们方法的有效性和效率，为将形式证明子扩展到数学领域之外提供了范式。为了促进进一步的研究，我们将向社区发布数据集和模型。

Off-Policy Actor-Critic with Sigmoid-Bounded Entropy for Real-World Robot Learning

非策略演员-批评者，采用S形有界熵，用于现实世界机器人学习

Authors: Xiefeng Wu, Mingyu Hu, Shu Zhang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.15761
Pdf link: https://arxiv.org/pdf/2601.15761
Abstract Deploying reinforcement learning in the real world remains challenging due to sample inefficiency, sparse rewards, and noisy visual observations. Prior work leverages demonstrations and human feedback to improve learning efficiency and robustness. However, offline-to-online methods need large datasets and can be unstable, while VLA-assisted RL relies on large-scale pretraining and fine-tuning. As a result, a low-cost real-world RL method with minimal data requirements has yet to emerge. We introduce \textbf{SigEnt-SAC}, an off-policy actor-critic method that learns from scratch using a single expert trajectory. Our key design is a sigmoid-bounded entropy term that prevents negative-entropy-driven optimization toward out-of-distribution actions and reduces Q-function oscillations. We benchmark SigEnt-SAC on D4RL tasks against representative baselines. Experiments show that SigEnt-SAC substantially alleviates Q-function oscillations and reaches a 100\% success rate faster than prior methods. Finally, we validate SigEnt-SAC on four real-world robotic tasks across multiple embodiments, where agents learn from raw images and sparse rewards; results demonstrate that SigEnt-SAC can learn successful policies with only a small number of real-world interactions, suggesting a low-cost and practical pathway for real-world RL deployment.
中文摘要 在现实世界中部署强化学习依然具有挑战性，原因是样本效率低、奖励稀疏和视觉观察噪声较大。以往的工作利用演示和人类反馈来提升学习效率和稳健性。然而，离线到在线方法需要大量数据集且可能不稳定，而VLA辅助的强化学习依赖大规模预训练和微调。因此，目前还没有出现低成本、数据需求极少的现实世界强化学习方法。我们引入了 \textbf{SigEnt-SAC}，这是一种非策略的 actor-critic 方法，通过单一专家轨迹从零学习。我们的关键设计是一个S形有界熵项，防止负熵驱动的非分布动作优化，并减少Q函数振荡。我们将D4RL任务的SigEnt-SAC基准测试为代表性基线。实验表明，SigEnt-SAC显著缓解了Q函数振荡，并比以往方法更快达到100%成功率。最后，我们在四种现实世界机器人任务中验证了SigEnt-SAC，跨越多个具体实例，代理从原始图像和稀疏奖励中学习;结果表明，SigEnt-SAC只需少量的真实交互即可学习成功的策略，这为现实世界部署提供了低成本且实用的路径。

Decoupling Return-to-Go for Efficient Decision Transformer

高效决策变换器的退货解耦

Authors: Yongyi Wang, Hanyu Liu, Lingfeng Li, Bozhou Chen, Ang Li, Qirui Zheng, Xionghui Yang, Wenxin Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.15953
Pdf link: https://arxiv.org/pdf/2601.15953
Abstract The Decision Transformer (DT) has established a powerful sequence modeling approach to offline reinforcement learning. It conditions its action predictions on Return-to-Go (RTG), using it both to distinguish trajectory quality during training and to guide action generation at inference. In this work, we identify a critical redundancy in this design: feeding the entire sequence of RTGs into the Transformer is theoretically unnecessary, as only the most recent RTG affects action prediction. We show that this redundancy can impair DT's performance through experiments. To resolve this, we propose the Decoupled DT (DDT). DDT simplifies the architecture by processing only observation and action sequences through the Transformer, using the latest RTG to guide the action prediction. This streamlined approach not only improves performance but also reduces computational cost. Our experiments show that DDT significantly outperforms DT and establishes competitive performance against state-of-the-art DT variants across multiple offline RL tasks.
中文摘要 决策变换器（DT）建立了一种强大的序列建模方法用于离线强化学习。它以返回到行动（RTG）为基础来预测行动，既用于区分训练中的轨迹质量，也用于引导推断时的动作生成。在本研究中，我们发现了该设计中的一个关键冗余：理论上无需将整个RTG序列输入变压器，因为只有最新的RTG会影响动作预测。我们通过实验证明，这种冗余会损害DT的性能。为解决这个问题，我们提出了解耦DT（DDT）。DDT简化了架构，仅通过Transformer处理观察和动作序列，并利用最新的RTG来指导动作预测。这种简化的方法不仅提升了性能，还降低了计算成本。我们的实验表明，DDT在多个离线强化学习任务中显著优于DT，并在与最先进的DT变体竞争中建立了竞争力。

PUMA: Perception-driven Unified Foothold Prior for Mobility Augmented Quadruped Parkour

彪马：感知驱动的统一立足点，机动性增强四足跑酷的先行

Authors: Liang Wang, Kanzhong Yao, Yang Liu, Weikai Qin, Jun Wu, Zhe Sun, Qiuguo Zhu
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2601.15995
Pdf link: https://arxiv.org/pdf/2601.15995
Abstract Parkour tasks for quadrupeds have emerged as a promising benchmark for agile locomotion. While human athletes can effectively perceive environmental characteristics to select appropriate footholds for obstacle traversal, endowing legged robots with similar perceptual reasoning remains a significant challenge. Existing methods often rely on hierarchical controllers that follow pre-computed footholds, thereby constraining the robot's real-time adaptability and the exploratory potential of reinforcement learning. To overcome these challenges, we present PUMA, an end-to-end learning framework that integrates visual perception and foothold priors into a single-stage training process. This approach leverages terrain features to estimate egocentric polar foothold priors, composed of relative distance and heading, guiding the robot in active posture adaptation for parkour tasks. Extensive experiments conducted in simulation and real-world environments across various discrete complex terrains, demonstrate PUMA's exceptional agility and robustness in challenging scenarios.
中文摘要 四足跑酷任务已成为敏捷行走的有前景标杆。虽然人类运动员能够有效感知环境特征以选择合适的立足点以穿越障碍，但赋予腿部机器人类似的感知推理仍是一项重大挑战。现有方法通常依赖于遵循预先计算的基础的层级控制器，从而限制了机器人的实时适应性和强化学习的探索潜力。为克服这些挑战，我们提出了PUMA，一个端到端的学习框架，将视觉感知和立足先验整合为单阶段培训流程。该方法利用地形特征估算以自我为中心的极地立足先验，由相对距离和方向组成，指导机器人在跑酷任务中主动调整姿势。在模拟和现实环境中，跨越各种复杂地形进行的广泛实验，展示了PUMA在复杂场景中的卓越敏捷性和稳健性。

Keyframe-Based Feed-Forward Visual Odometry

基于关键帧的前馈视觉里程计

Authors: Weichen Dai, Wenhan Su, Da Kong, Yuhang Ming, Wanzeng Kong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.16020
Pdf link: https://arxiv.org/pdf/2601.16020
Abstract The emergence of visual foundation models has revolutionized visual odometry~(VO) and SLAM, enabling pose estimation and dense reconstruction within a single feed-forward network. However, unlike traditional pipelines that leverage keyframe methods to enhance efficiency and accuracy, current foundation model based methods, such as VGGT-Long, typically process raw image sequences indiscriminately. This leads to computational redundancy and degraded performance caused by low inter-frame parallax, which provides limited contextual stereo information. Integrating traditional geometric heuristics into these methods is non-trivial, as their performance depends on high-dimensional latent representations rather than explicit geometric metrics. To bridge this gap, we propose a novel keyframe-based feed-forward VO. Instead of relying on hand-crafted rules, our approach employs reinforcement learning to derive an adaptive keyframe policy in a data-driven manner, aligning selection with the intrinsic characteristics of the underlying foundation model. We train our agent on TartanAir dataset and conduct extensive evaluations across several real-world datasets. Experimental results demonstrate that the proposed method achieves consistent and substantial improvements over state-of-the-art feed-forward VO methods.
中文摘要 视觉基础模型的出现彻底改变了视觉里程~（VO）和SLAM，使得在单一前馈网络内实现姿态估计和密集重建成为可能。然而，与利用关键帧方法提升效率和准确性的传统流水线不同，当前基于基础模型的方法，如VGGT-Long，通常对原始图像序列进行无差别处理。这导致计算冗余和低帧间视差导致性能下降，提供有限的上下文立体信息。将传统几何启发式融入这些方法并不简单，因为它们的性能依赖于高维潜在表示，而非显式几何度量。为了弥合这一差距，我们提出了一种基于关键帧的新型前馈配音。我们不依赖手工规则，而是采用强化学习，以数据驱动方式推导自适应关键帧策略，使选择与基础模型的内在特性保持一致。我们用TartanAir数据集训练我们的客服人员，并对多个真实世界数据集进行广泛评估。实验结果表明，所提方法相较于最先进的前馈VO方法实现了一致且显著的改进。

Dynamic Tactile Sensing System and Soft Actor Critic Reinforcement Learning for Inclusion Characterization

动态触觉感知系统和软演员批评强化学习用于包容性刻画

Authors: John Bannan, Nazia Rahman, Chang-Hee Won
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2601.16061
Pdf link: https://arxiv.org/pdf/2601.16061
Abstract This paper presents the Dynamic Tactile Sensing System that utilizes robotic tactile sensing in conjunction with reinforcement learning to locate and characterize embedded inclusions. A dual arm robot is integrated with an optical Tactile Imaging Sensor that utilizes the Soft Actor Critic Algorithm to acquire tactile data based on a pixel intensity reward. A Dynamic Interrogation procedure for tactile exploration is developed that enables the robot to first localize inclusion and refine their positions for precise imaging. Experimental validation conducted on Polydimethylsiloxane phantoms demonstrates that the robot using the Tactile Soft Actor Critic Model was able to achieve size estimation errors of 2.61% and 5.29% for soft and hard inclusions compared to 7.84% and 6.87% for expert human operators. Results also show that Dynamic Tactile Sensing System was able to locate embedded inclusions and autonomously determine their mechanical properties, useful in applications such as breast tumor characterization.
中文摘要 本文介绍了动态触觉感知系统，该系统结合机器人触觉感知与强化学习，定位并表征嵌入包涵物。双臂机器人集成了光学触觉成像传感器，利用软演员批评算法基于像素强度奖励获取触觉数据。开发了一种用于触觉探索的动态询问程序，使机器人能够首先定位包含物并优化位置以实现精确成像。对聚二甲基硅氧烷幻影的实验验证表明，使用触觉软性演员批评模型的机器人在软和硬夹杂物中实现了2.61%和5.29%的尺寸估计误差，而专业人类作者为7.84%和6.87%。结果还表明，动态触觉传感系统能够定位嵌入的包裹体并自主确定其机械性质，这在乳腺肿瘤表征等应用中非常有用。

SAMTok: Representing Any Mask with Two Words

SAMTok：用两个词代表任何面具

Authors: Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2601.16093
Pdf link: https://arxiv.org/pdf/2601.16093
Abstract Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.
中文摘要 像素层面的能力对于构建交互式智能系统至关重要。然而，像素级多模态LLM（MLLM）由于复杂的区域级编码器、专用的分割解码器和不兼容的训练目标，仍然难以扩展。为应对这些挑战，我们提出了SAMTok，一种离散掩码分词器，可以将任意区域掩码转换为两个特殊令牌，并利用这些令牌高保真度重建掩码。通过将掩码视为新的语言令牌，SAMTok使基础MLLM（如QwenVL系列）能够通过标准的下一个令牌预测和简单的强化学习，学习像素级能力，无需架构修改和专门的损耗设计。SAMTok基于SAM2，使用掩码编码器和残差矢量化器训练2.09亿个多样化掩码，生成离散、紧凑且信息丰富的令牌。凭借5M SAMTok格式的掩模理解和生成数据样本，QwenVL-SAMTok在区域字幕、区域VQA、接地对话、引用分割、场景图解析和多轮交互分割方面达到了最先进或相当的结果。我们还引入了文本匹配奖励，实现掩码生成的高效强化学习，显著提升了GRES和GCG基准。我们的结果展示了一种可扩展且直观的范式，用于赋予MLLM强大的像素能力。我们的代码和模型都已提供。

Efficiently Learning Robust Torque-based Locomotion Through Reinforcement with Model-Based Supervision

通过基于模型的监督高效学习基于扭矩的稳健行进，通过强化

Authors: Yashuai Yan, Tobias Egle, Christian Ott, Dongheui Lee
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2601.16109
Pdf link: https://arxiv.org/pdf/2601.16109
Abstract We propose a control framework that integrates model-based bipedal locomotion with residual reinforcement learning (RL) to achieve robust and adaptive walking in the presence of real-world uncertainties. Our approach leverages a model-based controller, comprising a Divergent Component of Motion (DCM) trajectory planner and a whole-body controller, as a reliable base policy. To address the uncertainties of inaccurate dynamics modeling and sensor noise, we introduce a residual policy trained through RL with domain randomization. Crucially, we employ a model-based oracle policy, which has privileged access to ground-truth dynamics during training, to supervise the residual policy via a novel supervised loss. This supervision enables the policy to efficiently learn corrective behaviors that compensate for unmodeled effects without extensive reward shaping. Our method demonstrates improved robustness and generalization across a range of randomized conditions, offering a scalable solution for sim-to-real transfer in bipedal locomotion.
中文摘要 我们提出了一个控制框架，将基于模型的双足行走与残余强化学习（RL）结合起来，在现实世界中存在不确定性的情况下实现稳健且自适应的行走。我们的方法利用基于模型的控制器，包括发散运动分量（DCM）轨迹规划器和全身控制器，作为可靠的基础策略。为解决不准确动力学建模和传感器噪声带来的不确定性，我们引入了通过强化学习训练的残差策略，并带有域随机化。关键是，我们采用基于模型的预言机策略，在培训期间优先访问地面真实动态，通过新颖的监督损失来监督残余策略。这种监督使政策能够高效学习纠正行为，补偿未建模的影响，而无需大量奖励塑造。我们的方法在多种随机条件下展现了更好的鲁棒性和泛化性，为双足行走中的模拟到现实转移提供了可扩展的解决方案。

Structured Hints for Sample-Efficient Lean Theorem Proving

样本高效精益定理证明的结构化提示

Authors: Zachary Burton
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.16172
Pdf link: https://arxiv.org/pdf/2601.16172
Abstract State-of-the-art neural theorem provers like DeepSeek-Prover-V1.5 combine large language models with reinforcement learning, achieving impressive results through sophisticated training. We ask: do these highly-trained models still benefit from simple structural guidance at inference time? We evaluate a lightweight intervention -- a fixed prompt schedule over 15 common tactic skeletons -- on the miniF2F benchmark. This simple approach yields 21.7% pass@16 compared to 15.2% for standard sampling from the same model, a 43% relative improvement using the same number of samples (k=16) and same maximum generation length (1024 tokens). Our results suggest that even capable RL-trained provers underutilize structural priors available in the tactic language, and that simple inference-time guidance remains a cheap, complementary boost.
中文摘要 像DeepSeek-Prover-V1.5这样的先进神经定理证明器将大型语言模型与强化学习结合，通过复杂的训练取得了令人印象深刻的成果。我们要问：这些高度训练的模型在推理时是否仍受益于简单的结构指导？我们评估了一项轻量级干预——对15种常见战术骨架的固定时间安排——基于miniF2F基准测试。这种简单方法相比同一模型的标准抽样产生了21.7%的 pass@16，相比之下，在相同样本数（k=16）和相同最大生成长度（1024个令牌）下，相较提升了43%。我们的结果表明，即使是具备强化学习训练的证明者，也未能充分利用战术语言中可用的结构先验，而简单的推理时间指导仍然是廉价且互补的助力。

Learning to Discover at Test Time

考试时学习发现

Authors: Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.16175
Pdf link: https://arxiv.org/pdf/2601.16175
Abstract How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erdős' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.
中文摘要 我们如何利用人工智能发现科学问题的新技术？此前在测试时间缩放领域的工作，如AlphaEvolve，通过提示冻结的大型语言模型来进行搜索。我们在测试时进行强化学习，这样LLM可以继续训练，但现在有针对测试问题的经验。这种持续学习方式非常特殊，因为它的目标是产出一个优秀的解，而不是平均来做多个好的解，并且解决这个问题本身，而不是推广到其他问题。因此，我们的学习目标和搜索子程序旨在优先考虑最有前景的解决方案。我们称这种方法为“测试时间训练以发现”（TTT-Discover）。在之前的工作之后，我们专注于持续奖励的问题。我们会报告我们尝试的每个问题的结果，涵盖数学、GPU内核工程、算法设计和生物学领域。TTT-Discover 几乎在所有领域都奠定了新的技术水平：（i） Erdős 最小重叠问题和自相关不等式;（ii） GPUMode 内核竞赛（比现有技术快达 $2 乘倍;（iii）过去的AtCoder算法竞赛;以及（iv）单胞体分析中的去噪问题。我们的解决方案会由专家或组织者审核。我们所有的结果均通过开放模型OpenAI gpt-oss-120b实现，并且可以用我们公开的代码复现，这与以往需要封闭前沿模型的最佳结果形成对比。我们的测试时训练运行使用Thinking Machines的APITinker，每个问题费用仅几百美元。

LLM-in-Sandbox Elicits General Agentic Intelligence

沙盒中的大型语言模型引发通用智能

Authors: Daixuan Cheng, Shaohan Huang, Yuxian Gu, Huatong Song, Guoxin Chen, Li Dong, Wayne Xin Zhao, Ji-Rong Wen, Furu Wei
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2601.16206
Pdf link: https://arxiv.org/pdf/2601.16206
Abstract We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs, without additional training, exhibit generalization capabilities to leverage the code sandbox for non-code tasks. For example, LLMs spontaneously access external resources to acquire new knowledge, leverage the file system to handle long contexts, and execute scripts to satisfy formatting requirements. We further show that these agentic capabilities can be enhanced through LLM-in-Sandbox Reinforcement Learning (LLM-in-Sandbox-RL), which uses only non-agentic data to train models for sandbox exploration. Experiments demonstrate that LLM-in-Sandbox, in both training-free and post-trained settings, achieves robust generalization spanning mathematics, physics, chemistry, biomedicine, long-context understanding, and instruction following. Finally, we analyze LLM-in-Sandbox's efficiency from computational and system perspectives, and open-source it as a Python package to facilitate real-world deployment.
中文摘要 我们引入了沙盒中的大型语言模型（LLM-in-Sandbox），使大型语言模型能够在代码沙盒（即虚拟计算机）内探索，从而在非代码领域中引发通用智能。我们首先展示了强力的大型语言模型在无需额外培训的情况下，能够在非代码任务中利用代码沙箱进行泛化。例如，LLM自发访问外部资源以获取新知识，利用文件系统处理长上下文，并执行脚本以满足格式要求。我们还进一步展示了，这些代理能力可以通过沙盒中大型语言模型强化学习（LLM-in-Sandbox-RL）得到增强，该方法仅使用非代理数据训练沙盒探索模型。实验表明，无论是在无培训还是后培训环境中，沙盒式LLM都能实现跨数学、物理、化学、生物医学、长上下文理解和指令跟随的稳健推广。最后，我们从计算和系统角度分析了沙盒中的LLM的效率，并将其开源为Python包，以促进实际部署。

Keyword: diffusion policy

There is no result