生成时间: 2026-06-04 19:18:50 (UTC+8); Arxiv 发布时间: 2026-06-04 20:00 EDT (2026-06-05 08:00 UTC+8)
今天共有 52 篇相关文章
Keyword: reinforcement learning
Position: Deployed Reinforcement Learning should be Continual
立场:部署强化学习应持续进行
- Authors: Parnian Behdin, Kevin Roice, Golnaz Mesbahi
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04029
- Pdf link: https://arxiv.org/pdf/2606.04029
- Abstract
Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades and retraining becomes necessary. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem. We identify four sources of non-stationarity after deployment that necessitate never-ending learning, and highlight why the best deployed agents never stop adapting. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train-then-fix paradigm.
- 中文摘要
强化学习(RL)在现实应用场景中受到越来越多的关注和采用。大多数此类系统遵循“培训后修复”模式,受过训练的代理在与世界交互时不会学习,直到性能下降、需要再培训。在这份立场文件中,我们认为部署一个无法实现最优但能获得评估性奖励信号的代理,本质上是一个持续存在的强化学习问题。我们识别了部署后非平稳性的四个来源,这些因素需要不断学习,并强调了为何最优秀的部署代理永不停止适应。我们分析了现实世界中持续强化学习的成功案例,并向社区介绍了摆脱当前“训练后修正”模式的优势和措施。
Self-Distilled Policy Gradient
自我提炼政策梯度
- Authors: Yifeng Liu, Shiyuan Zhang, Yifan Zhang, Quanquan Gu
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.04036
- Pdf link: https://arxiv.org/pdf/2606.04036
- Abstract
On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at this https URL.
- 中文摘要
策略上自我蒸馏,即语言模型在特权上下文条件下监督其自身代,是稀疏奖励强化学习中密集监督的有前景来源。实际上,它可以作为辅助全词汇学生到教师的逆库尔巴克-莱布勒散度损失来实现。因此,我们提出了SDPG,一种自提炼策略梯度框架,结合了群相对验证器的优势与归一化标准差、精确全词汇的策略自蒸馏以及参考策略KL正则化。从经验角度看,SDPG相较于RLVR和自蒸馏基线提升了稳定性和性能。代码可在该 https URL 访问。
RUBAS: Rubric-Based Reinforcement Learning for Agent Safety
RUBAS:基于评分标准的强化学习用于代理安全
- Authors: Xian Qi Loye, Qinglin Su, Zhexin Zhang, Shiyao Cui, Qi Zhu, Fei Mi, Hongning Wang, Minlie Huang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2606.04051
- Pdf link: https://arxiv.org/pdf/2606.04051
- Abstract
The evolution of LLMs into tool-enabled agents creates a new class of safety challenges associated with real-world execution rather than simple text generation. Existing alignment methods often rely on coarse refusal signals or static supervision, making it difficult to balance safety with useful tool execution across diverse agentic risks. We introduce RUBAS, a rubric-based reinforcement learning framework for agent safety. RUBAS decomposes agent behavior into four dimensions: tool-use safety, argument safety, response safety, and helpfulness. These structured rubrics provide fine-grained and interpretable rewards over complete agent trajectories, enabling reinforcement learning to optimize safe tool use while preserving task completion. Extensive experiments across multiple agent safety benchmarks and models show that RUBAS improves safety over standard alignment baselines, reduces tool-grounded hallucinations, and maintains competitive utility. Our results suggest that multi-dimensional rubric rewards provide an effective training signal for aligning LLM agents in safety-critical tool-use settings.
- 中文摘要
LLM演变成工具驱动的代理,带来了一类与现实执行相关的安全挑战,而非简单的文本生成。现有的对齐方法常依赖粗略拒绝信号或静态监督,这使得在安全与工具执行之间取得多样化代理风险的平衡变得困难。我们介绍了RUBAS,一个基于评分标准的强化学习框架,用于代理安全。RUBAS 将代理行为分解为四个维度:工具使用安全、论证安全、反应安全和帮助性。这些结构化的评分标准在完整的代理轨迹上提供细粒度且可解释的奖励,使强化学习能够优化安全工具的使用,同时保持任务完成度。在多个药物安全基准和模型上的广泛实验表明,RUBAS 在标准比对基线上提升安全性,减少工具接地幻觉,并保持竞争力。我们的结果表明,多维评分标准奖励为在安全关键工具使用环境中对齐LLM代理提供了有效的训练信号。
A Goal-Set Characterization of Task Composition in the Boolean Task Algebra
布尔任务代数中任务组合的目标集特征描述
- Authors: Eduardo Terrés-Caballero, Herke van Hoof
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04053
- Pdf link: https://arxiv.org/pdf/2606.04053
- Abstract
The Boolean Task Algebra (BTA) provides a principled framework for zero-shot task composition in reinforcement learning by equipping goal-reaching tasks with Boolean operations. We revisit its structural assumptions and formalize a collapse in the space of optimal extended Q-value functions: in deterministic MDPs, every such function is fully determined by the universal and empty tasks. This makes the logarithmic set of base tasks proposed in the original BTA formulation redundant. Building on this observation, we introduce a goal-set-based composition method that performs logical operations on goal sets and reconstructs composed value functions by selecting slices from the universal and empty value functions. This reduces learning costs for standard BTA and reduces composition time for both BTA and Skill Machines, while preserving policy performance. Experiments across tabular, visual, function-approximation, and continuous-control domains show that learning additional base tasks does not yield better performance. Finally, we study the stochastic setting and provide a counterexample showing that this collapse need not hold, that is, optimal composition may require accounting for exponentially many policies in the number of goals. Code is available at this https URL.
- 中文摘要
布尔任务代数(BTA)通过为目标达成任务配备布尔运算,为强化学习中的零样本任务组合提供了一个原则性框架。我们重新审视其结构假设,并形式化了最优扩展Q值函数空间中的坍缩:在确定性MDP中,每个此类函数均由全称任务和空任务完全决定。这使得原始BTA表述中提出的对数基础任务集变得多余。基于这一观察,我们引入了一种基于目标集的合成方法,该方法对目标集进行逻辑运算,并通过从全称和空值函数中选择切片来重建复合值函数。这降低了标准BTA的学习成本,缩短了BTA和Skill Machine的组合时间,同时保持策略性能。跨领域、表、可视化、函数近似和连续控制领域的实验表明,学习额外的基础任务并不会带来更好的性能。最后,我们研究了随机设定,并举出反例说明这种崩溃不必成立,即最优组合可能需要在目标数量中考虑指数级数量的策略。代码可在此 https URL 访问。
Need to Know: Contextual-Integrity-Grounded Query Rewriting for Privacy-Conscious LLM Delegation
必知:基于上下文完整性的查询重写,适用于注重隐私的LLM委派
- Authors: Xinyue Huang, Xiaochun Cao, Wenyuan Yang
- Subjects: Subjects:
Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04067
- Pdf link: https://arxiv.org/pdf/2606.04067
- Abstract
As LLMs become increasingly woven into everyday workflows, user queries sent to cloud hosted LLMs routinely mix task-essential content with task non-essential sensitive disclosures, yet type based PII redaction is context agnostic and may raise two issues: over disclosing untyped sensitive context and over removing answer bearing spans. We recast privacy preserving query rewriting under Contextual Integrity: a span should be forwarded only if it is necessary for the task. We introduce DelegateCI-Bench, the first task based Contextual Integrity benchmark for privacy-conscious delegation, comprising 3,167 samples that combine high quality synthetic data spanning 11 tasks and 20 task types, WildChat based real user queries, and a medical challenge set with dense sensitive information. Building on this benchmark, we propose a CI-guided reinforcement learning framework that converts essential and non-essential sensitive spans into verifiable optimization signals, and train a query rewriter to preserve task critical information while suppressing unnecessary sensitive disclosure. Experiments show that our learned rewriter achieves the best privacy-utility tradeoff, achieving up to +10.1 average utility over on-device baselines.
- 中文摘要
随着LLM越来越多地融入日常工作流程,用户发送到云端LLM的查询常常将任务本质内容与非关键敏感披露混杂在一起,但基于类型的PII遮蔽与上下文无关,可能引发两个问题:过度披露未类型敏感上下文和删除答案承载范围。我们将隐私保护查询重写重新定义为上下文完整性:只有在任务需要时才应转发一个跨度。我们介绍DelegateCI-Bench,这是首个针对隐私意识委派的基于任务的上下文完整性基准测试,包含3167个样本,结合了涵盖11个任务和20种任务类型的高质量合成数据、基于WildChat的真实用户查询以及包含高密度敏感信息的医疗挑战集。基于该基准,我们提出了一个CI引导的强化学习框架,将关键和非关键敏感跨度转换为可验证的优化信号,并训练查询重写器以保留任务关键信息,同时抑制不必要的敏感披露。实验显示,我们学习的重写工具实现了最佳的隐私与效用权衡,平均效用可达设备基线+10.1。
Large Language Models Hack Rewards, and Society
大型语言模型的黑客奖励与社会
- Authors: Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)
- Arxiv link: https://arxiv.org/abs/2606.04075
- Pdf link: https://arxiv.org/pdf/2606.04075
- Abstract
Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=
- 中文摘要
强化学习(RL)已成为主导的训练后范式,使大型语言模型(LLMs)能够从奖励中学习。我们观察到社会规范在结构上与奖励函数相似。它们定义了可衡量的结果、阈值和例外,同时往往只部分明确机构意图。我们假设强化学习训练过程可能利用这些空白,因此探讨模型在强化学习中黑客攻击奖励函数的已知倾向是否可以扩展成一种更具影响性的失败模式,称为社会黑客:发现社会运行规则中的漏洞。为了研究这一现象,我们引入了SocioHack,一个包含72个社会环境的沙盒,发现在这些环境中,奖励黑客自然出现,并导致监管漏洞的发现。模型学会破解社会规则,生成在技术上合规的同时又能突破监管意图的策略,而当前的大型语言模型防护措施仅提供有限的缓解措施。因此,收集模型训练所需的实际反馈需要更加谨慎,我们需要下一代后训练范式,以安全地在现实社会中迭代大型语言模型。=
SaliMory: Orchestrating Cognitive Memory for Conversational Agents
SaliMory:为会话代理协调认知记忆
- Authors: Kai Zhang, Xinyuan Zhang, Hongda Jiang, Shiun-Zu Kuo, Hyokun Yun, Ejaz Ahmed, Shereen Oraby, Ziyun Li, Sanat Sharma, Ann Lee, Ahmed A Aly, Anuj Kumar, Raffay Hamid, Xin Luna Dong
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04120
- Pdf link: https://arxiv.org/pdf/2606.04120
- Abstract
Conversational agents that serve as lifelong companions must maintain persistent memory across all interactions. However, simply expanding context windows with raw retrieval degrades reasoning quality, while training memory agents via standard reinforcement learning creates a severe credit assignment bottleneck in a multi-stage pipeline. To solve this, we introduce SALIMORY, a framework that trains a single language model to manage a cognitively-structured memory-spanning user facts, preferences, and working memory. By introducing a hierarchical stage-wise process reward and reward-decomposed contrastive refinement, SALIMORY provides isolated supervision for distinct memory operations (selective filtering, consolidation, and cue-driven recall) end-to-end. SALIMORY cuts memory-attributed failures by one-third, outperforms the state-of-the-art by over 10% in end-to-end accuracy, and more than doubles the Good Personalization rate.
- 中文摘要
作为终身伴侣的会话代理必须在所有交互中保持持久的记忆。然而,单纯用原始检索扩展上下文窗口会降低推理质量,而通过标准强化学习训练记忆代理则在多阶段流程中造成严重的学分分配瓶颈。为此,我们引入了SALIMORY框架,该框架训练单一语言模型,管理跨越记忆的用户事实、偏好和工作记忆。通过引入分层的过程奖励和奖励分解对比细化,SALIMORY为不同记忆操作(选择性过滤、巩固和线索驱动回忆)提供端到端的独立监督。SALIMORY将内存归因的失败率降低了三分之一,端到端准确率比最先进技术高出10%以上,并且将良好个性化率翻倍多。
SocialCoach: Personalized Social Skill Learning with RL-based Agentic Tutoring and Practice
SocialCoach:基于强化学习的能动辅导与实践的个性化社交技能学习
- Authors: Tianfu Wang, Max Xiong, Jianxun Lian, Hongyuan Zhu, Zhengyu Hu, Yuxuan Lei, Linxiao Gong, Xiaofang Li, Peiting Tsai, Nicholas Jing Yuan, Qi Zhang
- Subjects: Subjects:
Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computers and Society (cs.CY)
- Arxiv link: https://arxiv.org/abs/2606.04155
- Pdf link: https://arxiv.org/pdf/2606.04155
- Abstract
Social skills such as negotiation and leadership are crucial for personal and professional success in today's interconnected world. However, scalable and effective training remains a significant challenge due to the scarcity of expert coaching. In this paper, we introduce SocialCoach, a holistic LLM-powered agentic tutoring system for personalized social skill development at scale. First, SocialCoach automatically constructs a pedagogically-grounded, theory-to-practice knowledge corpus from diverse expert sources, leveraging a multi-agent pipeline. Second, to personalize the learning journey, it employs an adaptive practice scheduling module that follows a prescription-retrieval-adaptation process. To maximize the long-term learning experience while overcoming the cold-start problem, this policy is optimized within a learner simulation environment through reinforcement learning. Finally, SocialCoach integrates immersive, goal-driven practice, causality-driven proficiency assessment and knowledge-grounded, reflective tutoring to help address the knowing-doing gap. We deploy it in our product, EQoach, and conduct extensive experiments. The results show that SocialCoach improves simulated pathway quality and judge-rated tutoring quality over baseline approaches, while early user feedback indicates strong perceived engagement and usefulness. These findings suggest a practical architecture for personalized and gamified pedagogical platforms on soft skill learning.
- 中文摘要
在当今这个互联互通的世界中,谈判和领导力等社交技能对于个人和职业成功至关重要。然而,由于专业教练的稀缺,可扩展且有效的培训仍是一个重大挑战。本文介绍了SocialCoach,一套基于大型语言模型(LLM)驱动的整体智能辅导系统,用于大规模个性化社交技能发展。首先,SocialCoach 自动构建一个基于教学法、理论与实践相结合的知识语料库,利用多代理管道。其次,为了个性化学习过程,它采用了自适应实践排班模块,遵循处方-检索-适应流程。为了最大化长期学习体验并克服冷启动问题,该策略在学习者模拟环境中通过强化学习进行优化。最后,SocialCoach 结合了沉浸式、目标导向的练习、因果关系驱动的能力评估以及基于知识的反思性辅导,帮助弥合知识与实践之间的差距。我们将它部署在产品EQoach中,并进行了广泛的实验。结果显示,SocialCoach 在模拟路径质量和评判评级辅导质量上相较于基线方法有所提升,而早期用户反馈则显示其被感知到的参与度和实用性。这些发现为个性化和游戏化的软技能学习教学平台提供了实用的架构。
Smart Transportation Without Neurons -- Fair Metro Network Expansion with Tabular Reinforcement Learning
无神经元的智能交通——通过表格强化学习实现公平的地铁网络扩展
- Authors: Dimitris Michailidis, Sennay Ghebreab, Fernando P. Santos
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04167
- Pdf link: https://arxiv.org/pdf/2606.04167
- Abstract
We tackle the Metro Network Expansion Problem (MNEP), a subset of the Transport Network Design Problem (TNDP), which focuses on expanding metro systems to satisfy travel demand. Traditional methods rely on exact and heuristic approaches that require expert-defined constraints to reduce the search space. Recently, deep reinforcement learning (Deep RL) has emerged due to its effectiveness in complex sequential decision-making processes-it remains, however, computationally expensive, environmentally costly, and requires additional engineering to interpret. We show that MNEP problems are small enough to not require Deep RL methods. Reformulating the MNEP as a Non-Markovian Rewards Decision Process (NMRDP), we use tabular RL to achieve similar performance with significantly fewer training episodes, additionally offering greater interpretability. Additionally, we incorporate social equity criteria into the reward functions, focusing on efficiency and fairness, highlighting the versatility of our method. Evaluated in real-world settings-Xi'an and Amsterdam-our method reduces total episodes by a factor of 18 and total carbon emissions by a factor of 12 on average, while remaining competitive with Deep RL. This approach offers a replicable, modular, interpretable, and resource-efficient solution with potential applications to other combinatorial optimization problems.
- 中文摘要
我们解决地铁网络扩展问题(MNEP),这是交通网络设计问题(TNDP)的一个子集,重点是扩展地铁系统以满足出行需求。传统方法依赖精确和启发式方法,这些方法需要专家定义的约束来缩小搜索空间。近年来,深度强化学习(Deep RL)因其在复杂序列决策过程中的有效性而兴起——然而,它仍然计算成本高、环境成本高,并且需要额外的工程技术来解释。我们证明MNEP问题足够小,无需深度强化学习方法。我们将MNEP重新定义为非马尔可夫奖励决策过程(NMRDP),采用表格式强化学习,以显著减少训练次数实现类似性能,同时提供更高的可解释性。此外,我们将社会公平标准融入奖励函数,注重效率和公平,凸显方法的多样性。在现实环境中评估——西习安和阿姆斯特丹——我们的方法平均将总发作减少18倍,总碳排放减少12倍,同时仍能与深度强化学习保持竞争力。该方法提供了可重复、模块化、可解释且资源高效的解决方案,并有望应用于其他组合优化问题。
Exact Unlearning in Reinforcement Learning
强化学习中的精确逆学习
- Authors: Thanh Nguyen-Tang, Raman Arora
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2606.04182
- Pdf link: https://arxiv.org/pdf/2606.04182
- Abstract
We formulate the problem of \emph{exact unlearning} in reinforcement learning, where the goal is to design an efficient framework that enables the removal of any user's data upon deletion request, i.e., the online learner's output after unlearning is \emph{indistinguishable} from what would have been produced had the deleted user never interacted with the learner. For any $\rho >0$, we show that there exists a reinforcement learning (RL) algorithm that is $\rho$-TV-stable and supports an exact unlearning procedure whose expected computational cost is only a $\rho \sqrt{\ln T}$ fraction of the computational cost of retraining from scratch. We construct such a $\rho$-TV-stable RL algorithm for tabular Markov decision processes (MDPs), which achieves a regret bound of $\mathcal{O}(H^2 \sqrt{SAT} + H^3 S^2 A + {H^{2.5} S^2 A}/{\rho})$, where $S, A, H$, and $T$ denote the number of states, the number of actions, the episode horizon, and the number of episodes, respectively. We also establish a lower bound of $\Omega(H\sqrt{!SAT}! +! {SAH}/{\rho})$ for $\rho$-TV-stable RL algorithms, showing that our algorithm is nearly minimax optimal.
- 中文摘要
我们在强化学习中提出了\emph{exact unlearning}问题,目标是设计一个高效的框架,使用户在删除请求时能够删除其数据,即在线学习者在取消学习后输出的输出与删除用户未与学习者互动时产生的输出相符。对于任意$\rho>0$,我们证明存在一种强化学习(RL)算法,它是$\rho$-TV稳定的,并且支持一个精确的复学过程,其预期计算成本仅为从零开始重新训练计算成本的$\rho\sqrt{\ln T}$的一小部分。我们构造了这样一种 $\rho$-TV 稳定的强化学习算法,用于表马尔可夫决策过程(MDP),其后悔界限为 $\mathcal{O}(H^2 \sqrt{SAT} + H^3 S^2 A + {H^{2.5} S^2 A}/{\rho})$,其中 $S、A、H$ 和 $T$ 分别表示状态数、动作数、剧集视界和剧集数, 分别。我们还设定了$\Omega(H\sqrt{\!SAT}\!+!{SAH}/{\rho})$ 对于 $\rho$-电视稳定的强化学习算法,表明我们的算法几乎是极小极大最优的。
Dual Advantage Fields
双重优势场
- Authors: Alexey Zemtsov, Maxim Bobrin, Alexander Nikulin, Dmitry V. Dylov, Fakhri Karray, Vladislav Kurenkov, Martin Takáč, Arip Asadulaev
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.04188
- Pdf link: https://arxiv.org/pdf/2606.04188
- Abstract
Offline goal-conditioned reinforcement learning requires both long-horizon reachability estimates and local action comparisons. Dual goal representations provide value fields that capture global goal reachability, but they do not directly specify which action should be preferred at a given state. We propose Dual Advantage Fields, a policy-extraction method that turns a bilinear dual value model into a local advantage signal. Under bilinear dual parameterization, the goal embedding is the gradient of the value field with respect to the state representation. DAF learns an action-effect model that predicts the discounted feature displacement induced by an action and scores actions by the alignment between this displacement and the goal direction. In the realizable case, this score equals the goal-conditioned Bellman advantage, yielding a standard local policy-improvement guarantee. On OGBench locomotion, manipulation, and puzzle tasks, DAF improves aggregate RLiable metrics and performs strongly in settings where locally correct actions differ from direct movement toward the final goal.
- 中文摘要
离线目标条件强化学习既需要长期可达性估计,也需要局部行动比较。对偶目标表示提供了捕捉全局目标可达性的值字段,但它们并不直接指定在给定状态下应优先考虑哪个动作。我们提出了双重优势场,这是一种策略提取方法,将双线性对偶值模型转化为局部优势信号。在双线性对偶参数化下,目标嵌入是值场相对于状态表示的梯度。DAF学习一个动作-效应模型,该模型预测由动作引起的折现特征位移,并根据该位移与目标方向的对齐对作用进行评分。在可实现的情况下,该分数等于目标条件下的贝尔曼优势,从而获得标准的地方政策改进保证。在OGBench的移动、操控和谜题任务中,DAF提升了可实现的综合指标,并在局部正确操作与直接朝最终目标移动不同的环境中表现优异。
RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training
预训练期间的强化学习体验:重新审视LLM训练中的策略优化
- Authors: Rachit Bansal, Clara Mohri, Tian Qin, David Alvarez-Melis, Sham Kakade
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.04272
- Pdf link: https://arxiv.org/pdf/2606.04272
- Abstract
The standard LLM training pipeline applies reinforcement learning (RL) only after pre-training and supervised fine-tuning (SFT). We question this status quo by training a LLM from scratch and applying RL, SFT, and SFT followed by RL directly to intermediate pre-training checkpoints. We find that RL is effective very early, and often matches the full SFT$\to$RL pipeline early as well. Through experiments on harder problems, we find that targeted pre-training data composition is a strong lever for RL effectiveness, even more so than model scale. Beyond reasoning accuracy, applying RL directly to base checkpoints expands the model's distribution; the sharpening effect reported in recent work arises only when RL follows SFT. The general capabilities of the model remain essentially unchanged by RL, while they degrade following SFT. Finally, we merge RL and SFT objectives by parallel averaging, which outperforms across all other training methods discussed, across metrics, while preserving general capabilities. Together, these results suggest that LLM training might benefit from an expanded use of RL.
- 中文摘要
标准的LLM培训流程仅在预训练和监督微调(SFT)之后应用强化学习(RL)。我们通过从零开始训练LLM,直接应用RL、SFT和SRL到中间的预训练检查点,来质疑这种现状。我们发现强化学习很早就有效,而且通常也能很早匹配完整的SFT$\to$RL流程。通过对更难问题的实验,我们发现有针对性的预训练数据组合是提升强化学习有效性的有力杠杆,甚至超过模型规模。除了推理准确性之外,直接将强化学习应用于基础检查点还扩展了模型的分布;近期研究报告的锐化效应仅在强化逻辑紧随SFT之后时出现。模型的总体能力基本未被强化学习改变,但在SFT后性能下降。最后,我们通过并行平均法合并了强化学习和SFT目标,这种方法在所有其他训练方法中、各指标上都优于其他训练方法,同时保持了通用能力。综合来看,这些结果表明,扩展强化学习的应用可能会让LLM训练受益。
From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments
从滴答到流:连续环境中神经强化学习的动态学
- Authors: Saket Tiwari, Tejas Kotwal, George Konidaris
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04275
- Pdf link: https://arxiv.org/pdf/2606.04275
- Abstract
We present a novel theoretical framework for deep reinforcement learning (RL) in continuous environments by modeling the problem as a continuous-time stochastic process, drawing on insights from stochastic control. Building on previous work, we introduce a viable model of actor-critic algorithm that incorporates both exploration and stochastic transitions. For single-hidden-layer neural networks, we show that the state of the environment can be formulated as a two time scale process: the environment time and the gradient time. Within this formulation, we characterize how the time-dependent random variables that represent the environment's state and estimate of the cumulative discounted return evolve over gradient steps in the infinite width limit of two-layer networks. Using the theory of stochastic differential equations, we derive, for the first time in continuous RL, an equation describing the infinitesimal change in the state distribution at each gradient step, under a vanishingly small learning rate. Overall, our work provides a novel nonparametric formulation for studying overparametrized neural actor-critic algorithms. We empirically corroborate our theoretical result using a toy continuous control task.
- 中文摘要
我们通过将问题建模为连续时间随机过程,借鉴随机控制的洞见,提出了一个新的理论框架,用于连续环境中的深度强化学习(RL)。基于之前的工作,我们引入了一个可行的演员-批评者算法模型,结合了探索和随机转移。对于单层隐藏神经网络,我们证明环境状态可以被表述为两个时间尺度过程:环境时间和梯度时间。在该表述中,我们描述了表示环境状态的时间相关随机变量和累计贴现回报估计值在两层网络无限宽度极限中的梯度步进化过程。利用随机微分方程理论,我们首次在连续强化学习中推导出一个描述在极小学习率下每个梯度步态分布无穷小变化的方程。总体而言,我们的工作为研究超参数化神经行为者-批判算法提供了一种新的非参数化表述。我们通过玩具连续控制任务实证了理论结果。
Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling
稀疏专家混合奖励模型 学习可解释且专业的专家进行个性化偏好建模
- Authors: Yifan Wang, Jinyi Mu, Mayank Jobanputra, Yu Wang, Ji-Ung Lee, Soyoung Oh, Isabel Valera, Vera Demberg
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.04284
- Pdf link: https://arxiv.org/pdf/2606.04284
- Abstract
Preference modeling plays a central role in reinforcement learning from human feedback (RLHF), enabling large language models (LLMs) to align with human values. However, most existing approaches assume a universal reward function, neglecting the diversity and heterogeneity of human preferences. To address this limitation without additional annotation costs, recent work has proposed learning multiple preference components from binary data and combining them to model individual preferences. Nevertheless, these components often fail to capture coherent and disentangled patterns, limiting their interpretability and effectiveness for personalization. In this work, we propose a sparse Mixture-of-Experts (MoE) reward model that encourages sparse routing and expert diversity during training on binary preference data. Across controlled and real-world experiments, sparse MoE learns interpretable routing patterns and specialized experts. It also improves test-time personalization, and post-adaptation shifts in expert weights provide a qualitative lens for analyzing how the model adapts to personalized preferences.
- 中文摘要
偏好建模在基于人类反馈的强化学习(RLHF)中起着核心作用,使大型语言模型(LLMs)能够与人类价值观保持一致。然而,大多数现有方法假设普遍奖励函数,忽视了人类偏好的多样性和异质性。为了在不增加注释成本的情况下解决这一限制,近期研究提出从二进制数据中学习多个偏好成分并组合以建模个别偏好。然而,这些组件往往无法捕捉连贯且纠结的模式,限制了其在个性化时的可解释性和有效性。本研究提出一种稀疏的专家混合(MoE)奖励模型,鼓励在二元偏好数据训练时实现稀疏路由和专家多样性。在受控和现实实验中,稀疏的MoE学习可解释的路由模式和专业专家。它还提升了测试时的个性化,适应后专家权重的变化为分析模型如何适应个性化偏好提供了定性视角。
Generalizable Multi-Task Learning for Wireless Networks Using Prompt Decision Transformers
利用即时决策变换器实现无线网络的通用多任务学习
- Authors: Fatih Temiz, Shavbo Salehi, Melike Erol-Kantarci
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04328
- Pdf link: https://arxiv.org/pdf/2606.04328
- Abstract
Future wireless networks demand rapid adaptation to highly heterogeneous environments and dynamic task configurations, necessitating a shift from conventional rule-based and optimization-driven radio resource management (RRM) toward artificial intelligence (AI)-driven RRM. AI-driven approaches can learn complex nonlinear relationships, generalize across diverse network conditions and enable real-time, scalable and autonomous decision-making. Among RRM techniques, coordinated multipoint (CoMP) transmission is pivotal for mitigating inter-cell interference and enhancing cell-edge performance, thereby improving quality of experience (QoE) in dense deployments. However, optimal multi-cell selection remains a complex combinatorial challenge as it requires jointly optimizing over many possible serving-cell combinations under dynamic traffic and channel conditions. Despite their success, conventional deep reinforcement learning (DRL) methods such as proximal policy optimization (PPO) suffer from poor sample efficiency, limited generalization, and costly retraining when state and action spaces change. To address these bottlenecks, we propose a Prompt Decision Transformer (PromptDT) based multi-task learning framework capable of learning across diverse network configurations and reformulating multi-cell selection as a sequence modeling problem. By leveraging offline trajectories and task-specific prompts, PromptDT enables scalable learning across diverse network configurations, including varying base stations and user equipment counts, and scheduler policies. Experimental results demonstrate that PromptDT improves QoE by up to 49% in multi-task settings compared to baselines, with performance scaling positively alongside model capacity. Moreover, PromptDT generalizes effectively to unseen tasks, achieving robust few-shot adaptation to new network configurations without retraining or fine-tuning.
- 中文摘要
未来的无线网络需要快速适应高度异构环境和动态任务配置,因此需要从传统的基于规则和优化驱动的无线资源管理(RRM)转向人工智能(AI)驱动的RRM。AI驱动的方法可以学习复杂的非线性关系,跨越多种网络条件进行泛化,并实现实时、可扩展和自主的决策。在RRM技术中,协调多点(CoMP)传输对于减少小区间干扰和提升小区边缘性能至关重要,从而提升密集部署中的体验质量(QoE)。然而,最优多单元选择仍是一个复杂的组合挑战,因为它需要在动态流量和信道条件下共同优化多种可能的服务单元组合。尽管取得了成功,传统的深度强化学习(DRL)方法如近端策略优化(PPO)在样本效率低下、泛化有限以及状态和动作空间变化时需承担昂贵的再训练费用。为解决这些瓶颈,我们提出了一个基于提示决策转换器(PromptDT)的多任务学习框架,能够跨越多种网络配置学习,并将多单元选择重新表述为序列建模问题。通过利用离线轨迹和任务特定提示,PromptDT实现了跨多种网络配置的可扩展学习,包括不同的基站和用户设备数量以及调度策略。实验结果显示,PromptDT在多任务环境中相较基线提升了多达49%的QoE,性能与模型容量同步提升。此外,PromptDT能够有效泛化到未被发现的任务,实现对新网络配置的强健少样本适配,无需重新训练或微调。
Policy Gradient for Continuous-Time Robust Markov Decision Processes
连续时间稳健马尔可夫决策过程的策略梯度
- Authors: Tanya Veeravalli, David M. Bossens, Atsushi Nitanda
- Subjects: Subjects:
Machine Learning (cs.LG); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2606.04335
- Pdf link: https://arxiv.org/pdf/2606.04335
- Abstract
The framework of robust Markov decision processes (RMDPs) allows the design of reinforcement learning agents that satisfy performance guarantees under worst-case transition dynamics. Traditional RMDPs consider discrete-time dynamics and recently, sample-efficient policy gradient algorithms have been considered in this context. This paper investigates policy gradient algorithms within a continuous-time RMDP framework. Policy gradients and adversarial gradients are derived using pathwise and adjoint-based formulas for stochastic and ordinary differential equations. We propose double-loop optimisers to obtain linear convergence in the oracle-based setting and an $\tilde{\mathcal{O}}(\frac{1}{\epsilon^2})$ sample complexity in the sample-based setting in an analysis which also derives novel tools for the framework of undiscounted total cost MDPs. Additionally, we propose mean-field optimisers as distributional optimisers with an $\tilde{\mathcal{O}}(\frac{1}{K})$ oracle-based convergence rate and an $\tilde{\mathcal{O}}(\frac{N^2}{\epsilon})$ sample complexity under $N$-particle approximation. The effectiveness of continuous-time policy gradient algorithms is confirmed for both optimisers on continuous-time RMDPs with neural ordinary differential equation dynamics.
- 中文摘要
稳健的马尔可夫决策过程(RMDPs)框架允许设计满足最坏情况转换动态性能保证的强化学习代理。传统的RMDP考虑离散时间动态,近年来,采样高效的策略梯度算法也在此背景下被考虑。本文探讨了连续时间RMDP框架下的政策梯度算法。策略梯度和对抗梯度是通过基于路径和伴随的随机微分方程和常微分方程的公式推导出来的。我们提出双环优化器以在基于预言机的环境中获得线性收敛,在基于样本的环境中实现$\tilde{\mathcal{O}}(\frac{1}{\epsilon^2})$样本复杂度,分析中也为无贴现总成本MDP框架提供了新颖工具。此外,我们还提出均值场优化器作为分布优化器,其基于预言机的收敛率为$\tilde{\mathcal{O}}(\frac{1}{K})$为基于oracle的收敛率和$\tilde{\mathcal{O}}(\frac{N^2}{\epsilon})$样本$N$-粒子近似下的复杂度。连续时间策略梯度算法在连续时间RMDP中神经常微分方程动力学的优化者均已确认有效。
Learning to cooperate with emergent reputation via multi-agent reinforcement learning
通过多智能体强化学习与涌现声誉合作
- Authors: Xinwei Song, Yizhe Huang, Dengji Zhao, Xue Feng
- Subjects: Subjects:
Computer Science and Game Theory (cs.GT)
- Arxiv link: https://arxiv.org/abs/2606.04359
- Pdf link: https://arxiv.org/pdf/2606.04359
- Abstract
Reputation, the aggregation of peer assessments diffused through social networks, is a pivotal mechanism for promoting cooperation in social dilemmas ubiquitous to distributed multi-agent systems comprising agents with limited perception and cognitive capabilities. Exploring efficient reputation systems, comprising reputation assessment rules and reputation-based policies, is a long-standing challenge. Previous work assumes predefined reputation assessment rules or models reputation as an intrinsic reward to learn policies, compromising the methods' ability for generalization and adaptation. To address this, we propose a distributed multi-agent reinforcement learning method $\textbf{COOPER}$ ($\textbf{COOP}$eration with $\textbf{E}$mergent $\textbf{R}$eputation), which jointly learns reputation assessment rules and reputation-based policies entirely from environment rewards. Notably, leveraging the underlying mechanisms of reputation, we deliberately design the constituent modules of $\textbf{COOPER}$ and the data flows among them, overcoming the latency and noise in the feedback signal, caused by the deep entanglement between reputation and policy. Experiments on the donation game and the coin game in grid world environments demonstrate that $\textbf{COOPER}$ effectively adapts to various existing reputation systems and co-players. Furthermore, we observe the co-emergence of reputation norms and cooperation in self-play settings. These results hold robustly across diverse social network topologies, underscoring the generalizability and efficacy of our approach.
- 中文摘要
声誉是通过社交网络传播的同伴评估汇聚,是促进社会困境合作的关键机制,这种困境普遍存在于由感知和认知能力有限的代理组成的多智能体系统中。探索包含声誉评估规则和基于声誉政策的高效声誉系统,是长期以来的挑战。以往工作假设预设的声誉评估规则,或将声誉建模为学习策略的内在奖励,这削弱了方法的泛化和适应能力。为此,我们提出了一种分布式多智能体强化学习方法 $\textbf{COOPER}$ ($\textbf{COOP}$eration 配用 $\textbf{E}$mergent $\textbf{R}$eputation),该方法完全从环境奖励中共同学习声誉评估规则和基于声誉的策略。值得注意的是,利用声誉的底层机制,我们有意设计了 $\textbf{COOPER}$ 的组成模块及其间的数据流,克服了声誉与策略深度纠缠导致的反馈信号延迟和噪声。在网格世界环境中的捐赠游戏和硬币游戏实验表明,$\textbf{COOPER}$ 能够有效适应各种现有的声望系统和合作者。此外,我们观察到声誉规范与合作在自我游戏环境中的共生。这些结果在多样的社交网络拓扑中都适用,凸显了我们方法的普遍性和有效性。
When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling
当客户停止遵循:一个基于认知概念化图解的战略咨询框架
- Authors: Yihao Qin, Junyi Zhao, Changsheng Ma, Yongfeng Tao, Minqiang Yang, Chang Liu, Bin Hu
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.04389
- Pdf link: https://arxiv.org/pdf/2606.04389
- Abstract
Large Language Models (LLMs) show promise in psychological counseling, yet existing benchmarks rely heavily on highly cooperative simulated clients. We observe a critical counselor-following phenomenon: these clients often rapidly shift from resistance to compliance after only a few turns, creating an illusion of therapeutic progress and inflating scores under current evaluation protocols through superficial empathy. To address this evaluation mismatch, we propose a Cognitive Behavioral Therapy (CBT)-grounded resistance-aware framework. We introduce CARS, a client simulator that explicitly models dynamic resistance via Cognitive Conceptualization Diagrams (CCDs). We present STREAMS, a dual-module framework that decouples strategic reasoning (Thinker) from response generation (Presenter) and optimizes it via reinforcement learning. We further propose EWTS-MI, an entropy-weighted metric for evaluating responsiveness under high-friction interactions. Experiments across resistant and non-resistant counseling settings validate our findings on evaluation mismatch and demonstrate the effectiveness of resistance-aware training for improving strategic robustness under challenging counseling interactions.
- 中文摘要
大型语言模型(LLMs)在心理咨询中展现出潜力,但现有基准高度依赖高度合作的模拟客户。我们观察到一个关键的跟随咨询师现象:这些客户往往在几回合内迅速从抗拒转变为顺从,制造治疗进展的假象,并通过表面的同理心抬高当前评估方案下的分数。为解决这种评估不匹配,我们提出了基于认知行为疗法(CBT)的抗拒意识框架。我们介绍CARS,一种客户模拟器,通过认知概念图(CCD)明确建模动态阻力。我们介绍STREAMS,一个双模块框架,将战略推理(Thinker)与反应生成(Presenter)解耦,并通过强化学习进行优化。我们还提出了EWTS-MI这一熵加权指标,用于评估高摩擦相互作用下的响应性。在抗拒性和非抗拒性咨询环境中的实验验证了我们关于评估不匹配的发现,并证明了抗拒意识训练在挑战性咨询互动下提升战略稳健性的有效性。
Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models
阅读追踪,引导路径:扩散语言模型的轨迹感知强化学习
- Authors: Anant Khandelwal, Manish Gupta
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.04396
- Pdf link: https://arxiv.org/pdf/2606.04396
- Abstract
Diffusion large language models (dLLMs) generate responses by iteratively unmasking and revising many positions in parallel. This process leaves a rich denoising trace depicting which tokens become confident, which remain unstable, and when commitments form. Existing dLLM reinforcement learning methods use this signal only weakly. Flat rollouts are cheap, but assign a single outcome reward to the whole trajectory. Tree rollouts provide finer, verifiable training signals by branching partial trajectories and propagating leaf rewards upward, but are compute intensive. We ask whether the denoising trace itself can provide tree-like supervision without tree-level compute. We introduce CAPR (Cached-Amortized Path Refinement), a dLLM-RL algorithm that summarizes the denoising trace into a compact path state, uses cached trajectory states to generate cheap sibling continuations, and trains a block-level value head for local block-wise supervision. Under a block-wise unmasking schedule, CAPR records path-state and block-progress features, then redistributes the final outcome reward across blocks according to the tokens revealed in each block. This trains the value head to convert one sparse reward into block-level PPO weights. CAPR therefore recovers much of the granularity of tree search while avoiding full tree expansion, reducing rollout-generation cost to roughly 0.75x that of flat rollouts and 0.6x that of tree rollouts (under standard settings). Across 4x4 Sudoku, Countdown, GSM8K, and Math500, on dense and mixture-of-experts LLaDA backbones, CAPR sets a new state of the art for RL-tuned dLLMs at 256- and 512-token budgets. On Sudoku, it matches the strongest tree-structured baseline at less than one third of the per-step compute.
- 中文摘要
扩散大型语言模型(dLLMs)通过并行迭代解密和修正多个位置来生成响应。该过程留下丰富的去噪迹,描绘哪些令牌变得有信心,哪些保持不稳定,以及何时形成承诺。现有的dLLM强化学习方法对该信号的使用较弱。平坦的推出很便宜,但给整个轨迹分配一个结果奖励。树状展开通过分支部分轨迹和叶子奖励向上传播,提供更细致、可验证的训练信号,但计算量较大。我们询问去噪轨迹本身是否能在不进行树级计算的情况下提供树状监督。我们引入了CAPR(缓存-摊销路径细化),这是一种dLLM-RL算法,将去噪迹总成紧凑的路径状态,利用缓存轨迹状态生成廉价的兄弟延续,并训练一个块级值头用于局部块级监督。在按区块进行解密计划下,CAPR记录路径状态和区块进度特征,然后根据每个区块中揭示的代币,将最终结果奖励重新分配到各个区块。这训练值头将一个稀疏奖励转换为区块级PPO权重。因此,CAPR在避免完全树扩展的同时恢复了树搜索的大部分粒度,将部署生成成本降低到约为平坦部署的0.75倍和树状部署的0.6倍(在标准设置下)。在4x4数独、Countdown、GSM8K和Math500等平台上,基于密集且专家混合的LLaDA骨干,CAPR为256和512代币预算的强化学习调优dLLM树立了新的技术水平。在数独中,它与最强的树结构基线匹配,且每步计算量不到三分之一。
When Chatbots Accommodate: What AI Companions Optimize for in Vulnerable Conversations
聊天机器人配合:AI伙伴在脆弱对话中优化什么
- Authors: Minh Duc Chu, Yifan Wu, Zhiyi Chen, Angel Hsing-Chi Hwang, Luca Luceri
- Subjects: Subjects:
Human-Computer Interaction (cs.HC)
- Arxiv link: https://arxiv.org/abs/2606.04431
- Pdf link: https://arxiv.org/pdf/2606.04431
- Abstract
Millions turn to AI companion chatbots during loneliness, grief, and personal crises. How these companion platforms respond in such moments can shape the trajectory of a user's vulnerable state. Yet we lack tools to characterize what each platform actually does when users open up. Existing audits score reactions to pre-defined crisis prompts and miss the underlying decision policy that governs sustained interaction. We address these gaps with two key contributions. First, we introduce the AI Companion Vulnerability-Response Taxonomy, a paired taxonomy of user vulnerability and chatbot response designed for analyzing extended companion chatbot interactions. Second, we infer the response policy each platform follows across distinct vulnerability scenarios by applying Inverse Reinforcement Learning to ~48k turns of real-world user conversations with GPT-4.1, this http URL, and Replika. Our findings reveal what AI companions prioritize in conversations with vulnerable users: GPT-4.1 reaches for advice, this http URL spreads its response across different strategies without a dominant mode, and Replika consistently asks questions and stays present. Each, however, downweights the responses that introduce corrective friction: GPT-4.1 probes less as conversations continue and when interacting with psychologically high-risk users; Replika advises bonded users more and challenges them less; this http URL shows no committed engagement strategy on internal distress. Estimated policies are invisible to output-level audits, providing a new lens for auditing chatbots in the wild and enabling more realistic safety evaluation.
- 中文摘要
数百万人在孤独、悲伤和个人危机时求助于AI伴侣聊天机器人。这些伴随平台在此类时刻的反应,可能影响用户脆弱状态的发展轨迹。然而,我们缺乏工具来描述每个平台用户开放时的具体表现。现有审计对预设危机提示的反应进行评分,忽视了指导持续互动的根本决策政策。我们通过两个关键贡献来弥补这些空白。首先,我们介绍了AI伴侣漏洞-响应分类法,这是一种用户脆弱性与聊天机器人响应的配对分类法,旨在分析扩展伴侣聊天机器人交互。其次,我们通过对与GPT-4.1、这个http URL和Replika进行的真实用户对话应用逆向强化学习,推断每个平台在不同漏洞场景下的响应策略。我们的发现揭示了AI伙伴在与弱势用户对话时的优先考虑:GPT-4.1寻求建议,这个http网址在不同策略中分散响应,没有主导模式,Replika持续提问并保持当下。然而,每种都降低了引入纠正阻力的回答权重:GPT-4.1在对话持续和与心理高风险用户互动时探究的次数减少;Replika对绑定用户的建议更多,挑战较少;该HTTP URL显示内部困境没有承诺的参与策略。估计的政策对输出层审计是隐形的,这为现场审计聊天机器人提供了新的视角,并实现了更真实的安全评估。
AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning
AgentJet:一个用于智能强化学习的灵活群体训练框架
- Authors: Qingxu Fu, Boyin Liu, Shuchang Tao, Zhaoyang Liu, Bolin Ding
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
- Arxiv link: https://arxiv.org/abs/2606.04484
- Pdf link: https://arxiv.org/pdf/2606.04484
- Abstract
We present AgentJet, a distributed swarm training framework for large language model (LLM) agent reinforcement learning. Unlike centralized frameworks that tightly couple agent rollouts with model optimization, AgentJet adopts a decoupled multi-node architecture in which swarm server nodes host trainable models and run optimization on GPU clusters, whereas swarm client nodes execute arbitrary agents on arbitrary devices. This design provides capabilities that are difficult to support in centralized frameworks: (1) heterogeneous multi-model reinforcement learning, enabling the training of heterogeneous multi-agent teams with multiple LLM as brains; (2) multi-task cocktail training with isolated agent runtimes; (3) fault-tolerant execution that prevents external environment failures from interrupting the training process; and (4) live code iteration, which allows agents to be edited during training by replacing swarm client nodes. To support efficient RL in multi-model, multi-turn, and multi-agent settings, AgentJet introduces a context tracking module with timeline merging, which consolidates redundant context and achieves a 1.5-10x training speedup. Finally, AgentJet introduces an automated research system that takes a research topic as input and autonomously conducts long-horizon, multi-day RL studies on large-scale clusters. By leveraging the swarm architecture, this system reproduces key exploratory workflows of RL researchers without human intervention during execution.
- 中文摘要
我们介绍AgentJet,一个用于大型语言模型(LLM)代理强化学习的分布式群体训练框架。与紧密结合代理部署与模型优化的集中框架不同,AgentJet采用解耦多节点架构,群服务器节点托管可训练模型并在GPU集群上运行优化,而群体客户端节点则在任意设备上执行任意代理。该设计提供了在集中式框架中难以支持的功能:(1)异构多模型强化学习,支持训练拥有多个大型语言模型(LLM)的异构多智能体团队;(2)多任务鸡尾酒训练,使用独立代理运行时间;(3)容错执行,防止外部环境故障中断训练过程;以及(4)实时代码迭代,允许在训练过程中通过替换群组客户端节点来编辑代理。为了支持多模型、多回合和多智能体环境中高效的强化学习,AgentJet引入了带有时间线合并功能的上下文跟踪模块,整合冗余上下文并实现1.5-10倍的训练加速。最后,AgentJet引入了一套自动化研究系统,将研究课题作为输入,自主进行对大型集群的长期、多日强化学习研究。通过利用群体架构,该系统在执行过程中无需人工干预,重现强化学习研究者的关键探索性工作流。
Episodic Memory Temporal Consistency for Cooperative Multi-Agent Reinforcement Learning
合作多代理强化学习中的情节记忆时间一致性
- Authors: Zicheng Zhao, Yu Lan, Chengzhengxu Li, Zhaohan Zhang, Xiaoming Liu
- Subjects: Subjects:
Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
- Arxiv link: https://arxiv.org/abs/2606.04492
- Pdf link: https://arxiv.org/pdf/2606.04492
- Abstract
Cooperative Multi-Agent Reinforcement Learning (MARL) frequently suffers from severe reward sparsity and exploration bottlenecks. While episodic memory mechanisms mitigate these issues by reusing high-return trajectories, they often trap agents in local optima due to unconstrained incentive distribution and semantic representation collapse. To address this, we propose Episodic Memory Temporal Consistency (EMTC), a framework that robustly constructs and selectively leverages historical experiences. EMTC introduces two synergistic components: (1) a Temporally Consistent Semantic Embedder that integrates contrastive learning with time-conditioned state reconstruction, preventing representation collapse and enabling precise memory retrieval; and (2) a Temporal Consistency Gating Mechanism that dynamically modulates episodic incentives based on temporal consistency error. This adaptive gate filters misleading signals from pseudo-successful trajectories, effectively mitigating Q-value overestimation. We provide theoretical guarantees, establishing a strict error bound that directly links the observable temporal consistency error to the underlying trajectory optimality and representation quality. Extensive evaluations on the SMAC and GRF benchmarks demonstrate that EMTC consistently outperforms state-of-the-art baselines. Notably, compared to the strongest episodic baseline, EMTC achieves absolute win-rate improvements of up to 24% in super-hard SMAC scenarios and an average improvement of 28% across GRF tasks.
- 中文摘要
合作多智能体强化学习(MARL)经常面临严重的奖励稀缺和探索瓶颈。虽然情景记忆机制通过重用高回报轨迹来缓解这些问题,但由于激励分布不受约束和语义表征崩溃,常常将代理困在局部最优状态。为此,我们提出了情节记忆时间一致性(EMTC)框架,该框架能强有力地构建并选择性地利用历史经验。EMTC引入了两个协同组件:(1)时间一致语义嵌入器,将对比学习与时间条件状态重建整合,防止表征崩溃并实现精确记忆检索;以及(2)基于时间一致性误差动态调制情节激励的时序一致性门槛机制。该自适应门过滤了伪成功轨迹中的误导信号,有效减少了Q值高估。我们提供理论保证,建立严格的误差界限,直接将可观测的时间一致性误差与底层轨迹的最优性和表示质量联系起来。对SMAC和GRF基准的广泛评估表明,EMTC始终优于最先进的基线。值得注意的是,与最强的情节基线相比,EMTC在超难SMAC场景下的绝对胜率提升高达24%,在GRF任务中平均提升28%。
Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots
黑暗中的聪明选择:通过追踪元认知枢纽实现高效的RLVR推理
- Authors: Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Bowen Song, Weiqiang Wang, Gang Chen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04503
- Pdf link: https://arxiv.org/pdf/2606.04503
- Abstract
Reinforcement learning with verifiable rewards (RLVR) has greatly advanced large reasoning models (LRMs), but it requires timely training on a huge fully-annotated dataset. To this end, data-efficient RLVR methods have been widely studied from two perspectives: (i) data selection methods identify a small subset of "golden" samples that yield near-full-data performance, but they rely on a pre-existing pool of labeled data. (ii) unsupervised RLVR methods train the model using its own internal supervision signals on large-scale unlabeled data, yet they exhibit suboptimal performance. Accordingly, we investigate the "pick in the dark" setup for RLVR, which aims to select, without prior supervision, unlabeled samples that are most beneficial for training and worthy of annotation. Through systematic analysis, we demonstrate that smart picks hinge on a well-calibrated uncertainty estimator to enable strategic partitioning of data for adaptive training regimes. Building on this insight, we propose PivotTrace, a three-way data triage framework that leverages attention dynamics to trace metacognitive pivots during reasoning. By precisely quantifying uncertainty through pivot density, PivotTrace achieves automated data routing to synergistically maximize both annotation and training efficiency. Empirically, PivotTrace surpasses the fully supervised LRM with only 29.3% annotated samples and 2.75 faster convergence.
- 中文摘要
带有可验证奖励的强化学习(RLVR)极大地推动了大型推理模型(LRM),但它需要在庞大且全注释的数据集上及时训练。为此,数据高效的RLVR方法从两个角度被广泛研究:(i)数据选择方法识别出一小部分“黄金”样本,这些样本能产生接近完整的数据表现,但它们依赖于已有的标记数据池。(ii) 无监督RLVR方法在大尺度未标记数据上使用自身内部监督信号训练模型,但表现不理想。因此,我们研究了RLVR的“黑暗中挑选”机制,旨在无事先监督地选择最适合训练且值得注释的无标记样本。通过系统分析,我们证明智能选择依赖于一个校准良好的不确定性估计器,从而实现数据的战略划分以实现适应性训练方案。基于这一见解,我们提出了PivotTrace,一个三向数据分流框架,利用注意力动态追踪推理过程中的元认知枢纽。通过精准量化枢轴密度的不确定性,PivotTrace实现了自动数据路由,协同最大化标注和训练效率。从经验来看,PivotTrace仅有29.3%的注释样本和2.75个更快的收敛速度,超过了完全监督的LRM。
Self-Evolving Deep Research via Joint Generation and Evaluation
通过联合生成与评估实现自我演进的深度研究
- Authors: Han Zhu, Chengkun Cai, Yuanfeng Song, Xing Chen, Sirui Han, Yike Guo
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04507
- Pdf link: https://arxiv.org/pdf/2606.04507
- Abstract
Large Language Models (LLMs) have become increasingly adopted in daily applications, with deep research standing out as a particularly important capability. Unlike traditional question-answering (QA) tasks, deep research report generation lacks definitive ground-truth, making reward design inherently unverifiable and limiting effective reinforcement learning. Existing approaches mitigate this challenge with LLM-as-a-judge and query-dependent evaluation rubrics, but they still rely on static evaluators that cannot adapt their standards as the solver improves, leading to insufficient and eventually saturated optimization pressure. We address this limitation with a \textbf{s}elf-evolving \textbf{co}-evolutionary training framework for deep \textbf{re}search evaluation and generation (SCORE), which tightly couples an evaluator and a solver in a shared-parameter learning process. Rather than treating generation and evaluation as isolated modules, we leverage their intrinsic connection to enable joint improvement within a single shared-parameter model. To restrict this process, we introduce a meta-harness, which dynamically controls the evaluation environment based on solver performance, encouraging valid evaluation dimensions and sufficiently deep evaluator search. Extensive experiments on deep research benchmarks demonstrate consistent improvement in report generation quality, showing that co-evolving evaluation and generation is a promising direction for training open-ended research agents.
- 中文摘要
大型语言模型(LLMs)在日常应用中日益普及,其中深度研究尤为重要。与传统的问答(QA)任务不同,深度研究报告生成缺乏明确的实地信息,使得奖励设计本质上难以验证,限制了有效的强化学习。现有方法通过以LLM为评判和依赖查询的评估评分标准来缓解这一挑战,但它们仍然依赖静态评估器,无法随着求解器进步调整标准,导致优化压力不足甚至饱和。我们通过一个 \textbf{s}elf 进化的 \textbf{co}-进化训练框架解决了这一局限,用于深度 \textbf{re}搜索评估与生成(SCORE),该框架紧密结合评估者和求解器,实现共享参数学习过程。我们不将生成和评估视为孤立模块,而是利用它们的内在联系,在单一共享参数模型内实现联合改进。为限制这一过程,我们引入了元机束,基于求解器性能动态控制评估环境,鼓励有效的评估维度和足够深度的评估者搜索。深度研究基准的广泛实验显示报告生成质量持续提升,表明共进式评估与生成是培养开放式研究代理的有前景方向。
GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling
GeoMin:通过几何分布建模实现数据高效半监督RLVR
- Authors: Guangcheng Zhu, Shenzhi Yang, Haobo Wang, Xing Zheng, Yingfan MA, Xuening Feng, Zhongqi Chen, Kai Tang, Zhengqing Zang, Bowen Song, Weiqiang Wang, Gang Chen
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04516
- Pdf link: https://arxiv.org/pdf/2606.04516
- Abstract
Reinforcement learning with verifiable rewards (RLVR) significantly advances LLM reasoning, yet it faces a dilemma: standard supervised scaling is throttled by high annotation costs, while unsupervised alternatives suffer from severe model collapse. Recent semi-supervised RLVR methods address this by using a small labeled set to guide unlabeled data, achieving a promising trade-off between training efficacy and annotation cost. However, they suffer from a severe data-efficiency bottleneck due to the reliance on coarse performance heuristics, leaving a vast majority of valuable instances underutilized. To this end, we propose GeoMin, which models global feature distributions on labeled data to decode the structural discrepancy between correct and incorrect rollouts, thereby establishing a robust prior to assess the reliability of self-reward signals and fully unleash the potential of unlabeled data. Empirically, GeoMin outperforms the strongest baselines by +4.1% and even surpasses fully supervised models with only 10% of the annotations, demonstrating remarkable data efficiency.
- 中文摘要
带可验证奖励的强化学习(RLVR)显著推动了大型语言模型(LLM)推理,但面临一个难题:标准监督式扩展因高注释成本而受限,而无监督式的替代方案则面临严重的模型崩溃。近期的半监督RLVR方法通过使用少量带标签数据集来引导未标记数据,实现了训练效能与注释成本之间的有希望的权衡。然而,由于依赖粗糙的性能启发式,它们存在严重的数据效率瓶颈,导致绝大多数有价值的实例未被充分利用。为此,我们提出了GeoMin,通过在带标签数据上建模全球特征分布,解码正确与错误部署之间的结构差异,从而建立一个稳健的先行,以评估自我奖励信号的可靠性,充分释放未标记数据的潜力。从实证角度看,GeoMin 比最强基线高出+4.1%,甚至超过仅有10%注释的全监督模型,展现出显著的数据效率。
Rollout-Level Advantage-Prioritized Experience Replay for GRPO
GRPO的推广级优势优先经验重放
- Authors: Gyeongtae Yoo, Sanghyeok Park, Soohyuk Jang, Ik-hwan Kim, Sungroh Yoon
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04560
- Pdf link: https://arxiv.org/pdf/2606.04560
- Abstract
Reinforcement learning from verifiable rewards with GRPO is a standard approach for post-training reasoning LLMs. It remains sample inefficient. Each rollout is used for a single gradient update and then discarded. Naive replay is not well suited in this setting because LLM policies drift quickly per gradient step. Stored rollouts therefore become stale and can destabilize training. We propose a rollout-level replay buffer for GRPO that stores and samples individual rollouts rather than whole groups. The buffer bounds staleness through age eviction. Any rollout older than tau_max training steps is removed. The buffer also preserves on-policy data via fresh-anchored composition. Each batch keeps its fresh on-policy rollouts and then concatenates replay rollouts drawn separately from the buffer. We prioritize replay by per-rollout advantage magnitude and recycle individual rollouts whose advantages are large. Across three Qwen3-Base scales on five math benchmarks, our method outperforms GRPO and naive replay baselines. Gains are positive at every scale and grow with model size. The largest gain is +4.35 pp on the five-benchmark average at 4B. Under an AES metric that jointly measures accuracy and token efficiency, the efficiency margin over GRPO is again largest at 4B, at +0.579.
- 中文摘要
通过可验证的奖励进行GRPO强化学习,是训练后推理LLM的标准方法。它依然是样本效率低下。每次推出时只用于一次梯度更新,然后就被丢弃。天真重放并不适合这种设定,因为LLM策略在每个梯度步中会迅速漂移。因此,储存的部署会变得乏味,甚至可能破坏培训的稳定性。我们提出一个GRPO的部署级重放缓冲区,存储并采样单个部署,而非整个组。缓冲区通过年龄驱逐限制了陈旧。任何超过tau_max培训步骤的推广都会被移除。缓冲区还通过新锚定的组合保存策略上的数据。每个批次保留其最新的政策内部署,然后将与缓冲区单独绘制的重放部署串接起来。我们优先按每次推出的优势幅度来重放,并回收那些优势很大的单个推广。在五个数学基准的三个Qwen3-Base量表中,我们的方法优于GRPO和朴素重放基线。每个尺度的收益都是正的,并且随着模型规模增长。最大涨幅为五个基准指数平均值400点+4.35点。在一个同时衡量准确性和代币效率的AES指标下,GRPO的效率差再次达到4B,达到+0.579。
Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models
尼提亚巴斯:理性智能体模型中不确定性意识公共政策优化框架
- Authors: Janani Venugopalan, Gaurav Deshkar, Rishabh Gaur, Harshal Hayatnagarkar, Jayanta Kshirsagar
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
- Arxiv link: https://arxiv.org/abs/2606.04562
- Pdf link: https://arxiv.org/pdf/2606.04562
- Abstract
Purpose The WHO's COVID-19 non-pharmaceutical interventions (e.g., lockdowns, vaccinations) effectively curb transmission but impose heavy economic strains. Existing research often neglects individual behaviors and falsely assumes perfect infection tracking and flawless policy execution, failing to account for real-world uncertainties and errors. Methods We propose an integrative approach incorporating uncertainties in both epidemic measurement (infections/hospitalizations) and policy implementation. We built a simulation model of 1,000 individuals making real-time choices regarding mask-wearing, vaccination, and shopping. Concurrently, policymakers deploy interventions (lockdowns, mandates) based on health and economic observations. This framework is driven by hierarchical reinforcement learning agents, utilizing deep Q-networks alongside uncertainty-aware policy gradient variants (DDPG and TD3). Results The simulations effectively managed the epidemic's progression. Masking and vaccinations proved highly effective, significantly reducing both the outbreak's peak height and duration. By integrating individual behaviors, policy uncertainties, and multifaceted interventions, our dynamic control approach successfully mitigated the epidemic's impact. Conclusions Our model overcomes previous research limitations by embedding uncertainty and human behavior into public health policy frameworks. The simulation demonstrates that accounting for individual choices and imperfect data is crucial for designing effective interventions during complex pandemics, with masks and vaccines serving as pivotal tools.
- 中文摘要
目的 世卫组织对COVID-19的非药物干预措施(如封锁、疫苗接种)有效遏制传播,但对经济造成沉重压力。现有研究常常忽视个体行为,错误地假设感染追踪完美,政策执行无懈可击,未能考虑现实世界的不确定性和错误。方法 我们提出一种整合方法,在流行病测量(感染/住院)和政策实施中纳入不确定性。我们构建了一个模拟模型,涵盖1000名个人在戴口罩、接种疫苗和购物等方面做出实时选择。与此同时,政策制定者基于健康和经济观察采取干预措施(封锁、强制令)。该框架由分层强化学习代理驱动,利用深度Q网络以及不确定性感知的策略梯度变体(DDPG和TD3)。结果 模拟有效管理了疫情的发展。戴口罩和接种疫苗效果显著,显著缩短了疫情的高峰期和持续时间。通过整合个人行为、政策不确定性和多方面干预措施,我们的动态控制方法成功减轻了疫情的影响。结论 我们的模型通过将不确定性和人类行为嵌入公共卫生政策框架,克服了以往研究的局限。模拟表明,考虑到个人选择和数据不完美对于在复杂疫情中设计有效干预措施至关重要,口罩和疫苗是关键工具。
Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning
基于深度强化学习的加密货币市场动态多对交易策略
- Authors: Damian Lebiedź, Robert Ślepaczuk
- Subjects: Subjects:
Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Statistical Finance (q-fin.ST); Trading and Market Microstructure (q-fin.TR); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2606.04574
- Pdf link: https://arxiv.org/pdf/2606.04574
- Abstract
This study aims to determine whether the application of Deep Reinforcement Learning (DRL) as a specialized execution overlay can enhance pair trading in highly volatile cryptocurrency markets. Although classical implementations of the strategy have proven successful in traditional equities, they frequently exhibit rigidity and suffer from severe divergence risks when applied to high-variance environments. To address this need, this research introduces novel concepts. To construct a robust system, we developed a hierarchical "Filter-then-Rank" pair selection methodology and a proprietary "Fixed Risk, Adaptive Mean" execution model. The system employs a Proximal Policy Optimization (PPO) agent with a Long Short-Term Memory (LSTM) layer to govern execution decisions within strict deterministic risk management boundaries. Evaluated on 1-hour interval data from the Binance USD-M Futures market, the optimized RL policy achieved an out-of-sample performance that substantially outperformed the heuristic baseline. A stationary circular block bootstrap robustness check confirms that the agent's risk-adjusted outperformance is statistically significant at the 10 percent level. Although falling marginally short of the stricter 5 percent threshold, this result highlights the extreme idiosyncratic variance characteristic of digital assets. Ultimately, this thesis contributes to the quantitative finance literature by introducing a hybrid architecture that combines statistical arbitrage with DRL execution policies. Furthermore, it delivers a novel framework for safe reinforcement learning via deterministic shielding, proving that anchoring a neural policy to statistically robust boundaries successfully mitigates severe divergence risks.
- 中文摘要
本研究旨在确定将深度强化学习(DRL)作为专门执行覆盖层的应用,是否能提升高波动加密货币市场中的配对交易。尽管该策略的经典实现在传统股票中已被证明成功,但在高方差环境下常表现出僵化性,并面临严重的背离风险。为满足这一需求,本研究引入了新颖的概念。为了构建一个稳健的系统,我们开发了一种层级式的“筛选后排序”配对选择方法论和专有的“固定风险,自适应均值”执行模型。该系统采用带有长短期记忆(LSTM)层的近端策略优化(PPO)代理,在严格确定性风险管理范围内管理执行决策。基于币安USD-M期货市场的1小时间隔数据评估,优化后的RL策略在样本外表现显著优于启发式基线。固定循环块自举鲁棒性检查确认代理人的风险调整后超额表现在10%水平上具有统计显著性。虽然略低于更严格的5%门槛,这一结果凸显了数字资产极其独特的差异。最终,本论文通过引入一种结合统计套利与DRL执行策略的混合架构,为定量金融文献做出了贡献。此外,它通过确定性屏蔽提供了安全强化学习的新框架,证明将神经策略锚定在统计上强健的边界上能成功降低严重的发散风险。
SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
SCI-PRM:一种基于工具感知的过程奖励模型,用于科学推理验证
- Authors: Xiangyu Zhao, Hengyuan Zhao, Yiheng Wang, Wanghan Xu, Yuhao Zhou, Qinglong Cao, Zhiwang Zhou, Lei Bai, Wenlong Zhang, Xiao-Ming Wu
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04579
- Pdf link: https://arxiv.org/pdf/2606.04579
- Abstract
While Process Reward Models (PRMs) have achieved remarkable success in mathematical reasoning, their application in complex scientific domains-such as biology, chemistry, and physics remains largely unexplored. Scientific problems demand not only logical rigor but also factual consistency and the precise usage of domain-specific tools, areas where current models often suffer from hallucinations and lack of verification. In this paper, we first construct SCIPRM70K, a large-scale dataset featuring Chain-of-Tool trajectories that explicitly interleave reasoning with the execution of scientific tools. Building upon this, we train an efficient reward model called Sci-PRM to provide fine-grained supervision on tool selection, execution accuracy, and result interpretation at each step in one inference. Experiments demonstrate that Sci-PRM significantly enhances foundation models in two key aspects: (1) it enables effective test-time scaling via Best-of-N selection; and (2) when integrated into Reinforcement Learning, it serves as a dense reward signal that mitigates the critical issue of advantage disappearance, allowing the model to break through existing performance ceilings.
- 中文摘要
虽然过程奖励模型(PRM)在数学推理方面取得了显著成功,但其在生物学、化学和物理等复杂科学领域的应用仍然鲜有深入探讨。科学问题不仅需要逻辑严谨性,还需要事实一致性以及精确使用领域特定工具,而现有模型常常存在幻觉和缺乏验证。本文首先构建了SCIPRM70K,这是一个大规模数据集,具有工具链轨迹,明确将推理与科学工具的执行交织在一起。基于此,我们训练了一个高效的奖励模型——Sci-PRM,在推理的每一步对工具选择、执行准确性和结果解释提供细致监督。实验表明,Sci-PRM在两个关键方面显著增强了基础模型:(1)通过N中最佳选择实现有效的测试时间缩放;(2)当它集成到强化学习中时,它作为一个密集的奖励信号,缓解了优势消失的关键问题,使模型能够突破现有的性能上限。
Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues
多模态长篇对话中的细粒度片段检索
- Authors: Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou
- Subjects: Subjects:
Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
- Arxiv link: https://arxiv.org/abs/2606.04591
- Pdf link: https://arxiv.org/pdf/2606.04591
- Abstract
With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.
- 中文摘要
随着多模态交流平台的广泛采用,交错文本和图像的长对话变得越来越普遍。用户通常需要检索与特定主题相关的连贯对话片段,而非孤立的话语。我们提出了细粒度片段检索(FFR),在多模态长篇对话中定位语义相关的多语句、多图像片段。我们探索两个场景:(1)单一对话中的FFR,提取特定对话片段;以及(2)对话语料库中的FFR,从大规模语料库中检索开放域场景。对于(1),我们引入了F2RVELM模型,这是一种基于代的检索模型,通过强化学习训练,利用多目标奖励和难度感知的课程抽样来增强片段一致性。对于(2),我们开发了FFRS,这是一个结合离线片段级索引和在线检索的两阶段系统。具体来说,每个对话被分解为最小语义片段,通过片段嵌入模型(FEM)编码到向量数据库中;在推断时,FEM快速回忆Top-K候选词,F2RVELM则进行细致推理以识别最相关的子内容。为支持FFR,我们构建了MLDR——迄今为止最长的多模态对话检索数据集,以及基于微信的真实世界测试集。两个基准测试的实验都表明,F2RVLM和FFRS在单次对话和语料库层面的FFR中始终保持优越性能。
VentAgent: When LLMs Learn to Breathe -- Multi-Objective Arbitration for ARDS Ventilation
VentAgent:当大型语言模型学会呼吸——ARDS通气的多目标仲裁
- Authors: Teqi Hao, Yuxuan Fu, Xiaoyu Tan, Shaojie Shi, Bohao Lv, Yinghui Xu, Xihe Qiu
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.04632
- Pdf link: https://arxiv.org/pdf/2606.04632
- Abstract
Mechanical ventilation for Acute Respiratory Distress Syndrome (ARDS) requires balancing competing physiological goals, including oxygenation, lung protection, and acid-base homeostasis. However, current data-driven methods, especially those imitating retrospective Electronic Health Records (EHR), often suffer from imitation bias. They may capture superficial correlations from inconsistent clinical demonstrations, such as associating passive ventilator settings with survival because such settings are common in stable patients, and thus fail to generalize to volatile or out-of-distribution phenotypes. Standard Reinforcement Learning (RL) methods also struggle with the adversarial trade-offs of critical care and often produce opaque policies with limited clinical interpretability. To address these limitations, we introduce VentAgent, a hierarchical framework in which Large Language Models (LLMs) act as transparent arbitrators for mechanical ventilation. We reformulate ventilation control as a dynamic Multi-Objective Arbitration process rather than single-objective optimization. VentAgent decomposes decision-making into three interpretable stages: Perception, Planning, and Orchestration. By leveraging the semantic reasoning capabilities of LLMs, it synthesizes strategies from heterogeneous experts and resolves conflicting clinical priorities through an explicit coordination mechanism. Evaluations on a high-fidelity physiological simulator show that VentAgent outperforms state-of-the-art RL and classical control baselines. Moreover, it converts control decisions into human-readable reasoning chains, offering a safer, more interpretable, and adaptable paradigm for critical care automation.
- 中文摘要
急性呼吸窘迫综合征(ARDS)的机械通气需要平衡多种生理目标,包括供氧、肺部保护和酸碱稳态。然而,当前以数据为驱动的方法,尤其是模仿回顾性电子健康记录(EHR)的方法,常常存在模仿偏差。它们可能捕捉到临床表现不一致的表面相关性,例如将被动呼吸机设置与存活率联系起来,因为此类环境在稳定患者中较为常见,因此无法推广到波动性或分布外的表型。标准强化学习(RL)方法也难以应对重症护理的对抗权衡,常常产生不透明且临床可解释性有限的政策。为解决这些局限性,我们引入了VentAgent,一个层级框架,其中大型语言模型(LLMs)作为机械通气的透明仲裁者。我们将通风控制重新表述为动态的多目标仲裁过程,而非单一目标优化。VentAgent将决策分解为三个可解释阶段:感知、规划和编排。通过利用LLMs的语义推理能力,它综合了来自异质专家的策略,并通过显式协调机制解决了临床优先事项的冲突。高保真生理模拟器的评估显示,VentAgent优于最先进的强化学习和经典对照基线。此外,它将控制决策转化为人类可读的推理链,为重症护理自动化提供了更安全、更易解释和适应性的范式。
Explainably Safe Reinforcement Learning
可解释的安全强化学习
- Authors: Sabine Rieder, Stefan Pranger, Debraj Chakraborty, Jan Křetínský, Bettina Könighofer
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.04634
- Pdf link: https://arxiv.org/pdf/2606.04634
- Abstract
Trust in a decision-making system requires both safety guarantees and the ability to interpret and understand its behavior. This is particularly important for learned systems, whose decision-making processes are often highly opaque. Shielding is a prominent model-based technique for enforcing safety in reinforcement learning. However, because shields are automatically synthesized using rigorous formal methods, their decisions are often similarly difficult for humans to interpret. Recently, decision trees became customary to represent controllers and policies. However, since shields are inherently non-deterministic, their decision tree representations become too large to be explainable in practice. To address this challenge, we propose a novel approach for explainable safe RL that enhances trust by providing human-interpretable explanations of the shield's decisions. Our method represents the shielding policy as a hierarchy of decision trees, offering top-down, case-based explanations. At design time, we use a world model to analyze the safety risks of executing actions in given states. Based on this analysis, we construct both the shield and a high-level decision tree that classifies states into risk categories (safe, critical, dangerous, unsafe), explaining why a situation may be safety-critical. At runtime, we generate localized decision trees that explain which actions are allowed and why others are deemed unsafe. Our method facilitates explainability of the safety aspect in safe-by-shielding reinforcement learning, requires no additional information beyond what is already used for shielding, incurs minimal overhead, and integrates readily into existing shielded RL pipelines. In our experiments, we compute explanations using decision trees that are several orders of magnitude smaller than the original shield.
- 中文摘要
对决策系统的信任既需要安全保障,也需要能够解读和理解其行为。这对于可学系统尤为重要,因为其决策过程往往高度不透明。屏蔽是一种基于模型的显著技术,用于强化学习中的安全性执行。然而,由于盾牌是通过严格的形式方法自动合成的,其决策对人类来说往往同样难以理解。最近,决策树成为代表控制者和策略的惯例。然而,由于盾牌本质上是非确定性的,其决策树表示过大,无法在实际中解释。为应对这一挑战,我们提出了一种新颖的可解释安全强化学习方法,通过提供对盾牌决策的人类可解读解释来增强信任。我们的方法将屏蔽政策表示为决策树的层级结构,提供自上而下的案例解释。在设计阶段,我们使用世界模型分析在特定状态下执行动作的安全风险。基于该分析,我们构建了盾牌和一个高层决策树,将州划分为风险类别(安全、危急、危险、不安全),解释为何某一情境可能是安全关键。运行时,我们生成局部决策树,解释哪些动作被允许,以及为何其他行为被视为不安全。我们的方法有助于通过屏蔽强化学习中对安全方面进行解释,无需额外信息,超出现有屏蔽使用的信息,开销极低,且易于集成到现有屏蔽强化学习流水线中。在我们的实验中,我们用比原始盾牌小几个数量级的决策树来计算解释。
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation
CoRe-MoE:多地形类人移动及步态适应专家的对比重加权混合
- Authors: Kailun Huang (1), Zikang Xie (1), Yanzhe Xie (1), Panpan Liao (3), Fanghai Zhang (1), Yanheng Mai (1), Wenhao Xu (2), Yunheng Wang (1), Renjing Xu (1), Haohui Huang (3) ((1) Hong Kong University of Science and Technology (Guangzhou), (2) South China Agricultural University, (3) Guangdong University of Technology)
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04718
- Pdf link: https://arxiv.org/pdf/2606.04718
- Abstract
Humans primarily rely on walking and running to traverse complex terrains, without resorting to unnecessarily complex motion patterns. Similarly, humanoid robots should achieve smooth transitions between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference and the distribution shift induced by terrain-dependent visual and dynamic variations. Although Mixture-of-Experts (MoE) architectures can alleviate multi-skill interference, naive joint training often fails to yield clear expert specialization, limiting their effectiveness. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced and trained with a contrastive objective to shape the gating network, enabling it to capture structured terrain representations and promote expert specialization. The final action is obtained via weighted fusion of the base gait policy and the terrain-aware branch, allowing the policy to preserve stable locomotion patterns while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains, while maintaining accurate foothold placement and dynamic stability under external disturbances.
- 中文摘要
人类主要依靠步行和奔跑来穿越复杂地形,而不会诉诸不必要的复杂运动模式。同样,类人机器人应实现行走与跑步之间的平滑过渡,同时保持自然稳定的运动。然而,由于梯度干扰以及地形依赖的视觉和动态变化引起的分布偏移,统一步态转换和多地形适应仍具挑战性。尽管专家混合(MoE)架构可以缓解多技能干扰,但朴实的联合培训往往无法产生明确的专家专精,从而限制了其效能。为应对这些挑战,我们提出了CoRe-MoE,一种两阶段强化学习框架,将步态生成与地形适应脱钩。在第一阶段,学习稳定的运动策略以产生自然的行走和奔跑行为,并实现平滑的过渡。第二阶段引入并训练一个地形感知的MoE分支,目标对比,旨在塑造门控网络,使其能够捕捉结构化地形表示并促进专家专业化。最终动作通过基础步态策略与地形感知分支的加权融合实现,使策略在适应复杂地形的同时保持稳定的运动模式。大量模拟结果表明,该方法在成功率、移动稳定性和多地形适应性方面优于基线方法。此外,Unitree G1人形机器人的零发射验证了我们框架的有效性,实现了在楼梯、坡道、台阶、障碍物及无结构户外地形上的稳健行走和奔跑,同时保持准确的立足点定位和在外部干扰下的动态稳定性。
Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning
痕迹介导峰值偏差:在深度强化学习中连接时间学分赋值与认知启发式
- Authors: Viktor Veselý, Aleksandar Todorov, Erwan Escudie, Matthia Sabatelli
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04735
- Pdf link: https://arxiv.org/pdf/2606.04735
- Abstract
Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB). At intermediate eligibility trace depths, agents irrationally prefer trajectories with high-magnitude reward
peaks'' over alternatives with higher cumulative returns. This provides a mechanistic account of the Peak-End Rule: a human memory bias where experiences are judged by their most intense moments rather than integrated utility. We show that TMPB emerges because traces amplify distal Temporal Difference errors intogradient shocks'' that fixed-step-size Stochastic Gradient Descent cannot normalize, leading to global overestimation. Conversely, adaptive optimizers mitigate this pathology via second-moment normalization. Our results suggest that human-like saliency distortions may emerge naturally from the mathematical constraints of credit assignment in distributed systems, and that adaptive optimization is a theoretical necessity for rational value estimation.
- 中文摘要
时间赋权在生物智能和人工智能中都至关重要,但它与非线性函数近似的相互作用却鲜为人知。我们在深度强化学习(RL)中识别出一种系统性失败模式,称为痕迹介导峰值偏差(TMPB)。在中等资格痕迹深度下,代理者非理性地偏好具有高幅度奖励“峰值”的轨迹,而非累计回报更高的替代方案。这为峰值-终点规则提供了一种机制性解释:一种人类记忆偏差,即以最强烈的时刻来评判经验,而非综合效用。我们表明TMPB的出现是因为迹线将远端时间差误差放大为“梯度冲击”,而固定步长的随机梯度下降无法归一化,导致全局高估。相反,自适应优化器通过第二时刻归一化来缓解这种病理。我们的结果表明,类人显著性扭曲可能自然地从分布式系统中信用分配的数学约束中出现,自适应优化是理性价值估计的理论必要条件。
COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection
COP-Q:通过乔莱斯基有序投影实现机器人控制的安全优先强化学习
- Authors: Guopeng Li, Moritz A. Zanger, Matthijs T. J. Spaan, Julian F. P. Kooij
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.04749
- Pdf link: https://arxiv.org/pdf/2606.04749
- Abstract
Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wise treatment neglects inter-objective correlation and can lead to overly conservative value estimates, thereby reducing sample efficiency. To address this issue, we propose Cholesky-Ordered Projection Q-learning (COP-Q), a safety-first method that incorporates inter-objective covariance into vector-valued Q-value estimation. COP-Q constructs a generalized confidence bound in the joint Q-value space and uses Cholesky factorization to encode objective priority in a sequential form. This preserves conservatism on safety while adaptively reducing excessive conservatism on the reward objective. The resulting estimate is used in both temporal-difference target computation and actor optimization. COP-Q incurs minimal computational overhead and is readily compatible with most existing deep Q-learning frameworks. Experiments on robot locomotion in Brax and safe navigation in Safety-Gymnasium, covering both hard- and soft-safety settings, demonstrate that COP-Q achieves strong safety performance together with competitive or improved sample efficiency relative to representative baselines.
- 中文摘要
安全的机器人控制需要在满足安全约束的同时最大化回报。在非策略安全强化学习中,奖励和安全Q值通常由独立的批评组学习,不确定性则分别针对每个目标独立处理。这种客观处理忽视了客观间相关性,可能导致过于保守的数值估计,从而降低样本效率。为解决这一问题,我们提出了乔莱斯基有序投影Q学习(COP-Q),这是一种以安全为先的方法,将目标间协方差纳入向量值Q值估计中。COP-Q 在联合 Q 值空间中构造了一个广义置信界限,并利用 Cholesky 分解以顺序形式编码目标优先级。这既保持了安全上的保守,又能适应性地减少奖励目标上的过度保守。所得估计值既可用于时间差分目标计算,也用于演员优化。COP-Q 计算开销极低,且与大多数现有深度 Q 学习框架兼容。在Brax中机器人运动和安全体育馆安全导航的实验,涵盖硬安全和软安全环境,表明COP-Q在相较代表性基线时实现了强力安全性能,同时具有竞争力或更高的样品效率。
Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment
《爱之迷雾:在游戏环境中用亲和力强化学习工程化美德代理行为》
- Authors: Ajay Vishwanath, Christian Omlin
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.04750
- Pdf link: https://arxiv.org/pdf/2606.04750
- Abstract
Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent's teleology and makes its behavior human-level interpretable.
- 中文摘要
在人工智能中灌输美德行为的兴趣日益增加。其中一种技术被称为亲和力强化学习,利用目标函数的策略正则化来激励美德行为,而不完全依赖奖励函数设计。到目前为止,该技术已被证明在网格世界和玩具问题环境中,状态和动作空间极少时有效。为了将研究扩展到更复杂的环境,我们引入了一个基于名为《Fog of Love》的角色扮演桌游的双人多代理环境。在这种环境中,两位代理人竞争以实现各自的优点,同时合作以满足他们的关系。鉴于多智能体的特性,这是一个复杂的问题,多智能体深度确定性策略梯度智能体既无法成功竞争也无法成功合作。我们提供了证据,表明局部亲和力提升了代理在实现竞争和合作目标方面的表现,这源于两者整体得分的优越。这不仅带来了美德的选择,还澄清了主体的目的论,使其行为变得可理解为人类层面。
AIP: A Graph Representation for Learning and Governing Agent Skills
AIP:用于学习和治理智能体技能的图表示
- Authors: Zachary Blumenfeld, Jim Webber
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.04781
- Pdf link: https://arxiv.org/pdf/2606.04781
- Abstract
Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.
- 中文摘要
如今的代理技能主要由自由形式的散文组成,要求代理在每次会话中阅读、解读并重新推导如何行动。这带来了两个复合成本:实现重度任务的可靠性降低,以及技能创建和提升的困难,因为编辑散文是一个脆弱的过程,人类和代理都难以应对,尤其是在模型训练中缺乏代表性的领域特定程序知识。代理指令协议(AIP)通过将技能建模为有向执行图来解决这两者:离散步骤作为节点,由确定性脚本或自然语言描述支持,通过显式类型的输入/输出边连接,并由模式验证的YAML规范管理。编译元技能将现有的人类书写技能转化为这种形式。好处有两个。首先,将人文技能汇总到AIP后,Claude Sonnet在SkillsBench的27个真实代理任务中,平均任务奖励从0.60提升到0.71,通过率从53%提升到67%,这是统计学上的显著提升(Wilcoxon签名排名p = 0.011),赢得12个任务,13个平手——且通常在更短的墙上计时时间内完成。该图向代理提供经过审核的可运行单元,而不是要求其从自然语言中重新推导出代码、命令和工具调用。其次,在创建和改进方面,由于每个技能都经过模式验证、功能测试且节点逐节点可寻址,故障可以被精确诊断和修复。两起作者技能失误被追溯到剧本层面。调整AIP规范并重新编译后,两项任务均以零回归恢复(其中一项任务从0/5降至5/5),将技能提升转化为可衡量的调校循环,而非文笔重写。同样的图结构支持语料库层级的治理和技能内省,并为强化学习而非技能提供了自然的行动空间。
Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees
风险感知强化学习的情景生成,可能保证安全
- Authors: Mohit Prashant, Arvind Easwaran
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04812
- Pdf link: https://arxiv.org/pdf/2606.04812
- Abstract
Guaranteeing safety is critical to the deployment of reinforcement learning (RL) agents in the real-world, especially as policies learned using deep RL may demonstrate susceptibility to transition perturbations that result in unknown or unsafe behaviour. A method of policy verification is to construct probabilistic barrier-certificates by sampling policy trajectories with respect to safety constraints, thereby demarcating known safe behaviour from unknown behaviour. Obtaining tight upper and lower bounds on the probability of violation of these constraints may be difficult if the policy is susceptible to transition uncertainty or perturbation that places the agent in insufficiently explored states. To address this, we approximate the distribution of the encountered state-space using a variational autoencoder (VAE) and construct upper and lower-bound barrier-certificates using latent characteristics of states to optimize for regions of known, safe behaviour with high confidence. We frame this in our work as a dual optimization problem where the lower-bound barrier-certificate presents a more conservative estimate of the safe region than the upper-bound barrier-certificate. Sampling states that lie within the set difference of the two during training, i.e. the non-robust region, allows us to tighten the upper and lower bounds to provide sharper probabilistic guarantees on safety. Within our study, we describe the guarantees placed and demonstrate the tightness of our bounds experimentally.
- 中文摘要
保障安全对于强化学习(RL)智能体在现实世界中的部署至关重要,尤其是因为通过深度强化学习的策略可能显示出易受转移扰动影响,导致未知或不安全行为。政策验证的一种方法是通过抽样政策轨迹与安全约束来构建概率障碍证书,从而区分已知的安全行为与未知行为。如果策略易受转移不确定性或扰动影响,使智能体处于探索不足的状态,获得这些约束被违反概率的严格上下界可能很困难。为此,我们利用变分自编码器(VAE)近似对遇到的状态空间分布,并利用状态的潜在特性构造上界和下界障碍证书,以高置信度优化已知安全行为区域。我们在工作中将其框架为一个对偶优化问题,其中下界障碍证书比上界障碍证书更保守地估计安全区域。在训练过程中,采样状态落在两者的集合差范围内,即非稳健区域,使我们能够收紧上下界,从而提供更明确的概率安全性保证。在我们的研究中,我们描述了所设置的保证值,并通过实验展示了我们界限的紧密性。
Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents
边学边演:一个技能增强的考试时间共进化框架,面向在线终身学习代理
- Authors: Bo Mao, Jie Zhou, Yutao Yang, Xin Li, Xian Wei, Qin Chen, Xingjiao Wu, Liang He
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04815
- Pdf link: https://arxiv.org/pdf/2606.04815
- Abstract
Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test-time feedback like human learners. To bridge this gap, we propose Skill-enhanced Test-Time Co-Evolution (\texttt{LifeSkill}), a two-stage reinforcement learning framework for Online Lifelong Learning Agents. Specifically, we design Verifier-Guided Skill Learning that addresses the lack of direct supervision for skill extraction by rewarding candidate skills according to the average verifier success of multiple skill-conditioned policy rollouts, encouraging the model to generate skills that are useful for solving tasks rather than merely plausible in text. Furthermore, we introduce Online Skill Internalization, which continuously improves the policy model during test-time interaction by transforming skill-conditioned trajectories into reward signals. This enables the agent to directly internalize reasoning capabilities into its parameters, avoiding the context bloat of experience retrieval. Experiments on LifelongAgentBench show that LifeSkill improves average performance by 7 absolute points by comparing with existing lifelong agent baselines.
- 中文摘要
终身学习对于在动态、交互环境中运行的大型语言模型(LLM)代理来说至关重要。然而,现有的长期学习代理通常依赖于离散技能或以往经验,在推断过程中用静态参数检索,这阻止了它们像人类学习者那样持续内化测试时的反馈。为弥合这一差距,我们提出了技能增强测试时间共进化(\texttt{LifeSkill}),这是一个面向在线终身学习代理的两阶段强化学习框架。具体来说,我们设计了验证者引导技能学习,通过根据多次技能条件政策推广中核验证者的平均成功率奖励候选技能,解决技能提取缺乏直接监督的问题,鼓励模型生成有助于解决任务而非仅仅在文本中合理化的技能。此外,我们引入了在线技能内化技术,在测试时的交互中,通过将技能条件轨迹转化为奖励信号,持续改进策略模型。这使得智能体能够直接将推理能力内化到参数中,避免了经验检索带来的上下文膨胀。LifelongAgentBench上的实验显示,LifeSkill通过与现有的Lifelong代理基线相比,平均性能提升了7个绝对分。
M3imic: Learning a Versatile Whole-Body Controller for Multimodal Motion Mimicking
M3imic:学习一款多功能的全身控制器,用于多模态运动模拟
- Authors: Zuxing Lu, Ziang Zheng, Yao Lyu, Jingyu Liu, Feihong Zhang, Song Lu, Xin Yuan, Changyin Sun, Xingxing Zuo, Shengbo Eben Li
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.04829
- Pdf link: https://arxiv.org/pdf/2606.04829
- Abstract
Building a general-purpose whole-body controller is essential for enabling diverse motion capabilities in humanoid robots across a wide range of downstream tasks, including locomotion and loco-manipulation. Different tasks rely on distinct motion reference modalities: locomotion primarily depends on coordinated robot joint trajectories, whereas manipulation requires precise end-effector trajectory tracking. Existing methods often overlook the representational mismatch between dense robot joint angles and sparse end-effector poses. To address this, we propose Multi-Modal Mimic (M3imic), a versatile multi-modal whole-body control framework that unifies heterogeneous motion reference modalities, including robot joint angles, human pose trajectories, and end-effector poses, using modality-specific encoders to map them into a shared latent space. Leveraging large-scale reinforcement learning in the simulator, we train a single policy that achieves sim-to-real transfer across multiple motion reference modalities without modality-specific retraining. Extensive simulation and real-world experiments on the Unitree G1 robot are conducted to evaluate the proposed framework. In simulation, the policy achieves a peak success rate of 98.42\% on an unseen test dataset, demonstrating its exceptional generalization capability. The code is available at this https URL
- 中文摘要
构建通用的全身控制器对于实现类人机器人在多种下游任务中实现多样化运动能力至关重要,包括移动和机车操作。不同任务依赖于不同的运动参考模式:运动主要依赖协调的机器人关节轨迹,而操作则需要精确的末端执行器轨迹跟踪。现有方法常常忽视了致密机器人关节角度与稀疏端效器姿态之间的表征不匹配。为此,我们提出了多模态拟态(M3imic),这是一种多功能的多模态全身控制框架,统一了异质运动参考模态,包括机器人关节角度、人类姿态轨迹和末端执行器姿态,利用特定模态编码器将其映射到共享的潜在空间。利用模拟器中的大规模强化学习,我们训练单一策略,实现跨多种运动参考模态的模拟到真实转移,无需特定模态的重新训练。通过对Unitree G1机器人进行广泛的模拟和实际实验,以评估该框架。在模拟中,该策略在未见测试数据集上达到98.42%的峰值成功率,展示了其卓越的泛化能力。代码可在此 https URL 获取
MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU
MusaCoder:基于Moore线程GPU的全栈训练原生GPU内核生成
- Authors: Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.04847
- Pdf link: https://arxiv.org/pdf/2606.04847
- Abstract
Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.
- 中文摘要
原生GPU内核生成将高级张量程序转化为可执行且高效的低级代码。现有的大型语言模型(LLM)在这一任务上遇到困难,而基于执行的强化学习则因奖励稀疏、奖励黑客和训练不稳定而受挫。我们介绍MusaCoder,一个用于CUDA和MUSA后端原生GPU内核生成的全栈训练框架。MusaCoder 结合了渐进式内核数据综合、保持多样性的拒绝微调以及通过 MooreEval 实现的执行反馈强化学习(RL),这是一个分布式验证器和奖励环境。为稳定强化学习,MusaCoder 引入了 PrimeEcho 用于首回合锚定多回合奖励,Buffered Dynamic Retry 用于从所有失败的硬采样中恢复信号,以及 MirrorPop 用于非策略序列过滤。在KernelBench和MUSA移植版本上的实验显示,MusaCoder在正确性和实证加速方面均优于强力的开源和专有基线,9B模型与前沿闭源模型匹敌甚至超越,27B模型则奠定了新的技术水平。这些结果不仅证明了全栈执行反馈训练对原生内核生成的有效性,也证明了Moore Threads GPU支持完整LLM后训练栈的能力,为新兴加速器上的大型模型训练和优化提供了实用基础。
Learning Empirically Admissible Neural Heuristics for Combinatorial Search
学习组合搜索中经验可接受的神经启发式
- Authors: Siddharth Sahay
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2606.04860
- Pdf link: https://arxiv.org/pdf/2606.04860
- Abstract
Finding optimal solution paths for combinatorial puzzles like the Rubik's Cube, sliding tile puzzles, and Lights Out remains a classical challenge in artificial intelligence. Heuristic search algorithms, such as A* , guarantee path optimality only when using an admissible heuristic-one that never overestimates the true remaining cost-to-go. Deep reinforcement learning (RL) methods like DeepCubeA train deep neural networks to approximate cost-to-go heuristics. However, standard mean-squared error (MSE) training regularly yields overestimations, violating admissibility and compromising solution optimality. In this paper, we introduce a generalizable framework for learning validation-calibrated admissible neural heuristics. We train a value network using an underestimating Admissible Bellman Operator combined with an Asymmetric Loss function to penalize overestimation. To account for residual neural function approximation errors, we propose a post-hoc calibration safety offset computed over validation scrambles. We demonstrate that our calibrated neural heuristics achieve no observed admissibility violations under the evaluation protocol and preserve path optimality in practice while reducing search node expansions by up to 83.0% on a 2 by 2 Rubik's Cube, 19.9% on a 3 by 3 Lights Out grid, and 1.9% on an 8-Puzzle compared to standard analytical baselines.
- 中文摘要
为组合谜题如魔方、滑动拼图和“熄灯”找到最优解法路径,仍是人工智能领域的经典挑战。启发式搜索算法,如A*,仅在使用可接受的启发式算法且从不高估真实剩余成本时保证路径最优。深度强化学习(RL)方法如DeepCubeA训练深度神经网络,以近似成本即用启发式。然而,标准均方误差(MSE)训练经常导致高估,违反可采性并影响解的最优性。本文介绍了一个可推广的框架,用于学习验证校准的可接受神经启发式。我们使用低估的可接受贝尔曼算子和非对称损耗函数来训练一个价值网络,以惩罚高估。为考虑残余神经功能近似误差,我们提出在验证加扰过程中计算的事后校准安全偏移量。我们证明,在评估协议下,我们校准的神经启发式在评估协议下实现了未观察到的可采性违规,并在实际中保持路径最优性,同时在2×2魔方上将搜索节点扩展减少了高达83.0%,在3×3灯灭网格上减少了19.9%,在8-Puzzle上相比标准分析基线减少了1.9%。
GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards
GRAIL:带可验证奖励的强化学习中梯度重权优势
- Authors: Tej Deep Pala, Vernon Toh, Soujanya Poria
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.04889
- Pdf link: https://arxiv.org/pdf/2606.04889
- Abstract
Reinforcement learning with verifiable rewards (e.g. GRPO) is now a common way to improve mathematical reasoning in Large Language Models (LLMs). However, current methods usually broadcast one sequence-level advantage to all tokens, or use costly process reward models (PRMs) for step-level supervision. Uniform advantage distribution assumes that all tokens contribute equally to the final reward. This dilutes the gradient signal, since flawed reasoning steps and filler words are updated as strongly as valid logical inferences. To address this, we introduce Gradient-Reweighted Advantage (GRAIL), an intrinsic token-wise advantage reweighting method. GRAIL uses gradient-activation saliency to place more weight on tokens that are more locally sensitive to the final answer. Evaluations across five models from the Qwen3, R1-distilled and OctoThinker families show that GRAIL consistently outperforms GRPO. GRAIL achieved an average improvement of 3.60% in accuracy and 3.05% in Pass@3, demonstrating that fine-grained reasoning alignment can be achieved without process-level supervision.
- 中文摘要
带有可验证奖励的强化学习(例如GRPO)现已成为提升大型语言模型(LLMs)数学推理能力的常见方式。然而,当前方法通常向所有代币广播一个序列级优势,或使用昂贵的过程奖励模型(PRM)进行步骤级监督。均匀优势分布假设所有代币均等贡献最终奖励。这削弱了梯度信号,因为错误的推理步骤和填充词会与有效的逻辑推理同样强烈地更新。为此,我们引入了梯度重加权优势(GRAIL),这是一种内在的按代币优势重权方法。GRAIL利用梯度激活显著性,赋予对最终答案更敏感的标记权重。对Qwen3、R1蒸馏和OctoThinker系列五个模型的评估显示,GRAIL持续优于GRPO。GRAIL的平均精度提升为3.60%,Pass@3提升3.05%,证明无需过程层级监督即可实现细致推理对齐。
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
在基于评分标准的强化学习中,重现、分析和检测奖励黑客行为
- Authors: Xuekang Wang, Zhuoyuan Hao, Shuo Hou, Hao Peng, Juanzi Li, Xiaozhi Wang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.04923
- Pdf link: https://arxiv.org/pdf/2606.04923
- Abstract
Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at this https URL.
- 中文摘要
基于评分标准的强化学习(RL)使用作为评判的LLM(LaaJ)根据评分标准对模型输出进行评分。然而,政策模型可能利用评委潜在的偏见,导致奖励黑客行为以及无效或不安全的培训结果。在基于评分标准的现实学习中,这种黑客行为往往很微妙,且与多重评判偏差纠缠在一起,难以分析、检测和缓解。本文介绍了CHERRL,一个可控的基于评分标准的强化学习黑客环境。通过将已知偏差注入LaaJ,CHERRL实现了奖励黑客的稳定复现,明确观察奖励发散,并精确识别黑客起始。这为研究基于评分标准的强化学习中奖励黑客的机制和缓解措施提供了清晰的实验测试平台。为了展示其实用性,我们从可发现性和可利用性角度分析不同的评判偏差,并探索一种基于智能体的系统,从训练日志中自动检测奖励黑客的启动。代码和环境在此 https URL 公开。
Sequential Data Poisoning in LLM Post-Training
LLM后培训中的顺序数据中毒
- Authors: Jack Sanderson, Yihan Wang, Xiaoqian Lu, Gautam Kamath, Yiwei Lu
- Subjects: Subjects:
Machine Learning (cs.LG); Cryptography and Security (cs.CR)
- Arxiv link: https://arxiv.org/abs/2606.04929
- Pdf link: https://arxiv.org/pdf/2606.04929
- Abstract
LLM post-training proceeds through multiple stages, e.g., supervised fine-tuning (SFT) followed by reinforcement learning from human feedback (RLHF) or direct preference optimization (DPO), where each stage draws data from different, potentially untrusted sources. Existing literature assumes data poisoning attacks may occur at each training stage, but neglects the possibility of multiple attackers. To study the trustworthiness of the entire post-training pipeline, we propose the threat model of sequential data poisoning, where multiple adversaries separately poison the SFT and preference datasets. Under this threat model, we identify the single-attacker illusion: each adversary, evaluated in isolation, appears to pose a negligible threat. Yet when adversaries collaborate across stages, the true vulnerability is revealed. In the SFT $\to$ DPO pipeline, their contributions are additive: splitting a fixed poison budget across stages outperforms concentrating it in either stage alone. In the SFT $\to$ PPO pipeline, their contributions are complementary: neither SFT nor reward model poisoning succeeds individually, yet their combination does. These findings show that security analyses of individual post-training stages systematically underestimate compound vulnerabilities that emerge only from their interaction. Code is available at this https URL.
- 中文摘要
LLM的后期训练经历多个阶段,例如监督微调(SFT),随后是基于人类反馈的强化学习(RLHF)或直接偏好优化(DPO),每个阶段从不同且可能不可信的来源提取数据。现有文献假设数据中毒攻击可能发生在每个训练阶段,但忽视了多个攻击者的可能性。为了研究整个训练后流程的可信度,我们提出了顺序数据中毒的威胁模型,即多个对手分别毒害SFT和偏好数据集。在这种威胁模型下,我们识别出单一攻击者的错觉:每个对手单独评估时,似乎构成的威胁微乎其微。然而,当对手跨阶段协作时,真正的脆弱点便会显现出来。在SFT的$/to$ DPO流程中,它们的贡献是累加的:将固定的毒药预算分散到各个阶段,比单独集中在任一阶段更有效。在SFT $\to$ PPO流程中,它们的贡献是互补的:SFT和奖励模式的毒害单独都不成功,但它们的组合却成功了。这些发现表明,对单个训练后阶段的安全分析系统性地低估了仅在相互作用中出现的复合脆弱性。代码可在此 https URL 访问。
Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement
潜在引导流匹配以提升视觉-语言-行动政策
- Authors: Yunpeng Mei, Jiakai He, Hongjie Cao, Chenyu Wang, Xiaowen Zhu, Yihan Zhou, Jiamin Wang, Chenbo Xin, Peng Cheng, Yuxuan Yang, Yijie Wang, Xinhu Zheng, Gao Huang, Jie Chen, Gang Wang
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.04968
- Pdf link: https://arxiv.org/pdf/2606.04968
- Abstract
Large vision-language-action (VLA) policies are increasingly trained as conditional generative models over action chunks. Yet deployment produces mixed-quality experience-successful demonstrations, partial completions, recoverable mistakes, and failures-that is difficult to use with standard imitation. Full behavior cloning (BC) imitates failures, filtered BC discards useful sub-trajectories, and offline reinforcement learning adds a large critic. We introduce ForesightFlow, a self-guided flow-matching policy that augments each generated action chunk with a learned success-potential trajectory. The same flow proposes and scores candidate actions, enabling best-of-$K$ inference without an external critic. The key issue is that policy improvement and value calibration require different supervision: advantage weighting should emphasize high-quality actions, but applying the same weights to potential coordinates suppresses failure gradients and creates overconfident scores. We address this with decoupled advantage-weighted flow matching, applying exponentiated advantage weights only to action velocities while training potential velocities uniformly. We further derive a one-step boundary estimator for conditional flow matching, allowing advantage computation with a single stop-gradient forward pass. Across five BEHAVIOR-1K simulation tasks and five real-world bimanual tasks, ForesightFlow improves over imitation baselines, matches the strongest separate-critic baseline in simulation success, improves real-world success, and reduces training compute by $38\%$. Ablations show that decoupling prevents value hallucination, the one-step estimator preserves candidate-ranking fidelity, and self-guided sampling improves long-horizon execution.
- 中文摘要
大型视觉-语言-动作(VLA)策略越来越多地被训练为基于动作块的条件生成模型。然而,部署会产生参差不齐的体验——成功演示、部分完成、可恢复的错误和失败——这些都难以用标准模仿来使用。全行为克隆(BC)模拟失败,过滤后BC丢弃有用的子轨迹,离线强化学习则成为了很大的批评者。我们介绍了ForesightFlow,一种自我引导的流程匹配策略,为每个生成的动作块添加学习到的成功潜力轨迹。同一流程提出并评分候选动作,实现$K最佳推理而无需外部批评。关键问题在于政策改进和价值校准需要不同的监督:优势加权应强调高质量的行动,但对潜在坐标施加相同权重会抑制失效梯度并造成过度自信的评分。我们通过解耦优势加权流量匹配来解决这个问题,仅对动作速度施加指数级优势权重,同时均匀训练潜在速度。我们还进一步推导出一个一步边界估计器用于条件流匹配,允许通过单次停挡梯度前向传递实现优势计算。在五个BEHAVIOR-1K模拟任务和五个真实世界双手任务中,ForesightFlow优于模拟基线,匹配最强的独立批评者模拟成功率,提升现实世界成功率,并将训练计算减少38%美元。消融表明,解耦防止价值幻觉,一步估计保持候选排名的忠实度,自导抽样则提升了长视野执行。
GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation
GARL:多智能体战略优先级中的博弈论强化学习
- Authors: Yuxiao Ye, Yiwen Zhang, Huiyuan Xie, Yuqin Huang, Zhiyuan Liu
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.05002
- Pdf link: https://arxiv.org/pdf/2606.05002
- Abstract
LLM-based multi-agent systems are increasingly used for strategic decision-making tasks. In such settings, performance depends not only on individual model capabilities, but also on the policies by which agents interact and adapt. Multi-agent reinforcement learning can optimise these interaction policies, but its reward design often remains task-specific and weakly grounded in interaction structure. To address this gap, we propose GARL, a GAme-theoretic Reinforcement Learning framework for multi-agent strategic prioritisation. GARL formalises strategic prioritisation as a two-stage game: competing agents first allocate strategic resources over a shared candidate set, and a higher-level arbiter then produces the final ranking. The resulting game-theoretic utilities are converted into role-specific reinforcement signals, allowing policy optimisation to be guided by structured interaction. We instantiate GARL on issues-in-dispute ranking, where the goal is to prioritise core issues in legal proceedings. Experiments show that GARL improves ranking performance, enables small open-source LLMs to become competitive with a strong closed-source LLM under the same candidate-ranking setting, and yields gains in legal-domain competence and broader strategic decision-making. Overall, GARL demonstrates how game-theoretic interaction structure can be turned into reinforcement-learning objectives, providing a principled approach to policy optimisation in multi-agent strategic prioritisation.
- 中文摘要
基于LLM的多智能体系统越来越多地被用于战略决策任务。在这种环境下,性能不仅取决于单个模型的能力,还取决于代理之间交互和适应的策略。多智能体强化学习可以优化这些交互策略,但其奖励设计往往仍是任务特定且缺乏交互结构基础。为弥补这一空白,我们提出了GARL,一种基于GAme理论的强化学习框架,用于多智能体战略优先级排序。GARL将战略优先级排序形式化为两阶段博弈:竞争的代理首先在共享候选集上分配战略资源,然后由更高级别的仲裁者产生最终排名。由此产生的博弈论效用被转化为特定角色的强化信号,使策略优化能够由结构化交互引导。我们将GARL应用于争议议题排名,目标是优先处理法律程序中的核心问题。实验显示,GARL提升了排名表现,使小型开源大型语言模型在相同候选人排名设置下能够与强大的闭源LLM竞争,并在法律领域能力和更广泛的战略决策上获得提升。总体而言,GARL展示了如何将博弈论交互结构转化为强化学习目标,为多智能体战略优先级制定中的策略优化提供了有原则的方法。
Generalization of World Models under Environmental Variability for Vision-based Quadrotor Navigation
基于视觉的四旋翼导航环境变异性下世界模型的推广
- Authors: Luca Zanatta, Grzegorz Malczyk, Kostas Alexis
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2606.05015
- Pdf link: https://arxiv.org/pdf/2606.05015
- Abstract
World models, learned generative models that predict how an environment evolves, have become a promising tool for sample-efficient robot learning. Yet how robust they are to environmental variability remains poorly understood. To address this, we conduct a systematic study using vision-based quadrotor navigation as a testbed problem, training DreamerV3-based world models under varying levels of environmental randomness and evaluating them across all levels through cross-environment validation, spanning both Self-Supervised Learning (SSL) pretraining and Reinforcement Learning (RL) fine-tuning. We then deploy all world models and associated navigation policies on a real quadrotor in unseen environments, including an open-loop run where the model receives just 2.5s of real sensory input before all sensors are cut off, leaving the system to navigate entirely in imagination over a 12m traverse. Our results show that world model robustness during SSL pretraining is a strong predictor of sim-to-real transfer: every model that generalized well in cross-environment SSL validation deployed successfully in the real world, passing through gaps as narrow as 0.67m, whereas the model that dominated simulation policy evaluation failed on the real platform. We further identify (a) the discrete latent size and (b) the training-sequence length as the dominant factors governing world model quality.
- 中文摘要
世界模型,即预测环境演变的学习生成模型,已成为高效样本机器人学习的有前景工具。然而,它们对环境变异的韧性程度仍不十分明了。为此,我们系统地开展了一项研究,使用基于视觉的四旋翼导航作为测试平台问题,在不同环境随机水平下训练基于DreamerV3的世界模型,并通过跨环境验证评估这些模型,涵盖自我监督学习(SSL)预训练和强化学习(RL)微调。然后,我们将所有世界模型及相关导航策略部署在真实的四旋翼上,运行在未见环境中,包括开环运行,模型仅接收2.5秒的真实感官输入,随后所有传感器被切断,系统只能完全凭想象导航完成12米横移。我们的结果表明,全球模型在SSL预训练期间的鲁棒性是模拟到真实传输的强有力预测因子:所有在跨环境SSL验证中推广良好的模型都能在现实中成功部署,突破了仅0.67米的狭窄差距,而主导仿真策略评估的模型在真实平台上失败了。我们还进一步确定(a)离散潜在规模和(b)训练序列长度是决定世界模型质量的主要因素。
Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling
通过动作推断和重要性抽样增强多智能体学习的MADDPG算法
- Authors: Marc Walden, Jason Liu, Shaashwath Sivakumar, Ryan Liu, Hamza Khan
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2606.05021
- Pdf link: https://arxiv.org/pdf/2606.05021
- Abstract
We investigate multi-agent deep reinforcement learning and propose two enhancements to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm. First, we introduce a novel Action Inference mechanism that enables each agent to predict other agents' intended actions, thereby improving the accuracy and stability of its own policy. Second, we apply an importance sampling strategy, using geometric distribution, in the replay buffer to prioritize more recent and informative experiences, which helps mitigate the non-stationarity inherent in multi-agent environments. We evaluate both modifications on the discrete-action Predator-Prey task provided by the PettingZoo library, a flexible Python interface for general multi-agent reinforcement learning benchmarks. Our results indicate that Action Inference is effective in improving learning stability and inter-agent cooperation and that importance sampling using geometric distribution can lead to significant improvements in exploration efficiency over standard MADDPG. Code available at this https URL
- 中文摘要
我们研究多智能体深度强化学习,并提出了对多智能体深度确定性策略梯度(MADDPG)算法的两项改进方案。首先,我们引入了一种新的动作推断机制,使每个智能体能够预测其他智能体的预期动作,从而提高自身策略的准确性和稳定性。其次,我们在重放缓冲区中采用了几何分布的重要性采样策略,优先考虑近期且信息丰富的体验,这有助于缓解多智能体环境中固有的非平稳性。我们评估了PettingZoo库提供的离散动作捕食者-猎物任务的两种修改,该库是一个灵活的Python接口,用于通用多智能体强化学习基准测试。我们的结果表明,动作推断在改善学习稳定性和代理间协作方面有效,且使用几何分布进行重要性采样能显著提升探索效率,优于标准MADDPG。代码可在此 https URL 获取
Arithmetic Pedagogy for Language Models
语言模型的算术教学法
- Authors: Andhika Bernard Lumbantobing, Hokky Situngkir
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
- Arxiv link: https://arxiv.org/abs/2606.05106
- Pdf link: https://arxiv.org/pdf/2606.05106
- Abstract
We investigate whether methods of human mathematics pedagogy can guide the training of language models toward arithmetic reasoning. Building on the GASING method -- an Indonesian pedagogy that solves basic arithmetic through a left-to-right procedure aligned with the causal order of token generation -- we operationalize each operation as a computational procedure whose execution trace is serialized into natural-language Chain-of-Thought (CoT) supervision. A small GPT-2 decoder (86M parameters) with a syllabic-agglutinative TOBA tokenizer for Indonesian is trained from scratch on this data using only a next-token prediction objective, without reinforcement learning or reward-based optimization. Monitoring training reveals three distinct learning phases, and mechanistic analyses -- attention-masking interventions on the CoT information graph, residual-stream probing, and logit-lens inspection -- show that the model first internalizes a procedural pathway and subsequently develops an associative, ``mental-arithmetic'' capacity that retrieves intermediate results without explicit step-by-step computation. The trained model reaches over 80% accuracy on held-out problems and attains competitive performance against substantially larger language models, indicating that targeted, pedagogically grounded training can yield strong and economical arithmetic capability at small scale.
- 中文摘要
我们研究人类数学教学法是否能指导语言模型的训练,朝向算术推理。基于GASING方法——印尼的一种通过由左到右的程序解决基础算术的教学法,该过程与令牌生成的因果顺序保持一致——我们将每个操作操作化为一个计算过程,其执行轨迹被序列化为自然语言的思维链(Chain-of-Thought,CoT)监督。一个小型GPT-2解码器(86M参数)配备音节聚合TOBA分词器,仅基于下一标记预测目标从零训练,未进行强化学习或基于奖励的优化。监测训练揭示了三个不同的学习阶段,机制性分析——对CoT信息图的注意力掩蔽干预、残余流探测和logit透镜检查——表明模型首先内化了过程路径,随后发展出一种联想的“心算”能力,能够在不经过明确分步骤计算的情况下检索中间结果。训练好的模型在未完成的问题上准确率超过80%,并在与更大语言模型竞争中表现出色,表明有针对性和教学基础的训练可以在小规模下实现强大且经济的算术能力。
Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data
自我评估已经存在:在基础大型语言模型中以极少数据诱导潜在判定校准
- Authors: XiuYu Zhang, Yi Shan, Junfeng Fang, Zhenkai Liang
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.05122
- Pdf link: https://arxiv.org/pdf/2606.05122
- Abstract
Large language models are increasingly evaluated by other models, raising a natural question: can a model predict how a judge will score its own output? We find that the ability is largely present before any targeted training: prompted few-shot, a base model already predicts an external judge's multi-attribute quality scores on open-ended responses well above chance across three benchmarks. We introduce Self-Evaluation Elicitation (SEE), a method that surfaces this latent ability through a short cycle comprising a calibration-coupled reinforcement learning phase that improves the answer and predicts the judge, followed by a masked distillation phase that sharpens the prediction while leaving the answer untouched. From 160 unique examples, roughly 31x fewer than a reinforcement learning baseline, SEE improves held-out calibration across three benchmarks while preserving answer quality. The elicited self-evaluation is sharply localized within the model's own token distribution and stable across judges it was never trained against, indicating a transferable notion of quality rather than a single judge's preference. These results reframe judge-aligned self-evaluation as a problem of elicitation rather than acquisition.
- 中文摘要
大型语言模型越来越多地被其他模型评估,这自然引发了一个问题:模型能否预测评委如何给自己的输出评分?我们发现,这种能力在任何有针对性训练之前就已存在:在少数样本提示下,基础模型已经能预测外部评委在开放式回答上的多属性质量得分,在三个基准测试中远高于偶然值。我们引入了自我评估诱导(SEE),这是一种通过一个短周期展现潜在能力的方法,该周期包括校准耦合强化学习阶段,提升答案并预测评判结果,随后是掩码蒸馏阶段,在不变的情况下提升预测。从160个独立样本中,SEE大约少于强化学习基线的31倍,在保持答案质量的同时,提升了三个基准测试的保留校准。被引出的自我评估在模型自身的代币分布中被明确局限,并且在未曾训练过的评审之间稳定,表明质量是一个可转移的概念,而非单一评审的偏好。这些结果将法官对齐的自我评估重新定位为引发问题,而非获得问题。
Reinforcement Learning from Rich Feedback with Distributional DAgger
利用分布式DAgger进行丰富反馈的强化学习
- Authors: Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2606.05152
- Pdf link: https://arxiv.org/pdf/2606.05152
- Abstract
Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.
- 中文摘要
推理模型发展迅速,但主导的可验证奖励强化学习(RLVR)公式仍然出奇地狭窄:采样多个反应,并奖励每个回答一个表示最终答案是否正确的比特。然而,许多设置提供了丰富的反馈,包括执行痕迹、工具输出、专家修正和模型自我评估。我们研究如何通过经典模仿学习算法DAgger的分布变体来使用此类反馈,其中学习者可以本地访问当前政策所访问州的专家分布。这产生了一个简单的前向交叉熵目标,该目标允许一位黑箱专家,其序列层级梯度(通过传播丰富学分分配)将未来专家-学生的分歧追溯到早期决策。我们表明,基于逆KL或Jensen-Shannon的自提纯目标的先前强化学习无法保证单调策略改进:即使专家获得更高奖励,其更新也可能增加更差行为的概率。相比之下,我们表明前向交叉熵允许单调政策改进,并享有遗憾的保证。我们还进一步证明,我们的目标优化了教师加权成功概率的下界,从而提升了Pass@N。从经验上看,我们的方法DistIL在多个领域——科学推理、编码和解决难数学问题——通过自蒸馏基线,相较于RLVR和RL得到了改进。
Keyword: diffusion policy
There is no result