生成时间: 2025-10-15 16:29:59 (UTC+8); Arxiv 发布时间: 2025-10-15 20:00 EDT (2025-10-16 08:00 UTC+8)
今天共有 35 篇相关文章
Keyword: reinforcement learning
AI Agents for the Dhumbal Card Game: A Comparative Study
Dhumbal 纸牌游戏的 AI 代理:一项比较研究
- Authors: Sahaj Raj Malla
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11736
- Pdf link: https://arxiv.org/pdf/2510.11736
- Abstract
This study evaluates Artificial Intelligence (AI) agents for Dhumbal, a culturally significant multiplayer card game with imperfect information, through a systematic comparison of rule-based, search-based, and learning-based strategies. We formalize Dhumbal's mechanics and implement diverse agents, including heuristic approaches (Aggressive, Conservative, Balanced, Opportunistic), search-based methods such as Monte Carlo Tree Search (MCTS) and Information Set Monte Carlo Tree Search (ISMCTS), and reinforcement learning approaches including Deep Q-Network (DQN) and Proximal Policy Optimization (PPO), and a random baseline. Evaluation involves within-category tournaments followed by a cross-category championship. Performance is measured via win rate, economic outcome, Jhyap success, cards discarded per round, risk assessment, and decision efficiency. Statistical significance is assessed using Welch's t-test with Bonferroni correction, effect sizes via Cohen's d, and 95% confidence intervals (CI). Across 1024 simulated rounds, the rule-based Aggressive agent achieves the highest win rate (88.3%, 95% CI: [86.3, 90.3]), outperforming ISMCTS (9.0%) and PPO (1.5%) through effective exploitation of Jhyap declarations. The study contributes a reproducible AI framework, insights into heuristic efficacy under partial information, and open-source code, thereby advancing AI research and supporting digital preservation of cultural games.
- 中文摘要
本研究通过系统比较基于规则、基于搜索和基于学习的策略,评估了 Dhumbal 的人工智能 (AI) 代理,Dhumbal 是一种具有文化意义的多人纸牌游戏,但信息不完善。我们正式化了 Dhumbal 的机制并实现了多种代理,包括启发式方法(积极、保守、平衡、机会主义)、基于搜索的方法,如蒙特卡洛树搜索 (MCTS) 和信息集蒙特卡洛树搜索 (ISMCTS),以及强化学习方法,包括深度 Q 网络 (DQN) 和近端策略优化 (PPO),以及随机基线。评估包括类别内锦标赛,然后是跨类别锦标赛。绩效是通过胜率、经济成果、Jhyap 成功、每轮丢弃的牌、风险评估和决策效率来衡量的。使用带有 Bonferroni 校正的 Welch t 检验、通过 Cohen's d 的效应大小和 95% 置信区间 (CI) 评估统计显着性。在 1024 轮模拟回合中,基于规则的攻击性代理通过有效利用 Jhyap 声明实现了最高的胜率(88.3%,95% CI:[86.3,90.3]),优于 ISMCTS(9.0%)和 PPO(1.5%)。该研究贡献了可重复的人工智能框架、对部分信息下启发式功效的洞察以及开源代码,从而推进人工智能研究并支持文化游戏的数字保存。
GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving
GAR:用于形式定理证明的生成对抗强化学习
- Authors: Ruida Wang, Jiarui Yao, Rui Pan, Shizhe Diao, Tong Zhang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.11769
- Pdf link: https://arxiv.org/pdf/2510.11769
- Abstract
Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose GAR: Generative Adversarial Reinforcement learning, a comprehensive RL training framework that jointly trains the problem composer and solver in an adversarial loop. GAR introduces an implicit curriculum learning mechanism, which aligns task difficulty with the prover's evolving capability. It thereby improves the training efficiency and enables stronger performance of proving advanced theorems. Experiments show that with GAR training, Goedel-Prover-V2-8B and DeepSeek-Prover-V2-7B achieve an average relative improvement in pass@32 of 4.20% on MiniF2F-Test benchmark, while DeepSeek-Prover-V2's pass@32 on ProofNet-Test increases from 22.58% to 25.81%. Beyond formal proving, GAR establishes a general RL paradigm for co-evolution of problem generation and solving under verifiable environments.
- 中文摘要
通过精益等可验证语言解决数学问题对数学和计算机科学社区产生了重大影响。当前最先进的模型通常使用昂贵的在线强化学习 (RL) 或专家迭代进行训练。然而,这些方法依赖于固定的问题集,这会导致训练效率低下并限制了模型处理复杂问题的能力。为了克服这些限制,我们提出了 GAR:生成对抗强化学习,这是一个全面的 RL 训练框架,在对抗循环中联合训练问题编写者和求解者。GAR 引入了隐式课程学习机制,使任务难度与证明者不断发展的能力保持一致。从而提高了训练效率,并能够更强地证明高级定理。实验表明,通过GAR训练,Goedel-Prover-V2-8B和DeepSeek-Prover-V2-7B在MiniF2F-Test基准测试上的平均相对pass@32提升为4.20%,而DeepSeek-Prover-V2在ProofNet-Test上的pass@32从22.58%提高到25.81%。除了形式证明之外,GAR还建立了一种通用的RL范式,用于在可验证的环境中问题生成和解决的共同演化。
Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning
协作多智能体强化学习的鲁棒性和弹性的实证研究
- Authors: Simin Li, Zihao Mao, Hanxiao Li, Zonglei Jing, Zhuohang bian, Jun Guo, Li Wang, Zhuoran Han, Ruixiao Xu, Xin Yu, Chengdong Ma, Yuqing Ma, Bo An, Yaodong Yang, Weifeng Lv, Xianglong Liu
- Subjects: Subjects:
Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11824
- Pdf link: https://arxiv.org/pdf/2510.11824
- Abstract
In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions--a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at this https URL .
- 中文摘要
在协作多智能体强化学习 (MARL) 中,在理想的模拟环境中调整超参数以最大限度地提高协作性能是一种常见的做法。然而,在现实世界的不确定性下,为合作而调整的政策往往无法保持稳健性和弹性。构建值得信赖的 MARL 系统需要深入了解稳健性,这确保了不确定性下的稳定性,以及弹性,即从中断中恢复的能力——这个概念在控制系统中得到了广泛研究,但在 MARL 中基本上被忽视了。在本文中,我们提出了一项大规模实证研究,包括超过 82,620 个实验,以评估 MARL 在 4 个真实世界环境、13 种不确定性类型和 15 个超参数中的合作、鲁棒性和弹性。我们的主要发现是:(1)在轻度不确定性下,优化合作可以提高鲁棒性和弹性,但随着扰动的加剧,这种联系会减弱。鲁棒性和弹性也因算法和不确定性类型而异。(2)鲁棒性和弹性不会在不确定性模式或智能体范围内推广:对所有智能体的动作噪声具有鲁棒性的策略在单个智能体的观察噪声下可能会失败。(3) 超参数调优对于值得信赖的 MARL 至关重要:令人惊讶的是,参数共享、GAE 和 PopArt 等标准做法可能会损害鲁棒性,而提前停止、高批评者学习率和 Leaky ReLU 始终会有所帮助。仅通过优化超参数,我们观察到所有 MARL 主干的协作、鲁棒性和弹性都有显着改善,这种现象也推广到这些主干的鲁棒 MARL 方法。代码和结果可在此 https URL 中找到。
Don't Walk the Line: Boundary Guidance for Filtered Generation
不要走这条线:过滤生成的边界指南
- Authors: Sarah Ball, Andreas Haupt
- Subjects: Subjects:
Machine Learning (cs.LG); Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2510.11834
- Pdf link: https://arxiv.org/pdf/2510.11834
- Abstract
Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier's decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier's margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.
- 中文摘要
生成式模型越来越多地与过滤有害或不良输出的安全分类器配对。一种常见的策略是微调生成器以降低被过滤的概率,但这可能是次优的:它通常会推动模型在分类器的决策边界附近生成样本,从而增加误报和误报。我们提出了边界指导,这是一种强化学习微调方法,可以明确地将生成引导到分类器的边际之外。在越狱和模棱两可的提示的基准上,边界指导提高了输出的安全性和实用性,正如 LLM-as-a-Judge 评估所判断的那样。跨模型尺度和奖励设计的全面消融证明了我们方法的稳健性。
Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling
通过序列建模在随机博弈中进行鲁棒对抗强化学习
- Authors: Xiaohang Tang, Zhuowen Cheng, Satyabrat Kumar
- Subjects: Subjects:
Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
- Arxiv link: https://arxiv.org/abs/2510.11877
- Pdf link: https://arxiv.org/pdf/2510.11877
- Abstract
The Transformer, a highly expressive architecture for sequence modeling, has recently been adapted to solve sequential decision-making, most notably through the Decision Transformer (DT), which learns policies by conditioning on desired returns. Yet, the adversarial robustness of reinforcement learning methods based on sequence modeling remains largely unexplored. Here we introduce the Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance the robustness of DT in adversarial stochastic games. We formulate the interaction between the protagonist and the adversary at each stage as a stage game, where the payoff is defined as the expected maximum value over subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning Transformer policies on the NashQ value derived from these stage games, CART generates policy that are simultaneously less exploitable (adversarially robust) and conservative to transition uncertainty. Empirically, CART achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.
- 中文摘要
Transformer 是一种高度富有表现力的序列建模架构,最近已被调整用于解决顺序决策问题,最引人注目的是通过决策 Transformer (DT),它通过以期望回报为条件来学习策略。然而,基于序列建模的强化学习方法的对抗鲁棒性在很大程度上仍未得到探索。在这里,我们介绍保守对抗稳健决策转换器 (CART),据我们所知,这是第一个旨在增强对抗随机博弈中 DT 鲁棒性的框架。我们将主角和对手在每个阶段的互动表述为一个阶段博弈,其中收益被定义为后续状态的预期最大值,从而明确地纳入随机状态转换。通过根据从这些阶段博弈中得出的 NashQ 值来调节 Transformer 策略,CART 生成的策略同时具有较少的可利用性(对抗性稳健)和对过渡不确定性的保守性。根据经验,CART 实现了更准确的极小极大值估计,并在一系列对抗性随机博弈中始终获得卓越的最坏情况回报。
ADARL: Adaptive Low-Rank Structures for Robust Policy Learning under Uncertainty
ADARL:不确定性下稳健政策学习的自适应低秩结构
- Authors: Chenliang Li, Junyu Leng, Jiaxiang Li, Youbang Sun, Shixiang Chen, Shahin Shahrampour, Alfredo Garcia
- Subjects: Subjects:
Machine Learning (cs.LG); Machine Learning (stat.ML)
- Arxiv link: https://arxiv.org/abs/2510.11899
- Pdf link: https://arxiv.org/pdf/2510.11899
- Abstract
Robust reinforcement learning (Robust RL) seeks to handle epistemic uncertainty in environment dynamics, but existing approaches often rely on nested min--max optimization, which is computationally expensive and yields overly conservative policies. We propose \textbf{Adaptive Rank Representation (AdaRL)}, a bi-level optimization framework that improves robustness by aligning policy complexity with the intrinsic dimension of the task. At the lower level, AdaRL performs policy optimization under fixed-rank constraints with dynamics sampled from a Wasserstein ball around a centroid model. At the upper level, it adaptively adjusts the rank to balance the bias--variance trade-off, projecting policy parameters onto a low-rank manifold. This design avoids solving adversarial worst-case dynamics while ensuring robustness without over-parameterization. Empirical results on MuJoCo continuous control benchmarks demonstrate that AdaRL not only consistently outperforms fixed-rank baselines (e.g., SAC) and state-of-the-art robust RL methods (e.g., RNAC, Parseval), but also converges toward the intrinsic rank of the underlying tasks. These results highlight that adaptive low-rank policy representations provide an efficient and principled alternative for robust RL under model uncertainty.
- 中文摘要
鲁棒强化学习(Robust RL)旨在处理环境动力学中的认识不确定性,但现有方法通常依赖于嵌套的最小-最大优化,这在计算上很昂贵,并且会产生过于保守的策略。我们提出了 \textbf{自适应排名表示 (AdaRL)},这是一个双级优化框架,通过使策略复杂性与任务的内在维度保持一致来提高鲁棒性。在较低级别,AdaRL 在固定秩约束下执行策略优化,并围绕质心模型从 Wasserstein 球中采样动力学。在上层,它自适应地调整等级以平衡偏差——方差权衡,将政策参数投射到低等级流形上。这种设计避免了解决对抗性最坏情况的动态问题,同时确保鲁棒性而不过度参数化。MuJoCo连续控制基准的实证结果表明,AdaRL不仅始终优于固定秩基线(例如SAC)和最先进的鲁棒RL方法(例如RNAC,Parseval),而且还趋同于底层任务的内在等级。这些结果强调,自适应低秩策略表示为模型不确定性下的鲁棒RL提供了一种有效且有原则的替代方案。
Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning
非平稳无模型强化学习中的高效重启
- Authors: Hiroshi Nonaka, Simon Ambrozak, Sofia R. Miskala-Dinc, Amedeo Ercole, Aviva Prins
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11933
- Pdf link: https://arxiv.org/pdf/2510.11933
- Abstract
In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)'s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to $91$% relative to RestartQ-UCB.
- 中文摘要
在这项工作中,我们提出了三种有效的重启范式,用于无模型非平稳强化学习(RL)。我们确定了毛等人(2022)的RestartQ-UCB算法的重启设计的两个核心问题:(1)完全遗忘,即在重启后学到的有关环境的所有信息都会丢失,以及(2)计划重启,其中重启仅在预定义的时间发生,而不管策略与当前环境动态是否不兼容。我们引入了三种方法,我们称之为部分重启、自适应重启和选择性重启来修改算法 RestartQ-UCB 和 RANDOMIZEDQ(Wang 等人,2025 年)。我们发现在多个不同环境中接近最佳的经验性能,相对于 RestartQ-UCB 将动态后悔降低了高达 91 美元%。
Scaling Long-Horizon LLM Agent via Context-Folding
通过上下文折叠扩展长视野 LLM 代理
- Authors: Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, Jiecao Chen
- Subjects: Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.11967
- Pdf link: https://arxiv.org/pdf/2510.11967
- Abstract
Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10$\times$ smaller and significantly outperforms models that rely on summarization-based context management.
- 中文摘要
大型语言模型 (LLM) 代理在长期任务上从根本上受到上下文长度的限制。我们引入了上下文折叠,这是一个使代理能够主动管理其工作上下文的框架。代理可以按程序分支到子轨迹中来处理子任务,然后在完成后将其折叠,折叠中间步骤,同时保留结果的简明摘要。为了使这种行为具有可学习性,我们开发了一个端到端的强化学习框架 FoldGRPO,具有特定的过程奖励,以鼓励有效的任务分解和上下文管理。在复杂的长期任务(深度研究和 SWE)上,我们的折叠代理与 ReAct 基线匹配或优于 ReAct 基线,同时使用小 10$\times$ 的活动上下文,并且明显优于依赖基于摘要的上下文管理的模型。
Rethinking the Role of Dynamic Sparse Training for Scalable Deep Reinforcement Learning
重新思考动态稀疏训练在可扩展深度强化学习中的作用
- Authors: Guozheng Ma, Lu Li, Zilin Wang, Haoyu Wang, Shengchao Hu, Leszek Rutkowski, Dacheng Tao
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.12096
- Pdf link: https://arxiv.org/pdf/2510.12096
- Abstract
Scaling neural networks has driven breakthrough advances in machine learning, yet this paradigm fails in deep reinforcement learning (DRL), where larger models often degrade performance due to unique optimization pathologies such as plasticity loss. While recent works show that dynamically adapting network topology during training can mitigate these issues, existing studies have three critical limitations: (1) applying uniform dynamic training strategies across all modules despite encoder, critic, and actor following distinct learning paradigms, (2) focusing evaluation on basic architectures without clarifying the relative importance and interaction between dynamic training and architectural improvements, and (3) lacking systematic comparison between different dynamic approaches including sparse-to-sparse, dense-to-sparse, and sparse-to-dense. Through comprehensive investigation across modules and architectures, we reveal that dynamic sparse training strategies provide module-specific benefits that complement the primary scalability foundation established by architectural improvements. We finally distill these insights into Module-Specific Training (MST), a practical framework that further exploits the benefits of architectural improvements and demonstrates substantial scalability gains across diverse RL algorithms without algorithmic modifications.
- 中文摘要
扩展神经网络推动了机器学习的突破性进步,但这种范式在深度强化学习 (DRL) 中失败了,在深度强化学习 (DRL) 中,较大的模型通常会由于可塑性损失等独特的优化病态而降低性能。虽然最近的研究表明,在训练过程中动态适应网络拓扑可以缓解这些问题,但现有研究有三个关键局限性:(1)尽管编码器、批评者和参与者遵循不同的学习范式,但在所有模块中应用统一的动态训练策略,(2)将评估重点放在基本架构上,而没有阐明动态训练和架构改进之间的相对重要性和相互作用, (3)缺乏对稀疏到稀疏、密集到稀疏和稀疏到密集等不同动态方法的系统比较。通过对模块和架构的全面调查,我们发现动态稀疏训练策略提供了特定于模块的好处,这些好处补充了架构改进建立的主要可扩展性基础。我们最终将这些见解提炼成模块特定训练 (MST),这是一个实用的框架,它进一步利用了架构改进的好处,并在不修改算法的情况下展示了跨不同 RL 算法的可扩展性显着增益。
Self-Verifying Reflection Helps Transformers with CoT Reasoning
自验证反射帮助变压器进行 CoT 推理
- Authors: Zhongwei Yu, Wannian Xia, Xue Yan, Bo Xu, Haifeng Zhang, Yali Du, Jun Wang
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.12157
- Pdf link: https://arxiv.org/pdf/2510.12157
- Abstract
Advanced large language models (LLMs) frequently reflect in reasoning chain-of-thoughts (CoTs), where they self-verify the correctness of current solutions and explore alternatives. However, given recent findings that LLMs detect limited errors in CoTs, how reflection contributes to empirical improvements remains unclear. To analyze this issue, in this paper, we present a minimalistic reasoning framework to support basic self-verifying reflection for small transformers without natural language, which ensures analytic clarity and reduces the cost of comprehensive experiments. Theoretically, we prove that self-verifying reflection guarantees improvements if verification errors are properly bounded. Experimentally, we show that tiny transformers, with only a few million parameters, benefit from self-verification in both training and reflective execution, reaching remarkable LLM-level performance in integer multiplication and Sudoku. Similar to LLM results, we find that reinforcement learning (RL) improves in-distribution performance and incentivizes frequent reflection for tiny transformers, yet RL mainly optimizes shallow statistical patterns without faithfully reducing verification errors. In conclusion, integrating generative transformers with discriminative verification inherently facilitates CoT reasoning, regardless of scaling and natural language.
- 中文摘要
高级大型语言模型 (LLM) 经常反映在推理思维链 (CoT) 中,它们自我验证当前解决方案的正确性并探索替代方案。然而,鉴于最近的研究结果表明,法学硕士在 CoT 中检测到的错误有限,反思如何有助于经验改进仍不清楚。为了分析这个问题,在本文中,我们提出了一个极简的推理框架,以支持没有自然语言的小型Transformer的基本自验证反射,从而保证了分析的清晰度并降低了综合实验的成本。从理论上讲,我们证明,如果验证误差被适当限制,自验证反射可以保证改进。通过实验,我们表明,只有几百万个参数的微型 transformer 在训练和反射执行中都受益于自我验证,在整数乘法和数独方面达到了卓越的 LLM 级性能。与LLM结果类似,我们发现强化学习(RL)提高了分布内性能,并激励了微小变压器的频繁反射,但RL主要优化了浅层统计模式,而没有忠实地减少验证误差。总之,将生成式 Transformer 与判别验证相结合本质上有助于 CoT 推理,无论缩放和自然语言如何。
Reinforced Preference Optimization for Recommendation
推荐的强化偏好优化
- Authors: Junfei Tan, Yuxin Chen, An Zhang, Junguang Jiang, Bin Liu, Ziru Xu, Han Zhu, Jian Xu, Bo Zheng, Xiang Wang
- Subjects: Subjects:
Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.12211
- Pdf link: https://arxiv.org/pdf/2510.12211
- Abstract
Recent breakthroughs in large language models (LLMs) have fundamentally shifted recommender systems from discriminative to generative paradigms, where user behavior modeling is achieved by generating target items conditioned on historical interactions. Yet current generative recommenders still suffer from two core limitations: the lack of high-quality negative modeling and the reliance on implicit rewards. Reinforcement learning with verifiable rewards (RLVR) offers a natural solution by enabling on-policy sampling of harder negatives and grounding optimization in explicit reward signals. However, applying RLVR to generative recommenders remains non-trivial. Its unique generation space often leads to invalid or repetitive items that undermine sampling efficiency, and ranking supervision is sparse since most items receive identical zero rewards. To address these challenges, we propose Reinforced Preference Optimization for Recommendation (ReRe), a reinforcement-based paradigm tailored to LLM-based recommenders, an important direction in generative recommendation. ReRe incorporates constrained beam search to improve sampling efficiency and diversify hard negatives, while augmenting rule-based accuracy rewards with auxiliary ranking rewards for finer-grained supervision. Extensive experiments on three real-world datasets demonstrate that ReRe consistently outperforms both traditional and LLM-based recommenders in ranking performance. Further analysis shows that ReRe not only enhances performance across both base and SFT-initialized models but also generalizes robustly across different backbone families and scales. Beyond empirical gains, we systematically investigate the design space of RLVR in recommendation across generation, sampling strategy, reward modeling, and optimization algorithm, offering insights for future research.
- 中文摘要
大型语言模型(LLM)的最新突破从根本上将推荐系统从判别范式转变为生成范式,其中用户行为建模是通过生成以历史交互为条件的目标项来实现的。然而,当前的生成式推荐器仍然受到两个核心限制:缺乏高质量的负面建模和对隐性奖励的依赖。具有可验证奖励的强化学习 (RLVR) 通过启用更难的负值的策略采样和在显式奖励信号中进行基础优化,提供了一种自然的解决方案。然而,将 RLVR 应用于生成式推荐器仍然并非易事。其独特的生成空间往往会导致无效或重复的物品,从而破坏抽样效率,并且由于大多数物品获得相同的零奖励,因此排名监督稀疏。为了应对这些挑战,我们提出了推荐的强化偏好优化(ReRe),这是一种针对基于LLM的推荐者量身定制的基于强化的范式,这是生成式推荐的一个重要方向。ReRe 结合了约束波束搜索,以提高采样效率并使硬负值多样化,同时通过辅助排名奖励来增强基于规则的准确性奖励,以实现更细粒度的监督。对三个真实世界数据集的广泛实验表明,ReRe 在排名性能方面始终优于传统和基于 LLM 的推荐器。进一步的分析表明,ReRe不仅增强了基础模型和SFT初始化模型的性能,而且还在不同的主干系列和规模上进行了稳健的泛化。除了实证收益之外,我们还系统地研究了RLVR在推荐方面的设计空间,包括生成、采样策略、奖励建模和优化算法,为未来的研究提供了启示。
PromptFlow: Training Prompts Like Neural Networks
PromptFlow:训练神经网络等提示
- Authors: Jingyi Wang, Hongyuan Zhu, Ye Niu, Yunhui Deng
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12246
- Pdf link: https://arxiv.org/pdf/2510.12246
- Abstract
Large Language Models (LLMs) have demonstrated profound impact on Natural Language Processing (NLP) tasks. However, their effective deployment across diverse domains often require domain-specific adaptation strategies, as generic models may underperform when faced with specialized data distributions. Recent advances in prompt engineering (PE) offer a promising alternative to extensive retraining by refining input instructions to align LLM outputs with task objectives. This paradigm has emerged as a rapid and versatile approach for model fine-tuning. Despite its potential, manual prompt design remains labor-intensive and heavily depends on specialized expertise, often requiring iterative human effort to achieve optimal formulations. To address this limitation, automated prompt engineering methodologies have been developed to systematically generate task-specific prompts. However, current implementations predominantly employ static update rules and lack mechanisms for dynamic strategy selection, resulting in suboptimal adaptation to varying NLP task requirements. Furthermore, most methods treat and update the whole prompts at each step, without considering editing prompt sections at a finer granularity. At last, in particular, the problem of how to recycle experience in LLM is still underexplored. To this end, we propose the PromptFlow, a modular training framework inspired by TensorFlow, which integrates meta-prompts, operators, optimization, and evaluator. Our framework can be equipped with the latest optimization methods and autonomously explores optimal prompt refinement trajectories through gradient-based meta-learning, requiring minimal task-specific training data. Specifically, we devise a reinforcement learning method to recycle experience for LLM in the PE process. Finally, we conduct extensive experiments on various datasets, and demonstrate the effectiveness of PromptFlow.
- 中文摘要
大型语言模型 (LLM) 已证明对自然语言处理 (NLP) 任务产生了深远的影响。然而,它们在不同领域的有效部署通常需要特定领域的适应策略,因为通用模型在面对专门的数据分布时可能表现不佳。提示工程 (PE) 的最新进展通过改进输入指令使 LLM 输出与任务目标保持一致,为广泛的再训练提供了一种有前途的替代方案。这种范式已成为一种快速且通用的模型微调方法。尽管具有潜力,但手动提示设计仍然是劳动密集型的,并且严重依赖专业知识,通常需要迭代的人力才能实现最佳配方。为了解决这一限制,已经开发了自动化提示工程方法来系统地生成特定于任务的提示。然而,当前的实现主要采用静态更新规则,缺乏动态策略选择机制,导致对不同NLP任务要求的适应不理想。此外,大多数方法都会在每个步骤中处理和更新整个提示,而不考虑以更精细的粒度编辑提示部分。最后,特别是如何回收 LLM 经验的问题仍然没有得到充分探索。为此,我们提出了 PromptFlow,这是一个受 TensorFlow 启发的模块化训练框架,它集成了元提示、运算符、优化和评估器。我们的框架可以配备最新的优化方法,并通过基于梯度的元学习自主探索最优的提示细化轨迹,需要最少的任务特定训练数据。具体来说,我们设计了一种强化学习方法,为法学硕士在体育过程中循环利用经验。最后,我们对各种数据集进行了广泛的实验,并证明了 PromptFlow 的有效性。
Diffusion Models for Reinforcement Learning: Foundations, Taxonomy, and Development
强化学习的扩散模型:基础、分类法和发展
- Authors: Changfu Xu, Jianxiong Guo, Yuzhu Liang, Haiyang Huang, Haodong Zou, Xi Zheng, Shui Yu, Xiaowen Chu, Jiannong Cao, Tian Wang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12253
- Pdf link: https://arxiv.org/pdf/2510.12253
- Abstract
Diffusion Models (DMs), as a leading class of generative models, offer key advantages for reinforcement learning (RL), including multi-modal expressiveness, stable training, and trajectory-level planning. This survey delivers a comprehensive and up-to-date synthesis of diffusion-based RL. We first provide an overview of RL, highlighting its challenges, and then introduce the fundamental concepts of DMs, investigating how they are integrated into RL frameworks to address key challenges in this research field. We establish a dual-axis taxonomy that organizes the field along two orthogonal dimensions: a function-oriented taxonomy that clarifies the roles DMs play within the RL pipeline, and a technique-oriented taxonomy that situates implementations across online versus offline learning regimes. We also provide a comprehensive examination of this progression from single-agent to multi-agent domains, thereby forming several frameworks for DM-RL integration and highlighting their practical utility. Furthermore, we outline several categories of successful applications of diffusion-based RL across diverse domains, discuss open research issues of current methodologies, and highlight key directions for future research to advance the field. Finally, we summarize the survey to identify promising future development directions. We are actively maintaining a GitHub repository (this https URL) for papers and other related resources to apply DMs for RL.
- 中文摘要
扩散模型(DM)作为一类领先的生成模型,为强化学习(RL)提供了关键优势,包括多模态表达性、稳定训练和轨迹级规划。该调查提供了基于扩散的 RL 的全面和最新的综合。我们首先概述了 RL,强调了其挑战,然后介绍了 DM 的基本概念,研究了它们如何集成到 RL 框架中以应对该研究领域的关键挑战。我们建立了一个双轴分类法,沿着两个正交维度组织该领域:一个面向功能的分类法,阐明了 DM 在 RL 管道中所扮演的角色,以及一个面向技术的分类法,将实施定位在在线和离线学习制度之间。我们还对从单代理域到多代理域的这种进展进行了全面检查,从而形成了 DM-RL 集成的几个框架并强调了它们的实际实用性。此外,我们概述了基于扩散的RL在不同领域的成功应用的几类,讨论了当前方法的开放研究问题,并强调了未来研究推进该领域的关键方向。最后,对调查进行总结,确定未来有前景的发展方向。我们正在积极维护一个 GitHub 存储库(此 https URL),用于论文和其他相关资源,以将 DM 应用于 RL。
$\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning
$\mathbf{T^3}$:减少强化学习中的信念偏差以进行主动推理
- Authors: Deyu Zou, Yongqiang Chen, Jianxiang Wang, Haochen Yang, Mufei Li, James Cheng, Pan Li, Yu Gong
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12264
- Pdf link: https://arxiv.org/pdf/2510.12264
- Abstract
Active reasoning requires large language models (LLMs) to interact with external sources and strategically gather information to solve problems. Central to this process is belief tracking: maintaining a coherent understanding of the problem state and the missing information toward the solution. However, due to limited reasoning capabilities, LLM-based agents often suffer from belief deviation: they struggle to correctly model beliefs, lose track of problem states, and fall into uninformative or repetitive actions. Once this happens, errors compound and reinforcement learning (RL) training fails to properly credit the crucial exploratory steps. To address this issue, we propose to track the deviation of model beliefs and develop $\mathbf{T^3}$, a simple yet effective method that detects excessive belief deviation and truncates trajectories during training to remove uninformative tails. By preserving credit for informative prefixes, $\mathbf{T^3}$ systematically improves policy optimization. Across 5 challenging tasks, $\mathbf{T^3}$ consistently enhances training stability, token efficiency, and final performance, achieving up to 30% gains while cutting rollout tokens by roughly 25%. These results highlight belief control as a key principle for developing robust and generalizable LLM-based active reasoners.
- 中文摘要
主动推理需要大型语言模型 (LLM) 与外部资源交互并战略性地收集信息以解决问题。这个过程的核心是信念跟踪:保持对问题状态和解决方案缺失信息的连贯理解。然而,由于推理能力有限,基于 LLM 的智能体经常遭受信念偏差的困扰:他们难以正确地对信念进行建模,忘记了对问题状态的跟踪,并陷入了缺乏信息或重复的行为。一旦发生这种情况,错误就会复合,强化学习 (RL) 训练就无法正确地归功于关键的探索步骤。为了解决这个问题,我们建议跟踪模型信念的偏差,并开发 $\mathbf{T^3}$,这是一种简单而有效的方法,可以在训练过程中检测过度的信念偏差并截断轨迹,以去除无信息的尾部。通过保留信息前缀的功劳,$\mathbf{T^3}$ 系统地改进了策略优化。在 5 项具有挑战性的任务中,$\mathbf{T^3}$ 持续增强训练稳定性、代币效率和最终性能,实现高达 30% 的收益,同时将推出代币减少约 25%。这些结果强调了信念控制是开发稳健且可通用的基于 LLM 的主动推理器的关键原则。
Human-in-the-Loop Bandwidth Estimation for Quality of Experience Optimization in Real-Time Video Communication
用于实时视频通信体验质量优化的人机交互带宽估计
- Authors: Sami Khairy, Gabriel Mittag, Vishak Gopal, Ross Cutler
- Subjects: Subjects:
Multimedia (cs.MM); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2510.12265
- Pdf link: https://arxiv.org/pdf/2510.12265
- Abstract
The quality of experience (QoE) delivered by video conferencing systems is significantly influenced by accurately estimating the time-varying available bandwidth between the sender and receiver. Bandwidth estimation for real-time communications remains an open challenge due to rapidly evolving network architectures, increasingly complex protocol stacks, and the difficulty of defining QoE metrics that reliably improve user experience. In this work, we propose a deployed, human-in-the-loop, data-driven framework for bandwidth estimation to address these challenges. Our approach begins with training objective QoE reward models derived from subjective user evaluations to measure audio and video quality in real-time video conferencing systems. Subsequently, we collect roughly $1$M network traces with objective QoE rewards from real-world Microsoft Teams calls to curate a bandwidth estimation training dataset. We then introduce a novel distributional offline reinforcement learning (RL) algorithm to train a neural-network-based bandwidth estimator aimed at improving QoE for users. Our real-world A/B test demonstrates that the proposed approach reduces the subjective poor call ratio by $11.41\%$ compared to the baseline bandwidth estimator. Furthermore, the proposed offline RL algorithm is benchmarked on D4RL tasks to demonstrate its generalization beyond bandwidth estimation.
- 中文摘要
视频会议系统提供的体验质量 (QoE) 受到准确估计发送方和接收方之间随时变的可用带宽的显着影响。由于快速发展的网络架构、日益复杂的协议堆栈以及难以定义可靠地改善用户体验的 QoE 指标,实时通信的带宽估计仍然是一个悬而未决的挑战。在这项工作中,我们提出了一个部署的、人机交互的、数据驱动的带宽估计框架来应对这些挑战。我们的方法首先训练从主观用户评估中得出的客观 QoE 奖励模型,以衡量实时视频会议系统中的音频和视频质量。随后,我们从真实世界的 Microsoft Teams 调用中收集大约 $1$M 的网络跟踪和客观的 QoE 奖励,以策划带宽估计训练数据集。然后,我们引入了一种新颖的分布式离线强化学习(RL)算法来训练基于神经网络的带宽估计器,旨在提高用户的QoE。我们的真实 A/B 测试表明,与基线带宽估计器相比,所提出的方法将主观不良呼叫率降低了 11.41\%$。此外,所提出的离线RL算法在D4RL任务上进行了基准测试,以证明其超越带宽估计的泛化性。
Heterogeneous RBCs via deep multi-agent reinforcement learning
通过深度多智能体强化学习的异构红细胞
- Authors: Federico Gabriele, Aldo Glielmo, Marco Taboga
- Subjects: Subjects:
Multiagent Systems (cs.MA); Machine Learning (cs.LG); Theoretical Economics (econ.TH)
- Arxiv link: https://arxiv.org/abs/2510.12272
- Pdf link: https://arxiv.org/pdf/2510.12272
- Abstract
Current macroeconomic models with agent heterogeneity can be broadly divided into two main groups. Heterogeneous-agent general equilibrium (GE) models, such as those based on Heterogeneous Agents New Keynesian (HANK) or Krusell-Smith (KS) approaches, rely on GE and 'rational expectations', somewhat unrealistic assumptions that make the models very computationally cumbersome, which in turn limits the amount of heterogeneity that can be modelled. In contrast, agent-based models (ABMs) can flexibly encompass a large number of arbitrarily heterogeneous agents, but typically require the specification of explicit behavioural rules, which can lead to a lengthy trial-and-error model-development process. To address these limitations, we introduce MARL-BC, a framework that integrates deep multi-agent reinforcement learning (MARL) with Real Business Cycle (RBC) models. We demonstrate that MARL-BC can: (1) recover textbook RBC results when using a single agent; (2) recover the results of the mean-field KS model using a large number of identical agents; and (3) effectively simulate rich heterogeneity among agents, a hard task for traditional GE approaches. Our framework can be thought of as an ABM if used with a variety of heterogeneous interacting agents, and can reproduce GE results in limit cases. As such, it is a step towards a synthesis of these often opposed modelling paradigms.
- 中文摘要
当前具有代理异质性的宏观经济模型大致可分为两大类。异构智能体一般均衡 (GE) 模型,例如基于异质智能体新凯恩斯主义 (HANK) 或克鲁塞尔-史密斯 (KS) 方法的模型,依赖于 GE 和“理性期望”,这些假设有些不切实际,使模型在计算上非常繁琐,这反过来又限制了可以建模的异质性数量。相比之下,基于智能体的模型(ABM)可以灵活地包含大量任意异构的智能体,但通常需要指定明确的行为规则,这可能导致漫长的试错模型开发过程。为了解决这些限制,我们引入了 MARL-BC,这是一个将深度多智能体强化学习 (MARL) 与真实商业周期 (RBC) 模型集成在一起的框架。我们证明 MARL-BC 可以:(1) 使用单一药物时恢复教科书上的红细胞结果;(2)使用大量相同代理恢复均域KS模型的结果;(3)有效模拟代理之间丰富的异质性,这对于传统的GE方法来说是一项艰巨的任务。如果与各种异质相互作用剂一起使用,我们的框架可以被认为是一个 ABM,并且可以在有限的情况下重现 GE 结果。因此,这是朝着综合这些经常对立的建模范式迈出的一步。
Deep SPI: Safe Policy Improvement via World Models
深度 SPI:通过世界模型改进安全政策
- Authors: Florent Delgrange, Raphael Avalos, Willem Röpke
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12312
- Pdf link: https://arxiv.org/pdf/2510.12312
- Abstract
Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well-defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses to representation quality, yielding online, "deep" analogues of classical SPI theorems from the offline RL literature. Building on these results, we introduce DeepSPI, a principled on-policy algorithm that couples local transition and reward losses with regularised policy updates. On the ALE-57 benchmark, DeepSPI matches or exceeds strong baselines, including PPO and DeepMDPs, while retaining theoretical guarantees.
- 中文摘要
安全策略改进 (SPI) 提供了对策略更新的理论控制,但现有的保证主要涉及离线、表格强化学习 (RL)。我们将一般在线环境中的 SPI 与世界模型和表示学习相结合。我们开发了一个理论框架,表明将政策更新限制在当前政策的明确定义的邻域可以确保单调的改进和收敛。该分析将转换和奖励预测损失与表示质量联系起来,从离线 RL 文献中产生了经典 SPI 定理的在线“深度”类似物。基于这些结果,我们引入了 DeepSPI,这是一种有原则的政策算法,它将局部过渡和奖励损失与常规化政策更新相结合。在 ALE-57 基准测试中,DeepSPI 匹配或超过强基线,包括 PPO 和 DeepMDP,同时保留理论保证。
Finite-time Convergence Analysis of Actor-Critic with Evolving Reward
具有演变奖励的行为者-批评者的有限时间收敛分析
- Authors: Rui Hu, Yu Chen, Longbo Huang
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12334
- Pdf link: https://arxiv.org/pdf/2510.12334
- Abstract
Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.
- 中文摘要
许多流行的实用强化学习(RL)算法通过奖励塑造、熵正则化或课程学习等技术采用不断发展的奖励函数,但它们的理论基础仍然不发达。本文首次对马尔可夫采样下存在演化奖励函数的单时间尺度参与者-批评算法进行有限时间收敛分析。我们考虑了奖励参数可能在每个时间步长发生变化的设置,从而影响策略优化和价值估计。在标准假设下,我们推导出演员和批评者错误的非渐近边界。我们的结果表明,只要奖励参数演变足够慢,则 $O(1/\sqrt{T})$ 收敛率是可以实现的,与静态奖励的最已知速率相匹配。当奖励通过具有有限梯度的基于梯度的规则更新时,该速率将保留下来,并且与演员和批评者的时间尺度相同,为许多流行的 RL 技术提供了理论基础。作为次要贡献,我们引入了对马尔可夫抽样下分布不匹配的新分析,在静态奖励情况下将最已知的速率提高了 $\log^2T$ 倍。
Physics-Informed Reinforcement Learning for Large-Scale EV Smart Charging Considering Distribution Network Voltage Constraints
考虑配电网电压约束的大规模电动汽车智能充电的物理信息强化学习
- Authors: Stavros Orfanoudakis, Frans Oliehoek, Peter Palesnky, Pedro P. Vergara
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2510.12335
- Pdf link: https://arxiv.org/pdf/2510.12335
- Abstract
Electric Vehicles (EVs) offer substantial flexibility for grid services, yet large-scale, uncoordinated charging can threaten voltage stability in distribution networks. Existing Reinforcement Learning (RL) approaches for smart charging often disregard physical grid constraints or have limited performance for complex large-scale tasks, limiting their scalability and real-world applicability. This paper introduces a physics-informed (PI) RL algorithm that integrates a differentiable power flow model and voltage-based reward design into the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm, enabling EVs to deliver real-time voltage support while meeting user demands. The resulting PI-TD3 algorithm achieves faster convergence, improved sample efficiency, and reliable voltage magnitude regulation under uncertain and overloaded conditions. Benchmarks on the IEEE 34-bus and 123-bus networks show that the proposed PI-TD3 outperforms both model-free RL and optimization-based baselines in grid constraint management, user satisfaction, and economic metrics, even as the system scales to hundreds of EVs. These advances enable robust, scalable, and practical EV charging strategies that enhance grid resilience and support distribution networks operation.
- 中文摘要
电动汽车 (EV) 为电网服务提供了极大的灵活性,但大规模、不协调的充电可能会威胁到配电网络的电压稳定性。现有的智能充电强化学习 (RL) 方法通常无视物理电网限制或对复杂大规模任务的性能有限,从而限制了其可扩展性和实际适用性。本文引入了一种物理知情(PI)RL算法,该算法将可微分的潮流模型和基于电压的奖励设计集成到孪生延迟深度确定性策略梯度(TD3)算法中,使电动汽车能够在满足用户需求的同时提供实时电压支持。由此产生的 PI-TD3 算法在不确定和过载条件下实现更快的收敛、更高的采样效率和可靠的电压幅度调节。IEEE 34总线和123总线网络的基准测试表明,即使系统扩展到数百辆电动汽车,所提出的PI-TD3在电网约束管理、用户满意度和经济指标方面也优于无模型RL和基于优化的基线。这些进步实现了强大、可扩展且实用的电动汽车充电策略,从而增强电网弹性并支持配电网运营。
Pretraining in Actor-Critic Reinforcement Learning for Robot Motion Control
机器人运动控制的演员-批评强化学习预训练
- Authors: Jiale Fan, Andrei Cramariuc, Tifanny Portela, Marco Hutter
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.12363
- Pdf link: https://arxiv.org/pdf/2510.12363
- Abstract
The pretraining-finetuning paradigm has facilitated numerous transformative advancements in artificial intelligence research in recent years. However, in the domain of reinforcement learning (RL) for robot motion control, individual skills are often learned from scratch despite the high likelihood that some generalizable knowledge is shared across all task-specific policies belonging to a single robot embodiment. This work aims to define a paradigm for pretraining neural network models that encapsulate such knowledge and can subsequently serve as a basis for warm-starting the RL process in classic actor-critic algorithms, such as Proximal Policy Optimization (PPO). We begin with a task-agnostic exploration-based data collection algorithm to gather diverse, dynamic transition data, which is then used to train a Proprioceptive Inverse Dynamics Model (PIDM) through supervised learning. The pretrained weights are loaded into both the actor and critic networks to warm-start the policy optimization of actual tasks. We systematically validated our proposed method on seven distinct robot motion control tasks, showing significant benefits to this initialization strategy. Our proposed approach on average improves sample efficiency by 40.1% and task performance by 7.5%, compared to random initialization. We further present key ablation studies and empirical analyses that shed light on the mechanisms behind the effectiveness of our method.
- 中文摘要
近年来,预训练微调范式促进了人工智能研究的许多变革性进步。然而,在机器人运动控制的强化学习(RL)领域,尽管一些可推广的知识很可能在属于单个机器人实施例的所有特定任务策略中共享,但个人技能通常是从头开始学习的。这项工作旨在定义一种预训练神经网络模型的范式,该模型封装了这些知识,随后可以作为经典参与者批评算法(例如近端策略优化(PPO))中热启动RL过程的基础。我们首先使用一种与任务无关的基于探索的数据收集算法来收集多样化的动态转换数据,然后通过监督学习来训练本体感觉逆动力学模型(PIDM)。预训练权重被加载到参与者和批评者网络中,以热启动实际任务的策略优化。我们在七个不同的机器人运动控制任务上系统地验证了我们提出的方法,显示出这种初始化策略的显着优势。与随机初始化相比,我们提出的方法平均将样本效率提高了 40.1%,任务性能提高了 7.5%。我们进一步介绍了关键的消融研究和实证分析,阐明了我们方法有效性背后的机制。
Robot Learning: A Tutorial
机器人学习:教程
- Authors: Francesco Capuano, Caroline Pascal, Adil Zouitine, Thomas Wolf, Michel Aractingi
- Subjects: Subjects:
Robotics (cs.RO); Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.12403
- Pdf link: https://arxiv.org/pdf/2510.12403
- Abstract
Robot learning is at an inflection point, driven by rapid advancements in machine learning and the growing availability of large-scale robotics data. This shift from classical, model-based methods to data-driven, learning-based paradigms is unlocking unprecedented capabilities in autonomous systems. This tutorial navigates the landscape of modern robot learning, charting a course from the foundational principles of Reinforcement Learning and Behavioral Cloning to generalist, language-conditioned models capable of operating across diverse tasks and even robot embodiments. This work is intended as a guide for researchers and practitioners, and our goal is to equip the reader with the conceptual understanding and practical tools necessary to contribute to developments in robot learning, with ready-to-use examples implemented in $\texttt{lerobot}$.
- 中文摘要
在机器学习的快速进步和大规模机器人数据可用性不断增长的推动下,机器人学习正处于一个拐点。从经典的、基于模型的方法到数据驱动、基于学习的范式的转变正在释放自主系统前所未有的能力。本教程介绍了现代机器人学习的前景,描绘了从强化学习和行为克隆的基本原理到能够跨不同任务甚至机器人实施例运行的通才语言条件模型的课程。这项工作旨在为研究人员和从业者提供指南,我们的目标是为读者提供必要的概念理解和实用工具,以促进机器人学习的发展,并在 $\texttt{lerobot}$ 中实现即用型示例。
Biased-Attention Guided Risk Prediction for Safe Decision-Making at Unsignalized Intersections
在无信号交叉路口进行安全决策的偏注意力引导风险预测
- Authors: Chengyang Dong, Nan Guo
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12428
- Pdf link: https://arxiv.org/pdf/2510.12428
- Abstract
Autonomous driving decision-making at unsignalized intersections is highly challenging due to complex dynamic interactions and high conflict risks. To achieve proactive safety control, this paper proposes a deep reinforcement learning (DRL) decision-making framework integrated with a biased attention mechanism. The framework is built upon the Soft Actor-Critic (SAC) algorithm. Its core innovation lies in the use of biased attention to construct a traffic risk predictor. This predictor assesses the long-term risk of collision for a vehicle entering the intersection and transforms this risk into a dense reward signal to guide the SAC agent in making safe and efficient driving decisions. Finally, the simulation results demonstrate that the proposed method effectively improves both traffic efficiency and vehicle safety at the intersection, thereby proving the effectiveness of the intelligent decision-making framework in complex scenarios. The code of our work is available at this https URL.
- 中文摘要
由于复杂的动态交互和高冲突风险,在无信号交叉路口进行自动驾驶决策极具挑战性。为了实现主动安全控制,该文提出了一种与偏置注意力机制相结合的深度强化学习(DRL)决策框架。该框架建立在软行为者批评者 (SAC) 算法之上。其核心创新在于利用偏向注意力来构建交通风险预测器。该预测器评估进入十字路口的车辆的长期碰撞风险,并将该风险转化为密集的奖励信号,以指导 SAC 代理做出安全高效的驾驶决策。最后,仿真结果表明,所提方法有效提高了路口的通行效率和车辆安全性,从而证明了智能决策框架在复杂场景下的有效性。我们工作的代码可在此 https URL 中找到。
Bayesian Optimization for Dynamic Pricing and Learning
动态定价和学习的贝叶斯优化
- Authors: Anush Anand, Pranav Agrawal, Tejas Bodas
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.12447
- Pdf link: https://arxiv.org/pdf/2510.12447
- Abstract
Dynamic pricing is the practice of adjusting the selling price of a product to maximize a firm's revenue by responding to market demand. The literature typically distinguishes between two settings: infinite inventory, where the firm has unlimited stock and time to sell, and finite inventory, where both inventory and selling horizon are limited. In both cases, the central challenge lies in the fact that the demand function -- how sales respond to price -- is unknown and must be learned from data. Traditional approaches often assume a specific parametric form for the demand function, enabling the use of reinforcement learning (RL) to identify near-optimal pricing strategies. However, such assumptions may not hold in real-world scenarios, limiting the applicability of these methods. In this work, we propose a Gaussian Process (GP) based nonparametric approach to dynamic pricing that avoids restrictive modeling assumptions. We treat the demand function as a black-box function of the price and develop pricing algorithms based on Bayesian Optimization (BO) -- a sample-efficient method for optimizing unknown functions. We present BO-based algorithms tailored for both infinite and finite inventory settings and provide regret guarantees for both regimes, thereby quantifying the learning efficiency of our methods. Through extensive experiments, we demonstrate that our BO-based methods outperform several state-of-the-art RL algorithms in terms of revenue, while requiring fewer assumptions and offering greater robustness. This highlights Bayesian Optimization as a powerful and practical tool for dynamic pricing in complex, uncertain environments.
- 中文摘要
动态定价是调整产品售价的做法,通过响应市场需求来最大化公司的收入。文献通常区分两种设置:无限库存,即公司拥有无限的库存和销售时间,以及有限库存,即库存和销售期限都有限。在这两种情况下,核心挑战在于需求函数——销售对价格的反应——是未知的,必须从数据中学习。传统方法通常采用需求函数的特定参数形式,从而能够使用强化学习 (RL) 来识别近乎最佳的定价策略。然而,这样的假设在现实场景中可能不成立,从而限制了这些方法的适用性。在这项工作中,我们提出了一种基于高斯过程(GP)的动态定价非参数方法,避免了限制性建模假设。我们将需求函数视为价格的黑盒函数,并开发了基于贝叶斯优化(BO)的定价算法——一种优化未知函数的样本效率方法。我们提出了针对无限和有限库存设置量身定制的基于BO的算法,并为这两种状态提供了后悔保证,从而量化了我们方法的学习效率。通过广泛的实验,我们证明,我们基于BO的方法在收入方面优于几种最先进的RL算法,同时需要更少的假设并提供更高的鲁棒性。这凸显了贝叶斯优化作为在复杂、不确定的环境中进行动态定价的强大而实用的工具。
A Task-Efficient Reinforcement Learning Task-Motion Planner for Safe Human-Robot Cooperation
一种用于安全人机协作的任务高效强化学习任务运动规划器
- Authors: Gaoyuan Liu, Joris de Winter, Kelly Merckaert, Denis Steckelmacher, Ann Nowe, Bram Vanderborght
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.12477
- Pdf link: https://arxiv.org/pdf/2510.12477
- Abstract
In a Human-Robot Cooperation (HRC) environment, safety and efficiency are the two core properties to evaluate robot performance. However, safety mechanisms usually hinder task efficiency since human intervention will cause backup motions and goal failures of the robot. Frequent motion replanning will increase the computational load and the chance of failure. In this paper, we present a hybrid Reinforcement Learning (RL) planning framework which is comprised of an interactive motion planner and a RL task planner. The RL task planner attempts to choose statistically safe and efficient task sequences based on the feedback from the motion planner, while the motion planner keeps the task execution process collision-free by detecting human arm motions and deploying new paths when the previous path is not valid anymore. Intuitively, the RL agent will learn to avoid dangerous tasks, while the motion planner ensures that the chosen tasks are safe. The proposed framework is validated on the cobot in both simulation and the real world, we compare the planner with hard-coded task motion planning methods. The results show that our planning framework can 1) react to uncertain human motions at both joint and task levels; 2) reduce the times of repeating failed goal commands; 3) reduce the total number of replanning requests.
- 中文摘要
在人机协作(HRC)环境中,安全性和效率是评估机器人性能的两个核心属性。然而,安全机制通常会阻碍任务效率,因为人为干预会导致机器人的倒车运动和目标失败。频繁的运动重新规划会增加计算负载和失败的机会。在本文中,我们提出了一个混合强化学习(RL)规划框架,该框架由交互式运动规划器和RL任务规划器组成。RL 任务规划器试图根据运动规划器的反馈选择统计上安全高效的任务序列,而运动规划器则通过检测人臂运动并在前一个路径不再有效时部署新路径来保持任务执行过程无冲突。直观地讲,RL 代理将学习避免危险任务,而运动规划器则确保所选任务是安全的。所提出的框架在协作机器人上进行了仿真和现实世界的验证,我们将规划器与硬编码的任务运动规划方法进行了比较。结果表明,我们的规划框架可以 1) 在联合和任务层面对不确定的人类运动做出反应;2)减少重复失败目标命令的次数;3) 减少重新规划请求的总数。
Inclusive Fitness as a Key Step Towards More Advanced Social Behaviors in Multi-Agent Reinforcement Learning Settings
包容性适应度是多智能体强化学习环境中迈向更高级社交行为的关键一步
- Authors: Andries Rosseau, Raphaël Avalos, Ann Nowé
- Subjects: Subjects:
Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Social and Information Networks (cs.SI)
- Arxiv link: https://arxiv.org/abs/2510.12555
- Pdf link: https://arxiv.org/pdf/2510.12555
- Abstract
The competitive and cooperative forces of natural selection have driven the evolution of intelligence for millions of years, culminating in nature's vast biodiversity and the complexity of human minds. Inspired by this process, we propose a novel multi-agent reinforcement learning framework where each agent is assigned a genotype and where reward functions are modelled after the concept of inclusive fitness. An agent's genetic material may be shared with other agents, and our inclusive reward function naturally accounts for this. We study the resulting social dynamics in two types of network games with prisoner's dilemmas and find that our results align with well-established principles from biology, such as Hamilton's rule. Furthermore, we outline how this framework can extend to more open-ended environments with spatial and temporal structure, finite resources, and evolving populations. We hypothesize the emergence of an arms race of strategies, where each new strategy is a gradual improvement over earlier adaptations of other agents, effectively producing a multi-agent autocurriculum analogous to biological evolution. In contrast to the binary team-based structures prevalent in earlier research, our gene-based reward structure introduces a spectrum of cooperation ranging from full adversity to full cooperativeness based on genetic similarity, enabling unique non team-based social dynamics. For example, one agent having a mutual cooperative relationship with two other agents, while the two other agents behave adversarially towards each other. We argue that incorporating inclusive fitness in agents provides a foundation for the emergence of more strategically advanced and socially intelligent agents.
- 中文摘要
数百万年来,自然选择的竞争和合作力量推动了智力的进化,最终导致了自然界广阔的生物多样性和人类思维的复杂性。受这一过程的启发,我们提出了一种新颖的多智能体强化学习框架,其中每个智能体都被分配了一个基因型,其中奖励函数是根据包容性适应度的概念建模的。一个代理的遗传物质可以与其他代理共享,我们的包容性奖励功能自然会解释这一点。我们研究了两种类型的囚徒困境网络游戏中产生的社会动态,发现我们的结果与生物学中公认的原理(例如汉密尔顿规则)一致。此外,我们概述了该框架如何扩展到具有空间和时间结构、有限资源和不断发展的人口的更开放的环境。我们假设出现了一场战略军备竞赛,其中每一种新战略都是对其他智能体的早期适应的逐步改进,有效地产生了类似于生物进化的多智能体自动课程。与早期研究中普遍存在的基于二元团队的结构相比,我们基于基因的奖励结构引入了一系列合作,从完全逆境到基于遗传相似性的完全合作,从而实现了独特的非基于团队的社会动态。例如,一个智能体与另外两个智能体具有相互合作关系,而另外两个智能体则相互对抗。我们认为,将包容性适应度纳入智能体为战略更先进和社交智能体的出现奠定了基础。
CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving
CoIRL-AD:自动驾驶潜在世界模型中的协作-竞争模仿-强化学习
- Authors: Xiaoji Zheng, Ziyuan Yang, Yanhao Chen, Yuhang Peng, Yuanrong Tang, Gengyuan Liu, Bokui Chen, Jiangtao Gong
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.12560
- Pdf link: https://arxiv.org/pdf/2510.12560
- Abstract
End-to-end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two-stage paradigm (IL pretraining followed by RL fine-tuning), we propose CoIRL-AD, a competitive dual-policy framework that enables IL and RL agents to interact during training. CoIRL-AD introduces a competition-based mechanism that facilitates knowledge exchange while preventing gradient conflicts. Experiments on the nuScenes dataset show an 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long-tail scenarios. Code is available at: this https URL.
- 中文摘要
仅通过模仿学习 (IL) 训练的端到端自动驾驶模型通常存在泛化性差的问题。相比之下,强化学习(RL)通过奖励最大化来促进探索,但面临样本效率低下和收敛不稳定等挑战。一个自然的解决方案是结合 IL 和 RL。超越传统的两阶段范式(IL预训练,然后RL微调),我们提出了CoIRL-AD,这是一种竞争性双策略框架,使IL和RL代理能够在训练过程中进行交互。CoIRL-AD 引入了一种基于竞争的机制,促进知识交流,同时防止梯度冲突。在 nuScenes 数据集上的实验表明,与基线相比,碰撞率降低了 18%,同时在长尾场景下具有更强的泛化性和改进的性能。代码可在以下位置获得:此 https URL。
Laminar: A Scalable Asynchronous RL Post-Training Framework
Laminar:可扩展的异步 RL 训练后框架
- Authors: Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
- Arxiv link: https://arxiv.org/abs/2510.12633
- Pdf link: https://arxiv.org/pdf/2510.12633
- Abstract
Reinforcement learning (RL) post-training for Large Language Models (LLMs) is now scaling to large clusters and running for extended durations to enhance model reasoning performance. However, the scalability of existing RL frameworks is limited, as extreme long-tail skewness in RL trajectory generation causes severe GPU underutilization. Current asynchronous RL systems attempt to mitigate this, but they rely on global weight synchronization between the actor and all rollouts, which creates a rigid model update schedule. This global synchronization is ill-suited for the highly skewed and evolving distribution of trajectory generation latency in RL training, crippling training efficiency. Our key insight is that efficient scaling requires breaking this lockstep through trajectory-level asynchrony, which generates and consumes each trajectory independently. We propose Laminar, a scalable and robust RL post-training system built on a fully decoupled architecture. First, we replace global updates with a tier of relay workers acting as a distributed parameter service. This enables asynchronous and fine-grained weight synchronization, allowing rollouts to pull the latest weight anytime without stalling the actor's training loop. Second, a dynamic repack mechanism consolidates long-tail trajectories onto a few dedicated rollouts, maximizing generation throughput. The fully decoupled design also isolates failures, ensuring robustness for long-running jobs. Our evaluation on a 1024-GPU cluster shows that Laminar achieves up to 5.48$\times$ training throughput speedup over state-of-the-art systems, while reducing model convergence time.
- 中文摘要
大型语言模型 (LLM) 的强化学习 (RL) 后训练现在正在扩展到大型集群并延长运行时间,以增强模型推理性能。然而,现有 RL 框架的可扩展性是有限的,因为 RL 轨迹生成中的极端长尾偏度会导致 GPU 严重未充分利用。当前的异步 RL 系统试图缓解这种情况,但它们依赖于 actor 和所有部署之间的全局权重同步,这会创建一个严格的模型更新计划。这种全局同步不适合RL训练中轨迹生成延迟的高度倾斜和演变分布,从而削弱了训练效率。我们的关键见解是,高效的扩展需要通过轨迹级异步来打破这种锁步,它独立地生成和使用每个轨迹。我们提出了 Laminar,这是一个基于完全解耦架构构建的可扩展且强大的 RL 后训练系统。首先,我们将全局更新替换为充当分布式参数服务的中继工作线程层。这支持异步和细粒度权重同步,允许推出随时拉取最新的权重,而不会停止 actor 的训练循环。其次,动态重新包装机制将长尾轨迹整合到几个专用的推出上,从而最大限度地提高生成吞吐量。完全解耦的设计还可以隔离故障,确保长时间运行作业的稳健性。我们对 1024-GPU 集群的评估表明,与最先进的系统相比,Laminar 实现了高达 5.48$\times$ 的训练吞吐量加速,同时减少了模型收敛时间。
Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks
记忆即行动:长视野代理任务的自主上下文管理
- Authors: Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, Jitao Sang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12635
- Pdf link: https://arxiv.org/pdf/2510.12635
- Abstract
Large Language Models face challenges in long-horizon agentic tasks as their constrained memory is easily overwhelmed by distracting or irrelevant context. Existing working memory methods typically rely on external, heuristic mechanisms that are decoupled from the agent's core policy. In this work, we reframe working memory management as a learnable, intrinsic capability. We propose a novel framework, Memory-as-Action, where an agent actively manages its working memory by executing explicit editing operations as part of a unified policy. This formulation allows an agent, trained via reinforcement learning, to balance memory curation against long-term task objectives under given resource constraints. However, such memory editing actions break the standard assumption of a continuously growing prefix in LLM interactions, leading to what we call trajectory fractures. These non-prefix changes disrupt the causal continuity required by standard policy gradient methods, making those methods inapplicable. To address this, we propose a new algorithm, Dynamic Context Policy Optimization, which enables stable end-to-end reinforcement learning by segmenting trajectories at memory action points and applying trajectory-level advantages to the resulting action segments. Our results demonstrate that jointly optimizing for task reasoning and memory management in an end-to-end fashion not only reduces overall computational consumption but also improves task performance, driven by adaptive context curation strategies tailored to the model's intrinsic capabilities.
- 中文摘要
大型语言模型在长期代理任务中面临挑战,因为它们受限的记忆很容易被分散注意力或不相关的上下文淹没。现有的工作记忆方法通常依赖于与代理的核心策略解耦的外部启发式机制。在这项工作中,我们将工作记忆管理重新定义为一种可学习的内在能力。我们提出了一种新颖的框架,即记忆即作,其中代理通过执行显式编辑作作为统一策略的一部分来主动管理其工作记忆。这种公式允许通过强化学习训练的代理在给定的资源限制下平衡记忆策划与长期任务目标。然而,这种记忆编辑行为打破了 LLM 交互中前缀不断增长的标准假设,导致我们所说的轨迹断裂。这些非前缀更改破坏了标准策略梯度方法所需的因果连续性,使这些方法不适用。为了解决这个问题,我们提出了一种新的算法,即动态上下文策略优化,它通过分割记忆动作点的轨迹并将轨迹级优势应用于生成的动作片段来实现稳定的端到端强化学习。我们的结果表明,以端到端的方式联合优化任务推理和内存管理不仅可以减少总体计算消耗,还可以提高任务性能,这是由根据模型内在能力量身定制的自适应上下文管理策略驱动的。
Expert or not? assessing data quality in offline reinforcement learning
专家与否?评估离线强化学习中的数据质量
- Authors: Arip Asadulaev, Fakhri Karray, Martin Takac
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2510.12638
- Pdf link: https://arxiv.org/pdf/2510.12638
- Abstract
Offline reinforcement learning (RL) learns exclusively from static datasets, without further interaction with the environment. In practice, such datasets vary widely in quality, often mixing expert, suboptimal, and even random trajectories. The choice of algorithm therefore depends on dataset fidelity. Behavior cloning can suffice on high-quality data, whereas mixed- or low-quality data typically benefits from offline RL methods that stitch useful behavior across trajectories. Yet in the wild it is difficult to assess dataset quality a priori because the data's provenance and skill composition are unknown. We address the problem of estimating offline dataset quality without training an agent. We study a spectrum of proxies from simple cumulative rewards to learned value based estimators, and introduce the Bellman Wasserstein distance (BWD), a value aware optimal transport score that measures how dissimilar a dataset's behavioral policy is from a random reference policy. BWD is computed from a behavioral critic and a state conditional OT formulation, requiring no environment interaction or full policy optimization. Across D4RL MuJoCo tasks, BWD strongly correlates with an oracle performance score that aggregates multiple offline RL algorithms, enabling efficient prediction of how well standard agents will perform on a given dataset. Beyond prediction, integrating BWD as a regularizer during policy optimization explicitly pushes the learned policy away from random behavior and improves returns. These results indicate that value aware, distributional signals such as BWD are practical tools for triaging offline RL datasets and policy optimization.
- 中文摘要
离线强化学习 (RL) 仅从静态数据集中学习,无需与环境进一步交互。在实践中,此类数据集的质量差异很大,通常混合了专家轨迹、次优轨迹甚至随机轨迹。因此,算法的选择取决于数据集的保真度。行为克隆足以处理高质量数据,而混合或低质量数据通常受益于跨轨迹拼接有用行为的离线 RL 方法。然而,在野外,很难先验地评估数据集质量,因为数据的来源和技能构成是未知的。我们解决了在不训练代理的情况下估计离线数据集质量的问题。我们研究了从简单累积奖励到学习的基于价值的估计器的一系列代理,并引入了贝尔曼·瓦瑟斯坦距离 (BWD),这是一种价值感知的最优传输分数,用于衡量数据集的行为策略与随机参考策略的差异程度。BWD 是根据行为批评者和状态条件 OT 表述计算的,不需要环境交互或完整的策略优化。在 D4RL MuJoCo 任务中,BWD 与聚合多个离线 RL 算法的预言机性能分数密切相关,从而能够有效预测标准代理在给定数据集上的表现。除了预测之外,在策略优化期间将 BWD 集成为正则化器,可以明确地将学习到的策略从随机行为中推开并提高回报。这些结果表明,价值感知的分布信号(如BWD)是离线RL数据集分类和策略优化的实用工具。
Reasoning Pattern Matters: Learning to Reason without Human Rationales
推理模式很重要:学习在没有人类理由的情况下进行推理
- Authors: Chaoxu Pang, Yixuan Cao, Ping Luo
- Subjects: Subjects:
Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12643
- Pdf link: https://arxiv.org/pdf/2510.12643
- Abstract
Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.
- 中文摘要
大型语言模型(LLMs)在广泛采用的SFT+RLVR范式下表现出了卓越的推理能力,该范式首先对人类注释的推理轨迹(rationales)进行监督微调(SFT)以建立初始推理行为,然后应用具有可验证奖励的强化学习(Reinforcement Learning with Verifiable Rewards,RLVR)使用没有黄金基本原理的可验证信号对模型进行优化。然而,为 SFT 阶段注释高质量的基本原理仍然非常昂贵。本文研究了何时以及如何在不影响推理性能的情况下大幅降低基本原理注释成本。我们确定了一类广泛的问题,称为模式化推理任务,其中推理遵循跨实例一致的固定程序策略。尽管实例在领域知识、事实信息或数值等内容上有所不同,但解决方案源于应用共享推理模式。我们认为,SFT+RLVR 在此类任务上的成功主要源于它使模型能够内化这些推理模式的能力。使用数字语义匹配作为代表性任务,我们提供了因果和行为证据,表明推理模式而不是基本原理的数量或质量是绩效的关键决定因素。基于这些见解,我们提出了模式感知法学硕士作为基本原理分析器 (PARO),这是一个简单而有效的框架,使法学硕士能够生成与特定任务推理模式相一致的基本原理,而无需人工基本原理注释。实验表明,PARO 生成的基本原理实现了与人类基本原理相当的 SFT+RLVR 性能,而人类基本原理是 10 倍。这些结果表明,大规模的人类基本原理注释可以被基于 LLM 的自动注释所取代,只需要对推理模式进行有限的人类监督。
Autonomous Legged Mobile Manipulation for Lunar Surface Operations via Constrained Reinforcement Learning
通过约束强化学习进行月球表面作的自主腿移动纵
- Authors: Alvaro Belmonte-Baeza, Miguel Cazorla, Gabriel J. García, Carlos J. Pérez-Del-Pulgar, Jorge Pomares
- Subjects: Subjects:
Robotics (cs.RO); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2510.12684
- Pdf link: https://arxiv.org/pdf/2510.12684
- Abstract
Robotics plays a pivotal role in planetary science and exploration, where autonomous and reliable systems are crucial due to the risks and challenges inherent to space environments. The establishment of permanent lunar bases demands robotic platforms capable of navigating and manipulating in the harsh lunar terrain. While wheeled rovers have been the mainstay for planetary exploration, their limitations in unstructured and steep terrains motivate the adoption of legged robots, which offer superior mobility and adaptability. This paper introduces a constrained reinforcement learning framework designed for autonomous quadrupedal mobile manipulators operating in lunar environments. The proposed framework integrates whole-body locomotion and manipulation capabilities while explicitly addressing critical safety constraints, including collision avoidance, dynamic stability, and power efficiency, in order to ensure robust performance under lunar-specific conditions, such as reduced gravity and irregular terrain. Experimental results demonstrate the framework's effectiveness in achieving precise 6D task-space end-effector pose tracking, achieving an average positional accuracy of 4 cm and orientation accuracy of 8.1 degrees. The system consistently respects both soft and hard constraints, exhibiting adaptive behaviors optimized for lunar gravity conditions. This work effectively bridges adaptive learning with essential mission-critical safety requirements, paving the way for advanced autonomous robotic explorers for future lunar missions.
- 中文摘要
机器人技术在行星科学和探索中发挥着举足轻重的作用,由于太空环境固有的风险和挑战,自主和可靠的系统至关重要。永久性月球基地的建立需要能够在恶劣的月球地形中导航和纵的机器人平台。虽然轮式漫游车一直是行星探索的中流砥柱,但它们在非结构化和陡峭地形中的局限性促使采用具有卓越机动性和适应性的腿式机器人。本文介绍了一种约束强化学习框架,该框架专为在月球环境中运行的自主四足移动机械手而设计。所提出的框架集成了全身运动和纵能力,同时明确解决了关键的安全约束,包括防撞、动态稳定性和功率效率,以确保在月球特定条件下的稳健性能,例如重力降低和不规则地形。实验结果表明,该框架在实现精确的6D任务-空间末端执行器位姿跟踪方面具有有效性,平均定位精度为4 cm,方向精度为8.1度。该系统始终尊重软约束和硬约束,表现出针对月球重力条件优化的自适应行为。这项工作有效地将自适应学习与基本的关键任务安全要求联系起来,为先进的自主机器人探索者为未来的月球任务铺平了道路。
ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning
ERA:通过具身先验学习和在线强化学习将VLM转化为具身智能体
- Authors: Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, Mengchao Zhang, Jose Barreiros, Aykut Onol, ChengXiang Zhai, Heng Ji, Manling Li, Huan Zhang, Tong Zhang
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2510.12693
- Pdf link: https://arxiv.org/pdf/2510.12693
- Abstract
Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present \textit{Embodied Reasoning Agent (ERA)}, a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, \textit{Embodied Prior Learning}, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4\% on EB-ALFRED and 19.4\% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.
- 中文摘要
具身人工智能的最新进展凸显了视觉语言模型 (VLM) 作为能够在复杂环境中进行感知、推理和交互的代理的潜力。然而,性能最佳的系统依赖于部署成本高昂的大型模型,而较小的 VLM 则缺乏成功所需的知识和技能。为了弥合这一差距,我们提出了 \textit{具身推理代理 (ERA)},这是一个集成了先验知识学习和在线强化学习 (RL) 的两阶段框架。第一阶段,\textit{具身先验学习},从三种类型的数据中提炼基础知识:(1)轨迹增强先验,通过更强大的模型生成的结构化推理来丰富现有的轨迹数据;(2) 环境锚定先验,提供环境知识和接地监督;(3) 外部知识先验,从环境外数据集中转移常识。在第二阶段,我们开发了一个在线 RL 管道,该管道建立在这些先验的基础上,以进一步提高代理绩效。为了克服智能体RL的固有挑战,包括长视野、稀疏奖励和训练不稳定性,我们引入了三个关键设计:用于上下文管理的自我总结、密集奖励塑造和回合级策略优化。对高级规划(EB-ALFRED)和低级控制(EB-Manipulation)任务的广泛实验表明,ERA-3B超越了基于提示的大模型和以前基于训练的基线。具体来说,与GPT-4o相比,它在EB-ALFRAD上实现了8.4%的整体提升,在EB-Manipulation上实现了19.4%的总体提升,并且对看不见的任务表现出很强的泛化能力。总体而言,ERA 为实现可扩展的具身智能提供了一条实用的途径,为未来的具身人工智能系统提供了方法论见解。
Reflection-Based Task Adaptation for Self-Improving VLA
基于反思的自强VLA任务自适应
- Authors: Baicheng Li, Dong Wu, Zike Yan, Xinchen Liu, Zecui Zeng, Lusong Li, Hongbin Zha
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.12710
- Pdf link: https://arxiv.org/pdf/2510.12710
- Abstract
Pre-trained Vision-Language-Action (VLA) models represent a major leap towards general-purpose robots, yet efficiently adapting them to novel, specific tasks in-situ remains a significant hurdle. While reinforcement learning (RL) is a promising avenue for such adaptation, the process often suffers from low efficiency, hindering rapid task mastery. We introduce Reflective Self-Adaptation, a framework for rapid, autonomous task adaptation without human intervention. Our framework establishes a self-improving loop where the agent learns from its own experience to enhance both strategy and execution. The core of our framework is a dual-pathway architecture that addresses the full adaptation lifecycle. First, a Failure-Driven Reflective RL pathway enables rapid learning by using the VLM's causal reasoning to automatically synthesize a targeted, dense reward function from failure analysis. This provides a focused learning signal that significantly accelerates policy exploration. However, optimizing such proxy rewards introduces a potential risk of "reward hacking," where the agent masters the reward function but fails the actual task. To counteract this, our second pathway, Success-Driven Quality-Guided SFT, grounds the policy in holistic success. It identifies and selectively imitates high-quality successful trajectories, ensuring the agent remains aligned with the ultimate task goal. This pathway is strengthened by a conditional curriculum mechanism to aid initial exploration. We conduct experiments in challenging manipulation tasks. The results demonstrate that our framework achieves faster convergence and higher final success rates compared to representative baselines. Our work presents a robust solution for creating self-improving agents that can efficiently and reliably adapt to new environments.
- 中文摘要
预训练的视觉-语言-动作 (VLA) 模型代表了向通用机器人的重大飞跃,但有效地使其适应新颖的特定原位任务仍然是一个重大障碍。虽然强化学习 (RL) 是这种适应的一种有前途的途径,但该过程往往效率低下,阻碍了任务的快速掌握。我们介绍了反思性自我适应,这是一个无需人工干预即可快速、自主地适应任务的框架。我们的框架建立了一个自我改进的循环,代理从自己的经验中学习,以增强策略和执行。我们框架的核心是解决整个适应生命周期的双途径架构。首先,故障驱动反射RL路径通过使用VLM的因果推理从故障分析中自动合成有针对性的密集奖励函数,实现快速学习。这提供了一个有针对性的学习信号,显着加速了政策探索。然而,优化此类代理奖励会带来潜在的“奖励黑客攻击”风险,即智能体掌握了奖励功能,但实际任务失败。为了解决这个问题,我们的第二条途径,即成功驱动的质量指导 SFT,将政策建立在整体成功的基础上。它识别并有选择地模仿高质量的成功轨迹,确保智能体与最终任务目标保持一致。有条件的课程机制加强了这一途径,以帮助初步探索。我们在具有挑战性的作任务中进行实验。结果表明,与代表性基线相比,我们的框架实现了更快的收敛和更高的最终成功率。我们的工作为创建能够高效可靠地适应新环境的自我改进代理提供了一个强大的解决方案。
Residual MPC: Blending Reinforcement Learning with GPU-Parallelized Model Predictive Control
残差 MPC:将强化学习与 GPU 并行模型预测控制相结合
- Authors: Se Hwan Jeon, Ho Jae Lee, Seungwoo Hong, Sangbae Kim
- Subjects: Subjects:
Robotics (cs.RO)
- Arxiv link: https://arxiv.org/abs/2510.12717
- Pdf link: https://arxiv.org/pdf/2510.12717
- Abstract
Model Predictive Control (MPC) provides interpretable, tunable locomotion controllers grounded in physical models, but its robustness depends on frequent replanning and is limited by model mismatch and real-time computational constraints. Reinforcement Learning (RL), by contrast, can produce highly robust behaviors through stochastic training but often lacks interpretability, suffers from out-of-distribution failures, and requires intensive reward engineering. This work presents a GPU-parallelized residual architecture that tightly integrates MPC and RL by blending their outputs at the torque-control level. We develop a kinodynamic whole-body MPC formulation evaluated across thousands of agents in parallel at 100 Hz for RL training. The residual policy learns to make targeted corrections to the MPC outputs, combining the interpretability and constraint handling of model-based control with the adaptability of RL. The model-based control prior acts as a strong bias, initializing and guiding the policy towards desirable behavior with a simple set of rewards. Compared to standalone MPC or end-to-end RL, our approach achieves higher sample efficiency, converges to greater asymptotic rewards, expands the range of trackable velocity commands, and enables zero-shot adaptation to unseen gaits and uneven terrain.
- 中文摘要
模型预测控制 (MPC) 提供基于物理模型的可解释、可调运动控制器,但其鲁棒性取决于频繁的重新规划,并受到模型不匹配和实时计算约束的限制。相比之下,强化学习 (RL) 可以通过随机训练产生高度稳健的行为,但通常缺乏可解释性,存在分布外故障,并且需要密集的奖励工程。这项工作提出了一种 GPU 并行残差架构,它通过在扭矩控制级别混合 MPC 和 RL 的输出来紧密集成它们。我们开发了一种运动动力学全身 MPC 配方,在数千个药物中以 100 Hz 并行评估,用于 RL 训练。残差策略学习对 MPC 输出进行有针对性的修正,将基于模型的控制的可解释性和约束处理与 RL 的适应性相结合。基于模型的控制先验充当强偏差,通过一组简单的奖励初始化和引导策略实现理想的行为。与独立的 MPC 或端到端 RL 相比,我们的方法实现了更高的采样效率,收敛到更大的渐近奖励,扩大了可跟踪速度命令的范围,并能够对看不见的步态和不平坦的地形进行零样本适应。
DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search
DeepMMSearch-R1:在多模态 Web 搜索中赋能多模态 LLM
- Authors: Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
- Arxiv link: https://arxiv.org/abs/2510.12801
- Pdf link: https://arxiv.org/pdf/2510.12801
- Abstract
Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.
- 中文摘要
实际应用中的多模态大型语言模型 (MLLM) 需要访问外部知识源,并且必须对动态和不断变化的现实世界信息保持响应,以解决信息搜索和知识密集型用户查询。现有方法,例如检索增强生成 (RAG) 方法、搜索代理和配备搜索的 MLLM,经常存在僵化的管道、过多的搜索调用和构建不良的搜索查询,从而导致效率低下和结果不理想。为了解决这些限制,我们推出了 DeepMMSearch-R1,这是第一个能够执行按需、多轮次 Web 搜索并为图像和文本搜索工具动态制作查询的多模态 LLM。具体来说,DeepMMSearch-R1 可以根据输入图像的相关裁剪发起网络搜索,使图像搜索更加有效,并且可以根据检索到的信息迭代调整文本搜索查询,从而实现自我反思和自我纠正。我们的方法依赖于两阶段的训练管道:冷启动监督微调阶段,然后是在线强化学习优化。对于训练,我们引入了 DeepMMSearchVQA,这是一种新颖的多模态 VQA 数据集,通过与来自网络搜索工具的真实信息混合的自动化管道创建。该数据集包含各种多跳查询,这些查询集成了文本和视觉信息,教导模型何时搜索、搜索什么、使用哪种搜索工具以及如何对检索到的信息进行推理。我们在一系列知识密集型基准测试中进行了广泛的实验,以证明我们方法的优越性。最后,我们分析了结果并提供了对推进多模态网络搜索有价值的见解。
Keyword: diffusion policy
There is no result