Arxiv Papers of Today

生成时间: 2026-03-23 17:01:46 (UTC+8); Arxiv 发布时间: 2026-03-23 20:00 EDT (2026-03-24 08:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

LARFT: Closing the Cognition-Action Gap for Length Instruction Following in Large Language Models

LARFT：在大型语言模型中，缩短长度指令跟随的认知与行动差距

Authors: Wei Zhang, Lintong Du, Yuanhe Zhang, Zhenhong Zhou, Kun Wang, Li Sun, Sen Su
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.19255
Pdf link: https://arxiv.org/pdf/2603.19255
Abstract Despite the strong performance of Large Language Models (LLMs) on complex instruction-following tasks, precise control of output length remains a persistent challenge. Existing methods primarily attempt to enforce length constraints by externally imposing length signals or optimization objectives, while largely overlooking the underlying limitation: the model's intrinsic deficit in length cognition. To address this, we propose LARFT (Length-Aware Reinforcement Fine-Tuning), a training framework that aligns the model's length cognition with its action. Specifically, LARFT integrates length-oriented reinforcement learning with a hindsight length awareness. By transforming on-policy data into hindsight self-awareness tasks where the model learns to identify the actual length of its own generation, LARFT jointly optimizes the model's internal representation of length information and refines its policy to satisfy length constraints, thereby achieving precise and reliable length instruction following. Extensive experiments across four base models demonstrate that LARFT outperforms existing baselines, achieving an average improvement of +20.92 points across three length instruction following benchmarks with only a marginal decline of -1.45 points on four general capability benchmarks.
中文摘要 尽管大型语言模型（LLMs）在复杂的指令跟随任务中表现优异，但对输出长度的精确控制仍是一个持续的挑战。现有方法主要通过外部施加长度信号或优化目标来强制执行长度约束，但大多忽视了其潜在局限性：模型在长度认知上的固有缺陷。为此，我们提出了LARFT（长度感知强化微调），这是一种训练框架，使模型的长度认知与其行为保持一致。具体来说，LARFT将长度导向强化学习与事后视角长度意识相结合。通过将策略上的数据转化为事后诸葛亮的自我意识任务，使模型学会识别自身生成的实际长度，LARFT共同优化了模型内部长度信息的表示，并优化其策略以满足长度约束，从而实现精确可靠的长度指令跟随。四个基础模型的广泛实验表明，LARFT优于现有基线，在三段长度指令跟踪基准测试中平均提升+20.92分，四个通用能力基准仅有-1.45分的边幅下降。

Probing to Refine: Reinforcement Distillation of LLMs via Explanatory Inversion

探究到精炼：通过解释反演对大型语言模型的强化提炼

Authors: Zhen Tan, Chengshuai Zhao, Song Wang, Jundong Li, Tianlong Chen, Huan Liu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.19266
Pdf link: https://arxiv.org/pdf/2603.19266
Abstract Distilling robust reasoning capabilities from large language models (LLMs) into smaller, computationally efficient student models remains an unresolved challenge. Despite recent advances, distilled models frequently suffer from superficial pattern memorization and subpar generalization. To overcome these limitations, we introduce a novel distillation framework that moves beyond simple mimicry to instill a deeper conceptual understanding. Our framework features two key innovations. \underline{\textit{First}}, to address pattern memorization, Explanatory Inversion (EI) generates targeted ``explanatory probes'' that compel the student to articulate the underlying logic behind an answer, rather than just memorizing it. \underline{\textit{Second}}, to improve generalization, Explanatory GRPO (\texttt{EXGRPO}) uses a reinforcement learning algorithm with a novel Dialogue Structure Utility Bonus, which explicitly rewards the student for maintaining a coherent reasoning process across these probes. Extensive evaluations on 12 datasets demonstrate significant improvements. Using Gemma-7b as the student model, our method yields an average \textbf{20.39\%} increase over zero-shot performance and a \textbf{6.02\%} improvement over the state-of-the-art distillation baselines. Moreover, models distilled with our method show remarkable training efficiency (e.g., surpassing vanilla fine-tuning with \textbf{10-25\%} training data) and strong generalization to out-of-distribution tasks. Implementation is released at this https URL.
中文摘要 将大型语言模型（LLMs）中的强有力推理能力提炼到更小、计算高效的学生模型中，仍是一个未解决的挑战。尽管近期有所进展，精炼模型常常存在表层模式记忆和泛化不足的问题。为克服这些局限，我们引入了一种超越简单模仿的新提炼框架，灌输更深层次的概念理解。我们的框架有两项关键创新。\underline{\textit{First}}，为了解决模式记忆，解释倒置（EI）生成有针对性的“解释探针”，迫使学生表达答案背后的逻辑，而不仅仅是死记硬背。\underline{\textit{Second}}，为了提高泛化性，解释性GRPO（\texttt{EXGRPO}）采用了一种强化学习算法，并引入了新颖的对话结构实用性加值，明确奖励学生在这些探针中保持连贯的推理过程。对12个数据集的广泛评估显示出显著改进。以Gemma-7b为学生模型，我们的方法平均比零杯性能提升了_textbf{20.39\%}，并且比最先进的蒸馏基线提升了\textbf{6.02\%}。此外，采用我们方法提炼的模型表现出显著的训练效率（例如，在 \textbf{10-25\%} 训练数据中超越了原版微调），并且对分布外任务具有强有力的推广性。实现版本在此 https URL 发布。

Full-Stack Domain Enhancement for Combustion LLMs: Construction and Optimization

燃烧大型语言模型的全栈域增强：构建与优化

Authors: Quanjia Xiao, Weimin Ouyang, Zonglin Yang, Tianhao Wu, Qingguo Zhou, Runze Mao, Zhi X. Chen
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.19268
Pdf link: https://arxiv.org/pdf/2603.19268
Abstract Large language models (LLMs) in the direction of task adaptation and capability enhancement for professional fields demonstrate significant application potential. Nevertheless, for complex physical systems such as combustion science, general-purpose LLMs often generate severe hallucinations due to insufficient domain knowledge and the inability to adhere to physical conservation laws. To address this issue, we propose the first full-stack domain-enhanced LLM workflow tailored for the field of combustion science, which integrates automated domain corpus construction, incremental pre-training, instruction fine-tuning, and verifiable reward-based reinforcement learning. This workflow ensures that the model truly internalizes physical laws rather than merely learning textual statistical patterns. We also release FlameBench, a standardized evaluation benchmark specifically designed for complex reasoning tasks in combustion science. Experimental results demonstrate that the model developed in this work significantly outperforms state-of-the-art general-purpose closed-source models and traditional retrieval-augmented generation methods on combustion science reasoning tasks. This work lays a solid technical and resource foundation for the subsequent development of domain-specific scientific research agents with reliable scientific reasoning capabilities.
中文摘要 大型语言模型（LLMs）在任务适应和能力提升领域展现出显著的应用潜力。然而，对于如燃烧科学等复杂物理系统，通用大型语言模型常因领域知识不足和无法遵守物理守恒定律而产生严重的幻觉。为解决这一问题，我们提出了首个专为燃烧科学领域量身定制的全栈领域增强LLM工作流程，集成了自动化领域语料库构建、增量预训练、指令微调以及可验证的基于奖励的强化学习。这种工作流程确保模型真正内化物理定律，而不仅仅是学习文本统计模式。我们还发布了FlameBench，这是一个专门为燃烧科学中复杂推理任务设计的标准化评估基准。实验结果表明，本研究开发的模型在燃烧科学推理任务中显著优于最先进的通用闭源模型和传统的反演增强生成方法。这项工作为后续开发具有可靠科学推理能力的领域特定科学研究代理奠定了坚实的技术和资源基础。

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

MemReward：基于图形的经验记忆，用于有限标签的LLM奖励预测

Authors: Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.19310
Pdf link: https://arxiv.org/pdf/2603.19310
Abstract Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learning fine-tuning is constrained by the scarcity of reward labels. We introduce MemReward, a graph-based experience memory framework: an initial LLM policy generates rollouts for each query, each comprising a thinking process and a final answer, and these rollouts are stored as experience memory. Queries, thinking processes, and answers form nodes in a heterogeneous graph with similarity and structural edges; a GNN trained on labeled nodes propagates rewards to unlabeled rollouts during online optimization. Experiments on Qwen2.5-3B and 1.5B across mathematics, question answering, and code generation demonstrate that MemReward, with only 20% labels, achieves 97.3% of Oracle performance on 3B and 96.6% on 1.5B, surpassing Oracle on out-of-domain tasks. Performance scales smoothly with label budget, reaching 99.4% of Oracle at 70% labels.
中文摘要 通过强化学习训练大型语言模型（LLMs）进行复杂推理，需要奖励标签来指定生成的展开是否正确。然而，大规模获取奖励标签通常需要昂贵的人工标签或耗时的验证程序;例如，评估数学证明需要专家审核，而开放式问答缺乏明确的真实性。当奖励标签有限时，奖励标识的稀缺性限制了强化学习微调的有效性。我们介绍了MemRewards，一种基于图的经验记忆框架：初始的LLM策略为每个查询生成一次扩展，每个查询包含一个思考过程和一个最终答案，这些更新作为体验内存存储。查询、思维过程和答案构成异构图中的节点，具有相似性和结构性边;在带标签节点上训练的GNN在在线优化过程中将奖励传递到未标记的部署。在Qwen2.5-3B和1.5B的数学、问答和代码生成实验显示，MemReward仅有20%标签，在3B上达到了97.3%的Oracle性能，在1.5B上达到96.6%，在域外任务上超过了Oracle。性能随厂牌预算平稳增长，达到 Oracle 70% 标签时的 99.4%。

PrefPO: Pairwise Preference Prompt Optimization

PrefPO：成对偏好提示优化

Authors: Rahul Singhal, Pradyumna Tambwekar, Karime Maamari
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.19311
Pdf link: https://arxiv.org/pdf/2603.19311
Abstract Prompt engineering is effective but labor-intensive, motivating automated optimization methods. Existing methods typically require labeled datasets, which are often unavailable, and produce verbose, repetitive prompts. We introduce PrefPO, a minimal prompt optimization approach inspired by reinforcement learning from human feedback (RLHF). Its preference-based approach reduces the need for labeled data and hyperparameter tuning-only a starting prompt and natural language criteria are needed. PrefPO uses an LLM discriminator to express pairwise preferences over model outputs and provide feedback to an LLM optimizer, iteratively improving performance. We evaluate PrefPO on 9 BIG-Bench Hard (BBH) tasks and IFEval-Hard, a newly-curated, challenging subset of IFEval. PrefPO matches or exceeds SOTA methods, including GEPA, MIPRO, and TextGrad, on 6/9 tasks and performs comparably to TextGrad on IFEval-Hard (82.4% vs 84.5%). Unlike other methods, PrefPO can optimize in both labeled and unlabeled settings. Without labels, PrefPO closely matches its labeled performance on 6/9 tasks, proving effective without ground truth. PrefPO also improves prompt hygiene: we find existing methods produce prompts 14.7x their original length or with 34% repetitive content; PrefPO reduces these issues by 3-5x. Furthermore, both LLM and human judges rate PrefPO's prompts higher than TextGrad's. Finally, we identify prompt hacking in prompt optimizers, where methods game evaluation criteria, and find PrefPO is susceptible at half the rate of TextGrad (37% vs 86%), generating fewer brittle, misaligned prompts.
中文摘要 提示工程有效但劳动密集，促使自动化优化方法的采用。现有方法通常需要带标签的数据集，而这些数据集往往无法获得，且会产生冗长重复的提示。我们引入了PrefPO，这是一种受人类反馈强化学习（RLHF）启发的最小提示优化方法。其基于偏好的方法减少了对标记数据和超参数调优的需求——只需起始提示词和自然语言标准即可。PrefPO使用LLM判别器来表达对模型输出的成对偏好，并向LLM优化器提供反馈，迭代提升性能。我们评估了9个BIG-Bench Hard（BBH）任务的PrefPO和IFEval-Hard，这是IFEval中新策划的具有挑战性的子集。PrefPO在6/9任务中与SOTA方法（包括GEPA、MIPRO和TextGrad）相当甚至超过，且在IFEval-Hard中表现与TextGrad相当（82.4%对84.5%）。与其他方法不同，PrefPO可以在带标签和无标签的环境中进行优化。没有标签的情况下，PrefPO在6/9任务上的表现非常接近标注，无需实地验证即可有效。PrefPO还改善了提示的卫生：我们发现现有方法生成的提示长度是原始的14.7倍，重复内容占34%;PrefPO能将这些问题减少3-5倍。此外，LLM和人类评审对PrefPO提示的评分都高于TextGrad。最后，我们识别了提示优化器中的提示被动，这些方法的游戏评估标准，发现PrefPO的易感率仅为TextGrad的一半（37%对86%），因此生成的脆弱和错位提示更少。

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Goedel-Code-Prover：开放最先进代码验证的分层证明搜索

Authors: Zenan Li, Ziran Yang, Deyuan (Mike)He, Haoyu Zhao, Andrew Zhao, Shange Tang, Kaiyu Yang, Aarti Gupta, Zhendong Su, Chi Jin
Subjects: Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.19329
Pdf link: https://arxiv.org/pdf/2603.19329
Abstract Large language models (LLMs) can generate plausible code but offer limited guarantees of correctness. Formally verifying that implementations satisfy specifications requires constructing machine-checkable proofs, a task that remains beyond current automation. We propose a hierarchical proof search framework for automated code verification in Lean~4 that decomposes complex verification goals into structurally simpler subgoals before attempting tactic-level proving. Central to our approach is a principled decomposition score that combines constructive justification with structural effectiveness. Crucially, this score serves as both the training reward and the inference-time ranking criterion, ensuring strict alignment between optimization and deployment. We train Goedel-Code-Prover-8B, a single unified policy for both decomposition and completion, via supervised initialization followed by hybrid reinforcement learning, where a continuous decomposition reward drives planning exploration while supervised replay stabilizes proof generation. On three Lean-based code verification benchmarks comprising 427 tasks, our 8B-parameter model achieves a 62.0\% prove success rate, a 2.6$\times$ improvement over the strongest baseline, surpassing neural provers up to 84$\times$ larger. We further observe consistent inference-time scaling: success rates improve monotonically with search iterations and sampling budget, with our trained model achieving greater efficiency than frontier off-the-shelf models of comparable scale.
中文摘要 大型语言模型（LLM）可以生成合理的代码，但其正确性保证有限。正式验证实现是否满足规范需要构建机器可检查的证明，而这项任务目前已超出自动化的范畴。我们提出了一个用于Lean~4自动代码验证的层级证明搜索框架，将复杂的验证目标分解为结构上更简单的子目标，然后再尝试战术层面的证明。我们方法的核心是将建设性理由与结构有效性相结合的原则性分解评分。关键是，该评分既是训练奖励，也是推理时间排名标准，确保优化与部署之间的严格对齐。我们通过监督初始化和混合强化学习训练Goedel-Code-Prover-8B，这是一套统一的分解和完成策略，连续分解奖励驱动规划探索，监督重放稳定证明生成。在三个基于精益的代码验证基准测试中，共计427个任务，我们的8B参数模型实现了62.0%的证明成功率，比最强基线提升了2.6美元/时间，超过了最大84%\倍数的神经证明器。我们还观察到推理时间尺度一致：成功率随着搜索迭代和抽样预算单调提升，训练后的模型效率优于同等规模的Frontier现成模型。

Optimizing Resource-Constrained Non-Pharmaceutical Interventions for Multi-Cluster Outbreak Control Using Hierarchical Reinforcement Learning

利用层级强化学习优化资源受限的非药物干预措施以控制多群组疫情

Authors: Xueqiao Peng, Andrew Perrault
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.19397
Pdf link: https://arxiv.org/pdf/2603.19397
Abstract Non-pharmaceutical interventions (NPIs), such as diagnostic testing and quarantine, are crucial for controlling infectious disease outbreaks but are often constrained by limited resources, particularly in early outbreak stages. In real-world public health settings, resources must be allocated across multiple outbreak clusters that emerge asynchronously, vary in size and risk, and compete for a shared resource budget. Here, a cluster corresponds to a group of close contacts generated by a single infected index case. Thus, decisions must be made under uncertainty and heterogeneous demands, while respecting operational constraints. We formulate this problem as a constrained restless multi-armed bandit and propose a hierarchical reinforcement learning framework. A global controller learns a continuous action cost multiplier that adjusts global resource demand, while a generalized local policy estimates the marginal value of allocating resources to individuals within each cluster. We evaluate the proposed framework in a realistic agent-based simulator of SARS-CoV-2 with dynamically arriving clusters. Across a wide range of system scales and testing budgets, our method consistently outperforms RMAB-inspired and heuristic baselines, improving outbreak control effectiveness by 20%-30%. Experiments on up to 40 concurrently active clusters further demonstrate that the hierarchical framework is highly scalable and enables faster decision-making than the RMAB-inspired method.
中文摘要 非药物干预措施（NPI），如诊断检测和隔离，对于控制传染病爆发至关重要，但在疫情初期，往往受限于资源有限。在现实公共卫生环境中，资源必须分配到多个异步出现、规模和风险各异的疫情集群，并争夺共享资源预算。这里，群组对应由单一感染指数病例产生的一组密切接触者。因此，决策必须在不确定性和异质需求下做出，同时尊重操作约束。我们将这个问题表述为一个受限且不安分的多臂强盗，并提出了一个层级强化学习框架。全局控制者学习一个连续的行动成本乘数，以调整全球资源需求，而一般化的局部政策则估算每个集群内个体分配资源的边际价值。我们在一个基于真实的基于代理的SARS-CoV-2模拟器中，采用动态到来的集群进行评估该框架。在各种系统规模和测试预算中，我们的方法始终优于受RMAB启发和启发式基线的应用，疫情控制效果提升了20%-30%。对多达40个同时激活集群的实验进一步表明，分层框架具有高度可扩展性，比受RMAB启发的方法更能实现更快的决策。

Cooperation and Exploitation in LLM Policy Synthesis for Sequential Social Dilemmas

LLM政策综合中的合作与利用，针对连续社会困境

Authors: Víctor Gallego
Subjects: Subjects: Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
Arxiv link: https://arxiv.org/abs/2603.19453
Pdf link: https://arxiv.org/pdf/2603.19453
Abstract We study LLM policy synthesis: using a large language model to iteratively generate programmatic agent policies for multi-agent environments. Rather than training neural policies via reinforcement learning, our framework prompts an LLM to produce Python policy functions, evaluates them in self-play, and refines them using performance feedback across iterations. We investigate feedback engineering (the design of what evaluation information is shown to the LLM during refinement) comparing sparse feedback (scalar reward only) against dense feedback (reward plus social metrics: efficiency, equality, sustainability, peace). Across two canonical Sequential Social Dilemmas (Gathering and Cleanup) and two frontier LLMs (Claude Sonnet 4.6, Gemini 3.1 Pro), dense feedback consistently matches or exceeds sparse feedback on all metrics. The advantage is largest in the Cleanup public goods game, where providing social metrics helps the LLM calibrate the costly cleaning-harvesting tradeoff. Rather than triggering over-optimization of fairness, social metrics serve as a coordination signal that guides the LLM toward more effective cooperative strategies, including territory partitioning, adaptive role assignment, and the avoidance of wasteful aggression. We further perform an adversarial experiment to determine whether LLMs can reward hack these environments. We characterize five attack classes and discuss mitigations, highlighting an inherent tension in LLM policy synthesis between expressiveness and safety. Code at this https URL.
中文摘要 我们研究LLM策略综合：利用大型语言模型迭代生成多智能体环境的程序化代理策略。我们的框架不是通过强化学习来训练神经策略，而是提示LLM生成Python策略函数，在自我游戏中评估它们，并通过性能反馈在迭代中进行优化。我们研究反馈工程（即在精细过程中向LLM展示哪些评估信息的设计），比较稀疏反馈（仅标量奖励）与密集反馈（奖励加上社会指标：效率、平等、可持续性、和平）。在两个典型的顺序社会困境（Gathering and Cleanup）以及两个前沿大型语言模型（Claude Sonnet 4.6、Gemini 3.1 Pro）中，密集反馈在所有指标上始终匹配甚至超过稀疏反馈。优势在公共物品清理领域最大，提供社会指标有助于LLM校准昂贵的清理与收割权衡。社会指标并非触发过度优化公平，而是作为协调信号引导LLM采取更有效的合作策略，包括领地划分、适应性角色分配和避免浪费性攻击。我们还进行了对抗性实验，以确定大型语言模型是否能在这些环境中进行奖励黑客攻击。我们描述了五类攻击，并讨论了缓解措施，凸显了LLM策略综合中表现力与安全性之间的固有张力。代码就在这个 https URL 上。

Deep Hilbert--Galerkin Methods for Infinite-Dimensional PDEs and Optimal Control

深希尔伯特-加勒金方法：无限维偏微分方程与最优控制

Authors: Samuel N. Cohen, Filippo de Feo, Jackson Hebner, Justin Sirignano
Subjects: Subjects: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Numerical Analysis (math.NA); Optimization and Control (math.OC); Probability (math.PR)
Arxiv link: https://arxiv.org/abs/2603.19463
Pdf link: https://arxiv.org/pdf/2603.19463
Abstract We develop deep learning-based approximation methods for fully nonlinear second-order PDEs on separable Hilbert spaces, such as HJB equations for infinite-dimensional control, by parameterizing solutions via Hilbert--Galerkin Neural Operators (HGNOs). We prove the first Universal Approximation Theorems (UATs) which are sufficiently powerful to address these problems, based on novel topologies for Hessian terms and corresponding novel continuity assumptions on the fully nonlinear operator. These topologies are non-sequential and non-metrizable, making the problem delicate. In particular, we prove UATs for functions on Hilbert spaces, together with their Fréchet derivatives up to second order, and for unbounded operators applied to the first derivative, ensuring that HGNOs are able to approximate all the PDE terms. For control problems, we further prove UATs for optimal feedback controls in terms of our approximating value function HGNO. We develop numerical training methods, which we call Deep Hilbert--Galerkin and Hilbert Actor-Critic (reinforcement learning) Methods, for these problems by minimizing the $L^2_\mu(H)$-norm of the residual of the PDE on the whole Hilbert space, not just a projected PDE to finite dimensions. This is the first paper to propose such an approach. The models considered arise in many applied sciences, such as functional differential equations in physics and Kolmogorov and HJB PDEs related to controlled PDEs, SPDEs, path-dependent systems, partially observed stochastic systems, and mean-field SDEs. We numerically solve examples of Kolmogorov and HJB PDEs related to the optimal control of deterministic and stochastic heat and Burgers' equations, demonstrating the promise of our deep learning-based approach.
中文摘要 我们通过通过希尔伯特-加勒金神经算子（HGNO）参数化解，开发基于深度学习的完全非线性二阶偏微分方程（如无限维控制的HJB方程）近似方法。我们基于对黑森项的新拓扑和对应的全非线性算子新颖连续性假设，证明了首批足够强大的普遍逼近定理（UATs）来解决这些问题。这些拓扑是非序列且不可度量化的，使问题变得微妙。特别地，我们证明了希尔伯特空间上的函数及其二阶弗雷歇导数的无界面函数，以及应用于一阶导数的无界算子，确保HGNO能够近似所有偏微分方程项。对于控制问题，我们进一步证明了UTs，用于近似值函数HGNO的最优反馈控制。我们开发了数值训练方法，称为深度希尔伯特-加勒金法和希尔伯特演员-批判（强化学习）方法，通过最小化偏微分方程在整个希尔伯特空间上的残差$L^2_\mu（H）$范数，而不仅仅是有限维的偏微分方程。这是首篇提出此类方法的论文。所考虑的模型出现在许多应用科学领域，如物理中的泛函微分方程，以及与受控偏微分方程、正偏微分方程、路径依赖系统、部分观测随机系统和均场随机微分方程相关的柯尔莫哥洛夫和HJB偏微分方程。我们通过数值方法解出与确定性和随机热及伯格斯方程最优控制相关的柯尔莫哥洛夫和HJB偏微分方程，展示了基于深度学习方法的前景。

ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models

ProactiveBench：多模态大型语言模型中的主动性基准测试

Authors: Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2603.19466
Pdf link: https://arxiv.org/pdf/2603.19466
Abstract Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.
中文摘要 有效的协作始于知道何时寻求帮助。例如，当试图识别一个被阻塞的物体时，人类会请求某人移除障碍物。MLLM能否通过请求简单的用户干预表现出类似的“主动”行为？为此，我们引入了ProactiveBench，这是一个由七个重新利用数据集构建的基准测试，测试不同任务中的主动性，如识别遮挡物体、提升图像质量和解读粗略草图。我们在ProactiveBench上评估了22个MLLM，显示（i）它们普遍缺乏主动性;（ii）主动性与模型能力不相关;（三）“暗示”主动性仅带来边际收益。令人惊讶的是，我们发现对话历史和上下文学习会引入负面偏见，阻碍表现。最后，我们探讨基于强化学习的简单微调策略：其结果表明主动性是可以学习的，甚至可以推广到未见的场景。我们公开发布ProactiveBench，作为构建主动多模态模型的第一步。

Teaching an Agent to Sketch One Part at a Time

教代理一次绘制一个部分

Authors: Xiaodan Du, Ruize Xu, David Yunis, Yael Vinker, Greg Shakhnarovich
Subjects: Subjects: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.19500
Pdf link: https://arxiv.org/pdf/2603.19500
Abstract We develop a method for producing vector sketches one part at a time. To do this, we train a multi-modal language model-based agent using a novel multi-turn process-reward reinforcement learning following supervised fine-tuning. Our approach is enabled by a new dataset we call ControlSketch-Part, containing rich part-level annotations for sketches, obtained using a novel, generic automatic annotation pipeline that segments vector sketches into semantic parts and assigns paths to parts with a structured multi-stage labeling process. Our results indicate that incorporating structured part-level data and providing agent with the visual feedback through the process enables interpretable, controllable, and locally editable text-to-vector sketch generation.
中文摘要 我们开发了一种逐部分制作矢量草图的方法。为此，我们使用一种新型多回合过程-奖励强化学习，并在监督微调后训练一个多模态语言模型基的智能体。我们的方法得益于一个名为ControlSketch-Part的新数据集，该数据集包含丰富的部分级草图注释，这些注释是通过一种新颖的通用自动注释流水线获得的，该流水线将矢量草图分割为语义部分，并通过结构化多阶段标记过程为部分分配路径。我们的结果表明，结合结构化的部分级数据并在整个过程中为代理提供视觉反馈，能够实现可解释、可控且本地编辑的文本转矢量草图生成。

Stochastic Sequential Decision Making over Expanding Networks with Graph Filtering

基于图滤波扩展网络的随机顺序决策

Authors: Zhan Gao, Bishwadeep Das, Elvin Isufi
Subjects: Subjects: Machine Learning (cs.LG); Signal Processing (eess.SP)
Arxiv link: https://arxiv.org/abs/2603.19501
Pdf link: https://arxiv.org/pdf/2603.19501
Abstract Graph filters leverage topological information to process networked data with existing methods mainly studying fixed graphs, ignoring that graphs often expand as nodes continually attach with an unknown pattern. The latter requires developing filter-based decision-making paradigms that take evolution and uncertainty into account. Existing approaches rely on either pre-designed filters or online learning, limited to a myopic view considering only past or present information. To account for future impacts, we propose a stochastic sequential decision-making framework for filtering networked data with a policy that adapts filtering to expanding graphs. By representing filter shifts as agents, we model the filter as a multi-agent system and train the policy following multi-agent reinforcement learning. This accounts for long-term rewards and captures expansion dynamics through sequential decision-making. Moreover, we develop a context-aware graph neural network to parameterize the policy, which tunes filter parameters based on information of both the graph and agents. Experiments on synthetic and real datasets from cold-start recommendation to COVID prediction highlight the benefits of using a sequential decision-making perspective over batch and online filtering alternatives.
中文摘要 图滤波器利用拓扑信息处理网络数据，现有方法主要研究固定图，忽略了图常因节点不断以未知模式连接而扩展的事实。后者需要发展基于过滤器的决策范式，考虑进化和不确定性。现有方法依赖预设过滤器或在线学习，仅考虑过去或现在的信息，视角狭隘。为考虑未来影响，我们提出了一种随机顺序决策框架，用于过滤网络数据，并采用调整过滤以适应扩展图的策略。通过将滤波器移位表示为代理，我们将滤波器建模为多代理系统，并在多代理强化学习后训练策略。这既考虑了长期回报，也通过连续决策捕捉了扩张动态。此外，我们还开发了一个上下文感知图神经网络来参数化策略，并根据图和代理的信息调整滤波参数。从冷启动建议到新冠预测的合成和真实数据集实验，强调了采用顺序决策视角相较于批量和在线过滤替代方案的优势。

EvidenceRL: Reinforcing Evidence Consistency for Trustworthy Language Models

EvidenceRL：强化可信语言模型的证据一致性

Authors: J. Ben Tamo, Yuxing Lu, Benoit L. Marteau, Micky C. Nnamdi, May D. Wang
Subjects: Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.19532
Pdf link: https://arxiv.org/pdf/2603.19532
Abstract Large Language Models (LLMs) are fluent but prone to hallucinations, producing answers that appear plausible yet are unsupported by available evidence. This failure is especially problematic in high-stakes domains where decisions must be justified by verifiable information. We introduce \textbf{EvidenceRL}, a reinforcement learning framework that enforces evidence adherence during training. EvidenceRL scores candidate responses for grounding (entailment with retrieved evidence and context) and correctness (agreement with reference answers) and optimizes the generator using Group Relative Policy Optimization (GRPO). We evaluate across two high-stakes domains, cardiac diagnosis and legal reasoning, where EvidenceRL consistently improves evidence grounding and faithfulness without sacrificing task accuracy. On cardiac diagnosis, F1@3 increases from 37.0 to 54.5 on Llama-3.2-3B while grounding ($G_{\max}@3$) rises from 47.6 to 78.2; hallucinations drop nearly 5$\times$ and evidence-supported diagnoses increase from 31.8\% to 61.6\%. On legal reasoning, EvidenceRL raises Faithfulness from 32.8\% to 67.6\% on Llama-3.1-8B, demonstrating consistent behavioral change across domains. Our code is open-sourced at this https URL.
中文摘要 大型语言模型（LLMs）流利但容易产生幻觉，给出看似合理但缺乏现有证据支持的答案。这种失败在高风险领域尤为严重，因为决策必须以可验证的信息为依据。我们介绍了 \textbf{EvidenceRL}，一种强化学习框架，在培训过程中强制执行证据依从性。EvidenceRL对候选回答进行基于基础（蕴涵性，包含检索到的证据和上下文）和正确性（与参考答案一致）进行评分，并利用群体相对策略优化（GRPO）优化生成器。我们在两个高风险领域进行评估：心脏诊断和法律推理，EvidenceRL在不牺牲任务准确性的情况下，持续提升证据基础和忠实度。心脏诊断时，Llama-3.2-3B的接地F1@3从37.0升至54.5，接地（$G_{max}@3$）从47.6升至78.2;幻觉发生率下降近5美元，证据支持的诊断率从31.8%上升到61.6%。在法律推理方面，EvidenceRL将Llama-3.1-8B的忠诚度从32.8%提升至67.6%，展示了跨域的行为变化。我们的代码开源于这个 https URL。

PA2D-MORL: Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning

PA2D-MORL：基于帕累托上升方向分解的多目标强化学习

Authors: Tianmeng Hu, Biao Luo
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.19579
Pdf link: https://arxiv.org/pdf/2603.19579
Abstract Multi-objective reinforcement learning (MORL) provides an effective solution for decision-making problems involving conflicting objectives. However, achieving high-quality approximations to the Pareto policy set remains challenging, especially in complex tasks with continuous or high-dimensional state-action space. In this paper, we propose the Pareto Ascent Directional Decomposition based Multi-Objective Reinforcement Learning (PA2D-MORL) method, which constructs an efficient scheme for multi-objective problem decomposition and policy improvement, leading to a superior approximation of Pareto policy set. The proposed method leverages Pareto ascent direction to select the scalarization weights and computes the multi-objective policy gradient, which determines the policy optimization direction and ensures joint improvement on all objectives. Meanwhile, multiple policies are selectively optimized under an evolutionary framework to approximate the Pareto frontier from different directions. Additionally, a Pareto adaptive fine-tuning approach is applied to enhance the density and spread of the Pareto frontier approximation. Experiments on various multi-objective robot control tasks show that the proposed method clearly outperforms the current state-of-the-art algorithm in terms of both quality and stability of the outcomes.
中文摘要 多目标强化学习（MORL）为涉及冲突目标的决策问题提供了有效的解决方案。然而，实现高质量的帕累托策略近似仍然具有挑战性，尤其是在具有连续或高维状态-作用空间的复杂任务中。本文提出了基于帕累托上升方向分解的多目标强化学习（PA2D-MORL）方法，该方法构建了高效的多目标问题分解和策略改进方案，从而实现了对帕累托策略集更优近似的成果。所提方法利用帕累托上升方向选择标量化权重，并计算多目标策略梯度，确定政策优化方向并确保所有目标的联合改进。与此同时，在进化框架下，多个策略被选择性优化，以从不同方向近似帕累托边界。此外，还采用了帕累托自适应微调方法，以增强帕累托前沿近似的密度和扩散。多目标机器人控制任务的实验表明，所提方法在质量和稳定性方面明显优于当前最先进算法。

SaFRO: Satisfaction-Aware Fusion via Dual-Relative Policy Optimization for Short-Video Search

SaFRO：通过双相对策略优化实现满足感感知融合，用于短视频搜索

Authors: Renzhe Zhou, Songyang Li, Feiran Zhu, Chenglei Dai, Yi Zhang, Yi Wang, Jingwei Zhuo
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2603.19585
Pdf link: https://arxiv.org/pdf/2603.19585
Abstract Multi-Task Fusion plays a pivotal role in industrial short-video search systems by aggregating heterogeneous prediction signals into a unified ranking score. However, existing approaches predominantly optimize for immediate engagement metrics, which often fail to align with long-term user satisfaction. While Reinforcement Learning (RL) offers a promising avenue for user satisfaction optimization, its direct application to search scenarios is non-trivial due to the inherent data sparsity and intent constraints compared to recommendation feeds. To this end, we propose SaFRO, a novel framework designed to optimize user satisfaction in short-video search. We first construct a satisfaction-aware reward model that utilizes query-level behavioral proxies to capture holistic user satisfaction beyond item-level interactions. Then we introduce Dual-Relative Policy Optimization (DRPO), an efficient policy learning method that updates the fusion policy through relative preference comparisons within groups and across batches. Furthermore, we design a Task-Relation-Aware Fusion module to explicitly model the interdependencies among different objectives, enabling context-sensitive weight adaptation. Extensive offline evaluations and large-scale online A/B tests on Kuaishou short-video search platform demonstrate that SaFRO significantly outperforms state-of-the-art baselines, delivering substantial gains in both short-term ranking quality and long-term user retention.
中文摘要 多任务融合在工业短视频搜索系统中发挥着关键作用，通过将异构预测信号汇聚成统一的排名评分。然而，现有方法主要优化即时参与度指标，而这些指标往往与长期用户满意度不匹配。虽然强化学习（RL）为用户满意度优化提供了有前景的途径，但由于其数据稀疏性和意图约束，其直接应用于搜索场景并不简单。为此，我们提出了SaFRO，一种旨在优化短视频搜索用户满意度的新框架。我们首先构建了一个满意度感知奖励模型，利用查询级行为代理，捕捉超越项目级交互的整体用户满意度。随后，我们介绍了双相对策略优化（DRPO），这是一种高效的策略学习方法，通过组内和批次间的相对偏好比较更新融合策略。此外，我们设计了一个任务-关系感知融合模块，明确建模不同目标之间的相互依赖关系，实现上下文敏感的权重适应。在快手短视频搜索平台上，广泛的线下评估和大规模在线A/B测试显示，SaFRO在短期排名质量和长期用户留存率上均显著优于最先进的基线数据。

DeepStock: Reinforcement Learning with Policy Regularizations for Inventory Management

DeepStock：库存管理中的政策规范化强化学习

Authors: Yaqi Xie, Xinru Hao, Jiaxi Liu, Will Ma, Linwei Xin, Lei Cao, Yidong Zhang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.19621
Pdf link: https://arxiv.org/pdf/2603.19621
Abstract Deep Reinforcement Learning (DRL) provides a general-purpose methodology for training inventory policies that can leverage big data and compute. However, off-the-shelf implementations of DRL have seen mixed success, often plagued by high sensitivity to the hyperparameters used during training. In this paper, we show that by imposing policy regularizations, grounded in classical inventory concepts such as "Base Stock", we can significantly accelerate hyperparameter tuning and improve the final performance of several DRL methods. We report details from a 100% deployment of DRL with policy regularizations on Alibaba's e-commerce platform, Tmall. We also include extensive synthetic experiments, which show that policy regularizations reshape the narrative on what is the best DRL method for inventory management.
中文摘要 深度强化学习（DRL）提供了一种通用的方法论，用于训练库存策略，能够利用大数据并进行计算。然而，现成的DRL实现成效参差不齐，常常受训练中超参数敏感度过高的问题困扰。本文表明，通过基于经典库存概念（如“基础库存”）施加策略正则化，我们可以显著加速超参数调优，并提升多种日程学习方法的最终性能。我们报告了阿里巴巴电商平台天猫100%部署日灾恢复（DRL）及政策调整的细节。我们还包括大量综合实验，表明政策正则化会重塑关于库存管理最佳DRL方法的叙述。

ContractionPPO: Certified Reinforcement Learning via Differentiable Contraction Layers

ContractionPPO：通过可区分收缩层进行认证强化学习

Authors: Vrushabh Zinage, Narek Harutyunyan, Eric Verheyden, Fred Y. Hadaegh, Soon-Jo Chung
Subjects: Subjects: Robotics (cs.RO); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.19632
Pdf link: https://arxiv.org/pdf/2603.19632
Abstract Legged locomotion in unstructured environments demands not only high-performance control policies but also formal guarantees to ensure robustness under perturbations. Control methods often require carefully designed reference trajectories, which are challenging to construct in high-dimensional, contact-rich systems such as quadruped robots. In contrast, Reinforcement Learning (RL) directly learns policies that implicitly generate motion, and uniquely benefits from access to privileged information, such as full state and dynamics during training, that is not available at deployment. We present ContractionPPO, a framework for certified robust planning and control of legged robots by augmenting Proximal Policy Optimization (PPO) RL with a state-dependent contraction metric layer. This approach enables the policy to maximize performance while simultaneously producing a contraction metric that certifies incremental exponential stability of the simulated closed-loop system. The metric is parameterized as a Lipschitz neural network and trained jointly with the policy, either in parallel or as an auxiliary head of the PPO backbone. While the contraction metric is not deployed during real-world execution, we derive upper bounds on the worst-case contraction rate and show that these bounds ensure the learned contraction metric generalizes from simulation to real-world deployment. Our hardware experiments on quadruped locomotion demonstrate that ContractionPPO enables robust, certifiably stable control even under strong external perturbations.
中文摘要 无结构环境中的腿式运动不仅需要高性能控制政策，还需要正式保证在扰动下保持稳健性。控制方法通常需要精心设计的参考轨迹，而在高维、接触丰富系统（如四足机器人）中构建这些轨迹具有挑战性。相比之下，强化学习（RL）直接学习隐式产生运动的策略，并且独特地受益于部署时无法获得的特权信息，如训练期间的完整状态和动态信息。我们介绍ContractionPPO，这是一个通过增强近端策略优化（PPO）强化学习，并附加状态依赖的收缩度量层，实现腿式机器人的稳健规划与控制框架。这种方法使策略能够最大化性能，同时产生一个收缩度量，证明模拟闭环系统的增量指数稳定性。该指标被参数化为Lipschitz神经网络，并与策略联合训练，既可并行训练，也可以作为PPO骨干的辅助头。虽然收缩度量在实际执行时未被部署，但我们推导出了最坏情况收缩率的上限，并证明这些上限确保所学收缩度量能够从仿真推广到现实部署。我们在四足行走的硬件实验表明，ContractionPPO即使在强烈外部扰动下也能实现稳健且可证实稳定的控制。

Heavy-Tailed and Long-Range Dependent Noise in Stochastic Approximation: A Finite-Time Analysis

随机近似中的重尾噪声和长程相关噪声：有限时间分析

Authors: Siddharth Chandak, Anuj Yadav, Ayfer Ozgur, Nicholas Bambos
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2603.19648
Pdf link: https://arxiv.org/pdf/2603.19648
Abstract Stochastic approximation (SA) is a fundamental iterative framework with broad applications in reinforcement learning and optimization. Classical analyses typically rely on martingale difference or Markov noise with bounded second moments, but many practical settings, including finance and communications, frequently encounter heavy-tailed and long-range dependent (LRD) noise. In this work, we study SA for finding the root of a strongly monotone operator under these non-classical noise models. We establish the first finite-time moment bounds in both settings, providing explicit convergence rates that quantify the impact of heavy tails and temporal dependence. Our analysis employs a noise-averaging argument that regularizes the impact of noise without modifying the iteration. Finally, we apply our general framework to stochastic gradient descent (SGD) and gradient play, and corroborate our finite-time analysis through numerical experiments.
中文摘要 随机近似（SA）是一种基础的迭代框架，在强化学习和优化中有广泛应用。经典分析通常依赖于带界二矩的马丁格尔差噪声或马尔可夫噪声，但许多实际环境，包括金融和通信，经常遇到重尾噪声和长程依赖噪声（LRD）。在本研究中，我们研究在这些非经典噪声模型下，用SA来寻找强单调算符的根。我们在这两种环境中建立了首个有限时间矩界限，提供了明确的收敛率，以量化重尾巴和时间依赖性的影响。我们的分析采用了噪声平均论证，规范噪声的影响而不修改迭代过程。最后，我们将通用框架应用于随机梯度下降（SGD）和梯度游戏，并通过数值实验验证有限时间分析结果。

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

一个以子目标为驱动的框架，用于改进长远视野的LLM代理

Authors: Taiyi Wang, Sian Gooding, Florian Hartmann, Oriana Riva, Edward Grefenstette
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2603.19685
Pdf link: https://arxiv.org/pdf/2603.19685
Abstract Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.
中文摘要 基于大型语言模型（LLM）的代理已成为数字环境（包括移动界面、操作系统和网页浏览器）的强大自主控制器。例如，网页导航需要处理动态内容和冗长的操作序列，这使得它尤其具有挑战性。现有基于LLM的代理在长期规划方面面临两大困难。在线执行过程中，他们常常失去新信息的跟踪，缺乏清晰且灵活的最终目标路径。在强化学习（RL）微调过程中，这一问题进一步加剧，奖励稀疏且延迟，使智能体难以识别哪些行为导致成功，从而阻碍他们在长时间任务中保持连贯的推理。为应对这些挑战，我们提出了两项建议。首先，我们引入了一个代理框架，利用专有模型通过子目标分解进行在线规划。其次，我们介绍MiRA（Milestonizing your Reinforcement Learning Enhanced Agent），这是一个使用密集、基于里程碑的奖励信号的强化学习训练框架。实时规划机制使专有模型如 Gemini 在 WebArena-Lite 基准测试上成功率（SR）绝对提升约 10%。与此同时，将MiRA应用于开放的Gemma3-12B模型，其成功率从6.4%提升至43.0%。这一性能超过了专有系统如GPT-4-Turbo（17.6%）和GPT-4o（13.9%），以及之前最先进的开放模型WebRL（38.4%）。总体而言，我们的发现表明，将显性推理时间规划与基于里程碑的奖励相结合，显著提升了代理的长期视野能力，为更稳健、通用的自主系统铺平了道路。

LoopRPT: Reinforcement Pre-Training for Looped Language Models

LoopRPT：循环语言模型的强化预训练

Authors: Guo Tang, Shixin Jiang, Heng Chang, Nuo Chen, Yuhan Li, Huiming Fan, Jia Li, Ming Liu, Bing Qin
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.19714
Pdf link: https://arxiv.org/pdf/2603.19714
Abstract Looped language models (LoopLMs) perform iterative latent computation to refine internal representations, offering a promising alternative to explicit chain-of-thought (CoT) reasoning. However, existing reinforcement learning (RL) paradigms primarily target output tokens, creating a structural mismatch with looped architectures whose reasoning unfolds implicitly. In this work, we propose LoopRPT, a reinforcement pre-training framework tailored for LoopLMs. By reframing next-token prediction as a next-token reasoning task, LoopRPT assigns reinforcement signals directly to latent steps using an EMA teacher reference and noisy latent rollouts. This formulation enables RL to directly shape intermediate representations, compressing effective reasoning into fewer iterations. We instantiate LoopRPT on the Ouro architecture across multiple model scales. Results demonstrate that LoopRPT consistently improves per-step representation quality, achieving Pareto dominance in accuracy-computation trade-offs. Notably, significant gains on hard tokens indicate that LoopRPT enhances early-stage reasoning rather than merely encouraging premature exits. Our findings highlight reinforcement pre-training as a principled paradigm for learning efficient latent reasoning in LoopLMs.
中文摘要 循环语言模型（LoopLMs）通过迭代潜在计算来精炼内部表示，为显式思维链（CoT）推理提供了有前景的替代方案。然而，现有的强化学习（RL）范式主要针对输出令牌，导致与推理隐式展开的循环架构存在结构不匹配。在本研究中，我们提出了LoopRPT，一种专为LoopLM量身定制的强化预训练框架。通过将下一标记预测重新框架为下一标记推理任务，LoopRPT 直接将强化信号分配给潜在步骤，使用EMA教师参考和噪声潜在展开。这种表述使强化学习能够直接塑造中间表示，将有效推理压缩到更少的迭代次数中。我们在Ouro架构上实现了LoopRPT，跨越多个模型尺度。结果表明，LoopRPT持续提升每步表示质量，在准确性与计算权衡中实现帕累托优势。值得注意的是，硬代币的显著涨幅表明LoopRPT增强了早期推理能力，而不仅仅是鼓励过早退出。我们的发现强调了强化预训练作为学习高效潜在推理的原则范式。

FedPDPO: Federated Personalized Direct Preference Optimization for Large Language Model Alignment

FedPDPO：大型语言模型对齐的联合个性化直接偏好优化

Authors: Kewen Zhu, Liping Yi, Zhiming Zhao, Zhuang Qi, Han Yu, Qinghua Hu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.19741
Pdf link: https://arxiv.org/pdf/2603.19741
Abstract Aligning large language models (LLMs) with human preferences in federated learning (FL) is challenging due to decentralized, privacy-sensitive, and highly non-IID preference data. Direct Preference Optimization (DPO) offers an efficient alternative to reinforcement learning with human feedback (RLHF), but its direct application in FL suffers from severe performance degradation under non-IID data and limited generalization of implicit rewards. To bridge this gap, we propose FedPDPO (Federated Personalized Direct Preference Optimization), a personalized federated framework for preference alignment of LLMs. It adopts a parameter-efficient fine-tuning architecture where each client maintains a frozen pretrained LLM backbone augmented with a Low-Rank Adaptation (LoRA) adapter, enabling communication-efficient aggregation. To address non-IID heterogeneity, we devise (1) the globally shared LoRA adapter with the personalized client-specific LLM head. Moreover, we introduce (2) a personalized DPO training strategy with a client-specific explicit reward head to complement implicit rewards and further alleviate non-IID heterogeneity, and (3) a bottleneck adapter to balance global and local features. We provide theoretical analysis establishing the probabilistic foundation and soundness. Extensive experiments on multiple preference datasets demonstrate state-of-the-art performance, achieving up to 4.80% average accuracy improvements in federated intra-domain and cross-domain settings.
中文摘要 由于去中心化、隐私敏感且高度非IID偏好，将大型语言模型（LLMs）与人类偏好对齐具有挑战性。直接偏好优化（DPO）为人类反馈强化学习（RLHF）提供了一种高效的替代方案，但其在FL中的直接应用在非IID数据下表现严重下降，隐性奖励的泛化有限。为弥合这一差距，我们提出了FedPDPO（联邦个性化直接偏好优化），这是一个用于LLM偏好对齐的个性化联邦框架。它采用参数高效的微调架构，每个客户端维护一个冻结的预训练LLM骨干网，并配备低秩适配（LoRA）适配器，实现通信高效的聚合。为解决非IID异构性，我们设计了（1）全球共享的LoRA适配器，并配备个性化客户专属LLM头部。此外，我们引入了（2）个性化DPO培训策略，配备客户特定的明确奖励头，以补充隐性奖励，进一步缓解非IID的异质性，以及（3）瓶颈适配器以平衡全局与局部特征。我们提供理论分析，确立概率基础和合理性。在多偏好数据集上的广泛实验显示出最先进的性能，在联邦域内和跨域设置中平均准确率提升高达4.80%。

Generalized Task-Driven Design of Soft Robots via Reduced-Order FEM-based Surrogate Modeling

通过基于低阶有限元素法的替代建模实现软机器人的通用任务驱动设计

Authors: Yao Yao, David Howard, Perla Maiolino
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.19794
Pdf link: https://arxiv.org/pdf/2603.19794
Abstract Task-driven design of soft robots requires models that are physically accurate and computationally efficient, while remaining transferable across actuator designs and task scenarios. However, existing modeling approaches typically face a fundamental trade-off between physical fidelity and computational efficiency, which limits model reuse across design and task variations and constrains scalable task-driven optimization. This paper presents a unified reduced-order finite element method (FEM)-based surrogate modeling pipeline for generalized task-driven soft robot design. High-fidelity FEM simulations characterize actuator behavior at the modular level, from which compact surrogate joint models are constructed for evaluation within a pseudo-rigid body model (PRBM). A meta-model maps actuator design parameters to surrogate representations, enabling rapid instantiation across a parameterized actuator family. The resulting models are embedded into a PRBM-based simulation environment, supporting task-level simulation and optimization under realistic physical constraints. The proposed pipeline is validated through sim-to-real transfer across multiple actuator types, including bellow-type pneumatic actuators and a tendon-driven soft finger, as well as two task-driven design studies: soft gripper co-design via Reinforcement Learning (RL) and 3D actuator shape matching via evolutionary optimization. The results demonstrate high accuracy, efficiency, and reliable reuse, providing a scalable foundation for autonomous task-driven soft robot design.
中文摘要 软机器人的任务驱动设计需要模型既物理准确又计算高效，同时又能在执行器设计和任务场景中实现可迁移性。然而，现有建模方法通常面临物理真实度与计算效率之间的根本权衡，这限制了模型在设计和任务变体中的重用，并限制了可扩展的任务驱动优化。本文提出了一种基于简化阶有限元法（FEM）的通用任务驱动软机器人设计替代建模流水线。高精度FEM模拟在模块层面表征执行器行为，基于此构建紧凑的替代关节模型，用于在伪刚体模型（PRBM）中进行评估。元模型将执行器设计参数映射到代理表示，从而实现参数化执行器家族间的快速实例化。生成的模型嵌入基于PRBM的仿真环境中，支持在现实物理约束下的任务级仿真和优化。该流水线通过多种执行器类型的模拟到实际传输验证，包括贝尔沃型气动执行器和腱驱动的软指，以及两项任务驱动设计研究：通过强化学习（RL）实现的软抓握器联合设计和通过进化优化实现的三维执行器形状匹配。结果显示其高精度、高效且可信的重复使用，为自主任务驱动软机器人设计提供了可扩展的基础。

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

FIPO：通过未来吉隆坡影响的政策优化激发深度推理

Authors: Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2603.19835
Pdf link: https://arxiv.org/pdf/2603.19835
Abstract We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.
中文摘要 我们介绍了未来-基层级受影响策略优化（FIPO），这是一种旨在克服大型语言模型推理瓶颈的强化学习算法。虽然GRPO式训练可有效扩展，但它通常依赖于基于结果的奖励（ORM），将全球优势均匀分布到轨迹中的每个代币。我们认为，这种粗粒度的信用分配通过未能区分关键逻辑枢纽和琐碎的代币，从而施加了性能上限。FIPO通过将折现的未来-KL背离纳入政策更新，创建了一个密集的优势公式，根据代币对后续轨迹行为的影响重新加权。从经验角度看，FIPO使模型能够突破标准基线中出现的长度停滞。基于Qwen2.5-32B评估，FIPO将平均思考链长度从约4000枚扩展到超过10000枚，并将AIME 2024 Pass@1准确率从50.0%提升至峰值58.0%（收敛率约为56.0%）。这两者均优于DeepSeek-R1-Zero-Math-32B（约47.0%）和o1-mini（约56.0%）。我们的结果表明，建立密集优势表述是基于ORM算法发展以释放基础模型全部推理潜力的重要路径。我们将基于verl框架构建的培训系统开源。

NASimJax: GPU-Accelerated Policy Learning Framework for Penetration Testing

NASimJax：GPU加速策略学习框架用于渗透测试

Authors: Raphael Simon, José Carrasquel, Wim Mees, Pieter Libin
Subjects: Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
Arxiv link: https://arxiv.org/abs/2603.19864
Pdf link: https://arxiv.org/pdf/2603.19864
Abstract Penetration testing, the practice of simulating cyberattacks to identify vulnerabilities, is a complex sequential decision-making task that is inherently partially observable and features large action spaces. Training reinforcement learning (RL) policies for this domain faces a fundamental bottleneck: existing simulators are too slow to train on realistic network scenarios at scale, resulting in policies that fail to generalize. We present NASimJax, a complete JAX-based reimplementation of the Network Attack Simulator (NASim), achieving up to 100x higher environment throughput than the original simulator. By running the entire training pipeline on hardware accelerators, NASimJax enables experimentation on larger networks under fixed compute budgets that were previously infeasible. We formulate automated penetration testing as a Contextual POMDP and introduce a network generation pipeline that produces structurally diverse and guaranteed-solvable scenarios. Together, these provide a principled basis for studying zero-shot policy generalization. We use the framework to investigate action-space scaling and generalization across networks of up to 40 hosts. We find that Prioritized Level Replay better handles dense training distributions than Domain Randomization, particularly at larger scales, and that training on sparser topologies yields an implicit curriculum that improves out-of-distribution generalization, even on topologies denser than those seen during training. To handle linearly growing action spaces, we propose a two-stage action decomposition (2SAS) that substantially outperforms flat action masking at scale. Finally, we identify a failure mode arising from the interaction between Prioritized Level Replay's episode-reset behaviour and 2SAS's credit assignment structure. NASimJax thus provides a fast, flexible, and realistic platform for advancing RL-based penetration testing.
中文摘要 渗透测试，即模拟网络攻击以识别漏洞的实践，是一项复杂且顺序的决策任务，本质上部分可观察，且具有较大的操作空间。该领域的训练强化学习（RL）策略面临一个根本瓶颈：现有模拟器在大规模真实网络场景下训练速度过慢，导致策略无法推广。我们介绍NASimJax，这是基于JAX的完整网络攻击模拟器（NASim）重实现，环境吞吐量比原模拟器高出100倍。通过在硬件加速器上运行整个训练流水线，NASimJax 使得在较大的网络上实现了在固定计算预算下进行此前难以实现的实验。我们将自动化渗透测试制定为情境POMDP，并引入网络生成流水线，生成结构多样且有保证可解的场景。这些因素共同为研究零射击政策泛化提供了原则性基础。我们利用该框架研究了最多40个主机网络中动作空间的缩放和泛化。我们发现，优先级重放比域随机化更能处理密集训练分布，尤其是在较大规模上，且在稀疏拓扑上训练能产生隐式课程，即使在比训练中看到的拓扑更密集的情况下，也能改善分布外泛化。为了处理线性增长的动作空间，我们提出了一种两阶段动作分解（2SAS），在大规模上远远优于平面动作掩蔽。最后，我们识别出一种失败模式，源于优先级重放的剧集重置行为与2SAS的信用分配结构相互作用。因此，NASimJax 提供了一个快速、灵活且实用的平台，用于推进基于强化学习的渗透测试。

What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

如果共识撒谎怎么办？测试时的选择性-补充强化学习

Authors: Dong Yan, Jian Liang, Yanbo Wang, Shuo Lu, Ran He, Tieniu Tan
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.19880
Pdf link: https://arxiv.org/pdf/2603.19880
Abstract Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at this https URL.
中文摘要 测试时强化学习（TTRL）使大型语言模型（LLMs）能够通过多数投票共识获得伪奖励，增强未标记测试流的推理能力。然而，现有的TTRL方法完全依赖于正向伪标记策略。在答案分布高度分散的挑战情境下，这种依赖变得脆弱，导致共识薄弱，无意中强化错误的轨迹作为监督信号。本文提出了SCRL（选择性-补充强化学习），这是一种稳健的测试时强化学习框架，有效减轻标签噪声放大。SCRL开发了选择性正向伪标记技术，通过严格的共识标准过滤不可靠的多数。作为补充，SCRL引入了熵门控负伪标记，这是TTRL中首个负监督机制，能够可靠地基于生成不确定性修剪错误轨迹。多重推理基准测试的广泛实验表明，SCRL相较基线实现了显著改进，同时在有限的推广预算下保持了稳健的泛化和训练稳定性。我们的代码可在此 https URL 访问。

Learning Adaptive Parameter Policies for Nonlinear Bayesian Filtering

学习非线性贝叶斯滤波的自适应参数策略

Authors: Ondrej Straka, Felipe Giraldo-Grueso, Renato Zanetti
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2603.19910
Pdf link: https://arxiv.org/pdf/2603.19910
Abstract Algorithms for Bayesian state estimation of nonlinear systems inevitably introduce approximation errors. These algorithms depend on parameters that influence the accuracy of the numerical approximations used. The parameters include, for example, the number of particles, scaling parameters, and the number of iterations in iterative computations. Typically, these parameters are fixed or adjusted heuristically, although the approximation accuracy can change over time with the local degree of nonlinearity and uncertainty. The approximation errors introduced at a time step propagate through subsequent updates, affecting the accuracy, consistency, and robustness of future estimates. This paper presents adaptive parameter selection in nonlinear Bayesian filtering as a sequential decision-making problem, where parameters influence not only the immediate estimation outcome but also the future estimates. The decision-making problem is addressed using reinforcement learning to learn adaptive parameter policies for nonlinear Bayesian filters. Experiments with the unscented Kalman filter and stochastic integration filter demonstrate that the learned policies improve both estimate quality and consistency.
中文摘要 非线性系统的贝叶斯状态估计算法不可避免地引入了近似误差。这些算法依赖于影响所用数值近似准确性的参数。这些参数包括粒子数量、缩放参数以及迭代计算中的迭代次数。通常，这些参数会被固定或通过启发式方式调整，尽管近似精度会随着局部非线性和不确定性的程度而变化。在某一时间步引入的近似误差会在后续更新中传播，影响未来估计的准确性、一致性和鲁棒性。本文将非线性贝叶斯滤波中的自适应参数选择介绍为一个顺序决策问题，其中参数不仅影响即时估计结果，还影响未来的估计。通过强化学习来解决决策问题，以学习非线性贝叶斯滤波器的自适应参数策略。无香卡尔曼滤波器和随机积分滤波器的实验表明，所学策略不仅提升了估计质量，还提高了一致性。

Robust Beam Codebooks for mmWave/THz Systems: Toward a Stochastic RL Approach

毫米波/太太赫兹系统的稳健束流代码手册：迈向随机强化学习方法

Authors: Anouar Nechi, Rainer Buchty, Mladen Berekovic, Saleh Mulhem
Subjects: Subjects: Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2603.19930
Pdf link: https://arxiv.org/pdf/2603.19930
Abstract Millimeter-wave (mmWave) and terahertz (THz) massive MIMO systems often rely on predefined beamforming codebooks, which are usually suboptimal in Non-Line-of-Sight (NLoS) conditions and for hardware-limited transceivers. Reinforcement Learning (RL) enables adaptive, data-driven codebook design without explicit Channel State Information (CSI), but the robustness of such algorithms in practical conditions is underexplored. This paper introduces a robust multi-agent RL framework that learns beam codebooks directly from environmental feedback, eliminating the need for prior channel knowledge. Our method is well-suited for real-world deployments facing unpredictable propagation and hardware constraints. We conduct a comprehensive analysis of three off-policy algorithms, Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC), evaluating their resilience to hardware impairments and feedback noise. Simulations show that SAC consistently outperforms deterministic methods, achieving superior beamforming gains and stability in NLoS scenarios, even under severe impairments. These results demonstrate the promise of RL-based codebook design for robust mmWave/THz massive MIMO systems.
中文摘要 毫米波（mmWave）和太赫兹（THz）大规模MIMO系统通常依赖预定义的波束形成码本，而这些码本在非视距（NLoS）条件和硬件限制收发器中通常表现不佳。强化学习（RL）使得无需显式信道状态信息（CSI）的自适应数据驱动代码本设计成为可能，但此类算法在实际条件下的稳健性尚未被充分探讨。本文介绍了一个稳健的多智能体强化学习框架，直接从环境反馈中学习光束代码本，无需事先通道知识。我们的方法非常适合面对不可预测传播和硬件限制的实际部署。我们对三种非策略算法进行了全面分析：深度确定性策略梯度（DDPG）、双延迟DDPG（TD3）和软演员-批判者（SAC），评估它们对硬件损害和反馈噪声的韧性。模拟表明，SAC在NLoS场景下即使在严重损伤下，也始终优于确定性方法，实现更优越的波束成形增益和稳定性。这些结果展示了基于强化学习的代码本设计在稳健的毫米波/太赫兹大规模MIMO系统中的前景。

SAGE: Sustainable Agent-Guided Expert-tuning for Culturally Attuned Translation in Low-Resource Southeast Asia

SAGE：可持续的代理引导专家调校，促进资源匮乏东南亚的文化同步翻译

Authors: Zhixiang Lu, Chong Zhang, Yulong Li, Angelos Stefanidis, Anh Nguyen, Imran Razzak, Jionglong Su, Zhengyong Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.19931
Pdf link: https://arxiv.org/pdf/2603.19931
Abstract The vision of an inclusive World Wide Web is impeded by a severe linguistic divide, particularly for communities in low-resource regions of Southeast Asia. While large language models (LLMs) offer a potential solution for translation, their deployment in data-poor contexts faces a dual challenge: the scarcity of high-quality, culturally relevant data and the prohibitive energy costs of training on massive, noisy web corpora. To resolve the tension between digital inclusion and environmental sustainability, we introduce Sustainable Agent-Guided Expert-tuning (SAGE). This framework pioneers an energy-aware paradigm that prioritizes the "right data" over "big data". Instead of carbon-intensive training on unfiltered datasets, SAGE employs a reinforcement learning (RL) agent, optimized via Group Relative Policy Optimization (GRPO), to autonomously curate a compact training set. The agent utilizes a semantic reward signal derived from a small, expert-constructed set of community dialogues to filter out noise and cultural misalignment. We then efficiently fine-tune open-source LLMs on this curated data using Low-Rank Adaptation (LoRA). We applied SAGE to translation tasks between English and seven low-resource languages (LRLs) in Southeast Asia. Our approach establishes new state-of-the-art performance on BLEU-4 and COMET-22 metrics, effectively capturing local linguistic nuances. Crucially, SAGE surpasses baselines trained on full datasets while reducing data usage by 97.1% and training energy consumption by 95.2%. By delivering high-performance models with a minimal environmental footprint, SAGE offers a scalable and responsible pathway to bridge the digital divide in the Global South.
中文摘要 包容性万维网的愿景受到严重的语言分歧阻碍，尤其是对东南亚资源匮乏地区的社区而言。虽然大型语言模型（LLMs）为翻译提供了潜在解决方案，但它们在数据匮乏的环境中部署面临双重挑战：高质量且具文化相关性的数据稀缺，以及在庞大、嘈杂的网络语料库中训练的高昂能源成本。为了解决数字包容性与环境可持续性之间的矛盾，我们引入了可持续主体引导专家调优（SAGE）。该框架开创了一种能源意识范式，优先考虑“正确的数据”而非“大数据”。SAGE采用强化学习（RL）代理，通过Group Relative Policy Optimization（GRPO）优化，自动管理紧凑的训练集，取代了对未过滤数据集进行碳密集训练。该智能体利用由一组专家构建的小型社区对话源的语义奖励信号，以过滤噪音和文化不一致。随后，我们利用低秩适应（LoRA）高效地对这些精心策划的数据进行开源LLM的微调。我们将SAGE应用于东南亚地区英语与七种低资源语言（LRL）之间的翻译任务。我们的方法在BLEU-4和COMET-22指标上建立了最先进的表现，有效捕捉了当地语言的细微差别。关键是，SAGE在完整数据集训练后超越基线，同时降低了97.1%的数据使用量，训练能耗降低了95.2%。通过提供高效且环境足迹最小的模型，SAGE为全球南方数字鸿沟提供了可扩展且负责任的桥梁。

GustPilot: A Hierarchical DRL-INDI Framework for Wind-Resilient Quadrotor Navigation

GustPilot：用于抗风四旋翼导航的分层DRL-INDI框架

Authors: Amir Atef Habel, Roohan Ahmed Khan, Fawad Mehboob, Clement Fortin, Dzmitry Tsetserukou
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.19966
Pdf link: https://arxiv.org/pdf/2603.19966
Abstract Wind disturbances remain a key barrier to reliable autonomous navigation for lightweight quadrotors, where the rapidly varying airflow can destabilize both planning and tracking. This paper introduces GustPilot, a hierarchical wind-resilient navigation stack in which a deep reinforcement learning (DRL) policy generates inertial-frame velocity reference for gate traversal. At the same time, a geometric Incremental Nonlinear Dynamic Inversion (INDI) controller provides low-level tracking with fast residual disturbance rejection. The INDI layer achieves this by providing incremental feedback on both specific linear acceleration and angular acceleration rate, using onboard sensor measurements to reject wind disturbances rapidly. Robustness is obtained through a two-level strategy, wind-aware planning learned via fan-jet domain randomization during training, and rapid execution-time disturbance rejection by the INDI tracking controller. We evaluate GustPilot in real flights on a 50g quad-copter platform against a DRL-PID baseline across four scenarios ranging from no-wind to fully dynamic conditions with a moving gate and a moving disturbance source. Despite being trained only in a minimal single-gate and single-fan setup, the policy generalizes to significantly more complex environments (up to six gates and four fans) without retraining. Across 80 experiments, DRL-INDI achieves a 94.7% versus 55.0% for DRL-PID as average Overall Success Rate (OSR), reduces tracking RMSE up to 50%, and sustains speeds up to 1.34 m/s under wind disturbances up to 3.5 m/s. These results demonstrate that combining DRL-based velocity planning with structured INDI disturbance rejection provides a practical and generalizable approach to wind-resilient autonomous flight navigation.
中文摘要 风力干扰仍是轻型四旋翼飞机可靠自主导航的主要障碍，快速变化的气流可能破坏规划和跟踪。本文介绍了GustPilot，一种分层的抗风导航栈，其中深度强化学习（DRL）策略生成门穿越的惯性帧速度参考。同时，几何增量非线性动态反演（INDI）控制器提供低级跟踪并实现快速残余扰动抑制。INDI层通过提供对特定线性加速度和角加速度的增量反馈，利用机载传感器测量快速排除风扰来实现这一目标。鲁棒性通过两级策略实现：通过训练期间通过风向感知的区域随机化学习，以及由INDI跟踪控制器快速执行时间的干扰排除。我们在50加仑四旋翼平台上，基于DRL-PID基线，在四种情境下评估GustPilot，涵盖从无风到全动态状态（带移动登机门和移动扰动源）的各种情况。尽管培训时仅在最简的单门单风扇配置中进行，但该政策在更复杂的环境中推广至（最多六扇门和四扇风扇），无需重新培训。在80个实验中，DRL-INDI的平均总体成功率（OSR）为94.7%，DRL-PID为55.0%，将跟踪RMSE降低至50%，并在风力扰动下最高可维持1.34米/秒的速度，最高可达3.5米/秒。这些结果表明，结合基于日日辐射（DRL）的速度规划与结构化INDI扰动抑制，提供了一种实用且可通用的风韧自主飞行导航方法。

Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States

通过重新引入马尔可夫状态，突破LLM后培训能力上限

Authors: Yurun Yuan, Tengyang Xie
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.19987
Pdf link: https://arxiv.org/pdf/2603.19987
Abstract Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.
中文摘要 强化学习（RL）已成为大型语言模型（LLM）后训练和对齐的标准范式，但最新证据表明它面临持续存在的“能力上限”：与发现新策略的经典强化学习系统不同，LLM的强化学习往往只是对预训练权重中已潜藏模式的细化。在本研究中，我们识别出一个根本的结构瓶颈：经典强化学习依赖于紧凑且信息丰富的马尔可夫态，而当前的大型语言模型训练后表述则依赖于不断扩展的动作历史。我们重新审视一个长期以来在强化学习中核心但LLM训练后缺乏的经典原则：显式马尔可夫状态。理论上，我们提供了严格的保证，证明利用估计的马尔可夫态可以显著降低样本复杂度。通过实证，我们表明引入马尔可夫状态在一系列复杂逻辑谜题中，持续突破标准强化学习后训练的性能边界。我们的发现表明，超越“历史即国家”建模，转向结构化马尔可夫表示，对于解锁开放式发现和真正新的生成式人工智能推理能力至关重要。

ReViSQL: Achieving Human-Level Text-to-SQL

ReViSQL：实现人类级文本转SQL的实现

Authors: Yuxuan Zhu, Tengjun Jin, Yoojin Choi, Daniel Kang
Subjects: Subjects: Databases (cs.DB); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2603.20004
Pdf link: https://arxiv.org/pdf/2603.20004
Abstract Translating natural language to SQL (Text-to-SQL) is a critical challenge in both database research and data analytics applications. Recent efforts have focused on enhancing SQL reasoning by developing large language models and AI agents that decompose Text-to-SQL tasks into manually designed, step-by-step pipelines. However, despite these extensive architectural engineering efforts, a significant gap remains: even state-of-the-art (SOTA) AI agents have not yet achieved the human-level accuracy on the BIRD benchmark. In this paper, we show that closing this gap does not require further architectural complexity, but rather clean training data to improve SQL reasoning of the underlying models. We introduce ReViSQL, a streamlined framework that achieves human-level accuracy on BIRD for the first time. Instead of complex AI agents, ReViSQL leverages reinforcement learning with verifiable rewards (RLVR) on BIRD-Verified, a dataset we curated comprising 2.5k verified Text-to-SQL instances based on the BIRD Train set. To construct BIRD-Verified, we design a data correction and verification workflow involving SQL experts. We identified and corrected data errors in 61.1% of a subset of BIRD Train. By training on BIRD-Verified, we show that improving data quality alone boosts the single-generation accuracy by 8.2-13.9% under the same RLVR algorithm. To further enhance performance, ReViSQL performs inference-time scaling via execution-based reconciliation and majority voting. Empirically, we demonstrate the superiority of our framework with two model scales: ReViSQL-235B-A22B and ReViSQL-30B-A3B. On an expert-verified BIRD Mini-Dev set, ReViSQL-235B-A22B achieves 93.2% execution accuracy, exceeding the proxy human-level accuracy (92.96%) and outperforming the prior open-source SOTA method by 9.8%. Our lightweight ReViSQL-30B-A3B matches the prior SOTA at a 7.5$\times$ lower per-query cost.
中文摘要 将自然语言转换为SQL（文本转SQL）是数据库研究和数据分析应用中的关键挑战。近期的工作重点是通过开发大型语言模型和人工智能代理来增强SQL推理能力，将文本转SQL任务分解为手动设计的逐步流程。然而，尽管进行了大量架构工程努力，仍存在显著差距：即使是最先进的（SOTA）AI代理，也尚未达到BIRD基准的人类水平精度。本文表明，弥补这一差距不需要进一步的架构复杂度，而是需要干净的训练数据来提升底层模型的SQL推理能力。我们介绍ReViSQL，这是一个简化框架，首次在BIRD上实现了人类水平的准确性。ReViSQL没有使用复杂的AI代理，而是利用基于BIRD Train集策划的BIRD-Verified数据集上的可验证奖励强化学习（RLVR），该数据集包含2500个经过验证的文本转SQL实例。为了构建BIRD-Verified，我们设计了一个涉及SQL专家的数据更正和验证工作流程。我们在BIRD Train的一个子集中识别并纠正了数据错误。通过在BIRD-Verified上训练，我们证明仅提升数据质量即可在同一RLVR算法下提升单代准确率8.2%至13.9%。为进一步提升性能，ReViSQL通过基于执行的对账和多数投票进行推理时间扩展。通过实证，我们通过两种模型尺度——ReViSQL-235B-A22B 和 ReViSQL-30B-A3B 展示了我们框架的优越性。在专家验证的BIRD Mini-Dev测试集中，ReViSQL-235B-A22B实现了93.2%的执行准确率，超过代理人类水平的准确率（92.96%），并且比之前的开源SOTA方法高出9.8%。我们的轻量级ReViSQL-30B-A3B与之前的SOTA相当，每查询成本低7.5美元/倍数美元。

Experience is the Best Teacher: Motivating Effective Exploration in Reinforcement Learning for LLMs

《经验是最好的老师：激励强化学习中的有效探索》为LLMs提供帮助

Authors: Wenjian Zhang, Kongcheng Zhang, Jiaxin Qi, Baisheng Lai, Jianqiang Huang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20046
Pdf link: https://arxiv.org/pdf/2603.20046
Abstract Reinforcement Learning (RL) with rubric-based rewards has recently shown remarkable progress in enhancing general reasoning capabilities of Large Language Models (LLMs), yet still suffers from ineffective exploration confined to curent policy distribution. In fact, RL optimization can be viewed as steering the policy toward an ideal distribution that maximizes the rewards, while effective exploration should align efforts with desired target. Leveraging this insight, we propose HeRL, a Hindsight experience guided Reinforcement Learning framework to bootstrap effective exploration by explicitly telling LLMs the desired behaviors specified in rewards. Concretely, HeRL treats failed trajectories along with their unmet rubrics as hindsight experience, which serves as in-context guidance for the policy to explore desired responses beyond its current distribution. Additionally, we introduce a bonus reward to incentivize responses with greater potential for improvement under such guidance. HeRL facilitates effective learning from desired high quality samples without repeated trial-and-error from scratch, yielding a more accurate estimation of the expected gradient theoretically. Extensive experiments across various benchmarks demonstrate that HeRL achieves superior performance gains over baselines, and can further benefit from experience guided self-improvement at test time. Our code is available at this https URL.
中文摘要 基于评分标准的奖励强化学习（RL）最近在增强大型语言模型（LLMs）的通用推理能力方面取得了显著进展，但仍存在有限的探索能力，仅限于当前策略分布。事实上，强化学习优化可以看作是引导政策朝着最大化奖励的理想分布方向发展，而有效的探索则应使努力与目标保持一致。基于这一见解，我们提出了HeRL，一个以后见之明为导向的强化学习框架，通过明确告诉LLM奖励中指定的期望行为，实现有效的探索。具体来说，HeRL将失败的轨迹及其未满足的评分标准视为事后诸葛亮的经验，作为政策探索期望响应的背景指导，超越当前分布范围。此外，我们还引入了奖金奖励，以激励在此类指导下有更大改进潜力的回答。HeRL促进从期望的高质量样本中有效学习，无需从零开始反复试错，理论上更准确地估算预期梯度。在多个基准测试中的大量实验表明，HeRL在性能提升方面优于基线，并且在测试时还能从经验引导的自我提升中受益。我们的代码可在此 https URL 访问。

Fine-tuning Timeseries Predictors Using Reinforcement Learning

利用强化学习微调时间序列预测器

Authors: Hugo Cazaux, Ralph Rudd, Hlynur Stefánsson, Sverrir Ólafsson, Eyjólfur Ingi Ásgeirsson
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20063
Pdf link: https://arxiv.org/pdf/2603.20063
Abstract This chapter presents three major reinforcement learning algorithms used for fine-tuning financial forecasters. We propose a clear implementation plan for backpropagating the loss of a reinforcement learning task to a model trained using supervised learning, and compare the performance before and after the fine-tuning. We find an increase in performance after fine-tuning, and transfer learning properties to the models, indicating the benefits of fine-tuning. We also highlight the tuning process and empirical results for future implementation by practitioners.
中文摘要 本章介绍了三种主要的强化学习算法，用于微调财务预测者。我们提出了一个明确的实施方案，用于将强化学习任务丢失的反向传播到使用监督学习训练的模型，并比较微调前后的表现。我们发现微调后性能有所提升，并将学习特性转移到模型中，显示了微调的优势。我们还强调了调优过程和实证结果，供从业者未来实施。

Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

适应链：外科视觉语言适应与强化学习

Authors: Jiajie Li, Chenhui Xu, Meihuan Liu, Jinjun Xiong
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2603.20116
Pdf link: https://arxiv.org/pdf/2603.20116
Abstract Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
中文摘要 在特定领域数据集上的传统微调可能无意中改变模型的预训练多模态先验，导致泛化性降低。为此，我们提出了适应链（Chain-of-adaptation，简称CoA），这是一种适应框架，旨在整合领域知识，同时保持模型固有的推理和感知能力。CoA 引入了一种结构化推理格式，通过强化学习增强领域对齐，同时不牺牲多模态的整体能力。在标准外科基准测试中，无论是分布内还是外分布环境下，实验都表明CoA比监督微调更准确、更强的泛化性和更稳定的行为。此外，消融研究证实CoA有效保留了模型的核心视觉语言能力，为VLM领域专精提供了可靠的路径。

AGILE: A Comprehensive Workflow for Humanoid Loco-Manipulation Learning

敏捷：人形机动操作学习的全面工作流程

Authors: Huihua Zhao, Rafael Cathomen, Lionel Gulich, Wei Liu, Efe Arda Ongan, Michael Lin, Shalin Jain, Soha Pouya, Yan Chang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2603.20147
Pdf link: https://arxiv.org/pdf/2603.20147
Abstract Recent advances in reinforcement learning (RL) have enabled impressive humanoid behaviors in simulation, yet transferring these results to new robots remains challenging. In many real deployments, the primary bottleneck is no longer simulation throughput or algorithm design, but the absence of systematic infrastructure that links environment verification, training, evaluation, and deployment in a coherent loop. To address this gap, we present AGILE, an end-to-end workflow for humanoid RL that standardizes the policy-development lifecycle to mitigate common sim-to-real failure modes. AGILE comprises four stages: (1) interactive environment verification, (2) reproducible training, (3) unified evaluation, and (4) descriptor-driven deployment via robot/task configuration descriptors. For evaluation stage, AGILE supports both scenario-based tests and randomized rollouts under a shared suite of motion-quality diagnostics, enabling automated regression testing and principled robustness assessment. AGILE also incorporates a set of training stabilizations and algorithmic enhancements in training stage to improve optimization stability and sim-to-real transfer. With this pipeline in place, we validate AGILE across five representative humanoid skills spanning locomotion, recovery, motion imitation, and loco-manipulation on two hardware platforms (Unitree G1 and Booster T1), achieving consistent sim-to-real transfer. Overall, AGILE shows that a standardized, end-to-end workflow can substantially improve the reliability and reproducibility of humanoid RL development.
中文摘要 强化学习（RL）的最新进展使模拟中出现了令人印象深刻的类人行为，但将这些结果转化为新型机器人仍然充满挑战。在许多实际部署中，主要瓶颈不再是仿真吞吐量或算法设计，而是缺乏将环境验证、训练、评估和部署连接成连贯循环的系统化基础设施。为弥补这一空白，我们提出了AGILE，一种端到端的人形强化学习工作流程，标准化策略开发生命周期，以减轻常见的模拟到现实失败模式。敏捷包括四个阶段：（1）交互式环境验证，（2）可重复训练，（3）统一评估，以及（4）通过机器人/任务配置描述符驱动的描述符部署。在评估阶段，AGILE支持基于场景的测试和随机部署，采用一套共享的运动质量诊断套件，实现自动化回归测试和原则性的鲁棒性评估。AGILE还在训练阶段整合了一套训练稳定和算法增强，以提升优化稳定性和模拟到实物的传输能力。有了该流水线，我们在两个硬件平台（Unitree G1和Booster T1）上验证了AGILE涵盖五项代表性人形技能，涵盖移动、恢复、动作模仿和机车操控，实现了从模拟到真实的稳定传输。总体而言，AGILE表明，标准化的端到端工作流程可以显著提升类人生物强化学习开发的可靠性和可重复性。

Keyword: diffusion policy

There is no result