Arxiv Papers of Today

生成时间: 2026-04-22 17:23:58 (UTC+8); Arxiv 发布时间: 2026-04-22 20:00 EDT (2026-04-23 08:00 UTC+8)

今天共有 35 篇相关文章

Keyword: reinforcement learning

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

ARGUS：由数据流不变量引导的代理GPU优化

Authors: Haohui Mai, Xiaoyan Guo, Xiangyun Ding, Daifeng Li, Qiuchu Yu, Chenzhun Guo, Cong Wang, Jiacheng Zhao, Christos Kozyrakis, Binhang Yuan
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
Arxiv link: https://arxiv.org/abs/2604.18616
Pdf link: https://arxiv.org/pdf/2604.18616
Abstract LLM-based coding agents can generate functionally correct GPU kernels, yet their performance remains far below hand-optimized libraries on critical computations such as matrix multiplication, attention, and Mixture-of-Experts (MoE). Peak GPU performance requires coordinated reasoning over tightly coupled optimizations, including tiling, shared-memory staging, software pipelining, and instruction scheduling, while existing agents rely on sparse pass/fail feedback, leaving them unable to diagnose global constraint violations. We present Argus, an agentic framework that addresses this through data-flow invariants: compile-time specifications encoding how data must be choreographed throughout kernel execution. Argus introduces a tile-based, Pythonic DSL exposing hardware instructions and compiler policies while hiding low-level representations. The DSL provides tag functions to propagate symbolic annotations through data and control flow, and tag assertions to enforce relational constraints at use sites. When violations occur, the compiler returns concrete counterexamples identifying the thread, data element, and program point, enabling dense, structured feedback for targeted fixes. Invariants are verified at compile time via abstract interpretation over a layout algebra and SMT solving, with zero runtime overhead. An in-context reinforcement learning planner learns to select optimizations and synthesize effective invariants, supported by a curated knowledge base of GPU optimization techniques. We evaluate Argus on the AMD MI300X GPU across GEMM, flash attention, and MoE kernels accounting for over 90% of GPU time in LLM inference. Generated kernels achieve 99-104% of state-of-the-art hand-optimized assembly throughput and are 2-1543x faster than existing agentic systems. Argus further generalizes to 200 KernelBench tasks, solving 100% of Level 1 and 90% of Level 2 problems.
中文摘要 基于LLM的编码代理可以生成功能正确的GPU内核，但在矩阵乘法、注意力和专家混合（MoE）等关键计算上，其性能仍远低于手工优化的库。GPU的峰值性能需要在紧密耦合的优化中进行协调推理，包括铺砖、共享内存分期、软件流水线和指令调度，而现有代理依赖稀疏的通过/失败反馈，导致无法诊断全局约束违规。我们介绍Argus，一个通过数据流不变量来解决这一问题的代理框架：编译时规范编码数据在内核执行过程中必须如何编排。Argus 引入了基于瓦片的 Python DSL，暴露硬件指令和编译器策略，同时隐藏底层表示。DSL提供标签函数，用于通过数据和控制流传播符号注释，以及在使用站点强制执行关系约束的标签断言。当发生违规时，编译器会返回具体的反例，识别线程、数据元素和程序点，从而为有针对性修复提供密集且结构化的反馈。不变量通过对布局代数和SMT求解进行编译时的抽象解释验证，且运行时间为零。上下文强化学习规划者学习选择优化并综合有效不变量，并辅以精心策划的GPU优化技术知识库。我们评估了 AMD MI300X GPU 上 GEMM、闪存注意力和 MoE 内核的 Argus，这些内核在 LLM 推断中占 GPU 时间的 90% 以上。生成的内核实现了99-104%的先进手工优化装配吞吐量，速度比现有代理系统快2倍到1543倍。Argus进一步推广至200个KernelBench任务，解决1级问题的100%和2级问题的90%。

Discrete Tilt Matching

离散倾斜匹配

Authors: Yuyuan Chen, Shiyi Wang, Peter Potaptchik, Jaeyeon Kim, Michael S. Albergo
Subjects: Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2604.18739
Pdf link: https://arxiv.org/pdf/2604.18739
Abstract Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM's annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.
中文摘要 掩盖扩散大型语言模型（dLLMs）是自回归生成的有前景替代方案。虽然强化学习（RL）方法近年来已被应用于dLLM微调，但其目标通常依赖于序列层面的边际似然，而这对掩蔽扩散模型来说难以处理。为此，我们推导出离散倾斜匹配（DTM），这是一种无似然方法，将dLLM微调重新定义为在奖励倾斜下局部揭露后验的状态级匹配。DTM采用加权交叉熵目标，带有显式最小化，并允许控制变量以提升训练稳定性。在合成迷宫规划任务中，我们分析DTM退火计划和控制变量如何影响训练稳定性并防止模式崩溃。在大规模上，LLaDA-8B-Instruct与DTM微调可实现数独和倒计时的强劲提升，同时在MATH500和GSM8K上保持竞争力。

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System

ARES：自适应红队化及政策奖励系统的端到端修复

Authors: Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris
Subjects: Subjects: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.18789
Pdf link: https://arxiv.org/pdf/2604.18789
Abstract Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor'' that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the vulnerabilities gained, ARES implements a two-stage repair process: first fine-tuning the RM to better detect harmful content, then leveraging the improved RM to optimize the core model. Experiments across multiple adversarial safety benchmarks demonstrate that ARES substantially enhances safety robustness while preserving model capabilities, establishing a new paradigm for comprehensive RLHF safety alignment.
中文摘要 来自人类反馈的强化学习（RLHF）是协调大型语言模型（LLMs）的核心，但它也带来了一个关键漏洞：不完美的奖励模型（RM）如果未能惩罚不安全的行为，可能会成为单点故障。现有的红队方法主要针对策略层面的弱点，但它们忽视了我们所称的系统性弱点，即核心大型语言模型和资源管理同时失效的情况。我们介绍ARES，一个系统性发现并缓解此类双重脆弱性的框架。ARES 采用“安全导师”，通过结合结构化组件类型（主题、人物、战术、目标）动态组合语义连贯的对抗提示，并生成相应的恶意且安全的响应。这种双目标方法同时暴露了核心LLM和RM的弱点。利用获得的漏洞，ARES 实现了两阶段修复流程：首先微调 RM 以更好地检测有害内容，然后利用改进后的 RM 优化核心模型。跨多个对抗性安全基准的实验表明，ARES 在保持模型能力的同时大幅提升了安全稳健性，建立了全面的 RLHF 安全对齐新范式。

Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

优先考虑最佳：通过奖励超越答案正确性的奖励，激励可靠的多模态推理

Authors: Mengzhao Jia, Zhihan Zhang, Meng Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.18892
Pdf link: https://arxiv.org/pdf/2604.18892
Abstract Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs. Experiments show that RLVR aggravates reasoning-answer inconsistency, while trajectory supervision alleviates it. Groupwise Ranking Reward performs best overall, improving reliability-conditioned accuracy from 47.4% to 54.7% over RLVR.
中文摘要 可验证奖励的强化学习（RLVR）通过奖励可验证的最终答案来提升多模态推理能力。然而，正确答案的推导轨迹仍可能依赖于不完整的推导、薄弱的证据或与结论相矛盾的陈述。这种答案正确性与推理有效性的差距，我们称之为推理-答案不一致，这激发了多模态强化学习中的轨迹监督。我们比较了两种主要方法：奖励模型（RMs）和生成奖励（GR）。RM在培训初期效率高且有帮助，但随着政策分布的转变，他们的收益会减弱;广义相对论提升了性能，但可能带来不稳定的奖励和计算负担高昂。因此，我们提出了分组排名奖励，即在一次流程中对同一提示的验证者通过轨迹进行排名，并相应地重新分配奖励。组间比较比GR更能区分强和弱的正确轨迹，且判罚开销更低。实验显示RLVR加剧推理-答案不一致，而轨迹监督则缓解此问题。分组排名奖励整体表现最佳，可靠性条件准确率从47.4%提升至54.7%，优于RLVR。

From Particles to Perils: SVGD-Based Hazardous Scenario Generation for Autonomous Driving Systems Testing

从粒子到危险：基于SVGD的自动驾驶系统测试危险场景生成

Authors: Linfeng Liang, Xiao Cheng, Tsong Yueh Chen, Xi Zheng
Subjects: Subjects: Software Engineering (cs.SE); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.18918
Pdf link: https://arxiv.org/pdf/2604.18918
Abstract Simulation-based testing of autonomous driving systems (ADS) must uncover realistic and diverse failures in dense, heterogeneous traffic. However, existing search-based seeding methods (e.g., genetic algorithms) struggle in high-dimensional spaces, often collapsing to limited modes and missing many failure scenarios. We present PtoP, a framework that combines adaptive random seed generation with Stein Variational Gradient Descent (SVGD) to produce diverse, failure-inducing initial conditions. SVGD balances attraction toward high-risk regions and repulsion among particles, yielding risk-seeking yet well-distributed seeds across multiple failure modes. PtoP is plug-and-play and enhances existing online testing methods (e.g., reinforcement learning--based testers) by providing principled seeds. Evaluation in CARLA on two industry-grade ADS (Apollo, Autoware) and a native end-to-end system shows that PtoP improves safety violation rate (up to 27.68%), scenario diversity (9.6%), and map coverage (16.78%) over baselines.
中文摘要 基于仿真的自动驾驶系统（ADS）测试必须发现密集、异构交通中真实且多样化的故障。然而，现有基于搜索的种子方法（如遗传算法）在高维空间中表现不佳，常常崩溃到有限的模态，并错过许多失败场景。我们提出了PtoP框架，该框架结合了自适应随机种子生成与斯坦变分梯度下降（SVGD），以产生多样化且易导致失败的初始条件。SVGD平衡了对高风险区域的吸引和粒子间的排斥，产生了既寻求风险又分布均匀的种子，覆盖多种失效模式。PtoP即插即用，通过提供原则性种子，增强了现有的在线测试方法（例如基于强化学习的测试器）。在CARLA中对两款行业级ADS（Apollo、Autoware）和原生端到端系统的评估显示，PtoP相比基线提升了安全违规率（最高27.68%）、场景多样性（9.6%）和地图覆盖率（16.78%）。

Fine-Tuning Small Reasoning Models for Quantum Field Theory

量子场论中微调小推理模型

Authors: Nathaniel S. Woodward, Zhiqi Gao, Yurii Kvasiuk, Kendrick M. Smith, Frederic Sala, Moritz Münchmeyer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); High Energy Physics - Phenomenology (hep-ph); High Energy Physics - Theory (hep-th)
Arxiv link: https://arxiv.org/abs/2604.18936
Pdf link: https://arxiv.org/pdf/2604.18936
Abstract Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and $\sim$200M tokens of QFT reasoning traces.
中文摘要 尽管大型语言模型（LLMs）在理论物理中的应用日益增多，但关于在训练这些模型时，特定领域物理推理能力如何发展的学术研究仍然很少。为此，我们进行了首个专门针对理论物理的小型（7B参数）推理模型的学术微调研究。由于训练此类能力所需的开源可验证训练数据稀缺，我们开发了一个强大的数据生成流水线，既能生成合成问题，也能使现有的人工编写问题适合模型训练。我们选择量子场论（QFT）作为主要领域，生成了2500多个合成问题，并收集了来自arXiv和标准教学资源的精选人类适应问题。我们进行强化学习（RL）和监督微调（SFT）实验，对性能提升进行基准测试，并推广到其他物理领域。我们对微调前后的模型链进行了广泛分析，以理解在强化学习和SFT过程中推理错误的演变。最后，我们公开发布数据流水线、可验证的量子场论训练数据，以及价值$2亿的量子场论推理痕迹代币。

Reasoning Structure Matters for Safety Alignment of Reasoning Models

推理结构对推理模型安全对齐至关重要

Authors: Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18946
Pdf link: https://arxiv.org/pdf/2604.18946
Abstract Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks but often generate harmful responses to malicious user queries. This paper investigates the underlying cause of these safety risks and shows that the issue lies in the reasoning structure itself. Based on this insight, we claim that effective safety alignment can be achieved by altering the reasoning structure. We propose AltTrain, a simple yet effective post training method that explicitly alters the reasoning structure of LRMs. AltTrain is both practical and generalizable, requiring no complex reinforcement learning (RL) training or reward design, only supervised finetuning (SFT) with a lightweight 1K training examples. Experiments across LRM backbones and model sizes demonstrate strong safety alignment, along with robust generalization across reasoning, QA, summarization, and multilingual setting.
中文摘要 大型推理模型（LRM）在复杂推理任务中表现优异，但常常对恶意用户查询产生有害的响应。本文探讨了这些安全风险的根本原因，并指出问题出在推理结构本身。基于这一见解，我们认为通过改变推理结构可以实现有效的安全对齐。我们提出了AltTrain，这是一种简单但有效的训练后方法，明确改变LRM（长程强化模型）的推理结构。AltTrain既实用又可推广，无需复杂的强化学习（RL）训练或奖励设计，只需通过轻量级1K训练示例进行监督微调（SFT）。跨LRM骨干和模型规模的实验显示出强烈的安全性一致性，以及在推理、质量保证、总结和多语言环境中的稳健泛化。

Self-Improving Tabular Language Models via Iterative Group Alignment

通过迭代组比对实现自我改进的表形式语言模型

Authors: Yunbo Long, Tejumade Afonja, Alexandra Brintrup, Mario Fritz
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18966
Pdf link: https://arxiv.org/pdf/2604.18966
Abstract While language models have been adapted for tabular data generation, two fundamental limitations remain: (1) static fine-tuning produces models that cannot learn from their own generated samples and adapt to self-correct, and (2) autoregressive objectives preserve local token coherence but neglect global statistical properties, degrading tabular quality. Reinforcement learning offers a potential solution but requires designing reward functions that balance competing objectives -- impractical for tabular data. To fill the gap, we introduce TabGRAA (Tabular Group-Relative Advantage Alignment), the first self-improving framework for tabular data generation via automated feedback. At each iteration, TabGRAA uses an \emph{automated quality signal} -- such as a two-sample distinguishability classifier or a distance-based reward -- to partition newly generated samples into high- and low-quality groups, then optimizes a group-relative advantage objective that reinforces realistic patterns while penalizing artifacts. The specific signal is a modular choice rather than a fixed component of the framework. This establishes a virtuous feedback cycle, where the quality signal is re-computed against newly \emph{generated synthetic} samples at each round; the language model is only fine-tuned on these self-generated signals, so no additional real record is exposed during alignment, mitigating data-leakage risk beyond the initial supervised fine-tuning. Experiments show TabGRAA outperforms existing methods in fidelity, utility, and privacy, while matching or exceeding diffusion-based synthesizers, advancing tabular synthesis from static statistical replication to dynamic, self-improving generation.
中文摘要 虽然语言模型已被改编用于表格数据生成，但仍存在两个基本限制：（1）静态微调产生的模型无法从自身生成的样本中学习并适应自我纠正;（2）自回归目标保留局部令牌一致性，但忽视了全局统计属性，降低了表格质量。强化学习提供了一个潜在的解决方案，但需要设计平衡竞争目标的奖励函数——这对于表格数据来说并不切实际。为弥补这一空白，我们引入了TabGRAA（表组-相对优势对齐），这是首个通过自动反馈实现表数据生成自我改进的框架。在每次迭代中，TabGRAA 使用一个 \emph{自动质量信号}——例如两样本可区分性分类器或基于距离的奖励——将新生成的样本划分为高质量和低质量组，然后优化一个强化真实模式同时惩罚伪影的群体相对优势目标。具体信号是模块化选择，而非框架的固定组成部分。这建立了良性反馈循环，每轮对新生成的合成样本重新计算质量信号;语言模型仅对这些自生成信号进行微调，因此在对齐过程中不会暴露额外的真实记录，从而降低了超出初始监督微调之外的数据泄漏风险。实验显示，TabGRAA在保真度、实用性和隐私性方面优于现有方法，同时可与基于扩散的合成器匹敌甚至超越，推动了表格合成从静态统计复制向动态、自我改进生成的进步。

Toward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2

迈向临床可接受的胸部X光报告生成：CXRMate-2的定性回顾性初步研究

Authors: Aaron Nicolson, Elizabeth J. Cooper, Hwan-Jin Yoon, Claire McCafferty, Ramya Krishnan, Michelle Craigie, Nivene Saad, Jason Dowling, Ian A. Scott, Bevan Koopman
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.18967
Pdf link: https://arxiv.org/pdf/2604.18967
Abstract Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.
中文摘要 胸部X光（CXR）放射报告生成（RRG）模型取得了快速进展，但由于放射科医生评估有限，其临床效用仍不确定。我们介绍CXRMate-2，一种先进的CXR RRG模型，将结构化多模态条件反射和强化学习与对放射科医生报告语义对齐的复合奖励相结合相结合。在MIMIC-CXR、CheXpert Plus和ReXgradient数据集中，CXRMate-2相较于强基准测试实现了统计学上的显著提升，包括相较于MedGemma 1.5（4B）在MIMIC-CXR上的GREEN和RadGraph-XL分别提升了11.2%和24.4%。为直接比较CXRMate-2与放射科医生报告，我们进行了盲法、随机定性回顾性评估。三位放射科医生顾问比较了MIMIC-CXR测试集120项研究中产生的报告和放射科医生报告。生成报告在45%的评级中被认为可接受（定义为优先或与放射科医生报告同等评分），在八项分析结果中有七项中，放射科医生报告与可接受生成报告的偏好率没有统计学上显著差异。对放射科医生报告的偏好主要由更高的召回率驱动，而生成报告通常更受优先考虑可读性。综合来看，这些结果为临床可接受的CXR RRG提供了可信的路径。回忆能力的提升，加上对细微发现（如肺充血）的更好检测，很可能足以达到放射科医生报告的不劣水平。有了这些针对性的进步，CXR RRG系统有望在放射科医生主导的工作流程中，作为辅助角色的前瞻性评估。

Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

非策略强化学习中批判者学习的低秩适应

Authors: Yuan Zhuang, Yuexin Bian, Sihong He, Jie Feng, Qing Su, Songyang Han, Jonathan Petit, Shihao Ji, Yuanyuan Shi, Fei Miao
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18978
Pdf link: https://arxiv.org/pdf/2604.18978
Abstract Scaling critic capacity is a promising direction for enhancing off-policy reinforcement learning (RL). However, larger critics are prone to overfitting and unstable in replay-buffer-based bootstrap training. This paper leverages Low-Rank Adaptation (LoRA) as a structural-sparsity regularizer for off-policy critics. Our approach freezes randomly initialized base matrices and solely optimizes low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. Built on top of SimbaV2, we further develop a LoRA formulation, compatible with SimbaV2, that preserves its hyperspherical normalization geometry under frozen-backbone training. We evaluate our method with SAC and FastTD3 on DeepMind Control locomotion and IsaacLab robotics benchmarks. LoRA consistently achieves lower critic loss during training and stronger policy performance. Extensive experiments demonstrate that adaptive low-rank updates provide a simple, scalable, and effective structural regularization for critic learning in off-policy RL.
中文摘要 提升批评能力是提升非策略强化学习（RL）的一个有前景方向。然而，较大的批评者在基于重放缓冲的自助训练中容易出现过拟合和不稳定。本文利用低阶适应（LoRA）作为非政策批评者的结构稀疏性规范工具。我们的方法冻结随机初始化的基矩阵，仅优化低秩适配器，从而限制批评者更新在低维子空间。基于SimbaV2，我们进一步开发了与SimbaV2兼容的LoRA公式，在冻结主干训练下保持其超球面归一化几何。我们用SAC和FastTD3在DeepMind Control的locomotion和IsaacLab机器人基准测试中评估了我们的方法。LoRA在培训期间持续实现较低的批评损失和更强的政策执行。大量实验表明，自适应低秩更新为非策略强化学习中的批评学习提供了简单、可扩展且有效的结构正则化。

SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution

SAVOIR：通过Shapley奖励归因学习社会智慧

Authors: Xiachong Feng, Yi Jiang, Xiaocheng Feng, Deyi Yin, Libo Qin, Yangfan Ye, Lei Huang, Weitao Ma, Yuxuan Gu, Chonghan Qin, Bing Qin, Lingpeng Kong
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.18982
Pdf link: https://arxiv.org/pdf/2604.18982
Abstract Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance's strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.
中文摘要 社会智力，即应对复杂人际互动的能力，是语言代理面临的根本挑战。通过强化学习训练此类代理需要解决学分分配问题：确定单个话语如何影响多回合对话结果。现有方法直接利用语言模型分配剧集级奖励，导致归因具有回顾性且缺乏理论基础。我们提出了SAVOIR（ShApley价值，社会强化学习），这是一个基于合作博弈论的新颖原则框架。我们的方法结合了两个互补原则：期望效用将评估从事后归因转向前瞻性估值，捕捉话语为有利未来轨迹带来的战略潜力;夏普利值确保了公平的信用分配，并保证了效率、对称性和边际性的公理性保障。SOTOPIA基准测试的实验表明，SAVOIR在所有评估环境中都实现了最先进的性能，我们的7B模型能够匹配甚至超越包括GPT-4o和Claude-3.5-Sonnet在内的专有模型。值得注意的是，即使是大型推理模型也持续表现不佳，表明社会智能所需的能力与分析推理有质的不同。

Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

引导分布匹配蒸馏与基于梯度的强化学习

Authors: Linwei Dong, Ruoyu Guo, Ge Bai, Zehuan Yuan, Yawei Luo, Changqing Zou
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.19009
Pdf link: https://arxiv.org/pdf/2604.19009
Abstract Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.
中文摘要 扩散蒸馏以分布匹配蒸馏（DMD）为代表，在少步生成中展现出巨大前景，但通常牺牲质量以换取样速度。虽然将强化学习（RL）整合进蒸馏具有潜力，但将这两个目标简单结合，往往依赖于对未达到最佳的原始样本评估。这种基于样本的评分与蒸馏轨迹产生固有冲突，并因早期生成噪声性而产生不可靠的奖励。为克服这些局限，我们提出了GDMD，这是一种新颖框架，通过优先将蒸馏梯度置于原始像素输出作为主要优化信号，重新定义了奖励机制。通过将DMD梯度重新解释为隐式目标张量，我们的框架使现有奖励模型能够直接评估蒸馏更新的质量。这种梯度级指导作为自适应权重，使强化学习策略与蒸馏目标同步，有效中和优化的背离。实证结果表明，GDMD为少步生成设定了新的SOTA。具体来说，我们的四步模型优于其多步教师的质量，并在GenEval和人类偏好指标上显著优于以往DMDR结果，展现出强大的可扩展性潜力。

Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

基于人类反馈安全强化学习的政策梯度原始-对偶方法

Authors: Qiang Liu, Adrienne Kline, Ermin Wei
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.19024
Pdf link: https://arxiv.org/pdf/2604.19024
Abstract Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding helpfulness and harmlessness. Existing approaches typically rely on fitting fixed horizon reward models from human feedback and have only been validated empirically. In this paper, we formulate safe RLHF as an infinite horizon discounted Con- strained Markov Decision Process (CMDP), since humans may interact with the model over a continuing sequence of interactions rather than within a single finite episode. We propose two Safe RLHF algorithms that do not require reward model fitting and, in contrast to prior work assuming fixed-length trajectories, support flexible trajectory lengths for training. Both algo- rithms are based on the primal-dual method and achieve global convergence guarantees with polynomial rates in terms of policy gradient iterations, trajectory sample lengths, and human preference queries. To the best of our knowledge, this is the first work to study infinite horizon discounted CMDP under human feedback and establish global, non-asymptotic convergence.
中文摘要 安全强化人类反馈学习（Safe RLHF）最近通过脱钩人类对有益性和无害性偏好，取得了实证上的成功，开发出有益且无害的大语言模型。现有方法通常依赖于从人类反馈中拟合固定视界奖励模型，且仅通过实证验证。本文将安全RLHF表述为无限视野折价的Con-应变马尔可夫决策过程（CMDP），因为人类可能通过连续的交互序列与模型互动，而非单一有限事件。我们提出了两种安全RLHF算法，这些算法无需奖励模型拟合，且与以往假设固定长度轨迹的研究不同，支持灵活的轨迹长度进行训练。这两种算法都基于原始对偶方法，并通过策略梯度迭代、轨迹样本长度和人类偏好查询实现多项式速率的全局收敛保证。据我们所知，这是首个研究在人类反馈下无穷视距折现CMDP并建立全局非渐近收敛性的工作。

Intentional Updates for Streaming Reinforcement Learning

流式强化学习的有意更新

Authors: Arsalan Sharifnassab, Mohamed Elsayed, Kris De Asis, A. Rupam Mahmood, Richard S. Sutton
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19033
Pdf link: https://arxiv.org/pdf/2604.19033
Abstract In gradient-based learning, a step size chosen in parameter units does not produce a predictable per-step change in function output. This often leads to instability in the streaming setting (i.e., batch size=1), where stochasticity is not averaged out and update magnitudes can momentarily become arbitrarily big or small. Instead, we propose intentional updates: first specify the intended outcome of an update and then solve for the step size that approximately achieves it. This strategy has precedent in online supervised linear regression via Normalized Least Mean Squares algorithm, which selects a step size to yield a specified change in the function output proportional to the current error. We extend this principle to streaming deep reinforcement learning by defining appropriate intended outcomes: Intentional TD aims for a fixed fractional reduction of the TD error, and Intentional Policy Gradient aims for a bounded per-step change in the policy, limiting local KL divergence. We propose practical algorithms combining eligibility traces and diagonal scaling. Empirically, these methods yield state-of-the-art streaming performance, frequently performing on par with batch and replay-buffer approaches.
中文摘要 在基于梯度的学习中，选择参数单位的步长不会产生可预测的每步函数输出变化。这常导致流式设置不稳定（即批次大小=1），随机性未被平均，更新幅度可能暂时变得任意大或小。相反，我们提出有意更新：先指定更新的预期结果，然后求解大致实现该结果的步长。该策略在通过归一化最小均方算法进行在线监督线性回归中有先例，该算法选择一个步长，使函数输出与当前误差成正比的变化。我们将该原则扩展到深度强化学习流式，定义适当的预期结果：有意性TD旨在固定地减少TD误差的分数，意图策略梯度则旨在每步有界变化，限制局部的基层强化发散。我们提出了结合资格性迹和对角尺度的实用算法。从经验角度看，这些方法带来了最先进的流媒体性能，常常能与批处理和重放缓冲区方法相当。

TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

TRN-R1-Zero：仅强化学习的文本丰富网络推理 LLMs

Authors: Yilun Liu, Ruihong Qiu, Zi Huang
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.19070
Pdf link: https://arxiv.org/pdf/2604.19070
Abstract Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at this https URL.
中文摘要 文本富网络（TRNs）上的零样本推理仍是一个充满挑战的前沿，因为模型必须将文本语义与关系结构整合，且不需针对特定任务的监督。虽然图神经网络依赖固定标签空间和监督目标，但基于大型语言模型（LLM）的近期方法常常忽视图上下文或依赖从更大模型提炼，限制了泛化。我们提出了TRN-R1-Zero，一种仅通过强化学习训练TRN推理的后期框架。TRN-R1-Zero 直接利用邻居感知群相对策略优化目标，基于邻居信号信息量的新型边际收益指标动态调整奖励，有效引导模型走向关系推理。与以往方法不同，TRN-R1-Zero 不需要由大型推理模型生成的监督微调或思维链数据。在引用、超链接、社交和共购TRN基准测试中的广泛实验证明了TRN-R1-Zero的优越性和稳健性。此外，严格依赖节点级训练，TRN-R1-Zero实现了边缘和图级任务的零样本推断，超越了跨域转移。代码库在此 https 网址公开。

OLLM: Options-based Large Language Models

OLLM：基于选项的大型语言模型

Authors: Shashank Sharma, Janina Hoffmann, Vinay Namboodiri
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19087
Pdf link: https://arxiv.org/pdf/2604.19087
Abstract We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight "plug-in" that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only $1.56\%$ of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at $51\%$ final answer correctness, while OLLM's option set allows up to $\sim 70\%$ under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.
中文摘要 我们介绍了期权大型语言模型（OLLM），这是一种简单通用的方法，用一个由离散潜在变量索引的 \textit{学习期权集合}替代标准大型语言模型中单一的下一个标记预测。OLLM不依赖温度或采样启发式方法来诱导多样性，而是明确建模变异：一个小潜在空间参数化多个合理的下一个令牌选项，这些选项可被下游策略选择或搜索。从架构上看，OLLM是一个轻量级“插件”，在输出头前插入两层：编码器和解码器，使几乎任何预训练的LLM都能以最小的额外参数进行转换。我们将 OLLM 应用于一个 1.7 亿参数骨干（仅有 $1.56\% $ 可训练参数），该骨干链在 OpenMathReasoning 上训练，并在 OmniMath 上评估。SOTA LoRA适配的基线最终正确度最高为$51\%$，而OLLM的选项集则允许在最佳潜在选择下达到$\sim 70%$。然后我们在潜在空间训练一个紧凑的策略，该策略发出潜在能量以控制发电。在低维期权空间中操作使奖励优化更加高效，并显著减少常见错位（如语言切换或退化推理），因为策略受限于SFT中学到的选项。关键是，这种对齐源自模型结构，而非额外的KL或手工设计的对齐损失。我们的结果表明，选项化下一个令牌建模提升了数学推理的可控性、鲁棒性和效率，并凸显了潜空间策略学习作为大型语言模型强化学习的有前景方向。

Multi-Gait Learning for Humanoid Robots Using Reinforcement Learning with Selective Adversarial Motion Prior

基于选择性对抗运动先验的强化学习，用于类人机器人的多步态学习

Authors: Yuanye Wu, Keyi Wang, Linqi Ye, Boyang Xing
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19102
Pdf link: https://arxiv.org/pdf/2604.19102
Abstract Learning diverse locomotion skills for humanoid robots in a unified reinforcement learning framework remains challenging due to the conflicting requirements of stability and dynamic expressiveness across different gaits. We present a multi-gait learning approach that enables a humanoid robot to master five distinct gaits -- walking, goose-stepping, running, stair climbing, and jumping -- using a consistent policy structure, action space, and reward formulation. The key contribution is a selective Adversarial Motion Prior (AMP) strategy: AMP is applied to periodic, stability-critical gaits (walking, goose-stepping, stair climbing) where it accelerates convergence and suppresses erratic behavior, while being deliberately omitted for highly dynamic gaits (running, jumping) where its regularization would over-constrain the motion. Policies are trained via PPO with domain randomization in simulation and deployed on a physical 12-DOF humanoid robot through zero-shot sim-to-real transfer. Quantitative comparisons demonstrate that selective AMP outperforms a uniform AMP policy across all five gaits, achieving faster convergence, lower tracking error, and higher success rates on stability-focused gaits without sacrificing the agility required for dynamic ones.
中文摘要 在统一的强化学习框架下学习类人机器人多样化的运动技能仍然具有挑战性，因为不同步态之间对稳定性和动态表达性的需求存在冲突。我们提出了一种多步态学习方法，使类人机器人能够掌握五种不同的步态——走路、踩鹅步、跑步、爬楼梯和跳跃——采用一致的政策结构、行动空间和奖励表述。其关键贡献是选择性对抗运动先验（AMP）策略：AMP应用于周期性、稳定性关键步态（行走、鹅步、爬楼梯），加速收敛并抑制异常行为;而在高度动态步态（如跑步、跳跃）中，因其规则化会过度限制运动，则被有意省略。策略通过PPO和域随机化在仿真中训练，并通过零时模拟到实物传输部署在物理12自由度的人形机器人上。定量比较表明，选择性AMP在五种步态上优于统一AMP策略，在稳定为中心步态上实现更快的收敛、更低的跟踪误差和更高的成功率，同时不牺牲动态步态所需的敏捷性。

Reinforcement Learning Enabled Adaptive Multi-Task Control for Bipedal Soccer Robots

强化学习实现了双足足球机器人的自适应多任务控制

Authors: Yulai Zhang, Yinrong Zhang, Ting Wu, Linqi Ye
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19104
Pdf link: https://arxiv.org/pdf/2604.19104
Abstract Developing bipedal football robots in dynamiccombat environments presents challenges related to motionstability and deep coupling of multiple tasks, as well ascontrol switching issues between different states such as up-right walking and fall recovery. To address these problems,this paper proposes a modular reinforcement learning (RL)framework for achieving adaptive multi-task control. Firstly,this framework combines an open-loop feedforward oscilla-tor with a reinforcement learning-based feedback residualstrategy, effectively separating the generation of basic gaitsfrom complex football actions. Secondly, a posture-driven statemachine is introduced, clearly switching between the ballseeking and kicking network (BSKN) and the fall recoverynetwork (FRN), fundamentally preventing state this http URL FRN is efficiently trained through a progressive forceattenuation curriculum learning strategy. The architecture wasverified in Unity simulations of bipedal robots, demonstratingexcellent spatial adaptability-reliably finding and kicking theball even in restricted corner scenarios-and rapid autonomousfall recovery (with an average recovery time of 0.715 seconds).This ensures seamless and stable operation in complex multi-task environments.
中文摘要 在动态战斗环境中开发双足足球机器人面临运动稳定性和多任务深度耦合的挑战，同时也面临不同状态间切换控制问题，如直立行走和跌倒恢复。为解决这些问题，本文提出了一个模块化强化学习（RL）框架，用于实现自适应多任务控制。首先，该框架结合了开环前馈 oscilla-tor 与基于强化学习的反馈残差策略，有效将基本步态的生成与复杂的足球动作分离开来。其次，引入了姿态驱动状态机，明确在寻球与踢球网络（BSKN）和跌倒恢复网络（FRN）之间切换，根本上防止状态，通过渐进力衰减课程学习策略高效训练 http URL FRN。该架构在Unity双足机器人模拟中得到了验证，展现出卓越的空间适应性——即使在受限的角落场景中也能可靠地找到并踢球——以及快速的自主下落恢复（平均恢复时间为0.715秒）。这确保了在复杂多任务环境中无缝且稳定的运行。

GraphRAG-IRL: Personalized Recommendation with Graph-Grounded Inverse Reinforcement Learning and LLM Re-ranking

GraphRAG-IRL：基于图的逆向强化学习和大型语言模型重新排序的个性化推荐

Authors: Siqi Liang, Xiawei Wang, Yudi Zhang, Jiaying Zhou
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2604.19128
Pdf link: https://arxiv.org/pdf/2604.19128
Abstract Personalized recommendation requires models that capture sequential user preferences while remaining robust to sparse feedback and semantic ambiguity. Recent work has explored large language models (LLMs) as recommenders and re-rankers, but pure prompt-based ranking often suffers from poor calibration, sensitivity to candidate ordering, and popularity bias. These limitations make LLMs useful semantic reasoners, but unreliable as standalone ranking engines. We present \textbf{GraphRAG-IRL}, a hybrid recommendation framework that combines graph-grounded feature construction, inverse reinforcement learning (IRL), and persona-guided LLM re-ranking. Our method constructs a heterogeneous knowledge graph over items, categories, and concepts, retrieves both individual and community preference context, and uses these signals to train a Maximum Entropy IRL model for calibrated pre-ranking. An LLM is then applied only to a short candidate list, where persona-guided prompts provide complementary semantic judgments that are fused with IRL rankings. Experiments show that GraphRAG-IRL is a strong standalone recommender: IRL-MLP with GraphRAG improves NDCG@10 by 15.7\% on MovieLens and 16.6\% on KuaiRand over supervised baselines. The results also show that IRL and GraphRAG are superadditive, with the combined gain exceeding the sum of their individual improvements. Persona-guided LLM fusion further improves ranking quality, yielding up to 16.8\% NDCG@10 improvement over the IRL-only baseline on MovieLens ml-1m, while score fusion on KuaiRand provides consistent gains of 4--6\% across LLM providers.
中文摘要 个性化推荐需要模型能够捕捉用户顺序偏好，同时保持对稀疏反馈和语义模糊性的鲁棒性。近期研究探讨了大型语言模型（LLMs）作为推荐器和重新排序工具，但纯基于提示的排名常常存在校准不良、候选排序敏感度不足以及受欢迎度偏误的问题。这些局限使得大型语言模型在语义推理上非常有用，但作为独立的排名引擎则不可靠。我们提出了 \textbf{GraphRAG-IRL}，这是一个结合了基于图特征构造、逆强化学习（IRL）和人物引导大语言模型重新排序的混合推荐框架。我们的方法构建了一个针对项目、类别和概念的异构知识图谱，检索个体和群体偏好上下文，并利用这些信号训练一个用于校准预排名的最大熵IRL模型。LLM仅应用于一个简短的候选名单，角色引导提示提供与现实排名融合的补充语义判断。实验显示，GraphRAG-IRL是一个强有力的独立推荐工具：IRL-MLP配合GraphRAG相比监督基线，在MovieLens上提升NDCG@10 15.7%，在KuaiRand上提升16.6%。结果还显示，IRL和GraphRAG是超加法的，其综合增益超过了它们各自改进的总和。Persona引导的LLM融合进一步提升了排名质量，在MovieLens ml-1m上相比仅IRL基线提升了高达16.8%的NDCG@10，而KuaiRand上的评分融合在各LLM提供者之间持续提升了4-6%的提升。

The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

大型语言模型中言语抽动的兴起：跨前沿模型的系统分析

Authors: Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang, Ran Wang
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19139
Pdf link: https://arxiv.org/pdf/2604.19139
Abstract As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics -- repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers ("That's a great question!", "Awesome!") to pseudo-empathetic affirmations ("I completely understand your concern", "I'm right here to catch you") and overused vocabulary ("delve", "tapestry", "nuanced"). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001). These results underscore the "alignment tax" of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.
中文摘要 随着大型语言模型（LLMs）通过人类反馈强化学习（RLHF）和宪法人工智能等对齐技术不断演进，一个日益显著且日益显著的现象出现了：语言抽动的激增——这些模式是重复且公式化的语言模式，贯穿模型输出。这些词汇从谄媚的开场白（“这是个好问题！”，“太棒了！”）到伪同理心的肯定（“我完全理解你的担忧”、“我就在这里接住你”）以及被滥用的词汇（“深入”、“挂毯”、“细腻”）。本文系统分析了八款最先进的大型语言模型中的语言抽动现象：GPT-5.4、Claude Opus 4.7、Gemini 3.1 Pro、Grok 4.2、Doubao-Seed-2.0-pro、Kimi K2.5、DeepSeek V3.2 和 MiMo-V2-Pro。我们利用定制的基于API的标准化评估框架，评估了10个任务类别中的10,000个提示，涵盖英语和中文，生成了160,000条模型响应。我们介绍了动词抽动指数（VTI），这是一个量化抽动流行率的综合指标，并分析其与谄媚、词汇多样性及人类感知自然性的相关性。我们的发现显示出显著的型号间差异：Gemini 3.1 Pro 展现出最高的 VTI（0.590），而 DeepSeek V3.2 则最低（0.295）。我们还进一步证明，语言抽动在多轮对话中积累，在主观任务中被放大，并表现出明显的跨语言模式。人类评估（N = 120）证实谄媚与感知自然性之间存在强烈的反比关系（r = -0.87，p < 0.001）。这些结果凸显了当前训练范式的“对齐税”，并凸显了对更真实人机交互框架的紧迫需求。

ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

ReflectMT：内化反思以实现高效且高质量的机器翻译

Authors: Kunquan Li, Yingxue Zhang, Fandong Meng, Jinsong Su
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.19144
Pdf link: https://arxiv.org/pdf/2604.19144
Abstract Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a "think-first-then-translate" paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two-stage reflection internalization algorithm for machine translation that employs a "translate-first-think-later" paradigm. Our approach develops the model's "translate-reflect-refine" capability through reinforcement learning. In the first stage, we cultivate the model's capacity for high-quality reflection and refinement, thereby enhancing its semantic comprehension and task-specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high-quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model's first-pass translations during inference outperform multi-step reasoning LRMs such as DeepSeek-R1 in both automatic metrics and GPT-based evaluation, achieving a 2.16-point improvement in GPT-based translation quality evaluation while reducing token consumption by 94.33%.
中文摘要 近年来，将大型推理模型（LRM）应用于机器翻译（MT）的兴趣日益增长。现有方法主要采用“先思考后翻译”的范式。虽然显式推理轨迹显著提升翻译质量，但它们会带来高昂的推理成本和延迟。为解决这些局限性，我们提出了ReflectMT，一种两阶段的机器翻译反思内化算法，采用“先翻译后思考”的范式。我们的方法通过强化学习开发模型的“翻译-反射-精炼”能力。第一阶段，我们培养模型的高质量反思和精炼能力，从而提升其语义理解和任务专属知识。第二阶段，我们训练模型内化反思中获得的知识。因此，在推理过程中，ReflectMT以直接翻译模式运行，首次尝试即可生成高质量的翻译，无需任何显式推理步骤。在如WMT24等数据集上的实验结果表明，我们模型在推理中的首轮翻译在自动指标和基于GPT的评估中均优于多步推理LRM，如DeepSeek-R1，在基于GPT的翻译质量评估中提升了2.16分，同时减少了94.33%的令牌消耗。

RL-ABC: Reinforcement Learning for Accelerator Beamline Control

RL-ABC：加速器光束线控制的强化学习

Authors: Anwar Ibrahim, Fedor Ratnikov, Maxim Kaledin, Alexey Petrenko, Denis Derkach
Subjects: Subjects: Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex)
Arxiv link: https://arxiv.org/abs/2604.19146
Pdf link: https://arxiv.org/pdf/2604.19146
Abstract Particle accelerator beamline optimization is a high-dimensional control problem traditionally requiring significant expert intervention. We present RLABC (Reinforcement Learning for Accelerator Beamline Control), an open-source Python framework that automatically transforms standard Elegant beamline configurations into reinforcement learning environments. RLABC integrates with the widely-used Elegant beam dynamics simulation code via SDDS-based interfaces, enabling researchers to apply modern RL algorithms to beamline optimization with minimal RL-specific development. The main contribution is a general methodology for formulating beamline tuning as a Markov decision process: RLABC automatically preprocesses lattice files to insert diagnostic watch points before each tunable element, constructs a 57-dimensional state representation from beam statistics, covariance information, and aperture constraints, and provides a configurable reward function for transmission optimization. The framework supports multiple RL algorithms through Stable-Baselines3 compatibility and implements stage learning strategies for improved training efficiency. Validation on a test beamline derived from the VEPP-5 injection complex (37 control parameters across 11 quadrupoles and 4 dipoles) demonstrates that the framework successfully enables RL-based optimization, with a Deep Deterministic Policy Gradient agent achieving 70.3\% particle transmission -- performance matching established methods such as differential evolution. The framework's stage learning capability allows decomposition of complex optimization problems into manageable subproblems, improving training efficiency. The complete framework, including configuration files and example notebooks, is available as open-source software to facilitate adoption and further research.
中文摘要 粒子加速器光束线优化是一个高维控制问题，传统上需要大量专家干预。我们介绍RLABC（加速器光束线控制强化学习），这是一个开源的Python框架，能够自动将标准的Elegant光束线配置转换为强化学习环境。RLABC通过基于SDDS的接口与广泛使用的Elegant束流动力学仿真代码集成，使研究人员能够以最小限度的强化学习特性，将现代强化学习算法应用于光束线优化。其主要贡献是将光束线调谐建模作为马尔可夫决策过程的通用方法：RLABC自动预处理晶格文件，在每个可调元素前插入诊断观察点，结合光束统计、协方差信息和孔径约束构建57维状态表示，并提供可配置的传输优化奖励函数。该框架通过Stable-Baselines3兼容性支持多种强化学习算法，并实现阶段学习策略以提高训练效率。对源自VEPP-5注入复合体的测试光束线（11个四极和4个偶极子的37个控制参数）的验证表明，该框架成功实现基于强化学习的优化，深度确定性策略梯度代理实现了70.3%的粒子传输率——性能与已知方法如差分演化相匹配。该框架的阶段学习能力允许将复杂优化问题分解为可管理的子问题，从而提高训练效率。完整的框架，包括配置文件和示例笔记本，均作为开源软件提供，以便于采用和进一步研究。

Reasoning-Aware AIGC Detection via Alignment and Reinforcement

通过对齐和强化实现推理感知AIGC检测

Authors: Zhao Wang, Max Xiong, Jianxun Lian, Zhicheng Dou
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19172
Pdf link: https://arxiv.org/pdf/2604.19172
Abstract The rapid advancement and widespread adoption of Large Language Models (LLMs) have elevated the need for reliable AI-generated content (AIGC) detection, which remains challenging as models evolve. We introduce AIGC-text-bank, a comprehensive multi-domain dataset with diverse LLM sources and authorship scenarios, and propose REVEAL, a detection framework that generates interpretable reasoning chains before classification. Our approach uses a two-stage training strategy: supervised fine-tuning to establish reasoning capabilities, followed by reinforcement learning to improve accuracy, improve logical consistency, and reduce hallucinations. Extensive experiments show that REVEAL achieves state-of-the-art performance across multiple benchmarks, offering a robust and transparent solution for AIGC detection. The project is open-source at this https URL
中文摘要 大型语言模型（LLM）的快速发展和广泛采用，提高了对可靠AI生成内容（AIGC）检测的需求，而随着模型的发展，这一挑战依然存在。我们介绍了AIGC-text-bank，这是一个涵盖多元LLM来源和作者场景的综合多域数据集，并提出了REVEAL检测框架，在分类前生成可解释的推理链。我们的方法采用两阶段训练策略：监督式微调以建立推理能力，随后进行强化学习以提高准确性、逻辑一致性并减少幻觉。大量实验表明，REVEAL 在多个基准测试中实现了最先进的性能，为 AIGC 检测提供了稳健且透明的解决方案。该项目开源地址为 https URL。

Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

配对前思考：一种基于强化推理的范式，适用于一般人重新认同

Authors: Quan Zhang, Jingze Wu, Jialong Wang, Xiaohua Xie, Jianhuang Lai, Hongbo Chen
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.19218
Pdf link: https://arxiv.org/pdf/2604.19218
Abstract Learning identity-discriminative representations with multi-scene generality has become a critical objective in person re-identification (ReID). However, mainstream perception-driven paradigms tend to identify fitting from massive annotated data rather than identity-causal cues understanding, which presents a fragile representation against multiple disruptions. In this work, ReID-R is proposed as a novel reasoning-driven paradigm that achieves explicit identity understanding and reasoning by incorporating chain-of-thought into the ReID pipeline. Specifically, ReID-R consists of a two-stage contribution: (i) Discriminative reasoning warm-up, where a model is trained in a CoT label-free manner to acquire identity-aware feature understanding; and (ii) Efficient reinforcement learning, which proposes a non-trivial sampling to construct scene-generalizable data. On this basis, ReID-R leverages high-quality reward signals to guide the model toward focusing on ID-related cues, achieving accurate reasoning and correct responses. Extensive experiments on multiple ReID benchmarks demonstrate that ReID-R achieves competitive identity discrimination as superior methods using only 14.3K non-trivial data (20.9% of the existing data scale). Furthermore, benefit from inherent reasoning, ReID-R can provide high-quality interpretation for results.
中文摘要 学习具有多场景泛性身份辨别表征已成为面对面再识别（ReID）的关键目标。然而，主流以感知为驱动的范式往往从大量注释数据中识别拟合，而非身份-因果线索理解，后者在多重干扰下呈现出脆弱的表征。在本研究中，ReID-R 被提出为一种新颖的推理驱动范式，通过将思维链纳入 ReID 流水线，实现显式的身份理解和推理。具体来说，ReID-R 包含两个阶段的贡献：（i）判别推理预热，即以无标签的 CoT 方式训练模型以获得身份感知特征;以及（ii）高效强化学习，提出非平凡的采样方法来构建场景推广数据。基于此，ReID-R利用高质量的奖励信号引导模型聚焦于与智能设计相关的线索，实现准确的推理和正确的反应。对多个ReID基准测试的广泛实验表明，ReID-R仅用14.3K非平凡数据（占现有数据尺度的20.9%）就能实现竞争性身份辨别，作为更优的方法。此外，借助内在推理，ReID-R 还能为结果提供高质量的解释。

Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

学会认可正确步骤：可视化生成的目标感知流程优化

Authors: Rui Li, Ke Hao, Yuanzhi Liang, Haibin Huang, Chi Zhang, YunGu, XueLong Li
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.19234
Pdf link: https://arxiv.org/pdf/2604.19234
Abstract Reinforcement learning, particularly Group Relative Policy Optimization (GRPO), has emerged as an effective framework for post-training visual generative models with human preference signals. However, its effectiveness is fundamentally limited by coarse reward credit assignment. In modern visual generation, multiple reward models are often used to capture heterogeneous objectives, such as visual quality, motion consistency, and text alignment. Existing GRPO pipelines typically collapse these rewards into a single static scalar and propagate it uniformly across the entire diffusion trajectory. This design ignores the stage-specific roles of different denoising steps and produces mistimed or incompatible optimization signals. To address this issue, we propose Objective-aware Trajectory Credit Assignment (OTCA), a structured framework for fine-grained GRPO training. OTCA consists of two key components. Trajectory-Level Credit Decomposition estimates the relative importance of different denoising steps. Multi-Objective Credit Allocation adaptively weights and combines multiple reward signals throughout the denoising process. By jointly modeling temporal credit and objective-level credit, OTCA converts coarse reward supervision into a structured, timestep-aware training signal that better matches the iterative nature of diffusion-based generation. Extensive experiments show that OTCA consistently improves both image and video generation quality across evaluation metrics.
中文摘要 强化学习，特别是群体相对策略优化（Group Relative Policy Optimization，GRPO），已成为带有人类偏好信号的后期可视化生成模型的有效框架。然而，其有效性在基础上受限于粗略的奖励信用分配。在现代视觉生成中，常用多重奖励模型来捕捉异构目标，如视觉质量、运动一致性和文本对齐。现有的GRPO管道通常将这些奖励压缩为单一静态标量，并均匀地在整个扩散轨迹中传播。该设计忽略了不同去噪步骤在阶段特有的作用，产生时序不佳或不兼容的优化信号。为解决这一问题，我们提出了目标感知轨迹学分分配（OTCA），这是一种用于细粒度GRPO培训的结构化框架。OTCA由两个关键组成部分组成。轨迹级信用分解估算不同去噪步骤的相对重要性。多目标信用分配在去噪过程中自适应加权并组合多个奖励信号。通过联合建模时间信用和客观级信用，OTCA将粗奖励监督转化为结构化、时间步感知的训练信号，更符合基于扩散生成的迭代特性。大量实验表明，OTCA在各评估指标上持续提升图像和视频生成质量。

LASER: Learning Active Sensing for Continuum Field Reconstruction

激光：学习主动感测以重建连续介质场

Authors: Huayu Deng, Jinghui Zhong, Xiangming Zhu, Yunbo Wang, Xiaokang Yang
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
Arxiv link: https://arxiv.org/abs/2604.19355
Pdf link: https://arxiv.org/pdf/2604.19355
Abstract High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what-if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.
中文摘要 高保真度的连续物理场测量对于科学发现和工程设计至关重要，但在稀疏且受限的传感条件下仍具挑战性。传统重建方法通常依赖固定传感器布局，无法适应物理状态的变化。我们提出了LASER，一种统一的闭环框架，将主动传感表述为部分可观测的马尔可夫决策过程（POMDP）。LASER的核心采用连续场潜在世界模型，捕捉潜在的物理动态并提供内在的奖励反馈。这使得强化学习策略能够在潜在想象空间中模拟“假设”感知情景。通过将传感器运动与预测的潜态进行条件化，激光能够引导到当前观测之外可能蕴含高信息的区域。我们的实验表明，LASER在静态和离线优化策略中始终表现优于，在多样连续体场的稀疏条件下实现高保真重建。

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

HP-EDIT：基于人类偏好的图像编辑后培训框架

Authors: Fan Li, Chonghuinan Wang, Lina Lei, Yuping Qiu, Jiaqi Xu, Jiaxiu Jiang, Xinran Qin, Zhikai Chen, Fenglong Song, Zhixin Wang, Renjing Pei, Wangmeng Zuo
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19406
Pdf link: https://arxiv.org/pdf/2604.19406
Abstract Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.
中文摘要 常见的图像编辑任务通常采用强大的生成扩散模型，作为现实世界内容编辑的领先范式。与此同时，尽管强化学习（RL）方法如扩散-DPO和Flow-GRPO进一步提升了生成质量，但由于缺乏可扩展的人类偏好数据集和针对多样化编辑需求的框架，有效应用人类反馈强化学习（RLHF）在基于扩散的编辑中仍然鲜有探索。为填补这一空白，我们提出了HP-Edit，一个针对人类偏好对齐编辑的后期培训框架，并引入了RealPref-50K，这是一个涵盖八个常见任务和平衡常见对象编辑的真实世界数据集。具体来说，HP-Edit 利用少量人类偏好评分数据和预训练的视觉大型语言模型（VLM）开发了 HP-Scorer——一个自动、符合人类偏好的评估器。然后我们使用 HP-Scorer 高效构建可扩展的偏好数据集，并作为编辑模型后期训练的奖励函数。我们还介绍了RealPref-Bench，这是评估现实编辑性能的一个基准测试。大量实验表明，我们的方法显著增强了诸如Qwen-Image-Edit-2509等模型，使其输出更贴合人类偏好。

EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

EVPO：解释了在LLM后培训中自适应批评者利用中的方差策略优化

Authors: Chengjun Pan, Shichun Liu, Jiahang Lin, Dingwei Zhu, Jiazheng Zhang, Shihan Dou, Songyang Gao, Zhenhua Han, Binghai Wang, Rui Zheng, Xuanjing Huang, Tao Gui, Yansong Feng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.19485
Pdf link: https://arxiv.org/pdf/2604.19485
Abstract Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.
中文摘要 LLM后培训的强化学习（RL）面临一个根本设计抉择：是否将学习批评者作为策略优化的基线。经典理论更倾向于基于批评者的方法，如PPO以减少方差，但无批评的替代方案如GRPO因其简洁性和竞争力而被广泛采用。我们表明，在稀疏奖励环境下，学习型批评者可以注入超过其捕获状态信号的估计噪声，增加而非减少优势方差。通过将基线选择归结为卡尔曼滤波问题，我们将PPO和GRPO统一为卡尔曼增益的两个极端，并证明可从单一训练批次计算出的解释方差（EV）确定了精确边界：正EV表示批判者减少了方差，而零或负EV则表示方差膨胀。基于这一见解，我们提出了解释性方差策略优化（EVPO），该方法在每个训练步骤监控批次级的EV，并在基于批评者和批量均值优势估计之间自适应切换，证明在每个步骤中，方差不会比两者中较优者更大。在涵盖经典控制、代理互动和数学推理的四项任务中，EVPO无论哪个固定基线在特定任务中更强，都始终优于PPO和GRPO。进一步分析确认，自适应门槛追踪批评者在训练过程中的成熟，且理论上推导出的零阈值在经验上最优。

Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

多模态推理与大型语言模型用于视觉语义算术

Authors: Chuou Xu, Liya Ji, Qifeng Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19567
Pdf link: https://arxiv.org/pdf/2604.19567
Abstract Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy "king"-"man"+"woman" = "queen" illustrates relational reasoning, yet replacing text with images of "king" and "man" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that "powder" and "cake" are related by "is made of" grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots' ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.
中文摘要 强化学习（RL）作为后训练，对于提升大型语言模型（LLMs）在编码和数学中的推理能力至关重要。然而，它们在视觉语义算术中通过图像推断关系的能力尚未被充分探索。经典文本类比“国王”-“男人”+“女人”=“女王”说明了关系推理，但用“国王”和“男人”的图像替代文本会显著降低表现，因为这需要常识知识和从无关紧要的视觉细节中提取简明概念。这一能力对于服务和家庭机器人在非结构化环境中非常重要，因为机器人必须推断对象、代理和动作之间的语义关系。在厨房中，从图像中识别出“粉末”和“蛋糕”的关联是“是由”做的“，从而建立感知上的象征关系，促进工具替代、任务推广和语义推理的改进。此前的工作通过向量算术解码图像特征来处理语义算术，但存在模态缺口且缺乏系统性评估。本文提出了两项新颖任务：两项减法和三项运算，并构建了用于基准测试的图像-关系对数据集（IRPD）。我们还提出了语义算术强化微调（SAri-RFT），该方法通过可验证函数和群相对策略优化（Group Relative Policy Optimization，GRPO）对大型视觉语言模型（LVLM）进行后期训练。我们的方法在IRPD和现实世界的Visual7W-Telling数据集上取得了最先进的结果。通过为LVLM配备强大的跨模态关系推理能力，这项工作提升了家用机器人在感知中扎根符号推理的能力，提升了决策能力、工具适应性以及复杂环境中的人机交互能力。补充材料中提供了数据集和源代码。

Lyapunov-Certified Direct Switching Theory for Q-Learning

Lyapunov认证的Q-学习直接交换理论

Authors: Donghwan Lee
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2604.19569
Pdf link: https://arxiv.org/pdf/2604.19569
Abstract Q-learning is one of the most fundamental algorithms in reinforcement learning. We analyze constant-stepsize Q-learning through a direct stochastic switching system representation. The key observation is that the Bellman maximization error can be represented exactly by a stochastic policy. Therefore, the Q-learning error admits a switched linear conditional-mean recursion with martingale-difference noise. The intrinsic drift rate is the joint spectral radius (JSR) of the direct switching family, which can be strictly smaller than the standard row-sum rate. Using this representation, we derive a finite-time final-iterate bound via a JSR-induced Lyapunov function and then give a computable quadratic-certificate version.
中文摘要 Q-学习是强化学习中最基础的算法之一。我们通过直接随机切换系统表示分析常步长Q-学习。关键观察是贝尔曼最大化误差可以用随机策略精确表示。因此，Q-学习误差允许带有马丁格尔差分噪声的切换线性条件均值递归。本征漂移率是直接切换家族的联合频谱半径（JSR），其范围可以严格小于标准行和率。利用该表示，我们通过JSR诱导的李雅普诺夫函数推导出有限时间的最终迭代界限，然后给出可计算的二次证书版本。

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

SmartPhotoCrafter：自动摄影图像编辑的统一推理、生成与优化

Authors: Ying Zeng, Miaosen Luo, Guangyuan Li, Yang Yang, Ruiyang Fan, Linxiao Shi, Qirui Yang, Jian Zhang, Chengcheng Liu, Siming Zheng, Jinwei Chen, Bo Li, Peng-Tao Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2604.19587
Pdf link: https://arxiv.org/pdf/2604.19587
Abstract Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: this https URL.
中文摘要 传统的摄影图像编辑通常要求用户具备足够的美学理解，以便提供调整图像质量和相机参数的适当指导。然而，这种范式依赖于人类对审美意图的明确指导，而审美意图往往模糊、不完整或对非专业人士用户难以理解。在本研究中，我们提出了SmartPhotoCrafter，一种自动摄影图像编辑方法，将图像编辑表述为一个紧密耦合的推理与生成过程。所提模型首先通过Image Critic模块进行图像质量理解并识别缺陷，随后摄影艺术家模块实现有针对性的编辑以增强图像美感，无需明确的人工指令。采用多阶段培训流程：（i）基础预训练以建立基本的美学理解和编辑能力，（ii）在推理引导的多编辑监督下进行适应，以融入丰富的语义指导，（iii）协调推理到生成的强化学习，共同优化推理与生成。在培训期间，SmartPhotoCrafter强调逼真的图像生成，同时支持图像修复和修饰任务，始终遵循与颜色和色调相关的语义。我们还构建了阶段特定数据集，逐步构建推理和可控生成、有效的跨模块协作，最终实现高质量的摄影增强。实验表明，SmartPhotoCrafter在自动照片增强任务中优于现有生成模型，实现了逼真的效果，同时对修饰指示表现出更高的色调敏感度。项目页面：这个 https URL。

Pause or Fabricate? Training Language Models for Grounded Reasoning

暂停还是制造？基础推理的语言模型训练

Authors: Yiwen Qiu, Linjuan Wu, Yizhou Liu, Yuchen Yan, Jin Ma, Xu Tan, Yao Hu, Daoxin Zhang, Wenqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2604.19656
Pdf link: https://arxiv.org/pdf/2604.19656
Abstract Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions -- a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness -- the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.
中文摘要 大型语言模型在复杂推理任务上取得了显著进展。然而，当输入不完整时，它们常常隐含地捏造信息，产生自信但不可靠的结论——这种失败模式我们称之为无根基推理。我们认为，这一问题并非源于推理能力不足，而是由于缺乏推理边界意识——即识别何时缺失有效推理所需前提的能力。为解决这一问题，我们提出了通过互动强化学习进行基础推理（GRIL），这是一种多回合强化学习框架，用于在不完整信息下进行基础推理。GRIL将推理过程分为两个阶段：澄清和暂停，判断可用信息是否充分;以及扎根推理，在确立必要前提后进行任务解决。我们设计阶段特定奖励以惩罚幻觉，使模型能够发现空白、主动停止并在澄清后恢复推理。GSM8K-Insufficient和MetaMATH-Insufficient的实验显示，GRIL显著提升了前提检测（最高45%），使任务成功率提高了30%，同时平均响应长度减少了20%以上。额外分析确认了对噪声用户响应的鲁棒性以及对非分布任务的推广性。

Learning Hybrid-Control Policies for High-Precision In-Contact Manipulation Under Uncertainty

在不确定性下学习高精度接触操作的混合控制策略

Authors: Hunter L. Brown, Geoffrey Hollinger, Stefan Lee
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.19677
Pdf link: https://arxiv.org/pdf/2604.19677
Abstract Reinforcement learning-based control policies have been frequently demonstrated to be more effective than analytical techniques for many manipulation tasks. Commonly, these methods learn neural control policies that predict end-effector pose changes directly from observed state information. For tasks like inserting delicate connectors which induce force constraints, pose-based policies have limited explicit control over force and rely on carefully tuned low-level controllers to avoid executing damaging actions. In this work, we present hybrid position-force control policies that learn to dynamically select when to use force or position control in each control dimension. To improve learning efficiency of these policies, we introduce Mode-Aware Training for Contact Handling (MATCH) which adjusts policy action probabilities to explicitly mirror the mode selection behavior in hybrid control. We validate MATCH's learned policy effectiveness using fragile peg-in-hole tasks under extreme localization uncertainty. We find MATCH substantially outperforms pose-control policies -- solving these tasks with up to 10% higher success rates and 5x fewer peg breaks than pose-only policies under common types of state estimation error. MATCH also demonstrates data efficiency equal to pose-control policies, despite learning in a larger and more complex action space. In over 1600 sim-to-real experiments, we find MATCH succeeds twice as often as pose policies in high noise settings (33% vs.~68%) and applies ~30% less force on average compared to variable impedance policies on a Franka FR3 in laboratory conditions.
中文摘要 基于强化学习的控制策略常被证明在许多操作任务中比分析技术更有效。通常，这些方法通过直接从观察到的状态信息学习神经控制策略，预测末端效应态变化。对于插入诱发力约束的精细连接器等任务，基于姿势的策略对力的明确控制有限，依赖精心调校的低级控制器以避免执行有害动作。在本研究中，我们提出了混合位置-力量控制策略，学习在每个控制维度动态选择何时使用力量或位置控制。为提高这些策略的学习效率，我们引入了接触处理模式感知训练（MATCH），通过调整策略行动概率，显式地镜像混合控制中的模式选择行为。我们在极度本地化不确定性下，利用脆弱的“钉子”任务验证了MATCH的策略学习有效性。我们发现MATCH在常见状态估计误差类型下，比姿态控制策略高效多达10%，且断键次数减少5倍。尽管学习在更大、更复杂的动作空间中，MATCH还展示了与姿态控制策略相当的数据效率。在1600多场模拟到实物实验中，我们发现MATCH在高噪声环境中的成功率是姿态策略的两倍（33%对~68%），且在实验室条件下，平均施加的力比Franka FR3的可变阻抗策略少约30%。

FASTER: Value-Guided Sampling for Fast RL

加快：快速强化学习的价值引导抽样

Authors: Perry Dong, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2604.19730
Pdf link: https://arxiv.org/pdf/2604.19730
Abstract Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at this https URL .
中文摘要 目前一些性能最高的强化学习算法可能成本高昂，因为它们采用了测试时间的缩放方法，比如采样多个动作候选并选择最佳。在本研究中，我们提出了FASTER方法，通过追溯动作样本的性能提升追溯到去噪过程的早期，从而在不增加计算成本的情况下，实现基于扩散策略的基于采样测试时间缩放的好处。我们的关键见解是，我们可以模拟多个动作候选的去噪并选择最佳的，作为马尔可夫决策过程（MDP），目标是在去噪完成前逐步过滤动作候选。通过该MDP，我们可以在去噪空间中学习策略函数和值函数，预测去噪过程中动作候选的下游价值，并在最大化收益的同时进行过滤。结果是一种轻量级的方法，可以代入现有的生成式强化学习算法。在在线和批量在线强化学习中具有挑战性的长期操作任务中，FASTER持续改进底层策略，并在比较方法中实现最佳整体性能。应用于预训练的VLA，FASTER实现了相同的性能，同时大幅减少了训练和推理计算需求。代码可在此 https URL 获取。

Safe Continual Reinforcement Learning in Non-stationary Environments

非固定环境中的安全持续强化学习

Authors: Austin Coursey, Abel Diaz-Gonzalez, Marcos Quinones-Grueiro, Gautam Biswas
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2604.19737
Pdf link: https://arxiv.org/pdf/2604.19737
Abstract Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most existing control-oriented RL methods assume stationarity and, therefore, struggle in real-world non-stationary deployments where system dynamics and operating conditions can change unexpectedly. Moreover, RL controllers acting in physical environments must satisfy safety constraints throughout their learning and execution phases, rendering transient violations during adaptation unacceptable. Although continual RL and safe RL have each addressed non-stationarity and safety, respectively, their intersection remains comparatively unexplored, motivating the study of safe continual RL algorithms that can adapt over the system's lifetime while preserving safety. In this work, we systematically investigate safe continual reinforcement learning by introducing three benchmark environments that capture safety-critical continual adaptation and by evaluating representative approaches from safe RL, continual RL, and their combinations. Our empirical results reveal a fundamental tension between maintaining safety constraints and preventing catastrophic forgetting under non-stationary dynamics, with existing methods generally failing to achieve both objectives simultaneously. To address this shortcoming, we examine regularization-based strategies that partially mitigate this trade-off and characterize their benefits and limitations. Finally, we outline key open challenges and research directions toward developing safe, resilient learning-based controllers capable of sustained autonomous operation in changing environments.
中文摘要 强化学习（RL）为复杂系统在缺乏精确物理模型时综合控制器提供了一种引人注目的数据驱动范式;然而，大多数现有的以控制为导向的强化学习方法假设是平稳的，因此在现实世界的非定常部署中，系统动力学和运行条件可能意外变化，表现不佳。此外，在物理环境中工作的强化学习控制器必须在整个学习和执行阶段满足安全约束，使得适应过程中出现瞬态违规是不可接受的。尽管连续强化学习和安全强化学习分别解决了非平稳性和安全性问题，但它们的交集仍相对未被充分探索，这促使人们研究能够在系统生命周期内适应且保持安全的连续强化学习算法。在本研究中，我们系统地研究安全持续强化学习，引入三个基准环境以捕捉安全关键的持续适应，并评估安全强化学习、持续强化学习及其组合的代表性方法。我们的实证结果揭示了在非平稳动力学下，维持安全约束与防止灾难性遗忘之间存在根本张力，现有方法通常无法同时实现这两者。为解决这一不足，我们考察了基于正则化的策略，这些策略部分缓解了这一权衡，并界定了其优缺点。最后，我们概述了关键的未解决挑战和研究方向，旨在开发安全、韧性的基于学习的控制器，能够在不断变化的环境中实现持续自主运行。

Keyword: diffusion policy

There is no result