Arxiv Papers of Today

生成时间: 2026-07-03 18:38:13 (UTC+8); Arxiv 发布时间: 2026-07-03 20:00 EDT (2026-07-04 08:00 UTC+8)

今天共有 34 篇相关文章

Keyword: reinforcement learning

WaveLander: A Generalizable Hierarchical Control Framework for UAV Landing on Wave-Disturbed Platforms via Reinforcement Learning

WaveLander：一种可通用的分层控制框架，通过强化学习实现无人机在波扰平台上着陆

Authors: Chun-Kit Li, Iok Long Sit, Ming Fung Siu, Ka Yu Kui, Hin Wang Lin, Pengyu Wang, Ling Shi
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.01281
Pdf link: https://arxiv.org/pdf/2607.01281
Abstract Autonomous landing of unmanned aerial vehicles (UAVs) on wave-disturbed marine platforms remains challenging due to stochastic platform motion, time-varying platform attitude, and uncertain touchdown conditions. Existing model-based methods often require accurate motion prediction and online optimization, while end-to-end learning approaches may suffer from high training complexity and limited interpretability. This paper presents WaveLander, a hierarchical control framework via reinforcement learning (RL) that decouples vertical landing decision-making from low-level flight stabilization. The RL policy maps a compact platform-relative observation to a scalar vertical velocity reference, while a conventional low-level flight controller maintains attitude stability and lateral tracking. This formulation reduces dynamic platform landing to a low-dimensional, timing-aware control problem and enables smooth landing behavior without explicit switching rules. Simulation results under randomized wave-induced platform motions show that WaveLander achieves robust landing performance and generalizes to unseen disturbance conditions, demonstrating the potential of hierarchical learning-based control for marine UAV recovery.
中文摘要 由于平台运动随机、平台姿态时变和着陆条件不确定，无人机（UAV）在受波扰海洋平台上自主着陆仍然充满挑战。现有基于模型的方法通常需要准确的运动预测和在线优化，而端到端学习方法则可能存在高训练复杂性和有限的解释性。本文介绍了WaveLander，一种通过强化学习（RL）实现的分层控制框架，将垂直着陆决策与低空飞行稳定脱钩。强化操作策略将紧凑的平台相对观测映射到标量垂直速度参考，而传统的低空飞行控制器则保持姿态稳定性和横向跟踪。该表述将动态平台着陆简化为低维、时序感知控制问题，实现平稳着陆行为而无需显式切换规则。在随机波浪诱导平台运动下的模拟结果显示，WaveLander 实现了稳健的着陆性能，并能推广到看不见的扰动条件，展示了基于分层学习的控制在海洋无人机回收中的潜力。

Simulation Based Reward Function Validation for Multi-Agent On Orbit Inspection

基于仿真的多智能体轨道检测奖励函数验证

Authors: Patrick Quinn, Bala Prenith Reddy Gopu, George M. Nehma, Madhur Tiwari
Subjects: Subjects: Multiagent Systems (cs.MA); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.01367
Pdf link: https://arxiv.org/pdf/2607.01367
Abstract A proposed method for the control of groups of inspection spacecraft is Multi-Agent Reinforcement Learning (MARL). While MARL has already been employed for this purpose in previous work, the reward functions used focus on reaching a finite set of predetermined inspection points around the target. In this work, we study and develop a generalized reward function for the MARL inspection task informed by the analysis of 3D reconstructions of inspected objects in orbit. Because the reward function is generalized such that any number of images at arbitrary locations may evaluated, we also allow trained agents to have complete control over when images are collected. With this approach, we gather insights into best practices for not only the specific MARL inspection task, but also gain key takeaways informative to the broader inspection task outside of a MARL context.
中文摘要 一种用于控制检查航天器群的建议方法是多智能体强化学习（Multi-Agent Reinforcement Learning，MARL）。虽然MARL已在之前的工作中用于此目的，但所用的奖励函数侧重于达到目标周围有限的预定检查点。本研究基于轨道上被检查物体三维重建分析，研究并开发MARL检查任务的广义奖励函数。由于奖励函数是推广的，可以评估任意数量的任意位置图像，我们也允许受过训练的代理完全控制图像采集时间。通过这种方法，我们不仅能获得针对特定MARL检查任务的最佳实践见解，还能获得对MARL环境之外更广泛检查任务的重要信息。

The Rollout Infrastructure Tax in Coding-Agent Reinforcement Learning

编码代理强化学习中的基础设施税推广

Authors: Daniel Thi Graviet, Lovre Pesut, Ivan Dagelic, Vedran Jukic, Ivan Burazin
Subjects: Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2607.01415
Pdf link: https://arxiv.org/pdf/2607.01415
Abstract Coding-agent reinforcement learning treats execution infrastructure as a background implementation detail, despite relying on large numbers of interactive software rollouts. This is a missed opportunity: measuring infrastructure overhead can reveal practical efficiency gains for RL post-training, where small per-rollout savings compound at scale. We present a comparative study of four execution substrates: single containers, hosted sandboxes, Kubernetes-orchestrated containers, and cloud virtual machines. We find up to $110\times$ variation in cold-start latency and a $1.8\times$ spread in projected worker-hours for one million 150-step trajectories. Our results suggest that future coding-agent RL systems should optimize execution substrates as part of the training system itself, not merely as deployment plumbing.
中文摘要 编码代理强化学习将执行基础设施视为后台实现细节，尽管依赖大量交互式软件推广。这是一个错失的机会：衡量基础设施开销可以揭示强化学习培训后实际效率提升，而每次部署的少量节省在大规模上会累积。我们对四种执行基质进行了比较研究：单一容器、托管沙箱、Kubernetes编排容器和云虚拟机。我们发现，对于100万条150步轨迹，冷启动延迟的波动可达110美元/倍倍美元，预计工时的差额为1.8美元/倍美元。我们的结果表明，未来的编码代理强化学习系统应将执行基质作为训练系统本身的一部分进行优化，而不仅仅是部署管道。

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

FaithMed：培训LLMs以忠实循证医学推理

Authors: Zhiyun Zhang, Liwen Sun, Xiang Qian, Chenyan Xiong
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2607.01440
Pdf link: https://arxiv.org/pdf/2607.01440
Abstract Faithful reasoning is essential in medicine, where clinical decisions require transparent justification grounded in reliable evidence. Current medical LLMs either lack active access to evidence or use retrieved evidence without supervising how it should be appraised and applied during reasoning. To address this, we formalize evidence-based medicine principles as process-level criteria and introduce FaithMed, a framework that combines clinician-designed, automatically refined rubrics with reinforcement learning using step-level process reward assignment and advantage grouping. Across seven medical benchmarks, FaithMed improves over agentic-search baselines (+9% on average) and outcome-only RL (+5.8%), while raising average evidence-based medicine rubric scores over agentic-search Qwen3 baselines (+15.5%). This work demonstrates that explicit step-level supervision can improve both task success and the faithfulness of the reasoning process. Code is available at this https URL.
中文摘要 忠实推理在医学中至关重要，临床决策需要基于可靠证据的透明正当性。当前的医学大型语言模型要么缺乏主动证据访问，要么在推理过程中未监督如何评估和应用检索到的证据。为此，我们将循证医学原则正式化为过程层面的标准，并引入了FaithMed框架，该框架结合了临床医生设计、自动优化的评分标准与基于步骤级过程奖励分配和优势分组的强化学习。在七个医学基准中，FaithMed优于代理搜索基线（平均+9%）和仅结果的强化分析（+5.8%），同时在基于循证医学评分标准的平均值上高于代理搜索Qwen3基线（+15.5%）。这项工作表明，明确的步骤级监督可以提升任务成功率和推理过程的忠实性。代码可在此 https URL 访问。

Beyond Next-Token Prediction: An RLVR Proof of Concept for Tool-Use Agents on Atlassian Workflows

超越下一个令牌预测：Atlassian工作流中工具使用代理的RLVR概念验证

Authors: Karthikeya Aditya Vissa, Sankalp Mane, Ananya Mantravadi, Harshit Rajgarhia, Abhishek Mukherji
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.01465
Pdf link: https://arxiv.org/pdf/2607.01465
Abstract Large language models are trained to predict the next token, not to act inside a specific API. In niche enterprise SaaS workflows -- where success means hitting the right endpoint with the right nested arguments in the right order -- this objective mismatch shows up as silent failures: dropped required fields, hallucinated tools, or early stops after a single read. We ask whether Reinforcement Learning with Verifiable Rewards (RLVR), applied directly in the target environment, closes the gap. As a proof of concept we build a suite of five synthetic environments emulating the Jira REST v3 and Confluence v2 APIs at schema fidelity; rewards are computed entirely from the tool-call trace, with no live API, no learned judge, and no human label in the loop. Scoring prompted Qwen3-1.7B and Qwen3.5-4B on the same checkers that drive GRPO training, we find that on the four scenarios whose rewards are non-degenerate the RL-trained policy lifts average reward from a 4B-baseline range of 0.35--0.92 to 0.95--1.00, with the largest single gain on Confluence page creation ($0.35 \rightarrow 1.00$). We position this as a preliminary step toward outcome-optimised small models for niche enterprise APIs, and foreground two limitations a workshop reader should weigh: hand-crafting verifiable rewards does not scale beyond the handful of endpoints reported here, and one of our five scenarios (ticket-transition) has a saturating reward shape that the prompted 4B already maxes out.
中文摘要 大型语言模型训练是为了预测下一个令牌，而不是在特定API内行动。在细分企业SaaS工作流中——成功意味着以正确的顺序、正确的嵌套参数击中正确的端点——这种客观不匹配表现为无声的失败：丢弃必需字段、错觉工具，或一次读取后提前停止。我们探问，直接应用于目标环境中的可验证奖励强化学习（RLVR）是否能缩小差距。作为概念验证，我们构建了一套由五个合成环境组成的套件，模拟Jira REST v3和Confluence v2的API，且符合模式的保真度;奖励完全基于工具调用追踪计算，没有实时API，没有学过的评判，也没有人工标签。在驱动GRPO训练的同一检查器上，Qwen3-1.7B和Qwen3.5-4B的评分结果显示，在四个奖励非退化的情景中，RL训练的政策将平均奖励从4B基线区间0.35-0.92提升到0.95-1.00，其中Confluence页面创建时获得的最大单一收益（$0.35 \ rightarrow 1.00$）。我们将此定位为迈向细分企业API中优化结果的小模型的初步步骤，并强调研讨会读者应权衡的两个局限：手工打造可验证奖励的规模仅限于此处提及的少数几个端点;以及我们五个场景之一（工单转移）拥有饱和的奖励形态，而提示的4B已达到极限。

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

程序性记忆提炼：自我提升语言模型的在线反思

Authors: Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.01480
Pdf link: https://arxiv.org/pdf/2607.01480
Abstract Reinforcement learning with verifiable rewards (RLVR), along with recent selfdistillation variants such as SDPO, evaluates each rollout against a verifier and updates the policy from that episode-level signal. However, the richer procedural information in the rollout is rarely retained or reused. Across episodes and epochs, the model repeatedly encounters related problems under a changing policy, producing cross-episode signals that episode-local updates cannot capture: which strategies consistently pass verification, which failure modes persist, which patterns recur. We propose Procedural Memory Distillation (PMD), which converts these crossepisode signals into reusable procedural memory and distills it into the policy's weights during training. This memory functions as a training scaffold, absorbed into the policy itself, yielding a memory-free model at inference. PMD organizes the memory at three levels of abstraction: raw trajectories, self-reflected strategies and lessons, and higher-level behavioral patterns that recur across problems, all extracted online from the model's own trajectories. A memory-conditioned self-teacher draws on the accumulated experience to supervise the student on its own rollouts, enabling student to progressively internalize procedural knowledge within its parameters. The central design principle is co-evolution: the policy generates rollouts that update the memory, and memory shapes the supervision that updates the policy. Empirically, across Qwen3-8B and OLMo3-Instruct-7B, PMD improves over SDPO by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on LIVECODEBENCH. Co-evolution powers these gains: freezing either the memory or the policy trails PMD by more than 10% across SCIKNOWEVAL domains.
中文摘要 带可验证奖励的强化学习（RLVR）以及近期的自蒸馏变体如SDPO，会根据验证者对每个部署进行评估，并根据该剧集级信号更新策略。然而，部署中更丰富的程序性信息很少被保留或重复使用。跨越不同事件和时代，模型反复遇到相关问题，且策略变化，产生跨事件信号，这些信号无法通过本地事件更新：哪些策略持续通过验证，哪些失败模式持续存在，哪些模式反复出现。我们提出了程序性记忆蒸馏（PMD），将这些跨事件信号转换为可重复使用的程序性记忆，并在训练过程中提炼为策略权重。该内存作为训练支架，被策略吸收，推理时生成无内存模型。PMD将记忆组织在三个抽象层次：原始轨迹、自我反思的策略和经验教训，以及跨问题反复出现的高层行为模式，所有这些都从模型自身的轨迹中在线提取。记忆条件自学者利用积累的经验来监督学生自身的推广，使学生能够在其参数范围内逐步内化程序知识。核心设计原则是共进化：策略生成更新内存的部署，内存塑造更新策略的监督。从实证角度看，在Qwen3-8B和OLMo3-Instruct-7B中，PMD在SCIKNOWEVAL上比SDPO提升了3.8%-5.5%，在LIVECODEBENCH上提升了7.9-13.6%。共进推动了这些收益：冻结内存或策略在SCIKNOWEVAL域中PMD落后超过10%。

Don't Let Gains FADE: Breaking Down Policy Gradient Weights in RL

别让收益消逝：拆解强化环境中的政策梯度权重

Authors: Juliette Decugis, Sean O'Brien, Francis Bach, Gabriel Synnaeve, Taco Cohen
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.01490
Pdf link: https://arxiv.org/pdf/2607.01490
Abstract Reinforcement learning post-training dramatically improves LLM reasoning, but suffers from training instability and diversity collapse. Advantage functions offer an appealing fix: they reshape the training objective, reweight which rollouts drive learning, and are trivial to implement. Yet a proliferation of methods makes it unclear which advantage to use and when. We cut through the confusion with a unifying framework that decomposes any advantage into its positive and negative gradient mass along two orthogonal axes. On the sign axis, imbalanced updates collapse either entropy or weight geometry. On the difficulty axis, hard-problem focus sharpens signal but costs sample size. Both trade-offs shift during training: exploration favors balance and hard focus; exploitation favors suppression and medium focus. This motivates FADE (Focal Advantage with Dynamic Entropy), a self-adapting advantage that reads training dynamics to schedule the gradient weight automatically. FADE reaches peak pass@1 20k steps earlier than the best static baseline at the 7B scale and 2k steps earlier at the 32B , while achieving the best accuracy-diversity trade-off across all pass@k on LiveCodeBench and AIME.
中文摘要 训练后强化学习显著提升了LLM推理能力，但也存在训练不稳定性和多样性崩溃的问题。优势函数提供了一个有吸引力的解决方案：它们重塑培训目标，重新调整推动学习的推广权重，而且实施起来非常简单。然而，方法的激增使得何时以及如何使用哪种优势变得不明确。我们用一个统一框架打破混淆，将任何优势分解为沿两条正交轴的正负梯度质量。在符号轴上，不平衡更新会导致熵或权重几何结构的坍缩。在难度轴上，难题聚焦能增强信号，但代价是样本量。这两种权衡在训练中都会变化：探索更倾向于平衡和专注;剥削偏向压制和中等专注。这促使了FADE（动态熵焦点优势），这是一种自我适应优势，能够读取训练动态自动调度梯度权重。FADE 在 7B 尺度pass@1最佳静态基线比 20k 步达到峰值，32B 尺度时比 2000 步早，同时在 LiveCodeBench 和 AIME 上实现了所有pass@k的最佳准确性与多样性权衡。

Wind-Aware Reinforcement Learning Control of a Small Quadrotor Using Learned Onboard Wind Estimation in Simulated Atmospheric Turbulence

利用机载风速估算学习小型四旋翼机的风知强化学习控制，模拟大气湍流

Authors: Abdullah Al Tasim, Wei Sun
Subjects: Subjects: Machine Learning (cs.LG); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2607.01528
Pdf link: https://arxiv.org/pdf/2607.01528
Abstract Small multirotor aircraft are increasingly tasked with operations in the atmospheric boundary layer, where turbulent winds comparable to the vehicle's airspeed degrade trajectory tracking and can defeat conventional feedback control. This work illustrates a two-stage learning pipeline that first estimates the local wind from onboard kinematics and dynamics and then exploits that estimate inside a reinforcement learning (RL) flight controller. The wind estimator, an attention-augmented gated recurrent network trained on thousands of simulated flights through von Karman turbulence with power-law shear and veer, recovers the horizontal wind vector with a per-flight root-mean-square error of 0.40 m/s and a direction error of 3.2 degrees on unseen wind regimes, an accuracy near the floor imposed by unresolved turbulence, and generalizes to vertical ascent profiles with a skill score of 0.861 over a constant-wind reference. A proximal policy optimization controller receiving the frozen estimator's output reduces horizontal trajectory tracking error by 48% relative to a wind-blind proportional-derivative baseline across mean winds of 4 m/s to 12 m/s, winning on 100% of evaluation episodes. A three-way ablation decomposes this improvement into a kinematic component, available without wind information, and a wind-perception component; the perception share rises with wind speed, from small in light winds toward roughly half the total benefit in strong winds, consistent with the quadratic scaling of aerodynamic drag. The controller degrades gracefully on out-of-distribution winds of 13 m/s to 15 m/s, where the baseline fails catastrophically.
中文摘要 小型多旋翼飞机越来越多地承担在大气边界层内的任务，在那里，相当于飞行器空速的湍流会降低轨迹追踪能力，甚至可能破坏传统的反馈控制。这项工作展示了一个两阶段学习流程，首先通过机载运动学和动力学估算本地风速，然后在强化学习（RL）飞行控制器中利用该估计值。风速估计器是一个注意力增强的门控循环网络，通过数千次模拟冯卡门湍流飞行训练，采用幂律剪切和偏向，能够恢复水平风向矢量，在未见风向区间，每飞行均方根误差为0.40米/秒，方向误差为3.2度，这种精度在地面附近因未解决湍流而产生。并推广到垂直上升剖面，技能评分为0.861，且在恒风参考条件下。接收冻结估计器输出的近端策略优化控制器，相较于4米/秒至12米/秒平均风速下的风盲比例导数基线，水平轨迹跟踪误差降低48%，在100%的评估中获胜。三向消融将这一改进分解为一个运动学成分（不含风信息）和一个风感知成分;这种感知份额随着风速提升而上升，从轻风中较小的优势到强风时约为总益的一半，这与空气阻力的二次方尺度相符。控制器在13 m/s至15 m/s的偏离分布风速时会优雅地衰减，此时基线会灾难性失效。

Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model

安全且自适应的云修复：用神经符号世界模型验证LLM生成的恢复计划

Authors: Junyan Tan, Haoran Lin, Siyuan Guo, Yichen Fang, Xinyue Luo, Tianyu Shen, Zeyu Qiao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2607.01595
Pdf link: https://arxiv.org/pdf/2607.01595
Abstract As the scale and complexity of cloud-based AI systems continue to escalate, ensuring service reliability through rapid fault detection and adaptive recovery has become a critical challenge. While existing approaches integrate Large Language Models (LLMs) for semantic understanding and Deep Reinforcement Learning (DRL) for policy optimization, they often rely on sequential, loosely coupled architectures that underutilize the generative and reasoning capabilities of LLMs. In this paper, we propose a paradigm shift with PASE, a Planning-Aware Semantic self-healing engine, a novel fault self-healing framework that reconceptualizes recovery as a neuro-symbolic program synthesis task. PASE employs an LLM as a core Plan Synthesis Engine to generate structured recovery plans from a library of semantic primitives. A Neural-Symbolic World Model verifies plan feasibility through simulation, while a Meta-Prompt Optimizer, trained via DRL, learns to generate optimal prompts that guide the LLM's planning process. This tight reason-plan-verify-adapt loop enables dynamic, context-aware recovery strategy generation beyond predefined action spaces. Experiments on a real-world cloud fault injection dataset demonstrate that PASE significantly outperforms state-of-the-art methods, reducing average system recovery time by over 40% and improving fault detection accuracy in unknown fault scenarios. Our framework advances autonomous system management by unifying LLM-based reasoning with model-assisted verification and meta-learned guidance.
中文摘要 随着基于云的人工智能系统的规模和复杂度不断提升，通过快速故障检测和自适应恢复确保服务可靠性已成为一项关键挑战。现有方法整合了大型语言模型（LLMs）进行语义理解，深度强化学习（DRL）进行策略优化，但它们通常依赖顺序、松耦合的架构，未能充分发挥LLM的生成和推理能力。本文提出范式转变，采用PASE——一种规划感知语义自愈引擎，这是一种新型的故障自愈框架，将恢复重新概念化为神经符号程序综合任务。PASE 使用大型语言模型作为核心计划合成引擎，从语义原语库生成结构化恢复计划。神经符号世界模型通过模拟验证计划可行性，而通过日程学习学习的元提示优化器则学习生成指导LLM规划过程的最优提示。这种紧密的理由-计划-验证-适应循环使得动态、情境感知的恢复策略能够超越预设行动空间。在真实世界云故障注入数据集上的实验表明，PASE显著优于最先进方法，平均系统恢复时间缩短了40%以上，并在未知故障场景下提高了故障检测准确率。我们的框架通过统一基于LLM的推理、模型辅助验证和元学习指导，推动自主系统管理的发展。

Scaling with Confidence: Calibrating Confidence of LLMs for Adaptive Test Time Scaling

信心扩展：校准LLMs的自适应测试时间扩展信心

Authors: Xuqing Yang, Yi Yuan, Shanzhe Lei, Xuhong Wang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.01612
Pdf link: https://arxiv.org/pdf/2607.01612
Abstract Training large language models (LLMs) with reinforcement learning (RL) has significantly advanced their performance on reasoning and question-answering tasks. However, prevailing RL reward designs typically prioritize response correctness, neglecting to incentivize models to express their confidence accurately. This leads to a critical problem: performance gains are often accompanied by poor calibration between confidence and accuracy, misleading models to overconfidently hallucinate when uncertain. To address this limitation, we propose $\textbf{C}$orrectness and $\textbf{C}$onfidence $\textbf{C}$alibration $\textbf{R}$einforcement $\textbf{L}$earning ($\textbf{C3RL}$), a novel RL algorithm integrating correctness, calibration and dataset-informed reference accuracy rewards together. Comprehensive evaluation across 8 text and multimodal datasets demonstrates that C3RL enhances calibration without sacrificing accuracy, outperforming the current state-of-the-art method in both performance and calibration metrics. Utilizing the well-calibrated verbalized confidence from C3RL, we further introduce $\textbf{C}$onfidence-based $\textbf{A}$daptive Test Time $\textbf{S}$caling ($\textbf{CAS}$), an adjustable inference-time strategy that allocates computational resources based on response confidence. Experiments show that CAS surpasses majority voting on both in-domain and out-of-domain datasets while reducing the inference budget by up to 12.33 times. We believe the synergy of C3RL and CAS paves the way for deploying more reliable and resource-efficient LLMs. The code, data and models will be released.
中文摘要 用强化学习（RL）训练大型语言模型（LLMs）显著提升了它们在推理和问答任务中的表现。然而，主流的强化学习奖励设计通常优先考虑反应的正确性，忽视激励模型准确表达信心。这导致了一个关键问题：性能提升常伴随着置信度与准确性之间的校准不佳，导致模型在不确定时产生过度自信的幻觉。为解决这一限制，我们提出了$\textbf{C}$orrectness和$\textbf{C}$onfidence $\textbf{C}$alibration $\textbf{R}$einforcement $\textbf{L}$earning （$\textbf{C3RL}$），这是一种新颖的强化学习算法，将正确性、校准和数据集导向的参考准确性奖励结合在一起。对8个文本和多模态数据集的全面评估表明，C3RL在不牺牲准确性的情况下提升了校准，在性能和校准指标上均优于当前最先进方法。利用C3RL中校准良好的口头置信度，我们进一步引入了基于$onfidence的$\textbf{A}$daptive测试时间$\textbf{S}$caling（$\textbf{CAS}$），这是一种基于响应置信度分配计算资源的可调推断时间策略。实验显示，CAS在域内和域外数据集上的多数投票率均超过，同时将推理预算减少了多达12.33倍。我们相信C3RL与CAS的协同作用为部署更可靠、更高效的大型语言模型铺平了道路。代码、数据和模型将会被发布。

One Demonstration Is Enough for Real-World Robotic Reinforcement Learning

一次演示就足以实现现实世界的机器人强化学习

Authors: Yuwan Liu, Hongze Yu, Song Liu, Yuhan Wang, Junge Zhang, Yaodong Yang, Yuanpei Chen, Ceyao Zhang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.01651
Pdf link: https://arxiv.org/pdf/2607.01651
Abstract Learning effective robot control policies on physical hardware is challenging due to costly data collection and the difficulty of reward specification. Prior work has incorporated demonstrations into reinforcement learning (RL), yet existing approaches either require large numbers of demonstrations or depend on continuous human intervention during training. To address these limitations, we present AutoSERL, a framework that leverages a single demonstration to fully automate the intervention process in real-world robot RL. The framework includes three complementary mechanisms to accomplish certain tasks: a sliding window intervention mechanism that continuously guides exploration to prevent local optima and unsafe deviations, a safety recovery mechanism that detects and corrects failure states via predefined trajectory recovery points, and an intervention termination criterion that automatically disables guidance once the policy can independently complete the task, preserving its exploration advantage. We evaluate AutoSERL on six contact-intensive manipulation tasks across two robot platforms, spanning insertion, hanging, and hinge-based tasks. AutoSERL consistently outperforms SERL initialized with 20 demonstrations, behavior cloning, and MILES -- a dedicated one-shot imitation learning baseline -- across all tasks while matching HIL-SERL, achieves 100% success rate on insertion tasks, and demonstrates improved robustness to positional variations, all from a single demonstration. Code and videos are available on our project website: this https URL.
中文摘要 由于数据收集成本高昂且奖励难以指定，学习在物理硬件上有效的机器人控制策略具有挑战性。以往的研究已将演示纳入强化学习（RL），但现有方法要么需要大量演示，要么依赖训练期间持续的人类干预。为解决这些局限性，我们提出了AutoSERL，一个利用单一演示实现现实机器人强化学习干预过程完全自动化的框架。该框架包含三种互补机制以完成特定任务：滑动窗口干预机制，持续引导探索以防止局部最优和不安全偏差;安全恢复机制通过预设轨迹恢复点检测并纠正失效状态;以及干预终止标准，一旦策略独立完成任务，自动禁用指导，保持探索优势。我们在两个机器人平台上评估了六个接触密集型操作任务，涵盖插入、悬挂和铰链任务。AutoSERL在所有任务中始终优于初始化的SERL，完成20个演示、行为克隆和MILES——一个专门的一次性模拟学习基线——同时匹配HIL-SERL，插入任务成功率达到100%，并展示了对位置变化的更强鲁棒性，所有演示均来自一次演示。代码和视频可在我们的项目网站上获取：https URL。

CoRe: Combined Rewards with Vision-Language Model Feedback for Preference-Aligned Reinforcement Learning

CoRe：结合奖励与视觉语言模型反馈，实现偏好对齐的强化学习

Authors: Hexian Ni, Tao Lu, Yinghao Cai
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.01721
Pdf link: https://arxiv.org/pdf/2607.01721
Abstract Reward design remains a central challenge in reinforcement learning (RL). Hand-crafted rewards are often difficult to specify and may lead to suboptimal policies, while learned rewards from preferences can suffer from inefficiency and unstable training. Inspired by the dual nature of human learning explored in cognitive science, we decompose rewards into two complementary components: Formal Rewards (FR), explicitly designed based on task knowledge, and Residual Rewards (RR), learned from observations to capture implicit and nuanced preferences. Based on this decomposition, we propose CoRe, a hybrid framework that integrates FR and RR with vision-language models (VLMs) feedback to achieve preference-aligned policies without human involvement. Our contributions are twofold: (1) We propose a Formal Reward Module (FRM) that leverages VLMs to iteratively design and optimize FR based on task knowledge and preference feedback, enabling the continual improvement of policy during training; (2) We introduce a Residual Reward Module (RRM) that learns RR from video-level preference by employing VLMs to generate preference labels and capturing nuanced rewards that complement FR, ensuring alignment with human intent. Through the synergy of FRM and RRM, CoRe enables the automatic construction of reliable rewards that are efficient and preference-aligned. Extensive experiments demonstrate that CoRe outperforms existing approaches in terms of policy learning effectiveness and efficiency on ten robotic manipulation tasks in simulation and five real-worlds. Videos can be found on our project website: this https URL
中文摘要 奖励设计仍然是强化学习（RL）中的核心挑战。手工制作的奖励通常难以具体说明，可能导致政策不优;而通过偏好学习的奖励则可能存在低效和不稳定的训练问题。受认知科学中人类学习的双重性探索启发，我们将奖励分解为两个互补组成部分：基于任务知识明确设计的形式奖励（FR）和从观察中学习以捕捉隐性和细致偏好的残余奖励（RR）。基于这一分解，我们提出了CoRe，一种混合框架，将视觉语言模型（VLMs）反馈整合，实现偏好对齐的政策，无需人工干预。我们的贡献有两个方面：（1）我们提出一个正式奖励模块（FRM），利用VLM基于任务知识和偏好反馈迭代设计和优化FR，从而在培训期间持续改进政策;（2）我们引入了残余奖励模块（RRM），通过使用VLM生成偏好标签并捕捉补充FR的细致奖励，从视频层级的偏好中学习RR，确保与人类意图的对齐。通过FRM和RRM的协同，CoRe能够自动构建高效且符合偏好的可靠奖励。大量实验表明，CoRe在十项机器人操作任务和五项真实世界任务中，在策略学习的有效性和效率方面优于现有方法。视频可在我们的项目网站上观看：此 https URL

DRL-CLBA: A Clean Label Backdoor Attack for Speech Classification via DDPG Reinforcement Learning

DRL-CLBA：通过DDPG强化学习进行语音分类的清洁标签后门攻击

Authors: Yueming Huang, Wenhan Yao, Fen Xiao, Xiarun Chen, Weiping Wen
Subjects: Subjects: Artificial Intelligence (cs.AI); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2607.01729
Pdf link: https://arxiv.org/pdf/2607.01729
Abstract Deep learning models for speech classification are vulnerable to backdoor attacks, where malicious triggers cause misclassification at inference time. While sample-specific attacks can bypass many defenses, they often rely on poisoned label attack, making them detectable via manual data defense. In this paper, we propose DRL-CLBA, a novel clean label backdoor attack for speech classification that leverages Deep Deterministic Policy Gradient (DDPG) reinforcement learning. We also utilize deep audio steganography to embed sample-specific triggers into source audio, creating feature-space anchors. The proposed reinforcement learning framework effectively optimizes target samples toward trigger-bearing anchor points in the model's deep latent space, enabling label-migration-free poisoning of target samples. Experimental results across three datasets and four different DNNs demonstrate that DRL-CLBA achieves a high attack success rate, effectively bypassing some backdoor defenses. The attack demonstrates strong resistance against fine-tuning, pruning, and spectral signature defenses, exposing critical vulnerabilities in speech-controlled systems.
中文摘要 用于语音分类的深度学习模型易受后门攻击影响，恶意触发器在推理时导致错误分类。虽然样本特定攻击可以绕过许多防御，但它们通常依赖于有毒标签攻击，因此可以通过人工数据防御被检测到。本文提出了DRL-CLBA，一种新型的干净标签后门攻击，用于语音分类，利用深度确定性策略梯度（DDPG）强化学习。我们还利用深度音频隐写术将采样特定触发嵌入源音频，创建特征空间锚点。所提出的强化学习框架有效优化目标样本，朝向模型深潜空间中的触发锚点，实现无标签迁移的目标样本中毒。在三个数据集和四个不同DNNs上的实验结果表明，DRL-CLBA实现了较高的攻击成功率，有效绕过了一些后门防御。该攻击对微调、修剪和频谱特征防御表现出强烈抵抗力，暴露了语音控制系统中的关键漏洞。

Denser $\neq$ Better: Limits of On-Policy Self-Distillation for Continual Post-Training

更密集的$\neq$ 更好：持续后培训中政策自我提炼的限制

Authors: Meng Wang, Haohan Zhao, Wenzhuo Liu, Lu Yang, Geng Liu, Haiyang Guo, Guo-Sen Xie, Gaofeng Meng, Hongbin Liu, Fei Zhu
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2607.01763
Pdf link: https://arxiv.org/pdf/2607.01763
Abstract Continual post-training enables foundation models to acquire new knowledge while preserving existing capabilities. Recent work suggests that on-policy learning can mitigate forgetting, with on-policy self-distillation emerging as a particularly attractive approach. In this work, we revisit this optimistic view through self-distillation policy optimization (SDPO). Our experiments show that SDPO can accelerate in-domain specialization when teacher signals are stable and well aligned, but it struggles to generalize to out-of-distribution scenarios. In continual post-training, SDPO exhibits stronger forgetting and can even collapse, whereas on-policy reinforcement learning methods such as GRPO adapt more conservatively and better preserve prior capabilities. Further analyses reveal that denser self-distillation induces larger drift in both parameter space and response space, and can amplify high-frequency formatting artifacts through a self-reinforcing teacher--student loop. These findings suggest that on-policy data alone is insufficient for continual learning. Dense self-distillation can accelerate specialization when teacher targets are stable and token-level supervision is reliable, but it should not be treated as a default stabilizer for continual post-training. Our code is available at this https URL.
中文摘要 持续的后期训练使基础模型能够在保持现有能力的同时获得新知识。最新研究表明，政策内学习可以减少遗忘，政策内自我提炼成为一种特别有吸引力的方法。在本研究中，我们通过自我蒸馏政策优化（SDPO）重新审视这一乐观观点。我们的实验表明，当教师信号稳定且对齐良好时，SDPO可以加速领域内专业化，但难以推广到分布外场景。在持续的后训练中，SDPO表现出更强的遗忘现象，甚至可能崩溃，而策略内强化学习方法如GRPO则更保守地适应，更好地保留了先前的能力。进一步分析显示，更密集的自蒸馏在参数空间和响应空间中引起更大的漂移，并通过自我强化的师生循环放大高频格式化伪影。这些发现表明，单靠政策数据不足以实现持续学习。当教师目标稳定且代币级监督可靠时，密集的自我提炼可以加速专业化，但不应被视为持续后期培训的默认稳定因素。我们的代码可在此 https URL 访问。

Lightweight Safe Reinforcement Learning for End-to-End UAV Navigation

轻量级安全强化学习，用于端到端无人机导航

Authors: Shenghui Zhang, YuXuan Gao, Songwei Zhao, Jifeng Hu, Zijing Zhang, Hechang Chen
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.01794
Pdf link: https://arxiv.org/pdf/2607.01794
Abstract With the rapid development of autonomous aerial systems, Unmanned Aerial Vehicles (UAVs) are increasingly deployed in applications such as inspection, environmental monitoring, and rescue, creating growing demand for reliable autonomous navigation. However, autonomous UAV navigation in dense environments remains challenging under sparse perception and dynamic constraints. Most reinforcement learning (RL) methods lack explicit safety mechanisms, leading to unsafe exploration, unstable training, and risky behaviors, especially during high-speed flight. Even in safe RL approaches, safety is often enforced by projecting policy outputs onto a safe action set, which may introduce instability. Meanwhile, many learning-based methods rely on dense inputs or large networks, increasing computational burden and limiting lightweight onboard deployment. Facing the above challenges, we propose a safety-constrained perception-control integrated framework for UAV navigation. A lightweight network encodes sparse observations into collision-risk-aware features using asymmetric and depthwise separable convolutions. We formulate the task as a constrained Markov decision process within a hierarchical control architecture and solve it using a Lagrangian-based safe PPO algorithm. Curriculum learning further improves training stability. Experiments with varying obstacle densities and flight speeds demonstrate higher success rates, improved safety, and better efficiency than existing reinforcement learning baselines.
中文摘要 随着自主空中系统的快速发展，无人机（UAV）越来越多地应用于检查、环境监测和救援等领域，带来了对可靠自主导航需求的增长。然而，在密集环境中自主导航在感知稀疏和动态限制下仍具挑战性。大多数强化学习（RL）方法缺乏明确的安全机制，导致探索不安全、训练不稳定和高风险行为，尤其是在高速飞行时。即使在安全的强化学习方法中，安全性通常也通过将策略输出投射到安全动作集来实现，这可能会带来不稳定性。与此同时，许多基于学习的方法依赖于密集输入或大型网络，增加了计算负担，限制了轻量化的板载部署。面对上述挑战，我们提出了一个安全约束的感知-控制综合无人机导航框架。轻量级网络利用非对称和深度可分离卷积将稀疏观测值编码为碰撞风险感知特征。我们将任务表述为一个受限马尔可夫决策过程，采用层级控制架构，并使用基于拉格朗日的安全PPO算法求解。课程学习进一步提升了培训的稳定性。在不同障碍密度和飞行速度下的实验显示，成功率更高，安全性更高，效率也优于现有强化学习基线。

Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling

多声音，一奖：多角色评分标准生成用于LLM评判与奖励建模

Authors: Dazhi Fu, Jiuding Yang, Yiwen Guo, Jicong Fan
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.01830
Pdf link: https://arxiv.org/pdf/2607.01830
Abstract Reliable reward and preference signals are critical for evaluating and optimizing large language models on open-ended tasks. Rubric-based judges offer a transparent way to decompose such judgments into explicit evaluation criteria, but existing annotation-free rubric generators typically rely on a single generic evaluator. As a result, they may overlook important dimensions of human preference, a failure mode we term dimensional blind spots. To address this limitation, we propose Multi-Role Rubric Generation (MRRG), a training-free and reference-free framework that elicits evaluation criteria from multiple complementary roles and consolidates them into an auditable rubric-based scorer. This scorer can be used both to validate pairwise preferences and to provide rewards for GRPO-style Reinforcement Learning with Verifiable Rewards (RLVR). Experiments on preference validation benchmarks show that MRRG consistently outperforms single-role rubric generation baselines across multiple backbone models. Further RLVR experiments demonstrate that MRRG yields a stronger reward signal for improving open-ended generation.
中文摘要 可靠的奖励和偏好信号对于评估和优化开放式任务中的大型语言模型至关重要。基于评分标准的评审提供了一种透明的方式将此类判断分解为明确的评估标准，但现有无注释的评分标准生成器通常依赖单一通用评估器。因此，他们可能会忽视人类偏好中重要的维度，这种失败模式我们称之为维度盲点。为解决这一局限，我们提出了多角色评分标准生成（MRRG），这是一种无需培训、无需参考的框架，能够从多个互补角色中提取评估标准，并将其整合为可审计的评分标准。该评分器既可用于验证成对偏好，也可用于为带有可验证奖励的GRPO式强化学习（RLVR）提供奖励。偏好验证基准测试的实验显示，MRRG在多个骨干模型中始终优于单角色评分标准生成基线。进一步的RLVR实验表明，MRRG在改善开放式生成时能产生更强的奖励信号。

Decomposer: Learning to Decompile Symbolic Music to Programs

分解器：学习将符号音乐反编译到程序中

Authors: Yewon Kim, Apurva Gandhi, David Chung, Graham Neubig, Chris Donahue
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
Arxiv link: https://arxiv.org/abs/2607.01849
Pdf link: https://arxiv.org/pdf/2607.01849
Abstract Musical performance involves executing a set of high-level musical instructions, yet recovering those instructions from the performance is a challenging inverse problem. We present Decomposer, a post-training framework for symbolic music decompilation: the task of recovering executable, editable music programs from symbolic music. We instantiate the task as MIDI-to-Strudel decompilation, where the model takes symbolic MIDI as input and produces a program in Strudel, a music programming language, that reconstructs the input when executed. The task poses two challenges: Strudel is a low-resource language with little naturally paired MIDI-code data, and optimizing faithful reconstruction of MIDI alone can collapse to unreadable note-by-note transliteration. We address these challenges in two stages. First, we construct Strudel-Synth, a synthetic corpus of paired Strudel programs and rendered MIDI, and use it for supervised fine-tuning. Second, we refine the model with reinforcement learning on unpaired MIDI, optimizing rewards for both MIDI reconstruction faithfulness and code readability. Our evaluation across synthetic and real-world MIDI benchmarks shows that Decomposer achieves substantially higher MIDI reconstruction faithfulness than closed-source LLMs while producing more readable and diverse code than the heuristic converter.
中文摘要 音乐表演涉及执行一组高级音乐指令，但从演奏中恢复这些指令是一个具有挑战性的逆问题。我们介绍Decomposer，一个用于符号音乐反编译的后期培训框架：从符号音乐中恢复可执行、可编辑的音乐程序。我们将该任务实例化为MIDI到Strudel的反编译，模型接收符号MIDI作为输入，并生成一个用Strudel（一种音乐编程语言）执行时重建输入的程序。该任务面临两个挑战：Strudel 是一种资源有限的语言，几乎没有自然配对的 MIDI 代码数据;仅靠忠实重建 MIDI 可能会陷入无法辨认的逐音符音译。我们将这些挑战分两阶段进行应对。首先，我们构建了 Strudel-Synth，这是一个由成对的 Strudel 程序和渲染 MIDI 组成的合成语料库，并用于监督微调。其次，我们在未配对MIDI上通过强化学习优化模型，优化MIDI重建的忠实性和代码可读性。我们在合成和现实世界MIDI基准测试中的评估显示，Decomposer在MIDI重建忠实度上远高于闭源LLMs，同时生成的代码比启发式转换器更易读、更多样化。

Learning the Supports for Categorical Critic in Reinforcement Learning

学习强化学习中对类别批判的支持

Authors: Jen-Yen Chang, Takayuki Osa, Tatsuya Harada
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.01880
Pdf link: https://arxiv.org/pdf/2607.01880
Abstract Value functions are an essential component in actor-critic based deep reinforcement learning (RL). Conventionally, these functions are trained as a regression task by minimising the mean squared error (MSE) relative to bootstrapped target values. Meanwhile, in distributional RL, a distribution of returns is modelled based on the distributional Bellman operator. This work investigates the Gaussian Histogram Loss (HL-Gauss), a recent approach that reframes value estimation as classification by encoding each scalar Bellman target as a Gaussian-smoothed categorical target. Despite its potential, applying histogram-based losses to RL presents inherent challenges, most notably the requirement to pre-define a fixed support interval, which is often complicated by the non-stationary and stochastic nature of target values typically found in RL tasks. In this work, we propose an approach that dynamically learns the lower and upper bounds of the support instead of assigning them beforehand. We derive an objective that jointly learns these bounds whilst learning the categorical representation of the scalar values, and we show that this objective forms an upper bound on the mean-squared Bellman error. Our theoretical analysis further shows that this bound is tighter than that of non-learned supports of HL-Gauss. Empirically, the proposed objective enables stable adaptation of the support interval and matches HL-Gauss-based actor-critic algorithms on most continuous-control tasks whilst improving on a subset, without requiring a pre-specified support interval.
中文摘要 价值函数是基于演员-批评者的深度强化学习（RL）中不可或缺的组成部分。传统上，这些函数通过最小化均方误差（MSE）来训练为回归任务，相对于自助目标值。而在分布式强化学习中，收益分布基于分布贝尔曼算子建模。本研究研究高斯直方图损失（HL-Gauss），这是一种新方法，通过将每个标量贝尔曼目标编码为高斯平滑的类别目标，将值估计重新定义为分类。尽管具有潜力，基于直方图的损耗应用于强化学习仍存在固有挑战，最显著的是预先定义固定支持区间的要求，这通常因强化学习任务中常见的目标值非平稳和随机性而复杂化。在本研究中，我们提出了一种动态学习支撑的上下界的方法，而不是事先分配它们。我们推导出一个目标，它在学习标量值的范畴表示的同时，共同学习这些界限，并证明该目标构成了均方贝尔曼误差的上界。我们的理论分析进一步表明，这一界限比HL-高斯的非学习支持更紧密。从经验角度看，该目标能够稳定适应支持区间，并在大多数连续控制任务上匹配基于HL-Gauss的actor-critic算法，同时在部分子集上进行改进，且无需预先设定的支持区间。

Rank-Then-Act: Reward-Free Control from Frame-Order Progress

排名后行动：通过帧顺序进度实现无奖励控制权

Authors: Yuriy Maksyuta, George Bredis, Ruslan Rakhimov, Daniil Gavrilov
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.01897
Pdf link: https://arxiv.org/pdf/2607.01897
Abstract We introduce Rank-Then-Act (RTA), a framework for learning control policies from expert video demonstrations without environment rewards. RTA trains a Vision-Language Model (VLM) offline as a progress-based ordinal scorer, using a Group Relative Policy Optimization (GRPO) objective over shuffled frame sequences, which forces the model to recover temporal ordering from visual semantics rather than trivial time cues. Importantly, instead of using the scorer directly as a scalar reward model, we propose a correlation-based reward function for reinforcement learning: at each interaction window, we compute the Spearman rank correlation between predicted progress rankings and true temporal indices, yielding a bounded, scale-invariant learning signal. This design decouples reward learning from absolute calibration and enables stable transfer across tasks and environments. We evaluate RTA on discrete control benchmarks (PyBoy: Catrap, Kirby) and continuous control tasks (PointMaze, MetaWorld). RTA consistently matches or outperforms prior video-based reward learning methods and rank-based baselines, while demonstrating strong cross-task reuse of a single pretrained progress scorer. Our results suggest that correlation-structured supervision over video-derived ordinal signals is sufficient for policy learning, offering a scalable alternative to explicit reward design.
中文摘要 我们引入了排序后行动（RTA），这是一个无需环境奖励的专家视频演示学习控制政策的框架。RTA以基于进度的序数评分器形式离线训练视觉语言模型（VLM），使用群相对策略优化（GRPO）目标，针对洗牌的帧序列进行分析，迫使模型从视觉语义而非琐碎的时间线索恢复时间顺序。重要的是，我们提出基于相关性的奖励函数用于强化学习：在每个交互窗口，我们计算预测进展排名与真实时间指标之间的Spearman秩相关性，得到有界且尺度不变的学习信号。该设计将奖励学习与绝对校准分离，实现任务和环境间的稳定转移。我们在离散对照基准测试（PyBoy：Catrap、Kirby）和连续控制任务（PointMaze、MetaWorld）上评估RTA。RTA始终能与以往基于视频的奖励学习方法和基于排名的基线相匹配甚至优于，同时展现出单一预训练进度评分器的强力跨任务重用能力。我们的结果表明，基于视频衍生序数信号的相关结构监督足以进行政策学习，为显式奖励设计提供了可扩展的替代方案。

SPLC: Social Preference Learning for Crowd Robot Navigation

SPLC：人群机器人导航中的社会偏好学习

Authors: Zixuan Chen, Hao Fu, Haiwen Hu, Shiquan Zheng (Wuhan University of Science and Technology, Wuhan, China)
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.01925
Pdf link: https://arxiv.org/pdf/2607.01925
Abstract Offline reinforcement learning (RL) holds significant potential for crowd robot navigation in human-robot coexistence applications. However, the inherent complexity of pedestrian motion renders the design of effective reward functions for promoting socially compliant robot behaviors a persistent challenge. This paper proposes a Social Preference Learning for Crowd Robot Navigation (SPLC) algorithm to eliminate the need for detailed reward design. Its core innovation lies in the introduction of a social preference feedback mechanism to automatically generate preference data through principled preference evaluation criteria. By explicitly accounting for the intricacies of pedestrian dynamics, the pipeline mitigates the reward bias and facilitates the systematic quantification of broad social norms, thereby fostering socially compliant behaviors. Extensive experiments integrating SPLC with offline RL methods demonstrate consistent improvements over state-of-the-art baselines across standard performance metrics. Furthermore, real-world experiments on the TurtleBot4 further validate the effectiveness of SPLC in practical human-robot coexistence settings. Our code and video demos are available at this https URL.
中文摘要 离线强化学习（RL）在人机共存应用中的人群机器人导航具有巨大潜力。然而，行人动作的固有复杂性使得设计有效的奖励函数以促进社会顺从的机器人行为成为一个持续的挑战。本文提出了一种人群机器人导航的社会偏好学习（SPLC）算法，以消除对详细奖励设计的需求。其核心创新在于引入了社会偏好反馈机制，通过原则性偏好评估标准自动生成偏好数据。通过明确考虑行人动态的复杂性，管道缓解了奖励偏差，促进了广泛社会规范的系统量化，从而促进社会顺从行为。广泛实验将SPLC与离线强化学习方法整合，显示在标准性能指标上相较于最先进基线有持续的提升。此外，TurtleBot4上的真实实验进一步验证了SPLC在人机共存实际环境中的有效性。我们的代码和视频演示可在此 https URL 访问。

TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B

图杜姆：Qwen3.5-27B的土耳其思维推理管道

Authors: Baran Bingol, Bahaeddin Turkoglu
Subjects: Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.01927
Pdf link: https://arxiv.org/pdf/2607.01927
Abstract This paper presents TUDUM (Türkçe Düşünen Üretken Model), a project pipeline for adapting a Qwen-family 27B thinking model toward Turkish reasoning. The central problem is not only to answer Turkish prompts in Turkish, but to make the explicit reasoning trace itself Turkish. A thinking model may translate a Turkish prompt into an English-centered internal or visible scratchpad, solve the problem mostly in English, and only localize the final answer. TUDUM instead treats the generated ... block as a trainable behavior. The pipeline starts from the project base checkpoint unsloth/Qwen3.5-27B, applies supervised fine-tuning (SFT) on 15,991 Turkish reasoning examples using LoRA adapters, and then applies GRPO-family reinforcement learning on a proxy-filtered Turkish mathematics environment. The results are mixed. SFT made the model shorter and more consistently Turkish in its reasoning behavior, with large reductions in average response length and thinking exhaustion, but reduced benchmark accuracy. RL recovered some mathematical performance, especially AIME24 at the best early checkpoint, yet did not uniformly improve all benchmarks and did not exceed the base model on the reported Macro-6 average. The contribution is therefore best framed as a technically honest Turkish-thinking reasoning pipeline and evaluation, not as a claim of state-of-the-art Turkish reasoning. The released step-50 model is publicly available.
中文摘要 本文介绍了TUDUM（突厥语Üşünen Üretken模型），这是一个将Qwen家族27B思维模型应用于土耳其推理的项目流程。核心问题不仅是用土耳其语回答土耳其的提问，还要让明确的推理本身带有土耳其语。一个有思考能力的模型可能会将土耳其语题目翻译成以英语为中心的内部或可见的草稿本，主要用英语解决问题，并仅对最终答案进行本地化。TUDUM则将生成的......块视为可训练的行为。该流水线从项目基础检查点 unsloth/Qwen3.5-27B 开始，使用LoRA适配器对15,991个土耳其推理示例应用监督微调（SFT），然后在代理过滤的土耳其数学环境中应用GRPO家族强化学习。结果参差不齐。SFT使模型更短，推理行为更一致，平均响应长度和思维疲劳大幅减少，但基准准确度降低。RL恢复了一些数学性能，尤其是在最佳早期检查点的AIME24，但未能统一提升所有基准测试，且在报告的Macro-6平均值上未超过基础模型。因此，该贡献最好被定位为一条技术诚实的土耳其思维推理管道和评估，而非声称拥有最先进的土耳其推理。发布的step-50型号已公开。

Cross-Platform Control for Autonomous Surface Vehicles via Adaptive Reinforcement Learning

通过自适应强化学习实现自动地面车辆的跨平台控制

Authors: Ruiheng Jiang, Thomas Bi, Raffaello D'Andrea, Aswin Ramachandran
Subjects: Subjects: Robotics (cs.RO); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.02037
Pdf link: https://arxiv.org/pdf/2607.02037
Abstract Autonomous surface vehicles vary widely in hydrodynamic and actuation characteristics, yet most controllers are designed for single-platform deployment. We present an adaptive reinforcement learning approach for trajectory tracking that enables zero-shot cross-platform deployment using a single policy. Since the deployment platform's dynamics are unknown to the policy, we address cross-platform generalization with the standard partial-observability approach of conditioning on interaction history, employing a teacher-student architecture in which a learned module infers a latent representation of the platform dynamics. The policy is trained in simulation under randomized vessel dynamics and is deployed zero-shot to two real-world platforms without any fine-tuning, despite relying on a simple analytical dynamics model rather than a high-fidelity hydrodynamic simulator. In real-world experiments on two different platforms, the adaptive policy outperforms non-adaptive learning-based baselines by up to 58% in position mean absolute error while approaching the tracking accuracy of a platform-specific tuned controller.
中文摘要 自主地面载具在流体动力学和驱动特性上差异很大，但大多数控制器设计为单平台部署。我们提出了一种自适应强化学习方法，用于轨迹跟踪，使得通过单一策略实现零射击跨平台部署。由于部署平台的动态对策略未知，我们采用标准的部分可观测性方法，即以交互历史为条件，采用师生架构，其中学习模块推断平台动态的潜在表示。该政策在随机船舶动力学模拟中训练，并以零发射方式部署到两个真实平台，无需微调，尽管依赖的是简单的解析动力学模型而非高精度水动力模拟器。在两个不同平台的实际实验中，自适应策略在位置平均绝对误差方面比非自适应基于学习的基线高达58%，同时接近特定平台调优控制器的跟踪精度。

Evidence-State Rewards for Long-Context Reasoning

长语境推理的证据状态奖励

Authors: Ya Gao, Pekka Marttinen
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.02073
Pdf link: https://arxiv.org/pdf/2607.02073
Abstract Long-context reasoning requires models to locate, revise, and synthesize evidence distributed across lengthy inputs. Existing long-context RL methods usually reward final answers or static evidence extraction, offering little feedback on how intermediate actions change the model's evidence state. We propose Maven, a reinforcement learning framework with an editable evidence memory. Maven defines an answer-conditioned evidence-state value and rewards action-level state transitions: add actions are credited by marginal gain and hindsight contribution, link actions by evidence synergy, and drop actions by improved answer support after removing misleading evidence. These rewards are assigned to the corresponding action spans in GRPO. Across Llama and Qwen models on LongBench v2, LongReason, and RULER, Maven outperforms outcome-only RL and evidence-identification baselines, producing more sufficient evidence sets and lower distractor retention. Our results show that long-context RL benefits from optimizing stateful evidence navigation rather than one-shot evidence extraction.
中文摘要 长上下文推理需要模型定位、修正并综合分布在长输入中的证据。现有的长上下文强化学习方法通常奖励最终答案或静态证据提取，几乎不反馈中间动作如何改变模型证据状态。我们提出了Maven，一种具有可编辑证据记忆的强化学习框架。Maven定义了答案条件证据-状态值，并奖励行动级状态转换：添加行动通过边际收益和事后诸葛贡献获得认可，通过证据协同获得关联动作，删除行为则通过消除误导性证据后改进的答案支持来实现。这些奖励分配给GRPO中对应的动作区间。在LongBench v2、LongReason和RULER的Llama和Qwen模型中，Maven优于仅结果的强化学习和证据识别基线，提供了更多充分证据集和更低的干扰者保留率。我们的结果表明，长上下文强化学习受益于优化有状态证据导航，而非一次性提取证据。

Enhancing Fitness Intelligence through Domain-Specific LLM Post-Training

通过领域特定LLM后培训提升健身智能

Authors: Xingtao Zhao, Tian Yang, Han Jiang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.02118
Pdf link: https://arxiv.org/pdf/2607.02118
Abstract Scientific Fitness Coaching (SFC) is typically delivered by human professionals, making it costly and inaccessible to many. While recent advances in Large Language Models (LLMs) show considerable promise for more inclusive fitness coaching, directly deploying prevailing general-purpose LLMs in SFC reveals critical limitations. These models often lack sufficient domain-specific knowledge integration, leading to weak performance on complex SFC scenarios. In this paper, we introduce FitOne, a series of fitness LLMs (with 8B and 32B parameters) designed to improve reliability and domain specialization for SFC applications. Built upon the Qwen3 foundation models, FitOne is developed through a three-stage post-training pipeline consisting of continual pre-training, supervised fine-tuning, and reinforcement learning, using large-scale, high-quality datasets derived from rigorous knowledge engineering. We conduct comprehensive evaluations of FitOne on professional fitness certification exams, including ACSM-EP and NSCA-CSCS, as well as general capabilities such as knowledge reasoning and instruction following. Experimental results show that, while retaining strong general capabilities, FitOne-8B/32B achieves average improvements of up to 10.09%/9.29% and 12.73%/7.01% on the ACSM-EP and NSCA-CSCS exams, respectively, compared with the Qwen3 base models. Furthermore, in-depth ablation studies confirm the necessity of each training stage, highlighting the pipeline's effectiveness in balancing domain expertise enhancement with general ability retention. We believe this research advances LLM systems toward more reliable fitness intelligence and will inspire future research on developing domain-specific LLMs.
中文摘要 科学健身教练（SFC）通常由专业人员提供，这使得它成本高昂且对许多人来说难以获得。尽管大型语言模型（LLMs）的最新进展显示出更具包容性健身教练的巨大潜力，但直接在SFC中部署通用LLM暴露出关键局限性。这些模型通常缺乏足够的领域特定知识整合，导致在复杂的SFC场景下表现较弱。本文介绍了FitOne，一系列适应度大型语言模型（参数为8B和32B），旨在提升SFC应用的可靠性和领域专一化。FitOne基于Qwen3基础模型，通过三阶段训练后流程开发，包括持续的预训练、监督式微调和强化学习，使用来自严谨知识工程的大规模高质量数据集。我们对FitOne进行专业体能认证考试的全面评估，包括ACSM-EP和NSCA-CSCS，以及知识推理和教学跟随等通用能力。实验结果显示，FitOne-8B/32B在保持强劲的通用能力的同时，ACSM-EP和NSCA-CSCS考试的平均提升分别高达10.09%/9.29%和12.73%/7.01%，相比Qwen3基础模型。此外，深入的消融研究证实了每个培训阶段的必要性，凸显了该管道在平衡领域技能提升与整体能力保持方面的有效性。我们相信这项研究推动了LLM系统朝向更可靠的适应度智能发展，并将激励未来开发领域特定LLM的研究。

ART for Diffusion Sampling: Continuous-Time Control and Actor-Critic Learning

扩散采样的ART：连续时间控制与演员-批评者学习

Authors: Yilie Huang, Wenpin Tang, Xun Yu Zhou
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2607.02137
Pdf link: https://arxiv.org/pdf/2607.02137
Abstract We study timestep allocation for score-based diffusion sampling, where a learned reverse-time dynamics is discretized on a finite grid. Uniform and hand-crafted schedules are standard choices, but they rely on fixed prescriptions and can therefore be suboptimal. To address this limitation, we propose Adaptive Reparameterized Time (ART), a continuous-time control formulation that learns a time change by treating the speed of the sampling clock as the control, so that a uniform grid on the learned clock induces adaptive timesteps in the original diffusion time. Based on a leading-order Euler error surrogate, ART provides a principled objective for allocating timesteps along the sampling trajectory. To solve this deterministic control problem, we introduce ART-RL, an auxiliary randomized formulation with Gaussian policies that turns schedule learning into a continuous-time reinforcement learning problem. We prove that the randomized ART-RL formulation is equivalent to ART at the optimizer level, in the sense that its optimal Gaussian policy recovers the optimal ART time-warping rate through its mean. We further establish policy evaluation and policy improvement characterizations and derive trajectory-based moment identities that yield implementable actor--critic updates for learning the schedule. Across experiments ranging from controlled low-dimensional settings to image generation, ART-RL can be plugged into existing diffusion samplers by changing only the timestep grid, consistently improving sample quality over strong baseline schedules at matched budgets while leaving the rest of the sampling pipeline unchanged. The learned schedules also exhibit broad generalization, transferring without retraining across sampling budgets, datasets, solvers, pipelines, and representation spaces.
中文摘要 我们研究基于分数的扩散采样的时间步分配，其中学习到的反时间动力学被离散化在有限网格上。统一且手工制作的排班是标准选择，但它们依赖固定处方，因此可能不尽如人意。为解决这一限制，我们提出了自适应重参数化时间（ART）方法，这是一种连续时间控制表述，通过将采样时钟的速度视为控制来学习时间变化，使得学习时钟上的均匀网格在原始扩散时间中诱导自适应时间步长。基于先导阶欧拉误差替代，ART为沿采样轨迹分配时间步长提供了原则性目标。为解决这一确定性控制问题，我们引入了ART-RL，一种辅助随机表述，采用高斯策略，将计划学习转变为连续时间强化学习问题。我们证明了随机ART-RL表述在优化器层面等价于ART的表现，即其最优高斯策略通过均值恢复了最优的ART时间扭曲率。我们进一步建立了策略评估和策略改进的特征，并推导出基于轨迹的矩标识，从而产生可实现的actor-critic更新，用于学习时间表。在从受控低维设置到图像生成等实验中，ART-RL只需改变时间步长网格即可插入现有的扩散采样器，在匹配预算下持续提升强基线时程的采样质量，同时保持采样流程其余部分不变。所学的时程还表现出广泛的泛化，能够在采样预算、数据集、求解器、流水线和表示空间之间无需重新训练即可转移。

Actuator Reality Shaping for Zero-Shot Sim-to-Real Robot Learning

零样品模拟到真实机器人学习的执行器现实塑形

Authors: Satoshi Yamamori, Koji Ishihara, Kentaro Minamikawa, Kiyoharu Ohomori, Taiyo Yazaki, Norikazu Sugimoto, Jun Morimoto
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.02205
Pdf link: https://arxiv.org/pdf/2607.02205
Abstract Sim-to-real transfer in robot learning is often limited by discrepancies between the ideal actuator dynamics assumed during policy training and the nonlinear, hardware-dependent behavior of physical motors. While conventional approaches attempt to bridge this gap by increasing simulator fidelity through system identification, domain randomization, or learned actuator models, we introduce an alternative paradigm: actuator reality shaping. Instead of modifying the simulator to match the real world, our method shapes the closed-loop behavior of physical actuators to match the idealized second-order reference dynamics used in simulation. By equipping each joint with a two-degree-of-freedom feedforward--feedback controller, we decouple reference-response shaping from robust stabilization, thereby providing a standardized actuator interface for reinforcement learning policies. As a result, policies trained only with the prescribed reference model can be deployed zero-shot on real hardware without task-level fine-tuning or learned actuator models. We validate the approach on a single-joint high-gear-ratio servo under external loads and a 7-DOF robotic arm reaching task, where actuator reality shaping substantially reduces sim-to-real tracking error and improves zero-shot task performance compared with standard servo-control and representative real-to-sim-to-real baselines. We further demonstrate zero-shot transfer on a wheeled-legged robot driving over a slope and a humanoid robot walking, suggesting that actuator reality shaping can serve as a reusable interface for robot learning across diverse hardware platforms.
中文摘要 机器人学习中的模拟到实物转移通常受限于策略训练中理想执行器动力学与物理电机非线性、硬件依赖行为之间的差异。传统方法试图通过系统识别、域随机化或学习执行器模型来提升模拟器保真度来弥合这一差距，我们引入了另一种范式：执行器现实塑形。我们的方法不是将模拟器修改以匹配现实世界，而是塑造物理执行器的闭环行为，使其符合仿真中理想化的二阶参考动力学。通过为每个关节配备一个二自由度反馈控制器，我们将参考响应形态与稳健稳定耦合，从而为强化学习策略提供标准化的执行器接口。因此，仅用指定参考模型训练的策略可以在真实硬件上零样本部署，无需任务级微调或执行器模型学习。我们在单关节高齿轮比伺服在外部负载下和7自由度机械臂伸缩任务中验证了该方法，执行器现实塑造显著降低了模拟到实的跟踪误差，并提升了零任务性能，相较于标准伺服控制和代表性的实比模拟基线。我们还进一步演示了在轮式腿机器人上行驶过斜坡和类人机器人行走的零射程传输，表明执行器现实塑造可以作为跨多种硬件平台的机器人学习的可重复使用接口。

DetailAnywhere: Fashion Detail Generation via Cross-Modal Feature Alignment Distillation

DetailAnywhere：通过跨模态特征对齐提炼生成时尚细节

Authors: Zijun Li, Yimin Zhou, Jia Sun, Honglie Wang, Pengcheng Wei, Junlong Wu, Yongrui Heng, Jiyuan Wang, Huan Ouyang, Boheng Zhang, Huaiqing Wang, Dewen Fan, Qianqian Gan, Fan Yang, Tingting Gao
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.02220
Pdf link: https://arxiv.org/pdf/2607.02220
Abstract Diffusion-based generative AI has achieved remarkable success in e-commerce applications such as virtual try-on, poster generation, and product background synthesis. However, when making online purchasing decisions for apparel, consumers also desire the freedom to examine specific detail regions of interest, such as collars, cuffs, and fabric textures, yet existing methods have not explicitly studied this setting. We therefore formalize a new, non-template task: Fashion Detail Generation with focus conditioning, and release FDBench, the first benchmark comprising 40K+ human-verified reference-detail pairs across 41 different categories. This task poses a unique semantic gap challenge: the model must bridge the correspondence between a focus marker on a product reference image and a photorealistic close-up view of the indicated region, while faithfully preserving the garment's identity, without any precise prompt. To bridge this gap, we propose Cross-modal Feature Alignment Distillation (CFAD), which leverages a fine-tuned DINOv3 teacher to align both branches of a Multimodal Diffusion Transformer in a shared semantic space via dual-branch distillation. To further improve consistency between generated details and reference images, we introduce a consistency reward model that jointly scores image pairs along three quality axes and optimizes generation via reinforcement learning. Experiments show that our model DetailAnywhere significantly outperforms all state-of-the-art opensource methods across all metrics and human evaluations.
中文摘要 基于扩散的生成式人工智能在虚拟试穿、海报生成和产品背景综合等电子商务应用中取得了显著成功。然而，在网上购买服装时，消费者也希望能够自由地考察特定细节关注区域，如领口、袖口和织物质地，但现有方法尚未明确研究这一环境。因此，我们正式制定了一个新的非模板任务：带有焦点条件的时尚细节生成，并发布了FDBench，这是首个包含41个不同类别、40K+人工验证的参考-细节对的基准测试。该任务带来了独特的语义缺口挑战：模型必须在产品参考图像上的焦点标记与所示区域的写实特写视图之间建立桥梁，同时忠实地保持服装的身份，且无需精确提示。为弥合这一空白，我们提出了跨模态特征比对蒸馏（CFAD），利用经过精细调优的DINOv3教师，通过双分支蒸馏将多模扩散变换器的两个分支比对到共享语义空间中。为了进一步提升生成细节与参考图像之间的一致性，我们引入了一致性奖励模型，该模型沿三个质量轴联合评分图像对，并通过强化学习优化生成。实验显示，我们的模型DetailAnywhere在所有指标和人工评估中显著优于所有最先进的开源方法。

Generalization in offline RL: The structure is more important than the amount of pessimism

离线强化学习中的泛化：结构比悲观的程度更重要

Authors: Max Weltevrede, Matthijs T.J. Spaan, Wendelin Böhmer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.02288
Pdf link: https://arxiv.org/pdf/2607.02288
Abstract While pessimism counteracts overestimation bias in offline reinforcement learning (RL), being overly conservative has been associated with hindering certain forms of generalization. However, in this paper we demonstrate that being overly pessimistic does not inherently prevent optimal generalization in contextual MDPs (CMDPs). Instead, we argue successful generalization depends not on the amount of pessimism, but whether the pessimistic structure respects the underlying symmetries of the optimal solution. We prove that a mildly pessimistic, non-symmetric value function can generalize worse than an overly pessimistic, symmetric one. In offline RL, the structure of the pessimism is determined by the structure of the dataset coverage. As such, enforcing a symmetric value function can be non-trivial, and might require techniques such as data augmentation (DA). Inspired by our theoretical results, we argue that DA can best be applied through a consistency loss during policy extraction, rather than the common practice of (regular) offline training on an augmented dataset. This is empirically validated using IQL and CQL on a rotationally symmetric reacher environment.
中文摘要 虽然悲观主义能抵消离线下强化学习（RL）中的高估偏差，但过于保守则与某些形式的泛化受阻相关。然而，本文表明，过于悲观并不固有地阻碍上下文MDP（CMDP）的最佳推广。相反，我们认为成功的推广不取决于悲观程度，而在于悲观结构是否尊重最优解的潜在对称性。我们证明了一个轻度悲观的非对称价值函数的推广效果可能比过于悲观的对称函数更差。在离线强化学习中，悲观的结构由数据集覆盖结构决定。因此，强制执行对称值函数可能并不简单，可能需要数据增强（DA）等技术。受理论结果启发，我们认为DA最好通过策略提取时的一致性丢失来应用，而非常规的（常规）扩展数据集训练。这一点通过在旋转对称的前进环境中的 IQL 和 CQL 实证验证。

Optimizing Visual Generative Models via Distribution-wise Rewards

通过分布式奖励优化可视化生成模型

Authors: Ruihang Li, Mengde Xu, Shuyang Gu, Leigang Qu, Fuli Feng, Han Hu, Wenjie Wang
Subjects: Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.02291
Pdf link: https://arxiv.org/pdf/2607.02291
Abstract Conventional reinforcement learning strategies for visual generation typically employ sample-wise reward functions, yet this practice frequently results in reward hacking that degrades image diversity and introduces visual anomalies. To address these limitations, we present a novel framework that finetunes generative models using distribution-wise rewards, ensuring better alignment with real-world data distributions. Unlike rewards that evaluate samples individually, distribution-wise reward accounts for the data distribution of the samples, mitigating the mode collapse problem that occurs when all samples optimize towards the same direction independently. To overcome the prohibitive computational cost of estimating these rewards, we introduce a subset-replace strategy that efficiently provides reward signals by updating only a small subset of a generated reference set. Additionally, we apply RL to optimize post-hoc model merging coefficients, potentially mitigating the train-inference inconsistency caused by introducing stochastic differential equation (SDE) in regular RL practices. Extensive experiments show our approach significantly improves FID-50K across various base models, from 8.30 to 5.77 for SiT and from 3.74 to 3.52 for EDM2. Qualitative evaluation also confirms that our method enhances perceptual quality while preserving sample diversity.
中文摘要 传统的强化学习视觉生成策略通常采用样本奖励函数，但这种做法常常导致奖励黑客，降低图像多样性并引入视觉异常。为解决这些局限性，我们提出了一个新框架，利用分布奖励微调生成模型，确保更好地与现实数据分布对齐。与单独评估样本的奖励不同，分布奖励考虑了样本的数据分布，缓解了当所有样本独立朝同一方向优化时产生的模式崩溃问题。为了克服估算这些奖励的高计算成本，我们引入了一种子集替换策略，通过只更新生成参考集中的一小部分，高效地提供奖励信号。此外，我们应用强化学习优化事后模型合并系数，有望缓解常规强化学习实践中引入随机微分方程（SDE）所引起的列车推断不一致。大量实验表明，我们的方法显著提升了FID-50K在多个基础模型中的表现，SiT的FID-50K从8.30提升到5.77，EDM2从3.74提升到3.52。定性评估还证实了我们的方法在保持样本多样性的同时提升了感知质量。

DecompRL: Solving Harder Problems by Learning Modular Code Generation

DecompRL：通过学习模块化代码生成解决更难的问题

Authors: Juliette Decugis, Fabian Gloeckle, Francis Bach, Taco Cohen, Gabriel Synnaeve
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2607.02390
Pdf link: https://arxiv.org/pdf/2607.02390
Abstract How can Large Language Models (LLMs) solve problems they currently cannot? Repeated sampling scales test-time compute but GPU cost grows linearly with attempts, while reinforcement learning (RL) with verifiable rewards improves single-attempt accuracy at the expense of sample diversity. Both strategies ultimately fail when the base policy has near-zero probability of producing a correct solution: no amount of sampling or gradient signal can overcome a search space that is simply too large. We take a different approach: rather than sampling harder, we make the task easier by decomposing problems into smaller, independently solvable sub-functions whose implementations can be recombined. Since off-the-shelf models are not trained for this modular generation, we introduce DecompRL, an RL algorithm that explicitly learns to decompose and implement hierarchical code structures. Recombining $k$ implementations of $n$ modules yields up to $k^{n}$ candidate solutions, shifting the bottleneck from GPU inference to cheap CPU evaluation and cutting GPU token cost by $\sim$50$\times$. On LiveCodeBench and CodeContests (Qwen~2.5~7B, Code World Model~32B), DecompRL outperforms standard and diversity-optimized RL baselines beyond $10^5$ tokens per problem, solving problems that standard generation cannot reach.
中文摘要 大型语言模型（LLM）如何解决它们目前无法解决的问题？重复采样可扩展测试时间计算，但GPU成本随尝试次数线性增长，而带有可验证奖励的强化学习（RL）则提高单次尝试的准确性，但牺牲样本多样性。当基础策略几乎没有生成正确解的概率时，这两种策略最终都会失败：无论采样多少，或用梯度信号，都无法克服过大的搜索空间。我们采取不同的方法：我们不是更难采样，而是通过将问题分解成更小、独立可解的子函数，这些子函数的实现可以重新组合，从而简化任务。由于现成模型未经过该模块化生成的训练，我们引入了DecompRL，一种明确学习分解和实现层级代码结构的强化学习算法。重新组合$k$的$n$模块实现可获得多达$k^{n}$的候选解，将瓶颈从GPU推理转向廉价CPU评估，同时将GPU代币成本降低$\sim$50$\times$。在LiveCodeBench和CodeContests（Qwen~2.5~7B，Code World Model~32B）上，DecompRL在每个问题10+5$代币以上的标准和多样性优化的RL基线上表现优于标准和多样性优化的RL基线，解决了标准生成无法达到的问题。

WorldSample: Closed-loop Real-robot RL with World Modelling

WorldSample：闭环真实机器人强化学习与世界建模

Authors: Yuquan Xue, Le Xu, Zeyi Liu, Zhenyu Wu, Zhengyi Gu, Xinyang Song, Bofang Jia, Ziwei Wang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2607.02431
Pdf link: https://arxiv.org/pdf/2607.02431
Abstract Reinforcement learning (RL) can overcome the demonstration-coverage limitation of imitation learning (IL) by allowing robots to improve through trial-and-error interaction beyond the states observed in demonstrations. However, deploying RL on real robots remains constrained by high interaction costs, since each physical rollout is costly and reflects only one realized action-outcome path. To address this challenge, we propose WorldSample, a physically grounded data augmentation framework for real-robot RL that closes a real-synthetic loop between physical rollouts, world-model generation, and policy improvement. Grounded on real rollouts, WorldSample generates high-fidelity synthetic transitions through a post-trained world model, which greatly lowers the visual hallucination. Specifically, rather than simply using these transitions as real-world experience, WorldSample introduces Policy-Paced Learning (PPL) to regulate the training process through sample selection and scheduling, balancing useful augmentation against value overestimation and mitigating the hallucination-induced noise. Experiments on robot manipulation tasks involving contact-rich and precise tasks show that WorldSample improves policy success rate by 28% while reducing training steps by 59% compared with baselines. Furthermore, WorldSample improves world model visual fidelity by 19.4dB in PSNR and 0.47 in SSIM over demonstration-only post-training, validating the effectiveness of the real-synthetic loop for both policy and world model performance.
中文摘要 强化学习（RL）可以通过试错交互，克服模仿学习（IL）在演示覆盖上的限制，使机器人能够超越演示中观察到的状态。然而，在真实机器人上部署强化学习仍受限于高交互成本，因为每次物理部署成本高昂，且仅反映一条实现的行动-结果路径。为应对这一挑战，我们提出了WorldSample，一种基于物理基础的数据增强框架，用于真实机器人强化学习，闭合了物理推广、世界模型生成和政策改进之间的真实合成循环。基于真实的推送，WorldSample 通过后训练的世界模型生成高保真合成过渡，大大降低了视觉幻觉。具体来说，WorldSample 不仅将这些过渡作为现实经验，还引入了策略节奏学习（PPL），通过样本选择和调度来调节训练过程，平衡有用的增强与价值高估，并减轻幻觉引起的噪声。涉及接触丰富且精准任务的机器人操作实验显示，WorldSample 相比基线提升了 28% 的策略成功率，同时减少了 59% 的训练步骤。此外，WorldSample 在仅演示后训练中，PSNR 提升了 19.4dB，SSIM 提升了 0.47dB，验证了实合成循环在政策和世界模型性能上的有效性。

Learning Agile Intruder Interception using Differentiable Quadrotor Dynamics

学习使用可微四旋翼动力学进行敏捷入侵者拦截

Authors: Michael Anoruo, Xiaoyu Tian, Abhishek Rathod, Timothy Naudet, Thomas Canchola, Eric Sturzinger, Kshitij Goel, Wennie Tabib
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2607.02472
Pdf link: https://arxiv.org/pdf/2607.02472
Abstract This paper presents a methodology for learning a control policy to intercept an intruder using the 3D direction unit vector to the intruder and the interceptor state. Prior deep reinforcement learning approaches assume either relative position or distance to the intruder is available, but this information is not readily accessible in real-world applications that employ passive, monocular camera sensors. Instead, we propose a solution that leverages an analytical policy gradient method using differentiable quadrotor dynamics to learn agile interception at speeds up to 10 m/s. The proposed approach outperforms baseline methods that utilize simplified point mass dynamics by an average of 30%.
中文摘要 本文提出了一种学习控制策略的方法，用于利用入侵者和拦截者状态的三维方向单元矢量来拦截入侵者。以往的深度强化学习方法假设入侵者有相对位置或距离，但在实际应用中使用被动单眼摄像头传感器时，这些信息并不容易获得。相反，我们提出了一种解决方案，利用可微的四旋翼动力学进行分析策略梯度法，以学习最高10 m/s的速度敏捷拦截。所提方法比使用简化点质量动力学的基线方法平均优出30%。

Visually Grounded Self-Reflection for Vision-Language Models via Reinforcement Learning

视觉语言模型的视觉基础自我反思，通过强化学习

Authors: Liyan Tang, Fangcong Yin, Greg Durrett
Subjects: Subjects: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.02490
Pdf link: https://arxiv.org/pdf/2607.02490
Abstract Large vision-language models can reason over multimodal inputs by generating textual chains of thought (CoT). A key capability exhibited in CoT reasoning is self-reflection: revisiting earlier decisions and correcting previous errors. However, existing LVLMs often fail to properly attend to visual inputs during reflection, limiting their ability to translate feedback into grounded corrections, especially for out-of-distribution images. To address this issue, we propose a novel reinforcement learning training framework VRRL, with two components explicitly designed to elicit visually grounded self-reflection. First, we randomly mask trajectory prefixes during training to emphasize recovery from incorrect intermediate predictions rather than making early mistakes. Second, we introduce buffered roll-ins from an experience replay buffer to expose the model to diverse failure states that it must learn to correct. We evaluate our approach on visual grounding tasks involving tables and charts, as well as spatial navigation benchmarks. While off-the-shelf and conventionally fine-tuned models degrade substantially under distribution shift, our method substantially improves average out-of-distribution accuracy over standard RL and reflection-oriented fine-tuning baselines by using self-reflection effectively.
中文摘要 大型视觉语言模型可以通过生成文本思维链（CoT）来推理多模态输入。CoT推理中展现出的一个关键能力是自我反思：重新审视之前的决策并纠正之前的错误。然而，现有的LVLM常常未能在反射时正确处理视觉输入，限制了其将反馈转化为基准修正的能力，尤其是在分布外图像时。为解决这一问题，我们提出了一种新型强化学习训练框架VRRL，其中有两个组成部分专门设计用于引发视觉扎根的自我反思。首先，我们在训练中随机遮蔽轨迹前缀，以强调从错误中间预测中恢复，而不是早期犯错。其次，我们引入来自经验重放缓冲区的缓冲滚动，使模型暴露于需要学习纠正的多种故障状态。我们评估了涉及表格和图表的视觉基础任务，以及空间导航基准。虽然现成和传统微调模型在分布偏移下会显著退化，但我们的方法通过有效使用自反射，显著提升了平均分布外准确率，相较于标准强化学习和反射导向的精细调优基线。

Seek to Segment: Active Perception for Panoramic Referring Segmentation

寻求细分：主动感知用于全景指涉细分

Authors: Song Tang, Shuming Hu, Xincheng Shuai, Henghui Ding, Yu-Gang Jiang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV)
Arxiv link: https://arxiv.org/abs/2607.02497
Pdf link: https://arxiv.org/pdf/2607.02497
Abstract Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($\Delta\theta, \Delta\phi$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.
中文摘要 现有的引用分割模型被动处理固定视角捕捉的静态图像，限制了其在具身人工智能中的适用性，因为在连续的360$^\circ$环境中，代理必须进行主动感知。为弥合这一差距，我们引入了一项新任务：主动全景转诊分割（APRS）。在此环境中，代理需要调整其观看方向（$\Delta\theta， \Delta\phi$）以探索360$^\circ$环境，寻找用户指令指定的分割对象。为了应对这一挑战，我们提出了PanoSeeker，一种高效的APRS内存增强智能体。PanoSeeker 不依赖启发式扫描，而是将视觉语言模型（VLM）与显式空间视觉记忆 EgoSphere 集成。通过逐步将顺序的局部观测整合为统一的360$^\circ$表示，EgoSphere使智能体能够规划高效且无冗余的搜索轨迹。一旦找到目标，代理会进行主动视点对齐并输出分割掩码。此外，我们策划了一个专家注释的搜索轨迹数据集，包含监督微调的记忆时间线，随后进行训练后强化学习，以显式优化PanoSeeker的探索效率。我们新建立的APRS基准测试的广泛实验表明，PanoSeeker实现了更优的搜索效率和细分准确性，远超经过调整的先进基线。

Keyword: diffusion policy

There is no result