Arxiv Papers of Today

生成时间: 2025-10-30 16:28:05 (UTC+8); Arxiv 发布时间: 2025-10-30 20:00 EDT (2025-10-31 08:00 UTC+8)

今天共有 29 篇相关文章

Keyword: reinforcement learning

Learning to Attack: Uncovering Privacy Risks in Sequential Data Releases

学会攻击：揭示连续数据发布中的隐私风险

Authors: Ziyao Cui, Minxing Zhang, Jian Pei
Subjects: Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.24807
Pdf link: https://arxiv.org/pdf/2510.24807
Abstract Privacy concerns have become increasingly critical in modern AI and data science applications, where sensitive information is collected, analyzed, and shared across diverse domains such as healthcare, finance, and mobility. While prior research has focused on protecting privacy in a single data release, many real-world systems operate under sequential or continuous data publishing, where the same or related data are released over time. Such sequential disclosures introduce new vulnerabilities, as temporal correlations across releases may enable adversaries to infer sensitive information that remains hidden in any individual release. In this paper, we investigate whether an attacker can compromise privacy in sequential data releases by exploiting dependencies between consecutive publications, even when each individual release satisfies standard privacy guarantees. To this end, we propose a novel attack model that captures these sequential dependencies by integrating a Hidden Markov Model with a reinforcement learning-based bi-directional inference mechanism. This enables the attacker to leverage both earlier and later observations in the sequence to infer private information. We instantiate our framework in the context of trajectory data, demonstrating how an adversary can recover sensitive locations from sequential mobility datasets. Extensive experiments on Geolife, Porto Taxi, and SynMob datasets show that our model consistently outperforms baseline approaches that treat each release independently. The results reveal a fundamental privacy risk inherent to sequential data publishing, where individually protected releases can collectively leak sensitive information when analyzed temporally. These findings underscore the need for new privacy-preserving frameworks that explicitly model temporal dependencies, such as time-aware differential privacy or sequential data obfuscation strategies.
中文摘要 在现代人工智能和数据科学应用中，隐私问题变得越来越重要，在这些应用中，敏感信息在医疗保健、金融和移动性等不同领域被收集、分析和共享。虽然之前的研究侧重于在单个数据发布中保护隐私，但许多现实世界的系统在顺序或连续数据发布下运行，其中相同或相关的数据会随着时间的推移而发布。这种顺序披露引入了新的漏洞，因为跨版本的时间相关性可能使对手能够推断出隐藏在任何单个版本中的敏感信息。在本文中，我们研究了攻击者是否可以通过利用连续发布之间的依赖关系来损害连续数据发布中的隐私，即使每个单独的发布都满足标准隐私保证。为此，我们提出了一种新颖的攻击模型，通过将隐马尔可夫模型与基于强化学习的双向推理机制集成来捕获这些顺序依赖关系。这使攻击者能够利用序列中较早和较晚的观察来推断私人信息。我们在轨迹数据的上下文中实例化我们的框架，演示了对手如何从顺序移动数据集中恢复敏感位置。对 Geolife、Porto Taxi 和 SynMob 数据集的广泛实验表明，我们的模型始终优于独立处理每个版本的基线方法。结果揭示了顺序数据发布固有的基本隐私风险，即单独受保护的发布在进行时间分析时可能会集体泄露敏感信息。这些发现强调了需要新的隐私保护框架来显式模拟时间依赖关系，例如时间感知差分隐私或顺序数据混淆策略。

Scheduling Your LLM Reinforcement Learning with Reasoning Trees

使用推理树安排 LLM 强化学习

Authors: Hong Wang, Zhezheng Hao, Jian Luo, Chenxing Wei, Yao Shu, Lei Liu, Qiang Lin, Hande Dong, Jiawei Chen
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24832
Pdf link: https://arxiv.org/pdf/2510.24832
Abstract Using Reinforcement Learning with Verifiable Rewards (RLVR) to optimize Large Language Models (LLMs) can be conceptualized as progressively editing a query's `Reasoning Tree'. This process involves exploring nodes (tokens) and dynamically modifying the model's policy at each node. When combined with data scheduling, this process yields further gains in data efficiency and accuracy. However, existing RLVR data scheduling methods typically rely on path-based metrics to rank queries, overlooking the reasoning tree structures of these queries. In this paper, we introduce a novel metric, namely Reasoning Score (r-score), which measures the query's learning difficulty based on the structure of its reasoning tree. Based on the r-score, we propose the Reasoning Tree Schedule (Re-Schedule), a scheduling algorithm that constructs a curriculum progressing from structurally simple (high r-score) to complex (low r-score) queries. Experiments on six math-reasoning benchmarks show that Re-Schedule significantly improves average accuracy, achieving gains of up to 3.2%. These strong results validate our approach and demonstrate that a structural understanding of the reasoning tree provides a more powerful and principled foundation for RLVR data scheduling.
中文摘要 使用具有可验证奖励的强化学习（RLVR）来优化大型语言模型（LLM）可以概念化为逐步编辑查询的“推理树”。此过程涉及探索节点（代币）并在每个节点上动态修改模型的策略。当与数据调度相结合时，该过程可以进一步提高数据效率和准确性。然而，现有的RLVR数据调度方法通常依赖于基于路径的指标来对查询进行排名，而忽略了这些查询的推理树结构。在本文中，我们引入了一种新颖的指标，即推理分数（r-score），它根据推理树的结构来衡量查询的学习难度。基于 r 分数，我们提出了推理树计划（Re-Schedule），这是一种调度算法，用于构建从结构简单（高 r 分数）到复杂（低 r 分数）查询的课程。对六个数学推理基准的实验表明，Re-Schedule 显着提高了平均准确率，实现了高达 3.2% 的提升。这些强大的结果验证了我们的方法，并证明对推理树的结构理解为 RLVR 数据调度提供了更强大和更有原则的基础。

Deep Reinforcement Learning Approach to QoSAware Load Balancing in 5G Cellular Networks under User Mobility and Observation Uncertainty

用户移动性和观测不确定性下5G蜂窝网络QoSAware负载均衡的深度强化学习方法

Authors: Mehrshad Eskandarpour, Hossein Soleimani
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.24869
Pdf link: https://arxiv.org/pdf/2510.24869
Abstract Efficient mobility management and load balancing are critical to sustaining Quality of Service (QoS) in dense, highly dynamic 5G radio access networks. We present a deep reinforcement learning framework based on Proximal Policy Optimization (PPO) for autonomous, QoS-aware load balancing implemented end-to-end in a lightweight, pure-Python simulation environment. The control problem is formulated as a Markov Decision Process in which the agent periodically adjusts Cell Individual Offset (CIO) values to steer user-cell associations. A multi-objective reward captures key performance indicators (aggregate throughput, latency, jitter, packet loss rate, Jain's fairness index, and handover count), so the learned policy explicitly balances efficiency and stability under user mobility and noisy observations. The PPO agent uses an actor-critic neural network trained from trajectories generated by the Python simulator with configurable mobility (e.g., Gauss-Markov) and stochastic measurement noise. Across 500+ training episodes and stress tests with increasing user density, the PPO policy consistently improves KPI trends (higher throughput and fairness, lower delay, jitter, packet loss, and handovers) and exhibits rapid, stable convergence. Comparative evaluations show that PPO outperforms rule-based ReBuHa and A3 as well as the learning-based CDQL baseline across all KPIs while maintaining smoother learning dynamics and stronger generalization as load increases. These results indicate that PPO's clipped policy updates and advantage-based training yield robust, deployable control for next-generation RAN load balancing using an entirely Python-based toolchain.
中文摘要 高效的移动管理和负载平衡对于在密集、高度动态的 5G 无线接入网络中维持服务质量（QoS）至关重要。我们提出了一个基于近端策略优化（PPO）的深度强化学习框架，用于在轻量级纯 Python 仿真环境中端到端实现自主、QoS 感知负载平衡。控制问题被表述为马尔可夫决策过程，其中代理定期调整单元格单个偏移量（CIO）值以引导用户-单元格关联。多目标奖励捕获关键性能指标（聚合吞吐量、延迟、抖动、丢包率、Jain 的公平性指数和切换计数），因此学习到的策略在用户移动性和嘈杂观察下明确平衡了效率和稳定性。PPO 代理使用根据 Python 模拟器生成的轨迹训练的 actor-critic 神经网络，具有可配置的移动性（例如，高斯-马尔可夫）和随机测量噪声。在 500+ 个训练事件和压力测试中，随着用户密度的增加，PPO 策略持续改善 KPI 趋势（更高的吞吐量和公平性、更低的延迟、抖动、丢包和切换），并表现出快速、稳定的收敛。比较评估表明，PPO 在所有 KPI 上都优于基于规则的 ReBuHa 和 A3 以及基于学习的 CDQL 基线，同时随着负载的增加保持更平滑的学习动态和更强的泛化。这些结果表明，PPO 的裁剪策略更新和基于优势的训练使用完全基于 Python 的工具链为下一代 RAN 负载平衡产生了强大、可部署的控制。

LRT-Diffusion: Calibrated Risk-Aware Guidance for Diffusion Policies

LRT-Diffusion：扩散策略的校准风险感知指南

Authors: Ximan Sun, Xiang Cheng
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.24983
Pdf link: https://arxiv.org/pdf/2510.24983
Abstract Diffusion policies are competitive for offline reinforcement learning (RL) but are typically guided at sampling time by heuristics that lack a statistical notion of risk. We introduce LRT-Diffusion, a risk-aware sampling rule that treats each denoising step as a sequential hypothesis test between the unconditional prior and the state-conditional policy head. Concretely, we accumulate a log-likelihood ratio and gate the conditional mean with a logistic controller whose threshold tau is calibrated once under H0 to meet a user-specified Type-I level alpha. This turns guidance from a fixed push into an evidence-driven adjustment with a user-interpretable risk budget. Importantly, we deliberately leave training vanilla (two heads with standard epsilon-prediction) under the structure of DDPM. LRT guidance composes naturally with Q-gradients: critic-gradient updates can be taken at the unconditional mean, at the LRT-gated mean, or a blend, exposing a continuum from exploitation to conservatism. We standardize states and actions consistently at train and test time and report a state-conditional out-of-distribution (OOD) metric alongside return. On D4RL MuJoCo tasks, LRT-Diffusion improves the return-OOD trade-off over strong Q-guided baselines in our implementation while honoring the desired alpha. Theoretically, we establish level-alpha calibration, concise stability bounds, and a return comparison showing when LRT surpasses Q-guidance-especially when off-support errors dominate. Overall, LRT-Diffusion is a drop-in, inference-time method that adds principled, calibrated risk control to diffusion policies for offline RL.
中文摘要 扩散策略对于离线强化学习（RL）具有竞争力，但在采样时通常由缺乏风险统计概念的启发式方法指导。我们引入了 LRT-Diffusion，这是一种风险感知抽样规则，它将每个去噪步骤视为无条件先验和状态条件政策负责人之间的顺序假设检验。具体来说，我们累积一个对数似然比，并使用逻辑控制器对条件均值进行门控，其阈值 tau 在 H0 下校准一次，以满足用户指定的 I 类水平 alpha。这将指导从固定推送转变为具有用户可解释风险预算的证据驱动调整。重要的是，我们特意将训练香草（两个具有标准 epsilon-prediction 的头）留在 DDPM 的结构下。轻轨指导自然地与 Q 梯度组成：批评梯度更新可以在无条件均值、轻轨门控均值或混合处进行，从而暴露出从剥削到保守主义的连续体。我们在训练和测试时一致地标准化状态和作，并在返回的同时报告状态条件分布外（OOD）指标。在 D4RL MuJoCo 任务上，LRT-Diffusion 在我们的实现中改进了与强 Q 引导基线相比的返回 OOD 权衡，同时尊重所需的 alpha。从理论上讲，我们建立了 level-alpha 校准、简洁的稳定性边界以及显示 LRT 何时超过 Q 制导的返回比较，尤其是当偏离支撑误差占主导地位时。总体而言，LRT-Diffusion 是一种直接的推理时间方法，它为离线 RL 的扩散策略添加了原则性的、校准的风险控制。

Enhancing Hierarchical Reinforcement Learning through Change Point Detection in Time Series

通过时间序列中的变化点检测增强分层强化学习

Authors: Hemanath Arumugam, Falong Fan, Bo Liu
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.24988
Pdf link: https://arxiv.org/pdf/2510.24988
Abstract Hierarchical Reinforcement Learning (HRL) enhances the scalability of decision-making in long-horizon tasks by introducing temporal abstraction through options-policies that span multiple timesteps. Despite its theoretical appeal, the practical implementation of HRL suffers from the challenge of autonomously discovering semantically meaningful subgoals and learning optimal option termination boundaries. This paper introduces a novel architecture that integrates a self-supervised, Transformer-based Change Point Detection (CPD) module into the Option-Critic framework, enabling adaptive segmentation of state trajectories and the discovery of options. The CPD module is trained using heuristic pseudo-labels derived from intrinsic signals to infer latent shifts in environment dynamics without external supervision. These inferred change-points are leveraged in three critical ways: (i) to serve as supervisory signals for stabilizing termination function gradients, (ii) to pretrain intra-option policies via segment-wise behavioral cloning, and (iii) to enforce functional specialization through inter-option divergence penalties over CPD-defined state partitions. The overall optimization objective enhances the standard actor-critic loss using structure-aware auxiliary losses. In our framework, option discovery arises naturally as CPD-defined trajectory segments are mapped to distinct intra-option policies, enabling the agent to autonomously partition its behavior into reusable, semantically meaningful skills. Experiments on the Four-Rooms and Pinball tasks demonstrate that CPD-guided agents exhibit accelerated convergence, higher cumulative returns, and significantly improved option specialization. These findings confirm that integrating structural priors via change-point segmentation leads to more interpretable, sample-efficient, and robust hierarchical policies in complex environments.
中文摘要 分层强化学习（HRL）通过跨越多个时间步长的选项策略引入时间抽象，增强了长期任务中决策的可扩展性。尽管 HRL 具有理论吸引力，但实际实施仍面临着自主发现语义有意义的子目标和学习最佳选项终止边界的挑战。本文介绍了一种新颖的架构，该架构将自监督的、基于 Transformer 的变化点检测（CPD）模块集成到 Option-Critic 框架中，从而实现状态轨迹的自适应分割和选项的发现。CPD 模块使用源自内在信号的启发式伪标签进行训练，无需外部监督即可推断环境动力学的潜在变化。这些推断的变化点以三种关键方式被利用：（i）作为稳定终止函数梯度的监督信号，（ii）通过段行为克隆预训练选项内策略，以及（iii）通过对 CPD 定义的状态分区的选项间分歧惩罚来强制执行功能专业化。总体优化目标使用结构感知辅助损耗增强标准参与者-批评损耗。在我们的框架中，当 CPD 定义的轨迹段映射到不同的期权内策略时，期权发现自然而然地出现，使代理能够自主地将其行为划分为可重用的、语义上有意义的技能。对四室和弹球任务的实验表明，CPD 引导的智能体表现出加速收敛、更高的累积回报和显着提高的选项专业化。这些发现证实，通过变化点分割整合结构先验可以在复杂环境中产生更可解释、样本效率更高、更稳健的分层策略。

Control Synthesis with Reinforcement Learning: A Modeling Perspective

强化学习控制综合：建模视角

Authors: Nikki Xu, Hien Tran
Subjects: Subjects: Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.25063
Pdf link: https://arxiv.org/pdf/2510.25063
Abstract Controllers designed with reinforcement learning can be sensitive to model mismatch. We demonstrate that designing such controllers in a virtual simulation environment with an inaccurate model is not suitable for deployment in a physical setup. Controllers designed using an accurate model is robust against disturbance and small mismatch between the physical setup and the mathematical model derived from first principles; while a poor model results in a controller that performs well in simulation but fails in physical experiments. Sensitivity analysis is used to justify these discrepancies and an empirical region of attraction estimation help us visualize their robustness.
中文摘要 使用强化学习设计的控制器可能对模型不匹配很敏感。我们证明，在具有不准确模型的虚拟仿真环境中设计此类控制器不适合在物理设置中部署。使用精确模型设计的控制器对物理设置和根据第一性原理得出的数学模型之间的干扰和小不匹配具有鲁棒性;而糟糕的模型会导致控制器在仿真中表现良好，但在物理实验中失败。敏感性分析用于证明这些差异的合理性，吸引力估计的经验区域帮助我们可视化它们的稳健性。

Reasoning-Aware GRPO using Process Mining

使用流程挖掘的推理感知 GRPO

Authors: Taekhyun Park, Yongjae Lee, Hyerim Bae
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25065
Pdf link: https://arxiv.org/pdf/2510.25065
Abstract Reinforcement learning (RL)-based post-training has been crucial for enabling multi-step reasoning in large reasoning models (LRMs), yet current reward schemes are typically outcome-centric. We propose PM4GRPO, a reasoning-aware Group Relative Policy Optimization (GRPO) that augments standard answer/format rewards with signals over the reasoning procedure. To this end, process mining techniques are utilized to compute a scalar conformance reward that measures how closely a policy model's reasoning aligns with the pretrained teacher model. The empirical results on five benchmarks demonstrate that PM4GRPO significantly outperforms existing methodologies for GRPO-based post-training. These results highlight that leveraging process mining for reasoning-aware GRPO effectively enhances the reasoning capabilities of policy models.
中文摘要 基于强化学习（RL）的后训练对于在大型推理模型（LRM）中实现多步推理至关重要，但当前的奖励方案通常以结果为中心。我们提出了 PM4GRPO，这是一种推理感知的群体相对策略优化（GRPO），它通过推理过程中的信号来增强标准答案/格式奖励。为此，利用流程挖掘技术来计算标量一致性奖励，该奖励衡量策略模型的推理与预训练教师模型的一致性。五个基准的实证结果表明，PM4GRPO 明显优于基于 GRPO 的后训练的现有方法。这些结果强调，利用流程挖掘进行推理感知GRPO可以有效增强策略模型的推理能力。

KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

KnowCoder-A1：通过结果监督激励 KBQA 的代理推理能力

Authors: Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.25101
Pdf link: https://arxiv.org/pdf/2510.25101
Abstract Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.
中文摘要 知识库问答（KBQA）旨在通过结构化知识库（KB）回答自然语言问题。最近的工作通过采用代理推理范式来改进 KBQA，其中大型语言模型（LLM）迭代分解问题，生成相应的逻辑查询，并与知识库交互以得出答案。然而，这些方法通常会根据通过过程监督合成的推理轨迹对法学硕士进行微调，这对探索的激励作用较弱，因此无法增强智能体推理能力。在本文中，我们提出了 KnowCoder-A1，这是一种可以自主地对知识库进行代理推理以获得答案的 LLM。为了激励自主探索，KnowCoder-A1 通过多阶段课程强化学习和从易到难的课程，在仅结果监督下训练法学硕士。为了建立基本的代理能力，KnowCoder-A1 首先根据通过基于结果的拒绝采样获得的一小部分高质量轨迹对 LLM 进行微调。然后，为了缓解仅结果监督中固有的奖励稀疏性，它应用了多阶段课程 RL，奖励时间表从易到难。KnowCoder-A1 经过仅结果监督的训练，表现出强大的推理行为，并且在三个主流数据集中始终优于先前的方法。值得注意的是，在 GrailQA 的零样本子集上，KnowCoder-A1 实现了高达 11.1% 的相对改进，而仅使用了十二分之一的训练数据，展示了强大的智能体推理能力。

Energy-Efficient Autonomous Driving with Adaptive Perception and Robust Decision

具有自适应感知和稳健决策的节能自动驾驶

Authors: Yuyang Xia, Zibo Liang, Liwei Deng, Yan Zhao, Han Su, Kai Zheng
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25205
Pdf link: https://arxiv.org/pdf/2510.25205
Abstract Autonomous driving is an emerging technology that is expected to bring significant social, economic, and environmental benefits. However, these benefits come with rising energy consumption by computation engines, limiting the driving range of vehicles, especially electric ones. Perception computing is typically the most power-intensive component, as it relies on largescale deep learning models to extract environmental features. Recently, numerous studies have employed model compression techniques, such as sparsification, quantization, and distillation, to reduce computational consumption. However, these methods often result in either a substantial model size or a significant drop in perception accuracy compared to high-computation models. To address these challenges, we propose an energy-efficient autonomous driving framework, called EneAD. In the adaptive perception module, a perception optimization strategy is designed from the perspective of data management and tuning. Firstly, we manage multiple perception models with different computational consumption and adjust the execution framerate dynamically. Then, we define them as knobs and design a transferable tuning method based on Bayesian optimization to identify promising knob values that achieve low computation while maintaining desired accuracy. To adaptively switch the knob values in various traffic scenarios, a lightweight classification model is proposed to distinguish the perception difficulty in different scenarios. In the robust decision module, we propose a decision model based on reinforcement learning and design a regularization term to enhance driving stability in the face of perturbed perception results. Extensive experiments evidence the superiority of our framework in both energy consumption and driving performance. EneAD can reduce perception consumption by 1.9x to 3.5x and thus improve driving range by 3.9% to 8.5%
中文摘要 自动驾驶是一项新兴技术，有望带来显著的社会、经济和环境效益。然而，这些好处伴随着计算引擎能耗的增加，限制了车辆（尤其是电动汽车）的行驶里程。感知计算通常是功耗最高的组件，因为它依赖于大规模深度学习模型来提取环境特征。最近，许多研究采用了模型压缩技术，如稀疏化、量化和蒸馏，以减少计算消耗。然而，与高计算模型相比，这些方法通常会导致模型大小过大或感知准确性显着下降。为了应对这些挑战，我们提出了一种节能的自动驾驶框架，称为 EneAD。在自适应感知模块中，从数据管理和调优的角度设计了感知优化策略。首先，我们管理多个具有不同计算消耗的感知模型，并动态调整执行帧率;然后，我们将它们定义为旋钮，并设计一种基于贝叶斯优化的可转移调谐方法，以识别有希望的旋钮值，在保持所需精度的同时实现低计算。为了在各种交通场景下自适应切换旋钮值，提出了一种轻量级分类模型来区分不同场景下的感知难度。在鲁棒决策模块中，我们提出了一种基于强化学习的决策模型，并设计了一个正则化项，以增强面对扰动感知结果时的驾驶稳定性。广泛的实验证明了我们的框架在能耗和驾驶性能方面的优越性。EneAD 可以将感知消耗减少 1.9 倍至 3.5 倍，从而将行驶里程提高 3.9% 至 8.5%

RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models

RAVR：大型语言模型的参考-答案引导变分推理

Authors: Tianqianjin Lin, Xi Zhao, Xingyao Zhang, Rujiao Long, Yi Xu, Zhuoren Jiang, Wenbo Su, Bo Zheng
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2510.25206
Pdf link: https://arxiv.org/pdf/2510.25206
Abstract Reinforcement learning (RL) can refine the reasoning abilities of large language models (LLMs), but critically depends on a key prerequisite: the LLM can already generate high-utility reasoning paths with non-negligible probability. For tasks beyond the LLM's current competence, such reasoning path can be hard to sample, and learning risks reinforcing familiar but suboptimal reasoning. We are motivated by the insight from cognitive science that Why is this the answer is often an easier question than What is the answer, as it avoids the heavy cognitive load of open-ended exploration, opting instead for explanatory reconstruction-systematically retracing the reasoning that links a question to its answer. We show that LLMs can similarly leverage answers to derive high-quality reasoning paths. We formalize this phenomenon and prove that conditioning on answer provably increases the expected utility of sampled reasoning paths, thereby transforming intractable problems into learnable ones. Building on this insight, we introduce RAVR (Reference-Answer-guided Variational Reasoning), an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning. Experiments in both general and math domains demonstrate consistent improvements over strong baselines. We further analyze the reasoning behavior and find that RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.
中文摘要 强化学习（RL）可以完善大型语言模型（LLM）的推理能力，但关键取决于一个关键的先决条件：LLM已经可以生成具有不可忽略概率的高效用推理路径。对于超出法学硕士当前能力范围的任务，这种推理路径可能很难采样，并且学习可能会强化熟悉但次优的推理。我们受到认知科学的洞察力的激励，即为什么这是答案通常比答案是什么更容易，因为它避免了开放式探索的沉重认知负担，而是选择解释性重建——系统地追溯将问题与其答案联系起来的推理。我们表明，法学硕士可以类似地利用答案来推导出高质量的推理路径。我们将这种现象形式化，并证明答案条件可以证明可以增加采样推理路径的预期效用，从而将棘手的问题转化为可学习的问题。基于这一见解，我们引入了 RAVR（参考-答案引导变分推理），这是一个端到端框架，它使用答案条件推理作为纯问题推理的变分替代物。一般和数学领域的实验都表明，与强基线相比，持续改进。我们进一步分析了推理行为，发现RAVR减少了犹豫，加强了结论巩固，并促进了推理中针对特定问题的策略。

FELA: A Multi-Agent Evolutionary System for Feature Engineering of Industrial Event Log Data

FELA：工业事件日志数据特征工程的多智能体演化系统

Authors: Kun ouyang, Haoyu Wang, Dong Fang
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25223
Pdf link: https://arxiv.org/pdf/2510.25223
Abstract Event log data, recording fine-grained user actions and system events, represent one of the most valuable assets for modern digital services. However, the complexity and heterogeneity of industrial event logs--characterized by large scale, high dimensionality, diverse data types, and intricate temporal or relational structures--make feature engineering extremely challenging. Existing automatic feature engineering approaches, such as AutoML or genetic methods, often suffer from limited explainability, rigid predefined operations, and poor adaptability to complicated heterogeneous data. In this paper, we propose FELA (Feature Engineering LLM Agents), a multi-agent evolutionary system that autonomously extracts meaningful and high-performing features from complex industrial event log data. FELA integrates the reasoning and coding capabilities of large language models (LLMs) with an insight-guided self-evolution paradigm. Specifically, FELA employs specialized agents--Idea Agents, Code Agents, and Critic Agents--to collaboratively generate, validate, and implement novel feature ideas. An Evaluation Agent summarizes feedback and updates a hierarchical knowledge base and dual-memory system to enable continual improvement. Moreover, FELA introduces an agentic evolution algorithm, combining reinforcement learning and genetic algorithm principles to balance exploration and exploitation across the idea space. Extensive experiments on real industrial datasets demonstrate that FELA can generate explainable, domain-relevant features that significantly improve model performance while reducing manual effort. Our results highlight the potential of LLM-based multi-agent systems as a general framework for automated, interpretable, and adaptive feature engineering in complex real-world environments.
中文摘要 事件日志数据记录细粒度的用户作和系统事件，是现代数字服务最有价值的资产之一。然而，工业事件日志的复杂性和异构性——以大规模、高维度、多样化的数据类型以及复杂的时间或关系结构为特征——使得特征工程极具挑战性。现有的自动特征工程方法，例如 AutoML 或遗传方法，通常存在可解释性有限、预定义作僵化以及对复杂异构数据的适应性差的问题。在本文中，我们提出了FELA（Feature Engineering LLM Agents），这是一种多智能体进化系统，可以从复杂的工业事件日志数据中自主提取有意义和高性能的特征。FELA 将大型语言模型（LLM）的推理和编码能力与洞察引导的自我进化范式集成在一起。具体来说，FELA 雇用专门的代理——创意代理、代码代理和批评代理——来协作生成、验证和实施新颖的功能创意。评估代理总结反馈并更新分层知识库和双内存系统，以实现持续改进。此外，FELA 还引入了智能体进化算法，结合强化学习和遗传算法原理，以平衡整个思想空间的探索和开发。在真实工业数据集上的大量实验表明，FELA 可以生成可解释的、与领域相关的特征，从而显着提高模型性能，同时减少人工工作。我们的研究结果强调了基于法学硕士的多智能体系统作为复杂现实环境中自动化、可解释和自适应特征工程的通用框架的潜力。

One-shot Humanoid Whole-body Motion Learning

一次性人形全身运动学习

Authors: Hao Huang, Geeta Chandra Raju Bethala, Shuaihang Yuan, Congcong Wen, Anthony Tzes, Yi Fang
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25241
Pdf link: https://arxiv.org/pdf/2510.25241
Abstract Whole-body humanoid motion represents a cornerstone challenge in robotics, integrating balance, coordination, and adaptability to enable human-like behaviors. However, existing methods typically require multiple training samples per motion category, rendering the collection of high-quality human motion datasets both labor-intensive and costly. To address this, we propose a novel approach that trains effective humanoid motion policies using only a single non-walking target motion sample alongside readily available walking motions. The core idea lies in leveraging order-preserving optimal transport to compute distances between walking and non-walking sequences, followed by interpolation along geodesics to generate new intermediate pose skeletons, which are then optimized for collision-free configurations and retargeted to the humanoid before integration into a simulated environment for policy training via reinforcement learning. Experimental evaluations on the CMU MoCap dataset demonstrate that our method consistently outperforms baselines, achieving superior performance across metrics. Code will be released upon acceptance.
中文摘要 全身人形运动代表了机器人技术的基石挑战，它整合了平衡、协调和适应性以实现类似人类的行为。然而，现有方法通常需要每个运动类别多个训练样本，这使得高质量的人体运动数据集的收集既劳动密集型又成本高昂。为了解决这个问题，我们提出了一种新方法，仅使用单个非步行目标运动样本以及现成的步行运动来训练有效的人形运动策略。核心思想在于利用保持阶的最佳传输来计算步行和非步行序列之间的距离，然后沿测地线进行插值以生成新的中间姿态骨架，然后针对无碰撞配置进行优化，并重新定位到人形生物，然后通过强化学习集成到模拟环境中进行策略训练。对 CMU MoCap 数据集的实验评估表明，我们的方法始终优于基线，在各个指标上都取得了卓越的性能。代码将在接受后发布。

The influence of the random numbers quality on the results in stochastic simulations and machine learning

随机数质量对随机模拟和机器学习结果的影响

Authors: Benjamin A. Antunes (LIRMM | DALI)
Subjects: Subjects: Performance (cs.PF)
Arxiv link: https://arxiv.org/abs/2510.25269
Pdf link: https://arxiv.org/pdf/2510.25269
Abstract Pseudorandom number generators (PRNGs) are ubiquitous in stochastic simulations and machine learning (ML), where they drive sampling, parameter initialization, regularization, and data shuffling. While widely used, the potential impact of PRNG statistical quality on computational results remains underexplored. In this study, we investigate whether differences in PRNG quality, as measured by standard statistical test suites, can influence outcomes in representative stochastic applications. Seven PRNGs were evaluated, ranging from low-quality linear congruential generators (LCGs) with known statistical deficiencies to high-quality generators such as Mersenne Twister, PCG, and Philox. We applied these PRNGs to four distinct tasks: an epidemiological agent-based model (ABM), two independent from-scratch MNIST classification implementations (Python/NumPy and C++), and a reinforcement learning (RL) CartPole environment. Each experiment was repeated 30 times per generator using fixed seeds to ensure reproducibility, and outputs were compared using appropriate statistical analyses. Results show that very poor statistical quality, as in the ''bad'' LCG failing 125 TestU01 Crush tests, produces significant deviations in ABM epidemic dynamics, reduces MNIST classification accuracy, and severely degrades RL performance. In contrast, mid-and good-quality LCGs-despite failing a limited number of Crush or BigCrush tests-performed comparably to top-tier PRNGs in most tasks, with the RL experiment being the primary exception where performance scaled with statistical quality. Our findings indicate that, once a generator meets a sufficient statistical robustness threshold, its family or design has negligible impact on outcomes for most workloads, allowing selection to be guided by performance and implementation considerations. However, the use of low-quality PRNGs in sensitive stochastic computations can introduce substantial and systematic errors.
中文摘要 伪随机数生成器（PRNG）在随机模拟和机器学习（ML）中无处不在，它们驱动采样、参数初始化、正则化和数据洗牌。虽然被广泛使用，但 PRNG 统计质量对计算结果的潜在影响仍未得到充分探索。在这项研究中，我们调查了通过标准统计测试套件衡量的 PRNG 质量差异是否会影响代表性随机应用的结果。评估了七种PRNG，从已知统计缺陷的低质量线性同等发生器（LCG）到高质量发生器，如Mersenne Twister、PCG和Philox。我们将这些 PRNG 应用于四个不同的任务：基于流行病学代理的模型（ABM）、两个独立的从头开始的 MNIST 分类实现（Python/NumPy 和 C++）以及强化学习（RL） CartPole 环境。每个实验使用固定种子每个发生器重复 30 次，以确保可重复性，并使用适当的统计分析比较输出。结果表明，统计质量非常差，如未通过 125 次 TestU01 Crush 测试的“坏”LCG，会在 ABM 流行动态中产生显着偏差，降低 MNIST 分类准确性，并严重降低 RL 性能。相比之下，中等和优质的 LCG——尽管未能通过有限数量的 Crush 或 BigCrush 测试——在大多数任务中的表现与顶级 PRNG 相当，RL 实验是性能随统计质量成比例的主要例外。我们的研究结果表明，一旦生成器满足足够的统计鲁棒性阈值，其系列或设计对大多数工作负载的结果的影响可以忽略不计，因此可以根据性能和实现考虑来指导选择。然而，在敏感的随机计算中使用低质量的PRNG可能会引入大量系统误差。

Adaptive Design of mmWave Initial Access Codebooks using Reinforcement Learning

基于强化学习的毫米波初始访问码本的自适应设计

Authors: Sabrine Aroua, Christos Anastasios Bovolis, Bo Göransson, Anastasios Giovanidis, Mathieu Leconte, Apostolos Destounis
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2510.25271
Pdf link: https://arxiv.org/pdf/2510.25271
Abstract Initial access (IA) is the process by which user equipment (UE) establishes its first connection with a base station. In 5G systems, particularly at millimeter-wave frequencies, IA integrates beam management to support highly directional transmissions. The base station employs a codebook of beams for the transmission of Synchronization Signal Blocks (SSBs), which are periodically swept to detect and connect users. The design of this SSB codebook is critical for ensuring reliable, wide-area coverage. In current networks, SSB codebooks are meticulously engineered by domain experts. While these expert-defined codebooks provide a robust baseline, they lack flexibility in dynamic or heterogeneous environments where user distributions vary, limiting their overall effectiveness. This paper proposes a hybrid Reinforcement Learning (RL) framework for adaptive SSB codebook design. Building on top of expert knowledge, the RL agent leverages a pool of expert-designed SSB beams and learns to adaptively select or combine them based on real-time feedback. This enables the agent to dynamically tailor codebooks to the actual environment, without requiring explicit user location information, while always respecting practical beam constraints. Simulation results demonstrate that, on average, the proposed approach improves user connectivity by 10.8$\%$ compared to static expert configurations. These findings highlight the potential of combining expert knowledge with data-driven optimization to achieve more intelligent, flexible, and resilient beam management in next-generation wireless networks.
中文摘要 初始接入（IA）是用户设备（UE）与基站建立首次连接的过程。在 5G 系统中，特别是在毫米波频率下，IA 集成了波束管理以支持高定向传输。基站采用波束码本来传输同步信号块（SSB），这些信号块会定期扫描以检测和连接用户。该 SSB 密码本的设计对于确保可靠的广域覆盖至关重要。在当前的网络中，SSB 密码本是由领域专家精心设计的。虽然这些专家定义的代码本提供了强大的基线，但它们在用户分布不同的动态或异构环境中缺乏灵活性，从而限制了它们的整体有效性。该文提出了一种用于自适应SSB码本设计的混合强化学习（RL）框架。RL 代理以专业知识为基础，利用专家设计的 SSB 波束池，并学习根据实时反馈自适应地选择或组合它们。这使代理能够根据实际环境动态定制码本，而无需显式的用户位置信息，同时始终尊重实际的波束约束。仿真结果表明，与静态专家配置相比，所提出的方法平均将用户连接性提高了 10.8$\%$。这些发现凸显了将专业知识与数据驱动优化相结合的潜力，以在下一代无线网络中实现更加智能、灵活和有弹性的波束管理。

Dense and Diverse Goal Coverage in Multi Goal Reinforcement Learning

多目标强化学习中的密集多样目标覆盖

Authors: Sagalpreet Singh, Rishi Saket, Aravindan Raghuveer
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25311
Pdf link: https://arxiv.org/pdf/2510.25311
Abstract Reinforcement Learning algorithms are primarily focused on learning a policy that maximizes expected return. As a result, the learned policy can exploit one or few reward sources. However, in many natural situations, it is desirable to learn a policy that induces a dispersed marginal state distribution over rewarding states, while maximizing the expected return which is typically tied to reaching a goal state. This aspect remains relatively unexplored. Existing techniques based on entropy regularization and intrinsic rewards use stochasticity for encouraging exploration to find an optimal policy which may not necessarily lead to dispersed marginal state distribution over rewarding states. Other RL algorithms which match a target distribution assume the latter to be available apriori. This may be infeasible in large scale systems where enumeration of all states is not possible and a state is determined to be a goal state only upon reaching it. We formalize the problem of maximizing the expected return while uniformly visiting the goal states as Multi Goal RL in which an oracle classifier over the state space determines the goal states. We propose a novel algorithm that learns a high-return policy mixture with marginal state distribution dispersed over the set of goal states. Our algorithm is based on optimizing a custom RL reward which is computed - based on the current policy mixture - at each iteration for a set of sampled trajectories. The latter are used via an offline RL algorithm to update the policy mixture. We prove performance guarantees for our algorithm, showing efficient convergence bounds for optimizing a natural objective which captures the expected return as well as the dispersion of the marginal state distribution over the goal states. We design and perform experiments on synthetic MDPs and standard RL environments to evaluate the effectiveness of our algorithm.
中文摘要 强化学习算法主要侧重于学习最大化预期回报的策略。因此，学习到的策略可以利用一个或几个奖励来源。然而，在许多自然情况下，需要学习一种策略，该策略诱导分散的边际状态分布而不是奖励状态，同时最大化通常与达到目标状态相关的预期回报。这方面仍然相对未被探索。基于熵正则化和内在奖励的现有技术使用随机性来鼓励探索以找到最优策略，这可能不一定会导致分散的边际状态分布超过奖励状态。匹配目标分布的其他 RL 算法假设后者是先验可用的。这在大规模系统中可能是不可行的，因为无法枚举所有状态，并且只有在达到状态时才确定为目标状态。我们将最大化预期回报的问题形式化，同时统一访问目标状态，作为多目标 RL，其中状态空间上的预言机分类器确定目标状态。我们提出了一种新颖的算法，该算法可以学习高回报政策混合，边缘状态分布分散在目标状态集上。我们的算法基于优化自定义 RL 奖励，该奖励是根据当前策略组合在每次迭代时针对一组采样轨迹计算的。后者通过离线 RL 算法用于更新策略组合。我们证明了算法的性能保证，显示了优化自然目标的有效收敛边界，该目标捕获了预期回报以及边缘状态分布在目标状态上的离散度。我们在合成 MDP 和标准 RL 环境中设计和执行实验，以评估我们算法的有效性。

GAP: Graph-Based Agent Planning with Parallel Tool Use and Reinforcement Learning

GAP：基于图的代理规划与并行工具使用和强化学习

Authors: Jiaqi Wu, Qinlao Zhao, Zefeng Chen, Kai Qin, Yifei Zhao, Xueqian Wang, Yuhang Yao
Subjects: Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.25320
Pdf link: https://arxiv.org/pdf/2510.25320
Abstract Autonomous agents powered by large language models (LLMs) have shown impressive capabilities in tool manipulation for complex task-solving. However, existing paradigms such as ReAct rely on sequential reasoning and execution, failing to exploit the inherent parallelism among independent sub-tasks. This sequential bottleneck leads to inefficient tool utilization and suboptimal performance in multi-step reasoning scenarios. We introduce Graph-based Agent Planning (GAP), a novel framework that explicitly models inter-task dependencies through graph-based planning to enable adaptive parallel and serial tool execution. Our approach trains agent foundation models to decompose complex tasks into dependency-aware sub-task graphs, autonomously determining which tools can be executed in parallel and which must follow sequential dependencies. This dependency-aware orchestration achieves substantial improvements in both execution efficiency and task accuracy. To train GAP, we construct a high-quality dataset of graph-based planning traces derived from the Multi-Hop Question Answering (MHQA) benchmark. We employ a two-stage training strategy: supervised fine-tuning (SFT) on the curated dataset, followed by reinforcement learning (RL) with a correctness-based reward function on strategically sampled queries where tool-based reasoning provides maximum value. Experimental results on MHQA datasets demonstrate that GAP significantly outperforms traditional ReAct baselines, particularly on multi-step retrieval tasks, while achieving dramatic improvements in tool invocation efficiency through intelligent parallelization. The project page is available at: this https URL.
中文摘要 由大型语言模型（LLM）提供支持的自主代理在工具作以解决复杂任务方面表现出令人印象深刻的能力。然而，现有的范式（如 ReAct）依赖于顺序推理和执行，未能利用独立子任务之间固有的并行性。这种顺序瓶颈导致工具利用效率低下，在多步骤推理场景中性能不佳。我们引入了基于图的代理规划（GAP），这是一种新颖的框架，它通过基于图的规划显式地对任务间依赖关系进行建模，以实现自适应并行和串行工具执行。我们的方法训练代理基础模型将复杂的任务分解为依赖感知的子任务图，自主确定哪些工具可以并行执行，哪些工具必须遵循顺序依赖关系。这种依赖感知编排在执行效率和任务准确性方面都取得了显着提高。为了训练 GAP，我们构建了一个高质量的基于图的规划轨迹数据集，这些数据集源自多跳问答（MHQA）基准。我们采用两阶段训练策略：对策划的数据集进行监督微调（SFT），然后对战略采样查询进行基于正确性的奖励函数的强化学习（RL），其中基于工具的推理提供最大价值。MHQA数据集上的实验结果表明，GAP明显优于传统的ReAct基线，特别是在多步骤检索任务上，同时通过智能并行化实现了工具调用效率的显著提高。项目页面位于：此 https URL。

Multi-party Agent Relation Sampling for Multi-party Ad Hoc Teamwork

多方临时团队合作的多方代理关系抽样

Authors: Beiwen Zhang, Yongheng Liang, Hejun Wu
Subjects: Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25340
Pdf link: https://arxiv.org/pdf/2510.25340
Abstract Multi-agent reinforcement learning (MARl) has achieved strong results in cooperative tasks but typically assumes fixed, fully controlled teams. Ad hoc teamwork (AHT) relaxes this by allowing collaboration with unknown partners, yet existing variants still presume shared conventions. We introduce Multil-party Ad Hoc Teamwork (MAHT), where controlled agents must coordinate with multiple mutually unfamiliar groups of uncontrolled teammates. To address this, we propose MARs, which builds a sparse skeleton graph and applies relational modeling to capture cross-group dvnamics. Experiments on MPE and starCralt ll show that MARs outperforms MARL and AHT baselines while converging faster.
中文摘要 多智能体强化学习（MARl）在协作任务中取得了很好的成果，但通常假设有固定的、完全受控的团队。临时团队合作（AHT）通过允许与未知合作伙伴协作来放松这一点，但现有变体仍然假定共享约定。我们引入了多方临时团队合作（MAHT），其中受控代理必须与多个相互不熟悉的不受控制的队友组进行协调。为了解决这个问题，我们提出了 MARs，它构建了一个稀疏的骨架图，并应用关系建模来捕获跨组 dvnamics。MPE和starCralt ll的实验表明，MARs优于MARL和AHT基线，收敛速度更快。

Sim-to-Real Gentle Manipulation of Deformable and Fragile Objects with Stress-Guided Reinforcement Learning

使用应力引导强化学习对可变形和易碎物体进行模拟到真实的温和作

Authors: Kei Ikemura, Yifei Dong, David Blanco-Mulero, Alberta Longhini, Li Chen, Florian T. Pokorny
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2510.25405
Pdf link: https://arxiv.org/pdf/2510.25405
Abstract Robotic manipulation of deformable and fragile objects presents significant challenges, as excessive stress can lead to irreversible damage to the object. While existing solutions rely on accurate object models or specialized sensors and grippers, this adds complexity and often lacks generalization. To address this problem, we present a vision-based reinforcement learning approach that incorporates a stress-penalized reward to discourage damage to the object explicitly. In addition, to bootstrap learning, we incorporate offline demonstrations as well as a designed curriculum progressing from rigid proxies to deformables. We evaluate the proposed method in both simulated and real-world scenarios, showing that the policy learned in simulation can be transferred to the real world in a zero-shot manner, performing tasks such as picking up and pushing tofu. Our results show that the learned policies exhibit a damage-aware, gentle manipulation behavior, demonstrating their effectiveness by decreasing the stress applied to fragile objects by 36.5% while achieving the task goals, compared to vanilla RL policies.
中文摘要 机器人纵可变形和易碎物体带来了重大挑战，因为过大的压力会导致物体受到不可逆转的损坏。虽然现有的解决方案依赖于精确的物体模型或专门的传感器和夹持器，但这增加了复杂性，并且通常缺乏通用性。为了解决这个问题，我们提出了一种基于视觉的强化学习方法，该方法结合了压力惩罚奖励，以明确阻止对物体的损害。此外，为了引导学习，我们结合了离线演示以及从刚性代理到可变形的设计课程。我们在模拟和真实场景中对所提方法进行了评估，表明在模拟中学到的策略可以以零样本的方式转移到现实世界中，执行捡豆腐和推豆腐等任务。我们的结果表明，与普通的 RL 策略相比，学习到的策略表现出一种对损坏有感知、温和的纵行为，通过在实现任务目标的同时将施加在易碎物体上的压力降低 36.5% 来证明其有效性。

Generalized Pseudo-Relevance Feedback

广义伪相关性反馈

Authors: Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, Fen Lin, Qin Liu, Qingyao Ai
Subjects: Subjects: Information Retrieval (cs.IR)
Arxiv link: https://arxiv.org/abs/2510.25488
Pdf link: https://arxiv.org/pdf/2510.25488
Abstract Query rewriting is a fundamental technique in information retrieval (IR). It typically employs the retrieval result as relevance feedback to refine the query and thereby addresses the vocabulary mismatch between user queries and relevant documents. Traditional pseudo-relevance feedback (PRF) and its vector-based extension (VPRF) improve retrieval performance by leveraging top-retrieved documents as relevance feedback. However, they are constructed based on two major hypotheses: the relevance assumption (top documents are relevant) and the model assumption (rewriting methods need to be designed specifically for particular model architectures). While recent large language models (LLMs)-based generative relevance feedback (GRF) enables model-free query reformulation, it either suffers from severe LLM hallucination or, again, relies on the relevance assumption to guarantee the effectiveness of rewriting quality. To overcome these limitations, we introduce an assumption-relaxed framework: \textit{Generalized Pseudo Relevance Feedback} (GPRF), which performs model-free, natural language rewriting based on retrieved documents, not only eliminating the model assumption but also reducing dependence on the relevance assumption. Specifically, we design a utility-oriented training pipeline with reinforcement learning to ensure robustness against noisy feedback. Extensive experiments across multiple benchmarks and retrievers demonstrate that GPRF consistently outperforms strong baselines, establishing it as an effective and generalizable framework for query rewriting.
中文摘要 查询重写是信息检索（IR）中的一项基本技术。它通常使用检索结果作为相关性反馈来优化查询，从而解决用户查询与相关文档之间的词汇不匹配问题。传统的伪相关性反馈（PRF）及其基于向量的扩展（VPRF）通过利用检索到的顶部文档作为相关性反馈来提高检索性能。然而，它们是基于两个主要假设构建的：相关性假设（顶级文档是相关的）和模型假设（需要专门针对特定模型架构设计重写方法）。虽然最近基于大型语言模型（LLMs）的生成相关性反馈（GRF）实现了无模型查询的重新表述，但它要么遭受严重的LLM幻觉，要么再次依赖相关性假设来保证重写质量的有效性。为了克服这些限制，我们引入了一个假设宽松的框架：\textit{广义伪相关性反馈}（GPRF），它基于检索到的文档执行无模型的自然语言重写，不仅消除了模型假设，还减少了对相关性假设的依赖。具体来说，我们设计了一个面向实用程序的训练管道，具有强化学习功能，以确保对嘈杂反馈的鲁棒性。跨多个基准测试和检索器的广泛实验表明，GPRF 的性能始终优于强大的基线，使其成为查询重写的有效且可通用的框架。

MTIR-SQL: Multi-turn Tool-Integrated Reasoning Reinforcement Learning for Text-to-SQL

MTIR-SQL：用于文本转SQL的多轮工具集成推理强化学习

Authors: Zekun Xu, Siyu Xia, Chuhuai Yue, Jiajun Chai, Mingxue Tian, Xiaohan Wang, Wei Lin, Haoxuan Li, Guojun Yin
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25510
Pdf link: https://arxiv.org/pdf/2510.25510
Abstract As large language models (LLMs) are increasingly used in Text-to-SQL tasks, Reinforcement Learning (RL) has become a common method for improving performance. Existing methods primarily rely on static execution feedback, which restricts real-time error correction. However, integrating multi-turn tool invocation along with dynamic feedback could significantly improve adaptability and robustness, ultimately enhancing model performance. To address these issues, we propose MTIR-SQL, an innovative Multi-turn Tool-Integrated Reasoning reinforcement learning framework for Text-to-SQL. Our approach introduces an execution-aware multi-turn reasoning paradigm that seamlessly incorporates database execution feedback at each reasoning step, enabling context-sensitive query generation and progressive refinement throughout the reasoning process. The framework extends the GRPO algorithm to accommodate complex multi-turn interaction scenarios. Considering the training instability characteristics of MTIR and the potential for significant Deviation of model distribution from the initial model, we enhance the GRPO algorithm by adding a trajectory filtering mechanism and removing KL loss constraints. Experimental results demonstrate that MTIR-SQL, with 4B parameters, achieves \textbf{64.4}\% accuracy in the BIRD Dev and 84.6% execution accuracy in the SPIDER Dev, significantly outperforming existing approaches.
中文摘要 随着大型语言模型（LLM）越来越多地用于文本到SQL任务，强化学习（RL）已成为提高性能的常用方法。现有方法主要依赖于静态执行反馈，这限制了实时纠错。然而，将多圈刀具调用与动态反馈相结合可以显着提高适应性和鲁棒性，最终提高模型性能。为了解决这些问题，我们提出了 MTIR-SQL，这是一种用于文本转 SQL 的创新多轮工具集成推理强化学习框架。我们的方法引入了一种执行感知多轮次推理范式，在每个推理步骤中无缝整合数据库执行反馈，从而在整个推理过程中实现上下文相关的查询生成和渐进式细化。该框架扩展了 GRPO 算法，以适应复杂的多轮交互场景。考虑到MTIR的训练不稳定性特征以及模型分布与初始模型存在显著偏差的可能性，我们通过添加轨迹滤波机制和消除KL损失约束来增强GRPO算法。实验结果表明，具有 4B 参数的 MTIR-SQL 在 BIRD Dev 中达到了 \textbf{64.4}\% 的准确率，在 SPIDER Dev 中达到了 84.6% 的执行准确率，明显优于现有方法。

Zero Reinforcement Learning Towards General Domains

对一般领域的零强化学习

Authors: Yuyuan Zeng, Yufei Huang, Can Xu, Qingfeng Sun, Jianfeng Yan, Guanghui Xu, Tao Yang, Fengzong Lian
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25528
Pdf link: https://arxiv.org/pdf/2510.25528
Abstract Zero Reinforcement Learning (Zero-RL) has proven to be an effective approach for enhancing the reasoning capabilities of large language models (LLMs) by directly applying reinforcement learning with verifiable rewards on pretrained models, without the need for a supervised fine-tuning phase. However, current research on zero-RL primarily focuses on domains with easily verifiable reward signals, such as mathematics, programming, and other reasoning tasks. The challenge of eliciting reasoning abilities in more diverse scenarios, where verification is not straightforward, remains underexplored. To address this gap, we propose a novel zero-RL paradigm designed to improve a model's reasoning ability across both verifiable and non-verifiable domains. By combining verifiable rewards with a generative reward model, we conduct multi-task zero-RL training across both domains, facilitating the transfer of reasoning capabilities between them. Furthermore, to mitigate reward hacking in the generative reward model, we design a smooth length penalty that encourages the generation of more comprehensive thinking tokens in general domains. Experimental results on Qwen3-8B-Base and Qwen3-14B-Base demonstrate that our approach achieves superior reasoning performance, not only on tasks requiring extensive reasoning but also on more general tasks.
中文摘要 零强化学习（Zero-RL）已被证明是一种有效的方法，通过直接在预训练模型上应用具有可验证奖励的强化学习，无需监督微调阶段，从而增强大型语言模型（LLM）的推理能力。然而，目前对零强拿学习的研究主要集中在具有易于验证的奖励信号的领域，例如数学、编程和其他推理任务。在验证并不简单的更多样化的场景中引发推理能力的挑战仍然没有得到充分探索。为了解决这一差距，我们提出了一种新颖的零 RL 范式，旨在提高模型在可验证和不可验证领域的推理能力。通过将可验证的奖励与生成奖励模型相结合，我们在两个领域进行多任务零RL训练，促进了它们之间的推理能力转移。此外，为了减轻生成奖励模型中的奖励黑客攻击，我们设计了一个平滑长度惩罚，鼓励在一般领域生成更全面的思维代币。在Qwen3-8B-Base和Qwen3-14B-Base上的实验结果表明，我们的方法不仅在需要广泛推理的任务上，而且在更一般的任务上都取得了优异的推理性能。

Off-policy Reinforcement Learning with Model-based Exploration Augmentation

基于模型的探索增强的策略外强化学习

Authors: Likun Wang, Xiangteng Zhang, Yinuo Wang, Guojian Zhan, Wenxuan Wang, Haoyu Gao, Jingliang Duan, Shengbo Eben Li
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25529
Pdf link: https://arxiv.org/pdf/2510.25529
Abstract Exploration is fundamental to reinforcement learning (RL), as it determines how effectively an agent discovers and exploits the underlying structure of its environment to achieve optimal performance. Existing exploration methods generally fall into two categories: active exploration and passive exploration. The former introduces stochasticity into the policy but struggles in high-dimensional environments, while the latter adaptively prioritizes transitions in the replay buffer to enhance exploration, yet remains constrained by limited sample diversity. To address the limitation in passive exploration, we propose Modelic Generative Exploration (MoGE), which augments exploration through the generation of under-explored critical states and synthesis of dynamics-consistent experiences through transition models. MoGE is composed of two components: (1) a diffusion-based generator that synthesizes critical states under the guidance of a utility function evaluating each state's potential influence on policy exploration, and (2) a one-step imagination world model for constructing critical transitions based on the critical states for agent learning. Our method adopts a modular formulation that aligns with the principles of off-policy learning, allowing seamless integration with existing algorithms to improve exploration without altering their core structures. Empirical results on OpenAI Gym and DeepMind Control Suite reveal that MoGE effectively bridges exploration and policy learning, leading to remarkable gains in both sample efficiency and performance across complex control tasks.
中文摘要 探索是强化学习（RL）的基础，因为它决定了代理发现和利用其环境底层结构以实现最佳性能的效率。现有的勘探方法一般分为主动勘探和被动勘探两大类。前者在策略中引入了随机性，但在高维环境中表现不佳，而后者自适应地优先考虑重放缓冲区中的转换以增强探索，但仍受到有限样本多样性的限制。为了解决被动探索的局限性，我们提出了模型生成探索（MoGE），它通过生成未被充分探索的临界状态和通过过渡模型综合动力学一致的经验来增强探索。MoGE由两个组件组成：（1）基于扩散的生成器，在效用函数的指导下合成临界状态，评估每个状态对政策探索的潜在影响，以及（2）一步想象世界模型，用于构建基于临界状态的临界过渡，用于智能体学习。我们的方法采用符合非政策学习原理的模块化公式，允许与现有算法无缝集成，以在不改变其核心结构的情况下改进探索。OpenAI Gym 和 DeepMind Control Suite 的实证结果表明，MoGE 有效地连接了探索和策略学习，从而在复杂的控制任务中显着提高了样本效率和性能。

Deep Reinforcement Learning-Based Cooperative Rate Splitting for Satellite-to-Underground Communication Networks

基于深度强化学习的星地下通信网络协同速率分配

Authors: Kaiqiang Lin, Kangchun Zhao, Yijie Mao
Subjects: Subjects: Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP); Systems and Control (eess.SY)
Arxiv link: https://arxiv.org/abs/2510.25562
Pdf link: https://arxiv.org/pdf/2510.25562
Abstract Reliable downlink communication in satellite-to-underground networks remains challenging due to severe signal attenuation caused by underground soil and refraction in the air-soil interface. To address this, we propose a novel cooperative rate-splitting (CRS)-aided transmission framework, where an aboveground relay decodes and forwards the common stream to underground devices (UDs). Based on this framework, we formulate a max-min fairness optimization problem that jointly optimizes power allocation, message splitting, and time slot scheduling to maximize the minimum achievable rate across UDs. To solve this high-dimensional non-convex problem under uncertain channels, we develop a deep reinforcement learning solution framework based on the proximal policy optimization (PPO) algorithm that integrates distribution-aware action modeling and a multi-branch actor network. Simulation results under a realistic underground pipeline monitoring scenario demonstrate that the proposed approach achieves average max-min rate gains exceeding $167\%$ over conventional benchmark strategies across various numbers of UDs and underground conditions.
中文摘要 由于地下土壤和空气-土壤界面的折射导致的严重信号衰减，卫星到地下网络的可靠下行链路通信仍然具有挑战性。为了解决这个问题，我们提出了一种新型的协同速率分离（CRS）辅助传输框架，其中地上继电器解码公共流并将其转发到地下设备（UD）。基于该框架，我们制定了一个最大-最小公平性优化问题，该问题共同优化功率分配、消息拆分和时隙调度，以最大化跨UD的最小可实现速率。为了解决这种不确定通道下的高维非凸问题，我们开发了一种基于近端策略优化（PPO）算法的深度强化学习解决方案框架，该算法集成了分布感知动作建模和多分支参与者网络。在现实的地下管道监测场景下的仿真结果表明，在各种数量的 UD 和地下条件下，所提出的方法比传统基准策略实现了超过 167\%$ 的平均最大-最小速率增益。

EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

EHR-R1：用于电子健康记录分析的推理增强基础语言模型

Authors: Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.25628
Pdf link: https://arxiv.org/pdf/2510.25628
Abstract Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.
中文摘要 电子健康记录（EHR）包含丰富而复杂的信息，其自动分析对于临床决策至关重要。尽管大型语言模型（LLM）最近在临床工作流程中取得了进展，但由于任务覆盖范围狭窄且缺乏面向 EHR 的推理能力，它们分析 EHR 的能力仍然有限。本文旨在弥合这一差距，具体来说，我们提出了 EHR-Ins，这是一个大规模、全面的 EHR 推理指令数据集，包括 300k 个高质量推理案例和 42 个不同 EHR 任务的 4M 非推理案例。其核心创新是一个思维图驱动的框架，能够大规模生成高质量的推理数据。在此基础上，我们开发了 EHR-R1，这是一系列推理增强型 LLM，具有高达 72B 的参数，专为 EHR 分析量身定制。通过领域适应、推理增强和强化学习等多阶段训练范式，EHR-R1系统地获取领域知识和多样化的推理能力，从而实现准确而稳健的EHR分析。最后，我们介绍了 EHR-Bench，这是从 MIMIC-IV 策划的新基准，涵盖 42 个任务，用于全面评估 EHR 场景中的推理和预测。在实验中，我们表明，由此产生的 EHR-R1 始终优于最先进的商业和开源 LLM（包括 DeepSeek-V3 和 GPT-4o），在 MIMIC-Bench 上比 GPT-4o 高出 30 多个百分点，并在 EHRSHOT 上实现了 10\% 的零样本 AUROC。总的来说，EHR-Ins、EHR-R1 和 EHR-Bench 显着推进了更可靠和临床相关 EHR 分析的开发。

Learning to Plan & Schedule with Reinforcement-Learned Bimanual Robot Skills

学习通过强化学习的双手机器人技能进行计划和调度

Authors: Weikang Wan, Fabio Ramos, Xuning Yang, Caelan Garrett
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2510.25634
Pdf link: https://arxiv.org/pdf/2510.25634
Abstract Long-horizon contact-rich bimanual manipulation presents a significant challenge, requiring complex coordination involving a mixture of parallel execution and sequential collaboration between arms. In this paper, we introduce a hierarchical framework that frames this challenge as an integrated skill planning & scheduling problem, going beyond purely sequential decision-making to support simultaneous skill invocation. Our approach is built upon a library of single-arm and bimanual primitive skills, each trained using Reinforcement Learning (RL) in GPU-accelerated simulation. We then train a Transformer-based planner on a dataset of skill compositions to act as a high-level scheduler, simultaneously predicting the discrete schedule of skills as well as their continuous parameters. We demonstrate that our method achieves higher success rates on complex, contact-rich tasks than end-to-end RL approaches and produces more efficient, coordinated behaviors than traditional sequential-only planners.
中文摘要 长视野接触丰富的双手作提出了重大挑战，需要复杂的协调，包括双臂之间的并行执行和顺序协作的混合。在本文中，我们引入了一个分层框架，将这一挑战视为一个综合的技能规划和调度问题，超越了纯粹的顺序决策，支持同时技能调用。我们的方法建立在单臂和双手原始技能库之上，每个技能都使用 GPU 加速仿真中的强化学习（RL）进行训练。然后，我们在技能组成数据集上训练一个基于 Transformer 的计划器，以充当高级调度器，同时预测技能的离散计划及其连续参数。我们证明，与端到端 RL 方法相比，我们的方法在复杂、接触丰富的任务上取得了更高的成功率，并且比传统的纯顺序规划器产生更高效、更协调的行为。

ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents

ALDEN：用于长文档中主动导航和证据收集的强化学习

Authors: Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp
Subjects: Subjects: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
Arxiv link: https://arxiv.org/abs/2510.25668
Pdf link: https://arxiv.org/pdf/2510.25668
Abstract Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
中文摘要 视觉语言模型（VLM）擅长解释富含文本的图像，但难以处理需要分析和集成分布在多个页面上的信息的长而复杂的视觉文档。现有方法通常依赖于固定的推理模板或僵化的管道，这迫使 VLM 扮演被动角色，并阻碍了效率和泛化。我们提出了主动长文档导航（ALDEN），这是一个多轮强化学习框架，它将 VLM 微调为能够主动导航长而视觉丰富的文档的交互式代理。ALDEN 引入了一种新颖的获取作，可通过索引直接访问页面，补充了经典的搜索作并更好地利用了文档结构。为了实现密集的过程监督和高效的训练，我们提出了一种基于规则的跨级奖励，提供回合级和代币级信号。为了解决经验观察到的由长文档中的大量视觉标记引起的训练不稳定性，我们进一步提出了一种视觉语义锚定机制，该机制应用双路径KL发散约束在训练过程中分别稳定视觉和文本表示。ALDEN 在由三个开源数据集构建的语料库上进行训练，在五个长文档基准测试中实现了最先进的性能。总体而言，ALDEN 标志着从被动文档阅读迈出了一步，朝着自主导航和推理长文档、视觉丰富的文档的代理迈出了一步，为更准确、更高效的长文档理解提供了一条强大的途径。

Navigation in a Three-Dimensional Urban Flow using Deep Reinforcement Learning

基于深度强化学习的三维城市流导航

Authors: Federica Tonti, Ricardo Vinuesa
Subjects: Subjects: Artificial Intelligence (cs.AI); Fluid Dynamics (physics.flu-dyn)
Arxiv link: https://arxiv.org/abs/2510.25679
Pdf link: https://arxiv.org/pdf/2510.25679
Abstract Unmanned Aerial Vehicles (UAVs) are increasingly populating urban areas for delivery and surveillance purposes. In this work, we develop an optimal navigation strategy based on Deep Reinforcement Learning. The environment is represented by a three-dimensional high-fidelity simulation of an urban flow, characterized by turbulence and recirculation zones. The algorithm presented here is a flow-aware Proximal Policy Optimization (PPO) combined with a Gated Transformer eXtra Large (GTrXL) architecture, giving the agent richer information about the turbulent flow field in which it navigates. The results are compared with a PPO+GTrXL without the secondary prediction tasks, a PPO combined with Long Short Term Memory (LSTM) cells and a traditional navigation algorithm. The obtained results show a significant increase in the success rate (SR) and a lower crash rate (CR) compared to a PPO+LSTM, PPO+GTrXL and the classical Zermelo's navigation algorithm, paving the way to a completely reimagined UAV landscape in complex urban environments.
中文摘要 无人机（UAV）越来越多地出现在城市地区，用于运送和监视目的。在这项工作中，我们开发了一种基于深度强化学习的最优导航策略。环境由城市流动的三维高保真模拟来表示，其特征是湍流和再循环区。这里介绍的算法是流感知的近端策略优化（PPO）与 Gated Transformer eXtra Large （GTrXL）架构相结合，为代理提供了有关其导航的湍流场的更丰富的信息。将结果与没有二次预测任务的 PPO+GTrXL、结合长短期记忆（LSTM）单元的 PPO 和传统导航算法进行了比较。获得的结果表明，与 PPO+LSTM、PPO+GTrXL 和经典的 Zermelo 导航算法相比，成功率（SR）显着提高，碰撞率（CR）降低，为在复杂的城市环境中完全重新构想无人机景观铺平了道路。

PairUni: Pairwise Training for Unified Multimodal Language Models

PairUni：统一多模态语言模型的成对训练

Authors: Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2510.25682
Pdf link: https://arxiv.org/pdf/2510.25682
Abstract Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{this https URL}{this http URL}
中文摘要 统一视觉语言模型（UVLM）必须在单个架构中同时执行理解和生成，但这些任务依赖于异构数据和监督，因此在强化学习（RL）期间很难平衡它们。我们提出了 PairUni，这是一个统一的框架，它将数据重组为理解生成（UG）对并相应地调整优化。我们首先使用 GPT-o3 来增强单任务数据，生成用于理解样本的标题和用于生成样本的问答（QA）对，从同一实例形成对齐对。此外，对于每个生成样本，我们检索一个语义相关的理解示例以形成一个检索到的对，链接不同但相关的数据点。这些配对结构暴露了跨任务语义对应关系并支持一致的策略学习。为了利用这种结构，我们提出了 Pair-GPRO，这是一种基于组相对策略优化的配对感知变体。它为每对分配一个相似性分数以调节优势，加强从对齐的示例中学习并减少任务干扰。我们策划了一个名为 PairUG 的 16K UG 对的高质量数据集，用于 RL 微调，并在强大的 Janus-Pro UVLM 上评估 PairUni。我们的方法在各种 UVLM 上实现了平衡的改进，优于强大的 UVLM RL 基线。代码：\href{this https URL}{this http URL}

MetaLore: Learning to Orchestrate Communication and Computation for Metaverse Synchronization

MetaLore：学习编排通信和计算以实现元宇宙同步

Authors: Elif Ebru Ohri, Qi Liao, Anastasios Giovanidis, Francesca Fossati, Nour-El-Houda Yellas
Subjects: Subjects: Networking and Internet Architecture (cs.NI)
Arxiv link: https://arxiv.org/abs/2510.25705
Pdf link: https://arxiv.org/pdf/2510.25705
Abstract As augmented and virtual reality evolve, achieving seamless synchronization between physical and digital realms remains a critical challenge, especially for real-time applications where delays affect the user experience. This paper presents MetaLore, a Deep Reinforcement Learning (DRL) based framework for joint communication and computational resource allocation in Metaverse or digital twin environments. MetaLore dynamically shares the communication bandwidth and computational resources among sensors and mobile devices to optimize synchronization, while offering high throughput performance. Special treatment is given in satisfying end-to-end delay guarantees. A key contribution is the introduction of two novel Age of Information (AoI) metrics: Age of Request Information (AoRI) and Age of Sensor Information (AoSI), integrated into the reward function to enhance synchronization quality. An open source simulator has been extended to incorporate and evaluate the approach. The DRL solution is shown to achieve the performance of full-enumeration brute-force solutions by making use of a small, task-oriented observation space of two queue lengths at the network side. This allows the DRL approach the flexibility to effectively and autonomously adapt to dynamic traffic conditions.
中文摘要 随着增强现实和虚拟现实的发展，实现物理领域和数字领域之间的无缝同步仍然是一项严峻的挑战，特别是对于延迟影响用户体验的实时应用程序而言。本文提出了MetaLore，这是一个基于深度强化学习（DRL）的框架，用于在元宇宙或数字孪生环境中进行联合通信和计算资源分配。MetaLore 在传感器和移动设备之间动态共享通信带宽和计算资源，以优化同步，同时提供高吞吐量性能。在满足端到端延迟保证方面给予特殊处理。一个关键贡献是引入了两个新的信息时代（AoI）指标：请求信息时代（AoRI）和传感器信息时代（AoSI），集成到奖励函数中以提高同步质量。已扩展开源模拟器以合并和评估该方法。DRL解决方案通过利用网络端两个队列长度的小型任务导向观察空间，实现了全枚举暴力破解的性能。这使得 DRL 方法能够灵活地有效、自主地适应动态交通条件。

Keyword: diffusion policy

There is no result