生成时间: 2025-12-29 16:34:30 (UTC+8); Arxiv 发布时间: 2025-12-29 20:00 EST (2025-12-30 09:00 UTC+8)
今天共有 18 篇相关文章
Keyword: reinforcement learning
CosmoCore-Evo: Evolutionary Dream-Replay Reinforcement Learning for Adaptive Code Generation
CosmoCore-Evo:进化梦境重放强化学习用于自适应代码生成
- Authors: Santhosh Kumar Ravindran
- Subjects: Subjects:
Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
- Arxiv link: https://arxiv.org/abs/2512.21351
- Pdf link: https://arxiv.org/pdf/2512.21351
- Abstract
Building on the affective dream-replay reinforcement learning framework of CosmoCore, we introduce CosmoCore-Evo, an extension that incorporates evolutionary algorithms to enhance adaptability and novelty in code generation tasks. Inspired by anthropological aspects of human evolution, such as natural selection and adaptation in early hominids, CosmoCore-Evo treats RL trajectories as ``genomes'' that undergo mutation and selection during the nocturnal replay phase. This mechanism allows agents to break free from trained patterns, fostering emergent behaviors and improved performance in distribution-shifted environments, such as changing APIs or novel libraries. We augment the Dream Queue with evolutionary operations, including mutation of high-fitness trajectories and enterprise-tuned fitness functions that incorporate efficiency, compliance, and scalability metrics. Evaluated on extended benchmarks including HumanEval variants with shifts, BigCodeBench, and a custom PySpark pipeline simulation, CosmoCore-Evo achieves up to 35% higher novelty in solutions and 25% faster adaptation compared to the original CosmoCore and baselines like PPO and REAMER. Ablations confirm the role of evolutionary components in bridging the sentient gap for LLM agents. Code for replication, including a toy simulation, is provided.
- 中文摘要
基于CosmoCore的情感梦境重放强化学习框架,我们介绍CosmoCore-Evo,这是一个集成进化算法以增强代码生成任务适应性和新颖性的扩展。CosmoCore-Evo 受人类进化人类学方面启发,如早期人类的自然选择和适应,将强化学习轨迹视为在夜间重播阶段经历突变和选择的“基因组”。该机制使代理能够摆脱训练模式,促进新兴行为和在分布式转移环境中的性能提升,如API变更或新库。我们通过进化作来增强梦想队列,包括高适应度轨迹的变异和企业级调优的适应度函数,这些函数包含效率、合规性和可扩展性指标。通过包括带shift的HumanEval变体、BigCodeBench和定制PySpark流水线模拟在内的扩展基准测试进行评估,CosmoCore-Evo相比原始CosmoCore及PPO、REAMER这些基线,在解决方案创新度上提升了高达35%,适应速度提升了25%。消融验证了进化成分在弥合LLM智能体感知差距中的作用。还提供了用于复制的代码,包括玩具模拟。
A Reinforcement Learning Approach to Synthetic Data Generation
合成数据生成的强化学习方法
- Authors: Natalia Espinosa-Dice, Nicholas J. Jackson, Chao Yan, Aaron Lee, Bradley A. Malin
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.21395
- Pdf link: https://arxiv.org/pdf/2512.21395
- Abstract
Synthetic data generation (SDG) is a promising approach for enabling data sharing in biomedical studies while preserving patient privacy. Yet, state-of-the-art generative models often require large datasets and complex training procedures, limiting their applicability in small-sample settings. In this work, we reframe SDG as a reinforcement learning (RL) problem and introduce RLSyn, a novel framework that models the data generator as a stochastic policy over patient records and optimizes it using Proximal Policy Optimization with discriminator-derived rewards, yielding more stable and data-efficient training. We evaluate RLSyn on two biomedical datasets - AI-READI and MIMIC-IV- and benchmark it against state-of-the-art generative adversarial networks (GANs) and diffusion-based methods across extensive privacy, utility, and fidelity evaluations. RL-Syn performs comparably to diffusion models and outperforms GANs on MIMIC-IV, while outperforming both diffusion models and GANs on the smaller AI-READI dataset. These results demonstrate that reinforcement learning provides a principled and effective alternative for synthetic biomedical data generation, particularly in data-scarce regimes.
- 中文摘要
合成数据生成(SDG)是一种有前景的方法,可以在保护患者隐私的同时实现生物医学研究中的数据共享。然而,最先进的生成模型通常需要大量数据集和复杂的训练程序,限制了其在小样本环境中的适用性。在本研究中,我们将SDG重新定义为强化学习(RL)问题,并引入了RLSyn,这是一个新颖框架,将数据生成器建模为患者记录上的随机策略,并利用近端策略优化与判别器衍生的奖励进行优化,从而实现更稳定、数据高效的训练。我们基于两个生物医学数据集——AI-READI和MIMIC-IV——评估RLSyn,并将其与最先进的生成对抗网络(GAN)和基于扩散的方法进行基准测试,涵盖广泛的隐私、效用和保真度评估。RL-Syn在MIMIC-IV上表现与扩散模型相当,且优于GANs,同时在较小的AI-READI数据集上也优于扩散模型和GANs。这些结果表明,强化学习为合成生物医学数据生成提供了一种原则性且有效的替代方案,尤其是在数据稀缺的环境中。
A Survey of Freshness-Aware Wireless Networking with Reinforcement Learning
关于带有强化学习的新鲜感感知无线网络的综述
- Authors: Alimu Alibotaiken, Suyang Wang, Oluwaseun T. Ajayi, Yu Cheng
- Subjects: Subjects:
Machine Learning (cs.LG); Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2512.21412
- Pdf link: https://arxiv.org/pdf/2512.21412
- Abstract
The age of information (AoI) has become a central measure of data freshness in modern wireless systems, yet existing surveys either focus on classical AoI formulations or provide broad discussions of reinforcement learning (RL) in wireless networks without addressing freshness as a unified learning problem. Motivated by this gap, this survey examines RL specifically through the lens of AoI and generalized freshness optimization. We organize AoI and its variants into native, function-based, and application-oriented families, providing a clearer view of how freshness should be modeled in B5G and 6G systems. Building on this foundation, we introduce a policy-centric taxonomy that reflects the decisions most relevant to freshness, consisting of update-control RL, medium-access RL, risk-sensitive RL, and multi-agent RL. This structure provides a coherent framework for understanding how learning can support sampling, scheduling, trajectory planning, medium access, and distributed coordination. We further synthesize recent progress in RL-driven freshness control and highlight open challenges related to delayed decision processes, stochastic variability, and cross-layer design. The goal is to establish a unified foundation for learning-based freshness optimization in next-generation wireless networks.
- 中文摘要
信息时代(AoI)已成为衡量现代无线系统数据新鲜性的核心指标,然而现有调查要么聚焦于经典的AoI表述,要么对无线网络中的强化学习(RL)进行了广泛讨论,而未将新鲜性作为统一的学习问题来解决。基于这一差距,本调查特别从AoI和广义新鲜度优化的视角审视强化学习。我们将 AoI 及其变体分为原生、基于函数和应用导向的家族,更清晰地展示了 B5G 和 6G 系统中应如何建模新鲜度。在此基础上,我们引入了以策略为中心的分类法,反映与新颖性最相关的决策,包括更新控制强化学习、中等访问强化学习、风险敏感强化学习和多智能体强化学习。这一结构为理解学习如何支持抽样、调度、轨迹规划、媒介访问和分布式协调提供了连贯的框架。我们进一步综合了强化学习驱动的新鲜度控制的最新进展,并指出与延迟决策过程、随机变异性和跨层设计相关的未解决挑战。目标是在下一代无线网络中建立基于学习的新鲜度优化统一基础。
dUltra: Ultra-Fast Diffusion Language Models via Reinforcement Learning
dUltra:通过强化学习实现超快速扩散语言模型
- Authors: Shirui Chen, Jiantao Jiao, Lillian J. Ratliff, Banghua Zhu
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.21446
- Pdf link: https://arxiv.org/pdf/2512.21446
- Abstract
Masked diffusion language models (MDLMs) offer the potential for parallel token generation, but most open-source MDLMs decode fewer than 5 tokens per model forward pass even with sophisticated sampling strategies. As a result, their sampling speeds are often comparable to AR + speculative decoding schemes, limiting their advantage over mainstream autoregressive approaches. Existing distillation-based accelerators (dParallel, d3LLM) finetune MDLMs on trajectories generated by a base model, which can become off-policy during finetuning and restrict performance to the quality of the base model's samples. We propose \texttt{dUltra}, an on-policy reinforcement learning framework based on Group Relative Policy Optimization (GRPO) that learns unmasking strategies for efficient parallel decoding. dUltra introduces an unmasking planner head that predicts per-token unmasking likelihoods under independent Bernoulli distributions. We jointly optimize the base diffusion LLM and the unmasking order planner using reward signals combining verifiable reward, distillation reward, and the number of unmasking steps. Across mathematical reasoning and code generation tasks, dUltra improves the accuracy--efficiency trade-off over state-of-the-art heuristic and distillation baselines, moving towards achieving ``diffusion supremacy'' over autoregressive models.
- 中文摘要
掩体扩散语言模型(MDLMs)具备并行符号生成的潜力,但大多数开源MDLM即使采用复杂的采样策略,每个模型前向传递的符号解码量也少于5个。因此,它们的采样速度通常与AR+推测解码方案相当,限制了其相较于主流自回归方法的优势。现有基于蒸馏的加速器(dParallel、d3LLM)会根据基础模型生成的轨迹微调MDLM,而在微调过程中可能会失效,从而限制性能与基础模型样本的质量相匹配。我们提出了 \texttt{dUltra},这是一个基于组相对策略优化(GRPO)的策略强化学习框架,用于学习高效并行解码的揭露策略。dUltra引入了一个“揭密计划器”头部,用于预测独立伯努利分布下每个代币的揭密可能性。我们联合优化基础扩散LLM和解除掩蔽指令规划器,利用可验证的奖励、蒸馏奖励和解除掩蔽步骤的奖励信号。在数学推理和代码生成任务中,dUltra 提升了准确性与效率的权衡,相较于最先进的启发式和提纯基线,朝着“扩散优势”相较自回归模型迈进。
DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO
DiverseGRPO:通过多样性感知GRPO缓解图像生成中的模式崩溃
- Authors: Henglin Liu, Huijuan Huang, Jing Wang, Chang Liu, Xiu Li, Xiangyang Ji
- Subjects: Subjects:
Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.21514
- Pdf link: https://arxiv.org/pdf/2512.21514
- Abstract
Reinforcement learning (RL), particularly GRPO, improves image generation quality significantly by comparing the relative performance of images generated within the same group. However, in the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity, which restricts its application scenarios. This issue can be analyzed from both reward modeling and generation dynamics perspectives. First, traditional GRPO relies on single-sample quality as the reward signal, driving the model to converge toward a few high-reward generation modes while neglecting distribution-level diversity. Second, conventional GRPO regularization neglects the dominant role of early-stage denoising in preserving diversity, causing a misaligned regularization budget that limits the achievable quality--diversity trade-off. Motivated by these insights, we revisit the diversity degradation problem from both reward modeling and generation dynamics. At the reward level, we propose a distributional creativity bonus based on semantic grouping. Specifically, we construct a distribution-level representation via spectral clustering over samples generated from the same caption, and adaptively allocate exploratory rewards according to group sizes to encourage the discovery of novel visual modes. At the generation level, we introduce a structure-aware regularization, which enforces stronger early-stage constraints to preserve diversity without compromising reward optimization efficiency. Experiments demonstrate that our method achieves a 13\%--18\% improvement in semantic diversity under matched quality scores, establishing a new Pareto frontier between image quality and diversity for GRPO-based image generation.
- 中文摘要
强化学习(RL),特别是GRPO,通过比较同一组内生成图像的相对性能,显著提升了图像生成质量。然而,在训练的后期阶段,模型往往产生同质化输出,缺乏创造力和视觉多样性,这限制了其应用场景。这个问题可以从奖励建模和生成动力学的两个角度进行分析。首先,传统GRPO依赖单样本质量作为奖励信号,推动模型趋向少数高奖励生成模式,而忽视了分布层级的多样性。其次,传统的GRPO正则化忽视了早期去噪在保持多样性中的主导作用,导致正则化预算不匹配,限制了可实现的质量-多样性权衡。基于这些见解,我们从奖励建模和生成动态中重新审视多样性退化问题。在奖励层面,我们提出基于语义分组的分布式创造力加成。具体来说,我们通过对同一标题生成的样本进行谱聚类构建分布级表示,并根据组规模自适应地分配探索性奖励,以鼓励发现新的视觉模式。在生成层面,我们引入了结构感知正则化,强化早期约束以保持多样性而不牺牲奖励优化效率。实验表明,我们的方法在匹配质量评分下语义多样性提升了13%-18%,为基于GRPO的图像生成树立了图像质量与多样性之间的新帕累托前沿。
Generative Actor Critic
生成演员评论家
- Authors: Aoyang Qin, Deqian Kong, Wei Wang, Ying Nian Wu, Song-Chun Zhu, Sirui Xie
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.21527
- Pdf link: https://arxiv.org/pdf/2512.21527
- Abstract
Conventional Reinforcement Learning (RL) algorithms, typically focused on estimating or maximizing expected returns, face challenges when refining offline pretrained models with online experiences. This paper introduces Generative Actor Critic (GAC), a novel framework that decouples sequential decision-making by reframing \textit{policy evaluation} as learning a generative model of the joint distribution over trajectories and returns, $p(\tau, y)$, and \textit{policy improvement} as performing versatile inference on this learned model. To operationalize GAC, we introduce a specific instantiation based on a latent variable model that features continuous latent plan vectors. We develop novel inference strategies for both \textit{exploitation}, by optimizing latent plans to maximize expected returns, and \textit{exploration}, by sampling latent plans conditioned on dynamically adjusted target returns. Experiments on Gym-MuJoCo and Maze2D benchmarks demonstrate GAC's strong offline performance and significantly enhanced offline-to-online improvement compared to state-of-the-art methods, even in absence of step-wise rewards.
- 中文摘要
传统的强化学习(RL)算法通常专注于估算或最大化预期回报,但在用在线体验优化离线预训练模型时面临挑战。本文介绍了生成行为者批判者(GAC),这是一个新颖框架,通过重新框架 \textit{policy evaluation},将序列决策解耦为学习轨迹和收益、$p(\tau, y)$ 和 \textit{policy improvement} 的生成模型,从而对该学习模型进行多功能推断。为了实现GAC,我们引入了基于潜在变量模型的特定实例化,该模型具有连续潜在计划向量。我们为 \textit{exploitation} 开发了新的推理策略,通过优化潜在计划以最大化预期收益,以及 \textit{exploration},通过动态调整目标收益对潜在计划进行抽样。Gym-MuJoCo和Maze2D基准测试显示,即使没有分级奖励,GAC在离线时表现强劲,线下到线上提升显著优于最先进方法。
Leash: Adaptive Length Penalty and Reward Shaping for Efficient Large Reasoning Model
牵引绳:自适应长度惩罚与奖励塑造以实现高效大型推理模型
- Authors: Yanhao Li, Lu Ma, Jiaran Zhang, Lexiang Tang, Wentao Zhang, Guibo Luo
- Subjects: Subjects:
Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.21540
- Pdf link: https://arxiv.org/pdf/2512.21540
- Abstract
Existing approaches typically rely on fixed length penalties, but such penalties are hard to tune and fail to adapt to the evolving reasoning abilities of LLMs, leading to suboptimal trade-offs between accuracy and conciseness. To address this challenge, we propose Leash (adaptive LEngth penAlty and reward SHaping), a reinforcement learning framework for efficient reasoning in LLMs. We formulate length control as a constrained optimization problem and employ a Lagrangian primal-dual method to dynamically adjust the penalty coefficient. When generations exceed the target length, the penalty is intensified; when they are shorter, it is relaxed. This adaptive mechanism guides models toward producing concise reasoning without sacrificing task performance. Experiments on Deepseek-R1-Distill-Qwen-1.5B and Qwen3-4B-Thinking-2507 show that Leash reduces the average reasoning length by 60% across diverse tasks - including in-distribution mathematical reasoning and out-of-distribution domains such as coding and instruction following - while maintaining competitive performance. Our work thus presents a practical and effective paradigm for developing controllable and efficient LLMs that balance reasoning capabilities with computational budgets.
- 中文摘要
现有方法通常依赖固定长度惩罚,但这种惩罚难以调整,且无法适应大型语言模型不断演变的推理能力,导致准确性和简洁性之间的权衡不佳。为应对这一挑战,我们提出了Leash(自适应LEngth惩罚与奖励SHapping),这是一种用于高效推理的强化学习框架。我们将长度控制表述为一个受限优化问题,并采用拉格朗日原始对偶方法动态调整惩罚系数。当代数超过目标长度时,惩罚会加重;当他们较矮时,会比较放松。这种自适应机制引导模型在不牺牲任务表现的前提下,产生简洁的推理。在Deepseek-R1-Distill-Qwen-1.5B和Qwen3-4B-Thinking-2507上的实验显示,Leash在多种任务中平均推理长度减少了60%——包括分布内的数学推理以及分布外的编码和指令跟随等领域——同时保持了竞争性能。因此,我们的工作为开发可控且高效的大型语言模型提供了一种实用且有效的范式,平衡推理能力与计算预算。
Towards Learning-Based Formula 1 Race Strategies
迈向基于学习的一级方程式比赛策略
- Authors: Giona Fieni, Joschua Wüthrich, Marc-Philippe Neumann, Mohammad M. Moradi, Christopher H. Onder
- Subjects: Subjects:
Systems and Control (eess.SY)
- Arxiv link: https://arxiv.org/abs/2512.21570
- Pdf link: https://arxiv.org/pdf/2512.21570
- Abstract
This paper presents two complementary frameworks to optimize Formula 1 race strategies, jointly accounting for energy allocation, tire wear and pit stop timing. First, the race scenario is modeled using lap time maps and a dynamic tire wear model capturing the main trade-offs arising during a race. Then, we solve the problem by means of a mixed-integer nonlinear program that handles the integer nature of the pit stop decisions. The same race scenario is embedded into a reinforcement learning environment, on which an agent is trained. Providing fast inference at runtime, this method is suited to improve human decision-making during real races. The learned policy's suboptimality is assessed with respect to the optimal solution, both in a nominal scenario and with an unforeseen disturbance. In both cases, the agent achieves approximately 5s of suboptimality on 1.5h of race time, mainly attributable to the different energy allocation strategy. This work lays the foundations for learning-based race strategies and provides a benchmark for future developments.
- 中文摘要
本文提出了两个互补的框架,以优化一级方程式比赛策略,共同考虑能量分配、轮胎磨损和进站时间。首先,比赛场景通过圈速图和动态轮胎磨损模型建模,捕捉比赛中出现的主要权衡。然后,我们通过混合整数非线性规划来解决停车决策的整数性质。同样的竞赛场景被嵌入到强化学习环境中,代理在此环境中进行训练。该方法在运行时提供快速推理,适合在真实比赛中提升人类决策能力。所学策略的次优性是在名义情景和不可预见扰动情况下相对于最优解进行评估的。在这两种情况下,该剂在1.5小时的比赛时间内约有5秒的次优表现,主要归因于不同的能量分配策略。这项工作为基于学习的竞赛策略奠定了基础,并为未来发展提供了标杆。
Videos are Sample-Efficient Supervisions: Behavior Cloning from Videos via Latent Representations
视频是示例高效监督:通过潜在表示从视频中进行行为克隆
- Authors: Xin Liu, Haoran Li, Dongbin Zhao
- Subjects: Subjects:
Machine Learning (cs.LG)
- Arxiv link: https://arxiv.org/abs/2512.21586
- Pdf link: https://arxiv.org/pdf/2512.21586
- Abstract
Humans can efficiently extract knowledge and learn skills from the videos within only a few trials and errors. However, it poses a big challenge to replicate this learning process for autonomous agents, due to the complexity of visual input, the absence of action or reward signals, and the limitations of interaction steps. In this paper, we propose a novel, unsupervised, and sample-efficient framework to achieve imitation learning from videos (ILV), named Behavior Cloning from Videos via Latent Representations (BCV-LR). BCV-LR extracts action-related latent features from high-dimensional video inputs through self-supervised tasks, and then leverages a dynamics-based unsupervised objective to predict latent actions between consecutive frames. The pre-trained latent actions are fine-tuned and efficiently aligned to the real action space online (with collected interactions) for policy behavior cloning. The cloned policy in turn enriches the agent experience for further latent action finetuning, resulting in an iterative policy improvement that is highly sample-efficient. We conduct extensive experiments on a set of challenging visual tasks, including both discrete control and continuous control. BCV-LR enables effective (even expert-level on some tasks) policy performance with only a few interactions, surpassing state-of-the-art ILV baselines and reinforcement learning methods (provided with environmental rewards) in terms of sample efficiency across 24/28 tasks. To the best of our knowledge, this work for the first time demonstrates that videos can support extremely sample-efficient visual policy learning, without the need to access any other expert supervision.
- 中文摘要
人类只需几次试错,就能高效地从视频中提取知识并学习技能。然而,由于视觉输入复杂、缺乏动作或奖励信号以及交互步骤的局限性,自主智能体复制这一学习过程存在巨大挑战。本文提出了一种新颖、无监督且样本高效的框架,用于实现视频模仿学习(ILV),名为通过潜在表征从视频行为克隆(BCV-LR)。BCV-LR通过自监督任务从高维视频输入中提取与动作相关的潜在特征,然后利用基于动力学的无监督目标预测连续帧之间的潜在动作。预训练的潜在动作经过精细调优,并高效地与在线真实动作空间(包含收集的交互)对齐,用于策略行为克隆。克隆策略反过来丰富了代理的体验,便于进一步的潜在动作微调,从而实现高度采样效率的迭代策略改进。我们对一系列具有挑战性的视觉任务进行了广泛的实验,包括离散控制和连续控制。BCV-LR仅需少量交互即可实现有效(甚至专家级)策略执行,在24/28任务中样本效率超过了最先进的ILV基线和强化学习方法(提供环境奖励)。据我们所知,这项工作首次证明视频可以支持极高样本高效的视觉政策学习,无需其他专家监督。
Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards
重新思考带有可验证奖励的强化学习中的样本极性
- Authors: Xinyu Tang, Yuliang Zhan, Zhixun Li, Wayne Xin Zhao, Zhenduo Zhang, Zujie Wen, Zhiqiang Zhang, Jun Zhou
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.21625
- Pdf link: https://arxiv.org/pdf/2512.21625
- Abstract
Large reasoning models (LRMs) are typically trained using reinforcement learning with verifiable reward (RLVR) to enhance their reasoning abilities. In this paradigm, policies are updated using both positive and negative self-generated rollouts, which correspond to distinct sample polarities. In this paper, we provide a systematic investigation into how these sample polarities affect RLVR training dynamics and behaviors. We find that positive samples sharpen existing correct reasoning patterns, while negative samples encourage exploration of new reasoning paths. We further explore how adjusting the advantage values of positive and negative samples at both the sample level and the token level affects RLVR training. Based on these insights, we propose an Adaptive and Asymmetric token-level Advantage shaping method for Policy Optimization, namely A3PO, that more precisely allocates advantage signals to key tokens across different polarities. Experiments across five reasoning benchmarks demonstrate the effectiveness of our approach.
- 中文摘要
大型推理模型(LRM)通常通过可验证奖励强化学习(RLVR)训练,以增强推理能力。在这种范式中,策略通过正向和负向自生成的展开进行更新,这对应于不同的样本极性。本文系统性地探讨了这些样本极性如何影响RLVR训练动态和行为。我们发现,正样本能提升现有的正确推理模式,而负样本则鼓励探索新的推理路径。我们还进一步探讨了在样本层级和代币层面调整正负样本优势值如何影响RLVR训练。基于这些见解,我们提出了一种自适应且非对称的代币级优势塑造策略优化方法,即A3PO,更精确地将优势信号分配给跨极性的关键代币。五个推理基准测试的实验证明了我们方法的有效性。
Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search
基于先验的变异树保单用于蒙特卡洛树木搜索
- Authors: Maximilian Weichart
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.21648
- Pdf link: https://arxiv.org/pdf/2512.21648
- Abstract
Monte Carlo Tree Search (MCTS) has profoundly influenced reinforcement learning (RL) by integrating planning and learning in tasks requiring long-horizon reasoning, exemplified by the AlphaZero family of algorithms. Central to MCTS is the search strategy, governed by a tree policy based on an upper confidence bound (UCB) applied to trees (UCT). A key factor in the success of AlphaZero is the introduction of a prior term in the UCB1-based tree policy PUCT, which improves exploration efficiency and thus accelerates training. While many alternative UCBs with stronger theoretical guarantees than UCB1 exist, extending them to prior-based UCTs has been challenging, since PUCT was derived empirically rather than from first principles. Recent work retrospectively justified PUCT by framing MCTS as a regularized policy optimization (RPO) problem. Building on this perspective, we introduce Inverse-RPO, a general methodology that systematically derives prior-based UCTs from any prior-free UCB. Applying this method to the variance-aware UCB-V, we obtain two new prior-based tree policies that incorporate variance estimates into the search. Experiments indicate that these variance-aware prior-based UCTs outperform PUCT across multiple benchmarks without incurring additional computational cost. We also provide an extension of the mctx library supporting variance-aware UCTs, showing that the required code changes are minimal and intended to facilitate further research on principled prior-based UCTs. Code: this http URL.
- 中文摘要
蒙特卡洛树搜索(MCTS)通过将规划与学习整合到需要长期视野推理的任务中,深刻影响了强化学习(RL),以AlphaZero系列算法为代表。MCTS的核心是搜索策略,由基于树上置信界(UCB)应用的树策略(UCT)所控制。AlphaZero成功的关键因素是UCB1树策略PUCT中引入了前一项,这提高了探索效率,从而加快了训练。虽然存在许多理论保证更强的UCBs,但将其推广到基于先验的UCT存在挑战,因为PUCT是通过经验推导而非基于基本原理的。近期研究通过将MCTS框架为正则化策略优化(RPO)问题,事后为PUCT辩护。基于这一观点,我们引入了逆RPO,这是一种通用方法,系统地从任意先验无UCB推导出基于先验的UCT。将该方法应用于方差感知型UCB-V,我们获得了两个新的基于先验的树策略,将方差估计纳入搜索。实验表明,这些基于先验的方差感知型 UCT 在多个基准测试中优于 PUCT,且不会产生额外的计算成本。我们还提供了支持方差感知 UCT 的 mctx 库扩展,表明所需的代码变更极少,旨在促进基于先验原则的 UCT 的进一步研究。代码:这个 http URL。
Multiconnectivity for SAGIN: Current Trends, Challenges, AI-driven Solutions, and Opportunities
SAGIN的多元互联:当前趋势、挑战、AI驱动解决方案与机遇
- Authors: Abd Ullah Khan, Adnan Shahid, Haejoon Jung, Hyundong Shin
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
- Arxiv link: https://arxiv.org/abs/2512.21717
- Pdf link: https://arxiv.org/pdf/2512.21717
- Abstract
Space-air-ground-integrated network (SAGIN)-enabled multiconnectivity (MC) is emerging as a key enabler for next-generation networks, enabling users to simultaneously utilize multiple links across multi-layer non-terrestrial networks (NTN) and multi-radio access technology (multi-RAT) terrestrial networks (TN). However, the heterogeneity of TN and NTN introduces complex architectural challenges that complicate MC implementation. Specifically, the diversity of link types, spanning air-to-air, air-to-space, space-to-space, space-to-ground, and ground-to-ground communications, renders optimal resource allocation highly complex. Recent advancements in reinforcement learning (RL) and agentic artificial intelligence (AI) have shown remarkable effectiveness in optimal decision-making in complex and dynamic environments. In this paper, we review the current developments in SAGIN-enabled MC and outline the key challenges associated with its implementation. We further highlight the transformative potential of AI-driven approaches for resource optimization in a heterogeneous SAGIN environment. To this end, we present a case study on resource allocation optimization enabled by agentic RL for SAGIN-enabled MC involving diverse radio access technologies (RATs). Results show that learning-based methods can effectively handle complex scenarios and substantially enhance network performance in terms of latency and capacity while incurring a moderate increase in power consumption as an acceptable tradeoff. Finally, open research problems and future directions are presented to realize efficient SAGIN-enabled MC.
- 中文摘要
基于空地集成网络(SAGIN)的多连接(MC)正成为下一代网络的关键推动力,使用户能够同时利用多层非地面网络(NTN)和多无线接入技术(多RAT)地面网络(TN)之间的多条链路。然而,TN和NTN的异质性带来了复杂的架构挑战,使MC的实现更加复杂。具体来说,连接类型的多样性,涵盖空对空、空对空间、空对地和地对地通信,使得资源配置极为复杂。强化学习(RL)和智能人工智能(AI)的最新进展在复杂且动态的环境中展现出显著的高效决策。本文回顾了SAGIN支持的MC当前发展,并概述了其实施过程中面临的主要挑战。我们还进一步强调了人工智能驱动方法在异构SAGIN环境中资源优化的变革潜力。为此,我们提出了一个关于通过智能强化学习优化资源分配,适用于SAGIN驱动的多样无线接入技术(RAT)的资源分配案例研究。结果表明,基于学习的方法能够有效处理复杂场景,并在延迟和容量方面显著提升网络性能,同时以适度的功耗增加作为可接受的权衡。最后,提出了开放性研究问题和未来方向,以实现高效的SAGIN驱动MC。
Q-A3C2: Quantum Reinforcement Learning with Time-Series Dynamic Clustering for Adaptive ETF Stock Selection
Q-A3C2:基于时间序列动态聚类的量子强化学习用于自适应ETF股票选择
- Authors: Yen-Ku Liu, Yun-Cheng Tsai, Samuel Yen-Chi Chen
- Subjects: Subjects:
Computational Engineering, Finance, and Science (cs.CE)
- Arxiv link: https://arxiv.org/abs/2512.21819
- Pdf link: https://arxiv.org/pdf/2512.21819
- Abstract
Traditional ETF stock selection methods and reinforcement learning models such as the Asynchronous Advantage Actor-Critic (A3C) often suffer from high-dimensional feature spaces and overfitting when applied to complex financial markets. Moreover, static clustering algorithms fail to capture evolving market regimes, as the cluster with higher returns in one period may not remain optimal in the next. To address these limitations, this paper proposes Q-A3C2, a quantum-enhanced A3C framework that integrates time-series dynamic clustering. By embedding Variational Quantum Circuits (VQCs) into the policy network, Q-A3C2 enhances nonlinear feature representation and enables adaptive decision-making at the cluster level. Experimental results on the S and P 500 constituents show that Q-A3C2 achieves a cumulative return of 17.09%, outperforming the benchmark's 7.09%, demonstrating superior adaptability and exploration in dynamic financial environments.
- 中文摘要
传统的ETF股票选择方法和强化学习模型,如异步优势演员-批判者(A3C),在应用于复杂金融市场时常常存在高维特征空间和过拟合问题。此外,静态聚类算法无法捕捉市场格局的变化,因为某一时期回报较高的聚类在下一个时期可能不再保持最优。为解决这些局限性,本文提出了Q-A3C2,一种量子增强型A3C框架,集成了时间序列动态聚类。通过将变分量子电路(VQC)嵌入策略网络,Q-A3C2增强了非线性特征表示,并实现了集群层面的自适应决策。对S500和P500成分股的实验结果显示,Q-A3C2的累计回报率为17.09%,优于基准的7.09%,展现了在动态金融环境中更优越的适应性和探索能力。
A Comedy of Estimators: On KL Regularization in RL Training of LLMs
估计器的喜剧:关于强化学习训练中LLM正则化的基层
- Authors: Vedant Shah, Johan Obando-Ceron, Vineet Jain, Brian Bartoldson, Bhavya Kailkhura, Sarthak Mittal, Glen Berseth, Pablo Samuel Castro, Yoshua Bengio, Nikolay Malkin, Moksh Jain, Siddarth Venkatraman, Aaron Courville
- Subjects: Subjects:
Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.21852
- Pdf link: https://arxiv.org/pdf/2512.21852
- Abstract
The reasoning performance of large language models (LLMs) can be substantially improved by training them with reinforcement learning (RL). The RL objective for LLM training involves a regularization term, which is the reverse Kullback-Leibler (KL) divergence between the trained policy and the reference policy. Since computing the KL divergence exactly is intractable, various estimators are used in practice to estimate it from on-policy samples. Despite its wide adoption, including in several open-source libraries, there is no systematic study analyzing the numerous ways of incorporating KL estimators in the objective and their effect on the downstream performance of RL-trained models. Recent works show that prevailing practices for incorporating KL regularization do not provide correct gradients for stated objectives, creating a discrepancy between the objective and its implementation. In this paper, we further analyze these practices and study the gradients of several estimators configurations, revealing how design choices shape gradient bias. We substantiate these findings with empirical observations by RL fine-tuning \texttt{Qwen2.5-7B}, \texttt{Llama-3.1-8B-Instruct} and \texttt{Qwen3-4B-Instruct-2507} with different configurations and evaluating their performance on both in- and out-of-distribution tasks. Through our analysis, we observe that, in on-policy settings: (1) estimator configurations with biased gradients can result in training instabilities; and (2) using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks. We also investigate the performance resulting from different KL configurations in off-policy settings and observe that KL regularization can help stabilize off-policy RL training resulting from asynchronous setups.
- 中文摘要
通过强化学习(RL)训练大型语言模型(LLM)的推理性能可以显著提升。强化学习(RL)用于LLM训练的目标涉及一个正则化项,即训练策略与参考策略之间的反向Kullback-Leibler(KL)发散。由于精确计算KL散度难以处理,实际上会使用各种估计量从策略样本中估算。尽管该方法被广泛采用,包括在多个开源库中,但目前尚无系统研究分析将KL估计量纳入目标的多种方法及其对强化学习训练模型后游性能的影响。最新研究显示,现有的基层正则化实践未能为既定目标提供正确的梯度,导致目标与其实现之间存在差异。本文进一步分析这些做法,研究多种估计量配置的梯度,揭示设计选择如何塑造梯度偏置。我们通过强化学习微调 \texttt{Qwen2.5-7B}、\texttt{Llama-3.1-8B-Ininstruction} 和 \texttt{Qwen3-4B-Instruct-2507},在不同配置下评估其在分布内外任务中的表现,来支持这些发现。通过我们的分析,我们观察到,在策略环境中:(1)带有偏斜梯度的估计配置可能导致训练不稳定性;以及(2)使用估计配置以实现无偏梯度,能在域内任务和域外任务中获得更好的性能。我们还研究了不同非策略配置中 Kle 配置带来的性能,并观察到 KL 正则化有助于稳定异步设置导致的非策略强化学习。
SWE-RM: Execution-free Feedback For Software Engineering Agents
SWE-RM:软件工程代理的无执行反馈
- Authors: KaShun Shum, Binyuan Hui, Jiawei Chen, Lei Zhang, X. W., Jiaxi Yang, Yuzhen Huang, Junyang Lin, Junxian He
- Subjects: Subjects:
Computation and Language (cs.CL)
- Arxiv link: https://arxiv.org/abs/2512.21919
- Pdf link: https://arxiv.org/pdf/2512.21919
- Abstract
Execution-based feedback like unit testing is widely used in the development of coding agents through test-time scaling (TTS) and reinforcement learning (RL). This paradigm requires scalable and reliable collection of unit test cases to provide accurate feedback, and the resulting feedback is often sparse and cannot effectively distinguish between trajectories that are both successful or both unsuccessful. In contrast, execution-free feedback from reward models can provide more fine-grained signals without depending on unit test cases. Despite this potential, execution-free feedback for realistic software engineering (SWE) agents remains underexplored. Aiming to develop versatile reward models that are effective across TTS and RL, however, we observe that two verifiers with nearly identical TTS performance can nevertheless yield very different results in RL. Intuitively, TTS primarily reflects the model's ability to select the best trajectory, but this ability does not necessarily generalize to RL. To address this limitation, we identify two additional aspects that are crucial for RL training: classification accuracy and calibration. We then conduct comprehensive controlled experiments to investigate how to train a robust reward model that performs well across these metrics. In particular, we analyze the impact of various factors such as training data scale, policy mixtures, and data source composition. Guided by these investigations, we introduce SWE-RM, an accurate and robust reward model adopting a mixture-of-experts architecture with 30B total parameters and 3B activated during inference. SWE-RM substantially improves SWE agents on both TTS and RL performance. For example, it increases the accuracy of Qwen3-Coder-Flash from 51.6% to 62.0%, and Qwen3-Coder-Max from 67.0% to 74.6% on SWE-Bench Verified using TTS, achieving new state-of-the-art performance among open-source models.
- 中文摘要
基于执行的反馈,如单元测试,广泛用于通过测试时间缩放(TTS)和强化学习(RL)开发编码代理。该范式需要可扩展且可靠的单元测试用例集合以提供准确反馈,而产生的反馈往往稀疏,无法有效区分既成功又失败的路径。相比之下,来自奖励模型的无执行反馈可以提供更细粒度的信号,而无需依赖单元测试案例。尽管有此潜力,现实软件工程(SWE)代理的无执行反馈仍未被充分开发。旨在开发适用于TTS和强化学习的多功能奖励模型,但我们观察到,尽管TTS表现几乎相同,两个验证者在强化学习中却可能产生截然不同的结果。直观上,TTS主要反映了模型选择最佳轨迹的能力,但这种能力不一定适用于强化学习。为解决这一限制,我们确定了两个对强化学习训练至关重要的额外方面:分类准确性和校准。随后,我们进行了全面的受控实验,探讨如何训练一个在这些指标上表现良好的稳健奖励模型。特别地,我们分析了训练数据规模、策略混合和数据源组成等多种因素的影响。在这些研究的指导下,我们提出了SWE-RM,这是一种采用专家混合架构、30B总参数和3B在推理过程中激活的专家混合奖励模型。SWE-RM显著提升了软件代理在TTS和RL上的表现。例如,它将 Qwen3-Coder-Flash 的准确率从 51.6% 提升至 62.0%,在 SWE-Bench Verified 上使用 TTS 将 Qwen3-Coder-Max 从67.0%提升至74.6%,实现开源模型中的新颖性能。
Latency-Optimal Cache-aided Multicast Streaming via Forward-Backward Reinforcement Learning
通过正向强化学习实现延迟最优缓存辅助多播流
- Authors: Mohsen Amidzadeh
- Subjects: Subjects:
Information Theory (cs.IT)
- Arxiv link: https://arxiv.org/abs/2512.21954
- Pdf link: https://arxiv.org/pdf/2512.21954
- Abstract
We consider a cellular network equipped with cache-enabled base-stations (BSs) leveraging an orthogonal multipoint multicast (OMPMC) streaming scheme. The network operates in a time-slotted fashion to serve content-requesting users by streaming cached files. The users being unsatisfied by the multicat streaming face a delivery outage, implying that they will remain interested in their preference at the next time-slot, which leads to a forward dynamics on the user preference. To design a latency-optimal streaming policy, the dynamics of latency is properly modeled and included in the learning procedure. We show that this dynamics surprisingly represents a backward dynamics. The combination of problem's forward and backward dynamics then develops a forward-backward Markov decision process (FB-MDP) that fully captures the network evolution across time. This FB-MDP necessitates usage of a forward-backward multi-objective reinforcement learning (FB-MORL) algorithm to optimize the expected latency as well as other performance metrics of interest including the overall outage probability and total resource consumption. Simulation results show the merit of proposed FB-MORL algorithm in finding a promising dynamic cache policy.
- 中文摘要
我们考虑一个配备缓存支持基站(BS)的蜂窝网络,采用正交多点多播(OMPMC)流式方案。该网络以时段方式运行,通过流式缓存文件为请求内容的用户提供服务。用户对多猫流媒体不满意,可能会遇到传输中断,这意味着他们在下一个时段仍会关注自己的偏好,从而导致用户偏好的前向动态。为了设计延迟最优的流式策略,需要正确建模并包含延迟的动态。我们证明了这种动态令人惊讶地代表了一种倒退的动态。问题的前向和后向动态结合,形成了一个正向-后退马尔可夫决策过程(FB-MDP),完全捕捉网络随时间的演变。该FB-MDP需要使用前向-后退多目标强化学习(FB-MORL)算法来优化预期延迟以及其他重要性能指标,包括整体停机概率和总资源消耗。模拟结果显示,所提出的FB-MORL算法在寻找有前景的动态缓存策略方面具有价值。
Meta-Learning-Based Handover Management in NextG O-RAN
NextG O-RAN中的基于元学习的切换管理
- Authors: Michail Kalntis, George Iosifidis, José Suárez-Varela, Andra Lutu, Fernando A. Kuipers
- Subjects: Subjects:
Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.22022
- Pdf link: https://arxiv.org/pdf/2512.22022
- Abstract
While traditional handovers (THOs) have served as a backbone for mobile connectivity, they increasingly suffer from failures and delays, especially in dense deployments and high-frequency bands. To address these limitations, 3GPP introduced Conditional Handovers (CHOs) that enable proactive cell reservations and user-driven execution. However, both handover (HO) types present intricate trade-offs in signaling, resource usage, and reliability. This paper presents unique, countrywide mobility management datasets from a top-tier mobile network operator (MNO) that offer fresh insights into these issues and call for adaptive and robust HO control in next-generation networks. Motivated by these findings, we propose CONTRA, a framework that, for the first time, jointly optimizes THOs and CHOs within the O-RAN architecture. We study two variants of CONTRA: one where users are a priori assigned to one of the HO types, reflecting distinct service or user-specific requirements, as well as a more dynamic formulation where the controller decides on-the-fly the HO type, based on system conditions and needs. To this end, it relies on a practical meta-learning algorithm that adapts to runtime observations and guarantees performance comparable to an oracle with perfect future information (universal no-regret). CONTRA is specifically designed for near-real-time deployment as an O-RAN xApp and aligns with the 6G goals of flexible and intelligent control. Extensive evaluations leveraging crowdsourced datasets show that CONTRA improves user throughput and reduces both THO and CHO switching costs, outperforming 3GPP-compliant and Reinforcement Learning (RL) baselines in dynamic and real-world scenarios.
- 中文摘要
虽然传统切换(THO)一直是移动连接的骨干,但它们在密集部署和高频频段中,故障和延迟日益严重。为解决这些限制,3GPP引入了条件切换(CHO),实现主动的手机预留和用户驱动的执行。然而,这两种切换(HO)类型在信令、资源使用和可靠性方面都存在复杂的权衡。本文呈现了来自一家顶级移动网络运营商(MNO)的独特全国移动管理数据集,为这些问题提供了新见解,呼吁下一代网络中实现自适应且稳健的HO控制。基于这些发现,我们提出了CONTRA框架,首次在O-RAN架构中联合优化THO和CHOs。我们研究了两种CONTRA变体:一种是用户先天被分配到某一HO类型,反映不同的服务或用户特定需求;另一种是更动态的表述,控制者根据系统状况和需求即时决定HO类型。为此,它依赖于一种实用的元学习算法,能够适应运行时的观察,并保证性能可与拥有完美未来信息的预言机相当(普遍无遗憾)。CONTRA专为近实时部署设计,作为O-RAN xApp,符合6G灵活智能控制的目标。利用众包数据集进行的广泛评估表明,CONTRA提升了用户吞吐量,降低了THO和CHO切换成本,在动态和现实场景中优于3GPP合规和强化学习(RL)基线。
Keyword: diffusion policy
Flexible Multitask Learning with Factorized Diffusion Policy
灵活多任务学习与分解扩散策略
- Authors: Chaoqi Liu, Haonan Chen, Sigmund H. Høeg, Shaoxiong Yao, Yunzhu Li, Kris Hauser, Yilun Du
- Subjects: Subjects:
Robotics (cs.RO); Artificial Intelligence (cs.AI)
- Arxiv link: https://arxiv.org/abs/2512.21898
- Pdf link: https://arxiv.org/pdf/2512.21898
- Abstract
Multitask learning poses significant challenges due to the highly multimodal and diverse nature of robot action distributions. However, effectively fitting policies to these complex task distributions is often difficult, and existing monolithic models often underfit the action distribution and lack the flexibility required for efficient adaptation. We introduce a novel modular diffusion policy framework that factorizes complex action distributions into a composition of specialized diffusion models, each capturing a distinct sub-mode of the behavior space for a more effective overall policy. In addition, this modular structure enables flexible policy adaptation to new tasks by adding or fine-tuning components, which inherently mitigates catastrophic forgetting. Empirically, across both simulation and real-world robotic manipulation settings, we illustrate how our method consistently outperforms strong modular and monolithic baselines.
- 中文摘要
由于机器人动作分布高度多模态且多样化,多任务学习面临重大挑战。然而,有效拟合策略以适应这些复杂任务分布往往困难,现有的单体模型往往无法拟合行动分布,且缺乏高效适应所需的灵活性。我们引入了一种新的模块化扩散策略框架,将复杂的行动分布分解为多个专门的扩散模型组合,每个模型捕捉行为空间中不同的子模式,以实现更有效的整体策略。此外,这种模块化结构通过添加或微调组件,使策略能够灵活适应新任务,从而本质上减少灾难性遗忘。通过实证,无论是在模拟还是现实机器人作环境中,我们展示了我们的方法如何持续优于强模块化和单体基线。