Arxiv Papers of Today

生成时间: 2025-12-15 16:34:43 (UTC+8); Arxiv 发布时间: 2025-12-15 20:00 EST (2025-12-16 09:00 UTC+8)

今天共有 20 篇相关文章

Keyword: reinforcement learning

KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

KBQA-R1：强化大型语言模型用于知识库问答

Authors: Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, Liang Wang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.10999
Pdf link: https://arxiv.org/pdf/2512.10999
Abstract Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.
中文摘要 知识库问答（KBQA）通过生成可执行的逻辑形式，挑战模型弥合自然语言与严格知识图谱模式之间的鸿沟。尽管大型语言模型（LLM）推动了这一领域的发展，但当前方法常常面临失败的二分法：它们要么生成幻觉查询而不验证模式的存在，要么表现出僵化的基于模板的推理，模仿合成的痕迹，而未能真正理解环境。为解决这些局限性，我们提出了 \textbf{KBQA-R1}，这是一个将范式从文本模仿转向通过强化学习优化交互的框架。将KBQA视为多回合决策过程，我们的模型通过一系列动作学习在知识库中导航，利用群体相对策略优化（GRPO）基于具体执行反馈而非静态监督来优化策略。此外，我们引入了 \textbf{Referenced Rejection Sampling （RRS）}，这是一种数据综合方法，通过严格对齐推理迹迹与真实动作序列来解决冷启动挑战。在WebQSP、GrailQA和GraphQuestions上的大量实验表明，KBQA-R1实现了最先进的性能，有效地将LLM推理建立在可验证的执行基础上。

In-Context Multi-Objective Optimization

上下文多目标优化

Authors: Xinyu Zhang, Conor Hassan, Julien Martinelli, Daolang Huang, Samuel Kaski
Subjects: Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
Arxiv link: https://arxiv.org/abs/2512.11114
Pdf link: https://arxiv.org/pdf/2512.11114
Abstract Balancing competing objectives is omnipresent across disciplines, from drug design to autonomous systems. Multi-objective Bayesian optimization is a promising solution for such expensive, black-box problems: it fits probabilistic surrogates and selects new designs via an acquisition function that balances exploration and exploitation. In practice, it requires tailored choices of surrogate and acquisition that rarely transfer to the next problem, is myopic when multi-step planning is often required, and adds refitting overhead, particularly in parallel or time-sensitive loops. We present TAMO, a fully amortized, universal policy for multi-objective black-box optimization. TAMO uses a transformer architecture that operates across varying input and objective dimensions, enabling pretraining on diverse corpora and transfer to new problems without retraining: at test time, the pretrained model proposes the next design with a single forward pass. We pretrain the policy with reinforcement learning to maximize cumulative hypervolume improvement over full trajectories, conditioning on the entire query history to approximate the Pareto frontier. Across synthetic benchmarks and real tasks, TAMO produces fast proposals, reducing proposal time by 50-1000x versus alternatives while matching or improving Pareto quality under tight evaluation budgets. These results show that transformers can perform multi-objective optimization entirely in-context, eliminating per-task surrogate fitting and acquisition engineering, and open a path to foundation-style, plug-and-play optimizers for scientific discovery workflows.
中文摘要 平衡多重目标在各个学科中无处不在，从药物设计到自主系统。多目标贝叶斯优化是解决此类昂贵黑箱问题的有前景方案：它拟合概率代理，并通过一个平衡探索与利用的获取函数选择新设计。实际上，它需要针对替代和采购做出定制选择，这些选择很少能转移到下一个问题，在需要多步规划时视野短浅，并且增加了重新调整的开销，尤其是在并行或时间敏感的循环中。我们介绍TAMO，一种完全摊销的通用多目标黑箱优化策略。TAMO采用跨不同输入和目标维度的变换器架构，实现对多样化语料库的预训练，并在无需重新训练的情况下转移至新问题：在测试时，预训练模型通过一次前向传递提出下一个设计。我们用强化学习预训练策略，以最大化对完整轨迹的累计超体积改进，并以整个查询历史为条件来近似帕累托前沿。在综合基准和实际任务中，TAMO产出快速提案，将提案时间缩短50至1000倍，同时在紧张的评估预算下匹配或提升帕累托质量。这些结果表明，变换器能够完全在上下文中执行多目标优化，消除了每任务的代理拟合和获取工程，并为科学发现工作流程开辟了基础式即插即用优化器的道路。

Benchmarking RL-Enhanced Spatial Indices Against Traditional, Advanced, and Learned Counterparts

将强化学习增强的空间指数与传统、高级和学术对应指标进行基准比较

Authors: Guanli Liu, Renata Borovica-Gajic, Hai Lan, Zhifeng Bao
Subjects: Subjects: Databases (cs.DB)
Arxiv link: https://arxiv.org/abs/2512.11161
Pdf link: https://arxiv.org/pdf/2512.11161
Abstract Reinforcement learning has recently been used to enhance index structures, giving rise to reinforcement learning-enhanced spatial indices (RLESIs) that aim to improve query efficiency during index construction. However, their practical benefits remain unclear due to the lack of unified implementations and comprehensive evaluations, especially in disk-based settings. We present the first modular and extensible benchmark for RLESIs. Built on top of an existing spatial index library, our framework decouples index training from building, supports parameter tuning, and enables consistent comparison with traditional, advanced, and learned spatial indices. We evaluate 12 representative spatial indices across six datasets and diverse workloads, including point, range, kNN, spatial join, and mixed read/write queries. Using latency, I/O, and index statistics as metrics, we find that while RLESIs can reduce query latency with tuning, they consistently underperform learned spatial indices and advanced variants in both query efficiency and index build cost. These findings highlight that although RLESIs offer promising architectural compatibility, their high tuning costs and limited generalization hinder practical adoption.
中文摘要 强化学习最近被用于增强索引结构，催生了强化学习增强型空间索引（RLESI），旨在提高索引构建过程中的查询效率。然而，由于缺乏统一的实现和全面评估，尤其是在基于磁盘的环境中，其实际效益仍不明确。我们提出了首个模块化且可扩展的RLESI基准测试。我们的框架建立在现有的空间索引库之上，将索引训练与构建解耦，支持参数调优，并实现与传统、高级和学习空间索引的一致比较。我们评估了涵盖六个数据集和多样化工作负载的12个代表性空间索引，包括点、范围、kNN、空间连接和混合读写查询。利用延迟、输入输出和索引统计数据，我们发现虽然RLESI可以通过调优降低查询延迟，但在查询效率和索引构建成本上，它们始终低于已学习的空间索引和高级变体。这些发现表明，尽管RLESI在架构兼容性上有前景，但其高调优成本和有限的泛化性阻碍了实际应用。

CORL: Reinforcement Learning of MILP Policies Solved via Branch and Bound

CORL：通过分支与界限解决的MILP政策强化学习

Authors: Akhil S Anand, Elias Aarekol, Martin Mziray Dalseg, Magnus Stalhane, Sebastien Gros
Subjects: Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
Arxiv link: https://arxiv.org/abs/2512.11169
Pdf link: https://arxiv.org/pdf/2512.11169
Abstract Combinatorial sequential decision making problems are typically modeled as mixed integer linear programs (MILPs) and solved via branch and bound (B&B) algorithms. The inherent difficulty of modeling MILPs that accurately represent stochastic real world problems leads to suboptimal performance in the real world. Recently, machine learning methods have been applied to build MILP models for decision quality rather than how accurately they model the real world problem. However, these approaches typically rely on supervised learning, assume access to true optimal decisions, and use surrogates for the MILP gradients. In this work, we introduce a proof of concept CORL framework that end to end fine tunes an MILP scheme using reinforcement learning (RL) on real world data to maximize its operational performance. We enable this by casting an MILP solved by B&B as a differentiable stochastic policy compatible with RL. We validate the CORL method in a simple illustrative combinatorial sequential decision making example.
中文摘要 组合顺序决策问题通常以混合整数线性规划（MILPs）建模，并通过分支界限（B&B）算法求解。准确模拟随机现实问题的MILP本身就很困难，导致现实中的表现不够理想。近年来，机器学习方法被用于构建决策质量的MILP模型，而非其对现实问题的准确性建模。然而，这些方法通常依赖监督学习，假设能够获得真正的最优决策，并使用替代指标来处理MILP梯度。在本研究中，我们介绍了一个概念验证CORL框架，通过基于真实世界数据的强化学习（RL）对MILP方案进行端到端微调，以最大化其作性能。我们通过将由B&B解决的MILP作为与强化学习兼容的可微随机策略来实现这一点。我们通过一个简单的组合顺序决策示例验证了CORL方法。

Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning

带宽受限的变分消息编码用于合作多智能体强化学习

Authors: Wei Duan, Jie Lu, En Yu, Junyu Xuan
Subjects: Subjects: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.11179
Pdf link: https://arxiv.org/pdf/2512.11179
Abstract Graph-based multi-agent reinforcement learning (MARL) enables coordinated behavior under partial observability by modeling agents as nodes and communication links as edges. While recent methods excel at learning sparse coordination graphs-determining who communicates with whom-they do not address what information should be transmitted under hard bandwidth constraints. We study this bandwidth-limited regime and show that naive dimensionality reduction consistently degrades coordination performance. Hard bandwidth constraints force selective encoding, but deterministic projections lack mechanisms to control how compression occurs. We introduce Bandwidth-constrained Variational Message Encoding (BVME), a lightweight module that treats messages as samples from learned Gaussian posteriors regularized via KL divergence to an uninformative prior. BVME's variational framework provides principled, tunable control over compression strength through interpretable hyperparameters, directly constraining the representations used for decision-making. Across SMACv1, SMACv2, and MPE benchmarks, BVME achieves comparable or superior performance while using 67--83% fewer message dimensions, with gains most pronounced on sparse graphs where message quality critically impacts coordination. Ablations reveal U-shaped sensitivity to bandwidth, with BVME excelling at extreme ratios while adding minimal overhead.
中文摘要 基于图的多智能体强化学习（MARL）通过将代理建模为节点，将通信链路建模为边，实现部分可观测性下的协调行为。虽然最新方法在学习稀疏协调图——确定谁与谁通信方面表现出色——但它们并未解决在硬带宽约束下应传输哪些信息的问题。我们研究了这种带宽受限的状态，并证明朴素降维始终会降低协调表现。硬带宽约束迫使选择性编码，但确定性投影缺乏控制压缩方式的机制。我们介绍带宽约束变分消息编码（BVME），这是一个轻量级模块，将消息视为通过KL散度正则化为无信息先验的学习高斯后验样本。BVME的变分框架通过可解释的超参数，提供有原则且可调的压缩强度控制，直接约束用于决策的表示。在SMACv1、SMACv2和MPE基准测试中，BVME在使用67%-83%减少消息维度的情况下，实现了相当甚至更优的性能，且在消息质量对协调产生关键影响的稀疏图中提升最为明显。消融显示了对带宽的U型敏感度，BVME在极端比率下表现出色，同时增加的开销极小。

Multi-Objective Reinforcement Learning for Large-Scale Mixed Traffic Control

大规模混合交通控制的多目标强化学习

Authors: Iftekharul Islam, Weizi Li
Subjects: Subjects: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.11247
Pdf link: https://arxiv.org/pdf/2512.11247
Abstract Effective mixed traffic control requires balancing efficiency, fairness, and safety. Existing approaches excel at optimizing efficiency and enforcing safety constraints but lack mechanisms to ensure equitable service, resulting in systematic starvation of vehicles on low-demand approaches. We propose a hierarchical framework combining multi-objective reinforcement learning for local intersection control with strategic routing for network-level coordination. Our approach introduces a Conflict Threat Vector that provides agents with explicit risk signals for proactive conflict avoidance, and a queue parity penalty that ensures equitable service across all traffic streams. Extensive experiments on a real-world network across different robot vehicle (RV) penetration rates demonstrate substantial improvements: up to 53% reductions in average wait time, up to 86% reductions in maximum starvation, and up to 86\% reduction in conflict rate compared to baselines, while maintaining fuel efficiency. Our analysis reveals that strategic routing effectiveness scales with RV penetration, becoming increasingly valuable at higher autonomy levels. The results demonstrate that multi-objective optimization through well-curated reward functions paired with strategic RV routing yields significant benefits in fairness and safety metrics critical for equitable mixed-autonomy deployment.
中文摘要 有效的混合交通控制需要在效率、公平和安全之间取得平衡。现有方法在优化效率和执行安全约束方面表现出色，但缺乏确保服务公平的机制，导致低需求方案导致车辆系统性地被淘汰。我们提出了一个分层框架，结合了多目标强化学习用于局部交叉控制和战略路由进行网络级协调。我们的方法引入了冲突威胁向量，为代理提供明确的风险信号以实现主动避免冲突，并引入队列平等惩罚，确保所有流量流均有公平服务。在不同机器人车辆（RV）渗透率的真实网络上进行的广泛实验显示，显著提升：平均等待时间可减少多达53%，最大饥饿减少高达86%，冲突率相比基线降低高达86%，同时保持燃油效率。我们的分析显示，战略路由的有效性随着房车渗透率的提升而提升，且在更高自治度层面上价值日益增强。结果表明，通过精心策划的奖励函数配合战略性RV路由，实现多目标优化，在公平性和安全指标上带来显著益处，这对公平的混合自主部署至关重要。

A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation

A-LAMP：基于代理式大型语言模型的自动化MDP建模与策略生成框架

Authors: Hong Je-Gal, Chan-Bin Yi, Hyun-Suk Lee
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.11270
Pdf link: https://arxiv.org/pdf/2512.11270
Abstract Applying reinforcement learning (RL) to real-world tasks requires converting informal descriptions into a formal Markov decision process (MDP), implementing an executable environment, and training a policy agent. Automating this process is challenging due to modeling errors, fragile code, and misaligned objectives, which often impede policy training. We introduce an agentic large language model (LLM)-based framework for automated MDP modeling and policy generation (A-LAMP), that automatically translates free-form natural language task descriptions into an MDP formulation and trained policy. The framework decomposes modeling, coding, and training into verifiable stages, ensuring semantic alignment throughout the pipeline. Across both classic control and custom RL domains, A-LAMP consistently achieves higher policy generation capability than a single state-of-the-art LLM model. Notably, even its lightweight variant, which is built on smaller language models, approaches the performance of much larger models. Failure analysis reveals why these improvements occur. In addition, a case study also demonstrates that A-LAMP generates environments and policies that preserve the task's optimality, confirming its correctness and reliability.
中文摘要 将强化学习（RL）应用于现实任务需要将非正式描述转换为正式的马尔可夫决策过程（MDP），实现可执行环境，并训练策略代理。自动化这一过程具有挑战性，因为建模错误、代码脆弱和目标错位，常常阻碍政策培训。我们引入了一个基于代理大型语言模型（LLM）的自动化MDP建模与策略生成框架（A-LAMP），能够自动将自由形式的自然语言任务描述转换为MDP的表述和训练策略。该框架将建模、编码和培训分解为可验证的阶段，确保整个流程的语义一致。无论是经典控制还是定制强化学习领域，A-LAMP始终实现比单一最先进LLM模型更高的策略生成能力。值得注意的是，即使是基于更小语言模型构建的轻量级变体，性能也接近更大模型。失效分析揭示了这些改进发生的原因。此外，案例研究还表明A-LAMP生成的环境和策略能够保持任务的最优性，确认其正确性和可靠性。

When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents

当行动教会你思考：通过会话代理中的强化学习实现推理-行动协同效应

Authors: Mrinal Rawat, Arkajyoti Chakraborty, Neha Gupta, Roberto Pieraccini
Subjects: Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.11277
Pdf link: https://arxiv.org/pdf/2512.11277
Abstract Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks. However, SFT can have difficulty generalizing when the underlying data distribution changes, even when the new data does not fall completely outside the training domain. Recent reasoning-focused models such as o1 and R1 have demonstrated consistent gains over their non-reasoning counterparts, highlighting the importance of reasoning for improved generalization and reliability. However, collecting high-quality reasoning traces for SFT remains challenging -- annotations are costly, subjective, and difficult to scale. To address this limitation, we leverage Reinforcement Learning (RL) to enable models to learn reasoning strategies directly from task outcomes. We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools (e.g., function calls) and the final answer generation for conversational agents. Our method employs Group Relative Policy Optimization (GRPO) with rewards designed around tool accuracy and answer correctness, allowing the model to iteratively refine its reasoning and actions. Experimental results demonstrate that our approach improves both the quality of reasoning and the precision of tool invocations, achieving a 1.5% relative improvement over the SFT model (trained without explicit thinking) and a 40% gain compared to the base of the vanilla Qwen3-1.7B model. These findings demonstrate the promise of unifying reasoning and action learning through RL to build more capable and generalizable conversational agents.
中文摘要 监督式微调（SFT）已成为提升大型语言模型（LLMs）在下游任务中性能的最有效方法之一。然而，当底层数据分布发生变化时，即使新数据未完全超出训练域，SFT也可能难以进行泛化。近期以推理为重点的模型如o1和R1，已在非推理模型中持续取得优异，凸显了推理在提升泛化和可靠性方面的重要性。然而，收集高质量的SFT推理迹仍然具有挑战性——注释成本高昂、主观且难以扩展。为解决这一限制，我们利用强化学习（RL）使模型能够直接从任务结果中学习推理策略。我们提出了一个流程，LLMs生成引导工具调用（如函数调用）和会话代理最终生成答案的推理步骤。我们的方法采用群体相对策略优化（Group Relative Policy Optimization，GRPO），奖励围绕工具的准确性和答案正确性设计，使模型能够迭代优化其推理和行为。实验结果表明，我们的方法不仅提升了推理质量，也提升了工具调用的精度，相较于SFT模型（训练时无显式思考）提升了1.5%，相比基础Qwen3-1.7B模型提升了40%。这些发现展示了通过强化学习整合推理和行动学习，构建更有能力和可推广的会话代理的潜力。

RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training

RollMux：针对分解强化学习后训练的阶段级复用

Authors: Tianyuan Wu, Lunxi Cao, Yining Wei, Wei Gao, Yuheng Zhao, Dakai An, Shaopan Xiong, Zhiqiang Lv, Ju Huang, Siran Yang, Yinghao Yu, Jiamang Wang, Lin Qu, Wei Wang
Subjects: Subjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Arxiv link: https://arxiv.org/abs/2512.11306
Pdf link: https://arxiv.org/pdf/2512.11306
Abstract Rollout-training disaggregation is emerging as the standard architecture for Reinforcement Learning (RL) post-training, where memory-bound rollout and compute-bound training are physically disaggregated onto purpose-built clusters to maximize hardware efficiency. However, the strict synchronization required by on-policy algorithms introduces severe dependency bubbles, forcing one cluster to idle while the dependent phase is running on the other. We present RollMux, a cluster scheduling framework that reclaims these bubbles through cross-cluster orchestration. RollMux is built on the insight that the structural idleness of one job can be effectively utilized by the active phase of another. To realize this, we introduce the co-execution group abstraction, which partitions the cluster into isolated locality domains. This abstraction enables a two-tier scheduling architecture: an inter-group scheduler that optimizes job placement using conservative stochastic planning, and an intra-group scheduler that orchestrates a provably optimal round-robin schedule. The group abstraction also imposes a residency constraint, ensuring that massive model states remain cached in host memory to enable "warm-star" context switching. We evaluate RollMux on a production-scale testbed with 328 H20 and 328 H800 GPUs. RollMux improves cost efficiency by 1.84x over standard disaggregation and 1.38x over state-of-the-art co-located baselines, all while achieving 100% SLO attainment.
中文摘要 部署训练拆分正逐渐成为强化学习（RL）训练后标准架构，在这种结构中，内存受限的展开和计算受限的训练被物理拆分到专门构建的集群中，以最大化硬件效率。然而，策略上算法要求严格同步，会引入严重的依赖气泡，迫使一个集群处于空闲状态，而依赖阶段则运行于另一个集群。我们介绍RollMux，一种通过跨集群编排重新夺回这些气泡的集群调度框架。RollMux 基于这样一个洞见：一个工作的结构性闲置可以被另一个工作的活跃阶段有效利用。为实现这一点，我们引入共执行群抽象，将簇划分为孤立的局部域。这种抽象使得两层调度架构成为可能：一组间调度器通过保守随机规划优化作业配置，组内调度器则可证明最优轮转调度。群抽象还施加了驻留限制，确保庞大的模型状态保持缓存在主机内存中，以实现“暖星”上下文切换。我们在生产规模测试平台上评估了RollMux，配备328块H20和328块H800 GPU。RollMux在实现100%SLO实现的同时，成本效率比标准拆分提升1.84倍，较最先进的共址基线提升1.38倍。

DAPO: Design Structure-Aware Pass Ordering in High-Level Synthesis with Graph Contrastive and Reinforcement Learning

DAPO：结构感知通过排序设计，采用高阶合成与图对比与强化学习

Authors: Jinming Ge, Linfeng Du, Likith Anaparty, Shangkun Li, Tingyuan Liang, Afzal Ahmad, Vivek Chaturvedi, Sharad Sinha, Zhiyao Xie, Jiang Xu, Wei Zhang
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.11342
Pdf link: https://arxiv.org/pdf/2512.11342
Abstract High-Level Synthesis (HLS) tools are widely adopted in FPGA-based domain-specific accelerator design. However, existing tools rely on fixed optimization strategies inherited from software compilations, limiting their effectiveness. Tailoring optimization strategies to specific designs requires deep semantic understanding, accurate hardware metric estimation, and advanced search algorithms -- capabilities that current approaches lack. We propose DAPO, a design structure-aware pass ordering framework that extracts program semantics from control and data flow graphs, employs contrastive learning to generate rich embeddings, and leverages an analytical model for accurate hardware metric estimation. These components jointly guide a reinforcement learning agent to discover design-specific optimization strategies. Evaluations on classic HLS designs demonstrate that our end-to-end flow delivers a 2.36 speedup over Vitis HLS on average.
中文摘要 高级综合（HLS）工具被广泛应用于基于FPGA的领域专用加速器设计中。然而，现有工具依赖于从软件编译中继承的固定优化策略，限制了其有效性。针对特定设计定制优化策略需要深刻的语义理解、准确的硬件指标估计和先进的搜索算法——这些是当前方法所缺乏的能力。我们提出了DAPO，一种设计结构感知的过序框架，能够从控制图和数据流图中提取程序语义，利用对比学习生成丰富的嵌入，并利用分析模型实现硬件指标的准确估计。这些组件共同引导强化学习代理发现针对设计的优化策略。对经典HLS设计的评估显示，我们的端到端流程平均比Vitis HLS提升2.36倍。

Symmetry-Aware Steering of Equivariant Diffusion Policies: Benefits and Limits

对称意识引导等变扩散政策：优点与限制

Authors: Minwoo Park, Junwoo Chang, Jongeun Choi, Roberto Horowitz
Subjects: Subjects: Machine Learning (cs.LG); Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.11345
Pdf link: https://arxiv.org/pdf/2512.11345
Abstract Equivariant diffusion policies (EDPs) combine the generative expressivity of diffusion models with the strong generalization and sample efficiency afforded by geometric symmetries. While steering these policies with reinforcement learning (RL) offers a promising mechanism for fine-tuning beyond demonstration data, directly applying standard (non-equivariant) RL can be sample-inefficient and unstable, as it ignores the symmetries that EDPs are designed to exploit. In this paper, we theoretically establish that the diffusion process of an EDP is equivariant, which in turn induces a group-invariant latent-noise MDP that is well-suited for equivariant diffusion steering. Building on this theory, we introduce a principled symmetry-aware steering framework and compare standard, equivariant, and approximately equivariant RL strategies through comprehensive experiments across tasks with varying degrees of symmetry. While we identify the practical boundaries of strict equivariance under symmetry breaking, we show that exploiting symmetry during the steering process yields substantial benefits-enhancing sample efficiency, preventing value divergence, and achieving strong policy improvements even when EDPs are trained from extremely limited demonstrations.
中文摘要 等变扩散策略（EDP）结合了扩散模型的生成表达性与几何对称性所提供的强推广性和样本效率。虽然通过强化学习（RL）引导这些策略提供了一种有前景的微调机制，超越了演示数据，但直接应用标准（非等变）强化学习可能对样本效率低且不稳定，因为它忽视了EDP设计中要利用的对称性。本文理论上证明EDP的扩散过程是等变的，进而诱导出群不变的潜在噪声MDP，非常适合等变扩散引导。基于该理论，我们引入了原则性的对称感知引导框架，并通过跨不同对称度任务的综合实验，比较标准、等变和近似等变强化学习策略。虽然我们识别了对称破缺下的严格等变性边界，但我们表明在引导过程中利用对称性能带来显著益处——提升样本效率，防止值偏差，并即使在极其有限的演示中训练EDP时，也能实现强有力的策略改进。

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

通过零空间约束策略优化缓解安全对齐税

Authors: Yifan Niu, Han Xiao, Dongyi Liu, Nuo Chen, Jia Li
Subjects: Subjects: Machine Learning (cs.LG)
Arxiv link: https://arxiv.org/abs/2512.11391
Pdf link: https://arxiv.org/pdf/2512.11391
Abstract As Large Language Models (LLMs) are increasingly deployed in real-world applications, it is important to ensure their behaviors align with human values, societal norms, and ethical principles. However, safety alignment under Reinforcement Learning (RL) often suffers from forgetting learned general abilities, which is also known as the alignment tax. To address this issue, we introduce Null-Space constrained Policy Optimization (NSPO), a novel RL framework for LLM safety alignment while preserving their core abilities. The safety policy gradients are geometrically projected into the null space of general tasks, thereby mitigating the safety alignment tax. In addition, we theoretically prove that NSPO preserves the model's original core capabilities, while still guaranteeing a descent direction for effective safety alignment. Extensive experiments demonstrate that NSPO outperforms existing methods by a large margin, achieving state-of-the-art safety performance without sacrificing accuracy on general tasks, including math, code, and instruction-following tasks. Notably, NSPO is data-efficient and only requires 40% of public human-annotated safety data from PKU-SafeRLHF to achieve promising safety performance, without a large amount of mixed general tasks data in existing alignment methods.
中文摘要 随着大型语言模型（LLMs）在现实应用中日益广泛应用，确保其行为符合人类价值观、社会规范和伦理原则变得尤为重要。然而，强化学习（RL）下的安全对齐常常会遗忘所学的一般能力，也就是所谓的对齐税。为解决这一问题，我们引入了空空间约束策略优化（NSPO），这是一种新颖的强化学习框架，用于在保持LLM核心能力的同时实现安全对齐。安全政策梯度被几何投影到一般任务的零空间中，从而减轻安全对齐税。此外，我们理论上证明NSPO保留了该型号的原始核心能力，同时仍保证安全对齐的下降方向。大量实验表明，NSPO在包括数学、代码和指令跟踪等通用任务中实现了最先进的安全性能，同时保持了准确性。值得注意的是，NSPO数据效率高，只需40%的PKU-SafeRLHF公开人工注释安全数据即可实现有希望的安全性能，且无需在现有比对方法中大量混合通用任务数据。

Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance

通过行为指导，迈向可信赖的多回合大型语言模型代理

Authors: Gonca Gürsun
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.11421
Pdf link: https://arxiv.org/pdf/2512.11421
Abstract Large Language Models demonstrate strong reasoning and generation abilities, yet their behavior in multi-turn tasks often lacks reliability and verifiability. We present a task completion framework that enables LLM-based agents to act under explicit behavioral guidance in environments described by reinforcement learning formalisms with defined observation, action, and reward signals. The framework integrates three components: a lightweight task profiler that selects reasoning and generation strategies, a reasoning module that learns verifiable observation - action mappings, and a generation module that enforces constraint-compliant outputs through validation or deterministic synthesis. We show that as the agent interacts with the environment, these components co-evolve, yielding trustworthy behavior.
中文摘要 大型语言模型展现出强大的推理和生成能力，但其在多回合任务中的行为往往缺乏可靠性和可验证性。我们提出了一个任务完成框架，使基于LLM的智能体能够在由强化学习形式主义描述的环境中，在明确的行为指导下行动，这些环境包含明确的观察、行动和奖励信号。该框架集成了三个组成部分：一个轻量级任务分析器，用于选择推理和生成策略;一个学习可验证观察——动作映射的推理模块;以及一个通过验证或确定性综合强制约束合规输出的生成模块。我们表明，当智能体与环境相互作用时，这些组成部分会共同进化，从而产生可信的行为。

Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

Motif-2-12.7B-推理：实践者指南强化学习训练配方

Authors: Junghwan Lim, Sungmin Lee, Dongseok Kim, Taehyun Kim, Eunhwan Park, Jeesoo Lee, Jeongdoo Lee, Junhyeok Lee, Wai Ting Cheung, Dahye Choi, Minsu Ha, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Beomgyu Kim, Minjae Kim, Taewhan Kim, Youngrok Kim, Hyukjin Kweon, Haesol Lee, Kungyu Lee, Dongpin Oh, Yeongjae Park, Bokki Ryu, Dongjoo Weon
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.11463
Pdf link: https://arxiv.org/pdf/2512.11463
Abstract We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.
中文摘要 我们介绍了Motif-2-12.7B-推理，这是一个12.7亿参数语言模型，旨在弥合开放权重系统与专有前沿模型在复杂推理和长上下文理解中的差距。针对模型崩溃和推理适应中训练不稳定性的常见挑战，我们提出了一个全面且可重复的训练配方跨系统、数据和算法优化方案。我们的方法结合了64K标记上下文的内存高效基础设施，采用混合并行和内核级优化，并采用两阶段监督微调（SFT）课程，通过验证和对齐的合成数据减少分布不匹配。此外，我们还详细介绍了一个强健的强化学习微调（RLFT）流水线，通过难度感知的数据过滤和混合策略轨迹重用稳定训练。实证结果表明，Motif-2-12.7B-推理在数学、编码和智能基准测试中，其性能可与参数数量显著较大的模型相媲美，为社区提供了一个具有竞争力的开放模型和在现实计算约束下扩展推理能力的实用蓝图。

Three methods, one problem: Classical and AI approaches to no-three-in-line

三种方法，一个问题：经典与人工智能方法，防止三字串联

Authors: Pranav Ramanathan, Thomas Prellberg, Matthew Lewis, Prathamesh Dinesh Joshi, Raj Abhijit Dandekar, Rajat Dandekar, Sreedath Panat
Subjects: Subjects: Artificial Intelligence (cs.AI)
Arxiv link: https://arxiv.org/abs/2512.11469
Pdf link: https://arxiv.org/pdf/2512.11469
Abstract The No-Three-In-Line problem asks for the maximum number of points that can be placed on an n by n grid with no three collinear, representing a famous problem in combinatorial geometry. While classical methods like Integer Linear Programming (ILP) guarantee optimal solutions, they face exponential scaling with grid size, and recent advances in machine learning offer promising alternatives for pattern-based approximation. This paper presents the first systematic comparison of classical optimization and AI approaches to this problem, evaluating their performance against traditional algorithms. We apply PatternBoost transformer learning and reinforcement learning (PPO) to this problem for the first time, comparing them against ILP. ILP achieves provably optimal solutions up to 19 by 19 grids, while PatternBoost matches optimal performance up to 14 by 14 grids with 96% test loss reduction. PPO achieves perfect solutions on 10 by 10 grids but fails at 11 by 11 grids, where constraint violations prevent valid configurations. These results demonstrate that classical optimization remains essential for exact solutions while AI methods offer competitive performance on smaller instances, with hybrid approaches presenting the most promising direction for scaling to larger problem sizes.
中文摘要 三线不列问题要求在一个n×n的网格上，且没有三重共线，最多可以放置多少点，这代表了组合几何中著名的问题。虽然像整数线性规划（ILP）这样的经典方法保证了最优解，但它们会随着网格大小的指数级扩展，而机器学习的最新进展为基于模式的近似提供了有前景的替代方案。本文首次系统地比较了经典优化与人工智能方法对该问题的表现，并评估了它们与传统算法的表现。我们首次将PatternBoost变换器学习和强化学习（PPO）应用于该问题，并与ILP进行比较。ILP可实现可验证的最优解，范围为19×19网格，而PatternBoost则能在14×14网格范围内实现最佳性能，并降低96%的测试损耗。PPO在10×10网格上能实现完美解，但在11×11网格时失败，因为约束违规导致配置有效。这些结果表明，经典优化对于精确解依然至关重要，而AI方法在小规模实例上具有竞争力，混合方法则是向更大问题规模扩展的最有前景方向。

Rethinking Expert Trajectory Utilization in LLM Post-training

重新思考大型语言模型（LLM）培训后专家轨迹的利用

Authors: Bowen Ding, Yuhan Chen, Jiayang Lv, Jiyao Yuan, Qi Zhu, Shuangshuang Tian, Dantong Zhu, Futing Wang, Heyuan Deng, Fei Mi, Lifeng Shang, Tao Lin
Subjects: Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.11470
Pdf link: https://arxiv.org/pdf/2512.11470
Abstract While effective post-training integrates Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), the optimal mechanism for utilizing expert trajectories remains unresolved. We propose the Plasticity-Ceiling Framework to theoretically ground this landscape, decomposing performance into foundational SFT performance and the subsequent RL plasticity. Through extensive benchmarking, we establish the Sequential SFT-then-RL pipeline as the superior standard, overcoming the stability deficits of synchronized approaches. Furthermore, we derive precise scaling guidelines: (1) Transitioning to RL at the SFT Stable or Mild Overfitting Sub-phase maximizes the final ceiling by securing foundational SFT performance without compromising RL plasticity; (2) Refuting ``Less is More'' in the context of SFT-then-RL scaling, we demonstrate that Data Scale determines the primary post-training potential, while Trajectory Difficulty acts as a performance multiplier; and (3) Identifying that the Minimum SFT Validation Loss serves as a robust indicator for selecting the expert trajectories that maximize the final performance ceiling. Our findings provide actionable guidelines for maximizing the value extracted from expert trajectories.
中文摘要 虽然有效的后期培训整合了监督式微调（SFT）和强化学习（RL），但利用专家轨迹的最佳机制仍未解决。我们提出了可塑性-天花板框架，理论上为这一景观奠定基础，将性能分解为基础的SFT表现及随后的强化学习可塑性。通过广泛的基准测试，我们确立了先行SFT再到RL流水线的优越标准，克服了同步方法的稳定性缺陷。此外，我们得出了精确的尺度指南：（1）在SFT稳定或轻度过拟合子阶段过渡到强化理论，通过确保基础SFT性能而最大化最终上限，同时不影响强化学习的可塑性;（2）在SFT再强化学习（RL）尺度的背景下反驳“少即是多”的观点，我们证明数据量表决定了主要的训练后潜能，而轨迹难度则作为性能乘数;以及（3）识别最小SFT验证损耗作为选择专家轨迹以最大化最终性能上限的有力指标。我们的发现为最大化从专家轨迹中提取价值提供了可作的指导方针。

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

DentalGPT：激励牙科中的多模复杂推理

Authors: Zhenyang Cai, Jiaming Zhang, Junjie Zhao, Ziyi Zeng, Yanchao Li, Jingyi Liang, Junying Chen, Yunjin Yang, Jiajun You, Shuzhi Deng, Tongfei Wang, Wanting Chen, Chunxiu Hao, Ruiqi Xie, Zhenwei Wen, Xiangyi Feng, Zou Ting, Jin Zou Lin, Jianquan Li, Guangjun Yu, Liangyi Chen, Junwen Wang, Shan Jiang, Benyou Wang
Subjects: Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.11558
Pdf link: https://arxiv.org/pdf/2512.11558
Abstract Reliable interpretation of multimodal data in dentistry is essential for automated oral healthcare, yet current multimodal large language models (MLLMs) struggle to capture fine-grained dental visual details and lack sufficient reasoning ability for precise diagnosis. To address these limitations, we present DentalGPT, a specialized dental MLLM developed through high-quality domain knowledge injection and reinforcement learning. Specifically, the largest annotated multimodal dataset for dentistry to date was constructed by aggregating over 120k dental images paired with detailed descriptions that highlight diagnostically relevant visual features, making it the multimodal dataset with the most extensive collection of dental images to date. Training on this dataset significantly enhances the MLLM's visual understanding of dental conditions, while the subsequent reinforcement learning stage further strengthens its capability for multimodal complex reasoning. Comprehensive evaluations on intraoral and panoramic benchmarks, along with dental subsets of medical VQA benchmarks, show that DentalGPT achieves superior performance in disease classification and dental VQA tasks, outperforming many state-of-the-art MLLMs despite having only 7B parameters. These results demonstrate that high-quality dental data combined with staged adaptation provides an effective pathway for building capable and domain-specialized dental MLLMs.
中文摘要 牙科中多模态数据的可靠解读对于自动化口腔护理至关重要，然而当前多模态大型语言模型（MLLM）难以捕捉细致的牙科视觉细节，且缺乏足够的推理能力进行精确诊断。为解决这些局限性，我们介绍了DentalGPT，一款通过高质量领域知识注入和强化学习开发的专业牙科MLLM。具体来说，迄今为止牙科领域最大的注释多模态数据集是通过汇总超过12万张牙科图像并配对突出诊断相关视觉特征的详细描述构建的，使其成为迄今为止拥有最丰富牙科图像集合的多模态数据集。在该数据集上的训练显著增强了MLLM对牙科状况的视觉理解，而随后的强化学习阶段进一步强化了其多模态复杂推理的能力。对口腔内和全景基准的全面评估，以及医学VQA基准的牙科子集显示，DentalGPT在疾病分类和牙科VQA任务中表现优异，尽管仅有7B参数，仍优于许多最先进的多层次级多层次营销（MLLM）。这些结果表明，高质量的牙科数据结合分阶段适应，为构建具备能力且领域化的牙科多层次医学（MLLM）提供了有效路径。

UniBYD: A Unified Framework for Learning Robotic Manipulation Across Embodiments Beyond Imitation of Human Demonstrations

UniBYD：一个跨实体学习机器人作的统一框架，超越模仿人类演示

Authors: Tingyu Yuan, Biaoliang Guan, Wen Ye, Ziyan Tian, Yi Yang, Weijie Zhou, Yan Huang, Peng Wang, Chaoyang Zhao, Jinqiao Wang
Subjects: Subjects: Robotics (cs.RO)
Arxiv link: https://arxiv.org/abs/2512.11609
Pdf link: https://arxiv.org/pdf/2512.11609
Abstract In embodied intelligence, the embodiment gap between robotic and human hands brings significant challenges for learning from human demonstrations. Although some studies have attempted to bridge this gap using reinforcement learning, they remain confined to merely reproducing human manipulation, resulting in limited task performance. In this paper, we propose UniBYD, a unified framework that uses a dynamic reinforcement learning algorithm to discover manipulation policies aligned with the robot's physical characteristics. To enable consistent modeling across diverse robotic hand morphologies, UniBYD incorporates a unified morphological representation (UMR). Building on UMR, we design a dynamic PPO with an annealed reward schedule, enabling reinforcement learning to transition from imitation of human demonstrations to explore policies adapted to diverse robotic morphologies better, thereby going beyond mere imitation of human hands. To address the frequent failures of learning human priors in the early training stage, we design a hybrid Markov-based shadow engine that enables reinforcement learning to imitate human manipulations in a fine-grained manner. To evaluate UniBYD comprehensively, we propose UniManip, the first benchmark encompassing robotic manipulation tasks spanning multiple hand morphologies. Experiments demonstrate a 67.90% improvement in success rate over the current state-of-the-art. Upon acceptance of the paper, we will release our code and benchmark at this https URL.
中文摘要 在具身智能中，机器人手与人类手之间的身体化差距带来了从人类演示中学习的重大挑战。尽管一些研究尝试通过强化学习弥合这一差距，但它们仍仅限于重现人类作，导致任务表现有限。本文提出了UniBYD，一个统一框架，利用动态强化学习算法发现与机器人物理特性相符的作策略。为了实现不同机器人手形态的一致建模，UniBYD采用了统一的形态表示（UMR）。基于UMR，我们设计了带有退火奖励计划的动态PPO，使强化学习能够从模仿人类演示转向更好地探索适应多样机器人形态的策略，从而超越单纯的模仿人类手部。为解决早期训练阶段学习人类先验频发的失败问题，我们设计了一个基于马尔可夫的混合影子引擎，使强化学习能够以细致的方式模拟人类作。为全面评估UniBYD，我们提出了UniManip，这是首个涵盖多种手形态的机器人作任务的基准测试。实验显示，成功率比现有技术提升了67.90%。论文被接受后，我们将在此https URL发布代码和基准测试。

SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support

SUMFORU：基于LLM的个性化购买决策评审摘要框架

Authors: Yuming Feng, Xinrui Jiang
Subjects: Subjects: Computation and Language (cs.CL)
Arxiv link: https://arxiv.org/abs/2512.11755
Pdf link: https://arxiv.org/pdf/2512.11755
Abstract Online product reviews contain rich but noisy signals that overwhelm users and hinder effective decision-making. Existing LLM-based summarizers remain generic and fail to account for individual preferences, limiting their practical utility. We propose SUMFORU, a steerable review summarization framework that aligns outputs with explicit user personas to support personalized purchase decisions. Our approach integrates a high-quality data pipeline built from the Amazon 2023 Review Dataset with a two-stage alignment procedure: (1) persona-aware Supervised Fine-Tuning (SFT) via asymmetric knowledge distillation, and (2) Reinforcement Learning with AI Feedback (RLAIF) using a preference estimator to capture fine-grained, persona-relevant signals. We evaluate the model across rule-based, LLM-based, and human-centered metrics, demonstrating consistent improvements in consistency, grounding, and preference alignment. Our framework achieves the highest performance across all evaluation settings and generalizes effectively to unseen product categories. Our results highlight the promise of steerable pluralistic alignment for building next-generation personalized decision-support systems.
中文摘要 在线产品评论包含丰富但嘈杂的信号，使用户不堪重负，阻碍有效决策。现有基于LLM的摘要器仍属通用，未能考虑个人偏好，限制了其实用性。我们提出SUMFORU，一种可引导的评价总结框架，将输出与明确的用户角色对齐，支持个性化购买决策。我们的方法结合了基于亚马逊2023评审数据集构建的高质量数据流水线，并采用两阶段对齐程序：（1）通过非对称知识蒸馏进行人格感知监督微调（SFT），以及（2）利用偏好估计器捕捉细粒度、与人格相关的强化学习（RLAIF）。我们从基于规则、基于大语言模型（LLM）和以人为中心的指标上评估模型，展示了一致性、基础性和偏好一致性的持续提升。我们的框架在所有评估环境中都实现了最高性能，并有效推广到未被发现的产品类别。我们的结果凸显了可引导的多元协调为构建下一代个性化决策支持系统的前景。

Agile Flight Emerges from Multi-Agent Competitive Racing

敏捷飞行从多智能体竞赛中诞生

Authors: Vineet Pasumarti, Lorenzo Bianchi, Antonio Loquercio
Subjects: Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
Arxiv link: https://arxiv.org/abs/2512.11781
Pdf link: https://arxiv.org/pdf/2512.11781
Abstract Through multi-agent competition and the sparse high-level objective of winning a race, we find that both agile flight (e.g., high-speed motion pushing the platform to its physical limits) and strategy (e.g., overtaking or blocking) emerge from agents trained with reinforcement learning. We provide evidence in both simulation and the real world that this approach outperforms the common paradigm of training agents in isolation with rewards that prescribe behavior, e.g., progress on the raceline, in particular when the complexity of the environment increases, e.g., in the presence of obstacles. Moreover, we find that multi-agent competition yields policies that transfer more reliably to the real world than policies trained with a single-agent progress-based reward, despite the two methods using the same simulation environment, randomization strategy, and hardware. In addition to improved sim-to-real transfer, the multi-agent policies also exhibit some degree of generalization to opponents unseen at training time. Overall, our work, following in the tradition of multi-agent competitive game-play in digital domains, shows that sparse task-level rewards are sufficient for training agents capable of advanced low-level control in the physical world. Code: this https URL
中文摘要 通过多智能体竞争和稀疏的高层次目标——赢得比赛——我们发现，敏捷飞行（例如高速运动将平台推至物理极限）和策略（如超车或阻挡）都来自接受强化学习训练的代理。我们在模拟和现实世界中都提供了证据，表明这种方法优于孤立训练主体的常见范式，奖励会规定行为，例如在种族线上取得进展，尤其是在环境复杂性增加时，例如在障碍物存在的情况下。此外，我们发现，尽管多智能体竞争方法使用相同的模拟环境、随机化策略和硬件，多智能体竞争产生的策略比基于单智能体进度奖励的策略更可靠地转移到现实世界。除了模拟到现实传输的改进外，多智能体策略还对训练时未曾察觉的对手具有一定程度的泛化。总体而言，我们的工作延续数字领域多智能体竞争游戏的传统，表明稀疏的任务级奖励足以培养具备在物理世界中进行高级低级别控制能力的智能体。代码：这个 https URL

Keyword: diffusion policy

There is no result

Keyword: reinforcement learning

KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

KBQA-R1：强化大型语言模型用于知识库问答

In-Context Multi-Objective Optimization

上下文多目标优化

Benchmarking RL-Enhanced Spatial Indices Against Traditional, Advanced, and Learned Counterparts

将强化学习增强的空间指数与传统、高级和学术对应指标进行基准比较

CORL: Reinforcement Learning of MILP Policies Solved via Branch and Bound

CORL：通过分支与界限解决的MILP政策强化学习

Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning

带宽受限的变分消息编码用于合作多智能体强化学习

Multi-Objective Reinforcement Learning for Large-Scale Mixed Traffic Control

大规模混合交通控制的多目标强化学习

A-LAMP: Agentic LLM-Based Framework for Automated MDP Modeling and Policy Generation

A-LAMP：基于代理式大型语言模型的自动化MDP建模与策略生成框架

When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents

当行动教会你思考：通过会话代理中的强化学习实现推理-行动协同效应

RollMux: Phase-Level Multiplexing for Disaggregated RL Post-Training

RollMux：针对分解强化学习后训练的阶段级复用

DAPO: Design Structure-Aware Pass Ordering in High-Level Synthesis with Graph Contrastive and Reinforcement Learning

DAPO：结构感知通过排序设计，采用高阶合成与图对比与强化学习

Symmetry-Aware Steering of Equivariant Diffusion Policies: Benefits and Limits

对称意识引导等变扩散政策：优点与限制

Mitigating the Safety Alignment Tax with Null-Space Constrained Policy Optimization

通过零空间约束策略优化缓解安全对齐税

Towards Trustworthy Multi-Turn LLM Agents via Behavioral Guidance

通过行为指导，迈向可信赖的多回合大型语言模型代理

Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

Motif-2-12.7B-推理：实践者指南 强化学习训练配方

Three methods, one problem: Classical and AI approaches to no-three-in-line

三种方法，一个问题：经典与人工智能方法，防止三字串联

Rethinking Expert Trajectory Utilization in LLM Post-training

重新思考大型语言模型（LLM）培训后专家轨迹的利用

DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

DentalGPT：激励牙科中的多模复杂推理

UniBYD: A Unified Framework for Learning Robotic Manipulation Across Embodiments Beyond Imitation of Human Demonstrations

UniBYD：一个跨实体学习机器人作的统一框架，超越模仿人类演示

SUMFORU: An LLM-Based Review Summarization Framework for Personalized Purchase Decision Support

SUMFORU：基于LLM的个性化购买决策评审摘要框架

Agile Flight Emerges from Multi-Agent Competitive Racing

敏捷飞行从多智能体竞赛中诞生

Keyword: diffusion policy

Motif-2-12.7B-推理：实践者指南强化学习训练配方