生成时间: 2026-05-08 17:27:04 (UTC+8); Arxiv 发布时间: 2026-05-08 20:00 EDT (2026-05-09 08:00 UTC+8)

今天共有 71 篇相关文章

Keyword: reinforcement learning

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

将结果监督内化为过程监督:推理强化学习的新范式

Topology-Driven Anti-Entanglement Control for Soft Robots

软机器人的拓扑驱动抗纠缠控制

Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

神经共态策略:在循环强化学习中构建隐藏状态

Two-Stage Learned Decomposition for Scalable Routing on Multigraphs

多重图可扩展路由的两阶段学习分解

LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

LANTERN:带有体验门控推理网络的LLM增强神经符号转移

Approximate Next Policy Sampling: Replacing Conservative Target Policy Updates in Deep RL

下一个政策抽样:在深度强化学习中替换保守目标政策更新

Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

自适应Q块处理,用于离线到在线强化学习

SPARK: Self-Play with Asymmetric Reward from Knowledge Graphs

SPARK:知识图谱中非对称奖励的自我游戏

Nonsense Helps: Prompt Space Perturbation Broadens Reasoning Exploration

无意义的帮助:空间扰动的提示拓宽了推理探索

When2Speak: A Dataset for Temporal Participation and Turn-Taking in Multi-Party Conversations for Large Language Models

When2Speak:大型语言模型多方对话中时间参与与轮流的数据集

MotionGRPO: Overcoming Low Intra-Group Diversity in GRPO-Based Egocentric Motion Recovery

MotionGRPO:克服基于GRPO的自我中心运动恢复中群体内低多样性

Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

闭环:通过LLM-RL耦合实现统一的3D场景生成与沉浸式交互

LLM-Enhanced Deep Reinforcement Learning for Task Offloading in Collaborative Edge Computing

协作边缘计算中任务卸载的LLM增强深度强化学习

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using LLM Judges with Closed-Loop Reinforcement Learning Feedback

利用LLM评判和闭环强化学习反馈对代理种群预测系统的多维行为评估

Confidence is the key: how conformal prediction enhances the generative design of permeable peptides

信心是关键:共形预测如何增强渗透肽的生成设计

A Measure-Theoretic Finite-Sample Theory for Adaptive-Data Fitted Q-Iteration

自适应数据拟合Q迭代的测度论有限样本理论

Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs

使用行为树和大型语言模型(LLM)进行组合任务的奖励塑造和动作掩蔽

Unified Value Alignment for Generative Recommendation in Industrial Advertising

工业广告生成式推荐的统一价值对齐

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

长视野Q学习:通过n步不等式实现的准确价值学习

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

AGPO:京东的非对称集团策略优化,用于可验证推理和搜索广告相关性

Measuring Learning Progress via Gradient-Momentum Coupling

通过梯度-动量耦合测量学习进展

Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

托卡马克旋转配置文件控制的离线强化学习

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

SOPE:基于先前数据稳定在线强化学习的非策略评估

Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

思考,然后评分:视频奖励建模中的解耦推理与评分

Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

近政策:通过异步生成和选择性打包加速政策上的提炼

Foundation Twins: A New Generation of Power Systems Digital Twins using Foundation AI Models

基础双胞胎:新一代基于基础人工智能模型的电力系统数字孪生

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

超越统一学分分配:RLVR的选择性资格追踪

BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning

BehaviorGuard:深度强化学习的在线后门防御

4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding

4DThinker:用四维图像思考动态空间理解

Optimal Transport for LLM Reward Modeling from Noisy Preference

基于噪声偏好进行LLM奖励建模的最优传输

Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning

基于新颖性的思维树搜索,用于LLM推理与规划

Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

羽必须聚集的请求:批量大小与前缀同质性在LLM推断中

Causal Reinforcement Learning for Complex Card Games: A Magic The Gathering Benchmark

复杂纸牌游戏的因果强化学习:万智牌的基准测试

Arena as Offline Reward: Efficient Fine-Grained Preference Optimization for Diffusion Models

Arena作为离线奖励:扩散模型的高效细粒度偏好优化

Milestone-Guided Policy Learning for Long-Horizon Language Agents

面向长期视野语言代理的里程碑引导策略学习

VISD: Enhancing Video Reasoning via Structured Self-Distillation

VISD:通过结构化自我蒸馏增强视频推理

Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer

超越自回归RTG:决策变换器中通过注入外部顺序建模进行条件化

Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

调度与校准:基于工具的多任务强化学习,适用于代码大型语言模型

Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

策略引导的逐步模型路由以实现成本效益推理

Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

害虫思维者:通过强化学习学习,学会像昆虫学家一样思考和推理

AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

AffectGPT-RL:揭示强化学习在开放词汇情感识别中的作用

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

技能1:通过强化学习实现技能增强智能体的统一进化

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

列表策略优化:基于群组的RLVR作为LLM响应单纯形的目标投影

Unifying Goal-Conditioned RL and Unsupervised Skill Learning via Control-Maximization

通过控制最大化统一目标条件强化学习与无监督技能学习

AdaGamma: State-Dependent Discounting for Temporal Adaptation in Reinforcement Learning

AdaGamma:强化学习中时间适应的状态依赖折扣

Entropy-Regularized Adjoint Matching for Offline RL

离线强化学习的熵正则化伴随匹配

OPSD Compresses What RLVR Teaches: A Post-RL Compaction Stage for Reasoning Models

OPSD压缩了RLVR所教的内容:推理模型的后强化学习压缩阶段

A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping

A$^2$TGPO:具备自适应回合层裁剪的代理转组策略优化

Soft Deterministic Policy Gradient with Gaussian Smoothing

软确定性策略梯度与高斯平滑

Safactory: A Scalable Agent Factory for Trustworthy Autonomous Intelligence

Safactory:一个可扩展的可信自主智能代理工厂

Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning

重新思考强化学习对大型语言模型推理:它是策略选择稀疏,而非能力学习

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

用工具教思维模型推理:工具集成推理的完整流程配方

A Unified Pair-GRPO Family: From Implicit to Explicit Preference Constraints for Stable and General RL Alignment

统一的对-GRPO家族:从隐式到显式偏好约束,实现稳定且通用的强化学习对齐

Independent Learning of Nash Equilibria in Partially Observable Markov Potential Games with Decoupled Dynamics

部分可观测马尔可夫势博弈中纳什均衡的独立学习,具有解耦动力学

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

非对称策略上提炼:在令牌层面桥接利用与模仿

Distributed Online Learning for Time-Critical Communication in 6G Industrial Subnetworks

6G工业子网中时间关键通信的分布式在线学习

Hitting Time Isomorphism for Multi-Stage Planning with Foundation Policies

多阶段规划与基础政策的时间同构

Operator-Guided Invariance Learning for Continuous Reinforcement Learning

连续强化学习中的算符引导不变性学习

MARBLE: Multi-Aspect Reward Balance for Diffusion RL

MARBLE:扩散强化的多方面奖励平衡

On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR

关于隐性奖励过拟合与RLVR中的低秩动态

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

ROSE:通过协作弹性推广Agentic RL的服务GPU推广

Delay-Robust Deep Reinforcement Learning for Ranging-Free Channel Access under Mobility in Underwater Acoustic Networks

在水下声学网络中,延迟强健深度强化学习实现在移动性下实现无测距信道接入

Sequential Design of Genetic Circuits Under Uncertainty With Reinforcement Learning

在不确定性下遗传回路的顺序设计与强化学习

Coordination Matters: Evaluation of Cooperative Multi-Agent Reinforcement Learning

协调问题:合作多智能体强化学习的评估

SNAPO: Smooth Neural Adjoint Policy Optimization for Optimal Control via Differentiable Simulation

SNAPO:通过可微仿真实现最优控制的平滑神经伴随策略优化

ReActor: Reinforcement Learning for Physics-Aware Motion Retargeting

ReActor:用于物理感知运动重定向的强化学习

Cross-Modal Navigation with Multi-Agent Reinforcement Learning

多智能体强化学习的跨模态导航

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

强化学习能教大型语言模型长视野推理吗?表现力是关键

Recursive Agent Optimization

递归代理优化

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

StraTA:通过战略轨迹抽象激励代理强化学习

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

超越负面推广:仅正向的政策优化,隐含负梯度

Keyword: diffusion policy

There is no result