生成时间: 2026-05-13 18:32:09 (UTC+8); Arxiv 发布时间: 2026-05-13 20:00 EDT (2026-05-14 08:00 UTC+8)

今天共有 72 篇相关文章

Keyword: reinforcement learning

$ξ$-DPO: Direct Preference Optimization via Ratio Reward Margin

$ξ$-DPO:通过比率奖励边际进行直接偏好优化

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

TMPO:轨迹匹配策略优化,实现多样化且高效的扩散对齐

ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network

ACSAC:自适应块大小演员-批评者,配合因果变换器Q-Network。

Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

通过变分后验指导进行高效LLM推理,兼具效率意识

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

信任区域逆向强化学习:利用本地策略更新实现显式双向上升

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ:通过自我监督行动排名实现的离线到在线强化学习

Quotient-Categorical Representations for Bellman-Compatible Average-Reward Distributional Reinforcement Learning

Bellman兼容平均奖励分布强化学习的商类别表示

Epistemic Uncertainty for Test-Time Discovery

测试时间发现的认识不确定性

gym-invmgmt: An Open Benchmarking Framework for Inventory Management Methods

gym-invmgmt:库存管理方法的开放基准测试框架

Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies

用于微调多模生成策略的行为模式发现

fg-expo: Frontier-guided exploration-prioritized policy optimization via adaptive kl and gaussian curriculum

FG-Expo:通过自适应的学习学习和高斯课程,前沿引导的探索优先级政策优化

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

代理-BRACE:通过口头状态不确定性实现长期任务中的信念与行动的解耦

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破$\textit{赢者通吃}$:合作策略优化提升多样的大型语言模型推理

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

放下伪装:探针过滤的强化学习,忠实的思维链推理

Robust Multi-Agent Path Finding under Observation Attacks: A Principled Adversarial-Plus-Smoothing Training Recipe

观察攻击下的稳健多智能体路径寻找:一种原则性的对抗加平滑训练配方

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

TOPPO:重新思考多任务强化学习中的PPO与批判平衡

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

理解并防止RLVR中的熵崩溃,采用策略上的熵流优化

Selective Off-Policy Reference Tuning with Plan Guidance

选择性非政策参考调整及计划指导

Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks: Joint Optimization of Flight and Connectivity

HAPS辅助无人机网络的分层LLM驱动控制:飞行与连接的联合优化

Agents Should Replace Narrow Predictive AI as the Orchestrator in 6G AI-RAN

代理应取代狭义预测人工智能,成为6G AI-RAN中的协调者

UNIPO: Unified Interactive Visual Explanation for RL Fine-Tuning Policy Optimization

UNIPO:强化学习策略优化的统一交互式可视化解释

TwiSTAR:Think Fast, Think Slow, Then Act,Generative Recommendation with Adaptive Reasoning

TwiSTAR:快速思考,慢思考,然后行动,生成式推荐与适应性推理

OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training

作为结构可观测量的OUI:迈向以激活为中心的神经网络训练视角

CuSearch: Curriculum Rollout Sampling via Search Depth for Agentic RAG

CuSearch:通过搜索深度抽样的课程推广活动,针对代理性RAG

Evolutionary Task Discovery: Advancing Reasoning Frontiers via Skill Composition and Complexity Scaling

进化任务发现:通过技能组合和复杂度扩展推进推理前沿

DORA: Dynamic Online Reinforcement Agent for Token Merging in Vision Transformers

DORA:Vision Transformers 代币合并的动态在线强化代理

Rainbow Deep Q-Learning with Kinematics-Aware Design for Cooperative Delta and 3-RRS Parallel Robot Insertion

彩虹深度Q-学习,结合运动学感知设计,用于协作三角洲和3-RRS并行机器人插入

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

CaC:通过层级时空集中推进视频奖励模型

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

Block-R1:重新思考分组大小在扩散大型语言模型中多领域强化学习中的作用

Federated Client Selection under Partial Visibility: A POMDP Approach with Spatio-Temporal Attention

部分可见性下的联合客户选择:基于时空注意力的POMDP方法

NavOL: Navigation Policy with Online Imitation Learning

NavOL:带在线模仿学习的导航政策

Entropy Polarity in Reinforcement Fine-Tuning: Direction, Asymmetry, and Control

强化微调中的熵极性:方向、不对称与控制

GEAR: Granularity-Adaptive Advantage Reweighting for LLM Agents via Self-Distillation

GEAR:通过自蒸馏实现LLM代理的粒度自适应优势重权重

EvoNav: Evolutionary Reward Function Design for Robot Navigation with Large Language Models

EvoNav:大型语言模型机器人导航的进化奖励函数设计

RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agentic Recommender Systems

RecRM-Bench:智能推荐系统多维奖励建模的基准测试

Adaptive TD-Lambda for Cooperative Multi-agent Reinforcement Learning

自适应TD-Lambda用于协作多智能体强化学习

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Qwen-Scope:将稀疏特征转化为大型语言模型开发工具

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

StepCodeReasoner:通过强化学习将代码推理与逐步执行追踪对齐

When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

当仿真说谎:工具使用代理的模拟到现实基准测试和领域随机化强化学习配方

Towards Order Fairness: Mitigating LLMs Order Sensitivity through Dual Group Advantage Optimization

迈向顺序公平性:通过双组优势优化缓解LLMs的订单敏感性

Stochastic Minimum-Cost Reach-Avoid Reinforcement Learning

随机最小成本覆盖-避免强化学习

On Predicting the Post-training Potential of Pre-trained LLMs

关于预测预训练LLM的后训练潜力

Learning Agentic Policy from Action Guidance

从行动指导学习代理政策

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation

SAGE:用于LLM知识评估的可扩展自动化鲁棒性增强

SkillGraph: Skill-Augmented Reinforcement Learning for Agents via Evolving Skill Graphs

SkillGraph:通过不断演变的技能图谱为代理提供技能增强强化学习

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

异步智能强化学习中缺失的旧logit:语义不匹配与非策略纠正修复方法

Learning What Matters: Adaptive Information-Theoretic Objectives for Robot Exploration

学习重要性:机器人探索的自适应信息理论目标

Rollout Cards: A Reproducibility Standard for Agent Research

推广卡:代理人研究的可重复性标准

Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model

自洽潜在推理:视觉语言模型中的长潜序列推理

Overtrained, Not Misaligned

过度训练,不是错位

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

关于多稳定性在强化学习中视野推广的重要性

Intrinsic Vicarious Conditioning for Deep Reinforcement Learning

深度强化学习的内在替代条件反射

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

在大型语言模型中结合策略优化与提炼进行长上下文推理

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

TMRL:扩散时间步调制预训练促进高效策略微调的探索

Delay-Empowered Causal Hierarchical Reinforcement Learning

延迟赋能因果层级强化学习

PriorZero: Bridging Language Priors and World Models for Decision Making

PriorZero:连接语言先验与世界模型以促进决策

Transferable Delay-Aware Reinforcement Learning via Implicit Causal Graph Modeling

通过隐式因果图建模实现可转移的延迟感知强化学习

Reinforcing VLAs in Task-Agnostic World Models

在任务无关世界模型中强化VLA

BSO: Safety Alignment Is Density Ratio Matching

BSO:安全对齐是密度比匹配

Discrete Flow Matching for Offline-to-Online Reinforcement Learning

离散流匹配用于离线到在线强化学习

Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training

信任批次,开或关策略:强化学习后训练的自适应策略优化

Events as Triggers for Behavioral Diversity in Multi-Agent Reinforcement Learning

多智能体强化学习中行为多样性触发事件

Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

语义奖励崩溃与自适应人工智能系统中认知完整性的维护

Aligning Flow Map Policies with Optimal Q-Guidance

将流程图策略与最优Q-Guidance对齐

ORCE: Order-Aware Alignment of Verbalized Confidence in Large Language Models

ORCE:大型语言模型中口语信心的顺序感知对齐

LychSim: A Controllable and Interactive Simulation Framework for Vision Research

LychSim:一个可控且交互式的视觉研究模拟框架

Towards Affordable Energy: A Gymnasium Environment for Electric Utility Demand-Response Programs

迈向可负担能源:电力公用事业需求响应项目的体育馆环境

Reward Hacking in Rubric-Based Reinforcement Learning

基于评分标准的奖励黑客强化学习

OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation

OmniNFT:按模态划分的全扩散增强,用于联合音视频生成

Keyword: diffusion policy

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

通过折扣活度表述进行操作策略的离线策略评估

NavOL: Navigation Policy with Online Imitation Learning

NavOL:带在线模仿学习的导航政策

SI-Diff: A Framework for Learning Search and High-Precision Insertion with a Force-Domain Diffusion Policy

SI-Diff:一个基于力域扩散策略的搜索与高精度插入学习框架