生成时间: 2026-05-11 19:26:45 (UTC+8); Arxiv 发布时间: 2026-05-11 20:00 EDT (2026-05-12 08:00 UTC+8)

今天共有 66 篇相关文章

Keyword: reinforcement learning

Hidden Coalitions in Multi-Agent AI: A Spectral Diagnostic from Internal Representations

多智能体人工智能中的隐性联盟:从内部表征中进行的频谱诊断

On Training in Imagination

关于想象力训练

Gated QKAN-FWP: Scalable Quantum-inspired Sequence Learning

门控QKAN-FWP:可扩展量子启发序列学习

The Causally Emergent Alignment Hypothesis: Causal Emergence Aligns with and Predicts Final Reward in Reinforcement Learning Agents

因果涌现对齐假说:因果涌现与强化学习主体的最终奖励保持一致并预测其结果

Gradient Extrapolation-Based Policy Optimization

基于梯度外推的策略优化

Revisiting Adam for Streaming Reinforcement Learning

重温亚当进行流媒体强化学习

Randomness is sometimes necessary for coordination

有时随机性对协调是必要的

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

如何在强化学习后后压缩KV缓存?用于内存高效对齐的阴影掩膜蒸馏

On the Divergence of Differential Temporal Difference Learning without Local Clocks

关于无本地时钟的差分时间差分学习的发散

Mitigating Cognitive Bias in RLHF by Altering Rationality

通过改变理性来减轻RLHF中的认知偏差

Rollback-Free Stable Brick Structures Generation

无回滚稳定砖结构生成

Multi-Objective Constraint Inference using Inverse reinforcement learning

利用逆强化学习进行多目标约束推断

$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses

$f$-散度正则化RLHF:两个抽样故事与统一分析

Bridging Textual Profiles and Latent User Embeddings for Personalization

连接文本配置文件与潜在用户嵌入以实现个性化

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

行为提示推理:可监控推理通过监督提升效率和安全

A Systematic Investigation of The RL-Jailbreaker in LLMs

对大型语言模型中强化学习越狱者的系统性研究

PACEvolve++: Improving Test-time Learning for Evolutionary Search Agents

PACEvolve++:改进进化搜索代理的测试时学习

Towards Differentially Private Reinforcement Learning with General Function Approximation

迈向带有一般函数近似的差分私有强化学习

Integrating Causal DAGs in Deep RL: Activating Minimal Markovian States with Multi-Order Exposure

深度强化学习中因果DAG的整合:多阶暴露激活最小马尔可夫态

Self-Consolidating Language Models: Continual Knowledge Incorporation from Context

自我巩固的语言模型:从上下文持续整合知识

Actor-Critic with Active Importance Sampling

具有主动重要性采样的演员-评论家

Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

去中心化扩散策略学习,促进合作多智能体强化学习中的探索

Almost Sure Convergence Rates of Stochastic Approximation and Reinforcement Learning via a Poisson-Moreau Drift

通过泊松-莫罗漂移实现的随机近似和强化学习的几乎确定收敛率

Theoretical Limits of Language Model Alignment

语言模型对齐的理论极限

Where to Spend Rollouts: Hit-Utility Optimal Rollout Allocation for Group-Based RLVR

部署资金分配:基于组的RLVR的击中效用最优部署分配

Stabilized neural Hamilton--Jacobi--Bellman solvers: Error analysis and applications in model-based reinforcement learning

稳定神经Hamilton--Jacobi--Bellman求解器:错误分析及其在基于模型的强化学习中的应用

Convergence and Emergence of In-Context Reinforcement Learning with Chain of Thought

情境强化学习与思维链的融合与出现

Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

自适应负面强化用于大型语言模型推理:在RLVR中动态平衡纠正与多样性

Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

你能破坏RLVER吗?强化学习训练的共情智能体的对抗性鲁棒性探测

Beyond Reasoning: Reinforcement Learning Unlocks Parametric Knowledge in LLMs

超越推理:强化学习解锁大型语言模型中的参数化知识

Rethinking Experience Utilization in Self-Evolving Language Model Agents

重新思考自我演化语言模型代理中的经验利用

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

HyperEyes:双粒度效率感知强化学习,适用于并行多模态搜索代理

Towards Autonomous Business Intelligence via Data-to-Insight Discovery Agent

通过数据到洞察的发现代理实现自主商业智能

Improved Model-based Reinforcement Learning with Smooth Kernels

基于模型的改进型强化学习,采用光滑核

Teaching Language Models to Think in Code

教授语言模型用代码思考

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

异质语言模型的相互强化学习经验分享

Structured Role-Aware Policy Optimization for Multimodal Reasoning

结构化角色感知策略优化,用于多模态推理

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

隐式压缩正则化:通过强化学习后内部较短分布进行简明推理

SparseRL-Sync: Lossless Weight Synchronization with ~100x Less Communication

SparseRL-Sync:无损权重同步,通信量减少~100倍

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

重新思考LLM策略优化中的重要性抽样:累积代币视角

Beyond Linear Attention: Softmax Transformers Implement In-Context Reinforcement Learning

超越线性注意力:Softmax 变换器实现上下文强化学习

Gradient-Based LoRA Rank Allocation Under GRPO: An Empirical Study

基于梯度的 LoRA 排名分配:GRPO 下的实证研究

RELO: Reinforcement Learning to Localize for Visual Object Tracking

RELO:强化学习以定位视觉物体

Offline Policy Optimization with Posterior Sampling

带有后验抽样的离线策略优化

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

BalCapRL:基于强化语言的MLLM图像字幕平衡框架

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

用评分标准思考:从外部评估者到内部推理指导

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

SEIF:自我进化强化学习,用于跟随教学

ExpThink: Experience-Guided Reinforcement Learning for Adaptive Chain-of-Thought Compression

ExpThink:体验引导强化学习用于自适应思维链压缩

Implicit Preference Alignment for Human Image Animation

人类图像动画的隐性偏好对齐

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

你的语言模型是它的批评者:基于行为者内部状态的价值估计强化学习

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

学习本地通信以实现大规模多智能体路径寻找

Gradient Starvation in Binary-Reward GRPO: Why Group-Mean Centering Fails and Why the Simplest Fix Works

二元奖励GRPO中的梯度饥饿:为什么群体均值中心化失败以及为什么最简单的解决方案有效

Guidance Is Not a Hyperparameter: Learning Dynamic Control in Diffusion Language Models

指导不是超参数:在扩散语言模型中学习动态控制

SOD: Step-wise On-policy Distillation for Small Language Model Agents

SOD:小语言模型代理的分阶段策略提炼

POETS: Uncertainty-Aware LLM Optimization via Compute-Efficient Policy Ensembles

POETS:通过高效计算策略集成实现不确定性感知的LLM优化

Approximation-Free Differentiable Oblique Decision Trees

无近似可微斜决策树

From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

从合成到真实:迈向与合成与真实数据的身份一致性构成转移

Interpreting Reinforcement Learning Agents with Susceptibilities

解释具有易感性的强化学习代理

Learning CLI Agents with Structured Action Credit under Selective Observation

在选择性观察下学习带有结构化动作学分的CLI代理

Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

游戏理由:前沿长程游棋与人类游戏学习者之间的行为与大脑对齐

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

超越配对:你的语言模型正在秘密优化偏好图

Reinforcement Learning for Exponential Utility: Algorithms and Convergence in Discounted MDPs

指数效用的强化学习:贴现 MDP 中的算法与收敛

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

基于评分标准的强化学习:结构化评判对可推广推理的奖励

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

123D:大规模统一多模态自动驾驶数据

Keyword: diffusion policy

Decentralized Diffusion Policy Learning for Enhanced Exploration in Cooperative Multi-agent Reinforcement Learning

去中心化扩散策略学习,促进合作多智能体强化学习中的探索

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

塔维斯:模仿学习中自我中心的主动视觉与预见凝视的标杆