生成时间: 2026-06-04 19:18:50 (UTC+8); Arxiv 发布时间: 2026-06-04 20:00 EDT (2026-06-05 08:00 UTC+8)

今天共有 52 篇相关文章

Keyword: reinforcement learning

Position: Deployed Reinforcement Learning should be Continual

立场:部署强化学习应持续进行

Self-Distilled Policy Gradient

自我提炼政策梯度

RUBAS: Rubric-Based Reinforcement Learning for Agent Safety

RUBAS:基于评分标准的强化学习用于代理安全

A Goal-Set Characterization of Task Composition in the Boolean Task Algebra

布尔任务代数中任务组合的目标集特征描述

Need to Know: Contextual-Integrity-Grounded Query Rewriting for Privacy-Conscious LLM Delegation

必知:基于上下文完整性的查询重写,适用于注重隐私的LLM委派

Large Language Models Hack Rewards, and Society

大型语言模型的黑客奖励与社会

SaliMory: Orchestrating Cognitive Memory for Conversational Agents

SaliMory:为会话代理协调认知记忆

SocialCoach: Personalized Social Skill Learning with RL-based Agentic Tutoring and Practice

SocialCoach:基于强化学习的能动辅导与实践的个性化社交技能学习

Smart Transportation Without Neurons -- Fair Metro Network Expansion with Tabular Reinforcement Learning

无神经元的智能交通——通过表格强化学习实现公平的地铁网络扩展

Exact Unlearning in Reinforcement Learning

强化学习中的精确逆学习

Dual Advantage Fields

双重优势场

RL Excursions during Pre-Training: Re-examining Policy Optimization for LLM training

预训练期间的强化学习体验:重新审视LLM训练中的策略优化

From Ticks to Flows: Dynamics of Neural Reinforcement Learning in Continuous Environments

从滴答到流:连续环境中神经强化学习的动态学

Sparse Mixture-of-Experts Reward Models Learn Interpretable and Specialized Experts for Personalized Preference Modeling

稀疏专家混合奖励模型 学习可解释且专业的专家进行个性化偏好建模

Generalizable Multi-Task Learning for Wireless Networks Using Prompt Decision Transformers

利用即时决策变换器实现无线网络的通用多任务学习

Policy Gradient for Continuous-Time Robust Markov Decision Processes

连续时间稳健马尔可夫决策过程的策略梯度

Learning to cooperate with emergent reputation via multi-agent reinforcement learning

通过多智能体强化学习与涌现声誉合作

When Clients Stop Following: A Cognitive Conceptualization Diagram-driven Framework for Strategic Counseling

当客户停止遵循:一个基于认知概念化图解的战略咨询框架

Read the Trace, Steer the Path: Trajectory-Aware Reinforcement Learning for Diffusion Language Models

阅读追踪,引导路径:扩散语言模型的轨迹感知强化学习

When Chatbots Accommodate: What AI Companions Optimize for in Vulnerable Conversations

聊天机器人配合:AI伙伴在脆弱对话中优化什么

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

AgentJet:一个用于智能强化学习的灵活群体训练框架

Episodic Memory Temporal Consistency for Cooperative Multi-Agent Reinforcement Learning

合作多代理强化学习中的情节记忆时间一致性

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

黑暗中的聪明选择:通过追踪元认知枢纽实现高效的RLVR推理

Self-Evolving Deep Research via Joint Generation and Evaluation

通过联合生成与评估实现自我演进的深度研究

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

GeoMin:通过几何分布建模实现数据高效半监督RLVR

Rollout-Level Advantage-Prioritized Experience Replay for GRPO

GRPO的推广级优势优先经验重放

Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

尼提亚巴斯:理性智能体模型中不确定性意识公共政策优化框架

Dynamic Multi-Pair Trading Strategy in Cryptocurrency Markets with Deep Reinforcement Learning

基于深度强化学习的加密货币市场动态多对交易策略

SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification

SCI-PRM:一种基于工具感知的过程奖励模型,用于科学推理验证

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

多模态长篇对话中的细粒度片段检索

VentAgent: When LLMs Learn to Breathe -- Multi-Objective Arbitration for ARDS Ventilation

VentAgent:当大型语言模型学会呼吸——ARDS通气的多目标仲裁

Explainably Safe Reinforcement Learning

可解释的安全强化学习

CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation

CoRe-MoE:多地形类人移动及步态适应专家的对比重加权混合

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

痕迹介导峰值偏差:在深度强化学习中连接时间学分赋值与认知启发式

COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection

COP-Q:通过乔莱斯基有序投影实现机器人控制的安全优先强化学习

Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

《爱之迷雾:在游戏环境中用亲和力强化学习工程化美德代理行为》

AIP: A Graph Representation for Learning and Governing Agent Skills

AIP:用于学习和治理智能体技能的图表示

Scenario Generation for Risk-Aware Reinforcement Learning with Probably Approximately Safe Guarantees

风险感知强化学习的情景生成,可能保证安全

Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

边学边演:一个技能增强的考试时间共进化框架,面向在线终身学习代理

M3imic: Learning a Versatile Whole-Body Controller for Multimodal Motion Mimicking

M3imic:学习一款多功能的全身控制器,用于多模态运动模拟

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder:基于Moore线程GPU的全栈训练原生GPU内核生成

Learning Empirically Admissible Neural Heuristics for Combinatorial Search

学习组合搜索中经验可接受的神经启发式

GRAIL: Gradient-Reweighted Advantages for Reinforcement Learning with Verifiable Rewards

GRAIL:带可验证奖励的强化学习中梯度重权优势

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

在基于评分标准的强化学习中,重现、分析和检测奖励黑客行为

Sequential Data Poisoning in LLM Post-Training

LLM后培训中的顺序数据中毒

Potential-Guided Flow Matching for Vision-Language-Action Policy Improvement

潜在引导流匹配以提升视觉-语言-行动政策

GARL: Game-Theoretic Reinforcement Learning for Multi-Agent Strategic Prioritisation

GARL:多智能体战略优先级中的博弈论强化学习

Generalization of World Models under Environmental Variability for Vision-based Quadrotor Navigation

基于视觉的四旋翼导航环境变异性下世界模型的推广

Enhancing the MADDPG Algorithm for Multi-Agent Learning via Action Inference and Importance Sampling

通过动作推断和重要性抽样增强多智能体学习的MADDPG算法

Arithmetic Pedagogy for Language Models

语言模型的算术教学法

Self-Evaluation Is Already There: Eliciting Latent Judge Calibration in Base LLMs with Minimal Data

自我评估已经存在:在基础大型语言模型中以极少数据诱导潜在判定校准

Reinforcement Learning from Rich Feedback with Distributional DAgger

利用分布式DAgger进行丰富反馈的强化学习

Keyword: diffusion policy

There is no result