生成时间: 2026-05-22 18:54:54 (UTC+8); Arxiv 发布时间: 2026-05-22 20:00 EDT (2026-05-23 08:00 UTC+8)

今天共有 48 篇相关文章

Keyword: reinforcement learning

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

通过比较思想评估,教学语言模型预测研究成功

Value-Gradient Hypothesis of RL for LLMs

强化学习的价值梯度假说适用于大型语言模型

Closed-Loop Sim-to-Real Reinforcement Learning for Deformable Microfiber Shape Control

可变形微纤维形状控制的闭环模拟到实增强学习

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

关于带有优化确定性等价的折现强化学习样本复杂度

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Memory-R2:长视野内存增强LLM代理的公平信用分配

Implicit Safety Alignment from Crowd Preferences

群体偏好的隐性安全对齐

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

OPPO:LLM推理中代币级信用分配的贝叶斯价值递归

CCLab: Adversarial Testing of Learning- and Non-Learning-Based Congestion Controllers

CCLab:基于学习和非学习的拥塞控制器的对抗性测试

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

EvoVid:视频大型语言模型的以时间为中心的自我演化

Auction-Consensus Algorithm with Learned Bidding Scheme for Multi-Robot Systems

多机器人系统的拍卖共识算法与学习竞价方案

AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

人工智能驱动的严肃游戏:将智能与适应性融入训练系统

Reinforced Preference Optimization for Reasoning-Augmented Recommendations

推理增强推荐的强化偏好优化

Reasoning through Verifiable Forecast Actions: Consistency-Grounded RL for Financial LLMs

通过可验证的预测行动进行推理:基于一致性的强化学习为金融大型语言模型

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

忠实-MR1:通过锚定和强化视觉注意力实现忠实多模态推理

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

从推理链到可验证子问题:课程强化学习使LLM推理的学分作业得以完成

OPERA: An Agent for Image Restoration with End-to-End Joint Planning-Execution Optimization

OPERA:具备端到端联合规划-执行优化的图像修复代理

Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations

超越像素:通过几场演示学习现实世界机器人的不变奖励

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

部分可观测性知识图谱的短期到长期记忆转移

One-Way Policy Optimization for Self-Evolving LLMs

自我演进大型语言模型的单向策略优化

SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?

软件工程突变:大型语言模型能否生成可靠的软件测试套件?

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

大师:强化学习以协调层级模型-技能集合

Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

强化思维图谱:强化学习驱动的大型语言模型自适应提示

Kernel-Based Safe Exploration in Deep Reinforcement Learning

基于内核的安全深度强化学习探索

CLORE: Content-Level Optimization for Reasoning Efficiency

CLORE:内容层级优化以提升推理效率

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

生存或崩溃:数据门控与奖励在自我游戏强化生活中的非对称作用

Emergence of agriculture in an artificial society of reinforcement learning agents

农业在强化学习代理的人工社会中的出现

Long-term Fairness with Selective Labels

选择性标签的长期公平性

ACCoRD: Actor-Critic Conflict Resolution with Deep learning for O-RAN xApps

ACCoRD:基于深度学习的演员-批评者冲突解决,适用于O-RAN xApps

Integrating Chain-of-Thought into Generative Retrieval: A Preliminary Study

将思维链整合进生成检索:初步研究

Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning

目标比对Bellman备份,用于跨域离线强化学习

Unified Data Selection for LLM Reasoning

LLM推理的统一数据选择

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

DeferMem:通过强化学习进行查询时证据提炼,用于长期记忆质量保证

From Recognition to Reasoning: Benchmarking and Enhancing MLLMs on Real-World Receipt Document Understanding

从识别到推理:基于现实世界收据文件理解的基准测试与提升MLLMs(多层次营销产品)

Don't Forget the Critic: Value-Based Data Rehearsal for Multi-Cyclic Continual Reinforcement Learning

别忘了批评者:基于价值的数据演练用于多周期持续强化学习

F-TIS: Harnessing Diverse Models in Collaborative GRPO

F-TIS:协作GRPO中多元模型的利用

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG:多语言推理强化学习与语言自适应提示指导

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

两个比一个更好:一个无崩溃的多奖励RLIF培训框架

A note on convergence of Wasserstein policy optimization

关于瓦瑟斯坦策略优化收敛性的说明

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Spreadsheet-RL:通过强化学习推动大型语言模型代理完成现实的电子表格任务

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

SegCompass:探索稀疏自编码器的可解释比对以增强推理分割

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

削波瓶颈:通过随机恢复近边界信号稳定RLVR

Abstraction for Offline Goal-Conditioned Reinforcement Learning

离线目标条件强化学习的抽象

N3P: Accelerated Automated Parking via a Learning-Based Naturalistic Three-Stage Scheme

N3P:通过基于学习的自然主义三阶段方案实现加速自动停车

Post-Training is About States, Not Tokens: A State Distribution View of SFT, RL, and On-Policy Distillation

后期培训关注状态,而非代币:SFT、强化学习及策略上蒸馏的状态分布视角

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人安全敏捷赛车

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

DeltaBox:通过毫秒级沙盒检查点/回滚扩展有状态的AI代理

Remember to be Curious: Episodic Context and Persistent Worlds for 3D Exploration

记得保持好奇:章节上下文和持久世界用于3D探索

Keyword: diffusion policy

There is no result