生成时间: 2026-05-29 19:33:31 (UTC+8); Arxiv 发布时间: 2026-05-29 20:00 EDT (2026-05-30 08:00 UTC+8)

今天共有 67 篇相关文章

Keyword: reinforcement learning

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

微观宏观检索:减少大型语言模型中的长形幻觉

Aryabhata 2: Scaling Reinforcement Learning for Advanced STEM Reasoning

Aryabhata 2:高级STEM推理的扩展强化学习

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

思维作为规划:通过强化规划优化思维链的潜在世界模型

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

灾难性遗忘的机制起源:为什么强化学习比SFT更好地保存电路?

FedQHD: Closed-Form Function-Space Federated Reinforcement Learning

FedQHD:封闭形式函数-空间联合强化学习

Tensorized Radiative Heat Transfer for a Scalable and Calibrated Building Energy Simulator

用于可扩展和校准建筑能源模拟器的张量化辐射热传递

Label-Free Reinforcement Learning via Cross-Model Entropy

通过跨模型熵实现无标签强化学习

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

策略感知模拟器学习的理论基础与有效算法

Moment Matching Q-Learning

Q-学习时刻匹配

Differentiable Belief-based Opponent Shaping

基于信念的可微对手塑造

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

结构化提示优化结合强化学习,实现复杂文本的全局和局部可解释性

OISD: On-Policy Internal Self-Distillation of Language Models

OISD:语言模型的政策内自我提炼

PRO-CUA: Process-Reward Optimization for Computer Use Agents

PRO-CUA:计算机使用代理的过程-奖励优化

CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control

CA-AC-MPC:CUDA-加速行为者-批评者模型预测控制

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

当强化学习抑制自身词汇:在谜题到数学转移中恢复推理多样性

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

离散策略优化的指导对比令牌信用分配

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

协调实时约束与长视野推理:动态调度的异步代理框架

Prompt-Level Reward Specifications for Open-Ended Post-Training

开放式后期培训的提示级奖励规范

UniNote: A Unified Embedding Model for Multimodal Representation and Ranking

UniNote:一个用于多模态表示和排名的统一嵌入模型

LLM-ALSO: LLM-Driven Adaptive Learning-Signal Optimization for Multi-Agent Reinforcement Learning

LLM也:多智能体强化学习的LLM驱动自适应学习信号优化

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

基于熵-KL发散的令牌掩蔽:一种用于大型语言模型选择性微调的新方法

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek:直接语料库交互的搜索代理培训

Rubric-Guided Process Reward for Stepwise Model Routing

分步模型路由的评分标准引导过程奖励

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

STAMP:在可控且可扩展的虚拟环境中为移动图形界面代理训练显式内存

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

GDSD:强化学习作为扩散语言模型的引导去噪自蒸馏

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

重新思考多模时间序列预测的训练后配方

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

ReasonLight:基于多模态基础模型增强的零射程交通信号控制强化学习框架

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

FinGuard:检测大型语言模型互动中的金融监管不合规

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

帮助的诅咒:通过干扰IF来强健性中的逆标度定律来分散注意力指令

On Asymmetric Optimization of Reasoning and Perception in Vision-Language Model Post-Training

关于视觉语言模型训练后推理与感知的非对称优化

Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation

源基语义强化学习用于低资源目标语言生成

VE2VF: Vision-Enabled to Vision-Free Distillation via Real-world Reinforcement Learning for Robust Contact-Rich Manipulation

VE2VF:通过现实世界强化学习实现视觉驱动到无视觉的蒸馏,实现稳健的丰富接触操作

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool:通过过程监督强化学习在工具集成推理中扩展交错审议

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

PEARL:用教学法对齐强化学习培训苏格拉底导师

GAPD: Gold-Action Policy Distillation for Agentic Reinforcement Learning in Knowledge Base Question Answering

GAPD:知识库问答中能动强化学习的黄金行动政策提炼

Training Deliberative Monitors for Black-Box Scheming Detection

培训审议监视者以检测黑匣子阴谋

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

超越数学和代码的可验证奖励:基于语料库的轻量级流程监督,支持事实性问答

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

TRACE:基于图尔明的推理评估,通过建设性元素进行LLM CoT评估

Momentum Based Reward Design for Low Emission Traffic Signal Control

基于动量的奖励设计用于低排放交通信号控制

Fairness-Aware Profit Maximization using Deep Reinforcement Learning

利用深度强化学习实现公平意识的利润最大化

ARIADNE: AI-RAN Informed Link Adaptation in Digital Twin Network Environments

ARIADNE:数字孪生网络环境中的人工智能驱动知情链路适配

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Hista和Numca:有效估计LLM强化学习中的状态值

Quantifying and Optimizing Simplicity via Polynomial Representations

通过多项式表示量化和优化简洁性

EvoRubric: Self-Evolving Rubric-Driven RL for Open-Ended Generation

EvoRubric:自我演进的基于评分标准驱动的开放式生成强化学习

ESPO: Early-Stopping Proximal Policy Optimization

ESPO:早期终止的近端策略优化

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

CRITIC-R1:学习结构化批评者以实现检索增强生成

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

LaRA:用于检测强化学习后训练数据污染的分层表示分析

Train the Agent, Not the Expert: Learning to Harness Heterogeneous Experts for Multi-Turn Visual Reasoning

训练代理,而非专家:学习利用异质专家进行多回合视觉推理

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

KairosAgent:结合语义推理的能动时间序列预测

Who Am I? History-Aware Profiles for Student Simulation in Tutoring Dialogues

我是谁?历史感知配置文件,用于辅导对话中的学生模拟

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于样本的扩散强化学习与批判指导

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

克服LLM中的遗忘,用进化策略微调

RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

RL2ML:从强化学习到最大可能性的有限推广替代目标

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

面向长期视野LLM代理的元认知记忆策略优化

On Distributional Reinforcement Learning in Chaotic Dynamical Systems

关于混沌动力系统中的分布强化学习

Mean-Field Diffuser: Scaling Offline MARL to Thousands of Agents

平均场扩散器:将离线MARL扩展到数千个代理

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

HPO:在稀疏奖励体系下实现稳定高效训练的滞后策略优化

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

模特们什么时候应该改变主意?大型语言模型中的情境信念管理

TriSearch: Learning to Optimize Triangulations via Bistellar Flips

TriSearch:学习通过双恒星翻转优化三角剖分

How's it going? Reinforcement learning in language models recruits a functional welfare axis

怎么样?语言模型中的强化学习招募了一个功能性福利轴

Reinforcement Learning with Robust Rubric Rewards

强化学习与强有力的评分标准奖励

Stable-Layers: Fine-Tuning Image Layer Decomposition Models with VLM-Scored Reinforcement Learning

稳定层:利用VLM评分强化学习微调图像层分解模型

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong:一种类人长文档翻译代理,具备观察与行动自适应上下文选择功能

In-Context Reward Adaptation for Robust Preference Modeling

情境内奖励适应以实现稳健偏好建模

Reasoning with Sampling: Cutting at Decision Points

采样推理:决策点的切割

Keyword: diffusion policy

Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control

费舍尔保护指导:无训练歧管约束以实现安全扩散控制

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于样本的扩散强化学习与批判指导