生成时间: 2026-04-27 18:18:46 (UTC+8); Arxiv 发布时间: 2026-04-27 20:00 EDT (2026-04-28 08:00 UTC+8)

今天共有 15 篇相关文章

Keyword: reinforcement learning

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

结果奖励并不保证可验证或因果重要推理

Removing Sandbagging in LLMs by Training with Weak Supervision

通过在弱监督下训练消除大型语言模型中的沙袋策略

A Hybrid Reinforcement and Self-Supervised Learning Aided Benders Decomposition Algorithm

一种混合强化与自监督学习辅助弯曲者分解算法

Do Not Imitate, Reinforce: Iterative Classification via Belief Refinement

不模仿,强化:通过信念细化进行迭代分类

Optimal sequential decision-making for error propagation mitigation in digital twins

数字孪生中错误传播缓解的最优顺序决策

Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

行为金丝雀:在强化学习微调中审计私有检索上下文的使用情况

Learning Control Policies to Provably Satisfy Hard Affine Constraints for Black-Box Hybrid Dynamical Systems

学习控制策略以可证明满足黑盒混合动力系统中的硬仿射约束

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

超越思维链:重写作为生成多模态嵌入的通用接口

Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders

用硬否定的目标塑造:基于强化学习的大型语言模型推荐器的窗口部分AUC优化

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

SOLAR-RL:半在线长期视野作业强化学习

Learning Evidence Highlighting for Frozen LLMs

冻结大型语言模型的学习证据高亮

Adversarial Co-Evolution of Malware and Detection Models: A Bilevel Optimization Perspective

恶意软件与检测模型的对抗性共进:双层优化视角

Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought

无言思考:高效的潜在推理与抽象思维链

ATRS: Adaptive Trajectory Re-splitting via a Shared Neural Policy for Parallel Optimization

ATRS:通过共享神经策略进行自适应轨迹重拆以实现并行优化

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

智能世界建模:基础、能力、定律及其延伸

Keyword: diffusion policy

There is no result