生成时间: 2026-04-06 17:12:21 (UTC+8); Arxiv 发布时间: 2026-04-06 20:00 EDT (2026-04-07 08:00 UTC+8)

今天共有 38 篇相关文章

Keyword: reinforcement learning

LLM Reasoning with Process Rewards for Outcome-Guided Steps

带有过程奖励的大型语言模型推理,针对结果引导步骤

Contextual Intelligence The Next Leap for Reinforcement Learning

情境智能:强化学习的下一步飞跃

OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

OPRIDE:通过数据集内探索实现的基于偏好的离线强化学习

Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

棱镜:通过可解释的策略映射在强化学习中的策略重用

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

从广泛探索到稳定综合:熵引导优化实现自回归图像生成

A Survey on AI for 6G: Challenges and Opportunities

关于6G人工智能的调查:挑战与机遇

Compositional Neuro-Symbolic Reasoning

合成神经符号推理

Mitigating Data Scarcity in Spaceflight Applications for Offline Reinforcement Learning Using Physics-Informed Deep Generative Models

利用物理学基础深度生成模型缓解航天飞行应用中离线强化学习的数据稀缺性

RL-Loop: Reinforcement Learning-Driven Real-Time 5G Slice Control for Connected and Autonomous Mobility Services

RL-Loop:基于强化学习驱动的实时5G切片控制,用于互联和自主出行服务

Tune to Learn: How Controller Gains Shape Robot Policy Learning

调优学习:控制者如何获得优势塑造机器人政策学习

Interpretable Deep Reinforcement Learning for Element-level Bridge Life-cycle Optimization

可解释的深度强化学习用于元素级桥梁生命周期优化

Moondream Segmentation: From Words to Masks

月梦分割:从言语到面具

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

基于强化学习的知识蒸馏,结合以LLM为裁判

Generalization Limits of Reinforcement Learning Alignment

强化学习对齐的推广极限

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

超越语义操控:代币空间对奖励模型的攻击

ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving

ExploreVLA:端到端自动驾驶的密集世界建模与探索

Data-Driven Synthesis of Probabilistic Controlled Invariant Sets for Linear MDPs

线性MDP概率受控不变量集的数据驱动综合

Multi-agent Reinforcement Learning-based Joint Design of Low-Carbon P2P Market and Bidding Strategy in Microgrids

基于多智能体强化学习的低碳点对点市场与微电网竞标策略联合设计

Learning Locomotion on Complex Terrain for Quadrupedal Robots with Foot Position Maps and Stability Rewards

在复杂地形上学习四足机器人的移动,配备足位图和稳定性奖励

Fully Byzantine-Resilient Distributed Multi-Agent Q-Learning

完全拜占庭弹性分布式多智能体Q-学习

CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

CharTool:用于图表理解的工具集成可视化推理

Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

评分标准到代币:在任务指导中连接响应级评分标准与代币级奖励

Dependency-Guided Repository-Level C-to-Rust Translation with Reinforcement Alignment

依赖引导仓库级 C 到 Rust 的转换,带有强化对齐

Multi-Turn Reinforcement Learning for Tool-Calling Agents with Iterative Reward Calibration

工具调用代理的多回合强化学习,采用迭代奖励校准

Towards Near-Real-Time Telemetry-Aware Routing with Neural Routing Algorithms

迈向近实时遥测感知路由,利用神经路由算法

Digital Twin-Assisted In-Network and Edge Collaboration for Joint User Association, Task Offloading, and Resource Allocation in the Metaverse

数字孪生辅助的网络内和边缘协作,用于元宇宙中的联合用户关联、任务卸载和资源分配

Mitigating Reward Hacking in RLHF via Advantage Sign Robustness

通过优势符号的鲁棒性来缓解RLHF中的奖励黑客行为

R2-Write: Reflection and Revision for Open-Ended Writing with Deep Reasoning

R2-Write:深入推理的开放式写作反思与修订

Behavior-Constrained Reinforcement Learning with Receding-Horizon Credit Assignment for High-Performance Control

行为约束强化学习与后退视界学分作业,用于高性能控制

ARM: Advantage Reward Modeling for Long-Horizon Manipulation

ARM:用于长期视野操控的优势奖励建模

JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency

JoyAI-LLM 闪电版:以令牌效率推进中端大型语言模型

Distributed Snitch Digital Twin-Based Anomaly Detection for Smart Voltage Source Converter-Enabled Wind Power Systems

基于分布式Snitch数字孪生异常检测,用于智能电压源变换器驱动的风电系统

Self-Distilled RLVR

自酿RLVR

FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

FSUNav:一种大脑-小脑架构,实现快速、安全且通用的零射击目标导向导航

Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

Chart-RL:策略优化强化学习,利用视觉语言模型增强图表问答中的视觉推理能力

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

理解幻觉在多模态推理模型训练后强化中的作用

Keyword: diffusion policy

Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

多视图视频扩散策略:一个三维时空感知视频动作模型

The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling

压缩差距:为何离散分词限制视觉-语言-动作模型的缩放