生成时间: 2025-10-10 16:29:41 (UTC+8); Arxiv 发布时间: 2025-10-10 20:00 EDT (2025-10-11 08:00 UTC+8)

今天共有 61 篇相关文章

Keyword: reinforcement learning

ConCuR: Conciseness Makes State-of-the-Art Kernel Generation

ConCuR:简洁使最先进的内核生成

L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint)

L2M-AID:通过融合大型语言模型的语义推理与多智能体强化学习进行自主信息物理防御(预印本)

Parameter-Free Federated TD Learning with Markov Noise in Heterogeneous Environments

异构环境中的无参数联合TD学习,基于马尔可夫噪声

Reasoning by Exploration: A Unified Approach to Retrieval and Generation over Graphs

探索推理:一种统一的图检索和生成方法

Reinforcement Learning-based Task Offloading in the Internet of Wearable Things

可穿戴物联网中基于强化学习的任务卸载

Expanding the Action Space of LLMs to Reason Beyond Language

将法学硕士的行动空间扩展到超越语言的推理

AgentAsk: Multi-Agent Systems Need to Ask

AgentAsk:多代理系统需要询问

Value Flows

价值流

LiveThinking: Enabling Real-Time Efficient Reasoning for AI-Powered Livestreaming via Reinforcement Learning

LiveThinking:通过强化学习实现人工智能直播的实时高效推理

Control Synthesis of Cyber-Physical Systems for Real-Time Specifications through Causation-Guided Reinforcement Learning

通过因果引导强化学习实现实时规范的信息物理系统的控制综合

RePainter: Empowering E-commerce Object Removal via Spatial-matting Reinforcement Learning

RePainter:通过空间摳图强化学习增强电子商务对象移除

DEAS: DEtached value learning with Action Sequence for Scalable Offline RL

DEAS:使用可扩展离线 RL 的动作序列进行分离式价值学习

ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs

ToolExpander:将工具使用强化学习的前沿扩展到弱法学硕士

OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

OpenRubrics:用于奖励建模和 LLM 调整的可扩展综合评分标准生成

Human-in-the-Loop Optimization with Model-Informed Priors

基于模型的先验进行人机交互优化

From Noisy to Native: LLM-driven Graph Restoration for Test-Time Graph Domain Adaptation

从嘈杂到原生:LLM 驱动的图恢复用于测试时间图域自适应

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

使用评分标准奖励治愈 LLM 数学推理中的奇迹步骤

GCPO: When Contrast Fails, Go Gold

GCPO:当对比失败时,选择黄金

Strategic Communication under Threat: Learning Information Trade-offs in Pursuit-Evasion Games

威胁下的战略沟通:在追捕-逃避游戏中学习信息权衡

An LLM-Powered Cooperative Framework for Large-Scale Multi-Vehicle Navigation

LLM驱动的大规模多车导航协同框架

Network Topology and Information Efficiency of Multi-Agent Systems: Study based on MARL

多智能体系统的网络拓扑与信息效率——基于MARL的研究

MARC: Memory-Augmented RL Token Compression for Efficient Video Understanding

MARC:内存增强的 RL 令牌压缩,用于高效理解视频

A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

A$^2$搜索:使用强化学习进行歧义感知问答

Climate Surrogates for Scalable Multi-Agent Reinforcement Learning: A Case Study with CICERO-SCM

可扩展多智能体强化学习的气候替代物:CICERO-SCM 案例研究

TaoSR-SHE: Stepwise Hybrid Examination Reinforcement Learning Framework for E-commerce Search Relevance

TaoSR-SHE:电子商务搜索相关性的逐步混合考试强化学习框架

TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance

TaoSR-AGRL:电子商务搜索相关性的自适应引导强化学习框架

A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

流程奖励模型综述:从结果信号到大型语言模型的流程监督

Real-Time Motion-Controllable Autoregressive Video Diffusion

实时运动可控自回归视频扩散

ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code

ARM2:具有视觉理解和可执行代码的自适应推理模型

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

R-Horizon:你们的大型推理模型在广度和深度上到底能走多远?

Training-Free Group Relative Policy Optimization

免训练组相对策略优化

Expressive Value Learning for Scalable Offline Reinforcement Learning

用于可扩展离线强化学习的表达价值学习

Reinforcement Learning from Probabilistic Forecasts for Safe Decision-Making via Conditional Value-at-Risk Planning

通过条件风险价值规划从概率预测中强化学习以实现安全决策

Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization

通过分布匹配策略优化增强扩散法学硕士的推理

The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

对齐华尔兹:联合培训代理商合作确保安全

Opponent Shaping in LLM Agents

LLM 代理中的对手塑造

Mix- and MoE-DPO: A Variational Inference Approach to Direct Preference Optimization

混合和MoE-DPO:一种直接偏好优化的变分推理方法

Evaluation of a Robust Control System in Real-World Cable-Driven Parallel Robots

现实世界电缆驱动并联机器人中鲁棒控制系统的评估

Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

超越回合限制:使用动态上下文窗口训练深度搜索代理

Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries

超越Pass@k:推理边界的广度深度指标

DeepEN: Personalized Enteral Nutrition for Critically Ill Patients using Deep Reinforcement Learning

DeepEN:使用深度强化学习为危重患者提供个性化肠内营养

QAgent: A modular Search Agent with Interactive Query Understanding

QAgent:具有交互式查询理解功能的模块化搜索代理

Reinforcing Diffusion Models by Direct Group Preference Optimization

通过直接群体偏好优化强化扩散模型

ClauseLens: Clause-Grounded, CVaR-Constrained Reinforcement Learning for Trustworthy Reinsurance Pricing

ClauseLens:基于条款、CVaR 约束的强化学习,用于值得信赖的再保险定价

xRouter: Training Cost-Aware LLMs Orchestration System via Reinforcement Learning

xRouter:通过强化学习训练成本感知型法学硕士编排系统

Gaze on the Prize: Shaping Visual Attention with Return-Guided Contrastive Learning

凝视奖品:通过回归引导的对比学习塑造视觉注意力

DexMan: Learning Bimanual Dexterous Manipulation from Human and Generated Videos

DexMan:从人类和生成的视频中学习双手灵巧作

Rethinking Provenance Completeness with a Learning-Based Linux Scheduler

使用基于学习的 Linux 调度器重新思考出处完整性

Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools

Video-STAR:使用工具加强开放词汇动作识别

DYNAMIX: RL-based Adaptive Batch Size Optimization in Distributed Machine Learning Systems

DYNAMIX:分布式机器学习系统中基于 RL 的自适应批量大小优化

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

哪些头脑对推理很重要?RL 引导的 KV 缓存压缩

Convergence Theorems for Entropy-Regularized and Distributional Reinforcement Learning

熵正则化和分布强化学习的收敛定理

CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

CoMAS:通过交互奖励共同发展多智能体系统

SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models

SpatialLadder:视觉语言模型中空间推理的渐进式训练

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

论RLVR的优化动力学:梯度间隙和步长阈值

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

MM-HELIX:通过整体平台和自适应混合策略优化促进多模态长链反思推理

Entropy Regularizing Activation: Boosting Continuous Control, Large Language Models, and Image Classification with Activation as Entropy Constraints

熵正则化激活:以激活作为熵约束促进连续控制、大型语言模型和图像分类

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

通过组扩散策略优化改进扩散语言模型的推理

Agent Learning via Early Experience

通过早期经验学习代理

Keyword: diffusion policy

Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization

通过组扩散策略优化改进扩散语言模型的推理

ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving

ResAD:端到端自动驾驶的归一化残差轨迹建模