生成时间: 2026-02-02 16:50:00 (UTC+8); Arxiv 发布时间: 2026-02-02 20:00 EST (2026-02-03 09:00 UTC+8)

今天共有 53 篇相关文章

Keyword: reinforcement learning

ShellForge: Adversarial Co-Evolution of Webshell Generation and Multi-View Detection for Robust Webshell Defense

ShellForge:Webshell生成与多视角检测的对抗性共进,实现稳健Webshell防御

Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions

基于组合行动的强化学习潜在球形流策略

Aligning Microscopic Vehicle and Macroscopic Traffic Statistics: Reconstructing Driving Behavior from Partial Data

微观车辆与宏观交通统计的对齐:从部分数据重建驾驶行为

Learning Reward Functions for Cooperative Resilience in Multi-Agent Systems

多智能体系统中合作弹性的奖励函数学习

Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning

为多智能体辩论准备推理语言模型,配合自我辩论强化学习

Models Under SCOPE: Scalable and Controllable Routing via Pre-hoc Reasoning

范畴模型:通过预置推理实现可扩展且可控的路由

Quantum-Inspired Reinforcement Learning for Secure and Sustainable AIoT-Driven Supply Chain Systems

量子启发强化学习,实现安全且可持续的AIoT驱动供应链系统

SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

SAIR:通过上下文强化学习实现的多阶段高效机器学习流水线自动扩展

Unrewarded Exploration in Large Language Models Reveals Latent Learning from Psychology

在大型语言模型中无偿探索揭示了心理学的潜在学习

Continual Policy Distillation from Distributed Reinforcement Learning Teachers

分布式强化学习教师的持续政策提炼

RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning

RulePlanner:用于统一3D平面规划设计规则的一体化强化学习器

SSL: Sweet Spot Learning for Differentiated Guidance in Agentic Optimization

SSL:智能体优化中差异化指导的甜点学习

Action-Sufficient Goal Representations

动作充分的目标表示

DreamVAR: Taming Reinforced Visual Autoregressive Model for High-Fidelity Subject-Driven Image Generation

DreamVAR:驯服强化视觉自回归模型以实现高保真主体驱动图像生成

Mock Worlds, Real Skills: Building Small Agentic Language Models with Synthetic Tasks, Simulated Environments, and Rubric-Based Rewards

模拟世界,真实技能:构建小型智能语言模型,采用合成任务、模拟环境和基于评分标准的奖励

RoboStriker: Hierarchical Decision-Making for Autonomous Humanoid Boxing

RoboStriker:自主人形拳击的层级决策

One Ring to Rule Them All: Unifying Group-Based RL via Dynamic Power-Mean Geometry

统治所有的一环:通过动态幂均几何实现基于群体的统一强化学习

Detect and Act: Automated Dynamic Optimizer through Meta-Black-Box Optimization

检测与行动:通过元黑匣子优化实现自动化动态优化器

Adapting Reinforcement Learning for Path Planning in Constrained Parking Scenarios

在受限停车场景下调整强化学习以实现路径规划

PersonaAct: Simulating Short-Video Users with Personalized Agents for Counterfactual Filter Bubble Auditing

PersonaAct:用个性化代理模拟短视频用户进行反事实过滤气泡审计

Exo-Plore: Exploring Exoskeleton Control Space through Human-aligned Simulation

Exo-Plore:通过人类对齐模拟探索外骨骼控制空间

Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

以更少的资源了解更多:RLVR的不确定性一致性引导查询选择

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

从自我演化的合成数据到可验证奖励的强化学习:训练后多回合交互工具使用代理

COBRA++: Enhanced COBRA Optimizer with Augmented Surrogate Pool and Reinforced Surrogate Selection

COBRA++:增强型COBRA优化器,配备增强替代池和强化代理选择

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

通过平均延续对数概率评估和奖励表达性角色扮演TTS的LALMs

Real-Time Aligned Reward Model beyond Semantics

超越语义的实时对齐奖励模型

A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization

退一步:前缀重要性比稳定策略优化

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

TSPO:打破多回合搜索策略优化中的双重同化困境

Clipping-Free Policy Optimization for Large Language Models

大型语言模型的无裁剪策略优化

CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

CVeDRL:通过难度感知强化学习实现高效的代码验证器

Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment

在稳健风格对齐下高质量行为的离线强化学习

Robust Rigid Body Assembly via Contact-Implicit Optimal Control with Exact Second-Order Derivatives

通过接触隐式最优控制实现的稳健刚体组装,具有精确二阶导数

Degradation-Aware Frequency Regulation of a Heterogeneous Battery Fleet via Reinforcement Learning

通过强化学习对异构电池车队进行降级感知频率调节

Reinforcement Learning-Based Co-Design and Operation of Chiller and Thermal Energy Storage for Cost-Optimal HVAC Systems

基于强化学习的冷水机组和热能储存的协同设计与运行,以实现成本效益最高的暖通空调系统

PlatoLTL: Learning to Generalize Across Symbols in LTL Instructions for Multi-Task RL

PlatoLTL:学习在多任务强化学习中跨符号的LTL指令

MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

MulFeRL:在多回合循环中通过语言反馈增强强化学习

MTDrive: Multi-turn Interactive Reinforcement Learning for Autonomous Driving

MTDrive:多转向交互式强化学习,适用于自动驾驶

SWE-Manager: Selecting and Synthesizing Golden Proposals Before Coding

SWE-Manager:编码前选择和综合黄金提案

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

金鹅:从无法验证的互联网文本中合成无限RLVR任务的简单技巧

Automatic Constraint Policy Optimization based on Continuous Constraint Interpolation Framework for Offline Reinforcement Learning

基于连续约束插值框架的自动约束策略优化,用于离线强化学习

Mem-T: Densifying Rewards for Long-Horizon Memory Agents

Mem-T:对长视界记忆代理的细化奖励

Guided by Trajectories: Repairing and Rewarding Tool-Use Trajectories for Tool-Integrated Reasoning

以轨迹为导引:修复并奖励工具使用轨迹以实现工具整合推理

From Absolute to Relative: Rethinking Reward Shaping in Group-Based Reinforcement Learning

从绝对到相对:重新思考基于群体强化学习中的奖励塑造

RN-D: Discretized Categorical Actors with Regularized Networks for On-Policy Reinforcement Learning

RN-D:带有正则化网络的离散化类别演员用于策略内强化学习

Why GRPO Needs Normalization: A Local-Curvature Perspective on Adaptive Gradients

为什么GRPO需要归一化:自适应梯度的局部曲率视角

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

THINKSAFE:推理模型的自生成安全对齐

On Safer Reinforcement Learning Policies for Sedation and Analgesia in Intensive Care

关于重症监护中镇静和镇痛的安全强化学习政策

Unsupervised Hierarchical Skill Discovery

无监督层级技能发现

Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Med-Scout:通过几何感知强化学习后训练,治愈MLLM在医学感知中的几何盲点

Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

video-o3:本地交错线索寻找长视频多跳推理

Agile Reinforcement Learning through Separable Neural Architecture

通过可分神经架构实现敏捷强化学习

IRL-DAL: Safe and Adaptive Trajectory Planning for Autonomous Driving via Energy-Guided Diffusion Models

IRL-DAL:通过能量引导扩散模型实现自动驾驶的安全与适应性轨迹规划

Keyword: diffusion policy

Self-Imitated Diffusion Policy for Efficient and Robust Visual Navigation

高效且稳健的视觉导航自拟扩散政策